#612 Unicode characters as part of the language (project SpoofaxLegacy on YellowGrass.org)

It would be very nice if I could use symbols like λ, →, and so on as part of my language developed in Spoofax.

One possibility for implementing this would be to have a pre-pass that replaces them with a non-unicode equivalent, like replace λ with \lambda. That might mess up the source locations, though, unless the replacement was also a single character (like \).
Submitted by Dobes Vandermeer on 11 January 2013 at 19:55

improvement@seba

On 16 January 2013 at 16:42 Florian Lorenzen commented:

I would also appreciate unicode symbols in the input, preferrably in the UTF-8 encoding. Rather than having a pre-pass, I would love to see UTF-8 support in JSGLR, sdf2table and related tools.

On 16 June 2013 at 19:22 Moritz Lichter commented:

On 16 June 2013 at 19:24 Moritz Lichter commented:

I implemented support for unicode characters. Its rather like the suggestion from Dobes.
The idea is to encode unicode symbols as a sequence of ascii characters. It is assumed that the ascii character 7 (old character to make the computer produca a short beep) is not used in grammar and text to parse. Than a unicode character could not be parsed as ascii and no sequence of characters from the input file as a unicode file accidentally. Of course, the input string of the parser needs to be preprocessed to encode the unicode characters.

To write Unicode characters to SDF files, I introduced syntax sugar: $Unicode to write a single unicode character or a range of unicode characters, eg.
$Unicode(Ø,∀)
$Unicode(∀)
$Unicode(∀-水,“Symbol for Tablechef”)
$Unicode(“Symbol for Tablechef”)
* In Code you would write the character but this character cannot be processed by the comment editor.

A Preprocessor converts this file to valid SDF code, in the case of the four examples:
[\7]([\0][\216]|[\34][\0])
[\7]([\34][\0])
[\7](([\34-\108][\0-\52])|([\216][\52][\221][\30]))
[\7]([\216][\52][\221][\30])
Thus, use $Unicode only in lexical syntax.

THIS IS NOT INTEGRATED INTO THE MAIN BRANCH CURRENTLY.
I did only a few short tests and need to test a bit more :).

On 16 June 2013 at 19:24 Moritz Lichter commented:

On 21 June 2013 at 12:52 Sebastian Erdweg tagged @seba

On 30 June 2013 at 14:14 Moritz Lichter commented:

Now, after my tests, it seams like that this mechanism works correctly. Not I am going to improve the Unicode Preprocessor such that the user is able to use unicode like other regular characters in SDF

On 13 October 2013 at 12:10 Moritz Lichter commented:

Unicode support is now implemented. Unicode symbols can be used (with small exceptions in follow restrictions) anywhere as normal symbols in the grammar.
There is a separate branch waiting to be merged. A tutorial for Unicode is included in the doc folder of jsglr.

On 7 August 2014 at 23:47 Oskar van Rest commented:

It would be great if someone could take a look at this and merge the utf8 branch of jsglr into master.
I would like to use this for my SPARQL grammar and someone else was also asking about unicode support on IRC just now.

Log in to post comments

Unicode characters as part of the language

Issue Log