It would be very nice if I could use symbols like λ, →, and so on as part of my language developed in Spoofax.

One possibility for implementing this would be to have a pre-pass that replaces them with a non-unicode equivalent, like replace λ with \lambda. That might mess up the source locations, though, unless the replacement was also a single character (like \).

Submitted by Dobes Vandermeer on 11 January 2013 at 19:55

On 16 January 2013 at 16:42 Florian Lorenzen commented:

I would also appreciate unicode symbols in the input, preferrably in the UTF-8 encoding. Rather than having a pre-pass, I would love to see UTF-8 support in JSGLR, sdf2table and related tools.


On 16 June 2013 at 19:22 Moritz Lichter commented:

On 16 June 2013 at 19:24 Moritz Lichter commented:

I implemented support for unicode characters. Its rather like the suggestion from Dobes.
The idea is to encode unicode symbols as a sequence of ascii characters. It is assumed that the ascii character 7 (old character to make the computer produca a short beep) is not used in grammar and text to parse. Than a unicode character could not be parsed as ascii and no sequence of characters from the input file as a unicode file accidentally. Of course, the input string of the parser needs to be preprocessed to encode the unicode characters.

To write Unicode characters to SDF files, I introduced syntax sugar: $Unicode to write a single unicode character or a range of unicode characters, eg.
$Unicode(Ø,∀)
$Unicode(∀)
$Unicode(∀-水,“Symbol for Tablechef”)
$Unicode(“Symbol for Tablechef”
)
* In Code you would write the character but this character cannot be processed by the comment editor.

A Preprocessor converts this file to valid SDF code, in the case of the four examples:
[\7]([\0][\216]|[\34][\0])
[\7]([\34][\0])
[\7](([\34-\108][\0-\52])|([\216][\52][\221][\30]))
[\7]([\216][\52][\221][\30])
Thus, use $Unicode only in lexical syntax.

THIS IS NOT INTEGRATED INTO THE MAIN BRANCH CURRENTLY.
I did only a few short tests and need to test a bit more :).


On 16 June 2013 at 19:24 Moritz Lichter commented:

On 21 June 2013 at 12:52 Sebastian Erdweg tagged @seba

On 30 June 2013 at 14:14 Moritz Lichter commented:

Now, after my tests, it seams like that this mechanism works correctly. Not I am going to improve the Unicode Preprocessor such that the user is able to use unicode like other regular characters in SDF


On 13 October 2013 at 12:10 Moritz Lichter commented:

Unicode support is now implemented. Unicode symbols can be used (with small exceptions in follow restrictions) anywhere as normal symbols in the grammar.
There is a separate branch waiting to be merged. A tutorial for Unicode is included in the doc folder of jsglr.


On 7 August 2014 at 23:47 Oskar van Rest commented:

It would be great if someone could take a look at this and merge the utf8 branch of jsglr into master.
I would like to use this for my SPARQL grammar and someone else was also asking about unicode support on IRC just now.

Log in to post comments