Unicode characters as part of the language
It would be very nice if I could use symbols like λ, →, and so on as part of my language developed in Spoofax.
One possibility for implementing this would be to have a pre-pass that replaces them with a non-unicode equivalent, like replace λ with \lambda. That might mess up the source locations, though, unless the replacement was also a single character (like \).
Submitted by Dobes Vandermeer on 11 January 2013 at 19:55
Issue Log
I would also appreciate unicode symbols in the input, preferrably in the UTF-8 encoding. Rather than having a pre-pass, I would love to see UTF-8 support in JSGLR, sdf2table and related tools.
I implemented support for unicode characters. Its rather like the suggestion from Dobes.
The idea is to encode unicode symbols as a sequence of ascii characters. It is assumed that the ascii character 7 (old character to make the computer produca a short beep) is not used in grammar and text to parse. Than a unicode character could not be parsed as ascii and no sequence of characters from the input file as a unicode file accidentally. Of course, the input string of the parser needs to be preprocessed to encode the unicode characters.To write Unicode characters to SDF files, I introduced syntax sugar: $Unicode to write a single unicode character or a range of unicode characters, eg.
$Unicode(Ø,∀)
$Unicode(∀)
$Unicode(∀-水,“Symbol for Tablechef”)
$Unicode(“Symbol for Tablechef”)
* In Code you would write the character but this character cannot be processed by the comment editor.A Preprocessor converts this file to valid SDF code, in the case of the four examples:
[\7]([\0][\216]|[\34][\0])
[\7]([\34][\0])
[\7](([\34-\108][\0-\52])|([\216][\52][\221][\30]))
[\7]([\216][\52][\221][\30])
Thus, use $Unicode only in lexical syntax.THIS IS NOT INTEGRATED INTO THE MAIN BRANCH CURRENTLY.
I did only a few short tests and need to test a bit more :).
Now, after my tests, it seams like that this mechanism works correctly. Not I am going to improve the Unicode Preprocessor such that the user is able to use unicode like other regular characters in SDF
Unicode support is now implemented. Unicode symbols can be used (with small exceptions in follow restrictions) anywhere as normal symbols in the grammar.
There is a separate branch waiting to be merged. A tutorial for Unicode is included in the doc folder of jsglr.
It would be great if someone could take a look at this and merge the
utf8
branch ofjsglr
intomaster
.
I would like to use this for my SPARQL grammar and someone else was also asking about unicode support on IRC just now.
Log in to post comments