#337 Add support for unicode characters in lexicals and comments (project SpoofaxLegacy on YellowGrass.org)

The parse table format doesn’t quite support it, but we can map unicode letters, unicode numbers, and other unicode characters that are not in ASCII to the \255, \254, and \253 characters.

That way, identifiers with unicode letters and numbers can be specified as:

[a-zA-Z\255][a-zA-Z0-9_\255\254]* -> ID

while line comments and string literals work as before but now also support non-ASCII characters:

“"” StringChar* “"” -> STRING
~["\n] -> StringChar
“//” ~[\n\r]* ([\n\r] | EOF) -> LAYOUT
Submitted by Lennart Kats on 9 February 2011 at 11:16

feature1.0@lennartkats!dobesv

On 9 February 2011 at 11:18 Lennart Kats commented:

To be included with 0.6.1.

On 9 February 2011 at 11:18 Lennart Kats closed this issue.

On 9 February 2011 at 11:18 Lennart Kats tagged 1.0

On 28 December 2011 at 13:16 Lennart Kats tagged @lennartkats

On 7 September 2012 at 21:41 Emmanuel Castro commented:

Where is it written in the documentation?

On 30 October 2012 at 20:30 Dobes Vandermeer commented:

Has this been implemented? How does it work?

Ideally to support parsing a language like Scala you would need support for unicode character classes, I think. Not sure if this workaround would help with that.

On 1 December 2012 at 07:04 Dobes Vandermeer tagged !dobesv

On 1 December 2012 at 07:05 Dobes Vandermeer commented:

On re-reading the description I now think that it works like this:

\255 matches any single unicode letter
\254 matches any single unicode number
\253 matches any other single unicode non-ASCII character

Log in to post comments

Add support for unicode characters in lexicals and comments (1)

Issue Log