Given the following SDF3:

context-free syntax
  Exp = <<ID>();>

template options
  tokenize : "();"

the following SDF is generated:

ID "();" -> Exp

Expected output

I expect that (, ), and ; are separate tokens (and hence allow layout around them), but this is not the case.


Detailed analysis

Tokenization happens whenever a change from a non-tokenization character to a tokenization-character (or vice versa) occurs in the input string. In the preceding example all tokenization characters are adjacent, hence no tokenization occurs.

The workaround that we’ve been teaching students is to change the tokenization to (;. Since these two are not adjacent, the string is correctly tokenized as (, ), ;. But this is not intuitive (nor documented).


This issue is related to #727. Eduardo and I discussed this issue and we don’t know why this edge-triggering strategy is being used. Maybe there exist use cases where you want to tokenize on a character, but not tokenize if there are multiple consecutive occurrences of this character. We can still support this use case if we allow users to specify the tokenization rules as lists of strings:

tokenize : ["(", ")", ";", ".."]

This allows you to specify multi-character tokenization rules (e.g. tokenize on two consecutive characters). Personally, I find it more readable to specify these rules on multiple lines (see #727 for Guido’s proposal).

Does anybody know the reasoning behind the current tokenization strategy? Is somebody already working on this? If not, I’d be glad to try to solve it.

Submitted by Martijn on 3 November 2015 at 12:46

Log in to post comments