Easier greedy / unambiguous token matching
The way that “greedy” rules are defined currently is a bit more tricky and error-prone than it has to be, I think. So, here’s a little thread to capture the status of making it easier.
Probably the most obvious improvement would be to improve the locality of the restrictions needed to do greedy matching. Consider:
lexical syntax [a-zA-Z\_][a-zA-Z0-9\_]* -> ID lexical restrictions ID -/- [a-zA-Z0-9\_]
Could these be combined into one line/rule? Like:
lexical syntax [a-zA-Z\_][a-zA-Z0-9\_]* -> ID{greedy} %% Special keyword [a-zA-Z\_][a-zA-Z0-9\_]*(?![a-zA-Z0-9\_]) -> ID %% Negative lookahead
The keyword “greedy” here would imply a restriction that the token would fail to match if matching one more character would also have matched. Negative lookahead combines the lexical restriction into the same regular expression.
The other cases that require lexical restrictions (in my grammar) to ensure greedy matches are keywords and operators. The problem there is that you can’t infer from a string like “if” or “*” what class of characters it was drawn from, so the “greedy” keyword wouldn’t cut it. Negative lookahead would work, however. This would make the uses of the keywords pretty wordy, however - the current approach of putting them all in a big restriction rule is pretty good except it’s a pain to keep that list up-to-date.
Anyway, I hope I’m capturing something useful or intelligent here …
Submitted by Dobes Vandermeer on 7 December 2012 at 01:34
Issue Log
Yes, this is a good idea. Will be on the agenda of the SDF project.
I do not get the part with the negative look ahead. For the greedy part, the current nightly supports
longest-match
from Sebastian Erdweg’s work on layout-sensitive parsing.
Log in to post comments