#148 Using 'origin-text' during analysis makes SPT fail (project Spoofax on YellowGrass.org)

Eclipse: org.eclipse.platform.ide 4.6.0.I20160606-1100
Spoofax: org.metaborg.spoofax.eclipse 2.0.0
System: Linux amd64 4.4.0-34-generic
Example grammar:
context-free syntax
  
  Start.Empty = ID
Example analysis:
  editor-analyze = analyze-all(normalize, id, id|<language>)

  normalize:
   x -> x
   where
     <origin-text; debug> x
Example SPT:
test test1 [[
  exampleId
]] 1 error
The generated error is:
Failed to analyze the input fragment, which is required to evaluate some of the test expectations.
Expected analysis to succeed
Note that analysis only fails in SPT but not in the normal editor.

If there is a workaround, please let me know.
Submitted by Oskar van Rest on 19 August 2016 at 01:47

errorspt

On 19 August 2016 at 10:28 Gabriël Konat tagged spt

On 19 August 2016 at 16:22 Volker commented:

This issue is most likely caused by SPT’s ‘massaging’ of the token stream created after parsing.

Background
The problem is that in SPT there can be test fixtures in which test fragments will get inserted.
For parsing that means it should first parse a piece of text from a file at position x (the start of the fixture), then parse a piece at position y (the test case), then parse a piece at position z (the end of the fixture). However, these offsets/positions are of the form x < z < y.
Such awesome parsing situations are not supported by our parser, so we combine these pieces of text into a single string and parse that.
This means, however, that the resulting token stream is based on this input string, with offsets that do not correspond to the offsets within the actual file.

This is where SPT ‘massages’ the token stream to get the offsets corresponding to the offsets in the actual file, and then reorders the tokens to be in order of increasing offset again. As I’m not familiar with all the uses of the Tokenizer (i.e. token stream) and what assumptions people make on it, there are likely more errors related to this part of SPT.

Problem
I’m assuming the error here is that origin-text tries to get the text from the Tokenizer (which is the string that was actually parsed, i.e. the string result of concatenating the three pieces) based on the offsets of the tokens. These offsets, however, correspond to the offsets in the file, not in the string, so you get an index out of bounds exception.

Suggested solutions
The simplest fix is to further change the Tokenizer’s contents and make the actual file contents the ‘input string’. This would solve our current issue, but I have no clue if it adversely affects any other part of Spoofax that uses the token stream.
It would also be kind of a ‘hack’, until we agree on a concept for parsing in these situations.

For example, we could have these different views on parsing:
1.) the input of parsing is a String (the file contents) and we supply offset ranges referring to the pieces of the String that should be parsed. The resulting token stream would have the entire file contents as input string, resulting in the proposed solution.
2.) the input of parsing is a list of tuples (start_offset, String). The resulting token stream would have the concatenated Strings as input string, but the offsets would be adjusted based on the given start offsets of each piece. In this case anyone using the Tokenizer (e.g. origin-text) would have to use some smart offset shuffling to figure out how to get the corresponding piece of text. Or we could implement that in the Tokenizer itself.
3.) the input of parsing is just a String and that’s it (the old/current way). In this case SPT should keep its hands of the token stream and find some other way to sync the parsed AST with the actual offsets in the file (required for showing errors and resolving selections).

I’m open to other suggestions.

tldr;

We’re working on it, but it could be a while.

On 19 August 2016 at 16:36 Volker commented:

More interesting insight into the problem:

The way SPT currently massages the token stream, leads to interesting implications for the TracingService as well.
For example, let’s use an entity:
entity Person {
  name : String
}
and represent that using an SPT test and test fixture:
fixture [[
  entity Person {
    [[...]]
  }
]]

test simple String property [[
  name : String
]]
Going by the offsets of the actual SPT file contents, the ISourceRegion spanned by the Entity AST node would start and end before the ISourceRegion of the ‘name’ Property AST node. The creator of the language (the entity language in this case) could assume that any property term of an entity is contained within the entity’s source region. However, such assumptions will not hold in SPT.

This means we should either make it clear to language designers that they should not use the Tokenizer, and should not make any assumptions about source regions. That sounds like SPT is overextending and would suggest option 3 to be the correct approach, where SPT should try to create this ‘imaginary world’ (having a single program text) for the language under test and then handle all of the nasty conversions to the ‘real world’ (fixtures and test fragments) in some other way.

Log in to post comments

Using 'origin-text' during analysis makes SPT fail

Issue Log