Unicode characters not handled correctly during Stratego transformation
Eclipse: epp.package.java 4.5.2.20160218-0600 Spoofax: org.metaborg.spoofax.eclipse 2.0.0.beta1 System: Linux amd64 3.19.0-58-generic
Program:
"招弟"
AST:String("\"招弟\"")
(correct!)
Analyzed AST after using un-double-quote:String("ÿÿ")
(wrong! Should be:String("招弟")
)The bug is only there when using JAR instead of CTREE.
Submitted by Oskar van Rest on 19 April 2016 at 22:03
This was working fine in Spoofax 1.4.2.
Issue Log
The issue appears not only in combination with un-double-quote/un-single-quote, but with all kinds of transformations.
For example:!String("招弟"); ?String(<id>); ?s
Binds
s
to"ÿÿ"
instead of"招弟"
I’m not sure what would cause this, maybe different encoding somehow?
I tried to debug it a bit and found out that characters only change during analysis, but not during other Stratego transformation.
For example, after normalization, the characters still look fine. But after analysis, they are changed.editor-analyze = analyze-all(normalize; debug, id, id|<language>); debug
Output:
14:54 | INFO | stderr - Module("example",[Entity("\"招弟\"",[])]) 14:54 | INFO | stderr - Result([FileResult("eclipse:///entity/example.ent",Module("example",[Entity("\"招弟\"",[])]),Module("example",[Entity("\"ÿÿ\"",[])]),[],[],[])],[],[],DebugResult(CollectDebugResult(0,0,0,0,0),[],[],[]),TimeResult(545656.0,1022172.0,1.1414683E7,1517174.0,116765.0,-1.0,-1.0))
I also found out that this is not something new, but something that was also in Spoofax 1.4.1. The reason I though it was only introduced in 2.0.0 is that I previously worked with the normalized AST but I switched to using the analyzed AST.
Does this only happen with NaBL+TS analysis? If you don’t use
analyze-all
, but some custom analysis instead, does it still happen?
Only with NaBL+TS analysis. For example, the Unicode characters are preserved with the following analysis strategy:
editor-analyze = ?[File(f,ast,_)]; !Result([FileResult(f,ast,ast,[],[],[])],[],[],DebugResult(CollectDebugResult(0,0,0,0,0),[],[],[]),TimeResult(0,0,0,0,0,0,0))
Do note that it only happens with JAR but not with CTREE.
You can reproduce it with the following grammar:
context-free syntax Start.Empty = STRING
and example file:
"招弟"
Ok, so it only occurs with the analysis from the analysis library, when in JAR mode. That’s some weird bug :)
Even weirder: PGX switched to Java 8 last night and now the problem is gone. No matter if I compile the Spoofax project with Java 7 or Java 8, what matters is the version of HotSpot that it runs with afterwards.
However, I don’t fully understand ity. When showing the analyzed AST inside Eclipse, the bug is still there (not a big deal), even though I set the JDK compliance level to 1.8. I think it still uses 1.7 because the Spoofax plugins demand that or so.
Ok so ignore my previous comment. The issue is still there.
It is most likely some encoding problem. I found this online:
ÿ is character 255 in ISO-8859-1 (and maybe other character encodings). It is also the first byte of the Byte Order Marker for UTF16_LE. Please check the encoding you are using to save the file, and use the appropriate character set when you open it for reading, e.g.
Note that every special character such as
€
,Ω
and招
end up asÿ
.During analysis, do you write the AST to an output stream and then read it later again? If so, I expect there is something going wrong there. Can’t really explain why it’s only with JAR but not CTREE.
Not that I know of. We somehow use a builtin/primitive/strategy in the analysis library that behaves differently in JAR mode that causes the string to be rewritten wrongly. No idea which one it is though…
I found out it goes wrong in
nabl-collect(sibling-uris|lang, partition, unique*, uri*)
in the following piece of code:id#(nabl-siblings(|lang, partition, unique*, child-uri*))
What does
id#
mean? Can we rewrite it to something else?
Nice catch! That is generic term deconstruction and construction. What it does is deconstruct the current term into a constructor on the left hand side and its subterms on the right hand side. It then applies the strategies to the left and right hand side and reconstructs the term. In this case, we run
id
on the constructor, andnabl-siblings(|lang, partition, unique*, child-uri*)
on all subterms.I have no idea what happens when this is executed on something that is not a constructor application, I guess it goes wrong on strings in JAR mode :)
Also, I don’t know why we don’t just use
all(nabl-siblings(|lang, partition, unique*, child-uri*))
there, I think that does the same, but Stratego can probably still surprise me.
Disregard that, I remember why we did that there.
nabl-siblings
is executed on the list of all subterms, because we thread state through siblings.Maybe it can be fixed by skipping that line when the term is a string or integer:
(where(is-string + is-int) <+ preserve-annos(origin-track-forced(id#(nabl-siblings(|lang, partition, unique*, child-uri*)))))
this fixed the issue indeed: https://github.com/metaborg/runtime-libraries/pull/19
closing the issue
Log in to post comments