#105 Unicode characters not handled correctly during Stratego transformation (project Spoofax on YellowGrass.org)

Eclipse: epp.package.java 4.5.2.20160218-0600
Spoofax: org.metaborg.spoofax.eclipse 2.0.0.beta1
System: Linux amd64 3.19.0-58-generic
Program: "招弟"
AST: String("\"招弟\"") (correct!)
Analyzed AST after using un-double-quote: String("ÿÿ") (wrong! Should be: String("招弟"))

The bug is only there when using JAR instead of CTREE.
This was working fine in Spoofax 1.4.2.
Submitted by Oskar van Rest on 19 April 2016 at 22:03

error

On 19 April 2016 at 23:51 Oskar van Rest commented:

The issue appears not only in combination with un-double-quote/un-single-quote, but with all kinds of transformations.
For example:
!String("招弟"); ?String(<id>); ?s
Binds s to "ÿÿ" instead of "招弟"

On 25 April 2016 at 13:11 Gabriël Konat commented:

I’m not sure what would cause this, maybe different encoding somehow?

On 5 May 2016 at 23:56 Oskar van Rest commented:

I tried to debug it a bit and found out that characters only change during analysis, but not during other Stratego transformation.
For example, after normalization, the characters still look fine. But after analysis, they are changed.
  editor-analyze = analyze-all(normalize; debug, id, id|<language>); debug
Output:
14:54 | INFO  | stderr                         - Module("example",[Entity("\"招弟\"",[])])
14:54 | INFO  | stderr                         - Result([FileResult("eclipse:///entity/example.ent",Module("example",[Entity("\"招弟\"",[])]),Module("example",[Entity("\"ÿÿ\"",[])]),[],[],[])],[],[],DebugResult(CollectDebugResult(0,0,0,0,0),[],[],[]),TimeResult(545656.0,1022172.0,1.1414683E7,1517174.0,116765.0,-1.0,-1.0))
I also found out that this is not something new, but something that was also in Spoofax 1.4.1. The reason I though it was only introduced in 2.0.0 is that I previously worked with the normalized AST but I switched to using the analyzed AST.

On 9 May 2016 at 16:22 Gabriël Konat commented:

Does this only happen with NaBL+TS analysis? If you don’t use analyze-all, but some custom analysis instead, does it still happen?

On 9 May 2016 at 22:08 Oskar van Rest commented:

Only with NaBL+TS analysis. For example, the Unicode characters are preserved with the following analysis strategy:
editor-analyze = ?[File(f,ast,_)]; !Result([FileResult(f,ast,ast,[],[],[])],[],[],DebugResult(CollectDebugResult(0,0,0,0,0),[],[],[]),TimeResult(0,0,0,0,0,0,0))

On 9 May 2016 at 22:13 Oskar van Rest commented:

Do note that it only happens with JAR but not with CTREE.

You can reproduce it with the following grammar:
context-free syntax
  Start.Empty = STRING
and example file:
"招弟"

On 10 May 2016 at 11:06 Gabriël Konat commented:

Ok, so it only occurs with the analysis from the analysis library, when in JAR mode. That’s some weird bug :)

On 10 May 2016 at 19:52 Oskar van Rest commented:

Even weirder: PGX switched to Java 8 last night and now the problem is gone. No matter if I compile the Spoofax project with Java 7 or Java 8, what matters is the version of HotSpot that it runs with afterwards.

However, I don’t fully understand ity. When showing the analyzed AST inside Eclipse, the bug is still there (not a big deal), even though I set the JDK compliance level to 1.8. I think it still uses 1.7 because the Spoofax plugins demand that or so.

On 17 May 2016 at 22:55 Oskar van Rest commented:

Ok so ignore my previous comment. The issue is still there.

It is most likely some encoding problem. I found this online:
ÿ is character 255 in ISO-8859-1 (and maybe other character encodings). It is also the first byte of the Byte Order Marker for UTF16_LE.
Please check the encoding you are using to save the file, and use the appropriate character set when you open it for reading, e.g.
Note that every special character such as €, Ω and 招 end up as ÿ.

During analysis, do you write the AST to an output stream and then read it later again? If so, I expect there is something going wrong there. Can’t really explain why it’s only with JAR but not CTREE.

On 18 May 2016 at 10:59 Gabriël Konat commented:

Not that I know of. We somehow use a builtin/primitive/strategy in the analysis library that behaves differently in JAR mode that causes the string to be rewritten wrongly. No idea which one it is though…

On 19 May 2016 at 00:11 Oskar van Rest commented:

I found out it goes wrong in nabl-collect(sibling-uris|lang, partition, unique*, uri*) in the following piece of code:
id#(nabl-siblings(|lang, partition, unique*, child-uri*))
What does id# mean? Can we rewrite it to something else?

On 19 May 2016 at 10:46 Gabriël Konat commented:

Nice catch! That is generic term deconstruction and construction. What it does is deconstruct the current term into a constructor on the left hand side and its subterms on the right hand side. It then applies the strategies to the left and right hand side and reconstructs the term. In this case, we run id on the constructor, and nabl-siblings(|lang, partition, unique*, child-uri*) on all subterms.

I have no idea what happens when this is executed on something that is not a constructor application, I guess it goes wrong on strings in JAR mode :)

Also, I don’t know why we don’t just use all(nabl-siblings(|lang, partition, unique*, child-uri*)) there, I think that does the same, but Stratego can probably still surprise me.

On 19 May 2016 at 13:05 Gabriël Konat commented:

Disregard that, I remember why we did that there. nabl-siblings is executed on the list of all subterms, because we thread state through siblings.

Maybe it can be fixed by skipping that line when the term is a string or integer: (where(is-string + is-int) <+ preserve-annos(origin-track-forced(id#(nabl-siblings(|lang, partition, unique*, child-uri*)))))

On 20 May 2016 at 20:13 Oskar van Rest commented:

this fixed the issue indeed: https://github.com/metaborg/runtime-libraries/pull/19

On 26 May 2016 at 02:48 Oskar van Rest commented:

closing the issue

On 26 May 2016 at 02:48 Oskar van Rest closed this issue.

Log in to post comments

Unicode characters not handled correctly during Stratego transformation

Issue Log