REXML & HTMLentities incorrectly map to UTF-8

aris · October 30, 2012, 9:30pm

I have some XML data (UTF 8) that I’m trying to convert into another XML
set which will be eventually UTF 16. The data contains encoded html/xml
entities.

The problem is that when I try to parse the data, html/xml entities
inside of CDATA text are converted into 2-byte codes that don’t match
their original usage.

For instance, &#146: (should be right single quote) is translated into
bytes C292 when parsed and exported and examined in a hex editor.

Apparently what REXML and HTMLentities do is transliterate a value like
“�” to character point U-146 on the Unicode chart. Unfortunately,
this point is a CONTROL code, not a punctuation code. The real character
point should be U-2019.

Is there a fix for this? Or does one have to write their own parser to
map these values back to appropriate usage?

Thank you,
Mark

mslagell · October 31, 2012, 9:32am

Subject: REXML & HTMLentities incorrectly map to UTF-8
Date: Wed 31 Oct 12 05:30:50AM +0900

Quoting Mark S. ([email protected]):

I have some XML data (UTF 8) that I’m trying to convert into another XML
set which will be eventually UTF 16. The data contains encoded html/xml
entities.

The problem is that when I try to parse the data, html/xml entities
inside of CDATA text are converted into 2-byte codes that don’t match
their original usage.

For instance, &#146: (should be right single quote) is translated into
bytes C292 when parsed and exported and examined in a hex editor.

I am not really sure about what happens within rexml there, but when
you get your CDATA string, if you are sure that the stuff inside is
UTF-8, you can force the encoding. By

p string.encoding

you can see the current encoding, and by

string.force_encoding(‘utf-8’)

you can, without changing the byte content of the string, change
Ruby’s idea of how to interpret it.

Then, you should be able to obtain a utf-16 version of the same string
by

string_16=string.encode(‘utf-16’)

I cannot assure this works. In the past, I poured bucketfuls of sweat
and tears on character encodings - I was able to reach what I wanted
to reach only by (much) trial and (much) error.

Carlo

mslagell · October 31, 2012, 1:22pm

On 2012-10-30, at 4:30 PM, Mark S. [email protected] wrote:

Apparently what REXML and HTMLentities do is transliterate a value like
“�” to character point U-146 on the Unicode chart. Unfortunately,
this point is a CONTROL code, not a punctuation code. The real character
point should be U-2019.

Is there a fix for this? Or does one have to write their own parser to
map these values back to appropriate usage?

Are you saying that REXML is parsing the content of the CDATA section
and replacing those entities? Or are you extracting the CDATA sections
after REXML is finished and then parsing them yourself?

If REXML is doing this then have you tried Nokogiri? (REXML should not
be parsing the contents of a CDATA section) If not, then you’ll need to
do something along the lines of what Carlos suggested in his response.
If you’re still having problems can you post some sample XML and maybe
some of your translation code?

Cheers,
Bob

mslagell · October 31, 2012, 1:27pm

On 2012-10-31, at 8:19 AM, Bob H. [email protected]
wrote:

Are you saying that REXML is parsing the content of the CDATA section and
replacing those entities? Or are you extracting the CDATA sections after REXML is
finished and then parsing them yourself?

If REXML is doing this then have you tried Nokogiri? (REXML should not be
parsing the contents of a CDATA section) If not,

“If not” → if REXML is not parsing the entities in the CDATA section
then…

mslagell · October 31, 2012, 7:34pm

Hello Bob & Carlo,

I am not really sure about what happens within rexml
there, but when you get your CDATA string, if you are
sure that the stuff inside is UTF-8, you can force the
encoding. By

I’m pretty sure that REXML converts with UTF-8. That’s what the tutorial
implies. In any event, its already done the translation at the moment I
use element.text. The problem is that its converted HTML entities like
� into code point at 146 (which is C292) instead of into the
corresponding functional code point 2019 (single right quote).

Are you saying that REXML is parsing the content of the
CDATA section and replacing those entities? Or are you
extracting the CDATA sections after REXML is finished
and then parsing them yourself?

Yes, REXML is replacing entities like � and converting it into
whatever happens to be at codepoint 146. Which happens just to be a
control point – not a character. This is not an intelligent mapping.

This conversion apparently happens when I use any form of Xpath to
collect Elements. This is not what a typical user would expect.

There is a raw-mode that will tell REXML to not translate anything, but
then it also pulls out the enclosing tags. So I get

Apostrophe: � .

So maybe I could clean out the tags in this code or maybe I could write
some complicated recursive code that doesn’t use Xpath. But I would
still need an intelligent way to convert HTML entities to UTF-8.

Which leads me to HTMLentities.

If I try to use HTMLentities to translate the codes, it also does the
useless translation of converting � to a codepoint.

I didn’t know about Nokigiri. I took 2 days to learn REXML … thought
it was a standard. Guess I’ll look into NG and see if its better.

Thanks !
Mark

mslagell · November 2, 2012, 12:26pm

On 2012-10-31, at 2:34 PM, Mark S. [email protected] wrote:

I didn’t know about Nokigiri. I took 2 days to learn REXML … thought
it was a standard. Guess I’ll look into NG and see if its better.

I think nokogiri is probably your best bet, sorry to tell you that. But
most of what you learned getting going with REXML is portable to
nokogiri. Not only is nokogiri significantly faster, it’s way more
powerful, is actively developed, and has a very accessible community
that can help you out quickly.

Cheers,
Bob

mslagell · November 3, 2012, 3:50am

Hi,

In [email protected]
“REXML & HTMLentities incorrectly map to UTF-8” on Wed, 31 Oct 2012
05:30:50 +0900,
“Mark S.” [email protected] wrote:

Apparently what REXML and HTMLentities do is transliterate a value like
“�” to character point U-146 on the Unicode chart. Unfortunately,
this point is a CONTROL code, not a punctuation code. The real character
point should be U-2019.

Is there a fix for this? Or does one have to write their own parser to
map these values back to appropriate usage?

Could you show me a sample Ruby code?
If I can reproduce your problem with the code on my machine, I
will fix the problem and the fix will be shipped in Ruby 2.0.0.

Thanks,

mslagell · November 5, 2012, 1:05pm

On 2012-11-04, at 9:18 PM, Mark S. [email protected] wrote:

http://www.ruby-forum.com/attachment/7857/sample-REXML-input.xml

It doesn’t look, to me, as though there’s any CDATA section in the input
file. The output file does have CDATA sections though did they get
switched around?

Cheers,
Bob

mslagell · November 5, 2012, 3:17am

Kouhei S. wrote in post #1082578:

Could you show me a sample Ruby code?
If I can reproduce your problem with the code on my machine, I
will fix the problem and the fix will be shipped in Ruby 2.0.0.

Here is some code to produce the problem, plus the input xml and the
output xml that I got when running the code. If you view the output in
an editor that shows hex code, you’ll see that the apostrophe in
“fund’s” becomes transliterated to char point C292 – which is just an
unused control code.

The entity code used for the apostrophe is � which my Oreilly HTML
book indicates should indeed be rendered as an apostrophe.

But the problem is even worse. It turns out that if there is any HTML
tagging inside of the CDATA … REXML deletes the data! Sometimes it
even hangs up with a “tree parsing error” (not exact text) with no
indication what source tag is giving the problem. (Sorry, can’t provide
that sample input since its 15megs of semi-private data).

Thanks,
Mark

mslagell · November 5, 2012, 2:10pm

Something that might not have been noticed:

U+0092 (apparently called “PRIVATE USE TWO” but usually a rounded
apostrophe) when encoded in UTF-8 is two bytes long: 0xC2 0x92. Where
you’re seeing what appears to be U+C292, I would assume you’re actually
seeing a two-byte UTF-8 encoded form of U+92 (remember that UTF-8 is
emphatically not UCS-2). Thus the character would be being interpreted
“correctly” as the apostrophe char, and output as UTF-8.

If it starts looking like “Â’” (that is, 0xC3 0x82 0xC2 0x92) then
you’re
in double-encoding land.

And for the record, U+C292 isn’t a control code, it’s a hangul character
슒

On 5 November 2012 22:53, Kouhei S. [email protected] wrote:

will fix the problem and the fix will be shipped in Ruby 2.0.0.
Thanks for providing sample code.
character’s code point.
indication what source tag is giving the problem. (Sorry, can’t provide
EOX
–
kou

–
Matthew K., B.Sc (CompSci) (Hons)
http://matthew.kerwin.net.au/
ABN: 59-013-727-651

“You’ll never find a programming language that frees
you from the burden of clarifying your ideas.” - xkcd

mslagell · November 5, 2012, 1:53pm

Hi,

In [email protected]
“Re: REXML & HTMLentities incorrectly map to UTF-8” on Mon, 5 Nov 2012
11:18:33 +0900,
“Mark S.” [email protected] wrote:

The entity code used for the apostrophe is � which my Oreilly HTML
book indicates should indeed be rendered as an apostrophe.

Thanks for providing sample code.

First, “�” should be handled as U+0092 in XML.
See also:
Extensible Markup Language (XML) 1.0 (Fifth Edition)

If the character reference begins with " &#x ", the digits
and letters up to the terminating ; provide a hexadecimal
representation of the character’s code point in ISO/IEC
10646. If it begins just with " &# ", the digits up to the
terminating ; provide a decimal representation of the
character’s code point.

In your case, “&#” case. It means that 146 is handled as
decimal and it is 0x92 in hexadecimal. So � is U+0092
in XML.

(Note that XML is not HTML.)

But the problem is even worse. It turns out that if there is any HTML
tagging inside of the CDATA … REXML deletes the data! Sometimes it
even hangs up with a “tree parsing error” (not exact text) with no
indication what source tag is giving the problem. (Sorry, can’t provide
that sample input since its 15megs of semi-private data).

I can’t reproduce your problem with the following script:

require “rexml/document”

document = REXML::Document.new(<<-EOX)

<![CDATA[tag]]>

EOX

note = document.elements[“/notebook/note”]
cdata = note[0]
p cdata

=> “tag”

It seems that the output includes HTML tag in CDATA.

Thanks,

mslagell · November 5, 2012, 4:44pm

Bob H. wrote in post #1082902:

It doesn’t look, to me, as though there’s any CDATA section in the input
file. The output file does have CDATA sections though did they get
switched around?

There’s no CDATA in the input because it was refusing to put out ANY
output if there was CDATA on my real-life data. So I generated a set
without CDATA so the entity problem could be investigated.

Thanks,
Mark

mslagell · November 5, 2012, 4:56pm

Kouhei S. wrote in post #1082922:

First, “�” should be handled as U+0092 in XML.
See also:
Extensible Markup Language (XML) 1.0 (Fifth Edition)

If the character reference begins with " &#x ", the digits
and letters up to the terminating ; provide a hexadecimal
representation of the character’s code point in ISO/IEC
10646. If it begins just with " &# ", the digits up to the
terminating ; provide a decimal representation of the
character’s code point.

In your case, “&#” case. It means that 146 is handled as
decimal and it is 0x92 in hexadecimal. So � is U+0092
in XML.

(Note that XML is not HTML.)

I’m not sure what you’re saying.

The apostrophe started out life on a web page as �. It lived in
application “A” and viewed as an apostrophe. During conversion, it was
transliterated to (I guess) U-0092 which is represented by bytes C292.
This displays in application “B” and everywhere else as a control code.

From my standpoint, it should have been either translated as whatever
code is equivalent to an apostrophe, or byte-equivalent to �.

If that’s not possible, it should at least leave the entities alone. It
seems to only do these conversion if an Xpath command is given.

Using the “raw” option causes the data to be left alone, but INCLUDES
the outer wrapping tags. There didn’t seem to be a raw option that would
just hand me the data inside the tags.

But the problem is even worse. It turns out that if there is any HTML
…
I can’t reproduce your problem with the following script:

require “rexml/document”

document = REXML::Document.new(<<-EOX)

<![CDATA[tag]]>

EOX

I suspect that your case is too simple. Maybe I’ll revisit and see what
data caused the problem.

Thanks,
Mark