I have some XML data (UTF 8) that I'm trying to convert into another XML set which will be eventually UTF 16. The data contains encoded html/xml entities. The problem is that when I try to parse the data, html/xml entities inside of CDATA text are converted into 2-byte codes that don't match their original usage. For instance, ’: (should be right single quote) is translated into bytes C292 when parsed and exported and examined in a hex editor. Apparently what REXML and HTMLentities do is transliterate a value like "’" to character point U-146 on the Unicode chart. Unfortunately, this point is a CONTROL code, not a punctuation code. The real character point should be U-2019. Is there a fix for this? Or does one have to write their own parser to map these values back to appropriate usage? Thank you, Mark
on 2012-10-30 21:30
on 2012-10-31 09:32
Subject: REXML & HTMLentities incorrectly map to UTF-8 Date: Wed 31 Oct 12 05:30:50AM +0900 Quoting Mark S. (lists@ruby-forum.com): > I have some XML data (UTF 8) that I'm trying to convert into another XML > set which will be eventually UTF 16. The data contains encoded html/xml > entities. > > The problem is that when I try to parse the data, html/xml entities > inside of CDATA text are converted into 2-byte codes that don't match > their original usage. > > For instance, ’: (should be right single quote) is translated into > bytes C292 when parsed and exported and examined in a hex editor. I am not really sure about what happens within rexml there, but when you get your CDATA string, if you are sure that the stuff inside is UTF-8, you can force the encoding. By p string.encoding you can see the current encoding, and by string.force_encoding('utf-8') you can, *without changing the byte content of the string*, change Ruby's idea of how to interpret it. Then, you should be able to obtain a utf-16 version of the same string by string_16=string.encode('utf-16') I cannot assure this works. In the past, I poured bucketfuls of sweat and tears on character encodings - I was able to reach what I wanted to reach only by (much) trial and (much) error. Carlo
on 2012-10-31 13:22
On 2012-10-30, at 4:30 PM, Mark S. <lists@ruby-forum.com> wrote: > > Apparently what REXML and HTMLentities do is transliterate a value like > "’" to character point U-146 on the Unicode chart. Unfortunately, > this point is a CONTROL code, not a punctuation code. The real character > point should be U-2019. > > Is there a fix for this? Or does one have to write their own parser to > map these values back to appropriate usage? Are you saying that REXML is parsing the content of the CDATA section and replacing those entities? Or are you extracting the CDATA sections after REXML is finished and then parsing them yourself? If REXML is doing this then have you tried Nokogiri? (REXML should not be parsing the contents of a CDATA section) If not, then you'll need to do something along the lines of what Carlos suggested in his response. If you're still having problems can you post some sample XML and maybe some of your translation code? Cheers, Bob
on 2012-10-31 13:27
On 2012-10-31, at 8:19 AM, Bob Hutchison <hutch-lists@recursive.ca> wrote: >> > > Are you saying that REXML is parsing the content of the CDATA section and replacing those entities? Or are you extracting the CDATA sections after REXML is finished and then parsing them yourself? > > If REXML is doing this then have you tried Nokogiri? (REXML should not be parsing the contents of a CDATA section) If not, "If not" --> if REXML is not parsing the entities in the CDATA section then...
on 2012-10-31 19:34
Hello Bob & Carlo, > I am not really sure about what happens within rexml > there, but when you get your CDATA string, if you are > sure that the stuff inside is UTF-8, you can force the > encoding. By I'm pretty sure that REXML converts with UTF-8. That's what the tutorial implies. In any event, its already done the translation at the moment I use element.text. The problem is that its converted HTML entities like ’ into code point at 146 (which is C292) instead of into the corresponding functional code point 2019 (single right quote). > Are you saying that REXML is parsing the content of the > CDATA section and replacing those entities? Or are you > extracting the CDATA sections after REXML is finished > and then parsing them yourself? Yes, REXML is replacing entities like ’ and converting it into whatever happens to be at codepoint 146. Which happens just to be a control point -- not a character. This is not an intelligent mapping. This conversion apparently happens when I use any form of Xpath to collect Elements. This is not what a typical user would expect. There is a raw-mode that will tell REXML to not translate anything, but then it also pulls out the enclosing tags. So I get <mystuff>Apostrophe: ’ </mystuff>. So maybe I could clean out the tags in this code or maybe I could write some complicated recursive code that doesn't use Xpath. But I would still need an intelligent way to convert HTML entities to UTF-8. Which leads me to HTMLentities. If I try to use HTMLentities to translate the codes, it also does the useless translation of converting ’ to a codepoint. I didn't know about Nokigiri. I took 2 days to learn REXML ... thought it was a standard. Guess I'll look into NG and see if its better. Thanks ! Mark
on 2012-11-02 12:26
On 2012-10-31, at 2:34 PM, Mark S. <lists@ruby-forum.com> wrote: > I didn't know about Nokigiri. I took 2 days to learn REXML ... thought > it was a standard. Guess I'll look into NG and see if its better. I think nokogiri is probably your best bet, sorry to tell you that. But most of what you learned getting going with REXML is portable to nokogiri. Not only is nokogiri significantly faster, it's way more powerful, is actively developed, and has a very accessible community that can help you out quickly. Cheers, Bob
on 2012-11-03 03:50
Hi, In <b4beb061e4d5c4d274f78f143e4be29f@ruby-forum.com> "REXML & HTMLentities incorrectly map to UTF-8" on Wed, 31 Oct 2012 05:30:50 +0900, "Mark S." <lists@ruby-forum.com> wrote: > > Apparently what REXML and HTMLentities do is transliterate a value like > "’" to character point U-146 on the Unicode chart. Unfortunately, > this point is a CONTROL code, not a punctuation code. The real character > point should be U-2019. > > Is there a fix for this? Or does one have to write their own parser to > map these values back to appropriate usage? Could you show me a sample Ruby code? If I can reproduce your problem with the code on my machine, I will fix the problem and the fix will be shipped in Ruby 2.0.0. Thanks,
on 2012-11-05 03:17
Kouhei Sutou wrote in post #1082578: > Could you show me a sample Ruby code? > If I can reproduce your problem with the code on my machine, I > will fix the problem and the fix will be shipped in Ruby 2.0.0. Here is some code to produce the problem, plus the input xml and the output xml that I got when running the code. If you view the output in an editor that shows hex code, you'll see that the apostrophe in "fund's" becomes transliterated to char point C292 -- which is just an unused control code. The entity code used for the apostrophe is ’ which my Oreilly HTML book indicates should indeed be rendered as an apostrophe. But the problem is even worse. It turns out that if there is any HTML tagging inside of the CDATA ... REXML deletes the data! Sometimes it even hangs up with a "tree parsing error" (not exact text) with no indication what source tag is giving the problem. (Sorry, can't provide that sample input since its 15megs of semi-private data). Thanks, Mark
on 2012-11-05 13:05
On 2012-11-04, at 9:18 PM, Mark S. <lists@ruby-forum.com> wrote: > http://www.ruby-forum.com/attachment/7857/sample-R... > It doesn't look, to me, as though there's any CDATA section in the input file. The output file does have CDATA sections though did they get switched around? Cheers, Bob
on 2012-11-05 13:53
Hi, In <fec82c6b596b842bf6731e87991ed3cc@ruby-forum.com> "Re: REXML & HTMLentities incorrectly map to UTF-8" on Mon, 5 Nov 2012 11:18:33 +0900, "Mark S." <lists@ruby-forum.com> wrote: > > The entity code used for the apostrophe is ’ which my Oreilly HTML > book indicates should indeed be rendered as an apostrophe. Thanks for providing sample code. First, "’" should be handled as U+0092 in XML. See also: http://www.w3.org/TR/REC-xml/#sec-references If the character reference begins with " &#x ", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with " &# ", the digits up to the terminating ; provide a decimal representation of the character's code point. In your case, "&#" case. It means that 146 is handled as decimal and it is 0x92 in hexadecimal. So ’ is U+0092 in XML. (Note that XML is not HTML.) > But the problem is even worse. It turns out that if there is any HTML > tagging inside of the CDATA ... REXML deletes the data! Sometimes it > even hangs up with a "tree parsing error" (not exact text) with no > indication what source tag is giving the problem. (Sorry, can't provide > that sample input since its 15megs of semi-private data). I can't reproduce your problem with the following script: require "rexml/document" document = REXML::Document.new(<<-EOX) <notebook> <note><![CDATA[<html>tag</html>]]></note> </notebook> EOX note = document.elements["/notebook/note"] cdata = note[0] p cdata # => "<html>tag</html>" It seems that the output includes HTML tag in CDATA. Thanks,
on 2012-11-05 14:10
Something that might not have been noticed: U+0092 (apparently called "PRIVATE USE TWO" but usually a rounded apostrophe) when encoded in UTF-8 is two bytes long: 0xC2 0x92. Where you're seeing what appears to be U+C292, I would assume you're actually seeing a two-byte UTF-8 encoded form of U+92 (remember that UTF-8 is emphatically not UCS-2). Thus the character would be being interpreted "correctly" as the apostrophe char, and output as UTF-8. If it starts looking like "Â’" (that is, 0xC3 0x82 0xC2 0x92) then you're in double-encoding land. And for the record, U+C292 isn't a control code, it's a hangul character 슒 On 5 November 2012 22:53, Kouhei Sutou <kou@cozmixng.org> wrote: > >> will fix the problem and the fix will be shipped in Ruby 2.0.0. > Thanks for providing sample code. > character's code point. > > indication what source tag is giving the problem. (Sorry, can't provide > EOX > -- > kou > > -- Matthew Kerwin, B.Sc (CompSci) (Hons) http://matthew.kerwin.net.au/ ABN: 59-013-727-651 "You'll never find a programming language that frees you from the burden of clarifying your ideas." - xkcd
on 2012-11-05 16:44
Bob Hutchison wrote in post #1082902: > > It doesn't look, to me, as though there's any CDATA section in the input > file. The output file does have CDATA sections though did they get > switched around? > There's no CDATA in the input because it was refusing to put out ANY output if there was CDATA on my real-life data. So I generated a set without CDATA so the entity problem could be investigated. Thanks, Mark
on 2012-11-05 16:56
Kouhei Sutou wrote in post #1082922: > > First, "’" should be handled as U+0092 in XML. > See also: > http://www.w3.org/TR/REC-xml/#sec-references > > If the character reference begins with " &#x ", the digits > and letters up to the terminating ; provide a hexadecimal > representation of the character's code point in ISO/IEC > 10646. If it begins just with " &# ", the digits up to the > terminating ; provide a decimal representation of the > character's code point. > > In your case, "&#" case. It means that 146 is handled as > decimal and it is 0x92 in hexadecimal. So ’ is U+0092 > in XML. > > (Note that XML is not HTML.) I'm not sure what you're saying. The apostrophe started out life on a web page as ’. It lived in application "A" and viewed as an apostrophe. During conversion, it was transliterated to (I guess) U-0092 which is represented by bytes C292. This displays in application "B" and everywhere else as a control code. From my standpoint, it should have been either translated as whatever code is equivalent to an apostrophe, or byte-equivalent to ’. If that's not possible, it should at least leave the entities alone. It seems to only do these conversion if an Xpath command is given. Using the "raw" option causes the data to be left alone, but INCLUDES the outer wrapping tags. There didn't seem to be a raw option that would just hand me the data inside the tags. >> But the problem is even worse. It turns out that if there is any HTML > ... > I can't reproduce your problem with the following script: > > require "rexml/document" > > document = REXML::Document.new(<<-EOX) > <notebook> > <note><![CDATA[<html>tag</html>]]></note> > </notebook> > EOX I suspect that your case is too simple. Maybe I'll revisit and see what data caused the problem. Thanks, Mark
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.