I have some XML data (UTF 8) that I’m trying to convert into another XML
set which will be eventually UTF 16. The data contains encoded html/xml
The problem is that when I try to parse the data, html/xml entities
inside of CDATA text are converted into 2-byte codes that don’t match
their original usage.
For instance, ’: (should be right single quote) is translated into
bytes C292 when parsed and exported and examined in a hex editor.
Apparently what REXML and HTMLentities do is transliterate a value like
“�” to character point U-146 on the Unicode chart. Unfortunately,
this point is a CONTROL code, not a punctuation code. The real character
point should be U-2019.
Is there a fix for this? Or does one have to write their own parser to
map these values back to appropriate usage?