I have some XML data (UTF 8) that I'm trying to convert into another XML
set which will be eventually UTF 16. The data contains encoded html/xml
entities.

The problem is that when I try to parse the data, html/xml entities
inside of CDATA text are converted into 2-byte codes that don't match
their original usage.

For instance, &#146: (should be right single quote) is translated into
bytes C292 when parsed and exported and examined in a hex editor.

Apparently what REXML and HTMLentities do is transliterate a value like
"’" to character point U-146 on the Unicode chart. Unfortunately,
this point is a CONTROL code, not a punctuation code. The real character
point should be U-2019.

Is there a fix for this? Or does one have to write their own parser to
map these values back to appropriate usage?

Thank you,
Mark

-- 
Posted via http://www.ruby-forum.com/.