Hi,

In <b4beb061e4d5c4d274f78f143e4be29f / ruby-forum.com>
  "REXML & HTMLentities incorrectly map to UTF-8" on Wed, 31 Oct 2012 05:30:50 +0900,
  "Mark S." <lists / ruby-forum.com> wrote:

> I have some XML data (UTF 8) that I'm trying to convert into another XML
> set which will be eventually UTF 16. The data contains encoded html/xml
> entities.
> 
> The problem is that when I try to parse the data, html/xml entities
> inside of CDATA text are converted into 2-byte codes that don't match
> their original usage.
> 
> For instance, &#146: (should be right single quote) is translated into
> bytes C292 when parsed and exported and examined in a hex editor.
> 
> Apparently what REXML and HTMLentities do is transliterate a value like
> "&#146;" to character point U-146 on the Unicode chart. Unfortunately,
> this point is a CONTROL code, not a punctuation code. The real character
> point should be U-2019.
> 
> Is there a fix for this? Or does one have to write their own parser to
> map these values back to appropriate usage?

Could you show me a sample Ruby code?
If I can reproduce your problem with the code on my machine, I
will fix the problem and the fix will be shipped in Ruby 2.0.0.


Thanks,
--
kou