On 2012-10-31, at 8:19 AM, Bob Hutchison <hutch-lists / recursive.ca> =
wrote:

>=20
> On 2012-10-30, at 4:30 PM, Mark S. <lists / ruby-forum.com> wrote:
>=20
>> I have some XML data (UTF 8) that I'm trying to convert into another =
XML
>> set which will be eventually UTF 16. The data contains encoded =
html/xml
>> entities.
>>=20
>> The problem is that when I try to parse the data, html/xml entities
>> inside of CDATA text are converted into 2-byte codes that don't match
>> their original usage.
>>=20
>> For instance, &#146: (should be right single quote) is translated =
into
>> bytes C292 when parsed and exported and examined in a hex editor.
>>=20
>> Apparently what REXML and HTMLentities do is transliterate a value =
like
>> "&#146;" to character point U-146 on the Unicode chart. =
Unfortunately,
>> this point is a CONTROL code, not a punctuation code. The real =
character
>> point should be U-2019.
>>=20
>> Is there a fix for this? Or does one have to write their own parser =
to
>> map these values back to appropriate usage?
>=20
> Are you saying that REXML is parsing the content of the CDATA section =
and replacing those entities? Or are you extracting the CDATA sections =
after REXML is finished and then parsing them yourself?
>=20
> If REXML is doing this then have you tried Nokogiri? (REXML should not =
be parsing the contents of a CDATA section) If not,

"If not" --> if REXML is not parsing the entities in the CDATA section =
then...

> then you'll need to do something along the lines of what Carlos =
suggested in his response. If you're still having problems can you post =
some sample XML and maybe some of your translation code?
>=20
> Cheers,
> Bob
>=20
>=20
>>=20
>> Thank you,
>> Mark
>>=20
>> --=20
>> Posted via http://www.ruby-forum.com/.
>>=20
>=20