Hello Bob & Carlo,

> I am not really sure about what happens within rexml
> there, but when you get your CDATA string, if you are
> sure that the stuff inside is UTF-8, you can force the
> encoding. By

I'm pretty sure that REXML converts with UTF-8. That's what the tutorial 
implies. In any event, its already done the translation at the moment I 
use element.text. The problem is that its converted HTML entities like 
’ into code point at 146 (which is C292) instead of into the 
corresponding functional code point 2019 (single right quote).

> Are you saying that REXML is parsing the content of the
> CDATA section and replacing those entities? Or are you
> extracting the CDATA sections after REXML is finished
> and then parsing them yourself?

Yes, REXML is replacing entities like ’ and converting it into 
whatever happens to be at codepoint 146. Which happens just to be a 
control point -- not a character. This is not an intelligent mapping.

This conversion apparently happens when I use any form of Xpath to 
collect Elements. This is not what a typical user would expect.

There is a raw-mode that will tell REXML to not translate anything, but 
then it also pulls out the enclosing tags. So I get

   <mystuff>Apostrophe: &#146; </mystuff>.

So maybe I could clean out the tags in this code or maybe I could write 
some complicated recursive code that doesn't use Xpath. But I would 
still need an intelligent way to convert HTML entities to UTF-8.

Which leads me to HTMLentities.

If I try to  use HTMLentities to translate the codes, it also does the 
useless translation of converting &#146; to a codepoint.

I didn't know about Nokigiri. I took 2 days to learn REXML ... thought 
it was a standard. Guess I'll look into NG and see if its better.

Thanks !
Mark

-- 
Posted via http://www.ruby-forum.com/.