Kouhei Sutou wrote in post #1082922:
>
> First, "’" should be handled as U+0092 in XML.
> See also:
>   http://www.w3.org/TR/REC-xml/#sec-references
>
>   If the character reference begins with " &#x ", the digits
>   and letters up to the terminating ; provide a hexadecimal
>   representation of the character's code point in ISO/IEC
>   10646. If it begins just with " &# ", the digits up to the
>   terminating ; provide a decimal representation of the
>   character's code point.
>
> In your case, "&#" case. It means that 146 is handled as
> decimal and it is 0x92 in hexadecimal. So ’ is U+0092
> in XML.
>
> (Note that XML is not HTML.)

I'm not sure what you're saying.

The apostrophe started out life on a web page as ’. It lived in 
application "A" and viewed as an apostrophe. During conversion, it was 
transliterated to (I guess) U-0092 which is represented by bytes C292. 
This displays in application "B" and everywhere else as a control code.

From my standpoint, it should have been either translated as whatever 
code is equivalent to an apostrophe, or byte-equivalent to ’.

If that's not possible, it should at least leave the entities alone. It 
seems to only do these conversion if an Xpath command is given.

Using the "raw" option causes the data to be left alone, but INCLUDES 
the outer wrapping tags. There didn't seem to be a raw option that would 
just hand me the data inside the tags.

>> But the problem is even worse. It turns out that if there is any HTML
> ...
> I can't reproduce your problem with the following script:
>
>   require "rexml/document"
>
>   document = REXML::Document.new(<<-EOX)
>   <notebook>
>     <note><![CDATA[<html>tag</html>]]></note>
>   </notebook>
>   EOX

I suspect that your case is too simple. Maybe I'll revisit and see what 
data caused the problem.

Thanks,
Mark

-- 
Posted via http://www.ruby-forum.com/.