On 01/11/12 6:26 AM, "Tobias Reif" <tobiasreif / pinkjuice.com> wrote: > Bob Hutchison wrote: > > >>> IMHO, default encoding of XML parser in Ruby should be UTF-8. >>> Because XML is in Unicode world, not ISO-8859-* nor EUC world >>> (unfortunately for me). And Ruby's regex doesn't support >>> UTF-16. >>> So, if the parser support only one encoding, it should be UTF-8, >>> and documents in other encoding should be converted to UTF-8. >>> >>> Is it good solution? >>> >> >> No I don't think so. How you represent the character stream internally is >> entirely up to you (immediate *internal* conversion to UTF-8 by your parser >> is OK). Restricting input to UTF-8 will place an impossible to live with >> constraint on the use of your parser. Presumably having an XML parser is to >> allow ruby programs to participate in a larger context -- and this larger >> context isn't going to provide encoding conversions. > > > http://www.w3.org/TR/REC-xml.html : > http://www.w3.org/TR/REC-xml.html#charencoding : > "All XML processors must be able to read entities in both the UTF-8 and > UTF-16 encodings." I don't know what you are getting at here, but I think this is the second time this has been quoted in response to my message, so... This says that an XML processor must be able to read entities in both UTF-8 and UTF-16 encodings -- it seems that I cannot say it any differently than the spec does :-) This says absolutely nothing at all about what happens after that, specifically it does not require an XML application to work with any particular encodings, and it does not require an XML application to produce XML in any particular encodings (though you'd be making a mistake to not support at least one of these two or there is not guarantee that your XML can be read otherwise). There are a couple of things being discussed simultaneously in this thread: 1) what does an XML processor have to do regardless of implementation language. 2) what does Ruby have to do to be useful in this world. What must be done for 1 is the XML processor has to read both UTF-8 and UTF-16. What must be done for 2 is for Ruby to support an internal character encoding that covers the character set defined by UTF-8/16. > > Tobi > >