On 01/11/22 4:56 PM, "Martin v. Loewis" <martin / v.loewis.de> wrote:

> Jim Menard <jimm / io.com> writes:
> 
>> Multi-byte characters are another problem. How should I handle character
>> references like "&D7A3;" when Ruby can't handle that character? Is there
>> any way that NQXML can reach 100.0% conformance?
> 
> I would suggest to follow the same strategy that Expat uses: Export
> all data as UTF-8 to the application. This will allow for arbitrary
> input encodings, as well as for arbitrary character entities.

I mentioned what the Eiffel XML parsers did in a previous message, buried in
one of the XML threads in the last couple of weeks. Eiffel has basically the
same problem as Ruby: a byte and a char are the same thing. There were two
(immediate) "solutions" used there: extend the string class to handle UTF-8
characters; and, load the UTF-8 data into the existing string class and
simply ignore what is going on. Both work, sort-of, though it is clear that
the answer is to loose the character-byte identity in the language.

This is a Ruby issue, not an XML parser issue. Ruby has to solve this
problem generally -- and I don't doubt that it will.

For now, I'd think that it is perfectly reasonable and workable to do
exactly as you suggest: export all character data as UTF-8 to the
application.

> 
> Alternatively, offer the application to
> a) receive the data in an encoding of their choice, or

I think that this problem is in Ruby (i.e. the application from the parser's
point of view) not the parser. To load this kind of thing into the parser is
going to be a lot of work, going to be work that will be of possibly little
use to any other software faced with the problem, and going to be tossed
when Ruby supports proper characters (as I have no doubt that it will).

> b) offer the application to receive all data in the input encoding,
>  reporting an error when you get data that cannot be represented
>  in the input encoding (such as character entities).

I'm not sure what you are getting at here. What errors? detected by what
software? If you mean errors in the XML file then it is easy: if the input
encoding is not as promised, the the parser should fail (with a message :-).
Do you mean output? The XML writer has to ensure that it only writes
characters in the proper encoding (this is a different (hard) problem, and
in my mind at least, a different tool than the parser -- and again :-) the
problem is Ruby's and best solved there). If you mean in the application,
then this is just the price you pay for the convenience of shoving the UTF-8
into a Ruby string (you are going to have to test things, or document what
encodings your application supports).

It might be interesting to find out what the Perl parsers do -- I've got
"personal issues" with Perl, so I am definitely *not* the person to be doing
this :-)

Cheers,
Bob

> 
> HTH,
> Martin
>