Executive summary: supporting "UTF-16" isn't easy.

Details:

I'm sure we would have supported "UTF-16", as severalof you
have suggested, if we had figured out exactly how to do this,
and I'm rather sure we would actually support "UTF-16" if
we figured out how exactly to do this.

But it's not that easy. Equating "UTF-16" with UTF-16BE might
work on paper, but not in practice (with the Mac moving to
Intel, almost everybody is on an LE machine :-(.

Using the BOM may work if there is one (but what if there isn't?),
but only for files, not really for strings, and more for input
than for output.

Using host byte order may work for interfacing with some
system routines, but actually not quite, because for example,
there is no real guarantee in Ruby that any of the UTF-16
data is alligned on a 16-bit boundary (although in practice,
it shouldn't usually be off).

So we would end up with a lot of special casing, e.g. some
interfaces where "UTF-16" can be used and others where it cannot,
some additional identifiers (e.g. "UTF-16BE-BOM", "UTF-16LE-BOM"
or so to indicate that you want something with BOM, but in a
certain endianness), and so on.

Also, "UTF-16" (of course together with its friend "UTF-32")
would be the only encodings that (with lots of caveats) may
work for String#encode and related transcoding, but never
as internal encodings.

In my personal experience, e.g. having programmed the XML
encoding bootstrap stuff in several programming languages,
indicates that "UTF-16" always means some additional,
somewhat application-specific work, anyway.

All such considerations led to the current state of not
supporting "UTF-16".

Regards,   Martin.


At 02:45 08/09/17, Tim Bray wrote:
>On Sep 16, 2008, at 3:37 AM, Martin Duerst wrote:
>
>> So I'm suggesting that we produce some special error message for
>> UTF-16, such as "UTF-16 not available, use either UTF-16BE or
>> UTF-16LE". In general, I don't like such special-casing, but
>> given that Ruby tries to be user-friendly, it may be worth
>> doing.
>
>I haven't been involved in this discussion, but in my experience a  
>fairly high proportion of UTF-16 texts begin with a Byte Order Mark.   
>In this case, the programmer doesn't need to worry about BE/LE.  So  
>maybe in some cases Ruby could actually accept "UTF-16" and silently  
>do the right thing?  -Tim
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst / it.aoyama.ac.jp