David Masover wrote:
> Java at least did this sanely -- UTF16 is at least a fixed width. If you're 
> going to force a single encoding, why wouldn't you use fixed-width strings?

Actually, it's not. It's simply mathematically impossible, given that
there are more than 65536 Unicode codepoints. AFAIK, you need (at the
moment) at least 21 Bits to represent all Unicode codepoints. UTF-16
is *not* fixed-width, it encodes every Unicode codepoint as either one
or two UTF-16 "characters", just like UTF-8 encodes every Unicode
codepoint as 1, 2, 3 or 4 octets.

The only two Unicode encodings that are fixed-width are the obsolete
UCS-2 (which can only encode the lower 65536 codepoints) and UTF-32.

You can produce corrupt strings and slice into a half-character in
Java just as you can in Ruby 1.8.

> Oh, that's right -- UTF16 wastes half your RAM when dealing with mostly ASCII 
> characters. So UTF-8 makes the most sense... in the US.

Of course, that problem is even more pronounced with UTF-32.

German text blows up about 5%-10% when encoded in UTF-8 instead of
ISO8859-15. Arabic, Persian, Indian, Asian text (which is, after all,
much more than European) is much worse. (E.g. Chinese blows up *at
least* 50% when encoding UTF-8 instead of Big5 or GB2312.) Given that
the current tendency is that devices actually get *smaller*, bandwidth
gets *lower* and latency gets *higher*, that's simply not a price
everybody is willing to pay.

> The whole point of having multiple encodings in the first place is that other 
> encodings make much more sense when you're not in the US.

There's also a lot of legacy data, even within the US. On IBM systems,
the standard encoding, even for greenfield systems that are being
written right now, is still pretty much EBCDIC all the way.

There simply does not exist a single encoding which would be
appropriate for every case, not even the majority of cases. In fact,
I'm not even sure that there is even a single encoding which is
appropriate for a significant minority of cases.

We tried that One Encoding To Rule Them All in Java, and it was a
failure. We tried it again with a different encoding in Java 5, and it
was a failure. We tried it in .NET, and it was a failure. The Python
community is currently in the process of realizing it was a failure. 5
years of work on PHP 6 were completely destroyed because of this. (At
least they realized it *before* releasing it into the wild.)

And now there's a push for a One Encoding To Rule Them All in Ruby 2.
That's *literally* insane! (One definition of insanity is repeating
behavior and expecting a different outcome.)

jwm