Curt Sampson <cjs / cynic.net> wrote in message news:<Pine.NEB.4.44.0208121234260.2317-100000 / angelic.cynic.net>... > This is one of the big advantages of UTF-16 over UTF-8; you can do > simple operations the simple way and still produce valid UTF-16 output. > (There's no explicit rule, as far as I know at least, that states that > UTF-8 parsers *must* ignore broken characters, as there is with UTF-16.) UTF-8 parsers must ignore "broken" characters because, as I pointed out in a previous message, "broken" characters are never valid UTF-8, due to the UTF-8 design. The standard now only allows parsing of valid characters (the loopholes that existed in unicode version 3.0 were eliminated by updates in versions 3.1 and 3.2). The unicode standard expressly forbids the interpretation of illegal UTF-8 sequences. There are also advantages to a fixed-width encoding, such as the recently introduced UTF-32, which can often outweigh the endianness issues. (Encodings which are not byte-grained, such as UTF-32 and UTF-16, need two variants, big-endian and little-endian.) UTF-16 was not thought through very well. It is an encoding following the mental line of least resistence -- encode the character points by their numbers. There was no reason the encoding should have included 0-bytes, thus sabotaging byte-grained string processing by C programs. And of course it was thought that all characters of interest to any significant community could fit in the two-byte "Basic Multilingual Plane". This is not attainable even with current unicode unless you consider Chinese, Japanese, mathematicians, and musicians to be insignificant communities. Also, important further expansion outside the BMP is inevitable. But UTF-16 in both big-endian and little-endian variants is sure to be one of those technical blunders which far outlives its excusability, due to inertia and corporate politics, so Ruby should probably provide direct support. Failing that, Ruby could provide indirect support via invisible translation to some other unicode encoding. Some people use UTF-16 as a disk storage format and expand to UTF-32 in memory. This allows one to directly access characters by index for unicode strings in memory, while avoiding crass inefficiency in disk usage. But for general multilingual processing, UTF-8 seems more efficient and handier as a disk storage format. The unicode consortium has recently promulgated *yet another* encoding form, CESU-8, intended only for internal use by programs, and not for data transfer between applications. CESU-8 is byte- grained and similar to UTF-8, but CESU-8 has been designed so CESU-8 text will have the same binary collation as equivalent UTF-16 text. I don't know if there is a reason for RUBY to support this. Though notoriously unwise myself, I'd like to make a plea for some wisdom. Many people here have a great deal of experience with internationalization, and rightly consider themselves experts. But expertise comes in many flavors, and one should think twice before making assertions about what *other* people need. The need for internationalization, M17n, and so forth by a maker of corporate web sites is different from the need of a mathematician, musician, or someone trying to computerize Akkadian tablets. We should avoid the parochial thought that our interests are the only important or "practical" ones. Regards, Bret http://www.rexx.com/~oinkoink/ oinkoink at rexx dot com