Austin Ziegler wrote:

> On 3/10/06, Anthony DeRobertis <aderobertis / metrics.net> wrote:
>
>> Ummm, no. UTF-16 filenames would break *every* correctly-implemented
>> UNIX program: UTF-16 allows the octect 0x00, which has always been
>> the end-of-string marker.
> 
> You're right. And I'm saying that I don't care.

Well, I suspect most other people want to maintain backwards
compatibility. Hence the existence of UTF-8.

> People need to stop 
> thinking in terms of bytes (octets) and start thinking in terms of
> characters. I'll say it flat out here: the POSIX filesystem definition
> is going to badly limit what can be done with Unix systems.

Why? POSIX gives nearly binary-transparent file names; the only
exception is the single octet 0x00. Considering the 1:1 mapping between
UTF-8 and other Unicode encodings, how can the choice of one or another
"badly limit" what can be done?
 
>> Personally, my file names have been in UTF-8 for quite some time now,
>> and it works well: What exactly is this 'stone age' you refer to?
> 
> Change and environment variable and watch your programs break that had
> worked so well with Unicode. *That* is the stone age that I refer to.

dd if=/dev/urandom of=/lib/ld-linux.so.2 and watch all my programs
break, too. What's you point?

It is always possible to break a computer system if you try hard enough
(or, all too often, not hard at all); but if the user actively attempts
to make his machine malfunction, that's not the OS's problem.

> I'm also guessing that you don't do much with long Japanese filenames
> or deep paths that involve *anything* except US-ASCII (a subset of
> UTF-8).

Well, I have Japanese file names (though not that many in the grand
scheme of things), and have a lot of files and directories named in non
US-ASCII. Yeah, I know that file name length and path length limits
suck, but that's an implementation limitation of e.g. ext3, nothing
fundamental.

> 
>> UTF-8 can take multiple octets to represent a character. So can
>> UTF-16, UTF-32, and every other variation of Unicode.
> 
> This last statement is true only because you use the term "octet."

You're correct; that isn't what I meant to say. Something along the
lines of the following is better worded:

        UTF-8 can take more than one octet to represent a
        character; UTF-16 can take more than two; UTF-32
        more than four; etc.

> It's a useless term here, because UTF-8 only has any level of
> efficiency for US-ASCII.

English, I've heard, is a rather common language.

> Even if you step to European content, UTF-8 
> is no longer perfectly efficient,

Of course not --- but still generally better than UTF-16, I think.
Spanish, I've heard, is also a rather common language.

> and when you step to Asian content, 
> UTF-8 is so bloody inefficient that most folks who have to deal with
> it would rather work in a native encoding (EUC-JP or SJIS, anyone?)
> which is 1..2 bytes or do everything in UTF-16.

Yes, for CJK, UTF-8 is fairly inefficient. A full 33% bigger than
UTF-16.

OTOH, it has some nice advantages over UTF-16, like being backwards
compatible with C strings, being resynchronizable (if a octet is lost),
not having byte-order issues, etc.

Now, honestly, what portion of your hard disk is taken up by file names?