On Wed, Sep 17, 2008 at 10:09 AM, Matthias Whter
<matthias / waechter.wiz.at> wrote:
> Is there a complete characterization of this whole problem? It seems
> to be the main reason for sticking to non-UTF-8 character sets in
> Ruby these days, and concluding from what I have read about it, a
> solution could be the addition of missing characters/codepoints to
> Unicode. Why does no-one consider going that way, but instead builds
> a complicated stack of functions for conversions on top level?

While there is a private use plane, it's not generally interoperable
to use the private use plane in Unicode. Adding glyphs to Unicode is a
lengthy process that requires going through a standards body. The
Unicode standard is updated every few years, but the Unicode
consortium is much more likely to listen to the Japanese standards
bodies than Ruby programmers.

> To some extent, it looks like 'some' people like insisting on the
> status quo as it makes them feel special, swimming upstream the
> Unicode waterfall, retaining on regional locales instead of solving
> the issue. I do explicitly not refer to Ruby or the developers, they
> just accept these special needs more than other computer language
> designers with less sympathy for this anomaly.

The reality is that Unicode *doesn't* completely represent all Asian
languages well (see the discussions around Han unification for a brief
primer on the issues involved). The problem is exacerbated in the
academic arena where people want to be able to represent ancient
characters accurately, but it's not limited to that. Just because you
and I can represent our words in under one hundred characters doesn't
mean that it's appropriate to do the same with others' languages.

It's getting better, but it's still not perfect.

> Nevertheless, a persisting fix is needed, and I think writing more
> and more clutches for encoding conversion goes the wrong way. This
> might still be needed for legacy file support, but day-to-day work
> should not have to deal with this issue so prominently.

Day-to-day work *doesn't*.

Deal with all of your stuff in a single encoding (UTF-8, UTF-16,
whatever) and you don't even have to think about it.

If you *ever* deal with more than one encoding, you're going to run
into this problem in *any* language.

Sorry.

-austin, still working on a blog post about a .NET Unicode/XML bug
-- 
Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/
 * austin / halostatue.ca * http://www.halostatue.ca/feed/
 * austin / zieglers.ca