On Sat, 10 Aug 2002, MikkelFJ wrote:

> It is far less common to index strings by number of characters. But when you
> do, UCS-4 is better. A typical application is text formatting where you
> wan't the nearest linebreak to the given width of say 80 characters. UCS-2
> is probably not good enough because combining (or surrogates or whatever)
> will only take up one display unit. But then you probably need to take
> proportional spacing into account anyway.

Well, in a lot of cases it's no big deal, because you just want to
limit the length of a string. For example, I may want to trucate
a display field to twenty characters, so it doesn't overflow. With
UTF-16, I can safely just truncate. If I break a surrogate, no
problem; it doesn't display. If I break a combining character, it's
a bit more of a problem (because only part of it displays), but
nothing most people can't live with.

This is one of the big advantages of UTF-16 over UTF-8; you can do
simple operations the simple way and still produce valid UTF-16 output.
(There's no explicit rule, as far as I know at least, that states that
UTF-8 parsers *must* ignore broken characters, as there is with UTF-16.)

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC