On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote:

> You need glyphs, and some glyphs can be
> produced with multiple code points (e.g., LOWERCASE A + COMBINING  
> ACUTE
> ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.   
You want an equality test between "m" and "m" to always be true  
even their " characters are encoded differently.  The right way to  
solve this is called "Early Uniform Normalization" (see http:// 
www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea  
is you normalize the composed characters at the time you create the  
string, then the internal equality test can be done with strcmp() or  
equivalent.

>> Map legacy data, that is characters still not in Unicode, to a high
>> Plane in Unicode. That way all characters can be used together all  
>> the
>> time. When Unicode includes them we can change that to the official
>> code points. Note there are no files in String's internal storage
>> format, so we don't have to worry about reencoding them.
>
> Um. This is the statement of someone who is ignoring legacy issues.
> Performance *is* a big issue when you're dealing with enough legacy
> data.

Note that you don't have to use a high plane.  The Private Use Area  
in the Basic Multilingual Pane has 6,400 code points, which is quite  
a few.  Even if you did use a high plane, it's not obvious there'd be  
a detectable runtime performance penalty.

>  Unicode is *often* the right choice, but it's *not* the only
> choice and there are times when having the *flexibility* to work in
> other encodings without having to work through Unicode as an
> intermediary is the right choice.

That may be the case.  You need to do a cost-benefit analysis; you  
could buy a lot of simplicity by decreeing all-Unicode-internally;  
would the benefits of allowing non-Unicode characters be big enough  
to to compensate for the loss of simplicity?  I don't know the  
answer, but it needs thinking about.

  -Tim