On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote:

> You need glyphs, and some glyphs can be
> produced with multiple code points (e.g., LOWERCASE A + COMBINING  
> ACUTE
> ACCENT as opposed to A ACUTE).

This is another thing you need your String class to be smart about.   
You want an equality test between "m?s" and "m?s" to always be true  
even their "?" characters are encoded differently.  The right way to  
solve this is called "Early Uniform Normalization" (see http:// 
www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea  
is you normalize the composed characters at the time you create the  
string, then the internal equality test can be done with strcmp() or  
equivalent.

>> Map legacy data, that is characters still not in Unicode, to a high
>> Plane in Unicode. That way all characters can be used together all  
>> the
>> time. When Unicode includes them we can change that to the official
>> code points. Note there are no files in String's internal storage
>> format, so we don't have to worry about reencoding them.
>
> Um. This is the statement of someone who is ignoring legacy issues.
> Performance *is* a big issue when you're dealing with enough legacy
> data.

Note that you don't have to use a high plane.  The Private Use Area  
in the Basic Multilingual Pane has 6,400 code points, which is quite  
a few.  Even if you did use a high plane, it's not obvious there'd be  
a detectable runtime performance penalty.

>  Unicode is *often* the right choice, but it's *not* the only
> choice and there are times when having the *flexibility* to work in
> other encodings without having to work through Unicode as an
> intermediary is the right choice.

That may be the case.  You need to do a cost-benefit analysis; you  
could buy a lot of simplicity by decreeing all-Unicode-internally;  
would the benefits of allowing non-Unicode characters be big enough  
to to compensate for the loss of simplicity?  I don't know the  
answer, but it needs thinking about.

  -Tim