On Jun 18, 2006, at 8:29 AM, Austin Ziegler wrote: > You need glyphs, and some glyphs can be > produced with multiple code points (e.g., LOWERCASE A + COMBINING > ACUTE > ACCENT as opposed to A ACUTE). This is another thing you need your String class to be smart about. You want an equality test between "máÔ" and "máÔ" to always be true even their "ᢠcharacters are encoded differently. The right way to solve this is called "Early Uniform Normalization" (see http:// www.w3.org/TR/2003/WD-charmod-20030822/#sec-Normalization); the idea is you normalize the composed characters at the time you create the string, then the internal equality test can be done with strcmp() or equivalent. >> Map legacy data, that is characters still not in Unicode, to a high >> Plane in Unicode. That way all characters can be used together all >> the >> time. When Unicode includes them we can change that to the official >> code points. Note there are no files in String's internal storage >> format, so we don't have to worry about reencoding them. > > Um. This is the statement of someone who is ignoring legacy issues. > Performance *is* a big issue when you're dealing with enough legacy > data. Note that you don't have to use a high plane. The Private Use Area in the Basic Multilingual Pane has 6,400 code points, which is quite a few. Even if you did use a high plane, it's not obvious there'd be a detectable runtime performance penalty. > Unicode is *often* the right choice, but it's *not* the only > choice and there are times when having the *flexibility* to work in > other encodings without having to work through Unicode as an > intermediary is the right choice. That may be the case. You need to do a cost-benefit analysis; you could buy a lot of simplicity by decreeing all-Unicode-internally; would the benefits of allowing non-Unicode characters be big enough to to compensate for the loss of simplicity? I don't know the answer, but it needs thinking about. -Tim