On Saturday 03 August 2002 09:21 am, Danny van Bruggen wrote:
> 7: According to this point, Unicode has -much- more characters than
> 65536, so it needs more than two bytes. I haven't seen that
> mentioned here before.

Actually, we've pretty much beat this to death in the past few days.

My understanding from what I've heard (and I don't know much about 
Unicode, so please correct me if I'm wrong), in layman's terms (i.e. 
not standard-compliant language):

Unicode is *defined* to be representable using 2-byte numbers. There 
are provisions for using sequences of two (or more now?) of these 
numbers to represent characters outside the basic 64K or so. There 
are few applications that deal fully with these extended characters 
("surrogate pairs"?); Java (for instance) doesn't.

In a program, representing Unicode by 2-byte numbers is probably ideal 
unless you're very space-constrained (in which case UTF-8 is best) or 
have to do a lot of character indexing *and* are using these 
extension characters (in which case you might want to use a 4-byte 
representation).

-- 
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE