On Jun 17, 2006, at 6:50 AM, Stefan Lang wrote:

> It seems that the main argument against using Unicode strings
> in Ruby is because Unicode doesn't work well for eastern
> countries.

Point of information: there are highly successful word-processing  
products selling well in countries whose writing systems include Han  
characters, which internally use Unicode.   So while the Han- 
unification problems have been much discussed and are regarded as  
important by people who are not fools, in fact there is existence  
proof that Unicode does work well enough for wide deployment in  
commercial software.

> If Unicode is choosen as character set, there is the
> question which encoding to use internally. UTF-32 would be a
> good choice with regards to simplicity in implementation,

UTF-32 has a practical problem in that in C code, you can't use strcmp 
() and friends because it's full of null bytes.  Of course if you're  
careful to code everything using wchar_t you'll be OK, but lots of  
code isn't.  (UTF-8 doesn't have this problem and is much more compact).

> Consider
> indexing of Strings:
>
>         "some string"[4]
>
> If UTF-32 is used, this operation can internally be
> implemented as a simple, constant array lookup. If UTF-16 or
> UTF-8 is used, this is not possible to implement as an array

Correct.  But in practice this seems not to be too huge a problem,  
since in practice text is most often accessed sequentially.  The  
times that you really need true random access to the N'th character  
are rare enough that for some problems, the advantages of UTF-8 are  
big enough to compensate for this problem.  Note that in a variable- 
length character encoding, there's no trouble whatever with a table  
of pointers into text; the *only* problem is when you need to find  
the Nth character cheaply.

> An advantage of using UTF-8 would be that for pure ASCII files
> no conversion would be necessary for IO.

Be careful.  There are almost no pure ASCII files left.  Café®  
Ordoz. Ĺ´mart quotes
  -Tim