Um, hi everyone.  I'm a Rubie newby but very, very old hand at  
Unicode & text processing.  I wrote all those articles Charles Nutter  
pointed to the other day.   I spent years doing full-text search for  
a living, and adapted a popular engine to handle Japanese text, and  
co-edited the XML spec and helped work out its character-encoding  
issues.  Lots more war stories on request.

Anyhow, I have some ideas about what good ways to do text processing  
in a language like Ruby might be, but I thought for the moment I'd  
just watch this interesting debate go by and serve as an information  
resource.

On Jun 15, 2006, at 11:17 AM, Juergen Strobel wrote:

> UTF-8 encodes every Unicode code point as a variable length sequence
> of 1 to 4 (I think) bytes.

UTF-8 can do  the 1,114,112 Unicode codepoints in 4 bytes.  We  
probably don't need any more codepoints until we meet alien  
civilizations.

> Most western symbols only require 1 or 2
> bytes. This encoding is space efficient

UTF-8 is racist.  The further East you go, the less efficient it is  
to store text.  Having said that, it has a lot of other advantages.   
Also, when almost every storage device is increasingly being used for  
audio and video, at megabytes per minute, it may be the case that the  
efficiency of text storage is less likely to be a bottleneck.

> Java got bitten by that by defining the character type to 16 bit and
> hardcoding this in their VM, and now they need some kludges.

Java screwed up, with the result that a Java (and C#) "char"  
represent a UTF-16 codepoint.  Blecch.

  -Tim