Since there's been a lot of talk about Unicode lately, I thought I'd  
throw out a Ruby library I've been working on to support Unicode  
characters and strings based on the 4.1.0 standard and key  
specifications from the Unicode Consortium.

   ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2

The library adds an encoding property to native String objects, and  
allows conversion to and from Unicode::String and Unicode::Character.  
A default encoding is chosen based on $KCODE, or the default can be  
set/changed explicitly via String.default_encoding.

Unicode strings can be obtained by applying the + unary operator to  
native strings, e.g. +"Hello" (where the native string is encoded in  
the default encoding).

   % irb -I. -runicode -Ku
   irb(main):001:0> ustr = +" is pi"
   => +" is pi"

Native strings are obtained from Unicode strings by calling to_s,  
which accepts an optional argument to indicate the desired encoding.

   irb(main):002:0> str = ustr.to_s
   => " is pi"
   irb(main):003:0> str.encoding
   => Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning  
a Unicode::Character object.

   irb(main):004:0> ustr[0]
   => U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

   irb(main):005:0> ustr.upcase
   => +" IS PI"

Normalization is accomplished with the ~ unary operator.

   irb(main):006:0> ustr = +"mí"
   => +"mí"
   irb(main):007:0> ustr.to_a
   => [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH  
ACUTE]
   irb(main):008:0> (~ustr).each_char { |ch| p ch }
   U+006D LATIN SMALL LETTER M
   U+0069 LATIN SMALL LETTER I
   U+0301 COMBINING ACUTE ACCENT
   => +"mí"

There is much more -- character properties, text boundaries (grapheme  
clusters and words), Hangul decompositions, modular encodings (ASCII,  
Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is  
unfinished. If anyone is interested in helping develop it further,  
let me know.

The library incorporates the entire Unicode 4.1.0 Character Database  
(demand-loaded!) which is why the archive is rather large.

Cheers,

-- 
Rob Leslie
rob / mars.org