text-processing with ext. characterset
Hi!
I'd like to do textprocessing on non-english texts. (Related to multilingual
evironment for language-learning and similar)

I found the utf-8 lang option in regexes useful.

I am a newbie to Ruby, so I am very thankful for hints and pointers!
I did skim the "book", but otherwise I do apologize for "naive" questions!

Questions:
(1) unicode-support in Ruby? (Python got it from 1.6, but not so integrated
into Regexes (not supporting word-boundaries)

(2) Scince Ruby is popular in Japan, where can i find something on trating
Japanese?

(3) I am a bit confused on whether or not to go for converting everything to
16-bit unicode (Ruby?), or to stay in the utf-8 format.

(4) If staying in utf-8; how difficult to add functionality, so that things
like .upcase and sort does the right thing for various languages?

(5) I think integration to the in-built services of Ruby would allow a lot
cleaner code. I mean: Why should non-english text-processing be more awkward
than English? Considering Ruby's origin, I suppose Ruby to be easily
extensible in this area...

(6) Would it be time-inefficient to make the string-class utf-8 aware?
(returning the "correct" length, not the physical, and so on... so that for
example a c with circumflex, or a kanji symbol, would be considered as of
length 1?)

Very thankful for any pointers or other helpful comments.

amike via Henning VON ROSEN, Norway