Hi, 

In message "[ruby-talk:01598] Thanks and more regex q's"
    on 00/02/27, Wes Nakamura <wknaka / pobox.com> writes:
>I see that there are classes for Japanese string conversion and
>detection, and there's jcode.rb, but is there a class or module that has
>the concept of each EUC/SJIS "character" being a discrete unit instead
>of two bytes?  Maybe a string-like class with the underlying data being
>an array of integers. 

jcode.rb makes String `Japanese character string' rather than `byte
string'.  Let's assume XX is a Japanese character. "XX"[0] == "XX" if 
jcode.rb was loaded. 

>Is the Japanese-sensitive regex's behavior documented anywhere (I didn't
>see anything for the "n" option either)?  e.g. is there a way to use
>regexes where /./ would match 2 bytes, since . could match a single
>multibyte character? 

Well, ..., oh, this feature is not documented in English version of 
reference manual :-<

 * String, Regexp and program parsing is Japanese character code sensitive. 
 * $KCODE is used to control the character code. "e" for EUC-Japan, 
   "s" for Shift-JIS, "n" for none (i.e. non-J-sensitive). 
 * $KCODE value can be set by -K command line option. -Ke for
   EUC-Japan, etc.
 * Default for $KCODE value can be specified in configuration stage:
   "./configure --with-defalut-kcode=none".  See "./configure --help". 
 * "./configure --with-defalut-kcode=none" will be default in the next
   release of Ruby. 
 * Regexp's option e,s and n control matching manner whatever $KCODE
   is set.

>Is it possible to set an option like "n" when creating a regex when
>using Regexp.new() (since I was creating the regex on the fly using
>strings)?  The regex options become an attribute of the regex
>itself, right?

Yes. 

>This also didn't work:
>
># change hiragana to katakana...
>"\xa4\xa2".sub(/\xa4([\xa1-\xf3])/n, "\xa5\\1")

Hmmm, I don't know why that didn't work :-<  The following works:

  "\xa4\xa2".sub(/\xa4([\xa1-\xf3])/n){"\xa5#{$1}")

By the way, Ruby/KAKASI can be use to
{kanji,hiragana,katakana} -> {hiragana,katakana,ascii(romaji)}

For example, 

  require "kakasi"
  include Kakasi
  p Nakamura = "\xc3\xe6\xc2\xbc"           #=> (Namamura in kanji)
  p kakasi("-ieuc -oeuc -Ja", Nakamura)     #=> "nakamura"
  p a = kakasi("-ieuc -oeuc -JH", Nakamura) #=> (Nakamura in hiragana)
  p kakasi("-ieuc -oeuc -JK", a)            #=> (Nakamura in katakana)

Check out http://www.ruby-lang.org/en/raa.html. 

-- gotoken