Hi,
In message "[ruby-talk:01598] Thanks and more regex q's"
on 00/02/27, Wes Nakamura <wknaka / pobox.com> writes:
>I see that there are classes for Japanese string conversion and
>detection, and there's jcode.rb, but is there a class or module that has
>the concept of each EUC/SJIS "character" being a discrete unit instead
>of two bytes? Maybe a string-like class with the underlying data being
>an array of integers.
jcode.rb makes String `Japanese character string' rather than `byte
string'. Let's assume XX is a Japanese character. "XX"[0] == "XX" if
jcode.rb was loaded.
>Is the Japanese-sensitive regex's behavior documented anywhere (I didn't
>see anything for the "n" option either)? e.g. is there a way to use
>regexes where /./ would match 2 bytes, since . could match a single
>multibyte character?
Well, ..., oh, this feature is not documented in English version of
reference manual :-<
* String, Regexp and program parsing is Japanese character code sensitive.
* $KCODE is used to control the character code. "e" for EUC-Japan,
"s" for Shift-JIS, "n" for none (i.e. non-J-sensitive).
* $KCODE value can be set by -K command line option. -Ke for
EUC-Japan, etc.
* Default for $KCODE value can be specified in configuration stage:
"./configure --with-defalut-kcode=none". See "./configure --help".
* "./configure --with-defalut-kcode=none" will be default in the next
release of Ruby.
* Regexp's option e,s and n control matching manner whatever $KCODE
is set.
>Is it possible to set an option like "n" when creating a regex when
>using Regexp.new() (since I was creating the regex on the fly using
>strings)? The regex options become an attribute of the regex
>itself, right?
Yes.
>This also didn't work:
>
># change hiragana to katakana...
>"\xa4\xa2".sub(/\xa4([\xa1-\xf3])/n, "\xa5\\1")
Hmmm, I don't know why that didn't work :-< The following works:
"\xa4\xa2".sub(/\xa4([\xa1-\xf3])/n){"\xa5#{$1}")
By the way, Ruby/KAKASI can be use to
{kanji,hiragana,katakana} -> {hiragana,katakana,ascii(romaji)}
For example,
require "kakasi"
include Kakasi
p Nakamura = "\xc3\xe6\xc2\xbc" #=> (Namamura in kanji)
p kakasi("-ieuc -oeuc -Ja", Nakamura) #=> "nakamura"
p a = kakasi("-ieuc -oeuc -JH", Nakamura) #=> (Nakamura in hiragana)
p kakasi("-ieuc -oeuc -JK", a) #=> (Nakamura in katakana)
Check out http://www.ruby-lang.org/en/raa.html.
-- gotoken