Hi,

>From: "Mike Meng" <meng.yan / gmail.com>
>Reply-To: ruby-talk / ruby-lang.org
>To: ruby-talk / ruby-lang.org (ruby-talk ML)
>Subject: Encounter troubles with Regex in Chinese text splitting
>Date: Sat, 3 Dec 2005 14:42:31 +0900
>
>Hi All,
>   I'm a Ruby newbie. I'm writting a program to process a big chunk of
>Chinese text. The first step is to split the chunk of text into a list
>of sentences. In Chinese, all the characters are listed one by one
>without any natural boundary tag like space in English. Sentences are
>separated by one of three special characters(??. So at the
>first glance, I thought it's a simple task:
>
># $chunk stores the text body
>$sentenses = $chunk.split(/????)
># now $sentenses holds the list of sentences.
>
>   By when I checked the result, I found some of the sentenses didn't
>split well. For instance, here is a sentense:
>"?񥷸極¤?quot; (means "You are not sick, how about him?") . In
>GB2312, "? is encoded to (hex) b2a1 a3ac, and "??quot; happens to be
>encoded to (hex) a1a3. So the String#split method finds there is a
>"??quot; in the middle of the sentense and incorrectly do the splitting.
>
>   Certainly this is because the String#split (and the Ruby regex
>engine) is byte-oriented instead of true character-oriented, and it's a
>frequent problem in i18n domain. Is there any ways in Ruby to correct
>split Chinese text?
>
>   Thanks in advance.
>
>   myan
>
>
Try the script with $KCODE = "E"

Hope this help,

Park Heesob