Hi, >From: "Mike Meng" <meng.yan / gmail.com> >Reply-To: ruby-talk / ruby-lang.org >To: ruby-talk / ruby-lang.org (ruby-talk ML) >Subject: Encounter troubles with Regex in Chinese text splitting >Date: Sat, 3 Dec 2005 14:42:31 +0900 > >Hi All, > I'm a Ruby newbie. I'm writting a program to process a big chunk of >Chinese text. The first step is to split the chunk of text into a list >of sentences. In Chinese, all the characters are listed one by one >without any natural boundary tag like space in English. Sentences are >separated by one of three special characters(?¤ñ¥·Þñ¥·?. So at the >first glance, I thought it's a simple task: > ># $chunk stores the text body >$sentenses = $chunk.split(/??ï¼?ï¼?) ># now $sentenses holds the list of sentences. > > By when I checked the result, I found some of the sentenses didn't >split well. For instance, here is a sentense: >"Ë×?ªñ¥·¸æ¥µÌç¤þ¾?quot; (means "You are not sick, how about him?") . In >GB2312, "?ªñ¥· is encoded to (hex) b2a1 a3ac, and "??quot; happens to be >encoded to (hex) a1a3. So the String#split method finds there is a >"??quot; in the middle of the sentense and incorrectly do the splitting. > > Certainly this is because the String#split (and the Ruby regex >engine) is byte-oriented instead of true character-oriented, and it's a >frequent problem in i18n domain. Is there any ways in Ruby to correct >split Chinese text? > > Thanks in advance. > > myan > > Try the script with $KCODE = "E" Hope this help, Park Heesob