On Jul 30, 11:26 am, "Robert Klemme" <shortcut... / googlemail.com> wrote: > 2007/7/30, mike b. <michael.w.b... / gmail.com>: > > > I have to parse about 2000 files that are written in multiple > > languages (some English, some Korean, some Arabic and some Japanese). > > I have to split these UTF-8 encoded into individual sentences. Has > > anyone written a good parser that can parse all these non-Latin > > character languages or can someone give me some advice on how to go > > about writing a parser that can handle all these fairly different > > languages? > > I would consider doing this in Java, as Java's regular expressions > support Unicode. That might make the job much easier. OTOH, if all > files use only dot, question mark etc. (i.e. ASCII chars) as sentence > delimiters then Ruby's regular expressions might as well do the job. Ruby supports UTF-8 regular expressions: for example, /\w+|\W/u can be used to scan a string splitting it into words and non-words. There were some bugs with Unicode character classifications in older versions of Ruby, but I'm not aware of any in 1.8.6; OTOH I've never tried it with non-latin text so I don't know if it works correctly in those cases too.