On Jul 30, 11:26 am, "Robert Klemme" <shortcut... / googlemail.com>
wrote:
> 2007/7/30, mike b. <michael.w.b... / gmail.com>:
>
> > I have to parse about 2000 files that are written in multiple
> > languages (some English, some Korean, some Arabic and some Japanese).
> > I have to split these UTF-8 encoded into individual sentences. Has
> > anyone written a good parser that can parse all these non-Latin
> > character languages or can someone give me some advice on how to go
> > about writing a parser that can handle all these fairly different
> > languages?
>
> I would consider doing this in Java, as Java's regular expressions
> support Unicode.  That might make the job much easier.  OTOH, if all
> files use only dot, question mark etc. (i.e. ASCII chars) as sentence
> delimiters then Ruby's regular expressions might as well do the job.

Ruby supports UTF-8 regular expressions: for example, /\w+|\W/u can be
used
to scan a string splitting it into words and non-words. There were
some bugs
with Unicode character classifications in older versions of Ruby, but
I'm not
aware of any in 1.8.6; OTOH I've never tried it with non-latin text so
I don't
know if it works correctly in those cases too.