Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces.  It's not robust, but if you know that
convention will be followed by the authors, then it can work.

_Kevin

-----Original Message-----
From: Matthew Smillie [mailto:M.B.Smillie / sms.ed.ac.uk] 
Sent: Tuesday, November 29, 2005 09:06 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences


On Nov 29, 2005, at 23:49, basi wrote:

> Looking for ideas on how to split a text file into sentences. I see 
> the problem of basing the split on [.!?] -- they're also used in ways 
> other than to end a sentence. If I have to do manual pre-processing of 
> the text file, what editing might I do? Has anyone had to deal with 
> this problem and how did you make life easier for you?
> Thanks for the help.
> basi
>
>


Doing really, really good sentence boundary detection is an on-going problem
in natural language processing.  I'm not aware of any Ruby- based NLP
packages, but if you want better accuracy than just using [.!?:] there are
several free NLP packages around (NLTK in Python,  
and Stanford's Java NLP package spring to mind) that might help you.   
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough for
government work" by making a list of regular expressions to catch exceptions
to the punctuation rule.  These will necessarily vary a little depending on
your source text, but a typical examples are catching titles like "Mr.",
"Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something
like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

----
Matthew Smillie            <M.B.Smillie / sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems University of
Edinburgh