On Nov 29, 2005, at 23:49, basi wrote:

> Looking for ideas on how to split a text file into sentences. I see  
> the
> problem of basing the split on [.!?] -- they're also used in ways  
> other
> than to end a sentence. If I have to do manual pre-processing of the
> text file, what editing might I do? Has anyone had to deal with this
> problem and how did you make life easier for you?
> Thanks for the help.
> basi
>
>


Doing really, really good sentence boundary detection is an on-going  
problem in natural language processing.  I'm not aware of any Ruby- 
based NLP packages, but if you want better accuracy than just using  
[.!?:] there are several free NLP packages around (NLTK in Python,  
and Stanford's Java NLP package spring to mind) that might help you.   
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough  
for government work" by making a list of regular expressions to catch  
exceptions to the punctuation rule.  These will necessarily vary a  
little depending on your source text, but a typical examples are  
catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations  
like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

----
Matthew Smillie            <M.B.Smillie / sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh