Depending on the text you might be able to search for a period (or other punctuation) followed by two spaces. It's not robust, but if you know that convention will be followed by the authors, then it can work. _Kevin -----Original Message----- From: Matthew Smillie [mailto:M.B.Smillie / sms.ed.ac.uk] Sent: Tuesday, November 29, 2005 09:06 PM To: ruby-talk ML Subject: Re: Splitting a text file into sentences On Nov 29, 2005, at 23:49, basi wrote: > Looking for ideas on how to split a text file into sentences. I see > the problem of basing the split on [.!?] -- they're also used in ways > other than to end a sentence. If I have to do manual pre-processing of > the text file, what editing might I do? Has anyone had to deal with > this problem and how did you make life easier for you? > Thanks for the help. > basi > > Doing really, really good sentence boundary detection is an on-going problem in natural language processing. I'm not aware of any Ruby- based NLP packages, but if you want better accuracy than just using [.!?:] there are several free NLP packages around (NLTK in Python, and Stanford's Java NLP package spring to mind) that might help you. A googling of "sentence tokenization" may also yield some help. If that sounds like overkill, then you can get accuracy "good enough for government work" by making a list of regular expressions to catch exceptions to the punctuation rule. These will necessarily vary a little depending on your source text, but a typical examples are catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/) good luck, matthew smillie. ---- Matthew Smillie <M.B.Smillie / sms.ed.ac.uk> Institute for Communicating and Collaborative Systems University of Edinburgh