On Nov 29, 2005, at 23:49, basi wrote: > Looking for ideas on how to split a text file into sentences. I see > the > problem of basing the split on [.!?] -- they're also used in ways > other > than to end a sentence. If I have to do manual pre-processing of the > text file, what editing might I do? Has anyone had to deal with this > problem and how did you make life easier for you? > Thanks for the help. > basi > > Doing really, really good sentence boundary detection is an on-going problem in natural language processing. I'm not aware of any Ruby- based NLP packages, but if you want better accuracy than just using [.!?:] there are several free NLP packages around (NLTK in Python, and Stanford's Java NLP package spring to mind) that might help you. A googling of "sentence tokenization" may also yield some help. If that sounds like overkill, then you can get accuracy "good enough for government work" by making a list of regular expressions to catch exceptions to the punctuation rule. These will necessarily vary a little depending on your source text, but a typical examples are catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/) good luck, matthew smillie. ---- Matthew Smillie <M.B.Smillie / sms.ed.ac.uk> Institute for Communicating and Collaborative Systems University of Edinburgh