Mike Harris wrote: > Ruby Quiz wrote: > >> *snip* >> >> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= >> >> >> by elliot temple >> >> sometimes i type in all or mostly lowercase. a friend of mine says >> it's hard to >> read essays with no capital letters. so the problem is to write a >> method which >> takes a string (which could include many paragraphs), and capitalizes >> words that >> should be capitalized. at minimum it should do the starts of sentences. >> >> solutions could range from simple (a few regexes) to complex (lots of >> special >> cases are possible, like abbreviations that use a period). an addition >> would be >> using a dictionary to find proper nouns and capitalize those. it could >> also ask >> the user about cases the program can't figure out. or log them. >> >> i can provide an example solution (regex based) and a list of reasons >> it doesn't >> work very well, if you want. >> >> sample input: >> >> - this email itself works nicely >> >> - this one is hard. sometimes i might want to write about gsub vs. >> gsub! without >> the "." or "!" causing any capitalization (or the punctuation in quotes). >> >> one problem is maybe dealing with sentences that contain periods is >> too hard. i >> don't know. My day job is developing natural language processing apps, and we've had to implement a similar case-correcting tool. What we found is that a simple regex-based approach is correct about 90% of the time. When we used machine learning to do the same thing, the results went up to about 95%. Compare this to human performance (i.e. have two or more people manually correct a text, then compare how often their corrections were in agreement), which was, IIRC, about 97%. > It would be nice if you could assume two spaces after a end of sentence > with puncuation. Generally I think that's correct grammar, although my > grammar stinks so I could easily be wrong. If you have to get into > parsing incorrect grammar it becomes much more difficult. The two-spaces-after-period rule is not a grammatical one; it's a typographic convention that grew out of typewriter (i.e. monospaced) fonts.