Hello, all.
Here's an idea I'm toying with. Suggestions
are welcome.
I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it...
...and get a list of the *sentences* in the
document.
There are, of course, several things that make
this difficult:
- need to distinguish between end-of-sentence
and embedded punctuation, including both
abbreviations and textual references to
Ruby methods such as eof? and split!
- need to treat sentence fragments as sentences
- need to ignore blocks of code
- etc.
My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn't have any dependencies.
Not sure whether to do it in two steps or not:
1. Convert to text
2. Process
Might be just as easy to do it in one step if
I knew what I was doing.
Also not sure what is the best tool/library for
this job.
Comments welcome.
Hal
--
Hal Fulton
hal9000 / hypermetrics.com