Hello, all.

Here's an idea I'm toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it...

...and get a list of the *sentences* in the
document.

There are, of course, several things that make
this difficult:
  - need to distinguish between end-of-sentence
    and embedded punctuation, including both 
    abbreviations and textual references to 
    Ruby methods such as eof? and split!
  - need to treat sentence fragments as sentences
  - need to ignore blocks of code
  - etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but 
at least it doesn't have any dependencies.

Not sure whether to do it in two steps or not:
1. Convert to text
2. Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal

--
Hal Fulton
hal9000 / hypermetrics.com