----- Original Message ----- 
From: "Aredridel" <aredridel / nbtsc.org>
To: "ruby-talk ML" <ruby-talk / ruby-lang.org>
Sent: Thursday, June 12, 2003 11:16 AM
Subject: Re: HTML -> list of sentences? (semi-impossible task)


> I would parse into a tree, process there, then strip tags.  The reason
> being, ruby code and other nongramatical entities are likely to be
> offset by tags -- <pre>, <code>, <tt>, things like that.  Not always,
> but it's a useful heuristic.
> 
> It's not a trivial task -- I've done a lot of natural-language work for
> the Wiki that I run (it's markup is one of the least code-like of any
> wiki).  How good you need the results to be are a big deciding factor in
> how to implement, for sure.  Natural language parsing is a big cpu
> cruncher.

Yes, in this case, large code fragments are always set off by "pre"
tags. That does simplify.

As I said, I'm not interested in true natural-language parsing.
Something "mostly" accurate is good enough.

Thanks,
Hal