----- Original Message ----- From: "Aredridel" <aredridel / nbtsc.org> To: "ruby-talk ML" <ruby-talk / ruby-lang.org> Sent: Thursday, June 12, 2003 11:16 AM Subject: Re: HTML -> list of sentences? (semi-impossible task) > I would parse into a tree, process there, then strip tags. The reason > being, ruby code and other nongramatical entities are likely to be > offset by tags -- <pre>, <code>, <tt>, things like that. Not always, > but it's a useful heuristic. > > It's not a trivial task -- I've done a lot of natural-language work for > the Wiki that I run (it's markup is one of the least code-like of any > wiki). How good you need the results to be are a big deciding factor in > how to implement, for sure. Natural language parsing is a big cpu > cruncher. Yes, in this case, large code fragments are always set off by "pre" tags. That does simplify. As I said, I'm not interested in true natural-language parsing. Something "mostly" accurate is good enough. Thanks, Hal