Kevin,

I settled on using Tidy to clean up the HTML, then parsing it into a
tree using the HTML scanner that comes with Rails.

Tidy does all the hard stuff of dealing with bad HTML and straightening
it up. The HTML scanner is very lightweight and has a simple, clean
API. You don't need to run Rails, just require the scanner library
(look for html/document.rb).

It's two passes, but with Tidy being C++ and HTML scanner doing no
cleanup, it's amazingly fast. I'm processing around 500Kb/s (mobile Duo
Core 1.8GHz).

You can walk the DOM, or use XPath-like finders, or my preferred method
of looking up content: using CSS selectors.

If you're doing HTML scraping this library will do all the hard work
for you:
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/

Assaf
http://labnotes.org


Kevin Weller wrote:
> Anybody have experience with a decent HTML parser for a Ruby
> application?  I've looked around, and so far everything I've found is
> either unfinished, unstable, [relatively] undocumented, or just plain
> ugly in terms of API.
>
> I'd like a parser that can take a partial HTML file and return an
> easily-traversable data structure, in the same order that the elements
> appear in the file.  I don't want or need a callback mechanism, only
> something I can iterate and tree-search.  Though I don't hold much hope
> it will work, I will try using REXML on my text and see what it
> produces...results to be posted here.  Thanks in advance!
>
> --
> Kevin Weller
> Information Technology Crucible
> http://www.itcrucible.com