I have written a tree-building HTML parser that is handy for doing 
analysis, repair, or transformations of HTML text.

The RAA entry is 
http://www.ruby-lang.org/en/raa-list.rhtml?name=ruby-htmltools

and you can download it at:
http://www.bike-nomad.com/ruby/ruby-htmltools-1.01.tar.gz

It requires the html-parser library, available from 
http://www.jin.gr.jp/~nahi/Ruby/html-parser/html-parser-19990912p2.tar.gz

Following immediately on the heels of v1.0, this changes the 
following:

* attributes now maintain their order. Though this probably isn't
  strictly necessary under HTML, it may make it easier to compare
  document versions.

* the generated tree now has a top-level node for the document itself,
  so the DTD can be stored. THIS WILL REQUIRE CODE CHANGES if you have
  code that assumes that the root node is always <html>. To find the
  <html> node, you can use the new methods HTMLTreeParser#html() or
  HTMLDocument#html_node():

     html = parser.html()

  Or, querying the tree:

     html = parser.tree.html_node()

* comments are stored in the tree

* added HTMLElement#print_on() to print a (sub)tree to an IO stream

-- 
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE