I have written a tree-building HTML parser that is handy for doing
analysis, repair, or transformations of HTML text.
The RAA entry is
http://www.ruby-lang.org/en/raa-list.rhtml?name=ruby-htmltools
and you can download it at:
http://www.bike-nomad.com/ruby/ruby-htmltools-1.01.tar.gz
It requires the html-parser library, available from
http://www.jin.gr.jp/~nahi/Ruby/html-parser/html-parser-19990912p2.tar.gz
Following immediately on the heels of v1.0, this changes the
following:
* attributes now maintain their order. Though this probably isn't
strictly necessary under HTML, it may make it easier to compare
document versions.
* the generated tree now has a top-level node for the document itself,
so the DTD can be stored. THIS WILL REQUIRE CODE CHANGES if you have
code that assumes that the root node is always <html>. To find the
<html> node, you can use the new methods HTMLTreeParser#html() or
HTMLDocument#html_node():
html = parser.html()
Or, querying the tree:
html = parser.tree.html_node()
* comments are stored in the tree
* added HTMLElement#print_on() to print a (sub)tree to an IO stream
--
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE