------art_11287_19310825.1126568412932
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

> 
> 
> And Lyndon, I'm a huge fan of Tidy for cleaning up my own web pages,
> but I'm not sure it's helpful here, as was aiming to use regexes to
> parse the HTML rather than the DOM.



Well, DOM allows you to use XPath, which is a powerfull query mechanism.

This 
http://www-128.ibm.com/developerworks/java/library/j-jtp03225.html?ca=dgr-jw26XQueryis
XQuery specific, but relies
on XPath.

And example from the article

//td[contains(a/small/text(), "New York, NY")]

------art_11287_19310825.1126568412932--