Yeah I had the same problem recently.  I think since html allows lax 
closing of elements rexml will just barf.  In the end I used regular 
expressions to slurp catch the lines I was interested in and regex to 
capture the fields I wanted.  Works really well.  There's also a html 
parser class based on the python one, but it was so badly documented and 
it seems to be poorly supported that I chose not to use it.

Dario Linsky wrote:

> Hi,
> 
> On Fri, 02 Apr 2004 00:58:24 +0900, Paul Argentoff wrote:
> 
>>Sorry for returning to sorta well-discussed (but not in a sense I need)
>>topic. 
>>I can't parse xml files by rexml since some tags in html are open (such as 
>><link>, etc). Document.new errors with a message about such a tag.
> 
> Do I understand your problem right, that REXML gives you an Exception
> because you did not close a tag? If so, a possible solution would be to
> use XHTML instead of normal HTML.
> 
> 
> !DSPAM:406c5b0b63886654544321!
>