Sorry for the late reply.

I'm surprised no one mentioned RubyfulSoup:

http://www.crummy.com/software/RubyfulSoup/

If I understand your problem correctly, it's exactly what you need: a
forgiving html parser.

Dan

On 11/28/05, Horacio Sanson <hsanson / moegi.waseda.jp> wrote:
> Well the problem is that this HTML is not mine, retrieving the pages from the
> Internet.
>
>
> Guess I will skip this page from my script.
>
> thanks,
> Horacio
>
> Monday 28 November 2005 21:52、Daniel Schierbeck さんは書きました:
> > Horacio Sanson wrote:
> > > I am using htmltokenizer to extract the links of some web pages, my
> > > script worked perfectly until I started to parse pages with "<" and ">"
> > > chars in the text.
> > >
> > > a html string like this
> > >
> > > <a href="an_uri" > this is a <link> </a>
> > >
> > > causes the htmlparser to raise and exception; Error, tag is nil....
> > >
> > >
> > > Is there a patch or any way to make htmlparser to parse this text??
> > >
> > >
> > > regards,
> > > Horacio
> >
> > Your HTML isn't valid. Either use the proper entities (< = &lt; and > =
> > &gt;) or make a CDATA section, though the latter isn't really that
> > well-supported in most browsers.
> >
> >    <a href="an_uri"><![CDATA[this is a <link>]]></a>
> >
> >
> > Cheers,
> > Daniel
>
>