> I'm trying to match and extract pieces from structured text in a similar
> way to what sgrep (see <http://www.cs.helsinki.fi/~jjaakkol/sgrep.html>)
> does.
>
> The concrete purpose is to get titles from HTML files, that is the first
> occurrence of any text between <title> and </title>. Better still, I'd
> like to get the "X" from <html>..<head>..<title> X </title>..</head>.

I don't know of a ruby option, but perl has a really nice
HTML parser: http://search.cpan.org/search?dist=HTML-Tree

The author has an article about its use on tpj:
(password required): http://www.tpj.com/issues/currvol/tpj0503-0008.html

you can do stuff like (I quote from the article):

Sometimes the only way to pin down what you're after is by position in the
tree. For example, headlines of interest may be in the third column of the
second row of the second table element in a page:

  my $table = ( $tree->look_down('_tag','table') )[1];
  my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
  my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
  ...then do things with $col3...


Or they might be all the links in a <p> element with more than two <br>
elements as children:

  my $p = $tree->look_down(
    '_tag', 'p',
    sub {
      2 > grep { ref($_) and $_->tag eq 'br' }
              $_[0]->content_list
    }
  );
  @links = $p->look_down('_tag', 'a');

All in all, I think it is a very powerful parser, and it would
be great if ruby had something simliar (which it may already have...
I just haven't seen it.).

regards,
-joe