Robert Gustavsson <0317025435 / telia.com> wrote:

> "Michael Schuerig" <schuerig / acm.org> wrote in message
> news:1eodn9k.5gnwx1g5scq3N%schuerig / acm.org...
> >
> > The concrete purpose is to get titles from HTML files, that is the first
> > occurrence of any text between <title> and </title>. Better still, I'd
> > like to get the "X" from <html>..<head>..<title> X </title>..</head>.
> 
> # Sample line from a HTML file
> str = "<title>This is the title!</title><title>Another one!</title>"
> 
> # Make a regular expression match that finds a text expression that
> # 1. Starts with the text "<title>"
> # 2. Is followed by any (".") character(s), zero or more ("*"), do it
> non-greedy ("?")
> # 3. And then followed by the text "</title>" (note that the / is escaped by
> a backslash,
> # if not the Ruby interpreter would think that the forward slash indicated
> the end of the regular expression.)

[snip]

> Please note that the samples provided assumes that the start and end tags
> appear in the same string (that is, on the same line in a html file).

That's exactly the restriction I'd like to avoid...

I haven't looked into it, but I'm sure it's possible to redefine the
input record separator, slurp a complete file into a string and match a
regex against that string. This very much goes against my sense of
aesthetics. There's no need to read in the file beyond a successful
match, and there's no need to read further when an orphaned </title> or
a </head> tag are encountered.

To correctly deal with cases such as this requires parsing the input. In
the case of HTML there already is a suitable parser; for other purposes
one could use Racc to generate one (see the RAA for both). But that's
not really what I'm looking for. For lack of a better word, what I'd
like to do is "ad hoc"-parsing in a similar fashion to what sgrep
provides. Possibly my best bet is to make extract a library from sgrep
and add Ruby bindings. But before I go there, I'd like to see what the
pure-Ruby options are.


Michael

-- 
Michael Schuerig
mailto:schuerig / acm.org
http://www.schuerig.de/michael/