schuerig / acm.org (Michael Schuerig) writes:

> > Please note that the samples provided assumes that the start and end tags
> > appear in the same string (that is, on the same line in a html file).
> 
> That's exactly the restriction I'd like to avoid...
> 
> I haven't looked into it, but I'm sure it's possible to redefine the
> input record separator, slurp a complete file into a string and match a
> regex against that string.

str = File.open("x.html") {|f| f.read}
str =~ /.../m

> This very much goes against my sense of aesthetics. There's no need
> to read in the file beyond a successful match, and there's no need
> to read further when an orphaned </title> or a </head> tag are
> encountered.

All true, but at the same time, if you can do it in two lines rather
than writing a full parser, isn't there some compensating gain to be
had?

I've used a technique for a while now to convert structured files from 
one form to another.

1. Slurp the whole file in
2. Convert escaped characters into something distinct so they are no
   longer involved in processing.
3. Match delimiters (for example braces in LaTeX, and <>'s in
   HTML. This is where you take account of strings, commands and the
   like.
4. Perform a series of substitutions which match the command pattern
   and any arguments. The name of the command is then used either to
   look up a hash, or as the name of a method to call. The results of
   all this then get substituted back into the buffer.

It sounds messy, but the reality is that it works, and is a whole lot
simpler than doing the full parse (particularly for non-regular
languages such as LaTeX).


For your particular example, if I was worried about the potential size 
of reading in the while file, I might just read in the first (say) 2k, 
and quickly check for </head>. If I didn't find it, I'd read another
2k until I did.


   def findTitle(file)
      str = ''
      loop do
        begin
           str << file.sysread(2048)
          puts "next"
        rescue EOFError
           raise "</title> not found in file"
        end
        break if str =~ %{</title>}
      end

      return $1 if str =~ %r{<head.*?>.*?<title.*?>(.*?)</title>.*?</head>}m

      raise "Couldn't find title in file"
   end

   title = findTitle(File.open("test.html"))
   puts title

Can't say as I've tested this, but it _might_ work ;-)


Dave