--bpVaumkpfGNUagdU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Eric Schwartz (emschwar / pobox.com) wrote:
> "Hal E. Fulton" <hal9000 / hypermetrics.com> writes:
> > IIRC you can use the multiline modifier "m" on
> > the RE. I haven't tried this:
> > 
> >   str =~ /(<html>.*<\/html>)/m
> 
> But when parsing HTML, you probably shouldn't be using REs at all.
> This perfectly legal HTML will confuse that RE:
> 
> <html>
> <!-- You can put </html> in a comment -->
> </html>

I agree with the sentiment (that using REs to parse HTML, XML, etc)
isn't a good idea, but the example above isn't valid.  Greedy matching
takes precedence, so it actually works like it should.

--
#!/usr/bin/ruby

str = "<html>
<head><title>test html</title></head>

<!-- comment with </html> in it -->
<body bgcolor='#ffffff'>bleh</body>
</html>"

puts $1 if str =~ /<html>(.*)<\/html>/m
--
:!./re_test.rb                                                        

<head><title>test html</title></head>

<!-- comment with </html> in it -->
<body bgcolor='#ffffff'>bleh</body>
--

Here's a better example:

--
#!/usr/bin/ruby

str = "
<b>here's some bold text
<!-- with a </b> comment --></b><br />
<b>here's some more bold text</b>"

puts "greedy: #$1" if str =~ /<b>(.*)<\/b>/m
puts "non-greedy: #$1" if str =~ /<b>(.*?)<\/b>/m
--
:!./re_test.rb
greedy: here's some bold text
<!-- with a </b> comment --></b><br />
<b>here's some more bold text
non-greedy: here's some bold text
<!-- with a 
--

In this case, neither greedy nor non-greedy matching works
appropriately, and no sane amount of lookahead or lookbehind assertions
could possibly account for all the corner cases.  Note that this is
neither pathological nor contrived; think about the number of existing
HTML documents with unterminated <p> and <td> tags.

> If you need to parse HTML, use an HTML parser; there will always be a
> (usually simple) way to defeat a regular expression.  "But I can
> control the HTML!", I hear you (generic) say.  Sure you can-- now.
> But what happens a year down the road when someone else is in charge
> of generating it?  Best to be safe and do it the right way from the
> start.

Agreed.

> -=Eric

-- 
Paul Duncan <pabs / pablotron.org>        pabs in #gah (OPN IRC)
http://www.pablotron.org/               OpenPGP Key ID: 0x82C29562

--bpVaumkpfGNUagdU
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+oDU4zdlT34LClWIRAuFkAKCTMyWs9vJY0z6E0M+kIDfddF7JpgCZARDx
81/k3axj/6F06301KdfebZo/v
-----END PGP SIGNATURE-----

--bpVaumkpfGNUagdU--