On Feb 29, 10:52 am, Florian Gilcher <f... / andersground.net> wrote:
> On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:
>
> > All the regex solutions provided will break with the following
> > perfectly valid HTML:
>
> > <div class="info">
> > <h5 >Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>
>
> > This is one of many reasons it is a BAD idea to use regexes to parse
> > HTML. Regular expressions are simply not the right tool for the job.
>
> Whats quite interesting is that I am not able to find a nice article
> on _why_
> this doesn't work. So, in short:
>
> Regexp can only parse languages that are regular (hence the name) or -
> in other words - a Type 3-language in the Chomsky hierarchy [1]. This
> is a
> rule of thumb because many Regexp-libraries nowadays implement
> features that enable you to do more than formal regular expressions.
> But for the typical use, it is true.
>
> Regular languages do not have any possibility to "look behind". They
> do only
> look forward. This is the reason why you cannot define a regular
> language to
> describe an parse arbitrarily deep nested structure (an thus, no regular
> expression):
> You have no possibility to determine which closing tag matches a given
> opening tag.
>
> A more abstract example:
> There is no (formal) regular expression that matches a word that
> consists
> of n times "a" and then n times "b":

And that doesn't matter much.  One can use as many regular expressions
as he wishes.

>
> ab
> aabb
> aaabbb
> aaaabbbb
> etc.

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
  if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
    puts s
  else
    puts '-'
  end
}

Or one can use regular expression + code:

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
  if s.match(/^(a+)(b+)$/) and $1.size == $2.size
    puts s
  else
    puts '-'
  end
}

What makes anyone think that a single regular expression
has to do all the work?