On Feb 29, 10:52 am, Florian Gilcher <f... / andersground.net> wrote: > On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote: > > > All the regex solutions provided will break with the following > > perfectly valid HTML: > > > <div class="info"> > > <h5 >Tagline:</h5> > > Yippee Ki Yay Mo - John 6:27 > > </div> > > > This is one of many reasons it is a BAD idea to use regexes to parse > > HTML. Regular expressions are simply not the right tool for the job. > > Whats quite interesting is that I am not able to find a nice article > on _why_ > this doesn't work. So, in short: > > Regexp can only parse languages that are regular (hence the name) or - > in other words - a Type 3-language in the Chomsky hierarchy [1]. This > is a > rule of thumb because many Regexp-libraries nowadays implement > features that enable you to do more than formal regular expressions. > But for the typical use, it is true. > > Regular languages do not have any possibility to "look behind". They > do only > look forward. This is the reason why you cannot define a regular > language to > describe an parse arbitrarily deep nested structure (an thus, no regular > expression): > You have no possibility to determine which closing tag matches a given > opening tag. > > A more abstract example: > There is no (formal) regular expression that matches a word that > consists > of n times "a" and then n times "b": And that doesn't matter much. One can use as many regular expressions as he wishes. > > ab > aabb > aaabbb > aaaabbbb > etc. "ab xx aabb aaabbb aaabb aaaabbbb".split.each{|s| if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/) puts s else puts '-' end } Or one can use regular expression + code: "ab xx aabb aaabbb aaabb aaaabbbb".split.each{|s| if s.match(/^(a+)(b+)$/) and $1.size == $2.size puts s else puts '-' end } What makes anyone think that a single regular expression has to do all the work?