On 26.06.2009 18:14, Wes Gamble wrote:
> Robert Klemme wrote:
>> If you provide more detail about the input and the text that you want
>> to match we might be able to help fix the regular expression.  IMHO
>> that approach is superior to simply returning the match effectively
>> replacing it with itself (which does work of course).
> 
> self.html.gsub!(/<a\s+?[^>]*?href=(['"])   #<a up to and including 
> href=' or href="
>                 (?!mailto:)(.*?)            #Contents of any non-mailto: 
> href attribute
>                 \1.*?>                      #End of href attribute (same 
> quote) + arbitrary text to end of opening <a> tag
>                 (.*?)                       #Contents of <a> - the "link 
> display"
>                  <\\?\/a>/mix) {            #Closing </a> tag, allowing 
> for optional \, e.g. </a> or <\/a>
> 
> So, this regex is attempting to pull out the contents of an href in a 
> <a> tag, as well as the content enclosed by the <a> tag.
> 
> The problem comes when it encounters a particularly nefarious kind of 
> HTML which looks like this:
> 
> <a href="x"><div>....<a href="x"></a>....</div>
> 
> and there is no closing </a> for the first anchor.  What I want to pull 
> is the _valid_ <a> tag "on the inside", but what I get is the first <a> 
> tag up to the closing </a> tag, which is not correct.  The problem is 
> that the first <a> tag just shouldn't be there at all.

Another way to put it is that you want to match <a>...</a> without any 
intermediate <a>.

> So I need to modify my regex to not match if there is a <a> tag inside 
> of another one.  I tried for about 30 minutes yesterday using a (?!) 
> assertion, but couldn't quite get it.

So the basic pattern here is that you want to match a combination A...B 
without any A in between.

We try with a simple example:

irb(main):005:0> s = '....A;;A+++B'
=> "....A;;A+++B"
irb(main):006:0> s.scan %r{A(?:.(?!A))+B}
=> ["A+++B"]

Now with HTML like string:

irb(main):008:0> t = s.gsub(/A/, '').gsub(/B/, '')
=> "....<a href=\"foo\">;;<a href=\"foo\">+++</a>"
irb(main):017:0> t.scan %r{<a(?:\s+\w+=["'][^"']*["'])*>(?:.(?!<a))*?</a>}i
=> ["<a href=\"foo\">+++</a>"]

A bit more readable

irb(main):024:0> t.scan %r{
irb(main):025:0/ <a(?:\s+\w+=["'][^"']*["'])*> # opening tag
irb(main):026:0/ (?:.(?!<a))*?  # between <a> and </a>
irb(main):027:0/ </a>  # closing tag
irb(main):028:0/ }mix
=> ["<a href=\"foo\">+++</a>"]

The trick is to have a negative lookahead assertion on *each* character 
between the beginning and ending sequence.  Thus avoiding a match if the 
opening sequence appears anywhere in between.

Kind regards

	robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/