On 26.06.2009 18:14, Wes Gamble wrote:
> Robert Klemme wrote:
>> If you provide more detail about the input and the text that you want
>> to match we might be able to help fix the regular expression. IMHO
>> that approach is superior to simply returning the match effectively
>> replacing it with itself (which does work of course).
>
> self.html.gsub!(/<a\s+?[^>]*?href=(['"]) #<a up to and including
> href=' or href="
> (?!mailto:)(.*?) #Contents of any non-mailto:
> href attribute
> \1.*?> #End of href attribute (same
> quote) + arbitrary text to end of opening <a> tag
> (.*?) #Contents of <a> - the "link
> display"
> <\\?\/a>/mix) { #Closing </a> tag, allowing
> for optional \, e.g. </a> or <\/a>
>
> So, this regex is attempting to pull out the contents of an href in a
> <a> tag, as well as the content enclosed by the <a> tag.
>
> The problem comes when it encounters a particularly nefarious kind of
> HTML which looks like this:
>
> <a href="x"><div>....<a href="x">
</a>....</div>
>
> and there is no closing </a> for the first anchor. What I want to pull
> is the _valid_ <a> tag "on the inside", but what I get is the first <a>
> tag up to the closing </a> tag, which is not correct. The problem is
> that the first <a> tag just shouldn't be there at all.
Another way to put it is that you want to match <a>...</a> without any
intermediate <a>.
> So I need to modify my regex to not match if there is a <a> tag inside
> of another one. I tried for about 30 minutes yesterday using a (?!)
> assertion, but couldn't quite get it.
So the basic pattern here is that you want to match a combination A...B
without any A in between.
We try with a simple example:
irb(main):005:0> s = '....A;;A+++B'
=> "....A;;A+++B"
irb(main):006:0> s.scan %r{A(?:.(?!A))+B}
=> ["A+++B"]
Now with HTML like string:
irb(main):008:0> t = s.gsub(/A/, '').gsub(/B/, '')
=> "....<a href=\"foo\">;;<a href=\"foo\">+++</a>"
irb(main):017:0> t.scan %r{<a(?:\s+\w+=["'][^"']*["'])*>(?:.(?!<a))*?</a>}i
=> ["<a href=\"foo\">+++</a>"]
A bit more readable
irb(main):024:0> t.scan %r{
irb(main):025:0/ <a(?:\s+\w+=["'][^"']*["'])*> # opening tag
irb(main):026:0/ (?:.(?!<a))*? # between <a> and </a>
irb(main):027:0/ </a> # closing tag
irb(main):028:0/ }mix
=> ["<a href=\"foo\">+++</a>"]
The trick is to have a negative lookahead assertion on *each* character
between the beginning and ending sequence. Thus avoiding a match if the
opening sequence appears anywhere in between.
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/