Robert Klemme wrote:
> If you provide more detail about the input and the text that you want
> to match we might be able to help fix the regular expression. IMHO
> that approach is superior to simply returning the match effectively
> replacing it with itself (which does work of course).
self.html.gsub!(/<a\s+?[^>]*?href=(['"]) #<a up to and including
href=' or href="
(?!mailto:)(.*?) #Contents of any non-mailto:
href attribute
\1.*?> #End of href attribute (same
quote) + arbitrary text to end of opening <a> tag
(.*?) #Contents of <a> - the "link
display"
<\\?\/a>/mix) { #Closing </a> tag, allowing
for optional \, e.g. </a> or <\/a>
So, this regex is attempting to pull out the contents of an href in a
<a> tag, as well as the content enclosed by the <a> tag.
The problem comes when it encounters a particularly nefarious kind of
HTML which looks like this:
<a href="x"><div>....<a href="x">
</a>....</div>
and there is no closing </a> for the first anchor. What I want to pull
is the _valid_ <a> tag "on the inside", but what I get is the first <a>
tag up to the closing </a> tag, which is not correct. The problem is
that the first <a> tag just shouldn't be there at all.
So I need to modify my regex to not match if there is a <a> tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn't quite get it.
Thanks,
Wes
--
Posted via http://www.ruby-forum.com/.