"Scott Rubin" <slr2777 / cs.rit.edu> schrieb im Newsbeitrag
news:41c04ca2$1 / buckaroo.cs.rit.edu...
> Hello,
>
> I'm writing an application that parses log files, specifically gaim html
log
> files, extracts any links it finds and creates an RSS feed of those
links.  I
> have a working program that's about 60 lines of ruby, but it is far from
> perfect.  Most of the necessary fixes and improvements are things I know
how to
> do, but just take time. But there are a couple things I need help with.
>
> First, in ruby, how do I extract parts of a regex?  Let's use the
example from
> my program.  Normally I could use an expression like the following
>
>   href\s*=\s*?:(\"?<url>[^\"]*)\")
>
> And this would allow me to get the <url> out of the expression.  But
this
> doesn't seem to work in ruby, or at least I don't know how to make it
work in
> ruby.  What I would really like to do is match the entire <a href tag
structure.
>   I would want to extract: the protocol (ftp,http) the url
(www.website.com),
> and the text which appears between the <a> and the </a> into three
string
> variables.  And I have to extract this entire structure from any random
line of
> text in which the structure either exists or does not.  I'm guaranteed
that it
> wont be partial i.e: an <a> without a </a>.

You need grouping.  As a first shot:

if %r{<a\s+href="(\w+)://([^"]+)"[^>]*>([^<]*)</a>}i =~ text
  proto, url, text = $1, $2, $3
end

> The other thing I don't know how to do is replace things like &amp; with
&.  Is
> there anything in the ruby standard library, maybe in rexml, that
automatically
> takes care of all those standard entities for me?  I looked, but I
couldn't find
> one.

Dunno.  But you can easily create that on your own:

ENT = {
  "amp" => "&",
  "gt" => ">",
  # ...
}

text.gsub!(%r{&(\w+);}i) {|m| ENT[$1] || m}

Kind regards

    robert