"Scott Rubin" <slr2777 / cs.rit.edu> schrieb im Newsbeitrag news:41c04ca2$1 / buckaroo.cs.rit.edu... > Hello, > > I'm writing an application that parses log files, specifically gaim html log > files, extracts any links it finds and creates an RSS feed of those links. I > have a working program that's about 60 lines of ruby, but it is far from > perfect. Most of the necessary fixes and improvements are things I know how to > do, but just take time. But there are a couple things I need help with. > > First, in ruby, how do I extract parts of a regex? Let's use the example from > my program. Normally I could use an expression like the following > > href\s*=\s*?:(\"?<url>[^\"]*)\") > > And this would allow me to get the <url> out of the expression. But this > doesn't seem to work in ruby, or at least I don't know how to make it work in > ruby. What I would really like to do is match the entire <a href tag structure. > I would want to extract: the protocol (ftp,http) the url (www.website.com), > and the text which appears between the <a> and the </a> into three string > variables. And I have to extract this entire structure from any random line of > text in which the structure either exists or does not. I'm guaranteed that it > wont be partial i.e: an <a> without a </a>. You need grouping. As a first shot: if %r{<a\s+href="(\w+)://([^"]+)"[^>]*>([^<]*)</a>}i =~ text proto, url, text = $1, $2, $3 end > The other thing I don't know how to do is replace things like & with &. Is > there anything in the ruby standard library, maybe in rexml, that automatically > takes care of all those standard entities for me? I looked, but I couldn't find > one. Dunno. But you can easily create that on your own: ENT = { "amp" => "&", "gt" => ">", # ... } text.gsub!(%r{&(\w+);}i) {|m| ENT[$1] || m} Kind regards robert