On May 25, 2004, at 1:43 PM, Nicolas Cavigneaux wrote:

> Hello,
>
> I've written, some times ago, a Ruby code that allows me to follow web
> links and to retrieve easily interesting files. This little software
> works well. To extract the links from a downloaded webpage I use
> URI.extract and I've noticed that URI.extract miss a lot of links. In
> fact URI.extract doesn't understand (resolve ?) relative links (for
> example link). Am I wrong ? If I don't,
> what way do you advice to me to be sure to retrieve all the relative 
> links ?

IIRC, URI.extract(str) just scans plain text for URIs. So, links in 
html would have to be absolute, not relative, ie. 
"http://google.com/help", not just "/help".

To get all the links out of html, you would probably need to create a 
regular expression that finds all link-ish html attributes (<a 
href="">, <link rel="">, , etc), parse them to see what 
type of link they are, then construct a full URI based on the page's 
original location.

A quick, incomplete, untested example.

# open-uri is nice
require 'open-uri'


def get_URI_list(uri)

   # download the page at the uri passed
   page_data = open(uri){|f|f.read}

   # scan it for the contents of html
   # attributes that are usually links
   uris = page_data.scan(/(?:href|src|rel|)="([^"]*)"/)

   # convert relative links to absolute links
   uris.map do |item|
     case item
     when /^\// # it's relative to site root
       "http://" + URI.parse(uri).host + item
     when /^http:/ #it's absolute
       item
     else # it's relative to the current page
       # merge the two uris here. This is left as an exercise ;)
     end
   end
end

HTH,
Mark