I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:
------------------------
   temparray = Array.new

   url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }
   # puts res.body

   res.body.each_line {|line|
      line.gsub!(/\"/, '')
     temparray.push(line) if line =~ /<td\svalign=top/
      }
          temparray.each do |j|
             # j.gsub!(/<a\shref=\/map.*<\/a>/,'')
              j.gsub!(/\shref=\/map\//,'')
              j.gsub!(/\d+\sclass=map>Map\&nbsp\;It!/,'')
              j.gsub!(/<\/td>/,'')
              j.gsub!(/<td\svalign=top>/, '')
              j.gsub!(/<td\svalign=top\snowrap>/, '')
              j.gsub!(/<tr\sbgcolor=white>/, '<br>')
              j.gsub!(/MapIt!/, ', ')
              j.gsub!(/\(/, ', (')
              j.gsub!(/<\/tr>/,'')

           puts j
       }
            end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic

-- 
Posted via http://www.ruby-forum.com/.