--3QkRjJ1cJ7ZS82fD6sm
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit

Nokogiri provides a great interface for accessing the data trapped
inside markup.

Try something like:

page  okogiri::HTML res.body
data  ]
page.xpath("//xpath/to/table").each do |node|
  data << node.xpath("./rel/xpath/to/data/text()")
end




________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl / hi5.com
| 

On Thu, 2010-12-09 at 14:43 -0600, Atomic Bomb wrote:

> I am trying to screen scrape a webpage and pull out the name, address,
> city, state, zip and phone on a site that lists apartments for rent.
> 
> Here is my code:
> ------------------------
>    temparray  rray.new
> 
>    url  RI.parse("http://www.apartment-directory.info")
>    res  et::HTTP.start(url.host, url.port) {|http|
>    http.get('/connecticut/0')
>    }
>    # puts res.body
> 
>    res.body.each_line {|line|
>       line.gsub!(/\"/, '')
>      temparray.push(line) if line /<td\svalignp/
>       }
>           temparray.each do |j|
>              # j.gsub!(/<a\shrefmap.*<\/a>/,'')
>               j.gsub!(/\shrefmap\//,'')
>               j.gsub!(/\d+\sclass
p>Map\&nbsp\;It!/,'')
>               j.gsub!(/<\/td>/,'')
>               j.gsub!(/<td\svalignp>/, '')
>               j.gsub!(/<td\svalignp\snowrap>/, '')
>               j.gsub!(/<tr\sbgcolorite>/, '<br>')
>               j.gsub!(/MapIt!/, ', ')
>               j.gsub!(/\(/, ', (')
>               j.gsub!(/<\/tr>/,'')
> 
>            puts j
>        }
>             end
> ----------------------
> I am able to grab the HTML from the page, I then gsub! out a " sign
> then push each line that starts with <td valignp onto an array. I
> then iterate through the array and try to remove what I don't want with
> more gsub! commands. The output from this still has HTML tags on it and
> looks good if I output it to a html page (you can see the output here:
> http://www.holy-name.org/ct.html) but I really need to remove the HTML
> tags and get just the important facts into a CSV file. Since there are 4
> elements in the array for each record, the only way I could get it to
> work on a web page was to add a <br> between records.
> 
> Is there a better way to pull out the pertinent info and avoid all the
> HTML tags?
> 
> thanks
> 
> atomic
> 

--3QkRjJ1cJ7ZS82fD6sm--