-- 3QkRjJ1cJ7ZS82fD6sm
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Nokogiri provides a great interface for accessing the data trapped
inside markup.
Try something like:
page okogiri::HTML res.body
data ]
page.xpath("//xpath/to/table").each do |node|
data << node.xpath("./rel/xpath/to/data/text()")
end
________________________________________________________________________
Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl / hi5.com
|
On Thu, 2010-12-09 at 14:43 -0600, Atomic Bomb wrote:
> I am trying to screen scrape a webpage and pull out the name, address,
> city, state, zip and phone on a site that lists apartments for rent.
>
> Here is my code:
> ------------------------
> temparray rray.new
>
> url RI.parse("http://www.apartment-directory.info")
> res et::HTTP.start(url.host, url.port) {|http|
> http.get('/connecticut/0')
> }
> # puts res.body
>
> res.body.each_line {|line|
> line.gsub!(/\"/, '')
> temparray.push(line) if line /<td\svalign p/
> }
> temparray.each do |j|
> # j.gsub!(/<a\shref map.*<\/a>/,'')
> j.gsub!(/\shref map\//,'')
> j.gsub!(/\d+\sclass
p>Map\ \;It!/,'')
> j.gsub!(/<\/td>/,'')
> j.gsub!(/<td\svalign p>/, '')
> j.gsub!(/<td\svalign p\snowrap>/, '')
> j.gsub!(/<tr\sbgcolor ite>/, '<br>')
> j.gsub!(/MapIt!/, ', ')
> j.gsub!(/\(/, ', (')
> j.gsub!(/<\/tr>/,'')
>
> puts j
> }
> end
> ----------------------
> I am able to grab the HTML from the page, I then gsub! out a " sign
> then push each line that starts with <td valign p onto an array. I
> then iterate through the array and try to remove what I don't want with
> more gsub! commands. The output from this still has HTML tags on it and
> looks good if I output it to a html page (you can see the output here:
> http://www.holy-name.org/ct.html) but I really need to remove the HTML
> tags and get just the important facts into a CSV file. Since there are 4
> elements in the array for each record, the only way I could get it to
> work on a web page was to add a <br> between records.
>
> Is there a better way to pull out the pertinent info and avoid all the
> HTML tags?
>
> thanks
>
> atomic
>
-- 3QkRjJ1cJ7ZS82fD6sm--