Vikash Kumar wrote:

/ ...

> What will be right solution if some one wants to get the data from yahoo
> site http://finance.yahoo.com/q?s=IBM and then displaying only some
> values such as Prev Close, Last Trade. Lets suppose we go to the URL
> through :
> 
> require 'watir'
> include Watir
> require 'hpricot'
> include Hpricot
> ie=Watir::IE.new
> ie.goto("http://finance.yahoo.com/q?s=IBM")
> 
> Now, whats next.

What? What's next? You have already assumed that the Watir and Hpricot
libraries are the optimal solution for this problem. Not necessarily. There
are many circumstances where a simple Ruby solution is better. And the more
you need to know about the process of page scraping, the more likely it is
that you will want to understand and tune the details.

> Also let suppose we want to get all the values of 
> table, we don't know the table structure then what what should be the
> correct solution ?

How about this approach:

--------------------------------------------------

#!/usr/bin/ruby -w

require 'net/http'

# read the page data

http = Net::HTTP.new('finance.yahoo.com', 80)
resp, page = http.get('/q?s=IBM', nil )

# BEGIN processing HTML

def parse_html(data,tag)
   return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

output = []
table_data = parse_html(page,"table")
table_data.each do |table|
   out_row = []
   row_data = parse_html(table,"tr")
   row_data.each do |row|
      cell_data = parse_html(row,"td")
      cell_data.each do |cell|
         cell.gsub!(%r{<.*?>},"")
      end
      out_row << cell_data
   end
   output << out_row
end

# END processing HTML

# examine the result

def parse_nested_array(array,tab = 0)
   n = 0
   array.each do |item|
      if(item.size > 0)
         puts "#{"\t" * tab}[#{n}] {"
         if(item.class == Array)
            parse_nested_array(item,tab+1)
         else
            puts "#{"\t" * (tab+1)}#{item}"
         end
         puts "#{"\t" * tab}}"
      end
      n += 1
   end
end

parse_nested_array(output)


--------------------------------------------------

Notice about this program that about half the code parses the Web page and
creates an array of arrays, while the remainder shows the array. The entire
task of scraping the page is carried out in the middle of the program.

If you examine the array display created in the latter part of the program,
you will see that all the data are placed in an array that can be indexed
by table, row and cell. Simply select which array elements you want.

I want to emphasize something. The 21 lines, including spaces and comments,
between "# BEGIN processing HTML" and "# END processing HTML" are all that
is required to scrape the page. After this, you simply choose which table
cells you want to use by indexing the array.

This way of scraping pages is better if you have to post-process the
extracted data, or you need a lightweight solution for environments with
limited resources, or if you want to exercise detailed control over the
scraping process, or if you don't want to try to figure out how to use a
large, powerful library that can do absolutely anything, or if you want to
learn how to create Ruby programs.

And this way of scraping pages is not for everyone.

Also, I must add, if the Web page contains certain kinds of HTML syntax
errors, in particular any unpaired <table>, <tr> or <td> tags, my program
will break, and Hpricot probably won't. If, on the other hand, the page is
syntactically correct, this program is perfectly adequate to extract the
data.

Obligatory editorial comment: Yahoo exists because it can expose you to
advertising. That is the foundation of their business model. When you
scrape pages, you avoid having to look at their advertising. If everyone
did this, for better or worse Yahoo would go out of business (or change
their business model).

Those are the facts. If this page scraping business becomes commonplace,
eventually Yahoo and other similar Web sites will choose a different
strategy, for example, they might sell subscriptions. Or they might try to
do more than they are already doing to discourage scraping. This activity
might end up being a contest between the scrapers and the scrapees, with
the scrapees making their pages more and more complex.

I think eventually these content providers might put up their content as
graphics rather than text, as the spammers are now doing. Then the scrapers
would have to invest in OCR to get the content.

This scraping activity isn't illegal, unless of course you exploit or
re-post the scraped content.

End of editorial.

-- 
Paul Lutus
http://www.arachnoid.com