doog wrote:

> Thanks so much.  Parsing a web page is sufficient, and would
> be a great starting point.

Okay, here is a simple parser in ordinary Ruby, it will give you some ideas
about what is involved in parsing.

There are many libraries that do much more than this script does, some of
them have steep learning curves, many offer exotic ways to acquire
particular kinds of content.

This is a simple parser that returns an array containing all the table
content in the target Web page. I wrote it earlier today for someone who
wanted to scrape a yahoo.com financial page, which explains the target
page, something easy to change:

------------------------------------------------

#!/usr/bin/ruby -w

require 'net/http'

# read the page data

http = Net::HTTP.new('finance.yahoo.com', 80)
resp, page = http.get('/q?s=IBM', nil )

# BEGIN processing HTML

def parse_html(data,tag)
   return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

out_tables = []
table_data = parse_html(page,"table")
table_data.each do |table|
   out_rows = []
   row_data = parse_html(table,"tr")
   row_data.each do |row|
      out_cells = parse_html(row,"td")
      out_cells.each do |cell|
         cell.gsub!(%r{<.*?>},"")
      end
      out_rows << out_cells
   end
   out_tables << out_rows
end

# END processing HTML

# examine the result

def parse_nested_array(array,tab = 0)
   n = 0
   array.each do |item|
      if(item.size > 0)
         puts "#{"\t" * tab}[#{n}] {"
         if(item.class == Array)
            parse_nested_array(item,tab+1)
         else
            puts "#{"\t" * (tab+1)}#{item}"
         end
         puts "#{"\t" * tab}}"
      end
      n += 1
   end
end

parse_nested_array(out_tables)

------------------------------------------------

This program emits an indexed, indented listing of the table content that it
extracted, so you can then customize it by acquiring particular table cells
through use of the provided index numbers.

It should work with any Web page that has the interesting content embedded
in tables, and whose syntax is reliable.

The primary value of this program is to show you how easy it is to scrape
pages using Ruby, and give you a starting point you can customize to meet
your own requirements.

-- 
Paul Lutus
http://www.arachnoid.com