Thanks to this group for helping me get various screen scrapers
up and running.  I had made a couple of silly typos that held me
back.  It will take me a weekend or so of spare time to digest what
I have and actually write the code I want.

Thanks Peter, Paul and Alvim :)

Thanks Alvim, for the pointer to hpricot -- I"ve got the demo script
working and will study it.

Thanks Peter for your article -- I had read it before, but re-reading
it at this point helps quite a bit.

Thanks Paul for the code below.  I know regex, so it will just be
a matter of me learning the flow/expression syntax.

-Doug

Paul Lutus wrote:
> doog wrote:
> 
>> Thanks so much.  Parsing a web page is sufficient, and would
>> be a great starting point.
> 
> Okay, here is a simple parser in ordinary Ruby, it will give you some ideas
> about what is involved in parsing.
> 
> There are many libraries that do much more than this script does, some of
> them have steep learning curves, many offer exotic ways to acquire
> particular kinds of content.
> 
> This is a simple parser that returns an array containing all the table
> content in the target Web page. I wrote it earlier today for someone who
> wanted to scrape a yahoo.com financial page, which explains the target
> page, something easy to change:
> 
> ------------------------------------------------
> 
> #!/usr/bin/ruby -w
> 
> require 'net/http'
> 
> # read the page data
> 
> http = Net::HTTP.new('finance.yahoo.com', 80)
> resp, page = http.get('/q?s=IBM', nil )
> 
> # BEGIN processing HTML
> 
> def parse_html(data,tag)
>    return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
> end
> 
> out_tables = []
> table_data = parse_html(page,"table")
> table_data.each do |table|
>    out_rows = []
>    row_data = parse_html(table,"tr")
>    row_data.each do |row|
>       out_cells = parse_html(row,"td")
>       out_cells.each do |cell|
>          cell.gsub!(%r{<.*?>},"")
>       end
>       out_rows << out_cells
>    end
>    out_tables << out_rows
> end
> 
> # END processing HTML
> 
> # examine the result
> 
> def parse_nested_array(array,tab = 0)
>    n = 0
>    array.each do |item|
>       if(item.size > 0)
>          puts "#{"\t" * tab}[#{n}] {"
>          if(item.class == Array)
>             parse_nested_array(item,tab+1)
>          else
>             puts "#{"\t" * (tab+1)}#{item}"
>          end
>          puts "#{"\t" * tab}}"
>       end
>       n += 1
>    end
> end
> 
> parse_nested_array(out_tables)
> 
> ------------------------------------------------
> 
> This program emits an indexed, indented listing of the table content that it
> extracted, so you can then customize it by acquiring particular table cells
> through use of the provided index numbers.
> 
> It should work with any Web page that has the interesting content embedded
> in tables, and whose syntax is reliable.
> 
> The primary value of this program is to show you how easy it is to scrape
> pages using Ruby, and give you a starting point you can customize to meet
> your own requirements.
>