> So. Discuss ;P (Thanks in advance for any advice.)

Well, there is a new kid on the block, scRUBYt! (DISCLAIMER: I am the 
author
so I may be biased a bit ;-), a web scraping framework based on 
Mechanize and Hpricot. I am planning to add (or replace? not sure yet) 
Mechanize with WATIR, so that it can handle javascript, too. If you can 
do without javascript for a moment, I think scRUBYt! is an interesting 
choice, because:

1) Mechanize and Hpricot are super great in themselves - now sum the 
power two, multiply it by n (you decide the value of n - for all the 
people so far I got feedback from it was much greater than 1 :-) because 
of the added functionality ...

2) scRUBYt! is easy to learn and use, quite powerful, has tons of docs 
(check out http://scrubyt.org), nicely documented 
(http://scrubyt.rubyforge.org), unit tested, blackbox tested etc. API 
and the whole thing is designed to by extendable by your stuff - and I 
am usually available for support if this is still not enough.

3) I am planning to invest a lot of time into scRUBYt! - I am just 
releasing the next version as I write this mail, my TODO list has about 
200+ items and the community seems to be very active, so I got already 
tons bug reports, feat requests and even patches (and the whole thing is 
out for about 2 weeks)

4) I am planning to launch a community site where (hopefully) the users 
will upload, tag, rate etc. the extractors they create - so this can be 
also an interesting thing if it works out.

A quick example:

=====================================================================
amazon_stuff = Scrubyt::Extractor.define do

   fetch          'http://www.amazon.com'
   fill_textfield 'field-keywords', 'logitech keyboard'
   choose_option  'url', 'Computers & PC Hardware'
   submit

   stuff do
     item_name "Logitech diNovo Edge ( 967685-0403 )"
     price "$169.98"
   end
end

amazon_stuff.to_xml.write($stdout, 1)
Scrubyt::ResultDumper.print_statistics(amazon_stuff)
=====================================================================

output:

[MODE] learning
[ACTION] fetching document: http://www.amazon.com
[ACTION] typing logitech keyboard into the textfield named 'field-keywords'
[ACTION] selecting option Computers & PC Hardware from the option list 'url'
[ACTION] submitting form...
[ACTION] fetched 
http://www.amazon.com/s/ref=nb_ss_gw/002-0854452-3734424?field-keywords=logitech+keyboard&url=search-alias%3Daps

   <root>
     <stuff>
       <item_name>Logitech diNovo Edge ( 967685-0403 )</item_name>
       <price>$169.98</price>
     </stuff>
     <stuff>
       <item_name>Logitech G15 Gaming Keyboard</item_name>
       <price>$77.74</price>
     </stuff>
     <stuff>
       <item_name>Logitech Media Keyboard Elite- Black ( 967559-0403 
)</item_name>
       <price>$27.43</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop S510</item_name>
       <price>$52.79</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop MX 3000 Laser 
(967553-0403)</item_name>
       <price>$60.93</price>
     </stuff>
     <stuff>
       <item_name>Logitech Classic Keyboard</item_name>
       <price>$11.99</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop LX 300</item_name>
       <price>$38.74</price>
     </stuff>
     <stuff>
       <item_name>Logitech diNovo Cordless Desktop</item_name>
       <price>$104.99</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop MX 5000 Laser 
(967558-0403)</item_name>
       <price>$116.99</price>
     </stuff>
     <stuff>
       <item_name>Logitech Media Keyboard</item_name>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop MX3200 Laser</item_name>
       <price>$76.98</price>
     </stuff>
     <stuff>
       <item_name>Logitech G11 Gaming Keyboard</item_name>
       <price>$61.73</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop S 530 Laser for Mac ( 
967664-0403 )</item_name>
       <price>$67.94</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop Comfort Laser</item_name>
       <price>$77.81</price>
     </stuff>
     <stuff>
       <item_name>Logitech Cordless Desktop EX110 ( 967561-0403 
)</item_name>
       <price>Used & new
  from $24.97</price>
     </stuff>
     <stuff>
       <item_name>Sony Playstation 2 USB Keyboard</item_name>
     </stuff>
   </root>

     stuff extracted 16 instances.
         item_name extracted 16 instances.
         price extracted 14 instances.

I think you get the idea... scRUBYt! hides all the ugly stuff (HTML, 
XPats, form names, whatnot) and figures out everything based on your 
examples.

btw. don't try to run this example with 0.2.0 (the current version which 
is out), it needs 0.2.3 which I am going to release in a few hours.

scRUBYt! has much more features than this example suggests - if you are 
interested, check out http://scrubyt.org.

Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.