> Having since dared the HTML, it seems at least for some of the apps, not 
> really.
Well, it is more about the navigation part: if you would like to scrape 
a page
where you have to login first, and the login uses JS (like google 
analytics for example), you can not do it with Mechanize (O.K. you can 
workaround JS and login through a plain old HTML page in the case of 
google pages, but let's suppose there is no ye good olde HTML login 
possibility)


> I never actually used either, I can only vaguely guess at the scope - 
> Hpricot doing the low-level parsing and cleanup, Mechanize the 
> higher-level data extraction from the result of that.
Not exactly. Mechanize is used to do the navigation (login, click this,
fill that, dont't touch those, submit form etc) - so it get's you to the
page where you would like to actually do the scraping (in scRUBYt!, 
those are the fetch, fill_textfield etc. commands).

Once you arrive at the page of your interest, you can forget about 
Mechanize: Hpricot takes on from this point. scRUBYt! figures out what 
are you up to, turns it into XPath, regexps and that sort of stuff, then 
it hands over to Hpricot to evaluate all these.

>> amazon_stuff = Scrubyt::Extractor.define do
>>
>>    fetch          'http://www.amazon.com'
>>    fill_textfield 'field-keywords', 'logitech keyboard'
>>    choose_option  'url', 'Computers & PC Hardware'
>>    submit
> 
> I like this API.
great!

> Where'd the stuff variable come from?
from the scaper's creator. I could have written funky_ooze and the 
difference would be that in the XML output you would see <funky_ooze> 
tags instead of <stuff> tags.

So these can be arbitrary, they are just used to hold your results. The 
structure is more important: the fact that the other two things (called 
actually patterns in scRUBYt! terminology)  'item_name' and 'price' are 
passed as a block to it describes that they are logically stuff's 
children. This means that item_name's and price's input is stuff's output.

> Right, I suppose it goes on the List of Things To Try on the saner of 
> the webapps. And after that Excel automation for the paperwork done 
> -that- way (unsurprisingly the most laborious of them all.)
Great! Feedback  is highly appreciated so LMK how it goes or if you are 
stuck with something etc.

Cheers,
Peter

__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.