Michael Neumann wrote:
> James Britt wrote:
>> I had to hack Mechanize to have it grab 'p' elements, but it is dead 
>> easy to do.
> 
> 
> What exactly do you had to hack? If it's worth, I'll add it to the lib.


At first, I just used the built-in 'links' property to get the search 
result links.  That sort of worked; I  could get an array of URLs, but 
they had no descriptive context.  Looking at the HTML coming back from 
Google I saw I really needed the 'p' elements that held the search 
result URL + the description.

As best I could tell, the Page object has only a few built-in arrays 
(links, forms, maybe another, I don't recall) that get populated when 
calling parse_html.  Adding another array, and telling parse_html to 
populate this array, was super easy.

In retrospect I think I could have done some sort of Xpath-thing over 
the tree of node held by the Page object, but I just took what seemed to 
be the easiest route at the time.  (besides, XPath over the full node 
set is going to be slower than simply assembling a set of particular 
nodes on the first pass over the document done by parse_html.)

Where parse_html has:

       when 'a'
         @links << Link.new(node)

I added in

       when 'p'
         @paragraphs << Para.new(node)

The Para class is nothing more than a wrapper for a generic node.

I then ask for page.paragraphs and grab the ones I want.


BTW, while writing this post, I started thinking about my hackish 
implementation, and ended up replacing it with an arguably less hackish 
implementation, one that lets you do this:


   agent = WWW::Mechanize.new {|a|
      a.log = Logger.new(STDERR)
   }
   agent.watch_for_set = { 'style' => Style, 'p' => Para  }
   page = agent.get( url )
   page.body
   paragraphs = page.elements[ 'p' ]
   styles = page.elements[ 'style' ]

You just have to have the calling code define the classes passed in as 
part of the 'watch_for_set' hash.  Each of these classes then has to 
implement this constructor:

   def initialize( node ) ; end

It's up to each class then to extract what data it wants from the node.

So one could write a Style class that grabs the text value of the node 
and makes each CSS selector available for inspection.

(page.elements of course only has an array for each of those element 
names passed in.)


James