pkellner wrote:
> I was really hoping for some code or pseudo code.  I'm new to ruby and
> have been thrashing over this for hours.  I promise to put some back
> later when I know more about this.  (and sadly, I'm not a regular
> expression wizard)

I use WWW::Mechanize to slurp down numerous CafePress shop pages and 
snarf out the img info, which I use to automagically create the product 
pages for rubystuff.com.

The code sample here is a much simplified version.

Mechanize lets you use custom classes to encapsulate node types, which 
in turn makes it simpler to manipulate assorted HTML elements.  I need 
to extract assorted data from image URLs, so I coded up some additional 
trickery not shown here.

Also note that some sites reject bots, spiders, etc. when the declared 
user-agent is not something acceptable.  Hence the random selection from 
UA here.

#!/usr/local/bin/ruby

require 'mechanize'

UA = [
    'Windows IE 6' ,
    'Windows Mozilla',
    'Mac Safari' ,
    'Mac Mozilla' ,
    'Linux Mozilla',
    'Linux Konqueror' ]

# Wrap certain nodes in an Img class to make
# node attribute access a bit easier to grok.
class Img
   attr_reader :alt, :src

   def initialize(  node )
     @node = node
     @alt =  ''
     @src = ''

     if @node.attributes[ 'alt' ]
       @alt =  @node.attributes[ 'alt' ].to_s.strip
     end
     if @node.attributes[ 'src' ]
       @src =  @node.attributes[ 'src' ].to_s.strip
     end
   end
end

# Now with Rails tote bags and thongs and stuff!
url = 'http://www.cafepress.com/rubyonrailsshop'

agent = WWW::Mechanize.new {|a| a.log = Logger.new( STDERR ) }
agent.user_agent_alias = UA[ rand( UA.size - 1 ) ]

# This tells Mechanize to watch for certain elements, and
# map matching  nodes to the keyed class.  Here, when an img
# element is encountered, mechanize will use the node to create
# an Img object and store it for us.
agent.watch_for_set = { 'img' => Img }

page = agent.get( url )

# Get the watch items we're interested in
images = page.watches[ 'img' ]

# What did we get?
images.each  do |img|
   p img.src
end

#----------------

Hope this helps.

Get Mechanize from rubyforge.org, from the wee project page.

http://rubyforge.org/projects/wee/


James Britt

-- 

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com  - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com  - Playing with Better Toys