go look at OReilly's "spidering Hacks", which I would quote pieces of,
but somebody borrowed it.

Eric Hodel wrote:
> On 12 Jun 2005, at 18:31, Xeno Campanoli wrote:
>
> > I've tried both the net/http and the open-uri packages now, and the
> > latter is getting me a little closer, especially with assistance of
> > timeouts and almost universal exception rescues.  I now have it
> > narrowed down to 12 or so exception failures, which are probably
> > real problems, and a bunch of 404s, and these latter are mostly or
> > completely all there.  I suspect there is an index.html thing that
> > just won't be seen by these methods, or perhaps a bunch of them
> > configure to use somedarnthing.html instead of index.html, or
> > perhaps the server likes to see some headers.  Anyway, I wonder
> > whether typically at this point people just go out and make their
> > own crawlers from scratch (that's what I did a few years back with
> > Perl, LWP being much more of a hinderance than a help), or if there
> > are addons or other things that will make these efforts less ugly
> > that I am just not seeing.  Is there some standard ruby thing that:
> >
> >    1)   will deliver acceptable headers and such to retrieve stuff
> > behind some of these 404 sites?
> >    2)   is there a more clean or some more standardized method for
> > just giving me results no matter what and not blowing up in cases
> > of some HTTP return codes so that I can just treat all failures
> > orthogonally rather than constructing my own external rescue handlers?
> >    3)   should I just be building my own crawler at this point, and
> > ignoring these above packages as they might be for more casual users?
>
> I have found a good deal of success with http-access2, but I still
> had to do the things you don't want to do (external error handlers).
> I don't think there is a good generic way of handling that.
>
> >    4)   (off topic) does anyone have any etiquette recommendations
> > for what I am doing so I don't irk any netadmins, or others,
> > needlessly?
>
> obey /robots.txt
>
> don't crawl too fast (faster than 2 requests/sec is probably too fast)
>
> don't re-crawl too often
>
> --
> Eric Hodel - drbrain / segment7.net - http://segment7.net
> FEC2 57F1 D465 EB15 5D6E  7C11 332A 551C 796C 9F04