On Sep 17, 2007, at 1:52 PM, Chuck Dawit wrote:

> John Joyce wrote:
>> On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote:
>>
>>> equipment. So I started by downloading the DNS list of all domain
>>> Posted via http://www.ruby-forum.com/.
>>>
>> Doesn't sound like much scraping, just searching text for a string.
>> You could even do a lot of that work with Google.
>> but just download the file and search for a string. create a data
>> file of your own that tells you what line you found the string.
>> Scraping is really for getting data from other sites, using the DOM
>> structure they have to get (for example) the weather report.
>
>
> Well, I disagree. Once I have all the websites with Cisco in its  
> domain
> name and I look through them, there are lots of pages that won't  
> show me
> info unless I do a search within that page itself. (ex. usedcisco.com)
> To search for specific items on this website I would have to use the
> search bar located within its page to search for say "WIC-1T" and then
> search for a price below a specific amount for that item.
> -- 
> Posted via http://www.ruby-forum.com/.
>
What I mean is, scraping usually relies on the document's structure  
in some way. Without looking at the structure that a give site uses  
(a given page if it isn't a templated dynamically generated page)  
there is no way to know what corresponds to what. Page structure is  
pretty arbitrary. Presentation and structure don't necessarily  
correspond well, or in a way you could guess.
Ironically, the better their web designers, the easier it will be.

But if you are talking about searching a dynamically generated site,  
you still have to find out if it has a search mechanism, what does it  
call the form field and submit buttons? The names in html can be  
arbitrary, especially if they use graphic buttons.

If you have  long list of products to search for, you will still save  
yourself some work, but scraping involves some visual inspection of  
pages and page source to get things going. Be aware that their  
sysadmin may spot you doing a big blast of searches all at once and  
block you from the site. If they check their logs and see that  
somebody is searching for all cisco stuff,  in an automated fashion,  
they might just block you anyway, whether or not they are legit  
themselves. Many sysadmins don't like bots searching their  
databases!  They might see it as searching for exploits.