On Sep 17, 2007, at 1:52 PM, Chuck Dawit wrote: > John Joyce wrote: >> On Sep 17, 2007, at 12:25 PM, Charles Pareto wrote: >> >>> equipment. So I started by downloading the DNS list of all domain >>> Posted via http://www.ruby-forum.com/. >>> >> Doesn't sound like much scraping, just searching text for a string. >> You could even do a lot of that work with Google. >> but just download the file and search for a string. create a data >> file of your own that tells you what line you found the string. >> Scraping is really for getting data from other sites, using the DOM >> structure they have to get (for example) the weather report. > > > Well, I disagree. Once I have all the websites with Cisco in its > domain > name and I look through them, there are lots of pages that won't > show me > info unless I do a search within that page itself. (ex. usedcisco.com) > To search for specific items on this website I would have to use the > search bar located within its page to search for say "WIC-1T" and then > search for a price below a specific amount for that item. > -- > Posted via http://www.ruby-forum.com/. > What I mean is, scraping usually relies on the document's structure in some way. Without looking at the structure that a give site uses (a given page if it isn't a templated dynamically generated page) there is no way to know what corresponds to what. Page structure is pretty arbitrary. Presentation and structure don't necessarily correspond well, or in a way you could guess. Ironically, the better their web designers, the easier it will be. But if you are talking about searching a dynamically generated site, you still have to find out if it has a search mechanism, what does it call the form field and submit buttons? The names in html can be arbitrary, especially if they use graphic buttons. If you have long list of products to search for, you will still save yourself some work, but scraping involves some visual inspection of pages and page source to get things going. Be aware that their sysadmin may spot you doing a big blast of searches all at once and block you from the site. If they check their logs and see that somebody is searching for all cisco stuff, in an automated fashion, they might just block you anyway, whether or not they are legit themselves. Many sysadmins don't like bots searching their databases! They might see it as searching for exploits.