--20cf3056389fdb33f104bdf7ce29 Content-Type: text/plain; charset=ISO-8859-1 Hello, On Wed, Apr 18, 2012 at 1:10 PM, Panagiotis Atmatzidis <ml / convalesco.org>wrote: > Hello, > > I need some sort of advise over where to start digging (and how) because > I'm a bit confused. > > I'd like to be able to grab all content from a website. Using nokogiri I > can use XPath and get blog post content among other things from a web page. > But I don't have a clue about where to start looking in order to be able to > scan a website flying through all possible links that include that website. > > Is nokogiri the right tool or should I use something like mechanize? Can > you provide any hint on how to perform scraping on an entire website? I'm > interested in blogs mostly, wordpress and blogger platforms for the time > being. > I recommend starting by watching Ryan Bates's excellent screencasts: * http://railscasts.com/episodes/190-screen-scraping-with-nokogiri * http://railscasts.com/episodes/191-mechanize --20cf3056389fdb33f104bdf7ce29 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello,<br><br><div class="gmail_quote">On Wed, Apr 18, 2012 at 1:10 PM, Panagiotis Atmatzidis <span dir="ltr"><ml / convalesco.org></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Hello,<br> <br> I need some sort of advise over where to start digging (and how) because I'm a bit confused.<br> <br> I'd like to be able to grab all content from a website. Using nokogiri I can use XPath and get blog post content among other things from a web page. But I don't have a clue about where to start looking in order to be able to scan a website flying through all possible links that include that website.<br> <br> Is nokogiri the right tool or should I use something like mechanize? Can you provide any hint on how to perform scraping on an entire website? I'mnterested in blogs mostly, wordpress and blogger platforms for the time being.<br> </blockquote><div><br></div><div>I recommend starting by watching Ryan Bates's excellent screencasts:</div><div><br></div><div>*a href="http://railscasts.com/episodes/190-screen-scraping-with-nokogiri">http://railscasts.com/episodes/190-screen-scraping-with-nokogiri</a></div> <div>*a href="http://railscasts.com/episodes/191-mechanize">http://railscasts.com/episodes/191-mechanize</a></div><div><br></div></div> --20cf3056389fdb33f104bdf7ce29--