--20cf3056389fdb33f104bdf7ce29
Content-Type: text/plain; charset=ISO-8859-1

Hello,

On Wed, Apr 18, 2012 at 1:10 PM, Panagiotis Atmatzidis <ml / convalesco.org>wrote:

> Hello,
>
> I need some sort of advise over where to start digging (and how) because
> I'm a bit confused.
>
> I'd like to be able to grab all content from a website. Using nokogiri I
> can use XPath and get blog post content among other things from a web page.
> But I don't have a clue about where to start looking in order to be able to
> scan a website flying through all possible links that include that website.
>
> Is nokogiri the right tool or should I use something like mechanize? Can
> you provide any hint on how to perform scraping on an entire website? I'm
> interested in blogs mostly, wordpress and blogger platforms for the time
> being.
>

I recommend starting by watching Ryan Bates's excellent screencasts:

* http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
* http://railscasts.com/episodes/191-mechanize

--20cf3056389fdb33f104bdf7ce29
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello,<br><br><div class="gmail_quote">On Wed, Apr 18, 2012 at 1:10 PM, Panagiotis Atmatzidis <span dir="ltr">&lt;ml / convalesco.org&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello,<br>
<br>
I need some sort of advise over where to start digging (and how) because I&#39;m a bit confused.<br>
<br>
I&#39;d like to be able to grab all content from a website. Using nokogiri I can use XPath and get blog post content among other things from a web page. But I don&#39;t have a clue about where to start looking in order to be able to scan a website flying through all possible links that include that website.<br>

<br>
Is nokogiri the right tool or should I use something like mechanize? Can you provide any hint on how to perform scraping on an entire website? I&#39;mnterested in blogs mostly, wordpress and blogger platforms for the time being.<br>
</blockquote><div><br></div><div>I recommend starting by watching Ryan Bates&#39;s excellent screencasts:</div><div><br></div><div>*a href="http://railscasts.com/episodes/190-screen-scraping-with-nokogiri">http://railscasts.com/episodes/190-screen-scraping-with-nokogiri</a></div>
<div>*a href="http://railscasts.com/episodes/191-mechanize">http://railscasts.com/episodes/191-mechanize</a></div><div><br></div></div>

--20cf3056389fdb33f104bdf7ce29--