Hi,

I have been trying for some time to crack a problem which I am sure if very
simple! I am learning Ruby and programming in general and having lots of
fun.  But help would be appreciated...

I am writing a program to scrape body text from a series of web pages so
that they can be presented in a text file.

The format for the URLs of the series of pages I am interested in is:

www.targeturl.com/episode_x?=page1
...
...
www.targeturl.com/episode_x?=page[y]


Basically, "episode_x" has y number of pages, starting at 1.

I am using Nokogiri to grab the text from the page and can quite easily get
the text from page1, but I want to loop through page2, grab its text, page3,
grab its text, etc, until I reach page[y] which is where the text ends, and
to Nokogiri - this means there is no more text on that page (i.e. body_text
== nil).

Before attempting to grab the body text and append to a text file, my
strategy is to populate an array of 'valid' urls, based on a test which
involves Nokogiri finding text in the body tag, starting at page1.  I want
the loop to finish when the test finds body_text == nil, leaving me with a
collection of URLs which I know to definitely contain body text.

After a lot of playing around, I have got this far, but there is no looping
going on.  I am getting the page okay and am testing for a certain condition
which results in "Empty!" being appended to the array (essentially, when
body_text == nil).  But I can't work out how to loop.

 def get_text(base_url, page_number)

    @target_url = base_url + page_number.to_s
    @noko_doc = Nokogiri::HTML(open(@target_url))

    @text = ''
    @noko_doc.css('div.body_recap').each do |text|
       @text << text.content
       @text = @text.strip!
       return @text
     end
  end

def collect_urls(base_url, page_number)
  @valid_urls = []
  text = get_text(base_url, page_number)
  if text =~ /\A\s*Previous/
    @valid_urls << "END!"
  else @valid_urls << @target_url
    return @valid_urls
  end
end
end
-- 

Any help or comments very welcome!

Thanks all.

Matt