Hi,

I'm trying to get my feet wet with Ruby by tackling a manageable, but
real, issue I'd like to solve.

I'm an academic, and subscribe to some RSS feeds of journals I read.
However, the feeds are really bad, and only contain lists of authors
and titles (with no markup), and links to the issue urls.

So, I want a script that takes those feeds, goes to the issue pages,
grabs the links for the articles, and then from there extracts author
and title information.

For some reason I don't understand, the below fragment all works,
except for the author attribute is always blank.  The problem is not
with my regular expression pattern.

Could someone explain what I'm doing wrong?

Bruce

# journals is an array of rss feed urls and titles
journals.each do |journal|
  open(journal[1]) do |http|
    response = http.read
    result = RSS::Parser.parse(response, false)

  # grab first issue url listed from each journal
    issue_url = result.items[0].link

  # regular expression patterns to use below
    article_page = /<a href="(.*?)">Article Description<\/a>/
    title_match = /<span class="article-title">(.*?)<\/span>/
    author_match = /<strong>Author:<\/strong><\/td><td
class="rightcol">(.*?)</

    articles = open(issue_url)
    # find each article url by screen-scraping
    articles.read.scan(article_page).each do |url|
      article_url = "#{base_url}#{url}"
      open(article_url) do |article|
      # screen-scrap for article author and title
        title = article.read.scan(title_match)
      # for whatever reason, author never returns anything
        author = article.read.scan(author_match)
      # create new article object
        list.append(Article.new(title, author, article_url))
      end
    end
  end
end