Andrei Maxim wrote: > Hi guys, > > I'm starting to learn Ruby and I was thinking about a little app so I can > get things started as quickly as possible. Since I'm an avid blog reader, > the first thing that went though my mind was a small app that would extract > the RSS or Atom feed from a web page, giving the URL. > > My first choice were regexps but I'm thinking that my little app my grow a > little bit more in the not-so-distant future and I might be doing more than > just extracting feeds. > > I found: > > * ymHTML at http://www.yoshidam.net/Ruby.html > * RAA at http://raa.ruby-lang.org/project/html-parser-2/ > > but they don't look really standard and RAA doesn't look like it's currently > maintained. I've also heard that there's a Rails HTML parser but I couldn't > find more info (an pro'lly I'll ask on one of the Rails list). > > Is there a more "standard" way to parse HTML pages in Ruby? The closest you'll find to a standard is REXML, which is an XML parser that ships in the stdlib. You'll want to throw your HTML through Tidy first, though - but that's an easy install. There are a couple of alternatives: Hpricot and html-parser spring instantly to mind. If you're doing feed parsing, you probably also want to check out feedtools. -- Alex