On 10/11/06, Gregory Seidman <gsslist+ruby / anthropohedron.net> wrote: > On Thu, Oct 12, 2006 at 02:28:13AM +0900, Aaron Patterson wrote: > } On Thu, Oct 12, 2006 at 02:13:07AM +0900, Rick DeNatale wrote: > } > I'm trying to scan an html file using Hpricot to produce a table of > } > links within the file. > } > > } > Right now I've got something like this. > } > > } > doc = Hpricot(open(url)). > } > doc.search('a').each do | element | > } > puts "#{element.inner_html} > } > puts " #{element.attributes['href'] > } > end > } > > } > This works, but in this document some of the a tags use markup on > } > their contents. Something like > } > <a href="http://blah.org/blah.htm"><b>blah blah</b> blah</a> > } > > } > I'd like to strip out the markup tags so that I'd get > } > > } > blah blah blah > } > http://blah.org/blah.htm > } > > } > Is there some way to search for or iterate over the leaf elements of > } > the tree rooted by an element in Hpricot? > } > } I had to do something similar in Mechanize, and this is what I came up > } with: > } > } class Hpricot::Elem > } def all_text > } text = '' > } children.each do |child| > } if child.respond_to? :content > } text << child.content > } end > } if child.respond_to? :all_text > } text << child.all_text > } end > } end > } text > } end > } end > } > } doc = Hpricot("<a href=\"http://blah.org/blah.htm\"><b>blah blah</b> blah</a>") > } doc.search('a').each do |e| > } puts "#{e.all_text}" > } puts " #{e.attributes['href']}" > } end > > There is a simpler implementation of all_text: > > class Hpricot::Elem > def all_text > text = '' > traverse_text {|t| text << t.content } > text > end > end > > } Hope that helps! > } --Aaron > --Greg > > > Thanks Aaron and Greg, works a treat! -- Rick DeNatale My blog on Ruby http://talklikeaduck.denhaven2.com/