On 10/11/06, Gregory Seidman <gsslist+ruby / anthropohedron.net> wrote:
> On Thu, Oct 12, 2006 at 02:28:13AM +0900, Aaron Patterson wrote:
> } On Thu, Oct 12, 2006 at 02:13:07AM +0900, Rick DeNatale wrote:
> } > I'm trying to scan an html file using Hpricot to produce a table of
> } > links within the file.
> } >
> } > Right now I've got something like this.
> } >
> } >   doc = Hpricot(open(url)).
> } >   doc.search('a').each do | element |
> } >         puts "#{element.inner_html}
> } >         puts "     #{element.attributes['href']
> } >   end
> } >
> } > This works, but in this document some of the a tags use markup on
> } > their contents. Something like
> } > <a href="http://blah.org/blah.htm"><b>blah blah</b> blah</a>
> } >
> } > I'd like to strip out the markup tags so that I'd get
> } >
> } > blah blah blah
> } >    http://blah.org/blah.htm
> } >
> } > Is there some way to search for or iterate over the leaf elements of
> } > the tree rooted by an element in Hpricot?
> }
> } I had to do something similar in Mechanize, and this is what I came up
> } with:
> }
> }   class Hpricot::Elem
> }     def all_text
> }       text = ''
> }       children.each do |child|
> }         if child.respond_to? :content
> }           text << child.content
> }         end
> }         if child.respond_to? :all_text
> }           text << child.all_text
> }         end
> }       end
> }       text
> }     end
> }   end
> }
> }   doc = Hpricot("<a href=\"http://blah.org/blah.htm\"><b>blah blah</b> blah</a>")
> }   doc.search('a').each do |e|
> }     puts "#{e.all_text}"
> }     puts "     #{e.attributes['href']}"
> }   end
>
> There is a simpler implementation of all_text:
>
> class Hpricot::Elem
>   def all_text
>     text = ''
>     traverse_text {|t| text << t.content }
>     text
>   end
> end
>
> } Hope that helps!
> } --Aaron
> --Greg
>
>
>

Thanks Aaron and Greg, works a treat!

-- 
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/