------------mnRjvk51oifgxuLWGKD7k4 Content-Type: text/plain; format=flowed; delsp=yes; charset=utf-8 Content-Transfer-Encoding: 7bit On Tue, 21 Nov 2006 22:27:15 -0000, Wes Gamble <weyus / att.net> wrote: > Has anyone done a head to head comparison of Hpricot and Rubyful Soup > (both HTML parsers)? > > If so, would you be willing to comment on which one a) is faster for an > average sized HTML page and b) preserves the original HTML better. > I recently did a small head-to-head with RubyfulSoup, Hpricot, and the up-and-coming (now in CVS, release in a few weeks) libxml-ruby binding to the libxml2 HTML parser. Running against the RubyfulSoup homepage (perhaps ironically, it's pretty badly formed) over 100 iterations, the attached benchmark gave out the following results. Each benchmark is parsing the original HTML and then getting back a specific node set (Hpricot and libxml2 using Xpath, RubyfulSoup using it's own query API): user system total real rubyful soup - simple 25.900000 0.710000 26.610000 ( 26.669350) user system total real rubyful soup - trickier 26.220000 0.010000 26.230000 ( 26.252975) user system total real hpricot - simple xpath 7.930000 0.000000 7.930000 ( 7.950092) user system total real hpricot - trickier xpath 8.200000 0.010000 8.210000 ( 8.212230) user system total real libxml2 - simple xpath 0.900000 0.000000 0.900000 ( 0.899329) user system total real libxml2 - trickier xpath 0.940000 0.000000 0.940000 ( 1.217441) In terms of preserving the original HTML, I found the libxml2 and Hpricot parsers to be fairly even, with both doing pretty good job of fixing up broken HTML. There were minor differences in the XML produced, and from a (biased, nitpicking) spec point of view I think libxml2's output is slightly more 'proper' (self closing tags, etc). RubyfulSoup on the other hand seemed to have a few inconsistencies - it would occasionally lose tag attributes, and sometimes return varying results to the same query. As for feature support, well, I don't want to rain on anyone's parade but the libxml HTML parser outputs an XML::Document with which you can transparently use all of libxml2's (many) features ... ;) I couldn't get XPath functions to work with Hpricot, but then I'm not sure how complete an XPath implementation it's aiming for, and apart from that it seems pretty solid. OTOH RubyfulSoup has no Xpath support at all :( -- Ross Bamford - rosco / roscopeco.remove.co.uk ------------mnRjvk51oifgxuLWGKD7k4 Content-Disposition: attachment; filename=libxml-perfcomp.rb Content-Type: text/x-ruby-script; name=libxml-perfcomp.rb Content-Transfer-Encoding: Quoted-Printable require 'rubygems' require 'hpricot' require 'rubyful_soup' require 'libxml_test' require 'benchmark' require 'open-uri' # Make the test fair - spitting errors to stderr would slow libxml2 down. XML::Parser.register_error_handler(lambda { |msg| nil }) # is HTML 4.0 transitional with some unclosed tags unless File.exists?('data.html') File.open('data.html','w+') do |f| f << URI.parse('http://www.crummy.com/software/RubyfulSoup/documentation.html').read end end data = File.read('data.html') iters = 100 def do_benchmark(title) Benchmark.bmbm do |x| x.report(title) do yield end end end do_benchmark('rubyful soup - simple xpath') do iters.times do soup = BeautifulSoup.new(data) r = soup.find_all('span') end end do_benchmark('hpricot - simple xpath') do iters.times do doc = Hpricot(data) r = (doc/'//span') end end do_benchmark('libxml2 - simple xpath') do ters.times do doc = XML::HTMLParser.string(data).parse r = doc.find('//span').to_a end end # N.B. potential bug here - seems to return different results on some # iterations. do_benchmark('rubyful soup - trickier xpath') do iters.times do soup = BeautifulSoup.new(data) r = soup.find_all do |tag| tag.respond_to? :name and ag.name == 'td' and ag['valign'] == 'top' nd end end do_benchmark('hpricot - trickier xpath') do iters.times do |i| doc = Hpricot(data) r = (doc/'//tr/td[@valign = "top"]') end end do_benchmark('libxml2 - trickier xpath') do ters.times do doc = XML::HTMLParser.string(data).parse r = doc.find('//tr/td[@valign = "top"]').to_a end end ------------mnRjvk51oifgxuLWGKD7k4--