------------mnRjvk51oifgxuLWGKD7k4
Content-Type: text/plain; format=flowed; delsp=yes; charset=utf-8
Content-Transfer-Encoding: 7bit

On Tue, 21 Nov 2006 22:27:15 -0000, Wes Gamble <weyus / att.net> wrote:

> Has anyone done a head to head comparison of Hpricot and Rubyful Soup
> (both HTML parsers)?
>
> If so, would you be willing to comment on which one a) is faster for an
> average sized HTML page and b) preserves the original HTML better.
>

I recently did a small head-to-head with RubyfulSoup, Hpricot, and the  
up-and-coming (now in CVS, release in a few weeks) libxml-ruby binding to  
the libxml2 HTML parser. Running against the RubyfulSoup homepage (perhaps  
ironically, it's pretty badly formed) over 100 iterations, the attached  
benchmark gave out the following results. Each benchmark is parsing the  
original HTML and then getting back a specific node set (Hpricot and  
libxml2 using Xpath, RubyfulSoup using it's own query API):

                                   user     system      total        real
rubyful soup - simple        25.900000   0.710000  26.610000 ( 26.669350)

                                   user     system      total        real
rubyful soup - trickier      26.220000   0.010000  26.230000 ( 26.252975)

                                   user     system      total        real
hpricot - simple xpath        7.930000   0.000000   7.930000 (  7.950092)

                                   user     system      total        real
hpricot - trickier xpath      8.200000   0.010000   8.210000 (  8.212230)

                                   user     system      total        real
libxml2 - simple xpath        0.900000   0.000000   0.900000 (  0.899329)

                                   user     system      total        real
libxml2 - trickier xpath      0.940000   0.000000   0.940000 (  1.217441)

In terms of preserving the original HTML, I found the libxml2 and Hpricot  
parsers to be fairly even, with both doing pretty good job of fixing up  
broken HTML. There were minor differences in the XML produced, and from a  
(biased, nitpicking) spec point of view I think libxml2's output is  
slightly more 'proper' (self closing tags, etc). RubyfulSoup on the other  
hand seemed to have a few inconsistencies - it would occasionally lose tag  
attributes, and sometimes return varying results to the same query.

As for feature support, well, I don't want to rain on anyone's parade but  
the libxml HTML parser outputs an XML::Document with which you can  
transparently use all of libxml2's (many) features ... ;) I couldn't get  
XPath functions to work with Hpricot, but then I'm not sure how complete  
an XPath implementation it's aiming for, and apart from that it seems  
pretty solid. OTOH RubyfulSoup has no Xpath support at all :(

-- 
Ross Bamford - rosco / roscopeco.remove.co.uk

------------mnRjvk51oifgxuLWGKD7k4
Content-Disposition: attachment; filename=libxml-perfcomp.rb
Content-Type: text/x-ruby-script; name=libxml-perfcomp.rb
Content-Transfer-Encoding: Quoted-Printable

require 'rubygems'
require 'hpricot'
require 'rubyful_soup'
require 'libxml_test'
require 'benchmark'
require 'open-uri'

# Make the test fair - spitting errors to stderr would slow libxml2 down.
XML::Parser.register_error_handler(lambda { |msg| nil })

# is HTML 4.0 transitional with some unclosed tags
unless File.exists?('data.html')
  File.open('data.html','w+') do |f|
    f << URI.parse('http://www.crummy.com/software/RubyfulSoup/documentation.html').read
  end
end

data = File.read('data.html')
iters = 100

def do_benchmark(title)
  Benchmark.bmbm do |x|
    x.report(title) do
      yield
    end
  end
end

do_benchmark('rubyful soup - simple xpath') do
  iters.times do
    soup = BeautifulSoup.new(data)
    r = soup.find_all('span')
  end
end

do_benchmark('hpricot - simple xpath') do
  iters.times do
    doc = Hpricot(data)
    r = (doc/'//span')
  end
end

do_benchmark('libxml2 - simple xpath') do ters.times do
    doc = XML::HTMLParser.string(data).parse
    r = doc.find('//span').to_a
  end
end

# N.B. potential bug here - seems to return different results on some
# iterations.
do_benchmark('rubyful soup - trickier xpath') do
  iters.times do
    soup = BeautifulSoup.new(data)
    r = soup.find_all do |tag|
      tag.respond_to? :name and ag.name == 'td' and ag['valign'] == 'top' nd
  end
end

do_benchmark('hpricot - trickier xpath') do
  iters.times do |i|
    doc = Hpricot(data)
    r = (doc/'//tr/td[@valign = "top"]')
  end
end

do_benchmark('libxml2 - trickier xpath') do ters.times do
    doc = XML::HTMLParser.string(data).parse
    r = doc.find('//tr/td[@valign = "top"]').to_a
  end
end
  ------------mnRjvk51oifgxuLWGKD7k4--