On Wed, Nov 22, 2006 at 08:03:54PM +0900, Peter Szinek wrote:
} Luciano Ramalho wrote:
} > On 11/22/06, Peter Szinek <peter / rubyrailways.com> wrote:
[...]
} > Also, RubyfulSoup aims to be very resilient to malformed markup, 
} So it's HPricot. HPricot is not just a HTML parser which can parse
} (relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
} whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
} it is a fact that HPricot is handling malformed pages very well.
}
} > so it must resort to heuristics that have a performance cost. I don't
} > know fow HPricot handles  HTML or XML with really serious flaws like
} > tags that open but never close and so on,
} This concretely is absolutely OK. Maybe we would need a list of serious
} problems and see how Hpricot vs RubyfulSoup is handling them. From what
} I have seen, HPricot did not have any problems with any page...

HPricot even keeps track of when tags are (incorrectly) closed by a
different close tag. This can allow you to track down issues in broken HTML
if that's your intent, but since I am mostly using HPricot for sanitization
I just set the close tags to nil so the output closes with the correct tag.
I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

} > has managed to deal amazingly well with such problems. If you need to
} > parse low quality markup, the performance penalty of RubyfulSoup may
} > be well worth the price.
} I am still not sure what are the added benefits of RubyfulSoup parsing
} over HPricot (although I am not claiming that there are none) - I would
} like to see a real serious comparison to decide this...

I haven't tried RubyfulSoup, but HPricot suits my needs nicely. I am
delighted by its reliance on a bare minimum of HPricot-specific objects. It
doesn't try to behave like a real DOM, which means that it can use arrays
for child lists and ordinary references for parent nodes and hashes for
attributes, all read/write. It is possible to perform significant
transformations with minimal difficulty.

} Peter
--Greg