On Wed, Oct 11, 2006 at 12:30:16AM +0900, Bart Braem wrote:
> One question though: do you see a way of parsing a structure like this with
> hpricot:
> 
> <h3>Structure 1</h3>
> <h4>Substructre 1</h4>
> 
> <p>Substructure info</p>
> 
> <ul>
> 
>   <li>Somefiles description. Addition date.</li>
> 
> I can cope with setting a date in the RSS, the problem is parsing this
> structure. There is no surrounding element for the ul and I need both the
> structure and the substructure information because the combination of those
> too defines the effective identity of the ul and its items. 
> There seems to be no method to "give everything between to specific tags and
> then go on to the next one"...

I'm not sure I understand exactly, but here's my impression of what you're
trying to do.

  doc = Hpricot(html_string)
  (doc/:h3).each do |ele|
    rss_title = ele  # okay, so you have the 3rd-level header
    rss_contents = Hpricot::Elements[]
    
    while ele = h3.next_sibling
      rss_contents << ele
      break if ele.respond_to?(:name) and ele.name == "ul"
    end
  end

So, basically, you can use `next_sibling` (or `previous_sibling`) to walk back
and forth between HTML brothers and sisters.  I store it in an Hpricot::Elements
array, since you can then just call `rss_contents.to_html` or do other searches
on it.

This is available since changset [49], so you'll need to either install from SVN
or monkeypatch.

_why

[49] http://code.whytheluckystiff.net/hpricot/changeset/49