On Mon, Oct 02, 2006 at 11:38:05PM +0900, HH wrote:
} I've been messing with Hpricot and I'm trying to do a few things that
} aren't apparently documented or available as part of Hpricot. Can
} someone verify the following...
}
} 1) Is there a simple way to determine the element's current path /
} location? For example, if I find a text node, is there a simple way to
} determine the path of that text node so I can find it again later using
} that path / location as a parameter to the search method? I assume I
} can use the parent method to find the parent and recurse through until
} I get to the root node...is there an easier way?
I have been using the recursive (well, iterative, actually) way. I suspect
that that is the way to do it since the tree structure is intentionally
simple and is designed to allow you to move nodes around arbitrarily.
Maintaining a node's path independent of its structural location is
inefficient at best and impossible at worst.
} 2) Is there a simple way to find all elements with non-empty text
} nodes? It appears that Hpricot is focused on providing methods for
} finding something if you know the element tag / attributes / classes /
} etc. I've been using traverse_text which requires going through every
} text node and filtering out the ones that are empty / whitespace. Is
} there an easier way to find all elements with non-empty text nodes?
nodes = []
doc.traverse_text { |t| nodes << t.parent if (t.content && t.content != '') }
} This is in reference to parsing HTML pages which may or may not be
} well-formed.
I've found Hpricot to be remarkably resilient in parsing questionable HTML.
} All in all - I really like Hpricot. I was using REXML and tidy before,
} but this is alot simplier and faster!
}
} Thanks to _why the lucky stiff for a great little HTML parser...
I'll second that.
--Greg