On 17/12/2006, at 11:15 PM, Paul Lutus wrote:

> Henry Maddocks wrote:
>
>> Sorry, try again...
>>
>> Not sure where to send this, sorry if it's not the right place...
>>
>> The html in the attached file renders 'correctly' in the 3 browsers I
>> have tried but it tricks hpricot because of the second malformed
>> comment. When I say correctly I mean I get to see 'Some text'. I
>> guess it could be argued that this is incorrect. For my application
>> it would be nice if hpricot behaved like a browser.

Paul,

before I address your response directly I will say that I am aware of  
your crusade against html parsing libraries and while I believe you  
are entitled to your opinion, I disagree with it. I have done enough  
of this sort of thing to know that, for me, the level of abstraction  
that these libraries gives is both beneficial in development time and  
maintenance. I am neither an html nuby, nor a ruby nuby. I am also  
aware that my needs may not match those of some one else so I'm not  
going to ram my opinions down there throat every time they ask for a  
little help.


> You have created a new thread, and you have not attached any prior  
> text.
> This requires us to start over.

As this is the first time I have posted on this subject, that much is  
obvious. Unless I am missing something.


> Tell us what you hoped would happen, what happened instead, and how  
> they
> differ.

Run the script and that too will be obvious.


> If your goal is to filter particular content from HTML pages, just  
> say so,
> and be specific about what you want and don't want. Given this  
> information,
> I will show you how to extract the desired content with a few lines of
> Ruby, no fuss, no undue complexity, no Hpricot.

My goal is to highlight an issue I found with a particular library  
and provide some sample code that shows the problem with the minimum  
amount of code. I posted it here so that there may be some discussion  
with interested people as to the desired behaviour.


> IIRC, you had asked for help using Hpricot to extract text between  
> <p> and
> </p> tag pairs, but with the added requirement that there be an IMG  
> tag
> within the <p> ... </p> tag pair to validate the case. Is this  
> still the
> goal? If so, how did my previously posted, simple solution work out  
> for
> you?

What IMG tag? There isn't one in the sample code. What previous  
solution? You do not recall correctly.


> This is a scene in a much larger play, one in which someone says,  
> "Wow, I
> had no idea there was such a powerful library, so carefully  
> designed, so
> complete. But, notwithstanding its extraordinary features,  
> notwithstanding
> the hundreds of man-hours expended creating it ... I can't get it  
> to do
> what I want."

The incident that that prompted my post went thus...
I had a page that seemed to render fine in a browser but when parsing  
it my code failed. I inspected the html and found a malformed comment  
to be the problem. Probably put there to stop screen scraping. I  
wrote a bit of code, using regexps no less, that removed the  
offending comment and hpricot then went on it's merry way. Job done.
I thought others may be interested so I posted some sample code. I am  
now regretting that decision.


> This is a very common refrain. I think I can solve your problem  
> with a few
> lines of Ruby code, code that you can easily understand and adapt to
> specific and evolving requirements. And if I cannot do this, I will  
> say so.

I could too, but I don't care.


> -- 
> Paul Lutus

Thanks for hijacking my thread. Thanks for nothing.