On 17/12/2006, at 11:15 PM, Paul Lutus wrote: > Henry Maddocks wrote: > >> Sorry, try again... >> >> Not sure where to send this, sorry if it's not the right place... >> >> The html in the attached file renders 'correctly' in the 3 browsers I >> have tried but it tricks hpricot because of the second malformed >> comment. When I say correctly I mean I get to see 'Some text'. I >> guess it could be argued that this is incorrect. For my application >> it would be nice if hpricot behaved like a browser. Paul, before I address your response directly I will say that I am aware of your crusade against html parsing libraries and while I believe you are entitled to your opinion, I disagree with it. I have done enough of this sort of thing to know that, for me, the level of abstraction that these libraries gives is both beneficial in development time and maintenance. I am neither an html nuby, nor a ruby nuby. I am also aware that my needs may not match those of some one else so I'm not going to ram my opinions down there throat every time they ask for a little help. > You have created a new thread, and you have not attached any prior > text. > This requires us to start over. As this is the first time I have posted on this subject, that much is obvious. Unless I am missing something. > Tell us what you hoped would happen, what happened instead, and how > they > differ. Run the script and that too will be obvious. > If your goal is to filter particular content from HTML pages, just > say so, > and be specific about what you want and don't want. Given this > information, > I will show you how to extract the desired content with a few lines of > Ruby, no fuss, no undue complexity, no Hpricot. My goal is to highlight an issue I found with a particular library and provide some sample code that shows the problem with the minimum amount of code. I posted it here so that there may be some discussion with interested people as to the desired behaviour. > IIRC, you had asked for help using Hpricot to extract text between > <p> and > </p> tag pairs, but with the added requirement that there be an IMG > tag > within the <p> ... </p> tag pair to validate the case. Is this > still the > goal? If so, how did my previously posted, simple solution work out > for > you? What IMG tag? There isn't one in the sample code. What previous solution? You do not recall correctly. > This is a scene in a much larger play, one in which someone says, > "Wow, I > had no idea there was such a powerful library, so carefully > designed, so > complete. But, notwithstanding its extraordinary features, > notwithstanding > the hundreds of man-hours expended creating it ... I can't get it > to do > what I want." The incident that that prompted my post went thus... I had a page that seemed to render fine in a browser but when parsing it my code failed. I inspected the html and found a malformed comment to be the problem. Probably put there to stop screen scraping. I wrote a bit of code, using regexps no less, that removed the offending comment and hpricot then went on it's merry way. Job done. I thought others may be interested so I posted some sample code. I am now regretting that decision. > This is a very common refrain. I think I can solve your problem > with a few > lines of Ruby code, code that you can easily understand and adapt to > specific and evolving requirements. And if I cannot do this, I will > say so. I could too, but I don't care. > -- > Paul Lutus Thanks for hijacking my thread. Thanks for nothing.