Henry, There was some just a few days ago who had a problem with using Hpricot, and IMG elements in P tags. Paul must have gotten you two confused. On 12/18/06, Henry Maddocks <henryj / paradise.net.nz> wrote: > > On 17/12/2006, at 11:15 PM, Paul Lutus wrote: > > > Henry Maddocks wrote: > > > >> Sorry, try again... > >> > >> Not sure where to send this, sorry if it's not the right place... > >> > >> The html in the attached file renders 'correctly' in the 3 browsers I > >> have tried but it tricks hpricot because of the second malformed > >> comment. When I say correctly I mean I get to see 'Some text'. I > >> guess it could be argued that this is incorrect. For my application > >> it would be nice if hpricot behaved like a browser. > > Paul, > > before I address your response directly I will say that I am aware of > your crusade against html parsing libraries and while I believe you > are entitled to your opinion, I disagree with it. I have done enough > of this sort of thing to know that, for me, the level of abstraction > that these libraries gives is both beneficial in development time and > maintenance. I am neither an html nuby, nor a ruby nuby. I am also > aware that my needs may not match those of some one else so I'm not > going to ram my opinions down there throat every time they ask for a > little help. > > > > You have created a new thread, and you have not attached any prior > > text. > > This requires us to start over. > > As this is the first time I have posted on this subject, that much is > obvious. Unless I am missing something. > > > > Tell us what you hoped would happen, what happened instead, and how > > they > > differ. > > Run the script and that too will be obvious. > > > > If your goal is to filter particular content from HTML pages, just > > say so, > > and be specific about what you want and don't want. Given this > > information, > > I will show you how to extract the desired content with a few lines of > > Ruby, no fuss, no undue complexity, no Hpricot. > > My goal is to highlight an issue I found with a particular library > and provide some sample code that shows the problem with the minimum > amount of code. I posted it here so that there may be some discussion > with interested people as to the desired behaviour. > > > > IIRC, you had asked for help using Hpricot to extract text between > > <p> and > > </p> tag pairs, but with the added requirement that there be an IMG > > tag > > within the <p> ... </p> tag pair to validate the case. Is this > > still the > > goal? If so, how did my previously posted, simple solution work out > > for > > you? > > What IMG tag? There isn't one in the sample code. What previous > solution? You do not recall correctly. > > > > This is a scene in a much larger play, one in which someone says, > > "Wow, I > > had no idea there was such a powerful library, so carefully > > designed, so > > complete. But, notwithstanding its extraordinary features, > > notwithstanding > > the hundreds of man-hours expended creating it ... I can't get it > > to do > > what I want." > > The incident that that prompted my post went thus... > I had a page that seemed to render fine in a browser but when parsing > it my code failed. I inspected the html and found a malformed comment > to be the problem. Probably put there to stop screen scraping. I > wrote a bit of code, using regexps no less, that removed the > offending comment and hpricot then went on it's merry way. Job done. > I thought others may be interested so I posted some sample code. I am > now regretting that decision. > > > > This is a very common refrain. I think I can solve your problem > > with a few > > lines of Ruby code, code that you can easily understand and adapt to > > specific and evolving requirements. And if I cannot do this, I will > > say so. > > I could too, but I don't care. > > > > -- > > Paul Lutus > > Thanks for hijacking my thread. Thanks for nothing. > > > -- Chris Carter concentrationstudios.com brynmawrcs.com