James Edward Gray II wrote: > On Nov 19, 2006, at 8:50 PM, Paul Lutus wrote: > >> Chris Gallagher wrote: >> >>> OK that code all works great but i have one last question :) / ... >>>> My question is, how would i modify the >>> code in order to get it to capture say a block of text such as: >>> >>> <p>this is text that i want to scrape</p> >>> >>> any ideas? >> >> Really simple: >> >> array = page_content.scan(%r{<p>(.*?)</p>}m).flatten >> >> Returns an array, each cell of which is a paragraph from the >> original page. / ... >> In Ruby, writing normal code is so easy that the traditional cautions >> against adopting miraculous libraries should be amplified tenfold. > > I hope you're not arguing that HTML should be parsed with simple > regular expression instead of a real parser. I think most would > agree with me when I say that strategy seldom holds up for long. That depends on the complexity of the problem to be solved, and the reliability of the source page's HTML formatting. For a page that can pass validation of one kind or another or that is XHTML, the simplest kinds of parsers provide terrific results. For legacy pages and those that can be expected to have "relaxed" syntax, more robust parsers are required. But I must say I regularly see requests here for parsers that can be expected to do anything, but often as not and IMHO, such a library represents too much complexity for the majority of routine HTML/XML parsing tasks with Web pages and documents that are often generated, not hand-written. This thread is an example. Beginning with the generic equivalent of "Is there a library that can ..." followed almost immediately by "Great! But how do I make it do this ...", requesting a really trivial extraction step that can be accomplished in a single line of Ruby. I find this rather ironic, since Ruby is meant to provide an easy way to create solutions to everyday problems. One then sees a blizzard of libraries whose purpose is to shield the user from the complexities of the language, in a way that the remedy is often more complex than the problem it is meant to solve. In this thread, the OP started out by examining the alternatives among specialized libraries meant to address the general problem, but apparently never considered writing code to solve the problem directly. After choosing a library, the OP realized he didn't see an obvious way to solve the original problem -- extracting specific content from the source pages. As to modern XHTML Web pages that can pass a validator, I know from direct recent experience that they yield to the simplest parser design, and can be relied on to produce a tree of organized content, stripped of tags and XHTML-specific formatting, in a handful of lines of Ruby code. It is hard to justify bringing out the big guns for a task like this, when one could instead use a small self-documenting routine such as I suggested. In the bad old days of assembly and comparatively heavy, inflexible languages like C, C++ and the like, it is easy to see why people would be motivated to create specialized libraries to solve generic problems just once for all time. In fact, the argument can be made that Ruby is just such a library of generics, broadly speaking an extension/amplification of the STL project. Now we see people writing easy-to-use application libraries, each composed using the easy-to-use Ruby library, but that are sometimes harder to sort out, or make practical use of, than a short bit of code would have been. Lest my readers think I am going overboard here on a topic dear to my heart, let me quote the OP once again: >>> OK that code all works great but i have one last question :) >>> >>> This is allowing me to scrape the values of the class values on >>> tags and >>> any other attribues such as that. My question is, how would i >>> modify the >>> code in order to get it to capture say a block of text such as: >>> >>> <p>this is text that i want to scrape</p> >>> >>> any ideas? In other words, after choosing a library and playing with it for a while, he found himself back in square one, unable to solve the original problem. To quote one of my favorite authors (William Burroughs), it seems people are busy inventing cures for which there are no diseases. -- Paul Lutus http://www.arachnoid.com