James Edward Gray II wrote:

> On Nov 19, 2006, at 8:50 PM, Paul Lutus wrote:
> 
>> Chris Gallagher wrote:
>>
>>> OK that code all works great but i have one last question :)

/ ...

>>>> My question is, how would i modify the
>>> code in order to get it to capture say a block of text such as:
>>>
>>>  <p>this is text that i want to scrape</p>
>>>
>>> any ideas?
>>
>> Really simple:
>>
>> array = page_content.scan(%r{<p>(.*?)</p>}m).flatten
>>
>> Returns an array, each cell of which is a paragraph from the
>> original page.

/ ...

>> In Ruby, writing normal code is so easy that the traditional cautions
>> against adopting miraculous libraries should be amplified tenfold.
> 
> I hope you're not arguing that HTML should be parsed with simple
> regular expression instead of a real parser.  I think most would
> agree with me when I say that strategy seldom holds up for long.

That depends on the complexity of the problem to be solved, and the
reliability of the source page's HTML formatting.

For a page that can pass validation of one kind or another or that is XHTML,
the simplest kinds of parsers provide terrific results. For legacy pages
and those that can be expected to have "relaxed" syntax, more robust
parsers are required.

But I must say I regularly see requests here for parsers that can be
expected to do anything, but often as not and IMHO, such a library
represents too much complexity for the majority of routine HTML/XML parsing
tasks with Web pages and documents that are often generated, not
hand-written.

This thread is an example. Beginning with the generic equivalent of "Is
there a library that can ..." followed almost immediately by "Great! But
how do I make it do this ...", requesting a really trivial extraction step
that can be accomplished in a single line of Ruby.

I find this rather ironic, since Ruby is meant to provide an easy way to
create solutions to everyday problems. One then sees a blizzard of
libraries whose purpose is to shield the user from the complexities of the
language, in a way that the remedy is often more complex than the problem
it is meant to solve.

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but apparently
never considered writing code to solve the problem directly. After choosing
a library, the OP realized he didn't see an obvious way to solve the
original problem -- extracting specific content from the source pages.

As to modern XHTML Web pages that can pass a validator, I know from direct
recent experience that they yield to the simplest parser design, and can be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code. It is hard
to justify bringing out the big guns for a task like this, when one could
instead use a small self-documenting routine such as I suggested.

In the bad old days of assembly and comparatively heavy, inflexible
languages like C, C++ and the like, it is easy to see why people would be
motivated to create specialized libraries to solve generic problems just
once for all time. In fact, the argument can be made that Ruby is just such
a library of generics, broadly speaking an extension/amplification of the
STL project.

Now we see people writing easy-to-use application libraries, each composed
using the easy-to-use Ruby library, but that are sometimes harder to sort
out, or make practical use of, than a short bit of code would have been.

Lest my readers think I am going overboard here on a topic dear to my heart,
let me quote the OP once again:

>>> OK that code all works great but i have one last question :)
>>>
>>> This is allowing me to scrape the values of the class values on
>>> tags and
>>> any other attribues such as that. My question is, how would i
>>> modify the
>>> code in order to get it to capture say a block of text such as:
>>>
>>>  <p>this is text that i want to scrape</p>
>>>
>>> any ideas?

In other words, after choosing a library and playing with it for a while, he
found himself back in square one, unable to solve the original problem.

To quote one of my favorite authors (William Burroughs), it seems people are
busy inventing cures for which there are no diseases.

-- 
Paul Lutus
http://www.arachnoid.com