Phrogz wrote:
> On May 15, 9:50 am, "M. R." <r... / schwingerverband.ch> wrote:
>> I want to filter the content of a body-Tag in html. How can I do this
>> with regular expression?
>>
>> @h = Net::HTTP.new(url, 80)
>> @response = @h.get(file, nil)
>>
>> if response.message == "OK"
>>   @body_content = response.scan(/..................../).to_s
>> end
> Assuming your HTML is valid, then simply:
> @body_content = response[ /<body[^>]*>(.+?)</body>/m, 1 ]

Whenever someone asks me how to parse HTML with regular expressions, I
usually tell them: don't.  HTML is an extremely complex language; if
you want to parse HTML, use an HTML parser.  For example, the
following snippet is a perfectly well-formed and valid HTML document,
but none of the regexps posted in this thread so far are able to
correctly parse it:

  <HTML/
    <HEAD/
      <TITLE/>/
      <P/>

Oh, and, no, there is nothing missing there (well, except for the
DOCTYPE declaration, I left that out for brevity -- this snippet is
valid HTML 2.0, HTML 3.2 and HTML 4.01), that is actually a complete,
well-formed and valid HTML document.

The content of the above document's body element, flattened to a
string, should be something like this: '<P>></P>'.

Using an actual HTML parser like Hpricot might be a much better
choice.  Actually, I just checked and Hpricot doesn't seem to work
either and neither does RubyfulSoup.  Strange.  What other Ruby HTML
parsers are there that I could try?

jwm