On Mon, Mar 23, 2009 at 9:49 AM, Arun Kumar <arunkumar / innovaturelabs.com>wrote:

> Can anybody tell me how to extract all the contents which are included
> inside the '<html>' and '</html>' tag and also to extract the text given
> in between the '<a>' and '</a>' tag using regular expression. I know it
> can be extracted using the 'scan' method but I dont know what should be
> the matching patterns or expressions. Can anybody pls help me


Let's assume we have the following content:

<html>
<body>
<p>
Want a Ruby regular expression editor? Check out <a href="
http://www.rubular.com/">Rubular</a>.
</p>
</body>
</html>

Here are two quick and dirty regexps:

/<html>(.*)<\/html>/m
This regexp will capture anything between an opening html tag and a closing
one. the /m option specifies "Multiline Mode: "." will match any character
including a newline.
For our content, it will capture:
<body>
<p>
Want a Ruby regular expression editor? Check out <a href="
http://www.rubular.com/">Rubular</a>.
</p>
</body>

/<a.*>(.*)<\/a>/
This regexp will capture the text between an opening anchor element and a
closing one. The first ".*" is there to deal with href and any other
attribute. You might wanna throw the /m option in there too.
For our content, it will capture:
Rubular

On Mon, Mar 23, 2009 at 11:18 AM, Arun Kumar <arunkumar / innovaturelabs.com>
wrote:

> I know that using mechanize or hpricot is a far better option in this
> case. But i'm just asking as a matter of curiosity to know about regexps


Dare I say, a man should use regexps if only to satisfy his curiosity. ;-)

Regards,
Yaser