On Feb 29, 1:14 am, William James <w_a_x_... / yahoo.com> wrote:
> On Feb 28, 9:50 am, William James <w_a_x_... / yahoo.com> wrote:
>
>
>
> > On Feb 28, 12:36 am, Chirantan <chirantan.rajh... / gmail.com> wrote:
>
> > > I have an html code into string. I want to retrieve the content (Can
> > > be any HTML code with any number of tags) present inside the div after
> > > the heading till the end of the div.
>
> > > Example,
>
> > > <div class="info">
> > > <h5>Tagline:</h5>
> > > Yippee Ki Yay Mo - John 6:27
> > > </div>
>
> > > <div class="info">
> > > <h5>Plot Outline:</h5>
> > > John McClane takes on an Internet-based terrorist organization who is
> > > systematically shutting down the United States. <a class="tn15more
> > > inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> > > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > > link=/title/tt0337978/plotsummary';">more</a>
> > > </div>
>
> > > In the above example, Plot Outline is header that I am looking for
> > > then, regex should give me -
>
> > > John McClane takes on an Internet-based terrorist organization who is
> > > systematically shutting down the United States. <a class="tn15more
> > > inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> > > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > > link=/title/tt0337978/plotsummary';">more</a>
>
> > > And if "Tagline:" is what I am looking for then regex should give me -
>
> > > Yippee Ki Yay Mo - John 6:27
>
> > > I hope the problem statement is clear.
>
> > Note that this will give spurious results if an html comment happens
> > to contain what you are looking for.
>
> > def find_header header, html
> >   # Put all of the DIVs in an array.
> >   divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
> >   divs.each{|s|
> >     if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
> >       return $2.strip
> >     end
> >   }
> >   return nil
> > end
>
> > html = DATA.read
>
> > puts find_header( "Plot Outline:", html )
>
> > __END__
> > <div class="info">
> > <h5>Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>
>
> > <div class="info">
> > <h5>Plot Outline:</h5>
> > John McClane takes on an Internet-based terrorist organization who is
> > systematically shutting down the United States. <a class="tn15more
> > inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > link=/title/tt0337978/plotsummary';">more</a>
> > </div>
>
> More concise:
>
> def find_header header, html
>   html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
>     return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
>   return nil
> end

Thank you William and Mark,

The codes worked. :-) Thanks a lot.