On Feb 28, 12:36 am, Chirantan <chirantan.rajh... / gmail.com> wrote:
> I have an html code into string. I want to retrieve the content (Can
> be any HTML code with any number of tags) present inside the div after
> the heading till the end of the div.
>
> Example,
>
> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>
>
> In the above example, Plot Outline is header that I am looking for
> then, regex should give me -
>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
>
> And if "Tagline:" is what I am looking for then regex should give me -
>
> Yippee Ki Yay Mo - John 6:27
>
> I hope the problem statement is clear.

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html
  # Put all of the DIVs in an array.
  divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
  divs.each{|s|
    if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
      return $2.strip
    end
  }
  return nil
end

html = DATA.read

puts find_header( "Plot Outline:", html )

__END__
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>