On Jan 18, 10:18 pm, Stedwick <philip.broc... / gmail.com> wrote:
> This is just a whimsical question, really. I've been working on a
> website where people can vote on episodes of TV shows (and I happen to
> be a big Star Trek fan, so I'm starting there ha ha). By the way, the
> website is, literally, 40 lines of code. I'm loving Ruby on Rails so
> far.
>
> http://brocoum.com/voter/startrekvoyager/episodes
>
> Anyway, I need to extract the episode descriptions for the tool tips,
> and the descriptions come from TV.com. Unfortunately, this has turned
> out to be rather harder than it looks!
>
> http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.ht...
>
> If any of you feel up to the challenge, see if you can streamline my
> code below, or write better code yourself. I can't help but think that
> there's an easier way to do this!
>
> # open html file
> f = File.read("episode_guide.html")
>
> # keep track of the number of descriptions found
> count = 0
>
> # each description is enclosed in a multiline <p> </p> tag
> f.scan(/<p>.*?<\/p>/m) do |match|
>   # start with a blank description
>   desc = ''
>   # i want to condense each desc into a single line, and remove the
> stardate info
>   match.each_line {|m|
>     # remove stardate...<br /> because the stardate is not always on
> its own line
>     m.sub!(/^.*<br \/>/,'')
>     # remove unnecessary whitespace from beginning
>     m.sub!(/^\s*/,'')
>     # add non-stardate and non-blank lines to the desc and remove
> trailing \n
>     desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
>   }
>
>   # remove html tags
>   desc.gsub!(/<.*?>/,'')
>   # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
>   # these period problems were caused by concatenating the paragraphs
> above into one line
>   desc.gsub!(/(\w\.)(\w)/,'\1 \2')
>   # fix stupid html &nbsp; type stuff
>   desc.gsub!(/&nbsp;/," ")
>   desc.gsub!(/&#39;/,"'")
>   # make all spaces single
>   desc.gsub!(/ {2,}/,' ')
>
>   # output finished description followed by blank line and increment
> counter
>   puts desc + "\n\n"
>   count += 1
> end
>
> # make sure i got all 176 episode descriptions
> puts count
>
> Philip

This is not exactly what you want. But you may find it helpful

require 'hpricot'
require 'open-uri'

url ='http://www.tv.com/star-trek-deep-space-nine/show/166/
episode_guide.html?printable=1'
@doc =Hpricot(open(url))

@doc.search("/html/body/div[1]/div").each do |div|

   div.search("h1/a") do |h1|
     puts h1.inner_text.strip().squeeze(" ").gsub("\n"," ")
    end

  div.search("//div[@class='f-verdana f-small lh-16 mt-15 mb-15']") do
|div|
    puts div.inner_text.strip().squeeze(" ").gsub("\n"," ")
    puts
   end

  end