On Jan 18, 10:18 pm, Stedwick <philip.broc... / gmail.com> wrote: > This is just a whimsical question, really. I've been working on a > website where people can vote on episodes of TV shows (and I happen to > be a big Star Trek fan, so I'm starting there ha ha). By the way, the > website is, literally, 40 lines of code. I'm loving Ruby on Rails so > far. > > http://brocoum.com/voter/startrekvoyager/episodes > > Anyway, I need to extract the episode descriptions for the tool tips, > and the descriptions come from TV.com. Unfortunately, this has turned > out to be rather harder than it looks! > > http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.ht... > > If any of you feel up to the challenge, see if you can streamline my > code below, or write better code yourself. I can't help but think that > there's an easier way to do this! > > # open html file > f = File.read("episode_guide.html") > > # keep track of the number of descriptions found > count = 0 > > # each description is enclosed in a multiline <p> </p> tag > f.scan(/<p>.*?<\/p>/m) do |match| > # start with a blank description > desc = '' > # i want to condense each desc into a single line, and remove the > stardate info > match.each_line {|m| > # remove stardate...<br /> because the stardate is not always on > its own line > m.sub!(/^.*<br \/>/,'') > # remove unnecessary whitespace from beginning > m.sub!(/^\s*/,'') > # add non-stardate and non-blank lines to the desc and remove > trailing \n > desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/) > } > > # remove html tags > desc.gsub!(/<.*?>/,'') > # fix periods ie. "Hi there.I love you." => "Hi there. I love you." > # these period problems were caused by concatenating the paragraphs > above into one line > desc.gsub!(/(\w\.)(\w)/,'\1 \2') > # fix stupid html type stuff > desc.gsub!(/ /," ") > desc.gsub!(/'/,"'") > # make all spaces single > desc.gsub!(/ {2,}/,' ') > > # output finished description followed by blank line and increment > counter > puts desc + "\n\n" > count += 1 > end > > # make sure i got all 176 episode descriptions > puts count > > Philip This is not exactly what you want. But you may find it helpful require 'hpricot' require 'open-uri' url ='http://www.tv.com/star-trek-deep-space-nine/show/166/ episode_guide.html?printable=1' @doc =Hpricot(open(url)) @doc.search("/html/body/div[1]/div").each do |div| div.search("h1/a") do |h1| puts h1.inner_text.strip().squeeze(" ").gsub("\n"," ") end div.search("//div[@class='f-verdana f-small lh-16 mt-15 mb-15']") do |div| puts div.inner_text.strip().squeeze(" ").gsub("\n"," ") puts end end