Challenge: Extract episode descriptions

stedwick · January 19, 2008, 5:20am

This is just a whimsical question, really. I’ve been working on a
website where people can vote on episodes of TV shows (and I happen to
be a big Star Trek fan, so I’m starting there ha ha). By the way, the
website is, literally, 40 lines of code. I’m loving Ruby on Rails so
far.

http://brocoum.com/voter/startrekvoyager/episodes

Anyway, I need to extract the episode descriptions for the tool tips,
and the descriptions come from TV.com. Unfortunately, this has turned
out to be rather harder than it looks!

http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.html?season=0&tag=season_dropdown;dropdown;7

If any of you feel up to the challenge, see if you can streamline my
code below, or write better code yourself. I can’t help but think that
there’s an easier way to do this!

open html file

f = File.read(“episode_guide.html”)

keep track of the number of descriptions found

count = 0

each description is enclosed in a multiline

tag

f.scan(/

.*?/m) do |match|

start with a blank description

desc = ‘’

i want to condense each desc into a single line, and remove the

stardate info
match.each_line {|m|
# remove stardate…
because the stardate is not always on
its own line
m.sub!(/^. /,‘’)
# remove unnecessary whitespace from beginning
m.sub!(/^\s/,‘’)
# add non-stardate and non-blank lines to the desc and remove
trailing \n
desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
}

remove html tags

desc.gsub!(/<.*?>/,‘’)

fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”

these period problems were caused by concatenating the paragraphs

above into one line
desc.gsub!(/(\w.)(\w)/,‘\1 \2’)

fix stupid html type stuff

desc.gsub!(/ /," “)
desc.gsub!(/'/,”'")

make all spaces single

desc.gsub!(/ {2,}/,’ ')

output finished description followed by blank line and increment

counter
puts desc + “\n\n”
count += 1
end

make sure i got all 176 episode descriptions

puts count

Philip

stedwick · January 19, 2008, 7:40am

On Jan 18, 10:18 pm, Stedwick [email protected] wrote:

out to be rather harder than it looks!

keep track of the number of descriptions found

its own line

fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”

counter
puts desc + “\n\n”
count += 1
end

make sure i got all 176 episode descriptions

puts count

Philip

Look into Hpricot - http://code.whytheluckystiff.net/hpricot/ - or
another HTML parser. It makes things like this much easier - no need
for regexes.

stedwick · January 19, 2008, 7:42am

2008/1/19, Stedwick [email protected]:

each description is enclosed in a multiline

tag
f.scan(/

.*?/m) do |match|

[…]

You should take a look at Hpricot gem to make the
html scraping easier.

-- Jean-François.

stedwick · January 20, 2008, 5:41am

Stedwick wrote:

out to be rather harder than it looks!

keep track of the number of descriptions found

its own line

fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”

counter
puts desc + “\n\n”
count += 1
end

make sure i got all 176 episode descriptions

puts count

Philip

text = IO.read(“episode_guide.html”)
a = text.scan(/

\sstardate:[ a-z.\d](.?)/mi).flatten.
map{|s|
s.strip.gsub(/ /," ").gsub(/<.?>|&[^;]+;/m,"").
gsub(/\s+/, " “) }
puts a.join(”\n\n")
puts
puts a.size

stedwick · January 20, 2008, 12:01am

On Jan 18, 10:18 pm, Stedwick [email protected] wrote:

out to be rather harder than it looks!

keep track of the number of descriptions found

its own line

fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”

counter
puts desc + “\n\n”
count += 1
end

make sure i got all 176 episode descriptions

puts count

Philip

This is not exactly what you want. But you may find it helpful

require ‘hpricot’
require ‘open-uri’

url =‘http://www.tv.com/star-trek-deep-space-nine/show/166/
episode_guide.html?printable=1’
@doc =Hpricot(open(url))

@doc.search(“/html/body/div[1]/div”).each do |div|

div.search(“h1/a”) do |h1|
puts h1.inner_text.strip().squeeze(" “).gsub(”\n"," ")
end

div.search(“//div[@class=‘f-verdana f-small lh-16 mt-15 mb-15’]”) do
|div|
puts div.inner_text.strip().squeeze(" “).gsub(”\n"," ")
puts
end

end

stedwick · January 20, 2008, 10:40am

On Jan 19, 10:39 pm, William J. [email protected] wrote:

text = IO.read(“episode_guide.html”)
a = text.scan(/

\sstardate:[ a-z.\d](.?)/mi).flatten.
map{|s|
s.strip.gsub(/ /," ").gsub(/<.?>|&[^;]+;/m,“”).
gsub(/\s+/, " “) }
puts a.join(”\n\n")
puts
puts a.size

Corrected:

text = IO.read(“episode_guide.html”)
a = text.scan(/

\sstardate:[ a-z.\d](.?)/mi).flatten.
map{|s|
s.gsub(/ /," ").gsub(/<.?>/m,“”).gsub(“'”,“'”).
gsub(/\s+/, " “).strip }
puts a.join(”\n\n")
puts
puts a.size

stedwick · January 21, 2008, 11:35pm

On Jan 20, 4:38 am, William J. [email protected] wrote:

Corrected:

text = IO.read(“episode_guide.html”)
a = text.scan(/

\sstardate:[ a-z.\d](.?)/mi).flatten.
map{|s|
s.gsub(/ /," ").gsub(/<.?>/m,“”).gsub(“'”,“'”).
gsub(/\s+/, " “).strip }
puts a.join(”\n\n")
puts
puts a.size

I’m liking yours so far William It’s pretty elegant.