This is just a whimsical question, really. I’ve been working on a
website where people can vote on episodes of TV shows (and I happen to
be a big Star Trek fan, so I’m starting there ha ha). By the way, the
website is, literally, 40 lines of code. I’m loving Ruby on Rails so
far.
http://brocoum.com/voter/startrekvoyager/episodes
Anyway, I need to extract the episode descriptions for the tool tips,
and the descriptions come from TV.com. Unfortunately, this has turned
out to be rather harder than it looks!
http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.html?season=0&tag=season_dropdown;dropdown;7
If any of you feel up to the challenge, see if you can streamline my
code below, or write better code yourself. I can’t help but think that
there’s an easier way to do this!
open html file
f = File.read(“episode_guide.html”)
keep track of the number of descriptions found
count = 0
each description is enclosed in a multiline
tag
f.scan(/
.*?</p>/m) do |match|
start with a blank description
desc = ‘’
i want to condense each desc into a single line, and remove the
stardate info
match.each_line {|m|
# remove stardate…
because the stardate is not always on
its own line
m.sub!(/^.<br />/,‘’)
# remove unnecessary whitespace from beginning
m.sub!(/^\s/,‘’)
# add non-stardate and non-blank lines to the desc and remove
trailing \n
desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
}
remove html tags
desc.gsub!(/<.*?>/,‘’)
fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”
these period problems were caused by concatenating the paragraphs
above into one line
desc.gsub!(/(\w.)(\w)/,‘\1 \2’)
fix stupid html type stuff
desc.gsub!(/ /," “)
desc.gsub!(/'/,”'")
make all spaces single
desc.gsub!(/ {2,}/,’ ')
output finished description followed by blank line and increment
counter
puts desc + “\n\n”
count += 1
end
make sure i got all 176 episode descriptions
puts count
Philip
On Jan 18, 10:18 pm, Stedwick [email protected] wrote:
out to be rather harder than it looks!
keep track of the number of descriptions found
its own line
fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”
counter
puts desc + “\n\n”
count += 1
end
make sure i got all 176 episode descriptions
puts count
Philip
Look into Hpricot - http://code.whytheluckystiff.net/hpricot/ - or
another HTML parser. It makes things like this much easier - no need
for regexes.
2008/1/19, Stedwick [email protected]:
each description is enclosed in a multiline
tag
f.scan(/
.*?</p>/m) do |match|
[…]
You should take a look at Hpricot gem to make the
html scraping easier.
-- Jean-François.
Stedwick wrote:
out to be rather harder than it looks!
keep track of the number of descriptions found
its own line
fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”
counter
puts desc + “\n\n”
count += 1
end
make sure i got all 176 episode descriptions
puts count
Philip
text = IO.read(“episode_guide.html”)
a = text.scan(/
\sstardate:[ a-z.\d](.?)</p>/mi).flatten.
map{|s|
s.strip.gsub(/ /," ").gsub(/<.?>|&[^;]+;/m,"").
gsub(/\s+/, " “) }
puts a.join(”\n\n")
puts
puts a.size
On Jan 18, 10:18 pm, Stedwick [email protected] wrote:
out to be rather harder than it looks!
keep track of the number of descriptions found
its own line
fix periods ie. “Hi there.I love you.” => “Hi there. I love you.”
counter
puts desc + “\n\n”
count += 1
end
make sure i got all 176 episode descriptions
puts count
Philip
This is not exactly what you want. But you may find it helpful
require ‘hpricot’
require ‘open-uri’
url =‘http://www.tv.com/star-trek-deep-space-nine/show/166/
episode_guide.html?printable=1’
@doc =Hpricot(open(url))
@doc.search(“/html/body/div[1]/div”).each do |div|
div.search(“h1/a”) do |h1|
puts h1.inner_text.strip().squeeze(" “).gsub(”\n"," ")
end
div.search(“//div[@class=‘f-verdana f-small lh-16 mt-15 mb-15’]”) do
|div|
puts div.inner_text.strip().squeeze(" “).gsub(”\n"," ")
puts
end
end
On Jan 19, 10:39 pm, William J. [email protected] wrote:
text = IO.read(“episode_guide.html”)
a = text.scan(/
\sstardate:[ a-z.\d](.?)</p>/mi).flatten.
map{|s|
s.strip.gsub(/ /," ").gsub(/<.?>|&[^;]+;/m,“”).
gsub(/\s+/, " “) }
puts a.join(”\n\n")
puts
puts a.size
Corrected:
text = IO.read(“episode_guide.html”)
a = text.scan(/
\sstardate:[ a-z.\d](.?)</p>/mi).flatten.
map{|s|
s.gsub(/ /," ").gsub(/<.?>/m,“”).gsub(“'”,“'”).
gsub(/\s+/, " “).strip }
puts a.join(”\n\n")
puts
puts a.size
On Jan 20, 4:38 am, William J. [email protected] wrote:
Corrected:
text = IO.read(“episode_guide.html”)
a = text.scan(/
\sstardate:[ a-z.\d](.?)</p>/mi).flatten.
map{|s|
s.gsub(/ /," ").gsub(/<.?>/m,“”).gsub(“'”,“'”).
gsub(/\s+/, " “).strip }
puts a.join(”\n\n")
puts
puts a.size
I’m liking yours so far William
It’s pretty elegant.