Hi. I’m trying to scrape scripts from a news site that follows.
As you can see, this site is very unique that a news video runs
together with the corresponding scripts. And I want to scrape the
scripts only.
So I wrote this simple code.
#---------------------
require ‘rubygems’
require ‘nokogiri’
require ‘open-uri’
require ‘net/http’
if ARGV.size > 0 then
url = ARGV[0]
html = Nokogiri::HTML(open(url))
p html.search(“//span[@class=‘segment sec5’]”).text
end
#---------------------
in the command,
$> ruby code.rb URL
then, it does obtain a partial script (eventually I want to get all
of them of course), which in this case is “NBC”. But the behavior of
the code is funny. The process starts itself, and in a while it shows
the result but does not seem to end itself. By hitting “RET”, it
finally stops.
I am guessing that this Ruby code activates JavaScript or whatever to
manipulate the video.
Do you think we can avoid such behavior and get the scripts as quickly
as possible?
Thanks in advance.
soichi