Scraping but failed due to the uniqueness of the site

Dobai-Pataky_BSSSSl · December 13, 2010, 11:08am

Hi. I’m trying to scrape scripts from a news site that follows.

http://voxaleadnews.labs.exalead.com/play.php?flv=MSNBC_Brian_Williams/pdv_nn_netcast_m4v-12-12-2010-173721.mp4&q=&language=en

As you can see, this site is very unique that a news video runs
together with the corresponding scripts. And I want to scrape the
scripts only.

So I wrote this simple code.
#---------------------
require ‘rubygems’
require ‘nokogiri’
require ‘open-uri’
require ‘net/http’

if ARGV.size > 0 then
url = ARGV[0]
html = Nokogiri::HTML(open(url))
p html.search(“//span[@class=‘segment sec5’]”).text
end
#---------------------

in the command,

$> ruby code.rb URL

then, it does obtain a partial script (eventually I want to get all
of them of course), which in this case is “NBC”. But the behavior of
the code is funny. The process starts itself, and in a while it shows
the result but does not seem to end itself. By hitting “RET”, it
finally stops.

I am guessing that this Ruby code activates JavaScript or whatever to
manipulate the video.

Do you think we can avoid such behavior and get the scripts as quickly
as possible?

Thanks in advance.

soichi

soujiro0725 · December 13, 2010, 12:00pm

On 13 December 2010 10:08, Soichi I.
[email protected] wrote:

I am guessing that this Ruby code activates JavaScript or whatever to
manipulate the video.

Given that you are only using net/http I would say that the javascript
is not being run at all.

This code is a step in the right direction:

require ‘rubygems’

require ‘open-uri’
require ‘nokogiri’

URL=“http://voxaleadnews.labs.exalead.com/play.php?flv=MSNBC_Brian_Williams/pdv_nn_netcast_m4v-12-12-2010-173721.mp4&q=&language=en”

html = open(URL).read
doc = Nokogiri::HTML(html)

doc.search(“//div[@id=‘fulltext’]//span”).each do |span|
puts span
end

soujiro0725 · December 13, 2010, 12:05pm

Thanks Peter!

It worked well.

soichi