Vikash Kumar wrote:
/ …
ie.goto(“International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance”)
Now, whats next.
What? What’s next? You have already assumed that the Watir and Hpricot
libraries are the optimal solution for this problem. Not necessarily.
There
are many circumstances where a simple Ruby solution is better. And the
more
you need to know about the process of page scraping, the more likely it
is
that you will want to understand and tune the details.
Also let suppose we want to get all the values of
table, we don’t know the table structure then what what should be the
correct solution ?
How about this approach:
#!/usr/bin/ruby -w
require ‘net/http’
read the page data
http = Net::HTTP.new(‘finance.yahoo.com’, 80)
resp, page = http.get(‘/q?s=IBM’, nil )
BEGIN processing HTML
def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end
output = []
table_data = parse_html(page,“table”)
table_data.each do |table|
out_row = []
row_data = parse_html(table,“tr”)
row_data.each do |row|
cell_data = parse_html(row,“td”)
cell_data.each do |cell|
cell.gsub!(%r{<.*?>},“”)
end
out_row << cell_data
end
output << out_row
end
END processing HTML
examine the result
def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts “#{”\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts “#{”\t" * (tab+1)}#{item}"
end
puts “#{”\t" * tab}}"
end
n += 1
end
end
parse_nested_array(output)
Notice about this program that about half the code parses the Web page
and
creates an array of arrays, while the remainder shows the array. The
entire
task of scraping the page is carried out in the middle of the program.
If you examine the array display created in the latter part of the
program,
you will see that all the data are placed in an array that can be
indexed
by table, row and cell. Simply select which array elements you want.
I want to emphasize something. The 21 lines, including spaces and
comments,
between “# BEGIN processing HTML” and “# END processing HTML” are all
that
is required to scrape the page. After this, you simply choose which
table
cells you want to use by indexing the array.
This way of scraping pages is better if you have to post-process the
extracted data, or you need a lightweight solution for environments with
limited resources, or if you want to exercise detailed control over the
scraping process, or if you don’t want to try to figure out how to use a
large, powerful library that can do absolutely anything, or if you want
to
learn how to create Ruby programs.
And this way of scraping pages is not for everyone.
Also, I must add, if the Web page contains certain kinds of HTML syntax
errors, in particular any unpaired
, or
tags, my
program
will break, and Hpricot probably won’t. If, on the other hand, the page
is
syntactically correct, this program is perfectly adequate to extract the
data.
Obligatory editorial comment: Yahoo exists because it can expose you to
advertising. That is the foundation of their business model. When you
scrape pages, you avoid having to look at their advertising. If everyone
did this, for better or worse Yahoo would go out of business (or change
their business model).
Those are the facts. If this page scraping business becomes commonplace,
eventually Yahoo and other similar Web sites will choose a different
strategy, for example, they might sell subscriptions. Or they might try
to
do more than they are already doing to discourage scraping. This
activity
might end up being a contest between the scrapers and the scrapees, with
the scrapees making their pages more and more complex.
I think eventually these content providers might put up their content as
graphics rather than text, as the spammers are now doing. Then the
scrapers
would have to invest in OCR to get the content.
This scraping activity isn’t illegal, unless of course you exploit or
re-post the scraped content.
End of editorial.
|