doog wrote:
Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.
Okay, here is a simple parser in ordinary Ruby, it will give you some
ideas
about what is involved in parsing.
There are many libraries that do much more than this script does, some
of
them have steep learning curves, many offer exotic ways to acquire
particular kinds of content.
This is a simple parser that returns an array containing all the table
content in the target Web page. I wrote it earlier today for someone who
wanted to scrape a yahoo.com financial page, which explains the target
page, something easy to change:
#!/usr/bin/ruby -w
require ‘net/http’
read the page data
http = Net::HTTP.new(‘finance.yahoo.com’, 80)
resp, page = http.get(‘/q?s=IBM’, nil )
BEGIN processing HTML
def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end
out_tables = []
table_data = parse_html(page,“table”)
table_data.each do |table|
out_rows = []
row_data = parse_html(table,“tr”)
row_data.each do |row|
out_cells = parse_html(row,“td”)
out_cells.each do |cell|
cell.gsub!(%r{<.*?>},“”)
end
out_rows << out_cells
end
out_tables << out_rows
end
END processing HTML
examine the result
def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts “#{”\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts “#{”\t" * (tab+1)}#{item}"
end
puts “#{”\t" * tab}}"
end
n += 1
end
end
parse_nested_array(out_tables)
This program emits an indexed, indented listing of the table content
that it
extracted, so you can then customize it by acquiring particular table
cells
through use of the provided index numbers.
It should work with any Web page that has the interesting content
embedded
in tables, and whose syntax is reliable.
The primary value of this program is to show you how easy it is to
scrape
pages using Ruby, and give you a starting point you can customize to
meet
your own requirements.