Using hpricot to get tables

casper_the_ghost · July 1, 2008, 8:10pm

I am working with the following script to parse a page

require ‘hpricot’
require ‘open-uri’

strLink =“http://www.sportsline.com/mlb/gamecenter/boxscore/
MLB_20080331_ARI@CIN”
strPath =“div[@class=SLTables1]/div”

@doc = Hpricot(open(strLink))

@doc.search(strPath) do |div|
puts div.inner_html

puts div.css_path

puts div.xpath

puts

puts
end

This prints 4 tables to the screen

blah, blah, blah

I would like to access each table individually. How can I do that?

Thanks,

Luis

casper_the_ghost · July 1, 2008, 11:07pm

I would like to access each table individually

doc.search returns an array even if there is only one match. The
consturct you are using iterates through this array:

doc.search(strPath) do |div|

end

if you capture the search results into a variable named “divs” you can
index it like and array (because it is one)

divs=doc.search(strPath)

If you want to immediately start iterating you can do this:

doc.search(strPath).each_with_index do |div,idiv|
puts idiv if idiv==2
end

I work with hpricot a lot and I find it is more productive to not use
all the fancy ruby idioms to shorten your code as you are dealing with
pages that are very fragile to parse when someone changes the page
content.

See code below

require ‘hpricot’
require ‘open-uri’

strLink
=“http://www.sportsline.com/mlb/gamecenter/boxscore/MLB_20080331_ARI@CIN”
strPath =“//div[@class=‘SLTables1’]/div”

doc = Hpricot(open(strLink))
divs=doc.search(strPath)

puts “#{divs[0].inner_text.slice(0…70)}\n\n”
puts “#{divs[1].inner_text.slice(0…70)}\n\n”
puts “#{divs[2].inner_text.slice(0…70)}\n\n”
puts “#{divs[3].inner_text.slice(0…70)}\n\n”

casper_the_ghost · July 1, 2008, 11:45pm

On Jul 1, 4:03 pm, Dan D. [email protected] wrote:

if you capture the search results into a variable named “divs” you can index it like and array (because it is one)

puts “#{divs[0].inner_text.slice(0…70)}\n\n”
puts “#{divs[1].inner_text.slice(0…70)}\n\n”
puts “#{divs[2].inner_text.slice(0…70)}\n\n”
puts “#{divs[3].inner_text.slice(0…70)}\n\n”

This works. Will be very useful for future projects.

I ended up using the xpath for each table which also worked.

Thanks,

Luis