Hpricot getting a table

casper_the_ghost · April 18, 2007, 5:45pm

I am currently trying to scrape some data from the following web page

I am using some hpricot code that looks like this
@doc = Hpricot(open(strLink))

@doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/
table") do |data|
puts data
end

At this point data contains html that looks like this

Stuff

More stuff continues …

I want capture each of these four tables individually for further
processing. I have tried a variety of methods but nothing seems to
work.

Thanks,
Luis

casper_the_ghost · April 18, 2007, 6:15pm

[email protected] wrote:

processing. I have tried a variety of methods but nothing seems to
work.

Would something as simple as this work? I’m not sure how complex
your tables get.

#!/usr/bin/env ruby

require “hpricot”

doc =
Hpricot("

Stuff

")

(doc/“table”).map {|t| puts t.to_html}

This outputs:

“

Stuff

”
“

Stuff

”
“

Stuff

”
“

Stuff

”

Note that there’s an Hpricot mailing list at
http://code.whytheluckystiff.net/hpricot/ that might be a more
appropriate forum for these questions.

-Drew

casper_the_ghost · April 18, 2007, 8:55pm

On Apr 18, 11:11 am, Drew R. [email protected] wrote:

I want capture each of these four tables individually for further
doc = Hpricot("

Stuff

Stuff

“

Stuff

Stuff

”
“

Stuff

Stuff

”

Note that there’s an Hpricot mailing list athttp://code.whytheluckystiff.net/hpricot/that might be a more
appropriate forum for these questions.

-Drew

Thanks,

This gets me a lot closer to what I need.
I’m having some problems with syntax. If I’m reading the docs
correctly map returns an array. So I should be able to do something
like

arrTables = (doc/“tables”).map

And then access each table individually. For example

arrTables[0]

Luis

casper_the_ghost · April 18, 2007, 9:29pm

[email protected] wrote:

At this point data contains html that looks like this
work.
What are you trying to do exactly? What should be the result?
Could you please provide some real data, because these ‘stuff’ do not
make too much sense

Thanks,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

casper_the_ghost · April 18, 2007, 10:00pm

[email protected] wrote:

This gets me a lot closer to what I need.
I’m having some problems with syntax. If I’m reading the docs
correctly map returns an array. So I should be able to do something
like

arrTables = (doc/“tables”).map

And then access each table individually. For example

arrTables[0]

Array#map wasn’t particularly relevant in my example; I just used it
to iterate #puts over the result of (doc/“table”).

The real lesson to glean from my response is that:

(doc/“table”)

is much nicer than:

doc.search(“/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/table”)

…which, like Peter alluded to, is fairly meaningless to us because
we don’t know what the full original HTML looks like. You can grab
the

s from the snippet you provided with just a CSS-style
search[1].

FWIW, if you have any control over the markup, you can simply add
some unique classes to the tables you want:

...

Then do:

(doc/“table.foo”)

That’ll work regardless of how many

s are on the page.

-Drew

Footnotes:
[1] http://lnk.nu/code.whytheluckystiff.net/e7m