Hpricot getting a table


#1

I am currently trying to scrape some data from the following web page

I am using some hpricot code that looks like this
@doc = Hpricot(open(strLink))

@doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/
table") do |data|
puts data
end

At this point data contains html that looks like this

Stuff
Stuff
Stuff
Stuff
Stuff
Stuff
Stuff
Stuff

More stuff continues …

I want capture each of these four tables individually for further
processing. I have tried a variety of methods but nothing seems to
work.

Thanks,
Luis


#2

removed_email_address@domain.invalid wrote:

processing. I have tried a variety of methods but nothing seems to
work.

Would something as simple as this work? I’m not sure how complex
your tables get.

#!/usr/bin/env ruby

require “hpricot”

doc =
Hpricot("

Stuff
Stuff
Stuff
Stuff
Stuff
Stuff
Stuff
Stuff
")

(doc/“table”).map {|t| puts t.to_html}

This outputs:

Stuff
Stuff

Stuff
Stuff

Stuff
Stuff

Stuff
Stuff

Note that there’s an Hpricot mailing list at
http://code.whytheluckystiff.net/hpricot/ that might be a more
appropriate forum for these questions.

-Drew


#3

On Apr 18, 11:11 am, Drew R. removed_email_address@domain.invalid wrote:

I want capture each of these four tables individually for further
doc = Hpricot("

Stuff
Stuff

Stuff
Stuff

Stuff
Stuff

Note that there’s an Hpricot mailing list athttp://code.whytheluckystiff.net/hpricot/that might be a more
appropriate forum for these questions.

-Drew

Thanks,

This gets me a lot closer to what I need.
I’m having some problems with syntax. If I’m reading the docs
correctly map returns an array. So I should be able to do something
like

arrTables = (doc/“tables”).map

And then access each table individually. For example

arrTables[0]

Luis


#4

removed_email_address@domain.invalid wrote:

At this point data contains html that looks like this
work.
What are you trying to do exactly? What should be the result?
Could you please provide some real data, because these ‘stuff’ do not
make too much sense :slight_smile:

Thanks,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby


#5

removed_email_address@domain.invalid wrote:

This gets me a lot closer to what I need.
I’m having some problems with syntax. If I’m reading the docs
correctly map returns an array. So I should be able to do something
like

arrTables = (doc/“tables”).map

And then access each table individually. For example

arrTables[0]

Array#map wasn’t particularly relevant in my example; I just used it
to iterate #puts over the result of (doc/“table”).

The real lesson to glean from my response is that:

(doc/“table”)

is much nicer than:

doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/table")

…which, like Peter alluded to, is fairly meaningless to us because
we don’t know what the full original HTML looks like. You can grab
the

s from the snippet you provided with just a CSS-style
search[1].

FWIW, if you have any control over the markup, you can simply add
some unique classes to the tables you want:

...
...
...

Then do:

(doc/“table.foo”)

That’ll work regardless of how many

s are on the page.

-Drew

Footnotes:
[1] http://lnk.nu/code.whytheluckystiff.net/e7m