Forum: Ruby Hpricot getting a table

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
C5b0a0cf40eb23497068889c8fb20a18?d=identicon&s=25 lrlebron@gmail.com (Guest)
on 2007-04-18 17:45
(Received via mailing list)
I am currently trying to scrape some data from the following web page

I am using some hpricot code that looks like this
 @doc = Hpricot(open(strLink))

 @doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/
table") do |data|
        puts data
 end

At this point data contains html that looks like this

<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>

More stuff continues ......

I want capture each of these four tables individually for further
processing. I have tried a variety of methods but nothing seems to
work.

Thanks,
Luis
52f9115ad9def173d11a02f34d551516?d=identicon&s=25 Drew Raines (Guest)
on 2007-04-18 18:15
(Received via mailing list)
lrlebron@gmail.com wrote:

> processing. I have tried a variety of methods but nothing seems to
> work.

Would something as simple as this work?  I'm not sure how complex
your tables get.

   #!/usr/bin/env ruby

   require "hpricot"

   doc =
Hpricot("<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
   <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
   <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
   <table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>")

   (doc/"table").map {|t| puts t.to_html}

This outputs:

   "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
   "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
   "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
   "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"

Note that there's an Hpricot mailing list at
http://code.whytheluckystiff.net/hpricot/ that might be a more
appropriate forum for these questions.

-Drew
C5b0a0cf40eb23497068889c8fb20a18?d=identicon&s=25 lrlebron@gmail.com (Guest)
on 2007-04-18 20:55
(Received via mailing list)
On Apr 18, 11:11 am, Drew Raines <aarai...@gmail.com> wrote:
> > I want capture each of these four tables individually for further
>    doc = Hpricot("<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>
>    "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
>    "<table><tr><td>Stuff</td></tr><tr><td>Stuff</td></tr></table>"
>
> Note that there's an Hpricot mailing list 
athttp://code.whytheluckystiff.net/hpricot/that might be a more
> appropriate forum for these questions.
>
> -Drew

Thanks,

This gets me a lot closer to what I need.
I'm having some problems with syntax. If I'm reading the docs
correctly map returns an array. So I should be able to do something
like

arrTables = (doc/"tables").map

And then access each table individually. For example

arrTables[0]

Luis
F50f5d582d76f98686da34917531fe56?d=identicon&s=25 Peter Szinek (Guest)
on 2007-04-18 21:29
(Received via mailing list)
lrlebron@gmail.com wrote:
> At this point data contains html that looks like this
> work.
What are you trying to do exactly? What should be the result?
Could you please provide some real data, because these 'stuff' do not
make too much sense :-)


Thanks,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby
52f9115ad9def173d11a02f34d551516?d=identicon&s=25 Drew Raines (Guest)
on 2007-04-18 22:00
(Received via mailing list)
lrlebron@gmail.com wrote:

> This gets me a lot closer to what I need.
> I'm having some problems with syntax. If I'm reading the docs
> correctly map returns an array. So I should be able to do something
> like
>
> arrTables = (doc/"tables").map
>
> And then access each table individually. For example
>
> arrTables[0]

Array#map wasn't particularly relevant in my example; I just used it
to iterate #puts over the result of (doc/"table").

The real lesson to glean from my response is that:

  (doc/"table")

is much nicer than:

  doc.search("/html/body/table[5]/tr/td[2]/div[3]/table/tr/td/div[1]/table")

...which, like Peter alluded to, is fairly meaningless to us because
we don't know what the full original HTML looks like.  You can grab
the <table>s from the snippet you provided with just a CSS-style
search[1].

FWIW, if you have any control over the markup, you can simply add
some unique classes to the tables you want:

  <table class="foo">...</table>
  <table class="foo">...</table>
  <table class="foo">...</table>

Then do:

  (doc/"table.foo")

That'll work regardless of how many <table>s are on the page.

-Drew

Footnotes:
[1]  http://lnk.nu/code.whytheluckystiff.net/e7m
This topic is locked and can not be replied to.