Ruby(and programming) beginners question regarding 'NoMethodError' while using Hpricot

dubstep · February 15, 2011, 3:41pm

Hi!
I am trying to build a web scraper which fetches Fundamental data for
listed companies from finance websites.
let me show an example.

"

PE ratio
16.83 14/02/11

      <tr><td>EPS (Rs)</td><td class="numericalColumn">

10.59

Mar, 10
Sales (Rs crore)
13,963.81 Dec, 10
Face Value (Rs)10
Net profit margin (%)
17.72 Mar, 10

      <tr><td>Last dividend (%)</td><td

class=“numericalColumn”>30

18/01/11
Return on average equity13.69Mar, 10

"
I want to the data ‘16.83’ from the above html , so what I do is
I parse the HTML file and save it into doc.
I search doc for inner text ‘PE ratio’
And then I chose the next element using next_sibling.
But I am getting an error
‘C:\Users\Administrator\Documents>ruby scraper.rb scraper.rb:9:in
<main>': undefined methodnext_sibling’ for #<Hpricot::Elements[{elem "PE ratio" }]> (NoMethodError)'

I’ll be grateful for any suggestions .
Sorry about the formatting of the HTML Text!

sgudibanda · February 15, 2011, 4:47pm

Hi Sandeep.

The #search method returns an Hpricot::Elements object, which is
somewaht
similar to an array. You should call #next_sibling on any of the
elements
inside that collection, which, in fact, are Hpricot::Elem objects. For
instance:

perform search

elements = doc.search(‘td[text()=“PE ratio”]’)
=> #<Hpricot::Elements[{elem “PE ratio” }]>

get the targeted cell

cell = elements*.first.*next_sibling
=> {elem " 16.83" }

printout raw value

puts cell.to_plain_text
16.83
=> nil

Regards.

–
Estanislau Trepat

2011/2/15 Sandeep G. [email protected]

sgudibanda · February 17, 2011, 2:42pm

Thank You! very much Estanislau

If I am not bothering you too much why wasn’t it(‘next_sibling’) working
on my code??and
what are those ‘*’ for in here

cell = elements*.first.*next_sibling

They were giving an error ‘syntax error, unexpected ‘.’’.
I removed them and now it’s working fine.

One more thing I need to ask , if I could use this thread!
I have this web page ‘http://money.rediff.com/companies/all/1-200’
at the bottom there is a link(‘next’) to the next page of the list.
Now this link is a java script .
What I want to do is after finishing scraping this page I want to go to
the next through the ‘Next’ link. Is there any way to do it???

Note:- A cruder method will be to go to every page o the list by their
web page and scraping from that page (total number of pages will be 17).

Any suggestions are welcome!
Thank you!
Sandeep G.

sgudibanda · February 19, 2011, 10:48am

Hi!
Thanks! for the link Estanislau.
it certainly did my work lot easier.

I uploaded my ‘almost’ final program.What it does is it searches for
some data for each company on BSE and writes it down on an excel sheet .

First I collected all the links and saved it in an array ‘x’
Then i collect the data that i need and save it to an spreadsheet I
defined earlier in the program.
Lastly I write the spreadsheet to an excel File.
I can control how many companies I want by changing the number of
iterations(In this case 8).
This program is running fine if the number of iteration is less than 6
otherwise, I get a error
‘links.rb:34:in block in <main>': undefined methodnext_sibling’ for
nil:NilCla
ss (NoMethodError)
from links.rb:28:in times' from links.rb:28:in‘’

I’m puzzled(like always!).
All suggestions are welcome!
Thank you
Sandeep G.

sgudibanda · February 25, 2011, 10:33am

Hi!
I tried to find the class of the object on which I am using the method
next_sibling using the code below is returning (by iterating it for 25
times )

sheet1[num,1]=doc.search(‘td[text()=“PE ratio”]’).first
puts num # num is |num|
puts sheet1[num,1].class
It turns out it gives ‘nil class’ for 13th, 15th 16th and 24th
iteration.
So ‘next_method’ gives a no method error.

Please help me with this problem

sgudibanda · February 17, 2011, 4:25pm

Hi Sandeep.

The #next_sibling method was not working because you were using it on
the
whole elements array (in fact, an Hpricot::Elems object) and not on each
of
the elements inside. That’s because we had to use elements.first to get
the
first node which met our search criteria and then call #next_sibling on
that
node. The #next_sibling method is only defined on each of those nodes
not on
the array itself.

I apologize for the * characters, I think I was trying to put that part
in
bold and got bad formatting out of my email client.

For the problem you expose, maybe you could try using
Watirhttp://watir.com/.
It drives a real web browser, and can thus handle Javascript links.

If you allow me a suggestion: Taking a look at the page you’re trying to
scrape and the structure of the query parameters, I’d suggest to extract
the
total number of results from the bottom part which reads “Showing 1 -
200 of
3529”. If you extract that last number (the total number of results)
then
you could point your scraping script to:
http://money.rediff.com/companies/all/1-3529 without needing to follow
javascript links.

Hope it helps.

Regards.

–
Estanislau Trepat

http://twitter.com/etrepat

2011/2/17 Sandeep G. [email protected]