Nokogiri not pulling correct XPath

dubstep · February 28, 2011, 10:28am

Hi everyone,

I was wondering if anyone could help me. I’m trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I’m thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can’t get any values out of variable2 or
variable3. Is there a different syntax I should be using? To test, I’ve
only been outputting to the cli but I want to eventually push these into
a sqlite3 database.

Anyone have any ideas?
Cheers.

Scott.

scottbrisko · February 28, 2011, 11:50am

Hello…

I’ve been using Nokogiri for a while and I never had problems with it.
It works great.

I have some questions for you… Why do you put the full path to the h2
tag?
The h2 has a class or an id defined? how about all the div in between,
they have class or id defined?

I’m asking that because you can access inner_html of an html tag like
this:

doc.xpath("//div[@class=’(class of the div here)’]/h2").each do |node|
var = node.inner.html
end

You don’t really need to put the full path to the html tag. You can also
use //div[@id=’(id of the div here), for example.

Probably the other variables are not working because you missed a div or
something else in between… I think the way I show in lines above is
easy to get the html content without making mistakes.

If you want just let me know the url you want to get the content and
I’ll build a small script to do that.

Regards,

Luis Goncalves

scottbrisko · February 28, 2011, 1:20pm

On Mon, Feb 28, 2011 at 10:28 AM, Scott B. [email protected] wrote:

variable 2 =
Anyone have any ideas?
First I would dump the page as loaded by your program (this is
important) to disk and verify that those XPaths do work independently
(e.g. with Firefox’s DOM Inspector or Eclipse XML tools).

Kind regards

robert

scottbrisko · March 1, 2011, 11:51pm

Thanks guys for the help. In the end, I think it had more to do with the
tbody than anything. I still couldn’t get it working with Xpath however,
so used CSS and was able to get it working that way (albeit in a round
about fashion using an array).

Cheers.

Scott.

scottbrisko · February 28, 2011, 4:06pm

On Mon, Feb 28, 2011 at 3:28 AM, Scott B. [email protected] wrote:

(doc/“/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2”).inner_html

variable 2 =

(doc/“/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong”).inner_html

variable 3 =

(doc/“/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]”).inner_html

Now, variable1 is working but I can’t get any values out of variable2 or
variable3.

In my experience, Firebug shows a tbody element as part of the xpath,
even if there is no actual tbody tag in the HTML. In that case,
Nokogiri will fail to find the right element unless you take out the
‘tbody/’.