Using nokogiri

dubstep · December 5, 2011, 7:06pm

HI,

I want to grab some information about university names, and I found
this term called “web scraping”
I search about it in google, and there are tools in ruby.
One of them is nokogiri but I’m a bit confused because it seems that
it only gets information that its already in an html or xml

I found a webpage that have a list of university names as a

(html label)

and I want to grab that information

The question is… can I do that with nokogiri or another tool?
The list is like a country list, but with the names of the
universities of my country.

It seems that it get that information from an DB using ajax, and what
I’m trying to do may not be legal or possible

I’ll really appreciate if someone can help me to understand what this
tool is used for, and if what I’m trying to do is possible

Thanks

Javier Q

JavierQQ · December 5, 2011, 7:33pm

On Mon, Dec 5, 2011 at 4:05 PM, JavierQQ [email protected] wrote:

HI,

Hi

and I want to grab that information

Thanks

Javier Q

Take a look on some screencasts:

With nokogiri, you could use CSS3 selectors to grab the information you
want

Best Regards,
Everaldo

JavierQQ · December 5, 2011, 7:33pm

On Dec 5, 2011, at 1:05 PM, JavierQQ wrote:

HI,

I want to grab some information about university names, and I found
this term called “web scraping”
I search about it in google, and there are tools in ruby.
One of them is nokogiri but I’m a bit confused because it seems that
it only gets information that its already in an html or xml

Yes, Nokogiri is a toolkit for (among lots of other things) running
Xpath or CSS queries against a text file. That text file can be anything
– an io stream of one sort or another with textual data in it will do.

I found a webpage that have a list of university names as a

(html label)

and I want to grab that information

The question is… can I do that with nokogiri or another tool?
The list is like a country list, but with the names of the
universities of my country.

A select can be traversed like any other DOM object, this should be
fairly close:

#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
doc.css(’#yourPickerId option’).each do |opt|
foo = opt[‘value’]
#whatever else you want to do with foo here
end

It seems that it get that information from an DB using ajax, and what
I’m trying to do may not be legal or possible

If it’s Ajax, you’ll need to run a JavaScript interpreter against it.
Rails 3.1 shows the way to do that server-side. Once you have munged the
page into a text stream that includes this desired data (flattened it
down to the result of the Ajax plus the base code) then Nokogiri or
Hpricot or any other XML/HTML parser could rip through that DOM and give
you individual nodes to play with.

I’ll really appreciate if someone can help me to understand what this
tool is used for, and if what I’m trying to do is possible

Possible, sure. It’s never entirely clear why someone would run an Ajax
request to populate a page. They may have done it to keep the scrapers
out (like you), or they may have done it to isolate and accelerate a
laggy part of the initial page load. If the latter (so they aren’t
actually discouraging you – did you ask them if you could do this?)
then you might also want to look into loading the endpoint of that Ajax
request instead of the surrounding page, as that would eliminate the
whole JavaScript abstraction entirely. You’d have one HTTP request, and
unless that endpoint was kinked to only accept requests from within its
own domain, you would likely have JSON or some other structured data in
return, and that could be even easier to interpret in your application.

Walter

JavierQQ · December 5, 2011, 8:11pm

On Dec 5, 2011, at 1:55 PM, JavierQQ wrote:

  #whatever else you want to do with foo here
end
Thanks, in nokogiri example the result is like “link.content” and
that’s why I wondering how I can grab that information from the select
group

There are some basic things one can do with nodes once you find them.
content() spills out the textual content of any node (in the case of an
option, that might give you the same thing as the Option.text attribute
in JavaScript, but I wouldn’t count on it specifically. In the case of a
div, for example, content would give you the textual content of that
div, minus any HTML tags, while inner_html would give you the actual
HTML code defining all of the content tags as well as their text
content.

For everything else, any other named attribute on the given node you
access simply by putting the name of the attribute in as a key:

my_select[‘label’] or my_select[‘value’] or my_select[‘selected’] for
example.

Behind the scenes, Nokogiri does some elegant metaprogramming with
method_missing and gives you what you ask for if it’s available.

the information as JSON ?
I have seen this technique used for this reason, by splitting the
application load over time on the same server or across servers. But
then I would just throw a cacheing layer at the problem. Much less
heartache.

I’ve also seen this technique used to obfuscate the data source, or
simply to integrate third-party data sources into an existing site.
.

I’m kind of new with rails (not a completly newbie but… sort of )

Me too, but I’ve done quite a lot of Nokogiri recently, so it’s all
fairly fresh.

Walter

JavierQQ · December 6, 2011, 4:22pm

Hi,
It’s me again, I was doing some easy example and it worked… but now
I’ve got some trouble
Is there a way to provide nokogiri data such as username and password?
because in a web I have to login first
Scrapy gives a way to simulate user login, and I was wonderin if
nokogiri can do the same

Javier

JavierQQ · December 6, 2011, 4:26pm

You wouldn’t do it at the Nokogiri level. You need to read up on the
open-uri library, there are all sorts of goodies in there to manage
authentication, sessions, everything needed to create a Web client. That
layer of your application will get the text stream that you will send on
to Nokogiri. There’s nothing in Noko that is specific to solving that
problem, it starts from the assumption that you have a text file locally
or a stream from another client like open-uri.

Walter

JavierQQ · December 6, 2011, 5:30pm

It seems that :http_basic_authentication [user, pass]
no longer works, I’ve tested with 2 webs and nothing,
Is there any other way?

Thanks

Javier

JavierQQ · December 6, 2011, 6:00pm

Can you post some code surrounding this, show the open-uri method call
you’re using?

Walter

JavierQQ · December 5, 2011, 7:56pm

On 5 dic, 13:32, Walter Lee D. [email protected] wrote:

A select can be traversed like any other DOM object, this should be fairly
close:

#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
doc.css(‘#yourPickerId option’).each do |opt|
foo = opt[‘value’]
#whatever else you want to do with foo here
end

Thanks, in nokogiri example the result is like “link.content” and
that’s why I wondering how I can grab that information from the select
group

Possible, sure. It’s never entirely clear why someone would run an Ajax request
to populate a page. They may have done it to keep the scrapers out (like you), or
they may have done it to isolate and accelerate a laggy part of the initial page
load. If the latter (so they aren’t actually discouraging you – did you ask them
if you could do this?) then you might also want to look into loading the endpoint
of that Ajax request instead of the surrounding page, as that would eliminate the
whole JavaScript abstraction entirely. You’d have one HTTP request, and unless
that endpoint was kinked to only accept requests from within its own domain, you
would likely have JSON or some other structured data in return, and that could be
even easier to interpret in your application.

Walter

You mean that in order to make a better application I have to deliver
the information as JSON ?
I’m kind of new with rails (not a completly newbie but… sort of )

Thanks for your help

Javier Q

JavierQQ · December 6, 2011, 6:22pm

doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user, pass])

I’ve made a mistake, that was another file.
what I’m using is:

open(url, :http_basic_authentication => [user, pass] )
doc = Nokogiri::HTML(open(url))

Javier

JavierQQ · December 6, 2011, 6:18pm

On Tue, Dec 6, 2011 at 11:58 AM, Walter Lee D.
[email protected]wrote:

Can you post some code surrounding this, show the open-uri method call
you’re using?

Walter

require ‘nokogiri’
require ‘open-uri’

doc = Nokogiri::HTML(open(url, :http_basic_authentication => [user,
pass])
doc.xpath(‘//select/option’).each do |opt|
puts opt.content
end

I grab some info from tha main page of the url (so it works) but when I
enter to its login page with user/pass and try to get some, it seems to
get
information from other place (I’m not even sure from where)

Javier

JavierQQ · December 7, 2011, 10:03am

Hi,

The question is… can I do that with nokogiri or another tool?
The list is like a country list, but with the names of the
universities of my country.

Like Nokogiri, There is another tool called Hpricot

It seems that it get that information from an DB using ajax, and what
I’m trying to do may not be legal or possible

Ya its is possible.
See some examples which i tried with nokogiri,ruby

Nokogiri

Hpricot

–

JavierQQ · December 6, 2011, 6:25pm

On Dec 6, 2011, at 12:17 PM, Javier Q. wrote:

I grab some info from tha main page of the url (so it works) but when I enter to
its login page with user/pass and try to get some, it seems to get information
from other place (I’m not even sure from where)

Try all this out in a terminal with telnet or cURL – see where you’re
actually going when you log in. You may be redirected in some subtle
way. Also, a browser may throw a “basic authentication” dialog box when
you’re actually being challenged for digest authentication.
:basic_authentication is not the same thing.

I think your real solution here will be to abstract out the open() bit
inside the Nokogiri::HTML() call. Look for a gem that accepts a URL and
returns a text stream and offers a whole bunch of configuration options
for authentication. I am certain there are at least a handful of them
out there. By separating your concerns in this way, you’ll end up with a
more modular solution so that you can swap in different credentials for
each site you’re scraping.

Walter