On Dec 5, 2011, at 1:05 PM, JavierQQ wrote:
HI,
I want to grab some information about university names, and I found
this term called “web scraping”
I search about it in google, and there are tools in ruby.
One of them is nokogiri but I’m a bit confused because it seems that
it only gets information that its already in an html or xml
Yes, Nokogiri is a toolkit for (among lots of other things) running
Xpath or CSS queries against a text file. That text file can be anything
– an io stream of one sort or another with textual data in it will do.
I found a webpage that have a list of university names as a
(html label)
and I want to grab that information
The question is… can I do that with nokogiri or another tool?
The list is like a country list, but with the names of the
universities of my country.
A select can be traversed like any other DOM object, this should be
fairly close:
#given doc is a Nokogiri::XML or Nokogiri::HTML nodeset
doc.css(’#yourPickerId option’).each do |opt|
foo = opt[‘value’]
#whatever else you want to do with foo here
end
It seems that it get that information from an DB using ajax, and what
I’m trying to do may not be legal or possible
If it’s Ajax, you’ll need to run a JavaScript interpreter against it.
Rails 3.1 shows the way to do that server-side. Once you have munged the
page into a text stream that includes this desired data (flattened it
down to the result of the Ajax plus the base code) then Nokogiri or
Hpricot or any other XML/HTML parser could rip through that DOM and give
you individual nodes to play with.
I’ll really appreciate if someone can help me to understand what this
tool is used for, and if what I’m trying to do is possible
Possible, sure. It’s never entirely clear why someone would run an Ajax
request to populate a page. They may have done it to keep the scrapers
out (like you), or they may have done it to isolate and accelerate a
laggy part of the initial page load. If the latter (so they aren’t
actually discouraging you – did you ask them if you could do this?)
then you might also want to look into loading the endpoint of that Ajax
request instead of the surrounding page, as that would eliminate the
whole JavaScript abstraction entirely. You’d have one HTTP request, and
unless that endpoint was kinked to only accept requests from within its
own domain, you would likely have JSON or some other structured data in
return, and that could be even easier to interpret in your application.
Walter