On Nov 12, 9:48am, Terry M. [email protected] wrote:
–
Posted viahttp://www.ruby-forum.com/.
An approach which can prove handy is to “screen scrape” the data
from the HTML. One of the easiest ways is with Firefox with the
Firebug add-on installed. With Firebug, you can inspect the elements
on the page, and view formatted source.
After you figure out how the data you are looking for is tagged,
or can be located, there are Ruby tools like Hpricot and Nokogiri
which allow one to quickly throw together an extraction routine.
For example, a few minutes ago, I did a Google search on “helium high
voice”,
and came up with a few lines of code to extract the first page of
links
as follows:
- I inspected the links, and found that they all seem to have
‘class=“l”’.
- I copied the ugly source from a source-view window, and pasted it
into
scite (any editor would do), but in scite it’s easy to view changes in
output
as you experiment.
- I opened up a few lines, and pasted the HTML source under an
END
tag, which makes it available as the ‘DATA’ pseudo file.
- I tried a couple of things using Nokigiri, and found something that
seemed
to work.
The code:
coding: utf-8
require ‘nokogiri’
html_doc = Nokogiri::HTML(DATA.read)
puts html_doc.css(“a.l”).collect{|el| el.attribute(“href”) }
END
(the ugly HTML page source goes here)
The output:
Why does the act of inhaling helium make your voice high-pitched?
Why does helium make your voice squeaky? - The Straight Dope
Helium - Wikipedia
Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos
http://blog.sciencegeekgirl.com/2009/03/26/myth-helium-makes-your-voice-high-pitched/
Why does helium make your voice go high? - Answers
Lintastoto : Bandar Togel Online Dan Slot Pragmatic Paling Gacor Di Indonesia
http://www.hrwiki.org/wiki/helium
http://www.helium.com/items/1905495-why-does-helium-make-your-voice-squeaky
- YouTube
- YouTube
For production, just build the query and retrieve the page directly
to build the array of URLs.
Since there is no guarantee that Google won’t tweak its technique
and break this particular code, having a very high level method
of page-scraping means that it wouldn’t be hard to adjust. Moreover,
this technique can be used in many situations, and once you’ve done
a few sites, you’ll find most applications are as easy as parsing XML
or adapting JSON from “data only” API’s. After all, you get to see
exactly what data is available on the pages, which may include useful
things that an API might not make available.