API for getting Google search results?

I’m writing this command-line ruby script and it needs to be able to
submit a search string and get back google result links. Remarkably, I
google this subject and I am finding Google APIs to do every possibly
thing imaginable except for this. The only thing I found was Goose which
is apparently based on an deprecated API.

I tried just looking at the actual HTML at the google home page, but
that is the most nasty mess of web code I’ve ever seen. Please tell me
somebody else has reversed engineered it so I don’t have to…

On Nov 12, 9:48am, Terry M. [email protected] wrote:

Posted viahttp://www.ruby-forum.com/.

An approach which can prove handy is to “screen scrape” the data
from the HTML. One of the easiest ways is with Firefox with the
Firebug add-on installed. With Firebug, you can inspect the elements
on the page, and view formatted source.

After you figure out how the data you are looking for is tagged,
or can be located, there are Ruby tools like Hpricot and Nokogiri
which allow one to quickly throw together an extraction routine.

For example, a few minutes ago, I did a Google search on “helium high
and came up with a few lines of code to extract the first page of
as follows:

  1. I inspected the links, and found that they all seem to have
  2. I copied the ugly source from a source-view window, and pasted it
    scite (any editor would do), but in scite it’s easy to view changes in
    as you experiment.
  3. I opened up a few lines, and pasted the HTML source under an
    tag, which makes it available as the ‘DATA’ pseudo file.
  4. I tried a couple of things using Nokigiri, and found something that
    to work.

The code:

coding: utf-8

require ‘nokogiri’

html_doc = Nokogiri::HTML(DATA.read)
puts html_doc.css(“a.l”).collect{|el| el.attribute(“href”) }

(the ugly HTML page source goes here)

The output:


For production, just build the query and retrieve the page directly
to build the array of URLs.

Since there is no guarantee that Google won’t tweak its technique
and break this particular code, having a very high level method
of page-scraping means that it wouldn’t be hard to adjust. Moreover,
this technique can be used in many situations, and once you’ve done
a few sites, you’ll find most applications are as easy as parsing XML
or adapting JSON from “data only” API’s. After all, you get to see
exactly what data is available on the pages, which may include useful
things that an API might not make available.


scroll to ‘Code Snippets’

(also found this, but looks old,