Getting all google results with hpricot and connecting two g

kazaam · August 29, 2007, 2:46pm

Hi,
I’m trying to fetch all google results with hpricot. For the first page
of results I wrote this here:

#!/usr/bin/env ruby
$Verbose=true

require ‘hpricot’
require ‘open-uri’

google =
Hpricot(open(“Google”))
(google/“h2.r/a”).each {|line| puts
line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}

So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?

And the next question is: how can I get all results not just from the
first page?

greets

kazaam · August 29, 2007, 4:12pm

On Wed, Aug 29, 2007 at 09:45:04PM +0900, kazaam wrote:

(google/“h2.r/a”).each {|line| puts line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}

So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?

And the next question is: how can I get all results not just from the
first page?

Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot
alone.

greets
kazaam [email protected]
–Greg

kazaam · August 29, 2007, 9:34pm

Gregory S. wrote:

google = Hpricot(open(“Google”))
Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot alone.

greets
kazaam [email protected]
–Greg

I always thought the Google Terms of Service (around 5.3) suggested that
you
shouldn’t be running google searches in scripts?

TerryP.

kazaam · August 29, 2007, 11:44pm

Hi Kazaam,

On 29 Aug 2007, at 13:45, kazaam wrote:

So my first question is can I connect the both gsub statments above
in just one gsub which should increase the speed? Or is there even
a better way than using gsub for cleaning the results?

why covered this a while ago in one of his blog posts: http://
redhanded.hobix.com/inspect/nostrils.html

The non-graphic pastie version can be seen here: http://
pastie.caboo.se/54741

And just for completeness, I wrapped it up in a rails
controller:http://douglasfshearer.com/blog/site-search-using-google-
in-ruby-on-rails

Cheers.

Douglas F Shearer
[email protected]

kazaam · August 30, 2007, 4:41pm

On Wed, 29 Aug 2007 16:10:06 +0200, Gregory S.
[email protected] wrote:

google =
Hpricot(open(“Google”))
(google/“h2.r/a”).each {|line| puts
line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}

This doesn’t work?

(google/“h2.r/a”).each {|line| puts line[‘href’]}

So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?

And the next question is: how can I get all results not just from the
first page?

Can’t you teach Google to show more results on a page?
If not, extract the “next page” link and fetch that page. Proceed as
long
as long want

Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot
alone.

Nope. For this purpose, Hpricot is the best you can use!!! And it is
really great!
Mechanize doesn’t give you much advantage if you just want to parse
pages
(and I say that as the author who wrong the initial version of Mechanize
:).

scrubyt never worked on my machine, but I also find it too complicated
or
maybe I am just too stupid :).

Regards,

Michael

kazaam · August 30, 2007, 10:24pm

From the Google Terms of Service which as far as I know cover all Google
Products inloo of a seperate service agreement.

quote:

Use of the Services by you

5.1 In order to access certain Services, you may be required to
provide
information about yourself (such as identification or contact details)
as
part of the registration process for the Service, or as part of your
continued use of the Services. You agree that any registration
information
you give to Google will always be accurate, correct and up to date.

5.2 You agree to use the Services only for purposes that are
permitted by
(a) the Terms and (b) any applicable law, regulation or generally
accepted
practices or guidelines in the relevant jurisdictions (including any
laws
regarding the export of data or software to and from the United States
or
other relevant countries).

5.3 You agree not to access (or attempt to access) any of the
Services
by any means other than through the interface that is provided by
Google, unless you have been specifically allowed to do so in a
separate
agreement with Google. You specifically agree not to access (or attempt
to
access) any of the Services through any automated means (including use
of
scripts or web crawlers) and shall ensure that you comply with the
instructions set out in any robots.txt file present on the Services.

5.4 You agree that you will not engage in any activity that
interferes
with or disrupts the Services (or the servers and networks which are
connected to the Services).

5.5 Unless you have been specifically permitted to do so in a
separate
agreement with Google, you agree that you will not reproduce, duplicate,
copy, sell, trade or resell the Services for any purpose.

5.6 You agree that you are solely responsible for (and that Google
has no
responsibility to you or to any third party for) any breach of your
obligations under the Terms and for the consequences (including any loss
or damage which Google may suffer) of any such breach.

:un quote

Doesn’t that apply to this??

Cheers.

kazaam · August 30, 2007, 12:27am

Douglas F Shearer wrote:

why covered this a while ago in one of his blog posts:
http://redhanded.hobix.com/inspect/nostrils.html

The non-graphic pastie version can be seen here:
Parked at Loopia

And just for completeness, I wrapped it up in a rails
controller:Site Search Using Google In Ruby On Rails

I would not suggest doing that as google doesn’t allow bot requests and
your IP will simply get blacklisted. For site search I recommend google
coop: Programmable Search Engine by Google

Daniel