Getting all google results with hpricot and connecting two g

Hi,
I’m trying to fetch all google results with hpricot. For the first page
of results I wrote this here:

#!/usr/bin/env ruby
$Verbose=true

require ‘hpricot’
require ‘open-uri’

google =
Hpricot(open(“Google”))
(google/“h2.r/a”).each {|line| puts
line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}

So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?

And the next question is: how can I get all results not just from the
first page?

greets

On Wed, Aug 29, 2007 at 09:45:04PM +0900, kazaam wrote:

(google/“h2.r/a”).each {|line| puts line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}

So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?

And the next question is: how can I get all results not just from the
first page?

Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot
alone.

greets
kazaam [email protected]
–Greg

Gregory S. wrote:

google = Hpricot(open(“Google”))
Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot alone.

greets
kazaam [email protected]
–Greg

I always thought the Google Terms of Service (around 5.3) suggested that
you
shouldn’t be running google searches in scripts?

TerryP.

Hi Kazaam,

On 29 Aug 2007, at 13:45, kazaam wrote:

So my first question is can I connect the both gsub statments above
in just one gsub which should increase the speed? Or is there even
a better way than using gsub for cleaning the results?

why covered this a while ago in one of his blog posts: http://
redhanded.hobix.com/inspect/nostrils.html

The non-graphic pastie version can be seen here: http://
pastie.caboo.se/54741

And just for completeness, I wrapped it up in a rails
controller:http://douglasfshearer.com/blog/site-search-using-google-
in-ruby-on-rails

Cheers.

Douglas F Shearer
[email protected]

On Wed, 29 Aug 2007 16:10:06 +0200, Gregory S.
[email protected] wrote:

google =
Hpricot(open(“Google”))
(google/“h2.r/a”).each {|line| puts
line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}

This doesn’t work?

(google/“h2.r/a”).each {|line| puts line[‘href’]}

So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?

And the next question is: how can I get all results not just from the
first page?

Can’t you teach Google to show more results on a page?
If not, extract the “next page” link and fetch that page. Proceed as
long
as long want :slight_smile:

Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot
alone.

Nope. For this purpose, Hpricot is the best you can use!!! And it is
really great!
Mechanize doesn’t give you much advantage if you just want to parse
pages
(and I say that as the author who wrong the initial version of Mechanize
:).

scrubyt never worked on my machine, but I also find it too complicated
or
maybe I am just too stupid :).

Regards,

Michael

From the Google Terms of Service which as far as I know cover all Google
Products inloo of a seperate service agreement.

quote:

  1. Use of the Services by you

    5.1 In order to access certain Services, you may be required to
    provide
    information about yourself (such as identification or contact details)
    as
    part of the registration process for the Service, or as part of your
    continued use of the Services. You agree that any registration
    information
    you give to Google will always be accurate, correct and up to date.

    5.2 You agree to use the Services only for purposes that are
    permitted by
    (a) the Terms and (b) any applicable law, regulation or generally
    accepted
    practices or guidelines in the relevant jurisdictions (including any
    laws
    regarding the export of data or software to and from the United States
    or
    other relevant countries).

    5.3 You agree not to access (or attempt to access) any of the
    Services
    by any means other than through the interface that is provided by
    Google
    , unless you have been specifically allowed to do so in a
    separate
    agreement with Google. You specifically agree not to access (or attempt
    to
    access) any of the Services through any automated means (including use
    of
    scripts or web crawlers) and shall ensure that you comply with the
    instructions set out in any robots.txt file present on the Services.

    5.4 You agree that you will not engage in any activity that
    interferes
    with or disrupts the Services (or the servers and networks which are
    connected to the Services).

    5.5 Unless you have been specifically permitted to do so in a
    separate
    agreement with Google, you agree that you will not reproduce, duplicate,
    copy, sell, trade or resell the Services for any purpose.

    5.6 You agree that you are solely responsible for (and that Google
    has no
    responsibility to you or to any third party for) any breach of your
    obligations under the Terms and for the consequences (including any loss
    or damage which Google may suffer) of any such breach.

:un quote

Doesn’t that apply to this?? :confused:

Cheers.

Douglas F Shearer wrote:

why covered this a while ago in one of his blog posts:
http://redhanded.hobix.com/inspect/nostrils.html

The non-graphic pastie version can be seen here:
Parked at Loopia

And just for completeness, I wrapped it up in a rails
controller:Site Search Using Google In Ruby On Rails

I would not suggest doing that as google doesn’t allow bot requests and
your IP will simply get blacklisted. For site search I recommend google
coop: Programmable Search Engine by Google

Daniel