Hi,
I’m trying to fetch all google results with hpricot. For the first page
of results I wrote this here:
#!/usr/bin/env ruby
$Verbose=true
require ‘hpricot’
require ‘open-uri’
google =
Hpricot(open(“Google”))
(google/“h2.r/a”).each {|line| puts
line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}
So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?
And the next question is: how can I get all results not just from the
first page?
So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?
And the next question is: how can I get all results not just from the
first page?
Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot
alone.
google = Hpricot(open(“Google”))
Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot alone.
So my first question is can I connect the both gsub statments above
in just one gsub which should increase the speed? Or is there even
a better way than using gsub for cleaning the results?
On Wed, 29 Aug 2007 16:10:06 +0200, Gregory S. [email protected] wrote:
google =
Hpricot(open(“Google”))
(google/“h2.r/a”).each {|line| puts
line.to_s.gsub(/^.+href=“/,‘’).gsub(/” .+$/,‘’)}
This doesn’t work?
(google/“h2.r/a”).each {|line| puts line[‘href’]}
So my first question is can I connect the both gsub statments above in
just one gsub which should increase the speed? Or is there even a better
way than using gsub for cleaning the results?
And the next question is: how can I get all results not just from the
first page?
Can’t you teach Google to show more results on a page?
If not, extract the “next page” link and fetch that page. Proceed as
long
as long want
Look into mechanize or scrubyt for this. They sit on top of hpricot, but
are much better suited to screen scraping applications than hpricot
alone.
Nope. For this purpose, Hpricot is the best you can use!!! And it is
really great!
Mechanize doesn’t give you much advantage if you just want to parse
pages
(and I say that as the author who wrong the initial version of Mechanize
:).
scrubyt never worked on my machine, but I also find it too complicated
or
maybe I am just too stupid :).
From the Google Terms of Service which as far as I know cover all Google
Products inloo of a seperate service agreement.
quote:
Use of the Services by you
5.1 In order to access certain Services, you may be required to
provide
information about yourself (such as identification or contact details)
as
part of the registration process for the Service, or as part of your
continued use of the Services. You agree that any registration
information
you give to Google will always be accurate, correct and up to date.
5.2 You agree to use the Services only for purposes that are
permitted by
(a) the Terms and (b) any applicable law, regulation or generally
accepted
practices or guidelines in the relevant jurisdictions (including any
laws
regarding the export of data or software to and from the United States
or
other relevant countries).
5.3 You agree not to access (or attempt to access) any of the
Services
by any means other than through the interface that is provided by
Google, unless you have been specifically allowed to do so in a
separate
agreement with Google. You specifically agree not to access (or attempt
to
access) any of the Services through any automated means (including use
of scripts or web crawlers) and shall ensure that you comply with the
instructions set out in any robots.txt file present on the Services.
5.4 You agree that you will not engage in any activity that
interferes
with or disrupts the Services (or the servers and networks which are
connected to the Services).
5.5 Unless you have been specifically permitted to do so in a
separate
agreement with Google, you agree that you will not reproduce, duplicate,
copy, sell, trade or resell the Services for any purpose.
5.6 You agree that you are solely responsible for (and that Google
has no
responsibility to you or to any third party for) any breach of your
obligations under the Terms and for the consequences (including any loss
or damage which Google may suffer) of any such breach.
I would not suggest doing that as google doesn’t allow bot requests and
your IP will simply get blacklisted. For site search I recommend google
coop: Programmable Search Engine by Google
Daniel
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.