How to check if a webpage exists

This probably is trivial, but I have been googling for almost 2hs
without finding a viable solution.
Basically I have this rails app which uses hpricot to parse web pages.
There’s this line

page = Hpricot( open(url))

If the url is wrong, or the server is down, obviously I get an
exception.
First I tried my luck with

if page = Hpricot( open(url))
blah blah
end

but this did not work.

Then I started googling like crazy for a method to check if a webpage is
loadable. I only found this thread

http://markmail.org/message/iurqf4ejbndbczqq

I tried the suggested code, it does not work.
Now, I am pretty sure there’s a straightforward way of checking whether
a webpage is loadable.
Can you help me?
Thanks in advance,
Davide

On Wed, Sep 10, 2008 at 4:38 PM, Davide B. [email protected]
wrote:

I tried the suggested code, it does not work.
Now, I am pretty sure there’s a straightforward way of checking whether
a webpage is loadable.
Can you help me?
Thanks in advance,
Davide

Not really an answer, but should point you in the right direction.
Also, I can’t test with Hpricot right now due to gem install issues.
But, using open-uri…

require ‘open-uri’; begin; open(‘http://www.www.www’) {} rescue ‘404
error’; end

Todd

require ‘open-uri’; begin; open(‘http://www.www.www’) {} rescue ‘404
error’; end

Thanks folks,
your suggestion works, but I am not able to integrate it with my
existing code; I have some problems understanding how rescue interacts
with code blocks.
Basically, I have a chunk of code that must be executed ONLY IF there is
no exception; if there is an exception, I need to execute another chunk
of code.
I tried a couple of syntaxes:

require ‘open-uri’;
begin
open(‘http://www.www.www’) {
// code to execute if everythinkg’s ok
} rescue ‘404 error’
// code to execute in case of error
end

I also tried

require ‘open-uri’;
begin
open(‘http://www.www.www’) {}
rescue ‘404 error’
// code to execute in case of error
else
// code to execute if everythinkg’s ok
end

Also

require ‘open-uri’;
begin
open(‘http://www.www.www’)
puts “ok”
rescue ‘404 error’
puts “error”
end

None of this works.
Which is the proper syntax?
Davide

-------- Original-Nachricht --------

Datum: Thu, 11 Sep 2008 07:32:40 +0900
Von: “Todd B.” [email protected]
An: [email protected]
Betreff: Re: How to check if a webpage exists

exception.

Also, I can’t test with Hpricot right now due to gem install issues.
But, using open-uri…

require ‘open-uri’; begin; open(‘http://www.www.www’) {} rescue ‘404
error’; end

Todd

Dear Davide,

I was just about to suggest the same thing. It works on my Ubuntu
machine.

Best regards,

Axel

From: Davide B. [mailto:[email protected]]

None of this works.

Which is the proper syntax?

compare,

require ‘open-uri’
=> false
begin

p “i’m ok” #<-- ok codes here
rescue
p “sorry can’t do” #<-- not ok codes here
end
“i’m ok”
=> nil

begin

p “i’m ok”
rescue => e
p “sorry can’t do”
p “error is: #{e}”
end
“sorry can’t do”
“error is: 503 Service Unavailable”
=> nil

Thanks for your super-fast answer :slight_smile:
Yet, none of this works on my system; to be double sure I copied and
pasted.
In a script, I get “I’m ok” even when a page does not exist; in IRB , I
always get “sorry can’t do”
Any suggestion?
Davide

compare,

require ‘open-uri’
=> false
begin

p “i’m ok” #<-- ok codes here
rescue
p “sorry can’t do” #<-- not ok codes here
end
“i’m ok”
=> nil

begin

p “i’m ok”
rescue => e
p “sorry can’t do”
p “error is: #{e}”
end
“sorry can’t do”
“error is: 503 Service Unavailable”
=> nil

-------- Original-Nachricht --------

Datum: Thu, 11 Sep 2008 17:15:39 +0900
Von: Davide B. [email protected]
An: [email protected]
Betreff: Re: How to check if a webpage exists

Thanks for your super-fast answer :slight_smile:
Yet, none of this works on my system; to be double sure I copied and
pasted.
In a script, I get “I’m ok” even when a page does not exist; in IRB , I
always get “sorry can’t do”
Any suggestion?
Davide

Dear Davide,

hmmm … the suggested code works on my system (Ubuntu 8.04 /ruby
1.8.7.p-22), both for scripts
and on irb.
How do you enter the code on irb ?
Do you do

begin (enter)
line 1 (enter)
rescue (enter)
line2 (enter)
end (enter) ?

I tried instead

begin ; line1 ; recue ; line 2; end (enter)

This caused irb to work correctly.

Best regards,

Axel

Hi Axel,
I work on Mac Os X Leopard. Ruby works allright, I have also a number of
rails websites running locally, no problems so far.
Could you help me with the proper “script” sintax; I am sure the
rationale beyond the mechanism is correct, but I ultimately need to
integrate this script in a rails application, so I need to have it
woking in a common .rb file. As I said, tried this

require ‘open-uri’
begin
open “www.does.not.exist.sdadasdas.com”, :proxy=>true
p “i’m ok” #<-- ok codes here
rescue => e
p “sorry can’t do”
p “error is: #{e}”
end

Simple as it seems, it does not work, I always end up with “i’m ok”. I’m
sure it’s some stupid syntactic glitch…
Any suggestion?
Davide

From: Davide B. [mailto:[email protected]]
#…

open “www.does.not.exist.sdadasdas.com”, :proxy=>true

fwiw, mine does not work if i do not qualify the url, ie should be

open “http://www.does.not.exist.sdadasdas.com”, :proxy=>true

note the http://

but your case is weird, since it always works regardless :slight_smile:

From: Davide B. [mailto:[email protected]]

I work on Mac Os X Leopard. Ruby works allright, I have also

a number of

rails websites running locally, no problems so far.

Could you help me with the proper “script” sintax; I am sure the

rationale beyond the mechanism is correct, but I ultimately need to

integrate this script in a rails application, so I need to have it

woking in a common .rb file. As I said, tried this

try

require ‘open-uri’

begin

open “www.does.not.exist.sdadasdas.com”, :proxy=>true

replace the above line with

puts open("www.does.not.exist.sdadasdas.com",:proxy=>true).read

post the output again

p “i’m ok” #<-- ok codes here

rescue => e

p “sorry can’t do”

p “error is: #{e}”

end

try

require ‘open-uri’

begin

open “www.does.not.exist.sdadasdas.com”, :proxy=>true

replace the above line with

puts open("www.does.not.exist.sdadasdas.com",:proxy=>true).read

post the output again

p “i’m ok” #<-- ok codes here

rescue => e

p “sorry can’t do”

p “error is: #{e}”

end

The output is

“sorry can’t do”
“error is: can’t convert Hash into String”

So now there is type conflict…
Davide

Ok folks, I probably see what’s wrong.
The latest example works

begin
puts open(“http://www.does.not.exist.sdadasdas.com”,:proxy=>true).read
p “i’m ok”
rescue => e
p “sorry can’t do”
p “error is: #{e}”
end

I get “sorry can’t do” “error is: 400 Bad Request”.

But if I do this

begin
puts open(“http://www.jhkhjhkj.com”,:proxy=>true).read
p “i’m ok”
rescue => e
p “sorry can’t do”
p “error is: #{e}”
end
I get

<html

xmlns=“XHTML namespace”>body{font-family:Arial, Helvetica, FreeSans,
sans-serif;font-size:16px;color:#333;margin:10px 0 0
0;padding:0}a{color:#333;outline:0}a:hover{text-

etc etc

This is a page that my (!%&%&!) ISP loads when no page is encountered.
So the problem is that open gets a page from a redirect, right?
The problem is that my rails application, a kind of spider, is supposed
to load a number of pages; if one server gets down, I don’t want it to
be get stuck. Moreover, when the app will run online, my hosting server
might react in a different way. Is there a way to make sure the page
loaded is the page I asked for, not an error, a redirect or anything
else?
What do you suggest?
Davide
Davide

From: Davide B. [mailto:[email protected]]

> try

>

> # require ‘open-uri’

> # begin

> # open “www.does.not.exist.sdadasdas.com”, :proxy=>true

>

> replace the above line with

>

> puts open(“www.does.not.exist.sdadasdas.com”,:proxy=>true).read

>

> post the output again

>

> # p “i’m ok” #<-- ok codes here

> # rescue => e

> # p “sorry can’t do”

> # p “error is: #{e}”

> # end

The output is

“sorry can’t do”

“error is: can’t convert Hash into String”

So now there is type conflict…

because i just copied your url.
try this,

puts open(“http://www.does.not.exist.sdadasdas.com”,:proxy=>true).read

Dear Davide,

glad that it worked this far.
You could use Hpricot’s parsing capabilities to check whether the page
you loaded

Good point Alex, I was so focused on this small chunk of code I didn’t
think I can chek the page later, during the processing.
At any rate, thanks all here, now I’m able to deal with the 404
exception, I just have to take care of other problems in other parts of
the app.
Cheers,
Davide

-------- Original-Nachricht --------

Datum: Thu, 11 Sep 2008 18:55:27 +0900
Von: Davide B. [email protected]
An: [email protected]
Betreff: Re: How to check if a webpage exists

end
This is a page that my (!%&%&!) ISP loads when no page is encountered.

Posted via http://www.ruby-forum.com/.

Dear Davide,

glad that it worked this far.
You could use Hpricot’s parsing capabilities to check whether the page
you loaded

<html

xmlns=“XHTML namespace”>body{font-family:Arial, Helvetica, FreeSans,
sans-serif;font-size:16px;color:#333;margin:10px 0 0
0;padding:0}a{color:#333;outline:0}a:hover{text-

is the one you asked for by comparing the html address in what your
search resturns with
the html address you first gave it.

Best regards,

Axel

On Thu, Sep 11, 2008 at 6:55 PM, Davide B. [email protected]
wrote:

end
This is a page that my (!%&%&!) ISP loads when no page is encountered.
So the problem is that open gets a page from a redirect, right?
The problem is that my rails application, a kind of spider, is supposed
to load a number of pages; if one server gets down, I don’t want it to
be get stuck. Moreover, when the app will run online, my hosting server
might react in a different way. Is there a way to make sure the page
loaded is the page I asked for, not an error, a redirect or anything
else?
What do you suggest?

You can use something really low level:
http://p.ramaze.net/1960

Note that you can use the status to determine what happened… see the
http rfc for more information on the status codes.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1

On Thu, Sep 11, 2008 at 10:34 PM, Davide B.
[email protected] wrote:

You can use something really low level:
http://p.ramaze.net/1960

Hi Micheal, the low-level solution you’ve linked to doesn’t seem to work
at all; I changed the code and tested http://www.google.com and I get
“what to do when it doesn’t work”.
Yours would actually be the solution I like best; does it work on your
system?

Sorry, i should test last-minute changes…
http://p.ramaze.net/1961
but please read and try to understand the code, it’s no use if you’re
simply copy&pasting solutions.

^ manveru

but please read and try to understand the code, it’s no use if you’re
simply copy&pasting solutions.

Hi Micheal,
sorry if you got the impression I am just copying and pasting. I had got
the gist of your script rationale, and I tried to modify this and that,
but I am probably too of a beginner to debug it by myself. At any rate,
thanks for your help so far.

Anyway, for some reason the script does not work yet. Working pages
(like repubblica.it or corriere.it, italian most popular newpaper
websites) output a 400 status, and the script fails. Moreover, if I
insert a url without a www (like for my own website,
http://davidebenini.it), the connection doesn’t even start.
Why is that? Do you think there’s a problem with the request format?
Cheers,
Davide

You can use something really low level:
http://p.ramaze.net/1960

Hi Micheal, the low-level solution you’ve linked to doesn’t seem to work
at all; I changed the code and tested http://www.google.com and I get
“what to do when it doesn’t work”.
Yours would actually be the solution I like best; does it work on your
system?
Cheers,
Davide

Note that you can use the status to determine what happened… see the
http rfc for more information on the status codes.
HTTP/1.1: Response

I extended the format, so this should work:
http://p.ramaze.net/1962

Hi Micheal, with the HOST part added ti the request it works better, now
I get to sort positive/negative responses.
I still get false positives, because my ISP has decided to issue a
redirect to his own error page in case of missing pages, but as I said I
am going to catch these cases later, when I parse the code with hpricot.
As a matter of fact, I don’t think the response is enough to see if the
page loaded correctly; many pages might work with a redirect, if ISPs
decide to give a redirect also for wrong pages, well, I don’t think
there’s any chance to tell the difference.
Thanks for your help,
Davide