Open - uri question

akanksha · July 26, 2006, 6:42pm

I am using open-uri for the first time. I need to visit a bunch of urls
and gather some data. Here is a samll code snippet

require ‘open-uri’ # allows the use of a file like API for URLs
open( “http://no-way-outspaik375.spaces.msn.com/”) { |file|
lines = file.read
puts lines

}

and here is the error I get
ruby test.rb
/usr/local/lib/ruby/1.8/open-uri.rb:290:in open_http': 500 Internal Server Error (OpenURI::HTTPError) from /usr/local/lib/ruby/1.8/open-uri.rb:629:in buffer_open’
from /usr/local/lib/ruby/1.8/open-uri.rb:167:in open_loop' from /usr/local/lib/ruby/1.8/open-uri.rb:165:in open_loop’
from /usr/local/lib/ruby/1.8/open-uri.rb:135:in open_uri' from /usr/local/lib/ruby/1.8/open-uri.rb:531:in open’
from /usr/local/lib/ruby/1.8/open-uri.rb:86:in `open’
from test.rb:2

However
require ‘open-uri’ # allows the use of a file like API for URLs
open( “http://www.google.com/”) { |file|
lines = file.read
puts lines

}

works just fine. What am I doing wrong??

akanksha · July 26, 2006, 6:49pm

akanksha wrote:

and here is the error I get
ruby test.rb
/usr/local/lib/ruby/1.8/open-uri.rb:290:in `open_http’: 500 Internal
Server Error (OpenURI::HTTPError)
…

You can see some info on HTTP 500 errors here:
http://www.checkupdown.com/status/E500.html

Maybe the service was down?
Or they may have it restricted to prevent scraping?
You may need to provide some info to fool the site into
thinking your a regular browser…

Cheers

akanksha · July 26, 2006, 7:55pm

On 7/26/06, akanksha [email protected] wrote:

Or they may have it restricted to prevent scraping?
You may need to provide some info to fool the site into
thinking your a regular browser…

How would I go about doing that …could you plz point me to some
info?
Thank you.

Use something like Ethereal to capture the packets sent between your
browser
and the service. Then imitate that in code. You will just need to send
the
same HTTP headers and follow any redirects that the server sends.

Good luck!

Justin

akanksha · July 26, 2006, 7:58pm

On Thu, 27 Jul 2006, akanksha wrote:

How would I go about doing that …could you plz point me to some
info?
Thank you.

you need to set user-agent to a ‘real’ browser. something like
‘Mozilla/4.0’

-a

akanksha · July 26, 2006, 8:22pm

yes that works and so does mechanize …thanks!!!

akanksha · July 26, 2006, 7:51pm

Maybe the service was down?

The service was not down. Both urls open in a browser.

Or they may have it restricted to prevent scraping?
You may need to provide some info to fool the site into
thinking your a regular browser…

How would I go about doing that …could you plz point me to some
info?
Thank you.