Forum: Ruby Stuck in a Redirect Loop While Crawling

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Matt W. (Guest)
on 2007-06-12 22:41
(Received via mailing list)
Hello,

I am writing a crawler in Ruby to crawl websites. One of the sites I
crawl is very picky about headers so I am mimicking my FireFox browser
as closely as possible. One of the GETs I make to this site results in
a redirect response. I take the 'location' field from the redirect
header and go there. When FireFox sends its GET to this location, it
gets a 200 OK response. However, I keep getting redirected every time.

Here is what FireFox is sending:

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
1.8.1.4) Gecko/20070515 Firefox/2.0.0.4
Keep-Alive: 300
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Language: en-us,en;q=0.5
Cookie: sessionid=6d7dd6277ec64983bf642760d7d77d6a
Connection: keep-alive
Accept: text/xml,application/xml,application/xhtml+xml,text/
html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Host: <hostname here>

And here is how the server responds to FireFox:

HTTP/1.x 200 OK
Date: Tue, 12 Jun 2007 17:30:20 GMT
Server: Microsoft-IIS/6.0
MicrosoftOfficeWebServer: 5.0_Pub
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Cache-Control: private
Expires: Tue, 12 Jun 2007 17:29:18 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 81118

I am sending this exact same header using Ruby's Net::HTTP.get method:

server = Net::HTTP.new(uri.host, uri.port)
response,data = server.get(uri.request_uri, headers)

where headers is a hash with the exact same keys and values as the
FireFox headers above (the cookie value differs, of course, as that is
retrieved and stored dynamically). But I always get redirected to the
exact same URL that I just GETed. This is the response I get:

RESPONSE: #<Net::HTTPFound:0x300c604>
Printing Response:

cache-control: private
expires: Tue, 12 Jun 2007 18:17:26 GMT
x-aspnet-version: 1.1.4322
content-type: text/html; charset=utf-8
x-powered-by: ASP.NET
date: Tue, 12 Jun 2007 18:18:26 GMT
microsoftofficewebserver: 5.0_Pub
server: Microsoft-IIS/6.0
content-length: 200
location: <exact same URL I just GETed>

Can anyone enlighten me as to what I am doing differently that the
site redirects me to the same place? I can't tell if it's something
I'm doing wrong or something Ruby is doing that is not the same as
what FireFox is doing. Thanks.
Ball, Donald A Jr (Library) (Guest)
on 2007-06-12 23:05
(Received via mailing list)
> Accept: text/xml,application/xml,application/xhtml+xml,text/
> html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
> Host: <hostname here>

Presumably this is from LiveHTTPHeaders? I note that the Referer header
is not included herein, but Firefox does send those data by default.
Perhaps that's the substantive difference between the Firefox request
and the Net::HTTP request? Just a thought.

- donald
Matt W. (Guest)
on 2007-06-12 23:13
(Received via mailing list)
On Jun 12, 1:03 pm, "Ball, Donald A Jr (Library)"
<removed_email_address@domain.invalid> wrote:
> > Accept: text/xml,application/xml,application/xhtml+xml,text/
> > html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
> > Host: <hostname here>
>
> Presumably this is from LiveHTTPHeaders? I note that the Referer header
> is not included herein, but Firefox does send those data by default.
> Perhaps that's the substantive difference between the Firefox request
> and the Net::HTTP request? Just a thought.
>
> - donald

Donald,

Good thought. The GET right before this one that FireFox sent did have
the referer field but then it wasn't there for this one, so I removed
it. Any other ideas?

Matt
This topic is locked and can not be replied to.