Stuck in a Redirect Loop While Crawling

Hello,

I am writing a crawler in Ruby to crawl websites. One of the sites I
crawl is very picky about headers so I am mimicking my FireFox browser
as closely as possible. One of the GETs I make to this site results in
a redirect response. I take the ‘location’ field from the redirect
header and go there. When FireFox sends its GET to this location, it
gets a 200 OK response. However, I keep getting redirected every time.

Here is what FireFox is sending:

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:
1.8.1.4) Gecko/20070515 Firefox/2.0.0.4
Keep-Alive: 300
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,;q=0.7
Accept-Language: en-us,en;q=0.5
Cookie: sessionid=6d7dd6277ec64983bf642760d7d77d6a
Connection: keep-alive
Accept: text/xml,application/xml,application/xhtml+xml,text/
html;q=0.9,text/plain;q=0.8,image/png,
/*;q=0.5
Host:

And here is how the server responds to FireFox:

HTTP/1.x 200 OK
Date: Tue, 12 Jun 2007 17:30:20 GMT
Server: Microsoft-IIS/6.0
MicrosoftOfficeWebServer: 5.0_Pub
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Cache-Control: private
Expires: Tue, 12 Jun 2007 17:29:18 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 81118

I am sending this exact same header using Ruby’s Net::HTTP.get method:

server = Net::HTTP.new(uri.host, uri.port)
response,data = server.get(uri.request_uri, headers)

where headers is a hash with the exact same keys and values as the
FireFox headers above (the cookie value differs, of course, as that is
retrieved and stored dynamically). But I always get redirected to the
exact same URL that I just GETed. This is the response I get:

RESPONSE: #Net::HTTPFound:0x300c604
Printing Response:

cache-control: private
expires: Tue, 12 Jun 2007 18:17:26 GMT
x-aspnet-version: 1.1.4322
content-type: text/html; charset=utf-8
x-powered-by: ASP.NET
date: Tue, 12 Jun 2007 18:18:26 GMT
microsoftofficewebserver: 5.0_Pub
server: Microsoft-IIS/6.0
content-length: 200
location:

Can anyone enlighten me as to what I am doing differently that the
site redirects me to the same place? I can’t tell if it’s something
I’m doing wrong or something Ruby is doing that is not the same as
what FireFox is doing. Thanks.

Accept: text/xml,application/xml,application/xhtml+xml,text/
html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
Host:

Presumably this is from LiveHTTPHeaders? I note that the Referer header
is not included herein, but Firefox does send those data by default.
Perhaps that’s the substantive difference between the Firefox request
and the Net::HTTP request? Just a thought.

  • donald

On Jun 12, 1:03 pm, “Ball, Donald A Jr (Library)”
[email protected] wrote:

Accept: text/xml,application/xml,application/xhtml+xml,text/
html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
Host:

Presumably this is from LiveHTTPHeaders? I note that the Referer header
is not included herein, but Firefox does send those data by default.
Perhaps that’s the substantive difference between the Firefox request
and the Net::HTTP request? Just a thought.

  • donald

Donald,

Good thought. The GET right before this one that FireFox sent did have
the referer field but then it wasn’t there for this one, so I removed
it. Any other ideas?

Matt