Mechanize GETting twice without redirect?

[Cross posted to Ruby on Rails Forum and Mechanize mailing list.]

I’m using Mechanize for page scraping (Ruby 1.9.2 / Rails 3.0.5 /
Mechanize 2.0.1). I’m seeing a case where a single

agent.get(url)

generates two HTTP GETs. Why is this happening?

The response to the first GET is a 200 (no redirect) and doesn’t have
any meta-refresh. I don’t see why Mechanize is issuing the second GET
(which happens to be failing with an EOFError with Content-Length / body
length mismatch).

Details: I’m using the nifty Charles web proxy debugger to monitor
browser / server interactions.

=====
In the original browser + server exchange, I see:

Req: POST /login/Login HTTP/1.1
Rsp: sets two cookies + HTTP/1.1 302 Moved Temporarily =>
https://online.nationalgridus.com/eservice_enu/

Req: GET /eservice_enu/ HTTP/1.1
Rsp: set a cookie + HTTP/1.1 200 OK
The body contains onLoad Javascript to set this.location =
‘start.swe?SWECmd=Start’

Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1
Rsp: sets four cookies + HTTP/1.1 200 OK

=====
In the mechanize = server exchange:

My code: page2 = agent.submit(login_form)
Req: POST /login/Login HTTP/1.1
Rsp: set two cookies + HTTP/1.1 302 Moved Temporarily =>
https://online.nationalgridus.com/eservice_enu/

Req: GET /eservice_enu/ HTTP/1.1
Rsp: set a cookie + HTTP/1.1 200 OK
The body contains onLoad Javascript to set this.location =
‘start.swe?SWECmd=Start’, but Mechanize can’t follow that automatically.
So I do an agent.get() to emulate it:

My code: page3 =
agent.get(“https://online.nationalgridus.com/eservice_enu/start.swe?SWECmd=Start”)
Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1
Rsp: sets four cookies + HTTP/1.1 200 OK

Note that at this point both the user driven and mechanize driven
interactions appear to be identical. But Mechanize appears to generate
another GET all by itself:

Req: GET /eservice_enu/start.swe?SWECmd=Start HTTP/1.1
Rsp: sets four cookies + HTTP/1.1 200 OK

… and this response throws an EOFError:
Content-Length (536) does not match response body length (524) -
EOFError

=====
So: Why did Mechanize generate that last GET without me asking it to?
Was the EOFError actually in the first GET and it’s doing a retry? If
so, how do I work around the length mismatch?

UPDATE: sort of solved.

The web server sets Content-Length to a incorrect value (at least for
compressed replies), and Mechanize was re-trying the GET before giving
up.

The temporary fix is to monkey patch Mechanize::HTTP::Agent to ignore
Content-Length.

A longer-term fix would be to include a mechanism in Mechanize to allow
ignoring bogus Content-Length. I’ve submitted a feature request to
mechanize/issues.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs