Mechanize

Ramiro_Diaz_Trepat · December 17, 2007, 3:44am

Hello list,
I need to parse the contents of a weird web site, that uses a
session id which is 80,000 characters long, on a hidden input tag.
I try to use Mechanize for the task, but, since this web page has
the 12th line with 80k characters, I get the following error:

/usr/lib/ruby/gems/1.8/gems/hpricot-0.6/lib/hpricot/parse.rb:51:in
scan': ran out of buffer space on element <input>, starting on line 12. (Hpricot::ParseError) from /usr/lib/ruby/gems/1.8/gems/hpricot-0.6/lib/hpricot/parse.rb:51:in make’
from
/usr/lib/ruby/gems/1.8/gems/hpricot-0.6/lib/hpricot/parse.rb:15:in
parse' from /usr/lib/ruby/gems/1.8/gems/mechanize-0.6.11/lib/mechanize/page.rb:37:in initialize’
from
/usr/lib/ruby/gems/1.8/gems/mechanize-0.6.11/lib/mechanize.rb:551:in
new' from /usr/lib/ruby/gems/1.8/gems/mechanize-0.6.11/lib/mechanize.rb:551:in fetch_page’
from /usr/lib/ruby/1.8/net/http.rb:1050:in request' from /usr/lib/ruby/1.8/net/http.rb:2133:in reading_body’
from /usr/lib/ruby/1.8/net/http.rb:1049:in request' from /usr/lib/ruby/gems/1.8/gems/mechanize-0.6.11/lib/mechanize.rb:514:in fetch_page’
from
/usr/lib/ruby/gems/1.8/gems/mechanize-0.6.11/lib/mechanize.rb:185:in
`get’

Probably, the line buffer in Hpricot is a fixed size buffer and can’t
take this big line.

The “program” is this simple test script:

require ‘rubygems’
require ‘mechanize’

agent = WWW::Mechanize.new
agent.user_agent_alias = ‘Mac Safari’
page =
agent.get(“https://replica.megsa.com.ar/Usuario/MantenimientoContratos.aspx”)
puts page.body

Is there a way to configure Hpricot to use a dynamically sized
collection for the line buffer?

Ramiro_Diaz_Trepat · December 17, 2007, 5:05am

On Mon, Dec 17, 2007 at 11:43:11AM +0900, Ramiro Diaz Trepat wrote:

from /usr/lib/ruby/gems/1.8/gems/hpricot-0.6/lib/hpricot/parse.rb:15:in parse' from /usr/lib/ruby/gems/1.8/gems/mechanize-0.6.11/lib/mechanize.rb:185:in get’

agent = WWW::Mechanize.new
agent.user_agent_alias = ‘Mac Safari’
page = agent.get(“https://replica.megsa.com.ar/Usuario/MantenimientoContratos.aspx”)
puts page.body

Is there a way to configure Hpricot to use a dynamically sized
collection for the line buffer?

You can configure hpricot’s buffer size:

Hpricot.buffer_size = 2621444

http://code.whytheluckystiff.net/hpricot/ticket/13

I think that should fix your issue.

Ramiro_Diaz_Trepat · December 17, 2007, 5:24am

Thank you !
That did the trick