Mechanize file save on generated link

dkam · September 12, 2010, 10:53pm

Hi there,
I’m working on a project to automate retrieval of content and the
download of a pdf bill from an ISPs website. I have 30 connections with
this ISP and each has its own username and password. So far I have been
able to get the content I need that is actually stored within the page.

You have to login and go to a specific page to be able to click on the
link.
The link itself isn’t the pdf as its generated on the fly:
source code of the link:
Download PDF

I’m selecting it with:
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”)

how can I click this link to download the pdf and store it to the
filesystem?

btw, I only started with Ruby and Mechanize less than 24 hours ago.
TIA
Regards,
Dan

dkam · September 12, 2010, 11:03pm

Hi Dan,

 try with

File.open(‘myfile’, ‘w+’) do |file|
file << agent.get_file(link[‘href’])
end

where agent is the mechanize agent you used to log in and get the
link.

–

Andrea D.

Il 12/09/2010 22:53, Dan M. ha scritto:

dkam · September 12, 2010, 11:53pm

thanks, so my script is now:
…
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”)

File.open(‘myfile’, ‘w+’) do |file|
file << agent.get_file(link[‘href’])
end

results in:
C:/ruby/test.rb:27:in []': can't convert String into Integer (TypeError) from C:/ruby/test.rb:27:inblock in ’
from C:/ruby/test.rb:26:in open' from C:/ruby/test.rb:26:in’

what is the proper method to see the response back from clicking the
link?

dkam · September 15, 2010, 10:16pm

Mike D. wrote:

On Sun, Sep 12, 2010 at 5:54 PM, Dan M. [email protected]
wrote:

C:/ruby/test.rb:27:in []': can't convert String into Integer (TypeError) from C:/ruby/test.rb:27:in block in ’
from C:/ruby/test.rb:26:in open' from C:/ruby/test.rb:26:in ’

what is the proper method to see the response back from clicking the
link?

links_with returns an array. Try using .first to pick out the first
result,
so:

link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first

Ok, so some progress:
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first
page = agent.click(link)
#pp page
File.open(‘myfile.pdf’, ‘w+’) do |file|
file << page
end

If I look at the content of page now it contains the stream of PDF data
as well as:
@code=“200”,
@filename=“Invoice_14051844_08/09/2010.pdf”,
@response=
{“date”=>“Wed, 15 Sep 2010 19:13:04 GMT”,
“server”=>“Apache”,
“expires”=>“Wed, 15 Sep 2010 19:14:04 GMT”,
“cache-control”=>“max-age=60”,
“content-disposition”=>
“attachment;filename=Invoice_14051844_08/09/2010.pdf”,
“vary”=>“Accept-Encoding”,
“content-encoding”=>“gzip”,
“content-length”=>“7950”,
“keep-alive”=>“timeout=5, max=92”,
“connection”=>“Keep-Alive”,
“content-type”=>“application/octet-stream”},
@uri=
#<URI::HTTPS:0x335bc38
URL:https://www.bethere.co.uk/be-portal/downloadPdf>>

dkam · September 13, 2010, 2:46pm

On Sun, Sep 12, 2010 at 5:54 PM, Dan M. [email protected]
wrote:

C:/ruby/test.rb:27:in []': can't convert String into Integer (TypeError) from C:/ruby/test.rb:27:in block in ’
from C:/ruby/test.rb:26:in open' from C:/ruby/test.rb:26:in ’

what is the proper method to see the response back from clicking the
link?

links_with returns an array. Try using .first to pick out the first
result,
so:

link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first

dkam · September 15, 2010, 10:19pm

I tried this too:
File.open(‘myfile.pdf’, ‘w+’) do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

I have also tried using:
agent.pluggable_parser.pdf = Mechanize::FileSaver
agent.click(link)

which did not produce an error but also did not produce a pdf file
either.

dkam · September 20, 2010, 2:12pm

Dan M. wrote:

I tried this too:
File.open(‘myfile.pdf’, ‘w+’) do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

SO CLOSE!

By doing a binary compare of a working version downloaded through the
browser and the one through Mechanize, I have found that it is saving
the line breaks as 0D 0A in hex versus just 0A in the working file.

Whilst I dig around to find how to avoid Mechanise/Ruby using that
behaviour. Has anyone else come across this and have a solution?
Thanks

dkam · September 20, 2010, 2:39pm

Thanks to everyone who helped. Writing the file in Binary mode did the
trick.

In case anyone has this problem in the future here is my full script:

require ‘rubygems’
require ‘mechanize’

URL_LOGIN =
‘https://www.bethere.co.uk/cas/login?service=https://www.bethere.co.uk/c/portal/login’
URL_BILLING = ‘https://www.bethere.co.uk/group/beportal/billsandpayment’

abort “Usage: #{$0} ” unless ARGV.length == 2

agent = Mechanize.new
agent.follow_meta_refresh = true
agent.redirect_ok = true
agent.user_agent = ‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6;
en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6’
login_page = agent.get(URL_LOGIN)

login_form = login_page.forms.first
login_form.username = ARGV[0]
login_form.password = ARGV[1]

redirect_page = agent.submit(login_form)

invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first
page = agent.click(link)

File.open(page.filename.gsub(“/”,“_”), ‘w+b’) do |file|
file << page.body.strip
end