Hi there,
I’m working on a project to automate retrieval of content and the
download of a pdf bill from an ISPs website. I have 30 connections with
this ISP and each has its own username and password. So far I have been
able to get the content I need that is actually stored within the page.
You have to login and go to a specific page to be able to click on the
link.
The link itself isn’t the pdf as its generated on the fly:
source code of the link: Download PDF
I’m selecting it with:
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”)
how can I click this link to download the pdf and store it to the
filesystem?
btw, I only started with Ruby and Mechanize less than 24 hours ago.
TIA
Regards,
Dan
thanks, so my script is now:
…
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”)
File.open(‘myfile’, ‘w+’) do |file|
file << agent.get_file(link[‘href’])
end
results in:
C:/ruby/test.rb:27:in []': can't convert String into Integer (TypeError) from C:/ruby/test.rb:27:inblock in ’
from C:/ruby/test.rb:26:in open' from C:/ruby/test.rb:26:in’
what is the proper method to see the response back from clicking the
link?
C:/ruby/test.rb:27:in []': can't convert String into Integer (TypeError) from C:/ruby/test.rb:27:in block in ’
from C:/ruby/test.rb:26:in open' from C:/ruby/test.rb:26:in ’
what is the proper method to see the response back from clicking the
link?
links_with returns an array. Try using .first to pick out the first
result,
so:
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first
Ok, so some progress:
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first
page = agent.click(link) #pp page
File.open(‘myfile.pdf’, ‘w+’) do |file|
file << page
end
If I look at the content of page now it contains the stream of PDF data
as well as:
@code=“200”,
@filename=“Invoice_14051844_08/09/2010.pdf”,
@response=
{“date”=>“Wed, 15 Sep 2010 19:13:04 GMT”,
“server”=>“Apache”,
“expires”=>“Wed, 15 Sep 2010 19:14:04 GMT”,
“cache-control”=>“max-age=60”,
“content-disposition”=>
“attachment;filename=Invoice_14051844_08/09/2010.pdf”,
“vary”=>“Accept-Encoding”,
“content-encoding”=>“gzip”,
“content-length”=>“7950”,
“keep-alive”=>“timeout=5, max=92”,
“connection”=>“Keep-Alive”,
“content-type”=>“application/octet-stream”},
@uri=
#<URI::HTTPS:0x335bc38
URL:https://www.bethere.co.uk/be-portal/downloadPdf>>
C:/ruby/test.rb:27:in []': can't convert String into Integer (TypeError) from C:/ruby/test.rb:27:in block in ’
from C:/ruby/test.rb:26:in open' from C:/ruby/test.rb:26:in ’
what is the proper method to see the response back from clicking the
link?
links_with returns an array. Try using .first to pick out the first
result,
so:
link = invoice_page.links_with(:href => “/be-portal/downloadPdf”).first
I tried this too:
File.open(‘myfile.pdf’, ‘w+’) do |file|
file << page.body
end
Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.
SO CLOSE!
By doing a binary compare of a working version downloaded through the
browser and the one through Mechanize, I have found that it is saving
the line breaks as 0D 0A in hex versus just 0A in the working file.
Whilst I dig around to find how to avoid Mechanise/Ruby using that
behaviour. Has anyone else come across this and have a solution?
Thanks