Question: Downloading files with open(-uri)?


#1

Hi,

I could need a quick hand here.

I want to watch the RailsConf 2006 videos and want to download
them with a script.

Unfortunately open(“http:/xx”) never comes back? Any idea what I
am doing wrong here?

I tested it with an URL that returns plain html and that worked
fine. See the first line, ibm.com.

require ‘open-uri’

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
http://downloads.scribemedia.net/rails2006/07_why_lucky_stiff.m4v
}
BUFFER_SIZE = 1_0241_0241

urls.each do |url|
puts “downloading #{url}”
open(url) do |input|
puts “opened connection.”
output = open(url.split(///).last, “wb”)
while (buffer = input.read(BUFFER_SIZE))
print “.”
$stdout.flush
output.write(buffer)
end
output.close
end
puts “done.”
end
puts “All downloads done.”

Cheers,
Mariano


#2

Oops, forgot the output:

RubyMate r5712 running Ruby v1.8.4 (/usr/local/bin/ruby)

download.rb

downloading http://ibm.com
opened connection.
.done.
downloading http://downloads.scribemedia.net/
rails2006/03_martin_fowler_full.m4v


#3

Mariano K. wrote:

I tested it with an URL that returns plain html and that worked
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
print “.”
Mariano
There’s nothing wrong with your program; I tested it by
downloading a picture. If you have a dial-up connection, maybe
the transfer is progressing very slowly.


#4

On Dec 23, 2006, at 3:20 PM, William J. wrote:

http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
while (buffer = input.read(BUFFER_SIZE))
There’s nothing wrong with your program; I tested it by
downloading a picture. If you have a dial-up connection, maybe
the transfer is progressing very slowly.

Hey Bill,

hmm, not sure. If I change the BUFFER_SIZE to 1KB I still don’t
see anything and the “puts ‘opened connection’” should at least be
visible, shouldn’t it?

Anyways I have a 6 MBit/s downstream so even a 1MB buffer
shouldn’t be a problem.

I also suspected that the server is checking for deep links and
would evaluate the referer in the process, but when I enter one of
the urls directly into my browser it works.

Very strange.

Cheers,
Mariano


#5

William J. wrote:

Mariano K. wrote:

I tested it with an URL that returns plain html and that worked
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
print “.”
Mariano
There’s nothing wrong with your program; I tested it by
downloading a picture. If you have a dial-up connection, maybe
the transfer is progressing very slowly.

Actually, I think the site is slow or overloaded. The movies are 250MB -
500MB in size, and the download speed I am getting is around 52
KBytes/second (and I have a broadband connection). This code works
better at showing progress:

require ‘open-uri’

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
http://downloads.scribemedia.net/rails2006/07_why_lucky_stiff.m4v
}

BUFFER_SIZE = 8 * 1_024

urls.each do |url|
puts “downloading #{url}”
out_file = url.split(///).last
puts “Writing to #{out_file}”

open(url, “r”,
:content_length_proc => lambda {|content_length| puts “Content
length: #{content_length} bytes” },
:progress_proc => lambda { |size| printf(“Read %010d bytes\r”,
size.to_i) }) do |input|
open(out_file, “wb”) do |output|
while (buffer = input.read(BUFFER_SIZE))
output.write(buffer)
end
end
end
puts “\ndone.”
end
puts “All downloads done.”


#6

Edwin F. wrote:

http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
puts “downloading #{url}”
output.write(buffer)
end
end
end
puts “\ndone.”
end
puts “All downloads done.”

Wow. Cool. How did you know about the content_length and progress
hooks? I don’t see them in the docs.

Anyway … That looks nice, but I still don’t see the progress on the
console, other than for ibm.com. Do you?

I can see that I am downloading at 50KBytes/s using a network traffic
monitor, but not on the console. And if I read this right it should
yield a progress update roughly every kilobyte , right?

This is what I see after … say … 5 minutes after launching the
program.

downloading http://ibm.com
Writing to ibm.com
Content
length: 25348 bytes
Read 0000000822 bytes Read 0000001158 bytes Read 0000002182 bytes
Read 0000002518 bytes Read 0000003542 bytes Read 0000003878 bytes
Read 0000004902 bytes Read 0000005238 bytes Read 0000006262 bytes
Read 0000006598 bytes Read 0000007622 bytes Read 0000007958 bytes
Read 0000008982 bytes Read 0000009318 bytes Read 0000010342 bytes
Read 0000011366 bytes Read 0000012390 bytes Read 0000013398 bytes
Read 0000014422 bytes Read 0000014758 bytes Read 0000015782 bytes
Read 0000016118 bytes Read 0000017142 bytes Read 0000017478 bytes
Read 0000018502 bytes Read 0000018838 bytes Read 0000019862 bytes
Read 0000020198 bytes Read 0000021222 bytes Read 0000021558 bytes
Read 0000022582 bytes Read 0000022918 bytes Read 0000023942 bytes
Read 0000024278 bytes Read 0000025302 bytes Read 0000025348 bytes
done.
downloading http://downloads.scribemedia.net/
rails2006/03_martin_fowler_full.m4v
Writing to 03_martin_fowler_full.m4v
Content
length: 413031533 bytes

Cheers,
Mariano


#7

On Sat, 23 Dec 2006 12:35:40 -0000, Mariano K. removed_email_address@domain.invalid
wrote:

Hi,

I could need a quick hand here.

I want to watch the RailsConf 2006 videos and want to download them
with a script.

If you have libcurl and are willing to install an extension, the
rececently released (;)) Curb 0.1 makes this as easy as:

#!/usr/bin/env ruby
urls = %w{
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
http://downloads.scribemedia.net/rails2006/07_why_lucky_stiff.m4v
}

urls.each { |url| Curl::Easy.download(url) }

END

It’s at http://curb.rubyforge.org/


#8

On Dec 23, 2006, at 4:55 PM, Ross B. wrote:

If you have libcurl and are willing to install an extension, the
rececently released (;)) Curb 0.1 makes this as easy as:
Thanks for the tip Ross.

I tried gem install curb :wink: but that didn’t work. And as the other
version is already downloading the files and I just wanted this
program to do this single job I will try out curb the next time :wink:

You’ve implemented it in C, so you probably can’t answer my question
how you dealt with the buffer size too, can you?
Cheers,
Mariano


#9

On 23.12.2006 15:40, Mariano K. wrote:

http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
open(url) do |input|
end
shouldn’t it?

Anyways I have a 6 MBit/s downstream so even a 1MB buffer shouldn’t be
a problem.

I also suspected that the server is checking for deep links and would
evaluate the referer in the process, but when I enter one of the urls
directly into my browser it works.

Very strange.

I observe the same behavior that you see. I have no knowledge of
openuri internals but here’s my theory: the page is probably loaded
completely before open returns. This would explain why you see the dots
from ibm.com in one go. I would test the same with net/http and see
whether there is any difference. Make sure to use the stream form.

Kind regards

robert

#10

On 23.12.2006 16:29, Robert K. wrote:

them with a script.
http://ibm.com
puts “downloading #{url}”
puts “done.”
anything and the “puts ‘opened connection’” should at least be

I observe the same behavior that you see. I have no knowledge of
openuri internals but here’s my theory: the page is probably loaded
completely before open returns. This would explain why you see the dots
from ibm.com in one go. I would test the same with net/http and see
whether there is any difference. Make sure to use the stream form.

Try this (note, this will not follow redirects):

robert

require ‘net/http’
require ‘uri’

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
}

$stdout.sync=true

urls.each do |url|
puts “downloading #{url}”

Net::HTTP.get_response(URI.parse(url)) do |res|
puts “opened connection.”
target = url.split(///).last
puts “writing to #{target}”

 File.open(target, "wb") do |output|
   # next line will read in chunks but not provide option for 

dots…
# res.read_body(output)
res.read_body do |chunk|
output.write(chunk)
print “.”
end
end
end

puts “done.”
end

puts “All downloads done.”


#11

On Dec 23, 2006, at 5:10 PM, Robert K. wrote:

File.open(target, “wb”) do |output|
# next line will read in chunks but not provide option for
dots…
# res.read_body(output)
res.read_body do |chunk|
output.write(chunk)
print “.”
end
end
Nice. Thanks.

Cheers,
Mariano


#12

Mariano K. wrote:

Edwin F. wrote:

http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
puts “downloading #{url}”
output.write(buffer)
end
end
end
puts “\ndone.”
end
puts “All downloads done.”

Wow. Cool. How did you know about the content_length and progress
hooks? I don’t see them in the docs.

Anyway … That looks nice, but I still don’t see the progress on the
console, other than for ibm.com. Do you?

I can see that I am downloading at 50KBytes/s using a network traffic
monitor, but not on the console. And if I read this right it should
yield a progress update roughly every kilobyte , right?

This is what I see after … say … 5 minutes after launching the
program.

downloading http://ibm.com
Writing to ibm.com
Content
length: 25348 bytes
Read 0000000822 bytes Read 0000001158 bytes Read 0000002182 bytes
Read 0000002518 bytes Read 0000003542 bytes Read 0000003878 bytes
Read 0000004902 bytes Read 0000005238 bytes Read 0000006262 bytes
Read 0000006598 bytes Read 0000007622 bytes Read 0000007958 bytes
Read 0000008982 bytes Read 0000009318 bytes Read 0000010342 bytes
Read 0000011366 bytes Read 0000012390 bytes Read 0000013398 bytes
Read 0000014422 bytes Read 0000014758 bytes Read 0000015782 bytes
Read 0000016118 bytes Read 0000017142 bytes Read 0000017478 bytes
Read 0000018502 bytes Read 0000018838 bytes Read 0000019862 bytes
Read 0000020198 bytes Read 0000021222 bytes Read 0000021558 bytes
Read 0000022582 bytes Read 0000022918 bytes Read 0000023942 bytes
Read 0000024278 bytes Read 0000025302 bytes Read 0000025348 bytes
done.
downloading http://downloads.scribemedia.net/
rails2006/03_martin_fowler_full.m4v
Writing to 03_martin_fowler_full.m4v
Content
length: 413031533 bytes

Cheers,
Mariano

It’s documented here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

This is what I am seeing:
downloading http://ibm.com
Writing to ibm.com
Content length: 25348 bytes
Read 0000025348 bytes
done.
downloading
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
Writing to 03_martin_fowler_full.m4v
Content length: 413031533 bytes
Read 0131826472 bytes

It seems to update around every second, based on informal observation. I
don’t know why your output looks different; did you redirect or tee it
to a file? I’m using an old ‘C’ trick of printing a CR (\r) after each
update, which should keep the output on the same line and just overwrite
what was there before.

I’m running this using Ruby 1.8.5 on Ubuntu Edgy x86_64. Perhaps your OS
is different and has some other behavior.

I tried everything I could think of to disable or bypass buffering,
including $stdout.sync = true, using $stderr, calling $stdout.flush,
using syswrite, and so on, to get the output to appear periodically,
without success. I think the output is buffered at the OS level, or
something like that, so that even calling flush won’t always work. The
only thing that works for me is the progress hook.


#13

Edwin F. wrote:

Mariano K. wrote:

Edwin F. wrote:

Wow. Cool. How did you know about the content_length and progress
hooks? I don’t see them in the docs.
It’s documented here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/
Grmpfh. I looked there, but probably too properly.

downloading http://ibm.com
[…]
Read 0131826472 bytes
Thanks for trying that out.

Well, it seems, that open already read all the bytes. Changing the
implementation the way Robert suggested healed that.

So it was not really a problem with the buffering, as I suspected,
but with improper use of the API.

Cheers,
Mariano


#14

On Dec 23, 2006, at 07:35, Mariano K. wrote:

   :progress_proc => lambda { |size| printf("Read %010d bytes\r",

Wow. Cool. How did you know about the content_length and progress
hooks? I don’t see them in the docs.

ri OpenURI::OpenRead#open


Eric H. - removed_email_address@domain.invalid - http://blog.segment7.net

I LIT YOUR GEM ON FIRE!


#15

On Sat, 23 Dec 2006 16:04:33 -0000, Mariano K. removed_email_address@domain.invalid
wrote:

On Dec 23, 2006, at 4:55 PM, Ross B. wrote:

If you have libcurl and are willing to install an extension, the
rececently released (;)) Curb 0.1 makes this as easy as:
Thanks for the tip Ross.

Sure :slight_smile:

I tried gem install curb :wink: but that didn’t work. And as the other
version is already downloading the files and I just wanted this program
to do this single job I will try out curb the next time :wink:

I hear you on the rubygem thing. In preparation for next time, you might
try that gem install again - it should work now :wink:

You’ve implemented it in C, so you probably can’t answer my question how
you dealt with the buffer size too, can you?

I just left that to the experts - although libcurl does provide some
opportunity for fiddling with it’s buffers, it generally seems to do
pretty well with it’s defaults so none of that’s exposed in Ruby yet.

Cheers,