Forum: Ruby Question: Downloading files with open(-uri)?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 13:36
(Received via mailing list)
Hi,

   I could need a quick hand here.

   I want to watch the RailsConf 2006 videos and want to download
them with a script.

   Unfortunately open("http:/xx") never comes back? Any idea what I
am doing wrong here?

   I tested it with an URL that returns plain html and that worked
fine. See the first line, ibm.com.

require 'open-uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_mart...
http://downloads.scribemedia.net/rails2006/02_dave...
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul...
http://downloads.scribemedia.net/rails2006/06_rail...
http://downloads.scribemedia.net/rails2006/07_why_...
}
BUFFER_SIZE = 1_024*1_024*1

urls.each do |url|
   puts "downloading #{url}"
   open(url) do |input|
     puts "opened connection."
     output = open(url.split(/\//).last, "wb")
     while (buffer = input.read(BUFFER_SIZE))
       print "."
       $stdout.flush
       output.write(buffer)
     end
     output.close
   end
   puts "done."
end
puts "All downloads done."

Cheers,
Mariano
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 13:38
(Received via mailing list)
Oops, forgot the output:

RubyMate r5712 running Ruby v1.8.4 (/usr/local/bin/ruby)
 >>> download.rb

downloading http://ibm.com
opened connection.
.done.
downloading http://downloads.scribemedia.net/
rails2006/03_martin_fowler_full.m4v
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 William James (Guest)
on 2006-12-23 15:21
(Received via mailing list)
Mariano Kamp wrote:
>    I tested it with an URL that returns plain html and that worked
> http://downloads.scribemedia.net/rails2006/06_rail...
>        print "."
> Mariano
There's nothing wrong with your program; I tested it by
downloading a picture.  If you have a dial-up connection, maybe
the transfer is progressing very slowly.
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 15:41
(Received via mailing list)
On Dec 23, 2006, at 3:20 PM, William James wrote:

>>
>> http://downloads.scribemedia.net/rails2006/04_paul...
>>      while (buffer = input.read(BUFFER_SIZE))
> There's nothing wrong with your program; I tested it by
> downloading a picture.  If you have a dial-up connection, maybe
> the transfer is progressing very slowly.

Hey Bill,

   hmm, not sure. If I change the BUFFER_SIZE to 1KB I still don't
see anything and the "puts 'opened connection'" should at least be
visible, shouldn't it?

   Anyways I have a 6 MBit/s downstream so even a 1MB buffer
shouldn't be a problem.

   I also suspected that the server is checking for deep links and
would evaluate the referer in the process, but when I enter one of
the urls directly into my browser it works.

   Very strange.

Cheers,
Mariano
1cc072ab8daecee4dc8bca69fc5d574c?d=identicon&s=25 Edwin Fine (efine)
on 2006-12-23 16:22
William James wrote:
> Mariano Kamp wrote:
>>    I tested it with an URL that returns plain html and that worked
>> http://downloads.scribemedia.net/rails2006/06_rail...
>>        print "."
>> Mariano
> There's nothing wrong with your program; I tested it by
> downloading a picture.  If you have a dial-up connection, maybe
> the transfer is progressing very slowly.

Actually, I think the site is slow or overloaded. The movies are 250MB -
500MB in size, and the download speed I am getting is around 52
KBytes/second (and I have a broadband connection). This code works
better at showing progress:

require 'open-uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_mart...
http://downloads.scribemedia.net/rails2006/02_dave...
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul...
http://downloads.scribemedia.net/rails2006/06_rail...
http://downloads.scribemedia.net/rails2006/07_why_...
}

BUFFER_SIZE = 8 * 1_024

urls.each do |url|
  puts "downloading #{url}"
  out_file = url.split(/\//).last
  puts "Writing to #{out_file}"

  open(url, "r",
       :content_length_proc => lambda {|content_length| puts "Content
length: #{content_length} bytes" },
       :progress_proc => lambda { |size| printf("Read %010d bytes\r",
size.to_i) }) do |input|
    open(out_file, "wb") do |output|
      while (buffer = input.read(BUFFER_SIZE))
        output.write(buffer)
      end
    end
  end
  puts "\ndone."
end
puts "All downloads done."
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 16:37
(Received via mailing list)
Edwin Fine wrote:
>
> http://downloads.scribemedia.net/rails2006/03_mart...
>   puts "downloading #{url}"
>         output.write(buffer)
>       end
>     end
>   end
>   puts "\ndone."
> end
> puts "All downloads done."

Wow. Cool. How did you know about the content_length and progress
hooks? I don't see them in the docs.

Anyway ... That looks nice, but I still don't see the progress on the
console, other than for ibm.com. Do you?

I can see that I am downloading at 50KBytes/s using a network traffic
monitor, but not on the console. And if I read this right it should
yield a progress update roughly every kilobyte , right?

This is what I see after ... say ... 5 minutes after launching the
program.

downloading http://ibm.com
Writing to ibm.com
Content
length: 25348 bytes
Read 0000000822 bytes Read 0000001158 bytes Read 0000002182 bytes
Read 0000002518 bytes Read 0000003542 bytes Read 0000003878 bytes
Read 0000004902 bytes Read 0000005238 bytes Read 0000006262 bytes
Read 0000006598 bytes Read 0000007622 bytes Read 0000007958 bytes
Read 0000008982 bytes Read 0000009318 bytes Read 0000010342 bytes
Read 0000011366 bytes Read 0000012390 bytes Read 0000013398 bytes
Read 0000014422 bytes Read 0000014758 bytes Read 0000015782 bytes
Read 0000016118 bytes Read 0000017142 bytes Read 0000017478 bytes
Read 0000018502 bytes Read 0000018838 bytes Read 0000019862 bytes
Read 0000020198 bytes Read 0000021222 bytes Read 0000021558 bytes
Read 0000022582 bytes Read 0000022918 bytes Read 0000023942 bytes
Read 0000024278 bytes Read 0000025302 bytes Read 0000025348 bytes
done.
downloading http://downloads.scribemedia.net/
rails2006/03_martin_fowler_full.m4v
Writing to 03_martin_fowler_full.m4v
Content
length: 413031533 bytes


Cheers,
Mariano
82e62c756d89bc6fa0a0a2d7f2b1e617?d=identicon&s=25 Ross Bamford (Guest)
on 2006-12-23 16:57
(Received via mailing list)
On Sat, 23 Dec 2006 12:35:40 -0000, Mariano Kamp <mariano.kamp@acm.org>
wrote:

> Hi,
>
>    I could need a quick hand here.
>
>    I want to watch the RailsConf 2006 videos and want to download them
> with a script.
>

If you have libcurl and are willing to install an extension, the
rececently released (;)) Curb 0.1 makes this as easy as:

#!/usr/bin/env ruby
urls = %w{
http://downloads.scribemedia.net/rails2006/03_mart...
http://downloads.scribemedia.net/rails2006/02_dave...
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul...
http://downloads.scribemedia.net/rails2006/06_rail...
http://downloads.scribemedia.net/rails2006/07_why_...
}

urls.each { |url| Curl::Easy.download(url) }

__END__

It's at http://curb.rubyforge.org/
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2006-12-23 17:03
(Received via mailing list)
On 23.12.2006 15:40, Mariano Kamp wrote:
>>>
>>> http://downloads.scribemedia.net/rails2006/03_mart...
>>>    open(url) do |input|
>>> end
> shouldn't it?
>
>   Anyways I have a 6 MBit/s downstream so even a 1MB buffer shouldn't be
> a problem.
>
>   I also suspected that the server is checking for deep links and would
> evaluate the referer in the process, but when I enter one of the urls
> directly into my browser it works.
>
>   Very strange.

I observe the same behavior that you see.  I have no knowledge of
openuri internals but here's my theory: the page is probably loaded
completely before open returns.  This would explain why you see the dots
from ibm.com in one go.  I would test the same with net/http and see
whether there is any difference.  Make sure to use the stream form.

Kind regards

	robert
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 17:05
(Received via mailing list)
On Dec 23, 2006, at 4:55 PM, Ross Bamford wrote:

>
> If you have libcurl and are willing to install an extension, the
> rececently released (;)) Curb 0.1 makes this as easy as:
Thanks for the tip Ross.

I tried gem install curb ;-) but that didn't work. And as the other
version is already downloading the files and I just wanted this
program to do this single job I will try out curb the next time ;-)

You've implemented it in C, so you probably can't answer my question
how you dealt with the buffer size too, can you?
Cheers,
Mariano
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2006-12-23 17:11
(Received via mailing list)
On 23.12.2006 16:29, Robert Klemme wrote:
>>>> them with a script.
>>>> http://ibm.com
>>>>    puts "downloading #{url}"
>>>>    puts "done."
>> anything and the "puts 'opened connection'" should at least be
>
> I observe the same behavior that you see.  I have no knowledge of
> openuri internals but here's my theory: the page is probably loaded
> completely before open returns.  This would explain why you see the dots
> from ibm.com in one go.  I would test the same with net/http and see
> whether there is any difference.  Make sure to use the stream form.

Try this (note, this will not follow redirects):

	robert


require 'net/http'
require 'uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_mart...
http://downloads.scribemedia.net/rails2006/02_dave...
}

$stdout.sync=true

urls.each do |url|
   puts "downloading #{url}"

   Net::HTTP.get_response(URI.parse(url)) do |res|
     puts "opened connection."
     target = url.split(/\//).last
     puts "writing to #{target}"

     File.open(target, "wb") do |output|
       # next line will read in chunks but not provide option for
dots...
       # res.read_body(output)
       res.read_body do |chunk|
	output.write(chunk)
	print "."
       end
     end
   end

   puts "done."
end

puts "All downloads done."
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 17:23
(Received via mailing list)
On Dec 23, 2006, at 5:10 PM, Robert Klemme wrote:

>  File.open(target, "wb") do |output|
>       # next line will read in chunks but not provide option for
> dots...
>       # res.read_body(output)
>       res.read_body do |chunk|
> 	output.write(chunk)
> 	print "."
>       end
>     end
Nice. Thanks.

Cheers,
Mariano
1cc072ab8daecee4dc8bca69fc5d574c?d=identicon&s=25 Edwin Fine (efine)
on 2006-12-23 17:35
Mariano Kamp wrote:
> Edwin Fine wrote:
>>
>> http://downloads.scribemedia.net/rails2006/03_mart...
>>   puts "downloading #{url}"
>>         output.write(buffer)
>>       end
>>     end
>>   end
>>   puts "\ndone."
>> end
>> puts "All downloads done."
>
> Wow. Cool. How did you know about the content_length and progress
> hooks? I don't see them in the docs.
>
> Anyway ... That looks nice, but I still don't see the progress on the
> console, other than for ibm.com. Do you?
>
> I can see that I am downloading at 50KBytes/s using a network traffic
> monitor, but not on the console. And if I read this right it should
> yield a progress update roughly every kilobyte , right?
>
> This is what I see after ... say ... 5 minutes after launching the
> program.
>
> downloading http://ibm.com
> Writing to ibm.com
> Content
> length: 25348 bytes
> Read 0000000822 bytes Read 0000001158 bytes Read 0000002182 bytes
> Read 0000002518 bytes Read 0000003542 bytes Read 0000003878 bytes
> Read 0000004902 bytes Read 0000005238 bytes Read 0000006262 bytes
> Read 0000006598 bytes Read 0000007622 bytes Read 0000007958 bytes
> Read 0000008982 bytes Read 0000009318 bytes Read 0000010342 bytes
> Read 0000011366 bytes Read 0000012390 bytes Read 0000013398 bytes
> Read 0000014422 bytes Read 0000014758 bytes Read 0000015782 bytes
> Read 0000016118 bytes Read 0000017142 bytes Read 0000017478 bytes
> Read 0000018502 bytes Read 0000018838 bytes Read 0000019862 bytes
> Read 0000020198 bytes Read 0000021222 bytes Read 0000021558 bytes
> Read 0000022582 bytes Read 0000022918 bytes Read 0000023942 bytes
> Read 0000024278 bytes Read 0000025302 bytes Read 0000025348 bytes
> done.
> downloading http://downloads.scribemedia.net/
> rails2006/03_martin_fowler_full.m4v
> Writing to 03_martin_fowler_full.m4v
> Content
> length: 413031533 bytes
>
>
> Cheers,
> Mariano

It's documented here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

This is what I am seeing:
downloading http://ibm.com
Writing to ibm.com
Content length: 25348 bytes
Read 0000025348 bytes
done.
downloading
http://downloads.scribemedia.net/rails2006/03_mart...
Writing to 03_martin_fowler_full.m4v
Content length: 413031533 bytes
Read 0131826472 bytes

It seems to update around every second, based on informal observation. I
don't know why your output looks different; did you redirect or tee it
to a file? I'm using an old 'C' trick of printing a CR (\r) after each
update, which should keep the output on the same line and just overwrite
what was there before.

I'm running this using Ruby 1.8.5 on Ubuntu Edgy x86_64. Perhaps your OS
is different and has some other behavior.

I tried everything I could think of to disable or bypass buffering,
including $stdout.sync = true, using $stderr, calling $stdout.flush,
using syswrite, and so on, to get the output to appear periodically,
without success. I think the output is buffered at the OS level, or
something like that, so that even calling flush won't always work. The
only thing that works for me is the progress hook.
F0aaa796f43b5c4bc21db2051ecb4bfa?d=identicon&s=25 Mariano Kamp (Guest)
on 2006-12-23 18:05
(Received via mailing list)
Edwin Fine wrote:
> Mariano Kamp wrote:
>> Edwin Fine wrote:
>>
>> Wow. Cool. How did you know about the content_length and progress
>> hooks? I don't see them in the docs.
> It's documented here:
> http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/
Grmpfh. I looked there, but probably too properly.


> downloading http://ibm.com
[..]
> Read 0131826472 bytes
Thanks for trying that out.

Well, it seems, that open already read all the bytes. Changing the
implementation the way Robert suggested healed that.

So it was not really a problem with the buffering, as I suspected,
but with improper use of the API.

Cheers,
Mariano
82e62c756d89bc6fa0a0a2d7f2b1e617?d=identicon&s=25 Ross Bamford (Guest)
on 2006-12-23 22:35
(Received via mailing list)
On Sat, 23 Dec 2006 16:04:33 -0000, Mariano Kamp <mariano.kamp@acm.org>
wrote:

>
> On Dec 23, 2006, at 4:55 PM, Ross Bamford wrote:
>
>>
>> If you have libcurl and are willing to install an extension, the
>> rececently released (;)) Curb 0.1 makes this as easy as:
> Thanks for the tip Ross.
>

Sure :)

> I tried gem install curb ;-) but that didn't work. And as the other
> version is already downloading the files and I just wanted this program
> to do this single job I will try out curb the next time ;-)
>

I hear you on the rubygem thing. In preparation for next time, you might
try that gem install again - it should work now ;)

> You've implemented it in C, so you probably can't answer my question how
> you dealt with the buffer size too, can you?

I just left that to the experts - although libcurl does provide some
opportunity for fiddling with it's buffers, it generally seems to do
pretty well with it's defaults so none of that's exposed in Ruby yet.

Cheers,
58479f76374a3ba3c69b9804163f39f4?d=identicon&s=25 Eric Hodel (Guest)
on 2006-12-23 23:07
(Received via mailing list)
On Dec 23, 2006, at 07:35, Mariano Kamp wrote:
>>        :progress_proc => lambda { |size| printf("Read %010d bytes\r",
>
> Wow. Cool. How did you know about the content_length and progress
> hooks? I don't see them in the docs.

ri OpenURI::OpenRead#open

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net

I LIT YOUR GEM ON FIRE!
This topic is locked and can not be replied to.