Forum: Ruby safe way to calc md5 on very large files

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
rtilley (Guest)
on 2006-03-19 02:35
(Received via mailing list)
I'm calculating md5 checksums on very large files (2 GB). This is a safe
way to do so, right? Also... is the file closed when the block exits?
I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}
Stephen W. (Guest)
on 2006-03-19 06:04
(Received via mailing list)
rtilley wrote:
> I'm calculating md5 checksums on very large files (2 GB). This is a safe
> way to do so, right? Also... is the file closed when the block exits?
> I'm using 'rb' as this is used on Windows and Linux computers.
>
> md5 = Digest::MD5.new()
> File.open(file, 'rb').each {|line| md5.update(line)}
>

Close.. try this..

     require 'md5'
     File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

--Steve
unknown (Guest)
on 2006-03-19 08:29
(Received via mailing list)
On Sun, 19 Mar 2006, Stephen W. wrote:

>
>    require 'md5'
>    File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }
>
> And yes, the file is closed with the block form of open.
>
> --Steve

i think the OP has the right approach - note that an 'f.read' will
consume
2GB.  but the OP's code

   harp:~ > cat a.rb
   require 'digest/md5'
   md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 <<
line}
   p md5.hexdigest

will not.

regards.

-a
Andrew J. (Guest)
on 2006-03-19 08:29
(Received via mailing list)
On Sun, 19 Mar 2006 13:49:51 +0900, removed_email_address@domain.invalid
<removed_email_address@domain.invalid> wrote:
>
> i think the OP has the right approach - note that an 'f.read' will consume
> 2GB.  but the OP's code
>
>    harp:~ > cat a.rb
>    require 'digest/md5'
>    md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 << line}
>    p md5.hexdigest
>
> will not.


In my reading of the OP, both the block-open and iteration are actually
desired:

  md5 = Digest::MD5.new
  File.open(file,'rb') do |ios|
    ios.each {|line| md5 << line }
  end

cheers,
andrew
Bill K. (Guest)
on 2006-03-19 08:29
(Received via mailing list)
From: "rtilley" <removed_email_address@domain.invalid>
>
> I'm calculating md5 checksums on very large files (2 GB). This is a safe
> way to do so, right? Also... is the file closed when the block exits?
> I'm using 'rb' as this is used on Windows and Linux computers.
>
> md5 = Digest::MD5.new()
> File.open(file, 'rb').each {|line| md5.update(line)}

Hi - does the file really contain text lines?  Or is it a file
full of binary data.  If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line".  In that
case I'd recommend reading it in chunks.

Untested:

md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
  while (buf = io.read(4096)) && buf.length > 0
    md5.update(buf)
  end
end


Regards,

Bill
Robert K. (Guest)
on 2006-03-19 14:38
(Received via mailing list)
Andrew J. <removed_email_address@domain.invalid> wrote:
>>>
>> will not.
>
>
> In my reading of the OP, both the block-open and iteration are
> actually desired:
>
>  md5 = Digest::MD5.new
>  File.open(file,'rb') do |ios|
>    ios.each {|line| md5 << line }
>  end

IMHO it's a bad idea to use line oriented reading on a binary file
because
"lines" can be arbitrary long (i.e. the whole file in worst case).
Using
IO#read is much better.

Kind regards

    robert
Robert K. (Guest)
on 2006-03-19 14:48
(Received via mailing list)
Bill K. <removed_email_address@domain.invalid> wrote:
> full of binary data.  If it's a binary file, there may be no
> end
io.read will return nil at EOF so your test for positive length is
basically
obsolete.  Also, for reasons of error checking I'd place the digest
creation
inside the block because then the digest is never created if the file
cannot
be opened:

md5 = File.open(file, 'rb') do |io|
 dig = Digest::MD5.new
 while (buf = io.read(4096))
   dig.update(buf)
 end
 dig
end

If you want to increase efficiency, you can do this, which will prevent
new
strings to be created as buffers all the time:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  while io.read(4096, buf)
    dig.update(buf)
  end
  dig
end

Here's another nice variant:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  dig.update(buf) while io.read(4096, buf)
  dig
end

Kind regards

    robert
rtilley (Guest)
on 2006-03-19 16:48
(Received via mailing list)
Robert K. wrote:
> dig
> end

Thank you Robert, Billy and others! Your suggestions have helped me to
solve the problem.
Tanaka A. (Guest)
on 2006-03-19 17:21
(Received via mailing list)
In article <removed_email_address@domain.invalid>,
  "Robert K." <removed_email_address@domain.invalid> writes:

> md5 = File.open(file, 'rb') do |io|
>   dig = Digest::MD5.new
>   buf = ""
>   while io.read(4096, buf)
>     dig.update(buf)
>   end
>   dig
> end

Why we have no such method in the digest library?

I think it is useful enough to have in the library.
unknown (Guest)
on 2006-03-19 18:54
(Received via mailing list)
On Mon, 20 Mar 2006, Tanaka A. wrote:

> Why we have no such method in the digest library?
>
> I think it is useful enough to have in the library.

indeed.  in fact this seems a good candidate to add a method to a base
class:


     harp:~ > cat a.rb
     require 'digest/md5'
     require 'digest/rmd160'
     require 'digest/sha1'
     require 'digest/sha2'

     #
     # this in digest.rb or something equiv
     #
       digests = %w( MD5 RMD160 SHA1 SHA256 SHA384 SHA512 )

       digests.each do |d|
         digest_method = d.downcase

         IO.module_eval do
           define_method(digest_method) do |*argv|
             bufsize = argv.shift || 8192
             digest = ::Digest.const_get(d).new
             buf = ''
             off = pos rescue nil
             begin
               digest.update buf while read bufsize, buf
             ensure
               seek off rescue nil
             end
             digest
           end
         end

         File.module_eval do
           singleton_class = class << self; self; end
           singleton_class.module_eval do
             define_method(digest_method) do |path, *argv|
               mode = argv.shift || 'r'
               open(path, mode){|f| f.send digest_method}
             end
           end
         end
       end


     #
     # demo
     #
       report = {}
       digests.each do |d|
         digest_method = d.downcase
         report.update "File##{ digest_method}" => open(__FILE__){|f|
f.send(digest_method).hexdigest}
         report.update "File.#{ digest_method}" =>
File.send(digest_method, __FILE__).hexdigest
       end
       require 'yaml' and y report



     harp:~ > ruby a.rb
     ---
     File.md5: 2e6c1e1c3d81a871f2c6b5099ba208f3
     File#md5: 2e6c1e1c3d81a871f2c6b5099ba208f3
     File.rmd160: 22ad54cb48f6d00ef325f1c7ff2150cf46fd250f
     File#rmd160: 22ad54cb48f6d00ef325f1c7ff2150cf46fd250f
     File.sha1: 1600889b027ced6bf95dedc9803cb7c65f5aa396
     File#sha1: 1600889b027ced6bf95dedc9803cb7c65f5aa396
     File.sha256:
38ac0f761f16a13d2f4f51a8a8c9668656d84c29b383840579a7517b69d219a9
     File#sha256:
38ac0f761f16a13d2f4f51a8a8c9668656d84c29b383840579a7517b69d219a9
     File.sha384:
5882c884ea618539da50a36bfbbd0fa0cd41bfa2ee18bce5acf45965e5582e33a1a3edd269f0e3551a9c9e5cd6e77cd1
     File#sha384:
5882c884ea618539da50a36bfbbd0fa0cd41bfa2ee18bce5acf45965e5582e33a1a3edd269f0e3551a9c9e5cd6e77cd1
     File.sha512:
3fba99ff4d98feaf760b814e9a8f245e05881da9aa19378510172d4e7cb0a10aa98b6c1d9b22d4331f3552a5899bb5545c604dfc4620665a5b6fb0d4dc2b0b78
     File#sha512:
3fba99ff4d98feaf760b814e9a8f245e05881da9aa19378510172d4e7cb0a10aa98b6c1d9b22d4331f3552a5899bb5545c604dfc4620665a5b6fb0d4dc2b0b78



comments?

-a
Erik V. (Guest)
on 2006-03-19 21:23
(Received via mailing list)
> Why we have no such method in the digest library?

I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Use it like this:

 md5 = MD5.file("foo.bar")

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

 require "md5"

 class MD5
   def self.file(file)
     File.open(file, "rb") do |f|
       res = self.new
       while (data = f.read(4096))
         res << data
       end
       res
     end
   end
 end
rtilley (Guest)
on 2006-03-19 21:34
(Received via mailing list)
Erik V. wrote:
>>Why we have no such method in the digest library?
>
>
> I extended the MD5 class with a class method to build an MD5
> object directly from the contents of a given file.

Should this be done to sha1, sha2, etc?
This topic is locked and can not be replied to.