Safe way to calc md5 on very large files

rtilley · March 19, 2006, 1:35am

I’m calculating md5 checksums on very large files (2 GB). This is a safe
way to do so, right? Also… is the file closed when the block exits?
I’m using ‘rb’ as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, ‘rb’).each {|line| md5.update(line)}

rtilley · March 19, 2006, 5:04am

rtilley wrote:

I’m calculating md5 checksums on very large files (2 GB). This is a safe
way to do so, right? Also… is the file closed when the block exits?
I’m using ‘rb’ as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, ‘rb’).each {|line| md5.update(line)}

Close… try this…

 require 'md5'
 File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

–Steve

rtilley · March 19, 2006, 7:29am

On Sun, 19 Mar 2006, Stephen W. wrote:

require ‘md5’
File.open(filename,‘rb’) { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

–Steve

i think the OP has the right approach - note that an ‘f.read’ will
consume
2GB. but the OP’s code

harp:~ > cat a.rb
require ‘digest/md5’
md5 = Digest::MD5.new() and open(ARGV.shift, ‘rb’).each{|line| md5 <<
line}
p md5.hexdigest

will not.

regards.

-a

rtilley · March 19, 2006, 7:29am

On Sun, 19 Mar 2006 13:49:51 +0900, [email protected]
[email protected] wrote:

i think the OP has the right approach - note that an ‘f.read’ will consume
2GB. but the OP’s code

harp:~ > cat a.rb
require ‘digest/md5’
md5 = Digest::MD5.new() and open(ARGV.shift, ‘rb’).each{|line| md5 << line}
p md5.hexdigest

will not.

In my reading of the OP, both the block-open and iteration are actually
desired:

md5 = Digest::MD5.new
File.open(file,‘rb’) do |ios|
ios.each {|line| md5 << line }
end

cheers,
andrew

rtilley · March 19, 2006, 7:29am

From: “rtilley” [email protected]

I’m calculating md5 checksums on very large files (2 GB). This is a safe
way to do so, right? Also… is the file closed when the block exits?
I’m using ‘rb’ as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, ‘rb’).each {|line| md5.update(line)}

Hi - does the file really contain text lines? Or is it a file
full of binary data. If it’s a binary file, there may be no
guarantee the whole thing isn’t one very long “line”. In that
case I’d recommend reading it in chunks.

Untested:

md5 = Digest::MD5.new()
File.open(file, ‘rb’) do |io|
while (buf = io.read(4096)) && buf.length > 0
md5.update(buf)
end
end

Regards,

Bill

rtilley · March 19, 2006, 1:38pm

Andrew J. [email protected] wrote:

will not.

In my reading of the OP, both the block-open and iteration are
actually desired:

md5 = Digest::MD5.new
File.open(file,‘rb’) do |ios|
ios.each {|line| md5 << line }
end

IMHO it’s a bad idea to use line oriented reading on a binary file
because
“lines” can be arbitrary long (i.e. the whole file in worst case).
Using
IO#read is much better.

Kind regards

robert

rtilley · March 19, 2006, 1:48pm

Bill K. [email protected] wrote:

full of binary data. If it’s a binary file, there may be no
end
io.read will return nil at EOF so your test for positive length is
basically
obsolete. Also, for reasons of error checking I’d place the digest
creation
inside the block because then the digest is never created if the file
cannot
be opened:

md5 = File.open(file, ‘rb’) do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end

If you want to increase efficiency, you can do this, which will prevent
new
strings to be created as buffers all the time:

md5 = File.open(file, ‘rb’) do |io|
dig = Digest::MD5.new
buf = “”
while io.read(4096, buf)
dig.update(buf)
end
dig
end

Here’s another nice variant:

md5 = File.open(file, ‘rb’) do |io|
dig = Digest::MD5.new
buf = “”
dig.update(buf) while io.read(4096, buf)
dig
end

Kind regards

robert

rtilley · March 19, 2006, 3:48pm

Robert K. wrote:

dig
end

Thank you Robert, Billy and others! Your suggestions have helped me to
solve the problem.

rtilley · March 19, 2006, 4:21pm

In article [email protected],
“Robert K.” [email protected] writes:

md5 = File.open(file, ‘rb’) do |io|
dig = Digest::MD5.new
buf = “”
while io.read(4096, buf)
dig.update(buf)
end
dig
end

Why we have no such method in the digest library?

I think it is useful enough to have in the library.

rtilley · March 19, 2006, 5:54pm

On Mon, 20 Mar 2006, Tanaka A. wrote:

Why we have no such method in the digest library?

I think it is useful enough to have in the library.

indeed. in fact this seems a good candidate to add a method to a base
class:

 harp:~ > cat a.rb
 require 'digest/md5'
 require 'digest/rmd160'
 require 'digest/sha1'
 require 'digest/sha2'

 #
 # this in digest.rb or something equiv
 #
   digests = %w( MD5 RMD160 SHA1 SHA256 SHA384 SHA512 )

   digests.each do |d|
     digest_method = d.downcase

     IO.module_eval do
       define_method(digest_method) do |*argv|
         bufsize = argv.shift || 8192
         digest = ::Digest.const_get(d).new
         buf = ''
         off = pos rescue nil
         begin
           digest.update buf while read bufsize, buf
         ensure
           seek off rescue nil
         end
         digest
       end
     end

     File.module_eval do
       singleton_class = class << self; self; end
       singleton_class.module_eval do
         define_method(digest_method) do |path, *argv|
           mode = argv.shift || 'r'
           open(path, mode){|f| f.send digest_method}
         end
       end
     end
   end


 #
 # demo
 #
   report = {}
   digests.each do |d|
     digest_method = d.downcase
     report.update "File##{ digest_method}" => open(__FILE__){|f|

f.send(digest_method).hexdigest}
report.update “File.#{ digest_method}” =>
File.send(digest_method, FILE).hexdigest
end
require ‘yaml’ and y report

 harp:~ > ruby a.rb
 ---
 File.md5: 2e6c1e1c3d81a871f2c6b5099ba208f3
 File#md5: 2e6c1e1c3d81a871f2c6b5099ba208f3
 File.rmd160: 22ad54cb48f6d00ef325f1c7ff2150cf46fd250f
 File#rmd160: 22ad54cb48f6d00ef325f1c7ff2150cf46fd250f
 File.sha1: 1600889b027ced6bf95dedc9803cb7c65f5aa396
 File#sha1: 1600889b027ced6bf95dedc9803cb7c65f5aa396
 File.sha256:

38ac0f761f16a13d2f4f51a8a8c9668656d84c29b383840579a7517b69d219a9
File#sha256:
38ac0f761f16a13d2f4f51a8a8c9668656d84c29b383840579a7517b69d219a9
File.sha384:
5882c884ea618539da50a36bfbbd0fa0cd41bfa2ee18bce5acf45965e5582e33a1a3edd269f0e3551a9c9e5cd6e77cd1
File#sha384:
5882c884ea618539da50a36bfbbd0fa0cd41bfa2ee18bce5acf45965e5582e33a1a3edd269f0e3551a9c9e5cd6e77cd1
File.sha512:
3fba99ff4d98feaf760b814e9a8f245e05881da9aa19378510172d4e7cb0a10aa98b6c1d9b22d4331f3552a5899bb5545c604dfc4620665a5b6fb0d4dc2b0b78
File#sha512:
3fba99ff4d98feaf760b814e9a8f245e05881da9aa19378510172d4e7cb0a10aa98b6c1d9b22d4331f3552a5899bb5545c604dfc4620665a5b6fb0d4dc2b0b78

comments?

-a

rtilley · March 19, 2006, 8:23pm

Why we have no such method in the digest library?

I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Use it like this:

md5 = MD5.file(“foo.bar”)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

require “md5”

class MD5
def self.file(file)
File.open(file, “rb”) do |f|
res = self.new
while (data = f.read(4096))
res << data
end
res
end
end
end

rtilley · March 19, 2006, 8:34pm

Erik V. wrote:

Why we have no such method in the digest library?

I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Should this be done to sha1, sha2, etc?