Best/better way of md5suming of really large file in ruby?

I’ve got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn’t a problem, but I’ve
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I’m not sure the exact error it threw, but I’m
re-running just that part now to find out. In the meanwhile, any
suggestions?

This is how I’m generating the md5sum right now…
Digest::MD5.hexdigest(File.read(fn))

–Kyle

Kyle S. wrote:

I’ve got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn’t a problem, but I’ve
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I’m not sure the exact error it threw, but I’m
re-running just that part now to find out. In the meanwhile, any
suggestions?

I googled for ‘md5 large files’ and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun

On Wed, 2009-04-22 at 23:18 +0900, Yun Huang Y. wrote:

yun

rthompso@raker /cpartition/hold $ ls -rlt dummyfile
-rw-r–r-- 1 rthompso staff 2147483648 2009-04-22 10:27 dummyfile
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> result = %x[md5sum dummyfile]
=> “a981130cf2b7e09f4686dc273cf7187e dummyfile\n”
irb(main):002:0> p result
“a981130cf2b7e09f4686dc273cf7187e dummyfile\n”
=> nil
irb(main):003:0> def timeit
irb(main):004:1> tstart = Time.now
irb(main):005:1> result = %x[md5sum dummyfile]
irb(main):006:1> tend = Time.now
irb(main):007:1> elapsed = tend - tstart
irb(main):008:1> puts elapsed.to_s
irb(main):009:1> end
=> nil
irb(main):011:0> timeit
10.633416
=> nil

Thanks both of you. I’d rather not shell out using %x[, but I may end
up doing that. I tried the modified MD5, and it actually ran in close
to the same time on my work machine, have to see how it does against
my home one.

–Kyle

On Wed, 2009-04-22 at 23:34 +0900, Reid T. wrote:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834
=> nil
=> nil

more realistic…
rthompso@raker /cpartition/hold $ dd if=/dev/urandom of=dummyfile
count=4M
4194304+0 records in
4194304+0 records out
2147483648 bytes (2.1 GB) copied, 529.518 s, 4.1 MB/s
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> def timeit
irb(main):002:1> tstart = Time.now
irb(main):003:1> result = %x[md5sum dummyfile]
irb(main):004:1> tend = Time.now
irb(main):005:1> elapsed = tend - tstart
irb(main):006:1> puts elapsed.to_s
irb(main):007:1> end
=> nil
irb(main):008:0> timeit
49.366641
=> nil
irb(main):009:0> timeit
48.416673
=> nil
irb(main):010:0>