Best/better way of md5suming of really large file in ruby?


#1

I’ve got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn’t a problem, but I’ve
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I’m not sure the exact error it threw, but I’m
re-running just that part now to find out. In the meanwhile, any
suggestions?

This is how I’m generating the md5sum right now…
Digest::MD5.hexdigest(File.read(fn))

–Kyle


#2

Kyle S. wrote:

I’ve got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn’t a problem, but I’ve
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I’m not sure the exact error it threw, but I’m
re-running just that part now to find out. In the meanwhile, any
suggestions?

I googled for ‘md5 large files’ and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun


#3

On Wed, 2009-04-22 at 23:18 +0900, Yun Huang Y. wrote:

yun

rthompso@raker /cpartition/hold $ ls -rlt dummyfile
-rw-r–r-- 1 rthompso staff 2147483648 2009-04-22 10:27 dummyfile
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> result = %x[md5sum dummyfile]
=> “a981130cf2b7e09f4686dc273cf7187e dummyfile\n”
irb(main):002:0> p result
“a981130cf2b7e09f4686dc273cf7187e dummyfile\n”
=> nil
irb(main):003:0> def timeit
irb(main):004:1> tstart = Time.now
irb(main):005:1> result = %x[md5sum dummyfile]
irb(main):006:1> tend = Time.now
irb(main):007:1> elapsed = tend - tstart
irb(main):008:1> puts elapsed.to_s
irb(main):009:1> end
=> nil
irb(main):011:0> timeit
10.633416
=> nil


#4

Thanks both of you. I’d rather not shell out using %x[, but I may end
up doing that. I tried the modified MD5, and it actually ran in close
to the same time on my work machine, have to see how it does against
my home one.

–Kyle


#5

On Wed, 2009-04-22 at 23:34 +0900, Reid T. wrote:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834
=> nil
=> nil

more realistic…
rthompso@raker /cpartition/hold $ dd if=/dev/urandom of=dummyfile
count=4M
4194304+0 records in
4194304+0 records out
2147483648 bytes (2.1 GB) copied, 529.518 s, 4.1 MB/s
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> def timeit
irb(main):002:1> tstart = Time.now
irb(main):003:1> result = %x[md5sum dummyfile]
irb(main):004:1> tend = Time.now
irb(main):005:1> elapsed = tend - tstart
irb(main):006:1> puts elapsed.to_s
irb(main):007:1> end
=> nil
irb(main):008:0> timeit
49.366641
=> nil
irb(main):009:0> timeit
48.416673
=> nil
irb(main):010:0>