On Wed, Sep 19, 2012 at 11:41 AM, Peter H.
[email protected] wrote:
On 19 September 2012 10:09, Carlos A. [email protected] wrote:
I’d like to know, too. I stumbled upon a similar problem, but it was long
ago.
Ok here is a quick test that I hacked up. The data is a 2,659,800 line
639Mb text file. Some lines contain the string “FRED”, count them
Let’s see: that are 252 chars per line on average.
Here’s how I generated the file:
$ ruby -e ‘x=“X”*243; 2_600_000.times {|i| printf
“%7d%s%s\n”,i,x,rand(1000)==0 ? “FRED” : “OOOO” }’ >results201101.dat
To be honest I suspect that it is more an issue with the regexes than
file io and the real regexes are much more complicated than just match
a string. I was a bit surprised that the index() wasn’t faster.
Darn! Maybe encoding plays a role here. The pure IO is pretty fast
(see last test):
RUN 2
2659
real 0m3.520s
user 0m3.213s
sys 0m0.249s
./perl.pl
2659
real 0m2.220s
user 0m1.950s
sys 0m0.249s
./ruby-1.rb
2659
real 0m4.912s
user 0m4.383s
sys 0m0.498s
./ruby-2.rb
real 0m5.032s
user 0m4.336s
sys 0m0.639s
./ruby-3.rb
real 0m3.610s
user 0m3.276s
sys 0m0.312s
./ruby-4.rb
2659
real 0m5.004s
user 0m4.399s
sys 0m0.467s
./ruby-5.rb
2659
real 0m4.980s
user 0m4.430s
sys 0m0.451s
./ruby-6.rb
0
real 0m2.495s
user 0m2.012s
sys 0m0.420s
$ head -200 *.pl *.rb
==> perl.pl <==
#!/usr/bin/env perl
use strict;
use warnings;
my $logfile = ‘results201101.dat’;
my $counter = 0;
open FILE, “<$logfile” or die $!;
while(my $line = ) {
if($line =~ /FRED/) {
$counter++;
}
}
close(FILE);
print “$counter\n”;
==> ruby-1.rb <==
#!/usr/bin/env ruby
counter = 0
File.open(“results201101.dat”).each do |line|
if line =~ /FRED/
counter += 1
end
end
puts counter
==> ruby-2.rb <==
#!/usr/bin/env ruby
r = Regexp.new(‘FRED’)
counter = 0
File.open(“results201101.dat”).each do |line|
if r.match(line)
counter += 1
end
end
==> ruby-3.rb <==
#!/usr/bin/env ruby
counter = 0
File.open(“results201101.dat”).each do |line|
if line.index(“FRED”)
counter += 1
end
end
==> ruby-4.rb <==
#!/usr/bin/env ruby
count = 0
File.foreach “results201101.dat” do |line|
count += 1 if /FRED/ =~ line
end
puts count
==> ruby-5.rb <==
#!/usr/bin/env ruby
count = 0
File.foreach “results201101.dat”, encoding: “ASCII” do |line|
count += 1 if /FRED/ =~ line
end
puts count
==> ruby-6.rb <==
#!/usr/bin/env ruby
count = 0
File.foreach “results201101.dat”, encoding: “ASCII” do |line|
count += 1 if /FRED/ =~ line
end
puts count
And here’s the test run
$ for i in {1…2}; do echo “RUN $i”; time fgrep -c FRED
results201101.dat; for f in ./.pl ./.rb; do echo “$f”; time “$f”;
done; done
This was all on cygwin on a machine with plenty memory => likely no real
IO.
Ah, it get’s a tad better without regexp:
$ time ./ruby-7.rb
2659
real 0m3.432s
user 0m2.869s
sys 0m0.529s
$ cat ruby-7.rb
#!/usr/bin/env ruby
count = 0
f = ‘FRED’
File.foreach “results201101.dat”, encoding: “BINARY” do |line|
count += 1 if line.include? f
end
puts count
Kind regards
robert