Number of lines in a text file

alcina · October 19, 2013, 4:02pm

If I want the number of lines of the text file , I may use

File.readlines().size

but this builds an useless extra Array, or

%x(wc -l ).to_i

but this needs to be on a *nix system (or have a system command wc.exe
on Windows).Or else a File.read followed by a grep for ‘\n’…

I feel there should be a simpler way to do that…
_md

micheldogger · October 19, 2013, 4:21pm

On 2013-10-19, at 10:02 AM, Michel D. [email protected]
wrote:

I feel there should be a simpler way to do that…
_md

Have you looked at Enumerable’s count method?

mike$ wc -l /etc/passwd
83 /etc/passwd
mike$ ruby -e “puts File.open(‘/etc/passwd’) { |f| f.count }”
83

Hope this helps,

Mike

–

Mike S. [email protected]
http://www.stok.ca/~mike/

The “`Stok’ disclaimers” apply.

micheldogger · October 19, 2013, 8:23pm

On Sat, Oct 19, 2013 at 4:02 PM, Michel D. [email protected]
wrote:

I feel there should be a simpler way to do that…

lines = File.foreach(file).count

Kind regards

robert

micheldogger · October 20, 2013, 10:37am

Robert K. wrote in post #1124923:

lines = File.foreach(file).count

Thanks, Robert, using ‘foreach’ is cleaner.

FWIW, I benchmarked. The File methods are equivalent and much faster.

require ‘benchmark’
file = FILE
n = 10000
Benchmark.bm do |rep|
rep.report(“readlines”) { n.times { File.readlines(file).size } }
rep.report("wc -l ") { n.times { wc -l #{file}.to_i } }
rep.report("foreach ") { n.times { File.foreach(file).count } }
end

gives

          user       system     total       real

readlines 0.219000 0.499000 0.718000 ( 0.752043)
wc -l 2.542000 5.257000 7.799000 ( 83.502776)
foreach 0.219000 0.531000 0.750000 ( 0.761044)

_md

micheldogger · October 20, 2013, 5:07pm

Robert K. wrote in post #1124958:

It would be interesting to see how that works out for a large file. I
would expect the last version to be more efficiently than the first
one.

I would guess so. But this below shows the same pattern : Readlines a
bit faster.

file = File.join(File.dirname(FILE), ‘test.txt’)
File.open(file, ‘w’) do |file|
3000.times { file.puts ‘bla’ * 10 }
end

n = 10000
Benchmark.bm do |rep|
rep.report(“readlines”) { n.times { File.readlines(file).size } }
rep.report("foreach ") { n.times { File.foreach(file).count} }
end

        user     system      total        real

readlines 11.341000 1.217000 12.558000 ( 12.686726)
foreach 12.433000 1.264000 13.697000 ( 13.871793)

micheldogger · October 20, 2013, 3:10pm

On Sun, Oct 20, 2013 at 10:37 AM, Michel D. [email protected]
wrote:

Robert K. wrote in post #1124923:

lines = File.foreach(file).count

Thanks, Robert, using ‘foreach’ is cleaner.

Yes, and it avoids building an Array for the whole file in memory.

FWIW, I benchmarked. The File methods are equivalent and much faster.

Naturally since they avoid the overhead of forking and IPC.

          user       system     total       real
readlines 0.219000 0.499000 0.718000 ( 0.752043)
wc -l 2.542000 5.257000 7.799000 ( 83.502776)
foreach 0.219000 0.531000 0.750000 ( 0.761044)

It would be interesting to see how that works out for a large file. I
would expect the last version to be more efficiently than the first
one.

Kind regards

robert

micheldogger · October 20, 2013, 5:14pm

Michel D. wrote in post #1124962:

        user     system      total        real
readlines 11.341000 1.217000 12.558000 ( 12.686726)
foreach 12.433000 1.264000 13.697000 ( 13.871793)

With 300_000 lines and 100 times, instead of 3_000 lines and 10_000
times, one gets the same pattern :

       user     system      total        real

readlines 11.622000 1.060000 12.682000 ( 12.692726)
foreach 12.246000 0.858000 13.104000 ( 13.156753)

but the difference is smaller…

_md

micheldogger · October 21, 2013, 4:11am

On Oct 20, 2013, at 4:13 PM, Robert K. [email protected]
wrote:

      user     system      total        real
rep.report(“readlines”) { n.times { File.readlines(tmp.path).size } }

–
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

What about space? That’s also a huge consideration here, isn’t it?
foreach should win that by lots and lots, too.

micheldogger · October 20, 2013, 11:14pm

On Sun, Oct 20, 2013 at 5:14 PM, Michel D. [email protected]
wrote:

readlines 11.622000 1.060000 12.682000 ( 12.692726)
foreach 12.246000 0.858000 13.104000 ( 13.156753)

but the difference is smaller…

$ ruby x.rb
user system total real
readlines 56.831000 7.597000 64.428000 ( 64.241000)
foreach 50.357000 5.476000 55.833000 ( 56.153000)
$ cat x.rb

require ‘tempfile’
require ‘benchmark’

LINE = ‘x’ * 99
n = 100

Tempfile.open(ENV[‘TMP’] || ‘/tmp’) do |tmp|
1_000_000.times { tmp.puts LINE }

Benchmark.bm do |rep|
rep.report(“readlines”) { n.times { File.readlines(tmp.path).size }
}
rep.report("foreach ") { n.times { File.foreach(tmp.path).count} }
end

end

So with even larger files the difference shows.

Kind regards

robert

micheldogger · October 21, 2013, 8:17am

tamouse m. wrote in post #1124992:

What about space? That’s also a huge consideration here, isn’t it?
foreach should win that by lots and lots, too.

Sure.
_md