Class for data analysis: File.open once

ruby 1.9.3p362 (2012-12-25 revision 38607) [x86_64-darwin12.2.1]

I am trying to analyze a file and extract necessary data from it.
First I need to count the total number of lines and the blank lines. the
class and methods follow.

class DataExtractor
attr_reader :file_name
def initialize(file)
@file_name = file
end

def total_lines
    @lines = 0
    f = File.open(@file_name, 'r')
    f.read.each_line do |l|
        @lines += 1
    end
    f.close
    @lines
end

def total_blank_lines
    regEx = /^[\s]*$\n/
    @total_blank_lines = 0
    f = File.open(@file_name, 'r')
    f.read.each_line do | line |
        if line =~ regEx
            @total_blank_lines += 1
        end
    end
    f.close
    @total_blank_lines
end
....

The problem is that each method has 'File.open…each_line do ’ so the
whole loop gets executed separately.

Don’t you think this is inefficient? Is there better ways to develop
methods so that the loop is needed only once?

soichi

On 3 February 2013 11:13, Soichi I. [email protected] wrote:

end

    @total_blank_lines
end
....

The problem is that each method has 'File.open…each_line do ’ so the
whole loop gets executed separately.

Don’t you think this is inefficient? Is there better ways to develop
methods so that the loop is needed only once?

Yep, sure do; and yep, sure is. The fact that you’re using @variables
inside your methods is already a hint at how to improve things – have
the
object remember them. My first approach would be to do all the file
analysis in the constructor, sort of like this (using your code; it
would
look a little different if I wrote it for myself):

class DataExtractor
  attr_reader :file_name, :lines, :total_blank_lines
  def initialize(file)
    @file_name = file
    @lines = 0
    @total_blank_lines = 0
    regEx = /^[\s]*$\n/
    f = File.open(@file_name, 'r')
    f.read.each_line do | line |
      if line =~ regEx
        @total_blank_lines += 1
      end
      @lines += 1
    end
    f.close
  end
end

My next approach would be to lazy-initialise the data, because that’s
just
a thing I like to do. It means the lines aren’t counted until they’re
needed, in case that’s an improvement. (Again, rough code, untested.)

class DataExtractor
  attr_reader :file_name

  def initialize(file)
    @file_name = file
    @lines = -1
    @total_blank_lines = -1
  end

  def lazy_init_lines
    regEx = /^[\s]*$\n/
    f = File.open(@file_name, 'r')
    f.read.each_line do | line |
      if line =~ regEx
        @total_blank_lines += 1
      end
      @lines += 1
    end
    f.close
  end

  def lines
    lazy_init_lines if @lines < 0
    @lines
  end

  def total_blank_lines
    lazy_init_lines if @total_blank_lines < 0
    @total_blank_lines
  end
end

In either case you perform a single open-read-close loop, and can access
the totals over and over again. Various further optimisations and
improvements are available to be made at your discretion.


Matthew K., B.Sc (CompSci) (Hons)
http://matthew.kerwin.net.au/
ABN: 59-013-727-651

“You’ll never find a programming language that frees
you from the burden of clarifying your ideas.” - xkcd

Matthew K. wrote in post #1094954:

  def lazy_init_lines
    regEx = /^[\s]*$\n/
    f = File.open(@file_name, 'r')
    f.read.each_line do | line |
      if line =~ regEx
        @total_blank_lines += 1
      end
      @lines += 1
    end
    f.close
  end

  def lines
    lazy_init_lines if @lines < 0
    @lines
  end

  def total_blank_lines
    lazy_init_lines if @total_blank_lines < 0
    @total_blank_lines
  end
end

+1 nice resolution! :slight_smile:

Thanks everyone. They will certainly help.

In addition to these, what if I want to apply specific regular
expressions to each of the lines?

say,

if line =~ /exp1/ then

do stuff

else

do stuff

end

In this case, the loop for all lines must be implemented, am I correct?

If the class methods dynamically take regular expression like

test = DataExtractor.new(‘file.txt’)
test.apply_regexp(/exp1/)

Doesn’t the loop need to be reactivated for each method like this?

soichi

Am 03.02.2013 02:13, schrieb Soichi I.:

ruby 1.9.3p362 (2012-12-25 revision 38607) [x86_64-darwin12.2.1]

I am trying to analyze a file and extract necessary data from it.
First I need to count the total number of lines and the blank lines. the
class and methods follow.

Since you are reading the complete file into memory anyway
(with File.read), you could get the information you need
much easier without any loop at all:

content = File.readlines(‘file.dat’)
content.size
content.grep(/^\s*$/).size

thanks for your help.
I will try them all!

for bigger files, this code may be better

lines, empty_lines = 0,0
File.foreach(‘file.dat’) {|l| lines += 1; empty_lines += 1 if l =~
/^\s*$/}

On 4 February 2013 12:03, Soichi I. [email protected] wrote:

do stuff

end

In this case, the loop for all lines must be implemented, am I correct?

The previous code already does that, and we showed that it can be done
in
one loop.

If the class methods dynamically take regular expression like

test = DataExtractor.new(‘file.txt’)
test.apply_regexp(/exp1/)

Doesn’t the loop need to be reactivated for each method like this?

Yes, if the regex isn’t necessarily known beforehand you’ll have to do
the
loop again once it is known. Note that, as Marcus pointed out, you’re
reading the file into memory; if you are happy trading IO delays for
permanent memory usage, you could store the file contents in an instance
variable (either in the constructor or lazily) and then perform your
variable pattern match on the stored file contents. It would still need
to
perform the loop many times, but would save a lot of time in the #open
and
#read phases.

There are other caching techniques, as well, if you think a particular
pattern is likely to be tested multiple times. Probably a bit advanced
for
this discussion, though.


Matthew K., B.Sc (CompSci) (Hons)
http://matthew.kerwin.net.au/
ABN: 59-013-727-651

“You’ll never find a programming language that frees
you from the burden of clarifying your ideas.” - xkcd

On Sun, Feb 3, 2013 at 8:03 PM, Soichi I. [email protected]
wrote:

In this case, the loop for all lines must be implemented, am I correct?

If the class methods dynamically take regular expression like

test = DataExtractor.new(‘file.txt’)
test.apply_regexp(/exp1/)

Doesn’t the loop need to be reactivated for each method like this?

This last bit would be true in most cases, but you’d be looping
through the lines of the file in memory rather than re-reading it.

This sounds like a great case for using an Enumerable mixin:

class Dataanalysis
include Enumerable
attr_reader :lines
attr_reader :emptylines

def initialize(filename)
@filename = filename
@content = File.readlines(@filename)
@lines = @content.size
@emptylines = @content.grep(/^\s*$/).size
end

def each(&b)
@content.each do |line|
if block_given?
b.call line
else
yield line
end
end
end

end

Then you can do fun things like:

1.9.3-head :034 > f = Dataanalysis.new(‘dataanalysis.rb’)
=> #<Dataanalysis:0xa3d6210 @filename=“dataanalysis.rb”,
@content=[“class Dataanalysis\n”, " include Enumerable\n", "
attr_reader :lines\n", " attr_reader :emptylines\n", “\n”, " def
initialize(filename)\n", " @filename = filename\n", " @content =
File.readlines(@filename)\n", " @lines = @content.size\n", "
@emptylines = @content.grep(/^\s*$/).size\n", " end\n", “\n”, " def
each(&b)\n", " @content.each do |line|\n", " if
block_given?\n", " b.call line\n", " else\n", "
yield line\n", " end\n", " end\n", " end\n", “\n”, “end\n”],
@lines=23, @emptylines=3>
1.9.3-head :035 > f.lines
=> 23
1.9.3-head :036 > f.emptylines
=> 3

1.9.3-head :040 > f.each{|l| puts l.gsub(/\s+/,'')}
class
Dataanalysis*
includeEnumerable*
attr_reader:lines*
attr_reader:emptylines*
*
definitialize(filename)*
@filename=filename
@content=File.readlines(@filename)
@lines=@content.size
@emptylines=@content.grep(/^\s$/).size*
end
*
defeach(&b)*
@content.eachdo*|line|*
ifblock_given?*
b.callline*
else
yieldline*
end
end
end
*
end*

1.9.3-head :041 > f.map{|l| l.gsub(/\s+/,'')}
=> ["class
Dataanalysis*", “includeEnumerable*”,
attr_reader:lines*”, “attr_reader:emptylines*”, “",
"definitialize(filename)
”, “@filename=filename”,
@content=File.readlines(@filename)”, “@lines=@content.size”,
@emptylines=@content.grep(/^\s$/).size*”, “end”, “",
"defeach(&b)
”, “@content.eachdo*|line|", "ifblock_given?”,
b.callline*”, “else”, “yieldline*”, “end”, “end”, “end”,
", "end”]

and so on.

All this said, there is still a caveat: the size of the file under
analysis. If it’s something smallish to medium, no problem. If it’s a
4G log file – then you’ll likely have some issues.