Forum: Ruby Class for data analysis: File.open once

Posted by Soichi Ishida (soichi)
on 2013-02-03 02:12
ruby 1.9.3p362 (2012-12-25 revision 38607) [x86_64-darwin12.2.1]

I am trying to analyze a file and extract necessary data from it.
First I need to count the total number of lines and the blank lines. the
class and methods follow.

class DataExtractor
    attr_reader :file_name
    def initialize(file)
        @file_name = file
    end

    def total_lines
        @lines = 0
        f = File.open(@file_name, 'r')
        f.read.each_line do |l|
            @lines += 1
        end
        f.close
        @lines
    end

    def total_blank_lines
        regEx = /^[\s]*$\n/
        @total_blank_lines = 0
        f = File.open(@file_name, 'r')
        f.read.each_line do | line |
            if line =~ regEx
                @total_blank_lines += 1
            end
        end
        f.close
        @total_blank_lines
    end
    ....

The problem is that each method has 'File.open...each_line do ' so the
whole loop gets executed separately.

Don't you think this is inefficient?  Is there better ways to develop
methods so that the loop is needed only once?

soichi
Posted by Matthew Kerwin (mattyk)
on 2013-02-03 07:50
(Received via mailing list)
On 3 February 2013 11:13, Soichi Ishida <lists@ruby-forum.com> wrote:

>     end
>
>         @total_blank_lines
>     end
>     ....
>
> The problem is that each method has 'File.open...each_line do ' so the
> whole loop gets executed separately.
>
> Don't you think this is inefficient?  Is there better ways to develop
> methods so that the loop is needed only once?
>

Yep, sure do; and yep, sure is.  The fact that you're using @variables
inside your methods is already a hint at how to improve things -- have 
the
object remember them.  My first approach would be to do all the file
analysis in the constructor, sort of like this (using your code; it 
would
look a little different if I wrote it for myself):

    class DataExtractor
      attr_reader :file_name, :lines, :total_blank_lines
      def initialize(file)
        @file_name = file
        @lines = 0
        @total_blank_lines = 0
        regEx = /^[\s]*$\n/
        f = File.open(@file_name, 'r')
        f.read.each_line do | line |
          if line =~ regEx
            @total_blank_lines += 1
          end
          @lines += 1
        end
        f.close
      end
    end

My next approach would be to lazy-initialise the data, because that's 
just
a thing I like to do.  It means the lines aren't counted until they're
needed, in case that's an improvement. (Again, rough code, untested.)

    class DataExtractor
      attr_reader :file_name

      def initialize(file)
        @file_name = file
        @lines = -1
        @total_blank_lines = -1
      end

      def lazy_init_lines
        regEx = /^[\s]*$\n/
        f = File.open(@file_name, 'r')
        f.read.each_line do | line |
          if line =~ regEx
            @total_blank_lines += 1
          end
          @lines += 1
        end
        f.close
      end

      def lines
        lazy_init_lines if @lines < 0
        @lines
      end

      def total_blank_lines
        lazy_init_lines if @total_blank_lines < 0
        @total_blank_lines
      end
    end

In either case you perform a single open-read-close loop, and can access
the totals over and over again.  Various further optimisations and
improvements are available to be made at your discretion.

--
  Matthew Kerwin, B.Sc (CompSci) (Hons)
  http://matthew.kerwin.net.au/
  ABN: 59-013-727-651

  "You'll never find a programming language that frees
  you from the burden of clarifying your ideas." - xkcd
Posted by Love U Ruby (my-ruby)
on 2013-02-03 09:20
Matthew Kerwin wrote in post #1094954:

>
>       def lazy_init_lines
>         regEx = /^[\s]*$\n/
>         f = File.open(@file_name, 'r')
>         f.read.each_line do | line |
>           if line =~ regEx
>             @total_blank_lines += 1
>           end
>           @lines += 1
>         end
>         f.close
>       end
>
>       def lines
>         lazy_init_lines if @lines < 0
>         @lines
>       end
>
>       def total_blank_lines
>         lazy_init_lines if @total_blank_lines < 0
>         @total_blank_lines
>       end
>     end
>


`+1` nice resolution! :)
Posted by unknown (Guest)
on 2013-02-03 10:53
(Received via mailing list)
Am 03.02.2013 02:13, schrieb Soichi Ishida:
> ruby 1.9.3p362 (2012-12-25 revision 38607) [x86_64-darwin12.2.1]
>
> I am trying to analyze a file and extract necessary data from it.
> First I need to count the total number of lines and the blank lines. the
> class and methods follow.

Since you are reading the complete file into memory anyway
(with File.read), you could get the information you need
much easier without any loop at all:

content = File.readlines('file.dat')
content.size
content.grep(/^\s*$/).size
Posted by Soichi Ishida (soichi)
on 2013-02-04 03:03
Thanks everyone. They will certainly help.

In addition to these, what if I want to apply specific regular 
expressions to each of the lines?

say,

if line =~ /exp1/ then
 # do stuff
else
 # do stuff
end

In this case, the loop for all lines must be implemented, am I correct?

If the class methods dynamically take regular expression like

test = DataExtractor.new('file.txt')
test.apply_regexp(/exp1/)

Doesn't the loop need to be reactivated for each method like this?

soichi
Posted by Matthew Kerwin (mattyk)
on 2013-02-04 03:49
(Received via mailing list)
On 4 February 2013 12:03, Soichi Ishida <lists@ruby-forum.com> wrote:

>  # do stuff
> end
>
> In this case, the loop for all lines must be implemented, am I correct?
>

The previous code already does that, and we showed that it can be done 
in
one loop.


> If the class methods dynamically take regular expression like
>
> test = DataExtractor.new('file.txt')
> test.apply_regexp(/exp1/)
>
> Doesn't the loop need to be reactivated for each method like this?


Yes, if the regex isn't necessarily known beforehand you'll have to do 
the
loop again once it is known.  Note that, as Marcus pointed out, you're
reading the file into memory; if you are happy trading IO delays for
permanent memory usage, you could store the file contents in an instance
variable (either in the constructor or lazily) and then perform your
variable pattern match on the stored file contents.  It would still need 
to
perform the loop many times, but would save a lot of time in the #open 
and
#read phases.

There are other caching techniques, as well, if you think a particular
pattern is likely to be tested multiple times.  Probably a bit advanced 
for
this discussion, though.

--
  Matthew Kerwin, B.Sc (CompSci) (Hons)
  http://matthew.kerwin.net.au/
  ABN: 59-013-727-651

  "You'll never find a programming language that frees
  you from the burden of clarifying your ideas." - xkcd
Posted by Soichi Ishida (soichi)
on 2013-02-04 04:00
thanks for your help.
I will try them all!
Posted by Hans Mackowiak (hanmac)
on 2013-02-04 08:53
for bigger files, this code may be better

lines, empty_lines = 0,0
File.foreach('file.dat') {|l| lines += 1; empty_lines += 1 if l =~ 
/^\s*$/}
Posted by tamouse mailing lists (Guest)
on 2013-02-04 10:13
(Received via mailing list)
On Sun, Feb 3, 2013 at 8:03 PM, Soichi Ishida <lists@ruby-forum.com> 
wrote:
>
> In this case, the loop for all lines must be implemented, am I correct?
>
> If the class methods dynamically take regular expression like
>
> test = DataExtractor.new('file.txt')
> test.apply_regexp(/exp1/)
>
> Doesn't the loop need to be reactivated for each method like this?

This last bit would be true in most cases, but you'd be looping
through the lines of the file in memory rather than re-reading it.

This sounds like a great case for using an Enumerable mixin:

class Dataanalysis
  include Enumerable
  attr_reader :lines
  attr_reader :emptylines

  def initialize(filename)
    @filename = filename
    @content = File.readlines(@filename)
    @lines = @content.size
    @emptylines = @content.grep(/^\s*$/).size
  end

  def each(&b)
    @content.each do |line|
      if block_given?
        b.call line
      else
        yield line
      end
    end
  end

end



Then you can do fun things like:

1.9.3-head :034 > f = Dataanalysis.new('dataanalysis.rb')
 => #<Dataanalysis:0xa3d6210 @filename="dataanalysis.rb",
@content=["class Dataanalysis\n", "  include Enumerable\n", "
attr_reader :lines\n", "  attr_reader :emptylines\n", "\n", "  def
initialize(filename)\n", "    @filename = filename\n", "    @content =
File.readlines(@filename)\n", "    @lines = @content.size\n", "
@emptylines = @content.grep(/^\\s*$/).size\n", "  end\n", "\n", "  def
each(&b)\n", "    @content.each do |line|\n", "      if
block_given?\n", "        b.call line\n", "      else\n", "
yield line\n", "      end\n", "    end\n", "  end\n", "\n", "end\n"],
@lines=23, @emptylines=3>
1.9.3-head :035 > f.lines
 => 23
1.9.3-head :036 > f.emptylines
 => 3

1.9.3-head :040 > f.each{|l| puts l.gsub(/\s+/,'*')}
class*Dataanalysis*
*include*Enumerable*
*attr_reader*:lines*
*attr_reader*:emptylines*
*
*def*initialize(filename)*
*@filename*=*filename*
*@content*=*File.readlines(@filename)*
*@lines*=*@content.size*
*@emptylines*=*@content.grep(/^\s*$/).size*
*end*
*
*def*each(&b)*
*@content.each*do*|line|*
*if*block_given?*
*b.call*line*
*else*
*yield*line*
*end*
*end*
*end*
*
end*

1.9.3-head :041 > f.map{|l| l.gsub(/\s+/,'*')}
 => ["class*Dataanalysis*", "*include*Enumerable*",
"*attr_reader*:lines*", "*attr_reader*:emptylines*", "*",
"*def*initialize(filename)*", "*@filename*=*filename*",
"*@content*=*File.readlines(@filename)*", "*@lines*=*@content.size*",
"*@emptylines*=*@content.grep(/^\\s*$/).size*", "*end*", "*",
"*def*each(&b)*", "*@content.each*do*|line|*", "*if*block_given?*",
"*b.call*line*", "*else*", "*yield*line*", "*end*", "*end*", "*end*",
"*", "end*"]

and so on.


All this said, there is still a caveat: the size of the file under
analysis. If it's something smallish to medium, no problem. If it's a
4G log file -- then you'll likely have some issues.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.