ruby 1.9.3p362 (2012-12-25 revision 38607) [x86_64-darwin12.2.1]
I am trying to analyze a file and extract necessary data from it.
First I need to count the total number of lines and the blank lines. the
class and methods follow.
class DataExtractor
attr_reader :file_name
def initialize(file)
@file_name = file
end
def total_lines
@lines = 0
f = File.open(@file_name, 'r')
f.read.each_line do |l|
@lines += 1
end
f.close
@lines
end
def total_blank_lines
regEx = /^[\s]*$\n/
@total_blank_lines = 0
f = File.open(@file_name, 'r')
f.read.each_line do | line |
if line =~ regEx
@total_blank_lines += 1
end
end
f.close
@total_blank_lines
end
....
The problem is that each method has 'File.open...each_line do ' so the
whole loop gets executed separately.
Don't you think this is inefficient? Is there better ways to develop
methods so that the loop is needed only once?
soichi
on 2013-02-03 02:12
on 2013-02-03 07:50
On 3 February 2013 11:13, Soichi Ishida <lists@ruby-forum.com> wrote: > end > > @total_blank_lines > end > .... > > The problem is that each method has 'File.open...each_line do ' so the > whole loop gets executed separately. > > Don't you think this is inefficient? Is there better ways to develop > methods so that the loop is needed only once? > Yep, sure do; and yep, sure is. The fact that you're using @variables inside your methods is already a hint at how to improve things -- have the object remember them. My first approach would be to do all the file analysis in the constructor, sort of like this (using your code; it would look a little different if I wrote it for myself): class DataExtractor attr_reader :file_name, :lines, :total_blank_lines def initialize(file) @file_name = file @lines = 0 @total_blank_lines = 0 regEx = /^[\s]*$\n/ f = File.open(@file_name, 'r') f.read.each_line do | line | if line =~ regEx @total_blank_lines += 1 end @lines += 1 end f.close end end My next approach would be to lazy-initialise the data, because that's just a thing I like to do. It means the lines aren't counted until they're needed, in case that's an improvement. (Again, rough code, untested.) class DataExtractor attr_reader :file_name def initialize(file) @file_name = file @lines = -1 @total_blank_lines = -1 end def lazy_init_lines regEx = /^[\s]*$\n/ f = File.open(@file_name, 'r') f.read.each_line do | line | if line =~ regEx @total_blank_lines += 1 end @lines += 1 end f.close end def lines lazy_init_lines if @lines < 0 @lines end def total_blank_lines lazy_init_lines if @total_blank_lines < 0 @total_blank_lines end end In either case you perform a single open-read-close loop, and can access the totals over and over again. Various further optimisations and improvements are available to be made at your discretion. -- Matthew Kerwin, B.Sc (CompSci) (Hons) http://matthew.kerwin.net.au/ ABN: 59-013-727-651 "You'll never find a programming language that frees you from the burden of clarifying your ideas." - xkcd
on 2013-02-03 09:20
Matthew Kerwin wrote in post #1094954: > > def lazy_init_lines > regEx = /^[\s]*$\n/ > f = File.open(@file_name, 'r') > f.read.each_line do | line | > if line =~ regEx > @total_blank_lines += 1 > end > @lines += 1 > end > f.close > end > > def lines > lazy_init_lines if @lines < 0 > @lines > end > > def total_blank_lines > lazy_init_lines if @total_blank_lines < 0 > @total_blank_lines > end > end > `+1` nice resolution! :)
on 2013-02-03 10:53
Am 03.02.2013 02:13, schrieb Soichi Ishida: > ruby 1.9.3p362 (2012-12-25 revision 38607) [x86_64-darwin12.2.1] > > I am trying to analyze a file and extract necessary data from it. > First I need to count the total number of lines and the blank lines. the > class and methods follow. Since you are reading the complete file into memory anyway (with File.read), you could get the information you need much easier without any loop at all: content = File.readlines('file.dat') content.size content.grep(/^\s*$/).size
on 2013-02-04 03:03
Thanks everyone. They will certainly help.
In addition to these, what if I want to apply specific regular
expressions to each of the lines?
say,
if line =~ /exp1/ then
# do stuff
else
# do stuff
end
In this case, the loop for all lines must be implemented, am I correct?
If the class methods dynamically take regular expression like
test = DataExtractor.new('file.txt')
test.apply_regexp(/exp1/)
Doesn't the loop need to be reactivated for each method like this?
soichi
on 2013-02-04 03:49
On 4 February 2013 12:03, Soichi Ishida <lists@ruby-forum.com> wrote: > # do stuff > end > > In this case, the loop for all lines must be implemented, am I correct? > The previous code already does that, and we showed that it can be done in one loop. > If the class methods dynamically take regular expression like > > test = DataExtractor.new('file.txt') > test.apply_regexp(/exp1/) > > Doesn't the loop need to be reactivated for each method like this? Yes, if the regex isn't necessarily known beforehand you'll have to do the loop again once it is known. Note that, as Marcus pointed out, you're reading the file into memory; if you are happy trading IO delays for permanent memory usage, you could store the file contents in an instance variable (either in the constructor or lazily) and then perform your variable pattern match on the stored file contents. It would still need to perform the loop many times, but would save a lot of time in the #open and #read phases. There are other caching techniques, as well, if you think a particular pattern is likely to be tested multiple times. Probably a bit advanced for this discussion, though. -- Matthew Kerwin, B.Sc (CompSci) (Hons) http://matthew.kerwin.net.au/ ABN: 59-013-727-651 "You'll never find a programming language that frees you from the burden of clarifying your ideas." - xkcd
on 2013-02-04 08:53
for bigger files, this code may be better
lines, empty_lines = 0,0
File.foreach('file.dat') {|l| lines += 1; empty_lines += 1 if l =~
/^\s*$/}
on 2013-02-04 10:13
On Sun, Feb 3, 2013 at 8:03 PM, Soichi Ishida <lists@ruby-forum.com> wrote: > > In this case, the loop for all lines must be implemented, am I correct? > > If the class methods dynamically take regular expression like > > test = DataExtractor.new('file.txt') > test.apply_regexp(/exp1/) > > Doesn't the loop need to be reactivated for each method like this? This last bit would be true in most cases, but you'd be looping through the lines of the file in memory rather than re-reading it. This sounds like a great case for using an Enumerable mixin: class Dataanalysis include Enumerable attr_reader :lines attr_reader :emptylines def initialize(filename) @filename = filename @content = File.readlines(@filename) @lines = @content.size @emptylines = @content.grep(/^\s*$/).size end def each(&b) @content.each do |line| if block_given? b.call line else yield line end end end end Then you can do fun things like: 1.9.3-head :034 > f = Dataanalysis.new('dataanalysis.rb') => #<Dataanalysis:0xa3d6210 @filename="dataanalysis.rb", @content=["class Dataanalysis\n", " include Enumerable\n", " attr_reader :lines\n", " attr_reader :emptylines\n", "\n", " def initialize(filename)\n", " @filename = filename\n", " @content = File.readlines(@filename)\n", " @lines = @content.size\n", " @emptylines = @content.grep(/^\\s*$/).size\n", " end\n", "\n", " def each(&b)\n", " @content.each do |line|\n", " if block_given?\n", " b.call line\n", " else\n", " yield line\n", " end\n", " end\n", " end\n", "\n", "end\n"], @lines=23, @emptylines=3> 1.9.3-head :035 > f.lines => 23 1.9.3-head :036 > f.emptylines => 3 1.9.3-head :040 > f.each{|l| puts l.gsub(/\s+/,'*')} class*Dataanalysis* *include*Enumerable* *attr_reader*:lines* *attr_reader*:emptylines* * *def*initialize(filename)* *@filename*=*filename* *@content*=*File.readlines(@filename)* *@lines*=*@content.size* *@emptylines*=*@content.grep(/^\s*$/).size* *end* * *def*each(&b)* *@content.each*do*|line|* *if*block_given?* *b.call*line* *else* *yield*line* *end* *end* *end* * end* 1.9.3-head :041 > f.map{|l| l.gsub(/\s+/,'*')} => ["class*Dataanalysis*", "*include*Enumerable*", "*attr_reader*:lines*", "*attr_reader*:emptylines*", "*", "*def*initialize(filename)*", "*@filename*=*filename*", "*@content*=*File.readlines(@filename)*", "*@lines*=*@content.size*", "*@emptylines*=*@content.grep(/^\\s*$/).size*", "*end*", "*", "*def*each(&b)*", "*@content.each*do*|line|*", "*if*block_given?*", "*b.call*line*", "*else*", "*yield*line*", "*end*", "*end*", "*end*", "*", "end*"] and so on. All this said, there is still a caveat: the size of the file under analysis. If it's something smallish to medium, no problem. If it's a 4G log file -- then you'll likely have some issues.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.