IO#Foreach -- Max line length

tristincolby · March 6, 2008, 10:05pm

I’m trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

tristincolby · March 6, 2008, 11:37pm

Tristin D. wrote:

I’m trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

max = 3
count = 0

IO.foreach(‘data.txt’) do |line|
if count == max
break
else
count += 1
end

puts line
end

tristincolby · March 6, 2008, 11:54pm

But by the time you actually get count, isn’t the line already read in
memory. So if the line is 7 gigabytes, it’ll probably crash the system.

7stud – wrote:

Tristin D. wrote:

I’m trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?

max = 3
count = 0

IO.foreach(‘data.txt’) do |line|
if count == max
break
else
count += 1
end

puts line
end

tristincolby · March 7, 2008, 12:45am

Hi,

On Fri, Mar 7, 2008 at 9:37 AM, 7stud – [email protected] wrote:

puts line
end

Not quite the solution. This reads a number of lines, as opposed to
limiting the length of a single line read.

Arlen

tristincolby · March 7, 2008, 12:52am

Tristin D. wrote:

But by the time you actually get count, isn’t the line already read in
memory. So if the line is 7 gigabytes, it’ll probably crash the system.

Is this what you are looking for:

max_bytes = 30
text = IO.read(‘data.txt’, max_bytes)
puts text

tristincolby · March 7, 2008, 3:36am

On Behalf Of Tristin D.:

But by the time you actually get count, isn’t the line

already read in

memory. So if the line is 7 gigabytes, it’ll probably crash

the system.

read will accept arg on how many bytes to read.

so how about,

irb(main):040:0> File.open “test.rb” do |f| f.read end
=> “a=(1…2)\n\na\nputs a\n\nputs a.each{|x| puts x}”

irb(main):041:0> File.open “test.rb” do |f| f.read 2 end
=> “a=”

irb(main):042:0> File.open “test.rb” do |f| f.read 2; f.read 2 end
=> “(1”

irb(main):043:0> File.open “test.rb” do |f| while x=f.read(2); p x; end;
end
“a=”
“(1”
“…”
“2)”
“\n\n”
“a\n”
“pu”
“ts”
" a"
“\n\n”
“pu”
“ts”
" a"
“.e”
“ac”
“h{”
“|x”
“| "
“pu”
“ts”
" x”
“}”
=> nil

kind regards -botp

tristincolby · March 7, 2008, 6:59pm

On 3/6/08, Peña, Botp [email protected] wrote:

On Behalf Of Tristin D.:

But by the time you actually get count, isn’t the line

already read in

memory. So if the line is 7 gigabytes, it’ll probably crash

the system.

read will accept arg on how many bytes to read.

so how about,

…

irb(main):043:0> File.open “test.rb” do |f| while x=f.read(2); p x; end; end

That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:

class IO
#read by characters
def for_eachA(linelen)
c=0
while (c)
buf=‘’
linelen.times {
break unless c=getc
buf<<c
break if c.chr== $/
}
yield buf
end
end

#read by lines
def for_eachB(linelen)
re = Regexp.new(“.*?#{Regexp.escape($/)}”)
buf=‘’
while (line = read(linelen-buf.length))
buf = (buf+line).gsub(re){|l| yield l;‘’}
if buf.length == linelen
yield buf
buf=‘’
end
end
yield buf
end
end

File.open(“foreach.rb”) do |f|
f.for_eachA(10){|l| p l}
end

File.open(“foreach.rb”) do |f|
f.for_eachB(10){|l| p l}
end

I’d guess the second version would be faster, but I didn’t time it.

-Adam

tristincolby · March 8, 2008, 10:31pm

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i’d post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.

module Util

def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end

include Util

file = ARGV.shift #“C:/Documents and Settings/trdavi/Desktop/a1-1k.aa”
buf=‘’
record = 1
frequency = 100

f = File.open(file,‘r’)

while c=f.getc
buf << c

if too_large?(buf,max=102400)
    p "record #{record} is too long, skipping to end"
    while(x=f.getc)
      if x.chr == $/
        buf=''
        record += 1
        p "At record #{record}" if( (record % frequency ) == 0 )
        break
      end
    end
end

if c.chr == $/
   record += 1
   print "At record #{record}" if( (record % frequency ) == 0 )
   buf = ''
end

end

#If we still have something in the buffer, then it is probably the last
record.
unless buf.empty?
#record += 1
p “Last record is:” + buf
end

f.close
p record

Adam S. wrote:

On 3/6/08, Peï¿½a, Botp [email protected] wrote:

On Behalf Of Tristin D.:

But by the time you actually get count, isn’t the line

already read in

memory. So if the line is 7 gigabytes, it’ll probably crash

the system.

read will accept arg on how many bytes to read.

so how about,

…

irb(main):043:0> File.open “test.rb” do |f| while x=f.read(2); p x; end; end

That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:

class IO
#read by characters
def for_eachA(linelen)
c=0
while (c)
buf=‘’
linelen.times {
break unless c=getc
buf<<c
break if c.chr== $/
}
yield buf
end
end

#read by lines
def for_eachB(linelen)
re = Regexp.new(“.*?#{Regexp.escape($/)}”)
buf=‘’
while (line = read(linelen-buf.length))
buf = (buf+line).gsub(re){|l| yield l;‘’}
if buf.length == linelen
yield buf
buf=‘’
end
end
yield buf
end
end

File.open(“foreach.rb”) do |f|
f.for_eachA(10){|l| p l}
end

File.open(“foreach.rb”) do |f|
f.for_eachB(10){|l| p l}
end

I’d guess the second version would be faster, but I didn’t time it.

-Adam

tristincolby · March 9, 2008, 11:45am

That’s what the 2nd if statement is; for catching the delimiter if the
buffer isn’t too large. I can’t use gets b/c I may expend all the
memory before the actual line is read. I’m reading variable length
records, but some of them are bad data and exceed a max length of 100k.
That’s what the script is scanning for.

7stud – wrote:

Tristin D. wrote:

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i’d post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.

module Util

def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end

include Util

file = ARGV.shift #“C:/Documents and Settings/trdavi/Desktop/a1-1k.aa”
buf=’’
record = 1
frequency = 100

f = File.open(file,‘r’)

while c=f.getc
if buf.length < max #(but what if you find a ‘\n’ before max?)
buf << c
else
buf = ‘’
f.gets
end

tristincolby · March 9, 2008, 10:35pm

Tristin D. wrote:

That’s what the 2nd if statement is; for catching the delimiter if the
buffer isn’t too large. I can’t use gets b/c I may expend all the
memory before the actual line is read.

Look. A string and a file are really no different–except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.

tristincolby · March 9, 2008, 6:03am

Tristin D. wrote:

Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i’d post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.

module Util

def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end

include Util

file = ARGV.shift #“C:/Documents and Settings/trdavi/Desktop/a1-1k.aa”
buf=’’
record = 1
frequency = 100

f = File.open(file,‘r’)

while c=f.getc
if buf.length < max #(but what if you find a ‘\n’ before max?)
buf << c
else
buf = ‘’
f.gets
end

tristincolby · March 9, 2008, 11:02pm

Gotcha, I’ll post the code once i revamp

7stud – wrote:

Tristin D. wrote:

That’s what the 2nd if statement is; for catching the delimiter if the
buffer isn’t too large. I can’t use gets b/c I may expend all the
memory before the actual line is read.

Look. A string and a file are really no different–except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.

tristincolby · March 10, 2008, 7:22am

Here’s the benchmarks for the old and new code:
Old: 5.484000 0.031000 5.515000 ( 5.782000)
New: 5.094000 0.047000 5.141000 ( 5.407000)

=cut

module DataVerifier
require ‘strscan’

def too_large?(buffer,max=1024)
return true if buffer.length >= max
false
end

def verify_vbl(file,frequency,max,delimiter,out,cache_size)
$/=delimiter

buffer=''
buf=''
record = 1
o = File.new(out,"w")
f = File.open(file,'r')

while(buffer=f.read(cache_size=1048576))
cache=StringScanner.new(buffer)

 while(c = cache.getch)
    buf << c

    if too_large?(buf,max)
        o.print "record #{record} is too long, skipping to end\n"
        while(x=cache.getch)
          if x == $/
            buf=''
            record += 1
            print "At record #{record}\n" if( (record % frequency )

== 0 ) unless frequency.nil?
break
end
end
end

    if c == $/
       record += 1
       print "At record #{record}\n" if( (record % frequency ) == 0

) unless frequency.nil?
buf = ‘’
end
end
end
f.close
o.close
record
end
end

IO#Foreach -- Max line length

But by the time you actually get count, isn’t the line

already read in

memory. So if the line is 7 gigabytes, it’ll probably crash

the system.

But by the time you actually get count, isn’t the line

already read in

memory. So if the line is 7 gigabytes, it’ll probably crash

the system.

File.open(“foreach.rb”) do |f| f.for_eachB(10){|l| p l} end

But by the time you actually get count, isn’t the line

already read in

memory. So if the line is 7 gigabytes, it’ll probably crash

the system.

File.open(“foreach.rb”) do |f| f.for_eachB(10){|l| p l} end

File.open(“foreach.rb”) do |f|
f.for_eachB(10){|l| p l}
end

File.open(“foreach.rb”) do |f|
f.for_eachB(10){|l| p l}
end