I’m trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?
Tristin D. wrote:
I’m trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?
max = 3
count = 0
IO.foreach(‘data.txt’) do |line|
if count == max
break
else
count += 1
end
puts line
end
But by the time you actually get count, isn’t the line already read in
memory. So if the line is 7 gigabytes, it’ll probably crash the system.
7stud – wrote:
Tristin D. wrote:
I’m trying to emulate the new feature in 1.9 that allows you to specify
the maximum length of a line read in Ruby 1.8.6. Can anyone help?max = 3
count = 0IO.foreach(‘data.txt’) do |line|
if count == max
break
else
count += 1
endputs line
end
Hi,
On Fri, Mar 7, 2008 at 9:37 AM, 7stud – [email protected] wrote:
puts line
end
Not quite the solution. This reads a number of lines, as opposed to
limiting the length of a single line read.
Arlen
Tristin D. wrote:
But by the time you actually get count, isn’t the line already read in
memory. So if the line is 7 gigabytes, it’ll probably crash the system.
Is this what you are looking for:
max_bytes = 30
text = IO.read(‘data.txt’, max_bytes)
puts text
On Behalf Of Tristin D.:
But by the time you actually get count, isn’t the line
already read in
memory. So if the line is 7 gigabytes, it’ll probably crash
the system.
read will accept arg on how many bytes to read.
so how about,
irb(main):040:0> File.open “test.rb” do |f| f.read end
=> “a=(1…2)\n\na\nputs a\n\nputs a.each{|x| puts x}”
irb(main):041:0> File.open “test.rb” do |f| f.read 2 end
=> “a=”
irb(main):042:0> File.open “test.rb” do |f| f.read 2; f.read 2 end
=> “(1”
irb(main):043:0> File.open “test.rb” do |f| while x=f.read(2); p x; end;
end
“a=”
“(1”
“…”
“2)”
“\n\n”
“a\n”
“pu”
“ts”
" a"
“\n\n”
“pu”
“ts”
" a"
“.e”
“ac”
“h{”
“|x”
“| "
“pu”
“ts”
" x”
“}”
=> nil
kind regards -botp
On 3/6/08, Peña, Botp [email protected] wrote:
On Behalf Of Tristin D.:
But by the time you actually get count, isn’t the line
already read in
memory. So if the line is 7 gigabytes, it’ll probably crash
the system.
read will accept arg on how many bytes to read.
so how about,
…
irb(main):043:0> File.open “test.rb” do |f| while x=f.read(2); p x; end; end
That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:
class IO
#read by characters
def for_eachA(linelen)
c=0
while (c)
buf=‘’
linelen.times {
break unless c=getc
buf<<c
break if c.chr== $/
}
yield buf
end
end
#read by lines
def for_eachB(linelen)
re = Regexp.new(“.*?#{Regexp.escape($/)}”)
buf=‘’
while (line = read(linelen-buf.length))
buf = (buf+line).gsub(re){|l| yield l;‘’}
if buf.length == linelen
yield buf
buf=‘’
end
end
yield buf
end
end
File.open(“foreach.rb”) do |f|
f.for_eachA(10){|l| p l}
end
File.open(“foreach.rb”) do |f|
f.for_eachB(10){|l| p l}
end
I’d guess the second version would be faster, but I didn’t time it.
-Adam
Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i’d post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.
module Util
def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
end
include Util
file = ARGV.shift #“C:/Documents and Settings/trdavi/Desktop/a1-1k.aa”
buf=‘’
record = 1
frequency = 100
f = File.open(file,‘r’)
while c=f.getc
buf << c
if too_large?(buf,max=102400)
p "record #{record} is too long, skipping to end"
while(x=f.getc)
if x.chr == $/
buf=''
record += 1
p "At record #{record}" if( (record % frequency ) == 0 )
break
end
end
end
if c.chr == $/
record += 1
print "At record #{record}" if( (record % frequency ) == 0 )
buf = ''
end
end
#If we still have something in the buffer, then it is probably the last
record.
unless buf.empty?
#record += 1
p “Last record is:” + buf
end
f.close
p record
Adam S. wrote:
On 3/6/08, Pe�a, Botp [email protected] wrote:
On Behalf Of Tristin D.:
But by the time you actually get count, isn’t the line
already read in
memory. So if the line is 7 gigabytes, it’ll probably crash
the system.
read will accept arg on how many bytes to read.
so how about,
…
irb(main):043:0> File.open “test.rb” do |f| while x=f.read(2); p x; end; end
That solution essentially ignores linebreaks.
If you want to read up to a linebreak or N characters, whichever comes
first, you could one of these:
class IO
#read by characters
def for_eachA(linelen)
c=0
while (c)
buf=‘’
linelen.times {
break unless c=getc
buf<<c
break if c.chr== $/
}
yield buf
end
end#read by lines
def for_eachB(linelen)
re = Regexp.new(“.*?#{Regexp.escape($/)}”)
buf=‘’
while (line = read(linelen-buf.length))
buf = (buf+line).gsub(re){|l| yield l;‘’}
if buf.length == linelen
yield buf
buf=‘’
end
end
yield buf
end
endFile.open(“foreach.rb”) do |f|
f.for_eachA(10){|l| p l}
endFile.open(“foreach.rb”) do |f|
f.for_eachB(10){|l| p l}
endI’d guess the second version would be faster, but I didn’t time it.
-Adam
That’s what the 2nd if statement is; for catching the delimiter if the
buffer isn’t too large. I can’t use gets b/c I may expend all the
memory before the actual line is read. I’m reading variable length
records, but some of them are bad data and exceed a max length of 100k.
That’s what the script is scanning for.
7stud – wrote:
Tristin D. wrote:
Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i’d post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.module Util
def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
endinclude Util
file = ARGV.shift #“C:/Documents and Settings/trdavi/Desktop/a1-1k.aa”
buf=’’
record = 1
frequency = 100f = File.open(file,‘r’)
while c=f.getc
if buf.length < max #(but what if you find a ‘\n’ before max?)
buf << c
else
buf = ‘’
f.gets
end
Tristin D. wrote:
That’s what the 2nd if statement is; for catching the delimiter if the
buffer isn’t too large. I can’t use gets b/c I may expend all the
memory before the actual line is read.
Look. A string and a file are really no different–except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.
Tristin D. wrote:
Thanks for the ideas Adam. I thought someone might be able to use it so
I figured i’d post it. It processed about 675,000 1100+ byte records in
an hour. Not fantastic performance, but it works. If someone can tell
me how to improve the performance then have at it.module Util
def too_large?(buffer,max=10)
return true if buffer.length >= max
false
end
endinclude Util
file = ARGV.shift #“C:/Documents and Settings/trdavi/Desktop/a1-1k.aa”
buf=’’
record = 1
frequency = 100f = File.open(file,‘r’)
while c=f.getc
if buf.length < max #(but what if you find a ‘\n’ before max?)
buf << c
else
buf = ‘’
f.gets
end
Gotcha, I’ll post the code once i revamp
7stud – wrote:
Tristin D. wrote:
That’s what the 2nd if statement is; for catching the delimiter if the
buffer isn’t too large. I can’t use gets b/c I may expend all the
memory before the actual line is read.Look. A string and a file are really no different–except reading from
a file is slow. Therefore, to speed things up read in the maximum every
time you read from the file, and store it in a string. Process the
string just like you would the file. Then read from the file again.
Here’s the benchmarks for the old and new code:
Old: 5.484000 0.031000 5.515000 ( 5.782000)
New: 5.094000 0.047000 5.141000 ( 5.407000)
=cut
module DataVerifier
require ‘strscan’
def too_large?(buffer,max=1024)
return true if buffer.length >= max
false
end
def verify_vbl(file,frequency,max,delimiter,out,cache_size)
$/=delimiter
buffer=''
buf=''
record = 1
o = File.new(out,"w")
f = File.open(file,'r')
while(buffer=f.read(cache_size=1048576))
cache=StringScanner.new(buffer)
while(c = cache.getch)
buf << c
if too_large?(buf,max)
o.print "record #{record} is too long, skipping to end\n"
while(x=cache.getch)
if x == $/
buf=''
record += 1
print "At record #{record}\n" if( (record % frequency )
== 0 ) unless frequency.nil?
break
end
end
end
if c == $/
record += 1
print "At record #{record}\n" if( (record % frequency ) == 0
) unless frequency.nil?
buf = ‘’
end
end
end
f.close
o.close
record
end
end