NOT reading an entire file into memory

I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb
(http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html).
However, I have heard that File.readline is in fact equivalent to
File.read.split(‘\n’).each, which would really ruin my purpose of not
loading the whole file. I’d really like to keep this in ruby, as I
want to package the whole thing via the wonderful rubyscipt2exe, as
well as, of course, a standard rubygem.
What I would actually really love is if there was a way to read lines
4 through 7 without reading the whole file.
My current method has made the program not nearly as beautiful as ruby
ought to be.

Quoth Devi Web D.:

I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb
(http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html).

Daniel Brumbaugh K.
Devi Web D.
[email protected]

f = File.open(“myfile”)

skip through 3rd line

3.times do f.readline end

Array.new(4).map do
f.readline
end

Devi Web D. wrote:

I have heard that File.readline is in fact equivalent to
File.read.split(’\n’).each, which would really ruin my purpose of not
loading the whole file.

I doubt that is true, but as is often the case with Ruby there is no
easily locatable documentation that describes File I/O buffering. Just
in case, here is another solution:

#create a data file containing:
#line 1
#line 2
#…
#line 10

File.open(“data.txt”, “w”) do |file|
10.times do |i|
file.puts(“line #{i+1}”)
end
end

#read lines 4-7 and display them:
File.open(“data.txt”) do |file|
file.each_with_index do |line, i|
i = i + 1 #i starts at 0

if i < 4
  next
elsif i < 8
  puts line
else
  break
end

end
end

Quoth 7stud --:

#create a data file containing:

else
  break
end

end
end

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split(’\n’), that’s IO#readlines.

Konrad M. wrote:

Quoth 7stud --:

#create a data file containing:

else
  break
end

end
end

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split(’\n’), that’s IO#readlines.

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

–output:–
line 4
line 5
line 6
line 7

Quoth 7stud --:

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split(’\n’), that’s IO#readlines.

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

Performance isn’t everything. If it was, you wouldn’t be using ruby. The
idea
is that this will work “well enough”, shouldn’t take too much thought on
the
programmer’s behalf, and doesn’t load the entire (huge) file into ram.

On 28.10.2007 08:29, 7stud – wrote:

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split(’\n’), that’s IO#readlines.

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

Ruby does buffering but will not read the whole file unless asked to do
so.

There are several ways to access only lines 4 through 7. For example:

1

require ‘enumerator’ # pre 1.9
File.to_enum(:foreach, “foo.dat”).each_with_index do |line,idx|
case idx
when 0…3
# ignore
when 3…7
puts line
else
break # or return or exit
end
end

2

File.open(“foo.dat”) do |io|
io.each do |line|
case io.lineno
when 1…4
# ignore
when 4…7
puts line
else
break
end
end
end

3

File.foreach “foo.dat” do |line|
case $.
when 1…4
# ignore
when 4…7
puts line
else
break
end
end

Kind regards

robert

On Sun, 28 Oct 2007 16:29:47 +0900, 7stud – wrote:

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split(’\n’), that’s IO#readlines.

The real question is: does readline do any buffering?

It must. There’s no POSIX call that can read until the end of a line, so
you have to read(2) a bunch of data, look for a newline, and if there’s
no newline in it you have to read more. If there is a newline in it,
then
you have to buffer everything you read that comes after the newline.
That’s life with POSIX.

The standard C library has fgets(3) which can find a newline, butit
probably does its own buffering internally, for the same reasons that
other POSIX apps would.

Ruby uses fread(3), the C library’s equivalent of read(2), so ruby has
to
do its own buffering.

What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

rb_io_each_line implements IO#each_line and IO#each. It boils down to a
loop:

while (!NIL_P(str = rb_io_getline(rs, io))) {
    rb_yield(str);
}

and rb_io_getline reads only as much as it feels is necessary to find
that newline. It doesn’t put the whole file in memory at once.

–Ken