Parsing a file with look ahead

robertjames · February 22, 2007, 1:36am

I need to parse a file line by line, and output the results line by
line (too big to fit into memory). So far, simple enough:
file.each_line.

However, the parser needs the ability to peek ahead to the next line,
in order to parse this line. What’s the right way to do this? Again,
I really don’t want to try to slurp the whole file into memory and
split on newlines.

Here’s an example:
Line1: Hi
Line2: How
Line3: Are
Line4: you?

I’d like to:
parse(‘Hi’, ‘How’)
parse(‘How’, ‘Are’)
parse(‘Are’, ‘you?’)
parse(‘you?’, false)

hey, this is practically a unit test!

Any ideas?

robertjames · February 22, 2007, 1:38am

The first thing I can think of is do file.each_line and store that
line in a previous_line variable at the end of the proc. Then you have
access to the line that was read before hand and the current one.

robertjames · February 22, 2007, 1:57am

S. Robert J. wrote:

require ‘enumerator’

File.new(filename).enum_slice(2).each do |first, second|
p [ first, second ? second : false ]
end

robertjames · February 22, 2007, 2:40am

Thanks! BTW, looking at the Rdoc, it seems each_cons is what I want,
no?

robertjames · March 1, 2007, 2:42am

Gregory B. wrote:

On 2/21/07, S. Robert J. [email protected] wrote:

Thanks! BTW, looking at the Rdoc, it seems each_cons is what I want,
no?

If you are dealing with paired lines, use enum_slice(2)

if you are dealing with data dependent on the current and previous
line, use each_cons, yes.

Except each_cons(n) will iterate 9 times if you have 10 lines.

Maybe something simple like this?

line = f.gets
while line
nextline = f.gets
#do stuff…
line = nextline
end

Daniel

robertjames · February 22, 2007, 3:00am

On 2/21/07, S. Robert J. [email protected] wrote:

Thanks! BTW, looking at the Rdoc, it seems each_cons is what I want,
no?

If you are dealing with paired lines, use enum_slice(2)

if you are dealing with data dependent on the current and previous
line, use each_cons, yes.

robertjames · March 4, 2007, 3:45am

“S. Robert J.” [email protected] wrote/schrieb
[email protected]:

I need to parse a file line by line, and output the results line by
line (too big to fit into memory). So far, simple enough:
file.each_line.

However, the parser needs the ability to peek ahead to the next line,
in order to parse this line. What’s the right way to do this? Again,
I really don’t want to try to slurp the whole file into memory and
split on newlines.

Sounds for me like it could be solved elegantly with a lazy stream of
input lines. For lazy streams see the Usenet thread starting with
article [email protected], for instance.

The file will be split into lines, but lazily, and for that reason all
the lines don’t need to be hold in memory at the same time. Old, i.e.
already consumed lines will be garbage collected soon, because the
application does no longer reference them. You can have as many
lookahead lines as you want (tradeoff: needs more memory, of course).

Regards
Thomas