Parsing a file with look ahead


#1

I need to parse a file line by line, and output the results line by
line (too big to fit into memory). So far, simple enough:
file.each_line.

However, the parser needs the ability to peek ahead to the next line,
in order to parse this line. What’s the right way to do this? Again,
I really don’t want to try to slurp the whole file into memory and
split on newlines.

Here’s an example:
Line1: Hi
Line2: How
Line3: Are
Line4: you?

I’d like to:
parse(‘Hi’, ‘How’)
parse(‘How’, ‘Are’)
parse(‘Are’, ‘you?’)
parse(‘you?’, false)

hey, this is practically a unit test!

Any ideas?


#2

The first thing I can think of is do file.each_line and store that
line in a previous_line variable at the end of the proc. Then you have
access to the line that was read before hand and the current one.


#3

S. Robert J. wrote:

require ‘enumerator’

File.new(filename).enum_slice(2).each do |first, second|
p [ first, second ? second : false ]
end


#4

Thanks! BTW, looking at the Rdoc, it seems each_cons is what I want,
no?


#5

Gregory B. wrote:

On 2/21/07, S. Robert J. removed_email_address@domain.invalid wrote:

Thanks! BTW, looking at the Rdoc, it seems each_cons is what I want,
no?

If you are dealing with paired lines, use enum_slice(2)

if you are dealing with data dependent on the current and previous
line, use each_cons, yes.

Except each_cons(n) will iterate 9 times if you have 10 lines.

Maybe something simple like this?

line = f.gets
while line
nextline = f.gets
#do stuff…
line = nextline
end

Daniel


#6

On 2/21/07, S. Robert J. removed_email_address@domain.invalid wrote:

Thanks! BTW, looking at the Rdoc, it seems each_cons is what I want,
no?

If you are dealing with paired lines, use enum_slice(2)

if you are dealing with data dependent on the current and previous
line, use each_cons, yes.


#7

“S. Robert J.” removed_email_address@domain.invalid wrote/schrieb
removed_email_address@domain.invalid:

I need to parse a file line by line, and output the results line by
line (too big to fit into memory). So far, simple enough:
file.each_line.

However, the parser needs the ability to peek ahead to the next line,
in order to parse this line. What’s the right way to do this? Again,
I really don’t want to try to slurp the whole file into memory and
split on newlines.

Sounds for me like it could be solved elegantly with a lazy stream of
input lines. For lazy streams see the Usenet thread starting with
article removed_email_address@domain.invalid, for instance.

The file will be split into lines, but lazily, and for that reason all
the lines don’t need to be hold in memory at the same time. Old, i.e.
already consumed lines will be garbage collected soon, because the
application does no longer reference them. You can have as many
lookahead lines as you want (tradeoff: needs more memory, of course).

Regards
Thomas