What is the most efficient way of reading two large files at the same
time without using too much memory? Also, I need to compare strings in
the header lines throughout one file with the other file to weed out
what is different. Just don’t know how to start reading the files at the
same time. Any ideas on how to go about starting this? I was thinking a
while loop but wasn’t really sure on how to implement it. Small example
would help. Thanks in advance.
ruby.rb:6:in next': iteration reached an end (StopIteration) from ruby.rb:6:inblock (2 levels) in ’
from ruby.rb:4:in foreach' from ruby.rb:4:inblock in ’
from ruby.rb:1:in open' from ruby.rb:1:in’
Or, if you want to read each file in its entirety:
f1 = File.open(‘xml.xml’)
f2 = File.open(‘html.htm’)
e1 = f1.each
e2 = f2.each
enums = [e1, e2]
while not enums.empty?
e = enums.shift
begin
puts e.next #throws an exception at eof
rescue StopIteration #do nothing
else #if no exception execute this
enums << e
end
What is the most efficient way of reading two large files at the same
time without using too much memory? Also, I need to compare strings in
the header lines throughout one file with the other file to weed out
what is different. Just don’t know how to start reading the files at the
same time. Any ideas on how to go about starting this? I was thinking a
while loop but wasn’t really sure on how to implement it. Small example
would help. Thanks in advance.
What exactly are “header lines” in your case? Without knowing the
format and your parsing / processing requirements it’s difficult to
come up with suggestions. The most generic is
File.open f1 do |io1|
File.open f2 do |io2|
# now what?
end
end
I have two large files(several gigs) that I need to read with a bunch of
“entries” as the above shown. I need to match the header lines, in this
case i need to make sure “1_4_138_” is the same in both entries and if
it is, write those entries with matching headers to new seperate files.
Do you have guaranteed ordering for the header lines in each file?
What exactly are “header lines” in your case? Without knowing the
format and your parsing / processing requirements it’s difficult to
come up with suggestions.
A header line starts with “>”
First Entry, first file:
1_4_138_F5-P2
234234234234234
First Entry, second file:
1_4_138_F3
234234234234234
I have two large files(several gigs) that I need to read with a bunch of
“entries” as the above shown. I need to match the header lines, in this
case i need to make sure “1_4_138_” is the same in both entries and if
it is, write those entries with matching headers to new seperate files.
I’m not done writing the script yet(didn’t test it either), but this is
what I have:
while !f3_file.eof?
while !f5_file.eof?
f5_ln = f5_file.readline
if f5_ln =~ /(\d*\d*\d*)/
f5_ln += f5_file.readline
end
f5_out.puts(f5_ln)
end
f3_ln = f3_file.readline
if f3_ln =~ /(\d*\d*\d*)/
f3_ln += f3_file.readline
end
end
Do you have guaranteed ordering for the header lines in each file?
The headers should be in the same order in both files. Some headers are
going to be missing in the second file however, which is why I need to
do the check. If a header doesn’t match one from the first file, then
those two headers don’t get written out to the file.
I have two large files(several gigs) that I need to read with a bunch of
“entries” as the above shown. I need to match the header lines, in this
case i need to make sure “1_4_138_” is the same in both entries and if
it is, write those entries with matching headers to new seperate files.
So do you want to use the first file as template for checking only and
only write out matching sections from the second file? In other
words, is this what you want conceptually?
valid_sections = read_headers(file_1)
for each section in file_2
if section in valid_sections
print to file_3
Do you have guaranteed ordering for the header lines in each file?
The headers should be in the same order in both files. Some headers are
going to be missing in the second file however, which is why I need to
do the check. If a header doesn’t match one from the first file, then
those two headers don’t get written out to the file.
Here’s an implementation of the algorithm above:
require ‘set’
headers = Set.new
File.foreach file_1 do |line|
%r{^>(\d+_{3})} and headers << $1
end
File.open file_3, “w” do |out|
do_print = false
File.foreach file_2 do |line|
if %r{^>(\d+_{3})}
do_print = headers.include? $1
end
out.puts line if do_print
end
end
Kind regards
robert
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.