Large file reading

Hello all,

What is the most efficient way of reading two large files at the same
time without using too much memory? Also, I need to compare strings in
the header lines throughout one file with the other file to weed out
what is different. Just don’t know how to start reading the files at the
same time. Any ideas on how to go about starting this? I was thinking a
while loop but wasn’t really sure on how to implement it. Small example
would help. Thanks in advance.

  • C

File.open(“xml.xml”) do |f|
e = f.each

IO.foreach(“html.htm”) do |line|
puts “----#{line}”
puts “****#{e.next}”
end
end

–output:–

----
****<?xml version="1.0"?>
----
****
----


---- Test
**** Tove
----
**** xxxx
----
**** Tove’s value is: 10


----
**** Tove

**** Jani
---- $(document).ready(function() {
**** Reminder
---- $(’#my_table td’).addClass(‘edit’).click(function() {
**** Don’t forget me this weekend!
---- $(this).parentNode(‘tr’).children(‘td’).first().click();



****
---- });
**** Jani

**** xxxx
---- $(".edit").editable(’#’);
**** xxxx

**** Jani’s value is: 20

**** 1200
---- $(’#my_checkbox’).click(function() {
****
---- var current_display = $(’#my_div’).css(‘display’);


---- var new_display = (current_display == ‘none’) ? ‘block’ :
‘none’;
****
---- $(’#my_div’).css(‘display’, new_display);
**** Diane
---- });
****

**** Diane’s value is: 30



****
---- });



ruby.rb:6:in next': iteration reached an end (StopIteration) from ruby.rb:6:inblock (2 levels) in ’
from ruby.rb:4:in foreach' from ruby.rb:4:inblock in ’
from ruby.rb:1:in open' from ruby.rb:1:in

Or, if you want to read each file in its entirety:

f1 = File.open(‘xml.xml’)
f2 = File.open(‘html.htm’)

e1 = f1.each
e2 = f2.each

enums = [e1, e2]

while not enums.empty?
e = enums.shift

begin
puts e.next #throws an exception at eof
rescue StopIteration #do nothing
else #if no exception execute this
enums << e
end

end

On Fri, Sep 9, 2011 at 10:38 PM, Cyril J.
[email protected] wrote:

What is the most efficient way of reading two large files at the same
time without using too much memory? Also, I need to compare strings in
the header lines throughout one file with the other file to weed out
what is different. Just don’t know how to start reading the files at the
same time. Any ideas on how to go about starting this? I was thinking a
while loop but wasn’t really sure on how to implement it. Small example
would help. Thanks in advance.

What exactly are “header lines” in your case? Without knowing the
format and your parsing / processing requirements it’s difficult to
come up with suggestions. The most generic is

File.open f1 do |io1|
File.open f2 do |io2|
# now what?
end
end

Kind regards

robert

On Tue, Sep 13, 2011 at 3:01 PM, Cyril J.
[email protected] wrote:

First Entry, second file:

1_4_138_F3
234234234234234

I have two large files(several gigs) that I need to read with a bunch of
“entries” as the above shown. I need to match the header lines, in this
case i need to make sure “1_4_138_” is the same in both entries and if
it is, write those entries with matching headers to new seperate files.

Do you have guaranteed ordering for the header lines in each file?

What exactly are “header lines” in your case? Without knowing the
format and your parsing / processing requirements it’s difficult to
come up with suggestions.

A header line starts with “>”

First Entry, first file:

1_4_138_F5-P2
234234234234234

First Entry, second file:

1_4_138_F3
234234234234234

I have two large files(several gigs) that I need to read with a bunch of
“entries” as the above shown. I need to match the header lines, in this
case i need to make sure “1_4_138_” is the same in both entries and if
it is, write those entries with matching headers to new seperate files.

I’m not done writing the script yet(didn’t test it either), but this is
what I have:

while !f3_file.eof?
while !f5_file.eof?
f5_ln = f5_file.readline
if f5_ln =~ /(\d*\d*\d*)/
f5_ln += f5_file.readline
end
f5_out.puts(f5_ln)
end
f3_ln = f3_file.readline
if f3_ln =~ /(\d*
\d*\d*)/
f3_ln += f3_file.readline
end
end

Do you have guaranteed ordering for the header lines in each file?

The headers should be in the same order in both files. Some headers are
going to be missing in the second file however, which is why I need to
do the check. If a header doesn’t match one from the first file, then
those two headers don’t get written out to the file.

Here is an update from my original code:

cs_files = Dir.glob(“Desktop/scripts/*.csfasta”)

f3_file = File.open(cs_files[0])
f5_file = File.open(cs_files[1])

f3_out = File.new(“Desktop/f3.csfasta”, “w”)
f5_out = File.new(“Desktop/f5.csfasta”, “w”)

while !f3_file.eof? && while !f5_file.eof?
f3_ln = f3_file.readline.chomp
f5_ln = f5_file.readline.chomp
if f3_ln =~ /^>\d*\d*\d*F3$/
f3_headers = f3_ln.gsub(/F3/, “”)
end
if f5_ln =~ /^>\d*
\d*_\d*_F5-P2$/
f5_headers = f5_ln.gsub(/F5-P2/, “”)
if f3_headers == f5_headers
puts f3_headers
puts f5_headers
puts “worked”
else
puts f3_headers
puts f5_headers
puts “nope”
end
end

end # end while
end # end while
f3_file.close
f5_file.close

First Entry, second file:

1_4_138_F3
234234234234234

I have two large files(several gigs) that I need to read with a bunch of
“entries” as the above shown. I need to match the header lines, in this
case i need to make sure “1_4_138_” is the same in both entries and if
it is, write those entries with matching headers to new seperate files.

So do you want to use the first file as template for checking only and
only write out matching sections from the second file? In other
words, is this what you want conceptually?

valid_sections = read_headers(file_1)

for each section in file_2
if section in valid_sections
print to file_3

On Wed, Sep 14, 2011 at 3:26 AM, Cyril J.
[email protected] wrote:

Do you have guaranteed ordering for the header lines in each file?

The headers should be in the same order in both files. Some headers are
going to be missing in the second file however, which is why I need to
do the check. If a header doesn’t match one from the first file, then
those two headers don’t get written out to the file.

Here’s an implementation of the algorithm above:

require ‘set’

headers = Set.new

File.foreach file_1 do |line|
%r{^>(\d+_{3})} and headers << $1
end

File.open file_3, “w” do |out|
do_print = false

File.foreach file_2 do |line|
if %r{^>(\d+_{3})}
do_print = headers.include? $1
end

out.puts line if do_print

end
end

Kind regards

robert