Team,
Every week I get a large file, over 50 millions records with record
length
150 chars. These files can actually be over 15GB.
I need to take the new file and compare it against the one from the
previous week.
Reading the files into two arrays would make the process a bit easier,
but
the files are too large and when I try using arrays, the process crashes
with out of storage messages.
I am looking for suggestions on how to efficiently perform the following
process:
- Compare each record from the file from this week against last week
file - If every record are the same, do nothing or just indicate so:
SAM - If there is any duplicate records on the new file, output the
record
to a file of dups - If there are any new record, (records found on the new file, but
not
on last week file) output: INS followed by the record - If there is a record which is found on last week file (old file)
but
not on this week file, output: DEL followed by the record - If there is a record with the same key (the first 13 chars) on
both
files, but the rest of the record is different, output: UPD
followed
by the record
Hey, I can do all of the above doing reading each record from both files
and do different type of comparison/match, but I was wondering if there
is
an efficient way to do this. I was looking for suggestions.
Thank you