Looking for suggestions processing and comparing 2 very large files

aris · October 22, 2012, 8:22pm

Team,

Every week I get a large file, over 50 millions records with record
length

150 chars. These files can actually be over 15GB.
I need to take the new file and compare it against the one from the
previous week.
Reading the files into two arrays would make the process a bit easier,
but
the files are too large and when I try using arrays, the process crashes
with out of storage messages.

I am looking for suggestions on how to efficiently perform the following
process:

Compare each record from the file from this week against last week
file
If every record are the same, do nothing or just indicate so:
SAM
If there is any duplicate records on the new file, output the
record
to a file of dups
If there are any new record, (records found on the new file, but
not
on last week file) output: INS followed by the record
If there is a record which is found on last week file (old file)
but
not on this week file, output: DEL followed by the record
If there is a record with the same key (the first 13 chars) on
both
files, but the rest of the record is different, output: UPD
followed
by the record

Hey, I can do all of the above doing reading each record from both files
and do different type of comparison/match, but I was wondering if there
is
an efficient way to do this. I was looking for suggestions.

Thank you

rubina · October 22, 2012, 8:37pm

This is almost exactly what diff in Unix/linux/Mac OS X does. There are
also
diff utilities for Windows, so instead of reinventing the wheel, you may
want to
look at the diff utilities out there already.

Wayne

rubina · October 22, 2012, 8:57pm

On Mon, Oct 22, 2012 at 2:21 PM, Ruby S. [email protected]
wrote:

Every week I get a large file, over 50 millions records

The big question is… are these files SORTED, preferably on some
UNIQUE key, or at least some order that will remain the same from week
to week? If yes, then you can use the same sort of techniques as in
the “diff” utility found on every Unix-derived system and many others.
(Windows has something similar but the name escapes me at the moment.
IIRC, in an ironic twist, this is one of those cases where the
Windows command has a more cryptic name than its Unix cognate.) How
to make a “diff” type program has been covered in gazillions of
blog/magazine articles, textbooks, etc., so I won’t go into detail.
If you’re lucky, you might even be able to just use the ones existing
on your system, with some shell scripting for glue.

On the other claw, if the records are in random order, then you’ve got
a much more serious problem. In that case, ASSUMING that the keys,
and number of updated/duplicated records, are both quite small, off
the top of my head I think I’d:

Extract the keys from last week’s file
Ditto for this week’s
Sort those, assuming the keys are sufficiently smaller that this is
reasonable
Diff them.
Extract the actual records from both weeks for any matching keys.
Sort and diff, under the same assumption.

Or, if the above data sets are not small enough to make sorting
reasonable, but the potential dups might at least fit in RAM:

Extract last week’s keys into a Set
Initialize a “Needs Further Inspection” (NFI) Set
Iterate over this week’s records:
= Try to find the key in last week’s Set of keys
= If seen, remove from last weeks and add to NFI Set
= Else process as an Insertion
Anything left in last week’s Set is a Removal
(You can now get rid of last week’s Set of keys)
Extract last week’s full records matching NFI keys,
putting them in a hash keyed by the key
Extract this week’s records matching NFI keys,
looking them up in the hash
Compare the entire records, processing as either
Duplicate or Update as needed

-Dave

rubina · October 22, 2012, 9:21pm

I forgot to mention that the files are sorted!

On Mon, Oct 22, 2012 at 2:57 PM, Dave A.
<[email protected]

rubina · October 22, 2012, 10:06pm

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

If files are still to big use two chars: aa,ab or the like

Marc W.

rubina · October 24, 2012, 9:51am

On Mon, Oct 22, 2012 at 10:05 PM, Marc W. [email protected]
wrote:

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

One could as well read records from both files at the same time and
output diffs while doing that. If the data is line based, there are
tools already for comparing two sorted files:

$ comm --help
Usage: comm [OPTION]… FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.

-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

–check-order check that the input is correctly sorted, even
if all input lines are pairable
–nocheck-order do not check that the input is correctly sorted
–output-delimiter=STR separate columns with STR
–help display this help and exit
–version output version information and exit

Note, comparisons honor the rules specified by `LC_COLLATE’.

Examples:
comm -12 file1 file2 Print only lines present in both file1 and
file2.
comm -3 file1 file2 Print lines in file1 not in file2, and vice
versa.

Report comm bugs to [email protected]
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
For complete documentation, run: info coreutils ‘comm invocation’

If files are not line based it may be possible to convert them to a
line based format and then use comm like this:

$ comm <(conver old) <(convert new)

Kind regards

robert

rubina · October 24, 2012, 10:25am

On Wed, Oct 24, 2012 at 9:50 AM, Robert K.
[email protected] wrote:

On Mon, Oct 22, 2012 at 10:05 PM, Marc W. [email protected] wrote:

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

One could as well read records from both files at the same time and
output diffs while doing that. If the data is line based, there are
tools already for comparing two sorted files:

$ comm --help

PS: You can use this scheme if you want to implement this yourself

gist.github.com

https://gist.github.com/rklemme/3944819

diff-sorted.rb

#!/bin/ruby

def trace(old, new)
	printf "%p old=%p new=%p\n", caller[0], old, new
end

def fetch(io)
	s = io.gets and s.chomp!
	s
end

This file has been truncated. show original