Looking for suggestions processing and comparing 2 very large files

Team,

Every week I get a large file, over 50 millions records with record
length

150 chars. These files can actually be over 15GB.
I need to take the new file and compare it against the one from the
previous week.
Reading the files into two arrays would make the process a bit easier,
but
the files are too large and when I try using arrays, the process crashes
with out of storage messages.

I am looking for suggestions on how to efficiently perform the following
process:

  1. Compare each record from the file from this week against last week
    file
  2. If every record are the same, do nothing or just indicate so:
    SAM
  3. If there is any duplicate records on the new file, output the
    record
    to a file of dups
  4. If there are any new record, (records found on the new file, but
    not
    on last week file) output: INS followed by the record
  5. If there is a record which is found on last week file (old file)
    but
    not on this week file, output: DEL followed by the record
  6. If there is a record with the same key (the first 13 chars) on
    both
    files, but the rest of the record is different, output: UPD
    followed
    by the record

Hey, I can do all of the above doing reading each record from both files
and do different type of comparison/match, but I was wondering if there
is
an efficient way to do this. I was looking for suggestions.

Thank you

This is almost exactly what diff in Unix/linux/Mac OS X does. There are
also
diff utilities for Windows, so instead of reinventing the wheel, you may
want to
look at the diff utilities out there already.

Wayne

On Mon, Oct 22, 2012 at 2:21 PM, Ruby S. [email protected]
wrote:

Every week I get a large file, over 50 millions records

The big question is… are these files SORTED, preferably on some
UNIQUE key, or at least some order that will remain the same from week
to week? If yes, then you can use the same sort of techniques as in
the “diff” utility found on every Unix-derived system and many others.
(Windows has something similar but the name escapes me at the moment.
IIRC, in an ironic twist, this is one of those cases where the
Windows command has a more cryptic name than its Unix cognate.) How
to make a “diff” type program has been covered in gazillions of
blog/magazine articles, textbooks, etc., so I won’t go into detail.
If you’re lucky, you might even be able to just use the ones existing
on your system, with some shell scripting for glue.

On the other claw, if the records are in random order, then you’ve got
a much more serious problem. In that case, ASSUMING that the keys,
and number of updated/duplicated records, are both quite small, off
the top of my head I think I’d:

  • Extract the keys from last week’s file
  • Ditto for this week’s
  • Sort those, assuming the keys are sufficiently smaller that this is
    reasonable
  • Diff them.
  • Extract the actual records from both weeks for any matching keys.
  • Sort and diff, under the same assumption.

Or, if the above data sets are not small enough to make sorting
reasonable, but the potential dups might at least fit in RAM:

  • Extract last week’s keys into a Set
  • Initialize a “Needs Further Inspection” (NFI) Set
  • Iterate over this week’s records:
    = Try to find the key in last week’s Set of keys
    = If seen, remove from last weeks and add to NFI Set
    = Else process as an Insertion
  • Anything left in last week’s Set is a Removal
  • (You can now get rid of last week’s Set of keys)
  • Extract last week’s full records matching NFI keys,
    putting them in a hash keyed by the key
  • Extract this week’s records matching NFI keys,
    looking them up in the hash
  • Compare the entire records, processing as either
    Duplicate or Update as needed

-Dave

I forgot to mention that the files are sorted!

On Mon, Oct 22, 2012 at 2:57 PM, Dave A.
<[email protected]

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

If files are still to big use two chars: aa,ab or the like

Marc W.

On Mon, Oct 22, 2012 at 10:05 PM, Marc W. [email protected]
wrote:

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

One could as well read records from both files at the same time and
output diffs while doing that. If the data is line based, there are
tools already for comparing two sorted files:

$ comm --help
Usage: comm [OPTION]… FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.

-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)

–check-order check that the input is correctly sorted, even
if all input lines are pairable
–nocheck-order do not check that the input is correctly sorted
–output-delimiter=STR separate columns with STR
–help display this help and exit
–version output version information and exit

Note, comparisons honor the rules specified by `LC_COLLATE’.

Examples:
comm -12 file1 file2 Print only lines present in both file1 and
file2.
comm -3 file1 file2 Print lines in file1 not in file2, and vice
versa.

Report comm bugs to [email protected]
GNU coreutils home page: http://www.gnu.org/software/coreutils/
General help using GNU software: http://www.gnu.org/gethelp/
For complete documentation, run: info coreutils ‘comm invocation’

If files are not line based it may be possible to convert them to a
line based format and then use comm like this:

$ comm <(conver old) <(convert new)

Kind regards

robert

On Wed, Oct 24, 2012 at 9:50 AM, Robert K.
[email protected] wrote:

On Mon, Oct 22, 2012 at 10:05 PM, Marc W. [email protected] wrote:

If they are sorted you can split into pieces, eg create an
a,b,c file and compare a(old) with a(new) and b(old) with b(new) only.

One could as well read records from both files at the same time and
output diffs while doing that. If the data is line based, there are
tools already for comparing two sorted files:

$ comm --help

PS: You can use this scheme if you want to implement this yourself