Efficient way for comparing records between 2 large files (16 million records)

aris · November 12, 2012, 11:22pm

Team,

I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.

I need to compare the keys from the new file to the keys on the file
from
last week. If the rest of the records are the same, then I do nothing.
If
the keys matches but the rest of the records are different, I then have
an
update and I will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either
any
built-in method or an efficient algorithm which I can implement.

Thank you

rubina · November 12, 2012, 11:25pm

Generate an MD5 or stronger hash on the file. If anything changes, the
hash
changes. I say ‘or stronger’ because MD5 has been shown to have
(extremely
rare) overlaps.

rubina · November 13, 2012, 12:04am

On Nov 12, 2012, at 14:21 , Ruby S. [email protected] wrote:

I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.

You or someone else posted this exact problem just a few weeks ago.
Please don’t repost if that was you. Please don’t post your homework
regardless. We’re happy to answer questions and give advice. We’re not
happy to directly contribute to your grade.

rubina · November 13, 2012, 12:10am

On Nov 12, 2012, at 5:21 PM, Ruby S. wrote:

Thank you

–
Ruby S.

You can get quite efficient using the important fact that the files are
already sorted by your key value.

Your problem is a slight variation on the merge phase of an External
Sort

Let’s name the two files P (as in “previous”) and N (as in “new”).

At each step, you have the next record from P and from N.

One of three things will be true:

the keys match
Compare the rest of the records and
a) write the update if different
b) do nothing if all are the same
read the next P and the next N
Pkey < Nkey
Pkey was missing from N
Delete the record P (update to nothing?) if deletions can happen
read the next P (only; use the same N for the next iteration)
Pkey > Nkey
Nkey was missing from P
Insert a new record N
read the next N (only; use the same P for the next iteration)

If the end of file is reached for only one of the files, treat the key
as always greater than the other from the not-yet-at-end file.

When both ends-of-file have been reached, you are done.

Close the files.

Of course, this depends on being able to read the files one record at a
time (or having a buffer to do so). Since it seems like these might be
line-oriented files, I suspect you’re OK here.

-Rob

rubina · November 14, 2012, 3:02pm

On Mon, Nov 12, 2012 at 11:57 PM, Ryan D. [email protected]
wrote:

On Nov 12, 2012, at 14:21 , Ruby S. [email protected] wrote:

I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.

You or someone else posted this exact problem just a few weeks ago. Please don’t
repost if that was you. Please don’t post your homework regardless. We’re happy to
answer questions and give advice. We’re not happy to directly contribute to your
grade.

IIRC I even posted a solution outline.

Cheers

robert

rubina · November 14, 2012, 4:53pm

Hello,

On 12 Νοε 2012, at 23:21 , Ruby S. [email protected] wrote:

Team,

I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.

I need to compare the keys from the new file to the keys on the file from last
week. If the rest of the records are the same, then I do nothing. If the keys
matches but the rest of the records are different, I then have an update and I
will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either any
built-in method or an efficient algorithm which I can implement.

Not sure if you dealing with hashes or arrays but take a look at the ‘-’
method for arrays, it’s blazing fast for comparing even large arrays.

Thank you

–
Ruby S.

Panagiotis (atmosx) Atmatzidis

email: [email protected]
URL: http://www.convalesco.org
GnuPG ID: 0xE736C6A0
gpg --keyserver x-hkp://pgp.mit.edu --recv-keys 0xE736C6A0

rubina · November 16, 2012, 2:35pm

Robert,

You stated: “IIRC I even posted a solution outline.”
I went to your blog (http://blog.rubybestpractices.com/) but could not
find
the item you mentioned. Would you mind to post the link to it?

Thank you

On Wed, Nov 14, 2012 at 10:53 AM, Panagiotis A.
<[email protected]

rubina · November 16, 2012, 3:48pm

Am 16.11.2012 14:35, schrieb Ruby S.:

Robert,

You stated: “IIRC I even posted a solution outline.”
I went to your blog (http://blog.rubybestpractices.com/) but could not
find the item you mentioned. Would you mind to post the link to it?

I assume he meant he posted a solution on this very list
when you(!) asked the same question in October.

Search the ruby-talk archives for
“Looking for suggestions processing and comparing 2 very large files”.

As far as I remember there were several useful suggestions.

rubina · November 16, 2012, 5:46pm

Aha, Got it now. The problem is that I never got or noticed the answer
to y
first post. I will search now.

Thank you