I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
I need to compare the keys from the new file to the keys on the file
from
last week. If the rest of the records are the same, then I do nothing.
If
the keys matches but the rest of the records are different, I then have
an
update and I will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either
any
built-in method or an efficient algorithm which I can implement.
Generate an MD5 or stronger hash on the file. If anything changes, the
hash
changes. I say ‘or stronger’ because MD5 has been shown to have
(extremely
rare) overlaps.
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
You or someone else posted this exact problem just a few weeks ago.
Please don’t repost if that was you. Please don’t post your homework
regardless. We’re happy to answer questions and give advice. We’re not
happy to directly contribute to your grade.
You can get quite efficient using the important fact that the files are
already sorted by your key value.
Your problem is a slight variation on the merge phase of an External
Sort
Let’s name the two files P (as in “previous”) and N (as in “new”).
At each step, you have the next record from P and from N.
One of three things will be true:
the keys match
Compare the rest of the records and
a) write the update if different
b) do nothing if all are the same
read the next P and the next N
Pkey < Nkey
Pkey was missing from N
Delete the record P (update to nothing?) if deletions can happen
read the next P (only; use the same N for the next iteration)
Pkey > Nkey
Nkey was missing from P
Insert a new record N
read the next N (only; use the same P for the next iteration)
If the end of file is reached for only one of the files, treat the key
as always greater than the other from the not-yet-at-end file.
When both ends-of-file have been reached, you are done.
Close the files.
Of course, this depends on being able to read the files one record at a
time (or having a buffer to do so). Since it seems like these might be
line-oriented files, I suspect you’re OK here.
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
You or someone else posted this exact problem just a few weeks ago. Please don’t
repost if that was you. Please don’t post your homework regardless. We’re happy to
answer questions and give advice. We’re not happy to directly contribute to your
grade.
I have two large files, about 16 million records each.
The files are sorted.
The first 13 characters are used as a key.
We get an updated file every week.
We also keep the previous week file.
I need to compare the keys from the new file to the keys on the file from last
week. If the rest of the records are the same, then I do nothing. If the keys
matches but the rest of the records are different, I then have an update and I
will output that record to a new file.
I was wondering if there is an efficient way to do this in Ruby. Either any
built-in method or an efficient algorithm which I can implement.
Not sure if you dealing with hashes or arrays but take a look at the ‘-’
method for arrays, it’s blazing fast for comparing even large arrays.
You stated: “IIRC I even posted a solution outline.”
I went to your blog (http://blog.rubybestpractices.com/) but could not
find
the item you mentioned. Would you mind to post the link to it?
You stated: “IIRC I even posted a solution outline.”
I went to your blog (http://blog.rubybestpractices.com/) but could not
find the item you mentioned. Would you mind to post the link to it?
I assume he meant he posted a solution on this very list
when you(!) asked the same question in October.
Search the ruby-talk archives for
“Looking for suggestions processing and comparing 2 very large files”.
As far as I remember there were several useful suggestions.