The issue here is that we could have duplicate entries. A duplicate
entry
is when both the ID & the INSTANCE from one record is found on another
record. M task is to identified and all dups and create a file with
them. A
dup can occur at any time during the 24 hours period.
My idea was taking each ID and doing a grep against the file. That will
certainly find a match because at the very minimum, it will find itself.
That’s no good because I only want to out the record when I find two or
more instances of the ID/INSTANCE. I also thought about extracting just
the
ID/INSTANCE and creating an array such as this:
The idea was to take each element of the first column and navigate the
entire file looking for dups. I’m stuck here also finding a good way to
do
this. The other problem is if I will be able to create an array so
large.
Remember that there could be potentially 2 million records.
The issue here is that we could have duplicate entries. A duplicate entry is
when both the ID & the INSTANCE from one record is found on another record.
M task is to identified and all dups and create a file with them.
I’d create a DB table with fields for ID and INSTANCE and put a
unique constraint on them. Then insert your records, rescuing the
ActiveRecord::RecordNotUnique errors and writing those entries
to your dup file (or another DB table).
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
ID= ‘71f3f1f8-3e54-4f5c-a673-841578b3b118’ INSTANCE=
START= ‘2013-10-22 00:00:01.554328’ END= 2013-10-22 00:00:01.555390
My idea was taking each ID and doing a grep against the file. That will
ce3ddaf5-7a43-45f8-b6a7-667c64bb9183 d8810bd7-edcc-4033-8b73-10a04e7b994b
Any thought?
Thank you
–
Ruby S.
I think you are close, but what you are considering is searching the
entire file for each [ID, Instance] pair, in essence doing n*n
comparisons. This is not good, especially when n = 1000000.
Instead, go through the file once and maintain a Set containing each
[ID, Instance] pair. For each record, check if the pair is already in
the Set. If it is, you have a duplicate. If not, add it to the Set and
move on.
M[y] task is to identified and all dups and create a file with them.
Your program just writes the second, third and any further occurrence.
Btw. you can simplify lines 21 to 25 to
puts line unless entries.add? entry
alternatively
entries.add? entry or puts line
Agree that you could save a lot of memory by considering a day at a time
instead of all entires.
It’s not about saving memory. With large files it may not even be
possible to execute a program which holds all file content in memory
because it dies from memory exhaustion. If OTOH there is enough
memory in the machine then chances are that the second run through the
files is done from cached file content and will be much cheaper than
the first one.
Cheers
robert
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.