Help with ideas for finding dups on very large file

Hello Team,

I have a file with over a million records ordered in chronological
order. A
piece of the file looks like this:

START= ‘2013-10-22 00:00:00.436810’ END= 2013-10-22 00:00:00.448811
ID=
‘3b9e6ca5-27fc-4481-af6a-e0860689c054’ INSTANCE=
‘8e09cdcc-8597-4a05-84c0-ccb53be3a50b’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.570842’ END= 2013-10-22 00:00:00.578333
ID=
‘0138c106-1879-458a-bdba-d3074c700e54’ INSTANCE=
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.649034’ END= 2013-10-22 00:00:00.656009
ID=
‘b2f445ab-23f8-4859-a16d-7eecb58f378b’ INSTANCE=
‘1fc40423-7645-48ff-854d-a5359691ac67’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.658711’ END= 2013-10-22 00:00:00.669888
ID=
‘4677fd94-7f69-4296-b8f0-d02d3a034cf7’ INSTANCE=
‘5bf6e04f-be12-4132-946a-d62a2ffc41d3’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.760461’ END= 2013-10-22 00:00:00.767010
ID=
‘a1489b31-375f-4793-9ed2-72c391918f40’ INSTANCE=
‘4a7de027-3bb3-4586-acc4-559d1dd52e27’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.969661’ END= 2013-10-22 00:00:01.005119
ID=
‘71f3f1f8-3e54-4f5c-a673-841578b3b118’ INSTANCE=
‘ae2a659a-a482-4caa-a69f-81e8d422bd06’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.204287’ END= 2013-10-22 00:00:01.215604
ID=
‘fd27a3f7-35a6-4de9-bfa1-c416737a3cc5’ INSTANCE=
‘ec438e7e-7801-4b4a-8559-26a279347bc2’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.383654’ END= 2013-10-22 00:00:01.407878
ID=
‘324c5c6c-ad2a-4389-b26d-0354f030da68’ INSTANCE=
‘384e9658-1e13-47fe-a1de-2542f04dea2f’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.485704’ END= 2013-10-22 00:00:01.494108
ID=
‘11ca58b5-5e7c-4f17-a4a0-75383d1d8a13’ INSTANCE=
‘49bb66e4-d771-4aee-9a36-d5fba2babdb5’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.554328’ END= 2013-10-22 00:00:01.555390
ID=
‘0138c106-1879-458a-bdba-d3074c700e54’ INSTANCE=
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.699761’ END= 2013-10-22 00:00:01.731712
ID=
‘374af9f1-d344-4e89-9f82-33c8960edd77’ INSTANCE=
‘dcab9192-1ec5-4246-a133-599c5107781c’ TYPE= ‘read’

The issue here is that we could have duplicate entries. A duplicate
entry
is when both the ID & the INSTANCE from one record is found on another
record. M task is to identified and all dups and create a file with
them. A
dup can occur at any time during the 24 hours period.
My idea was taking each ID and doing a grep against the file. That will
certainly find a match because at the very minimum, it will find itself.
That’s no good because I only want to out the record when I find two or
more instances of the ID/INSTANCE. I also thought about extracting just
the
ID/INSTANCE and creating an array such as this:

519233e9-04fe-45c2-80c1-f1dbae752fee
1141003f-9af1-4868-861b-10ba9f54478b
6d1a98e7-b74b-4ed3-a1fd-a29b231dae84
8ac2385b-bbfc-4e2f-ae03-ed5fddbc8285
0137f425-6770-4fd4-abf0-b4827d6f563f
6cb955a9-fd29-4787-82af-72120fa9362b
3acf4174-3b21-4816-b44d-4b8379e9ea5e
64d5423d-2a4b-4e74-a87c-1f15b13330bc
91c3a54b-6487-4bf2-9a95-7a661f05661c
75b9896c-7f5f-433b-8613-fcea32ed22bf
ce3ddaf5-7a43-45f8-b6a7-667c64bb9183
d8810bd7-edcc-4033-8b73-10a04e7b994b
b2658eba-7186-43f4-b0f1-bef322a6cfd5
66a378d0-32fe-4305-9529-42524ec69cc8
78b31a85-e8c8-42cc-b6d4-6ca3e6f86776
ecdfc94c-8996-41bf-8ceb-f9b647065a4d
a0a27c60-9b5d-44a8-9ac2-73dd619c63cf
f4716c2e-bac2-4090-8a01-fb11693b58e0
8f6357df-532b-4f30-bc11-8e775137c957
a71de32e-9a12-4003-987c-9a7115fb3f2a
253fb047-0b21-4550-be95-1f86779f361c
cb1c8e25-7e6f-4851-a74c-291a910fe93f

The idea was to take each element of the first column and navigate the
entire file looking for dups. I’m stuck here also finding a good way to
do
this. The other problem is if I will be able to create an array so
large.
Remember that there could be potentially 2 million records.

Any thought?

Thank you

On Thu, Oct 24, 2013 at 11:17 AM, Ruby S. [email protected]
wrote:

The issue here is that we could have duplicate entries. A duplicate entry is
when both the ID & the INSTANCE from one record is found on another record.
M task is to identified and all dups and create a file with them.

I’d create a DB table with fields for ID and INSTANCE and put a
unique constraint on them. Then insert your records, rescuing the
ActiveRecord::RecordNotUnique errors and writing those entries
to your dup file (or another DB table).

FWIW,

On 10/24/2013 11:17 AM, Ruby S. wrote:

‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
ID= ‘71f3f1f8-3e54-4f5c-a673-841578b3b118’ INSTANCE=
START= ‘2013-10-22 00:00:01.554328’ END= 2013-10-22 00:00:01.555390
My idea was taking each ID and doing a grep against the file. That will
ce3ddaf5-7a43-45f8-b6a7-667c64bb9183 d8810bd7-edcc-4033-8b73-10a04e7b994b

Any thought?

Thank you


Ruby S.

I think you are close, but what you are considering is searching the
entire file for each [ID, Instance] pair, in essence doing n*n
comparisons. This is not good, especially when n = 1000000.

Instead, go through the file once and maintain a Set containing each
[ID, Instance] pair. For each record, check if the pair is already in
the Set. If it is, you have a duplicate. If not, add it to the Set and
move on.

-Justin

On Thu, Oct 24, 2013 at 11:00 PM, Reid T. [email protected]
wrote:

I also think we need a two step approach because storing whole files
in memory is not an option.

Since this is a Ruby group I put this into Ruby code
https://gist.github.com/rklemme/7151692

Room for improvement if memory is really low:

  • store pairs in a compressed form
  • only look at entries from a single day at a time

Cheers

robert

On Thu, 2013-10-24 at 14:17 -0400, Ruby S. wrote:


Ruby S.

use [gn]awk

$ cat /tmp/data
START= ‘2013-10-22 00:00:00.436810’ END= 2013-10-22 00:00:00.448811
ID= ‘3b9e6ca5-27fc-4481-af6a-e0860689c054’ INSTANCE=
‘8e09cdcc-8597-4a05-84c0-ccb53be3a50b’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.570842’ END= 2013-10-22 00:00:00.578333
ID= ‘0138c106-1879-458a-bdba-d3074c700e54’ INSTANCE=
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.649034’ END= 2013-10-22 00:00:00.656009
ID= ‘b2f445ab-23f8-4859-a16d-7eecb58f378b’ INSTANCE=
‘1fc40423-7645-48ff-854d-a5359691ac67’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.658711’ END= 2013-10-22 00:00:00.669888
ID= ‘4677fd94-7f69-4296-b8f0-d02d3a034cf7’ INSTANCE=
‘5bf6e04f-be12-4132-946a-d62a2ffc41d3’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.760461’ END= 2013-10-22 00:00:00.767010
ID= ‘a1489b31-375f-4793-9ed2-72c391918f40’ INSTANCE=
‘4a7de027-3bb3-4586-acc4-559d1dd52e27’ TYPE= ‘read’
START= ‘2013-10-22 00:00:00.969661’ END= 2013-10-22 00:00:01.005119
ID= ‘71f3f1f8-3e54-4f5c-a673-841578b3b118’ INSTANCE=
‘ae2a659a-a482-4caa-a69f-81e8d422bd06’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.204287’ END= 2013-10-22 00:00:01.215604
ID= ‘fd27a3f7-35a6-4de9-bfa1-c416737a3cc5’ INSTANCE=
‘ec438e7e-7801-4b4a-8559-26a279347bc2’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.383654’ END= 2013-10-22 00:00:01.407878
ID= ‘324c5c6c-ad2a-4389-b26d-0354f030da68’ INSTANCE=
‘384e9658-1e13-47fe-a1de-2542f04dea2f’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.485704’ END= 2013-10-22 00:00:01.494108
ID= ‘11ca58b5-5e7c-4f17-a4a0-75383d1d8a13’ INSTANCE=
‘49bb66e4-d771-4aee-9a36-d5fba2babdb5’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.554328’ END= 2013-10-22 00:00:01.555390
ID= ‘0138c106-1879-458a-bdba-d3074c700e54’ INSTANCE=
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.699761’ END= 2013-10-22 00:00:01.731712
ID= ‘374af9f1-d344-4e89-9f82-33c8960edd77’ INSTANCE=
‘dcab9192-1ec5-4246-a133-599c5107781c’ TYPE= ‘read’

[16:35:08][0s] [email protected]> ~
$ gawk ‘BEGIN{}{ COLUMN[sprintf("%51s %21s", $8, $10)]++}END{for (j in
COLUMN)print j,"“COLUMN[j]”"}’ /tmp/data
‘b2f445ab-23f8-4859-a16d-7eecb58f378b’
‘1fc40423-7645-48ff-854d-a5359691ac67’ 1
‘324c5c6c-ad2a-4389-b26d-0354f030da68’
‘384e9658-1e13-47fe-a1de-2542f04dea2f’ 1
‘4677fd94-7f69-4296-b8f0-d02d3a034cf7’
‘5bf6e04f-be12-4132-946a-d62a2ffc41d3’ 1
‘11ca58b5-5e7c-4f17-a4a0-75383d1d8a13’
‘49bb66e4-d771-4aee-9a36-d5fba2babdb5’ 1
‘71f3f1f8-3e54-4f5c-a673-841578b3b118’
‘ae2a659a-a482-4caa-a69f-81e8d422bd06’ 1
‘fd27a3f7-35a6-4de9-bfa1-c416737a3cc5’
‘ec438e7e-7801-4b4a-8559-26a279347bc2’ 1
‘374af9f1-d344-4e89-9f82-33c8960edd77’
‘dcab9192-1ec5-4246-a133-599c5107781c’ 1
‘0138c106-1879-458a-bdba-d3074c700e54’
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ 2<----
‘3b9e6ca5-27fc-4481-af6a-e0860689c054’
‘8e09cdcc-8597-4a05-84c0-ccb53be3a50b’ 1 |
‘a1489b31-375f-4793-9ed2-72c391918f40’
‘4a7de027-3bb3-4586-acc4-559d1dd52e27’ 1 |
|
your sample data has one duplicate instance in
it--------------------------------------------------

so … minor changes…

$ gawk ‘BEGIN{}{ COLUMN[sprintf(“grep “%s . %s””, $8,
$10)]++}END{for (j in COLUMN) if (COLUMN[j] > 1){ print j,"“FILENAME”"}
}’ /tmp/data
grep “‘0138c106-1879-458a-bdba-d3074c700e54’ .
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’” /tmp/data

copy and paste as command …
[16:58:36][0s] [email protected]> ~
$ grep “‘0138c106-1879-458a-bdba-d3074c700e54’ .
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’” /tmp/data
START= ‘2013-10-22 00:00:00.570842’ END= 2013-10-22 00:00:00.578333
ID= ‘0138c106-1879-458a-bdba-d3074c700e54’ INSTANCE=
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’
START= ‘2013-10-22 00:00:01.554328’ END= 2013-10-22 00:00:01.555390
ID= ‘0138c106-1879-458a-bdba-d3074c700e54’ INSTANCE=
‘044ddf01-6cca-4c4f-bc74-c48b859cbea7’ TYPE= ‘read’

write a script to combine all this into one…

On 10/25/2013 01:59 AM, Robert K. wrote:

  • only look at entries from a single day at a time

Cheers

robert

Why parse each file twice?

https://gist.github.com/presidentbeef/7156955

Agree that you could save a lot of memory by considering a day at a time
instead of all entires.

-Justin

On Fri, Oct 25, 2013 at 6:12 PM, Justin C. [email protected]
wrote:

Room for improvement if memory is really low:

  • store pairs in a compressed form
  • only look at entries from a single day at a time

Why parse each file twice?

Because of the subject.

https://gist.github.com/presidentbeef/7156955

That program does not satisfy the requirement:

M[y] task is to identified and all dups and create a file with them.

Your program just writes the second, third and any further occurrence.

Btw. you can simplify lines 21 to 25 to

puts line unless entries.add? entry

alternatively

entries.add? entry or puts line

Agree that you could save a lot of memory by considering a day at a time
instead of all entires.

It’s not about saving memory. With large files it may not even be
possible to execute a program which holds all file content in memory
because it dies from memory exhaustion. If OTOH there is enough
memory in the machine then chances are that the second run through the
files is done from cached file content and will be much cheaper than
the first one.

Cheers

robert