Database speed issues

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

Thanks,

~Jeremy

On Nov 2, 7:45 pm, “[email protected]
[email protected] wrote:

statement. Any ideas?

you could sort both lists, and find duplicates using O(n + m) time.

Jeremy W. wrote:

So, I have a bit of a design problem. I have an application that work,
but not as well as I would like it to. My problem is that I had to
write an application that checks millions of records against hundreds
of millions of records to see if there are any duplicates. The only
way I could think to do this is pretty much select from the different
tables where the specific column matches my search string. This is
really slow! Obviously if it returns something, then there is a dup,
if not, then YAY!

I know there has to be a better way then just doing a SQL select
statement. Any ideas?

I dunno how fast it is compared to your select statement, but if you can
build arrays out of the data you want to compare you call the % method
on them. Example…

irb> [ 5, 10 ] & [ 10, 15 ]
=> [10]

It works with more complex arrays, too…

irb> [ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]
=> [[5, 10]]

Hope that helps…

haha, you said “call the % method on them.” and then used the “&”
method.
That may work, but how long would it take to build an array from 500GB
of text files? Then every time that no duplicate exists, that file is
added to the array?

but wait… THAT’S AWESOME… hmmmm so I could probably do something
like
IO.readlines("somefile) & IO.readlines("anotherfile) and it would
return the lines that are the same.

The “anotherfile” would have to be an iteration through a complete
network drive of folders, so I could find each folder and then go
through each file inside that folder doing the same thing.

ughh. I hate dealing with these stupid mass quantities of records.
Thanks for the advice :slight_smile:

~Jeremy

way I could think to do this is pretty much select from the different

~Jeremy

You could let the database do the work in the following way, though it’s
relatively dirty:

First, duplicate the database you’re comparing to. Then change the
indexes
on the tables so that you can force uniqueness on the records based on
the
data you’re comparing. Then simply insert the records you’re checking.
If
the insert did not work, it’s a duplicate.

HTH,

Felix

“Felix W.” [email protected] writes:

way I could think to do this is pretty much select from the different
First, duplicate the database you’re comparing to. Then change the indexes
on the tables so that you can force uniqueness on the records based on the
data you’re comparing. Then simply insert the records you’re checking. If
the insert did not work, it’s a duplicate.

Rather than duplicating the database, create a temporary table, load
the 2nd set of data there, create index appropriately,
and do:

select second.* from first join second on (…) limit 1;

If the result is not an empty set, then there is duplicate.

YS.

On Nov 2, 8:14 pm, [email protected] wrote:

I know there has to be a better way then just doing a SQL select

irb> [ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]
=> [[5, 10]]

Hope that helps…

Posted viahttp://www.ruby-forum.com/.

Could you be more specific?
First if your data is in some kind of SQL database first you should
look in how well you explore the power of SQL.

In my experience if you generate your indexes right, specially the
multiple columns indexes that are in favor of your select statement
you can gain some performance boost.

The second thing can be helping SQL database to find the right way to
do the JOINs.
This has proven mostly right for MySQL.

If you give us some more details perhaps we can give you some more
meaningful response.

On 02.11.2007 20:43, dima wrote:

I know there has to be a better way then just doing a SQL select
irb> [ [ 5, 10 ], [ 15, 20 ] ] & [ [ 10, 15 ], [ 5, 10] ]
=> [[5, 10]]

Hope that helps…

Posted viahttp://www.ruby-forum.com/.

Could you be more specific?

The information given is really weak.

If you give us some more details perhaps we can give you some more
meaningful response.

Completely agree to all stated above! This should be offloaded to the
database.

Kind regards

robert

Cool. Ok I will explain the best I can the whole entire situation from
the beginning.

I work for a company that produces plastic cards (i.e. credit cards,
debit cards, gift cards, id cards etc…)

I receive files from customers containing anywhere from 1…100000000
records. These files get formatted and put into a network drive in
their respective job number folders. One customer is Starbucks,
andother CVS pharmacy, Mcdonalds, Jack N the Box, Mexico government
Drivers license.

About a year ago there was an idiot programmer who formatted the data
incorrectly causing 150,000 duplicate cards. So because of this, the
company wants to put in place a system of checking for duplicates. The
way it has to work (because I don’t have a say in this even though i’m
the developer >.<) is that the past 18 months of jobs will be loaded
into a database (or something useful) and when we (the programmers)
get a new file, we will generate our data from these files, then run
our application to see if there are any duplicate records in what we
created compared to what we have created in the past 18 months
according to their respective companies. (i.e. we get a starbucks job,
program it and then test it against last 18 months of starbucks
records.)

If there are no duplicate files, then we will load that job into the
database (or whatever we need) and continue on from there. If there
are duplicates, then we have to re-program that job, and generate a
report stating how many duplicates, and what exactly it is that was
duplicated (i.e. account number, mag stripe, pin, customer name). The
report is then automatically e-mailed to our boss who then walks next
door to our office to yell at us and tell us to re-program the job.

So for a figure we did 100 million starbucks cards in the past 12
months. We need to check the last 18 months. Starbucks jobs come in
from a company called Valuelink, who generates the data for us. CVS
pharmacy and Disney and McDonalds roughly equal another 100 million
records combined. These also come from Valuelink.
This is important, because if we program it correctly, but Valulink
sends over a duplicate card number, then we need to be able to report
this. So when they send over another job of 5 million cards, now we
are checking that against 200 million + cards to see if anything is
duplicate.

We program roughly 20-40 jobs in a day, so we need a quick way to go
through all this and not lose time or else we will end up working
longer days, and not be compensated for it :frowning:

I hope that explains it a little better. Thanks again for all the
advice.

~Jeremy

On 03.11.2007 01:53, [email protected] wrote:

Cool. Ok I will explain the best I can the whole entire situation from
the beginning.

I work for a company that produces plastic cards (i.e. credit cards,
debit cards, gift cards, id cards etc…)

I receive files from customers containing anywhere from 1…100000000
records. These files get formatted and put into a network drive in
their respective job number folders. One customer is […].

First of all a bit of general advice: in my experience customers are not
happy reading their names somewhere public without their prior
permission. So you probably should omit them in future postings to
avoid trouble. Company names do not actually help in resolving a
technical issue.

About a year ago there was an idiot programmer who formatted the data
incorrectly causing 150,000 duplicate cards. So because of this, the
company wants to put in place a system of checking for duplicates.

Even in the absence of “idiot programmers” duplicate checking is a good
idea. There is so much that can go wrong - and you indicated yourself,
that you might get buggy input data. It is always a good idea to verify
that data you receive from somewhere else complies with your
requirements or otherwise meets your expectations. And this holds true
on all levels (i.e. input files obtained from somewhere, method
arguments etc.). So you might as well be thankful for this guy’s
mistake because the checking you are doing might save your company a lot
of hassle in another situation. :slight_smile:

The
way it has to work (because I don’t have a say in this even though i’m
the developer >.<) is that the past 18 months of jobs will be loaded
into a database (or something useful) and when we (the programmers)
get a new file, we will generate our data from these files, then run
our application to see if there are any duplicate records in what we
created compared to what we have created in the past 18 months
according to their respective companies. (i.e. we get a starbucks job,
program it and then test it against last 18 months of starbucks
records.)

Do you actually reload past 18 month’s job data for every new job? This
would be a major source of inefficiency. If you do I would rather have
a large table that stays there and add entries with timestamps. Then
with a proper index (or with a table partitioned by month) you can
efficiently delete old data (i.e. data from jobs older than 18 months).
If your DB product supports partitioning you should definitively use
it as it makes deletions much more efficient.

Btw, I can’t find a reference to the DB vendor that you are using.
That would be an interesting brand name. :slight_smile: Seriously, this can have
a major impact on your options.

from a company called Valuelink, who generates the data for us. CVS
longer days, and not be compensated for it :frowning:
You do not give details about your jobs and what it is that gets
duplicated. I will just assume that it is some kind of “key”, which
might be just a single character sequence or a number of fields. Also,
we do not know what other information comprises a “job”, so take the
following with a grain of salt.

There are a few ways you could tackle this. I will try to sketch some.

  1. Avoid invalid jobs

You could do this by having a table with job definitions (maybe one per
customer if keys are different) where the key has a UNIQUE constraint on
it. Then you prepare your data (possibly in CVS files) and load it into
the DB. The unique constraint will prevent duplicate jobs.

Now it depends on the DB vendor you are using. IIRC with Oracle you can
use SQL*Loader and have it report duplicate records, i.e. records that
were not inserted. Same for SQL Server, there is a key property IGNORE
DUPLICATES which will prevent duplicate insertion. I am not sure about
duplicate reporting though.

If your DB vendor does not support this, you can load data into a
temporary table and insert from there. You can then use an approach
from 2 to detect duplicates:

  1. Detect duplicates

Approach as above but do not create a UNIQUE constraints but instead
index the key (if your key contains multiple fields you just have a
covering index with several columns). Now you can check for duplicates
with a query like this:

select key1, key2, … keyn, count() occurrences
from job_table
group by key1, key2, … keyn
having count(
) > 1

Now this query will return all key fields which occur more than once.
If you also need other info you can do this:

select *
from job_table jt join (
select key1, key2, … keyn
from job_table
group by key1, key2, … keyn
having count(*) > 1
) jt_dup on jt.key1 = jt_dup.key1
and jt.key2 = jt_dup.key2
and …
and jt.keyn = jt_dup.keyn

(The joined table is an “inline view” in case you want to look further
into this concept.)

Using the index you can also delete duplicates pretty efficiently.
Alternatively delete all entries from the new job, modify the data
outside and load again. It depends on the nature of your other
processing which approach is better. Either way you could also generate
your job data from this table even with duplicates in the table by using
SELECT DISTINCT or GROUP BY.

I hope I have given you some food for thought and this gets you started
digging deeper.

Kind regards

robert

PS: I’m traveling the next three days so please do not expect further
replies too early.