On 03.11.2007 01:53, [email protected] wrote:
Cool. Ok I will explain the best I can the whole entire situation from
the beginning.
I work for a company that produces plastic cards (i.e. credit cards,
debit cards, gift cards, id cards etc…)
I receive files from customers containing anywhere from 1…100000000
records. These files get formatted and put into a network drive in
their respective job number folders. One customer is […].
First of all a bit of general advice: in my experience customers are not
happy reading their names somewhere public without their prior
permission. So you probably should omit them in future postings to
avoid trouble. Company names do not actually help in resolving a
technical issue.
About a year ago there was an idiot programmer who formatted the data
incorrectly causing 150,000 duplicate cards. So because of this, the
company wants to put in place a system of checking for duplicates.
Even in the absence of “idiot programmers” duplicate checking is a good
idea. There is so much that can go wrong - and you indicated yourself,
that you might get buggy input data. It is always a good idea to verify
that data you receive from somewhere else complies with your
requirements or otherwise meets your expectations. And this holds true
on all levels (i.e. input files obtained from somewhere, method
arguments etc.). So you might as well be thankful for this guy’s
mistake because the checking you are doing might save your company a lot
of hassle in another situation.
The
way it has to work (because I don’t have a say in this even though i’m
the developer >.<) is that the past 18 months of jobs will be loaded
into a database (or something useful) and when we (the programmers)
get a new file, we will generate our data from these files, then run
our application to see if there are any duplicate records in what we
created compared to what we have created in the past 18 months
according to their respective companies. (i.e. we get a starbucks job,
program it and then test it against last 18 months of starbucks
records.)
Do you actually reload past 18 month’s job data for every new job? This
would be a major source of inefficiency. If you do I would rather have
a large table that stays there and add entries with timestamps. Then
with a proper index (or with a table partitioned by month) you can
efficiently delete old data (i.e. data from jobs older than 18 months).
If your DB product supports partitioning you should definitively use
it as it makes deletions much more efficient.
Btw, I can’t find a reference to the DB vendor that you are using.
That would be an interesting brand name. Seriously, this can have
a major impact on your options.
from a company called Valuelink, who generates the data for us. CVS
longer days, and not be compensated for it
You do not give details about your jobs and what it is that gets
duplicated. I will just assume that it is some kind of “key”, which
might be just a single character sequence or a number of fields. Also,
we do not know what other information comprises a “job”, so take the
following with a grain of salt.
There are a few ways you could tackle this. I will try to sketch some.
- Avoid invalid jobs
You could do this by having a table with job definitions (maybe one per
customer if keys are different) where the key has a UNIQUE constraint on
it. Then you prepare your data (possibly in CVS files) and load it into
the DB. The unique constraint will prevent duplicate jobs.
Now it depends on the DB vendor you are using. IIRC with Oracle you can
use SQL*Loader and have it report duplicate records, i.e. records that
were not inserted. Same for SQL Server, there is a key property IGNORE
DUPLICATES which will prevent duplicate insertion. I am not sure about
duplicate reporting though.
If your DB vendor does not support this, you can load data into a
temporary table and insert from there. You can then use an approach
from 2 to detect duplicates:
- Detect duplicates
Approach as above but do not create a UNIQUE constraints but instead
index the key (if your key contains multiple fields you just have a
covering index with several columns). Now you can check for duplicates
with a query like this:
select key1, key2, … keyn, count() occurrences
from job_table
group by key1, key2, … keyn
having count() > 1
Now this query will return all key fields which occur more than once.
If you also need other info you can do this:
select *
from job_table jt join (
select key1, key2, … keyn
from job_table
group by key1, key2, … keyn
having count(*) > 1
) jt_dup on jt.key1 = jt_dup.key1
and jt.key2 = jt_dup.key2
and …
and jt.keyn = jt_dup.keyn
(The joined table is an “inline view” in case you want to look further
into this concept.)
Using the index you can also delete duplicates pretty efficiently.
Alternatively delete all entries from the new job, modify the data
outside and load again. It depends on the nature of your other
processing which approach is better. Either way you could also generate
your job data from this table even with duplicates in the table by using
SELECT DISTINCT or GROUP BY.
I hope I have given you some food for thought and this gets you started
digging deeper.
Kind regards
robert
PS: I’m traveling the next three days so please do not expect further
replies too early.