On 03.11.2007 01:53, JeremyWoertink / gmail.com wrote:
> Cool. Ok I will explain the best I can the whole entire situation from
> the beginning.
> 
> I work for a company that produces plastic cards (i.e. credit cards,
> debit cards, gift cards, id cards etc..)
> 
> I receive files from customers containing anywhere from 1..100000000
> records. These files get formatted and put into a network drive in
> their respective job number folders. One customer is [...].

First of all a bit of general advice: in my experience customers are not 
happy reading their names somewhere public without their prior 
permission.  So you probably should omit them in future postings to 
avoid trouble.  Company names do not actually help in resolving a 
technical issue.

> About a year ago there was an idiot programmer who formatted the data
> incorrectly causing 150,000 duplicate cards. So because of this, the
> company wants to put in place a system of checking for duplicates.

Even in the absence of "idiot programmers" duplicate checking is a good 
idea.  There is so much that can go wrong - and you indicated yourself, 
that you might get buggy input data.  It is always a good idea to verify 
that data you receive from somewhere else complies with your 
requirements or otherwise meets your expectations.  And this holds true 
on all levels (i.e. input files obtained from somewhere, method 
arguments etc.).  So you might as well be thankful for this guy's 
mistake because the checking you are doing might save your company a lot 
of hassle in another situation. :-)

> The
> way it has to work (because I don't have a say in this even though i'm
> the developer >.<) is that the past 18 months of jobs will be loaded
> into a database (or something useful) and when we (the programmers)
> get a new file, we will generate our data from these files, then run
> our application to see if there are any duplicate records in what we
> created compared to what we have created in the past 18 months
> according to their respective companies. (i.e. we get a starbucks job,
> program it and then test it against last 18 months of starbucks
> records.)

Do you actually reload past 18 month's job data for every new job?  This 
would be a major source of inefficiency.  If you do I would rather have 
a large table that stays there and add entries with timestamps.  Then 
with a proper index (or with a table partitioned by month) you can 
efficiently delete old data (i.e. data from jobs older than 18 months). 
  If your DB product supports partitioning you should definitively use 
it as it makes deletions much more efficient.

Btw, I can't find a reference to the DB vendor that you are using. 
*That* would be an interesting brand name. :-)  Seriously, this can have 
a major impact on your options.

> If there are no duplicate files, then we will load that job into the
> database (or whatever we need) and continue on from there. If there
> are duplicates, then we have to re-program that job, and generate a
> report stating how many duplicates, and what exactly it is that was
> duplicated (i.e. account number, mag stripe, pin, customer name). The
> report is then automatically e-mailed to our boss who then walks next
> door to our office to yell at us and tell us to re-program the job.
> 
> So for a figure we did 100 million starbucks cards in the past 12
> months. We need to check the last 18 months. Starbucks jobs come in
> from a company called Valuelink, who generates the data for us. CVS
> pharmacy and Disney and McDonalds roughly equal another 100 million
> records combined. These also come from Valuelink.
> This is important, because if we program it correctly, but Valulink
> sends over a duplicate card number, then we need to be able to report
> this. So when they send over another job of 5 million cards, now we
> are checking that against 200 million + cards to see if anything is
> duplicate.
> 
> We program roughly 20-40 jobs in a day, so we need a quick way to go
> through all this and not lose time or else we will end up working
> longer days, and not be compensated for it :(

You do not give details about your jobs and what it is that gets 
duplicated.  I will just assume that it is some kind of "key", which 
might be just a single character sequence or a number of fields.  Also, 
we do not know what other information comprises a "job", so take the 
following with a grain of salt.

There are a few ways you could tackle this.  I will try to sketch some.

1. Avoid invalid jobs

You could do this by having a table with job definitions (maybe one per 
customer if keys are different) where the key has a UNIQUE constraint on 
it.  Then you prepare your data (possibly in CVS files) and load it into 
the DB.  The unique constraint will prevent duplicate jobs.

Now it depends on the DB vendor you are using.  IIRC with Oracle you can 
use SQL*Loader and have it report duplicate records, i.e. records that 
were not inserted.  Same for SQL Server, there is a key property IGNORE 
DUPLICATES which will prevent duplicate insertion.  I am not sure about 
duplicate reporting though.

If your DB vendor does not support this, you can load data into a 
temporary table and insert from there.  You can then use an approach 
from 2 to detect duplicates:

2. Detect duplicates

Approach as above but do not create a UNIQUE constraints but instead 
index the key (if your key contains multiple fields you just have a 
covering index with several columns).  Now you can check for duplicates 
with a query like this:

select key1, key2, ... keyn, count(*) occurrences
from job_table
group by key1, key2, ... keyn
having count(*) > 1

Now this query will return all key fields which occur more than once. 
If you also need other info you can do this:

select *
from job_table jt join (
select key1, key2, ... keyn
from job_table
group by key1, key2, ... keyn
having count(*) > 1
) jt_dup on jt.key1 = jt_dup.key1
and jt.key2 = jt_dup.key2
and ...
and jt.keyn = jt_dup.keyn

(The joined table is an "inline view" in case you want to look further 
into this concept.)

Using the index you can also delete duplicates pretty efficiently. 
Alternatively delete all entries from the new job, modify the data 
outside and load again.  It depends on the nature of your other 
processing which approach is better.  Either way you could also generate 
your job data from this table even with duplicates in the table by using 
SELECT DISTINCT or GROUP BY.

I hope I have given you some food for thought and this gets you started 
digging deeper.

Kind regards

	robert


PS: I'm traveling the next three days so please do not expect further 
replies too early.