"Felix Windt" <fwmailinglists / gmail.com> writes:

>> -----Original Message-----
>> From: JeremyWoertink / gmail.com [mailto:JeremyWoertink / gmail.com] 
>> Sent: Friday, November 02, 2007 10:45 AM
>> To: ruby-talk ML
>> Subject: Database speed issues
>> 
>> So, I have a bit of a design problem. I have an application that work,
>> but not as well as I would like it to. My problem is that I had to
>> write an application that checks millions of records against hundreds
>> of millions of records to see if there are any duplicates. The only
>> way I could think to do this is pretty much select from the different
>> tables where the specific column matches my search string. This is
>> really slow! Obviously if it returns something, then  there is a dup,
>> if not, then YAY!
>> 
>> I know there has to be a better way then just doing a SQL select
>> statement. Any ideas?
>> 
> You could let the database do the work in the following way, though it's
> relatively dirty:
>
> First, duplicate the database you're comparing to. Then change the indexes
> on the tables so that you can force uniqueness on the records based on the
> data you're comparing. Then simply insert the records you're checking. If
> the insert did not work, it's a duplicate.


Rather than duplicating the database, create a temporary table, load
the 2nd set of data there, create index appropriately,
and do:

select second.* from first join second on (...) limit 1;

If the result is not an empty set, then there is duplicate.

YS.