"Felix Windt" <fwmailinglists / gmail.com> writes: >> -----Original Message----- >> From: JeremyWoertink / gmail.com [mailto:JeremyWoertink / gmail.com] >> Sent: Friday, November 02, 2007 10:45 AM >> To: ruby-talk ML >> Subject: Database speed issues >> >> So, I have a bit of a design problem. I have an application that work, >> but not as well as I would like it to. My problem is that I had to >> write an application that checks millions of records against hundreds >> of millions of records to see if there are any duplicates. The only >> way I could think to do this is pretty much select from the different >> tables where the specific column matches my search string. This is >> really slow! Obviously if it returns something, then there is a dup, >> if not, then YAY! >> >> I know there has to be a better way then just doing a SQL select >> statement. Any ideas? >> > You could let the database do the work in the following way, though it's > relatively dirty: > > First, duplicate the database you're comparing to. Then change the indexes > on the tables so that you can force uniqueness on the records based on the > data you're comparing. Then simply insert the records you're checking. If > the insert did not work, it's a duplicate. Rather than duplicating the database, create a temporary table, load the 2nd set of data there, create index appropriately, and do: select second.* from first join second on (...) limit 1; If the result is not an empty set, then there is duplicate. YS.