Using CTE to remove duplicate records

I have been asked a few times on what is the best way to remove duplicate records.  Although there are many ways to do this, and the best answer would really depend on the environment and the scenario.  I would suggest developers to evaluate these two factors before jumping ahead and writing TSQL code.

In this post, I’d like to show a way of deleitng duplicate records using CTE.  I think knowing as many options and their good/bad points could never hurt.

The example below uses AdventureWorks database, mainly the Person.Contact table.  To make the sample easier, I copy the data from Person.Contact table into a separate table which I will use later to perform the duplicate record deletion.

So, the scenario is, I have a list of people with the same First Name and Last Name. To make it simpler, the scenario assumes that everyone in the database has a unique First Name and Last Name combination.  The duplicate records to be kept are those with the higher number of email promotion count and latest ContactID value.

Here’s the sample code that I would use to delete the duplicate records for the above scenario.

USE AdventureWorks
GO

-- Prepare temporary table to test the deletion of duplicate records
SELECT *
INTO dbo.TempPersonContact
FROM Person.Contact

-- Check the duplicate records
SELECT FirstName, LastName, COUNT(1) AS NumberOfInstances
FROM dbo.TempPersonContact
GROUP BY FirstName, LastName
HAVING  COUNT(1) > 1 ; 

-- Scenario: For duplicate records, keep those with the highest EmailPromotion
-- number. If the EmailPromotion values tie for the duplicate reocrds,
-- keep the record with the bigger ContactID
-- Note: by using CTE, we can affect the original table (dbo.TempPersonContact)
-- by referring to the CTE name (i.e. "Duplicates")
WITH Duplicates AS 
(
SELECT
     ContactID,
     ROW_NUMBER() OVER( PARTITION BY FirstName, LastName
          ORDER BY EmailPromotion DESC, ContactID DESC) AS OrderID
FROM dbo.TempPersonContact
)
DELETE Duplicates
WHERE OrderID > 1

-- Check the duplicate records; this should return no dataset since we have deleted in the previous statement
SELECT FirstName, LastName,  COUNT(1) AS NumberOfInstances
FROM dbo.TempPersonContact
GROUP BY FirstName, LastName
HAVING  COUNT(1) > 1 ; 

-- Clean up table
/* DROP TABLE dbo.TempPersonContact */

 

There are two major points to note here:

  • Usage of CTE and the deletion is performed on the CTE (instead of the actual table)
  • Usage of ROW_NUMBER

For more information on CTE: http://msdn.microsoft.com/en-us/library/ms190766.aspx

For more information on ROW_NUMBER: http://msdn.microsoft.com/en-us/library/ms186734.aspx