I have been asked a few times on what is the best way to remove duplicate records. Although there are many ways to do this, and the best answer would really depend on the environment and the scenario. I would suggest developers to evaluate these two factors before jumping ahead and writing TSQL code.
In this post, I’d like to show a way of deleitng duplicate records using CTE. I think knowing as many options and their good/bad points could never hurt.
The example below uses AdventureWorks database, mainly the Person.Contact table. To make the sample easier, I copy the data from Person.Contact table into a separate table which I will use later to perform the duplicate record deletion.
So, the scenario is, I have a list of people with the same First Name and Last Name. To make it simpler, the scenario assumes that everyone in the database has a unique First Name and Last Name combination. The duplicate records to be kept are those with the higher number of email promotion count and latest ContactID value.
Here’s the sample code that I would use to delete the duplicate records for the above scenario.
USE AdventureWorks GO -- Prepare temporary table to test the deletion of duplicate records SELECT * INTO dbo.TempPersonContact FROM Person.Contact -- Check the duplicate records SELECT FirstName, LastName, COUNT(1) AS NumberOfInstances FROM dbo.TempPersonContact GROUP BY FirstName, LastName HAVING COUNT(1) > 1 ; -- Scenario: For duplicate records, keep those with the highest EmailPromotion -- number. If the EmailPromotion values tie for the duplicate reocrds, -- keep the record with the bigger ContactID -- Note: by using CTE, we can affect the original table (dbo.TempPersonContact) -- by referring to the CTE name (i.e. "Duplicates") WITH Duplicates AS ( SELECT ContactID, ROW_NUMBER() OVER( PARTITION BY FirstName, LastName ORDER BY EmailPromotion DESC, ContactID DESC) AS OrderID FROM dbo.TempPersonContact ) DELETE Duplicates WHERE OrderID > 1 -- Check the duplicate records; this should return no dataset since we have deleted in the previous statement SELECT FirstName, LastName, COUNT(1) AS NumberOfInstances FROM dbo.TempPersonContact GROUP BY FirstName, LastName HAVING COUNT(1) > 1 ; -- Clean up table /* DROP TABLE dbo.TempPersonContact */
There are two major points to note here:
- Usage of CTE and the deletion is performed on the CTE (instead of the actual table)
- Usage of ROW_NUMBER
For more information on CTE: http://msdn.microsoft.com/en-us/library/ms190766.aspx
For more information on ROW_NUMBER: http://msdn.microsoft.com/en-us/library/ms186734.aspx