Using CTE to remove duplicate records

I have been asked a few times on what is the best way to remove duplicate records.  Although there are many ways to do this, and the best answer would really depend on the environment and the scenario.  I would suggest developers to evaluate these two factors before jumping ahead and writing TSQL code.

In this post, I’d like to show a way of deleitng duplicate records using CTE.  I think knowing as many options and their good/bad points could never hurt.

The example below uses AdventureWorks database, mainly the Person.Contact table.  To make the sample easier, I copy the data from Person.Contact table into a separate table which I will use later to perform the duplicate record deletion.

So, the scenario is, I have a list of people with the same First Name and Last Name. To make it simpler, the scenario assumes that everyone in the database has a unique First Name and Last Name combination.  The duplicate records to be kept are those with the higher number of email promotion count and latest ContactID value.

Here’s the sample code that I would use to delete the duplicate records for the above scenario.

USE AdventureWorks

-- Prepare temporary table to test the deletion of duplicate records
INTO dbo.TempPersonContact
FROM Person.Contact

-- Check the duplicate records
SELECT FirstName, LastName, COUNT(1) AS NumberOfInstances
FROM dbo.TempPersonContact
GROUP BY FirstName, LastName
HAVING  COUNT(1) > 1 ; 

-- Scenario: For duplicate records, keep those with the highest EmailPromotion
-- number. If the EmailPromotion values tie for the duplicate reocrds,
-- keep the record with the bigger ContactID
-- Note: by using CTE, we can affect the original table (dbo.TempPersonContact)
-- by referring to the CTE name (i.e. "Duplicates")
WITH Duplicates AS 
     ROW_NUMBER() OVER( PARTITION BY FirstName, LastName
          ORDER BY EmailPromotion DESC, ContactID DESC) AS OrderID
FROM dbo.TempPersonContact
DELETE Duplicates
WHERE OrderID > 1

-- Check the duplicate records; this should return no dataset since we have deleted in the previous statement
SELECT FirstName, LastName,  COUNT(1) AS NumberOfInstances
FROM dbo.TempPersonContact
GROUP BY FirstName, LastName
HAVING  COUNT(1) > 1 ; 

-- Clean up table
/* DROP TABLE dbo.TempPersonContact */


There are two major points to note here:

  • Usage of CTE and the deletion is performed on the CTE (instead of the actual table)
  • Usage of ROW_NUMBER

For more information on CTE:

For more information on ROW_NUMBER:

Open your mind to Business Intelligence

At the beginning of this year, I have decided to have a look through some videos, demos and tutoials on Business Intelligence, particularly on what Microsoft has to offer.

Rafal Lukawiecki has done a fantastic work in summarising and presenting Business Intelligence and its application to any organisation (i.e. for revenue-driven and non-revenue driven organisation), n who is interested in improving their performance, This presentation was recorded live at the Stockholm Business Intelligence Conference on 21 April 2009. The technology used are a little outdated now, i.e. Microsoft Office 2007, Microsoft Sharepoint 2007 and Microsoft SQL Server 2008. However, the concept presented is truly worth the time, especially for emerging BI professionals and Business Users.

The presentation has been divided into 4 comprehensive sessions, each with enough information to understand the concept at high level allowing for viewers to research it further to master it.

Session 1: Improving Insight and Decision Making Using Microsoft Business Intelligence and SQL Server 2008

Session 2: Delivering Business Intelligence through Microsoft Office 2007

Session 3: Aggregating Knowledge in Data Warehouse and Multidimensional Analysis (+ additional breakouts for Public Sector, BI for Sales, Executive BI)

Session 4: Finding Hidden Intelligence with Predictive Anaysis of Data Mining

I hope when Denali comes out, we will be lucky to have Rafal presenting a similar session with Micrrosoft’s new BI stack.

Special thanks to Rafal Lukawiecki for the presentation and Channel 9 MSDN for facilitating the video streaming!