ABC Classification With SQL Server Window Function

There are many benefits from using Window Functions with OVER clause which are available since SQL Server 2012. In this blog post, we will go through a use case example of ABC Analysis / Classification using a Window Function.

Let us examine the definition of ABC Analysis.

Extracting the definition from Wikipedia, the ABC analysis provides a mechanism for identifying items that will have a significant impact on overall inventory cost.

ABC Analysis Scenario

Marco Russo has posted a very useful ABC classification using Power Pivot here. Using a similar rule, we will classify

  • Products that make 70% of the sales are in Class A
  • Products that make 20% of the sales are in Class B
  • Products that make 10% of the sales are in Class C

So, let’s take a look at how we can apply the new Window Function to do this. In this scenario, we will use AdventureWorksDW2012 to classify all product sales for each month. The breakdown of the logic to perform the ABC classification in this scenario is

  1. Calculate the total sales of each product model.
  2. Calculate the cumulative sales of each product model in the order of the Sales amount (higher to lower values).
  3. Calculate the percentage of the cumulative sales of each product model against the total sales.
  4. Determine the classification based on the percentage of cumulative sales.

In this example below, I use CTE and to calculate the sales of each product model. Step 2 to 4 can be done in one go. Please note that there are many ways of calculating ABC Classification, below is just one of the ways.

 

 

USE AdventureWorksDW2012
GO

WITH ProductSales AS
(
	-- Get the total for each Product Model
	SELECT 
		dp.[ModelName], 
		SUM(ssd.[SalesAmount]) Sales
	FROM [dbo].[FactInternetSales] ssd
		INNER JOIN [dbo].[dimProduct] dp
		ON ssd.[ProductKey] = dp.[ProductKey]
	GROUP BY 
	dp.[ModelName]
)
-- Calculate culmulative total and categorise it based on the percentage.
-- Product Model that have high sales amount and make up the first 70% 
-- will be classified as A, the next 20% is B and the rest is C.
-- Product Models in A generate more revenue.
SELECT
	ps.[ModelName], 
	ps.[Sales],
	SUM(ps.[Sales]) OVER (ORDER BY ps.[Sales] DESC) AS CumulativeSales,
	SUM(ps.[Sales]) OVER () AS TotalSales,
	SUM(ps.[Sales]) OVER (ORDER BY ps.[Sales] DESC) / SUM(ps.[Sales]) OVER () AS CumulativePercentage,
	CASE 
		WHEN SUM(ps.[Sales]) OVER (ORDER BY ps.[Sales] DESC) / SUM(ps.[Sales]) OVER () <= 0.7 
			THEN 'A'
		WHEN SUM(ps.[Sales]) OVER (ORDER BY ps.[Sales] DESC) / SUM(ps.[Sales]) OVER () <= 0.9 
			THEN 'B'
		ELSE 'C'
	END AS Class
FROM    ProductSales ps
GROUP BY 
	ps.[ModelName],
	ps.[Sales];

 

The result looks like this (exported to Excel)

ABC Analysis on AdventureWorks Product Model

 

Have I Done This Correctly?

This example is very similar to Marco Russo’s post on ABC Analysis Using Power Pivot. To check that I’ve done this correctly, I create a Pivot Table based on the table above (no Power Pivot used). The resulting Pivot Table is very simlar to Marco’s one.

Pivot Table of ABC Classification

 

Using PARTITION BY For Additional Context

What if I’m interested in the ABC Analysis of product models for each year? This is simply addressed by using PARTITION BY on Year to get the Year context. In the CTE part, I add YEAR(OrderDate) to obtain the Year that I will be partitioning by.

ABC Analysis per Order Year

 

The code will look similar to this:

 

USE AdventureWorksDW2012
GO

WITH ProductSales AS
(
	-- Get the total for each Product Model and Year
	SELECT 
		YEAR(ssd.[OrderDate]) AS OrderYear,
		dp.[ModelName], 
		SUM(ssd.[SalesAmount]) Sales
	FROM [dbo].[FactInternetSales] ssd
		INNER JOIN [dbo].[dimProduct] dp
		ON ssd.[ProductKey] = dp.[ProductKey]
	GROUP BY 
		dp.[ModelName],
		YEAR(ssd.[OrderDate])
)
-- Calculate culmulative total and categorise it based on the percentage.
-- Product Model each year that have high sales amount and make up 
-- the first 70% will be classified as A, the next 20% is B and the rest is C.
-- Product Models in A generate more revenue.
SELECT
	ps.[OrderYear],
	ps.[ModelName], 
	ps.[Sales],
	SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear] ORDER BY ps.[Sales] DESC) AS CumulativeSales,
	SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear]) AS TotalSales,
	SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear] ORDER BY ps.[Sales] DESC) 
		/ SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear]) AS CumulativePercentage,
	CASE 
		WHEN SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear] ORDER BY ps.[Sales] DESC) 
				/ SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear]) <= 0.7 
			THEN 'A'
		WHEN SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear] ORDER BY ps.[Sales] DESC) 
				/ SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear]) <= 0.9 
			THEN 'B'
		ELSE 'C'
	END AS Class
FROM    ProductSales ps
GROUP BY 
	ps.[OrderYear],
	ps.[ModelName],	
	ps.[Sales]
ORDER BY 
	ps.[OrderYear],
	CASE 
		WHEN SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear] ORDER BY ps.[Sales] DESC) 
				/ SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear]) <= 0.7 
			THEN 'A'
		WHEN SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear] ORDER BY ps.[Sales] DESC) 
				/ SUM(ps.[Sales]) OVER (PARTITION BY ps.[OrderYear]) <= 0.9 
			THEN 'B'
		ELSE 'C'
	END ;

 

The result will look similar to this (I have updated my AdventureWorksDW to start in Year 2001 instead of the default one).

ABC Analysis on Product Model Per Year

 

Wrap Up

A number of Window Functions are very useful for analytics. We use SUM with OVER and PARTITION BY to generate Cumulative Total and to calculate percentages,which then can be used derive the ABC classification.

This post is part of #SQLCoop series of Window Function.

#SQLCoop

I would highly recommend other Window Function articles by #SQLCoop bloggers -

A Date At The End of The Month by Mickey Stuewe

Windows Functions: Who Knew by Chris Yates

Write readable and high-performance queries with Window Functions by Jeffrey Verheul

 

 

 

Further Reading

ABC Analysis in PowerPivot by Marco Russo

ABC Analysis on Wikipedia

SUM (Transact-SQL) on Technet

 

 

 

WASD with SSMS Tools Pack 2.7.2 Review

I have been using Windows Azure SQL Database (WASD) for a little while now, and I think it would be a good idea to start a WASD Tools Review series on tools that work well for WASD. To kick off the series, I would like to explore top 8 features of SSMS Tools Pack 2.7.2 that have worked well for me in WASD development / administration (which also works well for on-premise SQL Server instances too).

As you may already know, SSMS Tools Pack was originally built for SQL Server 2005 and has since then been enhanced to keep up to date with later versions of SQL Server. If you have not used it before, download it from here and read a comprehensive review from Grant Fritchey.

All the options that are offered by SSMS Tools Pack 2.7.0 work for SQL Server, but not all of them are relevant for WASD instances. For example, the pre-configured “DB – DBCC SHOWCONTIG” custom script. At the time of writing, this option is not automatically checked by the tool for its validity to be run on WASD instances. Running this script would result in the following error:

Msg 40518, Level 16, State 1, Line 1
DBCC command 'SHOWCONTIG' is not supported in this version of SQL Server.

This may change in the future though – if you find that you would like to suggest some enhancements on any parts of the tool, I would highly encourage you to please provide the author (Mladen Prajdić) with some feedback.

 

A Quick Introduction

Once you installed SSMS Tools Pack add-in for SQL Server Management Studio, you can right click on items under the WASD instance on Object Explorer similar to below.

WASD - STP Server Options

Options / configurations are usually accessible from the SSMS Tools menu.

SSMS Tools Menu

 

Below are my top 8 favourite features (not in any particular order) that I use quite frequently or find most useful when working with WASD (equivalently for SQL Server instances too).

1. Create CRUD

This is useful for generating a stored procedure for each Create, Read, Update and Delete actions on data pertaining to each table in the database.

WASD - STP Server Options

Generate CRUD scripts

 

2. Generate Insert Statements

This is a very handy script to export data out as Insert statements.You can also choose select tables instead of the whole tables in the database.

STP - Generate Insert Scripts

 

Below is an excerpt of the generated Insert scripts:

STP Insert Scripts

 

3. Search Database Data

This is probably one of the most handy features out there. As a Consultant, sometimes I have to search a specific keyword and it could  be like looking for a needle in a haystack. This makes my job easier – although probably best not to be executed on a large database that does not currently perform well.

WASD - STP Search In Database

Fortunately, while executing the search, should you find it is taking too long, you can cancel the search.

WASD - STP Searching Database Dialog

Once the search completes, it returns the SELECT statement of where the keyword is found in the data. In the below example, I am looking for any record (in any table / view) that contains the word “Daenerys”.WASD = STP Search Database Results

 

4. Run Custom Scripts

Do you have custom scripts that you run regularly against a server? Well, you can do this for WASD server too, as long as it is relevant. Below is an example of Update Statistics script that has been pre-configured upon your SSMS Tools Pack installation.

WASD - STP Update Statistics Output

It is possible to add / modify custom scripts and where they can be executed against. To do this, go to the toolbar and choose SSMS Tools > Run Custom Scripts > Options.

 

5. Window Connection Coloring

Using Window Connection Coloring, you can quickly find out what server the active script is connected to. As an example, the blue one below is configured to connect to my WASD instance, while the fuchsia is connected to my local database instance.

WASD - STP Window Colouring

The Window Connection Coloring > Options on the SSMS Tools menu allows you to configure more colours for more database instances,

 

6. Show / Hide EPA

If you are into performance tuning, Show / Hide EPA (Execution Plan Analyzer) may be just for you. When Include Actual Execution Plan is enabled, you can further show (or hide) EPA. It gives a suggestion list on how you can improve the performance of your query. Please note that this is just a suggestion, and that you should perform some testing on how your proposed action(s) will effect the environment as a whole.

WASD - ShowHide EPA Suggestions

By the way, there are many more options within EPA. Make sure you right click on the EPA area to check them out on your free time.

 

7. SQL Snippets

I have to admit, I’m a typist. I love my keyboard. But this particular feature is just super cool. If you are with an application / web developer background, you are probably familiar with Sublime Text’s Code Snippet feature. Well, this works almost just like that.

When I put on my Data Analyst hat, in a day I could be typing a lot of “SELECT COUNT(*) FROM .. ” or “SELECT TOP 10  * FROM .. ” for quick investigations. (No, I don’t use SELECT * FROM in my stored procedures / any production related activities :D). SQL Snippets feature quickly becomes my favourite. It saves me time by just typing SSC or SST. You can configure more SQL Snippets too or change ones that have been preconfigured for you.

STP SQL Snippets

To access this options, simply go to SQL Snippets > Options from SSMS Tools menu.

 

8. Tab Sessions History

Every now and then your machine shuts down unexpectedly or SSMS stops functioning. Then you lose your queries. (Yes, they taught us to save those queries since we were young) or perhaps you accidentally closed the query without saving it thinking that you don’t need it again. Well, guess what? SSMS Tools Pack Tab Sessions History feature is your little insurance for you.

SSMS Tools Pack Tab Sessions History

You can access it by simply pressing Ctrl+K Ctrl-T. Or simply go to SQL History > Tabs Sessions Management. From the SQL History menu, you can also configure the interval between the tab session saves.

 

How Much Does it Cost?

It depends on which version of SSMS you have installed. Prices may also change after this post’s publish date, so it would be wise for me to refer you to the simple Licensing page from the official site.

 

Wrap Up

Remember the saying “Work smarter not harder”? Well, SSMS Tools Pack can definitely helps you towards that. You still need to ensure that you customise it the way you work or want to work.

 

Further Reading

SSMS Tools Pack 2.0 by Grant Fritchey

SSMS Tools Pack (official site) by  Mladen Prajdić

 

 

 

SQL Tools Review: SQL Search

As a SQL Server database developer / DBA / consultant, at some point in your daily routine, you’d probably be asked to investigate a bug or to make an enhancement that is specific to a particular object. A common example is to remove a column from a table, and find all objects that refer to that column. Obviously you want to remove the references to the column (including in dynamic SQL statements*) before you drop that column from the table.

An easy way to find out all the objects in the Server that may refer to that column is by using SQL Search tool by Red Gate.

Price

It doesn’t cost a dime. At the time of writing, SQL Search 1 is free for download.

SQL Search 1 - Free

 

Easy to use

The tool is an add-in to SQL Server Management Studio. The user interface is self-explanatory.

SQL Search - Searching on a SQL Server instance

 

It’s at most 4 simple steps:

1. Type in the text that you want to search

2. Specify whether it’s an exact match or not

3. Choose which object you want to search and on

4. Specify the server. The server drop down list is populated based on the database servers that you have connected on SSMS in Object Explorer.

SQL Search - Server Selection is based on Object Explorer connections

 

Limitations

1. Windows Azure SQL Database

If you look carefully in the screenshot above, there are a couple of hints that show I have 2 connections to Windows Azure SQL Database (WASD) on my SSMS Object Explorer. You are right. Unfortunately, searching on WASD instances is not supported in SQL Search 1. Hopefully Red Gate will add this feature in.

SQL Search does not support WASD

 

2. Dynamic SQL Statement

Dynamic SQL Statements in objects (stored procedures, triggers, etc) that contain the text in plain.

 

CREATE PROCEDURE dbo.usp_DynamicSQLSimple
AS
/**************************************************************************
 *
 * Purpose: Update all NULL values to "Blank" on Title in dbo.Game.
 *			This is for a demo used in
 *			http://www.mssqlgirl.com/sql-tools-in-a-jiffy-sql-search.html
 *
 * Parameters: None
 *
 * Update History:
 * 20140209 - Julie Koesmarno - v1.0 - Initial Release
 *
 **************************************************************************/
BEGIN
	SET NOCOUNT ON;

	DECLARE @SQLStmt VARCHAR(1000);

	-- Construct a dynamic SQL Statement that updates the Column values.
	SET @SQLStmt = 'UPDATE [dbo].[Game] ' +
				   'SET [Title] = ''Blank'' ' +
				   'WHERE [Title] IS NULL '; 

	-- Execute SQL Statement
	IF @SQLStmt IS NOT NULL
	BEGIN
		PRINT @SQLStmt; 
		EXEC (@SQLStmt);
	END;

	-- Print number of rows updated.
	PRINT QUOTENAME(@@ROWCOUNT) + ' rows(s) updated';

END
GO

 

 

On the contrary, if you have a stored procedure with a dynamic SQL statement that constructs / resolves a reference to a text that you are searching, during execution, SQL Search will not be able to find it. Suppose you are interested in updating a first column in dbo.Game where its name starts with “T” and has a NULL value. Obviously, if “Title” is the first column that starts with T in dbo.Game, the dynamic SQLStatement in the below stored procedure (dbo.usp_DynamicSQLComplex) will be the same as above (dbo.usp_DynamicSQLSimple)..

Below is the stored procedure example on a more complex dynamic SQL Statement, where @SQLStmt only contains a reference to “Title” at the time of execution. Therefore SQL Search is not going to be able to find it, which is fair enough, but it is something that you have to bear in mind when finding dependencies.

CREATE PROCEDURE dbo.usp_DynamicSQLComplex 
AS
/**************************************************************************
 *
 * Purpose: Update all NULL values to "Blank" on first column that starts 
 *			with T in dbo.Game. 
 *			This is for a demo used in
 *			http://www.mssqlgirl.com/sql-tools-in-a-jiffy-sql-search.html
 *
 * Parameters: None
 *
 * Update History:
 * 20140209 - Julie Koesmarno - v1.0 - Initial Release
 *
 **************************************************************************/
BEGIN
	SET NOCOUNT ON;

	DECLARE @ColumnName VARCHAR(255);
	DECLARE @SQLStmt VARCHAR(1000);

	-- Get the first Column that starts with T in dbo.Game.
	SET @ColumnName = (SELECT TOP 1 COLUMN_NAME 
						FROM INFORMATION_SCHEMA.COLUMNS
						WHERE TABLE_NAME = 'Game'
						AND TABLE_SCHEMA = 'dbo'
						AND COLUMN_NAME LIKE 'T%');

	-- Print message
	PRINT ISNULL('First column in dbo.Game that starts with T is ' + @ColumnName, 
			'No column starts with T found in dbo.Game');

	-- Construct a dynamic SQL Statement that updates the Column values.
	SET @SQLStmt = 'UPDATE [dbo].[Game] ' +
				   'SET ' + QUOTENAME(@ColumnName) + ' = ''Blank'' ' +
				   'WHERE ' + QUOTENAME(@ColumnName) + ' IS NULL ' ;

	-- Execute SQL Statement
	IF @SQLStmt IS NOT NULL
	BEGIN
		PRINT @SQLStmt;
		EXEC (@SQLStmt);
	END;

	-- Print number of rows updated.
	PRINT QUOTENAME(@@ROWCOUNT) + ' rows(s) updated';

END
GO

 

Alternatives

Another alternative is to create your own script that inspects sys.sql_modules and iterate through your instances. You will then need to look up the object_id to get more details on them. So, for a free tool, SQL Search is much easier to do. There is more than one way to skin a cat, but using SQL Search is a user friendly and time saving way.

 

Wrap Up

I think SQL Search is one of a must have tool for database developers. You may not use it on your daily basis, but boy, when you do need it, it really saves you time, especially if you don’t have your own reusable custom script to search text.

 

Further Reading

Not convinced on how SQL Search can help you out? Check out other #SQLCoop blog posts here:

On a SQL Quest Using SQL Search by Red Gate by Mickey Stuewe

Headache + Pain <> Red Gates SQL Search by Chris Yates

How SQL Search saves you time by Jeffrey Verheul

 

Paging Function Performance in SQL Server 2012

Late last year when I discovered the new TSQL Enhancement introduced in SQL Server 2012, particularly the new Paging Function, I thought to myself that this would be very useful in a number of scenarios. Before I go any further, let’s have a look at what the Paging Function is.

Update: For simplicity, the following scenario uses [Production].[TransactionHistoryArchive] from AdventureWorks2012 and assumes that TransactionID is always continous and incremental as well as starting with TransactionID of 1. (I will discuss more on a scenario where the TransactionID is not continuous in a different post). Special thanks to Shy Engelberg for highlighting the behaviour differences in all three methods discussed in this post. Please read Shy’s comment at the end of this post for further clarification.

The Paging Function is part of the SELECT syntax, as an extension to the ORDER BY clause. Below is an example of its usage to retrieve a data set with TransactionID from 5001 and for the next 100 rows.

-- Paging Function
-- Retrieve 100 rows starting from TransactionID of 5001
SELECT
	 [TransactionID]
	,[ProductID]
	,[ReferenceOrderID]
	,[ReferenceOrderLineID]
	,[TransactionDate]
	,[TransactionType]
	,[Quantity]
	,[ActualCost]
	,[ModifiedDate]
FROM [Production].[TransactionHistoryArchive]
ORDER BY [TransactionID]
OFFSET 5001 ROWS
FETCH NEXT 100 ROWS ONLY

This function is very intuitive and would be applicable for returning data set a “page” (batch of rows) at a time. The OFFSET value can be expressed as an integer variable and so can the FETCH NEXT value, which allows easy and configurable data set paging for web/reporting applications.

A few things to note about the Paging Function:

  • The ORDER BY column(s) doesn’t have to be consecutive, meaning that we can avoid creating a surrogate consecutive integer key for the purpose of paging. This helps in a typical query to retrieve a page of ”active” Transaction records whereby some rows in the table may be deleted or “deactivated”, rendering broken IDs.
  • OFFSET and FETCH can only be used in the last query that participates in UNION, EXCEPT or INTERSECT operation.
  • If the column specified in the ORDER BY column(s) is not unique, the order of the output is not always consistent.

Alternative: TOP … Batching

In an earlier version, you could write a query to return the same data set using the TOP keyword.  Please note that the SET ROWCOUNT clause to limit the number of rows returned in a SELECT query, will not be supported in the next version of SQL Server – and the TOP keyword should be used instead.

-- TOP... Batching
-- Retrieve 100 rows starting from TransactionID of 5001
SELECT TOP 100
	 [TransactionID]
	,[ProductID]
	,[ReferenceOrderID]
	,[ReferenceOrderLineID]
	,[TransactionDate]
	,[TransactionType]
	,[Quantity]
	,[ActualCost]
	,[ModifiedDate]
FROM [Production].[TransactionHistoryArchive]
WHERE [TransactionID] >= 5001
ORDER BY [TransactionID]

Alternative: Direct ID Batching

An alternative to the above in an earlier version to SQL Server 2012 is shown below, The result will be the same, assuming TransactionID values are always consecutive.

-- Direct ID Batching
-- Retrieve 100 rows starting from TransactionID of 5001
SELECT
	 [TransactionID]
	,[ProductID]
	,[ReferenceOrderID]
	,[ReferenceOrderLineID]
	,[TransactionDate]
	,[TransactionType]
	,[Quantity]
	,[ActualCost]
	,[ModifiedDate]
FROM [Production].[TransactionHistoryArchive]
WHERE [TransactionID] BETWEEN 5001 AND 5100
ORDER BY [TransactionID]

By now, you may realise that the Direct ID batching method may need a surrogate consecutive key to ensure the batch/paging size is consistent, with the exception of the last page.

Execution Plan Comparison

Let’s inspect the execution plan of all three methods.

1. Direct ID Batching

As expected, here SQL Server is using Clustered Index Seek as per the index filter on the ID between the 2 values (5001 and 5100). Note that the Actual Number of Rows on the Execution Plan Details of the Clustered Index Seek is 100.

Clustered Index Seek on Production.TransactionHistory

Clustered Index Seek on Production.TransactionHistory

Actual Number of Rows = 100 for Clustered Index Seek - Direct ID Batching

2. Top Batching

Similar to the Direct ID Batching, SQL Server is also using Clustered Index Seek. It then uses TOP as the next step.  Note that here similar to Direct ID Batching, the Clustered Index Seek is also returning Actual Number of Rows of 100.

 

Clustered Index Seek on Production.TransactionHistory

Clustered Index Seek on Production.TransactionHistory

Actual Number of Rows = 100 for Clustered Index Seek - Top ... Batching

 

3. Paging Function

Surprisingly, SQL Server is using Clustered Index Scan and the Actual Number of Rows returned here is 5101!

Clustered Index Seek on Production.TransactionHistory

Clustered Index Seek on Production.TransactionHistory

 

Actual Number of Rows = 5101 for Clustered Index Seek - Paging Function

Performance Comparison

Using Paging Fuction on a source table with a large number of records may not be ideal. As you can already guess, the larger the offset size, the larger the Actual Number of Rows is returned too. Consequently, the Paging Function will take longer and longer as the paging progresses. To demonstrate this, I have a table with over 14.5 Million rows where I iterate through the records in a batch size of 500,000. Each iteration inserts the batched records into a heap table that is empty to start with.

Below is the chart exhibiting the time taken to insert the records in batches using Direct ID batching vs Top batching vs Paging function methods. To start with, the three techniques take about the same amount of time. After inserting about 1.5 Million rows the Paging Function query takes significantly longer than the other two.

SQL_Batching_Methods_Comparison_Chart

Wrap Up

Although SQL Server 2012 new Paging function is intuitive for the purpose of paging a data set, it comes at a cost for large volume of data. The Top Batching and Direct ID Batching perform significantly better under similar circumstances, i.e. simple Source Data with continuous ID.

The above discussion stems from my recent research for the 24 Hours of PASS Spring 2012 session “High Volume Data Processing Without Driving Your DBAs Crazy!” which will be held on 21 March 2012 at 16:00 AEST. 24 Hours of PASS is free and contains 24 back-to-back  great SQL Server sessions. You can find many different topics of SQL Server, from DBA, Database development, BI to Performance sessions. So, register now!

Disclaimer

Please note that every environment is different and that these tests are conducted on my small laptop and should not be used as an indication on how the above batching methods will perform in your Production environment. Please test and confirm that you understand these techniques before applying to your Production environment.

 Last updated on 25 March 2012 to add further explanation on the scenario used in this post.

 

 

Filtered Index Limitation

Filtered Index feature has been introduced since SQL Server 2008. In my recent survey with some Data Analysts, Database Developers and DBAs, it seems that Filtered Index has been ignored or forgotten. So here’s some reminder on why it’s a good idea to rethink about using Filtered Index.

Filtered index characteristics:

  • an optimised nonclustered index
  • containing only the specified subset of data, specified by the filter predicate
  • referring to column(s) of the table which it belongs to

Having Filtered Index is particularly useful if only the same subset of data is being queried very frequently (or almost all the time). Filtered index will be smaller in size compared to an index with the same definition without filtered index.

Below is an illustration of the Production.WorkOrder set and its subset data where DueDate between 1 Jan 2011 and 30 Jun 2011.

Set Diagram representing Production.WorkOrder records

 

The Non-Filtered index size is much larger in comparison to the Filtered index size as shown below.  The smaller the index size, the quicker for SQL Server will fetch the data.

Non Filtered Index vs Filtered Index Size Comparison

 

A couple of basic usage examples are listed below:

1. Reports are generated only for the year to date data.

A good example will be creating an index filtering the date column from the beginning of the year. e.g.

CREATE NONCLUSTERED INDEX IX_Production_WorkOrder_DueDate ON Production.WorkOrder
(
     DueDate
)
WHERE DueDate >= '01 Jan 2011' 

Of course, at the beginning of the year next year, you may wish to change the DueDate clause to refer to “01 Jan 2012″ (instead of “01 Jan 2011″).  To avoid manually changing the filter every year, you could create a SQL Server job that dynamically alters the index so that the filtered index is referring to the new DueDate filter at the specified date of the year.

 

2. Data is processed only for Customers that have email address ID value.

If there is more data with Non Nullable EmailAddressID values than the NULL EmailAddressID, creating filtered index will reduce size and therefore improve on performance (compared to creating a non-filtered index).

CREATE NONCLUSTERED INDEX IX_Person_Person ON Person.Person
(
     EmailAddressID
)
WHERE EmailAddressID IS NOT NULL

 

More scenarios and best practice have been documented here on BOL: http://msdn.microsoft.com/en-us/library/cc280372(v=SQL.100).aspx

The filter predicate only supports simple comparison operator such as the following logical operators:

  1. =
  2. !=
  3. <
  4. <=
  5. >
  6. >=
  7. IS NULL
  8. IS NOT NULL

This means “LIKE” and “BETWEEN … AND …” comparisons cannot be used.  However, there is a workaround  to the “BETWEEN … AND …”  comparison, i.e. by replacing it with the “>=” and “<=” operators, such as shown in the snippet below.

CREATE NONCLUSTERED INDEX IX_Production_WorkOrder_DueDate ON Production.WorkOrder

(
     DueDate
)
WHERE DueDate >= '01 Jan 2011' AND DueDate <= '30 Jun 2011'
-- equivalent to DueDate BETWEEN '01 Jan 2011' AND '30 Jun 2011

 

When implementing Filtered Indexes, I would recommend running the result of turning on Statistics IO for before and after.  This will give you a better understanding of how much improvement you will gain with the Filtered Index.

An article from MSSQLTips worth reading to get to know more about Filtered Index can be found here: http://www.mssqltips.com/sqlservertip/1785/sql-server-filtered-indexes-what-they-are-how-to-use-and-performance-advantages/

Using CTE to remove duplicate records

I have been asked a few times on what is the best way to remove duplicate records.  Although there are many ways to do this, and the best answer would really depend on the environment and the scenario.  I would suggest developers to evaluate these two factors before jumping ahead and writing TSQL code.

In this post, I’d like to show a way of deleitng duplicate records using CTE.  I think knowing as many options and their good/bad points could never hurt.

The example below uses AdventureWorks database, mainly the Person.Contact table.  To make the sample easier, I copy the data from Person.Contact table into a separate table which I will use later to perform the duplicate record deletion.

So, the scenario is, I have a list of people with the same First Name and Last Name. To make it simpler, the scenario assumes that everyone in the database has a unique First Name and Last Name combination.  The duplicate records to be kept are those with the higher number of email promotion count and latest ContactID value.

Here’s the sample code that I would use to delete the duplicate records for the above scenario.

USE AdventureWorks
GO

-- Prepare temporary table to test the deletion of duplicate records
SELECT *
INTO dbo.TempPersonContact
FROM Person.Contact

-- Check the duplicate records
SELECT FirstName, LastName, COUNT(1) AS NumberOfInstances
FROM dbo.TempPersonContact
GROUP BY FirstName, LastName
HAVING  COUNT(1) > 1 ; 

-- Scenario: For duplicate records, keep those with the highest EmailPromotion
-- number. If the EmailPromotion values tie for the duplicate reocrds,
-- keep the record with the bigger ContactID
-- Note: by using CTE, we can affect the original table (dbo.TempPersonContact)
-- by referring to the CTE name (i.e. "Duplicates")
WITH Duplicates AS 
(
SELECT
     ContactID,
     ROW_NUMBER() OVER( PARTITION BY FirstName, LastName
          ORDER BY EmailPromotion DESC, ContactID DESC) AS OrderID
FROM dbo.TempPersonContact
)
DELETE Duplicates
WHERE OrderID > 1

-- Check the duplicate records; this should return no dataset since we have deleted in the previous statement
SELECT FirstName, LastName,  COUNT(1) AS NumberOfInstances
FROM dbo.TempPersonContact
GROUP BY FirstName, LastName
HAVING  COUNT(1) > 1 ; 

-- Clean up table
/* DROP TABLE dbo.TempPersonContact */

 

There are two major points to note here:

  • Usage of CTE and the deletion is performed on the CTE (instead of the actual table)
  • Usage of ROW_NUMBER

For more information on CTE: http://msdn.microsoft.com/en-us/library/ms190766.aspx

For more information on ROW_NUMBER: http://msdn.microsoft.com/en-us/library/ms186734.aspx