24 Hours of PASS – High Volume Data Processing Techniques Without Driving Your DBA Crazy

I have just delivered my presentation on “High Volume Data Processing Techniques Without Driving Your DBA Crazy!” via webinar at 24 Hours of PASS (24HOP) Spring 2012. Thank you for those who attended the session. Hope it was useful for you. If you have any questions, please don’t hestitate to contact me and I will do my best to answer them.

 

As promised, below is a list of links that I briefly touched on at the presentation.

SQL Server 2012

ORDER BY Clause (OFFSET .. FETCH NEXT …)

http://msdn.microsoft.com/en-us/library/ms188385%28v=SQL.110%29.aspx

 

Paging Function Performance in SQL Server 2012

http://www.mssqlgirl.com/paging-function-performance-in-sql-server-2012.html

 

Columnstore Indexes

http://msdn.microsoft.com/en-us/library/gg492088(v=SQL.110).aspx

 

 

SSIS

The Data Loading Performance Guide

http://msdn.microsoft.com/en-us/library/dd425070(v=sql.100).aspx

 

Minimally Logging Bulk Load Inserts into SQL Server

http://www.mssqltips.com/sqlservertip/1185/minimally-logging-bulk-load-inserts-into-sql-server/

 

BDD

The “Balanced Data Distributor” SSIS

http://blogs.msdn.com/b/sqlperf/archive/2011/05/25/the-balanced-data-distributor-for-ssis.aspx

 

Download Center – BDD

http://www.microsoft.com/download/en/details.aspx?id=4123

 

Special thanks to 24HOP team and IBTalk for everything. I hope to present with 24HOP in the near future. The recording will be available in a few weeks time via the 24HOP website.

 

 

Paging Function Performance in SQL Server 2012

Late last year when I discovered the new TSQL Enhancement introduced in SQL Server 2012, particularly the new Paging Function, I thought to myself that this would be very useful in a number of scenarios. Before I go any further, let’s have a look at what the Paging Function is.

Update: For simplicity, the following scenario uses [Production].[TransactionHistoryArchive] from AdventureWorks2012 and assumes that TransactionID is always continous and incremental as well as starting with TransactionID of 1. (I will discuss more on a scenario where the TransactionID is not continuous in a different post). Special thanks to Shy Engelberg for highlighting the behaviour differences in all three methods discussed in this post. Please read Shy’s comment at the end of this post for further clarification.

The Paging Function is part of the SELECT syntax, as an extension to the ORDER BY clause. Below is an example of its usage to retrieve a data set with TransactionID from 5001 and for the next 100 rows.

-- Paging Function
-- Retrieve 100 rows starting from TransactionID of 5001
SELECT
	 [TransactionID]
	,[ProductID]
	,[ReferenceOrderID]
	,[ReferenceOrderLineID]
	,[TransactionDate]
	,[TransactionType]
	,[Quantity]
	,[ActualCost]
	,[ModifiedDate]
FROM [Production].[TransactionHistoryArchive]
ORDER BY [TransactionID]
OFFSET 5001 ROWS
FETCH NEXT 100 ROWS ONLY

This function is very intuitive and would be applicable for returning data set a “page” (batch of rows) at a time. The OFFSET value can be expressed as an integer variable and so can the FETCH NEXT value, which allows easy and configurable data set paging for web/reporting applications.

A few things to note about the Paging Function:

  • The ORDER BY column(s) doesn’t have to be consecutive, meaning that we can avoid creating a surrogate consecutive integer key for the purpose of paging. This helps in a typical query to retrieve a page of ”active” Transaction records whereby some rows in the table may be deleted or “deactivated”, rendering broken IDs.
  • OFFSET and FETCH can only be used in the last query that participates in UNION, EXCEPT or INTERSECT operation.
  • If the column specified in the ORDER BY column(s) is not unique, the order of the output is not always consistent.

Alternative: TOP … Batching

In an earlier version, you could write a query to return the same data set using the TOP keyword.  Please note that the SET ROWCOUNT clause to limit the number of rows returned in a SELECT query, will not be supported in the next version of SQL Server – and the TOP keyword should be used instead.

-- TOP... Batching
-- Retrieve 100 rows starting from TransactionID of 5001
SELECT TOP 100
	 [TransactionID]
	,[ProductID]
	,[ReferenceOrderID]
	,[ReferenceOrderLineID]
	,[TransactionDate]
	,[TransactionType]
	,[Quantity]
	,[ActualCost]
	,[ModifiedDate]
FROM [Production].[TransactionHistoryArchive]
WHERE [TransactionID] >= 5001
ORDER BY [TransactionID]

Alternative: Direct ID Batching

An alternative to the above in an earlier version to SQL Server 2012 is shown below, The result will be the same, assuming TransactionID values are always consecutive.

-- Direct ID Batching
-- Retrieve 100 rows starting from TransactionID of 5001
SELECT
	 [TransactionID]
	,[ProductID]
	,[ReferenceOrderID]
	,[ReferenceOrderLineID]
	,[TransactionDate]
	,[TransactionType]
	,[Quantity]
	,[ActualCost]
	,[ModifiedDate]
FROM [Production].[TransactionHistoryArchive]
WHERE [TransactionID] BETWEEN 5001 AND 5100
ORDER BY [TransactionID]

By now, you may realise that the Direct ID batching method may need a surrogate consecutive key to ensure the batch/paging size is consistent, with the exception of the last page.

Execution Plan Comparison

Let’s inspect the execution plan of all three methods.

1. Direct ID Batching

As expected, here SQL Server is using Clustered Index Seek as per the index filter on the ID between the 2 values (5001 and 5100). Note that the Actual Number of Rows on the Execution Plan Details of the Clustered Index Seek is 100.

Clustered Index Seek on Production.TransactionHistory

Clustered Index Seek on Production.TransactionHistory

Actual Number of Rows = 100 for Clustered Index Seek - Direct ID Batching

2. Top Batching

Similar to the Direct ID Batching, SQL Server is also using Clustered Index Seek. It then uses TOP as the next step.  Note that here similar to Direct ID Batching, the Clustered Index Seek is also returning Actual Number of Rows of 100.

 

Clustered Index Seek on Production.TransactionHistory

Clustered Index Seek on Production.TransactionHistory

Actual Number of Rows = 100 for Clustered Index Seek - Top ... Batching

 

3. Paging Function

Surprisingly, SQL Server is using Clustered Index Scan and the Actual Number of Rows returned here is 5101!

Clustered Index Seek on Production.TransactionHistory

Clustered Index Seek on Production.TransactionHistory

 

Actual Number of Rows = 5101 for Clustered Index Seek - Paging Function

Performance Comparison

Using Paging Fuction on a source table with a large number of records may not be ideal. As you can already guess, the larger the offset size, the larger the Actual Number of Rows is returned too. Consequently, the Paging Function will take longer and longer as the paging progresses. To demonstrate this, I have a table with over 14.5 Million rows where I iterate through the records in a batch size of 500,000. Each iteration inserts the batched records into a heap table that is empty to start with.

Below is the chart exhibiting the time taken to insert the records in batches using Direct ID batching vs Top batching vs Paging function methods. To start with, the three techniques take about the same amount of time. After inserting about 1.5 Million rows the Paging Function query takes significantly longer than the other two.

SQL_Batching_Methods_Comparison_Chart

Wrap Up

Although SQL Server 2012 new Paging function is intuitive for the purpose of paging a data set, it comes at a cost for large volume of data. The Top Batching and Direct ID Batching perform significantly better under similar circumstances, i.e. simple Source Data with continuous ID.

The above discussion stems from my recent research for the 24 Hours of PASS Spring 2012 session “High Volume Data Processing Without Driving Your DBAs Crazy!” which will be held on 21 March 2012 at 16:00 AEST. 24 Hours of PASS is free and contains 24 back-to-back  great SQL Server sessions. You can find many different topics of SQL Server, from DBA, Database development, BI to Performance sessions. So, register now!

Disclaimer

Please note that every environment is different and that these tests are conducted on my small laptop and should not be used as an indication on how the above batching methods will perform in your Production environment. Please test and confirm that you understand these techniques before applying to your Production environment.

 Last updated on 25 March 2012 to add further explanation on the scenario used in this post.