Late last year when I discovered the new TSQL Enhancement introduced in SQL Server 2012, particularly the new Paging Function, I thought to myself that this would be very useful in a number of scenarios. Before I go any further, let’s have a look at what the Paging Function is.
Update: For simplicity, the following scenario uses [Production].[TransactionHistoryArchive] from AdventureWorks2012 and assumes that TransactionID is always continous and incremental as well as starting with TransactionID of 1. (I will discuss more on a scenario where the TransactionID is not continuous in a different post). Special thanks to Shy Engelberg for highlighting the behaviour differences in all three methods discussed in this post. Please read Shy’s comment at the end of this post for further clarification.
The Paging Function is part of the SELECT syntax, as an extension to the ORDER BY clause. Below is an example of its usage to retrieve a data set with TransactionID from 5001 and for the next 100 rows.
-- Paging Function -- Retrieve 100 rows starting from TransactionID of 5001 SELECT [TransactionID] ,[ProductID] ,[ReferenceOrderID] ,[ReferenceOrderLineID] ,[TransactionDate] ,[TransactionType] ,[Quantity] ,[ActualCost] ,[ModifiedDate] FROM [Production].[TransactionHistoryArchive] ORDER BY [TransactionID] OFFSET 5001 ROWS FETCH NEXT 100 ROWS ONLY
This function is very intuitive and would be applicable for returning data set a “page” (batch of rows) at a time. The OFFSET value can be expressed as an integer variable and so can the FETCH NEXT value, which allows easy and configurable data set paging for web/reporting applications.
A few things to note about the Paging Function:
- The ORDER BY column(s) doesn’t have to be consecutive, meaning that we can avoid creating a surrogate consecutive integer key for the purpose of paging. This helps in a typical query to retrieve a page of ”active” Transaction records whereby some rows in the table may be deleted or “deactivated”, rendering broken IDs.
- OFFSET and FETCH can only be used in the last query that participates in UNION, EXCEPT or INTERSECT operation.
- If the column specified in the ORDER BY column(s) is not unique, the order of the output is not always consistent.
Alternative: TOP … Batching
In an earlier version, you could write a query to return the same data set using the TOP keyword. Please note that the SET ROWCOUNT clause to limit the number of rows returned in a SELECT query, will not be supported in the next version of SQL Server – and the TOP keyword should be used instead.
-- TOP... Batching -- Retrieve 100 rows starting from TransactionID of 5001 SELECT TOP 100 [TransactionID] ,[ProductID] ,[ReferenceOrderID] ,[ReferenceOrderLineID] ,[TransactionDate] ,[TransactionType] ,[Quantity] ,[ActualCost] ,[ModifiedDate] FROM [Production].[TransactionHistoryArchive] WHERE [TransactionID] >= 5001 ORDER BY [TransactionID]
Alternative: Direct ID Batching
An alternative to the above in an earlier version to SQL Server 2012 is shown below, The result will be the same, assuming TransactionID values are always consecutive.
-- Direct ID Batching -- Retrieve 100 rows starting from TransactionID of 5001 SELECT [TransactionID] ,[ProductID] ,[ReferenceOrderID] ,[ReferenceOrderLineID] ,[TransactionDate] ,[TransactionType] ,[Quantity] ,[ActualCost] ,[ModifiedDate] FROM [Production].[TransactionHistoryArchive] WHERE [TransactionID] BETWEEN 5001 AND 5100 ORDER BY [TransactionID]
By now, you may realise that the Direct ID batching method may need a surrogate consecutive key to ensure the batch/paging size is consistent, with the exception of the last page.
Execution Plan Comparison
Let’s inspect the execution plan of all three methods.
1. Direct ID Batching
As expected, here SQL Server is using Clustered Index Seek as per the index filter on the ID between the 2 values (5001 and 5100). Note that the Actual Number of Rows on the Execution Plan Details of the Clustered Index Seek is 100.
2. Top Batching
Similar to the Direct ID Batching, SQL Server is also using Clustered Index Seek. It then uses TOP as the next step. Note that here similar to Direct ID Batching, the Clustered Index Seek is also returning Actual Number of Rows of 100.
3. Paging Function
Surprisingly, SQL Server is using Clustered Index Scan and the Actual Number of Rows returned here is 5101!
Using Paging Fuction on a source table with a large number of records may not be ideal. As you can already guess, the larger the offset size, the larger the Actual Number of Rows is returned too. Consequently, the Paging Function will take longer and longer as the paging progresses. To demonstrate this, I have a table with over 14.5 Million rows where I iterate through the records in a batch size of 500,000. Each iteration inserts the batched records into a heap table that is empty to start with.
Below is the chart exhibiting the time taken to insert the records in batches using Direct ID batching vs Top batching vs Paging function methods. To start with, the three techniques take about the same amount of time. After inserting about 1.5 Million rows the Paging Function query takes significantly longer than the other two.
Although SQL Server 2012 new Paging function is intuitive for the purpose of paging a data set, it comes at a cost for large volume of data. The Top Batching and Direct ID Batching perform significantly better under similar circumstances, i.e. simple Source Data with continuous ID.
The above discussion stems from my recent research for the 24 Hours of PASS Spring 2012 session “High Volume Data Processing Without Driving Your DBAs Crazy!” which will be held on 21 March 2012 at 16:00 AEST. 24 Hours of PASS is free and contains 24 back-to-back great SQL Server sessions. You can find many different topics of SQL Server, from DBA, Database development, BI to Performance sessions. So, register now!
Please note that every environment is different and that these tests are conducted on my small laptop and should not be used as an indication on how the above batching methods will perform in your Production environment. Please test and confirm that you understand these techniques before applying to your Production environment.
Last updated on 25 March 2012 to add further explanation on the scenario used in this post.