Bulk Insert of sorted data

In a previous post I showed how to make bulk insert faster using SSIS. During Sql Server Conference 2007 I had the great opportunity to double check it with Stefano Stefani that analyzed the problem in greater detail and finally came up with both an explanation of the problem and a simple workaround when facing the same problem with T-SQL (my solution is still good with SSIS but Stefano’s one is very effective with T-SQL).

The problem is related to bad performances when inserting rows into a table that has a clustered index (typically a primary key) based on an INT IDENTITY field and is easily shown in this example:

Create Table Destination ( ID_Test Int Identity (1,1) Primary Key, Cod_Test VarChar (20), Test VarChar (100)) Go  Create Table RowsToInsert ( Cod_Test VarChar (20), Test VarChar (100)) Go 


Now, suppose you have 1.000.000 of rows in RowsToInsert, if you analyze the execution plan of this simple insert:

INSERT      Destination SELECT      Cod_Test,      Test FROM     RowsToInsert OPTION (RECOMPILE) 


You will get this: BulkInsertWithSort

As you can see, the data is sorted before being inserted into the table even if the IDENTITY specification of the primary key guarantees to SQL Server that the data is already sorted. Of course, sorting one million rows takes a long time.

The problem is that SQL Server can insert rows in parallel mode into the table, thus there is no guarantee that the IDENTITY fields will be generated sequentially. This leads to very poor performances when inserting a huge amount of data, as is normally the case in DWH tasks.

A very effective workaround, when facing the problem in a T-SQL task, is that of sorting the data based on a dummy column, as in this example:

INSERT      Destination SELECT      Cod_Test,      Test FROM  (SELECT 1 A, * FROM RowsToInsert) Q ORDER BY A OPTION (RECOMPILE) 


The execution plan is now: BulkInsertWithoutSort

The sort task disappeared, as now SQL Server knows that the IDENTITY fields will be generated sequentially because the input is sorted.

The final consideration is that you can have a dramatic speed gain in huge INSERT operations with SQL Server into clustered indexed tables if you provide SORTED SELECTS to the INSERT operation. When the clustered index is based on an INT IDENTITY field you can get the same speed by sorting on a dummy field, instructing SQL Server that the IDENTITY columns will be computed sequentially and there is no need to sort the data.