SQLBulkTool

SqlBulkTool is a command line utility that is used to quickly create a mirror of a database. It reads its configuration from an XML file containing source and destination command strings and a list of all the tables to mirror and then handles the work of copying the database in an automated and highly parallelized way.

The parallelism can use the partition capabilities of SQL Server 2005: to handle a huge table it is enough to partition it to make the tool load it by running each single partition in a separate thread, dramatically increasing table load time. In the case where no partitioning is defined the parallelism is handled at the table level.

Introduction

In the process of extracting data from a host to load your data mart you basically have two choices

While the first choice seems easier to develop, the latter one has several advantages

On the other hand, the main disadvantage of this choice is that your system will spend time to load the host database into the SQL staging and that you still need to build code to create the mirror, even if it is pretty easy with SSIS, you still have code to manage over time.

Up to now, I don’t think I have said anything interesting or new, everybody involved in BI well know this situation, the good news is that we decided to develop a tool to facilitate the process of mirroring several tables from the host to a staging database, by using SQL 2005 partitions we achieve a high level of parallelism and consequently loading speed, so the interesting part of the article is the description of the tool and the technique beside it.

The naïf solution

As you normally don’t need all the columns of the host table, the staging table have less fields than the original one the first (and naïf) solution is that of creating an SSIS package that has source SELECT statements fired against the host database and destination components to the SQL staging database, SSIS does the rest of the work for us, managing some kind of parallelism at the table level. A very easy solution that still have some severe limitations:

Sql Bulk Tool

In writing SqlBulkTool, our goal was to make this process easy and fast by

So, SqlBulkTool is a command line utility that reads its configuration from an XML file containing source and destination command strings and a list of all the tables to mirror, it handles the work of mirroring the database in an automated and highly parallelized way.

The parallelism is handled using the partition capabilities of SQL Server 2005, to handle a huge table it is enough to partition it to make the tool load it by running each single partition in a separate thread, dramatically increasing table load time.

Why parallelizing operations improve performance

We observed that gaining speed using partitions is often possible even when all partitions are on the same physical disk: the reason is often the bottleneck of protocol/driver connecting with the source database, that doesn’t use the full bandwidth available to transfer data.

We would expect the bottleneck to be the write speed capacity of our destination disk subsystem, but we don’t often reach this limit nor that of network and cpu. In this situation, chances are that protocol latency, bandwidth limitation on client connections or ODBC/OLE DB client driver quality are the real issue for low performance.

Another aspect is that bulk insert operations on SQL Server are (strangely?) CPU intensive. As we use ADO.NET 2.0 classes to get bulk functionality and most of the time is presumably spent on marshaling data from unmanaged to managed world and vice-versa. Anyway, having many CPU cores allows getting a great performance advantage compared to a sequential bulk insert operations.

To make the story short, the fact is that in many situations we get a great advantage using SqlBulkTool. Sometime this doesn’t happen, for example when you move data from a SQL Server to another SQL Server: in this case, main the advantage using SqlBulkTool is less code to write and not performance.

Partitioning the table

Let’s say you have a table with 20 millions of rows. Loading it in a whole is very time consuming, it would be really better to load it using 20 chunks of one million records each. You achieve a dramatic increase in speed because increasing parallelism on slow network operations optimize the use of fast CPU and disks, when a thread is waiting for data from the network, other threads can use CPU and disk to load the data that is arrived.

This was our situation: we decided to partition the table based on a date field (stored as int containing yyyymmdd on the host) creating a partition for each quarter, so the partition function is

CREATE PARTITION FUNCTION [DatePartitionFunction](int)AS RANGE RIGHT FOR VALUES (
20030101, 20030401, 20030901,
20040101, 20040401, 20040901,
20050101, 20050401, 20050901,
20060101, 20060401, 20060901,
20070101, 20070401, 20070901,
20080101, 20080401, 20080901,
20090101, 20090401, 20090901)

As you can see we decided to create ALL the partitions we will need over time so we don’t have to create code to automate partition handling in the future.

The partition scheme is very easy, we don’t need separate data files, we will use partition to achieve parallelism, not to manage storage:

CREATE PARTITION SCHEME [DatePartitionSchema] AS PARTITION [DatePartitionFunction]
ALL TO ([PRIMARY])

The last phase is that of creating the mirror table with all and only the fields we need to handle the TL phases:

CREATE TABLE [BORI200F](.....
[DATTS] [decimal](18, 0) NOT NULL,.....) ON [DatePartitionSchema]([DATTS])

So, now we have a huge table that can be divided into several smaller partitions; the task of loading them in parallel will be pretty easy and will be shown later.

The config file

The configuration is an XML file that looks like this:

Configuration.XML
<?xml version="1.0" encoding="utf-8"?>
<Configuration 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="SqlBulkToolConfig.xsd" 
MaxNumberOfThreads="15" >
<DataBases>
<DataBase
Name = "AdventureWorks TEST"
Run = "true" 
DataBaseClass="SqlServer"
SourceConnectionString="Data Source=localhost;
               Integrated Security=True;Initial Catalog=AdventureWorksDW"
DestConnectionString="Data Source=localhost;
               Integrated Security=True;Initial Catalog=SqlBulkToolTest">
<Tables>
<Table SourceTableName="AdventureWorksDWBuildVersion" DestTableSchema="dbo" 
                       DestTableName="AdventureWorksDWBuildVersion" TruncateBeforeLoad ="true"/>
<Table SourceTableName="DatabaseLog" DestTableSchema="dbo" 
                       DestTableName="DatabaseLog" TruncateBeforeLoad ="true"/>
<Table SourceTableName="DimAccount" DestTableSchema="dbo" 
                       DestTableName="DimAccount" TruncateBeforeLoad ="true"/>
<Table SourceTableName="DimCurrency" DestTableSchema="dbo" 
                       DestTableName="DimCurrency" TruncateBeforeLoad ="true"/>
<Table SourceTableName="DimCustomer" DestTableSchema="dbo" 
                       DestTableName="DimCustomer" TruncateBeforeLoad ="true"/>
<Table SourceTableName="DimDepartmentGroup" DestTableSchema="dbo" 
                       DestTableName="DimDepartmentGroup" TruncateBeforeLoad ="true"/>
</Tables>
</DataBase>
</DataBases>
</Configuration>

Where

Samples

We have provided a sample script to create a database that will mirror some of the table in the AdventureWorksDW database, it is not intended as a demo or a test, just a sample to make it easier to understand the configurations of SqlBulkTool and to test the tool in an easy way.

The project

Here we provide a basic description of the project, for whose of you who want to update/modify the code. The tool is built upon several classes

The process of mirroring the database is

The flow

When the system starts it

Let’s see the classes in more detail

DataBaseCopy Class

DataBaseCopy is an abstract class designed to make it easy to handle specific dialect details of source database. Its subclasses are specialized implementation that define details about the dialect used by the source database.

DataBaseCopy manages three important operations:

TableCopy Class

Tablecopy is a class that holds information about one mirror table. If the table is partitioned then TableCopy holds a list of all the partitions in the table, otherwise TableCopy contains all the information needed to mirror the table by its own.

TableCopy is designed to

PartitionCopy Class

If a table is partitioned then for each single partition we need to maintain the information that will let us handle the mirror operation, in detail:

The process of copying a single partition is

As you should already know, this is the only sequence that guarantee fast partition handling, all the operation (beside loading) are metadata operation and do not need any operation on the data, switching a partition in and out is really fast.

Metadata operations are carried on in a critical section, loading is handled in parallel.

ThreadManager Class

ThreadManager is responsible to coordinate threads, it will create a separate thread for each operation that has to be carried on and detach it when opportune.

As it is useless to launch all the thread at once, ThreadManager holds a semaphore that will limit the number of running threads to a configured value, in this way it is easy to scale the system based on the power of the source or target server.

Final considerations

The tool is not intended to be perfect, its main limitations up to now is that it will handle only RIGHT partitions, if someone wants to update it to provide handling of LEFT partitions… he/she will be welcome. Moreover, you may have different sources than the ones we provided, if so it will be enough to derivate a class from DatabaseCopy and handle there the specific dialect of your DB.

The project can be easily extended to handle more complex situations nevertheless we believe that it is a good example to demonstrate the power of SQL 2005 partitions as they can be used to parallelize the process of loading huge tables gaining a dramatic speed improvement simply by making a better use of the resources of the server.

We are currently using the tool in several project and it works fine giving us a dramatic speed improvement in the mirroring of the production database to a first staging DB.

Please note that – using SqlBulkTool – you may experience a timeout problem, this can be easily solved applying the patch that you can find in Microsoft Knowledge Base.