SqlBulkTool is a command line utility that is used to quickly create a mirror of a database. It reads its configuration from an XML file containing source and destination command strings and a list of all the tables to mirror and then handles the work of copying the database in an automated and highly parallelized way.
The parallelism can use the partition capabilities of SQL Server 2005: to handle a huge table it is enough to partition it to make the tool load it by running each single partition in a separate thread, dramatically increasing table load time. In the case where no partitioning is defined the parallelism is handled at the table level.
In the process of extracting data from a host to load your data mart you basically have two choices
- Extract data directly from the host to the datamart
- Create a mirror of the data to a staging SQL server database and then load it into the datamart
While the first choice seems easier to develop, the latter one has several advantages
- You can develop all the packages disconnected from the host
- You can easily create test cases modifying the staging database
- You don’t depend on availability of the host during the TL phases of ETL
- You have the power of SQL Server 2005 syntax fore querying the tables
- You can easily create indexes to make your queries faster
On the other hand, the main disadvantage of this choice is that your system will spend time to load the host database into the SQL staging and that you still need to build code to create the mirror, even if it is pretty easy with SSIS, you still have code to manage over time.
Up to now, I don’t think I have said anything interesting or new, everybody involved in BI well know this situation, the good news is that we decided to develop a tool to facilitate the process of mirroring several tables from the host to a staging database, by using SQL 2005 partitions we achieve a high level of parallelism and consequently loading speed, so the interesting part of the article is the description of the tool and the technique beside it.
The naïf solution
As you normally don’t need all the columns of the host table, the staging table have less fields than the original one the first (and naïf) solution is that of creating an SSIS package that has source SELECT statements fired against the host database and destination components to the SQL staging database, SSIS does the rest of the work for us, managing some kind of parallelism at the table level. A very easy solution that still have some severe limitations:
- You have to manually code SELECT statements
- If you have a lot of tables the package is not so easy to manage
- The level of parallelism is to the table level; if you have a huge table and several small tables, the final time to run the package is that of the biggest table.
- When you will need other columns and/or tables from the host you will have to modify the package with all the implication of making some sort of change to the software
Sql Bulk Tool
In writing SqlBulkTool, our goal was to make this process easy and fast by
- Never depend on manual code to mirror tables
- Parallelize the load process of the mirror using partitions
So, SqlBulkTool is a command line utility that reads its configuration from an XML file containing source and destination command strings and a list of all the tables to mirror, it handles the work of mirroring the database in an automated and highly parallelized way.
The parallelism is handled using the partition capabilities of SQL Server 2005, to handle a huge table it is enough to partition it to make the tool load it by running each single partition in a separate thread, dramatically increasing table load time.
Why parallelizing operations improve performance
We observed that gaining speed using partitions is often possible even when all partitions are on the same physical disk: the reason is often the bottleneck of protocol/driver connecting with the source database, that doesn’t use the full bandwidth available to transfer data.
We would expect the bottleneck to be the write speed capacity of our destination disk subsystem, but we don’t often reach this limit nor that of network and cpu. In this situation, chances are that protocol latency, bandwidth limitation on client connections or ODBC/OLE DB client driver quality are the real issue for low performance.
Another aspect is that bulk insert operations on SQL Server are (strangely?) CPU intensive. As we use ADO.NET 2.0 classes to get bulk functionality and most of the time is presumably spent on marshaling data from unmanaged to managed world and vice-versa. Anyway, having many CPU cores allows getting a great performance advantage compared to a sequential bulk insert operations.
To make the story short, the fact is that in many situations we get a great advantage using SqlBulkTool. Sometime this doesn’t happen, for example when you move data from a SQL Server to another SQL Server: in this case, main the advantage using SqlBulkTool is less code to write and not performance.
Partitioning the table
Let’s say you have a table with 20 millions of rows. Loading it in a whole is very time consuming, it would be really better to load it using 20 chunks of one million records each. You achieve a dramatic increase in speed because increasing parallelism on slow network operations optimize the use of fast CPU and disks, when a thread is waiting for data from the network, other threads can use CPU and disk to load the data that is arrived.
This was our situation: we decided to partition the table based on a date field (stored as int containing yyyymmdd on the host) creating a partition for each quarter, so the partition function is
CREATE PARTITION FUNCTION [DatePartitionFunction](int)AS RANGE RIGHT FOR VALUES ( 20030101, 20030401, 20030901, 20040101, 20040401, 20040901, 20050101, 20050401, 20050901, 20060101, 20060401, 20060901, 20070101, 20070401, 20070901, 20080101, 20080401, 20080901, 20090101, 20090401, 20090901)
As you can see we decided to create ALL the partitions we will need over time so we don’t have to create code to automate partition handling in the future.
The partition scheme is very easy, we don’t need separate data files, we will use partition to achieve parallelism, not to manage storage:
CREATE PARTITION SCHEME [DatePartitionSchema] AS PARTITION [DatePartitionFunction] ALL TO ([PRIMARY])
The last phase is that of creating the mirror table with all and only the fields we need to handle the TL phases:
CREATE TABLE [BORI200F](..... [DATTS] [decimal](18, 0) NOT NULL,.....) ON [DatePartitionSchema]([DATTS])
So, now we have a huge table that can be divided into several smaller partitions; the task of loading them in parallel will be pretty easy and will be shown later.
The config file
The configuration is an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?> <Configuration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="SqlBulkToolConfig.xsd" MaxNumberOfThreads="15" > <DataBases> <DataBase Name = "AdventureWorks TEST" Run = "true" DataBaseClass="SqlServer" SourceConnectionString="Data Source=localhost; Integrated Security=True;Initial Catalog=AdventureWorksDW" DestConnectionString="Data Source=localhost; Integrated Security=True;Initial Catalog=SqlBulkToolTest"> <Tables> <Table SourceTableName="AdventureWorksDWBuildVersion" DestTableSchema="dbo" DestTableName="AdventureWorksDWBuildVersion" TruncateBeforeLoad ="true"/> <Table SourceTableName="DatabaseLog" DestTableSchema="dbo" DestTableName="DatabaseLog" TruncateBeforeLoad ="true"/> <Table SourceTableName="DimAccount" DestTableSchema="dbo" DestTableName="DimAccount" TruncateBeforeLoad ="true"/> <Table SourceTableName="DimCurrency" DestTableSchema="dbo" DestTableName="DimCurrency" TruncateBeforeLoad ="true"/> <Table SourceTableName="DimCustomer" DestTableSchema="dbo" DestTableName="DimCustomer" TruncateBeforeLoad ="true"/> <Table SourceTableName="DimDepartmentGroup" DestTableSchema="dbo" DestTableName="DimDepartmentGroup" TruncateBeforeLoad ="true"/> </Tables> </DataBase> </DataBases> </Configuration>
- MaxNumberOfThreads is the maximum number of parallel processes (table and partitions) that will run, tune it by your needs depending on CPU numbers and network speed.
- DataBase is the configuration of one database, you can have as many entries as you need.
- Name is a simple name used in the log to identify your database
- Run is a Boolean, if false then the database will not be mirrored. It can be used to create a test database that will not run unless you are testing.
- DataBaseClass is the name of the destination DB. Up to now valid values are:
- SqlServer: Microsoft SQL Server.
- OleDb: Any OleDB provider.
- Navision: OleDB attached to a Navision database (via ODBC driver)
- Firebird: Firebird database
- Db2: OleDB attached to a DB2 database.
- SourceConnectionString, DestConnectionString are the connection strings for both source and destination DB. Beware that destination DB is always SQL Server, source DB may be any of the supported DB of SqlBulkTool.
- Tables: for each table you want to mirror you must provide:
- SourceTableName: the name used to build the SELECT clause.
- DestTableSchema: the schema of the destination table.
- DestTableName: the name of the destination table.
- TruncateBeforeLoad: if true the table will be truncated before loading it.
We have provided a sample script to create a database that will mirror some of the table in the AdventureWorksDW database, it is not intended as a demo or a test, just a sample to make it easier to understand the configurations of SqlBulkTool and to test the tool in an easy way.
Here we provide a basic description of the project, for whose of you who want to update/modify the code. The tool is built upon several classes
- DataBase, Table and Partition: these classes contain information about how we can mirror information.
- ThreadManager: this class will handle all the threads that will do the mirroring, it will detach them when needed and guarantee proper synchronization of the operations.
- DatabaseOperation: is the class used by ThreadManager to achieve parallelism, each DatabaseOperation will do the atomic mirror of one table or one partition, depending on the structure of the destination table
The process of mirroring the database is
- Gather information about each table, decide whether it is partitioned or not and create the operations that will mirror the table
- Start one thread for each operation, and let them work in parallel upon completion, limiting the maximum number of threads in order not to waste time in useless context switching
When the system starts it
- reads an XML file containing a list of databases and, for each database, a list of tables to be mirrored.
- Analyze each table of each database and optionally divide it into partitions, for each table or partition a DatabaseOperation is created which will load that table or partition.
- Upon termination of the analysis the real work starts and one thread will be detached for each operation, when all the threads are finished the mirroring process is terminated.
Let’s see the classes in more detail
DataBaseCopy is an abstract class designed to make it easy to handle specific dialect details of source database. Its subclasses are specialized implementation that define details about the dialect used by the source database.
DataBaseCopy manages three important operations:
- iterate over its tables to prepare the loading threads;
- handle specific dialect related topics in building the SELECT statements for the source database;
- handle some list of commands in a critical section for the table as metadata operations for switching partitions in and out need to be carried on atomically.
Tablecopy is a class that holds information about one mirror table. If the table is partitioned then TableCopy holds a list of all the partitions in the table, otherwise TableCopy contains all the information needed to mirror the table by its own.
TableCopy is designed to
- analyze the schema of the target table and build a SELECT statement that contains all the necessary columns to fill the target database. In this way, when you will need another field from a host table, you will only have to add the field with a proper type to the target table, the tool will recognize it and add it to the SELECT statement.
- Determine the partitioning structure of the table by analyzing partition function name, partition column name and partition limits.
- If the table is partitioned, it decides which partitions need to be rebuilt.
If a table is partitioned then for each single partition we need to maintain the information that will let us handle the mirror operation, in detail:
- The minimum and maximum values of the partition columns that are allowed to reside in that partition.
- The partition ID.
- Whether the partition needs to be rebuilt or not. If we decide not to load the entire source table (e.g. we can decide to refresh only the latter year and avoid useless loading of years already consolidated) not all the partitions need rebuilding, the tool will load only the partition that really need to be rebuilt.
The process of copying a single partition is
- Create a temporary table with the same structure of the target one
- Switch out the partition to rebuild into the temporary table
- Truncate it
- Load all the data into it
- Switch in the temporary table into the partition
- Drop the temporary table
As you should already know, this is the only sequence that guarantee fast partition handling, all the operation (beside loading) are metadata operation and do not need any operation on the data, switching a partition in and out is really fast.
Metadata operations are carried on in a critical section, loading is handled in parallel.
ThreadManager is responsible to coordinate threads, it will create a separate thread for each operation that has to be carried on and detach it when opportune.
As it is useless to launch all the thread at once, ThreadManager holds a semaphore that will limit the number of running threads to a configured value, in this way it is easy to scale the system based on the power of the source or target server.
The tool is not intended to be perfect, its main limitations up to now is that it will handle only RIGHT partitions, if someone wants to update it to provide handling of LEFT partitions… he/she will be welcome. Moreover, you may have different sources than the ones we provided, if so it will be enough to derivate a class from DatabaseCopy and handle there the specific dialect of your DB.
The project can be easily extended to handle more complex situations nevertheless we believe that it is a good example to demonstrate the power of SQL 2005 partitions as they can be used to parallelize the process of loading huge tables gaining a dramatic speed improvement simply by making a better use of the resources of the server.
We are currently using the tool in several project and it works fine giving us a dramatic speed improvement in the mirroring of the production database to a first staging DB.
Please note that – using SqlBulkTool – you may experience a timeout problem, this can be easily solved applying the patch that you can find in Microsoft Knowledge Base.