Populating a fact table with SSIS

Approaching SSIS (Sql Server Integration Services) from a DTS background one of the questions that comes to mind is: what is the better way to populate a fact table?

Assuming you have a conventional star schema, you should have a fact table with two kind of fields: surrogate keys and measures. When it’s time to populate the fact table, the common issue is to transform application dimensional keys into surrogate keys. To do that, my favorite way with DTS is to create a SQL query which joins the source table containing the fact measures (still in staging db) with the previously loaded dimension tables (already in data mart db).

SELECT
COALESCE( Northwind_Mart.dbo.Time_Dim.TimeKey, 1 ),
COALESCE( Northwind_Mart.dbo.Customer_Dim.CustomerKey, 1 ),
COALESCE( Northwind_Mart.dbo.Product_Dim.ProductKey, 1 ),
LineItemTotal = [Order Details].UnitPrice * [Order Details].Quantity,
LineItemQuantity = [Order Details].Quantity,
FROM Orders
INNER JOIN [Order Details] ON Orders.OrderID = [Order Details].OrderID
LEFT JOIN Northwind_Mart.dbo.Product_Dim ON [Order Details].ProductID =
Northwind_Mart.dbo.Product_Dim.ProductID
LEFT JOIN Northwind_Mart.dbo.Customer_Dim ON Orders.CustomerID =
Northwind_Mart.dbo.Customer_Dim.CustomerID
LEFT JOIN Northwind_Mart.dbo.Time_Dim ON Orders.ShippedDate =
Northwind_Mart.dbo.Time_Dim.theDate
WHERE (Orders.ShippedDate IS NOT NULL)

The source db is the well-known Northwind of SQL2000. I use LEFT JOIN to be sure that rows with missing or wrong dimensions are still imported: the COALESCE function put a 1 key (a dummy member inserted in each dimension to identify “unknown” values) in place of the missing/wrong element.

Now, what is the best approach with SSIS?
If you migrate DTS the whole Transform Task is still executed in a DTS 2000 package. I could have used the Data Flow Task with a simple transformation, but in this way performance and limitations are not different from DTS2000: you still have to hope that both databases (staging and data mart) are on the same Sql server (just to have good performance) and you still have to hope to not have too many dimensions with complex join conditions.

One of the performance trouble difficult to solve with this approach is when you have many dimensions (like 30 or more) and some of these dimensions requires complex join condition, with correlated subqueries that aggregate some data to obtain a value to join. Someone could argue about using some precalculated temporary table… and of course I did it, but the overall performance (read: total execution time) of a nightly-scheduled batch could be not so enhanced simply moving processing time from one task to another.

With SSIS we can choose a different approach in many ways. One is illustrated by Jamie Thomson in his “The New ETL Paradigm” article. Using Merge Join is an approach that requires a previous Sort operation: you could skip one for the dimension data (using an ORDER BY clause on the application key expression) but you can’t avoid the sort operation on fact table rows… and you need to sort this rowset many times, and all the sort has to be done in memory on the whole fact table content. A lot of memory and a lot of CPU will be consumed with fact tables having many rows.

Another approach is the Lookup operation: you can add a column to the pipeline using another column (that could be previously calculated in the pipeline, solving the problem of complex join BLOCKED EXPRESSION as a key inside a table or a query result obtained from an OLE DB connection. If you don’t change default, the whole dimension is loaded in memory at the first lookup operation, so following rows in the pipeline don’t require further access to a database. This consumes memory, but it is common to have most of the dimension with not so many members. For dimensions having too many members (suppose millions of customers) you can still use Lookup, but limiting the cache size and eventually defining the lookup query for each member. In the first case a SELECT * FROM DimTable is produced by SSIS (and you can optimize it writing a query with only application key and surrogate key fields, consuming less memory and less bandwith), in the second case each lookup produce a query with a WHERE condition (and you can redefine it, eventually using a stored procedure) if the searched value has not previously loaded. Remember, you have to check the “Enable caching” checkbox in Advanced Properties of Lookup Transformation, otherwise you will have very poor performance (each lookup will send a query to the database, even if the value is the same of the previous lookup).

The results are very good: low memory consumption (rows run in the pipeline without stopping for sort operations), good tuning possible for large dimensions, low load on the database server. I still hadn’t time to convert a real-world DTS with many dimensions with complex joins, but all my tests suggest that I will be able to outperform the SQL Server performance of a single query with tens of LEFT JOINs.

The Lookup approach is very good, but it’s not so comfortable when you want to define an “unknown” member value for values that aren’t found in the lookup table. My COALESCE( field, 1 ) in DTS was very simple, maintainable and fast. Now with Lookup I need to handle the error pipeline, substitute the value and then use a Union All transformation. Repeating this task for a lot of dimensions is very tedious. Did someone know a better approach?