SSIS: living in a parallel world? Not yet…

Today I come into a nice bug (misfeature?) of SSIS 2008 and I’d like to share it with you. Let’s start from the beginning:

One of the keys to make something speedy is to parallelize it. In TableDifference I read from two inputs, compare them and produce several outputs for new, deleted, updated rows and so on. It is the classical “produce/consumer” paradigm that works fine in parallel, creating a consumer thread that compares rows and different threads to produce them.

During the porting of the component in SQL 2008 I discovered that my code does not work as expected. It reads all the lines, then terminates abnormally without producing a single row of output. It took me a couple of hours and several coffes to discover the change in the way SSIS handles asyncronous components causing my bug.

In order to make the problem easier to understand (for microsoft connect too), I produced the same behaviour with a much simpler script component. Imagine to have a script transformation that needs to make some very complex work on each row. You will end up with something like this:

    public override void Input0_ProcessInputRow(Input0Buffer Row) {         Output0Buffer.AddRow();         Output0Buffer.LineID = VeryComplexFunction(Row);         if (Row.EndOfRowset ()) {            OutputBuffer.SetEndOfRowset ();         }     }

But, if you have a 64 CPU computer. You can imagine to detach many threads, each one handling just one very complex function, in this way you will produce a massive parallel computation. Wow! The solution to get maximum speed would be this:

    public override void Input0_ProcessInputRow(Input0Buffer Row) {         Thread T = new Thread(SendAsyncRow);         T.Start(Row.EndOfRowset());     }      public void SendAsyncRow(Object O) {         Output0Buffer.AddRow();         Output0Buffer.LineID = VeryComplexFunction();         if ((bool)O == true) {             Output0Buffer.SetEndOfRowset();         }     }

You detach a separate thread for each row, the thread will compute the very complex function and send it to the output buffer as soon as it is ready. In order to make SSIS understand when everything is finished, you signal SetEndOfRowset in the last thread.

Don’t start crying, this code will never work, the SetEndOfRowset is called on the last read line and not on the last produced one, but this is just pseudocode to understand the topic, don’t blame me for inconsistencies. :) Let’s go on with the main topic.

Under SSIS 2005 this code (in VB, of course) would work fine as the engine detects the end of processing when it receives the EndOfRowset on the outputs. In SSIS 2008 the engine detects the end of the processing as soon as the last call to ProcessInputRow has finished. So, in our case, after the last thread has been detached SSIS believes that the processing is finished and calls the cleaning up of the component, completely ignoring both the presence of separate threads running and our williness to not terminate the component because we know that we need some more time.

The result? As with any other parallel bug you will sometime get a crash, on the same inputs sometime finish corectly, sometime get an error… the nightmare of any programmer.

If you want to try it by yourself, cut and paste this code to a transformation script in a data flow that process some lines ( rows 10/20 lines are enough):

using System; using System.Data; using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper;  using System.Threading;  [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent {      bool useThreads = false;      /// <summary>     ///      /// Processes input rows     ///      /// </summary>     /// <param name="Row"></param>     public override void Input0_ProcessInputRow(Input0Buffer Row) {         Thread T = new Thread(SendAsyncRow);          if (useThreads) {             T.Start(Row.EndOfRowset());         } else {             SendRow();         }     }      /// <summary>     ///      /// Sends a row in sync mode     ///      /// </summary>     private void SendRow() {         // Thread.Sleep(100);         Output0Buffer.AddRow();         Output0Buffer.LineID = VeryComplexFunction();     }      /// <summary>     ///      /// Sends a row in async mode     ///      /// </summary>     /// <param name="O"></param>     public void SendAsyncRow(Object O) {         Thread.Sleep(100);         Output0Buffer.AddRow();         Output0Buffer.LineID = VeryComplexFunction();         if ((bool)O == true) {             Output0Buffer.SetEndOfRowset();         }     }      /// <summary>     ///      /// Not very complex... but it's a demo!     ///      /// </summary>     /// <returns></returns>     private int VeryComplexFunction() {         return 10;     } }

using Serial mode (useThreads se to false) everything works fine. Using parallel mode (useThreads set to true) SSIS will crash and produce no output at all. The bug in TableDifference was very similar to that, I needed to handle the wait for termination of the consumer process during the ProcessInput function.

It is not the first time I discover inconsistencies in the way SSIS handles parallelism, in order to try to make it easier to write code (using 2008 mode there is no need to call SetEndOfRowset and your program will run fine if you forget it) the architects of SSIS introduce a behavior that will make better and correct code crash. If you believe this is a problem, vote here, anyway… be aware of it, it mighe be useful in the future. :)