Distinct is a partially blocking component that removes duplicates from one flow. Its main advantages against the sort component provided in SSIS are:
- Memory usage: Distinct does not cache the whole flow the way Sort does; it only retains the distincts in memory, consuming less memory than Sort;
- Distinct is partially blocking whereas Sort is fully blocking;
- Distinct is freeware, you can easily download sources and adapt it to your needs.
During ETL we often need to take the DISTINCT values from one flow and process them in some way. This is normally done using the Sort transformation, a fully blocking component that sorts data and optionally removes duplicates.
The Sort component is really powerful, fast and easy to use; however, because it will cache all data received before releasing any row to its output, it consumes a lot of memory. I usually tend to think twice before inserting a Sort transformation into a flow. Indeed, I always have to be aware of how many rows the Sort component will have to manage.
Looking at the Microsoft samples for SSIS I have found a sample component that removes duplicates. The sample component behaves much like how I want Distinct to work so I decided to use it as a start.
I have removed some of the capabilities of the sample component as I didn’t need them, rearranged the user interface and then, I started working on the behavior of Distinct.
If you look at the original implementation of the RemoveDuplicatesCS component from Microsoft you will see that it uses the technique of the sort component: it caches all data, sorts it and returns non-duplicated rows.
I changed the behavior to only cache the non-duplicated rows in a dictionary, and to send each single row to the output as soon as the component receives it and discovers that it is not duplicated – meaning it is the first time the component sees the row.
Performance has been a problem: as I said before, Sort is a very fast component and achieving the same performance has not been easy. The problem is that when writing a component you have to manage arrays of Object; indeed, you have no idea what type of data you’ll have to handle. How do you quickly search for duplicates in an array of objects?
The final implementation uses a hash function cached on each single row to speed up searching the row in the distinct dictionary. It behaves at pretty much the same speed as the sort component – sometimes 1.5 times faster, sometimes a bit slower.
A few words about memory usage: If the distinct values are few compared to the whole flow of data, then Distinct will be a perfect candidate for your package. Indeed, it will consume a lot less memory compared to Sort and you will not have to worry about memory issues. On the other hand, if you want to remove just a few duplicated rows in a huge flow of data then Distinct will need to cache the whole flow (like Sort does) and memory will still be a problem. The choice depends on your knowledge about the original data.
Installation is very simple:
- copy Distinct.DLL into “<Program Files>/Microsoft SQL Server/90/DTS/PipelineComponents”
- add the DLL to the GAC using “GACUTIL –I Distinct.DLL”
No installer for the moment… Sorry.
Using Distinct is rather straightforward: just insert the transformation where you would have used a Sort, like in this diagram:
Configuring Distinct is as easy as double-clicking on it to reach the component editor, that lets you choose the columns you want to use in order to take their distinct values:
The component needs to have a strong name and hence a signature file. We have provided a free .snk file named SqlBiFree.snk. Please note that any change in the signature requires an update to the declaration of the Distinct class where you must declare the public key of the assembly to let SSIS invoke the component editor.
Distinct is a nice component that solves memory challenges when taking distinct values out of a flow. It took a single day to write it, starting from the samples provided by Microsoft. The real lesson here is that it is almost always easier to write your own components when you have a specific requirement, rather than trying to adapt the existing components.
I am pretty sure that somebody studying the sources will discover a faster algorithm for detecting distinct values in a flow – in this case his/her contribution will be greatly appreciated.
During further tests, Marco Russo discovered an easier and faster way to get Distinct that is to use an Aggregate component instead of a Sort component. Using aggregate memory usage is very low and the SSIS package runs faster – so in the end, the usefulness of this component is just to use it as a sample. I think I will use Aggregate to take distincts from now on. :)