Distinct is a partially blocking component that remove duplicates from one flow. Its main advantages against the sort component provided in SSIS are:
- Memory usage: Distinct does not cache the whole flow (as Sort does) but retains in memory only the distincts, consuming less memory then Sort
- Distinct is partially blocking where Sort is fully blocking
- Distinct is freeware, you can easily download sources and adapt it to your needs
During ETL we often need to take the DISTINCT values from one flow and process them in some way. This is normally done using the Sort transformation, a fully blocking component that sorts data and optionally removes duplicates.
The Sort component is really powerful, it is fast and easy to use but, as it will cache all data received before releasing any row to its output, it consumes a lot of memory. I am normally used to think twice before inserting a sort transformation in a flow as I have always to be aware of how many rows the Sort component will have to manage.
Looking in the Microsoft samples for SSIS I have found a sample component that removes duplicates. The sample component behaves much like I want Distinct to work so I decided to use it as a start.
I have removed some of the capabilities of the sample component as I didn’t need them, rearranged the use interface and then, I started working on the Distinct behaviour.
If you look at the original implementation of the RemoveDuplicatesCS component from Microsoft you will see that it uses the same technique of the sort component: it caches all data, sorts them and output non duplicated rows.
I changed the behaviour caching only the non duplicated rows in a dictionary and sending each single row to the output as soon as the component receives it and discovers that it is not a duplicated one (that means it is the first time the component sees the row).
Performance has been a problem: as I said before, Sort is a very fast component and gaining its same performance has not been easy. The problem is that – writing a component – you have to manage arrays of Object as you have no idea of what type of data you’ll have to handle. How do you quickly search for duplicates in an array of objects?
The final implementation uses a hash function cached on each single row to speed up searching the row in the distinct dictionary and behaves at much the same speed (sometimes 1,5 times faster, sometimes a bit slower) of the sort component.
A few words about memory usage. If the distinct values are few, when compared to the whole flow of data, then Distinct will be a perfect candidate for your package as it will consume a lot less memory compared to Sort and you will not have to worry about memory problems. On the other side, if you want to remove just a few duplicated rows in a huge flow then Distinct will need to cache the whole flow (as Sort does) and memory will still be a problem. The choice depends on your knowledge about the original data
Installation is very simple:
- copy Distinct.DLL into “<Program Files>/Microsoft SQL Server/90/DTS/PipelineComponents”
- add the DLL to the GAC using “GACUTIL –I Distinct.DLL”
No installer for the moment… sorry.
Using Distinct is straightforward: just insert the transformation where you would have used a Sort one, like in this diagram:
Configuring Distinct is easy as to double click on it to reach the component editor that let you choose the columns you want to use in order to take their distinct values:
The component needs to have a strong name and hence a signature file. We have provided a free snk file named SqlBiFree.snk, please note that any change in the signature needs an update to the declaration of the Distinct class where you must declare the public key of the assembly to let SSIS invoke the component editor.
Distinct is a nice component that solves memory problem for taking distinct values from a flow. It took a single day to write it, starting from the samples provided by Microsoft. The real lesson here is that it is quite always easier to write your components when you have a specific need instead of trying to adapt the existing ones.
I am pretty sure that somebody – studying the sources – will discover a faster algorithm for detecting distinct values in a flow, in the case his/her contribution will be highly appreciated.
During more tests, Marco Russo discovered an easier and faster way to get Distinct that is to use an Aggregate component instead of a Sort one. Using aggregate memory usage is very low and the SSIS package runs faster so, at the end, the usefulness of this component is just to use it as a sample, I think I’ll use Aggregate to take distincts from now on. :)