DISTINCT COUNT measures and NULL value

After several years of using Analysis Services and SQL Server 2000, today I stumbled in this problem.

I have a DISTINCT COUNT measure that have to measure how many different documents are present in a certain aggregation. Sometimes valid cells has no documents and the document field is NULL on the fact table (in reality it’s a view that generate this degenerate dimension, anyway this doesn’t really matter). The problem is that the NULL value is counted as a valid different value… so when you see 2 in the cube it could be 1 or 2 if you want to ignore NULL values.

Solution is not so light. Following this kb article (only available in korean language, I suppose – no english translation available but keywords are still in latin character so I guessed right…) I’ve found that a solution is to define two other measure in the cube with “regular” measure (as a best practice it’s best to put distinct count measures each in a different cube, for performance reasons – then merge all measures in a virtual cube):

  • Rows_number – defined as SUM of expression “1”
  • NotNullRows_number – defined as COUNT of expression “field”, where field is the same field you use as DISTINCT COUNT expression

In the cube with DISTINCT COUNT measure I defined only this measure:

  • Documents_null – defined as DISTINCT COUNT of expression “field”, where field is the same as the COUNT measure

Then in the virtual cube you put all together creating a single calculated measure that evaluate the right number:

  • Documents – defined as calculated measure with expression “IIF( Rows_number > NotNullRows_number, Documents_null – 1, Documents_null )”

It works, but you are wasting three times the space originally needed for the DISTINCT COUNT measure.

Distinct count measures are a really bad beast in Analysis Services 2000. Unfortunately it seems that even the Yukon release will suffer from some actual limitations, like the performance penalty caused by an ORDER BY (with the field you used as DISTINCT COUNT measure) in the SELECT sent by Analysis Services to the source database when you have a DISTINCT COUNT measure. Try it on a (virtual) cube with several DISTINCT COUNT measures and a large fact table… a lot of RDBMS resource are needed while processing this cube.

UPDATE: thanks to a valuable feedback, it has to be noted that Analysis Services handles a NULL value like a 0 value in a DISTINCT COUNT measure. In my real-world scenario I had a DISTINCT COUNT measure on a dimension key that’s an int identity, so it never has 0 as a valid value. Anyway, you have to take care of this behaviour if you want to use this technique.