Data modeling in Tabular is different from Multidimensional. I spend much time teaching people that they need to change their mind when approaching the Vertipaq engine, because it is a different tool with different capabilities. Nevertheless, I still have a lot to learn on this topic, as I am going to show in the next paragraphs.
One customers of mine has a 4 Billion rows table containing some numbers he wants to work with. Among those, two columns contain a value and a weight and, when aggregating the value, it should be weighted by the weight. This is a very common pattern: in the banking environment, for example, you might have the rate and the amount and, to get the mean rate, you need to weight the detail rates by the amount. In my specific scenario, the value represents the age of a customer, ranging from 0 to 100, while the weight is an integer number ranging from 1 to 60M. Take note of the ranges, it will come handy later.
I am used, in such a scenario, to store in the fact table the result of Age * Weight in a column called AgeMultipliedByWeight so that the aggregate can be computed as SUM (AgeMultipliedByWeight)/SUM(Weight). Following this pattern, I developed the solution and performances were pretty good. Nothing exceptional, but it worked fine. The reason to store the intermediate calculation is to use SUM instead of SUMX, getting better performances, as I normally teach during my classes.
Then, during PASS 2011, I had the option to discuss this model with the dev team and it has been very funny to get their feedback, which sounded like: “why did you do something that stupid? You don’t need it, just compute the values runtime using SUMX, it is much faster and you consume a lot less memory”. Learning why this pattern is wrong in Tabular is definitely worth spending some time, because in Multidimensional this is the right pattern. So, let us think in DAX and try to figure out what’s wrong with storing intermediate calculations.
First of all, we already know that the memory usage of Vertipaq depends on the number of distinct values of a column. It turns out that, while the age has only 100 distinct values, AgeMultipliedByWeight has 600 millions of different values. The difference is huge and, in fact, it turned out that the AgeMultipliedByWeight column was using a lot of memory (9Gb in my scenario with 4 billion rows). First lesson learned, increasing the number of distinct values of a column is easy as performing some computation over it and a lot of precious memory is wasted if you don’t pay attention to that.
The second and most important consideration is that the DAX Formula Engine is capable to push simple calculations to the Vertipaq Engine, which runs in a multithreaded environment (while the Formula Engine is still single threaded). Thus, if a SUMX contains a simple calculation (i.e. simple math with no complex formulas), this formula will be pushed down to Vertipaq and will be executed in a parallel way directly at the leaf level. There’s a lot of work yet to be done in order to understand what “simple calculations” means, but I will speak about that in future posts, as soon as I discover more information about what can and cannot be pushed down to Vertipaq.
Back home, I had the option to try this on my server and, as expected, performance is now much better. By changing the formula from SUM to SUMX, avoiding the intermediate calculated column, I saved 9Gb of RAM and the test query, which was running in 13 seconds, is now executed in 3 seconds, i.e. 4 times faster, scaling on all the available cores.
Thus, the first lesson here is always the same: thinking in DAX is not easy, we need to get rid of many of the concepts we have learned with UDM/MDX and change the way we approach problems. The results, when following this practice, are awesome. There is another lesson, probably the most important one: speaking with people at conferences lets you learn a lot, much more than you will ever be able to learn alone in the office or at home.
For the ones of you coming at SQL PASS Nordic… I am eager to speak with you too, see you there on 8-9 November.
Adds all the numbers in a column.
SUM ( <ColumnName> )
Returns the sum of an expression evaluated for each row in a table.
SUMX ( <Table>, <Expression> )