UPDATE 2018-01-24 : The content of this article is obsolete as of January 2018. Recent versions of Excel 2016, Power BI, and Analysis Services have a SUMMARIZE behavior that is different from the one described in this article. As noted below, using SUMMARIZE should be deprecated for aggregations and you should use SUMMARIZECOLUMNS instead. Read more in Introducing SUMMARIZECOLUMNS.
UPDATE 2017-01-30 : Excel 2016, Power BI and SSAS Tabular 2016 have now SUMMARIZECOLUMNS, which should replace the use of SUMMARIZE described in this article. Read more in Introducing SUMMARIZECOLUMNS.
Because the topic of SUMMARIZE is really complex, we use a simple data model to explain its behavior in detail. The example data model consists of a single table with four columns:
On this simple model we run a query that returns four columns using SUMMARIZE:
EVALUATE SUMMARIZE ( Sales, Sales[Color], "Sales", SUM ( Sales[Amount] ), "AllColorSales", CALCULATE ( SUM ( Sales[Amount] ), ALL ( Sales[Color] ) ), "AllSales", CALCULATE ( SUM ( Sales[Amount] ), ALL ( Sales ) ) )
Looking at the code, you might expect these columns to return:
- Color: the color name
- Sales: the sales of that specific color
- SalesAllColors: the sales of products of any colors. Because the color is the only filter put in the query, you might expect it to return the sales of all products
- AllSales: looks like a different formulation of AllColorSales, i.e. returns the sales for all products, regardless of any filter.
If you execute this query, the result might surprise you:
Out of the three sales columns, one makes no sense at all. In fact, Sales and AllSales show the expected result, i.e. the sales of the current product and the sales of all products. The AllColorSales, on the other hand, seems to compute meaningless results.
In order to make a sense out of these numbers, we have to understand the SUMMARIZE behavior in detail, because it is much more complex than expected.
Let us recap the behavior of SUMMARIZE. The general pattern is the following:
SUMMARIZE ( Source, GroupByColumns, NewColumns )
In order to compute its result, SUMMARIZE does the following:
- It scans the Source table and performs a GROUP BY using the set of GroupByColumns. As a result, it creates partitions of the source table dividing them by the GroupByColumns.
- It calculates the additional columns creating a suitable filter for each column, so that the NewColumns set is computed for the set of values of GroupByColumns determined in step 1.
The key is understanding how SUMMARIZE creates the filter context in point (2) and what is its effect on the formulas. Let us follow the evaluation path of the example.
The first step is to perform a scan of Internet Sales table, grouping it by Color. The result is that the table is partitioned into three segments:
There are three groups of rows: red, green and blue. At this point, Tabular has to compute the NewColumns set and, to do this, it creates a filter. In the table, there are two types of columns: the Color column, which has been used to partition, and all the other ones, which are part of the table but are not used for the grouping. We call the Color column a GroupBy column, whereas we call the other ones as Filter columns.
Tabular iterates over the segments and, for each segment, it computes the measures using a condition that filters the GroupBy columns and all the other Filter columns in a slightly different way. For example, the filter for the red partition looks like this:
Color = "Red" && ( (Product = "Bike" && Quantity = 1 && Amount = 100) || (Product = "Shirt" && Quantity = 2 &&; Amount = 200) )
As you can easily see, this condition is very specific and it filters all the columns of the source table so that it uniquely identifies the partition to scan. In fact, for the [Sales] added column it works just fine.
But what happens when you modify this filter using CALCULATE, as in SalesAllColors? The SalesAllColors removes the filter from the color, because of ALL ( Sales[Color] ). Thus, the final condition becomes:
(Product = "Bike" && Quantity = 1 && Amount = 100) || (Product = "Shirt" && Quantity = 2 && Amount = 200)
The only columns filtered are Product, Quantity, and Amount. In the next figure, the background color highlights the rows selected by this last filter:
The third row is a shirt, with Quantity equal to 2 and Amount equal to 200. Thus, it satisfies the filter condition and it is considered a valid row, even it does not belong to the red segment. The sum of Amount of the highlighted rows is 500, which is the SUMMARIZE result of AllColorSales for the red color.
At this point, it is an interesting exercise to verify that the other numbers are computed by this same algorithm. Thus, they look wrong, but only because of the complexity of SUMMARIZE, because in reality they follow the SUMMARIZE specifications.
As a further exercise, take a look at the following query:
EVALUATE SUMMARIZE ( Sales, Sales[Color], "Sales", SUM ( Sales[Amount] ), "AllColorSales", CALCULATE ( SUM ( Sales[Amount] ), ALL ( Sales[Color] ), ALL ( Sales[Product] ) ) )
Then, explain why the result of the query is the following:
As you can see, the value in AllColorSales for the Blue row is different than the previous SUMMARIZE, because the ALL on the Product column removes the filter from the Product, whereas filters on quantity and amount are still active.
When you apply ALL on the entire table, you remove all the filters and the result is the sum of all sales, as you have seen in AllSales.
We created this small table on purpose, with several rows with identical values for the filter columns, so to clearly show the issue. In a real-world database, the scenario is much more complex and you might face databases where the problem is not so clearly visible.
For example, you can test this query on the AdventureWorks database:
EVALUATE SUMMARIZE ( 'Internet Sales', Product[Color], "Sales", [Internet Total Sales], "SalesAllColors", CALCULATE ( [Internet Total Sales], ALL ( Product[Color] ) ) )
You might expect that this query returns sales for all colors in the SalesAllColors result. However, removing the filter on the color has no effect on the segment filter, because of the filter on all the columns of the table. These filters are still active and the result is the following:
In this case, because of the data distribution, the two results (Sales and SalesAllColors) produce identical results. As you have seen from the previous example, the two results might be completely unrelated if there are rows, in different colors, with all other columns identical. This is unlikely to happen in the real world, but “unlikely” does not mean “never”.
This is the reason why the best practice for SUMMARIZE is to use it to perform only the grouping of values, avoiding its use to compute additional columns. In fact, if you rewrite the previous query using a combination of ADDCOLUMNS and SUMMARIZE, the results will be correct.
EVALUATE ADDCOLUMNS ( SUMMARIZE ( 'Internet Sales', Product[Color] ), "Sales", [Internet Total Sales], "SalesAllColors", CALCULATE ( [Internet Total Sales], ALL ( Product[Color] ) ) )
The advantage of this approach is that SUMMARIZE does not create the filter context by itself. In fact, SUMMARIZE performs only the grouping of the fact table, returning a table of Product[Color] rows. Then, ADDCOLUMNS evaluates the measures in a row context over the Product[Color] column only, and the inner ALL ( Product[Color] ) removes the filter from the color in the current row, producing the correct result.
It is worth to mention that the SUMMARIZE applies the filter on the source table. In fact, if you summarize the Product table, instead of the Internet Sales one, the calculation works in a more intuitive way.
For example, look at this query:
EVALUATE SUMMARIZE ( 'Internet Sales', Product[Color], "Sales", [Internet Total Sales], "SalesAllColors", CALCULATE ( [Internet Total Sales], ALL ( Product ) ) )
You can see that the two columns show identical values, exploiting the issue we are discussing about SUMMARIZE. But, if you change the source table to Product, instead of Internet Sales, as in:
EVALUATE SUMMARIZE ( Product, Product[Color], "Sales", [Internet Total Sales], "SalesAllColors", CALCULATE ( [Internet Total Sales], ALL ( Product ) ) )
When you use Internet Sales as the source table, SUMMARIZE applies the filter to the fact table, filtering all of its columns. Thus, removing the filter on Product does not modify the filter on the fact table. On the other hand, when you use Product as the source table, then SUMMARIZE applies the filter to the product table and ALL ( Product ) removes the filter from the right table, producing the expected result.
After reading this dissertation on SUMMARIZE, you might end considering never to use SUMMARIZE at all in your code, because it is a dangerous function. In fact, it is, but only if you do not understand its behaviour.
Let us recap briefly the various patterns where SUMMARIZE can be used, looking at some examples. The first safe pattern is a SUMMARIZE that does not add new columns:
SUMMARIZE ( ‘Internet Sales’, Product[Color] )
You can safely use this pattern to perform a GROUP BY of the table by any set of related columns. Keep in mind that, from the performance point of view, you can express the same result as:
FILTER ( ADDCOLUMNS ( VALUES ( Product[Color] ), "NumOfRows", CALCULATE ( COUNTROWS ( ‘Internet Sales’ ) ) ), [NumOfRows] > 0 )
Depending on data distribution and number of rows in each table, one of the two expression might outperform the other one, and we suggest you to perform some test before choosing the one which works better in your model.
The following pattern, on the other hand, is a very dangerous one:
SUMMARIZE ( ‘Internet Sales’, Product[Color], "Sales", [Internet Total Sales], )
We explained the reasons in this article: because of the complex way SUMMARIZE perform its filtering, the measure can return unexpected results if it relies on some inner CALCULATE.
The previous SUMMARIZE expression can be replaced with the following one:
ADDCOLUMNS ( VALUES ( Product[Color] ), "Sales", [Internet Total Sales], )
This formula is not only safer, generally it is much faster. If you need to get rid of empty values, as SUMMARIZE does, you can surround the expression with an outer FILTER, as in:
FILTER ( ADDCOLUMNS ( VALUES ( Product[Color] ), "Sales", [Internet Total Sales], ), [Sales] <> 0 )
There is a last scenario to consider: if you need to compute subtotals using the ROLLUP function, then SUMMARIZE is your only option. But, in this case, you can safely use SUMMARIZE by splitting its behaviour into two steps: first you determine the columns to filter, later you perform the actual summarization:
SUMMARIZE ( SUMMARIZE ( ‘Internet Sales’, Product[Color] ), ROLLUP ( Product[Color] ), "Sales", [Internet Total Sales] )
In this last example we used two different SUMMARIZE. You can safely replace the innermost SUMMARIZE with ADDCOLUMNS or with one of the FILTER pattern you have seen in the previous formulas.