Articles   \  

FILTER vs CALCULATETABLE: optimization using cardinality estimation



A common best practice is to use CALCULATETABLE instead of FILTER for performance reason. This article explore the reasons of that and explain when FILTER might be better than CALCULATETABLE.


Please note that in this article CALCULATE is used instead of CALCULATETABLE, but the difference is the same (CALCULATETABLE returns a table, whereas CALCULATE returns a scalar value). I will extend the article for a deeper discussion – by now, consider this article as a discussion about the relative importance of removing explicit FILTER from the arguments of CALCULATE/CALCULATETABLE, because an implicit FILTER is always rewritten in case a logical condition is used in one of those arguments.

How FILTER Works

The FILTER function is an iterator. For example, consider the following expression:

CALCULATE ( 
    SUM ( Movements[Quantity] ),
    FILTER (
        ALL ( 'Date'[Date] ),
        'Date'[Date] <= MAX( 'Date'[Date] )
    )
)

FILTER will iterate all the distinct values of the Date column. Thus, the number of iteration is equal to the number of distinct values of the column. In Adventure Works, this value is 2191.

Now, what happen if you iterate the entire Date table with the following expression?

CALCULATE ( 
    SUM ( Movements[Quantity] ),
    FILTER (
         ALL ( 'Date' ),
         'Date'[Date] <= MAX( 'Date'[Date] )
     )
 )

The number of iterations is the same, because you were iterating the primary key of the Date table, so the cardinality of the Date table is identical to the cardinality of the Date[Date] column. If you look at query plan with a profiler, you will notice minor differences in query plans (just the columns logically included) and, most important, exactly the same VertiPaq SE queries. Remember, what is important is we evaluated a logical condition 2191 times in both expressions.

How CALCULATETABLE Works

CALCULATETABLE pushes more job possible to the VertiPaq engine, but remember that each filter argument might contain a FILTER even if you do not see it. For example, consider this query, where CALCULATE is used (CALCULATETABLE has an identical behavior):

CALCULATE ( 
    SUM ( Movements[Quantity] ),
    'Date'[Date] < DATE ( 2013, 1, 1)
)

It seems faster just because you do not see a FILTER. In reality, a logical expression in a filter argument of a CALCULATE statement is always transformed into a corresponding FILTER statement:

CALCULATE ( 
    SUM ( Movements[Quantity] ),
    FILTER ( 
        ALL ( 'Date'[Date] ),
        'Date'[Date] < DATE ( 2013, 1, 1)
    )
)

This is not so different than the FILTER we have seen before. However, there are differences in case you have multiple filters on the same table. For example, you can write the following measure in two ways, returning the same result:

OrdersInPlaceSingleFilter :=
CALCULATE (
    COUNTROWS ( 'Internet Sales' ),
    FILTER (
        ALL ( 'Internet Sales' ),
        'Internet Sales'[Ship Date] >= MAX ( 'Date'[Date] )
        && 'Internet Sales'[Order Date] <= MAX ( 'Date'[Date] )
    )
)

OrdersInPlaceDoubleFilter :=
CALCULATE (
    COUNTROWS ( 'Internet Sales' ),
    FILTER (
        ALL ( 'Internet Sales'[Ship Date] ),
        'Internet Sales'[Ship Date] >= MAX ( 'Date'[Date] )
    ),
    FILTER ( 
        ALL ( 'Internet Sales'[Order Date] ),
        'Internet Sales'[Order Date] <= MAX ( 'Date'[Date] )
    )
) 

The measure OrdersInPlaceSingleFilter iterates all the 60398 rows in Internet Sales. The measure OrdersInPlaceDoubleFilter executes two filters iterating the 1124 unique values of the columns Ship Date and Order Date of Internet Sales table, for 2248 iterations. The OrdersInPlaceDoubleFilter will be an order of magnitude faster in iterating the filter conditions of the CALCULATE statement.

Most of the times, you should reduce the number of iterations performed in the filter statements, with just a possible exception. In certain conditions, the filter arguments applied to CALCULATE might produce a temporary Cartesian product of all the combinations of the elements you are considering. If this would happen, the internal evaluation of 1,263,376 combinations would be slower than iterating all the 60,398 rows of Internet Sales. However, this might happen only when different conditions happen at the same time: you need more than one filter depending on external contexts and you have to evaluate a distinct count measure. For example, in the sample workbook you download, this measure is very slow:

ProductsSlow :=
CALCULATE (
    DISTINCTCOUNT ( FactInternetSales[CustomerKey] ),
    FILTER (
        VALUES ( FactInternetSales[OrderDate] ),
        FactInternetSales[OrderDate] >= MIN ( DimDate[FullDateAlternateKey] )
    ),
    FILTER (
        VALUES ( FactInternetSales[ShipDate] ),
        FactInternetSales[ShipDate] <= MAX ( DimDate[FullDateAlternateKey] )
    ),
    FactInternetSalesReason
)

We optimized the measure by consolidating the two filters into a single one, more expensive if considered alone, but producing a better execution plan.

ProductsOk :=
CALCULATE (
    DISTINCTCOUNT ( FactInternetSales[CustomerKey] ),
    FILTER (
        FactInternetSales,
        FactInternetSales[OrderDate] >= MIN ( DimDate[FullDateAlternateKey] )
        && FactInternetSales[ShipDate] <= MAX ( DimDate[FullDateAlternateKey] )
    ),
    FactInternetSalesReason
)

However, please consider that this is a very special condition, and the application of a many-to-many relationship (note the FactInternetSalesReason filter) is the factor that, combined with the distinct count measure, produce a worst execution plan. It is correct assuming that separate filters are better, but if you have many filters and a distinct count calculation, you might try to consolidate filters in order to improve performance. I expect this behavior to be improved in future releases of Power Pivot and Analysis Services Tabular. That day, this optimization obtained through a more expensive FILTER condition will be no longer necessary. In the meantime, always do your benchmark before doing that on your data!

Download Demo (ZIP)







 
Want to read more?