From SQL to DAX: Grouping Data

The GROUP BY condition of a SQL statement is natively implemented by SUMMARIZE in DAX. This article shows how to use SUMMARIZE and an alternative syntax to group data.

Consider the following SQL query:

SELECT
    OrderDate,
    SUM(SalesAmount) AS Sales
FROM   
    FactInternetSales
GROUP BY 
    OrderDate

It corresponds to this DAX query using SUMMARIZE:

EVALUATE
SUMMARIZE ( 
    'Internet Sales',
    'Internet Sales'[Order Date],
    "Sales", SUM ( 'Internet Sales'[Sales Amount] )
)

You can also use a syntax that produces the same result, even if it is not semantically the same, as you will see later:

EVALUATE
ADDCOLUMNS ( 
    VALUES ( 'Internet Sales'[Order Date] ),
    "Sales", CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ) )
)

Using ADDCOLUMNS you need to apply CALCULATE because you have to transform the row context (defined by the iteration in the Order Date colum) into a filter context. This is not required in SUMMARIZE, because the expression specified is already executed in a filter context of the group you specified.

The semantic difference between ADDCOLUMNS and SUMMARIZE becomes clearer as soon as we involve more tables in the grouping operation. For example, consider the following SQL query:

SELECT
    d.CalendarYear,
    SUM(s.SalesAmount) AS Sales
FROM
    FactInternetSales s
LEFT JOIN DimDate d
    ON s.OrderDateKey = d.DateKey
GROUP BY
    d.CalendarYear

In this case, if the data model has a relationship between DimDate and Internet Sales, the DAX expression implicitly use it. This is the corresponding DAX syntax (the order by only guarantees that data is displayed as in the following table):

EVALUATE
SUMMARIZE ( 
    'Internet Sales',
    'Date'[Calendar Year],
    "Sales", SUM ( 'Internet Sales'[Sales Amount] )
)
ORDER BY 'Date'[Calendar Year]

You obtain this result in Adventure Works Tabular Model SQL 2012:

Date[Calendar Year] [Sales]
2005 3266373,6566
2006 6530343,5264
2007 9791060,2977
2008 9770899,74

The table passed as first argument is joined with tables required to reach the column(s) used to group data. Thus, SUMMARIZE performs the equivalent SQL operations DISTINCT and GROUP BY, and it includes a LEFT JOIN between a table and one or more lookup tables.

You can avoid the SUMMARIZE by using this other DAX syntax:

EVALUATE
ADDCOLUMNS ( 
    VALUES ( 'Date'[Calendar Year] ),
    "Sales", CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ) )
)
ORDER BY 'Date'[Calendar Year]

However, this returns a different result, including calendar years defined in Date table that do not have any corresponding data in the Internet Sales table.

Date[Calendar Year] [Sales]
2005 3266373,6566
2006 6530343,5264
2007 9791060,2977
2008 9770899,74
2009
2010

This happens because the last DAX query corresponds to the following SQL syntax:

SELECT
    d.CalendarYear,
    SUM(s.SalesAmount) AS Sales
FROM
    DimDate d
LEFT JOIN FactInternetSales s
    ON d.DateKey = s.OrderDateKey
GROUP BY
    d.CalendarYear

As you see, SUMMARIZE is not required to perform a JOIN, but different DAX syntaxes executes different join types. In this case, the tables used in the LEFT JOIN are inverted.

Note: you can find more information about differences between ADDCOLUMN and SUMMARIZE in the article Best Practices Using SUMMARIZE and ADDCOLUMNS.

Finally, consider the HAVING condition in the following SQL query:

SELECT
    d.CalendarYear,
    SUM(s.SalesAmount) AS Sales
FROM
    DimDate d
LEFT JOIN FactInternetSales s
    ON d.DateKey = s.OrderDateKey
GROUP BY
    d.CalendarYear
HAVING
    SUM(s.SalesAmount) > 8000000
ORDER BY
    d.CalendarYear

This query returns only years with sales greater than 8 million:

Date[Calendar Year] [Sales]
2007 9791060,2977
2008 9770899,74

DAX does not have a syntax corresponding to the HAVING condition. After all, you might obtain the same result in SQL by applying a WHERE condition to a subquery, like in the following example.

SELECT
    CalendarYear,
    Sales
FROM
    ( SELECT
        d.CalendarYear,
        SUM(s.SalesAmount) AS Sales
      FROM
        DimDate d
      LEFT JOIN FactInternetSales s
        ON d.DateKey = s.OrderDateKey
      GROUP BY
        d.CalendarYear
    ) YearlySales
WHERE
    Sales > 8000000
ORDER BY
    CalendarYear

In DAX you need the same approach to write a syntax corresponding to the HAVING condition: just FILTER the result of a SUMMARIZE or ADDCOLUMNS function call, as you see in the following examples.

EVALUATE
FILTER (
    ADDCOLUMNS ( 
        VALUES ( 'Date'[Calendar Year] ),
        "Sales", CALCULATE ( SUM ( 'Internet Sales'[Sales Amount] ) )
    ), 
    [Sales] > 8000000
)
ORDER BY 'Date'[Calendar Year]
EVALUATE
FILTER (
    SUMMARIZE ( 
        'Internet Sales',
        'Date'[Calendar Year],
        "Sales", SUM ( 'Internet Sales'[Sales Amount] )
    ),
    [Sales] > 8000000
)
ORDER BY 'Date'[Calendar Year]

It is important to remember that the formula engine evaluates filters applied to the result of a calculation. When you can, it is better to apply a filter using CALCULATE or CALCULATETABLE (see Filtering Data article). However, you have to use FILTER in this case because the result of an aggregation cannot be part of a filter argument in CALCULATE or CALCULATETABLE.