Analyzing the performance of limited and regular relationships

A relationship in a semantic model can be either regular or limited. Regular relationships are the most common: a one-to-many relationship between two tables on the same data island is almost always a regular relationship. If the tables involved in the relationship are stored in different data islands, then the relationship is limited. We analyzed the semantic differences between the two types of relationships in previous articles: Understanding blank row and limited relationships and Regular and limited relationships in Power BI.

Relationships between different data islands are the most common case of limited relationships. In that scenario, performance depends on multiple factors, most of which are not under the control of a DAX developer. Indeed, when mixing data from different data islands, the DAX formula engine must act as a bridge between them, resulting in complex execution plans. Besides, when two tables reside in different data islands, only limited relationships can connect them. Therefore, a performance comparison would not make sense, as there are no alternative options to link the tables.

However, a model can have limited relationships in the very special case of two tables stored in the same data island and connected by a many-to-many cardinality relationship. By nature, many-to-many cardinality relationships are limited. While they seem like a convenient way to link two tables when the key used to link them together is not unique in both tables, many-to-many cardinality relationships are extremely expensive, and a wise data modeler should limit their use strictly to cases where they are absolutely necessary. In this article, we analyze the differences between regular and limited relationships, focusing solely on performance.

Introduction

Choosing the most efficient type of relationship is important in data models because relationships are used every time the DAX engine needs to scan a table and group by a related table. A slow relationship slows down every measure that traverses it.

Limited relationships are shown in Power BI by displaying small gaps at the edges of the line that connects two tables. In the following figure, the relationship between Product and Sales is regular (no gaps), whereas the relationship between Sales and Customer is limited (gaps are present).

When Power BI creates a matrix like the following, grouping by Customer[Country] and computing the Sales Amount measure, it needs to scan Sales and group by the columns in Customer, using the relationship to perform the join.

To test the difference in performance between the two types of relationships, we execute queries that traverse the relationship using a regular relationship first and a limited relationship next. We are not going to change anything: the relationship being used is the same as the one between Customer and Sales. The only difference will be the type of relationship: either regular or limited.

Choosing the test database

Quick heads up before we move forward with the test – the downloadable demo for this article uses the usual Contoso 10K model, which has around 4,000 rows in the Sales table. The tests executed on such a tiny model are useless because the execution time is so fast that any noticeable difference may depend on external factors. Therefore, the timing shown in this article is related to the same model, but with 1.4 billion rows in the fact table. The conclusions are the same, but the numbers are so much larger that the difference is clear.

As an example, this simple query runs in three milliseconds on the small model, and it runs in 15,000 milliseconds on the large model:

Query

EVALUATE
SUMMARIZECOLUMNS (
    Customer[Country],
    "Sales", [Sales Amount]
)

However, multiple executions of the same query may vary in timing: it can range from 3 to 7 milliseconds on the small model (more than double the time), whereas it stays around 15,000 milliseconds (+/- 300) on the large model. Bear that in mind when measuring performance: small models are pretty much useless for assessing the quality of your DAX or modeling choices. Always test on large models, with tens of millions of rows on the many side and millions of rows on the one side of the relationship!

Testing simple measures

Let us start measuring the baseline: the time required to compute Sales Amount without any relationship being involved. This measures the time needed to scan the Sales table:

Query

EVALUATE
SUMMARIZECOLUMNS (
    "Sales", [Sales Amount]
)

Here are the server timings.

The interesting number is the storage engine CPU (SE CPU): more than 7 seconds. Because of parallelism, those 7 seconds are reduced to 134 milliseconds of execution waiting time (SE). However, when evaluating performance, we are interested in the SE CPU rather than the SE timings, because SE CPU provides a better picture of the real processing cost of an operation.

We now run the same query, grouping by Customer[Country], and we evaluate the cost of traversing the relationship:

Query

EVALUATE
SUMMARIZECOLUMNS (
    Customer[Country],
    "Sales", [Sales Amount]
)

As you can see, the execution time nearly doubled.

You can see that the relationship is used within the VertiPaq engine, where there is a join between Sales and Customer.

In all its simplicity, this demo already shows how expensive relationships are. Scanning the Sales table takes 7 seconds, whereas scanning the same Sales table using the relationship to group by Customer[Country] adds another 7 seconds. Be mindful that the cost of a relationship strongly depends on the number of rows in the table on the one-side (Customer, in this case). Customer contains 1.8M rows. Using a smaller table like Product produces a faster execution time:

Query

EVALUATE
SUMMARIZECOLUMNS (
    Product[Brand],
    "Sales", [Sales Amount]
)

Instead of 15 seconds, using Product as the one-side of the relationship produces a better result.

Now that we have a better understanding of the baseline and the main factors to consider in a relationship, let us change the relationship between Customer and Sales from one-to-many to many-to-many, thus obtaining a limited relationship.

If we execute the query that groups by Customer[Country] again, we notice important details:

There are now multiple storage engine queries executed in a batch. A many-to-many cardinality relationship cannot be pushed down to the storage engine as a single query.
The execution time is noticeably longer.

Here are the server timings.

The SE CPU skyrocketed from 15,000 to 138,000 milliseconds. What is most relevant is to understand what this number comprises. We know that out of 15,000 milliseconds, around 7,000 are needed to scan the Sales table. Therefore, the additional cost of the relationship is around 8,000 milliseconds. Similarly, out of 138,000 milliseconds, 7,000 are needed to scan the table. Therefore, the additional cost is 131,000 milliseconds. In other words, the many-to-many cardinality relationship makes the calculation almost 20 times more expensive compared to a regular relationship.

Be mindful that this price must be paid whenever the table is scanned. In a regular report with dozens of different measures being computed, all these additional costs add up, further slowing the report.

Testing multiple measures

An important consideration about many-to-many cardinality relationships is that they cannot be pushed down to the storage engine as simple joins. This can limit the optimization options available to the DAX engine. Let us see an example of this in the context of Fusion optimization. If you are not familiar with Fusion, you can find further information here: Introducing horizontal fusion in DAX, and here: Optimizing fusion optimization for DAX measures.

The following query needs to compute Sales Amount for male and female customers, using two different columns for the result:

Query

EVALUATE
SUMMARIZECOLUMNS (
    Customer[Country],
    "Male Sales", CALCULATE ( [Sales Amount], Customer[Gender] = "Male" ),
    "Female Sales", CALCULATE ( [Sales Amount], Customer[Gender] = "Female" )    
)

Thanks to Fusion optimization, with a regular relationship, the entire query can be computed through a single storage engine query that retrieves sales grouped by country and gender.

If the relationship is set to many-to-many cardinality, Fusion is disabled because the VertiPaq engine cannot retrieve the values for both male and female in a single query by using a limited relationship. The same query with a many-to-many cardinality relationship generates these timings.

The storage engine needs to execute two complex batches: one for male and one for female. Needless to say, the more measures you have that cannot be merged with Fusion, the more complicated the entire plan becomes.

Conclusions

Many-to-many cardinality relationships are a great modeling feature, but they significantly slow down every query executed by the DAX engine. A wise DAX developer needs to know the price of many-to-many cardinality relationships and limit their use to only the scenarios where they are strictly necessary.

Specifically, because the price of a relationship is linked to the number of unique values in the column used for the relationships, many-to-many cardinality should be avoided as much as possible with large cardinalities (millions). At the same time, they may be a good solution for smaller cardinalities (preferably thousands or fewer), where the limited number of unique values involved reduces the additional overhead.