Optimising The Performance Of Power Query Merges In Power BI, Part 5: Cross Joins

In a late addition to the series of posts that I started here a few months ago, I thought it would be useful to look at how to optimise the performance of a rare (but nonetheless useful) type of merge: the cross join. Cross joins are not available as a join kind in the Table.Join or Table.NestedJoin M functions but there are two ways of creating them in Power Query:

  1. Creating a custom column in your query that contains the query that you want to cross join with and then expanding, as Matt Allington describes here
  2. Creating a custom column that contains a single value, such as the number 1, on both your queries and then using that column to merge the two queries on (something I mentioned a long time ago here)

Which one gives you the best performance? Apart from my old blog post there’s at least one other post suggesting the second approach is much faster, but I thought it would be useful to do some proper testing.

For my source data I created a query called SourceQuery that returned a table with one thousand rows, containing all the integers from 1 to 1000:

To test the first approach, I created a second query that referenced SourceQuery, then added a custom column that also returned the SourceQuery query:

…and then expanded this new custom column. The result was a query that returned a million rows – the cross join of the thousand-row SourceQuery table with itself.

Here’s the M code:

let
  Source = SourceQuery,
  #"Added Custom" = Table.AddColumn(
      Source, 
      "Custom", 
      each SourceQuery
    ),
  #"Expanded Custom" = Table.ExpandTableColumn(
      #"Added Custom", 
      "Custom", 
      {"Column1"}, 
      {"Custom.Column1"}
    ),
  #"Changed Type" = Table.TransformColumnTypes(
      #"Expanded Custom", 
      {{"Custom.Column1", Int64.Type}}
    )
in
  #"Changed Type"

The two timings from Profiler (the same ones that I have used to measure refresh performance throughout this series) were:

  • Progress Report End/25 Execute SQL – 0.03 seconds
  • Progress Report End/17 Read Data – 3.80 seconds

To test the second approach I added a custom column to the SourceQuery query that contained the value 1:

…and then created a new query that joined this query to itself:

…and then did an expand to achieve the same cross joined output as in the first test. Here’s the M code (all created in the Power Query Editor – no custom M code required):

let
  Source = Table.NestedJoin(
      #"SourceQuery With Join Column", 
      {"JoinColumn"}, 
      #"SourceQuery With Join Column", 
      {"JoinColumn"}, 
      "SourceQuery With Join Column", 
      JoinKind.Inner
    ),
  #"Expanded SourceQuery With Join Column"
    = Table.ExpandTableColumn(
        Source, 
        "SourceQuery With Join Column", 
        {"Column1"}, 
        {"SourceQuery With Join Column.Column1"}
      ),
  #"Removed Columns" = Table.RemoveColumns(
      #"Expanded SourceQuery With Join Column", 
      {"JoinColumn"}
    )
in
  #"Removed Columns"

The performance of this query was much faster:

  • Progress Report End/25 Execute SQL – 0.03 seconds
  • Progress Report End/17 Read Data – 0.80 seconds

So, as you can see, the second approach is the one to use.

There’s another advantage of this second approach too, if you’re using a foldable data source like SQL Server: it is possible to make query folding happen, which is of course incredibly important from a performance point of view, although you have to be careful not to change the data type of your custom column (or at least be careful how you do it). For example, here’s the M for a query that gets the cross join of the DimScenario table from the Adventure Works DW database with itself:

let
  Source = Sql.Databases("localhost"),
  AdventureWorksDW2017 = Source{[Name
    = "AdventureWorksDW2017"]}[Data],
  dbo_DimScenario = AdventureWorksDW2017{[
    Schema = "dbo", 
    Item = "DimScenario"
  ]}[Data],
  #"Removed Columns" = Table.RemoveColumns(
      dbo_DimScenario, 
      {"FactFinance"}
    ),
  #"Added Custom" = Table.AddColumn(
      #"Removed Columns", 
      "Custom", 
      each 1
    ),
  #"Merged Queries" = Table.NestedJoin(
      #"Added Custom", 
      {"Custom"}, 
      #"Added Custom", 
      {"Custom"}, 
      "Added Custom", 
      JoinKind.Inner
    ),
  #"Expanded Added Custom"
    = Table.ExpandTableColumn(
        #"Merged Queries", 
        "Added Custom", 
        {"ScenarioKey", "ScenarioName", "Custom"}, 
        {
          "Added Custom.ScenarioKey", 
          "Added Custom.ScenarioName", 
          "Added Custom.Custom"
        }
      ),
  #"Removed Other Columns" = Table.SelectColumns(
      #"Expanded Added Custom", 
      {
        "ScenarioName", 
        "Added Custom.ScenarioName"
      }
    )
in
  #"Removed Other Columns"

Here’s the SQL code generated by this query:

select [$Outer].[ScenarioName] as [ScenarioName],
    [$Inner].[ScenarioName2] as [Added Custom.ScenarioName]
from 
(
    select [_].[ScenarioKey] as [ScenarioKey],
        [_].[ScenarioName] as [ScenarioName],
        1 as [Custom]
    from [dbo].[DimScenario] as [_]
) as [$Outer]
inner join 
(
    select [_].[ScenarioKey] as [ScenarioKey2],
        [_].[ScenarioName] as [ScenarioName2],
        1 as [Custom2]
    from [dbo].[DimScenario] as [_]
) as [$Inner] on ([$Outer].[Custom] = [$Inner].[Custom2])

8 thoughts on “Optimising The Performance Of Power Query Merges In Power BI, Part 5: Cross Joins

  1. My suggestion for the next post in this series:
    “Parte 6: performance of multiple joins in paralel vs a single join”
    Scenario: suppose you import folder, 50 excel files. The two tests to do is:
    1 – Perform a join with in the example query – multiples joins.
    vs
    2 – Perform a join with in the outer (resulting) query – one single join.

    This will answer some interesting questions:
    -Which one is better?
    -Is PQ multithread? If so, what are the gains?

  2. A better, also relevant suggestion. Which is faster, Inner Join and Anti Join. Or Left Join, than Select, than Select.

    Specifically, I have two tables, from two different systems I’m trying to join.

    Both systems have a name field. But name’s might not match perfectly (Barb vs Barbara example). And, I can guess user ID based on name; but sometimes people get married after starting….
    And sometimes there is no match.

    In Power Query, I did:
    1. Left Join on Table 1 and Table 2
    2. Filtered for only matches
    3. Referred to step 1, and filtered for only nulls
    4. Left join on step 3 and Table 2 (new criteria)
    5. Filtered for only matches
    6. Referred to step 4, and filtered for only nulls
    7. Left join on step 6 and Table 2 (2nd new criteria)
    8. Appended step 2 and step 5 to step 7.

Leave a Reply