PowerPivot: improve the performances by sorting tables

While it is well know that in UDM data should be loaded sorted, to get better file compression and map structures. The world of PowerPivot (and BISM) is yet not explored. Working with a 50 millions rows table, I tested various scenarios to check if sorting affects PowerPivot data structures too.

The fact table structure is very simple:

CREATE TABLE Fact_Transaction (       ID_Account INT NOT NULL ,       ID_Type INT NOT NULL ,       ID_Date INT NOT NULL ,       Amount MONEY NOT NULL ) 

There are 240,000 accounts, 1,000 dates and 4 types. Thus, the ID_Account is the most selective, followed by the date and the type. I tried loading data inside PowerPivot using different sorting and the final workbook size is in the next table:

Order File Size (K) HEAP (unosrted) 498,905 Type, Date, Account 372,505 Date, Type, Account 372,209 Account, Date, Type 374,001

Thus, the compression algorithm in PowerPivot seems to prefer sorted data (which is somehow expected, even if not obvious). The big difference is between sorted and unsorted data while changing the sort order does not affect the final size in an evident way. This is definitely expected since the ratio between transactions (50 millions) and accounts (240.000) is very high, resulting in a good sorting even with the most selective column. Different distributions in real world scenarios might lead to different results and they are definitely worth trying.

What is interesting is that the query speed of the final workbook is much better with the smaller ones, when compared with the biggest, i.e. unsorted. This might indicate that it is not just a matter of workbook size but even of internal data structures that result in a better optimized format when data is fed sorted to PowerPivot. I don’t have clear number here, but the difference is evident at first glance by pivotting over the data model.

The final hint is straightforward: if you use a heap or a table with an INT IDENTITY column for your fact table, then it might be useful to check different sorting and (as always) verify the final result. It might be the case that you are wasting memory and CPU due to a poorly designed data structure or, in other words, that there is space for improvement.