Optimizing High Cardinality Columns in VertiPaq

VertiPaq is the internal column-based database engine used by PowerPivot and BISM Tabular models. High cardinality columns might be the more expensive parts of a table. If you cannot remove a high cardinality column from a table, by using the techniques described in this article you might optimize its storage saving up to more than 90% of original space.

Because of its nature, in VertiPaq every table is stored by column instead than by row. For each column it creates a dictionary of distinct values and a sort of bitmap index that references the dictionary. Such a bitmap index can be highly compressed and high cardinality columns might have a dictionary cost that represents more than 90% of the cost for the column. You can save this space by splitting the column in two or more columns with a smaller number of distinct values. The resulting dictionary will be a fraction of the original one and you will be able to obtain the original value by combining the two values. You can use different techniques depending on the data type of the column you want to reduce. The following examples are based on a hypothetical SQL Server data source, you can easily adapt them to other data sources but you have to remember that these transformations have to be applied before data is imported by VertiPaq: creating calculated columns in a table would not save the space required by the original column.

WARNING: By splitting a column in more column, you lose the ability to obtain a DISTINCTCOUNT calculation over a single column. You can obtain the same result by using COUNTROWS(SUMMARIZE(table,col1,col2,…,colN)) but performance can be much slower and there will be a high memory pressure at query time, especially when a large number of rows is involved in calculation.

DATETIME Columns

If you have a DATETIME column containing a sort of timestamp of an event (i.e. both date and time), it is more efficient to split it into two columns, one for the date and one for the time. You might use DATE and TIME data type in SQL Server, but in VertiPaq you will use always the same Date data type. The date column will have always the same time and the time column will have always the same Date. In this way you will have a maximum number of rows for date which is 365 multiplied by the number of years stored, and a maximum number of rows for time that depends on time granularity (for example, you have 86,400 seconds per day).

For example, in SQL Server you might have a EventLog DATETIME column extracted in this way:

SELECT
    EventLog,
    Action,
    Value
FROM Log

You can split the EventLog column into two columns in this way:

SELECT
    CAST( EventLog AS DATE ) AS EventDate,
    CAST( EventLog AS TIME( 0 ) ) AS EventTime,
    Action,
    Value
FROM Log

This is an important optimization for any column of DATETIME data type. You can avoid importing the TIME column in VertiPaq in case you are not interested in this part. In the sample query above, the EventTime is rounded down to the second, so that you will not have more than 86,400 distinct values in that column.

INT and BIGINT Columns

A 32-bit or 64-bit integer value in SQL Server is stored in VertiPaq by using the same 64-bit integer format. The size used by VertiPaq is not relevant, because the actual storage cost is related to the dictionary, which depends on the number of distinct values. If you have a column with a large number of distinct values in an integer column, you can split it into two or more columns by using a smaller range of values. A typical scenario for this need is the transaction ID in a fact table, which might be required in order to retrieve a single transaction. For example, imagine you have a table with a TransactionID using all the numbers from 1 to 100,000,000. You have 100 millions of distinct values in the TransactionID dictionary.

SELECT
    TransactionID,
    Quantity,
    Price
FROM Fact

You can split this number into two number ranging from 1 to 10,000.

SELECT
    TransactionID / 10000 AS TransactionHighID,
    TransactionID % 10000 AS TransactionLowID,
    Quantity,
    Price
FROM Fact

As you can see, the TransactionID is no longer imported and you will be able to obtain it by using the following DAX measure:

Fact[TransactionID] := 
IFERROR(
    VALUES( Fact[TransactionHighID] ) * 10000 + VALUES( Fact[TransactionLowID] ),
    BLANK()
)

In case you want to apply a filter over a table, you have to split the TransactionID you are looking for into two parts (just replace <TransactionID> with the actual value in the following syntax):

CALCULATETABLE(
    Fact,
    Fact[TransactionHighID] = INT( @TransactionID / 10000 ),
    Fact[TransactionLowID] = MOD( @TransactionID, 10000 )
)

The original cost of the TransactionID column for 100 million values is near to 3GB in VertiPaq, whereas the splitted version requires less than 200MB. This is more than 90% saving!

It is possible to further reduce this number by splitting the original column in more than two columns. For example, you can obtain the 3 column version by using this SQL syntax:

SELECT 
    TransactionID / 1000000 AS TransactionHighID, 
    (TransactionID / 1000) % 1000 AS TransactionMidID, 
    TransactionID % 1000 AS TransactionLowID, 
    Quantity, 
    Price 
FROM Fact

And you can also go further with the same technique, up to one column per each digit. We tested the split up to 8 columns and it is interesting to consider no only the storage cost, but also the processing time. The following table shows the results measure on a 8-core server.



Number of Columns Process Time Cores Used Disk Size
1 (original) 02:48 1

2,811 MB

2 03:21 up to 8

191 MB

3 03:49 up to 8

129 MB

4 04:01 up to 8

97 MB

8 05:32 up to 8

105 MB

As you can see, the processing of a single column is a single-thread operation. Processing the multiple-column version requires more resources and only having multiple cores you will see similar execution times. With a smaller number of available cores, processing would have required longer processing times. However, more complex evaluations should be done in case partitioning is involved and this is out of the scope for this article.

String Columns

You can split a string columns by using the same technique you have seen for INT and BIGINT columns. The only difference is that strings have to be split by using string functions. This is a useful technique in case the TransactionID in your fact table contains non-numeric characters. For example, you can split an alphanumeric TransactionID Column with fixed 10-character length in this way:

SELECT 
    LEFT( TransactionID, 5 ) AS TransactionHighID, 
    SUBSTRING( TransactionID, 6, LEN( TransactionID ) - 5 ) AS TransactionLowID, 
    Quantity, 
    Price 
FROM Fact

The split algorithm for a string column should consider the distribution of the final result, in order to populate two smaller dictionaries in order to obtain a good space saving.

Conclusion

High cardinality columns in PowerPivot and BISM Tabular models can be particularly expensive. The best practice is to remove them from the model, especially when these columns are not relevant for data analysis, such as a GUID or timestamp of a SQL Server table. However, whenever the information they contain is required, you can optimize these columns by splitting the value in two or more columns with a smaller number of distinct values. This will require some more effort when accessing the column value, but the saving can be so high in large tables that it could definitely worth the effort.