Choosing the Right Hardware for Analysis Services Tabular

Since the introduction of Analysis Services 2012, I helped several companies adopting the new Tabular model in their solutions. A very common issue is that, when going in production, performance were lower than expected. Worse, they are lower than in the development environments. Most of the times, the reason for that was wrong hardware sizing, especially when the server was in a virtualized environment. As you will see, the problem is not the use of a virtual machine by itself, but the technical specs of the underlying hardware. A very complete and detailed hardware sizing guide for Analysis Services Tabular is available in the whitepaper Hardware Sizing a Tabular Solution (SQL Server Analysis Services), so the goal of this article is to provide a shorter quick guide that will help you understanding the issues that affect many data centers when they have to host a Tabular solution.

Can you choose Hardware?

The first question is whether you can choose the hardware or not. The problem of using a virtual machine for a Tabular solution is that oftentimes the hardware has been already selected and installed, and you can only influence the number of cores and the amount of RAM that are assigned to your server. Unfortunately, these parameters are not so relevant for the performance. If you will have these limited choices, you should collect information about CPU model and clock of your host server as soon as possible. If you do not have access to this information, ask a small virtual machine running on the same host server and run the Task Manager: in the Performance tab, you will see the CPU model and the clock rate. With this information, you can predict whether the performance will be worse than an average modern laptop. Unfortunately, chances are that you will be in that condition, so you have to sharpen your political skills to convince the right people that running Tabular on that server is a bad idea. If you will find that your host server is ok, you will only have to avoid the pitfall of running a Virtual Machine on different NUMA nodes (more on that later).

Set Hardware Priorities

Assuming that you can influence the hardware selection, keep in mind that you have to set priorities in this order:

CPU Clock and Model
Memory Speed
Number of Cores
Memory Size

As you see, disk I/O performance is not in the list, because it is not important at all. You will see a separate section about that, just because there is a condition (paging) where disk I/O affects performance. However, you should size the RAM of the server so that you will not have paging at all. Allocate your budget on CPU and memory speed, memory size, and do not waste money on disk I/O bandwidth.

CPU Model

Analysis Services Tabular runs DAX and MDX queries using an architecture with two main modules: Formula Engine and Storage Engine. Every query starts its execution in the formula engine, which runs operations in a sequential way using a single thread. Formula Engine requests data to Storage Engine, which can execute operations in parallel on multiple threads, but then has to return data to the formula engine as a single dataset. All the remaining operations executed by the formula engine run in a single thread. Thus, the clock speed of the CPU is critical in the formula engine, which is also not able to use the cache to store the results of a DAX query. The cache only stores results of storage engine requests, running the formula engine code in every execution of the same DAX query.

The most important factor that affects the speed of code running in the formula engine is the CPU clock and model. Different CPU models might have different performance at the same clock rate, so considering the clock alone is not enough. The best practice is to run your own benchmark, measuring the different performance in queries that stress the formula engine. An example of such a query, on a model derived by Adventure Works, is the following (you can download the sample workbook to test this query on your hardware – open the Excel workbook and run the query below in DAX Studio):

EVALUATE
ROW (
    "test", COUNTROWS (
        GENERATE (
            TOPN (
                8000,
                CROSSJOIN (
                    ALL ( Reseller[ResellerKey] ),
                    ALL ( Reseller[GeographyKey] )
                ),
                Reseller[ResellerKey]
            ),
            ADDCOLUMNS (
                SUMMARIZE (
                    Sales,
                    OrderDate[FullDate],
                    Products[ProductKey]
                ),
                "Sales", CALCULATE ( SUM ( Sales[SalesAmount] ) )
            )
        )
    )
)

You can try this query (which is intentionally slow and does not produce any meaningful result) or similar ones. Using a query of a typical workload for your data model is certainly better, because performance might vary on different hardware depending on the memory allocated to materialize intermediate results (the query above has a minimal use of memory).

For example, this query runs in 8.1 seconds on an Intel i7-4770K 3.5GHz, and in 12.4 seconds on an Intel i7-2860QM 2.5GHz. These CPUs run a desktop workstation and a notebook, respectively. If you think that a server might run faster, do your test and share your results in the comments specifying CPU model and clock. If you do not have Excel on the server, you can restore the Power Pivot model on Analysis Services Tabular and run the query on SQL Server Management Studio if you do not have DAX Studio.

In general, Intel Xeon processor used on a server are E5 and E7 series, and it is very common to find clock speed around 2Ghz, even with a very high number of cores available. You should look for a clock speed of 3GHz or more, whenever possible. Another important factor is the L2 and L3 cache size: the larger, the better. This is especially important for large tables and relationships between tables based on columns that have more than one million of unique values.

We observed that modern Intel i7 series CPUs running on 3 GHz or more (up to 4GHz) provide very good performance. These CPUs are optimized for desktop workload and you hardly find them on a server. The concept of “modern” is relative: look for the CPU model release date, and consider “modern” whatever has been released in the last 18/24 months. For older models, verify if there is a new generation and check CPU benchmarks (single thread operations) compared to other CPUs. However, in-house tests with a benchmarks made on your own data is more relevant than any published benchmark you can find.

Memory Speed

The memory speed is an important factor for a Tabular model. Every operation made by the storage engine accesses memory at a very high speed. When the RAM bandwidth is the bottleneck, you see CPU usage instead of I/O waits. Unfortunately, we do not have performance counter that monitors the time spent waiting the RAM access. In Tabular, this amount of time can be relevant and it is hard to measure.

In general, you should get RAM that has at least 1600MHz, but if the hardware platform permits you should select faster RAM (1833, 2133, or 2400 MHz). At the time of writing (June 2015), 1833 Mhz is a fast standard on a server, where it is hard to find 2133MHz, and impossible to find 2400MHz unless you buy a desktop optimized to play videogames.

Number of Cores

The storage engine splits execution on multiple threads only when the table involved has multiple segments. Each segment contains 8 million rows by default (1 million on Power Pivot). If you have 8 cores, you will not see all of them involved in a storage engine query unless you have at least 64 million rows.

For these reasons, scalability over multiple cores is effective only for very large tables. Raising the number of cores will improve performance for a single query only when it hits a large table (200 million of rows or more). In terms of scalability (number of concurrent users), a higher number of cores might not improve performance if users access the same tables (they would contend access to shared RAM). A better way to increase the number of concurrent users is to use more servers in load balancing configuration.

The best practice is to get the maximum number of cores you can have on a single socket, getting the highest clock rate available. It is not a good idea having two or more sockets on the same server. Analysis Services Tabular does not recognize the NUMA architecture, which splits memory between different sockets and requires a more expensive intersocket communication whenever a thread running on a socket accesses memory allocated by another socket – you can find more details about NUMA architecture in Hardware Sizing a Tabular Solution (SQL Server Analysis Services).

Memory Size

You have to store in memory the entire database. You also need RAM to execute process operation (unless you use a separate process server) and to execute queries. Usually optimized queries do not have a high request of RAM, but a single query can materialize temporary tables that could be very large (database tables have a high compression rate, whereas materialization of intermediate tables during a single query generates uncompressed data).

Having enough memory only guarantees that your queries will come to an end, but increasing available RAM does not produce any performance improvement. Cache used by Tabular does not increase because of more RAM available. However, a condition of low available memory might affect query performance in a negative way if the server starts paging data. You should simply have enough memory to store all the data of your database and to avoid materialization during query execution.

Disk I/O and Paging

You should not allocate budget on storage I/O for Analysis Services Tabular. This is very different to Multidimensional, where random I/O operation on disk occurs very frequently, especially in certain measures. In Tabular, there are no direct storage I/O operations during a query. The only event when this might happen is when you have a low memory condition. However, it is less expensive and more effective to provide more RAM to a server, than trying to improve performance by increasing storage I/O throughput when you have a systematic paging caused by low memory available.

Conclusion

You should measure performance before choosing the hardware for Analysis Services Tabular. It is common to observe a server running twice slower than a development workstation, even if the server is a very new one. This is because a server designed to be scalable (especially for virtual machines), is usually not very performant for activities made by a single thread. However, this type of workload is very common in Tabular. You will need time and numbers (do your benchmark) to convince your company that a “standard server” could be the weak point of the entire BI solution. This article should provide you the arguments and a starting point to create your own benchmark tests.