This case study describes the methodologies and approaches used in optimizing ETL and query time on complex OLAP cubes such as those used to track viewing habits by national TV broadcasters.
The analysis of TV broadcast is performed using a highly detailed database that contains, at the minute level, which channel a subject is looking at. This database let the analyst look at the behavior of people, whether the same people looks at the same channel over time and so on. The detailed relational database is then used to feed some OLAP cubes. The customer (one important TV broadcast company in Italy) called us because they had problems on both ETL and query time with their SSAS cube.
- The ETL process took roughly two hours to execute and customers wanted to have data much earlier since data is necessary to give press agencies detailed information to publish as soon as possible.
- Query time was far from optimal (a couple of minutes for a pretty simple analysis) using the standard Excel interface, since the OLAP engine was always forced to read data at the leaf level (most the analysis need to go at the single minute level, which is the granularity of the data).
After the required maintenance, we reached these results:
- ETL process takes now five minutes starting from received flat files up to the process completion of the final cube
- Query time is performed in less than 10 seconds for a very complex query
Now, let us go into some details of what has been done to reach these results.
The flat files contained detailed information with pretty long business keys for the channels and the final database was designed to use the same keys. In other words, natural keys were used in the fact table, resulting in a huge waste of space. Moreover, since data was stored at the minute level, the primary key of the table was pretty complex (people, date, time) while the useful data is only the channel seen. SQL Server had to read a lot of information from a page to retrieve just a small set of useful ones.
The first task has been to rewrite the database structure in an optimized way, using the smallest integer for each dimension. This first task reduced the size of the fact tables, which in turn leads to a better speed in the ETL processing. Moreover, in order to reduce the overhead of large keys with small data, we used CLR SQL types to store, for each people, the list of channel seen during the whole day. Using this structure the keys are now much smaller than the useful portion of data and a single I/O operation is able to load a whole bunch of useful information without too much overhead.
The next step has been the optimization of the ETL. Before, a set of stored procedures loaded the facts into temporary tables and performed all the business logic using those tables. This solution required a lot of I/O and performed very bad. We switched it to SSIS logic and in-memory computation. Using SSIS we loaded the full flat file into highly optimized structures in memory and then, using a multithreaded algorithm, we performed all the business logic in memory, obtaining a dramatic speed improvement since I/O time has almost gone and the algorithm is able to use all the available CPUs in the server.
The final update to the ETL phase has been that of feeding the OLAP cubes directly from inside SSIS, instead of writing the database and then process the cubes. This reduced the need to reload the fact table for processing purposes and let the processing of the cube run in parallel with the database loading. If, for any reason, the cube needs to be reprocessed, it can load data from the relational database but, for the first time, it is faster to pull data into it while writing the same information to the relational database.
When it came the time to analyze query performance, the work has been harder. It turned out that, while SSAS performed well on some queries, for some specific ones the OLAP cube was not the correct solution. In fact, needing to always go at the granularity level, the cube could not make a good use of aggregations. Moreover, since the kind of analysis required for TV broadcasts is very complex, MDX queries were complex too and SSAS was not able to perform well.
Nevertheless, having reduced the storage of the relational database to a smaller size, we are now able to load information directly from the relational database with pretty good performance. Due to the specific nature of the queries (a lot of incremental processing was needed, to accumulate values from previous computation, something that is not so easy to describe in a set-based query language like SQL), we decided to go with .NET coding to provide the fastest queries. We designed a computing model which loads chunks of data from SQL Server using a multithreaded environment and performs all the needed computations using bitmaps and highly optimized structures, specifically designed for the kind of analysis needed for TV broadcasts. The usage of a variable number of threads let the model scale even on high powered servers.
The final result is that a query that analyzes a full month of data, at the minute level, which previously took five to six minutes, now runs in less than seven seconds.
Let us conclude with what we can learn from this case study:
- Sometimes, the solution to a problem needs deeper changes in the underlying technology. We would never had reached good ETL performance using only stored procedures or the old database structure of the customer. Deep changes are sometimes needed to open the road to better ideas and solutions.
- Always use the right tool for the job. As we described, SSAS was not the right tool to make some kind of analysis required by the customer, nor was pure SQL querying. It is useless to try to make both tools work the way we want if they are not designed to do it. Since the kind of analysis required by the customer were very complex, it has been much easier to write .NET code and leverage the full power of .NET libraries instead of trying to write very complex (and unoptimizable) MDX or SQL queries.
- The key to scale on today servers resides on multi threaded environments. Handling multi threaded computing is not an easy task and it makes very easy to create subtle bugs. Nevertheless, the gain in speed is dramatic and it is the only way we can think to make queries run faster.
- In-memory computations are faster but require more memory. Nevertheless, adding some gigabyte to a server is quite always easier than trying to make very complex code run with less memory.