Direct MOLAP insert vs. proactive cache

While I’m testing Analysis Services 2005, I’m starting to think about when and how to use the proactive cache feature vs. using the DTS to insert data straight into a MOLAP partition (using the Dimension and Partition Processing Data Flow items). In both cases the target is to reduce the latency between a data change and the cube update.

My first attempts was thwarted by the user interface of BI studio: I was trying to change the cube storage settings from the cube property ProactiveCaching, but it has no effects on already created partitions. So remember: you have to change the aggregation design (and proactive cache mode) for each single partition of your cube (you have already one for each measure group). I hope that Microsoft will change this behaviour a little bit because is counterintuitive: if you have just created a cube and want to modify the storage settings, probably you want to do it for any partition of the cube; I’d expect at least a warning and a UI feature to change the storage settings for all cube’s partitions.

When I solved this problem (thanks to an answer in the newsgroup after I sent a repro case) I tried to implement a cube without a back-end data mart. I made tests with Northwind, creating a DTS package (or how it’s named now.. SSIS package?) that push dimension members and fact rows directly into the cube. You have to create a Data Source View (with corresponding SQL tables) anyway, but SQL tables are never populated and data are loaded directly into the MOLAP partition. You can load data both in full and incremental mode, and probably the incremental mode is very interesting.

My hypothesis on how to use these two modes are the following.

Direct data load in partition with DTS

  • Simple cube, large amount of data, no need of relational star schema, no need of complex transformations from the raw data
  • Incremental update with small amount of new data and a very high frequency of update

Proactive cache

  • Complex cube structure, need for high availability
  • Unscheduled and/or unexpected updates of data mart data (may be due to corrections or maintenance batch)

I’m not a strong believer of the proactive cache usefulness in a traditional data warehouse scenario: the number of transformations necessary to bring the data on the multidimensional structure is very high (and sometimes they are very complex) that is practically impossible to update the data warehouse in a continuous manner. Of course, there are a lot of other scenarios where proactive cache is the optimal solution: I think its primary use will be to get the best performance from the UDM (Unified Dimensional Model) that will become the primary source of data for any kind of report (starting from Reporting Services).

The direct data load with DTS is probably interesting to build a “frequent update” scenario like this: imagine a big and complex data warehouse with some time-critical cubes (sales, orders, stock availability). While the nightly update could be very complex due to many dimension processing, may be that for any transaction in the LOB applications you can build a simple transformation that can update the cube at least for the most important dimensions. I imagine you can easily build a queue of “events” to be loaded into the cube, resolving only the more important dimensions (like customer, product and time for a sale) and leaving a “dummy” member to other dimensions: a DTS could extract these events from the queue and incrementally update the MOLAP partition, with a very low resource-consumption from all the point of views. Remember, a ROLAP query (using the proactive cache solution) could be resource-intensive and managing many different partitions to reduce this effect could be harder and expensive, not mentioning the need to update the data mart incrementally with data coming from the LOB applications.

Do you agree with my thoughts or not? Have I missed something? Do you see other scenarios? Comments are welcome!