Data Platform and Data Science

4 September 2023

Building Dimensional Model on Data Lakehouse

Filed under: Data Warehousing — Vincent Rainardi @ 7:53 am

Whilst data lakes are great for data science and machine learning, they are not good for BI and reporting. Poor data quality, duplicate data, missing data, no foreign keys, no NULL checks, the list goes on. Which was why many companies created data warehouses/marts at the back of their data lake, in dimensional model, like this: (source: altexsoft)

Side note: I’ve read many websites and books about Data Lakehouse, including Bill Inmon’s book (Ref #1 below) and found out the clearest explanation was given by Altexsoft (Ref #2) and Databricks of course (Ref #3). So hats off to them. Altexsoft is a Ukraine IT consultancy with offices in the US. Last week I came across their excellent data fabric article (Ref #4) and today their lakehouse article made the rank again. They must be very good in data platform space.

The above is not a Data Lakehouse. It is a Data Lake plus a Data Warehouse at the back. The above is called Data Lake + Data Warehouse architecture, not a Data Lakehouse.

This is a Data Lakehouse: (source: Altexsoft)

In a Data Lakehouse, there is no separate Data Lake and Data Warehouse. They are integrated into one thing called a Data Lakehouse. Like a data lake, a data lakehouse has layers, and the last layer is a dimensional model data warehouse/mart, like this: (source: Intercept, Ref #5)

It doesn’t have to be 3 layers like in a data lake. We can have 4 layers (bronze, silver, gold like in data lake, and then the fourth layer is the dimensional model). Or you can put the dimensional model on the gold layer, and have only 3 layers like below. (source: Databricks, Ref #3)

In the above ELT pipeline, the 2 fact tables for sales (fact_daily_sale and fact_daily_store_sale) are created from a gold table called fact_sale. The fact_daily_sale is at date-customer-product-store level, whereas the fact_daily_store_sale is at date-store level. The fact_sale is the fact staging table, with grain = transaction (each item purchased by any customer in any store). You can have a look at the code on Ref #6.

All 3 fact tables above are created as Delta Live Tables (DLT) on Databricks. To get started on DLT using SQL and Python see Ref #7. With DLT we get automatic data quality testing (wow!) automatic scaling on the number of nodes, and we can visually track the operational statistics (of the ELT) and the data lineage too. And you can create foreign keys constraints and NULL checks to enforce data integrity and improve data quality (see Ref #9 below).

And DLT has Slowly Change Dimension type 2 out of the box:

Look at the text that I highlighted blue above. It’s as simple as using “APPLY CHANGES” with “AS SCD TYPE 2” at the end. The Type 2 Dim Store is created from the Silver layer of Store table. The code is on Ref #6 below.

Conclusion

So that’s how we build a dimensional model on a Data Lakehouse. Data Lakehouse is a data lake where the last layer (the gold layer, or after gold) is a dimensional model data warehouse/mart. Instead of a one big monolithic data warehouse, I would recommend building multiple data marts. Some of these marts are dimensional and some are not, depending on the reporting requirements.

I hope this clarify what a data lakehouse is, and where is the dimensional model in the data lakehouse. I’m a Kimball practitioner, that’s why I said the data warehouse is a dimensional model (see 12 reasons for using dimensional modelling on Ref #8). But for those of you who are Inmon practitioners, of course you can create your Inmon warehouse on the gold layer of the Data Lakehouse, the same principles apply.

References:

  1. Book: Building a Data Lakehouse, by Bill Inmon, Mary Levins, Ranjeet Srivastava: link.
  2. Data Lakehouse: Concept, Key Features and Architecture Layers, by Altexsoft: link.
  3. Dimensional modelling implementation on modern lakehouse, by Databricks (Leo Mao, Soham Bhatt and Abhishek Dey): link.
  4. Data Fabric, by Altexsoft: link.
  5. How to build a data lake, by Intercept: link.
  6. Dimensional Modelling on Delta Live Tables, by Databricks: link.
  7. Delta Live Tables (in SQL and Python), by Databricks: link.
  8. 12 reasons to use dimensional modelling, by Vincent Rainardi: link.
  9. Creating foreign keys and NULL checks on Databricks, by Databricks: link.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.