Data Warehouse Layers (Modern Data Warehousing, Part 1)

Data Warehouse Layers (Modern Data Warehousing, Part 1)

This article is the first part in a series on modern data warehousing using data vault.

Introduction

The classic data warehouse architecture, going back to Bill Inmon, consists of three layers with different purposes:

  • a staging layer for getting data from various source systems into the data warehouse,
  • a core layer for integrating the data from the different systems and
  • a presentation layer for making the data accessible to consumers.

In most cases, these layers are implemented as different schemas in the same database but there are hybrid setups as well (e.g., using a Hadoop system for the staging layer and an analytical database for the core and presentation layers).

Nowadays, the core layer is usually modeled using data vault constructs. Inmon himself has endorsed this approach as far back as 2007.

The presentation layer is usually modeled in a dimensional way. In recent years, new tools have made flat tables more popular.

Staging Layer

In the staging layer, data from operational systems and other sources is collected and prepared for loading into the core layer.

Historically, the staging layer only consisted of a transient staging area that contained tables that basically were 1:1 copies of source system tables. Each table was filled with the data from its source system counterpart and emptied after the data had arrived in the core layer.

In recent years, the staging layer’s responsibilities have expanded. Often, it includes a persistent staging area that acts as a permanent archive of source data deliveries. This allows for loading and historizing source data before you have explicitly modeled it and serves as a kind of insurance against modeling mistakes in the core layer. If you have to change the core layer model, you can easily repopulate it from the persistent staging area.

A persistent staging area can be implemented as a history of changes (similar to an SCD2 dimension, using source system primary keys and a technical, DWH-generated timeline) or as a history of snapshots (storing all deliveries as they come, like a data lake).

In addition, the proliferation of non-tabular sources that store data in JSON or XML formats (or at least only let you access the data via APIs that return them in such formats) means that the staging layer has to convert source data into a tabular format for easier downstream processing. As developing your own conversion routines can be challenging, you might want to use a third-party tool that does the conversion for you and only work with its tabular output yourself.

Core Layer

In the core layer, the data from the different source systems is integrated, historized and transformed from the way the sources actually store it to the way the organization wants to see it.

To reliably fulfill these purposes, the core layer uses a normalized data model that splits the data into many different tables and therefore doesn’t offer the best user experience for analysts and other (usually human) consumers. Therefore, there is usually a presentation layer on top of it for user-friendly data access.

In recent years, data vault has become the most popular modeling paradigm for the core layer. Data vault splits the data into hub, link and satellite tables to create a flexible, easily extensible core layer with built-in historization that can be loaded with a small number of standardized loading patterns.

In many cases, the data vault model consists of a raw vault area that integrates the source data around business concepts and historizes attribute values using a technical, DWH-generated timeline but doesn’t correct, harmonize or otherwise calculate attribute values. Corrections, harmonizations and other calculations (also called “soft rules”) take place in a business vault area that takes data from the raw vault and makes it fit for presentation in case it isn’t.

While the raw vault usually consists of physical tables, the business vault often is partially or completely virtual (i. e., implemented as views) to allow for frequent changes to business rules and avoid unnecessary duplication of data.

Presentation Layer

In the presentation layer, the integrated data from the core layer is structured in a way that is accessible to consumers. These consumers can include human analysts, managers and other business users but also downstream systems that want to work with good-quality data.

Traditionally, the presentation layer consists of data marts that present the data for a certain subject area in the form of a dimensional model (fact tables surrounded by dimension tables). Sometimes, a distinction is made between raw marts (that present unrefined data from the raw vault in a dimensional format, e. g. for prototyping purposes) and information marts (that only present data that has been cleansed and harmonized).

Today, dimensional data marts are still the backbone of the presentation layer because they are familiar to users and their favorite BI tools alike. However, newer data visualization or data science tools might prefer their data flattened into one big table.

In addition to data marts, the presentation layer can include data interfaces that make data available to downstream systems in the format they need it (often flattened or even non-tabular). These systems can include websites, regulatory reporting, accounting or feedback loops feeding corrections or calculated values back to the source systems.

The second part of the series can be found here.

MIN P.

Shopee - Data Engineering

1y

usually inside of the core layer, will contain multiple inner-layers as well

Like
Reply
Dzung Vu (Daniel)

Data Architect | Data Engineer | GCP & Azure Expert | Data Warehouse Solutions | Spark & Talend Specialist | DataOps Professional

2y

Hi Christian, Thank you for useful post. I do have a question relate to core layer. How do you deal with joining multiple tables (SCD type 2) and insert to core table? - Is this full processing or delta? - Which kind of proccessing is popular? Delta processing is faster than full processing but it is more complicated to write the query.

Like
Reply
Lawrence Fernandes

Data Architect | Solutions Architect - Big Data & Analytics

4y

Congratulations Christian, excellent article and series! As I was thinking about writing something along these lines, I would like to raise some questions and suggestions in hope to enrich the discussion! In the presentation layer, I would include the Kimball Group's approach! I know Ralph Kimball and his colleagues didn't created the dimensional modeling, as sometimes erroneously put. However, given their extensive work in spreading this approach, I believe it would be interesting to quote them. In fact, they proposed a two-layer approach, containing only the staging and presentation layers, which I've successfully used on several projects - although I recognize it may not be the ideal architecture, specially for large enterprises. Still in the presentation layer, I believe it could be subdivided in two layers: a physical layer within the DW, using dimensional modeling and the Kimball's conformed dimensions; and another one inside the chosen BI platform, in the form of flat tables (data sources) for the dashboards. In my experience, this design approach allows for greater flexibility, despite duplicate ingestion efforts. Regarding the core layer, what do you thing about using anchor modeling? Recently, I started to dive into the works of Lars Rönnbäck, and though it would be an interesting alternative or complementary option to the data vault modeling. Finally, what do you think about hybrid architectures, such as Data Lakehouse (which I translate as using Big Data tools in a proven and disciplined way)? I had access to Bill Inmon's Exploration Warehouse and Kimball's Data Webhouse works, and I believe they make a perfect fit whith these new architecture principle and technologies - the ones like BigQuery, Redshift, Snowflake, among others. Thanks!

Like
Reply

So far, so good 😂 Looking forward to the series

Harald Erb

Architect, Data Management & Analytics at Scale

4y

Thanks Christian. I look forward to the next part of the series.

To view or add a comment, sign in

More articles by Christian Kaul

Insights from the community

Others also viewed

Explore topics