HOOK vs Data Vault: Agility

HOOK vs Data Vault: Agility

No alt text provided for this image

And to quote Heraclitus, “The only constant in life is change” - a universal truth that also applies to the data warehouse, so we might as well make life easy for ourselves and use an approach that doesn’t limit our ability to respond to change.

Continuing the side-by-side comparison of HOOK and Data Vault, we will explore how the two approaches can absorb the inevitable changes that we will encounter.


The Scenario

I’ll try to keep this simple. Consider the following Order table we want to ingest into our data warehouse. 

No alt text provided for this image

Initially, we identified two business keys (bolded) in the data, referring to the Order and Customer core business concepts (CBCs).

We will then introduce changes into the respective models to see how they change in response.


Data Vault Solution

First, let’s see what the initial target Data Vault model might look like. 

No alt text provided for this image

It is a familiar structure with the two business concepts represented by two Hubs, joined through a Link, from which we hang a Satellite. Let’s assume we implemented this model with no issues, and we have been happily loading data to it for a period of time.

Sometime later, after speaking with the business, we realise we have a new business concept for a Store to consider. As it happens, the existing source table already contained the business key for a Store (Store_Code), but we didn’t need it then.

So now we have three business keys to worry about; we must add a new Store Hub, and adapt the model to incorporate it, as follows:

No alt text provided for this image

Remember that the existing structures still exist, and we have been loading data to the Link and Satellite tables over a period of time. Refactoring those tables could be problematic, requiring new keys to be calculated. The recommended approach is to leave the old structures alone, implement new versions, and then load the data to them going forward. So now we have a second version (V2) for the Order Link and Order Satellite (as well as the new Store Hub).

Now we load the data into the new structures, but note that we could still load the incoming data into the old structures. However, that means we are deliberately loading duplicate information into the warehouse, which is bound to cause problems.

More time goes by, and we discover another change, and there is a requirement to add a new business concept, that of a Promotion. Again, the business key was in the data all along, but as before, it wasn’t important enough to pull out as a business key. We extend the model again. Now we have a four-way link.

No alt text provided for this image

And as before, we needed to create a new version of the Link and Satellite and introduce the new Promotion Hub. What does the overall Data Vault now look like now?

No alt text provided for this image

The picture is now getting a bit complicated, which, as I said before, would be handled by the automation tool, but it shows how a Data Vault model can gradually bloat over time.


HOOK Solution

The HOOK approach is not immune to change, but depending upon the design choices, there will be slightly different ways of handling it. We will go through the same steps based on the design choice of physical bags, as this is the closest option to the Data Vault steps described in the previous section.

The initial model would look like this:

No alt text provided for this image

The table structure on the right-hand side is a HOOK construct called a Bag, which we can implement in different ways, but let’s assume it is a physical table. The ingestion of the Order data is straightforward, with fields being copied 1:1 from the source into the Bag, and as with the Data Vault equivalent, we add in the standard Load_Timestamp and Load_Source audit fields. We also add two Hook Keys, which are projections of the business keys, Order_ID and Customer_ID. The pattern is simple, and there is no need to carve up the source into separate tables.

So far, so good. Let’s move on to the introduction of the Store business concept. The new model looks like this.

No alt text provided for this image

All we need to do is add the Store Hook Key (HK_Store); everything else stays the same. If we had implemented the table on a standard SQL-compliant RDBMS, we could alter the table to add the new Hook Key column to the Bag table. As nothing else has changed, there is no impact on downstream processing pipelines, and we don’t need to implement a new version of the Bag.

The same logic applies to the second change to incorporate the Promotion business key.

No alt text provided for this image

We started with one table, and we finished with one table. There is no bloat in the model, and we have gracefully incorporated the changes with almost zero impact.


Analysis

Of course, the comparison might seem a little unfair and biased towards Hook, so let’s explore some alternative implementations.

Alternative 1

What if the underlying storage infrastructure isn’t a “standard” RDBMS? For example, what if we had built the platform using data lake technology? In this case, it probably wouldn’t be possible to add a column to the table directly. The obvious way to make the change would be to query the old data and insert the data into the new structure, calculating the new Hook Key values. 

SELECT HK_Store = CONCAT(‘{keyset}|’, Store_Code)
     , OLD.*
  INTO [BAG ORDER V2]
  FROM [BAG ORDER]  AS OLD;)        

Once we’ve built the new structure, we can drop the old one and rename the new one to the old name. This approach would work well for tables that don’t contain large volumes of data but may not be as practical for massive tables that may take a long time to copy. 

Alternative 2

What if we implemented the Bag as a view over the top of a source aligned persisted staging table? In this case, the ingestion pipelines write to the Persisted Staging Area (PSA), but the Hook Key are not calculated and stored. We can create Bag views over the top of the source-aligned PSA data that perform the necessary Hook Key calculations. This approach is an extremely agile way of changing the Bag structure. We can drop the view and then recreate it. No data is copied or moved, and the change can be implemented as quickly on a 10-row table as it can for a multi-billion-row table.

The downside is query performance. Every time we query the table - especially when joining the view to another table - the DBMS has to calculate the Hook Keys for every row! Again, this isn’t a problem for smaller tables but could be limiting for larger tables.

Alternative 3

If your platform supports them, then consider implementing Bag as materialised views. You get the same ‘drop and recreate’ flexibility, but the underlying DBMS will physically store the result as a table. The DBMS only needs to calculate the Hook Key once.

But again, this could be an expensive operation for bags with larger volumes of data.

Alternative 4

Or we could go with a similar approach to the one described for Data Vault. We create new versions of the Bag (as a physical table) and update the ingestion pipeline to target the new structure.  

No alt text provided for this image

The beauty of this approach is that we can place a view over the top of all three versions of the Bag that simply Unions them together. The only drawback, if you can call it that, is that for the older versions of the Bag, we won’t have calculated values for the newer Hook Keys even though the information is available to do so.

One way around this is to build migration processing that merges the old versions into the new structure. For really large tables, this process would execute incrementally over time, with the new table eventually catching up.


Conclusions

The HOOK model is simple. Our single ingested table eventually blew out to ten tables for Data Vault. We have at most three tables for HOOK, but, in most cases, we should only need one.

The flat nature of the Hook Bag structure makes absorbing changes much more manageable than its Data Vault equivalent. There is no need to restructure the data; in most cases, there will be no impact on downstream processing logic.

Of course, I am biased, but it seems clear that there are certain advantages to the Hook approach over Data Vault.

Mark Atkins

Author | Knowledge-based IM & Governance Strategy for CFOs/CROs | Award-winning Process Creator | Songwriter & Musician | Passionate about Language | Wine & Good Conversation Enthusiast

1y

John A. Steinbeck Jr. you might find this interesting.

Like
Reply
Ross Carvalho

Actively looking for May 2025 SDE Full time roles | Ex - SDE intern @Apple  | Software Engineering | Platform Engineering | 1x Snowflake | 6x Azure Certified

1y

To view or add a comment, sign in

More articles by Andrew Foad

  • Square Pegs and Round Holes

    Why isn’t data modelling the No.1 go-to skill required by everybody in the data industry? I believe the answer has…

    40 Comments
  • "You have the right to remain silent..."

    There is a phrase in the 'Miranda' rights that is typically read when arresting suspects in the UK and Australia that I…

    2 Comments
  • Capability vs. Features

    As many of you know, I've been critical of the fact that a high percentage of data projects seem to fail, which was one…

    3 Comments
  • Is Data Modelling Dead?

    I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

    48 Comments
  • Data Vault is Garbage (change my mind)

    If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

    7 Comments
  • Introducing HOOK. Part 1 - Logical Architecture

    This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…

  • Is “Agile Data” an Oxymoron?

    In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article…

    11 Comments
  • Always try to boil the ocean!

    Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

    16 Comments
  • HOOK and the Unified Star Schema

    First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

    6 Comments
  • HOOK vs Data Vault: Willibald Part 6

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

Insights from the community

Others also viewed

Explore topics