HOOK vs. Data Vault

HOOK vs. Data Vault

If you‘ve read my series of articles introducing the HOOK approach to data warehousing, you might wonder how it compares to other methodologies. In this article, I will attempt to compare HOOK to Data Vault. I’ll try and keep it short so I won’t be able to get into too much detail…

Example Source Model

To highlight the differences, I will use this simple example shown in the diagram below, a data model of part of an ordering system for an online store. 

No alt text provided for this image

There are five tables. The tables coloured black (Customer, Address, Order and Product) are core business entities/terms with unique identifiers/keys. Order Line associates Orders with Products. I hope the model is self-explanatory.

The aim is to ingest the data from these five tables into Data Vault and HOOK structures.

Data Vault Representation

We will assume that we are not applying any transformations, so this is very much a Raw Vault design, which means that we will have five corresponding Satellite tables for the five source tables. We also know that four of the tables contain identifying business keys, which implies the existence of four Hub tables. To round the model off, we also need three Link tables to handle the relationships.

The resulting Data Vault model will look something like this:

No alt text provided for this image

Assuming we are using standard loading patterns and that we are being strict about the standards, then each inbound source table will be taken apart: any business keys will be populated in the appropriate Hub tables; any relationships populated into the Link tables; the remaining fields are loaded into the Satellite tables. Let’s also assume that we are using Hash Keys to join everything up together. So there is a bit of work to do.

HOOK Representation

Compare this to the HOOK equivalent. The ingestion of each source table into the Data Lake (“Lake”) is simple; again, no transformations are required. Over the top of each Lake table, we create a view that hooks each business key to one of the four Business Terms (“Term”), Customer, Order, Address and Product.

The resulting HOOK model looks like this: 

No alt text provided for this image

The Hook tables are entirely optional. They are equivalent to Data Vault Hub tables and are populated in precisely the same way. The only difference is that the HOOK table only has a single column for the business key; hash keys and audit columns are unnecessary.

Comparison

So let’s compare the two methods.

Which is easier to load?

The loading patterns for HOOK are certainly easier than Data Vault. It is a straight copy from source to the Lake. Data Vault requires additional processing, so it is not as performant.

Which is easier to query?

Say I wanted to join together data from the Order table to the Order Line table. In the Datavault model this requires a four-table join:

SELECT * 
  FROM Sat_Order         SO
  JOIN Link_Order        LO
    ON SO.HK_Link_Order       = LO.HK_Link_Order
  JOIN Link_Order_Line   LOL
    ON LO.HK_Hub_Order        = LOL. HK_Hub_Order
  JOIN Sat_Order_Line    SOL
    ON LOL.HK_Link_Order_Line = SOL.HK_Link_Order_Line
        

In HOOK, it is a two-table join:


SELECT  
  FROM Bag_Order      BO
  JOIN Bag_Order_Line BOL
    ON BO.HK_Order = BOL.HK_Order
        

Of course, we can simplify the Data Vault query by introducing a Bridge table, but this additional structure must be created and populated.

Which is easier to change?

Say we got the model wrong? What if we missed that each Order Line may have an alternate delivery address that overrides the delivery address defined at the Order level.

In the Data Vault implementation, the Link_Order_Line table now needs to reference the Hub_Address table. These are the steps we would need to go through to accommodate this:

  • Create a new version of the Link_Order_Line table that includes the additional reference to the HUB_Address table.
  • Create a new version of the Sat_Order_Line table that references the new Link_Order_Line table.
  • Change the ingestion pipeline (this should be relatively straightforward if you have an automation tool).
  • Populate the new Link_Order_Line table.
  • Populate the new Sat_Order_Line table.
  • Refactor any downstream processing to use the new tables.

In the HOOK implementation, the steps are:

  • Drop and recreate the Bag view to include the Address business key.

I’m not going to lie, I know which looks easier to me!

Query performance

One of the criticisms of Data Vault has always been query performance and one of the reasons why Bridge and PIT tables were added to the methodology. Those same criticisms will surely be levelled at HOOK, but the fact is that whatever performance-enhancing mechanisms in Data Vault can equally be applied to HOOK.

Bridge tables? Probably not needed as business keys are flattened into the Bag view. PIT tables? Why not? They are just regular tables that can be ingested and then “hooked” in the same way as any other source table.

In short, the performance of Data Vault and HOOK will be on par, especially if we materialise Bag views.

Closing Thoughts

I hope I have been able to demonstrate that HOOK is an appealing alternative to Data Vault. It appears to be more flexible, and there should be no concerns over performance.

Data Vault aficionados will rightly highlight that there is much more to the methodology than Hubs, Links and Satellites. For example, there are many different flavours of Satellite tables. There are different types of Link tables. What about Business Vault? The truth is that we can replicate all of these types of tables in HOOK.

The main differentiating aspect between Data Vault and HOOK is how we approach the modelling. Successful Data Vault implementation requires some upfront business architecture and modelling. HOOK is more forgiving, allowing the model to be applied after the data has been ingested. Data Vault is more of a Schema-on-Write approach, whereas HOOK is a Schema-on-Read paradigm that offers a great deal of flexibility.

It would be great to hear from all you Data Vault folk. Am I missing something? What killer features of Data Vault make it superior to HOOK. And yes, I am absolutely inviting criticism, and rightly so. If I can’t argue the case for HOOK then it doesn’t deserve to exist.

Next up HOOK vs. Star Schema…

Again, many thanks to the kind folk at Ellie.ai for granting me access to their cloud-based business glossary and data modelling tool, which I’ve used to produce the diagrams in this article.


Hi, Andrew Foad. Thanks for sharing this. I'm looking forward to your DM3 session. I can see similarities between hook and a Data Vault that includes its Business Keys in the satellites (which is not uncommon MPP DVs). I'm also interested in how Hook will replace the functions provided by links and hubs.

Remco Broekmans

VP International Programs at Genesee Academy, LLC Author of "from Stories to Solutions"

3y

Hi Andrew Foad, thanks for this write up. Although I must say that I do not agree with your Vault model - having Address as a hub is not correct in my opinion. Address is always descriptive information - where does an employee work, where does the order needs to be delivered, where does the customer wants to have his/her/... bill send to - -and thus needs to be modelled accordingly. With that aside I do see the idea behind your Hook approach as long as your bags do match the Core Business Concepts (what is important for the business).With the current thinking on ELM and Business Concept mapping it is easy to convert to either a full Data Vault model or your Hook approach (or dimensional, 3NF, Anchor, Focal Point, etc.). Having a good understanding of what is going on in the business and what is important there (not on a techy level) and be able to visually present it to the organization to get feedback and know you are on the right track is important. The ELM and business mapping approach is supported by ellie.io (and BimlFlex and CaseTalk). Love to get on a discussion on this any time.

Christian Kaul

Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH

3y

i get your points but i’m not sure if they matter that much in the grand scheme of things. the difficult, time-consuming and hard-to-automate part of data warehousing is the data integration and data harmonization part (what happens between getting data from different, often contradictory sources into the data warehouse system and presenting a nice unified view in a data mart, roughly equivalent to the business vault part of a data vault). does hook help me with that? i would love to read an article about that.

Sascha Nau

Fachreferent Business Intelligence bei Stadtsparkasse München

3y

I like the aproach; why do you don‘t persist the hooks to the initial tables - why views?

Like
Reply
Maarten van der Heijden

Data Architect at Tata Steel BV, designer of data constructs.

3y

I see a point that I see as both an advantage and a disadvantage. The HOOK allows you to keep the data in a structure that is close to or the same as that from which it came. That isn't trivial if you take your business meta data serious. At the same time, if the solution for change is to drop and create when your bag changes shape, you will also drop your history or at least the structure in which your history was. With that meaning is lost, for meaning is not only in the field level, but also in the structure of the data. Mind you, I am aware that Data Vault also has in itself little patience with meaning, but it has patience with history.

To view or add a comment, sign in

More articles by Andrew Foad

  • The map is NOT the territory.

    I was bemoaning that ‘Nobody gets it!’ when discussing the recent feedback from my last article with my business…

    16 Comments
  • Square Pegs and Round Holes

    Why isn’t data modelling the No.1 go-to skill required by everybody in the data industry? I believe the answer has…

    40 Comments
  • "You have the right to remain silent..."

    There is a phrase in the 'Miranda' rights that is typically read when arresting suspects in the UK and Australia that I…

    2 Comments
  • Capability vs. Features

    As many of you know, I've been critical of the fact that a high percentage of data projects seem to fail, which was one…

    3 Comments
  • Is Data Modelling Dead?

    I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

    48 Comments
  • Data Vault is Garbage (change my mind)

    If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

    7 Comments
  • Introducing HOOK. Part 1 - Logical Architecture

    This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…

  • Is “Agile Data” an Oxymoron?

    In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article…

    11 Comments
  • Always try to boil the ocean!

    Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

    16 Comments
  • HOOK and the Unified Star Schema

    First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

    6 Comments

Insights from the community

Others also viewed

Explore topics