HOOK vs. Data Vault
If you‘ve read my series of articles introducing the HOOK approach to data warehousing, you might wonder how it compares to other methodologies. In this article, I will attempt to compare HOOK to Data Vault. I’ll try and keep it short so I won’t be able to get into too much detail…
Example Source Model
To highlight the differences, I will use this simple example shown in the diagram below, a data model of part of an ordering system for an online store.
There are five tables. The tables coloured black (Customer, Address, Order and Product) are core business entities/terms with unique identifiers/keys. Order Line associates Orders with Products. I hope the model is self-explanatory.
The aim is to ingest the data from these five tables into Data Vault and HOOK structures.
Data Vault Representation
We will assume that we are not applying any transformations, so this is very much a Raw Vault design, which means that we will have five corresponding Satellite tables for the five source tables. We also know that four of the tables contain identifying business keys, which implies the existence of four Hub tables. To round the model off, we also need three Link tables to handle the relationships.
The resulting Data Vault model will look something like this:
Assuming we are using standard loading patterns and that we are being strict about the standards, then each inbound source table will be taken apart: any business keys will be populated in the appropriate Hub tables; any relationships populated into the Link tables; the remaining fields are loaded into the Satellite tables. Let’s also assume that we are using Hash Keys to join everything up together. So there is a bit of work to do.
HOOK Representation
Compare this to the HOOK equivalent. The ingestion of each source table into the Data Lake (“Lake”) is simple; again, no transformations are required. Over the top of each Lake table, we create a view that hooks each business key to one of the four Business Terms (“Term”), Customer, Order, Address and Product.
The resulting HOOK model looks like this:
The Hook tables are entirely optional. They are equivalent to Data Vault Hub tables and are populated in precisely the same way. The only difference is that the HOOK table only has a single column for the business key; hash keys and audit columns are unnecessary.
Comparison
So let’s compare the two methods.
Which is easier to load?
The loading patterns for HOOK are certainly easier than Data Vault. It is a straight copy from source to the Lake. Data Vault requires additional processing, so it is not as performant.
Which is easier to query?
Say I wanted to join together data from the Order table to the Order Line table. In the Datavault model this requires a four-table join:
Recommended by LinkedIn
SELECT *
FROM Sat_Order SO
JOIN Link_Order LO
ON SO.HK_Link_Order = LO.HK_Link_Order
JOIN Link_Order_Line LOL
ON LO.HK_Hub_Order = LOL. HK_Hub_Order
JOIN Sat_Order_Line SOL
ON LOL.HK_Link_Order_Line = SOL.HK_Link_Order_Line
In HOOK, it is a two-table join:
SELECT
FROM Bag_Order BO
JOIN Bag_Order_Line BOL
ON BO.HK_Order = BOL.HK_Order
Of course, we can simplify the Data Vault query by introducing a Bridge table, but this additional structure must be created and populated.
Which is easier to change?
Say we got the model wrong? What if we missed that each Order Line may have an alternate delivery address that overrides the delivery address defined at the Order level.
In the Data Vault implementation, the Link_Order_Line table now needs to reference the Hub_Address table. These are the steps we would need to go through to accommodate this:
In the HOOK implementation, the steps are:
I’m not going to lie, I know which looks easier to me!
Query performance
One of the criticisms of Data Vault has always been query performance and one of the reasons why Bridge and PIT tables were added to the methodology. Those same criticisms will surely be levelled at HOOK, but the fact is that whatever performance-enhancing mechanisms in Data Vault can equally be applied to HOOK.
Bridge tables? Probably not needed as business keys are flattened into the Bag view. PIT tables? Why not? They are just regular tables that can be ingested and then “hooked” in the same way as any other source table.
In short, the performance of Data Vault and HOOK will be on par, especially if we materialise Bag views.
Closing Thoughts
I hope I have been able to demonstrate that HOOK is an appealing alternative to Data Vault. It appears to be more flexible, and there should be no concerns over performance.
Data Vault aficionados will rightly highlight that there is much more to the methodology than Hubs, Links and Satellites. For example, there are many different flavours of Satellite tables. There are different types of Link tables. What about Business Vault? The truth is that we can replicate all of these types of tables in HOOK.
The main differentiating aspect between Data Vault and HOOK is how we approach the modelling. Successful Data Vault implementation requires some upfront business architecture and modelling. HOOK is more forgiving, allowing the model to be applied after the data has been ingested. Data Vault is more of a Schema-on-Write approach, whereas HOOK is a Schema-on-Read paradigm that offers a great deal of flexibility.
It would be great to hear from all you Data Vault folk. Am I missing something? What killer features of Data Vault make it superior to HOOK. And yes, I am absolutely inviting criticism, and rightly so. If I can’t argue the case for HOOK then it doesn’t deserve to exist.
Next up HOOK vs. Star Schema…
Again, many thanks to the kind folk at Ellie.ai for granting me access to their cloud-based business glossary and data modelling tool, which I’ve used to produce the diagrams in this article.
GenAI and Data
3yHi, Andrew Foad. Thanks for sharing this. I'm looking forward to your DM3 session. I can see similarities between hook and a Data Vault that includes its Business Keys in the satellites (which is not uncommon MPP DVs). I'm also interested in how Hook will replace the functions provided by links and hubs.
VP International Programs at Genesee Academy, LLC Author of "from Stories to Solutions"
3yHi Andrew Foad, thanks for this write up. Although I must say that I do not agree with your Vault model - having Address as a hub is not correct in my opinion. Address is always descriptive information - where does an employee work, where does the order needs to be delivered, where does the customer wants to have his/her/... bill send to - -and thus needs to be modelled accordingly. With that aside I do see the idea behind your Hook approach as long as your bags do match the Core Business Concepts (what is important for the business).With the current thinking on ELM and Business Concept mapping it is easy to convert to either a full Data Vault model or your Hook approach (or dimensional, 3NF, Anchor, Focal Point, etc.). Having a good understanding of what is going on in the business and what is important there (not on a techy level) and be able to visually present it to the organization to get feedback and know you are on the right track is important. The ELM and business mapping approach is supported by ellie.io (and BimlFlex and CaseTalk). Love to get on a discussion on this any time.
Data Modeling Aficionado and Senior Technical Consultant at virtual7 GmbH
3yi get your points but i’m not sure if they matter that much in the grand scheme of things. the difficult, time-consuming and hard-to-automate part of data warehousing is the data integration and data harmonization part (what happens between getting data from different, often contradictory sources into the data warehouse system and presenting a nice unified view in a data mart, roughly equivalent to the business vault part of a data vault). does hook help me with that? i would love to read an article about that.
Fachreferent Business Intelligence bei Stadtsparkasse München
3yI like the aproach; why do you don‘t persist the hooks to the initial tables - why views?
Data Architect at Tata Steel BV, designer of data constructs.
3yI see a point that I see as both an advantage and a disadvantage. The HOOK allows you to keep the data in a structure that is close to or the same as that from which it came. That isn't trivial if you take your business meta data serious. At the same time, if the solution for change is to drop and create when your bag changes shape, you will also drop your history or at least the structure in which your history was. With that meaning is lost, for meaning is not only in the field level, but also in the structure of the data. Mind you, I am aware that Data Vault also has in itself little patience with meaning, but it has patience with history.