Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)
If you’re new to the topic or don’t have a lot of practical data vault experience, you might want to consult the previous articles in the Modern Data Warehousing series (see below) or have a look at one of the books recommended in the Data Vault section here to get the most out of this article.
This article is the thirteenth part in an ongoing series on modern data warehousing using data vault.
Introduction
When building a data vault model, you encounter some of the same situations again and again. Therefore it is good to have something similar to the structural conventions described here so that you deal with these situations in a consistent manner. The data vault modeling patterns described in this little series will help you to achieve this consistency by recognizing standard situations and sticking to standard solutions for resolving them.
The most important data vault modeling patterns include options for
To raise awareness of these data vault modeling patterns and to point out the standard solution for a certain situation, consider creating some documentation similar to the one for the structural conventions. Your successors will thank you.
Keyed Instance
When representing connections between instances of concepts as a link, in many cases you have to include an additional concept in the form of a so called keyed-instance hub.
When You Need It
In general, you need a keyed-instance hub if:
If you’re using link satellites, you usually can get around creating a keyed-instance hub in cases 1 and 2 by attaching a link satellite to the respective link. In cases 3 and 4, you have to create a keyed-instance hub no matter what.
Example
Keyed-instance hub example
The diagram above can be seen as an example of all four cases:
This keyed-instance pattern occurs so often that some people even argue that every link should include a keyed-instance hub by default.
Recommended by LinkedIn
Hierarchy
Hierarchical connections are another common use case:
In general, there are two ways of dealing with this use case.
Hierarchical Link
If the two instances involved in the hierarchical connection are indeed instances of the same concept, you can create a hierarchical link that includes the same hub twice (plus maybe an additional keyed-instance hub). The Employee-reports-to-Employee connection most likely would be implemented as a hierarchical link. To avoid confusion later on, it is important to clearly mark the role played by each of the two identifiers referring to the same hub (by calling them MANAGING_EMPLOYEE_SRG and MANAGED_EMPLOYEE_SRG, for example).
Hierarchical link example (with keyed-instance hub)
Regular Link
If the two instances involved in the hierarchical connection represent different concepts, you can create a regular link between two different hubs (plus maybe an additional keyed-instance hub). The Organizational-Unit-consists-of-multiple-Organizational-Units connection would be implemented as a regular link if in your organization, the lower-level Organizational Units are consistently called Teams and the higher-level Organizational Units are consistently called Departments.
Identity
In some cases, you find out that what seemed like two different instances at first actually are one and the same:
Same-as Link
Many of these cases can be handled using a same-as link that includes the same hub twice (plus maybe a keyed-instance hub). One of the hub identifiers in the link refers to the master record and the other one the duplicate record. Once again, the role played by each of the two identifiers referring to the same hub should be marked clearly (by calling them MASTER_CUSTOMER_SRG and DUPLICATE_CUSTOMER_SRG, for example).
Same-as link example (with keyed-instance hub)
Regular Link
Sometimes, identity has to be documented using a regular link. Source system A might treat both Customers and Suppliers as Partners while source system B clearly distinguishes between the two. When it’s not always completely obvious which Partners from source system A are actually Customers, it is better to load all of them to a Partner hub and then create a link between this Partner hub and the Customer hub for those Partners that you have found to be Customers.
As mentioned in the introduction, there are some data vault modeling patterns for dealing with headers and line items and handling lookup values. We’ll look at them in the next article.
This article is based on an issue of my Substack newsletter.