Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

Data Vault Modeling Patterns: Links, Hierarchy, Identity (Modern Data Warehousing, Part 13)

If you’re new to the topic or don’t have a lot of practical data vault experience, you might want to consult the previous articles in the Modern Data Warehousing series (see below) or have a look at one of the books recommended in the Data Vault section here to get the most out of this article.

This article is the thirteenth part in an ongoing series on modern data warehousing using data vault.

  • The first part on data warehouse layers can be found here.
  • The second part on kinds of time can be found here.
  • The third part on number of timelines can be found here.
  • The fourth part on hubs can be found here
  • The fifth part on links and satellites can be found here.
  • The sixth part on satellite variations can be found here.
  • The seventh part on other optional constructs can be found here.
  • The eighth part on raw vault and business vault can be found here.
  • The ninth part on conventions for construct usage can be found here.
  • The tenth part on naming conventions can be found here.
  • The eleventh part on surrogate identifiers can be found here.
  • The twelfth part on technical columns can be found here.

Introduction

When building a data vault model, you encounter some of the same situations again and again. Therefore it is good to have something similar to the structural conventions described here so that you deal with these situations in a consistent manner. The data vault modeling patterns described in this little series will help you to achieve this consistency by recognizing standard situations and sticking to standard solutions for resolving them.

The most important data vault modeling patterns include options for

  • some of the finer points of link design,
  • modeling hierarchies (when one instance is subordinate to another),
  • modeling identity (when two instances actually refer to the same thing),
  • modeling headers and line items and
  • handling lookup values.

To raise awareness of these data vault modeling patterns and to point out the standard solution for a certain situation, consider creating some documentation similar to the one for the structural conventions. Your successors will thank you.

Keyed Instance

When representing connections between instances of concepts as a link, in many cases you have to include an additional concept in the form of a so called keyed-instance hub.

When You Need It

In general, you need a keyed-instance hub if:

  1. The connection is transferable (e. g., an Employee can work for different departments over time and you have to record the currently active Department Assigment).
  2. You have to store some details about the connection (e. g., quantity and actual unit price for a Product on a Sale).
  3. There can be multiple distinguishable connections involving the same instances of concepts (e. g., a Visitor can go to the same Website multiple times, each time being a different Visit with different detail values).
  4. You want to be able to connect other concepts to a specific connection without repeating certain combinations of concepts over and over again or building any link-on-link constructs (e. g., a Delivery can include some Products on a Sale but not others that are currently out of stock and will be part of a later Delivery).

If you’re using link satellites, you usually can get around creating a keyed-instance hub in cases 1 and 2 by attaching a link satellite to the respective link. In cases 3 and 4, you have to create a keyed-instance hub no matter what.

Example

Es wurde kein Alt-Text für dieses Bild angegeben.

Keyed-instance hub example

The diagram above can be seen as an example of all four cases:

  1. A Customer can subscribe to different Services over time.
  2. There are relevant details about a Customer–Service combination (how the Customer subscribed to the Service, why the Customer subscribed to the Service , …).
  3. A Customer can stop using a Service and then start subscribing to the same Service again.
  4. There are possible connections to a Customer-Service combination (you send an Invoice to the Customer for using the Service for a period of time, the Customer can file a Complaint about some problems with the Service, …).

This keyed-instance pattern occurs so often that some people even argue that every link should include a keyed-instance hub by default.

Hierarchy

Hierarchical connections are another common use case:

  • An Employee reports to another Employee.
  • A Car Part is made of multiple other Car Parts.
  • An Organizational Unit consists of multiple smaller Organizational Units.

In general, there are two ways of dealing with this use case.

Hierarchical Link

If the two instances involved in the hierarchical connection are indeed instances of the same concept, you can create a hierarchical link that includes the same hub twice (plus maybe an additional keyed-instance hub). The Employee-reports-to-Employee connection most likely would be implemented as a hierarchical link. To avoid confusion later on, it is important to clearly mark the role played by each of the two identifiers referring to the same hub (by calling them MANAGING_EMPLOYEE_SRG and MANAGED_EMPLOYEE_SRG, for example).

Es wurde kein Alt-Text für dieses Bild angegeben.

Hierarchical link example (with keyed-instance hub)

Regular Link

If the two instances involved in the hierarchical connection represent different concepts, you can create a regular link between two different hubs (plus maybe an additional keyed-instance hub). The Organizational-Unit-consists-of-multiple-Organizational-Units connection would be implemented as a regular link if in your organization, the lower-level Organizational Units are consistently called Teams and the higher-level Organizational Units are consistently called Departments.

Identity

In some cases, you find out that what seemed like two different instances at first actually are one and the same:

  • Two customer codes from different source systems actually refer to the same Customer.
  • A new customer code actually refers to an existing Customer that has lost access to our system for some reason.
  • When migrating from one source system to another, new customer codes had to be created for existing Customers for technical reasons.

Same-as Link

Many of these cases can be handled using a same-as link that includes the same hub twice (plus maybe a keyed-instance hub). One of the hub identifiers in the link refers to the master record and the other one the duplicate record. Once again, the role played by each of the two identifiers referring to the same hub should be marked clearly (by calling them MASTER_CUSTOMER_SRG and DUPLICATE_CUSTOMER_SRG, for example).

Es wurde kein Alt-Text für dieses Bild angegeben.

Same-as link example (with keyed-instance hub)

Regular Link

Sometimes, identity has to be documented using a regular link. Source system A might treat both Customers and Suppliers as Partners while source system B clearly distinguishes between the two. When it’s not always completely obvious which Partners from source system A are actually Customers, it is better to load all of them to a Partner hub and then create a link between this Partner hub and the Customer hub for those Partners that you have found to be Customers.

As mentioned in the introduction, there are some data vault modeling patterns for dealing with headers and line items and handling lookup values. We’ll look at them in the next article.

This article is based on an issue of my Substack newsletter.

To view or add a comment, sign in

More articles by Christian Kaul

Insights from the community

Others also viewed

Explore topics