Introducing HOOK. Part 3 : Bags

Introducing HOOK. Part 3 : Bags

In the last Introduction to Hook article, I explained the importance of the Business Glossary and that we would use it to organise, rather than model, the Data Warehouse. In this article, I will demonstrate how that looks using Hook’s Bag concept.

What is a Bag? Bags are to Hook, what Hubs, Links and Satellites are to Data Vault, or what Dimensions and Facts are to Kimball. Bags are the mechanism by which we can easily organise our data around Business Concepts.

Data Ingestion

Before we get into Bags, let me circle back to the way we acquire data. As I mentioned in the article on Logical Architecture, acquisition is about capturing new data from source systems. The most important thing to say is that when we acquire the data, we do not transform it in any way. Therefore, the data we load into the warehouse will align with the source data. We will load one row into the warehouse for each new row detected in the source data. Every attribute from the source will also have a corresponding column in the data warehouse. We don’t aggregate the data, attempt to normalise it, or split it up. The warehouse is a 1:1 representation of the source.

Although we will not transform the data, there is no reason we cannot include additional information. For example, we might add information that tells us:

  • Where did the data come from?
  • When was the data loaded into the data warehouse?
  • Did the new data represent an update, an insert or a deletion of the source data?
  • Which processing batch loaded the data?
  • When was the record updated on the source system?
  • Etc.

In the example below, we are ingesting the ATM Transaction table from the source (SRC.ATM_Tx) and loading it into its corresponding Data Warehouse table (EDW.ATM_Tx).

Article content

I’ve only included two additional metadata fields, which I’ve prefixed with ‘Audit’, but you could have as many as you need according to your standards; make sure you are consistent and that ALL warehouse tables have the same audit fields.

Once loaded into the Data Warehouse, the data remains unchanged. The data is non-volatile or immutable. We load any source data updates as new versions, and the old versions remain untouched. The Data Warehouse, therefore, represents a ‘fossil record’ of the activity within the enterprise.

Acquiring the data is only half of the story. At this stage, the data is still aligned with the source data, and we need to organise it around the Business Concepts from the Business Glossary. To do that, we need Bags.

Bags

A Bag is a view (which can be materialised) that we use to wrap a Data Warehouse table. We project the fields within Data Warehouse tables that identify Business Concepts as Hook Keys. You can think of a Bag view as a shopping bag into which you may place a single Data Warehouse table. The handles of the Bag are used to hang it on one or more Hooks. A Hook represents a Business Concept, as shown below.

Article content

Our example shows a Data Warehouse table containing data for Automatic Teller Machine (ATM) transactions. The table includes three key fields:

  • Customer_No – The unique customer identifier.
  • ATM_ID – The ID of the cash machine where the transaction was executed.
  • Currency_Code – The currency of the transaction.

A Bag is a view we build over a Data Warehouse table. Again, we do not apply transformations but add Hook keys, which are shown below prefixed with ‘HK' or 'WK'.

Article content

In this example, each of the three key fields is projected as a Hook Key. The DDL for this view might look something like this:

Article content

In this example, we have implemented the Bag as a view, but if your database platform supports them, we could also have implemented it as a materialised view. Or we could also implement the Bag as a regular table with the necessary loading logic. In other words, you have options.

Each Hook Key uses a strict naming convention to refer to a Business Concept in the Business Glossary. The convention dictates that Customer, Channel and Ref must be Business Concepts defined in the Glossary. In this case, we project the Customer_No field to HK_Customer and the ATM_ID to HK_Channel. We use a slightly different convention for Currency_Code as this points to reference data, which we refer to as a Weak Business Concept. I will discuss this idea in a future article.

Notice that we cast each Hook key to a common data type (VARBINARY), ensuring type consistency in queries that join Bags. You may also have noticed that each Hook key has a unique code prefixing the business key value. This code is the identifier for the business key’s KeySet, which I will deal with in the next article.

Using this straightforward mechanism, we have aligned the Data Warehouse table with the Business Glossary. The Bag is both source and business-aligned. Also, we didn’t need to restructure the data.  

Summary

Hopefully, in this article, I’ve been able to outline the underlying principle of the HOOK approach. Firstly, we load source-aligned data into the Data Warehouse. We then wrap each Warehouse table in a view, called a Bag, which associates each table with one or more Business Concepts from the Business Glossary. The association is entirely virtual in that a Bag may be dropped or altered without impacting the underlying Warehouse table, which means that we can reorganise the data without the need to re-ingest data.

Next Time

My next article will focus on KeySets. In addition to ensuring that keys from different systems do not clash, KeySets also help simplify the Glossary.

.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

2y

Interesting, working on digesting all of this! Question, when you write 'However, sets of Bags may be grouped within a HOOK Model': what is does this grouping do conceptually? What does the term 'model' mean in this context? You are not 'joining bag views' here right?

Like
Reply
Andreas Seidel

Teamleiter Data Management bei DKB | Deutsche Kreditbank AG

3y

Hi Andrew Foad! I read all your articles according to Hook and attended one of your webinars. While playing around with the HOOK concepts in our "real" world I came around a question according to Hook key projection. Let's assume I have a CBC identified by a customer_nr attribute in our conceptual data model. Our source systems deliver two tables containing customer data, e.g. TABLE_A and TABLE_B. For both tables there is a primary key column defined named cust_no. TABLE_A also contains a column customer_nr that can me projected to the CBC's customer_nr. As far as I understood that makes up the Hook Key (prefixed with a key set). Out TABLE_B does not have a column that can be projected to the customer_nr (just the technical PK column cust_no). What would be your advise for applying the Hook approach for getting data of TABLE_B wrapped in a bag? (In our data vault warehouse data from TABLE_B is joined with TABLE_A on cust_no column to get the corresponding customer_nr attribute for loading into the specific hub). Thank's in advance Andreas

Like
Reply
Kasper Hansen

Working with Databricks, data lake and data catalog

3y

How do you consolidate data about customers from two different source systems? Are you allowed to union or join in bag views? Or where do you do this in HOOK?

Like
Reply
Sathish Rao Koteshwar

Passionate about all things Data!!

3y

Really interesting article. I really like the thought process here. I think the approach kinda makes sense (in theory). However the one big assumption is that the keys in the example above are generated in the source and you can readily reuse them in the data lake. In many complex implementations, I have seen the keys being derived in the data warehouse (after cleaning / transforming the data). Does that throw a wrench in this design?

To view or add a comment, sign in

More articles by Andrew Foad

  • The map is NOT the territory.

    I was bemoaning that ‘Nobody gets it!’ when discussing the recent feedback from my last article with my business…

    16 Comments
  • Square Pegs and Round Holes

    Why isn’t data modelling the No.1 go-to skill required by everybody in the data industry? I believe the answer has…

    40 Comments
  • "You have the right to remain silent..."

    There is a phrase in the 'Miranda' rights that is typically read when arresting suspects in the UK and Australia that I…

    2 Comments
  • Capability vs. Features

    As many of you know, I've been critical of the fact that a high percentage of data projects seem to fail, which was one…

    3 Comments
  • Is Data Modelling Dead?

    I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

    48 Comments
  • Data Vault is Garbage (change my mind)

    If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

    7 Comments
  • Introducing HOOK. Part 1 - Logical Architecture

    This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…

  • Is “Agile Data” an Oxymoron?

    In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article…

    11 Comments
  • Always try to boil the ocean!

    Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

    16 Comments
  • HOOK and the Unified Star Schema

    First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

    6 Comments

Insights from the community

Others also viewed

Explore topics