Introducing HOOK. Part 3 : Bags
In the last Introduction to Hook article, I explained the importance of the Business Glossary and that we would use it to organise, rather than model, the Data Warehouse. In this article, I will demonstrate how that looks using Hook’s Bag concept.
What is a Bag? Bags are to Hook, what Hubs, Links and Satellites are to Data Vault, or what Dimensions and Facts are to Kimball. Bags are the mechanism by which we can easily organise our data around Business Concepts.
Data Ingestion
Before we get into Bags, let me circle back to the way we acquire data. As I mentioned in the article on Logical Architecture, acquisition is about capturing new data from source systems. The most important thing to say is that when we acquire the data, we do not transform it in any way. Therefore, the data we load into the warehouse will align with the source data. We will load one row into the warehouse for each new row detected in the source data. Every attribute from the source will also have a corresponding column in the data warehouse. We don’t aggregate the data, attempt to normalise it, or split it up. The warehouse is a 1:1 representation of the source.
Although we will not transform the data, there is no reason we cannot include additional information. For example, we might add information that tells us:
In the example below, we are ingesting the ATM Transaction table from the source (SRC.ATM_Tx) and loading it into its corresponding Data Warehouse table (EDW.ATM_Tx).
I’ve only included two additional metadata fields, which I’ve prefixed with ‘Audit’, but you could have as many as you need according to your standards; make sure you are consistent and that ALL warehouse tables have the same audit fields.
Once loaded into the Data Warehouse, the data remains unchanged. The data is non-volatile or immutable. We load any source data updates as new versions, and the old versions remain untouched. The Data Warehouse, therefore, represents a ‘fossil record’ of the activity within the enterprise.
Acquiring the data is only half of the story. At this stage, the data is still aligned with the source data, and we need to organise it around the Business Concepts from the Business Glossary. To do that, we need Bags.
Bags
A Bag is a view (which can be materialised) that we use to wrap a Data Warehouse table. We project the fields within Data Warehouse tables that identify Business Concepts as Hook Keys. You can think of a Bag view as a shopping bag into which you may place a single Data Warehouse table. The handles of the Bag are used to hang it on one or more Hooks. A Hook represents a Business Concept, as shown below.
Recommended by LinkedIn
Our example shows a Data Warehouse table containing data for Automatic Teller Machine (ATM) transactions. The table includes three key fields:
A Bag is a view we build over a Data Warehouse table. Again, we do not apply transformations but add Hook keys, which are shown below prefixed with ‘HK' or 'WK'.
In this example, each of the three key fields is projected as a Hook Key. The DDL for this view might look something like this:
In this example, we have implemented the Bag as a view, but if your database platform supports them, we could also have implemented it as a materialised view. Or we could also implement the Bag as a regular table with the necessary loading logic. In other words, you have options.
Each Hook Key uses a strict naming convention to refer to a Business Concept in the Business Glossary. The convention dictates that Customer, Channel and Ref must be Business Concepts defined in the Glossary. In this case, we project the Customer_No field to HK_Customer and the ATM_ID to HK_Channel. We use a slightly different convention for Currency_Code as this points to reference data, which we refer to as a Weak Business Concept. I will discuss this idea in a future article.
Notice that we cast each Hook key to a common data type (VARBINARY), ensuring type consistency in queries that join Bags. You may also have noticed that each Hook key has a unique code prefixing the business key value. This code is the identifier for the business key’s KeySet, which I will deal with in the next article.
Using this straightforward mechanism, we have aligned the Data Warehouse table with the Business Glossary. The Bag is both source and business-aligned. Also, we didn’t need to restructure the data.
Summary
Hopefully, in this article, I’ve been able to outline the underlying principle of the HOOK approach. Firstly, we load source-aligned data into the Data Warehouse. We then wrap each Warehouse table in a view, called a Bag, which associates each table with one or more Business Concepts from the Business Glossary. The association is entirely virtual in that a Bag may be dropped or altered without impacting the underlying Warehouse table, which means that we can reorganise the data without the need to re-ingest data.
Next Time
My next article will focus on KeySets. In addition to ensuring that keys from different systems do not clash, KeySets also help simplify the Glossary.
Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit
2yInteresting, working on digesting all of this! Question, when you write 'However, sets of Bags may be grouped within a HOOK Model': what is does this grouping do conceptually? What does the term 'model' mean in this context? You are not 'joining bag views' here right?
Teamleiter Data Management bei DKB | Deutsche Kreditbank AG
3yHi Andrew Foad! I read all your articles according to Hook and attended one of your webinars. While playing around with the HOOK concepts in our "real" world I came around a question according to Hook key projection. Let's assume I have a CBC identified by a customer_nr attribute in our conceptual data model. Our source systems deliver two tables containing customer data, e.g. TABLE_A and TABLE_B. For both tables there is a primary key column defined named cust_no. TABLE_A also contains a column customer_nr that can me projected to the CBC's customer_nr. As far as I understood that makes up the Hook Key (prefixed with a key set). Out TABLE_B does not have a column that can be projected to the customer_nr (just the technical PK column cust_no). What would be your advise for applying the Hook approach for getting data of TABLE_B wrapped in a bag? (In our data vault warehouse data from TABLE_B is joined with TABLE_A on cust_no column to get the corresponding customer_nr attribute for loading into the specific hub). Thank's in advance Andreas
Working with Databricks, data lake and data catalog
3yHow do you consolidate data about customers from two different source systems? Are you allowed to union or join in bag views? Or where do you do this in HOOK?
Passionate about all things Data!!
3yReally interesting article. I really like the thought process here. I think the approach kinda makes sense (in theory). However the one big assumption is that the keys in the example above are generated in the source and you can readily reuse them in the data lake. In many complex implementations, I have seen the keys being derived in the data warehouse (after cleaning / transforming the data). Does that throw a wrench in this design?