Square Pegs and Round Holes

Square Pegs and Round Holes

Why isn’t data modelling the No.1 go-to skill required by everybody in the data industry? I believe the answer has nothing to do with how to apply data modelling skills, but rather when we choose to use it. 

The saying goes, “If the only tool you have is a hammer, then every problem looks like a nail.” Just because you have a hammer doesn’t mean you need to use it. For example, if you try to use a hammer to drive in a screw, then the results are going to be suboptimal.

One area where the benefits of data modelling are grossly overstated is within the data warehouse. Both ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) patterns require some form of data modelling to define the Transform part. In the Kimball methodology, we transform data into facts and dimensions. In Data Vault, we transform data into hubs, links and satellites. Regardless of the methodology, we need to perform some upfront data modelling to determine the target structures and then map sources to them. It is a lot of work and very easy to get wrong.

 If you have ever read any of the literature from John Giles or Remco Broekmans , you might be persuaded that building a logical view of the business as a target structure for your data warehouse is imperative. I also believed this, but now I find myself disagreeing with this sentiment, albeit reluctantly. Logical models are great as a communication tool, but are they useful in the context of a data warehouse? Or are they more trouble than they are worth?

 Assume that we have spent some time building a logical model (or “town plan”), which might take many weeks or months. If the intention is for the logical model to define a target structure for the data in the data warehouse, then we are asking for trouble. 

 Our problem is that the logical model represents an “ideal world” view of the business or organisation, but not reality; it does not always accurately represent the structure of the data from our operational systems, so we either have to bend the data to fit the model or change the model. We now have more work to do and more things that could go wrong.

To put this in more rustic terms, we are constantly trying to put square pegs into round holes. How can we make square pegs fit into the round holes? We must reshape something, either the pegs or the holes, which requires effort. The bits we whittle away to make the pegs and holes fit one another represent technical debt, which we know is seldom addressed. But is this a problem? Isn’t that the purpose of a data modeller? Yes, but what if modelling wasn’t necessary? What would that look like?    

The solution is remarkably unremarkable. If we have a square peg, all we need is a square hole of the same size and depth. We don’t need to model a target structure; it will be identical to the source structure. Now I hear you say, “You have built a data warehouse siloed by source system.” And you would be correct, but breaking down siloes is an entirely separate problem. And that is the entire point that by modelling data, we are attempting to solve two problems in one go: loading the data into the warehouse and restructuring it to eliminate silos. Why are we trying to solve both problems using the same tool? Different problems require different solutions.

In a traditional warehouse, we would never attempt to alter the size and shape of inventory items just to fit a predetermined shelf; we leave them unmodified. If we can efficiently receive goods and put them on shelves, we have solved the first problem; if we can efficiently locate and retrieve goods from the warehouse, we have solved the second problem. We are free to place inventory on any shelf we like, but we must also ensure we record enough information to find it again.

This same reasoning applies to data warehouses. There is no need to (re)model data, but we should tag it with enough information to tell us how to find it again. Ideally, we should perform this tagging at the earliest opportunity, but that shouldn’t stop us from ingesting the data if we haven’t figured out how to organise it. We should be able to apply the tagging information later. In other words, ingestion and organisation are two separate processes. I call this pattern Extract-Load-Organise (ELO – [Mr.] Blue-Sky thinking?). The important point here is that there is no Transform! Therefore, there is no modelling or trying to force square pegs into round holes.

For anybody who has read anything I’ve posted over the past few years, you’ll realise that the ELO pattern underpins the #Hook approach to data warehousing. How we organise data is a separate topic I have written about before, and I hope to get a simple introduction guide out in the next few weeks (it’s mostly written, but it needs some editing).

To round it off, I must address the “elephant in the room” (sorry John 😊). There is no get-out-of-jail-free card; we still have to do some modelling when figuring out how to present data for end-user consumption. However, by organising the data according to the Hook principles, the modelling overhead is much lighter, which is a topic for another time.

Pugazendhi A.

Creator of Sudar.io - Extreme Automation of AI data foundation and data products A first in many ways! - Enterprise Data Warehouse (DV2.0) Design Deploy & Load complete no-code automation - SQL only Business Vault

1w

Andrew Foad I would be more comfortable comparing the ETL/ELT Transformation to the transformation we do from Raw Vault to Business Vault. Creating Hubs, Links and Satellites is just connecting to appropriate business keys. Intentionally, not much of transformation. Organizing historic data around logical model (or some level of abstraction of them) helps in ensuring data integration & integrity significantly. Organizing data around logical model (or some level of abstraction of them) is not as much as a trouble as it may appear when we use the right tool. sudar.io.

Brad S.

Data Leader | Enterprise Data Architect | Snowflake & Cloud Data Platform | Strategic Data Modeling & Integration | Remote-First Leadership

1w

Yet another variant on a source system raw data data lake, with a slightly renamed medallion framework sitting on top. There are many reason that data modeling is done and not just to design target data structures that raw data can be landed in. If you understand this, then you understand why data modeling, to include normalization / standardization / cleansing / master data, is critical path for an enterprise data platform.

Dale Anderson (CDVP2)

Talend/Snowflake/Data Vault 2.0 ~ Customer Success Architect ~ Data Warehouse Architect ~ Data Vault Alliance Member

1w

I agree

Like
Reply
Martijn Imrich

Getting your Data AI ready

1w

Remco Broekmans will say no to that question. I would even add that context is king now with the arrival of GenAI Data Modelling to me is a simplified model of the business reality. We need that more than ever to make GenAI work.

For me, it’s the conceptual model that defines the semantics of the universe of discourse - the model of the business information if you like. And no, a conceptual model isn’t just a logical model without attributes! The logical model is the solution design - the logical definition of the storage structure proposed. The physical model is the implementation of that logical design. So one conceptual model with many logical models: one for each storage architecture (relational, dimensional, data vault, …). Each logical model may have many physical models, one for each physical platform (Oracle, SQL Server, …) So the mantra is: conceptual modelling, logical design, physical implementation.

To view or add a comment, sign in

More articles by Andrew Foad

  • The map is NOT the territory.

    I was bemoaning that ‘Nobody gets it!’ when discussing the recent feedback from my last article with my business…

    6 Comments
  • "You have the right to remain silent..."

    There is a phrase in the 'Miranda' rights that is typically read when arresting suspects in the UK and Australia that I…

    2 Comments
  • Capability vs. Features

    As many of you know, I've been critical of the fact that a high percentage of data projects seem to fail, which was one…

    3 Comments
  • Is Data Modelling Dead?

    I’ve been using data modelling techniques for most of my 35-year career, and I consider it an essential skill for…

    48 Comments
  • Data Vault is Garbage (change my mind)

    If you are unfamiliar with the Hans Christian Andersen story of “The Emperor’s New Clothes”, ChatGPT summarises it…

    7 Comments
  • Introducing HOOK. Part 1 - Logical Architecture

    This article is the first in a short series on the Hook approach to Data Warehousing. As I said in my introduction…

  • Is “Agile Data” an Oxymoron?

    In last week’s article, I discussed how you can “boil the ocean” when building a HOOK data warehouse. The article…

    11 Comments
  • Always try to boil the ocean!

    Recently, I have felt very left out by all these Snowflake folk with their superhero avatars. As I don’t work for…

    16 Comments
  • HOOK and the Unified Star Schema

    First, we should probably understand what a Unified Start Schema is before we start talking about how we can use it in…

    6 Comments
  • HOOK vs Data Vault: Willibald Part 6

    As a reminder, the diagram below represents the Willibald source database. It consists of eleven tables.

Insights from the community

Others also viewed

Explore topics