There is something fundamentally wrong with the way we reuse data

There is something fundamentally wrong with the way we reuse data

This post came to me after reading a post on LinkedIn by @Ronald Baan. My response from the start was looking like an essay so I decided not to take it to the comment section, sorry Ronald.

 

As a very short summary of the post of Ronald, but please read the whole:

====

We make a mess of our data by buying and stitching together applications and then we go for #datamesh or #datafabric, but also to #DataVault, Enterprise Service Bus, API and a host of solution directions that help us make sense of the mess we made.

 

Can't we link our data in such a way that the data does not need to be first disassembled only to be reassembled again?

====

 

I think the answer is both yes and no.

Yes, we can link our data and take care of Data Quality, Data Security and Data Governance on the data as is and in source. That is in principle, there are a number of things I'll come back to that have to be taken into account.

No we can't offer a complete useful data environment based on the data generated at the source. There is data that we don't need outside the source and there is data in the source context that is useful where it is, but needs to be adapted to become valuable outside the source. Also, Data Quality and Data Security are not static, but dependent on the data use. And then not all source technologies are capable of handling all sources and types of data needs.

 

We surely can do better though. How did we get here in the first place? In the race to stay aligned to the business needs for data access, the data specialists have not taken the necessary step back to really understand data as it is created and served in the computational environment.

 

I hope to give a number of entry points and hints as for how to come to an environment that is not constantly deconstructing and reconstructing (all the) data needed for insights.

 

What do we do with data

Data primarily is created to monitor progression of work and remember what had to be passed on. Passing information on initially stayed in paper form and when computers were generally connected the paper moved to some digital message.

The first things people started tracking was possessions, either actual or due. Also the first administrative software was built around that and most IT departments at the time sprang up from the financial department due to that.

After that data was created for and in automated processes and moved into all other administrations. The looking back function was great, but there were other possible uses of the data. You could count and summarize and there was little done to combine data created in different applications. The makers of software had no incentive (or even the contrary) to make sure data could be interchanged. Keeping the data enclosed keeps the user bound to the application.

Bringing the data to a more central structure has no advantage for the makers of the software (apps break data). That is left to the user organisations. This results in problems with lineage and understanding the source context, changing contexts on the way and for the endpoint (report) and forcing organisations to move data around to get to grips with overarching information.

 

There are more types of data, that have to be identified.

I use the -more-  here on purpose. We have identified data types as technical constructs out of necessity (numeric, string, boolean and their subtypes). Then Master Data and Reference Data were singled out. Little attention has been given to Steering Data (the data that helps tracking the progression through the program). Database specialists are very much interested in keys, the foreign and primary keys are essential in making relational databases work. With this enumeration we have hardly escaped still from the technical aspects of the source. If we stick to these typologies of data we stay in the disassembled or deconstructed situation.

Then the data people begin to reconstruct the data in Data vault or Data Mesh, great ways in themselves, but not part of the user experience.

I have the impression that we Data People have been jumping to solutions all the time. Since there is a strong bond with technology the characteristics of the solutions found did not really go into the business needs of data usually.

What does the business do with data?

They communicate to themselves and their own teams, which is largely covered by the applications. And they communicate with others in the organisation, mainly with those from whom they take tasks and those to whom they pass tasks on.

So there is software internal data and software external (communication) data. Staying on the technical side, these are internal and external relative to a piece of software, which usually largely overlaps with a business process. The external data is essentially defined by the information needed to complete the communication act. And that communication act is a business thing. Amazingly very little attention has been given to this essential set of Communication Data. Not giving this attention means that fortunes are being spent to construct and pass on the right data between departments of our organisations. Fortunes spent that could be saved if the outlines of the necessary communication were described and implemented before building or buying software. But currently we are re-inventing the whole communication set anew when implementing a new piece in our software puzzle. By taking control of the way the organisations communication is managed beforehand you can also define with more precision and joined understanding what is communicated and how the links are to be used.

I would advocate a construction where the interdepartmental (inter process) communication is no longer kept inside the functional constraints of the source software but rather (or also) is published centrally in a business owned format and structure. This would allow for virtually all data to be understood in terms of this central structure of data.

 

So what do we link

Now how on earth are we to find out which data is needed in this central data structure. Most existing software does not really fit such a solutino and is using disparate terms for things that to the members of the organisation are or at least should be the same. One needs a method to detach from the current reality of deconstruction.

 

The good news is that there is at least a partial method created by Jan Dietz, who developed his method called Demo. There are some branches to this with different names, but let us not diverge. The idea is that there are different levels of actually a single pattern in companies that determine how things work. The pattern was baptized a transaction by Jan Dietz, which is composed of four different things. Each of these can be dressed with a limited set of data points, a lot like the good old forms from the times paper was the way to communicate. Back then both ends of the communication had to cope with a single way of naming things.

The transaction starts off with a request. This request can be a customer order or a request for an allowance from some governmental organisation, or a factory request passed on between production facilities. This should be followed by  a 'promise'. This can be a rejection of the request or, if the request can be processed the request should be followed by an acceptance of the request, implying that some process will be started to get the request handled accordingly. Quite often the 'promise' is implicit and only shows to be there if a rejection is possible. The next step is a delivery, with of course the data that is needed to inform the requestor of the assumed success. The last step is the acceptance of the delivery, a signoff that informs that the promise was followed up in a satisfactory way. The acceptance, as much as the 'promise' is often implicit as well and only surfaces as a rejection.

While there is a huge amount of detail around this theory of Jan Dietz, I think that here we find a very useful frame for defining the data that should be made available in a centralized way. They should cover the information needs of the transaction pattern.

 

You can ask a department with quite some success to define the information that is minimally needed for them to be in a position to give a promise (or when they will reject), and that gives you the data the request should consist of. You can ask the requester what information they require to see the promise (or rejection) as a valid one. The same is valid for the delivery and acceptance.

Because this pattern can go into quite some detail due to sub-processes, it is difficult to say when your central data structure is complete. This will depend a lot on the organisation. Reasoning from the outside request into your organisation, you should surely stop where a decline of the promise is not an option, possibly before that point to prevent unnecessary detail.

When you follow this pattern you will capture all necessary communications, which is essential for being able to link the data in your organisation. Details of the processes that are not in the communication structure can still be connected using the communication this structure.

 

How do we Link

There are several ways to integrate data. I already mentioned Data Vault, but also ontologies, taxonomies and business glossaries are amongst them. The problem we face is that these initially detach from what the business does, to then come to the business with a way of thinking that is alien, while stating that it is essential for a working centralized IT. (yes, I intentionally left out the C ;) why was that letter put in the middle in the first place?)

Clearly linking the different communications acts based on the transaction pattern by Jan Dietz requires that there is a common understanding of the terms in the communications, but this is actually dealt with when you discuss the different demo steps. That means that you stay close to the business in the processes. While this takes care of part what is needed in an Ontology, taxonomy or Business Glossary, you still need to get your keys together for the whole to become queryable (if that is a word).

We all know about that letter where you are asked to refer to a document number, but aside from that technical type of key you should also define the business key and make both available. That way one also can more easily dig into the more detailed data that lies behind the centralized set of data. This may well take on some resemblance to the hubs and links of a Data Vault, which offer a flexible way to connect disparate data.

 

Why is that not all?

Data is a carrier of meaning and meaning shifts with time, place and intent. Shifting meaning can also result in the need to change your integration methods and content. It is never finished. A well understood data environment is a good data environment. Being in control of the central set of communicated data is a great starting point for a good understanding of the data. When all parties are using the same central data structure their understanding will grow together and that again will make the central structure more stable.

An ontology will help, but it should support the central structure and probably not extend it too much. The same accounts for a taxonomy, while a business glossary may go into more detail where it extends beyond the boundaries of the defined central structure. When one sets up ontologies, taxonomies and even business glossaries without the central structure in place these attempts of integration to me look like solutions without a clear set of requirements.

Data quality needs a clear understanding of what the data is, and so does the implementation of security. If you don't define those on a rather stable foundation you will be working your head of keeping track of the many variations of data quality and security,

Can we now do without moving data around, de-constructing and re-constructing in a different way? No we probably can't because there are more requirements than just knowing how to knit things together. You may need summaries, or a faster response. You probably will always have some datamarts in fields where the central data structure is too cumbersome as an intermediate or sources can't cope with the extra write you ask from them for the central data structure. The latter will probably be a growth path that can take long, but once set in there is a good chance that the importance of the central data structure will become part of the decision making when choosing new software. If the communication with the central data structure becomes part of the requirements you will buy software that can feed it.

 

Will this change the way we work with data?

In the short run from the end users perspective little will change. They will still write queries and have some difficulty to combine data.

The biggest change is for IT where now communication becomes a central part of their thinking, across the board. There will be far less data on the move from one environment to the other, constantly changing context and thus meaning and cluttering up network and storage.

With less data moving around and a more clear central structure and meaning end users will benefit after a while because a simpler and probably less volatile data environment is offered to them. On the other hand, if it becomes easier to find your way, users will also start asking more questions and finding more solutions. I think that there is no organisation wanting to discourage that.

To view or add a comment, sign in

More articles by Maarten van der Heijden

  • Data Products, undefined. What is your take?

    For a number of years now there is a lot of talk about data products, but one needs to ask what a data product is. Is a…

    13 Comments
  • All the fuzz about Data

    A first introduction This attempts to give people who have had little to do with data some insight in how to view data…

  • You can't become data driven

    On what data is and isn't. Data is static, so it can't drive you.

  • Value of a digital archive?

    First about archiving Archival is making sure information stays available, that is unchanging and outside the active…

    3 Comments
  • Data from afterthought to foundation

    Some history The acronym ICT was coined by clever people and even if the C is left out, the I of information still…

    36 Comments
  • The Data Governance predicament.

    There is a large number of issues around data that make the lives of those who want a report or analysis to be…

    3 Comments
  • What is a KPI ?

    A KPI is a Key Performance Indicator you will probably answer, with good right. Though this is technically the right…

    1 Comment
  • The new Economic Revolution? AI?

    We had a few revolutions in the European space over time that changed the shape of our economies. I would mention: the…

  • Time as a theme in business, do you have time for it?

    While time is the most valuable asset anyone has, there is very little interest in time as something to discuss when…

  • Welcome to the world of Entropy and models

    One of the most basic laws in Physics is the law of entropy. The law of entropy was coined to describe a process in…

Insights from the community

Others also viewed

Explore topics