Data Architecture
What is data architecture
Data architectures will define a company’s livelihood. If a company were a chess piece, the data architecture defines the moves the company can make on the board.
A primitive architecture allows your company to move like a pawn. An advanced architecture can make that pawn a queen.
Picture these different data architectures:
Storing a file as a .csv on a local hard drive and reading the file into Tableau on a person’s computer for analysis is a very simple kind of data architecture.
Streaming data from a set of point-of-sale registers to accounting is another kind of architecture.
The data architecture is 100% responsible for increasing a company’s freedom to move around the world.
If agility is what is needed to avoid collapse during slow seasons or to capitalize on the spontaneous popularity of a new product, the more advanced the data architecture is, the more capable the company is to take action.
Explicitly, the data architecture:
Gives a fuller picture of what is happening in the company
Creates a better understanding of the company’s data
Offers protocols by which data moves from its source to being analyzed and consumed by its destinations
Ensures a system is in place to secure the data
Grants all teams the ability to make data-driven decisions
Components of data architecture
The architectural components of today’s data architectural world are:
Data pipelines
Cloud storage
APIs
AI & ML models
Data streaming
Kubernetes
Cloud computing
Real-time analytics
And more…
Data standards
Data standards are the overarching standards of a data architecture, which you apply to areas such as data schemas and security.
Data schemas
The architecture is responsible for setting the data standards that define what kinds of data will pass through it.
These standards can be achieved by creating a data schema. The data schema defines:
Each entity that should be collected. Schema for contact info, for example, might include name, phone number, email, and place of work.
The type of data each piece should be. For example, name is text data, phone number is integer data, email is text data, place of work is text data.
The relationship of that entity to others in the database, such as where it comes from and where it’s going.
Most companies will version their data schema. As data becomes increasingly pervasive, companies will begin using relational databases over more traditional SQL databases.
Relational (NoSQL) databases allow you to easily add data and piece data together more like a network of entities rather than a strict hierarchy of entities. Plus, these relational databases can grow much larger and handle adding data dynamically to the database, where traditional SQL databases could not (or was strongly advised against).
That’s why versioning is so vital. Versioning the data schema helps standardize:
What to find where
The ability to ask when a data was where
(Explore data storage from database to warehouse to lake and from hot to cold.)
Data security
Data standards also help set the security rules for the architecture. These can be visualized in the architecture and schema by showing what data gets passed where, and, when it travels from point A to point B, how the data is secured.
Security protocols can include:
Encrypting data during travel
Restricting access to individuals
Anonymizing data to decrease the value of the information upon receipt by receiving party
Additional actions
Shifting to new architecture
McKinsey published a great article about six important changes to consider when building a data architecture in today’s world. It highlights the older architectural components, and how it has been updated to the distributed, agile architecture for today’s companies.
Here is the short version of these six changes:
From on-premise to cloud-based data platforms
From batch to real-time data processing
From pre-integrated commercial solutions to modular, best-of-breed platforms
From point-to-point to decoupled data access
From an enterprise warehouse to domain-based architecture
From rigid data models toward flexible, extensible data schemas?