Modern Data Stack- Part 2

Sankha Mitra

Data Architecture | Data Warehousing | Business Intelligence | Cloud Architecture and Solutions | Program Management | Practice Management

Published Mar 20, 2024

+ Follow

Fundamental Components of the Modern Data Platform

The basic components of a data platform are:

Data Collection and Tracking

Data Ingestion

Data Transformation

Data Storage (Data warehouse/lake)

BI Tools

Reverse ETL

Orchestration (Workflow engine)

Data Management, Quality, and Governance

Lets delve into details of each of the above mentioned points

Data Collection and Tracking- This includes the process of collecting behavioural data from client applications (mobile, web, IoT devices) and transactional data from backend services.

The MDS tools in this area focus on reducing quality issues that arise due to badly designed, incorrectly implemented, missed, or delayed tracking of data.

Common capabilities include

Interface for event schema design

Workflow for collaboration and peer review

Integration of event schema with the rest of the stack

Auto-generation of tracking SDKs from event schemas

Validation of events against schemas

Data Ingestion- Ingestion is the mechanism for extracting and loading raw data from its source of truth to a central data warehouse/lake.

A modern data ecosystem has pipelines bringing in raw data from hundreds of 1st and 3rd-party sources into the warehouse. New ingestion pipelines need to be constantly laid out to meet growing business demands.

MDS data ingestion tools aim to improve productivity and ensure data quality.

Common capabilities include

Configurable framework

Plug and play connectors for well-known data formats and sources

Plug and play integrations for popular storage destinations

Quality checks against ingested data

Monitoring and alerting of ingestion pipelines

Data Transformation- Transformation is the process of cleaning, normalizing, filtering, joining, modelling, and summarizing raw data to make it easier to understand and query. In the ELT/ ETL architecture, transformation happens immediately after data ingestion or before the data is actually ingested into the database.

MDS data transformation tools focus on providing frameworks that enable consistent data model design, promoting code reuse and testability.

Common capabilities include

Strong support for software engineering best practices like version control, testing, CI/CD, and code reusability

Support for common transformation patterns such as idempotency, snapshots, and incrementality

Self-documentation

Integration with other tools in the data stack

Recommended by LinkedIn

Modern Data Stack

Anil Thapa 1 month ago

Modern Data Stack: Definition, Components and…

Rathinavel Subramanian MPH 7 months ago

Modern Data Platform Architecture using Data Vault

Saikrishna Cheruvu 2 years ago

Data storage (Data Warehouse/lake)- Data Warehouse/lake is at the heart of modern data platforms. It acts as a historical record of truth for all behavioural and transactional data of the organization.

MDS data storage systems focus on providing serverless auto-scaling, lightning-fast performance, economies of scale, better data governance, and high developer productivity.

Common capabilities include

Auto-scaling during heavy loads

Support for open data formats such as Parquet, ORC, and Avro

Strong security and access control

Data governance features such as managing personally identifiable information

Support for both batch and real-time data ingestion

Rich information schema

BI tools

BI tools are analytical, reporting, and dashboarding tools used by data consumers to understand data and support business decisions in an organization. MDS BI tools focus on enabling data democracy by making it easy for anyone in the organization to quickly analyze data and build feature-rich reports.

Common capabilities include

Low or no code

Data visualizations for specific use cases such as geospatial data

Built-in metrics definition layer

Integration with other tools in the data stack

Embedded collaboration and documentation features

Reverse ETL- Reverse ETL is the process of moving transformed data from the data warehouse to downstream systems like operations, finance, marketing, CRM, sales, and even back into the product, to facilitate operational decision making.

Reverse ETL tools are similar to MDS data ingestion tools except that the direction of data flow is reversed (from the data warehouse to downstream systems).

Common capabilities include

Configurable framework

Plug and play connectors for well-known data formats and destinations

Plug and play integrations for popular data sources

Quality checks against egressed data

Monitoring and alerting of data pipelines

Orchestration (Workflow engine)

Orchestration systems are required to run data pipelines on schedule, request/relinquish infrastructure resources on-demand, react to failures and manage dependencies across data pipelines from a common interface.

MDS orchestration tools focus on providing end-to-end management of workflow schedules, extensive support for complex workflow dependencies, and seamless integration with modern infrastructure components like Kubernetes.

Common capabilities include

Declarative definition of workflows

Complex scheduling

Backfills, reruns, and ad-hoc runs

Integration with other tools in the data stack

Modular and extendible design

Plugins for popular cloud and infrastructure services and privacy non-compliance

To view or add a comment, sign in

Modern Data Stack- Part 2

Sankha Mitra

Data Architecture | Data Warehousing | Business Intelligence | Cloud Architecture and Solutions | Program Management | Practice Management

Recommended by LinkedIn

More articles by Sankha Mitra

Insights from the community

Others also viewed

Datamesh - a paradigm shift | are Data warehouses/ Data lakes dying or need a fresh perspective?

DATA PIPELINE – TYPES, ARCHITECTURE, & ANALYSIS

Day 9: Common Data Engineering Challenges

Cloud-Native Architecture Decoupling Big Data Management

The Significance of ETL: Empowering Data-Driven Insights

Data Pipeline Concepts

Advancing Your Data Engineering with Azure Data Factory

A deep dive into data pipeline

Data Management Landscape

DATA TRANSFORMATION

Explore topics

Recommended by LinkedIn

More articles by Sankha Mitra

Slowly Changing Dimensions

Modern Data Stack- Part 1

Apache Iceberg and Data Lake

What is Massively Parallel Processing (MPP)?

Insights from the community

Others also viewed

Datamesh - a paradigm shift | are Data warehouses/ Data lakes dying or need a fresh perspective?

DATA PIPELINE – TYPES, ARCHITECTURE, & ANALYSIS

Day 9: Common Data Engineering Challenges

Cloud-Native Architecture Decoupling Big Data Management

The Significance of ETL: Empowering Data-Driven Insights

Data Pipeline Concepts

Advancing Your Data Engineering with Azure Data Factory

A deep dive into data pipeline

Data Management Landscape

DATA TRANSFORMATION

Explore topics