Modern Data Stack- Part 2

Modern Data Stack- Part 2

Fundamental Components of the Modern Data Platform

 The basic components of a data platform are:

Data Collection and Tracking

Data Ingestion

Data Transformation

Data Storage (Data warehouse/lake)

BI Tools

Reverse ETL

Orchestration (Workflow engine)

Data Management, Quality, and Governance

 Lets delve into details of each of the above mentioned points

 Data Collection and Tracking- This includes the process of collecting behavioural data from client applications (mobile, web, IoT devices) and transactional data from backend services.

The MDS tools in this area focus on reducing quality issues that arise due to badly designed, incorrectly implemented, missed, or delayed tracking of data.

Common capabilities include

Interface for event schema design

Workflow for collaboration and peer review

Integration of event schema with the rest of the stack

Auto-generation of tracking SDKs from event schemas

Validation of events against schemas

 

Data Ingestion- Ingestion is the mechanism for extracting and loading raw data from its source of truth to a central data warehouse/lake.

A modern data ecosystem has pipelines bringing in raw data from hundreds of 1st and 3rd-party sources into the warehouse. New ingestion pipelines need to be constantly laid out to meet growing business demands.

MDS data ingestion tools aim to improve productivity and ensure data quality.

Common capabilities include

Configurable framework

Plug and play connectors for well-known data formats and sources

Plug and play integrations for popular storage destinations

Quality checks against ingested data

Monitoring and alerting of ingestion pipelines

 

Data Transformation- Transformation is the process of cleaning, normalizing, filtering, joining, modelling, and summarizing raw data to make it easier to understand and query. In the ELT/ ETL architecture, transformation happens immediately after data ingestion or before the data is actually ingested into the database.

MDS data transformation tools focus on providing frameworks that enable consistent data model design, promoting code reuse and testability.

Common capabilities include

Strong support for software engineering best practices like version control, testing, CI/CD, and code reusability

Support for common transformation patterns such as idempotency, snapshots, and incrementality

Self-documentation

Integration with other tools in the data stack

 

Data storage (Data Warehouse/lake)- Data Warehouse/lake is at the heart of modern data platforms. It acts as a historical record of truth for all behavioural and transactional data of the organization.

MDS data storage systems focus on providing serverless auto-scaling, lightning-fast performance, economies of scale, better data governance, and high developer productivity.

Common capabilities include

Auto-scaling during heavy loads

Support for open data formats such as Parquet, ORC, and Avro

Strong security and access control

Data governance features such as managing personally identifiable information

Support for both batch and real-time data ingestion

Rich information schema

 

BI tools

BI tools are analytical, reporting, and dashboarding tools used by data consumers to understand data and support business decisions in an organization. MDS BI tools focus on enabling data democracy by making it easy for anyone in the organization to quickly analyze data and build feature-rich reports.

Common capabilities include

Low or no code

Data visualizations for specific use cases such as geospatial data

Built-in metrics definition layer

Integration with other tools in the data stack

Embedded collaboration and documentation features

 

Reverse ETL- Reverse ETL is the process of moving transformed data from the data warehouse to downstream systems like operations, finance, marketing, CRM, sales, and even back into the product, to facilitate operational decision making.

Reverse ETL tools are similar to MDS data ingestion tools except that the direction of data flow is reversed (from the data warehouse to downstream systems).

Common capabilities include

Configurable framework

Plug and play connectors for well-known data formats and destinations

Plug and play integrations for popular data sources

Quality checks against egressed data

Monitoring and alerting of data pipelines

 

Orchestration (Workflow engine)

Orchestration systems are required to run data pipelines on schedule, request/relinquish infrastructure resources on-demand, react to failures and manage dependencies across data pipelines from a common interface.

MDS orchestration tools focus on providing end-to-end management of workflow schedules, extensive support for complex workflow dependencies, and seamless integration with modern infrastructure components like Kubernetes.

Common capabilities include

Declarative definition of workflows

Complex scheduling

Backfills, reruns, and ad-hoc runs

Integration with other tools in the data stack

Modular and extendible design

Plugins for popular cloud and infrastructure services and privacy non-compliance

To view or add a comment, sign in

More articles by Sankha Mitra

  • Slowly Changing Dimensions

    When working on the dimensional modelling of a Data Warehouse, we must know and understand the nature of the…

  • Modern Data Stack- Part 1

    What is modern data stack? The birth of cloud data warehouses with their massively parallel processing…

  • Apache Iceberg and Data Lake

    Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the…

  • What is Massively Parallel Processing (MPP)?

    An MPP database is a database that is optimized to be processed in parallel for many operations to be performed by many…

    2 Comments

Insights from the community

Others also viewed

Explore topics