Modern Data Stack- Part 2
Fundamental Components of the Modern Data Platform
The basic components of a data platform are:
Data Collection and Tracking
Data Ingestion
Data Transformation
Data Storage (Data warehouse/lake)
BI Tools
Reverse ETL
Orchestration (Workflow engine)
Data Management, Quality, and Governance
Lets delve into details of each of the above mentioned points
Data Collection and Tracking- This includes the process of collecting behavioural data from client applications (mobile, web, IoT devices) and transactional data from backend services.
The MDS tools in this area focus on reducing quality issues that arise due to badly designed, incorrectly implemented, missed, or delayed tracking of data.
Common capabilities include
Interface for event schema design
Workflow for collaboration and peer review
Integration of event schema with the rest of the stack
Auto-generation of tracking SDKs from event schemas
Validation of events against schemas
Data Ingestion- Ingestion is the mechanism for extracting and loading raw data from its source of truth to a central data warehouse/lake.
A modern data ecosystem has pipelines bringing in raw data from hundreds of 1st and 3rd-party sources into the warehouse. New ingestion pipelines need to be constantly laid out to meet growing business demands.
MDS data ingestion tools aim to improve productivity and ensure data quality.
Common capabilities include
Configurable framework
Plug and play connectors for well-known data formats and sources
Plug and play integrations for popular storage destinations
Quality checks against ingested data
Monitoring and alerting of ingestion pipelines
Data Transformation- Transformation is the process of cleaning, normalizing, filtering, joining, modelling, and summarizing raw data to make it easier to understand and query. In the ELT/ ETL architecture, transformation happens immediately after data ingestion or before the data is actually ingested into the database.
MDS data transformation tools focus on providing frameworks that enable consistent data model design, promoting code reuse and testability.
Common capabilities include
Strong support for software engineering best practices like version control, testing, CI/CD, and code reusability
Support for common transformation patterns such as idempotency, snapshots, and incrementality
Self-documentation
Integration with other tools in the data stack
Recommended by LinkedIn
Data storage (Data Warehouse/lake)- Data Warehouse/lake is at the heart of modern data platforms. It acts as a historical record of truth for all behavioural and transactional data of the organization.
MDS data storage systems focus on providing serverless auto-scaling, lightning-fast performance, economies of scale, better data governance, and high developer productivity.
Common capabilities include
Auto-scaling during heavy loads
Support for open data formats such as Parquet, ORC, and Avro
Strong security and access control
Data governance features such as managing personally identifiable information
Support for both batch and real-time data ingestion
Rich information schema
BI tools
BI tools are analytical, reporting, and dashboarding tools used by data consumers to understand data and support business decisions in an organization. MDS BI tools focus on enabling data democracy by making it easy for anyone in the organization to quickly analyze data and build feature-rich reports.
Common capabilities include
Low or no code
Data visualizations for specific use cases such as geospatial data
Built-in metrics definition layer
Integration with other tools in the data stack
Embedded collaboration and documentation features
Reverse ETL- Reverse ETL is the process of moving transformed data from the data warehouse to downstream systems like operations, finance, marketing, CRM, sales, and even back into the product, to facilitate operational decision making.
Reverse ETL tools are similar to MDS data ingestion tools except that the direction of data flow is reversed (from the data warehouse to downstream systems).
Common capabilities include
Configurable framework
Plug and play connectors for well-known data formats and destinations
Plug and play integrations for popular data sources
Quality checks against egressed data
Monitoring and alerting of data pipelines
Orchestration (Workflow engine)
Orchestration systems are required to run data pipelines on schedule, request/relinquish infrastructure resources on-demand, react to failures and manage dependencies across data pipelines from a common interface.
MDS orchestration tools focus on providing end-to-end management of workflow schedules, extensive support for complex workflow dependencies, and seamless integration with modern infrastructure components like Kubernetes.
Common capabilities include
Declarative definition of workflows
Complex scheduling
Backfills, reruns, and ad-hoc runs
Integration with other tools in the data stack
Modular and extendible design
Plugins for popular cloud and infrastructure services and privacy non-compliance