Data Ingestion Framework
What is a Data Ingestion Framework?
A data ingestion framework is the collection of processes and technologies used to extract and load data for the data ingestion process, including data repositories, data integration software, and data processing tools.
Data ingestion frameworks are generally divided between batch and real-time architectures. It is also helpful to think of the intent of the end-user application: whether you will use the data pipeline to make analytical decisions for the business or as part of a data-driven product.
How Frameworks Put Your Data Ingestion Strategy to Work
In software development, a framework is a conceptual platform for application development. Frameworks provide a foundation for programming in addition to tools, functions, generic structure, and classes that help streamline the application development process. In this case, your data ingestion framework simplifies the data integration and collection process from different data sources and data types.
The data ingestion framework you choose will depend on your data processing requirements and its purpose. You have the option to hand-code a customized framework to fulfill the specific needs of your organization, or you can use a data ingestion tool. Since your data ingestion strategy informs your framework, some factors to consider are the complexity of the data, whether or not the process can be automated, how quickly it’s needed for analysis, the regulatory and compliance requirements involved, and the quality parameters. Once you’ve determined your data ingestion strategy, you can move on to the data ingestion process flow.
Components of a Data Ingestion Process Flow
All data originates from specific source systems and, depending upon the type of source, will subsequently be routed through different steps in the data ingestion process. Source systems broadly consist of OLTP databases, cloud, and on-premises applications, messages from Customer Data Platforms, logs, webhooks from third-party APIs, files, and object storage.
Source systems need to be orchestrated through a series of workflows or streamed across the data infrastructure stack to their target destinations because data pipelines still contain a lot of custom scripts and logic that don’t fit perfectly into a regular ETL workflow. Workflow orchestration tools like Airflow perform this function by scheduling jobs across different nodes using a series of Directed Acyclic Graphs (DAGs).
Recommended by LinkedIn
Next, metadata management comes in early in this process so that data scientists can do data discovery downstream and tackle issues such as data quality rule definitions, data lineage, and access control groups.
Once the necessary transformations are complete, this data’s “landing” zone can be a data lake like Apache Iceberg, Apache Hudi, or Delta Lake, or a cloud data warehouse like Snowflake, Google BigQuery, or Amazon Redshift. We often use data quality testing tools to check for issues like null values, renamed columns, or checkpointing certain acceptance criteria. Depending on the use case, data is also orchestrated after being cleaned from a data lake to a data warehouse.
This is the point where – based on the specific use case (analytical decisions or operational data feeding into an application) – data can be sent to a data science platform. Platforms may include Databricks or Domino Data Labs for machine learning workloads, be pulled by an ad-hoc query engine like Presto or Dremio, or used for real-time analytics by Imply, Clickhouse, or Rockset. Then, as the last step, analytics data is sent to dashboards like Looker or Tableau, while operational data is sent to custom apps or application frameworks like Streamlit.
Techniques Used to Ingest Data
Data ingestion involves different techniques and software languages used to code data ingestion engines. For starters, extract/transform/load (ETL) and extract/load/transform (ELT) are two integration methods that are quite similar. Each method enables data movement from a source to a data warehouse. The key difference is where the data is transformed and how much of the data is retained in the warehouse.
ETL is a traditional integration approach that involves transforming data for use before it’s loaded into the warehouse. Information is pulled from remote sources, converted into the necessary styles and formats, and is then loaded into its destination. However, with the ELT method, data is extracted from one or multiple remote sources and loaded directly into its destination without any formatting. Data transformation takes place within the target database.
Several programming languages are used to code data ingestion engines when manipulating and analyzing big data. Some of the most popular languages are: