15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro

15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro

Building resilient and scalable ETL pipelines is a cornerstone of data engineering. However, even with solid architecture, failures are inevitable. Whether it’s an API failure, a Spark job crashing, or a schema change upstream – the key is to design for failure and monitor everything. Over my career as a Data Engineer, I’ve encountered dozens of these issues. Below, I share 15 commonly faced challenges with actionable solutions based on real-world experience.


1. How do you handle job failures in an ETL pipeline?

Scenario: An ETL job in Azure Data Factory that ingests daily POS data from an SFTP server starts failing intermittently due to connection timeouts.

Approach:

• Retry Logic: Implement retry with exponential backoff. Many transient issues (e.g., network glitches) are resolved in subsequent retries.

• Alerting: Integrate Azure Monitor or PagerDuty to notify relevant teams on job failures.

• Failure Isolation: Design jobs to be modular and idempotent, allowing failed segments to rerun without corrupting data.

• Logging: Maintain logs in Blob Storage or a centralized logging system for root cause analysis.

2. What steps do you take when a data pipeline is running slower than expected?

Scenario: A PySpark batch job that processes 500M rows daily starts taking 4 hours instead of 45 minutes.

Root Causes & Fixes:

• Skewed Joins: Detected skew on store_id, fixed by using salting.

• Shuffle Partition Tuning: Reduced spark.sql.shuffle.partitions from 2000 to 500.

• Broadcast Joins: Enabled broadcast for smaller reference data.

• Caching: Cached intermediate DataFrames to avoid recomputation.

• Cluster Scaling: Temporarily scaled up the Databricks cluster from 4 to 8 workers.

3. How do you address data quality issues in a large dataset?

Scenario: A client’s ecommerce dataset has inconsistent country values like US, U.S.A, and United States.

Solution:

• Data Profiling: Used Great Expectations and pandas-profiling to detect anomalies.

• Harmonization Scripts: Applied mapping dictionaries to standardize values.

• Validation Rules: Defined completeness (no nulls), uniqueness (order_id), and referential integrity checks.

• Quarantine Strategy: Invalid records were stored separately with metadata logs.

4. What would you do if a scheduled job didn’t trigger as expected?

Scenario: A Databricks notebook scheduled via Azure Data Factory didn’t execute at 2 a.m.

Investigation Checklist:

• Trigger Logs: Checked ADF pipeline trigger history.

• Time Zone Mismatch: UTC vs IST issue fixed in the trigger configuration.

• Monitoring: Set up Logic App alerts if the job didn’t start by 2:10 a.m.

• Manual Recovery: Added a parameterized notebook to allow manual reruns for backfills.

5. How do you troubleshoot memory-related issues in Spark jobs?

Scenario: Spark job crashes with ExecutorLostFailure during a large aggregation.

Fixes Applied:

• Avoided Collects: Removed .collect() calls on large datasets.

• Memory Tuning: Increased spark.executor.memory and spark.memory.fraction.

• Repartitioning: Increased parallelism using .repartition(200) to avoid large shuffles.

• Efficient Transformations: Rewrote wide transformations using mapPartitions and reduceByKey.

6. What is your approach to handling schema changes in source systems?

Scenario: A marketing events table added a new column utm_medium, breaking the existing ingestion job.

Strategy:

• Schema Evolution: Enabled schema evolution in Delta Lake ingestion pipelines.

• Versioned Schema Registry: Tracked schema changes using a Git-based registry.

• Alerting: Triggered alerts on mismatch between expected and incoming schema.

• Backward Compatibility: Ensured missing fields defaulted to null if not present.

7. How do you manage data partitioning in large-scale data processing?

Scenario: A 5 TB Delta Lake table with no partitioning was slowing down queries.

Action Plan:

• Partitioned by region and date, the most commonly filtered columns.

• Z-Ordering: Applied Z-Ordering on customer_id to speed up point queries.

• File Optimization: Used OPTIMIZE and VACUUM commands to manage small files and storage.

8. What do you do if data ingestion from a third-party API fails?

Scenario: Weather data API exceeded rate limits during campaign targeting.

Mitigation Strategy:

• Rate Limiting: Introduced backoff and jitter in retry logic.

• Fallback Data: Used cached data from the last successful response.

• Monitoring: Monitored API latency and failure rates using Prometheus + Grafana.

• Error Logging: Logged failed payloads for offline retries.

9. How do you resolve issues with data consistency between different data stores?

Scenario: Mismatched customer records between MySQL and Snowflake.

Approach:

• Row Count Checks: Compared record counts for key partitions.

• Checksum Validation: Added hash columns (MD5) to detect drift.

• CDC Audit Logs: Verified that the Debezium-based CDC job captured all updates.

• Data Reconciliation Dashboard: Built in Power BI for business to validate deltas visually.

10. How do you handle out-of-memory errors in a Hadoop job?

Scenario: Hadoop MapReduce job crashes during reduce phase on 10 GB product data.

Resolution:

• Increase Heap Size: Set mapreduce.reduce.memory.mb to 3072 MB.

• Enable Combiners: Reduced data volume by 40% before reduce phase.

• Split Input Data: Adjusted split.maxsize to improve parallelism.

• Moved to Spark: Rewrote job in Spark for better memory handling and performance.

11. What steps do you take when a data job exceeds its allocated time window?

Scenario: Campaign segmentation job exceeds its 2-hour SLA on peak traffic days.

Fixes:

• Incremental Loads: Switched from full reload to delta-based ingestion.

• Parallelism: Broke job into multiple Spark tasks per campaign type.

• Pre-Aggregation: Aggregated at campaign level beforehand to reduce downstream joins.

• Profiling: Identified expensive joins and filtered unnecessary columns early on.

12. How do you manage and monitor data pipeline dependencies?

Scenario: Daily pipeline fails because the upstream campaign job didn’t complete.

Solution:

• Dependency Graphs: Used Apache Airflow DAGs for clear visual tracking.

• State Tracking: Maintained metadata tables with job run status and completion timestamps.

• Soft Dependencies: Implemented waits with backoff in Databricks notebooks where Airflow wasn’t used.

• Alerts: Triggered alerts if upstream job didn’t finish within X minutes.

13. What do you do if the output of a data transformation step is incorrect?

Scenario: total_revenue was lower than expected after a data transformation.

Root Cause Analysis:

• Step-by-Step Debug: Validated individual transformations with small sample data.

• Unit Tests: Created test cases for logic with mocked inputs.

• Logging: Added intermediate logging to capture before/after states.

• Peer Review: Conducted a quick review to catch logic issues in filtering.

14. How do you address issues with data duplication in a pipeline?

Scenario: Duplicate orders were appearing in the warehouse due to a reprocessed Kafka stream.

Solutions:

• Deduplication Logic: Used ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) to retain latest.

• Idempotent Loads: Employed MERGE INTO in Delta Lake to upsert.

• Checkpointing: Enabled checkpointing in Spark Structured Streaming to avoid reprocessing.

• Hash Columns: Used SHA256 hash on record to detect duplicates efficiently.

15. How do you handle and log errors in a distributed data processing job?

Scenario: A PySpark job processing JSON logs failed silently due to corrupt records.

Best Practices:

• Fail-Safe Read: Used option(“badRecordsPath”, “/logs/corrupt”) in Spark.

• Granular Logging: Captured partition info and record metadata in logs.

• Structured Logging: Employed structured logging in JSON for easy queryability.

• Retry Strategy: Failed stages retried up to 3 times with backoff.

Troubleshooting data pipeline failures is inevitable in real-world systems. The best engineers aren’t those who avoid failure, but those who build resilient, observant, and self-healing systems.


Article content


To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics