15 Real-World Data Pipeline Issues and How to Solve Them Like a Pro
Building resilient and scalable ETL pipelines is a cornerstone of data engineering. However, even with solid architecture, failures are inevitable. Whether it’s an API failure, a Spark job crashing, or a schema change upstream – the key is to design for failure and monitor everything. Over my career as a Data Engineer, I’ve encountered dozens of these issues. Below, I share 15 commonly faced challenges with actionable solutions based on real-world experience.
1. How do you handle job failures in an ETL pipeline?
Scenario: An ETL job in Azure Data Factory that ingests daily POS data from an SFTP server starts failing intermittently due to connection timeouts.
Approach:
• Retry Logic: Implement retry with exponential backoff. Many transient issues (e.g., network glitches) are resolved in subsequent retries.
• Alerting: Integrate Azure Monitor or PagerDuty to notify relevant teams on job failures.
• Failure Isolation: Design jobs to be modular and idempotent, allowing failed segments to rerun without corrupting data.
• Logging: Maintain logs in Blob Storage or a centralized logging system for root cause analysis.
2. What steps do you take when a data pipeline is running slower than expected?
Scenario: A PySpark batch job that processes 500M rows daily starts taking 4 hours instead of 45 minutes.
Root Causes & Fixes:
• Skewed Joins: Detected skew on store_id, fixed by using salting.
• Shuffle Partition Tuning: Reduced spark.sql.shuffle.partitions from 2000 to 500.
• Broadcast Joins: Enabled broadcast for smaller reference data.
• Caching: Cached intermediate DataFrames to avoid recomputation.
• Cluster Scaling: Temporarily scaled up the Databricks cluster from 4 to 8 workers.
3. How do you address data quality issues in a large dataset?
Scenario: A client’s ecommerce dataset has inconsistent country values like US, U.S.A, and United States.
Solution:
• Data Profiling: Used Great Expectations and pandas-profiling to detect anomalies.
• Harmonization Scripts: Applied mapping dictionaries to standardize values.
• Validation Rules: Defined completeness (no nulls), uniqueness (order_id), and referential integrity checks.
• Quarantine Strategy: Invalid records were stored separately with metadata logs.
4. What would you do if a scheduled job didn’t trigger as expected?
Scenario: A Databricks notebook scheduled via Azure Data Factory didn’t execute at 2 a.m.
Investigation Checklist:
• Trigger Logs: Checked ADF pipeline trigger history.
• Time Zone Mismatch: UTC vs IST issue fixed in the trigger configuration.
• Monitoring: Set up Logic App alerts if the job didn’t start by 2:10 a.m.
• Manual Recovery: Added a parameterized notebook to allow manual reruns for backfills.
5. How do you troubleshoot memory-related issues in Spark jobs?
Scenario: Spark job crashes with ExecutorLostFailure during a large aggregation.
Fixes Applied:
• Avoided Collects: Removed .collect() calls on large datasets.
• Memory Tuning: Increased spark.executor.memory and spark.memory.fraction.
• Repartitioning: Increased parallelism using .repartition(200) to avoid large shuffles.
• Efficient Transformations: Rewrote wide transformations using mapPartitions and reduceByKey.
6. What is your approach to handling schema changes in source systems?
Scenario: A marketing events table added a new column utm_medium, breaking the existing ingestion job.
Strategy:
• Schema Evolution: Enabled schema evolution in Delta Lake ingestion pipelines.
• Versioned Schema Registry: Tracked schema changes using a Git-based registry.
• Alerting: Triggered alerts on mismatch between expected and incoming schema.
• Backward Compatibility: Ensured missing fields defaulted to null if not present.
7. How do you manage data partitioning in large-scale data processing?
Scenario: A 5 TB Delta Lake table with no partitioning was slowing down queries.
Action Plan:
• Partitioned by region and date, the most commonly filtered columns.
• Z-Ordering: Applied Z-Ordering on customer_id to speed up point queries.
• File Optimization: Used OPTIMIZE and VACUUM commands to manage small files and storage.
8. What do you do if data ingestion from a third-party API fails?
Scenario: Weather data API exceeded rate limits during campaign targeting.
Mitigation Strategy:
• Rate Limiting: Introduced backoff and jitter in retry logic.
• Fallback Data: Used cached data from the last successful response.
Recommended by LinkedIn
• Monitoring: Monitored API latency and failure rates using Prometheus + Grafana.
• Error Logging: Logged failed payloads for offline retries.
9. How do you resolve issues with data consistency between different data stores?
Scenario: Mismatched customer records between MySQL and Snowflake.
Approach:
• Row Count Checks: Compared record counts for key partitions.
• Checksum Validation: Added hash columns (MD5) to detect drift.
• CDC Audit Logs: Verified that the Debezium-based CDC job captured all updates.
• Data Reconciliation Dashboard: Built in Power BI for business to validate deltas visually.
10. How do you handle out-of-memory errors in a Hadoop job?
Scenario: Hadoop MapReduce job crashes during reduce phase on 10 GB product data.
Resolution:
• Increase Heap Size: Set mapreduce.reduce.memory.mb to 3072 MB.
• Enable Combiners: Reduced data volume by 40% before reduce phase.
• Split Input Data: Adjusted split.maxsize to improve parallelism.
• Moved to Spark: Rewrote job in Spark for better memory handling and performance.
11. What steps do you take when a data job exceeds its allocated time window?
Scenario: Campaign segmentation job exceeds its 2-hour SLA on peak traffic days.
Fixes:
• Incremental Loads: Switched from full reload to delta-based ingestion.
• Parallelism: Broke job into multiple Spark tasks per campaign type.
• Pre-Aggregation: Aggregated at campaign level beforehand to reduce downstream joins.
• Profiling: Identified expensive joins and filtered unnecessary columns early on.
12. How do you manage and monitor data pipeline dependencies?
Scenario: Daily pipeline fails because the upstream campaign job didn’t complete.
Solution:
• Dependency Graphs: Used Apache Airflow DAGs for clear visual tracking.
• State Tracking: Maintained metadata tables with job run status and completion timestamps.
• Soft Dependencies: Implemented waits with backoff in Databricks notebooks where Airflow wasn’t used.
• Alerts: Triggered alerts if upstream job didn’t finish within X minutes.
13. What do you do if the output of a data transformation step is incorrect?
Scenario: total_revenue was lower than expected after a data transformation.
Root Cause Analysis:
• Step-by-Step Debug: Validated individual transformations with small sample data.
• Unit Tests: Created test cases for logic with mocked inputs.
• Logging: Added intermediate logging to capture before/after states.
• Peer Review: Conducted a quick review to catch logic issues in filtering.
14. How do you address issues with data duplication in a pipeline?
Scenario: Duplicate orders were appearing in the warehouse due to a reprocessed Kafka stream.
Solutions:
• Deduplication Logic: Used ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) to retain latest.
• Idempotent Loads: Employed MERGE INTO in Delta Lake to upsert.
• Checkpointing: Enabled checkpointing in Spark Structured Streaming to avoid reprocessing.
• Hash Columns: Used SHA256 hash on record to detect duplicates efficiently.
15. How do you handle and log errors in a distributed data processing job?
Scenario: A PySpark job processing JSON logs failed silently due to corrupt records.
Best Practices:
• Fail-Safe Read: Used option(“badRecordsPath”, “/logs/corrupt”) in Spark.
• Granular Logging: Captured partition info and record metadata in logs.
• Structured Logging: Employed structured logging in JSON for easy queryability.
• Retry Strategy: Failed stages retried up to 3 times with backoff.
Troubleshooting data pipeline failures is inevitable in real-world systems. The best engineers aren’t those who avoid failure, but those who build resilient, observant, and self-healing systems.