From Dirty Data to Reliable Inputs: Cleaning and Validation in PD Modelling
Introduction
This is the fifth post in my series on building a Probability of Default (PD) model from a business data analyst’s perspective.
In the last post, “From Raw Data to Early Insights: Preliminary EDA in PD Modelling” I shared how early visual and statistical exploration helps us uncover initial trends, anomalies, and questions.
Now, we move to Step 3 in our roadmap: Data Cleaning and Validation — the stage where raw data is transformed into trustworthy inputs. Before any feature engineering or model building, it’s critical to address missing values, detect outliers, and correct errors that could mislead analysis or degrade model performance.
Handling Missing Values
To reduce dimensionality and prevent bias from excessive imputation, I began by removing features with more than 30% missing values using a threshold-based approach. This aligns with common practice, where variables exceeding 30–40% missingness are often considered unreliable unless they hold critical value. Eliminating such features helps avoid distortion from imputation and improves model generalisability and training efficiency (Little & Rubin, 2002; Hastie, Tibshirani, & Friedman, 2009).
After removing features with high missingness (threshold = 30%), no features met the removal criteria. I then addressed the remaining missing values using tailored, feature-level strategies.
In total:
This strategy was applied modularly using a configurable pipeline, ensuring transparency, minimal bias, and strong downstream reliability (Little & Rubin, 2002).
Outlier Detection and Handling
Before diving into feature-level outlier detection, a row-wise threshold for removal was defined — similar to how we handle missingness.
If more than 50% of a row's numerical features are detected as outliers, the entire record is considered corrupted and removed. This helps protect the model from noisy or unstable observations (Osborne & Overbay, 2004).
🟢 Result: 0 records removed due to excessive outliers.
Next, feature-level outlier detection was performed using multiple statistical and machine learning methods, each selected for its unique strengths.
Recommended by LinkedIn
✏️ Example Adjustments
Two features required specific treatment based on domain knowledge and data review:
Other features like income, loan amount, and interest rate were visually inspected and retained unchanged due to plausible business ranges.
A unified pipeline was used to apply these strategies consistently and efficiently, supporting reproducibility and analysis integrity.
📌 Key Takeaways
🔜 What’s Next?
Next, we move to:
👉 Step 4: Full Exploratory Data Analysis (EDA) With clean, validated data:
🧠 Why it matters: Post-cleaning EDA uncovers deeper patterns that guide feature engineering and model design.
👨💻 Follow along and explore the code and full report here:
Stay tuned!