From Dirty Data to Reliable Inputs: Cleaning and Validation in PD Modelling

From Dirty Data to Reliable Inputs: Cleaning and Validation in PD Modelling

Introduction

This is the fifth post in my series on building a Probability of Default (PD) model from a business data analyst’s perspective.

In the last post, From Raw Data to Early Insights: Preliminary EDA in PD Modelling I shared how early visual and statistical exploration helps us uncover initial trends, anomalies, and questions.

Now, we move to Step 3 in our roadmap: Data Cleaning and Validation — the stage where raw data is transformed into trustworthy inputs. Before any feature engineering or model building, it’s critical to address missing values, detect outliers, and correct errors that could mislead analysis or degrade model performance.

Handling Missing Values

To reduce dimensionality and prevent bias from excessive imputation, I began by removing features with more than 30% missing values using a threshold-based approach. This aligns with common practice, where variables exceeding 30–40% missingness are often considered unreliable unless they hold critical value. Eliminating such features helps avoid distortion from imputation and improves model generalisability and training efficiency (Little & Rubin, 2002; Hastie, Tibshirani, & Friedman, 2009).

After removing features with high missingness (threshold = 30%), no features met the removal criteria. I then addressed the remaining missing values using tailored, feature-level strategies.

  • Numerical variables prone to outliers (e.g., income, age, credit history length) were imputed using the median.
  • Categorical variables (e.g., home ownership) were imputed using the mode.
  • Critical features (e.g., interest rate, loan amount, and the target variable) had rows dropped to preserve data integrity.

In total:

  • 3,116 records removed
  • 827 records imputed

This strategy was applied modularly using a configurable pipeline, ensuring transparency, minimal bias, and strong downstream reliability (Little & Rubin, 2002).

Article content
Table 1: Imputation Strategies for Handling Missing Values and Outliers

Outlier Detection and Handling

Before diving into feature-level outlier detection, a row-wise threshold for removal was defined — similar to how we handle missingness.

If more than 50% of a row's numerical features are detected as outliers, the entire record is considered corrupted and removed. This helps protect the model from noisy or unstable observations (Osborne & Overbay, 2004).

🟢 Result: 0 records removed due to excessive outliers.

Next, feature-level outlier detection was performed using multiple statistical and machine learning methods, each selected for its unique strengths.

Article content
Table 2: Summary of Techniques Used to Detect Outliers

✏️ Example Adjustments

Two features required specific treatment based on domain knowledge and data review:

  • Age: Ranged from 20 to 144, with high skewness (2.58) and extreme kurtosis (18.56). 🔧 Fixed: Outliers outside [18–80] replaced with the median.
  • Employment Length: Went up to 123, with kurtosis > 43. 🔧 Fixed: Values above 50 were capped at 50.

Other features like income, loan amount, and interest rate were visually inspected and retained unchanged due to plausible business ranges.

  • Total custom-based outliers handled: 7
  • 0 records removed due to row-level thresholds

A unified pipeline was used to apply these strategies consistently and efficiently, supporting reproducibility and analysis integrity.

📌 Key Takeaways

  • Applied modular, per-feature missing value handling (median, mode, drop).
  • Defined and applied a configurable row-wise outlier threshold (50%) to preserve data quality.
  • Used multiple statistical and ML-based outlier detection methods.
  • Corrected extreme values in age and employment length using custom domain rules.
  • Designed the process to be transparent, reproducible, and configurable.

🔜 What’s Next?

Next, we move to:

👉 Step 4: Full Exploratory Data Analysis (EDA) With clean, validated data:

  • Explore feature correlations, multicollinearity, and interactions
  • Apply dimensionality-reduction techniques (e.g., PCA) and clustering
  • Assess relationships between features and the target variable (Default)

🧠 Why it matters: Post-cleaning EDA uncovers deeper patterns that guide feature engineering and model design.

👨💻 Follow along and explore the code and full report here:

🔗 [GitHub – Daanish Framework]

🔗 [GitHub – PD Modelling Project]

Stay tuned!

To view or add a comment, sign in

More articles by Hamed Soleimani

Insights from the community

Others also viewed

Explore topics