Test with Real Data and Avoid Mock Data Pitfalls!

Rahul Anand Kale

Software Engineer: Tech Mentor, Innovator, Technology Migration

Published Oct 22, 2023

Have you ever tested your system with scenario-based, manually generated, mock data and found out that they don’t work in production? If so, you are not alone. Many agile teams face this problem because mock data is often different from real data in subtle ways. Most people make the assumption that logic in code is independent of real data. In this article, I will show you how to test with real data and avoid mock data pitfalls.

Say you have a rule-based engine to bucket orders for prioritization, based on product types and demographics. You think about all possible scenarios and create mock data to test positive, negative, exceptions and rare scenarios. You run the tests, all test cases pass, and you label the rules as QC PASSED. The rules go live, but in production no rules are hitting!

What went wrong? You investigate and find that the production data is ever so slightly different than your mock data. One field used in the rule may have trailing spaces and upper-case strings, whereas mock data had lower-case strings with no whitespace! Ah! no problem with the testing method you say, just wrong mock data. It can be even worse if only some rules are not working silently, and you don’t find out until it is too late.

This is a classic scenario which explains why testing with manually created mock data fails consistently. But is there a better way? Yes, use real data wherever possible. It is easy if you know the methods for the below steps.

Testing with real data:

Copy full production data (real data) into your test environment with sensitive fields masked.
Generate more mock data using different permutations and combinations from real data.
Derive mock data for scenarios that you expect but not yet realized in real data.
Use brute force to derive expected output for all record.
For subsequent releases, reconcile data (old code vs new code) to detect unintended changes in expected values.

Recommended by LinkedIn

Data Modernization Best Practices: From Planning to…

C2S Technologies, Inc. 9 months ago

Logical data modelling is dead: embrace Agile and…

Mikhael MAZU 3 weeks ago

How to Manage a High Number of Joins in Data Vault…

Martin Šafránek 1 year ago

Why mocking from real data performs better?

It reduces dependency on human knowledge limits.
You don’t need to think of all possible scenarios.
It covers all the real data scenarios including cosmetic data deviations.
You only create scenarios that are not present in real data or that can come in future.
“All possible scenarios” easily run into millions of combinations, real data is only a subset which you need to care about.
You have full coverage over real data.
It enables performance testing at real data scale.

Some may object to testing with real data because of the perceived cost of copying production data into test environment and other issues such as storage, network, data security, privacy, etc. However, these issues can be easily mitigated with simple and effective methods. Moreover, the cost of poor test quality far exceeds the cost of testing with real data. Testing with real data ensures that your system works as expected in production and prevents costly errors and rework.

But what if you don’t have real data yet? Start with manual mock data, go live, constantly analyze production scenarios for some time while you collect real data, then update your mock test data with real data using above method. There are very few real limitations which you can easily mitigate, and I shall discuss those in another article.

In conclusion, testing with real data is a better way to ensure the quality of your system than testing with mock data. Have you tried testing with real data? What issues have you encountered with manually generated test data?

Rahul Anand Kale

Software Engineer: Tech Mentor, Innovator, Technology Migration

Roshan Shetty, Puneet Dubey did you enjoy this method? You can share your experience in comments here 😊

Test with Real Data and Avoid Mock Data Pitfalls!

Rahul Anand Kale

Software Engineer: Tech Mentor, Innovator, Technology Migration

Testing with real data:

Recommended by LinkedIn

Why mocking from real data performs better?

More articles by Rahul Anand Kale

Insights from the community

Others also viewed

Logical data modelling is dead: embrace Agile and Conditional Data Modelling

How to Manage a High Number of Joins in Data Vault 2.0? Part 2: Raw Vault and Business Vault.

What Is Data Mesh?

Data Mesh: Decentralization and Agility in Data Analysis

Data Science Project Example Walk-through: London Cycle Hire Data using the CRISP-DM and TDSP process models: #2 Project Framework (Agile v Waterfall)

DataOps Insights for Testers

DataOps: From Theory to Practice

Vertical Slicing of DW/BI Requirements

A brief guide to the Ensemble Logical Model

Data, Agility & much more

Explore topics

Testing with real data:

Recommended by LinkedIn

Why mocking from real data performs better?

More articles by Rahul Anand Kale

Git is a database!

Beware of "null"!

AI – 5 Blind Spots

Distress Call

Java and Pipes!

Documents Don't Execute!

Dask Cuda for Speed!

A Simple Code Generation Method

Get Mentors!

Unveil the Obscure Data Flows: Reverse Engineer the Technology Landscape of Your Organization

Insights from the community

Others also viewed

Logical data modelling is dead: embrace Agile and Conditional Data Modelling

How to Manage a High Number of Joins in Data Vault 2.0? Part 2: Raw Vault and Business Vault.

What Is Data Mesh?

Data Mesh: Decentralization and Agility in Data Analysis

Data Science Project Example Walk-through: London Cycle Hire Data using the CRISP-DM and TDSP process models: #2 Project Framework (Agile v Waterfall)

DataOps Insights for Testers

DataOps: From Theory to Practice

Vertical Slicing of DW/BI Requirements

A brief guide to the Ensemble Logical Model

Data, Agility & much more

Explore topics