Test with Real Data and Avoid Mock Data Pitfalls!
QC PASSED vs PROD PASSED!

Test with Real Data and Avoid Mock Data Pitfalls!

Have you ever tested your system with scenario-based, manually generated, mock data and found out that they don’t work in production? If so, you are not alone. Many agile teams face this problem because mock data is often different from real data in subtle ways. Most people make the assumption that logic in code is independent of real data. In this article, I will show you how to test with real data and avoid mock data pitfalls.

Say you have a rule-based engine to bucket orders for prioritization, based on product types and demographics.  You think about all possible scenarios and create mock data to test positive, negative, exceptions and rare scenarios.  You run the tests, all test cases pass, and you label the rules as QC PASSED.  The rules go live, but in production no rules are hitting!

What went wrong?  You investigate and find that the production data is ever so slightly different than your mock data. One field used in the rule may have trailing spaces and upper-case strings, whereas mock data had lower-case strings with no whitespace!  Ah! no problem with the testing method you say, just wrong mock data.  It can be even worse if only some rules are not working silently, and you don’t find out until it is too late.

This is a classic scenario which explains why testing with manually created mock data fails consistently.  But is there a better way?  Yes, use real data wherever possible.  It is easy if you know the methods for the below steps.

Testing with real data:

  • Copy full production data (real data) into your test environment with sensitive fields masked.
  • Generate more mock data using different permutations and combinations from real data.
  • Derive mock data for scenarios that you expect but not yet realized in real data.
  • Use brute force to derive expected output for all record.
  • For subsequent releases, reconcile data (old code vs new code) to detect unintended changes in expected values.

Why mocking from real data performs better?

  • It reduces dependency on human knowledge limits.
  • You don’t need to think of all possible scenarios.
  • It covers all the real data scenarios including cosmetic data deviations.
  • You only create scenarios that are not present in real data or that can come in future.
  • “All possible scenarios” easily run into millions of combinations, real data is only a subset which you need to care about.
  • You have full coverage over real data.
  • It enables performance testing at real data scale.

Some may object to testing with real data because of the perceived cost of copying production data into test environment and other issues such as storage, network, data security, privacy, etc. However, these issues can be easily mitigated with simple and effective methods. Moreover, the cost of poor test quality far exceeds the cost of testing with real data. Testing with real data ensures that your system works as expected in production and prevents costly errors and rework.

But what if you don’t have real data yet?  Start with manual mock data, go live, constantly analyze production scenarios for some time while you collect real data, then update your mock test data with real data using above method. There are very few real limitations which you can easily mitigate, and I shall discuss those in another article.

In conclusion, testing with real data is a better way to ensure the quality of your system than testing with mock data. Have you tried testing with real data? What issues have you encountered with manually generated test data?

Rahul Anand Kale

Software Engineer: Tech Mentor, Innovator, Technology Migration

1y

Roshan Shetty, Puneet Dubey did you enjoy this method? You can share your experience in comments here 😊

Like
Reply

To view or add a comment, sign in

More articles by Rahul Anand Kale

  • Git is a database!

    Git is a workhorse of engineering teams. Its primary purpose is to maintain engineering artifacts in both central and…

  • Beware of "null"!

    Null is the single most dangerous entity in SQL. Even seasoned programmers make mistakes with nulls.

  • AI – 5 Blind Spots

    It is compelling to believe that AI algorithms caused the recent disruption, but I have a counter opinion. The…

  • Distress Call

    A few years ago, one of my old friends was dealing with the problem of reverse engineering a system written in python…

    1 Comment
  • Java and Pipes!

    Talking about fundamentals, here is a question for Java developers familiar with Linux, to clear your blind spot. I…

  • Documents Don't Execute!

    After working as a developer and team lead for 7 years, I started performing the dual role of IT architect and manager…

  • Dask Cuda for Speed!

    Data Exploration taking too long? Use Dask Cuda for Speed! I am using Rapids Docker Container to analyze ~900 million…

    3 Comments
  • A Simple Code Generation Method

    If you have read my previous article critically, you may have detected the logical error in the time formula. It is…

  • Get Mentors!

    This post goes to all my mentors, to express my gratitude for the positive impact they created in my life. My mentors…

    1 Comment
  • Unveil the Obscure Data Flows: Reverse Engineer the Technology Landscape of Your Organization

    Understanding the intricate landscape of systems, that form the backbone of any complex technology enabled…

Insights from the community

Others also viewed

Explore topics