Machine Learning models can overcome poor data....
image generated using artificial intelligence using Midjourney

Machine Learning models can overcome poor data....

Data quality is of utmost importance for Machine Learning (ML) models and fraud prevention.

When you input bad data, you output bad predictions.

If you work in fraud prevention, as an analyst, or data scientist then you know how challenging it is to get the data you need. According to mastheadata.com link at the bottom of the article.

"Poor data quality can lead to incorrect business intelligence decisions, worse data analysis, and a multitude of errors. Minor problems in the input data going into training a model can turn into large-scale issues at the output. A 2016 study by IBM revealed that $3.1 trillion is lost annually from the U.S. economy due to poor data quality and its effects alone."

$3.1 trillion is lost annually from the U.S. economy due to poor data quality"

How much money are you losing as a result of poor data quality?

A hymn comes to mind from my childhood "You better build your house upon a rock, Make a good foundation on a solid spot."

Build a solid data foundation

I am not a data purist who believes that every event should have all the data available, I am realistic and understand it can be challenging. Though ensuring a minimum standard does improve data quality and the performance of the predictions by the model, ultimately enabling you to identify fraud.

That said, it comes with a caution. I recall helping a customer who had been told their data quality was poor and likely useless.

I completed a data quality audit and deemed the data quality to be reasonable i.e. no better or worse than other customers. The data was of a standard that could be used to build rules that would accurately detect attacks and fraud and as such believed it was fine to build an ML model.

The data scientists working on the project reviewed the data and said it’s of low value and we can't do anything meaningful with it. Some of the issues raised were:

  • Not enough labels There were not enough events classified as fraud or genuine
  • Data Continuity The Device ID, IP Address and Session ID changed mid session due to an implementation error
  • Atypical events There were different events generated by different applications

They proceeded with the project though and generated features, trained models, and created predictions to see what could be realized. Unfortunately the predictions were poor, no fraud was identified and generally it seemed to miss the mark.

For most of the data scientists working on the project, this was their first time working with real data in fraud prevention. It was quite a wakeup call.

Ultimately they were blaming the data for the poor outcome rather than reviewing their approach and learning.

No alt text provided for this image
Image Source: FT.com, Roger Beale

To prove this, I worked closely with a talented data scientist with considerable real-world experience. Together we successfully identified fraud, generated accurate predictions and generated a bot detection model with the same data that the previous team had failed on.

Why were we successful?

The data scientist knew the data limitations, what would influence the predictions and how the models would be sensitive to time, frequency and the sequence of events. Ultimately, she is a very talented data scientist used to working with limited data. Also she knew the importance of working with an SME that understood the attacks, user journeys, applications, data, the customer and their problems.

In this example, the data was not the issue but the experience of the team working on the data and the process applied, i.e. not working with a subject matter expert, not having realistic expectations about data and working within the limitations of the data.

But wait.... your article infers that you cannot overcome bad data with ML models.

The example above, related to good quality data with some limitations. When the data quality is poor, then the predictions created will be poor and the likelihood of identifying fraud will be close to zero.

Here are some common problems that I’ve encountered:

No alt text provided for this image
Generated using powerpoint

  • Unbalanced dataset - 99.99%+ unlabeled / non fraud vs 0.001% fraud
  • Null data - data that should be present and is not
  • Unreliable data - data cannot be trusted i.e. a device fingerprint, ip, cookies… changing midsession due to incorrect implementation not a nefarious actor
  • Incorrect calculations - aggregations are typically built through counters / increments. If the initial value is wrong then what follows is wrong
  • Not knowing your data - building features / rules with an incorrect assumption of values
  • Not handling nulls - building features that infer change when a null is present. I.e. device id goes from known to unknown
  • Missing data – typically this is due to a poor understanding of user journeys resulting in missing events / api data
  • Slow message delivery - many features are time sensitive, as are rules. When the data is not delivered in sequence this can impact on the calculations made
  • Incorrectly anonymized / pseudonymized - this may have been applied or inconsistently applied reducing the value of the data, cohorts and features
  • No feedback – labelled data is not provided to the model to allow it to retrain

Aside from not identifying fraud, this can also result in missed opportunity as the business are not able to accurately understand their users and trends.

Conversely, when the data quality is good then you can:

No alt text provided for this image
Generated using powerpoint

The benefits of good data quality become an accelerator as you want to produce business intelligence, tie in enrichment services, hook into additional networks and so on. So it is important to get the fundamentals right.

Let's go deeper.

No alt text provided for this image
Image source: Naval Open Source Intelligence

If you do not have consistent event types, transaction types, device fingerprinting, IP enrichment, geo enrichment, consistency across different products monitored by the same tool then you should absolutely invest in data quality and here’s why.

Identifying fraud requires consistent aggregation data points to identify a logic, sequence or scenario relating to the attack. If the data is not consistent then you will struggle to identify a link, as will a machine learning model. Likely it will infer something that is incorrect as a result.

If there are no consistent aggregation data points such as the device id, event type, time etc… Then rules, machine learning models, reporting, investigation, forensic analysis etc.. will all fall short.

Therefore the biggest improvement can be to standardize the data expected in the form of an API and ensure applications meet this standard.

It is important to be realistic and know that there will be cases where the data quality is not as expected, and handle it. For example it may not be possible on the application level, or perhaps the payload is intercepted / altered by a malicious actor.

No alt text provided for this image
Image source: FreePik

This is where the data scientists can really come into their own, by dealing with poor quality data, but ultimately, they need a good frame of data consistency to build on.

You may want to consider what is limiting the data quality currently:

  • Is there an issue client side whereby the device fingerprinting or data vesting is inconsistent or inaccurate?
  • Is the data secured in transit?
  • Has the data been altered (maliciously or programmatically)?
  • Are you collecting enough data / events?
  • Can you infer behaviour of a user on an app or in a session?
  • Are there missing events?
  • Is there a middleware only passing through some events or some data?
  • Are you enriching data to increase its value?

On the last point you can enrich raw data. For example to provide IP location, tor, proxy, geo location, malicious device, malicious cookie, sanctions, bot behaviour identification.

No alt text provided for this image
Image Source: SEON - example of data aggregation / linking

Enriching the data will enable you to realize value as follows:

  • Infer aggregations, cause, links previously unknown
  • Create cohorts to share
  • Increase human understanding and interpretation
  • Potential to remove private data without losing intelligence
  • Increase features available
  •  Increase accuracy of the model
  •  Reduce fraud

So you have audited the data, created an API and the data quality is good.

What's next?

Ensure that labelled data is of high quality to enable model retraining.

And

Consider, how does a change, good or bad, in data quality impact the rules or features in a model.

The importance of change management specifically to the data specification is fundamental to a performant anti-fraud solution.

8 Recommendations for consistent high quality data

  1. Know your data
  2. Build a minimum specification for data
  3. Correlate the minimum data specification to user journeys and examine payloads
  4. Correlate the minimum data specification to rules, features, model interpretation and attack identification
  5. Improve and enrich data to increase the value and understanding
  6. Ensure a robust change management process is in place that allows for data extensibility and consistency
  7. Ensure data scientists and SME's work together
  8. Tie in data quality to revenue / fraud

I hope this fifth blog was of use to you and will publish the sixth in the series common misconceptions next week:

"Unbalanced data sets can be overcome with the right model”

Feedback is very welcome, please do let me know if there are any problems and I will look to update the post. Or if you think there is something I’ve missed and should cover, then reach out.

I love to learn and hearing from you all!

Thank you for your time.

#fraudprevention  #artificialintelligence  #businessowners #dataquality

Links referred to in the article:

Demetra Demetriou

Financial Crime, Cybercrime and Fraud expert

2y

Nice points Greg H. !!

To view or add a comment, sign in

More articles by Greg H.

Insights from the community

Others also viewed

Explore topics