Machine Learning models can overcome poor data....
Data quality is of utmost importance for Machine Learning (ML) models and fraud prevention.
When you input bad data, you output bad predictions.
If you work in fraud prevention, as an analyst, or data scientist then you know how challenging it is to get the data you need. According to mastheadata.com link at the bottom of the article.
"Poor data quality can lead to incorrect business intelligence decisions, worse data analysis, and a multitude of errors. Minor problems in the input data going into training a model can turn into large-scale issues at the output. A 2016 study by IBM revealed that $3.1 trillion is lost annually from the U.S. economy due to poor data quality and its effects alone."
$3.1 trillion is lost annually from the U.S. economy due to poor data quality"
How much money are you losing as a result of poor data quality?
A hymn comes to mind from my childhood "You better build your house upon a rock, Make a good foundation on a solid spot."
Build a solid data foundation
I am not a data purist who believes that every event should have all the data available, I am realistic and understand it can be challenging. Though ensuring a minimum standard does improve data quality and the performance of the predictions by the model, ultimately enabling you to identify fraud.
That said, it comes with a caution. I recall helping a customer who had been told their data quality was poor and likely useless.
I completed a data quality audit and deemed the data quality to be reasonable i.e. no better or worse than other customers. The data was of a standard that could be used to build rules that would accurately detect attacks and fraud and as such believed it was fine to build an ML model.
The data scientists working on the project reviewed the data and said it’s of low value and we can't do anything meaningful with it. Some of the issues raised were:
They proceeded with the project though and generated features, trained models, and created predictions to see what could be realized. Unfortunately the predictions were poor, no fraud was identified and generally it seemed to miss the mark.
For most of the data scientists working on the project, this was their first time working with real data in fraud prevention. It was quite a wakeup call.
Ultimately they were blaming the data for the poor outcome rather than reviewing their approach and learning.
To prove this, I worked closely with a talented data scientist with considerable real-world experience. Together we successfully identified fraud, generated accurate predictions and generated a bot detection model with the same data that the previous team had failed on.
Why were we successful?
The data scientist knew the data limitations, what would influence the predictions and how the models would be sensitive to time, frequency and the sequence of events. Ultimately, she is a very talented data scientist used to working with limited data. Also she knew the importance of working with an SME that understood the attacks, user journeys, applications, data, the customer and their problems.
In this example, the data was not the issue but the experience of the team working on the data and the process applied, i.e. not working with a subject matter expert, not having realistic expectations about data and working within the limitations of the data.
But wait.... your article infers that you cannot overcome bad data with ML models.
The example above, related to good quality data with some limitations. When the data quality is poor, then the predictions created will be poor and the likelihood of identifying fraud will be close to zero.
Here are some common problems that I’ve encountered:
Aside from not identifying fraud, this can also result in missed opportunity as the business are not able to accurately understand their users and trends.
Conversely, when the data quality is good then you can:
The benefits of good data quality become an accelerator as you want to produce business intelligence, tie in enrichment services, hook into additional networks and so on. So it is important to get the fundamentals right.
Let's go deeper.
Recommended by LinkedIn
If you do not have consistent event types, transaction types, device fingerprinting, IP enrichment, geo enrichment, consistency across different products monitored by the same tool then you should absolutely invest in data quality and here’s why.
Identifying fraud requires consistent aggregation data points to identify a logic, sequence or scenario relating to the attack. If the data is not consistent then you will struggle to identify a link, as will a machine learning model. Likely it will infer something that is incorrect as a result.
If there are no consistent aggregation data points such as the device id, event type, time etc… Then rules, machine learning models, reporting, investigation, forensic analysis etc.. will all fall short.
Therefore the biggest improvement can be to standardize the data expected in the form of an API and ensure applications meet this standard.
It is important to be realistic and know that there will be cases where the data quality is not as expected, and handle it. For example it may not be possible on the application level, or perhaps the payload is intercepted / altered by a malicious actor.
This is where the data scientists can really come into their own, by dealing with poor quality data, but ultimately, they need a good frame of data consistency to build on.
You may want to consider what is limiting the data quality currently:
On the last point you can enrich raw data. For example to provide IP location, tor, proxy, geo location, malicious device, malicious cookie, sanctions, bot behaviour identification.
Enriching the data will enable you to realize value as follows:
So you have audited the data, created an API and the data quality is good.
What's next?
Ensure that labelled data is of high quality to enable model retraining.
And
Consider, how does a change, good or bad, in data quality impact the rules or features in a model.
The importance of change management specifically to the data specification is fundamental to a performant anti-fraud solution.
8 Recommendations for consistent high quality data
I hope this fifth blog was of use to you and will publish the sixth in the series common misconceptions next week:
"Unbalanced data sets can be overcome with the right model”
Feedback is very welcome, please do let me know if there are any problems and I will look to update the post. Or if you think there is something I’ve missed and should cover, then reach out.
I love to learn and hearing from you all!
Thank you for your time.
Links referred to in the article:
Financial Crime, Cybercrime and Fraud expert
2yNice points Greg H. !!