AI is Easy, Data is Hard
Artificial Intelligence (AI) is more accessible today than ever. From open-source machine learning libraries to user-friendly AI cloud services, anyone can deploy AI models with relative ease. But here’s the catch: AI is only as good as the data you give it. A state-of-the-art algorithm fed poor data will produce poor results – a classic case of “garbage in, garbage out.” Many companies eagerly invest in AI technology, yet they overlook the less glamorous work of data preparation and quality control, often to their peril.
In fact, AI initiatives are notoriously prone to failure when data problems are ignored. Gartner reports that 85% of AI projects fail, with issues like poor data quality and lack of relevant data among the top reasons (Gartner, 2020). Even when the AI tech is sound, bad data can derail the outcome – one survey found that 99% of AI and machine learning projects encounter data quality issues (Forbes, 2020). As one industry observer succinctly put it, “it’s a data game, not a code fest.” The core of any AI system “lies not in complex coding, but in the data that powers it” (Ng, 2020). In other words, the real power lies in information, not just algorithms.
AI can feel like a high-wire act: the technology is dazzling, but without a solid data foundation, it’s one step away from a plunge. Businesses often race to implement AI solutions, pouring resources into model development, while treating data as an afterthought. To make AI work, organisations must fortify that rope – focus on data collection, cleaning, and governance – or risk a very public tumble.
Extreme Worst-Case AI Failure
What happens when AI is fed bad data? In the extreme case, it can lead to spectacular failures. A famous real-world example comes from Amazon. The company built an experimental AI recruiting tool to automatically screen resumes and identify top talent. Unfortunately, the AI quickly turned biased against women and Amazon had to shut it down (Dastin, 2018). The model was trained on ten years of past hiring data and, because the tech industry (and thus Amazon’s prior hires) were predominantly male, the AI learned a toxic lesson. In effect, Amazon’s system taught itself that male candidates were preferable, even penalising resumes that included the word “women’s” (as in “women’s chess club captain”). It even started downgrading graduates of women’s colleges. The very data that was meant to help the AI select the best candidates instead taught it to discriminate. This extreme failure was a direct result of biased, poor-quality training data.
Now, imagine a similar data issue in a higher-stakes arena. An AI system in healthcare could be fed incomplete or unrepresentative patient data and end up misdiagnosing illnesses or recommending unsafe treatments. In finance, a trading algorithm might make disastrous decisions if its market data is skewed or erroneous, potentially triggering huge losses. These hypotheticals underscore a sobering truth: when AI fails due to bad data, it can fail hard – with outcomes ranging from embarrassment and lost business to legal troubles or harm to human lives. It’s the nightmare scenario that shows why getting the data right is absolutely critical.
Expected Worst-Case: Everyday AI Failures
Not every AI mishap makes headlines. Far more common is the everyday worst-case scenario: AI that doesn’t live up to its promise because of flawed or insufficient data. These failures may be mundane, but they are widespread – and they carry a real cost for businesses. Bad data can lead AI systems to make decisions that are simply off the mark, resulting in inefficiencies and missed opportunities rather than dramatic crashes. Consider a few typical scenarios:
Individually, these issues are small fires; collectively, they’re a raging blaze of lost potential. Poor data quality is estimated to cost U.S. businesses around $3.1 trillion every year (Pereira, 2020). Think about that – trillions lost not to sci-fi robot rebellions or grand AI glitches, but to mundane problems like wrong pricing data, duplicate records, missing fields, and miscategorised inventory. In the business world, the “silent killer” of AI ROI is bad data. It’s a pervasive problem that shows how the real power play is in information. Get the data right, and even simple AI can yield great results. Get the data wrong, and even the most advanced AI will struggle.
Recommended by LinkedIn
FMCG Case Study: When Bad Data Spoils the Recipe
To see how data issues play out in practice, let’s look at a scenario in the fast-moving consumer goods (FMCG) industry – makers of everyday products like food, beverages, and toiletries. Imagine a consumer goods company about to launch a new snack beverage. They decide to use AI to forecast demand, set the optimal price, and plan the marketing campaign. The team pours in all the data they have: last year’s sales for similar products, market research survey results, and even some synthetic data (artificially generated examples) to cover scenarios they haven’t seen before. The AI model crunches the numbers and comes back with confident predictions – it forecasts sky-high demand for the new drink and even suggests that consumers would be willing to pay a premium price.
Launch day comes, and things don’t go according to plan. In reality, sales are lukewarm. It turns out the historical sales data was misleading – last year, a competitor had a supply chain issue that temporarily drove more customers to our hypothetical company’s products, artificially boosting those numbers. The model didn’t understand that context. It also turns out consumers were more price-sensitive than the AI assumed. By relying on synthetic and historical data without the full picture, the company overestimated demand and overpriced the product. They’ve now got warehouses full of unsold cans of the new drink. To move inventory, they’re forced to slash prices in a fire sale, hurting their profit margins and brand reputation. And the marketing? The AI’s insights into the target audience were off, so the ads didn’t resonate with the people who actually might buy this beverage. In the end, the product launch flops – not because the idea was bad or the team lacked talent, but because the data feeding the AI was flawed at multiple points.
This case study also shines a light on the lure – and limits – of synthetic data. To compensate for limited real data, companies often generate synthetic data to simulate consumer behaviour. Synthetic data can be useful, but it’s often too neat and optimistic. By design, it reflects the patterns we expect to see, not necessarily the chaotic reality of human behaviour. In fact, one analysis notes that synthetic data “may not capture the complexity of real-world datasets and can potentially omit important details or relationships needed for accurate predictions” (Dhillon, 2021). That was part of the issue for our snack launch: the synthetic scenarios assumed ideal conditions and typical customer responses, missing the possibility of wild-card events (like a sudden social media trend or a regional taste shift). Those messy real-world nuances were absent, so the AI was essentially flying blind to important factors. The lesson? Synthetic data is a helpful supplement, but it’s no substitute for real, high-quality data. If you only train on a polished, imagined version of reality, your AI will be unprepared for the rough edges of the real world.
The Real AI Power Play is Data
In the end, the message is clear: the real power of AI lies in the data behind it. Sophisticated algorithms and models are now widely available, but what differentiates success from failure is how well an organisation handles its data. Companies that thrive with AI are not necessarily those with the fanciest models, but those with the best data practices. That means having ample, relevant data and ensuring it’s accurate, up-to-date, and truly reflective of the domain. It means breaking down data silos so that AI systems have a complete view of the business. It means investing in data cleaning, integration, and monitoring pipelines to continuously feed models clean, rich information. As AI pioneer Andrew Ng argues, the shift to a “data-centric” approach – focusing on the quality of data fueling AI systems – is crucial to “unlocking [AI’s] full power” (Ng, 2020). In short, if you want better AI, start by improving your data.
For business leaders and decision-makers, this is a call to action. Treat data as a strategic asset – one that deserves at least as much attention and investment as the AI models themselves. Before chasing the next cutting-edge AI tool, make sure your data foundation is solid. Are you collecting the right data? Is it accurate, comprehensive, and unbiased? Do you have processes to fix errors, fill gaps, and update the data over time? Often, making progress with AI isn’t about inventing a new algorithm at all, but about doing the unglamorous work of data curation. The old adage “garbage in, garbage out” holds especially true in the age of AI. An algorithm that ingests bad data will churn out bad insights every time. Conversely, if you feed your AI high-quality data, you set the stage for reliable, transformative results. So while AI tools are becoming easier to acquire (the “easy” part), getting the data right remains hard – but that’s exactly where the real power lies. Master the data, and the AI will follow. After all, an AI is only as smart as the information you give it.
References
MBA | Senior Technology Leadership | Stakeholder Management | Solution Consulting | Project Management
2moGreat article Dharsh, spot on with data piece it’s always the challenge and I have seen this on a number of occasions. Even with good quality data, you still need ensure you have the right AI/ML tools and models to make the best use if the data.
Head of Product and Architecture - Simplyai
2moLove it Dharsh