The Trust Paradox: Why we need Evals in the AI Age

The Trust Paradox: Why we need Evals in the AI Age

Last week, I was using ChatGPT for some quantitative analysis. I had a large dataset of issues across various categories. The kind of analysis that would typically take hours of pivot tables and VLOOKUP formulas. I asked ChatGPT to identify patterns and count the frequency of each issue type within different product classes. The response was impressive—in seconds, it spotted trends I hadn't noticed and gave me precise counts for each category. Feeling satisfied with this time-saving breakthrough, I ran the same query again just to double-check the numbers. That's when things got interesting. The numbers were different. Curious and slightly concerned, I tried a third time. Different again. Each time, ChatGPT confidently presented its analysis as if it were the definitive answer.

This is the paradox of LLMs: they're simultaneously more capable and less predictable than traditional software.

When a regular program fails, it fails consistently—give it the same input, you'll get the same error. But LLMs are more like creative collaborators; they can surprise you with brilliant insights one moment and confidently state nonsense the next. This isn't a flaw—it's by design. These models work by predicting the most probable next token in a sequence, which enables creative but makes consistent output impossible. With LLMs, we're forced to ask a trickier question: "How do you test something that's intentionally probabilistic?"

As AI becomes more integrated into our daily tools and workflows, we need systematic ways to measure and verify its performance. We need evals. That's what we're going to explore today: what evals are, why they matter, and how they might shape the future of AI development.

What are Evals?

At their core, evals are systematic methods for measuring the quality and reliability of LLM outputs. They help us gauge whether a model's responses are reliable, relevant, and appropriate for their intended use. Let's see how this works in practice.

Deep Dive into how Evals work in practice

Remember our data analysis scenario? When analysing issue patterns across categories, we need to ensure several layers of correctness:

  1. Did the system understand the data structure correctly? (Comprehension)
  2. Did it perform the calculations accurately? (Computation)
  3. Are the identified patterns meaningful and useful? (Relevance)


To address these reliability challenges, evals are built on three key pillars: Test Data, Task Definition, and Scoring System. Let's look in detail each one of them:

Test Data: Think of this as your calibration dataset—a carefully selected set of examples where you know exactly what good (and bad) outputs should look like. In our data analysis scenario, each test example contains "ground truth"—the correct counts, relationships, and insights that should be found. This becomes your baseline for measuring whether the LLM's performance is reliable.

Input Query: 
For each product category, identify all issues with severity level 'High' and tell me if there's a pattern in when they occur.

Data Context:
- Excel sheet with columns: Product_Category, Issue_Type, Severity, Date_Reported
- 1000+ rows spanning 6 months
- 5 product categories (Mobile, Web, Desktop)

Ground Truth:
- Mobile: 45 high-severity issues, 80% related to authentication failures, spike in occurrences during app updates
- Web: 32 high-severity issues, 60% related to payment processing, consistent distribution
- Desktop: 28 high-severity issues, 50% related to data sync, higher occurrence on Mondays        

Task Definition: This is where you define exactly what you're testing. For our analysis task, this means specifying how the system should read the data, what patterns it should look for, and how it should present its findings. Each step needs to be clearly defined so you can identify where things might go wrong.

Success Criteria Example:
- Data Understanding: 100% correct column identification
- Filtering: Zero missed high-severity issues
- Counting: ±1% margin of error in counts
- Pattern Recognition: Identify patterns present in >10% of the data
- Output: All categories addressed, clear pattern description, supported by numbers        

Scoring System: Instead of single pass/fail metrics, you might have multiple scores: one for calculation accuracy, another for pattern recognition, and maybe even one for how clearly the insights are presented.

- Data Processing Accuracy (30 points)
- Quantitative Analysis (30 points)
- Pattern Recognition Quality (25 points)
- Output Quality (15 points)        

This scoring system strikes a delicate balance: harmonises quantitative precision with qualitative insights. Think of it as a report card that not only tells you the grade but helps you understand why you got it—where scores above 90 indicate production readiness, while anything below 80 signals specific areas needing attention.

These three pillars—Test Data, Task Definition, and Scoring System—give us a framework for measuring LLM reliability. But they also point to something bigger: the evolution of how we ensure software quality as technology advances.

Path ahead

As we build more AI systems, evals aren't just becoming useful—they're becoming part of critical infrastructure. Think about it: every major software development advancement came with its own testing paradigm. Unit tests for object-oriented programming. Integration tests for micro-services. Now, as we enter the age of AI, we need a new testing paradigm. One that can handle probabilistic outputs, measure both technical accuracy and human utility, and scale across different use cases.

Platforms like Braintrust.dev provide frameworks for building and running evals at scale. But more importantly, we're seeing a shift in how we think about AI development—from "Can we build it?" to "Can we build it reliably?"

Because ultimately, the question isn't whether AI will transform our tools and workflows. The question is whether we can make that transformation trustworthy and reliable. Evals are our first step toward answering that question.

Sabine VanderLinden

Activate Innovation Ecosystems | Tech Ambassador | Founder of Alchemy Crew Ventures + Scouting for Growth Podcast | Chair, Board Member, Advisor | Honorary Senior Visiting Fellow-Bayes Business School (formerly CASS)

6mo

Insightful analysis highlighting AI's unpredictable nature and inconsistencies.

Like
Reply

To view or add a comment, sign in

More articles by Saurabh Mehta

  • Beyond Text Generation: Inside Claude's ability to use computer (and why it matters)

    It's been almost two years since the launch of ChatGPT, and the generative AI revolution it started continues to…

    2 Comments
  • Decoding NotebookLM: A Peek Inside

    Have you ever had to wade through a dense book or massive report, knowing it’s packed with valuable insights but…

    3 Comments
  • Supercharge your research with AI

    A new trend is sweeping LinkedIn: users aren't just "#OpenToWork" anymore—now they're also using the "#Desperate"…

  • Hack together a squat counter in 3 hours

    As a product builder, I am always biased towards action. So, when I thought of building a squat counter, I knew that my…

    3 Comments
  • Using LiveOps to improve the game longevity

    The mobile gaming industry is going through tectonic shifts. With the explosion of smartphones, games are now more…

  • Amazon - The Logistics Company

    From Vice In 2013, at the height of the holiday season, a surge of last minute Amazon orders and bad weather left many…

    4 Comments
  • Why did Quip decide to be acquired by Salesforce?

    Quip is a fine product but wasn’t showing many signs of significant traction. The purchase price reflects exactly that:…

  • Views on Walmart buying Jet.com

    Obviously, for both the sides, this deal makes a lot of sense. But in the end, I am doubtful Walmart (event with the…

  • Google's latest keyboard GBoard has long way to go

    Third party keyboard market for iOS and Android is quite exciting. Infact, it’s a big opportunity to get in between a…

  • Uber didn't kill SideCar

    So, Sunil Paul, founder and CEO of SideCar filed (and eventually received) wrote that SideCar had to shut down because…

Insights from the community

Others also viewed

Explore topics