The Trust Paradox: Why we need Evals in the AI Age

Saurabh Mehta

Product Builder 🛠️| Optimising for user delight 😊

Published Nov 1, 2024

Last week, I was using ChatGPT for some quantitative analysis. I had a large dataset of issues across various categories. The kind of analysis that would typically take hours of pivot tables and VLOOKUP formulas. I asked ChatGPT to identify patterns and count the frequency of each issue type within different product classes. The response was impressive—in seconds, it spotted trends I hadn't noticed and gave me precise counts for each category. Feeling satisfied with this time-saving breakthrough, I ran the same query again just to double-check the numbers. That's when things got interesting. The numbers were different. Curious and slightly concerned, I tried a third time. Different again. Each time, ChatGPT confidently presented its analysis as if it were the definitive answer.

This is the paradox of LLMs: they're simultaneously more capable and less predictable than traditional software.

When a regular program fails, it fails consistently—give it the same input, you'll get the same error. But LLMs are more like creative collaborators; they can surprise you with brilliant insights one moment and confidently state nonsense the next. This isn't a flaw—it's by design. These models work by predicting the most probable next token in a sequence, which enables creative but makes consistent output impossible. With LLMs, we're forced to ask a trickier question: "How do you test something that's intentionally probabilistic?"

As AI becomes more integrated into our daily tools and workflows, we need systematic ways to measure and verify its performance. We need evals. That's what we're going to explore today: what evals are, why they matter, and how they might shape the future of AI development.

What are Evals?

At their core, evals are systematic methods for measuring the quality and reliability of LLM outputs. They help us gauge whether a model's responses are reliable, relevant, and appropriate for their intended use. Let's see how this works in practice.

Deep Dive into how Evals work in practice

Remember our data analysis scenario? When analysing issue patterns across categories, we need to ensure several layers of correctness:

Did the system understand the data structure correctly? (Comprehension)
Did it perform the calculations accurately? (Computation)
Are the identified patterns meaningful and useful? (Relevance)

To address these reliability challenges, evals are built on three key pillars: Test Data, Task Definition, and Scoring System. Let's look in detail each one of them:

Test Data: Think of this as your calibration dataset—a carefully selected set of examples where you know exactly what good (and bad) outputs should look like. In our data analysis scenario, each test example contains "ground truth"—the correct counts, relationships, and insights that should be found. This becomes your baseline for measuring whether the LLM's performance is reliable.

Recommended by LinkedIn

Can AI Make Us Great Beginners at Everything?

Gianni Giacomelli 8 months ago

HOW I USE AI.

Andrew Eklund 6 months ago

AI Assumptions: Staying Engaged to Avoid Costly…

Melissa Davis 3 months ago

Input Query: 
For each product category, identify all issues with severity level 'High' and tell me if there's a pattern in when they occur.

Data Context:
- Excel sheet with columns: Product_Category, Issue_Type, Severity, Date_Reported
- 1000+ rows spanning 6 months
- 5 product categories (Mobile, Web, Desktop)

Ground Truth:
- Mobile: 45 high-severity issues, 80% related to authentication failures, spike in occurrences during app updates
- Web: 32 high-severity issues, 60% related to payment processing, consistent distribution
- Desktop: 28 high-severity issues, 50% related to data sync, higher occurrence on Mondays

Task Definition: This is where you define exactly what you're testing. For our analysis task, this means specifying how the system should read the data, what patterns it should look for, and how it should present its findings. Each step needs to be clearly defined so you can identify where things might go wrong.

Success Criteria Example:
- Data Understanding: 100% correct column identification
- Filtering: Zero missed high-severity issues
- Counting: ±1% margin of error in counts
- Pattern Recognition: Identify patterns present in >10% of the data
- Output: All categories addressed, clear pattern description, supported by numbers

Scoring System: Instead of single pass/fail metrics, you might have multiple scores: one for calculation accuracy, another for pattern recognition, and maybe even one for how clearly the insights are presented.

- Data Processing Accuracy (30 points)
- Quantitative Analysis (30 points)
- Pattern Recognition Quality (25 points)
- Output Quality (15 points)

This scoring system strikes a delicate balance: harmonises quantitative precision with qualitative insights. Think of it as a report card that not only tells you the grade but helps you understand why you got it—where scores above 90 indicate production readiness, while anything below 80 signals specific areas needing attention.

These three pillars—Test Data, Task Definition, and Scoring System—give us a framework for measuring LLM reliability. But they also point to something bigger: the evolution of how we ensure software quality as technology advances.

Path ahead

As we build more AI systems, evals aren't just becoming useful—they're becoming part of critical infrastructure. Think about it: every major software development advancement came with its own testing paradigm. Unit tests for object-oriented programming. Integration tests for micro-services. Now, as we enter the age of AI, we need a new testing paradigm. One that can handle probabilistic outputs, measure both technical accuracy and human utility, and scale across different use cases.

Platforms like Braintrust.dev provide frameworks for building and running evals at scale. But more importantly, we're seeing a shift in how we think about AI development—from "Can we build it?" to "Can we build it reliably?"

Because ultimately, the question isn't whether AI will transform our tools and workflows. The question is whether we can make that transformation trustworthy and reliable. Evals are our first step toward answering that question.

Sabine VanderLinden

Activate Innovation Ecosystems | Tech Ambassador | Founder of Alchemy Crew Ventures + Scouting for Growth Podcast | Chair, Board Member, Advisor | Honorary Senior Visiting Fellow-Bayes Business School (formerly CASS)

6mo

Insightful analysis highlighting AI's unpredictable nature and inconsistencies.

To view or add a comment, sign in

The Trust Paradox: Why we need Evals in the AI Age

Saurabh Mehta

Product Builder 🛠️| Optimising for user delight 😊

What are Evals?

Deep Dive into how Evals work in practice

Recommended by LinkedIn

Path ahead

More articles by Saurabh Mehta

Insights from the community

Others also viewed

AI Assumptions: Staying Engaged to Avoid Costly Mistakes

The Death of the "Prompt Engineer": Why Prompting is Just 1/3 of the Equation Now

10 Things Every Executive (and Engineer) Should Know About AI

AI Agents Sessions - The Costs You Can’t Afford to Ignore

What's happening in the world of AI?? (Vol 6)

The AI Paradox: Why Prompt Engineering Matters More Than You Think

AI currently just costs too much!

AI for You - What You Should Be Doing Now

Understanding AI: Real-Time Responses and the Concept of AI Drift

AI for You - What You Should Be Doing Now

Explore topics

What are Evals?

Deep Dive into how Evals work in practice

Recommended by LinkedIn

Path ahead

More articles by Saurabh Mehta

Beyond Text Generation: Inside Claude's ability to use computer (and why it matters)

Decoding NotebookLM: A Peek Inside

Supercharge your research with AI

Hack together a squat counter in 3 hours

Using LiveOps to improve the game longevity

Amazon - The Logistics Company

Why did Quip decide to be acquired by Salesforce?

Views on Walmart buying Jet.com

Google's latest keyboard GBoard has long way to go

Uber didn't kill SideCar

Insights from the community

Others also viewed

AI Assumptions: Staying Engaged to Avoid Costly Mistakes

The Death of the "Prompt Engineer": Why Prompting is Just 1/3 of the Equation Now

10 Things Every Executive (and Engineer) Should Know About AI

AI Agents Sessions - The Costs You Can’t Afford to Ignore

What's happening in the world of AI?? (Vol 6)

The AI Paradox: Why Prompt Engineering Matters More Than You Think

AI currently just costs too much!

AI for You - What You Should Be Doing Now

Understanding AI: Real-Time Responses and the Concept of AI Drift

AI for You - What You Should Be Doing Now

Explore topics