The Trust Paradox: Why we need Evals in the AI Age
Last week, I was using ChatGPT for some quantitative analysis. I had a large dataset of issues across various categories. The kind of analysis that would typically take hours of pivot tables and VLOOKUP formulas. I asked ChatGPT to identify patterns and count the frequency of each issue type within different product classes. The response was impressive—in seconds, it spotted trends I hadn't noticed and gave me precise counts for each category. Feeling satisfied with this time-saving breakthrough, I ran the same query again just to double-check the numbers. That's when things got interesting. The numbers were different. Curious and slightly concerned, I tried a third time. Different again. Each time, ChatGPT confidently presented its analysis as if it were the definitive answer.
This is the paradox of LLMs: they're simultaneously more capable and less predictable than traditional software.
When a regular program fails, it fails consistently—give it the same input, you'll get the same error. But LLMs are more like creative collaborators; they can surprise you with brilliant insights one moment and confidently state nonsense the next. This isn't a flaw—it's by design. These models work by predicting the most probable next token in a sequence, which enables creative but makes consistent output impossible. With LLMs, we're forced to ask a trickier question: "How do you test something that's intentionally probabilistic?"
As AI becomes more integrated into our daily tools and workflows, we need systematic ways to measure and verify its performance. We need evals. That's what we're going to explore today: what evals are, why they matter, and how they might shape the future of AI development.
What are Evals?
At their core, evals are systematic methods for measuring the quality and reliability of LLM outputs. They help us gauge whether a model's responses are reliable, relevant, and appropriate for their intended use. Let's see how this works in practice.
Deep Dive into how Evals work in practice
Remember our data analysis scenario? When analysing issue patterns across categories, we need to ensure several layers of correctness:
To address these reliability challenges, evals are built on three key pillars: Test Data, Task Definition, and Scoring System. Let's look in detail each one of them:
Test Data: Think of this as your calibration dataset—a carefully selected set of examples where you know exactly what good (and bad) outputs should look like. In our data analysis scenario, each test example contains "ground truth"—the correct counts, relationships, and insights that should be found. This becomes your baseline for measuring whether the LLM's performance is reliable.
Recommended by LinkedIn
Input Query:
For each product category, identify all issues with severity level 'High' and tell me if there's a pattern in when they occur.
Data Context:
- Excel sheet with columns: Product_Category, Issue_Type, Severity, Date_Reported
- 1000+ rows spanning 6 months
- 5 product categories (Mobile, Web, Desktop)
Ground Truth:
- Mobile: 45 high-severity issues, 80% related to authentication failures, spike in occurrences during app updates
- Web: 32 high-severity issues, 60% related to payment processing, consistent distribution
- Desktop: 28 high-severity issues, 50% related to data sync, higher occurrence on Mondays
Task Definition: This is where you define exactly what you're testing. For our analysis task, this means specifying how the system should read the data, what patterns it should look for, and how it should present its findings. Each step needs to be clearly defined so you can identify where things might go wrong.
Success Criteria Example:
- Data Understanding: 100% correct column identification
- Filtering: Zero missed high-severity issues
- Counting: ±1% margin of error in counts
- Pattern Recognition: Identify patterns present in >10% of the data
- Output: All categories addressed, clear pattern description, supported by numbers
Scoring System: Instead of single pass/fail metrics, you might have multiple scores: one for calculation accuracy, another for pattern recognition, and maybe even one for how clearly the insights are presented.
- Data Processing Accuracy (30 points)
- Quantitative Analysis (30 points)
- Pattern Recognition Quality (25 points)
- Output Quality (15 points)
This scoring system strikes a delicate balance: harmonises quantitative precision with qualitative insights. Think of it as a report card that not only tells you the grade but helps you understand why you got it—where scores above 90 indicate production readiness, while anything below 80 signals specific areas needing attention.
These three pillars—Test Data, Task Definition, and Scoring System—give us a framework for measuring LLM reliability. But they also point to something bigger: the evolution of how we ensure software quality as technology advances.
Path ahead
As we build more AI systems, evals aren't just becoming useful—they're becoming part of critical infrastructure. Think about it: every major software development advancement came with its own testing paradigm. Unit tests for object-oriented programming. Integration tests for micro-services. Now, as we enter the age of AI, we need a new testing paradigm. One that can handle probabilistic outputs, measure both technical accuracy and human utility, and scale across different use cases.
Platforms like Braintrust.dev provide frameworks for building and running evals at scale. But more importantly, we're seeing a shift in how we think about AI development—from "Can we build it?" to "Can we build it reliably?"
Because ultimately, the question isn't whether AI will transform our tools and workflows. The question is whether we can make that transformation trustworthy and reliable. Evals are our first step toward answering that question.
Activate Innovation Ecosystems | Tech Ambassador | Founder of Alchemy Crew Ventures + Scouting for Growth Podcast | Chair, Board Member, Advisor | Honorary Senior Visiting Fellow-Bayes Business School (formerly CASS)
6moInsightful analysis highlighting AI's unpredictable nature and inconsistencies.