From Chess to ChatGPT: The Elo Scoring Method Explained

Raʼed Awdeh, PhD

AI & Digital Transformation Leader | Driving Business Value Through Digital Innovation | CIO • CTO • Advisor

Published Dec 13, 2024

Like many others, I was curious about the comparisons of large language models (LLMs) that kept mentioning "Elo scoring". What did it mean? Was it really a good way to measure something as complex as an AI model? These questions pushed me to dig deeper and what I found was both fascinating and a bit flawed.

Over the past few years, the world of LLMs has exploded. We’ve seen groundbreaking systems like GPT-4 and Claude alongside open-source alternatives such as Vicuna and LLaMA. These models are everywhere now, helping people write, code, summarize and even answer complex questions. But with so many options, how do we figure out which one is actually the best for a specific task?

This challenge has made evaluation methods more important than ever. Benchmarks like GLUE, SuperGLUE, SQuAD and BLEU offer some insights, but they have their limitations. They often focus on narrow tasks or datasets, which can miss the bigger picture. Fluency, creativity, and adaptability—qualities critical for real-world use—are often overlooked. That’s where the Elo scoring system comes in. Elo offers a way to directly compare models and rank their performance dynamically. Named after its creator, Arpad Elo, a Hungarian-American physicist, the system was initially used to evaluate skill levels in competitive games.

Why comparing LLMs is tricky

Evaluating LLMs isn’t as straightforward as it might seem. This is why:

Different strengths for different tasks: Some models are great at creative writing, others excel in technical problem-solving, and a few shine in delivering factual answers. A single benchmark score rarely captures these nuances.
Data overlap: Many LLMs are trained on massive datasets and sometimes the evaluation benchmarks contain content similar to what the models have already seen during training. This overlap can lead to inflated scores giving the impression that a model performs better than it actually does in novel situations.
The subjective factor: Metrics like accuracy and BLEU focus on measurable outputs but don’t capture qualities like fluency or coherence, things that matter to real users.

These challenges demand a more flexible and nuanced approach. That’s why Elo scoring, with its focus on direct comparisons, has become an interesting alternative.

How Elo scoring works for LLMs

Elo scoring measures relative performance rather than absolute ability. Here’s how it works:

Head-to-head comparisons: Two models are given the same task and their outputs are compared.
Judging the winner: Human evaluators (or automated tools) decide which model performed better.
Adjusting scores: Scores are updated based on the result. If a lower-rated model beats a higher-rated one, it gains more points, while the higher-rated model’s score drops slightly.

This system has some clear advantages:

The ChatBot Arena

A practical example of Elo scoring in action is the ChatBot Arena, developed by UC Berkeley and LMSYS. This platform pits popular models like GPT-4, Claude and others against each other in head-to-head tasks. These tasks might include generating coherent responses to user queries, creating creative text or explaining complex topics in an accessible way. Human judges evaluate the results and Elo scores are updated to reflect performance rankings.

What’s missing in Elo scoring?

For all its strengths, Elo scoring has notable limitations when applied to LLMs.

First, it oversimplifies performance by boiling it down to a single score. Models’ abilities span many dimensions such as factual accuracy, creativity and handling context, and these are often overlooked in a binary win/lose framework. For example, one model might deliver a highly creative but slightly off-topic response while another provides a dull yet factually correct answer. Elo isn’t built to handle such nuances.

Second, the system relies heavily on subjective evaluations. What evaluators prioritize vary widely, introducing bias into the rankings! This becomes especially problematic when the tasks themselves are limited or skewed.

Finally, Elo scoring is inefficient. It requires many comparisons to produce statistically meaningful rankings and may not be well-suited to keeping up with the rapid development cycles of modern LLMs.

Conclusion

Elo scoring offers a practical way to compare LLMs especially in head-to-head scenarios. It provides insights that go beyond static benchmarks helping us understand relative strengths and weaknesses. Platforms like ChatBot Arena showcase its potential, but it’s not the final word. As LLMs grow more sophisticated so must our evaluation methods. Elo scoring is a start—but not the whole story.

#ai #genai #chatgpt #gemini #claude #eloscore #eloscoring #llms #llm #chatbotarena #largelanguagemodels #llmevaluation

Badsha Dash

Chief Manager || Petrochemical Industry || Ethylene plant || AI & ML Enthusiast || Business Analysis || Oil & Gas || Industry 4.0 || Digital transformation || innovation || efficiency

5mo

If Elo scoring doesn't work properly, what's the alternative?

See more comments

To view or add a comment, sign in

From Chess to ChatGPT: The Elo Scoring Method Explained

Raʼed Awdeh, PhD

AI & Digital Transformation Leader | Driving Business Value Through Digital Innovation | CIO • CTO • Advisor

Why comparing LLMs is tricky

How Elo scoring works for LLMs

Recommended by LinkedIn

The ChatBot Arena

What’s missing in Elo scoring?

Conclusion

More articles by Raʼed Awdeh, PhD

Insights from the community

Others also viewed

🔍 Unraveling the Secret Behind ChatGPT's Success: A Deep Dive into Reinforcement Learning from Human Feedback (RLHF)

From R1-Zero to R1: How DeepSeek is Pushing the Limits of AI Reasoning

Character AI

The Rise of the Intelligent Machines: ChatGPT and its impact

Bridging the Gap: GPT-4's Integration into the Metaverse

Explaining Generative AI with a Simple Example to My 14 year old Son

Future of Tech- ChatGBT

ChatGPT: Revolutionizing Human-Machine Interaction Through Advanced Language Processing

ChatGPT, The art of prompting - Beginner guide

Explore topics

Why comparing LLMs is tricky

How Elo scoring works for LLMs

Recommended by LinkedIn

The ChatBot Arena

What’s missing in Elo scoring?

Conclusion

More articles by Raʼed Awdeh, PhD

AI in Education Is Guessing — Knowledge Graphs Help It Reason

How AI and Global Rankings Are Distorting Academic Research

Decoding the AI Spectrum: Automation, Assistants, and Agents

The Future of Education: How AI is Transforming OERs

Freelancing in the Age of AI: How to Thrive

From Brussels to the World: Why the EU AI Act Matters to Everyone

Top 10 AI Myths You Need to Stop Believing

Unpacking the Data Buzz: AI vs. Data Science

Beyond Shadow IT: Navigating the New Risks of Shadow AI

Detecting the Undetectable: AI or Not AI?

Insights from the community

Others also viewed

🔍 Unraveling the Secret Behind ChatGPT's Success: A Deep Dive into Reinforcement Learning from Human Feedback (RLHF)

From R1-Zero to R1: How DeepSeek is Pushing the Limits of AI Reasoning

Character AI

The Rise of the Intelligent Machines: ChatGPT and its impact

Bridging the Gap: GPT-4's Integration into the Metaverse

Explaining Generative AI with a Simple Example to My 14 year old Son

Future of Tech- ChatGBT

ChatGPT: Revolutionizing Human-Machine Interaction Through Advanced Language Processing

ChatGPT, The art of prompting - Beginner guide

Explore topics