All LLMs Now Perform About the Same. Right?

All LLMs Now Perform About the Same. Right?

Introduction

Think all LLMs are the same? If you speak English, it might seem that way. Whether you’re using OpenAI’s GPT, Google’s Gemini, Anthropic’s Claude, or Meta’s LLaMA, their fluency, reasoning, and factual accuracy in English feel nearly identical. I frequently see articles and comments claiming “there’s no real difference between LLMs anymore.”

But try asking an LLM a complex legal question in Hindi, an advanced coding query in Arabic, or a cultural reference in Swahili—and suddenly, the illusion of parity collapses. The reality is that while English-language LLMs have plateaued, their multilingual capabilities remain wildly inconsistent.

The reason for this is simple: most leading LLMs are trained on the same massive datasets, pulling from Common Crawl, Wikipedia, public books, and other overlapping sources. This has led to an English performance ceiling, where further improvements feel marginal. But while English AI seems to be reaching a limit, multilingual AI is still far from solved, with many languages barely functional in LLMs.

At the same time, open-source AI is closing the gap with proprietary AI faster than ever, often replicating major breakthroughs within days. As we move beyond simply scaling up training data, new approaches such as Reinforcement Learning with Human Feedback (RLHF), Mixture of Experts (MoE), and Chain-of-Thought prompting are emerging as the next frontiers of AI development.

This article explores why LLMs seem indistinguishable in English, why multilingual AI remains broken, and how open-source models are matching proprietary advancements at an unprecedented pace. Because while English LLMs may have reached a plateau, the real AI race is just beginning.


1. Why LLMs Seem About the Same Now

1.1 Training Data Overlap

Most LLMs are trained on vast but similar internet-sourced datasets, which include:

  • Common Crawl (massive web scrape of public data)
  • Wikipedia
  • Books (public domain or licensed)
  • News articles & research papers
  • Open-source code repositories (e.g., GitHub)

Because both proprietary and open-source models rely on these same sources, they end up learning from roughly the same pool of knowledge, leading to similar general knowledge capabilities and a plateau in differentiation.

1.2 Shared Transformer-Based Architectures

Since Google introduced the Transformer model in Attention Is All You Need (2017), nearly every competitive LLM has used variations of this architecture. Most leading models today follow similar training techniques, such as:

  • Self-supervised learning (predicting missing words)
  • Reinforcement Learning with Human Feedback (RLHF) (fine-tuning based on user preference)
  • Mixture of Experts (MoE) (specialized subnetworks for efficiency)

Because these architectural and training techniques are shared, performance differences between models are often marginal unless a model is specifically optimized for a niche task.

1.3 Standardized Benchmarks Reinforce Similarity

LLMs are evaluated using standardized tests such as:

  • MMLU (Massive Multitask Language Understanding) – measures general knowledge
  • HELM (Holistic Evaluation of Language Models) – assesses fairness, bias, and reasoning
  • BIG-bench – tests logical reasoning

Since LLMs are fine-tuned to perform well on these tests, they often achieve similar scores, reinforcing the perception that they are indistinguishable in core capabilities.

1.4 Diminishing Returns in English Performance

While early improvements in LLMs led to major leaps in fluency, coherence, and reasoning, further scaling has produced diminishing returns in English:

  • GPT-4, Claude, and Gemini already perform near human-level in many tasks.
  • Adding more English data doesn’t improve models significantly—it simply reinforces existing patterns.
  • Fine-tuning and prompt engineering now matter more than raw model size.

This plateau in English performance contributes to the idea that LLMs have all become "about the same."


2. The English-Centric Nature of LLM Training

2.1 English Dominates Training Data

Research indicates that models like GPT-3 allocate over 92% of their training tokens to English, leaving only 7.35% for all other languages combined. Even more concerning, a study found that in some cases, over 80% of an LLM’s non-English training data comes from low-quality translations of English content, rather than native sources.

This means that even when AI appears 'multilingual,' it often isn’t—it's just processing poorly translated English, leading to unnatural, biased, or incorrect outputs in many languages.

This leads to:

  • High proficiency in English but weaker performance in languages with limited digital presence.
  • Superior performance in high-resource languages (e.g., German, French, Spanish).
  • Poorer performance in low-resource languages (e.g., Amharic, Lao, and indigenous dialects).

2.2 Weaknesses in Multilingual Performance

Despite being labeled as "multilingual," LLMs frequently struggle with non-English tasks due to fundamental training limitations:

  • Limited Training Data: Many languages lack high-quality datasets, meaning LLMs struggle with basic fluency in low-resource languages like Lao, Amharic, and Maori.
  • Cultural and Contextual Misunderstandings: Even in high-resource languages, models often fail to grasp cultural context. For example:Korean users frequently report that ChatGPT translates slang incorrectly or misinterprets common phrases.Japanese speakers find that AI-generated text often sounds "unnatural" or overly formal, reflecting English sentence structures rather than native fluency.Arabic speakers have noted that LLMs struggle with dialect variations, defaulting to Modern Standard Arabic, which is rarely used in casual conversation.
  • Bias Toward English-Centric Thinking: LLMs often prioritize English worldviews, leading to distorted answers in other languages.French and Spanish users have noticed that AI-generated content defaults to American or British cultural perspectives, even when asked about local issues.In politically sensitive topics, LLMs trained on English-heavy datasets apply Western-centric assumptions when responding in non-English languages.
  • Inaccurate or Nonsensical Translations: When translating between languages, LLMs frequently introduce errors that range from minor inaccuracies to complete nonsense.In 2024, AI users reported that GPT-4 mistranslated legal documents in Chinese, causing critical misinterpretations in business contracts.Google’s Gemini was caught producing gibberish when asked to summarize complex academic papers in Thai.

2.3 Are LLMs Already Solving Multilingual AI? Not So Fast.

A common argument against the "multilingual AI is broken" claim is that LLMs are improving—and that soon, these weaknesses will disappear as models train on more diverse datasets.

It’s true that recent models are better than their predecessors at handling non-English text. Proprietary AI labs have:

  • Expanded multilingual datasets with more diverse web content.
  • Improved tokenization for languages with complex grammar (e.g., Japanese and Arabic).
  • Used translation fine-tuning to boost performance in lower-resource languages.

So, isn’t multilingual AI just a matter of time?

Not necessarily. Throwing more data at the problem isn’t enough.

🔹 Scaling Training Data Doesn't Solve Structural Issues More training data only helps when quality data exists. The problem is that many languages lack large-scale, high-quality digital content.

  • Low-resource languages like Amharic, Quechua, or Lao have limited online text, meaning models can’t learn as effectively.
  • Even in higher-resource languages like Korean or Hindi, much of the data is outdated, low-quality, or biased.
  • AI-generated text is now polluting training datasets, leading to a feedback loop where LLMs train on their own mistakes.

🔹 Cultural & Linguistic Nuances Are Still a Major Challenge Even when models handle basic fluency, they misunderstand regional dialects, idioms, and context-specific knowledge.

  • A study found that ChatGPT’s Spanish responses often default to Latin American Spanish, ignoring European variations.
  • AI-written Japanese articles sound overly formal, making them unnatural for everyday use.
  • Arabic dialects vary widely, but most LLMs default to Modern Standard Arabic, which is rarely spoken in real life.

🔹 The Gap Between English & Non-English AI Remains Huge Despite these improvements, English LLM performance is still far ahead.

  • English-language tasks in GPT-4 outperform multilingual tasks by a significant margin.
  • Many benchmarks still prioritize English evaluations, making improvements in non-English tasks less of a research priority.

2.4 Flawed Multilingual Benchmarks

Most multilingual evaluations rely on:

  • Direct translations from English rather than native-language datasets.
  • Poorly localized cultural references, which skew results.
  • Incomplete testing for grammatical nuances in different languages.*

The problem is even deeper than flawed benchmarks. A 2024 study analyzing AI evaluation datasets found that over 75% of major LLM benchmarks are designed for English tasks first, with non-English testing often being an afterthought. This means that even when AI models claim to be multilingual, they are optimized for English performance, and their actual non-English reasoning ability is rarely tested at the same depth.

So, Can Multilingual AI Be Fixed?

Yes—but it won’t happen automatically. LLMs won’t magically become good at all languages just by scraping more web data.

Without better multilingual benchmarks, native-language datasets, and region-specific fine-tuning, multilingual AI will remain second-class compared to English AI.


3. Open Source AI Is Rapidly Matching—and Sometimes Beating—Proprietary Models

3.1 The Open-Source AI Revolution: From Underdog to Contender

For years, proprietary AI labs dominated the field, holding an insurmountable lead. But today, open-source AI is catching up at breakneck speed—often replicating breakthroughs within days. In some cases, open-source is no longer just following—it’s leading.

🔹 Hugging Face vs. OpenAI’s Deep Research (The 24-Hour Hackathon)

In February 2025, OpenAI released Deep Research, an AI system designed to autonomously browse the web, summarize content, and provide in-depth answers. Within 24 hours, a team at Hugging Face reverse-engineered and replicated the tool in an open-source format, calling it Open Deep Research.

This event sent a clear message: OpenAI had spent months developing Deep Research—yet the open-source community matched it in a single day.

If that sounds shocking, it shouldn’t be. Open-source AI has been moving faster than ever, and Deep Research was just the latest example.

🔹 Meta’s LLaMA and Mistral’s Mixtral: Open-Source LLMs Are Now Competitive

  • Meta’s LLaMA models started as an internal research project, but when LLaMA 2 & 3 were released, they shattered expectations—achieving near-GPT-4 performance while remaining fully open-source.
  • Mistral’s Mixtral-8x7B made headlines by delivering GPT-4-class capabilities in an open-weight model, proving that proprietary AI companies no longer have an automatic edge.
  • Within days, Meta integrated algorithms from DeepSeek, further improving LLaMA’s reasoning and efficiency—demonstrating just how quickly open-source AI incorporates the latest advancements.

🔹 Open-Source AI Sometimes Comes First: DeepSeek Beats Proprietary AI to Market

While open-source AI is often seen as a fast follower, there are now cases where it leads the way—forcing proprietary labs to catch up.

  • DeepSeek, an open-source AI project, pioneered advanced training techniques before many proprietary models.
  • Within days of its release, Meta integrated DeepSeek’s algorithms into LLaMA, proving that open-source innovation is now influencing even the biggest AI companies.
  • In some areas, open-source is not just keeping pace—it’s setting the pace.

This trend challenges the assumption that proprietary AI will always be ahead—because now, the next AI breakthrough might not come from a closed lab at all.

🔹 Open-Source vs. Proprietary Code Generation: A Serious Battle

It’s not just text-based LLMs where open-source is thriving. Even in code generation, open models are emerging as real competitors:

  • Falcon and StarCoder are proving to be credible alternatives to OpenAI’s Codex and DeepMind’s AlphaCode.
  • The open-source AI community is rapidly integrating advanced fine-tuning techniques, narrowing the performance gap even further.

🔹 The Open-Source AI Playbook: Match, Improve, Lead

Once, open-source AI lagged behind, struggling to keep up with the rapid advancements of proprietary labs.

Today? Every time a major AI breakthrough happens, it’s not a question of if open-source will catch up—it’s how fast. The Hugging Face hackathon showed just how quickly the community can match proprietary AI. DeepSeek proved that sometimes, open-source gets there first.

The next step? Not just keeping pace—but leading the AI revolution. 🚀

3.2 Open-Source AI Is Democratizing Cutting-Edge Techniques

Both proprietary and open-source models rely on data, but English-language performance is nearing its ceiling. Instead of just scaling data, researchers are now focusing on:

  • Reinforcement Learning with Human Feedback (RLHF) – Enhancing AI via human feedback.
  • Mixture of Experts (MoE) – Activating specialized "expert" subnetworks within models for efficiency.
  • Chain-of-Thought (CoT) Prompting – Improving complex reasoning by forcing models to explain their thought processes step by step.

When new AI techniques emerge, the open-source community is often the first to implement them or create an equivalent, ensuring AI remains democratized.


4. Beyond Text: Will Multimodal AI Solve the Multilingual Challenge?

Much of today’s AI research focuses on improving LLMs, but what if the real breakthrough isn’t in text at all?

Multimodal AI—models that can process and understand images, video, and speech alongside text—could be the key to breaking language barriers. Instead of relying only on written text (which heavily favors English and other high-resource languages), multimodal systems might learn languages more like humans do—through sound, visuals, and context.

For example, a multimodal AI trained on spoken conversations instead of just text corpora might be better at handling low-resource languages, dialects, and non-standard grammar. Instead of defaulting to English-centric logic, it could derive meaning from real-world interactions.

We’re already seeing the first steps toward this future with models like GPT-4V (which integrates vision with text) and Google’s Gemini, which can analyze both words and images. As these technologies advance, they might help address some of the deep-rooted multilingual weaknesses that text-based LLMs struggle with today.

However, this raises a new question: Will multimodal AI be truly accessible, or will it deepen the divide between proprietary and open-source AI? Open-source projects have made incredible progress in text-based LLMs, but multimodal models require vast amounts of high-quality audiovisual data, which is often locked behind corporate walls. If proprietary labs control the best multimodal datasets, we could see a future where multimodal AI is even more closed than today’s text-based LLMs.

So while multimodal AI might help bridge the multilingual gap, the battle for open, accessible AI is far from over.


The Bottom Line

For English speakers, it’s easy to believe that all LLMs are the same. When you’re using GPT-4, Gemini, Claude, or LLaMA, they all seem to produce fluent, accurate, and coherent responses—because they’re trained on nearly identical datasets. But this illusion of parity collapses the moment you test these models in other languages, where performance can vary wildly or even fail completely.

The AI world isn’t converging—it’s fragmenting. While English-language AI has reached a temporary plateau, multilingual AI is still inconsistent, unreliable, and deeply biased toward high-resource languages. And while proprietary AI labs continue their race to dominate the field, open-source AI has proven it can match, and sometimes even outpace, proprietary innovation.

The real AI race isn’t just about making English responses marginally better—it’s about who will crack the challenge of true multilingual AI first. Will it be Big Tech with its massive resources? Or will it be the open-source community, which has already proven its ability to replicate breakthroughs at lightning speed?

One thing is certain: AI is not "solved." The question is no longer whether all LLMs are the same in English—it’s who will build the first LLM that truly understands the world, in every language, for every culture. And that race has only just begun.

AI won’t truly be ‘intelligent’ until it understands every language, every culture. The conversation needs to change. Who’s going to push it forward—you, or Big Tech?

What do you think? Have you noticed the cracks in multilingual AI?

The next time someone says 'all LLMs are the same,' ask them: In which language?

If you care about AI that truly understands the world, the conversation needs to shift. Who will lead the way?


Scott Bass

Principal, LocFluent Consulting

3w

Dion, as always you get to the heart of the matter. The "wow factor" is finally wearing off. I've only succeeded in breaking high-resource, English-bound ChatGPT once recently...that was with some 1944 German diaries. It's easy to forget that it's a big, complex world with thousands of languages that aren't English.

Like
Reply
Mirko Plitt

Multilingualism consultant

2mo

Been reading your important post while attending a talk on the very same topic at the UNESCO LT4ALL conference. I'll share the link when I can find it. The presenter also pointed out that due to how tokenisation works, the use of LLMs in other languages than English (and as a linguist, I suppose, Chinese and other isolating languages) is much more expensive. He considered that a sign of the superiority of English (he used a different term, but same idea); it's of course only more evidence of the tech being (unintentionally) optimised for and fundamentally biased towards English

Insightful perspective! Do you think open-source AI will be the key to bridging the multilingual gap, or will major players dominate the space? 

Like
Reply

The quality and volume you are posting is impressive Dion. How you doing this ?

Like
Reply

To view or add a comment, sign in

More articles by Dion Wiggins

Insights from the community

Others also viewed

Explore topics