The Pinnacle of Reasoning Model Performance: A Deep Dive into OpenAI's o3
On April 16, 2025, OpenAI officially released the o3 model, along with the o4 mini. This marks a significant milestone in their product line.
How Innovative Is OpenAI o3?
To assess o3’s innovation, we can place it at the second level.
The first level of innovation includes the launch of GPT-4 and o1. GPT-4 demonstrated that the scaling law continued to hold during the pretraining phase. The launch of o1 showed that the scaling law was also fully effective in the inference phase.
Although the second-level innovation doesn’t match the foundational significance of the first level, it still brings major improvements. For o3, that improvement means its performance exceeds that of human experts in a wide range of domains.
OpenAI had previously released a mini version of o3. That model was distilled from the full version and only supported reasoning in a purely text-based environment. In contrast, full o3 supports integration with external tools like code execution, search, and image/audio processing.
The full version of o3 also uses longer chains of reasoning, stacking more computational resources to solve complex problems.
From my personal experience and feedback from others, the full o3 performs noticeably better than o4 mini—even better than o4 mini high.
Following the release of o3, the mini version was retired. GPT-4, which was launched in March 2023, will also be deprecated on April 30.
Going forward, all of OpenAI’s models will be natively multimodal, meaning they are trained from the ground up on data including text, images, audio, and video. As a result, they can naturally handle a broad range of input types.
The GPT-4 model being retired on April 30 was trained entirely on text data. By today’s standards, it’s already outdated. It’s astonishing to realize that this shift has occurred in just two years—AI development is accelerating at an incredible pace.
o3’s Power as a Native Agent
What makes o3 truly powerful is that it functions as an agent natively.
Its agent capabilities are comparable to Manus, a March 2025 sensation that required a costly invite code to access.
For example, if you’re watching a video and don’t understand a part of it, you can provide a screenshot (with the timestamp), ask o3 to explain it simply, and it will locate the right time segment, analyze the content, and provide a clear explanation. That goes far beyond simple text analysis—o3 is actually interpreting the video content with tools and reasoning capabilities.
Because it is multimodal, its understanding of images is also exceptional.
For instance, I once gave o3 a challenging task: design a PC case with a rear panel that supports a 20cm fan. This is almost unrealistic. Most ATX-compliant cases only fit 12cm fans on the rear, and even 14cm support is rare. Fitting a 20cm fan would block access to motherboard and GPU I/O ports due to width constraints.
But o3 proposed placing the 20cm fan outside the rear panel and increasing the case width to 23cm. Surprisingly, the design was quite feasible. I’ve included the sketch in the article—it’s not perfect in scale, but it presents an innovative solution.
Recently, a popular trend has emerged where people upload photos with no obvious landmarks and ask o3 to guess the location. Astonishingly, it often guesses correctly—even when the original poster has forgotten where it was taken. o3’s analysis jogs their memory.
This leap in visual reasoning is due to major technical breakthroughs.
From Modular Vision to Unified Reasoning
Traditional visual reasoning relied on many separate modules: detecting people, classifying whether they wear glasses, segmenting bus lines, and scripting spatial relationships. Every module required input-output processing, which was slow and fragile—errors in any step could invalidate the result.
o3, as a natively multimodal model, handles such tasks differently. It internally sequences the necessary operations and performs relational reasoning to quickly and accurately answer the question.
OpenAI’s Free Tool: Codex CLI
Alongside o3 and o4 mini, OpenAI also released Codex CLI, a free open-source natural language coding agent. It uses o4 mini by default, but can be switched to o3.
Recommended by LinkedIn
It has three operating modes:
With o3 or o4 mini, Codex CLI understands architecture diagrams and code descriptions very well. It can access local code and assist with debugging and code generation. At just $20–$40/month, it puts serious pressure on other commercial coding assistants. It’s likely one of many agents that o3 calls internally to complete tasks.
Pricing Breakdown
Let’s average the cost of input and output per 1 million tokens:
Their mini versions are much cheaper:
As impressive as o3’s performance is, these prices feel like a cold splash of reality. High-performance inference models remain very costly.
What’s Next for AI?
This raises a critical question: Are we nearing the ceiling of the current technological paradigm? While o3 delivers significant gains over o1, it’s unclear whether future models like o4 will continue that trend at the same rate.
Even if o4 matches that level of improvement, it may be the last major gain achievable through scaling under the current methodology.
Historically, scaling laws yield 2–3 rounds of significant performance improvement in any new dimension. After that, the compute costs to double performance become prohibitive.
A New Hope: Learning from Experience
Perhaps hope lies in another direction—experience-based learning.
Just before o3’s release, Richard Sutton (the father of reinforcement learning and 2024 Turing Award winner) shared an article co-authored with David Silver (DeepMind VP and AlphaGo creator) titled "Welcome to the Era of Experience."
The article, to be published in MIT Press’s Designing an Intelligence, argues that to break past reasoning limits, AI must learn through experience—just as humans do.
AI performance has long depended on mimicking human data. But imitation caps AI’s potential at human levels. True intelligence comes from trial and error: solving math problems, riding a bike, cooking, repairing—activities that require making mistakes and learning from them.
Many say “we’ve run out of high-quality data,” but in truth, we’ve just wasted it.
If agents can interact with environments, generate their own data, and validate their actions, then even existing datasets would suffice for them to evolve far beyond human capabilities. These agents must have long-term, continuous experiences, and must be rewarded by real-world feedback, not human-defined scores.
For example:
With this kind of training, AI may develop more efficient ways of thinking. Programming agents might invent new programming languages. The “world models” of tomorrow won’t be large multimodal LLMs—they’ll be agents trained on rich, environment-derived data.
This could be the next breakthrough after hitting the ceiling of reasoning models.