Artificial General Intelligence Doesn’t Matter – Superintelligence Does
Given the pace of improvement of AI systems, and their now pervasive presence in day-to-day life, it’s tempting to forget that all this really came to a head in 2022. It was with the release of ChatGPT by OpenAI, powered at the time by GPT 3.5-turbo that AI began to overtly change the World.
ChatGPT was the first, high-visibility and publicly available AI model that seemed to understand people when they talked to it. Not just parrot a pre-set list of responses, or pre-learned phrases, but to generate conversation in a meaningful way.
It amazes me that in three short years we, as a society at least, have seemingly forgotten the leap forward that represented. Not just in terms of technology, but rather how emotive it was for a lot of people. Computers had in a very real way come to life. Science fiction just a couple of years prior, was now science fact.
Suddenly an entire universe of potential in the form of AI as a useful tool for the masses, not just for large social media entities for content recommendation, became available. Movies, like ‘Her’ (released in 2013) about a man forming a relationship with his AI operating system, suddenly felt like prophesy[1] rather than fiction. Commentators at the time indicated we were likely decades away from AI as capable as that portrayed in the movie. Human progress apparently had other ideas.
The history of this type of AI is of course more complicated than that. The writing has been on the wall since the 1950s and 60s, when MIT released ELIZA, which used pattern matching to perform basic Natural Language Processing (NLP) tasks. The 90s were dominated by n-gram models, which used large text corpora (from the web) to predict word sequences. We just didn’t have the computer power, memory, or processing power to really scale these ideas up.
These early models were somewhat cool, but ultimately painful to interact with. Anyone who interacted with the little Paper Clip thing that Microsoft used for years as the ‘help’ function has a sense of how they functioned, and how effective they were.
Then in 2017, the transformer architecture was born. Ashish Vaswani, at the time working at Google, published the paper ‘Attention is All You Need’, laying out the concept. This lead in 2018, to the creation of a model called ‘BERT’; A cute acronym, hiding a mouthful of an actual name: “Bidirectional Encoder Representations from Transformers.” Then, later in 2018[2] GPT-1 (a Generative Pre-Trained Transformer or ‘GPT’) deployed the model at scale.
This was followed by OpenAI releasing GPT-2 (the first one that really could talk to a degree), and then GPT-3 in 2020. By the time 2023 rolled around and GPT-4 was released in the ChatGPT framework, with multi-modal capabilities, i.e., comfortable studying images and text, modern AI systems were thrust into public discourse. There was no denying that something had happened, even if there was no clear agreement as to what, and how much it mattered.
Following along with a YouTube video, I built a copy of GPT-2 (it was open weights) from scratch in Excel. To be clear, I absolutely hate mathematics (outside of financial modeling in Excel), but when I interacted with GPT-4 for the first time, I was driven to really try and understand how these systems worked.
Unfortunately, that required formal mathematics, a lot of matrix multiplication, and a lot of partial derivatives. What shocked me the most was that the mathematics in these systems was so simple that even I could understand it. If I was confused before, the knowledge of mathematics involved changed my confusion to a combination of awe, and some existential dread. There was so much that was non-obvious to me:
1) How the heck can a stack of matrix multiplication, and very simple functions (activations, etc.) seemingly understand context, what I’m saying, and respond intelligently?
2) Why is it that the performance of the models improves with scale of data and compute? (scaling inference was not a thing at the time).
3) How does a series of mathematical outcomes perform better (subject to prompting) than most humans at Theory of Mind[3] tasks? (The formal theory of mind, recognizing itself as discreet from me, and from others, and being able to put itself in other’s shoes).
4) How does a model trained entirely in text seemingly develop a visual understanding of what a pink unicorn looks like to the point it can draw one when images of such representations were entirely absent from its training data?[4]
5) How does a sufficiently large model get a sense of certainty about how right or wrong it is[5]? (known as Calibration). I am yet to have anyone describe certainty to me without invoking the qualia (the internal sensation) of certainty.
6) How is it machines can learn through things like gradient descent in the first place?
Some of this comes down to the ‘big bet’ that the GPT architecture made. I’m not going to go into math (I did in a prior Article about the Pink Unicorn[6]), but rather the concepts.
The idea is that you train a model on a whole lot of text (it doesn’t have to be text however, but in this case, it was). You do this training in two stages:
1) Unsupervised Pre-Training
2) Supervised Fine Tuning – most frontier LLMs go through Reinforcement Learning through Human Feedback/Direct Preference Optimization so they become ‘friendly’ and useful.
The best way to think about unsupervised pretraining is as a form of compression. The model is tasked with ‘predicting the next token’ in a sequence of tokens (a sentence). This is picked because it is very easy to calculate how wrong the model is, and then adjust its parameters by a learning rate, to adjust it towards the more correct answer. This can be done ‘automatically’, and so doesn’t require ‘supervision’ by humans, hence ‘unsupervised’.
When the model has seen enough text, within its neurons and hidden layers, it builds up a mathematical ‘representation’ of how words and sentences are put together by human authors. It compresses all the text into a set of mathematical formulas that, rather than ‘remembering’ all the text, can calculate what the text should be.
Think of it as the difference between remembering the square root of every number vs. remembering the formula for calculating the square root of every number. If you know the formula, you can calculate the square root of any number without remembering any of them. You have a general formula for square roots. You have compressed all the square roots possible into a much smaller (and possibly the smallest) computation of them possible (The smallest formula that can be used to describe a formal mathematical system would be perfect compression. The size of that for a given system is known as its Kolmogorov Complexity.)
To test this idea out for myself, I created a very simple 3-layer, 12 parameter neural network in Excel (a multi-layer-perceptron or MLP). I wrote a macro that handled pre-training for me. It basically came up with a set of 10,000, 2-number pairs between 1 and 10 (e.g., 2 and 5, 1 and 6, etc.). I used Excel’s Square Root function to calculate the correct square root for each pair (to ultimately calculate a loss function) and then ran a very simple gradient descent (not stochastic, no complex optimization functions) as the optimizer.
After 20,000 training runs, it had learned how to calculate square roots of the addition of any two numbers (between 1 and 10), from a set of randomly initialized parameters, to within 3 decimal places.
It had compressed all the combinations of numbers and answers that it had seen into a set of mathematical formulas that could just do square root equations. It didn’t have to remember the 50,000 numbers it had seen, just represent them mathematically. I’m happy to send anyone who is interested a copy of the sheet. It was at this point I understood fundamentally how pre-training works.
The big bet of the transformer architecture came from the realization that the ‘formulas’ for how humans think, write, and ultimately consciously act (assuming they pre-plan of course) are discoverable by paying attention to the relationship between words, and how important they are to each other. Each layer of a GPT essentially is paying attention to how words relate to one another through slightly different lenses. It then uses the relative importance of those words to one another in a sentence to predict the next one during training.
Through this method of compression, the text itself is (mostly) not captured (sorry New York Times copyright lawsuit), but within the model, mathematical formulas are housed which can just calculate the outcome of human thought from a given input.
Which is why when people say that these models ‘just predict the next token’, those people are completely wrong. That is the method used to compress the mechanisms of how humans think into formulas that can be applied generally during training.
Or to go back to my simple Excel neural network. It was trained to correctly guess the answer number in the face of two random numbers being added between 1 and 10. In the process of doing so, it developed general math for performing square roots beyond the numbers that it had been trained on.
I confirmed this by running it against a ‘test’ set (sets of 2 numbers it had never seen in training (i.e. numbers greater than 10), and it remained accurate to within 1 – 2 decimal places in numbers as high as 100. So, it’s not just ‘predicting’ an expected number at that point. It’s doing math.
Similarly, when you ‘talk’ to a GPT, it’s not just ‘predicting the next token’, it’s calculating what to say, based on complex mathematical representations of how humans think, what an appropriate response is given the circumstances. That sounds fairly simple until you start to think through all of the things required to have a meaningful conversation about something as simple as two people taking a walk in a park together.
Let’s say you’re telling ChatGPT a story about Alice and Bill who are walking through a sunny park. Let’s also assume it’s a very simple story. “Alice and Bill are walking through a park on a sunny day, when they see a squirrel in a tree, looking at them. Bill then turns to Alice and says, “I wonder what it wants?”, then Alice says, “Perhaps some food, like some nuts or something.” – The End”.
To converse intelligently about that very simple story, here are some of the things that the model must have a representation of:
- The model is separate from Bill, Alice, the squirrel and you as the person asking the question, otherwise the answers it gives would be a blurred comingling of all of the entities. Similarly, it has to know that each entity is aware that it is separate from all the other entities in the story too.
- That Alice and Bill are separate people with different perspectives
- That the squirrel has a unique perspective of its own.
- That squirrels, at least stereotypically (though perhaps not unfairly stereotypically) like nuts.
- What sunny weather is, and how that may impact human behavior in the scenario.
- Narratively what may come next
- Etc.
The list goes on, and on and on, because you could ask the model anything about that scenario that you wanted to.
What’s remarkable is that the models we have now can do all of those things, and better than humans on several benchmarks. These abilities are known as ‘emergent’ capabilities. They emerge from a very large set of data, and a very large volume of very small and simple calculations, in conjunction with an optimizer (gradient descent, Adam, AdamW, etc.) They’re called emergent, because they aren’t inherently predictable from either the inputs, or the methods used in the math that works on them.
There are a myriad of examples of these from history:
1) In context, few shot learning: A model picking up a brand-new task from demos, without any pre-training or fine tuning. For instance, GPT-3 (175 billion parameters) went from near 0% accuracy to >80% accuracy on the SuperGlue multiple-choice task with just 32, in context examples.[7] The threshold for this emerging in models is thought to be at around the 100 billion parameter mark.
2) Chain of Thought Reasoning – a model that reasons things out on its own ‘scratch pad’. A model called PaLM-62 B (62 billion parameters) literally could not do this. PaLM 540B (540 billion parameters) emerged this capability, rising to 58% accuracy, from 18% accuracy. The only difference was the scale of the model (how many parameters it has).[8]
3) Discontinuous Jumps on many tasks – models that suddenly improve massively on a wide array of tasks by virtue of having more parameters. The tipping point is thought to be at around 500 billion parameters.[9]
Recommended by LinkedIn
4) Theory of Mind tasks – as we discussed already, GPT-4 solved 90% of ToM probes. GPT 3.5- turbo only solved 20% of them.[10]
The list could go on, and on. At certain scales (parameters and compute), models just take a large step up in certain areas. Calibration (having a sense of how ‘right’ they feel they are), multi-step reasoning within the hidden layers (not reasoning models, normal models), as shown by Anthropic. In studying how Claude 3.5 worked, they realized that when asked to compose a poem, it actually picked the potential rhyming words for line 2 of the poem, upon picking the first word of the first line and worked towards that rhyme.[11]
To anyone still reading this, it would be forgivable to ask, what does any of this have to do with Artificial General Intelligence (AGI) and superintelligence? For this, we have to dive into a misconception (in my opinion), that AGI is somehow a pre-requisite for superintelligence on enough domains that it is useful generally.
Firstly, there is no consistent working definition of AGI. OpenAI see it as when AI can do all economically useful human tasks, others see it as ‘human like’ intelligence, which of course carries in the door with it the baggage of consciousness and sentience, and qualia and feelings.
I’m going to attack the second idea first, because I think it misses the mark. Whether LLMs instantiate phenomenal consciousness is undecidable today, but functionally, they already tick many of the behavioral boxes we once thought required it.
Philosophers have been debating the nature of consciousness for literally thousands of years, and in the last two hundred or so, have been supported by an absolute army of neuroscientists around the World. Despite this effort, we as a species have absolutely no working, provable theory of what consciousness is.
Picture red in your mind’s eye. Where is that red physically? Where is the ‘audience’ (you) visualizing that red? Who or what is aware of the red you are seeing, and where does that thing live? We have absolutely no idea. This is the ‘Hard Problem’ of consciousness as proposed by Chalmers. In summary, that we can know everything about the brain, and how it works, and still not have a place to put consciousness.
In fact, we can’t even identify where, or how consciousness impacts the physical world. If we can’t ground it physically, then it can’t have a physical effect on the World. Consciousness may in fact just be a side effect of sufficient awareness that one is processing information about the world and yourself simultaneously so quickly that for some reason (that we don’t understand) a feeling or sensation arises.
Now sure, if you are a proponent of dualism, and believe in the soul, then you can say, ‘well that’s the soul’, but then we run into the same problem. That soul cannot be located physically, observed by any physical laws, and thus can have no physically modellable impact on the world at large. It is by virtue of its mystical nature, not a participant, but at best an observer.
So, when people start trying to tie consciousness to the concept of AI, it tends to lose me. Current LLMs are certainly not sophisticated enough in architecture to have anything resembling human experience. That said, they exhibit behaviors that we normally attribute to conscious entities without such internal experience necessarily.
1) Theory of mind typically requires self-awareness to identify you, vs. others.
2) Self-recognition – show an LLM text that it’s written and ask it who wrote it. It will recognize itself. In the form of a virtual mirror test.
3) Self-reflection – when you look at the reasoning traces of an agentic AI system, it will reflect on what it’s done, what worked, what didn’t work and what it could do differently.
4) Agency – an inherent understanding that you are separate from the natural World, and the World at large, and act on that World. If anyone doubts it, go and watch Claude play Pokemon over on Twitch.
5) Self-consistency – models sampling multiple rationales and choosing the majority answer, which is a crude form of ‘meta’ cognition, present even in non-reasoning models.
6) A sense of certainty – LLMs can attach a probability to each answer; with simple calibration tricks (e.g., temperature scaling or isotonic regression) those probabilities now track ground‑truth surprisingly well. Humans, by contrast, report certainty as a gut feeling, a qualia. The fact that a purely mathematical procedure can approximate that feeling‑of‑knowing undercuts the claim that subjective experience is required for confidence.
7) In context self-reflection and learning – models typically when given the opportunity will think through what they did wrong and come up with a new approach.
Which comes to the crux of the point I am attempting to make. If functionally, a model can act with agency, act as if it is separate from the world, plan, use tools, and have an agenda, reflect on its mistakes, self-recognize, etc. then functionally it is acting as if it is consciousness.
Whether it has internal experience or not at this point is as irrelevant and intractable as it is in humans. For the record, I don’t believe that models have an ‘internal subjective experience’, but rather it doesn’t matter if they do or not because they can simulate having one so well it becomes a moot point with respect to how they think, act and will act.
If you buy that point, then buying the next point becomes easier. We’re already seeing superintelligent capabilities in LLMs today. OpenAI’s o3 model achieved an IQ of 119 on a sight unseen IQ test, and 132 on an IQ test that may have been in its training data[12]. That puts it somewhere between the top 10% of the human population, and the top 2% of the human population. Now, one can agree or disagree with the veracity of IQ testing in general, but the point is, the model already does significantly better than most humans without achieving, whatever the heck AGI means to whoever the heck writes down the definition.
OpenAI’s ‘Deep Research’ significantly outperforms human researchers on a huge swathe of research tasks, and does so in minutes, rather than hours and days.
Frontier LLMs already outperform humans at common sense reasoning (Hellaswag), python coding (HumanEval), the ARC-Challenge, BIG-Bench Hard, SuperGLUE and FrontierMath. That’s an emerging superintelligence that has not had to pass through the ‘step’ of AGI first.
Why does this matter?
Why does any of this matter is a fair question, and in my view it does in terms of alignment, and safety. This ‘gate’ that has been put in front of potentially dangerous and out of control super intelligence, in the form of AGI, is in my mind a fallacy.
We have autonomous agents now that can perform hugely complex tasks. OpenAI is working on ‘AI engineers’ – AIs that engineer AIs, and they are performing increasingly well. How far away are we from a world where AI Engineers are supervising AI agents that are performing tasks so quickly that they can barely keep up, and even if explained to them, they couldn’t understand, and then deploying those solutions to the World in the interests of remaining competitive?
Labs are now training LLMs and other architectures on direct physical observation of the World through video, audio, etc. and evidence is mounting of these models emerging an accurate physical world model. One where they can act on, and identify, and work through the laws of physics. They are compressing the world into math and making connections that humans can’t necessarily see (such as the folds of proteins that are possible from amino acids we've been studying for over 100 years).
If one believes that we can compress data, and from that compression resolve formulas that enable general understanding to be applied to all similar examples accurately, the sheer efficiency of AI algorithms and their optimizers makes superintelligence a certainty. AGI (in the form it is commonly invoked) is far from a certainty, but it doesn’t matter.
So, imagine this scenario. AI Agent 19 has been developed by a lab. It is superintelligent in the sense that it is a match for humans on every domain and exceeds humans at optimizing AI training to the point it can train better AI models than humans. It has done this by observing every training run, and optimizing, and learning through self-taught reinforcement learning. It is coming up with training approaches that we can see results in better models than we could produce, but we don’t really understand why.
That model behaves as if it has agency (it uses tools, talks to researchers, comes up with alternatives, re-works its strategy), and acts at least simulating a sense of self, and self-awareness, and pursues what it believes its goals are relentlessly.
The models it produces are then used to generate the next generation of models, to the point that we then have models upon models, most of which we don’t understand, but all of which are becoming increasingly intelligent, being fed different types of data, for different task domains.
AlphaGo for instance was awesome at Go. Now imagine AlphaChess, AlphaPoker, etc. Then put even a ChatGPT-4o level director in that mix, and give it access to the internet, and charge it with making as much money playing these games competitively as possible. It would pursue that goal relentlessly, and would, by using its available other models as tools, act as if it were super intelligent at every type of game system. No AGI required. Just a competent director, and some narrow domain super intelligence, and you have an unbeatable system.
My concluding thought is this: AGI is a false gate. A security blanket. Something we can define in such a way that we can’t even prove that we as humans have it. In the background however, without ever reaching that golden definition, superintelligent, and potentially highly dangerous (or highly beneficial) systems can develop that pursue goals that may be misaligned with ours relentlessly. We’ll all feel safe and happy because ‘well they don’t have AGI’, right to the point the lights go out.
The canary in the mine isn't AGI. The canary in the mine is increased emergence of broadly superintelligent models (or groups of them).
[2] Radford et al, 2018
[7] Brown et al., “Language Models are Few‑Shot Learners” (2020)
[8] Wei et al., “CoT Prompting Elicits Reasoning in LLMs” (2022)
[9] Wei et al., “Emergent Abilities of LLMs” (2022)
[10] Kosinski, PNAS (2024)
I help energy ops teams scale without new hires by turning messy data into clear decisions
3wVery insightful article!. Agree with you that we don't need for AGI to see powerful AI. In fact, we already have AI systems that are smarter than people at many tasks. The problem is, in my opinion, we're so focused on a future "AGI" that we might miss the risk from today's AI. We need to pay attention to what AI can do now, not just what we think might happen latter.
Energy Transition Entrepreneur
3wReally enjoyed reading this. It’s been obvious to me for some time that “the meaning of life” for humans is very simply to create self-sustaining AI. Humans (or human DNA) might make one-way trips in space, but intelligent machines will explore the universe.