What can LLMs tell us about human language ability? – Part 2: How human language ability is like an LLM’s
We have seen how LLMs produce language based on pattern recognition from a large training set and conditional probabilities about the next letter and word.
And as Peterson points out, this is very similar to how humans create language, too.
If you pause to think about human language production, like speech, it is truly amazing. In fact, it is a bit creepy, like a ghost in the machine. As a competent speaker of a language, you open your mouth and language flows out, one word after another, very often with little—or apparently no—conscious thought.
How does this happen? Where is this ghost?
It turns out that how you produce speech is similar to how large language models produce text. You have a database or training set of language you have internalized after years of listening to and reading language, and you have subconsciously identified patterns or recurring words and parts of words, and then you regurgitate these patterns automatically most of the time.
So, how does this automaticity occur in native speaker fluency that most language learners aspire to?
When first learning a first or second language, you associate language with a context, and then repeat the language when you are in that context. So, when you learn Mandarin, you learn to associate ni hao (“hello”) when you first see someone or ma fan ni dao … (“please take me to …”) when you find yourself in a taxi. Since we re-live many contexts on a regular basis, we repeat the same language again and again.
But this does not tell the whole story about how we learn and use language.
It has long been recognized that the operation of language is highly efficient. Around 90 years ago, Zipf recognized that a relatively small number of words in a language make up a disproportionate amount of usage. This has come to be known as a Zipfian distribution, which basically means that the highest-ranked (i.e., most commonly occurring) word occurs approximately twice as often as the next common one, three times as often as the third, etc. In the Brown corpus of American English, for example, “the” is the most commonly appearing word at 7%, followed by “of” at 3.5%, and then “and” at about 2.8%, etc.
This Zipfian distribution suggests a “principle of least” effort that helps humans understand and produce language patterns as easily as possible.
Just like LLMs are trained on massive amounts of data, so are humans. Hart and Risely (1995) estimated that infants raised in middle class households will have been exposed to about 4 million words before the age of two years. At about two years of age, children experience a vocabulary explosion and can learn between 2-10 words per day, or over 1000 words per year for the next few years. With repeated exposure to words and word combinations, grammar patterns (like “the” before nouns in English) and phrasal chunks (like “Good job!”, “Time for bed”, “Thank you”, “Don’t touch that!”) become stored and retrieved from a child’s memory.
This chunking process Sinclair called the “idiom principle” and aligns with the principle of least effort in human language processing and production. Sinclair, a pioneer in corpus linguistics, noticed that much of the language we use is made up of prefabricated phrases or "chunks" of language that are stored and retrieved whole from memory rather than being generated anew from rules each time we speak or write.
The “idiom principle”, then, contrasts with the "open-choice principle" that sees language as a process of making choices from all possible words and structures available in the language. This would not be evolutionarily helpful. Imagine how non-fluent we would be if we had to choose every word from a near infinite number of possible choices before we utter it!
Recommended by LinkedIn
Clearly, a statistical view of language that emphasizes the importance of frequency, repetition, and patterning in language is more useful.
At the level of individual words and phrases, the statistical co-occurrence of words trains your own neural pattern-recognizers so that when you speak or write, you often automatically (read: sub-consciously) make next-word predictions similar to a large language model’s. Hoey called this next-word-prediction process “lexical priming”.
This priming affects not just the likelihood of a word being used in a specific context but also influences collocations (words that frequently occur together), colligations (grammatical patterns associated with specific words), semantic associations, and pragmatic associations (how words are used in social contexts).
According to Hoey, these primings are the result of our exposure to language over time and shape our expectations and production of language in a probabilistic manner. “Lexical priming” is seen as a mechanism that underlies our ability to produce and understand language fluently with the least effort. This also reflects a statistical understanding of language acquisition and use.
So far, various linguistic theories support Peterson’s use of LLMs as an analogy to illustrate how the human mind creates and understands language, emphasizing the structured and non-arbitrary nature of our linguistic and cognitive processes. LLMs work by calculating conditional probabilities, predicting the likelihood of one word following another based on vast databases of text. This process seems to mirror how humans learn and use language, where certain words, language chunks, and ideas are statistically more likely to prime each other based on our experiences and the contexts in which we encounter them.
This is a bottom-up model that accurately describes both LLMs and our real-time use of language in most forms of oral and written communication. Bottom-up here means two things: from the language perspective, it takes letters and words as building blocks for larger structures like paragraphs or reports; from the cognitive perspective, it implies a sub-conscious process that is usually beyond conscious control.
But what about for more complex forms of communication, like stories and business reports?
This is actually Peterson’s main interest. He wants to use LLMs to demonstrate that the similar stories we encounter across cultures and history are not arbitrary, as he claims postmodernists argue, but are empirically validated by the functioning of large language models when they create coherent texts and stories.
It is not my intention to enter this larger debate about deep social and cultural meanings, but to show how Peterson has perhaps overrepresented the analogy between LLM language use and human language use.
Unlike LLMs that can leverage tremendous computational power to accomplish conditional probabilities of next-, fourth-, or tenth-word prediction to create highly complex texts, humans do not have the cognitive bandwidth to do this. Our attention and memory (working, short-, and long-term) are limited and finite resources. The result is that humans that create more complicated texts typically cannot do this bottom-up in real-time and need to plan and revise the text in a protracted and iterative process.
This is where the analogy breaks down and where LLM language radically diverges from human language. This is also where Peterson’s view becomes problematic if we want to better understand human language ability and whether archetypes and narrative structures are embedded in bottom-up subconscious processes.
In the next article, we will break down language competence to see how alien LLM language ability is compared to a human’s, but also look at ways where LLMs can serve as a tool within this competence framework to enhance human language competence, for both native and non-native speaker.
Interesting read! It's intriguing to explore the parallels between human language production and AI like ChatGPT. Excited to learn more about the unique aspects that set human language apart. Looking forward to the next part of the series!