How Probability Functions Relate to Large Language Models (LLMs)

How Probability Functions Relate to Large Language Models (LLMs)

The other day I read a thought-provoking article exploring our increasing reliance on AI. The article presented a fascinating perspective: while AI tools are powerful, are we becoming too dependent on them at the expense of our critical thinking abilities?

So, in that line, this article demystifies AI and Large Language Models for non-technical readers, looking for a balance between accessibility and (some) technical accuracy.

I think we need to maintain our fundamental human capacities - creativity, critical thinking, and intellectual curiosity (the essential 'wait, but why?' mindset).

 

What´s the magic?

Despite the movies and the hype, Large Language Models such as ChatGPT, are fundamentally based on probability theory and statistics. When training a language model, you are essentially estimating the probability of one word following another in a given context.

 

Crystal ball a.k.a. probability distributions

Close your eyes and imagine a machine that can tell you the probability of a certain event happening given some kind of input. Now open them: whoa! Probability Distributions!!!!

From measurement errors in blood gas analyzers, to loan default rates on that luxury fast car, to the occurrence of genetic mutations along a DNA sequence, to predicting the next word in language models – all are governed by underlying statistical distributions.


Divide and conquer – Tokenization

Article content

A token is a fundamental unit of text, such as a word, punctuation mark, or special character, that computers use to analyze and generate language. By breaking down text into these smaller pieces, computers can better understand and process human language, enabling tasks like chatbot responses and language translation.

In an LLM, each token has an associated probability distribution over the possible tokens that can follow it. This distribution is updated and refined during the training process.


Maximum Likelihood training

The training of LLMs is usually performed using a maximum likelihood approach. Remember probability distributions? You should, it´s 3 paragraphs above. Anyway, this basically means that the model adjusts its parameters to maximize the probability of the training text.

In mathematical terms, it seeks to maximize the likelihood function, which is the product of the probabilities of all token sequences in the training dataset.


Practical training example

Already bored with too much theory? Me too, so let´s see a practical example.

Article content

Now you are training a very simple language model to predict the next word in a sentence. Your training dataset consists of just two sentences:

1. "Argentina is the current world cup champion"

2. "All life is problem solving"


Step 1 – tokenization


Article content

 Break down each sentence into tokens (words):

 Sentence 1: ["Argentina", "is", "the", "current", "world", "cup", “champion”]

 Sentence 2: ["All", "life", "is", "problem", "solving"]


Step 2 – probability estimation

Article content

Initially, our model doesn't know the probabilities. Let's assume it starts with random probabilities for simplicity.

For example, it might initially think the probability of "is" following "Argentina" is 0.1, "life" is 0.1, etc.


Step 3 – likelihood function calculation


Article content

Imagine you're a baker perfecting a cake recipe. Your training data is a list of ingredients and quantities from successful cakes. Your recipe is like the model's parameters. Each time you bake a cake, you taste it to see how close it is to perfect—this is like calculating the likelihood.

Based on your taste (likelihood), you adjust the recipe, reducing sugar if it's too sweet or adding more flour if it's too dense. These adjustments are like updating model parameters to maximize likelihood.

After many attempts, you find the perfect balance of ingredients that consistently produces a delicious cake, similar to maximizing the likelihood function—finding the model parameters that make the training data most probable.

That said, the likelihood function calculates the probability of the entire training dataset given the model's current parameters.

Going back to our model, for sentence 1, the likelihood is the product of the probabilities of each token sequence:

 

P (Argentina, is, the, current, world, cup, champion) = P (Argentina) P (is | Argentina) P (the | is) P (current | the) P (world | current) P (cup | world) P (champion | cup)


Similarly, for Sentence 2:


 P (All, life, is, problem, solving) = P (All) P (life | All) P (is | life) P (problem | is) P (solving | problem)

 

The symbol | states we are dealing with conditional probabilities and basically means "given that” – It’s like saying, "What are the chances of seeing 'is' if we already saw 'Argentina'?

The expression uses the chain rule of probability, which states that the joint probability of a sequence of events can be expressed as the product of the initial probability and the conditional probabilities of each subsequent event given the previous ones.

After training, the model's probabilities will reflect the patterns in the training data.

It will be better at predicting the next word in a sentence, because it has maximized the likelihood of the training text.

By maximizing the likelihood function, the model learns to assign higher probabilities to token sequences that are more common in the training data, thereby improving its ability to generate and understand text.

  

Nerd stuff (you can skip it)


Softmax and Logits


Article content

Logits are raw, unnormalized scores output by the model. So, we need to transform those in probability terms: values between 0 and 1 that sum to 1.

Softmax function converts logits to probabilities, and with this, we convert raw model outputs into interpretable probabilities. Also, we ensure that all probabilities are positive and sum to 1. Additionally, this helps in comparing and ranking different possible outputs.

 

Sampling Methods



Article content

Sampling methods in LLMs are techniques used to select the next word or token when generating text. Think of it as different ways the model makes choices about what to say next.

When an LLM is generating text, it calculates probabilities for all possible next words. Then, it needs a method to choose one - this is where sampling comes in. There are three main approaches:

 

1. Greedy Sampling: Like a person who always order the same menu at a restaurant, it always picks the word with the highest probability. While reliable, it can make text feel mechanical and repetitive. Yeah, just like all those articles generated exclusively with AI.

2. Random Sampling: This is like being good at some sport in high school and being selected in a team - words with higher probabilities are more likely to be chosen, but lower probability words still have a chance. This creates more varied and natural-sounding text.

3. Temperature Sampling: This acts like a "creativity dial." Higher temperature settings make the model more willing to take risks with unusual word choices, while lower settings make it stick to more predictable (boring) options.

These methods help balance between consistency and creativity in the generated text, similar to how a human might choose between using common phrases or being more creative in their writing.

 

Attention Mechanisms


Article content


Attention mechanisms are like a smart reading system that helps language models understand context better.

For example, in the sentence "The cat sat on the mat, it was comfortable", when the model processes the word "it", the attention mechanism helps figure out whether "it" refers to the cat or the mat by assigning different levels of importance (or "weights") to each previous word. These weights are calculated using probability distributions - higher probabilities are given to words that are more relevant to understanding the current context.

This is crucial because it allows the model to:

1. Connect related ideas even when they're far apart in the text

2. Handle complex relationships between different parts of a sentence

3. Maintain context awareness over longer passages

This capability is what makes modern language models so much better at maintaining coherence and context compared to older systems that could only look at a few words at a time.

 

Moral of the story

At their core, language models like ChatGPT are sophisticated probability calculators.

Just like how we humans learn language patterns through experience, these models learn by calculating the chances of words appearing together. Every time the model writes something, it's making educated guesses based on what it learned.

While it might seem like magic when these models generate human-like text, it's really just a sophisticated way of using probability patterns learned from millions of examples.

Understanding this helps demystify these AI systems - they're not magical black boxes, but rather mathematical tools that have gotten very good at playing the odds in language.


#AI #ArtificialIntelligence #MachineLearning #LargeLanguageModels #Chatbots #NaturalLanguageProcessing #DeepLearning #Probability #Statistics #DataScience #AIExplained #DemystifyingAI #BehindTheScenes #HowAIWorks #AIbasics #AIforEveryone

To view or add a comment, sign in

More articles by Alfredo Boietti

Insights from the community

Others also viewed

Explore topics