DECODING GEN AI TRANSFORMER ARCHITECTURE
Over the last two to three years, you've probably noticed in surge of posts, articles, and discussions about Generative AI in conferences and forums . From writing essays to creating art, Generative AI is making headlines everywhere. But what’s driving this wave of innovation? At the heart of it all is a powerful technology called the Transformer architecture.
Ever wondered, how the core works. Let me decode this for you in simpler terms first and later increase level of complexity.
You might be working with digital assistants day in and out to alleviate unnecessary handoffs and ramp up the process of communication. So how does it work . Whenever you put a query , it understands words like we do, it sees text as a series of numbers or codes then this assistant needs to figure out the best way to understand and respond to what you say.
The answer is Transformer. This is the brain behind this assistant. Think this is a super fast reader that has capability to search and scan millions and billions of text information even they are far apart and synthesize the proper pieces of information that can enable a relevant and polished response for you.
BUT HOW DOES IT WORK ? CAN IT BE DEMYSTIFIED?
Let me post an architecture here and walk you from left to right. The left part is encoder and right part is decoder.
Let's Start with Input Embedding
To put it in simple terms, think of your input as a group of words or a sentence. For simplicity, let's use the sentence:
"I have a cute kitten."
Each word in this sentence is like a unique item in a large vocabulary list, and each has a special ID:
These IDs represent the positions of the words in the vocabulary. These IDs are then passed to something called input embeddings, which are basically vectors—sets of numbers, each of size 512. These vectors aren’t fixed; they change based on the learning process as the model is trained.
Position Encoding: Understanding Word Order
After we have the input embeddings, we use position encoding. Position encoding helps the model understand the order of the words in a sentence. It ensures that words close to each other are treated as "near" and those far apart are treated as "distant." This is crucial because the meaning of a sentence often depends on the order of the words.
The output of the position encoding is also a vector of size 512, which is then added to the input embeddings. The formula used for this is:
Here, even positions use the sine function, and odd positions use the cosine function. Sine and cosine functions are used because they provide a continuous pattern that helps the model understand the relative position of words.
Multi-Head Attention: Finding Relationships
Once the positions are recognized, the data is fed into the multi-head attention layer.
Why do we use multi-head attention?
Multi-head attention is used to find the relationship between your query (like a search or prompt) and the final output. It helps the model figure out how strongly the words in your input are connected to each other. For example, some words can have multiple meanings depending on their context, like whether they’re used as a noun, verb, or adjective.
How does multi-head attention work?
Multi-head attention takes inputs from three sources: the query, the key, and the value. Let’s go back to our sentence, "I have a cute kitten."
Imagine this sentence as a query with five words, and each word is represented by a vector of 512 parameters. When you multiply these vectors together, you create a 5x5 matrix. This matrix tells us how closely related the words are to each other—essentially, it measures the correlation between the words.
The formula for this process is:
This formula helps the model figure out which words in the sentence are most important for generating the output.
Recommended by LinkedIn
Next Step: Fragmenting the Matrices
After obtaining the initial embeddings, the next step is to split each matrix into smaller, more manageable parts. For example, if we have embeddings of size 512, we divide them into smaller matrices, each of size 512/4. This allows us to break down the sentence into its smallest components, making it easier to analyze each part. Once this operation is completed, the smaller matrices are sequenced and then concatenated. This process is known as multi-head attention.
Why Do We Use Query, Keys, and Values in Transformers?
Imagine you’re searching a database. You start with a query—a question you’re asking. For instance, you might type the word "table" as your query (Q). The word "table" can have multiple meanings: it could refer to a piece of furniture or to a table of rows and columns.
To determine the correct meaning, we use a Key (K). The Key helps differentiate between the different meanings of "table." Finally, the Value (V) is assigned based on the context—whether it's a table of rows and columns or a piece of furniture. This process helps the Transformer model understand the relationship between words and generate the correct output. Essentially, the query identifies what you’re looking for, the key helps clarify the context, and the value determines the output based on that context.
Next Layer: Add & Normalize
The Add & Normalize layer is crucial for processing the data further. This layer uses statistical methods to normalize the range of values between 0 and 1, making it easier to score and compare them. It calculates the mean and variance to ensure all values are on the same scale. Additionally, this layer amplifies the most important variables—those with the highest correlation—and minimizes or omits the less relevant ones.
Decoder: How It Works
An important point to note is that the Key and Value inputs for the decoder come from the encoder, while the Query comes from the decoder itself. The decoder’s architecture is similar to the encoder’s, with one key difference: it includes a masked multi-head attention mechanism, which I will discuss in the next article.
Disclaimer : This article is authored by me for knowledge-sharing purposes for deep enthusiasts in GEN AI. My organization is not responsible for any errors or inaccuracies in the content. If you have any questions or need further clarification, please feel free to contact me directly.
Retired
8moVery informative, thanks, Ian
Manager Management Consulting @ Accenture | AI Strategy & Industry Analytics
8moTransformer Architecture concepts are complex and a bit difficult to understand, the way you explained the concepts, from embeddings to attention, was super helpful. I especially liked how you broke down the multi-head attention part. Thanks Sameer Tiku for making it so easy to understand! Can't wait to see what you have to say about the decoder part.
Principal Director Intelligent Asset Management |Gen AI| Digital Manufacturing Operations & Artificial Intelligence | HSE at Accenture
8moThe Transformer architecture was designed to address specific challenges that earlier models like Recurrent Neural Networks (RNNs) faced—particularly the vanishing gradient problem and the inefficiency in processing long sequences of data. In this article, i will break down the math behind Transformers, exploring how they overcome these obstacles