The Algorithmic Core of Positional Encoding in Transformers
Welcome! This article kicks off the ‘Decoding Transformers’ series!
My goal here is to articulate the full Transformer architecture in detail, including the math behind it all. In each installment of this series, we will focus on a different component of the model from beginning to end. It's only fitting to start at the begging, the position encoder!
Transformers, revolutionized natural language processing (NLP) with their attention-only architecture. Their ability to train in parallel and deliver impressive performance improvements made them popular among NLP researchers. The availability of implementations in common deep learning frameworks has made Transformers accessible for experimentation by many students and researchers.
However, while accessibility is great, it can sometimes lead to an oversight of the intricate details of the model. In this series, we’ll dive deep into each part of the Transformer architecture, starting with positional encoding. Positional encoding is crucial because, unlike traditional models that process words sequentially, Transformers handle words in parallel and need a way to retain the order of words in a sentence. By assigning unique position identifiers to each word, positional encoding helps Transformers maintain the correct word order, ensuring accurate understanding and generation of text.
What are Position Encoding?
The order of the words in a sentence is crucial for conveying the intended meaning. Capturing the position of each token is essential to maintain the semantic meaning, as tokens are processed independently and simultaneously in Transformers.
Consider the phrases:
• “King and Queen are awesome.”
• “Queen and King are awesome.”
These sentences are slightly different but their vector representations of embeddings are identical without positional encoding. To address this issue, positional encoding incorporates information about the position of each embedding within a sequence.
But first...
Before positional encoding comes into play, each word in the input sequence is converted into a dense vector representation known as an input embedding. These embeddings capture the semantic meaning of the words, allowing the model to understand and differentiate between different tokens based on their context. The embeddings are typically obtained from pre-trained models or learned during the training process. Once the words are converted into these embeddings, positional encoding is added to incorporate information about the order of words in the sequence.
How Positional Encoding Works
Positional encoding is designed to introduce unique position-specific patterns to each token in a sequence. This is done using a combination of sine and cosine functions, which generate distinct positional vectors.
Positional encoding consists of a series of sine and consine waves and involves two parameters:
First being the `pos` parameter, represented from the image:
The pos parameter represents the position of the token within the sequence. This is similar to the time variable t or the x coordinate in a standard plot. It determines the specific point at which the sine and cosine functions are evaluated.
Here is a visual representation of the sine and cosine waves used in positional encoding:
The second is the parameter `i` , represented from figure-2 as the dimension index:
This effectively generates a unique sine or cosine wave for each embedding, controlling the number of oscillations for each wave. An oscillation refers to a single cycle of a wave, moving from the highest point to the lowest point and back to the highest point. The number of oscillations indicates how many times the wave cycles within the positional encoding. Each sine and cosine wave controls the number of oscillations for each wave, and these waves are added to different dimensions of the word embeddings. This process allows the model to encode positional information within each embedding.
In the design of positional encodings in Transformer models:
• For even indices (0, 2, …), the positional encoding uses a sine function.
• For odd indices (1, 3, …), the positional encoding uses a cosine function.
where:
This approach ensures that each dimension of the word embedding receives unique positional information, enabling the model to differentiate between different positions in the input sequence.
Example: Positional Encoding in Action
To explore further, let’s begin with a sequence of embeddings for the phrase “Transformers are awesome.” Each row represents an embedding for a specific word, and each column corresponds to an element in that embedding.
| Transformers | 0.2 | 0.4 | 0.1 | 0.3 |
|--------------|-----|-----|-----|-----|
| are | 0.5 | 0.2 | 0.7 | 0.9 |
| awesome | 0.8 | 0.6 | 0.4 | 0.2 |
POS(t) is the specific position of each word embedding within the sequence.
| POS(t)|
|--------|
| 0 |
| 1 |
| 2 |
In this example, the sequence length is 3.
Each word embedding has a dimensionality of 4 and will be differentiated based on whether its indices are odd or even.
| Dimensions | 0 | 1 | 2 | 3 |
Let's add positional encoding to the embedding for Transformers:
| Transformers| 0.2 | 0.4 | 0.1 | 0.3 |
If you remember:
- For even indices (0, 2, ...), the positional encoding uses a sine function.
- For odd indices (1, 3, ...), the positional encoding uses a cosine function.
For each dimension in the positional encoding, you introduce the corresponding sine and consine waves. Thus for Transformers (pos = 0)
Recommended by LinkedIn
For i=0, a sine wave is added, and this pattern is repeated for i=1, ensuring that each dimnsion is uniquely represente by its own wave.
i=0
I=1
Similiarly, the positional encoding values are calculated for the words "are" and "awesome"
A positional encoding table:
| Transformers| 0 | 1 | 0 | 1 |
|-------------|------|-------|------|------|
| are | 0.84 | 0.54 | 0.01 | 0.99 |
| awesome | 0.90 | -0.41 | 0.02 | 0.99 |
The 3 graphs represent positional encoding, one for each dimension of the word embeddings. For the sequence length of 3, you see only three values for each sine function.
Let's generate positional encoding with an embedding dimension of 5 to depict a realistic setting:
|Transformers| 0 | 1 | 0 | 1 | 0 |
|-----------|------|--------|------|--------|------|
| are | 0.84 | 0.54 | 0.01 | 0.99 | 0.03 |
| awesome | 0.90 | -0.41 | 0.02 | 0.99 | 0.05 |
| pos=3 | - | - | - | - | - |
| pos=4 | - | - | - | - | - |
| pos=99 | -1.0 | 0.04 | 0.02 | -1.0 | 0.61 |
Positional encoding can be conceptualized as a series of vectors, where each vector captures a specific location within the sequence. When this poitional vector is added to its corresponding embedding vector, the combination preservces the positional information, ensuring the elements sequence order is maintained within the resulting vector. In models such as GPT, positional encodings are not static but rather learnable paramters. These learning params, represented by tensors, are added to the embedding vector and optimized during training.
Segment Embeddings
Segment embeddings used in certain models, such as BERT, are related to positional encodings, providing additional positional informaiton. You can intergrate segment embeddings into the existing embeddings alongside the positional encodings.
1. For instance, if the embedding size is 4, a word might be represented as:
x_w = [0.2, 0.4, 0.1, 0.3]
2. For a position i in a sequence, the positional encoding might be:
p_i = [0, 1, 0, 1]
3. Adding positional embeddings to word embeddings. This produces a new vector that contains both the words semantic meaning and its position in the sequence.
Let's perform the addition element-wise:
x'_w = x_w + p_i = [0.2, 0.4, 0.1, 0.3] + [0, 1, 0, 1] = [0.2, 1.4, 0.1, 1.3]
Learnable Positional Encoding in GPT Models
Learnable positional encodings offer flexibility and can adapt to the specific patterns of the dataset, potentially leading to better performance on tasks where the relative position of tokens is particularly important.
The learnable positional encoding for position i might initailly be:
w_i = [0.02, 0.04, 0.08, 0.06]
Segment Embeddings in BERT
Segment Embeddings are additional embeddings used in models like BERT to distinguish between different segments of text (eg., sentences in a document)
Segment embeddings allow the model to distinguish between different parts of the input, such as the question and answer in a question-answering task. By adding a unique embedding to tokens from different segments, the model can learn segment-specific representations.
s_i = [0.05, 0.07, 0.1, 0.02]
Integrating Segment Embeddings
The integration of segment embeddings with word embeddings and positional encodings enriches the input representation with multiple facets of information: semantic content (word embeddings), position in the sequence (positional encodings), and role in the context of multiple sequences (segment embeddings). This comprehensive representation enables the model to perform complex reasoning over the input.
x''_w = x'_w + s_i = [0.11, 0.33, 0.77, 0.55] + [0.05, 0.07, 0.1, 0.02] = [0.16, 0.4, 0.87, 0.57]
Great!
So, what happens next...?
After the positional embeddings are combined with the input embeddings, the resulting vectors are sent to the multi-head self-attention mechanism. This allows the Transformer to analyze the relationships between words in the sequence. The attention mechanism helps the model determine which words to focus on when processing each token.
Input Embedding + Positional Encoding
|
v
Multi-Head Self-Attention
|
v
Feed-Forward Neural Network (FFN)
|
v
Layer Normalization + Residual Connections
|
v
Stacked Encoder/Decoder Layers
|
v
Final Output
Conclusion
Positional encoding is a crucial component of the Transformer architecture, enabling the model to understand the order of words in a sequence. By combining sine and cosine functions, positional encoding provides unique position-specific information that, when added to input embeddings, allows the Transformer to capture the relative positions of tokens. This process ensures that the model can accurately interpret and generate text with a coherent understanding of the context.
Next up in the series, we will venture into the multi-head attention mechanism, exploring how Transformers can focus on different parts of the input sequence simultaneously to enhance their understanding and processing of language.
Senior Cloud & DevOps Engineer | Containerization, Data & Storage, Monitoring, Automation | I help companies smash their technical debt, save hours in developer time through automation and save up to 50% in the cloud.
10moI was hoping this would be more about autobots and decepticons. I gotta read this when I'm more coffee'd up and focused. You have learned a TON since you started down this path. Are you learning how to create learning models?
Activate Innovation Ecosystems | Tech Ambassador | Founder of Alchemy Crew Ventures + Scouting for Growth Podcast | Chair, Board Member, Advisor | Honorary Senior Visiting Fellow-Bayes Business School (formerly CASS)
10moInsightful breakdown of transformers' core mechanics. Unveiling algorithmic intricacies fosters deeper comprehension.