Layer Normalization in Transformers
In this article, I’ll cover what you need to know about Layer Normalization and how it helps stabilize and accelerate the training of Transformer models
As always, if you find my articles interesting, don’t forget to clap and follow 👍🏼 These articles take time and effort to create!
What is Layer Normalization ? and why we need it ?
Layer Normalization is a process applied within each layer of a Transformer model. In a simple way it’s like giving each layer of a Transformer a “reset button” to keep it stable while learning !
As the model learns, these values can become unbalanced (some might get too large, others too small ) which makes training harder. This problem is known as internal Covariate Shift
Layer Normalization fixes this by adjusting the values so they stay within a balanced range ( not too high, not too low at every step of training)
Imagine you’re teaching a student math, but every day the way you grade their work changes one day 70 is an A, the next day it’s a C. The student would get confused, right? That’s what happens to a model without normalization
Layer Normalization makes sure the “grading system” stays consistent, so the model can learn clearly and reliably :)
How it’s done ?
1 — Calculating Mean and Variance : Layer Normalization works by computing (This operation is done independently for each token vector, across its features)
When a token passes through a Transformer layer, it gets transformed into a token vector ; a list of numbers that capture the token’s meaning and context.
Each token vector might look something like this (below 👇🏼)
2 — Standardization : Once we have the mean and variance for a token vector (calculated in Step 1), Layer Normalization moves to the standardization stage
Each value (feature) inside the token vector is normalized using this formula (Below)
After the normalization, the mean of the values becomes 0 and the variance becomes 1, ensuring that the values inside the token vector are on the same scale (making it simple for the model to learn)
Let’s have a quick example to illustrate this explanation using Python. For the sake of the example, let’s say we have a token vector with the following values:
import numpy as np
token_vector = np.array([4, 6, 8, 2])
Recommended by LinkedIn
mean = np.mean(token_vector)
variance = np.var(token_vector)
epsilon = 1e-8
standardized_vector = (token_vector - mean) / np.sqrt(variance + epsilon)
print(mean, variance, standardized_vector)
# [-0.4472136 0.4472136 1.34164079 -1.34164079]
# The values inside the token vector are now standardized, with a mean of 0 and variance of 1
3 — Scaling and Shifting: After standardization, we need to give the model the ability to adjust the scale and shift of these normalized values, because not all features (values in the token vector) should be treated equally; some features might be more important for understanding the token’s meaning than others
Layer Normalization applies two learnable parameters to each feature in the vector: gamma (ɣ) and beta (β)
The formula now becomes (below 👇🏻)
Let’s have a quick example from the one used in the step 2 and let’s assume :
gamma = 2.0
beta = 1.0
final_output = gamma * standardized_vector + beta
print(final_output)
# [ 0.10557281 1.89442719 3.68328157 -1.68328157]
We are done with Layer Normalization. If you’d like to read previous articles regarding Transformer architecture, I will include them in the resources section. If there’s a specific topic you’d like us to cover, please don’t hesitate to let me know! Your input will help shape the direction of my content and ensure it remains relevant and engaging 😀
Resources
If you found this helpful, consider Sharing ♻️ and follow me Dr. Oualid Soula for more content like this.
Join the journey of discovery and stay ahead in the world of data science and AI! Don't miss out on the latest insights and updates - subscribe to the newsletter for free 👉🏼https://lnkd.in/eNBG5dWm , and become part of our growing community!
Country Manager Africa / Freight Forwarding / Distribution
1wLinda BOUHIDEL