Layer Normalization in Transformers
Image Source : Layer Normalization in Transformers

Layer Normalization in Transformers


In this article, I’ll cover what you need to know about Layer Normalization and how it helps stabilize and accelerate the training of Transformer models

As always, if you find my articles interesting, don’t forget to clap and follow 👍🏼 These articles take time and effort to create!

What is Layer Normalization ? and why we need it ?

Article content
Layer normalization. Image Source: pylessons

Layer Normalization is a process applied within each layer of a Transformer model. In a simple way it’s like giving each layer of a Transformer a “reset button” to keep it stable while learning !

As the model learns, these values can become unbalanced  (some might get too large, others too small )  which makes training harder. This problem is known as internal Covariate Shift

Article content
Covariate Shift Illustration. Image Source ResearchGate

Layer Normalization fixes this by adjusting the values so they stay within a balanced range  ( not too high, not too low  at every step of training)

Imagine you’re teaching a student math, but every day the way you grade their work changes one day 70 is an A, the next day it’s a C. The student would get confused, right? That’s what happens to a model without normalization

Layer Normalization makes sure the “grading system” stays consistent, so the model can learn clearly and reliably :)

How it’s done ?

1 — Calculating Mean and Variance : Layer Normalization works by computing (This operation is done independently for each token vector, across its features)

  • The mean (µ) :  the average of the values in that token vector
  • The variance (σ²) :  how much the values in the vector vary from that average

When a token passes through a Transformer layer, it gets transformed into a token vector  ;  a list of numbers that capture the token’s meaning and context.

Each token vector might look something like this (below 👇🏼)

Article content
Mean and Variance Calculation. Image Source : Dr. Walid Soula

2 — Standardization : Once we have the mean and variance for a token vector (calculated in Step 1), Layer Normalization moves to the standardization stage

Each value (feature) inside the token vector is normalized using this formula (Below)

Article content
Standarnization. Image Source : Dr. Walid Soula

  • x is the original value
  • μ is the mean
  • σ² is the variance
  • ϵ is a small constant (usually 10**-8) added to avoid division by zero during training

After the normalization, the mean of the values becomes 0 and the variance becomes 1, ensuring that the values inside the token vector are on the same scale (making it simple for the model to learn)

Let’s have a quick example to illustrate this explanation using Python. For the sake of the example, let’s say we have a token vector with the following values:

import numpy as np

token_vector = np.array([4, 6, 8, 2])        

  • Calculating the mean and the variance

mean = np.mean(token_vector)
variance = np.var(token_vector)        

  • Standarnization

epsilon = 1e-8
standardized_vector = (token_vector - mean) / np.sqrt(variance + epsilon)
print(mean, variance, standardized_vector)
# [-0.4472136   0.4472136   1.34164079 -1.34164079]
# The values inside the token vector are now standardized, with a mean of 0 and variance of 1        

3 — Scaling and Shifting: After standardization, we need to give the model the ability to adjust the scale and shift of these normalized values, because not all features (values in the token vector) should be treated equally; some features might be more important for understanding the token’s meaning than others

Article content
Scaling and Shifting representation. Image Source : kristakingmath

Layer Normalization applies two learnable parameters to each feature in the vector: gamma (ɣ) and beta (β)

  • Gamma (ɣ): This is the scaling factor, which can amplify or shrink the normalized values
  • Beta (β): This is the shifting factor, which can move the normalized values left or right (1D perspective) by adding a constant to each feature in the distribution

The formula now becomes (below 👇🏻)

Article content
Scaling and Shifting. Image Source : Dr. Walid Soula

Let’s have a quick example from the one used in the step 2 and let’s assume :

  • Gamma (ɣ) = 2
  • Beta (β) = 1

gamma = 2.0  
beta = 1.0   

final_output = gamma * standardized_vector + beta
print(final_output)

# [ 0.10557281  1.89442719  3.68328157 -1.68328157]        

We are done with Layer Normalization. If you’d like to read previous articles regarding Transformer architecture, I will include them in the resources section. If there’s a specific topic you’d like us to cover, please don’t hesitate to let me know! Your input will help shape the direction of my content and ensure it remains relevant and engaging 😀

Resources 


If you found this helpful, consider Sharing ♻️ and follow me Dr. Oualid Soula for more content like this.

Join the journey of discovery and stay ahead in the world of data science and AI! Don't miss out on the latest insights and updates - subscribe to the newsletter for free 👉🏼https://lnkd.in/eNBG5dWm , and become part of our growing community!

Linda BELAIDI 🌍

Country Manager Africa / Freight Forwarding / Distribution

1w
Like
Reply

To view or add a comment, sign in

More articles by Dr. Oualid S.

  • Mathematics Behind LightGBM (Regression)

    In many machine learning tasks, scalability and performance can become major concerns, struggling with speed and memory…

  • MCP Beyond the Hype

    There is a lot of hype about that subject. I will provide a simple and clear explanation of it As always, if you find…

  • Your guide to Vector Databases

    Have you ever tried to find a specific song or video on YouTube, only to stumble upon others that sound eerily similar?…

  • Why do the best engineers become bad managers?

    Why do good employees sometimes become the worst managers? Sounds confusing, right? The answer might lie in a concept…

  • CatBoost

    In many machine learning tasks, we encounter datasets with a mix of categorical and numerical features and traditional…

  • Building Recommendation engines using ALS

    In this article, I will cover how to build a recommendation engine using ALS, illustrated by three different examples…

    2 Comments
  • Van Westendorp’s Price Sensitivity Meter (PSM)

    In this article related to the price Strategies in the series of International Marketing I will cover the Van…

    3 Comments
  • Herfindahl-Hirschman Index (HHI)

    In this article, I will discuss a key metric in market research known as the Herfindahl-Hirschman Index (HHI), which is…

  • Evaluating a company’s portfolio with the MABA Analysis

    In this article, we will cover another tool that can be used in international marketing called MABA Analysis. This tool…

  • 7S McKinsey Model for Internal Analysis

    It's been quite a while since I wrote an article on business strategies, so I thought I'd kick off this week by…

    2 Comments

Insights from the community

Others also viewed

Explore topics