Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

Published Feb 10, 2025

1. Introduction

This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in the series. In the Part 1 and Part 2 of this series, I had discussed about the essential fundamentals related to Full Fine Tuning and then went into the details of Single-Task Fine Tuning and Multi-Task Fine Tuning. It should be clear reading through my notes in these blogs that Full Fine Tuning is a costly affair – from the point of view of compute as well as from the point of view of storage mainly because Full-Fine Tuning involves update of “ALL” the model parameters.

From the figure below, one can appreciate how the model size has been increasing with regards to the number of parameters, thus, increasing the burden related to Full Fine Tuning with regards to compute and storage.

Article content — Figure 1: Increase in model size parameters over time

It should be underscored that it is not only the model weights that are required to be stored during each training iteration for Fine tuning but also:

Gradients
Optimiser states
Forward activation
Temporary memory/storage requirements throughout the training process

[Note: The optimizer state is the optimizer's momentum vector or similar history-tracking properties.

For example, the Adam optimizer tracks moving averages of the gradient and squared gradient. If you start training a model without restoring these data, the optimizer will operate differently.]

These additional components can be many times (12-20 times as seen in the Figure 1B above) larger than the model parameters and as the training process proceeds, these can become too large to handle on your hardware.

In contrast to Full Fine Tuning where every model weight is updated during Supervised Learning, Parameter Efficient Fine Tuning (PEFT) methods work only on a sub-set of existing model parameters. PEFT techniques are the techniques which do not touch the original parameters at all, instead, they add new layers into model architecture and fine tune the weights corresponding to these layers.

With PEFT, most of the model weights are kept frozen. As a result, the number of training parameters are much smaller than the number of training parameters in the original LLM – in some cases just 15-20% of the original LLM weights. This makes the compute and storage requirements much smaller compared to Full Fine Tuning such that it is possible for PEFT To be performed in a single GPU.

There are several PEFT techniques as briefly explained in the section 3 of this article. This blog focuses on a PEFT technique involving “Additive Adaptation” which involves adding additional layers into the Encoder/Decoder component of the Pre trained model whose parameters are “adapted” to the task under consideration for fine tuning.

However, before diving into the PEFT technique involving “Additive Adaptation”, I felt it is important to provide some background of the over all problem. Therefore, I have discussed about the Challenges of Fine Tuning, Methods for Parameter Efficient Fine Tuning and Background of the Transformer Architecture in section 2/3/4 of this article. Section 5 and 6 talk about the details of the PEFT Technique wherein we introduce additional layers in the Encoder/Decoder component of the Transformer architecture for Fine Tuning of the parameters of the task under consideration.

2. Challenges with Full Fine Tuning

Catastrophic Forgetting

I have discussed about Catastrophic Forgetting in Part 2of this series in section 2. The problem of Catastrophic Forgetting is related to Full Fine Tuning on separate tasks. It may be recalled from my notes, that, Catastrophic Forgetting happens because Fine tuning of a model results in the update of ALL the parameters of the model. As a result of the update of ALL the parameters of the model, the model performs well on the single fine-tuning task but the performance on other tasks on which the model was fine-tuned before degrades.

For example, if the model was fine-tuned on question-answering task last, it performs well on the task but the performance of the model on the task it was fine-tuned earlier (e.g. sentiment classification). Parameter Efficient Fine Tuning (PEFT) is less prone to Catastrophic Forgetting.

Compute and Storage Requirements in Full-Fine Tuning:

Full Fine Tuning carries out on separate tasks results in a new version of the model after each task the model is trained on. Each version is of the same size as shown in the Figure 3 below and results in expensive storage problems if you’re fine tuning for multiple tasks

With Parameter Efficient Fine Tuning (PEFT), you train only a small number of weights - as small as that they require only MegaBytes to store. The new parameters are combined with the frozen parameters for inference. The PEFT weights can be trained for each task separately and are swapped out for inference allowing efficient adaption of the original model to multiple tasks.

3. Methods for Parameter Efficient Fine Tuning

There are several methods/techniques of Parameter Efficient Fine Tuning one could use – ach with trade-off for: parameter efficiency, memory efficiency. Model performance inference costs and training speed.

There are three main classes for PEFT methods:

Selective Methods:

Selective Methods are methods that fine tune only a subset of original LLM parameters. There are several approaches that one may take to identify which parameters you would want to update. You have the option to train only certain components of the model or specific layers or individual parameter types.

Reparameterization Techniques:

Reparameterization techniques also work with the original LLM parameters but reduce the number of parameters to train by creating new “Low Rank” transformations of the original network weights.

A commonly used technique of this type is LoRA. This will be discussed in the Part 4 of my series.

Additive Methods:

Lastly, we have Additive Methods which carry out fine tuning by keeping all the model weights frozen and introducing new trainable components. Here, there are two main approaches:

a) Adapter Methods which add new trainable layers to the architecture of the model typically inside the Encoder or Decoder component after the Attention or feed forward layers.

b) Soft Prompt Methods on the other hand keep the model architecture fixed and frozen and focus on manipulating the input to achieve better performance. This can be done by adding trainable parameters to the prompt embedding or keeping the input fixed but retraining the embedding weights.

Let us take a closer look at the Adapter methods wherein we add new trainable layers to the architecture of the model

4. Re-visiting the Transformer Neural Network Architecture

Understanding the architecture of Transformers is crucial when discussing Parameter-Efficient Fine-Tuning (PEFT) because it allows for selective fine-tuning of the most impactful components, such as attention heads and feed-forward layers—this is particularly relevant for Low-Rank Adaptation (LoRA), as pointed above. For additive adaptation methods, additional layers are introduced to the model, enabling it to learn new tasks without modifying the original parameters extensively.

Let us consider the Encoder-Decoder Transformer architecture as shown in the Figure below. Let us see at a higher level the how the Transformer model works in the subsequent paragraphs of this article. This will help reinforce the understanding of PEFT techniques involving Additive Adaptation covered in this blog and Low Rank Adaptation (LoRA) covered in the Part 4 of this series.

Below (Figure 7) is the simplified diagram of the Transformer architecture – the idea of this section is to focus on a higher level the processes taking place within the Transformer.

The Transformer Architecture is split into two distinct parts: the Encoder and the Decoder. These components work in conjunction with each other, and they share a number of similarities.

Before passing the text into the model to process, we must tokenize the words (because computers work with numbers and not text!). Tokenization converts the words into numbers with each number representing a position in the dictionary of all possible words that the model can work with as illustrated in the Figure 8 below:

Once the input is represented as numbers, we pass these to the Embedding layer. This is layer is a trainable vector embedding space: a higher dimension space where each number is represented as a vector and occupies a unique location in that space. The vectors learn to encode the meaning and text of individual tokens in the input sequence

Therefore, the sample sequence will appear as illustrated below:

5. Parameter Efficient Fine Tuning with Additive Adapter Layers:

I had mentioned in section 2/3 about the PEFT technique involving Additive Adapter Layers. These Adapter Layers are typically added inside the Encoder or Decoder component after the Attention and Feed Forward Layers as shown as shown below:

We will be interjecting two Adapter layers per task. These Adapter Layers are multi-layer perceptron’s / Feed Forward Layers.

Introducing Adapter Layers into the Encode/Decoder portion of the Transformer and then training only the weights of these Additive layer while keeping all other weights: in the Feed Forward and the Attention layers of the Transformer frozen proves extremely efficient compared to Full Fine Tuning because during Full Fine Tuning , “ALL" the weights of the model get updated which proves very costly from the point of view of Compute and Storage.

It’s only the weights / parameters corresponding to the Adapter layers the model is fine-tuned on. It might also be possible to Fine Tune using a single GPU using PEFT With Adapter

Architecture of the Adapter Layers:

The Adapter layer takes the output from the preceding Feed Forward layer of the transformer model as input, which typically has a dimensionality equal to the Embedding Dimension – that is 768 for BERT Base, 1024 for BERT Large, or 512 for BERT Small.

The Adapter layer has a bottleneck architecture, meaning it projects the d-dimensional input into a lower-dimensional space with m neurons (where m is much smaller than d).

This projection is achieved by the down-projection layer, which is a feed-forward linear transformation with a weight matrix of size: m×d.

After applying a non-linear activation function (like ReLU or GeLU), the hidden representation is then projected back to the original d-dimensional space using the up-projection layer, which is another linear transformation with a weight matrix of size: d×m. I

In summary, you can describe the adapter layer as having:

An input layer with dimensionality “d”

A hidden layer with m neurons (where 𝑚 ≪𝑑)

An output layer that maps back to d dimensions.

The Adapter neural network setup illustrated in the figure 19 below:

This bottleneck structure ensures parameter efficiency, as the total number of trainable parameters in the adapter layer is relatively small.

𝗗𝗼𝘄𝗻𝘀𝗶𝗱𝗲 𝗼𝗳 𝗣𝗘𝗙𝗧 𝗨𝘀𝗶𝗻𝗴 𝗔𝗱𝗮𝗽𝘁𝗲𝗿 𝗟𝗮𝘆𝗲𝗿𝘀:

However, there is a downside here. Surely, the number of trainable parameters is highly reduces compared to Full Fine Tuning but we are introducing new layers into the model. Therefore, the total number of weights are increased.

Although this does not have an effect during Fine Tuning but does have an effect during Inference because the model has to infer based on all the parameters including the parameters of the Adapter layers. As the model gets fine-tuned on more tasks, the inference time can further increase as more adapter layers are added.

The answer to this is: Low-Rank Adaptation (LoRA) which uses the existing model weights within the Attention Layer and re-parametrizes based on Rank Decomposition! This will be discussed in length in the Part 4 of the series.

6. Quantification of savings in compute and storage using Parameter Efficient Fine Tuning over Full Fine Tuning

Let us now quantify the savings using PEFT with additive Adapter in comparison to Full Fine Tuning.

From the above section explanation of the architecture of the Adapter architecture here and from the figure below, one should be able to figure out the number of trainable parameters in the Adapter Layers:

“d” dimensional inputs projecting into a lower dimensional space with “m” neurons: d x m weights to be calculated

“m” neurons projected back into the “d” dimensional space: m x d weights to be calculated

We then have biases – one bias neuron being attached to each of the “m” and “d” neurons = m + d

Therefore, the total number of weights/parameters to be calculated = (2md + m + d) per Adapter per Transformer layer

The table below shows the total number of parameters in the pre-trained model and the number of parameters per Adapter per Transformer layer. The total number of parameters per Adapter per Transformer layer

To view or add a comment, sign in

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

1. Introduction

2. Challenges with Full Fine Tuning

3. Methods for Parameter Efficient Fine Tuning

4. Re-visiting the Transformer Neural Network Architecture

Recommended by LinkedIn

5. Parameter Efficient Fine Tuning with Additive Adapter Layers:

6. Quantification of savings in compute and storage using Parameter Efficient Fine Tuning over Full Fine Tuning

More articles by Ajay Taneja

Insights from the community

Others also viewed

PMIS 343: Using PMIS to Provide Visibility into your Projects Filing Rooms

Hone Your System Design Skills with Challenging Questions

Advancing Agent Networks: AutoGen vs. MetaGPT, and Knowledge Management Strategies.

Exploring the Model Context Protocol (MCP): Bridging AI and Real-World Tools

Crafting AI Prompt Templates: Reusable Prompt Engineering

AI’s Memory Problem: Lessons from Coding, Data Analysis, Creative writing and Comparing Top Tools

AI-Powered Coding: The Best Free Tools Every Data Engineer Needs

Reverse engineering an API using Stack Overflow - Cloning helps?

Product review recommendation (recent work)

Revolutionizing Software Delivery: Why Every Developer Should Adopt GPT Engineering Tools

Explore topics

1. Introduction

2. Challenges with Full Fine Tuning

3. Methods for Parameter Efficient Fine Tuning

4. Re-visiting the Transformer Neural Network Architecture

Recommended by LinkedIn

5. Parameter Efficient Fine Tuning with Additive Adapter Layers:

6. Quantification of savings in compute and storage using Parameter Efficient Fine Tuning over Full Fine Tuning

More articles by Ajay Taneja

Building Effective AI Agents – Demystifying the Anthropic white paper: Part 3 of my series of Blogs on Agents

Understanding the ReAct framework: Part 2 of my series of blogs on Agents

An Introduction to AI Agents (my notes): Part 1 of my series of Blogs on Agents

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Insights from the community

Others also viewed

PMIS 343: Using PMIS to Provide Visibility into your Projects Filing Rooms

Hone Your System Design Skills with Challenging Questions

Advancing Agent Networks: AutoGen vs. MetaGPT, and Knowledge Management Strategies.

Exploring the Model Context Protocol (MCP): Bridging AI and Real-World Tools

Crafting AI Prompt Templates: Reusable Prompt Engineering

AI’s Memory Problem: Lessons from Coding, Data Analysis, Creative writing and Comparing Top Tools

AI-Powered Coding: The Best Free Tools Every Data Engineer Needs

Reverse engineering an API using Stack Overflow - Cloning helps?

Product review recommendation (recent work)

Revolutionizing Software Delivery: Why Every Developer Should Adopt GPT Engineering Tools

Explore topics