Introduction to LLM Model Fine Tuning
This is a study note of the “Generative AI with LLMs” by Deeplearning.AI x AWS. It is only for personal learning purposes. It is generalized and focuses more on practical implementations.
If you have some Machine Learning modeling experience, I believe that, after you finish reading, you will have a basic understanding of how LLM works and how to fine tune a LLM model to perform better on your specific task.
Directory:
Since Linkedin article doesn't have a outline for easy navigation, I creat this directory for you to search and go to the topic of your interest:
Intro to LLM Backgound & Usage:
> Transformer: "Attention is All You Need"
> Prompt Engineering
LLM Training Optimization
> Quantization
> Fully Sharded Data Parallel(FSDP) by Zero Redundancy Optimizer(ZeRO)
> Chinchilla -- Scaling laws for model and dataset size
Fine Tune LLM Model
> Evaluation Metrics
> ROUGE -- Recall Oriented Understudy for Gisting Evaluation
> BLEU Score -- Bilingual Evaluation Understudy
> Benchmarks -- Industrial standard validation dataset
> Instruction Fine Tuning
> Parameter-Efficient Fine-Tuning (PEFT)
> LoRA: Low-Rank Adoption
> Soft prompts -- Prompt Tuning
> ReinForcement Learning with Human Feedback(RLHF)
> Prepare Labeled data -- Obtaining human feedbacks
> Reinforcement Learning Algo -- Proximal Policy Optimization (PPO)
> Scaling human feedback by Constitutional AI
Intro to LLM Backgound & Usage:
Transformer: “Attention is All You Need”
The modern transformer model was proposed in the 2017 paper “Attention Is ALL You Need”(arxiv-1706.03762) by the Google Brain team. It is a more powerful and efficient RNN architecture compared with the previous one like LSTM. A large proportion of the current LLMs are developed from the Transformor architecture.
It has a vector space of the words and a vector space of the attention weight learnt during model training to measure the different level of connection to each other words(Self-attention).
By adding the position embedding, it can preserve information of the word order: The relevance of the position of the world in sentences.
The Transformer model uses self-attention to compute representations of input sequences, which can
Prompt Engineering
Example of One-shot Inference:
Few-shot Inference: Adding more examples in prompt for model can bring more values in helping model understand the goal. However, by increasing to 5 or 6 examples in prompts, the improvement to model performance will reach the plateau.
Generative Config:
Limitations:
To resolve the above problems, we will need to fine tune the LLM model. Solutions are covered in the later sections.
LLM Training Cost Optimization
Quantization
Fully Sharded Data Parallel(FSDP) by Zero Redundancy Optimizer(ZeRO) arxiv-1910.02054:
In Stage 3 Pos+g+p: Optimization becomes linear.
FSDP Workflow:
Performance comparison from "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel"(arxiv-2304.11277):
Intuition: If we fix the compute budget, then the impact of Model size and Dataset size will to the model performance will follow a power law.
Fine Tune LLM Model
Evaluation Metrics
1. ROUGE -- Recall Oriented Understudy for Gisting Evaluation
2. BLEU Score -- Bilingual Evaluation Understudy
a. Used for text translation
b. Compares to human-generated translations
c. BLEU = Average(precision across range of n-gram sizes)
3. Industrial standard validation dataset
● GLUE
● SuperGLUE
● HELM
● MMLU(Massive Multitask Language Understanding)
● BIG-bench
Instruction Fine Tuning
Wrap up the incremental training sample to Prompt-completion pairs like prompt engineering.
For example, Wrapping the training examples by the instruction of the task:
Instructional fine tuning on a single task training dataset can significantly increase the performance on the specific task but may lead to catastrophic forgetting.
Recommended by LinkedIn
If this model is only used for a specific task, then it is fine. Otherwise, a quick way is to fine-tune on multiple task data sets, and moreover, conducting PEFT.
Parameter-Efficient Fine-Tuning (PEFT)
In full fine tuning, every model weights is updated during the supervised learning; However, PEFT only update a small subset of the parameters.
Three main approaches:
● Selective: Select a subset of the initial LLM parameters to fine-tune.
● Reparameterization: Reparamter the model weights using a low-rank representation (LoRA: Low-Rank Adaption)
● Addictive: Keeping all the original LLM weights frozen. Add new trainable layers or parameters to the model.
○ Adapters: Add trainable layers to the model architecture. Usually, inside the encoder or decoder components after the attention or feed-forward layers.
○ Prompt Tuning(Soft Prompts): Keep the architecture frozen; Focus on manipulating the input as (1) adding trainable parameters to the prompt embeddings. Or (2) keeping the input fixed and retrain the embedding weights.
Advantages of PEFT:
LoRA: Low-Rank Adoption.
In principle, LoRa can be applied in other layers like feed-forward layers; But most of the parameters of LLM are in the attention layers, therefore rank decomposition matrices gain the most savings in training parameters here.
It is only training on (4096 + 512) instead of 32768 parameters, which brings 86% savings. You will only need to store matrices Ai and Bi for each specific task i and then switch and update the original model weights before the inference stage of each task. The performance trade-off by LoRA vs full fine-tuning is comparatively minimal.
In the original paper(arxiv.org-2106.09685), it compares the performance of choosing different rank r:
It shows that the rank is not always the higher the better. Furthermore, there might also exist a relationship between the most optimized rank vs the size of the dataset.
Implementation: GitHub - microsoft/LoRA: Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
Soft prompts -- Prompt Tuning:
With prompt tuning, you add additional trainable tokens to your prompt and leave it up to the supervised learning process to determine their optimal values. The set of trainable tokens is called a soft prompt and it gets prepended to embedding vectors that represent your input text.
During the soft prompt training, the weights of the model are frozen and soft prompt vectors are trained and updated.
The performance check for Prompt Tuning from "The Power of Scale for Parameter-Efficient Prompt Tuning"(arxiv-2104.08691):
Limitation:
● Soft prompts are the trainable free tokens which can be any value of the embedding vector space. It is possible that the trained soft prompt tokens cannot be mapped into any known token or words in the vocabulary or corpus. Usually, we can use KNN to find the words closest to the soft prompt token, which have similar meanings, as an interpretation.
Implementation: GitHub - google-research/prompt-tuning: Original Implementation of Prompt Tuning from Lester, et al, 2021
ReinForcement Learning with Human Feedback(RLHF)
The purpose of fine tuning with human feedback is to have a model that is better aligned with human preferences. Examples of the bad model behaviors are toxic language, giving aggressive responses, or providing dangerous information. Basically, we want to (1) maximize the helpfulness and relevance, (2) minimize the harm, and (3) Avoid dangerous topics.
The structure of the Reinforcement Learning:
Prepare Labeled data -- Obtaining human feedbacks
Reward model is the most important part of Reinforcement Learning and the labeled data is the key to build the reward model.
The procedures to prepare the labeled data:
Reinforcement Learning Algo -- Proximal Policy Optimization (PPO):
Overall PPO objective function (Equation (9) from the original paper Proximal Policy Optimization Algorithms arxiv-1707.06347):
The value function error term:
where first term is the estimated future total reward and the second term is the known future total rewards
The Policy surrogate function:
The bonus entropy bonus function:
Scaling human feedback by Constitutional AI:
0. Create constitutional principles
Summary in one cheat sheet:
Providing a solved sample is a great way for personalised prompts 😃