Introduction to LLM Model Fine Tuning 

This is a study note of the “Generative AI with LLMs” by Deeplearning.AI x AWS. It is only for personal learning purposes. It is generalized and focuses more on practical implementations.

If you have some Machine Learning modeling experience, I believe that, after you finish reading, you will have a basic understanding of how LLM works and how to fine tune a LLM model to perform better on your specific task.

Directory:

Since Linkedin article doesn't have a outline for easy navigation, I creat this directory for you to search and go to the topic of your interest:

Intro to LLM Backgound & Usage: 
> Transformer: "Attention is All You Need"
> Prompt Engineering

LLM Training Optimization
> Quantization
> Fully Sharded Data Parallel(FSDP) by Zero Redundancy Optimizer(ZeRO)
> Chinchilla -- Scaling laws for model and dataset size

Fine Tune LLM Model
> Evaluation Metrics 
   > ROUGE -- Recall Oriented Understudy for Gisting Evaluation
   > BLEU Score -- Bilingual Evaluation Understudy
   > Benchmarks -- Industrial standard validation dataset
> Instruction Fine Tuning
> Parameter-Efficient Fine-Tuning (PEFT)
   > LoRA: Low-Rank Adoption
   > Soft prompts -- Prompt Tuning
> ReinForcement Learning with Human Feedback(RLHF)
   > Prepare Labeled data -- Obtaining human feedbacks 
   > Reinforcement Learning Algo -- Proximal Policy Optimization (PPO)
   > Scaling human feedback by Constitutional AI        

Intro to LLM Backgound & Usage:

Transformer: “Attention is All You Need” 

The modern transformer model was proposed in the 2017 paper “Attention Is ALL You Need”(arxiv-1706.03762) by the Google Brain team. It is a more powerful and efficient RNN architecture compared with the previous one like LSTM. A large proportion of the current LLMs are developed from the Transformor architecture.

Article content
The Transformer - model architecture from “Attention Is ALL You Need”

It has a vector space of the words and a vector space of the attention weight learnt during model training to measure the different level of connection to each other words(Self-attention).

Article content

By adding the position embedding, it can preserve information of the word order: The relevance of the position of the world in sentences.

The Transformer model uses self-attention to compute representations of input sequences, which can 

  • capture long-term dependencies
  • parallelize computation effectively. 

Prompt Engineering

  1. Customize/Format the input prompt to let your model understand what is the goal of the output. 
  2. Providing examples along with the input prompt to guide the LLM. (ICL: In-context Learning)

Example of One-shot Inference:

Article content

Few-shot Inference: Adding more examples in prompt for model can bring more values in helping model understand the goal. However, by increasing to 5 or 6 examples in prompts, the improvement to model performance will reach the plateau.

Generative Config:

  • Greedy: Always return the highest probability output.
  • Top K sampling: Select from the top-k results after sorting by the weights/probabilities.
  • Top P sampling: Select from the results where the cumulative probability <= P.
  • Temperature: Scaling factor of the probability distribution out of the Softmax layer; Higher Temperature = Higher Randomness.

Limitations:

  1. A lot of manul effects to try and find the best prompts
  2. Prompts are limited by the length of the context windows
  3. Performance boosts are limited; More examples will not help. 

To resolve the above problems, we will need to fine tune the LLM model. Solutions are covered in the later sections. 

LLM Training Cost Optimization

Quantization 

  • Reduce required memory to store and train models
  • Projects original 32-bit floating point numbers into lower precision spaces
  • Quantization-aware training(QAT) learns the quantization scaling factors during training
  • BFloat16 -- Brain Floating Point Format from Google Brain, is the most popular choice arxiv.org-1711.10374

Article content

Fully Sharded Data Parallel(FSDP) by Zero Redundancy Optimizer(ZeRO) arxiv-1910.02054:

Article content
From Zero Redundancy Optimizer(ZeRO):

In Stage 3 Pos+g+p: Optimization becomes linear. 

FSDP Workflow:

Article content

Performance comparison from "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel"(arxiv-2304.11277):

Article content


  1. Chinchilla -- Scaling laws for model and dataset size (arXiv-2203.15556):

Intuition: If we fix the compute budget, then the impact of Model size and Dataset size will to the model performance will follow a power law. 

Article content


  1. Fix model sizes and vary number of training tokensMapping from any FLOP count 𝐶, to the most efficient choice of model size 𝑁 and number of training tokens 𝐷 ---- FLOPs(𝑁, 𝐷) = C

Article content

  1. IsoFLOP profiles:Vary the model size for a fixed set of 9 different training FLOP counts (ranging from 6 1018 to 3 1021 FLOPs), and consider the final training loss for each point to find out the optimal parameter count for a given FLOP budget. 
  2. Fitting a parametric loss function:

Article content

Fine Tune LLM Model

Evaluation Metrics

1. ROUGE -- Recall Oriented Understudy for Gisting Evaluation

  • Used for text summarization
  • Compares a summary to one or more reference summaries

Article content

2. BLEU Score -- Bilingual Evaluation Understudy

a. Used for text translation

b. Compares to human-generated translations

c. BLEU = Average(precision across range of n-gram sizes)

3. Industrial standard validation dataset

●      GLUE

●      SuperGLUE

●      HELM

○      GitHub - Holistic Evaluation of Language Models (HELM)

○      arxiv-2211.09110

Article content

●      MMLU(Massive Multitask Language Understanding)

●      BIG-bench


Instruction Fine Tuning

Wrap up the incremental training sample to Prompt-completion pairs like prompt engineering.

 

Article content

For example, Wrapping the training examples by the instruction of the task:

Article content

Instructional fine tuning on a single task training dataset can significantly increase the performance on the specific task but may lead to catastrophic forgetting.

If this model is only used for a specific task, then it is fine. Otherwise, a quick way is to fine-tune on multiple task data sets, and moreover, conducting PEFT.

 

Parameter-Efficient Fine-Tuning (PEFT)

In full fine tuning, every model weights is updated during the supervised learning; However, PEFT only update a small subset of the parameters.

Three main approaches:

●      Selective: Select a subset of the initial LLM parameters to fine-tune.

●      Reparameterization: Reparamter the model weights using a low-rank representation (LoRA: Low-Rank Adaption)

●      Addictive: Keeping all the original LLM weights frozen. Add new trainable layers or parameters to the model.

○      Adapters: Add trainable layers to the model architecture. Usually, inside the encoder or decoder components after the attention or feed-forward layers.

○      Prompt Tuning(Soft Prompts): Keep the architecture frozen; Focus on manipulating the input as (1) adding trainable parameters to the prompt embeddings. Or (2) keeping the input fixed and retrain the embedding weights.


Advantages of PEFT:

  1. Lower memory requirement for training: PEFT’s trainable parameters are usually 15% - 20% of the original LLM weights. It significantly reduces the memory required for training. It is often able to perform on a single GPU.
  2. Lower storage cost: In full fine-tuning,  for each task, you need to train a new version of the LLM, which is the same size as the original LLM. In PEFT, you only need to store the additional PEFT weights trained for each task.
  3. Resist to Catastrophic Forgetting: The original LLM is only slightly modified, PEFT is less prone to the catastrophic forgetting problems versus the full tine-tuning.


LoRA: Low-Rank Adoption.

Article content

In principle, LoRa can be applied in other layers like feed-forward layers; But most of the parameters of LLM are in the attention layers, therefore rank decomposition matrices gain the most savings in training parameters here.

Article content

It is only training on (4096 + 512) instead of 32768 parameters, which brings 86% savings. You will only need to store matrices Ai and Bi for each specific task i and then switch and update the original model weights before the inference stage of each task. The performance trade-off by LoRA vs full fine-tuning is comparatively minimal.

In the original paper(arxiv.org-2106.09685), it compares the performance of choosing different rank r:

Article content
From LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS


It shows that the rank is not always the higher the better. Furthermore, there might also exist a relationship between the most optimized rank vs the size of the dataset. 

Implementation: GitHub - microsoft/LoRA: Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

Soft prompts -- Prompt Tuning:

With prompt tuning, you add additional trainable tokens to your prompt and leave it up to the supervised learning process to determine their optimal values. The set of trainable tokens is called a soft prompt and it gets prepended to embedding vectors that represent your input text.

Article content

During the soft prompt training, the weights of the model are frozen and soft prompt vectors are trained and updated.

The performance check for Prompt Tuning from "The Power of Scale for Parameter-Efficient Prompt Tuning"(arxiv-2104.08691):

Article content

Limitation:

●      Soft prompts are the trainable free tokens which can be any value of the embedding vector space. It is possible that the trained soft prompt tokens cannot be mapped into any known token or words in the vocabulary or corpus.  Usually, we can use KNN to find the words closest to the soft prompt token, which have similar meanings, as an interpretation.

Implementation: GitHub - google-research/prompt-tuning: Original Implementation of Prompt Tuning from Lester, et al, 2021


ReinForcement Learning with Human Feedback(RLHF)

The purpose of fine tuning with human feedback is to have a model that is better aligned with human preferences. Examples of the bad model behaviors are toxic language, giving aggressive responses, or providing dangerous information. Basically, we want to (1) maximize the helpfulness and relevance, (2) minimize the harm, and (3) Avoid dangerous topics.

The structure of the Reinforcement Learning:

Article content

Prepare Labeled data -- Obtaining human feedbacks

Reward model is the most important part of Reinforcement Learning and the labeled data is the key to build the reward model.

The procedures to prepare the labeled data:

  1. Define labeling criterions
  2. Give a prompt to the model and ask for X completions.
  3. Ask the labeler to rank completions based on the predefined criterions. Resolving the disagreements. 

Article content

  1. Convert rankings into pairwise training data for the reward model.

  1. Reorder the prompts to make sure the positive prompt  always comes first.

Article content

  1. For each input  (prompt x, completion y_j), (prompt x, completion y_k) the reward model will return reward r_j, r_k respectively. Those rewards are used in the RL algorithm to update the model.


Reinforcement Learning Algo -- Proximal Policy Optimization (PPO):

Overall PPO objective function (Equation (9) from the original paper Proximal Policy Optimization Algorithms arxiv-1707.06347):

Article content

The value function error term:

Article content

where first term is the estimated future total reward and the second term is the known future total rewards

The Policy surrogate function:

Article content

The bonus entropy bonus function: 

Article content

Scaling human feedback by Constitutional AI:

Article content

  0.  Create constitutional principles

  1. Red Teaming: Ask model to generate harmful response
  2. Ask the model to critique its Red Teaming response it generated in step1, and ask it to revise them to comply with the constitutional principles.
  3. Fine tune the model using the pairs of the [Red Teaming response, revised constitutional response]

Source from Constitutional AI: Harmlessness from AI Feedback arxiv-2212.08073.

Summary in one cheat sheet:

Article content




Providing a solved sample is a great way for personalised prompts 😃

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics