Understanding Hyperparameters in Large Language Models: A Comprehensive Guide
DALLE 3

Understanding Hyperparameters in Large Language Models: A Comprehensive Guide


Large Language Models (LLMs) have emerged as transformative tools, possessing an extraordinary ability to process and generate human-like language. Trained on vast amounts of internet data, these models have acquired a remarkable "intelligence" that enables them to provide comprehensive and informative responses to a wide range of questions.

While LLMs exhibit remarkable capabilities, it's crucial to recognize that they are not infallible. Certain limitations stem from our own human imagination, particularly in how we formulate prompts and interact with these models. This is where prompt engineering comes into play.

Prompt Engineering: Unleashing LLM Potential

Prompt engineering is a discipline dedicated to crafting effective instructions for LLMs, ensuring that they accurately interpret our intent and produce the desired outcomes. This process involves carefully designing prompts to guide the model towards the intended task, providing it with the necessary context and information to generate relevant and insightful responses.

Hyperparameter Tuning

A critical component of prompt engineering, often overlooked, is hyperparameter tuning. Hyperparameters are essentially the settings that control the inference process of an LLM. Just as in traditional machine learning models, carefully adjusting hyperparameters is essential for harnessing the full potential of these large models.

LLMs are built on the Transformer architecture, which involves two key stages during inference: encoding and decoding.

Encoding: Transforming Words into Numbers

During encoding, the user's input prompt is converted into numerical representations called word embeddings. These embeddings allow the model to process and understand the information contained in the prompt.

Decoding: Generating the Response

In the decoding phase, the model uses the encoded information to generate the most relevant and informative response possible. Two main approaches exist for decoding:

  • Deterministic Decoding: Here, the model selects the most probable token at each step based on the probability distribution generated by the Softmax layer. This approach produces accurate and consistent responses but may lack creativity.
  • Randomized Decoding: This introduces an element of chance by selecting probable, but not necessarily the most probable, tokens. This approach encourages diversity and creativity in the responses but may also lead to less precise or coherent results.


In this article, we embark on a journey to explore the key hyperparameters that shape the behavior of language models. We will delve into their impact on model outputs and provide insights into how to effectively tune them for optimal performance.

Temperature T

Temperature is a parameter that controls the randomness of the model's output during decoding. It affects the probability distribution of tokens generated by the Softmax layer. Intuitively, it allows reducing the influence of large values in a vector when calculating its probability distribution using the Softmax.

The diagram below shows how it intervenes in the calculation of the Softmax

Article content
Softmax with temperature


  • Low Temperature (T<=1): With a low temperature, the model favors the most probable tokens, leading to more predictable and conservative responses. This is suitable for tasks requiring factual accuracy, such as question answering. T=1 is the classical softmax setting.
  • High Temperature (T>1): As the temperature increases, the probability distribution becomes more uniform, giving less probable tokens a higher chance of being selected. This encourages more creative and diverse outputs, but it may also increase the risk of nonsensical or inaccurate responses.

Choosing the Right Temperature

The appropriate temperature value depends on the specific task and the desired outcome. For factual tasks, a lower temperature is generally preferred to ensure accuracy. For creative tasks, a higher temperature can be used to explore more possibilities and generate more engaging content.

Top-k and Top-p sampling

Top k sampling aims to select the most relevant and likely continuations in a sequence. This objective prioritizes the tokens with the highest probabilities, calculated after the Softmax function. The model ranks all possible next tokens based on their assigned probabilities and then proceeds to choose only the top k tokens. By adjusting the k value (typically ranging from 1 to 100, with a default of 50), we control the level of diversity in the generated output. Lower k values lead to a more focused selection, often favoring the most predictable choices, while higher k values allow for a wider range of possibilities, potentially introducing more creative and unexpected elements. This approach offers a balance between ensuring coherence and encouraging some level of variation in the generated text.

Let's consider the following example to illustrate Top-k. Let's consider the sentence "J’ai rencontré mon professeur ..." as a starting point for an LLM to complete. Here are four possible continuations to illustrate the model's potential outputs:

  1. "... et il m'a donné un bon conseil." (and he gave me good advice)
  2. "... hier à la bibliothèque." (yesterday at the library)
  3. "... pour la première fois aujourd'hui." (for the first time today)
  4. "... dans la rue ce matin." (on the street this morning)

Supposons que les probabilités (par ordre décroissant) des tokens susceptible de continuer la phrase se présentent ainsi qu'il suit

Token               Probability

"Hier"               0.42

"pour"              0.31

"dans"               0.2

"et"                0.07

If k=2, only the top 2 tokens: "hier" and "pour" will be added to the sampling subset from which the LLM selects an output token. If k=4, all of the 4 options could be considered. The higher k, the greater the potential variety in output.

Top-p Sampling:

Alternatively, Top-p controls the diversity of the generated output by selecting tokens based on their cumulative probability. The model continues adding tokens as long as the sum of their probabilities doesn't exceed the specified threshold p (decimal number ranging from 0 to 1.0).

Returning to the above example, 

If p=0.8, only the "hier" and "pour" will be added to the sampling subset.

General Recommendation for Sampling Parameters:

When using temperature (T) or top-p/top-k sampling for LLM text generation, it's generally recommended to adjust only one of these parameters at a time, rather than modifying both simultaneously. This approach helps to isolate the impact of each parameter and allows for more precise control over the generated text.

Modifying both T and top-p/top-k together can lead to complex interactions and make it challenging to predict the exact outcome. By adjusting one parameter at a time, we can better understand its individual effect on the generated text and make more informed decisions about how to tune the sampling process.

Frequency and Presence Penalties

Frequency (Repetition) penalty is a hyperparameter (typically a decimal value between -2.0 and 2.0) that discourages the model from repeatedly using the same tokens in the generated text. This helps to promote diversity and prevent the model from getting stuck in loops or producing repetitive outputs.

How It Works:

  • The penalty is applied to the probabilities of tokens that have recently been used in the generated text.
  • As a token is used more frequently, its probability is lowered, making it less likely to be selected again.

The presence penalty is similar to frequency penalty but applied only to tokens used at least once.

While both penalties aim to reduce repetition, they have subtle differences:

  • Frequency Penalty: penalizes tokens based on how often they have been used in the entire generated text, regardless of their position.
  • Presence Penalty: penalizes tokens only if they have already been used at least once in the generated text so to encourage a wider assortment of tokens

Effects on Output:

Frequency penalty encourages the model to explore a wider range of vocabulary and avoid generating repetitive phrases or sentences. This can be particularly beneficial for tasks that require diverse and creative outputs, such as poetry generation or storytelling.


Max output tokens

Max output tokens determine the maximum number of tokens an LLM can generate in its response. This parameter significantly impacts the coherence, context, computational demands, and potential risk of incoherence in the output.

Higher Max Outputs:

  • Coherence: Longer responses can express ideas more comprehensively and logically.
  • Contextual Relevance: Better addresses the input prompt and provides a more complete response.
  • Computational Demands: Increased inference time and memory usage, potentially impacting cost and speed.

Lower Max Outputs:

  • Faster Inference: Reduces processing time and resource requirements, leading to faster response generation.
  • Cost Control: Lower resource usage translates to lower computational costs.
  • Risk of Incoherence: Limited space for the model to develop a comprehensive and coherent response. Error Possibility: Shorter output may lack context and potentially lead to inaccuracies or misunderstandings.

Use Cases:

  • Boosting Performance: Setting higher max tokens can be beneficial for tasks requiring detailed and context-rich answers, such as comprehensive summaries or in-depth explanations.
  • Cost-Effective Responses: Lower max tokens are suitable for tasks where speed and cost efficiency are prioritized, such as simple question answering or short text generation.

Stop sequences

Beyond setting a maximum number of tokens, we have another tool for controlling the length of an LLM's response: stop sequences.

Stop sequences are strings of characters that signal the LLM to halt its output generation. A common example is a period (".") which instructs the model to stop at the end of a sentence.

Alternatively, you can define the stopping point using a stop token limit. This is an integer value that specifies the number of tokens the model generates before stopping. For instance, a limit of 1 would restrict the output to a single sentence, while a limit of 2 might constrain it to a paragraph.

Similar to the max output tokens hyperparameter, stop sequences offer finer control over the LLM's inference process. This can be particularly beneficial when budget constraints are a concern, as shorter outputs often translate to lower computational costs.

Conclusion

As we've delved into the intricacies of decoding, hyperparameters, and their impact on LLM behavior, it's evident that understanding and manipulating these parameters is crucial for unlocking the true potential of these powerful language models.

Whether you're crafting engaging chatbots, generating captivating content, or devising intelligent recommendation systems, mastering hyperparameters empowers you to fine-tune the performance of LLMs to suit your specific needs.

Remember, there's no one-size-fits-all approach. Embrace experimentation, iterate relentlessly, and strike the perfect balance that aligns with your unique use case. By doing so, you'll transform LLMs into invaluable tools that augment human capabilities, revolutionizing the way we interact with information and the world around us.

Yves Hassan

PhD candidate in atoms and molecular physics

1y

Courage! Malheureusement pour moi je ne comprends pas tes hyperparamètres malgré ce "guide comprehensif".

Like
Reply

To view or add a comment, sign in

More articles by Kamila Kare, PhD

Insights from the community

Others also viewed

Explore topics