LLM Text Generation: Why is Determinism so Hard to Achieve?

LLM Text Generation: Why is Determinism so Hard to Achieve?

Large Language Models employ various parameters to generate text. One critical parameter in text generation is 'temperature'. Technically, it's defined as the inverse of the scaling factor applied to adjust the logits before implementing the softmax function. (Programmer 2023). Practically, it controls the degree of randomness in the generated output.  In this article, I will explore how the principle of temperature scaling plays out in real-world applications of LLMs, and why, despite these controls, achieving deterministic outputs remains a complex challenge.

To illustrate the concept of temperature, consider a log odds distribution represented as {log(p1), log(p2), …, log(pn)}. Introducing a temperature variable 'T', we scale these log odds by dividing them by T, leading to a transformed distribution: {log(p1)/T, log(p2)/T, …, log(pn)/T}. The probability of selecting a particular token 'Pi' then becomes e^(y_i / T) / Σ(e^y_j/T), where 'y_i' and 'y_j' are the scaled logits.

Article content
Figure 1. Impact of T on log odds

The effect of temperature on the distribution is nuanced:

- When T = 1, the original distribution remains unchanged,

- When T < 1, the distribution sharpens, meaning the model's output becomes less random and more predictable, potentially sacrificing creativity.

- When T > 1, the distribution evens out, enhancing randomness and creative potential in the output. 

Theoretically, setting the temperature parameter to zero (T = 0) should lead to deterministic outputs, with the highest probability token always being selected. This scenario implies that the model would consistently generate the same text for a given input. Yet, in practice, this level of determinism is unobtainable in most language models.

In some fields, particularly in scientific research, there is a strong preference for deterministic outputs — where the same input consistently produces the same output. This predictability is crucial for replicability and reliability in scientific studies. (Ouyang et al. 2023) 

When I experimented with this concept, I discovered that some models do not allow setting the temperature to exactly zero due to technical constraints; doing so results in an error. However, I did encounter models like OpenAI’s GPT-3.5 and Falcon 7-B that permit a zero-temperature setting. Surprisingly, even these advanced models do not achieve complete determinism at zero temperature. For instance, in an analysis of GPT-3.5, the mean value of the Longest Common Sequence (LCS) in a coding task was found to be only 0.77 when the temperature was set to zero (Ouyang et al. 2023).

One possible explanation for the non-deterministic nature of language models, even at a zero temperature setting, could be a deliberate programming decision. This can be achieved by setting up an edge case in the code that circumvents a division by zero error. In such a scenario, the temperature is set to a value close to but not exactly zero. As a result, randomness persists during the sampling process, usually through top-k or nucleus sampling. Consider the following code snippet as an initial setup:

def generate_text(conversation):
 res = client.chat.completions.create(
   model="gpt-3.5-turbo-1106",
   messages=conversation,
   temperature=0.0,
 )        

When this setup is applied to a prompt like “What is the best thing about being a student at Cornell?”, the responses, even though initiated with a zero temperature setting, show variations:

  • Response 1: "The best thing about being a student at Cornell is the opportunity to receive a world-class education from renowned faculty members in a wide range of academic disciplines. Additionally, the campus is beautiful and offers a vibrant and diverse community …"
  • Response 2: "The best thing about being a student at Cornell is the opportunity to receive a world-class education from renowned faculty members in a diverse and vibrant community. Cornell offers a wide range of academic programs and resources …"

In these responses, despite the divergence in content after a specific point, there is a noticeable convergence in certain phrases, such as “wide range of academic” and “offers a vibrant and diverse community.” This indicates that while the outputs may diverge, their thematic elements and some specific phrases still show a high degree of similarity. However, even when the model is programmed to function at a near-zero temperature, the inherent randomness at the sampling stage influences the generation process, preventing absolute determinism.

In response to the demand for more deterministic outputs, OpenAI introduced the seed parameter. This parameter is intended to control how tokens are sampled after temperature scaling, offering a promise of consistent outputs for the same input under the same conditions. However, in practice, this does not always hold true. For example, consider the following experiment for the same prompt where we set 

seed=1        

- Output 1: "... Additionally, the beautiful campus and the strong sense of tradition and pride among students and alumni make for a truly unique and enriching college experience. …"

- Output 2: "... Additionally, the beautiful campus and the strong sense of tradition and pride among students and alumni make for a unique and enriching college experience…"

Despite using the same seed, the outputs differ by a single word. This difference, seemingly minor and involving just a filler word, raises questions about the supposed determinism. Why do such variations occur even when T=0 and seed is controlled?

The answer lies in the nature of computational processes used in training and operating these models. Boris Power, head of applied research at OpenAI, points out that "there’s inherent non-determinism in GPU calculations around floating point operations". The use of GPUs, which are optimized for speed and parallel processing, introduces an element of non-determinism. This is rooted in parallel operations involving floating points where different computational paths (e.g., calculating a*b*c as (ab)*c vs. a*(bc)) can lead to slight variations in results. (Power 2021)

These minute differences, often less than 1% in the probability of the top two tokens, can lead to the selection of different tokens. Once a different token is chosen, the subsequent text generation diverges more significantly because the probability distributions for the following tokens change. (Power 2022) Thus, in longer texts, the likelihood of divergent outputs increases, even when the model operates at T=0 and with a specified seed.

Takeaways and Conclusions

According to NVIDIA, the non-determinism often observed in TensorFlow and GPU operations is attributed to asynchronous floating point operations. (Riach 2019, slide 51) While there is a notable demand for predictability in LLM outputs, it is practical to understand and adjust to the stochastic nature inherent in these models.

This adjustment can be compared to how we deal with unpredictability in human communication. In most applications, particularly when dealing with shorter texts, the minor variations in output from LLMs typically do not drastically change the overall content. Therefore, while perfect consistency in outputs may not always be possible, these slight deviations are generally not substantial enough to significantly impact the effectiveness of the models for users seeking reliable results.

Works Cited

Power, Boris. “A Question on Determinism.” OpenAI Developer Forum, 22 Aug. 2021, community.openai.com/t/a-question-on-determinism/8185

Ouyang, Shuyin, et al. “LLM Is Like a Box of Chocolates: The Non-Determinism of ChatGPT in Code Generation.” Arxiv.Org, 5 Aug. 2023, arxiv.org/pdf/2308.02828.pdf

Power, Boris. “This Happens with All the Models in Our API...” Twitter, Twitter, 29 Dec. 2022, twitter.com/BorisMPower/status/1608522707372740609

Riach, Duncan. “Determinism in Deep Learning.” Developer.Nvidia.Com, 2019, developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9911-determinism-in-deep-learning.pdf

Vishita Batra

Data Scientist at Housing.com| NLP | GENERATIVE AI | PYTHON

11mo

great read, Thanks!

Jonathan Gotian

Business Analyst @ Capital One || Cornell Graduate

1y

Great read Max Bohun!!

Thanks Max! Great read

Liam Du

Building Voice AI | Prev @ Palantir

1y

Good read Max!

To view or add a comment, sign in

Insights from the community

Explore topics