Cut Your Losses in Large-Vocabulary Language Models

Cut Your Losses in Large-Vocabulary Language Models

Today's paper introduces Cut Cross-Entropy (CCE), a novel method for reducing memory consumption in large language model training. The paper addresses a critical bottleneck in LLM training where the cross-entropy loss computation consumes a disproportionate amount of memory, particularly as vocabulary sizes grow. In current LLMs, this single component can consume up to 90% of the total memory. CCE dramatically reduces memory usage without sacrificing training speed or convergence.

Method Overview

CCE works by reformulating how the cross-entropy loss is computed during training. It computes the cross-entropy loss without materializing the logits for all tokens into global memory. Instead, it only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly.

Article content

The method consists of three main components. First, it implements a memory-efficient indexed matrix multiplication that avoids storing large intermediate results. Second, it introduces a linear-log-sum-exp operation that computes necessary values in blocks while keeping memory usage minimal. Finally, it leverages the inherent sparsity of the softmax operation to skip computations that would have negligible contributions to the gradient.

A key part is the use of gradient filtering, which recognizes that most tokens in the vocabulary have extremely small probabilities that don't meaningfully contribute to the gradient computation. By filtering these out, CCE achieves significant speed improvements without affecting training quality.

Results

The results are impressive:

Article content

  • For the Gemma 2 (2B) model, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB
  • The total training-time memory consumption of the classifier head is reduced from 28 GB to 1 GB
  • The method achieves these reductions without sacrificing training speed or convergence
  • Enables increasing batch sizes by 1.5x to 10x depending on the model
  • Shows identical training curves compared to traditional implementations

Article content

Conclusion

CCE attempts to solve the memory bottleneck associated with cross-entropy loss computation. The method enables training of models with larger vocabularies and batch sizes while maintaining training efficiency, potentially paving the way for more efficient pipeline parallelism in very large models. For more information please consult the full paper.

Congrats to the authors for their work!

Wijmans, Erik, et al. "Cut Your Losses in Large-Vocabulary Language Models." arXiv preprint arXiv:2411.09009 (2024).

Anca Chiriac

FinTech Executive | Strategic Advisor | PWN Romania Women on Boards Alumna

5mo

Very informative Vlad Bogolin

To view or add a comment, sign in

More articles by Vlad Bogolin

Insights from the community

Others also viewed

Explore topics