Cut Your Losses in Large-Vocabulary Language Models
Today's paper introduces Cut Cross-Entropy (CCE), a novel method for reducing memory consumption in large language model training. The paper addresses a critical bottleneck in LLM training where the cross-entropy loss computation consumes a disproportionate amount of memory, particularly as vocabulary sizes grow. In current LLMs, this single component can consume up to 90% of the total memory. CCE dramatically reduces memory usage without sacrificing training speed or convergence.
Method Overview
CCE works by reformulating how the cross-entropy loss is computed during training. It computes the cross-entropy loss without materializing the logits for all tokens into global memory. Instead, it only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly.
The method consists of three main components. First, it implements a memory-efficient indexed matrix multiplication that avoids storing large intermediate results. Second, it introduces a linear-log-sum-exp operation that computes necessary values in blocks while keeping memory usage minimal. Finally, it leverages the inherent sparsity of the softmax operation to skip computations that would have negligible contributions to the gradient.
A key part is the use of gradient filtering, which recognizes that most tokens in the vocabulary have extremely small probabilities that don't meaningfully contribute to the gradient computation. By filtering these out, CCE achieves significant speed improvements without affecting training quality.
Results
The results are impressive:
Recommended by LinkedIn
Conclusion
CCE attempts to solve the memory bottleneck associated with cross-entropy loss computation. The method enables training of models with larger vocabularies and batch sizes while maintaining training efficiency, potentially paving the way for more efficient pipeline parallelism in very large models. For more information please consult the full paper.
Congrats to the authors for their work!
Wijmans, Erik, et al. "Cut Your Losses in Large-Vocabulary Language Models." arXiv preprint arXiv:2411.09009 (2024).
FinTech Executive | Strategic Advisor | PWN Romania Women on Boards Alumna
5moVery informative Vlad Bogolin