Say Goodbye to Tokenization: Meta’s New AI Model Learns From Raw Data Directly
Here is a quick primer on why training directly on bytes is challenging and why we cannot naively adopt this approach even Meta goes beyond. Eliminating arbitrary tokenization or segmentation of input sequences may seem appealing, but training on raw bytes introduces significant hurdles.
Training on bytes is inherently a much harder problem to solve. Byte sequences are orders of magnitude longer than tokenized sequences, making them extremely inefficient to model. While Byte Pair Encoding (BPE) and similar tokenization methods have their limitations, they act as a form of compression, allowing us to manage sequence lengths efficiently. Without this compression, handling byte-level input would quickly become infeasible.
The first part of Meta's architecture focuses on converting sequences of bytes into patches. They discuss three approaches:
The key observations are:
These observations illustrate:
Intuition here is basically if there is a sudden jump in entropy, the model is surprised -> this new byte is probably not part of the current "stage" If your input sequence has a lot of entropy jumps, the model is surprised, your input is probably "all over the place", its a more complex sequence, reduce patch size, have more patches, spend more compute
The second stage of Meta's architecture involves a small encoder-decoder mechanism that maps bytes into patch embeddings:
The Encoder
The actual encoder is an encoder-transformer with cross attention between the byte patches and the latents The decoder is also basically the exact same thing but in reverse when doing cross-attention. It attends to the output of the big transformer when doing that unpatching
Recommended by LinkedIn
If your intuition like mine is built on tokens, this part is a little weird
- To compare with standard llms, the keep the context length in terms of bytes constant
- Compute flops per byte
- Report bits per byte instead of perplexity
Scaling Considerations
Meta emphasizes scaling as a key factor in evaluating the effectiveness of this architecture, as illustrated in Figure 1. Notably, placing scaling considerations at the forefront of the paper reflects the broader trend in modern AI research, where scalability is a critical factor for adoption and evaluation of new models.
Also notes an important point: although their approach achieves strong results, the LLaMA 3 model is trained on a dataset size that exceeds the compute-optimal threshold. Interestingly, BLT (Byte-Level Transformer) excels in these over-trained settings, showcasing its robustness and scalability when more data is available, Thanks to AI at Meta
This focus on scaling, combined with BLT's performance under non-optimal compute scenarios, suggests that internally at Meta, this research is a high-priority initiative. Fingers crossed, this could be a significant step towards mainstream adoption and further breakthroughs in byte-level model architectures.
On to evaluations, note they train new models using the llama 3 tokenizer and NOT against llama 3 (This is super important especially if you are just setting the table without context)
They also have this extremely cool multilingual table
They also release code! beautiful part for Open-source ✨
Some thoughts: Also my brain is not used to reading 'bytes'. I saw trained on 4T training "BYTES" and did a double take ☺️
Feel free to subscriber and connect: my profile is Tafar M.