Say Goodbye to Tokenization: Meta’s New AI Model Learns From Raw Data Directly

Say Goodbye to Tokenization: Meta’s New AI Model Learns From Raw Data Directly

Here is a quick primer on why training directly on bytes is challenging and why we cannot naively adopt this approach even Meta goes beyond. Eliminating arbitrary tokenization or segmentation of input sequences may seem appealing, but training on raw bytes introduces significant hurdles.

Training on bytes is inherently a much harder problem to solve. Byte sequences are orders of magnitude longer than tokenized sequences, making them extremely inefficient to model. While Byte Pair Encoding (BPE) and similar tokenization methods have their limitations, they act as a form of compression, allowing us to manage sequence lengths efficiently. Without this compression, handling byte-level input would quickly become infeasible.


Article content

The first part of Meta's architecture focuses on converting sequences of bytes into patches. They discuss three approaches:

  • Static fixed size: Patches are created using a predefined, fixed length.
  • Character-based segmentation: Start a new patch if the current patch contains a non-latin character, digit, or UTF-8 continuation (e.g., spaces).
  • Entropy-based dynamic segmentation: Train a small model over individual bytes and use the entropy of the byte sequence, as determined by the small model, to identify break points


Article content

The key observations are:

  1. This model is small, so we don’t really care if the sequence is long.
  2. The patching function is dynamic, allowing flexibility in identifying meaningful segments.


Article content


Article content

These observations illustrate:

  • Global and Approximate Monotonic Constraints: Meta uses a global constraint and an approximate monotonic constraint to determine entropy-based break points. This ensures that patches are dynamically segmented based on the byte sequence's entropy, capturing meaningful boundaries efficiently.
  • Visual Insights (Figure 4): The plot demonstrates how byte-level entropy is used to create patches. High-entropy bytes exceeding the global threshold (highlighted by the red horizontal line) mark new patch boundaries. For instance, the entropy of the characters "G" and "e" in "George R.R. Martin" exceeds , segmenting these as individual or larger patches. This dynamic segmentation ensures efficient patch creation without relying on arbitrary fixed-size divisions.

Intuition here is basically if there is a sudden jump in entropy, the model is surprised -> this new byte is probably not part of the current "stage" If your input sequence has a lot of entropy jumps, the model is surprised, your input is probably "all over the place", its a more complex sequence, reduce patch size, have more patches, spend more compute


The second stage of Meta's architecture involves a small encoder-decoder mechanism that maps bytes into patch embeddings:

The Encoder

  • Embedding bytes: Each byte is first embedded using an embedding table.
  • Grouping into n-grams: For sizes in , the encoder groups bytes into n-grams of size .
  • Hashing n-grams: The grouped n-grams are hashed into another embedding table.
  • Adding embeddings: The n-gram embeddings are added to the original byte embeddings.
  • Normalization: The resulting embeddings are normalized to ensure stability during training and to provide a consistent representation for downstream modeling.


Article content

The actual encoder is an encoder-transformer with cross attention between the byte patches and the latents The decoder is also basically the exact same thing but in reverse when doing cross-attention. It attends to the output of the big transformer when doing that unpatching


Article content



Article content

If your intuition like mine is built on tokens, this part is a little weird

- To compare with standard llms, the keep the context length in terms of bytes constant

- Compute flops per byte

- Report bits per byte instead of perplexity

Scaling Considerations

Meta emphasizes scaling as a key factor in evaluating the effectiveness of this architecture, as illustrated in Figure 1. Notably, placing scaling considerations at the forefront of the paper reflects the broader trend in modern AI research, where scalability is a critical factor for adoption and evaluation of new models.

Also notes an important point: although their approach achieves strong results, the LLaMA 3 model is trained on a dataset size that exceeds the compute-optimal threshold. Interestingly, BLT (Byte-Level Transformer) excels in these over-trained settings, showcasing its robustness and scalability when more data is available, Thanks to AI at Meta

This focus on scaling, combined with BLT's performance under non-optimal compute scenarios, suggests that internally at Meta, this research is a high-priority initiative. Fingers crossed, this could be a significant step towards mainstream adoption and further breakthroughs in byte-level model architectures.


Article content

On to evaluations, note they train new models using the llama 3 tokenizer and NOT against llama 3 (This is super important especially if you are just setting the table without context)


Article content

They also have this extremely cool multilingual table


Article content

They also release code! beautiful part for Open-source ✨


Article content
Some thoughts: Also my brain is not used to reading 'bytes'. I saw trained on 4T training "BYTES" and did a double take ☺️

Feel free to subscriber and connect: my profile is Tafar M.

To view or add a comment, sign in

More articles by Tafar M.

Insights from the community

Others also viewed

Explore topics