Tokenization in NLP: How Models Like GPT Understand Human Language

Tokenization in NLP: How Models Like GPT Understand Human Language

In my previous article, The Art of Text Processing in Natural Language Processing (NLP), I introduced several fundamental concepts for text processing, including normalization, tokenization, stemming, and lemmatization.

While these concepts are interconnected, tokenization deserves a deeper exploration as it forms the foundational layer upon which modern language models are built.

Tokenization: More Than Just Splitting Text

At its core, tokenization divides text into meaningful units (tokens) that serve as the atomic elements for further processing. In my previous post, I briefly mentioned word and sentence tokenization.

However, tokenization strategies have evolved significantly with the advancement of language models, addressing complex linguistic phenomena and computational efficiency challenges.


The Evolution of Tokenization Approaches

  1. Word-Level Tokenization

Word tokenization is perhaps the most intuitive approach - splitting text at whitespace and removing punctuation. This was the standard in earlier NLP systems and still serves specific use cases well.

Article content
Basic word tokenization example
Article content
Using NLTK for better word tokenization

Limitations of Word-Level Tokenization:

  • Vocabulary explosion: Each inflection (run, running, runs) requires a separate token
  • Out-of-vocabulary (OOV) problem: New words encountered during inference cannot be represented
  • Language dependence: Not all languages have clear word boundaries (e.g., Chinese)
  • Rare word problem: Infrequent words lack sufficient training examples



2. Character-Level Tokenization

Character tokenization represents the opposite extreme - breaking text into individual characters:

Article content
Character tokenization

Advantages:

  • No OOV problem - can represent any text
  • Much smaller vocabulary
  • Language-agnostic

Limitations:

  • Very long sequences (computational inefficiency)
  • Loses word-level semantic meaning
  • Harder to capture long-range dependencies



3. Subword Tokenization: The Middle Ground

Modern NLP has converged on subword tokenization methods that balance the trade-offs between word and character approaches. These methods adaptively break words into meaningful subword units based on frequency and patterns in the training corpus.

Let's implement a simple frequency-based subword tokenization to understand the concept:

Article content
"In the example above, our simple subword tokenizer uses the vocabulary sample_vocab to split the input sentence. Notice how common words like 'the', 'quick', 'brown', 'fox' are tokenized as single units, as they are present in our vocabulary. However, the word 'jumps' is broken down into 'jump' and 's', reflecting the subword units available in our vocabulary. This demonstrates the adaptive nature of subword tokenization. Words outside the pre-defined vocabulary or infrequent words, like 'jumps' in this case, are split into smaller, known components.
This simple frequency-based approach is a foundational concept behind more advanced techniques like Byte Pair Encoding (BPE) and WordPiece, which are commonly employed in modern NLP models like BERT and GPT. These advanced methods learn the vocabulary and subword units directly from a large text corpus, leading to more efficient and flexible tokenization."




Leading Subword Tokenization Approaches

Several subword tokenization algorithms have become standard in modern NLP:


1. Byte-Pair Encoding (BPE)

Originally a data compression algorithm, BPE was adapted for NLP in the paper "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016). It starts with a character-level vocabulary and iteratively merges the most frequent adjacent pairs.

Note: My next article will dive deep into BPE implementation and optimization.


2. WordPiece

Developed by Google and used in BERT, WordPiece modifies BPE by using a likelihood-based approach rather than frequency alone for merging decisions.

Article content

Notice how WordPiece breaks "transformerification" into meaningful subwords: 'transform', '##er', '##ific', '##ation'.


3. SentencePiece and Unigram Language Model

SentencePiece treats the input as a raw byte or character sequence without pre-tokenization, making it truly language-agnostic. It primarily uses the Unigram Language Model algorithm.

Article content

SentencePiece uses '_' (Unicode U+2581) to mark word boundaries, allowing for perfect reconstruction of the original text without ambiguity.


4. Byte-Level BPE

Used in GPT models, byte-level BPE operates on bytes rather than Unicode characters, ensuring a fixed vocabulary size (at most 256 original byte tokens plus merged tokens).



Practical Considerations for Tokenization

1. Vocabulary Size Trade-offs

Larger vocabularies typically result in fewer tokens per sequence, reducing computational requirements during inference but increasing model size.

Article content


2. Special Token Handling

Most modern tokenizers use special tokens for specific functions:

Article content

Special tokens serve crucial functions like marking sentence boundaries, padding, unknown tokens, and sequence start/end.


3. Language Considerations

Tokenization strategies should consider language-specific characteristics:

Article content

Multilingual models must handle diverse writing systems, morphological complexity, and compounding rules across languages.



Tokenization Impact on Model Performance

Tokenization directly impacts model performance in several ways:

1. Sequence Length

Models have context length limitations, making tokenization efficiency crucial. Compare:\

Article content
In this example, we've tokenized the same text using two different models, GPT-2 and BERT. The output shows that GPT-2 produced 30 tokens, while BERT generated 31 tokens for the same input.
This difference in token count highlights how different tokenization strategies impact sequence length. Even a small difference in the number of tokens can affect the model's computational efficiency and its ability to process longer texts.
Why is this important? Models like GPT and BERT have limitations on the number of tokens they can handle in a single input sequence. This limit is often referred to as the "context window." If the tokenized text exceeds the context window, the model might need to truncate the input, potentially losing valuable information. Therefore, choosing a tokenization strategy that produces shorter sequences while preserving essential information is crucial for optimal performance, especially when dealing with longer texts.

2. Information Density

More efficient tokenization increases information density within context windows:

Article content

This example illustrates how tokenization strategies affect information density. We've tokenized the sentence 'Electroencephalographically is a long word derived from electroencephalography.' using three different approaches:

Character-level: This approach breaks down the text into individual characters, resulting in 79 tokens.

Word-level: This approach splits the text into words, resulting in 8 tokens.

Subword-level (using GPT-2): This approach utilizes the GPT-2 tokenizer to break down words into meaningful subword units, resulting in 16 tokens.

Observe that character-level tokenization produces the largest number of tokens, while word-level produces the fewest. Subword-level tokenization falls in between.

Information density refers to the amount of information captured per token. Character-level tokens have low information density since individual characters often don't convey much semantic meaning on their own. Word-level tokens have higher information density, but they might struggle with out-of-vocabulary words or rare words.

Subword tokenization strikes a balance by breaking down complex words into smaller, meaningful units. This allows the model to capture more information within a given context window. In this example, the subword approach breaks down 'Electroencephalographically' and 'electroencephalography' into smaller, reusable subword units like 'Elect,' 'ro,' 'ence,' 'phal,' and 'ography.' This enables the model to understand the relationship between these words and potentially generalize to unseen words with similar subword components.

By using subword tokenization, we can pack more information into the limited context window of language models, leading to better performance and generalization capabilities.


3. Out-of-Domain Performance

Tokenization strategies affect how models handle domain-specific language:

Article content
This example demonstrates how the same tokenizer (GPT-2) handles text from different domains. We used four sentences: general, medical, technical, and chemical.
Notice that domain-specific terms, like 'myocardial infarction' (medical) or 'Kubernetes' (technical), are often broken down into smaller subword units. This is because the tokenizer was primarily trained on a general corpus and might not have encountered these specialized terms frequently.
This breakdown can affect a model's understanding of domain-specific language. If a tokenizer consistently splits essential technical terms into fragments, it can make it harder for the model to grasp their meaning and relationships.

Key takeaway: The choice of tokenizer and its training data significantly impact performance on domain-specific tasks. Fine-tuning tokenizers on domain-specific data can improve their ability to handle specialized vocabulary and improve model performance.



Evaluating Tokenization Approaches

When building NLP systems, consider these criteria for tokenization selection:

  1. Task specificity: Translation requires different tokenization than classification
  2. Language support: Consider all target languages
  3. Computational efficiency: Balance token length with vocabulary size
  4. Reconstructability: Perfect reconstruction for generative tasks
  5. Domain adaptability: Performance on domain-specific terminology


Future Directions in Tokenization

The field continues to evolve:

  1. Token-free models: Direct character-level processing with efficient architectures
  2. Learnable tokenization: End-to-end learned tokenization strategies
  3. Multi-modal tokenization: Unified tokenization across text, images, and audio
  4. Neural compression: Learned neural codes for more efficient representation


Conclusion

Tokenization represents a critical design choice in modern NLP systems, serving as the bridge between human language and computational representation. By understanding the trade-offs between different approaches, practitioners can make informed decisions that significantly impact model performance, efficiency, and capabilities.

In my next article, I'll dive deeper into Byte-Pair Encoding (BPE), exploring its implementation details, optimizations, and why it has become the foundation for models like GPT-3, GPT-4, and Claude.


Colab NoteBook - Tokenisation

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
  2. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144.
  3. Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226.

Recommended Resources

#NLP #MachineLearning #Tokenization #NaturalLanguageProcessing #AI

Ted Alan Stalets

We help birth Tokenized Enterprises! TokenizedDotComs.com for Location. BlocktechBrew.com for Creation. (dhruv@blocktechbrew.com); 150 web3 dot coms for Tokenizing Real Estate, Commodities & all Finance.

1w

Tokenization is becoming one of the biggest talked about subjects around the financial space. Stocks, bonds, commodities, real estate, et al. TokenizedDotComs.com succinctly answers the market's request by providing exact-match web3 rails for explosive growth. Possibly the largest single collection of tokenized dot coms under 1 roof.

To view or add a comment, sign in

More articles by Tarun. Arora

Insights from the community

Others also viewed

Explore topics