The Decoder In the Transformer Neural Network for Large Language Models

The Decoder In the Transformer Neural Network for Large Language Models

In my last article, we established that the Transformer neural network from the research paper Attention is All You Need is an encoder-decoder type of architecture that uses the Attention mechanism. We primarily focused on the encoder layer, but now we will shift our focus to the decoder layer.

The job of the decoder is to generate text sequences that are a function of the input from the encoder. It consists of two multi-headed attention layers, a pointwise feed-forward layer, and residual connections with layer normalization after each sublayer.

The multi-headed attention layer in the decoder functions slightly differently compared to the encoder’s attention mechanism. The decoder takes in the previous output as input, along with the encoder outputs that contain attention information from the input, and stops decoding when it generates an <end> token.

Similar to the encoder layer, the input first passes through an embedding layer and a positional encoding layer to obtain positional embeddings. These embeddings are then fed into the first multi-headed attention layer, which computes attention scores for the decoder’s input. However, unlike the encoder, the decoder must prevent attending to future tokens while generating sequences one word at a time. Since the Transformer model is not recurrent, this restriction is enforced using masking, specifically the Look-Ahead Mask.

Article content
Fig 1:Look Ahead Mask Mechanism

The Look-Ahead Mask ensures that each token can only attend to itself and previous tokens, preventing leakage of future information. This is done by setting the attention scores of future tokens to negative infinity before applying softmax, effectively giving them a probability of zero. This ensures that the model remains autoregressive, meaning that words only attend to previous words and themselves.

In the second multi-headed attention layer, the decoder uses its previous layer’s output as queries, while the keys and values come from the encoder outputs. This attention mechanism allows the decoder to focus on relevant parts of the encoder’s input, helping it align with the meaning of the original sequence. The output of this attention layer then passes through a pointwise feed-forward layer for further processing.

Finally, the processed output is passed to a linear classifier with N classes (corresponding to vocabulary size). This output then goes through a softmax layer, which assigns probability scores ranging from 0 to 1 for each word. The word with the highest probability is selected as the predicted token. The decoder continues this process, appending each predicted token to the list of decoder inputs, until it generates an <end> token.

Unlike traditional recurrent models, the Transformer’s decoder enforces autoregression purely through self-attention masking, eliminating the need for recurrence. In practice, decoding strategies like beam search or top-k sampling are often used instead of always selecting the highest probability word at each step to improve fluency and coherence.

By leveraging multi-headed attention, masking, and encoder-decoder interactions, the Transformer decoder generates highly contextualized and coherent text sequences, making it a powerful architecture for tasks like machine translation and text generation.

To view or add a comment, sign in

More articles by Clinton Nzedimma

Insights from the community

Others also viewed

Explore topics