Study Guide for “Attention Is All You Need”
This guide is designed to review and reinforce understanding of the seminal paper introducing the Transformer model. It includes a quiz with an answer key, a set of essay questions for deeper analysis, and a comprehensive glossary of key terms as defined and used within the source document.
Quiz: Short-Answer Questions
Answer each question in 2-3 sentences based on the information provided in the source text.
What is the fundamental architectural innovation of the Transformer model compared to dominant sequence transduction models that preceded it?
Describe the two main types of sub-layers that constitute each layer in the Transformer’s encoder stack.
What is the purpose of the “masking” implemented in the self-attention sub-layer of the decoder?
Explain the function of Scaled Dot-Product Attention, including the role of the scaling factor.
What is the primary benefit of using Multi-Head Attention instead of a single attention function?
Why does the Transformer model require “Positional Encodings,” and what method is used to create them in the paper?
According to the paper, what are the three main advantages of self-attention layers over recurrent and convolutional layers?
How does the per-layer computational complexity of a self-attention layer compare to that of a recurrent layer?
What two forms of regularization were employed during the training of the Transformer models?
Beyond machine translation, what other task was the Transformer applied to, and how did its performance compare to previous models in that domain?
Answer Key
The Transformer is the first sequence transduction model based entirely on attention mechanisms. It completely dispenses with the recurrence and convolutions that formed the basis of previous dominant models like RNNs and LSTMs.
Each of the N=6 identical layers in the encoder is composed of two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
Masking in the decoder’s self-attention sub-layer prevents positions from attending to subsequent positions. This ensures the auto-regressive property is preserved, meaning the prediction for a position i can only depend on the known outputs at positions less than i.
Scaled Dot-Product Attention computes an output as a weighted sum of values. The weights are obtained by taking the dot product of a query with all keys, scaling the result by dividing by the square root of the key dimension (√dk), and then applying a softmax function. The scaling factor counteracts the effect of large dot products pushing the softmax function into regions with extremely small gradients.
Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at different positions. A single attention head would be inhibited by averaging, whereas multiple heads can learn to perform different tasks and capture more nuanced relationships.
Since the model contains no recurrence or convolution, it has no inherent way to make use of the order of the sequence. Positional Encodings are added to the input embeddings to inject this information, using sine and cosine functions of different frequencies.
The three advantages, or desiderata, considered are: total computational complexity per layer, the amount of computation that can be parallelized (measured by minimum sequential operations), and the path length between long-range dependencies in the network.
Self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is often the case. The complexity for self-attention is O(n²·d), while for recurrent layers it is O(n·d²).
The two regularization techniques used during training are Residual Dropout and Label Smoothing. Dropout (Pdrop = 0.1 for the base model) is applied to the output of each sub-layer and to the sums of embeddings and positional encodings, while label smoothing (ϵls = 0.1) was found to improve accuracy and BLEU score.
The Transformer was applied to English constituency parsing. It performed surprisingly well, yielding better results than all previously reported models except for the Recurrent Neural Network Grammar and outperforming the BerkeleyParser even when trained only on the smaller WSJ dataset.
Essay Questions
The following questions are designed for longer-form, analytical responses. No answers are provided.
The paper argues that the ability to learn long-range dependencies is a key challenge in sequence transduction. Analyze and compare how Recurrent, Convolutional, and Self-Attention layers handle this challenge, focusing on the concept of “path length” as described in the text.
Explain the complete architecture of the Transformer, detailing the flow of information from an input sequence to an output sequence. Describe the role of the encoder stack, the decoder stack, and the three distinct ways Multi-Head Attention is applied within this architecture.
The authors state, “self-attention could yield more interpretable models.” Based on the attention visualizations and discussion in the paper’s appendix, elaborate on this claim. What kind of linguistic structures or behaviors do the attention heads appear to learn?
Describe the training regime for the “Transformer (big)” model for the WMT 2014 English-to-German task. Cover the dataset, hardware, training schedule, optimizer, and regularization techniques. How did this model’s performance and training cost compare to previous state-of-the-art models?
The paper details several “Model Variations” in Table 3 to evaluate the importance of different components. Discuss the findings related to varying the number of attention heads (A), the key dimension
dk(B), the overall model size (C), and the use of dropout (D). What do these results suggest about the Transformer’s design?
--------------------------------------------------------------------------------
Glossary of Key Terms
Attention
A function that can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
Auto-regressive
A property of a model where, at each step, it consumes the previously generated symbols as additional input when generating the next symbol. The Transformer’s decoder is auto-regressive.
BLEU Score
(Bilingual Evaluation Understudy) A metric for evaluating the quality of machine-translated text. Higher scores indicate better translation quality. The paper uses this as a primary metric for its machine translation tasks.
Decoder
In an encoder-decoder structure, the component that generates an output sequence of symbols (y1, ..., ym) one element at a time, given the continuous representation z produced by the encoder.
Encoder
In an encoder-decoder structure, the component that maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn).
Encoder-Decoder Structure
A common architecture for neural sequence transduction models where an encoder processes the input sequence and maps it to a continuous representation, which a decoder then uses to generate an output sequence. The Transformer follows this overall structure.
Intra-attention
Another name for self-attention.
Label Smoothing
A regularization technique where, during training, the model is encouraged to be less confident in its predictions. The paper notes this hurts perplexity but improves accuracy and BLEU score.
Layer Normalization
A technique used after each sub-layer in the Transformer. The output of a sub-layer is calculated as LayerNorm(x + Sublayer(x)), where x is the input and Sublayer(x) is the function implemented by the sub-layer itself.
Multi-Head Attention
An attention mechanism where queries, keys, and values are linearly projected h different times. The attention function is performed in parallel on each of these projected versions, and the outputs are concatenated and projected again to produce the final result. This allows the model to jointly attend to information from different representation subspaces.
Positional Encoding
Information about the relative or absolute position of tokens in a sequence that is injected into the model. Since the Transformer contains no recurrence or convolution, these are added to the input embeddings using sine and cosine functions of different frequencies.
Position-wise Feed-Forward Network
A sub-layer in the Transformer’s encoder and decoder that consists of two linear transformations with a ReLU activation in between. It is applied to each position separately and identically.
Residual Connection
A connection that adds the input of a sub-layer to its output (x + Sublayer(x)). This technique is employed around each of the sub-layers in both the encoder and decoder.
Scaled Dot-Product Attention
The specific attention mechanism used in the Transformer. It computes dot products of the query with all keys, divides each by √dk (the scaling factor), and applies a softmax function to obtain weights on the values.
Self-attention
An attention mechanism that relates different positions of a single sequence in order to compute a representation of that sequence. In a self-attention layer, the keys, values, and queries all come from the same place (e.g., the output of the previous layer).
Sequence Transduction
The task of converting one sequence to another, such as in machine translation or constituency parsing.
Transformer
The model architecture proposed in the paper that eschews recurrence and instead relies entirely on an attention mechanism to draw global dependencies between input and output.








