Attention: Computing Dynamic Context for Sequence Models
Attention mechanisms empower sequence-to-sequence models to compute a bespoke context vector for each output element by weighing input representations according to their relevance. This approach overcomes the limitations of fixed-size context vectors and enhances a model’s ability to capture dependencies over long sequences.
How Attention Works
At each decoding step , the model constructs a context vector as a weighted sum of encoder hidden states . Each input word is associated with a hidden state from an encoder (for example, an LSTM or GRU). The context vector is defined as:
Here, the attention weights reflect how much focus the output at position gives to each input position .
Calculating Attention Weights
Attention weights are obtained by normalizing alignment scores with a softmax function, ensuring they sum to one across all inputs:
The alignment score quantifies how well the decoder state at step aligns with the encoder hidden state .
Alignment Scores
Alignment scores can be computed via different trainable functions. Two common variants are:
-
Dot Product:
where is the decoder hidden state at step .
-
Additive (Bahdanau) Attention:
with and as trainable parameters and denoting concatenation.
These scores, once passed through the softmax, yield attention weights that guide the formation of each context vector.
Self-Attention in a Single Sequence
Self-attention applies the same principle within one sequence, allowing each element to attend to all others (including itself). Given a sequence of embeddings arranged in matrix , three projection matrices produce queries , keys , and values :
Scaled dot-product self-attention computes:
where the factor keeps the inner products in a numerically stable range. The result is a sequence of the same length, where each position’s representation integrates information from all positions.
Practical Example: Machine Translation
Consider translating the Chinese phrase “我 爱 机器 学习” (“I love machine learning”). The encoder processes each word into hidden states:
Input words: [ 我 , 爱 , 机器 , 学习 ]
Encoder hidden states: [ h1, h2, h3, h4 ]
At the first decoding step (producing “love”), the decoder state computes scores against each encoder state:
Softmax normalizes these scores into probabilities, for example . The context vector becomes:
This context is then combined with to generate the output embedding for “love.” A fresh context vector is computed for each subsequent target word, enabling the model to focus on different parts of the input.
Why Attention Matters
Dynamic context vectors allow models to avoid the bottleneck of a single summary vector, improving performance on long sequences. The soft alignment provided by attention yields interpretability—examining reveals which inputs influenced each output. Self-attention’s parallelism further accelerates training by removing sequential dependencies.