Attention: Computing Dynamic Context for Sequence Models

Attention mechanisms empower sequence-to-sequence models to compute a bespoke context vector for each output element by weighing input representations according to their relevance. This approach overcomes the limitations of fixed-size context vectors and enhances a model’s ability to capture dependencies over long sequences.

How Attention Works

alt text

At each decoding step $i$ , the model constructs a context vector $C_i$ as a weighted sum of encoder hidden states $h_j$ . Each input word $x_j$ is associated with a hidden state $h_j$ from an encoder (for example, an LSTM or GRU). The context vector is defined as:

C_i = \sum_{j=1}^{n} \alpha_{ij} \, h_j.

Here, the attention weights $\alpha_{ij}$ reflect how much focus the output at position $i$ gives to each input position $j$ .

Calculating Attention Weights

Attention weights $\alpha_{ij}$ are obtained by normalizing alignment scores $e_{ij}$ with a softmax function, ensuring they sum to one across all inputs:

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{n} \exp(e_{ik})}.

The alignment score $e_{ij}$ quantifies how well the decoder state at step $i$ aligns with the encoder hidden state $h_j$ .

Alignment Scores

Alignment scores can be computed via different trainable functions. Two common variants are:

Dot Product:
$e_{ij} = s_i^\top \, h_j$
where $s_i$ is the decoder hidden state at step $i$ .
Additive (Bahdanau) Attention:
$e_{ij} = v_a^\top \tanh\bigl(W_a [s_i; h_j]\bigr)$
with $W_a$ and $v_a$ as trainable parameters and $[s_i; h_j]$ denoting concatenation.

These scores, once passed through the softmax, yield attention weights that guide the formation of each context vector.

Self-Attention in a Single Sequence

alt text

Self-attention applies the same principle within one sequence, allowing each element to attend to all others (including itself). Given a sequence of embeddings arranged in matrix $X \in \mathbb{R}^{n \times d}$ , three projection matrices produce queries $Q$ , keys $K$ , and values $V$ :

Q = X W_Q, \quad K = X W_K, \quad V = X W_V.

Scaled dot-product self-attention computes:

\text{Attention}(Q, K, V) = \text{softmax}\Bigl(\frac{Q K^\top}{\sqrt{d_k}}\Bigr) V,

where the factor $1/\sqrt{d_k}$ keeps the inner products in a numerically stable range. The result is a sequence of the same length, where each position’s representation integrates information from all positions.

Practical Example: Machine Translation

Consider translating the Chinese phrase “我爱机器学习” (“I love machine learning”). The encoder processes each word into hidden states:

Input words:            [ 我 , 爱 , 机器 , 学习 ]
Encoder hidden states:  [ h1,  h2,  h3,    h4   ]

At the first decoding step (producing “love”), the decoder state $s_1$ computes scores against each encoder state:

\text{score}(s_1, h_j) = v^\top \tanh(W [s_1; h_j])

Softmax normalizes these scores into probabilities, for example $[0.4, 0.3, 0.2, 0.1]$ . The context vector becomes:

C_1 = 0.4\,h_1 + 0.3\,h_2 + 0.2\,h_3 + 0.1\,h_4.

This context $C_1$ is then combined with $s_1$ to generate the output embedding for “love.” A fresh context vector is computed for each subsequent target word, enabling the model to focus on different parts of the input.

Why Attention Matters

alt text

Dynamic context vectors allow models to avoid the bottleneck of a single summary vector, improving performance on long sequences. The soft alignment provided by attention yields interpretability—examining $\alpha_{ij}$ reveals which inputs influenced each output. Self-attention’s parallelism further accelerates training by removing sequential dependencies.