Attention: Computing Dynamic Context for Sequence Models

Attention mechanisms empower sequence-to-sequence models to compute a bespoke context vector for each output element by weighing input representations according to their relevance. This approach overcomes the limitations of fixed-size context vectors and enhances a model’s ability to capture dependencies over long sequences.

How Attention Works

alt text

At each decoding step ii, the model constructs a context vector CiC_i as a weighted sum of encoder hidden states hjh_j. Each input word xjx_j is associated with a hidden state hjh_j from an encoder (for example, an LSTM or GRU). The context vector is defined as:

Ci=j=1nαijhj.C_i = \sum_{j=1}^{n} \alpha_{ij} \, h_j.

Here, the attention weights αij\alpha_{ij} reflect how much focus the output at position ii gives to each input position jj.

Calculating Attention Weights

Attention weights αij\alpha_{ij} are obtained by normalizing alignment scores eije_{ij} with a softmax function, ensuring they sum to one across all inputs:

αij=exp(eij)k=1nexp(eik).\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{n} \exp(e_{ik})}.

The alignment score eije_{ij} quantifies how well the decoder state at step ii aligns with the encoder hidden state hjh_j.

Alignment Scores

Alignment scores can be computed via different trainable functions. Two common variants are:

  • Dot Product:

    eij=sihje_{ij} = s_i^\top \, h_j

    where sis_i is the decoder hidden state at step ii.

  • Additive (Bahdanau) Attention:

    eij=vatanh(Wa[si;hj])e_{ij} = v_a^\top \tanh\bigl(W_a [s_i; h_j]\bigr)

    with WaW_a and vav_a as trainable parameters and [si;hj][s_i; h_j] denoting concatenation.

These scores, once passed through the softmax, yield attention weights that guide the formation of each context vector.

Self-Attention in a Single Sequence

alt text

Self-attention applies the same principle within one sequence, allowing each element to attend to all others (including itself). Given a sequence of embeddings arranged in matrix XRn×dX \in \mathbb{R}^{n \times d}, three projection matrices produce queries QQ, keys KK, and values VV:

Q=XWQ,K=XWK,V=XWV.Q = X W_Q, \quad K = X W_K, \quad V = X W_V.

Scaled dot-product self-attention computes:

Attention(Q,K,V)=softmax(QKdk)V,\text{Attention}(Q, K, V) = \text{softmax}\Bigl(\frac{Q K^\top}{\sqrt{d_k}}\Bigr) V,

where the factor 1/dk1/\sqrt{d_k} keeps the inner products in a numerically stable range. The result is a sequence of the same length, where each position’s representation integrates information from all positions.

Practical Example: Machine Translation

Consider translating the Chinese phrase “我 爱 机器 学习” (“I love machine learning”). The encoder processes each word into hidden states:

Input words:            [ 我 , 爱 , 机器 , 学习 ]
Encoder hidden states:  [ h1,  h2,  h3,    h4   ]

At the first decoding step (producing “love”), the decoder state s1s_1 computes scores against each encoder state:

score(s1,hj)=vtanh(W[s1;hj])\text{score}(s_1, h_j) = v^\top \tanh(W [s_1; h_j])

Softmax normalizes these scores into probabilities, for example [0.4,0.3,0.2,0.1][0.4, 0.3, 0.2, 0.1]. The context vector becomes:

C1=0.4h1+0.3h2+0.2h3+0.1h4.C_1 = 0.4\,h_1 + 0.3\,h_2 + 0.2\,h_3 + 0.1\,h_4.

This context C1C_1 is then combined with s1s_1 to generate the output embedding for “love.” A fresh context vector is computed for each subsequent target word, enabling the model to focus on different parts of the input.

Why Attention Matters

alt text

Dynamic context vectors allow models to avoid the bottleneck of a single summary vector, improving performance on long sequences. The soft alignment provided by attention yields interpretability—examining αij\alpha_{ij} reveals which inputs influenced each output. Self-attention’s parallelism further accelerates training by removing sequential dependencies.