Attention Mechanism and Transformers: What, Why & How
Going beyond recurrent and convolutional models, Attention Mechanisms empower neural networks to dynamically weight different parts of the input sequence, capturing long‑range dependencies and enabling parallel computation. Transformers, built entirely on attention, have revolutionized natural language processing and beyond. In this article, we explore how attention works, implement key components in TensorFlow/Keras, and reveal why auto-regressive decoding gives models like GPT their generative prowess.
Understanding Attention
Attention lets a model focus selectively on parts of its input. Given a Query matrix , a Key matrix , and a Value matrix , scaled dot‑product attention computes raw scores by taking the dot product of queries and keys, scales them to stabilize gradients, applies softmax to generate weights, and finally produces a weighted sum of values:
Here, the factor ensures that the dot products remain in a suitable numerical range, while softmax converts raw scores into a probability distribution over input positions.
1import tensorflow as tf 2 3def scaled_dot_product_attention(query, key, value): 4 matmul_qk = tf.matmul(query, key, transpose_b=True) 5 dk = tf.cast(tf.shape(key)[-1], tf.float32) 6 scaled_logits = matmul_qk / tf.math.sqrt(dk) 7 weights = tf.nn.softmax(scaled_logits, axis=-1) 8 output = tf.matmul(weights, value) 9 return output, weights 10 11# Example: 12q = tf.random.normal((1, 10, 64)) 13k = tf.random.normal((1, 10, 64)) 14v = tf.random.normal((1, 10, 64)) 15out, attn = scaled_dot_product_attention(q, k, v) 16print("Attention output shape:", out.shape)
The returned attn
tensor shows which positions the model attends to most for each query.
Building Transformer Blocks
Transformers stack attention with feed‑forward layers, residual connections, and normalization. Multi‑Head Attention duplicates the basic attention mechanism times with different projection matrices, concatenating their outputs:
Positional encodings inject sequence order, and two-layer feed‑forward networks add non-linearity and depth.
1from tensorflow.keras.layers import MultiHeadAttention, Dense, LayerNormalization, Dropout 2import tensorflow as tf 3 4class TransformerBlock(tf.keras.layers.Layer): 5 def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1): 6 super().__init__() 7 self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim) 8 self.ffn = tf.keras.Sequential([ 9 Dense(ff_dim, activation='relu'), 10 Dense(embed_dim), 11 ]) 12 self.norm1 = LayerNormalization(epsilon=1e-6) 13 self.norm2 = LayerNormalization(epsilon=1e-6) 14 self.drop1 = Dropout(rate) 15 self.drop2 = Dropout(rate) 16 17 def call(self, x, training): 18 attn_out = self.att(x, x) 19 attn_out = self.drop1(attn_out, training=training) 20 out1 = self.norm1(x + attn_out) 21 ffn_out = self.ffn(out1) 22 ffn_out = self.drop2(ffn_out, training=training) 23 return self.norm2(out1 + ffn_out) 24 25# Usage: 26block = TransformerBlock(embed_dim=64, num_heads=8, ff_dim=256) 27x = tf.random.normal((1, 20, 64)) 28out = block(x) 29print("Transformer block output shape:", out.shape)
Deep stacks of these blocks power encoders and decoders in models like BERT, GPT, and T5.
Auto-Regressive Decoding in GPT
Generative tasks, such as text completion, benefit from auto-regressive models that predict one token at a time, using prior outputs as context. GPT exemplifies this, feeding its own generated tokens back as input during inference.
This left‑to‑right generation ensures continuity and coherence, distinguishing GPT’s sequential decoding from bidirectional encoders like BERT. Though GPT is a large language model (LLM), its auto-regressive nature defines how it generates text, step by step.
Attention’s ability to link any output position to all previous tokens, combined with auto-regressive decoding, gives transformers both interpretability and generative power fueling breakthroughs in natural language understanding and creation.