In 2017, a team at Google published a paper titled "Attention Is All You Need". It introduced the Transformer architecture, and it changed AI forever. Every major language model you've heard of — ChatGPT, GPT-4, BERT, Gemini, Claude, LLaMA — is built on Transformers.
To understand why Transformers matter, you first need to understand what they replaced.
Before Transformers, the dominant approach for processing sequences (text, audio, time series) was the Recurrent Neural Network (RNN) and its variants: LSTMs and GRUs.
RNNs process text one word at a time, maintaining a "hidden state" that carries information forward:
Input: "The cat sat on the mat because it was comfortable"
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h10 → output
This works, but has two major problems:
1. Vanishing gradients over long sequences By the time the model processes word 10, the information from word 1 has been "diluted" by nine hidden state updates. The model struggles to connect "it" (word 8) back to "cat" (word 2) when the sentence is long.
2. Slow — no parallelisation Because each word depends on the previous hidden state, you must process words one at a time. You cannot parallelise across a sequence. Training on long documents is painfully slow.
The Transformer solves both problems with a single elegant mechanism: attention.
The central innovation of the Transformer is self-attention: the ability for every word in a sequence to directly attend to every other word simultaneously, regardless of how far apart they are.
Rather than passing information sequentially, self-attention computes how relevant each word is to every other word in the sequence — all at once.
For each word (or token), the Transformer creates three vectors:
The attention score between two words is computed by taking the dot product of one word's Query with another word's Key. A high score means "these two words are highly relevant to each other."
import torch
import torch.nn.functional as F
def self_attention(Q, K, V):
d_k = Q.shape[-1] # dimension of key vectors
# Compute raw attention scores: how relevant is each word to each other
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Convert scores to probabilities (weights that sum to 1)
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of Value vectors
output = torch.matmul(attention_weights, V)
return output, attention_weights
# For the sentence "The cat sat on the mat because it was comfortable",
# "it" will have a high attention score with "cat" (the antecedent)
# regardless of how many words separate them
The result: every token has a direct line of sight to every other token. No more information bottleneck, no more vanishing gradients over long sequences.
The / (d_k ** 0.5) in the code above prevents the dot products from becoming very large numbers that would push the softmax into regions of extremely small gradients. It's a stabilisation trick.
Instead of computing attention just once, Transformers compute it multiple times in parallel with different learned projections. This is called multi-head attention:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads # 64 per head
# Separate linear projections for each head
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
# Each head learns to attend to different aspects:
# Head 1 might focus on syntactic relationships
# Head 2 might focus on semantic similarity
# Head 3 might focus on co-reference (it → cat)
# ... and so on for all 8 heads
pass
Each attention head can specialise in a different type of relationship. One head might track syntactic dependencies (subject-verb agreement), another might focus on coreference (pronouns and their antecedents), another might capture semantic similarity.
Self-attention has one blind spot: it's inherently order-agnostic. "Dog bites man" and "Man bites dog" would produce identical attention scores without additional information.
To give the model a sense of word order, Transformers add positional encodings to the input embeddings. The original Transformer used sinusoidal functions:
import numpy as np
def positional_encoding(seq_len, d_model):
PE = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(0, d_model, 2):
PE[pos, i] = np.sin(pos / 10000 ** (2*i / d_model))
PE[pos, i + 1] = np.cos(pos / 10000 ** (2*i / d_model))
return PE
# Each position gets a unique "fingerprint" added to its embedding
# Position 0 looks different from position 1, which looks different from position 2
Modern models like LLaMA use Rotary Position Embeddings (RoPE), which encode relative position more effectively and scale better to very long contexts.
The original Transformer has two main parts:
Reads the input and builds a rich contextual representation. Each token's representation is informed by every other token via self-attention. Used for tasks like: sentence classification, named entity recognition, question answering over a given passage.
Generates output tokens one at a time, attending both to the encoder's output and to the already-generated tokens. Used for: translation, summarisation, text generation.
A full encoder-decoder stack:
Input text → Token Embeddings + Positional Encoding
→ N × [Multi-Head Self-Attention → Add & Norm → Feed-Forward → Add & Norm]
→ Encoder Output
Encoder Output + Generated tokens so far
→ N × [Masked Self-Attention → Cross-Attention → Feed-Forward]
→ Next token prediction
Not all Transformer-based models use the full encoder-decoder structure. The two dominant variants are:
BERT training: "The [MASK] sat on the mat" → predict "cat"
GPT training: "The cat sat on the" → predict "mat"
The Transformer architecture turned out to scale remarkably well. Larger models with more parameters, trained on more data, with more compute, consistently produce better performance — a phenomenon described by scaling laws (Kaplan et al., 2020).
This led to a progression:
Each jump brought qualitative improvements in capability. Tasks that GPT-2 couldn't do at all (complex reasoning, code generation) became possible at GPT-3 scale and excellent at GPT-4 scale.
Two very different computational profiles:
Training: extremely expensive. Processing billions of tokens, computing gradients, and updating hundreds of billions of parameters across thousands of GPUs for weeks or months. GPT-4's training was estimated to cost over $100 million.
Inference: much cheaper. A single forward pass through the network to generate each token. Still requires significant hardware (powerful GPUs or specialised AI chips), but manageable at scale.
This asymmetry has shaped the industry: a few well-resourced organisations train foundation models; everyone else accesses them via APIs.
What problem with RNNs did the Transformer's attention mechanism solve?