AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›🌳 AI 枝干›课程›Transformer架构详解:ChatGPT背后的技术
🤖
AI 枝干 • 中级⏱️ 35 分钟阅读

Transformer架构详解:ChatGPT背后的技术

Transformers Explained: The Architecture Behind ChatGPT

In 2017, a team at Google published a paper titled "Attention Is All You Need". It introduced the Transformer architecture, and it changed AI forever. Every major language model you've heard of — ChatGPT, GPT-4, BERT, Gemini, Claude, LLaMA — is built on Transformers.

To understand why Transformers matter, you first need to understand what they replaced.

⏱️ The Problem with RNNs: Processing Text Sequentially

Before Transformers, the dominant approach for processing sequences (text, audio, time series) was the Recurrent Neural Network (RNN) and its variants: LSTMs and GRUs.

RNNs process text one word at a time, maintaining a "hidden state" that carries information forward:

Input:  "The cat sat on the mat because it was comfortable"
         ↓    ↓    ↓   ↓    ↓    ↓      ↓      ↓   ↓      ↓
        h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h10 → output

This works, but has two major problems:

1. Vanishing gradients over long sequences By the time the model processes word 10, the information from word 1 has been "diluted" by nine hidden state updates. The model struggles to connect "it" (word 8) back to "cat" (word 2) when the sentence is long.

2. Slow — no parallelisation Because each word depends on the previous hidden state, you must process words one at a time. You cannot parallelise across a sequence. Training on long documents is painfully slow.

The Transformer solves both problems with a single elegant mechanism: attention.

👀 The Self-Attention Mechanism

The central innovation of the Transformer is self-attention: the ability for every word in a sequence to directly attend to every other word simultaneously, regardless of how far apart they are.

Rather than passing information sequentially, self-attention computes how relevant each word is to every other word in the sequence — all at once.

Query, Key, and Value — In Plain English

For each word (or token), the Transformer creates three vectors:

  • Query (Q): "What am I looking for?" — the question this word is asking
  • Key (K): "What do I contain?" — the advertisement each word broadcasts about itself
  • Value (V): "What information do I provide?" — the actual content to be retrieved

The attention score between two words is computed by taking the dot product of one word's Query with another word's Key. A high score means "these two words are highly relevant to each other."

import torch
import torch.nn.functional as F

def self_attention(Q, K, V):
    d_k = Q.shape[-1]  # dimension of key vectors
    
    # Compute raw attention scores: how relevant is each word to each other
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    
    # Convert scores to probabilities (weights that sum to 1)
    attention_weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of Value vectors
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights

# For the sentence "The cat sat on the mat because it was comfortable",
# "it" will have a high attention score with "cat" (the antecedent)
# regardless of how many words separate them

The result: every token has a direct line of sight to every other token. No more information bottleneck, no more vanishing gradients over long sequences.

Why the Square Root Scaling?

The / (d_k ** 0.5) in the code above prevents the dot products from becoming very large numbers that would push the softmax into regions of extremely small gradients. It's a stabilisation trick.

🧠 Multi-Head Attention

Instead of computing attention just once, Transformers compute it multiple times in parallel with different learned projections. This is called multi-head attention:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 64 per head
        
        # Separate linear projections for each head
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        # Each head learns to attend to different aspects:
        # Head 1 might focus on syntactic relationships
        # Head 2 might focus on semantic similarity
        # Head 3 might focus on co-reference (it → cat)
        # ... and so on for all 8 heads
        pass

Each attention head can specialise in a different type of relationship. One head might track syntactic dependencies (subject-verb agreement), another might focus on coreference (pronouns and their antecedents), another might capture semantic similarity.

\ud83e\udd2f
GPT-4 is rumoured to use 96 attention heads. Each one can independently learn to track different linguistic relationships across the entire context window simultaneously.

📍 Positional Encoding

Self-attention has one blind spot: it's inherently order-agnostic. "Dog bites man" and "Man bites dog" would produce identical attention scores without additional information.

To give the model a sense of word order, Transformers add positional encodings to the input embeddings. The original Transformer used sinusoidal functions:

import numpy as np

def positional_encoding(seq_len, d_model):
    PE = np.zeros((seq_len, d_model))
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            PE[pos, i]     = np.sin(pos / 10000 ** (2*i / d_model))
            PE[pos, i + 1] = np.cos(pos / 10000 ** (2*i / d_model))
    return PE

# Each position gets a unique "fingerprint" added to its embedding
# Position 0 looks different from position 1, which looks different from position 2

Modern models like LLaMA use Rotary Position Embeddings (RoPE), which encode relative position more effectively and scale better to very long contexts.

🏗️ The Full Transformer Architecture

The original Transformer has two main parts:

Encoder

Reads the input and builds a rich contextual representation. Each token's representation is informed by every other token via self-attention. Used for tasks like: sentence classification, named entity recognition, question answering over a given passage.

Decoder

Generates output tokens one at a time, attending both to the encoder's output and to the already-generated tokens. Used for: translation, summarisation, text generation.

A full encoder-decoder stack:

Input text → Token Embeddings + Positional Encoding
           → N × [Multi-Head Self-Attention → Add & Norm → Feed-Forward → Add & Norm]
           → Encoder Output

Encoder Output + Generated tokens so far
           → N × [Masked Self-Attention → Cross-Attention → Feed-Forward]
           → Next token prediction

🆚 BERT vs GPT: Two Flavours of Transformer

Not all Transformer-based models use the full encoder-decoder structure. The two dominant variants are:

BERT (Encoder-Only)

  • Uses only the encoder stack
  • Trained with Masked Language Modelling: some input tokens are hidden, and BERT predicts them
  • Excellent at understanding text: classification, search, Q&A
  • Reads the entire context bidirectionally (every word sees every other word)

GPT (Decoder-Only)

  • Uses only the decoder stack
  • Trained with next-token prediction: given all previous tokens, predict the next one
  • Excellent at generating text: conversation, writing, code
  • Causal masking: each token can only attend to previous tokens (not future ones — that would be cheating)
BERT training:   "The [MASK] sat on the mat"  → predict "cat"
GPT training:    "The cat sat on the"          → predict "mat"
\ud83e\udd14
Think about it:BERT's bidirectional attention means it can see future words when building a word's representation. GPT cannot. Why does this make GPT better for generation but BERT better for classification tasks?

📈 Why Scale Matters

The Transformer architecture turned out to scale remarkably well. Larger models with more parameters, trained on more data, with more compute, consistently produce better performance — a phenomenon described by scaling laws (Kaplan et al., 2020).

This led to a progression:

  • GPT-1 (2018): 117M parameters
  • GPT-2 (2019): 1.5B parameters
  • GPT-3 (2020): 175B parameters
  • GPT-4 (2023): estimated 1 trillion+ parameters

Each jump brought qualitative improvements in capability. Tasks that GPT-2 couldn't do at all (complex reasoning, code generation) became possible at GPT-3 scale and excellent at GPT-4 scale.

⚡ Training vs Inference

Two very different computational profiles:

Training: extremely expensive. Processing billions of tokens, computing gradients, and updating hundreds of billions of parameters across thousands of GPUs for weeks or months. GPT-4's training was estimated to cost over $100 million.

Inference: much cheaper. A single forward pass through the network to generate each token. Still requires significant hardware (powerful GPUs or specialised AI chips), but manageable at scale.

This asymmetry has shaped the industry: a few well-resourced organisations train foundation models; everyone else accesses them via APIs.

\ud83e\udde0小测验

What problem with RNNs did the Transformer's attention mechanism solve?

Key Takeaways

  • RNNs processed sequences step-by-step, struggling with long-range dependencies and unable to parallelise — Transformers replaced them
  • Self-attention lets every token directly attend to every other token via Query, Key, Value vectors — solving both problems at once
  • Multi-head attention runs multiple attention computations in parallel, allowing the model to learn different types of relationships simultaneously
  • Positional encoding adds order information to the inherently position-agnostic attention mechanism
  • BERT (encoder-only) is pre-trained to understand text bidirectionally; best for classification and search tasks
  • GPT (decoder-only) is pre-trained to generate text from left to right; best for generation and conversation
  • Scale matters enormously: larger Transformers trained on more data consistently unlock new capabilities, as demonstrated by the GPT-1 → GPT-4 progression
  • Training is extremely expensive; inference is much cheaper — this asymmetry shapes how foundation models are deployed
第 12 课,共 14 课已完成 0%
←提示工程:与AI对话的艺术
强化学习:通过试错教导AI→