AI EducademyAIEducademy
🌳

AI基础

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

AI精通

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

🚀

职业准备

🚀
面试发射台

开启你的旅程

🌟
行为面试精通

掌握软技能

💻
技术面试

通过编程轮次

🤖
AI与ML面试

ML面试精通

🏆
Offer与未来

拿下最好的Offer

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
🎯模拟面试进入实验室→
学习旅程博客
🎯
关于

让AI教育触达每一个人、每一个角落

❓
常见问题

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

在 GitHub 上公开构建

立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
  • 服务条款
  • 隐私政策
  • 联系我们
AI & 工程学习计划›🏕️ AI 树冠›课程›大语言模型
🤖
AI 树冠 • 高级⏱️ 20 分钟阅读

大语言模型

Large Language Models - The Engines Behind Modern AI

Large language models have fundamentally changed what machines can do with text, code, and reasoning. In this lesson, we'll pull back the curtain on how they work, why they're so powerful, and where they still fall short.

What Exactly Is an LLM?

A large language model is a neural network trained on enormous quantities of text to predict the next token in a sequence. "Large" refers to the number of parameters - learnable weights that encode patterns from training data.

  • GPT-4 is estimated to have over 1 trillion parameters.
  • Llama 3 comes in sizes from 8 billion to 405 billion parameters.
  • Mistral 7B shows that smaller models can punch well above their weight.

These models are trained on hundreds of billions of words drawn from books, websites, code repositories, and academic papers. The sheer scale of data and parameters is what gives LLMs their remarkable versatility.

Tokens, not words: LLMs don't process whole words - they work with tokens, which are sub-word units. The word "understanding" might be split into "under" + "stand" + "ing". A typical English word averages about 1.3 tokens. This tokenisation scheme allows models to handle rare words, technical jargon, and even multiple languages without an impossibly large vocabulary.

🤯

Training GPT-4 is estimated to have cost over $100 million in compute alone. That's roughly the budget of a mid-range Hollywood film - except the "actor" can speak every programming language.

The Transformer Architecture

Every modern LLM is built on the transformer, introduced in the 2017 paper Attention Is All You Need. The key innovation is self-attention - a mechanism that lets the model weigh how important each word is relative to every other word in the input.

How self-attention works (simplified):

  1. Each token is converted into three vectors: Query, Key, and Value.
  2. The model computes attention scores by comparing every Query against every Key.
  3. These scores determine how much each token "pays attention" to every other token.
第 1 课,共 10 课已完成 0%
←返回学习计划

Discussion

Sign in to join the discussion

建议修改本课内容
  • The weighted sum of Values produces a context-aware representation.
  • The mathematical formula at the heart of attention is:

    Attention(Q, K, V) = softmax(QK^T / √d) × V

    The division by √d (the square root of the key dimension) prevents the dot products from growing too large, which would push the softmax function into regions with extremely small gradients. This simple scaling trick was one of the key insights that made training deep Transformers stable.

    This is repeated across multiple attention heads in parallel, allowing the model to capture different types of relationships simultaneously - syntax in one head, semantics in another.

    Diagram showing the transformer self-attention mechanism with Query, Key, and Value vectors
    Self-attention allows every token to attend to every other token, capturing long-range dependencies.
    🤔
    Think about it:

    If a sentence has 500 tokens, self-attention compares every token with every other - that's 250,000 comparisons per layer. How might this quadratic cost affect what LLMs can process?

    Key Models in the Landscape

    | Model | Creator | Notable Feature | |-------|---------|-----------------| | GPT-4o | OpenAI | Multimodal (text, image, audio) | | Claude 4 | Anthropic | Extended thinking, safety-focused | | Gemini 2.5 | Google DeepMind | Native multimodality, long context | | Llama 3 | Meta | Open-weight, community-driven | | Mistral Large | Mistral AI | Efficient European alternative |

    The field moves extraordinarily fast - by the time you read this, newer models may already exist.

    A crucial distinction in this landscape is between closed-source models (GPT-4o, Claude, Gemini) where only the API is available, and open-weight models (Llama, Mistral) where the model weights are publicly released. Open-weight models allow organisations to run inference on their own infrastructure, fine-tune for specific domains, and inspect the model's behaviour - advantages that matter greatly for privacy-sensitive industries.

    💡

    Not all "open" models are truly open source. Some release weights but restrict commercial use or don't share training data and code. Always check the licence before deploying an open-weight model in production.

    Pre-Training, Fine-Tuning, and RLHF

    LLMs are built in stages:

    1. Pre-training - The model reads vast amounts of text and learns to predict the next token. This stage is enormously expensive and produces a base model that can complete text but isn't particularly helpful.

    2. Supervised fine-tuning (SFT) - Human-written examples of ideal responses teach the model to follow instructions and answer questions properly.

    3. RLHF (Reinforcement Learning from Human Feedback) - Human raters rank model outputs from best to worst. A reward model learns these preferences, and the LLM is trained to maximise the reward. This is what makes models helpful, harmless, and honest.

    🧠小测验

    What is the primary purpose of RLHF in LLM training?

    Emergent Capabilities

    As models scale up, they develop abilities that weren't explicitly trained:

    • Complex reasoning - Multi-step logical deduction and mathematical problem-solving.
    • Code generation - Writing, debugging, and explaining code across dozens of languages.
    • Multilingual fluency - Translating and generating text in languages with relatively little training data.
    • In-context learning - Adapting behaviour based on examples provided in the prompt, without any weight updates.

    These emergent properties are one of the most fascinating aspects of scaling - abilities that appear almost "for free" once a model crosses certain size thresholds.

    🤯

    GPT-4 passed the Uniform Bar Exam in the 90th percentile - better than most human law graduates. Yet it can still struggle with basic arithmetic if the numbers are unusual enough.

    Limitations You Must Understand

    LLMs are powerful but far from perfect:

    • Hallucinations - Models generate confident-sounding text that is factually wrong. They don't "know" facts; they predict likely token sequences.
    • Context window limits - Each model has a maximum input size (e.g., 128K tokens for GPT-4o). Information beyond this window is simply invisible.
    • Cost and latency - Running inference on large models requires expensive GPU clusters. A single GPT-4 query costs significantly more than a GPT-3.5 query.
    • Lack of true understanding - LLMs manipulate statistical patterns in text. Whether this constitutes "understanding" is a deep philosophical debate.
    • Training data cutoffs - Models don't know about events after their training data was collected unless augmented with retrieval systems.
    💡

    Always verify critical facts from LLM outputs against authoritative sources. Treat LLMs as a brilliant but unreliable research assistant - extraordinarily useful, but never the final word on matters of fact.

    🧠小测验

    Why do LLMs sometimes produce hallucinations?

    The Scaling Debate

    For years, the dominant belief was that bigger is always better - more parameters, more data, and more compute would steadily improve performance. This "scaling law" held remarkably well through GPT-2, GPT-3, and GPT-4.

    However, recent research suggests we may be approaching diminishing returns on raw scale alone. The focus is shifting towards:

    • Better data quality over quantity (curated, deduplicated datasets).
    • Inference-time compute - letting models "think longer" on hard problems rather than making the model itself larger.
    • Specialised architectures - mixture-of-experts models that activate only a fraction of parameters per query, improving efficiency without sacrificing capability.

    Wrapping Up

    Large language models represent a genuine paradigm shift in computing. Understanding their architecture, training pipeline, and limitations isn't optional for anyone working seriously with AI - it's foundational knowledge that will serve you well as the field continues to evolve.

    🧠小测验

    Which stage of LLM training is the most computationally expensive?

    🤔
    Think about it:

    If an LLM can pass the bar exam but can't reliably count the number of letters in a word, what does that tell us about the difference between human intelligence and language model capabilities?