AI EducademyAIEducademy
🌳

AI基础

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

AI精通

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

🚀

职业准备

🚀
面试发射台

开启你的旅程

🌟
行为面试精通

掌握软技能

💻
技术面试

通过编程轮次

🤖
AI与ML面试

ML面试精通

🏆
Offer与未来

拿下最好的Offer

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
🎯模拟面试进入实验室→
学习旅程博客
🎯
关于

让AI教育触达每一个人、每一个角落

❓
常见问题

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

在 GitHub 上公开构建

立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
  • 服务条款
  • 隐私政策
  • 联系我们
AI & 工程学习计划›🌿 AI 萌芽›课程›分词
🔤
AI 萌芽 • 中级⏱️ 14 分钟阅读

分词

Tokenisation - How AI Reads Text

Neural networks work with numbers. They cannot read the word "hello" the way you do. Before any language model can process text, it must be broken into small numerical pieces called tokens. This seemingly simple step has profound consequences for how AI understands - and misunderstands - language.

Why Can't AI Just Read Characters?

The simplest approach: treat each character as a token. "Hello" becomes ['H', 'e', 'l', 'l', 'o'] - five tokens.

The problem? Words become absurdly long sequences. A 500-word essay might become 2,500+ character tokens. Since Transformer models scale quadratically with sequence length, this is computationally brutal. Worse, individual characters carry almost no meaning - the model must learn that 'c', 'a', 't' together mean a furry animal.

Word-Level Tokenisation

The opposite extreme: each word is one token. "The cat sat" becomes ['The', 'cat', 'sat'] - compact and meaningful.

But this creates a different problem: the vocabulary explosion. English alone has hundreds of thousands of words. Add misspellings, technical jargon, and code, and the vocabulary becomes unmanageable. Any word not in the vocabulary becomes an unknown [UNK] token - a dead end for understanding.

🤔
Think about it:

If a model using word-level tokens encounters "ChatGPT" for the first time and it is not in the vocabulary, it becomes [UNK]. How might this affect the model's ability to discuss new technology?

The Sweet Spot - Subword Tokenisation

Modern language models use subword tokenisation, which sits between characters and words. Common words stay whole ("the", "and"), while rare words are split into meaningful pieces ("un" + "believ" + "able").

This gives us a manageable vocabulary (typically 32,000–100,000 tokens) while handling any text - even words the model has never seen before.

The word 'unbelievable' split into three subword tokens: 'un', 'believ', and 'able', with arrows showing how they recombine
Subword tokenisation splits rare words into reusable pieces while keeping common words whole.

Byte Pair Encoding (BPE) - Step by Step

BPE is the algorithm behind GPT models. Here is how it builds a vocabulary:

第 8 课,共 16 课已完成 0%
←损失函数与优化器

Discussion

Sign in to join the discussion

建议修改本课内容
  1. Start with individual characters: {'h', 'e', 'l', 'o', 'w', 'r', 'd', ' '}.
  2. Count which pairs of adjacent tokens appear most frequently in the training text.
  3. Merge the most frequent pair into a new token. If 'l' + 'o' appears most, create 'lo'.
  4. Repeat steps 2–3 until you reach the desired vocabulary size.

Worked example with the text "low lower lowest":

| Step | Most frequent pair | New token | Vocabulary grows | |------|-------------------|-----------|-----------------| | 1 | l + o | lo | ...lo... | | 2 | lo + w | low | ...low... | | 3 | e + r | er | ...er... | | 4 | low + e | lowe | ...lowe... |

After enough merges, common words and word fragments emerge naturally from the data.

🤯

BPE was originally invented in 1994 as a data compression algorithm. It was repurposed for NLP in 2015 by Sennrich et al. - a beautiful example of ideas crossing disciplines.

Other Tokenisation Methods

WordPiece

Used by BERT and related models. Similar to BPE, but instead of merging the most frequent pair, it merges the pair that maximises the likelihood of the training data. Subword pieces are prefixed with ## (e.g., "playing" → ['play', '##ing']).

SentencePiece

Treats the input as a raw byte stream - no pre-tokenisation by spaces. This is crucial for languages like Japanese and Chinese that do not use spaces between words. GPT-4 and LLaMA use SentencePiece-style approaches.

How GPT-4 Tokenises Text

GPT-4 uses a BPE variant called cl100k_base with roughly 100,000 tokens in its vocabulary. Some surprising behaviours:

  • "Hello world" → 2 tokens (Hello, world - note the space is attached).
  • "indivisibility" → 4 tokens (ind, iv, isibility - it splits rare words).
  • A single emoji 🎉 → often 1–3 tokens.
  • Python code def hello(): → each keyword and symbol is typically its own token.
🧠小测验

Why do language models use subword tokenisation instead of whole words?

The Vocabulary Size Trade-Off

| Vocabulary size | Pros | Cons | |----------------|------|------| | Small (8k) | Smaller model, fewer embeddings | Longer sequences, slower processing | | Large (100k+) | Shorter sequences, richer tokens | Larger embedding table, more memory |

Finding the right balance is an engineering decision that affects model speed, memory, and capability.

Multilingual Challenges

Tokenisers trained primarily on English text are biased. The same sentence in Hindi or Arabic may require 3–5× more tokens than its English equivalent, because those scripts were underrepresented in training data. This means:

  • Non-English users hit context limits sooner.
  • API costs are higher per word for non-English text.
  • The model has less "thinking space" for non-English reasoning.
🧠小测验

Why might the same sentence cost more API tokens in Hindi than in English?

Token Counting and Cost Implications

Every API call to GPT-4, Claude, or Gemini is billed per token. Understanding tokenisation helps you:

  • Estimate costs before running large jobs.
  • Optimise prompts - shorter prompts with the same meaning save money.
  • Respect context windows - GPT-4 Turbo accepts 128k tokens; exceeding this truncates your input silently.

A rough rule of thumb for English: 1 token ≈ ¾ of a word, or about 4 characters.

🧠小测验

Approximately how many tokens is a 1,000-word English essay?

🤯

OpenAI's open-source tiktoken library lets you tokenise text locally with the exact same algorithm GPT-4 uses. Try it on your own writing to see how many tokens your messages really cost.

🤔
Think about it:

If you were building a language model for a low-resource language like Welsh, how would you approach tokenisation to ensure fair and efficient encoding?

Key Takeaways

  • Tokenisation converts raw text into numerical tokens that models can process.
  • BPE builds a vocabulary by iteratively merging the most frequent character pairs.
  • Subword tokenisation balances vocabulary size with the ability to handle any text.
  • Tokeniser bias disadvantages non-English languages in cost and capability.
  • Understanding tokens helps you estimate costs and optimise prompts.

📚 Further Reading

  • Andrej Karpathy - nn-zero-to-hero (Tokenizer lecture) - Build a BPE tokeniser from scratch alongside Karpathy
  • OpenAI Tokenizer Tool - Interactive tool to see how GPT models tokenise your text
  • Hugging Face - Summary of Tokenizers - Clear comparison of BPE, WordPiece, and SentencePiece