🔤

AI 萌芽 • 中级⏱️ 14 分钟阅读

分词

Tokenisation - How AI Reads Text

Neural networks work with numbers. They cannot read the word "hello" the way you do. Before any language model can process text, it must be broken into small numerical pieces called tokens. This seemingly simple step has profound consequences for how AI understands - and misunderstands - language.

Why Can't AI Just Read Characters?

The simplest approach: treat each character as a token. "Hello" becomes ['H', 'e', 'l', 'l', 'o'] - five tokens.

The problem? Words become absurdly long sequences. A 500-word essay might become 2,500+ character tokens. Since Transformer models scale quadratically with sequence length, this is computationally brutal. Worse, individual characters carry almost no meaning - the model must learn that 'c', 'a', 't' together mean a furry animal.

Word-Level Tokenisation

The opposite extreme: each word is one token. "The cat sat" becomes ['The', 'cat', 'sat'] - compact and meaningful.

But this creates a different problem: the vocabulary explosion. English alone has hundreds of thousands of words. Add misspellings, technical jargon, and code, and the vocabulary becomes unmanageable. Any word not in the vocabulary becomes an unknown [UNK] token - a dead end for understanding.

🤔

Think about it:

If a model using word-level tokens encounters "ChatGPT" for the first time and it is not in the vocabulary, it becomes [UNK]. How might this affect the model's ability to discuss new technology?

The Sweet Spot - Subword Tokenisation

Modern language models use subword tokenisation, which sits between characters and words. Common words stay whole ("the", "and"), while rare words are split into meaningful pieces ("un" + "believ" + "able").

This gives us a manageable vocabulary (typically 32,000–100,000 tokens) while handling any text - even words the model has never seen before.

The word 'unbelievable' split into three subword tokens: 'un', 'believ', and 'able', with arrows showing how they recombine — Subword tokenisation splits rare words into reusable pieces while keeping common words whole.

Byte Pair Encoding (BPE) - Step by Step

BPE is the algorithm behind GPT models. Here is how it builds a vocabulary:

第 8 课，共 16 课已完成 0%

←损失函数与优化器

Discussion

建议修改本课内容

Start with individual characters: {'h', 'e', 'l', 'o', 'w', 'r', 'd', ' '}.
Count which pairs of adjacent tokens appear most frequently in the training text.
Merge the most frequent pair into a new token. If 'l' + 'o' appears most, create 'lo'.
Repeat steps 2–3 until you reach the desired vocabulary size.

Worked example with the text "low lower lowest":

| Step | Most frequent pair | New token | Vocabulary grows | |------|-------------------|-----------|-----------------| | 1 | l + o | lo | ...lo... | | 2 | lo + w | low | ...low... | | 3 | e + r | er | ...er... | | 4 | low + e | lowe | ...lowe... |

After enough merges, common words and word fragments emerge naturally from the data.

🤯

BPE was originally invented in 1994 as a data compression algorithm. It was repurposed for NLP in 2015 by Sennrich et al. - a beautiful example of ideas crossing disciplines.

Other Tokenisation Methods

Used by BERT and related models. Similar to BPE, but instead of merging the most frequent pair, it merges the pair that maximises the likelihood of the training data. Subword pieces are prefixed with ## (e.g., "playing" → ['play', '##ing']).

Treats the input as a raw byte stream - no pre-tokenisation by spaces. This is crucial for languages like Japanese and Chinese that do not use spaces between words. GPT-4 and LLaMA use SentencePiece-style approaches.

How GPT-4 Tokenises Text

GPT-4 uses a BPE variant called cl100k_base with roughly 100,000 tokens in its vocabulary. Some surprising behaviours:

"Hello world" → 2 tokens (Hello, world - note the space is attached).
"indivisibility" → 4 tokens (ind, iv, isibility - it splits rare words).
A single emoji 🎉 → often 1–3 tokens.
Python code def hello(): → each keyword and symbol is typically its own token.

🧠小测验

Why do language models use subword tokenisation instead of whole words?

The Vocabulary Size Trade-Off

| Vocabulary size | Pros | Cons | |----------------|------|------| | Small (8k) | Smaller model, fewer embeddings | Longer sequences, slower processing | | Large (100k+) | Shorter sequences, richer tokens | Larger embedding table, more memory |

Finding the right balance is an engineering decision that affects model speed, memory, and capability.

Multilingual Challenges

Tokenisers trained primarily on English text are biased. The same sentence in Hindi or Arabic may require 3–5× more tokens than its English equivalent, because those scripts were underrepresented in training data. This means:

Non-English users hit context limits sooner.
API costs are higher per word for non-English text.
The model has less "thinking space" for non-English reasoning.

🧠小测验

Why might the same sentence cost more API tokens in Hindi than in English?

Token Counting and Cost Implications

Every API call to GPT-4, Claude, or Gemini is billed per token. Understanding tokenisation helps you:

Estimate costs before running large jobs.
Optimise prompts - shorter prompts with the same meaning save money.
Respect context windows - GPT-4 Turbo accepts 128k tokens; exceeding this truncates your input silently.

A rough rule of thumb for English: 1 token ≈ ¾ of a word, or about 4 characters.

🧠小测验

Approximately how many tokens is a 1,000-word English essay?

🤯

OpenAI's open-source tiktoken library lets you tokenise text locally with the exact same algorithm GPT-4 uses. Try it on your own writing to see how many tokens your messages really cost.

🤔

Think about it:

If you were building a language model for a low-resource language like Welsh, how would you approach tokenisation to ensure fair and efficient encoding?

Tokenisation converts raw text into numerical tokens that models can process.
BPE builds a vocabulary by iteratively merging the most frequent character pairs.
Subword tokenisation balances vocabulary size with the ability to handle any text.
Tokeniser bias disadvantages non-English languages in cost and capability.
Understanding tokens helps you estimate costs and optimise prompts.

📚 Further Reading

Andrej Karpathy - nn-zero-to-hero (Tokenizer lecture) - Build a BPE tokeniser from scratch alongside Karpathy
OpenAI Tokenizer Tool - Interactive tool to see how GPT models tokenise your text
Hugging Face - Summary of Tokenizers - Clear comparison of BPE, WordPiece, and SentencePiece

AI基础

AI精通

职业准备

实验室

分词

Tokenisation - How AI Reads Text

Why Can't AI Just Read Characters?

Word-Level Tokenisation

The Sweet Spot - Subword Tokenisation

Byte Pair Encoding (BPE) - Step by Step

Discussion

Other Tokenisation Methods

WordPiece

SentencePiece

How GPT-4 Tokenises Text

The Vocabulary Size Trade-Off

Multilingual Challenges

Token Counting and Cost Implications

Key Takeaways

📚 Further Reading