Neural networks work with numbers. They cannot read the word "hello" the way you do. Before any language model can process text, it must be broken into small numerical pieces called tokens. This seemingly simple step has profound consequences for how AI understands - and misunderstands - language.
The simplest approach: treat each character as a token. "Hello" becomes ['H', 'e', 'l', 'l', 'o'] - five tokens.
The problem? Words become absurdly long sequences. A 500-word essay might become 2,500+ character tokens. Since Transformer models scale quadratically with sequence length, this is computationally brutal. Worse, individual characters carry almost no meaning - the model must learn that 'c', 'a', 't' together mean a furry animal.
The opposite extreme: each word is one token. "The cat sat" becomes ['The', 'cat', 'sat'] - compact and meaningful.
But this creates a different problem: the vocabulary explosion. English alone has hundreds of thousands of words. Add misspellings, technical jargon, and code, and the vocabulary becomes unmanageable. Any word not in the vocabulary becomes an unknown [UNK] token - a dead end for understanding.
If a model using word-level tokens encounters "ChatGPT" for the first time and it is not in the vocabulary, it becomes [UNK]. How might this affect the model's ability to discuss new technology?
Modern language models use subword tokenisation, which sits between characters and words. Common words stay whole ("the", "and"), while rare words are split into meaningful pieces ("un" + "believ" + "able").
This gives us a manageable vocabulary (typically 32,000–100,000 tokens) while handling any text - even words the model has never seen before.
BPE is the algorithm behind GPT models. Here is how it builds a vocabulary:
{'h', 'e', 'l', 'o', 'w', 'r', 'd', ' '}.Worked example with the text "low lower lowest":
| Step | Most frequent pair | New token | Vocabulary grows | |------|-------------------|-----------|-----------------| | 1 | l + o | lo | ...lo... | | 2 | lo + w | low | ...low... | | 3 | e + r | er | ...er... | | 4 | low + e | lowe | ...lowe... |
After enough merges, common words and word fragments emerge naturally from the data.
BPE was originally invented in 1994 as a data compression algorithm. It was repurposed for NLP in 2015 by Sennrich et al. - a beautiful example of ideas crossing disciplines.
Used by BERT and related models. Similar to BPE, but instead of merging the most frequent pair, it merges the pair that maximises the likelihood of the training data. Subword pieces are prefixed with ## (e.g., "playing" → ['play', '##ing']).
Treats the input as a raw byte stream - no pre-tokenisation by spaces. This is crucial for languages like Japanese and Chinese that do not use spaces between words. GPT-4 and LLaMA use SentencePiece-style approaches.
GPT-4 uses a BPE variant called cl100k_base with roughly 100,000 tokens in its vocabulary. Some surprising behaviours:
Hello, world - note the space is attached).ind, iv, isibility - it splits rare words).def hello(): → each keyword and symbol is typically its own token.Why do language models use subword tokenisation instead of whole words?
| Vocabulary size | Pros | Cons | |----------------|------|------| | Small (8k) | Smaller model, fewer embeddings | Longer sequences, slower processing | | Large (100k+) | Shorter sequences, richer tokens | Larger embedding table, more memory |
Finding the right balance is an engineering decision that affects model speed, memory, and capability.
Tokenisers trained primarily on English text are biased. The same sentence in Hindi or Arabic may require 3–5× more tokens than its English equivalent, because those scripts were underrepresented in training data. This means:
Why might the same sentence cost more API tokens in Hindi than in English?
Every API call to GPT-4, Claude, or Gemini is billed per token. Understanding tokenisation helps you:
A rough rule of thumb for English: 1 token ≈ ¾ of a word, or about 4 characters.
Approximately how many tokens is a 1,000-word English essay?
OpenAI's open-source tiktoken library lets you tokenise text locally with the exact same algorithm GPT-4 uses. Try it on your own writing to see how many tokens your messages really cost.
If you were building a language model for a low-resource language like Welsh, how would you approach tokenisation to ensure fair and efficient encoding?