After tokenisation, each token is just a number - an index in a vocabulary. But index 4,821 tells the model nothing about meaning. How does AI know that "king" and "queen" are related, or that "bank" can mean a riverbank or a financial institution? The answer is embeddings.
The naive approach represents each word as a vector with one 1 and thousands of 0s. "Cat" might be [0, 0, 1, 0, ..., 0] and "dog" [0, 0, 0, 1, ..., 0].
This has two fatal flaws:
An embedding maps each token to a dense vector of, say, 256 or 768 dimensions. Unlike one-hot vectors, these dimensions are learned during training and encode meaning.
Words used in similar contexts end up close together in this space. "Puppy" lands near "kitten." "London" lands near "Paris." The geometry of the space is the meaning.
The 2013 Word2Vec paper showed something remarkable. Trained on large text corpora, the learned vectors exhibit arithmetic relationships:
vector("king") − vector("man") + vector("woman") ≈ vector("queen")
The direction from "man" to "woman" captures the concept of gender. Adding it to "king" moves to "queen." This is not programmed - it emerges from patterns in language.
Other examples: Paris − France + Italy ≈ Rome, bigger − big + small ≈ smaller.
Word2Vec was created by Tomáš Mikolov at Google in 2013. The paper has over 40,000 citations and is considered one of the most influential NLP papers ever published. It demonstrated that simple neural networks trained on raw text could learn astonishing semantic relationships.
Modern models use different embedding sizes:
| Model | Embedding dimensions | |-------|---------------------| | Word2Vec | 100–300 | | BERT | 768 | | GPT-3 | 12,288 | | OpenAI text-embedding-3-large | 3,072 |
More dimensions capture finer distinctions but require more memory and compute. Think of it like describing a person: 3 dimensions (height, weight, age) give a rough sketch; 768 dimensions paint a detailed portrait.
What does the famous equation 'king − man + woman ≈ queen' demonstrate?
Word embeddings represent individual words, but we often need to compare entire sentences or documents. Sentence embeddings (from models like Sentence-BERT or OpenAI's embedding API) compress a whole passage into a single vector.
"How do I reset my password?" and "I forgot my login credentials" would have very similar sentence embeddings, even though they share almost no words. The embedding captures intent, not just vocabulary.
To compare two embeddings, we use cosine similarity - the cosine of the angle between two vectors. It ranges from −1 (opposite) to +1 (identical direction).
Cosine similarity ignores vector magnitude, focusing purely on direction - which is where meaning lives.
"Love" and "hate" are opposites in meaning but might have moderate cosine similarity because they appear in similar contexts (emotions, relationships). What does this tell us about the limitations of embeddings trained purely on word co-occurrence?
A vector database stores millions of embeddings and retrieves the most similar ones blazingly fast. Instead of keyword matching ("find documents containing 'machine learning'"), you search by meaning ("find documents about AI education").
Popular vector databases include:
These databases use algorithms like HNSW (Hierarchical Navigable Small World) to search billions of vectors in milliseconds.
What advantage does vector search have over traditional keyword search?
RAG is one of the most important patterns in modern AI. It combines vector search with language models:
RAG lets language models answer questions about your specific data - company documents, product catalogues, research papers - without retraining. It dramatically reduces hallucination because the model has real sources to reference.
In a RAG system, what role does the vector database play?
Embeddings power countless real-world systems:
Spotify uses audio embeddings to recommend songs. Each track is embedded based on its acoustic features, and recommendations come from finding nearby vectors - songs that "sound similar" in embedding space.
If you embedded every product in an online shop, how could you build a recommendation system that says "customers who viewed this item might also like..." without relying on purchase history?