A Large Language Model (LLM) is a neural network trained on massive amounts of text to understand and generate human language. The word "large" refers to three things:
Scale of modern LLMs:
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Parameters โ Billions (7B โ 400B+) โ
โ Training data โ Trillions of tokens (words) โ
โ Training cost โ Millions of dollars โ
โ Training time โ Weeks to months on thousands of GPUs โ
โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
At its core, an LLM does one thing: predict the next token. Given "The cat sat on the", it predicts "mat" (or "roof" or "couch") with different probabilities. This simple objective, at massive scale, produces remarkably intelligent behaviour.
Imagine reading every book, article, and website ever written โ billions of pages. After all that reading, you'd be pretty good at predicting what word comes next in any sentence. That's essentially what an LLM does, but with mathematical precision.
Every modern LLM is built on the Transformer architecture (from the 2017 paper "Attention Is All You Need"). The key innovation: self-attention.
Traditional models read text sequentially โ word by word, left to right. Transformers read everything at once and figure out which words are relevant to each other.
Sentence: "The bank by the river was steep"
For the word "bank", attention scores might be:
"bank" โโ "river" = 0.45 (high โ helps clarify meaning)
"bank" โโ "steep" = 0.30 (medium โ supports the "riverbank" meaning)
"bank" โโ "The" = 0.05 (low โ not very informative)
This lets the model understand context: "bank" near "river" means a riverbank, not a financial bank.
Transformers have multiple attention heads running in parallel, each learning different relationships (grammar, meaning, coreference). GPT-3 has 96 heads across 96 layers โ all self-discovered, no human guidance.
Each Transformer layer follows: Self-Attention โ Add + Normalise โ Feed-Forward โ Add + Normalise. Stack 50โ100+ of these blocks and you have a modern LLM.
The "Add + Normalise" steps are skip connections โ the same trick from ResNet! They keep gradients healthy across dozens of layers, making deep Transformers trainable.
Training an LLM happens in three distinct phases:
The model reads trillions of tokens from books, websites, and code. It learns grammar, facts, reasoning patterns, and even some world knowledge โ all from predicting the next token.
Input: "The capital of France is ___"
Target: "Paris"
Input: "def fibonacci(n):\n if n <= 1:\n return ___"
Target: "n"
Cost: Millions of dollars and weeks of GPU time. This is the expensive step.
The base model is great at completing text but terrible at following instructions. Fine-tuning trains it on curated question-answer pairs:
User: "Summarise this article in three bullet points."
Assistant: "โข Point one...\nโข Point two...\nโข Point three..."
Reinforcement Learning from Human Feedback teaches the model what humans consider helpful, harmless, and honest.
Prompt: "How do I pick a lock?"
Response A: "Here are detailed instructions..." โ Ranked lower
Response B: "I can't help with that because..." โ Ranked higher
The model learns: safety and helpfulness matter.
RLHF is what makes the difference between a model that just completes text and one that feels like a helpful assistant. It aligns the model with human values โ but it's not perfect, which is why AI safety research remains critical.
The LLM landscape evolves rapidly. Here are the major players:
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Model Family โ Creator โ Key Characteristics โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ GPT-4/4o โ OpenAI โ Strong general reasoning, โ
โ โ โ multimodal (text + images) โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Claude โ Anthropic โ Safety-focused, long context, โ
โ โ โ strong at analysis and coding โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Llama โ Meta โ Open-weight, community-driven, โ
โ โ โ can run locally โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Gemini โ Google โ Multimodal-native, integrated โ
โ โ โ with Google services โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Mistral โ Mistral AI โ Efficient, European-made, โ
โ โ โ strong for its size โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
No single model is "best" at everything. The right choice depends on your task, budget, privacy requirements, and whether you need to run the model locally.
LLMs are like brilliant but unreliable interns. They can draft amazing work, but you should always fact-check their output. Trust, but verify โ especially for anything critical.
LLMs don't read characters or words โ they read tokens. A token is roughly 3โ4 characters or about ยพ of a word.
"Hello, how are you today?" โ ["Hello", ",", " how", " are", " you", " today", "?"]
= 7 tokens
Rule of thumb:
100 tokens โ 75 words
1,000 tokens โ 750 words โ 1.5 pages
The context window is the maximum tokens an LLM can process at once (input + output). GPT-4o supports 128K tokens (~300 pages), Claude handles 200K (~500 pages). API pricing is per token, so understanding token economics is essential:
# Rough cost estimation
input_tokens = 1000 # Your prompt
output_tokens = 500 # Model's response
price_per_1k = 0.01 # Varies by model and provider
cost = ((input_tokens + output_tokens) / 1000) * price_per_1k
print(f"Cost per request: ${cost:.4f}") # $0.0150
print(f"Cost for 10,000 requests: ${cost * 10000:.2f}") # $150.00
Here's how to call an LLM API in Python. This pattern works similarly across providers:
from openai import OpenAI
# Initialise the client (API key from environment variable)
client = OpenAI()
# Send a request
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a helpful science tutor for teenagers."
},
{
"role": "user",
"content": "Explain photosynthesis in simple terms."
}
],
temperature=0.7, # 0 = deterministic, 1 = creative
max_tokens=300 # Limit response length
)
# Extract the reply
answer = response.choices[0].message.content
print(answer)
# Check token usage
usage = response.usage
print(f"Input tokens: {usage.prompt_tokens}")
print(f"Output tokens: {usage.completion_tokens}")
print(f"Total tokens: {usage.total_tokens}")
Key parameters explained:
temperature: Controls randomness (0 = focused, 1 = creative)
max_tokens: Limits response length (saves cost)
messages: The conversation history (system + user + assistant turns)
model: Which LLM to use
The messages array is your conversation history. The model doesn't "remember" previous conversations โ you send the full context every time. This is why context windows matter: longer conversations cost more tokens.
You now understand the engines โ but knowing how to drive them is equally important. In the next lesson, we'll master prompt engineering: the art and science of getting the best results from LLMs through carefully crafted instructions. โจ