Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.
Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.
A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.
message = "Hello, World!"
Each character has a position (index), just like an array:
index: 0 1 2 3 4 5 6 7 8 9 10 11 12
char: H e l l o , W o r l d !
A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.
In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.
The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.
AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.
Input: "unhappiness"
Tokens: ["un", "happiness"]
Input: "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]
Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.
When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?
A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?
Slide the pattern along the text, checking character by character:
text: "the cat sat on the mat"
pattern: "cat"
Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!
This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.
Why is naive pattern matching slow for very large texts?
Reversing a string is simple but reveals how you think about data:
reverse("hello") → "olleh"
AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.
A palindrome reads the same forwards and backwards: "racecar", "madam", "level".
is_palindrome(text):
return text == reverse(text)
Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".
The elegant solution? Count character frequencies using a hash map:
are_anagrams(word1, word2):
return character_counts(word1) == character_counts(word2)
This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).
Which approach most efficiently checks if two words are anagrams?
Regular expressions (regex) let you describe patterns rather than exact text:
| Pattern | Matches | Use Case |
|---------|---------|----------|
| \d+ | One or more digits | Extracting numbers from text |
| [A-Z][a-z]+ | A capitalised word | Finding proper nouns |
| \b\w+@\w+\.\w+\b | Email addresses | Data extraction |
| (cat\|dog\|bird) | Any of three words | Classification keywords |
AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.
Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.
Here's a simplified view of how an AI processes text:
Every step involves string operations - slicing, searching, replacing, and splitting.
When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?
OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!
In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?