📝

AI 草图 • 中级⏱️ 15 分钟阅读

字符串与文本处理

Text Is Everywhere in AI

Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.

Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.

String Basics

A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.

message = "Hello, World!"

Each character has a position (index), just like an array:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !

A string broken into individual characters with index positions shown beneath each one — Under the hood, a string is an array of characters - each with its own index.

Substrings

A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.

Immutability

In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.

🤯

The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.

How ChatGPT Reads Text: Tokenisation

AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.

🤔

Think about it:

When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?

第 2 课，共 10 课已完成 0%

←数组与哈希表

Discussion

建议修改本课内容

Why Tokenisation Matters

A typical language model has a vocabulary of 50,000–100,000 tokens.
Each token maps to a number (its ID), which the model actually processes.
The way text is tokenised affects cost - more tokens means more computation and higher API fees.

Pattern Matching - Finding Needles in Haystacks

A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?

Slide the pattern along the text, checking character by character:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.

🧠小测验

Why is naive pattern matching slow for very large texts?

Classic String Challenges

Reversing a string is simple but reveals how you think about data:

reverse("hello") → "olleh"

AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.

Palindrome Detection

A palindrome reads the same forwards and backwards: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".

The elegant solution? Count character frequencies using a hash map:

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).

🧠小测验

Which approach most efficiently checks if two words are anagrams?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) let you describe patterns rather than exact text:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | One or more digits | Extracting numbers from text | | [A-Z][a-z]+ | A capitalised word | Finding proper nouns | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | Any of three words | Classification keywords |

AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.

💡

Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.

Real-World AI Text Processing Pipeline

Here's a simplified view of how an AI processes text:

Raw text → "The café's Wi-Fi isn't working!!!"
Lowercasing → "the café's wi-fi isn't working!!!"
Removing punctuation → "the cafés wifi isnt working"
Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
Into the model → Numbers the AI can actually process

Every step involves string operations - slicing, searching, replacing, and splitting.

🤔

Think about it:

When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?

🤯

OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!

🧠小测验

In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?

Strings are sequences of characters - the raw material for all text-based AI.
Tokenisation bridges human language and machine learning by splitting text into processable pieces.
Pattern matching and regex are essential tools for cleaning and extracting data.
Classic string problems (reversal, palindromes, anagrams) build the thinking skills needed for AI text processing.
Every message you send to ChatGPT passes through a pipeline of string operations before the model sees it.

AI基础

AI精通

职业准备

实验室

字符串与文本处理

Text Is Everywhere in AI

String Basics

Substrings

Immutability

How ChatGPT Reads Text: Tokenisation

Discussion

Why Tokenisation Matters

Pattern Matching - Finding Needles in Haystacks

The Simple Approach

Classic String Challenges

String Reversal

Palindrome Detection

Anagram Detection

Regular Expressions - Pattern Matching on Steroids

Real-World AI Text Processing Pipeline

Key Takeaways