AI EducademyAIEducademy
🌳

AI基础

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

AI精通

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

🚀

职业准备

🚀
面试发射台

开启你的旅程

🌟
行为面试精通

掌握软技能

💻
技术面试

通过编程轮次

🤖
AI与ML面试

ML面试精通

🏆
Offer与未来

拿下最好的Offer

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
🎯模拟面试进入实验室→
学习旅程博客
🎯
关于

让AI教育触达每一个人、每一个角落

❓
常见问题

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

在 GitHub 上公开构建

立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
  • 服务条款
  • 隐私政策
  • 联系我们
AI & 工程学习计划›✏️ AI 草图›课程›字符串与文本处理
📝
AI 草图 • 中级⏱️ 15 分钟阅读

字符串与文本处理

Text Is Everywhere in AI

Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.

Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.

String Basics

A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.

message = "Hello, World!"

Each character has a position (index), just like an array:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !
A string broken into individual characters with index positions shown beneath each one
Under the hood, a string is an array of characters - each with its own index.

Substrings

A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.

Immutability

In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.

🤯

The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.

How ChatGPT Reads Text: Tokenisation

AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.

🤔
Think about it:

When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?

第 2 课,共 10 课已完成 0%
←数组与哈希表

Discussion

Sign in to join the discussion

建议修改本课内容

Why Tokenisation Matters

  • A typical language model has a vocabulary of 50,000–100,000 tokens.
  • Each token maps to a number (its ID), which the model actually processes.
  • The way text is tokenised affects cost - more tokens means more computation and higher API fees.

Pattern Matching - Finding Needles in Haystacks

A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?

The Simple Approach

Slide the pattern along the text, checking character by character:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.

🧠小测验

Why is naive pattern matching slow for very large texts?

Classic String Challenges

String Reversal

Reversing a string is simple but reveals how you think about data:

reverse("hello") → "olleh"

AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.

Palindrome Detection

A palindrome reads the same forwards and backwards: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

Anagram Detection

Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".

The elegant solution? Count character frequencies using a hash map:

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).

🧠小测验

Which approach most efficiently checks if two words are anagrams?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) let you describe patterns rather than exact text:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | One or more digits | Extracting numbers from text | | [A-Z][a-z]+ | A capitalised word | Finding proper nouns | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | Any of three words | Classification keywords |

AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.

💡

Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.

Real-World AI Text Processing Pipeline

Here's a simplified view of how an AI processes text:

  1. Raw text → "The café's Wi-Fi isn't working!!!"
  2. Lowercasing → "the café's wi-fi isn't working!!!"
  3. Removing punctuation → "the cafés wifi isnt working"
  4. Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
  5. Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
  6. Into the model → Numbers the AI can actually process

Every step involves string operations - slicing, searching, replacing, and splitting.

🤔
Think about it:

When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?

🤯

OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!

🧠小测验

In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?

Key Takeaways

  • Strings are sequences of characters - the raw material for all text-based AI.
  • Tokenisation bridges human language and machine learning by splitting text into processable pieces.
  • Pattern matching and regex are essential tools for cleaning and extracting data.
  • Classic string problems (reversal, palindromes, anagrams) build the thinking skills needed for AI text processing.
  • Every message you send to ChatGPT passes through a pipeline of string operations before the model sees it.