AI EducademyAIEducademy
🌳

AI基礎

🌱
AI Seeds(種)

ゼロから始める

🌿
AI Sprouts(芽)

基礎を築く

🌳
AI Branches(枝)

実践に活かす

🏕️
AI Canopy(樹冠)

深く学ぶ

🌲
AI Forest(森)

AIをマスターする

🔨

AIマスタリー

✏️
AI Sketch(スケッチ)

ゼロから始める

🪨
AI Chisel(鑿)

基礎を築く

⚒️
AI Craft(制作)

実践に活かす

💎
AI Polish(磨き上げ)

深く学ぶ

🏆
AI Masterpiece(傑作)

AIをマスターする

🚀

キャリア準備

🚀
面接ローンチパッド

旅を始めよう

🌟
行動面接マスター

ソフトスキルをマスター

💻
技術面接

コーディング面接を突破

🤖
AI・ML面接

ML面接をマスター

🏆
オファーとその先

最高のオファーを獲得

全プログラムを見る→

ラボ

7つの実験がロード済み
🧠ニューラルネットワーク プレイグラウンド🤖AIか人間か?💬プロンプトラボ🎨画像生成😊感情分析ツール💡チャットボットビルダー⚖️倫理シミュレーター
🎯模擬面接ラボへ入る→
nav.journeyブログ
🎯
概要

すべての人にAI教育をアクセス可能にする

❓
nav.faq

Common questions answered

✉️
Contact

Get in touch with us

⭐
オープンソース

GitHubで公開開発

始める
AI EducademyAIEducademy

MITライセンス。オープンソース

学ぶ

  • アカデミックス
  • レッスン
  • ラボ

コミュニティ

  • GitHub
  • 貢献する
  • 行動規範
  • 概要
  • よくある質問

サポート

  • コーヒーをおごる ☕
  • footer.terms
  • footer.privacy
  • footer.contact
AI & エンジニアリング アカデミックス›✏️ AI Sketch(スケッチ)›レッスン›文字列とテキスト処理
📝
AI Sketch(スケッチ) • 中級⏱️ 15 分で読める

文字列とテキスト処理

Text Is Everywhere in AI

Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.

Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.

String Basics

A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.

message = "Hello, World!"

Each character has a position (index), just like an array:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !
A string broken into individual characters with index positions shown beneath each one
Under the hood, a string is an array of characters - each with its own index.

Substrings

A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.

Immutability

In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.

🤯

The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.

How ChatGPT Reads Text: Tokenisation

AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.

🤔
Think about it:

When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?

レッスン 2 / 100%完了
←配列とハッシュマップ

Discussion

Sign in to join the discussion

lessons.suggestEdit

Why Tokenisation Matters

  • A typical language model has a vocabulary of 50,000–100,000 tokens.
  • Each token maps to a number (its ID), which the model actually processes.
  • The way text is tokenised affects cost - more tokens means more computation and higher API fees.

Pattern Matching - Finding Needles in Haystacks

A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?

The Simple Approach

Slide the pattern along the text, checking character by character:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.

🧠クイックチェック

Why is naive pattern matching slow for very large texts?

Classic String Challenges

String Reversal

Reversing a string is simple but reveals how you think about data:

reverse("hello") → "olleh"

AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.

Palindrome Detection

A palindrome reads the same forwards and backwards: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

Anagram Detection

Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".

The elegant solution? Count character frequencies using a hash map:

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).

🧠クイックチェック

Which approach most efficiently checks if two words are anagrams?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) let you describe patterns rather than exact text:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | One or more digits | Extracting numbers from text | | [A-Z][a-z]+ | A capitalised word | Finding proper nouns | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | Any of three words | Classification keywords |

AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.

💡

Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.

Real-World AI Text Processing Pipeline

Here's a simplified view of how an AI processes text:

  1. Raw text → "The café's Wi-Fi isn't working!!!"
  2. Lowercasing → "the café's wi-fi isn't working!!!"
  3. Removing punctuation → "the cafés wifi isnt working"
  4. Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
  5. Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
  6. Into the model → Numbers the AI can actually process

Every step involves string operations - slicing, searching, replacing, and splitting.

🤔
Think about it:

When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?

🤯

OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!

🧠クイックチェック

In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?

Key Takeaways

  • Strings are sequences of characters - the raw material for all text-based AI.
  • Tokenisation bridges human language and machine learning by splitting text into processable pieces.
  • Pattern matching and regex are essential tools for cleaning and extracting data.
  • Classic string problems (reversal, palindromes, anagrams) build the thinking skills needed for AI text processing.
  • Every message you send to ChatGPT passes through a pipeline of string operations before the model sees it.