AI EducademyAIEducademy
🌳

KI-Grundlagen

🌱
AI Seeds

Starte bei null

🌿
AI Sprouts

Fundament aufbauen

🌳
AI Branches

In der Praxis anwenden

🏕️
AI Canopy

In die Tiefe gehen

🌲
AI Forest

KI meistern

🔨

KI-Meisterschaft

✏️
AI Sketch

Starte bei null

🪨
AI Chisel

Fundament aufbauen

⚒️
AI Craft

In der Praxis anwenden

💎
AI Polish

In die Tiefe gehen

🏆
AI Masterpiece

KI meistern

🚀

Karrierebereit

🚀
Interview-Startrampe

Starte deine Reise

🌟
Verhaltensinterview-Meisterschaft

Soft Skills meistern

💻
Technische Interviews

Die Coding-Runde bestehen

🤖
AI- & ML-Interviews

ML-Interview meistern

🏆
Angebot & Karriere

Das beste Angebot sichern

Alle Programme anzeigen→

Labor

7 Experimente geladen
🧠Neuronales Netz Spielplatz🤖KI oder Mensch?💬Prompt Labor🎨Bildgenerator😊Stimmungsanalyse💡Chatbot-Baukasten⚖️Ethik-Simulator
🎯ProbeinterviewLabor betreten→
LernreiseBlog
🎯
Über uns

KI-Bildung für alle zugänglich machen, überall

❓
FAQ

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

Öffentlich auf GitHub entwickelt

Loslegen
AI EducademyAIEducademy

MIT-Lizenz. Open Source

Lernen

  • Programme
  • Lektionen
  • Labor

Community

  • GitHub
  • Mitwirken
  • Verhaltenskodex
  • Über uns
  • FAQ

Unterstützung

  • Kauf mir einen Kaffee ☕
  • Nutzungsbedingungen
  • Datenschutzerklärung
  • Kontakt
KI & Engineering Programme›✏️ AI Sketch›Lektionen›Strings und Textverarbeitung
📝
AI Sketch • Fortgeschritten⏱️ 15 Min. Lesezeit

Strings und Textverarbeitung

Text Is Everywhere in AI

Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.

Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.

String Basics

A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.

message = "Hello, World!"

Each character has a position (index), just like an array:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !
A string broken into individual characters with index positions shown beneath each one
Under the hood, a string is an array of characters - each with its own index.

Substrings

A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.

Immutability

In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.

🤯

The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.

How ChatGPT Reads Text: Tokenisation

AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.

🤔
Think about it:

When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?

Lektion 2 von 100% abgeschlossen
←Arrays und Hash Maps

Discussion

Sign in to join the discussion

Bearbeitung vorschlagen

Why Tokenisation Matters

  • A typical language model has a vocabulary of 50,000–100,000 tokens.
  • Each token maps to a number (its ID), which the model actually processes.
  • The way text is tokenised affects cost - more tokens means more computation and higher API fees.

Pattern Matching - Finding Needles in Haystacks

A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?

The Simple Approach

Slide the pattern along the text, checking character by character:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.

🧠Kurzer Check

Why is naive pattern matching slow for very large texts?

Classic String Challenges

String Reversal

Reversing a string is simple but reveals how you think about data:

reverse("hello") → "olleh"

AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.

Palindrome Detection

A palindrome reads the same forwards and backwards: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

Anagram Detection

Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".

The elegant solution? Count character frequencies using a hash map:

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).

🧠Kurzer Check

Which approach most efficiently checks if two words are anagrams?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) let you describe patterns rather than exact text:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | One or more digits | Extracting numbers from text | | [A-Z][a-z]+ | A capitalised word | Finding proper nouns | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | Any of three words | Classification keywords |

AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.

💡

Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.

Real-World AI Text Processing Pipeline

Here's a simplified view of how an AI processes text:

  1. Raw text → "The café's Wi-Fi isn't working!!!"
  2. Lowercasing → "the café's wi-fi isn't working!!!"
  3. Removing punctuation → "the cafés wifi isnt working"
  4. Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
  5. Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
  6. Into the model → Numbers the AI can actually process

Every step involves string operations - slicing, searching, replacing, and splitting.

🤔
Think about it:

When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?

🤯

OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!

🧠Kurzer Check

In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?

Key Takeaways

  • Strings are sequences of characters - the raw material for all text-based AI.
  • Tokenisation bridges human language and machine learning by splitting text into processable pieces.
  • Pattern matching and regex are essential tools for cleaning and extracting data.
  • Classic string problems (reversal, palindromes, anagrams) build the thinking skills needed for AI text processing.
  • Every message you send to ChatGPT passes through a pipeline of string operations before the model sees it.