AI EducademyAIEducademy
🌳

KI-Lernpfad

🌱
AI Seeds

Starte bei null

🌿
AI Sprouts

Fundament aufbauen

🌳
AI Branches

In der Praxis anwenden

🏕️
AI Canopy

In die Tiefe gehen

🌲
AI Forest

KI meistern

🔨

Craft Engineering Pfad

✏️
AI Sketch

Starte bei null

🪨
AI Chisel

Fundament aufbauen

⚒️
AI Craft

In der Praxis anwenden

💎
AI Polish

In die Tiefe gehen

🏆
AI Masterpiece

KI meistern

Alle Programme anzeigen→

Labor

7 Experimente geladen
🧠Neuronales Netz Spielplatz🤖KI oder Mensch?💬Prompt Labor🎨Bildgenerator😊Stimmungsanalyse💡Chatbot-Baukasten⚖️Ethik-Simulator
Labor betreten→
📝

Blog

Neueste Artikel über KI, Bildung und Technologie

Blog lesen→
nav.faq
🎯
Mission

KI-Bildung für alle zugänglich machen, überall

💜
Werte

Open Source, mehrsprachig und community-getrieben

⭐
Open Source

Öffentlich auf GitHub entwickelt

Lerne den Gründer kennen→Auf GitHub ansehen
Loslegen
AI EducademyAIEducademy

MIT-Lizenz. Open Source

Lernen

  • Programme
  • Lektionen
  • Labor

Community

  • GitHub
  • Mitwirken
  • Verhaltenskodex
  • Über uns
  • FAQ

Unterstützung

  • Kauf mir einen Kaffee ☕
KI & Engineering Programme›✏️ AI Sketch›Lektionen›Strings und Textverarbeitung
📝
AI Sketch • Fortgeschritten⏱️ 15 Min. Lesezeit

Strings und Textverarbeitung

Text Is Everywhere in AI

Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.

Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.

String Basics

A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.

message = "Hello, World!"

Each character has a position (index), just like an array:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !
A string broken into individual characters with index positions shown beneath each one
Under the hood, a string is an array of characters - each with its own index.

Substrings

A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.

Immutability

In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.

\ud83e\udd2f

The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.

How ChatGPT Reads Text: Tokenisation

AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.

\ud83e\udd14
Think about it:

When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?

Why Tokenisation Matters

  • A typical language model has a vocabulary of 50,000–100,000 tokens.
  • Each token maps to a number (its ID), which the model actually processes.
  • The way text is tokenised affects cost - more tokens means more computation and higher API fees.

Pattern Matching - Finding Needles in Haystacks

A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?

The Simple Approach

Slide the pattern along the text, checking character by character:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.

\ud83e\udde0Kurzer Check

Why is naive pattern matching slow for very large texts?

Classic String Challenges

String Reversal

Reversing a string is simple but reveals how you think about data:

reverse("hello") → "olleh"

AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.

Palindrome Detection

A palindrome reads the same forwards and backwards: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

Anagram Detection

Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".

The elegant solution? Count character frequencies using a hash map:

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).

\ud83e\udde0Kurzer Check

Which approach most efficiently checks if two words are anagrams?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) let you describe patterns rather than exact text:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | One or more digits | Extracting numbers from text | | [A-Z][a-z]+ | A capitalised word | Finding proper nouns | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | Any of three words | Classification keywords |

AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.

💡

Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.

Real-World AI Text Processing Pipeline

Here's a simplified view of how an AI processes text:

  1. Raw text → "The café's Wi-Fi isn't working!!!"
  2. Lowercasing → "the café's wi-fi isn't working!!!"
  3. Removing punctuation → "the cafés wifi isnt working"
  4. Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
  5. Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
  6. Into the model → Numbers the AI can actually process

Every step involves string operations - slicing, searching, replacing, and splitting.

\ud83e\udd14
Think about it:

When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?

\ud83e\udd2f

OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!

\ud83e\udde0Kurzer Check

In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?

Key Takeaways

  • Strings are sequences of characters - the raw material for all text-based AI.
  • Tokenisation bridges human language and machine learning by splitting text into processable pieces.
  • Pattern matching and regex are essential tools for cleaning and extracting data.
  • Classic string problems (reversal, palindromes, anagrams) build the thinking skills needed for AI text processing.
  • Every message you send to ChatGPT passes through a pipeline of string operations before the model sees it.
Lektion 2 von 100% abgeschlossen
←Arrays und Hash Maps
Sortieren und Suchen→