AI EducademyAIEducademy
🌳

Trilha de Aprendizado em IA

🌱
AI Seeds

Comece do zero

🌿
AI Sprouts

Construa bases

🌳
AI Branches

Aplique na prática

🏕️
AI Canopy

Aprofunde-se

🌲
AI Forest

Domine a IA

🔨

Trilha de Engenharia e Código

✏️
AI Sketch

Comece do zero

🪨
AI Chisel

Construa bases

⚒️
AI Craft

Aplique na prática

💎
AI Polish

Aprofunde-se

🏆
AI Masterpiece

Domine a IA

Ver Todos os Programas→

Laboratório

7 experimentos carregados
🧠Playground de Rede Neural🤖IA ou Humano?💬Laboratório de Prompts🎨Gerador de Imagens😊Analisador de Sentimento💡Construtor de Chatbots⚖️Simulador de Ética
Entrar no Laboratório→
📝

Blog

Últimos artigos sobre IA, educação e tecnologia

Ler o Blog→
nav.faq
🎯
Missão

Tornar a educação em IA acessível para todos, em todo lugar

💜
Valores

Open Source, multilíngue e movido pela comunidade

⭐
Open Source

Construído de forma aberta no GitHub

Conheça o Criador→Ver no GitHub
Começar
AI EducademyAIEducademy

Licença MIT. Open Source

Aprender

  • Acadêmicos
  • Aulas
  • Laboratório

Comunidade

  • GitHub
  • Contribuir
  • Código de Conduta
  • Sobre
  • Perguntas Frequentes

Suporte

  • Me Pague um Café ☕
Acadêmicos de IA e Engenharia›✏️ AI Sketch›Aulas›Strings e Processamento de Texto
📝
AI Sketch • Intermediário⏱️ 15 min de leitura

Strings e Processamento de Texto

Text Is Everywhere in AI

Before an AI can understand your question, write an essay, or translate a sentence, it must process raw text. Every chatbot, search engine, and language model starts with strings - sequences of characters that represent human language.

Understanding how strings work unlocks the door to natural language processing, one of the most exciting areas of artificial intelligence.

String Basics

A string is simply a sequence of characters - letters, digits, spaces, and symbols - stored in order.

message = "Hello, World!"

Each character has a position (index), just like an array:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !
A string broken into individual characters with index positions shown beneath each one
Under the hood, a string is an array of characters - each with its own index.

Substrings

A substring is a slice of a string. From "Hello, World!" you could extract "World" (indices 7 to 11). AI systems constantly extract substrings - pulling out names from sentences, isolating hashtags from tweets, or grabbing URLs from web pages.

Immutability

In most languages, strings are immutable - you cannot change a character in place. Instead, you create a new string. This matters for performance: if your AI pipeline modifies text millions of times, creating new strings each time can slow things down.

\ud83e\udd2f

The entire works of Shakespeare contain roughly 900,000 words. GPT-4 was trained on text datasets thousands of times larger - hundreds of billions of words, all processed as strings before being converted to numbers.

How ChatGPT Reads Text: Tokenisation

AI models don't read words the way humans do. They use tokenisation - splitting text into smaller pieces called tokens.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation sits between character-level and word-level processing. It handles rare words by breaking them into known sub-pieces, keeping the vocabulary manageable.

\ud83e\udd14
Think about it:

When you type a long, unusual word like "antidisestablishmentarianism" into ChatGPT, the model breaks it into familiar sub-word tokens. Why might this be better than storing every possible English word as a separate token?

Why Tokenisation Matters

  • A typical language model has a vocabulary of 50,000–100,000 tokens.
  • Each token maps to a number (its ID), which the model actually processes.
  • The way text is tokenised affects cost - more tokens means more computation and higher API fees.

Pattern Matching - Finding Needles in Haystacks

A core string operation is searching for a pattern within a larger text. Does this email contain the word "urgent"? Does this code contain a security vulnerability?

The Simple Approach

Slide the pattern along the text, checking character by character:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

This is O(n × m) in the worst case, where n is the text length and m is the pattern length. For short patterns, it's fine. For scanning millions of documents, we need smarter approaches.

\ud83e\udde0Verificação Rápida

Why is naive pattern matching slow for very large texts?

Classic String Challenges

String Reversal

Reversing a string is simple but reveals how you think about data:

reverse("hello") → "olleh"

AI uses reversal in sequence-to-sequence models - for instance, some early translation models reversed the input sentence to improve accuracy.

Palindrome Detection

A palindrome reads the same forwards and backwards: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

Anagram Detection

Two words are anagrams if they contain the same characters in a different order: "listen" and "silent".

The elegant solution? Count character frequencies using a hash map:

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

This connects directly to the frequency counting pattern from the previous lesson - hash maps make it O(n).

\ud83e\udde0Verificação Rápida

Which approach most efficiently checks if two words are anagrams?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) let you describe patterns rather than exact text:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | One or more digits | Extracting numbers from text | | [A-Z][a-z]+ | A capitalised word | Finding proper nouns | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | Any of three words | Classification keywords |

AI data pipelines use regex extensively for data cleaning - removing HTML tags, extracting dates, standardising phone numbers, and filtering out unwanted characters before training.

💡

Regular expressions are powerful but can be tricky. A poorly written regex can take exponentially long on certain inputs - a problem known as "catastrophic backtracking." Always test your patterns on edge cases.

Real-World AI Text Processing Pipeline

Here's a simplified view of how an AI processes text:

  1. Raw text → "The café's Wi-Fi isn't working!!!"
  2. Lowercasing → "the café's wi-fi isn't working!!!"
  3. Removing punctuation → "the cafés wifi isnt working"
  4. Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
  5. Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
  6. Into the model → Numbers the AI can actually process

Every step involves string operations - slicing, searching, replacing, and splitting.

\ud83e\udd14
Think about it:

When you send a message to a chatbot in a language that doesn't use spaces between words (like Chinese or Japanese), how might tokenisation work differently? What extra challenges does this create?

\ud83e\udd2f

OpenAI's tokeniser splits "tokenisation" into ["token", "isation"] - two tokens. But the American spelling "tokenization" becomes ["token", "ization"]. The same concept costs different amounts depending on how you spell it!

\ud83e\udde0Verificação Rápida

In a text processing pipeline, why is tokenisation performed before feeding text to an AI model?

Key Takeaways

  • Strings are sequences of characters - the raw material for all text-based AI.
  • Tokenisation bridges human language and machine learning by splitting text into processable pieces.
  • Pattern matching and regex are essential tools for cleaning and extracting data.
  • Classic string problems (reversal, palindromes, anagrams) build the thinking skills needed for AI text processing.
  • Every message you send to ChatGPT passes through a pipeline of string operations before the model sees it.
Aula 2 de 100% concluído
←Arrays e Hash Maps
Ordenação e Busca→