AI & ఇంజనీరింగ్ ప్రోగ్రామ్‌లు›✏️ AI Sketch›పాఠాలు›స్ట్రింగ్స్ మరియు టెక్స్ట్ ప్రాసెసింగ్

📝

AI Sketch • మధ్యస్థం⏱️ 15 నిమిషాల పఠన సమయం

స్ట్రింగ్స్ మరియు టెక్స్ట్ ప్రాసెసింగ్

AI లో Text అంతటా ఉంది

AI మీ ప్రశ్న అర్థం చేసుకోవడానికి, essay రాయడానికి, లేదా sentence translate చేయడానికి ముందు raw text process చేయాలి. ప్రతి chatbot, search engine, మరియు language model strings తో మొదలవుతుంది - మానవ భాషను represent చేసే characters sequences.

Strings ఎలా పని చేస్తాయో అర్థం చేసుకోవడం natural language processing లోకి ద్వారం తెరుస్తుంది.

String ప్రాథమికాలు

String అనేది characters sequence - అక్షరాలు, అంకెలు, spaces, మరియు symbols - క్రమంలో నిల్వ చేయబడినవి.

message = "Hello, World!"

ప్రతి character కి ఒక position (index) ఉంటుంది, array లాగే:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !

ప్రతి character కింద index positions చూపుతూ విడి characters గా విడదీసిన string — String అనేది characters array - ప్రతి దానికి దాని స్వంత index ఉంటుంది.

Substrings

Substring అనేది string యొక్క slice. "Hello, World!" నుండి "World" (indices 7 నుండి 11) extract చేయవచ్చు. AI systems నిరంతరం substrings extract చేస్తాయి - sentences నుండి names, tweets నుండి hashtags, web pages నుండి URLs.

Immutability

చాలా languages లో strings immutable - character ను place లో మార్చలేరు. బదులుగా కొత్త string create చేస్తారు. Performance కి ఇది ముఖ్యం: AI pipeline text ను millions of times modify చేస్తే, ప్రతిసారి కొత్త strings create చేయడం slow అవుతుంది.

🤯

Shakespeare మొత్తం రచనలలో సుమారు 9,00,000 పదాలు ఉన్నాయి. GPT-4 వేలరెట్లు పెద్ద text datasets పై train అయింది - వందల billions పదాలు, అన్నీ numbers గా convert అవ్వడానికి ముందు strings గా process అయ్యాయి.

ChatGPT Text ఎలా చదువుతుంది: Tokenisation

AI models మానవులు చదివినట్లు పదాలు చదవరు. Tokenisation వాడతారు - text ను tokens అనే చిన్న ముక్కలుగా split చేయడం.

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation character-level మరియు word-level processing మధ్య ఉంటుంది. అరుదైన పదాలను తెలిసిన sub-pieces గా విడదీసి vocabulary manageable గా ఉంచుతుంది.

🤔

Think about it:

"antidisestablishmentarianism" లాంటి పొడవైన, unusual word ChatGPT లో type చేసినప్పుడు, model దాన్ని familiar sub-word tokens గా విడదీస్తుంది. ప్రతి English word ను separate token గా store చేయడం కంటే ఇది ఎందుకు మంచిది?

Tokenisation ఎందుకు ముఖ్యం

పాఠం 2 / 100% పూర్తి

←Arrays మరియు Hash Maps

Discussion

lessons.suggestEdit

Typical language model vocabulary 50,000–100,000 tokens.

ప్రతి token ఒక number (దాని ID) కి map అవుతుంది, model నిజంగా process చేసేది అదే.

Text ఎలా tokenise అవుతుందో cost ని affect చేస్తుంది - ఎక్కువ tokens అంటే ఎక్కువ computation మరియు API fees.

Pattern Matching - గడ్డిపరకలలో సూది వెతకడం

Core string operation: పెద్ద text లో pattern search చేయడం. ఈ email లో "urgent" ఉందా? ఈ code లో security vulnerability ఉందా?

Pattern ను text మీద slide చేస్తూ, character by character check చేయడం:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

Worst case O(n × m), n text length మరియు m pattern length. చిన్న patterns కి fine. Millions of documents scan చేయడానికి తెలివైన approaches అవసరం.

🧠త్వరిత తనిఖీ

చాలా పెద్ద texts కి naive pattern matching ఎందుకు slow?

Classic String Challenges

reverse("hello") → "olleh"

AI reversal ను sequence-to-sequence models లో వాడుతుంది - కొన్ని early translation models accuracy improve చేయడానికి input sentence reverse చేసేవి.

Palindrome Detection

ముందుకు వెనుకకు ఒకేలా చదవగలిగే string: "racecar", "madam", "level".

is_palindrome(text):
    return text == reverse(text)

ఒకే characters వేరే order లో ఉన్న రెండు words: "listen" మరియు "silent".

Hash map తో character frequencies count చేయడం - O(n):

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

🧠త్వరిత తనిఖీ

రెండు words anagrams అవునో కాదో most efficiently ఎలా check చేస్తారు?

Regular Expressions - Pattern Matching on Steroids

Regular expressions (regex) exact text బదులు patterns describe చేయడానికి అనుమతిస్తాయి:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | ఒకటి లేదా ఎక్కువ digits | Text నుండి numbers extract | | [A-Z][a-z]+ | Capitalised word | Proper nouns కనుగొనడం | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | మూడు words లో ఏదైనా | Classification keywords |

AI data pipelines data cleaning కి regex విస్తృతంగా వాడతాయి - HTML tags remove, dates extract, phone numbers standardise, training ముందు unwanted characters filter.

💡

Regular expressions శక్తివంతమైనవి కానీ tricky. తప్పుగా రాసిన regex కొన్ని inputs పై exponentially long time తీసుకోవచ్చు - "catastrophic backtracking" అనే సమస్య. ఎల్లప్పుడూ edge cases పై patterns test చేయండి.

Real-World AI Text Processing Pipeline

AI text ఎలా process చేస్తుందో simplified view:

Raw text → "The café's Wi-Fi isn't working!!!"
Lowercasing → "the café's wi-fi isn't working!!!"
Removing punctuation → "the cafés wifi isnt working"
Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
Model లోకి → AI నిజంగా process చేయగల numbers

ప్రతి step లో string operations ఉంటాయి - slicing, searching, replacing, splitting.

🤔

Think about it:

Words మధ్య spaces వాడని భాషలో (Chinese లేదా Japanese లాగా) chatbot కి message పంపినప్పుడు, tokenisation ఎలా వేరుగా పని చేయవచ్చు? ఏ extra challenges create అవుతాయి?

🤯

OpenAI tokeniser "tokenisation" ను ["token", "isation"] - రెండు tokens గా split చేస్తుంది. American spelling "tokenization" ["token", "ization"] అవుతుంది. అదే concept spelling ని బట్టి వేరే cost!

🧠త్వరిత తనిఖీ

Text processing pipeline లో AI model కి text feed చేయడానికి ముందు tokenisation ఎందుకు చేస్తారు?

ముఖ్య అంశాలు

Strings characters sequences - text-based AI అంతటికీ ముడి పదార్థం.
Tokenisation text ను processable pieces గా split చేసి మానవ భాష మరియు machine learning మధ్య వారధి.
Pattern matching మరియు regex data clean చేయడానికి మరియు extract చేయడానికి అవసరమైన tools.
Classic string problems (reversal, palindromes, anagrams) AI text processing కి అవసరమైన thinking skills build చేస్తాయి.

AI పునాదులు

AI నైపుణ్యం

కెరీర్ రెడీ

ల్యాబ్

స్ట్రింగ్స్ మరియు టెక్స్ట్ ప్రాసెసింగ్

AI లో Text అంతటా ఉంది

String ప్రాథమికాలు

Substrings

Immutability

ChatGPT Text ఎలా చదువుతుంది: Tokenisation

Tokenisation ఎందుకు ముఖ్యం

Discussion

Pattern Matching - గడ్డిపరకలలో సూది వెతకడం

సరళ Approach

Classic String Challenges

String Reversal

Palindrome Detection

Anagram Detection

Regular Expressions - Pattern Matching on Steroids

Real-World AI Text Processing Pipeline

ముఖ్య అంశాలు