AI और इंजीनियरिंग प्रोग्राम›✏️ AI Sketch›पाठ›Strings और Text Processing

📝

AI Sketch • मध्यम⏱️ 15 मिनट पढ़ने का समय

Strings और Text Processing

AI में Text हर जगह है

AI आपका सवाल समझे, essay लिखे, या sentence translate करे - इससे पहले उसे raw text process करना होता है। हर chatbot, search engine, और language model strings से शुरू होता है - characters की sequences जो human language represent करती हैं।

Strings समझना natural language processing का दरवाज़ा खोलता है - AI के सबसे exciting areas में से एक।

String Basics

String characters की एक sequence है - letters, digits, spaces, symbols - order में stored।

message = "Hello, World!"

हर character की एक position (index) होती है, array जैसी:

index:  0  1  2  3  4  5  6  7  8  9  10  11  12
char:   H  e  l  l  o  ,     W  o  r   l   d   !

String individual characters में टूटी हुई, हर एक के नीचे index position — Inside, string एक array of characters है - हर एक का अपना index।

Substrings

Substring string का एक टुकड़ा है। "Hello, World!" से "World" (indices 7 से 11) extract कर सकते हो। AI systems लगातार substrings extract करते हैं - sentences से names, tweets से hashtags, web pages से URLs।

Immutability

ज़्यादातर languages में strings immutable हैं - character in-place बदल नहीं सकते, नई string बनानी पड़ती है। Performance के लिए यह मायने रखता है: AI pipeline लाखों बार text modify करे तो हर बार नई string बनाना slow कर सकता है।

🤯

Shakespeare की पूरी रचनाओं में ~9 लाख शब्द हैं। GPT-4 हज़ारों गुना बड़े text datasets पर train हुआ - सैकड़ों अरब शब्द, सब strings के रूप में process हुए numbers में convert होने से पहले।

ChatGPT Text कैसे पढ़ता है: Tokenisation

AI models शब्द इंसानों जैसे नहीं पढ़ते। वे tokenisation इस्तेमाल करते हैं - text को छोटे टुकड़ों (tokens) में तोड़ना।

Input:  "unhappiness"
Tokens: ["un", "happiness"]

Input:  "ChatGPT is brilliant"
Tokens: ["Chat", "G", "PT", " is", " brilliant"]

Tokenisation character-level और word-level processing के बीच है। Rare words को known sub-pieces में तोड़ता है, vocabulary manageable रखता है।

🤔

Think about it:

जब ChatGPT में "antidisestablishmentarianism" जैसा लंबा unusual word type करो, model इसे familiar sub-word tokens में तोड़ता है। हर possible English word को अलग token store करने से यह बेहतर क्यों है?

Tokenisation क्यों ज़रूरी है

Typical language model की vocabulary 50,000–100,000 tokens की होती है।

पाठ 2 / 100% पूर्ण

←Arrays और Hash Maps

Discussion

lessons.suggestEdit

हर token एक number (ID) में map होता है जो model actually process करता है।

Text कैसे tokenise होता है cost affect करता है - ज़्यादा tokens = ज़्यादा computation और ज़्यादा API fees।

Pattern Matching - Needles in Haystacks

Core string operation है text में pattern ढूँढना। क्या email में "urgent" शब्द है? क्या code में security vulnerability है?

Pattern text पर slide करो, character by character check करो:

text:    "the cat sat on the mat"
pattern: "cat"

Position 0: "the" → no match
Position 1: "he " → no match
Position 4: "cat" → match found at index 4!

Worst case O(n × m) है जहाँ n text length और m pattern length। Short patterns के लिए ठीक है। Millions documents scan करने के लिए smarter approaches चाहिए।

🧠त्वरित जांच

बहुत बड़े texts पर naive pattern matching slow क्यों है?

Classic String Challenges

String reverse करना simple है लेकिन data के बारे में सोच reveal करता है:

reverse("hello") → "olleh"

Palindrome Detection

Palindrome आगे-पीछे से same पढ़ता है: "racecar", "madam", "level"।

is_palindrome(text):
    return text == reverse(text)

दो words anagrams हैं अगर same characters अलग order में: "listen" और "silent"।

are_anagrams(word1, word2):
    return character_counts(word1) == character_counts(word2)

Hash map से character frequencies count करो - O(n)।

🧠त्वरित जांच

दो words anagrams हैं, सबसे efficiently कैसे check करोगे?

Regular Expressions - Pattern Matching on Steroids

Regex exact text की जगह patterns describe करने देता है:

| Pattern | Matches | Use Case | |---------|---------|----------| | \d+ | एक या ज़्यादा digits | Text से numbers extract | | [A-Z][a-z]+ | Capitalised word | Proper nouns ढूँढना | | \b\w+@\w+\.\w+\b | Email addresses | Data extraction | | (cat\|dog\|bird) | तीनों में कोई भी | Classification keywords |

AI data pipelines regex extensively data cleaning में इस्तेमाल करती हैं - HTML tags हटाना, dates extract करना, phone numbers standardise करना।

💡

Regex powerful है लेकिन tricky भी। Poorly written regex कुछ inputs पर exponentially लंबा चल सकता है - "catastrophic backtracking" कहते हैं। Edge cases पर patterns ज़रूर test करो।

Real-World AI Text Processing Pipeline

AI text कैसे process करता है - simplified view:

Raw text → "The café's Wi-Fi isn't working!!!"
Lowercasing → "the café's wi-fi isn't working!!!"
Punctuation हटाना → "the cafés wifi isnt working"
Tokenisation → ["the", "café", "s", "wi", "fi", "isn", "t", "working"]
Token IDs → [1, 8432, 82, 5901, 3344, 2817, 83, 1562]
Model में → Numbers जो AI actually process कर सकता है

हर step string operations involve करता है - slicing, searching, replacing, splitting।

🤔

Think about it:

जब किसी chatbot को ऐसी भाषा में message भेजो जहाँ शब्दों के बीच spaces नहीं होते (जैसे Chinese या Japanese), tokenisation अलग कैसे होगा? क्या extra challenges आएँगी?

🤯

OpenAI का tokeniser "tokenisation" को ["token", "isation"] - दो tokens में तोड़ता है। लेकिन American spelling "tokenization" ["token", "ization"] बनता है। Same concept, spelling बदली तो cost बदली!

🧠त्वरित जांच

Text processing pipeline में AI model को feed करने से पहले tokenisation क्यों किया जाता है?

मुख्य बातें

Strings characters की sequences हैं - text-based AI का raw material।
Tokenisation human language और machine learning के बीच bridge है।
Pattern matching और regex data cleaning और extraction के essential tools हैं।
Classic string problems (reversal, palindromes, anagrams) AI text processing के लिए ज़रूरी thinking skills build करती हैं।

AI की नींव

AI में महारत

करियर रेडी

लैब

Strings और Text Processing

AI में Text हर जगह है

String Basics

Substrings

Immutability

ChatGPT Text कैसे पढ़ता है: Tokenisation

Tokenisation क्यों ज़रूरी है

Discussion

Pattern Matching - Needles in Haystacks

Simple Approach

Classic String Challenges

String Reversal

Palindrome Detection

Anagram Detection

Regular Expressions - Pattern Matching on Steroids

Real-World AI Text Processing Pipeline

मुख्य बातें