AI EducademyAIEducademy
🌳

AI基础

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

AI精通

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

🚀

职业准备

🚀
面试发射台

开启你的旅程

🌟
行为面试精通

掌握软技能

💻
技术面试

通过编程轮次

🤖
AI与ML面试

ML面试精通

🏆
Offer与未来

拿下最好的Offer

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
🎯模拟面试进入实验室→
学习旅程博客
🎯
关于

让AI教育触达每一个人、每一个角落

❓
常见问题

Common questions answered

✉️
Contact

Get in touch with us

⭐
Open Source

在 GitHub 上公开构建

立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
  • 服务条款
  • 隐私政策
  • 联系我们
AI & 工程学习计划›✏️ AI 草图›课程›数组与哈希表
📦
AI 草图 • 中级⏱️ 15 分钟阅读

数组与哈希表

Why Data Storage Matters in AI

Every AI system needs to store and retrieve data - quickly. Whether it's a list of pixel values in an image, a vocabulary of 50,000 words, or millions of user preferences, the choice of data structure determines how fast your AI can think.

Two structures dominate: arrays and hash maps. Master these, and you've got the foundation for nearly every AI pipeline.

Arrays - Ordered Lists with Instant Access

An array is simply a numbered list of items stored side by side in memory. Each item has an index - its position in the list, starting from zero.

index:   0       1       2       3       4
value: ["cat", "dog", "bird", "fish", "frog"]

Because items sit next to each other, you can jump straight to any position. Want item 3? Done - no searching required. That's O(1) access, meaning it takes the same time whether the array holds 10 items or 10 million.

An array of five elements with indices 0 through 4, showing direct access to index 3
Arrays let you jump directly to any position using its index.

When Arrays Shine

  • Feature vectors: An image might be represented as an array of 784 numbers (28×28 pixels), each holding a brightness value.
  • Embeddings: Language models store word meanings as arrays of hundreds of floating-point numbers.
  • Batch processing: Training data is loaded into arrays so the GPU can process thousands of examples simultaneously.

The Catch

Inserting or removing items in the middle is expensive. Every item after the change must shuffle along. That's O(n) - the more items, the longer it takes.

🤔
Think about it:

If you had a playlist of 10,000 songs and wanted to insert a new track at position 5, every song from position 5 onwards would need to shift. How might a streaming service handle this without slowing down?

Hash Maps - Instant Lookups by Name

A hash map (also called a dictionary or hash table) stores data as key-value pairs. Instead of accessing items by index number, you access them by a meaningful key.

word_counts = {
  "hello": 42,
  "world": 37,
  "AI": 156
}
第 1 课,共 10 课已完成 0%
←返回学习计划

Discussion

Sign in to join the discussion

建议修改本课内容

Need the count for "AI"? The hash map uses a hash function to convert the key into an index behind the scenes. The result? O(1) average lookup time - just like arrays, but using names instead of numbers.

🤯

Python's dictionaries are hash maps. When ChatGPT counts word frequencies during training, it uses hash-map-like structures to track billions of word occurrences across the entire internet's text.

Real AI Uses for Hash Maps

  • Vocabulary mapping: Turning words like "brilliant" into token ID 8921 for a language model.
  • Frequency counting: How often does each word appear in a dataset? Hash maps answer this in one pass.
  • Caching results: If an AI has already computed a prediction for input X, store it so you never recompute.

Time Complexity Made Simple

| Operation | Array | Hash Map | |-----------|-------|----------| | Access by index | O(1) ⚡ | N/A | | Access by key | O(n) 🐢 | O(1) ⚡ | | Insert at end | O(1) ⚡ | O(1) ⚡ | | Insert in middle | O(n) 🐢 | N/A | | Search for value | O(n) 🐢 | O(1) ⚡ |

Think of O(1) as "instant, no matter the size" and O(n) as "slower the bigger the data."

🧠小测验

You have 100,000 user profiles and need to find a user by their username. Which structure is fastest?

Common Pattern: Frequency Counting

One of the most useful patterns in both interviews and AI is counting occurrences. Here's the idea in pseudocode:

counts = {}
for each word in text:
    if word in counts:
        counts[word] = counts[word] + 1
    else:
        counts[word] = 1

This single pass through the text gives you every word's frequency. Language models use exactly this approach (at massive scale) to understand which words matter most.

Common Pattern: The Two-Sum Problem

Given an array of numbers and a target, find two numbers that add up to the target. The naive approach checks every pair - O(n²). The clever approach uses a hash map:

seen = {}
for each number in array:
    complement = target - number
    if complement in seen:
        return [seen[complement], current_index]
    seen[number] = current_index

One pass, O(n) time. The hash map remembers what you've already seen.

🧠小测验

Why is the hash map approach to two-sum faster than checking every pair?

When to Choose Which

  • Use arrays when order matters, you access items by position, or you need to process everything sequentially (like pixels in an image).
  • Use hash maps when you need fast lookups by a key, want to count occurrences, or need to check if something exists quickly.
🤔
Think about it:

A recommendation engine needs to check whether a user has already watched a film before suggesting it. Would you store the user's watch history in an array or a hash map? What trade-offs come to mind?

🧠小测验

An AI model stores word embeddings as arrays of 300 numbers each. Why are arrays a good choice here?

Key Takeaways

  • Arrays give O(1) access by index and are the backbone of numerical AI (images, embeddings, tensors).
  • Hash maps give O(1) access by key and are essential for lookups, counting, and caching.
  • Choosing the right structure can turn an O(n²) algorithm into an O(n) one - a difference that matters when processing millions of data points.
  • In practice, most AI pipelines use both: arrays for numerical computation and hash maps for metadata lookups.