Every API you use โ OpenAI, Stripe, GitHub โ has a rate limiter guarding it. Without rate limiting, a single bad actor (or a buggy script) can bring down an entire service.
Rate limiting isn't just about saying "no" โ it's about fairness. It ensures every user gets their share of a finite resource: compute, bandwidth, or AI inference time.
| Scenario | Without Rate Limiting | With Rate Limiting | |---|---|---| | DDoS attack | Service goes down | Attackers get 429, legitimate users unaffected | | Runaway script | 10K requests/sec burns through quota | Capped at allowed rate, fails gracefully | | AI API abuse | One user hogs all GPUs | Fair share across all users | | Cost control | $50K surprise cloud bill | Predictable, budgeted spend | | Cascading failures | Backend overload โ total outage | Backpressure protects downstream services |
| Provider | Free Tier | Paid Tier | Unit | |---|---|---|---| | OpenAI | 3 RPM / 200 RPD | 10,000 RPM | Requests per minute | | Claude | 50 RPM | 4,000 RPM | Requests per minute | | Gemini | 15 RPM | 1,000 RPM | Requests per minute | | Stability AI | 10 req/10sec | 150 req/10sec | Per time window |
OpenAI doesn't just rate limit by requests โ they also limit by tokens per minute (TPM). A single request using GPT-4 with a 128K context window consumes far more resources than a short GPT-3.5 call. This is why they have dual rate limits: RPM (requests) and TPM (tokens).
The most widely used algorithm. Simple, memory-efficient, and allows controlled bursts.
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity # max burst size
self.tokens = capacity # current tokens
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def allow_request(self) -> bool:
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True # 200 OK
return False # 429 Too Many Requests
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
new_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill = now
# Usage: 10 burst, 5/sec sustained
limiter = TokenBucket(capacity=10, refill_rate=5)
| Property | Value | Meaning |
|---|---|---|
| Memory | O(1) per user | Just 2 numbers: tokens + timestamp |
| Burst handling | Allows short bursts | Up to capacity requests instantly |
| Sustained rate | refill_rate req/sec | Long-term average throughput |
| Precision | Approximate | Depends on refill granularity |
Why does the Token Bucket algorithm allow bursts? Think about real user behaviour โ you might load a page and fire 10 API calls simultaneously, then nothing for 2 seconds. A strict "5 per second" limit would reject 5 of those initial calls, even though your average rate is well within limits.
More precise than fixed windows, less memory than sliding window log.
Fixed Window Problem: Boundary spike
Window 1 (0:00-1:00): ....xxxxxxxx (8 requests in last 10 sec)
Window 2 (1:00-2:00): xxxxxxxx.... (8 requests in first 10 sec)
^^^^^^^^^^^^^^^^
16 requests in 20 seconds!
(Limit was 10/minute โ bypassed at boundary)
Sliding Window Solution: Weight by position in window
class SlidingWindowCounter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window = window_seconds
self.previous_count = 0
self.current_count = 0
self.window_start = int(time.time())
def allow_request(self) -> bool:
now = int(time.time())
elapsed = now - self.window_start
if elapsed >= self.window:
self.previous_count = self.current_count
self.current_count = 0
self.window_start = now
elapsed = 0
# Weighted count: blend previous + current window
weight = 1 - (elapsed / self.window)
estimated_count = (self.previous_count * weight) + self.current_count
if estimated_count < self.limit:
self.current_count += 1
return True
return False
| Algorithm | Memory | Burst Control | Precision | Best For | |---|---|---|---|---| | Token Bucket | O(1) | Allows bursts | Approximate | API gateways, general use | | Leaky Bucket | O(1) | No bursts | Smooth | Network traffic shaping | | Fixed Window | O(1) | Poor at boundaries | Low | Simple rate counting | | Sliding Window Log | O(n) | Exact | Exact | Small-scale, audit logging | | Sliding Window Counter | O(1) | Good | High | Production rate limiting |
In interviews, recommend Token Bucket or Sliding Window Counter. Token Bucket is used by AWS, Stripe, and Cloudflare. Sliding Window Counter is used by Redis-based implementations. Know both and explain trade-offs.
Single-server rate limiting is easy. The real challenge is distributed rate limiting across multiple API servers.
The critical insight: checking the count and incrementing must be atomic. Otherwise, race conditions allow limit bypasses.
-- Redis Lua script: atomic sliding window rate limit
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window * 1000)
-- Count requests in current window
local count = redis.call('ZCARD', key)
if count < limit then
-- Allow: add this request
redis.call('ZADD', key, now, now .. '-' .. math.random(1000000))
redis.call('EXPIRE', key, window)
return 1 -- allowed
else
return 0 -- rejected
end
| Layer | What It Limits | Example | |---|---|---| | CDN/Edge | IP-based, geographic | Cloudflare: 1000 req/min per IP | | API Gateway | API key, route-based | Kong: 100 req/sec per API key | | Application | User-based, resource-specific | App: 50 AI generations/day per user | | Database | Connection pool, query rate | PgBouncer: max 100 connections |
If you have 10 API servers, each with a local rate limiter set to 100 req/sec, a user could potentially make 1000 req/sec (100 on each server). How does shared state in Redis solve this? What happens if Redis goes down โ do you fail open (allow all) or fail closed (block all)?
Every well-designed API returns rate limit info in response headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 955
X-RateLimit-Reset: 1672531200
X-RateLimit-Policy: "1000;w=3600"
--- OR when limited: ---
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1672531200
Content-Type: application/json
{"error": "rate_limit_exceeded", "message": "Try again in 30 seconds"}
Always include Retry-After in 429 responses. Good clients use it for exponential backoff. Without it, clients retry immediately โ making the overload worse.
Traditional rate limiters use static rules. Modern systems use ML to adapt limits in real-time.
# Features for bot detection
features = {
"request_interval_stddev": 0.002, # Bots: very consistent timing
"user_agent_entropy": 1.2, # Bots: low entropy UA strings
"mouse_movement_score": 0.0, # Bots: no mouse events
"js_fingerprint_valid": False, # Bots: often no JS execution
"geo_request_pattern": "datacenter", # Bots: originate from cloud IPs
"session_depth": 1, # Bots: single-page visits
}
# Model: Gradient Boosted Trees (XGBoost)
# Output: probability of being a bot โ adjust rate limit dynamically
Instead of fixed limits, ML models adjust thresholds based on:
Normal pattern: User makes 10-20 req/min, steady throughout the day
Anomaly: User suddenly makes 500 req/min at 3 AM
Action: Auto-throttle + alert security team
Cloudflare's Bot Management uses ML models that analyse over 30 signals per request โ including TLS fingerprinting, HTTP/2 frame ordering, and JavaScript execution patterns. They classify 40% of internet traffic as automated bots.
| Concern | Decision |
|---|---|
| Algorithm | Token Bucket (burst-friendly) or Sliding Window Counter (precise) |
| Storage | Redis (distributed) or local memory (single server) |
| Granularity | Per user? Per IP? Per API key? Per endpoint? |
| Fail mode | Fail open (allow) or fail closed (reject)? |
| Headers | Include X-RateLimit-* and Retry-After |
| Monitoring | Track 429 rate, false positives, latency added |
| Client SDK | Provide retry logic with exponential backoff |
| Multi-tier | Edge (CDN) + Gateway + Application layer |
Design a rate limiter for an AI code completion API (like GitHub Copilot). Users type code and get suggestions in real-time. How would you handle the bursty nature of typing? Would you rate limit per keystroke, per completion request, or per accepted suggestion? What about token-based limits for the underlying LLM?