Modern applications send billions of notifications daily across multiple channels. Designing this system teaches you about message queues, third-party integrations, and handling failures gracefully at scale.
Caller Services → Notification Service → Validation + Preference Check
↓
Message Queue (Kafka)
↓ ↓ ↓ ↓
Push Email SMS In-App
Worker Worker Worker Worker
↓ ↓ ↓ ↓
FCM SendGrid Twilio WebSocket
Store preferences per user and per notification type:
user_preferences:
user_id: "u123"
channels:
email: { enabled: true, quiet_hours: "22:00-08:00" }
push: { enabled: true }
sms: { enabled: false }
categories:
marketing: { enabled: false }
security: { enabled: true, override_quiet_hours: true }
The notification service checks preferences before enqueuing - never send what the user doesn't want.
Not all notifications are equal. Use priority levels:
Why should the notification system use separate message queues per channel rather than a single shared queue?
Prevent duplicate notifications using an idempotency key. Before processing, check if the notification ID exists in a deduplication store (Redis with TTL):
Key: "dedup:{notification_id}" → TTL: 24 hours
If key exists → skip (already sent)
If not → process and set key
Third-party services fail. Implement retries with increasing delays:
Attempt 1: immediate
Attempt 2: wait 1 second
Attempt 3: wait 4 seconds
Attempt 4: wait 16 seconds
Attempt 5: move to dead-letter queue
Protect users from notification fatigue:
| Channel | Provider | Protocol | Challenges | |---------|----------|----------|------------| | Push (iOS) | APNs | HTTP/2 | Token management, silent push limits | | Push (Android) | FCM | HTTP | Topic-based broadcasting | | Email | SendGrid/SES | SMTP/API | Deliverability, spam filters | | SMS | Twilio | REST API | Cost per message, regional compliance | | In-App | WebSocket | WS | Connection management, offline buffering |
Track the notification lifecycle for every message:
Store events in a time-series database or event stream for dashboard reporting: delivery rates, open rates, and click-through rates per channel and notification type.
A notification system retries a failed SMS delivery 3 times. Without deduplication, what could happen if the first attempt actually succeeded but the acknowledgement was lost?
| Bottleneck | Solution | |-----------|----------| | Third-party rate limits | Queue with controlled throughput per provider | | Template rendering at scale | Pre-render and cache common templates | | Preference lookups per message | Cache user preferences in Redis | | Peak traffic spikes | Auto-scale workers based on queue depth | | Cross-region delivery | Deploy workers close to provider endpoints |
Which component should check user notification preferences?