Notification System Design 2026 — Fanout, Priority Queue, Idempotency, and Template Engine for Millions of Push/Email/SMS per Day
Posted on: 4/17/2026 2:10:41 AM
Table of contents
- 1. Why Notifications Are Harder Than You Think
- 2. Functional and Non-Functional Requirements
- 3. Channel Map and Channel Characteristics
- 4. Overall Architecture
- 5. Core Data Model
- 6. Ingest Pipeline — From API to Queue
- 7. Fanout — One Event, Many Deliveries
- 8. Template Engine — Separate Content from Code
- 9. Priority Queue and Back-Pressure
- 10. Retry, DLQ, and Self-Healing
- 11. Dedup, Suppression, and Per-User Rate Limiting
- 12. Quiet Hours, Timezones, and Localisation
- 13. Delivery Callback — The Truth Is at the Provider
- 14. In-App Inbox and Realtime via WebSocket
- 15. Observability — Metrics, Traces, and Analytics
- 16. Campaign Scheduler — Millions of Messages, Each User's Local Time
- 17. Security and Abuse Prevention
- 18. Realistic Capacity Planning
- 19. Case Studies — How the Big Players Solve It
- 20. Rollout Checklist for Your Team
- 21. Conclusion
- References
1. Why Notifications Are Harder Than You Think
At first glance, sending a notification is just calling a Firebase or Twilio API. But when your system has millions of users, dozens of event types, three or four concurrent channels (push, email, SMS, in-app, webhook), and must respect quiet-hours in each user's timezone, the picture turns into a complex distributed system with a distinctive set of constraints: per-channel latency varies wildly, per-channel cost varies wildly, failure modes vary wildly, and the hardest constraint of all is that you must never send duplicates, yet you must never forget a single one.
This article dives deep into Notification Service architecture at millions-of-users scale: from data model, ingest pipeline, multi-channel fanout, template engine, personalised rate-limiting, idempotency, retry and DLQ, through to campaign effectiveness observability and unsubscribe/quiet-hours enforcement. The illustrated stack is .NET 10 plus Vue/Nuxt for the admin dashboard, but the principles apply to any stack — Java Spring, Node.js, Go, or Python.
2. Functional and Non-Functional Requirements
Before sketching the architecture, you have to nail down what the system must do — and what it must not do. This is where many "ship fast" teams trip up: they only think about being able to send, not about not sending the wrong thing or not spamming the user.
| Requirement group | Content | Concrete example |
|---|---|---|
| Functional | Multi-channel delivery driven by domain events (order, payment, promo, system) | When an order transitions to Shipped, send push + email but not SMS |
| Template | Multi-locale templating with variables, A/B testing, localisation | Your order {{orderId}} has shipped, {{recipient.firstName}} |
| Priority | Priority tiers, transactional strictly separated from promotional | OTP P0, order updates P1, marketing P3 |
| Idempotency | Each event must be delivered exactly once, regardless of upstream retries | Producer replays the same idempotencyKey ⇒ one delivery only |
| Throughput | ≥ 50k notifications/s peak for push, ≥ 10k/s for email | Black Friday flash-sale push to 2 million users |
| Latency | OTP e2e ≤ 3s p99, order updates ≤ 30s, marketing unconstrained | International OTP SMS still under 5s |
| Reliability | No message loss on process crash, intelligent retry, replayable DLQ | Kafka replication 3, manual offset checkpointing |
| Privacy | Per-channel and per-category user opt-out, respect quiet hours and timezone | Never send promo push after 22:00 local time |
| Observability | Per-channel/template/campaign metrics, end-to-end trace, open/click rate | Grafana dashboard plus a data warehouse for marketing |
3. Channel Map and Channel Characteristics
Every channel has its own behaviour. A "one-size-fits-all" design collapses because SMS costs money per message but arrives with near certainty, while push may be free but can be silently dropped the moment a user disables notifications or turns off background app refresh. These differences must be encoded into your send and retry strategy.
| Channel | Typical provider | Latency | Cost/msg | Success rate | Quirks |
|---|---|---|---|---|---|
| Mobile push | APNs (iOS), FCM (Android, Web) | ~1s | Free | 80–95% | Token expiration, notifications disabled, silent fail common |
| Amazon SES, SendGrid, Postmark, Mailgun | 2–30s | $0.0001–0.001 | 95–99% | Async bounce & complaint callbacks, IP/domain reputation matters | |
| SMS | Twilio, Vonage, local carrier | 2–15s | $0.005–0.05 | 97–99% | Expensive, 160-char limit, per-country whitelisting required |
| In-app | In-house service + WebSocket/SSE | ~100ms | Internal infra | ~100% if online | Must persist inbox for offline users |
| Webhook | Customer-hosted endpoint | 100ms–10s | Internal infra | 90–99% | Endpoint out of your control, need retry plus signed payload |
| Chat | Slack, Teams, Zalo, Viber | 1–5s | Free/Bot | 95–99% | Strict rate limits, OAuth tokens need refresh |
Don't mix marketing and transactional in one pipeline
OTPs and "today's discount" have wildly different SLAs. If you share a queue, one marketing campaign of 2 million SMS can push the OTP p99 from 30 seconds to 5 minutes — long enough for a user to abandon their cart. Always split at least two lanes: transactional (high priority, never throttled) and marketing (low priority, throttled when the system is loaded).
4. Overall Architecture
The heart of a Notification Service is an event-driven pipeline with a clear path from domain event to each send channel. Every component must be idempotent, independently restartable, and measurable at every hop.
flowchart LR
subgraph Producers
ORD["Order Service"]
PAY["Payment Service"]
AUTH["Auth Service (OTP)"]
MKT["Marketing Campaign"]
end
Producers --> API["Notification API
(.NET 10 Minimal API)"]
API --> VAL["Validator + Dedup
(Redis idempotency cache)"]
VAL --> TOPIC{"Kafka topics"}
TOPIC -->|transactional| WKT["Transactional Worker Pool"]
TOPIC -->|marketing| WKM["Marketing Worker Pool"]
WKT --> FAN["Fanout & Preferences"]
WKM --> FAN
FAN --> TMPL["Template Engine"]
TMPL --> ROUTE{"Per-channel router"}
ROUTE --> APNs
ROUTE --> FCM
ROUTE --> SES["SES / SendGrid"]
ROUTE --> SMS["Twilio / Local SMS"]
ROUTE --> WS["WebSocket / SSE"]
ROUTE --> HOOK["Webhook dispatcher"]
APNs --> CB1["Delivery callback"]
FCM --> CB1
SES --> CB1
SMS --> CB1
CB1 --> LEDGER[("Delivery ledger
PostgreSQL / OLTP")]
CB1 --> ANALYTICS[("Analytics warehouse")]
A few important things about this diagram:
- Topics split by priority: at minimum two topics,
notif.transactionalandnotif.marketing. Separate worker pools, separate resources. - Fanout lives after ingest: the producer only needs to know that user X should be notified about event A. Fanning out to N channels × M devices happens inside the service — the producer shouldn't have to know how many devices the user owns.
- Dedicated delivery ledger: an authoritative history table — who was sent what, when, with what result. This table feeds debugging, the user-facing notification history UI, and compliance.
5. Core Data Model
A Notification Service data schema doesn't have many tables, but the relationships between them determine how well it scales later. Here's the minimal model — it can grow when you need A/B testing, campaign scheduling, journey orchestration…
erDiagram
USER ||--o{ DEVICE : owns
USER ||--o{ PREFERENCE : has
USER ||--o{ DELIVERY : receives
TEMPLATE ||--o{ CAMPAIGN : used_in
CAMPAIGN ||--o{ NOTIFICATION_EVENT : produces
NOTIFICATION_EVENT ||--o{ DELIVERY : fans_out_to
DEVICE ||--o{ DELIVERY : targeted_by
CHANNEL ||--o{ DELIVERY : uses
USER {
uuid id
string locale
string timezone
}
DEVICE {
uuid id
uuid user_id
string platform
string push_token
datetime last_seen
bool active
}
PREFERENCE {
uuid user_id
string category
string channel
bool enabled
time quiet_start
time quiet_end
}
TEMPLATE {
uuid id
string code
string channel
string locale
text body
json schema
}
NOTIFICATION_EVENT {
uuid id
string idempotency_key
string type
int priority
json payload
datetime created_at
}
DELIVERY {
uuid id
uuid event_id
uuid user_id
string channel
string provider_msg_id
string status
datetime sent_at
datetime delivered_at
}
Design notes:
NOTIFICATION_EVENT.idempotency_keyis the input-side dedup key, produced by the caller (e.g.order:1234:shipped). Insert with a UNIQUE constraint; on duplicate, reject with 200 OK returning the existing event.DELIVERYis split fromEVENTso that one event can produce many deliveries (push to device A, push to device B, email). Each delivery has its own lifecycle:queued → sent → delivered → opened → clicked.PREFERENCEmust be split down to category × channel. A user may want promo emails but not promo pushes. A singleis_subscribedcolumn is not enough.- Sharding: the
DELIVERYtable grows fast. From day one, shard byuser_idorcreated_at(partition by day). Don't wait until 500 million rows to deal with it.
6. Ingest Pipeline — From API to Queue
Producers call the Notification Service API instead of calling FCM/SES directly. This centralises control: rate limiting, templates, preferences, idempotency, and audit all live in one place. The .NET 10 Minimal API snippet below illustrates a minimal but complete ingest endpoint:
app.MapPost("/v1/notifications", async (
NotifyRequest req,
IValidator<NotifyRequest> validator,
IIdempotencyStore idem,
IPublisher publisher,
CancellationToken ct) =>
{
// 1. Validate payload
var result = await validator.ValidateAsync(req, ct);
if (!result.IsValid) return Results.ValidationProblem(result.ToDictionary());
// 2. Idempotency: if key already seen, return the existing event
var existing = await idem.GetAsync(req.IdempotencyKey, ct);
if (existing is not null) return Results.Accepted($"/v1/notifications/{existing.Id}", existing);
// 3. Build event, pick topic by priority
var evt = NotificationEvent.Create(req);
var topic = req.Priority <= 1 ? "notif.transactional" : "notif.marketing";
// 4. Publish to Kafka (transactional producer, exactly-once)
await publisher.PublishAsync(topic, evt, ct);
// 5. Cache idempotency for 24h
await idem.SetAsync(req.IdempotencyKey, evt, TimeSpan.FromHours(24), ct);
return Results.Accepted($"/v1/notifications/{evt.Id}", evt);
})
.RequireAuthorization("NotificationProducer")
.WithName("SubmitNotification");
Idempotency done right
Redis SET key value NX EX 86400 is enough on the hot path. If you need absolute certainty, pair it with a UNIQUE constraint in the DB — but avoid DB round-trips on the hot path; let the DB only catch duplicates that Redis missed during a cache flush. For critical events (OTP, payment), add a per-user sequence number to catch out-of-order caused by upstream multi-retry.
7. Fanout — One Event, Many Deliveries
A worker consuming the topic breaks one event into a list of deliveries by this formula:
delivery_set =
(user.devices ∪ user.emails ∪ user.phone)
∩ template.channels
∩ user.preferences
\ user.quiet_hours_violating_channels
In other words, you take the intersection of three sets and subtract channels that are currently in quiet hours. The result is a concrete list of (channel, target) pairs to send. A user with 2 phones + 1 email + a web-push subscription can blow up into 4 distinct deliveries from a single event.
public async Task<IReadOnlyList<Delivery>> FanoutAsync(NotificationEvent evt, CancellationToken ct)
{
var user = await users.GetAsync(evt.UserId, ct);
var prefs = await prefs.GetAsync(evt.UserId, evt.Category, ct);
var template = await templates.GetAsync(evt.TemplateCode, user.Locale, ct);
var deliveries = new List<Delivery>();
var nowLocal = DateTimeOffset.UtcNow.ToTimeZone(user.TimeZone);
foreach (var channel in template.Channels)
{
if (!prefs.IsEnabled(channel)) continue;
if (prefs.InQuietHours(channel, nowLocal) && evt.Priority > 1) continue;
foreach (var target in user.TargetsFor(channel))
{
deliveries.Add(Delivery.New(evt.Id, user.Id, channel, target, template));
}
}
return deliveries;
}
Subtle but important: P0/P1 overrides quiet hours. Nobody wants to miss an OTP just because they happen to be asleep.
8. Template Engine — Separate Content from Code
Templates live in the DB and are preloaded into a cache (Redis or in-memory). Each template carries a schema validating its inputs: if an event is missing a variable, reject at ingest instead of letting a worker discover it and die midway.
code: order.shipped
locale: en-US
channel: push
body: |
Hi {{recipient.firstName}}, your order {{orderId}} is on the way
to {{shippingAddress.short}}. ETA {{eta | date:"HH:mm, MM/dd"}}.
schema:
required: [recipient.firstName, orderId, shippingAddress.short, eta]
Template versioning and A/B testing
Each template carries a version. When updating, create a new version and keep the old one running. Route 10% of traffic to the new version for 24 hours, watch CTR and open rate in the warehouse. If it wins, cut over completely. This is the same principle as a feature flag, applied to content.
9. Priority Queue and Back-Pressure
The system will be overloaded occasionally. An upstream service fails and retries everything, a marketing campaign hits Send on 2 million users in one shot, an SMS carrier slows to 500ms/msg. Without priority and back-pressure, every layer suffers — from CPU to provider limits.
flowchart TB
IN["Kafka:
notif.transactional
notif.marketing"] --> WK["Worker dispatcher"]
WK --> P0["P0 pool (OTP, auth)
high concurrency, no throttle"]
WK --> P1["P1 pool (order, payment)
moderate concurrency"]
WK --> P3["P3 pool (marketing)
low concurrency, token-bucket throttled"]
P0 --> PROV["Provider pool"]
P1 --> PROV
P3 --> PROV
PROV --> RL["Global per-provider rate limit"]
RL -->|block| RET["Retry queue (delay)"]
RET -. after 2^n seconds .-> PROV
A few rules distilled from real incidents:
- Distinct worker pools per priority: use separate thread pools or separate processes, NOT shared. If shared, one marketing burst will push OTPs behind tens of thousands of messages.
- Back-pressure from provider to queue: when SES returns 429, the worker must pause consumption for a window — don't blindly push to DLQ.
- Per-provider token bucket: FCM 600 req/s, SES 100/s by default. Apply the limit at the worker layer so you don't get cut off by the provider.
- Graceful degradation: if the primary SMS provider dies, fail over to secondary for P0/P1 but accept dropping P3. A marketing notification delayed an hour hurts nobody; an OTP delayed a minute loses you the order.
10. Retry, DLQ, and Self-Healing
Wrong channel, expired token, external endpoint timeout — all of these are either transient or permanent failures. Correctly distinguishing the two is the key to not spamming retries for nothing.
| Failure type | Example | Handling |
|---|---|---|
| Transient | 5xx provider, timeout, rate-limit | Retry with exponential backoff + jitter, capped at 5 attempts |
| Permanent | Invalid token, email hard bounce, malformed phone number | Do not retry. Log it, disable the target, trigger cleanup |
| Ambiguous | Provider returned 202 without a delivery status | Retry only if the delivery callback doesn't arrive within TTL |
| Critical bug | Malformed template, worker crash loop | Park in DLQ, page on-call, keep the main queue flowing |
A DLQ isn't "where messages go to die". It must come with a replay tool. A simple CLI that lists, edits metadata, and replays into the main queue is enough for on-call to handle most incidents.
// Exponential backoff with jitter
public static TimeSpan NextRetryDelay(int attempt)
{
var baseMs = Math.Min(30_000, 500 * Math.Pow(2, attempt));
var jitter = Random.Shared.NextDouble() * 0.3; // ±30%
return TimeSpan.FromMilliseconds(baseMs * (1 + jitter));
}
11. Dedup, Suppression, and Per-User Rate Limiting
No user should get 47 notifications in 10 minutes. But you don't want to hard-block either, because sometimes they genuinely need them (e.g. a sequence of order, payment, shipped events within seconds). The answer: per-user, per-category rate limits.
public async Task<bool> ShouldSuppressAsync(Guid userId, string category, CancellationToken ct)
{
// Leaky bucket in Redis: 5 marketing pushes per hour
var key = $"ratelimit:push:{userId}:{category}";
var count = await redis.StringIncrementAsync(key);
if (count == 1) await redis.KeyExpireAsync(key, TimeSpan.FromHours(1));
return category == "marketing" && count > 5;
}
Pair it with the digest pattern: when you detect that you're about to breach the rate limit, instead of dropping, bundle 10 small notifications into one "You have 10 new updates" message. This pattern works wonders for social and collaboration apps.
12. Quiet Hours, Timezones, and Localisation
A project I once worked on pushed a promotional notification at 3am because the servers ran on UTC and the campaign was scheduled in UTC. Outcome: thousands of 1-star reviews. Lesson: every user-facing time must be interpreted in the user's timezone, never the server's.
Three minimum rules:
- Store
user.timezoneas IANA (Asia/Ho_Chi_Minh), not a raw offset. - Default quiet hours 22:00–07:00 local for marketing (studies show very low open rates outside 8:00–21:00).
- For "send to each user at 9:00 local" batch campaigns, you need a dedicated scheduler: split the campaign into buckets by timezone, enqueue each bucket at the right moment.
13. Delivery Callback — The Truth Is at the Provider
You call SES and get a 202 — don't assume you're done. 202 just means SES accepted. The email can bounce, trigger a complaint, arrive immediately, or end up in the promotions tab. The truth lives in the delivery callback the provider sends back to your webhook.
app.MapPost("/webhooks/ses", async (SesEvent evt, IDeliveryService svc, CancellationToken ct) =>
{
// Verify SNS signature first
if (!SesSignature.Verify(evt.RawPayload, evt.Signature)) return Results.Unauthorized();
var deliveryId = evt.Tags["delivery_id"];
var status = evt.Type switch
{
"Delivery" => DeliveryStatus.Delivered,
"Bounce" => DeliveryStatus.Bounced,
"Complaint" => DeliveryStatus.Complaint,
"Open" => DeliveryStatus.Opened,
"Click" => DeliveryStatus.Clicked,
_ => DeliveryStatus.Unknown
};
await svc.UpdateAsync(Guid.Parse(deliveryId), status, evt.Timestamp, ct);
return Results.Ok();
});
When the status is a hard Bounce, the cleanup worker must disable that email. Continuing to send will destroy your sender reputation — SES and SendGrid score this very quickly and you can be blocked from sending until you contact support.
14. In-App Inbox and Realtime via WebSocket
Push is used to alert in the moment. But when users open the app, they want to see the history — that's the job of the in-app inbox. It has two requirements: (1) fast lookup by user, (2) realtime update when a new message arrives.
Common architecture:
sequenceDiagram
participant W as Worker
participant DB as Postgres (inbox)
participant R as Redis Pub/Sub
participant GW as Realtime Gateway
participant APP as Mobile/Web App
W->>DB: INSERT inbox row
W->>R: PUBLISH user:{id} new_msg
R->>GW: subscription event
GW->>APP: WebSocket/SSE push
APP->>APP: Update badge count
Note over APP: On user tap, call GET /inbox?limit=50
APP->>DB: SELECT with cursor pagination
A few often-overlooked details:
- Badge count must be computed on the server. Don't rely on the client counting, multi-device will drift.
- Mark-as-read needs an event upstream so other devices stay in sync — opened on mobile, badge on web drops too.
- Pagination must use cursors (
WHERE created_at < :lastSeen), not OFFSET — with a large inbox, OFFSET is painfully slow. - TTL: inboxes older than 90 days can be archived to cold storage or deleted per data policy.
15. Observability — Metrics, Traces, and Analytics
An unobserved Notification Service will almost certainly fail silently: you still deliver 99%, but that 1% is your most important users. Measure three layers:
| Layer | Metric | Used for |
|---|---|---|
| Pipeline | events_in/s, fanout_ratio, queue_lag, worker_throughput | System health monitoring, SRE alerts |
| Channel | send_rate, success_rate, bounce_rate, latency p50/p95/p99 | Provider comparison, alert on degradation |
| Business | delivery_rate, open_rate, CTR per template/campaign | Marketing and product content optimisation |
End-to-end traces should carry attributes event.id, user.id, template.code, channel so a specific message can be followed from ingest to callback. OpenTelemetry auto-instrumentation for Kafka and HTTP clients gets you most of the way with little config; the hard part is setting attributes correctly at the fanout point — where one event becomes N spans.
SLI/SLO for a Notification Service
Example: 99.5% of OTP SMS are acknowledged by the provider within 3 seconds of ingest. Record the delta delivered_at - event_created_at, compute hourly percentiles, alert when >30% of the weekly error budget burns. That's how you turn "it seems fine" into a number you can defend to stakeholders.
16. Campaign Scheduler — Millions of Messages, Each User's Local Time
Marketing wants to send "Monday 9am local time" to 2 million users. Simple-sounding but users span dozens of timezones. The naïve approach — enqueue all 2 million at 00:00 UTC and have workers hold each message — isn't just memory-hungry, it can't survive a restart.
The tidy solution: time-bucketed scheduler.
flowchart LR
CAMP["Campaign 'Weekly promo'
send at 9:00 local"] --> BUCKET["Bucket by timezone"]
BUCKET --> B1["bucket +7 (Asia/Ho_Chi_Minh)"]
BUCKET --> B2["bucket 0 (UTC, London)"]
BUCKET --> B3["bucket -5 (America/New_York)"]
B1 --> C["Cron at 02:00 UTC = 09:00 VN"]
B2 --> D["Cron at 09:00 UTC"]
B3 --> E["Cron at 14:00 UTC"]
C --> ENQ["Enqueue into Kafka"]
D --> ENQ
E --> ENQ
Inside each bucket, enqueue in ~10k-user batches at a fixed rate so you don't spike the provider. If a user changes timezone, re-check at read-time and shift buckets for that user — not the whole user base, just those few.
17. Security and Abuse Prevention
Notifications are an attack surface people overlook. Until someone uses the internal API to flood SMS to a victim's phone number, or spoof emails from your company's domain. A few mandatory measures:
- Authenticated producers: only internal backend services (mTLS or OAuth2 service-to-service) may call the API. Never expose it directly to clients.
- Template whitelisting: the body must be an identified template code; no free-text sends. Locking down free text is how you prevent internal phishing.
- Per-tenant rate limits: each producer has its own quota. Stops a buggy service from collapsing the whole pipeline.
- PII minimisation: payloads contain only keys (userId, orderId). The worker resolves personal data from the user service itself. Logs must never print email/phone in plaintext.
- DKIM, SPF, DMARC for email; reputable sending IPs for SES/SendGrid; signed payloads (HMAC-SHA256) for outbound webhooks.
- Hard opt-out honour: when a user clicks unsubscribe, the worker must block at fanout, not merely hide the UI. Regulations such as CAN-SPAM, GDPR, and Vietnam's Decree 91/2020 all require this.
18. Realistic Capacity Planning
Reference numbers you can anchor your estimate to:
| Component | Per-node capacity | Notes |
|---|---|---|
| API ingest (.NET 10 Minimal) | ~20–30k req/s | Bounded by CPU, validation, Redis I/O |
| Kafka broker | ~100MB/s write, 3 replicas | Tune batch size, ack=all for transactional |
| Fanout worker | ~2k events/s | With fanout ratio ~3, yields ~6k deliveries/s |
| FCM push | ~600 req/s per HTTP/2 connection | Scale with many connections + 500-token batch |
| SES email | 100/s default, rampable to 10k/s | Quota is per-account, request early |
| Twilio SMS | 10/s per phone number | More numbers for throughput, or Messaging Service |
| PostgreSQL delivery ledger | ~20k write/s with batching | Partition by day, proactive vacuuming |
For 50 million notifications/day (~580/s average, 5k/s peak), a cluster of 3 Kafka brokers, 6–8 worker nodes, and 2 API nodes is comfortable. Don't ignore cost: 50M SMS × $0.02 = $1M/month. Shifting 80% of non-critical content off to free channels (push + in-app inbox) is a much bigger lever than tuning worker code.
19. Case Studies — How the Big Players Solve It
A few publicly documented architectures worth studying:
- Slack: the in-app inbox is the source of truth. Push is only a teaser. They use a "fanout-on-read" pattern for large channels: don't push to 10k members simultaneously; push based on presence and active subscribers.
- Uber: transactional (trip events) is completely separated from promotional, on a dedicated Kafka pipeline. Marketing runs in another service with a hard quota.
- LinkedIn: their "Air Traffic Controller" balances many notification types, preventing a user from receiving multiple messages on the same topic within 24 hours. This is the canonical lesson on digest and frequency capping.
- Pinterest: uses ML to predict when a user opens the app, sending exactly at that moment instead of spamming. Beautiful idea, but you need a large behavioural dataset before it's worth building.
20. Rollout Checklist for Your Team
Pre-launch
- Idempotency key convention agreed across producers.
- Template versioning and schema validation on by default.
- DLQ with replay tooling, on-call runbook with remediation steps.
- SLO defined per priority tier (OTP, transactional, marketing).
- Token refresh for chat/webhook, signing-key rotation.
First 90 days in production
- Track daily email bounce rate and SES complaint rate. Kill bad targets immediately.
- Audit PII logs, ensure no email/phone leaks in plaintext.
- Run game-days: simulate FCM/SES outages, verify graceful degradation.
- Review marketing-messages-per-user-per-week — if the median exceeds 5, opt-out risk soars.
21. Conclusion
The Notification Service is one of the most underestimated backends out there. On the surface it's just "call FCM and SES", but as you go deeper, it's the sum of nearly every distributed-systems pattern: event-driven, idempotency, priority queue, retry with backoff, fanout, rate limiting, scheduling, observability, security. Building it right up front saves the team months of firefighting; building it in a rush means paying the price every Black Friday, every time a user reports "I didn't get my OTP".
Hopefully this article gives you a detailed enough map so you don't have to code and learn at the same time. The most important takeaways: separate transactional from marketing, enforce idempotency at ingest, make templates schema-backed, honour quiet hours in the user's timezone, and observe every single message. Those five principles alone separate a Notification Service that "works" from one that "can be trusted".
References
- Google Cloud — Building a large-scale notification system
- Firebase Cloud Messaging documentation
- Apple Developer — Setting up a remote notification server (APNs)
- Amazon SES — Email deliverability concepts
- Twilio — Messaging Services and high-throughput SMS
- LinkedIn Engineering — Air Traffic Controller: Member-First Notifications
- Slack Engineering — Messaging and inbox architecture
- Uber Engineering — Real-time push platform
- Apache Kafka — Delivery semantics and exactly-once
- Microsoft Learn — Background tasks with IHostedService (.NET)
- OpenTelemetry — Messaging semantic conventions
Blazor on .NET 10 in 2026 — Mastering Render Modes, Stream Rendering, and Enhanced Navigation for Full-stack C#
ClickHouse 2026 — Sub-second OLAP Architecture with SharedMergeTree, Parallel Replicas, and Storage-Compute Separation for Petabyte Analytics
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.