Load Balancing: The Art of Traffic Distribution for Million-Request Systems
Posted on: 4/20/2026 7:40:51 AM
Table of contents
- What Is Load Balancing and Why Does It Matter?
- Layer 4 vs Layer 7: Two Schools of Load Balancing
- The 6 Most Popular Load Balancing Algorithms
- Algorithm Comparison at a Glance
- Common Load Balancing Tools
- Health Checks and Failover — The Safety Net
- Real-World Deployment Architectures
- Common Pitfalls and How to Fix Them
- Load Balancing in Kubernetes
- Load Balancer Deployment Checklist
- Conclusion
- References
What Is Load Balancing and Why Does It Matter?
Imagine you run a restaurant with 10 tables. If every customer ended up sitting at one table while the other 9 sat empty, the experience would be terrible. A Load Balancer is the restaurant host — distributing customers evenly across tables so nobody waits too long.
In software architecture, a Load Balancer is the component that distributes client traffic across multiple backend servers to optimize performance, increase availability, and ensure no single server is overloaded.
When do you need a Load Balancer?
As soon as your system has 2 or more instances, you need a Load Balancer. But its role isn't just "splitting requests evenly" — it also performs health checks, SSL termination, rate limiting, and is the first line of defense against DDoS.
Layer 4 vs Layer 7: Two Schools of Load Balancing
This is the first architectural decision when choosing a Load Balancer. The difference lies in the OSI layer where the LB operates, and that directly affects routing capabilities, performance, and cost.
graph TB
subgraph L4["Layer 4 — Transport"]
A[Client Request] -->|TCP/UDP| B[L4 Load Balancer]
B -->|IP + Port| C[Server A]
B -->|IP + Port| D[Server B]
B -->|IP + Port| E[Server C]
end
subgraph L7["Layer 7 — Application"]
F[Client Request] -->|HTTP/gRPC| G[L7 Load Balancer]
G -->|/api/*| H[API Server]
G -->|/static/*| I[CDN/Static Server]
G -->|/ws/*| J[WebSocket Server]
end
style L4 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style L7 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style B fill:#e94560,stroke:#fff,color:#fff
style G fill:#e94560,stroke:#fff,color:#fff
style C fill:#2c3e50,stroke:#fff,color:#fff
style D fill:#2c3e50,stroke:#fff,color:#fff
style E fill:#2c3e50,stroke:#fff,color:#fff
style H fill:#2c3e50,stroke:#fff,color:#fff
style I fill:#2c3e50,stroke:#fff,color:#fff
style J fill:#2c3e50,stroke:#fff,color:#fff
Comparison of Layer 4 vs Layer 7 Load Balancer request flows
| Criterion | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Operates at | TCP/UDP — sees only IP and Port | HTTP/HTTPS/gRPC — reads headers, URLs, cookies |
| Routing | Based on source/destination IP and port | Based on URL path, hostname, headers, query string |
| Performance | Extremely fast — doesn't parse the payload | Slower — must decrypt TLS, parse HTTP |
| SSL Termination | TLS passthrough (no decryption) | Decrypts at the LB, re-encrypts if needed |
| Connection Pooling | No — forwards the TCP stream directly | Yes — multiplexes many clients over few backend connections |
| Use case | Databases, game servers, IoT, streaming | Web apps, APIs, microservices, gRPC |
| Examples | AWS NLB, HAProxy TCP mode, IPVS | AWS ALB, NGINX, HAProxy HTTP mode, Envoy |
Production practice
Most production architectures use both layers: L4 at the edge to quickly distribute traffic into L7 clusters, then L7 performs detailed content-based routing. For example: AWS NLB (L4) → ALB (L7), or Google Maglev (L4) → Envoy (L7).
The 6 Most Popular Load Balancing Algorithms
1. Round Robin — Simple but Effective
Round Robin
Complexity: O(1) | Stateless | Default for NGINX and HAProxy
Distribute requests sequentially in a loop: Server A → B → C → A → B → C... No need to track server state, extremely simple and effective when servers have uniform capacity.
✓ Pros
- Simple, stateless
- Evenly distributed over time
- O(1) performance
✗ Cons
- Ignores real load differences
- Not ideal for requests with very uneven processing time
2. Weighted Round Robin — When Servers Aren't Equal
Weighted Round Robin
Complexity: O(1) | Semi-stateless | Requires weight configuration
Assign weights to each server based on capacity. A powerful server (weight=5) receives 5× more requests than a weaker one (weight=1). Great for fleets with mixed machine types.
# NGINX config
upstream backend {
server app1.example.com weight=5; # 16 CPU, 64GB RAM
server app2.example.com weight=3; # 8 CPU, 32GB RAM
server app3.example.com weight=1; # 2 CPU, 8GB RAM
}
3. Least Connections — Adaptive to Real Load
Least Connections
Complexity: O(n) or O(log n) with a heap | Stateful
Send requests to the server with the fewest active connections. Smarter than Round Robin because it reacts to real load — busy servers naturally receive fewer new requests.
✓ Pros
- Adapts to varying request processing times
- Self-balancing when a server is slow
- Ideal for WebSocket, long-polling
✗ Cons
- Must track state per connection
- Higher overhead than Round Robin
4. IP Hash — Simple Session Affinity
IP Hash
Complexity: O(1) | Deterministic
Hash the client's IP to pick the destination server. The same IP always routes to the same server — giving session stickiness without cookies or a shared session store.
# NGINX config
upstream backend {
ip_hash;
server app1.example.com;
server app2.example.com;
server app3.example.com;
}
Beware of NAT
If many clients share the same IP (via NAT/proxy), they'll all pile onto a single server → uneven load. In enterprise networks, this is a common issue.
5. Consistent Hashing — The King of Distributed Cache
Consistent Hashing
Complexity: O(log n) lookup | Virtual Nodes improve distribution
Uses a hash ring — both servers and request keys are hashed onto a circle. Requests go to the nearest server in the clockwise direction. When servers are added/removed, only ~1/n of keys are affected instead of remapping everything.
graph TB
subgraph Ring["Hash Ring — Consistent Hashing"]
direction TB
N1["Server A
position: 0°"]
N2["Server B
position: 120°"]
N3["Server C
position: 240°"]
K1["Key 'user:42'
→ Server A"]
K2["Key 'session:99'
→ Server B"]
K3["Key 'cart:17'
→ Server C"]
end
K1 -.->|hash → 35°| N1
K2 -.->|hash → 155°| N2
K3 -.->|hash → 280°| N3
style Ring fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style N1 fill:#e94560,stroke:#fff,color:#fff
style N2 fill:#e94560,stroke:#fff,color:#fff
style N3 fill:#e94560,stroke:#fff,color:#fff
style K1 fill:#2c3e50,stroke:#fff,color:#fff
style K2 fill:#2c3e50,stroke:#fff,color:#fff
style K3 fill:#2c3e50,stroke:#fff,color:#fff
A hash ring with 3 servers — each key routes to the nearest server clockwise
Virtual Nodes is an important technique to improve even distribution. Instead of each server occupying a single position on the ring, you create 100-200 virtual positions per physical server. This helps:
- Distribute keys much more evenly
- When one server fails, its load spreads across many servers rather than piling on the next one
- Amazon DynamoDB, Apache Cassandra, and ScyllaDB all use this technique
// Consistent Hashing with virtual nodes — pseudo code
class ConsistentHash {
private ring: SortedMap<int, string> = new SortedMap();
private virtualNodes: int = 150;
addServer(server: string) {
for (let i = 0; i < this.virtualNodes; i++) {
let hash = md5(`${server}:${i}`);
this.ring.set(hash, server);
}
}
getServer(key: string): string {
let hash = md5(key);
// Find the nearest node clockwise
let entry = this.ring.ceilingEntry(hash);
return entry ? entry.value : this.ring.firstEntry().value;
}
removeServer(server: string) {
for (let i = 0; i < this.virtualNodes; i++) {
this.ring.delete(md5(`${server}:${i}`));
}
// Only ~1/n of keys remap — no global impact
}
}
6. Random Two Choices — The "Just Smart Enough" Algorithm
Power of Two Random Choices
Complexity: O(1) | Near-optimal distribution
Randomly pick 2 servers, then send the request to the one with fewer connections. Sounds simple, but probabilistic theory shows this achieves near-optimal distribution — max connections drop from O(log n) to O(log log n) compared to pure random.
# NGINX Plus config
upstream backend {
random two least_conn;
server app1.example.com;
server app2.example.com;
server app3.example.com;
server app4.example.com;
}
Algorithm Comparison at a Glance
| Algorithm | Stateful? | Session Sticky? | Main use case | When NOT to use |
|---|---|---|---|---|
| Round Robin | No | No | Stateless APIs, uniform microservices | Heterogeneous server configs |
| Weighted RR | No | No | Mixed fleets (on-prem + cloud) | Highly variable load |
| Least Connections | Yes | No | WebSockets, long-running requests | Very short, uniform requests |
| IP Hash | No | Yes | Legacy apps needing session affinity | Many clients behind NAT |
| Consistent Hash | No | Yes | Distributed caches, sharded DBs | Simple stateless services |
| Random Two Choices | Yes (light) | No | Large clusters needing near-optimal | Small clusters (<4 servers) |
Common Load Balancing Tools
NGINX — Reverse Proxy and Load Balancer
NGINX is the most popular choice for L7 Load Balancing thanks to its high performance (handling millions of concurrent connections), simple configuration, and rich module ecosystem.
# nginx.conf — Complete Load Balancing
http {
upstream api_servers {
least_conn;
server 10.0.1.10:8080 weight=3 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8080 weight=3 max_fails=3 fail_timeout=30s;
server 10.0.1.12:8080 weight=1 backup; # Only used when the above 2 fail
}
server {
listen 443 ssl http2;
server_name api.example.com;
# SSL Termination
ssl_certificate /etc/ssl/certs/api.crt;
ssl_certificate_key /etc/ssl/private/api.key;
# Implicit health check via max_fails/fail_timeout
location /api/ {
proxy_pass http://api_servers;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $host;
# Timeout configuration
proxy_connect_timeout 5s;
proxy_read_timeout 60s;
proxy_send_timeout 10s;
}
}
}
HAProxy — Dedicated Load Balancer
HAProxy is renowned for its L4 and L7 capabilities, purpose-built for load balancing with active health checks, detailed metrics, and impressive performance.
# haproxy.cfg
frontend http_front
bind *:443 ssl crt /etc/ssl/api.pem
mode http
# Content-based routing
acl is_api path_beg /api
acl is_ws hdr(Upgrade) -i websocket
use_backend api_servers if is_api
use_backend ws_servers if is_ws
default_backend static_servers
backend api_servers
mode http
balance leastconn
option httpchk GET /health
http-check expect status 200
server api1 10.0.1.10:8080 check inter 5s fall 3 rise 2
server api2 10.0.1.11:8080 check inter 5s fall 3 rise 2
backend ws_servers
mode http
balance source # IP Hash to keep WebSocket sessions sticky
timeout tunnel 3600s
server ws1 10.0.2.10:8080 check
server ws2 10.0.2.11:8080 check
Cloud Load Balancers — Managed and Auto-scaling
| Service | Layer | Free Tier | Strengths |
|---|---|---|---|
| AWS ALB | L7 | 750h/month (first 12 months) | Path-based routing, gRPC, WebSocket |
| AWS NLB | L4 | 750h/month (first 12 months) | Ultra-low latency, static IP, TLS passthrough |
| Azure Load Balancer | L4 | Basic SKU free | Zone-redundant, HA Ports |
| Azure App Gateway | L7 | None (from ~$18/month) | Integrated WAF, SSL offloading, URL rewrite |
| Cloudflare LB | L7 | None (from $5/month) | 330+ PoPs, Geo-steering, global health checks |
Health Checks and Failover — The Safety Net
A Load Balancer without health checks is like traffic without signals. Health checks let the LB automatically remove dead servers and bring them back when they recover.
sequenceDiagram
participant LB as Load Balancer
participant S1 as Server A (healthy)
participant S2 as Server B (failing)
participant S3 as Server C (healthy)
loop Health Check (every 5s)
LB->>S1: GET /health
S1-->>LB: 200 OK ✓
LB->>S2: GET /health
S2-->>LB: 503 Error ✗
LB->>S3: GET /health
S3-->>LB: 200 OK ✓
end
Note over LB,S2: Server B fails 3 times in a row → marked DOWN
LB->>S1: Route traffic (50%)
LB->>S3: Route traffic (50%)
Note over S2: Receives no traffic
S2-->>LB: 200 OK ✓ (after 2 successful checks)
Note over LB,S2: Server B recovers → returned to the pool
Health Check flow: detect a failing server → remove from pool → auto-recover
There are 3 common types of health checks:
- Active Health Check: The LB actively sends probes (HTTP GET /health, TCP connect, or custom scripts). HAProxy and cloud LBs support this by default.
- Passive Health Check: The LB observes real traffic responses — if a server returns errors continuously (e.g., 5× 5xx in 30s), mark it down automatically. NGINX Open Source only supports this.
- Deep Health Check: Verify dependencies too (database connection, disk space, memory). Return details via a /health/detailed endpoint.
// ASP.NET — Deep Health Check
// Program.cs
builder.Services.AddHealthChecks()
.AddSqlServer(connectionString, name: "database")
.AddRedis(redisConnection, name: "cache")
.AddCheck("disk-space", () =>
{
var drive = new DriveInfo("C");
return drive.AvailableFreeSpace > 1_073_741_824 // > 1GB
? HealthCheckResult.Healthy()
: HealthCheckResult.Degraded("Low disk space");
});
app.MapHealthChecks("/health", new HealthCheckOptions
{
Predicate = _ => true,
ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});
Real-World Deployment Architectures
Pattern 1: Single-tier LB — For Small and Medium Systems
graph LR
Client[Client] --> LB[NGINX / HAProxy
L7 Load Balancer]
LB --> S1[App Server 1]
LB --> S2[App Server 2]
LB --> S3[App Server 3]
S1 --> DB[(Database)]
S2 --> DB
S3 --> DB
style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style LB fill:#e94560,stroke:#fff,color:#fff
style S1 fill:#2c3e50,stroke:#fff,color:#fff
style S2 fill:#2c3e50,stroke:#fff,color:#fff
style S3 fill:#2c3e50,stroke:#fff,color:#fff
style DB fill:#4CAF50,stroke:#fff,color:#fff
Single-tier: simple, easy to operate, fits ~10K RPS
Pattern 2: Two-tier LB — For Large Systems
graph TB
Client[Client] --> DNS[DNS / GeoDNS]
DNS --> L4A[NLB - L4
Region A]
DNS --> L4B[NLB - L4
Region B]
L4A --> L7A1[ALB/NGINX - L7]
L4A --> L7A2[ALB/NGINX - L7]
L4B --> L7B1[ALB/NGINX - L7]
L7A1 --> API1[API Pods]
L7A2 --> WEB1[Web Pods]
L7B1 --> API2[API Pods]
style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style DNS fill:#ff9800,stroke:#fff,color:#fff
style L4A fill:#e94560,stroke:#fff,color:#fff
style L4B fill:#e94560,stroke:#fff,color:#fff
style L7A1 fill:#2c3e50,stroke:#fff,color:#fff
style L7A2 fill:#2c3e50,stroke:#fff,color:#fff
style L7B1 fill:#2c3e50,stroke:#fff,color:#fff
style API1 fill:#4CAF50,stroke:#fff,color:#fff
style WEB1 fill:#4CAF50,stroke:#fff,color:#fff
style API2 fill:#4CAF50,stroke:#fff,color:#fff
Two-tier: L4 edge → L7 routing, supports multi-region
Pattern 3: Global Load Balancing — For Worldwide Systems
When users are spread globally, you need load balancing at the DNS layer. GeoDNS or Anycast routing brings users to the nearest datacenter. Cloudflare, AWS Route 53, and Azure Traffic Manager all support this pattern.
3 Global LB strategies
- Geo-proximity: Route to the geographically nearest datacenter. Simple, effective at reducing latency.
- Latency-based: Measure actual latency from users to each region, route to the fastest. More accurate than geo-proximity.
- Failover: Active-passive — 100% of traffic to the primary region. When primary is down, shift everything to secondary.
Common Pitfalls and How to Fix Them
1. Thundering Herd When a Server Recovers
When a server is brought back into the pool after passing health checks, using Least Connections means all new requests pile onto it (because it has 0 connections). Solution: Slow Start — gradually increase the recovered server's weight over 30-60 seconds.
# HAProxy slow start
backend api_servers
server api1 10.0.1.10:8080 check slowstart 60s
2. Session Affinity Causing Imbalance
Sticky sessions (via cookies or IP hash) can make one server receive most of the traffic if "heavy users" cluster onto it. Solution: move to a stateless architecture — store sessions in a shared store (database or in-memory cache) instead of on the server.
3. Overly Sensitive or Overly Slow Health Checks
Too sensitive (interval=1s, fall=1): a server gets evicted because of one timeout → constant flapping. Too slow (interval=30s, fall=5): it takes 2.5 minutes to notice a dead server. Recommended: interval=5s, fall=3, rise=2 — detect in 15s, confirm recovery in 10s.
Load Balancing in Kubernetes
Kubernetes has its own load balancing system via Services and Ingress. Understanding how they work helps avoid duplication or conflicts with external LBs.
| Component | Layer | Scope | Default Algorithm |
|---|---|---|---|
| kube-proxy (iptables) | L4 | Inside the cluster | Random (probability-based) |
| kube-proxy (IPVS) | L4 | Inside the cluster | Round Robin (also supports Least Conn, Source Hash...) |
| Ingress Controller | L7 | External → cluster | Depends on controller (NGINX, Traefik, Envoy) |
| Service type LoadBalancer | L4 | External → cluster | Cloud provider LB (ALB, NLB...) |
| Gateway API | L4/L7 | External → cluster | Depends on implementation, more flexible than Ingress |
Load Balancer Deployment Checklist
Production Readiness Checklist
- Pick the algorithm that fits your workload — Round Robin for stateless, Least Connections for variable latency, Consistent Hash for caches
- Configure health checks — active checks with a /health endpoint, 5s interval, fail threshold 3
- SSL Termination — terminate TLS at the LB to offload work from backends
- Logging & Monitoring — track request count, latency p50/p95/p99, error rate, active connections per backend
- LB High Availability — the LB itself needs redundancy: VRRP (keepalived), or use a managed cloud LB
- Rate Limiting — protect backends from unexpected traffic spikes
- Connection Draining — when removing a server from the pool, let in-flight requests finish (graceful shutdown)
- Sensible timeouts — short connect timeout (5s), read timeout matching SLA (30-60s)
Conclusion
Load Balancing isn't just "splitting requests evenly" — it's the art of distributing load so the system stays fast, stable, and resilient. There's no "best" algorithm — only the one that fits your specific context:
- Stateless APIs → Round Robin or Random Two Choices
- WebSockets / Long-running → Least Connections
- Distributed Cache → Consistent Hashing
- Legacy session-based apps → IP Hash (temporary; migrate to stateless)
- Multi-region → Global LB (GeoDNS) + Regional L4/L7
Start simple with Round Robin, add health checks, and only add complexity when you truly need it. Over-engineering load balancing from day one is one of the most common mistakes in system design.
References
- NGINX HTTP Load Balancing Documentation
- Understanding Load Balancing Algorithms and Strategies (2026)
- Consistent Hashing Explained — AlgoMaster
- Layer 4 vs Layer 7: Load Balancing and Why It Matters — CloudRPS
- ALB vs NLB: Which AWS Load Balancer Fits Your Needs?
- Edge Computing: Cloudflare Workers Development Guide 2026
gRPC and Protocol Buffers on .NET 10 — High-Performance Microservice Communication
Domain-Driven Design in Practice on .NET 10 — Aggregate, Domain Event, and Bounded Context
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.