Load Balancing: The Art of Traffic Distribution for Million-Request Systems

Posted on: 4/20/2026 7:40:51 AM

What Is Load Balancing and Why Does It Matter?

Imagine you run a restaurant with 10 tables. If every customer ended up sitting at one table while the other 9 sat empty, the experience would be terrible. A Load Balancer is the restaurant host — distributing customers evenly across tables so nobody waits too long.

In software architecture, a Load Balancer is the component that distributes client traffic across multiple backend servers to optimize performance, increase availability, and ensure no single server is overloaded.

<1ms Average cold start for an L4 LB
99.99% Cloud LB uptime SLA
10M+ Requests/s handled by NGINX
330+ PoPs for Cloudflare LB

When do you need a Load Balancer?

As soon as your system has 2 or more instances, you need a Load Balancer. But its role isn't just "splitting requests evenly" — it also performs health checks, SSL termination, rate limiting, and is the first line of defense against DDoS.

Layer 4 vs Layer 7: Two Schools of Load Balancing

This is the first architectural decision when choosing a Load Balancer. The difference lies in the OSI layer where the LB operates, and that directly affects routing capabilities, performance, and cost.

graph TB
    subgraph L4["Layer 4 — Transport"]
        A[Client Request] -->|TCP/UDP| B[L4 Load Balancer]
        B -->|IP + Port| C[Server A]
        B -->|IP + Port| D[Server B]
        B -->|IP + Port| E[Server C]
    end

    subgraph L7["Layer 7 — Application"]
        F[Client Request] -->|HTTP/gRPC| G[L7 Load Balancer]
        G -->|/api/*| H[API Server]
        G -->|/static/*| I[CDN/Static Server]
        G -->|/ws/*| J[WebSocket Server]
    end

    style L4 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style L7 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style B fill:#e94560,stroke:#fff,color:#fff
    style G fill:#e94560,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style E fill:#2c3e50,stroke:#fff,color:#fff
    style H fill:#2c3e50,stroke:#fff,color:#fff
    style I fill:#2c3e50,stroke:#fff,color:#fff
    style J fill:#2c3e50,stroke:#fff,color:#fff

Comparison of Layer 4 vs Layer 7 Load Balancer request flows

Criterion Layer 4 (Transport) Layer 7 (Application)
Operates at TCP/UDP — sees only IP and Port HTTP/HTTPS/gRPC — reads headers, URLs, cookies
Routing Based on source/destination IP and port Based on URL path, hostname, headers, query string
Performance Extremely fast — doesn't parse the payload Slower — must decrypt TLS, parse HTTP
SSL Termination TLS passthrough (no decryption) Decrypts at the LB, re-encrypts if needed
Connection Pooling No — forwards the TCP stream directly Yes — multiplexes many clients over few backend connections
Use case Databases, game servers, IoT, streaming Web apps, APIs, microservices, gRPC
Examples AWS NLB, HAProxy TCP mode, IPVS AWS ALB, NGINX, HAProxy HTTP mode, Envoy

Production practice

Most production architectures use both layers: L4 at the edge to quickly distribute traffic into L7 clusters, then L7 performs detailed content-based routing. For example: AWS NLB (L4) → ALB (L7), or Google Maglev (L4) → Envoy (L7).

1. Round Robin — Simple but Effective

Round Robin

Complexity: O(1) | Stateless | Default for NGINX and HAProxy

Distribute requests sequentially in a loop: Server A → B → C → A → B → C... No need to track server state, extremely simple and effective when servers have uniform capacity.

✓ Pros

  • Simple, stateless
  • Evenly distributed over time
  • O(1) performance

✗ Cons

  • Ignores real load differences
  • Not ideal for requests with very uneven processing time

2. Weighted Round Robin — When Servers Aren't Equal

Weighted Round Robin

Complexity: O(1) | Semi-stateless | Requires weight configuration

Assign weights to each server based on capacity. A powerful server (weight=5) receives 5× more requests than a weaker one (weight=1). Great for fleets with mixed machine types.

# NGINX config
upstream backend {
    server app1.example.com weight=5;  # 16 CPU, 64GB RAM
    server app2.example.com weight=3;  # 8 CPU, 32GB RAM
    server app3.example.com weight=1;  # 2 CPU, 8GB RAM
}

3. Least Connections — Adaptive to Real Load

Least Connections

Complexity: O(n) or O(log n) with a heap | Stateful

Send requests to the server with the fewest active connections. Smarter than Round Robin because it reacts to real load — busy servers naturally receive fewer new requests.

✓ Pros

  • Adapts to varying request processing times
  • Self-balancing when a server is slow
  • Ideal for WebSocket, long-polling

✗ Cons

  • Must track state per connection
  • Higher overhead than Round Robin

4. IP Hash — Simple Session Affinity

IP Hash

Complexity: O(1) | Deterministic

Hash the client's IP to pick the destination server. The same IP always routes to the same server — giving session stickiness without cookies or a shared session store.

# NGINX config
upstream backend {
    ip_hash;
    server app1.example.com;
    server app2.example.com;
    server app3.example.com;
}

Beware of NAT

If many clients share the same IP (via NAT/proxy), they'll all pile onto a single server → uneven load. In enterprise networks, this is a common issue.

5. Consistent Hashing — The King of Distributed Cache

Consistent Hashing

Complexity: O(log n) lookup | Virtual Nodes improve distribution

Uses a hash ring — both servers and request keys are hashed onto a circle. Requests go to the nearest server in the clockwise direction. When servers are added/removed, only ~1/n of keys are affected instead of remapping everything.

graph TB
    subgraph Ring["Hash Ring — Consistent Hashing"]
        direction TB
        N1["Server A
position: 0°"] N2["Server B
position: 120°"] N3["Server C
position: 240°"] K1["Key 'user:42'
→ Server A"] K2["Key 'session:99'
→ Server B"] K3["Key 'cart:17'
→ Server C"] end K1 -.->|hash → 35°| N1 K2 -.->|hash → 155°| N2 K3 -.->|hash → 280°| N3 style Ring fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style N1 fill:#e94560,stroke:#fff,color:#fff style N2 fill:#e94560,stroke:#fff,color:#fff style N3 fill:#e94560,stroke:#fff,color:#fff style K1 fill:#2c3e50,stroke:#fff,color:#fff style K2 fill:#2c3e50,stroke:#fff,color:#fff style K3 fill:#2c3e50,stroke:#fff,color:#fff

A hash ring with 3 servers — each key routes to the nearest server clockwise

Virtual Nodes is an important technique to improve even distribution. Instead of each server occupying a single position on the ring, you create 100-200 virtual positions per physical server. This helps:

  • Distribute keys much more evenly
  • When one server fails, its load spreads across many servers rather than piling on the next one
  • Amazon DynamoDB, Apache Cassandra, and ScyllaDB all use this technique
// Consistent Hashing with virtual nodes — pseudo code
class ConsistentHash {
    private ring: SortedMap<int, string> = new SortedMap();
    private virtualNodes: int = 150;

    addServer(server: string) {
        for (let i = 0; i < this.virtualNodes; i++) {
            let hash = md5(`${server}:${i}`);
            this.ring.set(hash, server);
        }
    }

    getServer(key: string): string {
        let hash = md5(key);
        // Find the nearest node clockwise
        let entry = this.ring.ceilingEntry(hash);
        return entry ? entry.value : this.ring.firstEntry().value;
    }

    removeServer(server: string) {
        for (let i = 0; i < this.virtualNodes; i++) {
            this.ring.delete(md5(`${server}:${i}`));
        }
        // Only ~1/n of keys remap — no global impact
    }
}

6. Random Two Choices — The "Just Smart Enough" Algorithm

Power of Two Random Choices

Complexity: O(1) | Near-optimal distribution

Randomly pick 2 servers, then send the request to the one with fewer connections. Sounds simple, but probabilistic theory shows this achieves near-optimal distribution — max connections drop from O(log n) to O(log log n) compared to pure random.

# NGINX Plus config
upstream backend {
    random two least_conn;
    server app1.example.com;
    server app2.example.com;
    server app3.example.com;
    server app4.example.com;
}

Algorithm Comparison at a Glance

Algorithm Stateful? Session Sticky? Main use case When NOT to use
Round Robin No No Stateless APIs, uniform microservices Heterogeneous server configs
Weighted RR No No Mixed fleets (on-prem + cloud) Highly variable load
Least Connections Yes No WebSockets, long-running requests Very short, uniform requests
IP Hash No Yes Legacy apps needing session affinity Many clients behind NAT
Consistent Hash No Yes Distributed caches, sharded DBs Simple stateless services
Random Two Choices Yes (light) No Large clusters needing near-optimal Small clusters (<4 servers)

Common Load Balancing Tools

NGINX — Reverse Proxy and Load Balancer

NGINX is the most popular choice for L7 Load Balancing thanks to its high performance (handling millions of concurrent connections), simple configuration, and rich module ecosystem.

# nginx.conf — Complete Load Balancing
http {
    upstream api_servers {
        least_conn;
        server 10.0.1.10:8080 weight=3 max_fails=3 fail_timeout=30s;
        server 10.0.1.11:8080 weight=3 max_fails=3 fail_timeout=30s;
        server 10.0.1.12:8080 weight=1 backup;  # Only used when the above 2 fail
    }

    server {
        listen 443 ssl http2;
        server_name api.example.com;

        # SSL Termination
        ssl_certificate     /etc/ssl/certs/api.crt;
        ssl_certificate_key /etc/ssl/private/api.key;

        # Implicit health check via max_fails/fail_timeout
        location /api/ {
            proxy_pass http://api_servers;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $host;

            # Timeout configuration
            proxy_connect_timeout 5s;
            proxy_read_timeout 60s;
            proxy_send_timeout 10s;
        }
    }
}

HAProxy — Dedicated Load Balancer

HAProxy is renowned for its L4 and L7 capabilities, purpose-built for load balancing with active health checks, detailed metrics, and impressive performance.

# haproxy.cfg
frontend http_front
    bind *:443 ssl crt /etc/ssl/api.pem
    mode http

    # Content-based routing
    acl is_api path_beg /api
    acl is_ws  hdr(Upgrade) -i websocket

    use_backend api_servers if is_api
    use_backend ws_servers  if is_ws
    default_backend static_servers

backend api_servers
    mode http
    balance leastconn
    option httpchk GET /health
    http-check expect status 200

    server api1 10.0.1.10:8080 check inter 5s fall 3 rise 2
    server api2 10.0.1.11:8080 check inter 5s fall 3 rise 2

backend ws_servers
    mode http
    balance source   # IP Hash to keep WebSocket sessions sticky
    timeout tunnel 3600s
    server ws1 10.0.2.10:8080 check
    server ws2 10.0.2.11:8080 check

Cloud Load Balancers — Managed and Auto-scaling

Service Layer Free Tier Strengths
AWS ALB L7 750h/month (first 12 months) Path-based routing, gRPC, WebSocket
AWS NLB L4 750h/month (first 12 months) Ultra-low latency, static IP, TLS passthrough
Azure Load Balancer L4 Basic SKU free Zone-redundant, HA Ports
Azure App Gateway L7 None (from ~$18/month) Integrated WAF, SSL offloading, URL rewrite
Cloudflare LB L7 None (from $5/month) 330+ PoPs, Geo-steering, global health checks

Health Checks and Failover — The Safety Net

A Load Balancer without health checks is like traffic without signals. Health checks let the LB automatically remove dead servers and bring them back when they recover.

sequenceDiagram
    participant LB as Load Balancer
    participant S1 as Server A (healthy)
    participant S2 as Server B (failing)
    participant S3 as Server C (healthy)

    loop Health Check (every 5s)
        LB->>S1: GET /health
        S1-->>LB: 200 OK ✓
        LB->>S2: GET /health
        S2-->>LB: 503 Error ✗
        LB->>S3: GET /health
        S3-->>LB: 200 OK ✓
    end

    Note over LB,S2: Server B fails 3 times in a row → marked DOWN

    LB->>S1: Route traffic (50%)
    LB->>S3: Route traffic (50%)
    Note over S2: Receives no traffic

    S2-->>LB: 200 OK ✓ (after 2 successful checks)
    Note over LB,S2: Server B recovers → returned to the pool

Health Check flow: detect a failing server → remove from pool → auto-recover

There are 3 common types of health checks:

  • Active Health Check: The LB actively sends probes (HTTP GET /health, TCP connect, or custom scripts). HAProxy and cloud LBs support this by default.
  • Passive Health Check: The LB observes real traffic responses — if a server returns errors continuously (e.g., 5× 5xx in 30s), mark it down automatically. NGINX Open Source only supports this.
  • Deep Health Check: Verify dependencies too (database connection, disk space, memory). Return details via a /health/detailed endpoint.
// ASP.NET — Deep Health Check
// Program.cs
builder.Services.AddHealthChecks()
    .AddSqlServer(connectionString, name: "database")
    .AddRedis(redisConnection, name: "cache")
    .AddCheck("disk-space", () =>
    {
        var drive = new DriveInfo("C");
        return drive.AvailableFreeSpace > 1_073_741_824  // > 1GB
            ? HealthCheckResult.Healthy()
            : HealthCheckResult.Degraded("Low disk space");
    });

app.MapHealthChecks("/health", new HealthCheckOptions
{
    Predicate = _ => true,
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

Real-World Deployment Architectures

Pattern 1: Single-tier LB — For Small and Medium Systems

graph LR
    Client[Client] --> LB[NGINX / HAProxy
L7 Load Balancer] LB --> S1[App Server 1] LB --> S2[App Server 2] LB --> S3[App Server 3] S1 --> DB[(Database)] S2 --> DB S3 --> DB style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style LB fill:#e94560,stroke:#fff,color:#fff style S1 fill:#2c3e50,stroke:#fff,color:#fff style S2 fill:#2c3e50,stroke:#fff,color:#fff style S3 fill:#2c3e50,stroke:#fff,color:#fff style DB fill:#4CAF50,stroke:#fff,color:#fff

Single-tier: simple, easy to operate, fits ~10K RPS

Pattern 2: Two-tier LB — For Large Systems

graph TB
    Client[Client] --> DNS[DNS / GeoDNS]
    DNS --> L4A[NLB - L4
Region A] DNS --> L4B[NLB - L4
Region B] L4A --> L7A1[ALB/NGINX - L7] L4A --> L7A2[ALB/NGINX - L7] L4B --> L7B1[ALB/NGINX - L7] L7A1 --> API1[API Pods] L7A2 --> WEB1[Web Pods] L7B1 --> API2[API Pods] style Client fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style DNS fill:#ff9800,stroke:#fff,color:#fff style L4A fill:#e94560,stroke:#fff,color:#fff style L4B fill:#e94560,stroke:#fff,color:#fff style L7A1 fill:#2c3e50,stroke:#fff,color:#fff style L7A2 fill:#2c3e50,stroke:#fff,color:#fff style L7B1 fill:#2c3e50,stroke:#fff,color:#fff style API1 fill:#4CAF50,stroke:#fff,color:#fff style WEB1 fill:#4CAF50,stroke:#fff,color:#fff style API2 fill:#4CAF50,stroke:#fff,color:#fff

Two-tier: L4 edge → L7 routing, supports multi-region

Pattern 3: Global Load Balancing — For Worldwide Systems

When users are spread globally, you need load balancing at the DNS layer. GeoDNS or Anycast routing brings users to the nearest datacenter. Cloudflare, AWS Route 53, and Azure Traffic Manager all support this pattern.

3 Global LB strategies

  • Geo-proximity: Route to the geographically nearest datacenter. Simple, effective at reducing latency.
  • Latency-based: Measure actual latency from users to each region, route to the fastest. More accurate than geo-proximity.
  • Failover: Active-passive — 100% of traffic to the primary region. When primary is down, shift everything to secondary.

Common Pitfalls and How to Fix Them

1. Thundering Herd When a Server Recovers

When a server is brought back into the pool after passing health checks, using Least Connections means all new requests pile onto it (because it has 0 connections). Solution: Slow Start — gradually increase the recovered server's weight over 30-60 seconds.

# HAProxy slow start
backend api_servers
    server api1 10.0.1.10:8080 check slowstart 60s

2. Session Affinity Causing Imbalance

Sticky sessions (via cookies or IP hash) can make one server receive most of the traffic if "heavy users" cluster onto it. Solution: move to a stateless architecture — store sessions in a shared store (database or in-memory cache) instead of on the server.

3. Overly Sensitive or Overly Slow Health Checks

Too sensitive (interval=1s, fall=1): a server gets evicted because of one timeout → constant flapping. Too slow (interval=30s, fall=5): it takes 2.5 minutes to notice a dead server. Recommended: interval=5s, fall=3, rise=2 — detect in 15s, confirm recovery in 10s.

Load Balancing in Kubernetes

Kubernetes has its own load balancing system via Services and Ingress. Understanding how they work helps avoid duplication or conflicts with external LBs.

Component Layer Scope Default Algorithm
kube-proxy (iptables) L4 Inside the cluster Random (probability-based)
kube-proxy (IPVS) L4 Inside the cluster Round Robin (also supports Least Conn, Source Hash...)
Ingress Controller L7 External → cluster Depends on controller (NGINX, Traefik, Envoy)
Service type LoadBalancer L4 External → cluster Cloud provider LB (ALB, NLB...)
Gateway API L4/L7 External → cluster Depends on implementation, more flexible than Ingress

Load Balancer Deployment Checklist

Production Readiness Checklist

  1. Pick the algorithm that fits your workload — Round Robin for stateless, Least Connections for variable latency, Consistent Hash for caches
  2. Configure health checks — active checks with a /health endpoint, 5s interval, fail threshold 3
  3. SSL Termination — terminate TLS at the LB to offload work from backends
  4. Logging & Monitoring — track request count, latency p50/p95/p99, error rate, active connections per backend
  5. LB High Availability — the LB itself needs redundancy: VRRP (keepalived), or use a managed cloud LB
  6. Rate Limiting — protect backends from unexpected traffic spikes
  7. Connection Draining — when removing a server from the pool, let in-flight requests finish (graceful shutdown)
  8. Sensible timeouts — short connect timeout (5s), read timeout matching SLA (30-60s)

Conclusion

Load Balancing isn't just "splitting requests evenly" — it's the art of distributing load so the system stays fast, stable, and resilient. There's no "best" algorithm — only the one that fits your specific context:

  • Stateless APIs → Round Robin or Random Two Choices
  • WebSockets / Long-running → Least Connections
  • Distributed Cache → Consistent Hashing
  • Legacy session-based apps → IP Hash (temporary; migrate to stateless)
  • Multi-region → Global LB (GeoDNS) + Regional L4/L7

Start simple with Round Robin, add health checks, and only add complexity when you truly need it. Over-engineering load balancing from day one is one of the most common mistakes in system design.

References