WebRTC — Building Peer-to-Peer Video Call Architecture in the Browser

Posted on: 4/27/2026 10:18:36 AM

Every day, billions of minutes of video calls happen on Google Meet, Zoom, Discord and hundreds of other apps — all running on the same foundation: WebRTC. This set of browser APIs enables direct peer-to-peer transmission of audio, video and arbitrary data without plugins, without Flash, without installing anything. This article dives deep into WebRTC architecture from network protocols to production deployment with SFU, helping you understand how to build large-scale real-time communication systems.

7.7B+ WebRTC minutes per week globally
<200ms Average P2P latency with STUN
98% Browser support for WebRTC (2026)
85% Connections succeed via STUN without TURN

1. What is WebRTC and How Does It Work?

WebRTC (Web Real-Time Communication) is a collection of W3C/IETF standard APIs and protocols that enable browsers and native apps to establish peer-to-peer connections for media and data transmission. Unlike the traditional client-server model, WebRTC allows two devices to communicate directly — reducing latency, saving server bandwidth and simplifying architecture for real-time use cases.

Three core WebRTC APIs in the browser:

  • MediaStream (getUserMedia) — Access camera, microphone and screen capture
  • RTCPeerConnection — Establish P2P connections, handle codecs, SRTP encryption and ICE candidate management
  • RTCDataChannel — Arbitrary data channel (text, files, game state) over SCTP with reliable or unreliable mode
graph TD
    A["getUserMedia()
Camera + Mic"] --> B["MediaStream
Audio/Video Tracks"] B --> C["RTCPeerConnection
Encryption + ICE + DTLS"] C --> D{"NAT Traversal"} D -->|STUN succeeds| E["P2P Direct
~85% of cases"] D -->|STUN fails| F["TURN Relay
~15% of cases"] E --> G["Remote Peer
Receives stream"] F --> G C --> H["RTCDataChannel
Arbitrary data"] H --> G style A fill:#e94560,stroke:#fff,color:#fff style C fill:#2c3e50,stroke:#fff,color:#fff style G fill:#4CAF50,stroke:#fff,color:#fff style D fill:#ff9800,stroke:#fff,color:#fff

WebRTC architecture overview — from MediaStream to P2P connection

2. Signaling — The First Step WebRTC Doesn't Define

WebRTC intentionally does not specify a signaling protocol. This is by design — allowing developers to choose any transport channel that fits: WebSocket, HTTP long-polling, Firebase Realtime Database, or even email. Signaling does one thing: exchange the information needed for two peers to find each other and negotiate codecs.

The signaling process involves 3 main steps:

sequenceDiagram
    participant A as Peer A (Caller)
    participant S as Signaling Server
    participant B as Peer B (Callee)

    A->>S: 1. Create Offer (SDP)
    S->>B: Forward Offer
    B->>S: 2. Create Answer (SDP)
    S->>A: Forward Answer
    A->>S: 3. Send ICE Candidates
    S->>B: Forward ICE Candidates
    B->>S: Send ICE Candidates
    S->>A: Forward ICE Candidates
    A-->>B: P2P Connection Established!

Signaling flow — exchanging SDP and ICE Candidates through an intermediary server

SDP (Session Description Protocol) is a text format describing each peer's media capabilities: supported codecs (VP9, H.264, Opus), bandwidth, IP/port addresses. When Peer A creates an offer and Peer B responds with an answer, both sides have agreed on codec and encryption parameters.

Signaling Server Implementation Tips

For small apps (<1,000 concurrent users), a simple WebSocket server on Node.js or ASP.NET Core SignalR is sufficient. When scaling up, use Redis Pub/Sub as a message broker between signaling nodes to ensure all peers receive ICE candidates on time.

3. NAT Traversal — STUN, TURN and the ICE Framework

The biggest challenge for P2P is NAT (Network Address Translation). Most devices sit behind NAT routers without direct public IPs. WebRTC solves this with ICE (Interactive Connectivity Establishment) — a framework that tries all possible paths and selects the best one.

3.1 STUN — Discovering Your Public IP

STUN (Session Traversal Utilities for NAT) servers help clients discover their public IP and port mapping. The client sends a request to the STUN server, which responds with the public address it sees. This process is lightweight — just a few UDP packets. Google provides free STUN servers at stun:stun.l.google.com:19302.

STUN works with approximately 85% of standard NAT configurations (Full Cone, Restricted Cone, Port Restricted Cone). However, Symmetric NAT — common in enterprise networks — blocks STUN because each different destination gets NAT-mapped to a different port.

3.2 TURN — Relay When P2P Fails

TURN (Traversal Using Relays around NAT) is the fallback: all media passes through a TURN server as a relay. This consumes significant server bandwidth — each 720p video stream uses ~1.5 Mbps, doubled through relay — so TURN is only used when STUN fails.

TURN Costs Are Not Cheap

A TURN server handling 500 concurrent 1-on-1 video calls needs ~1.5 Gbps bandwidth. At average cloud pricing of $0.08/GB, bandwidth costs can reach $500–800/day. Always prioritize STUN and only fall back to TURN when necessary. Use coturn (open-source) and deploy close to users to reduce latency.

3.3 ICE — Finding the Best Path

The ICE framework collects all ICE candidates (possible connection addresses) from 3 sources: host candidates (local IP), server reflexive candidates (from STUN) and relay candidates (from TURN). ICE then performs connectivity checks in priority order — preferring direct P2P, falling back through TURN if needed.

graph LR
    A["ICE Agent"] --> B["Host Candidate
Local IP: 192.168.1.5:4532"] A --> C["Server Reflexive
STUN: 203.0.113.5:6789"] A --> D["Relay Candidate
TURN: 198.51.100.2:3478"] B --> E{"Connectivity
Check"} C --> E D --> E E -->|Priority 1| F["Direct P2P"] E -->|Priority 2| G["STUN-assisted P2P"] E -->|Priority 3| H["TURN Relay"] style A fill:#e94560,stroke:#fff,color:#fff style E fill:#ff9800,stroke:#fff,color:#fff style F fill:#4CAF50,stroke:#fff,color:#fff

ICE Framework — gathering candidates and selecting the optimal connection path

4. Media Pipeline — Codecs, Encryption and Adaptive Bitrate

4.1 Audio & Video Codecs

WebRTC mandates support for the following codecs:

Type Mandatory Codec Optional (Common) Characteristics
Audio Opus G.711, iSAC Opus: 6–510 kbps, adaptive bitrate, 48kHz. Best for voice + music
Video VP8, H.264 VP9, AV1, H.265 VP9 saves 30-50% bandwidth vs VP8. AV1 newest but CPU-intensive encoding

4.2 Mandatory Encryption

All WebRTC connections are encrypted by default — there is no option to disable it. The encryption stack consists of:

  • DTLS (Datagram Transport Layer Security) — Handshake and key exchange, similar to TLS but for UDP
  • SRTP (Secure Real-time Transport Protocol) — Encrypts audio/video payload with AES-128
  • SCTP over DTLS — Encrypts data on RTCDataChannel

4.3 Adaptive Bitrate & Congestion Control

WebRTC uses the GCC (Google Congestion Control) algorithm to automatically adjust bitrate based on network conditions. When packet loss or increased latency is detected, the encoder reduces resolution/framerate/bitrate. When the network improves, quality automatically increases. This is why video calls sometimes go "blurry" for a few seconds then clear up — GCC at work.

Since 2025, browsers have supported Simulcast — sending multiple quality layers simultaneously (e.g., 1080p + 720p + 360p). The receiver or SFU selects the appropriate layer for current bandwidth, avoiding CPU-intensive re-encoding.

5. Production Architecture — P2P, SFU and MCU

Pure P2P only works well for 1-on-1 calls. With 3+ participants, the full mesh model (every peer connects to every other peer) doesn't scale — N participants need N×(N-1)/2 connections. With 10 people, each device must encode and send 9 separate streams.

graph TD
    subgraph "P2P Mesh — Max 4-5 people"
        P1["Peer 1"] <--> P2["Peer 2"]
        P1 <--> P3["Peer 3"]
        P2 <--> P3
    end

    subgraph "SFU — Hundreds of people"
        S1["Peer 1"] --> SFU["SFU Server
Forward streams"] S2["Peer 2"] --> SFU S3["Peer 3"] --> SFU S4["Peer N"] --> SFU SFU --> S1 SFU --> S2 SFU --> S3 SFU --> S4 end subgraph "MCU — Low bandwidth" M1["Peer 1"] --> MCU["MCU Server
Mix + Re-encode"] M2["Peer 2"] --> MCU M3["Peer 3"] --> MCU MCU --> M1 MCU --> M2 MCU --> M3 end style SFU fill:#e94560,stroke:#fff,color:#fff style MCU fill:#2c3e50,stroke:#fff,color:#fff

Three WebRTC architectures: Mesh (P2P), SFU (Selective Forwarding) and MCU (Mixing)

5.1 SFU — The #1 Choice for Production in 2026

SFU (Selective Forwarding Unit) is the dominant architecture for WebRTC production. Each participant sends 1 stream to the SFU, which forwards it to all other participants — no decoding, no re-encoding. Advantages:

  • Low server CPU — only forwards packets, no media processing
  • Low latency — no intermediate decode/encode step
  • Scales well — each SFU node handles 500-1000 concurrent streams
  • Simulcast compatible — SFU selects appropriate layer for each receiver

5.2 MCU — When Client Bandwidth Is the Issue

MCU (Multipoint Control Unit) decodes all incoming streams, mixes them into a single layout, re-encodes and sends to each participant. Clients only receive 1 stream — saving downstream bandwidth. But MCU consumes massive server CPU and adds 200-500ms latency from decode/encode. MCU fits: weak IoT devices, 3G mobile connections, or recording/broadcasting.

5.3 Detailed SFU vs MCU Comparison

Criteria SFU MCU
Server CPU Low (forward only) Very high (decode + mix + encode)
Added Latency ~10-50ms ~200-500ms
Client Bandwidth (downstream) High (receives N-1 streams) Low (receives 1 stream)
Scale Good — 500-1000 streams/node Limited — 50-100 participants/node
Video Quality Original (no re-encoding) Reduced (through re-encoding)
Simulcast Native support Not needed (already mixed)
Best Use Case Video conferencing, live streaming IoT, legacy devices, recording

6. Open-Source SFUs — LiveKit, mediasoup and Janus

The three most popular open-source SFUs, each suited for different contexts:

6.1 LiveKit — Modern SFU Written in Go

LiveKit has emerged as the top choice for teams wanting to ship fast. Written in Go, leveraging goroutines for concurrent connections. Ships with SDKs for JavaScript, React, Swift, Kotlin, Flutter, Unity and server-side SDKs for Node.js, Python, Go, .NET. LiveKit includes signaling, room management and recording out of the box.

// LiveKit JavaScript Client — connect to room
import { Room, RoomEvent } from 'livekit-client';

const room = new Room();
await room.connect('wss://your-livekit-server.com', token);

room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
  const element = track.attach();
  document.getElementById('remote-video').appendChild(element);
});

// Publish local camera
const localTracks = await room.localParticipant.enableCameraAndMicrophone();

6.2 mediasoup — High-Performance SFU with C++ Core

mediasoup has its core written in C++ for optimal media processing performance, with a Node.js signaling layer. Worker-based architecture: each CPU core runs a Worker process, handling media routing for multiple rooms. mediasoup provides fine-grained control over every transport, producer and consumer — ideal for teams wanting deep customization.

6.3 Janus Gateway — Versatile Plugin Architecture

Janus is written in C, released in 2014, making it the oldest and most versatile SFU. Its plugin architecture enables extension: VideoRoom (SFU), AudioBridge (audio mixing), Streaming (one-to-many), SIP Gateway, Record/Play. Janus fits when you need integration with legacy VoIP/SIP systems.

Criteria LiveKit mediasoup Janus
Language Go C++ (core) + Node.js C
Setup Fast — all-in-one SDK Medium — build signaling yourself Medium — choose plugins
Customization Medium Very high High (plugin system)
Scalability Built-in multi-node Self-managed Self-managed
.NET SDK Yes (server-side) No official support No
Recording Built-in (Egress) Self-implement Record/Play plugin
Best For Startups, ship fast Custom platforms, large scale SIP/VoIP, legacy integration

7. Encoded Transform — True End-to-End Encryption

By default, WebRTC encrypts hop-by-hop with DTLS-SRTP — meaning SFU servers can see media in plaintext when forwarding. For sensitive applications (healthcare, finance), this isn't sufficient.

The WebRTC Encoded Transform API (W3C Working Draft, updated 02/2026) allows inserting a processing step into the pipeline between the encoder and packetizer. Developers can encrypt the payload with a private key before sending to the SFU — the SFU can only forward encrypted payload, unable to read content. This is true E2EE (End-to-End Encryption).

// Encoded Transform — encrypt frames before sending
const sender = peerConnection.getSenders()[0];
const senderStreams = sender.createEncodedStreams();
const transformStream = new TransformStream({
  transform(encodedFrame, controller) {
    // Encrypt payload with AES-GCM
    const encryptedData = encryptFrame(encodedFrame.data, sharedKey);
    encodedFrame.data = encryptedData;
    controller.enqueue(encodedFrame);
  }
});
senderStreams.readable
  .pipeThrough(transformStream)
  .pipeTo(senderStreams.writable);

Encoded Transform Browser Support (04/2026)

Chrome/Edge: full support since Chrome 86+. Firefox: implementing RTCRtpScriptTransform (latest spec). Safari: not yet supported — needs polyfill or fallback. If your target audience primarily uses Chrome/Edge (>75% market share), you can deploy E2EE today.

8. Production Deployment — Reference Architecture

Below is a production architecture for a video conferencing system supporting 10,000+ concurrent users:

graph TD
    Client["Client App
Vue.js + LiveKit SDK"] -->|WebSocket| LB["Load Balancer
Geographic DNS"] LB --> SIG["Signaling Cluster
3+ nodes"] SIG -->|Redis Pub/Sub| SIG SIG --> SFU1["SFU Node 1
Region: Asia"] SIG --> SFU2["SFU Node 2
Region: EU"] SIG --> SFU3["SFU Node 3
Region: US"] SFU1 --> TURN1["TURN Server
coturn — Asia"] SFU2 --> TURN2["TURN Server
coturn — EU"] SFU3 --> TURN3["TURN Server
coturn — US"] SFU1 --> REC["Recording Service
Egress to S3/R2"] SFU1 --> MON["Monitoring
Prometheus + Grafana"] style Client fill:#e94560,stroke:#fff,color:#fff style SIG fill:#2c3e50,stroke:#fff,color:#fff style SFU1 fill:#16213e,stroke:#fff,color:#fff style SFU2 fill:#16213e,stroke:#fff,color:#fff style SFU3 fill:#16213e,stroke:#fff,color:#fff style MON fill:#4CAF50,stroke:#fff,color:#fff

Multi-region production architecture for WebRTC — SFU cluster with geographic routing

8.1 Deployment Checklist

  • TURN server — Deploy coturn in each region, configure TLS (port 443) to bypass corporate firewalls
  • Bandwidth planning — Each 720p video participant: ~1.5 Mbps up + 1.5×(N-1) Mbps down (SFU mode). With simulcast: downstream drops 40-60%
  • Monitoring — Track metrics: ICE connection time, packet loss rate, bitrate adaptation, TURN usage percentage
  • Fallback strategy — When an SFU node is overloaded, redirect participants to another node. LiveKit has built-in load balancing
  • Recording — Use composite recording (MCU-style) for archives, or individual track recording for post-processing

8.2 Code Sample — Signaling Server with ASP.NET Core

// SignalingHub.cs — WebRTC signaling via SignalR
public class SignalingHub : Hub
{
    public async Task JoinRoom(string roomId)
    {
        await Groups.AddToGroupAsync(Context.ConnectionId, roomId);
        await Clients.OthersInGroup(roomId).SendAsync("user-joined", Context.ConnectionId);
    }

    public async Task SendOffer(string targetId, string sdp)
    {
        await Clients.Client(targetId).SendAsync("offer", Context.ConnectionId, sdp);
    }

    public async Task SendAnswer(string targetId, string sdp)
    {
        await Clients.Client(targetId).SendAsync("answer", Context.ConnectionId, sdp);
    }

    public async Task SendIceCandidate(string targetId, string candidate)
    {
        await Clients.Client(targetId).SendAsync("ice-candidate", Context.ConnectionId, candidate);
    }
}
// Client — Vue.js + WebRTC
const pc = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:stun.l.google.com:19302' },
    { urls: 'turn:turn.example.com:443', username: 'user', credential: 'pass' }
  ]
});

const stream = await navigator.mediaDevices.getUserMedia({ video: true, audio: true });
stream.getTracks().forEach(track => pc.addTrack(track, stream));

pc.onicecandidate = ({ candidate }) => {
  if (candidate) signalR.invoke('SendIceCandidate', targetId, JSON.stringify(candidate));
};

// Create offer and send via signaling
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
signalR.invoke('SendOffer', targetId, JSON.stringify(offer));

9. WebRTC Performance Optimization

9.1 Simulcast & SVC

Send multiple quality layers simultaneously, letting the SFU select the right layer for each receiver. Configuring simulcast in WebRTC:

const sender = pc.addTrack(videoTrack, stream);
const params = sender.getParameters();
params.encodings = [
  { rid: 'low', maxBitrate: 200000, scaleResolutionDownBy: 4 },
  { rid: 'mid', maxBitrate: 700000, scaleResolutionDownBy: 2 },
  { rid: 'high', maxBitrate: 2500000 }
];
await sender.setParameters(params);

9.2 Bandwidth Estimation

Use RTCPeerConnection.getStats() for real-time monitoring:

setInterval(async () => {
  const stats = await pc.getStats();
  stats.forEach(report => {
    if (report.type === 'outbound-rtp' && report.kind === 'video') {
      console.log(`Bitrate: ${report.bytesSent}, Frames: ${report.framesEncoded}`);
      console.log(`Quality limit: ${report.qualityLimitationReason}`);
    }
  });
}, 2000);

9.3 Network Quality Indicator

Measure packet loss and round-trip time to display a quality indicator for users:

  • Good: packet loss < 1%, RTT < 150ms
  • Fair: packet loss 1-5%, RTT 150-300ms
  • Poor: packet loss > 5%, RTT > 300ms → automatically reduce resolution

10. Real-World Use Cases Beyond Video Calls

WebRTC isn't just for video conferencing:

  • Screen SharinggetDisplayMedia() API captures screen, window or specific tab
  • P2P File Transfer — RTCDataChannel enables direct file transfer between browsers, bypassing servers. Speed can reach 100+ Mbps on LAN
  • Cloud Gaming — Stream gameplay from server, receive input from client via DataChannel. Google Stadia (now shut down) and Xbox Cloud Gaming both used WebRTC
  • IoT & Robotics — Control robots/drones via DataChannel, receive video feed through media streams
  • Live Streaming — WHIP (WebRTC HTTP Ingestion Protocol) enables publishing live streams to CDN via WebRTC, replacing traditional RTMP

WHIP & WHEP — New Standards for Live Streaming

WHIP (WebRTC HTTP Ingestion Protocol) standardizes how to publish streams to servers. WHEP (WebRTC HTTP Egress Protocol) standardizes how viewers subscribe to streams. Both are already supported by Cloudflare Stream, AWS IVS and many CDNs. This is the future replacement for RTMP in live streaming with sub-second latency.

Conclusion

WebRTC has matured from a Google experiment into the standard platform for all real-time communication applications on the web. With SFU architecture, you can build video conferencing systems supporting thousands of concurrent users. The Encoded Transform API enables true end-to-end encryption. And with WHIP/WHEP, WebRTC is expanding into live streaming territory.

Key takeaways for deployment: start with LiveKit if you need to ship fast, mediasoup if you need deep customization, and Janus if you need legacy VoIP integration. Always deploy TURN servers in each region and monitor ICE connection metrics to ensure the best user experience.

References:
WebRTC Official — webrtc.org
MDN Web Docs — WebRTC API
W3C WebRTC Specification
W3C WebRTC Encoded Transform
LiveKit Documentation
mediasoup Documentation
Janus Gateway Documentation
BlogGeek.me — WebRTC Open Source Media Servers
V100.ai — Fastest WebRTC Server in 2026