WebRTC — Building Peer-to-Peer Video Call Architecture in the Browser
Posted on: 4/27/2026 10:18:36 AM
Table of contents
- 1. What is WebRTC and How Does It Work?
- 2. Signaling — The First Step WebRTC Doesn't Define
- 3. NAT Traversal — STUN, TURN and the ICE Framework
- 4. Media Pipeline — Codecs, Encryption and Adaptive Bitrate
- 5. Production Architecture — P2P, SFU and MCU
- 6. Open-Source SFUs — LiveKit, mediasoup and Janus
- 7. Encoded Transform — True End-to-End Encryption
- 8. Production Deployment — Reference Architecture
- 9. WebRTC Performance Optimization
- 10. Real-World Use Cases Beyond Video Calls
- Conclusion
Every day, billions of minutes of video calls happen on Google Meet, Zoom, Discord and hundreds of other apps — all running on the same foundation: WebRTC. This set of browser APIs enables direct peer-to-peer transmission of audio, video and arbitrary data without plugins, without Flash, without installing anything. This article dives deep into WebRTC architecture from network protocols to production deployment with SFU, helping you understand how to build large-scale real-time communication systems.
1. What is WebRTC and How Does It Work?
WebRTC (Web Real-Time Communication) is a collection of W3C/IETF standard APIs and protocols that enable browsers and native apps to establish peer-to-peer connections for media and data transmission. Unlike the traditional client-server model, WebRTC allows two devices to communicate directly — reducing latency, saving server bandwidth and simplifying architecture for real-time use cases.
Three core WebRTC APIs in the browser:
- MediaStream (getUserMedia) — Access camera, microphone and screen capture
- RTCPeerConnection — Establish P2P connections, handle codecs, SRTP encryption and ICE candidate management
- RTCDataChannel — Arbitrary data channel (text, files, game state) over SCTP with reliable or unreliable mode
graph TD
A["getUserMedia()
Camera + Mic"] --> B["MediaStream
Audio/Video Tracks"]
B --> C["RTCPeerConnection
Encryption + ICE + DTLS"]
C --> D{"NAT Traversal"}
D -->|STUN succeeds| E["P2P Direct
~85% of cases"]
D -->|STUN fails| F["TURN Relay
~15% of cases"]
E --> G["Remote Peer
Receives stream"]
F --> G
C --> H["RTCDataChannel
Arbitrary data"]
H --> G
style A fill:#e94560,stroke:#fff,color:#fff
style C fill:#2c3e50,stroke:#fff,color:#fff
style G fill:#4CAF50,stroke:#fff,color:#fff
style D fill:#ff9800,stroke:#fff,color:#fff
WebRTC architecture overview — from MediaStream to P2P connection
2. Signaling — The First Step WebRTC Doesn't Define
WebRTC intentionally does not specify a signaling protocol. This is by design — allowing developers to choose any transport channel that fits: WebSocket, HTTP long-polling, Firebase Realtime Database, or even email. Signaling does one thing: exchange the information needed for two peers to find each other and negotiate codecs.
The signaling process involves 3 main steps:
sequenceDiagram
participant A as Peer A (Caller)
participant S as Signaling Server
participant B as Peer B (Callee)
A->>S: 1. Create Offer (SDP)
S->>B: Forward Offer
B->>S: 2. Create Answer (SDP)
S->>A: Forward Answer
A->>S: 3. Send ICE Candidates
S->>B: Forward ICE Candidates
B->>S: Send ICE Candidates
S->>A: Forward ICE Candidates
A-->>B: P2P Connection Established!
Signaling flow — exchanging SDP and ICE Candidates through an intermediary server
SDP (Session Description Protocol) is a text format describing each peer's media capabilities: supported codecs (VP9, H.264, Opus), bandwidth, IP/port addresses. When Peer A creates an offer and Peer B responds with an answer, both sides have agreed on codec and encryption parameters.
Signaling Server Implementation Tips
For small apps (<1,000 concurrent users), a simple WebSocket server on Node.js or ASP.NET Core SignalR is sufficient. When scaling up, use Redis Pub/Sub as a message broker between signaling nodes to ensure all peers receive ICE candidates on time.
3. NAT Traversal — STUN, TURN and the ICE Framework
The biggest challenge for P2P is NAT (Network Address Translation). Most devices sit behind NAT routers without direct public IPs. WebRTC solves this with ICE (Interactive Connectivity Establishment) — a framework that tries all possible paths and selects the best one.
3.1 STUN — Discovering Your Public IP
STUN (Session Traversal Utilities for NAT) servers help clients discover their public IP and port mapping. The client sends a request to the STUN server, which responds with the public address it sees. This process is lightweight — just a few UDP packets. Google provides free STUN servers at stun:stun.l.google.com:19302.
STUN works with approximately 85% of standard NAT configurations (Full Cone, Restricted Cone, Port Restricted Cone). However, Symmetric NAT — common in enterprise networks — blocks STUN because each different destination gets NAT-mapped to a different port.
3.2 TURN — Relay When P2P Fails
TURN (Traversal Using Relays around NAT) is the fallback: all media passes through a TURN server as a relay. This consumes significant server bandwidth — each 720p video stream uses ~1.5 Mbps, doubled through relay — so TURN is only used when STUN fails.
TURN Costs Are Not Cheap
A TURN server handling 500 concurrent 1-on-1 video calls needs ~1.5 Gbps bandwidth. At average cloud pricing of $0.08/GB, bandwidth costs can reach $500–800/day. Always prioritize STUN and only fall back to TURN when necessary. Use coturn (open-source) and deploy close to users to reduce latency.
3.3 ICE — Finding the Best Path
The ICE framework collects all ICE candidates (possible connection addresses) from 3 sources: host candidates (local IP), server reflexive candidates (from STUN) and relay candidates (from TURN). ICE then performs connectivity checks in priority order — preferring direct P2P, falling back through TURN if needed.
graph LR
A["ICE Agent"] --> B["Host Candidate
Local IP: 192.168.1.5:4532"]
A --> C["Server Reflexive
STUN: 203.0.113.5:6789"]
A --> D["Relay Candidate
TURN: 198.51.100.2:3478"]
B --> E{"Connectivity
Check"}
C --> E
D --> E
E -->|Priority 1| F["Direct P2P"]
E -->|Priority 2| G["STUN-assisted P2P"]
E -->|Priority 3| H["TURN Relay"]
style A fill:#e94560,stroke:#fff,color:#fff
style E fill:#ff9800,stroke:#fff,color:#fff
style F fill:#4CAF50,stroke:#fff,color:#fff
ICE Framework — gathering candidates and selecting the optimal connection path
4. Media Pipeline — Codecs, Encryption and Adaptive Bitrate
4.1 Audio & Video Codecs
WebRTC mandates support for the following codecs:
| Type | Mandatory Codec | Optional (Common) | Characteristics |
|---|---|---|---|
| Audio | Opus | G.711, iSAC | Opus: 6–510 kbps, adaptive bitrate, 48kHz. Best for voice + music |
| Video | VP8, H.264 | VP9, AV1, H.265 | VP9 saves 30-50% bandwidth vs VP8. AV1 newest but CPU-intensive encoding |
4.2 Mandatory Encryption
All WebRTC connections are encrypted by default — there is no option to disable it. The encryption stack consists of:
- DTLS (Datagram Transport Layer Security) — Handshake and key exchange, similar to TLS but for UDP
- SRTP (Secure Real-time Transport Protocol) — Encrypts audio/video payload with AES-128
- SCTP over DTLS — Encrypts data on RTCDataChannel
4.3 Adaptive Bitrate & Congestion Control
WebRTC uses the GCC (Google Congestion Control) algorithm to automatically adjust bitrate based on network conditions. When packet loss or increased latency is detected, the encoder reduces resolution/framerate/bitrate. When the network improves, quality automatically increases. This is why video calls sometimes go "blurry" for a few seconds then clear up — GCC at work.
Since 2025, browsers have supported Simulcast — sending multiple quality layers simultaneously (e.g., 1080p + 720p + 360p). The receiver or SFU selects the appropriate layer for current bandwidth, avoiding CPU-intensive re-encoding.
5. Production Architecture — P2P, SFU and MCU
Pure P2P only works well for 1-on-1 calls. With 3+ participants, the full mesh model (every peer connects to every other peer) doesn't scale — N participants need N×(N-1)/2 connections. With 10 people, each device must encode and send 9 separate streams.
graph TD
subgraph "P2P Mesh — Max 4-5 people"
P1["Peer 1"] <--> P2["Peer 2"]
P1 <--> P3["Peer 3"]
P2 <--> P3
end
subgraph "SFU — Hundreds of people"
S1["Peer 1"] --> SFU["SFU Server
Forward streams"]
S2["Peer 2"] --> SFU
S3["Peer 3"] --> SFU
S4["Peer N"] --> SFU
SFU --> S1
SFU --> S2
SFU --> S3
SFU --> S4
end
subgraph "MCU — Low bandwidth"
M1["Peer 1"] --> MCU["MCU Server
Mix + Re-encode"]
M2["Peer 2"] --> MCU
M3["Peer 3"] --> MCU
MCU --> M1
MCU --> M2
MCU --> M3
end
style SFU fill:#e94560,stroke:#fff,color:#fff
style MCU fill:#2c3e50,stroke:#fff,color:#fff
Three WebRTC architectures: Mesh (P2P), SFU (Selective Forwarding) and MCU (Mixing)
5.1 SFU — The #1 Choice for Production in 2026
SFU (Selective Forwarding Unit) is the dominant architecture for WebRTC production. Each participant sends 1 stream to the SFU, which forwards it to all other participants — no decoding, no re-encoding. Advantages:
- Low server CPU — only forwards packets, no media processing
- Low latency — no intermediate decode/encode step
- Scales well — each SFU node handles 500-1000 concurrent streams
- Simulcast compatible — SFU selects appropriate layer for each receiver
5.2 MCU — When Client Bandwidth Is the Issue
MCU (Multipoint Control Unit) decodes all incoming streams, mixes them into a single layout, re-encodes and sends to each participant. Clients only receive 1 stream — saving downstream bandwidth. But MCU consumes massive server CPU and adds 200-500ms latency from decode/encode. MCU fits: weak IoT devices, 3G mobile connections, or recording/broadcasting.
5.3 Detailed SFU vs MCU Comparison
| Criteria | SFU | MCU |
|---|---|---|
| Server CPU | Low (forward only) | Very high (decode + mix + encode) |
| Added Latency | ~10-50ms | ~200-500ms |
| Client Bandwidth (downstream) | High (receives N-1 streams) | Low (receives 1 stream) |
| Scale | Good — 500-1000 streams/node | Limited — 50-100 participants/node |
| Video Quality | Original (no re-encoding) | Reduced (through re-encoding) |
| Simulcast | Native support | Not needed (already mixed) |
| Best Use Case | Video conferencing, live streaming | IoT, legacy devices, recording |
6. Open-Source SFUs — LiveKit, mediasoup and Janus
The three most popular open-source SFUs, each suited for different contexts:
6.1 LiveKit — Modern SFU Written in Go
LiveKit has emerged as the top choice for teams wanting to ship fast. Written in Go, leveraging goroutines for concurrent connections. Ships with SDKs for JavaScript, React, Swift, Kotlin, Flutter, Unity and server-side SDKs for Node.js, Python, Go, .NET. LiveKit includes signaling, room management and recording out of the box.
// LiveKit JavaScript Client — connect to room
import { Room, RoomEvent } from 'livekit-client';
const room = new Room();
await room.connect('wss://your-livekit-server.com', token);
room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
const element = track.attach();
document.getElementById('remote-video').appendChild(element);
});
// Publish local camera
const localTracks = await room.localParticipant.enableCameraAndMicrophone();
6.2 mediasoup — High-Performance SFU with C++ Core
mediasoup has its core written in C++ for optimal media processing performance, with a Node.js signaling layer. Worker-based architecture: each CPU core runs a Worker process, handling media routing for multiple rooms. mediasoup provides fine-grained control over every transport, producer and consumer — ideal for teams wanting deep customization.
6.3 Janus Gateway — Versatile Plugin Architecture
Janus is written in C, released in 2014, making it the oldest and most versatile SFU. Its plugin architecture enables extension: VideoRoom (SFU), AudioBridge (audio mixing), Streaming (one-to-many), SIP Gateway, Record/Play. Janus fits when you need integration with legacy VoIP/SIP systems.
| Criteria | LiveKit | mediasoup | Janus |
|---|---|---|---|
| Language | Go | C++ (core) + Node.js | C |
| Setup | Fast — all-in-one SDK | Medium — build signaling yourself | Medium — choose plugins |
| Customization | Medium | Very high | High (plugin system) |
| Scalability | Built-in multi-node | Self-managed | Self-managed |
| .NET SDK | Yes (server-side) | No official support | No |
| Recording | Built-in (Egress) | Self-implement | Record/Play plugin |
| Best For | Startups, ship fast | Custom platforms, large scale | SIP/VoIP, legacy integration |
7. Encoded Transform — True End-to-End Encryption
By default, WebRTC encrypts hop-by-hop with DTLS-SRTP — meaning SFU servers can see media in plaintext when forwarding. For sensitive applications (healthcare, finance), this isn't sufficient.
The WebRTC Encoded Transform API (W3C Working Draft, updated 02/2026) allows inserting a processing step into the pipeline between the encoder and packetizer. Developers can encrypt the payload with a private key before sending to the SFU — the SFU can only forward encrypted payload, unable to read content. This is true E2EE (End-to-End Encryption).
// Encoded Transform — encrypt frames before sending
const sender = peerConnection.getSenders()[0];
const senderStreams = sender.createEncodedStreams();
const transformStream = new TransformStream({
transform(encodedFrame, controller) {
// Encrypt payload with AES-GCM
const encryptedData = encryptFrame(encodedFrame.data, sharedKey);
encodedFrame.data = encryptedData;
controller.enqueue(encodedFrame);
}
});
senderStreams.readable
.pipeThrough(transformStream)
.pipeTo(senderStreams.writable);
Encoded Transform Browser Support (04/2026)
Chrome/Edge: full support since Chrome 86+. Firefox: implementing RTCRtpScriptTransform (latest spec). Safari: not yet supported — needs polyfill or fallback. If your target audience primarily uses Chrome/Edge (>75% market share), you can deploy E2EE today.
8. Production Deployment — Reference Architecture
Below is a production architecture for a video conferencing system supporting 10,000+ concurrent users:
graph TD
Client["Client App
Vue.js + LiveKit SDK"] -->|WebSocket| LB["Load Balancer
Geographic DNS"]
LB --> SIG["Signaling Cluster
3+ nodes"]
SIG -->|Redis Pub/Sub| SIG
SIG --> SFU1["SFU Node 1
Region: Asia"]
SIG --> SFU2["SFU Node 2
Region: EU"]
SIG --> SFU3["SFU Node 3
Region: US"]
SFU1 --> TURN1["TURN Server
coturn — Asia"]
SFU2 --> TURN2["TURN Server
coturn — EU"]
SFU3 --> TURN3["TURN Server
coturn — US"]
SFU1 --> REC["Recording Service
Egress to S3/R2"]
SFU1 --> MON["Monitoring
Prometheus + Grafana"]
style Client fill:#e94560,stroke:#fff,color:#fff
style SIG fill:#2c3e50,stroke:#fff,color:#fff
style SFU1 fill:#16213e,stroke:#fff,color:#fff
style SFU2 fill:#16213e,stroke:#fff,color:#fff
style SFU3 fill:#16213e,stroke:#fff,color:#fff
style MON fill:#4CAF50,stroke:#fff,color:#fff
Multi-region production architecture for WebRTC — SFU cluster with geographic routing
8.1 Deployment Checklist
- TURN server — Deploy coturn in each region, configure TLS (port 443) to bypass corporate firewalls
- Bandwidth planning — Each 720p video participant: ~1.5 Mbps up + 1.5×(N-1) Mbps down (SFU mode). With simulcast: downstream drops 40-60%
- Monitoring — Track metrics: ICE connection time, packet loss rate, bitrate adaptation, TURN usage percentage
- Fallback strategy — When an SFU node is overloaded, redirect participants to another node. LiveKit has built-in load balancing
- Recording — Use composite recording (MCU-style) for archives, or individual track recording for post-processing
8.2 Code Sample — Signaling Server with ASP.NET Core
// SignalingHub.cs — WebRTC signaling via SignalR
public class SignalingHub : Hub
{
public async Task JoinRoom(string roomId)
{
await Groups.AddToGroupAsync(Context.ConnectionId, roomId);
await Clients.OthersInGroup(roomId).SendAsync("user-joined", Context.ConnectionId);
}
public async Task SendOffer(string targetId, string sdp)
{
await Clients.Client(targetId).SendAsync("offer", Context.ConnectionId, sdp);
}
public async Task SendAnswer(string targetId, string sdp)
{
await Clients.Client(targetId).SendAsync("answer", Context.ConnectionId, sdp);
}
public async Task SendIceCandidate(string targetId, string candidate)
{
await Clients.Client(targetId).SendAsync("ice-candidate", Context.ConnectionId, candidate);
}
}
// Client — Vue.js + WebRTC
const pc = new RTCPeerConnection({
iceServers: [
{ urls: 'stun:stun.l.google.com:19302' },
{ urls: 'turn:turn.example.com:443', username: 'user', credential: 'pass' }
]
});
const stream = await navigator.mediaDevices.getUserMedia({ video: true, audio: true });
stream.getTracks().forEach(track => pc.addTrack(track, stream));
pc.onicecandidate = ({ candidate }) => {
if (candidate) signalR.invoke('SendIceCandidate', targetId, JSON.stringify(candidate));
};
// Create offer and send via signaling
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
signalR.invoke('SendOffer', targetId, JSON.stringify(offer));
9. WebRTC Performance Optimization
9.1 Simulcast & SVC
Send multiple quality layers simultaneously, letting the SFU select the right layer for each receiver. Configuring simulcast in WebRTC:
const sender = pc.addTrack(videoTrack, stream);
const params = sender.getParameters();
params.encodings = [
{ rid: 'low', maxBitrate: 200000, scaleResolutionDownBy: 4 },
{ rid: 'mid', maxBitrate: 700000, scaleResolutionDownBy: 2 },
{ rid: 'high', maxBitrate: 2500000 }
];
await sender.setParameters(params);
9.2 Bandwidth Estimation
Use RTCPeerConnection.getStats() for real-time monitoring:
setInterval(async () => {
const stats = await pc.getStats();
stats.forEach(report => {
if (report.type === 'outbound-rtp' && report.kind === 'video') {
console.log(`Bitrate: ${report.bytesSent}, Frames: ${report.framesEncoded}`);
console.log(`Quality limit: ${report.qualityLimitationReason}`);
}
});
}, 2000);
9.3 Network Quality Indicator
Measure packet loss and round-trip time to display a quality indicator for users:
- Good: packet loss < 1%, RTT < 150ms
- Fair: packet loss 1-5%, RTT 150-300ms
- Poor: packet loss > 5%, RTT > 300ms → automatically reduce resolution
10. Real-World Use Cases Beyond Video Calls
WebRTC isn't just for video conferencing:
- Screen Sharing —
getDisplayMedia()API captures screen, window or specific tab - P2P File Transfer — RTCDataChannel enables direct file transfer between browsers, bypassing servers. Speed can reach 100+ Mbps on LAN
- Cloud Gaming — Stream gameplay from server, receive input from client via DataChannel. Google Stadia (now shut down) and Xbox Cloud Gaming both used WebRTC
- IoT & Robotics — Control robots/drones via DataChannel, receive video feed through media streams
- Live Streaming — WHIP (WebRTC HTTP Ingestion Protocol) enables publishing live streams to CDN via WebRTC, replacing traditional RTMP
WHIP & WHEP — New Standards for Live Streaming
WHIP (WebRTC HTTP Ingestion Protocol) standardizes how to publish streams to servers. WHEP (WebRTC HTTP Egress Protocol) standardizes how viewers subscribe to streams. Both are already supported by Cloudflare Stream, AWS IVS and many CDNs. This is the future replacement for RTMP in live streaming with sub-second latency.
Conclusion
WebRTC has matured from a Google experiment into the standard platform for all real-time communication applications on the web. With SFU architecture, you can build video conferencing systems supporting thousands of concurrent users. The Encoded Transform API enables true end-to-end encryption. And with WHIP/WHEP, WebRTC is expanding into live streaming territory.
Key takeaways for deployment: start with LiveKit if you need to ship fast, mediasoup if you need deep customization, and Janus if you need legacy VoIP integration. Always deploy TURN servers in each region and monitor ICE connection metrics to ensure the best user experience.
References:
WebRTC Official — webrtc.org
MDN Web Docs — WebRTC API
W3C WebRTC Specification
W3C WebRTC Encoded Transform
LiveKit Documentation
mediasoup Documentation
Janus Gateway Documentation
BlogGeek.me — WebRTC Open Source Media Servers
V100.ai — Fastest WebRTC Server in 2026
NATS JetStream — Ultra-Lightweight Messaging for Event-Driven Microservices
Neon Serverless Postgres — Storage-Compute Separation with Git-like Database Branching
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.