CRDT and Real-time Collaboration 2026 — Multi-User Sync Architecture à la Figma/Notion with Yjs, Automerge, WebSocket, and Presence/Awareness

Posted on: 4/17/2026 7:11:25 AM

Table of contents

  1. 1. Why real-time collaboration became default UX in 2026
    1. Real-time isn't just chat
  2. 2. The journey from Google Wave to mature CRDTs
  3. 3. OT vs CRDT — An in-depth comparison for technology choosers
    1. Quick decision rule
  4. 4. CRDT theory — state-based vs op-based, and why YATA won
    1. 4.1. State-based CRDTs (CvRDT — Convergent)
    2. 4.2. Operation-based CRDTs (CmRDT — Commutative)
    3. 4.3. List/Text CRDTs — YATA (Yjs) and RGA (Automerge)
  5. 5. Yjs — internal architecture, shared types, and update format
    1. 5.1. Shared types and composability
    2. 5.2. Binary update format and sync protocol
      1. Tombstones never truly disappear
    3. 5.3. The awareness protocol — presence and cursors
  6. 6. Automerge 3 — JSON-first, columnar storage, and sync protocol
  7. 7. Production architecture — the four most common patterns in 2026
    1. 7.1. Pattern A — Monolithic WebSocket node keeping state in RAM
    2. 7.2. Pattern B — Stateless WebSocket nodes + Redis pub/sub
    3. 7.3. Pattern C — Actor model (Orleans / Erlang / Cloudflare Durable Objects)
    4. 7.4. Pattern D — Managed service (Liveblocks / PartyKit / Pusher)
      1. Pattern selection rule
  8. 8. Persistence — snapshots, log compaction, and versioning
    1. 8.1. Versioning and time travel
  9. 9. Scaling — room-based sharding, tombstone GC, and backpressure
    1. 9.1. Room-based sharding
    2. 9.2. Tombstone GC
    3. 9.3. Backpressure when users type too fast
  10. 10. Security — auth, room permissions, and end-to-end encryption
    1. 10.1. JWT handshake
    2. 10.2. Room permissions
    3. 10.3. End-to-end encryption with CRDTs
      1. E2EE trades off server-side awareness
  11. 11. Six common anti-patterns in CRDT production
  12. 12. A 2026 go-live checklist for real-time collaboration systems
  13. 13. The future — AI agents as the Nth CRDT peer
  14. 14. Conclusion
  15. 15. References

1. Why real-time collaboration became default UX in 2026

Ten years ago, having a "Save" button in a SaaS product was considered normal. In 2026 the opposite is true — a product that still has a Save button feels dated. Users are used to Figma, Notion, Linear, Google Docs, Miro, and FigJam: you type — the other person sees it instantly; you drag a block — the whole meeting watches your cursor move; you go offline for ten minutes, come back, and no "conflict" dialog asks you to pick a version. Behind that experience sits a family of algorithms rooted in the early 2000s but only truly production-ready in the last five years: CRDTs — Conflict-free Replicated Data Types.

This article is an in-depth handbook for engineers building or evaluating a real collaboration system. We'll cover four layers: theory (what a CRDT is and how it differs from Operational Transformation), implementation (Yjs and Automerge 3 — the two libraries dominating the market), backend architecture (WebSocket transport, presence/awareness, persistence, scaling), and finally the anti-patterns plus a go-live checklist for teams choosing technology in 2026.

~85%top-tier B2B SaaS in 2026 have at least one real-time multi-user surface
100msend-to-end latency ceiling for cursor and keystroke to feel "real-time"
0number of "merge conflict" dialogs Yjs ever needs — that's the architectural win
~10xbackend RAM cost of keeping document state per room vs a stateless API

Real-time isn't just chat

Three layers of "real-time" are often conflated: broadcast (chat, notifications — handled well by SignalR/Socket.io), shared state (presence, cursors, "who's looking with me" — Liveblocks/Phoenix Channels), and collaborative documents (text, JSON, drawing — Yjs/Automerge). This article focuses on the third layer, which takes the most work but also creates the clearest product differentiation.

2. The journey from Google Wave to mature CRDTs

To see why CRDTs win many 2026 use cases over OT, you need the 25-year arc. Many design decisions in Yjs and Automerge are direct reactions to failures of earlier systems.

1989 — Operational Transformation (OT) is born
Ellis & Gibbs propose OT in the GROVE editor. The idea: each op sent to the server is "transformed" to compensate for concurrent ops. OT requires a central server to arbitrate ordering.
2009 — Google Wave and OT's lessons
Wave was the most ambitious real-time text editing system at the time. It was shut down after 18 months. One technical reason: the Jupiter OT algorithm was hard to get right and not easily extensible to rich text. Google Docs later used a simpler OT with a central server.
2011 — Shapiro et al. define CRDTs
The "Conflict-free Replicated Data Types" paper by Shapiro, Preguiça, Baquero, and Zawirski laid the mathematical foundation for CRDTs: state-based (CvRDT) and operation-based (CmRDT), with proofs of eventual consistency without a central coordinator.
2016 — WOOT, Logoot, RGA, and the boom of list CRDT algorithms
A wave of list/text CRDT algorithms arrived: WOOT, Logoot, LSEQ, Treedoc, RGA. Performance was still far from Google Docs, but feasibility was proven: peer-to-peer text editing without a server.
2018 — YATA and Yjs mature
Kevin Jahns publishes YATA (Yet Another Transformation Approach) and the Yjs library. YATA is simpler than RGA, tracking structure with a linked list using origin/leftOrigin. Yjs becomes the first CRDT to approach Google Docs on speed and memory.
2020 — Automerge and "local-first software"
The "Local-first software" paper from Ink & Switch along with Automerge resonate strongly. The philosophy: data lives on the client, syncs via CRDT, the server is just a relay. The reverse of traditional SaaS.
2023 — Automerge 2.0 rewritten in Rust
Automerge 2 uses columnar binary format, its core written in Rust, bringing performance on par with Yjs. Browser integration via WebAssembly. Officially production-ready for arbitrary JSON documents.
2024 — Liveblocks, PartyKit, ElectricSQL — the commercial ecosystem
Hosted layers appear: Liveblocks sells "collaboration as a service" on top of Yjs, PartyKit is an edge server for multiplayer, and ElectricSQL pushes CRDTs down into Postgres replication. Real-time starts becoming a commodity.
2025-2026 — AI agents become CRDT peers
ElectricSQL's post on "AI agents as CRDT peers" sketches the next direction: an agent writes into a document alongside the user via the same Yjs mechanism — no race condition, no separate handshake. Real-time collaboration is no longer only human-to-human.

3. OT vs CRDT — An in-depth comparison for technology choosers

This is the first and most important architectural decision. Don't trust the "CRDTs are always better" claim — Google Docs still uses OT, Quip uses OT, Etherpad uses OT. CRDTs win some problems, OT wins others. The table below is an honest comparison based on real production experience.

CriterionOperational Transformation (OT)CRDT (Yjs / Automerge)
Ordering arbiterCentral server requiredNone needed (peer-to-peer feasible)
Offline editingHard — must re-transform on reconnectEasy — merges naturally on reconnect
Document memoryOnly the current snapshotNeeds metadata (tombstones, logical timestamps)
Algorithmic complexityHigh (transform function hard to get right for rich text)Moderate (op + merge rules well-defined)
Rich text formattingQuill OT, ShareDB OT are matureYjs Y.XmlFragment, Automerge Rich Text recently stabilized
Per-user undo/redoNeeds complex custom logicYjs UndoManager built in
Peak throughputHigh with a well-tuned server (Google Docs level)High, but needs tombstone GC
Ease of reasoning about correctnessHard — transform property is tricky to verifyEasier — mathematical proofs of convergence exist
Strongest use caseServer-centric, online-only document (Google Docs)Local-first, offline-capable, peer-to-peer (Linear, Figma)
Weakest use caseMobile offline, peer-to-peerVery large documents (>100 MB) — tombstones balloon

Quick decision rule

If your product needs (1) offline-first, (2) mobile, (3) wants to reuse open-source editors (Tiptap, Slate, Lexical, ProseMirror), or (4) has a future peer-to-peer need — pick CRDT. If you need (1) online-only, (2) heavy engineering resources, (3) an existing long-tenured OT team, or (4) very large documents with few concurrent ops — OT is still a safe bet. In 2026, the default for new teams is CRDT.

4. CRDT theory — state-based vs op-based, and why YATA won

CRDTs come in two main families. Understanding the difference helps you read Yjs or Automerge source without getting lost.

4.1. State-based CRDTs (CvRDT — Convergent)

Each replica holds full state and defines a merge function that must have three properties: commutative (a+b = b+a), associative ((a+b)+c = a+(b+c)), and idempotent (a+a = a). If all three hold, every replica merging states in any order reaches the same result — that's eventual consistency.

Classic example: the G-Counter (grow-only counter). Each replica keeps a map {nodeId: localCount}. The counter value is the sum of all localCount values. Merge is element-wise max. Property: if two replicas increment simultaneously then sync, the result is always the correct total.

Upside of state-based: simple, no causal ordering needed. Downside: you must send the whole state each sync — impractical for large documents. That's why production rarely uses pure state-based CRDTs for text/JSON documents.

4.2. Operation-based CRDTs (CmRDT — Commutative)

Replicas send operations instead of state. Requirements: ops must commute (applying them in different orders gives the same result), and the transport must be reliable + at-most-once + causal-ordered (parent ops arrive before child ops).

Example: the OR-Set (observed-remove set): when adding an element, tag it with a unique id; when removing, record which ids have been removed. Concurrent add and remove of the same element resolves to add-wins (remove only clears ids it has observed).

Op-based is more bandwidth-efficient but demands a stronger transport layer. Yjs and Automerge are both op-based with an optimization: the op log is compressed into binary updates that can be repackaged as "snapshots" or "deltas".

4.3. List/Text CRDTs — YATA (Yjs) and RGA (Automerge)

The hardest list CRDT problem: two users both insert a character at position 5 — who wins? Indexes (numeric) don't work (they shift after inserts). The solution: assign each character a stable identifier (ID = nodeId + clock), describe an insert as "insert X to the right of Y", then use a tie-breaking rule when both end up at the same position.

graph LR
    subgraph U1["User A types "X" after "He""]
        A1["H"] --> A2["e"] --> A3["X"]
    end
    subgraph U2["User B types "Y" after "He""]
        B1["H"] --> B2["e"] --> B3["Y"]
    end
    subgraph MERGE["After merge — YATA tie-breaks by clientID"]
        M1["H"] --> M2["e"] --> M3["X (A.5)"] --> M4["Y (B.7)"]
    end
    style A3 fill:#e94560,color:#fff
    style B3 fill:#4CAF50,color:#fff
    style M3 fill:#e94560,color:#fff
    style M4 fill:#4CAF50,color:#fff
Two concurrent inserts are ordered deterministically by (origin, clientID, clock)

Yjs's YATA is simpler than RGA: each item has origin (the ID of the character to the left at creation), rightOrigin (the character to the right at creation), and tie-breaks by (clientID, clock). On merge, the new item is "nested" between origin and rightOrigin by a deterministic rule. Efficiency: O(N) for normal inserts, with index hashing optimizable toward O(1).

5. Yjs — internal architecture, shared types, and update format

Yjs is the most popular text CRDT in 2026. It's not an editor and has no UI — it's a shared data model: you structure your data with Y.Map, Y.Array, Y.Text, and Y.XmlFragment, and every change automatically syncs with every other peer.

graph TB
    subgraph CLIENT["Yjs Client (Browser/Node)"]
        DOC["Y.Doc
(root container)"] TYPES["Shared Types
Y.Text / Y.Array / Y.Map / Y.XmlFragment"] STORE["DocStore
(Item list, indexed by clientID)"] ENCODER["Update Encoder
(binary, lib0)"] AWARE["Awareness Protocol
(presence, cursor, user)"] end subgraph TRANSPORT["Provider (transport agnostic)"] WS["y-websocket"] WEBRTC["y-webrtc"] REDIS["y-redis"] IDB["y-indexeddb (persistence)"] end subgraph BACKEND["Backend"] SYNCSERVER["Sync Server
(broadcasts updates)"] DB[("Persistence
Postgres / S3 / LevelDB")] PUBSUB["Redis Pub/Sub
(cross-node)"] end DOC --> TYPES --> STORE --> ENCODER DOC --> AWARE ENCODER --> WS ENCODER --> WEBRTC ENCODER --> REDIS ENCODER --> IDB AWARE --> WS WS --> SYNCSERVER SYNCSERVER --> DB SYNCSERVER --> PUBSUB PUBSUB --> SYNCSERVER style DOC fill:#e94560,color:#fff style ENCODER fill:#e94560,color:#fff style SYNCSERVER fill:#2c3e50,color:#fff
Yjs cleanly separates data model, encoder, transport, and persistence — every layer is swappable

5.1. Shared types and composability

You can nest shared types inside each other: Y.Map<string, Y.Array<Y.Map>> describes a complete Trello board — a map of columns → array of cards → map of card fields. Each sub-tree change is encoded as a minimal update, no need to rebroadcast the whole board.

// Structure of a Notion-like document
import * as Y from 'yjs'

const doc = new Y.Doc()
const blocks = doc.getArray('blocks')

const heading = new Y.Map()
heading.set('type', 'heading')
heading.set('text', new Y.Text('CRDT 2026'))
blocks.push([heading])

const paragraph = new Y.Map()
paragraph.set('type', 'paragraph')
paragraph.set('text', new Y.Text('Hello collaborative world'))
blocks.push([paragraph])

// Every other user will automatically see these 2 blocks after sync

5.2. Binary update format and sync protocol

A Yjs update is tightly optimized binary: VarInt for numbers, dictionary encoding for repeated characters, run-length encoding for consecutive ids. A 1,000-character paragraph typed sequentially compresses to ~150 bytes of update because consecutive IDs get run-length-encoded into a single range.

The sync protocol has two steps (sync step 1 and step 2): the client sends a state vector (a map clientID → max clock seen), and the server returns a diff update (only the ops the client doesn't have). This is why Yjs syncs fast even with large documents: client A has 1 MB of state, reopens after 5 minutes offline, and sync costs only a few KB if there weren't many changes.

Tombstones never truly disappear

When you delete a character, Yjs doesn't really delete it — it marks it deleted. The tombstone keeps the ID so late-arriving ops can still anchor correctly. A heavily edited document can balloon over time. Production strategy: periodically snapshot with Y.encodeStateAsUpdate(doc) to produce a new update that only contains the current state; old unneeded tombstones get compressed.

5.3. The awareness protocol — presence and cursors

Awareness is a concept separate from the document: it's ephemeral state (cursor position, selection range, "user X is viewing"). Not persisted, no tombstones, expires after ~30 seconds without a heartbeat.

// Presence on the client
import { Awareness } from 'y-protocols/awareness'
const awareness = new Awareness(doc)
awareness.setLocalStateField('user', { name: 'Anh Tu', color: '#e94560' })
awareness.setLocalStateField('cursor', { anchor: 120, head: 145 })

awareness.on('change', () => {
  for (const [clientId, state] of awareness.getStates()) {
    if (clientId === doc.clientID) continue
    renderRemoteCursor(clientId, state.user, state.cursor)
  }
})

6. Automerge 3 — JSON-first, columnar storage, and sync protocol

Automerge 3 is Yjs's main rival. Different philosophy: Yjs prioritizes text editors, Automerge prioritizes arbitrary JSON documents. If your app isn't an editor but structured data (kanban board, todo list, config sync), Automerge feels more like "just a JSON object".

CriterionYjsAutomerge 3
Core languageJavaScript (with C++/Rust ports)Rust (browser via WASM)
API styleShared types (Y.Map, Y.Text, ...)JSON proxy + change function
Text performanceBest in benchmarksOn par since v3; still slightly slower
Arbitrary JSON nestingPossible but requires declarationNatural like a regular object
Storage formatBinary update listColumnar binary (better compression)
Sync protocolState vector exchangeHeads-based + bloom filter
Multi-languageJavaScript primary, Rust port (yrs)Rust core, JS/Python/Swift bindings official
Editor ecosystemTiptap, Slate, ProseMirror, Quill, Lexical, Monaco, CodeMirrorCustom integration needed for most editors
When to pickRich text editor is the core (Notion, Linear)Arbitrary JSON documents, native mobile, multi-language stack
// Automerge 3 — feels like a regular JSON object
import { next as Automerge } from '@automerge/automerge'

let doc = Automerge.from({
  todos: [],
  filter: 'all'
})

doc = Automerge.change(doc, d => {
  d.todos.push({ id: 1, text: 'Learn CRDT', done: false })
  d.todos.push({ id: 2, text: 'Refactor backend', done: false })
})

// Sync with other peers
const sync = Automerge.initSyncState()
const [nextDoc, nextSync, message] = Automerge.generateSyncMessage(doc, sync)
// send message over WebSocket / HTTP / any transport

7. Production architecture — the four most common patterns in 2026

The client code is the easy part. The backend is where 90% of production bugs happen. There are four architectural patterns to choose between, each with clear trade-offs.

7.1. Pattern A — Monolithic WebSocket node keeping state in RAM

Each document is "pinned" to a single node. Clients connect to that node via WebSocket. The node keeps the Y.Doc in memory and broadcasts updates between clients on the same node. Periodic snapshotting to disk (every 30 s).

graph LR
    C1["Client 1"] --> WS1["WS Node A
(Y.Doc room1)"] C2["Client 2"] --> WS1 C3["Client 3"] --> WS2["WS Node B
(Y.Doc room2)"] LB["Load Balancer
(sticky by roomId)"] --> WS1 LB --> WS2 WS1 --> DB[("Snapshot Store
S3 / Postgres")] WS2 --> DB style WS1 fill:#e94560,color:#fff style WS2 fill:#e94560,color:#fff
Each room "pinned" to a node — simple and effective for a startup

Fits: under 100k concurrent users, under 10k concurrent rooms, moderate document size. Problems: a node restart loses presence, horizontal scaling needs sticky sessions, cold start is slow when loading documents from disk.

7.2. Pattern B — Stateless WebSocket nodes + Redis pub/sub

No WebSocket node "owns" a fixed room. Updates arriving at a node are decoded → pushed through a Redis pub/sub channel doc:{roomId} → every node subscribed to the channel receives it and broadcasts to its own clients. Document state lives in Redis (or a leader node via Raft).

graph TB
    subgraph CLIENTS["Clients"]
        C1["Client 1"]
        C2["Client 2"]
        C3["Client 3"]
        C4["Client 4"]
    end
    subgraph NODES["WebSocket Nodes (stateless)"]
        N1["Node A"]
        N2["Node B"]
        N3["Node C"]
    end
    subgraph SHARED["Shared State"]
        REDIS[("Redis
Pub/Sub + Stream
doc:{roomId}")] BLOB[("Persistence
Postgres / S3
snapshot + log")] end C1 --> N1 C2 --> N2 C3 --> N2 C4 --> N3 N1 <--> REDIS N2 <--> REDIS N3 <--> REDIS REDIS --> BLOB style REDIS fill:#e94560,color:#fff style BLOB fill:#2c3e50,color:#fff
The go-to pattern for production scale — all nodes are equal, horizontal scaling is easy

Fits: 100k+ concurrent users, Kubernetes with many nodes, need to restart nodes without breaking clients. Problems: Redis becomes a SPOF (need Cluster/Sentinel), high Redis bandwidth cost without filtering, document state needs a leader-based mechanism to avoid write conflicts.

7.3. Pattern C — Actor model (Orleans / Erlang / Cloudflare Durable Objects)

Each room is an actor (a grain in Orleans, a GenServer in Phoenix, a Durable Object in Cloudflare Workers). The actor system guarantees single-writer per room — no race conditions. Clients are routed to the right actor; the actor holds state in RAM and persists asynchronously.

Cloudflare Durable Objects is the most polished implementation for the web today: each document = one Durable Object, running at the edge near users, persisting to Cloudflare's SSD storage. Liveblocks and PartyKit are built on similar ideas.

Fits: global apps that need low latency, teams fine with platform lock-in. Problems: higher cost than Pattern B, harder to debug without actor-model familiarity.

7.4. Pattern D — Managed service (Liveblocks / PartyKit / Pusher)

You don't build the backend. Liveblocks handles WebSocket, persistence, auth, and presence. You pay USD/month based on MAU. Clear trade-off: fast launch, low engineering effort, but vendor lock-in and cost scaling linearly with scale.

Pattern selection rule

Startup at idea-validation stage → Pattern D (Liveblocks). After Series A, MAU > 100k → migrate to Pattern B (Redis). Strong budget and team → Pattern C (Durable Objects/Orleans). Pattern A should be used only for prototypes or internal tools under 1,000 users.

8. Persistence — snapshots, log compaction, and versioning

A common mistake: persist every Yjs update straight into Postgres. After a month, the doc_updates table has 50 million rows and loading a document takes 10 seconds. The right approach combines an append-only log with periodic snapshots.

graph LR
    UPD["Update arrives
(binary, ~100B)"] --> APPEND["Append to
log table"] APPEND --> CHECK{Log size
>= threshold?} CHECK -->|No| END1[Done] CHECK -->|Yes| MERGE["Apply all updates
into an in-memory Y.Doc"] MERGE --> SNAP["Y.encodeStateAsUpdate
=> binary snapshot"] SNAP --> WRITE["Write snapshot to
doc_snapshot table/S3"] WRITE --> DELETE["Delete old log entries
(before the snapshot)"] DELETE --> END2[Done] style MERGE fill:#e94560,color:#fff style SNAP fill:#4CAF50,color:#fff
Log + snapshot — balance cheap writes with fast reads

Suggested Postgres schema:

-- Append-only log, very fast to write
CREATE TABLE doc_update (
    id BIGSERIAL PRIMARY KEY,
    doc_id UUID NOT NULL,
    update_data BYTEA NOT NULL,        -- Yjs binary update
    client_id BIGINT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_update_doc ON doc_update(doc_id, id);

-- Periodic snapshot — fast to load
CREATE TABLE doc_snapshot (
    doc_id UUID PRIMARY KEY,
    snapshot BYTEA NOT NULL,           -- encodeStateAsUpdate
    last_update_id BIGINT NOT NULL,    -- final log id included in the snapshot
    state_vector BYTEA NOT NULL,       -- for diff sync
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- On load: doc_snapshot.snapshot + doc_update WHERE id > last_update_id

Common thresholds: snapshot every 100 updates or 1 MB of accumulated log size. Snapshots bigger than 5 MB should move to S3 with the URL stored in Postgres.

8.1. Versioning and time travel

Yjs supports snapshots that can be "frozen" into versions and then diffed. Y.snapshot(doc) returns a small object containing just the state vector; combined with the update log, the document can be reconstructed at any historical point. This is the mechanism behind Notion's "Page History" and Figma's "Version History".

9. Scaling — room-based sharding, tombstone GC, and backpressure

9.1. Room-based sharding

Real-time collaboration documents don't need global consistency — only per-room consistency. This is a beautiful property for scaling: you fully shard by roomId. Each shard can be a consumer group, a Redis cluster, or a Durable Object.

Suggested hash function: shard = consistent_hash(roomId) % N where N is the WebSocket node count. When N changes (auto-scale), consistent hashing ensures only a small fraction of rooms migrate.

9.2. Tombstone GC

The longer a document lives, the more tombstones it accumulates. Yjs doesn't auto-GC because late ops still need anchors. The pragmatic approach: periodically create a "compaction snapshot" — not fully deleting tombstones but packing them into a single block. Mature production stacks use Yjs document v2 (in preview in 2026), which supports "permanent delete" after a safe-time threshold (>1 day = no more delayed ops possible).

9.3. Backpressure when users type too fast

An auto-typing keyboard at 100 chars/second generates 100 updates/second. Multiplied by 50 users in the same room, that's 5,000 messages/second to broadcast. Backpressure patterns:

  • Client debounce: Yjs Y.Transaction batches many changes into a single update.
  • Server batching: wait 50 ms and merge all incoming updates into one broadcast.
  • Drop awareness: cursor updates can be dropped if a client can't keep up — nothing is persisted, no harm done.

10. Security — auth, room permissions, and end-to-end encryption

Before CRDTs enter the picture, a plain WebSocket server already has two familiar auth problems: who can connect, and once connected, who can join which room. With CRDTs, a third appears: who's allowed to apply which update.

10.1. JWT handshake

WebSocket can't carry headers after the upgrade. Two common approaches: send the JWT in the query string on connect (wss://server/yjs?token=xxx), or use a connection cookie. The server verifies the token, attaches the userId to the connection state, and every subsequent message is checked against that userId.

10.2. Room permissions

When a client subscribes to room doc:{roomId}, the server checks whether userId has access. Cache permissions in Redis with a 60 s TTL to avoid hitting the DB on every message. When permissions change (an admin revokes), publish an event permission:revoked:{userId}:{roomId} so every node disconnects the relevant connections.

10.3. End-to-end encryption with CRDTs

This is a big advantage of CRDTs: because merging is deterministic and the server doesn't need to understand content, you can encrypt updates on the client with a key the server doesn't know. The server just relays binary blobs. Common pattern: a room key derived from a shared password, each Yjs update encrypted with AES-GCM before going over the WebSocket.

E2EE trades off server-side awareness

When you encrypt updates, the server can't run content-based logic (search, mention notifications, full-text indexing). Every such feature must move to the client or use a delegated relay that can decrypt. Weigh carefully before going E2EE.

11. Six common anti-patterns in CRDT production

  1. Persisting every update straight into Postgres without snapshots. The table balloons and document loading slows. Ship snapshot + log compaction from day one.
  2. Forgetting to broadcast updates over Redis pub/sub when horizontally scaling. User on node A types, user on node B sees nothing. Test with a multi-instance load balancer from the start.
  3. Sticky sessions that last too long. A user disconnects, reconnects to another node, and waits 5 seconds for the document to load from the DB. Pattern B (stateless + Redis) avoids this.
  4. Not debouncing updates. A 50 KB paste produces 50,000 tiny ops instead of one transaction. Always wrap bulk changes in doc.transact(() => ...).
  5. Awareness leak. Disconnect without cleaning up awareness state — users see "ghost" cursors for users who've left. Handle cleanup in the onClose handler.
  6. No per-room quotas. One user sending a 1 MB text over WebSocket can OOM a node. Impose message-size limits (e.g. 256 KB), per-user connection caps (10), and document-size caps (50 MB).

12. A 2026 go-live checklist for real-time collaboration systems

ItemRequirement
Choose the CRDTBenchmarked Yjs vs Automerge on a real sample document (10 MB, 50 users, 1,000 ops/s)
Editor integrationTiptap/ProseMirror/Slate/Lexical selected, Yjs plugins verified for every needed rich-text feature
TransportWebSocket with 30 s ping/pong heartbeats, exponential-backoff reconnect, long-polling fallback when proxies block WS
PersistenceAppend log + snapshot every 100 updates, large snapshots in S3, an offline log-compaction script
ScalingStateless WS nodes + Redis pub/sub, room-based sharding, auto-scale on CPU + connection count
AuthJWT in query string, refresh before expiry, room permission cache TTL 60 s
AwarenessCursor + selection broadcast, 30 s expiry, cleanup on disconnect
Backpressure50 ms server debounce, drop cursors under overload, 256 KB message size limit
ObservabilityOpenTelemetry traces for every sync round-trip, metrics: connection count, doc size, updates/s, snapshot lag
Disaster recoveryHourly snapshot backups, replay log from S3 within the last 24 hours, RPO < 1 minute
VersioningTime-travel UI for users, "named version" tagged snapshots, fork from an older version
TestingLoad test with 10k concurrent WS connections, chaos testing (kill random nodes, network partitions), property-based tests for merge convergence

13. The future — AI agents as the Nth CRDT peer

2026 brings a new perspective: if a human user is a CRDT peer, why can't an AI agent be one? ElectricSQL's "AI agents as CRDT peers" post argues this is far more natural than designing a bespoke RPC protocol for agents writing into documents.

Concretely: a Claude agent generates a paragraph and applies it to Y.Text just like the user typing. If the user is typing simultaneously, Yjs merges automatically — no "AI overrides user" or "user overrides AI". Both coexist as equal peers. This is the pattern Notion AI and Linear AI are adopting, and it's the cleanest path for generative agents inside multi-user documents.

When designing a new collaboration system in 2026, plan for three peer types from the start: human, AI agent, and integration bot. All three write through the same CRDT layer with the same permission model. That's the durable architecture for the next decade.

14. Conclusion

Real-time collaboration is no longer a "nice to have" SaaS feature in 2026 — it's the baseline expectation. CRDTs have solved the hardest part (merge convergence) through mathematics, leaving engineers with the practical parts: pick the right library (Yjs for editors, Automerge for arbitrary JSON), design the right backend (Pattern B for scale, Pattern D for quick launch), persist correctly (snapshot + log), and avoid the easy anti-patterns.

Don't wait six months post-launch to retrofit collaboration — the migration cost later is always 5-10× higher than building it in from the start. Also don't over-engineer: a three-person startup doesn't need Pattern C on day one; a $99/month Liveblocks subscription is enough to validate the idea before investing in a backend.

Three questions to answer before starting: (1) Is your document mostly text or JSON? (chooses Yjs vs Automerge). (2) Do you need offline-first or online-only? (chooses CRDT vs OT). (3) What's your expected MAU in the next 12 months? (chooses the backend pattern). With those three answers, every remaining technical decision becomes clear.

15. References