CRDT and Real-time Collaboration 2026 — Multi-User Sync Architecture à la Figma/Notion with Yjs, Automerge, WebSocket, and Presence/Awareness

Posted on: 4/17/2026 7:11:25 AM

Table of contents

1. Why real-time collaboration became default UX in 2026
1. Real-time isn't just chat
2. The journey from Google Wave to mature CRDTs
3. OT vs CRDT — An in-depth comparison for technology choosers
1. Quick decision rule
4. CRDT theory — state-based vs op-based, and why YATA won
5. Yjs — internal architecture, shared types, and update format
6. Automerge 3 — JSON-first, columnar storage, and sync protocol
7. Production architecture — the four most common patterns in 2026
8. Persistence — snapshots, log compaction, and versioning
1. 8.1. Versioning and time travel
9. Scaling — room-based sharding, tombstone GC, and backpressure
10. Security — auth, room permissions, and end-to-end encryption
11. Six common anti-patterns in CRDT production
12. A 2026 go-live checklist for real-time collaboration systems
13. The future — AI agents as the Nth CRDT peer
14. Conclusion
15. References

1. Why real-time collaboration became default UX in 2026

Ten years ago, having a "Save" button in a SaaS product was considered normal. In 2026 the opposite is true — a product that still has a Save button feels dated. Users are used to Figma, Notion, Linear, Google Docs, Miro, and FigJam: you type — the other person sees it instantly; you drag a block — the whole meeting watches your cursor move; you go offline for ten minutes, come back, and no "conflict" dialog asks you to pick a version. Behind that experience sits a family of algorithms rooted in the early 2000s but only truly production-ready in the last five years: CRDTs — Conflict-free Replicated Data Types.

This article is an in-depth handbook for engineers building or evaluating a real collaboration system. We'll cover four layers: theory (what a CRDT is and how it differs from Operational Transformation), implementation (Yjs and Automerge 3 — the two libraries dominating the market), backend architecture (WebSocket transport, presence/awareness, persistence, scaling), and finally the anti-patterns plus a go-live checklist for teams choosing technology in 2026.

~85%top-tier B2B SaaS in 2026 have at least one real-time multi-user surface

100msend-to-end latency ceiling for cursor and keystroke to feel "real-time"

0number of "merge conflict" dialogs Yjs ever needs — that's the architectural win

~10xbackend RAM cost of keeping document state per room vs a stateless API

Real-time isn't just chat

Three layers of "real-time" are often conflated: broadcast (chat, notifications — handled well by SignalR/Socket.io), shared state (presence, cursors, "who's looking with me" — Liveblocks/Phoenix Channels), and collaborative documents (text, JSON, drawing — Yjs/Automerge). This article focuses on the third layer, which takes the most work but also creates the clearest product differentiation.

2. The journey from Google Wave to mature CRDTs

To see why CRDTs win many 2026 use cases over OT, you need the 25-year arc. Many design decisions in Yjs and Automerge are direct reactions to failures of earlier systems.

1989 — Operational Transformation (OT) is born

Ellis & Gibbs propose OT in the GROVE editor. The idea: each op sent to the server is "transformed" to compensate for concurrent ops. OT requires a central server to arbitrate ordering.

2009 — Google Wave and OT's lessons

Wave was the most ambitious real-time text editing system at the time. It was shut down after 18 months. One technical reason: the Jupiter OT algorithm was hard to get right and not easily extensible to rich text. Google Docs later used a simpler OT with a central server.

2011 — Shapiro et al. define CRDTs

The "Conflict-free Replicated Data Types" paper by Shapiro, Preguiça, Baquero, and Zawirski laid the mathematical foundation for CRDTs: state-based (CvRDT) and operation-based (CmRDT), with proofs of eventual consistency without a central coordinator.

2016 — WOOT, Logoot, RGA, and the boom of list CRDT algorithms

A wave of list/text CRDT algorithms arrived: WOOT, Logoot, LSEQ, Treedoc, RGA. Performance was still far from Google Docs, but feasibility was proven: peer-to-peer text editing without a server.

2018 — YATA and Yjs mature

Kevin Jahns publishes YATA (Yet Another Transformation Approach) and the Yjs library. YATA is simpler than RGA, tracking structure with a linked list using origin/leftOrigin. Yjs becomes the first CRDT to approach Google Docs on speed and memory.

2020 — Automerge and "local-first software"

The "Local-first software" paper from Ink & Switch along with Automerge resonate strongly. The philosophy: data lives on the client, syncs via CRDT, the server is just a relay. The reverse of traditional SaaS.

2023 — Automerge 2.0 rewritten in Rust

Automerge 2 uses columnar binary format, its core written in Rust, bringing performance on par with Yjs. Browser integration via WebAssembly. Officially production-ready for arbitrary JSON documents.

2024 — Liveblocks, PartyKit, ElectricSQL — the commercial ecosystem

Hosted layers appear: Liveblocks sells "collaboration as a service" on top of Yjs, PartyKit is an edge server for multiplayer, and ElectricSQL pushes CRDTs down into Postgres replication. Real-time starts becoming a commodity.

2025-2026 — AI agents become CRDT peers

ElectricSQL's post on "AI agents as CRDT peers" sketches the next direction: an agent writes into a document alongside the user via the same Yjs mechanism — no race condition, no separate handshake. Real-time collaboration is no longer only human-to-human.

3. OT vs CRDT — An in-depth comparison for technology choosers

This is the first and most important architectural decision. Don't trust the "CRDTs are always better" claim — Google Docs still uses OT, Quip uses OT, Etherpad uses OT. CRDTs win some problems, OT wins others. The table below is an honest comparison based on real production experience.

Criterion	Operational Transformation (OT)	CRDT (Yjs / Automerge)
Ordering arbiter	Central server required	None needed (peer-to-peer feasible)
Offline editing	Hard — must re-transform on reconnect	Easy — merges naturally on reconnect
Document memory	Only the current snapshot	Needs metadata (tombstones, logical timestamps)
Algorithmic complexity	High (transform function hard to get right for rich text)	Moderate (op + merge rules well-defined)
Rich text formatting	Quill OT, ShareDB OT are mature	Yjs Y.XmlFragment, Automerge Rich Text recently stabilized
Per-user undo/redo	Needs complex custom logic	Yjs UndoManager built in
Peak throughput	High with a well-tuned server (Google Docs level)	High, but needs tombstone GC
Ease of reasoning about correctness	Hard — transform property is tricky to verify	Easier — mathematical proofs of convergence exist
Strongest use case	Server-centric, online-only document (Google Docs)	Local-first, offline-capable, peer-to-peer (Linear, Figma)
Weakest use case	Mobile offline, peer-to-peer	Very large documents (>100 MB) — tombstones balloon

Quick decision rule

If your product needs (1) offline-first, (2) mobile, (3) wants to reuse open-source editors (Tiptap, Slate, Lexical, ProseMirror), or (4) has a future peer-to-peer need — pick CRDT. If you need (1) online-only, (2) heavy engineering resources, (3) an existing long-tenured OT team, or (4) very large documents with few concurrent ops — OT is still a safe bet. In 2026, the default for new teams is CRDT.

4. CRDT theory — state-based vs op-based, and why YATA won

CRDTs come in two main families. Understanding the difference helps you read Yjs or Automerge source without getting lost.

4.1. State-based CRDTs (CvRDT — Convergent)

Each replica holds full state and defines a merge function that must have three properties: commutative (a+b = b+a), associative ((a+b)+c = a+(b+c)), and idempotent (a+a = a). If all three hold, every replica merging states in any order reaches the same result — that's eventual consistency.

Classic example: the G-Counter (grow-only counter). Each replica keeps a map {nodeId: localCount}. The counter value is the sum of all localCount values. Merge is element-wise max. Property: if two replicas increment simultaneously then sync, the result is always the correct total.

Upside of state-based: simple, no causal ordering needed. Downside: you must send the whole state each sync — impractical for large documents. That's why production rarely uses pure state-based CRDTs for text/JSON documents.

4.2. Operation-based CRDTs (CmRDT — Commutative)

Replicas send operations instead of state. Requirements: ops must commute (applying them in different orders gives the same result), and the transport must be reliable + at-most-once + causal-ordered (parent ops arrive before child ops).

Example: the OR-Set (observed-remove set): when adding an element, tag it with a unique id; when removing, record which ids have been removed. Concurrent add and remove of the same element resolves to add-wins (remove only clears ids it has observed).

Op-based is more bandwidth-efficient but demands a stronger transport layer. Yjs and Automerge are both op-based with an optimization: the op log is compressed into binary updates that can be repackaged as "snapshots" or "deltas".

4.3. List/Text CRDTs — YATA (Yjs) and RGA (Automerge)

The hardest list CRDT problem: two users both insert a character at position 5 — who wins? Indexes (numeric) don't work (they shift after inserts). The solution: assign each character a stable identifier (ID = nodeId + clock), describe an insert as "insert X to the right of Y", then use a tie-breaking rule when both end up at the same position.

graph LR
    subgraph U1["User A types "X" after "He""]
        A1["H"] --> A2["e"] --> A3["X"]
    end
    subgraph U2["User B types "Y" after "He""]
        B1["H"] --> B2["e"] --> B3["Y"]
    end
    subgraph MERGE["After merge — YATA tie-breaks by clientID"]
        M1["H"] --> M2["e"] --> M3["X (A.5)"] --> M4["Y (B.7)"]
    end
    style A3 fill:#e94560,color:#fff
    style B3 fill:#4CAF50,color:#fff
    style M3 fill:#e94560,color:#fff
    style M4 fill:#4CAF50,color:#fff

Two concurrent inserts are ordered deterministically by (origin, clientID, clock)

Yjs's YATA is simpler than RGA: each item has origin (the ID of the character to the left at creation), rightOrigin (the character to the right at creation), and tie-breaks by (clientID, clock). On merge, the new item is "nested" between origin and rightOrigin by a deterministic rule. Efficiency: O(N) for normal inserts, with index hashing optimizable toward O(1).

5. Yjs — internal architecture, shared types, and update format

Yjs is the most popular text CRDT in 2026. It's not an editor and has no UI — it's a shared data model: you structure your data with Y.Map, Y.Array, Y.Text, and Y.XmlFragment, and every change automatically syncs with every other peer.

graph TB
    subgraph CLIENT["Yjs Client (Browser/Node)"]
        DOC["Y.Doc
(root container)"]
        TYPES["Shared Types
Y.Text / Y.Array / Y.Map / Y.XmlFragment"]
        STORE["DocStore
(Item list, indexed by clientID)"]
        ENCODER["Update Encoder
(binary, lib0)"]
        AWARE["Awareness Protocol
(presence, cursor, user)"]
    end
    subgraph TRANSPORT["Provider (transport agnostic)"]
        WS["y-websocket"]
        WEBRTC["y-webrtc"]
        REDIS["y-redis"]
        IDB["y-indexeddb (persistence)"]
    end
    subgraph BACKEND["Backend"]
        SYNCSERVER["Sync Server
(broadcasts updates)"]
        DB[("Persistence
Postgres / S3 / LevelDB")]
        PUBSUB["Redis Pub/Sub
(cross-node)"]
    end
    DOC --> TYPES --> STORE --> ENCODER
    DOC --> AWARE
    ENCODER --> WS
    ENCODER --> WEBRTC
    ENCODER --> REDIS
    ENCODER --> IDB
    AWARE --> WS
    WS --> SYNCSERVER
    SYNCSERVER --> DB
    SYNCSERVER --> PUBSUB
    PUBSUB --> SYNCSERVER
    style DOC fill:#e94560,color:#fff
    style ENCODER fill:#e94560,color:#fff
    style SYNCSERVER fill:#2c3e50,color:#fff

Yjs cleanly separates data model, encoder, transport, and persistence — every layer is swappable

5.1. Shared types and composability

You can nest shared types inside each other: Y.Map<string, Y.Array<Y.Map>> describes a complete Trello board — a map of columns → array of cards → map of card fields. Each sub-tree change is encoded as a minimal update, no need to rebroadcast the whole board.

// Structure of a Notion-like document
import * as Y from 'yjs'

const doc = new Y.Doc()
const blocks = doc.getArray('blocks')

const heading = new Y.Map()
heading.set('type', 'heading')
heading.set('text', new Y.Text('CRDT 2026'))
blocks.push([heading])

const paragraph = new Y.Map()
paragraph.set('type', 'paragraph')
paragraph.set('text', new Y.Text('Hello collaborative world'))
blocks.push([paragraph])

// Every other user will automatically see these 2 blocks after sync

5.2. Binary update format and sync protocol

A Yjs update is tightly optimized binary: VarInt for numbers, dictionary encoding for repeated characters, run-length encoding for consecutive ids. A 1,000-character paragraph typed sequentially compresses to ~150 bytes of update because consecutive IDs get run-length-encoded into a single range.

The sync protocol has two steps (sync step 1 and step 2): the client sends a state vector (a map clientID → max clock seen), and the server returns a diff update (only the ops the client doesn't have). This is why Yjs syncs fast even with large documents: client A has 1 MB of state, reopens after 5 minutes offline, and sync costs only a few KB if there weren't many changes.

Tombstones never truly disappear

When you delete a character, Yjs doesn't really delete it — it marks it deleted. The tombstone keeps the ID so late-arriving ops can still anchor correctly. A heavily edited document can balloon over time. Production strategy: periodically snapshot with Y.encodeStateAsUpdate(doc) to produce a new update that only contains the current state; old unneeded tombstones get compressed.

5.3. The awareness protocol — presence and cursors

Awareness is a concept separate from the document: it's ephemeral state (cursor position, selection range, "user X is viewing"). Not persisted, no tombstones, expires after ~30 seconds without a heartbeat.

// Presence on the client
import { Awareness } from 'y-protocols/awareness'
const awareness = new Awareness(doc)
awareness.setLocalStateField('user', { name: 'Anh Tu', color: '#e94560' })
awareness.setLocalStateField('cursor', { anchor: 120, head: 145 })

awareness.on('change', () => {
  for (const [clientId, state] of awareness.getStates()) {
    if (clientId === doc.clientID) continue
    renderRemoteCursor(clientId, state.user, state.cursor)
  }
})

6. Automerge 3 — JSON-first, columnar storage, and sync protocol

Automerge 3 is Yjs's main rival. Different philosophy: Yjs prioritizes text editors, Automerge prioritizes arbitrary JSON documents. If your app isn't an editor but structured data (kanban board, todo list, config sync), Automerge feels more like "just a JSON object".

Criterion	Yjs	Automerge 3
Core language	JavaScript (with C++/Rust ports)	Rust (browser via WASM)
API style	Shared types (Y.Map, Y.Text, ...)	JSON proxy + change function
Text performance	Best in benchmarks	On par since v3; still slightly slower
Arbitrary JSON nesting	Possible but requires declaration	Natural like a regular object
Storage format	Binary update list	Columnar binary (better compression)
Sync protocol	State vector exchange	Heads-based + bloom filter
Multi-language	JavaScript primary, Rust port (yrs)	Rust core, JS/Python/Swift bindings official
Editor ecosystem	Tiptap, Slate, ProseMirror, Quill, Lexical, Monaco, CodeMirror	Custom integration needed for most editors
When to pick	Rich text editor is the core (Notion, Linear)	Arbitrary JSON documents, native mobile, multi-language stack

// Automerge 3 — feels like a regular JSON object
import { next as Automerge } from '@automerge/automerge'

let doc = Automerge.from({
  todos: [],
  filter: 'all'
})

doc = Automerge.change(doc, d => {
  d.todos.push({ id: 1, text: 'Learn CRDT', done: false })
  d.todos.push({ id: 2, text: 'Refactor backend', done: false })
})

// Sync with other peers
const sync = Automerge.initSyncState()
const [nextDoc, nextSync, message] = Automerge.generateSyncMessage(doc, sync)
// send message over WebSocket / HTTP / any transport

7. Production architecture — the four most common patterns in 2026

The client code is the easy part. The backend is where 90% of production bugs happen. There are four architectural patterns to choose between, each with clear trade-offs.

7.1. Pattern A — Monolithic WebSocket node keeping state in RAM

Each document is "pinned" to a single node. Clients connect to that node via WebSocket. The node keeps the Y.Doc in memory and broadcasts updates between clients on the same node. Periodic snapshotting to disk (every 30 s).

graph LR
    C1["Client 1"] --> WS1["WS Node A
(Y.Doc room1)"]
    C2["Client 2"] --> WS1
    C3["Client 3"] --> WS2["WS Node B
(Y.Doc room2)"]
    LB["Load Balancer
(sticky by roomId)"] --> WS1
    LB --> WS2
    WS1 --> DB[("Snapshot Store
S3 / Postgres")]
    WS2 --> DB
    style WS1 fill:#e94560,color:#fff
    style WS2 fill:#e94560,color:#fff

Each room "pinned" to a node — simple and effective for a startup

Fits: under 100k concurrent users, under 10k concurrent rooms, moderate document size. Problems: a node restart loses presence, horizontal scaling needs sticky sessions, cold start is slow when loading documents from disk.

7.2. Pattern B — Stateless WebSocket nodes + Redis pub/sub

No WebSocket node "owns" a fixed room. Updates arriving at a node are decoded → pushed through a Redis pub/sub channel doc:{roomId} → every node subscribed to the channel receives it and broadcasts to its own clients. Document state lives in Redis (or a leader node via Raft).

graph TB
    subgraph CLIENTS["Clients"]
        C1["Client 1"]
        C2["Client 2"]
        C3["Client 3"]
        C4["Client 4"]
    end
    subgraph NODES["WebSocket Nodes (stateless)"]
        N1["Node A"]
        N2["Node B"]
        N3["Node C"]
    end
    subgraph SHARED["Shared State"]
        REDIS[("Redis
Pub/Sub + Stream
doc:{roomId}")]
        BLOB[("Persistence
Postgres / S3
snapshot + log")]
    end
    C1 --> N1
    C2 --> N2
    C3 --> N2
    C4 --> N3
    N1 <--> REDIS
    N2 <--> REDIS
    N3 <--> REDIS
    REDIS --> BLOB
    style REDIS fill:#e94560,color:#fff
    style BLOB fill:#2c3e50,color:#fff

The go-to pattern for production scale — all nodes are equal, horizontal scaling is easy

Fits: 100k+ concurrent users, Kubernetes with many nodes, need to restart nodes without breaking clients. Problems: Redis becomes a SPOF (need Cluster/Sentinel), high Redis bandwidth cost without filtering, document state needs a leader-based mechanism to avoid write conflicts.

7.3. Pattern C — Actor model (Orleans / Erlang / Cloudflare Durable Objects)

Each room is an actor (a grain in Orleans, a GenServer in Phoenix, a Durable Object in Cloudflare Workers). The actor system guarantees single-writer per room — no race conditions. Clients are routed to the right actor; the actor holds state in RAM and persists asynchronously.

Cloudflare Durable Objects is the most polished implementation for the web today: each document = one Durable Object, running at the edge near users, persisting to Cloudflare's SSD storage. Liveblocks and PartyKit are built on similar ideas.

Fits: global apps that need low latency, teams fine with platform lock-in. Problems: higher cost than Pattern B, harder to debug without actor-model familiarity.

7.4. Pattern D — Managed service (Liveblocks / PartyKit / Pusher)

You don't build the backend. Liveblocks handles WebSocket, persistence, auth, and presence. You pay USD/month based on MAU. Clear trade-off: fast launch, low engineering effort, but vendor lock-in and cost scaling linearly with scale.

Pattern selection rule

Startup at idea-validation stage → Pattern D (Liveblocks). After Series A, MAU > 100k → migrate to Pattern B (Redis). Strong budget and team → Pattern C (Durable Objects/Orleans). Pattern A should be used only for prototypes or internal tools under 1,000 users.

8. Persistence — snapshots, log compaction, and versioning

A common mistake: persist every Yjs update straight into Postgres. After a month, the doc_updates table has 50 million rows and loading a document takes 10 seconds. The right approach combines an append-only log with periodic snapshots.

graph LR
    UPD["Update arrives
(binary, ~100B)"] --> APPEND["Append to
log table"]
    APPEND --> CHECK{Log size
>= threshold?}
    CHECK -->|No| END1[Done]
    CHECK -->|Yes| MERGE["Apply all updates
into an in-memory Y.Doc"]
    MERGE --> SNAP["Y.encodeStateAsUpdate
=> binary snapshot"]
    SNAP --> WRITE["Write snapshot to
doc_snapshot table/S3"]
    WRITE --> DELETE["Delete old log entries
(before the snapshot)"]
    DELETE --> END2[Done]
    style MERGE fill:#e94560,color:#fff
    style SNAP fill:#4CAF50,color:#fff

Log + snapshot — balance cheap writes with fast reads

8.1. Versioning and time travel

Yjs supports snapshots that can be "frozen" into versions and then diffed. Y.snapshot(doc) returns a small object containing just the state vector; combined with the update log, the document can be reconstructed at any historical point. This is the mechanism behind Notion's "Page History" and Figma's "Version History".

9. Scaling — room-based sharding, tombstone GC, and backpressure

9.1. Room-based sharding

Real-time collaboration documents don't need global consistency — only per-room consistency. This is a beautiful property for scaling: you fully shard by roomId. Each shard can be a consumer group, a Redis cluster, or a Durable Object.

Suggested hash function: shard = consistent_hash(roomId) % N where N is the WebSocket node count. When N changes (auto-scale), consistent hashing ensures only a small fraction of rooms migrate.

9.2. Tombstone GC

The longer a document lives, the more tombstones it accumulates. Yjs doesn't auto-GC because late ops still need anchors. The pragmatic approach: periodically create a "compaction snapshot" — not fully deleting tombstones but packing them into a single block. Mature production stacks use Yjs document v2 (in preview in 2026), which supports "permanent delete" after a safe-time threshold (>1 day = no more delayed ops possible).

9.3. Backpressure when users type too fast

An auto-typing keyboard at 100 chars/second generates 100 updates/second. Multiplied by 50 users in the same room, that's 5,000 messages/second to broadcast. Backpressure patterns:

Client debounce: Yjs Y.Transaction batches many changes into a single update.
Server batching: wait 50 ms and merge all incoming updates into one broadcast.
Drop awareness: cursor updates can be dropped if a client can't keep up — nothing is persisted, no harm done.

10. Security — auth, room permissions, and end-to-end encryption

Before CRDTs enter the picture, a plain WebSocket server already has two familiar auth problems: who can connect, and once connected, who can join which room. With CRDTs, a third appears: who's allowed to apply which update.

10.1. JWT handshake

WebSocket can't carry headers after the upgrade. Two common approaches: send the JWT in the query string on connect (wss://server/yjs?token=xxx), or use a connection cookie. The server verifies the token, attaches the userId to the connection state, and every subsequent message is checked against that userId.

10.2. Room permissions

When a client subscribes to room doc:{roomId}, the server checks whether userId has access. Cache permissions in Redis with a 60 s TTL to avoid hitting the DB on every message. When permissions change (an admin revokes), publish an event permission:revoked:{userId}:{roomId} so every node disconnects the relevant connections.

10.3. End-to-end encryption with CRDTs

This is a big advantage of CRDTs: because merging is deterministic and the server doesn't need to understand content, you can encrypt updates on the client with a key the server doesn't know. The server just relays binary blobs. Common pattern: a room key derived from a shared password, each Yjs update encrypted with AES-GCM before going over the WebSocket.

E2EE trades off server-side awareness

When you encrypt updates, the server can't run content-based logic (search, mention notifications, full-text indexing). Every such feature must move to the client or use a delegated relay that can decrypt. Weigh carefully before going E2EE.

11. Six common anti-patterns in CRDT production

Persisting every update straight into Postgres without snapshots. The table balloons and document loading slows. Ship snapshot + log compaction from day one.
Forgetting to broadcast updates over Redis pub/sub when horizontally scaling. User on node A types, user on node B sees nothing. Test with a multi-instance load balancer from the start.
Sticky sessions that last too long. A user disconnects, reconnects to another node, and waits 5 seconds for the document to load from the DB. Pattern B (stateless + Redis) avoids this.
Not debouncing updates. A 50 KB paste produces 50,000 tiny ops instead of one transaction. Always wrap bulk changes in doc.transact(() => ...).
Awareness leak. Disconnect without cleaning up awareness state — users see "ghost" cursors for users who've left. Handle cleanup in the onClose handler.
No per-room quotas. One user sending a 1 MB text over WebSocket can OOM a node. Impose message-size limits (e.g. 256 KB), per-user connection caps (10), and document-size caps (50 MB).

12. A 2026 go-live checklist for real-time collaboration systems

Item	Requirement
Choose the CRDT	Benchmarked Yjs vs Automerge on a real sample document (10 MB, 50 users, 1,000 ops/s)
Editor integration	Tiptap/ProseMirror/Slate/Lexical selected, Yjs plugins verified for every needed rich-text feature
Transport	WebSocket with 30 s ping/pong heartbeats, exponential-backoff reconnect, long-polling fallback when proxies block WS
Persistence	Append log + snapshot every 100 updates, large snapshots in S3, an offline log-compaction script
Scaling	Stateless WS nodes + Redis pub/sub, room-based sharding, auto-scale on CPU + connection count
Auth	JWT in query string, refresh before expiry, room permission cache TTL 60 s
Awareness	Cursor + selection broadcast, 30 s expiry, cleanup on disconnect
Backpressure	50 ms server debounce, drop cursors under overload, 256 KB message size limit
Observability	OpenTelemetry traces for every sync round-trip, metrics: connection count, doc size, updates/s, snapshot lag
Disaster recovery	Hourly snapshot backups, replay log from S3 within the last 24 hours, RPO < 1 minute
Versioning	Time-travel UI for users, "named version" tagged snapshots, fork from an older version
Testing	Load test with 10k concurrent WS connections, chaos testing (kill random nodes, network partitions), property-based tests for merge convergence

13. The future — AI agents as the Nth CRDT peer

2026 brings a new perspective: if a human user is a CRDT peer, why can't an AI agent be one? ElectricSQL's "AI agents as CRDT peers" post argues this is far more natural than designing a bespoke RPC protocol for agents writing into documents.

Concretely: a Claude agent generates a paragraph and applies it to Y.Text just like the user typing. If the user is typing simultaneously, Yjs merges automatically — no "AI overrides user" or "user overrides AI". Both coexist as equal peers. This is the pattern Notion AI and Linear AI are adopting, and it's the cleanest path for generative agents inside multi-user documents.

When designing a new collaboration system in 2026, plan for three peer types from the start: human, AI agent, and integration bot. All three write through the same CRDT layer with the same permission model. That's the durable architecture for the next decade.

14. Conclusion

Real-time collaboration is no longer a "nice to have" SaaS feature in 2026 — it's the baseline expectation. CRDTs have solved the hardest part (merge convergence) through mathematics, leaving engineers with the practical parts: pick the right library (Yjs for editors, Automerge for arbitrary JSON), design the right backend (Pattern B for scale, Pattern D for quick launch), persist correctly (snapshot + log), and avoid the easy anti-patterns.

Don't wait six months post-launch to retrofit collaboration — the migration cost later is always 5-10× higher than building it in from the start. Also don't over-engineer: a three-person startup doesn't need Pattern C on day one; a $99/month Liveblocks subscription is enough to validate the idea before investing in a backend.

Three questions to answer before starting: (1) Is your document mostly text or JSON? (chooses Yjs vs Automerge). (2) Do you need offline-first or online-only? (chooses CRDT vs OT). (3) What's your expected MAU in the next 12 months? (chooses the backend pattern). With those three answers, every remaining technical decision becomes clear.

15. References

#system design #WebSocket #Sharding #CRDT #Conflict-free Replicated Data Types #Yjs #Automerge #Real-time Collaboration #Operational Transformation #YATA Algorithm #y-websocket #y-redis #Awareness Protocol #Presence #Liveblocks #PartyKit #Cloudflare Durable Objects #Tiptap #ProseMirror #Lexical #Local-first Software #Multiplayer #Snapshot Persistence #Log Compaction #Redis Pub/Sub #End-to-End Encryption #Notion #Figma

# CRDT and Real-time Collaboration 2026 — Multi-User Sync Architecture à la Figma/Notion with Yjs, Automerge, WebSocket, and Presence/Awareness

## 1. Why real-time collaboration became default UX in 2026

Ten years ago, having a *"Save"* button in a SaaS product was considered normal. In 2026 the opposite is true — a product that still has a *Save* button feels dated. Users are used to Figma, Notion, Linear, Google Docs, Miro, and FigJam: you type — the other person sees it instantly; you drag a block — the whole meeting watches your cursor move; you go offline for ten minutes, come back, and no "conflict" dialog asks you to pick a version. Behind that experience sits a family of algorithms rooted in the early 2000s but only truly production-ready in the last five years: **CRDTs — Conflict-free Replicated Data Types**.

~85%top-tier B2B SaaS in 2026 have at least one real-time multi-user surface

100msend-to-end latency ceiling for cursor and keystroke to feel "real-time"

0number of "merge conflict" dialogs Yjs ever needs — that's the architectural win

~10xbackend RAM cost of keeping document state per room vs a stateless API

#### Real-time isn't just chat

Three layers of "real-time" are often conflated: **broadcast** (chat, notifications — handled well by SignalR/Socket.io), **shared state** (presence, cursors, "who's looking with me" — Liveblocks/Phoenix Channels), and **collaborative documents** (text, JSON, drawing — Yjs/Automerge). This article focuses on the third layer, which takes the most work but also creates the clearest product differentiation.

## 2. The journey from Google Wave to mature CRDTs

To see why CRDTs win many 2026 use cases over OT, you need the 25-year arc. Many design decisions in Yjs and Automerge are direct reactions to failures of earlier systems.

1989 — Operational Transformation (OT) is born

Ellis & Gibbs propose OT in the GROVE editor. The idea: each op sent to the server is "transformed" to compensate for concurrent ops. OT requires a central server to arbitrate ordering.

2009 — Google Wave and OT's lessons

2011 — Shapiro et al. define CRDTs

2016 — WOOT, Logoot, RGA, and the boom of list CRDT algorithms

A wave of list/text CRDT algorithms arrived: WOOT, Logoot, LSEQ, Treedoc, RGA. Performance was still far from Google Docs, but feasibility was proven: peer-to-peer text editing without a server.

2018 — YATA and Yjs mature

2020 — Automerge and "local-first software"

2023 — Automerge 2.0 rewritten in Rust

Automerge 2 uses columnar binary format, its core written in Rust, bringing performance on par with Yjs. Browser integration via WebAssembly. Officially production-ready for arbitrary JSON documents.

2024 — Liveblocks, PartyKit, ElectricSQL — the commercial ecosystem

2025-2026 — AI agents become CRDT peers

## 3. OT vs CRDT — An in-depth comparison for technology choosers

| Criterion | Operational Transformation (OT) | CRDT (Yjs / Automerge) |
| --- | --- | --- |
| Ordering arbiter | Central server required | None needed (peer-to-peer feasible) |
| Offline editing | Hard — must re-transform on reconnect | Easy — merges naturally on reconnect |
| Document memory | Only the current snapshot | Needs metadata (tombstones, logical timestamps) |
| Algorithmic complexity | High (transform function hard to get right for rich text) | Moderate (op + merge rules well-defined) |
| Rich text formatting | Quill OT, ShareDB OT are mature | Yjs Y.XmlFragment, Automerge Rich Text recently stabilized |
| Per-user undo/redo | Needs complex custom logic | Yjs UndoManager built in |
| Peak throughput | High with a well-tuned server (Google Docs level) | High, but needs tombstone GC |
| Ease of reasoning about correctness | Hard — transform property is tricky to verify | Easier — mathematical proofs of convergence exist |
| Strongest use case | Server-centric, online-only document (Google Docs) | Local-first, offline-capable, peer-to-peer (Linear, Figma) |
| Weakest use case | Mobile offline, peer-to-peer | Very large documents (>100 MB) — tombstones balloon |

#### Quick decision rule

## 4. CRDT theory — state-based vs op-based, and why YATA won

CRDTs come in two main families. Understanding the difference helps you read Yjs or Automerge source without getting lost.

### 4.1. State-based CRDTs (CvRDT — Convergent)

Each replica holds full state and defines a *merge function* that must have three properties: **commutative** (a+b = b+a), **associative** ((a+b)+c = a+(b+c)), and **idempotent** (a+a = a). If all three hold, every replica merging states in any order reaches the same result — that's eventual consistency.

Classic example: the *G-Counter* (grow-only counter). Each replica keeps a map `{nodeId: localCount}`. The counter value is the sum of all localCount values. Merge is element-wise max. Property: if two replicas increment simultaneously then sync, the result is always the correct total.

### 4.2. Operation-based CRDTs (CmRDT — Commutative)

Replicas send *operations* instead of state. Requirements: ops must **commute** (applying them in different orders gives the same result), and the transport must be **reliable + at-most-once + causal-ordered** (parent ops arrive before child ops).

Example: the *OR-Set* (observed-remove set): when adding an element, tag it with a unique id; when removing, record which ids have been removed. Concurrent add and remove of the same element resolves to add-wins (remove only clears ids it has observed).

### 4.3. List/Text CRDTs — YATA (Yjs) and RGA (Automerge)

The hardest list CRDT problem: two users both insert a character at position 5 — who wins? Indexes (numeric) don't work (they shift after inserts). The solution: assign each character a *stable identifier* (ID = nodeId + clock), describe an insert as "insert X to the right of Y", then use a tie-breaking rule when both end up at the same position.

```
graph LR
    subgraph U1["User A types "X" after "He""]
        A1["H"] --> A2["e"] --> A3["X"]
    end
    subgraph U2["User B types "Y" after "He""]
        B1["H"] --> B2["e"] --> B3["Y"]
    end
    subgraph MERGE["After merge — YATA tie-breaks by clientID"]
        M1["H"] --> M2["e"] --> M3["X (A.5)"] --> M4["Y (B.7)"]
    end
    style A3 fill:#e94560,color:#fff
    style B3 fill:#4CAF50,color:#fff
    style M3 fill:#e94560,color:#fff
    style M4 fill:#4CAF50,color:#fff

```

Two concurrent inserts are ordered deterministically by (origin, clientID, clock)

Yjs's YATA is simpler than RGA: each item has `origin` (the ID of the character to the left at creation), `rightOrigin` (the character to the right at creation), and tie-breaks by (clientID, clock). On merge, the new item is "nested" between origin and rightOrigin by a deterministic rule. Efficiency: O(N) for normal inserts, with index hashing optimizable toward O(1).

## 5. Yjs — internal architecture, shared types, and update format

Yjs is the most popular text CRDT in 2026. It's not an editor and has no UI — it's a *shared data model*: you structure your data with `Y.Map`, `Y.Array`, `Y.Text`, and `Y.XmlFragment`, and every change automatically syncs with every other peer.

```
graph TB
    subgraph CLIENT["Yjs Client (Browser/Node)"]
        DOC["Y.Doc  
(root container)"]
        TYPES["Shared Types  
Y.Text / Y.Array / Y.Map / Y.XmlFragment"]
        STORE["DocStore  
(Item list, indexed by clientID)"]
        ENCODER["Update Encoder  
(binary, lib0)"]
        AWARE["Awareness Protocol  
(presence, cursor, user)"]
    end
    subgraph TRANSPORT["Provider (transport agnostic)"]
        WS["y-websocket"]
        WEBRTC["y-webrtc"]
        REDIS["y-redis"]
        IDB["y-indexeddb (persistence)"]
    end
    subgraph BACKEND["Backend"]
        SYNCSERVER["Sync Server  
(broadcasts updates)"]
        DB[("Persistence  
Postgres / S3 / LevelDB")]
        PUBSUB["Redis Pub/Sub  
(cross-node)"]
    end
    DOC --> TYPES --> STORE --> ENCODER
    DOC --> AWARE
    ENCODER --> WS
    ENCODER --> WEBRTC
    ENCODER --> REDIS
    ENCODER --> IDB
    AWARE --> WS
    WS --> SYNCSERVER
    SYNCSERVER --> DB
    SYNCSERVER --> PUBSUB
    PUBSUB --> SYNCSERVER
    style DOC fill:#e94560,color:#fff
    style ENCODER fill:#e94560,color:#fff
    style SYNCSERVER fill:#2c3e50,color:#fff

```

Yjs cleanly separates data model, encoder, transport, and persistence — every layer is swappable

### 5.1. Shared types and composability

You can nest shared types inside each other: `Y.Map<string, Y.Array<Y.Map>>` describes a complete Trello board — a map of columns → array of cards → map of card fields. Each sub-tree change is encoded as a minimal update, no need to rebroadcast the whole board.

```
// Structure of a Notion-like document
import * as Y from 'yjs'

const doc = new Y.Doc()
const blocks = doc.getArray('blocks')

const heading = new Y.Map()
heading.set('type', 'heading')
heading.set('text', new Y.Text('CRDT 2026'))
blocks.push([heading])

const paragraph = new Y.Map()
paragraph.set('type', 'paragraph')
paragraph.set('text', new Y.Text('Hello collaborative world'))
blocks.push([paragraph])

// Every other user will automatically see these 2 blocks after sync

```

### 5.2. Binary update format and sync protocol

The sync protocol has two steps (sync step 1 and step 2): the client sends a *state vector* (a map clientID → max clock seen), and the server returns a *diff update* (only the ops the client doesn't have). This is why Yjs syncs fast even with large documents: client A has 1 MB of state, reopens after 5 minutes offline, and sync costs only a few KB if there weren't many changes.

#### Tombstones never truly disappear

When you delete a character, Yjs doesn't really delete it — it marks it deleted. The tombstone keeps the ID so late-arriving ops can still anchor correctly. A heavily edited document can balloon over time. Production strategy: periodically snapshot with `Y.encodeStateAsUpdate(doc)` to produce a new update that only contains the current state; old unneeded tombstones get compressed.

### 5.3. The awareness protocol — presence and cursors

```
// Presence on the client
import { Awareness } from 'y-protocols/awareness'
const awareness = new Awareness(doc)
awareness.setLocalStateField('user', { name: 'Anh Tu', color: '#e94560' })
awareness.setLocalStateField('cursor', { anchor: 120, head: 145 })

awareness.on('change', () => {
  for (const [clientId, state] of awareness.getStates()) {
    if (clientId === doc.clientID) continue
    renderRemoteCursor(clientId, state.user, state.cursor)
  }
})

```

## 6. Automerge 3 — JSON-first, columnar storage, and sync protocol

| Criterion | Yjs | Automerge 3 |
| --- | --- | --- |
| Core language | JavaScript (with C++/Rust ports) | Rust (browser via WASM) |
| API style | Shared types (Y.Map, Y.Text, ...) | JSON proxy + change function |
| Text performance | Best in benchmarks | On par since v3; still slightly slower |
| Arbitrary JSON nesting | Possible but requires declaration | Natural like a regular object |
| Storage format | Binary update list | Columnar binary (better compression) |
| Sync protocol | State vector exchange | Heads-based + bloom filter |
| Multi-language | JavaScript primary, Rust port (yrs) | Rust core, JS/Python/Swift bindings official |
| Editor ecosystem | Tiptap, Slate, ProseMirror, Quill, Lexical, Monaco, CodeMirror | Custom integration needed for most editors |
| When to pick | Rich text editor is the core (Notion, Linear) | Arbitrary JSON documents, native mobile, multi-language stack |

```
// Automerge 3 — feels like a regular JSON object
import { next as Automerge } from '@automerge/automerge'

let doc = Automerge.from({
  todos: [],
  filter: 'all'
})

doc = Automerge.change(doc, d => {
  d.todos.push({ id: 1, text: 'Learn CRDT', done: false })
  d.todos.push({ id: 2, text: 'Refactor backend', done: false })
})

// Sync with other peers
const sync = Automerge.initSyncState()
const [nextDoc, nextSync, message] = Automerge.generateSyncMessage(doc, sync)
// send message over WebSocket / HTTP / any transport

```

## 7. Production architecture — the four most common patterns in 2026

The client code is the easy part. The backend is where 90% of production bugs happen. There are four architectural patterns to choose between, each with clear trade-offs.

### 7.1. Pattern A — Monolithic WebSocket node keeping state in RAM

Each document is "pinned" to a single node. Clients connect to that node via WebSocket. The node keeps the `Y.Doc` in memory and broadcasts updates between clients on the same node. Periodic snapshotting to disk (every 30 s).

```
graph LR
    C1["Client 1"] --> WS1["WS Node A  
(Y.Doc room1)"]
    C2["Client 2"] --> WS1
    C3["Client 3"] --> WS2["WS Node B  
(Y.Doc room2)"]
    LB["Load Balancer  
(sticky by roomId)"] --> WS1
    LB --> WS2
    WS1 --> DB[("Snapshot Store  
S3 / Postgres")]
    WS2 --> DB
    style WS1 fill:#e94560,color:#fff
    style WS2 fill:#e94560,color:#fff

```

Each room "pinned" to a node — simple and effective for a startup

**Fits:** under 100k concurrent users, under 10k concurrent rooms, moderate document size. **Problems:** a node restart loses presence, horizontal scaling needs sticky sessions, cold start is slow when loading documents from disk.

### 7.2. Pattern B — Stateless WebSocket nodes + Redis pub/sub

No WebSocket node "owns" a fixed room. Updates arriving at a node are decoded → pushed through a Redis pub/sub channel `doc:{roomId}` → every node subscribed to the channel receives it and broadcasts to its own clients. Document state lives in Redis (or a leader node via Raft).

```
graph TB
    subgraph CLIENTS["Clients"]
        C1["Client 1"]
        C2["Client 2"]
        C3["Client 3"]
        C4["Client 4"]
    end
    subgraph NODES["WebSocket Nodes (stateless)"]
        N1["Node A"]
        N2["Node B"]
        N3["Node C"]
    end
    subgraph SHARED["Shared State"]
        REDIS[("Redis  
Pub/Sub + Stream  
doc:{roomId}")]
        BLOB[("Persistence  
Postgres / S3  
snapshot + log")]
    end
    C1 --> N1
    C2 --> N2
    C3 --> N2
    C4 --> N3
    N1 <--> REDIS
    N2 <--> REDIS
    N3 <--> REDIS
    REDIS --> BLOB
    style REDIS fill:#e94560,color:#fff
    style BLOB fill:#2c3e50,color:#fff

```

The go-to pattern for production scale — all nodes are equal, horizontal scaling is easy

**Fits:** 100k+ concurrent users, Kubernetes with many nodes, need to restart nodes without breaking clients. **Problems:** Redis becomes a SPOF (need Cluster/Sentinel), high Redis bandwidth cost without filtering, document state needs a leader-based mechanism to avoid write conflicts.

### 7.3. Pattern C — Actor model (Orleans / Erlang / Cloudflare Durable Objects)

**Fits:** global apps that need low latency, teams fine with platform lock-in. **Problems:** higher cost than Pattern B, harder to debug without actor-model familiarity.

### 7.4. Pattern D — Managed service (Liveblocks / PartyKit / Pusher)

#### Pattern selection rule

## 8. Persistence — snapshots, log compaction, and versioning

A common mistake: persist every Yjs update straight into Postgres. After a month, the `doc_updates` table has 50 million rows and loading a document takes 10 seconds. The right approach combines an *append-only log* with *periodic snapshots*.

```
graph LR
    UPD["Update arrives  
(binary, ~100B)"] --> APPEND["Append to  
log table"]
    APPEND --> CHECK{Log size  
>= threshold?}
    CHECK -->|No| END1[Done]
    CHECK -->|Yes| MERGE["Apply all updates  
into an in-memory Y.Doc"]
    MERGE --> SNAP["Y.encodeStateAsUpdate  
=> binary snapshot"]
    SNAP --> WRITE["Write snapshot to  
doc_snapshot table/S3"]
    WRITE --> DELETE["Delete old log entries  
(before the snapshot)"]
    DELETE --> END2[Done]
    style MERGE fill:#e94560,color:#fff
    style SNAP fill:#4CAF50,color:#fff

```

Log + snapshot — balance cheap writes with fast reads

Background Jobs on .NET 10 in 2026 — Hangfire, Quartz.NET, and MassTransit: Schedulers, Retry, Distributed Lock, and the Outbox Pattern for Production Async Workflows

Passkeys & WebAuthn 2026 — Replacing Passwords with FIDO2, Platform Authenticators, and Phishing-resistant Auth on .NET 10 and Vue

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.