atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

zlay — system design#

an AT Protocol relay that crawls PDS instances directly, validates commit signatures, and rebroadcasts to downstream consumers over WebSocket.

data flow#

PDS instances (N hosts)
  │
  ▼
Subscriber (one reader thread per host)
  │  header decode, cursor tracking, rate limiting
  │  submits raw frame to processing pool
  ▼
Frame Pool (configurable workers, default 16)
  │  full CBOR decode, DID → UID resolution via postgres
  │  ▼
  │  Validator
  │  │  cache lookup: DID → signing key (secp256k1 / p256)
  │  │  cache hit → verify commit signature (zat SDK)
  │  │  cache miss → skip, queue background resolution
  │  ▼
  │  DiskPersist
  │  │  append to event log (28-byte LE header + CBOR payload)
  │  │  assign relay sequence number (monotonic, relay-scoped)
  │  │  write postgres metadata (account state, host cursor)
  │  ▼
  │  Broadcaster
  │     resequence frame with relay seq
  │     fan out to all connected consumers via SharedFrame (ref-counted)
  │     per-consumer ring buffer (8,192 frames) + write thread
  ▼
Downstream consumers (WebSocket)

additionally:

  • collection index (RocksDB): subscriber calls trackCommitOps on each validated commit; stores (collection, did) pairs for listReposByCollection
  • event log: append-only files rotated every 10K events, configurable retention (default: 72h, env: RELAY_RETENTION_HOURS). supports cursor replay — disk first, then in-memory ring buffer (50K frames)
  • slurper: orchestrates subscribers. bootstraps host list from seed relay's listHosts API, spawns/stops workers, processes requestCrawl requests

threading model#

thread type count at ~2,750 PDS stack size responsibility
subscriber readers ~2,750 8 MB one per host — lightweight WebSocket read loop, header decode, cursor tracking, rate limiting, submit to pool
frame pool workers 16 (env: FRAME_WORKERS) 8 MB CBOR decode, validation, DB persist, broadcast
resolver threads 4–8 (env: RESOLVER_THREADS) 8 MB DID document resolution, signing key extraction, cache population
consumer write threads 1 per downstream consumer 8 MB drain ring buffer → WebSocket write, ping/pong keepalive
flush thread 1 8 MB batched fsync of event log (100ms or 400 events)
GC thread 1 8 MB event log file cleanup every 10 minutes
crawl queue thread 1 8 MB process requestCrawl — validate hostname, describeServer, spawn worker
metrics server 1 8 MB HTTP on internal port, prometheus scrape
main thread 1 default signal handling, shutdown coordination

total: ~2,770 + consumers. subscriber reader threads run blocking WebSocket read loops — lightweight, no async runtime, no event loop. each reader does minimal work per frame (header decode, cursor update, rate limit check) and spends most time blocked in recv(). heavy processing (full CBOR decode, CAR parse, ECDSA verify, DB persist, broadcast) runs on pool workers. the header is intentionally decoded twice — once by the reader for routing, once by the pool worker for full processing. this double decode costs ~1–2μs per frame, far cheaper than serializing parsed state across threads.

the 8 MB stack size (vs zig's 16 MB default) supports ReleaseSafe's TLS handshake path (~134 KiB peak stack from tls.Client.init + KeyShare.init). only touched pages count as RSS — reader threads use ~0.45 MiB RSS each since the deepest paths (crypto, CBOR, DB) now run on pool workers.

memory model#

allocator: std.heap.c_allocator (libc malloc). glibc has per-thread arenas and madvise-based page return. the general-purpose allocator (GPA) is a debug allocator that never returns freed pages — unsuitable for long-running servers.

shared TLS CA bundle: loaded once by the slurper, passed to all ~2,750 subscriber connections via config.ca_bundle. without this, each websocket.zig TLS client calls Bundle.rescan() and loads its own copy (~800 KB each), totaling ~2.2 GiB of duplicate CA certificates in memory.

arena per frame: each subscriber creates a std.heap.ArenaAllocator per WebSocket message. all CBOR decode temporaries, CAR parse buffers, and MST nodes live in this arena. freed in bulk after the frame is processed. this prevents fragmentation from per-field allocations.

shared frames: the broadcaster creates one SharedFrame per broadcast. consumers acquire references; the frame is freed when the last consumer releases. this avoids copying frame bytes per consumer.

validator cache: StringHashMap(CachedKey) — DID string → 75-byte fixed-size struct (key type + 33-byte compressed pubkey + resolve timestamp). capped at 250K entries (env: VALIDATOR_CACHE_SIZE), LRU-ish eviction of oldest 10% when full. ~37 MB at capacity. the resolve queue uses a StringHashMapUnmanaged(void) as a dedupe set to prevent the same DID from being queued multiple times. migration checks are interleaved with DID resolutions (1 per 10) to prevent starvation.

ring buffer: 50K-entry in-memory frame history for cursor replay when disk replay isn't available. entries are (seq, data) pairs with data duped from broadcast.

persistence#

event log (append-only files)#

format matches indigo's diskpersist:

[4B flags LE] [4B kind LE] [4B payload_len LE] [8B uid LE] [8B seq LE] [payload]

files named evts-{startSeq}, rotated every 10K events. buffered writes flushed every 100ms or 400 events (whichever comes first). GC deletes files older than RELAY_RETENTION_HOURS (default: 72h).

cursor replay: playback(cursor) binary-searches log files for the starting seq, then streams entries forward. the broadcaster tries disk first, falls back to in-memory ring buffer.

postgres#

tables:

  • account — uid, did, status, upstream_status, host_id
  • account_repo — uid, rev, commit_data_cid (latest repo state)
  • host — id, hostname, status, last_seq, failed_attempts
  • log_file_refs — seq→file mapping for cursor binary search
  • domain_ban — banned domain suffixes
  • backfill_progress — collection backfill cursor tracking

connection pool: 5 connections (hardcoded in pg.zig pool init).

RocksDB (collection index)#

two column families:

  • rbc: <collection>\0<did>() — prefix scan by collection
  • cbr: <did>\0<collection>() — per-repo deletion

populated live from firehose commits. backfill from source relay's listReposByCollection for historical data.

scaling limits#

current deployment: ~2,780 PDS hosts, running on a 32 GB / 16 CPU node. steady-state memory: ~1.1 GiB (ReleaseSafe, with frame pool). postgres alongside at ~240 MiB. resource limits: 8 GiB memory, 1 GiB request, 1000m CPU.

component current (~2,750 PDS) at 10x (~27,500 PDS) status
thread stacks ~22 GB virtual (2,750 × 8 MB), ~1.2 GiB RSS ~220 GB virtual breaks — virtual exceeds 32 GB node, but RSS scales sublinearly (~0.45 MiB/thread)
pg pool 5 connections (hardcoded) 5 connections breaks — saturates under concurrent UID lookups
resolver queue ArrayList + dedupe set same ok — dedupe prevents unbounded growth from duplicate DIDs
validator cache 250K entries, ~19 MB same (capped) degrades — miss rate climbs with more unique DIDs
broadcaster O(n consumers) under mutex, contention from 16 pool workers (not ~2,750 threads) same improved — contention reduced from ~2,750 threads to N workers
RocksDB manageable write rate ~1.4M writes/sec projected needs compaction tuning
event log buffered, 100ms flush fine — sequential I/O ok
kernel threads ~2,800 (below 30K default) ~28,000 (near default max) breaks without sysctl tuning
RSS ~1.1 GiB (ReleaseSafe) ~4–6 GiB projected ok — fits 32 GB node

what breaks first#

  1. thread count: linux default kernel.threads-max is ~30K. at 27,500 subscriber threads + resolver + consumer + system threads, we hit the wall. virtual address space for stacks alone is ~55 GB.

  2. postgres pool: 5 connections shared across ~2,750 subscriber threads works because UID lookups are fast (~0.5ms) and only happen on new DIDs. at 10x, queue contention becomes the bottleneck — every frame touches uidForDidFromHost.

  3. validator cache miss rate: 250K cache with 60M+ DIDs means ~99% miss rate for cold starts. resolver threads (4–8) can't keep up with the resolution queue at 10x ingest rate.

migration path#

near-term (no architecture change)#

  • expose PG_POOL_SIZE env var, increase from 5 to 20–50
  • increase VALIDATOR_CACHE_SIZE from 250K to 2M+ (costs ~150 MB)
  • tune RESOLVER_THREADS to 16–32 for higher resolution throughput
  • sysctl kernel.threads-max=65536 on deploy node

mid-term (thread pool — done, commit f0c7baf)#

  • replace one-thread-per-host with a thread pool of N workersdone: reader threads (one per PDS) submit raw frames to a shared processing pool (default 16 workers). heavy work (CBOR decode, validation, persist, broadcast) runs on pool workers.
  • thread count is still O(hosts) for readers, but reader threads are lightweight — no crypto, no DB, no broadcast contention.
  • remaining: IO multiplexing to reduce reader thread count from O(hosts) to O(cores)

long-term (async I/O)#

  • zig 0.16 introduces Io (io_uring on linux, kqueue on darwin)
  • single-threaded event loop with coroutines for all I/O
  • eliminates thread overhead entirely, scales to 100K+ hosts per process
  • requires rewriting subscriber, resolver, and consumer write paths
  • pg.zig and websocket.zig would need async-compatible forks

deliberate divergences from indigo#

documented policy choices where zlay intentionally differs from the Go relay (bluesky-social/indigo). these are not bugs — each reflects a tradeoff appropriate for zlay's architecture.

per-PDS concurrency model#

indigo uses goroutines (M:N scheduling on a small thread pool). zlay uses one OS thread per host — simple, no async runtime, no event loop. each thread spends most time blocked in recv() with minimal per-frame CPU work.

observability: prometheus metrics expose thread count, RSS, per-host memory. the 0.16 Io migration (io_uring/kqueue) is the planned optimization path, replacing OS threads with coroutines.

skip-on-miss validation#

when the validator has no cached signing key for a DID (cache miss or pending new-account verification), zlay broadcasts the frame while the key resolves in the background. indigo blocks on DID resolution before forwarding.

zlay trades a brief trust window for throughput. the window is bounded:

  • new accounts trigger async DID doc verification; on mismatch → rejected
  • signature failures trigger key eviction + re-resolution (sync spec guidance)
  • next commit from the same DID hits the refreshed cache

consumer buffer sizing#

zlay uses an 8K-entry per-consumer ring buffer (vs indigo's 16K-entry channel). can be tuned independently based on observed ConsumerTooSlow disconnect rate. the ring buffer is lock-free (atomic read/write indices), so the bottleneck is consumer write throughput, not buffer contention.