zlay — system design#
an AT Protocol relay that crawls PDS instances directly, validates commit signatures, and rebroadcasts to downstream consumers over WebSocket.
data flow#
PDS instances (N hosts)
│
▼
Subscriber (one reader thread per host)
│ header decode, cursor tracking, rate limiting
│ submits raw frame to processing pool
▼
Frame Pool (configurable workers, default 16)
│ full CBOR decode, DID → UID resolution via postgres
│ ▼
│ Validator
│ │ cache lookup: DID → signing key (secp256k1 / p256)
│ │ cache hit → verify commit signature (zat SDK)
│ │ cache miss → skip, queue background resolution
│ ▼
│ DiskPersist
│ │ append to event log (28-byte LE header + CBOR payload)
│ │ assign relay sequence number (monotonic, relay-scoped)
│ │ write postgres metadata (account state, host cursor)
│ ▼
│ Broadcaster
│ resequence frame with relay seq
│ fan out to all connected consumers via SharedFrame (ref-counted)
│ per-consumer ring buffer (8,192 frames) + write thread
▼
Downstream consumers (WebSocket)
additionally:
- collection index (RocksDB): subscriber calls
trackCommitOpson each validated commit; stores(collection, did)pairs forlistReposByCollection - event log: append-only files rotated every 10K events, configurable
retention (default: 72h, env:
RELAY_RETENTION_HOURS). supports cursor replay — disk first, then in-memory ring buffer (50K frames) - slurper: orchestrates subscribers. bootstraps host list from seed relay's
listHostsAPI, spawns/stops workers, processesrequestCrawlrequests
threading model#
| thread type | count at ~2,750 PDS | stack size | responsibility |
|---|---|---|---|
| subscriber readers | ~2,750 | 8 MB | one per host — lightweight WebSocket read loop, header decode, cursor tracking, rate limiting, submit to pool |
| frame pool workers | 16 (env: FRAME_WORKERS) |
8 MB | CBOR decode, validation, DB persist, broadcast |
| resolver threads | 4–8 (env: RESOLVER_THREADS) |
8 MB | DID document resolution, signing key extraction, cache population |
| consumer write threads | 1 per downstream consumer | 8 MB | drain ring buffer → WebSocket write, ping/pong keepalive |
| flush thread | 1 | 8 MB | batched fsync of event log (100ms or 400 events) |
| GC thread | 1 | 8 MB | event log file cleanup every 10 minutes |
| crawl queue thread | 1 | 8 MB | process requestCrawl — validate hostname, describeServer, spawn worker |
| metrics server | 1 | 8 MB | HTTP on internal port, prometheus scrape |
| main thread | 1 | default | signal handling, shutdown coordination |
total: ~2,770 + consumers. subscriber reader threads run blocking WebSocket
read loops — lightweight, no async runtime, no event loop. each reader does
minimal work per frame (header decode, cursor update, rate limit check) and
spends most time blocked in recv(). heavy processing (full CBOR decode, CAR
parse, ECDSA verify, DB persist, broadcast) runs on pool workers. the header
is intentionally decoded twice — once by the reader for routing, once by the
pool worker for full processing. this double decode costs ~1–2μs per frame,
far cheaper than serializing parsed state across threads.
the 8 MB stack size (vs zig's 16 MB default) supports ReleaseSafe's TLS
handshake path (~134 KiB peak stack from tls.Client.init + KeyShare.init).
only touched pages count as RSS — reader threads use ~0.45 MiB RSS each since
the deepest paths (crypto, CBOR, DB) now run on pool workers.
memory model#
allocator: std.heap.c_allocator (libc malloc). glibc has per-thread
arenas and madvise-based page return. the general-purpose allocator (GPA)
is a debug allocator that never returns freed pages — unsuitable for
long-running servers.
shared TLS CA bundle: loaded once by the slurper, passed to all ~2,750
subscriber connections via config.ca_bundle. without this, each websocket.zig
TLS client calls Bundle.rescan() and loads its own copy (~800 KB each),
totaling ~2.2 GiB of duplicate CA certificates in memory.
arena per frame: each subscriber creates a std.heap.ArenaAllocator per
WebSocket message. all CBOR decode temporaries, CAR parse buffers, and MST
nodes live in this arena. freed in bulk after the frame is processed. this
prevents fragmentation from per-field allocations.
shared frames: the broadcaster creates one SharedFrame per broadcast.
consumers acquire references; the frame is freed when the last consumer
releases. this avoids copying frame bytes per consumer.
validator cache: StringHashMap(CachedKey) — DID string → 75-byte
fixed-size struct (key type + 33-byte compressed pubkey + resolve timestamp).
capped at 250K entries (env: VALIDATOR_CACHE_SIZE), LRU-ish eviction of
oldest 10% when full. ~37 MB at capacity. the resolve queue uses a
StringHashMapUnmanaged(void) as a dedupe set to prevent the same DID from
being queued multiple times. migration checks are interleaved with DID
resolutions (1 per 10) to prevent starvation.
ring buffer: 50K-entry in-memory frame history for cursor replay when
disk replay isn't available. entries are (seq, data) pairs with data duped
from broadcast.
persistence#
event log (append-only files)#
format matches indigo's diskpersist:
[4B flags LE] [4B kind LE] [4B payload_len LE] [8B uid LE] [8B seq LE] [payload]
files named evts-{startSeq}, rotated every 10K events. buffered writes
flushed every 100ms or 400 events (whichever comes first). GC deletes files
older than RELAY_RETENTION_HOURS (default: 72h).
cursor replay: playback(cursor) binary-searches log files for the starting
seq, then streams entries forward. the broadcaster tries disk first, falls
back to in-memory ring buffer.
postgres#
tables:
account— uid, did, status, upstream_status, host_idaccount_repo— uid, rev, commit_data_cid (latest repo state)host— id, hostname, status, last_seq, failed_attemptslog_file_refs— seq→file mapping for cursor binary searchdomain_ban— banned domain suffixesbackfill_progress— collection backfill cursor tracking
connection pool: 5 connections (hardcoded in pg.zig pool init).
RocksDB (collection index)#
two column families:
rbc:<collection>\0<did>→()— prefix scan by collectioncbr:<did>\0<collection>→()— per-repo deletion
populated live from firehose commits. backfill from source relay's
listReposByCollection for historical data.
scaling limits#
current deployment: ~2,780 PDS hosts, running on a 32 GB / 16 CPU node. steady-state memory: ~1.1 GiB (ReleaseSafe, with frame pool). postgres alongside at ~240 MiB. resource limits: 8 GiB memory, 1 GiB request, 1000m CPU.
| component | current (~2,750 PDS) | at 10x (~27,500 PDS) | status |
|---|---|---|---|
| thread stacks | ~22 GB virtual (2,750 × 8 MB), ~1.2 GiB RSS | ~220 GB virtual | breaks — virtual exceeds 32 GB node, but RSS scales sublinearly (~0.45 MiB/thread) |
| pg pool | 5 connections (hardcoded) | 5 connections | breaks — saturates under concurrent UID lookups |
| resolver queue | ArrayList + dedupe set |
same | ok — dedupe prevents unbounded growth from duplicate DIDs |
| validator cache | 250K entries, ~19 MB | same (capped) | degrades — miss rate climbs with more unique DIDs |
| broadcaster | O(n consumers) under mutex, contention from 16 pool workers (not ~2,750 threads) | same | improved — contention reduced from ~2,750 threads to N workers |
| RocksDB | manageable write rate | ~1.4M writes/sec projected | needs compaction tuning |
| event log | buffered, 100ms flush | fine — sequential I/O | ok |
| kernel threads | ~2,800 (below 30K default) | ~28,000 (near default max) | breaks without sysctl tuning |
| RSS | ~1.1 GiB (ReleaseSafe) | ~4–6 GiB projected | ok — fits 32 GB node |
what breaks first#
-
thread count: linux default
kernel.threads-maxis ~30K. at 27,500 subscriber threads + resolver + consumer + system threads, we hit the wall. virtual address space for stacks alone is ~55 GB. -
postgres pool: 5 connections shared across ~2,750 subscriber threads works because UID lookups are fast (~0.5ms) and only happen on new DIDs. at 10x, queue contention becomes the bottleneck — every frame touches
uidForDidFromHost. -
validator cache miss rate: 250K cache with 60M+ DIDs means ~99% miss rate for cold starts. resolver threads (4–8) can't keep up with the resolution queue at 10x ingest rate.
migration path#
near-term (no architecture change)#
- expose
PG_POOL_SIZEenv var, increase from 5 to 20–50 - increase
VALIDATOR_CACHE_SIZEfrom 250K to 2M+ (costs ~150 MB) - tune
RESOLVER_THREADSto 16–32 for higher resolution throughput sysctl kernel.threads-max=65536on deploy node
mid-term (thread pool — done, commit f0c7baf)#
replace one-thread-per-host with a thread pool of N workers— done: reader threads (one per PDS) submit raw frames to a shared processing pool (default 16 workers). heavy work (CBOR decode, validation, persist, broadcast) runs on pool workers.- thread count is still O(hosts) for readers, but reader threads are lightweight — no crypto, no DB, no broadcast contention.
- remaining: IO multiplexing to reduce reader thread count from O(hosts) to O(cores)
long-term (async I/O)#
- zig 0.16 introduces
Io(io_uring on linux, kqueue on darwin) - single-threaded event loop with coroutines for all I/O
- eliminates thread overhead entirely, scales to 100K+ hosts per process
- requires rewriting subscriber, resolver, and consumer write paths
- pg.zig and websocket.zig would need async-compatible forks
deliberate divergences from indigo#
documented policy choices where zlay intentionally differs from the Go relay (bluesky-social/indigo). these are not bugs — each reflects a tradeoff appropriate for zlay's architecture.
per-PDS concurrency model#
indigo uses goroutines (M:N scheduling on a small thread pool). zlay uses one
OS thread per host — simple, no async runtime, no event loop. each thread
spends most time blocked in recv() with minimal per-frame CPU work.
observability: prometheus metrics expose thread count, RSS, per-host memory.
the 0.16 Io migration (io_uring/kqueue) is the planned optimization path,
replacing OS threads with coroutines.
skip-on-miss validation#
when the validator has no cached signing key for a DID (cache miss or pending new-account verification), zlay broadcasts the frame while the key resolves in the background. indigo blocks on DID resolution before forwarding.
zlay trades a brief trust window for throughput. the window is bounded:
- new accounts trigger async DID doc verification; on mismatch → rejected
- signature failures trigger key eviction + re-resolution (sync spec guidance)
- next commit from the same DID hits the refreshed cache
consumer buffer sizing#
zlay uses an 8K-entry per-consumer ring buffer (vs indigo's 16K-entry channel).
can be tuned independently based on observed ConsumerTooSlow disconnect rate.
the ring buffer is lock-free (atomic read/write indices), so the bottleneck is
consumer write throughput, not buffer contention.