devlog/006-building-a-relay.md at 8a4411c76e1c427e698db573e98e9b91bb20f4e1

zat.dev / zat
fork
atproto utils for zig zat.dev
atproto sdk zig
fork
zat / devlog / 006-building-a-relay.md
at 8a4411c76e1c427e698db573e98e9b91bb20f4e1 139 lines 12 kB view raw view rendered
wrap content
zzstoatzz docs: update devlog 006 — single port, backfill complete, spec compliance 2mo ago
4f3d6df4
  1# building a relay in zig
  2
  3the previous devlogs covered zat as a library — parsing, decoding, verifying. this one is about what happens when you point those primitives at the full network and try to keep up. [zlay](https://tangled.org/zzstoatzz.io/zlay) is an AT Protocol relay written in zig, running at `zlay.waow.tech`, serving ~2,750 PDS hosts with ~6,000 lines of code.
  4
  5## why build another relay
  6
  7there are already working relay implementations — bluesky's reference [indigo](https://github.com/bluesky-social/indigo) in Go and [rsky](https://github.com/blacksky-algorithms/rsky) (by Rudy Fraser / BlackSky) in Rust. but running indigo taught me things about the protocol that reading the spec didn't:
  8
  9- how identity resolution interacts with event ordering under load
 10- what happens when 2,750 PDS hosts each send 100ms of silence between bursts
 11- where the actual bottlenecks are (spoiler: not parsing)
 12
 13building another implementation from zat's primitives — CBOR, CAR, signatures, DID resolution — was the fastest way to verify the library works at scale, and to understand the design space.
 14
 15## architecture
 16
 17zlay crawls PDS hosts directly. there's no fan-out relay in between. the bootstrap relay (bsky.network) is called once at startup to get the host list via `listHosts`, then all data flows directly from each PDS.
 18
 19```
 20PDS hosts (2,750)
 21  ↓ one OS thread each
 22[subscriber] → decode frame → validate signature → [broadcaster]
 23                                    ↓                      ↓
 24                          [validator cache]         downstream consumers
 25                          [collection index]           (WebSocket)
 26                          [disk persist]
 27```
 28
 29the key modules:
 30
 31- **subscriber** — one thread per PDS, WebSocket connection with auto-reconnect and exponential backoff. decodes firehose frames using zat's CBOR codec, extracts ops from commits.
 32- **validator** — signing key cache + 4 background resolver threads. on cache miss, the frame passes through unvalidated and the DID is queued for resolution. subsequent commits from the same account are verified.
 33- **broadcaster** — lock-free fan-out to downstream consumers. ref-counted shared frames (one copy, N consumers). ring buffer of 50k frames for cursor replay.
 34- **collection index** — RocksDB with two column families (`rbc` for collection→DID, `cbr` for DID→collection). indexes live commits inline, no separate process.
 35- **event log** — postgres for account state, cursor tracking, host management. disk persistence for event replay.
 36
 37### design choices that differ from indigo
 38
 39**optimistic validation.** indigo blocks on DID resolution — every event waits for the signing key before proceeding. zlay passes frames through on cache miss and resolves in the background. first commit from an unknown account is unvalidated; everything after is verified. in practice, >99.9% of frames hit the cache after the first few minutes.
 40
 41**inline collection index.** indigo runs [collectiondir](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) as a sidecar — a separate process that subscribes to the relay's localhost firehose and maintains a pebble KV store. zlay indexes directly in its event processing pipeline. one process, one deployment, one thing to monitor.
 42
 43**OS threads, not goroutines.** one thread per PDS host. predictable memory, no GC pauses, but thread count scales linearly. 2,750 threads is fine — most are blocked on WebSocket reads. per-thread RSS is modest (stack pages on demand, ~1-2 MiB when active).
 44
 45**single port.** everything — WebSocket firehose, HTTP API, admin endpoints — on port 3000. a second port (3001) serves only prometheus metrics. indigo does the same: 2470 for everything, 2471 for metrics. this required patching the websocket.zig fork to support HTTP fallback — when a non-WebSocket request arrives, the handshake parser routes it to an HTTP handler instead of returning an error.
 46
 47## deployment war stories
 48
 49### the musl saga
 50
 51first deploy: alpine linux container, default zig target. relay starts, connects to PDS hosts, processes a few hundred events, then `SIGILL` — illegal instruction in RocksDB's LRU cache.
 52
 53the cause: zig 0.15's C++ code generator for musl targets emits instructions that don't exist on baseline x86_64. RocksDB is C++ linked via rocksdb-zig, and the LRU cache's `std::function` vtable dispatch was the casualty.
 54
 55fix chain:
 561. `-Dcpu=baseline` — force baseline instruction set. helped, but musl's C++ ABI still had issues.
 572. switch from alpine to debian bookworm-slim, `-Dtarget=x86_64-linux-gnu` — use glibc. this stuck.
 58
 59the Dockerfile comment is a warning to future-me: "zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache."
 60
 61### TCP splits everything
 62
 63behind traefik (k3s's ingress controller), POST endpoints would hang or return "invalid JSON." the issue: reverse proxies split HTTP headers and body across TCP segments.
 64
 65the original code did one `stream.read()` and assumed the full request was in that buffer. traefik sent headers in frame 1, body in frame 2. the JSON parser got an empty body.
 66
 67same class of bug in the WebSocket handshake — karlseguin's websocket.zig assumed the HTTP upgrade response arrived in one TCP segment. behind a TLS-terminating proxy, it doesn't. had to fork the library to buffer full lines before parsing.
 68
 69lesson: if there's a reverse proxy between you and the client, TCP will split your data at the worst possible boundary.
 70
 71### RocksDB iterator lifetimes
 72
 73rocksdb-zig returns `Data` structs with a `rocksdb_free` finalizer. natural instinct: call `.deinit()` when done. but iterator entries are views into rocksdb's internal snapshot buffers — calling `.deinit()` on them double-frees and triggers `SIGABRT`.
 74
 75separately: rocksdb-zig passes the database path pointer directly to the C API. if the path isn't null-terminated (which zig slices generally aren't), rocksdb reads past the slice boundary. fix: always use `realpathAlloc`, which guarantees null termination.
 76
 77both bugs were invisible in tests and only appeared under production load patterns.
 78
 79### pg.zig doesn't coerce
 80
 81the backfill status endpoint crashed on first request. postgres `COALESCE(SUM(imported_count), 0)` returns `numeric`, not `bigint`. Go's pq driver silently coerces. pg.zig panics. fix: explicit `::bigint` casts on every aggregate.
 82
 83strictness has its benefits — you catch schema bugs earlier. but you pay for it in production when the schema is "correct" by postgres standards and wrong by your driver's standards.
 84
 85## the collection index backfill
 86
 87the collection index only knows about accounts that have posted since live indexing started. historical data — tens of millions of `(DID, collection)` pairs — needs to come from somewhere.
 88
 89the backfiller discovers collections from two sources: [lexicon garden](https://lexicon.garden/llms.txt) (~700 NSIDs scraped from their llms.txt) and a RocksDB scan of collections already observed from the firehose. then it pages through `listReposByCollection` on bsky.network for each collection, adding DIDs to the index.
 90
 91progress is tracked in postgres — cursor position and imported count per collection — so crashes resume where they left off. triggered via admin API, monitored via status endpoint.
 92
 93first backfill run: 1,287 collections discovered. the small ones (niche lexicons, alt clients) complete in seconds. the big ones — `app.bsky.feed.like`, `app.bsky.feed.post`, `app.bsky.actor.profile` — each have 20-30M+ DIDs and take hours to page through at 1,000 per request with a 100ms pause between pages.
 94
 95as of writing: backfill complete — 1,287 collections indexed, 61M DIDs imported.
 96
 97## the build pipeline
 98
 99zig cross-compilation from macOS to linux/amd64 via Docker is slow (QEMU emulation). the production server is already x86_64 linux. so the deploy recipe SSHs into the server, does a native `zig build`, builds a thin runtime image with `buildah`, imports it directly into k3s's containerd (no registry), and restarts the deployment. the whole cycle takes under a minute.
100
101the runtime Dockerfile is five lines: debian base, ca-certificates, copy the binary, expose ports, entrypoint.
102
103## numbers
104
105| | indigo (Go) | zlay (zig) |
106|---|---|---|
107| dependencies | ~50 Go modules | 4 (zat, websocket, pg, rocksdb) |
108| memory | ~6 GiB (GOMEMLIMIT) | ~2.9 GiB (~2,750 hosts) |
109| collection index | sidecar process (pebble) | inline (RocksDB) |
110| validation | blocking (DID resolution) | optimistic (pass-through on miss) |
111| services to deploy | 2 (relay + collectiondir) | 1 |
112
113the first measurement (1.8 GiB at 1,486 hosts) was misleading — memory climbed to 6.6 GiB as the relay connected to all ~2,750 hosts, approaching the 8 GiB OOM limit. two fixes brought it back down:
114
1151. **thread stack sizes.** zig's default is 16 MB per thread. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most threads just read WebSockets and decode CBOR — 2 MB is generous. all `Thread.spawn` calls now pass `.{ .stack_size = 2 * 1024 * 1024 }`.
116
1172. **c_allocator instead of GeneralPurposeAllocator.** GPA is actually a debug allocator (renamed `DebugAllocator` in zig 0.15) — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc, `std.heap.c_allocator` gives glibc malloc with per-thread arenas, `madvise`-based page return, and production-grade fragmentation mitigation.
118
119## what zat exercises
120
121zlay is the heaviest consumer of zat. every firehose frame exercises the CBOR codec. every commit exercises CAR parsing. every new account exercises DID resolution and key extraction. the collection index uses NSID validation. the backfill uses HTTP client patterns.
122
123running at ~600 events/sec sustained, zat processes roughly 50M CBOR decodes per day. that's a different kind of test than unit vectors.
124
125## spec compliance
126
127after the memory fixes, the next pass was checking zlay against the actual lexicon definitions for what a relay should implement. three gaps:
128
1291. **`getHostStatus` was missing.** the lexicon says "implemented by relays" — zlay had `listHosts` but not the single-host query. straightforward handler: look up host, count accounts, map internal status values to the lexicon's `hostStatus` enum.
130
1312. **admin takedowns didn't emit `#account` events.** `/admin/repo/ban` zeroed payloads on disk but never told downstream consumers the account was taken down. the spec says a relay's own takedown should produce an `#account` event. fix: build a CBOR frame (`active: false, status: "takendown"`), persist it, broadcast it.
132
1333. **DID migration was unvalidated.** when an account appeared from a different PDS host, zlay blindly updated the host_id. now it queues a migration check — the validator's background threads resolve the DID document, check `pdsEndpoint()`, and only update if the new host matches.
134
135## what's next
136
137the backfill is complete — 1,287 collections indexed, 61M DIDs. the next step is a correctness audit — diff `listReposByCollection` results across a sample of collections against bsky.network's collectiondir and verify the sets match.
138
139longer term: full commit diff verification via MST inversion. zlay already handles `#sync` frames and validates signatures, but the inductive firehose check (`verifyCommitDiff`) isn't wired into the hot path yet. the primitives exist in zat — it's a throughput tradeoff.
Configure Feed

Configure Feed