atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: reflect ReleaseSafe in production, add CLAUDE.md

ReleaseSafe deployed successfully — ~1.1 GiB RSS at 2,255 hosts
(vs ~2.7 GiB debug). update all docs to reflect current state:
- README: remove "debug" default, remove "not in prod" caveat
- deployment.md: ReleaseSafe is production default, add frame pool
to memory tuning section, update resource usage
- design.md: 8 MiB stacks, updated RSS numbers, per-thread breakdown
- incident doc: current state shows successful ReleaseSafe deploy
- CLAUDE.md: operational context for AI-assisted development

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz c2531ab1 579c9622

+99 -25
+71
CLAUDE.md
··· 1 + # zlay 2 + 3 + AT Protocol relay in zig 0.15.2. subscribes to every PDS, validates commit 4 + signatures, serves merged firehose to downstream consumers. 5 + 6 + ## architecture 7 + 8 + - **reader thread per PDS** (~2,750) — lightweight: TLS read, header decode, 9 + cursor tracking, rate limiting. submits raw frames to pool. 10 + - **frame pool** (16 workers) — CBOR decode, ECDSA verify, DB persist, broadcast. 11 + - **validator** (4-8 resolver threads) — background DID resolution, signing key cache. 12 + - dependencies: zat (AT proto primitives), websocket.zig, pg.zig, rocksdb-zig. 13 + 14 + ## build 15 + 16 + ```bash 17 + zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu # production 18 + zig build test # run tests 19 + zig fmt --check . # CI checks this 20 + ``` 21 + 22 + always run `zig fmt --check .` and `zig build test` before pushing. 23 + 24 + ## deploy 25 + 26 + deploy configs live at `../@zzstoatzz.io/relay/` (justfile, helm, terraform). 27 + 28 + ```bash 29 + cd ../relay 30 + just zlay-publish-remote ReleaseSafe # build on server, deploy 31 + just zlay-publish-remote # debug build (fallback) 32 + ``` 33 + 34 + **critical rules:** 35 + - MUST use `-Dtarget=x86_64-linux-gnu` — musl produces illegal instructions in RocksDB 36 + - MUST set `KUBECONFIG="$(pwd)/zlay-kubeconfig.yaml"` — default context is docker-desktop 37 + - never change probe paths and image in separate operations 38 + - immutable SHA tags, never `:latest` 39 + - ReleaseFast has a known double-free — do not use 40 + 41 + ## key files 42 + 43 + | file | purpose | 44 + |---|---| 45 + | `src/main.zig` | entry point, thread stacks (8 MiB), signal handling | 46 + | `src/subscriber.zig` | per-PDS reader thread, FrameHandler | 47 + | `src/frame_worker.zig` | pool worker: decode, validate, persist, broadcast | 48 + | `src/thread_pool.zig` | generic ring-buffer thread pool | 49 + | `src/validator.zig` | signature verify, DID resolver, LRU key cache | 50 + | `src/broadcaster.zig` | fan-out to consumers, prometheus metrics | 51 + | `src/event_log.zig` | disk persist, postgres, cursor replay | 52 + | `src/collection_index.zig` | RocksDB (collection, did) index | 53 + | `src/slurper.zig` | multi-host crawl manager | 54 + | `src/api.zig` | HTTP API endpoints (served via httpFallback on WS port) | 55 + 56 + ## current production state 57 + 58 + - running at zlay.waow.tech, ~2,750 PDS hosts 59 + - ReleaseSafe, 8 MiB thread stacks, ~1.1 GiB RSS 60 + - ports: 3000 (WS + HTTP), 3001 (metrics only) 61 + - probes: `/_healthz` (liveness), `/_readyz` (readiness, DB check) 62 + - `relay_build_info{git_sha,optimize}` metric confirms what's running 63 + 64 + ## zig gotchas relevant here 65 + 66 + - `&.{...}` in loops creates stack-local arrays that alias — heap-allocate instead 67 + - operator precedence: `(byte & 0xf0) == 0` not `byte & 0xf0 == 0` 68 + - `ArrayList` is unmanaged in 0.15 — pass allocator to each method 69 + - rocksdb-zig iterator Data: do NOT call `.deinit()` on entries (SIGABRT) 70 + - rocksdb-zig DB.open: path must be null-terminated 71 + - pg.zig `QueryRow.deinit()` returns `!void` — use `defer row.deinit() catch {}`
+3 -3
README.md
··· 12 12 13 13 - **inline collection index** — indexes `(DID, collection)` pairs in the event processing pipeline using RocksDB. serves `listReposByCollection` from the relay process — no sidecar. the index design draws on [fig](https://tangled.org/microcosm.blue)'s work on [lightrail](https://tangled.org/microcosm.blue/lightrail). 14 14 15 - - **reader thread per PDS + frame processing pool** — each PDS gets a lightweight reader thread (cursor tracking, rate limiting, header decode). heavy work (full CBOR decode, validation, DB persist, broadcast) runs on a shared pool of frame workers (configurable, default 16). thread stacks are 4 MB (zig's default is 16 MB). 15 + - **reader thread per PDS + frame processing pool** — each PDS gets a lightweight reader thread (cursor tracking, rate limiting, header decode). heavy work (full CBOR decode, validation, DB persist, broadcast) runs on a shared pool of frame workers (configurable, default 16). 16 16 17 17 ## spec compliance 18 18 ··· 32 32 requires zig 0.15 and a C/C++ toolchain (for RocksDB). 33 33 34 34 ```bash 35 - zig build # build (debug — current production default) 35 + zig build # build (debug) 36 36 zig build test # run tests 37 - zig build -Doptimize=ReleaseSafe # release build (not used in prod yet, see docs/incident-2026-03-04.md) 37 + zig build -Doptimize=ReleaseSafe # release build (production default) 38 38 ``` 39 39 40 40 ## configuration
+8 -6
docs/deployment.md
··· 13 13 this SSHs into the server and: 14 14 15 15 1. `git pull --ff-only` in `/opt/zlay` 16 - 2. `zig build -Dtarget=x86_64-linux-gnu` — debug build (no `-Doptimize`). see [incident-2026-03-04.md](incident-2026-03-04.md) for why ReleaseSafe is not used. 17 - 3. `buildah bud -t atcr.io/zzstoatzz.io/zlay:latest -f Dockerfile.runtime .` — thin runtime image 16 + 2. `zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu` 17 + 3. `buildah bud -f Dockerfile.runtime .` — thin runtime image with SHA tag 18 18 4. pushes to k3s containerd via `buildah push` → `ctr images import` 19 19 5. `kubectl set image deployment/zlay -n zlay main=<sha-tagged-image>` + `kubectl rollout status` 20 20 ··· 28 28 29 29 - `-Dtarget=x86_64-linux-gnu` — **must use glibc**, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache. 30 30 - `-Dcpu=baseline` — required when building inside Docker/QEMU (not needed for `zlay-publish-remote` since it builds natively). 31 - - `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. **currently not used in production** — ReleaseSafe inflates per-thread RSS ~10x (see [incident-2026-03-04.md](incident-2026-03-04.md)). the frame pool reduces per-reader-thread work, opening a path back to ReleaseSafe. 31 + - `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. production default since 2026-03-05. previously caused OOM (see [incident-2026-03-04.md](incident-2026-03-04.md)) — resolved by the frame pool moving heavy work off reader threads. 32 32 33 33 ## initial setup 34 34 ··· 73 73 74 74 ## memory tuning 75 75 76 - three changes brought steady-state memory from ~6.6 GiB down to ~1.2 GiB at ~2,750 connected hosts: 76 + four changes brought steady-state memory from ~6.6 GiB down to ~1.1 GiB at ~2,250 connected hosts (ReleaseSafe): 77 77 78 78 **shared TLS CA bundle.** the biggest single win. websocket.zig's TLS client calls `Bundle.rescan()` per connection, loading the system CA certificates into a per-connection arena. with ~2,750 PDS connections, that's ~2,750 copies of the CA bundle in memory (~800 KB each = ~2.2 GiB). fix: load the bundle once in the slurper, pass it to all subscribers via `config.ca_bundle`. memory dropped from ~3.3 GiB to ~1.2 GiB (~65% reduction). 79 79 80 - **thread stack sizes.** zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most reader threads just read websockets and submit frames to the processing pool — 4 MB is generous. all `Thread.spawn` calls now pass `.{ .stack_size = 4 * 1024 * 1024 }`. the constant is defined in `main.zig` as `default_stack_size` for the threads spawned there; other modules use the literal directly. 80 + **thread stack sizes.** zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. all `Thread.spawn` calls use `main.default_stack_size` (8 MB). this is virtual memory — only touched pages count as RSS. 8 MB supports ReleaseSafe's TLS handshake path (~134 KiB peak stack). 81 81 82 82 **c_allocator instead of GeneralPurposeAllocator.** GPA is a debug allocator — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc (`build.zig:42`), `std.heap.c_allocator` gives us glibc malloc with per-thread arenas, madvise-based page return, and production-grade fragmentation mitigation. 83 83 84 + **frame processing pool.** reader threads (one per PDS) now only do TLS read, header decode, cursor tracking, and rate limiting — then queue raw frames to a shared pool of 16 workers. this dramatically reduced per-thread RSS in ReleaseSafe (from ~3.9 MiB to ~0.45 MiB) by keeping crypto, DB, and broadcast off reader thread stacks. 85 + 84 86 ## resource usage 85 87 86 88 | metric | value | 87 89 |--------|-------| 88 - | memory | ~2.7 GiB steady state (~2,750 hosts, with frame pool) | 90 + | memory | ~1.1 GiB at ~2,250 hosts (ReleaseSafe), projected ~1.3 GiB steady state | 89 91 | CPU | ~1.5 cores peak | 90 92 | requests | 1 GiB memory, 1000m CPU | 91 93 | limits | 8 GiB memory |
+15 -14
docs/design.md
··· 47 47 48 48 | thread type | count at ~2,750 PDS | stack size | responsibility | 49 49 |---|---|---|---| 50 - | subscriber readers | ~2,750 | 4 MB | one per host — lightweight WebSocket read loop, header decode, cursor tracking, rate limiting, submit to pool | 51 - | frame pool workers | 16 (env: `FRAME_WORKERS`) | 4 MB | CBOR decode, validation, DB persist, broadcast | 52 - | resolver threads | 4–8 (env: `RESOLVER_THREADS`) | 4 MB | DID document resolution, signing key extraction, cache population | 53 - | consumer write threads | 1 per downstream consumer | 4 MB | drain ring buffer → WebSocket write, ping/pong keepalive | 54 - | flush thread | 1 | 4 MB | batched fsync of event log (100ms or 400 events) | 55 - | GC thread | 1 | 4 MB | event log file cleanup every 10 minutes | 56 - | crawl queue thread | 1 | 4 MB | process `requestCrawl` — validate hostname, describeServer, spawn worker | 57 - | metrics server | 1 | 4 MB | HTTP on internal port, prometheus scrape | 50 + | subscriber readers | ~2,750 | 8 MB | one per host — lightweight WebSocket read loop, header decode, cursor tracking, rate limiting, submit to pool | 51 + | frame pool workers | 16 (env: `FRAME_WORKERS`) | 8 MB | CBOR decode, validation, DB persist, broadcast | 52 + | resolver threads | 4–8 (env: `RESOLVER_THREADS`) | 8 MB | DID document resolution, signing key extraction, cache population | 53 + | consumer write threads | 1 per downstream consumer | 8 MB | drain ring buffer → WebSocket write, ping/pong keepalive | 54 + | flush thread | 1 | 8 MB | batched fsync of event log (100ms or 400 events) | 55 + | GC thread | 1 | 8 MB | event log file cleanup every 10 minutes | 56 + | crawl queue thread | 1 | 8 MB | process `requestCrawl` — validate hostname, describeServer, spawn worker | 57 + | metrics server | 1 | 8 MB | HTTP on internal port, prometheus scrape | 58 58 | main thread | 1 | default | signal handling, shutdown coordination | 59 59 60 60 total: ~2,770 + consumers. subscriber reader threads run blocking WebSocket ··· 66 66 pool worker for full processing. this double decode costs ~1–2μs per frame, 67 67 far cheaper than serializing parsed state across threads. 68 68 69 - the 4 MB stack size (vs zig's 16 MB default) is the key to fitting ~3K threads 70 - in memory. reader thread actual stack usage is well below 4 MB since the 71 - deepest paths (crypto, CBOR, DB) now run on pool workers. 69 + the 8 MB stack size (vs zig's 16 MB default) supports ReleaseSafe's TLS 70 + handshake path (~134 KiB peak stack from `tls.Client.init` + `KeyShare.init`). 71 + only touched pages count as RSS — reader threads use ~0.45 MiB RSS each since 72 + the deepest paths (crypto, CBOR, DB) now run on pool workers. 72 73 73 74 ## memory model 74 75 ··· 144 145 ## scaling limits 145 146 146 147 current deployment: ~2,780 PDS hosts, running on a 32 GB / 16 CPU node. 147 - steady-state memory: ~2.7 GiB (with frame pool). postgres alongside 148 + steady-state memory: ~1.1 GiB (ReleaseSafe, with frame pool). postgres alongside 148 149 at ~240 MiB. resource limits: 8 GiB memory, 1 GiB request, 1000m CPU. 149 150 150 151 | component | current (~2,750 PDS) | at 10x (~27,500 PDS) | status | 151 152 |---|---|---|---| 152 - | thread stacks | ~11 GB virtual (2,750 × 4 MB) | ~110 GB virtual | **breaks** — exceeds 32 GB node. reader RSS is lower since heavy work moved to pool workers | 153 + | thread stacks | ~22 GB virtual (2,750 × 8 MB), ~1.2 GiB RSS | ~220 GB virtual | **breaks** — virtual exceeds 32 GB node, but RSS scales sublinearly (~0.45 MiB/thread) | 153 154 | pg pool | 5 connections (hardcoded) | 5 connections | **breaks** — saturates under concurrent UID lookups | 154 155 | resolver queue | `ArrayList` + dedupe set | same | **ok** — dedupe prevents unbounded growth from duplicate DIDs | 155 156 | validator cache | 250K entries, ~19 MB | same (capped) | **degrades** — miss rate climbs with more unique DIDs | ··· 157 158 | RocksDB | manageable write rate | ~1.4M writes/sec projected | **needs** compaction tuning | 158 159 | event log | buffered, 100ms flush | fine — sequential I/O | ok | 159 160 | kernel threads | ~2,800 (below 30K default) | ~28,000 (near default max) | **breaks** without `sysctl` tuning | 160 - | RSS | ~1.2 GiB | ~5–8 GiB projected (shared CA bundle, malloc overhead scales sublinearly) | ok — fits 32 GB node | 161 + | RSS | ~1.1 GiB (ReleaseSafe) | ~4–6 GiB projected | ok — fits 32 GB node | 161 162 162 163 ### what breaks first 163 164
+2 -2
docs/incident-2026-03-04.md
··· 80 80 81 81 ## current state (2026-03-05) 82 82 83 - - **optimize**: Debug (no optimization) — preparing ReleaseSafe deploy 84 - - **thread stacks**: 8 MiB (bumped from 4 MiB to support ReleaseSafe handshake path) 83 + - **optimize**: ReleaseSafe — deployed successfully, ~1.1 GiB RSS at ~2,255 hosts 84 + - **thread stacks**: 8 MiB (supports ReleaseSafe handshake path; only touched pages count as RSS) 85 85 - **probes**: liveness `/_healthz:3000`, readiness `/_readyz:3000` (helm synced) 86 86 - **helm**: values applied via `helm upgrade`, no longer drifted from kubectl patches 87 87 - **reconnect cronjob**: deployed — reconciles PDS host list every 4 hours