prepare ReleaseSafe deploy: 8 MiB stacks, doc updates

+6 -4

README.md

··· 12 12 13 13 - **inline collection index** — indexes `(DID, collection)` pairs in the event processing pipeline using RocksDB. serves `listReposByCollection` from the relay process — no sidecar. the index design draws on [fig](https://tangled.org/microcosm.blue)'s work on [lightrail](https://tangled.org/microcosm.blue/lightrail). 14 14 15 - - **one OS thread per PDS** — predictable memory, no garbage collector. thread stacks are set to 2 MB (zig's default is 16 MB). 15 + - **reader thread per PDS + frame processing pool** — each PDS gets a lightweight reader thread (cursor tracking, rate limiting, header decode). heavy work (full CBOR decode, validation, DB persist, broadcast) runs on a shared pool of frame workers (configurable, default 16). thread stacks are 4 MB (zig's default is 16 MB). 16 16 17 17 ## spec compliance 18 18 ··· 32 32 requires zig 0.15 and a C/C++ toolchain (for RocksDB). 33 33 34 34 ```bash 35 - zig build # build 35 + zig build # build (debug — current production default) 36 36 zig build test # run tests 37 - zig build -Doptimize=ReleaseSafe # release build 37 + zig build -Doptimize=ReleaseSafe # release build (not used in prod yet, see docs/incident-2026-03-04.md) 38 38 ``` 39 39 40 40 ## configuration ··· 50 50 | `DATABASE_URL` | — | PostgreSQL connection string | 51 51 | `RELAY_ADMIN_PASSWORD` | — | bearer token for admin endpoints | 52 52 | `RESOLVER_THREADS` | `4` | background DID resolution threads | 53 - | `VALIDATOR_CACHE_SIZE` | `500000` | max cached signing keys before eviction | 53 + | `FRAME_WORKERS` | `16` | frame processing pool worker count | 54 + | `FRAME_QUEUE_CAPACITY` | `4096` | max queued frames before backpressure | 55 + | `VALIDATOR_CACHE_SIZE` | `250000` | max cached signing keys before eviction | 54 56 55 57 see [docs/deployment.md](docs/deployment.md) for production deployment and [docs/backfill.md](docs/backfill.md) for collection index backfill. 56 58

+6 -6

docs/deployment.md

··· 13 13 this SSHs into the server and: 14 14 15 15 1. `git pull --ff-only` in `/opt/zlay` 16 - 2. `zig build -Doptimize=ReleaseSafe -Dtarget=x86_64-linux-gnu` — native x86_64 build 16 + 2. `zig build -Dtarget=x86_64-linux-gnu` — debug build (no `-Doptimize`). see [incident-2026-03-04.md](incident-2026-03-04.md) for why ReleaseSafe is not used. 17 17 3. `buildah bud -t atcr.io/zzstoatzz.io/zlay:latest -f Dockerfile.runtime .` — thin runtime image 18 18 4. pushes to k3s containerd via `buildah push` → `ctr images import` 19 - 5. `kubectl rollout restart deployment/zlay -n zlay` 19 + 5. `kubectl set image deployment/zlay -n zlay main=<sha-tagged-image>` + `kubectl rollout status` 20 20 21 21 the runtime image (`Dockerfile.runtime`) is minimal: debian bookworm-slim + ca-certificates + the binary. 22 22 ··· 28 28 29 29 - `-Dtarget=x86_64-linux-gnu` — **must use glibc**, not musl. zig 0.15's C++ codegen for musl produces illegal instructions in RocksDB's LRU cache. 30 30 - `-Dcpu=baseline` — required when building inside Docker/QEMU (not needed for `zlay-publish-remote` since it builds natively). 31 - - `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. 31 + - `-Doptimize=ReleaseSafe` — safety checks on, optimizations on. **currently not used in production** — ReleaseSafe inflates per-thread RSS ~10x (see [incident-2026-03-04.md](incident-2026-03-04.md)). the frame pool reduces per-reader-thread work, opening a path back to ReleaseSafe. 32 32 33 33 ## initial setup 34 34 ··· 77 77 78 78 **shared TLS CA bundle.** the biggest single win. websocket.zig's TLS client calls `Bundle.rescan()` per connection, loading the system CA certificates into a per-connection arena. with ~2,750 PDS connections, that's ~2,750 copies of the CA bundle in memory (~800 KB each = ~2.2 GiB). fix: load the bundle once in the slurper, pass it to all subscribers via `config.ca_bundle`. memory dropped from ~3.3 GiB to ~1.2 GiB (~65% reduction). 79 79 80 - **thread stack sizes.** zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most threads just read websockets and decode CBOR — 2 MB is generous. all `Thread.spawn` calls now pass `.{ .stack_size = 2 * 1024 * 1024 }`. the constant is defined in `main.zig` as `default_stack_size` for the threads spawned there; other modules use the literal directly. 80 + **thread stack sizes.** zig's default thread stack is 16 MB. with ~2,750 subscriber threads that maps 44 GB of virtual memory. most reader threads just read websockets and submit frames to the processing pool — 4 MB is generous. all `Thread.spawn` calls now pass `.{ .stack_size = 4 * 1024 * 1024 }`. the constant is defined in `main.zig` as `default_stack_size` for the threads spawned there; other modules use the literal directly. 81 81 82 82 **c_allocator instead of GeneralPurposeAllocator.** GPA is a debug allocator — it tracks per-allocation metadata and never returns freed small allocations to the OS. since zlay links glibc (`build.zig:42`), `std.heap.c_allocator` gives us glibc malloc with per-thread arenas, madvise-based page return, and production-grade fragmentation mitigation. 83 83 ··· 85 85 86 86 | metric | value | 87 87 |--------|-------| 88 - | memory | ~1.2 GiB steady state (~2,750 hosts) | 88 + | memory | ~2.7 GiB steady state (~2,750 hosts, with frame pool) | 89 89 | CPU | ~1.5 cores peak | 90 90 | requests | 1 GiB memory, 1000m CPU | 91 - | limits | 3 GiB memory | 91 + | limits | 8 GiB memory | 92 92 | PVC | 20 GiB (events + RocksDB collection index) | 93 93 | postgres | ~238 MiB | 94 94

+51 -45

docs/design.md

··· 9 9 PDS instances (N hosts) 10 10 │ 11 11 ▼ 12 - Subscriber (one OS thread per host) 13 - │ decodes CBOR frames, tracks cursor, rate-limits per host 14 - │ resolves DID → numeric UID via postgres 15 - ▼ 16 - Validator 17 - │ cache lookup: DID → signing key (secp256k1 / p256) 18 - │ cache hit → verify commit signature (zat SDK) 19 - │ cache miss → skip, queue background resolution 20 - ▼ 21 - DiskPersist 22 - │ append to event log (28-byte LE header + CBOR payload) 23 - │ assign relay sequence number (monotonic, relay-scoped) 24 - │ write postgres metadata (account state, host cursor) 12 + Subscriber (one reader thread per host) 13 + │ header decode, cursor tracking, rate limiting 14 + │ submits raw frame to processing pool 25 15 ▼ 26 - Broadcaster 27 - │ resequence frame with relay seq 28 - │ fan out to all connected consumers via SharedFrame (ref-counted) 29 - │ per-consumer ring buffer (8,192 frames) + write thread 16 + Frame Pool (configurable workers, default 16) 17 + │ full CBOR decode, DID → UID resolution via postgres 18 + │ ▼ 19 + │ Validator 20 + │ │ cache lookup: DID → signing key (secp256k1 / p256) 21 + │ │ cache hit → verify commit signature (zat SDK) 22 + │ │ cache miss → skip, queue background resolution 23 + │ ▼ 24 + │ DiskPersist 25 + │ │ append to event log (28-byte LE header + CBOR payload) 26 + │ │ assign relay sequence number (monotonic, relay-scoped) 27 + │ │ write postgres metadata (account state, host cursor) 28 + │ ▼ 29 + │ Broadcaster 30 + │ resequence frame with relay seq 31 + │ fan out to all connected consumers via SharedFrame (ref-counted) 32 + │ per-consumer ring buffer (8,192 frames) + write thread 30 33 ▼ 31 34 Downstream consumers (WebSocket) 32 35 ``` ··· 44 47 45 48 | thread type | count at ~2,750 PDS | stack size | responsibility | 46 49 |---|---|---|---| 47 - | subscriber workers | ~2,750 | 2 MB | one per host — WebSocket read loop, CBOR decode, validation call | 48 - | resolver threads | 4–8 (env: `RESOLVER_THREADS`) | 2 MB | DID document resolution, signing key extraction, cache population | 49 - | consumer write threads | 1 per downstream consumer | 2 MB | drain ring buffer → WebSocket write, ping/pong keepalive | 50 - | flush thread | 1 | 2 MB | batched fsync of event log (100ms or 400 events) | 51 - | GC thread | 1 | 2 MB | event log file cleanup every 10 minutes | 52 - | crawl queue thread | 1 | 2 MB | process `requestCrawl` — validate hostname, describeServer, spawn worker | 53 - | metrics server | 1 | 2 MB | HTTP on internal port, prometheus scrape | 50 + | subscriber readers | ~2,750 | 4 MB | one per host — lightweight WebSocket read loop, header decode, cursor tracking, rate limiting, submit to pool | 51 + | frame pool workers | 16 (env: `FRAME_WORKERS`) | 4 MB | CBOR decode, validation, DB persist, broadcast | 52 + | resolver threads | 4–8 (env: `RESOLVER_THREADS`) | 4 MB | DID document resolution, signing key extraction, cache population | 53 + | consumer write threads | 1 per downstream consumer | 4 MB | drain ring buffer → WebSocket write, ping/pong keepalive | 54 + | flush thread | 1 | 4 MB | batched fsync of event log (100ms or 400 events) | 55 + | GC thread | 1 | 4 MB | event log file cleanup every 10 minutes | 56 + | crawl queue thread | 1 | 4 MB | process `requestCrawl` — validate hostname, describeServer, spawn worker | 57 + | metrics server | 1 | 4 MB | HTTP on internal port, prometheus scrape | 54 58 | main thread | 1 | default | signal handling, shutdown coordination | 55 59 56 - total: ~2,760 + consumers. each subscriber thread runs a blocking WebSocket 57 - read loop — simple, no async runtime, no event loop. this works because each 58 - thread does minimal work per frame (CBOR decode + optional signature verify) 59 - and spends most of its time blocked in `recv()`. 60 + total: ~2,770 + consumers. subscriber reader threads run blocking WebSocket 61 + read loops — lightweight, no async runtime, no event loop. each reader does 62 + minimal work per frame (header decode, cursor update, rate limit check) and 63 + spends most time blocked in `recv()`. heavy processing (full CBOR decode, CAR 64 + parse, ECDSA verify, DB persist, broadcast) runs on pool workers. the header 65 + is intentionally decoded twice — once by the reader for routing, once by the 66 + pool worker for full processing. this double decode costs ~1–2μs per frame, 67 + far cheaper than serializing parsed state across threads. 60 68 61 - the 2 MB stack size (vs zig's 16 MB default) is the key to fitting ~3K threads 62 - in memory. actual stack usage is far below 2 MB — the deepest path is CBOR 63 - decode → CAR parse → ECDSA verify, which uses ~50 KB of stack at peak. 69 + the 4 MB stack size (vs zig's 16 MB default) is the key to fitting ~3K threads 70 + in memory. reader thread actual stack usage is well below 4 MB since the 71 + deepest paths (crypto, CBOR, DB) now run on pool workers. 64 72 65 73 ## memory model 66 74 ··· 85 93 86 94 **validator cache**: `StringHashMap(CachedKey)` — DID string → 75-byte 87 95 fixed-size struct (key type + 33-byte compressed pubkey + resolve timestamp). 88 - capped at 500K entries (env: `VALIDATOR_CACHE_SIZE`), LRU-ish eviction of 96 + capped at 250K entries (env: `VALIDATOR_CACHE_SIZE`), LRU-ish eviction of 89 97 oldest 10% when full. ~37 MB at capacity. the resolve queue uses a 90 98 `StringHashMapUnmanaged(void)` as a dedupe set to prevent the same DID from 91 99 being queued multiple times. migration checks are interleaved with DID ··· 136 144 ## scaling limits 137 145 138 146 current deployment: ~2,780 PDS hosts, running on a 32 GB / 16 CPU node. 139 - steady-state memory: ~1.2 GiB (after shared CA bundle fix). postgres alongside 140 - at ~240 MiB. resource limits: 3 GiB memory, 1 GiB request, 1000m CPU. 147 + steady-state memory: ~2.7 GiB (with frame pool). postgres alongside 148 + at ~240 MiB. resource limits: 8 GiB memory, 1 GiB request, 1000m CPU. 141 149 142 150 | component | current (~2,750 PDS) | at 10x (~27,500 PDS) | status | 143 151 |---|---|---|---| 144 - | thread stacks | ~5.5 GB virtual (2,750 × 2 MB) | ~55 GB virtual | **breaks** — exceeds 32 GB node | 152 + | thread stacks | ~11 GB virtual (2,750 × 4 MB) | ~110 GB virtual | **breaks** — exceeds 32 GB node. reader RSS is lower since heavy work moved to pool workers | 145 153 | pg pool | 5 connections (hardcoded) | 5 connections | **breaks** — saturates under concurrent UID lookups | 146 154 | resolver queue | `ArrayList` + dedupe set | same | **ok** — dedupe prevents unbounded growth from duplicate DIDs | 147 - | validator cache | 500K entries, ~37 MB | same (capped) | **degrades** — miss rate climbs with more unique DIDs | 148 - | broadcaster | O(n consumers) under mutex | same | **risk** — lock contention at high consumer count | 155 + | validator cache | 250K entries, ~19 MB | same (capped) | **degrades** — miss rate climbs with more unique DIDs | 156 + | broadcaster | O(n consumers) under mutex, contention from 16 pool workers (not ~2,750 threads) | same | **improved** — contention reduced from ~2,750 threads to N workers | 149 157 | RocksDB | manageable write rate | ~1.4M writes/sec projected | **needs** compaction tuning | 150 158 | event log | buffered, 100ms flush | fine — sequential I/O | ok | 151 159 | kernel threads | ~2,800 (below 30K default) | ~28,000 (near default max) | **breaks** without `sysctl` tuning | ··· 162 170 at 10x, queue contention becomes the bottleneck — every frame touches 163 171 `uidForDidFromHost`. 164 172 165 - 3. **validator cache miss rate**: 500K cache with 60M+ DIDs means ~99% miss 173 + 3. **validator cache miss rate**: 250K cache with 60M+ DIDs means ~99% miss 166 174 rate for cold starts. resolver threads (4–8) can't keep up with the 167 175 resolution queue at 10x ingest rate. 168 176 ··· 170 178 171 179 ### near-term (no architecture change) 172 180 - expose `PG_POOL_SIZE` env var, increase from 5 to 20–50 173 - - expose `VALIDATOR_CACHE_SIZE`, increase to 2M+ (costs ~150 MB) 181 + - increase `VALIDATOR_CACHE_SIZE` from 250K to 2M+ (costs ~150 MB) 174 182 - tune `RESOLVER_THREADS` to 16–32 for higher resolution throughput 175 183 - `sysctl kernel.threads-max=65536` on deploy node 176 184 177 - ### mid-term (thread pool) 178 - - replace one-thread-per-host with a thread pool of N workers (N = CPU cores × 2) 179 - - each worker runs an epoll/kqueue loop over multiple host connections 180 - - subscriber becomes a state machine: connect → read → decode → validate → persist 181 - - reduces thread count from O(hosts) to O(cores), eliminates the stack memory wall 182 - - websocket.zig would need a non-blocking client mode or replacement 185 + ### mid-term (thread pool — **done**, commit f0c7baf) 186 + - ~~replace one-thread-per-host with a thread pool of N workers~~ — **done**: reader threads (one per PDS) submit raw frames to a shared processing pool (default 16 workers). heavy work (CBOR decode, validation, persist, broadcast) runs on pool workers. 187 + - thread count is still O(hosts) for readers, but reader threads are lightweight — no crypto, no DB, no broadcast contention. 188 + - remaining: IO multiplexing to reduce reader thread count from O(hosts) to O(cores) 183 189 184 190 ### long-term (async I/O) 185 191 - zig 0.16 introduces `Io` (io_uring on linux, kqueue on darwin)

+51 -3

docs/incident-2026-03-04.md

··· 80 80 81 81 ## current state (2026-03-05) 82 82 83 - - **optimize**: Debug (no optimization) 84 - - **thread stacks**: 4 MiB (reduced from 8 MiB — 8 MiB caused OOM at ~2,500 threads) 83 + - **optimize**: Debug (no optimization) — preparing ReleaseSafe deploy 84 + - **thread stacks**: 8 MiB (bumped from 4 MiB to support ReleaseSafe handshake path) 85 85 - **probes**: liveness `/_healthz:3000`, readiness `/_readyz:3000` (helm synced) 86 86 - **helm**: values applied via `helm upgrade`, no longer drifted from kubectl patches 87 87 - **reconnect cronjob**: deployed — reconciles PDS host list every 4 hours 88 88 - **grafana dashboard**: configmap synced from `deploy/zlay-dashboard.json` 89 89 - **metrics**: working — RSS, threads, disk, counters, caches all emitting 90 90 91 + ## ReleaseSafe readiness analysis (2026-03-05) 92 + 93 + two changes since the incident should fix the per-thread RSS problem: 94 + 1. **thread pool** (f0c7baf) — reader threads no longer do validation, DB persist, or broadcast 95 + 2. **migration queue dedup** (37fa194) — eliminated unbounded heap growth 96 + 97 + ### symbol-level evidence 98 + 99 + cross-compiled x86_64-linux-gnu binaries, compared via `nm --size-sort` and `objdump`: 100 + 101 + **binary sizes**: Debug 313M, ReleaseSafe 143M, ReleaseSmall 8M 102 + 103 + **reader thread stack frames (ReleaseSafe)** — the per-frame hot path: 104 + 105 + | function | stack frame | 106 + |---|---| 107 + | `proto.Reader.read` | 360 bytes | 108 + | `Io.Reader.fillUnbuffered` | 48 bytes | 109 + | `tls.Client.readVec` | 24 bytes | 110 + | `tls.Client.readIndirect` | 2,264 bytes | 111 + | **total read chain** | **~2.7 KiB** | 112 + 113 + **TLS handshake (once per connection, not per frame)**: 114 + 115 + | function | stack frame | 116 + |---|---| 117 + | `tls.Client.init` | 84 KiB (0x15058) | 118 + | `tls.Client.KeyShare.init` | 48 KiB (0xc208) | 119 + | **total handshake** | **~134 KiB** | 120 + 121 + the largest static stack frame in ReleaseSafe outside of RocksDB C++ init is 4,072 bytes 122 + (`crypto.Certificate.rsa.encrypt`). all RocksDB frames (53 KiB, 41 KiB) are in C++ global 123 + init, not on reader threads. 124 + 125 + ### why 3.9 MiB/thread should not recur 126 + 127 + the old architecture had reader threads doing: TLS read → CBOR decode → ECDSA verify → 128 + DB persist → broadcast. each step involved heap allocations and deep call stacks. 129 + 130 + the new architecture: TLS read → CBOR header peek → dupe → queue to pool. the pool's 16 131 + workers do all the heavy lifting on shared threads, not per-PDS threads. 132 + 133 + ### ReleaseSmall as fallback 134 + 135 + ReleaseSmall produces an 8M binary (vs 143M ReleaseSafe) by avoiding inlining entirely. 136 + however, it **disables runtime safety checks** — same risk category as ReleaseFast, which 137 + has a known double-free. only use if ReleaseSafe RSS is still too high. 138 + 91 139 ## notable commits 92 140 93 141 see `git log` for full history. key milestones: ··· 100 148 101 149 ### investigate (not in prod) 102 150 1. figure out why `VmHWM`/`RssAnon`/`mallinfo` metrics silently fail 103 - 2. **thread pool architecture** — multiplex connections on fewer threads (e.g. 64 threads, async I/O) to eliminate the 2,750-thread RSS problem. this is the real fix for ReleaseSafe. 151 + 2. **thread pool architecture** — **done** (commit f0c7baf). reader threads (one per PDS) are now lightweight; heavy work (CBOR decode, validation, persist, broadcast) runs on a shared frame pool (default 16 workers). this doesn't reduce thread count but enables a path to ReleaseSafe — reader stacks no longer need space for crypto/DB call chains. 104 152 3. investigate the double-free in ReleaseFast — where is the consumer lifecycle bug? 105 153 4. try `noinline` on hot TLS/crypto paths to reduce ReleaseSafe per-thread RSS 106 154 5. consider filing a zig issue about `tls.Client.init` stack usage in ReleaseSafe

+4 -3

src/main.zig

··· 37 37 const log = std.log.scoped(.relay); 38 38 39 39 /// zig's default thread stack is 16 MB. with ~2,750 subscriber threads that's 40 - /// 44 GB of virtual memory. 4 MB is 2x the proven debug floor (2 MB worked, 41 - /// 1 MB overflowed). ReleaseSafe needs 8 MB — but we only run debug builds. 42 - pub const default_stack_size = 4 * 1024 * 1024; 40 + /// 44 GB of virtual memory. 8 MB supports ReleaseSafe — tls.Client.init alone 41 + /// needs ~134 KiB of stack, and deep call chains under inline-else cipher 42 + /// dispatch need headroom. only touched pages count as RSS. 43 + pub const default_stack_size = 8 * 1024 * 1024; 43 44 44 45 var shutdown_flag: std.atomic.Value(bool) = .{ .raw = false }; 45 46

Configure Feed

Configure Feed