add NOTES.md: 0.16 migration status, crashes, known issues

+221

1 changed file

expand all

NOTES.md

+221

NOTES.md

··· 1 + # zlay 0.16 migration — status and known issues 2 + 3 + last updated: 2026-04-02 4 + stable production build: `a931853` (zig 0.15) 5 + latest 0.16 build: `0f11cfc` (not yet deployed successfully) 6 + 7 + ## what happened 8 + 9 + zlay is an AT Protocol relay (~8,400 LOC, 15 source files). it crawls ~2,750 10 + PDS instances, validates frames, persists to disk, and fans out to downstream 11 + consumers over WebSocket. the zig 0.16 migration rewrote all I/O to use 12 + `std.Io` primitives. 13 + 14 + three separate crashes were found and fixed during deployment attempts: 15 + 16 + ### crash 1: SIGSEGV on startup (fixed in `f996812`) 17 + 18 + **symptom**: exit code 139 immediately after printing "io backend: Evented" 19 + 20 + **cause**: the build auto-selected `Io.Evented` (io_uring on linux). evented 21 + futex wait/wake access fiber-local `Thread.current()` state. but zlay's frame 22 + pool (`thread_pool.zig`) spawns plain `std.Thread` workers — they don't have 23 + fiber state. any `Io.Mutex` operation from a plain thread segfaulted. 24 + 25 + **fix**: force `Backend = Io.Threaded` in `src/main.zig:55`. threaded futex 26 + uses direct kernel syscalls that work from any execution context. all 0.16 API 27 + benefits are preserved; `io.concurrent` under Threaded spawns real OS threads. 28 + 29 + **future path**: either migrate frame workers to `io.concurrent`, or give 30 + cross-boundary modules a dedicated `Threaded` sync_io for mutex operations. 31 + then Evented can be re-enabled for the network I/O layer. 32 + 33 + ### crash 2: pool acquire panic under load (fixed in `e5ed0d1`) 34 + 35 + **symptom**: `unreachable` panic in `Io.Event.waitTimeout` after 30-60s of 36 + processing. stack: `event_log.zig uidForDid → pg pool.zig acquire → Io.zig` 37 + 38 + **cause**: `pg.zig` pool used `Io.Event` + `reset()` as a connection-available 39 + signal. `Io.Event.reset()` has a stdlib invariant (Io.zig:1857): **assumes no 40 + pending call to wait**. with 16 frame workers contending for 5 connections: 41 + `set()` wakes all waiters, one calls `reset()`, others hit `unreachable`. 42 + 43 + **fix (pg.zig `5ce2355` on dev branch)**: replaced `Io.Event` with a monotonic 44 + `u32` futex counter. `release()` increments counter + `futexWake(1)` (wake one). 45 + `acquire()` snapshots counter under mutex + `futexWaitTimeout()` with snapshot. 46 + no `reset()`, no single-waiter constraint. also fixed a pre-existing double-lock 47 + race in acquire and made zlay's pool size configurable via `DB_POOL_SIZE` env 48 + (default 20, was hardcoded 5). 49 + 50 + ### crash 3: GPF in websocket client write (fixed in `0f11cfc`) 51 + 52 + **symptom**: general protection exception in `memcpy → Writer.zig → writeAll → 53 + websocket client.zig writeFrame`. happens after 30-60s of normal processing. 54 + 55 + **cause**: the websocket `Client` struct had no write serialization. three 56 + concurrent writers on the same client: 57 + 58 + 1. `pingLoop` — `client.writeFrame(.ping, ...)` every 30s (`subscriber.zig:378`) 59 + 2. `readLoop` auto-pong — `self.writeFrame(.pong, ...)` on upstream ping 60 + (inside the library, `client.zig:272`) 61 + 3. close path — `client.writeFrame(.close, ...)` on failure (`subscriber.zig:383`) 62 + 63 + interleaved frame headers/payloads corrupt the shared stream/TLS writer state. 64 + the server-side `Conn` already had `lock: Io.Mutex` around all writes; the 65 + client was missing it. 66 + 67 + **fix (websocket.zig `0261b7d`)**: added `_write_lock: Io.Mutex` to `Client`, 68 + acquired in `writeFrame()` around the two `writeAll` calls. header construction 69 + and payload masking happen outside the lock. 70 + 71 + since zat also depends on websocket.zig, a new zat alpha (`v0.3.0-alpha.9`) was 72 + cut to resolve the diamond dependency. 73 + 74 + ## known issue: health probes on port 3000 return 400 75 + 76 + **not yet fixed.** this will cause k8s rollout to fail even if the relay itself 77 + is healthy. 78 + 79 + ### what's happening 80 + 81 + zlay serves the firehose WebSocket + HTTP API on port 3000 via the websocket 82 + library's `Server`. plain HTTP requests (health probes, XRPC endpoints) are 83 + supposed to reach zlay through an `httpFallback` mechanism: 84 + 85 + ``` 86 + main.zig:275 → bc.http_fallback = api.handleHttpRequest 87 + broadcaster.Handler.httpFallback() delegates to it 88 + ``` 89 + 90 + but the websocket server never calls `httpFallback`. when a non-upgrade HTTP 91 + request arrives: 92 + 93 + 1. `server.zig:1738` — `Handshake.parse()` fails (no websocket headers) 94 + 2. `server.zig:1738` — `respondToHandshakeError()` sends `400 missingheaders` 95 + 3. connection closed. the app handler never sees the request. 96 + 97 + the server code (`server.zig`) has **no code path** that detects "valid HTTP 98 + but not a websocket upgrade" and dispatches to `Handler.httpFallback()`. the 99 + method signature exists on the Handler, the router is fully implemented 100 + (`api/router.zig`), but the wiring inside the websocket library is missing. 101 + 102 + this worked on zig 0.15 — the old websocket library version had this path. it 103 + was lost during the 0.16 fork migration. 104 + 105 + ### workaround options 106 + 107 + 1. **switch k8s probes to port 3001** — the MetricsServer (`main.zig:67-142`) 108 + serves `/_healthz` and `/_readyz` directly via `std.http.Server`, no 109 + websocket library involved. the handlers are identical to the router ones. 110 + this is the fastest path to a working deploy. 111 + 112 + 2. **fix the websocket server** — add a code path in `server.zig` between 113 + handshake parse failure and error response that checks for valid HTTP, 114 + parses method/url/headers/body, and calls `H.httpFallback()` if it exists 115 + (using `comptime std.meta.hasFn`). this is the correct long-term fix. 116 + 117 + the probes were on port 3000 since initial deploy (commit `e111e47` split them 118 + to `/_healthz` / `/_readyz`). they worked fine on 0.15. the 400 is a 0.16 119 + regression in the websocket fork. 120 + 121 + ## where things live 122 + 123 + ### repos and their roles 124 + 125 + | repo | location | what | 126 + |---|---|---| 127 + | **zlay** | `tangled.org:zzstoatzz.io/zlay` | the relay. this repo. | 128 + | **zat** | `tangled.org:zat.dev/zat` | AT Protocol primitives (CBOR, CAR, firehose, DID, etc.) | 129 + | **websocket.zig** | `github.com/zzstoatzz/websocket.zig` | fork of karlseguin/websocket.zig | 130 + | **pg.zig** | `github.com/zzstoatzz/pg.zig` (dev branch) | fork of karlseguin/pg.zig | 131 + | **rocksdb-zig** | `github.com/zzstoatzz/rocksdb-zig` | RocksDB bindings | 132 + | **relay ops** | `tangled.org:zzstoatzz.io/relay` | helm chart, terraform, justfile | 133 + 134 + ### zlay source map 135 + 136 + | file | LOC | role | 137 + |---|---|---| 138 + | `main.zig` | 404 | Io bootstrap, signal handlers, GC loop, MetricsServer | 139 + | `subscriber.zig` | ~1021 | per-host PDS connection: connect, readLoop, pingLoop, backoff | 140 + | `slurper.zig` | ~663 | multi-host crawl manager: host discovery, worker lifecycle | 141 + | `broadcaster.zig` | ~1319 | downstream consumer management, broadcast, history, stats | 142 + | `frame_worker.zig` | ~311 | per-frame processing: CBOR decode, validate, persist, broadcast | 143 + | `thread_pool.zig` | ~357 | key-partitioned worker pool (plain `std.Thread`, **not** io.concurrent) | 144 + | `event_log.zig` | ~1446 | disk persistence (indigo-compatible) + Postgres index | 145 + | `validator.zig` | ~1002 | DID resolution, signature verification, validation cache | 146 + | `lru.zig` | ~289 | generic LRU cache (used by validator + event_log) | 147 + | `ring_buffer.zig` | ~256 | per-consumer ring buffer for broadcast fanout | 148 + | `collection_index.zig` | ~200 | RocksDB collection index (listReposByCollection) | 149 + | `backfill.zig` | ~396 | collection index backfill from source relay | 150 + | `resync.zig` | ~315 | collection index resync on #sync events | 151 + | `cleaner.zig` | ~114 | stale entry cleanup from collection index | 152 + | `api/router.zig` | 142 | HTTP request dispatch (health, XRPC, admin) | 153 + | `api/xrpc.zig` | — | XRPC endpoint handlers | 154 + | `api/admin.zig` | — | admin endpoint handlers | 155 + | `api/http.zig` | — | HTTP response helpers | 156 + 157 + ### concurrency architecture 158 + 159 + two layers, by design: 160 + 161 + **network I/O → `io.concurrent` tasks (Io.Threaded fibers)** 162 + - upstream PDS subscribers (subscriber read loops, ping loops) 163 + - downstream consumer write loops 164 + - DID resolver loops (validator) 165 + - background tasks: backfill, resync, cleaner, GC, metrics server 166 + - slurper coordination 167 + 168 + **CPU-bound ordered processing → explicit `std.Thread` workers** 169 + - `thread_pool.zig`: `workers[host_id % N]` ensures per-key FIFO ordering 170 + - `frame_worker.zig`: CBOR decode, DID resolution, signature verify, DB persist 171 + - bounded backpressure: blocking submit when queue full → TCP backpressure 172 + 173 + the frame pool workers use `Io.Mutex` / `Io.Condition` for synchronization. 174 + under `Io.Threaded`, these use direct kernel futex syscalls — safe from plain 175 + threads. under `Io.Evented`, they would segfault (see crash 1). 176 + 177 + ### dependency versions (current `0f11cfc`) 178 + 179 + ``` 180 + zat v0.3.0-alpha.9 (tangled.org) 181 + websocket.zig 0261b7d (github, master) 182 + pg.zig 5ce2355 (github, dev branch) 183 + rocksdb-zig cdef67b (github) 184 + zig 0.16.0-dev.3059+42e33db9d 185 + ``` 186 + 187 + ### key env vars 188 + 189 + | var | default | what | 190 + |---|---|---| 191 + | `RELAY_PORT` | 3000 | WebSocket firehose + HTTP API | 192 + | `RELAY_METRICS_PORT` | 3001 | internal metrics + health probes | 193 + | `FRAME_WORKERS` | 16 | CPU worker threads for frame processing | 194 + | `FRAME_QUEUE_CAPACITY` | 4096 | per-worker queue depth | 195 + | `DB_POOL_SIZE` | 20 | Postgres connection pool size (new) | 196 + | `RELAY_UPSTREAM` | bsky.network | seed host for PDS discovery | 197 + | `RELAY_DATA_DIR` | data/events | disk persistence directory | 198 + | `RELAY_RETENTION_HOURS` | 72 | event log retention | 199 + | `RELAY_MAX_EVENTS_GB` | 100 | max disk usage | 200 + | `DATABASE_URL` | postgres://relay:relay@localhost:5432/relay | Postgres connection | 201 + 202 + ## what needs to happen next 203 + 204 + 1. **decide on health probes** — either switch k8s probes to port 3001 205 + (immediate, works today) or fix the websocket server's httpFallback dispatch 206 + (correct but more work). both are valid. check ops repo at 207 + `relay/zlay/deploy/zlay-values.yaml` lines 28-45. 208 + 209 + 2. **deploy `0f11cfc`** — once probe issue is addressed. all three crashes are 210 + fixed. native build, linux cross-compile, fmt, tests all pass. 211 + 212 + 3. **monitor after deploy** — compare against 0.15 baseline: 213 + - thread count (2,903 on 0.15 — should be similar under Threaded) 214 + - memory (24.9 GiB VmSize, 1.44 GiB RSS on 0.15) 215 + - throughput, reconnect behavior, ConsumerTooSlow rate 216 + - verify no new crashes after extended run (hours, not seconds) 217 + 218 + 4. **follow-up work** (not blocking deploy): 219 + - fix websocket server httpFallback dispatch for port 3000 HTTP 220 + - investigate Evented backend viability (frame workers → io.concurrent?) 221 + - consider upstreaming the client write lock to karlseguin/websocket.zig

Configure Feed

Configure Feed