update NOTES.md: document crashes 6-8, stdlib patches, cross-Io rule

+165 -22

1 changed file

expand all

NOTES.md

+165 -22

NOTES.md

··· 1 1 # zlay 0.16 migration — status and known issues 2 2 3 - last updated: 2026-04-02 3 + last updated: 2026-04-04 4 4 stable production build: `a931853` (zig 0.15) 5 - latest 0.16 build: `0f11cfc` (not yet deployed successfully) 5 + latest 0.16 build: `b433403` (not yet deployed — includes all fixes through crash 8) 6 6 7 7 ## what happened 8 8 ··· 11 11 consumers over WebSocket. the zig 0.16 migration rewrote all I/O to use 12 12 `std.Io` primitives. 13 13 14 - three separate crashes were found and fixed during deployment attempts: 14 + eight separate crashes/bugs were found and fixed during deployment attempts: 15 15 16 16 ### crash 1: SIGSEGV on startup (fixed in `f996812`) 17 17 ··· 109 109 `hasFn` check). other handshake errors (e.g. `InvalidVersion`) still get 400. 110 110 10 tests cover all request patterns. 111 111 112 + ### crash 6: SIGSEGV — plain threads calling Evented Io.Mutex (fixed in `6674812`) 113 + 114 + **symptom**: SIGSEGV at startup after switching to `Io.Evented` backend. crash in 115 + `Uring.zig` at `Thread` struct field offset from NULL. 116 + 117 + **cause**: the resyncer was spawned via `io.concurrent()` which creates an Evented 118 + fiber. but resync work calls `DiskPersist` methods that take 119 + `self.mutex.lockUncancelable(self.io)` where `self.io` is `pool_io` (Threaded). 120 + Threaded futex from an Evented fiber dereferences `Thread.current()` which is a 121 + threadlocal only set on Uring-managed threads — NULL on Evented fibers. 122 + 123 + **fix**: run resyncer on a plain `std.Thread` with `pool_io`, not as an Evented 124 + fiber. the thread checks `shutdown_flag` to exit. 125 + 126 + ### fix 7: startup connection ramp throttling (fixed in `f9bf515`) 127 + 128 + **symptom**: event loop starvation during initial PDS connection burst (~2,750 129 + simultaneous connects). 130 + 131 + **fix**: throttled startup to connect in batches, preventing io_uring submission 132 + queue overflow. 133 + 134 + ### crash 8: cross-Io heap corruption in GC + health checks (fixed in `2156d08` + `b433403`) 135 + 136 + **symptom**: steady-state heap corruption / SIGSEGV with zero downstream consumers. 137 + dmesg shows crash addresses in `Uring.zig` at `Thread` struct field offsets from 138 + NULL — same signature as crash 6 but in the GC and health check paths. 139 + 140 + **cause**: two cross-Io violations active during steady-state operation: 141 + 142 + 1. **GC loop** ran as an Evented fiber (`io.concurrent(gcLoop, ...)`) but called 143 + `dp.gc()` which takes `self.mutex.lockUncancelable(self.io)` where `self.io` 144 + is `pool_io` (Threaded), and queries `pg.Pool` (also Threaded). Threaded futex 145 + from Evented fiber → NULL `Thread.current()` → heap corruption. 146 + 147 + 2. **health check endpoints** (`/_readyz`, `/_health`, `/xrpc/_health`) on both the 148 + metrics server and API router executed `db.exec("SELECT 1")` through `pg.Pool` 149 + (Threaded) from Evented HTTP handler context. same cross-Io violation. 150 + 151 + **fix**: 152 + - GC loop moved from `io.concurrent()` to `std.Thread.spawn(.{}, gcLoop, .{&dp, pool_io})`. 153 + the thread is joined during shutdown before `dp.deinit()` runs (dp is stack-owned). 154 + - health checks replaced with `isDbHealthy()` — reads an atomic `last_db_success` 155 + timestamp set by Threaded workers after successful DB queries. safe from any Io. 156 + - `markDbSuccess()` called from `uidForDid()` (every incoming event) and `gc()` 157 + (every 10 minutes). 158 + 159 + ### fix 8: broadcaster double-destroy (fixed in `72ba680`) 160 + 161 + **symptom**: use-after-free in broadcaster. `broadcast()` was destroying consumers 162 + that `Handler.close()` still referenced. 163 + 164 + **cause**: `broadcast()` detected dead consumers (via `alive` atomic) and called 165 + `consumer.shutdown()` + `self.allocator.destroy(consumer)` inline. but the 166 + consumer's `Handler.close()` callback was still running and would later call 167 + `removeConsumer()` on the already-freed pointer. 168 + 169 + **fix**: `broadcast()` now only does `swapRemove` + count decrement for dead 170 + consumers. `removeConsumer()` (called from `Handler.close()`) is the sole owner 171 + of `shutdown()` + `destroy()`. 172 + 173 + ## the cross-Io rule 174 + 175 + the single most important lesson from this migration: 176 + 177 + **`Io.Mutex`, `Io.Condition`, `io.sleep()`, and any `pg.Pool` operation must be 178 + called from the same Io type they were initialized with.** 179 + 180 + - Threaded futex on Evented fiber → dereferences NULL `Thread.current()` threadlocal 181 + → SIGSEGV or heap corruption (crashes 1, 6, 8) 182 + - Evented futex on plain thread → same NULL deref in the other direction 183 + 184 + **safe cross-Io patterns**: raw atomics (`Value`, `fetchAdd`, CAS), `tryLock` 185 + (non-blocking CAS, no futex), MPSC ring buffers with atomic spinlocks. 186 + 187 + **unsafe cross-Io patterns**: `Io.Mutex.lockUncancelable(wrong_io)`, 188 + `Io.Condition`, `io.sleep(wrong_io, ...)`, `pg.Pool` queries from wrong context. 189 + 190 + **fix pattern**: components that use Threaded resources (mutexes, pg.Pool) must 191 + run on plain `std.Thread`, not Evented `io.concurrent()`. examples: resyncer 192 + (`439c678`), GC loop (`2156d08`). 193 + 194 + **known remaining**: XRPC and admin API handlers run on Evented fibers and access 195 + `pg.Pool` (Threaded) on external HTTP requests. not steady-state, but real. 196 + fixing requires either running API handlers on pool_io or making pg.Pool Io-agnostic. 197 + 198 + ## stdlib patches 199 + 200 + zlay patches `lib/std/Io/Uring.zig` at build time (see `patches/uring-networking.patch`, 201 + applied in `Dockerfile`). this is necessary because the upstream zig 0.16 stdlib ships 202 + these networking operations as `*Unavailable` stubs that return `error.NetworkDown`. 203 + 204 + ### what's patched 205 + 206 + | function | opcode | why stubbed upstream | 207 + |---|---|---| 208 + | `netListenIp` | sync `bind()` + `listen()` | IORING_OP_BIND/LISTEN require kernel 6.11+ | 209 + | `netAccept` | `IORING_OP_ACCEPT` | was unimplemented | 210 + | `netConnectIp` | `IORING_OP_CONNECT` | was unimplemented | 211 + | `netSend` | `IORING_OP_SENDMSG` | was unimplemented | 212 + | `netRead` | `IORING_OP_READV` / `IORING_OP_READ` | was unimplemented | 213 + | `netWrite` | `IORING_OP_SENDMSG` | was unimplemented | 214 + 215 + also adds a `connect()` helper and `netSendOne()` for individual message sending. 216 + 217 + ### why not upstream yet 218 + 219 + the Uring networking layer is under active development (see zig issue #31723). 220 + the patch uses sync syscalls for `bind`/`listen` (not io_uring opcodes) because 221 + those opcodes require kernel 6.11+ and production runs on older kernels. this is 222 + a pragmatic choice that may not match upstream's desired API shape. 223 + 224 + ### how it's applied 225 + 226 + ```dockerfile 227 + # Dockerfile, line 11-13 228 + COPY patches/ patches/ 229 + RUN patch /opt/zig-.../lib/std/Io/Uring.zig < patches/uring-networking.patch 230 + ``` 231 + 232 + pinned to zig `0.16.0-dev.3059+42e33db9d`. any zig version bump requires 233 + regenerating the patch. 234 + 235 + ### other stdlib issues hit (not patched, worked around) 236 + 237 + | issue | workaround | 238 + |---|---| 239 + | `Io.Event.reset()` assumes no pending waiters — panics under contention | replaced with futex counter in pg.zig fork (crash 2) | 240 + | `Io.Uring` GPFs under `ReleaseSafe` (aggressive inlining + fiber context) | build with `ReleaseFast` only (`Dockerfile` line 21) | 241 + | `Io.Mutex` / futex cannot cross Io types | run Threaded workloads on plain `std.Thread` (crashes 1, 6, 8) | 242 + | `std_options.debug_io` is single-threaded by default | override in root source file for multi-threaded contexts | 243 + 112 244 ## where things live 113 245 114 246 ### repos and their roles ··· 147 279 148 280 ### concurrency architecture 149 281 150 - two layers, by design: 282 + three layers, shaped by the cross-Io constraint: 151 283 152 - **network I/O → `io.concurrent` tasks (Io.Threaded fibers)** 153 - - upstream PDS subscribers (subscriber read loops, ping loops) 284 + **Evented fibers (`io.concurrent` on `Io.Evented`)** 285 + - upstream PDS subscribers (read loops, ping loops) 154 286 - downstream consumer write loops 155 287 - DID resolver loops (validator) 156 - - background tasks: backfill, resync, cleaner, GC, metrics server 288 + - broadcast loop (drains queue, fans out to consumers) 157 289 - slurper coordination 290 + - metrics server, backfill, cleaner 291 + 292 + **plain `std.Thread` with `pool_io` (`Io.Threaded`)** 293 + - GC loop — uses `DiskPersist.mutex` + `pg.Pool` (crash 8) 294 + - resyncer — uses `DiskPersist` + HTTP client (crash 6) 295 + - these MUST NOT run as Evented fibers (see "the cross-Io rule") 158 296 159 297 **CPU-bound ordered processing → explicit `std.Thread` workers** 160 298 - `thread_pool.zig`: `workers[host_id % N]` ensures per-key FIFO ordering 161 299 - `frame_worker.zig`: CBOR decode, DID resolution, signature verify, DB persist 162 300 - bounded backpressure: blocking submit when queue full → TCP backpressure 301 + - uses `Io.Mutex` / `Io.Condition` with `pool_io` (Threaded futex) 163 302 164 - the frame pool workers use `Io.Mutex` / `Io.Condition` for synchronization. 165 - under `Io.Threaded`, these use direct kernel futex syscalls — safe from plain 166 - threads. under `Io.Evented`, they would segfault (see crash 1). 167 - 168 - ### dependency versions (current `6d6c832`) 303 + ### dependency versions (current `b433403`) 169 304 170 305 ``` 171 - zat v0.3.0-alpha.11 (tangled.org) 172 - websocket.zig 104608b (github, master) 306 + zat v0.3.0-alpha.16 (tangled.org) 307 + websocket.zig 80c6434 (github, master) 173 308 pg.zig 5ce2355 (github, dev branch) 174 309 rocksdb-zig cdef67b (github) 175 - zig 0.16.0-dev.3059+42e33db9d 310 + zig 0.16.0-dev.3059+42e33db9d (patched Uring networking) 176 311 ``` 177 312 178 313 ### key env vars ··· 192 327 193 328 ## what needs to happen next 194 329 195 - 1. **deploy `6d6c832`** — all four crashes are fixed, httpFallback dispatch 196 - is in. health probes on port 3000 should work. native build, linux 197 - cross-compile, fmt all pass. 330 + 1. **deploy `b433403`** — all eight crashes/fixes are in. the steady-state 331 + heap corruption (cross-Io GC + health checks) is fixed. broadcaster 332 + double-destroy is fixed. health probes use atomic flag, not cross-Io DB query. 198 333 199 334 2. **monitor after deploy** — compare against 0.15 baseline: 200 335 - thread count (2,903 on 0.15 — should be similar under Threaded) 201 336 - memory (24.9 GiB VmSize, 1.44 GiB RSS on 0.15) 202 337 - throughput, reconnect behavior, ConsumerTooSlow rate 203 - - verify no new crashes after extended run (hours, not seconds) 204 - - specifically watch for any remaining GPF — if crash 4 fix is correct, 205 - there should be zero GPFs even after hours of operation 338 + - verify zero crashes after extended run (hours). the GC cross-Io bug 339 + fired every 10 minutes, so 30+ minutes clean = strong signal. 340 + - watch `/_health` — should report healthy once `uidForDid` or `gc()` succeeds 341 + (within 10 minutes of startup at latest) 342 + 343 + 3. **known issue — XRPC/admin cross-Io** (not blocking deploy): 344 + - API handlers run on Evented fibers but some query `pg.Pool` (Threaded) 345 + - only triggered by external HTTP requests, not steady-state 346 + - fix: either run API handlers on pool_io, or make pg.Pool Io-agnostic 206 347 207 - 3. **follow-up work** (not blocking deploy): 348 + 4. **follow-up work**: 208 349 - investigate Evented backend viability (frame workers → io.concurrent?) 209 350 - consider upstreaming the client write lock to karlseguin/websocket.zig 351 + - consider upstreaming Uring networking patch (zig#31723) 352 + - evaluate whether pg.Pool can be made Io-agnostic (would eliminate cross-Io issues)

Configure Feed

Configure Feed