devlog 008: add crash 9 (websocket TCP split), ReleaseSafe investigation

+67 -4

1 changed file

expand all

devlog

+67 -4

devlog/008-the-io-migration.md

··· 1 1 # the io migration 2 2 3 - zig 0.16 replaced the networking and concurrency primitives. `std.net`, `std.Thread.Pool`, `std.Thread.Mutex` — all gone, replaced by `std.Io`. this is the story of migrating zat and zlay to the new system, the eight crashes that followed, and the rule we discovered that isn't documented anywhere. 3 + zig 0.16 replaced the networking and concurrency primitives. `std.net`, `std.Thread.Pool`, `std.Thread.Mutex` — all gone, replaced by `std.Io`. this is the story of migrating zat and zlay to the new system, the nine crashes that followed, and what we learned about the gap between a bug and what you think the bug is. 4 4 5 5 ## what changed in 0.16 6 6 ··· 54 54 55 55 [zlay](https://tangled.sh/zzstoatzz.io/zlay) is an AT Protocol relay — ~8,400 lines of zig, ~2,750 PDS subscribers, WebSocket fan-out to downstream consumers. it was the heaviest consumer of the 0.15 API surface. migrating it to 0.16 compiled on the first try. 56 56 57 - eight crashes followed. 57 + nine crashes followed. 58 58 59 59 ## crash 1: SIGSEGV on startup 60 60 ··· 130 130 131 131 ## the cross-Io rule 132 132 133 - the central discovery from eight crashes across five days: 133 + the central discovery from crashes 1, 6, and 8: 134 134 135 135 **`Io.Mutex`, `Io.Condition`, `io.sleep()`, and any library that uses them internally (pg.Pool, etc.) must be called from the same Io backend they were initialized with.** 136 136 ··· 216 216 backfiller (pool_io) 217 217 ``` 218 218 219 + ## crash 9: the ghost in the fiber 220 + 221 + after fixing crashes 1–8, the relay ran on Evented with ReleaseFast. (ReleaseSafe had a separate problem — more on that below.) it stayed up for hours at a time, processing the full AT Protocol firehose across ~2,800 PDS connections. then, every 30–90 minutes: SIGSEGV. exit code 139. no stack trace — ReleaseFast strips safety checks. 222 + 223 + the logs showed nothing unusual. chain breaks (expected after restarts when cursor positions are stale), normal reconnection cycles, then sudden death. 13 restarts in 12 hours. 224 + 225 + we had a separate observation that was shaping our thinking: a minimal repro (`scripts/repro_evented.zig`) that spawns a single fiber and returns GPFs immediately under ReleaseSafe. the crash lands in `fiber.zig:contextSwitch` → `Uring.zig:mainIdle`. so we had a confirmed fiber context-switch bug under one build mode, and a mystery SIGSEGV under another. the natural conclusion: same bug, different manifestation. ReleaseFast just hides it longer because the optimizer arranges code differently. 226 + 227 + we spent time investigating the fiber machinery, reading disassembly of the context switch, checking for upstream fixes (fiber.zig was unchanged across 32 dev builds). we considered patching the context switch ourselves. we checked upstream Uring networking implementation status — still fully stubbed. we read the zig team's position on Evented — "experimental," "important followup work to be done." 228 + 229 + we concluded the fiber machinery was broken and reverted to `Io.Threaded`. thread-per-PDS, ~2,800 threads, same as 0.15 but on the 0.16 API. the relay stopped crashing. 230 + 231 + then we switched the build back to ReleaseSafe. 232 + 233 + ``` 234 + thread 543 panic: start index 1370 is larger than end index 1369 235 + websocket.zig/src/client/client.zig:766 236 + ``` 237 + 238 + it was never the fibers. 239 + 240 + the websocket client's HTTP handshake reader parses response headers line by line. when it finds a `\r`, it advances `line_start` past the `\r\n` to the next line. but TCP can deliver the `\r` at the end of one read and the `\n` at the start of the next. when that happens, `line_start` overshoots `pos`, and the next `buf[line_start..pos]` slice has start > end. under ReleaseSafe, that's a bounds-check panic with a stack trace. under ReleaseFast, there's no bounds check — it indexes into garbage memory and eventually corrupts something downstream. 241 + 242 + with ~2,800 connections doing TLS handshakes, the probability of a TCP split landing on the exact `\r\n` boundary is low per-connection but high in aggregate. once every 30–90 minutes, some PDS reconnection handshake hits it. 243 + 244 + the fix was one line: 245 + 246 + ```zig 247 + line_start = line_end + 2; 248 + if (line_start > pos) break; // ← TCP split mid-CRLF, read more 249 + ``` 250 + 251 + this is the bug that ReleaseSafe would have caught on the first occurrence, with a stack trace pointing directly at the line. instead, we ran ReleaseFast for weeks, saw silent SIGSEGVs, and blamed the fiber scheduler. 252 + 253 + ## the ReleaseSafe problem 254 + 255 + so why were we on ReleaseFast in the first place? 256 + 257 + because Evented + ReleaseSafe GPFs on startup. the minimal repro — a fiber that returns without yielding — crashes deterministically in `fiber.zig:contextSwitch`. Debug, ReleaseFast, and ReleaseSmall all pass. only ReleaseSafe triggers it. this reproduces on completely unpatched zig (our Uring networking patch is not involved). 258 + 259 + comparing the disassembly of `Uring.idle` between modes, the difference is in how the SwitchMessage address reaches the inline asm: 260 + 261 + ``` 262 + // fiber.zig contextSwitch, x86_64: 263 + asm volatile ( 264 + \\ movq 0(%%rsi), %%rax // rax = Switch.old 265 + \\ movq 8(%%rsi), %%rcx // rcx = Switch.new 266 + ... 267 + : [message_to_send] "{rsi}" (s), // input: s must be in %rsi 268 + ``` 269 + 270 + under ReleaseFast, there's a `lea` that loads the SwitchMessage stack address into `%rsi` before the asm. under ReleaseSafe, that `lea` appears to be missing — `%rsi` holds a stale value from a prior function call. the ReleaseSafe prologue adds stack probing (`__zig_probe_stack`) and a canary (`fs:0x28`), which change the code layout surrounding the inline asm. we think this is why the register allocation differs, but we're not certain — there may be something else going on. 271 + 272 + we've written this up as a [bug report](scripts/fiber_gpf_issue.md) with a standalone reproduction. 273 + 274 + this is a real problem for the ecosystem. ReleaseSafe is the mode designed for production services that want optimization with safety checks. TigerBeetle uses it. the zig compiler's own nightlies recently switched to it. Ghostty and Bun use ReleaseFast, but both have noted they'd prefer ReleaseSafe if the performance cost were lower. for `Io.Evented` to be a viable production backend, it needs to work with ReleaseSafe. 275 + 219 276 ## what this means for zat 220 277 221 278 the library held up. CBOR, CAR, commit parsing, verification, multibase — all chain correctly through the Io migration. the API change was mechanical: add `io` as first parameter, thread it through. ··· 224 281 225 282 the streaming client redesign — `subscribe(handler)` instead of `connect()` + `next()` — was the right call. the handler pattern gives the library control over reconnection, backoff, and host rotation. the caller implements `onEvent` and gets reliable delivery without managing connection lifecycle. 226 283 227 - six patches were needed against the zig stdlib or its Uring backend for zlay to run on Evented. `netLookup` is still unimplemented. the cross-Io hazard is still undocumented. but the Io abstraction itself — write once, pick your scheduler — delivered on its promise. the same relay code runs on Threaded (production) and Evented (local development on macOS via GCD) without conditional compilation. 284 + six patches were needed against the zig stdlib or its Uring backend for zlay to run on Evented. `netLookup` is still unimplemented. the cross-Io hazard is still undocumented. but the Io abstraction itself — write once, pick your scheduler — delivered on its promise. the relay runs on Evented with ReleaseFast in production, processes the full firehose at ~2,800 PDS connections, and has been stable since the websocket fix. 285 + 286 + the biggest lesson wasn't technical. we had a confirmed bug in the fiber context switch (the ReleaseSafe GPF) and a mystery crash in production (the SIGSEGV under ReleaseFast). we assumed they were the same bug because the symptoms overlapped — both were crashes in the Evented code path. we spent time investigating fiber machinery, reading disassembly, checking upstream. the actual bug was a one-line off-by-one in a dependency, in a function that had nothing to do with fibers. 287 + 288 + the thing that found it was switching to ReleaseSafe. not to fix the crash — we'd already reverted to Threaded for that — but because reverting happened to re-enable the build mode that had the safety checks. the bounds check caught the real bug on the first handshake that split on `\r\n`. 289 + 290 + there are two bugs here and they're both real. the websocket off-by-one was the production crash. the ReleaseSafe GPF is a separate issue that blocks Evented from running with safety checks. we're filing the latter upstream. in the meantime, ReleaseFast works, and we know what to look for when it doesn't. 228 291 229 292 zat is v0.3.0. the Io parameter is the only breaking change.

Configure Feed

Configure Feed