devlog 009: add inline code formatting, commit links, file links

+40 -42

1 changed file

expand all

devlog

+40 -42

devlog/009-back-to-threads.md

··· 1 1 # back to threads 2 2 3 - the last devlog ended with: "the relay runs on Evented with ReleaseFast in production, processes the full firehose at ~2,800 PDS connections, and has been stable since the websocket fix." that was april 5. by april 9, we'd shipped three wrong fixes for a problem we couldn't diagnose, rolled back to a week-old build, and ultimately abandoned the Evented backend entirely. the relay is back on Io.Threaded — one OS thread per PDS, ~2,800 threads, the same model as 0.15. this is what happened. 3 + the last devlog ended with: "the relay runs on Evented with ReleaseFast in production, processes the full firehose at ~2,800 PDS connections, and has been stable since the websocket fix." that was april 5. by april 9, we'd shipped three wrong fixes for a problem we couldn't diagnose, rolled back to a week-old build, and ultimately abandoned the Evented backend entirely. the relay is back on `Io.Threaded` — one OS thread per PDS, ~2,800 threads, the same model as 0.15. this is what happened. 4 4 5 5 ## the slow bleed 6 6 7 - after the websocket CRLF fix and the Evented re-enable on april 5, relay-eval coverage started dropping. not dramatically — from 99% to roughly 85-90%. there was always a plausible local explanation. the host_authority resolver pool might be poisoned. the consumer buffer might be too small. the reconnect cron might be disrupting things. 7 + after the websocket CRLF fix and the Evented re-enable on april 5, relay-eval coverage started dropping. not dramatically — from 99% to roughly 85-90%. there was always a plausible local explanation. the `host_authority` resolver pool might be poisoned. the consumer buffer might be too small. the reconnect cron might be disrupting things. 8 8 9 9 we chased each one. none of them were wrong, exactly — they were all real issues. but none of them explained the gap. 10 10 ··· 13 13 two things happened at once: 14 14 15 15 1. external HTTP stopped responding after 10-40 minutes of uptime 16 - 2. host_authority rejection spiked to 88-99% 16 + 2. `host_authority` rejection spiked to 88-99% 17 17 18 - the HTTP hang was the worse problem. `/_health` would return 200 for a while, then start timing out. internal metrics kept ticking — `frames_received_total` climbed, workers were active, CPU was low (~0.26 cores), all threads sleeping. but external consumers couldn't complete a WebSocket handshake. k8s marked the pod Not Ready. relay-eval reported 0% coverage. 18 + the HTTP hang was the worse problem. `/_health` would return 200 for a while, then start timing out. internal metrics kept ticking — `frames_received_total` climbed, workers were active, CPU was low (~0.26 cores), all threads sleeping. but external consumers couldn't complete a WebSocket handshake. k8s marked the pod `NotReady`. relay-eval reported 0% coverage. 19 19 20 - we had five commits in the suspect window between the last good build (`b91382b`) and the broken state: 20 + we had five commits in the suspect window between the last good build ([`b91382b`](https://tangled.org/zzstoatzz.io/zlay/commit/b91382b)) and the broken state: 21 21 22 - ``` 23 - 1eec324 fix UAF: dupe FrameWork.hostname per submit instead of borrowing 24 - 31825b2 subscriber: extract prepareFrameWork + add UAF regression test 25 - 168d9f1 bump websocket.zig + zat: fix requestCrawl POST hang 26 - 3dc21b9 fix gcLoop: silently exited after one tick 27 - fbdffbe mark DB success on did_cache hits 28 - ``` 22 + - [`1eec324`](https://tangled.org/zzstoatzz.io/zlay/commit/1eec324) — fix UAF: dupe `FrameWork.hostname` per submit instead of borrowing 23 + - [`31825b2`](https://tangled.org/zzstoatzz.io/zlay/commit/31825b2) — subscriber: extract `prepareFrameWork` + add UAF regression test 24 + - [`168d9f1`](https://tangled.org/zzstoatzz.io/zlay/commit/168d9f1) — bump websocket.zig + zat: fix `requestCrawl` POST hang 25 + - [`3dc21b9`](https://tangled.org/zzstoatzz.io/zlay/commit/3dc21b9) — fix `gcLoop`: silently exited after one tick 26 + - [`fbdffbe`](https://tangled.org/zzstoatzz.io/zlay/commit/fbdffbe) — mark DB success on `did_cache` hits 29 27 30 28 we shipped three hypotheses in one day: 31 29 32 30 1. **"zig 0.16 `std.http.Client` stale keep-alive handling"** — falsified by a standalone reproduction that worked fine. 33 - 2. **"broadcaster writeLoop scheduler starvation"** — the operator caught this one. the proposed fix (move writeLoop to `pool_io`) would re-enter the cross-Io crash class from devlog 008. do not call Threaded Io primitives from Evented fibers. 34 - 3. **"gcLoop + malloc_trim stalls every 10 minutes"** — disabled `malloc_trim`, bumped gc interval from 10 minutes to 1 hour. the symptom returned at 35 minutes instead of 10. delayed, not fixed. 31 + 2. **"broadcaster `writeLoop` scheduler starvation"** — the operator caught this one. the proposed fix (move `writeLoop` to `pool_io`) would re-enter the cross-Io crash class from [devlog 008](008-the-io-migration.md). do not call `Threaded` Io primitives from `Evented` fibers. 32 + 3. **"`gcLoop` + `malloc_trim` stalls every 10 minutes"** — disabled `malloc_trim`, bumped gc interval from 10 minutes to 1 hour. the symptom returned at 35 minutes instead of 10. delayed, not fixed. 35 33 36 34 the situation report we wrote at this point opened with: "we need help getting to a correct fourth hypothesis rather than shipping a fourth wrong one." 37 35 38 36 ## the rollback 39 37 40 - the operator rolled back to `b91382b`. the measurements, ten minutes apart on the same cluster: 38 + the operator rolled back to [`b91382b`](https://tangled.org/zzstoatzz.io/zlay/commit/b91382b). the measurements, ten minutes apart on the same cluster: 41 39 42 - | signal | 4f3d1d4 (broken) | b91382b (rollback) | 40 + | signal | `4f3d1d4` (broken) | `b91382b` (rollback) | 43 41 |---|---|---| 44 - | external /_health | 503 (ingress, pod Not Ready) | 200 in 0.29s | 42 + | external `/_health` | 503 (ingress, pod `NotReady`) | 200 in 0.29s | 45 43 | 15s websocket consumer | 6 frames in 170s (0.035 fps) | 5,896 frames (395 fps) | 46 44 | delivery ratio | ~5% of received frames reaching broadcast | 99.4% (6,801/6,843 in 15s) | 47 - | host_authority reject rate | 88% | 1.3% | 45 + | `host_authority` reject rate | 88% | 1.3% | 48 46 49 - same code paths, same PDS pool, same cluster. b91382b was the april 6 build — pre-outage, still on Evented. it worked. every build after it didn't. 47 + same code paths, same PDS pool, same cluster. `b91382b` was the april 6 build — pre-outage, still on `Evented`. it worked. every build after it didn't. 50 48 51 49 ## the reframe 52 50 53 - a reviewer read the full git history and the operator's measurements and made one observation: the coverage degradation didn't start with the april 8 commits. it started with `9cc1ba3` — the 0.16 migration itself. the later bugs were real, but they were layered on top of an already-degraded Evented baseline. 51 + a reviewer read the full git history and the operator's measurements and made one observation: the coverage degradation didn't start with the april 8 commits. it started with [`9cc1ba3`](https://tangled.org/zzstoatzz.io/zlay/commit/9cc1ba3) — the 0.16 migration itself. the later bugs were real, but they were layered on top of an already-degraded `Evented` baseline. 54 52 55 - the reviewer recommended Tracy-based tracing to diagnose fiber scheduling inside the Evented runtime. trace cold-start behavior, steady-state with a consumer attached, find where fibers are monopolizing time or where yields are missing. 53 + the reviewer recommended Tracy-based tracing to diagnose fiber scheduling inside the `Evented` runtime. trace cold-start behavior, steady-state with a consumer attached, find where fibers are monopolizing time or where yields are missing. 56 54 57 55 we went a different direction. 58 56 59 57 ## the question 60 58 61 - is the Evented model defensible? 59 + is the `Evented` model defensible? 62 60 63 61 the full ledger, as of april 9: 64 62 65 63 - 9 crash classes in 8 days, 3 of them silent heap corruption 66 - - a custom uring networking patch against an experimental backend where all 6 networking operations were stubbed upstream 67 - - forced into ReleaseFast by a codegen bug in fiber context-switching (the same bug that hid the websocket CRLF crash in devlog 008) 64 + - a custom io_uring networking patch against an experimental backend where all 6 networking operations were stubbed upstream 65 + - forced into `ReleaseFast` by a codegen bug in fiber context-switching (the same bug that hid the websocket CRLF crash in [devlog 008](008-the-io-migration.md)) 68 66 - a persistent ~10-15% coverage degradation nobody could trace 69 - - the zig team's [own position](https://ziglang.org/devlog/2026/#2026-02-13): the Evented backends "should be considered **experimental** because there is important followup work to be done before they can be used reliably and robustly" 67 + - the zig team's [own position](https://ziglang.org/devlog/2026/#2026-02-13): the `Evented` backends "should be considered **experimental** because there is important followup work to be done before they can be used reliably and robustly" 70 68 71 - the relay ran on 0.15 from late february through early april — about five weeks — at 2,800 threads and 99%+ coverage. the benefit of Evented was thread count: ~35 instead of ~2,800. the cost was everything else. 69 + the relay ran on 0.15 from late february through early april — about five weeks — at 2,800 threads and 99%+ coverage. the benefit of `Evented` was thread count: ~35 instead of ~2,800. the cost was everything else. 72 70 73 71 ## the fix 74 72 ··· 76 74 const Backend = Io.Threaded; // was Io.Evented 77 75 ``` 78 76 79 - this works because the relay was already written against the [`Io`](https://ziglang.org/documentation/master/std/#std.Io) abstraction, not against `Io.Evented` directly. [`io.concurrent()`](https://ziglang.org/documentation/master/std/#std.Io.concurrent) — which the stdlib describes as calling a function "such that the return value is not guaranteed to be available until `await` is called, allowing the caller to progress" — spawns [fibers](https://ziglang.org/devlog/2026/#2026-02-13) under Evented and OS threads under Threaded. `Io.Mutex`, `Io.Condition`, `Io.Future` — all backend-agnostic. the same code runs on both, which is the promise 0.16 made and actually delivered on. 77 + this works because the relay was already written against the [`Io`](https://ziglang.org/documentation/master/std/#std.Io) abstraction, not against `Io.Evented` directly. [`io.concurrent()`](https://ziglang.org/documentation/master/std/#std.Io.concurrent) — which the stdlib describes as calling a function "such that the return value is not guaranteed to be available until `await` is called, allowing the caller to progress" — spawns [fibers](https://ziglang.org/devlog/2026/#2026-02-13) under `Evented` and OS threads under `Threaded`. `Io.Mutex`, `Io.Condition`, `Io.Future` — all backend-agnostic. the same code runs on both, which is the promise 0.16 made and actually delivered on. 80 78 81 - one line. tests pass, formatting clean, production build succeeds. and because the fiber context-switch GPF was Evented-specific, we could switch back to `ReleaseSafe` — safety checks in production for the first time since the migration. 79 + one line ([`e6cdf84`](https://tangled.org/zzstoatzz.io/zlay/commit/e6cdf84)). tests pass, formatting clean, production build succeeds. and because the fiber context-switch GPF was `Evented`-specific, we could switch back to `ReleaseSafe` — safety checks in production for the first time since the migration. 82 80 83 81 ## what ReleaseSafe found in 80 minutes 84 82 ··· 88 86 /websocket/src/server/server.zig:647:33: in readLoop 89 87 ``` 90 88 91 - a pre-existing bug in the consumer drop path. when a downstream consumer falls behind, the broadcaster calls `dropSlowConsumer`, which closes the socket to unblock the consumer's read loop. but the websocket server's `readLoop` runs on a different thread and was about to call `setsockopt` on the same fd. `setsockopt` gets `EBADF`, zig's stdlib wrapper hits `unreachable`, ReleaseSafe turns that into a panic with a stack trace. 89 + a pre-existing bug in the consumer drop path. when a downstream consumer falls behind, the broadcaster calls `dropSlowConsumer`, which closes the socket to unblock the consumer's read loop. but the websocket server's `readLoop` runs on a different thread and was about to call `setsockopt` on the same fd. `setsockopt` gets `EBADF`, zig's stdlib wrapper hits `unreachable`, `ReleaseSafe` turns that into a panic with a stack trace. 92 90 93 - under ReleaseFast, `unreachable` is undefined behavior. this bug existed on every prior build — every Evented deploy, every 0.15 deploy. it manifested as... nothing, usually. occasionally, whatever the compiler feels like. 91 + under `ReleaseFast`, `unreachable` is undefined behavior. this bug existed on every prior build — every `Evented` deploy, every 0.15 deploy. it manifested as... nothing, usually. occasionally, whatever the compiler feels like. 94 92 95 - the fix: move socket close ownership from the broadcast thread to the consumer's writeLoop thread. `dropSlowConsumer` signals the consumer to stop (`alive.store(false, .release)`). the writeLoop checks `alive`, exits its loop, drains remaining frames, and closes the connection itself. no race. 93 + the fix ([`4735725`](https://tangled.org/zzstoatzz.io/zlay/commit/4735725)): move socket close ownership from the broadcast thread to the consumer's `writeLoop` thread. `dropSlowConsumer` signals the consumer to stop (`alive.store(false, .release)`). the `writeLoop` checks `alive`, exits its loop, drains remaining frames, and closes the connection itself. no race. 96 94 97 - we also bumped the consumer ring buffer from 8,192 to 65,536 entries — ~3 minutes of headroom at current ingest rates instead of ~24 seconds. fewer ConsumerTooSlow kicks means the close path fires less often. 95 + we also bumped the consumer ring buffer from 8,192 to 65,536 entries — ~3 minutes of headroom at current ingest rates instead of ~24 seconds. fewer `ConsumerTooSlow` kicks means the close path fires less often. 98 96 99 - the pattern from devlog 008 repeats: ReleaseSafe finds the bug on the first occurrence with a stack trace. ReleaseFast hides it until something unrelated goes wrong and you blame the wrong thing. 97 + the pattern from [devlog 008](008-the-io-migration.md) repeats: `ReleaseSafe` finds the bug on the first occurrence with a stack trace. `ReleaseFast` hides it until something unrelated goes wrong and you blame the wrong thing. 100 98 101 99 ## the numbers 102 100 103 - `4735725` (Threaded + ReleaseSafe + consumer fix), 16 hours uptime: 101 + [`4735725`](https://tangled.org/zzstoatzz.io/zlay/commit/4735725) (`Threaded` + `ReleaseSafe` + consumer fix), 16 hours uptime: 104 102 105 - | signal | value | vs Evented | 103 + | signal | value | vs `Evented` | 106 104 |---|---|---| 107 105 | uptime | 16h, 0 restarts | crashed at 10-80 min | 108 106 | health | 200 in 0.31s | hung after 10-40 min | 109 107 | delivery | 460 fps peak | 0 fps at failure | 110 - | persist_order_spins | 38.6k/s | 33M/s (850x higher) | 111 - | broadcast_queue_depth_hwm | 36 | 8,191 | 108 + | `persist_order_spins` | 38.6k/s | 33M/s (850x higher) | 109 + | `broadcast_queue_depth_hwm` | 36 | 8,191 | 112 110 | RSS | 2.09 GiB, flat | comparable | 113 111 | threads | ~2,700 | ~35 | 114 112 115 - the persist_order_spins metric tells the real story. this is a spinlock counter on the frame persistence ordering mutex — same mutex, same code, same workload. under Evented, 33 million spins per second. under Threaded, 38 thousand. the Evented scheduler was creating contention that didn't exist under normal OS thread scheduling. 113 + the `persist_order_spins` metric tells the real story. this is a spinlock counter on the frame persistence ordering mutex — same mutex, same code, same workload. under `Evented`, 33 million spins per second. under `Threaded`, 38 thousand. the `Evented` scheduler was creating contention that didn't exist under normal OS thread scheduling. 116 114 117 115 ## what this means for the Io abstraction 118 116 119 - the good news: `std.Io`'s abstraction held. one line changed the backend, the relay kept running, every `io.concurrent()` call site worked. the dual-Io architecture we built (Evented for network orchestration, Threaded for blocking work) was unnecessary but harmless — under all-Threaded, `pool_io` and the `DbRequestQueue` bridge are redundant but functional. we can clean them up later without urgency. 117 + the good news: `std.Io`'s abstraction held. one line changed the backend, the relay kept running, every `io.concurrent()` call site worked. the dual-Io architecture we built (`Evented` for network orchestration, `Threaded` for blocking work) was unnecessary but harmless — under all-`Threaded`, `pool_io` and the [`DbRequestQueue`](https://tangled.org/zzstoatzz.io/zlay/blob/main/src/event_log.zig) bridge are redundant but functional. we can clean them up later without urgency. 120 118 121 - the bad news: the backend-agnostic promise has a sharp edge. `Io.Mutex.lock(io)` compiles and type-checks regardless of whether `io` matches the calling thread's execution context. the cross-Io crash class — the central lesson of devlog 008 — only manifests at runtime, under load, as memory corruption. the abstraction is correct in the sense that the same code runs on both backends. it's dangerous in the sense that mixing backends within a single process produces no compile error and corrupts your heap. 119 + the bad news: the backend-agnostic promise has a sharp edge. `Io.Mutex.lock(io)` compiles and type-checks regardless of whether `io` matches the calling thread's execution context. the cross-Io crash class — the central lesson of [devlog 008](008-the-io-migration.md) — only manifests at runtime, under load, as memory corruption. the abstraction is correct in the sense that the same code runs on both backends. it's dangerous in the sense that mixing backends within a single process produces no compile error and corrupts your heap. 122 120 123 121 for zat: the library doesn't care. `Io` is threaded through as a parameter. callers pick the backend. the streaming client's `subscribe(handler)` works identically on both backends. 124 122 125 - for zlay: we're on Threaded for the foreseeable future. the Evented code paths, the uring networking patch, and the cross-Io segregation rules stay in the tree but inert. we'll revisit when zig's Evented backend is no longer experimental and when the ReleaseSafe GPF is fixed upstream. 123 + for [zlay](https://tangled.org/zzstoatzz.io/zlay): we're on `Threaded` for the foreseeable future. the `Evented` code paths, the [uring networking patch](https://tangled.org/zzstoatzz.io/zlay/blob/main/patches/uring-networking.patch), and the cross-Io segregation rules stay in the tree but inert. we'll revisit when zig's `Evented` backend is no longer experimental and when the `ReleaseSafe` GPF is [fixed upstream](https://tangled.org/zzstoatzz.io/zlay/blob/main/scripts/fiber_gpf_issue.md). 126 124 127 - for anyone else considering `Io.Evented` in production: the Evented backends are ["based on userspace stack switching, sometimes called 'fibers', 'stackful coroutines', or 'green threads'"](https://ziglang.org/devlog/2026/#2026-02-13). they can work. they did work, for stretches. but the support surface is thin — you'll need to patch the networking layer, work around DNS, build cross-Io bridges for any Threaded dependency, and run ReleaseFast (hiding real bugs). the thread-count savings are real. whether they're worth the cost depends on how much time you have to spend on runtime issues that aren't your application logic. 125 + for anyone else considering `Io.Evented` in production: the `Evented` backends are ["based on userspace stack switching, sometimes called 'fibers', 'stackful coroutines', or 'green threads'"](https://ziglang.org/devlog/2026/#2026-02-13). they can work. they did work, for stretches. but the support surface is thin — you'll need to patch the networking layer, work around DNS, build cross-Io bridges for any `Threaded` dependency, and run `ReleaseFast` (hiding real bugs). the thread-count savings are real. whether they're worth the cost depends on how much time you have to spend on runtime issues that aren't your application logic. 128 126 129 127 zat is v0.3.0-alpha. no API changes from this — the `Io` parameter is already the interface.

Configure Feed

Configure Feed