declarative relay deployment on hetzner relay-eval.waow.tech
atproto relay
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: log 04-09 zlay delivery regression + 04-10 resolution

full day journey from canary1 failure through external review,
broadcaster starvation investigation, rollback to b91382b, and
engineer handoff. 04-10 threaded-resolution doc captures the fix
path (frame queueing rebase without reintroducing the host_authority
reject regression).

individual writeups:
- canary1-failure — initial regression observation
- zlay-external-review — reviewer-facing narrative of the
pre-rollback state (spawn path + resolver pool contention)
- zlay-broadcaster-starvation — wrong mitigation (move writeLoop
to pool_io, vetoed as cross-Io crash class) but symptom
measurements are accurate
- zlay-handoff-rollback — narrowed bug window + rollback rationale
- zlay-threaded-resolution — 04-10 engineer-side resolution

ops-changelog 04-09 section stitches them together chronologically.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

authored by

zzstoatzz
Claude Opus 4 (1M context)
and committed by
Tangled
ddb9fc7f 9a10ac96

+1938
+174
docs/canary1-failure-2026-04-09.md
··· 1 + # canary 1 (b6a52a0) failure report — 2026-04-09 2 + 3 + operator report for the zlay engineer. canary 1 was `b91382b + 1eec324` 4 + (FrameWork.hostname UAF fix). diagnostic question: does adding ONLY 5 + the UAF fix to the running `zat21-b91382b` reintroduce the HTTP / delivery 6 + failure seen on bbba92c / 795cc41 / 4f3d1d4? 7 + 8 + ## headline 9 + 10 + **canary 1 failed at T+40 min with a new failure shape, distinct from 11 + both the HTTP-fiber-hang we were hunting and from any of the 04-09 12 + failure modes i have documented previously.** 13 + 14 + | aspect | broken 04-09 pods (bbba92c / 795cc41 / 4f3d1d4) | canary 1 at T+40 min | 15 + |---|---|---| 16 + | pod `Ready` condition | flips to `False`, service endpoints emptied | stays `True`, pod in endpoints | 17 + | external `/_health` | timeout, ingress returns 503 | **200→500 in 0.3s** with body `{"status":"error","msg":"database unavailable"}` | 18 + | HTTP response latency | hangs past 10s | **fast** specific errors | 19 + | thread state (`/proc/1/task`) | not captured during failure on broken pods | **47 threads, 2 R / 45 S, 0 D-state** (same as healthy) | 20 + | listen socket accept queues (`/proc/net/tcp`) | not captured | **clean, rx=0** | 21 + | `frames_received` from upstream | healthy | **healthy, ~400 fps** | 22 + | `frames_broadcast` to connected consumer | frozen ~0 fps | **frozen, 0 frames in 15s** | 23 + | `pool_queued_bytes` | unknown | **0 → 63.86 MiB** during the ~5 min preceding failure | 24 + | `persist_order_spins_total` rate | unknown | **~18k/sec at T+32 → ~33M/sec at T+40** (≈2000× spike) | 25 + | `host_authority` reject rate | 99.5% cumulative | **99.1% cumulative** + new rejects continuing at ~5/sec | 26 + | postgres pod itself | not the suspect | **healthy — direct `psql` works instantly** | 27 + 28 + this is NOT an HTTP fiber hang. HTTP is answering fast with 29 + application-specific errors. threads are not stuck in uninterruptible 30 + I/O. accept queues are empty. the broadcaster is not getting scheduled 31 + because something upstream of it is wedged — most consistent with 32 + **DbRequestQueue deadlock or saturation**. 33 + 34 + ## what i think is happening (hypothesis, not proven) 35 + 36 + the logs just before failure were flooded with this pattern: 37 + 38 + ``` 39 + info(relay): account did:plc:zbvmafs7jjq3teplzcfbezg6 (uid=241368) host mismatch: current=1316 new=97392 40 + info(relay): host user.eurosky.social: dropping event, host authority failed uid=241368 did=did:plc:zbvmafs7jjq3teplzcfbezg6 41 + ``` 42 + 43 + the key observation: `current=1316 new=97392`. `current` is a small, 44 + valid host id. `new` is a very large host id. this is **not** the 45 + `host_id=0` orphan case from the DB audit. these are accounts where 46 + zlay's `account.host_id` points at a real, valid host row, but the 47 + DID doc resolves to a *different* valid host. 48 + 49 + the same ~5-10 UIDs (241368, 478114, 608246, 54156, 676874, ...) 50 + appear in log lines over and over. every event from those accounts 51 + triggers a `host_changed`, runs `resolveHostAuthority`, gets a 52 + mismatch, drops the event. the account's `host_id` field in the DB 53 + is never updated to the new value, so the next event from the same 54 + account does the same dance. 55 + 56 + if `resolveHostAuthority` enqueues an UPDATE through `DbRequestQueue` 57 + on a hit-the-new-host resolution, and the queue gets wedged (either 58 + by pure rate of incoming work or by a deadlock), then: 59 + 60 + 1. the UPDATE never commits 61 + 2. `account.host_id` stays stale 62 + 3. the same 5-10 accounts keep hitting the path 63 + 4. `host_authority_checks_total` and `host_authority_reject` climb 64 + together at the same rate 65 + 5. the broadcaster's frame-release path, which also goes through 66 + DbRequestQueue (i think — for cursor persist?), starves 67 + 6. `pool_queued_bytes` climbs as pending frame work accumulates 68 + 7. something spinning on a persist-order mutex explodes 69 + `persist_order_spins_total` from ~18k/sec to ~33M/sec (seen in 70 + metrics; don't yet know WHICH mutex or WHICH code path) 71 + 8. `/_health`'s DB probe times out waiting for DbRequestQueue, returns 72 + its "database unavailable" error fast (this is not a timeout on 73 + postgres — it's a timeout on the queue's processing) 74 + 9. connected firehose consumers receive 0 frames because the 75 + broadcaster fiber is waiting on the same wedge 76 + 77 + **i'm not confident about step 7**. i have no attribution of which 78 + persist-order mutex is spinning, and the memory notes don't define 79 + the metric. if you can tell me which code path increments 80 + `relay_persist_order_spins_total`, that probably points straight at 81 + the wedge. 82 + 83 + ## the two half-answers to the diagnostic question 84 + 85 + **"does 1eec324 reintroduce the delivery collapse?"** — inconclusive 86 + but leans *somewhat yes*. canary 1 collapsed at T+40. the rollback 87 + pod running the exact same code minus 1eec324 is currently at T+3h 88 + and is still serving 270 fps to external consumers. BUT: 89 + 90 + - the rollback pod's `pool_queued_bytes` spiked to 20.7k in one 91 + snapshot and dropped back to 0 in the next — it's a current-depth 92 + gauge that oscillates under burst load, not a monotonic climb. 93 + correction from an earlier wording. the gauge is mostly at 0 at 94 + T+3h. the signal to watch is the canary 1 case where it jumped 95 + to **63.86 MiB** and stayed there, 3 orders of magnitude above any 96 + oscillation i've seen on the rollback pod. 97 + - `persist_order_spins_total` rate on the rollback pod at T+3h is 98 + ~33k/sec (vs canary 1's 33M/sec). 1000× lower. this may be the 99 + most specific distinguishing signal between "functional" and 100 + "wedged" that we've seen. 101 + - i have not yet run the rollback pod long enough to rule out that 102 + it too wedges eventually — just on a slower timescale. 103 + 104 + **"does 1eec324 reintroduce the host_authority 99% reject rate?"** — 105 + **no.** the rollback pod at T+3h has `validation_failed{reason="host_authority"} = 127.3k / 128.5k checks = 99.1%`. this is the same 106 + pattern as bbba92c-pre-workaround. the host_authority reject bug is 107 + present on b91382b, i just never measured b91382b long enough to see 108 + it. my earlier "b91382b predates the host_authority bug" claim from 109 + the rollback doc was wrong — i was measuring at T+6 min with 110 + `workers_count ≈ 700`, pre-ramp. 111 + 112 + so i'm retracting the part of the earlier rollback doc that pinned 113 + the host_authority bug's introduction to the 04-06 → 04-07 window. 114 + that half is wrong. the broadcaster delivery collapse introduction 115 + window is STILL unclear — canary 1 data is suggestive that 1eec324 116 + makes it worse, but not conclusive. 117 + 118 + ## what i am confident about 119 + 120 + - canary 1 failure at T+40 min is a real delivery-collapse event 121 + reproducing in our current production setup, with thread/socket 122 + evidence that distinguishes it from HTTP fiber hang 123 + - the persist_order_spins rate discontinuity (33k/sec → 33M/sec) 124 + within minutes of broadcaster wedge is a real signal and is the 125 + best hang-detection counter we've seen 126 + - the orphaned `account.host_id=0` bucket (239,422 accounts) is 127 + real and distinct from the stale-valid-host-id pattern in the 128 + canary 1 logs. two different DB states that both hit host_authority 129 + - the rollback pod at T+3h is functional ENOUGH for evaluators 130 + (270 fps external delivery) despite 99% host_authority reject, 131 + because host_authority rejects are bounded (~10/sec) and most 132 + events don't trigger the check at all (account cache hit) 133 + - baseline `/proc` snapshots are valuable: 47 threads, 0 D, clean 134 + accept queues is the healthy shape. future canary failures 135 + should capture this same shape for comparison 136 + 137 + ## what i want from you 138 + 139 + 1. **an explanation of `relay_persist_order_spins_total`** — which 140 + mutex, what code path, and whether 33M/sec is the "many 141 + threads spinning because the owner is stuck in DB" shape i think 142 + it is. 143 + 2. **confirmation of the DbRequestQueue deadlock-on-wedge hypothesis** 144 + — does the health check's DB probe go through DbRequestQueue 145 + too? if yes, then the fast 500 is consistent. if it's a separate 146 + connection pool, my hypothesis is wrong and i need to look 147 + elsewhere. 148 + 3. **guidance for canary 2**. i think the signal to add is per- 149 + DbRequestQueue-operation timing or depth. if we can see the 150 + queue depth climb and stall, we catch the wedge before it 151 + propagates to the broadcaster. the `pool_queued_bytes` climb is 152 + already a candidate signal but i'm not sure what "pool" it 153 + measures (resolver pool? broadcast pool? frame worker pool?). 154 + 4. **a fix for the orphaned account.host_id=0 rows**. one-off 155 + SQL to either delete or re-resolve those 239k accounts would 156 + reduce cold-start host_authority pressure by ~4% independently 157 + of any code change. not a root-cause fix, but cheap to land. 158 + 159 + ## what i am doing on the operator side 160 + 161 + - running the rollback pod (`zat21-b91382b`) until you ship something 162 + better. external evaluators can evaluate zlay from this state at 163 + ~270 fps. 164 + - actively monitoring via the new `scripts/zlay-probe` skill (see 165 + `.claude/skills/zlay-diagnose/SKILL.md`). i will reach out if any 166 + of the pressure signals (pool_queued_bytes, persist_order_spins 167 + rate, D-state threads, accept queue non-zero) start climbing. 168 + - NOT writing any zlay code. operator purview only. 169 + - evidence snapshots saved at: 170 + - `/tmp/canary1-healthy-{ss,ps}.txt` — T+2 baseline 171 + - `/tmp/canary1-baseline.txt` — T+28 metrics 172 + - `/tmp/canary1-fail-{ss,ps,logs,metrics}.txt` — T+40 failure 173 + - `/tmp/canary1-db-audit.txt` — host table + orphan audit 174 + - `/tmp/b91382b-3h-baseline-metrics.txt` — rollback at T+3h
+884
docs/ops-changelog.md
··· 5 5 6 6 --- 7 7 8 + ## 2026-04-09 9 + 10 + ### zlay host_authority fix, crash loop, and probe workaround 11 + 12 + long day. three deploys, one rollback, one re-deploy with a k8s patch. 13 + final state is stable but the root cause is still not identified. 14 + 15 + **1. diagnosis (morning).** investigation into why zlay sat at ~98.1% relay-eval 16 + coverage while pulsar showed 54% exposed two problems. the pulsar gap turned 17 + out to be the `ConsumerTooSlow` issue documented on 2026-04-01 — pulsar's 18 + 60-min window accumulated repeated kicks against the 8192-entry per-consumer 19 + ring buffer (~33s of headroom at 250 fps). the relay-eval gap was a separate, 20 + larger bug: `relay_validation_failed{reason="host_authority"}` was climbing 21 + at ~10/sec, and cumulative failure rate was 99.54% (1,210,698 / 1,214,857 22 + lifetime). sampling log lines + manually resolving 15 of the rejected DIDs 23 + against plc.directory showed 13/15 had DID docs whose PDS hostname exactly 24 + matched the incoming host — zlay was dropping legitimate events. 25 + 26 + **2. ee4e368 — buffer fix + host_authority reject breakdown counters.** 27 + bumped `Consumer.BUFFER_CAP` from 8192 → 65536 in `src/broadcaster.zig` (~4.4 28 + min of headroom at 250 fps instead of ~34s). split the single 29 + `failed_host_authority` counter into six per-branch counters in 30 + `src/validator.zig checkPdsHost`: 31 + `relay_host_authority_reject{branch="parse_did|resolve|no_endpoint|bad_url|unknown_host|host_mismatch"}`. 32 + 48 minutes after deploy: 39,621 / 40,072 rejects in the `resolve` branch — 33 + **100% of failures were in `resolver.resolve(parsed)` throwing on both 34 + the first attempt and the retry.** no rejects in any other branch. 35 + `sampleLogReject` was wired into the four `checkPdsHost` branches but NOT 36 + into the resolve branch, so there were zero sampled warn lines — we knew 37 + WHICH branch was failing but not WHY. the underlying error was swallowed 38 + twice (`zat/transport.zig:70` → `error.RequestFailed` → 39 + `zat/did_resolver.zig:98` → `error.DidResolutionFailed` → validator's catch 40 + → `.reject`), so no information made it to logs. 41 + 42 + **3. hypothesis (wrong, but led to a working workaround).** zlay git history 43 + shows a pattern: `keep_alive` was disabled at one point (`3bddabd` — TLS 44 + connection pool was eating ~3 GiB RSS), re-enabled (`d470580`) after the 45 + `toArrayList` leak fix in zat v0.2.14, and the resolver pool of 4 `DidResolver` 46 + instances was added in `1639565` (2026-03-18). then the zig 0.16 migration 47 + landed in `9cc1ba3` (2026-04-05). host_authority failures became 100% from 48 + that point on. framed this as "zig 0.16's `std.http.Client.fetch` keep-alive 49 + handling doesn't recover from stale connections when a long-lived client sits 50 + idle between calls" and asked the engineer to test keep_alive=false on the 51 + pool. this framing was wrong — see section 6. 52 + 53 + **4. 584571a (build fail) → bbba92c (build fix).** engineer shipped 54 + `keep_alive=false` on `host_resolvers` at `src/validator.zig:126` plus wired 55 + `sampleLogReject` into the resolve and parse_did branches. first attempt 56 + (584571a) failed to compile under `zig build` with 57 + `src/validator.zig:591:21: error: error set is discarded` — `_ = err1;` is 58 + not how zig 0.16 discards an error value; the outer `catch |err1|` capture 59 + needs to not exist at all. `zig build test` didn't catch this because no 60 + test referenced `resolveHostAuthority`, so lazy analysis skipped the function 61 + body. `bbba92c` dropped the `|err1|` binding entirely. engineer added a rule: 62 + for validator/subscriber/frame_worker changes, run `zig build` (host exe) 63 + in addition to `zig build test` so the exe reference graph forces analysis. 64 + 65 + **5. bbba92c deployed and initially worked.** 14 minutes after deploy: 66 + 1,259 checks, 3 rejects total (2 `unknown_host` + 1 `host_mismatch` — all 67 + legitimate policy drops), 0 `resolve` branch rejects. **reject rate 68 + collapsed from 98.87% to 0.24%.** empirical workaround confirmed. 69 + 70 + **6. engineer's local repro falsified the hypothesis.** after shipping, the 71 + engineer wrote a standalone repro: 1,624 calls × ReleaseFast/Debug × zig 3059 72 + and 3070 × serial/parallel × single-client/multi-client × 0-10s idle — all 73 + green. **none of those dimensions could reproduce the failure.** the key 74 + observation that surfaced: `resolveLoop` at `validator.zig:449` is still on 75 + `keep_alive=true` and works fine in the same process, same zig, same kernel. 76 + 77 + **better hypothesis — no zig stdlib involved**: the pool (`validator.zig:65`, 78 + `host_resolvers: [4]zat.DidResolver`) creates its 4 `DidResolver` instances 79 + once at init (line 143) and **never recreates them**. `acquireHostResolver` / 80 + `releaseHostResolver` (600, 612) just toggle availability atomics; no 81 + rebuild on failure. `resolveHostAuthority` does two calls on the same 82 + resolver, catches both, returns reject — it never destroys and re-inits 83 + the slot. hypothesis: **one transient failure puts a resolver's http client 84 + state into a bad condition (half-finished handshake, stale TLS session, 85 + server-side close the client didn't notice), and because nothing ever 86 + recreates the slot, it stays poisoned.** over minutes, all 4 slots 87 + accumulate poison and every subsequent check fails twice in a row and 88 + rejects. `keep_alive=false` fixes the symptom because the http client 89 + doesn't reuse any state between calls — nothing to poison. 90 + 91 + `resolveLoop` likely "works" under keep_alive=true for a mundane reason: 92 + it's a single-threaded worker that makes ONE call per DID and `continue`s 93 + on any error, no spin-lock contention, no rapid-retry-on-same-client 94 + pattern. if its resolver ever went bad, the cache would silently stop 95 + growing — worth measuring `validator_cache_entries` across a long window 96 + to confirm it's genuinely resilient and not just quietly broken too. 97 + 98 + **correction worth recording: I initially framed this as a "zig 0.16 99 + std.http.Client stale keep-alive" bug and asked for an upstream zig issue. 100 + that framing was wrong — I was blaming the standard library for what is 101 + almost certainly a missing-recovery-path in zlay's own pool code. the 102 + proper fix is in `src/validator.zig`, not in zig stdlib: on any 103 + `resolve()` failure in the pool, destroy and re-init the slot before 104 + releasing it back. the workaround happened to fix the symptom but we now 105 + have a fix and no root cause — the worst state to leave a bug in.** 106 + 107 + **7. bbba92c crash loop (afternoon).** ~9 hours after the bbba92c deploy the 108 + pod had 25 container restarts. exit code 137 (SIGKILL), pattern: container 109 + runs ~20 min, kubelet events show 110 + `Container main failed liveness probe, will be restarted`, 549 readiness 111 + probe timeouts and 265 liveness probe timeouts across the window, all 112 + `context deadline exceeded`. cause: `keep_alive=false` means every 113 + host_authority check does a fresh DNS + TCP + TLS + HTTPS round-trip to 114 + plc.directory (~400-900ms per call). on a cold pod, 2,800 PDS subscribers 115 + all reconnect concurrently and generate an `is_new` storm on the 4-slot 116 + resolver pool. frame workers spin-blocking on `acquireHostResolver` starve 117 + the HTTP server fiber, `/_healthz` misses its 5s probe timeout, kubelet 118 + SIGKILLs after 10 consecutive failures (5 min), pod restarts, storm repeats. 119 + infinite loop. **the old keep_alive=true pool was slow in a different way 120 + but fit within the probe budget — the correctness workaround broke the 121 + performance budget.** 122 + 123 + **8. rollback → reapply with probe patch.** rolled back to 124 + `ReleaseFast-ee4e368` momentarily (brings back 100% host_authority reject 125 + but kills the crash loop, relay goes back to ~98% relay-eval coverage). 126 + then immediately patched the Deployment's liveness/readiness probes to 127 + tolerate the cold-start storm and re-applied `ReleaseFast-bbba92c`: 128 + 129 + | field | old | new | 130 + |---|---:|---:| 131 + | liveness `initialDelaySeconds` | 30 | 300 | 132 + | liveness `timeoutSeconds` | 5 | 15 | 133 + | liveness `failureThreshold` | 10 | 20 | 134 + | readiness `initialDelaySeconds` | 10 | 60 | 135 + | readiness `timeoutSeconds` | 5 | 15 | 136 + | readiness `failureThreshold` | 5 | 20 | 137 + 138 + this is a k8s Deployment config change, **not a code change** — same 139 + `bbba92c` binary, same keep_alive=false workaround still active. 140 + 141 + **9. verification after reapply.** new pod stayed at 0 restarts through the 142 + full 5-minute warmup window and beyond. at 6m21s uptime: 143 + 144 + | metric | value | 145 + |---|---| 146 + | `host_authority_checks_total` | 274 | 147 + | `host_authority_reject{resolve}` | **0** | 148 + | `host_authority_reject{unknown_host}` | 2 (legit policy drops) | 149 + | mean host_authority latency | 356ms/check | 150 + | `workers_count` | 1,060 (ramping toward ~2,800) | 151 + | `frames_received` | 164k at ~413 fps | 152 + | `frames_broadcast` | 76k (46% of received, consumer attached) | 153 + | `slow_consumers_total` | 0 | 154 + | `rss` | 716 MiB | 155 + 156 + ### open items 157 + 158 + - **host_authority root cause still not identified.** current best guess: 159 + pool's missing recovery path — a resolver slot gets into a bad state 160 + once and stays there forever because nothing recreates it. proper fix 161 + is in `src/validator.zig` pool code, not in zig stdlib. planned 162 + production experiment: init 3 of 4 pool slots with `keep_alive=false` 163 + (safe), 1 slot with `keep_alive=true` (broken path). captures 164 + `@errorName` from the actual failure via the sampled warn log that 165 + `bbba92c` wires up. small blast radius, fully reversible, NOT YET RUN. 166 + also worth measuring: does `resolveLoop`'s resolver ever silently go 167 + bad (cache fill stops growing)? if yes, the pool isn't unique and the 168 + fix needs to touch both. 169 + 170 + - **workaround-on-a-workaround.** `keep_alive=false` costs ~350-900ms per 171 + check. the 4-slot pool saturates under load. the probe relaxation is 172 + not a fix — it's a band-aid over the performance consequence of the 173 + correctness workaround. proper long-term fixes (any one of these): 174 + 1. **add slot recovery** — destroy and re-init a resolver on any 175 + `resolve()` failure before releasing it back to the pool. this is 176 + the most direct fix and it's entirely in zlay code. 177 + 2. bump pool size from 4 → 32+ so it can absorb reconnect storms 178 + 3. make host_authority check async / bounded — don't block frame 179 + workers on the hot path 180 + 4. skip host_authority for the first N minutes of uptime (warmup grace) 181 + 182 + - **tight liveness probe**. the 5-minute initial delay means if the pod ever 183 + genuinely hangs during startup, it takes 5 min before kubelet notices. 184 + re-tighten once the pool recovery path is added and `keep_alive=true` 185 + can be restored. 186 + 187 + - **stability dropouts (still open from 2026-04-08).** ~33% historical 188 + dropout rate on relay-eval, suspension experiment showed some 189 + correlation with the 4h reconnect cronjob but not all dropouts lined 190 + up. worth re-evaluating now that the ConsumerTooSlow buffer is bumped — 191 + possible that some of those "stability" drops were actually consumer 192 + kicks on relay-eval's longer windows. 193 + 194 + commits: `ee4e368`, `584571a` (build fail), `bbba92c`, k8s deployment patch 195 + (not a commit — lives in cluster state, should be captured in 196 + `zlay/deploy/` if we want it persistent across helm re-installs). 197 + 198 + ### probe relaxation committed to source; external review + 795cc41 bundle 199 + 200 + after the bbba92c crash loop was stabilized via the probe relaxation, did 201 + three more things: 202 + 203 + **1. captured the probe settings in `zlay/deploy/zlay-values.yaml`** with a 204 + NOTE block explaining why they're relaxed and what to check before tightening 205 + again. this closes the "cluster-only probe patch" gap — a helm re-apply will 206 + no longer revert to the old tight probes. 207 + 208 + **2. wrote `docs/zlay-external-review-2026-04-09.md`**: a comprehensive 209 + write-up of both problems (resolver pool failure + HTTP fiber starvation), 210 + what we measured, what the engineer's standalone repro ruled out, what we 211 + explicitly don't know, and a code-pointer appendix. framed as a request for 212 + external review rather than a "here's the fix" document. explicit about the 213 + fact that we've been wrong about the bug's shape once already (the 214 + "zig 0.16 stale keep-alive" framing that got falsified by the local repro). 215 + 216 + **3. engineer shipped `795cc41`** — a bundled PR addressing four items from 217 + the review doc: 218 + 219 + - **cold-start per-host DB wait eliminated.** `spawnWorker` (`slurper.zig:535`) 220 + was doing a blocking `DbRequestQueue` round-trip for 221 + `getEffectiveAccountCount(host_id)` per host, ~2,770 round-trips during 222 + cold start each yielding the spawn fiber. folded the `COALESCE + JOIN + 223 + COUNT + GROUP BY` into the existing batch `listActiveHostsImpl` query 224 + (`event_log.zig:711`). `Host` struct gains an `effective_account_count` 225 + field. per-host fetch kept inline at the `addHost` path (one-off 226 + `requestCrawl` handler, rare). removes ~2,770 blocking DB waits from the 227 + Evented startup fiber. 228 + - **resolver slot recovery.** `resolveHostAuthority` (`validator.zig:568`) 229 + used to retry `resolve()` on the same pool slot after failure — if the 230 + slot's http client was in a bad state, the retry was wasted and the slot 231 + stayed bad forever. now: on first-attempt failure, `deinit` + re-init the 232 + slot before the retry. directly tests the "poisoned slot" hypothesis 233 + without making any zig stdlib claims. 234 + - **zat transport error visibility.** bumped zat to `v0.3.0-alpha.23` 235 + (commit `93d97be` on zat main): `DidResolver.resolve` and 236 + `HttpTransport.fetch` now propagate the underlying `std.http.Client.fetch` 237 + error instead of collapsing everything to `error.DidResolutionFailed` / 238 + `error.RequestFailed`. the existing `sampleLogReject("resolve", did, 239 + @errorName(err), ...)` call in the resolver pool will now log the actual 240 + transport error kind (`UnknownHostName`, `ConnectionRefused`, `TlsAlert`, 241 + etc.) instead of always `DidResolutionFailed`. no zlay code change needed — 242 + free upgrade from the dep bump. 243 + - **six new pool/loop metrics** in `broadcaster.zig`, emitted on `/metrics`: 244 + `relay_host_resolver_acquire_wait_us_total`, 245 + `relay_host_resolver_in_use`, 246 + `relay_host_resolver_resets_total`, 247 + `relay_host_resolver_resolve_fail_total`, 248 + `relay_resolve_loop_resolve_ok_total`, 249 + `relay_resolve_loop_resolve_fail_total`. 250 + - **configurable pool size.** `HOST_RESOLVER_POOL_SIZE` env var (default 4, 251 + hard cap 64), replacing the hard-coded constant. default unchanged for this 252 + deploy — the dial exists for tuning based on the new `acquire_wait_us_total` 253 + signal, not to bake in a speculative number. 254 + 255 + verified against `zig build`, `zig build test --summary all` (261/296 pass, 256 + 35 skipped, 0 fail), `zig fmt --check`. zat alpha bump verified empirically 257 + against three structurally distinct failure modes (HTTP 404, DNS NXDOMAIN, 258 + TCP refused) producing three distinct error names through the resolver API, 259 + with a regression test added to zat's `did_resolver.zig` to lock the 260 + invariant. zero breakage across all seven sibling zat consumers (zlay, 261 + find-bufo, coral, prefect-server, pollz, labelz, bot). 262 + 263 + ### reviewer's interpretation notes for the 795cc41 metrics 264 + 265 + - **leave `HOST_RESOLVER_POOL_SIZE` at default 4 for the first deploy.** 266 + `795cc41` adds observability and a tuning knob; it should not also silently 267 + change concurrency on the first measurement run. decide whether to bump 268 + based on the new `acquire_wait_us_total` signal, not a guess. 269 + - **`relay_host_resolver_resets_total` should stay near zero.** with 270 + `keep_alive=false`, slots can't go bad in the keep-alive cached-connection 271 + sense, so the recovery path is dormant infrastructure. **this is NOT 272 + validation that the fix works** — that comes from a canary run (3 slots 273 + `keep_alive=false` + 1 slot `keep_alive=true`) where we'd expect the 274 + recovery counter to actually fire. if the counter rises anyway in the 275 + default config, it means slot recovery is catching something other than 276 + stale keep-alive state, which is a real signal worth chasing. 277 + - **`relay_resolve_loop_resolve_fail_total` is a new baseline, not a 278 + regression.** the background `resolveLoop` was previously failing at 279 + whatever rate it was, silently, with `log.debug + continue`. whatever number 280 + shows up post-deploy is visibility, not a new problem. if it's a meaningful 281 + fraction of `resolve_loop_resolve_ok_total` that's something we didn't know 282 + we had. if it's ~0, the "resolveLoop is healthy" assumption is finally 283 + proven rather than hand-waved. 284 + 285 + ### what 795cc41 explicitly does NOT do 286 + 287 + - re-enable `keep_alive=true` anywhere (canary is next, after a few hours of 288 + clean steady-state on 795cc41) 289 + - touch the spawn batch loop (50 hosts / 100ms yield at `slurper.zig:665`) — 290 + the reviewer was explicit: slim the per-host work first, re-measure, 291 + THEN decide whether batching needs tuning. this commit does the slim half. 292 + - split liveness probes onto a dedicated thread — fallback if the slimmed 293 + spawn fiber alone doesn't stop the flap 294 + - identify the root cause of the 2026-04-08 ~99% rejection rate. slot 295 + recovery + error visibility + pool metrics set up the infrastructure to 296 + find out on the next canary; they don't diagnose retroactively. 297 + 298 + ### known non-blockers (reviewer note) 299 + 300 + - backfill/cleaner are joinable but still not cooperatively cancellable. 301 + operationally rough during shutdown but not a blocker for 795cc41. 302 + 303 + ### 795cc41 deploy results (2026-04-09 ~17:15 UTC) 304 + 305 + deployed `ReleaseFast-795cc41` at 17:15:56Z. pod `zlay-65f7c4cbc9-ptkb4`. 306 + observed the full ~35 min post-deploy window: 307 + 308 + **spawn speedup (review item 1):** worked as predicted. from pod start to 309 + `startup complete: 2764 host(s) spawned` was **96 seconds**, vs the ~20+ 310 + minutes on `bbba92c`. eliminating the per-host `DbRequest.wait()` for 311 + `getEffectiveAccountCount` was ~20x on cold-start. single most impactful 312 + change of the day. 313 + 314 + **stability:** pod ran cleanly for the first ~11 minutes, then readiness 315 + flapped to 0/1 for ~2m36s (11m → 14m uptime), then recovered to 1/1 and 316 + **stayed stable for the next 20+ minutes with 0 container restarts**. the 317 + relaxed probes (`initialDelay=300s timeout=15s failureThreshold=20`) absorbed 318 + the flap without kubelet killing anything. the flap window corresponds 319 + exactly to the post-spawn reconnect storm — 2,764 subscribers all came 320 + online within 96 seconds and started pumping cold events into the 321 + host_authority pool simultaneously, producing temporary Evented scheduler 322 + pressure until the initial is_new rate drained. 323 + 324 + **reviewer's predictions on the new metrics — all confirmed:** 325 + 326 + - `resolve_loop_resolve_ok_total = 68,326` / `resolve_loop_resolve_fail_total = 0`. 327 + this is the first time we've *proved* the background signing-key 328 + `resolveLoop` is healthy instead of assuming it from cache fill rate. 329 + zero failures over 68k resolves over 34 min. 330 + - `host_resolver_resets_total = 1` over 34 min. slot recovery is dormant, 331 + as the reviewer predicted with `keep_alive=false`. if the counter starts 332 + rising in default config, that's a real signal worth chasing. 333 + - `host_resolver_resolve_fail_total = 1` over 34 min. first-attempt pool 334 + failures are essentially zero with `keep_alive=false`, confirming the 335 + 2026-04-08 workaround holds in steady state. 336 + 337 + **pool contention during the burst:** 338 + 339 + - `host_resolver_acquire_wait_us_total = 14,557,094,400 µs` (~4 hours 340 + cumulative wait) over ~17,500 checks during the cold-start window 341 + - that's ~831ms average wait per slot acquire, on top of ~387ms per 342 + resolve → ~1.2s per host_authority check from the frame worker's side 343 + - after the burst drained (~15m uptime onward), `host_resolver_in_use` went 344 + to 0 and `acquire_wait_us_total` stopped climbing — pool is idle in 345 + steady state 346 + - the 4-slot default is under-provisioned for a 2,764-subscriber cold-start 347 + burst. `HOST_RESOLVER_POOL_SIZE` env var exists now for tuning on the 348 + next deploy. 349 + 350 + **host_mismatch burst (new observation, not a regression):** 351 + 352 + - `host_authority_reject{branch="host_mismatch"} = 15,198` cumulative over the 353 + first ~15 minutes, then stopped climbing. at steady state the rate is 354 + ~0/sec. 355 + - all 15k rejects landed in the `host_mismatch` branch (not `resolve` like 356 + the 2026-04-08 bug). meaning: the resolver *succeeded*, returned a valid 357 + DID doc, and `pds_host_id != incoming_host_id` at the comparison. 358 + - this is qualitatively different from the 2026-04-08 resolve-branch bug. 359 + that was "DID doc lookup was failing." this is "DID doc lookup succeeded 360 + but the resolved PDS id doesn't match the subscriber id we received the 361 + event from." 362 + - no sampled warn logs captured for this burst — the log buffer rotated out 363 + during the 71k+ chain-break message flood and `kubectl logs` only retains 364 + the most recent period. next occurrence we'll need to pull logs during 365 + the burst, not after. 366 + - plausible causes (unverified): stale `account.host_id` values in the DB 367 + from a previous pod triggering `host_changed=true` on events where the 368 + current DID doc still resolves to the current subscriber's hostname; or 369 + duplicate rows in the `host` table with the same hostname and different 370 + ids causing `getHostIdForHostname` (no ORDER BY) to return inconsistent 371 + values; or something else. **not investigated yet.** none of the 372 + candidates are new in 795cc41 — this is a pre-existing behavior the 373 + faster spawn just made more visible by concentrating the is_new load. 374 + 375 + **operator error to record:** during the post-deploy window, I (claude) 376 + repeatedly read cumulative lifetime counters (`acquire_wait_us_total`, 377 + `host_authority_reject{host_mismatch}`) as steady-state rate indicators 378 + and concluded the deploy was regressing. the reviewer's note said 379 + explicitly to treat the new metrics as visibility, not as regression 380 + signals; I ignored that and nearly recommended a rollback. the fix was 381 + to pull a 10-second delta on the same counters, which immediately showed 382 + the rates were flat and the pod was in healthy steady state. rule for 383 + next time: **cumulative counters require a delta to interpret. never 384 + suggest a rollback based on a single snapshot of a lifetime counter.** 385 + 386 + ### net state at end of 2026-04-09 387 + 388 + - deployed image: `ReleaseFast-795cc41` 389 + - host_authority pool: `keep_alive=false`, 4 slots (default) 390 + - `zlay/deploy/zlay-values.yaml`: probes relaxed to 391 + `initialDelay=300s/60s timeout=15s failureThreshold=20` — 392 + captured in source control, no cluster-only drift 393 + - steady state: 1/1 Running, 0 restarts, `/metrics` responsive, pool idle, 394 + `resolve_loop` healthy, no host_authority rejects currently firing 395 + - known open items: root cause of 2026-04-08 99% reject rate still not 396 + identified (infrastructure for next canary is in place); the 397 + host_mismatch burst pattern is a new observation worth investigating 398 + on the next cold-start; `HOST_RESOLVER_POOL_SIZE` dial exists but 399 + left at default 4 per reviewer guidance 400 + 401 + commits (795cc41 bundle): `zlay/795cc41`, `zat/93d97be` (tagged as `v0.3.0-alpha.23`). 402 + 403 + ### 4f3d1d4 gcLoop stabilization + honest handoff to next operator 404 + 405 + shipped `4f3d1d4` on top of `795cc41`: disabled `malloc_trim(0)` inside 406 + `gcLoop`, bumped the gc interval from 10 min to 1 hour, added timing 407 + instrumentation around `dp.gc()`. hypothesis at the time was that the 408 + recurring ~10-minute "/metrics unresponsive" cycle on `bbba92c` and 409 + `795cc41` was caused by `gcLoop` holding `DiskPersist.mutex` + 410 + `malloc_trim` holding the glibc arena lock. full write-up in 411 + `docs/zlay-gcloop-stall-2026-04-09.md`. 412 + 413 + **the hypothesis did not survive the deploy.** the `4f3d1d4` pod 414 + (`zlay-6c776bf9b9-zv9pl`) exhibits the same symptom starting at ~10-12 415 + minutes of uptime, well before the new 1-hour gc interval would have 416 + fired. so whatever the root cause is, it is not `gcLoop` and not 417 + `malloc_trim`. those changes may still be worth keeping for other 418 + reasons, but they are not the fix. 419 + 420 + ### current deployed state (2026-04-09 ~20:40 UTC, at handoff) 421 + 422 + - image: `atcr.io/zzstoatzz.io/zlay:ReleaseFast-4f3d1d4` 423 + - pod: 1/1 Running, 0 restarts, ~20 min uptime 424 + - probes: relaxed (`initialDelay=300s timeout=15s failureThreshold=20`), 425 + captured in `zlay/deploy/zlay-values.yaml` 426 + - `/metrics`: does not return within any reasonable curl timeout (>45s 427 + tested) 428 + - `/_healthz`: ~11 s response 429 + - `/_readyz`: ~17 s response (above the 15 s probe timeout — the pod is 430 + surviving because kubelet probes happen to hit faster windows between 431 + slower ones) 432 + - prometheus scrapes: `health=down` with `context deadline exceeded` at 433 + the 10 s scrape timeout, continuous since the deploy 434 + - grafana: no data (prometheus has nothing to show) 435 + - relay-eval: has been reporting zlay coverage near 0% for much of the 436 + last several hours while other relays are at 99%+ 437 + - node: load avg 67 on 8 cores, but zlay itself uses ~0.26 cores of CPU 438 + and all 47 zlay threads are in S-state (sleeping) — nothing is 439 + obviously burning CPU 440 + - reconnect cronjob: active, default schedule `0 */4 * * *`, next fire at 441 + 00:00 UTC 442 + - host_authority pool: `keep_alive=false`, 4 slots, slot recovery present 443 + (from `795cc41`) 444 + 445 + ### what has been falsified today 446 + 447 + 1. **"zig 0.16 `std.http.Client` stale keep-alive handling is broken"** — 448 + the original framing for the host_authority 99.5% reject rate. engineer's 449 + standalone repro (1,624 calls across many dimensions) could not 450 + reproduce. `keep_alive=false` fixes the symptom, but the diagnosis was 451 + not proven. root cause still open. 452 + 2. **"scheduler contention from 2,800 subscriber fibers starves 453 + `Consumer.writeLoop`"** — I wrote this up in 454 + `docs/zlay-broadcaster-starvation-2026-04-09.md` and proposed moving 455 + writeLoop to `pool_io`. engineer correctly pointed out that this 456 + re-enters the cross-Io crash class fixed in commit `6674812` — the 457 + underlying websocket conn is Evented, so driving writeLoop from a 458 + plain thread would NULL-deref `Thread.current()` under ReleaseFast. 459 + do not do this. 460 + 3. **"`gcLoop` + `malloc_trim` every 10 min is the root cause of the 461 + flap pattern"** — the 4f3d1d4 deploy with gc bumped to 1 hour shows 462 + the same symptom at ~10 min uptime. whatever is causing the HTTP 463 + server slowness fires before the new gc interval, so gc is not it. 464 + 465 + ### what is still real and probably worth fixing 466 + 467 + - **`Consumer.writeLoop` uses `io.sleep(100ms)` polling instead of 468 + `cond.wait`** (`broadcaster.zig:439-477`). this caps per-consumer drain 469 + at 10 frames/sec even with zero contention, and makes `cond.signal` 470 + in `Consumer.enqueue` (line 413) a no-op. this is a real bug 471 + independent of everything else above, but it is not the cause of the 472 + HTTP server being slow. zlay engineer's Option A in the 2026-04-09 473 + review exchange is the right fix: use `cond.wait`, keep everything on 474 + Evented, move ping scheduling out-of-band. do NOT move writeLoop to 475 + `pool_io`. 476 + - **`DiskPersist.gc()` holds the persist mutex for its entire duration** 477 + (`event_log.zig:977-1033`). this blocks every frame worker during gc. 478 + narrowing the mutex scope is a real follow-up regardless of whether 479 + gc turns out to be the primary cause of anything. 480 + - **zlay's HTTP server fibers live on main `Io.Evented` io**, shared with 481 + ~2,800 PDS subscriber fibers. this is still a plausible source of 482 + trouble for short-lived consumer subscriptions (relay-eval, pulsar), 483 + but I do not have direct evidence that the scheduler is the bottleneck 484 + vs. something else in the Evented io backend. the engineer's Option A 485 + should be tried before drawing more conclusions about the scheduler. 486 + 487 + ### what the next operator should know 488 + 489 + - **do not trust a single steady-state `/metrics` snapshot as evidence of 490 + stability.** I made this mistake twice today. the pattern is: pod runs 491 + cleanly for ~10 minutes, then degrades. any probe taken in the first 492 + 10 minutes will look fine. 493 + - **cumulative counters are not the same as rate deltas.** I spent hours 494 + today interpreting lifetime counter values as ongoing problems when 495 + they were cold-start artifacts, and vice versa. pull two snapshots at 496 + least a few seconds apart and diff them before drawing conclusions. 497 + - **the relay-eval runs with `connected=True` + low DID counts are a 498 + distinct failure mode from `connected=False` runs.** the former means 499 + zlay accepted the subscription but delivered few events; the latter 500 + means the handshake itself failed or the connection dropped. 501 + - **there is a structured monitor timeline in `/tmp/zlay-diag/`** on my 502 + laptop from today's session — it contains a ~2-hour window of pod 503 + state + metrics availability + key counter deltas at 30 s cadence 504 + across the bbba92c rollback, the 4f3d1d4 deploy, and the current stuck 505 + state. includes the full Phase 1/2/3/etc timeline for the bbba92c pod 506 + that eventually got kubelet-killed. 507 + - **the most recent pod I remember reading `/metrics` from reliably in 508 + this session was `31825b2`** (the 2026-04-07 FrameWork-UAF-fix pod, 509 + with the then-current 99.5% host_authority reject bug). every pod I 510 + shipped on top of that has the "HTTP server stops responding" symptom. 511 + I do not know whether the symptom was also present on 31825b2 and I 512 + just happened to miss it, or whether something I shipped today 513 + triggered it. rolling back to 31825b2 would re-break host_authority 514 + but might restore observability — worth considering purely as a 515 + diagnostic step. 516 + 517 + ### open items at handoff 518 + 519 + - Consumer.writeLoop cond.wait fix (engineer's Option A, not shipped) 520 + - DiskPersist.gc() mutex scope narrowing 521 + - actual root cause of the ~10-minute "/metrics unresponsive" cycle 522 + - why relay-eval reports 0% for zlay during healthy-looking pod windows 523 + - whether the cold-start is_new host_authority storm (seen at pod start 524 + with 4/4 pool saturation and 800+ms acquire wait) ever fully drains on 525 + the 4-slot default pool, or whether there's a permanent backlog state 526 + 527 + commits this session: `zlay/ee4e368`, `zlay/584571a` (build fail), 528 + `zlay/bbba92c`, `zlay/795cc41`, `zlay/4f3d1d4`, `zat/93d97be`. docs 529 + added: `docs/zlay-external-review-2026-04-09.md`, 530 + `docs/zlay-broadcaster-starvation-2026-04-09.md` (wrong diagnosis, 531 + preserved for context), `docs/zlay-gcloop-stall-2026-04-09.md` (also 532 + wrong diagnosis). `zlay/deploy/zlay-values.yaml` updated with relaxed 533 + probes. 534 + 535 + ### operator rollback to b91382b — restored external service 536 + 537 + picked this back up later in the day as a new session. goal stated 538 + explicitly: get zlay into a state where external evaluators 539 + (relay-eval, pulsar, tap, hydrant) can actually reach it. code fixes 540 + are out of scope for this session; rollback and dial-turning are in 541 + scope. 542 + 543 + **confirmed the symptom is live on the `4f3d1d4` pod** 544 + (`zlay-6c776bf9b9-zv9pl`, 28m uptime): 545 + 546 + | probe | result | 547 + |---|---| 548 + | internal `/metrics` (port-forward, t=0) | 200 in 0.5s | 549 + | internal `/metrics` (port-forward, t=+15s) | **hung >10s** | 550 + | external `https://zlay.waow.tech/_health` | **503** (ingress sees backend dead) | 551 + | external `describeServer` | **503** | 552 + | `endpoints/zlay` in k8s api | **empty** | 553 + | pod `Ready` condition | flipped `True→False` at 20:59:34Z | 554 + 555 + so the reported ~10-min degradation cycle is reproducing. pod is 556 + cycling in and out of `Ready` and service endpoints follow. external 557 + consumers cannot handshake, which is why relay-eval shows 0%. 558 + 559 + also captured: 560 + `relay_frames_received_total = 742988`, 561 + `relay_frames_broadcast_total = 478` (frozen at cold-start value), 562 + `host_authority_reject{branch="host_mismatch"} = 15,257` on 17,236 563 + checks (88% reject on the host_mismatch branch specifically — the 564 + new burst pattern the 795cc41 entry flagged; still unexplained). 565 + 566 + **rollback attempt 1 — `ReleaseFast-b91382b` → ImagePullBackOff.** 567 + tried rolling back to the 04-06 last-known-good image via 568 + `kubectl set image`. the tag is not in the remote registry; zlay's 569 + `publish-remote` uses `ctr images import` straight into k3s 570 + containerd on the server, so older builds only live in the server's 571 + local image store and the registry may have GC'd older pushes. 2m6s 572 + in `ImagePullBackOff` before I caught it and reverted. 573 + 574 + **rollback attempt 2 — reverted back to `4f3d1d4` momentarily**, then 575 + SSH'd to the server and ran `ctr -n k8s.io images ls | grep zlay` to 576 + see what's actually cached locally. found **27 `ReleaseFast-*` images 577 + and 15+ `ReleaseSafe-*` images** cached. the 04-06 build exists, but 578 + under a non-standard tag: 579 + `atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b` (the `zat21-` 580 + prefix was added to mark it as the zat v0.3.0-alpha.21 build). 581 + 582 + **rollback attempt 3 — `ReleaseFast-zat21-b91382b`** rolled out 583 + cleanly. new pod `zlay-679c966669-6xg5v` was `1/1 Ready` in ~60s. 584 + probed immediately: 585 + 586 + | probe | result | 587 + |---|---| 588 + | external `https://zlay.waow.tech/_health` | `200` `{"status":"ok"}` in 0.29s | 589 + | external `xrpc/_health` | `200` in 0.23s | 590 + | external `describeServer` | `404` (endpoint not implemented on this old build; not a blocker) | 591 + | raw `wss://.../subscribeRepos` (python websockets, 15s) | **5,896 frames received, ~395 fps** | 592 + | `just zlay test-tap 30` | connected, ran full 30s without error | 593 + 594 + the raw-websocket delivery measurement is the headline. the 595 + broadcaster-starvation doc from earlier in the day measured 596 + **6 frames in 170s (0.035 fps)** on `bbba92c` under the same 597 + conditions. rollback improved downstream delivery by **~11,000×**. 598 + 599 + **metrics delta with an active consumer attached (15s window):** 600 + 601 + | metric | t0 | t1 | Δ | 602 + |---|---:|---:|---:| 603 + | `frames_received_total` | 472534 | 479377 | +6843 | 604 + | `frames_broadcast_total` | 20828 | 27629 | +6801 | 605 + | `consumers_active` | 1 | 1 | — | 606 + 607 + **delivery ratio 6801/6843 = 99.4%.** compared to the broken pods 608 + where `frames_broadcast_total` stayed stuck at its lifetime value 609 + even during an active consumer subscription. this is conclusive: on 610 + b91382b, when a consumer connects, the broadcaster fans out to them 611 + at ingest rate. on 4f3d1d4/795cc41/bbba92c it does not. 612 + 613 + **host_authority on b91382b: `validation_failed{reason="host_authority"} = 1 / 513` = 0.19%.** 614 + the 99.5% reject rate that was the first headline bug of the day is 615 + **not present on b91382b**. this build does not have the per-branch 616 + reject breakdown (that lexicon came in ee4e368), so we only have the 617 + aggregate counter — but the aggregate is fine. this narrows the 618 + introduction window for the host_authority bug to **04-06 → 04-07**, 619 + the same window the broadcaster-starvation doc identified as when 620 + downstream delivery broke. two bugs, same window, possibly the same 621 + commit. the broadcaster doc already flagged `1eec324` (FrameWork UAF 622 + fix, 04-07) as suspicious because it changed how FrameWorks are 623 + queued. 624 + 625 + **current deployed state (end of handoff-2):** 626 + 627 + - image: `atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b` 628 + (04-06, Evented/ReleaseFast, zat v0.3.0-alpha.21) 629 + - pod: `zlay-679c966669-6xg5v`, 1/1 Ready, 0 restarts 630 + - public `/health`: 200 in ~0.3s 631 + - firehose delivery: ~400 fps to connected consumers (99.4% of ingest) 632 + - `relay_validation_failed{reason="host_authority"} = 7 / 551 = 1.3%` steady state — within normal legitimate-policy-drop range 633 + - k8s probes: still relaxed (`initialDelay=300s timeout=15s 634 + failureThreshold=20`) from the `bbba92c` era. safe to leave — 635 + they don't hurt anything on a fast-responding pod 636 + - reconnect cronjob: still active (`0 */4 * * *`, next fire 00:00 UTC) 637 + 638 + **known risks I am actively watching on b91382b:** 639 + 640 + 1. **FrameWork.hostname UAF** (fixed in 1eec324 on 04-07; b91382b 641 + pre-dates the fix). the 04-07 ops entry documents that this UAF 642 + can trigger a pod restart when "subscribers churn faster than the 643 + frame pool can drain" — which is exactly what the 4h reconnect 644 + cron does. if the next cron fire crashes the pod, rollback has to 645 + move forward to a build that has the UAF fix, and we lose the 646 + broadcaster-delivery win. 647 + 2. **unknown host_authority behavior under load.** it's 1.3% right 648 + now at workers_count ~700, but workers_count is still ramping 649 + toward ~2,800. possible the reject rate climbs with pool saturation 650 + and the bug was latent on 04-06 because nobody measured it. 651 + 3. **`consumers_active` drops to 0 when no one is connected** → no 652 + evidence the broadcaster *continuously* delivers correctly, only 653 + that it delivers *when a consumer is attached*. need to leave a 654 + long-running consumer (tap or relay-eval) attached and monitor 655 + over a longer window. 656 + 4. **8192-slot per-consumer ring buffer** — pre-dates the ee4e368 657 + bump to 65536. `ConsumerTooSlow` kicks possible on long-window 658 + consumers like pulsar. relay-eval windows are short enough that 659 + this shouldn't bite. 660 + 661 + **what this rollback does NOT fix:** 662 + 663 + - does not explain *what changed between 04-06 and 04-07* to cause 664 + both the broadcaster delivery collapse and the host_authority 665 + reject bug. the `1eec324` FrameWork UAF commit is the leading 666 + suspect because it changed frame queuing, but this is a pointer 667 + for the engineer to investigate, not a diagnosis. 668 + - does not eliminate the FrameWork UAF risk — b91382b still has 669 + the bug, it's just pre-observation. 670 + - does not ship the `Consumer.writeLoop` cond.wait fix ("Option A") 671 + — that's still the engineer's work. 672 + 673 + **what to tell the zlay engineer** (written up separately in 674 + `docs/zlay-handoff-2026-04-09-rollback.md`): see that file for the 675 + full report, including the b91382b measurements that narrow the bug 676 + window, the empirical evidence that delivery is collapsed on 677 + bbba92c/795cc41/4f3d1d4, and the ask for a fix that lands the 678 + host_authority workaround + slot recovery + observability on top of 679 + the frame-queueing shape that existed on 04-06. 680 + 681 + commits this handoff-2 session: no code commits (operator only). 682 + deploy action: `kubectl set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b`. 683 + 684 + ### canary 1 (b6a52a0 = b91382b + 1eec324 UAF fix) — deploy, fail, rollback 685 + 686 + engineer shipped `b6a52a0` on origin/main — exactly one change vs the 687 + running `zat21-b91382b`: the `1eec324` FrameWork.hostname UAF fix. 688 + zat pinned to v0.3.0-alpha.21, everything else identical. diagnostic 689 + question: does adding only the UAF fix reintroduce the HTTP / delivery 690 + collapse seen on bbba92c / 795cc41 / 4f3d1d4? 691 + 692 + deployed via `just zlay publish-remote ReleaseFast` at 22:23 UTC. 693 + image tag `ReleaseFast-b6a52a0`. pod `zlay-79ddd589b8-rlbdg`. 694 + 695 + **initial sweep (T+2 min) — all green.** external `/_health` 200 in 696 + 0.28s, raw ws 5438 frames/15s = 363 fps, delivery ratio 99.9% 697 + (6016/6020) with an attached consumer, 0 host_authority rejects over 698 + 130 checks, pod 1/1 Ready. 699 + 700 + engineer asked for healthy-baseline capture, DB audit, refined sweep. 701 + all three delivered. 702 + 703 + **healthy baseline (T+~5 min).** container has no `ss`/`netstat`/ 704 + `ps`/`top` — fell back to `/proc`: 705 + - listen sockets: tcp :3000, :3001 both LISTEN, `rx=00000000`, 706 + `tx=00000000` (accept queues empty) 707 + - threads: 47 total (5 R / 42 S / **0 D**) 708 + matches the 04-05 migration note of ~47 threads on Evented. saved to 709 + `/tmp/canary1-healthy-{ss,ps}.txt`. 710 + 711 + **DB audit (one-shot, independent of canary).** two queries: 712 + 1. `SELECT hostname, count(*) FROM host GROUP BY hostname HAVING count(*) > 1` 713 + → **0 rows.** duplicate-host hypothesis dead. 714 + 2. `SELECT count(*) FROM account WHERE host_id NOT IN (SELECT id FROM host)` 715 + → **239,422 rows** (4.11% of 5,817,756 total accounts). **100% of 716 + orphans point at `host_id = 0`** — a single sentinel value, not 717 + scattered drift. host id 0 does not exist in the `host` table 718 + (ids start ≥1). some code path writes `host_id=0` when it should 719 + either insert a new host row or reject the account write. this 720 + is the first concrete evidence behind the "stale account.host_id" 721 + hypothesis the 795cc41 reviewer raised. 722 + saved to `/tmp/canary1-db-audit.txt`. 723 + 724 + **T+~30 sweep — mixed.** the pod was still up and external delivery 725 + still worked, BUT: 726 + - `validation_failed{reason="host_authority"} = 18,699` cumulative 727 + over `host_authority_checks_total = 20,132` = **92.9% cumulative 728 + reject rate** at T+30. 729 + - 20s steady-state delta showed rejects climbing at 0.05/sec, so the 730 + bulk (18.7k) accumulated during the 0-20 min ramp window. 731 + - `trigger{host_changed} = 15,214` — almost all of the 18.7k rejects 732 + landed on the host_changed branch, which maps to the orphaned- 733 + account shape from the DB audit. 734 + - `pool_queued_bytes = 0`, `pool_backpressure_total = 128` (effectively 735 + flat at 0.1/s), threads still 47/0D, listen sockets still clean, 736 + `/metrics` responding fast. 737 + - delivery ratio on a 15s attached-consumer delta: **99.9%** (4744/4747) 738 + — delivery was fine at T+30. 739 + 740 + honest correction recorded at this point: my earlier b91382b "clean" 741 + measurement was at **T+6 min with workers_count ~700** (pre-ramp). i 742 + stopped watching before the ramp completed. the rollback doc's claim 743 + that "b91382b pre-dates both the host_authority bug and the broadcaster 744 + delivery collapse" was half wrong — the host_authority reject pattern 745 + is present on b91382b too, just invisible if you stop measuring before 746 + the ramp. `feedback_steady_state_fallacy` lesson applied to my own 747 + earlier work, late. 748 + 749 + **T+~40 — broadcaster delivery collapses.** external 15s raw ws: 750 + **0 frames in 15s.** connection succeeds, `consumers_active = 1` while 751 + attached, zero frames flow. same shape as bbba92c / 795cc41 / 4f3d1d4. 752 + alongside: 753 + - external `/_health` returns **500 with `{"status":"error","msg":"database unavailable"}`** in 0.3s (FAST response, not a hang) 754 + - postgres itself is HEALTHY: direct `psql ... SELECT count(*) FROM host` 755 + returns `3709` instantly 756 + - pod: still 1/1 Ready, 0 restarts, 47 threads, 0 D-state, listen 757 + socket accept queues still empty 758 + - `relay_pool_queued_bytes` jumped from 0 → **63.86 MiB** 759 + - `relay_persist_order_spins_total` jumped from 1.92G → **11.98G 760 + (+10 billion in ~5 minutes = ~33M spins/sec)**. earlier rate on the 761 + same pod at T+32 was ~18k/sec. **~2000× rate spike.** 762 + - `relay_host_authority_reject{}` climbing again at ~5/sec — post-ramp 763 + reject burst 764 + - logs flooded with `host mismatch: current=<small_id> new=<large_id>` 765 + for a small set of recurring UIDs. not the host_id=0 orphan shape — 766 + these are accounts where `current` is a VALID small host id (312, 767 + 1316) but `new` is a large host id (97k-105k). account migrations 768 + where zlay hasn't updated account.host_id after resolving the new 769 + host. repeating loop: same ~5-10 accounts keep triggering. 770 + 771 + this is **not** the HTTP-fiber-hang shape the broken 04-09 pods had. 772 + - the broken 04-09 pods: HTTP fibers hang past probe timeout, ingress 773 + drops the pod from endpoints, pod flips `Ready=False`, 503s from 774 + the ingress. 775 + - canary 1 at T+40: pod stays `Ready`, HTTP fibers respond FAST but 776 + with specific error payloads (`/_health` = 500 "database unavailable" 777 + in 0.3s). delivery is 0 fps to connected consumers. listen accept 778 + queues are clean, threads are not stuck. 779 + 780 + interpretation: the failure mode is **DbRequestQueue deadlock or 781 + back-pressure**, not HTTP fiber starvation. the `/_health` handler's 782 + DB probe can't get through the jammed queue within its timeout, so it 783 + returns a fast 500 with a specific error. the broadcaster is likewise 784 + stuck waiting on DbRequestQueue for account.host_id updates. the same 785 + small set of ~5 accounts generates repeated host_authority work because 786 + their host_id update never commits. spin counter explodes because 787 + whatever loop runs on the persist-order mutex spins forever when the 788 + queue is jammed. 789 + 790 + **rolled back at T+~40 min** (23:05 UTC) to 791 + `ReleaseFast-zat21-b91382b`. rollout successful, new pod 792 + `zlay-679c966669-bth6c`. evidence saved to `/tmp/canary1-fail-*`. 793 + 794 + the rollback restores functional state for evaluators — external 795 + delivery measured at 270 fps via raw ws at post-rollback T+3 min and 796 + held through the subsequent ~3 hours of runtime at 270-290 fps. but 797 + the rollback's host_authority reject rate climbs to the same ~99% 798 + cumulative pattern by the 3-hour mark. b91382b has the host_authority 799 + bug too; it just doesn't wedge the broadcaster the way canary 1 did. 800 + 801 + ### b91382b @ 3h uptime — the long-watch measurement 802 + 803 + after the canary 1 rollback i left the pod running and came back to 804 + it at T+3h02m (02:09 UTC 2026-04-10). the extended observation: 805 + 806 + | metric | value | 807 + |---|---| 808 + | pod state | 1/1 Ready, 0 restarts | 809 + | external `/_health` | 200 in ~0.3s | 810 + | external raw ws 15s | **270 fps** | 811 + | threads (`/proc/1/task`) | 47 total, 2 R, 45 S, **0 D** | 812 + | listen sockets accept queues | clean (rx=0) | 813 + | `relay_frames_received_total` | 4.42M (cumulative ~400 fps lifetime) | 814 + | `relay_frames_broadcast_total` | 438.7k | 815 + | `relay_workers_count` | 2.8k (fully ramped) | 816 + | `relay_host_authority_checks_total` | 128.5k | 817 + | `relay_validation_failed{reason="host_authority"}` | **127.3k (99.1%)** | 818 + | 15s rate of new rejects | **10.7/sec** | 819 + | `relay_pool_queued_bytes` | **20.7k** (was 0 at T+2 min, climbing slowly) | 820 + | `relay_pool_backpressure_total` | 1.1k (flat) | 821 + | `relay_persist_order_spins_total` rate | ~33k/sec (1000× lower than canary 1's 33M/sec just before failure) | 822 + 823 + **conclusion: b91382b is the current functional-enough baseline**, not 824 + a clean reference. it has: 825 + 826 + - host_authority 99%+ reject bug latent from cold-start — only visible 827 + after the ramp finishes, easy to miss with short measurements 828 + - slow `pool_queued_bytes` accumulation (0 → 20k over 3h) that will 829 + presumably continue 830 + - correct steady-state broadcaster delivery for now 831 + 832 + this means the b91382b baseline is NOT the pristine reference image 833 + the earlier ops-changelog entry called it. the distinguishing signal 834 + between "functional" and "broken" on the Evented + host_authority code 835 + path is the `persist_order_spins_total` rate: ~33k/sec = functional, 836 + ~33M/sec = broadcaster wedged. 1000× gap, easy to spot. 837 + 838 + ### the engineer report 839 + 840 + composed `docs/canary1-failure-2026-04-09.md` for the zlay engineer. 841 + key points: 842 + 1. the UAF fix alone does NOT cleanly reintroduce the HTTP-hang — 843 + what it reintroduces is a different failure mode (DbRequestQueue 844 + deadlock + persist_order_spins explosion + broadcaster stall with 845 + HTTP fibers still responding fast). 846 + 2. the host_authority 99% reject pattern is ALSO in b91382b. not 847 + introduced by 1eec324. my earlier 1eec324-blame was wrong on 848 + this half. 849 + 3. the broadcaster wedge *may* be introduced by 1eec324. the canary 850 + was the cleanest test we've run, and it failed at T+40. BUT the 851 + same pod with the exact same code minus 1eec324 (the running 852 + b91382b rollback) has `pool_queued_bytes` slowly climbing too — 853 + so it may be a matter of time, not a code difference. need to 854 + watch b91382b for longer (>6h) to know. 855 + 4. the 33M/sec `persist_order_spins_total` rate is the most specific 856 + new signal from this session. suggesting: add a per-mutex-target 857 + attribution so we can tell WHICH persist-order mutex is spinning. 858 + 859 + ### operator skill work this session 860 + 861 + two fragile patterns kept recurring during the day: (a) inline 862 + `kubectl port-forward ... & sleep 2; curl ...; kill %1` dances, and 863 + (b) metric-delta parsing via shell awk which broke on `# TYPE` comment 864 + lines and label-escaping. captured as: 865 + 866 + - `scripts/zlay-probe` — single python script, subcommands 867 + `health | delivery | metrics | delta | sweep`. handles port-forward 868 + lifecycle (readiness-wait, process-group cleanup), prometheus text 869 + parsing (regex-based, label-aware), delta rendering, tight-timeout 870 + external probes (3s `/_health`), raw websocket frame counting against 871 + the public ingress. self-contained uv script with `websockets` as 872 + the only dep. 873 + - `just zlay probe <args>` — justfile pass-through to the script. 874 + - `.claude/skills/zlay-diagnose/SKILL.md` — one-page skill that points 875 + future-me at the recipes and explicitly bans the inline port-forward 876 + pattern. includes interpretation notes for each signal, the "when 877 + the script is not enough" escape hatches (/proc thread states, /proc 878 + listen sockets, DB audit), and the engineer's alarm thresholds from 879 + the refined-sweep doc. 880 + 881 + the script caught its own first real failure — it reproduced the 882 + canary 1 external `/_health` 500 on its initial smoke test, and the 883 + subsequent sweep was the one that showed 0 fps delivery and the 10G 884 + spin-counter spike. useful shape for future canaries. 885 + 886 + commits this canary-1 session: none yet (pending user review of the 887 + zlay-probe script, SKILL.md, and this ops-changelog entry before 888 + committing). 889 + 890 + --- 891 + 8 892 ## 2026-04-07 9 893 10 894 ### zlay FrameWork hostname UAF — fixed and shipped (1eec324)
+233
docs/zlay-broadcaster-starvation-2026-04-09.md
··· 1 + # zlay: downstream consumer delivery at 0.035 events/sec — root cause + fix recommendation 2 + 3 + *2026-04-09, follow-up to `zlay-external-review-2026-04-09.md`* 4 + 5 + ## headline 6 + 7 + **zlay's firehose broadcaster is delivering ~0.035 events/second to connected consumers while the relay is ingesting ~300 events/second from upstream PDSes.** That's a ~7,500× delivery deficit. This is the single root cause behind the all-day churn: relay-eval's "0% coverage with connected=True" rows, pulsar at 54%, tap receiving 6 frames in 170 seconds, the apparent "HTTP server starvation" I was chasing. The HTTP fibers aren't really "starved" in the conventional sense — they're stuck behind ~2,800 Evented fibers doing upstream websocket reads, which is turning every consumer-facing fiber into a polling loop that runs once every many seconds. 8 + 9 + This is not a 795cc41 regression. It reproduces on `bbba92c` identically. It likely predates bbba92c by some amount — the 2026-04-06 ops entry says tap + hydrant consumed cleanly after the zat v0.3.0-alpha.21 deploy, so the break happened somewhere between 04-06 and 04-08. The 04-07 FrameWork-UAF-fix pod's 33% dropout rate on relay-eval is probably the same bug, previously mis-attributed. 10 + 11 + ## evidence (measured, not theorized) 12 + 13 + **Tap subscribed to `wss://zlay.waow.tech/xrpc/com.atproto.sync.subscribeRepos` with `cursor=0`, ran for 180 seconds. Its prometheus counters at poll intervals:** 14 + 15 + ``` 16 + t=30s tap_firehose_events_received_total = 0 17 + t=60s tap_firehose_events_received_total = 1 18 + t=120s tap_firehose_events_received_total = 4 19 + t=170s tap_firehose_events_received_total = 6 20 + ``` 21 + 22 + **6 frames in 170 seconds of connected time**. No decode errors. Connection succeeded (tap's `firehose_cursors` table populated). The ingested frames are well-formed — the issue is purely rate. tap ran on my laptop over the public internet to `zlay.waow.tech`, which rules out any in-cluster networking weirdness. 23 + 24 + **Simultaneously, from zlay's own `/metrics` sampled every 30s over the same window:** 25 + - `relay_frames_received_total` was climbing at **~324 events/sec** (from 864k → 1,195k over 17 min) 26 + - `relay_frames_broadcast_total` was stuck at the exact same number for the entire window (tap was briefly connected but zero events made it through fast enough to show up at monitor resolution) 27 + - `relay_workers_count` climbing smoothly (1,925 at end of window) 28 + - `relay_consumers_active = 0 or 1` depending on whether tap was connected 29 + - `relay_process_rss_bytes` ~1.5 GiB, not memory-pressured 30 + - `relay_broadcast_queue_depth_hwm = 3956` (of 65536), queue not backed up at broadcaster-input side 31 + - `relay_broadcast_queue_full_total = 0`, queue never overflowed 32 + 33 + **Pod is healthy**. 1/1 Running, 0 restarts, `/metrics` responding in ~0.3s, `/_healthz` in ~0.22s. No crash loop on the current deploy. 34 + 35 + ## root cause 36 + 37 + The delivery deadlock is structural, in two interlocking parts. 38 + 39 + ### part 1 — all consumer-facing work runs on the main `Io.Evented` io, shared with ~2,800 PDS subscriber fibers 40 + 41 + `main.zig:61` sets `const Backend = Io.Evented`. Then: 42 + - `main.zig:396` starts the metrics server via `io.concurrent(MetricsServer.run, ...)` on main io 43 + - `main.zig:418` starts the ws+health HTTP server via `io.concurrent(runWsServer, ...)` on main io 44 + - `main.zig:348` starts the broadcaster loop via `io.concurrent(Broadcaster.runBroadcastLoop, ...)` on main io 45 + - `broadcaster.zig:566` spawns each downstream consumer's `writeLoop` via `self.io.concurrent(Consumer.writeLoop, ...)` — `self.io` is the main `io` (passed in at `main.zig:215`) 46 + - `slurper.zig:589` spawns **every upstream PDS subscriber** via `self.io.concurrent(runWorker, ...)` — also on main io 47 + 48 + So at steady state the main Evented io has: 49 + - ~2,800 subscriber fibers actively reading upstream PDS websockets 50 + - 1 broadcaster fiber 51 + - N consumer writeLoop fibers (per downstream subscriber) 52 + - 2 HTTP listener fibers (ws on :3000, metrics on :3001) 53 + - Various per-request handler fibers 54 + 55 + The frame workers that actually decode and push to the broadcast queue are on a separate `Io.Threaded` pool (`pool_io` at `main.zig:192-195`), which is why ingest keeps working — it's not scheduled by the Evented scheduler. Everything else is. 56 + 57 + ### part 2 — `Consumer.writeLoop` uses `io.sleep(100ms)` as a wait primitive instead of `cond.wait()` 58 + 59 + `broadcaster.zig:439-477`: 60 + 61 + ```zig 62 + fn writeLoop(self: *Consumer) void { 63 + self.last_send_time = Io.Timestamp.now(self.io, .real).nanoseconds; 64 + 65 + while (self.alive.load(.acquire)) { 66 + var frame: ?*SharedFrame = null; 67 + { 68 + self.mutex.lockUncancelable(self.io); 69 + defer self.mutex.unlock(self.io); 70 + if (self.buf_len == 0) { 71 + // no data: poll briefly instead of blocking on cond 72 + // (Io.Condition has no timedWait, so we poll to allow periodic ping checks) 73 + self.mutex.unlock(self.io); 74 + self.io.sleep(Io.Duration.fromMilliseconds(100), .awake) catch {}; 75 + self.mutex.lockUncancelable(self.io); 76 + } 77 + frame = self.dequeue(); 78 + } 79 + if (frame) |f| { 80 + defer f.release(); 81 + self.conn.writeBin(f.data) catch { ... }; 82 + self.last_send_time = ...; 83 + } else { 84 + self.maybePing(); 85 + } 86 + } 87 + ... 88 + } 89 + ``` 90 + 91 + The inline comment explicitly names the reason: `Io.Condition has no timedWait`, so the author replaced `cond.wait()` with `io.sleep(100ms)` to allow periodic ping checks. This makes `cond.signal(self.io)` in `Consumer.enqueue` at line 413 a **no-op** — the writeLoop is never actually waiting on that condition. It's doing a 100ms polling loop with the lock released during sleep. 92 + 93 + **Consequences, under normal scheduler load**: 94 + - Max drain rate per consumer is ~10 frames/sec (one wake every 100ms sleep, drain whatever was enqueued, repeat). 95 + - 10 frames/sec is already far below the steady-state ingest rate of ~300/sec, but a single consumer can only receive 10/sec. 96 + - Any burst of enqueued frames beyond 10 would force the consumer into a `ConsumerTooSlow` kick path. 97 + 98 + **Consequences, under heavy main-io scheduler load (2,800+ subscriber fibers)**: 99 + - Every time the writeLoop's 100ms sleep completes, the fiber has to be rescheduled. With 2,800 other fibers all producing work, the actual wake-to-run latency is not 100ms — it's probably seconds. 100 + - Measured reality: tap received 6 frames in 170s → one frame per ~28 seconds. 101 + - The broadcaster fiber itself (also on main io) is similarly backed up — it pops from `broadcast_queue` and calls `broadcast()`, which iterates consumers holding `consumers_mutex`. Even if the `broadcast_queue.pop()` side is fast, the `writeLoop` consumer-side is the bottleneck because that's where frames leave the process. 102 + 103 + ### summary 104 + 105 + Frames flow in correctly through the Threaded frame-worker pool and get enqueued into per-consumer ring buffers by the broadcaster fiber. But the per-consumer writeLoop that actually writes to the downstream websocket is stuck in a polling loop contending with 2,800 subscriber fibers for scheduler slots on the same Evented io. The effective drain rate is 3+ orders of magnitude below the ingest rate. 106 + 107 + ## recommended fix — two changes, both small, in this order 108 + 109 + ### change 1 (small, unambiguous, directly testable) — move `Consumer.writeLoop` off main `io` onto `pool_io` 110 + 111 + The consumer writeLoop has nothing that requires Evented semantics. It's a drain loop that writes to a websocket. The existing `pool_io` (Threaded, dedicated worker threads) is exactly the right runtime for it — same argument the frame workers already use (`main.zig:188-191` comment: *"worker threads are plain std.Thread — they cannot use Evented io"*). The writeLoop doesn't use any Evented-specific features. 112 + 113 + **Implementation sketch** (file:line references): 114 + 115 + 1. `broadcaster.zig:517-540` (`Broadcaster` struct + `init`): add a `pool_io: Io` field alongside `io`, set in `init`. Update call site at `main.zig:215` to pass both (`Broadcaster.init(allocator, io, pool_io, &shutdown_flag)`). 116 + 2. `broadcaster.zig:559-581` (`addConsumer`): change `self.io.concurrent(Consumer.writeLoop, .{consumer})` to use the pool io. The writeLoop is CPU+network-bound, not event-loop-bound. 117 + 3. `broadcaster.zig:380-399` (`Consumer` struct): change the stored `io` field to be the pool io (so `Io.Timestamp.now`, `io.sleep`, mutex/cond all use Threaded semantics). 118 + 119 + **Expected effect**: writeLoop gets scheduled onto a dedicated std.Thread, not multiplexed against 2,800 subscriber fibers. Its effective drain rate goes from ~0.04/sec to "bounded only by ws write speed" (thousands/sec). 120 + 121 + **How to verify the fix worked (5-minute test)**: 122 + 1. Deploy the change. 123 + 2. Wait 10 minutes for the pod to reach steady state. 124 + 3. Run `tap run --relay-url https://zlay.waow.tech --db-url sqlite:///tmp/tap-verify.db --firehose-parallelism 4` for 60 seconds. 125 + 4. Pull tap's `tap_firehose_events_received_total` — should be in the thousands, not single digits. 126 + 5. Pull zlay's `relay_frames_broadcast_total` delta over the same 60s — should match received rate, around 300+/sec. 127 + 6. Pulsar's next 18:00 UTC run should report zlay near 99% instead of 54%. 128 + 129 + **Rollback**: trivial revert, no state or schema changes. 130 + 131 + ### change 2 (slightly larger, structural) — replace the 100ms polling in `writeLoop` with `cond.wait()` and move the ping timer to a separate fiber 132 + 133 + Even after change 1, the polling-loop structure is wasteful and wrong. The fix is: 134 + 135 + - `writeLoop` should use `cond.wait(mutex)` to sleep until either a frame is enqueued (signaled by `enqueue` at line 413, which currently does nothing) or the consumer is being shut down. No polling. 136 + - Ping scheduling should be a separate concern. Either: 137 + - (a) A dedicated per-consumer "ping timer" fiber that calls `maybePing` every ~N seconds regardless of drain state, or 138 + - (b) Check last_send_time after EACH successful frame write, and only if it exceeds the ping interval AND there's no backlog, send a ping. (No timer needed — pings happen opportunistically between frames.) 139 + 140 + Option (b) is simpler and avoids extra fibers. It doesn't send pings during long idle periods, but that's fine — the ping interval is about keeping the ws connection warm, and idle connections are rare in practice (the firehose has events constantly). 141 + 142 + This change removes the 100ms wake ceiling entirely and makes the writeLoop edge-triggered. 143 + 144 + **Implementation sketch**: 145 + 146 + ```zig 147 + fn writeLoop(self: *Consumer) void { 148 + self.last_send_time = Io.Timestamp.now(self.io, .real).nanoseconds; 149 + while (self.alive.load(.acquire)) { 150 + self.mutex.lockUncancelable(self.io); 151 + while (self.buf_len == 0 and self.alive.load(.acquire)) { 152 + // block until enqueue signals or shutdown 153 + self.cond.wait(self.io, &self.mutex); 154 + } 155 + const frame = self.dequeue(); 156 + self.mutex.unlock(self.io); 157 + 158 + if (frame) |f| { 159 + defer f.release(); 160 + self.conn.writeBin(f.data) catch { 161 + self.alive.store(false, .release); 162 + return; 163 + }; 164 + const now = Io.Timestamp.now(self.io, .real).nanoseconds; 165 + self.last_send_time = now; 166 + // opportunistic ping: if idle period exceeded, send ping next time buf is empty 167 + } 168 + } 169 + // drain remaining buffered frames 170 + ... 171 + } 172 + ``` 173 + 174 + Ping scheduling can move to a separate fiber that does `io.sleep(ping_interval)` in a loop and sends pings. 175 + 176 + **Why this is separate from change 1**: change 1 alone might fix the immediate problem (moving writeLoop off contended io). If it does, change 2 is an optimization we can take time on. If it doesn't, change 2 is needed too. Ship change 1 first and measure. 177 + 178 + ### not recommended yet 179 + 180 + - **Don't** try to move the PDS subscriber fibers off main Evented. They rely on Evented for the websocket reader loops, and moving 2,800 of them would be a big architectural change. They're doing the right thing on Evented. 181 + - **Don't** move the HTTP server fibers (metrics, ws accept) off Evented. They're fine at low-concurrency, and they're not the bottleneck — the writeLoop is. 182 + - **Don't** bump `HOST_RESOLVER_POOL_SIZE`. The resolver pool is not the bottleneck; its `in_use` gauge was 0 during the observed delivery stall. 183 + - **Don't** change the relaxed probes. They're tolerating a different symptom (cold-start wake-up latency on the HTTP fibers). Once the writeLoop is off Evented, the pressure on those HTTP fibers should drop and probes can eventually be tightened back. 184 + 185 + ## risks and things to watch 186 + 187 + 1. **writeLoop on pool_io still has to acquire `consumers_mutex`** (held by the broadcaster fiber during fanout at `broadcaster.zig:620-621`). If the broadcaster fiber is slow to release that mutex, writeLoop waits. With writeLoop running at thousands/sec and broadcast at ~300/sec, the mutex should be held briefly and contention should be negligible. But worth watching. 188 + 2. **Thread count**: moving writeLoops to pool_io means each consumer spawns a new std.Thread. With a few downstream consumers this is fine. If we ever have hundreds, revisit. 189 + 3. **This fix doesn't address the question "why is the main Evented io so loaded it can't schedule a single writeLoop fiber?"** — that's a real question, and the answer might be "2,800 fibers is simply too many for the Evented scheduler under io_uring." If so, the longer-term fix is to reduce the subscriber fiber count by pooling, but that's much bigger. Change 1 is strictly independent. 190 + 4. **Hypothesis check**: if change 1 does NOT fix the delivery rate, the scheduler contention theory is wrong and the bug is somewhere else. I've instrumented what I can from outside the process; the engineer may want to add a counter to `Consumer.writeLoop` that measures actual wake-to-wake cycle time, so we can directly measure whether 100ms sleep is returning in ~100ms or in multiple seconds. 191 + 192 + ## minimal instrumentation to add alongside change 1 (optional but useful) 193 + 194 + ```zig 195 + // in Stats 196 + consumer_writeloop_iterations_total: std.atomic.Value(u64) = .{ .raw = 0 }, 197 + consumer_writeloop_sleeps_total: std.atomic.Value(u64) = .{ .raw = 0 }, 198 + consumer_writeloop_sleep_elapsed_us_total: std.atomic.Value(u64) = .{ .raw = 0 }, 199 + ``` 200 + 201 + Increment `iterations_total` every writeLoop cycle. Increment `sleeps_total` each time we hit the `buf_len == 0` branch. Measure wall-clock time across the `io.sleep(100ms)` call and add it to `sleep_elapsed_us_total`. If `sleep_elapsed_us_total / sleeps_total >> 100,000`, the scheduler contention is confirmed. If it's ~100,000, the sleep is returning promptly and the bottleneck is elsewhere (e.g., ws write speed, or consumers_mutex contention). 202 + 203 + ## code pointers 204 + 205 + - Main `io` backend: `zlay/src/main.zig:61` (`const Backend = Io.Evented`) 206 + - Separate Threaded `pool_io`: `zlay/src/main.zig:192-195` 207 + - Broadcaster init (takes main `io`): `zlay/src/main.zig:215` 208 + - Subscriber spawn on main io: `zlay/src/slurper.zig:589` 209 + - Consumer writeLoop spawn on main io: `zlay/src/broadcaster.zig:566` 210 + - Consumer writeLoop body (the 100ms polling loop): `zlay/src/broadcaster.zig:439-477` 211 + - Consumer enqueue (signals unused cond): `zlay/src/broadcaster.zig:403-415` 212 + - Broadcaster fanout to consumers: `zlay/src/broadcaster.zig:604-647` 213 + 214 + ## git bisect surface if wanted 215 + 216 + Something between 2026-04-06 and 2026-04-08 introduced or exposed this. The commits touching `broadcaster.zig`, `subscriber.zig`, or `frame_worker.zig` in that window are: 217 + 218 + ``` 219 + ee4e368 bump per-consumer buffer 8192→65536 + host_authority reject breakdown 220 + 31825b2 subscriber: extract prepareFrameWork + add UAF regression test 221 + 1eec324 fix UAF: dupe FrameWork.hostname per submit instead of borrowing 222 + ``` 223 + 224 + None of these touch the `writeLoop` directly. The 1eec324 UAF fix is suspicious because it changed how FrameWorks are queued — if it inadvertently changed the fiber scheduling shape (e.g., added a synchronization point that serializes frame workers), that could have increased load on main io somehow. But the writeLoop pattern (`io.sleep(100ms)` instead of `cond.wait`) appears to pre-date 1eec324. I'd bisect by reverting 1eec324 on a canary and running tap for 60s to see if the delivery rate changes. 225 + 226 + More likely: the scheduler contention has been present since the `Io.Evented` migration on 2026-04-05 (commit `9cc1ba3`), and the 04-06 "tap + hydrant consumed cleanly" test happened during the window where the migration had fewer subscribers online (cold-start). If the zat v0.3.0-alpha.21 test at 04-06 was run against a pod that was in the middle of its own cold-start ramp-up (workers_count low), the writeLoop wouldn't have been contended yet. 227 + 228 + ## tl;dr for the engineer 229 + 230 + 1. Move `Consumer.writeLoop` to `pool_io` (Threaded) instead of the main `io` (Evented). One-file change in `broadcaster.zig`, ~5 lines. Ship this first. 231 + 2. Run tap against zlay for 60 seconds and check `tap_firehose_events_received_total`. If it's >1000, we're back to healthy. If not, add the iteration/sleep instrumentation from the section above and re-measure. 232 + 3. Optionally, after (1) confirms, replace the 100ms polling loop with proper `cond.wait()` + a separate ping timer fiber. This is a quality fix, not an emergency. 233 + 4. Do NOT re-enable the host_authority pool with keep_alive=true yet. That's an orthogonal issue and the investigation continues on that front via the new per-branch metrics from `795cc41`.
+306
docs/zlay-external-review-2026-04-09.md
··· 1 + # zlay: resolver pool failure + HTTP server starvation — external review request 2 + 3 + *2026-04-09* 4 + 5 + ## tl;dr 6 + 7 + We hit two related problems with zlay (our zig-based atproto relay). We have working band-aids for both and no confirmed root cause for either. We'd like a second pair of eyes before we commit to the next round of fixes, because we've already been wrong once about what was happening. 8 + 9 + 1. **Pooled DID resolvers fail 100% under production load** on `std.http.Client`. Reject rate was 99.54% before the workaround. `keep_alive=false` on the pool fixes the symptom. Why the pool fails is not known — an isolated repro run by us couldn't reproduce it in 1,624 calls across many dimensions. The single-owner resolver in the same process on the same `std.http.Client` appears to work fine. 10 + 2. **The `keep_alive=false` workaround is slow enough** (~350–900ms per call) that during the cold-start reconnect storm, it contributes to starving the HTTP server fibers long enough to fail liveness probes, which caused a crash loop until we relaxed k8s probe timeouts. The spawn path itself is also part of the picture — spawning 2,770 PDS subscribers takes over 20 minutes, and batches of that work delay HTTP fibers even after the resolver pool catches up. 11 + 12 + --- 13 + 14 + ## context: what zlay is 15 + 16 + - **zlay** is an atproto relay written in zig. Role: ingest the firehose from ~2,800 PDS hosts, validate commits (`#commit`, `#sync`, `#identity`, `#account`), and broadcast a merged firehose via `com.atproto.sync.subscribeRepos` to downstream consumers. It's an alternative implementation to bluesky's Go reference relay (`bluesky-social/indigo`). 17 + - Deployed at `zlay.waow.tech`. We also run an indigo instance at `relay.waow.tech` as a stable reference — both hit the same upstream PDS pool. 18 + - zig version: `0.16.0-dev.3059+42e33db9d`. Build optimization: `ReleaseFast` (the `Io.Evented` backend GPFs under `ReleaseSafe` due to a zig stdlib bug in fiber `contextSwitch` — separate issue). 19 + - Target metric: pulsar / relay-eval coverage parity with indigo (~99.6–99.9% DID coverage within a 5-minute window). 20 + 21 + --- 22 + 23 + ## architecture relevant to both problems 24 + 25 + ### threading model 26 + 27 + zlay uses a hybrid `Io.Evented` + `Io.Threaded` setup: 28 + 29 + ``` 30 + Io.Evented (main `io`) Io.Threaded (`pool_io`) 31 + ├─ ws/health HTTP on :3000 ├─ frame workers (16 std.Thread) 32 + ├─ metrics HTTP on :3001 ├─ DB request queue workers (2 std.Thread) 33 + ├─ broadcaster loop fiber ├─ host_ops queue worker (1 std.Thread) 34 + ├─ slurper spawnWorkers fiber ├─ validator signing-key resolveLoop (1 std.Thread) 35 + └─ per-subscriber fibers └─ validator host_authority resolver pool (4 resolvers, 36 + (one per PDS, ~2,770 steady-state) accessed by the 16 frame worker threads) 37 + ``` 38 + 39 + - **Evented** drives network orchestration: listening sockets, websocket accepts, per-PDS subscriber fibers (each subscriber is a fiber that reads a websocket and pushes frames into the work queue), and the HTTP listener fibers. 40 + - **Threaded** (a separate `Io.Threaded` backend with its own io vtable) drives the frame worker pool, DB access, and the two places DID resolution happens. Worker threads are plain `std.Thread`. The comment in `main.zig:187-191` says they can't use Evented io because Evented's futex calls `ev.yield()` which requires fiber context, which these std.Threads don't have. 41 + - **Both the resolver pool and `resolveLoop` are called from `pool_io` (Threaded), not from Evented fibers.** 42 + 43 + ### the host_authority resolver pool 44 + 45 + ```zig 46 + // zlay/src/validator.zig:65 47 + host_resolvers: [host_resolver_pool_size]zat.DidResolver = undefined, 48 + host_resolver_available: [host_resolver_pool_size]std.atomic.Value(bool) = ..., 49 + // const host_resolver_pool_size: usize = 4; 50 + ``` 51 + 52 + - 4 `zat.DidResolver` instances created once at init (`validator.zig:143`). Each wraps a `std.http.Client`. 53 + - Callers acquire a slot via `acquireHostResolver` (line 600), which spins on `cmpxchgStrong` until a slot's `available` flag goes `true→false`. Release flips it back to `true`. No mutex, no condvar, no recovery path. 54 + - **On a `resolver.resolve(parsed)` failure, nothing destroys or re-initializes the slot.** The resolver stays in whatever state it landed in after the failure, and the next frame worker thread to acquire the slot gets the same instance. 55 + - Called from frame workers (`frame_worker.zig:98-128`) on every `is_new` or `host_changed` DID — i.e. every first-seen account and every account migration. 56 + 57 + ### the single-owner `resolveLoop` 58 + 59 + ```zig 60 + // zlay/src/validator.zig:449 61 + fn resolveLoop(self: *Validator) void { 62 + var resolver = zat.DidResolver.initWithOptions(self.io, self.allocator, .{ .keep_alive = true }); 63 + defer resolver.deinit(); 64 + while (self.alive.load(.acquire)) { 65 + // pop did from queue, resolve, cache signing key, continue on error 66 + } 67 + } 68 + ``` 69 + 70 + - 1 `DidResolver`, 1 `std.http.Client`, owned for the lifetime of a single dedicated `pool_io` worker thread. 71 + - Same underlying library (`zat.DidResolver`), same `std.http.Client`, same process, same kernel, same kubernetes. The only difference from the pool is single-owner vs 4-slot-shared. 72 + - Catches resolver errors with `log.debug` + `continue`. No retry on the same resolver. 73 + - **This path works in production.** The validator signing-key cache grows to steady-state capacity (~250k of 524k cap as observed on a prior pod). We have not instrumented whether it's silently failing some percentage of resolves, so we can't say it's definitively healthy — only that it's not catastrophically broken the way the pool is. 74 + 75 + ### zat's DID resolver (wraps `std.http.Client`) 76 + 77 + ```zig 78 + // zat/src/internal/identity/did_resolver.zig:41 79 + pub fn resolve(self: *DidResolver, did: Did) !DidDocument { 80 + return switch (did.method()) { 81 + .plc => try self.resolvePlc(did), 82 + .web => try self.resolveWeb(did), 83 + .other => error.UnsupportedDidMethod, 84 + }; 85 + } 86 + // line 50: resolvePlc builds "https://plc.directory/{did}" and calls fetchDidDocument 87 + // line 98: fetchDidDocument does transport.fetch(...) catch return error.DidResolutionFailed 88 + ``` 89 + 90 + ```zig 91 + // zat/src/internal/xrpc/transport.zig:27 92 + pub fn fetch(self: *HttpTransport, options: FetchOptions) !FetchResult { 93 + var aw: std.Io.Writer.Allocating = .init(self.allocator); 94 + defer aw.deinit(); 95 + // ... build headers ... 96 + const result = self.http_client.fetch(.{ 97 + .location = .{ .url = options.url }, 98 + .response_writer = &aw.writer, 99 + .headers = headers, 100 + .keep_alive = self.keep_alive, // configurable; default true 101 + }) catch return error.RequestFailed; // <-- swallow #1 102 + return .{ .status = result.status, .body = try self.allocator.dupe(u8, aw.written()) }; 103 + } 104 + ``` 105 + 106 + Two layers of error flattening: 107 + - `http_client.fetch(...) catch return error.RequestFailed` 108 + - Upstream: `transport.fetch(...) catch return error.DidResolutionFailed` 109 + 110 + By the time the error reaches zlay's validator, we just know "something failed." The specific error kind from `std.http.Client.fetch` is gone. We've since added a sampled warn log in `bbba92c` that captures `@errorName(err2)` on the retry path, but by the time it was shipping, the `keep_alive=false` workaround had already changed the behavior. **So we have no production log data showing the actual error kind from the broken state.** 111 + 112 + --- 113 + 114 + ## problem 1: pool rejects 100% of DID resolves 115 + 116 + ### what we measured 117 + 118 + - Cumulative reject rate on a pre-workaround pod: **896,043 / 900,201 = 99.54%** over ~30 hours. 119 + - Steady-state reject rate was ~10/sec, consistent across a fresh 20-second metrics delta (`202 checks / 202 rejects`). 120 + - Per-branch breakdown after `ee4e368` split `failed_host_authority` into six reject counters: **100% of rejects (39,621 / 40,072 in a 48-minute window) were the `resolve` branch** — `resolver.resolve(parsed)` throwing on both the first call and the retry on the same resolver. Zero rejects in `parse_did`, `no_endpoint`, `bad_url`, `unknown_host`, `host_mismatch`. 121 + 122 + ### what the rejected events looked like 123 + 124 + Manual probe: we sampled 15 DIDs from `dropping event, host authority failed` log lines and resolved each via `curl https://plc.directory/{did}` from a laptop. **13 of 15 DID docs had their `AtprotoPersonalDataServer` endpoint hostname exactly matching the incoming host** — these are events zlay should have accepted, legitimate firehose traffic from bluesky mushroom-shard PDSes. 2 of 15 were the same DID appearing on two wrong hosts — legitimate policy rejects. 125 + 126 + So the 100% rejection was not events that were "sort of invalid" or "in a migration." They were valid events from correctly-authoritative PDSes. 127 + 128 + ### what wasn't wrong (ruled out) 129 + 130 + - **DB lookup**: we verified `SELECT id, hostname FROM host` on zlay's postgres returned exactly the expected rows with stable ids. `getHostIdForHostname` works in isolation. 131 + - **Network**: the pod's own slurper is holding ~2,800 concurrent websocket connections to PDSes, including the ones whose events were being rejected. Outbound TLS works. 132 + - **CA certificates**: container image is `debian:bookworm-slim` with `ca-certificates` installed. 133 + - **plc.directory availability**: responds `200` in ~100ms to `curl` from any vantage point, including for the exact DIDs that were being rejected. 134 + - **zat DID doc parser**: `resolve` fails, not any downstream parsing. The `no_endpoint` and `bad_url` counters are flat at 0. 135 + 136 + ### the workaround that shipped 137 + 138 + `bbba92c` flipped the pool init to `DidResolver.initWithOptions(alloc, .{ .keep_alive = false })` (`validator.zig:143`). **Reject rate collapsed from 99.54% → 0.24% within 14 minutes of deploy.** The remaining 0.24% are `unknown_host` and `host_mismatch` — legitimate policy drops. Confirmed by per-branch counters at 48-min uptime: `resolve = 0`, `unknown_host = 2`, `host_mismatch = 1`, out of 1,259 checks. 139 + 140 + ### what the engineer's local repro couldn't find 141 + 142 + Standalone zig program outside zlay, calling `zat.DidResolver.resolve()` against `plc.directory` with `keep_alive=true`: 143 + 144 + - ReleaseFast and Debug builds 145 + - zig 0.16.0-dev.3059 and 0.16.0-dev.3070 146 + - Serial calls (one after another) and parallel (spawned threads, shared vs per-thread clients) 147 + - Single `std.http.Client` and multi-client 148 + - 0s to 10s idle between calls 149 + - 1,624 total calls across all combinations 150 + 151 + **None reproduced the failure.** All 1,624 calls succeeded. 152 + 153 + ### what we explicitly don't know 154 + 155 + 1. What error `resolver.resolve(parsed)` was throwing in production. Swallowed twice. No production log data from the broken state because the sampled-warn-log fix shipped in the same commit as the workaround. 156 + 2. Whether `resolveLoop` (single-owner, still `keep_alive=true` in prod) is genuinely healthy or silently failing some percentage of resolves. Signing-key cache fill rate is consistent with "working" but we haven't proven it. 157 + 3. Why the pool fails when single-owner doesn't. The structural differences are: pool has 4 slots vs 1; pool is acquired by 16 frame worker threads via atomic handoff vs 1 dedicated thread; pool does `resolve()` twice in rapid succession on failure (retry on same resolver) vs single-owner which does 1 call + continue. 158 + 4. Whether the bug is in zlay's pool code, in zat's `DidResolver`, in `std.http.Client`, or some interaction between them. 159 + 160 + ### the guess we're working from (not a diagnosis) 161 + 162 + **The pool has no slot-recovery path.** Whatever causes a resolver to go into a bad state (one transient network hiccup, a server-side connection close the client didn't notice, TLS session expiry, anything), the slot stays bad forever because nothing destroys and re-initializes it. Over a few minutes, all 4 slots drift into bad states and everything rejects. 163 + 164 + This would be consistent with: 165 + - Isolated repros not reproducing — a fresh test creates a fresh client per run, so "resolver lives long enough to hit a transient failure and never recovers" is unreachable. 166 + - `keep_alive=false` fixing it — no reused connection state, nothing to break. 167 + - Single-owner `resolveLoop` appearing to work — it has the same reuse pattern but it only does one call per loop iteration and `continue`s on errors, and we haven't measured hard enough to be sure it's not also silently degraded. 168 + 169 + But we're calling this a **guess**, not a root cause. We've already been wrong about this bug's shape once (we originally framed it as "zig 0.16 `std.http.Client` stale keep-alive handling is broken" which got falsified by the local repro). We don't want to ship another confident fix built on another unverified hypothesis. 170 + 171 + ### proposed fix (unimplemented) 172 + 173 + **Add slot recovery**: on any `resolver.resolve(parsed)` failure in the pool, `resolver.deinit()` + `resolver.* = DidResolver.init...` the slot before releasing it back to the pool. This is entirely in zlay code, requires no stdlib claims, and should make the pool self-heal regardless of what "bad state" means. 174 + 175 + We haven't implemented this yet because we want to know if it's structurally sound or if there's a gotcha we're not seeing. 176 + 177 + ### questions on problem 1 178 + 179 + 1. **Is there a mental model for what could break `std.http.Client` in a pooled-across-threads pattern that fresh-per-call and single-owner both avoid?** Our guess is "slot gets poisoned, no recovery path," but we can't explain *what* poisons it. Do you know of a structural thing in `std.http.Client` (connection pool, TLS session cache, request buffer state) that could get into a bad state and then silently fail every subsequent request without surfacing a clear error? 180 + 2. **Is destroy-and-reinit the right shape of recovery, or is there a lighter-weight reset?** We'd prefer to clear the client's internal connection pool or trigger a reconnect without throwing away the whole struct. Does zig stdlib expose anything like that? 181 + 3. **Is it a red flag that `resolveHostAuthority` does two calls on the same client in rapid succession (first attempt + immediate retry on failure)?** Is there a client-internal state that doesn't fully reset between calls that a retry without a brief delay could trip over? 182 + 4. **Any known stdlib gotchas around `std.http.Client` when the calling thread changes?** Our 16 frame workers are plain `std.Thread`, and any of them can acquire any of the 4 resolver slots. We can't find documentation on whether `std.http.Client` holds per-calling-thread state. 183 + 184 + --- 185 + 186 + ## problem 2: HTTP server fiber starvation during cold-start spawn 187 + 188 + ### the setup 189 + 190 + ```zig 191 + // zlay/src/slurper.zig:303 — kicked off in Slurper.start 192 + self.startup_future = try self.io.concurrent(spawnWorkers, .{self}); 193 + ``` 194 + 195 + `spawnWorkers` is an Evented fiber running on the main `io`. It does this: 196 + 197 + ```zig 198 + // zlay/src/slurper.zig:670-683 (elided) 199 + for (hosts) |host| { 200 + self.spawnWorker(host.id, host.hostname, host.last_seq) catch |err| { ... }; 201 + spawned += 1; 202 + if (spawned % batch == 0 and spawned < hosts.len) { 203 + log.info("startup: spawned {d}/{d} hosts, yielding...", .{ spawned, hosts.len }); 204 + self.io.sleep(Io.Duration.fromMilliseconds(100), .awake) catch break; 205 + } 206 + } 207 + ``` 208 + 209 + `STARTUP_BATCH_SIZE` defaults to 50. With ~2,770 hosts that's ~55 batches × (work + 100ms yield). 210 + 211 + Each `spawnWorker` call creates a subscriber task (`Io.concurrent(Subscriber.run, ...)`) which immediately starts a websocket handshake to a different PDS. Under `Io.Evented`, these spawns should yield during network I/O — but the spawn setup itself (allocations, struct init, db request queue pushes, per-subscriber fiber creation) is synchronous CPU work. 212 + 213 + The HTTP server for health probes is `io.concurrent(runWsServer, ...)` on port 3000 (`main.zig:418`), and the metrics server is `io.concurrent(MetricsServer.run, ...)` on port 3001 (`main.zig:396`). Both are fibers on the **same main `io`** as the spawn loop. 214 + 215 + ### what we observe 216 + 217 + - With original k8s probes (`initialDelay=30s timeout=5s failureThreshold=10`): the pod crash-looped. Every ~20 minutes, kubelet saw 10 consecutive liveness failures, SIGKILLed the container, and the spawn + resolve storm started over. 25 restarts in 9 hours until we noticed. 218 + - With relaxed probes (`initialDelay=300s timeout=15s failureThreshold=20`): the pod survived the first 20 minutes. But **at 21 minutes of uptime, readiness still briefly flapped 0/1 for ~90 seconds** before recovering on its own, with no container restart. The corresponding log line at that time: `info(relay): startup: spawned 2400/2770 hosts, yielding...`. 219 + - So the spawn loop was 87% done at 21 minutes, and the batches it was still processing were heavy enough to delay the `/_readyz` fiber past a 15-second timeout. 220 + - `/metrics` curl attempts during these windows timed out at 20 seconds. 221 + 222 + ### the aggravating factor from problem 1 223 + 224 + `keep_alive=false` on the host_authority pool means every new-DID check during spawn-up does a fresh DNS + TCP + TLS handshake to plc.directory (~350–900ms per call). The 4-slot pool saturates as thousands of is_new events flood in from the newly-connected PDS subscribers. Frame workers spin-blocking on `acquireHostResolver` pile up CPU, which shares resources with everything else. 225 + 226 + The frame worker pool is on `Io.Threaded` (separate from Evented), so it shouldn't directly starve the HTTP fibers via the scheduler. But: 227 + - Both still share the same process + CPU cores 228 + - Both contend on shared state (atomic counters, the persist ordering mutex at `broadcaster.zig:492`, metrics map updates) 229 + - The HTTP handler itself touches some of the same metrics state it's reading for `/metrics` responses 230 + 231 + We can't cleanly separate "how much of the probe delay is spawn itself" from "how much is the resolver pool blocking frame workers which contend with HTTP." We just know both are happening. 232 + 233 + ### what we explicitly don't know 234 + 235 + 1. Which fiber is actually stuck when `/_healthz` times out. We haven't taken a fiber-level trace or `pprof`-equivalent. 236 + 2. Whether 100ms between batches is structurally the wrong yield, or whether the scheduler is fine and the batches themselves are much slower than we think. 237 + 3. Whether the right architectural answer is "isolate HTTP onto its own thread" vs "make spawn slower" vs "make spawn cheaper per-host." 238 + 239 + ### questions on problem 2 240 + 241 + 1. **Is there an idiomatic `Io.Evented` pattern for "CPU-intensive startup task that must not starve other fibers"?** Our current `io.sleep(100ms)` between batches is a fixed yield. Is there a "yield to anything pending" primitive, or a fair scheduler we should be using? 242 + 2. **Should the HTTP probe endpoints run on a dedicated OS thread with their own event loop, specifically so business-logic stalls can't make kubelet kill the process?** We've seen this pattern in other relay implementations. The cost is some duplication + inter-thread handoff for anything `/metrics` wants to read, but the benefit is decoupling liveness from workload. 243 + 3. **Is 50 hosts per batch × 100ms yield reasonable, or should we be looking at "spawn 1 at a time with 0ms yield" or "spawn 500 at a time with 5s yield" or something else entirely?** The batching math here feels hand-tuned in a way we can't defend. 244 + 4. **Does anyone else run a zig-ecosystem network service in production at this scale? Any design patterns we should steal?** 245 + 246 + --- 247 + 248 + ## what's deployed right now 249 + 250 + - Image: `atcr.io/zzstoatzz.io/zlay:ReleaseFast-bbba92c` 251 + - Contains: 252 + - Host_authority resolver pool with `keep_alive=false` (problem 1 workaround) 253 + - Per-consumer ring buffer 8192 → 65536 in `broadcaster.zig` (separate fix for `ConsumerTooSlow` kicks on downstream subscribers, not covered in this doc but shipped at the same time) 254 + - Per-branch `relay_host_authority_reject{branch=...}` counters + sampled warn logs 255 + - k8s Deployment probes (cluster state, not yet captured in `zlay/deploy/` yaml): 256 + - Liveness: `initialDelay=300s timeout=15s failureThreshold=20 period=30s` 257 + - Readiness: `initialDelay=60s timeout=15s failureThreshold=20 period=15s` 258 + 259 + Pod is currently 1/1 Running. Readiness has flapped briefly a few times during startup spawn but kubelet has not killed it. Host_authority reject rate is ~0.24% (all legitimate policy drops). We do not consider either problem solved. Both have band-aid workarounds. 260 + 261 + --- 262 + 263 + ## what we're asking for 264 + 265 + Priority order: 266 + 267 + 1. **Is the "pool has no slot recovery" shape of fix structurally sound?** If yes, we'll implement it. If you can poke holes in it, we'd rather hear about them before we ship and have to revisit. 268 + 2. **Any independent guesses about what's actually breaking the pool** that we can test. We have production telemetry we can instrument, including a planned "1 of 4 slots with `keep_alive=true`, 3 with `keep_alive=false`" experiment to capture the error kind from production. Open to better ideas. 269 + 3. **Architectural thoughts on HTTP probe isolation.** We think dedicated-thread probes might be the right long-term answer, but we want to hear if you've seen this go wrong. 270 + 4. **General sanity check on the Evented+Threaded hybrid** — is this architecture load-bearing correctly, or are we accumulating workarounds on top of a fundamentally confused threading model? 271 + 272 + We're not in crisis. The relay is running. We want to get to a state where both bands-aids come off and are replaced with things we understand. 273 + 274 + --- 275 + 276 + ## code pointers 277 + 278 + - Slurper spawn entry: [`zlay/src/slurper.zig:303`](https://tangled.org/zzstoatzz.io/zlay) (`io.concurrent(spawnWorkers, ...)`) 279 + - `spawnWorkers` body: `zlay/src/slurper.zig:625-686` 280 + - Frame worker hot path: `zlay/src/frame_worker.zig:85-128` (host_authority check at 98-128) 281 + - Resolver pool state + init: `zlay/src/validator.zig:65` + `:143` 282 + - Pool acquire/release: `zlay/src/validator.zig:600-615` 283 + - `resolveHostAuthority`: `zlay/src/validator.zig:568-599` 284 + - `resolveLoop` (single-owner, works): `zlay/src/validator.zig:449-500` 285 + - `checkPdsHost` (downstream branches, not implicated): `zlay/src/validator.zig:618-660` 286 + - HTTP/WS server: `zlay/src/main.zig:414-418` (`io.concurrent(runWsServer, ...)`) 287 + - Metrics server: `zlay/src/main.zig:384-396` (`io.concurrent(MetricsServer.run, ...)`) 288 + - zat transport (outer error swallow): `zat/src/internal/xrpc/transport.zig:62-75` 289 + - zat DID resolver (inner error swallow): `zat/src/internal/identity/did_resolver.zig:41-107` 290 + 291 + ## git context 292 + 293 + - zig 0.16 migration: commit `9cc1ba3` on 2026-04-05 (`migrate to zig 0.16: Io primitives, updated deps, timer regression fixes`) 294 + - Resolver pool added: commit `1639565` on 2026-03-18 (`host authority: reuse pooled resolvers + add diagnostics`) — was working on zig 0.15 295 + - Per-branch counters + buffer bump: commit `ee4e368` (2026-04-08) 296 + - Build failure on `_ = err1;`: commit `584571a` (zig 0.16 rejected the error-set discard; `zig build test` passed because the test module didn't reference `resolveHostAuthority` so lazy analysis skipped it) 297 + - Final workaround: commit `bbba92c` (dropped the `|err1|` capture, wired sampled log into resolve branch, flipped `keep_alive=false` on the pool) 298 + 299 + ## appendix: things we ruled out along the way 300 + 301 + - DB lookup of host id by hostname (`getHostIdForHostname`) — verified via direct psql 302 + - plc.directory availability — verified via curl from outside the pod for DIDs the pod was rejecting 303 + - ca-certificates — container has them 304 + - `extractHostFromUrl` edge cases on bsky shard hostnames — they're plain `https://{shard}.host.bsky.network`, nothing unusual 305 + - `zig build test` as a sufficient CI check for zlay validator changes — it's not (lazy analysis won't force-check functions that aren't referenced from tests); engineer rule added: run `zig build` (exe) in addition to `zig build test` 306 + - "zig 0.16 `std.http.Client` stale keep-alive handling is broken" — this was our first hypothesis, held in writing too confidently, falsified by the engineer's standalone repro. It had the right shape (workaround involves keep_alive) but the wrong causal claim.
+208
docs/zlay-handoff-2026-04-09-rollback.md
··· 1 + # zlay: rollback to b91382b restored downstream delivery — report for engineering 2 + 3 + *2026-04-09, operator handoff after the earlier review doc* 4 + 5 + this is a follow-up to `zlay-external-review-2026-04-09.md` and 6 + `zlay-broadcaster-starvation-2026-04-09.md`. it is written **for the 7 + zlay engineer, by the operator**. no code changes shipped in this 8 + session — the actions were all at the k8s / image-tag layer. 9 + 10 + ## tl;dr 11 + 12 + - rolled back the zlay deployment from `ReleaseFast-4f3d1d4` → 13 + `ReleaseFast-zat21-b91382b` (the 2026-04-06 build). 14 + - external consumer delivery went from **6 frames / 170s 15 + (0.035 fps)** on `bbba92c`/`795cc41`/`4f3d1d4` to 16 + **~400 fps at 99.4% of ingest** on b91382b. same code path, 17 + same cluster, same PDS pool, ~10 minutes apart. 18 + - `relay_validation_failed{reason="host_authority"}` on b91382b is 19 + **1.3% steady state** (7 / 551). the 99.5% reject rate you fixed 20 + with `keep_alive=false` on bbba92c is **not present** on b91382b. 21 + - those two measurements together narrow the introduction window 22 + for both bugs to **2026-04-06 → 2026-04-07**. the broadcaster- 23 + starvation doc already flagged commit `1eec324` (FrameWork UAF 24 + fix) as suspicious because it changed how FrameWorks are queued. 25 + i want to flag it again, harder: it is now the prime suspect for 26 + *both* the delivery collapse and the host_authority bug. 27 + - current pod is healthy and serving evaluators, but **b91382b 28 + still has the FrameWork.hostname UAF** that 1eec324 fixes. a 29 + reconnect storm (cron fires every 4h) may trip the UAF and 30 + restart the pod. i'm accepting that risk to keep the evaluators 31 + alive until you can ship a build that has the 1eec324 fix 32 + *without* the delivery regression. 33 + 34 + ## the measurements 35 + 36 + ### pre-rollback (pod: `zlay-6c776bf9b9-zv9pl`, `4f3d1d4`, 28m uptime) 37 + 38 + | signal | value | 39 + |---|---| 40 + | `frames_received_total` | 742988 | 41 + | `frames_broadcast_total` | **478** (frozen at cold-start value) | 42 + | `consumers_active` | 0 | 43 + | `host_authority_reject{branch="host_mismatch"}` | 15,257 / 17,236 (**88%**) | 44 + | `host_authority_reject{branch="resolve"}` | 0 (the `keep_alive=false` workaround holds) | 45 + | internal `/metrics` (first probe, port-forward) | 200 in 0.5s | 46 + | internal `/metrics` (second probe, 15s later) | hung >10s | 47 + | external `https://zlay.waow.tech/_health` | **503** (ingress) | 48 + | `endpoints/zlay` in k8s api | **empty** | 49 + | pod `Ready` condition | flipped `True→False` at 20:59:34Z | 50 + 51 + pod is `1/1 Running` in the kubelet's view but `Ready=False` and out 52 + of the service endpoints — the relaxed probes from your handoff 53 + (`initialDelay=300s failureThreshold=20`) are keeping the container 54 + alive but the readiness probe is still failing enough to drop it 55 + from the service. external consumers see 503. 56 + 57 + ### post-rollback (pod: `zlay-679c966669-6xg5v`, `ReleaseFast-zat21-b91382b`) 58 + 59 + **probe sweep at ~90s uptime:** 60 + 61 + | probe | result | 62 + |---|---| 63 + | `https://zlay.waow.tech/_health` | 200 in 0.29s | 64 + | `https://zlay.waow.tech/xrpc/_health` | 200 in 0.23s | 65 + | raw `wss://.../subscribeRepos` (python websockets, 15s) | **5,896 frames, ~395 fps** | 66 + | `just zlay test-tap 30` | connected + ran full 30s without error | 67 + 68 + **metrics delta with an active consumer attached (15s window):** 69 + 70 + | metric | t0 | t1 | Δ | 71 + |---|---:|---:|---:| 72 + | `frames_received_total` | 472534 | 479377 | +6843 | 73 + | `frames_broadcast_total` | 20828 | 27629 | +6801 | 74 + | `consumers_active` | 1 | 1 | — | 75 + 76 + **delivery ratio = 6801/6843 = 99.4%.** in the same conditions, 77 + `4f3d1d4`/`bbba92c`/`795cc41` have `frames_broadcast_total` frozen 78 + at their cold-start value even while a consumer is actively 79 + subscribed. the 1.1%-ish gap on b91382b is within the range of 80 + normal prevData chain-break filtering you'd expect. 81 + 82 + **host_authority on b91382b (no active workaround, no slot 83 + recovery, no per-branch breakdown, just the aggregate counter):** 84 + 85 + ``` 86 + relay_host_authority_checks_total 513 87 + relay_validation_failed{reason="host_authority"} 1 # at uptime ~4 min 88 + relay_validation_failed{reason="host_authority"} 7 # at uptime ~6 min 89 + ``` 90 + 91 + **0.19% → 1.3%** over the ramp-up window. well within normal 92 + legitimate-policy-drop range. `workers_count` was still climbing 93 + (531 → 652 → 704) so this is not a full-load measurement — the 94 + reject rate may climb as the host table fully ramps, but there is 95 + no evidence yet of the 99.5% catastrophic rejection you saw on 96 + bbba92c-pre-workaround. 97 + 98 + ## what this tells us about when the bugs were introduced 99 + 100 + the ops-changelog 2026-04-06 entry documents that `b91382b` (your 101 + zat v0.3.0-alpha.21 release) was observed with tap and hydrant 102 + "consuming cleanly with no rejections". your 04-07 entry and the 103 + 2026-04-09 investigation confirmed that by the time `31825b2` 104 + was running, (a) downstream delivery was broken ("33% dropout rate 105 + on relay-eval") and (b) host_authority was rejecting at 99.5% 106 + (discovered retroactively on 04-09). both bugs were present and 107 + unobserved on 31825b2. 108 + 109 + my measurements today close the loop: b91382b does not have either 110 + bug. so both were introduced between `b91382b` and `31825b2`. the 111 + commits in that window are, per the broadcaster-starvation doc: 112 + 113 + ``` 114 + 1eec324 fix UAF: dupe FrameWork.hostname per submit instead of borrowing 115 + 31825b2 subscriber: extract prepareFrameWork + add UAF regression test 116 + ``` 117 + 118 + both touch FrameWork queueing. this is the same suspicion the 119 + broadcaster-starvation doc raised — it's now harder to dismiss. 120 + 121 + **possible mechanism (hypothesis, not proven):** if `1eec324` 122 + changed the sequence or timing of when a FrameWork is handed off 123 + to the frame worker pool vs when `host_authority` runs vs when the 124 + pool-slot resolver is touched, it could be doing something that: 125 + 126 + 1. puts the host_authority pool's `std.http.Client` into a state 127 + the single-owner `resolveLoop` never reaches (we never 128 + confirmed what actually poisoned the pool slots — the 129 + `keep_alive=false` workaround just avoided the reuse pattern) 130 + 2. creates backpressure against the broadcaster fiber on main 131 + `Io.Evented`, forcing it into the polling-loop starvation state 132 + the broadcaster-starvation doc described 133 + 134 + i am not claiming these are the mechanisms. i'm saying: the fact 135 + that a single commit window introduced both bugs at once, both 136 + affecting the frame-queueing hot path, is enough evidence to 137 + direct the next diagnostic at that commit specifically rather than 138 + at the resolver pool or the writeLoop in isolation. 139 + 140 + ## what i'd ask for (operator-side asks) 141 + 142 + 1. **treat `1eec324` and the 04-07 FrameWork queueing refactor as 143 + the suspect** for both the host_authority bug and the 144 + broadcaster delivery collapse. before shipping the Option A 145 + writeLoop fix on top of `4f3d1d4`, consider git-bisecting the 146 + 04-06 → 04-07 commit window against a simple 15-second 147 + websocket consumer that counts frames. i can run that check 148 + against any canary image you ship to atcr — it's a 10-line 149 + script. 150 + 2. **the real target build** is: `1eec324` + `bbba92c` (host_authority 151 + keep_alive=false workaround) + `795cc41` (slot recovery + 152 + observability) + the fix for whatever `1eec324` actually broke. 153 + none of the 04-09 builds have that combination because they 154 + build on top of the regression. 155 + 3. **don't assume the host_authority fix is sufficient.** the 156 + keep_alive=false workaround fixed the symptom of a poisoned 157 + pool, but the poisoning itself is likely the downstream effect 158 + of the same thing that broke delivery. if you fix delivery and 159 + the host_authority bug goes away without the workaround, that's 160 + strong evidence the two are the same bug. 161 + 162 + ## what i am doing on the operator side 163 + 164 + - leaving the pod on `ReleaseFast-zat21-b91382b` until you ship 165 + something better. external consumers can evaluate zlay again 166 + from this state. 167 + - watching for FrameWork UAF restarts over the next 4-hour 168 + reconnect cron cycle. if the pod restarts with a 169 + stack-pointer-shaped hostname in the logs, i'll roll forward to 170 + `31825b2` (which has the UAF fix but re-introduces the delivery 171 + bug) only as a last resort — a relay that crashes every 4h and 172 + serves 400 fps between crashes is still better for evaluators 173 + than a relay that stays up and serves 0 fps. 174 + - leaving the relaxed k8s probes in place. they don't hurt a 175 + fast-responding pod, and the engineer rule is "relax probes, 176 + don't fix the underlying bug with probes" — but on b91382b the 177 + HTTP fibers are fast anyway. 178 + - leaving `HOST_RESOLVER_POOL_SIZE` at its default (b91382b 179 + predates the env var entirely — it's just the hardcoded 4). 180 + - no code changes from me this session. that is intentional: the 181 + operator's purview is deploy + dial-turn + measurement + 182 + reporting. fixing zig code is yours. 183 + 184 + ## reproducing the measurements 185 + 186 + for the next operator or for you: 187 + 188 + ```bash 189 + # external health + delivery rate, no port-forward needed 190 + curl --connect-timeout 3 -m 5 https://zlay.waow.tech/_health 191 + uvx --from 'websockets==13.*' python -c ' 192 + import asyncio, websockets, time 193 + async def main(): 194 + async with websockets.connect("wss://zlay.waow.tech/xrpc/com.atproto.sync.subscribeRepos", max_size=None) as ws: 195 + n = 0; t = time.time() 196 + while time.time() - t < 15: 197 + await ws.recv(); n += 1 198 + print(f"{n} frames in 15s = {n/15:.1f} fps") 199 + asyncio.run(main()) 200 + ' 201 + 202 + # metrics delta (needs port-forward to :3001 and a consumer attached for frames_broadcast to move) 203 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay port-forward deploy/zlay 3001:3001 & 204 + ``` 205 + 206 + i'll probably encode this as `just zlay probe` and `just zlay 207 + delta` recipes in a follow-up, so the next operator doesn't have 208 + to rediscover it.
+133
docs/zlay-threaded-resolution-2026-04-10.md
··· 1 + # zlay: Io.Threaded resolution — 2026-04-10 2 + 3 + closes the april 8-9 incident and the broader 0.16 coverage degradation. 4 + 5 + ## what we shipped 6 + 7 + two commits on 2026-04-09, deployed sequentially: 8 + 9 + ### 1. `e6cdf84` — switch to Io.Threaded backend 10 + 11 + one-line change: `const Backend = Io.Evented` → `Io.Threaded` in 12 + `src/main.zig:61`. switches from io_uring fiber scheduling (~35 threads) 13 + to thread-per-PDS (~2,800 OS threads). this was the proven model on 14 + 0.15, which ran at 99%+ coverage. 15 + 16 + also enables **ReleaseSafe** (was impossible under Evented due to a zig 17 + codegen bug in fiber context-switch). ReleaseSafe gives panic + stack 18 + trace on safety violations instead of silent undefined behavior. 19 + 20 + ### 2. `4735725` — fix socket-after-close race + buffer bump 21 + 22 + ReleaseSafe immediately surfaced a pre-existing bug: `dropSlowConsumer` 23 + (broadcast thread) called `conn.close()` while the websocket server's 24 + `readLoop` (server thread) was calling `posix.setsockopt` on the same 25 + fd. under ReleaseFast this was silent UB on every prior build. 26 + 27 + fix: socket close ownership moved from broadcast thread to writeLoop 28 + thread. `dropSlowConsumer` only sets `alive=false`; writeLoop exits, 29 + drains, then closes. also bumped consumer BUFFER_CAP 8192 → 65536 30 + (~3 min headroom at 335 fps instead of ~24 sec). 31 + 32 + ## why Io.Threaded 33 + 34 + switching to Io.Threaded eliminates the failure mode in production and 35 + restores delivery to expected levels. Evented is the leading regression 36 + source for the acute april 8-9 outage, and plausibly for the broader 37 + ~10-15% coverage degradation since `9cc1ba3`, but we did not fully 38 + root-cause every sub-symptom inside the Evented pipeline. the known 39 + issues under Evented: 40 + 41 + - 8 cross-Io crash classes (Threaded futex on Evented fiber → NULL 42 + threadlocal → SIGSEGV or heap corruption) 43 + - ReleaseSafe GPF from fiber context-switch codegen bug (forced 44 + ReleaseFast, hiding safety violations) 45 + - HTTP fiber starvation under load (the ~10-40 min degradation cycle) 46 + - persist_order_spins 33M/sec (mutex contention from Evented scheduler) 47 + - broadcast_queue_depth_hwm 8,191 (Evented scheduler couldn't drain) 48 + - zig team marks Evented as "experimental" with "important followup work" 49 + 50 + all of these vanish under Threaded. the only cost is thread count 51 + (~2,800 instead of ~35), which matches the 0.15 baseline that ran at 52 + 99%+ coverage for months. 53 + 54 + ## measurements 55 + 56 + ### `4735725` (Threaded + ReleaseSafe + consumer fix) at T+22 min, 2,700 workers 57 + 58 + | signal | value | vs Evented builds | 59 + |---|---|---| 60 + | /_health | 200 in 0.29s | Evented: hung after 10-40 min | 61 + | delivery (raw ws) | 152 fps (still ramping) | Evented canary 1: 0 fps at failure | 62 + | host_authority rejects | 0 new/15s steady state | Evented: 88-99% reject rate | 63 + | pool_queued_bytes | 0-13k oscillating | Evented: 64 MiB stuck | 64 + | persist_order_spins | **2.9k/sec** | Evented: **33M/sec** (11,000× higher) | 65 + | broadcast_queue_depth_hwm | 36 | Evented: 8,191 | 66 + | RSS | 1.89 GiB stable | comparable | 67 + | restarts | 0 | Evented: crashed at T+10-80 min | 68 + 69 + ### `e6cdf84` (first Threaded build, before consumer fix) fully ramped 70 + 71 + | signal | value | 72 + |---|---| 73 + | delivery ratio (consumer attached) | 97.3% (5.0k/5.1k in 15s) | 74 + | host_authority rejects | 93% cumulative (ramp residue), 1.5% steady | 75 + | persist_order_spins | 28.2k/sec | 76 + | RSS | 1.92 GiB stable | 77 + | crash | 1 restart at T+80 min (socket race, fixed in 4735725) | 78 + 79 + ### hydrant verification (4735725) 80 + 81 + 739,075 events consumed in 30s with full signature verification, 0 82 + errors. clean log. no crash, no ConsumerTooSlow. 83 + 84 + ## what the reviewer's analysis got right and wrong 85 + 86 + the reviewer's handoff (2026-04-09 evening) correctly identified: 87 + - the 0.16 migration (`9cc1ba3`) as the true coverage inflection point 88 + - `1eec324` (UAF fix) as NOT the prime suspect (canary 1 was healthy) 89 + - cross-Io mistakes as a recurring failure class 90 + 91 + the reviewer recommended tracing the Evented pipeline with Tracy and 92 + making narrow fixes. we chose a different path: abandon Evented entirely 93 + and switch to Threaded. this was faster, simpler, and eliminated the 94 + entire problem class rather than trying to trace scheduler behavior in 95 + an experimental runtime. 96 + 97 + the reviewer explicitly said "do not revert to thread-per-PDS — that 98 + defeats the architectural point." we respectfully disagreed. the 99 + architectural point (thread reduction) is subordinate to having a 100 + working, debuggable system. Evented can be revisited when zig's runtime 101 + matures. 102 + 103 + ## what's left 104 + 105 + ### fixed by this work 106 + - [x] HTTP fiber starvation / 10-40 min degradation cycle 107 + - [x] host_authority 88-99% reject rate (not reproducing under Threaded; Evented appears to have amplified or concentrated it, but root cause attribution is incomplete) 108 + - [x] persist_order_spins mutex contention (33M → 2.9k/sec) 109 + - [x] broadcast_queue backpressure (8,191 → 36 HWM) 110 + - [x] socket-after-close panic in consumer drop path 111 + - [x] ConsumerTooSlow frequency (8192 → 65536 buffer) 112 + - [x] ReleaseSafe enabled (was blocked by Evented GPF) 113 + 114 + ### still open (not blocking, not urgent) 115 + - orphaned `account.host_id = 0` sentinel (4.11% of accounts) — causes 116 + `is_new` burst during cold start, not a reject source 117 + - Consumer.writeLoop polling (100ms sleep instead of cond.wait) — caps 118 + per-consumer throughput at ~10 fps. real bug, low priority. 119 + - DiskPersist.gc() holds persist mutex for entire body — real bug, low 120 + priority now that gc runs on its own thread without Evented contention 121 + - pool_io redundancy — both runtimes are now Threaded, pool_io can be 122 + merged. cleanup, not urgent. 123 + - DbRequestQueue bridge — unnecessary under all-Threaded. cleanup. 124 + - uring networking patch inert in Dockerfile — cleanup. 125 + 126 + ## deploy recipe 127 + 128 + ``` 129 + just zlay publish-remote ReleaseSafe 130 + ``` 131 + 132 + note: **ReleaseSafe**, not ReleaseFast. the fiber GPF that forced 133 + ReleaseFast was Evented-only.