···11+# operator note: switch to Io.Threaded — 2026-04-09
22+33+## what changed
44+55+`e6cdf84` switches zlay from `Io.Evented` (io_uring fibers, ~35 threads)
66+to `Io.Threaded` (OS thread per PDS subscriber, ~2,800 threads). this is
77+the same execution model as the 0.15 baseline that ran at 99%+ coverage.
88+99+one-line change in `src/main.zig:61`. everything else is the same.
1010+1111+## why
1212+1313+the Evented backend was the root of every major issue since the 0.16
1414+migration: 8 crash classes, a ReleaseSafe GPF, and a persistent 10-15%
1515+coverage gap that nobody could explain. the zig team marks Evented as
1616+experimental. rather than continuing to debug an unstable runtime, we're
1717+reverting to the proven thread-per-PDS model and keeping all other 0.16
1818+improvements.
1919+2020+## how to deploy
2121+2222+**build with ReleaseSafe** (not ReleaseFast). the fiber GPF that forced
2323+ReleaseFast was an Evented-only bug. ReleaseSafe gives better error
2424+messages and safety checks.
2525+2626+```
2727+just zlay publish-remote ReleaseSafe
2828+```
2929+3030+this is a change from previous deploys which used ReleaseFast.
3131+3232+## what to expect
3333+3434+- thread count: ~2,800-2,900 (same as 0.15 baseline, up from ~35-47)
3535+- VmSize: ~22-25 GiB (same as 0.15 baseline)
3636+- RSS: should be comparable to 0.15 (~1.4 GiB)
3737+- coverage: targeting 99%+ (matching 0.15 and the b91382b rollback)
3838+- the cross-Io crash class is eliminated entirely
3939+4040+## what to watch
4141+4242+1. external health: `curl https://zlay.waow.tech/_health` — should respond
4343+ in < 1s immediately, no 10-minute degradation cycle
4444+2. delivery: 15s websocket consumer should show ~395 fps once host table
4545+ ramps (~20 min)
4646+3. host_authority reject rate: should be ~1-2% steady state (same as
4747+ b91382b), not the 88-99% seen on Evented builds
4848+4. thread count: `ps -eLf | grep zlay | wc -l` — expect ~2,800-2,900
4949+5. no restarts through a full 4h reconnect-cron cycle
5050+6. ReleaseSafe specific: if any safety check fires, you'll get a clear
5151+ error message + stack trace instead of silent corruption
5252+5353+## rollback
5454+5555+if anything goes wrong, roll back to the known-good b91382b image:
5656+5757+```
5858+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b
5959+```
6060+6161+## what this is NOT
6262+6363+this is not a feature change, a dependency bump, or a new fix for the
6464+april 8-9 outage. it's a backend selection change that eliminates the
6565+runtime layer that was causing all the problems. the UAF fix (1eec324),
6666+dep bumps, gcLoop fix, and host_authority work are all still in the tree
6767+and unaffected.
+401
docs/zlay-canary-plan-2026-04-09.md
···11+# zlay canary plan — 2026-04-09
22+33+**Goal**: isolate which commit in the `b91382b..31825b2` window regressed
44+external HTTP responsiveness + downstream delivery. Current production is
55+`ReleaseFast-zat21-b91382b` (known good — 99.4% delivery, responsive HTTP).
66+We ship three canaries, each adding exactly one behavioral commit on top
77+of `b91382b`, in the order the reviewer recommended. Each canary answers
88+one diagnostic question. No canary attempts to "fix everything."
99+1010+**Reading guide for operator**: each canary section below is self-contained.
1111+You should only need the "deploy" / "watch for" / "on failure" / "rollback"
1212+blocks. The narrative between them is for the engineer if something
1313+weird happens.
1414+1515+## a note on the `zat21-` prefix in the production image tag
1616+1717+Production is running `atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b`.
1818+The `zat21-` prefix looks like it means "built with zat v0.3.0-alpha.21",
1919+but that's a misleading label. The committed `b91382b` pins zat
2020+**v0.3.0-alpha.17**, and no commit in this repo (across any branch or tag)
2121+ever pinned alpha.21. The image is built from committed b91382b, so its
2222+actual zat is alpha.17. The only zat bump after b91382b is `168d9f1`,
2323+which goes directly to alpha.22.
2424+2525+Consequence: **canary 1 (zat alpha.17) matches production exactly.** No
2626+drift, no hidden variable. Canary 2 is where zat actually changes
2727+(alpha.17 → alpha.22, via 168d9f1).
2828+2929+| canary | zat version |
3030+|---|---|
3131+| production `ReleaseFast-zat21-b91382b` | alpha.17 |
3232+| canary 1 | alpha.17 (same) |
3333+| canary 2 | alpha.22 |
3434+| canary 3 | alpha.22 (same as canary 2) |
3535+3636+## branches ready to build
3737+3838+All three branches exist locally and build clean against
3939+`-Dtarget=x86_64-linux-gnu -Doptimize=ReleaseFast`. `zig build test`
4040+passes on each.
4141+4242+```
4343+canary/1-uaf-only d71379c b91382b + 1eec324
4444+canary/2-uaf-plus-deps fa7b5c4 canary/1 + 168d9f1
4545+canary/3-uaf-deps-gc e9802df canary/2 + 3dc21b9
4646+```
4747+4848+None are pushed to any remote yet. Tell me if you want them pushed as
4949+`origin/canary/N-...` before the operator builds.
5050+5151+Build command per canary (run on Hetzner server per the normal flow,
5252+but validated locally first):
5353+5454+```
5555+just zlay publish-remote ReleaseFast
5656+# — or whatever flag the publish script takes to target a specific SHA
5757+```
5858+5959+Image tag expectation:
6060+- canary 1 → `atcr.io/zzstoatzz.io/zlay:ReleaseFast-d71379c`
6161+- canary 2 → `atcr.io/zzstoatzz.io/zlay:ReleaseFast-fa7b5c4`
6262+- canary 3 → `atcr.io/zzstoatzz.io/zlay:ReleaseFast-e9802df`
6363+6464+(The exact naming depends on `just zlay publish-remote`. Adjust if needed.)
6565+6666+---
6767+6868+## Canary 1 — `b91382b + 1eec324` (UAF fix only)
6969+7070+### what's in it
7171+7272+Exactly one cherry-pick on top of b91382b: `1eec324 fix UAF: dupe
7373+FrameWork.hostname per submit instead of borrowing`. Touches
7474+`src/frame_worker.zig` (+2 -1) and `src/subscriber.zig` (+10 -1). No
7575+dependency bumps. No gcLoop. No other changes.
7676+7777+### diagnostic question
7878+7979+**Does adding the FrameWork UAF fix alone reintroduce the HTTP /
8080+delivery failure?**
8181+8282+### hoped-for outcome
8383+8484+Canary 1 runs indistinguishably from current production:
8585+- external `/health` responsive in < 1 s
8686+- 15 s websocket consumer receives ~395 fps (99%+ of ingest)
8787+- `frames_broadcast_total` advances with an attached consumer
8888+- zero readiness flaps through 60 min
8989+- zero container restarts through a 4-hour reconnect-cron cycle
9090+- the corrupted-hostname log pattern from the 2026-04-07 incident is
9191+ ABSENT under reconnect-storm load
9292+9393+If canary 1 is clean, the UAF fix is not the regressor and we roll
9494+forward to canary 2. Canary 1 stays as the new production baseline
9595+because it strictly improves on b91382b by closing the UAF.
9696+9797+### what to check
9898+9999+Run the operator's measurement recipe from
100100+`../relay/docs/zlay-handoff-2026-04-09-rollback.md#reproducing-the-measurements`.
101101+Specifically:
102102+103103+1. **External health** (every 5 min for first hour):
104104+ ```
105105+ curl --connect-timeout 3 -m 5 https://zlay.waow.tech/_health
106106+ curl --connect-timeout 3 -m 5 https://zlay.waow.tech/xrpc/_health
107107+ ```
108108+ expected: `200 in < 1 s` every time.
109109+110110+2. **15 s websocket consumer** (every 15 min):
111111+ ```
112112+ uvx --from 'websockets==13.*' python -c '
113113+ import asyncio, websockets, time
114114+ async def main():
115115+ async with websockets.connect("wss://zlay.waow.tech/xrpc/com.atproto.sync.subscribeRepos", max_size=None) as ws:
116116+ n = 0; t = time.time()
117117+ while time.time() - t < 15:
118118+ await ws.recv(); n += 1
119119+ print(f"{n} frames in 15s = {n/15:.1f} fps")
120120+ asyncio.run(main())
121121+ '
122122+ ```
123123+ expected: `~5,000+ frames, ~330+ fps` by 20 min uptime, climbing
124124+ toward 395 fps as the host table ramps.
125125+126126+3. **15 s metrics delta** (with a consumer attached, port-forward to :3001):
127127+ ```
128128+ # t0: snapshot frames_received_total and frames_broadcast_total
129129+ # wait 15 s
130130+ # t1: snapshot again
131131+ # compute delta. ratio should be >= 99%.
132132+ ```
133133+134134+4. **Readiness state**:
135135+ ```
136136+ kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay get pod -l app.kubernetes.io/instance=zlay -o custom-columns=NAME:.metadata.name,READY:.status.conditions[?(@.type==\"Ready\")].status,RESTARTS:.status.containerStatuses[*].restartCount
137137+ ```
138138+ expected: `True`, `0` restarts.
139139+140140+5. **Reconnect-cron survival**: the cron fires every 4h at 00/04/08/...
141141+ UTC. Canary 1 should survive at least one fire cycle without
142142+ restart or corrupted-hostname log lines. That's the UAF test.
143143+144144+### success criteria (all must hold for 60 min from deploy)
145145+146146+- external `/health` responds `200` in < 1 s on every check
147147+- 15 s consumer snapshot shows delivery at or above 90% of
148148+ `frames_received_total` delta
149149+- `kubectl get pod` shows `Ready=True` throughout
150150+- `restartCount=0`
151151+- no corrupted-hostname patterns in `kubectl logs` (no DIDs in
152152+ hostname-shaped log fields, no stack-pointer-shaped bytes)
153153+154154+### failure signals
155155+156156+ANY of these means canary 1 is bad, roll back:
157157+158158+- external `/health` hangs or returns 503 on any probe
159159+- 15 s consumer delivers < 100 fps or disconnects
160160+- `Ready=False` at any point
161161+- `restartCount > 0`
162162+- metrics port-forward hangs where a single probe + 15 s wait +
163163+ second probe doesn't complete
164164+165165+### on failure: capture evidence IMMEDIATELY then roll back
166166+167167+Before rolling back, capture (needs to happen BEFORE the pod is
168168+recycled — evidence is gone after rollback):
169169+170170+```
171171+# 1. socket state on the HTTP ports (are accepts queueing up?)
172172+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay exec $POD -- sh -c 'ss -ltn 2>&1 | grep -E "(3000|3001)"' > /tmp/zlay-canary1-ss.txt 2>&1
173173+174174+# 2. thread state (anyone stuck in R/D?)
175175+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay exec $POD -- sh -c 'ps -eLo pid,tid,stat,wchan,comm 2>&1' > /tmp/zlay-canary1-ps.txt 2>&1
176176+177177+# 3. final metrics snapshot (might hang, that's also a data point)
178178+timeout 15 kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay exec $POD -- sh -c 'wget -qO- http://localhost:3001/metrics' > /tmp/zlay-canary1-metrics.txt 2>&1
179179+180180+# 4. recent logs
181181+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay logs --tail=2000 $POD > /tmp/zlay-canary1-logs.txt 2>&1
182182+```
183183+184184+Then roll back:
185185+186186+```
187187+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b
188188+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay rollout status deployment/zlay --timeout=300s
189189+```
190190+191191+Interpretation if canary 1 fails:
192192+- 1eec324's frame-queueing change (heap-duping hostname, freeing in
193193+ `processFrame`) is implicated directly. The engineer's next move
194194+ would be to study the allocation lifetime carefully.
195195+- Alternatively: zat 17 has a subtle issue that zat 21 (production)
196196+ fixes. Easy to distinguish — build canary 1 with zat 21 manually
197197+ pinned (same as production) and retest.
198198+199199+---
200200+201201+## Canary 2 — `b91382b + 1eec324 + 168d9f1` (add dep bump)
202202+203203+**Only run this after canary 1 is clean for ≥ 60 min AND has survived at
204204+least one 4h reconnect-cron fire.**
205205+206206+### what's in it
207207+208208+Canary 1 plus `168d9f1 bump websocket.zig + zat`. Only change is
209209+`build.zig.zon` (4 lines): zat v0.3.0-alpha.17 → v0.3.0-alpha.22 and
210210+websocket.zig 9ac64da → 3c6794a. The functional change in 168d9f1 is
211211+"fix requestCrawl POST hang in websocket.zig Handshake.parse" — but
212212+the reviewer points out this commit changes runtime dependency behavior
213213+more than any other in the window, which makes it a plausible regression
214214+carrier for reasons other than its stated fix.
215215+216216+### diagnostic question
217217+218218+**Does bumping websocket.zig + zat (alpha.17 → alpha.22) on top of
219219+canary 1 reintroduce the HTTP / delivery failure?**
220220+221221+### hoped-for outcome
222222+223223+Same clean run as canary 1, with one additional expected improvement:
224224+the zlay-reconnect cronjob (fires at 00/04/08/... UTC) should now
225225+correctly re-announce PDS hosts (that's what 168d9f1 specifically
226226+fixes). The cron log from the runner side should show a successful
227227+POST response where it previously hung.
228228+229229+If canary 2 is clean, dep bump is not the regressor. Roll forward to
230230+canary 3. Canary 2 becomes the new production baseline.
231231+232232+### what to check
233233+234234+Same 5 checks as canary 1. PLUS:
235235+236236+6. **Reconnect cron success**: the next fire after canary 2 deploys
237237+ should complete normally — the requestCrawl POST hang fix is in
238238+ 168d9f1. If this cron was failing before with canary 1, it should
239239+ succeed now. If it was already succeeding on canary 1 (because
240240+ b91382b's websocket lib was handling this path through a different
241241+ code path), no change expected.
242242+243243+### success criteria
244244+245245+Same as canary 1: health, delivery, readiness, restarts, no corrupted
246246+hostnames. Plus successful reconnect cron.
247247+248248+### failure signals
249249+250250+Same as canary 1.
251251+252252+### on failure
253253+254254+Capture (same 4 commands as canary 1, retitled `canary2`), then roll
255255+back:
256256+257257+```
258258+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-d71379c # canary 1
259259+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay rollout status deployment/zlay --timeout=300s
260260+```
261261+262262+Interpretation: the dep bump is implicated. Either websocket.zig
263263+3c6794a's Handshake.parse behavior has a regression we haven't seen
264264+yet, or zat alpha.22's transport layer does. Both are cheap to further
265265+bisect: websocket.zig and zat are both small repos with narrow change
266266+windows between 9ac64da→3c6794a and alpha.17→alpha.22.
267267+268268+---
269269+270270+## Canary 3 — `b91382b + 1eec324 + 168d9f1 + 3dc21b9` (add gcLoop re-enable)
271271+272272+**Only run this after canary 2 is clean for ≥ 60 min.**
273273+274274+### what's in it
275275+276276+Canary 2 plus `3dc21b9 fix gcLoop: silently exited after one tick`.
277277+Touches only `src/main.zig` (+15 -9). The behavioral change: prior to
278278+3dc21b9, `gcLoop` used `io.sleep` on pool_io from a plain std.Thread,
279279+which failed on the second tick and silently exited via `catch return`.
280280+So `gcLoop` actually ran ONE tick (+ malloc_trim) at ~10 min into each
281281+pod's life, then was dead for the rest of the pod. 3dc21b9 replaced
282282+`io.sleep` with `std.c.nanosleep` so gcLoop runs every 10 min for the
283283+entire pod lifetime.
284284+285285+The gcLoop stabilization I shipped in 4f3d1d4 (disable malloc_trim,
286286+bump interval to 1 hour) is intentionally NOT included in canary 3.
287287+We want to isolate the effect of 3dc21b9 first. If 3dc21b9 is the
288288+regressor, the 4f3d1d4-style mitigation becomes the next canary.
289289+290290+### diagnostic question
291291+292292+**Does re-enabling gcLoop at its original 10-minute cadence (with
293293+`malloc_trim(0)` intact) reintroduce the HTTP / delivery failure?**
294294+295295+### hoped-for outcome
296296+297297+Same clean run as canary 2 for the first 10 minutes, then NO
298298+degradation at the 10-minute mark when `gcLoop` first fires with
299299+`dp.gc()` and `malloc_trim(0)`. If the 10-min mark is clean,
300300+continue watching to ~60 min (by which point gcLoop has fired 5×).
301301+302302+If canary 3 is clean, 3dc21b9 is not the regressor and we've narrowed
303303+the window further. Remaining suspects in the window are `fbdffbe`
304304+(did_cache health-mark) and `31825b2` (subscriber prepareFrameWork
305305+extraction + test). Both are lower-priority per the reviewer; we'd
306306+run them as canary 4 and 5 or let them ride if canary 3 is fully
307307+stable with all other changes layered.
308308+309309+### what to check
310310+311311+Same 5 checks as canary 1. PLUS special attention to the 10-minute mark:
312312+313313+7. **10-minute gc fire**: canary 3's gcLoop fires at pod uptime ~10
314314+ min. Run:
315315+ - external `/health` probe at 9:50, 10:00, 10:10, 10:30
316316+ - websocket consumer at 10:00, 10:30
317317+ - check `kubectl logs $POD | grep gc` for log evidence of the
318318+ gc body running and malloc_trim completing
319319+ - if any probe fails in the 10:00 ± 30s window, gcLoop is the
320320+ smoking gun
321321+322322+### success criteria
323323+324324+Same as canary 1. PLUS clean behavior across the 10-min mark (and
325325+the 20, 30, 40, 50, 60 min marks — each a gcLoop fire).
326326+327327+### failure signals
328328+329329+Same as canary 1, with heightened attention at ~10-min intervals.
330330+331331+### on failure
332332+333333+Capture (same as above), then roll back:
334334+335335+```
336336+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-fa7b5c4 # canary 2
337337+kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay rollout status deployment/zlay --timeout=300s
338338+```
339339+340340+Interpretation: gcLoop is the regressor. This would partially
341341+rehabilitate the gcLoop-stall hypothesis from
342342+`docs/zlay-gcloop-stall-2026-04-09.md`, but only "partially" because
343343+4f3d1d4 (which disabled malloc_trim and bumped gc to 1 hour) still
344344+exhibited the HTTP hang at ~35 min uptime. So if canary 3 fails at
345345+the 10-min mark, the next diagnostic step is:
346346+- canary 3b = canary 3 with just `malloc_trim(0)` disabled (gc still
347347+ runs every 10 min). Isolates allocator stall from mutex-hold stall.
348348+- canary 3c = canary 3b with gc interval bumped to 1 hour. Matches
349349+ the 4f3d1d4 state exactly.
350350+351351+If canary 3c fails, we've exactly reproduced 4f3d1d4 from first
352352+principles and learned that neither the allocator stall nor the
353353+mutex-hold can be the sole cause — something else on top of 3dc21b9
354354+is also in play. That's a bisection lead, not a conclusion.
355355+356356+---
357357+358358+## summary table
359359+360360+| canary | base | + cherry-pick | zat ver | diagnostic question | rollback to |
361361+|---|---|---|---|---|---|
362362+| 1 | b91382b | `1eec324` | alpha.17 | Is the UAF fix alone enough to regress? | `ReleaseFast-zat21-b91382b` |
363363+| 2 | canary 1 | `168d9f1` | **alpha.22** | Does the dep bump regress on top of 1? | canary 1 image |
364364+| 3 | canary 2 | `3dc21b9` | alpha.22 | Does gcLoop re-enable regress on top of 2? | canary 2 image |
365365+366366+## what's explicitly out of scope for these canaries
367367+368368+- **No host_authority work** (`bbba92c` keep_alive=false, `795cc41`
369369+ slot recovery + observability). These add diagnostic noise and we
370370+ don't need them to test delivery/HTTP. Reviewer was explicit.
371371+- **No broadcaster writeLoop fix.** Real bug, but not the cause of
372372+ the HTTP hang, and unrelated to what we're isolating.
373373+- **No host_mismatch investigation in code.** Reviewer asked for a
374374+ DB audit first — check for duplicate hostnames in the `host`
375375+ table and stale `account.host_id` references before any
376376+ `getHostIdForHostname` patch.
377377+- **No additional "fix bundles."** Every canary is exactly one
378378+ cherry-pick on top of the previous canary.
379379+380380+## what the engineer needs from the operator
381381+382382+1. **Confirmation of the zat version running in the current
383383+ `ReleaseFast-zat21-b91382b` image.** If you can inspect the
384384+ image or recall how you built it, knowing whether it's literally
385385+ alpha.21 vs alpha.17 vs alpha.22 matters for interpreting canary 1
386386+ vs canary 2 results.
387387+2. **Push the canary branches to origin?** All three branches exist
388388+ locally only. If your build pipeline pulls from remote, I need
389389+ to push them. Let me know and I will.
390390+3. **Evidence from the 4f3d1d4 dying window, if any was captured.**
391391+ Specifically `ss -ltn` on ports 3000/3001 and `ps -eLo
392392+ pid,tid,stat,wchan,comm` during the active hang. If none was
393393+ captured, that's fine — we'll capture it on the first canary
394394+ failure instead.
395395+4. **DB audit results (independent of the canary sequence)**:
396396+ ```
397397+ SELECT hostname, count(*) FROM host GROUP BY hostname HAVING count(*) > 1;
398398+ SELECT count(*) FROM account WHERE host_id NOT IN (SELECT id FROM host);
399399+ ```
400400+ If either is non-empty, that's data we need before touching any
401401+ host_authority code.
+451
docs/zlay-situation-2026-04-09-reviewer.md
···11+# zlay situation report for reviewer — 2026-04-09 evening
22+33+This doc is a frank account of where zlay stands at end-of-day 2026-04-09
44+after a long debug session. It supersedes, as primary-cause narratives, both
55+`zlay-broadcaster-starvation-2026-04-09.md` and
66+`zlay-gcloop-stall-2026-04-09.md` — those were hypotheses we shipped fixes
77+for, and both turned out to be wrong about root cause (they may still be
88+real secondary issues).
99+1010+Purpose: help us stop ship-and-guess. We want the reviewer to sanity-check
1111+the failure model, call out observations we're over-fitting to, and help
1212+the next deploy be the one that works.
1313+1414+## update: rollback to b91382b succeeded — regression window confirmed
1515+1616+After this doc was first drafted, the operator completed the rollback to
1717+`ReleaseFast-zat21-b91382b` (the 2026-04-06 build). Full write-up in
1818+`../relay/docs/zlay-handoff-2026-04-09-rollback.md`. The load-bearing
1919+measurements:
2020+2121+- external `https://zlay.waow.tech/_health` returns 200 in ~0.29 s
2222+- a raw 15 s `subscribeRepos` websocket consumer receives **5,896 frames
2323+ at ~395 fps**, vs 6 frames in 170 s (0.035 fps) on `4f3d1d4`
2424+- with an attached consumer: `frames_broadcast` advances at **99.4% of
2525+ `frames_received`** over a 15 s window (6801 / 6843)
2626+- `host_authority` reject rate is **~1.3% steady state** (no workaround
2727+ in place) — the 99.54% catastrophic rejection that motivated the
2828+ `keep_alive=false` work is **not present** on b91382b
2929+3030+Conclusion: **both bugs** (external HTTP unreachability + delivery
3131+collapse, *and* the 99.5% host_authority reject rate) were introduced in
3232+the commit window `b91382b..31825b2`. The five commits in that window:
3333+3434+```
3535+3dc21b9 fix gcLoop: silently exited after one tick
3636+fbdffbe mark DB success on did_cache hits
3737+168d9f1 bump websocket.zig + zat: fix requestCrawl POST hang
3838+1eec324 fix UAF: dupe FrameWork.hostname per submit instead of borrowing
3939+31825b2 subscriber: extract prepareFrameWork + add UAF regression test
4040+```
4141+4242+The operator's rollback doc flagged `1eec324` as the prime suspect on the
4343+theory that it was the most invasive frame-path change. The reviewer has
4444+pushed back on that framing: `1eec324` is a narrow UAF ownership fix and
4545+`31825b2` is mostly test/extraction scaffolding; neither obviously
4646+explains the outage by itself. The commits that most clearly change
4747+runtime behavior are:
4848+4949+1. `168d9f1` — bumps `websocket.zig` and `zat` (runtime dependencies)
5050+2. `3dc21b9` — re-enables `gcLoop` (it had been silently dead after one
5151+ tick prior to this fix, so the gc body has effectively never run in
5252+ production on b91382b either)
5353+5454+Any of the three (`1eec324`, `168d9f1`, `3dc21b9`) could plausibly be the
5555+cause. The reviewer's recommended path is a controlled additive canary
5656+sequence from `b91382b`, one commit at a time, so we isolate which one
5757+introduces the regression — rather than another multi-change branch.
5858+The operator plan for those canaries is in
5959+`docs/zlay-canary-plan-2026-04-09.md`. The rest of this doc stands as
6060+background for the reviewer on *what* we're isolating and *why* the
6161+per-commit attribution matters.
6262+6363+**Current production state**: `ReleaseFast-zat21-b91382b`, 1/1 Running,
6464+delivering ~400 fps. Note: b91382b does NOT have the FrameWork.hostname
6565+UAF fix (`1eec324` is 04-07), so a reconnect storm could trip the UAF
6666+and restart the pod. The operator is accepting that risk to keep
6767+evaluators served.
6868+6969+## high-level headline
7070+7171+**Every pod we've shipped after 31825b2 (2026-04-07) has exhibited a symptom
7272+where zlay's external HTTP surface — `/health`, `describeServer`,
7373+`/_readyz`, `/metrics` on port 3000 / 3001 — stops responding past a
7474+threshold of a few minutes to half an hour of uptime. Internal metrics
7575+continue to be collected by the process (frames_in climbs, workers count is
7676+stable, CPU usage is low), but external consumers cannot complete a
7777+handshake, prometheus scrapes time out at 10s, and relay-eval reports 0%
7878+coverage for zlay while indigo shows 99%+.**
7979+8080+We have been wrong three times in a row about what causes this:
8181+1. "zig 0.16 `std.http.Client` stale keep-alive handling" (falsified by
8282+ engineer's standalone repro).
8383+2. "broadcaster writeLoop scheduler starvation, fix by moving to pool_io"
8484+ (falsified by recognizing the cross-Io crash class documented in 6674812).
8585+3. "gcLoop + malloc_trim every 10 min" (falsified by the 4f3d1d4 deploy
8686+ today — symptom returned at ~35 min uptime despite gc bumped to 1 hour
8787+ and `malloc_trim` fully removed).
8888+8989+We need help getting to a correct fourth hypothesis rather than shipping a
9090+fourth wrong one.
9191+9292+## what we know the symptom looks like from the outside
9393+9494+- External curl to `https://zlay.waow.tech/health` and
9595+ `com.atproto.sync.describeServer` hangs through a 3 s timeout, or returns
9696+ 503 from the ingress.
9797+- prometheus `/metrics` scrapes fail with `context deadline exceeded` at
9898+ the 10 s scrape timeout. this is continuous, not intermittent, once the
9999+ symptom starts.
100100+- internal port-forwarded `/metrics` hit can succeed for a single probe
101101+ then hang >10 s on the next one 15 s later — i.e. the HTTP handler isn't
102102+ dead, it's responding on a fraction of probes and timing out on the rest.
103103+- `/_healthz` comes back in ~11 s in the same window. `/_readyz` in ~17 s
104104+ (above the relaxed 15 s probe timeout — pod only survives because
105105+ kubelet's 20-failure threshold happens to hit faster windows sometimes).
106106+- k8s eventually marks the pod `Ready=False` with
107107+ `ContainersNotReady`. On 4f3d1d4 that happened at 20:59:34, ~35 min after
108108+ pod start (vs ~10-14 min on the unmodified 795cc41 and bbba92c pods).
109109+- service `endpoints/zlay` has no endpoints because Ready=False kicked the
110110+ pod out of the service. ingress returns 503.
111111+- pod process itself: ~0.26 CPU cores of usage, all ~47 threads in
112112+ S-state (sleeping). nothing is hot-spinning. load average on the node is
113113+ high (67/8) but **zlay is not the one consuming CPU**.
114114+115115+## what we know the symptom looks like from the inside
116116+117117+Representative snapshot from the live 4f3d1d4 pod at ~28m uptime just
118118+before rollback:
119119+120120+| metric | value |
121121+|---|---:|
122122+| `frames_received_total` | 742,988 |
123123+| `frames_broadcast_total` | 478 (lifetime, barely moved) |
124124+| `broadcast_no_consumers_total` | 37,956 |
125125+| `consumers_active` | 0 |
126126+| `workers_count` | 2,761 (spawn complete) |
127127+| `host_authority_checks_total` | 17,236 |
128128+| `host_authority_reject{branch="host_mismatch"}` | 15,257 (88%) |
129129+| `host_resolver_in_use` | 0 (pool idle at the moment of probe) |
130130+131131+Gap analysis:
132132+- `frames_received = 742,988` vs `frames_broadcast + no_cons ≈ 38,434`.
133133+ That's ~5% of received frames reaching `broadcast()`. The rest is either
134134+ still in-flight in the frame_worker pool, dropped on the validation
135135+ branches (host_authority, sig, chain, etc), or lost to something we're
136136+ not tracking. We did not capture the full validation counter delta in
137137+ the incident — recovering that is one of the first things we need.
138138+- `broadcast_no_consumers = 37,956` vs `frames_broadcast = 478`: ~99%
139139+ of the frames that did reach `broadcast()` hit the "zero consumers" fast
140140+ path (broadcaster.zig:611). Lifetime `frames_broadcast_total = 478`
141141+ means ~all of a consumer's ever-delivered frames were from a brief
142142+ window long ago — consistent with "consumers couldn't attach for most
143143+ of the pod's life" rather than "consumers attached but were kicked."
144144+- `host_authority_reject{host_mismatch} = 88%` is a second independent
145145+ bad signal on this pod (see "additional observation" below).
146146+147147+## what commits are in play
148148+149149+Between the last reliably-observable pod (`31825b2`, 2026-04-07) and now:
150150+151151+| commit | date | what | risk flagged |
152152+|---|---|---|---|
153153+| `3dc21b9` | 04-06 | fix gcLoop silent-exit after one tick | gc actually ran again after this; masked previously |
154154+| `fbdffbe` | 04-06 | mark DB success on did_cache hits | low |
155155+| `168d9f1` | 04-06 | bump websocket.zig + zat (requestCrawl POST hang fix) | low but touches websocket lib |
156156+| `ee4e368` | 04-08 | buffer 8192→65536 + per-branch counters | low |
157157+| `bbba92c` | 04-08 | **`keep_alive=false` on host_authority pool** | **high — workaround, not root-caused** |
158158+| `795cc41` | 04-09 | host_authority slot recovery + pool metrics + preload account count | medium — changes cold-start path |
159159+| `4f3d1d4` | 04-09 | disable malloc_trim, bump gc to 1h, timing log | low, but didn't fix the issue |
160160+161161+The single most significant behavioral change in that set is `bbba92c`'s
162162+`keep_alive=false`. Every `is_new`/`host_changed` event now does a fresh
163163+DNS + TCP + TLS + HTTPS round-trip to plc.directory (~350-900 ms/call),
164164+and the resolver pool is only 4 slots acquired by 16 frame_worker threads
165165+via atomic spinlock. This is a known aggravating factor for cold-start
166166+load, but ops-changelog claims it was also present on 795cc41 without
167167+persistent HTTP death — which weakens the "keep_alive=false is the whole
168168+story" claim.
169169+170170+## what we know is NOT the cause
171171+172172+1. **scheduler contention on writeLoop.** I wrote this hypothesis up
173173+ (`docs/zlay-broadcaster-starvation-2026-04-09.md`), proposed moving
174174+ writeLoop to `pool_io`. Zlay engineer correctly pointed out this
175175+ re-enters the `Thread.current() NULL deref on plain threads calling
176176+ Evented Io.Mutex` crash class fixed in `6674812`. **Do not do this.**
177177+ See the cross-Io rule in NOTES/stdlib-patches.md.
178178+2. **gcLoop mutex hold + malloc_trim.** I wrote this hypothesis up
179179+ (`docs/zlay-gcloop-stall-2026-04-09.md`), shipped 4f3d1d4 with gc
180180+ interval 10min → 1h and malloc_trim removed. **Pod still flapped**,
181181+ this time at ~35 min uptime instead of ~10 min. So those changes
182182+ delayed the symptom but did not prevent it. Longer uptime might
183183+ mean we moved some pressure off the critical path, or it might just
184184+ be process-state timing noise — we did not collect enough data to
185185+ distinguish.
186186+3. **broadcaster writeLoop `io.sleep(100ms)` polling** (broadcaster.zig:
187187+ 447-453). This is a real bug — `cond.signal` on line 413 is a no-op
188188+ and per-consumer drain is capped at 10 frames/sec in the best case.
189189+ But it is NOT the cause of the HTTP server being unreachable. It
190190+ affects throughput _to attached consumers_, and the problem we have
191191+ is that consumers cannot attach at all.
192192+193193+## additional observation: host_mismatch burst
194194+195195+On both 795cc41 and 4f3d1d4 cold-starts, we see a large burst of
196196+`host_authority_reject{branch="host_mismatch"}`: 15k-17k rejects in the
197197+first ~15 min, then the rate drops to ~0/sec. This is qualitatively
198198+different from the 2026-04-08 `resolve`-branch 99% reject bug. It means:
199199+the DID resolver succeeded, returned a valid DID doc, and
200200+`pds_host_id != incoming_host_id` at the comparison step
201201+(`validator.zig checkPdsHost`).
202202+203203+Plausible causes not yet investigated:
204204+- Stale `account.host_id` values persisted from a previous pod
205205+ triggering `host_changed=true` on events where the current DID doc
206206+ still resolves to the current subscriber's hostname.
207207+- Duplicate rows in the `host` table with the same hostname and
208208+ different ids (`getHostIdForHostname` has no ORDER BY — lookup is
209209+ non-deterministic).
210210+- Something in how 795cc41's `preload_account_count` change interacts
211211+ with host_id assignment at spawn time.
212212+213213+None of these candidates are new in 4f3d1d4. The burst exists independent
214214+of the HTTP-hang symptom, but it's the same window, so we can't say for
215215+sure whether it's a contributor or an unrelated mess.
216216+217217+**We have not captured sampled warn logs for any of these host_mismatch
218218+rejects.** The log buffer has rotated out by the time we look. Any
219219+isolation experiment here needs to pull logs _during_ the burst, not
220220+after.
221221+222222+## rollback completed — see "update" at top of doc
223223+224224+Rollback is done. b91382b is responsive, delivery is healthy, host_authority
225225+reject rate is ~1.3% not 99.54%. See the "update" section at the top of this
226226+doc for the full measurements. The diagnostic path going forward is the
227227+canary sequence from b91382b, documented in `zlay-canary-plan-2026-04-09.md`.
228228+229229+## things that are REAL bugs regardless of which hypothesis wins
230230+231231+Independent of the HTTP-hang root cause, these are defects we want to
232232+fix eventually. Enumerated so the reviewer can weigh in on priority and
233233+shape:
234234+235235+1. **`Consumer.writeLoop` polling** (broadcaster.zig:439-477). Replace
236236+ the `io.sleep(100ms)` with proper `cond.wait(mutex, io)`. Schedule
237237+ pings via separate timer fiber or opportunistic-on-wake. Do NOT move
238238+ writeLoop off Evented. Cap on per-consumer drain is otherwise
239239+ 10 frames/sec even in perfect conditions.
240240+2. **`DiskPersist.gc()` holds the persist hot-path mutex for its
241241+ entire body** (event_log.zig:977-1033). The mutex is there to
242242+ protect `evtbuf`/`outbuf`/`cur_seq`/`current_file_path`/`flushLocked`.
243243+ Nothing gc actually does (DB iteration, per-file unlink) needs that
244244+ lock. Proposed fix: discover candidate files without the lock,
245245+ re-acquire briefly per-file only to re-check `current_file_path`
246246+ before unlinking. Same treatment for `gcBySize()` and
247247+ `takeDownUser()`.
248248+3. **Missing pool slot recovery** (validator.zig resolver pool). The
249249+ slot-recovery fix in 795cc41 exists but is dormant under
250250+ `keep_alive=false`. We haven't actually proven the recovery path
251251+ works because it hasn't been exercised in production. The planned
252252+ canary ("1 of 4 slots `keep_alive=true`") was never run.
253253+4. **`zig build test` is not sufficient CI.** Lazy analysis skips
254254+ functions not referenced from tests. Engineer added a rule: run
255255+ `zig build` (exe) in addition to `zig build test` for
256256+ validator/subscriber/frame_worker changes. The `584571a` build
257257+ break is the precedent.
258258+5. **Two different error types get swallowed in zat's transport path**
259259+ (zat/src/internal/xrpc/transport.zig, did_resolver.zig). Partially
260260+ fixed by zat 0.3.0-alpha.23 which propagates the underlying
261261+ `std.http.Client.fetch` error. Upgrade is shipped in 795cc41 but we
262262+ never saw a failure after it shipped because `keep_alive=false` was
263263+ still in place.
264264+6. **`getHostIdForHostname` has no `ORDER BY`** — if there are ever
265265+ duplicate rows for the same hostname, lookup is non-deterministic.
266266+ We ran the reviewer's DB audit on 2026-04-09: 0 duplicate hostnames
267267+ in production, so this is not currently a live bug — but the query
268268+ should still be deterministic as a hardening matter.
269269+7. **Dual host_authority call sites with asymmetric metrics.**
270270+ `resolveHostAuthority` is invoked from both `frame_worker.zig:107`
271271+ and `subscriber.zig:555`. Only the frame_worker path emits the
272272+ `relay_host_authority_trigger{reason=...}` and
273273+ `relay_host_authority_checks_total` counters; both paths increment
274274+ `relay_validation_failed{reason="host_authority"}` on reject. This
275275+ makes the ratio `trigger:reject` structurally misleading — trigger
276276+ is a lower bound on true authority-check count, while rejects
277277+ include both paths. Discovered during the 2026-04-09 canary 1
278278+ investigation when we tried to correlate the `host_id=0` sentinel
279279+ finding against the reject counter. Long-term fix is one of:
280280+ (a) consolidate authority checking to a single path, or
281281+ (b) have both paths emit the same counters. Not a blocker for any
282282+ current canary — flagged so it doesn't get forgotten.
283283+8. **`account.host_id = 0` is a sentinel ("host not set yet") by
284284+ design**, documented in `event_log.zig:513`. As of the 2026-04-09
285285+ DB audit, 239,422 out of 5,817,756 accounts (4.11%) are still at
286286+ the sentinel. Reviewer's semantic correction (2026-04-09): orphan
287287+ accounts produce `is_new=true` in `uidForDidFromHost`, which
288288+ triggers a host_authority CHECK — not automatically a reject. The
289289+ reject only happens if `resolveHostAuthority` returns `.reject`,
290290+ which depends on the DID doc's PDS vs the incoming host. So the
291291+ sentinel explains an `is_new` burst during cold-start ramp but
292292+ does NOT by itself explain the `host_mismatch` reject burst we
293293+ observed on 795cc41/4f3d1d4. Real fix is probably: `is_new`
294294+ accept-and-update path should not go through the full
295295+ host_authority resolution pipeline (too slow for first-seen
296296+ events), but that's an optimization, not an outage fix.
297297+9. **Relaxed k8s probes are a band-aid over "HTTP fibers can get
298298+ stuck for 10+ s under load."** With probes at
299299+ `initialDelay=300s, timeout=15s, failureThreshold=20`, the pod
300300+ effectively has ~5 minutes to get its act together at startup and
301301+ can be in a degraded state for up to 300 s at runtime before
302302+ kubelet notices. If we ever genuinely hang at startup, we now get
303303+ no feedback for 5 minutes. These should be tightened once the
304304+ underlying cause is fixed.
305305+306306+## what I'd want the reviewer to help us with
307307+308308+### 1. the core question — why external HTTP stops responding
309309+310310+The HTTP server fibers (`runWsServer`, `MetricsServer.run`) live on main
311311+`Io.Evented` io alongside ~2,800 per-subscriber fibers, the broadcaster
312312+loop fiber, and the slurper spawn fiber. On the 4f3d1d4 pod at the time
313313+of death, the pod is using ~0.26 cores, nothing is spinning, all
314314+threads are in S-state. Yet HTTP accepts aren't completing and probes
315315+time out.
316316+317317+Possible shapes of the failure we can't distinguish with the evidence we
318318+have:
319319+320320+a. **Accept-queue exhaustion**: listener fiber is running but the
321321+ kernel's SYN/accept backlog is full so new connections don't reach
322322+ the fiber. Check with `netstat -an | grep :3000` for SYN_RECV,
323323+ `ss -ltn` for Recv-Q on the listening socket.
324324+325325+b. **Single Evented runtime thread wedged**: if Evented is a work-stealing
326326+ scheduler across ~47 threads, one stuck fiber on one thread should
327327+ not freeze the whole thing. If it's a single-loop-per-thread design
328328+ and accepts are pinned to one loop, a stuck loop could freeze
329329+ accept specifically. Which of these does zig 0.16 `Io.Evented`
330330+ actually implement?
331331+332332+c. **malloc contention**: the pod runs with `MALLOC_ARENA_MAX=4`, and
333333+ 16 frame_worker threads (plus ~47 Evented runtime threads) are all
334334+ contending for 4 glibc arenas. Under sustained allocator pressure
335335+ (keep_alive=false resolves every allocating a TLS session), any
336336+ thread that calls malloc at the wrong time may block on arena
337337+ locks. The HTTP handler allocates for response bodies. Is this
338338+ plausible as a sustained-state cause rather than just a burst?
339339+340340+d. **Some fiber consistently monopolizes a CPU slice and the HTTP
341341+ accept fiber doesn't get scheduled often enough**: spawnWorkers
342342+ finished at 96 s but maybe something else has replaced it. Possible
343343+ candidates: Consumer.writeLoop fibers with their 100 ms polling,
344344+ the broadcaster loop fiber under push-lock contention (persist_order
345345+ spinlock).
346346+347347+We cannot distinguish these today. We have no fiber-level trace, no
348348+evented scheduler diagnostics, no per-fiber wake-latency metric. **This
349349+is the single biggest gap in our ability to debug this.**
350350+351351+### 2. the falsifiable hypotheses we'd try next
352352+353353+In priority order, these are experiments where a positive result would
354354+strengthen one of (a)-(d) and a null result would rule it out. None of
355355+these are "ship a fix" — they're all "get diagnostic signal":
356356+357357+- **(a) accept-queue check**: on a live flapping pod, run
358358+ `ss -ltn | grep 3000`, `netstat -an | awk '$NF=="LISTEN"'`, and
359359+ `cat /proc/net/tcp` to look at socket state and queue depth. If
360360+ accept backlog is full, that's a scheduler-not-running-accept
361361+ problem. If it's empty, handshakes are reaching the fiber and
362362+ something inside the fiber is slow.
363363+364364+- **(b) Evented runtime design**: read zig 0.16 `std.Io.Uring` /
365365+ `Io.Evented` source to confirm whether it's single-loop or work-
366366+ stealing. If single-loop, accept-pinned-to-one-thread is a known
367367+ class of failure. If work-stealing, a single stuck fiber shouldn't
368368+ freeze the accept fiber — which would point away from
369369+ scheduler-starvation and toward something else.
370370+371371+- **(c) malloc contention experiment**: temporarily set
372372+ `MALLOC_ARENA_MAX=16` (or unset it entirely) on the Deployment and
373373+ see if the HTTP-hang cadence changes. Not a fix, just a signal.
374374+ Does not require a code change. Fully reversible.
375375+376376+- **(d) fiber occupancy metric**: add a counter for Evented main-io
377377+ fiber wake-to-wake time for the HTTP accept fiber. On a healthy
378378+ pod it should be < 1ms at idle. On a wedged pod it should be
379379+ seconds. This is the one new instrumentation we most need.
380380+381381+### 3. questions on the host_mismatch burst
382382+383383+- Is there a known migration / spawn pattern that would cause
384384+ `host_changed=true` on events that, when the DID doc is re-resolved,
385385+ actually point back at the host the event came from? Is there a
386386+ race between `requestCrawl` assigning a host_id and a frame arriving
387387+ from a subscriber that already has a different host_id?
388388+- Could `preload_account_count` (new in 795cc41) have changed the
389389+ cold-start order such that subscribers start firing events before
390390+ `host.id` is stable?
391391+392392+### 4. zat 0.3.0-alpha.21 risk
393393+394394+This is a theory I don't have time to fully develop but want on the
395395+table: the CBOR/CAR/MST hardening that landed in b91382b could in
396396+principle slow down frame decode enough to cause frame workers to
397397+back up, which cascades into broadcast_queue pressure, which cascades
398398+into persist_order spinlock contention, which cascades into the
399399+Evented broadcaster fiber holding main io for longer chunks, which
400400+cascades into HTTP fibers getting less scheduler share.
401401+402402+We haven't measured decode latency. The ops-changelog says "expected
403403+throughput drop: ~290k → ~202k fps decode+verify. still 13x faster
404404+than Go." which, on 300 fps ingest, is nowhere near a bottleneck. So
405405+this is probably not it. But it's the one behavior change in b91382b
406406+that we haven't ruled out, and b91382b is what we're rolling back to.
407407+408408+## open operator questions
409409+410410+Things I'd like data on as the canary sequence runs:
411411+412412+1. On each canary, if the HTTP-hang symptom returns: capture `ss -ltn`
413413+ on ports 3000/3001 (accept-queue depth), and
414414+ `ps -eLo pid,tid,stat,comm` on the main PID (thread state
415415+ distribution). We've never captured either during an active hang.
416416+2. Is there any preserved log output from the host_mismatch burst on
417417+ 795cc41 or 4f3d1d4 showing the actual rejected DIDs and the
418418+ `resolved_host` vs `incoming_host` that mismatched? None captured
419419+ yet — the log buffer rotates by the time we look.
420420+3. For the host_mismatch question specifically: the reviewer has asked
421421+ for a DB audit **before** any code change. Check for duplicate
422422+ hostnames in the `host` table and stale `account.host_id`
423423+ references. Do not speculatively patch `getHostIdForHostname` with
424424+ an `ORDER BY` before knowing whether duplicate rows exist.
425425+426426+## what we are NOT asking for
427427+428428+- Another hypothesis with a fix attached. We've shipped three of those
429429+ today.
430430+- A code review of the broadcaster polling or gcLoop mutex. Both are
431431+ real bugs, both are known, both should be fixed later. They are not
432432+ the cause of the outage.
433433+- Strong opinions on architecture. The Evented + Threaded hybrid is
434434+ load-bearing in ways we don't fully understand yet and this doc is
435435+ not the time to decide whether it should be replaced.
436436+437437+## what we ARE asking for
438438+439439+- A sanity check on "the HTTP-hang is the primary symptom worth
440440+ debugging, and everything else we've been chasing is secondary".
441441+- Help designing the next *measurement* (not fix), specifically
442442+ something that would distinguish between the four failure shapes
443443+ listed in section 1.
444444+- Fresh eyes on whether the commits between 31825b2 and 4f3d1d4
445445+ contain anything we've missed that could plausibly cause HTTP
446446+ fibers to stop getting scheduled.
447447+- Whether "roll forward to bbba92c without 795cc41's preload_account_count
448448+ and without 4f3d1d4's gc changes" is a safer diagnostic step than
449449+ "roll all the way back to b91382b", given that bbba92c is the last
450450+ image we saw responsive HTTP on (ops-changelog says so at the 14-min
451451+ mark, and we never collected data past that).