atproto relay implementation in zig zlay.waow.tech
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

notes?

zzstoatzz da33b819 85907d4c

+919
+67
docs/operator-note-2026-04-09-threaded.md
··· 1 + # operator note: switch to Io.Threaded — 2026-04-09 2 + 3 + ## what changed 4 + 5 + `e6cdf84` switches zlay from `Io.Evented` (io_uring fibers, ~35 threads) 6 + to `Io.Threaded` (OS thread per PDS subscriber, ~2,800 threads). this is 7 + the same execution model as the 0.15 baseline that ran at 99%+ coverage. 8 + 9 + one-line change in `src/main.zig:61`. everything else is the same. 10 + 11 + ## why 12 + 13 + the Evented backend was the root of every major issue since the 0.16 14 + migration: 8 crash classes, a ReleaseSafe GPF, and a persistent 10-15% 15 + coverage gap that nobody could explain. the zig team marks Evented as 16 + experimental. rather than continuing to debug an unstable runtime, we're 17 + reverting to the proven thread-per-PDS model and keeping all other 0.16 18 + improvements. 19 + 20 + ## how to deploy 21 + 22 + **build with ReleaseSafe** (not ReleaseFast). the fiber GPF that forced 23 + ReleaseFast was an Evented-only bug. ReleaseSafe gives better error 24 + messages and safety checks. 25 + 26 + ``` 27 + just zlay publish-remote ReleaseSafe 28 + ``` 29 + 30 + this is a change from previous deploys which used ReleaseFast. 31 + 32 + ## what to expect 33 + 34 + - thread count: ~2,800-2,900 (same as 0.15 baseline, up from ~35-47) 35 + - VmSize: ~22-25 GiB (same as 0.15 baseline) 36 + - RSS: should be comparable to 0.15 (~1.4 GiB) 37 + - coverage: targeting 99%+ (matching 0.15 and the b91382b rollback) 38 + - the cross-Io crash class is eliminated entirely 39 + 40 + ## what to watch 41 + 42 + 1. external health: `curl https://zlay.waow.tech/_health` — should respond 43 + in < 1s immediately, no 10-minute degradation cycle 44 + 2. delivery: 15s websocket consumer should show ~395 fps once host table 45 + ramps (~20 min) 46 + 3. host_authority reject rate: should be ~1-2% steady state (same as 47 + b91382b), not the 88-99% seen on Evented builds 48 + 4. thread count: `ps -eLf | grep zlay | wc -l` — expect ~2,800-2,900 49 + 5. no restarts through a full 4h reconnect-cron cycle 50 + 6. ReleaseSafe specific: if any safety check fires, you'll get a clear 51 + error message + stack trace instead of silent corruption 52 + 53 + ## rollback 54 + 55 + if anything goes wrong, roll back to the known-good b91382b image: 56 + 57 + ``` 58 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b 59 + ``` 60 + 61 + ## what this is NOT 62 + 63 + this is not a feature change, a dependency bump, or a new fix for the 64 + april 8-9 outage. it's a backend selection change that eliminates the 65 + runtime layer that was causing all the problems. the UAF fix (1eec324), 66 + dep bumps, gcLoop fix, and host_authority work are all still in the tree 67 + and unaffected.
+401
docs/zlay-canary-plan-2026-04-09.md
··· 1 + # zlay canary plan — 2026-04-09 2 + 3 + **Goal**: isolate which commit in the `b91382b..31825b2` window regressed 4 + external HTTP responsiveness + downstream delivery. Current production is 5 + `ReleaseFast-zat21-b91382b` (known good — 99.4% delivery, responsive HTTP). 6 + We ship three canaries, each adding exactly one behavioral commit on top 7 + of `b91382b`, in the order the reviewer recommended. Each canary answers 8 + one diagnostic question. No canary attempts to "fix everything." 9 + 10 + **Reading guide for operator**: each canary section below is self-contained. 11 + You should only need the "deploy" / "watch for" / "on failure" / "rollback" 12 + blocks. The narrative between them is for the engineer if something 13 + weird happens. 14 + 15 + ## a note on the `zat21-` prefix in the production image tag 16 + 17 + Production is running `atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b`. 18 + The `zat21-` prefix looks like it means "built with zat v0.3.0-alpha.21", 19 + but that's a misleading label. The committed `b91382b` pins zat 20 + **v0.3.0-alpha.17**, and no commit in this repo (across any branch or tag) 21 + ever pinned alpha.21. The image is built from committed b91382b, so its 22 + actual zat is alpha.17. The only zat bump after b91382b is `168d9f1`, 23 + which goes directly to alpha.22. 24 + 25 + Consequence: **canary 1 (zat alpha.17) matches production exactly.** No 26 + drift, no hidden variable. Canary 2 is where zat actually changes 27 + (alpha.17 → alpha.22, via 168d9f1). 28 + 29 + | canary | zat version | 30 + |---|---| 31 + | production `ReleaseFast-zat21-b91382b` | alpha.17 | 32 + | canary 1 | alpha.17 (same) | 33 + | canary 2 | alpha.22 | 34 + | canary 3 | alpha.22 (same as canary 2) | 35 + 36 + ## branches ready to build 37 + 38 + All three branches exist locally and build clean against 39 + `-Dtarget=x86_64-linux-gnu -Doptimize=ReleaseFast`. `zig build test` 40 + passes on each. 41 + 42 + ``` 43 + canary/1-uaf-only d71379c b91382b + 1eec324 44 + canary/2-uaf-plus-deps fa7b5c4 canary/1 + 168d9f1 45 + canary/3-uaf-deps-gc e9802df canary/2 + 3dc21b9 46 + ``` 47 + 48 + None are pushed to any remote yet. Tell me if you want them pushed as 49 + `origin/canary/N-...` before the operator builds. 50 + 51 + Build command per canary (run on Hetzner server per the normal flow, 52 + but validated locally first): 53 + 54 + ``` 55 + just zlay publish-remote ReleaseFast 56 + # — or whatever flag the publish script takes to target a specific SHA 57 + ``` 58 + 59 + Image tag expectation: 60 + - canary 1 → `atcr.io/zzstoatzz.io/zlay:ReleaseFast-d71379c` 61 + - canary 2 → `atcr.io/zzstoatzz.io/zlay:ReleaseFast-fa7b5c4` 62 + - canary 3 → `atcr.io/zzstoatzz.io/zlay:ReleaseFast-e9802df` 63 + 64 + (The exact naming depends on `just zlay publish-remote`. Adjust if needed.) 65 + 66 + --- 67 + 68 + ## Canary 1 — `b91382b + 1eec324` (UAF fix only) 69 + 70 + ### what's in it 71 + 72 + Exactly one cherry-pick on top of b91382b: `1eec324 fix UAF: dupe 73 + FrameWork.hostname per submit instead of borrowing`. Touches 74 + `src/frame_worker.zig` (+2 -1) and `src/subscriber.zig` (+10 -1). No 75 + dependency bumps. No gcLoop. No other changes. 76 + 77 + ### diagnostic question 78 + 79 + **Does adding the FrameWork UAF fix alone reintroduce the HTTP / 80 + delivery failure?** 81 + 82 + ### hoped-for outcome 83 + 84 + Canary 1 runs indistinguishably from current production: 85 + - external `/health` responsive in < 1 s 86 + - 15 s websocket consumer receives ~395 fps (99%+ of ingest) 87 + - `frames_broadcast_total` advances with an attached consumer 88 + - zero readiness flaps through 60 min 89 + - zero container restarts through a 4-hour reconnect-cron cycle 90 + - the corrupted-hostname log pattern from the 2026-04-07 incident is 91 + ABSENT under reconnect-storm load 92 + 93 + If canary 1 is clean, the UAF fix is not the regressor and we roll 94 + forward to canary 2. Canary 1 stays as the new production baseline 95 + because it strictly improves on b91382b by closing the UAF. 96 + 97 + ### what to check 98 + 99 + Run the operator's measurement recipe from 100 + `../relay/docs/zlay-handoff-2026-04-09-rollback.md#reproducing-the-measurements`. 101 + Specifically: 102 + 103 + 1. **External health** (every 5 min for first hour): 104 + ``` 105 + curl --connect-timeout 3 -m 5 https://zlay.waow.tech/_health 106 + curl --connect-timeout 3 -m 5 https://zlay.waow.tech/xrpc/_health 107 + ``` 108 + expected: `200 in < 1 s` every time. 109 + 110 + 2. **15 s websocket consumer** (every 15 min): 111 + ``` 112 + uvx --from 'websockets==13.*' python -c ' 113 + import asyncio, websockets, time 114 + async def main(): 115 + async with websockets.connect("wss://zlay.waow.tech/xrpc/com.atproto.sync.subscribeRepos", max_size=None) as ws: 116 + n = 0; t = time.time() 117 + while time.time() - t < 15: 118 + await ws.recv(); n += 1 119 + print(f"{n} frames in 15s = {n/15:.1f} fps") 120 + asyncio.run(main()) 121 + ' 122 + ``` 123 + expected: `~5,000+ frames, ~330+ fps` by 20 min uptime, climbing 124 + toward 395 fps as the host table ramps. 125 + 126 + 3. **15 s metrics delta** (with a consumer attached, port-forward to :3001): 127 + ``` 128 + # t0: snapshot frames_received_total and frames_broadcast_total 129 + # wait 15 s 130 + # t1: snapshot again 131 + # compute delta. ratio should be >= 99%. 132 + ``` 133 + 134 + 4. **Readiness state**: 135 + ``` 136 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay get pod -l app.kubernetes.io/instance=zlay -o custom-columns=NAME:.metadata.name,READY:.status.conditions[?(@.type==\"Ready\")].status,RESTARTS:.status.containerStatuses[*].restartCount 137 + ``` 138 + expected: `True`, `0` restarts. 139 + 140 + 5. **Reconnect-cron survival**: the cron fires every 4h at 00/04/08/... 141 + UTC. Canary 1 should survive at least one fire cycle without 142 + restart or corrupted-hostname log lines. That's the UAF test. 143 + 144 + ### success criteria (all must hold for 60 min from deploy) 145 + 146 + - external `/health` responds `200` in < 1 s on every check 147 + - 15 s consumer snapshot shows delivery at or above 90% of 148 + `frames_received_total` delta 149 + - `kubectl get pod` shows `Ready=True` throughout 150 + - `restartCount=0` 151 + - no corrupted-hostname patterns in `kubectl logs` (no DIDs in 152 + hostname-shaped log fields, no stack-pointer-shaped bytes) 153 + 154 + ### failure signals 155 + 156 + ANY of these means canary 1 is bad, roll back: 157 + 158 + - external `/health` hangs or returns 503 on any probe 159 + - 15 s consumer delivers < 100 fps or disconnects 160 + - `Ready=False` at any point 161 + - `restartCount > 0` 162 + - metrics port-forward hangs where a single probe + 15 s wait + 163 + second probe doesn't complete 164 + 165 + ### on failure: capture evidence IMMEDIATELY then roll back 166 + 167 + Before rolling back, capture (needs to happen BEFORE the pod is 168 + recycled — evidence is gone after rollback): 169 + 170 + ``` 171 + # 1. socket state on the HTTP ports (are accepts queueing up?) 172 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay exec $POD -- sh -c 'ss -ltn 2>&1 | grep -E "(3000|3001)"' > /tmp/zlay-canary1-ss.txt 2>&1 173 + 174 + # 2. thread state (anyone stuck in R/D?) 175 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay exec $POD -- sh -c 'ps -eLo pid,tid,stat,wchan,comm 2>&1' > /tmp/zlay-canary1-ps.txt 2>&1 176 + 177 + # 3. final metrics snapshot (might hang, that's also a data point) 178 + timeout 15 kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay exec $POD -- sh -c 'wget -qO- http://localhost:3001/metrics' > /tmp/zlay-canary1-metrics.txt 2>&1 179 + 180 + # 4. recent logs 181 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay logs --tail=2000 $POD > /tmp/zlay-canary1-logs.txt 2>&1 182 + ``` 183 + 184 + Then roll back: 185 + 186 + ``` 187 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b 188 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay rollout status deployment/zlay --timeout=300s 189 + ``` 190 + 191 + Interpretation if canary 1 fails: 192 + - 1eec324's frame-queueing change (heap-duping hostname, freeing in 193 + `processFrame`) is implicated directly. The engineer's next move 194 + would be to study the allocation lifetime carefully. 195 + - Alternatively: zat 17 has a subtle issue that zat 21 (production) 196 + fixes. Easy to distinguish — build canary 1 with zat 21 manually 197 + pinned (same as production) and retest. 198 + 199 + --- 200 + 201 + ## Canary 2 — `b91382b + 1eec324 + 168d9f1` (add dep bump) 202 + 203 + **Only run this after canary 1 is clean for ≥ 60 min AND has survived at 204 + least one 4h reconnect-cron fire.** 205 + 206 + ### what's in it 207 + 208 + Canary 1 plus `168d9f1 bump websocket.zig + zat`. Only change is 209 + `build.zig.zon` (4 lines): zat v0.3.0-alpha.17 → v0.3.0-alpha.22 and 210 + websocket.zig 9ac64da → 3c6794a. The functional change in 168d9f1 is 211 + "fix requestCrawl POST hang in websocket.zig Handshake.parse" — but 212 + the reviewer points out this commit changes runtime dependency behavior 213 + more than any other in the window, which makes it a plausible regression 214 + carrier for reasons other than its stated fix. 215 + 216 + ### diagnostic question 217 + 218 + **Does bumping websocket.zig + zat (alpha.17 → alpha.22) on top of 219 + canary 1 reintroduce the HTTP / delivery failure?** 220 + 221 + ### hoped-for outcome 222 + 223 + Same clean run as canary 1, with one additional expected improvement: 224 + the zlay-reconnect cronjob (fires at 00/04/08/... UTC) should now 225 + correctly re-announce PDS hosts (that's what 168d9f1 specifically 226 + fixes). The cron log from the runner side should show a successful 227 + POST response where it previously hung. 228 + 229 + If canary 2 is clean, dep bump is not the regressor. Roll forward to 230 + canary 3. Canary 2 becomes the new production baseline. 231 + 232 + ### what to check 233 + 234 + Same 5 checks as canary 1. PLUS: 235 + 236 + 6. **Reconnect cron success**: the next fire after canary 2 deploys 237 + should complete normally — the requestCrawl POST hang fix is in 238 + 168d9f1. If this cron was failing before with canary 1, it should 239 + succeed now. If it was already succeeding on canary 1 (because 240 + b91382b's websocket lib was handling this path through a different 241 + code path), no change expected. 242 + 243 + ### success criteria 244 + 245 + Same as canary 1: health, delivery, readiness, restarts, no corrupted 246 + hostnames. Plus successful reconnect cron. 247 + 248 + ### failure signals 249 + 250 + Same as canary 1. 251 + 252 + ### on failure 253 + 254 + Capture (same 4 commands as canary 1, retitled `canary2`), then roll 255 + back: 256 + 257 + ``` 258 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-d71379c # canary 1 259 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay rollout status deployment/zlay --timeout=300s 260 + ``` 261 + 262 + Interpretation: the dep bump is implicated. Either websocket.zig 263 + 3c6794a's Handshake.parse behavior has a regression we haven't seen 264 + yet, or zat alpha.22's transport layer does. Both are cheap to further 265 + bisect: websocket.zig and zat are both small repos with narrow change 266 + windows between 9ac64da→3c6794a and alpha.17→alpha.22. 267 + 268 + --- 269 + 270 + ## Canary 3 — `b91382b + 1eec324 + 168d9f1 + 3dc21b9` (add gcLoop re-enable) 271 + 272 + **Only run this after canary 2 is clean for ≥ 60 min.** 273 + 274 + ### what's in it 275 + 276 + Canary 2 plus `3dc21b9 fix gcLoop: silently exited after one tick`. 277 + Touches only `src/main.zig` (+15 -9). The behavioral change: prior to 278 + 3dc21b9, `gcLoop` used `io.sleep` on pool_io from a plain std.Thread, 279 + which failed on the second tick and silently exited via `catch return`. 280 + So `gcLoop` actually ran ONE tick (+ malloc_trim) at ~10 min into each 281 + pod's life, then was dead for the rest of the pod. 3dc21b9 replaced 282 + `io.sleep` with `std.c.nanosleep` so gcLoop runs every 10 min for the 283 + entire pod lifetime. 284 + 285 + The gcLoop stabilization I shipped in 4f3d1d4 (disable malloc_trim, 286 + bump interval to 1 hour) is intentionally NOT included in canary 3. 287 + We want to isolate the effect of 3dc21b9 first. If 3dc21b9 is the 288 + regressor, the 4f3d1d4-style mitigation becomes the next canary. 289 + 290 + ### diagnostic question 291 + 292 + **Does re-enabling gcLoop at its original 10-minute cadence (with 293 + `malloc_trim(0)` intact) reintroduce the HTTP / delivery failure?** 294 + 295 + ### hoped-for outcome 296 + 297 + Same clean run as canary 2 for the first 10 minutes, then NO 298 + degradation at the 10-minute mark when `gcLoop` first fires with 299 + `dp.gc()` and `malloc_trim(0)`. If the 10-min mark is clean, 300 + continue watching to ~60 min (by which point gcLoop has fired 5×). 301 + 302 + If canary 3 is clean, 3dc21b9 is not the regressor and we've narrowed 303 + the window further. Remaining suspects in the window are `fbdffbe` 304 + (did_cache health-mark) and `31825b2` (subscriber prepareFrameWork 305 + extraction + test). Both are lower-priority per the reviewer; we'd 306 + run them as canary 4 and 5 or let them ride if canary 3 is fully 307 + stable with all other changes layered. 308 + 309 + ### what to check 310 + 311 + Same 5 checks as canary 1. PLUS special attention to the 10-minute mark: 312 + 313 + 7. **10-minute gc fire**: canary 3's gcLoop fires at pod uptime ~10 314 + min. Run: 315 + - external `/health` probe at 9:50, 10:00, 10:10, 10:30 316 + - websocket consumer at 10:00, 10:30 317 + - check `kubectl logs $POD | grep gc` for log evidence of the 318 + gc body running and malloc_trim completing 319 + - if any probe fails in the 10:00 ± 30s window, gcLoop is the 320 + smoking gun 321 + 322 + ### success criteria 323 + 324 + Same as canary 1. PLUS clean behavior across the 10-min mark (and 325 + the 20, 30, 40, 50, 60 min marks — each a gcLoop fire). 326 + 327 + ### failure signals 328 + 329 + Same as canary 1, with heightened attention at ~10-min intervals. 330 + 331 + ### on failure 332 + 333 + Capture (same as above), then roll back: 334 + 335 + ``` 336 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay set image deployment/zlay main=atcr.io/zzstoatzz.io/zlay:ReleaseFast-fa7b5c4 # canary 2 337 + kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay rollout status deployment/zlay --timeout=300s 338 + ``` 339 + 340 + Interpretation: gcLoop is the regressor. This would partially 341 + rehabilitate the gcLoop-stall hypothesis from 342 + `docs/zlay-gcloop-stall-2026-04-09.md`, but only "partially" because 343 + 4f3d1d4 (which disabled malloc_trim and bumped gc to 1 hour) still 344 + exhibited the HTTP hang at ~35 min uptime. So if canary 3 fails at 345 + the 10-min mark, the next diagnostic step is: 346 + - canary 3b = canary 3 with just `malloc_trim(0)` disabled (gc still 347 + runs every 10 min). Isolates allocator stall from mutex-hold stall. 348 + - canary 3c = canary 3b with gc interval bumped to 1 hour. Matches 349 + the 4f3d1d4 state exactly. 350 + 351 + If canary 3c fails, we've exactly reproduced 4f3d1d4 from first 352 + principles and learned that neither the allocator stall nor the 353 + mutex-hold can be the sole cause — something else on top of 3dc21b9 354 + is also in play. That's a bisection lead, not a conclusion. 355 + 356 + --- 357 + 358 + ## summary table 359 + 360 + | canary | base | + cherry-pick | zat ver | diagnostic question | rollback to | 361 + |---|---|---|---|---|---| 362 + | 1 | b91382b | `1eec324` | alpha.17 | Is the UAF fix alone enough to regress? | `ReleaseFast-zat21-b91382b` | 363 + | 2 | canary 1 | `168d9f1` | **alpha.22** | Does the dep bump regress on top of 1? | canary 1 image | 364 + | 3 | canary 2 | `3dc21b9` | alpha.22 | Does gcLoop re-enable regress on top of 2? | canary 2 image | 365 + 366 + ## what's explicitly out of scope for these canaries 367 + 368 + - **No host_authority work** (`bbba92c` keep_alive=false, `795cc41` 369 + slot recovery + observability). These add diagnostic noise and we 370 + don't need them to test delivery/HTTP. Reviewer was explicit. 371 + - **No broadcaster writeLoop fix.** Real bug, but not the cause of 372 + the HTTP hang, and unrelated to what we're isolating. 373 + - **No host_mismatch investigation in code.** Reviewer asked for a 374 + DB audit first — check for duplicate hostnames in the `host` 375 + table and stale `account.host_id` references before any 376 + `getHostIdForHostname` patch. 377 + - **No additional "fix bundles."** Every canary is exactly one 378 + cherry-pick on top of the previous canary. 379 + 380 + ## what the engineer needs from the operator 381 + 382 + 1. **Confirmation of the zat version running in the current 383 + `ReleaseFast-zat21-b91382b` image.** If you can inspect the 384 + image or recall how you built it, knowing whether it's literally 385 + alpha.21 vs alpha.17 vs alpha.22 matters for interpreting canary 1 386 + vs canary 2 results. 387 + 2. **Push the canary branches to origin?** All three branches exist 388 + locally only. If your build pipeline pulls from remote, I need 389 + to push them. Let me know and I will. 390 + 3. **Evidence from the 4f3d1d4 dying window, if any was captured.** 391 + Specifically `ss -ltn` on ports 3000/3001 and `ps -eLo 392 + pid,tid,stat,wchan,comm` during the active hang. If none was 393 + captured, that's fine — we'll capture it on the first canary 394 + failure instead. 395 + 4. **DB audit results (independent of the canary sequence)**: 396 + ``` 397 + SELECT hostname, count(*) FROM host GROUP BY hostname HAVING count(*) > 1; 398 + SELECT count(*) FROM account WHERE host_id NOT IN (SELECT id FROM host); 399 + ``` 400 + If either is non-empty, that's data we need before touching any 401 + host_authority code.
+451
docs/zlay-situation-2026-04-09-reviewer.md
··· 1 + # zlay situation report for reviewer — 2026-04-09 evening 2 + 3 + This doc is a frank account of where zlay stands at end-of-day 2026-04-09 4 + after a long debug session. It supersedes, as primary-cause narratives, both 5 + `zlay-broadcaster-starvation-2026-04-09.md` and 6 + `zlay-gcloop-stall-2026-04-09.md` — those were hypotheses we shipped fixes 7 + for, and both turned out to be wrong about root cause (they may still be 8 + real secondary issues). 9 + 10 + Purpose: help us stop ship-and-guess. We want the reviewer to sanity-check 11 + the failure model, call out observations we're over-fitting to, and help 12 + the next deploy be the one that works. 13 + 14 + ## update: rollback to b91382b succeeded — regression window confirmed 15 + 16 + After this doc was first drafted, the operator completed the rollback to 17 + `ReleaseFast-zat21-b91382b` (the 2026-04-06 build). Full write-up in 18 + `../relay/docs/zlay-handoff-2026-04-09-rollback.md`. The load-bearing 19 + measurements: 20 + 21 + - external `https://zlay.waow.tech/_health` returns 200 in ~0.29 s 22 + - a raw 15 s `subscribeRepos` websocket consumer receives **5,896 frames 23 + at ~395 fps**, vs 6 frames in 170 s (0.035 fps) on `4f3d1d4` 24 + - with an attached consumer: `frames_broadcast` advances at **99.4% of 25 + `frames_received`** over a 15 s window (6801 / 6843) 26 + - `host_authority` reject rate is **~1.3% steady state** (no workaround 27 + in place) — the 99.54% catastrophic rejection that motivated the 28 + `keep_alive=false` work is **not present** on b91382b 29 + 30 + Conclusion: **both bugs** (external HTTP unreachability + delivery 31 + collapse, *and* the 99.5% host_authority reject rate) were introduced in 32 + the commit window `b91382b..31825b2`. The five commits in that window: 33 + 34 + ``` 35 + 3dc21b9 fix gcLoop: silently exited after one tick 36 + fbdffbe mark DB success on did_cache hits 37 + 168d9f1 bump websocket.zig + zat: fix requestCrawl POST hang 38 + 1eec324 fix UAF: dupe FrameWork.hostname per submit instead of borrowing 39 + 31825b2 subscriber: extract prepareFrameWork + add UAF regression test 40 + ``` 41 + 42 + The operator's rollback doc flagged `1eec324` as the prime suspect on the 43 + theory that it was the most invasive frame-path change. The reviewer has 44 + pushed back on that framing: `1eec324` is a narrow UAF ownership fix and 45 + `31825b2` is mostly test/extraction scaffolding; neither obviously 46 + explains the outage by itself. The commits that most clearly change 47 + runtime behavior are: 48 + 49 + 1. `168d9f1` — bumps `websocket.zig` and `zat` (runtime dependencies) 50 + 2. `3dc21b9` — re-enables `gcLoop` (it had been silently dead after one 51 + tick prior to this fix, so the gc body has effectively never run in 52 + production on b91382b either) 53 + 54 + Any of the three (`1eec324`, `168d9f1`, `3dc21b9`) could plausibly be the 55 + cause. The reviewer's recommended path is a controlled additive canary 56 + sequence from `b91382b`, one commit at a time, so we isolate which one 57 + introduces the regression — rather than another multi-change branch. 58 + The operator plan for those canaries is in 59 + `docs/zlay-canary-plan-2026-04-09.md`. The rest of this doc stands as 60 + background for the reviewer on *what* we're isolating and *why* the 61 + per-commit attribution matters. 62 + 63 + **Current production state**: `ReleaseFast-zat21-b91382b`, 1/1 Running, 64 + delivering ~400 fps. Note: b91382b does NOT have the FrameWork.hostname 65 + UAF fix (`1eec324` is 04-07), so a reconnect storm could trip the UAF 66 + and restart the pod. The operator is accepting that risk to keep 67 + evaluators served. 68 + 69 + ## high-level headline 70 + 71 + **Every pod we've shipped after 31825b2 (2026-04-07) has exhibited a symptom 72 + where zlay's external HTTP surface — `/health`, `describeServer`, 73 + `/_readyz`, `/metrics` on port 3000 / 3001 — stops responding past a 74 + threshold of a few minutes to half an hour of uptime. Internal metrics 75 + continue to be collected by the process (frames_in climbs, workers count is 76 + stable, CPU usage is low), but external consumers cannot complete a 77 + handshake, prometheus scrapes time out at 10s, and relay-eval reports 0% 78 + coverage for zlay while indigo shows 99%+.** 79 + 80 + We have been wrong three times in a row about what causes this: 81 + 1. "zig 0.16 `std.http.Client` stale keep-alive handling" (falsified by 82 + engineer's standalone repro). 83 + 2. "broadcaster writeLoop scheduler starvation, fix by moving to pool_io" 84 + (falsified by recognizing the cross-Io crash class documented in 6674812). 85 + 3. "gcLoop + malloc_trim every 10 min" (falsified by the 4f3d1d4 deploy 86 + today — symptom returned at ~35 min uptime despite gc bumped to 1 hour 87 + and `malloc_trim` fully removed). 88 + 89 + We need help getting to a correct fourth hypothesis rather than shipping a 90 + fourth wrong one. 91 + 92 + ## what we know the symptom looks like from the outside 93 + 94 + - External curl to `https://zlay.waow.tech/health` and 95 + `com.atproto.sync.describeServer` hangs through a 3 s timeout, or returns 96 + 503 from the ingress. 97 + - prometheus `/metrics` scrapes fail with `context deadline exceeded` at 98 + the 10 s scrape timeout. this is continuous, not intermittent, once the 99 + symptom starts. 100 + - internal port-forwarded `/metrics` hit can succeed for a single probe 101 + then hang >10 s on the next one 15 s later — i.e. the HTTP handler isn't 102 + dead, it's responding on a fraction of probes and timing out on the rest. 103 + - `/_healthz` comes back in ~11 s in the same window. `/_readyz` in ~17 s 104 + (above the relaxed 15 s probe timeout — pod only survives because 105 + kubelet's 20-failure threshold happens to hit faster windows sometimes). 106 + - k8s eventually marks the pod `Ready=False` with 107 + `ContainersNotReady`. On 4f3d1d4 that happened at 20:59:34, ~35 min after 108 + pod start (vs ~10-14 min on the unmodified 795cc41 and bbba92c pods). 109 + - service `endpoints/zlay` has no endpoints because Ready=False kicked the 110 + pod out of the service. ingress returns 503. 111 + - pod process itself: ~0.26 CPU cores of usage, all ~47 threads in 112 + S-state (sleeping). nothing is hot-spinning. load average on the node is 113 + high (67/8) but **zlay is not the one consuming CPU**. 114 + 115 + ## what we know the symptom looks like from the inside 116 + 117 + Representative snapshot from the live 4f3d1d4 pod at ~28m uptime just 118 + before rollback: 119 + 120 + | metric | value | 121 + |---|---:| 122 + | `frames_received_total` | 742,988 | 123 + | `frames_broadcast_total` | 478 (lifetime, barely moved) | 124 + | `broadcast_no_consumers_total` | 37,956 | 125 + | `consumers_active` | 0 | 126 + | `workers_count` | 2,761 (spawn complete) | 127 + | `host_authority_checks_total` | 17,236 | 128 + | `host_authority_reject{branch="host_mismatch"}` | 15,257 (88%) | 129 + | `host_resolver_in_use` | 0 (pool idle at the moment of probe) | 130 + 131 + Gap analysis: 132 + - `frames_received = 742,988` vs `frames_broadcast + no_cons ≈ 38,434`. 133 + That's ~5% of received frames reaching `broadcast()`. The rest is either 134 + still in-flight in the frame_worker pool, dropped on the validation 135 + branches (host_authority, sig, chain, etc), or lost to something we're 136 + not tracking. We did not capture the full validation counter delta in 137 + the incident — recovering that is one of the first things we need. 138 + - `broadcast_no_consumers = 37,956` vs `frames_broadcast = 478`: ~99% 139 + of the frames that did reach `broadcast()` hit the "zero consumers" fast 140 + path (broadcaster.zig:611). Lifetime `frames_broadcast_total = 478` 141 + means ~all of a consumer's ever-delivered frames were from a brief 142 + window long ago — consistent with "consumers couldn't attach for most 143 + of the pod's life" rather than "consumers attached but were kicked." 144 + - `host_authority_reject{host_mismatch} = 88%` is a second independent 145 + bad signal on this pod (see "additional observation" below). 146 + 147 + ## what commits are in play 148 + 149 + Between the last reliably-observable pod (`31825b2`, 2026-04-07) and now: 150 + 151 + | commit | date | what | risk flagged | 152 + |---|---|---|---| 153 + | `3dc21b9` | 04-06 | fix gcLoop silent-exit after one tick | gc actually ran again after this; masked previously | 154 + | `fbdffbe` | 04-06 | mark DB success on did_cache hits | low | 155 + | `168d9f1` | 04-06 | bump websocket.zig + zat (requestCrawl POST hang fix) | low but touches websocket lib | 156 + | `ee4e368` | 04-08 | buffer 8192→65536 + per-branch counters | low | 157 + | `bbba92c` | 04-08 | **`keep_alive=false` on host_authority pool** | **high — workaround, not root-caused** | 158 + | `795cc41` | 04-09 | host_authority slot recovery + pool metrics + preload account count | medium — changes cold-start path | 159 + | `4f3d1d4` | 04-09 | disable malloc_trim, bump gc to 1h, timing log | low, but didn't fix the issue | 160 + 161 + The single most significant behavioral change in that set is `bbba92c`'s 162 + `keep_alive=false`. Every `is_new`/`host_changed` event now does a fresh 163 + DNS + TCP + TLS + HTTPS round-trip to plc.directory (~350-900 ms/call), 164 + and the resolver pool is only 4 slots acquired by 16 frame_worker threads 165 + via atomic spinlock. This is a known aggravating factor for cold-start 166 + load, but ops-changelog claims it was also present on 795cc41 without 167 + persistent HTTP death — which weakens the "keep_alive=false is the whole 168 + story" claim. 169 + 170 + ## what we know is NOT the cause 171 + 172 + 1. **scheduler contention on writeLoop.** I wrote this hypothesis up 173 + (`docs/zlay-broadcaster-starvation-2026-04-09.md`), proposed moving 174 + writeLoop to `pool_io`. Zlay engineer correctly pointed out this 175 + re-enters the `Thread.current() NULL deref on plain threads calling 176 + Evented Io.Mutex` crash class fixed in `6674812`. **Do not do this.** 177 + See the cross-Io rule in NOTES/stdlib-patches.md. 178 + 2. **gcLoop mutex hold + malloc_trim.** I wrote this hypothesis up 179 + (`docs/zlay-gcloop-stall-2026-04-09.md`), shipped 4f3d1d4 with gc 180 + interval 10min → 1h and malloc_trim removed. **Pod still flapped**, 181 + this time at ~35 min uptime instead of ~10 min. So those changes 182 + delayed the symptom but did not prevent it. Longer uptime might 183 + mean we moved some pressure off the critical path, or it might just 184 + be process-state timing noise — we did not collect enough data to 185 + distinguish. 186 + 3. **broadcaster writeLoop `io.sleep(100ms)` polling** (broadcaster.zig: 187 + 447-453). This is a real bug — `cond.signal` on line 413 is a no-op 188 + and per-consumer drain is capped at 10 frames/sec in the best case. 189 + But it is NOT the cause of the HTTP server being unreachable. It 190 + affects throughput _to attached consumers_, and the problem we have 191 + is that consumers cannot attach at all. 192 + 193 + ## additional observation: host_mismatch burst 194 + 195 + On both 795cc41 and 4f3d1d4 cold-starts, we see a large burst of 196 + `host_authority_reject{branch="host_mismatch"}`: 15k-17k rejects in the 197 + first ~15 min, then the rate drops to ~0/sec. This is qualitatively 198 + different from the 2026-04-08 `resolve`-branch 99% reject bug. It means: 199 + the DID resolver succeeded, returned a valid DID doc, and 200 + `pds_host_id != incoming_host_id` at the comparison step 201 + (`validator.zig checkPdsHost`). 202 + 203 + Plausible causes not yet investigated: 204 + - Stale `account.host_id` values persisted from a previous pod 205 + triggering `host_changed=true` on events where the current DID doc 206 + still resolves to the current subscriber's hostname. 207 + - Duplicate rows in the `host` table with the same hostname and 208 + different ids (`getHostIdForHostname` has no ORDER BY — lookup is 209 + non-deterministic). 210 + - Something in how 795cc41's `preload_account_count` change interacts 211 + with host_id assignment at spawn time. 212 + 213 + None of these candidates are new in 4f3d1d4. The burst exists independent 214 + of the HTTP-hang symptom, but it's the same window, so we can't say for 215 + sure whether it's a contributor or an unrelated mess. 216 + 217 + **We have not captured sampled warn logs for any of these host_mismatch 218 + rejects.** The log buffer has rotated out by the time we look. Any 219 + isolation experiment here needs to pull logs _during_ the burst, not 220 + after. 221 + 222 + ## rollback completed — see "update" at top of doc 223 + 224 + Rollback is done. b91382b is responsive, delivery is healthy, host_authority 225 + reject rate is ~1.3% not 99.54%. See the "update" section at the top of this 226 + doc for the full measurements. The diagnostic path going forward is the 227 + canary sequence from b91382b, documented in `zlay-canary-plan-2026-04-09.md`. 228 + 229 + ## things that are REAL bugs regardless of which hypothesis wins 230 + 231 + Independent of the HTTP-hang root cause, these are defects we want to 232 + fix eventually. Enumerated so the reviewer can weigh in on priority and 233 + shape: 234 + 235 + 1. **`Consumer.writeLoop` polling** (broadcaster.zig:439-477). Replace 236 + the `io.sleep(100ms)` with proper `cond.wait(mutex, io)`. Schedule 237 + pings via separate timer fiber or opportunistic-on-wake. Do NOT move 238 + writeLoop off Evented. Cap on per-consumer drain is otherwise 239 + 10 frames/sec even in perfect conditions. 240 + 2. **`DiskPersist.gc()` holds the persist hot-path mutex for its 241 + entire body** (event_log.zig:977-1033). The mutex is there to 242 + protect `evtbuf`/`outbuf`/`cur_seq`/`current_file_path`/`flushLocked`. 243 + Nothing gc actually does (DB iteration, per-file unlink) needs that 244 + lock. Proposed fix: discover candidate files without the lock, 245 + re-acquire briefly per-file only to re-check `current_file_path` 246 + before unlinking. Same treatment for `gcBySize()` and 247 + `takeDownUser()`. 248 + 3. **Missing pool slot recovery** (validator.zig resolver pool). The 249 + slot-recovery fix in 795cc41 exists but is dormant under 250 + `keep_alive=false`. We haven't actually proven the recovery path 251 + works because it hasn't been exercised in production. The planned 252 + canary ("1 of 4 slots `keep_alive=true`") was never run. 253 + 4. **`zig build test` is not sufficient CI.** Lazy analysis skips 254 + functions not referenced from tests. Engineer added a rule: run 255 + `zig build` (exe) in addition to `zig build test` for 256 + validator/subscriber/frame_worker changes. The `584571a` build 257 + break is the precedent. 258 + 5. **Two different error types get swallowed in zat's transport path** 259 + (zat/src/internal/xrpc/transport.zig, did_resolver.zig). Partially 260 + fixed by zat 0.3.0-alpha.23 which propagates the underlying 261 + `std.http.Client.fetch` error. Upgrade is shipped in 795cc41 but we 262 + never saw a failure after it shipped because `keep_alive=false` was 263 + still in place. 264 + 6. **`getHostIdForHostname` has no `ORDER BY`** — if there are ever 265 + duplicate rows for the same hostname, lookup is non-deterministic. 266 + We ran the reviewer's DB audit on 2026-04-09: 0 duplicate hostnames 267 + in production, so this is not currently a live bug — but the query 268 + should still be deterministic as a hardening matter. 269 + 7. **Dual host_authority call sites with asymmetric metrics.** 270 + `resolveHostAuthority` is invoked from both `frame_worker.zig:107` 271 + and `subscriber.zig:555`. Only the frame_worker path emits the 272 + `relay_host_authority_trigger{reason=...}` and 273 + `relay_host_authority_checks_total` counters; both paths increment 274 + `relay_validation_failed{reason="host_authority"}` on reject. This 275 + makes the ratio `trigger:reject` structurally misleading — trigger 276 + is a lower bound on true authority-check count, while rejects 277 + include both paths. Discovered during the 2026-04-09 canary 1 278 + investigation when we tried to correlate the `host_id=0` sentinel 279 + finding against the reject counter. Long-term fix is one of: 280 + (a) consolidate authority checking to a single path, or 281 + (b) have both paths emit the same counters. Not a blocker for any 282 + current canary — flagged so it doesn't get forgotten. 283 + 8. **`account.host_id = 0` is a sentinel ("host not set yet") by 284 + design**, documented in `event_log.zig:513`. As of the 2026-04-09 285 + DB audit, 239,422 out of 5,817,756 accounts (4.11%) are still at 286 + the sentinel. Reviewer's semantic correction (2026-04-09): orphan 287 + accounts produce `is_new=true` in `uidForDidFromHost`, which 288 + triggers a host_authority CHECK — not automatically a reject. The 289 + reject only happens if `resolveHostAuthority` returns `.reject`, 290 + which depends on the DID doc's PDS vs the incoming host. So the 291 + sentinel explains an `is_new` burst during cold-start ramp but 292 + does NOT by itself explain the `host_mismatch` reject burst we 293 + observed on 795cc41/4f3d1d4. Real fix is probably: `is_new` 294 + accept-and-update path should not go through the full 295 + host_authority resolution pipeline (too slow for first-seen 296 + events), but that's an optimization, not an outage fix. 297 + 9. **Relaxed k8s probes are a band-aid over "HTTP fibers can get 298 + stuck for 10+ s under load."** With probes at 299 + `initialDelay=300s, timeout=15s, failureThreshold=20`, the pod 300 + effectively has ~5 minutes to get its act together at startup and 301 + can be in a degraded state for up to 300 s at runtime before 302 + kubelet notices. If we ever genuinely hang at startup, we now get 303 + no feedback for 5 minutes. These should be tightened once the 304 + underlying cause is fixed. 305 + 306 + ## what I'd want the reviewer to help us with 307 + 308 + ### 1. the core question — why external HTTP stops responding 309 + 310 + The HTTP server fibers (`runWsServer`, `MetricsServer.run`) live on main 311 + `Io.Evented` io alongside ~2,800 per-subscriber fibers, the broadcaster 312 + loop fiber, and the slurper spawn fiber. On the 4f3d1d4 pod at the time 313 + of death, the pod is using ~0.26 cores, nothing is spinning, all 314 + threads are in S-state. Yet HTTP accepts aren't completing and probes 315 + time out. 316 + 317 + Possible shapes of the failure we can't distinguish with the evidence we 318 + have: 319 + 320 + a. **Accept-queue exhaustion**: listener fiber is running but the 321 + kernel's SYN/accept backlog is full so new connections don't reach 322 + the fiber. Check with `netstat -an | grep :3000` for SYN_RECV, 323 + `ss -ltn` for Recv-Q on the listening socket. 324 + 325 + b. **Single Evented runtime thread wedged**: if Evented is a work-stealing 326 + scheduler across ~47 threads, one stuck fiber on one thread should 327 + not freeze the whole thing. If it's a single-loop-per-thread design 328 + and accepts are pinned to one loop, a stuck loop could freeze 329 + accept specifically. Which of these does zig 0.16 `Io.Evented` 330 + actually implement? 331 + 332 + c. **malloc contention**: the pod runs with `MALLOC_ARENA_MAX=4`, and 333 + 16 frame_worker threads (plus ~47 Evented runtime threads) are all 334 + contending for 4 glibc arenas. Under sustained allocator pressure 335 + (keep_alive=false resolves every allocating a TLS session), any 336 + thread that calls malloc at the wrong time may block on arena 337 + locks. The HTTP handler allocates for response bodies. Is this 338 + plausible as a sustained-state cause rather than just a burst? 339 + 340 + d. **Some fiber consistently monopolizes a CPU slice and the HTTP 341 + accept fiber doesn't get scheduled often enough**: spawnWorkers 342 + finished at 96 s but maybe something else has replaced it. Possible 343 + candidates: Consumer.writeLoop fibers with their 100 ms polling, 344 + the broadcaster loop fiber under push-lock contention (persist_order 345 + spinlock). 346 + 347 + We cannot distinguish these today. We have no fiber-level trace, no 348 + evented scheduler diagnostics, no per-fiber wake-latency metric. **This 349 + is the single biggest gap in our ability to debug this.** 350 + 351 + ### 2. the falsifiable hypotheses we'd try next 352 + 353 + In priority order, these are experiments where a positive result would 354 + strengthen one of (a)-(d) and a null result would rule it out. None of 355 + these are "ship a fix" — they're all "get diagnostic signal": 356 + 357 + - **(a) accept-queue check**: on a live flapping pod, run 358 + `ss -ltn | grep 3000`, `netstat -an | awk '$NF=="LISTEN"'`, and 359 + `cat /proc/net/tcp` to look at socket state and queue depth. If 360 + accept backlog is full, that's a scheduler-not-running-accept 361 + problem. If it's empty, handshakes are reaching the fiber and 362 + something inside the fiber is slow. 363 + 364 + - **(b) Evented runtime design**: read zig 0.16 `std.Io.Uring` / 365 + `Io.Evented` source to confirm whether it's single-loop or work- 366 + stealing. If single-loop, accept-pinned-to-one-thread is a known 367 + class of failure. If work-stealing, a single stuck fiber shouldn't 368 + freeze the accept fiber — which would point away from 369 + scheduler-starvation and toward something else. 370 + 371 + - **(c) malloc contention experiment**: temporarily set 372 + `MALLOC_ARENA_MAX=16` (or unset it entirely) on the Deployment and 373 + see if the HTTP-hang cadence changes. Not a fix, just a signal. 374 + Does not require a code change. Fully reversible. 375 + 376 + - **(d) fiber occupancy metric**: add a counter for Evented main-io 377 + fiber wake-to-wake time for the HTTP accept fiber. On a healthy 378 + pod it should be < 1ms at idle. On a wedged pod it should be 379 + seconds. This is the one new instrumentation we most need. 380 + 381 + ### 3. questions on the host_mismatch burst 382 + 383 + - Is there a known migration / spawn pattern that would cause 384 + `host_changed=true` on events that, when the DID doc is re-resolved, 385 + actually point back at the host the event came from? Is there a 386 + race between `requestCrawl` assigning a host_id and a frame arriving 387 + from a subscriber that already has a different host_id? 388 + - Could `preload_account_count` (new in 795cc41) have changed the 389 + cold-start order such that subscribers start firing events before 390 + `host.id` is stable? 391 + 392 + ### 4. zat 0.3.0-alpha.21 risk 393 + 394 + This is a theory I don't have time to fully develop but want on the 395 + table: the CBOR/CAR/MST hardening that landed in b91382b could in 396 + principle slow down frame decode enough to cause frame workers to 397 + back up, which cascades into broadcast_queue pressure, which cascades 398 + into persist_order spinlock contention, which cascades into the 399 + Evented broadcaster fiber holding main io for longer chunks, which 400 + cascades into HTTP fibers getting less scheduler share. 401 + 402 + We haven't measured decode latency. The ops-changelog says "expected 403 + throughput drop: ~290k → ~202k fps decode+verify. still 13x faster 404 + than Go." which, on 300 fps ingest, is nowhere near a bottleneck. So 405 + this is probably not it. But it's the one behavior change in b91382b 406 + that we haven't ruled out, and b91382b is what we're rolling back to. 407 + 408 + ## open operator questions 409 + 410 + Things I'd like data on as the canary sequence runs: 411 + 412 + 1. On each canary, if the HTTP-hang symptom returns: capture `ss -ltn` 413 + on ports 3000/3001 (accept-queue depth), and 414 + `ps -eLo pid,tid,stat,comm` on the main PID (thread state 415 + distribution). We've never captured either during an active hang. 416 + 2. Is there any preserved log output from the host_mismatch burst on 417 + 795cc41 or 4f3d1d4 showing the actual rejected DIDs and the 418 + `resolved_host` vs `incoming_host` that mismatched? None captured 419 + yet — the log buffer rotates by the time we look. 420 + 3. For the host_mismatch question specifically: the reviewer has asked 421 + for a DB audit **before** any code change. Check for duplicate 422 + hostnames in the `host` table and stale `account.host_id` 423 + references. Do not speculatively patch `getHostIdForHostname` with 424 + an `ORDER BY` before knowing whether duplicate rows exist. 425 + 426 + ## what we are NOT asking for 427 + 428 + - Another hypothesis with a fix attached. We've shipped three of those 429 + today. 430 + - A code review of the broadcaster polling or gcLoop mutex. Both are 431 + real bugs, both are known, both should be fixed later. They are not 432 + the cause of the outage. 433 + - Strong opinions on architecture. The Evented + Threaded hybrid is 434 + load-bearing in ways we don't fully understand yet and this doc is 435 + not the time to decide whether it should be replaced. 436 + 437 + ## what we ARE asking for 438 + 439 + - A sanity check on "the HTTP-hang is the primary symptom worth 440 + debugging, and everything else we've been chasing is secondary". 441 + - Help designing the next *measurement* (not fix), specifically 442 + something that would distinguish between the four failure shapes 443 + listed in section 1. 444 + - Fresh eyes on whether the commits between 31825b2 and 4f3d1d4 445 + contain anything we've missed that could plausibly cause HTTP 446 + fibers to stop getting scheduled. 447 + - Whether "roll forward to bbba92c without 795cc41's preload_account_count 448 + and without 4f3d1d4's gc changes" is a safer diagnostic step than 449 + "roll all the way back to b91382b", given that bbba92c is the last 450 + image we saw responsive HTTP on (ops-changelog says so at the 14-min 451 + mark, and we never collected data past that).