declarative relay deployment on hetzner relay-eval.waow.tech
atproto relay
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: log zlay FrameWork hostname UAF fix + validation results

records the 2026-04-07 investigation: root cause (borrowed hostname
slice freed by slurper teardown while in-flight FrameWorks held it),
the dupe-at-submit fix shipped as 1eec324, the follow-up refactor
that added a regression test, and the stress-test validation
metrics (broadcast_queue_full_total 3B → 0 after fix).

also notes the known follow-ups (persist_order global lock,
SharedFrame allocation hot path) as non-urgent optimizations to
revisit after consumer-side canary.

+68
+68
docs/ops-changelog.md
··· 5 5 6 6 --- 7 7 8 + ## 2026-04-07 9 + 10 + ### zlay FrameWork hostname UAF — fixed and shipped (1eec324) 11 + 12 + the reconnect cronjob triggered a pod restart after subscribers churned faster 13 + than the frame pool could drain. investigation surfaced a use-after-free on 14 + `FrameWork.hostname`: 15 + 16 + - `slurper.runWorker` (slurper.zig:617) frees `sub.options.hostname` immediately 17 + after `sub.run()` returns and before `sub.destroy()` 18 + - `FrameWork.hostname` was a **borrowed** slice pointing into that buffer 19 + - in-flight FrameWorks queued in the frame pool kept reading the freed slice — 20 + visible in logs as corrupted hostnames (`` `H���y ...host.bsky.network``, 21 + DID strings where hostnames should be, stack-pointer-shaped bytes) 22 + - 4,537 corrupted log lines accumulated before the liveness probe flapped 23 + 24 + **fix**: `Subscriber.prepareFrameWork` (subscriber.zig) now heap-dupes both 25 + `data` and `hostname` at submit time. the worker frees both through 26 + `work.allocator` in `processFrame` (frame_worker.zig). FrameWork is now 27 + lifetime-independent from its submitting subscriber. 28 + 29 + **regression test**: `test "prepareFrameWork dupes hostname and data (UAF 30 + regression)"` (subscriber.zig) — builds a FrameWork from a to-be-freed 31 + hostname buffer, frees the source, reads the work item. asserts distinct 32 + pointers and content equality. would trip the testing allocator's UAF 33 + detection if the dupe were skipped. 34 + 35 + **production validation** (commit 1eec324, ReleaseFast): 36 + 37 + | metric | old pod (pre-fix) | new pod (post-fix, 13m + stress test) | 38 + |---|---|---| 39 + | `broadcast_queue_full_total` | 3,000,000,000+ | 0 | 40 + | `broadcast_queue_push_lock_spins_total` | (unrecorded) | 0 | 41 + | `broadcast_queue_depth_hwm` | high | 966 / 8192 | 42 + | `slow_consumers_total` | high | 0 | 43 + | corrupted log lines | 4,537 / ~77k | 0 / 15,920 | 44 + | probe latencies | 5s timeout | 326 ms / 225 ms | 45 + | pod restarts | 1 in 7m | 0 in 13m + stress | 46 + 47 + the stress test was a manual `zlay-reconnect-stress-1775573095` cronjob run 48 + (~1,839 `requestCrawl` POSTs over 2m09s) against the new pod. the pod stayed 49 + 1/1 ready, `relay_workers_count` recovered to 1,527 inbound subscribers with 50 + no restarts and no corruption. 51 + 52 + **conclusion**: the 3B broadcast queue full events on the old pod were a 53 + symptom of the UAF, not a separate broadcaster bug. corrupted frame state 54 + cascaded into the push path. with clean state the broadcaster behaves 55 + normally even under stress. 56 + 57 + caveat: `relay_consumers_active = 0` on the new pod (no downstream firehose 58 + subscribers attached during the test), so this confirms the push side and 59 + drain loop but not the per-consumer enqueue fanout. push-side was the 60 + metric that blew up before, so the fix is high-confidence on the 61 + regression path. 62 + 63 + **known follow-ups** (separate from UAF, not bugs): 64 + 65 + - `relay_persist_order_spins_total = 91M / 13m` — 16 frame workers serialize 66 + through one global lock that holds disk I/O. worth splitting persist and 67 + broadcast queue insertion once real downstream consumers are validated. 68 + - `SharedFrame.create` does 2 heap allocs per broadcast — could transfer 69 + ownership from the broadcast queue instead of copying. 70 + 71 + neither is urgent. the pipeline is stable with headroom; next disciplined 72 + step is consumer-side canary before touching the ordering lock. 73 + 74 + --- 75 + 8 76 ## 2026-04-06 9 77 10 78 ### zat v0.3.0-alpha.21: CBOR/CAR/MST validation hardening