gcLoop: disable malloc_trim, bump interval 10min→1h, add timing

diagnosis of the 2026-04-09 ~10-minute pod-flap cycle. the bbba92c
pod pattern (healthy ~10 min → /metrics and /_readyz stuck → NotReady
→ restart → repeat) lines up exactly with gcLoop's 10-min cadence.

two separable suspects inside gcLoop, either alone sufficient to
flunk probes:

1. dp.gc() holds DiskPersist.mutex for its entire duration (DB
iteration + per-file unlinks, event_log.zig:977-1033). every
frame worker blocks on persist() during gc. this alone explains
the earlier "0.035 events/sec to consumers" measurement.

2. malloc_trim(0) on a ~1.5 GiB RSS process with MALLOC_ARENA_MAX=4.
glibc holds per-arena locks during the free-list walk, stalling
every allocator caller — including the Evented fiber serving
/metrics and /_readyz. long enough to trip probe timeouts.

this is a stabilization commit, not a root-cause fix:

- disable malloc_trim(0) entirely (comment preserved). prefer
MALLOC_MMAP_THRESHOLD_ tuning or an out-of-band maintenance window
if reclaim becomes an issue.
- bump gc_interval_s 10 min → 1 hour. bounds blast radius of the
persist-mutex hold until gc() is properly narrowed.
- add clock_gettime(.MONOTONIC) timing around dp.gc(). next incident
tells us whether dp.gc() itself or something adjacent is the stall.
- new doc: docs/zlay-gcloop-stall-2026-04-09.md with the hypothesis,
code pointers, validation plan, and follow-up work list (mutex
narrowing; broadcaster writeLoop polling as a separate bug).

3dc21b9 (2026-04-06 "fix gcLoop: silently exited after one tick")
is what unmasked this — before that fix gcLoop ran once and died,
so malloc_trim + gc ran exactly once per pod lifetime. after that
fix they run every 10 min, which is when zlay started flapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zzstoatzz 1 month ago 4f3d1d4c 795cc415

+131 -6

2 changed files

expand all

docs

zlay-gcloop-stall-2026-04-09.md

src

main.zig

+108

docs/zlay-gcloop-stall-2026-04-09.md

··· 1 + # zlay gcLoop stall — 2026-04-09 2 + 3 + ## tl;dr 4 + 5 + zlay pods were flapping on a ~10 minute cadence: ~10 min healthy, then /metrics 6 + and /_readyz stop responding, kubelet marks NotReady, pod eventually restarts, 7 + cycle repeats. Initial hypothesis was broadcaster writeLoop starvation 8 + (see [zlay-broadcaster-starvation-2026-04-09.md] — now superseded as primary 9 + cause). The actual cadence lines up precisely with `gcLoop` (main.zig), which 10 + became functional again on 2026-04-06 via commit 3dc21b9 ("fix gcLoop: silently 11 + exited after one tick"). Prior to that fix, gc was silently dead after one tick 12 + per pod, masking the underlying problem. 13 + 14 + ## causal chain (suspected) 15 + 16 + 1. `gcLoop` fires every 10 minutes (main.zig). 17 + 2. `dp.gc()` holds `DiskPersist.mutex` for its entire duration 18 + (event_log.zig:977-1033). That critical section includes: 19 + - `SELECT` of expired log_file_refs 20 + - per-file `DELETE FROM log_file_refs` + `unlink` 21 + - `gcBySize()` — another pass of queries + `stat` + `unlink` 22 + 3. `DiskPersist.persist()` (event_log.zig:864-899) takes the same mutex on the 23 + frame-worker hot path. For the duration of gc, every frame worker blocks on 24 + persist → the broadcast queue dries up → consumers see ~0 events/sec. This 25 + alone explains the "zlay only delivers 0.035 events/sec" symptom that was 26 + previously blamed on writeLoop polling. 27 + 4. After `dp.gc()` returns, `gcLoop` called `malloc_trim(0)`. The pod runs with 28 + `MALLOC_ARENA_MAX=4`, so glibc holds per-arena locks while walking free 29 + lists. On a ~1.5 GiB RSS process this can stall every allocator user for 30 + seconds. The Evented fiber serving /metrics and /_readyz would stall on its 31 + next malloc → kubelet liveness/readiness probes time out → pod marked 32 + NotReady → restart. 33 + 34 + Two separable suspects (important for isolation): 35 + - **`dp.gc()` mutex hold** freezes ingest via the persist hot path. 36 + - **`malloc_trim(0)`** freezes everything via arena locks. 37 + 38 + Either alone is sufficient to flunk probes. Don't conflate them when 39 + investigating. 40 + 41 + ## stabilization shipped 42 + 43 + Commit on top of 795cc41: 44 + 45 + - **Disabled `malloc_trim(0)`** in `gcLoop` (main.zig:502). Comment preserved 46 + so future maintainers know why. If RSS growth becomes an issue, prefer 47 + tuning `MALLOC_MMAP_THRESHOLD_` or running trim out-of-band. 48 + - **Bumped gc interval from 10 min → 1 hour** (main.zig:473). Reduces 49 + frequency and blast radius of the persist-mutex hold while the real fix 50 + (mutex narrowing) is prepared. Not a cure — a stall at hour boundaries is 51 + still possible if gc runs long. 52 + - **Added timing log** around `dp.gc()` using `clock_gettime(.MONOTONIC)` 53 + (plain thread — no Io available). Next incident will tell us how long 54 + `dp.gc()` actually runs on a production dataset. 55 + 56 + These changes are in main.zig only. No dependency or schema changes. Safe to 57 + revert by undoing the single commit. 58 + 59 + ## validation plan post-deploy 60 + 61 + After deploying on top of 795cc41: 62 + 63 + 1. Pod uptime should exceed 10 minutes with no NotReady flap. 64 + 2. Grep logs for `gc: dp.gc complete in` — verify gc runs and record its 65 + duration on production data. 66 + 3. `frames_broadcast_total` should climb at ingest rate (~300/sec), not 67 + the previously measured ~0.035/sec. 68 + 4. `tap run --relay-url https://zlay.waow.tech` for 60 seconds should 69 + deliver thousands of events, not single digits. 70 + 71 + If the pod still flaps after this change, the hypothesis is wrong or 72 + incomplete — do **not** proceed to the follow-up fixes until we re-diagnose. 73 + Most likely remaining suspect in that case is the `dp.gc()` mutex hold on 74 + a dataset big enough that even hourly gc takes long enough to trip probes. 75 + Mitigation in that case: temporarily comment out the `dp.gc()` call body to 76 + isolate. 77 + 78 + ## follow-up work (not in this change) 79 + 80 + 1. **Narrow `DiskPersist.mutex` scope in `gc()`**. The mutex protects 81 + `evtbuf`/`outbuf`/`cur_seq`/`current_file_path`/`flushLocked`. Nothing in 82 + gc's DB iteration or per-file unlink genuinely needs that lock. The only 83 + shared state is a read of `current_file_path` to skip the active file. 84 + Plan: do DB discovery and file discovery without the lock; acquire briefly 85 + only to re-check `current_file_path` against each candidate before unlink. 86 + Same treatment for `gcBySize()` and `takeDownUser()`. 87 + 2. **Broadcaster writeLoop polling** (broadcaster.zig:447-453). Real bug — 88 + `cond.signal` at line 413 is a no-op because writeLoop polls with 89 + `io.sleep(100ms)` instead of `cond.wait`. This caps per-consumer drain at 90 + ~10/sec even under zero contention. Fix: use `cond.wait`, schedule pings 91 + via a separate timer fiber or piggyback on next wakeup. 92 + **Do NOT** move writeLoop off Evented to pool_io — commit 6674812 documents 93 + that cross-Io path crashes via `Thread.current()` NULL deref. 94 + 3. **Consider whether `malloc_trim` should ever run on-process**. For a 95 + steady-state relay, the answer is probably no; mmap threshold tuning is a 96 + better choice. 97 + 98 + ## code pointers 99 + 100 + - main.zig:467-510 — `gcLoop` 101 + - event_log.zig:977-1033 — `DiskPersist.gc` 102 + - event_log.zig:1036-1099 — `DiskPersist.gcBySize` 103 + - event_log.zig:864-899 — `DiskPersist.persist` (hot path, same mutex) 104 + - broadcaster.zig:439-477 — `Consumer.writeLoop` (secondary bug, not fixed here) 105 + - commit 3dc21b9 — "fix gcLoop: silently exited after one tick" (the fix that 106 + unmasked this) 107 + - commit 6674812 — "fix SIGSEGV: plain threads calling Evented Io.Mutex" 108 + (cross-Io landmine, read before touching Consumer)

+23 -6

src/main.zig

··· 465 465 } 466 466 467 467 fn gcLoop(dp: *event_log_mod.DiskPersist) void { 468 - const gc_interval_s: u64 = 10 * 60; // 10 minutes 468 + // gc cadence: 1 hour (was 10 min before 2026-04-09 incident). DiskPersist.gc() 469 + // currently holds DiskPersist.mutex for its entire duration — covering DB 470 + // iteration and per-file unlinks — which blocks persist() on every frame 471 + // worker. until the mutex scope is narrowed (follow-up), run hourly to 472 + // bound blast radius. see docs/zlay-gcloop-stall-2026-04-09.md. 473 + const gc_interval_s: u64 = 60 * 60; // 1 hour 469 474 while (!shutdown_flag.load(.acquire)) { 470 475 // sleep in 1s ticks so shutdown is checked frequently. uses 471 476 // std.c.nanosleep directly — this is a plain OS thread, so calling ··· 480 485 } 481 486 if (shutdown_flag.load(.acquire)) return; 482 487 488 + // time the gc call so the next incident log tells us whether gc itself 489 + // or something around it is the stall. plain-thread context — use 490 + // clock_gettime directly rather than Io.Timestamp. 491 + var ts_start: std.c.timespec = undefined; 492 + _ = std.c.clock_gettime(.MONOTONIC, &ts_start); 483 493 dp.gc() catch |err| { 484 494 log.warn("event log GC failed: {s}", .{@errorName(err)}); 485 495 }; 496 + var ts_end: std.c.timespec = undefined; 497 + _ = std.c.clock_gettime(.MONOTONIC, &ts_end); 498 + const elapsed_ns: i64 = (@as(i64, ts_end.sec) - @as(i64, ts_start.sec)) * std.time.ns_per_s + 499 + (@as(i64, ts_end.nsec) - @as(i64, ts_start.nsec)); 500 + log.info("gc: dp.gc complete in {d}ms", .{@divTrunc(elapsed_ns, std.time.ns_per_ms)}); 486 501 487 - // return freed pages to OS (glibc-specific, no-op on other platforms) 488 - if (comptime malloc_trim) |trim| { 489 - _ = trim(0); 490 - log.info("gc: malloc_trim complete", .{}); 491 - } 502 + // NOTE: malloc_trim(0) disabled 2026-04-09. with MALLOC_ARENA_MAX=4 it 503 + // walks free lists holding per-arena locks, which on a ~1.5 GiB RSS 504 + // process can stall every allocator user (including the Evented http 505 + // fiber serving /metrics and /_readyz) long enough to flunk liveness 506 + // probes. if memory reclamation becomes an issue, prefer tuning 507 + // MALLOC_MMAP_THRESHOLD_ or calling trim from a dedicated maintenance 508 + // window rather than on a hot process. 492 509 } 493 510 } 494 511

Configure Feed

Configure Feed