gcLoop: disable malloc_trim, bump interval 10min→1h, add timing
diagnosis of the 2026-04-09 ~10-minute pod-flap cycle. the bbba92c
pod pattern (healthy ~10 min → /metrics and /_readyz stuck → NotReady
→ restart → repeat) lines up exactly with gcLoop's 10-min cadence.
two separable suspects inside gcLoop, either alone sufficient to
flunk probes:
1. dp.gc() holds DiskPersist.mutex for its entire duration (DB
iteration + per-file unlinks, event_log.zig:977-1033). every
frame worker blocks on persist() during gc. this alone explains
the earlier "0.035 events/sec to consumers" measurement.
2. malloc_trim(0) on a ~1.5 GiB RSS process with MALLOC_ARENA_MAX=4.
glibc holds per-arena locks during the free-list walk, stalling
every allocator caller — including the Evented fiber serving
/metrics and /_readyz. long enough to trip probe timeouts.
this is a stabilization commit, not a root-cause fix:
- disable malloc_trim(0) entirely (comment preserved). prefer
MALLOC_MMAP_THRESHOLD_ tuning or an out-of-band maintenance window
if reclaim becomes an issue.
- bump gc_interval_s 10 min → 1 hour. bounds blast radius of the
persist-mutex hold until gc() is properly narrowed.
- add clock_gettime(.MONOTONIC) timing around dp.gc(). next incident
tells us whether dp.gc() itself or something adjacent is the stall.
- new doc: docs/zlay-gcloop-stall-2026-04-09.md with the hypothesis,
code pointers, validation plan, and follow-up work list (mutex
narrowing; broadcaster writeLoop polling as a separate bug).
3dc21b9 (2026-04-06 "fix gcLoop: silently exited after one tick")
is what unmasked this — before that fix gcLoop ran once and died,
so malloc_trim + gc ran exactly once per pod lifetime. after that
fix they run every 10 min, which is when zlay started flapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>