commits
reverts all four host retention commits that caused production
restart loops. the interaction between reconciliation, dormant
logic, startup jitter, and cold-start ramp with ~2,800 hosts
was not testable via unit tests and each deploy regressed
relay-eval coverage (alternating 0%/97% from kubelet kills).
back to 80eca78 behavior: exhausted hosts stop after 15 failures,
cron handles re-discovery. stable baseline for 24+ hours at 97-99%.
the feature needs a local test harness that validates startup ramp
behavior against a realistic host table before any production deploy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all ~2,800 subscriber threads were starting DNS+TLS handshakes
simultaneously during startup, starving the HTTP server thread and
causing kubelet to kill the pod on liveness probe timeout.
each subscriber now sleeps a deterministic jitter (0-30s, based on
host_id hash) before its first connection attempt. threads still
spawn quickly (50/batch, 100ms yield) but actual handshakes are
spread across a 30-second window instead of hitting all at once.
jitter only applies to startup — requestCrawl and reconciliation
spawn workers with zero jitter since they're one-at-a-time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the previous change kept dormant subscriber threads running forever,
meaning thread count could only go up. dormant now correctly stops
the worker thread (freeing resources) while preserving the DB row
for discovery to re-activate later. reconciliation loop queries only
active hosts — dormant hosts wait for requestCrawl.
separated "don't forget the host" (DB row persists) from "don't stop
the thread" (thread exits on dormancy). removed unused
listReconnectableHosts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Io.Future(void).wait() doesn't exist — futures use .await(io) and
are not threadsafe across fibers. replace startup_future.wait() with
a simple initial sleep (the first reconciliation pass runs 5 min
after startup anyway, well after spawnWorkers finishes).
also use 1-second sleep increments for shutdown responsiveness,
matching the subscriber backoff pattern.
caught by `zig build` (exe target) — `zig build test` misses this
due to lazy analysis when no test references the code path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
subscribers now retry with exponential backoff capped at 30 min
(was 60s cap with hard kill at 15 failures). on successful connect,
backoff resets to 1s and host flips back to active. hosts that fail
15+ consecutive times are marked dormant (observable) but the
subscriber keeps retrying. a reconciliation loop every 5 min
respawns any active/dormant host missing from the worker map.
this eliminates dependence on the external reconnect cron for host
retention — it can be reduced to discovery-only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
now that Backend is Io.Threaded, remove references to cross-Io
constraints, Evented fiber context requirements, and Uring thread
warnings that no longer apply. the historical context is preserved
in docs/evented-attempt.md and docs/notes.md.
no behavioral changes — comments only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ReleaseSafe caught an unreachable panic in posix.setsockopt when a
ConsumerTooSlow kick raced with the websocket server's readLoop thread.
the race: dropSlowConsumer called conn.close() from the broadcast thread
while readLoop (server thread) was about to call setsockopt on the same
socket fd. setsockopt got EBADF, zig stdlib hit unreachable → panic.
under ReleaseFast this was silent undefined behavior — likely existed on
every prior build.
fix: move conn.close() from dropSlowConsumer (broadcast thread) to the
end of writeLoop (consumer's own thread). writeLoop exits when alive is
set to false, drains remaining frames, then closes the connection. this
unblocks readLoop's pending read without racing on socket state.
also bump BUFFER_CAP 8192 → 65536 to reduce ConsumerTooSlow frequency
(cherry-pick of the ee4e368 change onto the current tree).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the Evented (io_uring fiber) backend has been the source of every major
issue since the 0.16 migration: 8 cross-Io crash classes, a ReleaseSafe
GPF from a zig codegen bug, and a persistent ~10-15% coverage degradation
that nobody could trace. the zig team marks Evented as experimental.
Io.Threaded restores thread-per-PDS (~2,800 OS threads instead of ~35
fibers), which was the proven model on 0.15 at 99%+ coverage. the entire
cross-Io problem class vanishes. ReleaseSafe works again. DNS works
natively. the uring networking patch becomes inert.
one-line change: const Backend = Io.Evented → Io.Threaded.
all io.concurrent() call sites, Io.Future, Io.Mutex, Io.Condition are
backend-agnostic through the std.Io abstraction. pool_io becomes
redundant but harmless (both runtimes are now Threaded).
builds clean: zig build test, zig fmt, and
zig build -Dtarget=x86_64-linux-gnu -Doptimize=ReleaseSafe all pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the image atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b was built
with zat alpha.21 via a locally-modified build.zig.zon on the Hetzner
build server that was never committed back. this commit reproduces that
state on top of b91382b so the canary behavioral delta vs production is
exactly one commit (1eec324, the FrameWork hostname UAF fix).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FrameWork.hostname was a borrowed slice from sub.options.hostname,
documented as "stable lifetime". it isn't: slurper.runWorker frees
sub.options.hostname after sub.run() returns, but FrameWorks for
that subscriber may still be queued in the frame pool. once the
allocator reuses that memory, pool workers read garbage when
logging chain breaks, host authority decisions, etc.
repro: zlay-reconnect cronjob spawns ~1839 hosts in 134s. some
subscribers churn within that window. corrupted hostnames appear
in logs as DIDs (the freed slot got reused for a DID dup) or with
stack-pointer-shaped bytes overlaying the suffix.
fix: dupe hostname alongside data when submitting to the pool,
free both in processFrame. one extra alloc/free per frame.
forward-only rewind: every commit on main between b91382b and 4f3d1d4
has been superseded or is suspected of being implicated in the
2026-04-09 HTTP / delivery outage. rather than force-pushing history
backward, this commit creates a new snapshot whose tree matches b91382b
exactly, parented on 4f3d1d4. git pull --ff-only continues to work.
the superseded commits remain in ancestry and can be referenced via
the ops-changelog:
- 4f3d1d4 gcLoop: disable malloc_trim, bump interval 10min→1h
- 795cc41 host_authority: slot recovery + pool metrics + preload account count
- bbba92c fix build: drop unused err1 capture in resolveHostAuthority
- 584571a disable keep_alive on host authority resolver pool + log resolve errors
- ee4e368 bump per-consumer buffer 8192→65536 + host_authority reject breakdown
- 31825b2 subscriber: extract prepareFrameWork + add UAF regression test
- 1eec324 fix UAF: dupe FrameWork.hostname per submit (will be re-applied on top)
- 168d9f1 bump websocket.zig + zat: fix requestCrawl POST hang
- fbdffbe mark DB success on did_cache hits
- 3dc21b9 fix gcLoop: silently exited after one tick
- e5f415f update README, CLAUDE.md, Dockerfile for current state
this commit and the two following it (cherry-pick 1eec324 + pin zat
alpha.21) constitute canary 1 per docs/zlay-canary-plan-2026-04-09.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diagnosis of the 2026-04-09 ~10-minute pod-flap cycle. the bbba92c
pod pattern (healthy ~10 min → /metrics and /_readyz stuck → NotReady
→ restart → repeat) lines up exactly with gcLoop's 10-min cadence.
two separable suspects inside gcLoop, either alone sufficient to
flunk probes:
1. dp.gc() holds DiskPersist.mutex for its entire duration (DB
iteration + per-file unlinks, event_log.zig:977-1033). every
frame worker blocks on persist() during gc. this alone explains
the earlier "0.035 events/sec to consumers" measurement.
2. malloc_trim(0) on a ~1.5 GiB RSS process with MALLOC_ARENA_MAX=4.
glibc holds per-arena locks during the free-list walk, stalling
every allocator caller — including the Evented fiber serving
/metrics and /_readyz. long enough to trip probe timeouts.
this is a stabilization commit, not a root-cause fix:
- disable malloc_trim(0) entirely (comment preserved). prefer
MALLOC_MMAP_THRESHOLD_ tuning or an out-of-band maintenance window
if reclaim becomes an issue.
- bump gc_interval_s 10 min → 1 hour. bounds blast radius of the
persist-mutex hold until gc() is properly narrowed.
- add clock_gettime(.MONOTONIC) timing around dp.gc(). next incident
tells us whether dp.gc() itself or something adjacent is the stall.
- new doc: docs/zlay-gcloop-stall-2026-04-09.md with the hypothesis,
code pointers, validation plan, and follow-up work list (mutex
narrowing; broadcaster writeLoop polling as a separate bug).
3dc21b9 (2026-04-06 "fix gcLoop: silently exited after one tick")
is what unmasked this — before that fix gcLoop ran once and died,
so malloc_trim + gc ran exactly once per pod lifetime. after that
fix they run every 10 min, which is when zlay started flapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bundles four changes from the 2026-04-09 external review (relay
docs/zlay-external-review-2026-04-09.md). all four target the two
unresolved problems from the 2026-04-08 incident: ~99% host_authority
pool rejection and HTTP probe starvation during cold-start spawn.
1. preload effective_account_count in listActiveHostsImpl (item 1)
spawnWorker was doing a per-host blocking DbRequest for
getEffectiveAccountCount(host_id) on every host during cold-start
— ~2,770 round-trips through the DbRequestQueue, each yielding the
spawn fiber. fold the COUNT/JOIN/GROUP BY into the existing batch
query so the value is preloaded into Host.effective_account_count.
addHost (one-off requestCrawl path) keeps the inline fetch since
it's not in the cold-start hot path.
2. resolver slot recovery (item 2)
resolveHostAuthority used to retry resolve() on the SAME pool slot
after a failure. if a slot's std.http.Client gets into any kind of
bad state, the retry is wasted and the slot stays bad forever. on
first-attempt failure, deinit + re-init the slot via the new
recycleHostResolver helper before retrying. directly tests the
leading "poisoned slot" hypothesis without making any zig stdlib
claims. dormant in production while keep_alive=false (no persistent
connections to corrupt) but ready for the next canary.
3. pool/loop metrics (item 4)
six new counters/gauges in broadcaster.zig Stats:
host_resolver_acquire_wait_us_total — pool contention timing
host_resolver_in_use — current slot count held
host_resolver_resets_total — slot recovery firings
host_resolver_resolve_fail_total — first-attempt failures
resolve_loop_resolve_ok_total — background loop ok
resolve_loop_resolve_fail_total — background loop fail
the resolve_loop counters reveal something we've been operating
blind on: the background signing-key resolveLoop has been
log.debug+continue on errors, never measured. when these first
ship, whatever number resolve_loop_resolve_fail shows is the
baseline, NOT a regression — that's the whole point.
4. configurable host_resolver_pool_size (item 5)
was const = 4. heap-allocate from start() based on env var
HOST_RESOLVER_POOL_SIZE (default 4, max 64). with keep_alive=false,
pool width is a real startup throughput knob — bumping it lets more
is_new checks run concurrently during reconnect storms. tune from
ops based on the new acquire_wait_us metric.
5. zat dep bumped to v0.3.0-alpha.23 — surfaces the underlying
std.http.Client.fetch error kind through resolver.resolve, so the
existing sampleLogReject("resolve", did, @errorName(err), ...) call
in resolveHostAuthority will print the actual transport error
(UnknownHostName, ConnectionRefused, TlsAlert, etc.) instead of
always DidResolutionFailed.
not in this batch (per reviewer's "do not yet" list):
- spawn batch loop tuning — slim per-host work first, re-measure
- re-enabling keep_alive=true globally — canary first, after this
metrics shipment lets us see what the broken path returns
- splitting liveness onto a dedicated thread — see if probe flap
survives the slimmed startup fiber first
584571a tried to discard the first-attempt error via `_ = err1` to
document intent, but zig 0.16 rejects that pattern with "error set is
discarded". build failed on the operator's ReleaseFast pipeline and
slipped past my local `zig build test` because the test binary is
lazy — no test references resolveHostAuthority, so zig never analyzed
the function body. `zig build` (the exe) does reach it via
frame_worker.processFrame and trips the error immediately.
just drop the capture. the comment explaining why only err2 is logged
is preserved.
100% of host_authority rejects on 2026-04-08 were in the resolve branch
(39,621 / 40,072 over 48min). plc.directory is reachable from the pod,
cold resolvers in resolveLoop work fine, and websockets to 2785 PDSes
are healthy — isolates the failure to the pooled + long-lived keep_alive
HTTP path. pool was added on 0.15 (1639565) and never re-validated
after the 0.16 migration (9cc1ba3).
workaround: disable keep_alive on the pool. cost is one TLS handshake
per is_new / host_changed DID, which is low-rate enough to absorb.
keep the pool itself for socket churn savings across fiber callers.
also wire sampleLogReject into the resolve and parse_did branches with
@errorName of the resolver error — previous commit incremented counters
for those branches but never logged, so we had no diagnostic data when
the reject rate spiked. if the workaround doesn't fully fix it we now
see the actual error kind without a second redeploy cycle.
the 8192-entry per-consumer ring = ~33s of headroom at 250fps. pulsar's
60-min snapshot accumulated repeated ConsumerTooSlow kicks within that
window (ops_changelog 2026-04-01). bumping to 65536 gives ~4.4min of
headroom — enough to absorb transient write stalls without dropping
consumers mid-run.
checkPdsHost had five silent-reject branches collapsed into one
failed_host_authority counter, which hid the 100% rejection rate
diagnosed 2026-04-08. split into per-branch counters emitted as
relay_host_authority_reject{branch=...} alongside a 1-in-2048 sampled
warn log so we can tell whether the DID doc lookup is failing, the
endpoint is unparseable, the host isn't in our table, or the resolved
host genuinely differs from the incoming host.
follow-up to 1eec324 (fix UAF: dupe FrameWork.hostname per submit).
the dupe-at-submit logic was inline in FrameHandler.onMessage, which
made it hard to regression-test the invariant. extracted a small
Subscriber method that returns a FrameWork with heap-owned data +
hostname, and added a unit test that:
- builds a FrameWork from a to-be-freed hostname buffer
- asserts the returned slices have distinct pointers from the inputs
- simulates slurper.runWorker teardown by freeing the source hostname
- reads the FrameWork.hostname again — would trip the testing
allocator's use-after-free detection if the dupe was elided
no behavior change at the submit site. tested via zig build test.
FrameWork.hostname was a borrowed slice from sub.options.hostname,
documented as "stable lifetime". it isn't: slurper.runWorker frees
sub.options.hostname after sub.run() returns, but FrameWorks for
that subscriber may still be queued in the frame pool. once the
allocator reuses that memory, pool workers read garbage when
logging chain breaks, host authority decisions, etc.
repro: zlay-reconnect cronjob spawns ~1839 hosts in 134s. some
subscribers churn within that window. corrupted hostnames appear
in logs as DIDs (the freed slot got reused for a DID dup) or with
stack-pointer-shaped bytes overlaying the suffix.
fix: dupe hostname alongside data when submitting to the pool,
free both in processFrame. one extra alloc/free per frame.
websocket.zig 3c6794a fixes Handshake.parse hanging on POST requests
with bodies. The previous endsWith("\r\n\r\n") check only matched
header-only GETs, so any POST handed to httpFallback (e.g.
requestCrawl) caused parse to return null indefinitely and the worker
to read the same data forever until the connection's idle timeout.
zat alpha.22 carries the same websocket bump so the transitive
dependency resolves to a single module (otherwise zig links both
hashes and errors with `file exists in modules 'build' and 'build0'`).
Symptom: zlay-reconnect cronjob has been failing for ~12h, never
re-announcing the ~2,950 PDS hosts from atproto-scraping.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
isDbHealthy() is a 30s freshness check on last_db_success, but the
markDbSuccess() call sites only fire on cache misses + the 10-min GC
tick. with a hot did_cache in steady state, miss rate can dip for 30+
seconds, leaving the health flag stale and tripping k8s liveness probes
even though the relay is healthy.
cache hits use DB-derived data, so they're a valid signal that the
data path is functioning. mark success on the fast path. cost is one
clock_gettime + atomic store per ingestion event (~20 ns vDSO).
the GC tick still provides real DB liveness as a backstop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gcLoop was using io.sleep on pool_io (Threaded) from a plain std.Thread.
the first tick happened to succeed, the second hit an error path, and
catch return swallowed it — silently killing the loop. one malloc_trim
fired at the 10-min mark and then nothing for 13.5+ hours.
fix: switch to std.c.nanosleep directly. plain threads can't safely
call into Io scheduler primitives, even on the matching backend, because
they aren't registered with that backend's scheduler.
drop the io parameter from gcLoop since dp.gc() uses its own bound io
internally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: document Evented backend, cross-Io architecture, DbRequestQueue,
link to devlog 008, fix zat dep link, note ReleaseFast requirement
- CLAUDE.md: Evented not Threaded, ReleaseFast not ReleaseSafe
- Dockerfile: fix build flag to ReleaseFast (comments said ReleaseFast
but flag was still ReleaseSafe)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLVM register allocator bug: under ReleaseSafe, the stack probe and
canary instrumentation cause LLVM to skip materializing the SwitchMessage
address into %rsi before fiber.zig's inline asm context switch. %rsi is
left holding a stale value from Thread.current(), causing a GPF.
Debug, ReleaseFast, and ReleaseSmall all pass. Only ReleaseSafe triggers
it — the combination of optimization + safety instrumentation changes
the code layout enough to expose the miscompilation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Evented + ReleaseSafe GPFs immediately on startup in fiber.zig
contextSwitch — confirmed in production deploy, same as the repro.
this was already documented in devlog 008 and stdlib-patches.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the SIGSEGV that prompted the Threaded revert was a TCP split mid-CRLF
in the websocket handshake reader (fixed in 9ac64da), not a fiber
context-switch issue. re-enabling Evented + ReleaseSafe to see it through.
if the repro GPF (scripts/repro_evented.zig) hits production code paths,
fallback is ReleaseFast or one-line flip back to Threaded.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
websocket client read() panics when TCP delivers \r at buffer boundary
without \n — line_start advances past pos, causing start > end slice.
under ReleaseFast this was the silent SIGSEGV every 30-90 min.
websocket.zig 9ac64da, zat v0.3.0-alpha.17.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Io.Evented (io_uring fibers) has a probabilistic SIGSEGV in
std.Io.fiber.contextSwitch that crashes every 30-90 min under load.
After 28 commits fixing cross-Io crashes, heap corruption, and mutex
incompatibilities, this stdlib bug is the remaining blocker — and
upstream fiber.zig is unchanged as of dev.3091 with no fix in sight.
Switching to Io.Threaded restores the thread-per-PDS model (~2,800
threads, stable) and lets us use ReleaseSafe again. The Evented work
is preserved in patches/, scripts/repro_evented.zig, and the new
docs/evented-attempt.md for when upstream catches up.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
some PDSes omit the tooBig field from commit events. the lexicon marks it
required, and consumers like hydrant reject frames without it.
resequenceFrame now detects #commit frames and injects tooBig: false when
the field is missing. non-commit frames are unaffected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the Evented pg.Pool (ev_db) approach was broken — io_uring netLookup is
unimplemented upstream, so no DNS and no outbound TCP from Evented fibers.
this replaces ev_db with a DbRequestQueue (MPSC FIFO ring buffer) that
routes general DB traffic through pool_io (Threaded) worker threads:
- add DbRequest + DbRequestQueue to event_log.zig (4096-slot ring,
spinlock push, CAS pop, 2 worker threads)
- convert xrpc/admin handlers to typed DbRequest structs with
@fieldParentPtr callbacks that write JSON into stack buffers
- convert slurper: pullHosts on own std.Thread (parallel with
spawnWorkers), addHost phased (DB via queue, HTTP via temp thread)
- convert broadcaster firstSeq to DbRequest
- convert backfiller/cleaner from Io.Future to std.Thread + direct
persist.db access, with cooperative shutdown checks
- remove all *Ev methods and ev_db infrastructure from DiskPersist
- explicit join paths for backfiller/cleaner/db-workers on shutdown
DbRequest.wait() never returns before done — preserves stack lifetime
of caller-embedded request structs during shutdown drain.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ensureEvDb() was @panic on initUri failure, meaning a transient postgres
hiccup during lazy init crashes the entire relay. now returns !*pg.Pool
and resets state to uninit on failure so the next call retries.
callers handle the error gracefully:
- xrpc handlers: respond 503 ServiceUnavailable
- slurper: skip host on db error (safe default)
- broadcaster firstSeq: fall back to memory history
- admin: use 0 for account count on error
- backfiller/cleaner: log and bail from current run
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pg.Pool.initUri does TCP connects via io_uring, which requires the Evented
event loop to be running. creating it during main() init (before the
scheduler starts) fails with NetworkDown — io_uring can't submit ops yet.
fix: store ev_io/db_url/pool_size config on DiskPersist, create the pool
on first use via ensureEvDb(). uses CAS-based init-once — first fiber to
call it creates the pool, concurrent fibers yield-wait via ev_io.sleep().
also changes backfiller/cleaner from storing *pg.Pool to *DiskPersist,
accessing the pool lazily via self.persist.ensureEvDb().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the relay SIGSEGV'd at ~396 hosts during startup — DiskPersist's pg.Pool
was created on Threaded Io, but ~40 call sites across slurper, API handlers,
broadcaster, backfiller, and cleaner called it from Evented fibers, triggering
Thread.current() on a NULL threadlocal.
Part A — dual pg.Pool:
add ev_db (Evented pg.Pool) to DiskPersist. pure DB methods get *Impl(db)
internal implementations + thin *Ev wrappers. Evented callers use ev_db;
pool_io callers keep using self.db. 13 methods refactored.
Part B — Threaded service paths:
- admin ban: uidForDidEv on Evented side, takedown routed through host_ops
queue (fire-and-forget). worker executes takeDownUser + persist + broadcast
on pool_io under persist_order.
- playback: cross-Io request/reply via MPSC Treiber stack. Evented fiber
posts PlaybackRequest, pool_io worker executes playback() under mutex,
sets done atomic. fiber spin-waits on failure to prevent use-after-free
on stack-local request.
persist_order held through full persist → resequence → broadcast_queue.push()
to guarantee insertion order matches seq assignment order.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move resequenceFrame, heap dupe, and broadcast_queue.push() outside
the ordering lock. persist_order now covers only the DB persist call
and seq store — the minimum needed for monotonic sequence assignment.
this eliminates the cascade where producers spin on persist_order
while another producer is blocked in a full broadcast_queue.push().
slight out-of-order in the ring is acceptable — seq is embedded in
frame data and consumers/history track by seq.
metrics showed persist_order_spins_total dominating at ~1,100 hosts
(548M spins) while push_lock_spins was zero — confirming the critical
section width was the bottleneck.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
separates push-lock contention from queue-full contention so operator
can distinguish: producers fighting for the CAS lock vs producers
blocked on a full ring buffer.
new metric: relay_broadcast_queue_push_lock_spins_total
clarified: relay_broadcast_queue_full_total HELP text (ring capacity)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
instrumentation for the ~1,300 host CPU cliff:
- relay_persist_order_spins_total: spin iterations on the ordering lock
- relay_broadcast_queue_full_total: spin iterations on full broadcast queue
- relay_broadcast_queue_depth_hwm: high-water mark of queue depth
- relay_broadcast_no_consumers_total: frames that skipped SharedFrame alloc
zero-consumer fast path: when no consumers are connected, broadcast()
returns after history.push() without allocating SharedFrame or taking
consumers_mutex. saves one heap alloc + one mutex per frame.
also includes the cursor coalesce fix (CursorMap) and slot reuse
(free list with unregister on subscriber exit) from previous commits.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the host_ops MPSC queue was processing every cursor flush (~348/sec at
1,391 hosts) as an individual DB write. when the queue filled, Evented
producers busy-spun — causing 5.4 cores sustained CPU.
split into two mechanisms:
- CursorMap: subscribers atomically store latest seq (one store, no lock).
worker thread sweeps every 5s, batch-flushes only changed cursors.
- HostOpsQueue: kept for rare ops only (failures, status updates).
also adds slot reuse via free list — unregister on subscriber exit
prevents slot exhaustion from host churn.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- NOTES.md → docs/notes.md
- repro_evented.zig → scripts/repro_evented.zig
- update references in Dockerfile, main.zig, stdlib-patches.md
- delete stale build artifacts (check_mutex.o, repro_evented binary)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
subscriber fibers (Evented) were calling DiskPersist methods that acquire
pg.Pool connections via Threaded futex — NULL Thread.current() on Evented
fibers caused heap corruption (~16min crash cycle).
add MPSC host_ops queue (atomic spinlock, same pattern as BroadcastQueue):
- subscriber pushes ops instead of calling dp.* directly
- single background thread (std.Thread on pool_io) pops and executes
- covers: cursor flush (~450/s), failure tracking, status updates
- cursor loaded at spawn time (slurper passes last_seq through)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
covers the Uring networking patch, DNS fallback, ReleaseSafe GPF,
debug_io override, Io.Event single-waiter, and cross-Io Mutex issues.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- keep gc_thread handle and join it during shutdown before dp.deinit() runs —
dp is stack-owned, detaching left a use-after-free window
- add markDbSuccess() call at end of gc() so the health signal isn't solely
dependent on uidForDid (event ingestion path)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- move GC loop from io.concurrent() (Evented fiber) to std.Thread.spawn() with pool_io (Threaded)
— dp.gc() takes Threaded mutex + queries pg.Pool, which dereferences NULL Thread.current()
threadlocal when called from Evented context → heap corruption / SIGSEGV
- replace direct pg.Pool "SELECT 1" health checks with atomic last_db_success timestamp
— metrics server and API router both run on Evented, pg.Pool runs on Threaded
— isDbHealthy() reads an atomic set by Threaded workers, safe from any Io context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
broadcast() removed dead consumers from the list AND called shutdown()+destroy(),
but Handler.close() still held the pointer and later called removeConsumer() on
freed memory. Now broadcast() only unlinks from the list — removeConsumer() is
the sole owner of shutdown + destroy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the previous fix (6674812) correctly identified that Evented Io.Mutex
from a plain thread causes SIGSEGV, but the fix used Threaded futex
from within an Evented fiber. Threaded futexWait blocks the Uring OS
thread, preventing it from processing io_uring completions for other
fibers — deadlocking the event loop during CA bundle scan.
fix: the resyncer now runs entirely on pool_io (Threaded) via a plain
std.Thread. no Evented io involvement at all. this is correct because:
- enqueue() from frame workers: Threaded futex on plain thread ✓
- dequeue() in worker: Threaded futex on plain thread ✓
- HTTP client: blocking I/O on plain thread ✓
the fundamental constraint: Io.Mutex cannot be shared across Io types.
Threaded futex on Evented fiber → blocks Uring thread → deadlock.
Evented futex on plain thread → NULL threadlocal → SIGSEGV.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
root cause: Resyncer.enqueue() called from frame worker threads (plain
std.Thread) used Evented Io for mutex/cond ops. when contended,
futexWaitUncancelable enters the Uring fiber scheduler which calls
Thread.current() — a threadlocal only set on Uring threads. on plain
threads it's null, and ReleaseFast silently dereferences NULL at struct
field offsets (0x28, 0x30, 0x38) → SIGSEGV.
fix: add queue_io (pool_io/Threaded) to Resyncer for cross-thread
synchronization. Evented io kept for HTTP client and fiber spawning.
also fixes two consumer bugs:
- dropSlowConsumer spawned plain std.Thread that called Evented future
cancel → same NULL deref class. removed cleanup thread, deferred
destruction to Handler.close → removeConsumer.
- removeConsumer unconditionally decremented consumer count after
dropSlowConsumer already did → double-decrement. now only decrements
when consumer is found in the list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spawning ~2,800 TLS handshakes simultaneously starves the single
Evented event loop — health checks on both :3000 and :3001 time out,
liveness probe kills the pod. batch-spawn with io.sleep yields between
batches so the event loop stays responsive during ramp.
STARTUP_BATCH_SIZE env var (default 50), 100ms yield between batches.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IORING_OP_BIND and IORING_OP_LISTEN require kernel 6.11+. Our server
is on 6.8 — the kernel returns EINVAL for unknown opcodes, surfacing
as error.Unexpected. Use direct linux.bind()/linux.listen() syscalls
instead (instant, non-blocking — same approach as getsockname).
IORING_OP_ACCEPT (5.5), IORING_OP_CONNECT (5.5), IORING_OP_SOCKET
(5.19), and IORING_OP_SENDMSG/RECVMSG/READV (5.3-5.6) all work fine.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements 6 io_uring network vtable entries (listen, accept, connect,
read, write, send) that are stubbed as Unavailable upstream (zig#31723).
The patch is applied at build time inside the Docker container only —
it modifies the zig stdlib bundled in the container image, not the host
zig installation or any downstream consumer of zat/websocket.zig.
DNS (netLookup) is deliberately NOT patched — subscribers resolve
hostnames through pool_io (Threaded) and pass the connected stream
to the websocket client for TLS + framing via Evented io.
Dep bumps:
- websocket.zig 80c6434: initWithStream() respects config.tls
(was hardcoded to null). Non-breaking — existing callers default
to tls=false and get identical behavior.
- zat v0.3.0-alpha.16: picks up the websocket bump so both zlay
and zat resolve to the same websocket version.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Io.Uring fiber context-switch GPFs under ReleaseSafe on x86_64-linux.
Reproduced with a 30-line minimal program (repro_evented.zig):
- Debug (safety ON, opts OFF): passes
- ReleaseFast (safety OFF, opts ON): passes
- ReleaseSmall (safety OFF, opts ON): passes
- ReleaseSafe (safety ON, opts ON): GPF in fiber.zig contextSwitch
This is a zig 0.16-dev stdlib bug (confirmed in both dev.3059 and
dev.3066) — the LLVM optimizer miscompiles safety-checked code in
the fiber context switch path.
Fix: build with ReleaseFast instead of ReleaseSafe. Backend stays
on Io.Evented (fibers on io_uring), eliminating thread-per-PDS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flip backend from Io.Threaded to Io.Evented. frame workers run on a
dedicated Io.Threaded (pool_io) for CPU-heavy decode/validate, then
hand off to the broadcaster fiber via an MPSC ring buffer.
key design: one publication path for all relay-sequenced events.
workers, subscriber inline, and admin all use the same pattern:
persist_order spinlock → dp.persist → resequence → queue.push.
the broadcaster fiber drains FIFO (= seq order) and fans out.
persist_order is an atomic spinlock (not Io.Mutex) so both Threaded
workers and Evented admin can participate — Io.Mutex futex
implementations are incompatible across domains.
- DiskPersist initialized with pool_io (workers are its callers)
- BroadcastQueue: lossless spin-wait push (matches Indigo semantics)
- admin ban: same ordered path as workers, no direct broadcast()
- pool_io wired to HttpContext for future cross-domain use
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pingLoop swallowed error.Canceled from io.sleep(), preventing
ping_future.cancel() from stopping the task before client.deinit()
freed the stream/TLS buffers. next writeFrame hit freed memory → GPF.
two changes:
- io.sleep() catch {} → catch return (cancellation-cooperative)
- check client.isClosed() before writeFrame (defense-in-depth)
also bumps zat to v0.3.0-alpha.11 and websocket.zig to 104608b.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zat alpha.10: pins websocket with httpFallback fix
websocket 4222f98: dispatch non-upgrade HTTP to httpFallback handler,
restoring /_healthz and /_readyz on port 3000
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
documents the three deployment crashes and their fixes, the unresolved
health probe issue (httpFallback not wired in websocket server), where
all repos/files/env vars live, and what needs to happen next.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fixes GPF in websocket client write path: concurrent writes from ping
loop, auto-pong, and close were interleaving frame headers/payloads on
the shared stream, corrupting memory during memcpy in Writer.zig.
websocket.zig 0261b7d adds _write_lock: Io.Mutex to the client, matching
the server-side Conn.lock pattern. zat alpha.9 pins the same websocket
commit, resolving the diamond dependency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pg.zig pool used Io.Event with reset() for connection-available signaling.
Io.Event.reset() assumes no pending waiters — violated when 16 frame
workers contend for 5 connections. Updated pg.zig replaces Event with a
monotonic futex counter (safe for any number of concurrent waiters).
Also:
- make DB pool size configurable via DB_POOL_SIZE env (default 20)
- previous hardcoded 5 guaranteed constant contention with 16 workers
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evented (io_uring) futexWait/Wake dispatch through fiber-local
Thread.current() state. Frame pool workers are plain std.Thread —
they lack that state, so Io.Mutex contention segfaults immediately.
Threaded backend uses direct kernel futex syscalls that work from
any execution context. io.concurrent still spawns real OS threads
(same concurrency model as 0.15, just 0.16 APIs).
Evented can be revisited once frame workers either move to
io.concurrent or cross-boundary mutexes get a dedicated Threaded
sync_io.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
reverts all four host retention commits that caused production
restart loops. the interaction between reconciliation, dormant
logic, startup jitter, and cold-start ramp with ~2,800 hosts
was not testable via unit tests and each deploy regressed
relay-eval coverage (alternating 0%/97% from kubelet kills).
back to 80eca78 behavior: exhausted hosts stop after 15 failures,
cron handles re-discovery. stable baseline for 24+ hours at 97-99%.
the feature needs a local test harness that validates startup ramp
behavior against a realistic host table before any production deploy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all ~2,800 subscriber threads were starting DNS+TLS handshakes
simultaneously during startup, starving the HTTP server thread and
causing kubelet to kill the pod on liveness probe timeout.
each subscriber now sleeps a deterministic jitter (0-30s, based on
host_id hash) before its first connection attempt. threads still
spawn quickly (50/batch, 100ms yield) but actual handshakes are
spread across a 30-second window instead of hitting all at once.
jitter only applies to startup — requestCrawl and reconciliation
spawn workers with zero jitter since they're one-at-a-time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the previous change kept dormant subscriber threads running forever,
meaning thread count could only go up. dormant now correctly stops
the worker thread (freeing resources) while preserving the DB row
for discovery to re-activate later. reconciliation loop queries only
active hosts — dormant hosts wait for requestCrawl.
separated "don't forget the host" (DB row persists) from "don't stop
the thread" (thread exits on dormancy). removed unused
listReconnectableHosts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Io.Future(void).wait() doesn't exist — futures use .await(io) and
are not threadsafe across fibers. replace startup_future.wait() with
a simple initial sleep (the first reconciliation pass runs 5 min
after startup anyway, well after spawnWorkers finishes).
also use 1-second sleep increments for shutdown responsiveness,
matching the subscriber backoff pattern.
caught by `zig build` (exe target) — `zig build test` misses this
due to lazy analysis when no test references the code path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
subscribers now retry with exponential backoff capped at 30 min
(was 60s cap with hard kill at 15 failures). on successful connect,
backoff resets to 1s and host flips back to active. hosts that fail
15+ consecutive times are marked dormant (observable) but the
subscriber keeps retrying. a reconciliation loop every 5 min
respawns any active/dormant host missing from the worker map.
this eliminates dependence on the external reconnect cron for host
retention — it can be reduced to discovery-only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
now that Backend is Io.Threaded, remove references to cross-Io
constraints, Evented fiber context requirements, and Uring thread
warnings that no longer apply. the historical context is preserved
in docs/evented-attempt.md and docs/notes.md.
no behavioral changes — comments only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ReleaseSafe caught an unreachable panic in posix.setsockopt when a
ConsumerTooSlow kick raced with the websocket server's readLoop thread.
the race: dropSlowConsumer called conn.close() from the broadcast thread
while readLoop (server thread) was about to call setsockopt on the same
socket fd. setsockopt got EBADF, zig stdlib hit unreachable → panic.
under ReleaseFast this was silent undefined behavior — likely existed on
every prior build.
fix: move conn.close() from dropSlowConsumer (broadcast thread) to the
end of writeLoop (consumer's own thread). writeLoop exits when alive is
set to false, drains remaining frames, then closes the connection. this
unblocks readLoop's pending read without racing on socket state.
also bump BUFFER_CAP 8192 → 65536 to reduce ConsumerTooSlow frequency
(cherry-pick of the ee4e368 change onto the current tree).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the Evented (io_uring fiber) backend has been the source of every major
issue since the 0.16 migration: 8 cross-Io crash classes, a ReleaseSafe
GPF from a zig codegen bug, and a persistent ~10-15% coverage degradation
that nobody could trace. the zig team marks Evented as experimental.
Io.Threaded restores thread-per-PDS (~2,800 OS threads instead of ~35
fibers), which was the proven model on 0.15 at 99%+ coverage. the entire
cross-Io problem class vanishes. ReleaseSafe works again. DNS works
natively. the uring networking patch becomes inert.
one-line change: const Backend = Io.Evented → Io.Threaded.
all io.concurrent() call sites, Io.Future, Io.Mutex, Io.Condition are
backend-agnostic through the std.Io abstraction. pool_io becomes
redundant but harmless (both runtimes are now Threaded).
builds clean: zig build test, zig fmt, and
zig build -Dtarget=x86_64-linux-gnu -Doptimize=ReleaseSafe all pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the image atcr.io/zzstoatzz.io/zlay:ReleaseFast-zat21-b91382b was built
with zat alpha.21 via a locally-modified build.zig.zon on the Hetzner
build server that was never committed back. this commit reproduces that
state on top of b91382b so the canary behavioral delta vs production is
exactly one commit (1eec324, the FrameWork hostname UAF fix).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FrameWork.hostname was a borrowed slice from sub.options.hostname,
documented as "stable lifetime". it isn't: slurper.runWorker frees
sub.options.hostname after sub.run() returns, but FrameWorks for
that subscriber may still be queued in the frame pool. once the
allocator reuses that memory, pool workers read garbage when
logging chain breaks, host authority decisions, etc.
repro: zlay-reconnect cronjob spawns ~1839 hosts in 134s. some
subscribers churn within that window. corrupted hostnames appear
in logs as DIDs (the freed slot got reused for a DID dup) or with
stack-pointer-shaped bytes overlaying the suffix.
fix: dupe hostname alongside data when submitting to the pool,
free both in processFrame. one extra alloc/free per frame.
forward-only rewind: every commit on main between b91382b and 4f3d1d4
has been superseded or is suspected of being implicated in the
2026-04-09 HTTP / delivery outage. rather than force-pushing history
backward, this commit creates a new snapshot whose tree matches b91382b
exactly, parented on 4f3d1d4. git pull --ff-only continues to work.
the superseded commits remain in ancestry and can be referenced via
the ops-changelog:
- 4f3d1d4 gcLoop: disable malloc_trim, bump interval 10min→1h
- 795cc41 host_authority: slot recovery + pool metrics + preload account count
- bbba92c fix build: drop unused err1 capture in resolveHostAuthority
- 584571a disable keep_alive on host authority resolver pool + log resolve errors
- ee4e368 bump per-consumer buffer 8192→65536 + host_authority reject breakdown
- 31825b2 subscriber: extract prepareFrameWork + add UAF regression test
- 1eec324 fix UAF: dupe FrameWork.hostname per submit (will be re-applied on top)
- 168d9f1 bump websocket.zig + zat: fix requestCrawl POST hang
- fbdffbe mark DB success on did_cache hits
- 3dc21b9 fix gcLoop: silently exited after one tick
- e5f415f update README, CLAUDE.md, Dockerfile for current state
this commit and the two following it (cherry-pick 1eec324 + pin zat
alpha.21) constitute canary 1 per docs/zlay-canary-plan-2026-04-09.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diagnosis of the 2026-04-09 ~10-minute pod-flap cycle. the bbba92c
pod pattern (healthy ~10 min → /metrics and /_readyz stuck → NotReady
→ restart → repeat) lines up exactly with gcLoop's 10-min cadence.
two separable suspects inside gcLoop, either alone sufficient to
flunk probes:
1. dp.gc() holds DiskPersist.mutex for its entire duration (DB
iteration + per-file unlinks, event_log.zig:977-1033). every
frame worker blocks on persist() during gc. this alone explains
the earlier "0.035 events/sec to consumers" measurement.
2. malloc_trim(0) on a ~1.5 GiB RSS process with MALLOC_ARENA_MAX=4.
glibc holds per-arena locks during the free-list walk, stalling
every allocator caller — including the Evented fiber serving
/metrics and /_readyz. long enough to trip probe timeouts.
this is a stabilization commit, not a root-cause fix:
- disable malloc_trim(0) entirely (comment preserved). prefer
MALLOC_MMAP_THRESHOLD_ tuning or an out-of-band maintenance window
if reclaim becomes an issue.
- bump gc_interval_s 10 min → 1 hour. bounds blast radius of the
persist-mutex hold until gc() is properly narrowed.
- add clock_gettime(.MONOTONIC) timing around dp.gc(). next incident
tells us whether dp.gc() itself or something adjacent is the stall.
- new doc: docs/zlay-gcloop-stall-2026-04-09.md with the hypothesis,
code pointers, validation plan, and follow-up work list (mutex
narrowing; broadcaster writeLoop polling as a separate bug).
3dc21b9 (2026-04-06 "fix gcLoop: silently exited after one tick")
is what unmasked this — before that fix gcLoop ran once and died,
so malloc_trim + gc ran exactly once per pod lifetime. after that
fix they run every 10 min, which is when zlay started flapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bundles four changes from the 2026-04-09 external review (relay
docs/zlay-external-review-2026-04-09.md). all four target the two
unresolved problems from the 2026-04-08 incident: ~99% host_authority
pool rejection and HTTP probe starvation during cold-start spawn.
1. preload effective_account_count in listActiveHostsImpl (item 1)
spawnWorker was doing a per-host blocking DbRequest for
getEffectiveAccountCount(host_id) on every host during cold-start
— ~2,770 round-trips through the DbRequestQueue, each yielding the
spawn fiber. fold the COUNT/JOIN/GROUP BY into the existing batch
query so the value is preloaded into Host.effective_account_count.
addHost (one-off requestCrawl path) keeps the inline fetch since
it's not in the cold-start hot path.
2. resolver slot recovery (item 2)
resolveHostAuthority used to retry resolve() on the SAME pool slot
after a failure. if a slot's std.http.Client gets into any kind of
bad state, the retry is wasted and the slot stays bad forever. on
first-attempt failure, deinit + re-init the slot via the new
recycleHostResolver helper before retrying. directly tests the
leading "poisoned slot" hypothesis without making any zig stdlib
claims. dormant in production while keep_alive=false (no persistent
connections to corrupt) but ready for the next canary.
3. pool/loop metrics (item 4)
six new counters/gauges in broadcaster.zig Stats:
host_resolver_acquire_wait_us_total — pool contention timing
host_resolver_in_use — current slot count held
host_resolver_resets_total — slot recovery firings
host_resolver_resolve_fail_total — first-attempt failures
resolve_loop_resolve_ok_total — background loop ok
resolve_loop_resolve_fail_total — background loop fail
the resolve_loop counters reveal something we've been operating
blind on: the background signing-key resolveLoop has been
log.debug+continue on errors, never measured. when these first
ship, whatever number resolve_loop_resolve_fail shows is the
baseline, NOT a regression — that's the whole point.
4. configurable host_resolver_pool_size (item 5)
was const = 4. heap-allocate from start() based on env var
HOST_RESOLVER_POOL_SIZE (default 4, max 64). with keep_alive=false,
pool width is a real startup throughput knob — bumping it lets more
is_new checks run concurrently during reconnect storms. tune from
ops based on the new acquire_wait_us metric.
5. zat dep bumped to v0.3.0-alpha.23 — surfaces the underlying
std.http.Client.fetch error kind through resolver.resolve, so the
existing sampleLogReject("resolve", did, @errorName(err), ...) call
in resolveHostAuthority will print the actual transport error
(UnknownHostName, ConnectionRefused, TlsAlert, etc.) instead of
always DidResolutionFailed.
not in this batch (per reviewer's "do not yet" list):
- spawn batch loop tuning — slim per-host work first, re-measure
- re-enabling keep_alive=true globally — canary first, after this
metrics shipment lets us see what the broken path returns
- splitting liveness onto a dedicated thread — see if probe flap
survives the slimmed startup fiber first
584571a tried to discard the first-attempt error via `_ = err1` to
document intent, but zig 0.16 rejects that pattern with "error set is
discarded". build failed on the operator's ReleaseFast pipeline and
slipped past my local `zig build test` because the test binary is
lazy — no test references resolveHostAuthority, so zig never analyzed
the function body. `zig build` (the exe) does reach it via
frame_worker.processFrame and trips the error immediately.
just drop the capture. the comment explaining why only err2 is logged
is preserved.
100% of host_authority rejects on 2026-04-08 were in the resolve branch
(39,621 / 40,072 over 48min). plc.directory is reachable from the pod,
cold resolvers in resolveLoop work fine, and websockets to 2785 PDSes
are healthy — isolates the failure to the pooled + long-lived keep_alive
HTTP path. pool was added on 0.15 (1639565) and never re-validated
after the 0.16 migration (9cc1ba3).
workaround: disable keep_alive on the pool. cost is one TLS handshake
per is_new / host_changed DID, which is low-rate enough to absorb.
keep the pool itself for socket churn savings across fiber callers.
also wire sampleLogReject into the resolve and parse_did branches with
@errorName of the resolver error — previous commit incremented counters
for those branches but never logged, so we had no diagnostic data when
the reject rate spiked. if the workaround doesn't fully fix it we now
see the actual error kind without a second redeploy cycle.
the 8192-entry per-consumer ring = ~33s of headroom at 250fps. pulsar's
60-min snapshot accumulated repeated ConsumerTooSlow kicks within that
window (ops_changelog 2026-04-01). bumping to 65536 gives ~4.4min of
headroom — enough to absorb transient write stalls without dropping
consumers mid-run.
checkPdsHost had five silent-reject branches collapsed into one
failed_host_authority counter, which hid the 100% rejection rate
diagnosed 2026-04-08. split into per-branch counters emitted as
relay_host_authority_reject{branch=...} alongside a 1-in-2048 sampled
warn log so we can tell whether the DID doc lookup is failing, the
endpoint is unparseable, the host isn't in our table, or the resolved
host genuinely differs from the incoming host.
follow-up to 1eec324 (fix UAF: dupe FrameWork.hostname per submit).
the dupe-at-submit logic was inline in FrameHandler.onMessage, which
made it hard to regression-test the invariant. extracted a small
Subscriber method that returns a FrameWork with heap-owned data +
hostname, and added a unit test that:
- builds a FrameWork from a to-be-freed hostname buffer
- asserts the returned slices have distinct pointers from the inputs
- simulates slurper.runWorker teardown by freeing the source hostname
- reads the FrameWork.hostname again — would trip the testing
allocator's use-after-free detection if the dupe was elided
no behavior change at the submit site. tested via zig build test.
FrameWork.hostname was a borrowed slice from sub.options.hostname,
documented as "stable lifetime". it isn't: slurper.runWorker frees
sub.options.hostname after sub.run() returns, but FrameWorks for
that subscriber may still be queued in the frame pool. once the
allocator reuses that memory, pool workers read garbage when
logging chain breaks, host authority decisions, etc.
repro: zlay-reconnect cronjob spawns ~1839 hosts in 134s. some
subscribers churn within that window. corrupted hostnames appear
in logs as DIDs (the freed slot got reused for a DID dup) or with
stack-pointer-shaped bytes overlaying the suffix.
fix: dupe hostname alongside data when submitting to the pool,
free both in processFrame. one extra alloc/free per frame.
websocket.zig 3c6794a fixes Handshake.parse hanging on POST requests
with bodies. The previous endsWith("\r\n\r\n") check only matched
header-only GETs, so any POST handed to httpFallback (e.g.
requestCrawl) caused parse to return null indefinitely and the worker
to read the same data forever until the connection's idle timeout.
zat alpha.22 carries the same websocket bump so the transitive
dependency resolves to a single module (otherwise zig links both
hashes and errors with `file exists in modules 'build' and 'build0'`).
Symptom: zlay-reconnect cronjob has been failing for ~12h, never
re-announcing the ~2,950 PDS hosts from atproto-scraping.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
isDbHealthy() is a 30s freshness check on last_db_success, but the
markDbSuccess() call sites only fire on cache misses + the 10-min GC
tick. with a hot did_cache in steady state, miss rate can dip for 30+
seconds, leaving the health flag stale and tripping k8s liveness probes
even though the relay is healthy.
cache hits use DB-derived data, so they're a valid signal that the
data path is functioning. mark success on the fast path. cost is one
clock_gettime + atomic store per ingestion event (~20 ns vDSO).
the GC tick still provides real DB liveness as a backstop.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gcLoop was using io.sleep on pool_io (Threaded) from a plain std.Thread.
the first tick happened to succeed, the second hit an error path, and
catch return swallowed it — silently killing the loop. one malloc_trim
fired at the 10-min mark and then nothing for 13.5+ hours.
fix: switch to std.c.nanosleep directly. plain threads can't safely
call into Io scheduler primitives, even on the matching backend, because
they aren't registered with that backend's scheduler.
drop the io parameter from gcLoop since dp.gc() uses its own bound io
internally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- README: document Evented backend, cross-Io architecture, DbRequestQueue,
link to devlog 008, fix zat dep link, note ReleaseFast requirement
- CLAUDE.md: Evented not Threaded, ReleaseFast not ReleaseSafe
- Dockerfile: fix build flag to ReleaseFast (comments said ReleaseFast
but flag was still ReleaseSafe)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLVM register allocator bug: under ReleaseSafe, the stack probe and
canary instrumentation cause LLVM to skip materializing the SwitchMessage
address into %rsi before fiber.zig's inline asm context switch. %rsi is
left holding a stale value from Thread.current(), causing a GPF.
Debug, ReleaseFast, and ReleaseSmall all pass. Only ReleaseSafe triggers
it — the combination of optimization + safety instrumentation changes
the code layout enough to expose the miscompilation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
the SIGSEGV that prompted the Threaded revert was a TCP split mid-CRLF
in the websocket handshake reader (fixed in 9ac64da), not a fiber
context-switch issue. re-enabling Evented + ReleaseSafe to see it through.
if the repro GPF (scripts/repro_evented.zig) hits production code paths,
fallback is ReleaseFast or one-line flip back to Threaded.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
websocket client read() panics when TCP delivers \r at buffer boundary
without \n — line_start advances past pos, causing start > end slice.
under ReleaseFast this was the silent SIGSEGV every 30-90 min.
websocket.zig 9ac64da, zat v0.3.0-alpha.17.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Io.Evented (io_uring fibers) has a probabilistic SIGSEGV in
std.Io.fiber.contextSwitch that crashes every 30-90 min under load.
After 28 commits fixing cross-Io crashes, heap corruption, and mutex
incompatibilities, this stdlib bug is the remaining blocker — and
upstream fiber.zig is unchanged as of dev.3091 with no fix in sight.
Switching to Io.Threaded restores the thread-per-PDS model (~2,800
threads, stable) and lets us use ReleaseSafe again. The Evented work
is preserved in patches/, scripts/repro_evented.zig, and the new
docs/evented-attempt.md for when upstream catches up.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
some PDSes omit the tooBig field from commit events. the lexicon marks it
required, and consumers like hydrant reject frames without it.
resequenceFrame now detects #commit frames and injects tooBig: false when
the field is missing. non-commit frames are unaffected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the Evented pg.Pool (ev_db) approach was broken — io_uring netLookup is
unimplemented upstream, so no DNS and no outbound TCP from Evented fibers.
this replaces ev_db with a DbRequestQueue (MPSC FIFO ring buffer) that
routes general DB traffic through pool_io (Threaded) worker threads:
- add DbRequest + DbRequestQueue to event_log.zig (4096-slot ring,
spinlock push, CAS pop, 2 worker threads)
- convert xrpc/admin handlers to typed DbRequest structs with
@fieldParentPtr callbacks that write JSON into stack buffers
- convert slurper: pullHosts on own std.Thread (parallel with
spawnWorkers), addHost phased (DB via queue, HTTP via temp thread)
- convert broadcaster firstSeq to DbRequest
- convert backfiller/cleaner from Io.Future to std.Thread + direct
persist.db access, with cooperative shutdown checks
- remove all *Ev methods and ev_db infrastructure from DiskPersist
- explicit join paths for backfiller/cleaner/db-workers on shutdown
DbRequest.wait() never returns before done — preserves stack lifetime
of caller-embedded request structs during shutdown drain.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ensureEvDb() was @panic on initUri failure, meaning a transient postgres
hiccup during lazy init crashes the entire relay. now returns !*pg.Pool
and resets state to uninit on failure so the next call retries.
callers handle the error gracefully:
- xrpc handlers: respond 503 ServiceUnavailable
- slurper: skip host on db error (safe default)
- broadcaster firstSeq: fall back to memory history
- admin: use 0 for account count on error
- backfiller/cleaner: log and bail from current run
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pg.Pool.initUri does TCP connects via io_uring, which requires the Evented
event loop to be running. creating it during main() init (before the
scheduler starts) fails with NetworkDown — io_uring can't submit ops yet.
fix: store ev_io/db_url/pool_size config on DiskPersist, create the pool
on first use via ensureEvDb(). uses CAS-based init-once — first fiber to
call it creates the pool, concurrent fibers yield-wait via ev_io.sleep().
also changes backfiller/cleaner from storing *pg.Pool to *DiskPersist,
accessing the pool lazily via self.persist.ensureEvDb().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the relay SIGSEGV'd at ~396 hosts during startup — DiskPersist's pg.Pool
was created on Threaded Io, but ~40 call sites across slurper, API handlers,
broadcaster, backfiller, and cleaner called it from Evented fibers, triggering
Thread.current() on a NULL threadlocal.
Part A — dual pg.Pool:
add ev_db (Evented pg.Pool) to DiskPersist. pure DB methods get *Impl(db)
internal implementations + thin *Ev wrappers. Evented callers use ev_db;
pool_io callers keep using self.db. 13 methods refactored.
Part B — Threaded service paths:
- admin ban: uidForDidEv on Evented side, takedown routed through host_ops
queue (fire-and-forget). worker executes takeDownUser + persist + broadcast
on pool_io under persist_order.
- playback: cross-Io request/reply via MPSC Treiber stack. Evented fiber
posts PlaybackRequest, pool_io worker executes playback() under mutex,
sets done atomic. fiber spin-waits on failure to prevent use-after-free
on stack-local request.
persist_order held through full persist → resequence → broadcast_queue.push()
to guarantee insertion order matches seq assignment order.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move resequenceFrame, heap dupe, and broadcast_queue.push() outside
the ordering lock. persist_order now covers only the DB persist call
and seq store — the minimum needed for monotonic sequence assignment.
this eliminates the cascade where producers spin on persist_order
while another producer is blocked in a full broadcast_queue.push().
slight out-of-order in the ring is acceptable — seq is embedded in
frame data and consumers/history track by seq.
metrics showed persist_order_spins_total dominating at ~1,100 hosts
(548M spins) while push_lock_spins was zero — confirming the critical
section width was the bottleneck.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
separates push-lock contention from queue-full contention so operator
can distinguish: producers fighting for the CAS lock vs producers
blocked on a full ring buffer.
new metric: relay_broadcast_queue_push_lock_spins_total
clarified: relay_broadcast_queue_full_total HELP text (ring capacity)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
instrumentation for the ~1,300 host CPU cliff:
- relay_persist_order_spins_total: spin iterations on the ordering lock
- relay_broadcast_queue_full_total: spin iterations on full broadcast queue
- relay_broadcast_queue_depth_hwm: high-water mark of queue depth
- relay_broadcast_no_consumers_total: frames that skipped SharedFrame alloc
zero-consumer fast path: when no consumers are connected, broadcast()
returns after history.push() without allocating SharedFrame or taking
consumers_mutex. saves one heap alloc + one mutex per frame.
also includes the cursor coalesce fix (CursorMap) and slot reuse
(free list with unregister on subscriber exit) from previous commits.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the host_ops MPSC queue was processing every cursor flush (~348/sec at
1,391 hosts) as an individual DB write. when the queue filled, Evented
producers busy-spun — causing 5.4 cores sustained CPU.
split into two mechanisms:
- CursorMap: subscribers atomically store latest seq (one store, no lock).
worker thread sweeps every 5s, batch-flushes only changed cursors.
- HostOpsQueue: kept for rare ops only (failures, status updates).
also adds slot reuse via free list — unregister on subscriber exit
prevents slot exhaustion from host churn.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
subscriber fibers (Evented) were calling DiskPersist methods that acquire
pg.Pool connections via Threaded futex — NULL Thread.current() on Evented
fibers caused heap corruption (~16min crash cycle).
add MPSC host_ops queue (atomic spinlock, same pattern as BroadcastQueue):
- subscriber pushes ops instead of calling dp.* directly
- single background thread (std.Thread on pool_io) pops and executes
- covers: cursor flush (~450/s), failure tracking, status updates
- cursor loaded at spawn time (slurper passes last_seq through)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- keep gc_thread handle and join it during shutdown before dp.deinit() runs —
dp is stack-owned, detaching left a use-after-free window
- add markDbSuccess() call at end of gc() so the health signal isn't solely
dependent on uidForDid (event ingestion path)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- move GC loop from io.concurrent() (Evented fiber) to std.Thread.spawn() with pool_io (Threaded)
— dp.gc() takes Threaded mutex + queries pg.Pool, which dereferences NULL Thread.current()
threadlocal when called from Evented context → heap corruption / SIGSEGV
- replace direct pg.Pool "SELECT 1" health checks with atomic last_db_success timestamp
— metrics server and API router both run on Evented, pg.Pool runs on Threaded
— isDbHealthy() reads an atomic set by Threaded workers, safe from any Io context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
broadcast() removed dead consumers from the list AND called shutdown()+destroy(),
but Handler.close() still held the pointer and later called removeConsumer() on
freed memory. Now broadcast() only unlinks from the list — removeConsumer() is
the sole owner of shutdown + destroy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the previous fix (6674812) correctly identified that Evented Io.Mutex
from a plain thread causes SIGSEGV, but the fix used Threaded futex
from within an Evented fiber. Threaded futexWait blocks the Uring OS
thread, preventing it from processing io_uring completions for other
fibers — deadlocking the event loop during CA bundle scan.
fix: the resyncer now runs entirely on pool_io (Threaded) via a plain
std.Thread. no Evented io involvement at all. this is correct because:
- enqueue() from frame workers: Threaded futex on plain thread ✓
- dequeue() in worker: Threaded futex on plain thread ✓
- HTTP client: blocking I/O on plain thread ✓
the fundamental constraint: Io.Mutex cannot be shared across Io types.
Threaded futex on Evented fiber → blocks Uring thread → deadlock.
Evented futex on plain thread → NULL threadlocal → SIGSEGV.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
root cause: Resyncer.enqueue() called from frame worker threads (plain
std.Thread) used Evented Io for mutex/cond ops. when contended,
futexWaitUncancelable enters the Uring fiber scheduler which calls
Thread.current() — a threadlocal only set on Uring threads. on plain
threads it's null, and ReleaseFast silently dereferences NULL at struct
field offsets (0x28, 0x30, 0x38) → SIGSEGV.
fix: add queue_io (pool_io/Threaded) to Resyncer for cross-thread
synchronization. Evented io kept for HTTP client and fiber spawning.
also fixes two consumer bugs:
- dropSlowConsumer spawned plain std.Thread that called Evented future
cancel → same NULL deref class. removed cleanup thread, deferred
destruction to Handler.close → removeConsumer.
- removeConsumer unconditionally decremented consumer count after
dropSlowConsumer already did → double-decrement. now only decrements
when consumer is found in the list.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
spawning ~2,800 TLS handshakes simultaneously starves the single
Evented event loop — health checks on both :3000 and :3001 time out,
liveness probe kills the pod. batch-spawn with io.sleep yields between
batches so the event loop stays responsive during ramp.
STARTUP_BATCH_SIZE env var (default 50), 100ms yield between batches.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IORING_OP_BIND and IORING_OP_LISTEN require kernel 6.11+. Our server
is on 6.8 — the kernel returns EINVAL for unknown opcodes, surfacing
as error.Unexpected. Use direct linux.bind()/linux.listen() syscalls
instead (instant, non-blocking — same approach as getsockname).
IORING_OP_ACCEPT (5.5), IORING_OP_CONNECT (5.5), IORING_OP_SOCKET
(5.19), and IORING_OP_SENDMSG/RECVMSG/READV (5.3-5.6) all work fine.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements 6 io_uring network vtable entries (listen, accept, connect,
read, write, send) that are stubbed as Unavailable upstream (zig#31723).
The patch is applied at build time inside the Docker container only —
it modifies the zig stdlib bundled in the container image, not the host
zig installation or any downstream consumer of zat/websocket.zig.
DNS (netLookup) is deliberately NOT patched — subscribers resolve
hostnames through pool_io (Threaded) and pass the connected stream
to the websocket client for TLS + framing via Evented io.
Dep bumps:
- websocket.zig 80c6434: initWithStream() respects config.tls
(was hardcoded to null). Non-breaking — existing callers default
to tls=false and get identical behavior.
- zat v0.3.0-alpha.16: picks up the websocket bump so both zlay
and zat resolve to the same websocket version.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Io.Uring fiber context-switch GPFs under ReleaseSafe on x86_64-linux.
Reproduced with a 30-line minimal program (repro_evented.zig):
- Debug (safety ON, opts OFF): passes
- ReleaseFast (safety OFF, opts ON): passes
- ReleaseSmall (safety OFF, opts ON): passes
- ReleaseSafe (safety ON, opts ON): GPF in fiber.zig contextSwitch
This is a zig 0.16-dev stdlib bug (confirmed in both dev.3059 and
dev.3066) — the LLVM optimizer miscompiles safety-checked code in
the fiber context switch path.
Fix: build with ReleaseFast instead of ReleaseSafe. Backend stays
on Io.Evented (fibers on io_uring), eliminating thread-per-PDS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flip backend from Io.Threaded to Io.Evented. frame workers run on a
dedicated Io.Threaded (pool_io) for CPU-heavy decode/validate, then
hand off to the broadcaster fiber via an MPSC ring buffer.
key design: one publication path for all relay-sequenced events.
workers, subscriber inline, and admin all use the same pattern:
persist_order spinlock → dp.persist → resequence → queue.push.
the broadcaster fiber drains FIFO (= seq order) and fans out.
persist_order is an atomic spinlock (not Io.Mutex) so both Threaded
workers and Evented admin can participate — Io.Mutex futex
implementations are incompatible across domains.
- DiskPersist initialized with pool_io (workers are its callers)
- BroadcastQueue: lossless spin-wait push (matches Indigo semantics)
- admin ban: same ordered path as workers, no direct broadcast()
- pool_io wired to HttpContext for future cross-domain use
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pingLoop swallowed error.Canceled from io.sleep(), preventing
ping_future.cancel() from stopping the task before client.deinit()
freed the stream/TLS buffers. next writeFrame hit freed memory → GPF.
two changes:
- io.sleep() catch {} → catch return (cancellation-cooperative)
- check client.isClosed() before writeFrame (defense-in-depth)
also bumps zat to v0.3.0-alpha.11 and websocket.zig to 104608b.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fixes GPF in websocket client write path: concurrent writes from ping
loop, auto-pong, and close were interleaving frame headers/payloads on
the shared stream, corrupting memory during memcpy in Writer.zig.
websocket.zig 0261b7d adds _write_lock: Io.Mutex to the client, matching
the server-side Conn.lock pattern. zat alpha.9 pins the same websocket
commit, resolving the diamond dependency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pg.zig pool used Io.Event with reset() for connection-available signaling.
Io.Event.reset() assumes no pending waiters — violated when 16 frame
workers contend for 5 connections. Updated pg.zig replaces Event with a
monotonic futex counter (safe for any number of concurrent waiters).
Also:
- make DB pool size configurable via DB_POOL_SIZE env (default 20)
- previous hardcoded 5 guaranteed constant contention with 16 workers
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evented (io_uring) futexWait/Wake dispatch through fiber-local
Thread.current() state. Frame pool workers are plain std.Thread —
they lack that state, so Io.Mutex contention segfaults immediately.
Threaded backend uses direct kernel futex syscalls that work from
any execution context. io.concurrent still spawns real OS threads
(same concurrency model as 0.15, just 0.16 APIs).
Evented can be revisited once frame workers either move to
io.concurrent or cross-boundary mutexes get a dedicated Threaded
sync_io.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>