host_authority: slot recovery + pool metrics + preload account count
bundles four changes from the 2026-04-09 external review (relay
docs/zlay-external-review-2026-04-09.md). all four target the two
unresolved problems from the 2026-04-08 incident: ~99% host_authority
pool rejection and HTTP probe starvation during cold-start spawn.
1. preload effective_account_count in listActiveHostsImpl (item 1)
spawnWorker was doing a per-host blocking DbRequest for
getEffectiveAccountCount(host_id) on every host during cold-start
— ~2,770 round-trips through the DbRequestQueue, each yielding the
spawn fiber. fold the COUNT/JOIN/GROUP BY into the existing batch
query so the value is preloaded into Host.effective_account_count.
addHost (one-off requestCrawl path) keeps the inline fetch since
it's not in the cold-start hot path.
2. resolver slot recovery (item 2)
resolveHostAuthority used to retry resolve() on the SAME pool slot
after a failure. if a slot's std.http.Client gets into any kind of
bad state, the retry is wasted and the slot stays bad forever. on
first-attempt failure, deinit + re-init the slot via the new
recycleHostResolver helper before retrying. directly tests the
leading "poisoned slot" hypothesis without making any zig stdlib
claims. dormant in production while keep_alive=false (no persistent
connections to corrupt) but ready for the next canary.
3. pool/loop metrics (item 4)
six new counters/gauges in broadcaster.zig Stats:
host_resolver_acquire_wait_us_total — pool contention timing
host_resolver_in_use — current slot count held
host_resolver_resets_total — slot recovery firings
host_resolver_resolve_fail_total — first-attempt failures
resolve_loop_resolve_ok_total — background loop ok
resolve_loop_resolve_fail_total — background loop fail
the resolve_loop counters reveal something we've been operating
blind on: the background signing-key resolveLoop has been
log.debug+continue on errors, never measured. when these first
ship, whatever number resolve_loop_resolve_fail shows is the
baseline, NOT a regression — that's the whole point.
4. configurable host_resolver_pool_size (item 5)
was const = 4. heap-allocate from start() based on env var
HOST_RESOLVER_POOL_SIZE (default 4, max 64). with keep_alive=false,
pool width is a real startup throughput knob — bumping it lets more
is_new checks run concurrently during reconnect storms. tune from
ops based on the new acquire_wait_us metric.
5. zat dep bumped to v0.3.0-alpha.23 — surfaces the underlying
std.http.Client.fetch error kind through resolver.resolve, so the
existing sampleLogReject("resolve", did, @errorName(err), ...) call
in resolveHostAuthority will print the actual transport error
(UnknownHostName, ConnectionRefused, TlsAlert, etc.) instead of
always DidResolutionFailed.
not in this batch (per reviewer's "do not yet" list):
- spawn batch loop tuning — slim per-host work first, re-measure
- re-enabling keep_alive=true globally — canary first, after this
metrics shipment lets us see what the broken path returns
- splitting liveness onto a dedicated thread — see if probe flap
survives the slimmed startup fiber first