disable keep_alive on host authority resolver pool + log resolve errors

100% of host_authority rejects on 2026-04-08 were in the resolve branch
(39,621 / 40,072 over 48min). plc.directory is reachable from the pod,
cold resolvers in resolveLoop work fine, and websockets to 2785 PDSes
are healthy — isolates the failure to the pooled + long-lived keep_alive
HTTP path. pool was added on 0.15 (1639565) and never re-validated
after the 0.16 migration (9cc1ba3).

workaround: disable keep_alive on the pool. cost is one TLS handshake
per is_new / host_changed DID, which is low-rate enough to absorb.
keep the pool itself for socket churn savings across fiber callers.

also wire sampleLogReject into the resolve and parse_did branches with
@errorName of the resolver error — previous commit incremented counters
for those branches but never logged, so we had no diagnostic data when
the reject rate spiked. if the workaround doesn't fully fix it we now
see the actual error kind without a second redeploy cycle.

zzstoatzz 1 month ago 584571aa ee4e3682

+27 -4

1 changed file

expand all

src

validator.zig

+27 -4

src/validator.zig

··· 122 122 slot.* = try self.io.concurrent(resolveLoop, .{self}); 123 123 } 124 124 125 - // init host authority resolver pool (reused across calls) 125 + // init host authority resolver pool (reused across calls). 126 + // 127 + // keep_alive = false: workaround for 100% rejection rate observed 128 + // 2026-04-08. hypothesis is that zig 0.16 std.http.Client doesn't 129 + // recover stale keep-alive connections on the pooled resolvers — 130 + // pool was added 2026-03-18 on zig 0.15, never re-validated after 131 + // the 0.16 migration on 2026-04-05. plc.directory is reachable 132 + // from the pod and cold resolvers (resolveLoop) work fine, so it's 133 + // specifically the pooled + long-lived keep_alive path. 134 + // 135 + // cost: one TLS handshake per host authority check (~tens of ms). 136 + // host authority checks only fire on is_new or host_changed, so the 137 + // steady-state rate is low. keep the pool for the socket churn 138 + // savings across multiple fiber callers even without keep_alive. 139 + // 140 + // TODO: remove once upstream zig fix lands. file issue when we 141 + // have the actual error kind from the sampled warn logs below. 126 142 for (&self.host_resolvers) |*r| { 127 - r.* = zat.DidResolver.initWithOptions(self.io, self.allocator, .{}); 143 + r.* = zat.DidResolver.initWithOptions(self.io, self.allocator, .{ .keep_alive = false }); 128 144 } 129 145 for (&self.host_resolver_available) |*a| { 130 146 a.store(true, .release); ··· 553 569 const persist = self.persist orelse return .migrate; // no DB — can't check 554 570 const parsed = zat.Did.parse(did) orelse { 555 571 _ = self.stats.host_authority_reject_parse_did.fetchAdd(1, .monotonic); 572 + self.sampleLogReject("parse_did", did, "", incoming_host_id, 0); 556 573 return .reject; 557 574 }; 558 575 ··· 562 579 var resolver = &self.host_resolvers[idx]; 563 580 564 581 // first resolve attempt 565 - var doc = resolver.resolve(parsed) catch { 582 + var doc = resolver.resolve(parsed) catch |err1| { 566 583 // retry once on network failure 567 - var doc2 = resolver.resolve(parsed) catch { 584 + var doc2 = resolver.resolve(parsed) catch |err2| { 568 585 _ = self.stats.host_authority_reject_resolve.fetchAdd(1, .monotonic); 586 + // log the second-attempt error kind — first-attempt kind is 587 + // dropped because resolver.resolve already swallows it into 588 + // DidResolutionFailed upstream, so both errors look the same 589 + // here. detail field captures @errorName for upstream triage. 590 + self.sampleLogReject("resolve", did, @errorName(err2), incoming_host_id, 0); 591 + _ = err1; 569 592 return .reject; 570 593 }; 571 594 defer doc2.deinit();

Configure Feed

Configure Feed