declarative relay deployment on hetzner relay-eval.waow.tech
atproto relay
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

ops-changelog: document tabled strict validation change

record the strict-validation-on-cache-miss work, external review
findings, and rationale for tabling. change lives on zlay branch
nate/strict-validation-on-cache-miss, not deployed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz 56bda9b9 d0e2bb8e

+77
+77
docs/ops-changelog.md
··· 31 31 32 32 --- 33 33 34 + ## 2026-03-12 35 + 36 + ### tabled: strict validation on cache miss (branch: nate/strict-validation-on-cache-miss) 37 + 38 + **observation**: pulsar showed zlay with ~200 more unique DIDs than bsky.network in 39 + a 1-hour window. validator metrics confirmed 19% of commits (1.67M of 8.8M) were 40 + broadcast without signature verification — the optimistic pass-through on cache miss. 41 + 42 + **change built**: replaced optimistic pass-through with synchronous inline DID 43 + resolution on cache miss. added `resolveInline` and `cacheKeyFromDoc` helpers. 44 + re-resolve narrowed to `SignatureVerificationFailed` only (not structural errors). 45 + fixed cache hit/miss telemetry double-counting. external review by fig. 46 + 47 + **why tabled**: 48 + - this is a policy divergence from indigo, which is availability-biased — indigo 49 + tolerates identity lookup failure and skips verification rather than dropping 50 + - `resolveInline` collapses all resolver failures (including transient PLC/network) 51 + into `null` → frame drop. legitimate traffic gets dropped during resolver flakiness 52 + - pulsar shows an output divergence but not the cause — the ~200 extra DIDs could be 53 + invalid, or could be edge coverage differences. need DID-level diff + independent 54 + validation to distinguish 55 + 56 + **what's needed before deploying**: 57 + - relay evaluator tool that classifies differing DIDs (valid-but-transient vs 58 + permanently-unresolvable vs malformed) 59 + - hybrid policy: hard-drop permanently unresolvable, quarantine/retry on transient 60 + resolver failures 61 + - canary deployment before primary relay 62 + 63 + files: zlay/src/validator.zig (branch only, not on main) 64 + 65 + ### feat: per-host configurable account limits for zlay (d717282) 66 + 67 + **problem**: pulsar showed zlay at 99.82% coverage — missing ~420 users. DID-level 68 + diffing traced the gap to high-volume third-party PDS hosts (blacksky.app 57 missing, 69 + eurosky.social 19, northsky.social 14) whose events were rate-limited by zlay's 70 + per-host formula (`2500 + account_count` evt/hr). same problem bryan identified on 71 + indigo and fixed via `changeLimits` API — zlay had no equivalent. 72 + 73 + **fix**: added admin API and supporting infrastructure: 74 + - `account_limit` nullable column on host table (NULL = use actual COUNT(*)) 75 + - `getEffectiveAccountCount` uses COALESCE(override, actual_count) for rate scaling 76 + - `POST /admin/hosts/changeLimits` — set override (`{"host":"...","account_limit":N}`) 77 + or clear it (`{"account_limit":null}`) to revert to auto-mode 78 + - atomic rate limiter fields (`std.atomic.Value(u64)`) for safe cross-thread updates 79 + from admin HTTP handler to subscriber worker threads 80 + - `updateHostLimits` on Slurper applies new limits to running subscribers immediately 81 + - `GET /admin/hosts` response now includes `account_limit` field (nullable) 82 + 83 + files: event_log.zig, subscriber.zig, slurper.zig, api/admin.zig, api/router.zig 84 + (+145/-32 lines across 5 files) 85 + 86 + **applied limits**: blacksky.app:100K, eurosky.social:100K, northsky.social:50K 87 + 88 + **result (5-min pulsar measurement)**: 89 + 90 + | relay | events | users | 91 + |-------|--------|-------| 92 + | zlay.waow.tech | 77,862 | 21,090 | 93 + | bsky.network | 66,526 | 19,214 | 94 + | relay.waow.tech | 67,207 | 19,073 | 95 + 96 + zlay now sees ~110% of reference relay users — the previously rate-limited hosts are 97 + delivering events that other relays don't surface in this time window. 98 + 99 + ### gotcha: debug vs ReleaseSafe deploy 100 + 101 + **mistake**: ran `just zlay publish-remote` (no args) which defaults to a debug build. 102 + all previous deploys were `just zlay publish-remote ReleaseSafe`. debug binary used 103 + ~2.5 GiB RSS vs ~1.5 GiB for ReleaseSafe — appeared as a memory regression on grafana. 104 + 105 + **lesson**: ALWAYS pass `ReleaseSafe` to `publish-remote`. the replicaset history 106 + (`kubectl get rs`) shows image tags — check them if unsure which optimization level 107 + was last deployed. 108 + 109 + --- 110 + 34 111 ## 2026-03-11 35 112 36 113 ### fix: indigo relay coverage gap (98.44% → targeting 99.9%)