commits
reported by ptr.pet — https://relay.t4tlabs.net/ advertises hydrant.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
three changes based on phi-integration feedback:
1. drop the /api/phi/ namespace
- /api/phi/monitors → /api/relays
- /api/phi/history → /api/relays/history
these are a general observability surface, not phi-specific.
clean rename; no aliases kept. response shapes unchanged.
2. bound history with timestamps
GET /api/relays/history?name=<host>&since=<iso>&until=<iso>
answers "what was happening at 3am Tuesday" instead of only
"last N points". when since/until are set, returns every point
in the inclusive range (no cap); limit stays as the fallback
for recent-N queries. validates inputs before binding to SQL.
3. new /api/relays/events transition log
GET /api/relays/events?since=<iso>&until=<iso>&name=<host>
returns status-transition rows with from/to status and the
headline captured at transition time. added a monitor_transitions
table; /api/relays appends a row whenever a monitor's status
differs from its prior state. default window 24h; name filter
optional. sorted ascending by ts.
with all three, a caller can ask "what's the state now"
(/api/relays), "what did this host look like over time"
(/api/relays/history), and "what changed, when"
(/api/relays/events) without re-deriving state from raw points.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
single-host coverage history with precomputed coverage_pct + a
summary block (mean/min/max coverage, connected-run count).
complements /api/phi/monitors — monitors for "what is the state
now", history for "what was the state over the last N runs".
usage:
GET /api/phi/history?name=<host>&limit=<n>
default limit = 288 (~24h at 5-min eval cadence), max = 2016 (~7d).
response shape:
{
"name": "...",
"limit": N,
"points": [ {ts, coverage_pct, events, dids, connected}, ... ],
"summary": {
"mean_coverage_pct": N,
"min_coverage_pct": N,
"max_coverage_pct": N,
"connected_runs": N,
"total_runs": N
}
}
coverage semantics match /api/phi/monitors: unique_dids / MAX(unique_dids)
per run, self-normalizing against replay. summary stats are computed
over connected points only; disconnected runs are reported separately
as the connected_runs count so phi can say "alive 272/288 runs".
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
per-relay monitor objects designed for phi (bluesky bot) to poll
and post about meaningful status transitions. shape:
[{
"name": "...",
"status": "nominal"|"degraded"|"critical",
"headline": "...",
"metrics": { ... },
"last_changed": "<iso8601>",
"checked_at": "<iso8601>"
}, ...]
status logic:
- critical: no connection in recent runs OR short/baseline < 0.70
- degraded: short/baseline < 0.90
- nominal: otherwise
coverage is computed as unique_dids / max(unique_dids) per run
(self-normalizing — replay-heavy relays inflate only their own
coverage above 1.0 instead of deflating everyone else's).
short window = last 3 valid runs (~15 min); baseline = last 24
(~2h). disconnected runs excluded from the mean.
status transitions are tracked in a new monitor_state table so
phi can judge "is this old news" via last_changed.
public read, no auth. adding a monitored relay happens naturally
when a host appears in a run — no phi-side config change needed.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
two specific zlay fixes surfaced during 2026-04-17 attack recovery:
1. listActiveHostsImpl should zero last_seq when updated_at is
stale (> 1h). reactivating a long-dormant host with its last
cursor triggers a multi-week replay storm that contends with
the startup spawn loop via the shared DbRequestQueue.
observed: 3.79M/s persist_order_spins, 3000 fps ingest,
27 workers/min spawn rate instead of normal ~120/min.
2. host_authority should gate broadcast on cache miss, not flag
post-hoc. current optimistic "cache miss = pass through +
background resolve" lets forged DIDs get one broadcast each
before rejection. observed via relay-eval: zlay emits ~740
unresolvable DIDs/5min that every strict relay drops.
plus a minor observability nit (chain-break log rate-limit).
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
argparse-style wrapper for zlay's /admin/* HTTP API. fetches the
bearer token from the k8s secret, manages a port-forward for the
call, exposes admin endpoints as subcommands:
list-hosts [--status STATUS] [--json]
block-host HOSTNAME
unblock-host HOSTNAME
change-limits HOSTNAME [--account-limit N]
ban-repo DID
resync DID HOSTNAME
resync-status
backfill-status
audit [--json] # status + worker/delivering summary
the token never touches argv or the shell. required when operating
zlay during incidents — used during 2026-04-17 attack recovery to
audit the host roster, identify exhausted legitimate PDSes, and
confirm worker-vs-active deltas.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
the rest of the tracked relays all run indigo. this completes the
impl column so every row links to its source repo instead of
rendering em-dash.
also adds a proper ops entry for relay.t4tlabs.net (was falling
through to the auto-derived default).
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
blacksky-algorithms/rsky is a Rust atproto implementation that
includes a relay (per the repo topics + description). adds rsky
to the impls lookup and attributes atproto.africa accordingly.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- adds relay.klbr.net to the scheduled eval relay list, attributed
to @klbr.net (operator-submitted via tangled issue #1)
- adds an `impl` field to ops entries and an implementation column
to the snapshot + all-time coverage tables. links to the impl's
source repo via the `impls` lookup at the top of the script
filled in initial impls where confident:
- zlay.waow.tech → zlay
- relay.waow.tech → indigo
- relay.klbr.net → hydrant (per issue reporter)
other hosts render an em-dash — future PRs can fill them in as
operators confirm their codebase.
deploy with `just relay-eval deploy` to sync the updated service
unit + embedded HTML to the relay-eval server.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
full day journey from canary1 failure through external review,
broadcaster starvation investigation, rollback to b91382b, and
engineer handoff. 04-10 threaded-resolution doc captures the fix
path (frame queueing rebase without reintroducing the host_authority
reject regression).
individual writeups:
- canary1-failure — initial regression observation
- zlay-external-review — reviewer-facing narrative of the
pre-rollback state (spawn path + resolver pool contention)
- zlay-broadcaster-starvation — wrong mitigation (move writeLoop
to pool_io, vetoed as cross-Io crash class) but symptom
measurements are accurate
- zlay-handoff-rollback — narrowed bug window + rollback rationale
- zlay-threaded-resolution — 04-10 engineer-side resolution
ops-changelog 04-09 section stitches them together chronologically.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
`just relay-eval eval <hosts> [window]` runs a single evaluation
against bsky.network without waiting for the scheduled timer —
useful for canary validation and quick deltas after a redeploy.
`just relay-eval coverage-history` dumps the full runs table as
CSV for offline analysis.
relay-eval and relay-history skills document the SSH vs public-API
split: the former needs the server (all-runs trigger), the latter
is pure curl against /api/trend.
docs/relay-eval-recipes.md collects the common query patterns in
one place so we stop rediscovering them.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
wraps the port-forward + curl + metrics-parsing dance behind
`just zlay probe {health|delivery|metrics|delta|sweep}` so the
operator diagnostic path is reproducible across sessions. the
hydrant smoke test recipe also gained full-network flag, sig
verification, and PASS/FAIL stats parsing.
liveness/readiness probes in zlay-values.yaml are relaxed
(initialDelay 300s, timeout 15s, failureThreshold 20) to survive
the ~20min cold-start when the PDS subscriber spawn loop contends
with HTTP fibers. see docs/zlay-external-review-2026-04-09.md for
the full context; tighten again once the spawn/resolver path is
fixed.
.claude/skills/zlay-diagnose documents when to reach for each
probe subcommand. .claude/settings.local.json gitignored since
it's a per-host permissions file.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
records the 2026-04-07 investigation: root cause (borrowed hostname
slice freed by slurper teardown while in-flight FrameWorks held it),
the dupe-at-submit fix shipped as 1eec324, the follow-up refactor
that added a regression test, and the stress-test validation
metrics (broadcast_queue_full_total 3B → 0 after fix).
also notes the known follow-ups (persist_order global lock,
SharedFrame allocation hot path) as non-urgent optimizations to
revisit after consumer-side canary.
- ops-changelog: add 2026-04-06 entry (zat validation hardening), update
2026-04-05 entry with full websocket bug story and Evented redeploy
- architecture: update zlay to reflect Evented backend (~47 threads, ~1.2 GiB)
- zlay justfile: add test-hydrant, test-tap, test-consumers recipes
- zlay reconnect cronjob: fix IndexError on malformed HTTP response
- zlay dashboard: grafana panel improvements from previous session
- indigo reconnect cronjob: changes from previous session
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ops-changelog: add 2026-03-27 lightrail cutover entry
- architecture: replace collectiondir section with lightrail
- backfill: rewrite for lightrail's self-managed backfill
- deploying: remove `just indigo backfill` command
- README: update collectiondir row to lightrail
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lightrail (fig's Rust collection directory) replaces the Go collectiondir
sidecar, which had unbounded memory growth (~1.4 GiB and climbing toward
its OOM limit). lightrail validates sync 1.1 commit proofs, removes repos
on collection deletion, and has a configurable fjall cache.
- add Dockerfile.lightrail, helm values, and ServiceMonitor
- route listReposByCollection to lightrail in ingress
- replace collectiondir grafana panels with lightrail metrics
- remove collectiondir backfill recipe from justfile
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
upstream atproto-scraping moved from state.json to dist/instances.json,
leaving the old URL with only cursor data (no pdses key). phase 1 was
silently discovering 0 hosts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 99bc68ccd096841a549433f69fbf5ebe3ddfe29c.
This reverts commit 9bf6a88c3d6c778e9802c7d8b8bb308d9e823fdc.
single port definition since metrics are served on the same port.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- collectiondir-values.yaml: helm values for shadow collectiondir service
(port 2510, 10Gi PVC, connects to zlay firehose internally)
- collectiondir-servicemonitor.yaml: prometheus scrape config
- justfile: deploy-collectiondir, logs-collectiondir recipes +
publish-remote auto-updates collectiondir deployment if it exists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- filter runs to only show those where >50% of relays stayed connected
- add getDiffCounts/getDiffSamples for aggregated diff reporting
- wrap run insertion in a transaction for atomicity
- add t4tlabs relay to eval list
- add sqlite indexes on run_id foreign keys
- ops changelog: AT Protocol account number context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
safety headroom after OOM investigation. MALLOC_ARENA_MAX=4 reduces
contention for 2800+ threads while limiting fragmentation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
when a relay reports >1.3x the median event count (likely replaying
after a restart), exclude it from the effective union used to compute
coverage percentages. prevents a single replaying relay from dragging
all other relays' coverage scores down.
applies to both the dashboard snapshot view and the OG SVG image.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- dynamic SVG at /og.svg with leaderboard from latest run data
- rasterized PNG at /og via rsvg-convert (ExecStartPost in eval timer)
- OG meta tags in dashboard HTML for link previews
- add relay-eval section + pulsar attribution to repo README
- add build.zig, terraform, systemd units, justfile, classifier
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
replace linear density-window scaling with quantile/rank mapping.
values are spaced by their rank in the distribution, not their
absolute distance — the dense cluster near 100% gets most of the
canvas while outliers are clearly separated at the bottom.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
find the tightest range containing 60% of data points and use
that for the Y scale. outliers (0%, 46%) get clamped to the
bottom edge instead of compressing the 98-99% cluster into
an unreadable band.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- trend lines use 3-layer glow (halo + medium + core) for subtle lightning energy
- hover on trend: crosshair, highlights full line, dims others, shows tooltip
with relay name, coverage %, and timestamp
- previous runs nav: horizontal scroll instead of awkward wrapping
- add .gitignore for zig build artifacts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- trend canvas: fixed behind content, no background fill, lines-only
glow at low opacity (0.25 glow / 0.45 core) emerging from darkness
- glass panel: 80% opacity with backdrop-filter blur, no displacement
- mobile (<640px): hide run-by/bar/detail columns, smaller spacing,
operator legend 2-col, disable css tooltips, shorter trend canvas
- tables wrapped in overflow-x container for edge cases
- summary card uses semi-transparent bg to match glass aesthetic
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add /api/trend endpoint returning coverage data across recent runs
- store.zig: getTrendData() joins runs + relay_stats for last 48 runs
- full-width canvas at top draws multi-relay coverage lines with glow
- additive blending on fills creates stained-glass color layering
- glass-morphism panel (backdrop-filter, semi-transparent bg) floats
over the trend, letting the glow bleed through
- canvas auto-scales Y axis to data range, handles missing data points
- responsive: redraws on window resize with cached data
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- replace classification sentences with compact horizontal legend + tooltips
- ago() now shows "2h 10m ago" instead of "2h ago" for better run differentiation
- run nav shows local clock time alongside relative time
- subtle CSS refinements: box-shadow, transitions, antialiasing, smoother radii
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sort purely by unique_dids descending. operator symbols and run-by
column already convey who runs what — positional grouping was
distorting the leaderboard.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bluesky → Bluesky PBC, linked to bsky.app profile
- relay hostnames are now clickable links to https://{host}
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bluesky → Bluesky PBC, linked to safety.bsky.app profile
- relay hostnames are now clickable links to https://{host}
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
blackskyweb.xyz, not blacksky.dev
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flag relays with >1.5x median event count with a warning icon and
explanation. catches replay-after-restart scenarios where one relay
reports significantly more events than peers.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- operator legend uses 3-column grid (no more horizontal overflow)
- all operator names link to bsky.app profiles
- added "run by" column to coverage table with linked names
- follows Pulsar's pattern: @handle → bsky.app/profile/handle
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
drop the "offline" word — dim rows with 0 events instead. the data
speaks for itself without a misleading status label.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
was_connected tracks whether the collector ever connected, since the
connected flag gets reset to false on shutdown before main reads it.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the web server couldn't start while an eval run held the SQLite lock.
WAL mode allows concurrent readers + one writer. busy_timeout gives
5 seconds of retry before failing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- use unicode symbols (◆ ▲ ■ ● etc.) per operator instead of colored CSS dots
- remove confusing connection-status dots; offline relays get dimmed + tag
- human-readable timestamps ("3h ago") with exact UTC on hover
- plain english window ("5 minutes" not "300s")
- cleaner visual hierarchy, slimmer coverage bars
- better classification names in breakdown section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- group relays by operator with colored symbols (matches pulsar's "run by")
- "real gaps" → "missed (active)" — factual, not vague
- "can't resolve" → "unresolvable", "inactive" → "deactivated"
- operator legend, breakdown sorted by most missed first
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
record the strict-validation-on-cache-miss work, external review
findings, and rationale for tabling. change lives on zlay branch
nate/strict-validation-on-cache-miss, not deployed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bumped per-host account limits for 16 over-limit hosts via changeLimits API
- fixed reconnect cronjob phase 2: /admin/pds/list fetch was missing auth headers
- added RELAY_DEFAULT_ACCOUNT_LIMIT=10000 env var to prevent recurrence
ref: https://bsky.app/profile/bnewbold.net/post/3mgtbaicg322f
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
remove root-level context.md and TODO.md (content migrated to
docs/ops-changelog.md). add evan's relay-compare tool. update
reconnect cronjobs (phase 2: bsky.network host sync), indigo
dashboard/values/justfile changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
chain breaks are expected during reconnects and cache misses, not
errors. move them out of the errors panel into a dedicated "chain
continuity" panel with a % of received overlay.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add memory attribution + leak rate grafana panels, fix thread stack estimate (128K → 1M)
- add reconnect cronjob + publish-gpa recipe to justfile
- add ops-changelog.md documenting debugging sessions
- reduce indigo ident cache from 2M to 500K
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the restructure made indigo and zlay first-class peers but the readme
still had indigo as the title and sole focus. rewrite the intro with a
comparison table showing both deployments side by side.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move from flat deploy/ and infra/ directories with zlay-prefixed recipes
to symmetric indigo/ and zlay/ peer directories, each with their own
justfile, deploy configs, and terraform. shared configs (cluster-issuer,
postgres-values, grafana-ingress) go to shared/deploy/.
recipes are now invoked as `just indigo <recipe>` / `just zlay <recipe>`
using just's mod feature (1.36.0+). no helm values or terraform resources
changed — purely a repo organization change.
key fix: use source_directory() instead of justfile_directory() in module
justfiles, since justfile_directory() returns the root in modules.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
three changes based on phi-integration feedback:
1. drop the /api/phi/ namespace
- /api/phi/monitors → /api/relays
- /api/phi/history → /api/relays/history
these are a general observability surface, not phi-specific.
clean rename; no aliases kept. response shapes unchanged.
2. bound history with timestamps
GET /api/relays/history?name=<host>&since=<iso>&until=<iso>
answers "what was happening at 3am Tuesday" instead of only
"last N points". when since/until are set, returns every point
in the inclusive range (no cap); limit stays as the fallback
for recent-N queries. validates inputs before binding to SQL.
3. new /api/relays/events transition log
GET /api/relays/events?since=<iso>&until=<iso>&name=<host>
returns status-transition rows with from/to status and the
headline captured at transition time. added a monitor_transitions
table; /api/relays appends a row whenever a monitor's status
differs from its prior state. default window 24h; name filter
optional. sorted ascending by ts.
with all three, a caller can ask "what's the state now"
(/api/relays), "what did this host look like over time"
(/api/relays/history), and "what changed, when"
(/api/relays/events) without re-deriving state from raw points.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
single-host coverage history with precomputed coverage_pct + a
summary block (mean/min/max coverage, connected-run count).
complements /api/phi/monitors — monitors for "what is the state
now", history for "what was the state over the last N runs".
usage:
GET /api/phi/history?name=<host>&limit=<n>
default limit = 288 (~24h at 5-min eval cadence), max = 2016 (~7d).
response shape:
{
"name": "...",
"limit": N,
"points": [ {ts, coverage_pct, events, dids, connected}, ... ],
"summary": {
"mean_coverage_pct": N,
"min_coverage_pct": N,
"max_coverage_pct": N,
"connected_runs": N,
"total_runs": N
}
}
coverage semantics match /api/phi/monitors: unique_dids / MAX(unique_dids)
per run, self-normalizing against replay. summary stats are computed
over connected points only; disconnected runs are reported separately
as the connected_runs count so phi can say "alive 272/288 runs".
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
per-relay monitor objects designed for phi (bluesky bot) to poll
and post about meaningful status transitions. shape:
[{
"name": "...",
"status": "nominal"|"degraded"|"critical",
"headline": "...",
"metrics": { ... },
"last_changed": "<iso8601>",
"checked_at": "<iso8601>"
}, ...]
status logic:
- critical: no connection in recent runs OR short/baseline < 0.70
- degraded: short/baseline < 0.90
- nominal: otherwise
coverage is computed as unique_dids / max(unique_dids) per run
(self-normalizing — replay-heavy relays inflate only their own
coverage above 1.0 instead of deflating everyone else's).
short window = last 3 valid runs (~15 min); baseline = last 24
(~2h). disconnected runs excluded from the mean.
status transitions are tracked in a new monitor_state table so
phi can judge "is this old news" via last_changed.
public read, no auth. adding a monitored relay happens naturally
when a host appears in a run — no phi-side config change needed.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
two specific zlay fixes surfaced during 2026-04-17 attack recovery:
1. listActiveHostsImpl should zero last_seq when updated_at is
stale (> 1h). reactivating a long-dormant host with its last
cursor triggers a multi-week replay storm that contends with
the startup spawn loop via the shared DbRequestQueue.
observed: 3.79M/s persist_order_spins, 3000 fps ingest,
27 workers/min spawn rate instead of normal ~120/min.
2. host_authority should gate broadcast on cache miss, not flag
post-hoc. current optimistic "cache miss = pass through +
background resolve" lets forged DIDs get one broadcast each
before rejection. observed via relay-eval: zlay emits ~740
unresolvable DIDs/5min that every strict relay drops.
plus a minor observability nit (chain-break log rate-limit).
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
argparse-style wrapper for zlay's /admin/* HTTP API. fetches the
bearer token from the k8s secret, manages a port-forward for the
call, exposes admin endpoints as subcommands:
list-hosts [--status STATUS] [--json]
block-host HOSTNAME
unblock-host HOSTNAME
change-limits HOSTNAME [--account-limit N]
ban-repo DID
resync DID HOSTNAME
resync-status
backfill-status
audit [--json] # status + worker/delivering summary
the token never touches argv or the shell. required when operating
zlay during incidents — used during 2026-04-17 attack recovery to
audit the host roster, identify exhausted legitimate PDSes, and
confirm worker-vs-active deltas.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
the rest of the tracked relays all run indigo. this completes the
impl column so every row links to its source repo instead of
rendering em-dash.
also adds a proper ops entry for relay.t4tlabs.net (was falling
through to the auto-derived default).
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- adds relay.klbr.net to the scheduled eval relay list, attributed
to @klbr.net (operator-submitted via tangled issue #1)
- adds an `impl` field to ops entries and an implementation column
to the snapshot + all-time coverage tables. links to the impl's
source repo via the `impls` lookup at the top of the script
filled in initial impls where confident:
- zlay.waow.tech → zlay
- relay.waow.tech → indigo
- relay.klbr.net → hydrant (per issue reporter)
other hosts render an em-dash — future PRs can fill them in as
operators confirm their codebase.
deploy with `just relay-eval deploy` to sync the updated service
unit + embedded HTML to the relay-eval server.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
full day journey from canary1 failure through external review,
broadcaster starvation investigation, rollback to b91382b, and
engineer handoff. 04-10 threaded-resolution doc captures the fix
path (frame queueing rebase without reintroducing the host_authority
reject regression).
individual writeups:
- canary1-failure — initial regression observation
- zlay-external-review — reviewer-facing narrative of the
pre-rollback state (spawn path + resolver pool contention)
- zlay-broadcaster-starvation — wrong mitigation (move writeLoop
to pool_io, vetoed as cross-Io crash class) but symptom
measurements are accurate
- zlay-handoff-rollback — narrowed bug window + rollback rationale
- zlay-threaded-resolution — 04-10 engineer-side resolution
ops-changelog 04-09 section stitches them together chronologically.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
`just relay-eval eval <hosts> [window]` runs a single evaluation
against bsky.network without waiting for the scheduled timer —
useful for canary validation and quick deltas after a redeploy.
`just relay-eval coverage-history` dumps the full runs table as
CSV for offline analysis.
relay-eval and relay-history skills document the SSH vs public-API
split: the former needs the server (all-runs trigger), the latter
is pure curl against /api/trend.
docs/relay-eval-recipes.md collects the common query patterns in
one place so we stop rediscovering them.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
wraps the port-forward + curl + metrics-parsing dance behind
`just zlay probe {health|delivery|metrics|delta|sweep}` so the
operator diagnostic path is reproducible across sessions. the
hydrant smoke test recipe also gained full-network flag, sig
verification, and PASS/FAIL stats parsing.
liveness/readiness probes in zlay-values.yaml are relaxed
(initialDelay 300s, timeout 15s, failureThreshold 20) to survive
the ~20min cold-start when the PDS subscriber spawn loop contends
with HTTP fibers. see docs/zlay-external-review-2026-04-09.md for
the full context; tighten again once the spawn/resolver path is
fixed.
.claude/skills/zlay-diagnose documents when to reach for each
probe subcommand. .claude/settings.local.json gitignored since
it's a per-host permissions file.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
records the 2026-04-07 investigation: root cause (borrowed hostname
slice freed by slurper teardown while in-flight FrameWorks held it),
the dupe-at-submit fix shipped as 1eec324, the follow-up refactor
that added a regression test, and the stress-test validation
metrics (broadcast_queue_full_total 3B → 0 after fix).
also notes the known follow-ups (persist_order global lock,
SharedFrame allocation hot path) as non-urgent optimizations to
revisit after consumer-side canary.
- ops-changelog: add 2026-04-06 entry (zat validation hardening), update
2026-04-05 entry with full websocket bug story and Evented redeploy
- architecture: update zlay to reflect Evented backend (~47 threads, ~1.2 GiB)
- zlay justfile: add test-hydrant, test-tap, test-consumers recipes
- zlay reconnect cronjob: fix IndexError on malformed HTTP response
- zlay dashboard: grafana panel improvements from previous session
- indigo reconnect cronjob: changes from previous session
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ops-changelog: add 2026-03-27 lightrail cutover entry
- architecture: replace collectiondir section with lightrail
- backfill: rewrite for lightrail's self-managed backfill
- deploying: remove `just indigo backfill` command
- README: update collectiondir row to lightrail
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lightrail (fig's Rust collection directory) replaces the Go collectiondir
sidecar, which had unbounded memory growth (~1.4 GiB and climbing toward
its OOM limit). lightrail validates sync 1.1 commit proofs, removes repos
on collection deletion, and has a configurable fjall cache.
- add Dockerfile.lightrail, helm values, and ServiceMonitor
- route listReposByCollection to lightrail in ingress
- replace collectiondir grafana panels with lightrail metrics
- remove collectiondir backfill recipe from justfile
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- collectiondir-values.yaml: helm values for shadow collectiondir service
(port 2510, 10Gi PVC, connects to zlay firehose internally)
- collectiondir-servicemonitor.yaml: prometheus scrape config
- justfile: deploy-collectiondir, logs-collectiondir recipes +
publish-remote auto-updates collectiondir deployment if it exists
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- filter runs to only show those where >50% of relays stayed connected
- add getDiffCounts/getDiffSamples for aggregated diff reporting
- wrap run insertion in a transaction for atomicity
- add t4tlabs relay to eval list
- add sqlite indexes on run_id foreign keys
- ops changelog: AT Protocol account number context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
when a relay reports >1.3x the median event count (likely replaying
after a restart), exclude it from the effective union used to compute
coverage percentages. prevents a single replaying relay from dragging
all other relays' coverage scores down.
applies to both the dashboard snapshot view and the OG SVG image.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- dynamic SVG at /og.svg with leaderboard from latest run data
- rasterized PNG at /og via rsvg-convert (ExecStartPost in eval timer)
- OG meta tags in dashboard HTML for link previews
- add relay-eval section + pulsar attribution to repo README
- add build.zig, terraform, systemd units, justfile, classifier
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
replace linear density-window scaling with quantile/rank mapping.
values are spaced by their rank in the distribution, not their
absolute distance — the dense cluster near 100% gets most of the
canvas while outliers are clearly separated at the bottom.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- trend lines use 3-layer glow (halo + medium + core) for subtle lightning energy
- hover on trend: crosshair, highlights full line, dims others, shows tooltip
with relay name, coverage %, and timestamp
- previous runs nav: horizontal scroll instead of awkward wrapping
- add .gitignore for zig build artifacts
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- trend canvas: fixed behind content, no background fill, lines-only
glow at low opacity (0.25 glow / 0.45 core) emerging from darkness
- glass panel: 80% opacity with backdrop-filter blur, no displacement
- mobile (<640px): hide run-by/bar/detail columns, smaller spacing,
operator legend 2-col, disable css tooltips, shorter trend canvas
- tables wrapped in overflow-x container for edge cases
- summary card uses semi-transparent bg to match glass aesthetic
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- add /api/trend endpoint returning coverage data across recent runs
- store.zig: getTrendData() joins runs + relay_stats for last 48 runs
- full-width canvas at top draws multi-relay coverage lines with glow
- additive blending on fills creates stained-glass color layering
- glass-morphism panel (backdrop-filter, semi-transparent bg) floats
over the trend, letting the glow bleed through
- canvas auto-scales Y axis to data range, handles missing data points
- responsive: redraws on window resize with cached data
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- replace classification sentences with compact horizontal legend + tooltips
- ago() now shows "2h 10m ago" instead of "2h ago" for better run differentiation
- run nav shows local clock time alongside relative time
- subtle CSS refinements: box-shadow, transitions, antialiasing, smoother radii
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- use unicode symbols (◆ ▲ ■ ● etc.) per operator instead of colored CSS dots
- remove confusing connection-status dots; offline relays get dimmed + tag
- human-readable timestamps ("3h ago") with exact UTC on hover
- plain english window ("5 minutes" not "300s")
- cleaner visual hierarchy, slimmer coverage bars
- better classification names in breakdown section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- group relays by operator with colored symbols (matches pulsar's "run by")
- "real gaps" → "missed (active)" — factual, not vague
- "can't resolve" → "unresolvable", "inactive" → "deactivated"
- operator legend, breakdown sorted by most missed first
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- bumped per-host account limits for 16 over-limit hosts via changeLimits API
- fixed reconnect cronjob phase 2: /admin/pds/list fetch was missing auth headers
- added RELAY_DEFAULT_ACCOUNT_LIMIT=10000 env var to prevent recurrence
ref: https://bsky.app/profile/bnewbold.net/post/3mgtbaicg322f
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
move from flat deploy/ and infra/ directories with zlay-prefixed recipes
to symmetric indigo/ and zlay/ peer directories, each with their own
justfile, deploy configs, and terraform. shared configs (cluster-issuer,
postgres-values, grafana-ingress) go to shared/deploy/.
recipes are now invoked as `just indigo <recipe>` / `just zlay <recipe>`
using just's mod feature (1.36.0+). no helm values or terraform resources
changed — purely a repo organization change.
key fix: use source_directory() instead of justfile_directory() in module
justfiles, since justfile_directory() returns the root in modules.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>