declarative relay deployment on hetzner relay-eval.waow.tech
atproto relay
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

zlay: add zlay-probe tool + relaxed liveness for cold-start

wraps the port-forward + curl + metrics-parsing dance behind
`just zlay probe {health|delivery|metrics|delta|sweep}` so the
operator diagnostic path is reproducible across sessions. the
hydrant smoke test recipe also gained full-network flag, sig
verification, and PASS/FAIL stats parsing.

liveness/readiness probes in zlay-values.yaml are relaxed
(initialDelay 300s, timeout 15s, failureThreshold 20) to survive
the ~20min cold-start when the PDS subscriber spawn loop contends
with HTTP fibers. see docs/zlay-external-review-2026-04-09.md for
the full context; tighten again once the spawn/resolver path is
fixed.

.claude/skills/zlay-diagnose documents when to reach for each
probe subcommand. .claude/settings.local.json gitignored since
it's a per-host permissions file.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

authored by

zzstoatzz
Claude Opus 4 (1M context)
and committed by
Tangled
98366c1d 7f275078

+572 -15
+48
.claude/skills/zlay-diagnose/SKILL.md
··· 1 + --- 2 + name: zlay-diagnose 3 + description: probe zlay health, firehose delivery, and internal metrics via `just zlay probe`. use when checking if zlay is alive, delivering, or degrading. 4 + --- 5 + 6 + probe zlay via `just zlay probe {health|delivery|metrics|delta|sweep}`. context (if any): $ARGUMENTS 7 + 8 + ## commands 9 + 10 + ```bash 11 + just zlay probe health # public /_health + /xrpc/_health (no port-forward) 12 + just zlay probe delivery [sec] # raw wss frame count against public ingress 13 + just zlay probe metrics # port-forward + snapshot key counters 14 + just zlay probe delta [sec] # two snapshots, print rate deltas 15 + just zlay probe sweep [sec] # all of the above in one shot 16 + ``` 17 + 18 + default window is 15s. use 30-60 for reports. 19 + 20 + ## interpreting output 21 + 22 + ### health block 23 + - 200 in <1s = healthy 24 + - 500 `"database unavailable"` = DbRequestQueue jammed (postgres is probably fine) 25 + - 503 = ingress dropped the pod (readiness failing) 26 + - timeout = HTTP fiber hang 27 + 28 + ### delivery block 29 + - >100 fps = broadcaster delivering 30 + - 0 fps with successful connect = broadcaster starvation 31 + 32 + ### delta block — alarm thresholds 33 + | signal | healthy | alarm | 34 + |---|---|---| 35 + | `pool_queued_bytes` | 0 | non-zero and rising | 36 + | `persist_order_spins_total` rate | ~3-15k/s | >1M/s | 37 + | `broadcast_queue_depth_hwm` | flat | actively climbing | 38 + 39 + `delivery ratio` line is only meaningful when `consumers_active >= 1`. 40 + 41 + ## when the script isn't enough 42 + 43 + thread state (no ps/ss in container): 44 + ```bash 45 + POD=$(kubectl --kubeconfig=zlay/kubeconfig.yaml -n zlay get pod -l app.kubernetes.io/instance=zlay -o name | head -1) 46 + kubectl ... exec $POD -- sh -c 'for t in /proc/1/task/*; do read a < $t/stat; echo "$a" | awk "{for(i=NF;i>0;i--) if(\$i ~ /^[A-Z]\$/){print \$i; exit}}"; done | sort | uniq -c' 47 + ``` 48 + healthy: 47 threads, mostly S, 0 D.
+3
.gitignore
··· 21 21 # local scratch files (content folded into docs/ops-changelog.md) 22 22 context.md 23 23 TODO.md 24 + 25 + # user-local claude harness settings (shared skills live under .claude/skills/) 26 + .claude/settings.local.json
+446
scripts/zlay-probe
··· 1 + #!/usr/bin/env -S PYTHONUNBUFFERED=1 uv run --script --quiet 2 + # /// script 3 + # requires-python = ">=3.12" 4 + # dependencies = ["websockets>=13"] 5 + # /// 6 + """ 7 + zlay operator probe — the one command for "what is zlay doing right now". 8 + 9 + replaces the ad-hoc `kubectl port-forward ... & sleep 2; curl ...; kill %1` 10 + dance with a single script that handles port-forward lifecycle, metric 11 + parsing, delivery measurement, and delta rendering. 12 + 13 + subcommands: 14 + health public /_health + describeServer (no port-forward) 15 + delivery [seconds=15] raw wss frame count against public ingress 16 + metrics port-forward :3001, snapshot, print key table 17 + delta [seconds=15] port-forward :3001, two snapshots, print delta 18 + sweep [seconds=15] health + delivery + delta (the refined sweep) 19 + 20 + usage: 21 + just zlay probe health 22 + just zlay probe delivery 30 23 + just zlay probe metrics 24 + just zlay probe delta 20 25 + just zlay probe sweep 26 + 27 + env: 28 + ZLAY_DOMAIN (default: zlay.waow.tech) 29 + KUBECONFIG (default: <repo>/zlay/kubeconfig.yaml; set by the 30 + zlay justfile automatically) 31 + ZLAY_NAMESPACE (default: zlay) 32 + """ 33 + 34 + from __future__ import annotations 35 + 36 + import asyncio 37 + import os 38 + import re 39 + import signal 40 + import subprocess 41 + import sys 42 + import time 43 + import urllib.error 44 + import urllib.request 45 + from contextlib import contextmanager 46 + from dataclasses import dataclass 47 + from pathlib import Path 48 + 49 + # ---------------- config ---------------- 50 + 51 + DOMAIN = os.environ.get("ZLAY_DOMAIN", "zlay.waow.tech") 52 + NAMESPACE = os.environ.get("ZLAY_NAMESPACE", "zlay") 53 + METRICS_PORT = 3001 54 + 55 + # metrics captured in every snapshot. order = display order. 56 + KEY_METRICS: list[str] = [ 57 + # delivery 58 + "relay_frames_received_total", 59 + "relay_frames_broadcast_total", 60 + "relay_consumers_active", 61 + "relay_connected_inbound", 62 + "relay_workers_count", 63 + # host authority 64 + 'relay_host_authority_trigger{reason="is_new"}', 65 + 'relay_host_authority_trigger{reason="host_changed"}', 66 + "relay_host_authority_checks_total", 67 + 'relay_validation_failed{reason="host_authority"}', 68 + "relay_host_authority_time_us_total", 69 + # pipeline pressure 70 + "relay_pool_queued_bytes", 71 + "relay_pool_backpressure_total", 72 + "relay_persist_order_spins_total", 73 + "relay_broadcast_queue_depth_hwm", 74 + # process 75 + "relay_process_rss_bytes", 76 + "relay_vm_hwm_kb", 77 + "relay_rss_anon_kb", 78 + ] 79 + 80 + GREEN = "\033[32m" 81 + YELLOW = "\033[33m" 82 + RED = "\033[31m" 83 + DIM = "\033[2m" 84 + BOLD = "\033[1m" 85 + RESET = "\033[0m" 86 + 87 + 88 + # ---------------- util ---------------- 89 + 90 + 91 + def log(msg: str) -> None: 92 + print(msg, file=sys.stderr) 93 + 94 + 95 + def status(label: str, ok: bool, detail: str = "", warn: bool = False) -> None: 96 + if warn: 97 + mark = f"{YELLOW}warn{RESET}" 98 + elif ok: 99 + mark = f"{GREEN}ok{RESET}" 100 + else: 101 + mark = f"{RED}FAIL{RESET}" 102 + print(f" [{mark}] {label}{(' ' + detail) if detail else ''}") 103 + 104 + 105 + def find_kubeconfig() -> str | None: 106 + if "KUBECONFIG" in os.environ: 107 + return os.environ["KUBECONFIG"] 108 + # fallback: <repo>/zlay/kubeconfig.yaml relative to this script 109 + here = Path(__file__).resolve() 110 + candidate = here.parent.parent / "zlay" / "kubeconfig.yaml" 111 + return str(candidate) if candidate.exists() else None 112 + 113 + 114 + def parse_metrics(text: str) -> dict[str, float]: 115 + """Parse prometheus text format into {full_name_with_labels: value}.""" 116 + out: dict[str, float] = {} 117 + line_re = re.compile(r"^([a-zA-Z_:][a-zA-Z0-9_:]*)(\{[^}]*\})?\s+([-+\d.eE]+)") 118 + for line in text.splitlines(): 119 + if not line or line.startswith("#"): 120 + continue 121 + m = line_re.match(line) 122 + if not m: 123 + continue 124 + name = m.group(1) + (m.group(2) or "") 125 + try: 126 + out[name] = float(m.group(3)) 127 + except ValueError: 128 + continue 129 + return out 130 + 131 + 132 + # ---------------- port-forward ---------------- 133 + 134 + 135 + @contextmanager 136 + def port_forward(local_port: int, remote_port: int): 137 + """Launch kubectl port-forward, wait for readiness, tear down on exit.""" 138 + kubeconfig = find_kubeconfig() 139 + if not kubeconfig: 140 + raise SystemExit( 141 + "no kubeconfig — set KUBECONFIG or run via `just zlay probe ...`" 142 + ) 143 + cmd = [ 144 + "kubectl", 145 + f"--kubeconfig={kubeconfig}", 146 + "-n", 147 + NAMESPACE, 148 + "port-forward", 149 + "deploy/zlay", 150 + f"{local_port}:{remote_port}", 151 + ] 152 + proc = subprocess.Popen( 153 + cmd, 154 + stdout=subprocess.DEVNULL, 155 + stderr=subprocess.PIPE, 156 + start_new_session=True, # own process group; we kill the whole group 157 + ) 158 + try: 159 + # wait up to 5s for port to accept connections 160 + url = f"http://127.0.0.1:{local_port}/metrics" 161 + deadline = time.time() + 5.0 162 + while time.time() < deadline: 163 + if proc.poll() is not None: 164 + err = (proc.stderr.read() or b"").decode(errors="replace").strip() 165 + raise SystemExit( 166 + f"port-forward exited before becoming ready: {err or '(no stderr)'}" 167 + ) 168 + try: 169 + with urllib.request.urlopen(url, timeout=0.5) as r: 170 + if r.status == 200: 171 + break 172 + except (urllib.error.URLError, ConnectionError, TimeoutError): 173 + pass 174 + time.sleep(0.1) 175 + else: 176 + raise SystemExit( 177 + f"port-forward never became ready on :{local_port} within 5s" 178 + ) 179 + yield url 180 + finally: 181 + try: 182 + os.killpg(os.getpgid(proc.pid), signal.SIGTERM) 183 + except (ProcessLookupError, PermissionError): 184 + pass 185 + try: 186 + proc.wait(timeout=2) 187 + except subprocess.TimeoutExpired: 188 + try: 189 + os.killpg(os.getpgid(proc.pid), signal.SIGKILL) 190 + except ProcessLookupError: 191 + pass 192 + 193 + 194 + def fetch_metrics(url: str) -> dict[str, float]: 195 + with urllib.request.urlopen(url, timeout=5) as r: 196 + return parse_metrics(r.read().decode("utf-8", errors="replace")) 197 + 198 + 199 + # ---------------- subcommands ---------------- 200 + 201 + 202 + def cmd_health(_args: list[str]) -> int: 203 + """Tight-timeout probe of public endpoints. No port-forward.""" 204 + print(f"{BOLD}public health — https://{DOMAIN}{RESET}") 205 + targets = [ 206 + ("/_health", f"https://{DOMAIN}/_health"), 207 + ("/xrpc/_health", f"https://{DOMAIN}/xrpc/_health"), 208 + ( 209 + "describeServer", 210 + f"https://{DOMAIN}/xrpc/com.atproto.sync.describeServer", 211 + ), 212 + ] 213 + # 2xx = ok; 4xx = warn (server responded, endpoint missing/mismatched); 214 + # 5xx / timeout / conn refused = fail (HTTP fiber in trouble) 215 + any_fail = False 216 + for label, url in targets: 217 + t0 = time.time() 218 + try: 219 + with urllib.request.urlopen(url, timeout=5) as r: 220 + dt = time.time() - t0 221 + body = r.read(200).decode(errors="replace") 222 + status( 223 + f"{label:20s}", 224 + ok=True, 225 + detail=f"{r.status} in {dt:.2f}s{DIM} {body.strip()[:60]}{RESET}", 226 + ) 227 + except urllib.error.HTTPError as e: 228 + dt = time.time() - t0 229 + if 400 <= e.code < 500: 230 + status(f"{label:20s}", ok=False, warn=True, 231 + detail=f"HTTP {e.code} in {dt:.2f}s (endpoint not served)") 232 + else: 233 + status(f"{label:20s}", ok=False, 234 + detail=f"HTTP {e.code} in {dt:.2f}s") 235 + any_fail = True 236 + except Exception as e: 237 + dt = time.time() - t0 238 + status(f"{label:20s}", ok=False, 239 + detail=f"{type(e).__name__} after {dt:.2f}s") 240 + any_fail = True 241 + return 1 if any_fail else 0 242 + 243 + 244 + async def _delivery(seconds: int) -> tuple[int, float]: 245 + import websockets 246 + 247 + url = f"wss://{DOMAIN}/xrpc/com.atproto.sync.subscribeRepos" 248 + n = 0 249 + async with websockets.connect(url, max_size=None) as ws: 250 + # start the measurement window after handshake, not before — 251 + # otherwise a slow TLS connect inflates the denominator. 252 + t0 = time.time() 253 + 254 + async def count_frames() -> None: 255 + nonlocal n 256 + while True: 257 + await ws.recv() 258 + n += 1 259 + 260 + try: 261 + await asyncio.wait_for(count_frames(), timeout=seconds) 262 + except asyncio.TimeoutError: 263 + pass 264 + return n, time.time() - t0 265 + 266 + 267 + def cmd_delivery(args: list[str]) -> int: 268 + seconds = int(args[0]) if args else 15 269 + print( 270 + f"{BOLD}firehose delivery — wss://{DOMAIN}/xrpc/com.atproto.sync.subscribeRepos{RESET}" 271 + ) 272 + log(f" listening for {seconds}s...") 273 + try: 274 + n, elapsed = asyncio.run(_delivery(seconds)) 275 + except Exception as e: 276 + print(f" [{RED}FAIL{RESET}] connection: {type(e).__name__}: {e}") 277 + return 1 278 + fps = n / max(0.01, elapsed) 279 + ok = fps >= 50 # any meaningful delivery; broken pods show ~0.03 fps 280 + status( 281 + "received", 282 + ok, 283 + f"{n} frames / {elapsed:.1f}s = {BOLD}{fps:.0f} fps{RESET}", 284 + ) 285 + return 0 if ok else 1 286 + 287 + 288 + def _fmt_value(v: float) -> str: 289 + if v >= 1e9: 290 + return f"{v/1e9:.2f}G" 291 + if v >= 1e6: 292 + return f"{v/1e6:.2f}M" 293 + if v >= 1e3: 294 + return f"{v/1e3:.1f}k" 295 + if v == int(v): 296 + return str(int(v)) 297 + return f"{v:.2f}" 298 + 299 + 300 + def _print_snapshot(vals: dict[str, float]) -> None: 301 + print(f"{BOLD}{'metric':<52}{'value':>14}{RESET}") 302 + for m in KEY_METRICS: 303 + v = vals.get(m) 304 + if v is None: 305 + print(f" {DIM}{m:<50}{'— (missing)':>14}{RESET}") 306 + else: 307 + print(f" {m:<50}{_fmt_value(v):>14}") 308 + 309 + 310 + def cmd_metrics(_args: list[str]) -> int: 311 + print(f"{BOLD}zlay /metrics snapshot — ns={NAMESPACE}{RESET}") 312 + with port_forward(METRICS_PORT, METRICS_PORT) as url: 313 + vals = fetch_metrics(url) 314 + _print_snapshot(vals) 315 + return 0 316 + 317 + 318 + @dataclass 319 + class DeltaRow: 320 + name: str 321 + t0: float | None 322 + t1: float | None 323 + delta: float | None 324 + rate: float | None 325 + 326 + 327 + def _compute_delta( 328 + a: dict[str, float], b: dict[str, float], dt: float 329 + ) -> list[DeltaRow]: 330 + rows: list[DeltaRow] = [] 331 + for m in KEY_METRICS: 332 + va = a.get(m) 333 + vb = b.get(m) 334 + if va is None or vb is None: 335 + rows.append(DeltaRow(m, va, vb, None, None)) 336 + continue 337 + d = vb - va 338 + rows.append(DeltaRow(m, va, vb, d, d / max(0.001, dt))) 339 + return rows 340 + 341 + 342 + def _print_delta(rows: list[DeltaRow], dt: float) -> None: 343 + print( 344 + f"{BOLD}{'metric':<52}{'t0':>14}{'t1':>14}{'Δ':>12}{'rate/s':>12}{RESET}" 345 + ) 346 + for r in rows: 347 + if r.t0 is None or r.t1 is None: 348 + print(f" {DIM}{r.name:<50}{'—':>14}{'—':>14}{'—':>12}{'—':>12}{RESET}") 349 + continue 350 + # delta color by sign 351 + c = "" 352 + if r.delta is not None: 353 + if r.delta > 0: 354 + c = YELLOW 355 + elif r.delta < 0: 356 + c = RED 357 + print( 358 + f" {r.name:<50}" 359 + f"{_fmt_value(r.t0):>14}" 360 + f"{_fmt_value(r.t1):>14}" 361 + f"{c}{_fmt_value(r.delta):>12}{RESET}" 362 + f"{_fmt_value(r.rate):>12}" 363 + ) 364 + print(f" {DIM}(window: {dt:.1f}s){RESET}") 365 + 366 + 367 + def cmd_delta(args: list[str]) -> int: 368 + seconds = int(args[0]) if args else 15 369 + print(f"{BOLD}zlay /metrics delta — {seconds}s window{RESET}") 370 + with port_forward(METRICS_PORT, METRICS_PORT) as url: 371 + a = fetch_metrics(url) 372 + t0 = time.time() 373 + time.sleep(seconds) 374 + b = fetch_metrics(url) 375 + dt = time.time() - t0 376 + rows = _compute_delta(a, b, dt) 377 + _print_delta(rows, dt) 378 + # delivery ratio if consumer was attached at any point 379 + rcv_a = a.get("relay_frames_received_total") 380 + rcv_b = b.get("relay_frames_received_total") 381 + bcs_a = a.get("relay_frames_broadcast_total") 382 + bcs_b = b.get("relay_frames_broadcast_total") 383 + if ( 384 + rcv_a is not None 385 + and rcv_b is not None 386 + and bcs_a is not None 387 + and bcs_b is not None 388 + ): 389 + rcv_d = rcv_b - rcv_a 390 + bcs_d = bcs_b - bcs_a 391 + ratio = 100.0 * bcs_d / rcv_d if rcv_d > 0 else 0.0 392 + consumer = b.get("relay_consumers_active", 0) 393 + tag = "consumer attached" if consumer >= 1 else "no consumer" 394 + print( 395 + f"\n {BOLD}delivery ratio{RESET} ({tag}): " 396 + f"{_fmt_value(bcs_d)}/{_fmt_value(rcv_d)} = {ratio:.1f}%" 397 + ) 398 + return 0 399 + 400 + 401 + def cmd_sweep(args: list[str]) -> int: 402 + seconds = int(args[0]) if args else 15 403 + print(f"{BOLD}=== zlay sweep ==={RESET}\n") 404 + h = cmd_health([]) 405 + print() 406 + d = cmd_delivery([str(seconds)]) 407 + print() 408 + dl = cmd_delta([str(seconds)]) 409 + print() 410 + all_ok = (h == 0) and (d == 0) and (dl == 0) 411 + print( 412 + f"{BOLD}sweep: {GREEN if all_ok else RED}" 413 + f"{'GREEN' if all_ok else 'DEGRADED'}{RESET}" 414 + ) 415 + return 0 if all_ok else 1 416 + 417 + 418 + # ---------------- main ---------------- 419 + 420 + 421 + SUBCOMMANDS = { 422 + "health": cmd_health, 423 + "delivery": cmd_delivery, 424 + "metrics": cmd_metrics, 425 + "delta": cmd_delta, 426 + "sweep": cmd_sweep, 427 + } 428 + 429 + 430 + def main(argv: list[str]) -> int: 431 + if len(argv) < 2 or argv[1] in ("-h", "--help", "help"): 432 + print(__doc__, file=sys.stderr) 433 + return 0 if len(argv) >= 2 else 2 434 + sub = argv[1] 435 + if sub not in SUBCOMMANDS: 436 + print(f"unknown subcommand: {sub}", file=sys.stderr) 437 + print(f"available: {', '.join(SUBCOMMANDS)}", file=sys.stderr) 438 + return 2 439 + try: 440 + return SUBCOMMANDS[sub](argv[2:]) 441 + except KeyboardInterrupt: 442 + return 130 443 + 444 + 445 + if __name__ == "__main__": 446 + sys.exit(main(sys.argv))
+15 -6
zlay/deploy/zlay-values.yaml
··· 25 25 - secretRef: 26 26 name: zlay-secret 27 27 probes: 28 + # NOTE 2026-04-09: these values are relaxed beyond what `/_healthz` 29 + # would normally need. the cold-start spawn loop (~2,770 PDS 30 + # subscribers, batched 50/100ms, with keep_alive=false on the 31 + # host_authority resolver pool doing fresh ~500ms TLS handshakes 32 + # per is_new DID) contends with the HTTP server fibers enough to 33 + # delay probe responses past the default 5s timeout for the first 34 + # ~20+ minutes of uptime. see docs/zlay-external-review-2026-04-09.md 35 + # for the full context. tighten these again once the spawn path 36 + # and/or resolver pool are fixed so probes can assume fast response. 28 37 liveness: 29 38 enabled: true 30 39 custom: true ··· 32 41 httpGet: 33 42 path: /_healthz 34 43 port: 3000 35 - initialDelaySeconds: 30 44 + initialDelaySeconds: 300 36 45 periodSeconds: 30 37 - timeoutSeconds: 5 38 - failureThreshold: 10 46 + timeoutSeconds: 15 47 + failureThreshold: 20 39 48 readiness: 40 49 enabled: true 41 50 custom: true ··· 43 52 httpGet: 44 53 path: /_readyz 45 54 port: 3000 46 - initialDelaySeconds: 10 55 + initialDelaySeconds: 60 47 56 periodSeconds: 15 48 - timeoutSeconds: 5 49 - failureThreshold: 5 57 + timeoutSeconds: 15 58 + failureThreshold: 20 50 59 resources: 51 60 requests: 52 61 memory: 1Gi
+60 -9
zlay/justfile
··· 247 247 : "${ZLAY_DOMAIN:?set ZLAY_DOMAIN}" 248 248 curl -sf "https://$ZLAY_DOMAIN/_health" | jq . 249 249 250 + # operator probe — one tool for "what is zlay doing right now" 251 + # subcommands: health | delivery [sec] | metrics | delta [sec] | sweep [sec] 252 + # wraps port-forward + curl + parsing so you never have to retype the dance. 253 + # see .claude/skills/zlay-diagnose/SKILL.md for when to reach for which one. 254 + probe *ARGS: 255 + ../scripts/zlay-probe {{ ARGS }} 256 + 250 257 # get the grafana admin password from the cluster 251 258 grafana-password: 252 259 @kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d && echo 253 260 254 261 # --- consumer smoke tests --- 255 262 256 - # run hydrant (strict indexer) against zlay for N seconds (default 15) 257 - test-hydrant seconds="15": 263 + # run hydrant (strict Rust indexer, full sig verification) against zlay 264 + # verifies the firehose is delivering valid, properly signed events. 265 + # usage: 266 + # just zlay test-hydrant # 30s, default 267 + # just zlay test-hydrant 60 # longer run 268 + # requires: cargo build --release in /tmp/hydrant (tangled.org/ptr.pet/hydrant) 269 + test-hydrant seconds="30": 258 270 #!/usr/bin/env bash 259 271 set -euo pipefail 260 272 : "${ZLAY_DOMAIN:?set ZLAY_DOMAIN}" 273 + HYDRANT=/tmp/hydrant/target/release/hydrant 274 + if [ ! -x "$HYDRANT" ]; then 275 + echo "hydrant not built. run:" 276 + echo " git clone https://tangled.org/ptr.pet/hydrant /tmp/hydrant" 277 + echo " cd /tmp/hydrant && cargo build --release" 278 + exit 1 279 + fi 261 280 TMPDIR=$(mktemp -d) 262 - trap "rm -rf $TMPDIR" EXIT 263 - echo "==> running hydrant against wss://$ZLAY_DOMAIN for {{ seconds }}s" 281 + PID="" 282 + trap 'rm -rf $TMPDIR; [ -n "$PID" ] && kill $PID 2>/dev/null || true' EXIT 283 + PORT=$(python3 -c 'import socket; s=socket.socket(); s.bind(("",0)); print(s.getsockname()[1]); s.close()') 284 + echo "==> hydrant: consuming wss://$ZLAY_DOMAIN for {{ seconds }}s (full sig verification)" 264 285 HYDRANT_RELAY_HOSTS="wss://$ZLAY_DOMAIN" \ 265 286 HYDRANT_DATABASE_PATH="$TMPDIR/hydrant.db" \ 266 287 HYDRANT_EPHEMERAL=true \ 267 - HYDRANT_API_PORT=0 \ 288 + HYDRANT_FULL_NETWORK=true \ 289 + HYDRANT_API_PORT=$PORT \ 268 290 HYDRANT_ENABLE_CRAWLER=false \ 269 291 RUST_LOG=info \ 270 - /tmp/hydrant/target/release/hydrant & 292 + "$HYDRANT" > "$TMPDIR/hydrant.log" 2>&1 & 271 293 PID=$! 272 294 sleep {{ seconds }} 273 - kill $PID 2>/dev/null 274 - wait $PID 2>/dev/null || true 275 - echo "==> hydrant exited cleanly" 295 + echo "" 296 + echo "==> stats:" 297 + STATS=$(curl -sf "http://127.0.0.1:$PORT/stats" 2>/dev/null || echo '{}') 298 + EVENTS=$(echo "$STATS" | python3 -c "import json,sys; print(json.load(sys.stdin).get('counts',{}).get('events',0))" 2>/dev/null || echo "?") 299 + ERRORS=$(echo "$STATS" | python3 -c "import json,sys; c=json.load(sys.stdin).get('counts',{}); print(c.get('error_generic',0) + c.get('error_transport',0))" 2>/dev/null || echo "?") 300 + REPOS=$(echo "$STATS" | python3 -c "import json,sys; print(json.load(sys.stdin).get('counts',{}).get('repos',0))" 2>/dev/null || echo "?") 301 + echo " events indexed: $EVENTS" 302 + echo " repos touched: $REPOS" 303 + echo " errors: $ERRORS" 304 + FPS=$(python3 -c "print(f'{int(${EVENTS:-0})/{{ seconds }}:.0f}')" 2>/dev/null || echo "?") 305 + echo " rate: ~${FPS} events/sec" 306 + echo "" 307 + # filter known hydrant cosmetic bugs (count underflow is a hydrant issue, not zlay) 308 + LOG_ERRORS=$(grep -iE 'error|panic' "$TMPDIR/hydrant.log" | grep -vc 'count underflow' || true) 309 + LOG_WARNS=$(grep -c 'count underflow' "$TMPDIR/hydrant.log" || true) 310 + if [ "$LOG_ERRORS" -gt 0 ]; then 311 + echo "==> log errors ($LOG_ERRORS lines):" 312 + grep -iE 'error|panic' "$TMPDIR/hydrant.log" | grep -v 'count underflow' | head -10 313 + elif [ "$LOG_WARNS" -gt 0 ]; then 314 + echo "==> log: $LOG_WARNS known hydrant warnings (count underflow — not a zlay issue)" 315 + else 316 + echo "==> log: clean" 317 + fi 318 + kill $PID 2>/dev/null; wait $PID 2>/dev/null || true; PID="" 319 + if [ "${ERRORS:-0}" = "0" ] && [ "$LOG_ERRORS" -le 0 ]; then 320 + echo "" 321 + echo "==> PASS: hydrant consumed $EVENTS events with full sig verification, $ERRORS errors" 322 + else 323 + echo "" 324 + echo "==> FAIL: $ERRORS API errors + $LOG_ERRORS log errors" 325 + exit 1 326 + fi 276 327 277 328 # run tap (Go firehose consumer) against zlay for N seconds (default 15) 278 329 test-tap seconds="15":