declarative relay deployment on hetzner relay-eval.waow.tech
atproto relay
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs + ops: update for Evented redeploy, zat v0.3.0-alpha.21, consumer smoke tests

- ops-changelog: add 2026-04-06 entry (zat validation hardening), update
2026-04-05 entry with full websocket bug story and Evented redeploy
- architecture: update zlay to reflect Evented backend (~47 threads, ~1.2 GiB)
- zlay justfile: add test-hydrant, test-tap, test-consumers recipes
- zlay reconnect cronjob: fix IndexError on malformed HTTP response
- zlay dashboard: grafana panel improvements from previous session
- indigo reconnect cronjob: changes from previous session

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz 932844f3 f84f63df

+559 -622
+5 -4
docs/architecture.md
··· 76 76 77 77 **split ports.** 3000 for the WebSocket firehose, 3001 for HTTP (health, stats, metrics, admin, XRPC). indigo serves everything on port 2470 (with metrics on 2471). 78 78 79 - **OS threads, not goroutines.** one thread per PDS host subscription, one per downstream consumer. predictable memory (no GC), but thread count scales linearly with host count. 79 + **fibers, not goroutines.** zig 0.16 `Io.Evented` backend runs ~2,800 subscriber tasks on ~47 OS threads via io_uring fibers. requires ReleaseFast due to a zig stdlib GPF in fiber context switching under ReleaseSafe (tracked via `scripts/repro_evented.zig`). predictable memory (no GC). 80 80 81 81 ### deployment 82 82 ··· 110 110 111 111 | metric | value | 112 112 |--------|-------| 113 - | connected PDS hosts | ~2,780 | 113 + | connected PDS hosts | ~2,830 | 114 + | OS threads | ~47 (Evented backend, io_uring fibers) | 114 115 | collection index DIDs | ~30.4M (backfill 1,017/1,287 collections) | 115 - | memory (steady state) | ~2.9 GiB | 116 - | memory limit | 5 GiB | 116 + | memory (steady state) | ~1.2 GiB (zig 0.16, Evented/ReleaseFast) | 117 + | memory limit | 10 GiB | 117 118 | PVC | 20 GiB | 118 119 | `listReposByCollection` max limit | 1000 |
+175 -3
docs/ops-changelog.md
··· 5 5 6 6 --- 7 7 8 + ## 2026-04-06 9 + 10 + ### zat v0.3.0-alpha.21: CBOR/CAR/MST validation hardening 11 + 12 + deployed b91382b with zat v0.3.0-alpha.21 (Evented/ReleaseFast). this release 13 + adds strict validation to all decode paths: 14 + 15 + - CBOR: rejects non-minimal encodings, validates UTF-8, enforces map key sort 16 + order, rejects duplicate keys (per RFC 8949) 17 + - CAR: tighter v1 header validation, varint overflow protection 18 + - MST: integer overflow fix in findKey/deleteFromNode at layer 0 19 + - ECDSA: overflow fix in signature verification 20 + - safe integer arithmetic (std.math.add/cast) replaces raw + throughout 21 + 22 + expected throughput drop: ~290k → ~202k fps decode+verify. still 13x faster 23 + than Go. 24 + 25 + monitored for 15 minutes after deploy: 0 decode errors, 0 restarts. ran both 26 + hydrant (strict Rust indexer with full sig verification) and tap (Go firehose 27 + consumer) against the relay — both consumed cleanly with no rejections. 28 + 29 + also fixed: reconnect cronjob crash on empty/malformed HTTP responses from zlay 30 + (added IndexError/ValueError to the except clause in response parser). 31 + 32 + --- 33 + 34 + ## 2026-04-05 35 + 36 + ### zlay 0.16 migration: Evented deployed, crashed, fixed, redeployed 37 + 38 + migrated zlay from zig 0.15 (`Io.Threaded`, thread-per-PDS) to zig 0.16 39 + (`Io.Evented`, io_uring fibers). the Evented backend runs ~2,800 subscriber 40 + tasks on ~47 OS threads — a 60x reduction in thread count. 41 + 42 + **deploy timeline:** 43 + 44 + | commit | backend | build | result | 45 + |--------|---------|-------|--------| 46 + | fe8a08c | Evented | ReleaseSafe | SIGSEGV on startup (zig stdlib GPF in fiber contextSwitch) | 47 + | 949e9a7 | Evented | ReleaseFast | boot loop — `queue_full` metric stuck at max, zero frames broadcast | 48 + | e9c7b96 | Evented | ReleaseFast | same — DB contention under Evented blocked the pipeline | 49 + | c3bc3be | Evented | ReleaseFast | **first stable Evented deploy** — DbRequestQueue decoupled DB from fibers | 50 + | b8ef148 | Evented | ReleaseFast | tooBig fix — stable initially, then 13 SIGSEGVs in 12 hours | 51 + | 42e1019 | Threaded | ReleaseSafe | **reverted to Threaded** — stable, but revealed real crash cause | 52 + | 02434de | Evented | ReleaseFast | **websocket fix** — TCP split mid-CRLF bug fixed, back on Evented | 53 + 54 + **the real crash cause**: the 13 SIGSEGVs on b8ef148 were NOT the fiber GPF — 55 + they were a websocket library bug. switching to Threaded/ReleaseSafe gave us the 56 + stack trace: `thread 543 panic: start index 1370 is larger than end index 1369` 57 + at `websocket.zig/src/client/client.zig:766`. the library assumed `\r\n` always 58 + arrives in a single TCP read. when TCP splits mid-CRLF, `line_start` advances 59 + past `pos` and the next `buf[line_start..pos]` slice has start > end. under 60 + ReleaseFast: silent memory corruption → SIGSEGV every 30-90 min. fix was 61 + `if (line_start > pos) break;` in the websocket parser. 62 + 63 + **the Evented GPF** (separate issue): `std.Io.fiber.contextSwitch` (x86_64 64 + inline asm saving/restoring rsp/rbp/rip) GPFs under ReleaseSafe immediately. 65 + confirmed with `scripts/repro_evented.zig` on kernel 6.8.0-101-generic. 66 + this forces Evented builds to use ReleaseFast, which loses bounds checking 67 + and safety checks. this is a zig stdlib bug, not zlay code. 68 + 69 + **tooBig protocol compliance** (b8ef148): `tooBig` is a required boolean 70 + in `#commit` frames per the `subscribeRepos` lexicon. both indigo and rsky 71 + always serialize it. zlay's passthrough relay could omit it if the upstream 72 + PDS omitted it. fix: `resequenceFrame` now injects `tooBig: false` when 73 + missing from `#commit` frames. verified with hydrant (0 decode errors over 74 + 15 seconds). 75 + 76 + **metric naming issue**: `relay_workers_count` tracks active subscriber 77 + tasks (hashmap entries), not OS threads. under Evented, this read ~2,800 78 + while actual OS threads were 47. under Threaded, the numbers happen to 79 + match. the metric name is misleading — should be renamed to 80 + `relay_subscriber_tasks` or similar. 81 + 82 + **current state**: 02434de, Evented/ReleaseFast. stable at ~47 threads, 83 + ~1.2 GiB RSS. the websocket fix resolved the intermittent SIGSEGVs. 84 + 85 + **what's needed for ReleaseSafe on Evented**: upstream zig fix for the 86 + fiber `contextSwitch` GPF. tracked via `scripts/repro_evented.zig`. 87 + 88 + ### grafana dashboard improvements 89 + 90 + rewrote `zlay/deploy/zlay-dashboard.json`: 91 + 92 + 1. **build info + uptime panel** — stat panel showing current git SHA, 93 + optimization level, and uptime with color thresholds. uses `instant: true` 94 + to avoid showing historical build_info series from previous deploys. 95 + 96 + 2. **consumers_active in broadcast queue** — added as third series alongside 97 + pipeline contention metrics. 98 + 99 + 3. **collapsible row groups** — overview, memory, operations, pipeline. 100 + 101 + 4. **fixed memory attribution panel** — the old model stacked 102 + `arena + mmap + stacks` which double-counted (glibc mmap reports virtual 103 + address space, not RSS; arena already includes in-use + free). replaced 104 + with `malloc in-use + malloc free + other(RSS - arena)` which is 105 + guaranteed to sum to RSS since `in_use + free = arena` exactly. 106 + 107 + 5. **fixed malloc breakdown panel** — added RSS as white reference line, 108 + made mmap dashed with "(virtual, not RSS)" label. 109 + 110 + ### lightrail steady state on relay.waow.tech 111 + 112 + lightrail has been running since 2026-03-27. resync is complete (~6.8M 113 + repos). steady-state memory is settling — monitor over next week to 114 + confirm it drops from the ~4 GiB resync peak. 115 + 116 + --- 117 + 118 + ## 2026-04-01 119 + 120 + ### identified: ConsumerTooSlow — root cause of zlay's persistent ~1% relay-eval gap 121 + 122 + **problem**: zlay consistently sits at ~99.0% on relay-eval while top relays 123 + (bsky.network, relay.waow.tech) hit 99.5-99.8%. pulsar testing confirmed the 124 + cause: zlay kicks consumers with `ConsumerTooSlow` when their per-consumer ring 125 + buffer fills up. 126 + 127 + `BUFFER_CAP = 8192` in `broadcaster.zig:239` — at ~240 events/sec that's ~34 128 + seconds of buffer. pulsar got kicked after ~57 seconds of connection with 129 + repeated `ConsumerTooSlow (consumer buffer full)` errors. relay-eval's 300s 130 + measurement window means any disconnection during the window shows as missing 131 + DIDs. 132 + 133 + **not yet fixed** — see zat 0.16 migration notes below. 134 + 135 + ### completed: zat 0.16 migration (std.Io) 136 + 137 + **status: done.** zlay is on zig 0.16, zat v0.3.0-alpha.16. the migration 138 + replaced all `libc` calls with zig 0.16's `std.Io` vtable. the `Io.Evented` 139 + backend (io_uring fibers) was tested in production — reduced ~2,800 OS 140 + threads to ~47 — but reverted to `Io.Threaded` due to a zig stdlib GPF 141 + in fiber context switching. see 2026-04-05 entry above for full details. 142 + 143 + the migration is architecturally complete. switching back to Evented is a 144 + one-line change (`const Backend = Io.Evented` in main.zig) once zig fixes 145 + the fiber GPF upstream 146 + 147 + ### fix: reconnect cronjobs (both relays) 148 + 149 + both reconnect cronjobs were broken: 150 + 1. mary-ext `state.json` format changed — `data.get("pdses", {})` returned 0 151 + hosts (key no longer exists) 152 + 2. phase 2 only submitted hosts missing from `our_hosts`, not re-announcing all 153 + known hosts 154 + 155 + **fix**: simplified both cronjobs to a single phase — pull all active/idle hosts 156 + from `bsky.network/xrpc/com.atproto.sync.listHosts` and submit all of them. 157 + 158 + - indigo: 1,647/1,778 ok (131 dead hosts). relay.waow.tech recovered from 159 + 90.9% → 98.7% 160 + - zlay: 1,809/1,809 ok. already at 99%+ but ensures ongoing host discovery 161 + 162 + files: `indigo/deploy/reconnect-cronjob.yaml`, `zlay/deploy/zlay-reconnect-cronjob.yaml` 163 + 164 + --- 165 + 8 166 ## 2026-03-27 9 167 10 168 ### replaced collectiondir with lightrail on relay.waow.tech ··· 57 215 files: event_log.zig (flushLocked, persist), frame_worker.zig:227, 58 216 subscriber.zig:633 59 217 60 - ### getRepo redirect 218 + ### ConsumerTooSlow buffer sizing 61 219 62 - router.zig:68 has no getRepo. should redirect to the PDS hosting the 63 - repo (like indigo service.go:153). needs DID → host lookup + new handler. 220 + `BUFFER_CAP = 8192` in `broadcaster.zig:239` is too small at current throughput 221 + (~240 events/sec ≈ 34s buffer). consumers like relay-eval and pulsar get kicked. 222 + now that the 0.16 migration is complete (still Threaded — Evented reverted), 223 + bumping to 32768 or 65536 is the straightforward fix. 224 + 225 + ### zig Evented fiber GPF (upstream) 226 + 227 + `Io.Evented` fiber `contextSwitch` GPFs under ReleaseSafe on x86_64 228 + (zig 0.16.0-dev.3059). repro: `scripts/repro_evented.zig`. blocks using 229 + the Evented backend for production. revisit when zig ships a fix. 230 + 231 + ### relay_workers_count metric rename 232 + 233 + `relay_workers_count` tracks active subscriber tasks (hashmap entries), not 234 + OS threads. under Evented it read ~2,800 while actual threads were 47. should 235 + be renamed to `relay_subscriber_tasks` or similar to avoid confusion. 64 236 65 237 --- 66 238
+8 -29
indigo/deploy/reconnect-cronjob.yaml
··· 32 32 - | 33 33 import json, urllib.request, urllib.error, time, os, base64, sys 34 34 35 - PDS_LIST_URL = "https://raw.githubusercontent.com/mary-ext/atproto-scraping/refs/heads/trunk/state.json" 36 35 RELAY_URL = "http://relay.relay.svc.cluster.local:2470" 37 36 PASSWORD = os.environ["RELAY_ADMIN_PASSWORD"] 38 37 AUTH = base64.b64encode(f"admin:{PASSWORD}".encode()).decode() ··· 63 62 time.sleep(0.05) 64 63 print(f"{label}: {ok} ok, {errors} errors, {time.time() - start:.0f}s") 65 64 66 - # phase 1: mary-ext scraping list (reconnect existing hosts) 67 - print(f"phase 1: fetching PDS list from {PDS_LIST_URL}...") 68 - with urllib.request.urlopen(PDS_LIST_URL, timeout=30) as resp: 69 - data = json.loads(resp.read()) 70 - hosts = [url.rstrip("/") for url in data.get("pdses", {}).keys() if url.startswith("https://")] 71 - print(f" {len(hosts)} hosts") 72 - submit_hosts(hosts, "phase 1 (mary-ext)") 73 - 74 - # phase 2: discover new hosts from bsky.network 75 - print("\nphase 2: pulling hosts from bsky.network...") 76 - our_hosts = set() 77 - try: 78 - req = urllib.request.Request( 79 - f"{RELAY_URL}/admin/pds/list", 80 - headers=HEADERS, 81 - ) 82 - resp = urllib.request.urlopen(req, timeout=30) 83 - for rec in json.loads(resp.read()): 84 - our_hosts.add(rec["Host"]) 85 - except Exception as e: 86 - print(f" warning: could not fetch our host list: {e}") 87 - 88 - bsky_hosts = {} 65 + # pull all active/idle hosts from bsky.network and re-announce them all 66 + # (indigo's slurper gives up on hosts after repeated failures — 67 + # requestCrawl is the only way to wake it back up) 68 + print("fetching hosts from bsky.network...") 69 + hosts = [] 89 70 cursor = "" 90 71 while True: 91 72 url = f"https://bsky.network/xrpc/com.atproto.sync.listHosts?limit=1000" ··· 99 80 for h in page.get("hosts", []): 100 81 status = h.get("status", "unknown") 101 82 if status in ("active", "idle"): 102 - bsky_hosts[h["hostname"]] = status 83 + hosts.append(h["hostname"]) 103 84 cursor = page.get("cursor", "") 104 85 if not cursor: 105 86 break 106 87 107 - missing = [h for h in bsky_hosts if h not in our_hosts] 108 - print(f" bsky.network: {len(bsky_hosts)} active/idle hosts, {len(missing)} new") 109 - if missing: 110 - submit_hosts(missing, "phase 2 (bsky.network)") 88 + print(f" {len(hosts)} active/idle hosts") 89 + submit_hosts(hosts, "bsky.network")
+279 -528
zlay/deploy/zlay-dashboard.json
··· 8 8 "links": [], 9 9 "panels": [ 10 10 { 11 - "title": "throughput", 12 - "type": "timeseries", 13 - "gridPos": { 14 - "h": 8, 15 - "w": 8, 16 - "x": 0, 17 - "y": 0 18 - }, 19 - "datasource": { 20 - "type": "prometheus", 21 - "uid": "prometheus" 22 - }, 11 + "title": "overview", 12 + "type": "row", 13 + "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, 14 + "collapsed": false, 15 + "panels": [] 16 + }, 17 + { 18 + "title": "build", 19 + "type": "stat", 20 + "gridPos": { "h": 8, "w": 4, "x": 0, "y": 1 }, 21 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 23 22 "fieldConfig": { 24 23 "defaults": { 25 - "unit": "ops", 26 - "color": { 27 - "mode": "palette-classic" 28 - }, 29 - "custom": { 30 - "fillOpacity": 15, 31 - "lineWidth": 2, 32 - "spanNulls": false 24 + "color": { "mode": "fixed", "fixedColor": "text" }, 25 + "thresholds": { "steps": [{ "color": "text", "value": null }] } 26 + }, 27 + "overrides": [ 28 + { 29 + "matcher": { "id": "byName", "options": "uptime" }, 30 + "properties": [ 31 + { "id": "unit", "value": "s" }, 32 + { "id": "color", "value": { "mode": "thresholds" } }, 33 + { "id": "thresholds", "value": { "steps": [ 34 + { "color": "red", "value": null }, 35 + { "color": "yellow", "value": 300 }, 36 + { "color": "green", "value": 3600 } 37 + ]}} 38 + ] 33 39 } 34 - }, 35 - "overrides": [] 40 + ] 41 + }, 42 + "options": { 43 + "colorMode": "value", 44 + "graphMode": "none", 45 + "textMode": "value_and_name", 46 + "reduceOptions": { "calcs": ["lastNotNull"] } 36 47 }, 37 48 "targets": [ 38 49 { 39 - "expr": "sum(rate(relay_frames_received_total{job=\"zlay\"}[5m]))", 40 - "legendFormat": "received", 41 - "refId": "A" 50 + "expr": "max(relay_build_info{job=\"zlay\"}) by (git_sha, optimize)", 51 + "legendFormat": "{{git_sha}} {{optimize}}", 52 + "refId": "A", 53 + "instant": true 42 54 }, 43 55 { 44 - "expr": "sum(rate(relay_frames_broadcast_total{job=\"zlay\"}[5m]))", 45 - "legendFormat": "broadcast", 46 - "refId": "B" 56 + "expr": "max(relay_uptime_seconds{job=\"zlay\"})", 57 + "legendFormat": "uptime", 58 + "refId": "B", 59 + "instant": true 47 60 } 48 61 ] 49 62 }, 50 63 { 51 64 "title": "connected PDS hosts", 52 65 "type": "stat", 53 - "gridPos": { 54 - "h": 8, 55 - "w": 8, 56 - "x": 8, 57 - "y": 0 58 - }, 59 - "datasource": { 60 - "type": "prometheus", 61 - "uid": "prometheus" 62 - }, 66 + "gridPos": { "h": 8, "w": 6, "x": 4, "y": 1 }, 67 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 63 68 "fieldConfig": { 64 69 "defaults": { 65 - "color": { 66 - "mode": "thresholds" 67 - }, 70 + "color": { "mode": "thresholds" }, 68 71 "thresholds": { 69 72 "steps": [ 70 - { 71 - "color": "red", 72 - "value": null 73 - }, 74 - { 75 - "color": "yellow", 76 - "value": 500 77 - }, 78 - { 79 - "color": "green", 80 - "value": 1000 81 - } 73 + { "color": "red", "value": null }, 74 + { "color": "yellow", "value": 500 }, 75 + { "color": "green", "value": 1000 } 82 76 ] 83 77 } 84 78 }, ··· 87 81 "options": { 88 82 "colorMode": "value", 89 83 "graphMode": "area", 90 - "reduceOptions": { 91 - "calcs": [ 92 - "lastNotNull" 93 - ] 94 - } 84 + "reduceOptions": { "calcs": ["lastNotNull"] } 95 85 }, 96 86 "targets": [ 97 87 { ··· 102 92 ] 103 93 }, 104 94 { 95 + "title": "throughput", 96 + "type": "timeseries", 97 + "gridPos": { "h": 8, "w": 8, "x": 10, "y": 1 }, 98 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 99 + "fieldConfig": { 100 + "defaults": { 101 + "unit": "ops", 102 + "color": { "mode": "palette-classic" }, 103 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 104 + }, 105 + "overrides": [] 106 + }, 107 + "targets": [ 108 + { 109 + "expr": "sum(rate(relay_frames_received_total{job=\"zlay\"}[5m]))", 110 + "legendFormat": "received", 111 + "refId": "A" 112 + }, 113 + { 114 + "expr": "sum(rate(relay_frames_broadcast_total{job=\"zlay\"}[5m]))", 115 + "legendFormat": "broadcast", 116 + "refId": "B" 117 + } 118 + ] 119 + }, 120 + { 105 121 "title": "validation/sec", 106 122 "type": "timeseries", 107 - "gridPos": { 108 - "h": 8, 109 - "w": 8, 110 - "x": 16, 111 - "y": 0 112 - }, 113 - "datasource": { 114 - "type": "prometheus", 115 - "uid": "prometheus" 116 - }, 123 + "gridPos": { "h": 8, "w": 6, "x": 18, "y": 1 }, 124 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 117 125 "fieldConfig": { 118 126 "defaults": { 119 127 "unit": "ops", 120 - "color": { 121 - "mode": "palette-classic" 122 - }, 123 - "custom": { 124 - "fillOpacity": 15, 125 - "lineWidth": 2, 126 - "spanNulls": false 127 - } 128 + "color": { "mode": "palette-classic" }, 129 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 128 130 }, 129 131 "overrides": [] 130 132 }, ··· 147 149 ] 148 150 }, 149 151 { 152 + "title": "memory", 153 + "type": "row", 154 + "gridPos": { "h": 1, "w": 24, "x": 0, "y": 9 }, 155 + "collapsed": false, 156 + "panels": [] 157 + }, 158 + { 150 159 "title": "malloc breakdown", 151 160 "type": "timeseries", 152 - "gridPos": { 153 - "h": 8, 154 - "w": 12, 155 - "x": 0, 156 - "y": 8 157 - }, 158 - "datasource": { 159 - "type": "prometheus", 160 - "uid": "prometheus" 161 - }, 161 + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 10 }, 162 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 162 163 "fieldConfig": { 163 164 "defaults": { 164 165 "unit": "bytes", 165 - "color": { 166 - "mode": "palette-classic" 166 + "color": { "mode": "palette-classic" }, 167 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 168 + }, 169 + "overrides": [ 170 + { 171 + "matcher": { "id": "byName", "options": "RSS" }, 172 + "properties": [ 173 + { "id": "custom.fillOpacity", "value": 0 }, 174 + { "id": "custom.lineWidth", "value": 3 }, 175 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "white" } } 176 + ] 167 177 }, 168 - "custom": { 169 - "fillOpacity": 15, 170 - "lineWidth": 2, 171 - "spanNulls": false 178 + { 179 + "matcher": { "id": "byName", "options": "mmap (virtual, not RSS)" }, 180 + "properties": [ 181 + { "id": "custom.lineStyle", "value": { "fill": "dash", "dash": [6, 6] } }, 182 + { "id": "custom.fillOpacity", "value": 0 } 183 + ] 172 184 } 173 - }, 174 - "overrides": [] 185 + ] 175 186 }, 176 187 "targets": [ 177 - { 178 - "expr": "max(relay_malloc_arena_bytes{job=\"zlay\"})", 179 - "legendFormat": "arena (claimed from OS)", 180 - "refId": "A" 181 - }, 182 - { 183 - "expr": "max(relay_malloc_in_use_bytes{job=\"zlay\"})", 184 - "legendFormat": "in-use (allocated)", 185 - "refId": "B" 186 - }, 187 - { 188 - "expr": "max(relay_malloc_free_bytes{job=\"zlay\"})", 189 - "legendFormat": "free (fragmentation)", 190 - "refId": "C" 191 - }, 192 - { 193 - "expr": "max(relay_malloc_mmap_bytes{job=\"zlay\"})", 194 - "legendFormat": "mmap (large blocks)", 195 - "refId": "D" 196 - } 188 + { "expr": "max(relay_process_rss_bytes{job=\"zlay\"})", "legendFormat": "RSS", "refId": "E" }, 189 + { "expr": "max(relay_malloc_arena_bytes{job=\"zlay\"})", "legendFormat": "arena (sbrk pool)", "refId": "A" }, 190 + { "expr": "max(relay_malloc_in_use_bytes{job=\"zlay\"})", "legendFormat": "in-use (allocated)", "refId": "B" }, 191 + { "expr": "max(relay_malloc_free_bytes{job=\"zlay\"})", "legendFormat": "free (fragmentation)", "refId": "C" }, 192 + { "expr": "max(relay_malloc_mmap_bytes{job=\"zlay\"})", "legendFormat": "mmap (virtual, not RSS)", "refId": "D" } 197 193 ] 198 194 }, 199 195 { 200 196 "title": "process memory", 201 197 "type": "timeseries", 202 - "gridPos": { 203 - "h": 8, 204 - "w": 12, 205 - "x": 12, 206 - "y": 8 207 - }, 208 - "datasource": { 209 - "type": "prometheus", 210 - "uid": "prometheus" 211 - }, 198 + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 10 }, 199 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 212 200 "fieldConfig": { 213 201 "defaults": { 214 202 "unit": "bytes", 215 - "color": { 216 - "mode": "palette-classic" 217 - }, 218 - "custom": { 219 - "fillOpacity": 15, 220 - "lineWidth": 2, 221 - "spanNulls": false 222 - } 203 + "color": { "mode": "palette-classic" }, 204 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 223 205 }, 224 206 "overrides": [ 225 207 { 226 - "matcher": { 227 - "id": "byName", 228 - "options": "limit" 229 - }, 208 + "matcher": { "id": "byName", "options": "limit" }, 230 209 "properties": [ 231 - { 232 - "id": "custom.lineStyle", 233 - "value": { 234 - "fill": "dash", 235 - "dash": [ 236 - 10, 237 - 10 238 - ] 239 - } 240 - }, 241 - { 242 - "id": "custom.fillOpacity", 243 - "value": 0 244 - }, 245 - { 246 - "id": "color", 247 - "value": { 248 - "mode": "fixed", 249 - "fixedColor": "red" 250 - } 251 - } 210 + { "id": "custom.lineStyle", "value": { "fill": "dash", "dash": [10, 10] } }, 211 + { "id": "custom.fillOpacity", "value": 0 }, 212 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "red" } } 252 213 ] 253 214 } 254 215 ] 255 216 }, 256 217 "targets": [ 257 - { 258 - "expr": "max(relay_process_rss_bytes{job=\"zlay\"})", 259 - "legendFormat": "RSS total", 260 - "refId": "A" 261 - }, 262 - { 263 - "expr": "max(relay_rss_anon_kb{job=\"zlay\"}) * 1024", 264 - "legendFormat": "RssAnon (heap+stack)", 265 - "refId": "B" 266 - }, 267 - { 268 - "expr": "max(relay_vm_hwm_kb{job=\"zlay\"}) * 1024", 269 - "legendFormat": "VmHWM (peak RSS)", 270 - "refId": "C" 271 - }, 272 - { 273 - "expr": "max(kube_pod_container_resource_limits{namespace=\"zlay\",pod=~\"zlay-[a-z0-9].*\",container=\"main\",resource=\"memory\"})", 274 - "legendFormat": "limit", 275 - "refId": "D" 276 - } 218 + { "expr": "max(relay_process_rss_bytes{job=\"zlay\"})", "legendFormat": "RSS total", "refId": "A" }, 219 + { "expr": "max(relay_rss_anon_kb{job=\"zlay\"}) * 1024", "legendFormat": "RssAnon (heap+stack)", "refId": "B" }, 220 + { "expr": "max(relay_vm_hwm_kb{job=\"zlay\"}) * 1024", "legendFormat": "VmHWM (peak RSS)", "refId": "C" }, 221 + { "expr": "max(kube_pod_container_resource_limits{namespace=\"zlay\",pod=~\"zlay-[a-z0-9].*\",container=\"main\",resource=\"memory\"})", "legendFormat": "limit", "refId": "D" } 277 222 ] 278 223 }, 279 224 { 280 225 "title": "memory attribution", 281 226 "type": "timeseries", 282 - "gridPos": { 283 - "h": 8, 284 - "w": 12, 285 - "x": 0, 286 - "y": 16 287 - }, 288 - "datasource": { 289 - "type": "prometheus", 290 - "uid": "prometheus" 291 - }, 227 + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 18 }, 228 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 292 229 "fieldConfig": { 293 230 "defaults": { 294 231 "unit": "bytes", 295 - "color": { 296 - "mode": "palette-classic" 297 - }, 298 - "custom": { 299 - "fillOpacity": 15, 300 - "lineWidth": 2, 301 - "spanNulls": false, 302 - "stacking": { 303 - "mode": "normal" 304 - } 305 - } 232 + "color": { "mode": "palette-classic" }, 233 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false, "stacking": { "mode": "normal" } } 306 234 }, 307 235 "overrides": [ 308 236 { 309 - "matcher": { 310 - "id": "byName", 311 - "options": "RSS" 312 - }, 237 + "matcher": { "id": "byName", "options": "RSS" }, 313 238 "properties": [ 314 - { 315 - "id": "custom.fillOpacity", 316 - "value": 0 317 - }, 318 - { 319 - "id": "custom.lineWidth", 320 - "value": 3 321 - }, 322 - { 323 - "id": "custom.stacking", 324 - "value": { 325 - "mode": "none" 326 - } 327 - }, 328 - { 329 - "id": "color", 330 - "value": { 331 - "mode": "fixed", 332 - "fixedColor": "white" 333 - } 334 - } 239 + { "id": "custom.fillOpacity", "value": 0 }, 240 + { "id": "custom.lineWidth", "value": 3 }, 241 + { "id": "custom.stacking", "value": { "mode": "none" } }, 242 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "white" } } 335 243 ] 336 244 }, 337 245 { 338 - "matcher": { 339 - "id": "byName", 340 - "options": "unaccounted" 341 - }, 246 + "matcher": { "id": "byName", "options": "other (stacks + mmap resident)" }, 342 247 "properties": [ 343 - { 344 - "id": "custom.fillOpacity", 345 - "value": 30 346 - }, 347 - { 348 - "id": "color", 349 - "value": { 350 - "mode": "fixed", 351 - "fixedColor": "red" 352 - } 353 - } 248 + { "id": "custom.fillOpacity", "value": 10 }, 249 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "semi-dark-blue" } } 354 250 ] 355 251 } 356 252 ] 357 253 }, 358 254 "targets": [ 359 - { 360 - "expr": "max(relay_process_rss_bytes{job=\"zlay\"})", 361 - "legendFormat": "RSS", 362 - "refId": "A" 363 - }, 364 - { 365 - "expr": "max(relay_validator_cache_map_cap{job=\"zlay\"}) * 48", 366 - "legendFormat": "validator cache", 367 - "refId": "B" 368 - }, 369 - { 370 - "expr": "max(relay_did_cache_map_cap{job=\"zlay\"}) * 48", 371 - "legendFormat": "DID cache", 372 - "refId": "C" 373 - }, 374 - { 375 - "expr": "max(relay_queued_set_map_cap{job=\"zlay\"}) * 40", 376 - "legendFormat": "resolve set", 377 - "refId": "D" 378 - }, 379 - { 380 - "expr": "max(relay_outbuf_cap{job=\"zlay\"})", 381 - "legendFormat": "outbuf", 382 - "refId": "E" 383 - }, 384 - { 385 - "expr": "max(relay_evtbuf_cap{job=\"zlay\"}) * 256", 386 - "legendFormat": "evtbuf", 387 - "refId": "F" 388 - }, 389 - { 390 - "expr": "max(relay_workers_count{job=\"zlay\"}) * 1048576", 391 - "legendFormat": "thread stacks (~1M used each)", 392 - "refId": "G" 393 - }, 394 - { 395 - "expr": "max(relay_malloc_arena_bytes{job=\"zlay\"}) + max(relay_malloc_mmap_bytes{job=\"zlay\"})", 396 - "legendFormat": "malloc (arena+mmap)", 397 - "refId": "H" 398 - }, 399 - { 400 - "expr": "max(relay_process_rss_bytes{job=\"zlay\"}) - max(relay_malloc_arena_bytes{job=\"zlay\"}) - max(relay_malloc_mmap_bytes{job=\"zlay\"}) - max(relay_workers_count{job=\"zlay\"}) * 1048576", 401 - "legendFormat": "unaccounted", 402 - "refId": "I" 403 - } 255 + { "expr": "max(relay_process_rss_bytes{job=\"zlay\"})", "legendFormat": "RSS", "refId": "A" }, 256 + { "expr": "max(relay_malloc_in_use_bytes{job=\"zlay\"})", "legendFormat": "malloc in-use", "refId": "B" }, 257 + { "expr": "max(relay_malloc_free_bytes{job=\"zlay\"})", "legendFormat": "malloc free (fragmentation)", "refId": "C" }, 258 + { "expr": "max(relay_process_rss_bytes{job=\"zlay\"}) - max(relay_malloc_arena_bytes{job=\"zlay\"})", "legendFormat": "other (stacks + mmap resident)", "refId": "D" } 404 259 ] 405 260 }, 406 261 { 407 262 "title": "leak rate", 408 263 "type": "timeseries", 409 - "gridPos": { 410 - "h": 8, 411 - "w": 12, 412 - "x": 12, 413 - "y": 16 414 - }, 415 - "datasource": { 416 - "type": "prometheus", 417 - "uid": "prometheus" 418 - }, 264 + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 18 }, 265 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 419 266 "fieldConfig": { 420 267 "defaults": { 421 268 "unit": "Bps", 422 - "color": { 423 - "mode": "palette-classic" 424 - }, 425 - "custom": { 426 - "fillOpacity": 15, 427 - "lineWidth": 2, 428 - "spanNulls": false 429 - } 269 + "color": { "mode": "palette-classic" }, 270 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 430 271 }, 431 272 "overrides": [] 432 273 }, 433 274 "targets": [ 434 - { 435 - "expr": "deriv(relay_process_rss_bytes{job=\"zlay\"}[10m])", 436 - "legendFormat": "RSS growth", 437 - "refId": "A" 438 - }, 439 - { 440 - "expr": "deriv(relay_malloc_in_use_bytes{job=\"zlay\"}[10m])", 441 - "legendFormat": "malloc in_use growth", 442 - "refId": "B" 443 - }, 444 - { 445 - "expr": "deriv(relay_malloc_arena_bytes{job=\"zlay\"}[10m])", 446 - "legendFormat": "malloc arena growth", 447 - "refId": "C" 448 - } 275 + { "expr": "deriv(relay_process_rss_bytes{job=\"zlay\"}[10m])", "legendFormat": "RSS growth", "refId": "A" }, 276 + { "expr": "deriv(relay_malloc_in_use_bytes{job=\"zlay\"}[10m])", "legendFormat": "malloc in_use growth", "refId": "B" }, 277 + { "expr": "deriv(relay_malloc_arena_bytes{job=\"zlay\"}[10m])", "legendFormat": "malloc arena growth", "refId": "C" } 449 278 ] 279 + }, 280 + { 281 + "title": "operations", 282 + "type": "row", 283 + "gridPos": { "h": 1, "w": 24, "x": 0, "y": 26 }, 284 + "collapsed": false, 285 + "panels": [] 450 286 }, 451 287 { 452 288 "title": "caches", 453 289 "type": "timeseries", 454 - "gridPos": { 455 - "h": 8, 456 - "w": 12, 457 - "x": 0, 458 - "y": 24 459 - }, 460 - "datasource": { 461 - "type": "prometheus", 462 - "uid": "prometheus" 463 - }, 290 + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 27 }, 291 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 464 292 "fieldConfig": { 465 293 "defaults": { 466 - "color": { 467 - "mode": "palette-classic" 468 - }, 469 - "custom": { 470 - "fillOpacity": 15, 471 - "lineWidth": 2, 472 - "spanNulls": false, 473 - "axisSoftMin": 0 474 - } 294 + "color": { "mode": "palette-classic" }, 295 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false, "axisSoftMin": 0 } 475 296 }, 476 297 "overrides": [ 477 298 { 478 - "matcher": { 479 - "id": "byRegexp", 480 - "options": "/(hits|misses|evictions)/" 481 - }, 299 + "matcher": { "id": "byRegexp", "options": "/(hits|misses|evictions)/" }, 482 300 "properties": [ 483 - { 484 - "id": "custom.axisPlacement", 485 - "value": "right" 486 - }, 487 - { 488 - "id": "unit", 489 - "value": "ops" 490 - }, 491 - { 492 - "id": "custom.fillOpacity", 493 - "value": 0 494 - }, 495 - { 496 - "id": "custom.lineWidth", 497 - "value": 1 498 - } 301 + { "id": "custom.axisPlacement", "value": "right" }, 302 + { "id": "unit", "value": "ops" }, 303 + { "id": "custom.fillOpacity", "value": 0 }, 304 + { "id": "custom.lineWidth", "value": 1 } 499 305 ] 500 306 } 501 307 ] 502 308 }, 503 309 "targets": [ 504 - { 505 - "expr": "sum(relay_validator_cache_entries{job=\"zlay\"})", 506 - "legendFormat": "validator cache", 507 - "refId": "A" 508 - }, 509 - { 510 - "expr": "sum(relay_did_cache_entries{job=\"zlay\"})", 511 - "legendFormat": "DID cache", 512 - "refId": "B" 513 - }, 514 - { 515 - "expr": "sum(rate(relay_cache_total{job=\"zlay\",result=\"hit\"}[5m]))", 516 - "legendFormat": "hits", 517 - "refId": "C" 518 - }, 519 - { 520 - "expr": "sum(rate(relay_cache_total{job=\"zlay\",result=\"miss\"}[5m]))", 521 - "legendFormat": "misses", 522 - "refId": "D" 523 - }, 524 - { 525 - "expr": "sum(rate(relay_validator_cache_evictions_total{job=\"zlay\"}[5m]))", 526 - "legendFormat": "evictions", 527 - "refId": "E" 528 - } 310 + { "expr": "sum(relay_validator_cache_entries{job=\"zlay\"})", "legendFormat": "validator cache", "refId": "A" }, 311 + { "expr": "sum(relay_did_cache_entries{job=\"zlay\"})", "legendFormat": "DID cache", "refId": "B" }, 312 + { "expr": "sum(rate(relay_cache_total{job=\"zlay\",result=\"hit\"}[5m]))", "legendFormat": "hits", "refId": "C" }, 313 + { "expr": "sum(rate(relay_cache_total{job=\"zlay\",result=\"miss\"}[5m]))", "legendFormat": "misses", "refId": "D" }, 314 + { "expr": "sum(rate(relay_validator_cache_evictions_total{job=\"zlay\"}[5m]))", "legendFormat": "evictions", "refId": "E" } 529 315 ] 530 316 }, 531 317 { 532 318 "title": "errors", 533 319 "type": "timeseries", 534 - "gridPos": { 535 - "h": 8, 536 - "w": 12, 537 - "x": 12, 538 - "y": 24 539 - }, 540 - "datasource": { 541 - "type": "prometheus", 542 - "uid": "prometheus" 543 - }, 320 + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 27 }, 321 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 544 322 "fieldConfig": { 545 323 "defaults": { 546 324 "unit": "ops", 547 - "color": { 548 - "mode": "palette-classic" 549 - }, 550 - "custom": { 551 - "fillOpacity": 15, 552 - "lineWidth": 2, 553 - "spanNulls": false 554 - } 325 + "color": { "mode": "palette-classic" }, 326 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 555 327 }, 556 328 "overrides": [] 557 329 }, 558 330 "targets": [ 559 - { 560 - "expr": "sum(rate(relay_decode_errors_total{job=\"zlay\"}[5m]))", 561 - "legendFormat": "decode errors", 562 - "refId": "A" 563 - }, 564 - { 565 - "expr": "sum(rate(relay_slow_consumers_total{job=\"zlay\"}[5m]))", 566 - "legendFormat": "slow consumer drops", 567 - "refId": "B" 568 - }, 569 - { 570 - "expr": "sum(rate(relay_rate_limited_total{job=\"zlay\"}[5m]))", 571 - "legendFormat": "rate limited", 572 - "refId": "C" 573 - }, 574 - { 575 - "expr": "sum by (reason) (rate(relay_validation_failed{job=\"zlay\"}[5m]))", 576 - "legendFormat": "validation: {{reason}}", 577 - "refId": "D" 578 - } 331 + { "expr": "sum(rate(relay_decode_errors_total{job=\"zlay\"}[5m]))", "legendFormat": "decode errors", "refId": "A" }, 332 + { "expr": "sum(rate(relay_slow_consumers_total{job=\"zlay\"}[5m]))", "legendFormat": "slow consumer drops", "refId": "B" }, 333 + { "expr": "sum(rate(relay_rate_limited_total{job=\"zlay\"}[5m]))", "legendFormat": "rate limited", "refId": "C" }, 334 + { "expr": "sum by (reason) (rate(relay_validation_failed{job=\"zlay\"}[5m]))", "legendFormat": "validation: {{reason}}", "refId": "D" } 579 335 ] 580 336 }, 581 337 { 582 338 "title": "resolver", 583 339 "type": "timeseries", 584 - "gridPos": { 585 - "h": 8, 586 - "w": 8, 587 - "x": 0, 588 - "y": 32 589 - }, 590 - "datasource": { 591 - "type": "prometheus", 592 - "uid": "prometheus" 593 - }, 340 + "gridPos": { "h": 8, "w": 8, "x": 0, "y": 35 }, 341 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 594 342 "fieldConfig": { 595 343 "defaults": { 596 - "color": { 597 - "mode": "palette-classic" 598 - }, 599 - "custom": { 600 - "fillOpacity": 15, 601 - "lineWidth": 2, 602 - "spanNulls": false, 603 - "axisSoftMin": 0 604 - } 344 + "color": { "mode": "palette-classic" }, 345 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false, "axisSoftMin": 0 } 605 346 }, 606 347 "overrides": [] 607 348 }, 608 349 "targets": [ 609 - { 610 - "expr": "max(relay_resolve_queue_len{job=\"zlay\"})", 611 - "legendFormat": "queue length", 612 - "refId": "A" 613 - }, 614 - { 615 - "expr": "max(relay_resolve_queued_set_count{job=\"zlay\"})", 616 - "legendFormat": "dedup set", 617 - "refId": "B" 618 - } 350 + { "expr": "max(relay_resolve_queue_len{job=\"zlay\"})", "legendFormat": "queue length", "refId": "A" }, 351 + { "expr": "max(relay_resolve_queued_set_count{job=\"zlay\"})", "legendFormat": "dedup set", "refId": "B" } 619 352 ] 620 353 }, 621 354 { 622 355 "title": "chain continuity", 623 356 "type": "timeseries", 624 - "gridPos": { 625 - "h": 8, 626 - "w": 8, 627 - "x": 8, 628 - "y": 32 629 - }, 630 - "datasource": { 631 - "type": "prometheus", 632 - "uid": "prometheus" 633 - }, 357 + "gridPos": { "h": 8, "w": 8, "x": 8, "y": 35 }, 358 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 634 359 "fieldConfig": { 635 360 "defaults": { 636 361 "unit": "ops", 637 - "color": { 638 - "mode": "palette-classic" 639 - }, 640 - "custom": { 641 - "fillOpacity": 15, 642 - "lineWidth": 2, 643 - "spanNulls": false 644 - } 362 + "color": { "mode": "palette-classic" }, 363 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 645 364 }, 646 365 "overrides": [ 647 366 { 648 - "matcher": { 649 - "id": "byName", 650 - "options": "% of received" 651 - }, 367 + "matcher": { "id": "byName", "options": "% of received" }, 652 368 "properties": [ 653 - { 654 - "id": "custom.axisPlacement", 655 - "value": "right" 656 - }, 657 - { 658 - "id": "unit", 659 - "value": "percent" 660 - }, 661 - { 662 - "id": "custom.fillOpacity", 663 - "value": 0 664 - }, 665 - { 666 - "id": "custom.lineWidth", 667 - "value": 1 668 - } 369 + { "id": "custom.axisPlacement", "value": "right" }, 370 + { "id": "unit", "value": "percent" }, 371 + { "id": "custom.fillOpacity", "value": 0 }, 372 + { "id": "custom.lineWidth", "value": 1 } 669 373 ] 670 374 } 671 375 ] 672 376 }, 673 377 "targets": [ 674 - { 675 - "expr": "sum(rate(relay_chain_breaks_total{job=\"zlay\"}[5m]))", 676 - "legendFormat": "chain breaks/sec", 677 - "refId": "A" 678 - }, 679 - { 680 - "expr": "sum(rate(relay_chain_breaks_total{job=\"zlay\"}[5m])) / sum(rate(relay_frames_received_total{job=\"zlay\"}[5m])) * 100", 681 - "legendFormat": "% of received", 682 - "refId": "B" 683 - } 378 + { "expr": "sum(rate(relay_chain_breaks_total{job=\"zlay\"}[5m]))", "legendFormat": "chain breaks/sec", "refId": "A" }, 379 + { "expr": "sum(rate(relay_chain_breaks_total{job=\"zlay\"}[5m])) / sum(rate(relay_frames_received_total{job=\"zlay\"}[5m])) * 100", "legendFormat": "% of received", "refId": "B" } 684 380 ], 685 - "options": { 686 - "tooltip": { 687 - "mode": "multi" 688 - } 689 - } 381 + "options": { "tooltip": { "mode": "multi" } } 690 382 }, 691 383 { 692 384 "title": "disk usage", 693 385 "type": "timeseries", 694 - "gridPos": { 695 - "h": 8, 696 - "w": 8, 697 - "x": 16, 698 - "y": 32 386 + "gridPos": { "h": 8, "w": 8, "x": 16, "y": 35 }, 387 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 388 + "fieldConfig": { 389 + "defaults": { 390 + "unit": "bytes", 391 + "color": { "mode": "palette-classic" }, 392 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false } 393 + }, 394 + "overrides": [] 699 395 }, 700 - "datasource": { 701 - "type": "prometheus", 702 - "uid": "prometheus" 703 - }, 396 + "targets": [ 397 + { "expr": "max(relay_disk_total_bytes{job=\"zlay\"})", "legendFormat": "total", "refId": "A" }, 398 + { "expr": "max(relay_disk_total_bytes{job=\"zlay\"}) - max(relay_disk_available_bytes{job=\"zlay\"})", "legendFormat": "used", "refId": "B" } 399 + ] 400 + }, 401 + { 402 + "title": "pipeline", 403 + "type": "row", 404 + "gridPos": { "h": 1, "w": 24, "x": 0, "y": 43 }, 405 + "collapsed": false, 406 + "panels": [] 407 + }, 408 + { 409 + "title": "pipeline contention", 410 + "type": "timeseries", 411 + "gridPos": { "h": 10, "w": 16, "x": 0, "y": 44 }, 412 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 704 413 "fieldConfig": { 705 414 "defaults": { 706 - "unit": "bytes", 707 - "color": { 708 - "mode": "palette-classic" 415 + "unit": "ops", 416 + "color": { "mode": "palette-classic" }, 417 + "custom": { "fillOpacity": 10, "lineWidth": 2, "spanNulls": false, "axisSoftMin": 0 } 418 + }, 419 + "overrides": [ 420 + { 421 + "matcher": { "id": "byName", "options": "hosts" }, 422 + "properties": [ 423 + { "id": "custom.axisPlacement", "value": "right" }, 424 + { "id": "unit", "value": "short" }, 425 + { "id": "custom.fillOpacity", "value": 0 }, 426 + { "id": "custom.lineWidth", "value": 1 }, 427 + { "id": "custom.lineStyle", "value": { "fill": "dash", "dash": [10, 10] } }, 428 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "white" } } 429 + ] 709 430 }, 710 - "custom": { 711 - "fillOpacity": 15, 712 - "lineWidth": 2, 713 - "spanNulls": false 431 + { 432 + "matcher": { "id": "byName", "options": "CPU cores" }, 433 + "properties": [ 434 + { "id": "custom.axisPlacement", "value": "right" }, 435 + { "id": "unit", "value": "short" }, 436 + { "id": "custom.fillOpacity", "value": 0 }, 437 + { "id": "custom.lineWidth", "value": 1 }, 438 + { "id": "custom.lineStyle", "value": { "fill": "dash", "dash": [4, 4] } }, 439 + { "id": "color", "value": { "mode": "fixed", "fixedColor": "yellow" } } 440 + ] 714 441 } 715 - }, 716 - "overrides": [] 442 + ] 717 443 }, 718 444 "targets": [ 719 - { 720 - "expr": "max(relay_disk_total_bytes{job=\"zlay\"})", 721 - "legendFormat": "total", 722 - "refId": "A" 445 + { "expr": "sum(rate(relay_persist_order_spins_total{job=\"zlay\"}[1m]))", "legendFormat": "persist_order spins/s", "refId": "A" }, 446 + { "expr": "sum(rate(relay_broadcast_queue_push_lock_spins_total{job=\"zlay\"}[1m]))", "legendFormat": "push_lock spins/s", "refId": "B" }, 447 + { "expr": "sum(rate(relay_broadcast_queue_full_total{job=\"zlay\"}[1m]))", "legendFormat": "queue full/s", "refId": "C" }, 448 + { "expr": "sum(relay_connected_inbound{job=\"zlay\"})", "legendFormat": "hosts", "refId": "D" }, 449 + { "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"zlay\",container=\"main\"}[1m]))", "legendFormat": "CPU cores", "refId": "E" } 450 + ], 451 + "options": { "tooltip": { "mode": "multi" } } 452 + }, 453 + { 454 + "title": "broadcast queue", 455 + "type": "timeseries", 456 + "gridPos": { "h": 10, "w": 8, "x": 16, "y": 44 }, 457 + "datasource": { "type": "prometheus", "uid": "prometheus" }, 458 + "fieldConfig": { 459 + "defaults": { 460 + "color": { "mode": "palette-classic" }, 461 + "custom": { "fillOpacity": 15, "lineWidth": 2, "spanNulls": false, "axisSoftMin": 0 } 723 462 }, 724 - { 725 - "expr": "max(relay_disk_total_bytes{job=\"zlay\"}) - max(relay_disk_available_bytes{job=\"zlay\"})", 726 - "legendFormat": "used", 727 - "refId": "B" 728 - } 729 - ] 463 + "overrides": [ 464 + { 465 + "matcher": { "id": "byRegexp", "options": "/(no consumers|consumers active)/" }, 466 + "properties": [ 467 + { "id": "custom.axisPlacement", "value": "right" }, 468 + { "id": "unit", "value": "short" }, 469 + { "id": "custom.fillOpacity", "value": 0 }, 470 + { "id": "custom.lineWidth", "value": 1 } 471 + ] 472 + } 473 + ] 474 + }, 475 + "targets": [ 476 + { "expr": "max(relay_broadcast_queue_depth_hwm{job=\"zlay\"})", "legendFormat": "depth HWM", "refId": "A" }, 477 + { "expr": "sum(rate(relay_broadcast_no_consumers_total{job=\"zlay\"}[1m]))", "legendFormat": "no consumers/s", "refId": "B" }, 478 + { "expr": "max(relay_consumers_active{job=\"zlay\"})", "legendFormat": "consumers active", "refId": "C" } 479 + ], 480 + "options": { "tooltip": { "mode": "multi" } } 730 481 } 731 482 ], 732 483 "schemaVersion": 39,
+8 -36
zlay/deploy/zlay-reconnect-cronjob.yaml
··· 27 27 - | 28 28 import json, urllib.request, urllib.error, socket, time, sys 29 29 30 - PDS_LIST_URL = "https://raw.githubusercontent.com/mary-ext/atproto-scraping/refs/heads/trunk/dist/instances.json" 31 30 ZLAY_HOST = "zlay.zlay.svc.cluster.local" 32 31 ZLAY_PORT = 3000 33 32 ··· 58 57 sock.close() 59 58 status = int(resp.split(b"\r\n", 1)[0].split(b" ", 2)[1]) 60 59 return status == 200 61 - except (ConnectionError, OSError, socket.timeout): 60 + except (ConnectionError, OSError, socket.timeout, IndexError, ValueError): 62 61 return False 63 62 64 63 def submit_hosts(hosts, label): ··· 74 73 time.sleep(0.05) 75 74 print(f"{label}: {ok} ok, {errors} errors, {time.time() - start:.0f}s") 76 75 77 - # phase 1: mary-ext scraping list (reconnect existing hosts) 78 - print(f"phase 1: fetching PDS list from {PDS_LIST_URL}...") 79 - with urllib.request.urlopen(PDS_LIST_URL, timeout=30) as resp: 80 - data = json.loads(resp.read()) 81 - pds_urls = [url.rstrip("/") for url in data.get("pdses", {}).keys() if url.startswith("https://")] 82 - hosts = [url.replace("https://", "").replace("http://", "") for url in pds_urls] 83 - print(f" {len(hosts)} hosts") 84 - submit_hosts(hosts, "phase 1 (mary-ext)") 85 - 86 - # phase 2: discover new hosts from bsky.network 87 - print("\nphase 2: pulling hosts from bsky.network...") 88 - our_hosts = set() 89 - cursor = "" 90 - while True: 91 - url = f"https://zlay.waow.tech/xrpc/com.atproto.sync.listHosts?limit=1000" 92 - if cursor: 93 - url += f"&cursor={cursor}" 94 - try: 95 - with urllib.request.urlopen(url, timeout=30) as resp: 96 - page = json.loads(resp.read()) 97 - except Exception: 98 - break 99 - for h in page.get("hosts", []): 100 - our_hosts.add(h["hostname"]) 101 - cursor = page.get("cursor", "") 102 - if not cursor: 103 - break 104 - 105 - bsky_hosts = {} 76 + # pull all active/idle hosts from bsky.network and submit them all 77 + # (re-announces hosts zlay may have given up on + discovers new ones) 78 + print("fetching hosts from bsky.network...") 79 + hosts = [] 106 80 cursor = "" 107 81 while True: 108 82 url = f"https://bsky.network/xrpc/com.atproto.sync.listHosts?limit=1000" ··· 116 90 for h in page.get("hosts", []): 117 91 status = h.get("status", "unknown") 118 92 if status in ("active", "idle"): 119 - bsky_hosts[h["hostname"]] = status 93 + hosts.append(h["hostname"]) 120 94 cursor = page.get("cursor", "") 121 95 if not cursor: 122 96 break 123 97 124 - missing = [h for h in bsky_hosts if h not in our_hosts] 125 - print(f" bsky.network: {len(bsky_hosts)} active/idle hosts, {len(missing)} new") 126 - if missing: 127 - submit_hosts(missing, "phase 2 (bsky.network)") 98 + print(f" {len(hosts)} active/idle hosts") 99 + submit_hosts(hosts, "bsky.network")
+84 -22
zlay/justfile
··· 151 151 publish-remote optimize="": 152 152 #!/usr/bin/env bash 153 153 set -euo pipefail 154 - ssh root@$(just server-ip) <<'DEPLOY' 154 + SERVER=$(just server-ip) 155 + OPTIMIZE="{{ optimize }}" 156 + LABEL="{{ if optimize != "" { optimize } else { "debug" } }}" 157 + ZIG_FLAG="{{ if optimize != "" { "-Doptimize=" + optimize + " " } else { "" } }}" 158 + 159 + # upload build script (avoids heredoc stdin buffering over SSH) 160 + ssh root@"$SERVER" "cat > /tmp/zlay-build.sh" <<SCRIPT 161 + #!/usr/bin/env bash 155 162 set -euo pipefail 156 163 cd /opt/zlay 157 164 git pull --ff-only 158 165 159 - TAG=$(git rev-parse --short HEAD) 160 - IMAGE="atcr.io/zzstoatzz.io/zlay:{{ if optimize != "" { optimize + "-" } else { "debug-" } }}${TAG}" 166 + TAG=\$(git rev-parse --short HEAD) 167 + IMAGE="atcr.io/zzstoatzz.io/zlay:${LABEL}-\${TAG}" 161 168 162 - echo "==> building binary (${TAG}{{ if optimize != "" { ", " + optimize } else { ", debug" } }})" 163 - zig build {{ if optimize != "" { "-Doptimize=" + optimize + " " } else { "" } }}-Dtarget=x86_64-linux-gnu 169 + echo "==> building binary (\${TAG}, ${LABEL})" 170 + zig build ${ZIG_FLAG}-Dtarget=x86_64-linux-gnu 164 171 165 - echo "==> building container image (${IMAGE})" 166 - buildah bud -t "${IMAGE}" -f Dockerfile.runtime . 172 + echo "==> building container image (\${IMAGE})" 173 + buildah bud -t "\${IMAGE}" -f Dockerfile.runtime . 167 174 168 175 echo "==> importing into k3s containerd" 169 - buildah push "${IMAGE}" docker-archive:/tmp/zlay.tar:"${IMAGE}" 176 + buildah push "\${IMAGE}" docker-archive:/tmp/zlay.tar:"\${IMAGE}" 170 177 ctr -n k8s.io images import /tmp/zlay.tar 171 178 rm -f /tmp/zlay.tar 172 179 173 180 echo "==> updating deployment image" 174 - kubectl set image deployment/zlay -n zlay main="${IMAGE}" 181 + kubectl set image deployment/zlay -n zlay main="\${IMAGE}" 175 182 kubectl rollout status deployment/zlay -n zlay --timeout=120s 176 183 177 - echo "==> deployed ${IMAGE}" 178 - DEPLOY 184 + echo "==> deployed \${IMAGE}" 185 + SCRIPT 186 + 187 + ssh root@"$SERVER" "bash /tmp/zlay-build.sh" 179 188 180 189 # build with GPA leak detection enabled (exp-002). SIGTERM to get leak report. 181 190 # usage: just zlay publish-gpa ReleaseSafe 182 191 publish-gpa optimize="ReleaseSafe": 183 192 #!/usr/bin/env bash 184 193 set -euo pipefail 185 - ssh root@$(just server-ip) <<'DEPLOY' 194 + SERVER=$(just server-ip) 195 + OPTIMIZE="{{ optimize }}" 196 + 197 + ssh root@"$SERVER" "cat > /tmp/zlay-build.sh" <<SCRIPT 198 + #!/usr/bin/env bash 186 199 set -euo pipefail 187 200 cd /opt/zlay 188 201 git pull --ff-only 189 202 190 - TAG=$(git rev-parse --short HEAD) 191 - IMAGE="atcr.io/zzstoatzz.io/zlay:{{ optimize }}-gpa-${TAG}" 203 + TAG=\$(git rev-parse --short HEAD) 204 + IMAGE="atcr.io/zzstoatzz.io/zlay:${OPTIMIZE}-gpa-\${TAG}" 192 205 193 - echo "==> building binary (${TAG}, {{ optimize }}, GPA enabled)" 194 - zig build -Doptimize={{ optimize }} -Duse_gpa=true -Dtarget=x86_64-linux-gnu 206 + echo "==> building binary (\${TAG}, ${OPTIMIZE}, GPA enabled)" 207 + zig build -Doptimize=${OPTIMIZE} -Duse_gpa=true -Dtarget=x86_64-linux-gnu 195 208 196 - echo "==> building container image (${IMAGE})" 197 - buildah bud -t "${IMAGE}" -f Dockerfile.runtime . 209 + echo "==> building container image (\${IMAGE})" 210 + buildah bud -t "\${IMAGE}" -f Dockerfile.runtime . 198 211 199 212 echo "==> importing into k3s containerd" 200 - buildah push "${IMAGE}" docker-archive:/tmp/zlay.tar:"${IMAGE}" 213 + buildah push "\${IMAGE}" docker-archive:/tmp/zlay.tar:"\${IMAGE}" 201 214 ctr -n k8s.io images import /tmp/zlay.tar 202 215 rm -f /tmp/zlay.tar 203 216 204 217 echo "==> updating deployment image" 205 - kubectl set image deployment/zlay -n zlay main="${IMAGE}" 218 + kubectl set image deployment/zlay -n zlay main="\${IMAGE}" 206 219 kubectl rollout status deployment/zlay -n zlay --timeout=120s 207 220 208 - echo "==> deployed ${IMAGE} (GPA enabled — SIGTERM to get leak report)" 209 - DEPLOY 221 + echo "==> deployed \${IMAGE} (GPA enabled — SIGTERM to get leak report)" 222 + SCRIPT 223 + 224 + ssh root@"$SERVER" "bash /tmp/zlay-build.sh" 210 225 211 226 # --- status --- 212 227 ··· 235 250 # get the grafana admin password from the cluster 236 251 grafana-password: 237 252 @kubectl get secret -n monitoring kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d && echo 253 + 254 + # --- consumer smoke tests --- 255 + 256 + # run hydrant (strict indexer) against zlay for N seconds (default 15) 257 + test-hydrant seconds="15": 258 + #!/usr/bin/env bash 259 + set -euo pipefail 260 + : "${ZLAY_DOMAIN:?set ZLAY_DOMAIN}" 261 + TMPDIR=$(mktemp -d) 262 + trap "rm -rf $TMPDIR" EXIT 263 + echo "==> running hydrant against wss://$ZLAY_DOMAIN for {{ seconds }}s" 264 + HYDRANT_RELAY_HOSTS="wss://$ZLAY_DOMAIN" \ 265 + HYDRANT_DATABASE_PATH="$TMPDIR/hydrant.db" \ 266 + HYDRANT_EPHEMERAL=true \ 267 + HYDRANT_API_PORT=0 \ 268 + HYDRANT_ENABLE_CRAWLER=false \ 269 + RUST_LOG=info \ 270 + /tmp/hydrant/target/release/hydrant & 271 + PID=$! 272 + sleep {{ seconds }} 273 + kill $PID 2>/dev/null 274 + wait $PID 2>/dev/null || true 275 + echo "==> hydrant exited cleanly" 276 + 277 + # run tap (Go firehose consumer) against zlay for N seconds (default 15) 278 + test-tap seconds="15": 279 + #!/usr/bin/env bash 280 + set -euo pipefail 281 + : "${ZLAY_DOMAIN:?set ZLAY_DOMAIN}" 282 + TMPDIR=$(mktemp -d) 283 + trap "rm -rf $TMPDIR" EXIT 284 + echo "==> running tap against https://$ZLAY_DOMAIN for {{ seconds }}s" 285 + ~/go/bin/tap run \ 286 + --relay-url="https://$ZLAY_DOMAIN" \ 287 + --db-url="sqlite://$TMPDIR/tap.db" \ 288 + --bind=":0" \ 289 + --log-level=info & 290 + PID=$! 291 + sleep {{ seconds }} 292 + kill $PID 2>/dev/null 293 + wait $PID 2>/dev/null || true 294 + echo "==> tap exited cleanly" 295 + 296 + # run both hydrant and tap against zlay 297 + test-consumers seconds="15": 298 + just test-hydrant {{ seconds }} 299 + just test-tap {{ seconds }}