declarative relay deployment on hetzner relay-eval.waow.tech
atproto relay
14
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add zlay architecture, backfill strategy, and updated specs

covers how zlay differs from indigo (inline RocksDB collection index,
optimistic validation, split ports, OS threads), deployment recipes,
collection index backfill approaches (import/crawl/hybrid), verification
tools, and steady-state metrics for both relays.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz 266b2fa5 f0156ba2

+61 -2
+61 -2
docs/architecture.md
··· 47 47 48 48 fix: a k8s CronJob (`deploy/reconnect-cronjob.yaml`) runs every 4 hours, fetching the [community PDS list](https://github.com/mary-ext/atproto-scraping) and sending `requestCrawl` for each host. this can also be run manually via `just reconnect`. 49 49 50 - ## steady-state specs 50 + ## steady-state specs (indigo relay) 51 51 52 52 | metric | value | 53 53 |--------|-------| 54 54 | storage (relay data) | ~21 GB | 55 55 | storage (postgres) | ~2.4 GB | 56 - | storage (collectiondir pebble) | ~5 GB (post-backfill) | 56 + | storage (collectiondir pebble) | ~5 GB (post-backfill, ~5M DIDs indexed) | 57 57 | CPU usage | 5-15% | 58 58 | network throughput | ~600 events/sec typical, 2000 peak | 59 59 | connected PDS hosts | ~2800 | 60 + | memory (relay) | ~6 GiB (plateaus at GOMEMLIMIT) | 61 + | memory (collectiondir) | ~470 MiB steady-state | 62 + 63 + --- 64 + 65 + ## zlay (zig relay) 66 + 67 + a second relay implementation in [Zig](https://ziglang.org/), deployed on a separate Hetzner node. source: [tangled.org/zzstoatzz.io/zlay](https://tangled.org/zzstoatzz.io/zlay). runs at `zlay.waow.tech`. 68 + 69 + ### how it differs from indigo 70 + 71 + **same model, different internals.** zlay crawls PDS hosts directly — it is not a fan-out relay. `RELAY_UPSTREAM` (default: `bsky.network`) is a bootstrap seed used once at startup to populate the host list via `listHosts`. after that, all data flows directly from each PDS. 72 + 73 + **inline collection index.** instead of running collectiondir as a sidecar, zlay indexes `(DID, collection)` pairs directly in its event processing pipeline. storage is [RocksDB](https://rocksdb.org/) with two column families (`rbc` for collection→DID lookups, `cbr` for DID→collection cleanup). serves `listReposByCollection` from the relay's HTTP port — no separate service. 74 + 75 + **optimistic validation.** on a signing key cache miss, zlay passes the frame through immediately and queues the DID for background resolution. first commit from an unknown account is unvalidated; subsequent commits are verified. indigo blocks until resolution completes. 76 + 77 + **split ports.** 3000 for the WebSocket firehose, 3001 for HTTP (health, stats, metrics, admin, XRPC). indigo serves everything on port 2470 (with metrics on 2471). 78 + 79 + **OS threads, not goroutines.** one thread per PDS host subscription, one per downstream consumer. predictable memory (no GC), but thread count scales linearly with host count. 80 + 81 + ### deployment 82 + 83 + separate Hetzner cpx41 in Hillsboro OR (`hil`), independent k3s cluster. all `zlay-*` justfile recipes use `ZLAY_KUBECONFIG`. terraform in `infra/zlay/`. 84 + 85 + ```bash 86 + just zlay-init # terraform init 87 + just zlay-infra # create server 88 + just zlay-kubeconfig # pull kubeconfig 89 + just zlay-deploy # full deploy (cert-manager, postgres, relay, monitoring) 90 + just zlay-publish # build and push image 91 + just zlay-status # check pods + health 92 + just zlay-logs # tail logs 93 + ``` 94 + 95 + ### collection index backfill 96 + 97 + the collection index is live-only — it indexes `create` ops as they flow through the firehose. historical data requires a backfill. recommended approaches: 98 + 99 + 1. **import from bsky.network** (fastest): paginate `listReposByCollection` on the reference relay for each collection, bulk-insert pairs into RocksDB. no PDS crawling, no rate limits. `addCollection` is idempotent. 100 + 2. **describeRepo crawl** (independent): crawl the host table, calling `listRepos` + `describeRepo` per PDS. same rate limit gotchas as indigo collectiondir — see [backfill.md](backfill.md). 101 + 3. **hybrid** (recommended): import from reference relay for immediate parity, then live indexing keeps current. optionally add a slow background verify-crawl later. 102 + 103 + ### verification 104 + 105 + `scripts/zlay-smoketest` tests endpoint conformance, pagination, and set completeness against a reference relay. `scripts/collectiondir-diff` compares `listReposByCollection` results between any two endpoints (use `--limit` values ≤ 1000 for zlay). 106 + 107 + [pulsar](https://tangled.org/mackuba.eu/pulsar) (by @mackuba.eu) provides live firehose coverage comparison — subscribes to multiple relays simultaneously and counts unique DIDs over a time window. 108 + 109 + ### steady-state specs (zlay) 110 + 111 + | metric | value | 112 + |--------|-------| 113 + | connected PDS hosts | ~2,749 | 114 + | collection index DIDs | ~497K (live-only, no backfill) | 115 + | memory request | 512 MiB | 116 + | memory limit | 8 GiB | 117 + | PVC | 20 GiB | 118 + | `listReposByCollection` max limit | 1000 |