move detailed docs out of readme into docs/ · zzstoatzz.io/relay@c4b47a8

+4 -94

README.md

··· 2 2 3 3 > **experimental** — this is a personal project for learning ATProto infrastructure. the endpoints below may go down, lose data, or change without notice. do not depend on them for anything that matters. 4 4 5 - a full-network [ATProto](https://atproto.com) relay running on a single Hetzner Cloud node with k3s. a [jetstream](https://github.com/bluesky-social/jetstream) instance runs alongside it, re-encoding the relay's CBOR firehose into plain JSON over websockets — easier to consume if you don't need the full atproto SDK. a [collectiondir](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) sidecar indexes `(DID, collection)` pairs from the firehose and serves `com.atproto.sync.listReposByCollection` — the endpoint TAP crawlers need to enumerate the network. 5 + a full-network [ATProto](https://atproto.com) relay running on a single Hetzner Cloud node with k3s. a [jetstream](https://github.com/bluesky-social/jetstream) instance runs alongside it, re-encoding the relay's CBOR firehose into plain JSON over websockets. a [collectiondir](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) sidecar indexes `(DID, collection)` pairs from the firehose and serves `com.atproto.sync.listReposByCollection` for TAP crawlers. 6 6 7 7 **relay endpoint:** `wss://relay.waow.tech` — raw CBOR firehose ([`com.atproto.sync.subscribeRepos`](https://docs.bsky.app/docs/advanced-guides/firehose)) 8 8 ··· 59 59 60 60 ``` 61 61 . 62 + ├── docs/ # architecture, deployment guide, backfill 62 63 ├── scripts/ # uv scripts — firehose, jetstream, backfill 63 64 ├── justfile # all commands: deploy, status, logs, backfill, etc. 64 65 ├── infra/ # terraform — hetzner server + k3s ··· 71 72 72 73 running one is [surprisingly cheap](https://whtwnd.com/bnewbold.net/3lo7a2a4qxg2l) — the relay binary uses modest CPU and memory, and storage requirements are manageable. the main cost driver is bandwidth, which is why Hetzner (unlimited 1 Gbps) is a good fit. 73 74 74 - this repo is a template for deploying your own. everything is declarative: terraform for the VM, helm for the workloads, a justfile to tie it together. 75 - 76 - <details> 77 - <summary>deploying your own</summary> 78 - 79 - ### prerequisites 80 - 81 - - [terraform](https://www.terraform.io/) (or [opentofu](https://opentofu.org/)) 82 - - [helm](https://helm.sh/) 83 - - [kubectl](https://kubernetes.io/docs/tasks/tools/) 84 - - [just](https://github.com/casey/just) 85 - - a [Hetzner Cloud](https://www.hetzner.com/cloud/) account 86 - 87 - ### setup 88 - 89 - create a `.env` file: 90 - 91 - ```bash 92 - export HCLOUD_TOKEN="your-hetzner-api-token" 93 - export RELAY_DOMAIN="relay.yourdomain.com" 94 - export RELAY_ADMIN_PASSWORD="something-secure" 95 - export POSTGRES_PASSWORD="something-else-secure" 96 - export LETSENCRYPT_EMAIL="you@example.com" 97 - ``` 98 - 99 - then: 100 - 101 - ```bash 102 - source .env 103 - 104 - just init # terraform init 105 - just infra # creates a CPX31 in Ashburn (~$15/mo) with k3s via cloud-init 106 - just kubeconfig # waits for k3s, pulls kubeconfig (~2 min) 107 - just deploy # installs cert-manager, postgresql, relay, jetstream, monitoring 108 - ``` 109 - 110 - point a DNS A record at the server IP (`just server-ip`) before running deploy, so the Let's Encrypt HTTP-01 challenge succeeds. 111 - 112 - after deploy, seed the relay with the network's PDS hosts: 113 - 114 - ```bash 115 - just bootstrap # pulls hosts from upstream + restarts relay so slurper picks them up 116 - ``` 117 - 118 - ### available commands 119 - 120 - ```bash 121 - just status # nodes, pods, health check 122 - just logs # tail relay logs 123 - just health # curl the public health endpoint 124 - just firehose # consume the firehose (passes args through) 125 - just jetstream # consume the jetstream (passes args through) 126 - just ssh # ssh into the server 127 - just destroy # tear down everything 128 - ``` 75 + this repo is a template for deploying your own. everything is declarative: terraform for the VM, helm for the workloads, a justfile to tie it together. see [docs/deploying.md](docs/deploying.md) for setup instructions and [docs/architecture.md](docs/architecture.md) for how the pieces fit together. 129 76 130 - </details> 131 - 132 - <details> 133 - <summary>architecture</summary> 134 - 135 - ### infrastructure 136 - 137 - - **Hetzner Cloud CPX31** — 8 vCPU (AMD), 16 GB RAM, 160 GB NVMe, 20 TB bandwidth @ ~$15/mo 138 - - **k3s** — single-node kubernetes, installed via cloud-init 139 - - **traefik** — ingress controller (ships with k3s) 140 - - **cert-manager** — automatic TLS via Let's Encrypt 141 - 142 - ### workloads 143 - 144 - - **relay** — [`ghcr.io/bluesky-social/indigo`](https://github.com/bluesky-social/indigo/pkgs/container/indigo) (tagged per-commit, e.g. `relay-bf41e2ee...`), deployed via [bjw-s/app-template](https://github.com/bjw-s-labs/helm-charts) helm chart with `hostNetwork: true` for lower-overhead networking 145 - - **jetstream** — [`ghcr.io/bluesky-social/jetstream`](https://github.com/bluesky-social/jetstream) subscribes to the relay's firehose over localhost (`ws://relay:2470`) and re-serves it as JSON WebSocket events at [`jetstream.waow.tech/subscribe`](https://jetstream.waow.tech/subscribe). lightweight alternative for consumers that don't need CBOR/CAR decoding 146 - - **collectiondir** — [`atcr.io/zzstoatzz.io/collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) subscribes to the relay firehose, indexes `(DID, collection)` pairs in a pebble DB, and serves `com.atproto.sync.listReposByCollection` at `relay.waow.tech`. routed via ingress path matching so the relay's existing endpoints are unaffected 147 - - **postgresql** — relay's backing database, deployed via [bitnami/postgresql](https://github.com/bitnami/charts/tree/main/bitnami/postgresql) helm chart 148 - - **prometheus + grafana** — metrics collection and dashboards via [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack), public read-only access at [`relay-metrics.waow.tech`](https://relay-metrics.waow.tech) 149 - 150 - ### relay specs at steady state 151 - 152 - | metric | value | 153 - |--------|-------| 154 - | storage (relay data) | ~21 GB | 155 - | storage (postgres) | ~2.4 GB | 156 - | CPU usage | 5–15% | 157 - | network throughput | ~600 events/sec typical, 2000 peak | 158 - | connected PDS hosts | ~2200 | 159 - 160 - </details> 161 - 162 - <details> 163 - <summary>prior art</summary> 164 - 165 - this setup draws heavily from: 77 + ## prior art 166 78 167 79 - [a full-network relay for $34 a month](https://whtwnd.com/bnewbold.net/3lo7a2a4qxg2l) by bryan newbold — the definitive guide 168 80 - [atproto relay any% speedrun](https://pdsls.dev/at://did:plc:uu5axsmbm2or2dngy4gwchec/com.whtwnd.blog.entry/3lkubavdilf2m) — proof it runs on a raspberry pi 169 81 - [running a PDS in kubernetes](https://hayden.leaflet.pub/3m4vfjkr6gc2p) — the app-template helm pattern 170 82 - [firehose.network](https://sri.leaflet.pub/3mddrqk5ays27) — 3 public relays deployed globally 171 - 172 - </details>

+1 -1

deploy/collectiondir-values.yaml

··· 40 40 memory: 128Mi 41 41 cpu: 50m 42 42 limits: 43 - memory: 512Mi 43 + memory: 1Gi 44 44 45 45 defaultPodOptions: 46 46 imagePullSecrets:

+53

docs/architecture.md

··· 1 + # architecture 2 + 3 + ## infrastructure 4 + 5 + - **Hetzner Cloud CPX31** — 8 vCPU (AMD), 16 GB RAM, 160 GB NVMe, 20 TB bandwidth @ ~$15/mo 6 + - **k3s** — single-node kubernetes, installed via cloud-init 7 + - **traefik** — ingress controller (ships with k3s) 8 + - **cert-manager** — automatic TLS via Let's Encrypt 9 + 10 + ## workloads 11 + 12 + ### relay 13 + 14 + the core service. [`ghcr.io/bluesky-social/indigo`](https://github.com/bluesky-social/indigo/pkgs/container/indigo), deployed via [bjw-s/app-template](https://github.com/bjw-s-labs/helm-charts) with `hostNetwork: true` for lower-overhead networking. connects to every PDS on the network and aggregates their writes into a single firehose stream (`com.atproto.sync.subscribeRepos`). backed by postgresql for state. 15 + 16 + the relay maintains an in-process identity cache (hashicorp LRU, 5M entries, 24h TTL) — every event requires a DID document lookup, and this cache keeps the relay from hammering PLC. memory usage climbs over the first day as the cache fills, then plateaus once eviction matches insertion. 17 + 18 + ### collectiondir 19 + 20 + a sidecar, not part of the relay itself. [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) subscribes to the relay's firehose over localhost (`ws://relay:2470`), indexes `(DID, collection)` pairs in a [pebble](https://github.com/cockroachdb/pebble) key-value store, and serves `com.atproto.sync.listReposByCollection` — the endpoint TAP crawlers use to enumerate which accounts have records in a given collection. 21 + 22 + **what pebble stores:** each key is a `(collection, DID)` pair. the value is minimal (just a marker). when a TAP crawler asks "who has `app.bsky.feed.post` records?", the collectiondir does a prefix scan over all keys starting with that collection and returns the DIDs, paginated. 23 + 24 + **live indexing vs historical data:** the collectiondir sees every new commit on the firehose in real time, so newly-created accounts and new record types are indexed immediately. but it has no knowledge of accounts that existed before it started running, or accounts that haven't posted since it came online. that gap is what the [backfill](backfill.md) covers. 25 + 26 + **consequence of missing pairs:** if a `(DID, collection)` pair is absent, that DID won't appear in `listReposByCollection` responses for that collection. TAP crawlers won't discover the account through this endpoint. the relay's firehose is unaffected — the collectiondir is purely a directory service layered on top. 27 + 28 + routed via traefik ingress path matching (`/xrpc/com.atproto.sync.listReposByCollection`) so the relay's existing endpoints are unaffected. 29 + 30 + ### jetstream 31 + 32 + [`ghcr.io/bluesky-social/jetstream`](https://github.com/bluesky-social/jetstream) subscribes to the relay's firehose over localhost and re-serves it as JSON websocket events. a lightweight alternative for consumers that don't need CBOR/CAR decoding. 33 + 34 + ### postgresql 35 + 36 + relay's backing database, deployed via [bitnami/postgresql](https://github.com/bitnami/charts/tree/main/bitnami/postgresql). stores relay state (PDS host list, cursor positions, etc.). 37 + 38 + ### monitoring 39 + 40 + prometheus + grafana via [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack). scrapes relay (`:2471/metrics`), jetstream, and collectiondir (`:2511/metrics`). kubelet scraping is enabled for container-level disk I/O metrics. public read-only access at `relay-metrics.waow.tech`. 41 + 42 + the relay and collectiondir ServiceMonitors are standalone manifests (`kubectl apply -f`) rather than inline in the helm values — the `additionalServiceMonitors` field in kube-prometheus-stack silently fails when targeting services in a different namespace. 43 + 44 + ## steady-state specs 45 + 46 + | metric | value | 47 + |--------|-------| 48 + | storage (relay data) | ~21 GB | 49 + | storage (postgres) | ~2.4 GB | 50 + | storage (collectiondir pebble) | ~300 MB (pre-bsky-backfill) | 51 + | CPU usage | 5-15% | 52 + | network throughput | ~600 events/sec typical, 2000 peak | 53 + | connected PDS hosts | ~2200 |

+56

docs/backfill.md

··· 1 + # backfilling the collection directory 2 + 3 + the collectiondir indexes `(DID, collection)` pairs from the relay's live firehose, but it has no knowledge of accounts that existed before it started running. the backfill crawls PDS hosts to fill that gap. 4 + 5 + ## what the crawl does 6 + 7 + for each PDS host, the collectiondir: 8 + 9 + 1. paginates `com.atproto.sync.listRepos` (1000 at a time) to get every DID on that host 10 + 2. for each DID, calls `com.atproto.repo.describeRepo` to get the list of collections 11 + 3. writes each `(DID, collection)` pair to pebble 12 + 13 + the calls are sequential per host, rate-limited at 100 req/s (configurable via `--crawl-qps`). in practice, network latency (~170ms per call from Hetzner Ashburn to Bluesky's US shards) limits throughput to ~6 repos/second per host. 14 + 15 + ## two categories of hosts 16 + 17 + **indie PDS hosts** (~2200): independently-run servers, mostly small (1-100 accounts each). backfilling all of them takes minutes. 18 + 19 + **bluesky shards** (~87): the mushroom-named hosts (`amanita.us-east.host.bsky.network`, `chanterelle.us-west.host.bsky.network`, etc.) that host the vast majority of accounts. ~14K repos per shard on average, up to ~40K for the largest. these take hours to crawl. 20 + 21 + ## running the backfill 22 + 23 + the `just backfill` recipe handles port-forwarding and host list extraction: 24 + 25 + ```bash 26 + # backfill all connected hosts (extracts list from relay automatically) 27 + just backfill 28 + 29 + # backfill from a specific host list with custom batch size 30 + just backfill --hosts /tmp/bsky-shards.txt --batch-size 10 31 + ``` 32 + 33 + or run the script directly (requires a port-forward to the collectiondir): 34 + 35 + ```bash 36 + ./scripts/backfill --token "$COLLECTIONDIR_ADMIN_TOKEN" --hosts hosts.txt 37 + ./scripts/backfill --token "$TOKEN" --hosts hosts.txt --batch-size 20 --pause 30 38 + ``` 39 + 40 + the script sends batches of N hosts (default: 10) to `POST /admin/pds/requestCrawl`, then polls `GET /admin/crawlStatus` until active crawls drain before sending the next batch. ctrl-c stops after the current batch finishes. 41 + 42 + ## batch sizing 43 + 44 + each host in a batch crawls concurrently — they hit different PDS servers, so increasing batch size adds parallelism without increasing load on any individual server. the bottleneck is per-host sequential `describeRepo` calls, not the batch size. batch-size 10 is a reasonable default. 45 + 46 + ## monitoring 47 + 48 + watch progress via: 49 + - the backfill script's stdout (hosts crawled, repos described, ETA) 50 + - grafana: collectiondir panels show firehose events/sec, commits/sec, new pairs indexed/sec, and disk I/O 51 + - `kubectl exec -n relay deploy/collectiondir -- df -h /data` for pebble disk usage 52 + - crawl status API: `curl -H "Authorization: Bearer $TOKEN" localhost:2510/admin/crawlStatus` 53 + 54 + ## storage impact 55 + 56 + pebble stores one key per `(collection, DID)` pair. the indie host backfill brought the DB to ~300 MB. the full bsky shard backfill (millions of accounts, each with multiple collections) will likely grow it to a few GB. the collectiondir has a 10Gi PVC, so there's plenty of headroom.

+58

docs/deploying.md

··· 1 + # deploying 2 + 3 + ## prerequisites 4 + 5 + - [terraform](https://www.terraform.io/) (or [opentofu](https://opentofu.org/)) 6 + - [helm](https://helm.sh/) 7 + - [kubectl](https://kubernetes.io/docs/tasks/tools/) 8 + - [just](https://github.com/casey/just) 9 + - a [Hetzner Cloud](https://www.hetzner.com/cloud/) account 10 + 11 + ## setup 12 + 13 + create a `.env` file: 14 + 15 + ```bash 16 + export HCLOUD_TOKEN="your-hetzner-api-token" 17 + export RELAY_DOMAIN="relay.yourdomain.com" 18 + export RELAY_ADMIN_PASSWORD="something-secure" 19 + export POSTGRES_PASSWORD="something-else-secure" 20 + export LETSENCRYPT_EMAIL="you@example.com" 21 + ``` 22 + 23 + then: 24 + 25 + ```bash 26 + source .env 27 + 28 + just init # terraform init 29 + just infra # creates a CPX31 in Ashburn (~$15/mo) with k3s via cloud-init 30 + just kubeconfig # waits for k3s, pulls kubeconfig (~2 min) 31 + just deploy # installs cert-manager, postgresql, relay, jetstream, monitoring 32 + ``` 33 + 34 + point a DNS A record at the server IP (`just server-ip`) before running deploy, so the Let's Encrypt HTTP-01 challenge succeeds. 35 + 36 + after deploy, seed the relay with the network's PDS hosts: 37 + 38 + ```bash 39 + just bootstrap # pulls hosts from upstream + restarts relay so slurper picks them up 40 + ``` 41 + 42 + ## available commands 43 + 44 + ```bash 45 + just status # nodes, pods, health check 46 + just logs # tail relay logs 47 + just health # curl the public health endpoint 48 + just firehose # consume the firehose (passes args through) 49 + just jetstream # consume the jetstream (passes args through) 50 + just ssh # ssh into the server 51 + just destroy # tear down everything 52 + ``` 53 + 54 + ## targeted deployments 55 + 56 + `just deploy` deploys everything. for targeted updates: 57 + 58 + - `just deploy-monitoring` — only the monitoring stack (prometheus, grafana, dashboards, ServiceMonitors). useful for dashboard changes or prometheus config tweaks without touching the relay.

Configure Feed

Configure Feed