relay: updated docs and Makefile (#1048) · alice.mosphere.at/indigo@7594406

+2

.gitignore

··· 32 32 /stress 33 33 /supercollider 34 34 /hepa 35 + /relay 35 36 36 37 # Don't ignore this file itself, or other specific dotfiles 37 38 !.gitignore ··· 49 50 50 51 # Relay dash output 51 52 /public/ 53 + /cmd/relay/public

+9 -10

Makefile

··· 86 86 docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "plugins.security.disabled=true" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=0penSearch-Pal0mar" opensearch-palomar 87 87 88 88 .PHONY: run-dev-relay 89 - run-dev-relay: .env ## Runs 'bigsky' Relay for local dev 90 - GOLOG_LOG_LEVEL=info go run ./cmd/bigsky --admin-key localdev 91 - # --crawl-insecure-ws 89 + run-dev-relay: .env ## Runs relay for local dev 90 + LOG_LEVEL=info go run ./cmd/relay --admin-password localdev serve 92 91 93 92 .PHONY: run-dev-ident 94 93 run-dev-ident: .env ## Runs 'bluepages' identity directory for local dev 95 94 GOLOG_LOG_LEVEL=info go run ./cmd/bluepages serve 96 95 97 96 .PHONY: build-relay-image 98 - build-relay-image: ## Builds 'bigsky' Relay docker image 99 - docker build -t bigsky -f cmd/bigsky/Dockerfile . 97 + build-relay-image: ## Builds relay docker image 98 + docker build -t relay -f cmd/relay/Dockerfile . 100 99 101 - .PHONY: build-relay-ui 102 - build-relay-ui: ## Build Relay dash web app 103 - cd ts/bgs-dash; yarn install --frozen-lockfile; yarn build 100 + .PHONY: build-relay-admin-ui 101 + build-relay-admin-ui: ## Build relay admin web UI 102 + cd cmd/relay/relay-admin-ui; yarn install --frozen-lockfile; yarn build 104 103 mkdir -p public 105 - cp -r ts/bgs-dash/dist/* public/ 104 + cp -r cmd/relay/relay-admin-ui/dist/* public/ 106 105 107 106 .PHONY: run-relay-image 108 107 run-relay-image: 109 - docker run -p 2470:2470 bigsky /bigsky --admin-key localdev 108 + docker run -p 2470:2470 relay /relay serve --admin-password localdev 110 109 # --crawl-insecure-ws 111 110 112 111 .PHONY: run-dev-search

+3 -2

README.md

··· 8 8 9 9 **Go Services:** 10 10 11 - - **bigsky** ([README](./cmd/bigsky/README.md)): relay reference implementation, running at `bsky.network` 11 + - **relay** ([README](./cmd/relay/README.md)): relay reference implementation 12 + - **rainbow** ([README](./cmd/rainbow/README.md)): firehose "splitter" or "fan-out" service 12 13 - **palomar** ([README](./cmd/palomar/README.md)): fulltext search service for <https://bsky.app> 13 14 - **hepa** ([README](./cmd/hepa/README.md)): auto-moderation bot for [Ozone](https://ozone.tools) 14 15 ··· 47 48 48 49 Individual commands can be run like: 49 50 50 - go run ./cmd/bigsky 51 + go run ./cmd/relay 51 52 52 53 The [HACKING](./HACKING.md) file has a list of commands and packages in this repository and some other development tips. 53 54

+52

cmd/relay/HACKING.md

··· 1 + 2 + 3 + ## Behaviors 4 + 5 + Details about how the relay operates which might not be obvious! 6 + 7 + - unknown/unexpected fields on overall firehose messages (eg, `#commit`) are *not* passed-through, so it is critical to upgrade the relay when there are protocol changes 8 + - records and commit objects *are* passed through verbatim: they are serialized in `blocks` fields on `#commit` and `#sync` messages 9 + - some admin UI changes are persisted across restarts (stored in database), others are not (ephemeral) 10 + - ephemeral (but can be configured via env vars): new-hosts-per-day limit; enable/disable requestCrawl 11 + - persisted (in database): account takedowns, domain bans, host bans, host account limit 12 + - the "lenient mode" configuration flag is intended as a short-term migration tool for [atproto Sync 1.1](https://github.com/bluesky-social/proposals/tree/main/0006-sync-iteration) and will be removed over time 13 + - once an upstream host websocket is established, the sequence numbers on that socket must always increase; messages with lower sequence will be dropped. but this is only strictly enforced over the life the the socket connection; if the relay restarts and the host emits older sequence numbers, those messages will start coming through 14 + - for a new host (no known previous sequence number), the relay will connect at "current" firehose offset, not "oldest" offset and backfill 15 + - for a known host, the relay will attempt to reconnect (eg, after a drop or restart) at the last persisted sequence number. persisting should happen every few seconds, or at clean shutdown of the daemon, but it is possible for the cursor to be slightly out of sync, resulting in replay of messages 16 + - account-level `#commit` revisions must always increase, and these revisions are stored for every valid `#commit` or `#sync` message from the account. repeated or lower revision messages are dropped. messages with revisions corresponding to a TID "in the future" (beyond a fudge period of a few minutes) are also dropped 17 + - messages for an account (DID) which come from a host connection which are not the current PDS host for that account are dropped. If there is a mismatch, the relay will re-resolve the identity (DID document) and double-check before dropping the message, in case there was an account migration not reflected yet in local caches. 18 + - if a host sends no messages for a long period, the relay will drop the connection and set the host status to "idle"; this is common for low-traffic PDS instances (eg, handful of accounts). The expectation is that the host would then send a `requestCrawl` ping next time there is a new event. 19 + - when the relay restarts, it connects to all "active" hosts 20 + 21 + 22 + ## Internal Implementation Details 23 + 24 + - the parallel event scheduler prevents multiple tasks for the same account (DID) from being processed at the same time 25 + - note the potentential for race-conditions with messages about the same account (DID) coming from different hosts around the same time: in this case there is no guarantee about ordering 26 + - the relay keeps track of which events have been received-but-not-processed by sequence number, and only increments the `lastSeq` for actually-processed events. the "inflight" set of messages (sequence numbers) can grow rather large for active hosts, if there are many events for a single account (only one processed per account at a time) 27 + 28 + 29 + ## Code Organization and History 30 + 31 + *Note: this was written in April 2025, and is likely to get out of date* 32 + 33 + This codebase started as a fork of the prior `bigsky` / "BGS" relay implementation. The host and account state management, and message validation, were re-written. The "slurper" got a refactor, and some event stream and disk persistence code got lighter changes. 34 + 35 + - `Service` struct: overall service executable/daemon. Implements protocol and admin HTTP endpoints. 36 + - `relay.Relay` struct: core relay service logic, message validation and processing, state and database management 37 + - `relay.Slurper` struct: maintains active subscriptions (WebSocket connections) to upstream hosts (eg, PDS instances) 38 + - `relay/models` package: database models 39 + - `stream` package: fork of `indigo:events` package, including websocket "frame" type, listeners, and some event stream rate-limiting 40 + - `stream.XRPCStreamEvent` struct: relatively critical/central serialiation type 41 + - `stream.eventmgr.EventManager`: manages output firehose: disk persistence, sequencing, etc 42 + - `testing` package: end-to-end integration tests 43 + 44 + The `stream` code should probably get merged back in with the `indigo:events` at some point, but there are many small differences so it won't be a quick/trivial change. 45 + 46 + 47 + ## Verification Tools and Tests 48 + 49 + - `goat` has several firehose verify flags 50 + - `./testing/` contains a framework for end-to-end relay integration tests 51 + - commit-level MST slice validation tests are in `indigo:atproto/repo` 52 + - there are some interop test resources at: https://github.com/bluesky-social/atproto-interop-tests

+65 -34

cmd/relay/README.md

··· 6 6 7 7 This is a reference implementation of an atproto relay, written and operated by Bluesky. 8 8 9 - In atproto, a relay subscribes to multiple PDS hosts and outputs a combined "firehose" event stream. Downstream services can subscribe to this single firehose a get all relevant events for the entire network, or a specific sub-graph of the network. The relay maintains a mirror of repo data from all accounts on the upstream PDS instances, and verifies repo data structure integrity and identity signatures. It is agnostic to applications, and does not validate data against atproto Lexicon schemas. 9 + In [atproto](https://atproto.com), a relay subscribes to multiple PDS hosts and outputs a combined "firehose" event stream. Downstream services can subscribe to this single firehose a get all relevant events for the entire network, or a specific sub-graph of the network. The relay verifies repo data structure integrity and identity signatures. It is application-agnostic, and does not validate data records against atproto Lexicon schemas. 10 10 11 11 This relay implementation is designed to subscribe to the entire global network. The current state of the codebase is informally expected to scale to around 100 million accounts in the network, and tens of thousands of repo events per second (peak). 12 12 13 13 Features and design decisions: 14 14 15 - - runs on a single server 16 - - crawling and account state: stored in SQL database 17 - - SQL driver: gorm, with PostgreSQL in production and sqlite for testing 15 + - runs on a single server (not a distributed system) 16 + - upstream host and account state is stored in a SQL database 17 + - SQL driver: [gorm](https://gorm.io), supporting PostgreSQL in production and sqlite for testing 18 18 - highly concurrent: not particularly CPU intensive 19 19 - single golang binary for easy deployment 20 20 - observability: logging, prometheus metrics, OTEL traces 21 21 - admin web interface: configure limits, add upstream PDS instances, etc 22 22 23 - This software is not yet as packaged, documented, and supported for self-hosting as our PDS distribution or Ozone service. But it is relatively simple and inexpensive to get running. 23 + This daemon is relatively simple to self-host, though it isn't as well documented or supported as the PDS reference implementation (see details below). 24 24 25 - A note and reminder about relays in general are that they are more of a convenience in the protocol than a hard requirement. The "firehose" API is the exact same on the PDS and on a relay. Any service which subscribes to the relay could instead connect to one or more PDS instances directly. 25 + See `./HACKING.md` for more documentation of specific behaviors of this implementation. 26 26 27 27 28 28 ## Development Tips 29 29 30 30 The README and Makefile at the top level of this git repo have some generic helpers for testing, linting, formatting code, etc. 31 31 32 - To re-build and run the relay locally: 32 + To build the admin web interface, and then build and run the relay locally: 33 33 34 + make build-relay-admin-ui 34 35 make run-dev-relay 35 36 36 - You can re-build and run the command directly to get a list of configuration flags and env vars; env vars will be loaded from `.env` if that file exists: 37 + You can run the command directly to get a list of configuration flags and environment variables. The environment will be loaded from a `.env`file if one exist: 37 38 38 - RELAY_ADMIN_PASSWORD=dummy go run ./cmd/relay/ --help 39 + go run ./cmd/relay/ --help 39 40 40 - By default, the daemon will use sqlite for databases (in the directory `./data/relay/`) and the HTTP API will be bound to localhost port 2470. 41 + You can also build an run the command directly: 42 + 43 + go build ./cmd/relay 44 + ./relay serve 45 + 46 + By default, the daemon will use sqlite for databases (in the directory `./data/relay/`), and the HTTP API will be bound to localhost port 2470. 41 47 42 48 When the daemon isn't running, sqlite database files can be inspected with: 43 49 ··· 45 51 [...] 46 52 sqlite> .schema 47 53 48 - Wipe all local data: 54 + To wipe all local data (careful!): 49 55 50 - # careful! double-check this destructive command 56 + # double-check before running this destructive command 51 57 rm -rf ./data/relay/* 52 58 53 - There is a basic web dashboard, though it will not be included unless built and copied to a local directory `./public/`. Run `make build-relay-ui`, and then when running the daemon the dashboard will be available at: <http://localhost:2470/dash/>. Paste in the admin key, eg `dummy`. 59 + There is a basic web dashboard, though it will not be included unless built and copied to a local directory `./public/`. Run `make build-relay-admin-ui`, and then when running the daemon the dashboard will be available at: <http://localhost:2470/dash/>. Paste in the admin key, eg `dummy`. 54 60 55 61 The local admin routes can also be accessed by passing the admin password using HTTP Basic auth (with username `admin`), for example: 56 62 ··· 60 66 61 67 http post :2470/admin/pds/requestCrawl -a admin:dummy hostname=pds.example.com 62 68 69 + The `goat` command line tool (also part of the indigo git repository) includes helpers for administering, inspecting, and debugging relays: 63 70 64 - ## Docker Containers 71 + RELAY_HOST=http://localhost:2470 goat firehose --verify-mst 72 + RELAY_HOST=http://localhost:2470 goat relay admin host list 65 73 66 - One way to deploy is running a docker image. You can pull and/or run a specific version of relay, referenced by git commit, from the Bluesky Github container registry. For example: 74 + ## API Endpoints 67 75 68 - docker pull ghcr.io/bluesky-social/indigo:relay-fd66f93ce1412a3678a1dd3e6d53320b725978a6 69 - docker run ghcr.io/bluesky-social/indigo:relay-fd66f93ce1412a3678a1dd3e6d53320b725978a6 76 + This relay implements the core atproto "sync" API endpoints: 70 77 71 - There is a Dockerfile in this directory, which can be used to build customized/patched versions of the relay as a container, republish them, run locally, deploy to servers, deploy to an orchestrated cluster, etc. See docs and guides for docker and cluster management systems for details. 78 + - `GET /xrpc/com.atproto.sync.subscribeRepos` (WebSocket) 79 + - `GET /xrpc/com.atproto.sync.getRepo` (HTTP redirect to account's PDS) 80 + - `GET /xrpc/com.atproto.sync.getRepoStatus` 81 + - `GET /xrpc/com.atproto.sync.listRepos` (optional) 82 + - `GET /xrpc/com.atproto.sync.getLatestCommit` (optional) 72 83 84 + It also implements some relay-specific endpoints: 73 85 74 - ## Database Setup 86 + - `POST /xrpc/com.atproto.sync.requestCrawl` 87 + - `GET /xrpc/com.atproto.sync.listHosts` 88 + - `GET /xrpc/com.atproto.sync.getHostStatus` 75 89 76 - PostgreSQL and Sqlite are both supported. Database configuration is passed via the `DATABASE_URL` environment variable, or the corresponding CLI arg. 90 + Documentation can be found in the [atproto specifications](https://atproto.com/specs/sync) for repository synchronization, event streams, data formats, account status, etc. 77 91 78 - For PostgreSQL, the user and database must already be configured. Some example SQL commands are: 92 + This implementation also has some off-protocol admin endpoints under `/admin/`. These have legacy schemas from an earlier implementation, are not well documented, and should not be considered a stable API to build upon. The intention is to refactor them in to Lexicon-specified APIs. 79 93 80 - CREATE DATABASE relay; 94 + ## Configuration and Operation 81 95 82 - CREATE USER ${username} WITH PASSWORD '${password}'; 83 - GRANT ALL PRIVILEGES ON DATABASE relay TO ${username}; 96 + *NOTE: this document is not a complete guide to operating a relay as a public service. That requires planning around acceptable use policies, financial sustainability, infrastructure selection, etc. This is just a quick overview of the mechanics of getting a relay up and running.* 84 97 85 - This service currently uses `gorm` to automatically run database migrations as the regular user. There is no concept of running a separate set of migrations under more privileged database user. 86 - 87 - 88 - ## Deployment 89 - 90 - *NOTE: this is not a complete guide to operating a relay. There are decisions to be made and communicated about policies, bandwidth use, PDS crawling and rate-limits, financial sustainability, etc, which are not covered here. This is just a quick overview of how to technically get a relay up and running.* 91 - 92 - In a real-world system, you will probably want to use PostgreSQL. 93 - 94 - Some notable configuration env vars to set: 98 + Some notable configuration env vars: 95 99 96 100 - `RELAY_ADMIN_PASSWORD` 97 101 - `DATABASE_URL`: eg, `postgres://relay:CHANGEME@localhost:5432/relay` ··· 103 107 There is a health check endpoint at `/xrpc/_health`. Prometheus metrics are exposed by default on port 2471, path `/metrics`. The service logs fairly verbosely to stdout; use `LOG_LEVEL` to control log volume (`warn`, `info`, etc). 104 108 105 109 Be sure to double-check bandwidth usage and pricing if running a public relay! Bandwidth prices can vary widely between providers, and popular cloud services (AWS, Google Cloud, Azure) are very expensive compared to alternatives like OVH or Hetzner. 110 + 111 + The relay admin interface has flexibility for many situations, but in some operational incidents it may be necessary to run SQL commands to do cleanups. This should be done when the relay is not actively operating. It is also recommended to run SQL commands in a transaction that can be rolled back in case of a typo or mistake. 112 + 113 + ### PostgreSQL 114 + 115 + PostgreSQL is recommended for any non-trival relay deployments. Database configuration is passed via the `DATABASE_URL` environment variable, or the corresponding CLI arg. 116 + 117 + The user and database must already be configured. For example: 118 + 119 + CREATE DATABASE relay; 120 + 121 + CREATE USER ${username} WITH PASSWORD '${password}'; 122 + GRANT ALL PRIVILEGES ON DATABASE relay TO ${username}; 123 + 124 + This service currently uses `gorm` to automatically run database migrations as the regular user. There is no support for running database migrations separately under more privileged database user. 125 + 126 + ### Docker 127 + 128 + The relay is relatively easy to build and operate as as simple executable, but there is also Dockerfile in this directory. It can be used to build customized/patched versions of the relay as a container, republish them, run locally, deploy to servers, deploy to an orchestrated cluster, etc. 129 + 130 + We strongly recommend running docker in "host networking" mode when operating a full-network relay. 131 + 132 + ### Bootstrapping Host List 133 + 134 + The relay comes with a helper command to pull a list of hosts from an existing relay. You should shut the relay down first and run this as a separate command: 135 + 136 + ./relay pull-hosts

Configure Feed

Configure Feed