docs: update for collectiondir → lightrail migration

+1 -1

README.md

··· 8 8 |---|---|---| 9 9 | **firehose** | `wss://relay.waow.tech` | `wss://zlay.waow.tech` | 10 10 | **jetstream** | `wss://jetstream.waow.tech/subscribe` ([sidecar](https://github.com/bluesky-social/jetstream)) | — | 11 - | **collectiondir** | [sidecar](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) on same endpoint | built-in (inspired by [lightrail](https://tangled.org/microcosm.blue/lightrail)) | 11 + | **collectiondir** | [lightrail](https://tangled.org/microcosm.blue/lightrail) sidecar on same endpoint | built-in (inspired by lightrail) | 12 12 | **health** | [`relay.waow.tech/xrpc/_health`](https://relay.waow.tech/xrpc/_health) | [`zlay.waow.tech/_health`](https://zlay.waow.tech/_health) | 13 13 | **metrics** | [`relay-metrics.waow.tech`](https://relay-metrics.waow.tech) | [`zlay-metrics.waow.tech`](https://zlay-metrics.waow.tech) | 14 14 | **source** | [bluesky-social/indigo](https://github.com/bluesky-social/indigo) | [zzstoatzz.io/zlay](https://tangled.org/zzstoatzz.io/zlay) |

+9 -9

docs/architecture.md

··· 15 15 16 16 the relay maintains an in-process identity cache (hashicorp LRU, 5M entries, 24h TTL) — every event requires a DID document lookup, and this cache keeps the relay from hammering PLC. memory usage climbs over the first day as the cache fills, then plateaus once eviction matches insertion. `GOMEMLIMIT=6GiB` is set so the Go runtime returns memory to the OS under pressure rather than holding onto it indefinitely. 17 17 18 - ### collectiondir 18 + ### lightrail 19 19 20 - a sidecar, not part of the relay itself. [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) subscribes to the relay's firehose over localhost (`ws://relay:2470`), indexes `(DID, collection)` pairs in a [pebble](https://github.com/cockroachdb/pebble) key-value store, and serves `com.atproto.sync.listReposByCollection` — the endpoint TAP crawlers use to enumerate which accounts have records in a given collection. 20 + a sidecar serving `com.atproto.sync.listReposByCollection` — the endpoint TAP crawlers use to enumerate which accounts have records in a given collection. [lightrail](https://tangled.org/microcosm.blue/lightrail) is fig's Rust collection directory, replacing the previous Go-based collectiondir (which had unbounded memory growth). 21 21 22 - **what pebble stores:** each key is a `(collection, DID)` pair. the value is minimal (just a marker). when a TAP crawler asks "who has `app.bsky.feed.post` records?", the collectiondir does a prefix scan over all keys starting with that collection and returns the DIDs, paginated. 22 + lightrail subscribes to the relay's firehose (`--subscribe https://relay.waow.tech`), indexes `(DID, collection)` pairs in [fjall](https://github.com/fjall-rs/fjall), and detects collection creation/deletion using MST adjacent key proofs from sync 1.1 commit ops — no `describeRepo` calls needed for most events. 23 23 24 - **live indexing vs historical data:** the collectiondir sees every new commit on the firehose in real time, so newly-created accounts and new record types are indexed immediately. but it has no knowledge of accounts that existed before it started running, or accounts that haven't posted since it came online. that gap is what the [backfill](backfill.md) covers. 24 + **backfill:** lightrail handles its own via `--deep-crawl`, discovering hosts from the relay's `listHosts` and crawling each one's `listRepos`. no manual backfill step needed. 25 25 26 - **consequence of missing pairs:** if a `(DID, collection)` pair is absent, that DID won't appear in `listReposByCollection` responses for that collection. TAP crawlers won't discover the account through this endpoint. the relay's firehose is unaffected — the collectiondir is purely a directory service layered on top. 26 + **admin:** `GET /admin` serves an HTML dashboard; `GET /admin/status` returns JSON. both require HTTP basic auth (password from `LIGHTRAIL_ADMIN_PASSWORD` env var). 27 27 28 28 routed via traefik ingress path matching (`/xrpc/com.atproto.sync.listReposByCollection`) so the relay's existing endpoints are unaffected. 29 29 ··· 37 37 38 38 ### monitoring 39 39 40 - prometheus + grafana via [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack). scrapes relay (`:2471/metrics`), jetstream, and collectiondir (`:2511/metrics`). kubelet scraping is enabled for container-level disk I/O metrics. public read-only access at `relay-metrics.waow.tech`. 40 + prometheus + grafana via [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack). scrapes relay (`:2471/metrics`), jetstream, and lightrail (`:6789/metrics`). kubelet scraping is enabled for container-level disk I/O metrics. public read-only access at `relay-metrics.waow.tech`. 41 41 42 - the relay and collectiondir ServiceMonitors are standalone manifests (`kubectl apply -f`) rather than inline in the helm values — the `additionalServiceMonitors` field in kube-prometheus-stack silently fails when targeting services in a different namespace. 42 + the relay and lightrail ServiceMonitors are standalone manifests (`kubectl apply -f`) rather than inline in the helm values — the `additionalServiceMonitors` field in kube-prometheus-stack silently fails when targeting services in a different namespace. 43 43 44 44 ## PDS connection maintenance 45 45 ··· 53 53 |--------|-------| 54 54 | storage (relay data) | ~21 GB | 55 55 | storage (postgres) | ~2.4 GB | 56 - | storage (collectiondir pebble) | ~5 GB (post-backfill, ~5M DIDs indexed) | 56 + | storage (lightrail fjall) | ~3 GB (~6.8M repos indexed) | 57 57 | CPU usage | 5-15% | 58 58 | network throughput | ~600 events/sec typical, 2000 peak | 59 59 | connected PDS hosts | ~2800 | 60 60 | memory (relay) | ~6 GiB (plateaus at GOMEMLIMIT) | 61 - | memory (collectiondir) | ~470 MiB steady-state | 61 + | memory (lightrail) | ~4 GiB during resync, expected lower at steady state | 62 62 63 63 --- 64 64

+12 -63

docs/backfill.md

··· 1 - # backfilling the collection directory 1 + # collection directory backfill 2 2 3 - the collectiondir indexes `(DID, collection)` pairs from the relay's live firehose, but it has no knowledge of accounts that existed before it started running. the backfill crawls PDS hosts to fill that gap. 4 - 5 - ## what the crawl does 3 + ## lightrail (current) 6 4 7 - for each PDS host, the collectiondir: 5 + lightrail handles its own backfill automatically via `--deep-crawl`. on startup it: 8 6 9 - 1. paginates `com.atproto.sync.listRepos` (1000 at a time) to get every DID on that host 10 - 2. for each DID, calls `com.atproto.repo.describeRepo` to get the list of collections 11 - 3. writes each `(DID, collection)` pair to pebble 12 - 13 - the calls are sequential per host, rate-limited at 100 req/s (configurable via `--crawl-qps`). in practice, network latency (~170ms per call from Hetzner Ashburn to Bluesky's US shards) limits throughput to ~6 repos/second per host. 14 - 15 - ## two categories of hosts 16 - 17 - **indie PDS hosts** (~2200): independently-run servers, mostly small (1-100 accounts each). backfilling all of them takes minutes. 18 - 19 - **bluesky shards** (~88): the mushroom-named hosts (`amanita.us-east.host.bsky.network`, `chanterelle.us-west.host.bsky.network`, etc.) that host the vast majority of accounts. ~30K-50K repos per shard on average, up to ~500K for the largest. these take days to crawl (see batch sizing below). 7 + 1. discovers PDS hosts from the relay's `com.atproto.sync.listHosts` endpoint 8 + 2. crawls each host's `com.atproto.sync.listRepos` to enumerate all DIDs 9 + 3. resyncs each DID — fetches `describeRepo` to get collections, indexes `(DID, collection)` pairs in fjall 20 10 21 - ## running the backfill 11 + progress is tracked internally (resync queue in fjall). pod restarts resume from where they left off. 22 12 23 - the `just indigo backfill` recipe handles port-forwarding and host list extraction: 13 + **monitoring**: `GET /admin/status` (basic auth) returns `resync_queue_depth`, `resyncs_completed_total`, and `upstream_backfill_complete`. the grafana dashboard has panels for these metrics. 24 14 25 - ```bash 26 - # backfill all connected hosts (extracts list from relay automatically) 27 - just indigo backfill 15 + **timing**: full resync of ~6.8M repos took ~2.5 days. rate is governed by `--crawl-qps` (default 8) and PDS response times. 28 16 29 - # backfill from a specific host list with custom batch size 30 - just indigo backfill --hosts /tmp/bsky-shards.txt --batch-size 10 31 - ``` 17 + **no manual intervention needed.** lightrail manages the entire lifecycle — host discovery, crawling, retry on failure, and rate-limit cooldown. 32 18 33 - or run the script directly (requires a port-forward to the collectiondir): 19 + ## collectiondir (legacy, replaced 2026-03-27) 34 20 35 - ```bash 36 - ./scripts/backfill --token "$COLLECTIONDIR_ADMIN_TOKEN" --hosts hosts.txt 37 - ./scripts/backfill --token "$TOKEN" --hosts hosts.txt --batch-size 20 --pause 30 38 - 39 - # resume a run that died partway through (skip first 35 batches) 40 - ./scripts/backfill --token "$TOKEN" --hosts hosts.txt --batch-size 1 --skip 35 41 - 42 - # set a timeout per batch (useful for long-running shard crawls) 43 - ./scripts/backfill --token "$TOKEN" --hosts hosts.txt --batch-size 1 --batch-timeout 600 44 - ``` 45 - 46 - the script sends batches of N hosts (default: 10) to `POST /admin/pds/requestCrawl`, then polls `GET /admin/crawlStatus` until active crawls drain before sending the next batch. if `--batch-timeout` is set, it moves on after that many seconds even if crawls are still active. ctrl-c stops after the current batch finishes. 47 - 48 - the script retries on transient connection errors (e.g. port-forward drops) with exponential backoff — up to 12 retries for status polling, 6 for crawl requests. 49 - 50 - ## batch sizing 51 - 52 - **indie PDS hosts:** each host in a batch crawls concurrently and they're all independent servers, so batch-size 10 is fine. 53 - 54 - **bsky shards:** all the mushroom-named hosts share an IP-based rate limit. more than ~2 concurrent crawls from our IP triggers HTTP 429, and the crawl code has no retry logic — a single rate limit kills the entire crawl for that host. use `--batch-size 1` for bsky shards. this means crawling all 87 shards takes days, not hours. 55 - 56 - ## monitoring 57 - 58 - watch progress via: 59 - - the backfill script's stdout (hosts crawled, repos described, ETA) 60 - - grafana: collectiondir panels show firehose events/sec, commits/sec, new pairs indexed/sec, and disk I/O 61 - - `kubectl exec -n relay deploy/collectiondir -- df -h /data` for pebble disk usage 62 - - crawl status API: `curl -H "Authorization: Bearer $TOKEN" localhost:2510/admin/crawlStatus` 63 - 64 - ## gotchas 65 - 66 - - **port-forwards die** after ~80 minutes. server-side crawls survive the disconnect, so progress isn't lost — but the script can't poll status or submit new batches until the port-forward is re-established. the retry logic handles brief drops; for longer outages, re-run with `--skip`. 67 - - **crawl state is in-memory.** a collectiondir pod restart loses all in-progress crawl goroutines. completed pairs are already in pebble and safe. 68 - - **no 429 retry in crawl code.** the collectiondir's crawl thread doesn't retry on HTTP 429. a single rate-limit response kills the entire crawl for that host. this is why bsky shards must be submitted one at a time. 69 - 70 - ## storage impact 71 - 72 - pebble stores one key per `(collection, DID)` pair. post-backfill (indie + all bsky shards, ~2.96M repos), the DB is ~5 GB. the collectiondir has a 10Gi PVC. 21 + the previous Go-based collectiondir required manual backfill via `scripts/backfill`. that workflow is no longer needed. the collectiondir helm release is scaled to 0 but kept for rollback — see `just indigo collectiondir-publish`.

-1

docs/deploying.md

··· 46 46 just indigo logs # tail relay logs 47 47 just indigo health # curl the public health endpoint 48 48 just indigo reconnect # re-announce all known PDS hosts to the relay 49 - just indigo backfill # backfill collectiondir with full network data 50 49 just indigo firehose # consume the firehose (passes args through) 51 50 just indigo jetstream # consume the jetstream (passes args through) 52 51 just indigo ssh # ssh into the server

+39

docs/ops-changelog.md

··· 5 5 6 6 --- 7 7 8 + ## 2026-03-27 9 + 10 + ### replaced collectiondir with lightrail on relay.waow.tech 11 + 12 + collectiondir (Go, pebble-backed) had unbounded memory growth — 512 MiB → 13 + 1.4 GiB over 16 days, heading toward its 2.5 GiB OOM limit. pprof showed 14 + 75% in an LRU cache + pebble with default 8 MB cache and no tuning. no 15 + cleanup for deleted DIDs either (indigo #1276). 16 + 17 + replaced with [lightrail](https://tangled.org/microcosm.blue/lightrail) — 18 + fig's Rust collection directory. lightrail validates sync 1.1 commit proofs, 19 + removes repos on collection deletion, and has a configurable fjall cache 20 + (`--fjall-cache-mb`). 21 + 22 + **deployment**: lightrail subscribes to the relay's firehose over HTTPS 23 + (`--subscribe https://relay.waow.tech`), indexes `(DID, collection)` pairs 24 + in fjall, and serves `listReposByCollection` on port 2510. traefik ingress 25 + routes `/xrpc/com.atproto.sync.listReposByCollection` to lightrail. 26 + 27 + **backfill**: lightrail does its own via `--deep-crawl` — discovers hosts 28 + from the relay's `listHosts` and crawls each one's `listRepos`. no manual 29 + backfill step needed. full resync took ~2.5 days for ~6.8M repos. 30 + 31 + **memory**: peaked at ~4 GiB during full resync (identity cache 2M entries 32 + default + fjall 256 MB block cache + mmap pressure from 2.9 GB db on disk). 33 + should drop in steady state once resync scanning stops. 34 + 35 + **coverage**: lightrail exceeds collectiondir on `site.standard.publication` 36 + (2,952 vs 1,679) and matches on `site.standard.document` (5,052 vs 5,091). 37 + 38 + **rollback**: collectiondir scaled to 0 replicas, helm release + PVC kept. 39 + `just indigo collectiondir-publish` recipe retained. 40 + 41 + files: `indigo/deploy/Dockerfile.lightrail` (new), `indigo/deploy/lightrail-values.yaml` (new), 42 + `indigo/deploy/lightrail-servicemonitor.yaml` (new), `indigo/deploy/ingress.yaml` (edit), 43 + `indigo/deploy/relay-dashboard.json` (edit), `indigo/justfile` (edit) 44 + 45 + --- 46 + 8 47 ## open items 9 48 10 49 ### broadcast after flush (architectural)

Configure Feed

Configure Feed