docs: add zlay architecture, backfill strategy, and updated specs

+61 -2

1 changed file

expand all

docs

+61 -2

docs/architecture.md

··· 47 47 48 48 fix: a k8s CronJob (`deploy/reconnect-cronjob.yaml`) runs every 4 hours, fetching the [community PDS list](https://github.com/mary-ext/atproto-scraping) and sending `requestCrawl` for each host. this can also be run manually via `just reconnect`. 49 49 50 - ## steady-state specs 50 + ## steady-state specs (indigo relay) 51 51 52 52 | metric | value | 53 53 |--------|-------| 54 54 | storage (relay data) | ~21 GB | 55 55 | storage (postgres) | ~2.4 GB | 56 - | storage (collectiondir pebble) | ~5 GB (post-backfill) | 56 + | storage (collectiondir pebble) | ~5 GB (post-backfill, ~5M DIDs indexed) | 57 57 | CPU usage | 5-15% | 58 58 | network throughput | ~600 events/sec typical, 2000 peak | 59 59 | connected PDS hosts | ~2800 | 60 + | memory (relay) | ~6 GiB (plateaus at GOMEMLIMIT) | 61 + | memory (collectiondir) | ~470 MiB steady-state | 62 + 63 + --- 64 + 65 + ## zlay (zig relay) 66 + 67 + a second relay implementation in [Zig](https://ziglang.org/), deployed on a separate Hetzner node. source: [tangled.org/zzstoatzz.io/zlay](https://tangled.org/zzstoatzz.io/zlay). runs at `zlay.waow.tech`. 68 + 69 + ### how it differs from indigo 70 + 71 + **same model, different internals.** zlay crawls PDS hosts directly — it is not a fan-out relay. `RELAY_UPSTREAM` (default: `bsky.network`) is a bootstrap seed used once at startup to populate the host list via `listHosts`. after that, all data flows directly from each PDS. 72 + 73 + **inline collection index.** instead of running collectiondir as a sidecar, zlay indexes `(DID, collection)` pairs directly in its event processing pipeline. storage is [RocksDB](https://rocksdb.org/) with two column families (`rbc` for collection→DID lookups, `cbr` for DID→collection cleanup). serves `listReposByCollection` from the relay's HTTP port — no separate service. 74 + 75 + **optimistic validation.** on a signing key cache miss, zlay passes the frame through immediately and queues the DID for background resolution. first commit from an unknown account is unvalidated; subsequent commits are verified. indigo blocks until resolution completes. 76 + 77 + **split ports.** 3000 for the WebSocket firehose, 3001 for HTTP (health, stats, metrics, admin, XRPC). indigo serves everything on port 2470 (with metrics on 2471). 78 + 79 + **OS threads, not goroutines.** one thread per PDS host subscription, one per downstream consumer. predictable memory (no GC), but thread count scales linearly with host count. 80 + 81 + ### deployment 82 + 83 + separate Hetzner cpx41 in Hillsboro OR (`hil`), independent k3s cluster. all `zlay-*` justfile recipes use `ZLAY_KUBECONFIG`. terraform in `infra/zlay/`. 84 + 85 + ```bash 86 + just zlay-init # terraform init 87 + just zlay-infra # create server 88 + just zlay-kubeconfig # pull kubeconfig 89 + just zlay-deploy # full deploy (cert-manager, postgres, relay, monitoring) 90 + just zlay-publish # build and push image 91 + just zlay-status # check pods + health 92 + just zlay-logs # tail logs 93 + ``` 94 + 95 + ### collection index backfill 96 + 97 + the collection index is live-only — it indexes `create` ops as they flow through the firehose. historical data requires a backfill. recommended approaches: 98 + 99 + 1. **import from bsky.network** (fastest): paginate `listReposByCollection` on the reference relay for each collection, bulk-insert pairs into RocksDB. no PDS crawling, no rate limits. `addCollection` is idempotent. 100 + 2. **describeRepo crawl** (independent): crawl the host table, calling `listRepos` + `describeRepo` per PDS. same rate limit gotchas as indigo collectiondir — see [backfill.md](backfill.md). 101 + 3. **hybrid** (recommended): import from reference relay for immediate parity, then live indexing keeps current. optionally add a slow background verify-crawl later. 102 + 103 + ### verification 104 + 105 + `scripts/zlay-smoketest` tests endpoint conformance, pagination, and set completeness against a reference relay. `scripts/collectiondir-diff` compares `listReposByCollection` results between any two endpoints (use `--limit` values ≤ 1000 for zlay). 106 + 107 + [pulsar](https://tangled.org/mackuba.eu/pulsar) (by @mackuba.eu) provides live firehose coverage comparison — subscribes to multiple relays simultaneously and counts unique DIDs over a time window. 108 + 109 + ### steady-state specs (zlay) 110 + 111 + | metric | value | 112 + |--------|-------| 113 + | connected PDS hosts | ~2,749 | 114 + | collection index DIDs | ~497K (live-only, no backfill) | 115 + | memory request | 512 MiB | 116 + | memory limit | 8 GiB | 117 + | PVC | 20 GiB | 118 + | `listReposByCollection` max limit | 1000 |

Configure Feed

Configure Feed