···94949595### collection index backfill
96969797-the collection index is live-only — it indexes `create` ops as they flow through the firehose. historical data requires a backfill. recommended approaches:
9797+the collection index is live-only — it indexes `create` ops as they flow through the firehose. historical data is backfilled by importing from a source relay (bsky.network) via `com.atproto.sync.listReposByCollection`.
98989999-1. **import from bsky.network** (fastest): paginate `listReposByCollection` on the reference relay for each collection, bulk-insert pairs into RocksDB. no PDS crawling, no rate limits. `addCollection` is idempotent.
100100-2. **describeRepo crawl** (independent): crawl the host table, calling `listRepos` + `describeRepo` per PDS. same rate limit gotchas as indigo collectiondir — see [backfill.md](backfill.md).
101101-3. **hybrid** (recommended): import from reference relay for immediate parity, then live indexing keeps current. optionally add a slow background verify-crawl later.
9999+the backfiller discovers collections from two sources (lexicon garden llms.txt + RocksDB scan), then pages through each collection on the source relay, adding DIDs to RocksDB. progress is tracked in postgres for crash-resumability. triggered via `POST /admin/backfill-collections`, status via `GET`.
100100+101101+see the [zlay backfill docs](https://tangled.org/zzstoatzz.io/zlay/tree/main/docs/backfill.md) for full details, or use `scripts/backfill-status` in this repo.
102102103103### verification
104104···111111| metric | value |
112112|--------|-------|
113113| connected PDS hosts | ~2,749 |
114114-| collection index DIDs | ~497K (live-only, no backfill) |
114114+| collection index DIDs | ~13.6M+ (backfill in progress from bsky.network) |
115115| memory request | 512 MiB |
116116| memory limit | 8 GiB |
117117| PVC | 20 GiB |