···9393- [ ] lenient pre-sync1.1
9494 - [ ] *don't* allow non-validating commits that look like sync1.1
9595 - [ ] rachet by PDS host: be lenient if we have never seen a sync1.1-looking commit, always strict after we see one.
9696- - [ ] boooo we probably need *even more* special handling for pre-sync1.1 repos since they don't include adjacent keys!!!
9696+ - [ ] boooo we might need more handling for pre-sync1.1 repos if they don't include adjacent keys
9797+- [ ] resync free hints from first phony getRecord
9898+ - [ ] short-circuit: tiny repos may incidentally return their entire CAR for getRecord
9999+ - [ ] estimate CAR size and `getRecord` if it's likely very small (bypass `describeRepo`)
100100+- [ ] add a `--heavy` mode that always uses `getRepo` and never `describeRepo`
101101+- [ ] commit CAR handling: generate a list of keys with gaps noted, to reliably detect missing adjacent keys
97102- [ ] account status convergeance: if we receive commits from apparently-inactive accounts, should we check upstream status to make sure we're not stale?
98103- [ ] split the keyspace: put the rbc/cbr indexes on a second keyspace with larger block size, expect hits on main keyspace
99104- [ ] websocket ping/pong (unless jacquard is already doing it)
···106111- [ ] admin view of backfill state etc
107112- [ ] vanity stats for optimizations, like how many in-flight repos were saved from resync due to high-water-mark firehose cursor persistence
108113- [ ] if the upstream is a PDS (check with describeServer?) then make only accept events for DIDs that have it as their PDS
114114+- [ ] use `since` on getRepo for resync to get a smaller partial export in many cases (and then more-carefully do the actual resync)
109115110116111117### special-casing
···116122## some choices
117123118124- tokio for async runtime: works good
119119-- iroh-car: robust, simple, async
120120-- manual CAR processing: since we need access to adjacent keys
121121- - TODO: repo-stream will expose this soon probably
122122- - TODO: right now we use jacquard_repo but i think it's easier in our case to handle it more manually.
125125+- jacquard almost everywhere: makes things *so much* easier
126126+- repo-stream for CAR processing
123127- fjall: workload is write-heavy so LSM is a good fit, space efficiency also very desirable
124128125129···150154151155taking [inspiration from tap](https://github.com/bluesky-social/indigo/blob/main/cmd/tap/models/models.go) here!
152156153153-TODO: fix outdated prefixes here
157157+see [src/storage/mod.rs](./src/storage/mod.rs) for an accurate key summary. rough overview:
154158155159```
156160main index:
···169173170174subscribeRepos (firehose) cursor:
171175172172- "subscribeRepos"||<subscribe_host>||"cursor" => u64
176176+ "sub"||<subscribe_host>||"cursor" => u64
173177174178175179subscribeRepos' host listRepos progress:
176180177177- "listRepos"||<subscribe_host> => {
181181+ "lsr"||<subscribe_host> => {
178182 cursor: String,
179183 completed: Option<DateTime>,
180184 }
···194198195199per-repo transient sync state:
196200197197- "repoPrev"||<did> => <rev:string>||<prevData:cid>
201201+ "rev"||<did> => <rev:string>||<prevData:cid>
198202199203 note: kept separate and small because it very frequently updates!
200204···206210207211resync queue:
208212209209- "repoResyncQueue"||<after:timestamp/u64_be>||<did> => {
213213+ "rsq"||<after:timestamp/u64_be>||<did> => {
210214 commit: cbor,
211215 retryCount: u16,
212216 retryReason: string,
···217221218222resync buffer:
219223220220- "resyncBuffer"||<did>||<seq_be:u64> => <raw firehose event:cbor>
224224+ "rsb"||<did>||<seq_be:u64> => <raw firehose event:cbor>
221225222226```
223227···248252249253## parallel work
250254251251-there are two implementations of worker pools: one for backfill, and one for firehose commits. they work slightly differently from Bluesky's parallel scheduler (used in tap, relay, jetstream, ..):
255255+there several implementations of worker pools: one for backfill, one for firehose commits, etc. they work slightly differently from Bluesky's parallel scheduler (used in tap, relay, jetstream, ..):
252256253257Bluesky's parallel scheduler assigns work by sharding on the associated DID: each worker is essentially assigned a subset of DIDs it's responsible for. This is really nice and pretty simple, and upholds the important thing: work for a specific DID is never assigned to more than one worder, so all event for any specific DID are always handled sequentially.
254258
+147-20
readme.md
···11# lightrail: `listReposByCollection` service
2233-**status: almost working well but not stable yet!!**
33+**status: almost working well but _not stable yet!!_**
4455-lightrail uses the adjacent keys included in CAR slices from firehose commits to detect the first record added and last record removed from a collection in an atproto repo.
55+Lightrail uses the _adjacent keys_ in firehose commit CAR slices to detect first-record-added-to and last-record-removed-from collections in atproto repos, _statelessly_. Since most commits don't change repos' collection lists, this eliminates most of the work to maintain an accurate repos-by-collection index.
6677-compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail:
77+Compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail:
8899-- applies sync1.1 inductive proof validation to firehose commits
1010-- handles sync1.1 `#sync` events
1111-- avoids updating its index unless commits actually add or remove collections
1212-- removes repos from the index when the last record from a repo's collection is removed
99+- validates sync1.1 commit proofs for index integrity
1010+- handles sync1.1 `#sync` events, catching significant repo changes
1111+- actually _removes repos from the index_ when their last record from a collection is removed
1212+- while doing less work over all
13131414-lightrail's CAR slice techniques enable its lightweight implementation, but its primary focus is on accuracy and correctness.
1414+Lightrail's main priorities are accuracy and correctness.
15151616-for backfill, lightrail currently uses `com.atproto.repo.describeRepo`, like Bluesky's `collectiondir`. This is not a robust approach, and will hopefully be replaced by probing that authenticated repo contents (see [./authenticated-collection-list.md](./authenticated-collection-list.md)) soon.
17161717+## Backfill-by-collection assister
18181919-### wishlist features (probably doable?):
1919+Sync utilities in atproto like [Tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap#tap-atproto-sync-utility) and [Hydrant](https://tangled.org/ptr.pet/hydrant#hydrant) can typically synchronize subsets of the atmosphere, filtering repositories by collection. The `com.atproto.sync.listReposByCollection` query answers _"which repos already have content relevant to the filtered subset?"_, so the sync utility can backfill existing relevant network data.
20202121-- [x] DONE accept multiple collections for `listReposbyCollection` (merge + dedup by DID; works bc key is `<collection>||<did>`)
2222-- [x] DONE "wilcard" fo `listReposbyCollection` by omitting the `collection` query param entirely
2323-- ~~`listReposByCollectionPrefix`, either with additional indexes up the NSID hierarchy, or via merge+dedup.~~ not doing
2424-- subscribe to multiple relays
2525-- use authenticated repo contents for backfill instead of `com.atproto.repo.describeRepo` (see [./authenticated-collection-list.md](./authenticated-collection-list.md))
2121+You usually want to call `listReposByCollection` on the [relay](https://atproto.com/guides/glossary#relay) you subscribe to, to filter the same view of teh network that your firehose delivers. But relays don't usually implement `listReposByCollection` themselves: instead they proxy the request to a helper service, like lightrail!
2222+2323+```
2424+ ___________
2525+ ___________ [ lightrail ]......
2626+[ your app ] ‾‾‾‾^‾‾‾‾‾‾ :
2727+ ‾‾‾‾‾^‾‾‾‾‾ __|____ (subscribeRepos)
2828+ | .-listReposByCollection-->|--+ | :
2929+ __|__ ___/ | relay |<.......
3030+ [ tap ]<--------subscribeRepos-------| |
3131+ ‾‾‾‾‾ ‾‾‾‾‾‾‾
3232+```
26333434+Subscribing lightrail to the same relay it's assisting keeps its network view consistent.
27352828-### quirks
29363030-if you see a log line like
3737+### API
31383939+#### `com.atproto.sync.listReposByCollection`
4040+4141+[Query docs](https://docs.bsky.app/docs/api/com-atproto-sync-list-repos-by-collection)
4242+4343+Lightrail implements some [proposed changes](https://github.com/bluesky-social/atproto/pull/4733) to this query:
4444+4545+- `collection` parameter with zero values (absent) returns *all* repos
4646+- repeated `collection` parameter returns repos from *any* of the specified collections
4747+4848+Quirks:
4949+5050+- `limit` can be up to 10,000 (lexicon specifies 2,000 max). This matches `collectiondir`'s limit.
5151+5252+5353+#### `com.atproto.sync.listRepos`
5454+5555+[Query docs](https://docs.bsky.app/docs/api/com-atproto-sync-list-repos)
5656+5757+5858+#### `com.atproto.sync.getRepoStatus`
5959+6060+[Query docs](https://docs.bsky.app/docs/api/com-atproto-sync-get-repo-status)
6161+6262+6363+## Lightrail server quick start
6464+6565+_(one day we'll have pre-built binaries)_
6666+6767+Lightrail is written in rust. Installing [rustup](https://rustup.rs/) will get you everything you need to build and run it.
6868+6969+```bash
7070+cargo run --release -- --upstream relay.fire.hose.cam
3271```
3333-... WARN ... error=identity resolution failed: jacquard: unsupported DID method: did:web:...
7272+7373+[`relay.fire.hose.cam`](https://relay.fire.hose.cam/) is one of [microcosm](https://www.microcosm.blue/)'s full-network relays. Lightrail works with a relay or PDS host upstream, or any other service that implements at least:
7474+7575+- `com.atproto.sync.subscribeRepos` and
7676+- `com.atproto.sync.listRepos`
7777+7878+7979+### Key configs
8080+8181+```bash
8282+# you can list all config options with:
8383+cargo run -- --help
3484```
35853636-it just means the did:web resolution failed. lightrail supports did:web, but a [tiny current bug in jacquard](https://tangled.org/nonbinary.computer/jacquard/issues/31) surfaces this message
8686+- **`--db-path`**, default `./lightrail.db`: where to write lightrail's [fjall](https://fjall-rs.github.io/) db
8787+- **`--listen`**, default `0.0.0.0:2511`: host and port to bind
8888+8989+9090+#### Atmosphere configs
9191+9292+- **`--plc-url`**, default: `https://plc.directory`: where to resolve `did:plc` identities. To use microcosm's mirror: `--plc-url https://plc.wtf`.
9393+- **`--slingshot-url`**, default: `https://slingshot.microcosm.blue`: enables slingshot for identity reoslution (PLC directory acts as fallback.
9494+- **`--deep-crawl`**, default: `[unset]`. enumerate hosts from upstream with `com.atproto.sync.listHosts` and then crawl those hosts each directly with `com.atproto.sync.listRepos`.
9595+9696+9797+#### Operational configs
9898+9999+- **`--metrics-listen`**, default: `0.0.0.0:6789`: enable prometheus-style metrics collection and serving at this address
100100+- **`--max-resync-workers`**, default: `16`: max backfill and repo resync concurrency. increase to use more resources to speed up backfill.
101101+102102+103103+more knobs you can twist:
104104+105105+- **`--ident-cache-size`**, default: `2_000_000`: identity resolution provides repo signing keys and PDS hostnames. a larger cache reduces outbound resolution requests at the cost of more memory used.
106106+- **`--max-firehose-workers`**, default: `6`: max firehose event processing concurrency.
107107+- **`--cursor-save-interval-secs`**, default `1`
108108+- **`--describe-repo-fetch-timeout-secs`**, default `30`
109109+- **`--get-repo-fetch-timeout-secs`**, default `300`
110110+- **`--max-deep-crawl-workers`**, default `4`: host-crawling concurrency for `--deep-crawl`
111111+112112+113113+### quirks
114114+115115+- Lightrail's ordering of DIDs in the `listReposByCollection` response is different from `collectiondir`
116116+117117+ - `collectiondir` always inserts new DIDs at the end of the paginated response
118118+ - Lightrail makes no guarantee except that the response will not contain duplicates
119119+120120+121121+- If you see a log line like
122122+123123+ ```
124124+ ... WARN ... error=identity resolution failed: jacquard: unsupported DID method: did:web:...
125125+ ```
126126+127127+ it just means the did:web resolution failed. lightrail supports did:web, but a [tiny current bug in jacquard](https://tangled.org/nonbinary.computer/jacquard/issues/31) surfaces this message
128128+129129+130130+### Backfill
131131+132132+Lightrail currently uses `com.atproto.repo.describeRepo`, like Bluesky's `collectiondir`. This not as robust as we wish it was, and could be replaced by probing that authenticated repo contents (see [./authenticated-collection-list.md](./authenticated-collection-list.md)) soon.
133133+134134+The two reasons `describeRepo` isn't robust:
135135+136136+1. the results are not authenticated (PDS bugs or quirks could lead to incorrect index)
137137+2. the response lacks the repo `rev`, so even if the list is accurate, it's not possible to prove that the next firehose commit follows without gaps
138138+139139+To mitigate the second, we always call `com.atproto.sync.getRecord` *before* `describeRepo`. This establishes a `rev` prior to the list, for eventual-(usually fast)-consistency after cutting over to the firehose.
140140+141141+The `sync.getRecord` response also includes a CAR slice that we can use: for very small repos, it might actually include a full repository export, in which case we can resync directly (and robustly!) from that and exit early. If it's a partial CAR, it will still include some keys whose presence we can assert when processing the `describeRepo` response to *maybe* catch a PDS bug.
142142+143143+Future `sync.getRecord` work: since every provable partial CAR must contain at least the MST root node, we can make a very rough estimate of the full-repo export size, and go ahead and `sync.getRepo` instead of `describeRepo` when it's expected to be very small, for better accuracy without much additional bandwidth overhead.
144144+145145+When we call `sync.getRecord`, we provide a made-up collection and rkey, which works for our purposes because the response will contain a _proof of absense_ if the key doesn't exist in the repo: a CAR slice (with rev + data from the commit object!) containing adjacent keys (that we'll use!). Unfortunately, not every PDS implements proof of absense responses, notably **bridgy** currently returns an error for non-existent keys.
146146+37147148148+#### `sync.getRepo` resync fallback
149149+150150+If the `describeRepo` approach fails for any reason, lightstream attempt to resync from a full repo export.
151151+152152+153153+### Sync1.1
154154+155155+plz remind fig to write this up: the strictness ratchet, any handling of lenient hosts we end up needing, and proof re: correctness of the adjacent keys approach.
156156+157157+158158+### wishlist features (probably doable?):
159159+160160+- [x] DONE accept multiple collections for `listReposbyCollection` (merge + dedup by DID; works bc key is `<collection>||<did>`)
161161+- [x] DONE "wilcard" fo `listReposbyCollection` by omitting the `collection` query param entirely
162162+- ~~`listReposByCollectionPrefix`, either with additional indexes up the NSID hierarchy, or via merge+dedup.~~ not doing
163163+- subscribe to multiple relays
164164+- use authenticated repo contents for backfill instead of `com.atproto.repo.describeRepo` (see [./authenticated-collection-list.md](./authenticated-collection-list.md))
381653916640167## contributing
411684242-see ['./hacking.md'](./hacking.md)
169169+see ['./hacking.md'](./hacking.md) for style, implementation, and architecture notes.
431704417145172## license
+7-7
src/main.rs
···2424 db_path: PathBuf,
25252626 /// TCP address for the XRPC API server.
2727- #[arg(long, env = "LIGHTRAIL_LISTEN", default_value = "0.0.0.0:3000")]
2727+ #[arg(long, env = "LIGHTRAIL_LISTEN", default_value = "0.0.0.0:2511")]
2828 listen: SocketAddr,
29293030 /// PLC directory URL for did:plc resolution.
···4949 ident_cache_size: u64,
50505151 /// Maximum concurrent firehose commit worker tasks.
5252- #[arg(long, env = "LIGHTRAIL_MAX_FIREHOSE_WORKERS", default_value_t = 10)]
5252+ #[arg(long, env = "LIGHTRAIL_MAX_FIREHOSE_WORKERS", default_value_t = 6)]
5353 max_firehose_workers: usize,
54545555 /// Maximum concurrent resync worker tasks.
···74747575 /// TCP address for the Prometheus metrics HTTP endpoint.
7676 /// If not set, metrics are not exported.
7777- #[arg(long, env = "LIGHTRAIL_METRICS_BIND", num_args = 0..=1, default_missing_value = "0.0.0.0:6789")]
7878- metrics_bind: Option<SocketAddr>,
7777+ #[arg(long, env = "LIGHTRAIL_METRICS_LISTEN", num_args = 0..=1, default_missing_value = "0.0.0.0:6789")]
7878+ metrics_listen: Option<SocketAddr>,
79798080 /// Admin password for privileged API endpoints.
8181 #[arg(long, env = "LIGHTRAIL_ADMIN_PASSWORD")]
8282 admin_password: Option<String>,
83838484 /// Enable deep crawl: discover PDS hosts via listHosts and crawl each one's repos.
8585- #[arg(long, env = "LIGHTRAIL_DEEP_CRAWL")]
8585+ #[arg(long, action, env = "LIGHTRAIL_DEEP_CRAWL")]
8686 deep_crawl: bool,
87878888 /// Max concurrent per-PDS listRepos workers during deep crawl.
8989- #[arg(long, env = "LIGHTRAIL_MAX_DEEP_CRAWL_WORKERS", default_value_t = 4)]
8989+ #[arg(long, env = "LIGHTRAIL_MAX_DEEP_CRAWL_WORKERS", requires("deep_crawl"), default_value_t = 4)]
9090 max_deep_crawl_workers: usize,
9191}
9292···119119 ident_cache_size,
120120 ));
121121122122- if let Some(addr) = args.metrics_bind {
122122+ if let Some(addr) = args.metrics_listen {
123123 install_metrics(addr)?;
124124 }
125125