···55$ cargo run -- --subscribe <pds-or-relay>
66```
7788-in the project root and be up and running.
88+in the project root to be up and running!
991010- TODO: make local dev work without needing an actual live firehose and backfill etc.
11111212-before submitting a pull request, it's nice if you run rustfmt, clippy, and all tests
1212+before submitting a pull request, please run rustfmt, clippy, and all tests
13131414```bash
1515$ cargo test && cargo fmt && cargo clippy
1616```
1717+1818+- TODO: configure tangled CI to run these on PR
171918201921## some choices
···2325- manual CAR processing: since we need access to adjacent keys
2426 - TODO: repo-stream will expose this soon probably
2527- fjall: workload is write-heavy so LSM is a good fit, space efficiency also very desirable
2828+2929+3030+## state/db models
3131+3232+taking [inspiration from tap](https://github.com/bluesky-social/indigo/blob/main/cmd/tap/models/models.go) here!
3333+3434+```
3535+main index:
3636+3737+ "rbc"||<collection>||<did> => ()
3838+3939+ note: value unused for now
4040+4141+4242+reversed index:
4343+4444+ "cbr"||<did>||<collection> => ()
4545+4646+ note: supports `#sync` diffing and account deletion
4747+4848+4949+subscribeRepos (firehose) cursor:
5050+5151+ "subscribeRepos"||<subscribe_host>||"cursor" => u64
5252+5353+5454+subscribeRepos' host listRepos progress:
5555+5656+ "listRepos"||<subscribe_host> => {
5757+ cursor: String,
5858+ completed: Option<DateTime>,
5959+ }
6060+6161+ note: alternatively we could just delete the key when done. we'd know not to
6262+ restart because the subscribeRepos entry existing could mean that.
6363+6464+6565+per-repo state stuff:
6666+6767+ "repo"||<did> => {
6868+ state: RepoState,
6969+ status: RepoStatus,
7070+ error: Option<String>,
7171+ }
7272+7373+7474+per-repo transient sync state:
7575+7676+ "repoPrev"||<did> => <rev:string>||<prevData:cid>
7777+7878+ note: kept separate and small because it very frequently updates!
7979+8080+8181+resync queue:
8282+8383+ "repoResyncQueue"||<after:timestamp/u64_be>||<did> => {
8484+ commit: cbor,
8585+ retryCount: u16,
8686+ retryReason: string,
8787+ }
8888+8989+ TODO: per-did resync rate-limit? state might need to live somewhere
9090+9191+```
9292+9393+9494+### return order of `listReposByCollection`:
9595+9696+note: `collectiondir` indexes repos by discovery time, so you haven't missed any newly-added repos since you starting paging through. if we cursor over DIDs, then there *can* be some added mid-paging. is that a problem?
9797+9898+i think clients should be listening to the firehose before they start walking here, which should help them avoid missing any repos (newly-added ones while paging would be seen in the firehose).
9999+100100+not sure if there would be value in it but we *could* also grab a keyspace snapshot so that a client has an exact consistent view while they page through... but for full-network paging that can take a long time, this is maybe not such a space-friendly idea.
101101+102102+my current thinking is that ordering repos by did is probably ok
+5-5
readme.md
···11-# lightrail: lightweight `com.atproto.sync.listReposByCollection` service
11+# lightrail: lightweight `listReposByCollection` service
2233**status: in development**
4455lightrail uses the adjacent keys included in CAR slices from firehose commits and `com.atproto.sync.getRecord` responses to detect the first record added and last record removed from a collection in an atproto repo.
6677-for backfill of for large repositories, lightrail probes the repo with `com.atproto.sync.getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum possible `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate all contained collections.
77+for backfill of large repositories, lightrail probes the repo with `getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate every collection.
8899-repo `#sync` events similarly probe the repository from the PDS to diff against the recorded repo collections.
99+repo `#sync` events similarly probe the repository to diff against the recorded repo collections.
10101111small repositories are detected by inspecting the MST nodes near the root of a CAR slice (eg., from the first `getRecord` probe): small repos are statistically unlikely to contain keys with high MST levels (every CAR slice must include the maximum-level keys of a repository), providing a statistical basis for estimating total repo size. Small repos are fetched in their entirety with `com.atproto.sync.getRepo` instead of probing with numerous `getRecord` requests.
12121313-key differences compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail:
1313+compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail:
14141515- applies sync1.1 inductive proof validation to firehose commits
1616- handles sync1.1 `#sync` events
1717-- reduces avoids updating the index unless firehose commits actually add or remove a collection
1717+- avoids updating its index unless commits actually add or remove collections
1818- removes repos from the index when the last record from a repo's collection is removed
1919- uses authenticated repo contents for backfill instead of `com.atprot.repo.describeRepo`
2020