···11+22+lightrail is written in rust. assuming you've got `cargo` installed, you should be able to
33+44+```bash
55+$ cargo run -- --subscribe <pds-or-relay>
66+```
77+88+in the project root and be up and running.
99+1010+- TODO: make local dev work without needing an actual live firehose and backfill etc.
1111+1212+before submitting a pull request, it's nice if you run rustfmt, clippy, and all tests
1313+1414+```bash
1515+$ cargo test && cargo fmt && cargo clippy
1616+```
1717+1818+1919+## some choices
2020+2121+- tokio for async runtime: works good
2222+- iroh-car: robust, simple, async
2323+- manual CAR processing: since we need access to adjacent keys
2424+ - TODO: repo-stream will expose this soon probably
2525+- fjall: workload is write-heavy so LSM is a good fit, space efficiency also very desirable
+33
readme.md
···11+# lightrail: lightweight `com.atproto.sync.listReposByCollection` service
22+33+**status: in development**
44+55+lightrail uses the adjacent keys included in CAR slices from firehose commits and `com.atproto.sync.getRecord` responses to detect the first record added and last record removed from a collection in an atproto repo.
66+77+for backfill of for large repositories, lightrail probes the repo with `com.atproto.sync.getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum possible `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate all contained collections.
88+99+repo `#sync` events similarly probe the repository from the PDS to diff against the recorded repo collections.
1010+1111+small repositories are detected by inspecting the MST nodes near the root of a CAR slice (eg., from the first `getRecord` probe): small repos are statistically unlikely to contain keys with high MST levels (every CAR slice must include the maximum-level keys of a repository), providing a statistical basis for estimating total repo size. Small repos are fetched in their entirety with `com.atproto.sync.getRepo` instead of probing with numerous `getRecord` requests.
1212+1313+key differences compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail:
1414+1515+- applies sync1.1 inductive proof validation to firehose commits
1616+- handles sync1.1 `#sync` events
1717+- reduces avoids updating the index unless firehose commits actually add or remove a collection
1818+- removes repos from the index when the last record from a repo's collection is removed
1919+- uses authenticated repo contents for backfill instead of `com.atprot.repo.describeRepo`
2020+2121+lightrail's CAR slice techniques enable its lightweight implementation, but its primary focus is on accuracy and correctness.
2222+2323+2424+### wishlist features (probably doable?):
2525+2626+- accept multiple collections for `listReposbyCollection` (merge + dedup by DID; works bc key is `<collection>||<did>`)
2727+- `listReposByCollectionPrefix`, either with additional indexes up the NSID hierarchy, or via merge+dedup.
2828+- subscribe to multiple relays
2929+3030+3131+## contributing
3232+3333+see ['./hacking.md'](./hacking.md)