lightweight com.atproto.sync.listReposByCollection
45
fork

Configure Feed

Select the types of activity you want to include in your feed.

init

phil 959a418a

+68
+1
.gitignore
··· 1 + /target
+6
Cargo.toml
··· 1 + [package] 2 + name = "lightrail" 3 + version = "0.1.0" 4 + edition = "2024" 5 + 6 + [dependencies]
+25
hacking.md
··· 1 + 2 + lightrail is written in rust. assuming you've got `cargo` installed, you should be able to 3 + 4 + ```bash 5 + $ cargo run -- --subscribe <pds-or-relay> 6 + ``` 7 + 8 + in the project root and be up and running. 9 + 10 + - TODO: make local dev work without needing an actual live firehose and backfill etc. 11 + 12 + before submitting a pull request, it's nice if you run rustfmt, clippy, and all tests 13 + 14 + ```bash 15 + $ cargo test && cargo fmt && cargo clippy 16 + ``` 17 + 18 + 19 + ## some choices 20 + 21 + - tokio for async runtime: works good 22 + - iroh-car: robust, simple, async 23 + - manual CAR processing: since we need access to adjacent keys 24 + - TODO: repo-stream will expose this soon probably 25 + - fjall: workload is write-heavy so LSM is a good fit, space efficiency also very desirable
+33
readme.md
··· 1 + # lightrail: lightweight `com.atproto.sync.listReposByCollection` service 2 + 3 + **status: in development** 4 + 5 + lightrail uses the adjacent keys included in CAR slices from firehose commits and `com.atproto.sync.getRecord` responses to detect the first record added and last record removed from a collection in an atproto repo. 6 + 7 + for backfill of for large repositories, lightrail probes the repo with `com.atproto.sync.getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum possible `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate all contained collections. 8 + 9 + repo `#sync` events similarly probe the repository from the PDS to diff against the recorded repo collections. 10 + 11 + small repositories are detected by inspecting the MST nodes near the root of a CAR slice (eg., from the first `getRecord` probe): small repos are statistically unlikely to contain keys with high MST levels (every CAR slice must include the maximum-level keys of a repository), providing a statistical basis for estimating total repo size. Small repos are fetched in their entirety with `com.atproto.sync.getRepo` instead of probing with numerous `getRecord` requests. 12 + 13 + key differences compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail: 14 + 15 + - applies sync1.1 inductive proof validation to firehose commits 16 + - handles sync1.1 `#sync` events 17 + - reduces avoids updating the index unless firehose commits actually add or remove a collection 18 + - removes repos from the index when the last record from a repo's collection is removed 19 + - uses authenticated repo contents for backfill instead of `com.atprot.repo.describeRepo` 20 + 21 + lightrail's CAR slice techniques enable its lightweight implementation, but its primary focus is on accuracy and correctness. 22 + 23 + 24 + ### wishlist features (probably doable?): 25 + 26 + - accept multiple collections for `listReposbyCollection` (merge + dedup by DID; works bc key is `<collection>||<did>`) 27 + - `listReposByCollectionPrefix`, either with additional indexes up the NSID hierarchy, or via merge+dedup. 28 + - subscribe to multiple relays 29 + 30 + 31 + ## contributing 32 + 33 + see ['./hacking.md'](./hacking.md)
+3
src/main.rs
··· 1 + fn main() { 2 + println!("Hello, world!"); 3 + }