lightweight com.atproto.sync.listReposByCollection
45
fork

Configure Feed

Select the types of activity you want to include in your feed.

more notes

phil 0291498e 959a418a

+84 -7
+79 -2
hacking.md
··· 5 5 $ cargo run -- --subscribe <pds-or-relay> 6 6 ``` 7 7 8 - in the project root and be up and running. 8 + in the project root to be up and running! 9 9 10 10 - TODO: make local dev work without needing an actual live firehose and backfill etc. 11 11 12 - before submitting a pull request, it's nice if you run rustfmt, clippy, and all tests 12 + before submitting a pull request, please run rustfmt, clippy, and all tests 13 13 14 14 ```bash 15 15 $ cargo test && cargo fmt && cargo clippy 16 16 ``` 17 + 18 + - TODO: configure tangled CI to run these on PR 17 19 18 20 19 21 ## some choices ··· 23 25 - manual CAR processing: since we need access to adjacent keys 24 26 - TODO: repo-stream will expose this soon probably 25 27 - fjall: workload is write-heavy so LSM is a good fit, space efficiency also very desirable 28 + 29 + 30 + ## state/db models 31 + 32 + taking [inspiration from tap](https://github.com/bluesky-social/indigo/blob/main/cmd/tap/models/models.go) here! 33 + 34 + ``` 35 + main index: 36 + 37 + "rbc"||<collection>||<did> => () 38 + 39 + note: value unused for now 40 + 41 + 42 + reversed index: 43 + 44 + "cbr"||<did>||<collection> => () 45 + 46 + note: supports `#sync` diffing and account deletion 47 + 48 + 49 + subscribeRepos (firehose) cursor: 50 + 51 + "subscribeRepos"||<subscribe_host>||"cursor" => u64 52 + 53 + 54 + subscribeRepos' host listRepos progress: 55 + 56 + "listRepos"||<subscribe_host> => { 57 + cursor: String, 58 + completed: Option<DateTime>, 59 + } 60 + 61 + note: alternatively we could just delete the key when done. we'd know not to 62 + restart because the subscribeRepos entry existing could mean that. 63 + 64 + 65 + per-repo state stuff: 66 + 67 + "repo"||<did> => { 68 + state: RepoState, 69 + status: RepoStatus, 70 + error: Option<String>, 71 + } 72 + 73 + 74 + per-repo transient sync state: 75 + 76 + "repoPrev"||<did> => <rev:string>||<prevData:cid> 77 + 78 + note: kept separate and small because it very frequently updates! 79 + 80 + 81 + resync queue: 82 + 83 + "repoResyncQueue"||<after:timestamp/u64_be>||<did> => { 84 + commit: cbor, 85 + retryCount: u16, 86 + retryReason: string, 87 + } 88 + 89 + TODO: per-did resync rate-limit? state might need to live somewhere 90 + 91 + ``` 92 + 93 + 94 + ### return order of `listReposByCollection`: 95 + 96 + note: `collectiondir` indexes repos by discovery time, so you haven't missed any newly-added repos since you starting paging through. if we cursor over DIDs, then there *can* be some added mid-paging. is that a problem? 97 + 98 + i think clients should be listening to the firehose before they start walking here, which should help them avoid missing any repos (newly-added ones while paging would be seen in the firehose). 99 + 100 + not sure if there would be value in it but we *could* also grab a keyspace snapshot so that a client has an exact consistent view while they page through... but for full-network paging that can take a long time, this is maybe not such a space-friendly idea. 101 + 102 + my current thinking is that ordering repos by did is probably ok
+5 -5
readme.md
··· 1 - # lightrail: lightweight `com.atproto.sync.listReposByCollection` service 1 + # lightrail: lightweight `listReposByCollection` service 2 2 3 3 **status: in development** 4 4 5 5 lightrail uses the adjacent keys included in CAR slices from firehose commits and `com.atproto.sync.getRecord` responses to detect the first record added and last record removed from a collection in an atproto repo. 6 6 7 - for backfill of for large repositories, lightrail probes the repo with `com.atproto.sync.getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum possible `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate all contained collections. 7 + for backfill of large repositories, lightrail probes the repo with `getRecord` requests instead of trusting the `collections` property from `com.atproto.repo.describeRepo`. since there are concrete minimum and maximum `rkey`s for collections, and since `getRecord` always returns adjacent keys *even when a key is not found in a repo*, lightrail can precisely probe the repo along collection boundaries to enumerate every collection. 8 8 9 - repo `#sync` events similarly probe the repository from the PDS to diff against the recorded repo collections. 9 + repo `#sync` events similarly probe the repository to diff against the recorded repo collections. 10 10 11 11 small repositories are detected by inspecting the MST nodes near the root of a CAR slice (eg., from the first `getRecord` probe): small repos are statistically unlikely to contain keys with high MST levels (every CAR slice must include the maximum-level keys of a repository), providing a statistical basis for estimating total repo size. Small repos are fetched in their entirety with `com.atproto.sync.getRepo` instead of probing with numerous `getRecord` requests. 12 12 13 - key differences compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail: 13 + compared to Bluesky's [`collectiondir`](https://github.com/bluesky-social/indigo/tree/main/cmd/collectiondir) service, lightrail: 14 14 15 15 - applies sync1.1 inductive proof validation to firehose commits 16 16 - handles sync1.1 `#sync` events 17 - - reduces avoids updating the index unless firehose commits actually add or remove a collection 17 + - avoids updating its index unless commits actually add or remove collections 18 18 - removes repos from the index when the last record from a repo's collection is removed 19 19 - uses authenticated repo contents for backfill instead of `com.atprot.repo.describeRepo` 20 20