lightweight com.atproto.sync.listReposByCollection
45
fork

Configure Feed

Select the types of activity you want to include in your feed.

fix up key probing description

phil ba69cef4 fce3c3c0

+30 -13
+3 -1
src/backfill/mod.rs
··· 3 3 //! Walks `com.atproto.sync.listRepos` and probes each repository to populate 4 4 //! the rbc/cbr index before or alongside the live firehose feed. 5 5 //! 6 - //! Large repos are enumerated via binary-search `getRecord` probing (`probe`). 6 + //! Large repos are enumerated via sequential `getRecord` probing (`probe`): 7 + //! one request per collection, walking right-adjacent MST keys from the minimum 8 + //! legal key to the end of the repo. 7 9 //! Small repos take the fast path of fetching the full repo CAR (`small_repo`). 8 10 9 11 pub mod list_repos;
+27 -12
src/backfill/probe.rs
··· 1 - //! Binary-search `getRecord` probing for large-repo backfill. 1 + //! `getRecord`-probing for large-repo backfill. 2 + //! 3 + //! MST keys have the form `<collection>/<rkey>`, where `collection` is an NSID 4 + //! and `rkey` is a Record Key, both subject to format restrictions and a total 5 + //! byte-length cap defined in the AT Protocol specs. 6 + //! 7 + //! `getRecord` always includes the keys adjacent to the queried key in its CAR 8 + //! slice response, even when the record does not exist. The probing algorithm 9 + //! exploits this to enumerate every collection with one request per collection: 10 + //! 11 + //! 1. Query `getRecord` with the **minimum legal MST key** — the 12 + //! lexicographically lowest string that is a valid `<collection>/<rkey>`. 13 + //! The record won't exist, but the right-adjacent key in the CAR slice is 14 + //! the lowest key actually present in the repo, revealing the first 15 + //! collection. 16 + //! 2. For that collection, compute the **maximum legal rkey** and query 17 + //! `getRecord` with `<collection>/<max_rkey>`. The right-adjacent key in 18 + //! the response is the first key of the *next* collection in the repo. 19 + //! 3. Repeat step 2 for each newly discovered collection until no right-adjacent 20 + //! key is returned, signalling that all collections have been found. 2 21 //! 3 - //! Since every ATProto collection has a known minimum and maximum possible rkey, 4 - //! `getRecord` returns adjacent keys even when the requested key does not exist. 5 - //! This lets us binary-search the MST to enumerate all collections without 6 - //! fetching the full repo CAR. 22 + //! Each discovered `(did, collection)` pair is written to the rbc/cbr index 23 + //! via `db::index::insert`. 7 24 8 25 use crate::db::DbRef; 9 26 use crate::error::Result; 10 27 11 - /// Probe `did` to enumerate its collections via `getRecord` binary search. 28 + /// Probe `did` to enumerate its collections via sequential `getRecord` requests. 12 29 /// 13 - /// 1. Issue a `getRecord` for the midpoint of the NSID key space. 14 - /// 2. Feed the returned CAR slice to `mst::adjacent::extract_adjacent`. 15 - /// 3. Use adjacent keys to narrow the search and recurse until all collection 16 - /// boundaries are discovered. 17 - /// 4. Write results to the rbc/cbr index via `db::index::insert`. 30 + /// Starts from the minimum legal MST key and follows right-adjacent keys one 31 + /// collection at a time until the end of the repo is reached. One XRPC request 32 + /// is issued per collection present in the repo. 18 33 pub async fn probe_repo(host: &str, did: &str, db: DbRef) -> Result<()> { 19 34 let _ = (host, did, db); 20 - todo!("binary-search getRecord probing to enumerate collections") 35 + todo!("sequential getRecord probing: walk right-adjacent keys to enumerate collections") 21 36 }