lightweight com.atproto.sync.listReposByCollection
45
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 92 lines 4.4 kB view raw view rendered
1# authenticated collection listing (future work) 2 3right now we're just doing `describeRepo` to get collections, which *is* what 4`collectiondir` also, does but it's not what we want to stick with because: 5 6- the collections list isn't paginated. not clear what happens if a repo has a huge 7 number of collections -- will `describeRepo` eventually fail? 8- the contents of the collecitons list aren't authenticated. it's *possible* for a 9 PDS to lie and make our index incorrect, but the threat we're considering here is 10 more about just PDS bugs causing the list to be wrong. 11- there is no `commit` or even `rev` in the response, so actually we can't know if 12 firehose commits after `describeRepo` follow correctly/without gaps. 13 14there are a few ways we can do better. 15 16 17## `com.atproto.sync.getRepo` 18 19obviously we can just do full backfill of repo contents. but then we couldn't call 20ourselves *light*rail. 21 22what we can do is detect small repos and use `getRepo` just for them. repo size can 23be estimated from any CAR slice by measuring the root node height. we get a car slice 24from firehose commits and from any `sync.getRecord` request. 25 26 27## collection-boundary `com.atproto.sync.getRecord` probing 28 29mst keys have the form `<collection>/<rkey>` (lexicographic order). 30`com.atproto.sync.getRecord` returns a CAR proof path from the repo root to the 31queried key, and that usually includes keys immediately adjacent to the queried key. 32in particular, when the record does *not* exist, the proof path must include adjacent 33keys (required to prove the key is absent). 34 35we can exploit this: 36 371. query `getRecord` with the **minimum legal MST key** 38 (`a-----...0.0-----...0.A/-`). the record usually won't exist, but the right- 39 adjacent key in the CAR slice reveals the first collection present in the repo. 402. for that collection, compute the **maximum** legal rkey (`~` × 512) and query 41 `getRecord` with `<collection>/<max_rkey>`. The right-adjacent key, if present, is 42 the first key of the *next* collection. 433. if we don't have a immediate-right-adjacent key, we can *increment* the rkey to 44 minimum next legal key and retry until we do get the next collection. 454. repeat from step 2 until the end (no more right-adjacent collections). 46 47 48this probing costs ~one request per collection discovered. wrinkles: 49 50- on the first request, estimate repo size and just do `getRepo` if it's small. 51 probing requests count toward PDS rate-limit. 52 53- the repository can update while we are probing. this is easily detected because 54 every CAR slice response includes the commit object and MST root, which updates 55 for any update to the MST. the really nice way to deal with this is to maintain 56 a sparse MST tree built up from all the probe requests, which can usually be 57 *updated* directly from the upper changed nodes. at the end, we have a repo- 58 spanning valid-but-sparse MST that proves all collection boundaries simultaneously. 59 60 what do we do if a collection is added or removed by a mid-probe update? TODO! 61 62 63## skeleton shower from `com.atproto.sync.getBlocks` 64 65instead of scanning across the key range on collection boundaries, we could build 66our own sparse collection-boundary tree top-down: 67 681. make any `..sync.getRecord` query, to obtain the MST root node 692. request every MST child node that spans a collection change, using 70 `com.atproto.sync.getBlocks`. 713. continue down like this, layer by layer, until reaching the bottom layer. since 72 `getBlocks` accepts multiple CIDs, we can fetch everything required from each 73 layer together in one request per layer (unless we need too many blocks) to fit 74 in the querystring. 75 76we end up with a nice sparse tree that proves all collection boundaries. MSTs are not 77very tall so this might actually be pretty nice, and we directly build a consistent 78point-in-time snapshot. 79 80this fails when 81 82- any block we need is updated or removed while we're climbing down. in that case we 83 can retry or fall back to `getRepo`. 84- a PDS doesn't implement `getBlocks`. (i have no idea how common it is?) 85 86 87## we can dream: `"com.atproto.sync.getRepoCollections"` 88 89maybe one day a PDS endpoint like this will exist, which serves the sparse MST 90containing blocks on all collection boundary paths our approaches here end up 91building, proving the exact set of collections present in the repo assocaited with an 92exact commit.