lightweight
com.atproto.sync.listReposByCollection
1# authenticated collection listing (future work)
2
3right now we're just doing `describeRepo` to get collections, which *is* what
4`collectiondir` also, does but it's not what we want to stick with because:
5
6- the collections list isn't paginated. not clear what happens if a repo has a huge
7 number of collections -- will `describeRepo` eventually fail?
8- the contents of the collecitons list aren't authenticated. it's *possible* for a
9 PDS to lie and make our index incorrect, but the threat we're considering here is
10 more about just PDS bugs causing the list to be wrong.
11- there is no `commit` or even `rev` in the response, so actually we can't know if
12 firehose commits after `describeRepo` follow correctly/without gaps.
13
14there are a few ways we can do better.
15
16
17## `com.atproto.sync.getRepo`
18
19obviously we can just do full backfill of repo contents. but then we couldn't call
20ourselves *light*rail.
21
22what we can do is detect small repos and use `getRepo` just for them. repo size can
23be estimated from any CAR slice by measuring the root node height. we get a car slice
24from firehose commits and from any `sync.getRecord` request.
25
26
27## collection-boundary `com.atproto.sync.getRecord` probing
28
29mst keys have the form `<collection>/<rkey>` (lexicographic order).
30`com.atproto.sync.getRecord` returns a CAR proof path from the repo root to the
31queried key, and that usually includes keys immediately adjacent to the queried key.
32in particular, when the record does *not* exist, the proof path must include adjacent
33keys (required to prove the key is absent).
34
35we can exploit this:
36
371. query `getRecord` with the **minimum legal MST key**
38 (`a-----...0.0-----...0.A/-`). the record usually won't exist, but the right-
39 adjacent key in the CAR slice reveals the first collection present in the repo.
402. for that collection, compute the **maximum** legal rkey (`~` × 512) and query
41 `getRecord` with `<collection>/<max_rkey>`. The right-adjacent key, if present, is
42 the first key of the *next* collection.
433. if we don't have a immediate-right-adjacent key, we can *increment* the rkey to
44 minimum next legal key and retry until we do get the next collection.
454. repeat from step 2 until the end (no more right-adjacent collections).
46
47
48this probing costs ~one request per collection discovered. wrinkles:
49
50- on the first request, estimate repo size and just do `getRepo` if it's small.
51 probing requests count toward PDS rate-limit.
52
53- the repository can update while we are probing. this is easily detected because
54 every CAR slice response includes the commit object and MST root, which updates
55 for any update to the MST. the really nice way to deal with this is to maintain
56 a sparse MST tree built up from all the probe requests, which can usually be
57 *updated* directly from the upper changed nodes. at the end, we have a repo-
58 spanning valid-but-sparse MST that proves all collection boundaries simultaneously.
59
60 what do we do if a collection is added or removed by a mid-probe update? TODO!
61
62
63## skeleton shower from `com.atproto.sync.getBlocks`
64
65instead of scanning across the key range on collection boundaries, we could build
66our own sparse collection-boundary tree top-down:
67
681. make any `..sync.getRecord` query, to obtain the MST root node
692. request every MST child node that spans a collection change, using
70 `com.atproto.sync.getBlocks`.
713. continue down like this, layer by layer, until reaching the bottom layer. since
72 `getBlocks` accepts multiple CIDs, we can fetch everything required from each
73 layer together in one request per layer (unless we need too many blocks) to fit
74 in the querystring.
75
76we end up with a nice sparse tree that proves all collection boundaries. MSTs are not
77very tall so this might actually be pretty nice, and we directly build a consistent
78point-in-time snapshot.
79
80this fails when
81
82- any block we need is updated or removed while we're climbing down. in that case we
83 can retry or fall back to `getRepo`.
84- a PDS doesn't implement `getBlocks`. (i have no idea how common it is?)
85
86
87## we can dream: `"com.atproto.sync.getRepoCollections"`
88
89maybe one day a PDS endpoint like this will exist, which serves the sparse MST
90containing blocks on all collection boundary paths our approaches here end up
91building, proving the exact set of collections present in the repo assocaited with an
92exact commit.