lightweight com.atproto.sync.listReposByCollection
45
fork

Configure Feed

Select the types of activity you want to include in your feed.

authenticated collection listing (future work)#

right now we're just doing describeRepo to get collections, which is what collectiondir also, does but it's not what we want to stick with because:

  • the collections list isn't paginated. not clear what happens if a repo has a huge number of collections -- will describeRepo eventually fail?
  • the contents of the collecitons list aren't authenticated. it's possible for a PDS to lie and make our index incorrect, but the threat we're considering here is more about just PDS bugs causing the list to be wrong.
  • there is no commit or even rev in the response, so actually we can't know if firehose commits after describeRepo follow correctly/without gaps.

there are a few ways we can do better.

com.atproto.sync.getRepo#

obviously we can just do full backfill of repo contents. but then we couldn't call ourselves lightrail.

what we can do is detect small repos and use getRepo just for them. repo size can be estimated from any CAR slice by measuring the root node height. we get a car slice from firehose commits and from any sync.getRecord request.

collection-boundary com.atproto.sync.getRecord probing#

mst keys have the form <collection>/<rkey> (lexicographic order). com.atproto.sync.getRecord returns a CAR proof path from the repo root to the queried key, and that usually includes keys immediately adjacent to the queried key. in particular, when the record does not exist, the proof path must include adjacent keys (required to prove the key is absent).

we can exploit this:

  1. query getRecord with the minimum legal MST key (a-----...0.0-----...0.A/-). the record usually won't exist, but the right- adjacent key in the CAR slice reveals the first collection present in the repo.
  2. for that collection, compute the maximum legal rkey (~ × 512) and query getRecord with <collection>/<max_rkey>. The right-adjacent key, if present, is the first key of the next collection.
  3. if we don't have a immediate-right-adjacent key, we can increment the rkey to minimum next legal key and retry until we do get the next collection.
  4. repeat from step 2 until the end (no more right-adjacent collections).

this probing costs ~one request per collection discovered. wrinkles:

  • on the first request, estimate repo size and just do getRepo if it's small. probing requests count toward PDS rate-limit.

  • the repository can update while we are probing. this is easily detected because every CAR slice response includes the commit object and MST root, which updates for any update to the MST. the really nice way to deal with this is to maintain a sparse MST tree built up from all the probe requests, which can usually be updated directly from the upper changed nodes. at the end, we have a repo- spanning valid-but-sparse MST that proves all collection boundaries simultaneously.

    what do we do if a collection is added or removed by a mid-probe update? TODO!

skeleton shower from com.atproto.sync.getBlocks#

instead of scanning across the key range on collection boundaries, we could build our own sparse collection-boundary tree top-down:

  1. make any ..sync.getRecord query, to obtain the MST root node
  2. request every MST child node that spans a collection change, using com.atproto.sync.getBlocks.
  3. continue down like this, layer by layer, until reaching the bottom layer. since getBlocks accepts multiple CIDs, we can fetch everything required from each layer together in one request per layer (unless we need too many blocks) to fit in the querystring.

we end up with a nice sparse tree that proves all collection boundaries. MSTs are not very tall so this might actually be pretty nice, and we directly build a consistent point-in-time snapshot.

this fails when

  • any block we need is updated or removed while we're climbing down. in that case we can retry or fall back to getRepo.
  • a PDS doesn't implement getBlocks. (i have no idea how common it is?)

we can dream: "com.atproto.sync.getRepoCollections"#

maybe one day a PDS endpoint like this will exist, which serves the sparse MST containing blocks on all collection boundary paths our approaches here end up building, proving the exact set of collections present in the repo assocaited with an exact commit.