STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

iterate on spec stuff

phil 72b537d3 383dcb84

+145 -22
+1 -6
readme.md
··· 15 15 | strict | ❌ | ✅ | ✅ | ✅ | 16 16 | deterministic | ❌ | ✅ | ✅ | ✅ | 17 17 | slices, sparse | ✅ | ❌^4 | ✅ | ✅ | 18 - | subtree | ❌ | ✅^5 | ✅ | ✅ | 18 + | subtree | ❌ | ✅ | ✅ | ✅ | 19 19 20 20 21 21 Read more: ··· 44 44 3. STAR-lite values can be emitted immediately and trivially from its encoded form with zero buffering required. However, MST recovery (or pre-verification) requires either two passes or disk spilling -- but it's still more efficient than CAR. 45 45 46 46 4. STAR-lite *could* support MST slices and probably sparse MSTs, but this is not specified yet. MST slices in particular would be valuable. 47 - 48 - 5. Work in progress (easy) 49 - 50 - 51 - 52 47 53 48 54 49
+144 -16
star-lite/readme.md
··· 33 33 STAR-lite is just a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header. 34 34 35 35 ``` 36 - |------ header ------| |------------------ data (records) -------------------| 37 - [ magic | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] … 36 + |--------- header ---------| |------------------ data (records) -------------------| 37 + [ magic | cid | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] … 38 38 ``` 39 39 40 40 | name | type | 41 41 | ----- | -------------------------------------- | 42 42 | magic | three-byte mark to identify the format | 43 + | cid | multiformats sha256 CID link | 43 44 | len | unsigned varint | 44 45 | str | utf-8 bytes | 45 46 | cbor | cbor bytes | 46 47 47 48 48 - ### Magic 49 + ### Header magic 49 50 50 51 Three bytes: `0x2A 0x6C 0x00`, ASCII for `*l\0`: "star", "**l**ite", version 0. 51 52 52 53 53 - ### Header len + cbor: Commit 54 + ### Header CID 54 55 55 - A length-prefixed CBOR blob containing an atproto signed Commit object. The CBOR format is the same as the atproto repo spec describes. The Commit may be ignored, but for archive content verification, its `data` field must be parsed at minimum. 56 + A 36-byte CID link to the root of the repo MST, ie., the `data` field from the repo's current [`Commit` object][commit]. 56 57 57 - A parser may reject a commit if its `varint` length is greater than 4KiB (TODO: we could make this really tight, but 2KiB for the did:web and then 2K for the rest should prevent large reads and not be limiting) 58 + Archive integrity can be verified by recovering the MST from its contents and computing the CID of the root MST node, which must match. 58 59 59 - TODO: like the other STAR formats we should actually define a slightly-modified commit object, specifically with a nullable data cid for empty repos. otherwise, we should include the magic CID of an empty atproto MST node's hash like you get with CARs. 60 + 61 + ### Header len + cbor: optional partial commit object 62 + 63 + When `len == 0`, no commit object is included in the archive. This is useful for archiving unsigned subtrees of a full repository tree -- the contents can still be verified from the preceeding CID field. 64 + 65 + When `len > 4096`, a parser may reject the commit object as being implausibly large. (TODO: we probably can set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc). 66 + 67 + Otherwise, when `len > 0`, a partial commit object of exactly `len` bytes follows, in CBOR format. The partial commit has the same fields as an [atproto Commit Object][commit] except that the `data` field must be omitted. 68 + 69 + To verify the commit signature, use the Header CID (above) as the `data` field to compute the commit's signed CID. 60 70 61 71 62 72 ### Data: keys and records ··· 92 102 93 103 While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, this usually carries high overhead (memory or i/o) for the most frequent operations on a STAR-lite file: verification, and conversion to stream-ordered CAR. 94 104 105 + By exploiting the strict lexicographic key ordering of STAR-lite files, we can implement these transformations directly and with lower overhead. 95 106 96 107 97 - By exploiting the strict lexicographic key ordering of STAR-lite files, we can implement these transformations directly and with minimal overhead. 108 + ### MST node stack 109 + 110 + We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a small stack of in-progress MST nodes (one per layer of the tree) is sufficient. 111 + 112 + When a key's layer is *greater than the previous* key's layer, all in-progress MST nodes from lower layers are complete, and can be **frozen**: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node. 113 + 114 + At this point, the newly frozen nodes can be: 115 + 116 + - simply discarded, when verifying archive integrity, 117 + - serialized into runs of CAR-format blocks, 118 + - any other transformation 119 + 120 + Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which be match the CID in a STAR-lite file's header. 98 121 99 122 100 123 ### Conversion to CAR 101 124 125 + For preorder traversal block ordering of CAR files (aka "stream-friendly order"), each parent MST node must be included *before* all of its children. So while subtrees can be serialized into CAR-output-ready byte sequences as soon as they are frozen, they must be buffered until their parent node is frozen 126 + 127 + Since our depth-first walk finalizes children before parents, and the final parent finalizes last, we must unfortunately buffer all serialized CAR frames while the tree is walked. The good news is that a disk-spill-friendly byte log works well for this buffering. 128 + 129 + 130 + #### some old intuition-y words that might go somewhere but not here now 131 + 102 132 Stream-ordered CARs (in "preorder traversal" block order) are a depth-first walk over the Merkle Search Tree, and keys encountered during a depth-first MST walk are in strict lexicographic order. 103 133 104 134 There is a a useful symmetry here: ··· 108 138 109 139 So, any subtree-spanning range of keys (and records) can be materialized directly into its stream-ordered sequence of CAR blocks, independent of the rest of the archive. 110 140 111 - MST subtrees can't be *emitted* until the entire MST has been reconstructed, because stream-ordering requires that the very first CAR block is the MST root node, and that is the very last node we can serialize. 112 141 113 - But what we can do, is write serialized segments of the final CAR to disk temporarily as the entire MST is reconstructed, to stay within a strict memory budget. Streaming out the final stream-ordered CAR can use `copy_file_range` or equivalent to splice them in at the right places. 142 + #### pseudo-code 114 143 115 - ``` 116 - TODO: actual algorithm pesudocode 144 + ```python 145 + ## WIP! 146 + 147 + def to_stream_ordered_car(key_record_pairs): 148 + stack = [] 149 + byte_log = [] # disk spilling omitted from this example 150 + prev_key_layer = 0 151 + 152 + for (key, record) in key_record_pairs: 153 + record_cid = compute_cid(record) 154 + 155 + record_run = byte_log.append_car_frame(record_cid, record) 156 + 157 + key_layer = layer_of(key) 158 + 159 + extend stack with empty slots until len(stack) >= key_layer + 1 160 + 161 + # every layer below key_layer that has content gets frozen. Its 162 + # node frame is appended to the byte log, and the resulting 163 + # subtree's emit plan is propagated up to layer L+1. 164 + for lower_layer in range(0, key_layer): 165 + if node := stack.get(lower_layer): 166 + (node_cid, node_bytes) = encode_mst_node(node) 167 + node_run = byte_log.append_car_frame(node_cid, node_bytes) 168 + subtree_emit_plan = build_emit_plan(node, node_run) 169 + push_subtree_with_plan(stack[lower_layer + 1], node_cid, subtree_emit_plan) 170 + stack[lower_layer] = None 171 + 172 + # bleh, None handling kind of sucks. we should actually check nodes for .empty() and push/extend where needed 173 + 174 + if stack.get(key_layer) is None: 175 + stack[key_layer] = make_empty_node() # blehhh 176 + 177 + stack[key_layer].entries.append(WhatIsThis( 178 + key=key, 179 + cid=record_cid, 180 + car_run=record_run, 181 + right=None, 182 + right_emit_plan=None, 183 + )) 184 + 185 + # End of input: fold remaining stack bottom-up the same way. 186 + node_cid, node_emit_plan = None, None 187 + for node in stack: 188 + if node_cid is not None: 189 + push_subtree_with_plan(node, node_cid, node_emit_plan) 190 + node_cid, node_emit_plan = None, None 191 + if node is not empty: 192 + (node_cid, node_bytes) = encode_mst_node(node) 193 + node_run = byte_log.append_car_frame(node_cid, node_bytes) 194 + node_emit_plan = build_emit_plan(node, node_run) 195 + node_cid = node_cid 196 + 197 + # Empty repo: emit the canonical empty MST node into the byte log. 198 + if node_cid is None: 199 + (node_cid, node_bytes) = encode_mst_node(empty stack-slot) 200 + node_run = byte_log.append_car_frame(node_cid, node_bytes) 201 + node_emit_plan = [node_run] 202 + 203 + output = [] 204 + for run in node_emit_plan: 205 + output.extend(byte_log[run.what:run.whattt]) 206 + 207 + return node_cid, output 117 208 ``` 118 209 119 210 ··· 125 216 126 217 The final output is the root MST node's CID, which verifies the entire archive if it matches the `data` field from the commit object. 127 218 128 - TODO: mention that this doesn't check the commit's signature but users can do that on the commit object first, directly, if they want. verification just asserts that the archive content is consistent with the commit object's `data` commitment. 219 + Verification asserts the integrity of the repository contents: verifying the signature of the archive's [commit object][commit] (if present) is a separate process, outside the scope of STAR. See atproto [commit signatures][commit-sigs] 129 220 130 221 131 - ``` 132 - TODO: actual algorithm pseudocode 222 + ```python 223 + ## WIP!! 224 + 225 + def verify(key_record_pairs, expected_root_cid): 226 + stack: list[MstNode] = [] 227 + 228 + for (key, record) in key_record_pairs: 229 + record_cid = compute_cid(record) 230 + key_layer = layer_of(key) 231 + while len(stack) <= key_layer: 232 + stack.append(MstNode()) 233 + 234 + for i in range(key_layer): 235 + if stack[i].is_empty(): 236 + continue 237 + (node_cid, _) = encode_mst_node(stack[i]) 238 + stack[i + 1].attach_subtree(node_cid) 239 + stack[i] = MstNode() # empty it 240 + 241 + stack[key_layer].entries.append(Leaf(key, record_cid, car_run=None)) 242 + 243 + # Fold remaining stack bottom-up. 244 + node_cid = None 245 + for node in stack: 246 + if node_cid is not None: 247 + node.attach_subtree(node_cid) 248 + node_cid = None 249 + if not node.is_empty(): 250 + (node_cid, _) = encode_mst_node(node) 251 + 252 + # Empty repo: canonical empty MST node CID. 253 + if node_cid is None: 254 + (node_cid, _) = encode_mst_node(MstNode()) 255 + 256 + return node_cid == expected_root_cid 133 257 ``` 134 258 135 259 136 260 #### Empty repos 137 261 138 - a repo with zero keys is allowed: its commit object must use the magic CID `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4` (TODO: double-check this), which corresponds to the CID of a single empty atproto MST node - how atproto CARs represent empty repos. 262 + A repo with no keys is allowed. Its header CID is always `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`, the CID of a single empty atproto MST node. 263 + 264 + Note that when generating a CAR file, the empty MST node block must be included. 139 265 140 266 141 267 ### Conversion from CAR ··· 153 279 [leb128]: https://en.wikipedia.org/wiki/LEB128 154 280 [nsid]: https://atproto.com/specs/nsid 155 281 [rkey]: https://atproto.com/specs/record-key 282 + [commit]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#section-2.4 283 + [commit-sigs]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-signatures