STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

this probably can work

phil 2f22ac34 093cd403

+65 -53
+65 -53
readme.md
··· 9 9 - canonical (unlike CAR) 10 10 - strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR) 11 11 - the header simply is the commit, followed by the serialized tree. 12 - - CID link are implicit for blocks included in the STAR: linked blocks follow in a deterministic order and recompute CID from on their contents. 13 - - fewer edge cases: empty MST nodes strictly disallowed: `commit` omits `data` for an empty tree 12 + - layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents. 13 + - small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as `data: null` in the commit header. 14 14 15 15 the two primary motivations are 16 16 ··· 24 24 25 25 2. reduced archive size. 26 26 27 - CIDs are large, compression-unfriendly, and redundant if you are including the CID's actual content. 27 + CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content. 28 28 29 29 for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%. 30 30 ··· 41 41 42 42 These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression. 43 43 44 - STARs retain the raw CBOR serialization of records, but may use a new MST node serialization that further reduces this overhead. 45 - 46 - Since omitting CIDs by making them implicit removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher. 44 + STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an **overall reduction of CIDs by 80%**. 47 45 48 46 Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it. 49 47 ··· 63 61 STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens). 64 62 65 63 66 - ## problems 67 - 68 - It might be difficult to convert a STAR to stream-friendly (preorder traversal) CAR format, since the CID of each MST node block can only be computed after visiting all of its children. STAR (could)[https://bsky.app/profile/bad-example.com/post/3mcv4zxwtgs2w] require that MST nodes above a certain depth store their own CIDs which would be sad but pragmatic. 69 - 70 - Similarly, the validity of the commit signature cannot be known until the root node's CID is calculated. a parser might emit an entire repo-worth of keys-record pairs only to find out at the very end that none of it was valid. The same boring pragmatic fix as the last problem can probably also address this: small near-leaf subtrees need to be buffered for validation; as trees get larger this looks more like streaming. (see `verifying the whole tree` below) 71 - 72 - 73 64 ## format 74 65 75 66 ``` ··· 87 78 88 79 - `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects): 89 80 90 - - `did` (string, nullable): same as repo spec 81 + - `did` (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now. 91 82 - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3` 92 - - `data` (hash link, **nullable**): CID of the first (root) node in the MST. an empty tree is represented by the presence of a `null` here 93 - - `rev` (string, required): same as repo spec 94 - - `prev` (hash link, nullable): same as repo spec 95 - - `sig` (byte array, **nullable**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be `null`. 83 + - `data` (hash link, **optional**): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key. 84 + - `rev` (string, required): same as repo spec (may become optional) 85 + - `prev` (hash link, **optional**): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping. 86 + - `sig` (byte array, **optional**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be omitted. 96 87 97 88 #### verifying a commit 98 89 99 90 The `commit` object can be converted to a repo-spec compliant commit: 100 91 101 - - if `data` is null, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`) 92 + - if `data` is absent, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`) 102 93 - follow steps from the repo spec to resolve the identity and verify the signature. 103 94 104 - When `sig` is null (typically for archived MST sub-trees), this STAR cannot be converted to a repo-spec compliant commit. 95 + When `sig` is absent (typically for archived MST sub-trees), this STAR _cannot_ be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it. 105 96 106 97 107 98 ### optional tree 108 99 109 - - `node`: TODO: we need a new node format. It must be convertible back to a repo-spec style node. 110 - 111 - - `record`: The atproto record. Its CID can be computed over the bytes of its `block` (see below). 100 + There are two kinds of blocks in the tree: `node` blocks and `record` blocks. If the tree is present, the first block must always be a `node` (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and `key -> CID`-only archives). `node`s have a flag for each link to indicate its upcoming presence (or absence) in the archive. 112 101 113 102 114 - ### node 103 + #### `node` 115 104 116 105 ``` 117 106 |----- node -----| ··· 121 110 - `len` (varint): the length of the proceeding CBOR block, in bytes. 122 111 123 112 - `mst node` (DAG-CBOR): object with the following schema 124 - - `l` (hash link, **optional and nullable**): reference to a subtree at a lower depth containing only keys to the left of this node. 125 - - when **absent**: there is no left subtree 126 - - when **null**: the left subtree is present and will follow in the archive (implicit CID) 127 - - when **non-null**: the left subtree exists but is abset from the archive 128 - - `e` (array, required): ordered array of entry objects, each containing: 113 + - `l` (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree. 114 + - `L` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `l` is not present. 115 + - `e` (array, required): ordered array of entry objects with length of at least one, each containing: 129 116 - `p` (integer, required): number of bytes shared with the previous entry (TODO key compression actually) 130 117 - `k` (byte string, required): key suffix remaining 131 - - `v` (hash link, **nullable**): reference to the record data for this key. 132 - - when **null**: the record is included in the archive and will follow (implicit CID) 133 - - when **non-null**: the record exists but is not included in the archive 134 - - `t` (hash link, **optional and nullable**): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. same rules as `l`: 135 - - when **absent**: there is no subtree here subtree 136 - - when **null**: the subtree is present and will follow in the archive (implicit CID) 137 - - when **non-null**: the subtree exists but is abset from the archive 118 + - `v` (hash link, optional): reference to the record data for this key. 119 + - for MST nodes at depth=0: 120 + - `v` must be omitted when the record is included in the archive 121 + - `v` mut not be omitted if the record is not included 122 + - for MST nodes at depth>0: 123 + - `v` is required (`V` signifies if it's in the archive) 124 + - `V` (bool, optional): "archived": if `true`, the record is contained in this archive. must not be present when `v` is not present. 125 + - `t` (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree. 126 + - `T` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `t` is not present. 138 127 128 + for now see the atproto repo spec for key compression (`p` and `k`) 139 129 140 - ### record 130 + #### `record` 131 + 132 + An atproto record. Its CID can be computed over its bytes of its `block` (see below). 133 + 134 + ``` 135 + |--- record --| 136 + [ len | block ] 137 + ``` 141 138 142 - ``` 143 - |--- record --| 144 - [ len | block ] 145 - ``` 139 + - `len` (varint): the length of the proceeding binary record block in bytes. 140 + 141 + - `block` (bytes): the raw bytes of the (DAG-CBOR) record. 142 + 143 + 144 + ### verifying the whole tree 145 + 146 + Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes: 147 + 148 + For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization. 149 + 150 + For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes. 151 + 152 + To compute the CID of included records: 153 + 154 + The required bytes for the CID calculation are the exact included record bytes (sha256 over them). 146 155 147 - - `len` (varint): the length of the proceeding binary record block in bytes. 148 156 149 - - `block` (bytes): the raw bytes of the (DAG-CBOR) record 157 + ## open questions 150 158 159 + ### how far to go with implicit CIDs? 151 160 152 - ### order of `node`s and `record`s 161 + there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR). 153 162 154 - The MST **must** be stored in key order, which for an MST is a depth-first walk across the tree. 163 + the problem with omitting *all* possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all! 155 164 156 - For each *included* child of a `node` (indicated by ?? in its entries. null for cid?), todo blah blah 165 + as a compromise we're only omitting CIDs from MST nodes for **layer 0 record links**: 157 166 158 - *excluded* children (indicated by a CID link being present in entries) are not included in the series of nodes and records. 167 + - 75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links 168 + - MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable. 169 + - no special casing required for MST subtree links, their CIDs are always included 159 170 171 + this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is **80%**. 160 172 161 - #### key compression 173 + but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records: 162 174 163 - TODO 164 - - (but basically do what the repo-spec does but apply it across the whole stream) 165 - - (but also actually run some tests and measure how much this decreases file sizes post-normal-file-compression) 175 + - 89% overall reduction in CIDs in the STAR 176 + - a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes. 177 + - layer-dependent special-casing required for CID links as well as record links, across two MST layers 166 178 179 + with the 50% initial reduction, this would be **95% total CID reduction** 167 180 168 - ### verifying the whole tree 181 + CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another. 169 182 170 - A STAR reader must compute CIDs for all MST nodes as they are encountered, so that parent node CIDs can be computed, until eventually the root node's CID is known and can be compared againt the commit object's `data` hash link. If the root node's CID does not match, the commit's signature is not valid for the archive.