STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

update specs

phil 383dcb84 806197a0

+271 -219
+32 -161
readme.md
··· 1 - # STAR: Streaming Tree ARchive format 1 + # STAR: **ST**reaming **AR**chive formats 2 2 3 - _status: just thinking about it_ 3 + Stricter, verifiable, deterministic, highly compressible alternatives to [CAR][1] files for atproto repositories. 4 4 5 - STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs. 6 5 7 - - convertible to/from CAR (lossless except any out-of-tree blocks from a CAR) 8 - - extra garbage strictly not allowed (unlike CAR) 9 - - canonical (unlike CAR) 10 - - strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR) 11 - - the header simply is the commit, followed by the serialized tree. 12 - - layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents. 13 - - small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as `data: null` in the commit header. 6 + | | [CAR][1] | [STAR-lite][2] | [STAR-L0][3] | [STAR-L1][3] | 7 + | -------------- | ------- | --------- | -------- | -------- | 8 + | verifiable | ✅ | ✅ | ✅ | ✅ | 9 + | existing tools | ✅ | ❌ | ❌ | ❌ | 10 + | archive size | worst | best | good | near-best | 11 + | streamable | ❌^1 | ✅^2 | ✅ best | ✅ | 12 + | bounded memory | ❌^1 | ✅^2 | ✅ best | ✅ | 13 + | speed | worst^1 | good/best^3 | best | better | 14 + | complexity | ✅ best | ✅ best | ok | tricky | 15 + | strict | ❌ | ✅ | ✅ | ✅ | 16 + | deterministic | ❌ | ✅ | ✅ | ✅ | 17 + | slices, sparse | ✅ | ❌^4 | ✅ | ✅ | 18 + | subtree | ❌ | ✅^5 | ✅ | ✅ | 14 19 15 - the two primary motivations are 16 20 17 - 1. bounded-resource streaming readers. 21 + Read more: 18 22 19 - atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly. 23 + - **[CAR][1]**: best interoperability 20 24 21 - you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them. 25 + > A standardized content-addressed block format 22 26 23 - since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through) 27 + - **[STAR-lite][2]**: best compression 24 28 25 - 2. reduced archive size. 29 + > A flat key-record encoding with no MST 26 30 27 - CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content. 31 + - **[STAR-L0/L1][3]**: best for streaming verification 28 32 29 - for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%. 33 + > A strictly-ordered block format with implicit CIDs and MST recovery at lower layers 30 34 31 - from a few more samples of various sizes from real atproto repos: 32 35 33 - ``` 34 - CIDs CAR potential savings 35 - 0.53KB / 3.4KB = 16% 36 - 23.2KB / 279KB = 8% 37 - 0.9MB / 5.0MB = 18% 38 - 25.9MB / 128MB = 20% 39 - 94.8MB / 449MB = 21% 40 - ``` 41 - 42 - These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression. 43 - 44 - STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an **overall reduction of CIDs by 80%**. 45 - 46 - Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it. 47 - 48 - 49 - ### scope 50 - 51 - STAR is specialized for atproto MST storage, and best-suited for serializing complete trees. 36 + --- 52 37 53 - - It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated) 54 - 55 - - It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary) 56 - 57 - - It *might* be interesting to allow arbitrary sparse trees. Not sure yet. 58 - 59 - - It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST. 60 - 61 - STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens). 38 + Notes: 62 39 40 + 1. See [this issue](https://github.com/bluesky-social/ietf-drafts/issues/18) on the ietf atproto repo draft: it's not possible in general to correctly treat a CAR repo as stream-ordered without knowing (out of band) that it was encoded that way, so parsers must buffer the entire repository. Disk spilling can bound memory usage, like [repo-stream](https://tangled.org/microcosm.blue/repo-stream) does, but requires many random i/o reads. Stream-ordered CARs are competitive with STAR variants on some axes, but given the unresolved issues, are not considered in this comparison. 63 41 64 - #### CARs 42 + 2. STAR-lite streaming verification or conversion-to-CAR requires disk spilling to acheive bounded memory, but the i/o is optimized for a small number of one-time in-order reads from disk. 65 43 66 - The IETF draft for AT repositories (referred to as "repo spec" in this document) can be found at https://github.com/bluesky-social/ietf-drafts/blob/main/draft-holmgren-at-repository.md or https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html 44 + 3. STAR-lite values can be emitted immediately and trivially from its encoded form with zero buffering required. However, MST recovery (or pre-verification) requires either two passes or disk spilling -- but it's still more efficient than CAR. 67 45 46 + 4. STAR-lite *could* support MST slices and probably sparse MSTs, but this is not specified yet. MST slices in particular would be valuable. 68 47 69 - ## format 48 + 5. Work in progress (easy) 70 49 71 - ``` 72 - |--------- header ---------| |---------------- optional tree ----------------| 73 - [ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] … 74 - ``` 75 50 76 - ### header 77 51 78 - - `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`). 79 52 80 - - `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format. 81 53 82 - - `len` (varint): The length of the proceeding atproto commit object in bytes. 83 54 84 - - `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects): 85 55 86 - - `did` (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now. 87 - - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3` 88 - - `data` (hash link, **optional**): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key. 89 - - `rev` (string, required): same as repo spec (may become optional) 90 - - `prev` (hash link, **optional**): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping. 91 - - `sig` (byte array, **optional**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be omitted. 92 - 93 - #### verifying a commit 94 - 95 - The `commit` object can be converted to a repo-spec compliant commit: 96 - 97 - - if `data` is absent, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`) 98 - - follow steps from the repo spec to resolve the identity and verify the signature. 99 - 100 - When `sig` is absent (typically for archived MST sub-trees), this STAR _cannot_ be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it. 101 - 102 - 103 - ### optional tree 104 - 105 - There are two kinds of blocks in the tree: `node` blocks and `record` blocks. If the tree is present, the first block must always be a `node` (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and `key -> CID`-only archives). `node`s have a flag for each link to indicate its upcoming presence (or absence) in the archive. 106 - 107 - 108 - #### `node` 109 - 110 - ``` 111 - |----- node -----| 112 - [ len | mst node ] 113 - ``` 114 - 115 - - `len` (varint): the length of the proceeding CBOR block, in bytes. 116 - 117 - - `mst node` (DAG-CBOR): object with the following schema 118 - - `l` (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree. 119 - - `L` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `l` is not present. 120 - - `e` (array, required): ordered array of entry objects with length of at least one, each containing: 121 - - `p` (integer, required): number of bytes shared with the previous entry (TODO [key compression](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-mst-node-schema) actually) 122 - - `k` (byte string, required): key suffix remaining 123 - - `v` (hash link, optional): reference to the record data for this key. 124 - - for MST nodes at depth=0: 125 - - `v` must be omitted when the record is included in the archive 126 - - `v` mut not be omitted if the record is not included 127 - - for MST nodes at depth>0: 128 - - `v` is required (`V` signifies if it's in the archive) 129 - - `V` (bool, optional): "archived": if `true`, the record is contained in this archive. must not be present when `v` is not present. 130 - - `t` (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree. 131 - - `T` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `t` is not present. 132 - 133 - for now see the atproto repo spec for key compression (`p` and `k`) 134 - 135 - #### `record` 136 - 137 - An atproto record. Its CID can be computed over its bytes of its `block` (see below). 138 - 139 - ``` 140 - |--- record --| 141 - [ len | block ] 142 - ``` 143 - 144 - - `len` (varint): the length of the proceeding binary record block in bytes. 145 - 146 - - `block` (bytes): the raw bytes of the (DAG-CBOR) record. 147 - 148 - 149 - ### verifying the whole tree 150 - 151 - Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes: 152 - 153 - For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization. 154 - 155 - For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes. 156 - 157 - To compute the CID of included records: 158 - 159 - The required bytes for the CID calculation are the exact included record bytes (sha256 over them). 160 - 161 - 162 - ## open questions 163 - 164 - ### how far to go with implicit CIDs? 165 - 166 - there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR). 167 - 168 - the problem with omitting *all* possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all! 169 - 170 - as a compromise we're only omitting CIDs from MST nodes for **layer 0 record links**: 171 - 172 - - 75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links 173 - - MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable. 174 - - no special casing required for MST subtree links, their CIDs are always included 175 - 176 - this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is **80%**. 177 - 178 - but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records: 179 - 180 - - 89% overall reduction in CIDs in the STAR 181 - - a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes. 182 - - layer-dependent special-casing required for CID links as well as record links, across two MST layers 183 - 184 - with the 50% initial reduction, this would be **95% total CID reduction** 185 - 186 - CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another. 187 - 56 + [1]: https://dasl.ing/car.html 57 + [2]: ./star-lite/ 58 + [3]: ./star-lN/
+189
star-lN/readme.md
··· 1 + # STAR: STreaming ARchive format 2 + 3 + _NOTE: this document needs updates to reflect STAR-L0 and STAR-L1 variants_ 4 + 5 + _status: just thinking about it_ 6 + 7 + STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs. 8 + 9 + - convertible to/from CAR (lossless except any out-of-tree blocks from a CAR) 10 + - extra garbage strictly not allowed (unlike CAR) 11 + - canonical (unlike CAR) 12 + - strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR) 13 + - the header simply is the commit, followed by the serialized tree. 14 + - layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents. 15 + - small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as `data: null` in the commit header. 16 + 17 + the two primary motivations are 18 + 19 + 1. bounded-resource streaming readers. 20 + 21 + atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly. 22 + 23 + you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them. 24 + 25 + since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through) 26 + 27 + 2. reduced archive size. 28 + 29 + CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content. 30 + 31 + for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%. 32 + 33 + from a few more samples of various sizes from real atproto repos: 34 + 35 + ``` 36 + CIDs CAR potential savings 37 + 0.53KB / 3.4KB = 16% 38 + 23.2KB / 279KB = 8% 39 + 0.9MB / 5.0MB = 18% 40 + 25.9MB / 128MB = 20% 41 + 94.8MB / 449MB = 21% 42 + ``` 43 + 44 + These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression. 45 + 46 + STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an **overall reduction of CIDs by 80%**. 47 + 48 + Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it. 49 + 50 + 51 + ### scope 52 + 53 + STAR is specialized for atproto MST storage, and best-suited for serializing complete trees. 54 + 55 + - It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated) 56 + 57 + - It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary) 58 + 59 + - It *might* be interesting to allow arbitrary sparse trees. Not sure yet. 60 + 61 + - It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST. 62 + 63 + STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens). 64 + 65 + 66 + #### CARs 67 + 68 + The IETF draft for AT repositories (referred to as "repo spec" in this document) can be found at https://github.com/bluesky-social/ietf-drafts/blob/main/draft-holmgren-at-repository.md or https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html 69 + 70 + 71 + ## format 72 + 73 + ``` 74 + |--------- header ---------| |---------------- optional tree ----------------| 75 + [ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] … 76 + ``` 77 + 78 + ### header 79 + 80 + - `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`). 81 + 82 + - `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format. 83 + 84 + - `len` (varint): The length of the proceeding atproto commit object in bytes. 85 + 86 + - `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects): 87 + 88 + - `did` (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now. 89 + - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3` 90 + - `data` (hash link, **optional**): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key. 91 + - `rev` (string, required): same as repo spec (may become optional) 92 + - `prev` (hash link, **optional**): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping. 93 + - `sig` (byte array, **optional**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be omitted. 94 + 95 + #### verifying a commit 96 + 97 + The `commit` object can be converted to a repo-spec compliant commit: 98 + 99 + - if `data` is absent, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`) 100 + - follow steps from the repo spec to resolve the identity and verify the signature. 101 + 102 + When `sig` is absent (typically for archived MST sub-trees), this STAR _cannot_ be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it. 103 + 104 + 105 + ### optional tree 106 + 107 + There are two kinds of blocks in the tree: `node` blocks and `record` blocks. If the tree is present, the first block must always be a `node` (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and `key -> CID`-only archives). `node`s have a flag for each link to indicate its upcoming presence (or absence) in the archive. 108 + 109 + 110 + #### `node` 111 + 112 + ``` 113 + |----- node -----| 114 + [ len | mst node ] 115 + ``` 116 + 117 + - `len` (varint): the length of the proceeding CBOR block, in bytes. 118 + 119 + - `mst node` (DAG-CBOR): object with the following schema 120 + - `l` (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree. 121 + - `L` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `l` is not present. 122 + - `e` (array, required): ordered array of entry objects with length of at least one, each containing: 123 + - `p` (integer, required): number of bytes shared with the previous entry (TODO [key compression](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-mst-node-schema) actually) 124 + - `k` (byte string, required): key suffix remaining 125 + - `v` (hash link, optional): reference to the record data for this key. 126 + - for MST nodes at depth=0: 127 + - `v` must be omitted when the record is included in the archive 128 + - `v` mut not be omitted if the record is not included 129 + - for MST nodes at depth>0: 130 + - `v` is required (`V` signifies if it's in the archive) 131 + - `V` (bool, optional): "archived": if `true`, the record is contained in this archive. must not be present when `v` is not present. 132 + - `t` (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree. 133 + - `T` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `t` is not present. 134 + 135 + for now see the atproto repo spec for key compression (`p` and `k`) 136 + 137 + #### `record` 138 + 139 + An atproto record. Its CID can be computed over its bytes of its `block` (see below). 140 + 141 + ``` 142 + |--- record --| 143 + [ len | block ] 144 + ``` 145 + 146 + - `len` (varint): the length of the proceeding binary record block in bytes. 147 + 148 + - `block` (bytes): the raw bytes of the (DAG-CBOR) record. 149 + 150 + 151 + ### verifying the whole tree 152 + 153 + Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes: 154 + 155 + For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization. 156 + 157 + For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes. 158 + 159 + To compute the CID of included records: 160 + 161 + The required bytes for the CID calculation are the exact included record bytes (sha256 over them). 162 + 163 + 164 + ## open questions 165 + 166 + ### how far to go with implicit CIDs? 167 + 168 + there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR). 169 + 170 + the problem with omitting *all* possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all! 171 + 172 + as a compromise we're only omitting CIDs from MST nodes for **layer 0 record links**: 173 + 174 + - 75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links 175 + - MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable. 176 + - no special casing required for MST subtree links, their CIDs are always included 177 + 178 + this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is **80%**. 179 + 180 + but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records: 181 + 182 + - 89% overall reduction in CIDs in the STAR 183 + - a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes. 184 + - layer-dependent special-casing required for CID links as well as record links, across two MST layers 185 + 186 + with the 50% initial reduction, this would be **95% total CID reduction** 187 + 188 + CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another. 189 +
+50 -58
star-lite/readme.md
··· 2 2 3 3 **ST**reaming **AR**chive repository format (extra light version) 4 4 5 - A stricter, simpler, still verifiable, more compressible alternative to [CAR](https://ipld.io/specs/transport/car/carv1/#format-description). 5 + A stricter, simpler, still verifiable, highly compressible alternative to [CAR][car]. 6 6 7 - STAR-lite describes both the actual binary encoding, and its memory-bounded algorithm to convert any sequence of in-order key-record pairs into stream-ordered CAR files. 7 + STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files. 8 8 9 - This efficient conversion makes STAR-lite suitable as an efficient network transport format or for long-term archiving and backup, without sacrificing interoperability. 9 + Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability. 10 10 11 11 12 - ### compared to CARs: 12 + ### Compared to CARs: 13 13 14 - - All MST node blocks and all MST CIDs are omitted, eliminating the least-compressible content. 14 + - No MST node blocks or CIDs, eliminating the least-compressible content. 15 15 - Strict content ordering, deterministic encoding. 16 - - Bounded-memory conversion back to stream-ordered CAR. 16 + - Bounded-memory conversion to stream-ordered CAR. 17 17 18 18 STAR-lite files shine when zstd-compressed. 19 19 20 20 21 - ### compared to STAR-L0 and STAR-L1: 21 + ### Compared to [STAR-L0 and STAR-L1][ln]: 22 22 23 23 - Smallest archive format (with zstd compression) 24 24 - Content verification requires a complete scan of all content ··· 26 26 - Disk spilling required for memory-bounded streaming of large archives 27 27 28 28 29 - ## format 29 + ## Format 30 + 31 + TODO: subtree -- probably just set header cbor len to 0 to omit the commit? but we still probably want the `data` root hash... 30 32 31 33 STAR-lite is just a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header. 32 34 ··· 43 45 | cbor | cbor bytes | 44 46 45 47 46 - ### magic 48 + ### Magic 47 49 48 50 Three bytes: `0x2A 0x6C 0x00`, ASCII for `*l\0`: "star", "**l**ite", version 0. 49 51 50 52 51 - ### header len + cbor: Commit 53 + ### Header len + cbor: Commit 52 54 53 55 A length-prefixed CBOR blob containing an atproto signed Commit object. The CBOR format is the same as the atproto repo spec describes. The Commit may be ignored, but for archive content verification, its `data` field must be parsed at minimum. 54 56 55 - TODO: specify maximum commit size 57 + A parser may reject a commit if its `varint` length is greater than 4KiB (TODO: we could make this really tight, but 2KiB for the did:web and then 2K for the rest should prevent large reads and not be limiting) 56 58 57 59 TODO: like the other STAR formats we should actually define a slightly-modified commit object, specifically with a nullable data cid for empty repos. otherwise, we should include the magic CID of an empty atproto MST node's hash like you get with CARs. 58 60 59 61 60 - ### data: keys and records 62 + ### Data: keys and records 61 63 62 64 zero or more records until EOF. Each is: 63 65 64 66 | field | type | 65 67 | ----------- | --------------------------------------- | 66 - | key len | varint (TODO: min and max) | 68 + | key len | varint, max: 830 | 67 69 | key str | utf-8 bytes, exactly `key len` length | 68 - | record len | varint, max: 1,048,576 (1MiB) | 70 + | record len | varint, max: [1,048,576][reclen] | 69 71 | record cbor | cbor bytes, exactly `record len` length | 70 72 73 + The maximum key length comes from the combined limits of the `<collection>/<rkey>` syntax for atproto repo paths: 317 for the [collection][nsid] + 1 for the `/` slash + 512 for the [rkey][rkey]. 71 74 72 - ### varints 75 + The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto [*recommended data limits*][reclen]. 73 76 74 - Unsigned LEB128 / multiformats unsigned-varint: 7 bits per byte, MSB is the continuation flag, little-endian byte order. 75 77 76 - TODO: say we defer to that spec -- which one specifically? (match to CAR's) 78 + ### Varints 77 79 78 - TODO: do we need to resolve any ambiguities from the spec? eg., that encoders must use the minimum number of bytes (no leading `0x80` padding bytes)? (we want to be deterministic). Also any security notes, like most number of bytes before bailing on the varint read? 80 + Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128]) 79 81 80 82 81 - ### rules 83 + ### Rules 82 84 83 85 - keys must be in strict lexicographic byte order. 84 86 - duplicate keys are not allowed. 85 87 - keys must be valid atproto repo paths: the format specifies utf-8, but in practice the required `<collection>/<rkey>` repo path format currently restricts characters to a small subset of ASCII. 86 - - records must be encoded as [DRISL](https://dasl.ing/drisl.html), the deterministic subset of CBOR used by atproto. 88 + - records must be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto. 87 89 88 90 89 - ## efficient MST recovery 91 + ## Efficient MST-aware operations 90 92 91 - *for archive verification and conversion to CAR* 93 + While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, this usually carries high overhead (memory or i/o) for the most frequent operations on a STAR-lite file: verification, and conversion to stream-ordered CAR. 92 94 93 - The simple way to verify an archive is to insert each `(key, record)` pair into an atproto MST builder library to reconstruct the full MST. Then, assert that the MST's root `CID` matches the CID in the Commit's `data` field. 94 95 95 - For large repositories, building the MST this way may require significant memory, or significant storage I/O. STAR-lite includes a bounded-memory, efficient disk-spilling algorithm to recover the MST for verification or conversion to other atproto formats. 96 96 97 - 98 - ### archive verification 99 - 100 - Verification uses the same MST recover technique as CAR conversion (below), but evicts subtrees by simply dropping them, rather than spilling to disk, since only the root MST node's CID is required for verification. 97 + By exploiting the strict lexicographic key ordering of STAR-lite files, we can implement these transformations directly and with minimal overhead. 101 98 102 99 103 - ### conversion to CAR 100 + ### Conversion to CAR 104 101 105 102 Stream-ordered CARs (in "preorder traversal" block order) are a depth-first walk over the Merkle Search Tree, and keys encountered during a depth-first MST walk are in strict lexicographic order. 106 103 ··· 115 112 116 113 But what we can do, is write serialized segments of the final CAR to disk temporarily as the entire MST is reconstructed, to stay within a strict memory budget. Streaming out the final stream-ordered CAR can use `copy_file_range` or equivalent to splice them in at the right places. 117 114 115 + ``` 116 + TODO: actual algorithm pesudocode 117 + ``` 118 118 119 - #### empty repos 120 119 121 - a repo with zero keys is allowed: its commit object must use the magic CID `(TODO)`, which corresponds to the CID of a single empty atproto MST node - how atproto CARs represent empty repos. 120 + ### Archive verification 122 121 122 + Verification requires MST reconstruction just like CAR conversion, but never requires temporary disk storage. Each record must be hashed to compute its CID, but its byte contents can be immediately discarded. 123 123 124 - ### algorithm 124 + Layer-0 MST nodes are materialized with computed record CIDs, then encoded, then hashed, to produce node CIDs. The encoded node bytes (and referenced record CIDs) are discarded, since we only need the node CID to help materialize a MST node. 125 125 126 - ``` 127 - read magic 128 - read commit 129 - init mst 126 + The final output is the root MST node's CID, which verifies the entire archive if it matches the `data` field from the commit object. 130 127 131 - // TODO: fix this up to eagerly serialize subtrees 132 - 133 - for (key, record) in star_lite_entries: 134 - mst.insert(key, record) 135 - if mst.memory_usage() > limit: 136 - // find the leftmost subtree whose rightmost key < `key` (its structure is now frozen) 137 - subtree := mst.evict_leftmost_finalized_subtree() 138 - root_cid := subtree.root_cid() 139 - segment_path := temp_dir.create_segment(root_cid) 140 - subtree.write_blocks_in_car_order(segment_path) 141 - // replace the in-memory subtree pointer with a marker 142 - mst.replace_with_marker(root_cid, segment_path) 128 + TODO: mention that this doesn't check the commit's signature but users can do that on the commit object first, directly, if they want. verification just asserts that the archive content is consistent with the commit object's `data` commitment. 143 129 144 - // EOF 145 - root_cid := mst.finalize() 146 130 147 - // stream out 148 - init car := AtprotoCar(commit) 149 - for block_or_marker in mst.depth_first_walk(): 150 - match block_or_marker: 151 - Block(cid, bytes) => car.write_block(cid, bytes) 152 - Marker(segment) => car.splice_file(segment.path) 131 + ``` 132 + TODO: actual algorithm pseudocode 153 133 ``` 154 134 155 - Memory is bounded because there is a practical (low) limit to MST height. 156 135 136 + #### Empty repos 157 137 138 + a repo with zero keys is allowed: its commit object must use the magic CID `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4` (TODO: double-check this), which corresponds to the CID of a single empty atproto MST node - how atproto CARs represent empty repos. 158 139 159 - ### conversion from CAR 140 + 141 + ### Conversion from CAR 160 142 161 143 TODO: but basically: use repo-stream for a bounded-memory MST walk if you can't be certain that the CAR is stream-ordered. If it *is* stream-ordered, any streaming walker will work and it's pretty simple to write out. 162 144 163 145 Note: it is **not possible** to know if an atproto CAR is stream-ordered except by either knowing that it was encoded that way in advance, or by reading the **entire** archive first to verify. 146 + 147 + 148 + [car]: https://dasl.ing/car.html 149 + [drisl]: https://dasl.ing/drisl.html 150 + [ln]: ../star-lN/ 151 + [reclen]: https://atproto.com/guides/data-validation#recommended-data-limits 152 + [varint]: https://github.com/multiformats/unsigned-varint 153 + [leb128]: https://en.wikipedia.org/wiki/LEB128 154 + [nsid]: https://atproto.com/specs/nsid 155 + [rkey]: https://atproto.com/specs/record-key