···11-# STAR: Streaming Tree ARchive format
11+# STAR: **ST**reaming **AR**chive formats
2233-_status: just thinking about it_
33+Stricter, verifiable, deterministic, highly compressible alternatives to [CAR][1] files for atproto repositories.
4455-STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs.
6577-- convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
88-- extra garbage strictly not allowed (unlike CAR)
99-- canonical (unlike CAR)
1010-- strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
1111-- the header simply is the commit, followed by the serialized tree.
1212-- layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents.
1313-- small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as `data: null` in the commit header.
66+| | [CAR][1] | [STAR-lite][2] | [STAR-L0][3] | [STAR-L1][3] |
77+| -------------- | ------- | --------- | -------- | -------- |
88+| verifiable | ✅ | ✅ | ✅ | ✅ |
99+| existing tools | ✅ | ❌ | ❌ | ❌ |
1010+| archive size | worst | best | good | near-best |
1111+| streamable | ❌^1 | ✅^2 | ✅ best | ✅ |
1212+| bounded memory | ❌^1 | ✅^2 | ✅ best | ✅ |
1313+| speed | worst^1 | good/best^3 | best | better |
1414+| complexity | ✅ best | ✅ best | ok | tricky |
1515+| strict | ❌ | ✅ | ✅ | ✅ |
1616+| deterministic | ❌ | ✅ | ✅ | ✅ |
1717+| slices, sparse | ✅ | ❌^4 | ✅ | ✅ |
1818+| subtree | ❌ | ✅^5 | ✅ | ✅ |
14191515-the two primary motivations are
16201717-1. bounded-resource streaming readers.
2121+Read more:
18221919- atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly.
2323+- **[CAR][1]**: best interoperability
20242121- you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them.
2525+ > A standardized content-addressed block format
22262323- since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through)
2727+- **[STAR-lite][2]**: best compression
24282525-2. reduced archive size.
2929+ > A flat key-record encoding with no MST
26302727- CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content.
3131+- **[STAR-L0/L1][3]**: best for streaming verification
28322929- for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%.
3333+ > A strictly-ordered block format with implicit CIDs and MST recovery at lower layers
30343131- from a few more samples of various sizes from real atproto repos:
32353333- ```
3434- CIDs CAR potential savings
3535- 0.53KB / 3.4KB = 16%
3636- 23.2KB / 279KB = 8%
3737- 0.9MB / 5.0MB = 18%
3838- 25.9MB / 128MB = 20%
3939- 94.8MB / 449MB = 21%
4040- ```
4141-4242- These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression.
4343-4444- STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an **overall reduction of CIDs by 80%**.
4545-4646- Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it.
4747-4848-4949-### scope
5050-5151-STAR is specialized for atproto MST storage, and best-suited for serializing complete trees.
3636+---
52375353-- It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated)
5454-5555- - It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary)
5656-5757-- It *might* be interesting to allow arbitrary sparse trees. Not sure yet.
5858-5959-- It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST.
6060-6161-STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens).
3838+Notes:
62394040+1. See [this issue](https://github.com/bluesky-social/ietf-drafts/issues/18) on the ietf atproto repo draft: it's not possible in general to correctly treat a CAR repo as stream-ordered without knowing (out of band) that it was encoded that way, so parsers must buffer the entire repository. Disk spilling can bound memory usage, like [repo-stream](https://tangled.org/microcosm.blue/repo-stream) does, but requires many random i/o reads. Stream-ordered CARs are competitive with STAR variants on some axes, but given the unresolved issues, are not considered in this comparison.
63416464-#### CARs
4242+2. STAR-lite streaming verification or conversion-to-CAR requires disk spilling to acheive bounded memory, but the i/o is optimized for a small number of one-time in-order reads from disk.
65436666-The IETF draft for AT repositories (referred to as "repo spec" in this document) can be found at https://github.com/bluesky-social/ietf-drafts/blob/main/draft-holmgren-at-repository.md or https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html
4444+3. STAR-lite values can be emitted immediately and trivially from its encoded form with zero buffering required. However, MST recovery (or pre-verification) requires either two passes or disk spilling -- but it's still more efficient than CAR.
67454646+4. STAR-lite *could* support MST slices and probably sparse MSTs, but this is not specified yet. MST slices in particular would be valuable.
68476969-## format
4848+5. Work in progress (easy)
70497171-```
7272-|--------- header ---------| |---------------- optional tree ----------------|
7373-[ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] …
7474-```
75507676-### header
77517878-- `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`).
79528080-- `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format.
81538282-- `len` (varint): The length of the proceeding atproto commit object in bytes.
83548484-- `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects):
85558686- - `did` (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now.
8787- - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3`
8888- - `data` (hash link, **optional**): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key.
8989- - `rev` (string, required): same as repo spec (may become optional)
9090- - `prev` (hash link, **optional**): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping.
9191- - `sig` (byte array, **optional**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be omitted.
9292-9393-#### verifying a commit
9494-9595-The `commit` object can be converted to a repo-spec compliant commit:
9696-9797- - if `data` is absent, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`)
9898- - follow steps from the repo spec to resolve the identity and verify the signature.
9999-100100-When `sig` is absent (typically for archived MST sub-trees), this STAR _cannot_ be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it.
101101-102102-103103-### optional tree
104104-105105-There are two kinds of blocks in the tree: `node` blocks and `record` blocks. If the tree is present, the first block must always be a `node` (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and `key -> CID`-only archives). `node`s have a flag for each link to indicate its upcoming presence (or absence) in the archive.
106106-107107-108108-#### `node`
109109-110110-```
111111-|----- node -----|
112112-[ len | mst node ]
113113-```
114114-115115-- `len` (varint): the length of the proceeding CBOR block, in bytes.
116116-117117-- `mst node` (DAG-CBOR): object with the following schema
118118- - `l` (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree.
119119- - `L` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `l` is not present.
120120- - `e` (array, required): ordered array of entry objects with length of at least one, each containing:
121121- - `p` (integer, required): number of bytes shared with the previous entry (TODO [key compression](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-mst-node-schema) actually)
122122- - `k` (byte string, required): key suffix remaining
123123- - `v` (hash link, optional): reference to the record data for this key.
124124- - for MST nodes at depth=0:
125125- - `v` must be omitted when the record is included in the archive
126126- - `v` mut not be omitted if the record is not included
127127- - for MST nodes at depth>0:
128128- - `v` is required (`V` signifies if it's in the archive)
129129- - `V` (bool, optional): "archived": if `true`, the record is contained in this archive. must not be present when `v` is not present.
130130- - `t` (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree.
131131- - `T` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `t` is not present.
132132-133133-for now see the atproto repo spec for key compression (`p` and `k`)
134134-135135-#### `record`
136136-137137-An atproto record. Its CID can be computed over its bytes of its `block` (see below).
138138-139139- ```
140140- |--- record --|
141141- [ len | block ]
142142- ```
143143-144144- - `len` (varint): the length of the proceeding binary record block in bytes.
145145-146146- - `block` (bytes): the raw bytes of the (DAG-CBOR) record.
147147-148148-149149-### verifying the whole tree
150150-151151-Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes:
152152-153153-For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization.
154154-155155-For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes.
156156-157157-To compute the CID of included records:
158158-159159-The required bytes for the CID calculation are the exact included record bytes (sha256 over them).
160160-161161-162162-## open questions
163163-164164-### how far to go with implicit CIDs?
165165-166166-there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR).
167167-168168-the problem with omitting *all* possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all!
169169-170170-as a compromise we're only omitting CIDs from MST nodes for **layer 0 record links**:
171171-172172-- 75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links
173173-- MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable.
174174-- no special casing required for MST subtree links, their CIDs are always included
175175-176176-this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is **80%**.
177177-178178-but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records:
179179-180180-- 89% overall reduction in CIDs in the STAR
181181-- a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes.
182182-- layer-dependent special-casing required for CID links as well as record links, across two MST layers
183183-184184-with the 50% initial reduction, this would be **95% total CID reduction**
185185-186186-CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another.
187187-5656+[1]: https://dasl.ing/car.html
5757+[2]: ./star-lite/
5858+[3]: ./star-lN/
+189
star-lN/readme.md
···11+# STAR: STreaming ARchive format
22+33+_NOTE: this document needs updates to reflect STAR-L0 and STAR-L1 variants_
44+55+_status: just thinking about it_
66+77+STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-ordered streaming of repository contents and reduced archive size compared to CARs.
88+99+- convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
1010+- extra garbage strictly not allowed (unlike CAR)
1111+- canonical (unlike CAR)
1212+- strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
1313+- the header simply is the commit, followed by the serialized tree.
1414+- layer 0 MST nodes leave record links implicit, and content blocks in the STAR are not prefixed by their content hash: linked blocks follow in a deterministic order and recompute CID from on their contents.
1515+- small spec cleanups, eg: there are no exceptions to the no-empty-MST-nodes rule: an entirely-empty MST is serialized as `data: null` in the commit header.
1616+1717+the two primary motivations are
1818+1919+1. bounded-resource streaming readers.
2020+2121+ atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly.
2222+2323+ you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them.
2424+2525+ since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through)
2626+2727+2. reduced archive size.
2828+2929+ CIDs are large, compression-unfriendly, and redundant if you have access to the CID's actual content.
3030+3131+ for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%.
3232+3333+ from a few more samples of various sizes from real atproto repos:
3434+3535+ ```
3636+ CIDs CAR potential savings
3737+ 0.53KB / 3.4KB = 16%
3838+ 23.2KB / 279KB = 8%
3939+ 0.9MB / 5.0MB = 18%
4040+ 25.9MB / 128MB = 20%
4141+ 94.8MB / 449MB = 21%
4242+ ```
4343+4444+ These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression.
4545+4646+ STARs don't include content hashes before content blocks, reducing the number of CIDs immediately by half. They omit record link CIDs from layer-0 MST nodes as well, for an **overall reduction of CIDs by 80%**.
4747+4848+ Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it.
4949+5050+5151+### scope
5252+5353+STAR is specialized for atproto MST storage, and best-suited for serializing complete trees.
5454+5555+- It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated)
5656+5757+ - It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary)
5858+5959+- It *might* be interesting to allow arbitrary sparse trees. Not sure yet.
6060+6161+- It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST.
6262+6363+STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens).
6464+6565+6666+#### CARs
6767+6868+The IETF draft for AT repositories (referred to as "repo spec" in this document) can be found at https://github.com/bluesky-social/ietf-drafts/blob/main/draft-holmgren-at-repository.md or https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html
6969+7070+7171+## format
7272+7373+```
7474+|--------- header ---------| |---------------- optional tree ----------------|
7575+[ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] …
7676+```
7777+7878+### header
7979+8080+- `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`).
8181+8282+- `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format.
8383+8484+- `len` (varint): The length of the proceeding atproto commit object in bytes.
8585+8686+- `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects):
8787+8888+ - `did` (string, required): same as repo spec. may become optional for subtree archives, but it's nice to be able to inspect, for now.
8989+ - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3`
9090+ - `data` (hash link, **optional**): CID of the first (root) node in the MST. an empty tree is represented by the absence of this key.
9191+ - `rev` (string, required): same as repo spec (may become optional)
9292+ - `prev` (hash link, **optional**): same as repo spec, but optional instead of nullable. only included for lossless CAR round-tripping.
9393+ - `sig` (byte array, **optional**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be omitted.
9494+9595+#### verifying a commit
9696+9797+The `commit` object can be converted to a repo-spec compliant commit:
9898+9999+ - if `data` is absent, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`)
100100+ - follow steps from the repo spec to resolve the identity and verify the signature.
101101+102102+When `sig` is absent (typically for archived MST sub-trees), this STAR _cannot_ be converted to a repo-spec compliant CAR. however, if it's a subtree, it can be stitched back into a full MST that can be converted back to a compliant CAR, provided the complimentary sparse STAR containing it.
103103+104104+105105+### optional tree
106106+107107+There are two kinds of blocks in the tree: `node` blocks and `record` blocks. If the tree is present, the first block must always be a `node` (the MST root node), and blocks follow in depth-first tree traversal order. Blocks may arbitrarily be omitted (for STAR slices, sparse trees, and `key -> CID`-only archives). `node`s have a flag for each link to indicate its upcoming presence (or absence) in the archive.
108108+109109+110110+#### `node`
111111+112112+```
113113+|----- node -----|
114114+[ len | mst node ]
115115+```
116116+117117+- `len` (varint): the length of the proceeding CBOR block, in bytes.
118118+119119+- `mst node` (DAG-CBOR): object with the following schema
120120+ - `l` (hash link, optional): reference to a subtree at a lower depth containing only keys to the left of this node. if absent, there is no left subtree.
121121+ - `L` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `l` is not present.
122122+ - `e` (array, required): ordered array of entry objects with length of at least one, each containing:
123123+ - `p` (integer, required): number of bytes shared with the previous entry (TODO [key compression](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-mst-node-schema) actually)
124124+ - `k` (byte string, required): key suffix remaining
125125+ - `v` (hash link, optional): reference to the record data for this key.
126126+ - for MST nodes at depth=0:
127127+ - `v` must be omitted when the record is included in the archive
128128+ - `v` mut not be omitted if the record is not included
129129+ - for MST nodes at depth>0:
130130+ - `v` is required (`V` signifies if it's in the archive)
131131+ - `V` (bool, optional): "archived": if `true`, the record is contained in this archive. must not be present when `v` is not present.
132132+ - `t` (hash link, optional): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. if absent, there is no subtree.
133133+ - `T` (bool, optional): "archived": if `true`, the subtree is contained in this archive. must not be present when `t` is not present.
134134+135135+for now see the atproto repo spec for key compression (`p` and `k`)
136136+137137+#### `record`
138138+139139+An atproto record. Its CID can be computed over its bytes of its `block` (see below).
140140+141141+ ```
142142+ |--- record --|
143143+ [ len | block ]
144144+ ```
145145+146146+ - `len` (varint): the length of the proceeding binary record block in bytes.
147147+148148+ - `block` (bytes): the raw bytes of the (DAG-CBOR) record.
149149+150150+151151+### verifying the whole tree
152152+153153+Each MST node, including the root, must be verified to match its expected CID. To compute the CID of MST nodes:
154154+155155+For nodes at depth>0 (all child CID links are included): Convert the MST node into repo-spec format, compute its CID as the sha256 hash of its DAG-CBOR serialization.
156156+157157+For nodes at depth=0 (record CID links excluded unless omitted from archive): read all included linked records from the node into a buffer and compute their CIDs (they will immediately follow in the STAR since a depth=0 node cannot have any other children). With the record CIDs available, the MST node can be converted into repo-spec format, and its CID calculated as with depth>0 nodes.
158158+159159+To compute the CID of included records:
160160+161161+The required bytes for the CID calculation are the exact included record bytes (sha256 over them).
162162+163163+164164+## open questions
165165+166166+### how far to go with implicit CIDs?
167167+168168+there is a trade-off between going fully implicit on CIDs (possible as long as the content is present to compute the CID) vs fully explicit in MST node links (CIDS still omitted before the content blocks themselves for 50% fewer vs CAR).
169169+170170+the problem with omitting *all* possible CIDS is that you then cannot verify the root node until you finish walking the entire MST. a consumer might have already written data somewhere only to find out they need to undo it all!
171171+172172+as a compromise we're only omitting CIDs from MST nodes for **layer 0 record links**:
173173+174174+- 75% reduction in record CIDs written to the STAR; 60% overall CID reduction including subtree links
175175+- MST nodes contain four records on average; a verifying streamer can buffer this small number of records before emitting, so it never omits content that later is found to be unverifiable.
176176+- no special casing required for MST subtree links, their CIDs are always included
177177+178178+this is probably the right balance: considering the 50% initial reduction compared CARs by dropping the hash prefix in front of blocks, the all-in CID reduction is **80%**.
179179+180180+but we could take it one step higher: have layer1 nodes do implicit CIDs for subtrees and records:
181181+182182+- 89% overall reduction in CIDs in the STAR
183183+- a verifying streamer needs to buffer 16 records on average. an attacker gets a 5x space amplification benefit if trying to generate extra-wide bufferable bottom-level tree nodes.
184184+- layer-dependent special-casing required for CID links as well as record links, across two MST layers
185185+186186+with the 50% initial reduction, this would be **95% total CID reduction**
187187+188188+CIDs make up around 20% of uncompressed CAR file sizes. the first approach gets that down to 4%; second 1%. however, CIDs are uncompressible, so it's probably worth measuring the real effect of both approaches on large repos post-compression before completely committing one way or another.
189189+
+50-58
star-lite/readme.md
···2233**ST**reaming **AR**chive repository format (extra light version)
4455-A stricter, simpler, still verifiable, more compressible alternative to [CAR](https://ipld.io/specs/transport/car/carv1/#format-description).
55+A stricter, simpler, still verifiable, highly compressible alternative to [CAR][car].
6677-STAR-lite describes both the actual binary encoding, and its memory-bounded algorithm to convert any sequence of in-order key-record pairs into stream-ordered CAR files.
77+STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files.
8899-This efficient conversion makes STAR-lite suitable as an efficient network transport format or for long-term archiving and backup, without sacrificing interoperability.
99+Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability.
101011111212-### compared to CARs:
1212+### Compared to CARs:
13131414-- All MST node blocks and all MST CIDs are omitted, eliminating the least-compressible content.
1414+- No MST node blocks or CIDs, eliminating the least-compressible content.
1515- Strict content ordering, deterministic encoding.
1616-- Bounded-memory conversion back to stream-ordered CAR.
1616+- Bounded-memory conversion to stream-ordered CAR.
17171818STAR-lite files shine when zstd-compressed.
191920202121-### compared to STAR-L0 and STAR-L1:
2121+### Compared to [STAR-L0 and STAR-L1][ln]:
22222323- Smallest archive format (with zstd compression)
2424- Content verification requires a complete scan of all content
···2626- Disk spilling required for memory-bounded streaming of large archives
272728282929-## format
2929+## Format
3030+3131+TODO: subtree -- probably just set header cbor len to 0 to omit the commit? but we still probably want the `data` root hash...
30323133STAR-lite is just a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header.
3234···4345| cbor | cbor bytes |
444645474646-### magic
4848+### Magic
47494850Three bytes: `0x2A 0x6C 0x00`, ASCII for `*l\0`: "star", "**l**ite", version 0.
495150525151-### header len + cbor: Commit
5353+### Header len + cbor: Commit
52545355A length-prefixed CBOR blob containing an atproto signed Commit object. The CBOR format is the same as the atproto repo spec describes. The Commit may be ignored, but for archive content verification, its `data` field must be parsed at minimum.
54565555-TODO: specify maximum commit size
5757+A parser may reject a commit if its `varint` length is greater than 4KiB (TODO: we could make this really tight, but 2KiB for the did:web and then 2K for the rest should prevent large reads and not be limiting)
56585759TODO: like the other STAR formats we should actually define a slightly-modified commit object, specifically with a nullable data cid for empty repos. otherwise, we should include the magic CID of an empty atproto MST node's hash like you get with CARs.
586059616060-### data: keys and records
6262+### Data: keys and records
61636264zero or more records until EOF. Each is:
63656466| field | type |
6567| ----------- | --------------------------------------- |
6666-| key len | varint (TODO: min and max) |
6868+| key len | varint, max: 830 |
6769| key str | utf-8 bytes, exactly `key len` length |
6868-| record len | varint, max: 1,048,576 (1MiB) |
7070+| record len | varint, max: [1,048,576][reclen] |
6971| record cbor | cbor bytes, exactly `record len` length |
70727373+The maximum key length comes from the combined limits of the `<collection>/<rkey>` syntax for atproto repo paths: 317 for the [collection][nsid] + 1 for the `/` slash + 512 for the [rkey][rkey].
71747272-### varints
7575+The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto [*recommended data limits*][reclen].
73767474-Unsigned LEB128 / multiformats unsigned-varint: 7 bits per byte, MSB is the continuation flag, little-endian byte order.
75777676-TODO: say we defer to that spec -- which one specifically? (match to CAR's)
7878+### Varints
77797878-TODO: do we need to resolve any ambiguities from the spec? eg., that encoders must use the minimum number of bytes (no leading `0x80` padding bytes)? (we want to be deterministic). Also any security notes, like most number of bytes before bailing on the varint read?
8080+Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128])
798180828181-### rules
8383+### Rules
82848385- keys must be in strict lexicographic byte order.
8486- duplicate keys are not allowed.
8587- keys must be valid atproto repo paths: the format specifies utf-8, but in practice the required `<collection>/<rkey>` repo path format currently restricts characters to a small subset of ASCII.
8686-- records must be encoded as [DRISL](https://dasl.ing/drisl.html), the deterministic subset of CBOR used by atproto.
8888+- records must be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto.
878988908989-## efficient MST recovery
9191+## Efficient MST-aware operations
90929191-*for archive verification and conversion to CAR*
9393+While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, this usually carries high overhead (memory or i/o) for the most frequent operations on a STAR-lite file: verification, and conversion to stream-ordered CAR.
92949393-The simple way to verify an archive is to insert each `(key, record)` pair into an atproto MST builder library to reconstruct the full MST. Then, assert that the MST's root `CID` matches the CID in the Commit's `data` field.
94959595-For large repositories, building the MST this way may require significant memory, or significant storage I/O. STAR-lite includes a bounded-memory, efficient disk-spilling algorithm to recover the MST for verification or conversion to other atproto formats.
96969797-9898-### archive verification
9999-100100-Verification uses the same MST recover technique as CAR conversion (below), but evicts subtrees by simply dropping them, rather than spilling to disk, since only the root MST node's CID is required for verification.
9797+By exploiting the strict lexicographic key ordering of STAR-lite files, we can implement these transformations directly and with minimal overhead.
1019810299103103-### conversion to CAR
100100+### Conversion to CAR
104101105102Stream-ordered CARs (in "preorder traversal" block order) are a depth-first walk over the Merkle Search Tree, and keys encountered during a depth-first MST walk are in strict lexicographic order.
106103···115112116113But what we can do, is write serialized segments of the final CAR to disk temporarily as the entire MST is reconstructed, to stay within a strict memory budget. Streaming out the final stream-ordered CAR can use `copy_file_range` or equivalent to splice them in at the right places.
117114115115+```
116116+TODO: actual algorithm pesudocode
117117+```
118118119119-#### empty repos
120119121121-a repo with zero keys is allowed: its commit object must use the magic CID `(TODO)`, which corresponds to the CID of a single empty atproto MST node - how atproto CARs represent empty repos.
120120+### Archive verification
122121122122+Verification requires MST reconstruction just like CAR conversion, but never requires temporary disk storage. Each record must be hashed to compute its CID, but its byte contents can be immediately discarded.
123123124124-### algorithm
124124+Layer-0 MST nodes are materialized with computed record CIDs, then encoded, then hashed, to produce node CIDs. The encoded node bytes (and referenced record CIDs) are discarded, since we only need the node CID to help materialize a MST node.
125125126126-```
127127-read magic
128128-read commit
129129-init mst
126126+The final output is the root MST node's CID, which verifies the entire archive if it matches the `data` field from the commit object.
130127131131-// TODO: fix this up to eagerly serialize subtrees
132132-133133-for (key, record) in star_lite_entries:
134134- mst.insert(key, record)
135135- if mst.memory_usage() > limit:
136136- // find the leftmost subtree whose rightmost key < `key` (its structure is now frozen)
137137- subtree := mst.evict_leftmost_finalized_subtree()
138138- root_cid := subtree.root_cid()
139139- segment_path := temp_dir.create_segment(root_cid)
140140- subtree.write_blocks_in_car_order(segment_path)
141141- // replace the in-memory subtree pointer with a marker
142142- mst.replace_with_marker(root_cid, segment_path)
128128+TODO: mention that this doesn't check the commit's signature but users can do that on the commit object first, directly, if they want. verification just asserts that the archive content is consistent with the commit object's `data` commitment.
143129144144-// EOF
145145-root_cid := mst.finalize()
146130147147-// stream out
148148-init car := AtprotoCar(commit)
149149-for block_or_marker in mst.depth_first_walk():
150150- match block_or_marker:
151151- Block(cid, bytes) => car.write_block(cid, bytes)
152152- Marker(segment) => car.splice_file(segment.path)
131131+```
132132+TODO: actual algorithm pseudocode
153133```
154134155155-Memory is bounded because there is a practical (low) limit to MST height.
156135136136+#### Empty repos
157137138138+a repo with zero keys is allowed: its commit object must use the magic CID `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4` (TODO: double-check this), which corresponds to the CID of a single empty atproto MST node - how atproto CARs represent empty repos.
158139159159-### conversion from CAR
140140+141141+### Conversion from CAR
160142161143TODO: but basically: use repo-stream for a bounded-memory MST walk if you can't be certain that the CAR is stream-ordered. If it *is* stream-ordered, any streaming walker will work and it's pretty simple to write out.
162144163145Note: it is **not possible** to know if an atproto CAR is stream-ordered except by either knowing that it was encoded that way in advance, or by reading the **entire** archive first to verify.
146146+147147+148148+[car]: https://dasl.ing/car.html
149149+[drisl]: https://dasl.ing/drisl.html
150150+[ln]: ../star-lN/
151151+[reclen]: https://atproto.com/guides/data-validation#recommended-data-limits
152152+[varint]: https://github.com/multiformats/unsigned-varint
153153+[leb128]: https://en.wikipedia.org/wiki/LEB128
154154+[nsid]: https://atproto.com/specs/nsid
155155+[rkey]: https://atproto.com/specs/record-key