···2233_status: just thinking about it_
4455+STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-orderd streaming of repository contents and reduced archive size compared to CARs.
5667- convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
78- extra garbage strictly not allowed (unlike CAR)
89- canonical (unlike CAR)
910- strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
1010-- the header simply is the commit, always including the root CID, followed by the serialized tree.
1111-- all CID links omitted for blocks included in the STAR: linked blocks follow in a deterministic order
1111+- the header simply is the commit, followed by the serialized tree.
1212+- CID link are implicit for blocks included in the STAR: linked blocks follow in a deterministic order and recompute CID from on their contents.
1313+- fewer edge cases: empty MST nodes strictly disallowed: `commit` omits `data` for an empty tree
12141315the two primary motivations are
1416···41434244 STARs retain the raw CBOR serialization of records, but may use a new MST node serialization that further reduces this overhead.
43454444- Since eliminating CIDs removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher.
4646+ Since omitting CIDs by making them implicit removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher.
4747+4848+ Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it.
454946504751### scope
···5559- It *might* be interesting to allow arbitrary sparse trees. Not sure yet.
56605761- It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST.
6262+6363+STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens).
6464+6565+6666+## problems
6767+6868+It might be difficult to convert a STAR to stream-friendly (preorder traversal) CAR format, since the CID of each MST node block can only be computed after visiting all of its children.
6969+7070+7171+## format
7272+7373+```
7474+|--------- header ---------| |---------------- optional tree ----------------|
7575+[ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] …
7676+```
7777+7878+### header
7979+8080+- `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`).
8181+8282+- `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format.
8383+8484+- `len` (varint): The length of the proceeding atproto commit object in bytes.
8585+8686+- `commit`: An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects):
8787+8888+ - `did` (string, nullable): same as repo spec
8989+ - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3`
9090+ - `data` (hash link, **nullable**): CID of the first (root) node in the MST. an empty tree is represented by the presence of a `null` here
9191+ - `rev` (string, required): same as repo spec
9292+ - `prev` (hash link, nullable): same as repo spec
9393+ - `sig` (byte array, **nullable**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be `null`.
9494+9595+#### verifying a commit
9696+9797+The `commit` object can be converted to a repo-spec compliant commit:
9898+9999+ - if `data` is null, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`)
100100+ - follow steps from the repo spec to resolve the identity and verify the signature.
101101+102102+When `sig` is null (typically for archived MST sub-trees), this STAR cannot be converted to a repo-spec compliant commit.
103103+104104+105105+### optional tree
106106+107107+- `node`: TODO: we need a new node format. It must be convertible back to a repo-spec style node.
108108+109109+- `record`: The atproto record. Its CID can be computed over the bytes of its `block` (see below).
110110+111111+### record
112112+113113+```
114114+|--- record --|
115115+[ len | block ]
116116+```
117117+118118+- `len` (varint): the length of the proceeding binary record block in bytes.
119119+120120+- `block` (bytes): the raw bytes of the (DAG-CBOR) record
121121+122122+123123+### order of `node`s and `record`s
124124+125125+The MST **must** be stored in key order, which for an MST is a depth-first walk across the tree.
126126+127127+For each *included* child of a `node` (indicated by ?? in its entries. null for cid?), todo blah blah
128128+129129+*excluded* children (indicated by a CID link being present in entries) are not included in the series of nodes and records.
130130+131131+132132+#### key compression
133133+134134+TODO
135135+- (but basically do what the repo-spec does but apply it across the whole stream)
136136+- (but also actually run some tests and measure how much this decreases file sizes post-normal-file-compression)