···11+# STAR: Streaming Tree ARchive format
22+33+_status: just thinking about it_
44+55+66+- convertible to/from CAR (lossless except any out-of-tree blocks from a CAR)
77+- extra garbage strictly not allowed (unlike CAR)
88+- canonical (unlike CAR)
99+- strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR)
1010+- the header simply is the commit, always including the root CID, followed by the serialized tree.
1111+- all CID links omitted for blocks included in the STAR: linked blocks follow in a deterministic order
1212+1313+the two primary motivations are
1414+1515+1. bounded-resource streaming readers.
1616+1717+ atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly.
1818+1919+ you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them.
2020+2121+ since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through)
2222+2323+2. reduced archive size.
2424+2525+ CIDs are large, compression-unfriendly, and redundant if you are including the CID's actual content.
2626+2727+ for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%.
2828+2929+ from a few more samples of various sizes from real atproto repos:
3030+3131+ ```
3232+ CIDs CAR potential savings
3333+ 0.53KB / 3.4KB = 16%
3434+ 23.2KB / 279KB = 8%
3535+ 0.9MB / 5.0MB = 18%
3636+ 25.9MB / 128MB = 20%
3737+ 94.8MB / 449MB = 21%
3838+ ```
3939+4040+ These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression.
4141+4242+ STARs retain the raw CBOR serialization of records, but may use a new MST node serialization that further reduces this overhead.
4343+4444+ Since eliminating CIDs removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher.
4545+4646+4747+### scope
4848+4949+STAR is specialized for atproto MST storage, and best-suited for serializing complete trees.
5050+5151+- It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated)
5252+5353+- It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary)
5454+