STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

hello

phil 5c5a3cf1

+54
+54
readme.md
··· 1 + # STAR: Streaming Tree ARchive format 2 + 3 + _status: just thinking about it_ 4 + 5 + 6 + - convertible to/from CAR (lossless except any out-of-tree blocks from a CAR) 7 + - extra garbage strictly not allowed (unlike CAR) 8 + - canonical (unlike CAR) 9 + - strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR) 10 + - the header simply is the commit, always including the root CID, followed by the serialized tree. 11 + - all CID links omitted for blocks included in the STAR: linked blocks follow in a deterministic order 12 + 13 + the two primary motivations are 14 + 15 + 1. bounded-resource streaming readers. 16 + 17 + atproto MSTs in CARs *have* to buffer and retain all record blocks, and typically buffer most MST node blocks, just to traverse the tree. even if a CAR appears to be in stream-friendly block ordering, you can only safely discard record blocks if you *know for sure* it's actually stream-friendly. 18 + 19 + you also cannot reliably identify MST node blocks and record blocks in an atproto CAR without walkign the tree, so you cannot discard *any* potentially garbage blocks from the buffered data before walking. A malicious PDS can serve a cheap-to-generate endless CAR stream of garbage blocks, and you just have to keep buffering them. 20 + 21 + since STAR is strictly stream-ordered, there is no node/block ambiguity, and extra garbage is not allowed. CIDs commit the contents of subtrees and records, and since reading is the same as walking the tree, it *might* be possible to reject some kinds of malicious block-generation attacks early. (haven't thought this through) 22 + 23 + 2. reduced archive size. 24 + 25 + CIDs are large, compression-unfriendly, and redundant if you are including the CID's actual content. 26 + 27 + for example, my atproto repo is around 5.0MB and contains 14,673 blocks with a CID prefix plus 14,675 CID links in its MST. Each CID is 32 bytes, so `(14,673 + 14,675) * 32 = 0.9MB` just for the CIDS, almost 20%. 28 + 29 + from a few more samples of various sizes from real atproto repos: 30 + 31 + ``` 32 + CIDs CAR potential savings 33 + 0.53KB / 3.4KB = 16% 34 + 23.2KB / 279KB = 8% 35 + 0.9MB / 5.0MB = 18% 36 + 25.9MB / 128MB = 20% 37 + 94.8MB / 449MB = 21% 38 + ``` 39 + 40 + These calculations don't include the 4-bytes-per-CID prefix size, since that overhead will already typically be eliminated by compression. 41 + 42 + STARs retain the raw CBOR serialization of records, but may use a new MST node serialization that further reduces this overhead. 43 + 44 + Since eliminating CIDs removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher. 45 + 46 + 47 + ### scope 48 + 49 + STAR is specialized for atproto MST storage, and best-suited for serializing complete trees. 50 + 51 + - It should work for "CAR slices" -- contiguous narrowed partial trees that may omit content before and/or after a specific key. (CIDS referencing missing nodes at the boundaries cannot be eliminated) 52 + 53 + - It's desireable to be able to archive complete *subtrees*, so enforcing a well-formed atproto commit as the header might not be sufficient on its own. (subtrees could be stored as CAR slices so this may be unnecessary) 54 +