STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

some format ideas

phil 94f45ce1 474181b8

+82 -3
+82 -3
readme.md
··· 2 2 3 3 _status: just thinking about it_ 4 4 5 + STAR is an archival format for Merkle Search Trees (MSTs) as implemented in atproto. It offers efficient key-orderd streaming of repository contents and reduced archive size compared to CARs. 5 6 6 7 - convertible to/from CAR (lossless except any out-of-tree blocks from a CAR) 7 8 - extra garbage strictly not allowed (unlike CAR) 8 9 - canonical (unlike CAR) 9 10 - strict depth-first (MST key-ordered) node and record ordering -> efficient reading (unlike CAR) 10 - - the header simply is the commit, always including the root CID, followed by the serialized tree. 11 - - all CID links omitted for blocks included in the STAR: linked blocks follow in a deterministic order 11 + - the header simply is the commit, followed by the serialized tree. 12 + - CID link are implicit for blocks included in the STAR: linked blocks follow in a deterministic order and recompute CID from on their contents. 13 + - fewer edge cases: empty MST nodes strictly disallowed: `commit` omits `data` for an empty tree 12 14 13 15 the two primary motivations are 14 16 ··· 41 43 42 44 STARs retain the raw CBOR serialization of records, but may use a new MST node serialization that further reduces this overhead. 43 45 44 - Since eliminating CIDs removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher. 46 + Since omitting CIDs by making them implicit removes uncompressible content from CARs, I'm optimistic that real savings for compressed STARs vs CARs will be higher. 47 + 48 + Note that all repository content in a STAR is still cryptographically bound to the signed root commit's CID: it's just a little more work to prove it. 45 49 46 50 47 51 ### scope ··· 55 59 - It *might* be interesting to allow arbitrary sparse trees. Not sure yet. 56 60 57 61 - It's not suitable for firehose commit CARs, which need to include blocks that aren't in a strict single MST. 62 + 63 + STAR format does not aim to provide efficient access to random nodes or through other tree iteration patterns. Almost any kind of inspection requires a linear scan through the archive (especially if global key compression happens). 64 + 65 + 66 + ## problems 67 + 68 + It might be difficult to convert a STAR to stream-friendly (preorder traversal) CAR format, since the CID of each MST node block can only be computed after visiting all of its children. 69 + 70 + 71 + ## format 72 + 73 + ``` 74 + |--------- header ---------| |---------------- optional tree ----------------| 75 + [ '*' | ver | len | commit ] [ node ] [ node OR record ] [ node OR record ] … 76 + ``` 77 + 78 + ### header 79 + 80 + - `*` (fixed u8): The first byte of the header is always `*` (hex `0x2A`). 81 + 82 + - `ver` (varint): Next is an [`LEB128` `varint`](https://en.wikipedia.org/wiki/LEB128) specifying the `version`, with a fixed value of `1` for the current format. 83 + 84 + - `len` (varint): The length of the proceeding atproto commit object in bytes. 85 + 86 + - `commit`: An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects): 87 + 88 + - `did` (string, nullable): same as repo spec 89 + - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3` 90 + - `data` (hash link, **nullable**): CID of the first (root) node in the MST. an empty tree is represented by the presence of a `null` here 91 + - `rev` (string, required): same as repo spec 92 + - `prev` (hash link, nullable): same as repo spec 93 + - `sig` (byte array, **nullable**): to enable archiving stable sub-trees which might be later stitched into full signed MSTs, the `sig` property is allowed to be `null`. 94 + 95 + #### verifying a commit 96 + 97 + The `commit` object can be converted to a repo-spec compliant commit: 98 + 99 + - if `data` is null, replace it with the CID of an empty repo-spec style MST (`bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`) 100 + - follow steps from the repo spec to resolve the identity and verify the signature. 101 + 102 + When `sig` is null (typically for archived MST sub-trees), this STAR cannot be converted to a repo-spec compliant commit. 103 + 104 + 105 + ### optional tree 106 + 107 + - `node`: TODO: we need a new node format. It must be convertible back to a repo-spec style node. 108 + 109 + - `record`: The atproto record. Its CID can be computed over the bytes of its `block` (see below). 110 + 111 + ### record 112 + 113 + ``` 114 + |--- record --| 115 + [ len | block ] 116 + ``` 117 + 118 + - `len` (varint): the length of the proceeding binary record block in bytes. 119 + 120 + - `block` (bytes): the raw bytes of the (DAG-CBOR) record 121 + 122 + 123 + ### order of `node`s and `record`s 124 + 125 + The MST **must** be stored in key order, which for an MST is a depth-first walk across the tree. 126 + 127 + For each *included* child of a `node` (indicated by ?? in its entries. null for cid?), todo blah blah 128 + 129 + *excluded* children (indicated by a CID link being present in entries) are not included in the series of nodes and records. 130 + 131 + 132 + #### key compression 133 + 134 + TODO 135 + - (but basically do what the repo-spec does but apply it across the whole stream) 136 + - (but also actually run some tests and measure how much this decreases file sizes post-normal-file-compression)