STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

spec tightening and cid fixup

phil c1c44d4a 84b09df3

+34 -41
+34 -41
star-lite/readme.md
··· 1 1 # STAR-lite 2 2 3 - **ST**reaming **AR**chive repository format (extra light version) 3 + **ST**reaming **AR**chive repository format (extra light version): a stricter, simpler, still verifiable, highly compressible alternative to [CAR][car]. 4 4 5 - A stricter, simpler, still verifiable, highly compressible alternative to [CAR][car]. 5 + STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files. Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability. 6 6 7 - STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files. 7 + STAR-lite files shine when zstd-compressed. 8 8 9 - Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability. 10 9 11 - 12 - ### Compared to CARs: 10 + ### Compared to [CARs][car]: 13 11 14 12 - No MST node blocks or CIDs, eliminating the least-compressible content. 15 13 - Strict content ordering, deterministic encoding. 16 14 - Bounded-memory conversion to stream-ordered CAR. 17 - 18 - STAR-lite files shine when zstd-compressed. 19 15 20 16 21 17 ### Compared to [STAR-L0 and STAR-L1][ln]: ··· 28 24 29 25 ## Format 30 26 31 - STAR-lite is a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header. It's suited for single-pass streaming. 27 + STAR-lite is a flat list of every key/record pair in a repository, in lexicographic key order, with commit details in its header. It's suited for single-pass streaming. 32 28 33 29 ``` 34 30 |--------- header ---------| |------------------ data (records) -------------------| 35 31 [ magic | cid | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] … 36 32 ``` 37 33 38 - | name | type | 39 - | ----- | -------------------------------------- | 40 - | magic | three-byte mark to identify the format | 41 - | cid | multiformats sha256 CID link | 42 - | len | unsigned varint | 43 - | str | utf-8 bytes | 44 - | cbor | cbor bytes | 34 + | name | type | 35 + | ----- | --------------------------------------- | 36 + | magic | three-byte mark to identify the format | 37 + | cid | atproto-format binary MST node CID link | 38 + | len | unsigned varint | 39 + | str | utf-8 bytes | 40 + | cbor | cbor bytes | 45 41 46 42 47 43 ### Header magic ··· 51 47 52 48 ### Header CID 53 49 54 - A 36-byte CID link to the root of the repo MST, ie., the `data` field from the repo's current [`Commit` object][commit]. 50 + A 36-byte CID link to the root node of the repo's atproto Merkle Search Tree (MST) representation. This is the `data` field from a full atproto [`Commit` object][commit], and it verifies the entire archive's integrity (see **Archive Verification**, below). 55 51 56 - Archive integrity can be verified by recovering the MST from its contents and computing the CID of the root MST node, which must match. 52 + It consists of a four-byte fixed prefix, `0x01711220`, followed by a 32-byte sha256 digest. See the [atproto repo spec][at-cid] and/or [DASL-CID][cid]. 53 + 54 + An empty repository (zero keys) must has CID `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4` — the CID of a single empty atproto MST node, which is how atproto CAR files represent empty repositories. 57 55 58 56 59 57 ### Header len + cbor: optional partial commit object ··· 82 80 83 81 The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto [*recommended data limits*][reclen]. 84 82 85 - Parsers must reject archives that exceed maximum values. 86 - 87 83 88 84 ### Varints 89 85 90 - Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128]) 86 + Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128]). 87 + 88 + 89 + ### Rules 90 + 91 + - keys must be in strict lexicographic byte order. 92 + - duplicate keys are not allowed. 93 + - keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format `<collection>/<rkey>` restricts characters to a small subset of ASCII. 94 + - records should be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all. 95 + - any parse error should be treated as fatal for the entire archive. 91 96 92 97 93 98 ### Compression 94 99 95 100 STAR-lite is intended to be externally compressed with zstd in transport or for storage. 96 101 97 - TODO: include recommended zstd configs 102 + TODO: include recommended zstd configs, and tables/graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings 98 103 99 - TODO: include an actual table or graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings 100 104 101 - 102 - ### Rules 105 + ## STAR-lite algorithms 103 106 104 - - keys must be in strict lexicographic byte order. 105 - - duplicate keys are not allowed. 106 - - keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format `<collection>/<rkey>` restricts characters to a small subset of ASCII. 107 - - records should be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all. 108 - - any parse error should be treated as fatal for the entire archive. 107 + While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead. 109 108 109 + We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**. 110 110 111 - ## Efficient MST-aware operations 111 + This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive. 112 112 113 - While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, this usually carries high overhead (memory or i/o) for the most frequent operations on a STAR-lite file: verification, and conversion to stream-ordered CAR. 114 113 115 - By exploiting the strict lexicographic key ordering of STAR-lite files, we can implement these transformations directly and with lower overhead. 114 + ### MST state: node stack 116 115 117 116 118 - ### MST node stack 119 117 120 118 We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state. 121 119 ··· 453 451 So, any subtree-spanning range of keys (and records) can be materialized directly into its stream-ordered sequence of CAR blocks, independent of the rest of the archive. 454 452 455 453 456 - #### Empty repos 457 - 458 - A repo with no keys is allowed. Its header CID is always `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`, the CID of a single empty atproto MST node. 459 - 460 - Note that when generating a CAR file, the empty MST node block must be included. 461 - 462 - 463 454 ### Conversion from CAR 464 455 465 456 TODO: but basically: use repo-stream for a bounded-memory MST walk if you can't be certain that the CAR is stream-ordered. If it *is* stream-ordered, any streaming walker will work and it's pretty simple to write out. ··· 470 461 [car]: https://dasl.ing/car.html 471 462 [drisl]: https://dasl.ing/drisl.html 472 463 [ln]: ../star-lN/ 464 + [at-cid]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#section-2.7 465 + [cid]: https://dasl.ing/cid.html 473 466 [reclen]: https://atproto.com/guides/data-validation#recommended-data-limits 474 467 [varint]: https://github.com/multiformats/unsigned-varint 475 468 [leb128]: https://en.wikipedia.org/wiki/LEB128