STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

wip readme cleanup

phil 603ca650 c1c44d4a

+69 -13
+69 -13
star-lite/readme.md
··· 34 34 | name | type | 35 35 | ----- | --------------------------------------- | 36 36 | magic | three-byte mark to identify the format | 37 - | cid | atproto-format binary MST node CID link | 37 + | cid | atproto-format binary CID link | 38 38 | len | unsigned varint | 39 39 | str | utf-8 bytes | 40 40 | cbor | cbor bytes | ··· 102 102 TODO: include recommended zstd configs, and tables/graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings 103 103 104 104 105 - ## STAR-lite algorithms 105 + #### DATAaaaaaaaa 106 106 107 - While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead. 107 + Ratios are STAR/CAR (lower is better). "raw" baseline = uncompressed CAR; "coder" baseline = CAR compressed with the same coder. 108 108 109 - We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**. 109 + #### overall 110 110 111 - This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive. 111 + N=4866, raw CAR=2.18 GiB, raw STAR=1.62 GiB. 112 112 113 + | setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) | 114 + |---|---:|---:|---:|---:|---:|---:| 115 + | raw | 0.668 | 0.678 | 0.746 | — | — | — | 116 + | gzip | 0.292 | 0.232 | 0.215 | 0.568 | 0.556 | 0.552 | 117 + | zstd --fast 1 | 0.333 | 0.295 | 0.286 | 0.632 | 0.635 | 0.671 | 118 + | zstd 3 | 0.281 | 0.224 | 0.195 | 0.566 | 0.551 | 0.553 | 119 + | zstd 9 | 0.276 | 0.218 | 0.183 | 0.562 | 0.544 | 0.542 | 113 120 114 - ### MST state: node stack 121 + #### < 10 KiB 115 122 123 + N=2168, raw CAR=6.31 MiB, raw STAR=4.04 MiB. 116 124 125 + | setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) | 126 + |---|---:|---:|---:|---:|---:|---:| 127 + | raw | 0.614 | 0.611 | 0.640 | — | — | — | 128 + | gzip | 0.395 | 0.370 | 0.301 | 0.624 | 0.640 | 0.585 | 129 + | zstd --fast 1 | 0.413 | 0.402 | 0.342 | 0.653 | 0.663 | 0.628 | 130 + | zstd 3 | 0.383 | 0.363 | 0.295 | 0.621 | 0.630 | 0.578 | 131 + | zstd 9 | 0.381 | 0.361 | 0.292 | 0.622 | 0.634 | 0.578 | 117 132 118 - We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state. 133 + #### 10 KiB – 1 MiB 119 134 120 - When a key's layer is *greater than the previous* key's layer, all in-progress MST nodes from lower layers are complete, and can be **frozen**: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node. 135 + N=2346, raw CAR=379.92 MiB, raw STAR=276.84 MiB. 121 136 122 - At this point, the newly frozen nodes can be: 137 + | setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) | 138 + |---|---:|---:|---:|---:|---:|---:| 139 + | raw | 0.706 | 0.710 | 0.729 | — | — | — | 140 + | gzip | 0.209 | 0.210 | 0.208 | 0.517 | 0.520 | 0.526 | 141 + | zstd --fast 1 | 0.266 | 0.270 | 0.270 | 0.607 | 0.615 | 0.629 | 142 + | zstd 3 | 0.199 | 0.198 | 0.193 | 0.519 | 0.521 | 0.529 | 143 + | zstd 9 | 0.193 | 0.191 | 0.184 | 0.511 | 0.513 | 0.519 | 123 144 124 - - simply discarded, when verifying archive integrity, 125 - - serialized into runs of CAR-format blocks, 126 - - any other transformation 145 + #### 1 MiB – 100 MiB 146 + 147 + N=352, raw CAR=1.80 GiB, raw STAR=1.35 GiB. 127 148 128 - Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which must match the CID in a STAR-lite file's header. 149 + | setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) | 150 + |---|---:|---:|---:|---:|---:|---:| 151 + | raw | 0.749 | 0.746 | 0.750 | — | — | — | 152 + | gzip | 0.216 | 0.216 | 0.216 | 0.556 | 0.557 | 0.557 | 153 + | zstd --fast 1 | 0.288 | 0.290 | 0.289 | 0.676 | 0.675 | 0.680 | 154 + | zstd 3 | 0.192 | 0.194 | 0.195 | 0.547 | 0.543 | 0.558 | 155 + | zstd 9 | 0.180 | 0.182 | 0.183 | 0.534 | 0.532 | 0.547 | 156 + 157 + 158 + ## STAR-lite algorithms 159 + 160 + While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead. 161 + 162 + We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**. 163 + 164 + This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive. 129 165 130 166 131 167 ### Archive verification ··· 470 506 [rkey]: https://atproto.com/specs/record-key 471 507 [commit]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#section-2.4 472 508 [commit-sigs]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-signatures 509 + 510 + 511 + 512 + 513 + ### MST state: node stack 514 + 515 + 516 + 517 + We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state. 518 + 519 + When a key's layer is *greater than the previous* key's layer, all in-progress MST nodes from lower layers are complete, and can be **frozen**: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node. 520 + 521 + At this point, the newly frozen nodes can be: 522 + 523 + - simply discarded, when verifying archive integrity, 524 + - serialized into runs of CAR-format blocks, 525 + - any other transformation 526 + 527 + Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which must match the CID in a STAR-lite file's header. 528 +