···3434| name | type |
3535| ----- | --------------------------------------- |
3636| magic | three-byte mark to identify the format |
3737-| cid | atproto-format binary MST node CID link |
3737+| cid | atproto-format binary CID link |
3838| len | unsigned varint |
3939| str | utf-8 bytes |
4040| cbor | cbor bytes |
···102102TODO: include recommended zstd configs, and tables/graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings
103103104104105105-## STAR-lite algorithms
105105+#### DATAaaaaaaaa
106106107107-While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead.
107107+Ratios are STAR/CAR (lower is better). "raw" baseline = uncompressed CAR; "coder" baseline = CAR compressed with the same coder.
108108109109-We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**.
109109+#### overall
110110111111-This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive.
111111+N=4866, raw CAR=2.18 GiB, raw STAR=1.62 GiB.
112112113113+| setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) |
114114+|---|---:|---:|---:|---:|---:|---:|
115115+| raw | 0.668 | 0.678 | 0.746 | — | — | — |
116116+| gzip | 0.292 | 0.232 | 0.215 | 0.568 | 0.556 | 0.552 |
117117+| zstd --fast 1 | 0.333 | 0.295 | 0.286 | 0.632 | 0.635 | 0.671 |
118118+| zstd 3 | 0.281 | 0.224 | 0.195 | 0.566 | 0.551 | 0.553 |
119119+| zstd 9 | 0.276 | 0.218 | 0.183 | 0.562 | 0.544 | 0.542 |
113120114114-### MST state: node stack
121121+#### < 10 KiB
115122123123+N=2168, raw CAR=6.31 MiB, raw STAR=4.04 MiB.
116124125125+| setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) |
126126+|---|---:|---:|---:|---:|---:|---:|
127127+| raw | 0.614 | 0.611 | 0.640 | — | — | — |
128128+| gzip | 0.395 | 0.370 | 0.301 | 0.624 | 0.640 | 0.585 |
129129+| zstd --fast 1 | 0.413 | 0.402 | 0.342 | 0.653 | 0.663 | 0.628 |
130130+| zstd 3 | 0.383 | 0.363 | 0.295 | 0.621 | 0.630 | 0.578 |
131131+| zstd 9 | 0.381 | 0.361 | 0.292 | 0.622 | 0.634 | 0.578 |
117132118118-We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state.
133133+#### 10 KiB – 1 MiB
119134120120-When a key's layer is *greater than the previous* key's layer, all in-progress MST nodes from lower layers are complete, and can be **frozen**: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node.
135135+N=2346, raw CAR=379.92 MiB, raw STAR=276.84 MiB.
121136122122-At this point, the newly frozen nodes can be:
137137+| setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) |
138138+|---|---:|---:|---:|---:|---:|---:|
139139+| raw | 0.706 | 0.710 | 0.729 | — | — | — |
140140+| gzip | 0.209 | 0.210 | 0.208 | 0.517 | 0.520 | 0.526 |
141141+| zstd --fast 1 | 0.266 | 0.270 | 0.270 | 0.607 | 0.615 | 0.629 |
142142+| zstd 3 | 0.199 | 0.198 | 0.193 | 0.519 | 0.521 | 0.529 |
143143+| zstd 9 | 0.193 | 0.191 | 0.184 | 0.511 | 0.513 | 0.519 |
123144124124-- simply discarded, when verifying archive integrity,
125125-- serialized into runs of CAR-format blocks,
126126-- any other transformation
145145+#### 1 MiB – 100 MiB
146146+147147+N=352, raw CAR=1.80 GiB, raw STAR=1.35 GiB.
127148128128-Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which must match the CID in a STAR-lite file's header.
149149+| setting | mean (raw) | med (raw) | wt (raw) | mean (coder) | med (coder) | wt (coder) |
150150+|---|---:|---:|---:|---:|---:|---:|
151151+| raw | 0.749 | 0.746 | 0.750 | — | — | — |
152152+| gzip | 0.216 | 0.216 | 0.216 | 0.556 | 0.557 | 0.557 |
153153+| zstd --fast 1 | 0.288 | 0.290 | 0.289 | 0.676 | 0.675 | 0.680 |
154154+| zstd 3 | 0.192 | 0.194 | 0.195 | 0.547 | 0.543 | 0.558 |
155155+| zstd 9 | 0.180 | 0.182 | 0.183 | 0.534 | 0.532 | 0.547 |
156156+157157+158158+## STAR-lite algorithms
159159+160160+While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead.
161161+162162+We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**.
163163+164164+This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive.
129165130166131167### Archive verification
···470506[rkey]: https://atproto.com/specs/record-key
471507[commit]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#section-2.4
472508[commit-sigs]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-signatures
509509+510510+511511+512512+513513+### MST state: node stack
514514+515515+516516+517517+We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state.
518518+519519+When a key's layer is *greater than the previous* key's layer, all in-progress MST nodes from lower layers are complete, and can be **frozen**: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node.
520520+521521+At this point, the newly frozen nodes can be:
522522+523523+- simply discarded, when verifying archive integrity,
524524+- serialized into runs of CAR-format blocks,
525525+- any other transformation
526526+527527+Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which must match the CID in a STAR-lite file's header.
528528+