···11# STAR-lite
2233-**ST**reaming **AR**chive repository format (extra light version)
33+**ST**reaming **AR**chive repository format (extra light version): a stricter, simpler, still verifiable, highly compressible alternative to [CAR][car].
4455-A stricter, simpler, still verifiable, highly compressible alternative to [CAR][car].
55+STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files. Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability.
6677-STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files.
77+STAR-lite files shine when zstd-compressed.
8899-Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability.
1091111-1212-### Compared to CARs:
1010+### Compared to [CARs][car]:
13111412- No MST node blocks or CIDs, eliminating the least-compressible content.
1513- Strict content ordering, deterministic encoding.
1614- Bounded-memory conversion to stream-ordered CAR.
1717-1818-STAR-lite files shine when zstd-compressed.
191520162117### Compared to [STAR-L0 and STAR-L1][ln]:
···28242925## Format
30263131-STAR-lite is a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header. It's suited for single-pass streaming.
2727+STAR-lite is a flat list of every key/record pair in a repository, in lexicographic key order, with commit details in its header. It's suited for single-pass streaming.
32283329```
3430|--------- header ---------| |------------------ data (records) -------------------|
3531[ magic | cid | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] …
3632```
37333838-| name | type |
3939-| ----- | -------------------------------------- |
4040-| magic | three-byte mark to identify the format |
4141-| cid | multiformats sha256 CID link |
4242-| len | unsigned varint |
4343-| str | utf-8 bytes |
4444-| cbor | cbor bytes |
3434+| name | type |
3535+| ----- | --------------------------------------- |
3636+| magic | three-byte mark to identify the format |
3737+| cid | atproto-format binary MST node CID link |
3838+| len | unsigned varint |
3939+| str | utf-8 bytes |
4040+| cbor | cbor bytes |
454146424743### Header magic
···51475248### Header CID
53495454-A 36-byte CID link to the root of the repo MST, ie., the `data` field from the repo's current [`Commit` object][commit].
5050+A 36-byte CID link to the root node of the repo's atproto Merkle Search Tree (MST) representation. This is the `data` field from a full atproto [`Commit` object][commit], and it verifies the entire archive's integrity (see **Archive Verification**, below).
55515656-Archive integrity can be verified by recovering the MST from its contents and computing the CID of the root MST node, which must match.
5252+It consists of a four-byte fixed prefix, `0x01711220`, followed by a 32-byte sha256 digest. See the [atproto repo spec][at-cid] and/or [DASL-CID][cid].
5353+5454+An empty repository (zero keys) must has CID `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4` — the CID of a single empty atproto MST node, which is how atproto CAR files represent empty repositories.
575558565957### Header len + cbor: optional partial commit object
···82808381The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto [*recommended data limits*][reclen].
84828585-Parsers must reject archives that exceed maximum values.
8686-87838884### Varints
89859090-Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128])
8686+Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128]).
8787+8888+8989+### Rules
9090+9191+- keys must be in strict lexicographic byte order.
9292+- duplicate keys are not allowed.
9393+- keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format `<collection>/<rkey>` restricts characters to a small subset of ASCII.
9494+- records should be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all.
9595+- any parse error should be treated as fatal for the entire archive.
919692979398### Compression
949995100STAR-lite is intended to be externally compressed with zstd in transport or for storage.
961019797-TODO: include recommended zstd configs
102102+TODO: include recommended zstd configs, and tables/graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings
981039999-TODO: include an actual table or graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings
100104101101-102102-### Rules
105105+## STAR-lite algorithms
103106104104-- keys must be in strict lexicographic byte order.
105105-- duplicate keys are not allowed.
106106-- keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format `<collection>/<rkey>` restricts characters to a small subset of ASCII.
107107-- records should be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all.
108108-- any parse error should be treated as fatal for the entire archive.
107107+While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead.
109108109109+We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**.
110110111111-## Efficient MST-aware operations
111111+This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive.
112112113113-While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, this usually carries high overhead (memory or i/o) for the most frequent operations on a STAR-lite file: verification, and conversion to stream-ordered CAR.
114113115115-By exploiting the strict lexicographic key ordering of STAR-lite files, we can implement these transformations directly and with lower overhead.
114114+### MST state: node stack
116115117116118118-### MST node stack
119117120118We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state.
121119···453451So, any subtree-spanning range of keys (and records) can be materialized directly into its stream-ordered sequence of CAR blocks, independent of the rest of the archive.
454452455453456456-#### Empty repos
457457-458458-A repo with no keys is allowed. Its header CID is always `bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4`, the CID of a single empty atproto MST node.
459459-460460-Note that when generating a CAR file, the empty MST node block must be included.
461461-462462-463454### Conversion from CAR
464455465456TODO: but basically: use repo-stream for a bounded-memory MST walk if you can't be certain that the CAR is stream-ordered. If it *is* stream-ordered, any streaming walker will work and it's pretty simple to write out.
···470461[car]: https://dasl.ing/car.html
471462[drisl]: https://dasl.ing/drisl.html
472463[ln]: ../star-lN/
464464+[at-cid]: https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#section-2.7
465465+[cid]: https://dasl.ing/cid.html
473466[reclen]: https://atproto.com/guides/data-validation#recommended-data-limits
474467[varint]: https://github.com/multiformats/unsigned-varint
475468[leb128]: https://en.wikipedia.org/wiki/LEB128