···28282929## Format
30303131-TODO: subtree -- probably just set header cbor len to 0 to omit the commit? but we still probably want the `data` root hash...
3232-3333-STAR-lite is just a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header.
3131+STAR-lite is a flat list of every key/record pair in the repository, in lexicographic key order, with a commit object in its header. It's suited for single-pass streaming.
34323533```
3634|--------- header ---------| |------------------ data (records) -------------------|
···62606361When `len == 0`, no commit object is included in the archive. This is useful for archiving unsigned subtrees of a full repository tree -- the contents can still be verified from the preceeding CID field.
64626565-When `len > 4096`, a parser may reject the commit object as being implausibly large. (TODO: we probably can set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc).
6363+When `len > 4096`, a parser should reject the commit object as being implausibly large. (TODO: we can probably set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc).
66646765Otherwise, when `len > 0`, a partial commit object of exactly `len` bytes follows, in CBOR format. The partial commit has the same fields as an [atproto Commit Object][commit] except that the `data` field must be omitted.
6866···71697270### Data: keys and records
73717474-zero or more records until EOF. Each is:
7272+Zero or more records until EOF. Each is:
75737674| field | type |
7775| ----------- | --------------------------------------- |
···8381The maximum key length comes from the combined limits of the `<collection>/<rkey>` syntax for atproto repo paths: 317 for the [collection][nsid] + 1 for the `/` slash + 512 for the [rkey][rkey].
84828583The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto [*recommended data limits*][reclen].
8484+8585+Parsers must reject archives that exceed maximum values.
868687878888### Varints
···9090Length prefixes in STAR are encoded as unsigned variable-length integers ([varint][varint], a variant of [LEB128][leb128])
919192929393+### Compression
9494+9595+STAR-lite is intended to be externally compressed with zstd in transport or for storage.
9696+9797+TODO: include recommended zstd configs
9898+9999+TODO: include an actual table or graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings
100100+101101+93102### Rules
9410395104- keys must be in strict lexicographic byte order.
96105- duplicate keys are not allowed.
9797-- keys must be valid atproto repo paths: the format specifies utf-8, but in practice the required `<collection>/<rkey>` repo path format currently restricts characters to a small subset of ASCII.
9898-- records must be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto.
106106+- keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format `<collection>/<rkey>` restricts characters to a small subset of ASCII.
107107+- records should be encoded as [DRISL][drisl], the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all.
108108+- any parse error should be treated as fatal for the entire archive.
99109100110101111## Efficient MST-aware operations
···107117108118### MST node stack
109119110110-We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a small stack of in-progress MST nodes (one per layer of the tree) is sufficient.
120120+We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state.
111121112122When a key's layer is *greater than the previous* key's layer, all in-progress MST nodes from lower layers are complete, and can be **frozen**: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node.
113123···117127- serialized into runs of CAR-format blocks,
118128- any other transformation
119129120120-Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which be match the CID in a STAR-lite file's header.
130130+Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which match the CID in a STAR-lite file's header.
131131+132132+133133+### Archive verification
134134+135135+Verification requires MST reconstruction just like CAR conversion, but never requires temporary disk storage. Each record must be hashed to compute its CID, but its byte contents can be immediately discarded.
136136+137137+Layer-0 MST nodes are materialized with computed record CIDs, then encoded, then hashed, to produce node CIDs. The encoded node bytes (and referenced record CIDs) are discarded, since we only need the node CID to help materialize a MST node.
138138+139139+The final output is the root MST node's CID, which verifies the entire archive if it matches the `data` field from the commit object.
140140+141141+Verification asserts the integrity of the repository contents: verifying the signature of the archive's [commit object][commit] (if present) is a separate process, outside the scope of STAR. See atproto [commit signatures][commit-sigs]
142142+143143+144144+```python
145145+# MstNode interface:
146146+# is_empty() => bool true if the node has no subtree or value links
147147+# reset_to_empty() clears the node to `empty` state
148148+# link_record(key, cid) appends an entry with a key and value link
149149+# link_subtree(cid) inserts a node link as the "left" child (empty node),
150150+# or as the right-most entry's "right"
151151+# to_cbor() => bytes bytes: canonical DAG-CBOR encoding of the MST node
152152+153153+def reconstruct_root_cid(key_record_pairs):
154154+ """Compute the MST root CID from repo contents
155155+156156+ key_record_pairs must be in lexicographic key order (= depth-first mst walk)
157157+ """
158158+ stack: list[MstNode] = []
159159+ prev_layer = -1
160160+161161+ # the actual walk. everything left of the stack is finalized.
162162+ # anything remaining in the stack gets rolled up at the end.
163163+ for (key, record_cbor) in key_record_pairs:
164164+ key_layer = compute_mst_layer(key)
165165+166166+ # grow the stack if needed, init with empty nodes.
167167+ while len(stack) <= key_layer:
168168+ stack.append(MstNode())
169169+170170+ # finalize lower levels if this key is at a higher level than last.
171171+ # higher key means everything lower in the stack is to-our-left now.
172172+ if key_layer > prev_layer:
173173+ for node, parent in zip(stack[:key_layer], stack[1:]):
174174+ if node.is_empty():
175175+ continue # skip possible empty bottom-most nodes
176176+ parent.link_subtree(compute_cid(node.to_cbor()))
177177+ node.reset_to_empty()
178178+179179+ # add a node entry for the current record
180180+ stack[key_layer].link_value(key, compute_cid(record_cbor))
181181+182182+ prev_layer = key_layer
183183+184184+ # finalize remaining stack
185185+ for node, parent in zip(stack[:-1], stack[1:]):
186186+ if node.is_empty():
187187+ continue
188188+ parent.link_subtree(compute_cid(node.to_cbor()))
189189+ node.reset_to_empty()
190190+191191+ # get the finished root node, finally.
192192+ if len(stack) > 0:
193193+ root = stack[-1]
194194+ else:
195195+ root = MstNode() # empty repo: atproto CAR writes one single empty node
196196+197197+ return compute_cid(root.to_cbor())
198198+```
121199122200123201### Conversion to CAR
···142220#### pseudo-code
143221144222```python
145145-## WIP!
223223+# wip!
146224147225def to_stream_ordered_car(key_record_pairs):
148226 stack = []
···205283 output.extend(byte_log[run.what:run.whattt])
206284207285 return node_cid, output
208208-```
209209-210210-211211-### Archive verification
212212-213213-Verification requires MST reconstruction just like CAR conversion, but never requires temporary disk storage. Each record must be hashed to compute its CID, but its byte contents can be immediately discarded.
214214-215215-Layer-0 MST nodes are materialized with computed record CIDs, then encoded, then hashed, to produce node CIDs. The encoded node bytes (and referenced record CIDs) are discarded, since we only need the node CID to help materialize a MST node.
216216-217217-The final output is the root MST node's CID, which verifies the entire archive if it matches the `data` field from the commit object.
218218-219219-Verification asserts the integrity of the repository contents: verifying the signature of the archive's [commit object][commit] (if present) is a separate process, outside the scope of STAR. See atproto [commit signatures][commit-sigs]
220220-221221-222222-```python
223223-## WIP!!
224224-225225-def verify(key_record_pairs, expected_root_cid):
226226- stack: list[MstNode] = []
227227- prev_layer = -1
228228-229229- for (key, record) in key_record_pairs:
230230- record_cid = compute_cid(record)
231231- key_layer = layer_of(key)
232232-233233- # grow the stack if needed, init with empty nodes
234234- while len(stack) <= key_layer:
235235- stack.append(MstNode())
236236-237237- # when `key` is at a higher layer than last, freeze all layers below
238238- if key_layer > prev_layer:
239239- for i in range(key_layer):
240240- if stack[i].is_empty():
241241- continue
242242- (node_cid, _) = encode_mst_node(stack[i])
243243- stack[i + 1].attach_subtree(node_cid)
244244- stack[i] = MstNode() # empty it
245245-246246- # every key-record pair must insert to a node `entry`
247247- stack[key_layer].entries.append(Leaf(key, record_cid, car_run=None))
248248-249249- prev_layer = key_layer
250250-251251- # Fold remaining stack bottom-up.
252252- node_cid = None
253253- for node in stack:
254254- if node_cid is not None:
255255- node.attach_subtree(node_cid)
256256- node_cid = None
257257- if not node.is_empty():
258258- (node_cid, _) = encode_mst_node(node)
259259-260260- # Empty repo: canonical empty MST node CID.
261261- if node_cid is None:
262262- (node_cid, _) = encode_mst_node(MstNode())
263263-264264- return node_cid == expected_root_cid
265286```
266287267288