STAR-lite#

STreaming ARchive repository format (extra light version): a stricter, simpler, still verifiable, highly compressible alternative to CAR.

STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files. Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability.

Compared to CARs:#

No MST node blocks or CIDs, eliminating the least-compressible content.
Strict content ordering, deterministic encoding.
Bounded-memory conversion to stream-ordered CAR.

Compared to STAR-L0 and STAR-L1:#

Smallest archive format (with zstd compression)
Content verification requires a complete scan of all content
No support for sparse archives or "CAR slice"-like proofs (yet)
Disk spilling required for memory-bounded streaming of large archives

Compression-oriented#

STAR-lite files shine when compressed: almost 2x smaller than CARs for the same compression settings; almost 6x smaller than raw CARs with zstd.

CAR vs STAR-lite for various compression configs:

CAR vs STAR-lite for identical compression settings, grouped by CAR size:

Format#

STAR-lite is a flat list of every key/record pair in a repository, in lexicographic key order, with commit details in its header. It's suited for single-pass streaming.

|--------- header ---------| |------------------ data (records) -------------------|
[ magic | cid | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] …

name	type
magic	three-byte mark to identify the format
cid	atproto-format binary CID link
len	unsigned varint
str	utf-8 bytes
cbor	cbor bytes

Header magic#

Three bytes: 0x2A 0x6C 0x00, ASCII for *l\0: "star", "lite", version 0.

Header CID#

A 36-byte CID link to the root node of the repo's atproto Merkle Search Tree (MST) representation. This is the data field from a full atproto Commit object, and it verifies the entire archive's integrity (see Archive Verification, below).

It consists of a four-byte fixed prefix, 0x01711220, followed by a 32-byte sha256 digest. See the atproto repo spec and/or DASL-CID.

An empty repository (zero keys) must has CID bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4 — the CID of a single empty atproto MST node, which is how atproto CAR files represent empty repositories.

Header len + cbor: optional partial commit object#

When len == 0, no commit object is included in the archive. This is useful for archiving unsigned subtrees of a full repository tree -- the contents can still be verified from the preceeding CID field.

When len > 4096, a parser should reject the commit object as being implausibly large. (TODO: we can probably set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc).

Otherwise, when len > 0, a partial commit object of exactly len bytes follows, in CBOR format. The partial commit has the same fields as an atproto Commit Object except that the data field must be omitted.

To verify the commit signature, use the Header CID (above) as the data field to compute the commit's signed CID.

Data: keys and records#

Zero or more records until EOF. Each is:

field	type
key len	varint, max: 830
key str	utf-8 bytes, exactly `key len` length
record len	varint, max: 1,048,576
record cbor	cbor bytes, exactly `record len` length

The maximum key length comes from the combined limits of the <collection>/<rkey> syntax for atproto repo paths: 317 for the collection + 1 for the / slash + 512 for the rkey.

The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto recommended data limits.

Varints#

Length prefixes in STAR are encoded as unsigned variable-length integers (varint, a variant of LEB128).

Rules#

keys must be in strict lexicographic byte order.
duplicate keys are not allowed.
keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format <collection>/<rkey> restricts characters to a small subset of ASCII.
records should be encoded as DRISL, the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all.
any parse error should be treated as fatal for the entire archive.

Compression#

STAR-lite is intended to be externally compressed with zstd in transport or for storage.

TODO: include recommended zstd configs, and tables/graphs showing compression performance. should show vs CAR, and also compare gzip (maybe brotli?) to zstd settings

see ./compression-summary.md

STAR-lite algorithms#

While any atproto MST library can reconstruct a full repo MST by simply inserting each (key, record) pair, materializing the entire MST at once costs significant memory or i/o overhead.

We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to walk a fully-reconstructed MST without holding the entire tree in memory.

This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive.

Archive verification#

Verification requires MST reconstruction just like CAR conversion, but never requires temporary disk storage. Each record must be hashed to compute its CID, but its byte contents can be immediately discarded.

Layer-0 MST nodes are materialized with computed record CIDs, then encoded, then hashed, to produce node CIDs. The encoded node bytes (and referenced record CIDs) are discarded, since we only need the node CID to help materialize a MST node.

The final output is the root MST node's CID, which verifies the entire archive if it matches the data field from the commit object.

Verification asserts the integrity of the repository contents: verifying the signature of the archive's commit object (if present) is a separate process, outside the scope of STAR. See atproto commit signatures

Pseudo-code#

# MstNode interface:
#   is_empty() => bool     true if the node has no subtree or value links
#   reset_to_empty()       clears the node to `empty` state
#   link_record(key, cid)  appends an entry with a key and value link
#   link_subtree(cid)      inserts a node link as the "left" child (empty node),
#                          or as the right-most entry's "right"
#   to_cbor() => bytes     canonical DAG-CBOR encoding of the MST node

def reconstruct_root_cid(key_record_pairs):
    """Compute the MST root CID from repo contents

    key_record_pairs must be in lexicographic key order (= depth-first mst walk)
    """
    stack: list[MstNode] = []
    prev_layer = -1

    # the actual walk. everything to the left of the stack is finalized.
    # anything remaining in the stack gets rolled up at the end.
    for (key, record_cbor) in key_record_pairs:
        key_layer = compute_mst_layer(key)

        # grow the stack if needed, init with empty nodes.
        while len(stack) <= key_layer:
            stack.append(MstNode())

        # finalize lower levels if this key is at a higher level than last.
        # higher key means everything lower in the stack is left-of-us now.
        if key_layer > prev_layer:
            for node, parent in zip(stack[:key_layer], stack[1:]):
                if node.is_empty():
                    continue  # skip possible empty bottom-most nodes
                cid = compute_cid(node.to_cbor())
                parent.link_subtree(cid)
                node.reset_to_empty()

        # add a node entry for the current record
        record_cid = compute_cid(record_cbor)
        stack[key_layer].link_record(key, record_cid)

        prev_layer = key_layer

    # finalize remaining stack
    for node, parent in zip(stack[:-1], stack[1:]):
        if node.is_empty():
            continue
        cid = compute_cid(node.to_cbor())
        parent.link_subtree(cid)
        node.reset_to_empty()

    # get the finished root node, finally.
    if len(stack) > 0:
        root = stack[-1]
    else:
        root = MstNode()  # empty repo: atproto CAR writes one single empty node

    root_cid = compute_cid(root.to_cbor())
    return root_cid

Conversion to CAR#

For preorder traversal block ordering of CAR files (aka "stream-friendly order"), each parent MST node must be included before all of its children. So while subtrees can be serialized into CAR-output-ready byte sequences as soon as they are frozen, they must be buffered until their parent node is frozen

Since our depth-first walk finalizes children before parents, and the final parent finalizes last, we must unfortunately buffer all serialized CAR frames while the tree is walked. The good news is that a disk-spill-friendly byte log works well for this buffering.

Pseudo-code#

# MstNode interface changes:
#   entries                       list of (key, cid, frame position, right link)
#   left, entries[].right         optional subtree link + stashed emit plan
#   link_record(key, cid, frame_position)  stash the carv1 frame's byte_log spot
#   link_subtree(cid, subtree_emit_plan)  stash an emit plan with the link

def car_frame(data_bytes: bytes) -> tuple[Cid, bytes]:
    """CARv1 block framing: [ varint | CID | data ]"""
    cid = compute_cid(data_bytes)
    data = cid.to_bytes() + data_bytes  # wire-encoded CID bytes, not the digest
    return cid, varint_bytes(len(data)) + data

def frame_at(byte_log: bytes, position: int) -> bytes:
    """Get a logged CARv1 frame from `position` using its own varint length"""
    varint_size, payload_len = varint_read(byte_log, position)
    frame_end = position + varint_size + payload_len
    return byte_log[position:frame_end]


def build_subtree_emit_plan(node: MstNode, node_frame_position):
    """assemble the stream-ordered emit plan for finalized subtree

    this is the core of how we drive the CAR preorder traversal output!

    node_frame_position: offset in the byte log of this node's own CARv1 frame
    returns: ordered list of byte_log indexes to serialized CARv1 frames
    """
    plan = []

    # first: the (CBOR-encoded) parent node itself
    plan.append(node_frame_position)

    # next, the left sub-subtree, if present
    if node.left:
        plan.extend(node.left.subtree_emit_plan)

    # finally, each value and entire value-right-subtree, in order:
    for entry in node.entries:
        # value first (always present in an MST entry)
        plan.append(entry.frame_position)
        # then after-value right sub-subtree (if present)
        if entry.right:
            plan.extend(entry.right.subtree_emit_plan)

    return plan


def to_stream_ordered_car_body(key_record_pairs):
    """Get a stream-ordered atproto CAR body from repository contents

    returns (root_cid, output_bytes) -- does not write a CAR header or the
    commit object's block, which must come first in the body for stream-order.

    key_record_pairs must be in lexicographic key order (= depth-first mst walk)
    """
    stack: list[MstNode] = []
    byte_log = bytearray()  # append-only storage of CARv1 frames
    prev_layer = -1

    # the actual walk. everything to the left of the stack is finalized.
    # anything remaining in the stack gets rolled up at the end.
    # serialized CARv1 frames appended into byte_log as we go.
    for (key, record_cbor) in key_record_pairs:
        key_layer = compute_mst_layer(key)

        # grow the stack if needed, init with empty nodes.
        while len(stack) <= key_layer:
            stack.append(MstNode())

        # finalize lower levels if this key is at a higher level than last.
        # higher key means everything lower in the stack is left-of-us now.
        if key_layer > prev_layer:
            for node, parent in zip(stack[:key_layer], stack[1:]):
                if node.is_empty():
                    continue  # skip possible empty bottom-most nodes

                # put finalized (+serialized, CAR-framed) node into the byte log
                cid, framed = car_frame(node.to_cbor())
                frame_position = len(byte_log)
                byte_log.extend(framed)

                # link it from the parent node now it's finalized with a CID
                node_emit_plan = build_subtree_emit_plan(node, frame_position)
                parent.link_subtree(cid, node_emit_plan)
                node.reset_to_empty()

        # put the current record into the byte log
        record_cid, framed = car_frame(record_cbor)
        frame_position = len(byte_log)
        byte_log.extend(framed)

        # and link it from the MST node's entries at this layer
        stack[key_layer].link_record(key, record_cid, frame_position)

        prev_layer = key_layer

    # finalize remaining stack
    for node, parent in zip(stack[:-1], stack[1:]):
        if node.is_empty():
            continue

        cid, framed = car_frame(node.to_cbor())
        frame_position = len(byte_log)
        byte_log.extend(framed)

        node_emit_plan = build_subtree_emit_plan(node, frame_position)
        parent.link_subtree(cid, node_emit_plan)
        node.reset_to_empty()

    # get the finished root node, finally.
    if len(stack) > 0:
        root = stack[-1]
    else:
        root = MstNode()  # empty repo: atproto CAR writes one single empty node

    # frame the root and get it in the logggggggg
    root_cid, framed = car_frame(root.to_cbor())
    root_frame_position = len(byte_log)
    byte_log.extend(framed)

    # and pull together the final emit plan
    root_emit_plan = build_subtree_emit_plan(root, root_frame_position)

    # walk the plan into the final output!!!
    output = bytearray()
    for position in root_emit_plan:
        output.extend(frame_at(byte_log, position))

    return root_cid, output

Structural similarity#

To emphasize the core of the MST-reconstructing algorithm, here is a diff between the main routine for verification vs. conversion to stream-ordered CARv1.

-def reconstruct_root_cid(key_record_pairs):
+def to_stream_ordered_car_body(key_record_pairs):
-    """Compute the MST root CID from repo contents
+    """Get a stream-ordered atproto CAR body from repository contents

     key_record_pairs must be in lexicographic key order (= depth-first mst walk)
     """
     stack: list[MstNode] = []
+    byte_log = bytearray()  # append-only storage of CARv1 frames
     prev_layer = -1

     # the actual walk. everything to the left of the stack is finalized.
     # anything remaining in the stack gets rolled up at the end.
     for (key, record_cbor) in key_record_pairs:
         key_layer = compute_mst_layer(key)

         # grow the stack if needed, init with empty nodes.
         while len(stack) <= key_layer:
             stack.append(MstNode())

         # finalize lower levels if this key is at a higher level than last.
         # higher key means everything lower in the stack is left-of-us now.
         if key_layer > prev_layer:
             for node, parent in zip(stack[:key_layer], stack[1:]):
                 if node.is_empty():
                     continue  # skip possible empty bottom-most nodes
+
+                # put finalized (+serialized, CAR-framed) node into the byte log
-                cid = compute_cid(node.to_cbor())
+                cid, framed = car_frame(node.to_cbor())
+                frame_position = len(byte_log)
+                byte_log.extend(framed)
+
+                # link it from the parent node now it's finalized with a CID
+                node_emit_plan = build_subtree_emit_plan(node, frame_position)
-                parent.link_subtree(cid)
+                parent.link_subtree(cid, node_emit_plan)
                 node.reset_to_empty()

         # add a node entry for the current record
+        # and put it into the byte log
-        record_cid = compute_cid(record_cbor)
+        record_cid, framed = car_frame(record_cbor)
+        frame_position = len(byte_log)
+        byte_log.extend(framed)
+
-        stack[key_layer].link_record(key, record_cid)
+        stack[key_layer].link_record(key, record_cid, frame_position)

         prev_layer = key_layer

     # finalize remaining stack
     for node, parent in zip(stack[:-1], stack[1:]):
         if node.is_empty():
             continue
+
-        cid = compute_cid(node.to_cbor())
+        cid, framed = car_frame(node.to_cbor())
+        frame_position = len(byte_log)
+        byte_log.extend(framed)
+
+        node_emit_plan = build_subtree_emit_plan(node, frame_position)
-        parent.link_subtree(cid)
+        parent.link_subtree(cid, node_emit_plan)
         node.reset_to_empty()

     # get the finished root node, finally.
     if len(stack) > 0:
         root = stack[-1]
     else:
         root = MstNode()  # empty repo: atproto CAR writes one single empty node

+    # frame the root and get it in the log!
-    root_cid = compute_cid(root.to_cbor())
+    root_cid, framed = car_frame(root.to_cbor())
+    root_frame_position = len(byte_log)
+    byte_log.extend(framed)
+
+    # and pull together the final emit plan
+    root_emit_plan = build_subtree_emit_plan(root, root_frame_position)
+
+    # walk the plan into the final output!!!
+    output = bytearray()
+    for position in root_emit_plan:
+        output.extend(frame_at(byte_log, position))
+
-    return root_cid
+    return root_cid, output

some old intuition-y words that might go somewhere but not here now#

Stream-ordered CARs (in "preorder traversal" block order) are a depth-first walk over the Merkle Search Tree, and keys encountered during a depth-first MST walk are in strict lexicographic order.

There is a a useful symmetry here:

every subtree of an MST occupies a contiguous region of the stream-order serialized CAR
every subtree of an MST spans a contiguous range lexicographically-ordered keys

So, any subtree-spanning range of keys (and records) can be materialized directly into its stream-ordered sequence of CAR blocks, independent of the rest of the archive.

Conversion from CAR#

TODO: but basically: use repo-stream for a bounded-memory MST walk if you can't be certain that the CAR is stream-ordered. If it is stream-ordered, any streaming walker will work and it's pretty simple to write out.

Note: it is not possible to know if an atproto CAR is stream-ordered except by either knowing that it was encoded that way in advance, or by reading the entire archive first to verify.

MST state: node stack#

We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state.

When a key's layer is greater than the previous key's layer, all in-progress MST nodes from lower layers are complete, and can be frozen: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node.

At this point, the newly frozen nodes can be:

simply discarded, when verifying archive integrity,
serialized into runs of CAR-format blocks,
any other transformation

Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which must match the CID in a STAR-lite file's header.

Configure Feed

Configure Feed

STAR-lite#

Compared to CARs:#

Compared to STAR-L0 and STAR-L1:#

Compression-oriented#

Format#

Header magic#

Header CID#

Header len + cbor: optional partial commit object#

Data: keys and records#

Varints#

Rules#

Compression#

STAR-lite algorithms#

Archive verification#

Pseudo-code#

Conversion to CAR#

Pseudo-code#

Structural similarity#

some old intuition-y words that might go somewhere but not here now#

Conversion from CAR#

MST state: node stack#