STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

STAR-lite#

STreaming ARchive repository format (extra light version): a stricter, simpler, still verifiable, highly compressible alternative to CAR.

STAR-lite describes both a binary encoding, and an efficient algorithm to verify or transform sorted key-record pairs into stream-ordered CAR files. Together, they make STAR-lite suitable as both a network transport, and for long-term repo archiving, without sacrificing interoperability.

Compared to CARs:#

  • No MST node blocks or CIDs, eliminating the least-compressible content.
  • Strict content ordering, deterministic encoding.
  • Bounded-memory conversion to stream-ordered CAR.

Compared to STAR-L0 and STAR-L1:#

  • Smallest archive format (with zstd compression)
  • Content verification requires a complete scan of all content
  • No support for sparse archives or "CAR slice"-like proofs (yet)
  • Disk spilling required for memory-bounded streaming of large archives

Compression-oriented#

STAR-lite shines when compressed: 2x smaller than CARs for the same compression settings and over 6x smaller than raw CARs with zstd-22.

three bar charts, each with multiple series: Compressed size vs uncompressed CAR (series: CAR, blue; STAR-lite, red). Y-axis: 0–100%. X groups: uncompressed, zstd --fast 1, zstd 3, zstd 9, zstd 9 +long, zstd 22. CAR starts at 100% (by definition) and lands just under 33%. there isn't a big change from zstd3 through zstd 22. STAR-lite starts at a bit over 75% and lands around 16%. there's a bit more improvement at the higher zstd levels. The final bars of each series are labelled: 3.09x (CAR), 6.29x (STAR-lite). Bucketed STAR-lite / CAR by compressor. Same percentage Y-axis, series are each compression level including uncompressed. X groups are: "< 100KiB", "100 KiB–1MiB", "1–10MiB", "10–100MiB", ">= 100MiB", "weighted average" Uncompressed series: about 70% for <100KiB, levels off around 75% over the other buckets. zstd --fast 1: about half-way between uncompressed and zstd-3 for all buckets. zstd 3: just over 50% uncompressed, rising to perhaps 55% by 10–100MiB. bit over 50% weighted average. higher compressions start very flat (little benefit over zstd 3) at <100KiB, and show more spread as the CAR size increases. At >=100MiB, zstd22 takes a dramatic drop to about 34%. The weighted average shows a pretty clean spread: zstd3 is much better than zstd --fast 1, zstd9 is a little better, zstd 9 +long is similarly a little better, and zstd 22 is better again by a bit more. Bucketed count of repos & weight (total size on disk). X-groups: same buckets without weighted average. Left Y-axis and "count of CARs" series: power-curve rapid dropoff: about 350K repos < 100KiB, fewer than 100K 100K–1M, much less than 50K 1–10M, barely visible bar 10–100M; invisible at >= 100M. Right Y-axis and "total bucket size" series (the weight of the weighted average): <100KiB has the smallest total, less than 10GiB; then the last bucket, >=100MiB has slightly more. Middle buckets in order: 100K–1M a bit over 25GiB; 1–10M around 90GiB; 10–100M almost 100GiB

Read more about compression.

Format#

STAR-lite is a flat list of every key/record pair in a repository, in lexicographic key order, with commit details in its header. It's suited for single-pass streaming.

|--------- header ---------| |------------------ data (records) -------------------|
[ magic | cid | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] …
type
magic three-byte mark to identify the format
cid atproto-format binary CID link
len unsigned varint
str utf-8 bytes
cbor cbor bytes

Header magic#

Three bytes: 0x2A 0x6C 0x00, ASCII for *l\0: "star", "lite", version 0.

Header CID#

A 36-byte CID link to the root node of the repo's atproto Merkle Search Tree (MST) representation. This is the data field from a full atproto Commit object, and it verifies the entire archive's integrity (see Archive Verification, below).

It consists of a four-byte fixed prefix, 0x01711220, followed by a 32-byte sha256 digest. See the atproto repo spec and/or DASL-CID.

An empty repository (zero keys) must has CID bafyreihmh6lpqcmyus4kt4rsypvxgvnvzkmj4aqczyewol5rsf7pdzzta4 — the CID of a single empty atproto MST node, which is how atproto CAR files represent empty repositories.

Header len + cbor: optional partial commit object#

len == 0 means the commit object was omitted. The header CID (above) still proves repository integrity, but identity and cryptographic signature (among other metadata) are not included. One possible use-case is archiving subtrees of a repository, but note that this is not the same as a "CAR slice" which can prove a subset of a repository all the way to its root.

When len <= 4096, a partial commit object of exactly len bytes follows, in CBOR format. The partial commit has the same fields as an atproto Commit Object except that the data field must be omitted.

To verify the commit signature, use the Header CID (above) as the data field to compute the commit's signed CID.

When len > 4096, a parser should reject the commit object as being implausibly large. (TODO: we can probably set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc).

Data: keys and records#

Zero or more records until EOF. Each is:

type
key len varint, max: 830
key str utf-8 bytes, exactly key len length
record len varint, max: 1,048,576
record cbor cbor bytes, exactly record len length

The maximum key length comes from the combined limits of the <collection>/<rkey> syntax for atproto repo paths: 317 for the collection + 1 for the / slash + 512 for the rkey.

The maximum record size of 1MiB (1,048,576 bytes) comes from the atproto recommended data limits.

Varints#

Length prefixes in STAR are encoded as unsigned variable-length integers (varint, a variant of LEB128).

Rules#

  • keys must be in strict lexicographic byte order.
  • duplicate keys are not allowed.
  • keys should be valid atproto repo paths: the format specifies utf-8, but in practice the required repo path format <collection>/<rkey> restricts characters to a small subset of ASCII.
  • records should be encoded as DRISL, the deterministic subset of CBOR used by atproto, though parsers are not required to interpret record bytes at all.
  • any parse error should be treated as fatal for the entire archive.

STAR-lite algorithms#

While any atproto MST library can reconstruct a full repo MST by simply inserting each (key, record) pair, materializing the entire MST at once costs significant memory or i/o overhead.

We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to walk a fully-reconstructed MST without holding the entire tree in memory. Only a narrow stack of MST nodes (one per layer) must be buffered.

This enables efficient transformations, like verifying repository integrity, or conversion to other formats, like stream-ordered atproto CARv1.

Archive verification#

Each record must be hashed to compute its CID, but its byte contents can be immediately discarded. MST nodes can also be discarded once finalized into CIDs.

The final output is the root MST node's CID. It must match the data CID field from the header, or else the archive is corrupt.

Note: as mentioned above, this is only an integrity check. Authenticity requires verifying the signature from the commit object by resolving a DID to find a public key -- the same process as with CAR files, and an external concern for STAR-lite.

Pseudo-code#

# MstNode interface:
#   is_empty() => bool     true if the node has no subtree or value links
#   reset_to_empty()       clears the node to `empty` state
#   link_record(key, cid)  appends an entry with a key and value link
#   link_subtree(cid)      inserts a node link as the "left" child (empty node),
#                          or as the right-most entry's "right"
#   to_cbor() => bytes     canonical DAG-CBOR encoding of the MST node

def reconstruct_root_cid(key_record_pairs):
    """Compute the MST root CID from repo contents

    key_record_pairs must be in lexicographic key order (= depth-first mst walk)
    """
    stack: list[MstNode] = []
    prev_layer = -1

    # the actual walk. everything to the left of the stack is finalized.
    # anything remaining in the stack gets rolled up at the end.
    for (key, record_cbor) in key_record_pairs:
        key_layer = compute_mst_layer(key)

        # grow the stack if needed, init with empty nodes.
        while len(stack) <= key_layer:
            stack.append(MstNode())

        # finalize lower levels if this key is at a higher level than last.
        # higher key means everything lower in the stack is left-of-us now.
        if key_layer > prev_layer:
            for node, parent in zip(stack[:key_layer], stack[1:]):
                if node.is_empty():
                    continue  # skip possible empty bottom-most nodes
                cid = compute_cid(node.to_cbor())
                parent.link_subtree(cid)
                node.reset_to_empty()

        # add a node entry for the current record
        record_cid = compute_cid(record_cbor)
        stack[key_layer].link_record(key, record_cid)

        prev_layer = key_layer

    # finalize remaining stack
    for node, parent in zip(stack[:-1], stack[1:]):
        if node.is_empty():
            continue
        cid = compute_cid(node.to_cbor())
        parent.link_subtree(cid)
        node.reset_to_empty()

    # get the finished root node, finally.
    if len(stack) > 0:
        root = stack[-1]
    else:
        root = MstNode()  # empty repo: atproto CAR writes one single empty node

    root_cid = compute_cid(root.to_cbor())
    return root_cid

Conversion to CAR#

For preorder traversal block ordering of CAR files (aka "stream-friendly order"), each parent MST node must be included before all of its children. So while subtrees can be serialized into CAR-output-ready byte sequences as soon as they are frozen, they must be buffered until their parent node is frozen

Since our depth-first walk finalizes children before parents, and the final parent finalizes last, we must unfortunately buffer all serialized CAR frames while the tree is walked. The good news is that a disk-spill-friendly byte log works well for this buffering.

There is a nice optimization available if read-out performance suffers from the random access patterns (eg., on HDDs or networked block storage): keep a separate byte log for records and MST nodes. Most of the data is in the records, and they are inserted in slow-storage-friendly read-out order.

Pseudo-code#

# MstNode interface changes:
#   entries                       list of (key, cid, frame position, right link)
#   left, entries[].right         optional subtree link + stashed emit plan
#   link_record(key, cid, frame_position)  stash the carv1 frame's byte_log spot
#   link_subtree(cid, subtree_emit_plan)  stash an emit plan with the link

def car_frame(data_bytes: bytes) -> tuple[Cid, bytes]:
    """CARv1 block framing: [ varint | CID | data ]"""
    cid = compute_cid(data_bytes)
    data = cid.to_bytes() + data_bytes  # wire-encoded CID bytes, not the digest
    return cid, varint_bytes(len(data)) + data

def frame_at(byte_log: bytes, position: int) -> bytes:
    """Get a logged CARv1 frame from `position` using its own varint length"""
    varint_size, payload_len = varint_read(byte_log, position)
    frame_end = position + varint_size + payload_len
    return byte_log[position:frame_end]


def build_subtree_emit_plan(node: MstNode, node_frame_position):
    """assemble the stream-ordered emit plan for finalized subtree

    this is the core of how we drive the CAR preorder traversal output!

    node_frame_position: offset in the byte log of this node's own CARv1 frame
    returns: ordered list of byte_log indexes to serialized CARv1 frames
    """
    plan = []

    # first: the (CBOR-encoded) parent node itself
    plan.append(node_frame_position)

    # next, the left sub-subtree, if present
    if node.left:
        plan.extend(node.left.subtree_emit_plan)

    # finally, each value and entire value-right-subtree, in order:
    for entry in node.entries:
        # value first (always present in an MST entry)
        plan.append(entry.frame_position)
        # then after-value right sub-subtree (if present)
        if entry.right:
            plan.extend(entry.right.subtree_emit_plan)

    return plan


def to_stream_ordered_car_body(key_record_pairs):
    """Get a stream-ordered atproto CAR body from repository contents

    returns (root_cid, output_bytes) -- does not write a CAR header or the
    commit object's block, which must come first in the body for stream-order.

    key_record_pairs must be in lexicographic key order (= depth-first mst walk)
    """
    stack: list[MstNode] = []
    byte_log = bytearray()  # append-only storage of CARv1 frames
    prev_layer = -1

    # the actual walk. everything to the left of the stack is finalized.
    # anything remaining in the stack gets rolled up at the end.
    # serialized CARv1 frames appended into byte_log as we go.
    for (key, record_cbor) in key_record_pairs:
        key_layer = compute_mst_layer(key)

        # grow the stack if needed, init with empty nodes.
        while len(stack) <= key_layer:
            stack.append(MstNode())

        # finalize lower levels if this key is at a higher level than last.
        # higher key means everything lower in the stack is left-of-us now.
        if key_layer > prev_layer:
            for node, parent in zip(stack[:key_layer], stack[1:]):
                if node.is_empty():
                    continue  # skip possible empty bottom-most nodes

                # put finalized (+serialized, CAR-framed) node into the byte log
                cid, framed = car_frame(node.to_cbor())
                frame_position = len(byte_log)
                byte_log.extend(framed)

                # link it from the parent node now it's finalized with a CID
                node_emit_plan = build_subtree_emit_plan(node, frame_position)
                parent.link_subtree(cid, node_emit_plan)
                node.reset_to_empty()

        # put the current record into the byte log
        record_cid, framed = car_frame(record_cbor)
        frame_position = len(byte_log)
        byte_log.extend(framed)

        # and link it from the MST node's entries at this layer
        stack[key_layer].link_record(key, record_cid, frame_position)

        prev_layer = key_layer

    # finalize remaining stack
    for node, parent in zip(stack[:-1], stack[1:]):
        if node.is_empty():
            continue

        cid, framed = car_frame(node.to_cbor())
        frame_position = len(byte_log)
        byte_log.extend(framed)

        node_emit_plan = build_subtree_emit_plan(node, frame_position)
        parent.link_subtree(cid, node_emit_plan)
        node.reset_to_empty()

    # get the finished root node, finally.
    if len(stack) > 0:
        root = stack[-1]
    else:
        root = MstNode()  # empty repo: atproto CAR writes one single empty node

    # frame the root and get it in the logggggggg
    root_cid, framed = car_frame(root.to_cbor())
    root_frame_position = len(byte_log)
    byte_log.extend(framed)

    # and pull together the final emit plan
    root_emit_plan = build_subtree_emit_plan(root, root_frame_position)

    # walk the plan into the final output!!!
    output = bytearray()
    for position in root_emit_plan:
        output.extend(frame_at(byte_log, position))

    return root_cid, output

Structural similarity#

To emphasize the core of the MST-reconstructing algorithm, here is a diff between the main routine for verification vs. conversion to stream-ordered CARv1.

-def reconstruct_root_cid(key_record_pairs):
+def to_stream_ordered_car_body(key_record_pairs):
-    """Compute the MST root CID from repo contents
+    """Get a stream-ordered atproto CAR body from repository contents

     key_record_pairs must be in lexicographic key order (= depth-first mst walk)
     """
     stack: list[MstNode] = []
+    byte_log = bytearray()  # append-only storage of CARv1 frames
     prev_layer = -1

     # the actual walk. everything to the left of the stack is finalized.
     # anything remaining in the stack gets rolled up at the end.
     for (key, record_cbor) in key_record_pairs:
         key_layer = compute_mst_layer(key)

         # grow the stack if needed, init with empty nodes.
         while len(stack) <= key_layer:
             stack.append(MstNode())

         # finalize lower levels if this key is at a higher level than last.
         # higher key means everything lower in the stack is left-of-us now.
         if key_layer > prev_layer:
             for node, parent in zip(stack[:key_layer], stack[1:]):
                 if node.is_empty():
                     continue  # skip possible empty bottom-most nodes
+
+                # put finalized (+serialized, CAR-framed) node into the byte log
-                cid = compute_cid(node.to_cbor())
+                cid, framed = car_frame(node.to_cbor())
+                frame_position = len(byte_log)
+                byte_log.extend(framed)
+
+                # link it from the parent node now it's finalized with a CID
+                node_emit_plan = build_subtree_emit_plan(node, frame_position)
-                parent.link_subtree(cid)
+                parent.link_subtree(cid, node_emit_plan)
                 node.reset_to_empty()

         # add a node entry for the current record
+        # and put it into the byte log
-        record_cid = compute_cid(record_cbor)
+        record_cid, framed = car_frame(record_cbor)
+        frame_position = len(byte_log)
+        byte_log.extend(framed)
+
-        stack[key_layer].link_record(key, record_cid)
+        stack[key_layer].link_record(key, record_cid, frame_position)

         prev_layer = key_layer

     # finalize remaining stack
     for node, parent in zip(stack[:-1], stack[1:]):
         if node.is_empty():
             continue
+
-        cid = compute_cid(node.to_cbor())
+        cid, framed = car_frame(node.to_cbor())
+        frame_position = len(byte_log)
+        byte_log.extend(framed)
+
+        node_emit_plan = build_subtree_emit_plan(node, frame_position)
-        parent.link_subtree(cid)
+        parent.link_subtree(cid, node_emit_plan)
         node.reset_to_empty()

     # get the finished root node, finally.
     if len(stack) > 0:
         root = stack[-1]
     else:
         root = MstNode()  # empty repo: atproto CAR writes one single empty node

+    # frame the root and get it in the log!
-    root_cid = compute_cid(root.to_cbor())
+    root_cid, framed = car_frame(root.to_cbor())
+    root_frame_position = len(byte_log)
+    byte_log.extend(framed)
+
+    # and pull together the final emit plan
+    root_emit_plan = build_subtree_emit_plan(root, root_frame_position)
+
+    # walk the plan into the final output!!!
+    output = bytearray()
+    for position in root_emit_plan:
+        output.extend(frame_at(byte_log, position))
+
-    return root_cid
+    return root_cid, output

some old intuition-y words that might go somewhere but not here now#

Stream-ordered CARs (in "preorder traversal" block order) are a depth-first walk over the Merkle Search Tree, and keys encountered during a depth-first MST walk are in strict lexicographic order.

There is a a useful symmetry here:

  • every subtree of an MST occupies a contiguous region of the stream-order serialized CAR
  • every subtree of an MST spans a contiguous range lexicographically-ordered keys

So, any subtree-spanning range of keys (and records) can be materialized directly into its stream-ordered sequence of CAR blocks, independent of the rest of the archive.

Conversion from CAR#

TODO: but basically: use repo-stream for a bounded-memory MST walk if you can't be certain that the CAR is stream-ordered. If it is stream-ordered, any streaming walker will work and it's pretty simple to write out.

Note: it is not possible to know if an atproto CAR is stream-ordered except by either knowing that it was encoded that way in advance, or by reading the entire archive first to verify.

MST state: node stack#

We don't need to materialize the entire MST at once for a depth-first tree-reconstructing walk across it: a narrow stack of MST nodes (one per layer of the tree) is sufficient state.

When a key's layer is greater than the previous key's layer, all in-progress MST nodes from lower layers are complete, and can be frozen: encoded in atproto MST node format to compute their CIDs, recursively resolving into a CID link from the current key's node.

At this point, the newly frozen nodes can be:

  • simply discarded, when verifying archive integrity,
  • serialized into runs of CAR-format blocks,
  • any other transformation

Once the entire tree has been walked and frozen, the highest-layer MST node can finally be considered frozen to produce the root node CID, which must match the CID in a STAR-lite file's header.