STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

more spec cleanup

phil b8868ad4 77457b05

+9 -13
+9 -13
star-lite/readme.md
··· 38 38 [ magic | cid | len | cbor ] [ len | str | len | cbor ] [ len | str | len | cbor] … 39 39 ``` 40 40 41 - | name | type | 42 - | ----- | --------------------------------------- | 43 41 | magic | three-byte mark to identify the format | 44 42 | cid | atproto-format binary CID link | 45 43 | len | unsigned varint | ··· 63 61 64 62 ### Header len + cbor: optional partial commit object 65 63 66 - When `len == 0`, no commit object is included in the archive. This is useful for archiving unsigned subtrees of a full repository tree -- the contents can still be verified from the preceeding CID field. 67 - 68 - When `len > 4096`, a parser should reject the commit object as being implausibly large. (TODO: we can probably set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc). 64 + `len == 0` means the commit object was omitted. The header `CID` (above) still proves repository integrity, but identity and cryptographic signature (among other metadata) are not included. One possible use-case is archiving subtrees of a repository, but note that this is not the same as a "CAR slice" which can prove a subset of a repository all the way to its root. 69 65 70 - Otherwise, when `len > 0`, a partial commit object of exactly `len` bytes follows, in CBOR format. The partial commit has the same fields as an [atproto Commit Object][commit] except that the `data` field must be omitted. 66 + When `len <= 4096`, a partial commit object of exactly `len` bytes follows, in CBOR format. The partial commit has the same fields as an [atproto Commit Object][commit] except that the `data` field must be omitted. 71 67 72 68 To verify the commit signature, use the Header CID (above) as the `data` field to compute the commit's signed CID. 69 + 70 + When `len > 4096`, a parser should reject the commit object as being implausibly large. (TODO: we can probably set an exact limit. DID max is 2048 in atproto, rev must be TID format, etc). 73 71 74 72 75 73 ### Data: keys and records ··· 106 104 107 105 While any atproto MST library can reconstruct a full repo MST by simply inserting each `(key, record)` pair, materializing the entire MST at once costs significant memory or i/o overhead. 108 106 109 - We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**. 107 + We exploit the lexicographic key ordering of STAR-lite files (or any stream of lex-ordered key-record pairs) to **walk a fully-reconstructed MST without holding the entire tree in memory**. Only a narrow *stack* of MST nodes (one per layer) must be buffered. 110 108 111 - This enables efficient transformations, like verifying repository integrity, or conversion to stream-ordered atproto CARv1 format archive. 109 + This enables efficient transformations, like verifying repository integrity, or conversion to other formats, like stream-ordered atproto CARv1. 112 110 113 111 114 112 ### Archive verification 115 113 116 - Verification requires MST reconstruction just like CAR conversion, but never requires temporary disk storage. Each record must be hashed to compute its CID, but its byte contents can be immediately discarded. 114 + Each record must be hashed to compute its CID, but its byte contents can be immediately discarded. MST nodes can also be discarded once finalized into CIDs. 117 115 118 - Layer-0 MST nodes are materialized with computed record CIDs, then encoded, then hashed, to produce node CIDs. The encoded node bytes (and referenced record CIDs) are discarded, since we only need the node CID to help materialize a MST node. 116 + The final output is the root MST node's CID. It must match the data `CID` field from the header, or else the archive is corrupt. 119 117 120 - The final output is the root MST node's CID, which verifies the entire archive if it matches the `data` field from the commit object. 121 - 122 - Verification asserts the integrity of the repository contents: verifying the signature of the archive's [commit object][commit] (if present) is a separate process, outside the scope of STAR. See atproto [commit signatures][commit-sigs] 118 + _Note: as mentioned above, this is only an integrity check. **Authenticity** requires verifying the signature from the [commit object][commit] by resolving a DID to find a public key -- the [same process as with CAR files][commit-sigs], and an external concern for STAR-lite._ 123 119 124 120 125 121 #### Pseudo-code