STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

bunch of scattered thoughts about nodes

phil 71131d74 4571667c

+70 -1
+70 -1
readme.md
··· 85 85 86 86 - `len` (varint): The length of the proceeding atproto commit object in bytes. 87 87 88 - - `commit`: An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects): 88 + - `commit` (DAG-CBOR): An atproto commit object in `DAG-CBOR` derived from the [repo spec](https://www.ietf.org/archive/id/draft-holmgren-at-repository-00.html#name-commit-objects): 89 89 90 90 - `did` (string, nullable): same as repo spec 91 91 - `version` (integer, required): corresponding CAR repo format version, currently fixed value of `3` ··· 109 109 - `node`: TODO: we need a new node format. It must be convertible back to a repo-spec style node. 110 110 111 111 - `record`: The atproto record. Its CID can be computed over the bytes of its `block` (see below). 112 + 113 + ### node / base 114 + 115 + ``` 116 + |----- node -----| 117 + [ len | mst node ] 118 + ``` 119 + 120 + - `len` (varint): the length of the proceeding CBOR block, in bytes. 121 + 122 + - `mst node` (DAG-CBOR): object with the following schema 123 + - `l` (hash link, nullable) 124 + 125 + note1: it's a bit tempting to redesign the MST nodes, because the _reason_ (and lack of special-ness) for `l` being separate from the entries in `e` took a long time for me to understand. but the existing format definitely works so maybe sticking close to it is the move? 126 + 127 + note2: a magic special zero hash-link is a pretty gross way to shoehorn in a sentinel! null was already taken because subtrees always are optional 128 + 129 + (this section is very much in flux) 130 + 131 + was thinking of making base (depth=0) nodes special (implicit cid) and then further simplifying to a simple array of entries since they can't have subtrees (`l` or `t`s). 132 + 133 + buuuutttt it's probably simpler just to give the node a nullable `cid` property that's required when depth=0. 134 + 135 + on the other track, i was thinking nodes could be rewritten as a pair of arrays 136 + 137 + ``` 138 + index: [ 0 , 1 , 2 , 3 ] 139 + 140 + new 141 + entries: [ (keyA, linkA) , (keyB, linkB) , (keyC, linkC) ] xxxxxxxxxxxxxxx 142 + trees: [ * tree before A , * tree before B , <null> , *tree after C ] 143 + 144 + vs old repo spec 145 + mst node:[ tree in `l` , keyA's `t` , keyB's null `t`, keyC's `t` ] 146 + ``` 147 + 148 + i think most languages can handle a pair of arrays ok with zip? but the equal-or-one-shorter length of `entries` compared to `trees` seems like asking for bugs. 149 + 150 + so let's keep it simple (similar to the repo spec), trying again: 151 + 152 + 153 + ``` 154 + |----- node -----| 155 + [ len | mst node ] 156 + ``` 157 + 158 + - `len` (varint): the length of the proceeding CBOR block, in bytes. 159 + 160 + - `mst node` (DAG-CBOR): object with the following schema 161 + - `cid` (hash link, nullable): the CID of this MST node. must be `null` for nodes at `depth=0`; required to be non-null for nodes at any higher `depth`. 162 + - `l` (hash link, nullable): reference to a subtree at a lower depth containing only keys to the left of this node. when the referenced node is included in the archive, it must be given a special zeroed-out link reference (all zero bytes (deal with hash link prefixes or whatever... probably can assume sha256 but careful for lossless reversibility back to CAR)) 163 + - `e` (array, required): ordered array of entry objects, each containing: 164 + - `p` (integer, required): number of bytes shared with the previous entry (TODO key compression actually) 165 + - `k` (byte string, required): key suffix remaining 166 + - `v` (hash link, **nullable**): reference to the record data for this key. must be null if the STAR includes the record; must _not_ be null if the record is not included in the STAR 167 + - `t` (hash link, nullable): link to a subtree that sorts to the right of this entry's key and to the left of the next entry's key. see `l` above. 168 + 169 + NOTE: the option to not include `v` (and requiring its hash link to be present in that case) keeps the option open for `key->CID`-only archives, which can be nice for things like diffing a repo to handle a firehose `#sync` event, or perhaps to exclude large records specifically from the archive. (make this cohesive with optional vs null handling if using that) 170 + 171 + TODO: nullable vs optional? (in general??) 172 + 173 + tempting to do something like: 174 + 175 + - omitted means there is no subtree 176 + - null means there is a subtree and it's included (CID to-calculate) 177 + - non-null means there is a subtree and it's *not* included (MST slice or sparse tree) 178 + 179 + hmmm: having separate optional and null cases might make deserializing into some languages tricky. i'm not sure if serde can handle that well? omitempty + nullable => `Option<Option<T>>`? should probably check other languages. 180 + 112 181 113 182 ### record 114 183