STreaming ARchives: stricter, verifiable, deterministic, highly compressible alternatives to CAR files for atproto repositories.
atproto car
9
fork

Configure Feed

Select the types of activity you want to include in your feed.

throw in a diff of the to main pseudo-codes

was kinda hoping this would be easier to read but ehh

phil 444ef0ef 0373a334

+113 -11
+113 -11
star-lite/readme.md
··· 141 141 Verification asserts the integrity of the repository contents: verifying the signature of the archive's [commit object][commit] (if present) is a separate process, outside the scope of STAR. See atproto [commit signatures][commit-sigs] 142 142 143 143 144 + #### Pseudo-code 145 + 144 146 ```python 145 147 # MstNode interface: 146 148 # is_empty() => bool true if the node has no subtree or value links ··· 173 175 for node, parent in zip(stack[:key_layer], stack[1:]): 174 176 if node.is_empty(): 175 177 continue # skip possible empty bottom-most nodes 176 - parent.link_subtree(compute_cid(node.to_cbor())) 178 + cid = compute_cid(node.to_cbor()) 179 + parent.link_subtree(cid) 177 180 node.reset_to_empty() 178 181 179 182 # add a node entry for the current record 180 - stack[key_layer].link_record(key, compute_cid(record_cbor)) 183 + record_cid = compute_cid(record_cbor) 184 + stack[key_layer].link_record(key, record_cid) 181 185 182 186 prev_layer = key_layer 183 187 ··· 185 189 for node, parent in zip(stack[:-1], stack[1:]): 186 190 if node.is_empty(): 187 191 continue 188 - parent.link_subtree(compute_cid(node.to_cbor())) 192 + cid = compute_cid(node.to_cbor()) 193 + parent.link_subtree(cid) 189 194 node.reset_to_empty() 190 195 191 196 # get the finished root node, finally. ··· 194 199 else: 195 200 root = MstNode() # empty repo: atproto CAR writes one single empty node 196 201 197 - return compute_cid(root.to_cbor()) 202 + root_cid = compute_cid(root.to_cbor()) 203 + return root_cid 198 204 ``` 199 205 200 206 ··· 204 210 205 211 Since our depth-first walk finalizes children before parents, and the final parent finalizes last, we must unfortunately buffer all serialized CAR frames while the tree is walked. The good news is that a disk-spill-friendly byte log works well for this buffering. 206 212 207 - #### pseudo-code 213 + #### Pseudo-code 208 214 209 215 ```python 210 216 # MstNode interface changes: 211 217 # entries list of (key, cid, frame position, right link) 212 218 # left, entries[].right optional subtree link + stashed emit plan 213 - # link_record(key, cid, frame_pos) stash the carv1 frame's byte log position 219 + # link_record(key, cid, frame_position) stash the carv1 frame's byte_log spot 214 220 # link_subtree(cid, subtree_emit_plan) stash an emit plan with the link 215 221 216 222 def car_frame(data_bytes: bytes) -> tuple[Cid, bytes]: ··· 263 269 key_record_pairs must be in lexicographic key order (= depth-first mst walk) 264 270 """ 265 271 stack: list[MstNode] = [] 266 - byte_log = bytearray() 272 + byte_log = bytearray() # append-only storage of CARv1 frames 267 273 prev_layer = -1 268 274 269 275 # the actual walk. everything to the left of the stack is finalized. ··· 284 290 continue # skip possible empty bottom-most nodes 285 291 286 292 # put finalized (+serialized, CAR-framed) node into the byte log 287 - frame_position = len(byte_log) 288 293 cid, framed = car_frame(node.to_cbor()) 294 + frame_position = len(byte_log) 289 295 byte_log.extend(framed) 290 296 291 297 # link it from the parent node now it's finalized with a CID ··· 294 300 node.reset_to_empty() 295 301 296 302 # put the current record into the byte log 297 - frame_position = len(byte_log) 298 303 record_cid, framed = car_frame(record_cbor) 304 + frame_position = len(byte_log) 299 305 byte_log.extend(framed) 300 306 301 307 # and link it from the MST node's entries at this layer ··· 308 314 if node.is_empty(): 309 315 continue 310 316 311 - frame_position = len(byte_log) 312 317 cid, framed = car_frame(node.to_cbor()) 318 + frame_position = len(byte_log) 313 319 byte_log.extend(framed) 314 320 315 321 node_emit_plan = build_subtree_emit_plan(node, frame_position) ··· 323 329 root = MstNode() # empty repo: atproto CAR writes one single empty node 324 330 325 331 # frame the root and get it in the logggggggg 326 - root_frame_position = len(byte_log) 327 332 root_cid, framed = car_frame(root.to_cbor()) 333 + root_frame_position = len(byte_log) 328 334 byte_log.extend(framed) 329 335 330 336 # and pull together the final emit plan ··· 336 342 output.extend(frame_at(byte_log, position)) 337 343 338 344 return root_cid, output 345 + ``` 346 + 347 + 348 + #### Structural similarity 349 + 350 + To emphasize the core of the MST-reconstructing algorithm, here is a diff between the main routine for verification vs. conversion to stream-ordered CARv1. 351 + 352 + ```diff,python 353 + -def reconstruct_root_cid(key_record_pairs): 354 + +def to_stream_ordered_car_body(key_record_pairs): 355 + - """Compute the MST root CID from repo contents 356 + + """Get a stream-ordered atproto CAR body from repository contents 357 + 358 + key_record_pairs must be in lexicographic key order (= depth-first mst walk) 359 + """ 360 + stack: list[MstNode] = [] 361 + + byte_log = bytearray() # append-only storage of CARv1 frames 362 + prev_layer = -1 363 + 364 + # the actual walk. everything to the left of the stack is finalized. 365 + # anything remaining in the stack gets rolled up at the end. 366 + for (key, record_cbor) in key_record_pairs: 367 + key_layer = compute_mst_layer(key) 368 + 369 + # grow the stack if needed, init with empty nodes. 370 + while len(stack) <= key_layer: 371 + stack.append(MstNode()) 372 + 373 + # finalize lower levels if this key is at a higher level than last. 374 + # higher key means everything lower in the stack is left-of-us now. 375 + if key_layer > prev_layer: 376 + for node, parent in zip(stack[:key_layer], stack[1:]): 377 + if node.is_empty(): 378 + continue # skip possible empty bottom-most nodes 379 + + 380 + + # put finalized (+serialized, CAR-framed) node into the byte log 381 + - cid = compute_cid(node.to_cbor()) 382 + + cid, framed = car_frame(node.to_cbor()) 383 + + frame_position = len(byte_log) 384 + + byte_log.extend(framed) 385 + + 386 + + # link it from the parent node now it's finalized with a CID 387 + + node_emit_plan = build_subtree_emit_plan(node, frame_position) 388 + - parent.link_subtree(cid) 389 + + parent.link_subtree(cid, node_emit_plan) 390 + node.reset_to_empty() 391 + 392 + # add a node entry for the current record 393 + + # and put it into the byte log 394 + - record_cid = compute_cid(record_cbor) 395 + + record_cid, framed = car_frame(record_cbor) 396 + + frame_position = len(byte_log) 397 + + byte_log.extend(framed) 398 + + 399 + - stack[key_layer].link_record(key, record_cid) 400 + + stack[key_layer].link_record(key, record_cid, frame_position) 401 + 402 + prev_layer = key_layer 403 + 404 + # finalize remaining stack 405 + for node, parent in zip(stack[:-1], stack[1:]): 406 + if node.is_empty(): 407 + continue 408 + + 409 + - cid = compute_cid(node.to_cbor()) 410 + + cid, framed = car_frame(node.to_cbor()) 411 + + frame_position = len(byte_log) 412 + + byte_log.extend(framed) 413 + + 414 + + node_emit_plan = build_subtree_emit_plan(node, frame_position) 415 + - parent.link_subtree(cid) 416 + + parent.link_subtree(cid, node_emit_plan) 417 + node.reset_to_empty() 418 + 419 + # get the finished root node, finally. 420 + if len(stack) > 0: 421 + root = stack[-1] 422 + else: 423 + root = MstNode() # empty repo: atproto CAR writes one single empty node 424 + 425 + + # frame the root and get it in the log! 426 + - root_cid = compute_cid(root.to_cbor()) 427 + + root_cid, framed = car_frame(root.to_cbor()) 428 + + root_frame_position = len(byte_log) 429 + + byte_log.extend(framed) 430 + + 431 + + # and pull together the final emit plan 432 + + root_emit_plan = build_subtree_emit_plan(root, root_frame_position) 433 + + 434 + + # walk the plan into the final output!!! 435 + + output = bytearray() 436 + + for position in root_emit_plan: 437 + + output.extend(frame_at(byte_log, position)) 438 + + 439 + - return root_cid 440 + + return root_cid, output 339 441 ``` 340 442 341 443