docs: rewrite benchmark devlog with actual numbers

+29 -11

1 changed file

expand all

devlog

+29 -11

devlog/002-firehose-and-benchmarks.md

··· 1 1 # consuming the firehose, then benchmarking it 2 2 3 - since the last devlog (self-publishing docs), zat grew from a collection of string parsers and HTTP clients into something that can consume the full AT Protocol event stream — both jetstream (JSON) and the raw firehose (binary DAG-CBOR). then we benchmarked it against every other AT Protocol SDK and the numbers were... surprising. 3 + since the last devlog (self-publishing docs), zat grew from a collection of string parsers and HTTP clients into something that can consume the full AT Protocol event stream — both jetstream (JSON) and the raw firehose (binary DAG-CBOR). then we benchmarked it against every other AT Protocol SDK. 4 4 5 5 ## what we built 6 6 ··· 37 37 38 38 we built [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) — a cross-SDK benchmark that captures ~10 seconds of live firehose traffic, then decodes the full corpus with four SDKs. 39 39 40 - every SDK does the same work per frame: decode CBOR header → decode CBOR payload → parse CAR → decode every CAR block as DAG-CBOR. block counts, error counts, and per-pass variance (min/median/max) are reported so you can verify parity. 40 + every SDK does the same work per frame: decode CBOR header → decode CBOR payload → parse CAR → decode every CAR block as DAG-CBOR. block counts and error counts are reported per SDK so you can verify parity. per-pass variance (min/median/max) is reported so you can see how stable the numbers are. 41 + 42 + the corpus is captured with a CBOR header peek (check `t == "#commit"` and `ops` is non-empty) using zat's CBOR decoder. this is standard CBOR parsing — not zat's typed firehose decoder — but it does mean frames that zat's CBOR decoder rejects won't appear in the corpus. 41 43 42 - the corpus is captured with a minimal CBOR header peek (check `t == "#commit"` and `ops` is non-empty) using zat's CBOR decoder. this is standard CBOR parsing — not zat's typed firehose decoder — but it does mean frames that zat's CBOR decoder rejects won't appear in the corpus. in practice, CBOR header parsing is the least likely step to diverge across implementations. 44 + the original version of these benchmarks had work asymmetry: zat only decoded op-linked blocks (~2.3k per corpus), while rust and go decoded all CAR blocks (~23k). python parsed CAR structure but didn't iterate blocks. the numbers below are from the corrected version where all SDKs decode every block. 43 45 44 - the original version of these benchmarks had work asymmetry: zat's `firehose.decodeFrame` only decoded op-linked blocks (~2.3k per corpus), while rust and go decoded all CAR blocks (~23k). python parsed CAR structure but didn't iterate blocks. the numbers below are from the corrected version where all SDKs decode every block. 46 + ### results 45 47 46 - run `just capture && just bench` in [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) to get numbers on your machine. 48 + 3,298 frames (16.2 MB), 5 measured passes, macOS arm64 (M3 Max): 49 + 50 + | SDK | frames/sec (median) | MB/s | blocks/frame | 51 + |-----|--------:|-----:|-----:| 52 + | zig (zat, arena reuse) | 461,827 | 2,268.9 | 9.98 | 53 + | zig (zat, alloc per frame) | 395,485 | 1,890.0 | 9.98 | 54 + | rust (jacquard) | 42,023 | 203.5 | 9.98 | 55 + | python (atproto) | 24,026 | 118.0 | 9.98 | 56 + | go (indigo) | 10,896 | 53.3 | 9.98 | 57 + 58 + all SDKs: 0 errors. run-to-run variance is ~30-40% — compare ratios within a single run, not across runs. 47 59 48 60 ### why zat is fast 49 61 ··· 53 65 54 66 **block decode cardinality.** each firehose frame contains a CAR with ~10 blocks (MST nodes + records). decoding every block as DAG-CBOR is the dominant cost — it's where most of the per-frame CPU time goes across all SDKs. 55 67 56 - **sync vs async CAR parsing.** rust's `iroh-car` is an async library. every `next_block().await` goes through tokio's poll/wake state machine to read from an in-memory buffer. this is bad enough that python (via libipld, a *different* Rust library that works synchronously) outperforms the rust benchmark. the async overhead compounds on top of the per-block decode cost — ~10 awaits per frame adds up. 68 + **arena allocation.** zat uses one arena per frame — a single `malloc` on the first frame, then `reset` (no syscall) on every subsequent frame. the "alloc per frame" variant creates and destroys an arena per frame (one `malloc` + one `free`), which is the fair comparison to what the other SDKs do. the "arena reuse" variant is the production pattern. 57 69 58 - ### the go result 70 + ### rust and python 59 71 60 - indigo — bluesky's own production relay implementation — is the slowest of the four. go-car is synchronous (no async overhead excuse), and cbor-gen is code-generated (no reflection). the cost is allocations and GC pressure: go copies every string, every byte slice, every block into heap-allocated objects, and the garbage collector has to clean it all up. at ~10 blocks/frame, that's a lot of short-lived allocations per decode. 72 + jacquard uses iroh-car for CAR parsing, which is async — every `next_block().await` goes through tokio's poll/wake state machine even though the I/O is an in-memory buffer. ~10 awaits per frame adds up. 61 73 62 - this doesn't mean indigo is bad software — it handles the live firehose fine at ~1k events/sec. but it explains why bluesky runs beefy relay infrastructure: the decode path has no room to spare at scale. 74 + python's atproto SDK uses libipld (Rust via PyO3) under the hood, which does the entire CAR parse + per-block DAG-CBOR decode in one synchronous C-extension call. this is a different (and for this workload, faster) Rust library than what the rust benchmark uses. python beats rust here because libipld avoids the async overhead entirely. 75 + 76 + ### go 77 + 78 + indigo — bluesky's own production relay — is the slowest. go-car is synchronous (no async overhead excuse), and cbor-gen is code-generated (no reflection). the cost is GC pressure: every string, byte slice, and block is a heap allocation that the garbage collector has to sweep. at ~10 blocks/frame, that's a lot of short-lived objects per decode. 79 + 80 + indigo handles the live firehose fine at ~1k events/sec. but it explains why bluesky runs beefy relay infrastructure. 63 81 64 82 ### does this matter? 65 83 66 - for live firehose consumption: no. the network delivers ~500-1000 events/sec. any of these SDKs handle that easily. where it matters: backfill (replaying months of data), relays (fanning out to many consumers), and anything where you're processing stored firehose data as fast as possible. 84 + for live firehose consumption: no. the network delivers ~500-1000 events/sec. any of these SDKs handle that. 67 85 68 - for now, we ship features. the headroom is there when we need it. 86 + where it matters: backfill (replaying months of data), relays (fanning out to many consumers), and anything where you're processing stored firehose data as fast as possible.

Configure Feed

Configure Feed