···11# consuming the firehose, then benchmarking it
2233-since the last devlog (self-publishing docs), zat grew from a collection of string parsers and HTTP clients into something that can consume the full AT Protocol event stream — both jetstream (JSON) and the raw firehose (binary DAG-CBOR). then we benchmarked it against every other AT Protocol SDK and the numbers were... surprising.
33+since the last devlog (self-publishing docs), zat grew from a collection of string parsers and HTTP clients into something that can consume the full AT Protocol event stream — both jetstream (JSON) and the raw firehose (binary DAG-CBOR). then we benchmarked it against every other AT Protocol SDK.
4455## what we built
66···37373838we built [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) — a cross-SDK benchmark that captures ~10 seconds of live firehose traffic, then decodes the full corpus with four SDKs.
39394040-every SDK does the same work per frame: decode CBOR header → decode CBOR payload → parse CAR → decode every CAR block as DAG-CBOR. block counts, error counts, and per-pass variance (min/median/max) are reported so you can verify parity.
4040+every SDK does the same work per frame: decode CBOR header → decode CBOR payload → parse CAR → decode every CAR block as DAG-CBOR. block counts and error counts are reported per SDK so you can verify parity. per-pass variance (min/median/max) is reported so you can see how stable the numbers are.
4141+4242+the corpus is captured with a CBOR header peek (check `t == "#commit"` and `ops` is non-empty) using zat's CBOR decoder. this is standard CBOR parsing — not zat's typed firehose decoder — but it does mean frames that zat's CBOR decoder rejects won't appear in the corpus.
41434242-the corpus is captured with a minimal CBOR header peek (check `t == "#commit"` and `ops` is non-empty) using zat's CBOR decoder. this is standard CBOR parsing — not zat's typed firehose decoder — but it does mean frames that zat's CBOR decoder rejects won't appear in the corpus. in practice, CBOR header parsing is the least likely step to diverge across implementations.
4444+the original version of these benchmarks had work asymmetry: zat only decoded op-linked blocks (~2.3k per corpus), while rust and go decoded all CAR blocks (~23k). python parsed CAR structure but didn't iterate blocks. the numbers below are from the corrected version where all SDKs decode every block.
43454444-the original version of these benchmarks had work asymmetry: zat's `firehose.decodeFrame` only decoded op-linked blocks (~2.3k per corpus), while rust and go decoded all CAR blocks (~23k). python parsed CAR structure but didn't iterate blocks. the numbers below are from the corrected version where all SDKs decode every block.
4646+### results
45474646-run `just capture && just bench` in [atproto-bench](https://tangled.sh/@zzstoatzz.io/atproto-bench) to get numbers on your machine.
4848+3,298 frames (16.2 MB), 5 measured passes, macOS arm64 (M3 Max):
4949+5050+| SDK | frames/sec (median) | MB/s | blocks/frame |
5151+|-----|--------:|-----:|-----:|
5252+| zig (zat, arena reuse) | 461,827 | 2,268.9 | 9.98 |
5353+| zig (zat, alloc per frame) | 395,485 | 1,890.0 | 9.98 |
5454+| rust (jacquard) | 42,023 | 203.5 | 9.98 |
5555+| python (atproto) | 24,026 | 118.0 | 9.98 |
5656+| go (indigo) | 10,896 | 53.3 | 9.98 |
5757+5858+all SDKs: 0 errors. run-to-run variance is ~30-40% — compare ratios within a single run, not across runs.
47594860### why zat is fast
4961···53655466**block decode cardinality.** each firehose frame contains a CAR with ~10 blocks (MST nodes + records). decoding every block as DAG-CBOR is the dominant cost — it's where most of the per-frame CPU time goes across all SDKs.
55675656-**sync vs async CAR parsing.** rust's `iroh-car` is an async library. every `next_block().await` goes through tokio's poll/wake state machine to read from an in-memory buffer. this is bad enough that python (via libipld, a *different* Rust library that works synchronously) outperforms the rust benchmark. the async overhead compounds on top of the per-block decode cost — ~10 awaits per frame adds up.
6868+**arena allocation.** zat uses one arena per frame — a single `malloc` on the first frame, then `reset` (no syscall) on every subsequent frame. the "alloc per frame" variant creates and destroys an arena per frame (one `malloc` + one `free`), which is the fair comparison to what the other SDKs do. the "arena reuse" variant is the production pattern.
57695858-### the go result
7070+### rust and python
59716060-indigo — bluesky's own production relay implementation — is the slowest of the four. go-car is synchronous (no async overhead excuse), and cbor-gen is code-generated (no reflection). the cost is allocations and GC pressure: go copies every string, every byte slice, every block into heap-allocated objects, and the garbage collector has to clean it all up. at ~10 blocks/frame, that's a lot of short-lived allocations per decode.
7272+jacquard uses iroh-car for CAR parsing, which is async — every `next_block().await` goes through tokio's poll/wake state machine even though the I/O is an in-memory buffer. ~10 awaits per frame adds up.
61736262-this doesn't mean indigo is bad software — it handles the live firehose fine at ~1k events/sec. but it explains why bluesky runs beefy relay infrastructure: the decode path has no room to spare at scale.
7474+python's atproto SDK uses libipld (Rust via PyO3) under the hood, which does the entire CAR parse + per-block DAG-CBOR decode in one synchronous C-extension call. this is a different (and for this workload, faster) Rust library than what the rust benchmark uses. python beats rust here because libipld avoids the async overhead entirely.
7575+7676+### go
7777+7878+indigo — bluesky's own production relay — is the slowest. go-car is synchronous (no async overhead excuse), and cbor-gen is code-generated (no reflection). the cost is GC pressure: every string, byte slice, and block is a heap allocation that the garbage collector has to sweep. at ~10 blocks/frame, that's a lot of short-lived objects per decode.
7979+8080+indigo handles the live firehose fine at ~1k events/sec. but it explains why bluesky runs beefy relay infrastructure.
63816482### does this matter?
65836666-for live firehose consumption: no. the network delivers ~500-1000 events/sec. any of these SDKs handle that easily. where it matters: backfill (replaying months of data), relays (fanning out to many consumers), and anything where you're processing stored firehose data as fast as possible.
8484+for live firehose consumption: no. the network delivers ~500-1000 events/sec. any of these SDKs handle that.
67856868-for now, we ship features. the headroom is there when we need it.
8686+where it matters: backfill (replaying months of data), relays (fanning out to many consumers), and anything where you're processing stored firehose data as fast as possible.