stream getRepo body to /tmp and mmap it for the CAR walk
ken's old walker buffered the entire sync.getRepo response body in heap via
zat.HttpTransport.fetch (which always dupes into std.Io.Writer.Allocating),
then handed it to zat.car.readWithOptions which eagerly materialized a
StringHashMap of every CID → block content. that combination capped ken at
repos with <200k blocks and kept the whole CAR resident for the duration of
the walk. pfrazee.com (196k records, 72 MB CAR, 248k blocks counting MST
internals) sat just past the cliff.
this path now:
1. talks to std.http.Client directly for the one call that needs it,
streaming the response body straight into /tmp/ken-car-{seq}-{did}.car
via std.Io.File.Writer.initStreaming — no heap staging
2. mmaps the temp file read-only via std.Io.File.MemoryMap (kernel pages
in what we touch, evicts what we don't)
3. feeds the mmap slice to zat.car.streamBlocks (v0.3.0-alpha.24) and
builds a CID → {offset, len} index into the buffer via pointer
arithmetic — no block content duplication, ~16 bytes of value per
entry instead of 48
4. walks the MST through that index, delete-on-destroy cleans the
temp file whether the walk succeeds or errors out
every other ken call still uses zat.HttpTransport — only this one endpoint
needs streaming. bumps zat to v0.3.0-alpha.24 for car.streamBlocks.
smoke tested against two real repos via a standalone /tmp/ken_smoke.zig that
imports repo_walk.zig directly:
zzstoatzz.io: 17,348 records, 200 collections, 8.2 MB CAR
pfrazee.com: 195,904 records, 39 collections, 72.0 MB CAR
pfrazee walks end-to-end in ~11.5s on my laptop. 0 lingering /tmp/ken-car-*
files after either run. verified fly's /tmp is on the rootfs overlay (7.4G
free on the current machine), not tmpfs, so streaming to disk does not
compete with the 4 GB memory budget.
not yet addressed: indexer.zig still holds the full records[] + extracted
text + 384-dim vectors for every record simultaneously during embedding,
which for pfrazee would be ~300 MB of vectors on top of the mmap. walking
pfrazee is unblocked; embedding pfrazee needs a separate, record-at-a-time
pipeline that's planned as a follow-up.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>