filter noise collections and auto time-cutoff large repos
the streaming CAR walker unblocked very large repos at fetch/parse time,
but the embed pipeline still choked: pfrazee's 196k records (mostly
likes/follows/reposts) burned transient memory + embed time on records
with no semantic text. this was never going to scale beyond me.
two transparent concessions, surfaced honestly in the UI:
1. collection-level filter. records in DEFAULT_SKIP_COLLECTIONS (likes,
follows, reposts, blocks, listitems, threadgate, postgate, actor
status, chat declaration, sh.tangled graph follow/star) are dropped
before CBOR value decode — skipped records cost only the MST entry
iteration. applied in both the CAR walker and the listRecords
fallback for consistency.
2. auto time cutoff. if post-collection-filter count still exceeds
LARGE_REPO_THRESHOLD (30k), enable a 2-year TID cutoff. implemented
as a cheap count-only MST walk before the full walk — we learn the
post-filter size without decoding record values, then decide. TIDs
decode from base32-sortable rkeys in ~15 lines; non-TID rkeys (self,
etc.) are always kept.
pipeline shape becomes: openRepo → countOpened → decide filter →
walkOpened → close. the open/walk split keeps the mmap alive across
both passes so the count pass is essentially free.
pfrazee smoke: 195,908 total → 37,611 kept post collection filter →
cutoff kicks in → 35,682 final. zzstoatzz.io regression-clean: 17,350
total → 5,145 kept, 12,205 skipped, no cutoff.
status response gains skipped_by_collection, skipped_by_time,
applied_tid_cutoff_ms. pack-meta line in the UI shows the honest
breakdown: \"5,145 records · 190 collections · skipped 12,205
likes/follows/reposts\" for normal repos; \"35,682 records · 30
collections · skipped 158,297 likes/follows/reposts · indexed records
after 2023-04-01 (1,929 older records skipped)\" for pfrazee.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>