Benchmarks#
This directory is organized as:
benchmarks/inputs/datasets + configs (hand-edited, source of truth)benchmarks/models/trained artifacts (router models + reports)benchmarks/runs/benchmark outputs (generated)
benchmarks/inputs/datasets/internal_eval_template.json is the minimal smoke-test dataset.
benchmarks/inputs/datasets/internal_eval_starter.json is the first non-toy internal eval set. It is still small enough to edit by hand, but it includes:
- all required query categories from the MVP plan
- distractor memories
- stale/new fact conflicts
- no-hit queries
- multi-evidence queries
timeline_idon both memories and queries so split enforcement can avoid timeline leakage
benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json is the starter slice for non-question memory activation and passive recall support.
Current harness behavior:
klbr-benchfilters queries bysplitcargo run -p klbr-bench -- passive-recall ...runs the passive memory-activation benchmark on queries withobjective = passive_recall- if the dataset includes timeline metadata and the experiment selects a split, it also filters the memory corpus by
timeline_id - mixed split datasets are treated strictly: a split with timeline metadata must give every query in that split a
timeline_id cargo run -p klbr-bench -- sweep ...reruns the real expand-or-abstain policy over a grid of raw rerank score, raw margin, and support thresholds, then writessweep_report.json,sweep_report.md, andsweep_results.csv- per-query traces now include
window_attempts, so each attempted time window records raw rerank score, raw margin, top-candidate support score, candidate ids, and the policy decision at that window benchmarks/inputs/configs/mvp_rerank_calibrated_balanced.jsonandbenchmarks/inputs/configs/mvp_rerank_calibrated_conservative.jsonare provisional threshold presets from the 2026-04-22 sweeps; use them as candidate operating modes rather than treating them as final defaultsbenchmarks/inputs/configs/mvp_rerank_support_calibrated.jsonis the first support-assisted preset from the focused 2026-04-22 support sweep
If a dataset has no timeline_id metadata at all, the harness falls back to the full corpus for backward compatibility.
Recommended workflow:
- add new memories in timeline order
- group related memories under one
timeline_id - add queries that reference only the intended memories
- keep hard distractors near the gold memories in wording and time
- make sure each split contains no-hit and conflict/update cases
Passive recall planning now lives in:
docs/klbr_passive_recall_benchmark_plan.md
benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json is the expanded v3 passive-recall slice. Added on top of v2:
- 16 new memories (ids 123–138) across 6 new timelines
- 12 new queries (5 dev, 7 test) covering:
- temporal ordering cues — "before we switched embeddings", time-anchored update references
- ambiguous time references — "the other day I moved back to neovim"
- update/staleness signals —
interaction_mode: "update"cases where user signals a prior memory is now stale - abstention / no-recall false technical cue — sentences with technical words that should still produce no recall (e.g. "internet keeps dropping")
- multi-session back-references — referencing a decision made in a prior session
- context-switch no-recall — user switches to an entirely different project
benchmarks/inputs/configs/mvp_rerank_support_calibrated_test.json is the same operating point as benchmarks/inputs/configs/mvp_rerank_support_calibrated.json with dataset_split: "test".
Run dev and test splits of v3:
cargo run -p klbr-bench -- retrieval \
benchmarks/inputs/datasets/internal_eval_starter.json \
benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
benchmarks/runs/manual-retrieval-dev
cargo run -p klbr-bench -- passive-recall \
benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json \
benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
benchmarks/runs/manual-passive-recall-v3-dev
cargo run -p klbr-bench -- passive-recall \
benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json \
benchmarks/inputs/configs/mvp_rerank_support_calibrated_test.json \
benchmarks/runs/manual-passive-recall-v3-test
Known false activation on the test split: pr_test_14 ("working on the dotfiles syncer today, not touching klbr") is triggered because memory 131 ("side project: rust cli for syncing dotfiles") sits in the same corpus and gets a rerank score above the abstain threshold. This is a real system weakness — cross-project memory bleed when the dotfiles project shares vocabulary with klbr sessions.
Run Everything Into One Folder#
Use benchmarks/run_all.py to run the standard suite (retrieval dev/test, passive dev/test, tools-lane dev/test) into a single run directory under benchmarks/runs/ and write a consolidated summary.md.