Benchmarks#

This directory is organized as:

benchmarks/inputs/ datasets + configs (hand-edited, source of truth)
benchmarks/models/ trained artifacts (router models + reports)
benchmarks/runs/ benchmark outputs (generated)

benchmarks/inputs/datasets/internal_eval_template.json is the minimal smoke-test dataset.

benchmarks/inputs/datasets/internal_eval_starter.json is the first non-toy internal eval set. It is still small enough to edit by hand, but it includes:

all required query categories from the MVP plan
distractor memories
stale/new fact conflicts
no-hit queries
multi-evidence queries
timeline_id on both memories and queries so split enforcement can avoid timeline leakage

benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json is the starter slice for non-question memory activation and passive recall support.

Current harness behavior:

klbr-bench filters queries by split
cargo run -p klbr-bench -- passive-recall ... runs the passive memory-activation benchmark on queries with objective = passive_recall
if the dataset includes timeline metadata and the experiment selects a split, it also filters the memory corpus by timeline_id
mixed split datasets are treated strictly: a split with timeline metadata must give every query in that split a timeline_id
cargo run -p klbr-bench -- sweep ... reruns the real expand-or-abstain policy over a grid of raw rerank score, raw margin, and support thresholds, then writes sweep_report.json, sweep_report.md, and sweep_results.csv
per-query traces now include window_attempts, so each attempted time window records raw rerank score, raw margin, top-candidate support score, candidate ids, and the policy decision at that window
benchmarks/inputs/configs/mvp_rerank_calibrated_balanced.json and benchmarks/inputs/configs/mvp_rerank_calibrated_conservative.json are provisional threshold presets from the 2026-04-22 sweeps; use them as candidate operating modes rather than treating them as final defaults
benchmarks/inputs/configs/mvp_rerank_support_calibrated.json is the first support-assisted preset from the focused 2026-04-22 support sweep

If a dataset has no timeline_id metadata at all, the harness falls back to the full corpus for backward compatibility.

Recommended workflow:

add new memories in timeline order
group related memories under one timeline_id
add queries that reference only the intended memories
keep hard distractors near the gold memories in wording and time
make sure each split contains no-hit and conflict/update cases

Passive recall planning now lives in:

docs/klbr_passive_recall_benchmark_plan.md

benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json is the expanded v3 passive-recall slice. Added on top of v2:

16 new memories (ids 123–138) across 6 new timelines
12 new queries (5 dev, 7 test) covering:
- temporal ordering cues — "before we switched embeddings", time-anchored update references
- ambiguous time references — "the other day I moved back to neovim"
- update/staleness signals — interaction_mode: "update" cases where user signals a prior memory is now stale
- abstention / no-recall false technical cue — sentences with technical words that should still produce no recall (e.g. "internet keeps dropping")
- multi-session back-references — referencing a decision made in a prior session
- context-switch no-recall — user switches to an entirely different project

benchmarks/inputs/configs/mvp_rerank_support_calibrated_test.json is the same operating point as benchmarks/inputs/configs/mvp_rerank_support_calibrated.json with dataset_split: "test".

Run dev and test splits of v3:

cargo run -p klbr-bench -- retrieval \
  benchmarks/inputs/datasets/internal_eval_starter.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/runs/manual-retrieval-dev

cargo run -p klbr-bench -- passive-recall \
  benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/runs/manual-passive-recall-v3-dev

cargo run -p klbr-bench -- passive-recall \
  benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated_test.json \
  benchmarks/runs/manual-passive-recall-v3-test

Known false activation on the test split: pr_test_14 ("working on the dotfiles syncer today, not touching klbr") is triggered because memory 131 ("side project: rust cli for syncing dotfiles") sits in the same corpus and gets a rerank score above the abstain threshold. This is a real system weakness — cross-project memory bleed when the dotfiles project shares vocabulary with klbr sessions.

Run Everything Into One Folder#

Use benchmarks/run_all.py to run the standard suite (retrieval dev/test, passive dev/test, tools-lane dev/test) into a single run directory under benchmarks/runs/ and write a consolidated summary.md.

Configure Feed

Configure Feed

Benchmarks#

Run Everything Into One Folder#