ive harnessed the harness
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

README.md

Benchmarks#

This directory is organized as:

  • benchmarks/inputs/ datasets + configs (hand-edited, source of truth)
  • benchmarks/models/ trained artifacts (router models + reports)
  • benchmarks/runs/ benchmark outputs (generated)

benchmarks/inputs/datasets/internal_eval_template.json is the minimal smoke-test dataset.

benchmarks/inputs/datasets/internal_eval_starter.json is the first non-toy internal eval set. It is still small enough to edit by hand, but it includes:

  • all required query categories from the MVP plan
  • distractor memories
  • stale/new fact conflicts
  • no-hit queries
  • multi-evidence queries
  • timeline_id on both memories and queries so split enforcement can avoid timeline leakage

benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json is the starter slice for non-question memory activation and passive recall support.

Current harness behavior:

  • klbr-bench filters queries by split
  • cargo run -p klbr-bench -- passive-recall ... runs the passive memory-activation benchmark on queries with objective = passive_recall
  • if the dataset includes timeline metadata and the experiment selects a split, it also filters the memory corpus by timeline_id
  • mixed split datasets are treated strictly: a split with timeline metadata must give every query in that split a timeline_id
  • cargo run -p klbr-bench -- sweep ... reruns the real expand-or-abstain policy over a grid of raw rerank score, raw margin, and support thresholds, then writes sweep_report.json, sweep_report.md, and sweep_results.csv
  • per-query traces now include window_attempts, so each attempted time window records raw rerank score, raw margin, top-candidate support score, candidate ids, and the policy decision at that window
  • benchmarks/inputs/configs/mvp_rerank_calibrated_balanced.json and benchmarks/inputs/configs/mvp_rerank_calibrated_conservative.json are provisional threshold presets from the 2026-04-22 sweeps; use them as candidate operating modes rather than treating them as final defaults
  • benchmarks/inputs/configs/mvp_rerank_support_calibrated.json is the first support-assisted preset from the focused 2026-04-22 support sweep

If a dataset has no timeline_id metadata at all, the harness falls back to the full corpus for backward compatibility.

Recommended workflow:

  1. add new memories in timeline order
  2. group related memories under one timeline_id
  3. add queries that reference only the intended memories
  4. keep hard distractors near the gold memories in wording and time
  5. make sure each split contains no-hit and conflict/update cases

Passive recall planning now lives in:

  • docs/klbr_passive_recall_benchmark_plan.md

benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json is the expanded v3 passive-recall slice. Added on top of v2:

  • 16 new memories (ids 123–138) across 6 new timelines
  • 12 new queries (5 dev, 7 test) covering:
    • temporal ordering cues — "before we switched embeddings", time-anchored update references
    • ambiguous time references — "the other day I moved back to neovim"
    • update/staleness signalsinteraction_mode: "update" cases where user signals a prior memory is now stale
    • abstention / no-recall false technical cue — sentences with technical words that should still produce no recall (e.g. "internet keeps dropping")
    • multi-session back-references — referencing a decision made in a prior session
    • context-switch no-recall — user switches to an entirely different project

benchmarks/inputs/configs/mvp_rerank_support_calibrated_test.json is the same operating point as benchmarks/inputs/configs/mvp_rerank_support_calibrated.json with dataset_split: "test".

Run dev and test splits of v3:

cargo run -p klbr-bench -- retrieval \
  benchmarks/inputs/datasets/internal_eval_starter.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/runs/manual-retrieval-dev

cargo run -p klbr-bench -- passive-recall \
  benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/runs/manual-passive-recall-v3-dev

cargo run -p klbr-bench -- passive-recall \
  benchmarks/inputs/datasets/internal_eval_passive_recall_v3.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated_test.json \
  benchmarks/runs/manual-passive-recall-v3-test

Known false activation on the test split: pr_test_14 ("working on the dotfiles syncer today, not touching klbr") is triggered because memory 131 ("side project: rust cli for syncing dotfiles") sits in the same corpus and gets a rerank score above the abstain threshold. This is a real system weakness — cross-project memory bleed when the dotfiles project shares vocabulary with klbr sessions.

Run Everything Into One Folder#

Use benchmarks/run_all.py to run the standard suite (retrieval dev/test, passive dev/test, tools-lane dev/test) into a single run directory under benchmarks/runs/ and write a consolidated summary.md.