KLBR Passive Recall Benchmark Plan#
This document defines the next benchmark slice for memory use on non-question inputs.
Why This Exists#
The current retrieval benchmark mostly measures:
- answerable factual lookup
- rerank confidence
- expand vs abstain decisions
That is necessary, but not sufficient for KLBR.
KLBR also needs to decide whether to recall memory for inputs like:
still working on the benchmark stuffi'm back in zed againreranker on 8003lol yeah that makes sense
Those are not all explicit factual questions, but the system still needs a memory-use policy.
Two Separate Decisions#
We should treat memory use as two linked decisions:
activationShould memory be recalled at all for this input?trustIf memory is recalled, are the selected memories actually the right support?
The current question-oriented benchmark mostly covers trust. Passive recall needs to evaluate both.
Schema#
The shared eval schema now supports passive-recall cases through optional fields on EvalQuery:
interaction_modequestionstatementupdaterequestfragment
objectiveanswer_querypassive_recall
expected_memory_actionrecallno_recall
Existing retrieval datasets do not need to set these fields. They default to question-answer behavior.
Dataset Semantics#
For passive-recall items:
textis the user input as it would arrive to the agentgold_memory_idsare the memories that should be recalled if memory activation is correctno_hit = truemeans the item should not recall any memoryexpected_memory_action = no_recallis the preferred explicit label for passive-recall no-hit cases
For multi-evidence passive cases:
gold_memory_idsmay contain several acceptable support memories- the goal is not necessarily single-memory top-1
- the benchmark should allow support-style grading later
First Metrics To Add#
The passive-recall benchmark should report:
- activation precision fraction of recall-triggered inputs that should actually recall
- activation recall fraction of should-recall inputs where recall was triggered
- no-recall false activation rate fraction of no-recall inputs where memory was still injected
- support precision@k fraction of injected memories that are in the gold support set
- support recall@k fraction of gold support memories that were injected
These are deliberately different from answer/abstain metrics.
Starter Dataset#
The first starter dataset is:
benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json
It includes:
- passive recent-work cues
- elliptical requests
- declarative state cues
- endpoint fragments
- multi-evidence infra cues
- pure chatter that should not trigger recall
What Not To Do#
Do not solve passive recall by inventing many brittle query classes.
The benchmark should push us toward generic features that work for both questions and non-questions, such as:
- semantic relevance
- lexical/entity support
- agreement across support memories
- confidence calibrated against no-recall cases
Current Implementation#
The starter implementation now exists as:
cargo run -p klbr-bench -- passive-recall \
benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json \
benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
benchmarks/runs/manual-passive-recall-starter
It:
- embeds the input text
- retrieves candidates through the same exact-search path used by the main retrieval benchmark
- reranks and scores support the same way as the answer/abstain benchmark
- decides
recallvsno_recall - if recalling, grades the injected support set against
gold_memory_ids
The current starter run is intentionally small, so the next real step after implementation is to expand the passive-recall dataset rather than overfitting to the 8-query slice.