KLBR Passive Recall Benchmark Plan#

This document defines the next benchmark slice for memory use on non-question inputs.

Why This Exists#

The current retrieval benchmark mostly measures:

answerable factual lookup
rerank confidence
expand vs abstain decisions

That is necessary, but not sufficient for KLBR.

KLBR also needs to decide whether to recall memory for inputs like:

still working on the benchmark stuff
i'm back in zed again
reranker on 8003
lol yeah that makes sense

Those are not all explicit factual questions, but the system still needs a memory-use policy.

Two Separate Decisions#

We should treat memory use as two linked decisions:

activation Should memory be recalled at all for this input?
trust If memory is recalled, are the selected memories actually the right support?

The current question-oriented benchmark mostly covers trust. Passive recall needs to evaluate both.

Schema#

The shared eval schema now supports passive-recall cases through optional fields on EvalQuery:

interaction_mode
- question
- statement
- update
- request
- fragment
objective
- answer_query
- passive_recall
expected_memory_action
- recall
- no_recall

Existing retrieval datasets do not need to set these fields. They default to question-answer behavior.

Dataset Semantics#

For passive-recall items:

text is the user input as it would arrive to the agent
gold_memory_ids are the memories that should be recalled if memory activation is correct
no_hit = true means the item should not recall any memory
expected_memory_action = no_recall is the preferred explicit label for passive-recall no-hit cases

For multi-evidence passive cases:

gold_memory_ids may contain several acceptable support memories
the goal is not necessarily single-memory top-1
the benchmark should allow support-style grading later

First Metrics To Add#

The passive-recall benchmark should report:

activation precision fraction of recall-triggered inputs that should actually recall
activation recall fraction of should-recall inputs where recall was triggered
no-recall false activation rate fraction of no-recall inputs where memory was still injected
support precision@k fraction of injected memories that are in the gold support set
support recall@k fraction of gold support memories that were injected

These are deliberately different from answer/abstain metrics.

Starter Dataset#

The first starter dataset is:

benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json

It includes:

passive recent-work cues
elliptical requests
declarative state cues
endpoint fragments
multi-evidence infra cues
pure chatter that should not trigger recall

What Not To Do#

Do not solve passive recall by inventing many brittle query classes.

The benchmark should push us toward generic features that work for both questions and non-questions, such as:

semantic relevance
lexical/entity support
agreement across support memories
confidence calibrated against no-recall cases

Current Implementation#

The starter implementation now exists as:

cargo run -p klbr-bench -- passive-recall \
  benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/runs/manual-passive-recall-starter

It:

embeds the input text
retrieves candidates through the same exact-search path used by the main retrieval benchmark
reranks and scores support the same way as the answer/abstain benchmark
decides recall vs no_recall
if recalling, grades the injected support set against gold_memory_ids

The current starter run is intentionally small, so the next real step after implementation is to expand the passive-recall dataset rather than overfitting to the 8-query slice.

Configure Feed