ive harnessed the harness
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

KLBR Passive Recall Benchmark Plan#

This document defines the next benchmark slice for memory use on non-question inputs.

Why This Exists#

The current retrieval benchmark mostly measures:

  • answerable factual lookup
  • rerank confidence
  • expand vs abstain decisions

That is necessary, but not sufficient for KLBR.

KLBR also needs to decide whether to recall memory for inputs like:

  • still working on the benchmark stuff
  • i'm back in zed again
  • reranker on 8003
  • lol yeah that makes sense

Those are not all explicit factual questions, but the system still needs a memory-use policy.

Two Separate Decisions#

We should treat memory use as two linked decisions:

  1. activation Should memory be recalled at all for this input?
  2. trust If memory is recalled, are the selected memories actually the right support?

The current question-oriented benchmark mostly covers trust. Passive recall needs to evaluate both.

Schema#

The shared eval schema now supports passive-recall cases through optional fields on EvalQuery:

  • interaction_mode
    • question
    • statement
    • update
    • request
    • fragment
  • objective
    • answer_query
    • passive_recall
  • expected_memory_action
    • recall
    • no_recall

Existing retrieval datasets do not need to set these fields. They default to question-answer behavior.

Dataset Semantics#

For passive-recall items:

  • text is the user input as it would arrive to the agent
  • gold_memory_ids are the memories that should be recalled if memory activation is correct
  • no_hit = true means the item should not recall any memory
  • expected_memory_action = no_recall is the preferred explicit label for passive-recall no-hit cases

For multi-evidence passive cases:

  • gold_memory_ids may contain several acceptable support memories
  • the goal is not necessarily single-memory top-1
  • the benchmark should allow support-style grading later

First Metrics To Add#

The passive-recall benchmark should report:

  • activation precision fraction of recall-triggered inputs that should actually recall
  • activation recall fraction of should-recall inputs where recall was triggered
  • no-recall false activation rate fraction of no-recall inputs where memory was still injected
  • support precision@k fraction of injected memories that are in the gold support set
  • support recall@k fraction of gold support memories that were injected

These are deliberately different from answer/abstain metrics.

Starter Dataset#

The first starter dataset is:

  • benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json

It includes:

  • passive recent-work cues
  • elliptical requests
  • declarative state cues
  • endpoint fragments
  • multi-evidence infra cues
  • pure chatter that should not trigger recall

What Not To Do#

Do not solve passive recall by inventing many brittle query classes.

The benchmark should push us toward generic features that work for both questions and non-questions, such as:

  • semantic relevance
  • lexical/entity support
  • agreement across support memories
  • confidence calibrated against no-recall cases

Current Implementation#

The starter implementation now exists as:

cargo run -p klbr-bench -- passive-recall \
  benchmarks/inputs/datasets/internal_eval_passive_recall_starter.json \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/runs/manual-passive-recall-starter

It:

  1. embeds the input text
  2. retrieves candidates through the same exact-search path used by the main retrieval benchmark
  3. reranks and scores support the same way as the answer/abstain benchmark
  4. decides recall vs no_recall
  5. if recalling, grades the injected support set against gold_memory_ids

The current starter run is intentionally small, so the next real step after implementation is to expand the passive-recall dataset rather than overfitting to the 8-query slice.