Router Dataset Log#

Tracks changes to benchmarks/inputs/router/router_slices/, router benchmark runs, and evaluation summaries per iteration.

Format#

Each entry is an iteration. Counts are by route_label across all queries in the changed files.

iter-1 — 2026-04-23 (initial slices)#

Date: 2026-04-23

Slice files changed / created:

file	route_label	dev	test	total
`benchmarks/inputs/router/router_slices/shell_state_v1.json`	tools	15	15	30
`benchmarks/inputs/router/router_slices/read_file_repo_v1.json`	tools	15	15	30
`benchmarks/inputs/router/router_slices/abstain_v1.json`	abstain	15	15	30
`benchmarks/inputs/router/router_slices/memory_lane_v1.json`	memory	15	15	30

Total router-labeled queries (slices only):

route_label	count
tools	60
abstain	30
memory	30

(Plus existing route_label: tools queries from benchmarks/inputs/datasets/internal_eval_starter.json: ~8 dev + test combined.)

Tool inventory snapshot: benchmarks/inputs/router/router_tools.json (generated by dump-tools)

Data quality notes:

shell and read_file slices include Japanese-language variants and typo/fragment queries to prevent English keyword hacking
memory_lane queries are adversarially similar to tool queries (same topics: embedding server, reranker, config defaults) but ask about past decisions/notes rather than current state
abstain queries cover: subjective preference, underspecified, future prediction, external web, credential requests

Router train/eval command (iter-1):

  nix develop --command cargo run -p klbr-bench -- router-multi \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/models/router/centroid/out-router-iter-1 \
  benchmarks/inputs/datasets/internal_eval_starter.json \
  benchmarks/inputs/router/router_slices/shell_state_v1.json \
  benchmarks/inputs/router/router_slices/read_file_repo_v1.json \
  benchmarks/inputs/router/router_slices/memory_lane_v1.json

note: abstain_v1.json is excluded from the first run because the current router only distinguishes memory vs tools. once a three-class head is wired, add it.

Benchmark output dir: benchmarks/models/router/centroid/out-router-iter-1/

Results:

metric	value
labeled queries (test)	61 (32 tools, 29 memory)
tools precision	1.0000 ✅
tools recall	0.3125 ⚠️
memory false-tools rate	0.0000 ✅
accuracy	0.6393
threshold	0.100

Gate status: precision gate MET, recall gate NOT YET MET — needs more tool-lane test coverage to push recall up without hurting precision

iter-3 — 2026-04-23 (3-way routing)#

Changes:

Upgraded router to 3-way centroid classification (Tools, Memory, Abstain).
Added abstain_v1.json, shell_state_v1b.json, read_file_repo_v1b.json.
Added V2 slices for shell, read_file, memory_lane.

Results:

Tools Precision: 1.0000
Tools Recall: 0.1875
Memory False-Tools Rate: 0.0000

iter-4 — 2026-04-23 (Adversarial & Observability)#

Changes:

Added memory_adversarial_v1.json to sharpen memory centroid.
Added T-sim/M-sim/A-sim scores to misclassified table and agent status.

Results:

Tools Precision: 1.0000
Tools Recall: 0.1625
Memory False-Tools Rate: 0.0000

Conclusion: Centroids are at their semantic limit. 1.0 precision makes this safe to deploy as a conservative gate.

iter-7 — 2026-04-23 (Recall-Focused Threshold Tuning)#

Changes:

Switched router model format to multi-prototype centroids per class (k-means style), still embedding-only.
Updated threshold tuning objective to allow a tiny memory→tools FP budget (kept at 0 on test) in exchange for much higher tools recall.

Benchmark output dir: benchmarks/models/router/centroid/out-router-iter-7/

Results (test split):

Tools Precision: 0.9455
Tools Recall: 0.6500
Memory False-Tools Rate (memory→tools): 0.0000
Abstain False-Tools Rate (abstain→tools): 0.2000
Tuned margin threshold: 0.050

Production Config#

To enable this router, set router_model_path = Some("benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json") in config.rs.

linear-iter-2 — 2026-04-23 (Softmax Regression)#

Changes:

Implemented router-multi-linear: 3-way softmax regression over the same embedding vectors (no keyword heuristics).
Added threshold tuning on dev split with a hard memory→tools = 0 constraint.
Optimized the tuner by precomputing per-example probabilities (fast grid search).

Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-2/

Results (test split):

Tools Precision: 0.9444
Tools Recall: 0.6375
Memory False-Tools Rate (memory→tools): 0.0000
Abstain False-Tools Rate (abstain→tools): 0.2000

Conclusion: This is a better default than centroids for tools recall while keeping the key safety invariant (memory→tools = 0).

linear-iter-4 — 2026-04-23 (Train/Dev Split + Frozen Holdout)#

Changes:

Added split=train dataset (benchmarks/inputs/router/router_slices/train_pack_v1.json) so training no longer reuses the dev tuning split.
Added split=dev dataset (benchmarks/inputs/router/router_slices/dev_pack_v1.json) for threshold tuning.
Added frozen holdout file (benchmarks/inputs/router/router_holdout_v1.json, 60 queries balanced across tools/memory/abstain). Reports now include a separate Holdout Metrics block for query_id prefixed with holdout_.

Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-4/

Results (test split, overall):

Tools Precision: 0.9342
Tools Recall: 0.7172
Memory False-Tools Rate (memory→tools): 0.0000

Results (holdout subset):

Tools Precision: 0.7826
Tools Recall: 0.9000
Memory False-Tools Rate (memory→tools): 0.0000
Abstain False-Tools Rate (abstain→tools): 0.2500

linear-iter-9 — 2026-04-23 (Deterministic Re-split + Multi-Holdout Buckets)#

Changes:

Fixed holdout bucket parsing so reports include per-holdout-bucket metrics.
Added deterministic, stratified re-split of the combined (train+dev) pool when split=train is too small, to avoid severe underfitting.
Kept key safety invariant during tuning: memory→tools = 0 on dev.

Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-9/

Results (test split, overall):

Tools Precision: 0.9597
Tools Recall: 0.7987
Memory False-Tools Rate (memory→tools): 0.0000
Abstain False-Tools Rate (abstain→tools): 0.0714

Results (holdout subset, all buckets combined):

Tools Precision: 0.9375
Tools Recall: 0.8571
Memory False-Tools Rate (memory→tools): 0.0000
Abstain False-Tools Rate (abstain→tools): 0.0727

Holdout buckets (tools precision / recall):

tools_recall: 1.0000 / 0.8000
abstain_toolish: 0.7143 / 1.0000
memory_toolish: 1.0000 / 1.0000

Recommended model path (current): benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json

Full run snapshot: docs/router_linear_latest_2026-04-23.md

Configure Feed

Configure Feed

Router Dataset Log#

Format#

iter-1 — 2026-04-23 (initial slices)#

iter-3 — 2026-04-23 (3-way routing)#

iter-4 — 2026-04-23 (Adversarial & Observability)#

iter-7 — 2026-04-23 (Recall-Focused Threshold Tuning)#

Production Config#

linear-iter-2 — 2026-04-23 (Softmax Regression)#

linear-iter-4 — 2026-04-23 (Train/Dev Split + Frozen Holdout)#

linear-iter-9 — 2026-04-23 (Deterministic Re-split + Multi-Holdout Buckets)#