Router Dataset Log#
Tracks changes to benchmarks/inputs/router/router_slices/, router benchmark runs, and evaluation summaries per iteration.
Format#
Each entry is an iteration. Counts are by route_label across all queries in the changed files.
iter-1 — 2026-04-23 (initial slices)#
Date: 2026-04-23
Slice files changed / created:
| file | route_label | dev | test | total |
|---|---|---|---|---|
benchmarks/inputs/router/router_slices/shell_state_v1.json |
tools | 15 | 15 | 30 |
benchmarks/inputs/router/router_slices/read_file_repo_v1.json |
tools | 15 | 15 | 30 |
benchmarks/inputs/router/router_slices/abstain_v1.json |
abstain | 15 | 15 | 30 |
benchmarks/inputs/router/router_slices/memory_lane_v1.json |
memory | 15 | 15 | 30 |
Total router-labeled queries (slices only):
| route_label | count |
|---|---|
| tools | 60 |
| abstain | 30 |
| memory | 30 |
(Plus existing route_label: tools queries from benchmarks/inputs/datasets/internal_eval_starter.json: ~8 dev + test combined.)
Tool inventory snapshot: benchmarks/inputs/router/router_tools.json (generated by dump-tools)
Data quality notes:
- shell and read_file slices include Japanese-language variants and typo/fragment queries to prevent English keyword hacking
- memory_lane queries are adversarially similar to tool queries (same topics: embedding server, reranker, config defaults) but ask about past decisions/notes rather than current state
- abstain queries cover: subjective preference, underspecified, future prediction, external web, credential requests
Router train/eval command (iter-1):
nix develop --command cargo run -p klbr-bench -- router-multi \
benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
benchmarks/models/router/centroid/out-router-iter-1 \
benchmarks/inputs/datasets/internal_eval_starter.json \
benchmarks/inputs/router/router_slices/shell_state_v1.json \
benchmarks/inputs/router/router_slices/read_file_repo_v1.json \
benchmarks/inputs/router/router_slices/memory_lane_v1.json
note:
abstain_v1.jsonis excluded from the first run because the current router only distinguishesmemoryvstools. once a three-class head is wired, add it.
Benchmark output dir: benchmarks/models/router/centroid/out-router-iter-1/
Results:
| metric | value |
|---|---|
| labeled queries (test) | 61 (32 tools, 29 memory) |
| tools precision | 1.0000 ✅ |
| tools recall | 0.3125 ⚠️ |
| memory false-tools rate | 0.0000 ✅ |
| accuracy | 0.6393 |
| threshold | 0.100 |
Gate status: precision gate MET, recall gate NOT YET MET — needs more tool-lane test coverage to push recall up without hurting precision
iter-3 — 2026-04-23 (3-way routing)#
Changes:
- Upgraded router to 3-way centroid classification (Tools, Memory, Abstain).
- Added
abstain_v1.json,shell_state_v1b.json,read_file_repo_v1b.json. - Added V2 slices for shell, read_file, memory_lane.
Results:
- Tools Precision: 1.0000
- Tools Recall: 0.1875
- Memory False-Tools Rate: 0.0000
iter-4 — 2026-04-23 (Adversarial & Observability)#
Changes:
- Added
memory_adversarial_v1.jsonto sharpen memory centroid. - Added T-sim/M-sim/A-sim scores to misclassified table and agent status.
Results:
- Tools Precision: 1.0000
- Tools Recall: 0.1625
- Memory False-Tools Rate: 0.0000
Conclusion: Centroids are at their semantic limit. 1.0 precision makes this safe to deploy as a conservative gate.
iter-7 — 2026-04-23 (Recall-Focused Threshold Tuning)#
Changes:
- Switched router model format to multi-prototype centroids per class (k-means style), still embedding-only.
- Updated threshold tuning objective to allow a tiny memory→tools FP budget (kept at 0 on test) in exchange for much higher tools recall.
Benchmark output dir: benchmarks/models/router/centroid/out-router-iter-7/
Results (test split):
- Tools Precision:
0.9455 - Tools Recall:
0.6500 - Memory False-Tools Rate (memory→tools):
0.0000 - Abstain False-Tools Rate (abstain→tools):
0.2000 - Tuned margin threshold:
0.050
Production Config#
To enable this router, set router_model_path = Some("benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json") in config.rs.
linear-iter-2 — 2026-04-23 (Softmax Regression)#
Changes:
- Implemented
router-multi-linear: 3-way softmax regression over the same embedding vectors (no keyword heuristics). - Added threshold tuning on dev split with a hard
memory→tools = 0constraint. - Optimized the tuner by precomputing per-example probabilities (fast grid search).
Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-2/
Results (test split):
- Tools Precision:
0.9444 - Tools Recall:
0.6375 - Memory False-Tools Rate (memory→tools):
0.0000 - Abstain False-Tools Rate (abstain→tools):
0.2000
Conclusion: This is a better default than centroids for tools recall while keeping the key safety invariant (memory→tools = 0).
linear-iter-4 — 2026-04-23 (Train/Dev Split + Frozen Holdout)#
Changes:
- Added
split=traindataset (benchmarks/inputs/router/router_slices/train_pack_v1.json) so training no longer reuses the dev tuning split. - Added
split=devdataset (benchmarks/inputs/router/router_slices/dev_pack_v1.json) for threshold tuning. - Added frozen holdout file (
benchmarks/inputs/router/router_holdout_v1.json, 60 queries balanced across tools/memory/abstain). Reports now include a separate Holdout Metrics block forquery_idprefixed withholdout_.
Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-4/
Results (test split, overall):
- Tools Precision:
0.9342 - Tools Recall:
0.7172 - Memory False-Tools Rate (memory→tools):
0.0000
Results (holdout subset):
- Tools Precision:
0.7826 - Tools Recall:
0.9000 - Memory False-Tools Rate (memory→tools):
0.0000 - Abstain False-Tools Rate (abstain→tools):
0.2500
linear-iter-9 — 2026-04-23 (Deterministic Re-split + Multi-Holdout Buckets)#
Changes:
- Fixed holdout bucket parsing so reports include per-holdout-bucket metrics.
- Added deterministic, stratified re-split of the combined
(train+dev)pool whensplit=trainis too small, to avoid severe underfitting. - Kept key safety invariant during tuning:
memory→tools = 0on dev.
Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-9/
Results (test split, overall):
- Tools Precision:
0.9597 - Tools Recall:
0.7987 - Memory False-Tools Rate (memory→tools):
0.0000 - Abstain False-Tools Rate (abstain→tools):
0.0714
Results (holdout subset, all buckets combined):
- Tools Precision:
0.9375 - Tools Recall:
0.8571 - Memory False-Tools Rate (memory→tools):
0.0000 - Abstain False-Tools Rate (abstain→tools):
0.0727
Holdout buckets (tools precision / recall):
tools_recall:1.0000/0.8000abstain_toolish:0.7143/1.0000memory_toolish:1.0000/1.0000
Recommended model path (current):
benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json
Full run snapshot: docs/router_linear_latest_2026-04-23.md