# Router Dataset Log Tracks changes to `benchmarks/inputs/router/router_slices/`, router benchmark runs, and evaluation summaries per iteration. --- ## Format Each entry is an iteration. Counts are by `route_label` across all queries in the changed files. --- ## iter-1 — 2026-04-23 (initial slices) **Date:** 2026-04-23 **Slice files changed / created:** | file | route_label | dev | test | total | |------|------------|-----|------|-------| | `benchmarks/inputs/router/router_slices/shell_state_v1.json` | tools | 15 | 15 | 30 | | `benchmarks/inputs/router/router_slices/read_file_repo_v1.json` | tools | 15 | 15 | 30 | | `benchmarks/inputs/router/router_slices/abstain_v1.json` | abstain | 15 | 15 | 30 | | `benchmarks/inputs/router/router_slices/memory_lane_v1.json` | memory | 15 | 15 | 30 | **Total router-labeled queries (slices only):** | route_label | count | |------------|-------| | tools | 60 | | abstain | 30 | | memory | 30 | (Plus existing `route_label: tools` queries from `benchmarks/inputs/datasets/internal_eval_starter.json`: ~8 dev + test combined.) **Tool inventory snapshot:** `benchmarks/inputs/router/router_tools.json` (generated by `dump-tools`) **Data quality notes:** - shell and read_file slices include Japanese-language variants and typo/fragment queries to prevent English keyword hacking - memory_lane queries are adversarially similar to tool queries (same topics: embedding server, reranker, config defaults) but ask about past decisions/notes rather than current state - abstain queries cover: subjective preference, underspecified, future prediction, external web, credential requests **Router train/eval command (iter-1):** ``` nix develop --command cargo run -p klbr-bench -- router-multi \ benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \ benchmarks/models/router/centroid/out-router-iter-1 \ benchmarks/inputs/datasets/internal_eval_starter.json \ benchmarks/inputs/router/router_slices/shell_state_v1.json \ benchmarks/inputs/router/router_slices/read_file_repo_v1.json \ benchmarks/inputs/router/router_slices/memory_lane_v1.json ``` > note: `abstain_v1.json` is excluded from the first run because the current router only distinguishes `memory` vs `tools`. once a three-class head is wired, add it. **Benchmark output dir:** `benchmarks/models/router/centroid/out-router-iter-1/` **Results:** | metric | value | |--------|-------| | labeled queries (test) | 61 (32 tools, 29 memory) | | tools precision | **1.0000** ✅ | | tools recall | 0.3125 ⚠️ | | memory false-tools rate | **0.0000** ✅ | | accuracy | 0.6393 | | threshold | 0.100 | **Gate status:** precision gate MET, recall gate NOT YET MET — needs more tool-lane test coverage to push recall up without hurting precision --- ## iter-3 — 2026-04-23 (3-way routing) **Changes:** - Upgraded router to 3-way centroid classification (Tools, Memory, Abstain). - Added `abstain_v1.json`, `shell_state_v1b.json`, `read_file_repo_v1b.json`. - Added V2 slices for shell, read_file, memory_lane. **Results:** - Tools Precision: **1.0000** - Tools Recall: 0.1875 - Memory False-Tools Rate: **0.0000** --- ## iter-4 — 2026-04-23 (Adversarial & Observability) **Changes:** - Added `memory_adversarial_v1.json` to sharpen memory centroid. - Added T-sim/M-sim/A-sim scores to misclassified table and agent status. **Results:** - Tools Precision: **1.0000** - Tools Recall: 0.1625 - Memory False-Tools Rate: **0.0000** **Conclusion:** Centroids are at their semantic limit. 1.0 precision makes this safe to deploy as a conservative gate. --- ## iter-7 — 2026-04-23 (Recall-Focused Threshold Tuning) **Changes:** - Switched router model format to multi-prototype centroids per class (k-means style), still embedding-only. - Updated threshold tuning objective to allow a tiny memory→tools FP budget (kept at 0 on test) in exchange for much higher tools recall. **Benchmark output dir:** `benchmarks/models/router/centroid/out-router-iter-7/` **Results (test split):** - Tools Precision: `0.9455` - Tools Recall: **`0.6500`** - Memory False-Tools Rate (memory→tools): **`0.0000`** - Abstain False-Tools Rate (abstain→tools): `0.2000` - Tuned margin threshold: `0.050` ## Production Config To enable this router, set `router_model_path = Some("benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json")` in `config.rs`. --- ## linear-iter-2 — 2026-04-23 (Softmax Regression) **Changes:** - Implemented `router-multi-linear`: 3-way softmax regression over the same embedding vectors (no keyword heuristics). - Added threshold tuning on dev split with a hard `memory→tools = 0` constraint. - Optimized the tuner by precomputing per-example probabilities (fast grid search). **Benchmark output dir:** `benchmarks/models/router/linear/out-router-linear-iter-2/` **Results (test split):** - Tools Precision: `0.9444` - Tools Recall: `0.6375` - Memory False-Tools Rate (memory→tools): **`0.0000`** - Abstain False-Tools Rate (abstain→tools): `0.2000` **Conclusion:** This is a better default than centroids for tools recall while keeping the key safety invariant (`memory→tools = 0`). --- ## linear-iter-4 — 2026-04-23 (Train/Dev Split + Frozen Holdout) **Changes:** - Added `split=train` dataset (`benchmarks/inputs/router/router_slices/train_pack_v1.json`) so training no longer reuses the dev tuning split. - Added `split=dev` dataset (`benchmarks/inputs/router/router_slices/dev_pack_v1.json`) for threshold tuning. - Added frozen holdout file (`benchmarks/inputs/router/router_holdout_v1.json`, 60 queries balanced across tools/memory/abstain). Reports now include a separate Holdout Metrics block for `query_id` prefixed with `holdout_`. **Benchmark output dir:** `benchmarks/models/router/linear/out-router-linear-iter-4/` **Results (test split, overall):** - Tools Precision: `0.9342` - Tools Recall: `0.7172` - Memory False-Tools Rate (memory→tools): **`0.0000`** **Results (holdout subset):** - Tools Precision: `0.7826` - Tools Recall: `0.9000` - Memory False-Tools Rate (memory→tools): **`0.0000`** - Abstain False-Tools Rate (abstain→tools): `0.2500` --- ## linear-iter-9 — 2026-04-23 (Deterministic Re-split + Multi-Holdout Buckets) **Changes:** - Fixed holdout bucket parsing so reports include per-holdout-bucket metrics. - Added deterministic, stratified re-split of the combined `(train+dev)` pool when `split=train` is too small, to avoid severe underfitting. - Kept key safety invariant during tuning: `memory→tools = 0` on dev. **Benchmark output dir:** `benchmarks/models/router/linear/out-router-linear-iter-9/` **Results (test split, overall):** - Tools Precision: `0.9597` - Tools Recall: **`0.7987`** - Memory False-Tools Rate (memory→tools): **`0.0000`** - Abstain False-Tools Rate (abstain→tools): `0.0714` **Results (holdout subset, all buckets combined):** - Tools Precision: `0.9375` - Tools Recall: **`0.8571`** - Memory False-Tools Rate (memory→tools): **`0.0000`** - Abstain False-Tools Rate (abstain→tools): `0.0727` **Holdout buckets (tools precision / recall):** - `tools_recall`: `1.0000` / `0.8000` - `abstain_toolish`: `0.7143` / `1.0000` - `memory_toolish`: `1.0000` / `1.0000` **Recommended model path (current):** `benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json` **Full run snapshot:** `docs/router_linear_latest_2026-04-23.md`