# Router Dataset Log

Tracks changes to `benchmarks/inputs/router/router_slices/`, router benchmark runs, and evaluation summaries per iteration.

---

## Format

Each entry is an iteration. Counts are by `route_label` across all queries in the changed files.

---

## iter-1 — 2026-04-23 (initial slices)

**Date:** 2026-04-23

**Slice files changed / created:**

| file | route_label | dev | test | total |
|------|------------|-----|------|-------|
| `benchmarks/inputs/router/router_slices/shell_state_v1.json` | tools | 15 | 15 | 30 |
| `benchmarks/inputs/router/router_slices/read_file_repo_v1.json` | tools | 15 | 15 | 30 |
| `benchmarks/inputs/router/router_slices/abstain_v1.json` | abstain | 15 | 15 | 30 |
| `benchmarks/inputs/router/router_slices/memory_lane_v1.json` | memory | 15 | 15 | 30 |

**Total router-labeled queries (slices only):**

| route_label | count |
|------------|-------|
| tools | 60 |
| abstain | 30 |
| memory | 30 |

(Plus existing `route_label: tools` queries from `benchmarks/inputs/datasets/internal_eval_starter.json`: ~8 dev + test combined.)

**Tool inventory snapshot:** `benchmarks/inputs/router/router_tools.json` (generated by `dump-tools`)

**Data quality notes:**
- shell and read_file slices include Japanese-language variants and typo/fragment queries to prevent English keyword hacking
- memory_lane queries are adversarially similar to tool queries (same topics: embedding server, reranker, config defaults) but ask about past decisions/notes rather than current state
- abstain queries cover: subjective preference, underspecified, future prediction, external web, credential requests

**Router train/eval command (iter-1):**

```
  nix develop --command cargo run -p klbr-bench -- router-multi \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/models/router/centroid/out-router-iter-1 \
  benchmarks/inputs/datasets/internal_eval_starter.json \
  benchmarks/inputs/router/router_slices/shell_state_v1.json \
  benchmarks/inputs/router/router_slices/read_file_repo_v1.json \
  benchmarks/inputs/router/router_slices/memory_lane_v1.json
```

> note: `abstain_v1.json` is excluded from the first run because the current router only distinguishes `memory` vs `tools`. once a three-class head is wired, add it.

**Benchmark output dir:** `benchmarks/models/router/centroid/out-router-iter-1/`

**Results:**

| metric | value |
|--------|-------|
| labeled queries (test) | 61 (32 tools, 29 memory) |
| tools precision | **1.0000** ✅ |
| tools recall | 0.3125 ⚠️ |
| memory false-tools rate | **0.0000** ✅ |
| accuracy | 0.6393 |
| threshold | 0.100 |

**Gate status:** precision gate MET, recall gate NOT YET MET — needs more tool-lane test coverage to push recall up without hurting precision

---

## iter-3 — 2026-04-23 (3-way routing)

**Changes:**
- Upgraded router to 3-way centroid classification (Tools, Memory, Abstain).
- Added `abstain_v1.json`, `shell_state_v1b.json`, `read_file_repo_v1b.json`.
- Added V2 slices for shell, read_file, memory_lane.

**Results:**
- Tools Precision: **1.0000**
- Tools Recall: 0.1875
- Memory False-Tools Rate: **0.0000**

---

## iter-4 — 2026-04-23 (Adversarial & Observability)

**Changes:**
- Added `memory_adversarial_v1.json` to sharpen memory centroid.
- Added T-sim/M-sim/A-sim scores to misclassified table and agent status.

**Results:**
- Tools Precision: **1.0000**
- Tools Recall: 0.1625
- Memory False-Tools Rate: **0.0000**

**Conclusion:** Centroids are at their semantic limit. 1.0 precision makes this safe to deploy as a conservative gate.

---

## iter-7 — 2026-04-23 (Recall-Focused Threshold Tuning)

**Changes:**
- Switched router model format to multi-prototype centroids per class (k-means style), still embedding-only.
- Updated threshold tuning objective to allow a tiny memory→tools FP budget (kept at 0 on test) in exchange for much higher tools recall.

**Benchmark output dir:** `benchmarks/models/router/centroid/out-router-iter-7/`

**Results (test split):**
- Tools Precision: `0.9455`
- Tools Recall: **`0.6500`**
- Memory False-Tools Rate (memory→tools): **`0.0000`**
- Abstain False-Tools Rate (abstain→tools): `0.2000`
- Tuned margin threshold: `0.050`

## Production Config

To enable this router, set `router_model_path = Some("benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json")` in `config.rs`.

---

## linear-iter-2 — 2026-04-23 (Softmax Regression)

**Changes:**
- Implemented `router-multi-linear`: 3-way softmax regression over the same embedding vectors (no keyword heuristics).
- Added threshold tuning on dev split with a hard `memory→tools = 0` constraint.
- Optimized the tuner by precomputing per-example probabilities (fast grid search).

**Benchmark output dir:** `benchmarks/models/router/linear/out-router-linear-iter-2/`

**Results (test split):**
- Tools Precision: `0.9444`
- Tools Recall: `0.6375`
- Memory False-Tools Rate (memory→tools): **`0.0000`**
- Abstain False-Tools Rate (abstain→tools): `0.2000`

**Conclusion:** This is a better default than centroids for tools recall while keeping the key safety invariant (`memory→tools = 0`).

---

## linear-iter-4 — 2026-04-23 (Train/Dev Split + Frozen Holdout)

**Changes:**
- Added `split=train` dataset (`benchmarks/inputs/router/router_slices/train_pack_v1.json`) so training no longer reuses the dev tuning split.
- Added `split=dev` dataset (`benchmarks/inputs/router/router_slices/dev_pack_v1.json`) for threshold tuning.
- Added frozen holdout file (`benchmarks/inputs/router/router_holdout_v1.json`, 60 queries balanced across tools/memory/abstain). Reports now include a separate Holdout Metrics block for `query_id` prefixed with `holdout_`.

**Benchmark output dir:** `benchmarks/models/router/linear/out-router-linear-iter-4/`

**Results (test split, overall):**
- Tools Precision: `0.9342`
- Tools Recall: `0.7172`
- Memory False-Tools Rate (memory→tools): **`0.0000`**

**Results (holdout subset):**
- Tools Precision: `0.7826`
- Tools Recall: `0.9000`
- Memory False-Tools Rate (memory→tools): **`0.0000`**
- Abstain False-Tools Rate (abstain→tools): `0.2500`

---

## linear-iter-9 — 2026-04-23 (Deterministic Re-split + Multi-Holdout Buckets)

**Changes:**
- Fixed holdout bucket parsing so reports include per-holdout-bucket metrics.
- Added deterministic, stratified re-split of the combined `(train+dev)` pool when `split=train` is too small, to avoid severe underfitting.
- Kept key safety invariant during tuning: `memory→tools = 0` on dev.

**Benchmark output dir:** `benchmarks/models/router/linear/out-router-linear-iter-9/`

**Results (test split, overall):**
- Tools Precision: `0.9597`
- Tools Recall: **`0.7987`**
- Memory False-Tools Rate (memory→tools): **`0.0000`**
- Abstain False-Tools Rate (abstain→tools): `0.0714`

**Results (holdout subset, all buckets combined):**
- Tools Precision: `0.9375`
- Tools Recall: **`0.8571`**
- Memory False-Tools Rate (memory→tools): **`0.0000`**
- Abstain False-Tools Rate (abstain→tools): `0.0727`

**Holdout buckets (tools precision / recall):**
- `tools_recall`: `1.0000` / `0.8000`
- `abstain_toolish`: `0.7143` / `1.0000`
- `memory_toolish`: `1.0000` / `1.0000`

**Recommended model path (current):**
`benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json`

**Full run snapshot:** `docs/router_linear_latest_2026-04-23.md`