ive harnessed the harness
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Router Dataset Log#

Tracks changes to benchmarks/inputs/router/router_slices/, router benchmark runs, and evaluation summaries per iteration.


Format#

Each entry is an iteration. Counts are by route_label across all queries in the changed files.


iter-1 — 2026-04-23 (initial slices)#

Date: 2026-04-23

Slice files changed / created:

file route_label dev test total
benchmarks/inputs/router/router_slices/shell_state_v1.json tools 15 15 30
benchmarks/inputs/router/router_slices/read_file_repo_v1.json tools 15 15 30
benchmarks/inputs/router/router_slices/abstain_v1.json abstain 15 15 30
benchmarks/inputs/router/router_slices/memory_lane_v1.json memory 15 15 30

Total router-labeled queries (slices only):

route_label count
tools 60
abstain 30
memory 30

(Plus existing route_label: tools queries from benchmarks/inputs/datasets/internal_eval_starter.json: ~8 dev + test combined.)

Tool inventory snapshot: benchmarks/inputs/router/router_tools.json (generated by dump-tools)

Data quality notes:

  • shell and read_file slices include Japanese-language variants and typo/fragment queries to prevent English keyword hacking
  • memory_lane queries are adversarially similar to tool queries (same topics: embedding server, reranker, config defaults) but ask about past decisions/notes rather than current state
  • abstain queries cover: subjective preference, underspecified, future prediction, external web, credential requests

Router train/eval command (iter-1):

  nix develop --command cargo run -p klbr-bench -- router-multi \
  benchmarks/inputs/configs/mvp_rerank_support_calibrated.json \
  benchmarks/models/router/centroid/out-router-iter-1 \
  benchmarks/inputs/datasets/internal_eval_starter.json \
  benchmarks/inputs/router/router_slices/shell_state_v1.json \
  benchmarks/inputs/router/router_slices/read_file_repo_v1.json \
  benchmarks/inputs/router/router_slices/memory_lane_v1.json

note: abstain_v1.json is excluded from the first run because the current router only distinguishes memory vs tools. once a three-class head is wired, add it.

Benchmark output dir: benchmarks/models/router/centroid/out-router-iter-1/

Results:

metric value
labeled queries (test) 61 (32 tools, 29 memory)
tools precision 1.0000
tools recall 0.3125 ⚠️
memory false-tools rate 0.0000
accuracy 0.6393
threshold 0.100

Gate status: precision gate MET, recall gate NOT YET MET — needs more tool-lane test coverage to push recall up without hurting precision


iter-3 — 2026-04-23 (3-way routing)#

Changes:

  • Upgraded router to 3-way centroid classification (Tools, Memory, Abstain).
  • Added abstain_v1.json, shell_state_v1b.json, read_file_repo_v1b.json.
  • Added V2 slices for shell, read_file, memory_lane.

Results:

  • Tools Precision: 1.0000
  • Tools Recall: 0.1875
  • Memory False-Tools Rate: 0.0000

iter-4 — 2026-04-23 (Adversarial & Observability)#

Changes:

  • Added memory_adversarial_v1.json to sharpen memory centroid.
  • Added T-sim/M-sim/A-sim scores to misclassified table and agent status.

Results:

  • Tools Precision: 1.0000
  • Tools Recall: 0.1625
  • Memory False-Tools Rate: 0.0000

Conclusion: Centroids are at their semantic limit. 1.0 precision makes this safe to deploy as a conservative gate.


iter-7 — 2026-04-23 (Recall-Focused Threshold Tuning)#

Changes:

  • Switched router model format to multi-prototype centroids per class (k-means style), still embedding-only.
  • Updated threshold tuning objective to allow a tiny memory→tools FP budget (kept at 0 on test) in exchange for much higher tools recall.

Benchmark output dir: benchmarks/models/router/centroid/out-router-iter-7/

Results (test split):

  • Tools Precision: 0.9455
  • Tools Recall: 0.6500
  • Memory False-Tools Rate (memory→tools): 0.0000
  • Abstain False-Tools Rate (abstain→tools): 0.2000
  • Tuned margin threshold: 0.050

Production Config#

To enable this router, set router_model_path = Some("benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json") in config.rs.


linear-iter-2 — 2026-04-23 (Softmax Regression)#

Changes:

  • Implemented router-multi-linear: 3-way softmax regression over the same embedding vectors (no keyword heuristics).
  • Added threshold tuning on dev split with a hard memory→tools = 0 constraint.
  • Optimized the tuner by precomputing per-example probabilities (fast grid search).

Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-2/

Results (test split):

  • Tools Precision: 0.9444
  • Tools Recall: 0.6375
  • Memory False-Tools Rate (memory→tools): 0.0000
  • Abstain False-Tools Rate (abstain→tools): 0.2000

Conclusion: This is a better default than centroids for tools recall while keeping the key safety invariant (memory→tools = 0).


linear-iter-4 — 2026-04-23 (Train/Dev Split + Frozen Holdout)#

Changes:

  • Added split=train dataset (benchmarks/inputs/router/router_slices/train_pack_v1.json) so training no longer reuses the dev tuning split.
  • Added split=dev dataset (benchmarks/inputs/router/router_slices/dev_pack_v1.json) for threshold tuning.
  • Added frozen holdout file (benchmarks/inputs/router/router_holdout_v1.json, 60 queries balanced across tools/memory/abstain). Reports now include a separate Holdout Metrics block for query_id prefixed with holdout_.

Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-4/

Results (test split, overall):

  • Tools Precision: 0.9342
  • Tools Recall: 0.7172
  • Memory False-Tools Rate (memory→tools): 0.0000

Results (holdout subset):

  • Tools Precision: 0.7826
  • Tools Recall: 0.9000
  • Memory False-Tools Rate (memory→tools): 0.0000
  • Abstain False-Tools Rate (abstain→tools): 0.2500

linear-iter-9 — 2026-04-23 (Deterministic Re-split + Multi-Holdout Buckets)#

Changes:

  • Fixed holdout bucket parsing so reports include per-holdout-bucket metrics.
  • Added deterministic, stratified re-split of the combined (train+dev) pool when split=train is too small, to avoid severe underfitting.
  • Kept key safety invariant during tuning: memory→tools = 0 on dev.

Benchmark output dir: benchmarks/models/router/linear/out-router-linear-iter-9/

Results (test split, overall):

  • Tools Precision: 0.9597
  • Tools Recall: 0.7987
  • Memory False-Tools Rate (memory→tools): 0.0000
  • Abstain False-Tools Rate (abstain→tools): 0.0714

Results (holdout subset, all buckets combined):

  • Tools Precision: 0.9375
  • Tools Recall: 0.8571
  • Memory False-Tools Rate (memory→tools): 0.0000
  • Abstain False-Tools Rate (abstain→tools): 0.0727

Holdout buckets (tools precision / recall):

  • tools_recall: 1.0000 / 0.8000
  • abstain_toolish: 0.7143 / 1.0000
  • memory_toolish: 1.0000 / 1.0000

Recommended model path (current): benchmarks/models/router/linear/out-router-linear-iter-9/router_model.json

Full run snapshot: docs/router_linear_latest_2026-04-23.md