Complete canonicalization v2 Sprints 3-4: LLM stability, anchor diff, eval harness
SPRINT 3: LLM Stabilization + Anchors
--------------------------------------
Self-consistency (k=3 medoid):
- src/canonicalizer-llm.ts: LLMCanonOptions.selfConsistencyK parameter
- Generate k samples (first at temp=0, rest at temp=0.3)
- Select lexical medoid (most similar to all others by token Jaccard)
- Ties broken alphabetically for determinism
- Exported selectMedoid() for testing
Anchor-based diff in classifier:
- src/classifier.ts: computeAnchorOverlap() compares canon_anchor sets
between before/after clauses
- When anchors match (>50% overlap), high-edit-distance changes get
downgraded from D→B (same concept, different wording)
- Reduces phantom D-class from LLM rephrasing
SPRINT 4: Evaluation + Polish
------------------------------
Evaluation harness:
- tests/eval/gold-standard.ts: 6 annotated specs with expected nodes,
types, edges, coverage bounds, and node count ranges
- tests/eval/canonicalization-eval.test.ts: 40 tests measuring:
- Extraction recall (per-spec and aggregate)
- Type accuracy (per-spec and aggregate)
- Coverage (per-spec bounds)
- Linking precision (for specs with expected edges)
- Node count bounds
- Max degree enforcement
- Hierarchy coverage
- Baseline report table printed to stdout
Results (rule-based, no LLM):
┌──────────────────┬────────┬─────────┬───────┬───────┬───────┬───────┐
│ Spec │ Recall │ TypeAcc │ Cover │ ResD% │ Hier% │ Nodes │
├──────────────────┼────────┼─────────┼───────┼───────┼───────┼───────┤
│ Auth v1 │ 100% │ 100% │ 86% │ 50% │ 100% │ 11 │
│ Auth v2 │ 100% │ 67% │ 88% │ 50% │ 100% │ 14 │
│ Notifications │ 100% │ 100% │ 100% │ 60% │ 100% │ 15 │
│ Gateway │ 100% │ 100% │ 100% │ 78% │ 100% │ 21 │
│ TaskFlow: tasks │ 100% │ 100% │ 100% │ 100% │ 100% │ 19 │
│ TaskFlow: analyt │ 100% │ 100% │ 100% │ 57% │ 100% │ 11 │
├──────────────────┼────────┼─────────┼───────┤ │ │ │
│ AVERAGE │ 100% │ 94% │ 96% │ │ │ │
└──────────────────┴────────┴─────────┴───────┴───────┴───────┴───────┘
vs Targets: Recall ≥95% ✅, TypeAcc ≥90% ✅, Coverage ≥95% ✅
Phoenix status enhancements:
- Canon type breakdown (e.g., '18 REQUIREMENT, 3 CONSTRAINT, 1 CONTEXT')
- Resolution metrics: edge count, relates_to %, max degree, hierarchy %
- Extraction coverage % with per-clause warnings for <80%
- Low-coverage clauses appear as info diagnostics
Phoenix inspect enhancements:
- CanonNodeInfo: confidence, anchor, parentId, linkTypes, extractionMethod
- Edge type passed through for canon→canon edges
- Parent edges (canon→parent) for hierarchy visualization
- CONTEXT badge color (yellow) distinct from CONSTRAINT (red)
- Canon subtitle shows confidence score and extraction method
New files:
- tests/eval/gold-standard.ts (6 annotated specs)
- tests/eval/canonicalization-eval.test.ts (40 tests)
- tests/unit/self-consistency.test.ts (5 tests)
- tests/unit/anchor-diff.test.ts (3 tests)
305 tests passing across 33 files (48 new tests since Sprint 2)