Add CANONICALIZATION-PLAN.md — architecture plan for canonicalization v2

+702

2 changed files

expand all

docs

+504

docs/CANONICALIZATION-PLAN.md

··· 1 + # Canonicalization v2 — Architecture Plan 2 + 3 + **Version:** 2026-02-19 4 + **Status:** Decision document — ready for team sign-off 5 + **Inputs:** CANONICALIZATION.md (internal deep-dive), Codex automated code review, research advisor feedback 6 + **Decision required by:** Team leads 7 + 8 + --- 9 + 10 + ## 0. The Core Insight 11 + 12 + Canonicalization is not one problem. It is two: 13 + 14 + 1. **Extraction** — turning clause text into structured candidate nodes (per-clause, parallelizable, should be deterministic) 15 + 2. **Resolution** — linking, deduplicating, typing, and structuring those candidates into a coherent graph (global, explicitly versioned, may be probabilistic) 16 + 17 + These are currently tangled in a single pass. Untangling them is the organizing principle of this plan. Extraction becomes a pure function. Resolution becomes a separate, versioned graph pass with its own quality metrics — essentially its own D-rate. 18 + 19 + --- 20 + 21 + ## 1. Sacred Invariants (Non-Negotiable) 22 + 23 + Every change in this plan must preserve these properties. If a proposed approach violates one, it's rejected. 24 + 25 + | Invariant | Why | 26 + |-----------|-----| 27 + | **Content-addressed identity** | `canon_id` must be a deterministic function of content. Same input → same ID, always. This is the foundation of selective invalidation. | 28 + | **Deterministic fallback** | The system must produce correct output with zero external dependencies. Rule-based extraction is the floor, not the ceiling. | 29 + | **Explicit provenance** | Every canon node must trace to specific source clause(s). No node exists without justification. Broken provenance = broken system. | 30 + | **Graceful degradation** | LLM unavailable → rules. Resolution fails → extraction output is still valid. Each layer fails independently. | 31 + 32 + These are **not** sacred: 33 + 34 + | Not Sacred | Why | 35 + |------------|-----| 36 + | O(n²) linking | Obviously replaceable | 37 + | Line-level extraction granularity | Historical artifact | 38 + | Four-type taxonomy (R/C/I/D) | Can be extended | 39 + | Single-pass architecture | The thing we're replacing | 40 + | Statement text as sole identity signal | The core tension we're resolving | 41 + 42 + --- 43 + 44 + ## 2. New Type Taxonomy 45 + 46 + The current four types (REQUIREMENT, CONSTRAINT, INVARIANT, DEFINITION) miss an important category. A line like *"Tasks support three assignment modes"* isn't a requirement — it's a framing statement that gives meaning to what follows. Dropping it silently is the root cause of the coverage problem. 47 + 48 + **v2 taxonomy:** 49 + 50 + | Type | What It Captures | Example | 51 + |------|-----------------|---------| 52 + | **REQUIREMENT** | Something the system must do | "Tasks must support status transitions" | 53 + | **CONSTRAINT** | A limitation, bound, or prohibition | "Task titles must not exceed 200 characters" | 54 + | **INVARIANT** | Something that must always/never hold | "Every task must have exactly one assignee at all times" | 55 + | **DEFINITION** | A term or concept definition | "A 'task' is a unit of work with a title, description, status, and assignee" | 56 + | **CONTEXT** | Framing text that gives meaning to other nodes but isn't actionable on its own | "Tasks support three assignment modes" / "The system handles user login, registration, and session management" | 57 + 58 + CONTEXT nodes: 59 + - Don't generate code directly (IU planner skips them) 60 + - Do participate in provenance (they're extracted, not dropped) 61 + - Do influence linking (CONTEXT frames are parents of the nodes they introduce) 62 + - Solve the coverage problem: instead of "35% coverage, 4 statements not canonicalized," we get "100% classified: 3 requirements, 1 context frame" 63 + - Have low confidence by default (heading-context or no-keyword match → CONTEXT rather than dropped) 64 + 65 + --- 66 + 67 + ## 3. Architecture: Two-Phase Pipeline 68 + 69 + ``` 70 + PHASE 1: EXTRACTION PHASE 2: RESOLUTION 71 + (deterministic, per-clause) (versioned, global, graph-level) 72 + 73 + Clause[] ──────────▶ ┌─────────────────────┐ ┌──────────────────────────┐ 74 + │ 1. Sentence segment │ │ 5. Dedup / merge │ 75 + │ 2. Classify type │────────▶ │ 6. Typed edge inference │ ──▶ CanonicalGraph 76 + │ 3. Normalize + hash │ │ 7. Hierarchy proposal │ 77 + │ 4. Tag + confidence │ │ 8. Anchor computation │ 78 + └─────────────────────┘ └──────────────────────────┘ 79 + 80 + Pure function. Versioned pipeline. 81 + No global state. Has its own shadow diff. 82 + Deterministic. Resolution-D-rate tracked. 83 + Falls back to rules. Falls back to extraction-only. 84 + ``` 85 + 86 + ### Phase 1 output: `CandidateNode[]` 87 + 88 + ```typescript 89 + interface CandidateNode { 90 + candidate_id: string; // SHA-256(type + statement + source_clause_id) 91 + type: CanonicalType; // REQUIREMENT | CONSTRAINT | INVARIANT | DEFINITION | CONTEXT 92 + statement: string; // Normalized sentence 93 + confidence: number; // 0.0–1.0 extraction confidence 94 + source_clause_ids: string[]; // Provenance (may be >1 for LLM path) 95 + tags: string[]; // Extracted terms + keyphrases 96 + sentence_index: number; // Position within clause for coverage tracking 97 + extraction_method: 'rule' | 'llm'; 98 + } 99 + ``` 100 + 101 + ### Phase 2 output: `CanonicalNode[]` (extended) 102 + 103 + ```typescript 104 + interface CanonicalNode { 105 + canon_id: string; // Content-addressed (same formula as today) 106 + canon_anchor: string; // Soft identity — survives minor rephrasing 107 + type: CanonicalType; // May be upgraded by resolution (e.g., REQUIREMENT → CONSTRAINT) 108 + statement: string; 109 + confidence: number; // May be adjusted by resolution context 110 + source_clause_ids: string[]; // May be expanded by dedup/merge 111 + linked_canon_ids: string[]; // Only meaningful links (typed, idf-filtered) 112 + link_types: Record<string, EdgeType>; // canon_id → edge type 113 + parent_canon_id?: string; // Hierarchy 114 + tags: string[]; 115 + extraction_method: 'rule' | 'llm'; 116 + } 117 + 118 + type EdgeType = 'constrains' | 'defines' | 'refines' | 'invariant_of' | 'duplicates' | 'relates_to'; 119 + ``` 120 + 121 + --- 122 + 123 + ## 4. Phase 1: Extraction (Detailed Design) 124 + 125 + ### 4.1 Sentence Segmentation 126 + 127 + Replace line-level splitting with sentence-level. This is the highest-impact single change — it fixes prose extraction and compound statements simultaneously. 128 + 129 + ``` 130 + Current: clause.raw_text → split('\n') → classify per line 131 + Proposed: clause.raw_text → segmentSentences() → classify per sentence 132 + ``` 133 + 134 + Segmentation rules: 135 + - Split on sentence-ending punctuation (`. `, `! `, `? `) followed by uppercase or list marker 136 + - Split compound modals: "must A and must B" → two sentences 137 + - Preserve list items as atomic units (each `- ` item is one sentence regardless of internal periods) 138 + - Preserve transition sequences as atomic (lines with `→`, `->` are not split) 139 + 140 + **File:** New `src/sentence-segmenter.ts` (~80 lines) 141 + 142 + ### 4.2 Enhanced Type Classification 143 + 144 + Replace binary regex matching with a **scoring rubric**. Each sentence gets a score per type; highest score wins. Ties go to CONTEXT (safe default — extracted but not actionable). 145 + 146 + | Signal | REQ | CON | INV | DEF | CTX | 147 + |--------|-----|-----|-----|-----|-----| 148 + | "must", "shall" | +2 | +1 | +1 | 0 | 0 | 149 + | "must not", "cannot", "forbidden" | 0 | +4 | 0 | 0 | 0 | 150 + | "always", "never", "at all times" | 0 | 0 | +4 | 0 | 0 | 151 + | "is defined as", "means", "refers to" | 0 | 0 | 0 | +4 | 0 | 152 + | Numeric bound ("at most N", "≤", "maximum") | 0 | +3 | 0 | 0 | 0 | 153 + | Heading context: "constraint"/"security" | 0 | +2 | 0 | 0 | 0 | 154 + | Heading context: "requirement"/"feature" | +2 | 0 | 0 | 0 | 0 | 155 + | Heading context: "definition"/"glossary" | 0 | 0 | 0 | +2 | 0 | 156 + | No modal verb, no keyword match | 0 | 0 | 0 | 0 | +2 | 157 + | Short sentence (< 10 words, no verb) | 0 | 0 | 0 | +1 | +1 | 158 + 159 + **Confidence** = `(winning_score - runner_up_score) / winning_score`, clamped to [0.3, 1.0]. 160 + 161 + This means: 162 + - "must not exceed 200 characters" → CON=7 (must not +4, numeric +3), REQ=3 → CON wins, confidence=(7-3)/7=0.57 163 + - "must authenticate with email" → REQ=4, CON=1 → REQ wins, confidence=0.75 164 + - "Tasks support three assignment modes" → CTX=2, all others 0 → CTX, confidence=1.0 165 + 166 + **File:** Replace `classifyLine()` in `src/canonicalizer.ts` with `scoreSentence()` (~60 lines) 167 + 168 + ### 4.3 Term Extraction Fix 169 + 170 + Two immediate fixes: 171 + 172 + 1. **Acronym whitelist**: Allow short tokens that are domain terms: `id`, `ui`, `api`, `jwt`, `sso`, `otp`, `ip`, `db`, `tls`, `rsa`, `aes`, `rs256`, `hs256`, `oidc`, `oauth`, `2fa`, `url`, `uri`, `http`, `sql`, `css`, `html`. 173 + 174 + 2. **Preserve hyphenated compounds**: `rate-limit`, `cross-origin`, `in-progress` stay as single tags, not split into parts. 175 + 176 + **File:** Modify `extractTerms()` in `src/canonicalizer.ts` (~15 lines changed) 177 + 178 + ### 4.4 Normalizer Fix: Ordered Sequences 179 + 180 + The list-sorting behavior in `normalizeText()` is a **correctness bug**. The sequence `open → in_progress → review → done` is order-dependent. Sorting it alphabetically changes the meaning. 181 + 182 + Fix: 183 + - **Numbered lists**: Never sort (ordering is explicit). 184 + - **Bullet lists with sequence indicators**: Don't sort if any item contains `→`, `->`, `=>`, ordinals (1st, 2nd), or comma-delimited sequences. 185 + - **All other bullet lists**: Continue sorting (preserves stability for unordered lists). 186 + 187 + **File:** Modify `normalizeText()` in `src/normalizer.ts` (~20 lines changed) 188 + 189 + ### 4.5 Coverage Reporting 190 + 191 + After extraction, compute per-clause coverage: 192 + 193 + ```typescript 194 + interface ExtractionCoverage { 195 + clause_id: string; 196 + total_sentences: number; 197 + extracted_sentences: number; 198 + coverage_pct: number; 199 + uncovered: { text: string; reason: 'no_match' | 'too_short' | 'meta_text' }[]; 200 + } 201 + ``` 202 + 203 + Emit as diagnostics in `phoenix status`: 204 + ``` 205 + INFO canon spec/tasks.md L1-4 Coverage: 1/3 sentences (33%) — 2 classified as CONTEXT 206 + WARN canon spec/auth.md L15-20 Coverage: 2/5 sentences (40%) — 3 uncovered (no keyword match) 207 + ``` 208 + 209 + **File:** New `src/extraction-coverage.ts` (~50 lines), wired into CLI status 210 + 211 + --- 212 + 213 + ## 5. Phase 2: Resolution (Detailed Design) 214 + 215 + Resolution is a **global graph pass** that operates on the flat list of candidate nodes from Phase 1 and produces the final canonical graph. It is explicitly versioned (`resolution_pipeline_id`) and has its own shadow diff mechanism. 216 + 217 + ### 5.1 Deduplication / Merge 218 + 219 + Two candidates from different clauses may express the same requirement. Resolution detects this and merges them. 220 + 221 + **Algorithm:** 222 + 1. Build inverted index: tag → candidate_ids 223 + 2. For each pair of candidates sharing ≥ 3 rare tags (idf-weighted): compute statement similarity (Jaccard on token trigrams) 224 + 3. If similarity > 0.7 and types are compatible (same type, or one is CONTEXT): merge into one node with both `source_clause_ids` 225 + 4. Merged node gets `canon_id` from the higher-confidence candidate (preserves identity stability) 226 + 227 + This solves the dedup problem (§5.9) and multi-clause provenance (§5.8) simultaneously. 228 + 229 + ### 5.2 Typed Edge Inference 230 + 231 + Replace untyped `linked_canon_ids` with typed edges. Infer edge types from node types and tag relationships: 232 + 233 + | From Type | To Type | Inferred Edge | Condition | 234 + |-----------|---------|--------------|-----------| 235 + | CONSTRAINT | REQUIREMENT | `constrains` | Shared head noun or tags | 236 + | INVARIANT | REQUIREMENT | `invariant_of` | Shared domain terms | 237 + | DEFINITION | any | `defines` | Definition's head term appears in target's statement | 238 + | CONTEXT | REQUIREMENT | `refines` | CONTEXT is in parent heading; REQUIREMENT is in child heading | 239 + | any | any (same statement, different clause) | `duplicates` | Caught by dedup but retained as edge | 240 + | any | any | `relates_to` | Shared rare tags above threshold (fallback) | 241 + 242 + **Key constraint:** Only create `relates_to` edges for pairs sharing ≥ 2 tags where at least one tag has IDF > median. This replaces the current noisy ≥ 2 shared tags rule. 243 + 244 + **Max degree cap:** No node may have more than 8 outgoing edges (excluding `duplicates`). If over cap, keep edges with highest tag-idf overlap. This prevents the "linked to everything" problem. 245 + 246 + ### 5.3 Hierarchy Inference 247 + 248 + Use heading structure from clause `section_path` to propose parent-child relationships: 249 + 250 + ``` 251 + Heading: "Task Management Service" → CONTEXT node (top-level) 252 + Heading: "Task Lifecycle" → CONTEXT node (parent) 253 + Bullet: "Tasks must support..." → REQUIREMENT (child of Task Lifecycle) 254 + Bullet: "Invalid transitions..." → CONSTRAINT (child of Task Lifecycle) 255 + ``` 256 + 257 + Algorithm: 258 + 1. For each clause, record its `section_path` depth 259 + 2. For each candidate node, inherit the depth of its source clause 260 + 3. CONTEXT nodes at depth N are potential parents of nodes at depth N+1 from the same document 261 + 4. Set `parent_canon_id` based on heading containment 262 + 5. If no CONTEXT node exists at the parent depth, leave `parent_canon_id` unset 263 + 264 + This gives us hierarchical invalidation for free: changing a child requirement only invalidates its subtree, not the sibling requirements under the same section. 265 + 266 + ### 5.4 Anchor Computation 267 + 268 + Add `canon_anchor` — a soft identity that survives minor statement rephrasing: 269 + 270 + ```typescript 271 + canon_anchor = SHA-256( 272 + type + 273 + sorted(tags).join(',') + 274 + source_clause_ids.sort().join(',') 275 + ) 276 + ``` 277 + 278 + The anchor is stable because: 279 + - `type` changes rarely (and the scoring rubric is deterministic) 280 + - `tags` are extracted from the same source text (minor rephrasing keeps most tags) 281 + - `source_clause_ids` are content-addressed and stable for unchanged clauses 282 + 283 + **Usage in diff:** When comparing old and new canonical graphs: 284 + 1. Match by `canon_anchor` first (find "same concept" pairs) 285 + 2. Compare `canon_id` within matched pairs (detect "reworded" vs "identical") 286 + 3. Unmatched anchors → truly new or removed nodes 287 + 288 + This separates "the LLM rephrased this slightly" (anchor matches, id differs → class A/B change) from "this is a genuinely new requirement" (no anchor match → class C/D change). 289 + 290 + ### 5.5 Resolution Quality Metrics 291 + 292 + Resolution gets its own health metrics, separate from the extraction D-rate: 293 + 294 + | Metric | Definition | Target | 295 + |--------|-----------|--------| 296 + | **Resolution-D-rate** | % of edges where type couldn't be inferred (fell back to `relates_to`) | ≤ 20% | 297 + | **Dedup rate** | % of candidates that were merged | Report only (no target — depends on spec style) | 298 + | **Orphan rate** | % of nodes with zero outgoing edges (no connections) | ≤ 10% | 299 + | **Max degree** | Highest node degree in the graph | ≤ 8 (enforced by cap) | 300 + | **Hierarchy coverage** | % of non-CONTEXT nodes that have a parent | ≥ 50% | 301 + 302 + These appear in `phoenix status` as resolution-specific diagnostics. 303 + 304 + --- 305 + 306 + ## 6. LLM Integration (Stabilized) 307 + 308 + The research advisor feedback and Codex review converge on the same recommendation: **don't use the LLM for extraction. Use it for normalization only.** 309 + 310 + ### 6.1 LLM-as-Normalizer Architecture 311 + 312 + ``` 313 + Sentence (raw) → Rule-based extraction → CandidateNode (draft) 314 + │ 315 + ▼ 316 + LLM Normalizer 317 + (if available) 318 + │ 319 + ▼ 320 + CandidateNode (normalized statement) 321 + ``` 322 + 323 + The LLM receives a single sentence + its type classification and returns a normalized canonical form. This is a **constrained rewrite**, not open-ended generation — variance is much lower than full extraction. 324 + 325 + **Prompt:** 326 + ``` 327 + Rewrite this {type} statement in canonical form. 328 + Rules: one clear sentence, present tense, active voice, no pronouns, no ambiguity. 329 + Input: "{raw_statement}" 330 + Output (JSON): {"statement": "..."} 331 + ``` 332 + 333 + - Temperature: 0 334 + - Max tokens: 100 335 + - JSON schema enforced 336 + - If output is empty, malformed, or LLM is unavailable → use rule-normalized statement 337 + 338 + ### 6.2 Self-Consistency for Stability 339 + 340 + For high-risk IUs (where canon stability matters most), run the normalizer k=3 times and select the **lexical medoid** (the output most similar to all other outputs, by token Jaccard). This is deterministic given a tie-breaking rule (alphabetical). 341 + 342 + ### 6.3 Explicit Provenance in LLM Path 343 + 344 + If the LLM is used for full extraction (retained as an option behind `--llm-extract` flag): 345 + - Prompt must require `source_section` field in output 346 + - Response is validated: if `source_section` doesn't match any clause's `section_path`, the node is rejected 347 + - Attribution to multiple clauses is allowed if the LLM identifies them 348 + - Positional fallback is **removed entirely** — if provenance can't be established, the node is dropped 349 + 350 + --- 351 + 352 + ## 7. Implementation Roadmap 353 + 354 + ### Sprint 1 (Week 1–2): Foundation + Quick Wins 355 + 356 + **Goal:** Fix correctness bugs, establish extraction/resolution split, get coverage reporting. 357 + 358 + | Task | File(s) | Size | Risk | 359 + |------|---------|------|------| 360 + | Fix normalizer list sorting for ordered sequences | `src/normalizer.ts` | S | Low — tests exist for sort behavior | 361 + | Add acronym whitelist + hyphenated compound preservation | `src/canonicalizer.ts` | S | Low | 362 + | Add CONTEXT type to `CanonicalType` enum | `src/models/canonical.ts` | S | Low — additive | 363 + | Build sentence segmenter | New: `src/sentence-segmenter.ts` | M | Medium — needs good test fixtures | 364 + | Replace line-level extraction with sentence-level | `src/canonicalizer.ts` | M | Medium — all canon tests will shift | 365 + | Implement scoring rubric for type classification | `src/canonicalizer.ts` | M | Medium | 366 + | Add `confidence` field to `CanonicalNode` | `src/models/canonical.ts` | S | Low — additive, optional | 367 + | Add extraction coverage reporting | New: `src/extraction-coverage.ts` | S | Low | 368 + | Wire coverage into `phoenix status` | `src/cli.ts` | S | Low | 369 + | Update all tests for new extraction behavior | `tests/` | L | High — most canon tests need updating | 370 + 371 + **Deliverable:** Sentence-level extraction with scoring rubric, CONTEXT type, coverage diagnostics, normalizer fix. All existing pipeline tests pass (updated for new node counts/types). 372 + 373 + ### Sprint 2 (Week 3–4): Resolution Phase 374 + 375 + **Goal:** Build the global graph pass. Typed edges, dedup, hierarchy. 376 + 377 + | Task | File(s) | Size | Risk | 378 + |------|---------|------|------| 379 + | Create `CandidateNode` type and extraction→resolution interface | `src/models/canonical.ts` | S | Low | 380 + | Build inverted index + idf-weighted tag utility | New: `src/tag-index.ts` | M | Low | 381 + | Implement dedup/merge (similarity on trigrams, idf-filtered) | New: `src/resolution.ts` | M | Medium | 382 + | Implement typed edge inference | `src/resolution.ts` | M | Medium | 383 + | Implement hierarchy inference from section_path depth | `src/resolution.ts` | M | Low | 384 + | Add `canon_anchor` computation | `src/resolution.ts` | S | Low | 385 + | Add `link_types`, `parent_canon_id`, `canon_anchor` to model | `src/models/canonical.ts` | S | Low — additive | 386 + | Add max-degree cap to linking | `src/resolution.ts` | S | Low | 387 + | Resolution quality metrics in `phoenix status` | `src/cli.ts` | S | Low | 388 + | Update warm-hasher to use only typed edges | `src/warm-hasher.ts` | S | Medium — affects context hash values | 389 + | Update `phoenix inspect` for new edge types + hierarchy | `src/inspect.ts` | M | Low | 390 + | Comprehensive tests for resolution | `tests/functional/resolution.test.ts` | L | — | 391 + 392 + **Deliverable:** Two-phase pipeline producing hierarchical, typed canonical graph. Resolution metrics in status. Inspect shows hierarchy and edge types. 393 + 394 + ### Sprint 3 (Week 5–6): LLM Stabilization + Anchors 395 + 396 + **Goal:** LLM-as-normalizer, anchor-based diff, self-consistency. 397 + 398 + | Task | File(s) | Size | Risk | 399 + |------|---------|------|------| 400 + | LLM-as-normalizer: single-sentence rewrite | `src/canonicalizer-llm.ts` (rewrite) | M | Medium | 401 + | Self-consistency (k=3, medoid selection) | `src/canonicalizer-llm.ts` | S | Low | 402 + | Anchor-based diff in classifier | `src/classifier.ts` | M | Medium — changes diff semantics | 403 + | Require explicit provenance in LLM-extract mode | `src/canonicalizer-llm.ts` | M | Medium | 404 + | Remove positional fallback | `src/canonicalizer-llm.ts` | S | Low | 405 + | Update shadow pipeline to compare resolution graphs | `src/pipeline.ts` | M | Medium | 406 + | Tests for LLM normalization stability | `tests/unit/` | M | — | 407 + 408 + **Deliverable:** LLM path uses normalizer-only by default. Self-consistent. Anchor-based diff reduces phantom invalidation. Shadow pipeline covers resolution. 409 + 410 + ### Sprint 4 (Week 7–8): Polish + Evaluation 411 + 412 + **Goal:** Measure improvements against baselines. Decide on optional enhancements. 413 + 414 + | Task | File(s) | Size | Risk | 415 + |------|---------|------|------| 416 + | Build evaluation harness (gold-standard annotated specs) | `tests/eval/` | L | — | 417 + | Measure extraction recall, type accuracy, linking precision | `tests/eval/` | M | — | 418 + | Compare baselines (old vs new) across all fixtures | `tests/eval/` | M | — | 419 + | Decide: local embeddings for linking (transformers.js) | Decision doc | — | Depends on eval results | 420 + | Decide: TextRank keyphrases for tag enrichment | Decision doc | — | Depends on eval results | 421 + | Decide: weakly-supervised type classifier | Decision doc | — | Depends on eval results | 422 + | Documentation update | `docs/` | M | — | 423 + 424 + **Deliverable:** Quantified improvement report. Go/no-go decisions on mid-term enhancements. Updated documentation. 425 + 426 + --- 427 + 428 + ## 8. Risk Register 429 + 430 + | Risk | Likelihood | Impact | Mitigation | 431 + |------|-----------|--------|-----------| 432 + | Sentence segmenter handles edge cases poorly | Medium | High | Extensive test fixtures; keep line-level as fallback for list items | 433 + | Scoring rubric under-fits on novel spec styles | Medium | Medium | CONTEXT as safe default; rubric is tunable weights not hard rules | 434 + | Dedup merges nodes that shouldn't be merged | Low | High | Conservative threshold (0.7 similarity); require ≥3 rare shared tags; easy to back out | 435 + | Hierarchy inference creates wrong parent-child links | Medium | Low | Only propose hierarchy from heading structure; don't infer from content | 436 + | Warm-hasher change causes cascade of hash changes | High | Medium | Feature-flag new warm hash; transition gradually; shadow diff before switching | 437 + | LLM normalizer produces unstable output despite temp=0 | Medium | Medium | Self-consistency k=3; fall back to rule normalization | 438 + | Test suite disruption from extraction granularity change | High | Medium | Sprint 1 allocates time explicitly; use snapshot testing for transition | 439 + 440 + --- 441 + 442 + ## 9. Measurement Targets 443 + 444 + | Metric | Current Baseline | Sprint 1 Target | Sprint 4 Target | 445 + |--------|-----------------|-----------------|-----------------| 446 + | Extraction recall | ~70% | 85% (sentence-level + CONTEXT) | 95% | 447 + | Type accuracy | ~60% (over-REQUIREMENT) | 80% (scoring rubric) | 90% (with resolution type upgrade) | 448 + | Provenance accuracy (rule) | 100% | 100% | 100% | 449 + | Provenance accuracy (LLM) | ~80% | 90% (explicit provenance) | 95% (validated) | 450 + | Linking precision | ~40% | 70% (idf-filtered, typed) | 80% | 451 + | Identity stability (LLM) | ~90% | 95% (normalizer-only) | 98% (self-consistency + anchors) | 452 + | Coverage visibility | None | Per-clause % in status | Per-clause + uncovered sentence detail | 453 + | Linking scalability | O(n²) | O(n·k) via inverted index | O(n·k) confirmed at 5K nodes | 454 + 455 + --- 456 + 457 + ## 10. Decisions Needed 458 + 459 + | # | Decision | Options | Recommendation | Deadline | 460 + |---|----------|---------|----------------|----------| 461 + | 1 | Accept CONTEXT as 5th canonical type? | Yes / No (keep dropping) | **Yes** — solves coverage and prose extraction simultaneously | Before Sprint 1 | 462 + | 2 | Accept extraction/resolution split? | Two-phase / Keep single-pass | **Two-phase** — enables independent versioning and quality metrics | Before Sprint 1 | 463 + | 3 | Accept normalizer fix for ordered sequences? | Fix sorting / Keep current | **Fix** — current behavior is a correctness bug | Before Sprint 1 | 464 + | 4 | Add `confidence`, `canon_anchor`, `link_types`, `parent_canon_id` to data model? | Now (optional fields) / Later | **Now, as optional fields** — avoids migration later, no breaking change | Before Sprint 2 | 465 + | 5 | LLM default mode: normalizer-only or full extraction? | Normalizer / Extraction / Both behind flag | **Normalizer default**, extraction behind `--llm-extract` flag | Before Sprint 3 | 466 + | 6 | Invest in local embeddings (transformers.js)? | Yes (Sprint 5–6) / Defer | **Defer** — evaluate after Sprint 4 baselines; only pursue if idf-linking doesn't hit 80% precision | After Sprint 4 eval | 467 + 468 + --- 469 + 470 + ## Appendix A: Superseded Documents 471 + 472 + This plan supersedes and incorporates: 473 + - `docs/CANONICALIZATION.md` — retained as the technical reference for current implementation 474 + - `docs/CANONICALIZATION-REVIEW.md` — Codex automated review; findings incorporated into this plan's design 475 + 476 + Both documents remain in the repo as historical reference. 477 + 478 + ## Appendix B: References 479 + 480 + ### Requirements engineering 481 + - Mavin et al. (2009/2010). EARS: The Easy Approach to Requirements Syntax. 482 + - Ferrari et al. (2017). NLP for Requirements Engineering: Systematic Literature Review. 483 + - Banko et al. (2007). Open Information Extraction from the Web. 484 + - He et al. (2017). Deep Semantic Role Labeling. 485 + 486 + ### Similarity, linking, identity 487 + - Reimers & Gurevych (2019). Sentence-BERT. arXiv:1908.10084. 488 + - Broder (1997). MinHash for document resemblance. 489 + - Charikar (2002). SimHash for similarity estimation. 490 + - Malkov & Yashunin (2018). HNSW for approximate nearest neighbors. 491 + 492 + ### Keyphrases and NLP tooling 493 + - Mihalcea & Tarau (2004). TextRank for keyphrase extraction. 494 + - RAKE (Rose et al., 2010). Rapid Automatic Keyword Extraction. 495 + - `xenova/transformers.js` — local embeddings in Node. 496 + - `wink-nlp` / `compromise` — JS POS and noun-phrase extraction. 497 + 498 + ### LLM stability 499 + - Wang et al. (2023). Self-Consistency improves chain-of-thought reasoning. 500 + - Structured/JSON-constrained decoding (outlines, guidance). 501 + 502 + --- 503 + 504 + *This is a decision document. It becomes the plan of record once all §10 decisions are signed off.*

+198

docs/CANONICALIZATION-REVIEW.md

··· 1 + # Phoenix Canonicalization — Review, Gaps, and Refinement Plan 2 + 3 + **Version:** 2026-02-19 4 + **Source:** Automated code review (OpenAI Codex) of `docs/CANONICALIZATION.md` against codebase 5 + **Audience:** Phoenix core team (research + engineering) 6 + **Scope:** Independent review of current canonicalization with prioritized improvements and references 7 + 8 + --- 9 + 10 + ## Executive Summary 11 + 12 + Canonicalization is central to Phoenix. The current rule-based engine is fast, deterministic, and preserves provenance; the LLM path adds recall but weakens determinism and provenance. Key pain points are over-typing to REQUIREMENT, missed statements in prose, noisy O(n²) linking, lack of coverage/confidence, and identity instability when statements are rephrased. We recommend a staged refinement: 13 + 14 + - **Near term** (no external deps): sentence-level extraction with coverage diagnostics, better typing rules, preserve ordered sequences in normalization, less noisy/scalable linking, and multi-clause provenance. 15 + - **Mid term** (local-only ML): deterministic local embeddings for linking/anchors, unsupervised keyphrase extraction, and weakly-supervised type classifier. 16 + - **LLM, stabilized**: use LLM for normalization only or require explicit clause references; add self-consistency to reduce variance. 17 + - **Graph/identity**: introduce typed edges, optional hierarchy, and a soft "anchor" identity alongside strict hash. 18 + 19 + These changes maintain Phoenix's core properties (content-addressed identity, explicit provenance, deterministic fallback) while measurably improving extraction quality, type accuracy, and linking precision. 20 + 21 + --- 22 + 23 + ## Strengths (What Works Today) 24 + 25 + - **Determinism and fallback**: Rule-based path is stable and always available; LLM path fully degrades to rules on failure. 26 + - **Explicit provenance**: Rule extractor maps nodes to clauses with high fidelity; leveraged by `warm-hasher` and `canonical-store`. 27 + - **Normalization for stability**: Formatting-only changes do not churn `clause_semhash`; downstream diffs remain clean. 28 + - **Test coverage at pipeline level**: Functional tests exercise parse → canonicalize → warm hash → classify flows. 29 + 30 + --- 31 + 32 + ## Gaps and Risks (Beyond the Known Issues) 33 + 34 + These are issues identified through code review that go beyond what CANONICALIZATION.md already documents: 35 + 36 + - **Normalizer reorders lists**: Sorting list items breaks ordered semantics (e.g., transition sequences like `open → in_progress → review → done`), causing meaning drift and brittle hashes. This is a **correctness bug**, not just a quality issue. 37 + - **Acronym loss in tags**: Tokens ≤2 chars (`id`, `ui`, `api`, `jwt`, `sso`, `otp`) are dropped by `extractTerms`, degrading tags and links in domains with heavy acronym usage. 38 + - **No typed relations**: `linked_canon_ids` are untyped, undirected, based on term overlap; downstream systems can't use relation semantics (constrains, defines, refines). 39 + - **Statement-as-identity tension**: Small rephrasings create new IDs; including `source_clause_id` in the hash prevents dedup across sections even when the same requirement appears twice. 40 + 41 + --- 42 + 43 + ## Quick Wins (Low-Risk, High-Impact) 44 + 45 + ### 1. Sentence-first extraction and coverage 46 + - Segment clause text into sentences; split compound sentences with coordinated modals ("must A and must B" → two candidates). 47 + - Classify per sentence; compute coverage per clause (extracted / total sentences) and emit diagnostics for uncovered sentences. 48 + 49 + ### 2. Stronger rule features for typing 50 + - **Constraints**: negation ("must not", "may not", "cannot"), bound phrases ("at most/least", "no more than/fewer than", numeric ranges). 51 + - **Invariants**: adverbs ("always", "never", "at all times", "regardless", "must remain"). 52 + - **Definitions**: copular patterns ("X is/means/refers to …"), colon heuristics with noun-phrase guard to avoid enumerations. 53 + 54 + ### 3. Preserve ordered sequences in normalization 55 + - Do not sort numbered lists; avoid sorting bullet lists that contain arrows (→, ->), ordinals, or comma-delimited sequences. 56 + - Treat transition lines as atomic sequences (preserve order in normalized text). 57 + 58 + ### 4. Reduce noisy linking without embeddings 59 + - Build an inverted index over tags; generate candidate links only for pairs sharing ≥2 low-idf tags; cap node degree. 60 + - Optionally use MinHash/SimHash on tag sets for candidate generation, then exact-check overlap. 61 + 62 + ### 5. Multi-clause provenance (deterministic) 63 + - Attribute a node to top-2 clauses by similarity over `normalized_text` (BM25/cosine on tf-idf), above a threshold; drop positional fallback. 64 + 65 + ### 6. Domain term retention 66 + - Whitelist short acronyms in `extractTerms`: `id`, `ui`, `api`, `jwt`, `sso`, `otp`, `ip`, `db`, `tls`, `rsa`, `aes`, `rs256`, `hs256`, `oidc`, `oauth`, `2fa`. 67 + 68 + ### 7. Confidence scoring 69 + - Add `confidence` to `CanonicalNode` (e.g., 1.0 for strict pattern match, 0.7 for mixed cues, 0.3 for heading-only); use to filter/downweight low-confidence nodes in warm context and planners. 70 + 71 + --- 72 + 73 + ## Mid-Term Enhancements (Local, Deterministic) 74 + 75 + ### Local sentence embeddings 76 + - Use `transformers.js` (xenova/transformers.js) to run MiniLM/E5 embeddings locally in Node for linking and anchor identity; keep thresholds conservative. 77 + - Introduce `canon_anchor` as an LSH/SimHash over embeddings; maintain `canon_id` as strict identity. 78 + 79 + ### Unsupervised keyphrases 80 + - Implement TextRank/RAKE to extract multi-word phrases; supplement token tags with phrases for better linking. 81 + 82 + ### Weakly-supervised type classifier 83 + - Train a small, explainable classifier (tf-idf + logistic regression/SVM) with labeling functions (negation, bounds, temporal, definitional cues) to improve type accuracy deterministically. 84 + 85 + --- 86 + 87 + ## LLM Usage, Stabilized 88 + 89 + ### LLM as normalizer only 90 + - Keep deterministic rule-based candidate extraction; use LLM to rewrite each sentence into a canonical form. Validate with JSON schema, temperature 0; hash normalized sentence. 91 + 92 + ### LLM with explicit provenance 93 + - Require `source_clauses` (headings/line spans) in output; reject responses without attribution; fallback to rules if validation fails. 94 + 95 + ### Self-consistency 96 + - Query k times at temp 0 per sentence; select lexical medoid/majority for type/tags to reduce variance; deterministic tie-breakers. 97 + 98 + --- 99 + 100 + ## Graph and Identity Evolution 101 + 102 + ### Typed edges and optional hierarchy 103 + - Relation types: `defines`, `refines`, `constrains`, `invariant_of`, `duplicates`, `relates_to`. 104 + - Add optional `parent_canon_id`; use heading depth + similarity to propose hierarchies; use typed edges in `warm-hasher` context. 105 + 106 + ### Soft identity anchor 107 + - Add `canon_anchor` (e.g., SimHash/LSH of embedding or MinHash of tags) to match across minor rephrasing. 108 + - Diff logic: match by anchor first, then compare `canon_id` to decide "same concept, changed wording" vs "new node." 109 + 110 + --- 111 + 112 + ## Implementation Sketch (Files and Changes) 113 + 114 + | File | Changes | 115 + |------|---------| 116 + | `src/normalizer.ts` | Don't sort numbered lists; protect bullet items with arrows/ordinals; preserve transition sequences; preserve hyphenated compounds as single tokens | 117 + | `src/canonicalizer.ts` | Add sentence segmentation; split coordinated modals; expand type patterns with scoring rubric; add `confidence` field; extend `extractTerms` with acronym whitelist and phrase retention | 118 + | `src/canonicalizer-llm.ts` | Require `source_clauses` in schema; validate and attribute to multiple clauses; remove positional fallback; fall back to rule-based if schema invalid | 119 + | `src/warm-hasher.ts` | Include only typed edges and/or high-confidence nodes in warm context; cap linked IDs to reduce incidental invalidations | 120 + | New: tag inverted index utility | Build idf-weighted candidate generator; cap max neighbors; replace O(n²) linking in both paths | 121 + | CLI and diagnostics | Print per-clause coverage %, uncovered sentence snippets, and top reasons for drop | 122 + 123 + --- 124 + 125 + ## Measurement and Targets 126 + 127 + | Metric | Current | Target | 128 + |--------|---------|--------| 129 + | Extraction recall | ~70% | +15–25% on prose-heavy specs | 130 + | Type accuracy | ~60% | +20–30% absolute with enhanced rules/weak supervision | 131 + | Linking precision | ~40% | +25–40% absolute using idf-weighted candidates or embeddings | 132 + | Provenance accuracy (rule) | 100% | 100% (maintain) | 133 + | Provenance accuracy (LLM) | ~80% | ≥95% with explicit sources; multi-clause enabled | 134 + | Identity stability (rule) | 100% | 100% (maintain) | 135 + | Identity stability (LLM) | ~90% | Stabilized via normalization + anchors | 136 + | Scalability | O(n²) linking | Near-linear with candidate pruning; supports 5–10k nodes | 137 + 138 + --- 139 + 140 + ## References (Academic & OSS) 141 + 142 + ### Semantic extraction and typing 143 + - Reimers & Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084. 144 + - Mavin et al. (2009/2010). EARS: The Easy Approach to Requirements Syntax. 145 + - Ferrari et al. (2017). NLP for Requirements Engineering: Systematic Literature Review. 146 + - Banko et al. (2007). Open Information Extraction from the Web. (OpenIE/OLLIE line of work). 147 + - He et al. (2017). Deep Semantic Role Labeling. 148 + 149 + ### Linking, similarity, anchors 150 + - Broder (1997). On the resemblance and containment of documents (MinHash). 151 + - Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms (SimHash). 152 + - Malkov & Yashunin (2018). HNSW for Approximate Nearest Neighbors. 153 + 154 + ### Keyphrases and JS tooling 155 + - Mihalcea & Tarau (2004). TextRank for keyphrase extraction. 156 + - RAKE (Rose et al., 2010). Rapid Automatic Keyword Extraction. 157 + - KeyBERT (Grootendorst, 2020). Keyphrase extraction via embeddings. 158 + - `xenova/transformers.js` — run MiniLM/E5 locally in Node. 159 + - `wink-nlp` / `compromise` — JS POS and noun-phrase extraction. 160 + 161 + ### LLM stability and structured output 162 + - Structured/JSON-constrained decoding (OSS: outlines, guidance-style libs). 163 + - Wang et al. (2023). Self-Consistency improves chain-of-thought (ensembling for stability). 164 + 165 + --- 166 + 167 + ## Proposed Roadmap (6–8 weeks) 168 + 169 + | Week | Focus | 170 + |------|-------| 171 + | 1–2 | **Quick wins**: Sentence segmentation; coverage reporting; improved rule patterns; acronym whitelist; stop sorting sequences. Inverted-index linking with idf filtering; cap node degree; update warm-hasher to use high-confidence links only. | 172 + | 3–4 | **Multi-clause provenance and typed edges**: Add multi-source attribution; heuristics-based typed relations; CLI diagnostics. | 173 + | 5–6 | **Local embeddings and anchors** (optional flag): Integrate transformers.js; compute anchors; evaluate on fixtures; keep behind feature flag. | 174 + | 7–8 | **LLM stabilization** (optional): LLM normalization path; explicit clause references schema; self-consistency; fallbacks and tests. | 175 + 176 + --- 177 + 178 + ## Risks and Mitigations 179 + 180 + | Risk | Mitigation | 181 + |------|-----------| 182 + | Semantic drift from normalizer changes | Guard with tests on ordered lists/transitions; feature flag if needed | 183 + | Over-linking regressions | Cap links per node; require idf-weighted overlap; toggle embedding/anchor features behind flags | 184 + | LLM nondeterminism | Keep rule path primary; use LLM only for normalization with strict schema; add self-consistency | 185 + | Performance regressions | Benchmark linking pre/post; ensure near-linear behavior with candidate pruning | 186 + 187 + --- 188 + 189 + ## Open Questions for the Team 190 + 191 + 1. Confirm acceptance of sentence-level extraction and coverage reporting in CLI. 192 + 2. Approve normalization change to preserve ordered sequences. 193 + 3. Choose initial path for linking: idf-filtered tags vs local embeddings. 194 + 4. Decide whether to introduce `confidence`, typed edges, and `canon_anchor` in the data model now (fields can be optional, defaulted). 195 + 196 + --- 197 + 198 + *Automated code review generated 2026-02-19. Source: OpenAI Codex analysis of Phoenix VCS codebase.*

Configure Feed

Configure Feed