Implement canonicalization v2: two-phase extraction/resolution pipeline
ARCHITECTURE CHANGES:
- Split canonicalization into Phase 1 (Extraction) and Phase 2 (Resolution)
- Phase 1 is deterministic, per-clause, parallelizable
- Phase 2 is a versioned global graph pass
NEW FILES:
- src/sentence-segmenter.ts — sentence-level text segmentation
- src/resolution.ts — dedup, typed edges, hierarchy, anchors, IDF linking
- tests/unit/sentence-segmenter.test.ts — 9 tests
- tests/unit/resolution.test.ts — 13 tests
MODEL CHANGES (src/models/canonical.ts):
- Added CONTEXT as 5th CanonicalType (framing text, not actionable)
- Added CandidateNode interface (Phase 1 output)
- Added ExtractionCoverage interface
- Added EdgeType union: constrains | defines | refines | invariant_of | duplicates | relates_to
- Added optional fields to CanonicalNode: canon_anchor, confidence,
link_types, parent_canon_id, extraction_method
EXTRACTION (src/canonicalizer.ts rewrite):
- Sentence-level segmentation replaces line-level splitting
- Scoring rubric replaces binary regex matching (scores across all 5 types)
- CONTEXT type catches non-actionable text (previously dropped silently)
- Confidence scores: margin between winning and runner-up type
- Acronym whitelist: id, api, jwt, sso, otp, etc. no longer dropped
- Hyphenated compounds preserved as single tags (rate-limit, in-progress)
- extractCandidates() exposed as public API with coverage metrics
RESOLUTION (src/resolution.ts new):
- Deduplication: token-trigram fingerprinting + Jaccard similarity >0.7
- Typed edge inference: constrains, defines, refines, invariant_of
- IDF-weighted inverted index replaces O(n²) pairwise linking
- Hierarchy from heading structure (parent_canon_id)
- canon_anchor: SHA-256(type + sorted_tags + sorted_source_clause_ids)
- Max degree cap of 8 per node (enforced by IDF-scored pruning)
NORMALIZER FIX (src/normalizer.ts):
- Numbered lists no longer sorted (correctness bug — order matters)
- Bullet lists with sequence indicators (→, ->, ordinals) preserved
LLM CANONICALIZER (src/canonicalizer-llm.ts rewrite):
- Default mode: LLM-as-normalizer (rule extraction + LLM statement rewrite)
- Temperature 0, JSON schema enforced, per-sentence (not batch)
- CONTEXT nodes skipped (not worth LLM cost)
- Full extraction mode behind extractWithLLMFull() with explicit provenance
- Positional fallback removed — nodes without valid provenance are dropped
WARM HASHER (src/warm-hasher.ts):
- Uses only typed edges (excludes weak 'relates_to') in context hash
- Filters by confidence threshold (≥0.3)
IU PLANNER (src/iu-planner.ts):
- CONTEXT nodes filtered out (don't generate code)
RESULTS (before → after):
TaskFlow tasks.md:
Types: {REQ:18} → {CTX:1, REQ:18}
Coverage: unmeasured → 100%
Hierarchy: none → 18/19 nodes have parents
Edges: 24 untyped → 26 (all typed)
Auth v1:
Types: {REQ:6, CON:2} → {CTX:5, REQ:3, CON:3}
Edges: 2 untyped → 8 (4 refines, 4 relates_to)
Notifications:
Types: {REQ:12, CON:1, INV:1} → {CTX:1, REQ:10, CON:3, INV:1}
Edges: 14 untyped → 10 (2 constrains, 2 refines, 6 relates_to)
257 tests passing across 30 files (22 new tests).