docs/CANONICALIZATION-REVIEW.md at main

chadfowler.com / phoenix
fork
Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy
fork
phoenix / docs / CANONICALIZATION-REVIEW.md
at main 198 lines 12 kB view raw view rendered
wrap content
Chad Fowler Add CANONICALIZATION-PLAN.md — architecture plan for canonicalization v2 2mo ago
7d0b9931
  1# Phoenix Canonicalization — Review, Gaps, and Refinement Plan
  2
  3**Version:** 2026-02-19  
  4**Source:** Automated code review (OpenAI Codex) of `docs/CANONICALIZATION.md` against codebase  
  5**Audience:** Phoenix core team (research + engineering)  
  6**Scope:** Independent review of current canonicalization with prioritized improvements and references
  7
  8---
  9
 10## Executive Summary
 11
 12Canonicalization is central to Phoenix. The current rule-based engine is fast, deterministic, and preserves provenance; the LLM path adds recall but weakens determinism and provenance. Key pain points are over-typing to REQUIREMENT, missed statements in prose, noisy O(n²) linking, lack of coverage/confidence, and identity instability when statements are rephrased. We recommend a staged refinement:
 13
 14- **Near term** (no external deps): sentence-level extraction with coverage diagnostics, better typing rules, preserve ordered sequences in normalization, less noisy/scalable linking, and multi-clause provenance.
 15- **Mid term** (local-only ML): deterministic local embeddings for linking/anchors, unsupervised keyphrase extraction, and weakly-supervised type classifier.
 16- **LLM, stabilized**: use LLM for normalization only or require explicit clause references; add self-consistency to reduce variance.
 17- **Graph/identity**: introduce typed edges, optional hierarchy, and a soft "anchor" identity alongside strict hash.
 18
 19These changes maintain Phoenix's core properties (content-addressed identity, explicit provenance, deterministic fallback) while measurably improving extraction quality, type accuracy, and linking precision.
 20
 21---
 22
 23## Strengths (What Works Today)
 24
 25- **Determinism and fallback**: Rule-based path is stable and always available; LLM path fully degrades to rules on failure.
 26- **Explicit provenance**: Rule extractor maps nodes to clauses with high fidelity; leveraged by `warm-hasher` and `canonical-store`.
 27- **Normalization for stability**: Formatting-only changes do not churn `clause_semhash`; downstream diffs remain clean.
 28- **Test coverage at pipeline level**: Functional tests exercise parse → canonicalize → warm hash → classify flows.
 29
 30---
 31
 32## Gaps and Risks (Beyond the Known Issues)
 33
 34These are issues identified through code review that go beyond what CANONICALIZATION.md already documents:
 35
 36- **Normalizer reorders lists**: Sorting list items breaks ordered semantics (e.g., transition sequences like `open → in_progress → review → done`), causing meaning drift and brittle hashes. This is a **correctness bug**, not just a quality issue.
 37- **Acronym loss in tags**: Tokens ≤2 chars (`id`, `ui`, `api`, `jwt`, `sso`, `otp`) are dropped by `extractTerms`, degrading tags and links in domains with heavy acronym usage.
 38- **No typed relations**: `linked_canon_ids` are untyped, undirected, based on term overlap; downstream systems can't use relation semantics (constrains, defines, refines).
 39- **Statement-as-identity tension**: Small rephrasings create new IDs; including `source_clause_id` in the hash prevents dedup across sections even when the same requirement appears twice.
 40
 41---
 42
 43## Quick Wins (Low-Risk, High-Impact)
 44
 45### 1. Sentence-first extraction and coverage
 46- Segment clause text into sentences; split compound sentences with coordinated modals ("must A and must B" → two candidates).
 47- Classify per sentence; compute coverage per clause (extracted / total sentences) and emit diagnostics for uncovered sentences.
 48
 49### 2. Stronger rule features for typing
 50- **Constraints**: negation ("must not", "may not", "cannot"), bound phrases ("at most/least", "no more than/fewer than", numeric ranges).
 51- **Invariants**: adverbs ("always", "never", "at all times", "regardless", "must remain").
 52- **Definitions**: copular patterns ("X is/means/refers to …"), colon heuristics with noun-phrase guard to avoid enumerations.
 53
 54### 3. Preserve ordered sequences in normalization
 55- Do not sort numbered lists; avoid sorting bullet lists that contain arrows (→, ->), ordinals, or comma-delimited sequences.
 56- Treat transition lines as atomic sequences (preserve order in normalized text).
 57
 58### 4. Reduce noisy linking without embeddings
 59- Build an inverted index over tags; generate candidate links only for pairs sharing ≥2 low-idf tags; cap node degree.
 60- Optionally use MinHash/SimHash on tag sets for candidate generation, then exact-check overlap.
 61
 62### 5. Multi-clause provenance (deterministic)
 63- Attribute a node to top-2 clauses by similarity over `normalized_text` (BM25/cosine on tf-idf), above a threshold; drop positional fallback.
 64
 65### 6. Domain term retention
 66- Whitelist short acronyms in `extractTerms`: `id`, `ui`, `api`, `jwt`, `sso`, `otp`, `ip`, `db`, `tls`, `rsa`, `aes`, `rs256`, `hs256`, `oidc`, `oauth`, `2fa`.
 67
 68### 7. Confidence scoring
 69- Add `confidence` to `CanonicalNode` (e.g., 1.0 for strict pattern match, 0.7 for mixed cues, 0.3 for heading-only); use to filter/downweight low-confidence nodes in warm context and planners.
 70
 71---
 72
 73## Mid-Term Enhancements (Local, Deterministic)
 74
 75### Local sentence embeddings
 76- Use `transformers.js` (xenova/transformers.js) to run MiniLM/E5 embeddings locally in Node for linking and anchor identity; keep thresholds conservative.
 77- Introduce `canon_anchor` as an LSH/SimHash over embeddings; maintain `canon_id` as strict identity.
 78
 79### Unsupervised keyphrases
 80- Implement TextRank/RAKE to extract multi-word phrases; supplement token tags with phrases for better linking.
 81
 82### Weakly-supervised type classifier
 83- Train a small, explainable classifier (tf-idf + logistic regression/SVM) with labeling functions (negation, bounds, temporal, definitional cues) to improve type accuracy deterministically.
 84
 85---
 86
 87## LLM Usage, Stabilized
 88
 89### LLM as normalizer only
 90- Keep deterministic rule-based candidate extraction; use LLM to rewrite each sentence into a canonical form. Validate with JSON schema, temperature 0; hash normalized sentence.
 91
 92### LLM with explicit provenance
 93- Require `source_clauses` (headings/line spans) in output; reject responses without attribution; fallback to rules if validation fails.
 94
 95### Self-consistency
 96- Query k times at temp 0 per sentence; select lexical medoid/majority for type/tags to reduce variance; deterministic tie-breakers.
 97
 98---
 99
100## Graph and Identity Evolution
101
102### Typed edges and optional hierarchy
103- Relation types: `defines`, `refines`, `constrains`, `invariant_of`, `duplicates`, `relates_to`.
104- Add optional `parent_canon_id`; use heading depth + similarity to propose hierarchies; use typed edges in `warm-hasher` context.
105
106### Soft identity anchor
107- Add `canon_anchor` (e.g., SimHash/LSH of embedding or MinHash of tags) to match across minor rephrasing.
108- Diff logic: match by anchor first, then compare `canon_id` to decide "same concept, changed wording" vs "new node."
109
110---
111
112## Implementation Sketch (Files and Changes)
113
114| File | Changes |
115|------|---------|
116| `src/normalizer.ts` | Don't sort numbered lists; protect bullet items with arrows/ordinals; preserve transition sequences; preserve hyphenated compounds as single tokens |
117| `src/canonicalizer.ts` | Add sentence segmentation; split coordinated modals; expand type patterns with scoring rubric; add `confidence` field; extend `extractTerms` with acronym whitelist and phrase retention |
118| `src/canonicalizer-llm.ts` | Require `source_clauses` in schema; validate and attribute to multiple clauses; remove positional fallback; fall back to rule-based if schema invalid |
119| `src/warm-hasher.ts` | Include only typed edges and/or high-confidence nodes in warm context; cap linked IDs to reduce incidental invalidations |
120| New: tag inverted index utility | Build idf-weighted candidate generator; cap max neighbors; replace O(n²) linking in both paths |
121| CLI and diagnostics | Print per-clause coverage %, uncovered sentence snippets, and top reasons for drop |
122
123---
124
125## Measurement and Targets
126
127| Metric | Current | Target |
128|--------|---------|--------|
129| Extraction recall | ~70% | +15–25% on prose-heavy specs |
130| Type accuracy | ~60% | +20–30% absolute with enhanced rules/weak supervision |
131| Linking precision | ~40% | +25–40% absolute using idf-weighted candidates or embeddings |
132| Provenance accuracy (rule) | 100% | 100% (maintain) |
133| Provenance accuracy (LLM) | ~80% | ≥95% with explicit sources; multi-clause enabled |
134| Identity stability (rule) | 100% | 100% (maintain) |
135| Identity stability (LLM) | ~90% | Stabilized via normalization + anchors |
136| Scalability | O(n²) linking | Near-linear with candidate pruning; supports 5–10k nodes |
137
138---
139
140## References (Academic & OSS)
141
142### Semantic extraction and typing
143- Reimers & Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084.
144- Mavin et al. (2009/2010). EARS: The Easy Approach to Requirements Syntax.
145- Ferrari et al. (2017). NLP for Requirements Engineering: Systematic Literature Review.
146- Banko et al. (2007). Open Information Extraction from the Web. (OpenIE/OLLIE line of work).
147- He et al. (2017). Deep Semantic Role Labeling.
148
149### Linking, similarity, anchors
150- Broder (1997). On the resemblance and containment of documents (MinHash).
151- Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms (SimHash).
152- Malkov & Yashunin (2018). HNSW for Approximate Nearest Neighbors.
153
154### Keyphrases and JS tooling
155- Mihalcea & Tarau (2004). TextRank for keyphrase extraction.
156- RAKE (Rose et al., 2010). Rapid Automatic Keyword Extraction.
157- KeyBERT (Grootendorst, 2020). Keyphrase extraction via embeddings.
158- `xenova/transformers.js` — run MiniLM/E5 locally in Node.
159- `wink-nlp` / `compromise` — JS POS and noun-phrase extraction.
160
161### LLM stability and structured output
162- Structured/JSON-constrained decoding (OSS: outlines, guidance-style libs).
163- Wang et al. (2023). Self-Consistency improves chain-of-thought (ensembling for stability).
164
165---
166
167## Proposed Roadmap (6–8 weeks)
168
169| Week | Focus |
170|------|-------|
171| 1–2 | **Quick wins**: Sentence segmentation; coverage reporting; improved rule patterns; acronym whitelist; stop sorting sequences. Inverted-index linking with idf filtering; cap node degree; update warm-hasher to use high-confidence links only. |
172| 3–4 | **Multi-clause provenance and typed edges**: Add multi-source attribution; heuristics-based typed relations; CLI diagnostics. |
173| 5–6 | **Local embeddings and anchors** (optional flag): Integrate transformers.js; compute anchors; evaluate on fixtures; keep behind feature flag. |
174| 7–8 | **LLM stabilization** (optional): LLM normalization path; explicit clause references schema; self-consistency; fallbacks and tests. |
175
176---
177
178## Risks and Mitigations
179
180| Risk | Mitigation |
181|------|-----------|
182| Semantic drift from normalizer changes | Guard with tests on ordered lists/transitions; feature flag if needed |
183| Over-linking regressions | Cap links per node; require idf-weighted overlap; toggle embedding/anchor features behind flags |
184| LLM nondeterminism | Keep rule path primary; use LLM only for normalization with strict schema; add self-consistency |
185| Performance regressions | Benchmark linking pre/post; ensure near-linear behavior with candidate pruning |
186
187---
188
189## Open Questions for the Team
190
1911. Confirm acceptance of sentence-level extraction and coverage reporting in CLI.
1922. Approve normalization change to preserve ordered sequences.
1933. Choose initial path for linking: idf-filtered tags vs local embeddings.
1944. Decide whether to introduce `confidence`, typed edges, and `canon_anchor` in the data model now (fields can be optional, defaulted).
195
196---
197
198*Automated code review generated 2026-02-19. Source: OpenAI Codex analysis of Phoenix VCS codebase.*
Configure Feed

Configure Feed