Reference implementation for the Phoenix Architecture. Work in progress. aicoding.leaflet.pub/
ai coding crazy
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 198 lines 12 kB view raw view rendered
1# Phoenix Canonicalization — Review, Gaps, and Refinement Plan 2 3**Version:** 2026-02-19 4**Source:** Automated code review (OpenAI Codex) of `docs/CANONICALIZATION.md` against codebase 5**Audience:** Phoenix core team (research + engineering) 6**Scope:** Independent review of current canonicalization with prioritized improvements and references 7 8--- 9 10## Executive Summary 11 12Canonicalization is central to Phoenix. The current rule-based engine is fast, deterministic, and preserves provenance; the LLM path adds recall but weakens determinism and provenance. Key pain points are over-typing to REQUIREMENT, missed statements in prose, noisy O(n²) linking, lack of coverage/confidence, and identity instability when statements are rephrased. We recommend a staged refinement: 13 14- **Near term** (no external deps): sentence-level extraction with coverage diagnostics, better typing rules, preserve ordered sequences in normalization, less noisy/scalable linking, and multi-clause provenance. 15- **Mid term** (local-only ML): deterministic local embeddings for linking/anchors, unsupervised keyphrase extraction, and weakly-supervised type classifier. 16- **LLM, stabilized**: use LLM for normalization only or require explicit clause references; add self-consistency to reduce variance. 17- **Graph/identity**: introduce typed edges, optional hierarchy, and a soft "anchor" identity alongside strict hash. 18 19These changes maintain Phoenix's core properties (content-addressed identity, explicit provenance, deterministic fallback) while measurably improving extraction quality, type accuracy, and linking precision. 20 21--- 22 23## Strengths (What Works Today) 24 25- **Determinism and fallback**: Rule-based path is stable and always available; LLM path fully degrades to rules on failure. 26- **Explicit provenance**: Rule extractor maps nodes to clauses with high fidelity; leveraged by `warm-hasher` and `canonical-store`. 27- **Normalization for stability**: Formatting-only changes do not churn `clause_semhash`; downstream diffs remain clean. 28- **Test coverage at pipeline level**: Functional tests exercise parse → canonicalize → warm hash → classify flows. 29 30--- 31 32## Gaps and Risks (Beyond the Known Issues) 33 34These are issues identified through code review that go beyond what CANONICALIZATION.md already documents: 35 36- **Normalizer reorders lists**: Sorting list items breaks ordered semantics (e.g., transition sequences like `open → in_progress → review → done`), causing meaning drift and brittle hashes. This is a **correctness bug**, not just a quality issue. 37- **Acronym loss in tags**: Tokens ≤2 chars (`id`, `ui`, `api`, `jwt`, `sso`, `otp`) are dropped by `extractTerms`, degrading tags and links in domains with heavy acronym usage. 38- **No typed relations**: `linked_canon_ids` are untyped, undirected, based on term overlap; downstream systems can't use relation semantics (constrains, defines, refines). 39- **Statement-as-identity tension**: Small rephrasings create new IDs; including `source_clause_id` in the hash prevents dedup across sections even when the same requirement appears twice. 40 41--- 42 43## Quick Wins (Low-Risk, High-Impact) 44 45### 1. Sentence-first extraction and coverage 46- Segment clause text into sentences; split compound sentences with coordinated modals ("must A and must B" → two candidates). 47- Classify per sentence; compute coverage per clause (extracted / total sentences) and emit diagnostics for uncovered sentences. 48 49### 2. Stronger rule features for typing 50- **Constraints**: negation ("must not", "may not", "cannot"), bound phrases ("at most/least", "no more than/fewer than", numeric ranges). 51- **Invariants**: adverbs ("always", "never", "at all times", "regardless", "must remain"). 52- **Definitions**: copular patterns ("X is/means/refers to …"), colon heuristics with noun-phrase guard to avoid enumerations. 53 54### 3. Preserve ordered sequences in normalization 55- Do not sort numbered lists; avoid sorting bullet lists that contain arrows (→, ->), ordinals, or comma-delimited sequences. 56- Treat transition lines as atomic sequences (preserve order in normalized text). 57 58### 4. Reduce noisy linking without embeddings 59- Build an inverted index over tags; generate candidate links only for pairs sharing ≥2 low-idf tags; cap node degree. 60- Optionally use MinHash/SimHash on tag sets for candidate generation, then exact-check overlap. 61 62### 5. Multi-clause provenance (deterministic) 63- Attribute a node to top-2 clauses by similarity over `normalized_text` (BM25/cosine on tf-idf), above a threshold; drop positional fallback. 64 65### 6. Domain term retention 66- Whitelist short acronyms in `extractTerms`: `id`, `ui`, `api`, `jwt`, `sso`, `otp`, `ip`, `db`, `tls`, `rsa`, `aes`, `rs256`, `hs256`, `oidc`, `oauth`, `2fa`. 67 68### 7. Confidence scoring 69- Add `confidence` to `CanonicalNode` (e.g., 1.0 for strict pattern match, 0.7 for mixed cues, 0.3 for heading-only); use to filter/downweight low-confidence nodes in warm context and planners. 70 71--- 72 73## Mid-Term Enhancements (Local, Deterministic) 74 75### Local sentence embeddings 76- Use `transformers.js` (xenova/transformers.js) to run MiniLM/E5 embeddings locally in Node for linking and anchor identity; keep thresholds conservative. 77- Introduce `canon_anchor` as an LSH/SimHash over embeddings; maintain `canon_id` as strict identity. 78 79### Unsupervised keyphrases 80- Implement TextRank/RAKE to extract multi-word phrases; supplement token tags with phrases for better linking. 81 82### Weakly-supervised type classifier 83- Train a small, explainable classifier (tf-idf + logistic regression/SVM) with labeling functions (negation, bounds, temporal, definitional cues) to improve type accuracy deterministically. 84 85--- 86 87## LLM Usage, Stabilized 88 89### LLM as normalizer only 90- Keep deterministic rule-based candidate extraction; use LLM to rewrite each sentence into a canonical form. Validate with JSON schema, temperature 0; hash normalized sentence. 91 92### LLM with explicit provenance 93- Require `source_clauses` (headings/line spans) in output; reject responses without attribution; fallback to rules if validation fails. 94 95### Self-consistency 96- Query k times at temp 0 per sentence; select lexical medoid/majority for type/tags to reduce variance; deterministic tie-breakers. 97 98--- 99 100## Graph and Identity Evolution 101 102### Typed edges and optional hierarchy 103- Relation types: `defines`, `refines`, `constrains`, `invariant_of`, `duplicates`, `relates_to`. 104- Add optional `parent_canon_id`; use heading depth + similarity to propose hierarchies; use typed edges in `warm-hasher` context. 105 106### Soft identity anchor 107- Add `canon_anchor` (e.g., SimHash/LSH of embedding or MinHash of tags) to match across minor rephrasing. 108- Diff logic: match by anchor first, then compare `canon_id` to decide "same concept, changed wording" vs "new node." 109 110--- 111 112## Implementation Sketch (Files and Changes) 113 114| File | Changes | 115|------|---------| 116| `src/normalizer.ts` | Don't sort numbered lists; protect bullet items with arrows/ordinals; preserve transition sequences; preserve hyphenated compounds as single tokens | 117| `src/canonicalizer.ts` | Add sentence segmentation; split coordinated modals; expand type patterns with scoring rubric; add `confidence` field; extend `extractTerms` with acronym whitelist and phrase retention | 118| `src/canonicalizer-llm.ts` | Require `source_clauses` in schema; validate and attribute to multiple clauses; remove positional fallback; fall back to rule-based if schema invalid | 119| `src/warm-hasher.ts` | Include only typed edges and/or high-confidence nodes in warm context; cap linked IDs to reduce incidental invalidations | 120| New: tag inverted index utility | Build idf-weighted candidate generator; cap max neighbors; replace O(n²) linking in both paths | 121| CLI and diagnostics | Print per-clause coverage %, uncovered sentence snippets, and top reasons for drop | 122 123--- 124 125## Measurement and Targets 126 127| Metric | Current | Target | 128|--------|---------|--------| 129| Extraction recall | ~70% | +15–25% on prose-heavy specs | 130| Type accuracy | ~60% | +20–30% absolute with enhanced rules/weak supervision | 131| Linking precision | ~40% | +25–40% absolute using idf-weighted candidates or embeddings | 132| Provenance accuracy (rule) | 100% | 100% (maintain) | 133| Provenance accuracy (LLM) | ~80% | ≥95% with explicit sources; multi-clause enabled | 134| Identity stability (rule) | 100% | 100% (maintain) | 135| Identity stability (LLM) | ~90% | Stabilized via normalization + anchors | 136| Scalability | O(n²) linking | Near-linear with candidate pruning; supports 5–10k nodes | 137 138--- 139 140## References (Academic & OSS) 141 142### Semantic extraction and typing 143- Reimers & Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084. 144- Mavin et al. (2009/2010). EARS: The Easy Approach to Requirements Syntax. 145- Ferrari et al. (2017). NLP for Requirements Engineering: Systematic Literature Review. 146- Banko et al. (2007). Open Information Extraction from the Web. (OpenIE/OLLIE line of work). 147- He et al. (2017). Deep Semantic Role Labeling. 148 149### Linking, similarity, anchors 150- Broder (1997). On the resemblance and containment of documents (MinHash). 151- Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms (SimHash). 152- Malkov & Yashunin (2018). HNSW for Approximate Nearest Neighbors. 153 154### Keyphrases and JS tooling 155- Mihalcea & Tarau (2004). TextRank for keyphrase extraction. 156- RAKE (Rose et al., 2010). Rapid Automatic Keyword Extraction. 157- KeyBERT (Grootendorst, 2020). Keyphrase extraction via embeddings. 158- `xenova/transformers.js` — run MiniLM/E5 locally in Node. 159- `wink-nlp` / `compromise` — JS POS and noun-phrase extraction. 160 161### LLM stability and structured output 162- Structured/JSON-constrained decoding (OSS: outlines, guidance-style libs). 163- Wang et al. (2023). Self-Consistency improves chain-of-thought (ensembling for stability). 164 165--- 166 167## Proposed Roadmap (6–8 weeks) 168 169| Week | Focus | 170|------|-------| 171| 1–2 | **Quick wins**: Sentence segmentation; coverage reporting; improved rule patterns; acronym whitelist; stop sorting sequences. Inverted-index linking with idf filtering; cap node degree; update warm-hasher to use high-confidence links only. | 172| 3–4 | **Multi-clause provenance and typed edges**: Add multi-source attribution; heuristics-based typed relations; CLI diagnostics. | 173| 5–6 | **Local embeddings and anchors** (optional flag): Integrate transformers.js; compute anchors; evaluate on fixtures; keep behind feature flag. | 174| 7–8 | **LLM stabilization** (optional): LLM normalization path; explicit clause references schema; self-consistency; fallbacks and tests. | 175 176--- 177 178## Risks and Mitigations 179 180| Risk | Mitigation | 181|------|-----------| 182| Semantic drift from normalizer changes | Guard with tests on ordered lists/transitions; feature flag if needed | 183| Over-linking regressions | Cap links per node; require idf-weighted overlap; toggle embedding/anchor features behind flags | 184| LLM nondeterminism | Keep rule path primary; use LLM only for normalization with strict schema; add self-consistency | 185| Performance regressions | Benchmark linking pre/post; ensure near-linear behavior with candidate pruning | 186 187--- 188 189## Open Questions for the Team 190 1911. Confirm acceptance of sentence-level extraction and coverage reporting in CLI. 1922. Approve normalization change to preserve ordered sequences. 1933. Choose initial path for linking: idf-filtered tags vs local embeddings. 1944. Decide whether to introduce `confidence`, typed edges, and `canon_anchor` in the data model now (fields can be optional, defaulted). 195 196--- 197 198*Automated code review generated 2026-02-19. Source: OpenAI Codex analysis of Phoenix VCS codebase.*