A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add comprehensive architectural assessment of atproto integration design

Added detailed architectural appraisal (Grade A-) covering:
- Strengths and synergies across design decisions
- Trade-offs and implementation considerations
- Risk analysis and mitigation strategies
- Long-term architectural trajectory
- Immediate next steps and phasing recommendations

Updated decision status: all core decisions (#45-49) complete

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

+333 -8
.chainlink/issues.db

This is a binary file and will not be displayed.

+20 -8
.planning/decisions/README.md
··· 48 48 - **Question**: How to validate finalized Lexicon designs? 49 49 - **Process**: Validation checklist, example records, tests 50 50 - **Deliverables**: Finalized Lexicon JSON files, validation report 51 - - **Blocked By**: Issues #45, #46, #47, #48, #49 51 + - **Blocked By**: Issues #45, #46, #47, #48, #49 (all completed ✅) 52 52 - **Blocks**: Phase 1 completion (Issue #17) 53 + - **Status**: Ready to proceed 54 + 55 + ### Architectural Assessment 56 + 57 + 7. **[assessment.md](assessment.md)** (Issue #51) ✅ **Complete** 58 + - **Comprehensive appraisal** of all finalized design decisions 59 + - **Overall Grade**: A- (Excellent with caveats) 60 + - **Analysis**: Strengths, synergies, trade-offs, risks, long-term trajectory 61 + - **Recommendations**: Immediate next steps and phasing guidance 53 62 54 63 ## Decision Status 55 64 56 - | Issue | Decision | Status | Recommendation | 65 + | Issue | Decision | Status | Final Decision | 57 66 |-------|----------|--------|----------------| 58 - | #45 | Schema format | ⏳ Needs decision | Custom format | 59 - | #46 | Lens code storage | ⏳ Needs decision | Code references only | 60 - | #47 | WebDataset storage | ⏳ Needs decision | External URLs | 61 - | #48 | Schema evolution | ⏳ Needs decision | Semantic versioning | 62 - | #49 | Lexicon namespace | ⏳ Needs decision | `io.atdata.*` | 63 - | #50 | Validation process | ⏳ Blocked | (After #45-49) | 67 + | #45 | Schema format | ✅ Decided | JSON Schema with NDArray shim | 68 + | #46 | Lens code storage | ✅ Decided | External repos (GitHub + tangled.org) | 69 + | #47 | WebDataset storage | ✅ Decided | Hybrid (URLs + blobs from start) | 70 + | #48 | Schema evolution | ✅ Decided | rkey={NSID}@{semver} + migration Lenses | 71 + | #49 | Lexicon namespace | ✅ Decided | `ac.foundation.dataset.*` | 72 + | #50 | Validation process | ⏳ Ready | Proceed with finalized decisions | 73 + | #51 | Architectural appraisal | ✅ Complete | See [assessment.md](assessment.md) | 74 + 75 + **Overall Assessment**: Grade A- (Excellent with caveats) - See [assessment.md](assessment.md) for detailed analysis 64 76 65 77 ## How to Use These Documents 66 78
+313
.planning/decisions/assessment.md
··· 1 + # Architectural Assessment of Design Decisions 2 + 3 + **Issue**: #51 4 + **Date**: 2026-01-07 5 + **Status**: Complete 6 + 7 + ## Overall Impression: **Ambitious but Coherent** 8 + 9 + The finalized design decisions prioritize **flexibility and future-proofing** over initial simplicity. This is a deliberate trade-off that makes sense given the scope of building a distributed dataset federation. 10 + 11 + --- 12 + 13 + ## Decision Summary 14 + 15 + 1. **Schema Format (#45)**: JSON Schema with NDArray shim, extensible via open union 16 + 2. **Lens Code (#46)**: External repos (GitHub + tangled.org), language metadata, future attestation 17 + 3. **Storage (#47)**: Hybrid (URLs + blobs) from start, AppView proxy for blobs 18 + 4. **Evolution (#48)**: rkey as {NSID}@{semver}, getLatestSchema query, optional migration Lenses 19 + 5. **Namespace (#49)**: `ac.foundation.dataset.*` (sampleSchema, record, lens) 20 + 21 + --- 22 + 23 + ## Key Strengths 24 + 25 + ### 1. **Ecosystem Integration** (JSON Schema + External Repos) 26 + 27 + **Decision**: JSON Schema for type definitions, external repos for code storage 28 + 29 + **Strength**: Leveraging existing ecosystems rather than building in isolation. JSON Schema brings: 30 + - Extensive tooling (validators, codegen, IDE support) 31 + - Multi-language support out of the box 32 + - Familiarity for developers 33 + 34 + Pairing this with GitHub/tangled.org for Lenses means developers can use existing workflows. 35 + 36 + **Implication**: Lower barrier to entry, faster time to value. The NDArray shim is the only custom piece, which is appropriate since that's the unique requirement. 37 + 38 + --- 39 + 40 + ### 2. **Progressive Decentralization** (Hybrid Storage) 41 + 42 + **Decision**: Hybrid storage from day one (URLs + PDS blobs) 43 + 44 + **Strength**: This is pragmatic yet principled. Not forcing decentralization where it doesn't make sense (TB-scale datasets), but enabling it where it does (smaller datasets, self-hosters). 45 + 46 + **Key Insight**: The AppView proxy for blobs is clever - it means users can work with a unified WebDataset URL interface regardless of backend storage. This abstraction is powerful. 47 + 48 + **Implication**: More implementation complexity upfront, but avoids a painful migration later. The open union pattern makes this clean. 49 + 50 + --- 51 + 52 + ### 3. **Versioning as Identity** (rkey = NSID@semver) 53 + 54 + **Decision**: Embed version in record key, use NSID for permanent identity 55 + 56 + **Strength**: This is elegant. By making versioning part of the identity (rkey), you get: 57 + - Immutable version records (can't accidentally update a published version) 58 + - Natural query pattern (`getLatestSchema` Lexicon) 59 + - Clear semantic versioning enforcement 60 + 61 + **Synergy**: Combining this with Lenses for migration is brilliant. The rkey structure makes it trivial to discover what migrations exist (e.g., "show me all versions of schema X"). 62 + 63 + **Implication**: This requires custom rkey handling (type `any` in Lexicon), which ATProto supports but isn't the default pattern. Need to ensure tooling understands this convention. 64 + 65 + --- 66 + 67 + ### 4. **Trust Layer** (Attestation + Verification) 68 + 69 + **Decision**: Language metadata + future attestation/verification records for Lenses 70 + 71 + **Strength**: Thinking ahead about the trust problem. In a distributed system, trust is critical. This approach: 72 + - Short-term: Language metadata helps users understand what they're running 73 + - Long-term: Attestation (formal correctness proofs) + verification (trusted DIDs) 74 + 75 + This is a **strong security model** that's missing from many distributed systems. 76 + 77 + **Implication**: This is a research-level feature (formal verification of Lenses). Starting with language metadata is right, but the attestation system will require significant design work. Consider this Phase 6+. 78 + 79 + --- 80 + 81 + ## Architectural Tensions (Intentional Trade-offs) 82 + 83 + ### 1. **Complexity Budget** 84 + 85 + **Observation**: Sophisticated solutions across the board: 86 + - JSON Schema (standard but verbose) 87 + - Hybrid storage (two code paths) 88 + - Custom rkey scheme (non-standard) 89 + - Future attestation system (advanced) 90 + 91 + **Assessment**: This increases initial implementation cost significantly. However, each choice is justified: 92 + - JSON Schema: Ecosystem benefits outweigh verbosity 93 + - Hybrid storage: Essential for real-world use cases 94 + - Custom rkey: Enables clean versioning 95 + - Attestation: Future-proofing for trust 96 + 97 + **Recommendation**: ✅ Accept the complexity, but **phase implementation carefully**: 98 + - Phase 1-2: Core functionality (schemas, datasets, basic lenses) 99 + - Phase 3: Hybrid storage in AppView 100 + - Phase 4: Codegen for JSON Schema 101 + - Phase 5+: Attestation/verification system 102 + 103 + --- 104 + 105 + ### 2. **ATProto Conventions vs. Custom Patterns** 106 + 107 + **Observation**: Using some non-standard ATProto patterns: 108 + - rkey type `any` (not typical) 109 + - Custom versioning scheme in rkey 110 + - `getLatestSchema` query Lexicon (not standard CRUD) 111 + 112 + **Assessment**: This is **justified innovation**. ATProto is designed to support custom use cases. The versioning scheme in particular is a good use of flexible rkey. 113 + 114 + **Caveat**: Need to document these conventions clearly, since they won't match typical ATProto examples. 115 + 116 + --- 117 + 118 + ### 3. **JSON Schema for NDArray** 119 + 120 + **Observation**: JSON Schema wasn't designed for NDArray types. The shim approach treats them as "serialized bytes" with metadata. 121 + 122 + **Assessment**: This is **pragmatic but leaky**. The abstraction leaks because: 123 + - JSON Schema describes serialized form (bytes), not semantic form (array with dtype/shape) 124 + - Codegen will need custom handling for NDArray types 125 + - Validation happens at deserialization, not schema level 126 + 127 + **Alternative Considered**: Custom format would give cleaner NDArray representation, but traded that for ecosystem benefits. 128 + 129 + **Mitigation**: Ensure the NDArray shim is well-documented and becomes a de facto standard within the atdata ecosystem. Consider publishing it as a reusable JSON Schema extension. 130 + 131 + --- 132 + 133 + ## Synergies (Where Decisions Reinforce Each Other) 134 + 135 + ### 1. **Versioning + Lenses + rkey Scheme** 136 + 137 + This trilogy works beautifully together: 138 + - rkey embeds version → easy to list all versions 139 + - Lenses enable migration → versions can evolve safely 140 + - `getLatestSchema` query → discoverable latest version 141 + 142 + This creates a **complete version management story** that's rare in distributed systems. 143 + 144 + --- 145 + 146 + ### 2. **Hybrid Storage + AppView Proxy** 147 + 148 + The hybrid storage decision unlocks the proxy pattern: 149 + - Large datasets stay on S3/R2 (practical) 150 + - Small datasets can use PDS blobs (decentralized) 151 + - AppView proxies both → uniform interface 152 + 153 + This means the **client code is simple** (just WebDataset URLs) even though the backend is sophisticated. 154 + 155 + --- 156 + 157 + ### 3. **JSON Schema + Attestation + Language Metadata** 158 + 159 + This builds a **tiered trust model**: 160 + 1. Base layer: JSON Schema validates structure 161 + 2. Language metadata: Users know what they're executing 162 + 3. Attestation (future): Formal proofs of correctness 163 + 4. Verification (future): Social trust (trusted DIDs) 164 + 165 + Each layer adds security without requiring the next layer to exist. 166 + 167 + --- 168 + 169 + ## Implementation Risks & Mitigations 170 + 171 + ### Risk 1: JSON Schema Complexity 172 + 173 + **Risk**: JSON Schema is verbose and can be confusing for users defining NDArray-heavy schemas. 174 + 175 + **Mitigation**: 176 + - Build **high-quality codegen** that hides the complexity (users write Python, get JSON Schema) 177 + - Provide **NDArray shim library** that handles the serialization/deserialization 178 + - Create **examples and templates** for common patterns 179 + 180 + --- 181 + 182 + ### Risk 2: Hybrid Storage Code Paths 183 + 184 + **Risk**: Two storage backends means 2x testing, 2x bugs, 2x maintenance. 185 + 186 + **Mitigation**: 187 + - Use **abstraction layer** in Dataset class (already planned) 188 + - **Prioritize external URLs** for Phase 1-2 (blob support can be added incrementally) 189 + - Test both paths from the start (CI/CD) 190 + 191 + --- 192 + 193 + ### Risk 3: Custom rkey Convention 194 + 195 + **Risk**: Tools that expect standard TID-based rkeys might break. 196 + 197 + **Mitigation**: 198 + - **Document clearly** in all Lexicon definitions 199 + - Provide **helper functions** in SDK (`parseSchemaRkey`, `formatSchemaRkey`) 200 + - Ensure `getLatestSchema` query is the primary discovery mechanism (hides rkey complexity) 201 + 202 + --- 203 + 204 + ### Risk 4: Attestation System Scope Creep 205 + 206 + **Risk**: Formal verification and trust systems are research-level hard. Could delay entire project. 207 + 208 + **Mitigation**: 209 + - Mark as **explicitly future work** (Phase 6+) 210 + - Start with **language metadata only** (low-hanging fruit) 211 + - Consider **social trust first** (verified DIDs, reputation) before formal verification 212 + - Partner with PL/verification researchers if pursuing formal proofs 213 + 214 + --- 215 + 216 + ## Long-Term Trajectory 217 + 218 + The decisions set up a compelling long-term vision: 219 + 220 + **Year 1**: Core dataset federation 221 + - Publish/discover datasets 222 + - JSON Schema for types 223 + - External URL storage 224 + - Basic Lenses 225 + 226 + **Year 2**: Decentralization 227 + - PDS blob storage for small datasets 228 + - AppView with proxy 229 + - Migration Lenses widely used 230 + - Community schemas emerging 231 + 232 + **Year 3**: Trust & verification 233 + - Language metadata standard 234 + - Verified DID system (social trust) 235 + - Attestation for critical Lenses 236 + - Cross-language support (TypeScript, Rust) 237 + 238 + **Year 4+**: Research frontier 239 + - Formal verification of Lenses 240 + - Advanced query capabilities 241 + - Federated learning on distributed datasets 242 + - Integration with compute-over-data systems 243 + 244 + --- 245 + 246 + ## Concrete Recommendations 247 + 248 + ### 1. **Immediate** (Before Phase 1 Implementation) 249 + 250 + - [ ] Define the **NDArray JSON Schema shim** precisely (schema structure, examples) 251 + - [ ] Spec out the **rkey format** (`{NSID}@{semver}` - what's valid NSID here? full NSID or partial?) 252 + - [ ] Design the **`getLatestSchema` query Lexicon** (parameters, return type) 253 + - [ ] Define the **storage union type** (external URL variant vs PDS blob variant) 254 + 255 + ### 2. **Phase 1-2** (Lexicon + Python Client) 256 + 257 + - [ ] Implement **external URLs only** for storage (defer blobs to Phase 3) 258 + - [ ] Build **NDArray shim library** (serialize/deserialize with metadata) 259 + - [ ] Create **basic codegen** (Python dataclass ↔ JSON Schema) 260 + - [ ] Defer **language metadata** on Lenses to Phase 2 (start with just repo reference) 261 + 262 + ### 3. **Phase 3** (AppView) 263 + 264 + - [ ] Implement **hybrid storage support** in AppView 265 + - [ ] Build **proxy for PDS blobs** (unified WebDataset URL interface) 266 + - [ ] Add **getLatestSchema endpoint** 267 + 268 + ### 4. **Phase 4+** (Future Work) 269 + 270 + - [ ] Add **language metadata** to Lens records 271 + - [ ] Design **attestation Lexicon** (separate from Lens records) 272 + - [ ] Design **verification Lexicon** (trusted DIDs) 273 + - [ ] Research formal verification feasibility 274 + 275 + --- 276 + 277 + ## Summary Assessment 278 + 279 + **Grade: A-** (Excellent with caveats) 280 + 281 + ### Strengths 282 + - ✅ Leverages existing ecosystems (JSON Schema, GitHub) 283 + - ✅ Future-proof (extensible via open unions, versioning built-in) 284 + - ✅ Pragmatic decentralization (hybrid storage) 285 + - ✅ Innovative versioning (rkey scheme) 286 + - ✅ Strong security model (multi-layered trust) 287 + 288 + ### Concerns 289 + - ⚠️ High implementation complexity (manageable with phasing) 290 + - ⚠️ JSON Schema for NDArray is a leaky abstraction (acceptable trade-off) 291 + - ⚠️ Custom rkey convention requires good documentation 292 + - ⚠️ Attestation system is ambitious (defer to future) 293 + 294 + ### Overall Assessment 295 + 296 + This is a **well-considered architecture** that makes intentional trade-offs. The bet is on ecosystem integration and flexibility over simplicity, which is appropriate for a distributed dataset federation. The key to success will be **disciplined phasing** - implement the core first, add sophistication incrementally. 297 + 298 + The decisions form a **coherent whole** where each piece reinforces the others. The versioning scheme, Lenses, and hybrid storage create a system that's greater than the sum of its parts. 299 + 300 + **Recommendation**: ✅ **Proceed with these decisions**. Document the NDArray shim and rkey conventions thoroughly, and commit to incremental implementation. 301 + 302 + --- 303 + 304 + ## Next Steps 305 + 306 + 1. Close decision issues #45-49 as decided 307 + 2. Update planning documents with finalized decisions 308 + 3. Proceed to Issue #50 (Lexicon validation) with: 309 + - NDArray JSON Schema shim definition 310 + - rkey format specification 311 + - `getLatestSchema` query Lexicon design 312 + - Storage union type definition 313 + 4. Begin Phase 1 implementation after validation complete