docs: add comprehensive atproto integration planning documentation

+204

.planning/01_overview.md

··· 1 + # ATProto Integration - Overview 2 + 3 + ## Vision 4 + 5 + Transform `atdata` from a local/centralized dataset library into a **distributed dataset federation** built on AT Protocol. Datasets, schemas, and transformations become discoverable, versioned records on the ATProto network, enabling: 6 + 7 + - **Decentralized dataset publishing**: Anyone can publish datasets without centralized infrastructure 8 + - **Schema sharing & reuse**: Sample type definitions become reusable records with automatic code generation 9 + - **Discoverable transformations**: Lens transformations are published as bidirectional mappings between schemas 10 + - **Interoperability**: Different tools and languages can consume the same datasets using generated code 11 + - **Versioning & provenance**: Immutable records provide audit trails for dataset evolution 12 + 13 + ## High-Level Architecture 14 + 15 + ``` 16 + ┌─────────────────────────────────────────────────────────────────┐ 17 + │ AT Protocol Network │ 18 + │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ 19 + │ │ Schema Records │ │ Dataset Records │ │ Lens Records │ │ 20 + │ │ (Lexicon) │ │ (Lexicon) │ │ (Lexicon) │ │ 21 + │ └──────────────────┘ └──────────────────┘ └───────────────┘ │ 22 + │ ▲ ▲ ▲ │ 23 + │ │ │ │ │ 24 + └─────────┼──────────────────────┼─────────────────────┼──────────┘ 25 + │ │ │ 26 + │ publish/query │ │ 27 + │ │ │ 28 + ┌─────┴──────────────────────┴─────────────────────┴─────┐ 29 + │ Python Client Library (atdata) │ 30 + │ │ 31 + │ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ 32 + │ │ ATProto │ │ Schema │ │ Dataset │ │ 33 + │ │ Auth │ │ Publisher │ │ Loader │ │ 34 + │ └────────────┘ └────────────┘ └──────────────────┘ │ 35 + │ │ 36 + │ Existing: │ 37 + │ - PackableSample, Dataset, Lens │ 38 + │ - WebDataset integration │ 39 + └──────────────────────────────────────────────────────────┘ 40 + │ 41 + │ queries (optional) 42 + ▼ 43 + ┌─────────────────────┐ 44 + │ AppView Service │ 45 + │ (Index Aggregator) │ 46 + │ │ 47 + │ - Fast search │ 48 + │ - Schema browser │ 49 + │ - Metadata cache │ 50 + └─────────────────────┘ 51 + ``` 52 + 53 + ## Core Concepts 54 + 55 + ### 1. Schema Records (PackableSample definitions) 56 + 57 + Published ATProto records containing: 58 + - Field names and types (with special handling for NDArray) 59 + - Serialization metadata 60 + - Version information 61 + - Author/provenance 62 + 63 + These become the **source of truth** for sample types across the network. 64 + 65 + ### 2. Dataset Index Records 66 + 67 + Published ATProto records containing: 68 + - Reference to schema record (the sample type) 69 + - WebDataset URL(s) using brace notation (e.g., `s3://bucket/data-{000000..000099}.tar`) 70 + - Msgpack-encoded metadata (arbitrary key-value pairs) 71 + - Dataset description, tags, author 72 + 73 + Users discover datasets by querying these records, then load them using existing `Dataset` class. 74 + 75 + ### 3. Lens Transformation Records 76 + 77 + Published ATProto records containing: 78 + - Source schema reference 79 + - Target schema reference 80 + - Transformation code (or reference to code) 81 + - Bidirectional mapping metadata (getter/putter) 82 + 83 + Enables building a **network of transformations** between schemas. 84 + 85 + ## Integration with Existing `atdata` 86 + 87 + The ATProto integration is **additive**: 88 + 89 + 1. **Existing functionality unchanged**: `PackableSample`, `Dataset`, `Lens` continue to work as-is 90 + 2. **New methods added**: 91 + - `sample_type.publish_to_atproto(client)` - Publish schema 92 + - `dataset.publish_to_atproto(client)` - Publish index record 93 + - `Dataset.from_atproto(client, record_uri)` - Load from published record 94 + - `lens.publish_to_atproto(client)` - Publish transformation 95 + 3. **Optional AppView**: Query service for faster discovery (like Bluesky's AppView) 96 + 97 + ## Development Phases 98 + 99 + ### Phase 1: Lexicon Design (Issues #17, #22-25) 100 + - Design three Lexicon schemas (sample, dataset, lens) 101 + - Evaluate schema representation formats 102 + - Create reference documentation 103 + 104 + **Deliverable**: Lexicon JSON definitions ready for use 105 + 106 + ### Phase 2: Python Client Library (Issues #18, #26-31) 107 + - ATProto SDK integration (auth, session management) 108 + - Publishing implementations for all three record types 109 + - Query/discovery functionality 110 + - Extend `Dataset` class with `from_atproto()` method 111 + 112 + **Deliverable**: Working Python library that can publish/load from ATProto 113 + 114 + ### Phase 3: AppView Service (Issues #19, #32-35) 115 + - Optional aggregation service 116 + - Firehose ingestion 117 + - Search/query API 118 + - Performance optimization 119 + 120 + **Deliverable**: Hosted service for fast dataset discovery 121 + 122 + ### Phase 4: Code Generation (Issues #20, #36-39) 123 + - Template system for Python codegen 124 + - CLI tool for generating classes from schema records 125 + - Type validation and compatibility checking 126 + 127 + **Deliverable**: Tool to generate Python code from published schemas 128 + 129 + ### Phase 5: Integration & Testing (Issues #21, #40-43) 130 + - End-to-end workflows and examples 131 + - Integration test suite 132 + - Documentation and guides 133 + - Performance benchmarks 134 + 135 + **Deliverable**: Production-ready feature with complete documentation 136 + 137 + ## Open Design Questions 138 + 139 + ### Schema Representation Format 140 + **Question**: How should we represent `PackableSample` schemas in Lexicon records? 141 + 142 + **Options**: 143 + 1. **JSON Schema** - Standard, well-supported, validation tools exist 144 + 2. **Protobuf** - Compact, has codegen ecosystem, good for cross-language 145 + 3. **Custom format** - Tailored to `PackableSample` specifics (NDArray handling, msgpack serialization) 146 + 147 + **Considerations**: 148 + - Need to represent `NDArray` types specially (dtype, shape constraints?) 149 + - Should support future extensions (constraints, validation rules) 150 + - Must be human-readable and machine-processable 151 + - Codegen tooling needs to parse it 152 + 153 + **Decision needed**: See Issue #25 154 + 155 + ### WebDataset Storage Location 156 + **Question**: Should actual WebDataset `.tar` files be stored on ATProto, or just references to external storage? 157 + 158 + **Current approach**: References only (S3, HTTP URLs, etc.) 159 + - Pros: No storage limits, existing infrastructure works 160 + - Cons: Centralization risk if datasets disappear 161 + 162 + **Future consideration**: ATProto blob storage for datasets 163 + - Pros: Truly decentralized 164 + - Cons: Storage costs, size limits, performance 165 + 166 + ### Lens Code Storage 167 + **Question**: How should Lens transformation code be stored? 168 + 169 + **Options**: 170 + 1. Python code as string in record (security concerns!) 171 + 2. Reference to GitHub/GitLab repo + commit hash 172 + 3. Bytecode or AST representation 173 + 4. Only store metadata, expect manual implementation 174 + 175 + **Decision needed**: See Phase 1 planning 176 + 177 + ## Success Metrics 178 + 179 + - **Functionality**: Can publish schema, publish dataset, discover, load end-to-end 180 + - **Performance**: Dataset discovery <100ms (with AppView), load time unchanged 181 + - **Adoption**: Easy enough that external users publish datasets 182 + - **Interop**: Schema records usable from other languages (future) 183 + 184 + ## Timeline & Dependencies 185 + 186 + ``` 187 + Phase 1 (Lexicon Design) 188 + ↓ 189 + Phase 2 (Python Client) ← CRITICAL PATH 190 + ↓ 191 + ├── Phase 3 (AppView) [parallel, optional] 192 + └── Phase 4 (Codegen) [parallel] 193 + ↓ 194 + Phase 5 (Integration & Testing) 195 + ``` 196 + 197 + Phase 2 is the critical path. Phases 3 & 4 can proceed in parallel once Phase 2 foundations are in place. 198 + 199 + ## Related Documents 200 + 201 + - `02_lexicon_design.md` - Detailed Lexicon schema specifications 202 + - `03_python_client.md` - Python library architecture and API design 203 + - `04_appview.md` - AppView service architecture 204 + - `05_codegen.md` - Code generation approach and templates

+576

.planning/02_lexicon_design.md

··· 1 + # Lexicon Design for ATProto Integration 2 + 3 + ## Overview 4 + 5 + This document specifies the three Lexicon schemas needed for `atdata` ATProto integration: 6 + 7 + 1. **Schema Record** (`app.bsky.atdata.schema`) - Defines PackableSample types 8 + 2. **Dataset Record** (`app.bsky.atdata.dataset`) - Index records pointing to WebDataset files 9 + 3. **Lens Record** (`app.bsky.atdata.lens`) - Transformation mappings between schemas 10 + 11 + ## Design Principles 12 + 13 + - **Self-describing**: Records contain all necessary metadata 14 + - **Versioned**: Schema evolution supported through versioning 15 + - **Lightweight**: Minimal overhead, fast to parse 16 + - **Extensible**: Future additions don't break existing records 17 + - **Language-agnostic**: Usable from Python, TypeScript, Rust, etc. 18 + 19 + ## 1. Schema Record Lexicon 20 + 21 + **NSID**: `app.bsky.atdata.schema` (tentative namespace) 22 + 23 + **Purpose**: Define a reusable PackableSample type that can be instantiated via codegen 24 + 25 + ### Proposed Structure 26 + 27 + ```json 28 + { 29 + "lexicon": 1, 30 + "id": "app.bsky.atdata.schema", 31 + "defs": { 32 + "main": { 33 + "type": "record", 34 + "description": "Definition of a PackableSample-compatible sample type", 35 + "key": "tid", 36 + "record": { 37 + "type": "object", 38 + "required": ["name", "version", "fields", "createdAt"], 39 + "properties": { 40 + "name": { 41 + "type": "string", 42 + "description": "Human-readable name for this sample type", 43 + "maxLength": 100 44 + }, 45 + "version": { 46 + "type": "string", 47 + "description": "Semantic version (e.g., '1.0.0')", 48 + "maxLength": 20 49 + }, 50 + "description": { 51 + "type": "string", 52 + "description": "Human-readable description", 53 + "maxLength": 1000 54 + }, 55 + "fields": { 56 + "type": "array", 57 + "description": "List of fields in this sample type", 58 + "items": { 59 + "type": "ref", 60 + "ref": "#field" 61 + } 62 + }, 63 + "metadata": { 64 + "type": "object", 65 + "description": "Arbitrary metadata (author, license, etc.)" 66 + }, 67 + "createdAt": { 68 + "type": "string", 69 + "format": "datetime" 70 + } 71 + } 72 + } 73 + }, 74 + "field": { 75 + "type": "object", 76 + "description": "A field within a sample type", 77 + "required": ["name", "type"], 78 + "properties": { 79 + "name": { 80 + "type": "string", 81 + "description": "Field name (Python identifier)", 82 + "maxLength": 100 83 + }, 84 + "type": { 85 + "type": "ref", 86 + "ref": "#fieldType" 87 + }, 88 + "optional": { 89 + "type": "boolean", 90 + "description": "Whether field can be None", 91 + "default": false 92 + }, 93 + "description": { 94 + "type": "string", 95 + "description": "Field documentation", 96 + "maxLength": 500 97 + } 98 + } 99 + }, 100 + "fieldType": { 101 + "type": "union", 102 + "refs": [ 103 + "#primitiveType", 104 + "#arrayType", 105 + "#nestedType" 106 + ] 107 + }, 108 + "primitiveType": { 109 + "type": "object", 110 + "required": ["kind", "primitive"], 111 + "properties": { 112 + "kind": { 113 + "type": "string", 114 + "const": "primitive" 115 + }, 116 + "primitive": { 117 + "type": "string", 118 + "enum": ["str", "int", "float", "bool", "bytes"] 119 + } 120 + } 121 + }, 122 + "arrayType": { 123 + "type": "object", 124 + "required": ["kind", "dtype"], 125 + "properties": { 126 + "kind": { 127 + "type": "string", 128 + "const": "ndarray" 129 + }, 130 + "dtype": { 131 + "type": "string", 132 + "description": "Numpy dtype string (e.g., 'float32', 'uint8')", 133 + "maxLength": 20 134 + }, 135 + "shape": { 136 + "type": "array", 137 + "description": "Optional shape constraint (null for dynamic dimensions)", 138 + "items": { 139 + "type": "integer" 140 + } 141 + } 142 + } 143 + }, 144 + "nestedType": { 145 + "type": "object", 146 + "required": ["kind", "schemaRef"], 147 + "properties": { 148 + "kind": { 149 + "type": "string", 150 + "const": "nested" 151 + }, 152 + "schemaRef": { 153 + "type": "string", 154 + "description": "AT-URI reference to another schema record" 155 + } 156 + } 157 + } 158 + } 159 + } 160 + ``` 161 + 162 + ### Example Schema Record 163 + 164 + ```json 165 + { 166 + "$type": "app.bsky.atdata.schema", 167 + "name": "ImageSample", 168 + "version": "1.0.0", 169 + "description": "Sample containing an image with label", 170 + "fields": [ 171 + { 172 + "name": "image", 173 + "type": { 174 + "kind": "ndarray", 175 + "dtype": "uint8", 176 + "shape": [null, null, 3] 177 + }, 178 + "description": "RGB image with variable height/width" 179 + }, 180 + { 181 + "name": "label", 182 + "type": { 183 + "kind": "primitive", 184 + "primitive": "str" 185 + }, 186 + "description": "Human-readable label" 187 + }, 188 + { 189 + "name": "confidence", 190 + "type": { 191 + "kind": "primitive", 192 + "primitive": "float" 193 + }, 194 + "optional": true, 195 + "description": "Optional confidence score" 196 + } 197 + ], 198 + "metadata": { 199 + "author": "alice.bsky.social", 200 + "license": "MIT" 201 + }, 202 + "createdAt": "2025-01-06T12:00:00Z" 203 + } 204 + ``` 205 + 206 + ### Design Questions 207 + 208 + 1. **Shape constraints**: Should we enforce shape constraints, or just document them? 209 + - Option A: Runtime validation against shape 210 + - Option B: Documentation only, actual shapes can vary 211 + - **Recommendation**: Documentation only initially, validation in future versions 212 + 213 + 2. **Custom types**: Should we support custom serialization hooks? 214 + - Current approach: Only primitive + NDArray 215 + - Future: Allow references to custom serialization functions? 216 + 217 + 3. **Schema inheritance**: Should schemas support inheritance/composition? 218 + - Could reference parent schema and add fields 219 + - **Defer to future version** 220 + 221 + ## 2. Dataset Record Lexicon 222 + 223 + **NSID**: `app.bsky.atdata.dataset` 224 + 225 + **Purpose**: Index record pointing to WebDataset files with associated metadata 226 + 227 + ### Proposed Structure 228 + 229 + ```json 230 + { 231 + "lexicon": 1, 232 + "id": "app.bsky.atdata.dataset", 233 + "defs": { 234 + "main": { 235 + "type": "record", 236 + "description": "Index record for a WebDataset-backed dataset", 237 + "key": "tid", 238 + "record": { 239 + "type": "object", 240 + "required": ["name", "schemaRef", "urls", "createdAt"], 241 + "properties": { 242 + "name": { 243 + "type": "string", 244 + "description": "Human-readable dataset name", 245 + "maxLength": 200 246 + }, 247 + "schemaRef": { 248 + "type": "string", 249 + "description": "AT-URI reference to the schema record for this dataset's samples" 250 + }, 251 + "urls": { 252 + "type": "array", 253 + "description": "WebDataset URLs (supports brace notation)", 254 + "items": { 255 + "type": "string", 256 + "format": "uri", 257 + "maxLength": 1000 258 + }, 259 + "minLength": 1 260 + }, 261 + "description": { 262 + "type": "string", 263 + "description": "Human-readable description", 264 + "maxLength": 5000 265 + }, 266 + "metadata": { 267 + "type": "bytes", 268 + "description": "Msgpack-encoded metadata dict", 269 + "maxLength": 100000 270 + }, 271 + "tags": { 272 + "type": "array", 273 + "description": "Searchable tags", 274 + "items": { 275 + "type": "string", 276 + "maxLength": 50 277 + }, 278 + "maxLength": 20 279 + }, 280 + "size": { 281 + "type": "object", 282 + "description": "Dataset size information", 283 + "properties": { 284 + "samples": { 285 + "type": "integer", 286 + "description": "Total number of samples" 287 + }, 288 + "bytes": { 289 + "type": "integer", 290 + "description": "Total size in bytes" 291 + } 292 + } 293 + }, 294 + "license": { 295 + "type": "string", 296 + "description": "License (SPDX identifier preferred)", 297 + "maxLength": 100 298 + }, 299 + "createdAt": { 300 + "type": "string", 301 + "format": "datetime" 302 + } 303 + } 304 + } 305 + } 306 + } 307 + } 308 + ``` 309 + 310 + ### Example Dataset Record 311 + 312 + ```json 313 + { 314 + "$type": "app.bsky.atdata.dataset", 315 + "name": "CIFAR-10 Training Set", 316 + "schemaRef": "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm", 317 + "urls": [ 318 + "s3://my-bucket/cifar10-train-{000000..000049}.tar" 319 + ], 320 + "description": "CIFAR-10 training images (50,000 samples) stored as WebDataset shards", 321 + "metadata": "<msgpack bytes>", 322 + "tags": ["computer-vision", "classification", "cifar10"], 323 + "size": { 324 + "samples": 50000, 325 + "bytes": 178456789 326 + }, 327 + "license": "MIT", 328 + "createdAt": "2025-01-06T12:00:00Z" 329 + } 330 + ``` 331 + 332 + ### Design Questions 333 + 334 + 1. **WebDataset storage**: Where are the actual `.tar` files? 335 + - Phase 1: External storage (S3, HTTP, etc.) - just store URLs 336 + - Future: Could use ATProto blob storage for smaller datasets 337 + - **Recommendation**: External only for now 338 + 339 + 2. **Metadata size limit**: What's reasonable for msgpack metadata? 340 + - Could store large metadata as separate blob 341 + - **Recommendation**: 100KB limit, use blob for larger 342 + 343 + 3. **Versioning**: Should we support dataset versioning? 344 + - Could link to previous version 345 + - **Defer to future version** 346 + 347 + ## 3. Lens Record Lexicon 348 + 349 + **NSID**: `app.bsky.atdata.lens` 350 + 351 + **Purpose**: Define bidirectional transformations between sample types 352 + 353 + ### Proposed Structure 354 + 355 + ```json 356 + { 357 + "lexicon": 1, 358 + "id": "app.bsky.atdata.lens", 359 + "defs": { 360 + "main": { 361 + "type": "record", 362 + "description": "Bidirectional transformation between two sample types", 363 + "key": "tid", 364 + "record": { 365 + "type": "object", 366 + "required": ["name", "sourceSchema", "targetSchema", "createdAt"], 367 + "properties": { 368 + "name": { 369 + "type": "string", 370 + "description": "Human-readable lens name", 371 + "maxLength": 100 372 + }, 373 + "sourceSchema": { 374 + "type": "string", 375 + "description": "AT-URI reference to source schema" 376 + }, 377 + "targetSchema": { 378 + "type": "string", 379 + "description": "AT-URI reference to target schema" 380 + }, 381 + "description": { 382 + "type": "string", 383 + "description": "What this transformation does", 384 + "maxLength": 1000 385 + }, 386 + "getterCode": { 387 + "type": "ref", 388 + "ref": "#transformCode" 389 + }, 390 + "putterCode": { 391 + "type": "ref", 392 + "ref": "#transformCode" 393 + }, 394 + "metadata": { 395 + "type": "object", 396 + "description": "Arbitrary metadata" 397 + }, 398 + "createdAt": { 399 + "type": "string", 400 + "format": "datetime" 401 + } 402 + } 403 + } 404 + }, 405 + "transformCode": { 406 + "type": "union", 407 + "refs": [ 408 + "#pythonCode", 409 + "#codeReference" 410 + ] 411 + }, 412 + "pythonCode": { 413 + "type": "object", 414 + "required": ["kind", "source"], 415 + "properties": { 416 + "kind": { 417 + "type": "string", 418 + "const": "python" 419 + }, 420 + "source": { 421 + "type": "string", 422 + "description": "Python function source code", 423 + "maxLength": 50000 424 + } 425 + } 426 + }, 427 + "codeReference": { 428 + "type": "object", 429 + "required": ["kind", "repository", "path"], 430 + "properties": { 431 + "kind": { 432 + "type": "string", 433 + "const": "reference" 434 + }, 435 + "repository": { 436 + "type": "string", 437 + "description": "Git repository URL", 438 + "maxLength": 500 439 + }, 440 + "commit": { 441 + "type": "string", 442 + "description": "Git commit hash", 443 + "maxLength": 40 444 + }, 445 + "path": { 446 + "type": "string", 447 + "description": "Path to function within repo", 448 + "maxLength": 500 449 + } 450 + } 451 + } 452 + } 453 + } 454 + ``` 455 + 456 + ### Example Lens Record 457 + 458 + ```json 459 + { 460 + "$type": "app.bsky.atdata.lens", 461 + "name": "image_to_grayscale", 462 + "sourceSchema": "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm", 463 + "targetSchema": "at://did:plc:def456/app.bsky.atdata.schema/7mn8op56pqr", 464 + "description": "Convert RGB images to grayscale", 465 + "getterCode": { 466 + "kind": "reference", 467 + "repository": "https://github.com/alice/lenses", 468 + "commit": "a1b2c3d4e5f6", 469 + "path": "lenses/vision.py:image_to_grayscale" 470 + }, 471 + "putterCode": { 472 + "kind": "reference", 473 + "repository": "https://github.com/alice/lenses", 474 + "commit": "a1b2c3d4e5f6", 475 + "path": "lenses/vision.py:grayscale_to_image" 476 + }, 477 + "metadata": { 478 + "author": "alice.bsky.social" 479 + }, 480 + "createdAt": "2025-01-06T12:00:00Z" 481 + } 482 + ``` 483 + 484 + ### Design Questions - CRITICAL 485 + 486 + 1. **Code storage security**: Storing executable code is dangerous! 487 + - **Option A**: Code reference only (GitHub + commit hash) - safer 488 + - **Option B**: Allow inline code but require manual approval - flexible 489 + - **Option C**: AST/bytecode representation - complex 490 + - **Recommendation**: Start with references only (Option A), defer inline code 491 + 492 + 2. **Lens verification**: How to verify well-behavedness? 493 + - Could store test cases 494 + - Could require proof of GetPut/PutGet laws 495 + - **Defer to future** 496 + 497 + 3. **Lens composition**: Should lenses be composable? 498 + - Network could auto-compose transformations 499 + - **Defer to future** 500 + 501 + ## Schema Representation Format Decision 502 + 503 + **Question**: What format should we use to represent field types internally? 504 + 505 + ### Option 1: JSON Schema 506 + **Pros**: 507 + - Standard, widely supported 508 + - Validation tooling exists 509 + - Human-readable 510 + 511 + **Cons**: 512 + - Not designed for codegen 513 + - NDArray representation awkward 514 + - Overly complex for our needs 515 + 516 + ### Option 2: Protobuf 517 + **Pros**: 518 + - Designed for codegen 519 + - Compact binary format 520 + - Cross-language support excellent 521 + 522 + **Cons**: 523 + - Not ATProto-native 524 + - Requires compilation step 525 + - Less human-readable 526 + 527 + ### Option 3: Custom Format (as shown above) 528 + **Pros**: 529 + - Tailored exactly to PackableSample needs 530 + - Native ATProto Lexicon 531 + - Clean NDArray representation 532 + - Easy to extend 533 + 534 + **Cons**: 535 + - Need to write our own codegen 536 + - Less ecosystem tooling 537 + 538 + ### Recommendation: Option 3 (Custom Format) 539 + 540 + **Rationale**: 541 + 1. PackableSample has specific needs (NDArray, msgpack serialization) 542 + 2. ATProto Lexicon provides all the structure we need 543 + 3. Writing our own codegen gives us full control 544 + 4. Can still use JSON Schema for validation if needed 545 + 546 + The proposed Lexicon structure above uses this approach. 547 + 548 + ## Implementation Checklist (Phase 1) 549 + 550 + - [ ] Finalize Lexicon JSON definitions for all three record types 551 + - [ ] Create reference documentation with examples 552 + - [ ] Decide on schema representation format (recommendation: custom) 553 + - [ ] Resolve open questions (code storage, versioning, etc.) 554 + - [ ] Validate Lexicons against ATProto spec 555 + - [ ] Create example records for testing 556 + 557 + ## Future Extensions 558 + 559 + ### Schema Evolution 560 + - Support schema versioning with migration paths 561 + - Compatibility checking (backward/forward compatible) 562 + 563 + ### Advanced Types 564 + - Generic/parameterized types 565 + - Union types for polymorphic samples 566 + - Schema composition/inheritance 567 + 568 + ### Lens Network 569 + - Automatic lens composition 570 + - Lens verification and testing 571 + - Performance metadata (transformation cost) 572 + 573 + ### Dataset Features 574 + - Dataset splitting (train/val/test) references 575 + - Dataset versioning and diffs 576 + - Access control and permissions

+690

.planning/03_python_client.md

··· 1 + # Python Client Library Architecture 2 + 3 + ## Overview 4 + 5 + This document specifies the Python library extensions to `atdata` for ATProto integration. The goal is to add ATProto publishing and discovery capabilities while maintaining backward compatibility with existing code. 6 + 7 + ## Design Principles 8 + 9 + - **Backward compatible**: Existing code continues to work unchanged 10 + - **Optional integration**: ATProto features are opt-in 11 + - **Pythonic API**: Follows Python conventions and `atdata` style 12 + - **Type-safe**: Full type hints with generics 13 + - **Testable**: Mockable dependencies, unit testable 14 + 15 + ## Module Structure 16 + 17 + ``` 18 + src/atdata/ 19 + __init__.py # Existing exports 20 + dataset.py # Existing Dataset, PackableSample 21 + lens.py # Existing Lens, LensNetwork 22 + _helpers.py # Existing serialization helpers 23 + atproto/ # NEW: ATProto integration 24 + __init__.py # Public API exports 25 + client.py # ATProtoClient for auth/session 26 + schema.py # Schema publishing/loading 27 + dataset.py # Dataset publishing/loading 28 + lens.py # Lens publishing/loading 29 + _lexicon.py # Lexicon record builders 30 + _types.py # Type definitions for records 31 + ``` 32 + 33 + ## Core Components 34 + 35 + ### 1. ATProtoClient - Authentication & Session Management 36 + 37 + **File**: `src/atdata/atproto/client.py` 38 + 39 + ```python 40 + from typing import Optional 41 + from atproto import Client as ATProtoSDKClient 42 + 43 + class ATProtoClient: 44 + """Wrapper around atproto SDK client with atdata-specific helpers.""" 45 + 46 + def __init__(self, client: Optional[ATProtoSDKClient] = None): 47 + """ 48 + Initialize ATProto client. 49 + 50 + Args: 51 + client: Optional pre-configured atproto Client. If None, creates new client. 52 + """ 53 + self._client = client or ATProtoSDKClient() 54 + self._session: Optional[dict] = None 55 + 56 + def login(self, handle: str, password: str) -> None: 57 + """Authenticate with ATProto PDS.""" 58 + self._session = self._client.login(handle, password) 59 + 60 + def login_with_token(self, access_token: str, refresh_token: str) -> None: 61 + """Authenticate using existing tokens.""" 62 + # Implementation 63 + pass 64 + 65 + @property 66 + def is_authenticated(self) -> bool: 67 + """Check if client has valid session.""" 68 + return self._session is not None 69 + 70 + @property 71 + def did(self) -> str: 72 + """Get DID of authenticated user.""" 73 + if not self._session: 74 + raise ValueError("Not authenticated") 75 + return self._session['did'] 76 + 77 + # Low-level record operations 78 + def create_record(self, collection: str, record: dict) -> str: 79 + """Create a record and return its AT-URI.""" 80 + # Implementation using self._client 81 + pass 82 + 83 + def get_record(self, uri: str) -> dict: 84 + """Fetch a record by AT-URI.""" 85 + # Implementation 86 + pass 87 + 88 + def list_records(self, collection: str, did: Optional[str] = None) -> list[dict]: 89 + """List records in a collection.""" 90 + # Implementation 91 + pass 92 + ``` 93 + 94 + **Usage**: 95 + ```python 96 + from atdata.atproto import ATProtoClient 97 + 98 + client = ATProtoClient() 99 + client.login("alice.bsky.social", "password") 100 + ``` 101 + 102 + ### 2. Schema Publishing & Loading 103 + 104 + **File**: `src/atdata/atproto/schema.py` 105 + 106 + ```python 107 + from typing import Type, TypeVar, get_type_hints 108 + from dataclasses import fields, is_dataclass 109 + import atdata 110 + from .client import ATProtoClient 111 + from ._lexicon import build_schema_record 112 + 113 + ST = TypeVar('ST', bound=atdata.PackableSample) 114 + 115 + class SchemaPublisher: 116 + """Handles publishing PackableSample schemas to ATProto.""" 117 + 118 + def __init__(self, client: ATProtoClient): 119 + self.client = client 120 + 121 + def publish_schema( 122 + self, 123 + sample_type: Type[ST], 124 + *, 125 + name: Optional[str] = None, 126 + version: str = "1.0.0", 127 + description: Optional[str] = None, 128 + metadata: Optional[dict] = None 129 + ) -> str: 130 + """ 131 + Publish a PackableSample schema to ATProto. 132 + 133 + Args: 134 + sample_type: The PackableSample class to publish 135 + name: Human-readable name (defaults to class name) 136 + version: Semantic version 137 + description: Human-readable description 138 + metadata: Arbitrary metadata dict 139 + 140 + Returns: 141 + AT-URI of the created schema record 142 + """ 143 + if not self.client.is_authenticated: 144 + raise ValueError("Client must be authenticated") 145 + 146 + # Extract field information from dataclass 147 + schema_record = self._build_schema_record( 148 + sample_type, name, version, description, metadata 149 + ) 150 + 151 + # Publish to ATProto 152 + uri = self.client.create_record("app.bsky.atdata.schema", schema_record) 153 + return uri 154 + 155 + def _build_schema_record( 156 + self, 157 + sample_type: Type[ST], 158 + name: Optional[str], 159 + version: str, 160 + description: Optional[str], 161 + metadata: Optional[dict] 162 + ) -> dict: 163 + """Build schema record dict from PackableSample class.""" 164 + if not is_dataclass(sample_type): 165 + raise ValueError(f"{sample_type} must be a dataclass") 166 + 167 + field_defs = [] 168 + type_hints = get_type_hints(sample_type) 169 + 170 + for field in fields(sample_type): 171 + field_type = type_hints[field.name] 172 + field_def = self._field_to_record(field.name, field_type) 173 + field_defs.append(field_def) 174 + 175 + return { 176 + "$type": "app.bsky.atdata.schema", 177 + "name": name or sample_type.__name__, 178 + "version": version, 179 + "description": description or "", 180 + "fields": field_defs, 181 + "metadata": metadata or {}, 182 + "createdAt": datetime.now(timezone.utc).isoformat() 183 + } 184 + 185 + def _field_to_record(self, name: str, field_type) -> dict: 186 + """Convert Python type annotation to schema field record.""" 187 + # Handle Optional types 188 + is_optional = False 189 + if hasattr(field_type, '__origin__') and field_type.__origin__ is Union: 190 + args = field_type.__args__ 191 + if type(None) in args: 192 + is_optional = True 193 + field_type = next(arg for arg in args if arg is not type(None)) 194 + 195 + # Map Python types to schema types 196 + type_def = self._python_type_to_schema_type(field_type) 197 + 198 + return { 199 + "name": name, 200 + "type": type_def, 201 + "optional": is_optional 202 + } 203 + 204 + def _python_type_to_schema_type(self, python_type) -> dict: 205 + """Map Python type to schema type definition.""" 206 + # Handle primitives 207 + if python_type is str: 208 + return {"kind": "primitive", "primitive": "str"} 209 + elif python_type is int: 210 + return {"kind": "primitive", "primitive": "int"} 211 + elif python_type is float: 212 + return {"kind": "primitive", "primitive": "float"} 213 + elif python_type is bool: 214 + return {"kind": "primitive", "primitive": "bool"} 215 + elif python_type is bytes: 216 + return {"kind": "primitive", "primitive": "bytes"} 217 + 218 + # Handle NDArray - this is the key special case 219 + # In atdata, NDArray is used as a type annotation 220 + if hasattr(python_type, '__origin__'): 221 + origin = python_type.__origin__ 222 + if origin.__name__ == 'NDArray' or str(origin) == 'numpy.ndarray': 223 + # Extract dtype from annotation if available 224 + # For now, default to float32 225 + return { 226 + "kind": "ndarray", 227 + "dtype": "float32", # TODO: extract from annotation 228 + "shape": None 229 + } 230 + 231 + # If it's another PackableSample, create nested reference 232 + if is_dataclass(python_type) and issubclass(python_type, atdata.PackableSample): 233 + # This would require publishing the nested type first 234 + raise NotImplementedError("Nested PackableSample types not yet supported") 235 + 236 + raise ValueError(f"Unsupported type: {python_type}") 237 + 238 + class SchemaLoader: 239 + """Handles loading PackableSample schemas from ATProto.""" 240 + 241 + def __init__(self, client: ATProtoClient): 242 + self.client = client 243 + 244 + def get_schema(self, uri: str) -> dict: 245 + """Fetch a schema record by AT-URI.""" 246 + record = self.client.get_record(uri) 247 + if record.get('$type') != 'app.bsky.atdata.schema': 248 + raise ValueError(f"Record at {uri} is not a schema record") 249 + return record 250 + 251 + def list_schemas(self, did: Optional[str] = None) -> list[dict]: 252 + """List available schema records.""" 253 + return self.client.list_records("app.bsky.atdata.schema", did) 254 + ``` 255 + 256 + **Usage**: 257 + ```python 258 + from atdata.atproto import ATProtoClient, SchemaPublisher 259 + 260 + @atdata.packable 261 + class MySample: 262 + image: NDArray 263 + label: str 264 + 265 + client = ATProtoClient() 266 + client.login("alice.bsky.social", "password") 267 + 268 + publisher = SchemaPublisher(client) 269 + schema_uri = publisher.publish_schema( 270 + MySample, 271 + description="My sample type", 272 + version="1.0.0" 273 + ) 274 + print(f"Published schema at {schema_uri}") 275 + ``` 276 + 277 + ### 3. Dataset Publishing & Loading 278 + 279 + **File**: `src/atdata/atproto/dataset.py` 280 + 281 + ```python 282 + from typing import Type, TypeVar, Optional 283 + import msgpack 284 + import atdata 285 + from .client import ATProtoClient 286 + from .schema import SchemaPublisher 287 + 288 + ST = TypeVar('ST', bound=atdata.PackableSample) 289 + 290 + class DatasetPublisher: 291 + """Handles publishing Dataset index records to ATProto.""" 292 + 293 + def __init__(self, client: ATProtoClient): 294 + self.client = client 295 + self.schema_publisher = SchemaPublisher(client) 296 + 297 + def publish_dataset( 298 + self, 299 + dataset: atdata.Dataset[ST], 300 + *, 301 + name: str, 302 + schema_uri: Optional[str] = None, 303 + description: Optional[str] = None, 304 + tags: Optional[list[str]] = None, 305 + license: Optional[str] = None, 306 + auto_publish_schema: bool = True 307 + ) -> str: 308 + """ 309 + Publish a dataset index record to ATProto. 310 + 311 + Args: 312 + dataset: The Dataset to publish 313 + name: Human-readable dataset name 314 + schema_uri: AT-URI of the schema record (required if auto_publish_schema=False) 315 + description: Human-readable description 316 + tags: Searchable tags 317 + license: License identifier (SPDX preferred) 318 + auto_publish_schema: If True and schema_uri not provided, publish schema automatically 319 + 320 + Returns: 321 + AT-URI of the created dataset record 322 + """ 323 + if not self.client.is_authenticated: 324 + raise ValueError("Client must be authenticated") 325 + 326 + # Ensure schema is published 327 + if schema_uri is None: 328 + if not auto_publish_schema: 329 + raise ValueError("schema_uri required when auto_publish_schema=False") 330 + schema_uri = self.schema_publisher.publish_schema(dataset.sample_type) 331 + 332 + # Build dataset record 333 + dataset_record = { 334 + "$type": "app.bsky.atdata.dataset", 335 + "name": name, 336 + "schemaRef": schema_uri, 337 + "urls": [dataset.url], # Single URL for now 338 + "description": description or "", 339 + "metadata": msgpack.packb(dataset.metadata), 340 + "tags": tags or [], 341 + "license": license or "", 342 + "createdAt": datetime.now(timezone.utc).isoformat() 343 + } 344 + 345 + # Add size information if available 346 + # (would need to iterate dataset or have metadata about size) 347 + 348 + # Publish to ATProto 349 + uri = self.client.create_record("app.bsky.atdata.dataset", dataset_record) 350 + return uri 351 + 352 + class DatasetLoader: 353 + """Handles loading Datasets from ATProto records.""" 354 + 355 + def __init__(self, client: ATProtoClient): 356 + self.client = client 357 + 358 + def load_dataset(self, uri: str) -> atdata.Dataset: 359 + """ 360 + Load a Dataset from an ATProto record. 361 + 362 + Args: 363 + uri: AT-URI of the dataset record 364 + 365 + Returns: 366 + Dataset instance configured from the record 367 + """ 368 + # Fetch the dataset record 369 + record = self.client.get_record(uri) 370 + if record.get('$type') != 'app.bsky.atdata.dataset': 371 + raise ValueError(f"Record at {uri} is not a dataset record") 372 + 373 + # For now, we still need the Python class for the sample type 374 + # In the future, this could use codegen 375 + # TODO: Implement dynamic type loading via codegen 376 + 377 + # Extract URLs and metadata 378 + urls = record['urls'] 379 + metadata = msgpack.unpackb(record.get('metadata', b'')) 380 + 381 + # We need the schema to instantiate the Dataset with correct type 382 + # This is a limitation - we need codegen to create the type dynamically 383 + # For now, raise an error 384 + raise NotImplementedError( 385 + "Loading datasets requires code generation to instantiate sample types. " 386 + f"Schema URI: {record['schemaRef']}\n" 387 + "Use the codegen tool to generate the Python class first." 388 + ) 389 + 390 + def list_datasets(self, did: Optional[str] = None) -> list[dict]: 391 + """List available dataset records.""" 392 + return self.client.list_records("app.bsky.atdata.dataset", did) 393 + 394 + def search_datasets(self, tags: Optional[list[str]] = None, query: Optional[str] = None) -> list[dict]: 395 + """ 396 + Search for datasets. 397 + 398 + Args: 399 + tags: Filter by tags 400 + query: Text search query 401 + 402 + Returns: 403 + List of matching dataset records 404 + """ 405 + # This would use AppView in production 406 + # For now, fetch all and filter client-side 407 + all_datasets = self.list_records("app.bsky.atdata.dataset") 408 + 409 + filtered = all_datasets 410 + if tags: 411 + filtered = [d for d in filtered if any(t in d.get('tags', []) for t in tags)] 412 + if query: 413 + filtered = [d for d in filtered if query.lower() in d.get('name', '').lower() or 414 + query.lower() in d.get('description', '').lower()] 415 + 416 + return filtered 417 + ``` 418 + 419 + **Usage**: 420 + ```python 421 + from atdata.atproto import ATProtoClient, DatasetPublisher 422 + 423 + # Create dataset 424 + dataset = atdata.Dataset[MySample](url="s3://bucket/data-{000000..000009}.tar") 425 + 426 + # Publish 427 + client = ATProtoClient() 428 + client.login("alice.bsky.social", "password") 429 + 430 + publisher = DatasetPublisher(client) 431 + dataset_uri = publisher.publish_dataset( 432 + dataset, 433 + name="My Training Data", 434 + description="Training data for my model", 435 + tags=["computer-vision", "training"], 436 + license="MIT" 437 + ) 438 + print(f"Published dataset at {dataset_uri}") 439 + ``` 440 + 441 + ### 4. Lens Publishing 442 + 443 + **File**: `src/atdata/atproto/lens.py` 444 + 445 + ```python 446 + from typing import Callable, Optional 447 + import inspect 448 + from .client import ATProtoClient 449 + 450 + class LensPublisher: 451 + """Handles publishing Lens transformations to ATProto.""" 452 + 453 + def __init__(self, client: ATProtoClient): 454 + self.client = client 455 + 456 + def publish_lens( 457 + self, 458 + lens_getter: Callable, 459 + lens_putter: Callable, 460 + *, 461 + name: str, 462 + source_schema_uri: str, 463 + target_schema_uri: str, 464 + description: Optional[str] = None, 465 + code_repository: Optional[str] = None, 466 + code_commit: Optional[str] = None 467 + ) -> str: 468 + """ 469 + Publish a Lens transformation to ATProto. 470 + 471 + Args: 472 + lens_getter: The getter function (Source -> Target) 473 + lens_putter: The putter function (Target, Source -> Source) 474 + name: Human-readable lens name 475 + source_schema_uri: AT-URI of source schema 476 + target_schema_uri: AT-URI of target schema 477 + description: What this transformation does 478 + code_repository: Git repository URL 479 + code_commit: Git commit hash 480 + 481 + Returns: 482 + AT-URI of the created lens record 483 + """ 484 + if not self.client.is_authenticated: 485 + raise ValueError("Client must be authenticated") 486 + 487 + # Build lens record 488 + lens_record = { 489 + "$type": "app.bsky.atdata.lens", 490 + "name": name, 491 + "sourceSchema": source_schema_uri, 492 + "targetSchema": target_schema_uri, 493 + "description": description or "", 494 + "createdAt": datetime.now(timezone.utc).isoformat() 495 + } 496 + 497 + # Add code references 498 + if code_repository and code_commit: 499 + getter_name = lens_getter.__name__ 500 + putter_name = lens_putter.__name__ 501 + 502 + lens_record["getterCode"] = { 503 + "kind": "reference", 504 + "repository": code_repository, 505 + "commit": code_commit, 506 + "path": f"{getter_name}" # Simplified - would need module path 507 + } 508 + lens_record["putterCode"] = { 509 + "kind": "reference", 510 + "repository": code_repository, 511 + "commit": code_commit, 512 + "path": f"{putter_name}" 513 + } 514 + else: 515 + # For initial version, we could store source code directly 516 + # But this is DANGEROUS - security review required 517 + raise NotImplementedError( 518 + "Inline code storage not yet supported. " 519 + "Please provide code_repository and code_commit." 520 + ) 521 + 522 + # Publish to ATProto 523 + uri = self.client.create_record("app.bsky.atdata.lens", lens_record) 524 + return uri 525 + ``` 526 + 527 + ## Extension to Existing Classes 528 + 529 + ### Adding ATProto methods to Dataset 530 + 531 + **Approach**: Add methods directly to `Dataset` class in `src/atdata/dataset.py` 532 + 533 + ```python 534 + class Dataset[ST: PackableSample]: 535 + # ... existing implementation ... 536 + 537 + def publish_to_atproto( 538 + self, 539 + client: 'ATProtoClient', # Forward reference to avoid circular import 540 + *, 541 + name: str, 542 + **kwargs 543 + ) -> str: 544 + """ 545 + Publish this dataset to ATProto. 546 + 547 + This is a convenience method that wraps DatasetPublisher. 548 + """ 549 + from .atproto import DatasetPublisher 550 + publisher = DatasetPublisher(client) 551 + return publisher.publish_dataset(self, name=name, **kwargs) 552 + 553 + @classmethod 554 + def from_atproto( 555 + cls, 556 + client: 'ATProtoClient', 557 + uri: str 558 + ) -> 'Dataset': 559 + """ 560 + Load a dataset from an ATProto record. 561 + 562 + Note: This requires the sample type to be available in Python. 563 + Use codegen to generate types from schema records. 564 + """ 565 + from .atproto import DatasetLoader 566 + loader = DatasetLoader(client) 567 + return loader.load_dataset(uri) 568 + ``` 569 + 570 + **Usage**: 571 + ```python 572 + # Publishing 573 + dataset = atdata.Dataset[MySample](url="s3://...") 574 + uri = dataset.publish_to_atproto(client, name="My Dataset") 575 + 576 + # Loading (future, requires codegen) 577 + dataset = atdata.Dataset.from_atproto(client, uri) 578 + ``` 579 + 580 + ## Public API Exports 581 + 582 + **File**: `src/atdata/atproto/__init__.py` 583 + 584 + ```python 585 + from .client import ATProtoClient 586 + from .schema import SchemaPublisher, SchemaLoader 587 + from .dataset import DatasetPublisher, DatasetLoader 588 + from .lens import LensPublisher 589 + 590 + __all__ = [ 591 + "ATProtoClient", 592 + "SchemaPublisher", 593 + "SchemaLoader", 594 + "DatasetPublisher", 595 + "DatasetLoader", 596 + "LensPublisher", 597 + ] 598 + ``` 599 + 600 + ## Testing Strategy 601 + 602 + ### Unit Tests 603 + - Mock `ATProtoClient` to avoid network calls 604 + - Test schema record building from various PackableSample types 605 + - Test error handling (auth failures, invalid types, etc.) 606 + 607 + ### Integration Tests 608 + - Use ATProto test server or sandbox 609 + - Test full publish/query cycle 610 + - Verify record structure matches Lexicon 611 + 612 + ### Example Test 613 + ```python 614 + import pytest 615 + from unittest.mock import Mock 616 + import atdata 617 + from atdata.atproto import SchemaPublisher 618 + 619 + @atdata.packable 620 + class TestSample: 621 + field1: str 622 + field2: int 623 + 624 + def test_schema_publisher(): 625 + # Mock client 626 + mock_client = Mock() 627 + mock_client.is_authenticated = True 628 + mock_client.create_record = Mock(return_value="at://did:example/app.bsky.atdata.schema/abc123") 629 + 630 + # Publish schema 631 + publisher = SchemaPublisher(mock_client) 632 + uri = publisher.publish_schema(TestSample, version="1.0.0") 633 + 634 + # Verify 635 + assert uri == "at://did:example/app.bsky.atdata.schema/abc123" 636 + mock_client.create_record.assert_called_once() 637 + 638 + # Check the record structure 639 + call_args = mock_client.create_record.call_args 640 + collection, record = call_args[0] 641 + assert collection == "app.bsky.atdata.schema" 642 + assert record["name"] == "TestSample" 643 + assert len(record["fields"]) == 2 644 + ``` 645 + 646 + ## Dependencies 647 + 648 + **New dependencies** (to be added to `pyproject.toml`): 649 + 650 + ```toml 651 + [project] 652 + dependencies = [ 653 + # ... existing ... 654 + "atproto>=0.0.40", # ATProto Python SDK 655 + ] 656 + ``` 657 + 658 + ## Implementation Checklist (Phase 2) 659 + 660 + - [ ] Set up `atdata/atproto/` module structure 661 + - [ ] Implement `ATProtoClient` wrapper 662 + - [ ] Implement `SchemaPublisher` with type introspection 663 + - [ ] Implement `DatasetPublisher` 664 + - [ ] Implement `LensPublisher` (code reference only) 665 + - [ ] Add convenience methods to `Dataset` class 666 + - [ ] Write unit tests for all publishers 667 + - [ ] Write integration tests with test server 668 + - [ ] Update documentation with examples 669 + 670 + ## Future Enhancements 671 + 672 + ### Better NDArray Type Handling 673 + - Parse `NDArray[DType, Shape]` annotations for accurate dtype/shape 674 + - Support for shape constraints in schema 675 + 676 + ### Dynamic Type Loading 677 + - Use codegen to create types at runtime from schema records 678 + - Enable `Dataset.from_atproto()` without pre-existing Python classes 679 + 680 + ### Caching 681 + - Cache schema lookups to avoid repeated network calls 682 + - Local schema registry 683 + 684 + ### Batch Operations 685 + - Publish multiple schemas/datasets in one call 686 + - Bulk import/export 687 + 688 + ### AppView Integration 689 + - Use AppView for fast search instead of client-side filtering 690 + - Streaming results for large queries

+578

.planning/04_appview.md

··· 1 + # AppView Service Architecture 2 + 3 + ## Overview 4 + 5 + The AppView is an **optional aggregation service** that indexes dataset records from across the ATProto network, providing fast search and discovery. Think of it as the "search engine" for atdata datasets. 6 + 7 + ## Why AppView? 8 + 9 + Without AppView, discovering datasets requires: 10 + - Querying each user's Personal Data Server (PDS) individually 11 + - No global search across all published datasets 12 + - Slow, inefficient discovery 13 + 14 + With AppView: 15 + - **Fast global search** across all datasets 16 + - **Rich metadata browsing** (schemas, tags, authors) 17 + - **Recommendation systems** (similar datasets, popular datasets) 18 + - **Analytics** (dataset usage, trends) 19 + 20 + ## Architecture 21 + 22 + ``` 23 + ┌─────────────────────────────────────────────────────────────┐ 24 + │ ATProto Network │ 25 + │ │ 26 + │ ┌─────┐ ┌─────┐ ┌─────┐ ┌──────────────┐ │ 27 + │ │ PDS │ │ PDS │ │ PDS │ ────────▶ │ Relay/ │ │ 28 + │ │ 1 │ │ 2 │ │ 3 │ │ Firehose │ │ 29 + │ └─────┘ └─────┘ └─────┘ └──────────────┘ │ 30 + │ │ │ │ │ │ 31 + │ └─────────┴─────────┴────────────────────┘ │ 32 + │ (publish records) │ │ 33 + └────────────────────────────────────────────────┼──────────────┘ 34 + │ 35 + │ (subscribe) 36 + ▼ 37 + ┌─────────────────────────┐ 38 + │ AppView Service │ 39 + │ │ 40 + │ ┌──────────────────┐ │ 41 + │ │ Firehose │ │ 42 + │ │ Consumer │ │ 43 + │ └────────┬─────────┘ │ 44 + │ │ │ 45 + │ ▼ │ 46 + │ ┌──────────────────┐ │ 47 + │ │ Record │ │ 48 + │ │ Processor │ │ 49 + │ └────────┬─────────┘ │ 50 + │ │ │ 51 + │ ▼ │ 52 + │ ┌──────────────────┐ │ 53 + │ │ PostgreSQL │ │ 54 + │ │ Database │ │ 55 + │ └──────────────────┘ │ 56 + │ │ 57 + │ ┌──────────────────┐ │ 58 + │ │ Search Index │ │ 59 + │ │ (ElasticSearch) │ │ 60 + │ └──────────────────┘ │ 61 + │ │ 62 + │ ┌──────────────────┐ │ 63 + │ │ HTTP API │ │ 64 + │ │ (FastAPI) │ │ 65 + │ └──────────────────┘ │ 66 + └─────────────────────────┘ 67 + │ 68 + │ (query API) 69 + ▼ 70 + ┌─────────────────────────┐ 71 + │ Python Client │ 72 + │ (atdata.atproto) │ 73 + └─────────────────────────┘ 74 + ``` 75 + 76 + ## Components 77 + 78 + ### 1. Firehose Consumer 79 + 80 + **Purpose**: Subscribe to ATProto firehose and receive real-time record updates 81 + 82 + **Technology**: Python + `atproto` SDK 83 + 84 + **Responsibilities**: 85 + - Connect to ATProto relay/firehose 86 + - Filter for relevant Lexicon types: 87 + - `app.bsky.atdata.schema` 88 + - `app.bsky.atdata.dataset` 89 + - `app.bsky.atdata.lens` 90 + - Handle reconnection and backpressure 91 + - Forward records to processor 92 + 93 + **Implementation**: 94 + ```python 95 + from atproto import FirehoseSubscribeReposClient, parse_subscribe_repos_message 96 + 97 + class AtdataFirehoseConsumer: 98 + def __init__(self, processor: RecordProcessor): 99 + self.processor = processor 100 + self.client = FirehoseSubscribeReposClient() 101 + 102 + def start(self): 103 + """Start consuming firehose.""" 104 + def on_message_handler(message): 105 + commit = parse_subscribe_repos_message(message) 106 + if not commit: 107 + return 108 + 109 + for op in commit.ops: 110 + if op.action == 'create' or op.action == 'update': 111 + if op.path.startswith('app.bsky.atdata.'): 112 + # Extract record 113 + record = op.record 114 + self.processor.process_record( 115 + uri=op.uri, 116 + cid=op.cid, 117 + record=record 118 + ) 119 + 120 + self.client.start(on_message_handler) 121 + ``` 122 + 123 + ### 2. Record Processor 124 + 125 + **Purpose**: Parse and validate incoming records, update database and search index 126 + 127 + **Responsibilities**: 128 + - Validate records against Lexicon schemas 129 + - Extract searchable fields 130 + - Resolve references (schema URIs, etc.) 131 + - Update PostgreSQL and ElasticSearch 132 + - Handle deletions and updates 133 + 134 + **Data Model**: 135 + 136 + **PostgreSQL Tables**: 137 + ```sql 138 + -- Schema records 139 + CREATE TABLE schemas ( 140 + uri TEXT PRIMARY KEY, 141 + cid TEXT NOT NULL, 142 + did TEXT NOT NULL, 143 + name TEXT NOT NULL, 144 + version TEXT NOT NULL, 145 + description TEXT, 146 + fields JSONB NOT NULL, 147 + metadata JSONB, 148 + created_at TIMESTAMP NOT NULL, 149 + indexed_at TIMESTAMP NOT NULL DEFAULT NOW() 150 + ); 151 + CREATE INDEX idx_schemas_did ON schemas(did); 152 + CREATE INDEX idx_schemas_name ON schemas(name); 153 + 154 + -- Dataset records 155 + CREATE TABLE datasets ( 156 + uri TEXT PRIMARY KEY, 157 + cid TEXT NOT NULL, 158 + did TEXT NOT NULL, 159 + name TEXT NOT NULL, 160 + schema_ref TEXT NOT NULL REFERENCES schemas(uri), 161 + urls TEXT[] NOT NULL, 162 + description TEXT, 163 + metadata BYTEA, 164 + tags TEXT[], 165 + license TEXT, 166 + size_samples INTEGER, 167 + size_bytes BIGINT, 168 + created_at TIMESTAMP NOT NULL, 169 + indexed_at TIMESTAMP NOT NULL DEFAULT NOW() 170 + ); 171 + CREATE INDEX idx_datasets_did ON datasets(did); 172 + CREATE INDEX idx_datasets_schema ON datasets(schema_ref); 173 + CREATE INDEX idx_datasets_tags ON datasets USING GIN(tags); 174 + 175 + -- Lens records 176 + CREATE TABLE lenses ( 177 + uri TEXT PRIMARY KEY, 178 + cid TEXT NOT NULL, 179 + did TEXT NOT NULL, 180 + name TEXT NOT NULL, 181 + source_schema TEXT NOT NULL REFERENCES schemas(uri), 182 + target_schema TEXT NOT NULL REFERENCES schemas(uri), 183 + description TEXT, 184 + created_at TIMESTAMP NOT NULL, 185 + indexed_at TIMESTAMP NOT NULL DEFAULT NOW() 186 + ); 187 + CREATE INDEX idx_lenses_source ON lenses(source_schema); 188 + CREATE INDEX idx_lenses_target ON lenses(target_schema); 189 + 190 + -- Lens network view (for finding transformation paths) 191 + CREATE MATERIALIZED VIEW lens_network AS 192 + SELECT 193 + source_schema, 194 + target_schema, 195 + uri, 196 + name 197 + FROM lenses; 198 + CREATE INDEX idx_lens_network_source ON lens_network(source_schema); 199 + CREATE INDEX idx_lens_network_target ON lens_network(target_schema); 200 + ``` 201 + 202 + **ElasticSearch Index**: 203 + ```json 204 + { 205 + "mappings": { 206 + "properties": { 207 + "uri": { "type": "keyword" }, 208 + "type": { "type": "keyword" }, 209 + "did": { "type": "keyword" }, 210 + "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, 211 + "description": { "type": "text" }, 212 + "tags": { "type": "keyword" }, 213 + "created_at": { "type": "date" }, 214 + "schema_ref": { "type": "keyword" }, 215 + "license": { "type": "keyword" } 216 + } 217 + } 218 + } 219 + ``` 220 + 221 + ### 3. HTTP API 222 + 223 + **Purpose**: Expose search and query endpoints for clients 224 + 225 + **Technology**: FastAPI + Pydantic 226 + 227 + **Endpoints**: 228 + 229 + ```python 230 + from fastapi import FastAPI, Query 231 + from pydantic import BaseModel 232 + 233 + app = FastAPI() 234 + 235 + # Search datasets 236 + @app.get("/api/v1/datasets/search") 237 + async def search_datasets( 238 + q: str = Query(None, description="Text search query"), 239 + tags: list[str] = Query(None, description="Filter by tags"), 240 + schema_uri: str = Query(None, description="Filter by schema"), 241 + author_did: str = Query(None, description="Filter by author DID"), 242 + limit: int = Query(20, le=100), 243 + offset: int = Query(0) 244 + ) -> list[dict]: 245 + """Search for datasets.""" 246 + # Query ElasticSearch + PostgreSQL 247 + pass 248 + 249 + # Get dataset details 250 + @app.get("/api/v1/datasets/{uri:path}") 251 + async def get_dataset(uri: str) -> dict: 252 + """Get dataset record by URI.""" 253 + # Query PostgreSQL 254 + pass 255 + 256 + # List schemas 257 + @app.get("/api/v1/schemas") 258 + async def list_schemas( 259 + limit: int = Query(20, le=100), 260 + offset: int = Query(0) 261 + ) -> list[dict]: 262 + """List available schemas.""" 263 + pass 264 + 265 + # Get schema details 266 + @app.get("/api/v1/schemas/{uri:path}") 267 + async def get_schema(uri: str) -> dict: 268 + """Get schema record by URI.""" 269 + pass 270 + 271 + # Find lens path between schemas 272 + @app.get("/api/v1/lenses/path") 273 + async def find_lens_path( 274 + source: str = Query(..., description="Source schema URI"), 275 + target: str = Query(..., description="Target schema URI") 276 + ) -> list[dict]: 277 + """Find transformation path between two schemas.""" 278 + # Graph search on lens_network 279 + pass 280 + 281 + # Stats and analytics 282 + @app.get("/api/v1/stats") 283 + async def get_stats() -> dict: 284 + """Get aggregate statistics.""" 285 + return { 286 + "total_datasets": await count_datasets(), 287 + "total_schemas": await count_schemas(), 288 + "total_lenses": await count_lenses() 289 + } 290 + ``` 291 + 292 + ### 4. Caching Layer 293 + 294 + **Purpose**: Reduce database load for frequent queries 295 + 296 + **Technology**: Redis 297 + 298 + **Cached Items**: 299 + - Popular dataset queries 300 + - Schema lookups (high read frequency) 301 + - Search results (with short TTL) 302 + - Aggregate statistics 303 + 304 + **Implementation**: 305 + ```python 306 + import redis 307 + import json 308 + from functools import wraps 309 + 310 + redis_client = redis.Redis(host='localhost', port=6379, db=0) 311 + 312 + def cache_result(ttl: int = 300): 313 + """Decorator to cache function results in Redis.""" 314 + def decorator(func): 315 + @wraps(func) 316 + async def wrapper(*args, **kwargs): 317 + # Generate cache key from function name and args 318 + cache_key = f"{func.__name__}:{hash((args, frozenset(kwargs.items())))}" 319 + 320 + # Check cache 321 + cached = redis_client.get(cache_key) 322 + if cached: 323 + return json.loads(cached) 324 + 325 + # Compute result 326 + result = await func(*args, **kwargs) 327 + 328 + # Store in cache 329 + redis_client.setex(cache_key, ttl, json.dumps(result)) 330 + 331 + return result 332 + return wrapper 333 + return decorator 334 + 335 + @cache_result(ttl=60) 336 + async def get_popular_datasets(): 337 + """Get popular datasets (cached for 1 minute).""" 338 + # Query database 339 + pass 340 + ``` 341 + 342 + ## Deployment 343 + 344 + ### Infrastructure 345 + 346 + **Option 1: Simple (single server)** 347 + ``` 348 + - PostgreSQL (datasets, schemas, lenses) 349 + - ElasticSearch (search index) 350 + - Redis (cache) 351 + - FastAPI app (HTTP API) 352 + - Firehose consumer (background process) 353 + ``` 354 + 355 + **Option 2: Scalable (cloud)** 356 + ``` 357 + - AWS RDS PostgreSQL (managed database) 358 + - AWS OpenSearch (managed ElasticSearch) 359 + - AWS ElastiCache (managed Redis) 360 + - AWS ECS/Fargate (containerized FastAPI app) 361 + - AWS ECS/Fargate (containerized firehose consumer) 362 + - AWS ALB (load balancer) 363 + ``` 364 + 365 + ### Docker Compose (Development) 366 + 367 + ```yaml 368 + version: '3.8' 369 + 370 + services: 371 + postgres: 372 + image: postgres:15 373 + environment: 374 + POSTGRES_DB: atdata_appview 375 + POSTGRES_USER: atdata 376 + POSTGRES_PASSWORD: password 377 + volumes: 378 + - postgres_data:/var/lib/postgresql/data 379 + ports: 380 + - "5432:5432" 381 + 382 + elasticsearch: 383 + image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 384 + environment: 385 + - discovery.type=single-node 386 + - xpack.security.enabled=false 387 + volumes: 388 + - es_data:/usr/share/elasticsearch/data 389 + ports: 390 + - "9200:9200" 391 + 392 + redis: 393 + image: redis:7 394 + ports: 395 + - "6379:6379" 396 + 397 + appview-api: 398 + build: 399 + context: . 400 + dockerfile: Dockerfile.api 401 + environment: 402 + DATABASE_URL: postgresql://atdata:password@postgres/atdata_appview 403 + ELASTICSEARCH_URL: http://elasticsearch:9200 404 + REDIS_URL: redis://redis:6379 405 + depends_on: 406 + - postgres 407 + - elasticsearch 408 + - redis 409 + ports: 410 + - "8000:8000" 411 + 412 + appview-firehose: 413 + build: 414 + context: . 415 + dockerfile: Dockerfile.firehose 416 + environment: 417 + DATABASE_URL: postgresql://atdata:password@postgres/atdata_appview 418 + ELASTICSEARCH_URL: http://elasticsearch:9200 419 + REDIS_URL: redis://redis:6379 420 + FIREHOSE_URL: wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos 421 + depends_on: 422 + - postgres 423 + - elasticsearch 424 + - redis 425 + 426 + volumes: 427 + postgres_data: 428 + es_data: 429 + ``` 430 + 431 + ## Client Integration 432 + 433 + ### Python Client Updates 434 + 435 + Add AppView support to `atdata.atproto.dataset.DatasetLoader`: 436 + 437 + ```python 438 + class DatasetLoader: 439 + def __init__( 440 + self, 441 + client: ATProtoClient, 442 + appview_url: Optional[str] = None 443 + ): 444 + self.client = client 445 + self.appview_url = appview_url or "https://appview.atdata.network" 446 + 447 + def search_datasets( 448 + self, 449 + query: Optional[str] = None, 450 + tags: Optional[list[str]] = None, 451 + schema_uri: Optional[str] = None, 452 + limit: int = 20 453 + ) -> list[dict]: 454 + """Search datasets using AppView.""" 455 + import httpx 456 + 457 + params = {"limit": limit} 458 + if query: 459 + params["q"] = query 460 + if tags: 461 + params["tags"] = tags 462 + if schema_uri: 463 + params["schema_uri"] = schema_uri 464 + 465 + response = httpx.get(f"{self.appview_url}/api/v1/datasets/search", params=params) 466 + response.raise_for_status() 467 + return response.json() 468 + ``` 469 + 470 + **Usage**: 471 + ```python 472 + from atdata.atproto import ATProtoClient, DatasetLoader 473 + 474 + client = ATProtoClient() 475 + loader = DatasetLoader(client, appview_url="https://appview.atdata.network") 476 + 477 + # Search for computer vision datasets 478 + results = loader.search_datasets( 479 + tags=["computer-vision"], 480 + limit=10 481 + ) 482 + 483 + for dataset in results: 484 + print(f"{dataset['name']}: {dataset['description']}") 485 + ``` 486 + 487 + ## Performance Considerations 488 + 489 + ### Indexing Speed 490 + - **Goal**: Index records in <1 second from firehose receipt 491 + - **Approach**: Async processing, batch inserts 492 + 493 + ### Search Performance 494 + - **Goal**: Search queries return in <100ms 495 + - **Approach**: ElasticSearch indexing, query optimization, caching 496 + 497 + ### Scalability 498 + - **Goal**: Handle 1000+ datasets, 100+ schemas 499 + - **Approach**: Horizontal scaling of API servers, read replicas for PostgreSQL 500 + 501 + ## Monitoring & Observability 502 + 503 + ### Metrics 504 + - Firehose lag (time behind current) 505 + - Indexing throughput (records/second) 506 + - API request latency (p50, p95, p99) 507 + - Cache hit rate 508 + - Database query performance 509 + 510 + ### Logging 511 + - Structured JSON logs 512 + - Log aggregation (e.g., CloudWatch, Datadog) 513 + - Error tracking (e.g., Sentry) 514 + 515 + ### Health Checks 516 + ```python 517 + @app.get("/health") 518 + async def health_check(): 519 + """Check service health.""" 520 + return { 521 + "status": "healthy", 522 + "components": { 523 + "database": await check_db_health(), 524 + "elasticsearch": await check_es_health(), 525 + "redis": await check_redis_health(), 526 + "firehose": await check_firehose_health() 527 + } 528 + } 529 + ``` 530 + 531 + ## Implementation Checklist (Phase 3) 532 + 533 + - [ ] Design database schema (PostgreSQL) 534 + - [ ] Design search index (ElasticSearch) 535 + - [ ] Implement firehose consumer 536 + - [ ] Implement record processor with validation 537 + - [ ] Implement HTTP API with FastAPI 538 + - [ ] Add caching layer (Redis) 539 + - [ ] Create Docker Compose for local development 540 + - [ ] Write integration tests 541 + - [ ] Set up monitoring and logging 542 + - [ ] Deploy to staging environment 543 + - [ ] Performance testing and optimization 544 + 545 + ## Future Enhancements 546 + 547 + ### Advanced Search 548 + - Fuzzy matching 549 + - Relevance scoring 550 + - Autocomplete for tags/names 551 + 552 + ### Recommendations 553 + - "Datasets similar to this one" 554 + - "Popular datasets in this category" 555 + - "Datasets by authors you follow" 556 + 557 + ### Analytics 558 + - Dataset usage tracking (downloads, views) 559 + - Trending datasets 560 + - Schema adoption statistics 561 + 562 + ### Social Features 563 + - Dataset comments/reviews 564 + - Ratings 565 + - Curation lists (e.g., "Best datasets for X") 566 + 567 + ### Federation 568 + - Multiple AppView instances 569 + - Cross-AppView search 570 + - Regional AppViews for performance 571 + 572 + ## Security Considerations 573 + 574 + - **Rate limiting**: Prevent abuse of search API 575 + - **Input validation**: Sanitize all query parameters 576 + - **DDoS protection**: Use CloudFlare or similar 577 + - **Authentication** (optional): API keys for heavy users 578 + - **Data validation**: Verify record signatures from ATProto

+799

.planning/05_codegen.md

··· 1 + # Code Generation Tooling 2 + 3 + ## Overview 4 + 5 + Code generation enables users to create `PackableSample` classes from schema records published on ATProto, making datasets truly interoperable across different codebases and even languages. 6 + 7 + ## Goals 8 + 9 + 1. **Automatic class generation**: Convert schema records to Python classes 10 + 2. **Type safety**: Generate proper type hints and validation 11 + 3. **Maintainability**: Generated code should be readable and maintainable 12 + 4. **Cross-language support** (future): TypeScript, Rust, etc. 13 + 14 + ## Python Code Generation 15 + 16 + ### Input: Schema Record 17 + 18 + ```json 19 + { 20 + "$type": "app.bsky.atdata.schema", 21 + "name": "ImageSample", 22 + "version": "1.0.0", 23 + "description": "Sample containing an image with label", 24 + "fields": [ 25 + { 26 + "name": "image", 27 + "type": { "kind": "ndarray", "dtype": "uint8", "shape": [null, null, 3] }, 28 + "description": "RGB image with variable height/width" 29 + }, 30 + { 31 + "name": "label", 32 + "type": { "kind": "primitive", "primitive": "str" }, 33 + "description": "Human-readable label" 34 + }, 35 + { 36 + "name": "confidence", 37 + "type": { "kind": "primitive", "primitive": "float" }, 38 + "optional": true, 39 + "description": "Optional confidence score" 40 + } 41 + ] 42 + } 43 + ``` 44 + 45 + ### Output: Python Code 46 + 47 + ```python 48 + """ 49 + ImageSample 50 + 51 + Sample containing an image with label 52 + 53 + Schema Version: 1.0.0 54 + Schema URI: at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm 55 + Generated: 2025-01-06T12:00:00Z 56 + """ 57 + 58 + from dataclasses import dataclass 59 + from typing import Optional 60 + from numpy.typing import NDArray 61 + import atdata 62 + 63 + 64 + @atdata.packable 65 + class ImageSample: 66 + """Sample containing an image with label""" 67 + 68 + #: RGB image with variable height/width 69 + image: NDArray # uint8, shape: [*, *, 3] 70 + 71 + #: Human-readable label 72 + label: str 73 + 74 + #: Optional confidence score 75 + confidence: Optional[float] = None 76 + ``` 77 + 78 + ## Code Generator Architecture 79 + 80 + ### Module Structure 81 + 82 + ``` 83 + src/atdata/codegen/ 84 + __init__.py # Public API 85 + generator.py # Core code generation logic 86 + templates/ # Template files 87 + python.jinja2 # Python class template 88 + cli.py # CLI interface 89 + _validators.py # Schema validation 90 + ``` 91 + 92 + ### Core Generator 93 + 94 + **File**: `src/atdata/codegen/generator.py` 95 + 96 + ```python 97 + from typing import Optional 98 + from datetime import datetime, timezone 99 + from jinja2 import Environment, PackageLoader 100 + import atdata 101 + from ..atproto import ATProtoClient, SchemaLoader 102 + 103 + class PythonGenerator: 104 + """Generate Python PackableSample classes from schema records.""" 105 + 106 + def __init__(self): 107 + # Set up Jinja2 environment 108 + self.env = Environment( 109 + loader=PackageLoader('atdata.codegen', 'templates'), 110 + trim_blocks=True, 111 + lstrip_blocks=True 112 + ) 113 + 114 + # Register custom filters 115 + self.env.filters['python_type'] = self._python_type_filter 116 + self.env.filters['python_default'] = self._python_default_filter 117 + 118 + def generate_from_uri( 119 + self, 120 + client: ATProtoClient, 121 + schema_uri: str, 122 + output_path: Optional[str] = None 123 + ) -> str: 124 + """ 125 + Generate Python code from a schema URI. 126 + 127 + Args: 128 + client: ATProto client 129 + schema_uri: URI of the schema record 130 + output_path: Optional path to write output file 131 + 132 + Returns: 133 + Generated Python code as string 134 + """ 135 + # Load schema record 136 + loader = SchemaLoader(client) 137 + schema = loader.get_schema(schema_uri) 138 + 139 + # Generate code 140 + code = self.generate_from_record(schema, schema_uri) 141 + 142 + # Write to file if requested 143 + if output_path: 144 + with open(output_path, 'w') as f: 145 + f.write(code) 146 + 147 + return code 148 + 149 + def generate_from_record( 150 + self, 151 + schema: dict, 152 + schema_uri: str 153 + ) -> str: 154 + """ 155 + Generate Python code from a schema record dict. 156 + 157 + Args: 158 + schema: Schema record dict 159 + schema_uri: URI of the schema (for documentation) 160 + 161 + Returns: 162 + Generated Python code 163 + """ 164 + # Validate schema 165 + self._validate_schema(schema) 166 + 167 + # Prepare template context 168 + context = { 169 + 'schema': schema, 170 + 'schema_uri': schema_uri, 171 + 'generated_at': datetime.now(timezone.utc).isoformat(), 172 + 'fields': self._prepare_fields(schema['fields']) 173 + } 174 + 175 + # Render template 176 + template = self.env.get_template('python.jinja2') 177 + code = template.render(**context) 178 + 179 + return code 180 + 181 + def _prepare_fields(self, fields: list[dict]) -> list[dict]: 182 + """Prepare fields for template rendering.""" 183 + prepared = [] 184 + 185 + for field in fields: 186 + prepared.append({ 187 + 'name': field['name'], 188 + 'type_annotation': self._field_type_to_python(field['type']), 189 + 'optional': field.get('optional', False), 190 + 'description': field.get('description', ''), 191 + 'type_comment': self._type_comment(field['type']) 192 + }) 193 + 194 + return prepared 195 + 196 + def _field_type_to_python(self, field_type: dict) -> str: 197 + """Convert schema field type to Python type annotation.""" 198 + kind = field_type['kind'] 199 + 200 + if kind == 'primitive': 201 + primitive_map = { 202 + 'str': 'str', 203 + 'int': 'int', 204 + 'float': 'float', 205 + 'bool': 'bool', 206 + 'bytes': 'bytes' 207 + } 208 + return primitive_map[field_type['primitive']] 209 + 210 + elif kind == 'ndarray': 211 + return 'NDArray' 212 + 213 + elif kind == 'nested': 214 + # Extract class name from schema ref 215 + # For now, just use a placeholder 216 + ref = field_type['schemaRef'] 217 + return f'NestedType' # TODO: resolve nested types 218 + 219 + else: 220 + raise ValueError(f"Unknown field type kind: {kind}") 221 + 222 + def _type_comment(self, field_type: dict) -> Optional[str]: 223 + """Generate type comment for NDArray types.""" 224 + if field_type['kind'] == 'ndarray': 225 + dtype = field_type['dtype'] 226 + shape = field_type.get('shape') 227 + if shape: 228 + shape_str = ', '.join('*' if s is None else str(s) for s in shape) 229 + return f"{dtype}, shape: [{shape_str}]" 230 + else: 231 + return f"{dtype}" 232 + return None 233 + 234 + def _python_type_filter(self, field: dict) -> str: 235 + """Jinja2 filter to get Python type annotation.""" 236 + type_str = self._field_type_to_python(field['type']) 237 + if field.get('optional'): 238 + return f'Optional[{type_str}]' 239 + return type_str 240 + 241 + def _python_default_filter(self, field: dict) -> Optional[str]: 242 + """Jinja2 filter to get Python default value.""" 243 + if field.get('optional'): 244 + return 'None' 245 + return None 246 + 247 + def _validate_schema(self, schema: dict) -> None: 248 + """Validate schema record structure.""" 249 + required = ['name', 'version', 'fields'] 250 + for field in required: 251 + if field not in schema: 252 + raise ValueError(f"Schema missing required field: {field}") 253 + 254 + if not isinstance(schema['fields'], list): 255 + raise ValueError("Schema fields must be a list") 256 + 257 + for field in schema['fields']: 258 + if 'name' not in field or 'type' not in field: 259 + raise ValueError(f"Field missing name or type: {field}") 260 + ``` 261 + 262 + ### Template File 263 + 264 + **File**: `src/atdata/codegen/templates/python.jinja2` 265 + 266 + ```jinja2 267 + """ 268 + {{ schema.name }} 269 + 270 + {{ schema.description }} 271 + 272 + Schema Version: {{ schema.version }} 273 + Schema URI: {{ schema_uri }} 274 + Generated: {{ generated_at }} 275 + 276 + ⚠️ This file was automatically generated from an ATProto schema record. 277 + Do not edit manually - regenerate using `atdata codegen` instead. 278 + """ 279 + 280 + from dataclasses import dataclass 281 + {%- if fields | selectattr('optional') | list %} 282 + from typing import Optional 283 + {%- endif %} 284 + {%- if fields | selectattr('type.kind', 'equalto', 'ndarray') | list %} 285 + from numpy.typing import NDArray 286 + {%- endif %} 287 + import atdata 288 + 289 + 290 + @atdata.packable 291 + class {{ schema.name }}: 292 + """{{ schema.description }}""" 293 + 294 + {% for field in fields %} 295 + {%- if field.description %} 296 + #: {{ field.description }} 297 + {%- endif %} 298 + {{ field.name }}: {{ field | python_type }} 299 + {%- if field.type_comment %} # {{ field.type_comment }}{% endif %} 300 + {%- if field | python_default %} = {{ field | python_default }}{% endif %} 301 + 302 + {% endfor %} 303 + ``` 304 + 305 + ### CLI Interface 306 + 307 + **File**: `src/atdata/codegen/cli.py` 308 + 309 + ```python 310 + import click 311 + from pathlib import Path 312 + from ..atproto import ATProtoClient 313 + from .generator import PythonGenerator 314 + 315 + 316 + @click.group() 317 + def codegen(): 318 + """Code generation tools for atdata.""" 319 + pass 320 + 321 + 322 + @codegen.command() 323 + @click.argument('schema_uri') 324 + @click.option('--output', '-o', type=click.Path(), help='Output file path') 325 + @click.option('--handle', '-u', help='ATProto handle for authentication') 326 + @click.option('--password', '-p', help='ATProto password') 327 + @click.option('--language', '-l', default='python', type=click.Choice(['python']), help='Output language') 328 + def generate(schema_uri: str, output: str, handle: str, password: str, language: str): 329 + """Generate code from a schema URI. 330 + 331 + Example: 332 + atdata codegen generate at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm -o my_sample.py 333 + """ 334 + # Initialize client 335 + client = ATProtoClient() 336 + 337 + # Authenticate if credentials provided 338 + if handle and password: 339 + client.login(handle, password) 340 + 341 + # Generate code 342 + generator = PythonGenerator() 343 + 344 + try: 345 + code = generator.generate_from_uri(client, schema_uri, output) 346 + 347 + if output: 348 + click.echo(f"Generated {language} code written to {output}") 349 + else: 350 + click.echo(code) 351 + 352 + except Exception as e: 353 + click.echo(f"Error generating code: {e}", err=True) 354 + raise click.Abort() 355 + 356 + 357 + @codegen.command() 358 + @click.argument('schema_uris', nargs=-1, required=True) 359 + @click.option('--output-dir', '-d', type=click.Path(), required=True, help='Output directory') 360 + @click.option('--handle', '-u', help='ATProto handle for authentication') 361 + @click.option('--password', '-p', help='ATProto password') 362 + def batch(schema_uris: tuple, output_dir: str, handle: str, password: str): 363 + """Generate code for multiple schemas. 364 + 365 + Example: 366 + atdata codegen batch schema1_uri schema2_uri -d ./generated 367 + """ 368 + # Create output directory 369 + output_path = Path(output_dir) 370 + output_path.mkdir(parents=True, exist_ok=True) 371 + 372 + # Initialize client 373 + client = ATProtoClient() 374 + if handle and password: 375 + client.login(handle, password) 376 + 377 + # Generate code for each schema 378 + generator = PythonGenerator() 379 + 380 + for schema_uri in schema_uris: 381 + try: 382 + # Load schema to get name 383 + from ..atproto import SchemaLoader 384 + loader = SchemaLoader(client) 385 + schema = loader.get_schema(schema_uri) 386 + 387 + # Generate output path from schema name 388 + filename = f"{schema['name'].lower()}.py" 389 + output_file = output_path / filename 390 + 391 + # Generate code 392 + generator.generate_from_uri(client, schema_uri, str(output_file)) 393 + 394 + click.echo(f"Generated {filename}") 395 + 396 + except Exception as e: 397 + click.echo(f"Error generating code for {schema_uri}: {e}", err=True) 398 + 399 + 400 + if __name__ == '__main__': 401 + codegen() 402 + ``` 403 + 404 + ### Integration with Main CLI 405 + 406 + **File**: `src/atdata/cli.py` (new or extend existing) 407 + 408 + ```python 409 + import click 410 + from .codegen.cli import codegen as codegen_group 411 + 412 + @click.group() 413 + def main(): 414 + """atdata command-line interface.""" 415 + pass 416 + 417 + # Add codegen subcommand 418 + main.add_command(codegen_group) 419 + 420 + if __name__ == '__main__': 421 + main() 422 + ``` 423 + 424 + **Update** `pyproject.toml`: 425 + 426 + ```toml 427 + [project.scripts] 428 + atdata = "atdata.cli:main" 429 + ``` 430 + 431 + ## Usage Examples 432 + 433 + ### Generate Single Schema 434 + 435 + ```bash 436 + # Generate Python code from schema URI 437 + atdata codegen generate \ 438 + at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm \ 439 + -o image_sample.py 440 + 441 + # Output to stdout instead 442 + atdata codegen generate \ 443 + at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm 444 + ``` 445 + 446 + ### Batch Generation 447 + 448 + ```bash 449 + # Generate multiple schemas to a directory 450 + atdata codegen batch \ 451 + at://did:plc:abc123/app.bsky.atdata.schema/schema1 \ 452 + at://did:plc:abc123/app.bsky.atdata.schema/schema2 \ 453 + at://did:plc:abc123/app.bsky.atdata.schema/schema3 \ 454 + -d ./generated_schemas 455 + ``` 456 + 457 + ### Programmatic Usage 458 + 459 + ```python 460 + from atdata.atproto import ATProtoClient 461 + from atdata.codegen import PythonGenerator 462 + 463 + # Initialize 464 + client = ATProtoClient() 465 + client.login("alice.bsky.social", "password") 466 + 467 + # Generate code 468 + generator = PythonGenerator() 469 + code = generator.generate_from_uri( 470 + client, 471 + "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm", 472 + output_path="my_sample.py" 473 + ) 474 + 475 + # Now can import and use the generated class 476 + from my_sample import ImageSample 477 + 478 + # Use with Dataset 479 + dataset = atdata.Dataset[ImageSample](url="s3://bucket/data-{000000..000009}.tar") 480 + ``` 481 + 482 + ## Type Validation 483 + 484 + ### Schema Compatibility Checking 485 + 486 + ```python 487 + from atdata.codegen import SchemaValidator 488 + 489 + class SchemaValidator: 490 + """Validate schema compatibility and evolution.""" 491 + 492 + def is_compatible(self, old_schema: dict, new_schema: dict) -> tuple[bool, list[str]]: 493 + """ 494 + Check if new_schema is compatible with old_schema. 495 + 496 + Returns: 497 + (is_compatible, list_of_incompatibilities) 498 + """ 499 + incompatibilities = [] 500 + 501 + # Check for removed fields 502 + old_fields = {f['name']: f for f in old_schema['fields']} 503 + new_fields = {f['name']: f for f in new_schema['fields']} 504 + 505 + for name in old_fields: 506 + if name not in new_fields: 507 + incompatibilities.append(f"Field removed: {name}") 508 + 509 + # Check for type changes 510 + for name in old_fields: 511 + if name in new_fields: 512 + old_type = old_fields[name]['type'] 513 + new_type = new_fields[name]['type'] 514 + if old_type != new_type: 515 + incompatibilities.append( 516 + f"Field type changed: {name} from {old_type} to {new_type}" 517 + ) 518 + 519 + # Check for optional -> required changes 520 + for name in old_fields: 521 + if name in new_fields: 522 + was_optional = old_fields[name].get('optional', False) 523 + is_optional = new_fields[name].get('optional', False) 524 + if was_optional and not is_optional: 525 + incompatibilities.append( 526 + f"Field changed from optional to required: {name}" 527 + ) 528 + 529 + return len(incompatibilities) == 0, incompatibilities 530 + 531 + def validate_evolution(self, old_version: str, new_version: str) -> bool: 532 + """Validate that version numbers follow semantic versioning.""" 533 + # Parse versions 534 + old_major, old_minor, old_patch = map(int, old_version.split('.')) 535 + new_major, new_minor, new_patch = map(int, new_version.split('.')) 536 + 537 + # Major version should increment for breaking changes 538 + # Minor version should increment for compatible additions 539 + # Patch version should increment for bug fixes 540 + 541 + return new_major >= old_major 542 + ``` 543 + 544 + ### Runtime Type Validation 545 + 546 + ```python 547 + from atdata.codegen import TypeValidator 548 + 549 + class TypeValidator: 550 + """Validate sample instances against schemas.""" 551 + 552 + def validate(self, sample: atdata.PackableSample, schema: dict) -> tuple[bool, list[str]]: 553 + """ 554 + Validate that a sample instance conforms to a schema. 555 + 556 + Returns: 557 + (is_valid, list_of_errors) 558 + """ 559 + errors = [] 560 + 561 + # Check all required fields present 562 + schema_fields = {f['name']: f for f in schema['fields']} 563 + 564 + for field_name, field_def in schema_fields.items(): 565 + if not field_def.get('optional', False): 566 + if not hasattr(sample, field_name): 567 + errors.append(f"Missing required field: {field_name}") 568 + 569 + # Check field types 570 + for field_name, field_def in schema_fields.items(): 571 + if hasattr(sample, field_name): 572 + value = getattr(sample, field_name) 573 + if value is not None: 574 + type_valid = self._validate_field_type(value, field_def['type']) 575 + if not type_valid: 576 + errors.append( 577 + f"Invalid type for field {field_name}: " 578 + f"expected {field_def['type']}, got {type(value)}" 579 + ) 580 + 581 + return len(errors) == 0, errors 582 + 583 + def _validate_field_type(self, value, field_type: dict) -> bool: 584 + """Validate that value matches field type.""" 585 + kind = field_type['kind'] 586 + 587 + if kind == 'primitive': 588 + primitive_types = { 589 + 'str': str, 590 + 'int': int, 591 + 'float': float, 592 + 'bool': bool, 593 + 'bytes': bytes 594 + } 595 + expected_type = primitive_types[field_type['primitive']] 596 + return isinstance(value, expected_type) 597 + 598 + elif kind == 'ndarray': 599 + import numpy as np 600 + if not isinstance(value, np.ndarray): 601 + return False 602 + 603 + # Check dtype if specified 604 + if 'dtype' in field_type: 605 + expected_dtype = np.dtype(field_type['dtype']) 606 + if value.dtype != expected_dtype: 607 + return False 608 + 609 + # Check shape if specified 610 + if 'shape' in field_type and field_type['shape']: 611 + expected_shape = field_type['shape'] 612 + if len(value.shape) != len(expected_shape): 613 + return False 614 + for actual_dim, expected_dim in zip(value.shape, expected_shape): 615 + if expected_dim is not None and actual_dim != expected_dim: 616 + return False 617 + 618 + return True 619 + 620 + return True 621 + ``` 622 + 623 + ## Testing 624 + 625 + ### Unit Tests 626 + 627 + ```python 628 + import pytest 629 + from atdata.codegen import PythonGenerator 630 + 631 + def test_generate_simple_schema(): 632 + """Test generating code from a simple schema.""" 633 + schema = { 634 + "name": "TestSample", 635 + "version": "1.0.0", 636 + "description": "Test sample", 637 + "fields": [ 638 + { 639 + "name": "field1", 640 + "type": {"kind": "primitive", "primitive": "str"} 641 + } 642 + ] 643 + } 644 + 645 + generator = PythonGenerator() 646 + code = generator.generate_from_record(schema, "at://test/schema/123") 647 + 648 + # Check that code contains expected elements 649 + assert "@atdata.packable" in code 650 + assert "class TestSample:" in code 651 + assert "field1: str" in code 652 + 653 + 654 + def test_generate_ndarray_field(): 655 + """Test generating code with NDArray fields.""" 656 + schema = { 657 + "name": "ImageSample", 658 + "version": "1.0.0", 659 + "description": "Image sample", 660 + "fields": [ 661 + { 662 + "name": "image", 663 + "type": { 664 + "kind": "ndarray", 665 + "dtype": "uint8", 666 + "shape": [None, None, 3] 667 + } 668 + } 669 + ] 670 + } 671 + 672 + generator = PythonGenerator() 673 + code = generator.generate_from_record(schema, "at://test/schema/456") 674 + 675 + assert "from numpy.typing import NDArray" in code 676 + assert "image: NDArray" in code 677 + assert "# uint8, shape: [*, *, 3]" in code 678 + 679 + 680 + def test_optional_fields(): 681 + """Test generating code with optional fields.""" 682 + schema = { 683 + "name": "OptionalSample", 684 + "version": "1.0.0", 685 + "description": "Sample with optional fields", 686 + "fields": [ 687 + { 688 + "name": "required_field", 689 + "type": {"kind": "primitive", "primitive": "str"} 690 + }, 691 + { 692 + "name": "optional_field", 693 + "type": {"kind": "primitive", "primitive": "int"}, 694 + "optional": True 695 + } 696 + ] 697 + } 698 + 699 + generator = PythonGenerator() 700 + code = generator.generate_from_record(schema, "at://test/schema/789") 701 + 702 + assert "from typing import Optional" in code 703 + assert "required_field: str" in code 704 + assert "optional_field: Optional[int] = None" in code 705 + ``` 706 + 707 + ### Integration Tests 708 + 709 + ```python 710 + def test_generate_and_import(): 711 + """Test that generated code can be imported and used.""" 712 + import tempfile 713 + import importlib.util 714 + 715 + schema = { 716 + "name": "GeneratedSample", 717 + "version": "1.0.0", 718 + "description": "Generated sample", 719 + "fields": [ 720 + {"name": "x", "type": {"kind": "primitive", "primitive": "int"}} 721 + ] 722 + } 723 + 724 + generator = PythonGenerator() 725 + 726 + # Generate code to temp file 727 + with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f: 728 + code = generator.generate_from_record(schema, "at://test/schema/123") 729 + f.write(code) 730 + temp_path = f.name 731 + 732 + # Import the generated module 733 + spec = importlib.util.spec_from_file_location("generated", temp_path) 734 + module = importlib.util.module_from_spec(spec) 735 + spec.loader.exec_module(module) 736 + 737 + # Test instantiation 738 + sample = module.GeneratedSample(x=42) 739 + assert sample.x == 42 740 + 741 + # Test serialization 742 + assert isinstance(sample, atdata.PackableSample) 743 + packed = sample.packed 744 + assert isinstance(packed, bytes) 745 + ``` 746 + 747 + ## Implementation Checklist (Phase 4) 748 + 749 + - [ ] Implement `PythonGenerator` core logic 750 + - [ ] Create Jinja2 template for Python classes 751 + - [ ] Add CLI commands (`generate`, `batch`) 752 + - [ ] Implement schema validation 753 + - [ ] Implement type compatibility checking 754 + - [ ] Write unit tests for generator 755 + - [ ] Write integration tests (generate + import) 756 + - [ ] Add documentation and examples 757 + - [ ] Consider edge cases (nested types, complex shapes) 758 + 759 + ## Future Extensions 760 + 761 + ### Multi-Language Support 762 + 763 + **TypeScript Generator**: 764 + ```typescript 765 + // Generated from schema 766 + export interface ImageSample { 767 + image: number[][][]; // uint8, [*, *, 3] 768 + label: string; 769 + confidence?: number; 770 + } 771 + ``` 772 + 773 + **Rust Generator**: 774 + ```rust 775 + // Generated from schema 776 + #[derive(Debug, Clone, Serialize, Deserialize)] 777 + pub struct ImageSample { 778 + /// RGB image with variable height/width 779 + pub image: ndarray::Array3<u8>, 780 + /// Human-readable label 781 + pub label: String, 782 + /// Optional confidence score 783 + pub confidence: Option<f64>, 784 + } 785 + ``` 786 + 787 + ### Advanced Features 788 + 789 + - **Backwards compatibility checks**: Ensure schema updates don't break existing code 790 + - **Migration generators**: Generate migration code for schema evolution 791 + - **Validation decorators**: Runtime validation of generated classes 792 + - **Documentation generation**: Generate API docs from schemas 793 + - **IDE support**: Language server protocol support for autocomplete 794 + 795 + ### Code Quality 796 + 797 + - **Formatting**: Run `black` on generated Python code 798 + - **Linting**: Ensure generated code passes `ruff`/`flake8` 799 + - **Type checking**: Ensure generated code passes `mypy`

+195

.planning/README.md

··· 1 + # ATProto Integration Planning 2 + 3 + This directory contains comprehensive planning documents for integrating AT Protocol into the `atdata` library, transforming it into a distributed dataset federation. 4 + 5 + ## Planning Documents 6 + 7 + ### Design Decisions 8 + 9 + 📋 **[decisions/](decisions/)** - Critical design decisions with detailed analysis 10 + - Each decision has its own document with options, recommendations, and rationale 11 + - See [decisions/README.md](decisions/README.md) for navigation guide 12 + - **Must be reviewed and finalized before Phase 1 implementation** 13 + 14 + ### Architecture & Design 15 + 16 + 1. **[01_overview.md](01_overview.md)** - High-level vision, architecture, and project roadmap 17 + - Overall vision for distributed datasets on ATProto 18 + - System architecture diagram 19 + - Development phases and dependencies 20 + - Open design questions 21 + 22 + 2. **[02_lexicon_design.md](02_lexicon_design.md)** - Detailed Lexicon schema specifications 23 + - Schema Record Lexicon (for PackableSample types) 24 + - Dataset Record Lexicon (for dataset indexes) 25 + - Lens Record Lexicon (for transformations) 26 + - Schema representation format decision 27 + - Example records 28 + 29 + 3. **[03_python_client.md](03_python_client.md)** - Python library architecture and API design 30 + - ATProtoClient for authentication 31 + - SchemaPublisher/Loader 32 + - DatasetPublisher/Loader 33 + - LensPublisher 34 + - Integration with existing Dataset class 35 + - Testing strategy 36 + 37 + 4. **[04_appview.md](04_appview.md)** - AppView aggregation service design 38 + - Service architecture 39 + - Database schema (PostgreSQL, ElasticSearch) 40 + - HTTP API endpoints 41 + - Firehose consumer 42 + - Deployment options 43 + - Performance considerations 44 + 45 + 5. **[05_codegen.md](05_codegen.md)** - Code generation tooling 46 + - Python code generator from schema records 47 + - CLI interface 48 + - Template system 49 + - Type validation and compatibility checking 50 + - Future multi-language support 51 + 52 + ## Milestone Tracking 53 + 54 + **Milestone**: ATProto Integration (Milestone #1) 55 + **Total Issues**: 34 (6 parent issues + 28 subissues) 56 + 57 + ### Planning Phase (Issue #44) 58 + 59 + **Status**: In progress 60 + **Priority**: High (blocks Phase 1) 61 + 62 + Critical decisions needed before implementation: 63 + - Decide on schema representation format (#45) 64 + - Decide on Lens code storage approach (#46) 65 + - Decide on WebDataset storage strategy (#47) 66 + - Design schema evolution and versioning strategy (#48) 67 + - Finalize Lexicon namespace and NSID structure (#49) 68 + - Review and validate Lexicon JSON definitions (#50) 69 + 70 + **All decisions have detailed analysis in planning documents with recommendations.** 71 + 72 + ### Phase Breakdown 73 + 74 + #### Phase 1: Lexicon Design & Schema Definition (Issue #17) 75 + - Design Lexicon for PackableSample schema storage (#22) 76 + - Design Lexicon for dataset index records (#23) 77 + - Design Lexicon for Lens transformation records (#24) 78 + - Evaluate schema representation formats (#25) 79 + 80 + **Status**: Blocked by Planning (#44) 81 + **Priority**: High (blocks all other phases) 82 + 83 + #### Phase 2: Python Client Library (Issue #18) 84 + - Implement ATProto authentication and session management (#26) 85 + - Implement schema publishing to ATProto (#27) 86 + - Implement dataset index record publishing (#28) 87 + - Implement Lens transformation publishing (#29) 88 + - Implement querying and discovery of datasets (#30) 89 + - Extend Dataset class to load from ATProto records (#31) 90 + 91 + **Status**: Blocked by Phase 1 92 + **Priority**: High (critical path) 93 + 94 + #### Phase 3: AppView & Index Aggregation Service (Issue #19) 95 + - Design AppView architecture and data model (#32) 96 + - Implement record ingestion from ATProto firehose (#33) 97 + - Implement search and query API (#34) 98 + - Add caching and indexing for performance (#35) 99 + 100 + **Status**: Blocked by Phase 2 101 + **Priority**: Medium (optional infrastructure) 102 + 103 + #### Phase 4: Code Generation Tooling (Issue #20) 104 + - Design code generation template system (#36) 105 + - Implement Python code generator from schema records (#37) 106 + - Add CLI for code generation (#38) 107 + - Support type validation and compatibility checking (#39) 108 + 109 + **Status**: Blocked by Phase 2 110 + **Priority**: Medium (can run parallel with Phase 3) 111 + 112 + #### Phase 5: End-to-End Integration & Testing (Issue #21) 113 + - Create end-to-end example workflows (#40) 114 + - Write integration tests for full publish/discover/load cycle (#41) 115 + - Create comprehensive documentation (#42) 116 + - Performance testing and optimization (#43) 117 + 118 + **Status**: Blocked by Phase 2 119 + **Priority**: High (required for production release) 120 + 121 + ## Getting Started 122 + 123 + To begin implementation: 124 + 125 + 1. **Review design decisions** in `decisions/` directory - these need your input first 126 + 2. **Review architecture documents** (01-05) to understand the full scope 127 + 3. **Provide feedback** on the design decisions and open questions 128 + 4. **Finalize decisions** for issues #45-49 129 + 5. **Validate Lexicons** (issue #50) once decisions are made 130 + 6. **Begin Phase 1 implementation** after validation 131 + 7. **Track progress** using chainlink issues 132 + 133 + ### Quick Start for Decision Review 134 + 135 + 1. Read [decisions/README.md](decisions/README.md) for overview 136 + 2. Review each decision document (01-06) 137 + 3. For each decision: 138 + - Agree with recommendation? → Comment on issue 139 + - Disagree? → Propose alternative in issue 140 + - Unsure? → Discuss open questions 141 + 4. Once all decisions made → Proceed to issue #50 (validation) 142 + 143 + ## Key Design Decisions Needed 144 + 145 + Before starting implementation, we need decisions on (see Issue #44 and subissues #45-50): 146 + 147 + 1. **Schema representation format** (Issue #45) 148 + - Recommendation: Custom format within ATProto Lexicon 149 + - Alternative: JSON Schema or Protobuf 150 + - Details in `02_lexicon_design.md` 151 + 152 + 2. **Lens code storage** (Issue #46) 153 + - Recommendation: Code references (GitHub + commit) only 154 + - Alternative: Allow inline code (security concerns) 155 + - Details in `02_lexicon_design.md` 156 + 157 + 3. **WebDataset storage location** (Issue #47) 158 + - Phase 1: External storage (S3, HTTP) - just URLs 159 + - Future: ATProto blob storage for smaller datasets 160 + - Details in `02_lexicon_design.md` 161 + 162 + 4. **Schema evolution strategy** (Issue #48) 163 + - How to handle versioning and compatibility 164 + - Migration path for breaking changes 165 + - Details in `05_codegen.md` 166 + 167 + 5. **Lexicon namespace** (Issue #49) 168 + - Current proposal: `app.bsky.atdata.*` 169 + - May need to coordinate with ATProto/Bluesky team 170 + - Details in `02_lexicon_design.md` 171 + 172 + 6. **Lexicon validation** (Issue #50) 173 + - Validate all Lexicon JSON against ATProto spec 174 + - Create example records for testing 175 + - Blocked by decisions #45-49 176 + 177 + ## Questions for Discussion 178 + 179 + Review the "Open Design Questions" sections in each planning document, particularly: 180 + 181 + - `01_overview.md` - Overall architecture questions 182 + - `02_lexicon_design.md` - Lexicon-specific design questions (CRITICAL for Phase 1) 183 + 184 + ## Next Steps 185 + 186 + 1. Review planning documents 187 + 2. Discuss and finalize design decisions 188 + 3. Begin Phase 1 implementation 189 + 4. Iterate and refine as we learn 190 + 191 + --- 192 + 193 + **Milestone Created**: 2026-01-07 194 + **Last Updated**: 2026-01-07 195 + **Status**: Planning complete, ready for review

+239

.planning/decisions/01_schema_representation_format.md

··· 1 + # Decision: Schema Representation Format 2 + 3 + **Issue**: #45 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + Let's go with the **JSON schema** approach; the only real issue we have to worry about here is the `NDArray` support, and we can solve that by 11 + 12 + * Adding a standardized JSON Schema shim to represent an `NDArray` as its serialized bytes 13 + * Referencing this as the type within other schemas, and making this the standard we use 14 + 15 + We'll make this decision future-proof by adding a property in the Lexicon for schemas that gives the type of schema definition, with one currently supported value (for JSON Schema), and then leave the standard overall as an open union, as is standard for atproto lexicons. 16 + 17 + --- 18 + 19 + ## Problem Statement 20 + 21 + We need to decide how to represent `PackableSample` type definitions within ATProto Lexicon records. This affects: 22 + - How schemas are stored and transmitted 23 + - Code generation complexity 24 + - Cross-language interoperability 25 + - Tooling ecosystem availability 26 + 27 + ## Context 28 + 29 + `PackableSample` types have specific requirements: 30 + - Support for primitive types (str, int, float, bool, bytes) 31 + - **Special handling for `NDArray` types** with dtype and shape information 32 + - Msgpack serialization metadata 33 + - Optional/required field semantics 34 + - Future extensibility (constraints, validation, nested types) 35 + 36 + ## Options 37 + 38 + ### Option 1: Custom Format within ATProto Lexicon ⭐ RECOMMENDED 39 + 40 + **Description**: Define our own type system using ATProto Lexicon primitives 41 + 42 + **Example**: 43 + ```json 44 + { 45 + "name": "image", 46 + "type": { 47 + "kind": "ndarray", 48 + "dtype": "uint8", 49 + "shape": [null, null, 3] 50 + }, 51 + "optional": false, 52 + "description": "RGB image with variable height/width" 53 + } 54 + ``` 55 + 56 + **Pros**: 57 + - ✅ Native to ATProto - no external dependencies 58 + - ✅ Tailored exactly to `PackableSample` needs 59 + - ✅ Clean representation of NDArray (dtype, shape constraints) 60 + - ✅ Full control over codegen implementation 61 + - ✅ Can evolve independently 62 + - ✅ Easy to extend (add constraints, validation rules, etc.) 63 + 64 + **Cons**: 65 + - ❌ Need to implement our own codegen tooling 66 + - ❌ Less ecosystem tooling available 67 + - ❌ Need to maintain custom parsers 68 + 69 + **Implementation Effort**: Medium 70 + - Lexicon design: ~2-3 days 71 + - Python codegen: ~5-7 days 72 + - Validation: ~2-3 days 73 + 74 + --- 75 + 76 + ### Option 2: JSON Schema 77 + 78 + **Description**: Use JSON Schema as the type definition format 79 + 80 + **Example**: 81 + ```json 82 + { 83 + "type": "object", 84 + "properties": { 85 + "image": { 86 + "type": "object", 87 + "x-atdata-type": "ndarray", 88 + "x-dtype": "uint8", 89 + "x-shape": [null, null, 3] 90 + } 91 + }, 92 + "required": ["image"] 93 + } 94 + ``` 95 + 96 + **Pros**: 97 + - ✅ Industry standard, widely understood 98 + - ✅ Extensive validation tooling exists 99 + - ✅ Many language implementations 100 + 101 + **Cons**: 102 + - ❌ Not designed for code generation 103 + - ❌ Awkward NDArray representation (need custom extensions like `x-atdata-type`) 104 + - ❌ Overly complex for our needs 105 + - ❌ Still need custom codegen despite standard format 106 + - ❌ Doesn't map cleanly to Python dataclasses 107 + 108 + **Implementation Effort**: Medium-High 109 + - Still need custom codegen despite standard format 110 + - JSON Schema parsers available but adaptation needed 111 + 112 + --- 113 + 114 + ### Option 3: Protobuf (Protocol Buffers) 115 + 116 + **Description**: Use Protobuf schema definitions 117 + 118 + **Example**: 119 + ```protobuf 120 + message ImageSample { 121 + bytes image = 1; // NDArray serialized 122 + string label = 2; 123 + optional float confidence = 3; 124 + } 125 + ``` 126 + 127 + **Pros**: 128 + - ✅ Excellent codegen ecosystem (Python, TypeScript, Rust, etc.) 129 + - ✅ Compact binary format 130 + - ✅ Strong cross-language support 131 + - ✅ Built-in versioning/evolution support 132 + 133 + **Cons**: 134 + - ❌ Not ATProto-native (different ecosystem) 135 + - ❌ NDArray handling is awkward (just bytes, lose dtype/shape info) 136 + - ❌ Requires compilation step 137 + - ❌ Less human-readable than JSON 138 + - ❌ Doesn't integrate well with msgpack serialization we already use 139 + - ❌ Would need to convert between Protobuf and our existing serialization 140 + 141 + **Implementation Effort**: High 142 + - Need to bridge Protobuf and PackableSample worlds 143 + - Complexity of maintaining two serialization systems 144 + 145 + ## Recommendation: Option 1 (Custom Format) 146 + 147 + **Rationale**: 148 + 149 + 1. **Perfect fit for PackableSample**: Our custom format can represent NDArray types with full dtype and shape information, which is critical for ML/data applications. 150 + 151 + 2. **ATProto-native**: Using Lexicon primitives means everything stays within the ATProto ecosystem. No external schema dependencies. 152 + 153 + 3. **Full control**: We can optimize the codegen for our exact use case. Want to generate dataclasses with specific decorators? Easy. Want to add custom validation? We control it. 154 + 155 + 4. **Simplicity**: Despite being "custom", it's actually simpler than adapting JSON Schema or Protobuf to our needs. Less impedance mismatch. 156 + 157 + 5. **Future-proof**: Easy to add features like: 158 + - Shape constraints and validation 159 + - Custom serialization hooks 160 + - Nested PackableSample types 161 + - Union types for polymorphic samples 162 + 163 + ## Implementation Plan 164 + 165 + If we choose Option 1: 166 + 167 + 1. **Finalize Lexicon structure** (see `02_lexicon_design.md`) 168 + - Field type definitions (primitive, ndarray, nested) 169 + - Union types for extensibility 170 + - Metadata fields 171 + 172 + 2. **Implement Python codegen** (see `05_codegen.md`) 173 + - Jinja2 templates for dataclass generation 174 + - Type annotation mapping 175 + - NDArray handling with dtype/shape comments 176 + 177 + 3. **Build validation tooling** 178 + - Schema validator (ensure schemas are well-formed) 179 + - Sample validator (ensure samples match schemas) 180 + - Compatibility checker (schema evolution) 181 + 182 + 4. **Document the format** 183 + - Clear spec for the type system 184 + - Examples for common patterns 185 + - Migration guide from JSON Schema if needed 186 + 187 + ## Alternative Approaches Considered 188 + 189 + **Hybrid approach**: Use JSON Schema for validation + custom codegen 190 + - Still has awkward NDArray representation 191 + - Added complexity of two systems 192 + - Not recommended 193 + 194 + **Defer decision**: Use simple types only, add NDArray later 195 + - Defeats the purpose - NDArray is core to ML datasets 196 + - Would require breaking changes later 197 + - Not recommended 198 + 199 + ## Impact on Other Decisions 200 + 201 + - **Code generation (#36-39)**: Custom format means we fully control codegen 202 + - **Validation (#50)**: Need to implement custom validators 203 + - **Cross-language support (future)**: Need to write codegen for each language, but format is language-agnostic 204 + 205 + ## Success Criteria 206 + 207 + After implementing this decision: 208 + - ✅ Can represent all current PackableSample types 209 + - ✅ NDArray types include dtype and shape information 210 + - ✅ Generated code is idiomatic Python (dataclasses with type hints) 211 + - ✅ Schema records are human-readable 212 + - ✅ Codegen is fast (<1s for typical schemas) 213 + 214 + ## Open Questions 215 + 216 + 1. **Should we support shape constraints beyond documentation?** 217 + - e.g., should [224, 224, 3] be enforced at runtime? 218 + - Recommendation: Document only initially, add validation later 219 + 220 + 2. **How to handle nested PackableSample types?** 221 + - Reference by schema URI? 222 + - Inline nested schema? 223 + - Recommendation: URI reference for Phase 1 224 + 225 + 3. **Should we generate both classes and validators?** 226 + - Just classes, or also Pydantic models? 227 + - Recommendation: Start with dataclasses, add Pydantic later if needed 228 + 229 + ## References 230 + 231 + - Full Lexicon design: `../02_lexicon_design.md` 232 + - Code generation plan: `../05_codegen.md` 233 + - Example schemas: `../02_lexicon_design.md` (Schema Record Lexicon section) 234 + 235 + --- 236 + 237 + **Decision Needed By**: Before starting Phase 1 Issue #22 (Lexicon design) 238 + **Decision Maker**: Project maintainer (max) 239 + **Date Created**: 2026-01-07

+345

.planning/decisions/02_lens_code_storage.md

··· 1 + # Decision: Lens Code Storage Approach 2 + 3 + **Issue**: #46 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + ... 11 + 12 + --- 13 + 14 + ## Problem Statement 15 + 16 + We need to decide how to store the transformation code for Lens records on ATProto. Lenses define bidirectional transformations between sample types (getter: Source → Target, putter: Target × Source → Source). 17 + 18 + This is a **critical security decision** because we're dealing with executable code. 19 + 20 + ## Context 21 + 22 + Lens transformations are functions that: 23 + - Take samples of one type and transform them to another 24 + - Are bidirectional (getter + putter) 25 + - Need to be reproducible and verifiable 26 + - Potentially execute on untrusted data 27 + 28 + Example Lens: 29 + ```python 30 + @atdata.lens 31 + def rgb_to_grayscale(rgb_sample: RGBSample) -> GrayscaleSample: 32 + gray = cv2.cvtColor(rgb_sample.image, cv2.COLOR_RGB2GRAY) 33 + return GrayscaleSample(image=gray, label=rgb_sample.label) 34 + 35 + @rgb_to_grayscale.putter 36 + def grayscale_to_rgb(gray: GrayscaleSample, rgb: RGBSample) -> RGBSample: 37 + # Convert back to RGB (approximate) 38 + rgb_img = cv2.cvtColor(gray.image, cv2.COLOR_GRAY2RGB) 39 + return RGBSample(image=rgb_img, label=gray.label) 40 + ``` 41 + 42 + ## Options 43 + 44 + ### Option 1: Code References Only (GitHub/GitLab + Commit Hash) ⭐ RECOMMENDED 45 + 46 + **Description**: Store only references to code in version control repositories 47 + 48 + **Record Format**: 49 + ```json 50 + { 51 + "getterCode": { 52 + "kind": "reference", 53 + "repository": "https://github.com/alice/lenses", 54 + "commit": "a1b2c3d4e5f6789...", 55 + "path": "lenses/vision.py:rgb_to_grayscale" 56 + }, 57 + "putterCode": { 58 + "kind": "reference", 59 + "repository": "https://github.com/alice/lenses", 60 + "commit": "a1b2c3d4e5f6789...", 61 + "path": "lenses/vision.py:grayscale_to_rgb" 62 + } 63 + } 64 + ``` 65 + 66 + **Pros**: 67 + - ✅ **Secure**: No arbitrary code execution from ATProto records 68 + - ✅ **Verifiable**: Commit hash ensures immutability 69 + - ✅ **Auditable**: Users can review code before using 70 + - ✅ **Version controlled**: Natural versioning through git 71 + - ✅ **Professional workflow**: Encourages proper development practices 72 + 73 + **Cons**: 74 + - ❌ External dependency (repo could disappear) 75 + - ❌ Requires users to have code in public/accessible repos 76 + - ❌ Need to clone/fetch repos to use lenses 77 + - ❌ Less convenient than self-contained records 78 + 79 + **Security**: ⭐⭐⭐⭐⭐ Excellent 80 + **Convenience**: ⭐⭐⭐ Good 81 + **Implementation Effort**: Low-Medium 82 + 83 + --- 84 + 85 + ### Option 2: Inline Python Code with Sandboxing 86 + 87 + **Description**: Store Python source code directly in records, execute in sandbox 88 + 89 + **Record Format**: 90 + ```json 91 + { 92 + "getterCode": { 93 + "kind": "python", 94 + "source": "def rgb_to_grayscale(rgb_sample: RGBSample) -> GrayscaleSample:\n ..." 95 + } 96 + } 97 + ``` 98 + 99 + **Pros**: 100 + - ✅ Self-contained records 101 + - ✅ No external dependencies 102 + - ✅ More convenient for users 103 + - ✅ Easier discovery and exploration 104 + 105 + **Cons**: 106 + - ❌ **MAJOR SECURITY RISK**: Executing untrusted code 107 + - ❌ Sandboxing Python is extremely difficult 108 + - ❌ Even with sandboxing, attack surface is large 109 + - ❌ `eval()`/`exec()` considered harmful 110 + - ❌ Would need extensive review and testing 111 + - ❌ Potential for malicious code injection 112 + 113 + **Security**: ⭐ Very Poor (even with sandboxing) 114 + **Convenience**: ⭐⭐⭐⭐⭐ Excellent 115 + **Implementation Effort**: Very High (sandboxing is complex) 116 + 117 + **Why Sandboxing is Hard**: 118 + - Python has many ways to break out of sandboxes 119 + - Import system, file I/O, network access all need blocking 120 + - `__import__`, `eval`, `exec`, `compile`, `open`, etc. 121 + - Even readonly access can leak sensitive data 122 + - See: [PyPy sandbox](https://doc.pypy.org/en/latest/sandbox.html) - discontinued 123 + 124 + --- 125 + 126 + ### Option 3: Bytecode or AST Representation 127 + 128 + **Description**: Store compiled bytecode or AST instead of source 129 + 130 + **Pros**: 131 + - ✅ Slightly safer than raw source (no syntax injection) 132 + - ✅ Self-contained 133 + 134 + **Cons**: 135 + - ❌ Still executes arbitrary code - same security issues 136 + - ❌ Harder to audit than source 137 + - ❌ Platform/version dependent (Python bytecode changes) 138 + - ❌ Complex to implement 139 + - ❌ Doesn't solve the fundamental problem 140 + 141 + **Security**: ⭐⭐ Poor 142 + **Convenience**: ⭐⭐ Poor (less readable) 143 + **Implementation Effort**: High 144 + 145 + --- 146 + 147 + ### Option 4: Metadata Only (Manual Implementation) 148 + 149 + **Description**: Store only metadata about transformations, require manual implementation 150 + 151 + **Record Format**: 152 + ```json 153 + { 154 + "description": "Converts RGB images to grayscale", 155 + "getterSignature": "(RGBSample) -> GrayscaleSample", 156 + "putterSignature": "(GrayscaleSample, RGBSample) -> RGBSample" 157 + } 158 + ``` 159 + 160 + **Pros**: 161 + - ✅ Completely safe 162 + - ✅ Simple to implement 163 + 164 + **Cons**: 165 + - ❌ Lenses not actually usable 166 + - ❌ Defeats the purpose of publishing transformations 167 + - ❌ No network effect (can't compose lenses) 168 + 169 + **Security**: ⭐⭐⭐⭐⭐ Excellent 170 + **Convenience**: ⭐ Very Poor 171 + **Implementation Effort**: Very Low 172 + 173 + ## Recommendation: Option 1 (Code References Only) 174 + 175 + **Rationale**: 176 + 177 + 1. **Security First**: We cannot compromise on security. Publishing executable code to a public network is extremely dangerous without proper safeguards. 178 + 179 + 2. **Verifiable and Auditable**: With commit hashes, users can: 180 + - Review the exact code before execution 181 + - Verify it hasn't been tampered with 182 + - Make informed trust decisions 183 + 184 + 3. **Professional Workflow**: Requiring code in version control: 185 + - Encourages good practices (testing, documentation) 186 + - Makes lens development collaborative 187 + - Enables code review 188 + 189 + 4. **Future Extensibility**: We can add inline code later if we solve sandboxing, but we can't easily remove it once added. 190 + 191 + ## Implementation Plan 192 + 193 + If we choose Option 1: 194 + 195 + 1. **Lexicon Design** (Phase 1) 196 + ```json 197 + "transformCode": { 198 + "type": "union", 199 + "refs": ["#codeReference"] 200 + }, 201 + "codeReference": { 202 + "type": "object", 203 + "required": ["kind", "repository", "commit", "path"], 204 + "properties": { 205 + "kind": {"type": "string", "const": "reference"}, 206 + "repository": {"type": "string", "maxLength": 500}, 207 + "commit": {"type": "string", "maxLength": 40}, 208 + "path": {"type": "string", "maxLength": 500} 209 + } 210 + } 211 + ``` 212 + 213 + 2. **Lens Publisher** (Phase 2) 214 + - Automatically detect git repo and commit from function location 215 + - Validate that repo is accessible 216 + - Include function name and module path 217 + 218 + 3. **Lens Loader** (Phase 2) 219 + - Clone/fetch repository at specified commit 220 + - Import function from specified path 221 + - Cache cloned repos locally 222 + - Verify function signatures match schema 223 + 224 + 4. **Trust Model** 225 + - Users explicitly approve which repos to trust 226 + - Whitelist/blacklist mechanism 227 + - Warn on first use of any lens 228 + 229 + ## Alternative Approaches Considered 230 + 231 + **Signed inline code**: Store inline code with cryptographic signatures 232 + - Still has execution risk 233 + - Signature only proves authorship, not safety 234 + - Not recommended 235 + 236 + **WASM modules**: Compile transformations to WebAssembly 237 + - More sandboxed than Python 238 + - Very complex to implement 239 + - Would require rewriting lenses in Rust/C++ 240 + - Interesting future direction but not for Phase 1 241 + 242 + ## User Experience Implications 243 + 244 + **Publishing a Lens**: 245 + ```python 246 + # 1. Write lens code in your repo 247 + # lenses/vision.py 248 + @atdata.lens 249 + def rgb_to_grayscale(rgb: RGBSample) -> GrayscaleSample: 250 + ... 251 + 252 + # 2. Commit and push 253 + git add lenses/vision.py 254 + git commit -m "Add RGB to grayscale lens" 255 + git push 256 + 257 + # 3. Publish to ATProto (automatically detects git info) 258 + client = ATProtoClient() 259 + client.login("alice.bsky.social", "password") 260 + 261 + lens_publisher = LensPublisher(client) 262 + lens_uri = lens_publisher.publish_lens( 263 + rgb_to_grayscale, 264 + source_schema_uri="at://alice/schema/rgb", 265 + target_schema_uri="at://alice/schema/gray" 266 + ) 267 + ``` 268 + 269 + **Using a Lens**: 270 + ```python 271 + # 1. Discover lens 272 + loader = LensLoader(client) 273 + lenses = loader.search_lenses( 274 + source_schema="at://alice/schema/rgb", 275 + target_schema="at://alice/schema/gray" 276 + ) 277 + 278 + # 2. User reviews the repo/code (outside tool) 279 + # 3. User approves the repo 280 + 281 + # 4. Load and use lens 282 + rgb_to_gray = loader.load_lens(lenses[0]['uri']) 283 + gray_sample = rgb_to_gray(rgb_sample) 284 + ``` 285 + 286 + ## Security Considerations 287 + 288 + Even with code references: 289 + - **Malicious repos**: Users could reference repos with malicious code 290 + - **Mitigation**: Explicit user approval, warnings, sandboxing (future) 291 + 292 + - **Repo compromise**: Git repos could be hacked 293 + - **Mitigation**: Commit hash pins exact version, users can audit 294 + 295 + - **Dependency injection**: Lens code could import malicious packages 296 + - **Mitigation**: Users review code, standard Python security practices 297 + 298 + ## Future Enhancements 299 + 300 + **If we want inline code later**: 301 + 1. Build robust Python sandbox (e.g., using PyPy, restrictedpython) 302 + 2. Add extensive security testing 303 + 3. Implement strict permissions model 304 + 4. Use WebAssembly for true isolation 305 + 5. Add code signing and reputation system 306 + 307 + **For now**: Start with references, prove the concept, add inline code only if there's strong demand and we can do it safely. 308 + 309 + ## Open Questions 310 + 311 + 1. **Private repositories**: How to handle lenses in private repos? 312 + - Could support auth tokens (stored locally, not in record) 313 + - Could use SSH keys 314 + - Recommendation: Public repos only for Phase 1 315 + 316 + 2. **Repository availability**: What if repo goes offline? 317 + - Could encourage mirrors 318 + - Could cache code (with user permission) 319 + - Recommendation: Accept the risk, it's part of decentralization 320 + 321 + 3. **Non-Python lenses**: What about TypeScript, Rust, etc.? 322 + - References work for any language 323 + - Each language would need its own loader 324 + - Recommendation: Python-only for Phase 1 325 + 326 + ## Success Criteria 327 + 328 + After implementing this decision: 329 + - ✅ Lenses can be published with code references 330 + - ✅ Users can load and execute lenses from approved repos 331 + - ✅ No arbitrary code execution from untrusted sources 332 + - ✅ Lens records include immutable commit hashes 333 + - ✅ Clear warnings when using external code 334 + 335 + ## References 336 + 337 + - Lexicon design: `../02_lexicon_design.md` (Lens Record Lexicon) 338 + - Python client implementation: `../03_python_client.md` (LensPublisher) 339 + - Security best practices: Python security guide 340 + 341 + --- 342 + 343 + **Decision Needed By**: Before starting Phase 1 Issue #24 (Lens Lexicon design) 344 + **Decision Maker**: Project maintainer (max) 345 + **Date Created**: 2026-01-07

+366

.planning/decisions/03_webdataset_storage.md

··· 1 + # Decision: WebDataset Storage Strategy 2 + 3 + **Issue**: #47 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + Let's build the hybrid approach in from the beginning. Critically: 11 + 12 + * We'll keep track of whether dataset index records are referencing an external storage (S3, R2, etc) by URL or a PDS blob using an open union to define the data location 13 + * In the AppView implementation, we can proxy WDS urls for datasets across individual stored blobs, which streamlines some of the design. 14 + 15 + This will help us be robust from the start -- particularly for those self-hosting. 16 + 17 + --- 18 + 19 + ## Problem Statement 20 + 21 + We need to decide where the actual WebDataset `.tar` files are stored and how dataset records reference them. This affects decentralization, reliability, and scalability. 22 + 23 + ## Context 24 + 25 + WebDataset files are: 26 + - **Large**: Typically gigabytes to terabytes 27 + - **Immutable**: Once created, datasets rarely change 28 + - **Sharded**: Split across multiple `.tar` files (e.g., `data-{000000..000099}.tar`) 29 + - **Binary**: Contain msgpack-serialized samples with images/arrays 30 + 31 + Current `atdata` usage: 32 + ```python 33 + # External storage (S3, HTTP, etc.) 34 + dataset = Dataset[MySample](url="s3://bucket/data-{000000..000009}.tar") 35 + ``` 36 + 37 + ## Options 38 + 39 + ### Option 1: External Storage with URL References ⭐ RECOMMENDED (Phase 1) 40 + 41 + **Description**: Store WebDataset files on existing storage (S3, HTTP, IPFS, etc.), record only contains URLs 42 + 43 + **Record Format**: 44 + ```json 45 + { 46 + "$type": "app.bsky.atdata.dataset", 47 + "name": "CIFAR-10 Training Set", 48 + "urls": [ 49 + "s3://my-bucket/cifar10-train-{000000..000049}.tar" 50 + ], 51 + "schemaRef": "at://alice/schema/image", 52 + ... 53 + } 54 + ``` 55 + 56 + **Supported URL Schemes**: 57 + - `s3://` - AWS S3 and compatible (MinIO, DigitalOcean Spaces) 58 + - `https://` - HTTP/HTTPS servers 59 + - `gs://` - Google Cloud Storage 60 + - `ipfs://` - IPFS (decentralized, content-addressed) 61 + - `file://` - Local files (for development) 62 + 63 + **Pros**: 64 + - ✅ **No size limits**: Store datasets of any size 65 + - ✅ **Existing infrastructure**: Leverage proven storage solutions 66 + - ✅ **No ATProto storage costs**: Publishers pay for their own storage 67 + - ✅ **Performance**: Use CDNs, regional endpoints, etc. 68 + - ✅ **Compatibility**: Works with current `atdata` code 69 + - ✅ **Flexibility**: Different storage for different use cases 70 + 71 + **Cons**: 72 + - ❌ **Centralization risk**: If storage provider goes down, dataset unavailable 73 + - ❌ **URL rot**: Links can break over time 74 + - ❌ **No permanence guarantee**: Publisher can delete files 75 + - ❌ **Access control complexity**: Need to handle auth for private datasets 76 + 77 + **Decentralization**: ⭐⭐ Fair (better with IPFS) 78 + **Reliability**: ⭐⭐⭐ Good (depends on storage provider) 79 + **Cost**: ⭐⭐⭐⭐ Excellent (publishers pay storage costs) 80 + **Implementation Effort**: ⭐⭐⭐⭐⭐ Very Low (already supported) 81 + 82 + --- 83 + 84 + ### Option 2: ATProto Blob Storage 85 + 86 + **Description**: Store WebDataset files as ATProto blobs, record contains blob CIDs 87 + 88 + **Record Format**: 89 + ```json 90 + { 91 + "$type": "app.bsky.atdata.dataset", 92 + "name": "Small Dataset", 93 + "blobs": [ 94 + {"$type": "blob", "ref": {"$link": "bafyrei..."}}, 95 + {"$type": "blob", "ref": {"$link": "bafyrei..."}} 96 + ], 97 + "schemaRef": "at://alice/schema/image", 98 + ... 99 + } 100 + ``` 101 + 102 + **Pros**: 103 + - ✅ **True decentralization**: Data lives on ATProto network 104 + - ✅ **Content-addressed**: CIDs guarantee immutability 105 + - ✅ **Permanence**: As permanent as ATProto itself 106 + - ✅ **No external dependencies**: Self-contained 107 + 108 + **Cons**: 109 + - ❌ **Size limits**: ATProto may have blob size restrictions (need to verify) 110 + - ❌ **Storage costs**: Who pays for storing large datasets? 111 + - ❌ **Performance**: May be slower than specialized data storage 112 + - ❌ **Scalability**: Not designed for TB-scale datasets 113 + - ❌ **Unknown limitations**: ATProto blob storage is less proven for this use case 114 + 115 + **Decentralization**: ⭐⭐⭐⭐⭐ Excellent 116 + **Reliability**: ⭐⭐⭐⭐ Very Good (ATProto network) 117 + **Cost**: ⭐ Poor (storage costs for large datasets) 118 + **Implementation Effort**: ⭐⭐⭐ Medium (need to implement blob upload/download) 119 + 120 + --- 121 + 122 + ### Option 3: Hybrid Approach 123 + 124 + **Description**: Support both external URLs and ATProto blobs 125 + 126 + **Record Format**: 127 + ```json 128 + { 129 + "$type": "app.bsky.atdata.dataset", 130 + "name": "Hybrid Dataset", 131 + "storage": { 132 + "kind": "external", 133 + "urls": ["s3://bucket/data-{000000..000009}.tar"] 134 + }, 135 + // OR 136 + "storage": { 137 + "kind": "blobs", 138 + "blobs": [{"$type": "blob", "ref": {"$link": "bafyrei..."}}] 139 + }, 140 + ... 141 + } 142 + ``` 143 + 144 + **Pros**: 145 + - ✅ Best of both worlds 146 + - ✅ Flexibility for different use cases 147 + - ✅ Can migrate between storage types 148 + 149 + **Cons**: 150 + - ❌ More complex Lexicon and implementation 151 + - ❌ Confusing for users (which to choose?) 152 + - ❌ Testing burden (need to test both paths) 153 + 154 + **Implementation Effort**: ⭐⭐ High (two systems to maintain) 155 + 156 + ## Recommendation: Option 1 (External URLs) for Phase 1, Option 3 (Hybrid) for Future 157 + 158 + **Rationale**: 159 + 160 + 1. **Pragmatism**: Most ML datasets are huge (10GB-10TB). ATProto blob storage is not designed for this scale. 161 + 162 + 2. **Existing Infrastructure**: S3, GCS, HTTP are battle-tested for large file storage. Why reinvent the wheel? 163 + 164 + 3. **Cost Model**: Publishers pay for their own storage. This is sustainable and aligns incentives. 165 + 166 + 4. **IPFS for Decentralization**: Users who want decentralization can use `ipfs://` URLs, which are content-addressed and distributed. 167 + 168 + 5. **Future-Proof**: We can add blob storage later for small datasets (<100MB) without breaking existing datasets. 169 + 170 + ## Implementation Plan 171 + 172 + ### Phase 1: External URLs Only 173 + 174 + **Lexicon Design**: 175 + ```json 176 + { 177 + "urls": { 178 + "type": "array", 179 + "description": "WebDataset URLs (supports brace notation)", 180 + "items": { 181 + "type": "string", 182 + "format": "uri", 183 + "maxLength": 1000 184 + }, 185 + "minLength": 1 186 + } 187 + } 188 + ``` 189 + 190 + **Publisher Implementation**: 191 + ```python 192 + publisher = DatasetPublisher(client) 193 + dataset_uri = publisher.publish_dataset( 194 + dataset, 195 + name="My Dataset", 196 + description="Training data for my model" 197 + ) 198 + # dataset.url is used directly, no upload needed 199 + ``` 200 + 201 + **Loader Implementation**: 202 + ```python 203 + loader = DatasetLoader(client) 204 + dataset = loader.load_dataset("at://alice/dataset/123") 205 + # Creates Dataset with URL from record 206 + # Actual data loading happens lazily via WebDataset 207 + ``` 208 + 209 + **Validation**: 210 + - Check URL format (scheme + netloc + path) 211 + - Support brace notation for sharded datasets 212 + - Don't validate URL accessibility (too slow, may be private) 213 + 214 + ### Future: Add Blob Storage Option 215 + 216 + When ATProto blob storage is more mature and we understand limits: 217 + 218 + 1. **Add blob support to Lexicon**: 219 + ```json 220 + "storage": { 221 + "type": "union", 222 + "refs": ["#urlStorage", "#blobStorage"] 223 + } 224 + ``` 225 + 226 + 2. **Implement blob upload**: 227 + - Chunk large files 228 + - Upload shards as separate blobs 229 + - Update record with blob CIDs 230 + 231 + 3. **Size recommendations**: 232 + - Datasets <100MB → Consider blobs 233 + - Datasets >100MB → Use external URLs 234 + - Datasets >10GB → Definitely external URLs 235 + 236 + ## URL Scheme Support 237 + 238 + | Scheme | Support | Notes | 239 + |--------|---------|-------| 240 + | `s3://` | ✅ Phase 1 | AWS S3 and compatible services | 241 + | `https://` | ✅ Phase 1 | Public HTTP/HTTPS servers | 242 + | `http://` | ✅ Phase 1 | Upgraded to HTTPS when possible | 243 + | `gs://` | ✅ Phase 1 | Google Cloud Storage | 244 + | `ipfs://` | ✅ Phase 1 | Decentralized storage via IPFS | 245 + | `file://` | ✅ Phase 1 | Local development only | 246 + | `at://` | ⏳ Future | ATProto blob references | 247 + 248 + ## Decentralization Strategy 249 + 250 + For users who want decentralization without ATProto blobs: 251 + 252 + **IPFS + Pinning Services**: 253 + 1. Upload dataset to IPFS 254 + 2. Pin with service (Pinata, Infura, Web3.Storage) 255 + 3. Publish dataset with `ipfs://` URL 256 + 4. IPFS ensures content-addressed, distributed storage 257 + 258 + **Example**: 259 + ```python 260 + # Upload to IPFS (using ipfs client) 261 + ipfs_hash = upload_to_ipfs("data-000000.tar") 262 + 263 + # Publish dataset 264 + dataset_uri = publisher.publish_dataset( 265 + dataset, 266 + name="My Dataset", 267 + urls=[f"ipfs://{ipfs_hash}"] 268 + ) 269 + ``` 270 + 271 + **Benefits**: 272 + - Content-addressed (CID in URL) 273 + - Distributed (IPFS network) 274 + - Permanent (with pinning) 275 + - No ATProto blob limits 276 + 277 + ## Access Control Considerations 278 + 279 + **Public datasets**: URLs point to public storage 280 + - S3 public buckets 281 + - Public HTTP servers 282 + - IPFS (inherently public) 283 + 284 + **Private datasets**: URL points to private storage 285 + - S3 with authentication (pre-signed URLs? credentials?) 286 + - Private HTTP servers (auth tokens?) 287 + - Recommendation: Public datasets only for Phase 1 288 + 289 + **Future**: Could add access control metadata to records 290 + ```json 291 + { 292 + "access": { 293 + "kind": "authenticated", 294 + "requiredRole": "subscriber" 295 + } 296 + } 297 + ``` 298 + 299 + ## Storage Cost Implications 300 + 301 + | Storage Type | Cost Responsibility | Pros | Cons | 302 + |-------------|-------------------|------|------| 303 + | S3 | Publisher | Industry standard, reliable | Ongoing costs | 304 + | IPFS + Pinning | Publisher | Decentralized | Need pinning service | 305 + | HTTP Server | Publisher | Full control | Maintenance burden | 306 + | ATProto Blobs | Publisher? ATProto? | Simple | Unknown cost model | 307 + 308 + **Recommendation**: Let publishers choose based on their needs and budget. 309 + 310 + ## Alternative Approaches Considered 311 + 312 + **Torrents**: Use BitTorrent protocol 313 + - Pros: Decentralized, efficient for large files 314 + - Cons: Need seeders, not as well integrated 315 + - Could add in future with `torrent://` scheme 316 + 317 + **Arweave**: Permanent storage blockchain 318 + - Pros: True permanence, one-time payment 319 + - Cons: Expensive for large datasets 320 + - Could add in future for critical datasets 321 + 322 + ## Open Questions 323 + 324 + 1. **Should we validate URL accessibility when publishing?** 325 + - Pro: Catch broken links early 326 + - Con: Slow, may fail for private URLs 327 + - Recommendation: No validation, trust publishers 328 + 329 + 2. **Should we mirror datasets automatically?** 330 + - Could create community mirrors for popular datasets 331 + - Recommendation: Not for Phase 1, community can organize 332 + 333 + 3. **What about dataset versioning?** 334 + - New version = new record with new URLs 335 + - Could link to previous version in metadata 336 + - Recommendation: Simple versioning via new records 337 + 338 + 4. **Should we support multi-region URLs?** 339 + ```json 340 + "urls": [ 341 + {"region": "us-east-1", "url": "s3://..."}, 342 + {"region": "eu-west-1", "url": "s3://..."} 343 + ] 344 + ``` 345 + - Recommendation: Defer to future if needed 346 + 347 + ## Success Criteria 348 + 349 + After implementing this decision: 350 + - ✅ Datasets can reference external URLs (S3, HTTPS, IPFS) 351 + - ✅ WebDataset brace notation is preserved 352 + - ✅ Loading datasets works with existing `Dataset` class 353 + - ✅ No breaking changes to current `atdata` usage 354 + - ✅ Path clear for future blob storage support 355 + 356 + ## References 357 + 358 + - Lexicon design: `../02_lexicon_design.md` (Dataset Record Lexicon) 359 + - Python client: `../03_python_client.md` (DatasetPublisher/Loader) 360 + - WebDataset documentation: https://webdataset.github.io/webdataset/ 361 + 362 + --- 363 + 364 + **Decision Needed By**: Before starting Phase 1 Issue #23 (Dataset Lexicon design) 365 + **Decision Maker**: Project maintainer (max) 366 + **Date Created**: 2026-01-07

+497

.planning/decisions/04_schema_evolution.md

··· 1 + # Decision: Schema Evolution and Versioning Strategy 2 + 3 + **Issue**: #48 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation), #39 (Type validation) 6 + **Priority**: High 7 + 8 + ## Problem Statement 9 + 10 + We need to define how PackableSample schemas can evolve over time without breaking existing datasets or code. This includes: 11 + - Version numbering scheme 12 + - Compatibility rules (what changes are allowed?) 13 + - Migration strategies 14 + - Runtime validation 15 + 16 + ## Context 17 + 18 + Schemas will evolve: 19 + - **Adding new fields** (e.g., adding optional metadata) 20 + - **Removing deprecated fields** 21 + - **Changing field types** (e.g., int → float) 22 + - **Changing field constraints** (e.g., making field optional) 23 + 24 + Real-world example: 25 + ```python 26 + # Version 1.0.0 27 + @atdata.packable 28 + class ImageSample: 29 + image: NDArray 30 + label: str 31 + 32 + # Version 1.1.0 - add optional field (backward compatible) 33 + @atdata.packable 34 + class ImageSample: 35 + image: NDArray 36 + label: str 37 + confidence: Optional[float] = None # NEW 38 + 39 + # Version 2.0.0 - remove field (breaking change) 40 + @atdata.packable 41 + class ImageSample: 42 + image: NDArray 43 + # label removed - BREAKING 44 + class_id: int # NEW, replaces label 45 + ``` 46 + 47 + ## Goals 48 + 49 + 1. **Backward compatibility**: Old code can read new data (when possible) 50 + 2. **Forward compatibility**: New code can read old data (when possible) 51 + 3. **Clear breaking changes**: Users know when they need to update 52 + 4. **Safe migrations**: Data transformations are explicit and verifiable 53 + 5. **Developer-friendly**: Easy to understand and use 54 + 55 + ## Versioning Scheme 56 + 57 + ### Semantic Versioning (MAJOR.MINOR.PATCH) 58 + 59 + **Recommendation**: Use semantic versioning for schemas 60 + 61 + ``` 62 + 1.0.0 → 1.0.1 → 1.1.0 → 2.0.0 63 + ``` 64 + 65 + **Version Components**: 66 + - **MAJOR**: Breaking changes (incompatible with previous versions) 67 + - **MINOR**: Backward-compatible additions (new optional fields) 68 + - **PATCH**: Documentation, clarifications, no functional changes 69 + 70 + ### Examples 71 + 72 + ```python 73 + # 1.0.0 → 1.0.1 (PATCH) 74 + # Change: Fixed documentation, added field description 75 + # Compatible: ✅ Yes 76 + # Action: None needed 77 + 78 + # 1.0.0 → 1.1.0 (MINOR) 79 + # Change: Added optional field 'metadata' 80 + # Compatible: ✅ Yes (backward compatible) 81 + # Action: Old code works, new code can use new field 82 + 83 + # 1.0.0 → 2.0.0 (MAJOR) 84 + # Change: Removed field 'old_field' 85 + # Compatible: ❌ No (breaking change) 86 + # Action: Users must migrate or use conversion lens 87 + ``` 88 + 89 + ## Compatibility Rules 90 + 91 + ### Backward-Compatible Changes (MINOR version bump) 92 + 93 + **Allowed**: 94 + - ✅ Adding optional fields 95 + - ✅ Making required field optional 96 + - ✅ Widening type constraints (e.g., relaxing shape requirements) 97 + - ✅ Adding documentation 98 + - ✅ Adding metadata 99 + 100 + **Example**: 101 + ```python 102 + # v1.0.0 103 + class Sample: 104 + x: int 105 + 106 + # v1.1.0 - backward compatible 107 + class Sample: 108 + x: int 109 + y: Optional[int] = None # Added optional field 110 + ``` 111 + 112 + **Guarantee**: Code written for v1.0.0 continues to work with v1.1.0 schemas 113 + 114 + --- 115 + 116 + ### Breaking Changes (MAJOR version bump) 117 + 118 + **Required**: 119 + - ❌ Removing fields 120 + - ❌ Changing field types (str → int) 121 + - ❌ Making optional field required 122 + - ❌ Narrowing type constraints (e.g., restricting shape) 123 + - ❌ Renaming fields 124 + 125 + **Example**: 126 + ```python 127 + # v1.0.0 128 + class Sample: 129 + x: int 130 + y: int 131 + 132 + # v2.0.0 - breaking changes 133 + class Sample: 134 + x: float # Type changed 135 + # y removed 136 + z: int # New required field 137 + ``` 138 + 139 + **Guarantee**: Code written for v1.0.0 will NOT work with v2.0.0 without updates 140 + 141 + --- 142 + 143 + ### Non-Breaking Changes (PATCH version bump) 144 + 145 + **Allowed**: 146 + - ✅ Documentation updates 147 + - ✅ Metadata changes 148 + - ✅ Clarifications 149 + - ✅ Bug fixes in schema definition (not structure) 150 + 151 + **No functional changes to schema structure** 152 + 153 + ## Compatibility Checking 154 + 155 + ### Automatic Compatibility Checker 156 + 157 + Implement `SchemaValidator` to check compatibility: 158 + 159 + ```python 160 + from atdata.codegen import SchemaValidator 161 + 162 + validator = SchemaValidator() 163 + 164 + old_schema = load_schema("at://alice/schema/sample/v1.0.0") 165 + new_schema = load_schema("at://alice/schema/sample/v1.1.0") 166 + 167 + is_compatible, issues = validator.is_compatible(old_schema, new_schema) 168 + 169 + if not is_compatible: 170 + print("Incompatibilities found:") 171 + for issue in issues: 172 + print(f" - {issue}") 173 + ``` 174 + 175 + **Checks**: 176 + 1. Field additions/removals 177 + 2. Type changes 178 + 3. Optional → Required changes 179 + 4. Shape constraint changes 180 + 181 + See `../05_codegen.md` for implementation details. 182 + 183 + ### Version Constraints in Dataset Records 184 + 185 + Datasets can specify schema version constraints: 186 + 187 + ```json 188 + { 189 + "$type": "app.bsky.atdata.dataset", 190 + "schemaRef": "at://alice/schema/sample/v1.0.0", 191 + "schemaVersionConstraint": ">=1.0.0,<2.0.0", 192 + ... 193 + } 194 + ``` 195 + 196 + **Semantics**: 197 + - Dataset created with v1.0.0 198 + - Compatible with v1.x.x (minor/patch updates) 199 + - NOT compatible with v2.x.x (breaking changes) 200 + 201 + ## Migration Strategies 202 + 203 + ### Option 1: Lenses as Migration Paths ⭐ RECOMMENDED 204 + 205 + **Concept**: Use Lens transformations to migrate between schema versions 206 + 207 + ```python 208 + # Migration lens: v1.0.0 → v2.0.0 209 + @atdata.lens 210 + def sample_v1_to_v2(v1: SampleV1) -> SampleV2: 211 + """Migrate from v1.0.0 to v2.0.0""" 212 + return SampleV2( 213 + x=float(v1.x), # int → float 214 + z=hash(v1.y) % 100 # derive z from removed y 215 + ) 216 + 217 + @sample_v1_to_v2.putter 218 + def sample_v2_to_v1(v2: SampleV2, v1: SampleV1) -> SampleV1: 219 + """Reverse migration (lossy)""" 220 + return SampleV1( 221 + x=int(v2.x), 222 + y=0 # Can't recover removed field 223 + ) 224 + ``` 225 + 226 + **Benefits**: 227 + - ✅ Reuses existing Lens infrastructure 228 + - ✅ Explicit transformation logic 229 + - ✅ Bidirectional (when possible) 230 + - ✅ Publishable and discoverable 231 + 232 + **Limitations**: 233 + - ❌ May be lossy (can't always reverse) 234 + - ❌ Requires manual implementation 235 + 236 + --- 237 + 238 + ### Option 2: Automatic Migration 239 + 240 + **Concept**: Generate migrations automatically based on schema diff 241 + 242 + ```python 243 + migrator = SchemaM migrator() 244 + v2_sample = migrator.migrate(v1_sample, target_version="2.0.0") 245 + ``` 246 + 247 + **Benefits**: 248 + - ✅ Convenient for users 249 + - ✅ No manual code needed 250 + 251 + **Limitations**: 252 + - ❌ Only works for simple changes (add/remove optional fields) 253 + - ❌ Can't handle complex transformations (type changes) 254 + - ❌ Risk of incorrect assumptions 255 + 256 + **Recommendation**: Could implement for simple cases, but Lenses are more general 257 + 258 + --- 259 + 260 + ### Option 3: Manual Migration Scripts 261 + 262 + **Concept**: Users write custom migration scripts 263 + 264 + **Benefits**: 265 + - ✅ Full control 266 + 267 + **Limitations**: 268 + - ❌ Not publishable/discoverable 269 + - ❌ No standardization 270 + 271 + **Recommendation**: Allow as fallback, but encourage Lenses 272 + 273 + ## Runtime Validation 274 + 275 + ### Sample Validation Against Schema 276 + 277 + ```python 278 + from atdata.codegen import TypeValidator 279 + 280 + validator = TypeValidator() 281 + schema = load_schema("at://alice/schema/sample/v1.0.0") 282 + 283 + # Validate sample 284 + sample = SampleV1(x=42, y=100) 285 + is_valid, errors = validator.validate(sample, schema) 286 + 287 + if not is_valid: 288 + print("Validation errors:") 289 + for error in errors: 290 + print(f" - {error}") 291 + ``` 292 + 293 + **Checks**: 294 + 1. All required fields present 295 + 2. Field types match 296 + 3. NDArray dtypes match (if specified) 297 + 4. NDArray shapes match (if specified) 298 + 299 + **When to validate**: 300 + - ❓ Every sample creation? (slow) 301 + - ✅ On dataset write? (good balance) 302 + - ✅ On user request (explicit validation) 303 + 304 + **Recommendation**: Validate on write, make runtime validation optional 305 + 306 + ## Schema Record Versioning 307 + 308 + ### Version Field in Schema Records 309 + 310 + ```json 311 + { 312 + "$type": "app.bsky.atdata.schema", 313 + "name": "ImageSample", 314 + "version": "1.1.0", # Semantic version 315 + ... 316 + } 317 + ``` 318 + 319 + ### Publishing New Versions 320 + 321 + **Option A**: New record for each version (RECOMMENDED) 322 + ``` 323 + at://alice/schema/imagesample/v1.0.0 # Version 1.0.0 324 + at://alice/schema/imagesample/v1.1.0 # Version 1.1.0 325 + at://alice/schema/imagesample/v2.0.0 # Version 2.0.0 326 + ``` 327 + 328 + **Pros**: 329 + - ✅ Immutable versions 330 + - ✅ Easy to reference specific versions 331 + - ✅ No breaking changes to existing references 332 + 333 + **Cons**: 334 + - ❌ More records to manage 335 + - ❌ Harder to find "latest" version 336 + 337 + **Option B**: Update existing record 338 + ``` 339 + at://alice/schema/imagesample # Always points to latest 340 + ``` 341 + 342 + **Pros**: 343 + - ✅ Single canonical reference 344 + - ✅ Easy to find latest 345 + 346 + **Cons**: 347 + - ❌ Breaks immutability 348 + - ❌ References become ambiguous over time 349 + 350 + **Recommendation**: Option A (new record per version), with metadata linking to previous versions 351 + 352 + ### Linking Versions 353 + 354 + ```json 355 + { 356 + "$type": "app.bsky.atdata.schema", 357 + "name": "ImageSample", 358 + "version": "2.0.0", 359 + "metadata": { 360 + "previousVersion": "at://alice/schema/imagesample/v1.1.0", 361 + "migrationLens": "at://alice/lens/imagesample-v1-to-v2" 362 + }, 363 + ... 364 + } 365 + ``` 366 + 367 + ## Developer Workflow 368 + 369 + ### Publishing a New Schema Version 370 + 371 + ```python 372 + # 1. Define new version 373 + @atdata.packable 374 + class ImageSampleV2: 375 + image: NDArray 376 + label: str 377 + confidence: Optional[float] = None # NEW 378 + 379 + # 2. Publish with version 380 + schema_uri = publisher.publish_schema( 381 + ImageSampleV2, 382 + name="ImageSample", 383 + version="1.1.0", # MINOR bump 384 + metadata={ 385 + "previousVersion": "at://alice/schema/imagesample/v1.0.0" 386 + } 387 + ) 388 + 389 + # 3. Optionally publish migration lens 390 + migration_lens = publisher.publish_lens( 391 + v1_to_v2_lens, 392 + source_schema_uri="at://alice/schema/imagesample/v1.0.0", 393 + target_schema_uri=schema_uri, 394 + name="ImageSample v1→v2 Migration" 395 + ) 396 + ``` 397 + 398 + ### Using Versioned Schemas 399 + 400 + ```python 401 + # Load specific version 402 + schema = loader.get_schema("at://alice/schema/imagesample/v1.0.0") 403 + 404 + # Check compatibility 405 + is_compatible = validator.is_compatible( 406 + "at://alice/schema/imagesample/v1.0.0", 407 + "at://alice/schema/imagesample/v2.0.0" 408 + ) 409 + 410 + # Find migration path 411 + migration = loader.find_migration( 412 + source="at://alice/schema/imagesample/v1.0.0", 413 + target="at://alice/schema/imagesample/v2.0.0" 414 + ) 415 + ``` 416 + 417 + ## Tooling Support 418 + 419 + ### CLI Commands 420 + 421 + ```bash 422 + # Check schema compatibility 423 + atdata schema diff \ 424 + at://alice/schema/sample/v1.0.0 \ 425 + at://alice/schema/sample/v2.0.0 426 + 427 + # Validate sample against schema 428 + atdata validate mysample.msgpack \ 429 + --schema at://alice/schema/sample/v1.0.0 430 + 431 + # Find migration path 432 + atdata schema migrate \ 433 + --from at://alice/schema/sample/v1.0.0 \ 434 + --to at://alice/schema/sample/v2.0.0 435 + ``` 436 + 437 + ### IDE Support (Future) 438 + 439 + - Autocomplete for schema versions 440 + - Warnings for compatibility issues 441 + - Quick fixes for migrations 442 + 443 + ## Open Questions 444 + 445 + 1. **Should we auto-bump versions on publish?** 446 + - Detect changes, suggest version bump? 447 + - Recommendation: Manual for Phase 1, auto-suggest later 448 + 449 + 2. **How to handle shape evolution for NDArray?** 450 + ```python 451 + # v1: image shape [224, 224, 3] 452 + # v2: image shape [256, 256, 3] # Breaking or not? 453 + ``` 454 + - If shape is documented (not enforced), this could be minor 455 + - If shape is validated, this is breaking 456 + - Recommendation: Document only initially 457 + 458 + 3. **Should we support version ranges in schema refs?** 459 + ```json 460 + "schemaRef": "at://alice/schema/sample@^1.0.0" # npm-style 461 + ``` 462 + - Pro: More flexible 463 + - Con: Ambiguous (which exact version?) 464 + - Recommendation: Explicit versions only for Phase 1 465 + 466 + 4. **What about deprecated fields?** 467 + ```python 468 + class Sample: 469 + x: int 470 + y: int # @deprecated: Use z instead 471 + z: Optional[int] = None 472 + ``` 473 + - Could add deprecation warnings 474 + - Could track in schema metadata 475 + - Recommendation: Metadata only for Phase 1 476 + 477 + ## Success Criteria 478 + 479 + After implementing this decision: 480 + - ✅ Schemas use semantic versioning 481 + - ✅ Compatibility rules are clear and documented 482 + - ✅ Compatibility checker validates schema changes 483 + - ✅ Lenses can be used for migrations 484 + - ✅ Dataset records can specify version constraints 485 + - ✅ Breaking changes require major version bump 486 + 487 + ## References 488 + 489 + - Code generation: `../05_codegen.md` (SchemaValidator, TypeValidator) 490 + - Lexicon design: `../02_lexicon_design.md` (Schema versioning) 491 + - Lens transformations: `02_lens_code_storage.md` 492 + 493 + --- 494 + 495 + **Decision Needed By**: Before Phase 4 Issue #39 (Type validation) 496 + **Decision Maker**: Project maintainer (max) 497 + **Date Created**: 2026-01-07

+380

.planning/decisions/05_lexicon_namespace.md

··· 1 + # Decision: Lexicon Namespace and NSID Structure 2 + 3 + **Issue**: #49 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + We're going to use an org NSID for the steward organization as the base: 11 + 12 + ``` 13 + ac.foundation.data.* 14 + ``` 15 + 16 + --- 17 + 18 + ## Problem Statement 19 + 20 + We need to finalize the namespace (NSID - Namespaced Identifier) for atdata Lexicons. This is a critical decision because: 21 + - NSIDs are permanent and hard to change 22 + - They affect discoverability and organization 23 + - They may require coordination with ATProto/Bluesky team 24 + 25 + ## Context 26 + 27 + ATProto NSIDs follow reverse domain notation: 28 + ``` 29 + app.bsky.feed.post # Bluesky official feed posts 30 + com.example.myapp.record # Third-party app 31 + ``` 32 + 33 + We need NSIDs for three record types: 34 + 1. Schema records (PackableSample definitions) 35 + 2. Dataset records (dataset indexes) 36 + 3. Lens records (transformations) 37 + 38 + ## Current Proposal 39 + 40 + ``` 41 + app.bsky.atdata.schema # PackableSample schema records 42 + app.bsky.atdata.dataset # Dataset index records 43 + app.bsky.atdata.lens # Lens transformation records 44 + ``` 45 + 46 + ## Options 47 + 48 + ### Option 1: `app.bsky.atdata.*` (Current Proposal) 49 + 50 + **Full NSIDs**: 51 + - `app.bsky.atdata.schema` 52 + - `app.bsky.atdata.dataset` 53 + - `app.bsky.atdata.lens` 54 + 55 + **Pros**: 56 + - ✅ Under Bluesky ecosystem umbrella 57 + - ✅ High visibility and discoverability 58 + - ✅ Official-looking namespace 59 + - ✅ Good for adoption 60 + 61 + **Cons**: 62 + - ❌ May require approval from Bluesky team 63 + - ❌ `app.bsky.*` typically for official Bluesky apps 64 + - ❌ Could be rejected or need to change later 65 + - ❌ Implies Bluesky endorsement/ownership 66 + 67 + **Risk**: ⚠️ Medium (may need to change if not approved) 68 + 69 + --- 70 + 71 + ### Option 2: `io.atdata.*` or `org.atdata.*` 72 + 73 + **Full NSIDs**: 74 + - `io.atdata.schema` 75 + - `io.atdata.dataset` 76 + - `io.atdata.lens` 77 + 78 + **Pros**: 79 + - ✅ Independent namespace 80 + - ✅ No approval needed 81 + - ✅ Clear ownership (atdata project) 82 + - ✅ Can use immediately 83 + 84 + **Cons**: 85 + - ❌ Less discoverable (not under Bluesky) 86 + - ❌ Appears less "official" 87 + - ❌ Need to own atdata.io domain (or just use anyway?) 88 + 89 + **Risk**: ⭐ Low (we control it) 90 + 91 + --- 92 + 93 + ### Option 3: `app.bsky.atproto.atdata.*` (Nested) 94 + 95 + **Full NSIDs**: 96 + - `app.bsky.atproto.atdata.schema` 97 + - `app.bsky.atproto.atdata.dataset` 98 + - `app.bsky.atproto.atdata.lens` 99 + 100 + **Pros**: 101 + - ✅ Still under Bluesky but more specific 102 + - ✅ Groups with other ATProto-related Lexicons 103 + - ✅ Less likely to conflict 104 + 105 + **Cons**: 106 + - ❌ Longer NSIDs 107 + - ❌ Awkward naming (`atproto.atdata`?) 108 + - ❌ Still may need approval 109 + 110 + **Risk**: ⚠️ Medium 111 + 112 + --- 113 + 114 + ### Option 4: Personal/Org namespace (e.g., `com.github.username.atdata.*`) 115 + 116 + **Example with your GitHub**: 117 + - `com.github.maxineishere.atdata.schema` (if that's your GH username) 118 + - Or: `com.yourorg.atdata.schema` 119 + 120 + **Pros**: 121 + - ✅ Guaranteed to work (it's your namespace) 122 + - ✅ No approval needed 123 + - ✅ Clear ownership 124 + 125 + **Cons**: 126 + - ❌ Looks very unofficial 127 + - ❌ Hard to discover 128 + - ❌ Tied to individual/org, not project 129 + - ❌ May need to migrate later if project grows 130 + 131 + **Risk**: ⭐ Very Low (but not ideal for adoption) 132 + 133 + ## Recommendation: Start with Option 2 (`io.atdata.*`), Keep Option 1 as Goal 134 + 135 + **Phased Approach**: 136 + 137 + ### Phase 1: Use `io.atdata.*` immediately 138 + - No approvals needed 139 + - Can start development right away 140 + - Professional-looking namespace 141 + - Independent from Bluesky governance 142 + 143 + ### Future: Request `app.bsky.atdata.*` if appropriate 144 + - Once atdata has users and proven value 145 + - Submit formal request to Bluesky/ATProto team 146 + - Migrate if approved (see migration plan below) 147 + 148 + **Rationale**: 149 + 1. **Speed**: Don't block development waiting for approval 150 + 2. **Safety**: If denied `app.bsky.*`, we haven't committed to it 151 + 3. **Flexibility**: Can migrate namespaces if needed 152 + 4. **Independence**: atdata can exist independently of Bluesky 153 + 154 + ## Implementation Details 155 + 156 + ### Namespace Structure 157 + 158 + ``` 159 + io.atdata 160 + ├── schema # PackableSample schema definitions 161 + ├── dataset # Dataset index records 162 + └── lens # Lens transformations 163 + ``` 164 + 165 + **Lexicon IDs**: 166 + ```json 167 + { 168 + "lexicon": 1, 169 + "id": "io.atdata.schema", 170 + ... 171 + } 172 + ``` 173 + 174 + ```json 175 + { 176 + "lexicon": 1, 177 + "id": "io.atdata.dataset", 178 + ... 179 + } 180 + ``` 181 + 182 + ```json 183 + { 184 + "lexicon": 1, 185 + "id": "io.atdata.lens", 186 + ... 187 + } 188 + ``` 189 + 190 + ### Record URIs 191 + 192 + ``` 193 + at://did:plc:abc123/io.atdata.schema/3jk2lo34klm 194 + at://did:plc:abc123/io.atdata.dataset/7mn8op56pqr 195 + at://did:plc:abc123/io.atdata.lens/2fg4hi78jkl 196 + ``` 197 + 198 + ### Python Constants 199 + 200 + ```python 201 + # src/atdata/atproto/_constants.py 202 + 203 + SCHEMA_NSID = "io.atdata.schema" 204 + DATASET_NSID = "io.atdata.dataset" 205 + LENS_NSID = "io.atdata.lens" 206 + 207 + # Can be changed in one place if we migrate namespaces 208 + ``` 209 + 210 + ## Domain Ownership 211 + 212 + **Question**: Do we need to own `atdata.io`? 213 + 214 + **ATProto Spec**: NSIDs don't require domain ownership, but it's recommended for credibility. 215 + 216 + **Options**: 217 + 1. **Register `atdata.io`** (~$12/year) 218 + - Pro: Professional, verifiable ownership 219 + - Con: Small cost 220 + - Recommendation: ✅ Do this 221 + 222 + 2. **Use without owning** 223 + - Pro: Free 224 + - Con: Someone else could register it and claim the namespace 225 + - Recommendation: ❌ Too risky 226 + 227 + **Decision**: Register `atdata.io` domain 228 + 229 + ## Versioning in NSIDs 230 + 231 + **Question**: Should version be part of NSID? 232 + 233 + ### Option A: Version in record (RECOMMENDED) 234 + ``` 235 + NSIDs: io.atdata.schema (constant) 236 + Versions: In schema record "version" field 237 + ``` 238 + 239 + **Pros**: 240 + - ✅ Stable NSIDs 241 + - ✅ Versions can evolve independently 242 + - ✅ Single collection for all versions 243 + 244 + **Cons**: 245 + - ❌ Need to look up version from record 246 + 247 + ### Option B: Version in NSID 248 + ``` 249 + NSIDs: io.atdata.schema.v1, io.atdata.schema.v2 250 + ``` 251 + 252 + **Pros**: 253 + - ✅ Version explicit in URI 254 + 255 + **Cons**: 256 + - ❌ New NSID for each major version 257 + - ❌ More Lexicons to maintain 258 + - ❌ Harder to query across versions 259 + 260 + **Recommendation**: Option A (version in record) 261 + 262 + ## Namespace Migration Plan 263 + 264 + If we need to migrate from `io.atdata.*` to `app.bsky.atdata.*`: 265 + 266 + ### Migration Steps 267 + 268 + 1. **Dual Publishing** (transition period) 269 + ```python 270 + # Publish to both namespaces 271 + publisher.publish_schema( 272 + sample_type, 273 + nsid="io.atdata.schema" # Old 274 + ) 275 + publisher.publish_schema( 276 + sample_type, 277 + nsid="app.bsky.atdata.schema" # New 278 + ) 279 + ``` 280 + 281 + 2. **Deprecation Notice** 282 + - Announce migration timeline 283 + - Update documentation 284 + - Add warnings to old namespace 285 + 286 + 3. **Update Client** 287 + - Default to new namespace 288 + - Still support old namespace (read-only) 289 + 290 + 4. **Sunset Old Namespace** 291 + - After 6-12 months, stop publishing to old namespace 292 + - Keep reading old records for compatibility 293 + 294 + ### Record Linking 295 + 296 + Add migration metadata: 297 + ```json 298 + { 299 + "$type": "app.bsky.atdata.schema", 300 + "metadata": { 301 + "migratedFrom": "at://did:plc:abc123/io.atdata.schema/3jk2lo34klm" 302 + }, 303 + ... 304 + } 305 + ``` 306 + 307 + ## Additional Lexicons (Future) 308 + 309 + Should we reserve NSIDs for future use? 310 + 311 + **Potential Additions**: 312 + - `io.atdata.collection` - Group multiple datasets 313 + - `io.atdata.benchmark` - Evaluation results 314 + - `io.atdata.annotation` - User comments/ratings 315 + - `io.atdata.pipeline` - Data processing pipelines 316 + 317 + **Recommendation**: Don't create yet, but document reserved names 318 + 319 + ## Community Input 320 + 321 + **Before finalizing**: 322 + 1. Check if `io.atdata.*` is available (no conflicts) 323 + 2. Reach out to ATProto community (Discord, GitHub) 324 + 3. Ask Bluesky team about `app.bsky.atdata.*` feasibility 325 + 4. Document decision and rationale 326 + 327 + ## Open Questions 328 + 329 + 1. **Should we create a demo namespace first?** 330 + - `io.atdata.dev.schema` for testing? 331 + - Pro: Keeps production namespace clean 332 + - Con: More namespaces to manage 333 + - Recommendation: Not needed, use test DIDs instead 334 + 335 + 2. **What about language-specific namespaces?** 336 + - `io.atdata.py.schema` for Python-specific schemas? 337 + - Pro: Allows language-specific features 338 + - Con: Fragments ecosystem 339 + - Recommendation: ❌ Keep language-agnostic 340 + 341 + 3. **Should we namespace by domain (vision, NLP, etc.)?** 342 + - `io.atdata.vision.schema`, `io.atdata.nlp.schema`? 343 + - Pro: Better organization for large ecosystems 344 + - Con: Premature optimization 345 + - Recommendation: ❌ Not for Phase 1 346 + 347 + ## Success Criteria 348 + 349 + After implementing this decision: 350 + - ✅ NSIDs are finalized and documented 351 + - ✅ Lexicon JSON files use correct NSIDs 352 + - ✅ Python code uses constant definitions (easy to change) 353 + - ✅ Migration plan exists if needed 354 + - ✅ Domain `atdata.io` is registered (or plan to register) 355 + 356 + ## References 357 + 358 + - ATProto NSID spec: https://atproto.com/specs/nsid 359 + - Lexicon design: `../02_lexicon_design.md` 360 + - All three Lexicon definitions need this decision 361 + 362 + --- 363 + 364 + **Decision Needed By**: Before starting Phase 1 Issue #22, #23, #24 (all Lexicon designs) 365 + **Decision Maker**: Project maintainer (max) 366 + **Date Created**: 2026-01-07 367 + 368 + ## Recommended Action 369 + 370 + **Immediate**: 371 + 1. ✅ Decide on `io.atdata.*` as working namespace 372 + 2. ✅ Plan to register `atdata.io` domain 373 + 3. ✅ Document migration path to `app.bsky.atdata.*` if desired later 374 + 375 + **Before Phase 2**: 376 + 1. Register `atdata.io` domain 377 + 2. Optional: Reach out to Bluesky about `app.bsky.atdata.*` for future 378 + 379 + **Phase 1**: 380 + Use `io.atdata.*` in all Lexicon designs

+459

.planning/decisions/06_lexicon_validation.md

··· 1 + # Decision: Lexicon Validation Process 2 + 3 + **Issue**: #50 4 + **Status**: Needs decision 5 + **Blocked By**: #45, #46, #47, #48, #49 (all design decisions) 6 + **Priority**: Critical - Final step before Phase 1 completion 7 + 8 + ## Problem Statement 9 + 10 + Once we've finalized all design decisions, we need to validate that our Lexicon JSON definitions: 11 + 1. Follow ATProto Lexicon specification correctly 12 + 2. Are internally consistent 13 + 3. Support all our use cases 14 + 4. Can be implemented as designed 15 + 16 + This is the final checkpoint before Phase 1 (Lexicon Design) is complete and we move to Phase 2 (Implementation). 17 + 18 + ## What Needs Validation 19 + 20 + ### 1. Schema Record Lexicon (`io.atdata.schema`) 21 + - Field type system (primitive, ndarray, nested) 22 + - Type unions are properly structured 23 + - Required vs optional fields 24 + - Constraints (maxLength, etc.) are reasonable 25 + - Example schema records validate against the Lexicon 26 + 27 + ### 2. Dataset Record Lexicon (`io.atdata.dataset`) 28 + - URL array handling 29 + - Metadata blob size limits 30 + - Schema reference format 31 + - Tag array constraints 32 + - Example dataset records validate against the Lexicon 33 + 34 + ### 3. Lens Record Lexicon (`io.atdata.lens`) 35 + - Code reference structure 36 + - Schema reference handling 37 + - Union types for different code storage options (if applicable) 38 + - Example lens records validate against the Lexicon 39 + 40 + ## Validation Checklist 41 + 42 + ### ATProto Spec Compliance 43 + 44 + **Lexicon Structure**: 45 + - [ ] All Lexicons have required fields: `lexicon`, `id`, `defs` 46 + - [ ] `lexicon` field is set to `1` (current version) 47 + - [ ] `id` follows NSID format (reverse domain notation) 48 + - [ ] `defs.main` exists and has `type: "record"` 49 + - [ ] Record `key` is set appropriately (`tid` for time-ordered) 50 + 51 + **Field Types**: 52 + - [ ] All field types are valid ATProto types 53 + - `string`, `integer`, `boolean`, `bytes`, `object`, `array` 54 + - `ref`, `union` for complex types 55 + - [ ] String fields have appropriate `maxLength` 56 + - [ ] Array fields have `items` definition 57 + - [ ] Object fields have `properties` definition 58 + - [ ] Refs point to valid def names (e.g., `#fieldType`) 59 + 60 + **Constraints**: 61 + - [ ] `maxLength` values are reasonable (not too small, not too large) 62 + - [ ] `minLength` constraints make sense 63 + - [ ] Required fields are marked correctly 64 + - [ ] Optional fields have appropriate defaults 65 + 66 + ### Internal Consistency 67 + 68 + **Cross-References**: 69 + - [ ] Schema refs (e.g., `schemaRef` in datasets) use correct format 70 + - Should be AT-URI format: `at://did:plc:.../io.atdata.schema/...` 71 + - [ ] Union refs point to existing defs 72 + - [ ] No circular references 73 + 74 + **Type System**: 75 + - [ ] Field types are well-defined 76 + - Primitive types map clearly (str, int, float, bool, bytes) 77 + - NDArray type includes dtype and optional shape 78 + - Nested types have schema reference 79 + - [ ] Optional vs required semantics are clear 80 + 81 + **Metadata**: 82 + - [ ] Descriptions are present and helpful 83 + - [ ] Examples match the schema 84 + - [ ] Deprecations are noted (if any) 85 + 86 + ### Use Case Coverage 87 + 88 + **Can we represent...**: 89 + - [ ] All current PackableSample types? 90 + - [ ] NDArray with dtype and shape information? 91 + - [ ] Optional fields? 92 + - [ ] Nested PackableSample types (future)? 93 + - [ ] Dataset metadata (arbitrary key-value)? 94 + - [ ] Multiple WebDataset shard URLs? 95 + - [ ] Lens code references (repo + commit + path)? 96 + 97 + **Can we implement...**: 98 + - [ ] Python codegen from schema records? 99 + - [ ] Dataset publishing with external URLs? 100 + - [ ] Dataset loading from records? 101 + - [ ] Lens publishing with code references? 102 + - [ ] Schema versioning (version field present)? 103 + 104 + ## Validation Methods 105 + 106 + ### 1. Schema Validation Tools 107 + 108 + **Use ATProto Tools** (if available): 109 + ```bash 110 + # If ATProto has a Lexicon validator 111 + atproto-lexicon validate io.atdata.schema.json 112 + atproto-lexicon validate io.atdata.dataset.json 113 + atproto-lexicon validate io.atdata.lens.json 114 + ``` 115 + 116 + **Create Custom Validator**: 117 + ```python 118 + # src/atdata/atproto/validation.py 119 + from jsonschema import validate, ValidationError 120 + 121 + def validate_lexicon(lexicon_json: dict) -> tuple[bool, list[str]]: 122 + """Validate Lexicon against ATProto spec.""" 123 + errors = [] 124 + 125 + # Check required fields 126 + if 'lexicon' not in lexicon_json: 127 + errors.append("Missing 'lexicon' field") 128 + if 'id' not in lexicon_json: 129 + errors.append("Missing 'id' field") 130 + if 'defs' not in lexicon_json: 131 + errors.append("Missing 'defs' field") 132 + 133 + # Check NSID format 134 + nsid = lexicon_json.get('id', '') 135 + if not is_valid_nsid(nsid): 136 + errors.append(f"Invalid NSID: {nsid}") 137 + 138 + # More validations... 139 + 140 + return len(errors) == 0, errors 141 + ``` 142 + 143 + ### 2. Example Record Validation 144 + 145 + **Create Example Records**: 146 + 147 + ```python 148 + # examples/schema_record.json 149 + { 150 + "$type": "io.atdata.schema", 151 + "name": "ImageSample", 152 + "version": "1.0.0", 153 + "description": "Sample with image and label", 154 + "fields": [ 155 + { 156 + "name": "image", 157 + "type": {"kind": "ndarray", "dtype": "uint8", "shape": [null, null, 3]}, 158 + "optional": false 159 + }, 160 + { 161 + "name": "label", 162 + "type": {"kind": "primitive", "primitive": "str"}, 163 + "optional": false 164 + } 165 + ], 166 + "metadata": {"author": "alice"}, 167 + "createdAt": "2025-01-06T12:00:00Z" 168 + } 169 + ``` 170 + 171 + **Validate Against Lexicon**: 172 + ```python 173 + def validate_record(record: dict, lexicon: dict) -> tuple[bool, list[str]]: 174 + """Validate a record against its Lexicon.""" 175 + errors = [] 176 + 177 + # Check $type matches Lexicon id 178 + record_type = record.get('$type') 179 + lexicon_id = lexicon.get('id') 180 + if record_type != lexicon_id: 181 + errors.append(f"Type mismatch: {record_type} != {lexicon_id}") 182 + 183 + # Validate required fields 184 + main_def = lexicon['defs']['main']['record'] 185 + required = main_def.get('required', []) 186 + for field in required: 187 + if field not in record: 188 + errors.append(f"Missing required field: {field}") 189 + 190 + # Validate field types 191 + properties = main_def.get('properties', {}) 192 + for field, value in record.items(): 193 + if field in properties: 194 + # Type checking logic 195 + pass 196 + 197 + return len(errors) == 0, errors 198 + ``` 199 + 200 + ### 3. Roundtrip Testing 201 + 202 + **Test Full Cycle**: 203 + 1. Create PackableSample class 204 + 2. Generate schema record from class 205 + 3. Validate schema record against Lexicon 206 + 4. Generate code from schema record 207 + 5. Verify generated code matches original class 208 + 209 + ```python 210 + def test_roundtrip(): 211 + # 1. Original class 212 + @atdata.packable 213 + class TestSample: 214 + x: int 215 + y: str 216 + 217 + # 2. Generate schema record 218 + generator = SchemaRecordGenerator() 219 + record = generator.from_class(TestSample) 220 + 221 + # 3. Validate against Lexicon 222 + is_valid, errors = validate_record(record, SCHEMA_LEXICON) 223 + assert is_valid, f"Validation failed: {errors}" 224 + 225 + # 4. Generate code from record 226 + codegen = PythonGenerator() 227 + code = codegen.generate_from_record(record) 228 + 229 + # 5. Execute generated code and compare 230 + exec_globals = {} 231 + exec(code, exec_globals) 232 + GeneratedClass = exec_globals['TestSample'] 233 + 234 + # Should be equivalent 235 + original_instance = TestSample(x=1, y="test") 236 + generated_instance = GeneratedClass(x=1, y="test") 237 + 238 + assert original_instance.packed == generated_instance.packed 239 + ``` 240 + 241 + ### 4. Edge Case Testing 242 + 243 + **Test Corner Cases**: 244 + - [ ] Empty optional fields 245 + - [ ] Very long strings (maxLength boundary) 246 + - [ ] Large arrays (maxItems boundary) 247 + - [ ] Complex nested types 248 + - [ ] Unicode in strings 249 + - [ ] Special characters in names 250 + - [ ] Large metadata blobs 251 + 252 + ## Validation Artifacts 253 + 254 + After validation, we should have: 255 + 256 + ### 1. Finalized Lexicon JSON Files 257 + 258 + ``` 259 + .planning/lexicons/ 260 + io.atdata.schema.json 261 + io.atdata.dataset.json 262 + io.atdata.lens.json 263 + ``` 264 + 265 + Each file: 266 + - Validates against ATProto Lexicon spec 267 + - Has complete documentation 268 + - Includes examples 269 + 270 + ### 2. Example Records 271 + 272 + ``` 273 + .planning/examples/ 274 + schema_example.json 275 + dataset_example.json 276 + lens_example.json 277 + ``` 278 + 279 + Each example: 280 + - Validates against its Lexicon 281 + - Demonstrates all key features 282 + - Includes comments explaining choices 283 + 284 + ### 3. Validation Test Suite 285 + 286 + ```python 287 + # tests/test_lexicons.py 288 + 289 + def test_schema_lexicon_valid(): 290 + """Test schema Lexicon is valid.""" 291 + with open('.planning/lexicons/io.atdata.schema.json') as f: 292 + lexicon = json.load(f) 293 + is_valid, errors = validate_lexicon(lexicon) 294 + assert is_valid, errors 295 + 296 + def test_schema_example_valid(): 297 + """Test schema example validates against Lexicon.""" 298 + with open('.planning/lexicons/io.atdata.schema.json') as f: 299 + lexicon = json.load(f) 300 + with open('.planning/examples/schema_example.json') as f: 301 + example = json.load(f) 302 + is_valid, errors = validate_record(example, lexicon) 303 + assert is_valid, errors 304 + 305 + # Similar tests for dataset and lens 306 + ``` 307 + 308 + ### 4. Validation Report 309 + 310 + ```markdown 311 + # Lexicon Validation Report 312 + 313 + ## Summary 314 + - Schema Lexicon: ✅ Valid 315 + - Dataset Lexicon: ✅ Valid 316 + - Lens Lexicon: ✅ Valid 317 + 318 + ## Validation Results 319 + 320 + ### io.atdata.schema 321 + - ATProto compliance: ✅ Pass 322 + - Internal consistency: ✅ Pass 323 + - Example validation: ✅ Pass 324 + - Edge cases: ✅ Pass 325 + 326 + ### io.atdata.dataset 327 + ... 328 + 329 + ## Issues Found 330 + None 331 + 332 + ## Recommendations 333 + 1. Consider adding X field to Y 334 + 2. Might want to increase maxLength for Z 335 + ... 336 + ``` 337 + 338 + ## Implementation Plan 339 + 340 + ### Step 1: Create Lexicon JSON Files (depends on decisions #45-49) 341 + 342 + Based on finalized decisions: 343 + - Schema representation format (#45) 344 + - Lens code storage (#46) 345 + - WebDataset storage (#47) 346 + - Schema evolution (#48) 347 + - Lexicon namespace (#49) 348 + 349 + Create three JSON files with complete Lexicon definitions. 350 + 351 + ### Step 2: Create Example Records 352 + 353 + For each Lexicon, create 2-3 example records demonstrating: 354 + - Minimal record 355 + - Full-featured record 356 + - Edge cases 357 + 358 + ### Step 3: Write Validation Tests 359 + 360 + Implement validation test suite that: 361 + - Validates Lexicons against ATProto spec 362 + - Validates examples against Lexicons 363 + - Tests roundtrip (class → record → code → class) 364 + 365 + ### Step 4: Manual Review 366 + 367 + Have team members review: 368 + - Lexicon designs 369 + - Example records 370 + - Any edge cases or concerns 371 + 372 + ### Step 5: Document Issues and Resolutions 373 + 374 + Track any issues found: 375 + - What was wrong? 376 + - How was it fixed? 377 + - Why was this decision made? 378 + 379 + ### Step 6: Final Sign-off 380 + 381 + Once all validation passes: 382 + - Mark Issue #50 as complete 383 + - Unblock Phase 1 (Issue #17) 384 + - Proceed to Phase 2 implementation 385 + 386 + ## Tools and Resources 387 + 388 + **ATProto Resources**: 389 + - Lexicon specification: https://atproto.com/specs/lexicon 390 + - NSID specification: https://atproto.com/specs/nsid 391 + - Example Lexicons: https://github.com/bluesky-social/atproto/tree/main/lexicons 392 + 393 + **Validation Tools**: 394 + - JSON Schema validator (jsonschema library) 395 + - ATProto SDK validation (if available) 396 + - Custom validators (we'll write) 397 + 398 + **Documentation**: 399 + - All planning docs in `.planning/` 400 + - Decision docs in `.planning/decisions/` 401 + - Lexicon design in `02_lexicon_design.md` 402 + 403 + ## Success Criteria 404 + 405 + Phase 1 Issue #17 is complete when: 406 + - ✅ All three Lexicons are finalized and validated 407 + - ✅ Example records validate against Lexicons 408 + - ✅ Roundtrip tests pass 409 + - ✅ Team has reviewed and approved 410 + - ✅ Documentation is complete 411 + - ✅ Ready to begin Phase 2 implementation 412 + 413 + ## Next Steps After Validation 414 + 415 + Once Issue #50 is complete: 416 + 1. Close Issue #50 417 + 2. Unblock and close Issue #17 (Phase 1) 418 + 3. Begin Phase 2 (Issue #18) - Python Client implementation 419 + 4. Reference finalized Lexicons during implementation 420 + 421 + ## Open Questions 422 + 423 + 1. **Should we submit Lexicons to ATProto for official review?** 424 + - Pro: Get expert feedback 425 + - Con: Delays, may not be necessary 426 + - Recommendation: Optional, do if time permits 427 + 428 + 2. **Should we create a Lexicon registry/index?** 429 + - Pro: Makes discovery easier 430 + - Con: Extra infrastructure 431 + - Recommendation: Defer to Phase 3 (AppView) 432 + 433 + 3. **How do we handle Lexicon updates after publication?** 434 + - Once records exist, changing Lexicons is breaking 435 + - Need clear versioning for Lexicons themselves 436 + - Recommendation: Lexicons are v1 for all Phase 1-5 437 + 438 + ## References 439 + 440 + - All design decisions: `01-05_*.md` in this directory 441 + - Lexicon design: `../02_lexicon_design.md` 442 + - ATProto Lexicon spec: https://atproto.com/specs/lexicon 443 + 444 + --- 445 + 446 + **Decision Needed By**: After all decisions #45-49 are finalized 447 + **Decision Maker**: Project maintainer (max) + team review 448 + **Date Created**: 2026-01-07 449 + 450 + ## Recommended Action 451 + 452 + **After all design decisions are made**: 453 + 1. Create three Lexicon JSON files 454 + 2. Create example records for each 455 + 3. Write and run validation test suite 456 + 4. Review as team 457 + 5. Document any issues and fixes 458 + 6. Get final sign-off 459 + 7. Mark Phase 1 complete ✅

+146

.planning/decisions/README.md

··· 1 + # Critical Design Decisions for ATProto Integration 2 + 3 + This directory contains detailed analysis and recommendations for the critical design decisions needed before implementing ATProto integration in `atdata`. 4 + 5 + ## Decision Documents (In Dependency Order) 6 + 7 + ### Core Design Decisions (Can be made in parallel) 8 + 9 + 1. **[01_schema_representation_format.md](01_schema_representation_format.md)** (Issue #45) 10 + - **Question**: How to represent PackableSample types in Lexicon records? 11 + - **Options**: Custom format, JSON Schema, Protobuf 12 + - **Recommendation**: Custom format within ATProto Lexicon 13 + - **Impact**: Code generation, cross-language support 14 + - **Blocks**: Issue #50 (validation) 15 + 16 + 2. **[02_lens_code_storage.md](02_lens_code_storage.md)** (Issue #46) 17 + - **Question**: How to store Lens transformation code? 18 + - **Options**: Code references, inline code, metadata only 19 + - **Recommendation**: Code references (GitHub + commit hash) only 20 + - **Impact**: Security, usability, trust model 21 + - **Blocks**: Issue #50 (validation) 22 + - ⚠️ **CRITICAL SECURITY DECISION** 23 + 24 + 3. **[03_webdataset_storage.md](03_webdataset_storage.md)** (Issue #47) 25 + - **Question**: Where to store actual WebDataset .tar files? 26 + - **Options**: External URLs, ATProto blobs, hybrid 27 + - **Recommendation**: External URLs (Phase 1), hybrid (future) 28 + - **Impact**: Decentralization, scalability, costs 29 + - **Blocks**: Issue #50 (validation) 30 + 31 + 4. **[04_schema_evolution.md](04_schema_evolution.md)** (Issue #48) 32 + - **Question**: How do schemas evolve without breaking changes? 33 + - **Options**: Semantic versioning, compatibility rules, migrations 34 + - **Recommendation**: Semantic versioning + Lenses for migration 35 + - **Impact**: Long-term maintainability, compatibility 36 + - **Blocks**: Issue #50 (validation), Issue #39 (type validation) 37 + 38 + 5. **[05_lexicon_namespace.md](05_lexicon_namespace.md)** (Issue #49) 39 + - **Question**: What namespace (NSID) to use for Lexicons? 40 + - **Options**: `app.bsky.atdata.*`, `io.atdata.*`, others 41 + - **Recommendation**: `io.atdata.*` (Phase 1), request `app.bsky.*` later 42 + - **Impact**: Discoverability, ownership, migration 43 + - **Blocks**: Issue #50 (validation) 44 + 45 + ### Final Validation (Depends on all above) 46 + 47 + 6. **[06_lexicon_validation.md](06_lexicon_validation.md)** (Issue #50) 48 + - **Question**: How to validate finalized Lexicon designs? 49 + - **Process**: Validation checklist, example records, tests 50 + - **Deliverables**: Finalized Lexicon JSON files, validation report 51 + - **Blocked By**: Issues #45, #46, #47, #48, #49 52 + - **Blocks**: Phase 1 completion (Issue #17) 53 + 54 + ## Decision Status 55 + 56 + | Issue | Decision | Status | Recommendation | 57 + |-------|----------|--------|----------------| 58 + | #45 | Schema format | ⏳ Needs decision | Custom format | 59 + | #46 | Lens code storage | ⏳ Needs decision | Code references only | 60 + | #47 | WebDataset storage | ⏳ Needs decision | External URLs | 61 + | #48 | Schema evolution | ⏳ Needs decision | Semantic versioning | 62 + | #49 | Lexicon namespace | ⏳ Needs decision | `io.atdata.*` | 63 + | #50 | Validation process | ⏳ Blocked | (After #45-49) | 64 + 65 + ## How to Use These Documents 66 + 67 + ### For Review 68 + 69 + 1. **Read in order** (01 through 06) to understand dependencies 70 + 2. **Focus on recommendations** - detailed analysis supports them 71 + 3. **Check open questions** - some need your input 72 + 4. **Provide feedback** - comment on issues or update documents 73 + 74 + ### For Implementation 75 + 76 + 1. **After decisions made** - use as reference during coding 77 + 2. **Check success criteria** - ensure implementation meets goals 78 + 3. **Follow recommendations** - they're based on thorough analysis 79 + 4. **Update as needed** - decisions can evolve with learning 80 + 81 + ## Key Insights 82 + 83 + ### Security First 84 + - **Issue #46** (Lens code storage) is a critical security decision 85 + - Recommendation: Code references only (no arbitrary code execution) 86 + - Can add inline code later if we solve sandboxing 87 + 88 + ### Pragmatic Approach 89 + - Start with what works (external URLs, custom format) 90 + - Add sophistication later (ATProto blobs, advanced features) 91 + - Don't block on perfect solutions 92 + 93 + ### Independence 94 + - Use `io.atdata.*` namespace (don't wait for Bluesky approval) 95 + - Can migrate to `app.bsky.atdata.*` later if desired 96 + - Maintain control over project direction 97 + 98 + ### Future-Proof 99 + - Semantic versioning enables evolution 100 + - Hybrid storage approach allows flexibility 101 + - Custom format gives us full control 102 + 103 + ## Decision Dependencies 104 + 105 + ``` 106 + ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ 107 + │ #45 │ │ #46 │ │ #47 │ │ #48 │ │ #49 │ 108 + │ Format │ │ Lens │ │ Storage │ │Evolution│ │Namespace│ 109 + └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ 110 + │ │ │ │ │ 111 + └────────────┴────────────┴────────────┴────────────┘ 112 + │ 113 + ┌────▼────┐ 114 + │ #50 │ 115 + │Validate │ 116 + └────┬────┘ 117 + │ 118 + ┌────▼────┐ 119 + │ Phase 1 │ 120 + │Complete │ 121 + └─────────┘ 122 + ``` 123 + 124 + All decisions #45-49 can be made in parallel, then #50 validates everything before Phase 1 completion. 125 + 126 + ## Timeline 127 + 128 + **Recommended**: 129 + 1. **Week 1**: Review and decide on #45-49 (can be done in parallel) 130 + 2. **Week 2**: Validation (#50) - create Lexicon JSON files and examples 131 + 3. **Week 3**: Begin Phase 2 implementation 132 + 133 + **Flexible**: Can make decisions incrementally, but all needed before #50 134 + 135 + ## Questions? 136 + 137 + - Review individual decision documents for detailed analysis 138 + - Check "Open Questions" sections for items needing input 139 + - See "References" sections for related planning documents 140 + - Consult `../02_lexicon_design.md` for technical details 141 + 142 + --- 143 + 144 + **Created**: 2026-01-07 145 + **Status**: All decisions pending review 146 + **Next Step**: Review decision documents and provide feedback

Configure Feed

Configure Feed