A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge pull request #33 from foundation-ac/release/v0.2.0a1

feat: v0.2.0a1 - Local Repository & ATProto Integration

authored by

Maxine Levesque and committed by
GitHub
9512c4c9 b9f01e3e

+15057 -276
+11 -1
.github/workflows/uv-test.yml
··· 17 17 runs-on: ubuntu-latest 18 18 environment: 19 19 name: test 20 + strategy: 21 + matrix: 22 + python-version: [3.12, 3.13, 3.14] 23 + redis-version: [6, 7] 20 24 21 25 steps: 22 26 - uses: actions/checkout@v5 ··· 24 28 - name: "Set up Python" 25 29 uses: actions/setup-python@v5 26 30 with: 27 - python-version-file: "pyproject.toml" 31 + python-version: ${{ matrix.python-version }} 32 + # python-version-file: "pyproject.toml" 28 33 29 34 - name: Install uv 30 35 uses: astral-sh/setup-uv@v6 ··· 33 38 run: uv sync --all-extras --dev 34 39 # TODO Better to use --locked for author control over versions? 35 40 # run: uv sync --locked --all-extras --dev 41 + 42 + - name: Start Redis 43 + uses: supercharge/redis-github-action@1.8.1 44 + with: 45 + redis-version: ${{ matrix.redis-version }} 36 46 37 47 - name: Run tests with coverage 38 48 run: uv run pytest --cov=atdata --cov-report=xml --cov-report=term
+3
.gitignore
··· 6 6 **/*.env 7 7 # Don't commit `uv` lockfiles 8 8 **/uv.lock 9 + # Development tooling (keep local, not in upstream) 10 + .chainlink/ 11 + .claude/ 9 12 10 13 ## 11 14
+204
.planning/01_overview.md
··· 1 + # ATProto Integration - Overview 2 + 3 + ## Vision 4 + 5 + Transform `atdata` from a local/centralized dataset library into a **distributed dataset federation** built on AT Protocol. Datasets, schemas, and transformations become discoverable, versioned records on the ATProto network, enabling: 6 + 7 + - **Decentralized dataset publishing**: Anyone can publish datasets without centralized infrastructure 8 + - **Schema sharing & reuse**: Sample type definitions become reusable records with automatic code generation 9 + - **Discoverable transformations**: Lens transformations are published as bidirectional mappings between schemas 10 + - **Interoperability**: Different tools and languages can consume the same datasets using generated code 11 + - **Versioning & provenance**: Immutable records provide audit trails for dataset evolution 12 + 13 + ## High-Level Architecture 14 + 15 + ``` 16 + ┌─────────────────────────────────────────────────────────────────┐ 17 + │ AT Protocol Network │ 18 + │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ 19 + │ │ Schema Records │ │ Dataset Records │ │ Lens Records │ │ 20 + │ │ (Lexicon) │ │ (Lexicon) │ │ (Lexicon) │ │ 21 + │ └──────────────────┘ └──────────────────┘ └───────────────┘ │ 22 + │ ▲ ▲ ▲ │ 23 + │ │ │ │ │ 24 + └─────────┼──────────────────────┼─────────────────────┼──────────┘ 25 + │ │ │ 26 + │ publish/query │ │ 27 + │ │ │ 28 + ┌─────┴──────────────────────┴─────────────────────┴─────┐ 29 + │ Python Client Library (atdata) │ 30 + │ │ 31 + │ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │ 32 + │ │ ATProto │ │ Schema │ │ Dataset │ │ 33 + │ │ Auth │ │ Publisher │ │ Loader │ │ 34 + │ └────────────┘ └────────────┘ └──────────────────┘ │ 35 + │ │ 36 + │ Existing: │ 37 + │ - PackableSample, Dataset, Lens │ 38 + │ - WebDataset integration │ 39 + └──────────────────────────────────────────────────────────┘ 40 + 41 + │ queries (optional) 42 + 43 + ┌─────────────────────┐ 44 + │ AppView Service │ 45 + │ (Index Aggregator) │ 46 + │ │ 47 + │ - Fast search │ 48 + │ - Schema browser │ 49 + │ - Metadata cache │ 50 + └─────────────────────┘ 51 + ``` 52 + 53 + ## Core Concepts 54 + 55 + ### 1. Schema Records (PackableSample definitions) 56 + 57 + Published ATProto records containing: 58 + - Field names and types (with special handling for NDArray) 59 + - Serialization metadata 60 + - Version information 61 + - Author/provenance 62 + 63 + These become the **source of truth** for sample types across the network. 64 + 65 + ### 2. Dataset Index Records 66 + 67 + Published ATProto records containing: 68 + - Reference to schema record (the sample type) 69 + - WebDataset URL(s) using brace notation (e.g., `s3://bucket/data-{000000..000099}.tar`) 70 + - Msgpack-encoded metadata (arbitrary key-value pairs) 71 + - Dataset description, tags, author 72 + 73 + Users discover datasets by querying these records, then load them using existing `Dataset` class. 74 + 75 + ### 3. Lens Transformation Records 76 + 77 + Published ATProto records containing: 78 + - Source schema reference 79 + - Target schema reference 80 + - Transformation code (or reference to code) 81 + - Bidirectional mapping metadata (getter/putter) 82 + 83 + Enables building a **network of transformations** between schemas. 84 + 85 + ## Integration with Existing `atdata` 86 + 87 + The ATProto integration is **additive**: 88 + 89 + 1. **Existing functionality unchanged**: `PackableSample`, `Dataset`, `Lens` continue to work as-is 90 + 2. **New methods added**: 91 + - `sample_type.publish_to_atproto(client)` - Publish schema 92 + - `dataset.publish_to_atproto(client)` - Publish index record 93 + - `Dataset.from_atproto(client, record_uri)` - Load from published record 94 + - `lens.publish_to_atproto(client)` - Publish transformation 95 + 3. **Optional AppView**: Query service for faster discovery (like Bluesky's AppView) 96 + 97 + ## Development Phases 98 + 99 + ### Phase 1: Lexicon Design (Issues #17, #22-25) 100 + - Design three Lexicon schemas (sample, dataset, lens) 101 + - Evaluate schema representation formats 102 + - Create reference documentation 103 + 104 + **Deliverable**: Lexicon JSON definitions ready for use 105 + 106 + ### Phase 2: Python Client Library (Issues #18, #26-31) 107 + - ATProto SDK integration (auth, session management) 108 + - Publishing implementations for all three record types 109 + - Query/discovery functionality 110 + - Extend `Dataset` class with `from_atproto()` method 111 + 112 + **Deliverable**: Working Python library that can publish/load from ATProto 113 + 114 + ### Phase 3: AppView Service (Issues #19, #32-35) 115 + - Optional aggregation service 116 + - Firehose ingestion 117 + - Search/query API 118 + - Performance optimization 119 + 120 + **Deliverable**: Hosted service for fast dataset discovery 121 + 122 + ### Phase 4: Code Generation (Issues #20, #36-39) 123 + - Template system for Python codegen 124 + - CLI tool for generating classes from schema records 125 + - Type validation and compatibility checking 126 + 127 + **Deliverable**: Tool to generate Python code from published schemas 128 + 129 + ### Phase 5: Integration & Testing (Issues #21, #40-43) 130 + - End-to-end workflows and examples 131 + - Integration test suite 132 + - Documentation and guides 133 + - Performance benchmarks 134 + 135 + **Deliverable**: Production-ready feature with complete documentation 136 + 137 + ## Open Design Questions 138 + 139 + ### Schema Representation Format 140 + **Question**: How should we represent `PackableSample` schemas in Lexicon records? 141 + 142 + **Options**: 143 + 1. **JSON Schema** - Standard, well-supported, validation tools exist 144 + 2. **Protobuf** - Compact, has codegen ecosystem, good for cross-language 145 + 3. **Custom format** - Tailored to `PackableSample` specifics (NDArray handling, msgpack serialization) 146 + 147 + **Considerations**: 148 + - Need to represent `NDArray` types specially (dtype, shape constraints?) 149 + - Should support future extensions (constraints, validation rules) 150 + - Must be human-readable and machine-processable 151 + - Codegen tooling needs to parse it 152 + 153 + **Decision needed**: See Issue #25 154 + 155 + ### WebDataset Storage Location 156 + **Question**: Should actual WebDataset `.tar` files be stored on ATProto, or just references to external storage? 157 + 158 + **Current approach**: References only (S3, HTTP URLs, etc.) 159 + - Pros: No storage limits, existing infrastructure works 160 + - Cons: Centralization risk if datasets disappear 161 + 162 + **Future consideration**: ATProto blob storage for datasets 163 + - Pros: Truly decentralized 164 + - Cons: Storage costs, size limits, performance 165 + 166 + ### Lens Code Storage 167 + **Question**: How should Lens transformation code be stored? 168 + 169 + **Options**: 170 + 1. Python code as string in record (security concerns!) 171 + 2. Reference to GitHub/GitLab repo + commit hash 172 + 3. Bytecode or AST representation 173 + 4. Only store metadata, expect manual implementation 174 + 175 + **Decision needed**: See Phase 1 planning 176 + 177 + ## Success Metrics 178 + 179 + - **Functionality**: Can publish schema, publish dataset, discover, load end-to-end 180 + - **Performance**: Dataset discovery <100ms (with AppView), load time unchanged 181 + - **Adoption**: Easy enough that external users publish datasets 182 + - **Interop**: Schema records usable from other languages (future) 183 + 184 + ## Timeline & Dependencies 185 + 186 + ``` 187 + Phase 1 (Lexicon Design) 188 + 189 + Phase 2 (Python Client) ← CRITICAL PATH 190 + 191 + ├── Phase 3 (AppView) [parallel, optional] 192 + └── Phase 4 (Codegen) [parallel] 193 + 194 + Phase 5 (Integration & Testing) 195 + ``` 196 + 197 + Phase 2 is the critical path. Phases 3 & 4 can proceed in parallel once Phase 2 foundations are in place. 198 + 199 + ## Related Documents 200 + 201 + - `02_lexicon_design.md` - Detailed Lexicon schema specifications 202 + - `03_python_client.md` - Python library architecture and API design 203 + - `04_appview.md` - AppView service architecture 204 + - `05_codegen.md` - Code generation approach and templates
+576
.planning/02_lexicon_design.md
··· 1 + # Lexicon Design for ATProto Integration 2 + 3 + ## Overview 4 + 5 + This document specifies the three Lexicon schemas needed for `atdata` ATProto integration: 6 + 7 + 1. **Schema Record** (`app.bsky.atdata.schema`) - Defines PackableSample types 8 + 2. **Dataset Record** (`app.bsky.atdata.dataset`) - Index records pointing to WebDataset files 9 + 3. **Lens Record** (`app.bsky.atdata.lens`) - Transformation mappings between schemas 10 + 11 + ## Design Principles 12 + 13 + - **Self-describing**: Records contain all necessary metadata 14 + - **Versioned**: Schema evolution supported through versioning 15 + - **Lightweight**: Minimal overhead, fast to parse 16 + - **Extensible**: Future additions don't break existing records 17 + - **Language-agnostic**: Usable from Python, TypeScript, Rust, etc. 18 + 19 + ## 1. Schema Record Lexicon 20 + 21 + **NSID**: `app.bsky.atdata.schema` (tentative namespace) 22 + 23 + **Purpose**: Define a reusable PackableSample type that can be instantiated via codegen 24 + 25 + ### Proposed Structure 26 + 27 + ```json 28 + { 29 + "lexicon": 1, 30 + "id": "app.bsky.atdata.schema", 31 + "defs": { 32 + "main": { 33 + "type": "record", 34 + "description": "Definition of a PackableSample-compatible sample type", 35 + "key": "tid", 36 + "record": { 37 + "type": "object", 38 + "required": ["name", "version", "fields", "createdAt"], 39 + "properties": { 40 + "name": { 41 + "type": "string", 42 + "description": "Human-readable name for this sample type", 43 + "maxLength": 100 44 + }, 45 + "version": { 46 + "type": "string", 47 + "description": "Semantic version (e.g., '1.0.0')", 48 + "maxLength": 20 49 + }, 50 + "description": { 51 + "type": "string", 52 + "description": "Human-readable description", 53 + "maxLength": 1000 54 + }, 55 + "fields": { 56 + "type": "array", 57 + "description": "List of fields in this sample type", 58 + "items": { 59 + "type": "ref", 60 + "ref": "#field" 61 + } 62 + }, 63 + "metadata": { 64 + "type": "object", 65 + "description": "Arbitrary metadata (author, license, etc.)" 66 + }, 67 + "createdAt": { 68 + "type": "string", 69 + "format": "datetime" 70 + } 71 + } 72 + } 73 + }, 74 + "field": { 75 + "type": "object", 76 + "description": "A field within a sample type", 77 + "required": ["name", "type"], 78 + "properties": { 79 + "name": { 80 + "type": "string", 81 + "description": "Field name (Python identifier)", 82 + "maxLength": 100 83 + }, 84 + "type": { 85 + "type": "ref", 86 + "ref": "#fieldType" 87 + }, 88 + "optional": { 89 + "type": "boolean", 90 + "description": "Whether field can be None", 91 + "default": false 92 + }, 93 + "description": { 94 + "type": "string", 95 + "description": "Field documentation", 96 + "maxLength": 500 97 + } 98 + } 99 + }, 100 + "fieldType": { 101 + "type": "union", 102 + "refs": [ 103 + "#primitiveType", 104 + "#arrayType", 105 + "#nestedType" 106 + ] 107 + }, 108 + "primitiveType": { 109 + "type": "object", 110 + "required": ["kind", "primitive"], 111 + "properties": { 112 + "kind": { 113 + "type": "string", 114 + "const": "primitive" 115 + }, 116 + "primitive": { 117 + "type": "string", 118 + "enum": ["str", "int", "float", "bool", "bytes"] 119 + } 120 + } 121 + }, 122 + "arrayType": { 123 + "type": "object", 124 + "required": ["kind", "dtype"], 125 + "properties": { 126 + "kind": { 127 + "type": "string", 128 + "const": "ndarray" 129 + }, 130 + "dtype": { 131 + "type": "string", 132 + "description": "Numpy dtype string (e.g., 'float32', 'uint8')", 133 + "maxLength": 20 134 + }, 135 + "shape": { 136 + "type": "array", 137 + "description": "Optional shape constraint (null for dynamic dimensions)", 138 + "items": { 139 + "type": "integer" 140 + } 141 + } 142 + } 143 + }, 144 + "nestedType": { 145 + "type": "object", 146 + "required": ["kind", "schemaRef"], 147 + "properties": { 148 + "kind": { 149 + "type": "string", 150 + "const": "nested" 151 + }, 152 + "schemaRef": { 153 + "type": "string", 154 + "description": "AT-URI reference to another schema record" 155 + } 156 + } 157 + } 158 + } 159 + } 160 + ``` 161 + 162 + ### Example Schema Record 163 + 164 + ```json 165 + { 166 + "$type": "app.bsky.atdata.schema", 167 + "name": "ImageSample", 168 + "version": "1.0.0", 169 + "description": "Sample containing an image with label", 170 + "fields": [ 171 + { 172 + "name": "image", 173 + "type": { 174 + "kind": "ndarray", 175 + "dtype": "uint8", 176 + "shape": [null, null, 3] 177 + }, 178 + "description": "RGB image with variable height/width" 179 + }, 180 + { 181 + "name": "label", 182 + "type": { 183 + "kind": "primitive", 184 + "primitive": "str" 185 + }, 186 + "description": "Human-readable label" 187 + }, 188 + { 189 + "name": "confidence", 190 + "type": { 191 + "kind": "primitive", 192 + "primitive": "float" 193 + }, 194 + "optional": true, 195 + "description": "Optional confidence score" 196 + } 197 + ], 198 + "metadata": { 199 + "author": "alice.bsky.social", 200 + "license": "MIT" 201 + }, 202 + "createdAt": "2025-01-06T12:00:00Z" 203 + } 204 + ``` 205 + 206 + ### Design Questions 207 + 208 + 1. **Shape constraints**: Should we enforce shape constraints, or just document them? 209 + - Option A: Runtime validation against shape 210 + - Option B: Documentation only, actual shapes can vary 211 + - **Recommendation**: Documentation only initially, validation in future versions 212 + 213 + 2. **Custom types**: Should we support custom serialization hooks? 214 + - Current approach: Only primitive + NDArray 215 + - Future: Allow references to custom serialization functions? 216 + 217 + 3. **Schema inheritance**: Should schemas support inheritance/composition? 218 + - Could reference parent schema and add fields 219 + - **Defer to future version** 220 + 221 + ## 2. Dataset Record Lexicon 222 + 223 + **NSID**: `app.bsky.atdata.dataset` 224 + 225 + **Purpose**: Index record pointing to WebDataset files with associated metadata 226 + 227 + ### Proposed Structure 228 + 229 + ```json 230 + { 231 + "lexicon": 1, 232 + "id": "app.bsky.atdata.dataset", 233 + "defs": { 234 + "main": { 235 + "type": "record", 236 + "description": "Index record for a WebDataset-backed dataset", 237 + "key": "tid", 238 + "record": { 239 + "type": "object", 240 + "required": ["name", "schemaRef", "urls", "createdAt"], 241 + "properties": { 242 + "name": { 243 + "type": "string", 244 + "description": "Human-readable dataset name", 245 + "maxLength": 200 246 + }, 247 + "schemaRef": { 248 + "type": "string", 249 + "description": "AT-URI reference to the schema record for this dataset's samples" 250 + }, 251 + "urls": { 252 + "type": "array", 253 + "description": "WebDataset URLs (supports brace notation)", 254 + "items": { 255 + "type": "string", 256 + "format": "uri", 257 + "maxLength": 1000 258 + }, 259 + "minLength": 1 260 + }, 261 + "description": { 262 + "type": "string", 263 + "description": "Human-readable description", 264 + "maxLength": 5000 265 + }, 266 + "metadata": { 267 + "type": "bytes", 268 + "description": "Msgpack-encoded metadata dict", 269 + "maxLength": 100000 270 + }, 271 + "tags": { 272 + "type": "array", 273 + "description": "Searchable tags", 274 + "items": { 275 + "type": "string", 276 + "maxLength": 50 277 + }, 278 + "maxLength": 20 279 + }, 280 + "size": { 281 + "type": "object", 282 + "description": "Dataset size information", 283 + "properties": { 284 + "samples": { 285 + "type": "integer", 286 + "description": "Total number of samples" 287 + }, 288 + "bytes": { 289 + "type": "integer", 290 + "description": "Total size in bytes" 291 + } 292 + } 293 + }, 294 + "license": { 295 + "type": "string", 296 + "description": "License (SPDX identifier preferred)", 297 + "maxLength": 100 298 + }, 299 + "createdAt": { 300 + "type": "string", 301 + "format": "datetime" 302 + } 303 + } 304 + } 305 + } 306 + } 307 + } 308 + ``` 309 + 310 + ### Example Dataset Record 311 + 312 + ```json 313 + { 314 + "$type": "app.bsky.atdata.dataset", 315 + "name": "CIFAR-10 Training Set", 316 + "schemaRef": "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm", 317 + "urls": [ 318 + "s3://my-bucket/cifar10-train-{000000..000049}.tar" 319 + ], 320 + "description": "CIFAR-10 training images (50,000 samples) stored as WebDataset shards", 321 + "metadata": "<msgpack bytes>", 322 + "tags": ["computer-vision", "classification", "cifar10"], 323 + "size": { 324 + "samples": 50000, 325 + "bytes": 178456789 326 + }, 327 + "license": "MIT", 328 + "createdAt": "2025-01-06T12:00:00Z" 329 + } 330 + ``` 331 + 332 + ### Design Questions 333 + 334 + 1. **WebDataset storage**: Where are the actual `.tar` files? 335 + - Phase 1: External storage (S3, HTTP, etc.) - just store URLs 336 + - Future: Could use ATProto blob storage for smaller datasets 337 + - **Recommendation**: External only for now 338 + 339 + 2. **Metadata size limit**: What's reasonable for msgpack metadata? 340 + - Could store large metadata as separate blob 341 + - **Recommendation**: 100KB limit, use blob for larger 342 + 343 + 3. **Versioning**: Should we support dataset versioning? 344 + - Could link to previous version 345 + - **Defer to future version** 346 + 347 + ## 3. Lens Record Lexicon 348 + 349 + **NSID**: `app.bsky.atdata.lens` 350 + 351 + **Purpose**: Define bidirectional transformations between sample types 352 + 353 + ### Proposed Structure 354 + 355 + ```json 356 + { 357 + "lexicon": 1, 358 + "id": "app.bsky.atdata.lens", 359 + "defs": { 360 + "main": { 361 + "type": "record", 362 + "description": "Bidirectional transformation between two sample types", 363 + "key": "tid", 364 + "record": { 365 + "type": "object", 366 + "required": ["name", "sourceSchema", "targetSchema", "createdAt"], 367 + "properties": { 368 + "name": { 369 + "type": "string", 370 + "description": "Human-readable lens name", 371 + "maxLength": 100 372 + }, 373 + "sourceSchema": { 374 + "type": "string", 375 + "description": "AT-URI reference to source schema" 376 + }, 377 + "targetSchema": { 378 + "type": "string", 379 + "description": "AT-URI reference to target schema" 380 + }, 381 + "description": { 382 + "type": "string", 383 + "description": "What this transformation does", 384 + "maxLength": 1000 385 + }, 386 + "getterCode": { 387 + "type": "ref", 388 + "ref": "#transformCode" 389 + }, 390 + "putterCode": { 391 + "type": "ref", 392 + "ref": "#transformCode" 393 + }, 394 + "metadata": { 395 + "type": "object", 396 + "description": "Arbitrary metadata" 397 + }, 398 + "createdAt": { 399 + "type": "string", 400 + "format": "datetime" 401 + } 402 + } 403 + } 404 + }, 405 + "transformCode": { 406 + "type": "union", 407 + "refs": [ 408 + "#pythonCode", 409 + "#codeReference" 410 + ] 411 + }, 412 + "pythonCode": { 413 + "type": "object", 414 + "required": ["kind", "source"], 415 + "properties": { 416 + "kind": { 417 + "type": "string", 418 + "const": "python" 419 + }, 420 + "source": { 421 + "type": "string", 422 + "description": "Python function source code", 423 + "maxLength": 50000 424 + } 425 + } 426 + }, 427 + "codeReference": { 428 + "type": "object", 429 + "required": ["kind", "repository", "path"], 430 + "properties": { 431 + "kind": { 432 + "type": "string", 433 + "const": "reference" 434 + }, 435 + "repository": { 436 + "type": "string", 437 + "description": "Git repository URL", 438 + "maxLength": 500 439 + }, 440 + "commit": { 441 + "type": "string", 442 + "description": "Git commit hash", 443 + "maxLength": 40 444 + }, 445 + "path": { 446 + "type": "string", 447 + "description": "Path to function within repo", 448 + "maxLength": 500 449 + } 450 + } 451 + } 452 + } 453 + } 454 + ``` 455 + 456 + ### Example Lens Record 457 + 458 + ```json 459 + { 460 + "$type": "app.bsky.atdata.lens", 461 + "name": "image_to_grayscale", 462 + "sourceSchema": "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm", 463 + "targetSchema": "at://did:plc:def456/app.bsky.atdata.schema/7mn8op56pqr", 464 + "description": "Convert RGB images to grayscale", 465 + "getterCode": { 466 + "kind": "reference", 467 + "repository": "https://github.com/alice/lenses", 468 + "commit": "a1b2c3d4e5f6", 469 + "path": "lenses/vision.py:image_to_grayscale" 470 + }, 471 + "putterCode": { 472 + "kind": "reference", 473 + "repository": "https://github.com/alice/lenses", 474 + "commit": "a1b2c3d4e5f6", 475 + "path": "lenses/vision.py:grayscale_to_image" 476 + }, 477 + "metadata": { 478 + "author": "alice.bsky.social" 479 + }, 480 + "createdAt": "2025-01-06T12:00:00Z" 481 + } 482 + ``` 483 + 484 + ### Design Questions - CRITICAL 485 + 486 + 1. **Code storage security**: Storing executable code is dangerous! 487 + - **Option A**: Code reference only (GitHub + commit hash) - safer 488 + - **Option B**: Allow inline code but require manual approval - flexible 489 + - **Option C**: AST/bytecode representation - complex 490 + - **Recommendation**: Start with references only (Option A), defer inline code 491 + 492 + 2. **Lens verification**: How to verify well-behavedness? 493 + - Could store test cases 494 + - Could require proof of GetPut/PutGet laws 495 + - **Defer to future** 496 + 497 + 3. **Lens composition**: Should lenses be composable? 498 + - Network could auto-compose transformations 499 + - **Defer to future** 500 + 501 + ## Schema Representation Format Decision 502 + 503 + **Question**: What format should we use to represent field types internally? 504 + 505 + ### Option 1: JSON Schema 506 + **Pros**: 507 + - Standard, widely supported 508 + - Validation tooling exists 509 + - Human-readable 510 + 511 + **Cons**: 512 + - Not designed for codegen 513 + - NDArray representation awkward 514 + - Overly complex for our needs 515 + 516 + ### Option 2: Protobuf 517 + **Pros**: 518 + - Designed for codegen 519 + - Compact binary format 520 + - Cross-language support excellent 521 + 522 + **Cons**: 523 + - Not ATProto-native 524 + - Requires compilation step 525 + - Less human-readable 526 + 527 + ### Option 3: Custom Format (as shown above) 528 + **Pros**: 529 + - Tailored exactly to PackableSample needs 530 + - Native ATProto Lexicon 531 + - Clean NDArray representation 532 + - Easy to extend 533 + 534 + **Cons**: 535 + - Need to write our own codegen 536 + - Less ecosystem tooling 537 + 538 + ### Recommendation: Option 3 (Custom Format) 539 + 540 + **Rationale**: 541 + 1. PackableSample has specific needs (NDArray, msgpack serialization) 542 + 2. ATProto Lexicon provides all the structure we need 543 + 3. Writing our own codegen gives us full control 544 + 4. Can still use JSON Schema for validation if needed 545 + 546 + The proposed Lexicon structure above uses this approach. 547 + 548 + ## Implementation Checklist (Phase 1) 549 + 550 + - [ ] Finalize Lexicon JSON definitions for all three record types 551 + - [ ] Create reference documentation with examples 552 + - [ ] Decide on schema representation format (recommendation: custom) 553 + - [ ] Resolve open questions (code storage, versioning, etc.) 554 + - [ ] Validate Lexicons against ATProto spec 555 + - [ ] Create example records for testing 556 + 557 + ## Future Extensions 558 + 559 + ### Schema Evolution 560 + - Support schema versioning with migration paths 561 + - Compatibility checking (backward/forward compatible) 562 + 563 + ### Advanced Types 564 + - Generic/parameterized types 565 + - Union types for polymorphic samples 566 + - Schema composition/inheritance 567 + 568 + ### Lens Network 569 + - Automatic lens composition 570 + - Lens verification and testing 571 + - Performance metadata (transformation cost) 572 + 573 + ### Dataset Features 574 + - Dataset splitting (train/val/test) references 575 + - Dataset versioning and diffs 576 + - Access control and permissions
+690
.planning/03_python_client.md
··· 1 + # Python Client Library Architecture 2 + 3 + ## Overview 4 + 5 + This document specifies the Python library extensions to `atdata` for ATProto integration. The goal is to add ATProto publishing and discovery capabilities while maintaining backward compatibility with existing code. 6 + 7 + ## Design Principles 8 + 9 + - **Backward compatible**: Existing code continues to work unchanged 10 + - **Optional integration**: ATProto features are opt-in 11 + - **Pythonic API**: Follows Python conventions and `atdata` style 12 + - **Type-safe**: Full type hints with generics 13 + - **Testable**: Mockable dependencies, unit testable 14 + 15 + ## Module Structure 16 + 17 + ``` 18 + src/atdata/ 19 + __init__.py # Existing exports 20 + dataset.py # Existing Dataset, PackableSample 21 + lens.py # Existing Lens, LensNetwork 22 + _helpers.py # Existing serialization helpers 23 + atproto/ # NEW: ATProto integration 24 + __init__.py # Public API exports 25 + client.py # ATProtoClient for auth/session 26 + schema.py # Schema publishing/loading 27 + dataset.py # Dataset publishing/loading 28 + lens.py # Lens publishing/loading 29 + _lexicon.py # Lexicon record builders 30 + _types.py # Type definitions for records 31 + ``` 32 + 33 + ## Core Components 34 + 35 + ### 1. ATProtoClient - Authentication & Session Management 36 + 37 + **File**: `src/atdata/atproto/client.py` 38 + 39 + ```python 40 + from typing import Optional 41 + from atproto import Client as ATProtoSDKClient 42 + 43 + class ATProtoClient: 44 + """Wrapper around atproto SDK client with atdata-specific helpers.""" 45 + 46 + def __init__(self, client: Optional[ATProtoSDKClient] = None): 47 + """ 48 + Initialize ATProto client. 49 + 50 + Args: 51 + client: Optional pre-configured atproto Client. If None, creates new client. 52 + """ 53 + self._client = client or ATProtoSDKClient() 54 + self._session: Optional[dict] = None 55 + 56 + def login(self, handle: str, password: str) -> None: 57 + """Authenticate with ATProto PDS.""" 58 + self._session = self._client.login(handle, password) 59 + 60 + def login_with_token(self, access_token: str, refresh_token: str) -> None: 61 + """Authenticate using existing tokens.""" 62 + # Implementation 63 + pass 64 + 65 + @property 66 + def is_authenticated(self) -> bool: 67 + """Check if client has valid session.""" 68 + return self._session is not None 69 + 70 + @property 71 + def did(self) -> str: 72 + """Get DID of authenticated user.""" 73 + if not self._session: 74 + raise ValueError("Not authenticated") 75 + return self._session['did'] 76 + 77 + # Low-level record operations 78 + def create_record(self, collection: str, record: dict) -> str: 79 + """Create a record and return its AT-URI.""" 80 + # Implementation using self._client 81 + pass 82 + 83 + def get_record(self, uri: str) -> dict: 84 + """Fetch a record by AT-URI.""" 85 + # Implementation 86 + pass 87 + 88 + def list_records(self, collection: str, did: Optional[str] = None) -> list[dict]: 89 + """List records in a collection.""" 90 + # Implementation 91 + pass 92 + ``` 93 + 94 + **Usage**: 95 + ```python 96 + from atdata.atproto import ATProtoClient 97 + 98 + client = ATProtoClient() 99 + client.login("alice.bsky.social", "password") 100 + ``` 101 + 102 + ### 2. Schema Publishing & Loading 103 + 104 + **File**: `src/atdata/atproto/schema.py` 105 + 106 + ```python 107 + from typing import Type, TypeVar, get_type_hints 108 + from dataclasses import fields, is_dataclass 109 + import atdata 110 + from .client import ATProtoClient 111 + from ._lexicon import build_schema_record 112 + 113 + ST = TypeVar('ST', bound=atdata.PackableSample) 114 + 115 + class SchemaPublisher: 116 + """Handles publishing PackableSample schemas to ATProto.""" 117 + 118 + def __init__(self, client: ATProtoClient): 119 + self.client = client 120 + 121 + def publish_schema( 122 + self, 123 + sample_type: Type[ST], 124 + *, 125 + name: Optional[str] = None, 126 + version: str = "1.0.0", 127 + description: Optional[str] = None, 128 + metadata: Optional[dict] = None 129 + ) -> str: 130 + """ 131 + Publish a PackableSample schema to ATProto. 132 + 133 + Args: 134 + sample_type: The PackableSample class to publish 135 + name: Human-readable name (defaults to class name) 136 + version: Semantic version 137 + description: Human-readable description 138 + metadata: Arbitrary metadata dict 139 + 140 + Returns: 141 + AT-URI of the created schema record 142 + """ 143 + if not self.client.is_authenticated: 144 + raise ValueError("Client must be authenticated") 145 + 146 + # Extract field information from dataclass 147 + schema_record = self._build_schema_record( 148 + sample_type, name, version, description, metadata 149 + ) 150 + 151 + # Publish to ATProto 152 + uri = self.client.create_record("app.bsky.atdata.schema", schema_record) 153 + return uri 154 + 155 + def _build_schema_record( 156 + self, 157 + sample_type: Type[ST], 158 + name: Optional[str], 159 + version: str, 160 + description: Optional[str], 161 + metadata: Optional[dict] 162 + ) -> dict: 163 + """Build schema record dict from PackableSample class.""" 164 + if not is_dataclass(sample_type): 165 + raise ValueError(f"{sample_type} must be a dataclass") 166 + 167 + field_defs = [] 168 + type_hints = get_type_hints(sample_type) 169 + 170 + for field in fields(sample_type): 171 + field_type = type_hints[field.name] 172 + field_def = self._field_to_record(field.name, field_type) 173 + field_defs.append(field_def) 174 + 175 + return { 176 + "$type": "app.bsky.atdata.schema", 177 + "name": name or sample_type.__name__, 178 + "version": version, 179 + "description": description or "", 180 + "fields": field_defs, 181 + "metadata": metadata or {}, 182 + "createdAt": datetime.now(timezone.utc).isoformat() 183 + } 184 + 185 + def _field_to_record(self, name: str, field_type) -> dict: 186 + """Convert Python type annotation to schema field record.""" 187 + # Handle Optional types 188 + is_optional = False 189 + if hasattr(field_type, '__origin__') and field_type.__origin__ is Union: 190 + args = field_type.__args__ 191 + if type(None) in args: 192 + is_optional = True 193 + field_type = next(arg for arg in args if arg is not type(None)) 194 + 195 + # Map Python types to schema types 196 + type_def = self._python_type_to_schema_type(field_type) 197 + 198 + return { 199 + "name": name, 200 + "type": type_def, 201 + "optional": is_optional 202 + } 203 + 204 + def _python_type_to_schema_type(self, python_type) -> dict: 205 + """Map Python type to schema type definition.""" 206 + # Handle primitives 207 + if python_type is str: 208 + return {"kind": "primitive", "primitive": "str"} 209 + elif python_type is int: 210 + return {"kind": "primitive", "primitive": "int"} 211 + elif python_type is float: 212 + return {"kind": "primitive", "primitive": "float"} 213 + elif python_type is bool: 214 + return {"kind": "primitive", "primitive": "bool"} 215 + elif python_type is bytes: 216 + return {"kind": "primitive", "primitive": "bytes"} 217 + 218 + # Handle NDArray - this is the key special case 219 + # In atdata, NDArray is used as a type annotation 220 + if hasattr(python_type, '__origin__'): 221 + origin = python_type.__origin__ 222 + if origin.__name__ == 'NDArray' or str(origin) == 'numpy.ndarray': 223 + # Extract dtype from annotation if available 224 + # For now, default to float32 225 + return { 226 + "kind": "ndarray", 227 + "dtype": "float32", # TODO: extract from annotation 228 + "shape": None 229 + } 230 + 231 + # If it's another PackableSample, create nested reference 232 + if is_dataclass(python_type) and issubclass(python_type, atdata.PackableSample): 233 + # This would require publishing the nested type first 234 + raise NotImplementedError("Nested PackableSample types not yet supported") 235 + 236 + raise ValueError(f"Unsupported type: {python_type}") 237 + 238 + class SchemaLoader: 239 + """Handles loading PackableSample schemas from ATProto.""" 240 + 241 + def __init__(self, client: ATProtoClient): 242 + self.client = client 243 + 244 + def get_schema(self, uri: str) -> dict: 245 + """Fetch a schema record by AT-URI.""" 246 + record = self.client.get_record(uri) 247 + if record.get('$type') != 'app.bsky.atdata.schema': 248 + raise ValueError(f"Record at {uri} is not a schema record") 249 + return record 250 + 251 + def list_schemas(self, did: Optional[str] = None) -> list[dict]: 252 + """List available schema records.""" 253 + return self.client.list_records("app.bsky.atdata.schema", did) 254 + ``` 255 + 256 + **Usage**: 257 + ```python 258 + from atdata.atproto import ATProtoClient, SchemaPublisher 259 + 260 + @atdata.packable 261 + class MySample: 262 + image: NDArray 263 + label: str 264 + 265 + client = ATProtoClient() 266 + client.login("alice.bsky.social", "password") 267 + 268 + publisher = SchemaPublisher(client) 269 + schema_uri = publisher.publish_schema( 270 + MySample, 271 + description="My sample type", 272 + version="1.0.0" 273 + ) 274 + print(f"Published schema at {schema_uri}") 275 + ``` 276 + 277 + ### 3. Dataset Publishing & Loading 278 + 279 + **File**: `src/atdata/atproto/dataset.py` 280 + 281 + ```python 282 + from typing import Type, TypeVar, Optional 283 + import msgpack 284 + import atdata 285 + from .client import ATProtoClient 286 + from .schema import SchemaPublisher 287 + 288 + ST = TypeVar('ST', bound=atdata.PackableSample) 289 + 290 + class DatasetPublisher: 291 + """Handles publishing Dataset index records to ATProto.""" 292 + 293 + def __init__(self, client: ATProtoClient): 294 + self.client = client 295 + self.schema_publisher = SchemaPublisher(client) 296 + 297 + def publish_dataset( 298 + self, 299 + dataset: atdata.Dataset[ST], 300 + *, 301 + name: str, 302 + schema_uri: Optional[str] = None, 303 + description: Optional[str] = None, 304 + tags: Optional[list[str]] = None, 305 + license: Optional[str] = None, 306 + auto_publish_schema: bool = True 307 + ) -> str: 308 + """ 309 + Publish a dataset index record to ATProto. 310 + 311 + Args: 312 + dataset: The Dataset to publish 313 + name: Human-readable dataset name 314 + schema_uri: AT-URI of the schema record (required if auto_publish_schema=False) 315 + description: Human-readable description 316 + tags: Searchable tags 317 + license: License identifier (SPDX preferred) 318 + auto_publish_schema: If True and schema_uri not provided, publish schema automatically 319 + 320 + Returns: 321 + AT-URI of the created dataset record 322 + """ 323 + if not self.client.is_authenticated: 324 + raise ValueError("Client must be authenticated") 325 + 326 + # Ensure schema is published 327 + if schema_uri is None: 328 + if not auto_publish_schema: 329 + raise ValueError("schema_uri required when auto_publish_schema=False") 330 + schema_uri = self.schema_publisher.publish_schema(dataset.sample_type) 331 + 332 + # Build dataset record 333 + dataset_record = { 334 + "$type": "app.bsky.atdata.dataset", 335 + "name": name, 336 + "schemaRef": schema_uri, 337 + "urls": [dataset.url], # Single URL for now 338 + "description": description or "", 339 + "metadata": msgpack.packb(dataset.metadata), 340 + "tags": tags or [], 341 + "license": license or "", 342 + "createdAt": datetime.now(timezone.utc).isoformat() 343 + } 344 + 345 + # Add size information if available 346 + # (would need to iterate dataset or have metadata about size) 347 + 348 + # Publish to ATProto 349 + uri = self.client.create_record("app.bsky.atdata.dataset", dataset_record) 350 + return uri 351 + 352 + class DatasetLoader: 353 + """Handles loading Datasets from ATProto records.""" 354 + 355 + def __init__(self, client: ATProtoClient): 356 + self.client = client 357 + 358 + def load_dataset(self, uri: str) -> atdata.Dataset: 359 + """ 360 + Load a Dataset from an ATProto record. 361 + 362 + Args: 363 + uri: AT-URI of the dataset record 364 + 365 + Returns: 366 + Dataset instance configured from the record 367 + """ 368 + # Fetch the dataset record 369 + record = self.client.get_record(uri) 370 + if record.get('$type') != 'app.bsky.atdata.dataset': 371 + raise ValueError(f"Record at {uri} is not a dataset record") 372 + 373 + # For now, we still need the Python class for the sample type 374 + # In the future, this could use codegen 375 + # TODO: Implement dynamic type loading via codegen 376 + 377 + # Extract URLs and metadata 378 + urls = record['urls'] 379 + metadata = msgpack.unpackb(record.get('metadata', b'')) 380 + 381 + # We need the schema to instantiate the Dataset with correct type 382 + # This is a limitation - we need codegen to create the type dynamically 383 + # For now, raise an error 384 + raise NotImplementedError( 385 + "Loading datasets requires code generation to instantiate sample types. " 386 + f"Schema URI: {record['schemaRef']}\n" 387 + "Use the codegen tool to generate the Python class first." 388 + ) 389 + 390 + def list_datasets(self, did: Optional[str] = None) -> list[dict]: 391 + """List available dataset records.""" 392 + return self.client.list_records("app.bsky.atdata.dataset", did) 393 + 394 + def search_datasets(self, tags: Optional[list[str]] = None, query: Optional[str] = None) -> list[dict]: 395 + """ 396 + Search for datasets. 397 + 398 + Args: 399 + tags: Filter by tags 400 + query: Text search query 401 + 402 + Returns: 403 + List of matching dataset records 404 + """ 405 + # This would use AppView in production 406 + # For now, fetch all and filter client-side 407 + all_datasets = self.list_records("app.bsky.atdata.dataset") 408 + 409 + filtered = all_datasets 410 + if tags: 411 + filtered = [d for d in filtered if any(t in d.get('tags', []) for t in tags)] 412 + if query: 413 + filtered = [d for d in filtered if query.lower() in d.get('name', '').lower() or 414 + query.lower() in d.get('description', '').lower()] 415 + 416 + return filtered 417 + ``` 418 + 419 + **Usage**: 420 + ```python 421 + from atdata.atproto import ATProtoClient, DatasetPublisher 422 + 423 + # Create dataset 424 + dataset = atdata.Dataset[MySample](url="s3://bucket/data-{000000..000009}.tar") 425 + 426 + # Publish 427 + client = ATProtoClient() 428 + client.login("alice.bsky.social", "password") 429 + 430 + publisher = DatasetPublisher(client) 431 + dataset_uri = publisher.publish_dataset( 432 + dataset, 433 + name="My Training Data", 434 + description="Training data for my model", 435 + tags=["computer-vision", "training"], 436 + license="MIT" 437 + ) 438 + print(f"Published dataset at {dataset_uri}") 439 + ``` 440 + 441 + ### 4. Lens Publishing 442 + 443 + **File**: `src/atdata/atproto/lens.py` 444 + 445 + ```python 446 + from typing import Callable, Optional 447 + import inspect 448 + from .client import ATProtoClient 449 + 450 + class LensPublisher: 451 + """Handles publishing Lens transformations to ATProto.""" 452 + 453 + def __init__(self, client: ATProtoClient): 454 + self.client = client 455 + 456 + def publish_lens( 457 + self, 458 + lens_getter: Callable, 459 + lens_putter: Callable, 460 + *, 461 + name: str, 462 + source_schema_uri: str, 463 + target_schema_uri: str, 464 + description: Optional[str] = None, 465 + code_repository: Optional[str] = None, 466 + code_commit: Optional[str] = None 467 + ) -> str: 468 + """ 469 + Publish a Lens transformation to ATProto. 470 + 471 + Args: 472 + lens_getter: The getter function (Source -> Target) 473 + lens_putter: The putter function (Target, Source -> Source) 474 + name: Human-readable lens name 475 + source_schema_uri: AT-URI of source schema 476 + target_schema_uri: AT-URI of target schema 477 + description: What this transformation does 478 + code_repository: Git repository URL 479 + code_commit: Git commit hash 480 + 481 + Returns: 482 + AT-URI of the created lens record 483 + """ 484 + if not self.client.is_authenticated: 485 + raise ValueError("Client must be authenticated") 486 + 487 + # Build lens record 488 + lens_record = { 489 + "$type": "app.bsky.atdata.lens", 490 + "name": name, 491 + "sourceSchema": source_schema_uri, 492 + "targetSchema": target_schema_uri, 493 + "description": description or "", 494 + "createdAt": datetime.now(timezone.utc).isoformat() 495 + } 496 + 497 + # Add code references 498 + if code_repository and code_commit: 499 + getter_name = lens_getter.__name__ 500 + putter_name = lens_putter.__name__ 501 + 502 + lens_record["getterCode"] = { 503 + "kind": "reference", 504 + "repository": code_repository, 505 + "commit": code_commit, 506 + "path": f"{getter_name}" # Simplified - would need module path 507 + } 508 + lens_record["putterCode"] = { 509 + "kind": "reference", 510 + "repository": code_repository, 511 + "commit": code_commit, 512 + "path": f"{putter_name}" 513 + } 514 + else: 515 + # For initial version, we could store source code directly 516 + # But this is DANGEROUS - security review required 517 + raise NotImplementedError( 518 + "Inline code storage not yet supported. " 519 + "Please provide code_repository and code_commit." 520 + ) 521 + 522 + # Publish to ATProto 523 + uri = self.client.create_record("app.bsky.atdata.lens", lens_record) 524 + return uri 525 + ``` 526 + 527 + ## Extension to Existing Classes 528 + 529 + ### Adding ATProto methods to Dataset 530 + 531 + **Approach**: Add methods directly to `Dataset` class in `src/atdata/dataset.py` 532 + 533 + ```python 534 + class Dataset[ST: PackableSample]: 535 + # ... existing implementation ... 536 + 537 + def publish_to_atproto( 538 + self, 539 + client: 'ATProtoClient', # Forward reference to avoid circular import 540 + *, 541 + name: str, 542 + **kwargs 543 + ) -> str: 544 + """ 545 + Publish this dataset to ATProto. 546 + 547 + This is a convenience method that wraps DatasetPublisher. 548 + """ 549 + from .atproto import DatasetPublisher 550 + publisher = DatasetPublisher(client) 551 + return publisher.publish_dataset(self, name=name, **kwargs) 552 + 553 + @classmethod 554 + def from_atproto( 555 + cls, 556 + client: 'ATProtoClient', 557 + uri: str 558 + ) -> 'Dataset': 559 + """ 560 + Load a dataset from an ATProto record. 561 + 562 + Note: This requires the sample type to be available in Python. 563 + Use codegen to generate types from schema records. 564 + """ 565 + from .atproto import DatasetLoader 566 + loader = DatasetLoader(client) 567 + return loader.load_dataset(uri) 568 + ``` 569 + 570 + **Usage**: 571 + ```python 572 + # Publishing 573 + dataset = atdata.Dataset[MySample](url="s3://...") 574 + uri = dataset.publish_to_atproto(client, name="My Dataset") 575 + 576 + # Loading (future, requires codegen) 577 + dataset = atdata.Dataset.from_atproto(client, uri) 578 + ``` 579 + 580 + ## Public API Exports 581 + 582 + **File**: `src/atdata/atproto/__init__.py` 583 + 584 + ```python 585 + from .client import ATProtoClient 586 + from .schema import SchemaPublisher, SchemaLoader 587 + from .dataset import DatasetPublisher, DatasetLoader 588 + from .lens import LensPublisher 589 + 590 + __all__ = [ 591 + "ATProtoClient", 592 + "SchemaPublisher", 593 + "SchemaLoader", 594 + "DatasetPublisher", 595 + "DatasetLoader", 596 + "LensPublisher", 597 + ] 598 + ``` 599 + 600 + ## Testing Strategy 601 + 602 + ### Unit Tests 603 + - Mock `ATProtoClient` to avoid network calls 604 + - Test schema record building from various PackableSample types 605 + - Test error handling (auth failures, invalid types, etc.) 606 + 607 + ### Integration Tests 608 + - Use ATProto test server or sandbox 609 + - Test full publish/query cycle 610 + - Verify record structure matches Lexicon 611 + 612 + ### Example Test 613 + ```python 614 + import pytest 615 + from unittest.mock import Mock 616 + import atdata 617 + from atdata.atproto import SchemaPublisher 618 + 619 + @atdata.packable 620 + class TestSample: 621 + field1: str 622 + field2: int 623 + 624 + def test_schema_publisher(): 625 + # Mock client 626 + mock_client = Mock() 627 + mock_client.is_authenticated = True 628 + mock_client.create_record = Mock(return_value="at://did:example/app.bsky.atdata.schema/abc123") 629 + 630 + # Publish schema 631 + publisher = SchemaPublisher(mock_client) 632 + uri = publisher.publish_schema(TestSample, version="1.0.0") 633 + 634 + # Verify 635 + assert uri == "at://did:example/app.bsky.atdata.schema/abc123" 636 + mock_client.create_record.assert_called_once() 637 + 638 + # Check the record structure 639 + call_args = mock_client.create_record.call_args 640 + collection, record = call_args[0] 641 + assert collection == "app.bsky.atdata.schema" 642 + assert record["name"] == "TestSample" 643 + assert len(record["fields"]) == 2 644 + ``` 645 + 646 + ## Dependencies 647 + 648 + **New dependencies** (to be added to `pyproject.toml`): 649 + 650 + ```toml 651 + [project] 652 + dependencies = [ 653 + # ... existing ... 654 + "atproto>=0.0.40", # ATProto Python SDK 655 + ] 656 + ``` 657 + 658 + ## Implementation Checklist (Phase 2) 659 + 660 + - [ ] Set up `atdata/atproto/` module structure 661 + - [ ] Implement `ATProtoClient` wrapper 662 + - [ ] Implement `SchemaPublisher` with type introspection 663 + - [ ] Implement `DatasetPublisher` 664 + - [ ] Implement `LensPublisher` (code reference only) 665 + - [ ] Add convenience methods to `Dataset` class 666 + - [ ] Write unit tests for all publishers 667 + - [ ] Write integration tests with test server 668 + - [ ] Update documentation with examples 669 + 670 + ## Future Enhancements 671 + 672 + ### Better NDArray Type Handling 673 + - Parse `NDArray[DType, Shape]` annotations for accurate dtype/shape 674 + - Support for shape constraints in schema 675 + 676 + ### Dynamic Type Loading 677 + - Use codegen to create types at runtime from schema records 678 + - Enable `Dataset.from_atproto()` without pre-existing Python classes 679 + 680 + ### Caching 681 + - Cache schema lookups to avoid repeated network calls 682 + - Local schema registry 683 + 684 + ### Batch Operations 685 + - Publish multiple schemas/datasets in one call 686 + - Bulk import/export 687 + 688 + ### AppView Integration 689 + - Use AppView for fast search instead of client-side filtering 690 + - Streaming results for large queries
+578
.planning/04_appview.md
··· 1 + # AppView Service Architecture 2 + 3 + ## Overview 4 + 5 + The AppView is an **optional aggregation service** that indexes dataset records from across the ATProto network, providing fast search and discovery. Think of it as the "search engine" for atdata datasets. 6 + 7 + ## Why AppView? 8 + 9 + Without AppView, discovering datasets requires: 10 + - Querying each user's Personal Data Server (PDS) individually 11 + - No global search across all published datasets 12 + - Slow, inefficient discovery 13 + 14 + With AppView: 15 + - **Fast global search** across all datasets 16 + - **Rich metadata browsing** (schemas, tags, authors) 17 + - **Recommendation systems** (similar datasets, popular datasets) 18 + - **Analytics** (dataset usage, trends) 19 + 20 + ## Architecture 21 + 22 + ``` 23 + ┌─────────────────────────────────────────────────────────────┐ 24 + │ ATProto Network │ 25 + │ │ 26 + │ ┌─────┐ ┌─────┐ ┌─────┐ ┌──────────────┐ │ 27 + │ │ PDS │ │ PDS │ │ PDS │ ────────▶ │ Relay/ │ │ 28 + │ │ 1 │ │ 2 │ │ 3 │ │ Firehose │ │ 29 + │ └─────┘ └─────┘ └─────┘ └──────────────┘ │ 30 + │ │ │ │ │ │ 31 + │ └─────────┴─────────┴────────────────────┘ │ 32 + │ (publish records) │ │ 33 + └────────────────────────────────────────────────┼──────────────┘ 34 + 35 + │ (subscribe) 36 + 37 + ┌─────────────────────────┐ 38 + │ AppView Service │ 39 + │ │ 40 + │ ┌──────────────────┐ │ 41 + │ │ Firehose │ │ 42 + │ │ Consumer │ │ 43 + │ └────────┬─────────┘ │ 44 + │ │ │ 45 + │ ▼ │ 46 + │ ┌──────────────────┐ │ 47 + │ │ Record │ │ 48 + │ │ Processor │ │ 49 + │ └────────┬─────────┘ │ 50 + │ │ │ 51 + │ ▼ │ 52 + │ ┌──────────────────┐ │ 53 + │ │ PostgreSQL │ │ 54 + │ │ Database │ │ 55 + │ └──────────────────┘ │ 56 + │ │ 57 + │ ┌──────────────────┐ │ 58 + │ │ Search Index │ │ 59 + │ │ (ElasticSearch) │ │ 60 + │ └──────────────────┘ │ 61 + │ │ 62 + │ ┌──────────────────┐ │ 63 + │ │ HTTP API │ │ 64 + │ │ (FastAPI) │ │ 65 + │ └──────────────────┘ │ 66 + └─────────────────────────┘ 67 + 68 + │ (query API) 69 + 70 + ┌─────────────────────────┐ 71 + │ Python Client │ 72 + │ (atdata.atproto) │ 73 + └─────────────────────────┘ 74 + ``` 75 + 76 + ## Components 77 + 78 + ### 1. Firehose Consumer 79 + 80 + **Purpose**: Subscribe to ATProto firehose and receive real-time record updates 81 + 82 + **Technology**: Python + `atproto` SDK 83 + 84 + **Responsibilities**: 85 + - Connect to ATProto relay/firehose 86 + - Filter for relevant Lexicon types: 87 + - `app.bsky.atdata.schema` 88 + - `app.bsky.atdata.dataset` 89 + - `app.bsky.atdata.lens` 90 + - Handle reconnection and backpressure 91 + - Forward records to processor 92 + 93 + **Implementation**: 94 + ```python 95 + from atproto import FirehoseSubscribeReposClient, parse_subscribe_repos_message 96 + 97 + class AtdataFirehoseConsumer: 98 + def __init__(self, processor: RecordProcessor): 99 + self.processor = processor 100 + self.client = FirehoseSubscribeReposClient() 101 + 102 + def start(self): 103 + """Start consuming firehose.""" 104 + def on_message_handler(message): 105 + commit = parse_subscribe_repos_message(message) 106 + if not commit: 107 + return 108 + 109 + for op in commit.ops: 110 + if op.action == 'create' or op.action == 'update': 111 + if op.path.startswith('app.bsky.atdata.'): 112 + # Extract record 113 + record = op.record 114 + self.processor.process_record( 115 + uri=op.uri, 116 + cid=op.cid, 117 + record=record 118 + ) 119 + 120 + self.client.start(on_message_handler) 121 + ``` 122 + 123 + ### 2. Record Processor 124 + 125 + **Purpose**: Parse and validate incoming records, update database and search index 126 + 127 + **Responsibilities**: 128 + - Validate records against Lexicon schemas 129 + - Extract searchable fields 130 + - Resolve references (schema URIs, etc.) 131 + - Update PostgreSQL and ElasticSearch 132 + - Handle deletions and updates 133 + 134 + **Data Model**: 135 + 136 + **PostgreSQL Tables**: 137 + ```sql 138 + -- Schema records 139 + CREATE TABLE schemas ( 140 + uri TEXT PRIMARY KEY, 141 + cid TEXT NOT NULL, 142 + did TEXT NOT NULL, 143 + name TEXT NOT NULL, 144 + version TEXT NOT NULL, 145 + description TEXT, 146 + fields JSONB NOT NULL, 147 + metadata JSONB, 148 + created_at TIMESTAMP NOT NULL, 149 + indexed_at TIMESTAMP NOT NULL DEFAULT NOW() 150 + ); 151 + CREATE INDEX idx_schemas_did ON schemas(did); 152 + CREATE INDEX idx_schemas_name ON schemas(name); 153 + 154 + -- Dataset records 155 + CREATE TABLE datasets ( 156 + uri TEXT PRIMARY KEY, 157 + cid TEXT NOT NULL, 158 + did TEXT NOT NULL, 159 + name TEXT NOT NULL, 160 + schema_ref TEXT NOT NULL REFERENCES schemas(uri), 161 + urls TEXT[] NOT NULL, 162 + description TEXT, 163 + metadata BYTEA, 164 + tags TEXT[], 165 + license TEXT, 166 + size_samples INTEGER, 167 + size_bytes BIGINT, 168 + created_at TIMESTAMP NOT NULL, 169 + indexed_at TIMESTAMP NOT NULL DEFAULT NOW() 170 + ); 171 + CREATE INDEX idx_datasets_did ON datasets(did); 172 + CREATE INDEX idx_datasets_schema ON datasets(schema_ref); 173 + CREATE INDEX idx_datasets_tags ON datasets USING GIN(tags); 174 + 175 + -- Lens records 176 + CREATE TABLE lenses ( 177 + uri TEXT PRIMARY KEY, 178 + cid TEXT NOT NULL, 179 + did TEXT NOT NULL, 180 + name TEXT NOT NULL, 181 + source_schema TEXT NOT NULL REFERENCES schemas(uri), 182 + target_schema TEXT NOT NULL REFERENCES schemas(uri), 183 + description TEXT, 184 + created_at TIMESTAMP NOT NULL, 185 + indexed_at TIMESTAMP NOT NULL DEFAULT NOW() 186 + ); 187 + CREATE INDEX idx_lenses_source ON lenses(source_schema); 188 + CREATE INDEX idx_lenses_target ON lenses(target_schema); 189 + 190 + -- Lens network view (for finding transformation paths) 191 + CREATE MATERIALIZED VIEW lens_network AS 192 + SELECT 193 + source_schema, 194 + target_schema, 195 + uri, 196 + name 197 + FROM lenses; 198 + CREATE INDEX idx_lens_network_source ON lens_network(source_schema); 199 + CREATE INDEX idx_lens_network_target ON lens_network(target_schema); 200 + ``` 201 + 202 + **ElasticSearch Index**: 203 + ```json 204 + { 205 + "mappings": { 206 + "properties": { 207 + "uri": { "type": "keyword" }, 208 + "type": { "type": "keyword" }, 209 + "did": { "type": "keyword" }, 210 + "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, 211 + "description": { "type": "text" }, 212 + "tags": { "type": "keyword" }, 213 + "created_at": { "type": "date" }, 214 + "schema_ref": { "type": "keyword" }, 215 + "license": { "type": "keyword" } 216 + } 217 + } 218 + } 219 + ``` 220 + 221 + ### 3. HTTP API 222 + 223 + **Purpose**: Expose search and query endpoints for clients 224 + 225 + **Technology**: FastAPI + Pydantic 226 + 227 + **Endpoints**: 228 + 229 + ```python 230 + from fastapi import FastAPI, Query 231 + from pydantic import BaseModel 232 + 233 + app = FastAPI() 234 + 235 + # Search datasets 236 + @app.get("/api/v1/datasets/search") 237 + async def search_datasets( 238 + q: str = Query(None, description="Text search query"), 239 + tags: list[str] = Query(None, description="Filter by tags"), 240 + schema_uri: str = Query(None, description="Filter by schema"), 241 + author_did: str = Query(None, description="Filter by author DID"), 242 + limit: int = Query(20, le=100), 243 + offset: int = Query(0) 244 + ) -> list[dict]: 245 + """Search for datasets.""" 246 + # Query ElasticSearch + PostgreSQL 247 + pass 248 + 249 + # Get dataset details 250 + @app.get("/api/v1/datasets/{uri:path}") 251 + async def get_dataset(uri: str) -> dict: 252 + """Get dataset record by URI.""" 253 + # Query PostgreSQL 254 + pass 255 + 256 + # List schemas 257 + @app.get("/api/v1/schemas") 258 + async def list_schemas( 259 + limit: int = Query(20, le=100), 260 + offset: int = Query(0) 261 + ) -> list[dict]: 262 + """List available schemas.""" 263 + pass 264 + 265 + # Get schema details 266 + @app.get("/api/v1/schemas/{uri:path}") 267 + async def get_schema(uri: str) -> dict: 268 + """Get schema record by URI.""" 269 + pass 270 + 271 + # Find lens path between schemas 272 + @app.get("/api/v1/lenses/path") 273 + async def find_lens_path( 274 + source: str = Query(..., description="Source schema URI"), 275 + target: str = Query(..., description="Target schema URI") 276 + ) -> list[dict]: 277 + """Find transformation path between two schemas.""" 278 + # Graph search on lens_network 279 + pass 280 + 281 + # Stats and analytics 282 + @app.get("/api/v1/stats") 283 + async def get_stats() -> dict: 284 + """Get aggregate statistics.""" 285 + return { 286 + "total_datasets": await count_datasets(), 287 + "total_schemas": await count_schemas(), 288 + "total_lenses": await count_lenses() 289 + } 290 + ``` 291 + 292 + ### 4. Caching Layer 293 + 294 + **Purpose**: Reduce database load for frequent queries 295 + 296 + **Technology**: Redis 297 + 298 + **Cached Items**: 299 + - Popular dataset queries 300 + - Schema lookups (high read frequency) 301 + - Search results (with short TTL) 302 + - Aggregate statistics 303 + 304 + **Implementation**: 305 + ```python 306 + import redis 307 + import json 308 + from functools import wraps 309 + 310 + redis_client = redis.Redis(host='localhost', port=6379, db=0) 311 + 312 + def cache_result(ttl: int = 300): 313 + """Decorator to cache function results in Redis.""" 314 + def decorator(func): 315 + @wraps(func) 316 + async def wrapper(*args, **kwargs): 317 + # Generate cache key from function name and args 318 + cache_key = f"{func.__name__}:{hash((args, frozenset(kwargs.items())))}" 319 + 320 + # Check cache 321 + cached = redis_client.get(cache_key) 322 + if cached: 323 + return json.loads(cached) 324 + 325 + # Compute result 326 + result = await func(*args, **kwargs) 327 + 328 + # Store in cache 329 + redis_client.setex(cache_key, ttl, json.dumps(result)) 330 + 331 + return result 332 + return wrapper 333 + return decorator 334 + 335 + @cache_result(ttl=60) 336 + async def get_popular_datasets(): 337 + """Get popular datasets (cached for 1 minute).""" 338 + # Query database 339 + pass 340 + ``` 341 + 342 + ## Deployment 343 + 344 + ### Infrastructure 345 + 346 + **Option 1: Simple (single server)** 347 + ``` 348 + - PostgreSQL (datasets, schemas, lenses) 349 + - ElasticSearch (search index) 350 + - Redis (cache) 351 + - FastAPI app (HTTP API) 352 + - Firehose consumer (background process) 353 + ``` 354 + 355 + **Option 2: Scalable (cloud)** 356 + ``` 357 + - AWS RDS PostgreSQL (managed database) 358 + - AWS OpenSearch (managed ElasticSearch) 359 + - AWS ElastiCache (managed Redis) 360 + - AWS ECS/Fargate (containerized FastAPI app) 361 + - AWS ECS/Fargate (containerized firehose consumer) 362 + - AWS ALB (load balancer) 363 + ``` 364 + 365 + ### Docker Compose (Development) 366 + 367 + ```yaml 368 + version: '3.8' 369 + 370 + services: 371 + postgres: 372 + image: postgres:15 373 + environment: 374 + POSTGRES_DB: atdata_appview 375 + POSTGRES_USER: atdata 376 + POSTGRES_PASSWORD: password 377 + volumes: 378 + - postgres_data:/var/lib/postgresql/data 379 + ports: 380 + - "5432:5432" 381 + 382 + elasticsearch: 383 + image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 384 + environment: 385 + - discovery.type=single-node 386 + - xpack.security.enabled=false 387 + volumes: 388 + - es_data:/usr/share/elasticsearch/data 389 + ports: 390 + - "9200:9200" 391 + 392 + redis: 393 + image: redis:7 394 + ports: 395 + - "6379:6379" 396 + 397 + appview-api: 398 + build: 399 + context: . 400 + dockerfile: Dockerfile.api 401 + environment: 402 + DATABASE_URL: postgresql://atdata:password@postgres/atdata_appview 403 + ELASTICSEARCH_URL: http://elasticsearch:9200 404 + REDIS_URL: redis://redis:6379 405 + depends_on: 406 + - postgres 407 + - elasticsearch 408 + - redis 409 + ports: 410 + - "8000:8000" 411 + 412 + appview-firehose: 413 + build: 414 + context: . 415 + dockerfile: Dockerfile.firehose 416 + environment: 417 + DATABASE_URL: postgresql://atdata:password@postgres/atdata_appview 418 + ELASTICSEARCH_URL: http://elasticsearch:9200 419 + REDIS_URL: redis://redis:6379 420 + FIREHOSE_URL: wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos 421 + depends_on: 422 + - postgres 423 + - elasticsearch 424 + - redis 425 + 426 + volumes: 427 + postgres_data: 428 + es_data: 429 + ``` 430 + 431 + ## Client Integration 432 + 433 + ### Python Client Updates 434 + 435 + Add AppView support to `atdata.atproto.dataset.DatasetLoader`: 436 + 437 + ```python 438 + class DatasetLoader: 439 + def __init__( 440 + self, 441 + client: ATProtoClient, 442 + appview_url: Optional[str] = None 443 + ): 444 + self.client = client 445 + self.appview_url = appview_url or "https://appview.atdata.network" 446 + 447 + def search_datasets( 448 + self, 449 + query: Optional[str] = None, 450 + tags: Optional[list[str]] = None, 451 + schema_uri: Optional[str] = None, 452 + limit: int = 20 453 + ) -> list[dict]: 454 + """Search datasets using AppView.""" 455 + import httpx 456 + 457 + params = {"limit": limit} 458 + if query: 459 + params["q"] = query 460 + if tags: 461 + params["tags"] = tags 462 + if schema_uri: 463 + params["schema_uri"] = schema_uri 464 + 465 + response = httpx.get(f"{self.appview_url}/api/v1/datasets/search", params=params) 466 + response.raise_for_status() 467 + return response.json() 468 + ``` 469 + 470 + **Usage**: 471 + ```python 472 + from atdata.atproto import ATProtoClient, DatasetLoader 473 + 474 + client = ATProtoClient() 475 + loader = DatasetLoader(client, appview_url="https://appview.atdata.network") 476 + 477 + # Search for computer vision datasets 478 + results = loader.search_datasets( 479 + tags=["computer-vision"], 480 + limit=10 481 + ) 482 + 483 + for dataset in results: 484 + print(f"{dataset['name']}: {dataset['description']}") 485 + ``` 486 + 487 + ## Performance Considerations 488 + 489 + ### Indexing Speed 490 + - **Goal**: Index records in <1 second from firehose receipt 491 + - **Approach**: Async processing, batch inserts 492 + 493 + ### Search Performance 494 + - **Goal**: Search queries return in <100ms 495 + - **Approach**: ElasticSearch indexing, query optimization, caching 496 + 497 + ### Scalability 498 + - **Goal**: Handle 1000+ datasets, 100+ schemas 499 + - **Approach**: Horizontal scaling of API servers, read replicas for PostgreSQL 500 + 501 + ## Monitoring & Observability 502 + 503 + ### Metrics 504 + - Firehose lag (time behind current) 505 + - Indexing throughput (records/second) 506 + - API request latency (p50, p95, p99) 507 + - Cache hit rate 508 + - Database query performance 509 + 510 + ### Logging 511 + - Structured JSON logs 512 + - Log aggregation (e.g., CloudWatch, Datadog) 513 + - Error tracking (e.g., Sentry) 514 + 515 + ### Health Checks 516 + ```python 517 + @app.get("/health") 518 + async def health_check(): 519 + """Check service health.""" 520 + return { 521 + "status": "healthy", 522 + "components": { 523 + "database": await check_db_health(), 524 + "elasticsearch": await check_es_health(), 525 + "redis": await check_redis_health(), 526 + "firehose": await check_firehose_health() 527 + } 528 + } 529 + ``` 530 + 531 + ## Implementation Checklist (Phase 3) 532 + 533 + - [ ] Design database schema (PostgreSQL) 534 + - [ ] Design search index (ElasticSearch) 535 + - [ ] Implement firehose consumer 536 + - [ ] Implement record processor with validation 537 + - [ ] Implement HTTP API with FastAPI 538 + - [ ] Add caching layer (Redis) 539 + - [ ] Create Docker Compose for local development 540 + - [ ] Write integration tests 541 + - [ ] Set up monitoring and logging 542 + - [ ] Deploy to staging environment 543 + - [ ] Performance testing and optimization 544 + 545 + ## Future Enhancements 546 + 547 + ### Advanced Search 548 + - Fuzzy matching 549 + - Relevance scoring 550 + - Autocomplete for tags/names 551 + 552 + ### Recommendations 553 + - "Datasets similar to this one" 554 + - "Popular datasets in this category" 555 + - "Datasets by authors you follow" 556 + 557 + ### Analytics 558 + - Dataset usage tracking (downloads, views) 559 + - Trending datasets 560 + - Schema adoption statistics 561 + 562 + ### Social Features 563 + - Dataset comments/reviews 564 + - Ratings 565 + - Curation lists (e.g., "Best datasets for X") 566 + 567 + ### Federation 568 + - Multiple AppView instances 569 + - Cross-AppView search 570 + - Regional AppViews for performance 571 + 572 + ## Security Considerations 573 + 574 + - **Rate limiting**: Prevent abuse of search API 575 + - **Input validation**: Sanitize all query parameters 576 + - **DDoS protection**: Use CloudFlare or similar 577 + - **Authentication** (optional): API keys for heavy users 578 + - **Data validation**: Verify record signatures from ATProto
+799
.planning/05_codegen.md
··· 1 + # Code Generation Tooling 2 + 3 + ## Overview 4 + 5 + Code generation enables users to create `PackableSample` classes from schema records published on ATProto, making datasets truly interoperable across different codebases and even languages. 6 + 7 + ## Goals 8 + 9 + 1. **Automatic class generation**: Convert schema records to Python classes 10 + 2. **Type safety**: Generate proper type hints and validation 11 + 3. **Maintainability**: Generated code should be readable and maintainable 12 + 4. **Cross-language support** (future): TypeScript, Rust, etc. 13 + 14 + ## Python Code Generation 15 + 16 + ### Input: Schema Record 17 + 18 + ```json 19 + { 20 + "$type": "app.bsky.atdata.schema", 21 + "name": "ImageSample", 22 + "version": "1.0.0", 23 + "description": "Sample containing an image with label", 24 + "fields": [ 25 + { 26 + "name": "image", 27 + "type": { "kind": "ndarray", "dtype": "uint8", "shape": [null, null, 3] }, 28 + "description": "RGB image with variable height/width" 29 + }, 30 + { 31 + "name": "label", 32 + "type": { "kind": "primitive", "primitive": "str" }, 33 + "description": "Human-readable label" 34 + }, 35 + { 36 + "name": "confidence", 37 + "type": { "kind": "primitive", "primitive": "float" }, 38 + "optional": true, 39 + "description": "Optional confidence score" 40 + } 41 + ] 42 + } 43 + ``` 44 + 45 + ### Output: Python Code 46 + 47 + ```python 48 + """ 49 + ImageSample 50 + 51 + Sample containing an image with label 52 + 53 + Schema Version: 1.0.0 54 + Schema URI: at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm 55 + Generated: 2025-01-06T12:00:00Z 56 + """ 57 + 58 + from dataclasses import dataclass 59 + from typing import Optional 60 + from numpy.typing import NDArray 61 + import atdata 62 + 63 + 64 + @atdata.packable 65 + class ImageSample: 66 + """Sample containing an image with label""" 67 + 68 + #: RGB image with variable height/width 69 + image: NDArray # uint8, shape: [*, *, 3] 70 + 71 + #: Human-readable label 72 + label: str 73 + 74 + #: Optional confidence score 75 + confidence: Optional[float] = None 76 + ``` 77 + 78 + ## Code Generator Architecture 79 + 80 + ### Module Structure 81 + 82 + ``` 83 + src/atdata/codegen/ 84 + __init__.py # Public API 85 + generator.py # Core code generation logic 86 + templates/ # Template files 87 + python.jinja2 # Python class template 88 + cli.py # CLI interface 89 + _validators.py # Schema validation 90 + ``` 91 + 92 + ### Core Generator 93 + 94 + **File**: `src/atdata/codegen/generator.py` 95 + 96 + ```python 97 + from typing import Optional 98 + from datetime import datetime, timezone 99 + from jinja2 import Environment, PackageLoader 100 + import atdata 101 + from ..atproto import ATProtoClient, SchemaLoader 102 + 103 + class PythonGenerator: 104 + """Generate Python PackableSample classes from schema records.""" 105 + 106 + def __init__(self): 107 + # Set up Jinja2 environment 108 + self.env = Environment( 109 + loader=PackageLoader('atdata.codegen', 'templates'), 110 + trim_blocks=True, 111 + lstrip_blocks=True 112 + ) 113 + 114 + # Register custom filters 115 + self.env.filters['python_type'] = self._python_type_filter 116 + self.env.filters['python_default'] = self._python_default_filter 117 + 118 + def generate_from_uri( 119 + self, 120 + client: ATProtoClient, 121 + schema_uri: str, 122 + output_path: Optional[str] = None 123 + ) -> str: 124 + """ 125 + Generate Python code from a schema URI. 126 + 127 + Args: 128 + client: ATProto client 129 + schema_uri: URI of the schema record 130 + output_path: Optional path to write output file 131 + 132 + Returns: 133 + Generated Python code as string 134 + """ 135 + # Load schema record 136 + loader = SchemaLoader(client) 137 + schema = loader.get_schema(schema_uri) 138 + 139 + # Generate code 140 + code = self.generate_from_record(schema, schema_uri) 141 + 142 + # Write to file if requested 143 + if output_path: 144 + with open(output_path, 'w') as f: 145 + f.write(code) 146 + 147 + return code 148 + 149 + def generate_from_record( 150 + self, 151 + schema: dict, 152 + schema_uri: str 153 + ) -> str: 154 + """ 155 + Generate Python code from a schema record dict. 156 + 157 + Args: 158 + schema: Schema record dict 159 + schema_uri: URI of the schema (for documentation) 160 + 161 + Returns: 162 + Generated Python code 163 + """ 164 + # Validate schema 165 + self._validate_schema(schema) 166 + 167 + # Prepare template context 168 + context = { 169 + 'schema': schema, 170 + 'schema_uri': schema_uri, 171 + 'generated_at': datetime.now(timezone.utc).isoformat(), 172 + 'fields': self._prepare_fields(schema['fields']) 173 + } 174 + 175 + # Render template 176 + template = self.env.get_template('python.jinja2') 177 + code = template.render(**context) 178 + 179 + return code 180 + 181 + def _prepare_fields(self, fields: list[dict]) -> list[dict]: 182 + """Prepare fields for template rendering.""" 183 + prepared = [] 184 + 185 + for field in fields: 186 + prepared.append({ 187 + 'name': field['name'], 188 + 'type_annotation': self._field_type_to_python(field['type']), 189 + 'optional': field.get('optional', False), 190 + 'description': field.get('description', ''), 191 + 'type_comment': self._type_comment(field['type']) 192 + }) 193 + 194 + return prepared 195 + 196 + def _field_type_to_python(self, field_type: dict) -> str: 197 + """Convert schema field type to Python type annotation.""" 198 + kind = field_type['kind'] 199 + 200 + if kind == 'primitive': 201 + primitive_map = { 202 + 'str': 'str', 203 + 'int': 'int', 204 + 'float': 'float', 205 + 'bool': 'bool', 206 + 'bytes': 'bytes' 207 + } 208 + return primitive_map[field_type['primitive']] 209 + 210 + elif kind == 'ndarray': 211 + return 'NDArray' 212 + 213 + elif kind == 'nested': 214 + # Extract class name from schema ref 215 + # For now, just use a placeholder 216 + ref = field_type['schemaRef'] 217 + return f'NestedType' # TODO: resolve nested types 218 + 219 + else: 220 + raise ValueError(f"Unknown field type kind: {kind}") 221 + 222 + def _type_comment(self, field_type: dict) -> Optional[str]: 223 + """Generate type comment for NDArray types.""" 224 + if field_type['kind'] == 'ndarray': 225 + dtype = field_type['dtype'] 226 + shape = field_type.get('shape') 227 + if shape: 228 + shape_str = ', '.join('*' if s is None else str(s) for s in shape) 229 + return f"{dtype}, shape: [{shape_str}]" 230 + else: 231 + return f"{dtype}" 232 + return None 233 + 234 + def _python_type_filter(self, field: dict) -> str: 235 + """Jinja2 filter to get Python type annotation.""" 236 + type_str = self._field_type_to_python(field['type']) 237 + if field.get('optional'): 238 + return f'Optional[{type_str}]' 239 + return type_str 240 + 241 + def _python_default_filter(self, field: dict) -> Optional[str]: 242 + """Jinja2 filter to get Python default value.""" 243 + if field.get('optional'): 244 + return 'None' 245 + return None 246 + 247 + def _validate_schema(self, schema: dict) -> None: 248 + """Validate schema record structure.""" 249 + required = ['name', 'version', 'fields'] 250 + for field in required: 251 + if field not in schema: 252 + raise ValueError(f"Schema missing required field: {field}") 253 + 254 + if not isinstance(schema['fields'], list): 255 + raise ValueError("Schema fields must be a list") 256 + 257 + for field in schema['fields']: 258 + if 'name' not in field or 'type' not in field: 259 + raise ValueError(f"Field missing name or type: {field}") 260 + ``` 261 + 262 + ### Template File 263 + 264 + **File**: `src/atdata/codegen/templates/python.jinja2` 265 + 266 + ```jinja2 267 + """ 268 + {{ schema.name }} 269 + 270 + {{ schema.description }} 271 + 272 + Schema Version: {{ schema.version }} 273 + Schema URI: {{ schema_uri }} 274 + Generated: {{ generated_at }} 275 + 276 + ⚠️ This file was automatically generated from an ATProto schema record. 277 + Do not edit manually - regenerate using `atdata codegen` instead. 278 + """ 279 + 280 + from dataclasses import dataclass 281 + {%- if fields | selectattr('optional') | list %} 282 + from typing import Optional 283 + {%- endif %} 284 + {%- if fields | selectattr('type.kind', 'equalto', 'ndarray') | list %} 285 + from numpy.typing import NDArray 286 + {%- endif %} 287 + import atdata 288 + 289 + 290 + @atdata.packable 291 + class {{ schema.name }}: 292 + """{{ schema.description }}""" 293 + 294 + {% for field in fields %} 295 + {%- if field.description %} 296 + #: {{ field.description }} 297 + {%- endif %} 298 + {{ field.name }}: {{ field | python_type }} 299 + {%- if field.type_comment %} # {{ field.type_comment }}{% endif %} 300 + {%- if field | python_default %} = {{ field | python_default }}{% endif %} 301 + 302 + {% endfor %} 303 + ``` 304 + 305 + ### CLI Interface 306 + 307 + **File**: `src/atdata/codegen/cli.py` 308 + 309 + ```python 310 + import click 311 + from pathlib import Path 312 + from ..atproto import ATProtoClient 313 + from .generator import PythonGenerator 314 + 315 + 316 + @click.group() 317 + def codegen(): 318 + """Code generation tools for atdata.""" 319 + pass 320 + 321 + 322 + @codegen.command() 323 + @click.argument('schema_uri') 324 + @click.option('--output', '-o', type=click.Path(), help='Output file path') 325 + @click.option('--handle', '-u', help='ATProto handle for authentication') 326 + @click.option('--password', '-p', help='ATProto password') 327 + @click.option('--language', '-l', default='python', type=click.Choice(['python']), help='Output language') 328 + def generate(schema_uri: str, output: str, handle: str, password: str, language: str): 329 + """Generate code from a schema URI. 330 + 331 + Example: 332 + atdata codegen generate at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm -o my_sample.py 333 + """ 334 + # Initialize client 335 + client = ATProtoClient() 336 + 337 + # Authenticate if credentials provided 338 + if handle and password: 339 + client.login(handle, password) 340 + 341 + # Generate code 342 + generator = PythonGenerator() 343 + 344 + try: 345 + code = generator.generate_from_uri(client, schema_uri, output) 346 + 347 + if output: 348 + click.echo(f"Generated {language} code written to {output}") 349 + else: 350 + click.echo(code) 351 + 352 + except Exception as e: 353 + click.echo(f"Error generating code: {e}", err=True) 354 + raise click.Abort() 355 + 356 + 357 + @codegen.command() 358 + @click.argument('schema_uris', nargs=-1, required=True) 359 + @click.option('--output-dir', '-d', type=click.Path(), required=True, help='Output directory') 360 + @click.option('--handle', '-u', help='ATProto handle for authentication') 361 + @click.option('--password', '-p', help='ATProto password') 362 + def batch(schema_uris: tuple, output_dir: str, handle: str, password: str): 363 + """Generate code for multiple schemas. 364 + 365 + Example: 366 + atdata codegen batch schema1_uri schema2_uri -d ./generated 367 + """ 368 + # Create output directory 369 + output_path = Path(output_dir) 370 + output_path.mkdir(parents=True, exist_ok=True) 371 + 372 + # Initialize client 373 + client = ATProtoClient() 374 + if handle and password: 375 + client.login(handle, password) 376 + 377 + # Generate code for each schema 378 + generator = PythonGenerator() 379 + 380 + for schema_uri in schema_uris: 381 + try: 382 + # Load schema to get name 383 + from ..atproto import SchemaLoader 384 + loader = SchemaLoader(client) 385 + schema = loader.get_schema(schema_uri) 386 + 387 + # Generate output path from schema name 388 + filename = f"{schema['name'].lower()}.py" 389 + output_file = output_path / filename 390 + 391 + # Generate code 392 + generator.generate_from_uri(client, schema_uri, str(output_file)) 393 + 394 + click.echo(f"Generated {filename}") 395 + 396 + except Exception as e: 397 + click.echo(f"Error generating code for {schema_uri}: {e}", err=True) 398 + 399 + 400 + if __name__ == '__main__': 401 + codegen() 402 + ``` 403 + 404 + ### Integration with Main CLI 405 + 406 + **File**: `src/atdata/cli.py` (new or extend existing) 407 + 408 + ```python 409 + import click 410 + from .codegen.cli import codegen as codegen_group 411 + 412 + @click.group() 413 + def main(): 414 + """atdata command-line interface.""" 415 + pass 416 + 417 + # Add codegen subcommand 418 + main.add_command(codegen_group) 419 + 420 + if __name__ == '__main__': 421 + main() 422 + ``` 423 + 424 + **Update** `pyproject.toml`: 425 + 426 + ```toml 427 + [project.scripts] 428 + atdata = "atdata.cli:main" 429 + ``` 430 + 431 + ## Usage Examples 432 + 433 + ### Generate Single Schema 434 + 435 + ```bash 436 + # Generate Python code from schema URI 437 + atdata codegen generate \ 438 + at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm \ 439 + -o image_sample.py 440 + 441 + # Output to stdout instead 442 + atdata codegen generate \ 443 + at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm 444 + ``` 445 + 446 + ### Batch Generation 447 + 448 + ```bash 449 + # Generate multiple schemas to a directory 450 + atdata codegen batch \ 451 + at://did:plc:abc123/app.bsky.atdata.schema/schema1 \ 452 + at://did:plc:abc123/app.bsky.atdata.schema/schema2 \ 453 + at://did:plc:abc123/app.bsky.atdata.schema/schema3 \ 454 + -d ./generated_schemas 455 + ``` 456 + 457 + ### Programmatic Usage 458 + 459 + ```python 460 + from atdata.atproto import ATProtoClient 461 + from atdata.codegen import PythonGenerator 462 + 463 + # Initialize 464 + client = ATProtoClient() 465 + client.login("alice.bsky.social", "password") 466 + 467 + # Generate code 468 + generator = PythonGenerator() 469 + code = generator.generate_from_uri( 470 + client, 471 + "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm", 472 + output_path="my_sample.py" 473 + ) 474 + 475 + # Now can import and use the generated class 476 + from my_sample import ImageSample 477 + 478 + # Use with Dataset 479 + dataset = atdata.Dataset[ImageSample](url="s3://bucket/data-{000000..000009}.tar") 480 + ``` 481 + 482 + ## Type Validation 483 + 484 + ### Schema Compatibility Checking 485 + 486 + ```python 487 + from atdata.codegen import SchemaValidator 488 + 489 + class SchemaValidator: 490 + """Validate schema compatibility and evolution.""" 491 + 492 + def is_compatible(self, old_schema: dict, new_schema: dict) -> tuple[bool, list[str]]: 493 + """ 494 + Check if new_schema is compatible with old_schema. 495 + 496 + Returns: 497 + (is_compatible, list_of_incompatibilities) 498 + """ 499 + incompatibilities = [] 500 + 501 + # Check for removed fields 502 + old_fields = {f['name']: f for f in old_schema['fields']} 503 + new_fields = {f['name']: f for f in new_schema['fields']} 504 + 505 + for name in old_fields: 506 + if name not in new_fields: 507 + incompatibilities.append(f"Field removed: {name}") 508 + 509 + # Check for type changes 510 + for name in old_fields: 511 + if name in new_fields: 512 + old_type = old_fields[name]['type'] 513 + new_type = new_fields[name]['type'] 514 + if old_type != new_type: 515 + incompatibilities.append( 516 + f"Field type changed: {name} from {old_type} to {new_type}" 517 + ) 518 + 519 + # Check for optional -> required changes 520 + for name in old_fields: 521 + if name in new_fields: 522 + was_optional = old_fields[name].get('optional', False) 523 + is_optional = new_fields[name].get('optional', False) 524 + if was_optional and not is_optional: 525 + incompatibilities.append( 526 + f"Field changed from optional to required: {name}" 527 + ) 528 + 529 + return len(incompatibilities) == 0, incompatibilities 530 + 531 + def validate_evolution(self, old_version: str, new_version: str) -> bool: 532 + """Validate that version numbers follow semantic versioning.""" 533 + # Parse versions 534 + old_major, old_minor, old_patch = map(int, old_version.split('.')) 535 + new_major, new_minor, new_patch = map(int, new_version.split('.')) 536 + 537 + # Major version should increment for breaking changes 538 + # Minor version should increment for compatible additions 539 + # Patch version should increment for bug fixes 540 + 541 + return new_major >= old_major 542 + ``` 543 + 544 + ### Runtime Type Validation 545 + 546 + ```python 547 + from atdata.codegen import TypeValidator 548 + 549 + class TypeValidator: 550 + """Validate sample instances against schemas.""" 551 + 552 + def validate(self, sample: atdata.PackableSample, schema: dict) -> tuple[bool, list[str]]: 553 + """ 554 + Validate that a sample instance conforms to a schema. 555 + 556 + Returns: 557 + (is_valid, list_of_errors) 558 + """ 559 + errors = [] 560 + 561 + # Check all required fields present 562 + schema_fields = {f['name']: f for f in schema['fields']} 563 + 564 + for field_name, field_def in schema_fields.items(): 565 + if not field_def.get('optional', False): 566 + if not hasattr(sample, field_name): 567 + errors.append(f"Missing required field: {field_name}") 568 + 569 + # Check field types 570 + for field_name, field_def in schema_fields.items(): 571 + if hasattr(sample, field_name): 572 + value = getattr(sample, field_name) 573 + if value is not None: 574 + type_valid = self._validate_field_type(value, field_def['type']) 575 + if not type_valid: 576 + errors.append( 577 + f"Invalid type for field {field_name}: " 578 + f"expected {field_def['type']}, got {type(value)}" 579 + ) 580 + 581 + return len(errors) == 0, errors 582 + 583 + def _validate_field_type(self, value, field_type: dict) -> bool: 584 + """Validate that value matches field type.""" 585 + kind = field_type['kind'] 586 + 587 + if kind == 'primitive': 588 + primitive_types = { 589 + 'str': str, 590 + 'int': int, 591 + 'float': float, 592 + 'bool': bool, 593 + 'bytes': bytes 594 + } 595 + expected_type = primitive_types[field_type['primitive']] 596 + return isinstance(value, expected_type) 597 + 598 + elif kind == 'ndarray': 599 + import numpy as np 600 + if not isinstance(value, np.ndarray): 601 + return False 602 + 603 + # Check dtype if specified 604 + if 'dtype' in field_type: 605 + expected_dtype = np.dtype(field_type['dtype']) 606 + if value.dtype != expected_dtype: 607 + return False 608 + 609 + # Check shape if specified 610 + if 'shape' in field_type and field_type['shape']: 611 + expected_shape = field_type['shape'] 612 + if len(value.shape) != len(expected_shape): 613 + return False 614 + for actual_dim, expected_dim in zip(value.shape, expected_shape): 615 + if expected_dim is not None and actual_dim != expected_dim: 616 + return False 617 + 618 + return True 619 + 620 + return True 621 + ``` 622 + 623 + ## Testing 624 + 625 + ### Unit Tests 626 + 627 + ```python 628 + import pytest 629 + from atdata.codegen import PythonGenerator 630 + 631 + def test_generate_simple_schema(): 632 + """Test generating code from a simple schema.""" 633 + schema = { 634 + "name": "TestSample", 635 + "version": "1.0.0", 636 + "description": "Test sample", 637 + "fields": [ 638 + { 639 + "name": "field1", 640 + "type": {"kind": "primitive", "primitive": "str"} 641 + } 642 + ] 643 + } 644 + 645 + generator = PythonGenerator() 646 + code = generator.generate_from_record(schema, "at://test/schema/123") 647 + 648 + # Check that code contains expected elements 649 + assert "@atdata.packable" in code 650 + assert "class TestSample:" in code 651 + assert "field1: str" in code 652 + 653 + 654 + def test_generate_ndarray_field(): 655 + """Test generating code with NDArray fields.""" 656 + schema = { 657 + "name": "ImageSample", 658 + "version": "1.0.0", 659 + "description": "Image sample", 660 + "fields": [ 661 + { 662 + "name": "image", 663 + "type": { 664 + "kind": "ndarray", 665 + "dtype": "uint8", 666 + "shape": [None, None, 3] 667 + } 668 + } 669 + ] 670 + } 671 + 672 + generator = PythonGenerator() 673 + code = generator.generate_from_record(schema, "at://test/schema/456") 674 + 675 + assert "from numpy.typing import NDArray" in code 676 + assert "image: NDArray" in code 677 + assert "# uint8, shape: [*, *, 3]" in code 678 + 679 + 680 + def test_optional_fields(): 681 + """Test generating code with optional fields.""" 682 + schema = { 683 + "name": "OptionalSample", 684 + "version": "1.0.0", 685 + "description": "Sample with optional fields", 686 + "fields": [ 687 + { 688 + "name": "required_field", 689 + "type": {"kind": "primitive", "primitive": "str"} 690 + }, 691 + { 692 + "name": "optional_field", 693 + "type": {"kind": "primitive", "primitive": "int"}, 694 + "optional": True 695 + } 696 + ] 697 + } 698 + 699 + generator = PythonGenerator() 700 + code = generator.generate_from_record(schema, "at://test/schema/789") 701 + 702 + assert "from typing import Optional" in code 703 + assert "required_field: str" in code 704 + assert "optional_field: Optional[int] = None" in code 705 + ``` 706 + 707 + ### Integration Tests 708 + 709 + ```python 710 + def test_generate_and_import(): 711 + """Test that generated code can be imported and used.""" 712 + import tempfile 713 + import importlib.util 714 + 715 + schema = { 716 + "name": "GeneratedSample", 717 + "version": "1.0.0", 718 + "description": "Generated sample", 719 + "fields": [ 720 + {"name": "x", "type": {"kind": "primitive", "primitive": "int"}} 721 + ] 722 + } 723 + 724 + generator = PythonGenerator() 725 + 726 + # Generate code to temp file 727 + with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f: 728 + code = generator.generate_from_record(schema, "at://test/schema/123") 729 + f.write(code) 730 + temp_path = f.name 731 + 732 + # Import the generated module 733 + spec = importlib.util.spec_from_file_location("generated", temp_path) 734 + module = importlib.util.module_from_spec(spec) 735 + spec.loader.exec_module(module) 736 + 737 + # Test instantiation 738 + sample = module.GeneratedSample(x=42) 739 + assert sample.x == 42 740 + 741 + # Test serialization 742 + assert isinstance(sample, atdata.PackableSample) 743 + packed = sample.packed 744 + assert isinstance(packed, bytes) 745 + ``` 746 + 747 + ## Implementation Checklist (Phase 4) 748 + 749 + - [ ] Implement `PythonGenerator` core logic 750 + - [ ] Create Jinja2 template for Python classes 751 + - [ ] Add CLI commands (`generate`, `batch`) 752 + - [ ] Implement schema validation 753 + - [ ] Implement type compatibility checking 754 + - [ ] Write unit tests for generator 755 + - [ ] Write integration tests (generate + import) 756 + - [ ] Add documentation and examples 757 + - [ ] Consider edge cases (nested types, complex shapes) 758 + 759 + ## Future Extensions 760 + 761 + ### Multi-Language Support 762 + 763 + **TypeScript Generator**: 764 + ```typescript 765 + // Generated from schema 766 + export interface ImageSample { 767 + image: number[][][]; // uint8, [*, *, 3] 768 + label: string; 769 + confidence?: number; 770 + } 771 + ``` 772 + 773 + **Rust Generator**: 774 + ```rust 775 + // Generated from schema 776 + #[derive(Debug, Clone, Serialize, Deserialize)] 777 + pub struct ImageSample { 778 + /// RGB image with variable height/width 779 + pub image: ndarray::Array3<u8>, 780 + /// Human-readable label 781 + pub label: String, 782 + /// Optional confidence score 783 + pub confidence: Option<f64>, 784 + } 785 + ``` 786 + 787 + ### Advanced Features 788 + 789 + - **Backwards compatibility checks**: Ensure schema updates don't break existing code 790 + - **Migration generators**: Generate migration code for schema evolution 791 + - **Validation decorators**: Runtime validation of generated classes 792 + - **Documentation generation**: Generate API docs from schemas 793 + - **IDE support**: Language server protocol support for autocomplete 794 + 795 + ### Code Quality 796 + 797 + - **Formatting**: Run `black` on generated Python code 798 + - **Linting**: Ensure generated code passes `ruff`/`flake8` 799 + - **Type checking**: Ensure generated code passes `mypy`
+195
.planning/README.md
··· 1 + # ATProto Integration Planning 2 + 3 + This directory contains comprehensive planning documents for integrating AT Protocol into the `atdata` library, transforming it into a distributed dataset federation. 4 + 5 + ## Planning Documents 6 + 7 + ### Design Decisions 8 + 9 + 📋 **[decisions/](decisions/)** - Critical design decisions with detailed analysis 10 + - Each decision has its own document with options, recommendations, and rationale 11 + - See [decisions/README.md](decisions/README.md) for navigation guide 12 + - **Must be reviewed and finalized before Phase 1 implementation** 13 + 14 + ### Architecture & Design 15 + 16 + 1. **[01_overview.md](01_overview.md)** - High-level vision, architecture, and project roadmap 17 + - Overall vision for distributed datasets on ATProto 18 + - System architecture diagram 19 + - Development phases and dependencies 20 + - Open design questions 21 + 22 + 2. **[02_lexicon_design.md](02_lexicon_design.md)** - Detailed Lexicon schema specifications 23 + - Schema Record Lexicon (for PackableSample types) 24 + - Dataset Record Lexicon (for dataset indexes) 25 + - Lens Record Lexicon (for transformations) 26 + - Schema representation format decision 27 + - Example records 28 + 29 + 3. **[03_python_client.md](03_python_client.md)** - Python library architecture and API design 30 + - ATProtoClient for authentication 31 + - SchemaPublisher/Loader 32 + - DatasetPublisher/Loader 33 + - LensPublisher 34 + - Integration with existing Dataset class 35 + - Testing strategy 36 + 37 + 4. **[04_appview.md](04_appview.md)** - AppView aggregation service design 38 + - Service architecture 39 + - Database schema (PostgreSQL, ElasticSearch) 40 + - HTTP API endpoints 41 + - Firehose consumer 42 + - Deployment options 43 + - Performance considerations 44 + 45 + 5. **[05_codegen.md](05_codegen.md)** - Code generation tooling 46 + - Python code generator from schema records 47 + - CLI interface 48 + - Template system 49 + - Type validation and compatibility checking 50 + - Future multi-language support 51 + 52 + ## Milestone Tracking 53 + 54 + **Milestone**: ATProto Integration (Milestone #1) 55 + **Total Issues**: 34 (6 parent issues + 28 subissues) 56 + 57 + ### Planning Phase (Issue #44) 58 + 59 + **Status**: In progress 60 + **Priority**: High (blocks Phase 1) 61 + 62 + Critical decisions needed before implementation: 63 + - Decide on schema representation format (#45) 64 + - Decide on Lens code storage approach (#46) 65 + - Decide on WebDataset storage strategy (#47) 66 + - Design schema evolution and versioning strategy (#48) 67 + - Finalize Lexicon namespace and NSID structure (#49) 68 + - Review and validate Lexicon JSON definitions (#50) 69 + 70 + **All decisions have detailed analysis in planning documents with recommendations.** 71 + 72 + ### Phase Breakdown 73 + 74 + #### Phase 1: Lexicon Design & Schema Definition (Issue #17) 75 + - Design Lexicon for PackableSample schema storage (#22) 76 + - Design Lexicon for dataset index records (#23) 77 + - Design Lexicon for Lens transformation records (#24) 78 + - Evaluate schema representation formats (#25) 79 + 80 + **Status**: Blocked by Planning (#44) 81 + **Priority**: High (blocks all other phases) 82 + 83 + #### Phase 2: Python Client Library (Issue #18) 84 + - Implement ATProto authentication and session management (#26) 85 + - Implement schema publishing to ATProto (#27) 86 + - Implement dataset index record publishing (#28) 87 + - Implement Lens transformation publishing (#29) 88 + - Implement querying and discovery of datasets (#30) 89 + - Extend Dataset class to load from ATProto records (#31) 90 + 91 + **Status**: Blocked by Phase 1 92 + **Priority**: High (critical path) 93 + 94 + #### Phase 3: AppView & Index Aggregation Service (Issue #19) 95 + - Design AppView architecture and data model (#32) 96 + - Implement record ingestion from ATProto firehose (#33) 97 + - Implement search and query API (#34) 98 + - Add caching and indexing for performance (#35) 99 + 100 + **Status**: Blocked by Phase 2 101 + **Priority**: Medium (optional infrastructure) 102 + 103 + #### Phase 4: Code Generation Tooling (Issue #20) 104 + - Design code generation template system (#36) 105 + - Implement Python code generator from schema records (#37) 106 + - Add CLI for code generation (#38) 107 + - Support type validation and compatibility checking (#39) 108 + 109 + **Status**: Blocked by Phase 2 110 + **Priority**: Medium (can run parallel with Phase 3) 111 + 112 + #### Phase 5: End-to-End Integration & Testing (Issue #21) 113 + - Create end-to-end example workflows (#40) 114 + - Write integration tests for full publish/discover/load cycle (#41) 115 + - Create comprehensive documentation (#42) 116 + - Performance testing and optimization (#43) 117 + 118 + **Status**: Blocked by Phase 2 119 + **Priority**: High (required for production release) 120 + 121 + ## Getting Started 122 + 123 + To begin implementation: 124 + 125 + 1. **Review design decisions** in `decisions/` directory - these need your input first 126 + 2. **Review architecture documents** (01-05) to understand the full scope 127 + 3. **Provide feedback** on the design decisions and open questions 128 + 4. **Finalize decisions** for issues #45-49 129 + 5. **Validate Lexicons** (issue #50) once decisions are made 130 + 6. **Begin Phase 1 implementation** after validation 131 + 7. **Track progress** using chainlink issues 132 + 133 + ### Quick Start for Decision Review 134 + 135 + 1. Read [decisions/README.md](decisions/README.md) for overview 136 + 2. Review each decision document (01-06) 137 + 3. For each decision: 138 + - Agree with recommendation? → Comment on issue 139 + - Disagree? → Propose alternative in issue 140 + - Unsure? → Discuss open questions 141 + 4. Once all decisions made → Proceed to issue #50 (validation) 142 + 143 + ## Key Design Decisions Needed 144 + 145 + Before starting implementation, we need decisions on (see Issue #44 and subissues #45-50): 146 + 147 + 1. **Schema representation format** (Issue #45) 148 + - Recommendation: Custom format within ATProto Lexicon 149 + - Alternative: JSON Schema or Protobuf 150 + - Details in `02_lexicon_design.md` 151 + 152 + 2. **Lens code storage** (Issue #46) 153 + - Recommendation: Code references (GitHub + commit) only 154 + - Alternative: Allow inline code (security concerns) 155 + - Details in `02_lexicon_design.md` 156 + 157 + 3. **WebDataset storage location** (Issue #47) 158 + - Phase 1: External storage (S3, HTTP) - just URLs 159 + - Future: ATProto blob storage for smaller datasets 160 + - Details in `02_lexicon_design.md` 161 + 162 + 4. **Schema evolution strategy** (Issue #48) 163 + - How to handle versioning and compatibility 164 + - Migration path for breaking changes 165 + - Details in `05_codegen.md` 166 + 167 + 5. **Lexicon namespace** (Issue #49) 168 + - Current proposal: `app.bsky.atdata.*` 169 + - May need to coordinate with ATProto/Bluesky team 170 + - Details in `02_lexicon_design.md` 171 + 172 + 6. **Lexicon validation** (Issue #50) 173 + - Validate all Lexicon JSON against ATProto spec 174 + - Create example records for testing 175 + - Blocked by decisions #45-49 176 + 177 + ## Questions for Discussion 178 + 179 + Review the "Open Design Questions" sections in each planning document, particularly: 180 + 181 + - `01_overview.md` - Overall architecture questions 182 + - `02_lexicon_design.md` - Lexicon-specific design questions (CRITICAL for Phase 1) 183 + 184 + ## Next Steps 185 + 186 + 1. Review planning documents 187 + 2. Discuss and finalize design decisions 188 + 3. Begin Phase 1 implementation 189 + 4. Iterate and refine as we learn 190 + 191 + --- 192 + 193 + **Milestone Created**: 2026-01-07 194 + **Last Updated**: 2026-01-07 195 + **Status**: Planning complete, ready for review
+19
.planning/atproto_integration.md
··· 1 + # Planning for full atproto integration 2 + 3 + The overall goal for `atdata` is that the index for datasets is actually present on the atproto distributed repository, with one type of Lexicon schema for actually containing information about `PackableSample` schemas that can be reproduced with code gen, and one type of Lexicon schema designed for the main functionality: records holding the links to the WDS dataset for samples and the msgpack metadata (that can be plugged into the `Dataset` class) as well as a reference to the atproto record containing the schema for the appropriate sample type for the dataset. 4 + 5 + ## Thoughts on functionality 6 + 7 + * Lexicons 8 + * Definition of a `PackableSample`-compatible sample type schema, that can be used to reconstitute the code in appropriate languages using code gen toolilng 9 + * Index records that contain links to the actual WebDataset data, as well as to the records with the corresponding sample schema. 10 + * `Lenses` between defined sample type schemas across the network. 11 + * Python library functionality 12 + * Logging in with the atproto sdk 13 + * Posting sample schemas and dataset index records to the appropriate lexicons for the user 14 + * AppView functionality 15 + * Aggregating index records, making an index of those that is quick to query on 16 + 17 + ## Questions for implementation 18 + 19 + * What is the best way to store the sample type schemas within atproto Lexicons? I've thought about using JSON schema or protobuf, but want to think through possibilities.
+239
.planning/decisions/01_schema_representation_format.md
··· 1 + # Decision: Schema Representation Format 2 + 3 + **Issue**: #45 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + Let's go with the **JSON schema** approach; the only real issue we have to worry about here is the `NDArray` support, and we can solve that by 11 + 12 + * Adding a standardized JSON Schema shim to represent an `NDArray` as its serialized bytes 13 + * Referencing this as the type within other schemas, and making this the standard we use 14 + 15 + We'll make this decision future-proof by adding a property in the Lexicon for schemas that gives the type of schema definition, with one currently supported value (for JSON Schema), and then leave the standard overall as an open union, as is standard for atproto lexicons. 16 + 17 + --- 18 + 19 + ## Problem Statement 20 + 21 + We need to decide how to represent `PackableSample` type definitions within ATProto Lexicon records. This affects: 22 + - How schemas are stored and transmitted 23 + - Code generation complexity 24 + - Cross-language interoperability 25 + - Tooling ecosystem availability 26 + 27 + ## Context 28 + 29 + `PackableSample` types have specific requirements: 30 + - Support for primitive types (str, int, float, bool, bytes) 31 + - **Special handling for `NDArray` types** with dtype and shape information 32 + - Msgpack serialization metadata 33 + - Optional/required field semantics 34 + - Future extensibility (constraints, validation, nested types) 35 + 36 + ## Options 37 + 38 + ### Option 1: Custom Format within ATProto Lexicon ⭐ RECOMMENDED 39 + 40 + **Description**: Define our own type system using ATProto Lexicon primitives 41 + 42 + **Example**: 43 + ```json 44 + { 45 + "name": "image", 46 + "type": { 47 + "kind": "ndarray", 48 + "dtype": "uint8", 49 + "shape": [null, null, 3] 50 + }, 51 + "optional": false, 52 + "description": "RGB image with variable height/width" 53 + } 54 + ``` 55 + 56 + **Pros**: 57 + - ✅ Native to ATProto - no external dependencies 58 + - ✅ Tailored exactly to `PackableSample` needs 59 + - ✅ Clean representation of NDArray (dtype, shape constraints) 60 + - ✅ Full control over codegen implementation 61 + - ✅ Can evolve independently 62 + - ✅ Easy to extend (add constraints, validation rules, etc.) 63 + 64 + **Cons**: 65 + - ❌ Need to implement our own codegen tooling 66 + - ❌ Less ecosystem tooling available 67 + - ❌ Need to maintain custom parsers 68 + 69 + **Implementation Effort**: Medium 70 + - Lexicon design: ~2-3 days 71 + - Python codegen: ~5-7 days 72 + - Validation: ~2-3 days 73 + 74 + --- 75 + 76 + ### Option 2: JSON Schema 77 + 78 + **Description**: Use JSON Schema as the type definition format 79 + 80 + **Example**: 81 + ```json 82 + { 83 + "type": "object", 84 + "properties": { 85 + "image": { 86 + "type": "object", 87 + "x-atdata-type": "ndarray", 88 + "x-dtype": "uint8", 89 + "x-shape": [null, null, 3] 90 + } 91 + }, 92 + "required": ["image"] 93 + } 94 + ``` 95 + 96 + **Pros**: 97 + - ✅ Industry standard, widely understood 98 + - ✅ Extensive validation tooling exists 99 + - ✅ Many language implementations 100 + 101 + **Cons**: 102 + - ❌ Not designed for code generation 103 + - ❌ Awkward NDArray representation (need custom extensions like `x-atdata-type`) 104 + - ❌ Overly complex for our needs 105 + - ❌ Still need custom codegen despite standard format 106 + - ❌ Doesn't map cleanly to Python dataclasses 107 + 108 + **Implementation Effort**: Medium-High 109 + - Still need custom codegen despite standard format 110 + - JSON Schema parsers available but adaptation needed 111 + 112 + --- 113 + 114 + ### Option 3: Protobuf (Protocol Buffers) 115 + 116 + **Description**: Use Protobuf schema definitions 117 + 118 + **Example**: 119 + ```protobuf 120 + message ImageSample { 121 + bytes image = 1; // NDArray serialized 122 + string label = 2; 123 + optional float confidence = 3; 124 + } 125 + ``` 126 + 127 + **Pros**: 128 + - ✅ Excellent codegen ecosystem (Python, TypeScript, Rust, etc.) 129 + - ✅ Compact binary format 130 + - ✅ Strong cross-language support 131 + - ✅ Built-in versioning/evolution support 132 + 133 + **Cons**: 134 + - ❌ Not ATProto-native (different ecosystem) 135 + - ❌ NDArray handling is awkward (just bytes, lose dtype/shape info) 136 + - ❌ Requires compilation step 137 + - ❌ Less human-readable than JSON 138 + - ❌ Doesn't integrate well with msgpack serialization we already use 139 + - ❌ Would need to convert between Protobuf and our existing serialization 140 + 141 + **Implementation Effort**: High 142 + - Need to bridge Protobuf and PackableSample worlds 143 + - Complexity of maintaining two serialization systems 144 + 145 + ## Recommendation: Option 1 (Custom Format) 146 + 147 + **Rationale**: 148 + 149 + 1. **Perfect fit for PackableSample**: Our custom format can represent NDArray types with full dtype and shape information, which is critical for ML/data applications. 150 + 151 + 2. **ATProto-native**: Using Lexicon primitives means everything stays within the ATProto ecosystem. No external schema dependencies. 152 + 153 + 3. **Full control**: We can optimize the codegen for our exact use case. Want to generate dataclasses with specific decorators? Easy. Want to add custom validation? We control it. 154 + 155 + 4. **Simplicity**: Despite being "custom", it's actually simpler than adapting JSON Schema or Protobuf to our needs. Less impedance mismatch. 156 + 157 + 5. **Future-proof**: Easy to add features like: 158 + - Shape constraints and validation 159 + - Custom serialization hooks 160 + - Nested PackableSample types 161 + - Union types for polymorphic samples 162 + 163 + ## Implementation Plan 164 + 165 + If we choose Option 1: 166 + 167 + 1. **Finalize Lexicon structure** (see `02_lexicon_design.md`) 168 + - Field type definitions (primitive, ndarray, nested) 169 + - Union types for extensibility 170 + - Metadata fields 171 + 172 + 2. **Implement Python codegen** (see `05_codegen.md`) 173 + - Jinja2 templates for dataclass generation 174 + - Type annotation mapping 175 + - NDArray handling with dtype/shape comments 176 + 177 + 3. **Build validation tooling** 178 + - Schema validator (ensure schemas are well-formed) 179 + - Sample validator (ensure samples match schemas) 180 + - Compatibility checker (schema evolution) 181 + 182 + 4. **Document the format** 183 + - Clear spec for the type system 184 + - Examples for common patterns 185 + - Migration guide from JSON Schema if needed 186 + 187 + ## Alternative Approaches Considered 188 + 189 + **Hybrid approach**: Use JSON Schema for validation + custom codegen 190 + - Still has awkward NDArray representation 191 + - Added complexity of two systems 192 + - Not recommended 193 + 194 + **Defer decision**: Use simple types only, add NDArray later 195 + - Defeats the purpose - NDArray is core to ML datasets 196 + - Would require breaking changes later 197 + - Not recommended 198 + 199 + ## Impact on Other Decisions 200 + 201 + - **Code generation (#36-39)**: Custom format means we fully control codegen 202 + - **Validation (#50)**: Need to implement custom validators 203 + - **Cross-language support (future)**: Need to write codegen for each language, but format is language-agnostic 204 + 205 + ## Success Criteria 206 + 207 + After implementing this decision: 208 + - ✅ Can represent all current PackableSample types 209 + - ✅ NDArray types include dtype and shape information 210 + - ✅ Generated code is idiomatic Python (dataclasses with type hints) 211 + - ✅ Schema records are human-readable 212 + - ✅ Codegen is fast (<1s for typical schemas) 213 + 214 + ## Open Questions 215 + 216 + 1. **Should we support shape constraints beyond documentation?** 217 + - e.g., should [224, 224, 3] be enforced at runtime? 218 + - Recommendation: Document only initially, add validation later 219 + 220 + 2. **How to handle nested PackableSample types?** 221 + - Reference by schema URI? 222 + - Inline nested schema? 223 + - Recommendation: URI reference for Phase 1 224 + 225 + 3. **Should we generate both classes and validators?** 226 + - Just classes, or also Pydantic models? 227 + - Recommendation: Start with dataclasses, add Pydantic later if needed 228 + 229 + ## References 230 + 231 + - Full Lexicon design: `../02_lexicon_design.md` 232 + - Code generation plan: `../05_codegen.md` 233 + - Example schemas: `../02_lexicon_design.md` (Schema Record Lexicon section) 234 + 235 + --- 236 + 237 + **Decision Needed By**: Before starting Phase 1 Issue #22 (Lexicon design) 238 + **Decision Maker**: Project maintainer (max) 239 + **Date Created**: 2026-01-07
+352
.planning/decisions/02_lens_code_storage.md
··· 1 + # Decision: Lens Code Storage Approach 2 + 3 + **Issue**: #46 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + Let's go with Option 1, using external repositories. We can actually make this work for 11 + 12 + * GitHub 13 + * tangled.org (the native ATProto git repository system) 14 + 15 + Additionally, we'll want to keep track of metadata for lenses giving the language the referenced code is implemented in. 16 + 17 + Longer-term, it will also be good to add another Lexicon specification for attestation of `Lens` formal correctness (where possible), as this will enable filtering lens implementations by provability. We'll also want to add our own `verification` records that give attestation of individual atproto DIDs (user identities) as being "trusted" for creating `Lens`es, etc. 18 + 19 + --- 20 + 21 + ## Problem Statement 22 + 23 + We need to decide how to store the transformation code for Lens records on ATProto. Lenses define bidirectional transformations between sample types (getter: Source → Target, putter: Target × Source → Source). 24 + 25 + This is a **critical security decision** because we're dealing with executable code. 26 + 27 + ## Context 28 + 29 + Lens transformations are functions that: 30 + - Take samples of one type and transform them to another 31 + - Are bidirectional (getter + putter) 32 + - Need to be reproducible and verifiable 33 + - Potentially execute on untrusted data 34 + 35 + Example Lens: 36 + ```python 37 + @atdata.lens 38 + def rgb_to_grayscale(rgb_sample: RGBSample) -> GrayscaleSample: 39 + gray = cv2.cvtColor(rgb_sample.image, cv2.COLOR_RGB2GRAY) 40 + return GrayscaleSample(image=gray, label=rgb_sample.label) 41 + 42 + @rgb_to_grayscale.putter 43 + def grayscale_to_rgb(gray: GrayscaleSample, rgb: RGBSample) -> RGBSample: 44 + # Convert back to RGB (approximate) 45 + rgb_img = cv2.cvtColor(gray.image, cv2.COLOR_GRAY2RGB) 46 + return RGBSample(image=rgb_img, label=gray.label) 47 + ``` 48 + 49 + ## Options 50 + 51 + ### Option 1: Code References Only (GitHub/GitLab + Commit Hash) ⭐ RECOMMENDED 52 + 53 + **Description**: Store only references to code in version control repositories 54 + 55 + **Record Format**: 56 + ```json 57 + { 58 + "getterCode": { 59 + "kind": "reference", 60 + "repository": "https://github.com/alice/lenses", 61 + "commit": "a1b2c3d4e5f6789...", 62 + "path": "lenses/vision.py:rgb_to_grayscale" 63 + }, 64 + "putterCode": { 65 + "kind": "reference", 66 + "repository": "https://github.com/alice/lenses", 67 + "commit": "a1b2c3d4e5f6789...", 68 + "path": "lenses/vision.py:grayscale_to_rgb" 69 + } 70 + } 71 + ``` 72 + 73 + **Pros**: 74 + - ✅ **Secure**: No arbitrary code execution from ATProto records 75 + - ✅ **Verifiable**: Commit hash ensures immutability 76 + - ✅ **Auditable**: Users can review code before using 77 + - ✅ **Version controlled**: Natural versioning through git 78 + - ✅ **Professional workflow**: Encourages proper development practices 79 + 80 + **Cons**: 81 + - ❌ External dependency (repo could disappear) 82 + - ❌ Requires users to have code in public/accessible repos 83 + - ❌ Need to clone/fetch repos to use lenses 84 + - ❌ Less convenient than self-contained records 85 + 86 + **Security**: ⭐⭐⭐⭐⭐ Excellent 87 + **Convenience**: ⭐⭐⭐ Good 88 + **Implementation Effort**: Low-Medium 89 + 90 + --- 91 + 92 + ### Option 2: Inline Python Code with Sandboxing 93 + 94 + **Description**: Store Python source code directly in records, execute in sandbox 95 + 96 + **Record Format**: 97 + ```json 98 + { 99 + "getterCode": { 100 + "kind": "python", 101 + "source": "def rgb_to_grayscale(rgb_sample: RGBSample) -> GrayscaleSample:\n ..." 102 + } 103 + } 104 + ``` 105 + 106 + **Pros**: 107 + - ✅ Self-contained records 108 + - ✅ No external dependencies 109 + - ✅ More convenient for users 110 + - ✅ Easier discovery and exploration 111 + 112 + **Cons**: 113 + - ❌ **MAJOR SECURITY RISK**: Executing untrusted code 114 + - ❌ Sandboxing Python is extremely difficult 115 + - ❌ Even with sandboxing, attack surface is large 116 + - ❌ `eval()`/`exec()` considered harmful 117 + - ❌ Would need extensive review and testing 118 + - ❌ Potential for malicious code injection 119 + 120 + **Security**: ⭐ Very Poor (even with sandboxing) 121 + **Convenience**: ⭐⭐⭐⭐⭐ Excellent 122 + **Implementation Effort**: Very High (sandboxing is complex) 123 + 124 + **Why Sandboxing is Hard**: 125 + - Python has many ways to break out of sandboxes 126 + - Import system, file I/O, network access all need blocking 127 + - `__import__`, `eval`, `exec`, `compile`, `open`, etc. 128 + - Even readonly access can leak sensitive data 129 + - See: [PyPy sandbox](https://doc.pypy.org/en/latest/sandbox.html) - discontinued 130 + 131 + --- 132 + 133 + ### Option 3: Bytecode or AST Representation 134 + 135 + **Description**: Store compiled bytecode or AST instead of source 136 + 137 + **Pros**: 138 + - ✅ Slightly safer than raw source (no syntax injection) 139 + - ✅ Self-contained 140 + 141 + **Cons**: 142 + - ❌ Still executes arbitrary code - same security issues 143 + - ❌ Harder to audit than source 144 + - ❌ Platform/version dependent (Python bytecode changes) 145 + - ❌ Complex to implement 146 + - ❌ Doesn't solve the fundamental problem 147 + 148 + **Security**: ⭐⭐ Poor 149 + **Convenience**: ⭐⭐ Poor (less readable) 150 + **Implementation Effort**: High 151 + 152 + --- 153 + 154 + ### Option 4: Metadata Only (Manual Implementation) 155 + 156 + **Description**: Store only metadata about transformations, require manual implementation 157 + 158 + **Record Format**: 159 + ```json 160 + { 161 + "description": "Converts RGB images to grayscale", 162 + "getterSignature": "(RGBSample) -> GrayscaleSample", 163 + "putterSignature": "(GrayscaleSample, RGBSample) -> RGBSample" 164 + } 165 + ``` 166 + 167 + **Pros**: 168 + - ✅ Completely safe 169 + - ✅ Simple to implement 170 + 171 + **Cons**: 172 + - ❌ Lenses not actually usable 173 + - ❌ Defeats the purpose of publishing transformations 174 + - ❌ No network effect (can't compose lenses) 175 + 176 + **Security**: ⭐⭐⭐⭐⭐ Excellent 177 + **Convenience**: ⭐ Very Poor 178 + **Implementation Effort**: Very Low 179 + 180 + ## Recommendation: Option 1 (Code References Only) 181 + 182 + **Rationale**: 183 + 184 + 1. **Security First**: We cannot compromise on security. Publishing executable code to a public network is extremely dangerous without proper safeguards. 185 + 186 + 2. **Verifiable and Auditable**: With commit hashes, users can: 187 + - Review the exact code before execution 188 + - Verify it hasn't been tampered with 189 + - Make informed trust decisions 190 + 191 + 3. **Professional Workflow**: Requiring code in version control: 192 + - Encourages good practices (testing, documentation) 193 + - Makes lens development collaborative 194 + - Enables code review 195 + 196 + 4. **Future Extensibility**: We can add inline code later if we solve sandboxing, but we can't easily remove it once added. 197 + 198 + ## Implementation Plan 199 + 200 + If we choose Option 1: 201 + 202 + 1. **Lexicon Design** (Phase 1) 203 + ```json 204 + "transformCode": { 205 + "type": "union", 206 + "refs": ["#codeReference"] 207 + }, 208 + "codeReference": { 209 + "type": "object", 210 + "required": ["kind", "repository", "commit", "path"], 211 + "properties": { 212 + "kind": {"type": "string", "const": "reference"}, 213 + "repository": {"type": "string", "maxLength": 500}, 214 + "commit": {"type": "string", "maxLength": 40}, 215 + "path": {"type": "string", "maxLength": 500} 216 + } 217 + } 218 + ``` 219 + 220 + 2. **Lens Publisher** (Phase 2) 221 + - Automatically detect git repo and commit from function location 222 + - Validate that repo is accessible 223 + - Include function name and module path 224 + 225 + 3. **Lens Loader** (Phase 2) 226 + - Clone/fetch repository at specified commit 227 + - Import function from specified path 228 + - Cache cloned repos locally 229 + - Verify function signatures match schema 230 + 231 + 4. **Trust Model** 232 + - Users explicitly approve which repos to trust 233 + - Whitelist/blacklist mechanism 234 + - Warn on first use of any lens 235 + 236 + ## Alternative Approaches Considered 237 + 238 + **Signed inline code**: Store inline code with cryptographic signatures 239 + - Still has execution risk 240 + - Signature only proves authorship, not safety 241 + - Not recommended 242 + 243 + **WASM modules**: Compile transformations to WebAssembly 244 + - More sandboxed than Python 245 + - Very complex to implement 246 + - Would require rewriting lenses in Rust/C++ 247 + - Interesting future direction but not for Phase 1 248 + 249 + ## User Experience Implications 250 + 251 + **Publishing a Lens**: 252 + ```python 253 + # 1. Write lens code in your repo 254 + # lenses/vision.py 255 + @atdata.lens 256 + def rgb_to_grayscale(rgb: RGBSample) -> GrayscaleSample: 257 + ... 258 + 259 + # 2. Commit and push 260 + git add lenses/vision.py 261 + git commit -m "Add RGB to grayscale lens" 262 + git push 263 + 264 + # 3. Publish to ATProto (automatically detects git info) 265 + client = ATProtoClient() 266 + client.login("alice.bsky.social", "password") 267 + 268 + lens_publisher = LensPublisher(client) 269 + lens_uri = lens_publisher.publish_lens( 270 + rgb_to_grayscale, 271 + source_schema_uri="at://alice/schema/rgb", 272 + target_schema_uri="at://alice/schema/gray" 273 + ) 274 + ``` 275 + 276 + **Using a Lens**: 277 + ```python 278 + # 1. Discover lens 279 + loader = LensLoader(client) 280 + lenses = loader.search_lenses( 281 + source_schema="at://alice/schema/rgb", 282 + target_schema="at://alice/schema/gray" 283 + ) 284 + 285 + # 2. User reviews the repo/code (outside tool) 286 + # 3. User approves the repo 287 + 288 + # 4. Load and use lens 289 + rgb_to_gray = loader.load_lens(lenses[0]['uri']) 290 + gray_sample = rgb_to_gray(rgb_sample) 291 + ``` 292 + 293 + ## Security Considerations 294 + 295 + Even with code references: 296 + - **Malicious repos**: Users could reference repos with malicious code 297 + - **Mitigation**: Explicit user approval, warnings, sandboxing (future) 298 + 299 + - **Repo compromise**: Git repos could be hacked 300 + - **Mitigation**: Commit hash pins exact version, users can audit 301 + 302 + - **Dependency injection**: Lens code could import malicious packages 303 + - **Mitigation**: Users review code, standard Python security practices 304 + 305 + ## Future Enhancements 306 + 307 + **If we want inline code later**: 308 + 1. Build robust Python sandbox (e.g., using PyPy, restrictedpython) 309 + 2. Add extensive security testing 310 + 3. Implement strict permissions model 311 + 4. Use WebAssembly for true isolation 312 + 5. Add code signing and reputation system 313 + 314 + **For now**: Start with references, prove the concept, add inline code only if there's strong demand and we can do it safely. 315 + 316 + ## Open Questions 317 + 318 + 1. **Private repositories**: How to handle lenses in private repos? 319 + - Could support auth tokens (stored locally, not in record) 320 + - Could use SSH keys 321 + - Recommendation: Public repos only for Phase 1 322 + 323 + 2. **Repository availability**: What if repo goes offline? 324 + - Could encourage mirrors 325 + - Could cache code (with user permission) 326 + - Recommendation: Accept the risk, it's part of decentralization 327 + 328 + 3. **Non-Python lenses**: What about TypeScript, Rust, etc.? 329 + - References work for any language 330 + - Each language would need its own loader 331 + - Recommendation: Python-only for Phase 1 332 + 333 + ## Success Criteria 334 + 335 + After implementing this decision: 336 + - ✅ Lenses can be published with code references 337 + - ✅ Users can load and execute lenses from approved repos 338 + - ✅ No arbitrary code execution from untrusted sources 339 + - ✅ Lens records include immutable commit hashes 340 + - ✅ Clear warnings when using external code 341 + 342 + ## References 343 + 344 + - Lexicon design: `../02_lexicon_design.md` (Lens Record Lexicon) 345 + - Python client implementation: `../03_python_client.md` (LensPublisher) 346 + - Security best practices: Python security guide 347 + 348 + --- 349 + 350 + **Decision Needed By**: Before starting Phase 1 Issue #24 (Lens Lexicon design) 351 + **Decision Maker**: Project maintainer (max) 352 + **Date Created**: 2026-01-07
+366
.planning/decisions/03_webdataset_storage.md
··· 1 + # Decision: WebDataset Storage Strategy 2 + 3 + **Issue**: #47 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + Let's build the hybrid approach in from the beginning. Critically: 11 + 12 + * We'll keep track of whether dataset index records are referencing an external storage (S3, R2, etc) by URL or a PDS blob using an open union to define the data location 13 + * In the AppView implementation, we can proxy WDS urls for datasets across individual stored blobs, which streamlines some of the design. 14 + 15 + This will help us be robust from the start -- particularly for those self-hosting. 16 + 17 + --- 18 + 19 + ## Problem Statement 20 + 21 + We need to decide where the actual WebDataset `.tar` files are stored and how dataset records reference them. This affects decentralization, reliability, and scalability. 22 + 23 + ## Context 24 + 25 + WebDataset files are: 26 + - **Large**: Typically gigabytes to terabytes 27 + - **Immutable**: Once created, datasets rarely change 28 + - **Sharded**: Split across multiple `.tar` files (e.g., `data-{000000..000099}.tar`) 29 + - **Binary**: Contain msgpack-serialized samples with images/arrays 30 + 31 + Current `atdata` usage: 32 + ```python 33 + # External storage (S3, HTTP, etc.) 34 + dataset = Dataset[MySample](url="s3://bucket/data-{000000..000009}.tar") 35 + ``` 36 + 37 + ## Options 38 + 39 + ### Option 1: External Storage with URL References ⭐ RECOMMENDED (Phase 1) 40 + 41 + **Description**: Store WebDataset files on existing storage (S3, HTTP, IPFS, etc.), record only contains URLs 42 + 43 + **Record Format**: 44 + ```json 45 + { 46 + "$type": "app.bsky.atdata.dataset", 47 + "name": "CIFAR-10 Training Set", 48 + "urls": [ 49 + "s3://my-bucket/cifar10-train-{000000..000049}.tar" 50 + ], 51 + "schemaRef": "at://alice/schema/image", 52 + ... 53 + } 54 + ``` 55 + 56 + **Supported URL Schemes**: 57 + - `s3://` - AWS S3 and compatible (MinIO, DigitalOcean Spaces) 58 + - `https://` - HTTP/HTTPS servers 59 + - `gs://` - Google Cloud Storage 60 + - `ipfs://` - IPFS (decentralized, content-addressed) 61 + - `file://` - Local files (for development) 62 + 63 + **Pros**: 64 + - ✅ **No size limits**: Store datasets of any size 65 + - ✅ **Existing infrastructure**: Leverage proven storage solutions 66 + - ✅ **No ATProto storage costs**: Publishers pay for their own storage 67 + - ✅ **Performance**: Use CDNs, regional endpoints, etc. 68 + - ✅ **Compatibility**: Works with current `atdata` code 69 + - ✅ **Flexibility**: Different storage for different use cases 70 + 71 + **Cons**: 72 + - ❌ **Centralization risk**: If storage provider goes down, dataset unavailable 73 + - ❌ **URL rot**: Links can break over time 74 + - ❌ **No permanence guarantee**: Publisher can delete files 75 + - ❌ **Access control complexity**: Need to handle auth for private datasets 76 + 77 + **Decentralization**: ⭐⭐ Fair (better with IPFS) 78 + **Reliability**: ⭐⭐⭐ Good (depends on storage provider) 79 + **Cost**: ⭐⭐⭐⭐ Excellent (publishers pay storage costs) 80 + **Implementation Effort**: ⭐⭐⭐⭐⭐ Very Low (already supported) 81 + 82 + --- 83 + 84 + ### Option 2: ATProto Blob Storage 85 + 86 + **Description**: Store WebDataset files as ATProto blobs, record contains blob CIDs 87 + 88 + **Record Format**: 89 + ```json 90 + { 91 + "$type": "app.bsky.atdata.dataset", 92 + "name": "Small Dataset", 93 + "blobs": [ 94 + {"$type": "blob", "ref": {"$link": "bafyrei..."}}, 95 + {"$type": "blob", "ref": {"$link": "bafyrei..."}} 96 + ], 97 + "schemaRef": "at://alice/schema/image", 98 + ... 99 + } 100 + ``` 101 + 102 + **Pros**: 103 + - ✅ **True decentralization**: Data lives on ATProto network 104 + - ✅ **Content-addressed**: CIDs guarantee immutability 105 + - ✅ **Permanence**: As permanent as ATProto itself 106 + - ✅ **No external dependencies**: Self-contained 107 + 108 + **Cons**: 109 + - ❌ **Size limits**: ATProto may have blob size restrictions (need to verify) 110 + - ❌ **Storage costs**: Who pays for storing large datasets? 111 + - ❌ **Performance**: May be slower than specialized data storage 112 + - ❌ **Scalability**: Not designed for TB-scale datasets 113 + - ❌ **Unknown limitations**: ATProto blob storage is less proven for this use case 114 + 115 + **Decentralization**: ⭐⭐⭐⭐⭐ Excellent 116 + **Reliability**: ⭐⭐⭐⭐ Very Good (ATProto network) 117 + **Cost**: ⭐ Poor (storage costs for large datasets) 118 + **Implementation Effort**: ⭐⭐⭐ Medium (need to implement blob upload/download) 119 + 120 + --- 121 + 122 + ### Option 3: Hybrid Approach 123 + 124 + **Description**: Support both external URLs and ATProto blobs 125 + 126 + **Record Format**: 127 + ```json 128 + { 129 + "$type": "app.bsky.atdata.dataset", 130 + "name": "Hybrid Dataset", 131 + "storage": { 132 + "kind": "external", 133 + "urls": ["s3://bucket/data-{000000..000009}.tar"] 134 + }, 135 + // OR 136 + "storage": { 137 + "kind": "blobs", 138 + "blobs": [{"$type": "blob", "ref": {"$link": "bafyrei..."}}] 139 + }, 140 + ... 141 + } 142 + ``` 143 + 144 + **Pros**: 145 + - ✅ Best of both worlds 146 + - ✅ Flexibility for different use cases 147 + - ✅ Can migrate between storage types 148 + 149 + **Cons**: 150 + - ❌ More complex Lexicon and implementation 151 + - ❌ Confusing for users (which to choose?) 152 + - ❌ Testing burden (need to test both paths) 153 + 154 + **Implementation Effort**: ⭐⭐ High (two systems to maintain) 155 + 156 + ## Recommendation: Option 1 (External URLs) for Phase 1, Option 3 (Hybrid) for Future 157 + 158 + **Rationale**: 159 + 160 + 1. **Pragmatism**: Most ML datasets are huge (10GB-10TB). ATProto blob storage is not designed for this scale. 161 + 162 + 2. **Existing Infrastructure**: S3, GCS, HTTP are battle-tested for large file storage. Why reinvent the wheel? 163 + 164 + 3. **Cost Model**: Publishers pay for their own storage. This is sustainable and aligns incentives. 165 + 166 + 4. **IPFS for Decentralization**: Users who want decentralization can use `ipfs://` URLs, which are content-addressed and distributed. 167 + 168 + 5. **Future-Proof**: We can add blob storage later for small datasets (<100MB) without breaking existing datasets. 169 + 170 + ## Implementation Plan 171 + 172 + ### Phase 1: External URLs Only 173 + 174 + **Lexicon Design**: 175 + ```json 176 + { 177 + "urls": { 178 + "type": "array", 179 + "description": "WebDataset URLs (supports brace notation)", 180 + "items": { 181 + "type": "string", 182 + "format": "uri", 183 + "maxLength": 1000 184 + }, 185 + "minLength": 1 186 + } 187 + } 188 + ``` 189 + 190 + **Publisher Implementation**: 191 + ```python 192 + publisher = DatasetPublisher(client) 193 + dataset_uri = publisher.publish_dataset( 194 + dataset, 195 + name="My Dataset", 196 + description="Training data for my model" 197 + ) 198 + # dataset.url is used directly, no upload needed 199 + ``` 200 + 201 + **Loader Implementation**: 202 + ```python 203 + loader = DatasetLoader(client) 204 + dataset = loader.load_dataset("at://alice/dataset/123") 205 + # Creates Dataset with URL from record 206 + # Actual data loading happens lazily via WebDataset 207 + ``` 208 + 209 + **Validation**: 210 + - Check URL format (scheme + netloc + path) 211 + - Support brace notation for sharded datasets 212 + - Don't validate URL accessibility (too slow, may be private) 213 + 214 + ### Future: Add Blob Storage Option 215 + 216 + When ATProto blob storage is more mature and we understand limits: 217 + 218 + 1. **Add blob support to Lexicon**: 219 + ```json 220 + "storage": { 221 + "type": "union", 222 + "refs": ["#urlStorage", "#blobStorage"] 223 + } 224 + ``` 225 + 226 + 2. **Implement blob upload**: 227 + - Chunk large files 228 + - Upload shards as separate blobs 229 + - Update record with blob CIDs 230 + 231 + 3. **Size recommendations**: 232 + - Datasets <100MB → Consider blobs 233 + - Datasets >100MB → Use external URLs 234 + - Datasets >10GB → Definitely external URLs 235 + 236 + ## URL Scheme Support 237 + 238 + | Scheme | Support | Notes | 239 + |--------|---------|-------| 240 + | `s3://` | ✅ Phase 1 | AWS S3 and compatible services | 241 + | `https://` | ✅ Phase 1 | Public HTTP/HTTPS servers | 242 + | `http://` | ✅ Phase 1 | Upgraded to HTTPS when possible | 243 + | `gs://` | ✅ Phase 1 | Google Cloud Storage | 244 + | `ipfs://` | ✅ Phase 1 | Decentralized storage via IPFS | 245 + | `file://` | ✅ Phase 1 | Local development only | 246 + | `at://` | ⏳ Future | ATProto blob references | 247 + 248 + ## Decentralization Strategy 249 + 250 + For users who want decentralization without ATProto blobs: 251 + 252 + **IPFS + Pinning Services**: 253 + 1. Upload dataset to IPFS 254 + 2. Pin with service (Pinata, Infura, Web3.Storage) 255 + 3. Publish dataset with `ipfs://` URL 256 + 4. IPFS ensures content-addressed, distributed storage 257 + 258 + **Example**: 259 + ```python 260 + # Upload to IPFS (using ipfs client) 261 + ipfs_hash = upload_to_ipfs("data-000000.tar") 262 + 263 + # Publish dataset 264 + dataset_uri = publisher.publish_dataset( 265 + dataset, 266 + name="My Dataset", 267 + urls=[f"ipfs://{ipfs_hash}"] 268 + ) 269 + ``` 270 + 271 + **Benefits**: 272 + - Content-addressed (CID in URL) 273 + - Distributed (IPFS network) 274 + - Permanent (with pinning) 275 + - No ATProto blob limits 276 + 277 + ## Access Control Considerations 278 + 279 + **Public datasets**: URLs point to public storage 280 + - S3 public buckets 281 + - Public HTTP servers 282 + - IPFS (inherently public) 283 + 284 + **Private datasets**: URL points to private storage 285 + - S3 with authentication (pre-signed URLs? credentials?) 286 + - Private HTTP servers (auth tokens?) 287 + - Recommendation: Public datasets only for Phase 1 288 + 289 + **Future**: Could add access control metadata to records 290 + ```json 291 + { 292 + "access": { 293 + "kind": "authenticated", 294 + "requiredRole": "subscriber" 295 + } 296 + } 297 + ``` 298 + 299 + ## Storage Cost Implications 300 + 301 + | Storage Type | Cost Responsibility | Pros | Cons | 302 + |-------------|-------------------|------|------| 303 + | S3 | Publisher | Industry standard, reliable | Ongoing costs | 304 + | IPFS + Pinning | Publisher | Decentralized | Need pinning service | 305 + | HTTP Server | Publisher | Full control | Maintenance burden | 306 + | ATProto Blobs | Publisher? ATProto? | Simple | Unknown cost model | 307 + 308 + **Recommendation**: Let publishers choose based on their needs and budget. 309 + 310 + ## Alternative Approaches Considered 311 + 312 + **Torrents**: Use BitTorrent protocol 313 + - Pros: Decentralized, efficient for large files 314 + - Cons: Need seeders, not as well integrated 315 + - Could add in future with `torrent://` scheme 316 + 317 + **Arweave**: Permanent storage blockchain 318 + - Pros: True permanence, one-time payment 319 + - Cons: Expensive for large datasets 320 + - Could add in future for critical datasets 321 + 322 + ## Open Questions 323 + 324 + 1. **Should we validate URL accessibility when publishing?** 325 + - Pro: Catch broken links early 326 + - Con: Slow, may fail for private URLs 327 + - Recommendation: No validation, trust publishers 328 + 329 + 2. **Should we mirror datasets automatically?** 330 + - Could create community mirrors for popular datasets 331 + - Recommendation: Not for Phase 1, community can organize 332 + 333 + 3. **What about dataset versioning?** 334 + - New version = new record with new URLs 335 + - Could link to previous version in metadata 336 + - Recommendation: Simple versioning via new records 337 + 338 + 4. **Should we support multi-region URLs?** 339 + ```json 340 + "urls": [ 341 + {"region": "us-east-1", "url": "s3://..."}, 342 + {"region": "eu-west-1", "url": "s3://..."} 343 + ] 344 + ``` 345 + - Recommendation: Defer to future if needed 346 + 347 + ## Success Criteria 348 + 349 + After implementing this decision: 350 + - ✅ Datasets can reference external URLs (S3, HTTPS, IPFS) 351 + - ✅ WebDataset brace notation is preserved 352 + - ✅ Loading datasets works with existing `Dataset` class 353 + - ✅ No breaking changes to current `atdata` usage 354 + - ✅ Path clear for future blob storage support 355 + 356 + ## References 357 + 358 + - Lexicon design: `../02_lexicon_design.md` (Dataset Record Lexicon) 359 + - Python client: `../03_python_client.md` (DatasetPublisher/Loader) 360 + - WebDataset documentation: https://webdataset.github.io/webdataset/ 361 + 362 + --- 363 + 364 + **Decision Needed By**: Before starting Phase 1 Issue #23 (Dataset Lexicon design) 365 + **Decision Maker**: Project maintainer (max) 366 + **Date Created**: 2026-01-07
+509
.planning/decisions/04_schema_evolution.md
··· 1 + # Decision: Schema Evolution and Versioning Strategy 2 + 3 + **Issue**: #48 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation), #39 (Type validation) 6 + **Priority**: High 7 + 8 + ## DECISION 9 + 10 + For this, let's take the following approach: 11 + 12 + 1. Let's make the `rkey` for the `ac.foundation.dataset.sampleSchema` records be of type `any`. 13 + 2. Then, we can have our own standard for the `rkey` being of the format `{NSID}@{semver}`, where `{NSID}` gives an NSID for the permanent identifier of this sample schema type. 14 + * This allows us to bookkeep on the version updates 15 + * We can make a `ac.foundation.dataset.getLatestSchema` `query` Lexicon that will provide the record for the latest version of a given schema, as well 16 + 3. We can build into the `atdata` SDK that whenever users update their own sample schema types, they can pass in optional `Lens`es between the two versions that give transformations to downgrade / upgrade records, so that there's an easy dev-facing way to auto-update any existing datasets using an older schema and maintain compatibility with older code for newer data. 17 + 18 + --- 19 + 20 + ## Problem Statement 21 + 22 + We need to define how PackableSample schemas can evolve over time without breaking existing datasets or code. This includes: 23 + - Version numbering scheme 24 + - Compatibility rules (what changes are allowed?) 25 + - Migration strategies 26 + - Runtime validation 27 + 28 + ## Context 29 + 30 + Schemas will evolve: 31 + - **Adding new fields** (e.g., adding optional metadata) 32 + - **Removing deprecated fields** 33 + - **Changing field types** (e.g., int → float) 34 + - **Changing field constraints** (e.g., making field optional) 35 + 36 + Real-world example: 37 + ```python 38 + # Version 1.0.0 39 + @atdata.packable 40 + class ImageSample: 41 + image: NDArray 42 + label: str 43 + 44 + # Version 1.1.0 - add optional field (backward compatible) 45 + @atdata.packable 46 + class ImageSample: 47 + image: NDArray 48 + label: str 49 + confidence: Optional[float] = None # NEW 50 + 51 + # Version 2.0.0 - remove field (breaking change) 52 + @atdata.packable 53 + class ImageSample: 54 + image: NDArray 55 + # label removed - BREAKING 56 + class_id: int # NEW, replaces label 57 + ``` 58 + 59 + ## Goals 60 + 61 + 1. **Backward compatibility**: Old code can read new data (when possible) 62 + 2. **Forward compatibility**: New code can read old data (when possible) 63 + 3. **Clear breaking changes**: Users know when they need to update 64 + 4. **Safe migrations**: Data transformations are explicit and verifiable 65 + 5. **Developer-friendly**: Easy to understand and use 66 + 67 + ## Versioning Scheme 68 + 69 + ### Semantic Versioning (MAJOR.MINOR.PATCH) 70 + 71 + **Recommendation**: Use semantic versioning for schemas 72 + 73 + ``` 74 + 1.0.0 → 1.0.1 → 1.1.0 → 2.0.0 75 + ``` 76 + 77 + **Version Components**: 78 + - **MAJOR**: Breaking changes (incompatible with previous versions) 79 + - **MINOR**: Backward-compatible additions (new optional fields) 80 + - **PATCH**: Documentation, clarifications, no functional changes 81 + 82 + ### Examples 83 + 84 + ```python 85 + # 1.0.0 → 1.0.1 (PATCH) 86 + # Change: Fixed documentation, added field description 87 + # Compatible: ✅ Yes 88 + # Action: None needed 89 + 90 + # 1.0.0 → 1.1.0 (MINOR) 91 + # Change: Added optional field 'metadata' 92 + # Compatible: ✅ Yes (backward compatible) 93 + # Action: Old code works, new code can use new field 94 + 95 + # 1.0.0 → 2.0.0 (MAJOR) 96 + # Change: Removed field 'old_field' 97 + # Compatible: ❌ No (breaking change) 98 + # Action: Users must migrate or use conversion lens 99 + ``` 100 + 101 + ## Compatibility Rules 102 + 103 + ### Backward-Compatible Changes (MINOR version bump) 104 + 105 + **Allowed**: 106 + - ✅ Adding optional fields 107 + - ✅ Making required field optional 108 + - ✅ Widening type constraints (e.g., relaxing shape requirements) 109 + - ✅ Adding documentation 110 + - ✅ Adding metadata 111 + 112 + **Example**: 113 + ```python 114 + # v1.0.0 115 + class Sample: 116 + x: int 117 + 118 + # v1.1.0 - backward compatible 119 + class Sample: 120 + x: int 121 + y: Optional[int] = None # Added optional field 122 + ``` 123 + 124 + **Guarantee**: Code written for v1.0.0 continues to work with v1.1.0 schemas 125 + 126 + --- 127 + 128 + ### Breaking Changes (MAJOR version bump) 129 + 130 + **Required**: 131 + - ❌ Removing fields 132 + - ❌ Changing field types (str → int) 133 + - ❌ Making optional field required 134 + - ❌ Narrowing type constraints (e.g., restricting shape) 135 + - ❌ Renaming fields 136 + 137 + **Example**: 138 + ```python 139 + # v1.0.0 140 + class Sample: 141 + x: int 142 + y: int 143 + 144 + # v2.0.0 - breaking changes 145 + class Sample: 146 + x: float # Type changed 147 + # y removed 148 + z: int # New required field 149 + ``` 150 + 151 + **Guarantee**: Code written for v1.0.0 will NOT work with v2.0.0 without updates 152 + 153 + --- 154 + 155 + ### Non-Breaking Changes (PATCH version bump) 156 + 157 + **Allowed**: 158 + - ✅ Documentation updates 159 + - ✅ Metadata changes 160 + - ✅ Clarifications 161 + - ✅ Bug fixes in schema definition (not structure) 162 + 163 + **No functional changes to schema structure** 164 + 165 + ## Compatibility Checking 166 + 167 + ### Automatic Compatibility Checker 168 + 169 + Implement `SchemaValidator` to check compatibility: 170 + 171 + ```python 172 + from atdata.codegen import SchemaValidator 173 + 174 + validator = SchemaValidator() 175 + 176 + old_schema = load_schema("at://alice/schema/sample/v1.0.0") 177 + new_schema = load_schema("at://alice/schema/sample/v1.1.0") 178 + 179 + is_compatible, issues = validator.is_compatible(old_schema, new_schema) 180 + 181 + if not is_compatible: 182 + print("Incompatibilities found:") 183 + for issue in issues: 184 + print(f" - {issue}") 185 + ``` 186 + 187 + **Checks**: 188 + 1. Field additions/removals 189 + 2. Type changes 190 + 3. Optional → Required changes 191 + 4. Shape constraint changes 192 + 193 + See `../05_codegen.md` for implementation details. 194 + 195 + ### Version Constraints in Dataset Records 196 + 197 + Datasets can specify schema version constraints: 198 + 199 + ```json 200 + { 201 + "$type": "app.bsky.atdata.dataset", 202 + "schemaRef": "at://alice/schema/sample/v1.0.0", 203 + "schemaVersionConstraint": ">=1.0.0,<2.0.0", 204 + ... 205 + } 206 + ``` 207 + 208 + **Semantics**: 209 + - Dataset created with v1.0.0 210 + - Compatible with v1.x.x (minor/patch updates) 211 + - NOT compatible with v2.x.x (breaking changes) 212 + 213 + ## Migration Strategies 214 + 215 + ### Option 1: Lenses as Migration Paths ⭐ RECOMMENDED 216 + 217 + **Concept**: Use Lens transformations to migrate between schema versions 218 + 219 + ```python 220 + # Migration lens: v1.0.0 → v2.0.0 221 + @atdata.lens 222 + def sample_v1_to_v2(v1: SampleV1) -> SampleV2: 223 + """Migrate from v1.0.0 to v2.0.0""" 224 + return SampleV2( 225 + x=float(v1.x), # int → float 226 + z=hash(v1.y) % 100 # derive z from removed y 227 + ) 228 + 229 + @sample_v1_to_v2.putter 230 + def sample_v2_to_v1(v2: SampleV2, v1: SampleV1) -> SampleV1: 231 + """Reverse migration (lossy)""" 232 + return SampleV1( 233 + x=int(v2.x), 234 + y=0 # Can't recover removed field 235 + ) 236 + ``` 237 + 238 + **Benefits**: 239 + - ✅ Reuses existing Lens infrastructure 240 + - ✅ Explicit transformation logic 241 + - ✅ Bidirectional (when possible) 242 + - ✅ Publishable and discoverable 243 + 244 + **Limitations**: 245 + - ❌ May be lossy (can't always reverse) 246 + - ❌ Requires manual implementation 247 + 248 + --- 249 + 250 + ### Option 2: Automatic Migration 251 + 252 + **Concept**: Generate migrations automatically based on schema diff 253 + 254 + ```python 255 + migrator = SchemaM migrator() 256 + v2_sample = migrator.migrate(v1_sample, target_version="2.0.0") 257 + ``` 258 + 259 + **Benefits**: 260 + - ✅ Convenient for users 261 + - ✅ No manual code needed 262 + 263 + **Limitations**: 264 + - ❌ Only works for simple changes (add/remove optional fields) 265 + - ❌ Can't handle complex transformations (type changes) 266 + - ❌ Risk of incorrect assumptions 267 + 268 + **Recommendation**: Could implement for simple cases, but Lenses are more general 269 + 270 + --- 271 + 272 + ### Option 3: Manual Migration Scripts 273 + 274 + **Concept**: Users write custom migration scripts 275 + 276 + **Benefits**: 277 + - ✅ Full control 278 + 279 + **Limitations**: 280 + - ❌ Not publishable/discoverable 281 + - ❌ No standardization 282 + 283 + **Recommendation**: Allow as fallback, but encourage Lenses 284 + 285 + ## Runtime Validation 286 + 287 + ### Sample Validation Against Schema 288 + 289 + ```python 290 + from atdata.codegen import TypeValidator 291 + 292 + validator = TypeValidator() 293 + schema = load_schema("at://alice/schema/sample/v1.0.0") 294 + 295 + # Validate sample 296 + sample = SampleV1(x=42, y=100) 297 + is_valid, errors = validator.validate(sample, schema) 298 + 299 + if not is_valid: 300 + print("Validation errors:") 301 + for error in errors: 302 + print(f" - {error}") 303 + ``` 304 + 305 + **Checks**: 306 + 1. All required fields present 307 + 2. Field types match 308 + 3. NDArray dtypes match (if specified) 309 + 4. NDArray shapes match (if specified) 310 + 311 + **When to validate**: 312 + - ❓ Every sample creation? (slow) 313 + - ✅ On dataset write? (good balance) 314 + - ✅ On user request (explicit validation) 315 + 316 + **Recommendation**: Validate on write, make runtime validation optional 317 + 318 + ## Schema Record Versioning 319 + 320 + ### Version Field in Schema Records 321 + 322 + ```json 323 + { 324 + "$type": "app.bsky.atdata.schema", 325 + "name": "ImageSample", 326 + "version": "1.1.0", # Semantic version 327 + ... 328 + } 329 + ``` 330 + 331 + ### Publishing New Versions 332 + 333 + **Option A**: New record for each version (RECOMMENDED) 334 + ``` 335 + at://alice/schema/imagesample/v1.0.0 # Version 1.0.0 336 + at://alice/schema/imagesample/v1.1.0 # Version 1.1.0 337 + at://alice/schema/imagesample/v2.0.0 # Version 2.0.0 338 + ``` 339 + 340 + **Pros**: 341 + - ✅ Immutable versions 342 + - ✅ Easy to reference specific versions 343 + - ✅ No breaking changes to existing references 344 + 345 + **Cons**: 346 + - ❌ More records to manage 347 + - ❌ Harder to find "latest" version 348 + 349 + **Option B**: Update existing record 350 + ``` 351 + at://alice/schema/imagesample # Always points to latest 352 + ``` 353 + 354 + **Pros**: 355 + - ✅ Single canonical reference 356 + - ✅ Easy to find latest 357 + 358 + **Cons**: 359 + - ❌ Breaks immutability 360 + - ❌ References become ambiguous over time 361 + 362 + **Recommendation**: Option A (new record per version), with metadata linking to previous versions 363 + 364 + ### Linking Versions 365 + 366 + ```json 367 + { 368 + "$type": "app.bsky.atdata.schema", 369 + "name": "ImageSample", 370 + "version": "2.0.0", 371 + "metadata": { 372 + "previousVersion": "at://alice/schema/imagesample/v1.1.0", 373 + "migrationLens": "at://alice/lens/imagesample-v1-to-v2" 374 + }, 375 + ... 376 + } 377 + ``` 378 + 379 + ## Developer Workflow 380 + 381 + ### Publishing a New Schema Version 382 + 383 + ```python 384 + # 1. Define new version 385 + @atdata.packable 386 + class ImageSampleV2: 387 + image: NDArray 388 + label: str 389 + confidence: Optional[float] = None # NEW 390 + 391 + # 2. Publish with version 392 + schema_uri = publisher.publish_schema( 393 + ImageSampleV2, 394 + name="ImageSample", 395 + version="1.1.0", # MINOR bump 396 + metadata={ 397 + "previousVersion": "at://alice/schema/imagesample/v1.0.0" 398 + } 399 + ) 400 + 401 + # 3. Optionally publish migration lens 402 + migration_lens = publisher.publish_lens( 403 + v1_to_v2_lens, 404 + source_schema_uri="at://alice/schema/imagesample/v1.0.0", 405 + target_schema_uri=schema_uri, 406 + name="ImageSample v1→v2 Migration" 407 + ) 408 + ``` 409 + 410 + ### Using Versioned Schemas 411 + 412 + ```python 413 + # Load specific version 414 + schema = loader.get_schema("at://alice/schema/imagesample/v1.0.0") 415 + 416 + # Check compatibility 417 + is_compatible = validator.is_compatible( 418 + "at://alice/schema/imagesample/v1.0.0", 419 + "at://alice/schema/imagesample/v2.0.0" 420 + ) 421 + 422 + # Find migration path 423 + migration = loader.find_migration( 424 + source="at://alice/schema/imagesample/v1.0.0", 425 + target="at://alice/schema/imagesample/v2.0.0" 426 + ) 427 + ``` 428 + 429 + ## Tooling Support 430 + 431 + ### CLI Commands 432 + 433 + ```bash 434 + # Check schema compatibility 435 + atdata schema diff \ 436 + at://alice/schema/sample/v1.0.0 \ 437 + at://alice/schema/sample/v2.0.0 438 + 439 + # Validate sample against schema 440 + atdata validate mysample.msgpack \ 441 + --schema at://alice/schema/sample/v1.0.0 442 + 443 + # Find migration path 444 + atdata schema migrate \ 445 + --from at://alice/schema/sample/v1.0.0 \ 446 + --to at://alice/schema/sample/v2.0.0 447 + ``` 448 + 449 + ### IDE Support (Future) 450 + 451 + - Autocomplete for schema versions 452 + - Warnings for compatibility issues 453 + - Quick fixes for migrations 454 + 455 + ## Open Questions 456 + 457 + 1. **Should we auto-bump versions on publish?** 458 + - Detect changes, suggest version bump? 459 + - Recommendation: Manual for Phase 1, auto-suggest later 460 + 461 + 2. **How to handle shape evolution for NDArray?** 462 + ```python 463 + # v1: image shape [224, 224, 3] 464 + # v2: image shape [256, 256, 3] # Breaking or not? 465 + ``` 466 + - If shape is documented (not enforced), this could be minor 467 + - If shape is validated, this is breaking 468 + - Recommendation: Document only initially 469 + 470 + 3. **Should we support version ranges in schema refs?** 471 + ```json 472 + "schemaRef": "at://alice/schema/sample@^1.0.0" # npm-style 473 + ``` 474 + - Pro: More flexible 475 + - Con: Ambiguous (which exact version?) 476 + - Recommendation: Explicit versions only for Phase 1 477 + 478 + 4. **What about deprecated fields?** 479 + ```python 480 + class Sample: 481 + x: int 482 + y: int # @deprecated: Use z instead 483 + z: Optional[int] = None 484 + ``` 485 + - Could add deprecation warnings 486 + - Could track in schema metadata 487 + - Recommendation: Metadata only for Phase 1 488 + 489 + ## Success Criteria 490 + 491 + After implementing this decision: 492 + - ✅ Schemas use semantic versioning 493 + - ✅ Compatibility rules are clear and documented 494 + - ✅ Compatibility checker validates schema changes 495 + - ✅ Lenses can be used for migrations 496 + - ✅ Dataset records can specify version constraints 497 + - ✅ Breaking changes require major version bump 498 + 499 + ## References 500 + 501 + - Code generation: `../05_codegen.md` (SchemaValidator, TypeValidator) 502 + - Lexicon design: `../02_lexicon_design.md` (Schema versioning) 503 + - Lens transformations: `02_lens_code_storage.md` 504 + 505 + --- 506 + 507 + **Decision Needed By**: Before Phase 4 Issue #39 (Type validation) 508 + **Decision Maker**: Project maintainer (max) 509 + **Date Created**: 2026-01-07
+388
.planning/decisions/05_lexicon_namespace.md
··· 1 + # Decision: Lexicon Namespace and NSID Structure 2 + 3 + **Issue**: #49 4 + **Status**: Needs decision 5 + **Blocks**: #50 (Lexicon validation) 6 + **Priority**: Critical for Phase 1 7 + 8 + ## DECISION 9 + 10 + We're going to use an org NSID for the steward organization as the base: 11 + 12 + ``` 13 + ac.foundation.dataset.* 14 + ``` 15 + 16 + The choices we have then are 17 + 18 + ``` 19 + ac.foundation.dataset.sampleSchema 20 + ac.foundation.dataset.record 21 + ac.foundation.dataset.lens 22 + ``` 23 + 24 + --- 25 + 26 + ## Problem Statement 27 + 28 + We need to finalize the namespace (NSID - Namespaced Identifier) for atdata Lexicons. This is a critical decision because: 29 + - NSIDs are permanent and hard to change 30 + - They affect discoverability and organization 31 + - They may require coordination with ATProto/Bluesky team 32 + 33 + ## Context 34 + 35 + ATProto NSIDs follow reverse domain notation: 36 + ``` 37 + app.bsky.feed.post # Bluesky official feed posts 38 + com.example.myapp.record # Third-party app 39 + ``` 40 + 41 + We need NSIDs for three record types: 42 + 1. Schema records (PackableSample definitions) 43 + 2. Dataset records (dataset indexes) 44 + 3. Lens records (transformations) 45 + 46 + ## Current Proposal 47 + 48 + ``` 49 + app.bsky.atdata.schema # PackableSample schema records 50 + app.bsky.atdata.dataset # Dataset index records 51 + app.bsky.atdata.lens # Lens transformation records 52 + ``` 53 + 54 + ## Options 55 + 56 + ### Option 1: `app.bsky.atdata.*` (Current Proposal) 57 + 58 + **Full NSIDs**: 59 + - `app.bsky.atdata.schema` 60 + - `app.bsky.atdata.dataset` 61 + - `app.bsky.atdata.lens` 62 + 63 + **Pros**: 64 + - ✅ Under Bluesky ecosystem umbrella 65 + - ✅ High visibility and discoverability 66 + - ✅ Official-looking namespace 67 + - ✅ Good for adoption 68 + 69 + **Cons**: 70 + - ❌ May require approval from Bluesky team 71 + - ❌ `app.bsky.*` typically for official Bluesky apps 72 + - ❌ Could be rejected or need to change later 73 + - ❌ Implies Bluesky endorsement/ownership 74 + 75 + **Risk**: ⚠️ Medium (may need to change if not approved) 76 + 77 + --- 78 + 79 + ### Option 2: `io.atdata.*` or `org.atdata.*` 80 + 81 + **Full NSIDs**: 82 + - `io.atdata.schema` 83 + - `io.atdata.dataset` 84 + - `io.atdata.lens` 85 + 86 + **Pros**: 87 + - ✅ Independent namespace 88 + - ✅ No approval needed 89 + - ✅ Clear ownership (atdata project) 90 + - ✅ Can use immediately 91 + 92 + **Cons**: 93 + - ❌ Less discoverable (not under Bluesky) 94 + - ❌ Appears less "official" 95 + - ❌ Need to own atdata.io domain (or just use anyway?) 96 + 97 + **Risk**: ⭐ Low (we control it) 98 + 99 + --- 100 + 101 + ### Option 3: `app.bsky.atproto.atdata.*` (Nested) 102 + 103 + **Full NSIDs**: 104 + - `app.bsky.atproto.atdata.schema` 105 + - `app.bsky.atproto.atdata.dataset` 106 + - `app.bsky.atproto.atdata.lens` 107 + 108 + **Pros**: 109 + - ✅ Still under Bluesky but more specific 110 + - ✅ Groups with other ATProto-related Lexicons 111 + - ✅ Less likely to conflict 112 + 113 + **Cons**: 114 + - ❌ Longer NSIDs 115 + - ❌ Awkward naming (`atproto.atdata`?) 116 + - ❌ Still may need approval 117 + 118 + **Risk**: ⚠️ Medium 119 + 120 + --- 121 + 122 + ### Option 4: Personal/Org namespace (e.g., `com.github.username.atdata.*`) 123 + 124 + **Example with your GitHub**: 125 + - `com.github.maxineishere.atdata.schema` (if that's your GH username) 126 + - Or: `com.yourorg.atdata.schema` 127 + 128 + **Pros**: 129 + - ✅ Guaranteed to work (it's your namespace) 130 + - ✅ No approval needed 131 + - ✅ Clear ownership 132 + 133 + **Cons**: 134 + - ❌ Looks very unofficial 135 + - ❌ Hard to discover 136 + - ❌ Tied to individual/org, not project 137 + - ❌ May need to migrate later if project grows 138 + 139 + **Risk**: ⭐ Very Low (but not ideal for adoption) 140 + 141 + ## Recommendation: Start with Option 2 (`io.atdata.*`), Keep Option 1 as Goal 142 + 143 + **Phased Approach**: 144 + 145 + ### Phase 1: Use `io.atdata.*` immediately 146 + - No approvals needed 147 + - Can start development right away 148 + - Professional-looking namespace 149 + - Independent from Bluesky governance 150 + 151 + ### Future: Request `app.bsky.atdata.*` if appropriate 152 + - Once atdata has users and proven value 153 + - Submit formal request to Bluesky/ATProto team 154 + - Migrate if approved (see migration plan below) 155 + 156 + **Rationale**: 157 + 1. **Speed**: Don't block development waiting for approval 158 + 2. **Safety**: If denied `app.bsky.*`, we haven't committed to it 159 + 3. **Flexibility**: Can migrate namespaces if needed 160 + 4. **Independence**: atdata can exist independently of Bluesky 161 + 162 + ## Implementation Details 163 + 164 + ### Namespace Structure 165 + 166 + ``` 167 + io.atdata 168 + ├── schema # PackableSample schema definitions 169 + ├── dataset # Dataset index records 170 + └── lens # Lens transformations 171 + ``` 172 + 173 + **Lexicon IDs**: 174 + ```json 175 + { 176 + "lexicon": 1, 177 + "id": "io.atdata.schema", 178 + ... 179 + } 180 + ``` 181 + 182 + ```json 183 + { 184 + "lexicon": 1, 185 + "id": "io.atdata.dataset", 186 + ... 187 + } 188 + ``` 189 + 190 + ```json 191 + { 192 + "lexicon": 1, 193 + "id": "io.atdata.lens", 194 + ... 195 + } 196 + ``` 197 + 198 + ### Record URIs 199 + 200 + ``` 201 + at://did:plc:abc123/io.atdata.schema/3jk2lo34klm 202 + at://did:plc:abc123/io.atdata.dataset/7mn8op56pqr 203 + at://did:plc:abc123/io.atdata.lens/2fg4hi78jkl 204 + ``` 205 + 206 + ### Python Constants 207 + 208 + ```python 209 + # src/atdata/atproto/_constants.py 210 + 211 + SCHEMA_NSID = "io.atdata.schema" 212 + DATASET_NSID = "io.atdata.dataset" 213 + LENS_NSID = "io.atdata.lens" 214 + 215 + # Can be changed in one place if we migrate namespaces 216 + ``` 217 + 218 + ## Domain Ownership 219 + 220 + **Question**: Do we need to own `atdata.io`? 221 + 222 + **ATProto Spec**: NSIDs don't require domain ownership, but it's recommended for credibility. 223 + 224 + **Options**: 225 + 1. **Register `atdata.io`** (~$12/year) 226 + - Pro: Professional, verifiable ownership 227 + - Con: Small cost 228 + - Recommendation: ✅ Do this 229 + 230 + 2. **Use without owning** 231 + - Pro: Free 232 + - Con: Someone else could register it and claim the namespace 233 + - Recommendation: ❌ Too risky 234 + 235 + **Decision**: Register `atdata.io` domain 236 + 237 + ## Versioning in NSIDs 238 + 239 + **Question**: Should version be part of NSID? 240 + 241 + ### Option A: Version in record (RECOMMENDED) 242 + ``` 243 + NSIDs: io.atdata.schema (constant) 244 + Versions: In schema record "version" field 245 + ``` 246 + 247 + **Pros**: 248 + - ✅ Stable NSIDs 249 + - ✅ Versions can evolve independently 250 + - ✅ Single collection for all versions 251 + 252 + **Cons**: 253 + - ❌ Need to look up version from record 254 + 255 + ### Option B: Version in NSID 256 + ``` 257 + NSIDs: io.atdata.schema.v1, io.atdata.schema.v2 258 + ``` 259 + 260 + **Pros**: 261 + - ✅ Version explicit in URI 262 + 263 + **Cons**: 264 + - ❌ New NSID for each major version 265 + - ❌ More Lexicons to maintain 266 + - ❌ Harder to query across versions 267 + 268 + **Recommendation**: Option A (version in record) 269 + 270 + ## Namespace Migration Plan 271 + 272 + If we need to migrate from `io.atdata.*` to `app.bsky.atdata.*`: 273 + 274 + ### Migration Steps 275 + 276 + 1. **Dual Publishing** (transition period) 277 + ```python 278 + # Publish to both namespaces 279 + publisher.publish_schema( 280 + sample_type, 281 + nsid="io.atdata.schema" # Old 282 + ) 283 + publisher.publish_schema( 284 + sample_type, 285 + nsid="app.bsky.atdata.schema" # New 286 + ) 287 + ``` 288 + 289 + 2. **Deprecation Notice** 290 + - Announce migration timeline 291 + - Update documentation 292 + - Add warnings to old namespace 293 + 294 + 3. **Update Client** 295 + - Default to new namespace 296 + - Still support old namespace (read-only) 297 + 298 + 4. **Sunset Old Namespace** 299 + - After 6-12 months, stop publishing to old namespace 300 + - Keep reading old records for compatibility 301 + 302 + ### Record Linking 303 + 304 + Add migration metadata: 305 + ```json 306 + { 307 + "$type": "app.bsky.atdata.schema", 308 + "metadata": { 309 + "migratedFrom": "at://did:plc:abc123/io.atdata.schema/3jk2lo34klm" 310 + }, 311 + ... 312 + } 313 + ``` 314 + 315 + ## Additional Lexicons (Future) 316 + 317 + Should we reserve NSIDs for future use? 318 + 319 + **Potential Additions**: 320 + - `io.atdata.collection` - Group multiple datasets 321 + - `io.atdata.benchmark` - Evaluation results 322 + - `io.atdata.annotation` - User comments/ratings 323 + - `io.atdata.pipeline` - Data processing pipelines 324 + 325 + **Recommendation**: Don't create yet, but document reserved names 326 + 327 + ## Community Input 328 + 329 + **Before finalizing**: 330 + 1. Check if `io.atdata.*` is available (no conflicts) 331 + 2. Reach out to ATProto community (Discord, GitHub) 332 + 3. Ask Bluesky team about `app.bsky.atdata.*` feasibility 333 + 4. Document decision and rationale 334 + 335 + ## Open Questions 336 + 337 + 1. **Should we create a demo namespace first?** 338 + - `io.atdata.dev.schema` for testing? 339 + - Pro: Keeps production namespace clean 340 + - Con: More namespaces to manage 341 + - Recommendation: Not needed, use test DIDs instead 342 + 343 + 2. **What about language-specific namespaces?** 344 + - `io.atdata.py.schema` for Python-specific schemas? 345 + - Pro: Allows language-specific features 346 + - Con: Fragments ecosystem 347 + - Recommendation: ❌ Keep language-agnostic 348 + 349 + 3. **Should we namespace by domain (vision, NLP, etc.)?** 350 + - `io.atdata.vision.schema`, `io.atdata.nlp.schema`? 351 + - Pro: Better organization for large ecosystems 352 + - Con: Premature optimization 353 + - Recommendation: ❌ Not for Phase 1 354 + 355 + ## Success Criteria 356 + 357 + After implementing this decision: 358 + - ✅ NSIDs are finalized and documented 359 + - ✅ Lexicon JSON files use correct NSIDs 360 + - ✅ Python code uses constant definitions (easy to change) 361 + - ✅ Migration plan exists if needed 362 + - ✅ Domain `atdata.io` is registered (or plan to register) 363 + 364 + ## References 365 + 366 + - ATProto NSID spec: https://atproto.com/specs/nsid 367 + - Lexicon design: `../02_lexicon_design.md` 368 + - All three Lexicon definitions need this decision 369 + 370 + --- 371 + 372 + **Decision Needed By**: Before starting Phase 1 Issue #22, #23, #24 (all Lexicon designs) 373 + **Decision Maker**: Project maintainer (max) 374 + **Date Created**: 2026-01-07 375 + 376 + ## Recommended Action 377 + 378 + **Immediate**: 379 + 1. ✅ Decide on `io.atdata.*` as working namespace 380 + 2. ✅ Plan to register `atdata.io` domain 381 + 3. ✅ Document migration path to `app.bsky.atdata.*` if desired later 382 + 383 + **Before Phase 2**: 384 + 1. Register `atdata.io` domain 385 + 2. Optional: Reach out to Bluesky about `app.bsky.atdata.*` for future 386 + 387 + **Phase 1**: 388 + Use `io.atdata.*` in all Lexicon designs
+459
.planning/decisions/06_lexicon_validation.md
··· 1 + # Decision: Lexicon Validation Process 2 + 3 + **Issue**: #50 4 + **Status**: Needs decision 5 + **Blocked By**: #45, #46, #47, #48, #49 (all design decisions) 6 + **Priority**: Critical - Final step before Phase 1 completion 7 + 8 + ## Problem Statement 9 + 10 + Once we've finalized all design decisions, we need to validate that our Lexicon JSON definitions: 11 + 1. Follow ATProto Lexicon specification correctly 12 + 2. Are internally consistent 13 + 3. Support all our use cases 14 + 4. Can be implemented as designed 15 + 16 + This is the final checkpoint before Phase 1 (Lexicon Design) is complete and we move to Phase 2 (Implementation). 17 + 18 + ## What Needs Validation 19 + 20 + ### 1. Schema Record Lexicon (`io.atdata.schema`) 21 + - Field type system (primitive, ndarray, nested) 22 + - Type unions are properly structured 23 + - Required vs optional fields 24 + - Constraints (maxLength, etc.) are reasonable 25 + - Example schema records validate against the Lexicon 26 + 27 + ### 2. Dataset Record Lexicon (`io.atdata.dataset`) 28 + - URL array handling 29 + - Metadata blob size limits 30 + - Schema reference format 31 + - Tag array constraints 32 + - Example dataset records validate against the Lexicon 33 + 34 + ### 3. Lens Record Lexicon (`io.atdata.lens`) 35 + - Code reference structure 36 + - Schema reference handling 37 + - Union types for different code storage options (if applicable) 38 + - Example lens records validate against the Lexicon 39 + 40 + ## Validation Checklist 41 + 42 + ### ATProto Spec Compliance 43 + 44 + **Lexicon Structure**: 45 + - [ ] All Lexicons have required fields: `lexicon`, `id`, `defs` 46 + - [ ] `lexicon` field is set to `1` (current version) 47 + - [ ] `id` follows NSID format (reverse domain notation) 48 + - [ ] `defs.main` exists and has `type: "record"` 49 + - [ ] Record `key` is set appropriately (`tid` for time-ordered) 50 + 51 + **Field Types**: 52 + - [ ] All field types are valid ATProto types 53 + - `string`, `integer`, `boolean`, `bytes`, `object`, `array` 54 + - `ref`, `union` for complex types 55 + - [ ] String fields have appropriate `maxLength` 56 + - [ ] Array fields have `items` definition 57 + - [ ] Object fields have `properties` definition 58 + - [ ] Refs point to valid def names (e.g., `#fieldType`) 59 + 60 + **Constraints**: 61 + - [ ] `maxLength` values are reasonable (not too small, not too large) 62 + - [ ] `minLength` constraints make sense 63 + - [ ] Required fields are marked correctly 64 + - [ ] Optional fields have appropriate defaults 65 + 66 + ### Internal Consistency 67 + 68 + **Cross-References**: 69 + - [ ] Schema refs (e.g., `schemaRef` in datasets) use correct format 70 + - Should be AT-URI format: `at://did:plc:.../io.atdata.schema/...` 71 + - [ ] Union refs point to existing defs 72 + - [ ] No circular references 73 + 74 + **Type System**: 75 + - [ ] Field types are well-defined 76 + - Primitive types map clearly (str, int, float, bool, bytes) 77 + - NDArray type includes dtype and optional shape 78 + - Nested types have schema reference 79 + - [ ] Optional vs required semantics are clear 80 + 81 + **Metadata**: 82 + - [ ] Descriptions are present and helpful 83 + - [ ] Examples match the schema 84 + - [ ] Deprecations are noted (if any) 85 + 86 + ### Use Case Coverage 87 + 88 + **Can we represent...**: 89 + - [ ] All current PackableSample types? 90 + - [ ] NDArray with dtype and shape information? 91 + - [ ] Optional fields? 92 + - [ ] Nested PackableSample types (future)? 93 + - [ ] Dataset metadata (arbitrary key-value)? 94 + - [ ] Multiple WebDataset shard URLs? 95 + - [ ] Lens code references (repo + commit + path)? 96 + 97 + **Can we implement...**: 98 + - [ ] Python codegen from schema records? 99 + - [ ] Dataset publishing with external URLs? 100 + - [ ] Dataset loading from records? 101 + - [ ] Lens publishing with code references? 102 + - [ ] Schema versioning (version field present)? 103 + 104 + ## Validation Methods 105 + 106 + ### 1. Schema Validation Tools 107 + 108 + **Use ATProto Tools** (if available): 109 + ```bash 110 + # If ATProto has a Lexicon validator 111 + atproto-lexicon validate io.atdata.schema.json 112 + atproto-lexicon validate io.atdata.dataset.json 113 + atproto-lexicon validate io.atdata.lens.json 114 + ``` 115 + 116 + **Create Custom Validator**: 117 + ```python 118 + # src/atdata/atproto/validation.py 119 + from jsonschema import validate, ValidationError 120 + 121 + def validate_lexicon(lexicon_json: dict) -> tuple[bool, list[str]]: 122 + """Validate Lexicon against ATProto spec.""" 123 + errors = [] 124 + 125 + # Check required fields 126 + if 'lexicon' not in lexicon_json: 127 + errors.append("Missing 'lexicon' field") 128 + if 'id' not in lexicon_json: 129 + errors.append("Missing 'id' field") 130 + if 'defs' not in lexicon_json: 131 + errors.append("Missing 'defs' field") 132 + 133 + # Check NSID format 134 + nsid = lexicon_json.get('id', '') 135 + if not is_valid_nsid(nsid): 136 + errors.append(f"Invalid NSID: {nsid}") 137 + 138 + # More validations... 139 + 140 + return len(errors) == 0, errors 141 + ``` 142 + 143 + ### 2. Example Record Validation 144 + 145 + **Create Example Records**: 146 + 147 + ```python 148 + # examples/schema_record.json 149 + { 150 + "$type": "io.atdata.schema", 151 + "name": "ImageSample", 152 + "version": "1.0.0", 153 + "description": "Sample with image and label", 154 + "fields": [ 155 + { 156 + "name": "image", 157 + "type": {"kind": "ndarray", "dtype": "uint8", "shape": [null, null, 3]}, 158 + "optional": false 159 + }, 160 + { 161 + "name": "label", 162 + "type": {"kind": "primitive", "primitive": "str"}, 163 + "optional": false 164 + } 165 + ], 166 + "metadata": {"author": "alice"}, 167 + "createdAt": "2025-01-06T12:00:00Z" 168 + } 169 + ``` 170 + 171 + **Validate Against Lexicon**: 172 + ```python 173 + def validate_record(record: dict, lexicon: dict) -> tuple[bool, list[str]]: 174 + """Validate a record against its Lexicon.""" 175 + errors = [] 176 + 177 + # Check $type matches Lexicon id 178 + record_type = record.get('$type') 179 + lexicon_id = lexicon.get('id') 180 + if record_type != lexicon_id: 181 + errors.append(f"Type mismatch: {record_type} != {lexicon_id}") 182 + 183 + # Validate required fields 184 + main_def = lexicon['defs']['main']['record'] 185 + required = main_def.get('required', []) 186 + for field in required: 187 + if field not in record: 188 + errors.append(f"Missing required field: {field}") 189 + 190 + # Validate field types 191 + properties = main_def.get('properties', {}) 192 + for field, value in record.items(): 193 + if field in properties: 194 + # Type checking logic 195 + pass 196 + 197 + return len(errors) == 0, errors 198 + ``` 199 + 200 + ### 3. Roundtrip Testing 201 + 202 + **Test Full Cycle**: 203 + 1. Create PackableSample class 204 + 2. Generate schema record from class 205 + 3. Validate schema record against Lexicon 206 + 4. Generate code from schema record 207 + 5. Verify generated code matches original class 208 + 209 + ```python 210 + def test_roundtrip(): 211 + # 1. Original class 212 + @atdata.packable 213 + class TestSample: 214 + x: int 215 + y: str 216 + 217 + # 2. Generate schema record 218 + generator = SchemaRecordGenerator() 219 + record = generator.from_class(TestSample) 220 + 221 + # 3. Validate against Lexicon 222 + is_valid, errors = validate_record(record, SCHEMA_LEXICON) 223 + assert is_valid, f"Validation failed: {errors}" 224 + 225 + # 4. Generate code from record 226 + codegen = PythonGenerator() 227 + code = codegen.generate_from_record(record) 228 + 229 + # 5. Execute generated code and compare 230 + exec_globals = {} 231 + exec(code, exec_globals) 232 + GeneratedClass = exec_globals['TestSample'] 233 + 234 + # Should be equivalent 235 + original_instance = TestSample(x=1, y="test") 236 + generated_instance = GeneratedClass(x=1, y="test") 237 + 238 + assert original_instance.packed == generated_instance.packed 239 + ``` 240 + 241 + ### 4. Edge Case Testing 242 + 243 + **Test Corner Cases**: 244 + - [ ] Empty optional fields 245 + - [ ] Very long strings (maxLength boundary) 246 + - [ ] Large arrays (maxItems boundary) 247 + - [ ] Complex nested types 248 + - [ ] Unicode in strings 249 + - [ ] Special characters in names 250 + - [ ] Large metadata blobs 251 + 252 + ## Validation Artifacts 253 + 254 + After validation, we should have: 255 + 256 + ### 1. Finalized Lexicon JSON Files 257 + 258 + ``` 259 + .planning/lexicons/ 260 + io.atdata.schema.json 261 + io.atdata.dataset.json 262 + io.atdata.lens.json 263 + ``` 264 + 265 + Each file: 266 + - Validates against ATProto Lexicon spec 267 + - Has complete documentation 268 + - Includes examples 269 + 270 + ### 2. Example Records 271 + 272 + ``` 273 + .planning/examples/ 274 + schema_example.json 275 + dataset_example.json 276 + lens_example.json 277 + ``` 278 + 279 + Each example: 280 + - Validates against its Lexicon 281 + - Demonstrates all key features 282 + - Includes comments explaining choices 283 + 284 + ### 3. Validation Test Suite 285 + 286 + ```python 287 + # tests/test_lexicons.py 288 + 289 + def test_schema_lexicon_valid(): 290 + """Test schema Lexicon is valid.""" 291 + with open('.planning/lexicons/io.atdata.schema.json') as f: 292 + lexicon = json.load(f) 293 + is_valid, errors = validate_lexicon(lexicon) 294 + assert is_valid, errors 295 + 296 + def test_schema_example_valid(): 297 + """Test schema example validates against Lexicon.""" 298 + with open('.planning/lexicons/io.atdata.schema.json') as f: 299 + lexicon = json.load(f) 300 + with open('.planning/examples/schema_example.json') as f: 301 + example = json.load(f) 302 + is_valid, errors = validate_record(example, lexicon) 303 + assert is_valid, errors 304 + 305 + # Similar tests for dataset and lens 306 + ``` 307 + 308 + ### 4. Validation Report 309 + 310 + ```markdown 311 + # Lexicon Validation Report 312 + 313 + ## Summary 314 + - Schema Lexicon: ✅ Valid 315 + - Dataset Lexicon: ✅ Valid 316 + - Lens Lexicon: ✅ Valid 317 + 318 + ## Validation Results 319 + 320 + ### io.atdata.schema 321 + - ATProto compliance: ✅ Pass 322 + - Internal consistency: ✅ Pass 323 + - Example validation: ✅ Pass 324 + - Edge cases: ✅ Pass 325 + 326 + ### io.atdata.dataset 327 + ... 328 + 329 + ## Issues Found 330 + None 331 + 332 + ## Recommendations 333 + 1. Consider adding X field to Y 334 + 2. Might want to increase maxLength for Z 335 + ... 336 + ``` 337 + 338 + ## Implementation Plan 339 + 340 + ### Step 1: Create Lexicon JSON Files (depends on decisions #45-49) 341 + 342 + Based on finalized decisions: 343 + - Schema representation format (#45) 344 + - Lens code storage (#46) 345 + - WebDataset storage (#47) 346 + - Schema evolution (#48) 347 + - Lexicon namespace (#49) 348 + 349 + Create three JSON files with complete Lexicon definitions. 350 + 351 + ### Step 2: Create Example Records 352 + 353 + For each Lexicon, create 2-3 example records demonstrating: 354 + - Minimal record 355 + - Full-featured record 356 + - Edge cases 357 + 358 + ### Step 3: Write Validation Tests 359 + 360 + Implement validation test suite that: 361 + - Validates Lexicons against ATProto spec 362 + - Validates examples against Lexicons 363 + - Tests roundtrip (class → record → code → class) 364 + 365 + ### Step 4: Manual Review 366 + 367 + Have team members review: 368 + - Lexicon designs 369 + - Example records 370 + - Any edge cases or concerns 371 + 372 + ### Step 5: Document Issues and Resolutions 373 + 374 + Track any issues found: 375 + - What was wrong? 376 + - How was it fixed? 377 + - Why was this decision made? 378 + 379 + ### Step 6: Final Sign-off 380 + 381 + Once all validation passes: 382 + - Mark Issue #50 as complete 383 + - Unblock Phase 1 (Issue #17) 384 + - Proceed to Phase 2 implementation 385 + 386 + ## Tools and Resources 387 + 388 + **ATProto Resources**: 389 + - Lexicon specification: https://atproto.com/specs/lexicon 390 + - NSID specification: https://atproto.com/specs/nsid 391 + - Example Lexicons: https://github.com/bluesky-social/atproto/tree/main/lexicons 392 + 393 + **Validation Tools**: 394 + - JSON Schema validator (jsonschema library) 395 + - ATProto SDK validation (if available) 396 + - Custom validators (we'll write) 397 + 398 + **Documentation**: 399 + - All planning docs in `.planning/` 400 + - Decision docs in `.planning/decisions/` 401 + - Lexicon design in `02_lexicon_design.md` 402 + 403 + ## Success Criteria 404 + 405 + Phase 1 Issue #17 is complete when: 406 + - ✅ All three Lexicons are finalized and validated 407 + - ✅ Example records validate against Lexicons 408 + - ✅ Roundtrip tests pass 409 + - ✅ Team has reviewed and approved 410 + - ✅ Documentation is complete 411 + - ✅ Ready to begin Phase 2 implementation 412 + 413 + ## Next Steps After Validation 414 + 415 + Once Issue #50 is complete: 416 + 1. Close Issue #50 417 + 2. Unblock and close Issue #17 (Phase 1) 418 + 3. Begin Phase 2 (Issue #18) - Python Client implementation 419 + 4. Reference finalized Lexicons during implementation 420 + 421 + ## Open Questions 422 + 423 + 1. **Should we submit Lexicons to ATProto for official review?** 424 + - Pro: Get expert feedback 425 + - Con: Delays, may not be necessary 426 + - Recommendation: Optional, do if time permits 427 + 428 + 2. **Should we create a Lexicon registry/index?** 429 + - Pro: Makes discovery easier 430 + - Con: Extra infrastructure 431 + - Recommendation: Defer to Phase 3 (AppView) 432 + 433 + 3. **How do we handle Lexicon updates after publication?** 434 + - Once records exist, changing Lexicons is breaking 435 + - Need clear versioning for Lexicons themselves 436 + - Recommendation: Lexicons are v1 for all Phase 1-5 437 + 438 + ## References 439 + 440 + - All design decisions: `01-05_*.md` in this directory 441 + - Lexicon design: `../02_lexicon_design.md` 442 + - ATProto Lexicon spec: https://atproto.com/specs/lexicon 443 + 444 + --- 445 + 446 + **Decision Needed By**: After all decisions #45-49 are finalized 447 + **Decision Maker**: Project maintainer (max) + team review 448 + **Date Created**: 2026-01-07 449 + 450 + ## Recommended Action 451 + 452 + **After all design decisions are made**: 453 + 1. Create three Lexicon JSON files 454 + 2. Create example records for each 455 + 3. Write and run validation test suite 456 + 4. Review as team 457 + 5. Document any issues and fixes 458 + 6. Get final sign-off 459 + 7. Mark Phase 1 complete ✅
+158
.planning/decisions/README.md
··· 1 + # Critical Design Decisions for ATProto Integration 2 + 3 + This directory contains detailed analysis and recommendations for the critical design decisions needed before implementing ATProto integration in `atdata`. 4 + 5 + ## Decision Documents (In Dependency Order) 6 + 7 + ### Core Design Decisions (Can be made in parallel) 8 + 9 + 1. **[01_schema_representation_format.md](01_schema_representation_format.md)** (Issue #45) 10 + - **Question**: How to represent PackableSample types in Lexicon records? 11 + - **Options**: Custom format, JSON Schema, Protobuf 12 + - **Recommendation**: Custom format within ATProto Lexicon 13 + - **Impact**: Code generation, cross-language support 14 + - **Blocks**: Issue #50 (validation) 15 + 16 + 2. **[02_lens_code_storage.md](02_lens_code_storage.md)** (Issue #46) 17 + - **Question**: How to store Lens transformation code? 18 + - **Options**: Code references, inline code, metadata only 19 + - **Recommendation**: Code references (GitHub + commit hash) only 20 + - **Impact**: Security, usability, trust model 21 + - **Blocks**: Issue #50 (validation) 22 + - ⚠️ **CRITICAL SECURITY DECISION** 23 + 24 + 3. **[03_webdataset_storage.md](03_webdataset_storage.md)** (Issue #47) 25 + - **Question**: Where to store actual WebDataset .tar files? 26 + - **Options**: External URLs, ATProto blobs, hybrid 27 + - **Recommendation**: External URLs (Phase 1), hybrid (future) 28 + - **Impact**: Decentralization, scalability, costs 29 + - **Blocks**: Issue #50 (validation) 30 + 31 + 4. **[04_schema_evolution.md](04_schema_evolution.md)** (Issue #48) 32 + - **Question**: How do schemas evolve without breaking changes? 33 + - **Options**: Semantic versioning, compatibility rules, migrations 34 + - **Recommendation**: Semantic versioning + Lenses for migration 35 + - **Impact**: Long-term maintainability, compatibility 36 + - **Blocks**: Issue #50 (validation), Issue #39 (type validation) 37 + 38 + 5. **[05_lexicon_namespace.md](05_lexicon_namespace.md)** (Issue #49) 39 + - **Question**: What namespace (NSID) to use for Lexicons? 40 + - **Options**: `app.bsky.atdata.*`, `io.atdata.*`, others 41 + - **Recommendation**: `io.atdata.*` (Phase 1), request `app.bsky.*` later 42 + - **Impact**: Discoverability, ownership, migration 43 + - **Blocks**: Issue #50 (validation) 44 + 45 + ### Final Validation (Depends on all above) 46 + 47 + 6. **[06_lexicon_validation.md](06_lexicon_validation.md)** (Issue #50) 48 + - **Question**: How to validate finalized Lexicon designs? 49 + - **Process**: Validation checklist, example records, tests 50 + - **Deliverables**: Finalized Lexicon JSON files, validation report 51 + - **Blocked By**: Issues #45, #46, #47, #48, #49 (all completed ✅) 52 + - **Blocks**: Phase 1 completion (Issue #17) 53 + - **Status**: Ready to proceed 54 + 55 + ### Architectural Assessment 56 + 57 + 7. **[assessment.md](assessment.md)** (Issue #51) ✅ **Complete** 58 + - **Comprehensive appraisal** of all finalized design decisions 59 + - **Overall Grade**: A- (Excellent with caveats) 60 + - **Analysis**: Strengths, synergies, trade-offs, risks, long-term trajectory 61 + - **Recommendations**: Immediate next steps and phasing guidance 62 + 63 + ## Decision Status 64 + 65 + | Issue | Decision | Status | Final Decision | 66 + |-------|----------|--------|----------------| 67 + | #45 | Schema format | ✅ Decided | JSON Schema with NDArray shim | 68 + | #46 | Lens code storage | ✅ Decided | External repos (GitHub + tangled.org) | 69 + | #47 | WebDataset storage | ✅ Decided | Hybrid (URLs + blobs from start) | 70 + | #48 | Schema evolution | ✅ Decided | rkey={NSID}@{semver} + migration Lenses | 71 + | #49 | Lexicon namespace | ✅ Decided | `ac.foundation.dataset.*` | 72 + | #50 | Validation process | ⏳ Ready | Proceed with finalized decisions | 73 + | #51 | Architectural appraisal | ✅ Complete | See [assessment.md](assessment.md) | 74 + 75 + **Overall Assessment**: Grade A- (Excellent with caveats) - See [assessment.md](assessment.md) for detailed analysis 76 + 77 + ## How to Use These Documents 78 + 79 + ### For Review 80 + 81 + 1. **Read in order** (01 through 06) to understand dependencies 82 + 2. **Focus on recommendations** - detailed analysis supports them 83 + 3. **Check open questions** - some need your input 84 + 4. **Provide feedback** - comment on issues or update documents 85 + 86 + ### For Implementation 87 + 88 + 1. **After decisions made** - use as reference during coding 89 + 2. **Check success criteria** - ensure implementation meets goals 90 + 3. **Follow recommendations** - they're based on thorough analysis 91 + 4. **Update as needed** - decisions can evolve with learning 92 + 93 + ## Key Insights 94 + 95 + ### Security First 96 + - **Issue #46** (Lens code storage) is a critical security decision 97 + - Recommendation: Code references only (no arbitrary code execution) 98 + - Can add inline code later if we solve sandboxing 99 + 100 + ### Pragmatic Approach 101 + - Start with what works (external URLs, custom format) 102 + - Add sophistication later (ATProto blobs, advanced features) 103 + - Don't block on perfect solutions 104 + 105 + ### Independence 106 + - Use `io.atdata.*` namespace (don't wait for Bluesky approval) 107 + - Can migrate to `app.bsky.atdata.*` later if desired 108 + - Maintain control over project direction 109 + 110 + ### Future-Proof 111 + - Semantic versioning enables evolution 112 + - Hybrid storage approach allows flexibility 113 + - Custom format gives us full control 114 + 115 + ## Decision Dependencies 116 + 117 + ``` 118 + ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ 119 + │ #45 │ │ #46 │ │ #47 │ │ #48 │ │ #49 │ 120 + │ Format │ │ Lens │ │ Storage │ │Evolution│ │Namespace│ 121 + └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ 122 + │ │ │ │ │ 123 + └────────────┴────────────┴────────────┴────────────┘ 124 + 125 + ┌────▼────┐ 126 + │ #50 │ 127 + │Validate │ 128 + └────┬────┘ 129 + 130 + ┌────▼────┐ 131 + │ Phase 1 │ 132 + │Complete │ 133 + └─────────┘ 134 + ``` 135 + 136 + All decisions #45-49 can be made in parallel, then #50 validates everything before Phase 1 completion. 137 + 138 + ## Timeline 139 + 140 + **Recommended**: 141 + 1. **Week 1**: Review and decide on #45-49 (can be done in parallel) 142 + 2. **Week 2**: Validation (#50) - create Lexicon JSON files and examples 143 + 3. **Week 3**: Begin Phase 2 implementation 144 + 145 + **Flexible**: Can make decisions incrementally, but all needed before #50 146 + 147 + ## Questions? 148 + 149 + - Review individual decision documents for detailed analysis 150 + - Check "Open Questions" sections for items needing input 151 + - See "References" sections for related planning documents 152 + - Consult `../02_lexicon_design.md` for technical details 153 + 154 + --- 155 + 156 + **Created**: 2026-01-07 157 + **Status**: All decisions pending review 158 + **Next Step**: Review decision documents and provide feedback
+313
.planning/decisions/assessment.md
··· 1 + # Architectural Assessment of Design Decisions 2 + 3 + **Issue**: #51 4 + **Date**: 2026-01-07 5 + **Status**: Complete 6 + 7 + ## Overall Impression: **Ambitious but Coherent** 8 + 9 + The finalized design decisions prioritize **flexibility and future-proofing** over initial simplicity. This is a deliberate trade-off that makes sense given the scope of building a distributed dataset federation. 10 + 11 + --- 12 + 13 + ## Decision Summary 14 + 15 + 1. **Schema Format (#45)**: JSON Schema with NDArray shim, extensible via open union 16 + 2. **Lens Code (#46)**: External repos (GitHub + tangled.org), language metadata, future attestation 17 + 3. **Storage (#47)**: Hybrid (URLs + blobs) from start, AppView proxy for blobs 18 + 4. **Evolution (#48)**: rkey as {NSID}@{semver}, getLatestSchema query, optional migration Lenses 19 + 5. **Namespace (#49)**: `ac.foundation.dataset.*` (sampleSchema, record, lens) 20 + 21 + --- 22 + 23 + ## Key Strengths 24 + 25 + ### 1. **Ecosystem Integration** (JSON Schema + External Repos) 26 + 27 + **Decision**: JSON Schema for type definitions, external repos for code storage 28 + 29 + **Strength**: Leveraging existing ecosystems rather than building in isolation. JSON Schema brings: 30 + - Extensive tooling (validators, codegen, IDE support) 31 + - Multi-language support out of the box 32 + - Familiarity for developers 33 + 34 + Pairing this with GitHub/tangled.org for Lenses means developers can use existing workflows. 35 + 36 + **Implication**: Lower barrier to entry, faster time to value. The NDArray shim is the only custom piece, which is appropriate since that's the unique requirement. 37 + 38 + --- 39 + 40 + ### 2. **Progressive Decentralization** (Hybrid Storage) 41 + 42 + **Decision**: Hybrid storage from day one (URLs + PDS blobs) 43 + 44 + **Strength**: This is pragmatic yet principled. Not forcing decentralization where it doesn't make sense (TB-scale datasets), but enabling it where it does (smaller datasets, self-hosters). 45 + 46 + **Key Insight**: The AppView proxy for blobs is clever - it means users can work with a unified WebDataset URL interface regardless of backend storage. This abstraction is powerful. 47 + 48 + **Implication**: More implementation complexity upfront, but avoids a painful migration later. The open union pattern makes this clean. 49 + 50 + --- 51 + 52 + ### 3. **Versioning as Identity** (rkey = NSID@semver) 53 + 54 + **Decision**: Embed version in record key, use NSID for permanent identity 55 + 56 + **Strength**: This is elegant. By making versioning part of the identity (rkey), you get: 57 + - Immutable version records (can't accidentally update a published version) 58 + - Natural query pattern (`getLatestSchema` Lexicon) 59 + - Clear semantic versioning enforcement 60 + 61 + **Synergy**: Combining this with Lenses for migration is brilliant. The rkey structure makes it trivial to discover what migrations exist (e.g., "show me all versions of schema X"). 62 + 63 + **Implication**: This requires custom rkey handling (type `any` in Lexicon), which ATProto supports but isn't the default pattern. Need to ensure tooling understands this convention. 64 + 65 + --- 66 + 67 + ### 4. **Trust Layer** (Attestation + Verification) 68 + 69 + **Decision**: Language metadata + future attestation/verification records for Lenses 70 + 71 + **Strength**: Thinking ahead about the trust problem. In a distributed system, trust is critical. This approach: 72 + - Short-term: Language metadata helps users understand what they're running 73 + - Long-term: Attestation (formal correctness proofs) + verification (trusted DIDs) 74 + 75 + This is a **strong security model** that's missing from many distributed systems. 76 + 77 + **Implication**: This is a research-level feature (formal verification of Lenses). Starting with language metadata is right, but the attestation system will require significant design work. Consider this Phase 6+. 78 + 79 + --- 80 + 81 + ## Architectural Tensions (Intentional Trade-offs) 82 + 83 + ### 1. **Complexity Budget** 84 + 85 + **Observation**: Sophisticated solutions across the board: 86 + - JSON Schema (standard but verbose) 87 + - Hybrid storage (two code paths) 88 + - Custom rkey scheme (non-standard) 89 + - Future attestation system (advanced) 90 + 91 + **Assessment**: This increases initial implementation cost significantly. However, each choice is justified: 92 + - JSON Schema: Ecosystem benefits outweigh verbosity 93 + - Hybrid storage: Essential for real-world use cases 94 + - Custom rkey: Enables clean versioning 95 + - Attestation: Future-proofing for trust 96 + 97 + **Recommendation**: ✅ Accept the complexity, but **phase implementation carefully**: 98 + - Phase 1-2: Core functionality (schemas, datasets, basic lenses) 99 + - Phase 3: Hybrid storage in AppView 100 + - Phase 4: Codegen for JSON Schema 101 + - Phase 5+: Attestation/verification system 102 + 103 + --- 104 + 105 + ### 2. **ATProto Conventions vs. Custom Patterns** 106 + 107 + **Observation**: Using some non-standard ATProto patterns: 108 + - rkey type `any` (not typical) 109 + - Custom versioning scheme in rkey 110 + - `getLatestSchema` query Lexicon (not standard CRUD) 111 + 112 + **Assessment**: This is **justified innovation**. ATProto is designed to support custom use cases. The versioning scheme in particular is a good use of flexible rkey. 113 + 114 + **Caveat**: Need to document these conventions clearly, since they won't match typical ATProto examples. 115 + 116 + --- 117 + 118 + ### 3. **JSON Schema for NDArray** 119 + 120 + **Observation**: JSON Schema wasn't designed for NDArray types. The shim approach treats them as "serialized bytes" with metadata. 121 + 122 + **Assessment**: This is **pragmatic but leaky**. The abstraction leaks because: 123 + - JSON Schema describes serialized form (bytes), not semantic form (array with dtype/shape) 124 + - Codegen will need custom handling for NDArray types 125 + - Validation happens at deserialization, not schema level 126 + 127 + **Alternative Considered**: Custom format would give cleaner NDArray representation, but traded that for ecosystem benefits. 128 + 129 + **Mitigation**: Ensure the NDArray shim is well-documented and becomes a de facto standard within the atdata ecosystem. Consider publishing it as a reusable JSON Schema extension. 130 + 131 + --- 132 + 133 + ## Synergies (Where Decisions Reinforce Each Other) 134 + 135 + ### 1. **Versioning + Lenses + rkey Scheme** 136 + 137 + This trilogy works beautifully together: 138 + - rkey embeds version → easy to list all versions 139 + - Lenses enable migration → versions can evolve safely 140 + - `getLatestSchema` query → discoverable latest version 141 + 142 + This creates a **complete version management story** that's rare in distributed systems. 143 + 144 + --- 145 + 146 + ### 2. **Hybrid Storage + AppView Proxy** 147 + 148 + The hybrid storage decision unlocks the proxy pattern: 149 + - Large datasets stay on S3/R2 (practical) 150 + - Small datasets can use PDS blobs (decentralized) 151 + - AppView proxies both → uniform interface 152 + 153 + This means the **client code is simple** (just WebDataset URLs) even though the backend is sophisticated. 154 + 155 + --- 156 + 157 + ### 3. **JSON Schema + Attestation + Language Metadata** 158 + 159 + This builds a **tiered trust model**: 160 + 1. Base layer: JSON Schema validates structure 161 + 2. Language metadata: Users know what they're executing 162 + 3. Attestation (future): Formal proofs of correctness 163 + 4. Verification (future): Social trust (trusted DIDs) 164 + 165 + Each layer adds security without requiring the next layer to exist. 166 + 167 + --- 168 + 169 + ## Implementation Risks & Mitigations 170 + 171 + ### Risk 1: JSON Schema Complexity 172 + 173 + **Risk**: JSON Schema is verbose and can be confusing for users defining NDArray-heavy schemas. 174 + 175 + **Mitigation**: 176 + - Build **high-quality codegen** that hides the complexity (users write Python, get JSON Schema) 177 + - Provide **NDArray shim library** that handles the serialization/deserialization 178 + - Create **examples and templates** for common patterns 179 + 180 + --- 181 + 182 + ### Risk 2: Hybrid Storage Code Paths 183 + 184 + **Risk**: Two storage backends means 2x testing, 2x bugs, 2x maintenance. 185 + 186 + **Mitigation**: 187 + - Use **abstraction layer** in Dataset class (already planned) 188 + - **Prioritize external URLs** for Phase 1-2 (blob support can be added incrementally) 189 + - Test both paths from the start (CI/CD) 190 + 191 + --- 192 + 193 + ### Risk 3: Custom rkey Convention 194 + 195 + **Risk**: Tools that expect standard TID-based rkeys might break. 196 + 197 + **Mitigation**: 198 + - **Document clearly** in all Lexicon definitions 199 + - Provide **helper functions** in SDK (`parseSchemaRkey`, `formatSchemaRkey`) 200 + - Ensure `getLatestSchema` query is the primary discovery mechanism (hides rkey complexity) 201 + 202 + --- 203 + 204 + ### Risk 4: Attestation System Scope Creep 205 + 206 + **Risk**: Formal verification and trust systems are research-level hard. Could delay entire project. 207 + 208 + **Mitigation**: 209 + - Mark as **explicitly future work** (Phase 6+) 210 + - Start with **language metadata only** (low-hanging fruit) 211 + - Consider **social trust first** (verified DIDs, reputation) before formal verification 212 + - Partner with PL/verification researchers if pursuing formal proofs 213 + 214 + --- 215 + 216 + ## Long-Term Trajectory 217 + 218 + The decisions set up a compelling long-term vision: 219 + 220 + **Year 1**: Core dataset federation 221 + - Publish/discover datasets 222 + - JSON Schema for types 223 + - External URL storage 224 + - Basic Lenses 225 + 226 + **Year 2**: Decentralization 227 + - PDS blob storage for small datasets 228 + - AppView with proxy 229 + - Migration Lenses widely used 230 + - Community schemas emerging 231 + 232 + **Year 3**: Trust & verification 233 + - Language metadata standard 234 + - Verified DID system (social trust) 235 + - Attestation for critical Lenses 236 + - Cross-language support (TypeScript, Rust) 237 + 238 + **Year 4+**: Research frontier 239 + - Formal verification of Lenses 240 + - Advanced query capabilities 241 + - Federated learning on distributed datasets 242 + - Integration with compute-over-data systems 243 + 244 + --- 245 + 246 + ## Concrete Recommendations 247 + 248 + ### 1. **Immediate** (Before Phase 1 Implementation) 249 + 250 + - [ ] Define the **NDArray JSON Schema shim** precisely (schema structure, examples) 251 + - [ ] Spec out the **rkey format** (`{NSID}@{semver}` - what's valid NSID here? full NSID or partial?) 252 + - [ ] Design the **`getLatestSchema` query Lexicon** (parameters, return type) 253 + - [ ] Define the **storage union type** (external URL variant vs PDS blob variant) 254 + 255 + ### 2. **Phase 1-2** (Lexicon + Python Client) 256 + 257 + - [ ] Implement **external URLs only** for storage (defer blobs to Phase 3) 258 + - [ ] Build **NDArray shim library** (serialize/deserialize with metadata) 259 + - [ ] Create **basic codegen** (Python dataclass ↔ JSON Schema) 260 + - [ ] Defer **language metadata** on Lenses to Phase 2 (start with just repo reference) 261 + 262 + ### 3. **Phase 3** (AppView) 263 + 264 + - [ ] Implement **hybrid storage support** in AppView 265 + - [ ] Build **proxy for PDS blobs** (unified WebDataset URL interface) 266 + - [ ] Add **getLatestSchema endpoint** 267 + 268 + ### 4. **Phase 4+** (Future Work) 269 + 270 + - [ ] Add **language metadata** to Lens records 271 + - [ ] Design **attestation Lexicon** (separate from Lens records) 272 + - [ ] Design **verification Lexicon** (trusted DIDs) 273 + - [ ] Research formal verification feasibility 274 + 275 + --- 276 + 277 + ## Summary Assessment 278 + 279 + **Grade: A-** (Excellent with caveats) 280 + 281 + ### Strengths 282 + - ✅ Leverages existing ecosystems (JSON Schema, GitHub) 283 + - ✅ Future-proof (extensible via open unions, versioning built-in) 284 + - ✅ Pragmatic decentralization (hybrid storage) 285 + - ✅ Innovative versioning (rkey scheme) 286 + - ✅ Strong security model (multi-layered trust) 287 + 288 + ### Concerns 289 + - ⚠️ High implementation complexity (manageable with phasing) 290 + - ⚠️ JSON Schema for NDArray is a leaky abstraction (acceptable trade-off) 291 + - ⚠️ Custom rkey convention requires good documentation 292 + - ⚠️ Attestation system is ambitious (defer to future) 293 + 294 + ### Overall Assessment 295 + 296 + This is a **well-considered architecture** that makes intentional trade-offs. The bet is on ecosystem integration and flexibility over simplicity, which is appropriate for a distributed dataset federation. The key to success will be **disciplined phasing** - implement the core first, add sophistication incrementally. 297 + 298 + The decisions form a **coherent whole** where each piece reinforces the others. The versioning scheme, Lenses, and hybrid storage create a system that's greater than the sum of its parts. 299 + 300 + **Recommendation**: ✅ **Proceed with these decisions**. Document the NDArray shim and rkey conventions thoroughly, and commit to incremental implementation. 301 + 302 + --- 303 + 304 + ## Next Steps 305 + 306 + 1. Close decision issues #45-49 as decided 307 + 2. Update planning documents with finalized decisions 308 + 3. Proceed to Issue #50 (Lexicon validation) with: 309 + - NDArray JSON Schema shim definition 310 + - rkey format specification 311 + - `getLatestSchema` query Lexicon design 312 + - Storage union type definition 313 + 4. Begin Phase 1 implementation after validation complete
+468
.planning/decisions/record_lexicon_assessment.md
··· 1 + # Record Lexicon Assessment 2 + 3 + ## Overview 4 + 5 + Comprehensive assessment of `ac.foundation.dataset.record` Lexicon design against ATProto standards and atdata project requirements. 6 + 7 + **Assessment Date:** 2026-01-07 8 + **Lexicon Version:** Initial design 9 + **Assessor:** Claude Sonnet 4.5 10 + 11 + --- 12 + 13 + ## Executive Summary 14 + 15 + **Grade: B+** (Good with improvements needed) 16 + 17 + The record Lexicon provides a solid foundation for dataset indexing with hybrid storage support. Key strengths include clean union-based storage design and appropriate use of ATProto primitives. However, several issues need addressing: 18 + 19 + - ⚠️ **Critical**: schemaRef should use format validation 20 + - ⚠️ **High**: Metadata structure inconsistency with sampleSchema pattern 21 + - ⚠️ **Medium**: Missing $type discriminators in union variants 22 + - ✅ **Strength**: Clean storage union design 23 + - ✅ **Strength**: Appropriate use of tid keys for datasets 24 + 25 + --- 26 + 27 + ## Detailed Analysis 28 + 29 + ### 1. Key Type Choice ✅ **Appropriate** 30 + 31 + ```json 32 + "key": "tid" 33 + ``` 34 + 35 + **Assessment:** Correct choice for dataset records. 36 + 37 + **Rationale:** 38 + - TIDs provide temporal ordering (useful for "recent datasets" queries) 39 + - Auto-generated, no collision risk 40 + - Appropriate for records without natural semantic keys 41 + - Consistent with ATProto patterns for user-generated content 42 + 43 + **Comparison to sampleSchema:** 44 + - sampleSchema uses `"key": "any"` for versioned rkeys like `{NSID}@{semver}` 45 + - record uses `"key": "tid"` for chronological dataset entries 46 + - Both choices are appropriate for their use cases 47 + 48 + --- 49 + 50 + ### 2. Field Validation Issues 51 + 52 + #### Issue 2.1: schemaRef Missing Format Validation ⚠️ **Critical** 53 + 54 + ```json 55 + "schemaRef": { 56 + "type": "string", 57 + "description": "AT-URI reference...", 58 + "maxLength": 500 59 + } 60 + ``` 61 + 62 + **Problem:** Should use `"format": "at-uri"` like we did for sampleSchema fields. 63 + 64 + **Fix:** 65 + ```json 66 + "schemaRef": { 67 + "type": "string", 68 + "format": "at-uri", 69 + "description": "AT-URI reference to the sampleSchema record", 70 + "maxLength": 500 71 + } 72 + ``` 73 + 74 + **Impact:** Without format validation, malformed references could be stored. 75 + 76 + --- 77 + 78 + #### Issue 2.2: License Field Inconsistency ⚠️ **Medium** 79 + 80 + sampleSchema metadata: 81 + ```json 82 + "license": { 83 + "type": "string", 84 + "description": "... SPDX identifiers recommended ... or full SPDX URLs ...", 85 + "maxLength": 200 86 + } 87 + ``` 88 + 89 + record: 90 + ```json 91 + "license": { 92 + "type": "string", 93 + "description": "License (SPDX identifier preferred)", 94 + "maxLength": 100 95 + } 96 + ``` 97 + 98 + **Problem:** Inconsistent maxLength and less detailed guidance. 99 + 100 + **Recommendation:** Align with sampleSchema: 101 + - maxLength: 200 (to support full URLs) 102 + - Enhanced description with examples 103 + - Reference Schema.org license property 104 + 105 + --- 106 + 107 + #### Issue 2.3: Tags Field Inconsistency ⚠️ **Medium** 108 + 109 + sampleSchema metadata: 110 + ```json 111 + "tags": { 112 + "type": "array", 113 + "items": {"type": "string", "maxLength": 150}, 114 + "maxLength": 30 115 + } 116 + ``` 117 + 118 + record: 119 + ```json 120 + "tags": { 121 + "type": "array", 122 + "items": {"type": "string", "maxLength": 50}, 123 + "maxLength": 20 124 + } 125 + ``` 126 + 127 + **Problem:** Different limits with no clear rationale. 128 + 129 + **Recommendation:** Use consistent limits or document why datasets need different constraints than schemas. 130 + 131 + --- 132 + 133 + ### 3. Metadata Structure ⚠️ **High Priority** 134 + 135 + #### Current Design 136 + 137 + record: 138 + ```json 139 + "metadata": { 140 + "type": "bytes", 141 + "description": "Msgpack-encoded metadata dict", 142 + "maxLength": 100000 143 + }, 144 + "tags": {...}, 145 + "license": {...} 146 + ``` 147 + 148 + sampleSchema: 149 + ```json 150 + "metadata": { 151 + "type": "object", 152 + "properties": { 153 + "license": {...}, 154 + "tags": {...} 155 + } 156 + } 157 + ``` 158 + 159 + **Problem:** Inconsistent approach between lexicons. 160 + 161 + **Analysis:** 162 + 163 + **Option A: Keep Separate (Current)** 164 + - Pros: More discoverable (top-level fields, indexed/searchable) 165 + - Pros: Validated by Lexicon 166 + - Cons: Duplicates structure with metadata blob 167 + - Cons: Inconsistent with sampleSchema pattern 168 + 169 + **Option B: Unified Metadata Object** 170 + - Pros: Consistent with sampleSchema 171 + - Pros: Single source of truth 172 + - Cons: Less discoverable for search 173 + - Cons: Can't validate blob contents 174 + 175 + **Recommendation:** Keep current approach but clarify relationship: 176 + - Top-level fields: Core, searchable metadata (license, tags, size) 177 + - metadata blob: Extended, arbitrary key-value pairs 178 + - Update descriptions to explain this pattern 179 + 180 + --- 181 + 182 + ### 4. Storage Union Design ✅ **Excellent** 183 + 184 + ```json 185 + "storage": { 186 + "type": "union", 187 + "refs": ["#storageExternal", "#storageBlobs"] 188 + } 189 + ``` 190 + 191 + **Strengths:** 192 + - Clean separation of storage types 193 + - Extensible (closed: false by default) 194 + - Well-defined variants 195 + 196 + #### Issue 4.1: Missing $type in Union Variants ⚠️ **Critical** 197 + 198 + storageExternal: 199 + ```json 200 + { 201 + "type": "object", 202 + "required": ["type", "urls"], 203 + "properties": { 204 + "type": {"type": "string", "const": "external"} 205 + } 206 + } 207 + ``` 208 + 209 + **Problem:** Uses `type` field as discriminator instead of ATProto's `$type`. 210 + 211 + **ATProto Spec:** "Unions require discriminator fields... union variants: Always include `$type`" 212 + 213 + **Fix:** 214 + ```json 215 + { 216 + "type": "object", 217 + "required": ["$type", "urls"], 218 + "properties": { 219 + "$type": { 220 + "type": "string", 221 + "const": "ac.foundation.dataset.record#storageExternal" 222 + } 223 + } 224 + } 225 + ``` 226 + 227 + **Impact:** Current design violates ATProto conventions and may cause issues with SDKs. 228 + 229 + --- 230 + 231 + ### 5. Size Information ✅ **Good Design** 232 + 233 + ```json 234 + "size": { 235 + "type": "ref", 236 + "ref": "#datasetSize", 237 + "description": "Dataset size information (optional)" 238 + } 239 + ``` 240 + 241 + **Strengths:** 242 + - Optional (appropriate, not all datasets track this) 243 + - Structured with useful fields (samples, bytes, shards) 244 + - Uses ref for reusability 245 + 246 + **Minor Suggestion:** Consider renaming `datasetSize` to `sizeInfo` or `datasetSizeInfo` for clarity. 247 + 248 + --- 249 + 250 + ### 6. Blob Storage Design ⚠️ **Needs Verification** 251 + 252 + ```json 253 + "blobs": { 254 + "type": "array", 255 + "items": { 256 + "type": "blob", 257 + "description": "Blob reference to a WebDataset tar archive" 258 + } 259 + } 260 + ``` 261 + 262 + **Questions:** 263 + 1. Does ATProto Lexicon support `"type": "blob"` for array items? 264 + 2. Should this be a ref like `"type": "ref", "ref": "#blobRef"`? 265 + 3. Are blob mime types validated? 266 + 267 + **Example shows:** 268 + ```json 269 + { 270 + "$type": "blob", 271 + "ref": {"$link": "..."}, 272 + "mimeType": "application/x-tar", 273 + "size": 1234567 274 + } 275 + ``` 276 + 277 + **Recommendation:** Verify against ATProto blob specification and potentially add validation constraints (maxSize, accept mimeType patterns). 278 + 279 + --- 280 + 281 + ### 7. Closed Union Consideration 🤔 282 + 283 + ```json 284 + "storage": { 285 + "type": "union", 286 + "refs": ["#storageExternal", "#storageBlobs"] 287 + } 288 + ``` 289 + 290 + **Current:** `closed: false` (default) 291 + 292 + **Question:** Should storage union be closed? 293 + 294 + **Arguments for closed: true:** 295 + - Core storage types unlikely to change frequently 296 + - Breaking change to add new storage after launch 297 + - More predictable for clients 298 + 299 + **Arguments for closed: false (current):** 300 + - Future extensibility (e.g., IPFS-native, Filecoin, Arweave) 301 + - Consistent with sampleSchema schema union pattern 302 + - Graceful degradation for unknown types 303 + 304 + **Recommendation:** Keep open but document in description that external/blobs are the canonical types maintained by foundation.ac. 305 + 306 + --- 307 + 308 + ### 8. Missing Fields from Standard Patterns 309 + 310 + Comparing to Schema.org Dataset and sampleSchema patterns: 311 + 312 + **Consider Adding:** 313 + 314 + 1. **Publisher/Creator** - Who published this dataset? 315 + - Could use top-level `creator` field (DID/handle) 316 + - Or rely on record author (implicit in AT-URI) 317 + 318 + 2. **Version** - Dataset versioning? 319 + - Current approach: New record per version (via tid) 320 + - Alternative: Add explicit `version` field like sampleSchema 321 + - **Recommendation:** Document that versioning is via new records, reference via AT-URI with tid 322 + 323 + 3. **Citation** - How to cite this dataset? 324 + - Optional field for academic datasets 325 + - Could go in metadata blob for now 326 + 327 + 4. **Related Datasets** - Links to variants, subsets, etc. 328 + - Could be array of AT-URIs 329 + - Or handle via separate "collection" Lexicon later 330 + 331 + **Recommendation:** Current fields are sufficient for v1. Document these as future extensions. 332 + 333 + --- 334 + 335 + ### 9. ATProto Compliance Checklist 336 + 337 + | Requirement | Status | Notes | 338 + |-------------|--------|-------| 339 + | Valid Lexicon version | ✅ | lexicon: 1 | 340 + | NSID format | ✅ | ac.foundation.dataset.record | 341 + | Key type specified | ✅ | tid (appropriate) | 342 + | Required fields present | ✅ | name, schemaRef, storage, createdAt | 343 + | Union discriminators | ⚠️ | Missing $type in variants | 344 + | Format validators | ⚠️ | Missing at-uri format | 345 + | Blob type usage | ⚠️ | Needs verification | 346 + | Description fields | ✅ | All fields documented | 347 + | maxLength constraints | ✅ | Present on strings | 348 + | Datetime format | ✅ | createdAt uses datetime | 349 + 350 + --- 351 + 352 + ### 10. Example Record Validation 353 + 354 + #### External Storage Example ✅ 355 + 356 + ```json 357 + { 358 + "$type": "ac.foundation.dataset.record", 359 + "name": "CIFAR-10 Training Set", 360 + "schemaRef": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imageclassification@1.0.0", 361 + "storage": {"type": "external", "urls": ["..."]} 362 + } 363 + ``` 364 + 365 + **Issues:** 366 + - schemaRef is well-formed but not validated (missing format check) 367 + - storage.type should be $type 368 + - Otherwise structurally correct 369 + 370 + #### Blob Storage Example ⚠️ 371 + 372 + ```json 373 + { 374 + "storage": { 375 + "type": "blobs", 376 + "blobs": [{ 377 + "$type": "blob", 378 + "ref": {"$link": "..."}, 379 + "mimeType": "application/x-tar" 380 + }] 381 + } 382 + } 383 + ``` 384 + 385 + **Issues:** 386 + - storage.type should be $type 387 + - Blob structure needs verification against ATProto spec 388 + - mimeType not validated in Lexicon 389 + 390 + --- 391 + 392 + ## Priority Issues Summary 393 + 394 + ### Critical (Must Fix) 395 + 396 + 1. **Add format validation to schemaRef** - Use `"format": "at-uri"` 397 + 2. **Fix union discriminators** - Use `$type` instead of `type` in storage variants 398 + 3. **Verify blob type usage** - Confirm ATProto compliance 399 + 400 + ### High Priority (Should Fix) 401 + 402 + 4. **Align metadata pattern** - Clarify relationship between top-level fields and metadata blob 403 + 5. **Standardize license field** - Match sampleSchema maxLength and description 404 + 6. **Standardize tags field** - Use consistent limits or document rationale 405 + 406 + ### Medium Priority (Consider) 407 + 408 + 7. **Add $type requirement to union variants** - Make explicit in required array 409 + 8. **Document versioning strategy** - Clarify that new versions = new records 410 + 9. **Add blob validation** - Consider maxSize, mimeType constraints 411 + 412 + ### Low Priority (Future) 413 + 414 + 10. **Consider closed union** - Evaluate after Phase 1 usage patterns 415 + 11. **Add creator field** - If needed based on user feedback 416 + 12. **Collection/relationship fields** - Phase 2 feature 417 + 418 + --- 419 + 420 + ## Consistency Matrix 421 + 422 + Comparison of patterns between sampleSchema and record Lexicons: 423 + 424 + | Pattern | sampleSchema | record | Status | 425 + |---------|--------------|--------|--------| 426 + | AT-URI format | ✅ Uses format | ❌ Missing | **Fix** | 427 + | License field | 200 chars, detailed | 100 chars, basic | **Align** | 428 + | Tags limits | 150/30 | 50/20 | **Decide** | 429 + | Metadata structure | Structured object | Blob + top-level | **Document** | 430 + | Union discriminator | Uses $type | Uses type | **Fix** | 431 + | Versioning | Explicit version field | Implicit (tid) | **Different OK** | 432 + | Key type | any (semantic) | tid (temporal) | **Both OK** | 433 + 434 + --- 435 + 436 + ## Recommendations 437 + 438 + ### Immediate Actions 439 + 440 + 1. Add `"format": "at-uri"` to schemaRef field 441 + 2. Change storage union variants to use `$type` discriminator 442 + 3. Verify blob array item type with ATProto specification 443 + 4. Align license field with sampleSchema (maxLength: 200, enhanced description) 444 + 5. Decide on tags limits (recommend matching sampleSchema: 150/30) 445 + 446 + ### Documentation Improvements 447 + 448 + 6. Add description clarifying metadata blob vs top-level fields relationship 449 + 7. Document that dataset versioning is via new records (tids) 450 + 8. Add note about storage union extensibility 451 + 9. Cross-reference with sampleSchema Lexicon 452 + 453 + ### Consider for Phase 2 454 + 455 + 10. Add creator/publisher field if user feedback indicates need 456 + 11. Evaluate closed union after observing extension patterns 457 + 12. Consider collection/relationship Lexicon for dataset hierarchies 458 + 459 + --- 460 + 461 + ## Conclusion 462 + 463 + The record Lexicon provides a solid foundation but needs refinement for ATProto compliance and consistency with sampleSchema patterns. The storage union design is excellent, and the use of tids is appropriate. Primary concerns are format validation, union discriminators, and metadata pattern clarity. 464 + 465 + **Estimated effort to address critical issues:** 2-3 hours 466 + **Recommended timeline:** Before Phase 1 completion 467 + 468 + After fixes, expected grade: **A-** (Excellent and production-ready)
+166
.planning/decisions/sampleSchema_design_questions.md
··· 1 + # sampleSchema Lexicon Design Questions 2 + 3 + This document captures open design questions for the `ac.foundation.dataset.sampleSchema` Lexicon that require user decisions before implementation. 4 + 5 + ## Q1: Key Format Validation 6 + 7 + **Context:** 8 + - Schema uses `"key": "any"` in Lexicon 9 + - Documentation says rkey format is `{NSID}@{semver}` 10 + - ATProto might not support regex validation on rkey in Lexicons 11 + 12 + **Question:** 13 + Should we add validation for the rkey format in the Lexicon definition, or is this enforced elsewhere? 14 + 15 + **Options:** 16 + 1. Add rkey pattern validation if ATProto Lexicons support it 17 + 2. Document expected format but rely on application-level validation 18 + 3. Use a structured key type instead of "any" 19 + 20 + **Impact:** 21 + - Option 1: Strongest validation, prevents malformed rkeys 22 + - Option 2: Simpler, but allows invalid rkeys to be created 23 + - Option 3: May not be compatible with ATProto Lexicon spec 24 + 25 + **Decision:** [TBD] 26 + 27 + --- 28 + 29 + ## Q2: Required Fields in JSON Schema 30 + 31 + **Context:** 32 + - The `jsonSchema` field accepts any JSON Schema object 33 + - JSON Schemas can have zero required fields (all optional) 34 + - PackableSample types in atdata typically have at least one field 35 + 36 + **Question:** 37 + Should we enforce that JSON Schemas must have at least one required field? 38 + 39 + **Options:** 40 + 1. No constraint - allow empty required arrays 41 + 2. Require at least one field in required array 42 + 3. No constraint but document best practices 43 + 44 + **Impact:** 45 + - Option 1: Maximum flexibility, but allows degenerate schemas 46 + - Option 2: Forces meaningful sample definitions 47 + - Option 3: Middle ground - guidance without enforcement 48 + 49 + **Recommendation:** Option 3 (document best practices) 50 + 51 + **Decision:** [TBD] 52 + 53 + --- 54 + 55 + ## Q3: Schema Type Extension Path 56 + 57 + **Context:** 58 + - `schemaType` field has `enum: ["jsonschema"]` only 59 + - Future may want to support other formats (Avro, Protobuf, etc.) 60 + - Lexicon schema evolution unclear 61 + 62 + **Question:** 63 + How should we design for future schema format support? 64 + 65 + **Options:** 66 + 1. Keep enum as-is, add new formats in major version bump 67 + 2. Use open union type instead of closed enum 68 + 3. Add `schemaFormat` union field alongside `jsonSchema` 69 + 70 + **Example for Option 3:** 71 + ```json 72 + { 73 + "schemaFormat": { 74 + "type": "union", 75 + "refs": ["#jsonSchemaFormat", "#avroSchemaFormat", "#protobufSchemaFormat"] 76 + } 77 + } 78 + ``` 79 + 80 + **Impact:** 81 + - Option 1: Breaking change required for new formats 82 + - Option 2: No validation of format string 83 + - Option 3: Clean extensibility but more complex now 84 + 85 + **Recommendation:** Option 1 (YAGNI - wait for actual need) 86 + 87 + **Decision:** [TBD] 88 + 89 + --- 90 + 91 + ## Q4: Metadata Field Structure 92 + 93 + **Context:** 94 + - `metadata` is currently `"type": "object"` with no structure 95 + - Common fields like `author`, `license`, `tags` are documented in examples 96 + - No validation on these fields 97 + 98 + **Question:** 99 + Should we define a structured schema for common metadata fields? 100 + 101 + **Options:** 102 + 1. Keep fully unstructured (current) 103 + 2. Define optional but structured fields (author, license, tags, etc.) 104 + 3. Create separate metadata Lexicon type and reference it 105 + 106 + **Example for Option 2:** 107 + ```json 108 + { 109 + "metadata": { 110 + "type": "object", 111 + "properties": { 112 + "author": {"type": "string", "maxLength": 200}, 113 + "license": {"type": "string", "maxLength": 100}, 114 + "tags": {"type": "array", "items": {"type": "string"}, "maxItems": 20} 115 + } 116 + } 117 + } 118 + ``` 119 + 120 + **Impact:** 121 + - Option 1: Maximum flexibility, no validation 122 + - Option 2: Standardization with optional compliance 123 + - Option 3: Reusability but added complexity 124 + 125 + **Recommendation:** Option 2 (structured but optional) 126 + 127 + **Decision:** [TBD] 128 + 129 + --- 130 + 131 + ## Q5: NDArray Shim URI Default 132 + 133 + **Context:** 134 + - `ndarrayShimUri` is optional with default mentioned in description 135 + - Standard shim is at `https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0` 136 + - No explicit default value in Lexicon 137 + 138 + **Question:** 139 + Should we add an explicit default value for `ndarrayShimUri`? 140 + 141 + **Options:** 142 + 1. Add `"default": "https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0"` 143 + 2. Keep as optional, codegen assumes standard shim if missing 144 + 3. Make required - always explicit 145 + 146 + **Impact:** 147 + - Option 1: Clearest behavior, but locks in URI 148 + - Option 2: Flexibility for future shim versions 149 + - Option 3: Most explicit but verbose 150 + 151 + **Recommendation:** Option 2 (implicit default in codegen) 152 + 153 + **Decision:** [TBD] 154 + 155 + --- 156 + 157 + ## Notes 158 + 159 + These questions should be resolved before finalizing the sampleSchema Lexicon design. Some can be deferred to Phase 2 implementation based on priority. 160 + 161 + **Priority:** 162 + - Q1: High (affects rkey strategy) 163 + - Q2: Low (can document later) 164 + - Q3: Low (YAGNI until needed) 165 + - Q4: Medium (affects metadata usage patterns) 166 + - Q5: Medium (affects codegen implementation)
+252
.planning/examples/code/ndarray_roundtrip.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + Demonstration of NDArray JSON Schema shim roundtrip. 4 + 5 + This script demonstrates: 6 + 1. Creating numpy arrays 7 + 2. Serializing to bytes (numpy .npy format) 8 + 3. Storing in JSON-compatible structure 9 + 4. Validating against JSON Schema 10 + 5. Deserializing back to numpy arrays 11 + 12 + This proves the NDArray shim design works end-to-end. 13 + """ 14 + 15 + import json 16 + import base64 17 + from io import BytesIO 18 + from pathlib import Path 19 + 20 + import numpy as np 21 + from jsonschema import validate, ValidationError 22 + 23 + 24 + ## 25 + # Step 1: Define helper functions (same as atdata._helpers) 26 + 27 + def array_to_bytes(x: np.ndarray) -> bytes: 28 + """Convert numpy array to bytes using .npy format.""" 29 + np_bytes = BytesIO() 30 + np.save(np_bytes, x, allow_pickle=True) 31 + return np_bytes.getvalue() 32 + 33 + 34 + def bytes_to_array(b: bytes) -> np.ndarray: 35 + """Convert bytes back to numpy array.""" 36 + np_bytes = BytesIO(b) 37 + return np.load(np_bytes, allow_pickle=True) 38 + 39 + 40 + ## 41 + # Step 2: Load the JSON Schema for ImageSample 42 + 43 + # Get path to the schema example 44 + schema_path = Path(__file__).parent.parent / "sampleSchema_example.json" 45 + with open(schema_path) as f: 46 + schema_record = json.load(f) 47 + 48 + # Extract just the jsonSchema part 49 + json_schema = schema_record["jsonSchema"] 50 + 51 + print("=" * 80) 52 + print("JSON Schema for ImageSample") 53 + print("=" * 80) 54 + print(json.dumps(json_schema, indent=2)) 55 + print() 56 + 57 + 58 + ## 59 + # Step 3: Create sample data matching the schema 60 + 61 + print("=" * 80) 62 + print("Creating Sample Data") 63 + print("=" * 80) 64 + 65 + # Create a numpy array (simulating an image) 66 + image_array = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8) 67 + print(f"Created image array: shape={image_array.shape}, dtype={image_array.dtype}") 68 + 69 + # Serialize to bytes (this is what atdata does) 70 + image_bytes = array_to_bytes(image_array) 71 + print(f"Serialized to bytes: {len(image_bytes)} bytes") 72 + print(f"First 100 bytes (hex): {image_bytes[:100].hex()}") 73 + print() 74 + 75 + 76 + ## 77 + # Step 4: Create JSON-compatible representation 78 + 79 + print("=" * 80) 80 + print("Creating JSON-Compatible Representation") 81 + print("=" * 80) 82 + 83 + # For JSON, bytes need to be base64-encoded 84 + image_base64 = base64.b64encode(image_bytes).decode('utf-8') 85 + print(f"Base64 encoded: {len(image_base64)} characters") 86 + print(f"First 100 chars: {image_base64[:100]}...") 87 + 88 + # Create a sample object matching the schema 89 + sample_data = { 90 + "image": image_base64, # NDArray as base64 string 91 + "label": "cat", # Regular string field 92 + "confidence": 0.95 # Optional number field 93 + } 94 + 95 + print() 96 + print("Sample data structure:") 97 + print(json.dumps({ 98 + "image": f"<{len(image_base64)} chars of base64>", 99 + "label": sample_data["label"], 100 + "confidence": sample_data["confidence"] 101 + }, indent=2)) 102 + print() 103 + 104 + 105 + ## 106 + # Step 5: Validate against JSON Schema 107 + 108 + print("=" * 80) 109 + print("Validating Against JSON Schema") 110 + print("=" * 80) 111 + 112 + try: 113 + validate(instance=sample_data, schema=json_schema) 114 + print("✅ VALID: Sample data validates against JSON Schema!") 115 + except ValidationError as e: 116 + print(f"❌ INVALID: {e.message}") 117 + print(f"Failed at: {list(e.path)}") 118 + 119 + print() 120 + 121 + 122 + ## 123 + # Step 6: Deserialize back to numpy 124 + 125 + print("=" * 80) 126 + print("Deserializing Back to Numpy") 127 + print("=" * 80) 128 + 129 + # Decode from base64 130 + recovered_bytes = base64.b64decode(sample_data["image"]) 131 + print(f"Decoded from base64: {len(recovered_bytes)} bytes") 132 + 133 + # Deserialize to numpy array 134 + recovered_array = bytes_to_array(recovered_bytes) 135 + print(f"Deserialized to array: shape={recovered_array.shape}, dtype={recovered_array.dtype}") 136 + 137 + # Verify it matches the original 138 + arrays_equal = np.array_equal(image_array, recovered_array) 139 + print(f"Arrays equal: {arrays_equal}") 140 + 141 + if arrays_equal: 142 + print("✅ SUCCESS: Full roundtrip successful!") 143 + else: 144 + print("❌ FAILURE: Arrays don't match") 145 + print(f"Max difference: {np.max(np.abs(image_array.astype(float) - recovered_array.astype(float)))}") 146 + 147 + print() 148 + 149 + 150 + ## 151 + # Step 7: Demonstrate validation of dtype/shape metadata 152 + 153 + print("=" * 80) 154 + print("Validating NDArray Metadata (dtype, shape)") 155 + print("=" * 80) 156 + 157 + # Extract metadata from schema 158 + image_schema = json_schema["properties"]["image"] 159 + expected_dtype = image_schema.get("x-atdata-dtype") 160 + expected_shape = image_schema.get("x-atdata-shape") 161 + 162 + print(f"Expected dtype: {expected_dtype}") 163 + print(f"Expected shape: {expected_shape}") 164 + print(f"Actual dtype: {recovered_array.dtype}") 165 + print(f"Actual shape: {recovered_array.shape}") 166 + 167 + # Validate dtype 168 + dtype_match = str(recovered_array.dtype) == expected_dtype 169 + print(f"Dtype matches: {dtype_match}") 170 + 171 + # Validate shape (with None/null for dynamic dimensions) 172 + def validate_shape(actual_shape, expected_shape): 173 + """Validate shape with support for dynamic dimensions (None/null).""" 174 + if len(actual_shape) != len(expected_shape): 175 + return False 176 + for actual_dim, expected_dim in zip(actual_shape, expected_shape): 177 + if expected_dim is not None and actual_dim != expected_dim: 178 + return False 179 + return True 180 + 181 + shape_match = validate_shape(recovered_array.shape, expected_shape) 182 + print(f"Shape matches: {shape_match}") 183 + 184 + if dtype_match and shape_match: 185 + print("✅ SUCCESS: Array metadata matches schema expectations!") 186 + else: 187 + print("❌ FAILURE: Metadata mismatch") 188 + 189 + print() 190 + 191 + 192 + ## 193 + # Step 8: Demonstrate msgpack (actual atdata format) 194 + 195 + print("=" * 80) 196 + print("Msgpack Serialization (Actual atdata Format)") 197 + print("=" * 80) 198 + 199 + try: 200 + import msgpack 201 + 202 + # In atdata, the sample would be stored in msgpack, not JSON 203 + # The image field would be raw bytes, not base64 204 + msgpack_data = { 205 + "image": image_bytes, # Raw bytes (not base64) 206 + "label": "cat", 207 + "confidence": 0.95 208 + } 209 + 210 + # Serialize to msgpack 211 + msgpack_bytes = msgpack.packb(msgpack_data) 212 + print(f"Msgpack size: {len(msgpack_bytes)} bytes") 213 + 214 + # Deserialize from msgpack 215 + recovered_msgpack = msgpack.unpackb(msgpack_bytes, raw=False) 216 + recovered_array_msgpack = bytes_to_array(recovered_msgpack["image"]) 217 + 218 + print(f"Recovered from msgpack: shape={recovered_array_msgpack.shape}, dtype={recovered_array_msgpack.dtype}") 219 + print(f"Arrays equal: {np.array_equal(image_array, recovered_array_msgpack)}") 220 + print("✅ SUCCESS: Msgpack roundtrip successful!") 221 + 222 + except ImportError: 223 + print("⚠️ msgpack not installed, skipping msgpack demonstration") 224 + print(" (atdata uses msgpack for actual serialization)") 225 + 226 + print() 227 + 228 + 229 + ## 230 + # Summary 231 + 232 + print("=" * 80) 233 + print("SUMMARY") 234 + print("=" * 80) 235 + print(""" 236 + ✅ The NDArray JSON Schema shim works correctly: 237 + 1. JSON Schema validates structure (field is present, is base64 string) 238 + 2. Binary .npy format preserves dtype and shape 239 + 3. Extension properties (x-atdata-*) provide metadata for validation 240 + 4. Full roundtrip: numpy → bytes → base64 → JSON → validate → deserialize → numpy 241 + 5. Msgpack format (actual atdata) uses raw bytes instead of base64 242 + 243 + ⚠️ Validation happens at two levels: 244 + - JSON Schema: Structural validation (field present, correct type) 245 + - Deserialization: Semantic validation (dtype/shape match expectations) 246 + 247 + 📝 This design is a pragmatic compromise: 248 + - Leverages existing .npy serialization (proven, self-describing) 249 + - Uses standard JSON Schema conventions (format: byte, contentEncoding) 250 + - Adds metadata via extension properties (x-atdata-*) 251 + - Works with both JSON (base64) and msgpack (raw bytes) 252 + """)
+316
.planning/examples/code/validate_ndarray_shim.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + Validate base64-encoded numpy arrays against the standalone ndarray_shim.json schema. 4 + 5 + This demonstrates that the NDArray shim schema definition works correctly as a 6 + standalone, reusable schema component that can be referenced from other schemas. 7 + 8 + Note: This tests the JSON representation (base64-encoded bytes). In actual atdata 9 + usage, WebDatasets store raw bytes directly in msgpack format without base64 encoding. 10 + """ 11 + 12 + import json 13 + import base64 14 + from io import BytesIO 15 + from pathlib import Path 16 + 17 + import numpy as np 18 + from jsonschema import validate, ValidationError, Draft7Validator 19 + 20 + 21 + ## 22 + # Helper functions 23 + 24 + def array_to_bytes(x: np.ndarray) -> bytes: 25 + """Convert numpy array to bytes using .npy format.""" 26 + np_bytes = BytesIO() 27 + np.save(np_bytes, x, allow_pickle=True) 28 + return np_bytes.getvalue() 29 + 30 + 31 + def bytes_to_array(b: bytes) -> np.ndarray: 32 + """Convert bytes back to numpy array.""" 33 + np_bytes = BytesIO(b) 34 + return np.load(np_bytes, allow_pickle=True) 35 + 36 + 37 + ## 38 + # Load the standalone ndarray shim schema 39 + 40 + shim_path = Path(__file__).parent.parent.parent / "lexicons" / "ndarray_shim.json" 41 + with open(shim_path) as f: 42 + ndarray_shim = json.load(f) 43 + 44 + print("=" * 80) 45 + print("Loaded NDArray Shim Schema") 46 + print("=" * 80) 47 + print(f"Schema ID: {ndarray_shim['$id']}") 48 + print(f"Version: {ndarray_shim['version']}") 49 + print() 50 + print("NDArray definition:") 51 + print(json.dumps(ndarray_shim["$defs"]["ndarray"], indent=2)) 52 + print() 53 + 54 + 55 + ## 56 + # Test Case 1: Simple 1D array 57 + 58 + print("=" * 80) 59 + print("Test Case 1: Simple 1D Array") 60 + print("=" * 80) 61 + 62 + array_1d = np.array([1, 2, 3, 4, 5], dtype=np.int32) 63 + print(f"Created array: {array_1d}") 64 + print(f"Shape: {array_1d.shape}, dtype: {array_1d.dtype}") 65 + 66 + # Serialize and encode 67 + bytes_1d = array_to_bytes(array_1d) 68 + base64_1d = base64.b64encode(bytes_1d).decode('utf-8') 69 + print(f"Serialized to {len(bytes_1d)} bytes") 70 + print(f"Base64: {len(base64_1d)} characters") 71 + 72 + # Validate against the ndarray schema definition directly 73 + ndarray_schema = { 74 + "$schema": "http://json-schema.org/draft-07/schema#", 75 + "$defs": ndarray_shim["$defs"], 76 + "$ref": "#/$defs/ndarray" 77 + } 78 + 79 + try: 80 + validate(instance=base64_1d, schema=ndarray_schema) 81 + print("✅ VALID: 1D array validates against ndarray schema") 82 + except ValidationError as e: 83 + print(f"❌ INVALID: {e.message}") 84 + 85 + # Verify roundtrip 86 + recovered_1d = bytes_to_array(base64.b64decode(base64_1d)) 87 + print(f"Recovered: {recovered_1d}") 88 + print(f"Arrays equal: {np.array_equal(array_1d, recovered_1d)}") 89 + print() 90 + 91 + 92 + ## 93 + # Test Case 2: 2D array (matrix) 94 + 95 + print("=" * 80) 96 + print("Test Case 2: 2D Array (Matrix)") 97 + print("=" * 80) 98 + 99 + array_2d = np.random.randn(3, 4).astype(np.float32) 100 + print(f"Created array shape: {array_2d.shape}, dtype: {array_2d.dtype}") 101 + print(f"Sample values:\n{array_2d}") 102 + 103 + bytes_2d = array_to_bytes(array_2d) 104 + base64_2d = base64.b64encode(bytes_2d).decode('utf-8') 105 + print(f"Serialized to {len(bytes_2d)} bytes") 106 + 107 + try: 108 + validate(instance=base64_2d, schema=ndarray_schema) 109 + print("✅ VALID: 2D array validates against ndarray schema") 110 + except ValidationError as e: 111 + print(f"❌ INVALID: {e.message}") 112 + 113 + recovered_2d = bytes_to_array(base64.b64decode(base64_2d)) 114 + print(f"Arrays equal: {np.array_equal(array_2d, recovered_2d)}") 115 + print() 116 + 117 + 118 + ## 119 + # Test Case 3: 3D array (image-like) 120 + 121 + print("=" * 80) 122 + print("Test Case 3: 3D Array (Image-like)") 123 + print("=" * 80) 124 + 125 + array_3d = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8) 126 + print(f"Created array shape: {array_3d.shape}, dtype: {array_3d.dtype}") 127 + print(f"Total elements: {array_3d.size}") 128 + 129 + bytes_3d = array_to_bytes(array_3d) 130 + base64_3d = base64.b64encode(bytes_3d).decode('utf-8') 131 + print(f"Serialized to {len(bytes_3d)} bytes ({len(bytes_3d) / 1024:.1f} KB)") 132 + print(f"Base64 string: {len(base64_3d)} characters ({len(base64_3d) / 1024:.1f} KB)") 133 + 134 + try: 135 + validate(instance=base64_3d, schema=ndarray_schema) 136 + print("✅ VALID: 3D array validates against ndarray schema") 137 + except ValidationError as e: 138 + print(f"❌ INVALID: {e.message}") 139 + 140 + recovered_3d = bytes_to_array(base64.b64decode(base64_3d)) 141 + print(f"Recovered shape: {recovered_3d.shape}, dtype: {recovered_3d.dtype}") 142 + print(f"Arrays equal: {np.array_equal(array_3d, recovered_3d)}") 143 + print() 144 + 145 + 146 + ## 147 + # Test Case 4: Different dtypes 148 + 149 + print("=" * 80) 150 + print("Test Case 4: Various Dtypes") 151 + print("=" * 80) 152 + 153 + dtypes_to_test = [ 154 + np.int8, 155 + np.int16, 156 + np.int32, 157 + np.int64, 158 + np.uint8, 159 + np.uint16, 160 + np.uint32, 161 + np.uint64, 162 + np.float16, 163 + np.float32, 164 + np.float64, 165 + np.complex64, 166 + np.complex128, 167 + ] 168 + 169 + print(f"Testing {len(dtypes_to_test)} different dtypes...") 170 + all_passed = True 171 + 172 + for dtype in dtypes_to_test: 173 + array = np.array([1, 2, 3], dtype=dtype) 174 + array_bytes = array_to_bytes(array) 175 + array_base64 = base64.b64encode(array_bytes).decode('utf-8') 176 + 177 + try: 178 + validate(instance=array_base64, schema=ndarray_schema) 179 + recovered = bytes_to_array(base64.b64decode(array_base64)) 180 + match = np.array_equal(array, recovered) 181 + status = "✅" if match else "❌" 182 + print(f" {status} {str(dtype):12s} - valid and {'matches' if match else 'MISMATCH'}") 183 + if not match: 184 + all_passed = False 185 + except ValidationError as e: 186 + print(f" ❌ {str(dtype):12s} - validation failed: {e.message}") 187 + all_passed = False 188 + 189 + if all_passed: 190 + print("✅ SUCCESS: All dtypes validated and roundtripped correctly") 191 + else: 192 + print("❌ FAILURE: Some dtypes failed") 193 + print() 194 + 195 + 196 + ## 197 + # Test Case 5: Invalid data (should fail validation) 198 + 199 + print("=" * 80) 200 + print("Test Case 5: Invalid Data (Negative Tests)") 201 + print("=" * 80) 202 + 203 + # Test invalid types 204 + invalid_cases = [ 205 + ("plain string", "not base64 encoded array data"), 206 + ("number", 12345), 207 + ("object", {"dtype": "uint8", "data": "fake"}), 208 + ("array", [1, 2, 3]), 209 + ("null", None), 210 + ] 211 + 212 + print("Testing invalid inputs (should fail validation):") 213 + for name, invalid_data in invalid_cases: 214 + try: 215 + validate(instance=invalid_data, schema=ndarray_schema) 216 + print(f" ❌ {name:15s} - SHOULD HAVE FAILED but passed") 217 + except ValidationError: 218 + print(f" ✅ {name:15s} - correctly rejected") 219 + 220 + print() 221 + 222 + 223 + ## 224 + # Test Case 6: Using the schema as a $ref in another schema (inline) 225 + 226 + print("=" * 80) 227 + print("Test Case 6: Using NDArray Shim as $ref (Inline)") 228 + print("=" * 80) 229 + 230 + # Create a schema that inlines the ndarray shim definition 231 + sample_schema = { 232 + "$schema": "http://json-schema.org/draft-07/schema#", 233 + "title": "TestSample", 234 + "type": "object", 235 + "required": ["data", "label"], 236 + "properties": { 237 + "data": { 238 + "$ref": "#/$defs/ndarray", 239 + "description": "Numpy array data", 240 + "x-atdata-dtype": "float32", 241 + "x-atdata-shape": [None, 10] 242 + }, 243 + "label": { 244 + "type": "string", 245 + "description": "Label for this sample" 246 + } 247 + }, 248 + "$defs": { 249 + "ndarray": ndarray_shim["$defs"]["ndarray"] 250 + } 251 + } 252 + 253 + print("Created schema that uses inlined ndarray shim:") 254 + print(json.dumps({ 255 + "title": sample_schema["title"], 256 + "required": sample_schema["required"], 257 + "properties": { 258 + "data": {"$ref": "#/$defs/ndarray", "x-atdata-dtype": "float32"}, 259 + "label": {"type": "string"} 260 + } 261 + }, indent=2)) 262 + print() 263 + 264 + # Create sample data 265 + test_array = np.random.randn(5, 10).astype(np.float32) 266 + test_data = { 267 + "data": base64.b64encode(array_to_bytes(test_array)).decode('utf-8'), 268 + "label": "test sample" 269 + } 270 + 271 + print(f"Created test sample with array shape {test_array.shape}") 272 + 273 + # Validate with inline $ref 274 + validator = Draft7Validator(sample_schema) 275 + 276 + try: 277 + validator.validate(test_data) 278 + print("✅ VALID: Sample with $ref to ndarray shim validates correctly") 279 + except ValidationError as e: 280 + print(f"❌ INVALID: {e.message}") 281 + 282 + print() 283 + 284 + 285 + ## 286 + # Summary 287 + 288 + print("=" * 80) 289 + print("SUMMARY") 290 + print("=" * 80) 291 + print(""" 292 + ✅ The standalone ndarray_shim.json schema works correctly: 293 + 1. Validates base64-encoded .npy bytes as strings 294 + 2. Works with all standard numpy dtypes 295 + 3. Supports arrays of any dimensionality (1D, 2D, 3D, etc.) 296 + 4. Can be used as $ref in other schemas 297 + 5. Correctly rejects invalid data 298 + 299 + ✅ The shim is a proper JSON Schema Draft 7 definition: 300 + - Uses standard type/format (string/byte) 301 + - Uses contentEncoding/contentMediaType properly 302 + - Works with standard validators (jsonschema library) 303 + - Can be stored at a canonical URI and referenced 304 + 305 + 📝 Key points: 306 + - Base64 encoding adds ~33% overhead (150KB → 200KB) 307 + - In actual atdata, WebDatasets store raw bytes (no base64) 308 + - JSON representation useful for: APIs, validation, examples 309 + - Msgpack representation used in practice: more efficient 310 + 311 + 🎯 Design validated: 312 + - Shim definition is sound and reusable 313 + - Works as both inline $def and external $ref 314 + - Compatible with JSON Schema tooling 315 + - Ready for use in ac.foundation.dataset.sampleSchema Lexicon 316 + """)
+39
.planning/examples/dataset_blob_storage.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.record", 3 + "name": "Small Sample Dataset", 4 + "schemaRef": "at://did:plc:def456/ac.foundation.dataset.sampleSchema/textsample@2.1.0", 5 + "storage": { 6 + "$type": "ac.foundation.dataset.storageBlobs", 7 + "blobs": [ 8 + { 9 + "$type": "blob", 10 + "ref": { 11 + "$link": "bafyreig4rvsqx3vfzdchq2qx7xr2nq2y4vjvd4w5pqtjwkqiw7h5e6vf7e" 12 + }, 13 + "mimeType": "application/x-tar", 14 + "size": 1234567 15 + }, 16 + { 17 + "$type": "blob", 18 + "ref": { 19 + "$link": "bafyreig5saabc3defghijklmnopqrstuvwxyz123456789abcdefghijk" 20 + }, 21 + "mimeType": "application/x-tar", 22 + "size": 2345678 23 + } 24 + ] 25 + }, 26 + "description": "Small text dataset stored directly on PDS for maximum decentralization", 27 + "tags": [ 28 + "nlp", 29 + "text", 30 + "small-dataset" 31 + ], 32 + "size": { 33 + "samples": 1000, 34 + "bytes": 3580245, 35 + "shards": 2 36 + }, 37 + "license": "CC-BY-4.0", 38 + "createdAt": "2025-01-07T10:30:00Z" 39 + }
+26
.planning/examples/dataset_external_storage.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.record", 3 + "name": "CIFAR-10 Training Set", 4 + "schemaRef": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imageclassification@1.0.0", 5 + "storage": { 6 + "$type": "ac.foundation.dataset.storageExternal", 7 + "urls": [ 8 + "s3://my-bucket/cifar10-train-{000000..000049}.tar" 9 + ] 10 + }, 11 + "description": "CIFAR-10 training images (50,000 samples) stored as WebDataset shards on S3", 12 + "metadata": "<msgpack-encoded bytes here>", 13 + "tags": [ 14 + "computer-vision", 15 + "classification", 16 + "cifar10", 17 + "training" 18 + ], 19 + "size": { 20 + "samples": 50000, 21 + "bytes": 178456789, 22 + "shards": 50 23 + }, 24 + "license": "MIT", 25 + "createdAt": "2025-01-06T12:00:00Z" 26 + }
+27
.planning/examples/lens_example.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.lens", 3 + "name": "RGB to Grayscale Conversion", 4 + "sourceSchema": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/rgbimage@1.0.0", 5 + "targetSchema": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/grayscaleimage@1.0.0", 6 + "description": "Converts RGB images to grayscale using standard luminosity formula", 7 + "getterCode": { 8 + "repository": "https://github.com/alice/vision-lenses", 9 + "commit": "a1b2c3d4e5f6789abcdef0123456789abcdef012", 10 + "path": "lenses/color.py:rgb_to_grayscale", 11 + "branch": "main" 12 + }, 13 + "putterCode": { 14 + "repository": "https://github.com/alice/vision-lenses", 15 + "commit": "a1b2c3d4e5f6789abcdef0123456789abcdef012", 16 + "path": "lenses/color.py:grayscale_to_rgb", 17 + "branch": "main" 18 + }, 19 + "language": "python", 20 + "metadata": { 21 + "author": "alice.bsky.social", 22 + "performance": "O(n) where n is number of pixels", 23 + "reversible": false, 24 + "notes": "Putter creates approximate RGB by duplicating grayscale channel" 25 + }, 26 + "createdAt": "2025-01-07T14:00:00Z" 27 + }
+53
.planning/examples/sampleSchema_example.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.sampleSchema", 3 + "name": "ImageSample", 4 + "version": "1.0.0", 5 + "schemaType": "jsonSchema", 6 + "schema": { 7 + "$type": "ac.foundation.dataset.sampleSchema#jsonSchemaFormat", 8 + "$schema": "http://json-schema.org/draft-07/schema#", 9 + "title": "ImageSample", 10 + "type": "object", 11 + "arrayFormatVersions": { 12 + "ndarrayBytes": "1.0.0" 13 + }, 14 + "required": [ 15 + "image", 16 + "label" 17 + ], 18 + "properties": { 19 + "image": { 20 + "$ref": "#/$defs/ndarray", 21 + "description": "RGB image with variable height/width", 22 + "x-atdata-dtype": "uint8", 23 + "x-atdata-shape": [null, null, 3], 24 + "x-atdata-notes": "Images must have 3 color channels (RGB)" 25 + }, 26 + "label": { 27 + "type": "string", 28 + "description": "Human-readable label for the image" 29 + }, 30 + "confidence": { 31 + "type": "number", 32 + "description": "Optional confidence score", 33 + "minimum": 0, 34 + "maximum": 1 35 + } 36 + }, 37 + "$defs": { 38 + "ndarray": { 39 + "type": "string", 40 + "format": "byte", 41 + "description": "Numpy array serialized using numpy .npy format (includes dtype and shape in binary header)", 42 + "contentEncoding": "base64", 43 + "contentMediaType": "application/octet-stream" 44 + } 45 + } 46 + }, 47 + "description": "Sample type for images with labels, commonly used for computer vision datasets", 48 + "metadata": { 49 + "license": "MIT", 50 + "tags": ["computer-vision", "image-classification"] 51 + }, 52 + "createdAt": "2025-01-06T12:00:00Z" 53 + }
+259
.planning/lexicons/README.md
··· 1 + # ATProto Lexicon Definitions for atdata 2 + 3 + This directory contains the ATProto Lexicon JSON definitions for the distributed dataset federation system. 4 + 5 + ## Lexicons 6 + 7 + ### Core Record Types 8 + 9 + 1. **[ac.foundation.dataset.sampleSchema](ac.foundation.dataset.sampleSchema.json)** 10 + - Defines PackableSample-compatible sample types using JSON Schema 11 + - Supports versioning via rkey format: `{NSID}@{semver}` 12 + - Includes NDArray shim for ML/scientific data types 13 + - Example: [sampleSchema_example.json](../examples/sampleSchema_example.json) 14 + 15 + 2. **[ac.foundation.dataset.record](ac.foundation.dataset.record.json)** 16 + - Index records for WebDataset-backed datasets 17 + - Hybrid storage support (external URLs + PDS blobs) 18 + - References sampleSchema for type information 19 + - Examples: 20 + - [External storage](../examples/dataset_external_storage.json) 21 + - [Blob storage](../examples/dataset_blob_storage.json) 22 + 23 + 3. **[ac.foundation.dataset.lens](ac.foundation.dataset.lens.json)** 24 + - Bidirectional transformations between sample types 25 + - External code references (GitHub, tangled.org) 26 + - Language metadata for multi-language support 27 + - Example: [lens_example.json](../examples/lens_example.json) 28 + 29 + ### Query APIs 30 + 31 + 4. **[ac.foundation.dataset.getLatestSchema](ac.foundation.dataset.getLatestSchema.json)** 32 + - Query to get the latest version of a schema by NSID 33 + - Returns full record + all available versions 34 + - Handles the custom rkey versioning scheme 35 + 36 + ## Key Design Decisions 37 + 38 + ### 1. Namespace 39 + 40 + All Lexicons use the `ac.foundation.dataset.*` namespace: 41 + - `ac.foundation` - Organization namespace 42 + - `dataset` - Domain (distributed datasets) 43 + - Specific record types: `sampleSchema`, `record`, `lens` 44 + 45 + ### 2. Schema Versioning (rkey Convention) 46 + 47 + **Custom rkey format**: `{NSID}@{semver}` 48 + 49 + **Example**: `com.example.myschema@1.2.3` 50 + 51 + - `{NSID}`: Permanent identifier for the schema type (e.g., `com.example.myschema`) 52 + - `{semver}`: Semantic version (e.g., `1.2.3`) 53 + 54 + **Benefits**: 55 + - Immutable version records 56 + - Easy to list all versions of a schema 57 + - Natural query pattern via `getLatestSchema` 58 + - Clear semantic versioning enforcement 59 + 60 + **Implementation**: The sampleSchema Lexicon uses `"key": "any"` to support this custom format. 61 + 62 + ### 3. JSON Schema with NDArray Shim 63 + 64 + **Decision**: Use standard JSON Schema for type definitions with a custom NDArray shim. 65 + 66 + **NDArray Shim Structure**: 67 + ```json 68 + { 69 + "$defs": { 70 + "ndarray": { 71 + "type": "object", 72 + "required": ["dtype", "shape", "data"], 73 + "properties": { 74 + "dtype": { 75 + "type": "string", 76 + "description": "Numpy dtype string (e.g., 'float32', 'uint8')" 77 + }, 78 + "shape": { 79 + "type": "array", 80 + "items": {"type": "integer"}, 81 + "description": "Array shape" 82 + }, 83 + "data": { 84 + "type": "string", 85 + "format": "byte", 86 + "description": "Array data as base64-encoded bytes" 87 + } 88 + } 89 + } 90 + } 91 + } 92 + ``` 93 + 94 + **Usage in schemas**: 95 + ```json 96 + { 97 + "properties": { 98 + "image": { 99 + "$ref": "#/$defs/ndarray", 100 + "dtype": "uint8", 101 + "shape": [null, null, 3] 102 + } 103 + } 104 + } 105 + ``` 106 + 107 + **Benefits**: 108 + - Leverages JSON Schema ecosystem (validators, tooling) 109 + - Custom NDArray handling for ML/scientific data 110 + - Extensible via `schemaType` field (future: Protobuf, etc.) 111 + 112 + ### 4. Hybrid Storage 113 + 114 + **Open union** for storage location: 115 + - `storageExternal`: External URLs (S3, HTTP, IPFS, etc.) 116 + - `storageBlobs`: ATProto PDS blobs 117 + 118 + **Benefits**: 119 + - Flexibility: Use external storage for large datasets 120 + - Decentralization: Use blobs for small datasets or self-hosting 121 + - AppView can proxy both types uniformly 122 + 123 + ### 5. External Code References 124 + 125 + **Lenses use code references** instead of inline code for security: 126 + - Repository URL (GitHub, tangled.org) 127 + - Commit hash (immutability) 128 + - Function path (e.g., `lenses/vision.py:rgb_to_grayscale`) 129 + 130 + **Benefits**: 131 + - Secure: No arbitrary code execution 132 + - Verifiable: Commit hash ensures immutability 133 + - Auditable: Users can review code before use 134 + 135 + ## Example Workflows 136 + 137 + ### Publishing a Schema 138 + 139 + ```python 140 + from atdata.atproto import SchemaPublisher 141 + 142 + @atdata.packable 143 + class ImageSample: 144 + image: NDArray # uint8, [H, W, 3] 145 + label: str 146 + 147 + publisher = SchemaPublisher(client) 148 + schema_uri = publisher.publish_schema( 149 + ImageSample, 150 + name="ImageSample", 151 + version="1.0.0", 152 + description="RGB image with label" 153 + ) 154 + # Result: at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imagesample@1.0.0 155 + ``` 156 + 157 + ### Publishing a Dataset 158 + 159 + ```python 160 + from atdata.atproto import DatasetPublisher 161 + 162 + dataset = atdata.Dataset[ImageSample]( 163 + url="s3://my-bucket/dataset-{000000..000009}.tar" 164 + ) 165 + 166 + publisher = DatasetPublisher(client) 167 + dataset_uri = publisher.publish_dataset( 168 + dataset, 169 + name="My Image Dataset", 170 + schema_uri=schema_uri, 171 + tags=["computer-vision", "training"] 172 + ) 173 + ``` 174 + 175 + ### Discovering Datasets 176 + 177 + ```python 178 + from atdata.atproto import DatasetLoader 179 + 180 + loader = DatasetLoader(client) 181 + 182 + # Search by tags 183 + datasets = loader.search_datasets(tags=["computer-vision"]) 184 + 185 + # Load dataset 186 + dataset = loader.load_dataset(datasets[0]['uri']) 187 + ``` 188 + 189 + ## Migration & Versioning 190 + 191 + ### Publishing a New Schema Version 192 + 193 + ```python 194 + # Publish v2.0.0 with migration lens 195 + schema_uri_v2 = publisher.publish_schema( 196 + ImageSampleV2, 197 + name="ImageSample", 198 + version="2.0.0", 199 + previous_version=schema_uri_v1, 200 + migration_lens=migration_lens_uri 201 + ) 202 + ``` 203 + 204 + ### Getting Latest Schema 205 + 206 + ```python 207 + from atdata.atproto import query_latest_schema 208 + 209 + latest = query_latest_schema( 210 + client, 211 + schema_id="imagesample" # Just the NSID part 212 + ) 213 + # Returns: { 214 + # "uri": "at://.../imagesample@2.0.0", 215 + # "version": "2.0.0", 216 + # "record": {...}, 217 + # "allVersions": [...] 218 + # } 219 + ``` 220 + 221 + ## Validation 222 + 223 + See [06_lexicon_validation.md](../decisions/06_lexicon_validation.md) for validation process. 224 + 225 + ### Quick Validation 226 + 227 + ```bash 228 + # Validate Lexicon JSON (requires ATProto tooling) 229 + atproto-lexicon validate ac.foundation.dataset.sampleSchema.json 230 + 231 + # Validate example records 232 + python scripts/validate_examples.py 233 + ``` 234 + 235 + ## Future Extensions 236 + 237 + ### Potential Additional Lexicons 238 + 239 + - `ac.foundation.dataset.collection` - Group multiple datasets 240 + - `ac.foundation.dataset.benchmark` - Evaluation results on datasets 241 + - `ac.foundation.dataset.attestation` - Formal correctness proofs for Lenses 242 + - `ac.foundation.dataset.verification` - Trusted DID attestations 243 + 244 + ### Schema Type Extensions 245 + 246 + Current: `"schemaType": "jsonschema"` 247 + 248 + Future possibilities: 249 + - `"schemaType": "protobuf"` - Protocol Buffers definitions 250 + - `"schemaType": "avro"` - Apache Avro schemas 251 + - Custom domain-specific schema languages 252 + 253 + ## References 254 + 255 + - Planning documents: `../*.md` 256 + - Design decisions: `../decisions/*.md` 257 + - Architectural assessment: `../decisions/assessment.md` 258 + - ATProto Lexicon spec: https://atproto.com/specs/lexicon 259 + - ATProto NSID spec: https://atproto.com/specs/nsid
+178
.planning/lexicons/README_ARRAY_FORMATS.md
··· 1 + # Array Format Registry 2 + 3 + This document explains the token-based registry pattern for atdata array serialization formats. 4 + 5 + ## Overview 6 + 7 + Array formats define how numpy NDArray fields are serialized in atdata sample types. The system provides: 8 + 9 + 1. **Token-based registry**: `ac.foundation.dataset.arrayFormat` Lexicon 10 + 2. **Version tracking**: Each schema declares which format versions it uses 11 + 3. **Canonical shim schemas**: Foundation.ac maintains standard JSON Schema shims at predictable URLs 12 + 13 + ## Pattern 14 + 15 + ### arrayFormat Lexicon Structure 16 + 17 + ```json 18 + { 19 + "lexicon": 1, 20 + "id": "ac.foundation.dataset.arrayFormat", 21 + "defs": { 22 + "main": { 23 + "type": "string", 24 + "knownValues": ["ndarrayBytes"], 25 + "maxLength": 50 26 + }, 27 + "ndarrayBytes": { 28 + "type": "token", 29 + "description": "Numpy .npy binary format..." 30 + } 31 + } 32 + } 33 + ``` 34 + 35 + ### Usage in sampleSchema 36 + 37 + Schema records declare format versions in `arrayFormatVersions` field: 38 + 39 + ```json 40 + { 41 + "$type": "ac.foundation.dataset.sampleSchema", 42 + "schemaType": "jsonSchema", 43 + "schema": { 44 + "$type": "ac.foundation.dataset.sampleSchema#jsonSchemaFormat", 45 + "arrayFormatVersions": { 46 + "ndarrayBytes": "1.0.0" 47 + }, 48 + "properties": { 49 + "image": { 50 + "$ref": "#/$defs/ndarray", 51 + "x-atdata-dtype": "uint8" 52 + } 53 + }, 54 + "$defs": { 55 + "ndarray": { 56 + "type": "string", 57 + "format": "byte", 58 + ... 59 + } 60 + } 61 + } 62 + } 63 + ``` 64 + 65 + ## Canonical Shim Schema URLs 66 + 67 + Foundation.ac maintains JSON Schema shims at canonical URLs: 68 + 69 + ``` 70 + https://foundation.ac/schemas/atdata-{format}-bytes/{version}/ 71 + ``` 72 + 73 + Examples: 74 + - `https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0/` 75 + - `https://foundation.ac/schemas/atdata-arrow-bytes/1.0.0/` (future) 76 + 77 + These shim schemas define the JSON Schema representation (base64-encoded bytes) for each format. 78 + 79 + ## Default Behavior 80 + 81 + If `arrayFormatVersions` is omitted, the system defaults to: 82 + 83 + ```json 84 + { 85 + "ndarrayBytes": "1.0.0" 86 + } 87 + ``` 88 + 89 + This ensures backward compatibility and simplifies common cases. 90 + 91 + ## Current Array Formats 92 + 93 + | Token Def | knownValue | Current Version | Description | 94 + |-----------|------------|-----------------|-------------| 95 + | `#ndarrayBytes` | `"ndarrayBytes"` | `1.0.0` | Numpy .npy binary format with dtype/shape header | 96 + 97 + ## Adding New Array Formats 98 + 99 + To add support for a new array format (e.g., Apache Arrow): 100 + 101 + ### 1. Add token def to arrayFormat Lexicon 102 + 103 + Edit `ac.foundation.dataset.arrayFormat.json`: 104 + 105 + ```json 106 + { 107 + "defs": { 108 + "main": { 109 + "knownValues": ["ndarrayBytes", "arrowBytes"] 110 + }, 111 + "arrowBytes": { 112 + "type": "token", 113 + "description": "Apache Arrow IPC format for array serialization..." 114 + } 115 + } 116 + } 117 + ``` 118 + 119 + ### 2. Publish shim schema at canonical URL 120 + 121 + Create and publish JSON Schema shim at: 122 + ``` 123 + https://foundation.ac/schemas/atdata-arrow-bytes/1.0.0/ 124 + ``` 125 + 126 + ### 3. Use in sample schemas 127 + 128 + Declare format version in schema records: 129 + 130 + ```json 131 + { 132 + "arrayFormatVersions": { 133 + "arrowBytes": "1.0.0" 134 + } 135 + } 136 + ``` 137 + 138 + ## Version Evolution 139 + 140 + ### Minor/Patch Updates 141 + 142 + For backward-compatible changes: 143 + - Publish new version at new URL (e.g., `1.1.0`) 144 + - Update `arrayFormatVersions` in schema records 145 + - Old versions remain accessible 146 + 147 + ### Major Updates 148 + 149 + For breaking changes: 150 + - Consider new format name (e.g., `ndarrayBytes2`) 151 + - Or use major version in URL structure 152 + - Schemas can migrate via Lens transformations 153 + 154 + ## Design Rationale 155 + 156 + This pattern provides: 157 + 158 + 1. **Centralized Discovery**: Query `ac.foundation.dataset.arrayFormat` to see all supported formats 159 + 2. **Explicit Versioning**: Each schema declares exactly which format versions it uses 160 + 3. **Canonical References**: Predictable URLs for shim schemas maintained by foundation.ac 161 + 4. **Extensibility**: New formats added via tokens without breaking existing schemas 162 + 5. **Flexibility**: Schemas can use multiple formats simultaneously (if needed) 163 + 164 + ## Relationship to Codegen 165 + 166 + When atdata codegen processes a sampleSchema: 167 + 168 + 1. Reads `arrayFormatVersions` to know which formats are used 169 + 2. Fetches canonical shim schemas from foundation.ac URLs 170 + 3. Generates Python dataclasses with proper NDArray type hints 171 + 4. Implements serialization using appropriate format (currently `.npy` via `_helpers.py`) 172 + 173 + ## References 174 + 175 + - [ac.foundation.dataset.arrayFormat Lexicon](./ac.foundation.dataset.arrayFormat.json) 176 + - [ac.foundation.dataset.sampleSchema Lexicon](./ac.foundation.dataset.sampleSchema.json) 177 + - [NDArray Shim Specification](../.planning/ndarray_shim_spec.md) 178 + - [ATProto Lexicon Token Type](https://atproto.com/guides/lexicon)
+150
.planning/lexicons/README_SCHEMA_TYPES.md
··· 1 + # Schema Type Registry 2 + 3 + This document explains the token-based registry pattern for atdata schema types. 4 + 5 + ## Pattern 6 + 7 + Schema types in atdata are managed through the `ac.foundation.dataset.schemaType` Lexicon: 8 + 9 + 1. **Single Lexicon file**: `ac.foundation.dataset.schemaType.json` 10 + 2. **Main def**: String type with `knownValues` listing supported schema types 11 + 3. **Token defs**: Each schema type has a corresponding token def (e.g., `#jsonSchema`) 12 + 4. **Reference in sampleSchema**: The `schemaType` field refs to `ac.foundation.dataset.schemaType` 13 + 14 + ## Structure 15 + 16 + ```json 17 + { 18 + "lexicon": 1, 19 + "id": "ac.foundation.dataset.schemaType", 20 + "defs": { 21 + "main": { 22 + "type": "string", 23 + "knownValues": ["jsonSchema"], 24 + "maxLength": 50 25 + }, 26 + "jsonSchema": { 27 + "type": "token", 28 + "description": "JSON Schema Draft 7 format..." 29 + } 30 + } 31 + } 32 + ``` 33 + 34 + ## Usage in sampleSchema 35 + 36 + The `schemaType` field references the schemaType Lexicon: 37 + 38 + ```json 39 + { 40 + "$type": "ac.foundation.dataset.sampleSchema", 41 + "name": "ImageSample", 42 + "version": "1.0.0", 43 + "schemaType": "jsonSchema", 44 + "schema": { 45 + "$type": "ac.foundation.dataset.sampleSchema#jsonSchemaFormat", 46 + ... 47 + } 48 + } 49 + ``` 50 + 51 + In the Lexicon definition: 52 + 53 + ```json 54 + { 55 + "schemaType": { 56 + "type": "ref", 57 + "ref": "ac.foundation.dataset.schemaType" 58 + } 59 + } 60 + ``` 61 + 62 + ## Adding New Schema Types 63 + 64 + To add support for a new schema format (e.g., Avro, Protobuf): 65 + 66 + ### 1. Add token def to schemaType Lexicon 67 + 68 + Edit `ac.foundation.dataset.schemaType.json`: 69 + 70 + ```json 71 + { 72 + "defs": { 73 + "main": { 74 + "type": "string", 75 + "knownValues": ["jsonSchema", "avro"], 76 + "maxLength": 50 77 + }, 78 + "avro": { 79 + "type": "token", 80 + "description": "Apache Avro schema format..." 81 + } 82 + } 83 + } 84 + ``` 85 + 86 + ### 2. Add format def to sampleSchema Lexicon 87 + 88 + Edit `ac.foundation.dataset.sampleSchema.json`: 89 + 90 + ```json 91 + { 92 + "defs": { 93 + "avroFormat": { 94 + "type": "object", 95 + "description": "Apache Avro schema format...", 96 + "required": ["$type", "type"], 97 + "properties": { 98 + "$type": { 99 + "type": "string", 100 + "const": "ac.foundation.dataset.sampleSchema#avroFormat" 101 + }, 102 + "type": { 103 + "type": "string" 104 + }, 105 + "fields": { 106 + "type": "array" 107 + } 108 + } 109 + } 110 + } 111 + } 112 + ``` 113 + 114 + ### 3. Update schema union refs 115 + 116 + In sampleSchema main record: 117 + 118 + ```json 119 + { 120 + "schema": { 121 + "type": "union", 122 + "refs": [ 123 + "ac.foundation.dataset.sampleSchema#jsonSchemaFormat", 124 + "ac.foundation.dataset.sampleSchema#avroFormat" 125 + ], 126 + "closed": false 127 + } 128 + } 129 + ``` 130 + 131 + ## Current Schema Types 132 + 133 + | Token Def | knownValue | Format Def | Description | 134 + |-----------|------------|------------|-------------| 135 + | `#jsonSchema` | `"jsonSchema"` | `#jsonSchemaFormat` | JSON Schema Draft 7 | 136 + 137 + ## Design Rationale 138 + 139 + This pattern provides: 140 + 141 + 1. **Centralized Registry**: Single Lexicon (`schemaType`) lists all supported types 142 + 2. **Type Safety**: Token defs provide canonical documentation for each schema type 143 + 3. **Extensibility**: New types added to `knownValues` + token defs without breaking changes 144 + 4. **Validation**: Refs ensure schemaType values are validated against known types 145 + 5. **Discoverability**: Query `ac.foundation.dataset.schemaType` to see all supported types 146 + 147 + ## References 148 + 149 + - [ATProto Lexicon Token Type](https://atproto.com/guides/lexicon) 150 + - [ATProto Lexicon Spec](.reference/atproto_lexicon_spec.md)
+16
.planning/lexicons/ac.foundation.dataset.arrayFormat.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.arrayFormat", 4 + "defs": { 5 + "main": { 6 + "type": "string", 7 + "description": "Array serialization format identifier for NDArray fields in sample schemas. Known values correspond to token definitions in this Lexicon. Each format has versioned specifications maintained by foundation.ac at canonical URLs.", 8 + "knownValues": ["ndarrayBytes"], 9 + "maxLength": 50 10 + }, 11 + "ndarrayBytes": { 12 + "type": "token", 13 + "description": "Numpy .npy binary format for NDArray serialization. Stores arrays with dtype and shape in binary header. Versions maintained at https://foundation.ac/schemas/atdata-ndarray-bytes/{version}/" 14 + } 15 + } 16 + }
+78
.planning/lexicons/ac.foundation.dataset.getLatestSchema.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.getLatestSchema", 4 + "defs": { 5 + "main": { 6 + "type": "query", 7 + "description": "Get the latest version of a sample schema by its permanent NSID identifier", 8 + "parameters": { 9 + "type": "params", 10 + "required": [ 11 + "schemaId" 12 + ], 13 + "properties": { 14 + "schemaId": { 15 + "type": "string", 16 + "description": "The permanent NSID identifier for the schema (the {NSID} part of the rkey {NSID}@{semver})", 17 + "maxLength": 500 18 + } 19 + } 20 + }, 21 + "output": { 22 + "encoding": "application/json", 23 + "schema": { 24 + "type": "object", 25 + "required": [ 26 + "uri", 27 + "version", 28 + "record" 29 + ], 30 + "properties": { 31 + "uri": { 32 + "type": "string", 33 + "description": "AT-URI of the latest schema version", 34 + "maxLength": 500 35 + }, 36 + "version": { 37 + "type": "string", 38 + "description": "Semantic version of the latest schema", 39 + "maxLength": 20 40 + }, 41 + "record": { 42 + "type": "ref", 43 + "ref": "ac.foundation.dataset.sampleSchema", 44 + "description": "The full schema record" 45 + }, 46 + "allVersions": { 47 + "type": "array", 48 + "description": "All available versions (optional, sorted by semver descending)", 49 + "items": { 50 + "type": "object", 51 + "required": [ 52 + "uri", 53 + "version" 54 + ], 55 + "properties": { 56 + "uri": { 57 + "type": "string", 58 + "maxLength": 500 59 + }, 60 + "version": { 61 + "type": "string", 62 + "maxLength": 20 63 + } 64 + } 65 + } 66 + } 67 + } 68 + } 69 + }, 70 + "errors": [ 71 + { 72 + "name": "SchemaNotFound", 73 + "description": "No schema found with the given NSID" 74 + } 75 + ] 76 + } 77 + } 78 + }
+99
.planning/lexicons/ac.foundation.dataset.lens.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.lens", 4 + "defs": { 5 + "main": { 6 + "type": "record", 7 + "description": "Bidirectional transformation (Lens) between two sample types, with code stored in external repositories", 8 + "key": "tid", 9 + "record": { 10 + "type": "object", 11 + "required": [ 12 + "name", 13 + "sourceSchema", 14 + "targetSchema", 15 + "getterCode", 16 + "putterCode", 17 + "createdAt" 18 + ], 19 + "properties": { 20 + "name": { 21 + "type": "string", 22 + "description": "Human-readable lens name", 23 + "maxLength": 100 24 + }, 25 + "sourceSchema": { 26 + "type": "string", 27 + "description": "AT-URI reference to source sampleSchema", 28 + "maxLength": 500 29 + }, 30 + "targetSchema": { 31 + "type": "string", 32 + "description": "AT-URI reference to target sampleSchema", 33 + "maxLength": 500 34 + }, 35 + "description": { 36 + "type": "string", 37 + "description": "What this transformation does", 38 + "maxLength": 1000 39 + }, 40 + "getterCode": { 41 + "type": "ref", 42 + "ref": "#codeReference", 43 + "description": "Code reference for getter function (Source -> Target)" 44 + }, 45 + "putterCode": { 46 + "type": "ref", 47 + "ref": "#codeReference", 48 + "description": "Code reference for putter function (Target, Source -> Source)" 49 + }, 50 + "language": { 51 + "type": "string", 52 + "description": "Programming language of the lens implementation (e.g., 'python', 'typescript')", 53 + "maxLength": 50 54 + }, 55 + "metadata": { 56 + "type": "object", 57 + "description": "Arbitrary metadata (author, performance notes, etc.)" 58 + }, 59 + "createdAt": { 60 + "type": "string", 61 + "format": "datetime", 62 + "description": "Timestamp when this lens was created" 63 + } 64 + } 65 + } 66 + }, 67 + "codeReference": { 68 + "type": "object", 69 + "description": "Reference to code in an external repository (GitHub, tangled.org, etc.)", 70 + "required": [ 71 + "repository", 72 + "commit", 73 + "path" 74 + ], 75 + "properties": { 76 + "repository": { 77 + "type": "string", 78 + "description": "Repository URL (e.g., 'https://github.com/user/repo' or 'at://did/tangled.repo/...')", 79 + "maxLength": 500 80 + }, 81 + "commit": { 82 + "type": "string", 83 + "description": "Git commit hash (ensures immutability)", 84 + "maxLength": 40 85 + }, 86 + "path": { 87 + "type": "string", 88 + "description": "Path to function within repository (e.g., 'lenses/vision.py:rgb_to_grayscale')", 89 + "maxLength": 500 90 + }, 91 + "branch": { 92 + "type": "string", 93 + "description": "Optional branch name (for reference, commit hash is authoritative)", 94 + "maxLength": 100 95 + } 96 + } 97 + } 98 + } 99 + }
+96
.planning/lexicons/ac.foundation.dataset.record.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.record", 4 + "defs": { 5 + "main": { 6 + "type": "record", 7 + "description": "Index record for a WebDataset-backed dataset with references to storage location and sample schema", 8 + "key": "tid", 9 + "record": { 10 + "type": "object", 11 + "required": [ 12 + "name", 13 + "schemaRef", 14 + "storage", 15 + "createdAt" 16 + ], 17 + "properties": { 18 + "name": { 19 + "type": "string", 20 + "description": "Human-readable dataset name", 21 + "maxLength": 200 22 + }, 23 + "schemaRef": { 24 + "type": "string", 25 + "format": "at-uri", 26 + "description": "AT-URI reference to the sampleSchema record for this dataset's samples", 27 + "maxLength": 500 28 + }, 29 + "storage": { 30 + "type": "union", 31 + "description": "Storage location for dataset files (WebDataset tar archives)", 32 + "refs": [ 33 + "ac.foundation.dataset.storageExternal", 34 + "ac.foundation.dataset.storageBlobs" 35 + ] 36 + }, 37 + "description": { 38 + "type": "string", 39 + "description": "Human-readable description of the dataset", 40 + "maxLength": 5000 41 + }, 42 + "metadata": { 43 + "type": "bytes", 44 + "description": "Msgpack-encoded metadata dict for arbitrary extended key-value pairs. Use this for additional metadata beyond the core top-level fields (license, tags, size). Top-level fields are preferred for discoverable/searchable metadata.", 45 + "maxLength": 100000 46 + }, 47 + "tags": { 48 + "type": "array", 49 + "description": "Searchable tags for dataset discovery. Aligns with Schema.org keywords property.", 50 + "items": { 51 + "type": "string", 52 + "maxLength": 150 53 + }, 54 + "maxLength": 30 55 + }, 56 + "size": { 57 + "type": "ref", 58 + "ref": "#datasetSize", 59 + "description": "Dataset size information (optional)" 60 + }, 61 + "license": { 62 + "type": "string", 63 + "description": "License identifier or URL. SPDX identifiers recommended (e.g., MIT, Apache-2.0, CC-BY-4.0) or full SPDX URLs (e.g., http://spdx.org/licenses/MIT). Aligns with Schema.org license property.", 64 + "maxLength": 200 65 + }, 66 + "createdAt": { 67 + "type": "string", 68 + "format": "datetime", 69 + "description": "Timestamp when this dataset record was created" 70 + } 71 + } 72 + } 73 + }, 74 + "datasetSize": { 75 + "type": "object", 76 + "description": "Information about dataset size", 77 + "properties": { 78 + "samples": { 79 + "type": "integer", 80 + "description": "Total number of samples in the dataset", 81 + "minimum": 0 82 + }, 83 + "bytes": { 84 + "type": "integer", 85 + "description": "Total size in bytes", 86 + "minimum": 0 87 + }, 88 + "shards": { 89 + "type": "integer", 90 + "description": "Number of WebDataset shards", 91 + "minimum": 1 92 + } 93 + } 94 + } 95 + } 96 + }
+107
.planning/lexicons/ac.foundation.dataset.sampleSchema.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.sampleSchema", 4 + "defs": { 5 + "main": { 6 + "type": "record", 7 + "description": "Definition of a PackableSample-compatible sample type. Supports versioning via rkey format: {NSID}@{semver}. Schema format is extensible via union type.", 8 + "key": "any", 9 + "record": { 10 + "type": "object", 11 + "required": [ 12 + "name", 13 + "version", 14 + "schemaType", 15 + "schema", 16 + "createdAt" 17 + ], 18 + "properties": { 19 + "name": { 20 + "type": "string", 21 + "description": "Human-readable display name for this sample type. Used for documentation and UI. The NSID in the record URI provides unique identification; name collisions across NSIDs are acceptable.", 22 + "maxLength": 100 23 + }, 24 + "version": { 25 + "type": "string", 26 + "description": "Semantic version (e.g., '1.0.0')", 27 + "pattern": "^(0|[1-9]\\d*)\\.(0|[1-9]\\d*)\\.(0|[1-9]\\d*)(?:-((?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+([0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$", 28 + "maxLength": 100 29 + }, 30 + "schemaType": { 31 + "type": "ref", 32 + "ref": "ac.foundation.dataset.schemaType", 33 + "description": "Type of schema definition. This field indicates which union member is present in the schema field." 34 + }, 35 + "schema": { 36 + "type": "union", 37 + "refs": ["ac.foundation.dataset.sampleSchema#jsonSchemaFormat"], 38 + "closed": false, 39 + "description": "Schema definition for this sample type. Currently supports JSON Schema Draft 7. Union allows for future schema formats (Avro, Protobuf, etc.) without breaking changes." 40 + }, 41 + "description": { 42 + "type": "string", 43 + "description": "Human-readable description of what this sample type represents", 44 + "maxLength": 5000 45 + }, 46 + "metadata": { 47 + "type": "object", 48 + "description": "Optional metadata about this schema. Common fields include license and tags, but any additional fields are permitted.", 49 + "maxProperties": 50, 50 + "properties": { 51 + "license": { 52 + "type": "string", 53 + "description": "License identifier or URL. SPDX identifiers recommended (e.g., MIT, Apache-2.0, CC-BY-4.0) or full SPDX URLs (e.g., http://spdx.org/licenses/MIT). Aligns with Schema.org license property.", 54 + "maxLength": 200 55 + }, 56 + "tags": { 57 + "type": "array", 58 + "description": "Categorization keywords for discovery. Aligns with Schema.org keywords property.", 59 + "items": { 60 + "type": "string", 61 + "maxLength": 150 62 + }, 63 + "maxLength": 30 64 + } 65 + } 66 + }, 67 + "createdAt": { 68 + "type": "string", 69 + "format": "datetime", 70 + "description": "Timestamp when this schema version was created. Immutable once set (ATProto records are permanent)." 71 + } 72 + } 73 + } 74 + }, 75 + "jsonSchemaFormat": { 76 + "type": "object", 77 + "description": "JSON Schema Draft 7 format for sample type definitions. Used with NDArray shim for array types.", 78 + "required": ["$type", "$schema", "type", "properties"], 79 + "properties": { 80 + "$type": { 81 + "type": "string", 82 + "const": "ac.foundation.dataset.sampleSchema#jsonSchemaFormat" 83 + }, 84 + "$schema": { 85 + "type": "string", 86 + "const": "http://json-schema.org/draft-07/schema#", 87 + "description": "JSON Schema version identifier" 88 + }, 89 + "type": { 90 + "type": "string", 91 + "const": "object", 92 + "description": "Sample types must be objects" 93 + }, 94 + "properties": { 95 + "type": "object", 96 + "description": "Field definitions for the sample type", 97 + "minProperties": 1 98 + }, 99 + "arrayFormatVersions": { 100 + "type": "object", 101 + "description": "Mapping from array format identifiers to semantic versions. Keys are ac.foundation.dataset.arrayFormat values (e.g., 'ndarrayBytes'), values are semver strings (e.g., '1.0.0'). Foundation.ac maintains canonical shim schemas at https://foundation.ac/schemas/atdata-{format}-bytes/{version}/.", 102 + "maxProperties": 10 103 + } 104 + } 105 + } 106 + } 107 + }
+16
.planning/lexicons/ac.foundation.dataset.schemaType.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.schemaType", 4 + "defs": { 5 + "main": { 6 + "type": "string", 7 + "description": "Schema type identifier for atdata sample definitions. Known values correspond to token definitions in this Lexicon. New schema types can be added as tokens without breaking changes.", 8 + "knownValues": ["jsonSchema"], 9 + "maxLength": 50 10 + }, 11 + "jsonSchema": { 12 + "type": "token", 13 + "description": "JSON Schema Draft 7 format for sample type definitions. When schemaType is 'jsonSchema', the schema field must contain an object conforming to ac.foundation.dataset.sampleSchema#jsonSchemaFormat." 14 + } 15 + } 16 + }
+24
.planning/lexicons/ac.foundation.dataset.storageBlobs.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.storageBlobs", 4 + "defs": { 5 + "main": { 6 + "type": "object", 7 + "description": "Storage via ATProto PDS blobs for WebDataset tar archives. Each blob contains one or more tar files. Used in ac.foundation.dataset.record storage union for maximum decentralization.", 8 + "required": [ 9 + "blobs" 10 + ], 11 + "properties": { 12 + "blobs": { 13 + "type": "array", 14 + "description": "Array of blob references for WebDataset tar files", 15 + "items": { 16 + "type": "blob", 17 + "description": "Blob reference to a WebDataset tar archive" 18 + }, 19 + "minLength": 1 20 + } 21 + } 22 + } 23 + } 24 + }
+25
.planning/lexicons/ac.foundation.dataset.storageExternal.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.storageExternal", 4 + "defs": { 5 + "main": { 6 + "type": "object", 7 + "description": "External storage via URLs (S3, HTTP, IPFS, etc.) for WebDataset tar archives. URLs support brace notation for sharding (e.g., 'data-{000000..000099}.tar'). Used in ac.foundation.dataset.record storage union.", 8 + "required": [ 9 + "urls" 10 + ], 11 + "properties": { 12 + "urls": { 13 + "type": "array", 14 + "description": "WebDataset URLs with optional brace notation for sharded tar files", 15 + "items": { 16 + "type": "string", 17 + "format": "uri", 18 + "maxLength": 1000 19 + }, 20 + "minLength": 1 21 + } 22 + } 23 + } 24 + } 25 + }
+16
.planning/lexicons/ndarray_shim.json
··· 1 + { 2 + "$schema": "http://json-schema.org/draft-07/schema#", 3 + "$id": "https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0", 4 + "title": "ATDataNDArrayBytes", 5 + "description": "Standard definition for numpy NDArray types in JSON Schema, compatible with atdata WebDataset serialization. This type's contents are interpreted as containing the raw bytes data for a serialized numpy NDArray, and serve as a marker for atdata-based code generation to use standard numpy types, rather than generated dataclasses.", 6 + "version": "1.0.0", 7 + "$defs": { 8 + "ndarray": { 9 + "type": "string", 10 + "format": "byte", 11 + "description": "Numpy array serialized using numpy `.npy` format via `np.save` (includes dtype and shape in binary header). When represented in JSON, this is a base64-encoded string. In msgpack, this is raw bytes.", 12 + "contentEncoding": "base64", 13 + "contentMediaType": "application/octet-stream" 14 + } 15 + } 16 + }
+386
.planning/ndarray_shim_spec.md
··· 1 + # NDArray JSON Schema Shim Specification 2 + 3 + **Issue**: #52 4 + **Version**: 1.0 5 + **Status**: Draft 6 + 7 + ## Problem Statement 8 + 9 + We need a standard way to represent numpy NDArray types in JSON Schema that: 10 + 1. Works with existing atdata msgpack serialization (numpy `.npy` format) 11 + 2. Can be validated (where practical) 12 + 3. Can be used for code generation 13 + 4. Is compatible with JSON Schema tooling 14 + 5. Preserves dtype and shape information 15 + 16 + ## Current Serialization Format 17 + 18 + atdata uses `_helpers.array_to_bytes()` which serializes arrays using numpy's native `.npy` format: 19 + 20 + ```python 21 + def array_to_bytes(x: np.ndarray) -> bytes: 22 + np_bytes = BytesIO() 23 + np.save(np_bytes, x, allow_pickle=True) 24 + return np_bytes.getvalue() 25 + ``` 26 + 27 + **Result**: A bytes object containing: 28 + - Magic bytes (`\x93NUMPY`) 29 + - Version info 30 + - Header with dtype and shape 31 + - Array data 32 + 33 + **Key insight**: The .npy format is self-describing - dtype and shape are already in the bytes! 34 + 35 + ## Design Approach 36 + 37 + ### Option 1: Pure Metadata (REJECTED) 38 + 39 + Describe the semantic array only: 40 + ```json 41 + { 42 + "type": "object", 43 + "x-atdata-ndarray": true, 44 + "x-dtype": "uint8", 45 + "x-shape": [null, null, 3] 46 + } 47 + ``` 48 + 49 + **Problem**: Doesn't match actual msgpack structure (which stores bytes, not objects) 50 + 51 + ### Option 2: Bytes with Extension Properties (REJECTED) 52 + 53 + Describe the bytes with metadata: 54 + ```json 55 + { 56 + "type": "string", 57 + "format": "byte", 58 + "x-dtype": "uint8", 59 + "x-shape": [null, null, 3] 60 + } 61 + ``` 62 + 63 + **Problem**: 64 + - Non-standard use of extension properties 65 + - JSON Schema doesn't know how to validate these 66 + - Codegen tools won't understand x- properties 67 + 68 + ### Option 3: Reusable Schema Definition (RECOMMENDED) 69 + 70 + Create a standard NDArray schema definition that can be $ref'd, with controlled vocabulary for metadata. 71 + 72 + ## Recommended Specification 73 + 74 + ### Base NDArray Schema Definition 75 + 76 + This should be included in every JSON Schema that uses NDArray types: 77 + 78 + ```json 79 + { 80 + "$schema": "http://json-schema.org/draft-07/schema#", 81 + "$defs": { 82 + "ndarray": { 83 + "type": "string", 84 + "format": "byte", 85 + "description": "Numpy array serialized using numpy .npy format (includes dtype and shape in binary header)", 86 + "contentEncoding": "base64", 87 + "contentMediaType": "application/octet-stream" 88 + } 89 + } 90 + } 91 + ``` 92 + 93 + ### Using NDArray in Properties 94 + 95 + Properties that are NDArray types reference the base definition and add metadata as **sibling properties**: 96 + 97 + ```json 98 + { 99 + "properties": { 100 + "image": { 101 + "$ref": "#/$defs/ndarray", 102 + "description": "RGB image with variable height/width", 103 + "x-atdata-dtype": "uint8", 104 + "x-atdata-shape": [null, null, 3] 105 + } 106 + } 107 + } 108 + ``` 109 + 110 + ### Metadata Convention 111 + 112 + **Extension properties** (prefixed with `x-atdata-`): 113 + - `x-atdata-dtype`: Numpy dtype string (e.g., "uint8", "float32", "int64") 114 + - `x-atdata-shape`: Array of integers and null (null = dynamic dimension) 115 + - `x-atdata-notes`: Optional human-readable notes about the array 116 + 117 + **Standard JSON Schema properties** (used normally): 118 + - `description`: Human-readable description of what the array represents 119 + - `title`: Short name for the field 120 + 121 + ## Complete Example 122 + 123 + ```json 124 + { 125 + "$schema": "http://json-schema.org/draft-07/schema#", 126 + "title": "ImageSample", 127 + "type": "object", 128 + "required": ["image", "label"], 129 + "properties": { 130 + "image": { 131 + "$ref": "#/$defs/ndarray", 132 + "description": "RGB image with variable height/width", 133 + "x-atdata-dtype": "uint8", 134 + "x-atdata-shape": [null, null, 3], 135 + "x-atdata-notes": "Images must have 3 color channels (RGB)" 136 + }, 137 + "depth_map": { 138 + "$ref": "#/$defs/ndarray", 139 + "description": "Depth map corresponding to the image", 140 + "x-atdata-dtype": "float32", 141 + "x-atdata-shape": [null, null], 142 + "x-atdata-notes": "Same height and width as image, but single channel" 143 + }, 144 + "label": { 145 + "type": "string", 146 + "description": "Human-readable label" 147 + } 148 + }, 149 + "$defs": { 150 + "ndarray": { 151 + "type": "string", 152 + "format": "byte", 153 + "description": "Numpy array serialized using numpy .npy format", 154 + "contentEncoding": "base64", 155 + "contentMediaType": "application/octet-stream" 156 + } 157 + } 158 + } 159 + ``` 160 + 161 + ## Rationale 162 + 163 + ### Why `type: "string", format: "byte"`? 164 + 165 + In msgpack serialization: 166 + - The NDArray field is stored as raw bytes (the .npy format) 167 + - When represented in JSON (for validation/transport), bytes become base64 strings 168 + - JSON Schema's `type: "string", format: "byte"` is the standard way to represent binary data 169 + 170 + ### Why extension properties (`x-atdata-*`)? 171 + 172 + JSON Schema allows custom properties starting with `x-`. Benefits: 173 + 1. **Standard**: Well-established convention in JSON Schema ecosystem 174 + 2. **Ignored by validators**: Won't cause validation errors 175 + 3. **Accessible to codegen**: Tools can parse these for type generation 176 + 4. **Self-documenting**: Clear what they mean 177 + 178 + ### Why not validate dtype/shape at JSON Schema level? 179 + 180 + **Technical limitation**: JSON Schema can't validate binary .npy format internals. 181 + 182 + **Solution**: Validation happens at **deserialization time**: 183 + 1. JSON Schema validates overall structure (field is present, is bytes) 184 + 2. When bytes are deserialized to NDArray, check dtype/shape match expectations 185 + 186 + ## Usage in atdata 187 + 188 + ### Publishing Schemas 189 + 190 + When publishing a PackableSample with NDArray fields: 191 + 192 + ```python 193 + @atdata.packable 194 + class ImageSample: 195 + image: NDArray # Will be annotated with dtype/shape hints 196 + label: str 197 + 198 + # SDK extracts type hints and generates JSON Schema 199 + schema_json = { 200 + "properties": { 201 + "image": { 202 + "$ref": "#/$defs/ndarray", 203 + "x-atdata-dtype": "uint8", # From annotation or default 204 + "x-atdata-shape": [null, null, 3] # From annotation or None 205 + } 206 + } 207 + } 208 + ``` 209 + 210 + ### Type Annotations for NDArray 211 + 212 + Python type hints to specify dtype/shape: 213 + 214 + ```python 215 + from typing import Annotated 216 + from numpy.typing import NDArray 217 + 218 + # Option 1: Generic NDArray (dtype/shape inferred or not specified) 219 + image: NDArray 220 + 221 + # Option 2: With dtype (using numpy typing) 222 + image: NDArray[np.uint8] 223 + 224 + # Option 3: With full metadata (using Annotated) 225 + image: Annotated[NDArray[np.uint8], {"shape": [None, None, 3]}] 226 + ``` 227 + 228 + ### Code Generation 229 + 230 + Codegen reads JSON Schema and produces: 231 + 232 + ```python 233 + @atdata.packable 234 + class ImageSample: 235 + image: NDArray # uint8, shape: [*, *, 3] 236 + label: str 237 + ``` 238 + 239 + Comment indicates dtype/shape from `x-atdata-*` properties. 240 + 241 + ## Validation Strategy 242 + 243 + ### JSON Schema Level (Structural) 244 + ✅ Validate field is present (if required) 245 + ✅ Validate field is bytes/string (in JSON) 246 + ✅ Validate base64 encoding (if in JSON) 247 + 248 + ### Deserialization Level (Semantic) 249 + ✅ Validate .npy format is valid 250 + ✅ Validate dtype matches expected (if specified) 251 + ✅ Validate shape matches expected (if specified) 252 + ✅ Validate shape constraints (e.g., must be 3D) 253 + 254 + ### Implementation 255 + 256 + ```python 257 + from atdata.validation import validate_ndarray 258 + 259 + def validate_ndarray( 260 + array: np.ndarray, 261 + expected_dtype: Optional[str] = None, 262 + expected_shape: Optional[List[Optional[int]]] = None 263 + ) -> tuple[bool, list[str]]: 264 + """Validate array against expectations.""" 265 + errors = [] 266 + 267 + # Check dtype 268 + if expected_dtype and str(array.dtype) != expected_dtype: 269 + errors.append(f"Expected dtype {expected_dtype}, got {array.dtype}") 270 + 271 + # Check shape 272 + if expected_shape: 273 + if len(array.shape) != len(expected_shape): 274 + errors.append(f"Expected {len(expected_shape)}D array, got {len(array.shape)}D") 275 + for i, (actual, expected) in enumerate(zip(array.shape, expected_shape)): 276 + if expected is not None and actual != expected: 277 + errors.append(f"Dimension {i}: expected {expected}, got {actual}") 278 + 279 + return len(errors) == 0, errors 280 + ``` 281 + 282 + ## Standard NDArray Shim URI 283 + 284 + The NDArray shim definition should be published at a canonical URI: 285 + 286 + **Proposed**: `at://did:plc:foundation/ac.foundation.dataset.ndarray-shim/1.0.0` 287 + 288 + This allows schemas to reference a standard definition: 289 + 290 + ```json 291 + { 292 + "properties": { 293 + "image": { 294 + "$ref": "at://did:plc:foundation/ac.foundation.dataset.ndarray-shim/1.0.0#/$defs/ndarray", 295 + "x-atdata-dtype": "uint8" 296 + } 297 + } 298 + } 299 + ``` 300 + 301 + Or schemas can inline the definition (recommended for Phase 1): 302 + 303 + ```json 304 + { 305 + "$defs": { 306 + "ndarray": { /* inline definition */ } 307 + } 308 + } 309 + ``` 310 + 311 + ## Alternative: Describe Deserialized Structure 312 + 313 + For reference, an alternative approach that describes the "unpacked" structure: 314 + 315 + ```json 316 + { 317 + "$defs": { 318 + "ndarray": { 319 + "type": "object", 320 + "description": "Numpy array (deserialized representation)", 321 + "required": ["dtype", "shape", "data"], 322 + "properties": { 323 + "dtype": {"type": "string"}, 324 + "shape": {"type": "array", "items": {"type": "integer"}}, 325 + "data": {"type": "string", "format": "byte"} 326 + } 327 + } 328 + } 329 + } 330 + ``` 331 + 332 + **Problem**: This doesn't match the actual msgpack structure (which is just bytes, not an object with dtype/shape/data fields). The .npy format is opaque bytes, not a structured object. 333 + 334 + **Conclusion**: Stick with the recommended approach (bytes with metadata). 335 + 336 + ## Implementation Checklist 337 + 338 + - [ ] Update sampleSchema Lexicon to reference this spec 339 + - [ ] Create standard NDArray shim definition 340 + - [ ] Update schema examples to use the shim correctly 341 + - [ ] Implement validation helpers in Python SDK 342 + - [ ] Add type annotation support for dtype/shape hints 343 + - [ ] Update codegen to read x-atdata-* properties 344 + - [ ] Document in user-facing docs 345 + 346 + ## Open Questions 347 + 348 + 1. **Should we support other array libraries?** (PyTorch tensors, JAX arrays, etc.) 349 + - Could use `x-atdata-array-type: "numpy"|"torch"|"jax"` 350 + - Recommendation: NumPy only for Phase 1 351 + 352 + 2. **Should shape constraints be enforced at runtime?** 353 + - Pro: Catch errors early 354 + - Con: Performance overhead, flexibility lost 355 + - Recommendation: Optional validation, off by default 356 + 357 + 3. **Should we support sparse arrays?** 358 + - scipy.sparse has different serialization format 359 + - Recommendation: Defer to future 360 + 361 + 4. **What about array of arrays?** (ragged arrays) 362 + - Can be represented as Python lists of NDArrays 363 + - Recommendation: Not a priority for Phase 1 364 + 365 + ## Summary 366 + 367 + **Recommended Approach**: 368 + - NDArray fields represented as `{"$ref": "#/$defs/ndarray"}` (bytes) 369 + - Dtype and shape specified via `x-atdata-dtype` and `x-atdata-shape` 370 + - Standard `ndarray` definition inlined in every schema 371 + - Validation happens at deserialization, not JSON Schema level 372 + - Codegen reads extension properties to generate proper types 373 + 374 + **Benefits**: 375 + - ✅ Compatible with existing msgpack serialization 376 + - ✅ Works with JSON Schema tooling 377 + - ✅ Clear metadata for codegen 378 + - ✅ Flexible (dtype/shape optional) 379 + - ✅ Extensible (can add more x-atdata-* properties) 380 + 381 + **Trade-offs**: 382 + - ⚠️ Leaky abstraction (JSON Schema describes bytes, not semantic array) 383 + - ⚠️ Validation split across two layers 384 + - ⚠️ Extension properties not universally understood 385 + 386 + **Grade**: B+ (Good practical solution)
+336
.reference/atproto_lexicon_guide.md
··· 1 + # AT Protocol Lexicon Guide 2 + 3 + > **Source**: [AT Protocol Lexicon Documentation](https://atproto.com/guides/lexicon) 4 + 5 + ## Overview 6 + 7 + Lexicon is a JSON-based schema language that defines RPC methods and record types for AT Protocol. It enables interoperability by establishing agreed-upon behaviors and semantics across the open network. 8 + 9 + ## Key Concepts 10 + 11 + ### NSIDs (Namespaced Identifiers) 12 + 13 + Schemas use reverse-DNS format identifiers indicating ownership: 14 + 15 + ``` 16 + com.atproto.repo.createRecord # Core ATProto API 17 + app.bsky.feed.post # Bluesky app record type 18 + ac.foundation.dataset.sampleSchema # Our custom namespace 19 + ``` 20 + 21 + Format: `authority.name` where authority is reverse-DNS 22 + 23 + ### Why Not RDF? 24 + 25 + Lexicon prioritizes: 26 + - Schema enforcement (not optional metadata) 27 + - Code generation with types and validation 28 + - Practical developer experience 29 + 30 + ## Schema Types 31 + 32 + ### 1. Record Types 33 + 34 + Define the structure of data stored in repositories: 35 + 36 + ```json 37 + { 38 + "lexicon": 1, 39 + "id": "com.example.follow", 40 + "defs": { 41 + "main": { 42 + "type": "record", 43 + "key": "tid", 44 + "record": { 45 + "type": "object", 46 + "required": ["subject", "createdAt"], 47 + "properties": { 48 + "subject": { "type": "string", "format": "did" }, 49 + "createdAt": { "type": "string", "format": "datetime" } 50 + } 51 + } 52 + } 53 + } 54 + } 55 + ``` 56 + 57 + Records stored in repos have a `$type` field mapping to their schema. 58 + 59 + ### 2. Query Methods 60 + 61 + Define HTTP GET endpoints: 62 + 63 + ```json 64 + { 65 + "lexicon": 1, 66 + "id": "com.example.getProfile", 67 + "defs": { 68 + "main": { 69 + "type": "query", 70 + "parameters": { 71 + "type": "params", 72 + "required": ["actor"], 73 + "properties": { 74 + "actor": { "type": "string", "format": "at-identifier" } 75 + } 76 + }, 77 + "output": { 78 + "encoding": "application/json", 79 + "schema": { "$ref": "#/defs/profileView" } 80 + } 81 + } 82 + } 83 + } 84 + ``` 85 + 86 + Maps to: `GET /xrpc/com.example.getProfile?actor=...` 87 + 88 + ### 3. Procedure Methods 89 + 90 + Define HTTP POST endpoints: 91 + 92 + ```json 93 + { 94 + "lexicon": 1, 95 + "id": "com.example.updateProfile", 96 + "defs": { 97 + "main": { 98 + "type": "procedure", 99 + "input": { 100 + "encoding": "application/json", 101 + "schema": { ... } 102 + }, 103 + "output": { 104 + "encoding": "application/json", 105 + "schema": { ... } 106 + } 107 + } 108 + } 109 + } 110 + ``` 111 + 112 + ### 4. Tokens 113 + 114 + Declare reusable global identifiers for extensible enums: 115 + 116 + ```json 117 + { 118 + "lexicon": 1, 119 + "id": "com.example.status.active", 120 + "defs": { 121 + "main": { 122 + "type": "token", 123 + "description": "User is active" 124 + } 125 + } 126 + } 127 + ``` 128 + 129 + Instead of hardcoding enum values, use tokens. Teams can add values without collisions. 130 + 131 + ## Field Types 132 + 133 + ### Primitives 134 + 135 + | Type | Description | 136 + |------|-------------| 137 + | `string` | Text, with optional format/length constraints | 138 + | `integer` | Whole numbers | 139 + | `boolean` | true/false | 140 + | `bytes` | Binary data (base64 encoded in JSON) | 141 + | `cid-link` | Content identifier reference | 142 + | `unknown` | Any JSON value | 143 + 144 + ### String Formats 145 + 146 + | Format | Description | 147 + |--------|-------------| 148 + | `at-uri` | AT Protocol URI | 149 + | `at-identifier` | Handle or DID | 150 + | `did` | Decentralized identifier | 151 + | `handle` | User handle | 152 + | `datetime` | ISO 8601 timestamp | 153 + | `uri` | Generic URI | 154 + | `language` | BCP 47 language tag | 155 + 156 + ### Complex Types 157 + 158 + ```json 159 + // Object 160 + { 161 + "type": "object", 162 + "required": ["field1"], 163 + "properties": { 164 + "field1": { "type": "string" }, 165 + "field2": { "type": "integer" } 166 + } 167 + } 168 + 169 + // Array 170 + { 171 + "type": "array", 172 + "items": { "type": "string" }, 173 + "maxLength": 100 174 + } 175 + 176 + // Union (discriminated) 177 + { 178 + "type": "union", 179 + "refs": [ 180 + "#defs/typeA", 181 + "#defs/typeB" 182 + ] 183 + } 184 + 185 + // Reference to another schema 186 + { 187 + "type": "ref", 188 + "ref": "com.example.otherSchema#defs/thing" 189 + } 190 + ``` 191 + 192 + ### Blob References 193 + 194 + For binary data stored separately: 195 + 196 + ```json 197 + { 198 + "type": "blob", 199 + "accept": ["image/jpeg", "image/png"], 200 + "maxSize": 1000000 201 + } 202 + ``` 203 + 204 + ## Versioning Rules 205 + 206 + **Published schemas are immutable regarding constraints.** 207 + 208 + - Loosening constraints breaks old software validation 209 + - Tightening constraints breaks new software validation 210 + - Only **optional** constraints may be added to previously unconstrained fields 211 + - Major changes require **new NSIDs** 212 + 213 + ## Schema Distribution 214 + 215 + Schemas should be published as machine-readable, network-accessible resources: 216 + 217 + 1. Host at well-known URL: `https://authority.com/.well-known/lexicons/` 218 + 2. Or embed in documentation 219 + 3. Ensure canonical representation exists for consumers 220 + 221 + ## Record Keys (rkeys) 222 + 223 + Records in collections are identified by keys: 224 + 225 + | Key Type | Description | 226 + |----------|-------------| 227 + | `tid` | Timestamp-based ID (sortable, unique) | 228 + | `literal:self` | Singleton record (e.g., profile) | 229 + | `any` | Any valid string | 230 + 231 + TID format: 13-character base32-sortable timestamp 232 + 233 + ## Example: Complete Lexicon 234 + 235 + ```json 236 + { 237 + "lexicon": 1, 238 + "id": "ac.foundation.dataset.sampleSchema", 239 + "revision": 1, 240 + "description": "Schema definition for a PackableSample type", 241 + "defs": { 242 + "main": { 243 + "type": "record", 244 + "key": "tid", 245 + "description": "A sample schema record", 246 + "record": { 247 + "type": "object", 248 + "required": ["name", "version", "fields"], 249 + "properties": { 250 + "name": { 251 + "type": "string", 252 + "description": "Human-readable schema name" 253 + }, 254 + "version": { 255 + "type": "string", 256 + "description": "Semantic version" 257 + }, 258 + "fields": { 259 + "type": "array", 260 + "items": { "type": "ref", "ref": "#defs/fieldDef" } 261 + }, 262 + "createdAt": { 263 + "type": "string", 264 + "format": "datetime" 265 + } 266 + } 267 + } 268 + }, 269 + "fieldDef": { 270 + "type": "object", 271 + "required": ["name", "fieldType"], 272 + "properties": { 273 + "name": { "type": "string" }, 274 + "fieldType": { "type": "ref", "ref": "#defs/fieldType" }, 275 + "optional": { "type": "boolean", "default": false } 276 + } 277 + }, 278 + "fieldType": { 279 + "type": "union", 280 + "refs": [ 281 + "#defs/primitiveType", 282 + "#defs/arrayType" 283 + ] 284 + }, 285 + "primitiveType": { 286 + "type": "object", 287 + "required": ["kind"], 288 + "properties": { 289 + "kind": { 290 + "type": "string", 291 + "knownValues": ["string", "int", "float", "bool", "bytes"] 292 + } 293 + } 294 + }, 295 + "arrayType": { 296 + "type": "object", 297 + "required": ["kind", "elementType"], 298 + "properties": { 299 + "kind": { "type": "string", "const": "ndarray" }, 300 + "elementType": { "type": "string" }, 301 + "shape": { 302 + "type": "array", 303 + "items": { "type": "integer" } 304 + } 305 + } 306 + } 307 + } 308 + } 309 + ``` 310 + 311 + ## XRPC (Cross-Server RPC) 312 + 313 + Lexicons map to HTTP endpoints: 314 + 315 + ``` 316 + com.example.getProfile() 317 + → GET /xrpc/com.example.getProfile 318 + 319 + com.example.createPost() 320 + → POST /xrpc/com.example.createPost 321 + ``` 322 + 323 + ## Validation Behavior 324 + 325 + The PDS can validate records against lexicons, but: 326 + 327 + 1. PDS is lexicon-agnostic by default 328 + 2. Validation can be disabled: `validate: false` 329 + 3. Unknown lexicons are stored without validation 330 + 4. Rate limits prevent abuse (not schema enforcement) 331 + 332 + ## Resources 333 + 334 + - **Lexicon Specification**: https://atproto.com/specs/lexicon 335 + - **Lexicon Guide**: https://atproto.com/guides/lexicon 336 + - **Bluesky Lexicons**: https://github.com/bluesky-social/atproto/tree/main/lexicons
+230
.reference/atproto_lexicon_spec.md
··· 1 + # Lexicon Specification - AT Protocol 2 + 3 + ## Overview 4 + 5 + "Lexicon is a schema definition language used to describe atproto records, HTTP endpoints (XRPC), and event stream messages." 6 + 7 + The language builds on the atproto Data Model and incorporates concepts similar to JSON Schema and OpenAPI, while adding protocol-specific features. This specification covers version 1 of the Lexicon language. 8 + 9 + ## Type Categories 10 + 11 + Lexicon types fall into several categories: 12 + 13 + **Concrete Types:** boolean, integer, string, bytes, cid-link, blob 14 + 15 + **Container Types:** array, object 16 + 17 + **Sub-types:** params, permission 18 + 19 + **Meta Types:** token, ref, union, unknown 20 + 21 + **Primary Types:** record, query, procedure, subscription, permission-set 22 + 23 + ## Lexicon Files 24 + 25 + Lexicon schemas are JSON files associated with a single NSID containing one or more definitions. Required file fields: 26 + 27 + - `lexicon` (integer): Language version, currently fixed at 1 28 + - `id` (string): The NSID identifier 29 + - `defs` (object): Named definitions with distinct keys 30 + - `description` (string, optional): Overview text 31 + 32 + "References to specific definitions within a Lexicon use fragment syntax, like `com.example.defs#someView`." 33 + 34 + ## Primary Type Definitions 35 + 36 + ### Record Type 37 + 38 + Specifies data objects stored in repositories. Type-specific fields: 39 + 40 + - `key` (string): Record key type specification 41 + - `record` (object): Schema with type object describing the record structure 42 + 43 + ### Query and Procedure (HTTP API) 44 + 45 + Describes XRPC endpoints. Fields: 46 + 47 + - `parameters`: Optional params schema for query parameters 48 + - `output`: Response body with encoding (MIME type) and optional schema 49 + - `input`: Request body (procedures only) 50 + - `errors`: Array of possible error codes with descriptions 51 + 52 + ### Subscription (Event Stream) 53 + 54 + Defines WebSocket endpoint messages. Fields: 55 + 56 + - `parameters`: Optional HTTP parameters 57 + - `message`: Required specification with union schema 58 + - `errors`: Optional error definitions 59 + 60 + "Subscription schemas must be a `union` of refs, not an `object` type." 61 + 62 + ### Permission Set 63 + 64 + Bundles permissions for OAuth scopes. Fields: 65 + 66 + - `title` / `title:langs`: Display name with localization 67 + - `detail` / `detail:langs`: Human-readable scope description 68 + - `permissions`: Array of permission definitions 69 + 70 + ## Field Type Definitions 71 + 72 + ### Primitive Types 73 + 74 + **boolean:** Optional `default` and `const` fields 75 + 76 + **integer:** Supports `minimum`, `maximum`, `enum`, `default`, `const` 77 + 78 + **string:** Supports `format`, `maxLength`, `minLength`, `maxGraphemes`, `minGraphemes`, `knownValues`, `enum`, `default`, `const` 79 + 80 + "Strings are Unicode. For non-Unicode encodings, use `bytes` instead." 81 + 82 + **bytes:** Raw binary data with optional `minLength` and `maxLength` 83 + 84 + **cid-link:** Content identifier links with no type-specific fields 85 + 86 + ### Container Types 87 + 88 + **array:** Contains `items` (required schema) and optional `minLength`/`maxLength` 89 + 90 + **object:** 91 + - `properties`: Named field schemas 92 + - `required`: Array of required field names 93 + - `nullable`: Array of fields accepting null values 94 + 95 + "There is a semantic difference in data between omitting a field; including the field with value `null`; and including the field with a falsy value." 96 + 97 + **blob:** Binary large objects with: 98 + - `accept`: MIME type restrictions (glob patterns supported) 99 + - `maxSize`: Maximum bytes 100 + 101 + ### Specialized Types 102 + 103 + **params:** Limited to HTTP query parameters, supporting only boolean, integer, string, or arrays of these types. Cannot be top-level named definitions. 104 + 105 + **permission:** Defines access permissions with `resource` field. Current resources: 106 + 107 + - `repo`: Repository write permissions with collection and optional action fields 108 + - `rpc`: Remote API calls with lxm (endpoints), aud (audience), and inheritAud fields 109 + 110 + "Permission declarations with unsupported resource types must be ignored by services implementing access control." 111 + 112 + **token:** Empty values referenced by name, used for symbolic enumerations. Cannot be used in refs, unions, or as object fields. 113 + 114 + ### Reference and Union Types 115 + 116 + **ref:** References another schema definition globally (by NSID) or locally (by fragment). Reduces schema duplication for reusable definitions. 117 + 118 + **union:** Declares multiple possible types at a location. Fields: 119 + 120 + - `refs`: Array of schema references 121 + - `closed`: Boolean indicating if type list is fixed (default: false) 122 + 123 + "Unions represent that multiple possible types could be present at this location in the schema." 124 + 125 + **unknown:** Accepts any data object with no specific validation, but must be a CBOR map. Data may contain optional `$type` field. 126 + 127 + ## String Formats 128 + 129 + Lexicon supports format-constrained strings: 130 + 131 + - `at-identifier`: Handle or DID 132 + - `at-uri`: AT-URI reference 133 + - `at-uri-regex`: "Lenient" version accepting unresolved at-identifier 134 + - `cid`: Content identifier 135 + - `datetime`: RFC 3339 timestamp 136 + - `did`: Decentralized identifier 137 + - `handle`: Handle identifier 138 + - `nsid`: Namespaced identifier 139 + - `tid`: Timestamp identifier 140 + - `record-key`: Record key syntax 141 + - `uri`: Generic URI (RFC 3986) 142 + - `language`: IETF language tag (BCP 47) 143 + 144 + ### Datetime Format 145 + 146 + Required elements: 147 + - Intersection of RFC 3339, ISO 8601, and WHATWG HTML standards 148 + - Uppercase T separator between date and time 149 + - Timezone specification (preferably Z for UTC) 150 + - Whole seconds precision (millisecond precision recommended) 151 + 152 + Valid example: `1985-04-12T23:20:50.123Z` 153 + 154 + Invalid: Missing timezone, lowercase t, insufficient precision, or invalid day/month values 155 + 156 + ### AT-URI Format 157 + 158 + "at-uri": Represents an AT-URI following the AT-URI scheme specification. Examples: 159 + - `at://did:plc:abc123/com.example.record/rkey123` 160 + - `at://alice.bsky.social/app.bsky.feed.post/3k4i5j6k` 161 + 162 + "at-uri-regex": "Lenient" version that accepts AT-URIs with unresolved at-identifiers. 163 + 164 + ### URI Format 165 + 166 + "uri": "Flexible to any URI schema, following the generic RFC-3986 on URIs." Supports did, https, wss, ipfs, dns, and at schemes. Maximum length is 8 KBytes. 167 + 168 + ### Language Format 169 + 170 + "language": "An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646." Examples include `ja` (Japanese) and `pt-BR` (Brazilian Portuguese). 171 + 172 + ## Validation Approach 173 + 174 + "For the various identifier formats, when doing Lexicon schema validation the most expansive identifier syntax format should be permitted." Application-level validation of specific identifier methods occurs separately from schema validation. 175 + 176 + ## When to Use `$type` 177 + 178 + Data objects sometimes require a `$type` field for disambiguation: 179 + 180 + - `record` objects: Always include `$type` 181 + - `union` variants: Always include `$type` (except top-level subscription messages) 182 + - `blob` objects: Always include `$type` 183 + 184 + "Main types must be referenced in `$type` fields as just the NSID, not including a `#main` suffix." 185 + 186 + ## Validation Options 187 + 188 + Three PDS validation approaches: 189 + 190 + 1. **Explicit validation:** Record must validate against known Lexicon; fails if unavailable 191 + 2. **No validation:** Record bypasses Lexicon validation (still validates data model rules) 192 + 3. **Optimistic validation (default):** Validates if Lexicon known; allows creation if unavailable 193 + 194 + ## Lexicon Evolution 195 + 196 + Compatibility rules for schema updates: 197 + 198 + - New fields must be optional 199 + - Non-optional fields cannot be removed 200 + - Field types cannot change 201 + - Fields cannot be renamed 202 + 203 + "If larger breaking changes are necessary, a new Lexicon name must be used." 204 + 205 + Lexicon publication occurs through atproto repositories using `com.atproto.lexicon.schema` record types, linked via DNS TXT records for authority resolution. 206 + 207 + ## Authority and Control 208 + 209 + NSID authority derives from DNS domain control. Domain authorities maintain Lexicon definitions with ultimate responsibility for maintenance and distribution. Protocol implementations should treat data failing Lexicon validation as entirely invalid. 210 + 211 + "Unexpected fields in data which otherwise conforms to the Lexicon should be ignored." 212 + 213 + ## Usage Guidelines 214 + 215 + Implementations should support translation to JSON Schema and OpenAPI formats for cross-ecosystem compatibility. Care must be taken when deserializing/reserializing to avoid losing unexpected fields that may represent newer schema versions. 216 + 217 + ## Record Key Types 218 + 219 + The `key` field in record definitions specifies the format of record keys (rkeys). Options: 220 + 221 + - `"any"`: Any string matching general record-key syntax 222 + - `"tid"`: Must be a valid timestamp identifier 223 + - `"literal:{value}"`: Fixed literal string (e.g., `"literal:self"` for profile records) 224 + 225 + ## Notes on Implementation 226 + 227 + - String grapheme counting should follow Unicode extended grapheme cluster boundaries 228 + - Unknown fields should be preserved during serialization/deserialization when possible 229 + - Services should be permissive with format validation but strict with structural requirements 230 + - Breaking schema changes require new NSIDs rather than version updates
+347
.reference/python_atproto_sdk.md
··· 1 + # Python ATProto SDK Reference 2 + 3 + > **Source**: [MarshalX/atproto](https://github.com/MarshalX/atproto) | [Documentation](https://atproto.blue/) | [PyPI](https://pypi.org/project/atproto/) 4 + 5 + ## Overview 6 + 7 + The `atproto` package is the community Python SDK for AT Protocol (Bluesky). It provides: 8 + 9 + - Autogenerated models from lexicons with full type hints 10 + - Synchronous and asynchronous XRPC clients 11 + - Firehose data streaming 12 + - Identity resolution (DID/Handle) 13 + - Cryptographic utilities 14 + - **Code generator for custom lexicon schemes** 15 + 16 + **Version**: 0.0.65 (Dec 2025) 17 + **Python**: 3.9 - 3.14 18 + **License**: MIT 19 + 20 + > Note: Until 1.0.0, compatibility between versions is not guaranteed. 21 + 22 + ## Installation 23 + 24 + ```bash 25 + pip install atproto 26 + ``` 27 + 28 + ## Package Structure 29 + 30 + | Package | Purpose | 31 + |---------|---------| 32 + | `atproto_client` | XRPC client, data models, utilities | 33 + | `atproto_core` | NSID, AT URIs, CID, CAR files, DID documents | 34 + | `atproto_crypto` | Multibase, signature verification, DID keys | 35 + | `atproto_firehose` | Real-time data streaming | 36 + | `atproto_identity` | DID and handle resolution | 37 + | `atproto_lexicon` | Lexicon parsing (parser, models) | 38 + | `atproto_codegen` | Code generator for models/clients from lexicons | 39 + | `atproto_cli` | CLI tool for code generation | 40 + | `atproto_server` | Server-side JWT utilities | 41 + 42 + ## Authentication 43 + 44 + ### Basic Login 45 + 46 + ```python 47 + from atproto import Client 48 + 49 + # Synchronous 50 + client = Client() 51 + client.login('handle.bsky.social', 'app-password') 52 + 53 + # Asynchronous 54 + from atproto import AsyncClient 55 + client = AsyncClient() 56 + await client.login('handle.bsky.social', 'app-password') 57 + ``` 58 + 59 + ### Session Persistence 60 + 61 + Sessions can be exported/imported to avoid repeated authentication: 62 + 63 + ```python 64 + # Export session 65 + session_string = client.export_session_string() 66 + 67 + # Import session later 68 + client = Client() 69 + client.login(session_string=session_string) 70 + ``` 71 + 72 + ### Custom PDS 73 + 74 + ```python 75 + client = Client(base_url='https://my-pds.example.com') 76 + ``` 77 + 78 + ## Namespaced API Access 79 + 80 + The SDK mirrors AT Protocol's NSID structure: 81 + 82 + ```python 83 + # Core atproto methods 84 + client.com.atproto.server.create_session(...) 85 + client.com.atproto.repo.create_record(...) 86 + client.com.atproto.repo.put_record(...) 87 + client.com.atproto.repo.get_record(...) 88 + client.com.atproto.repo.delete_record(...) 89 + 90 + # Bluesky app methods 91 + client.app.bsky.feed.get_timeline(...) 92 + client.app.bsky.actor.get_profile(...) 93 + 94 + # Chat methods 95 + client.chat.bsky.convo.send_message(...) 96 + ``` 97 + 98 + ## Creating Custom Records 99 + 100 + This is the key functionality for atdata's ATProto integration. 101 + 102 + ### Using com.atproto.repo.createRecord 103 + 104 + ```python 105 + from atproto import Client 106 + 107 + client = Client() 108 + client.login('handle', 'password') 109 + 110 + # Create a record with a custom collection 111 + response = client.com.atproto.repo.create_record( 112 + data={ 113 + 'repo': client.me.did, # Your DID 114 + 'collection': 'ac.foundation.dataset.sampleSchema', # Custom NSID 115 + 'record': { 116 + '$type': 'ac.foundation.dataset.sampleSchema', 117 + # ... your record fields 118 + }, 119 + 'validate': False # Skip lexicon validation for custom schemas 120 + } 121 + ) 122 + 123 + # Response contains: 124 + # - uri: AT URI for the record (at://did:plc:.../ac.foundation.dataset.sampleSchema/...) 125 + # - cid: Content hash of the record 126 + ``` 127 + 128 + ### Using com.atproto.repo.putRecord (Create or Update) 129 + 130 + ```python 131 + response = client.com.atproto.repo.put_record( 132 + data={ 133 + 'repo': client.me.did, 134 + 'collection': 'ac.foundation.dataset.sampleSchema', 135 + 'rkey': 'my-schema-key', # Explicit record key 136 + 'record': { 137 + '$type': 'ac.foundation.dataset.sampleSchema', 138 + # ... fields 139 + }, 140 + 'validate': False 141 + } 142 + ) 143 + ``` 144 + 145 + ### Getting a Record 146 + 147 + ```python 148 + response = client.com.atproto.repo.get_record( 149 + params={ 150 + 'repo': 'did:plc:...', 151 + 'collection': 'ac.foundation.dataset.sampleSchema', 152 + 'rkey': 'my-schema-key' 153 + } 154 + ) 155 + # response.value contains the record data 156 + ``` 157 + 158 + ### Listing Records in a Collection 159 + 160 + ```python 161 + response = client.com.atproto.repo.list_records( 162 + params={ 163 + 'repo': 'did:plc:...', 164 + 'collection': 'ac.foundation.dataset.sampleSchema', 165 + 'limit': 100 166 + } 167 + ) 168 + # response.records is a list of records 169 + ``` 170 + 171 + ### Deleting a Record 172 + 173 + ```python 174 + client.com.atproto.repo.delete_record( 175 + data={ 176 + 'repo': client.me.did, 177 + 'collection': 'ac.foundation.dataset.sampleSchema', 178 + 'rkey': 'my-schema-key' 179 + } 180 + ) 181 + ``` 182 + 183 + ## Key Insight: PDS is Lexicon-Agnostic 184 + 185 + From [GitHub Discussion #3116](https://github.com/bluesky-social/atproto/discussions/3116): 186 + 187 + > "You don't need the lexicon to parse a record, only to validate the schema. Validation can be disabled." 188 + 189 + The PDS stores any JSON data in any collection without requiring prior knowledge of the schema. This means: 190 + 191 + 1. We can publish `ac.foundation.dataset.*` records immediately 192 + 2. Set `validate: False` to bypass lexicon validation 193 + 3. Rate limits and account bans prevent abuse, not schema enforcement 194 + 195 + ## AT URIs 196 + 197 + Records are addressed using AT URIs: 198 + 199 + ``` 200 + at://did:plc:abcd1234/ac.foundation.dataset.sampleSchema/record-key 201 + └──────────────────────┘ └──────────────────────────────────┘ └────────┘ 202 + authority collection rkey 203 + ``` 204 + 205 + ### Parsing AT URIs 206 + 207 + ```python 208 + from atproto_core import AtUri 209 + 210 + uri = AtUri.from_str('at://did:plc:abc/com.example.record/key123') 211 + print(uri.hostname) # did:plc:abc 212 + print(uri.collection) # com.example.record 213 + print(uri.rkey) # key123 214 + ``` 215 + 216 + ## Core Utilities (atproto_core) 217 + 218 + ### NSID (Namespaced Identifier) 219 + 220 + ```python 221 + from atproto_core import NSID 222 + 223 + nsid = NSID.from_str('ac.foundation.dataset.sampleSchema') 224 + print(nsid.authority) # ac.foundation.dataset 225 + print(nsid.name) # sampleSchema 226 + ``` 227 + 228 + ### CID (Content Identifier) 229 + 230 + ```python 231 + from atproto_core import CID 232 + 233 + cid = CID.decode('bafyrei...') 234 + print(cid.version) 235 + print(cid.codec) 236 + ``` 237 + 238 + ### DID Document 239 + 240 + ```python 241 + from atproto_core import DidDocument 242 + 243 + doc = DidDocument(...) 244 + pds_endpoint = doc.get_pds_endpoint() 245 + handle = doc.get_handle() 246 + ``` 247 + 248 + ## Identity Resolution 249 + 250 + ```python 251 + from atproto_identity import IdentityResolver 252 + 253 + resolver = IdentityResolver() 254 + 255 + # Resolve handle to DID 256 + did = await resolver.resolve_handle('handle.bsky.social') 257 + 258 + # Resolve DID to DID document 259 + doc = await resolver.resolve_did('did:plc:...') 260 + ``` 261 + 262 + ## Firehose Streaming 263 + 264 + ```python 265 + from atproto import FirehoseSubscribeReposClient, parse_subscribe_repos_message 266 + 267 + client = FirehoseSubscribeReposClient() 268 + 269 + def on_message(message): 270 + commit = parse_subscribe_repos_message(message) 271 + # Process commits... 272 + 273 + client.start(on_message) 274 + ``` 275 + 276 + ## Blob Upload 277 + 278 + ```python 279 + # Upload binary data 280 + with open('image.jpg', 'rb') as f: 281 + upload = client.upload_blob(f.read()) 282 + 283 + # upload.blob can be used in record fields 284 + ``` 285 + 286 + ## Error Handling 287 + 288 + ```python 289 + from atproto import exceptions 290 + 291 + try: 292 + client.com.atproto.repo.get_record(...) 293 + except exceptions.AtProtocolError as e: 294 + print(f"AT Protocol error: {e}") 295 + except exceptions.NetworkError as e: 296 + print(f"Network error: {e}") 297 + ``` 298 + 299 + ## Code Generation for Custom Lexicons 300 + 301 + The SDK supports generating Python models from custom lexicon schemas: 302 + 303 + ```bash 304 + # Install CLI 305 + pip install atproto[cli] 306 + 307 + # Generate code from lexicons (exact CLI usage TBD) 308 + atproto codegen --lexicons ./my-lexicons --output ./generated 309 + ``` 310 + 311 + The `atproto_codegen` package can generate: 312 + - Data models for records 313 + - Client namespaces for queries/procedures 314 + - Validation functions 315 + 316 + ## Relevant API Endpoints for atdata 317 + 318 + | Endpoint | Purpose | 319 + |----------|---------| 320 + | `com.atproto.repo.createRecord` | Publish new schema/dataset/lens record | 321 + | `com.atproto.repo.putRecord` | Create or update by explicit rkey | 322 + | `com.atproto.repo.getRecord` | Fetch a specific record | 323 + | `com.atproto.repo.listRecords` | List all records in a collection | 324 + | `com.atproto.repo.deleteRecord` | Remove a record | 325 + | `com.atproto.sync.getRepo` | Download full repository (CAR file) | 326 + | `com.atproto.identity.resolveHandle` | Resolve handle to DID | 327 + 328 + ## Resources 329 + 330 + - **Documentation**: https://atproto.blue/ 331 + - **GitHub**: https://github.com/MarshalX/atproto 332 + - **Examples**: https://github.com/MarshalX/atproto/tree/main/examples 333 + - **PyPI**: https://pypi.org/project/atproto/ 334 + - **Discord**: https://discord.gg/PCyVJXU9jN 335 + 336 + ## AT Protocol Specification 337 + 338 + - **Lexicon Guide**: https://atproto.com/guides/lexicon 339 + - **Application Guide**: https://atproto.com/guides/applications 340 + - **SDK List**: https://atproto.com/sdks 341 + - **API Reference**: https://docs.bsky.app/docs/api/ 342 + 343 + ## Version History 344 + 345 + - 0.0.65 (Dec 8, 2025) - Latest 346 + - 0.0.64 (Dec 1, 2025) 347 + - 0.0.63 (Oct 22, 2025)
+16 -1
.vscode/settings.json
··· 1 1 { 2 2 "cSpell.words": [ 3 3 "atdata", 4 + "atlocal", 5 + "atproto", 6 + "creds", 7 + "dtype", 4 8 "getattr", 9 + "hgetall", 10 + "hset", 11 + "maxcount", 12 + "minioadmin", 5 13 "msgpack", 14 + "ndarray", 15 + "NSID", 6 16 "pypi", 7 17 "pyproject", 8 - "pytest" 18 + "pytest", 19 + "randn", 20 + "rkey", 21 + "schemamodels", 22 + "unpackb", 23 + "webdataset" 9 24 ] 10 25 }
+17
CHANGELOG.md
··· 1 + # Changelog 2 + 3 + All notable changes to this project will be documented in this file. 4 + 5 + The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). 6 + 7 + ## [Unreleased] 8 + 9 + ### Added 10 + 11 + ### Fixed 12 + 13 + ### Changed 14 + - Investigate test-bucket directory creation issue (#105) 15 + - Add remaining Dataset edge case tests (#104) 16 + - Improve test coverage for edge cases (#103) 17 + - Phase 1: Lexicon Design & Schema Definition (#17)
+48 -7
CLAUDE.md
··· 22 22 23 23 ### Testing 24 24 ```bash 25 + # Always run tests through uv to use the correct virtual environment 25 26 # Run all tests with coverage 26 - pytest 27 + uv run pytest 27 28 28 29 # Run specific test file 29 - pytest tests/test_dataset.py 30 - pytest tests/test_lens.py 30 + uv run pytest tests/test_dataset.py 31 + uv run pytest tests/test_lens.py 31 32 32 33 # Run single test 33 - pytest tests/test_dataset.py::test_create_sample 34 - pytest tests/test_lens.py::test_lens 34 + uv run pytest tests/test_dataset.py::test_create_sample 35 + uv run pytest tests/test_lens.py::test_lens 35 36 ``` 36 37 37 38 ### Building ··· 136 137 137 138 **WebDataset Integration** 138 139 139 - - Uses `wds.ShardWriter` / `wds.TarWriter` for writing 140 + - Uses `wds.writer.ShardWriter` / `wds.writer.TarWriter` for writing 141 + - **Important:** Always import from `wds.writer` (e.g., `wds.writer.TarWriter`) instead of `wds.TarWriter` 142 + - This avoids linting issues while functionally equivalent 140 143 - Dataset iteration via `wds.DataPipeline` with custom `wrap()` / `wrap_batch()` methods 141 144 - Supports `ordered()` and `shuffled()` iteration modes 142 145 ··· 146 149 - Test cases cover both decorator and inheritance syntax 147 150 - Temporary WebDataset tar files created in `tmp_path` fixture 148 151 - Tests verify both serialization and batch aggregation behavior 149 - - Lens tests verify well-behavedness (GetPut/PutGet laws) 152 + - Lens tests verify well-behavedness (GetPut/PutGet/PutPut laws) 153 + 154 + ### Warning Suppression Convention 155 + 156 + **Keep warning suppression local to individual tests, not global.** 157 + 158 + When tests generate expected warnings (e.g., from third-party library incompatibilities), suppress them using `@pytest.mark.filterwarnings` decorators on each affected test rather than global suppression in `conftest.py`. This: 159 + - Documents which specific tests have known warning behaviors 160 + - Makes it easier to track when warnings appear in unexpected places 161 + - Avoids masking genuine warnings from new code 162 + 163 + Example for s3fs/moto async incompatibility warnings: 164 + ```python 165 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 166 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 167 + def test_repo_insert_with_s3(mock_s3, clean_redis): 168 + ... 169 + ``` 170 + 171 + ## Git Workflow 172 + 173 + ### Committing Changes 174 + 175 + When using the `/commit` command or creating commits: 176 + - **Always include `.chainlink/issues.db`** in commits alongside code changes 177 + - This ensures issue tracking history is preserved across sessions 178 + - The issues.db file tracks all chainlink issues, comments, and status changes 179 + 180 + ### Planning Documents 181 + 182 + - **Track `.planning/` directory in git** - Do not ignore planning documents 183 + - Planning documents in `.planning/` should be committed to preserve design history 184 + - This includes architecture notes, implementation plans, and design decisions 185 + 186 + ### Reference Materials 187 + 188 + - **Track `.reference/` directory in git** - Include reference documentation in commits 189 + - The `.reference/` directory contains external specifications and reference materials 190 + - This includes API specs, lexicon definitions, and other reference documentation used for development
+368
examples/atmosphere_demo.py
··· 1 + #!/usr/bin/env python3 2 + """Demonstration of atdata.atmosphere ATProto integration. 3 + 4 + This script demonstrates how to use the atmosphere module to publish 5 + and discover datasets on the AT Protocol network. 6 + 7 + Usage: 8 + # Dry run (no actual ATProto connection): 9 + python atmosphere_demo.py 10 + 11 + # With actual ATProto connection: 12 + python atmosphere_demo.py --handle your.handle.social --password your-app-password 13 + 14 + Requirements: 15 + pip install atdata[atmosphere] 16 + 17 + Note: 18 + Use an app-specific password, not your main Bluesky password. 19 + Create app passwords at: https://bsky.app/settings/app-passwords 20 + """ 21 + 22 + import argparse 23 + import sys 24 + from dataclasses import asdict, fields, is_dataclass 25 + from datetime import datetime 26 + 27 + import numpy as np 28 + from numpy.typing import NDArray 29 + 30 + import atdata 31 + from atdata.atmosphere import ( 32 + AtmosphereClient, 33 + SchemaPublisher, 34 + SchemaLoader, 35 + DatasetPublisher, 36 + DatasetLoader, 37 + AtUri, 38 + ) 39 + 40 + 41 + # ============================================================================= 42 + # Define sample types using @packable decorator 43 + # ============================================================================= 44 + 45 + @atdata.packable 46 + class ImageSample: 47 + """A sample containing image data with metadata.""" 48 + image: NDArray 49 + label: str 50 + confidence: float 51 + 52 + 53 + @atdata.packable 54 + class TextEmbeddingSample: 55 + """A sample containing text with embedding vectors.""" 56 + text: str 57 + embedding: NDArray 58 + source: str 59 + 60 + 61 + # ============================================================================= 62 + # Demo functions 63 + # ============================================================================= 64 + 65 + def demo_type_introspection(): 66 + """Demonstrate how atmosphere introspects PackableSample types.""" 67 + print("\n" + "=" * 60) 68 + print("Type Introspection Demo") 69 + print("=" * 60) 70 + 71 + # Show what information is available from a PackableSample type 72 + print(f"\nSample type: {ImageSample.__name__}") 73 + print(f"Is dataclass: {is_dataclass(ImageSample)}") 74 + 75 + print("\nFields:") 76 + for field in fields(ImageSample): 77 + print(f" - {field.name}: {field.type}") 78 + 79 + # Create a sample instance 80 + sample = ImageSample( 81 + image=np.random.rand(224, 224, 3).astype(np.float32), 82 + label="cat", 83 + confidence=0.95, 84 + ) 85 + 86 + print(f"\nSample instance:") 87 + print(f" image shape: {sample.image.shape}") 88 + print(f" image dtype: {sample.image.dtype}") 89 + print(f" label: {sample.label}") 90 + print(f" confidence: {sample.confidence}") 91 + 92 + # Demonstrate serialization 93 + packed = sample.packed 94 + print(f"\nSerialized size: {len(packed):,} bytes") 95 + 96 + # Round-trip 97 + restored = ImageSample.from_bytes(packed) 98 + print(f"Round-trip successful: {np.allclose(sample.image, restored.image)}") 99 + 100 + 101 + def demo_at_uri_parsing(): 102 + """Demonstrate AT URI parsing.""" 103 + print("\n" + "=" * 60) 104 + print("AT URI Parsing Demo") 105 + print("=" * 60) 106 + 107 + # Example AT URIs 108 + uris = [ 109 + "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789", 110 + "at://alice.bsky.social/ac.foundation.dataset.record/my-dataset", 111 + ] 112 + 113 + for uri_str in uris: 114 + print(f"\nParsing: {uri_str}") 115 + uri = AtUri.parse(uri_str) 116 + print(f" Authority: {uri.authority}") 117 + print(f" Collection: {uri.collection}") 118 + print(f" Rkey: {uri.rkey}") 119 + print(f" Roundtrip: {str(uri)}") 120 + 121 + 122 + def demo_schema_record_building(): 123 + """Demonstrate building schema records from PackableSample types.""" 124 + print("\n" + "=" * 60) 125 + print("Schema Record Building Demo") 126 + print("=" * 60) 127 + 128 + from atdata.atmosphere._types import SchemaRecord, FieldDef, FieldType 129 + 130 + # Build a schema record manually (what SchemaPublisher does internally) 131 + schema = SchemaRecord( 132 + name="ImageSample", 133 + version="1.0.0", 134 + description="A sample containing image data with metadata", 135 + fields=[ 136 + FieldDef( 137 + name="image", 138 + field_type=FieldType(kind="ndarray", dtype="float32", shape=[224, 224, 3]), 139 + optional=False, 140 + ), 141 + FieldDef( 142 + name="label", 143 + field_type=FieldType(kind="primitive", primitive="str"), 144 + optional=False, 145 + ), 146 + FieldDef( 147 + name="confidence", 148 + field_type=FieldType(kind="primitive", primitive="float"), 149 + optional=False, 150 + ), 151 + ], 152 + ) 153 + 154 + # Convert to ATProto record format 155 + record = schema.to_record() 156 + 157 + print("\nSchema record structure:") 158 + print(f" $type: {record['$type']}") 159 + print(f" name: {record['name']}") 160 + print(f" version: {record['version']}") 161 + print(f" description: {record.get('description', 'N/A')}") 162 + print(f" fields: {len(record['fields'])} fields") 163 + 164 + for field in record["fields"]: 165 + print(f" - {field['name']}: {field['fieldType']}") 166 + 167 + 168 + def demo_mock_client(): 169 + """Demonstrate the AtmosphereClient interface with a mock.""" 170 + print("\n" + "=" * 60) 171 + print("Mock Client Demo (no network)") 172 + print("=" * 60) 173 + 174 + from unittest.mock import Mock, MagicMock 175 + 176 + # Create a mock atproto client 177 + mock_atproto = Mock() 178 + mock_atproto.me = MagicMock() 179 + mock_atproto.me.did = "did:plc:demo123456789" 180 + mock_atproto.me.handle = "demo.bsky.social" 181 + 182 + # Mock the login response 183 + mock_profile = Mock() 184 + mock_profile.did = "did:plc:demo123456789" 185 + mock_profile.handle = "demo.bsky.social" 186 + mock_atproto.login.return_value = mock_profile 187 + 188 + # Mock create_record response 189 + mock_response = Mock() 190 + mock_response.uri = "at://did:plc:demo123456789/ac.foundation.dataset.sampleSchema/abc123" 191 + mock_atproto.com.atproto.repo.create_record.return_value = mock_response 192 + 193 + # Create our client with the mock 194 + client = AtmosphereClient(_client=mock_atproto) 195 + client.login("demo.bsky.social", "fake-password") 196 + 197 + print(f"\nAuthenticated as: {client.handle}") 198 + print(f"DID: {client.did}") 199 + 200 + # Demonstrate schema publishing with mock 201 + publisher = SchemaPublisher(client) 202 + uri = publisher.publish( 203 + ImageSample, 204 + name="ImageSample", 205 + version="1.0.0", 206 + description="Demo image sample type", 207 + ) 208 + 209 + print(f"\nPublished schema at: {uri}") 210 + print(f" Authority: {uri.authority}") 211 + print(f" Collection: {uri.collection}") 212 + print(f" Rkey: {uri.rkey}") 213 + 214 + 215 + def demo_live_connection(handle: str, password: str): 216 + """Demonstrate actual ATProto connection. 217 + 218 + Args: 219 + handle: Bluesky handle (e.g., 'alice.bsky.social') 220 + password: App-specific password 221 + """ 222 + print("\n" + "=" * 60) 223 + print("Live ATProto Connection Demo") 224 + print("=" * 60) 225 + 226 + # Create client and authenticate 227 + print(f"\nConnecting as {handle}...") 228 + client = AtmosphereClient() 229 + client.login(handle, password) 230 + 231 + print(f"Authenticated!") 232 + print(f" DID: {client.did}") 233 + print(f" Handle: {client.handle}") 234 + 235 + # Publish a schema 236 + print("\nPublishing ImageSample schema...") 237 + schema_publisher = SchemaPublisher(client) 238 + schema_uri = schema_publisher.publish( 239 + ImageSample, 240 + name="ImageSample", 241 + version="1.0.0", 242 + description="Demo: Image sample with label and confidence", 243 + ) 244 + print(f" Schema URI: {schema_uri}") 245 + 246 + # List schemas we've published 247 + print("\nListing your published schemas...") 248 + schema_loader = SchemaLoader(client) 249 + schemas = schema_loader.list_all(limit=10) 250 + print(f" Found {len(schemas)} schema(s)") 251 + for schema in schemas: 252 + print(f" - {schema.get('name', 'Unknown')}: v{schema.get('version', '?')}") 253 + 254 + # Publish a dataset record (pointing to example URLs) 255 + print("\nPublishing dataset record...") 256 + dataset_publisher = DatasetPublisher(client) 257 + dataset_uri = dataset_publisher.publish_with_urls( 258 + urls=["s3://example-bucket/demo-data-{000000..000009}.tar"], 259 + schema_uri=str(schema_uri), 260 + name="Demo Image Dataset", 261 + description="Example dataset demonstrating atmosphere publishing", 262 + tags=["demo", "images", "atdata"], 263 + license="MIT", 264 + ) 265 + print(f" Dataset URI: {dataset_uri}") 266 + 267 + # List datasets 268 + print("\nListing your published datasets...") 269 + dataset_loader = DatasetLoader(client) 270 + datasets = dataset_loader.list_all(limit=10) 271 + print(f" Found {len(datasets)} dataset(s)") 272 + for ds in datasets: 273 + print(f" - {ds.get('name', 'Unknown')}") 274 + print(f" Schema: {ds.get('schemaRef', 'N/A')}") 275 + tags = ds.get('tags', []) 276 + if tags: 277 + print(f" Tags: {', '.join(tags)}") 278 + 279 + 280 + def demo_dataset_loading(): 281 + """Demonstrate loading a dataset from an ATProto record.""" 282 + print("\n" + "=" * 60) 283 + print("Dataset Loading Demo (conceptual)") 284 + print("=" * 60) 285 + 286 + print(""" 287 + Once you have published a dataset, others can load it like this: 288 + 289 + from atdata.atmosphere import AtmosphereClient, DatasetLoader 290 + 291 + client = AtmosphereClient() 292 + # Note: reading public records doesn't require authentication 293 + 294 + loader = DatasetLoader(client) 295 + 296 + # Get the dataset record 297 + record = loader.get("at://did:plc:abc123/ac.foundation.dataset.record/xyz") 298 + 299 + # Get the WebDataset URLs 300 + urls = loader.get_urls("at://did:plc:abc123/ac.foundation.dataset.record/xyz") 301 + print(f"Dataset URLs: {urls}") 302 + 303 + # If you have the sample type class, create a Dataset directly 304 + dataset = loader.to_dataset( 305 + "at://did:plc:abc123/ac.foundation.dataset.record/xyz", 306 + sample_type=ImageSample, 307 + ) 308 + 309 + # Now iterate as usual 310 + for batch in dataset.shuffled(batch_size=32): 311 + images = batch.image # (32, 224, 224, 3) 312 + labels = batch.label # list of 32 strings 313 + process(images, labels) 314 + """) 315 + 316 + 317 + # ============================================================================= 318 + # Main 319 + # ============================================================================= 320 + 321 + def main(): 322 + parser = argparse.ArgumentParser( 323 + description="Demonstrate atdata.atmosphere ATProto integration", 324 + formatter_class=argparse.RawDescriptionHelpFormatter, 325 + epilog=__doc__, 326 + ) 327 + parser.add_argument( 328 + "--handle", 329 + help="Bluesky handle for live demo (e.g., alice.bsky.social)", 330 + ) 331 + parser.add_argument( 332 + "--password", 333 + help="App-specific password for live demo", 334 + ) 335 + 336 + args = parser.parse_args() 337 + 338 + print("=" * 60) 339 + print("atdata.atmosphere Demo") 340 + print("=" * 60) 341 + print(f"\nTime: {datetime.now().isoformat()}") 342 + print(f"atdata version: {atdata.__name__}") 343 + 344 + # Always run these demos (no network required) 345 + demo_type_introspection() 346 + demo_at_uri_parsing() 347 + demo_schema_record_building() 348 + demo_mock_client() 349 + demo_dataset_loading() 350 + 351 + # Run live demo if credentials provided 352 + if args.handle and args.password: 353 + demo_live_connection(args.handle, args.password) 354 + else: 355 + print("\n" + "=" * 60) 356 + print("Live Demo Skipped") 357 + print("=" * 60) 358 + print("\nTo run with actual ATProto connection:") 359 + print(" python atmosphere_demo.py --handle your.handle --password your-app-password") 360 + print("\nCreate app passwords at: https://bsky.app/settings/app-passwords") 361 + 362 + print("\n" + "=" * 60) 363 + print("Demo Complete!") 364 + print("=" * 60) 365 + 366 + 367 + if __name__ == "__main__": 368 + main()
+1
prototyping/.credentials/.gitignore
··· 1 + *.env
+1
prototyping/data/.gitignore
··· 1 + *.tar
+15 -1
pyproject.toml
··· 1 1 [project] 2 2 name = "atdata" 3 - version = "0.1.3b4" 3 + version = "0.2.0a1" 4 4 description = "A loose federation of distributed, typed datasets" 5 5 readme = "README.md" 6 6 authors = [ ··· 8 8 ] 9 9 requires-python = ">=3.12" 10 10 dependencies = [ 11 + "atproto>=0.0.65", 11 12 "fastparquet>=2024.11.0", 12 13 "msgpack>=1.1.2", 13 14 "numpy>=2.3.4", 14 15 "ormsgpack>=1.11.0", 15 16 "pandas>=2.3.3", 17 + "pydantic>=2.12.5", 18 + "python-dotenv>=1.2.1", 19 + "redis-om>=0.3.5", 20 + "requests>=2.32.5", 21 + "s3fs>=2025.12.0", 22 + "schemamodels>=0.9.1", 16 23 "tqdm>=4.67.1", 17 24 "webdataset>=1.0.2", 25 + ] 26 + 27 + [project.optional-dependencies] 28 + atmosphere = [ 29 + "atproto>=0.0.55", 18 30 ] 19 31 20 32 [project.scripts] ··· 29 41 30 42 [dependency-groups] 31 43 dev = [ 44 + "jupyter>=1.1.1", 45 + "moto[s3]>=5.0.29", 32 46 "pytest>=8.4.2", 33 47 "pytest-cov>=7.0.0", 34 48 ]
+3
src/atdata/__init__.py
··· 51 51 lens, 52 52 ) 53 53 54 + # ATProto integration (lazy import to avoid requiring atproto package) 55 + from . import atmosphere 56 + 54 57 55 58 #
+61
src/atdata/atmosphere/__init__.py
··· 1 + """ATProto integration for distributed dataset federation. 2 + 3 + This module provides ATProto publishing and discovery capabilities for atdata, 4 + enabling a loose federation of distributed, typed datasets on the AT Protocol 5 + network. 6 + 7 + Key components: 8 + 9 + - ``AtmosphereClient``: Authentication and session management for ATProto 10 + - ``SchemaPublisher``: Publish PackableSample schemas as ATProto records 11 + - ``DatasetPublisher``: Publish dataset index records with WebDataset URLs 12 + - ``LensPublisher``: Publish lens transformation records 13 + 14 + The ATProto integration is additive - existing atdata functionality continues 15 + to work unchanged. These features are opt-in for users who want to publish 16 + or discover datasets on the ATProto network. 17 + 18 + Example: 19 + >>> from atdata.atmosphere import AtmosphereClient, SchemaPublisher 20 + >>> 21 + >>> client = AtmosphereClient() 22 + >>> client.login("handle.bsky.social", "app-password") 23 + >>> 24 + >>> publisher = SchemaPublisher(client) 25 + >>> schema_uri = publisher.publish(MySampleType, version="1.0.0") 26 + 27 + Note: 28 + This module requires the ``atproto`` package to be installed:: 29 + 30 + pip install atproto 31 + """ 32 + 33 + from .client import AtmosphereClient 34 + from .schema import SchemaPublisher, SchemaLoader 35 + from .records import DatasetPublisher, DatasetLoader 36 + from .lens import LensPublisher, LensLoader 37 + from ._types import ( 38 + AtUri, 39 + SchemaRecord, 40 + DatasetRecord, 41 + LensRecord, 42 + ) 43 + 44 + __all__ = [ 45 + # Client 46 + "AtmosphereClient", 47 + # Schema operations 48 + "SchemaPublisher", 49 + "SchemaLoader", 50 + # Dataset operations 51 + "DatasetPublisher", 52 + "DatasetLoader", 53 + # Lens operations 54 + "LensPublisher", 55 + "LensLoader", 56 + # Types 57 + "AtUri", 58 + "SchemaRecord", 59 + "DatasetRecord", 60 + "LensRecord", 61 + ]
+329
src/atdata/atmosphere/_types.py
··· 1 + """Type definitions for ATProto record structures. 2 + 3 + This module defines the data structures used to represent ATProto records 4 + for schemas, datasets, and lenses. These types map to the Lexicon definitions 5 + in the ``ac.foundation.dataset.*`` namespace. 6 + """ 7 + 8 + from dataclasses import dataclass, field 9 + from datetime import datetime, timezone 10 + from typing import Optional, Literal, Any 11 + 12 + # Lexicon namespace for atdata records 13 + LEXICON_NAMESPACE = "ac.foundation.dataset" 14 + 15 + 16 + @dataclass 17 + class AtUri: 18 + """Parsed AT Protocol URI. 19 + 20 + AT URIs follow the format: at://<authority>/<collection>/<rkey> 21 + 22 + Example: 23 + >>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz") 24 + >>> uri.authority 25 + 'did:plc:abc123' 26 + >>> uri.collection 27 + 'ac.foundation.dataset.sampleSchema' 28 + >>> uri.rkey 29 + 'xyz' 30 + """ 31 + 32 + authority: str 33 + """The DID or handle of the repository owner.""" 34 + 35 + collection: str 36 + """The NSID of the record collection.""" 37 + 38 + rkey: str 39 + """The record key within the collection.""" 40 + 41 + @classmethod 42 + def parse(cls, uri: str) -> "AtUri": 43 + """Parse an AT URI string into components. 44 + 45 + Args: 46 + uri: AT URI string in format ``at://<authority>/<collection>/<rkey>`` 47 + 48 + Returns: 49 + Parsed AtUri instance. 50 + 51 + Raises: 52 + ValueError: If the URI format is invalid. 53 + """ 54 + if not uri.startswith("at://"): 55 + raise ValueError(f"Invalid AT URI: must start with 'at://': {uri}") 56 + 57 + parts = uri[5:].split("/") 58 + if len(parts) < 3: 59 + raise ValueError(f"Invalid AT URI: expected authority/collection/rkey: {uri}") 60 + 61 + return cls( 62 + authority=parts[0], 63 + collection=parts[1], 64 + rkey="/".join(parts[2:]), # rkey may contain slashes 65 + ) 66 + 67 + def __str__(self) -> str: 68 + """Format as AT URI string.""" 69 + return f"at://{self.authority}/{self.collection}/{self.rkey}" 70 + 71 + 72 + @dataclass 73 + class FieldType: 74 + """Schema field type definition. 75 + 76 + Represents a type in the schema type system, supporting primitives, 77 + ndarrays, and references to other schemas. 78 + """ 79 + 80 + kind: Literal["primitive", "ndarray", "ref", "array"] 81 + """The category of type.""" 82 + 83 + primitive: Optional[str] = None 84 + """For kind='primitive': one of 'str', 'int', 'float', 'bool', 'bytes'.""" 85 + 86 + dtype: Optional[str] = None 87 + """For kind='ndarray': numpy dtype string (e.g., 'float32').""" 88 + 89 + shape: Optional[list[int | None]] = None 90 + """For kind='ndarray': shape constraints (None for any dimension).""" 91 + 92 + ref: Optional[str] = None 93 + """For kind='ref': AT URI of referenced schema.""" 94 + 95 + items: Optional["FieldType"] = None 96 + """For kind='array': type of array elements.""" 97 + 98 + 99 + @dataclass 100 + class FieldDef: 101 + """Schema field definition.""" 102 + 103 + name: str 104 + """Field name.""" 105 + 106 + field_type: FieldType 107 + """Type of this field.""" 108 + 109 + optional: bool = False 110 + """Whether this field can be None.""" 111 + 112 + description: Optional[str] = None 113 + """Human-readable description.""" 114 + 115 + 116 + @dataclass 117 + class SchemaRecord: 118 + """ATProto record for a PackableSample schema. 119 + 120 + Maps to the ``ac.foundation.dataset.sampleSchema`` Lexicon. 121 + """ 122 + 123 + name: str 124 + """Human-readable schema name.""" 125 + 126 + version: str 127 + """Semantic version string (e.g., '1.0.0').""" 128 + 129 + fields: list[FieldDef] 130 + """List of field definitions.""" 131 + 132 + description: Optional[str] = None 133 + """Human-readable description.""" 134 + 135 + created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) 136 + """When this record was created.""" 137 + 138 + metadata: Optional[dict] = None 139 + """Arbitrary metadata as msgpack-encoded bytes.""" 140 + 141 + def to_record(self) -> dict: 142 + """Convert to ATProto record dict for publishing.""" 143 + record = { 144 + "$type": f"{LEXICON_NAMESPACE}.sampleSchema", 145 + "name": self.name, 146 + "version": self.version, 147 + "fields": [self._field_to_dict(f) for f in self.fields], 148 + "createdAt": self.created_at.isoformat(), 149 + } 150 + if self.description: 151 + record["description"] = self.description 152 + if self.metadata: 153 + record["metadata"] = self.metadata 154 + return record 155 + 156 + def _field_to_dict(self, field_def: FieldDef) -> dict: 157 + """Convert a field definition to dict.""" 158 + result = { 159 + "name": field_def.name, 160 + "fieldType": self._type_to_dict(field_def.field_type), 161 + "optional": field_def.optional, 162 + } 163 + if field_def.description: 164 + result["description"] = field_def.description 165 + return result 166 + 167 + def _type_to_dict(self, field_type: FieldType) -> dict: 168 + """Convert a field type to dict.""" 169 + result: dict = {"$type": f"{LEXICON_NAMESPACE}.schemaType#{field_type.kind}"} 170 + 171 + if field_type.kind == "primitive": 172 + result["primitive"] = field_type.primitive 173 + elif field_type.kind == "ndarray": 174 + result["dtype"] = field_type.dtype 175 + if field_type.shape: 176 + result["shape"] = field_type.shape 177 + elif field_type.kind == "ref": 178 + result["ref"] = field_type.ref 179 + elif field_type.kind == "array": 180 + if field_type.items: 181 + result["items"] = self._type_to_dict(field_type.items) 182 + 183 + return result 184 + 185 + 186 + @dataclass 187 + class StorageLocation: 188 + """Dataset storage location specification.""" 189 + 190 + kind: Literal["external", "blobs"] 191 + """Storage type: external URLs or ATProto blobs.""" 192 + 193 + urls: Optional[list[str]] = None 194 + """For kind='external': WebDataset URLs with brace notation.""" 195 + 196 + blob_refs: Optional[list[dict]] = None 197 + """For kind='blobs': ATProto blob references.""" 198 + 199 + 200 + @dataclass 201 + class DatasetRecord: 202 + """ATProto record for a dataset index. 203 + 204 + Maps to the ``ac.foundation.dataset.record`` Lexicon. 205 + """ 206 + 207 + name: str 208 + """Human-readable dataset name.""" 209 + 210 + schema_ref: str 211 + """AT URI of the schema record.""" 212 + 213 + storage: StorageLocation 214 + """Where the dataset data is stored.""" 215 + 216 + description: Optional[str] = None 217 + """Human-readable description.""" 218 + 219 + tags: list[str] = field(default_factory=list) 220 + """Searchable tags.""" 221 + 222 + license: Optional[str] = None 223 + """SPDX license identifier.""" 224 + 225 + created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) 226 + """When this record was created.""" 227 + 228 + metadata: Optional[bytes] = None 229 + """Arbitrary metadata as msgpack-encoded bytes.""" 230 + 231 + def to_record(self) -> dict: 232 + """Convert to ATProto record dict for publishing.""" 233 + record = { 234 + "$type": f"{LEXICON_NAMESPACE}.record", 235 + "name": self.name, 236 + "schemaRef": self.schema_ref, 237 + "storage": self._storage_to_dict(), 238 + "createdAt": self.created_at.isoformat(), 239 + } 240 + if self.description: 241 + record["description"] = self.description 242 + if self.tags: 243 + record["tags"] = self.tags 244 + if self.license: 245 + record["license"] = self.license 246 + if self.metadata: 247 + record["metadata"] = self.metadata 248 + return record 249 + 250 + def _storage_to_dict(self) -> dict: 251 + """Convert storage location to dict.""" 252 + if self.storage.kind == "external": 253 + return { 254 + "$type": f"{LEXICON_NAMESPACE}.storageExternal", 255 + "urls": self.storage.urls or [], 256 + } 257 + else: 258 + return { 259 + "$type": f"{LEXICON_NAMESPACE}.storageBlobs", 260 + "blobs": self.storage.blob_refs or [], 261 + } 262 + 263 + 264 + @dataclass 265 + class CodeReference: 266 + """Reference to lens code in a git repository.""" 267 + 268 + repository: str 269 + """Git repository URL.""" 270 + 271 + commit: str 272 + """Git commit hash.""" 273 + 274 + path: str 275 + """Path to the code file/function.""" 276 + 277 + 278 + @dataclass 279 + class LensRecord: 280 + """ATProto record for a lens transformation. 281 + 282 + Maps to the ``ac.foundation.dataset.lens`` Lexicon. 283 + """ 284 + 285 + name: str 286 + """Human-readable lens name.""" 287 + 288 + source_schema: str 289 + """AT URI of the source schema.""" 290 + 291 + target_schema: str 292 + """AT URI of the target schema.""" 293 + 294 + description: Optional[str] = None 295 + """What this transformation does.""" 296 + 297 + getter_code: Optional[CodeReference] = None 298 + """Reference to getter function code.""" 299 + 300 + putter_code: Optional[CodeReference] = None 301 + """Reference to putter function code.""" 302 + 303 + created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc)) 304 + """When this record was created.""" 305 + 306 + def to_record(self) -> dict: 307 + """Convert to ATProto record dict for publishing.""" 308 + record: dict[str, Any] = { 309 + "$type": f"{LEXICON_NAMESPACE}.lens", 310 + "name": self.name, 311 + "sourceSchema": self.source_schema, 312 + "targetSchema": self.target_schema, 313 + "createdAt": self.created_at.isoformat(), 314 + } 315 + if self.description: 316 + record["description"] = self.description 317 + if self.getter_code: 318 + record["getterCode"] = { 319 + "repository": self.getter_code.repository, 320 + "commit": self.getter_code.commit, 321 + "path": self.getter_code.path, 322 + } 323 + if self.putter_code: 324 + record["putterCode"] = { 325 + "repository": self.putter_code.repository, 326 + "commit": self.putter_code.commit, 327 + "path": self.putter_code.path, 328 + } 329 + return record
+393
src/atdata/atmosphere/client.py
··· 1 + """ATProto client wrapper for atdata. 2 + 3 + This module provides the ``AtmosphereClient`` class which wraps the atproto SDK 4 + client with atdata-specific helpers for publishing and querying records. 5 + """ 6 + 7 + from typing import Optional, Any 8 + 9 + from ._types import AtUri, LEXICON_NAMESPACE 10 + 11 + # Lazy import to avoid requiring atproto if not using atmosphere features 12 + _atproto_client_class: Optional[type] = None 13 + 14 + 15 + def _get_atproto_client_class(): 16 + """Lazily import the atproto Client class.""" 17 + global _atproto_client_class 18 + if _atproto_client_class is None: 19 + try: 20 + from atproto import Client 21 + _atproto_client_class = Client 22 + except ImportError as e: 23 + raise ImportError( 24 + "The 'atproto' package is required for ATProto integration. " 25 + "Install it with: pip install atproto" 26 + ) from e 27 + return _atproto_client_class 28 + 29 + 30 + class AtmosphereClient: 31 + """ATProto client wrapper for atdata operations. 32 + 33 + This class wraps the atproto SDK client and provides higher-level methods 34 + for working with atdata records (schemas, datasets, lenses). 35 + 36 + Example: 37 + >>> client = AtmosphereClient() 38 + >>> client.login("alice.bsky.social", "app-password") 39 + >>> print(client.did) 40 + 'did:plc:...' 41 + 42 + Note: 43 + The password should be an app-specific password, not your main account 44 + password. Create app passwords in your Bluesky account settings. 45 + """ 46 + 47 + def __init__( 48 + self, 49 + base_url: Optional[str] = None, 50 + *, 51 + _client: Optional[Any] = None, 52 + ): 53 + """Initialize the ATProto client. 54 + 55 + Args: 56 + base_url: Optional PDS base URL. Defaults to bsky.social. 57 + _client: Optional pre-configured atproto Client for testing. 58 + """ 59 + if _client is not None: 60 + self._client = _client 61 + else: 62 + Client = _get_atproto_client_class() 63 + self._client = Client(base_url=base_url) if base_url else Client() 64 + 65 + self._session: Optional[dict] = None 66 + 67 + def login(self, handle: str, password: str) -> None: 68 + """Authenticate with the ATProto PDS. 69 + 70 + Args: 71 + handle: Your Bluesky handle (e.g., 'alice.bsky.social'). 72 + password: App-specific password (not your main password). 73 + 74 + Raises: 75 + atproto.exceptions.AtProtocolError: If authentication fails. 76 + """ 77 + profile = self._client.login(handle, password) 78 + self._session = { 79 + "did": profile.did, 80 + "handle": profile.handle, 81 + } 82 + 83 + def login_with_session(self, session_string: str) -> None: 84 + """Authenticate using an exported session string. 85 + 86 + This allows reusing a session without re-authenticating, which helps 87 + avoid rate limits on session creation. 88 + 89 + Args: 90 + session_string: Session string from ``export_session()``. 91 + """ 92 + self._client.login(session_string=session_string) 93 + self._session = { 94 + "did": self._client.me.did, 95 + "handle": self._client.me.handle, 96 + } 97 + 98 + def export_session(self) -> str: 99 + """Export the current session for later reuse. 100 + 101 + Returns: 102 + Session string that can be passed to ``login_with_session()``. 103 + 104 + Raises: 105 + ValueError: If not authenticated. 106 + """ 107 + if not self.is_authenticated: 108 + raise ValueError("Not authenticated") 109 + return self._client.export_session_string() 110 + 111 + @property 112 + def is_authenticated(self) -> bool: 113 + """Check if the client has a valid session.""" 114 + return self._session is not None 115 + 116 + @property 117 + def did(self) -> str: 118 + """Get the DID of the authenticated user. 119 + 120 + Returns: 121 + The DID string (e.g., 'did:plc:...'). 122 + 123 + Raises: 124 + ValueError: If not authenticated. 125 + """ 126 + if not self._session: 127 + raise ValueError("Not authenticated") 128 + return self._session["did"] 129 + 130 + @property 131 + def handle(self) -> str: 132 + """Get the handle of the authenticated user. 133 + 134 + Returns: 135 + The handle string (e.g., 'alice.bsky.social'). 136 + 137 + Raises: 138 + ValueError: If not authenticated. 139 + """ 140 + if not self._session: 141 + raise ValueError("Not authenticated") 142 + return self._session["handle"] 143 + 144 + def _ensure_authenticated(self) -> None: 145 + """Raise if not authenticated.""" 146 + if not self.is_authenticated: 147 + raise ValueError("Client must be authenticated to perform this operation") 148 + 149 + # Low-level record operations 150 + 151 + def create_record( 152 + self, 153 + collection: str, 154 + record: dict, 155 + *, 156 + rkey: Optional[str] = None, 157 + validate: bool = False, 158 + ) -> AtUri: 159 + """Create a record in the user's repository. 160 + 161 + Args: 162 + collection: The NSID of the record collection 163 + (e.g., 'ac.foundation.dataset.sampleSchema'). 164 + record: The record data. Must include a '$type' field. 165 + rkey: Optional explicit record key. If not provided, a TID is generated. 166 + validate: Whether to validate against the Lexicon schema. Set to False 167 + for custom lexicons that the PDS doesn't know about. 168 + 169 + Returns: 170 + The AT URI of the created record. 171 + 172 + Raises: 173 + ValueError: If not authenticated. 174 + atproto.exceptions.AtProtocolError: If record creation fails. 175 + """ 176 + self._ensure_authenticated() 177 + 178 + response = self._client.com.atproto.repo.create_record( 179 + data={ 180 + "repo": self.did, 181 + "collection": collection, 182 + "record": record, 183 + "rkey": rkey, 184 + "validate": validate, 185 + } 186 + ) 187 + 188 + return AtUri.parse(response.uri) 189 + 190 + def put_record( 191 + self, 192 + collection: str, 193 + rkey: str, 194 + record: dict, 195 + *, 196 + validate: bool = False, 197 + swap_commit: Optional[str] = None, 198 + ) -> AtUri: 199 + """Create or update a record at a specific key. 200 + 201 + Args: 202 + collection: The NSID of the record collection. 203 + rkey: The record key. 204 + record: The record data. Must include a '$type' field. 205 + validate: Whether to validate against the Lexicon schema. 206 + swap_commit: Optional CID for compare-and-swap update. 207 + 208 + Returns: 209 + The AT URI of the record. 210 + 211 + Raises: 212 + ValueError: If not authenticated. 213 + atproto.exceptions.AtProtocolError: If operation fails. 214 + """ 215 + self._ensure_authenticated() 216 + 217 + data: dict[str, Any] = { 218 + "repo": self.did, 219 + "collection": collection, 220 + "rkey": rkey, 221 + "record": record, 222 + "validate": validate, 223 + } 224 + if swap_commit: 225 + data["swapCommit"] = swap_commit 226 + 227 + response = self._client.com.atproto.repo.put_record(data=data) 228 + 229 + return AtUri.parse(response.uri) 230 + 231 + def get_record( 232 + self, 233 + uri: str | AtUri, 234 + ) -> dict: 235 + """Fetch a record by AT URI. 236 + 237 + Args: 238 + uri: The AT URI of the record. 239 + 240 + Returns: 241 + The record data as a dictionary. 242 + 243 + Raises: 244 + atproto.exceptions.AtProtocolError: If record not found. 245 + """ 246 + if isinstance(uri, str): 247 + uri = AtUri.parse(uri) 248 + 249 + response = self._client.com.atproto.repo.get_record( 250 + params={ 251 + "repo": uri.authority, 252 + "collection": uri.collection, 253 + "rkey": uri.rkey, 254 + } 255 + ) 256 + 257 + return response.value 258 + 259 + def delete_record( 260 + self, 261 + uri: str | AtUri, 262 + *, 263 + swap_commit: Optional[str] = None, 264 + ) -> None: 265 + """Delete a record. 266 + 267 + Args: 268 + uri: The AT URI of the record to delete. 269 + swap_commit: Optional CID for compare-and-swap delete. 270 + 271 + Raises: 272 + ValueError: If not authenticated. 273 + atproto.exceptions.AtProtocolError: If deletion fails. 274 + """ 275 + self._ensure_authenticated() 276 + 277 + if isinstance(uri, str): 278 + uri = AtUri.parse(uri) 279 + 280 + data: dict[str, Any] = { 281 + "repo": self.did, 282 + "collection": uri.collection, 283 + "rkey": uri.rkey, 284 + } 285 + if swap_commit: 286 + data["swapCommit"] = swap_commit 287 + 288 + self._client.com.atproto.repo.delete_record(data=data) 289 + 290 + def list_records( 291 + self, 292 + collection: str, 293 + *, 294 + repo: Optional[str] = None, 295 + limit: int = 100, 296 + cursor: Optional[str] = None, 297 + ) -> tuple[list[dict], Optional[str]]: 298 + """List records in a collection. 299 + 300 + Args: 301 + collection: The NSID of the record collection. 302 + repo: The DID of the repository to query. Defaults to the 303 + authenticated user's repository. 304 + limit: Maximum number of records to return (default 100). 305 + cursor: Pagination cursor from a previous call. 306 + 307 + Returns: 308 + A tuple of (records, next_cursor). The cursor is None if there 309 + are no more records. 310 + 311 + Raises: 312 + ValueError: If repo is None and not authenticated. 313 + """ 314 + if repo is None: 315 + self._ensure_authenticated() 316 + repo = self.did 317 + 318 + response = self._client.com.atproto.repo.list_records( 319 + params={ 320 + "repo": repo, 321 + "collection": collection, 322 + "limit": limit, 323 + "cursor": cursor, 324 + } 325 + ) 326 + 327 + records = [r.value for r in response.records] 328 + return records, response.cursor 329 + 330 + # Convenience methods for atdata collections 331 + 332 + def list_schemas( 333 + self, 334 + repo: Optional[str] = None, 335 + limit: int = 100, 336 + ) -> list[dict]: 337 + """List schema records. 338 + 339 + Args: 340 + repo: The DID to query. Defaults to authenticated user. 341 + limit: Maximum number to return. 342 + 343 + Returns: 344 + List of schema records. 345 + """ 346 + records, _ = self.list_records( 347 + f"{LEXICON_NAMESPACE}.sampleSchema", 348 + repo=repo, 349 + limit=limit, 350 + ) 351 + return records 352 + 353 + def list_datasets( 354 + self, 355 + repo: Optional[str] = None, 356 + limit: int = 100, 357 + ) -> list[dict]: 358 + """List dataset records. 359 + 360 + Args: 361 + repo: The DID to query. Defaults to authenticated user. 362 + limit: Maximum number to return. 363 + 364 + Returns: 365 + List of dataset records. 366 + """ 367 + records, _ = self.list_records( 368 + f"{LEXICON_NAMESPACE}.record", 369 + repo=repo, 370 + limit=limit, 371 + ) 372 + return records 373 + 374 + def list_lenses( 375 + self, 376 + repo: Optional[str] = None, 377 + limit: int = 100, 378 + ) -> list[dict]: 379 + """List lens records. 380 + 381 + Args: 382 + repo: The DID to query. Defaults to authenticated user. 383 + limit: Maximum number to return. 384 + 385 + Returns: 386 + List of lens records. 387 + """ 388 + records, _ = self.list_records( 389 + f"{LEXICON_NAMESPACE}.lens", 390 + repo=repo, 391 + limit=limit, 392 + ) 393 + return records
+280
src/atdata/atmosphere/lens.py
··· 1 + """Lens transformation publishing for ATProto. 2 + 3 + This module provides classes for publishing Lens transformation records to 4 + ATProto. Lenses are published as ``ac.foundation.dataset.lens`` records. 5 + 6 + Note: 7 + For security reasons, lens code is stored as references to git repositories 8 + rather than inline code. Users must manually install and import lens 9 + implementations. 10 + """ 11 + 12 + from typing import Optional, Callable 13 + 14 + from .client import AtmosphereClient 15 + from ._types import ( 16 + AtUri, 17 + LensRecord, 18 + CodeReference, 19 + LEXICON_NAMESPACE, 20 + ) 21 + 22 + # Import for type checking only 23 + from typing import TYPE_CHECKING 24 + if TYPE_CHECKING: 25 + from ..lens import Lens 26 + 27 + 28 + class LensPublisher: 29 + """Publishes Lens transformation records to ATProto. 30 + 31 + This class creates lens records that reference source and target schemas 32 + and point to the transformation code in a git repository. 33 + 34 + Example: 35 + >>> @atdata.lens 36 + ... def my_lens(source: SourceType) -> TargetType: 37 + ... return TargetType(field=source.other_field) 38 + >>> 39 + >>> client = AtmosphereClient() 40 + >>> client.login("handle", "password") 41 + >>> 42 + >>> publisher = LensPublisher(client) 43 + >>> uri = publisher.publish( 44 + ... name="my_lens", 45 + ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source", 46 + ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target", 47 + ... code_repository="https://github.com/user/repo", 48 + ... code_commit="abc123def456", 49 + ... getter_path="mymodule.lenses:my_lens", 50 + ... putter_path="mymodule.lenses:my_lens_putter", 51 + ... ) 52 + 53 + Security Note: 54 + Lens code is stored as references to git repositories rather than 55 + inline code. This prevents arbitrary code execution from ATProto 56 + records. Users must manually install and trust lens implementations. 57 + """ 58 + 59 + def __init__(self, client: AtmosphereClient): 60 + """Initialize the lens publisher. 61 + 62 + Args: 63 + client: Authenticated AtmosphereClient instance. 64 + """ 65 + self.client = client 66 + 67 + def publish( 68 + self, 69 + *, 70 + name: str, 71 + source_schema_uri: str, 72 + target_schema_uri: str, 73 + description: Optional[str] = None, 74 + code_repository: Optional[str] = None, 75 + code_commit: Optional[str] = None, 76 + getter_path: Optional[str] = None, 77 + putter_path: Optional[str] = None, 78 + rkey: Optional[str] = None, 79 + ) -> AtUri: 80 + """Publish a lens transformation record to ATProto. 81 + 82 + Args: 83 + name: Human-readable lens name. 84 + source_schema_uri: AT URI of the source schema. 85 + target_schema_uri: AT URI of the target schema. 86 + description: What this transformation does. 87 + code_repository: Git repository URL containing the lens code. 88 + code_commit: Git commit hash for reproducibility. 89 + getter_path: Module path to the getter function 90 + (e.g., 'mymodule.lenses:my_getter'). 91 + putter_path: Module path to the putter function 92 + (e.g., 'mymodule.lenses:my_putter'). 93 + rkey: Optional explicit record key. 94 + 95 + Returns: 96 + The AT URI of the created lens record. 97 + 98 + Raises: 99 + ValueError: If code references are incomplete. 100 + """ 101 + # Build code references if provided 102 + getter_code: Optional[CodeReference] = None 103 + putter_code: Optional[CodeReference] = None 104 + 105 + if code_repository and code_commit: 106 + if getter_path: 107 + getter_code = CodeReference( 108 + repository=code_repository, 109 + commit=code_commit, 110 + path=getter_path, 111 + ) 112 + if putter_path: 113 + putter_code = CodeReference( 114 + repository=code_repository, 115 + commit=code_commit, 116 + path=putter_path, 117 + ) 118 + 119 + lens_record = LensRecord( 120 + name=name, 121 + source_schema=source_schema_uri, 122 + target_schema=target_schema_uri, 123 + description=description, 124 + getter_code=getter_code, 125 + putter_code=putter_code, 126 + ) 127 + 128 + return self.client.create_record( 129 + collection=f"{LEXICON_NAMESPACE}.lens", 130 + record=lens_record.to_record(), 131 + rkey=rkey, 132 + validate=False, 133 + ) 134 + 135 + def publish_from_lens( 136 + self, 137 + lens_obj: "Lens", 138 + *, 139 + name: str, 140 + source_schema_uri: str, 141 + target_schema_uri: str, 142 + code_repository: str, 143 + code_commit: str, 144 + description: Optional[str] = None, 145 + rkey: Optional[str] = None, 146 + ) -> AtUri: 147 + """Publish a lens record from an existing Lens object. 148 + 149 + This method extracts the getter and putter function names from 150 + the Lens object and publishes a record referencing them. 151 + 152 + Args: 153 + lens_obj: The Lens object to publish. 154 + name: Human-readable lens name. 155 + source_schema_uri: AT URI of the source schema. 156 + target_schema_uri: AT URI of the target schema. 157 + code_repository: Git repository URL. 158 + code_commit: Git commit hash. 159 + description: What this transformation does. 160 + rkey: Optional explicit record key. 161 + 162 + Returns: 163 + The AT URI of the created lens record. 164 + """ 165 + # Extract function names from the lens 166 + getter_name = lens_obj._getter.__name__ 167 + putter_name = lens_obj._putter.__name__ 168 + 169 + # Get module info if available 170 + getter_module = getattr(lens_obj._getter, "__module__", "") 171 + putter_module = getattr(lens_obj._putter, "__module__", "") 172 + 173 + getter_path = f"{getter_module}:{getter_name}" if getter_module else getter_name 174 + putter_path = f"{putter_module}:{putter_name}" if putter_module else putter_name 175 + 176 + return self.publish( 177 + name=name, 178 + source_schema_uri=source_schema_uri, 179 + target_schema_uri=target_schema_uri, 180 + description=description, 181 + code_repository=code_repository, 182 + code_commit=code_commit, 183 + getter_path=getter_path, 184 + putter_path=putter_path, 185 + rkey=rkey, 186 + ) 187 + 188 + 189 + class LensLoader: 190 + """Loads lens records from ATProto. 191 + 192 + This class fetches lens transformation records. Note that actually 193 + using a lens requires installing the referenced code and importing 194 + it manually. 195 + 196 + Example: 197 + >>> client = AtmosphereClient() 198 + >>> loader = LensLoader(client) 199 + >>> 200 + >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz") 201 + >>> print(record["name"]) 202 + >>> print(record["sourceSchema"]) 203 + >>> print(record.get("getterCode", {}).get("repository")) 204 + """ 205 + 206 + def __init__(self, client: AtmosphereClient): 207 + """Initialize the lens loader. 208 + 209 + Args: 210 + client: AtmosphereClient instance. 211 + """ 212 + self.client = client 213 + 214 + def get(self, uri: str | AtUri) -> dict: 215 + """Fetch a lens record by AT URI. 216 + 217 + Args: 218 + uri: The AT URI of the lens record. 219 + 220 + Returns: 221 + The lens record as a dictionary. 222 + 223 + Raises: 224 + ValueError: If the record is not a lens record. 225 + """ 226 + record = self.client.get_record(uri) 227 + 228 + expected_type = f"{LEXICON_NAMESPACE}.lens" 229 + if record.get("$type") != expected_type: 230 + raise ValueError( 231 + f"Record at {uri} is not a lens record. " 232 + f"Expected $type='{expected_type}', got '{record.get('$type')}'" 233 + ) 234 + 235 + return record 236 + 237 + def list_all( 238 + self, 239 + repo: Optional[str] = None, 240 + limit: int = 100, 241 + ) -> list[dict]: 242 + """List lens records from a repository. 243 + 244 + Args: 245 + repo: The DID of the repository. Defaults to authenticated user. 246 + limit: Maximum number of records to return. 247 + 248 + Returns: 249 + List of lens records. 250 + """ 251 + return self.client.list_lenses(repo=repo, limit=limit) 252 + 253 + def find_by_schemas( 254 + self, 255 + source_schema_uri: str, 256 + target_schema_uri: Optional[str] = None, 257 + repo: Optional[str] = None, 258 + ) -> list[dict]: 259 + """Find lenses that transform between specific schemas. 260 + 261 + Args: 262 + source_schema_uri: AT URI of the source schema. 263 + target_schema_uri: Optional AT URI of the target schema. 264 + If not provided, returns all lenses from the source. 265 + repo: The DID of the repository to search. 266 + 267 + Returns: 268 + List of matching lens records. 269 + """ 270 + all_lenses = self.list_all(repo=repo, limit=1000) 271 + 272 + matches = [] 273 + for lens_record in all_lenses: 274 + if lens_record.get("sourceSchema") == source_schema_uri: 275 + if target_schema_uri is None: 276 + matches.append(lens_record) 277 + elif lens_record.get("targetSchema") == target_schema_uri: 278 + matches.append(lens_record) 279 + 280 + return matches
+342
src/atdata/atmosphere/records.py
··· 1 + """Dataset record publishing and loading for ATProto. 2 + 3 + This module provides classes for publishing dataset index records to ATProto 4 + and loading them back. Dataset records are published as 5 + ``ac.foundation.dataset.record`` records. 6 + """ 7 + 8 + from typing import Type, TypeVar, Optional 9 + import msgpack 10 + 11 + from .client import AtmosphereClient 12 + from .schema import SchemaPublisher 13 + from ._types import ( 14 + AtUri, 15 + DatasetRecord, 16 + StorageLocation, 17 + LEXICON_NAMESPACE, 18 + ) 19 + 20 + # Import for type checking only to avoid circular imports 21 + from typing import TYPE_CHECKING 22 + if TYPE_CHECKING: 23 + from ..dataset import PackableSample, Dataset 24 + 25 + ST = TypeVar("ST", bound="PackableSample") 26 + 27 + 28 + class DatasetPublisher: 29 + """Publishes dataset index records to ATProto. 30 + 31 + This class creates dataset records that reference a schema and point to 32 + external storage (WebDataset URLs) or ATProto blobs. 33 + 34 + Example: 35 + >>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar") 36 + >>> 37 + >>> client = AtmosphereClient() 38 + >>> client.login("handle", "password") 39 + >>> 40 + >>> publisher = DatasetPublisher(client) 41 + >>> uri = publisher.publish( 42 + ... dataset, 43 + ... name="My Training Data", 44 + ... description="Training data for my model", 45 + ... tags=["computer-vision", "training"], 46 + ... ) 47 + """ 48 + 49 + def __init__(self, client: AtmosphereClient): 50 + """Initialize the dataset publisher. 51 + 52 + Args: 53 + client: Authenticated AtmosphereClient instance. 54 + """ 55 + self.client = client 56 + self._schema_publisher = SchemaPublisher(client) 57 + 58 + def publish( 59 + self, 60 + dataset: "Dataset[ST]", 61 + *, 62 + name: str, 63 + schema_uri: Optional[str] = None, 64 + description: Optional[str] = None, 65 + tags: Optional[list[str]] = None, 66 + license: Optional[str] = None, 67 + auto_publish_schema: bool = True, 68 + schema_version: str = "1.0.0", 69 + rkey: Optional[str] = None, 70 + ) -> AtUri: 71 + """Publish a dataset index record to ATProto. 72 + 73 + Args: 74 + dataset: The Dataset to publish. 75 + name: Human-readable dataset name. 76 + schema_uri: AT URI of the schema record. If not provided and 77 + auto_publish_schema is True, the schema will be published. 78 + description: Human-readable description. 79 + tags: Searchable tags for discovery. 80 + license: SPDX license identifier (e.g., 'MIT', 'Apache-2.0'). 81 + auto_publish_schema: If True and schema_uri not provided, 82 + automatically publish the schema first. 83 + schema_version: Version for auto-published schema. 84 + rkey: Optional explicit record key. 85 + 86 + Returns: 87 + The AT URI of the created dataset record. 88 + 89 + Raises: 90 + ValueError: If schema_uri is not provided and auto_publish_schema is False. 91 + """ 92 + # Ensure we have a schema reference 93 + if schema_uri is None: 94 + if not auto_publish_schema: 95 + raise ValueError( 96 + "schema_uri is required when auto_publish_schema=False" 97 + ) 98 + # Auto-publish the schema 99 + schema_uri_obj = self._schema_publisher.publish( 100 + dataset.sample_type, 101 + version=schema_version, 102 + ) 103 + schema_uri = str(schema_uri_obj) 104 + 105 + # Build the storage location 106 + storage = StorageLocation( 107 + kind="external", 108 + urls=[dataset.url], 109 + ) 110 + 111 + # Build dataset record 112 + metadata_bytes: Optional[bytes] = None 113 + if dataset.metadata is not None: 114 + metadata_bytes = msgpack.packb(dataset.metadata) 115 + 116 + dataset_record = DatasetRecord( 117 + name=name, 118 + schema_ref=schema_uri, 119 + storage=storage, 120 + description=description, 121 + tags=tags or [], 122 + license=license, 123 + metadata=metadata_bytes, 124 + ) 125 + 126 + # Publish to ATProto 127 + return self.client.create_record( 128 + collection=f"{LEXICON_NAMESPACE}.record", 129 + record=dataset_record.to_record(), 130 + rkey=rkey, 131 + validate=False, 132 + ) 133 + 134 + def publish_with_urls( 135 + self, 136 + urls: list[str], 137 + schema_uri: str, 138 + *, 139 + name: str, 140 + description: Optional[str] = None, 141 + tags: Optional[list[str]] = None, 142 + license: Optional[str] = None, 143 + metadata: Optional[dict] = None, 144 + rkey: Optional[str] = None, 145 + ) -> AtUri: 146 + """Publish a dataset record with explicit URLs. 147 + 148 + This method allows publishing a dataset record without having a 149 + Dataset object, useful for registering existing WebDataset files. 150 + 151 + Args: 152 + urls: List of WebDataset URLs with brace notation. 153 + schema_uri: AT URI of the schema record. 154 + name: Human-readable dataset name. 155 + description: Human-readable description. 156 + tags: Searchable tags for discovery. 157 + license: SPDX license identifier. 158 + metadata: Arbitrary metadata dictionary. 159 + rkey: Optional explicit record key. 160 + 161 + Returns: 162 + The AT URI of the created dataset record. 163 + """ 164 + storage = StorageLocation( 165 + kind="external", 166 + urls=urls, 167 + ) 168 + 169 + metadata_bytes: Optional[bytes] = None 170 + if metadata is not None: 171 + metadata_bytes = msgpack.packb(metadata) 172 + 173 + dataset_record = DatasetRecord( 174 + name=name, 175 + schema_ref=schema_uri, 176 + storage=storage, 177 + description=description, 178 + tags=tags or [], 179 + license=license, 180 + metadata=metadata_bytes, 181 + ) 182 + 183 + return self.client.create_record( 184 + collection=f"{LEXICON_NAMESPACE}.record", 185 + record=dataset_record.to_record(), 186 + rkey=rkey, 187 + validate=False, 188 + ) 189 + 190 + 191 + class DatasetLoader: 192 + """Loads dataset records from ATProto. 193 + 194 + This class fetches dataset index records and can create Dataset objects 195 + from them. Note that loading a dataset requires having the corresponding 196 + Python class for the sample type. 197 + 198 + Example: 199 + >>> client = AtmosphereClient() 200 + >>> loader = DatasetLoader(client) 201 + >>> 202 + >>> # List available datasets 203 + >>> datasets = loader.list() 204 + >>> for ds in datasets: 205 + ... print(ds["name"], ds["schemaRef"]) 206 + >>> 207 + >>> # Get a specific dataset record 208 + >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz") 209 + """ 210 + 211 + def __init__(self, client: AtmosphereClient): 212 + """Initialize the dataset loader. 213 + 214 + Args: 215 + client: AtmosphereClient instance. 216 + """ 217 + self.client = client 218 + 219 + def get(self, uri: str | AtUri) -> dict: 220 + """Fetch a dataset record by AT URI. 221 + 222 + Args: 223 + uri: The AT URI of the dataset record. 224 + 225 + Returns: 226 + The dataset record as a dictionary. 227 + 228 + Raises: 229 + ValueError: If the record is not a dataset record. 230 + """ 231 + record = self.client.get_record(uri) 232 + 233 + expected_type = f"{LEXICON_NAMESPACE}.record" 234 + if record.get("$type") != expected_type: 235 + raise ValueError( 236 + f"Record at {uri} is not a dataset record. " 237 + f"Expected $type='{expected_type}', got '{record.get('$type')}'" 238 + ) 239 + 240 + return record 241 + 242 + def list_all( 243 + self, 244 + repo: Optional[str] = None, 245 + limit: int = 100, 246 + ) -> list[dict]: 247 + """List dataset records from a repository. 248 + 249 + Args: 250 + repo: The DID of the repository. Defaults to authenticated user. 251 + limit: Maximum number of records to return. 252 + 253 + Returns: 254 + List of dataset records. 255 + """ 256 + return self.client.list_datasets(repo=repo, limit=limit) 257 + 258 + def get_urls(self, uri: str | AtUri) -> list[str]: 259 + """Get the WebDataset URLs from a dataset record. 260 + 261 + Args: 262 + uri: The AT URI of the dataset record. 263 + 264 + Returns: 265 + List of WebDataset URLs. 266 + 267 + Raises: 268 + ValueError: If the storage type is not external URLs. 269 + """ 270 + record = self.get(uri) 271 + storage = record.get("storage", {}) 272 + 273 + storage_type = storage.get("$type", "") 274 + if "storageExternal" in storage_type: 275 + return storage.get("urls", []) 276 + elif "storageBlobs" in storage_type: 277 + raise ValueError( 278 + "Dataset uses blob storage, not external URLs. " 279 + "Use get_blobs() instead." 280 + ) 281 + else: 282 + raise ValueError(f"Unknown storage type: {storage_type}") 283 + 284 + def get_metadata(self, uri: str | AtUri) -> Optional[dict]: 285 + """Get the metadata from a dataset record. 286 + 287 + Args: 288 + uri: The AT URI of the dataset record. 289 + 290 + Returns: 291 + The metadata dictionary, or None if no metadata. 292 + """ 293 + record = self.get(uri) 294 + metadata_bytes = record.get("metadata") 295 + 296 + if metadata_bytes is None: 297 + return None 298 + 299 + return msgpack.unpackb(metadata_bytes, raw=False) 300 + 301 + def to_dataset( 302 + self, 303 + uri: str | AtUri, 304 + sample_type: Type[ST], 305 + ) -> "Dataset[ST]": 306 + """Create a Dataset object from an ATProto record. 307 + 308 + This method creates a Dataset instance from a published record. 309 + You must provide the sample type class, which should match the 310 + schema referenced by the record. 311 + 312 + Args: 313 + uri: The AT URI of the dataset record. 314 + sample_type: The Python class for the sample type. 315 + 316 + Returns: 317 + A Dataset instance configured from the record. 318 + 319 + Raises: 320 + ValueError: If the storage type is not external URLs. 321 + 322 + Example: 323 + >>> loader = DatasetLoader(client) 324 + >>> dataset = loader.to_dataset(uri, MySampleType) 325 + >>> for batch in dataset.shuffled(batch_size=32): 326 + ... process(batch) 327 + """ 328 + # Import here to avoid circular import 329 + from ..dataset import Dataset 330 + 331 + urls = self.get_urls(uri) 332 + if not urls: 333 + raise ValueError("Dataset record has no URLs") 334 + 335 + # Use the first URL (multi-URL support could be added later) 336 + url = urls[0] 337 + 338 + # Get metadata URL if available 339 + record = self.get(uri) 340 + metadata_url = record.get("metadataUrl") 341 + 342 + return Dataset[sample_type](url, metadata_url=metadata_url)
+296
src/atdata/atmosphere/schema.py
··· 1 + """Schema publishing and loading for ATProto. 2 + 3 + This module provides classes for publishing PackableSample schemas to ATProto 4 + and loading them back. Schemas are published as ``ac.foundation.dataset.sampleSchema`` 5 + records. 6 + """ 7 + 8 + from dataclasses import fields, is_dataclass 9 + from typing import Type, TypeVar, Optional, Union, get_type_hints, get_origin, get_args 10 + import types 11 + 12 + from .client import AtmosphereClient 13 + from ._types import ( 14 + AtUri, 15 + SchemaRecord, 16 + FieldDef, 17 + FieldType, 18 + LEXICON_NAMESPACE, 19 + ) 20 + 21 + # Import for type checking only to avoid circular imports 22 + from typing import TYPE_CHECKING 23 + if TYPE_CHECKING: 24 + from ..dataset import PackableSample 25 + 26 + ST = TypeVar("ST", bound="PackableSample") 27 + 28 + 29 + class SchemaPublisher: 30 + """Publishes PackableSample schemas to ATProto. 31 + 32 + This class introspects a PackableSample class to extract its field 33 + definitions and publishes them as an ATProto schema record. 34 + 35 + Example: 36 + >>> @atdata.packable 37 + ... class MySample: 38 + ... image: NDArray 39 + ... label: str 40 + ... 41 + >>> client = AtmosphereClient() 42 + >>> client.login("handle", "password") 43 + >>> 44 + >>> publisher = SchemaPublisher(client) 45 + >>> uri = publisher.publish(MySample, version="1.0.0") 46 + >>> print(uri) 47 + at://did:plc:.../ac.foundation.dataset.sampleSchema/... 48 + """ 49 + 50 + def __init__(self, client: AtmosphereClient): 51 + """Initialize the schema publisher. 52 + 53 + Args: 54 + client: Authenticated AtmosphereClient instance. 55 + """ 56 + self.client = client 57 + 58 + def publish( 59 + self, 60 + sample_type: Type[ST], 61 + *, 62 + name: Optional[str] = None, 63 + version: str = "1.0.0", 64 + description: Optional[str] = None, 65 + metadata: Optional[dict] = None, 66 + rkey: Optional[str] = None, 67 + ) -> AtUri: 68 + """Publish a PackableSample schema to ATProto. 69 + 70 + Args: 71 + sample_type: The PackableSample class to publish. 72 + name: Human-readable name. Defaults to the class name. 73 + version: Semantic version string (e.g., '1.0.0'). 74 + description: Human-readable description. 75 + metadata: Arbitrary metadata dictionary. 76 + rkey: Optional explicit record key. If not provided, a TID is generated. 77 + 78 + Returns: 79 + The AT URI of the created schema record. 80 + 81 + Raises: 82 + ValueError: If sample_type is not a dataclass or client is not authenticated. 83 + TypeError: If a field type is not supported. 84 + """ 85 + if not is_dataclass(sample_type): 86 + raise ValueError(f"{sample_type.__name__} must be a dataclass (use @packable)") 87 + 88 + # Build the schema record 89 + schema_record = self._build_schema_record( 90 + sample_type, 91 + name=name, 92 + version=version, 93 + description=description, 94 + metadata=metadata, 95 + ) 96 + 97 + # Publish to ATProto 98 + return self.client.create_record( 99 + collection=f"{LEXICON_NAMESPACE}.sampleSchema", 100 + record=schema_record.to_record(), 101 + rkey=rkey, 102 + validate=False, # PDS doesn't know our lexicon 103 + ) 104 + 105 + def _build_schema_record( 106 + self, 107 + sample_type: Type[ST], 108 + *, 109 + name: Optional[str], 110 + version: str, 111 + description: Optional[str], 112 + metadata: Optional[dict], 113 + ) -> SchemaRecord: 114 + """Build a SchemaRecord from a PackableSample class.""" 115 + field_defs = [] 116 + type_hints = get_type_hints(sample_type) 117 + 118 + for f in fields(sample_type): 119 + field_type = type_hints.get(f.name, f.type) 120 + field_def = self._field_to_def(f.name, field_type) 121 + field_defs.append(field_def) 122 + 123 + return SchemaRecord( 124 + name=name or sample_type.__name__, 125 + version=version, 126 + description=description, 127 + fields=field_defs, 128 + metadata=metadata, 129 + ) 130 + 131 + def _field_to_def(self, name: str, python_type) -> FieldDef: 132 + """Convert a Python field to a FieldDef.""" 133 + # Check for Optional types (Union with None) 134 + is_optional = False 135 + origin = get_origin(python_type) 136 + 137 + # Handle Union types (including Optional which is Union[T, None]) 138 + if origin is Union or isinstance(python_type, types.UnionType): 139 + args = get_args(python_type) 140 + non_none_args = [a for a in args if a is not type(None)] 141 + if type(None) in args or len(non_none_args) < len(args): 142 + is_optional = True 143 + if len(non_none_args) == 1: 144 + python_type = non_none_args[0] 145 + elif len(non_none_args) > 1: 146 + # Complex union type - not fully supported yet 147 + raise TypeError(f"Complex union types not supported: {python_type}") 148 + 149 + field_type = self._python_type_to_field_type(python_type) 150 + 151 + return FieldDef( 152 + name=name, 153 + field_type=field_type, 154 + optional=is_optional, 155 + ) 156 + 157 + def _python_type_to_field_type(self, python_type) -> FieldType: 158 + """Map a Python type to a FieldType.""" 159 + # Handle primitives 160 + if python_type is str: 161 + return FieldType(kind="primitive", primitive="str") 162 + elif python_type is int: 163 + return FieldType(kind="primitive", primitive="int") 164 + elif python_type is float: 165 + return FieldType(kind="primitive", primitive="float") 166 + elif python_type is bool: 167 + return FieldType(kind="primitive", primitive="bool") 168 + elif python_type is bytes: 169 + return FieldType(kind="primitive", primitive="bytes") 170 + 171 + # Check for NDArray 172 + # NDArray from numpy.typing is a special generic alias 173 + type_str = str(python_type) 174 + if "NDArray" in type_str or "ndarray" in type_str.lower(): 175 + # Try to extract dtype info if available 176 + dtype = "float32" # Default 177 + args = get_args(python_type) 178 + if args: 179 + # NDArray[np.float64] or similar 180 + dtype_arg = args[-1] if args else None 181 + if dtype_arg is not None: 182 + dtype = self._numpy_dtype_to_string(dtype_arg) 183 + 184 + return FieldType(kind="ndarray", dtype=dtype, shape=None) 185 + 186 + # Check for list/array types 187 + origin = get_origin(python_type) 188 + if origin is list: 189 + args = get_args(python_type) 190 + if args: 191 + items = self._python_type_to_field_type(args[0]) 192 + return FieldType(kind="array", items=items) 193 + else: 194 + # Untyped list 195 + return FieldType(kind="array", items=FieldType(kind="primitive", primitive="str")) 196 + 197 + # Check for nested PackableSample (not yet supported) 198 + if is_dataclass(python_type): 199 + raise TypeError( 200 + f"Nested dataclass types not yet supported: {python_type.__name__}. " 201 + "Publish nested types separately and use references." 202 + ) 203 + 204 + raise TypeError(f"Unsupported type for schema field: {python_type}") 205 + 206 + def _numpy_dtype_to_string(self, dtype) -> str: 207 + """Convert a numpy dtype annotation to a string.""" 208 + dtype_str = str(dtype) 209 + # Handle common numpy dtypes 210 + dtype_map = { 211 + "float16": "float16", 212 + "float32": "float32", 213 + "float64": "float64", 214 + "int8": "int8", 215 + "int16": "int16", 216 + "int32": "int32", 217 + "int64": "int64", 218 + "uint8": "uint8", 219 + "uint16": "uint16", 220 + "uint32": "uint32", 221 + "uint64": "uint64", 222 + "bool": "bool", 223 + "complex64": "complex64", 224 + "complex128": "complex128", 225 + } 226 + 227 + for key, value in dtype_map.items(): 228 + if key in dtype_str: 229 + return value 230 + 231 + return "float32" # Default fallback 232 + 233 + 234 + class SchemaLoader: 235 + """Loads PackableSample schemas from ATProto. 236 + 237 + This class fetches schema records from ATProto and can list available 238 + schemas from a repository. 239 + 240 + Example: 241 + >>> client = AtmosphereClient() 242 + >>> client.login("handle", "password") 243 + >>> 244 + >>> loader = SchemaLoader(client) 245 + >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...") 246 + >>> print(schema["name"]) 247 + 'MySample' 248 + """ 249 + 250 + def __init__(self, client: AtmosphereClient): 251 + """Initialize the schema loader. 252 + 253 + Args: 254 + client: AtmosphereClient instance (authentication optional for reads). 255 + """ 256 + self.client = client 257 + 258 + def get(self, uri: str | AtUri) -> dict: 259 + """Fetch a schema record by AT URI. 260 + 261 + Args: 262 + uri: The AT URI of the schema record. 263 + 264 + Returns: 265 + The schema record as a dictionary. 266 + 267 + Raises: 268 + ValueError: If the record is not a schema record. 269 + atproto.exceptions.AtProtocolError: If record not found. 270 + """ 271 + record = self.client.get_record(uri) 272 + 273 + expected_type = f"{LEXICON_NAMESPACE}.sampleSchema" 274 + if record.get("$type") != expected_type: 275 + raise ValueError( 276 + f"Record at {uri} is not a schema record. " 277 + f"Expected $type='{expected_type}', got '{record.get('$type')}'" 278 + ) 279 + 280 + return record 281 + 282 + def list_all( 283 + self, 284 + repo: Optional[str] = None, 285 + limit: int = 100, 286 + ) -> list[dict]: 287 + """List schema records from a repository. 288 + 289 + Args: 290 + repo: The DID of the repository. Defaults to authenticated user. 291 + limit: Maximum number of records to return. 292 + 293 + Returns: 294 + List of schema records. 295 + """ 296 + return self.client.list_schemas(repo=repo, limit=limit)
+40 -169
src/atdata/dataset.py
··· 32 32 33 33 from pathlib import Path 34 34 import uuid 35 - import functools 36 35 37 36 import dataclasses 38 37 import types ··· 40 39 dataclass, 41 40 asdict, 42 41 ) 43 - from abc import ( 44 - ABC, 45 - abstractmethod, 46 - ) 42 + from abc import ABC 47 43 48 44 from tqdm import tqdm 49 45 import numpy as np 50 46 import pandas as pd 47 + import requests 51 48 52 49 import typing 53 50 from typing import ( ··· 65 62 TypeVar, 66 63 TypeAlias, 67 64 ) 68 - # from typing_inspect import get_bound, get_parameters 69 - from numpy.typing import ( 70 - NDArray, 71 - ArrayLike, 72 - ) 73 - 74 - # 75 - 76 - # import ekumen.atmosphere as eat 65 + from numpy.typing import NDArray 77 66 78 67 import msgpack 79 68 import ormsgpack ··· 96 85 ## 97 86 # Main base classes 98 87 99 - # TODO Check for best way to ensure this typevar is used as a dataclass type 100 - # DT = TypeVar( 'DT', bound = dataclass.__class__ ) 101 88 DT = TypeVar( 'DT' ) 102 89 103 90 MsgpackRawSample: TypeAlias = Dict[str, Any] 104 91 105 - # @dataclass 106 - # class ArrayBytes: 107 - # """Annotates bytes that should be interpreted as the raw contents of a 108 - # numpy NDArray""" 109 - 110 - # raw_bytes: bytes 111 - # """The raw bytes of the corresponding NDArray""" 112 - 113 - # def __init__( self, 114 - # array: Optional[ArrayLike] = None, 115 - # raw: Optional[bytes] = None, 116 - # ): 117 - # """TODO""" 118 - 119 - # if array is not None: 120 - # array = np.array( array ) 121 - # self.raw_bytes = eh.array_to_bytes( array ) 122 - 123 - # elif raw is not None: 124 - # self.raw_bytes = raw 125 - 126 - # else: 127 - # raise ValueError( 'Must provide either `array` or `raw` bytes' ) 128 - 129 - # @property 130 - # def to_numpy( self ) -> NDArray: 131 - # """Return the `raw_bytes` data as an NDArray""" 132 - # return eh.bytes_to_array( self.raw_bytes ) 133 92 134 93 def _make_packable( x ): 135 94 """Convert a value to a msgpack-compatible format. ··· 141 100 Returns: 142 101 The value in a format suitable for msgpack serialization. 143 102 """ 144 - # if isinstance( x, ArrayBytes ): 145 - # return x.raw_bytes 146 103 if isinstance( x, np.ndarray ): 147 104 return eh.array_to_bytes( x ) 148 105 return x ··· 226 183 # based on what is provided 227 184 228 185 if isinstance( var_cur_value, np.ndarray ): 229 - # we're good! 230 - pass 231 - 232 - # elif isinstance( var_cur_value, ArrayBytes ): 233 - # setattr( self, var_name, var_cur_value.to_numpy ) 186 + # Already the correct type, no conversion needed 187 + continue 234 188 235 189 elif isinstance( var_cur_value, bytes ): 236 190 # TODO This does create a constraint that serialized bytes ··· 411 365 raise AttributeError( f'No sample attribute named {name}' ) 412 366 413 367 414 - # class AnySample( BaseModel ): 415 - # """A sample that can hold anything""" 416 - # value: Any 417 - 418 - # class AnyBatch( BaseModel ): 419 - # """A batch of `AnySample`s""" 420 - # values: list[AnySample] 421 - 422 - 423 368 ST = TypeVar( 'ST', bound = PackableSample ) 424 - # BT = TypeVar( 'BT' ) 425 - 426 369 RT = TypeVar( 'RT', bound = PackableSample ) 427 370 428 - # TODO For python 3.13 429 - # BT = TypeVar( 'BT', default = None ) 430 - # IT = TypeVar( 'IT', default = Any ) 431 - 432 371 class Dataset( Generic[ST] ): 433 372 """A typed dataset built on WebDataset with lens transformations. 434 373 ··· 456 395 ... 457 396 >>> # Transform to a different view 458 397 >>> ds_view = ds.as_type(MyDataView) 398 + 459 399 """ 460 400 461 - # sample_class: Type = get_parameters( ) 462 - # """The type of each returned sample from this `Dataset`'s iterator""" 463 - # batch_class: Type = get_bound( BT ) 464 - # """The type of a batch built from `sample_class`""" 465 - 466 401 @property 467 402 def sample_type( self ) -> Type: 468 403 """The type of each returned sample from this dataset's iterator. ··· 482 417 Returns: 483 418 ``SampleBatch[ST]`` where ``ST`` is this dataset's sample type. 484 419 """ 485 - # return self.__orig_class__.__args__[1] 486 420 return SampleBatch[self.sample_type] 487 421 488 - 489 - # _schema_registry_sample: dict[str, Type] 490 - # _schema_registry_batch: dict[str, Type | None] 491 - 492 - # 493 - 494 - def __init__( self, url: str ) -> None: 422 + def __init__( self, url: str, 423 + metadata_url: str | None = None, 424 + ) -> None: 495 425 """Create a dataset from a WebDataset URL. 496 426 497 427 Args: ··· 501 431 """ 502 432 super().__init__() 503 433 self.url = url 434 + """WebDataset brace-notation URL pointing to tar files, e.g., 435 + ``"path/to/file-{000000..000009}.tar"`` for multiple shards or 436 + ``"path/to/file-000000.tar"`` for a single shard. 437 + """ 438 + 439 + self._metadata: dict[str, Any] | None = None 440 + self.metadata_url: str | None = metadata_url 441 + """Optional URL to msgpack-encoded metadata for this dataset.""" 504 442 505 443 # Allow addition of automatic transformation of raw underlying data 506 444 self._output_lens: Lens | None = None ··· 527 465 ret._output_lens = lenses.transform( self.sample_type, ret.sample_type ) 528 466 return ret 529 467 530 - # @classmethod 531 - # def register( cls, uri: str, 532 - # sample_class: Type, 533 - # batch_class: Optional[Type] = None, 534 - # ): 535 - # """Register an `ekumen` schema to use a particular dataset sample class""" 536 - # cls._schema_registry_sample[uri] = sample_class 537 - # cls._schema_registry_batch[uri] = batch_class 538 - 539 - # @classmethod 540 - # def at( cls, uri: str ) -> 'Dataset': 541 - # """Create a Dataset for the `ekumen` index entry at `uri`""" 542 - # client = eat.Client() 543 - # return cls( ) 544 - 545 - # Common functionality 546 - 547 468 @property 548 469 def shard_list( self ) -> list[str]: 549 470 """List of individual dataset shards ··· 557 478 wds.filters.map( lambda x: x['url'] ) 558 479 ) 559 480 return list( pipe ) 481 + 482 + @property 483 + def metadata( self ) -> dict[str, Any] | None: 484 + """Fetch and cache metadata from metadata_url. 485 + 486 + Returns: 487 + Deserialized metadata dictionary, or None if no metadata_url is set. 488 + 489 + Raises: 490 + requests.HTTPError: If metadata fetch fails. 491 + """ 492 + if self.metadata_url is None: 493 + return None 494 + 495 + if self._metadata is None: 496 + with requests.get( self.metadata_url, stream = True ) as response: 497 + response.raise_for_status() 498 + self._metadata = msgpack.unpackb( response.content, raw = False ) 499 + 500 + # Use our cached values 501 + return self._metadata 560 502 561 503 def ordered( self, 562 504 batch_size: int | None = 1, ··· 575 517 """ 576 518 577 519 if batch_size is None: 578 - # TODO Duplication here 579 520 return wds.pipeline.DataPipeline( 580 521 wds.shardlists.SimpleShardList( self.url ), 581 522 wds.shardlists.split_by_worker, 582 - # 583 523 wds.tariterators.tarfile_to_samples(), 584 - # wds.map( self.preprocess ), 585 524 wds.filters.map( self.wrap ), 586 525 ) 587 526 588 527 return wds.pipeline.DataPipeline( 589 528 wds.shardlists.SimpleShardList( self.url ), 590 529 wds.shardlists.split_by_worker, 591 - # 592 530 wds.tariterators.tarfile_to_samples(), 593 - # wds.map( self.preprocess ), 594 531 wds.filters.batched( batch_size ), 595 532 wds.filters.map( self.wrap_batch ), 596 533 ) ··· 618 555 ``SampleBatch[ST]`` instances; otherwise yields individual ``ST`` 619 556 samples. 620 557 """ 621 - 622 558 if batch_size is None: 623 - # TODO Duplication here 624 559 return wds.pipeline.DataPipeline( 625 560 wds.shardlists.SimpleShardList( self.url ), 626 561 wds.filters.shuffle( buffer_shards ), 627 562 wds.shardlists.split_by_worker, 628 - # 629 563 wds.tariterators.tarfile_to_samples(), 630 - # wds.shuffle( buffer_samples ), 631 - # wds.map( self.preprocess ), 632 564 wds.filters.shuffle( buffer_samples ), 633 565 wds.filters.map( self.wrap ), 634 566 ) ··· 637 569 wds.shardlists.SimpleShardList( self.url ), 638 570 wds.filters.shuffle( buffer_shards ), 639 571 wds.shardlists.split_by_worker, 640 - # 641 572 wds.tariterators.tarfile_to_samples(), 642 - # wds.shuffle( buffer_samples ), 643 - # wds.map( self.preprocess ), 644 573 wds.filters.shuffle( buffer_samples ), 645 574 wds.filters.batched( batch_size ), 646 575 wds.filters.map( self.wrap_batch ), ··· 683 612 684 613 cur_segment = 0 685 614 cur_buffer = [] 686 - path_template = (path.parent / f'{path.stem}-%06d.{path.suffix}').as_posix() 615 + path_template = (path.parent / f'{path.stem}-{{:06d}}{path.suffix}').as_posix() 687 616 688 617 for x in self.ordered( batch_size = None ): 689 618 cur_buffer.append( sample_map( x ) ) 690 - 619 + 691 620 if len( cur_buffer ) >= maxcount: 692 621 # Write current segment 693 622 cur_path = path_template.format( cur_segment ) ··· 703 632 df = pd.DataFrame( cur_buffer ) 704 633 df.to_parquet( cur_path, **kwargs ) 705 634 706 - 707 - # Implemented by specific subclasses 708 - 709 - # @property 710 - # @abstractmethod 711 - # def url( self ) -> str: 712 - # """str: Brace-notation URL of the underlying full WebDataset""" 713 - # pass 714 - 715 - # @classmethod 716 - # # TODO replace Any with IT 717 - # def preprocess( cls, sample: WDSRawSample ) -> Any: 718 - # """Pre-built preprocessor for a raw `sample` from the given dataset""" 719 - # return sample 720 - 721 - # @classmethod 722 - # TODO replace Any with IT 723 635 def wrap( self, sample: MsgpackRawSample ) -> ST: 724 636 """Wrap a raw msgpack sample into the appropriate dataset-specific type. 725 637 ··· 739 651 740 652 source_sample = self._output_lens.source_type.from_bytes( sample['msgpack'] ) 741 653 return self._output_lens( source_sample ) 742 - 743 - # try: 744 - # assert type( sample ) == dict 745 - # return cls.sample_class( **{ 746 - # k: v 747 - # for k, v in sample.items() if k != '__key__' 748 - # } ) 749 - 750 - # except Exception as e: 751 - # # Sample constructor failed -- revert to default 752 - # return AnySample( 753 - # value = sample, 754 - # ) 755 654 756 655 def wrap_batch( self, batch: WDSRawBatch ) -> SampleBatch[ST]: 757 656 """Wrap a batch of raw msgpack samples into a typed SampleBatch. ··· 782 681 for s in batch_source ] 783 682 return SampleBatch[self.sample_type]( batch_view ) 784 683 785 - # # @classmethod 786 - # def wrap_batch( self, batch: WDSRawBatch ) -> BT: 787 - # """Wrap a `batch` of samples into the appropriate dataset-specific type 788 - 789 - # This default implementation simply creates a list one sample at a time 790 - # """ 791 - # assert cls.batch_class is not None, 'No batch class specified' 792 - # return cls.batch_class( **batch ) 793 - 794 - 795 - ## 796 - # Shortcut decorators 797 - 798 - # def packable( cls ): 799 - # """TODO""" 800 - 801 - # def decorator( cls ): 802 - # # Create a new class dynamically 803 - # # The new class inherits from the new_parent_class first, then the original cls 804 - # new_bases = (PackableSample,) + cls.__bases__ 805 - # new_cls = type(cls.__name__, new_bases, dict(cls.__dict__)) 806 - 807 - # # Optionally, update __module__ and __qualname__ for better introspection 808 - # new_cls.__module__ = cls.__module__ 809 - # new_cls.__qualname__ = cls.__qualname__ 810 - 811 - # return new_cls 812 - # return decorator 813 684 814 685 def packable( cls ): 815 686 """Decorator to convert a regular class into a ``PackableSample``.
+2 -55
src/atdata/lens.py
··· 201 201 """ 202 202 return self._getter( s ) 203 203 204 - # TODO Figure out how to properly parameterize this 205 - # def _lens_factory[S, V]( register: bool = True ): 206 - # """Register the annotated function `f` as the getter of a sample lens""" 207 - 208 - # # The actual lens decorator taking a lens getter function to a lens object 209 - # def _decorator( f: LensGetter[S, V] ) -> Lens[S, V]: 210 - # ret = Lens[S, V]( f ) 211 - # if register: 212 - # _network.register( ret ) 213 - # return ret 214 - 215 - # # Return the lens decorator 216 - # return _decorator 217 - 218 - # # For convenience 219 - # lens = _lens_factory 220 204 221 205 def lens( f: LensGetter[S, V] ) -> Lens[S, V]: 222 206 """Decorator to create and register a lens transformation. ··· 245 229 _network.register( ret ) 246 230 return ret 247 231 248 - 249 - ## 250 - # Global registry of used lenses 251 - 252 - # _registered_lenses: Dict[LensSignature, Lens] = dict() 253 - # """TODO""" 254 232 255 233 class LensNetwork: 256 234 """Global registry for lens transformations between sample types. ··· 292 270 If a lens already exists for the same type pair, it will be 293 271 overwritten. 294 272 """ 295 - 296 - # sig = inspect.signature( _lens.get ) 297 - # input_types = list( sig.parameters.values() ) 298 - # assert len( input_types ) == 1, \ 299 - # 'Wrong number of input args for lens: should only have one' 300 - 301 - # input_type = input_types[0].annotation 302 - # print( input_type ) 303 - # output_type = sig.return_annotation 304 - 305 - # self._registry[input_type, output_type] = _lens 306 - # print( _lens.source_type ) 307 273 self._registry[_lens.source_type, _lens.view_type] = _lens 308 274 309 275 def transform( self, source: DatasetType, view: DatasetType ) -> Lens: ··· 323 289 Currently only supports direct transformations. Compositional 324 290 transformations (chaining multiple lenses) are not yet implemented. 325 291 """ 326 - 327 - # TODO Handle compositional closure 328 292 ret = self._registry.get( (source, view), None ) 329 293 if ret is None: 330 294 raise ValueError( f'No registered lens from source {source} to view {view}' ) ··· 332 296 return ret 333 297 334 298 335 - # Create global singleton registry instance 336 - _network = LensNetwork() 337 - 338 - # def lens( f: LensPutter ) -> Lens: 339 - # """Register the annotated function `f` as a sample lens""" 340 - # ## 341 - 342 - # sig = inspect.signature( f ) 343 - 344 - # input_types = list( sig.parameters.values() ) 345 - # output_type = sig.return_annotation 346 - 347 - # _registered_lenses[] 348 - 349 - # f.lens = Lens( 350 - 351 - # ) 352 - 353 - # return f 299 + # Global singleton registry instance 300 + _network = LensNetwork()
+492
src/atdata/local.py
··· 1 + """Local repository storage for atdata datasets. 2 + 3 + This module provides a local storage backend for atdata datasets using: 4 + - S3-compatible object storage for dataset tar files and metadata 5 + - Redis for indexing and tracking datasets 6 + 7 + The main classes are: 8 + - Repo: Manages dataset storage in S3 with Redis indexing 9 + - Index: Redis-backed index for tracking dataset metadata 10 + - BasicIndexEntry: Index entry representing a stored dataset 11 + 12 + This is intended for development and small-scale deployment before 13 + migrating to the full atproto PDS infrastructure. 14 + """ 15 + 16 + ## 17 + # Imports 18 + 19 + from atdata import ( 20 + PackableSample, 21 + Dataset, 22 + ) 23 + 24 + import os 25 + from pathlib import Path 26 + from uuid import uuid4 27 + from tempfile import TemporaryDirectory 28 + from dotenv import dotenv_values 29 + import msgpack 30 + 31 + from redis import Redis 32 + 33 + from s3fs import ( 34 + S3FileSystem, 35 + ) 36 + 37 + import webdataset as wds 38 + 39 + from dataclasses import ( 40 + dataclass, 41 + asdict, 42 + field, 43 + ) 44 + from typing import ( 45 + Any, 46 + Optional, 47 + Dict, 48 + Type, 49 + TypeVar, 50 + Generator, 51 + BinaryIO, 52 + cast, 53 + ) 54 + 55 + T = TypeVar( 'T', bound = PackableSample ) 56 + 57 + 58 + ## 59 + # Helpers 60 + 61 + def _kind_str_for_sample_type( st: Type[PackableSample] ) -> str: 62 + """Convert a sample type to a fully-qualified string identifier. 63 + 64 + Args: 65 + st: The sample type class. 66 + 67 + Returns: 68 + A string in the format 'module.name' identifying the sample type. 69 + """ 70 + return f'{st.__module__}.{st.__name__}' 71 + 72 + def _decode_bytes_dict( d: dict[bytes, bytes] ) -> dict[str, str]: 73 + """Decode a dictionary with byte keys and values to strings. 74 + 75 + Redis returns dictionaries with bytes keys/values, this converts them to strings. 76 + 77 + Args: 78 + d: Dictionary with bytes keys and values. 79 + 80 + Returns: 81 + Dictionary with UTF-8 decoded string keys and values. 82 + """ 83 + return { 84 + k.decode('utf-8'): v.decode('utf-8') 85 + for k, v in d.items() 86 + } 87 + 88 + 89 + ## 90 + # Redis object model 91 + 92 + @dataclass 93 + class BasicIndexEntry: 94 + """Index entry for a dataset stored in the repository. 95 + 96 + Tracks metadata about a dataset stored in S3, including its location, 97 + type, and unique identifier. 98 + """ 99 + ## 100 + 101 + wds_url: str 102 + """WebDataset URL for the dataset tar files, for use with atdata.Dataset.""" 103 + 104 + sample_kind: str 105 + """Fully-qualified sample type name (e.g., 'module.ClassName').""" 106 + 107 + metadata_url: str | None 108 + """S3 URL to the dataset's metadata msgpack file, if any.""" 109 + 110 + uuid: str = field( default_factory = lambda: str( uuid4() ) ) 111 + """Unique identifier for this dataset entry. Defaults to a new UUID if not provided.""" 112 + 113 + def write_to( self, redis: Redis ): 114 + """Persist this index entry to Redis. 115 + 116 + Stores the entry as a Redis hash with key 'BasicIndexEntry:{uuid}'. 117 + 118 + Args: 119 + redis: Redis connection to write to. 120 + """ 121 + save_key = f'BasicIndexEntry:{self.uuid}' 122 + # Filter out None values - Redis doesn't accept None 123 + data = {k: v for k, v in asdict(self).items() if v is not None} 124 + # redis-py typing uses untyped dict, so type checker complains about dict[str, Any] 125 + redis.hset( save_key, mapping = data ) # type: ignore[arg-type] 126 + 127 + def _s3_env( credentials_path: str | Path ) -> dict[str, Any]: 128 + """Load S3 credentials from a .env file. 129 + 130 + Args: 131 + credentials_path: Path to .env file containing S3 credentials. 132 + 133 + Returns: 134 + Dictionary with AWS_ENDPOINT, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. 135 + 136 + Raises: 137 + AssertionError: If required credentials are missing from the file. 138 + """ 139 + ## 140 + credentials_path = Path( credentials_path ) 141 + env_values = dotenv_values( credentials_path ) 142 + assert 'AWS_ENDPOINT' in env_values 143 + assert 'AWS_ACCESS_KEY_ID' in env_values 144 + assert 'AWS_SECRET_ACCESS_KEY' in env_values 145 + 146 + return { 147 + k: env_values[k] 148 + for k in ( 149 + 'AWS_ENDPOINT', 150 + 'AWS_ACCESS_KEY_ID', 151 + 'AWS_SECRET_ACCESS_KEY', 152 + ) 153 + } 154 + 155 + def _s3_from_credentials( creds: str | Path | dict ) -> S3FileSystem: 156 + """Create an S3FileSystem from credentials. 157 + 158 + Args: 159 + creds: Either a path to a .env file with credentials, or a dict 160 + containing AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally 161 + AWS_ENDPOINT. 162 + 163 + Returns: 164 + Configured S3FileSystem instance. 165 + """ 166 + ## 167 + if not isinstance( creds, dict ): 168 + creds = _s3_env( creds ) 169 + 170 + # Build kwargs, making endpoint_url optional 171 + kwargs = { 172 + 'key': creds['AWS_ACCESS_KEY_ID'], 173 + 'secret': creds['AWS_SECRET_ACCESS_KEY'] 174 + } 175 + if 'AWS_ENDPOINT' in creds: 176 + kwargs['endpoint_url'] = creds['AWS_ENDPOINT'] 177 + 178 + return S3FileSystem(**kwargs) 179 + 180 + 181 + ## 182 + # Classes 183 + 184 + class Repo: 185 + """Repository for storing and managing atdata datasets. 186 + 187 + Provides storage of datasets in S3-compatible object storage with Redis-based 188 + indexing. Datasets are stored as WebDataset tar files with optional metadata. 189 + 190 + Attributes: 191 + s3_credentials: S3 credentials dictionary or None. 192 + bucket_fs: S3FileSystem instance or None. 193 + hive_path: Path within S3 bucket for storing datasets. 194 + hive_bucket: Name of the S3 bucket. 195 + index: Index instance for tracking datasets. 196 + """ 197 + 198 + ## 199 + 200 + def __init__( self, 201 + # 202 + s3_credentials: str | Path | dict[str, Any] | None = None, 203 + hive_path: str | Path | None = None, 204 + redis: Redis | None = None, 205 + # 206 + # 207 + **kwargs 208 + ) -> None: 209 + """Initialize a repository. 210 + 211 + Args: 212 + s3_credentials: Path to .env file with S3 credentials, or dict with 213 + AWS_ENDPOINT, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY. 214 + If None, S3 functionality will be disabled. 215 + hive_path: Path within the S3 bucket to store datasets. 216 + Required if s3_credentials is provided. 217 + redis: Redis connection for indexing. If None, creates a new connection. 218 + **kwargs: Additional arguments (reserved for future use). 219 + 220 + Raises: 221 + ValueError: If hive_path is not provided when s3_credentials is set. 222 + """ 223 + 224 + if s3_credentials is None: 225 + self.s3_credentials = None 226 + elif isinstance( s3_credentials, dict ): 227 + self.s3_credentials = s3_credentials 228 + else: 229 + self.s3_credentials = _s3_env( s3_credentials ) 230 + 231 + if self.s3_credentials is None: 232 + self.bucket_fs = None 233 + else: 234 + self.bucket_fs = _s3_from_credentials( self.s3_credentials ) 235 + 236 + if self.bucket_fs is not None: 237 + if hive_path is None: 238 + raise ValueError( 'Must specify hive path within bucket' ) 239 + self.hive_path = Path( hive_path ) 240 + self.hive_bucket = self.hive_path.parts[0] 241 + else: 242 + self.hive_path = None 243 + self.hive_bucket = None 244 + 245 + # 246 + 247 + self.index = Index( redis = redis ) 248 + 249 + ## 250 + 251 + def insert( self, ds: Dataset[T], 252 + # 253 + cache_local: bool = False, 254 + # 255 + **kwargs 256 + ) -> tuple[BasicIndexEntry, Dataset[T]]: 257 + """Insert a dataset into the repository. 258 + 259 + Writes the dataset to S3 as WebDataset tar files, stores metadata, 260 + and creates an index entry in Redis. 261 + 262 + Args: 263 + ds: The dataset to insert. 264 + cache_local: If True, write to local temporary storage first, then 265 + copy to S3. This can be faster for some workloads. 266 + **kwargs: Additional arguments passed to wds.ShardWriter. 267 + 268 + Returns: 269 + A tuple of (index_entry, new_dataset) where: 270 + - index_entry: BasicIndexEntry for the stored dataset 271 + - new_dataset: Dataset object pointing to the stored copy 272 + 273 + Raises: 274 + AssertionError: If S3 credentials or hive_path are not configured. 275 + RuntimeError: If no shards were written. 276 + """ 277 + 278 + assert self.s3_credentials is not None 279 + assert self.hive_bucket is not None 280 + assert self.hive_path is not None 281 + 282 + new_uuid = str( uuid4() ) 283 + 284 + hive_fs = _s3_from_credentials( self.s3_credentials ) 285 + 286 + # Write metadata 287 + metadata_path = ( 288 + self.hive_path 289 + / 'metadata' 290 + / f'atdata-metadata--{new_uuid}.msgpack' 291 + ) 292 + # Note: S3 doesn't need directories created beforehand - s3fs handles this 293 + 294 + if ds.metadata is not None: 295 + # Use s3:// prefix to ensure s3fs treats this as an S3 path 296 + with cast( BinaryIO, hive_fs.open( f's3://{metadata_path.as_posix()}', 'wb' ) ) as f: 297 + meta_packed = msgpack.packb( ds.metadata ) 298 + assert meta_packed is not None 299 + f.write( cast( bytes, meta_packed ) ) 300 + 301 + 302 + # Write data 303 + shard_pattern = ( 304 + self.hive_path 305 + / f'atdata--{new_uuid}--%06d.tar' 306 + ).as_posix() 307 + 308 + with TemporaryDirectory() as temp_dir: 309 + 310 + if cache_local: 311 + # For cache_local, we need to use boto3 directly to avoid s3fs async issues with moto 312 + import boto3 313 + 314 + # Create boto3 client from credentials 315 + s3_client_kwargs = { 316 + 'aws_access_key_id': self.s3_credentials['AWS_ACCESS_KEY_ID'], 317 + 'aws_secret_access_key': self.s3_credentials['AWS_SECRET_ACCESS_KEY'] 318 + } 319 + if 'AWS_ENDPOINT' in self.s3_credentials: 320 + s3_client_kwargs['endpoint_url'] = self.s3_credentials['AWS_ENDPOINT'] 321 + s3_client = boto3.client('s3', **s3_client_kwargs) 322 + 323 + def _writer_opener( p: str ): 324 + local_cache_path = Path( temp_dir ) / p 325 + local_cache_path.parent.mkdir( parents = True, exist_ok = True ) 326 + return open( local_cache_path, 'wb' ) 327 + writer_opener = _writer_opener 328 + 329 + def _writer_post( p: str ): 330 + local_cache_path = Path( temp_dir ) / p 331 + 332 + # Copy to S3 using boto3 client (avoids s3fs async issues) 333 + path_parts = Path( p ).parts 334 + bucket = path_parts[0] 335 + key = str( Path( *path_parts[1:] ) ) 336 + 337 + with open( local_cache_path, 'rb' ) as f_in: 338 + s3_client.put_object( Bucket=bucket, Key=key, Body=f_in.read() ) 339 + 340 + # Delete local cache file 341 + local_cache_path.unlink() 342 + 343 + written_shards.append( p ) 344 + writer_post = _writer_post 345 + 346 + else: 347 + # Use s3:// prefix to ensure s3fs treats paths as S3 paths 348 + writer_opener = lambda s: cast( BinaryIO, hive_fs.open( f's3://{s}', 'wb' ) ) 349 + writer_post = lambda s: written_shards.append( s ) 350 + 351 + written_shards = [] 352 + with wds.writer.ShardWriter( 353 + shard_pattern, 354 + opener = writer_opener, 355 + post = writer_post, 356 + **kwargs, 357 + ) as sink: 358 + for sample in ds.ordered( batch_size = None ): 359 + sink.write( sample.as_wds ) 360 + 361 + # Make a new Dataset object for the written dataset copy 362 + if len( written_shards ) == 0: 363 + raise RuntimeError( 'Cannot form new dataset entry -- did not write any shards' ) 364 + 365 + elif len( written_shards ) < 2: 366 + new_dataset_url = ( 367 + self.hive_path 368 + / ( Path( written_shards[0] ).name ) 369 + ).as_posix() 370 + 371 + else: 372 + shard_s3_format = ( 373 + ( 374 + self.hive_path 375 + / f'atdata--{new_uuid}' 376 + ).as_posix() 377 + ) + '--{shard_id}.tar' 378 + shard_id_braced = '{' + f'{0:06d}..{len( written_shards ) - 1:06d}' + '}' 379 + new_dataset_url = shard_s3_format.format( shard_id = shard_id_braced ) 380 + 381 + new_dataset = Dataset[ds.sample_type]( 382 + url = new_dataset_url, 383 + metadata_url = metadata_path.as_posix(), 384 + ) 385 + 386 + # Add to index 387 + new_entry = self.index.add_entry( new_dataset, uuid = new_uuid ) 388 + 389 + return new_entry, new_dataset 390 + 391 + 392 + class Index: 393 + """Redis-backed index for tracking datasets in a repository. 394 + 395 + Maintains a registry of BasicIndexEntry objects in Redis, allowing 396 + enumeration and lookup of stored datasets. 397 + 398 + Attributes: 399 + _redis: Redis connection for index storage. 400 + """ 401 + 402 + ## 403 + 404 + def __init__( self, 405 + redis: Redis | None = None, 406 + **kwargs 407 + ) -> None: 408 + """Initialize an index. 409 + 410 + Args: 411 + redis: Redis connection to use. If None, creates a new connection 412 + using the provided kwargs. 413 + **kwargs: Additional arguments passed to Redis() constructor if 414 + redis is None. 415 + """ 416 + ## 417 + 418 + if redis is not None: 419 + self._redis = redis 420 + else: 421 + self._redis: Redis = Redis( **kwargs ) 422 + 423 + @property 424 + def all_entries( self ) -> list[BasicIndexEntry]: 425 + """Get all index entries as a list. 426 + 427 + Returns: 428 + List of all BasicIndexEntry objects in the index. 429 + """ 430 + return list( self.entries ) 431 + 432 + @property 433 + def entries( self ) -> Generator[BasicIndexEntry, None, None]: 434 + """Iterate over all index entries. 435 + 436 + Scans Redis for all BasicIndexEntry keys and yields them one at a time. 437 + 438 + Yields: 439 + BasicIndexEntry objects from the index. 440 + """ 441 + ## 442 + for key in self._redis.scan_iter( match = 'BasicIndexEntry:*' ): 443 + # hgetall returns dict[bytes, bytes] which we decode to dict[str, str] 444 + cur_entry_data = _decode_bytes_dict( cast(dict[bytes, bytes], self._redis.hgetall( key )) ) 445 + 446 + # Provide default None for optional fields that may be missing 447 + # Type checker complains about None in dict[str, str], but BasicIndexEntry accepts it 448 + cur_entry_data: dict[str, Any] = dict( **cur_entry_data ) 449 + cur_entry_data.setdefault('metadata_url', None) 450 + 451 + cur_entry = BasicIndexEntry( **cur_entry_data ) 452 + yield cur_entry 453 + 454 + return 455 + 456 + def add_entry( self, ds: Dataset, 457 + uuid: str | None = None, 458 + ) -> BasicIndexEntry: 459 + """Add a dataset to the index. 460 + 461 + Creates a BasicIndexEntry for the dataset and persists it to Redis. 462 + 463 + Args: 464 + ds: The dataset to add to the index. 465 + uuid: Optional UUID for the entry. If None, a new UUID is generated. 466 + 467 + Returns: 468 + The created BasicIndexEntry object. 469 + """ 470 + ## 471 + temp_sample_kind = _kind_str_for_sample_type( ds.sample_type ) 472 + 473 + if uuid is None: 474 + ret_data = BasicIndexEntry( 475 + wds_url = ds.url, 476 + sample_kind = temp_sample_kind, 477 + metadata_url = ds.metadata_url, 478 + ) 479 + else: 480 + ret_data = BasicIndexEntry( 481 + wds_url = ds.url, 482 + sample_kind = temp_sample_kind, 483 + metadata_url = ds.metadata_url, 484 + uuid = uuid, 485 + ) 486 + 487 + ret_data.write_to( self._redis ) 488 + 489 + return ret_data 490 + 491 + 492 + #
+1
tests/conftest.py
··· 1 + """Pytest configuration for atdata tests."""
+1363
tests/test_atmosphere.py
··· 1 + """Tests for the atdata.atmosphere module. 2 + 3 + This module contains comprehensive tests for ATProto integration including: 4 + - Type definitions (_types.py) 5 + - Client wrapper (client.py) 6 + - Schema publishing/loading (schema.py) 7 + - Dataset publishing/loading (records.py) 8 + - Lens publishing/loading (lens.py) 9 + """ 10 + 11 + from datetime import datetime, timezone 12 + from typing import Optional 13 + from unittest.mock import Mock, MagicMock, patch 14 + import pytest 15 + 16 + import numpy as np 17 + from numpy.typing import NDArray 18 + 19 + import atdata 20 + from atdata.atmosphere import ( 21 + AtmosphereClient, 22 + SchemaPublisher, 23 + SchemaLoader, 24 + DatasetPublisher, 25 + DatasetLoader, 26 + LensPublisher, 27 + LensLoader, 28 + AtUri, 29 + SchemaRecord, 30 + DatasetRecord, 31 + LensRecord, 32 + ) 33 + from atdata.atmosphere._types import ( 34 + FieldType, 35 + FieldDef, 36 + StorageLocation, 37 + CodeReference, 38 + LEXICON_NAMESPACE, 39 + ) 40 + 41 + 42 + # ============================================================================= 43 + # Test Fixtures 44 + # ============================================================================= 45 + 46 + @pytest.fixture 47 + def mock_atproto_client(): 48 + """Create a mock atproto SDK client.""" 49 + mock = Mock() 50 + mock.me = MagicMock() 51 + mock.me.did = "did:plc:test123456789" 52 + mock.me.handle = "test.bsky.social" 53 + 54 + # Mock login 55 + mock_profile = Mock() 56 + mock_profile.did = "did:plc:test123456789" 57 + mock_profile.handle = "test.bsky.social" 58 + mock.login.return_value = mock_profile 59 + 60 + # Mock export_session_string 61 + mock.export_session_string.return_value = "test-session-string" 62 + 63 + return mock 64 + 65 + 66 + @pytest.fixture 67 + def authenticated_client(mock_atproto_client): 68 + """Create an authenticated AtmosphereClient with mocked backend.""" 69 + client = AtmosphereClient(_client=mock_atproto_client) 70 + client.login("test.bsky.social", "test-password") 71 + return client 72 + 73 + 74 + @atdata.packable 75 + class BasicSample: 76 + """Simple sample type for testing.""" 77 + name: str 78 + value: int 79 + 80 + 81 + @atdata.packable 82 + class NumpySample: 83 + """Sample type with NDArray field.""" 84 + data: NDArray 85 + label: str 86 + 87 + 88 + @atdata.packable 89 + class OptionalSample: 90 + """Sample type with optional fields.""" 91 + required_field: str 92 + optional_field: Optional[int] 93 + optional_array: Optional[NDArray] 94 + 95 + 96 + @atdata.packable 97 + class AllTypesSample: 98 + """Sample type with all primitive types.""" 99 + str_field: str 100 + int_field: int 101 + float_field: float 102 + bool_field: bool 103 + bytes_field: bytes 104 + 105 + 106 + # ============================================================================= 107 + # Tests for _types.py - AtUri 108 + # ============================================================================= 109 + 110 + class TestAtUri: 111 + """Tests for AtUri parsing and formatting.""" 112 + 113 + def test_parse_valid_uri_with_did(self): 114 + """Parse a valid AT URI with a DID authority.""" 115 + uri = AtUri.parse("at://did:plc:abc123/com.example.record/key456") 116 + 117 + assert uri.authority == "did:plc:abc123" 118 + assert uri.collection == "com.example.record" 119 + assert uri.rkey == "key456" 120 + 121 + def test_parse_valid_uri_with_handle(self): 122 + """Parse a valid AT URI with a handle authority.""" 123 + uri = AtUri.parse("at://alice.bsky.social/app.bsky.feed.post/abc123") 124 + 125 + assert uri.authority == "alice.bsky.social" 126 + assert uri.collection == "app.bsky.feed.post" 127 + assert uri.rkey == "abc123" 128 + 129 + def test_parse_uri_with_slashes_in_rkey(self): 130 + """Parse a URI where rkey contains slashes.""" 131 + uri = AtUri.parse("at://did:plc:abc/collection/path/to/key") 132 + 133 + assert uri.authority == "did:plc:abc" 134 + assert uri.collection == "collection" 135 + assert uri.rkey == "path/to/key" 136 + 137 + def test_parse_invalid_uri_no_protocol(self): 138 + """Reject URIs without at:// protocol.""" 139 + with pytest.raises(ValueError, match="must start with 'at://'"): 140 + AtUri.parse("https://example.com/path") 141 + 142 + def test_parse_invalid_uri_missing_parts(self): 143 + """Reject URIs with missing components.""" 144 + with pytest.raises(ValueError, match="expected authority/collection/rkey"): 145 + AtUri.parse("at://did:plc:abc/collection") 146 + 147 + def test_str_roundtrip(self): 148 + """Verify __str__ produces valid URI that can be re-parsed.""" 149 + original = "at://did:plc:test123/ac.foundation.dataset.sampleSchema/xyz789" 150 + uri = AtUri.parse(original) 151 + assert str(uri) == original 152 + 153 + def test_parse_atdata_namespace(self): 154 + """Parse URIs in the atdata namespace.""" 155 + uri = AtUri.parse(f"at://did:plc:abc/{LEXICON_NAMESPACE}.sampleSchema/test") 156 + 157 + assert uri.collection == f"{LEXICON_NAMESPACE}.sampleSchema" 158 + 159 + 160 + # ============================================================================= 161 + # Tests for _types.py - FieldType 162 + # ============================================================================= 163 + 164 + class TestFieldType: 165 + """Tests for FieldType dataclass.""" 166 + 167 + def test_primitive_type(self): 168 + """Create a primitive field type.""" 169 + ft = FieldType(kind="primitive", primitive="str") 170 + 171 + assert ft.kind == "primitive" 172 + assert ft.primitive == "str" 173 + assert ft.dtype is None 174 + assert ft.shape is None 175 + 176 + def test_ndarray_type(self): 177 + """Create an ndarray field type.""" 178 + ft = FieldType(kind="ndarray", dtype="float32", shape=[224, 224, 3]) 179 + 180 + assert ft.kind == "ndarray" 181 + assert ft.dtype == "float32" 182 + assert ft.shape == [224, 224, 3] 183 + 184 + def test_ref_type(self): 185 + """Create a reference field type.""" 186 + ft = FieldType(kind="ref", ref="at://did:plc:abc/collection/key") 187 + 188 + assert ft.kind == "ref" 189 + assert ft.ref == "at://did:plc:abc/collection/key" 190 + 191 + def test_array_type(self): 192 + """Create an array field type with items.""" 193 + items = FieldType(kind="primitive", primitive="str") 194 + ft = FieldType(kind="array", items=items) 195 + 196 + assert ft.kind == "array" 197 + assert ft.items is not None 198 + assert ft.items.kind == "primitive" 199 + 200 + 201 + # ============================================================================= 202 + # Tests for _types.py - FieldDef 203 + # ============================================================================= 204 + 205 + class TestFieldDef: 206 + """Tests for FieldDef dataclass.""" 207 + 208 + def test_required_field(self): 209 + """Create a required field definition.""" 210 + fd = FieldDef( 211 + name="test_field", 212 + field_type=FieldType(kind="primitive", primitive="str"), 213 + optional=False, 214 + ) 215 + 216 + assert fd.name == "test_field" 217 + assert fd.optional is False 218 + 219 + def test_optional_field(self): 220 + """Create an optional field definition.""" 221 + fd = FieldDef( 222 + name="optional_field", 223 + field_type=FieldType(kind="primitive", primitive="int"), 224 + optional=True, 225 + ) 226 + 227 + assert fd.optional is True 228 + 229 + def test_field_with_description(self): 230 + """Create a field with description.""" 231 + fd = FieldDef( 232 + name="described_field", 233 + field_type=FieldType(kind="primitive", primitive="float"), 234 + optional=False, 235 + description="A field with a description", 236 + ) 237 + 238 + assert fd.description == "A field with a description" 239 + 240 + 241 + # ============================================================================= 242 + # Tests for _types.py - SchemaRecord 243 + # ============================================================================= 244 + 245 + class TestSchemaRecord: 246 + """Tests for SchemaRecord dataclass and to_record().""" 247 + 248 + def test_to_record_basic(self): 249 + """Convert a basic schema record to dict.""" 250 + schema = SchemaRecord( 251 + name="TestSchema", 252 + version="1.0.0", 253 + fields=[ 254 + FieldDef( 255 + name="field1", 256 + field_type=FieldType(kind="primitive", primitive="str"), 257 + optional=False, 258 + ), 259 + ], 260 + ) 261 + 262 + record = schema.to_record() 263 + 264 + assert record["$type"] == f"{LEXICON_NAMESPACE}.sampleSchema" 265 + assert record["name"] == "TestSchema" 266 + assert record["version"] == "1.0.0" 267 + assert len(record["fields"]) == 1 268 + assert "createdAt" in record 269 + 270 + def test_to_record_with_description(self): 271 + """Convert schema record with description.""" 272 + schema = SchemaRecord( 273 + name="DescribedSchema", 274 + version="2.0.0", 275 + description="A schema with description", 276 + fields=[], 277 + ) 278 + 279 + record = schema.to_record() 280 + 281 + assert record["description"] == "A schema with description" 282 + 283 + def test_to_record_with_metadata(self): 284 + """Convert schema record with metadata.""" 285 + schema = SchemaRecord( 286 + name="MetaSchema", 287 + version="1.0.0", 288 + fields=[], 289 + metadata={"author": "test", "tags": ["demo"]}, 290 + ) 291 + 292 + record = schema.to_record() 293 + 294 + assert record["metadata"] == {"author": "test", "tags": ["demo"]} 295 + 296 + def test_to_record_field_types(self): 297 + """Verify field type serialization in to_record().""" 298 + schema = SchemaRecord( 299 + name="TypesSchema", 300 + version="1.0.0", 301 + fields=[ 302 + FieldDef( 303 + name="primitive_field", 304 + field_type=FieldType(kind="primitive", primitive="int"), 305 + optional=False, 306 + ), 307 + FieldDef( 308 + name="array_field", 309 + field_type=FieldType(kind="ndarray", dtype="float32"), 310 + optional=True, 311 + ), 312 + ], 313 + ) 314 + 315 + record = schema.to_record() 316 + 317 + # Check primitive field 318 + prim_field = record["fields"][0] 319 + assert prim_field["name"] == "primitive_field" 320 + assert prim_field["fieldType"]["$type"] == f"{LEXICON_NAMESPACE}.schemaType#primitive" 321 + assert prim_field["fieldType"]["primitive"] == "int" 322 + assert prim_field["optional"] is False 323 + 324 + # Check ndarray field 325 + arr_field = record["fields"][1] 326 + assert arr_field["name"] == "array_field" 327 + assert arr_field["fieldType"]["$type"] == f"{LEXICON_NAMESPACE}.schemaType#ndarray" 328 + assert arr_field["fieldType"]["dtype"] == "float32" 329 + assert arr_field["optional"] is True 330 + 331 + 332 + # ============================================================================= 333 + # Tests for _types.py - StorageLocation 334 + # ============================================================================= 335 + 336 + class TestStorageLocation: 337 + """Tests for StorageLocation dataclass.""" 338 + 339 + def test_external_storage(self): 340 + """Create external URL storage location.""" 341 + storage = StorageLocation( 342 + kind="external", 343 + urls=["s3://bucket/data-{000000..000009}.tar"], 344 + ) 345 + 346 + assert storage.kind == "external" 347 + assert storage.urls == ["s3://bucket/data-{000000..000009}.tar"] 348 + assert storage.blob_refs is None 349 + 350 + def test_blob_storage(self): 351 + """Create ATProto blob storage location.""" 352 + storage = StorageLocation( 353 + kind="blobs", 354 + blob_refs=[{"cid": "bafyabc", "mimeType": "application/octet-stream"}], 355 + ) 356 + 357 + assert storage.kind == "blobs" 358 + assert storage.blob_refs is not None 359 + assert len(storage.blob_refs) == 1 360 + 361 + 362 + # ============================================================================= 363 + # Tests for _types.py - DatasetRecord 364 + # ============================================================================= 365 + 366 + class TestDatasetRecord: 367 + """Tests for DatasetRecord dataclass and to_record().""" 368 + 369 + def test_to_record_external_storage(self): 370 + """Convert dataset record with external storage.""" 371 + dataset = DatasetRecord( 372 + name="TestDataset", 373 + schema_ref="at://did:plc:abc/ac.foundation.dataset.sampleSchema/xyz", 374 + storage=StorageLocation( 375 + kind="external", 376 + urls=["s3://bucket/data.tar"], 377 + ), 378 + ) 379 + 380 + record = dataset.to_record() 381 + 382 + assert record["$type"] == f"{LEXICON_NAMESPACE}.record" 383 + assert record["name"] == "TestDataset" 384 + assert record["schemaRef"] == "at://did:plc:abc/ac.foundation.dataset.sampleSchema/xyz" 385 + assert record["storage"]["$type"] == f"{LEXICON_NAMESPACE}.storageExternal" 386 + assert record["storage"]["urls"] == ["s3://bucket/data.tar"] 387 + 388 + def test_to_record_blob_storage(self): 389 + """Convert dataset record with blob storage.""" 390 + dataset = DatasetRecord( 391 + name="BlobDataset", 392 + schema_ref="at://did:plc:abc/collection/key", 393 + storage=StorageLocation( 394 + kind="blobs", 395 + blob_refs=[{"cid": "bafytest"}], 396 + ), 397 + ) 398 + 399 + record = dataset.to_record() 400 + 401 + assert record["storage"]["$type"] == f"{LEXICON_NAMESPACE}.storageBlobs" 402 + assert record["storage"]["blobs"] == [{"cid": "bafytest"}] 403 + 404 + def test_to_record_with_tags_and_license(self): 405 + """Convert dataset record with tags and license.""" 406 + dataset = DatasetRecord( 407 + name="TaggedDataset", 408 + schema_ref="at://did:plc:abc/collection/key", 409 + storage=StorageLocation(kind="external", urls=[]), 410 + tags=["ml", "vision", "demo"], 411 + license="MIT", 412 + ) 413 + 414 + record = dataset.to_record() 415 + 416 + assert record["tags"] == ["ml", "vision", "demo"] 417 + assert record["license"] == "MIT" 418 + 419 + def test_to_record_with_metadata(self): 420 + """Convert dataset record with msgpack metadata.""" 421 + import msgpack 422 + 423 + metadata_bytes = msgpack.packb({"size": 1000, "split": "train"}) 424 + dataset = DatasetRecord( 425 + name="MetaDataset", 426 + schema_ref="at://did:plc:abc/collection/key", 427 + storage=StorageLocation(kind="external", urls=[]), 428 + metadata=metadata_bytes, 429 + ) 430 + 431 + record = dataset.to_record() 432 + 433 + assert record["metadata"] == metadata_bytes 434 + 435 + 436 + # ============================================================================= 437 + # Tests for _types.py - LensRecord 438 + # ============================================================================= 439 + 440 + class TestLensRecord: 441 + """Tests for LensRecord dataclass and to_record().""" 442 + 443 + def test_to_record_basic(self): 444 + """Convert basic lens record.""" 445 + lens = LensRecord( 446 + name="TestLens", 447 + source_schema="at://did:plc:abc/collection/source", 448 + target_schema="at://did:plc:abc/collection/target", 449 + ) 450 + 451 + record = lens.to_record() 452 + 453 + assert record["$type"] == f"{LEXICON_NAMESPACE}.lens" 454 + assert record["name"] == "TestLens" 455 + assert record["sourceSchema"] == "at://did:plc:abc/collection/source" 456 + assert record["targetSchema"] == "at://did:plc:abc/collection/target" 457 + assert "createdAt" in record 458 + 459 + def test_to_record_with_description(self): 460 + """Convert lens record with description.""" 461 + lens = LensRecord( 462 + name="DescribedLens", 463 + source_schema="at://a", 464 + target_schema="at://b", 465 + description="Transforms A to B", 466 + ) 467 + 468 + record = lens.to_record() 469 + 470 + assert record["description"] == "Transforms A to B" 471 + 472 + def test_to_record_with_code_references(self): 473 + """Convert lens record with code references.""" 474 + lens = LensRecord( 475 + name="CodeLens", 476 + source_schema="at://a", 477 + target_schema="at://b", 478 + getter_code=CodeReference( 479 + repository="https://github.com/user/repo", 480 + commit="abc123def456", 481 + path="module.lenses:getter_func", 482 + ), 483 + putter_code=CodeReference( 484 + repository="https://github.com/user/repo", 485 + commit="abc123def456", 486 + path="module.lenses:putter_func", 487 + ), 488 + ) 489 + 490 + record = lens.to_record() 491 + 492 + assert record["getterCode"]["repository"] == "https://github.com/user/repo" 493 + assert record["getterCode"]["commit"] == "abc123def456" 494 + assert record["getterCode"]["path"] == "module.lenses:getter_func" 495 + assert record["putterCode"]["path"] == "module.lenses:putter_func" 496 + 497 + 498 + # ============================================================================= 499 + # Tests for client.py - AtmosphereClient 500 + # ============================================================================= 501 + 502 + class TestAtmosphereClient: 503 + """Tests for AtmosphereClient.""" 504 + 505 + def test_init_default(self): 506 + """Initialize client with defaults.""" 507 + with patch("atdata.atmosphere.client._get_atproto_client_class") as mock_get: 508 + mock_class = Mock() 509 + mock_get.return_value = mock_class 510 + 511 + client = AtmosphereClient() 512 + 513 + mock_class.assert_called_once() 514 + assert not client.is_authenticated 515 + 516 + def test_init_with_base_url(self): 517 + """Initialize client with custom base URL.""" 518 + with patch("atdata.atmosphere.client._get_atproto_client_class") as mock_get: 519 + mock_class = Mock() 520 + mock_get.return_value = mock_class 521 + 522 + client = AtmosphereClient(base_url="https://custom.pds.example") 523 + 524 + mock_class.assert_called_once_with(base_url="https://custom.pds.example") 525 + 526 + def test_init_with_mock_client(self, mock_atproto_client): 527 + """Initialize with pre-configured mock client.""" 528 + client = AtmosphereClient(_client=mock_atproto_client) 529 + 530 + assert client._client is mock_atproto_client 531 + 532 + def test_login_success(self, mock_atproto_client): 533 + """Successful login sets session.""" 534 + client = AtmosphereClient(_client=mock_atproto_client) 535 + 536 + client.login("test.bsky.social", "password123") 537 + 538 + assert client.is_authenticated 539 + assert client.did == "did:plc:test123456789" 540 + assert client.handle == "test.bsky.social" 541 + mock_atproto_client.login.assert_called_once_with("test.bsky.social", "password123") 542 + 543 + def test_login_with_session(self, mock_atproto_client): 544 + """Login with exported session string.""" 545 + client = AtmosphereClient(_client=mock_atproto_client) 546 + 547 + client.login_with_session("test-session-string") 548 + 549 + assert client.is_authenticated 550 + mock_atproto_client.login.assert_called_once_with(session_string="test-session-string") 551 + 552 + def test_export_session(self, authenticated_client, mock_atproto_client): 553 + """Export session string.""" 554 + session = authenticated_client.export_session() 555 + 556 + assert session == "test-session-string" 557 + mock_atproto_client.export_session_string.assert_called_once() 558 + 559 + def test_export_session_not_authenticated(self, mock_atproto_client): 560 + """Export session raises when not authenticated.""" 561 + client = AtmosphereClient(_client=mock_atproto_client) 562 + 563 + with pytest.raises(ValueError, match="Not authenticated"): 564 + client.export_session() 565 + 566 + def test_did_not_authenticated(self, mock_atproto_client): 567 + """Accessing did raises when not authenticated.""" 568 + client = AtmosphereClient(_client=mock_atproto_client) 569 + 570 + with pytest.raises(ValueError, match="Not authenticated"): 571 + _ = client.did 572 + 573 + def test_handle_not_authenticated(self, mock_atproto_client): 574 + """Accessing handle raises when not authenticated.""" 575 + client = AtmosphereClient(_client=mock_atproto_client) 576 + 577 + with pytest.raises(ValueError, match="Not authenticated"): 578 + _ = client.handle 579 + 580 + def test_create_record(self, authenticated_client, mock_atproto_client): 581 + """Create a record via the client.""" 582 + mock_response = Mock() 583 + mock_response.uri = "at://did:plc:test123456789/collection/newkey" 584 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 585 + 586 + uri = authenticated_client.create_record( 587 + collection="collection", 588 + record={"$type": "collection", "data": "test"}, 589 + ) 590 + 591 + assert isinstance(uri, AtUri) 592 + assert uri.authority == "did:plc:test123456789" 593 + assert uri.collection == "collection" 594 + assert uri.rkey == "newkey" 595 + 596 + def test_create_record_not_authenticated(self, mock_atproto_client): 597 + """Create record raises when not authenticated.""" 598 + client = AtmosphereClient(_client=mock_atproto_client) 599 + 600 + with pytest.raises(ValueError, match="must be authenticated"): 601 + client.create_record(collection="test", record={}) 602 + 603 + def test_put_record(self, authenticated_client, mock_atproto_client): 604 + """Put (create or update) a record.""" 605 + mock_response = Mock() 606 + mock_response.uri = "at://did:plc:test123456789/collection/specific-key" 607 + mock_atproto_client.com.atproto.repo.put_record.return_value = mock_response 608 + 609 + uri = authenticated_client.put_record( 610 + collection="collection", 611 + rkey="specific-key", 612 + record={"$type": "collection", "data": "test"}, 613 + ) 614 + 615 + assert uri.rkey == "specific-key" 616 + 617 + def test_get_record(self, authenticated_client, mock_atproto_client): 618 + """Get a record by URI.""" 619 + mock_response = Mock() 620 + mock_response.value = {"$type": "test", "field": "value"} 621 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 622 + 623 + record = authenticated_client.get_record("at://did:plc:abc/collection/key") 624 + 625 + assert record["field"] == "value" 626 + 627 + def test_get_record_with_aturi_object(self, authenticated_client, mock_atproto_client): 628 + """Get a record using AtUri object.""" 629 + mock_response = Mock() 630 + mock_response.value = {"$type": "test", "data": 123} 631 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 632 + 633 + uri = AtUri(authority="did:plc:abc", collection="collection", rkey="key") 634 + record = authenticated_client.get_record(uri) 635 + 636 + assert record["data"] == 123 637 + 638 + def test_delete_record(self, authenticated_client, mock_atproto_client): 639 + """Delete a record.""" 640 + authenticated_client.delete_record("at://did:plc:test123456789/collection/key") 641 + 642 + mock_atproto_client.com.atproto.repo.delete_record.assert_called_once() 643 + 644 + def test_list_records(self, authenticated_client, mock_atproto_client): 645 + """List records in a collection.""" 646 + mock_record1 = Mock() 647 + mock_record1.value = {"name": "record1"} 648 + mock_record2 = Mock() 649 + mock_record2.value = {"name": "record2"} 650 + 651 + mock_response = Mock() 652 + mock_response.records = [mock_record1, mock_record2] 653 + mock_response.cursor = "next-page" 654 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 655 + 656 + records, cursor = authenticated_client.list_records("collection", limit=10) 657 + 658 + assert len(records) == 2 659 + assert records[0]["name"] == "record1" 660 + assert cursor == "next-page" 661 + 662 + def test_list_schemas_convenience(self, authenticated_client, mock_atproto_client): 663 + """Test list_schemas convenience method.""" 664 + mock_response = Mock() 665 + mock_response.records = [] 666 + mock_response.cursor = None 667 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 668 + 669 + schemas = authenticated_client.list_schemas() 670 + 671 + call_args = mock_atproto_client.com.atproto.repo.list_records.call_args 672 + assert f"{LEXICON_NAMESPACE}.sampleSchema" in str(call_args) 673 + 674 + 675 + # ============================================================================= 676 + # Tests for schema.py - SchemaPublisher 677 + # ============================================================================= 678 + 679 + class TestSchemaPublisher: 680 + """Tests for SchemaPublisher.""" 681 + 682 + def test_publish_basic_sample(self, authenticated_client, mock_atproto_client): 683 + """Publish a basic sample type schema.""" 684 + mock_response = Mock() 685 + mock_response.uri = f"at://did:plc:test123456789/{LEXICON_NAMESPACE}.sampleSchema/abc" 686 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 687 + 688 + publisher = SchemaPublisher(authenticated_client) 689 + uri = publisher.publish(BasicSample, version="1.0.0") 690 + 691 + assert isinstance(uri, AtUri) 692 + assert uri.collection == f"{LEXICON_NAMESPACE}.sampleSchema" 693 + 694 + # Verify the record structure 695 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 696 + record = call_args.kwargs["data"]["record"] 697 + assert record["name"] == "BasicSample" 698 + assert record["version"] == "1.0.0" 699 + assert len(record["fields"]) == 2 700 + 701 + def test_publish_with_custom_name(self, authenticated_client, mock_atproto_client): 702 + """Publish with custom name override.""" 703 + mock_response = Mock() 704 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.sampleSchema/abc" 705 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 706 + 707 + publisher = SchemaPublisher(authenticated_client) 708 + publisher.publish(BasicSample, name="CustomName", version="2.0.0") 709 + 710 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 711 + record = call_args.kwargs["data"]["record"] 712 + assert record["name"] == "CustomName" 713 + 714 + def test_publish_numpy_sample(self, authenticated_client, mock_atproto_client): 715 + """Publish sample type with NDArray field.""" 716 + mock_response = Mock() 717 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.sampleSchema/abc" 718 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 719 + 720 + publisher = SchemaPublisher(authenticated_client) 721 + publisher.publish(NumpySample, version="1.0.0") 722 + 723 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 724 + record = call_args.kwargs["data"]["record"] 725 + 726 + # Find the data field 727 + data_field = next(f for f in record["fields"] if f["name"] == "data") 728 + assert "ndarray" in data_field["fieldType"]["$type"] 729 + 730 + def test_publish_optional_fields(self, authenticated_client, mock_atproto_client): 731 + """Publish sample type with optional fields.""" 732 + mock_response = Mock() 733 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.sampleSchema/abc" 734 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 735 + 736 + publisher = SchemaPublisher(authenticated_client) 737 + publisher.publish(OptionalSample, version="1.0.0") 738 + 739 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 740 + record = call_args.kwargs["data"]["record"] 741 + 742 + # Check optional field marking 743 + required = next(f for f in record["fields"] if f["name"] == "required_field") 744 + optional = next(f for f in record["fields"] if f["name"] == "optional_field") 745 + 746 + assert required["optional"] is False 747 + assert optional["optional"] is True 748 + 749 + def test_publish_all_primitive_types(self, authenticated_client, mock_atproto_client): 750 + """Publish sample with all primitive types.""" 751 + mock_response = Mock() 752 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.sampleSchema/abc" 753 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 754 + 755 + publisher = SchemaPublisher(authenticated_client) 756 + publisher.publish(AllTypesSample, version="1.0.0") 757 + 758 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 759 + record = call_args.kwargs["data"]["record"] 760 + 761 + # Verify each primitive type 762 + type_map = {f["name"]: f["fieldType"]["primitive"] for f in record["fields"]} 763 + assert type_map["str_field"] == "str" 764 + assert type_map["int_field"] == "int" 765 + assert type_map["float_field"] == "float" 766 + assert type_map["bool_field"] == "bool" 767 + assert type_map["bytes_field"] == "bytes" 768 + 769 + def test_publish_not_dataclass_error(self, authenticated_client): 770 + """Publishing non-dataclass raises error.""" 771 + publisher = SchemaPublisher(authenticated_client) 772 + 773 + class NotADataclass: 774 + pass 775 + 776 + with pytest.raises(ValueError, match="must be a dataclass"): 777 + publisher.publish(NotADataclass, version="1.0.0") 778 + 779 + 780 + class TestSchemaLoader: 781 + """Tests for SchemaLoader.""" 782 + 783 + def test_get_schema(self, authenticated_client, mock_atproto_client): 784 + """Get a schema by URI.""" 785 + mock_response = Mock() 786 + mock_response.value = { 787 + "$type": f"{LEXICON_NAMESPACE}.sampleSchema", 788 + "name": "TestSchema", 789 + "version": "1.0.0", 790 + "fields": [], 791 + } 792 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 793 + 794 + loader = SchemaLoader(authenticated_client) 795 + schema = loader.get(f"at://did:plc:abc/{LEXICON_NAMESPACE}.sampleSchema/xyz") 796 + 797 + assert schema["name"] == "TestSchema" 798 + 799 + def test_get_schema_wrong_type(self, authenticated_client, mock_atproto_client): 800 + """Get raises error for wrong record type.""" 801 + mock_response = Mock() 802 + mock_response.value = { 803 + "$type": "app.bsky.feed.post", 804 + "text": "Not a schema", 805 + } 806 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 807 + 808 + loader = SchemaLoader(authenticated_client) 809 + 810 + with pytest.raises(ValueError, match="not a schema record"): 811 + loader.get("at://did:plc:abc/app.bsky.feed.post/xyz") 812 + 813 + def test_list_all_schemas(self, authenticated_client, mock_atproto_client): 814 + """List all schemas.""" 815 + mock_record = Mock() 816 + mock_record.value = {"name": "Schema1"} 817 + 818 + mock_response = Mock() 819 + mock_response.records = [mock_record] 820 + mock_response.cursor = None 821 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 822 + 823 + loader = SchemaLoader(authenticated_client) 824 + schemas = loader.list_all() 825 + 826 + assert len(schemas) == 1 827 + assert schemas[0]["name"] == "Schema1" 828 + 829 + 830 + # ============================================================================= 831 + # Tests for records.py - DatasetPublisher 832 + # ============================================================================= 833 + 834 + class TestDatasetPublisher: 835 + """Tests for DatasetPublisher.""" 836 + 837 + def test_publish_with_urls(self, authenticated_client, mock_atproto_client): 838 + """Publish dataset with explicit URLs.""" 839 + mock_response = Mock() 840 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.record/abc" 841 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 842 + 843 + publisher = DatasetPublisher(authenticated_client) 844 + uri = publisher.publish_with_urls( 845 + urls=["s3://bucket/data-{000000..000009}.tar"], 846 + schema_uri="at://did:plc:abc/schema/xyz", 847 + name="TestDataset", 848 + description="A test dataset", 849 + tags=["test", "demo"], 850 + license="MIT", 851 + ) 852 + 853 + assert isinstance(uri, AtUri) 854 + 855 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 856 + record = call_args.kwargs["data"]["record"] 857 + assert record["name"] == "TestDataset" 858 + assert record["schemaRef"] == "at://did:plc:abc/schema/xyz" 859 + assert record["tags"] == ["test", "demo"] 860 + assert record["license"] == "MIT" 861 + 862 + def test_publish_auto_schema(self, authenticated_client, mock_atproto_client): 863 + """Publish dataset with auto schema publishing.""" 864 + # Mock for schema creation 865 + schema_response = Mock() 866 + schema_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.sampleSchema/schema123" 867 + 868 + # Mock for dataset creation 869 + dataset_response = Mock() 870 + dataset_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.record/dataset456" 871 + 872 + mock_atproto_client.com.atproto.repo.create_record.side_effect = [ 873 + schema_response, 874 + dataset_response, 875 + ] 876 + 877 + # Create a mock dataset 878 + mock_dataset = Mock() 879 + mock_dataset.url = "s3://bucket/data.tar" 880 + mock_dataset.sample_type = BasicSample 881 + mock_dataset.metadata = None 882 + 883 + publisher = DatasetPublisher(authenticated_client) 884 + uri = publisher.publish( 885 + mock_dataset, 886 + name="AutoSchemaDataset", 887 + auto_publish_schema=True, 888 + ) 889 + 890 + # Should have called create_record twice (schema + dataset) 891 + assert mock_atproto_client.com.atproto.repo.create_record.call_count == 2 892 + 893 + def test_publish_explicit_schema_uri(self, authenticated_client, mock_atproto_client): 894 + """Publish dataset with explicit schema URI (no auto publish).""" 895 + mock_response = Mock() 896 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.record/abc" 897 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 898 + 899 + mock_dataset = Mock() 900 + mock_dataset.url = "s3://bucket/data.tar" 901 + mock_dataset.metadata = None 902 + 903 + publisher = DatasetPublisher(authenticated_client) 904 + publisher.publish( 905 + mock_dataset, 906 + name="ExplicitSchemaDataset", 907 + schema_uri="at://did:plc:existing/schema/xyz", 908 + auto_publish_schema=False, 909 + ) 910 + 911 + # Should have called create_record only once (dataset only) 912 + assert mock_atproto_client.com.atproto.repo.create_record.call_count == 1 913 + 914 + def test_publish_no_schema_error(self, authenticated_client): 915 + """Publish without schema_uri and auto_publish_schema=False raises.""" 916 + mock_dataset = Mock() 917 + mock_dataset.url = "s3://bucket/data.tar" 918 + 919 + publisher = DatasetPublisher(authenticated_client) 920 + 921 + with pytest.raises(ValueError, match="schema_uri is required"): 922 + publisher.publish( 923 + mock_dataset, 924 + name="NoSchemaDataset", 925 + auto_publish_schema=False, 926 + ) 927 + 928 + 929 + class TestDatasetLoader: 930 + """Tests for DatasetLoader.""" 931 + 932 + def test_get_dataset(self, authenticated_client, mock_atproto_client): 933 + """Get a dataset record.""" 934 + mock_response = Mock() 935 + mock_response.value = { 936 + "$type": f"{LEXICON_NAMESPACE}.record", 937 + "name": "TestDataset", 938 + "schemaRef": "at://schema", 939 + "storage": { 940 + "$type": f"{LEXICON_NAMESPACE}.storageExternal", 941 + "urls": ["s3://bucket/data.tar"], 942 + }, 943 + } 944 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 945 + 946 + loader = DatasetLoader(authenticated_client) 947 + record = loader.get(f"at://did:plc:abc/{LEXICON_NAMESPACE}.record/xyz") 948 + 949 + assert record["name"] == "TestDataset" 950 + 951 + def test_get_dataset_wrong_type(self, authenticated_client, mock_atproto_client): 952 + """Get raises error for wrong record type.""" 953 + mock_response = Mock() 954 + mock_response.value = { 955 + "$type": f"{LEXICON_NAMESPACE}.sampleSchema", 956 + "name": "NotADataset", 957 + } 958 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 959 + 960 + loader = DatasetLoader(authenticated_client) 961 + 962 + with pytest.raises(ValueError, match="not a dataset record"): 963 + loader.get("at://did:plc:abc/collection/xyz") 964 + 965 + def test_get_urls(self, authenticated_client, mock_atproto_client): 966 + """Get WebDataset URLs from a dataset record.""" 967 + mock_response = Mock() 968 + mock_response.value = { 969 + "$type": f"{LEXICON_NAMESPACE}.record", 970 + "name": "TestDataset", 971 + "schemaRef": "at://schema", 972 + "storage": { 973 + "$type": f"{LEXICON_NAMESPACE}.storageExternal", 974 + "urls": ["s3://bucket/data-{000000..000009}.tar", "s3://bucket/extra.tar"], 975 + }, 976 + } 977 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 978 + 979 + loader = DatasetLoader(authenticated_client) 980 + urls = loader.get_urls(f"at://did:plc:abc/{LEXICON_NAMESPACE}.record/xyz") 981 + 982 + assert len(urls) == 2 983 + assert "data-{000000..000009}.tar" in urls[0] 984 + 985 + def test_get_urls_blob_storage_error(self, authenticated_client, mock_atproto_client): 986 + """Get URLs raises for blob storage datasets.""" 987 + mock_response = Mock() 988 + mock_response.value = { 989 + "$type": f"{LEXICON_NAMESPACE}.record", 990 + "name": "BlobDataset", 991 + "schemaRef": "at://schema", 992 + "storage": { 993 + "$type": f"{LEXICON_NAMESPACE}.storageBlobs", 994 + "blobs": [{"cid": "bafytest"}], 995 + }, 996 + } 997 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 998 + 999 + loader = DatasetLoader(authenticated_client) 1000 + 1001 + with pytest.raises(ValueError, match="blob storage"): 1002 + loader.get_urls(f"at://did:plc:abc/{LEXICON_NAMESPACE}.record/xyz") 1003 + 1004 + def test_get_metadata(self, authenticated_client, mock_atproto_client): 1005 + """Get metadata from dataset record.""" 1006 + import msgpack 1007 + 1008 + metadata_bytes = msgpack.packb({"split": "train", "samples": 10000}) 1009 + 1010 + mock_response = Mock() 1011 + mock_response.value = { 1012 + "$type": f"{LEXICON_NAMESPACE}.record", 1013 + "name": "MetaDataset", 1014 + "schemaRef": "at://schema", 1015 + "storage": {"$type": f"{LEXICON_NAMESPACE}.storageExternal", "urls": []}, 1016 + "metadata": metadata_bytes, 1017 + } 1018 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 1019 + 1020 + loader = DatasetLoader(authenticated_client) 1021 + metadata = loader.get_metadata(f"at://did:plc:abc/{LEXICON_NAMESPACE}.record/xyz") 1022 + 1023 + assert metadata["split"] == "train" 1024 + assert metadata["samples"] == 10000 1025 + 1026 + def test_get_metadata_none(self, authenticated_client, mock_atproto_client): 1027 + """Get metadata returns None when not present.""" 1028 + mock_response = Mock() 1029 + mock_response.value = { 1030 + "$type": f"{LEXICON_NAMESPACE}.record", 1031 + "name": "NoMetaDataset", 1032 + "schemaRef": "at://schema", 1033 + "storage": {"$type": f"{LEXICON_NAMESPACE}.storageExternal", "urls": []}, 1034 + } 1035 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 1036 + 1037 + loader = DatasetLoader(authenticated_client) 1038 + metadata = loader.get_metadata(f"at://did:plc:abc/{LEXICON_NAMESPACE}.record/xyz") 1039 + 1040 + assert metadata is None 1041 + 1042 + def test_list_all(self, authenticated_client, mock_atproto_client): 1043 + """List all datasets.""" 1044 + mock_record = Mock() 1045 + mock_record.value = {"name": "Dataset1"} 1046 + 1047 + mock_response = Mock() 1048 + mock_response.records = [mock_record] 1049 + mock_response.cursor = None 1050 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 1051 + 1052 + loader = DatasetLoader(authenticated_client) 1053 + datasets = loader.list_all() 1054 + 1055 + assert len(datasets) == 1 1056 + 1057 + 1058 + # ============================================================================= 1059 + # Tests for lens.py - LensPublisher 1060 + # ============================================================================= 1061 + 1062 + class TestLensPublisher: 1063 + """Tests for LensPublisher.""" 1064 + 1065 + def test_publish_with_code_refs(self, authenticated_client, mock_atproto_client): 1066 + """Publish lens with code references.""" 1067 + mock_response = Mock() 1068 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.lens/abc" 1069 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 1070 + 1071 + publisher = LensPublisher(authenticated_client) 1072 + uri = publisher.publish( 1073 + name="TestLens", 1074 + source_schema_uri="at://did:plc:abc/schema/source", 1075 + target_schema_uri="at://did:plc:abc/schema/target", 1076 + description="Transforms source to target", 1077 + code_repository="https://github.com/user/repo", 1078 + code_commit="abc123def456", 1079 + getter_path="module.lenses:my_getter", 1080 + putter_path="module.lenses:my_putter", 1081 + ) 1082 + 1083 + assert isinstance(uri, AtUri) 1084 + 1085 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 1086 + record = call_args.kwargs["data"]["record"] 1087 + assert record["name"] == "TestLens" 1088 + assert record["sourceSchema"] == "at://did:plc:abc/schema/source" 1089 + assert record["targetSchema"] == "at://did:plc:abc/schema/target" 1090 + assert record["getterCode"]["repository"] == "https://github.com/user/repo" 1091 + 1092 + def test_publish_without_code_refs(self, authenticated_client, mock_atproto_client): 1093 + """Publish lens without code references.""" 1094 + mock_response = Mock() 1095 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.lens/abc" 1096 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 1097 + 1098 + publisher = LensPublisher(authenticated_client) 1099 + uri = publisher.publish( 1100 + name="MetadataOnlyLens", 1101 + source_schema_uri="at://source", 1102 + target_schema_uri="at://target", 1103 + ) 1104 + 1105 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 1106 + record = call_args.kwargs["data"]["record"] 1107 + assert "getterCode" not in record 1108 + assert "putterCode" not in record 1109 + 1110 + def test_publish_from_lens_object(self, authenticated_client, mock_atproto_client): 1111 + """Publish lens from an atdata Lens object.""" 1112 + mock_response = Mock() 1113 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.lens/abc" 1114 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 1115 + 1116 + # Create a real lens 1117 + @atdata.lens 1118 + def test_lens(source: BasicSample) -> NumpySample: 1119 + return NumpySample( 1120 + data=np.array([source.value]), 1121 + label=source.name, 1122 + ) 1123 + 1124 + publisher = LensPublisher(authenticated_client) 1125 + uri = publisher.publish_from_lens( 1126 + test_lens, 1127 + name="FromObjectLens", 1128 + source_schema_uri="at://source", 1129 + target_schema_uri="at://target", 1130 + code_repository="https://github.com/user/repo", 1131 + code_commit="abc123", 1132 + ) 1133 + 1134 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 1135 + record = call_args.kwargs["data"]["record"] 1136 + assert "test_lens" in record["getterCode"]["path"] 1137 + 1138 + 1139 + class TestLensLoader: 1140 + """Tests for LensLoader.""" 1141 + 1142 + def test_get_lens(self, authenticated_client, mock_atproto_client): 1143 + """Get a lens record.""" 1144 + mock_response = Mock() 1145 + mock_response.value = { 1146 + "$type": f"{LEXICON_NAMESPACE}.lens", 1147 + "name": "TestLens", 1148 + "sourceSchema": "at://source", 1149 + "targetSchema": "at://target", 1150 + } 1151 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 1152 + 1153 + loader = LensLoader(authenticated_client) 1154 + record = loader.get(f"at://did:plc:abc/{LEXICON_NAMESPACE}.lens/xyz") 1155 + 1156 + assert record["name"] == "TestLens" 1157 + 1158 + def test_get_lens_wrong_type(self, authenticated_client, mock_atproto_client): 1159 + """Get raises error for wrong record type.""" 1160 + mock_response = Mock() 1161 + mock_response.value = { 1162 + "$type": f"{LEXICON_NAMESPACE}.record", 1163 + "name": "NotALens", 1164 + } 1165 + mock_atproto_client.com.atproto.repo.get_record.return_value = mock_response 1166 + 1167 + loader = LensLoader(authenticated_client) 1168 + 1169 + with pytest.raises(ValueError, match="not a lens record"): 1170 + loader.get("at://did:plc:abc/collection/xyz") 1171 + 1172 + def test_list_all(self, authenticated_client, mock_atproto_client): 1173 + """List all lens records.""" 1174 + mock_record = Mock() 1175 + mock_record.value = {"name": "Lens1"} 1176 + 1177 + mock_response = Mock() 1178 + mock_response.records = [mock_record] 1179 + mock_response.cursor = None 1180 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 1181 + 1182 + loader = LensLoader(authenticated_client) 1183 + lenses = loader.list_all() 1184 + 1185 + assert len(lenses) == 1 1186 + 1187 + def test_find_by_schemas_source_only(self, authenticated_client, mock_atproto_client): 1188 + """Find lenses by source schema only.""" 1189 + mock_records = [ 1190 + Mock(value={"sourceSchema": "at://schema/a", "targetSchema": "at://schema/b"}), 1191 + Mock(value={"sourceSchema": "at://schema/a", "targetSchema": "at://schema/c"}), 1192 + Mock(value={"sourceSchema": "at://schema/x", "targetSchema": "at://schema/y"}), 1193 + ] 1194 + 1195 + mock_response = Mock() 1196 + mock_response.records = mock_records 1197 + mock_response.cursor = None 1198 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 1199 + 1200 + loader = LensLoader(authenticated_client) 1201 + matches = loader.find_by_schemas(source_schema_uri="at://schema/a") 1202 + 1203 + assert len(matches) == 2 1204 + 1205 + def test_find_by_schemas_both(self, authenticated_client, mock_atproto_client): 1206 + """Find lenses by both source and target schema.""" 1207 + mock_records = [ 1208 + Mock(value={"sourceSchema": "at://schema/a", "targetSchema": "at://schema/b"}), 1209 + Mock(value={"sourceSchema": "at://schema/a", "targetSchema": "at://schema/c"}), 1210 + ] 1211 + 1212 + mock_response = Mock() 1213 + mock_response.records = mock_records 1214 + mock_response.cursor = None 1215 + mock_atproto_client.com.atproto.repo.list_records.return_value = mock_response 1216 + 1217 + loader = LensLoader(authenticated_client) 1218 + matches = loader.find_by_schemas( 1219 + source_schema_uri="at://schema/a", 1220 + target_schema_uri="at://schema/b", 1221 + ) 1222 + 1223 + assert len(matches) == 1 1224 + assert matches[0]["targetSchema"] == "at://schema/b" 1225 + 1226 + 1227 + # ============================================================================= 1228 + # Additional Edge Case Tests for Coverage 1229 + # ============================================================================= 1230 + 1231 + 1232 + class TestFieldTypeEdgeCases: 1233 + """Tests for FieldType and FieldDef edge cases.""" 1234 + 1235 + def test_field_with_description(self): 1236 + """Test FieldDef with description is included in dict output.""" 1237 + field_type = FieldType(kind="primitive", primitive="str") 1238 + field_def = FieldDef( 1239 + name="described_field", 1240 + field_type=field_type, 1241 + optional=False, 1242 + description="This is a description", 1243 + ) 1244 + 1245 + # Create a SchemaRecord to test _field_to_dict 1246 + schema = SchemaRecord( 1247 + name="TestSchema", 1248 + version="1.0.0", 1249 + fields=[field_def], 1250 + ) 1251 + record = schema.to_record() 1252 + 1253 + # Check that description is included 1254 + field = record["fields"][0] 1255 + assert field["description"] == "This is a description" 1256 + 1257 + def test_ndarray_type_with_shape(self): 1258 + """Test FieldType for ndarray with shape.""" 1259 + field_type = FieldType( 1260 + kind="ndarray", 1261 + dtype="float32", 1262 + shape=[224, 224, 3], 1263 + ) 1264 + 1265 + schema = SchemaRecord( 1266 + name="ShapedArraySchema", 1267 + version="1.0.0", 1268 + fields=[FieldDef(name="image", field_type=field_type, optional=False)], 1269 + ) 1270 + record = schema.to_record() 1271 + 1272 + field = record["fields"][0] 1273 + assert field["fieldType"]["shape"] == [224, 224, 3] 1274 + 1275 + def test_ref_type(self): 1276 + """Test FieldType for reference type.""" 1277 + field_type = FieldType( 1278 + kind="ref", 1279 + ref="at://did:plc:abc/atdata.sampleSchema/xyz", 1280 + ) 1281 + 1282 + schema = SchemaRecord( 1283 + name="RefSchema", 1284 + version="1.0.0", 1285 + fields=[FieldDef(name="reference", field_type=field_type, optional=False)], 1286 + ) 1287 + record = schema.to_record() 1288 + 1289 + field = record["fields"][0] 1290 + assert "ref" in field["fieldType"]["$type"] 1291 + assert field["fieldType"]["ref"] == "at://did:plc:abc/atdata.sampleSchema/xyz" 1292 + 1293 + def test_array_type_with_items(self): 1294 + """Test FieldType for array with typed items.""" 1295 + items_type = FieldType(kind="primitive", primitive="int") 1296 + field_type = FieldType(kind="array", items=items_type) 1297 + 1298 + schema = SchemaRecord( 1299 + name="ArraySchema", 1300 + version="1.0.0", 1301 + fields=[FieldDef(name="numbers", field_type=field_type, optional=False)], 1302 + ) 1303 + record = schema.to_record() 1304 + 1305 + field = record["fields"][0] 1306 + assert "array" in field["fieldType"]["$type"] 1307 + assert field["fieldType"]["items"]["primitive"] == "int" 1308 + 1309 + 1310 + class TestSchemaPublisherEdgeCases: 1311 + """Additional edge case tests for SchemaPublisher.""" 1312 + 1313 + def test_publish_list_field(self, authenticated_client, mock_atproto_client): 1314 + """Publish sample type with List[str] field.""" 1315 + from typing import List 1316 + 1317 + @atdata.packable 1318 + class ListSample: 1319 + tags: List[str] 1320 + values: List[int] 1321 + 1322 + mock_response = Mock() 1323 + mock_response.uri = f"at://did:plc:test/{LEXICON_NAMESPACE}.sampleSchema/abc" 1324 + mock_atproto_client.com.atproto.repo.create_record.return_value = mock_response 1325 + 1326 + publisher = SchemaPublisher(authenticated_client) 1327 + publisher.publish(ListSample, version="1.0.0") 1328 + 1329 + call_args = mock_atproto_client.com.atproto.repo.create_record.call_args 1330 + record = call_args.kwargs["data"]["record"] 1331 + 1332 + # Find the tags field 1333 + tags_field = next(f for f in record["fields"] if f["name"] == "tags") 1334 + assert "array" in tags_field["fieldType"]["$type"] 1335 + 1336 + def test_publish_nested_dataclass_error(self, authenticated_client): 1337 + """Publishing sample with nested dataclass raises error.""" 1338 + from dataclasses import dataclass 1339 + 1340 + @dataclass 1341 + class Inner: 1342 + value: int 1343 + 1344 + @atdata.packable 1345 + class Outer: 1346 + nested: Inner 1347 + 1348 + publisher = SchemaPublisher(authenticated_client) 1349 + 1350 + with pytest.raises(TypeError, match="Nested dataclass types not yet supported"): 1351 + publisher.publish(Outer, version="1.0.0") 1352 + 1353 + def test_publish_unsupported_type_error(self, authenticated_client): 1354 + """Publishing sample with unsupported type raises error.""" 1355 + 1356 + @atdata.packable 1357 + class UnsupportedSample: 1358 + value: complex # complex is not a supported type 1359 + 1360 + publisher = SchemaPublisher(authenticated_client) 1361 + 1362 + with pytest.raises(TypeError, match="Unsupported type"): 1363 + publisher.publish(UnsupportedSample, version="1.0.0")
+213 -35
tests/test_dataset.py
··· 148 148 assert cur_assertion, \ 149 149 f'Did not properly incorporate property {k} of test type {SampleType}' 150 150 151 - # 152 - 153 - # def test_decorator_syntax(): 154 - # """Test use of decorator syntax for sample types""" 155 - 156 - # @atdata.packable 157 - # class BasicTestSampleDecorated: 158 - # name: str 159 - # position: int 160 - # value: float 161 - 162 - # @atdata.packable 163 - # class NumpyTestSampleDecorated: 164 - # label: int 165 - # image: NDArray 166 - 167 - # ## 168 - 169 - # test_create_sample( BasicTestSampleDecorated, { 170 - # 'name': 'Hello, world!', 171 - # 'position': 42, 172 - # 'value': 1024.768, 173 - # } ) 174 - 175 - # test_create_sample( NumpyTestSampleDecorated, { 176 - # 'label': 9_001, 177 - # 'image': np.random.randn( 1024, 1024 ), 178 - # } ) 179 - 180 - # 181 151 182 152 @pytest.mark.parametrize( 183 153 ('SampleType', 'sample_data', 'sample_wds_stem'), ··· 301 271 break 302 272 303 273 assert iterations_run == n_iterate, \ 304 - "Only found {iterations_run} samples, not {n_iterate}" 274 + f"Only found {iterations_run} samples, not {n_iterate}" 305 275 306 276 307 277 ## Shuffled ··· 353 323 break 354 324 355 325 assert iterations_run == n_iterate, \ 356 - "Only found {iterations_run} samples, not {n_iterate}" 326 + f"Only found {iterations_run} samples, not {n_iterate}" 357 327 358 328 # 359 329 ··· 402 372 parquet_filename = tmp_path / f'{sample_wds_stem}-segments.parquet' 403 373 dataset.to_parquet( parquet_filename, maxcount = n_per_file ) 404 374 405 - ## Double-check our `parquet` export 406 - 407 - # TODO 375 + 376 + ## 377 + # Edge case tests for coverage 378 + 379 + 380 + def test_batch_aggregate_empty(): 381 + """Test _batch_aggregate with empty list returns empty list.""" 382 + result = atds._batch_aggregate([]) 383 + assert result == [], "Empty input should return empty list" 384 + 385 + 386 + def test_sample_batch_attribute_error(): 387 + """Test SampleBatch raises AttributeError for non-existent attributes.""" 388 + @atdata.packable 389 + class SimpleSample: 390 + name: str 391 + value: int 392 + 393 + samples = [SimpleSample(name="test", value=1)] 394 + batch = atdata.SampleBatch[SimpleSample](samples) 395 + 396 + with pytest.raises(AttributeError, match="No sample attribute named"): 397 + _ = batch.nonexistent_attribute 398 + 399 + 400 + def test_sample_batch_type_property(): 401 + """Test SampleBatch.sample_type property.""" 402 + @atdata.packable 403 + class TypedSample: 404 + data: str 405 + 406 + samples = [TypedSample(data="hello")] 407 + batch = atdata.SampleBatch[TypedSample](samples) 408 + 409 + assert batch.sample_type == TypedSample 410 + 411 + 412 + def test_dataset_batch_type_property(tmp_path): 413 + """Test Dataset.batch_type property.""" 414 + @atdata.packable 415 + class BatchTypeSample: 416 + value: int 417 + 418 + # Create a simple dataset 419 + wds_filename = (tmp_path / "batch_type_test.tar").as_posix() 420 + with wds.writer.TarWriter(wds_filename) as sink: 421 + sample = BatchTypeSample(value=42) 422 + sink.write(sample.as_wds) 423 + 424 + dataset = atdata.Dataset[BatchTypeSample](wds_filename) 425 + batch_type = dataset.batch_type 426 + 427 + # batch_type should be SampleBatch parameterized with the sample type 428 + assert batch_type.__origin__ == atdata.SampleBatch 429 + 430 + 431 + def test_dataset_shard_list_property(tmp_path): 432 + """Test Dataset.shard_list property returns list of shard URLs.""" 433 + @atdata.packable 434 + class ShardListSample: 435 + value: int 436 + 437 + # Create multiple shards 438 + file_pattern = (tmp_path / "shards_test-%06d.tar").as_posix() 439 + with wds.writer.ShardWriter(pattern=file_pattern, maxcount=5) as sink: 440 + for i in range(15): 441 + sample = ShardListSample(value=i) 442 + sink.write(sample.as_wds) 443 + 444 + # Read with brace pattern 445 + brace_pattern = (tmp_path / "shards_test-{000000..000002}.tar").as_posix() 446 + dataset = atdata.Dataset[ShardListSample](brace_pattern) 447 + 448 + shard_list = dataset.shard_list 449 + assert isinstance(shard_list, list) 450 + assert len(shard_list) == 3 451 + 452 + 453 + def test_dataset_metadata_property(tmp_path): 454 + """Test Dataset.metadata property fetches and caches metadata from URL.""" 455 + from unittest.mock import patch, Mock 456 + import msgpack 457 + 458 + @atdata.packable 459 + class MetadataSample: 460 + value: int 461 + 462 + # Create a simple dataset 463 + wds_filename = (tmp_path / "metadata_test.tar").as_posix() 464 + with wds.writer.TarWriter(wds_filename) as sink: 465 + sample = MetadataSample(value=42) 466 + sink.write(sample.as_wds) 467 + 468 + # Mock the requests.get call 469 + mock_metadata = {"key": "value", "count": 100} 470 + mock_response = Mock() 471 + mock_response.content = msgpack.packb(mock_metadata) 472 + mock_response.raise_for_status = Mock() 473 + mock_response.__enter__ = Mock(return_value=mock_response) 474 + mock_response.__exit__ = Mock(return_value=False) 475 + 476 + with patch("atdata.dataset.requests.get", return_value=mock_response) as mock_get: 477 + dataset = atdata.Dataset[MetadataSample]( 478 + wds_filename, 479 + metadata_url="http://example.com/metadata.msgpack" 480 + ) 481 + 482 + # First call should fetch 483 + metadata = dataset.metadata 484 + assert metadata == mock_metadata 485 + mock_get.assert_called_once_with("http://example.com/metadata.msgpack", stream=True) 486 + 487 + # Second call should use cache 488 + metadata2 = dataset.metadata 489 + assert metadata2 == mock_metadata 490 + assert mock_get.call_count == 1 # Still only one call 491 + 492 + 493 + def test_dataset_metadata_property_none(tmp_path): 494 + """Test Dataset.metadata returns None when no metadata_url is set.""" 495 + @atdata.packable 496 + class NoMetadataSample: 497 + value: int 498 + 499 + wds_filename = (tmp_path / "no_metadata_test.tar").as_posix() 500 + with wds.writer.TarWriter(wds_filename) as sink: 501 + sample = NoMetadataSample(value=42) 502 + sink.write(sample.as_wds) 503 + 504 + dataset = atdata.Dataset[NoMetadataSample](wds_filename) 505 + assert dataset.metadata is None 506 + 507 + 508 + def test_parquet_export_with_remainder(tmp_path): 509 + """Test parquet export with maxcount that doesn't divide evenly.""" 510 + @atdata.packable 511 + class RemainderSample: 512 + name: str 513 + value: int 514 + 515 + # Create dataset with 25 samples 516 + n_samples = 25 517 + maxcount = 10 # Will create 3 segments: 10, 10, 5 518 + 519 + wds_filename = (tmp_path / "remainder_test.tar").as_posix() 520 + with wds.writer.TarWriter(wds_filename) as sink: 521 + for i in range(n_samples): 522 + sample = RemainderSample(name=f"sample_{i}", value=i) 523 + sink.write(sample.as_wds) 524 + 525 + dataset = atdata.Dataset[RemainderSample](wds_filename) 526 + parquet_path = tmp_path / "remainder_output.parquet" 527 + dataset.to_parquet(parquet_path, maxcount=maxcount) 528 + 529 + # Should have created 3 segment files 530 + import pandas as pd 531 + segment_files = list(tmp_path.glob("remainder_output-*.parquet")) 532 + assert len(segment_files) == 3 533 + 534 + # Check total row count 535 + total_rows = sum(len(pd.read_parquet(f)) for f in segment_files) 536 + assert total_rows == n_samples 537 + 538 + 539 + def test_dataset_with_lens_batched(tmp_path): 540 + """Test dataset iteration with lens transformation in batch mode.""" 541 + from dataclasses import dataclass 542 + 543 + @dataclass 544 + class SourceSample(atdata.PackableSample): 545 + name: str 546 + age: int 547 + score: float 548 + 549 + @dataclass 550 + class ViewSample(atdata.PackableSample): 551 + name: str 552 + score: float 553 + 554 + @atdata.lens 555 + def extract_view(s: SourceSample) -> ViewSample: 556 + return ViewSample(name=s.name, score=s.score) 557 + 558 + # Create dataset 559 + n_samples = 20 560 + batch_size = 4 561 + wds_filename = (tmp_path / "lens_batch_test.tar").as_posix() 562 + 563 + with wds.writer.TarWriter(wds_filename) as sink: 564 + for i in range(n_samples): 565 + sample = SourceSample(name=f"person_{i}", age=20 + i, score=float(i) * 1.5) 566 + sink.write(sample.as_wds) 567 + 568 + # Read with lens transformation in batch mode 569 + dataset = atdata.Dataset[SourceSample](wds_filename).as_type(ViewSample) 570 + 571 + batches_seen = 0 572 + for batch in dataset.ordered(batch_size=batch_size): 573 + assert isinstance(batch, atdata.SampleBatch) 574 + assert batch.sample_type == ViewSample 575 + 576 + # Check that samples are ViewSample type (not SourceSample) 577 + for sample in batch.samples: 578 + assert isinstance(sample, ViewSample) 579 + assert hasattr(sample, "name") 580 + assert hasattr(sample, "score") 581 + assert not hasattr(sample, "age") # age is not in ViewSample 582 + 583 + batches_seen += 1 584 + 585 + assert batches_seen == n_samples // batch_size 408 586 409 587 410 588 ##
+94
tests/test_helpers.py
··· 1 + """Tests for atdata._helpers module.""" 2 + 3 + import numpy as np 4 + import pytest 5 + 6 + from atdata._helpers import array_to_bytes, bytes_to_array 7 + 8 + 9 + class TestArraySerialization: 10 + """Test array_to_bytes and bytes_to_array round-trip serialization.""" 11 + 12 + @pytest.mark.parametrize("dtype", [ 13 + np.float32, 14 + np.float64, 15 + np.int32, 16 + np.int64, 17 + np.uint8, 18 + np.bool_, 19 + np.complex64, 20 + ]) 21 + def test_dtype_preservation(self, dtype): 22 + """Verify dtype is preserved through serialization.""" 23 + original = np.array([1, 2, 3], dtype=dtype) 24 + serialized = array_to_bytes(original) 25 + restored = bytes_to_array(serialized) 26 + 27 + assert restored.dtype == original.dtype 28 + np.testing.assert_array_equal(restored, original) 29 + 30 + @pytest.mark.parametrize("shape", [ 31 + (10,), 32 + (3, 4), 33 + (2, 3, 4), 34 + (1, 1, 1, 1), 35 + ]) 36 + def test_shape_preservation(self, shape): 37 + """Verify shape is preserved through serialization.""" 38 + original = np.random.rand(*shape).astype(np.float32) 39 + serialized = array_to_bytes(original) 40 + restored = bytes_to_array(serialized) 41 + 42 + assert restored.shape == original.shape 43 + np.testing.assert_array_almost_equal(restored, original) 44 + 45 + def test_empty_array(self): 46 + """Verify empty arrays serialize correctly.""" 47 + original = np.array([], dtype=np.float32) 48 + serialized = array_to_bytes(original) 49 + restored = bytes_to_array(serialized) 50 + 51 + assert restored.shape == (0,) 52 + assert restored.dtype == np.float32 53 + 54 + def test_scalar_array(self): 55 + """Verify 0-dimensional arrays serialize correctly.""" 56 + original = np.array(42.0) 57 + serialized = array_to_bytes(original) 58 + restored = bytes_to_array(serialized) 59 + 60 + assert restored.shape == () 61 + assert restored == 42.0 62 + 63 + def test_large_array(self): 64 + """Verify large arrays serialize correctly.""" 65 + original = np.random.rand(100, 100).astype(np.float32) 66 + serialized = array_to_bytes(original) 67 + restored = bytes_to_array(serialized) 68 + 69 + np.testing.assert_array_almost_equal(restored, original) 70 + 71 + def test_contiguous_and_noncontiguous(self): 72 + """Verify non-contiguous arrays serialize correctly.""" 73 + original = np.random.rand(10, 10).astype(np.float32) 74 + non_contiguous = original[::2, ::2] # Strided view 75 + 76 + assert not non_contiguous.flags['C_CONTIGUOUS'] 77 + 78 + serialized = array_to_bytes(non_contiguous) 79 + restored = bytes_to_array(serialized) 80 + 81 + np.testing.assert_array_almost_equal(restored, non_contiguous) 82 + 83 + def test_bytes_output_type(self): 84 + """Verify array_to_bytes returns bytes.""" 85 + arr = np.array([1, 2, 3]) 86 + result = array_to_bytes(arr) 87 + assert isinstance(result, bytes) 88 + 89 + def test_ndarray_output_type(self): 90 + """Verify bytes_to_array returns ndarray.""" 91 + arr = np.array([1, 2, 3]) 92 + serialized = array_to_bytes(arr) 93 + result = bytes_to_array(serialized) 94 + assert isinstance(result, np.ndarray)
+81 -7
tests/test_lens.py
··· 78 78 y = polite.put( polite( test_source ), test_source ) 79 79 assert y == test_source, \ 80 80 f'Violation of PutGet: {y} =/= {test_source}' 81 - 82 - # TODO Test PutPut 81 + 82 + # PutPut law: put(v2, put(v1, s)) = put(v2, s) 83 + another_view = View( 84 + name = 'Different Name', 85 + height = 165.0, 86 + ) 87 + z1 = polite.put( another_view, polite.put( update_view, test_source ) ) 88 + z2 = polite.put( another_view, test_source ) 89 + assert z1 == z2, \ 90 + f'Violation of PutPut: {z1} =/= {z2}' 83 91 84 92 def test_conversion( tmp_path ): 85 93 """Test automatic interconversion between sample types""" ··· 104 112 favorite_pizza = s.favorite_pizza, 105 113 favorite_image = s.favorite_image, 106 114 ) 107 - 108 - lens_network = atdata.LensNetwork() 109 - print( lens_network._registry ) 110 115 111 116 # Map a test sample through the view 112 117 test_source = Source( ··· 156 161 157 162 assert sample.name == test_view.name, \ 158 163 f'Divergence on auto-mapped dataset: `name` should be {test_view.name}, but is {sample.name}' 159 - # assert sample.height == test_view.height, \ 160 - # f'Divergence on auto-mapped dataset: `height` should be {test_view.height}, but is {sample.height}' 161 164 assert sample.favorite_pizza == test_view.favorite_pizza, \ 162 165 f'Divergence on auto-mapped dataset: `favorite_pizza` should be {test_view.favorite_pizza}, but is {sample.favorite_pizza}' 163 166 assert np.all( sample.favorite_image == test_view.favorite_image ), \ 164 167 f'Divergence on auto-mapped dataset: `favorite_image`' 168 + 169 + 170 + ## 171 + # Edge case tests for coverage 172 + 173 + 174 + def test_lens_get_method(): 175 + """Test calling lens.get() explicitly instead of lens().""" 176 + @atdata.packable 177 + class GetSource: 178 + value: int 179 + 180 + @atdata.packable 181 + class GetView: 182 + doubled: int 183 + 184 + @atdata.lens 185 + def doubler(s: GetSource) -> GetView: 186 + return GetView(doubled=s.value * 2) 187 + 188 + source = GetSource(value=5) 189 + 190 + # Test both calling conventions 191 + result_call = doubler(source) 192 + result_get = doubler.get(source) 193 + 194 + assert result_call == result_get 195 + assert result_get.doubled == 10 196 + 197 + 198 + def test_lens_trivial_putter(): 199 + """Test lens without explicit putter uses trivial putter.""" 200 + @atdata.packable 201 + class TrivialSource: 202 + a: int 203 + b: str 204 + 205 + @atdata.packable 206 + class TrivialView: 207 + a: int 208 + 209 + # Create lens without putter 210 + @atdata.lens 211 + def extract_a(s: TrivialSource) -> TrivialView: 212 + return TrivialView(a=s.a) 213 + 214 + source = TrivialSource(a=10, b="hello") 215 + view = TrivialView(a=99) 216 + 217 + # Trivial putter should return source unchanged 218 + result = extract_a.put(view, source) 219 + assert result == source, "Trivial putter should return source unchanged" 220 + 221 + 222 + def test_lens_network_missing_lens(): 223 + """Test LensNetwork raises ValueError for unregistered lens.""" 224 + from atdata.lens import LensNetwork 225 + 226 + @atdata.packable 227 + class UnregisteredSource: 228 + x: int 229 + 230 + @atdata.packable 231 + class UnregisteredView: 232 + y: int 233 + 234 + network = LensNetwork() 235 + 236 + with pytest.raises(ValueError, match="No registered lens"): 237 + network.transform(UnregisteredSource, UnregisteredView) 238 + 165 239 166 240 ##
+1032
tests/test_local.py
··· 1 + """Test local repository storage functionality.""" 2 + 3 + ## 4 + # Imports 5 + 6 + import pytest 7 + 8 + # System 9 + from dataclasses import dataclass 10 + from pathlib import Path 11 + from uuid import UUID 12 + 13 + # External 14 + import numpy as np 15 + from redis import Redis 16 + from moto import mock_aws 17 + 18 + # Local 19 + import atdata 20 + import atdata.local as atlocal 21 + import webdataset as wds 22 + 23 + # Typing 24 + from numpy.typing import NDArray 25 + from typing import Any 26 + 27 + 28 + ## 29 + # Test fixtures 30 + 31 + @pytest.fixture 32 + def redis_connection(): 33 + """Provide a Redis connection, skip test if Redis is not available.""" 34 + try: 35 + redis = Redis() 36 + redis.ping() 37 + yield redis 38 + except Exception: 39 + pytest.skip("Redis server not available") 40 + 41 + 42 + @pytest.fixture 43 + def clean_redis(redis_connection): 44 + """Provide a Redis connection with automatic BasicIndexEntry cleanup. 45 + 46 + Clears all BasicIndexEntry keys before and after each test to ensure 47 + test isolation. 48 + """ 49 + def _clear_entries(): 50 + for key in redis_connection.scan_iter(match='BasicIndexEntry:*'): 51 + redis_connection.delete(key) 52 + 53 + _clear_entries() 54 + yield redis_connection 55 + _clear_entries() 56 + 57 + 58 + @pytest.fixture 59 + def mock_s3(): 60 + """Provide a mock S3 environment using moto. 61 + 62 + Note: Tests using this fixture may generate warnings due to s3fs/moto async 63 + incompatibility. These are suppressed via @pytest.mark.filterwarnings on 64 + individual tests that use this fixture. 65 + """ 66 + with mock_aws(): 67 + # Create S3 credentials dict (no endpoint_url for moto) 68 + creds = { 69 + 'AWS_ACCESS_KEY_ID': 'testing', 70 + 'AWS_SECRET_ACCESS_KEY': 'testing' 71 + } 72 + 73 + # Create S3 client and bucket 74 + import boto3 75 + s3_client = boto3.client( 76 + 's3', 77 + aws_access_key_id=creds['AWS_ACCESS_KEY_ID'], 78 + aws_secret_access_key=creds['AWS_SECRET_ACCESS_KEY'], 79 + region_name='us-east-1' 80 + ) 81 + 82 + bucket_name = 'test-bucket' 83 + s3_client.create_bucket(Bucket=bucket_name) 84 + 85 + yield { 86 + 'credentials': creds, 87 + 'bucket': bucket_name, 88 + 'hive_path': f'{bucket_name}/datasets', 89 + 's3_client': s3_client 90 + } 91 + 92 + 93 + @pytest.fixture 94 + def sample_dataset(tmp_path): 95 + """Create a sample WebDataset for testing.""" 96 + # Create a temporary WebDataset 97 + dataset_path = tmp_path / "test-dataset-000000.tar" 98 + 99 + with wds.writer.TarWriter(str(dataset_path)) as sink: 100 + for i in range(10): 101 + sample = SimpleTestSample(name=f"sample_{i}", value=i * 10) 102 + sink.write(sample.as_wds) 103 + 104 + ds = atdata.Dataset[SimpleTestSample](url=str(dataset_path)) 105 + return ds 106 + 107 + 108 + @dataclass 109 + class SimpleTestSample(atdata.PackableSample): 110 + """Simple test sample for repository tests.""" 111 + name: str 112 + value: int 113 + 114 + 115 + @dataclass 116 + class ArrayTestSample(atdata.PackableSample): 117 + """Test sample with numpy array for repository tests.""" 118 + label: str 119 + data: NDArray 120 + 121 + 122 + def make_simple_dataset(tmp_path: Path, num_samples: int = 10, name: str = "test") -> atdata.Dataset: 123 + """Create a SimpleTestSample dataset for testing.""" 124 + dataset_path = tmp_path / f"{name}-dataset-000000.tar" 125 + with wds.writer.TarWriter(str(dataset_path)) as sink: 126 + for i in range(num_samples): 127 + sample = SimpleTestSample(name=f"sample_{i}", value=i * 10) 128 + sink.write(sample.as_wds) 129 + return atdata.Dataset[SimpleTestSample](url=str(dataset_path)) 130 + 131 + 132 + def make_array_dataset(tmp_path: Path, num_samples: int = 3, array_shape: tuple = (10, 10)) -> atdata.Dataset: 133 + """Create an ArrayTestSample dataset for testing.""" 134 + dataset_path = tmp_path / "array-dataset-000000.tar" 135 + with wds.writer.TarWriter(str(dataset_path)) as sink: 136 + for i in range(num_samples): 137 + arr = np.random.randn(*array_shape) 138 + sample = ArrayTestSample(label=f"array_{i}", data=arr) 139 + sink.write(sample.as_wds) 140 + return atdata.Dataset[ArrayTestSample](url=str(dataset_path)) 141 + 142 + 143 + ## 144 + # Helper function tests 145 + 146 + def test_kind_str_for_sample_type(): 147 + """Test that sample types are converted to correct fully-qualified string identifiers. 148 + 149 + Should produce strings in format 'module.name' that uniquely identify the sample type. 150 + """ 151 + result = atlocal._kind_str_for_sample_type(SimpleTestSample) 152 + assert result == f"{SimpleTestSample.__module__}.SimpleTestSample" 153 + 154 + result2 = atlocal._kind_str_for_sample_type(ArrayTestSample) 155 + assert result2 == f"{ArrayTestSample.__module__}.ArrayTestSample" 156 + 157 + 158 + def test_decode_bytes_dict(): 159 + """Test that byte dictionaries from Redis are correctly decoded to strings. 160 + 161 + Should handle UTF-8 decoding of both keys and values from Redis response format. 162 + """ 163 + bytes_dict = { 164 + b'wds_url': b's3://bucket/dataset.tar', 165 + b'sample_kind': b'module.Sample', 166 + b'metadata_url': b's3://bucket/metadata.msgpack', 167 + b'uuid': b'12345678-1234-1234-1234-123456789abc' 168 + } 169 + 170 + result = atlocal._decode_bytes_dict(bytes_dict) 171 + 172 + assert result == { 173 + 'wds_url': 's3://bucket/dataset.tar', 174 + 'sample_kind': 'module.Sample', 175 + 'metadata_url': 's3://bucket/metadata.msgpack', 176 + 'uuid': '12345678-1234-1234-1234-123456789abc' 177 + } 178 + assert all(isinstance(k, str) for k in result.keys()) 179 + assert all(isinstance(v, str) for v in result.values()) 180 + 181 + 182 + def test_s3_env_valid_credentials(tmp_path): 183 + """Test loading S3 credentials from a valid .env file. 184 + 185 + Should successfully parse AWS_ENDPOINT, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY 186 + from a properly formatted .env file. 187 + """ 188 + env_file = tmp_path / ".env" 189 + env_file.write_text( 190 + "AWS_ENDPOINT=http://localhost:9000\n" 191 + "AWS_ACCESS_KEY_ID=minioadmin\n" 192 + "AWS_SECRET_ACCESS_KEY=minioadmin\n" 193 + ) 194 + 195 + result = atlocal._s3_env(env_file) 196 + 197 + assert result == { 198 + 'AWS_ENDPOINT': 'http://localhost:9000', 199 + 'AWS_ACCESS_KEY_ID': 'minioadmin', 200 + 'AWS_SECRET_ACCESS_KEY': 'minioadmin' 201 + } 202 + 203 + 204 + @pytest.mark.parametrize("missing_field,env_content", [ 205 + ("AWS_ENDPOINT", "AWS_ACCESS_KEY_ID=minioadmin\nAWS_SECRET_ACCESS_KEY=minioadmin\n"), 206 + ("AWS_ACCESS_KEY_ID", "AWS_ENDPOINT=http://localhost:9000\nAWS_SECRET_ACCESS_KEY=minioadmin\n"), 207 + ("AWS_SECRET_ACCESS_KEY", "AWS_ENDPOINT=http://localhost:9000\nAWS_ACCESS_KEY_ID=minioadmin\n"), 208 + ]) 209 + def test_s3_env_missing_required_field(tmp_path, missing_field, env_content): 210 + """Test that loading S3 credentials fails when a required field is missing. 211 + 212 + Should raise AssertionError when .env file lacks any of the required fields: 213 + AWS_ENDPOINT, AWS_ACCESS_KEY_ID, or AWS_SECRET_ACCESS_KEY. 214 + """ 215 + env_file = tmp_path / ".env" 216 + env_file.write_text(env_content) 217 + 218 + with pytest.raises(AssertionError): 219 + atlocal._s3_env(env_file) 220 + 221 + 222 + def test_s3_from_credentials_with_dict(): 223 + """Test creating S3FileSystem from a credentials dictionary. 224 + 225 + Should create a properly configured S3FileSystem instance using dict credentials. 226 + """ 227 + creds = { 228 + 'AWS_ENDPOINT': 'http://localhost:9000', 229 + 'AWS_ACCESS_KEY_ID': 'minioadmin', 230 + 'AWS_SECRET_ACCESS_KEY': 'minioadmin' 231 + } 232 + 233 + fs = atlocal._s3_from_credentials(creds) 234 + 235 + assert isinstance(fs, atlocal.S3FileSystem) 236 + assert fs.endpoint_url == 'http://localhost:9000' 237 + assert fs.key == 'minioadmin' 238 + assert fs.secret == 'minioadmin' 239 + 240 + 241 + def test_s3_from_credentials_with_path(tmp_path): 242 + """Test creating S3FileSystem from a .env file path. 243 + 244 + Should load credentials from file and create S3FileSystem instance. 245 + """ 246 + env_file = tmp_path / ".env" 247 + env_file.write_text( 248 + "AWS_ENDPOINT=http://localhost:9000\n" 249 + "AWS_ACCESS_KEY_ID=minioadmin\n" 250 + "AWS_SECRET_ACCESS_KEY=minioadmin\n" 251 + ) 252 + 253 + fs = atlocal._s3_from_credentials(env_file) 254 + 255 + assert isinstance(fs, atlocal.S3FileSystem) 256 + assert fs.endpoint_url == 'http://localhost:9000' 257 + assert fs.key == 'minioadmin' 258 + assert fs.secret == 'minioadmin' 259 + 260 + 261 + ## 262 + # BasicIndexEntry tests 263 + 264 + def test_basic_index_entry_creation(): 265 + """Test creating a BasicIndexEntry with explicit values. 266 + 267 + Should create an entry with provided wds_url, sample_kind, metadata_url, and uuid. 268 + """ 269 + entry = atlocal.BasicIndexEntry( 270 + wds_url="s3://bucket/dataset.tar", 271 + sample_kind="test_module.TestSample", 272 + metadata_url="s3://bucket/metadata.msgpack", 273 + uuid="12345678-1234-1234-1234-123456789abc" 274 + ) 275 + 276 + assert entry.wds_url == "s3://bucket/dataset.tar" 277 + assert entry.sample_kind == "test_module.TestSample" 278 + assert entry.metadata_url == "s3://bucket/metadata.msgpack" 279 + assert entry.uuid == "12345678-1234-1234-1234-123456789abc" 280 + 281 + 282 + def test_basic_index_entry_default_uuid(): 283 + """Test that BasicIndexEntry generates a valid UUID by default. 284 + 285 + Should auto-generate a unique UUID when none is provided, and it should be 286 + parsable as a valid UUID. 287 + """ 288 + entry = atlocal.BasicIndexEntry( 289 + wds_url="s3://bucket/dataset.tar", 290 + sample_kind="test_module.TestSample", 291 + metadata_url="s3://bucket/metadata.msgpack" 292 + ) 293 + 294 + assert entry.uuid is not None 295 + # Verify it's a valid UUID by parsing it 296 + parsed_uuid = UUID(entry.uuid) 297 + assert str(parsed_uuid) == entry.uuid 298 + 299 + 300 + def test_basic_index_entry_write_to_redis(clean_redis): 301 + """Test persisting a BasicIndexEntry to Redis. 302 + 303 + Should write the entry to Redis as a hash with key 'BasicIndexEntry:{uuid}' 304 + and all fields should be retrievable with correct values. 305 + """ 306 + test_uuid = "12345678-1234-1234-1234-123456789abc" 307 + 308 + entry = atlocal.BasicIndexEntry( 309 + wds_url="s3://bucket/dataset.tar", 310 + sample_kind="test_module.TestSample", 311 + metadata_url="s3://bucket/metadata.msgpack", 312 + uuid=test_uuid 313 + ) 314 + 315 + entry.write_to(clean_redis) 316 + 317 + # Retrieve and verify actual stored values 318 + stored_data = atlocal._decode_bytes_dict(clean_redis.hgetall(f"BasicIndexEntry:{test_uuid}")) 319 + assert stored_data['wds_url'] == "s3://bucket/dataset.tar" 320 + assert stored_data['sample_kind'] == "test_module.TestSample" 321 + assert stored_data['metadata_url'] == "s3://bucket/metadata.msgpack" 322 + assert stored_data['uuid'] == test_uuid 323 + 324 + 325 + def test_basic_index_entry_round_trip_redis(clean_redis): 326 + """Test writing and reading a BasicIndexEntry from Redis. 327 + 328 + Should be able to write an entry to Redis and read it back with all fields 329 + intact and matching the original values. 330 + """ 331 + test_uuid = "12345678-1234-1234-1234-123456789abc" 332 + 333 + original_entry = atlocal.BasicIndexEntry( 334 + wds_url="s3://bucket/dataset.tar", 335 + sample_kind="test_module.TestSample", 336 + metadata_url="s3://bucket/metadata.msgpack", 337 + uuid=test_uuid 338 + ) 339 + 340 + original_entry.write_to(clean_redis) 341 + 342 + # Read back from Redis 343 + stored_data = atlocal._decode_bytes_dict(clean_redis.hgetall(f"BasicIndexEntry:{test_uuid}")) 344 + retrieved_entry = atlocal.BasicIndexEntry(**stored_data) 345 + 346 + assert retrieved_entry.wds_url == original_entry.wds_url 347 + assert retrieved_entry.sample_kind == original_entry.sample_kind 348 + assert retrieved_entry.metadata_url == original_entry.metadata_url 349 + assert retrieved_entry.uuid == original_entry.uuid 350 + 351 + 352 + ## 353 + # Index tests 354 + 355 + def test_index_init_default_redis(): 356 + """Test creating an Index with default Redis connection. 357 + 358 + Should create a new Redis connection using default parameters when no 359 + redis argument is provided. 360 + """ 361 + index = atlocal.Index() 362 + 363 + assert index._redis is not None 364 + assert isinstance(index._redis, Redis) 365 + 366 + 367 + def test_index_init_with_redis_connection(): 368 + """Test creating an Index with an existing Redis connection. 369 + 370 + Should use the provided Redis connection instead of creating a new one. 371 + """ 372 + redis = Redis() 373 + index = atlocal.Index(redis=redis) 374 + 375 + assert index._redis is redis 376 + 377 + 378 + def test_index_init_with_redis_kwargs(): 379 + """Test creating an Index with Redis connection kwargs. 380 + 381 + Should pass custom kwargs to Redis constructor when creating a new connection. 382 + """ 383 + index = atlocal.Index(host='localhost', port=6379, db=0) 384 + 385 + assert index._redis is not None 386 + assert isinstance(index._redis, Redis) 387 + 388 + 389 + def test_index_add_entry_without_uuid(clean_redis): 390 + """Test adding a dataset entry to the index without specifying UUID. 391 + 392 + Should create a BasicIndexEntry with auto-generated UUID and persist it to Redis. 393 + """ 394 + index = atlocal.Index(redis=clean_redis) 395 + 396 + ds = atdata.Dataset[SimpleTestSample]( 397 + url="s3://bucket/dataset.tar", 398 + metadata_url="s3://bucket/metadata.msgpack" 399 + ) 400 + 401 + entry = index.add_entry(ds) 402 + 403 + assert entry.uuid is not None 404 + assert entry.wds_url == ds.url 405 + assert entry.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 406 + assert entry.metadata_url == ds.metadata_url 407 + 408 + # Verify it was persisted to Redis 409 + stored_data = clean_redis.hgetall(f"BasicIndexEntry:{entry.uuid}") 410 + assert len(stored_data) > 0 411 + 412 + 413 + def test_index_add_entry_with_uuid(clean_redis): 414 + """Test adding a dataset entry to the index with a specified UUID. 415 + 416 + Should create a BasicIndexEntry with the provided UUID and persist it to Redis. 417 + """ 418 + index = atlocal.Index(redis=clean_redis) 419 + test_uuid = "12345678-1234-1234-1234-123456789abc" 420 + 421 + ds = atdata.Dataset[SimpleTestSample]( 422 + url="s3://bucket/dataset.tar", 423 + metadata_url="s3://bucket/metadata.msgpack" 424 + ) 425 + 426 + entry = index.add_entry(ds, uuid=test_uuid) 427 + 428 + assert entry.uuid == test_uuid 429 + assert entry.wds_url == ds.url 430 + assert entry.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 431 + assert entry.metadata_url == ds.metadata_url 432 + 433 + 434 + def test_index_entries_generator_empty(clean_redis): 435 + """Test iterating over entries in an empty index. 436 + 437 + Should yield no entries when the index is empty. 438 + """ 439 + index = atlocal.Index(redis=clean_redis) 440 + 441 + entries = list(index.entries) 442 + assert len(entries) == 0 443 + 444 + 445 + def test_index_entries_generator_multiple(clean_redis): 446 + """Test iterating over multiple entries in the index. 447 + 448 + Should yield all BasicIndexEntry objects that have been added to the index. 449 + """ 450 + index = atlocal.Index(redis=clean_redis) 451 + 452 + ds1 = atdata.Dataset[SimpleTestSample](url="s3://bucket/dataset1.tar") 453 + ds2 = atdata.Dataset[ArrayTestSample](url="s3://bucket/dataset2.tar") 454 + 455 + entry1 = index.add_entry(ds1) 456 + entry2 = index.add_entry(ds2) 457 + 458 + entries = list(index.entries) 459 + assert len(entries) == 2 460 + 461 + uuids = {entry.uuid for entry in entries} 462 + assert entry1.uuid in uuids 463 + assert entry2.uuid in uuids 464 + 465 + 466 + def test_index_all_entries_empty(clean_redis): 467 + """Test getting all entries as a list from an empty index. 468 + 469 + Should return an empty list when no entries exist. 470 + """ 471 + index = atlocal.Index(redis=clean_redis) 472 + 473 + entries = index.all_entries 474 + assert isinstance(entries, list) 475 + assert len(entries) == 0 476 + 477 + 478 + def test_index_all_entries_multiple(clean_redis): 479 + """Test getting all entries as a list with multiple entries. 480 + 481 + Should return a list containing all BasicIndexEntry objects in the index. 482 + """ 483 + index = atlocal.Index(redis=clean_redis) 484 + 485 + ds1 = atdata.Dataset[SimpleTestSample](url="s3://bucket/dataset1.tar") 486 + ds2 = atdata.Dataset[ArrayTestSample](url="s3://bucket/dataset2.tar") 487 + 488 + entry1 = index.add_entry(ds1) 489 + entry2 = index.add_entry(ds2) 490 + 491 + entries = index.all_entries 492 + assert isinstance(entries, list) 493 + assert len(entries) == 2 494 + 495 + 496 + def test_index_entries_filtering(clean_redis): 497 + """Test that index only returns BasicIndexEntry objects. 498 + 499 + Should only iterate over keys matching 'BasicIndexEntry:*' pattern and 500 + ignore any other Redis keys. 501 + """ 502 + index = atlocal.Index(redis=clean_redis) 503 + 504 + # Add a BasicIndexEntry 505 + ds = atdata.Dataset[SimpleTestSample](url="s3://bucket/dataset.tar") 506 + entry = index.add_entry(ds) 507 + 508 + # Add some other Redis keys that should be ignored 509 + clean_redis.set("other_key", "value") 510 + clean_redis.hset("other_hash", "field", "value") 511 + 512 + entries = list(index.entries) 513 + assert len(entries) == 1 514 + assert entries[0].uuid == entry.uuid 515 + 516 + # Clean up non-BasicIndexEntry keys (fixture only cleans BasicIndexEntry:*) 517 + clean_redis.delete("other_key") 518 + clean_redis.delete("other_hash") 519 + 520 + 521 + ## 522 + # Repo tests - Initialization 523 + 524 + def test_repo_init_no_s3(): 525 + """Test creating a Repo without S3 credentials. 526 + 527 + Should create a Repo with s3_credentials=None, bucket_fs=None, and working index. 528 + """ 529 + repo = atlocal.Repo() 530 + 531 + assert repo.s3_credentials is None 532 + assert repo.bucket_fs is None 533 + assert repo.hive_path is None 534 + assert repo.hive_bucket is None 535 + assert repo.index is not None 536 + assert isinstance(repo.index, atlocal.Index) 537 + 538 + 539 + def test_repo_init_with_s3_dict(): 540 + """Test creating a Repo with S3 credentials as a dictionary. 541 + 542 + Should create a Repo with S3FileSystem and set hive_path and hive_bucket. 543 + """ 544 + creds = { 545 + 'AWS_ENDPOINT': 'http://localhost:9000', 546 + 'AWS_ACCESS_KEY_ID': 'minioadmin', 547 + 'AWS_SECRET_ACCESS_KEY': 'minioadmin' 548 + } 549 + 550 + repo = atlocal.Repo(s3_credentials=creds, hive_path="test-bucket/datasets") 551 + 552 + assert repo.s3_credentials == creds 553 + assert repo.bucket_fs is not None 554 + assert isinstance(repo.bucket_fs, atlocal.S3FileSystem) 555 + assert repo.hive_path == Path("test-bucket/datasets") 556 + assert repo.hive_bucket == "test-bucket" 557 + 558 + 559 + def test_repo_init_with_s3_path(tmp_path): 560 + """Test creating a Repo with S3 credentials from a .env file. 561 + 562 + Should load credentials from file and create S3FileSystem with hive configuration. 563 + """ 564 + env_file = tmp_path / ".env" 565 + env_file.write_text( 566 + "AWS_ENDPOINT=http://localhost:9000\n" 567 + "AWS_ACCESS_KEY_ID=minioadmin\n" 568 + "AWS_SECRET_ACCESS_KEY=minioadmin\n" 569 + ) 570 + 571 + repo = atlocal.Repo(s3_credentials=env_file, hive_path="test-bucket/datasets") 572 + 573 + assert repo.s3_credentials is not None 574 + assert repo.bucket_fs is not None 575 + assert isinstance(repo.bucket_fs, atlocal.S3FileSystem) 576 + assert repo.hive_path == Path("test-bucket/datasets") 577 + assert repo.hive_bucket == "test-bucket" 578 + 579 + 580 + def test_repo_init_s3_without_hive_path(): 581 + """Test that creating a Repo with S3 but no hive_path raises ValueError. 582 + 583 + Should raise ValueError when s3_credentials is provided but hive_path is None. 584 + """ 585 + creds = { 586 + 'AWS_ENDPOINT': 'http://localhost:9000', 587 + 'AWS_ACCESS_KEY_ID': 'minioadmin', 588 + 'AWS_SECRET_ACCESS_KEY': 'minioadmin' 589 + } 590 + 591 + with pytest.raises(ValueError, match="Must specify hive path"): 592 + atlocal.Repo(s3_credentials=creds) 593 + 594 + 595 + def test_repo_init_hive_path_parsing(): 596 + """Test that hive_path is correctly parsed to extract bucket name. 597 + 598 + Should set hive_bucket to the first component of hive_path. 599 + """ 600 + creds = { 601 + 'AWS_ENDPOINT': 'http://localhost:9000', 602 + 'AWS_ACCESS_KEY_ID': 'minioadmin', 603 + 'AWS_SECRET_ACCESS_KEY': 'minioadmin' 604 + } 605 + 606 + repo = atlocal.Repo(s3_credentials=creds, hive_path="my-bucket/path/to/datasets") 607 + 608 + assert repo.hive_bucket == "my-bucket" 609 + assert repo.hive_path == Path("my-bucket/path/to/datasets") 610 + 611 + 612 + def test_repo_init_with_custom_redis(): 613 + """Test creating a Repo with a custom Redis connection. 614 + 615 + Should pass the Redis connection to the Index instance. 616 + """ 617 + custom_redis = Redis() 618 + repo = atlocal.Repo(redis=custom_redis) 619 + 620 + assert repo.index._redis is custom_redis 621 + 622 + 623 + ## 624 + # Repo tests - Insert functionality 625 + 626 + def test_repo_insert_without_s3(): 627 + """Test that inserting a dataset without S3 configured raises AssertionError. 628 + 629 + Should fail with assertion error when trying to insert without S3 credentials. 630 + """ 631 + repo = atlocal.Repo() 632 + ds = atdata.Dataset[SimpleTestSample](url="s3://bucket/dataset.tar") 633 + 634 + with pytest.raises(AssertionError): 635 + repo.insert(ds) 636 + 637 + 638 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 639 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 640 + def test_repo_insert_single_shard(mock_s3, clean_redis, sample_dataset): 641 + """Test inserting a small dataset that fits in a single shard. 642 + 643 + Should write the dataset to S3, create metadata, add index entry, and return 644 + a new Dataset pointing to the stored copy with correct URL format. 645 + """ 646 + repo = atlocal.Repo( 647 + s3_credentials=mock_s3['credentials'], 648 + hive_path=mock_s3['hive_path'], 649 + redis=clean_redis 650 + ) 651 + 652 + entry, new_ds = repo.insert(sample_dataset, maxcount=100) 653 + 654 + assert entry.uuid is not None 655 + assert entry.wds_url is not None 656 + assert entry.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 657 + assert len(repo.index.all_entries) == 1 658 + assert '.tar' in new_ds.url 659 + assert new_ds.url.startswith(mock_s3['hive_path']) 660 + 661 + 662 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 663 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 664 + def test_repo_insert_multiple_shards(mock_s3, clean_redis, tmp_path): 665 + """Test inserting a large dataset that spans multiple shards. 666 + 667 + Should write multiple tar files to S3, use brace notation in returned URL, 668 + and correctly format the shard range. 669 + """ 670 + ds = make_simple_dataset(tmp_path, num_samples=50, name="large") 671 + repo = atlocal.Repo( 672 + s3_credentials=mock_s3['credentials'], 673 + hive_path=mock_s3['hive_path'], 674 + redis=clean_redis 675 + ) 676 + 677 + entry, new_ds = repo.insert(ds, maxcount=10) 678 + 679 + assert entry.uuid is not None 680 + assert entry.wds_url is not None 681 + assert '{' in new_ds.url and '}' in new_ds.url 682 + 683 + 684 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 685 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 686 + def test_repo_insert_with_metadata(mock_s3, clean_redis, tmp_path): 687 + """Test inserting a dataset with metadata. 688 + 689 + Should write metadata as msgpack to S3 and include metadata_url in the 690 + returned Dataset and BasicIndexEntry. 691 + """ 692 + ds = make_simple_dataset(tmp_path, num_samples=5) 693 + ds._metadata = {"description": "test dataset", "version": "1.0"} 694 + 695 + repo = atlocal.Repo( 696 + s3_credentials=mock_s3['credentials'], 697 + hive_path=mock_s3['hive_path'], 698 + redis=clean_redis 699 + ) 700 + 701 + entry, new_ds = repo.insert(ds, maxcount=100) 702 + 703 + assert entry.metadata_url is not None 704 + assert new_ds.metadata_url is not None 705 + assert 'metadata' in entry.metadata_url 706 + 707 + 708 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 709 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 710 + def test_repo_insert_without_metadata(mock_s3, clean_redis, tmp_path): 711 + """Test inserting a dataset without metadata. 712 + 713 + Should handle None metadata gracefully and not write a metadata file. 714 + """ 715 + ds = make_simple_dataset(tmp_path, num_samples=5) 716 + repo = atlocal.Repo( 717 + s3_credentials=mock_s3['credentials'], 718 + hive_path=mock_s3['hive_path'], 719 + redis=clean_redis 720 + ) 721 + 722 + entry, new_ds = repo.insert(ds, maxcount=100) 723 + 724 + assert entry.uuid is not None 725 + assert len(repo.index.all_entries) == 1 726 + 727 + 728 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 729 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 730 + def test_repo_insert_cache_local_false(mock_s3, clean_redis, sample_dataset): 731 + """Test inserting with cache_local=False (direct S3 write). 732 + 733 + Should write tar shards directly to S3 without local caching. 734 + """ 735 + repo = atlocal.Repo( 736 + s3_credentials=mock_s3['credentials'], 737 + hive_path=mock_s3['hive_path'], 738 + redis=clean_redis 739 + ) 740 + 741 + entry, new_ds = repo.insert(sample_dataset, cache_local=False, maxcount=100) 742 + 743 + assert entry.uuid is not None 744 + assert entry.wds_url is not None 745 + 746 + 747 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 748 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 749 + def test_repo_insert_cache_local_true(mock_s3, clean_redis, sample_dataset): 750 + """Test inserting with cache_local=True (local cache then copy). 751 + 752 + Should write to temporary local storage first, then copy to S3, and clean up 753 + local cache files after copying. 754 + """ 755 + repo = atlocal.Repo( 756 + s3_credentials=mock_s3['credentials'], 757 + hive_path=mock_s3['hive_path'], 758 + redis=clean_redis 759 + ) 760 + 761 + entry, new_ds = repo.insert(sample_dataset, cache_local=True, maxcount=100) 762 + 763 + assert entry.uuid is not None 764 + assert entry.wds_url is not None 765 + 766 + 767 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 768 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 769 + def test_repo_insert_creates_index_entry(mock_s3, clean_redis, sample_dataset): 770 + """Test that insert() creates a valid index entry. 771 + 772 + Should add a BasicIndexEntry to the index with correct wds_url, sample_kind, 773 + metadata_url, and UUID. 774 + """ 775 + repo = atlocal.Repo( 776 + s3_credentials=mock_s3['credentials'], 777 + hive_path=mock_s3['hive_path'], 778 + redis=clean_redis 779 + ) 780 + 781 + entry, new_ds = repo.insert(sample_dataset, maxcount=100) 782 + 783 + assert entry.uuid is not None 784 + assert entry.wds_url == new_ds.url 785 + assert entry.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 786 + 787 + all_entries = repo.index.all_entries 788 + assert len(all_entries) == 1 789 + assert all_entries[0].uuid == entry.uuid 790 + 791 + 792 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 793 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 794 + def test_repo_insert_uuid_generation(mock_s3, clean_redis, sample_dataset): 795 + """Test that insert() generates a unique UUID for each dataset. 796 + 797 + Should create a new UUID for the dataset and use it consistently in filenames, 798 + index entry, and returned Dataset. 799 + """ 800 + repo = atlocal.Repo( 801 + s3_credentials=mock_s3['credentials'], 802 + hive_path=mock_s3['hive_path'], 803 + redis=clean_redis 804 + ) 805 + 806 + entry1, new_ds1 = repo.insert(sample_dataset, maxcount=100) 807 + entry2, new_ds2 = repo.insert(sample_dataset, maxcount=100) 808 + 809 + assert entry1.uuid != entry2.uuid 810 + assert entry1.uuid in new_ds1.url 811 + assert entry2.uuid in new_ds2.url 812 + assert len(repo.index.all_entries) == 2 813 + 814 + 815 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 816 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 817 + def test_repo_insert_empty_dataset(mock_s3, clean_redis, tmp_path): 818 + """Test inserting an empty dataset. 819 + 820 + WebDataset's ShardWriter creates a shard file even with no samples, 821 + so empty datasets succeed (creating an empty shard) rather than raising 822 + RuntimeError. 823 + """ 824 + dataset_path = tmp_path / "empty-dataset-000000.tar" 825 + with wds.writer.TarWriter(str(dataset_path)) as sink: 826 + pass # Write no samples 827 + 828 + ds = atdata.Dataset[SimpleTestSample](url=str(dataset_path)) 829 + repo = atlocal.Repo( 830 + s3_credentials=mock_s3['credentials'], 831 + hive_path=mock_s3['hive_path'], 832 + redis=clean_redis 833 + ) 834 + 835 + # Empty datasets succeed because WebDataset creates a shard file regardless 836 + entry, new_ds = repo.insert(ds, maxcount=100) 837 + assert entry.uuid is not None 838 + assert '.tar' in new_ds.url 839 + 840 + 841 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 842 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 843 + def test_repo_insert_preserves_sample_type(mock_s3, clean_redis, sample_dataset): 844 + """Test that the returned Dataset preserves the original sample type. 845 + 846 + Should return a Dataset[T] with the same sample type as the input dataset. 847 + """ 848 + repo = atlocal.Repo( 849 + s3_credentials=mock_s3['credentials'], 850 + hive_path=mock_s3['hive_path'], 851 + redis=clean_redis 852 + ) 853 + 854 + entry, new_ds = repo.insert(sample_dataset, maxcount=100) 855 + 856 + assert new_ds.sample_type == SimpleTestSample 857 + assert entry.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 858 + 859 + 860 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 861 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 862 + def test_repo_insert_round_trip(mock_s3, clean_redis, tmp_path): 863 + """Test full round-trip: insert dataset, then load and compare samples. 864 + 865 + Should be able to insert a dataset and then load it back from the returned 866 + URL with all samples intact and matching the original. 867 + """ 868 + pytest.skip("Reading from moto-mocked S3 requires additional s3fs/WebDataset configuration") 869 + 870 + 871 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 872 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 873 + def test_repo_insert_with_shard_writer_kwargs(mock_s3, clean_redis, tmp_path): 874 + """Test that insert() passes additional kwargs to ShardWriter. 875 + 876 + Should forward kwargs like maxcount, maxsize to the underlying ShardWriter. 877 + """ 878 + ds = make_simple_dataset(tmp_path, num_samples=30, name="large") 879 + repo = atlocal.Repo( 880 + s3_credentials=mock_s3['credentials'], 881 + hive_path=mock_s3['hive_path'], 882 + redis=clean_redis 883 + ) 884 + 885 + entry, new_ds = repo.insert(ds, maxcount=5) 886 + 887 + assert '{' in new_ds.url and '}' in new_ds.url 888 + 889 + 890 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 891 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 892 + def test_repo_insert_numpy_arrays(mock_s3, clean_redis, tmp_path): 893 + """Test inserting a dataset containing samples with numpy arrays. 894 + 895 + Should correctly serialize and store numpy arrays. 896 + """ 897 + ds = make_array_dataset(tmp_path, num_samples=3, array_shape=(10, 10)) 898 + repo = atlocal.Repo( 899 + s3_credentials=mock_s3['credentials'], 900 + hive_path=mock_s3['hive_path'], 901 + redis=clean_redis 902 + ) 903 + 904 + entry, new_ds = repo.insert(ds, maxcount=100) 905 + 906 + assert entry.uuid is not None 907 + assert entry.sample_kind == f"{ArrayTestSample.__module__}.ArrayTestSample" 908 + 909 + 910 + ## 911 + # Integration tests 912 + 913 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 914 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 915 + def test_repo_index_integration(mock_s3, clean_redis, sample_dataset): 916 + """Test that Repo and Index work together correctly. 917 + 918 + Should be able to insert datasets into Repo and retrieve their entries 919 + from the Index. 920 + """ 921 + repo = atlocal.Repo( 922 + s3_credentials=mock_s3['credentials'], 923 + hive_path=mock_s3['hive_path'], 924 + redis=clean_redis 925 + ) 926 + 927 + entry, new_ds = repo.insert(sample_dataset, maxcount=100) 928 + 929 + all_entries = repo.index.all_entries 930 + assert len(all_entries) == 1 931 + assert all_entries[0].uuid == entry.uuid 932 + assert all_entries[0].wds_url == entry.wds_url 933 + 934 + 935 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 936 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 937 + def test_multiple_datasets_same_type(mock_s3, clean_redis, sample_dataset): 938 + """Test inserting multiple datasets of the same sample type. 939 + 940 + Should create separate entries with different UUIDs and all should be 941 + retrievable from the index. 942 + """ 943 + repo = atlocal.Repo( 944 + s3_credentials=mock_s3['credentials'], 945 + hive_path=mock_s3['hive_path'], 946 + redis=clean_redis 947 + ) 948 + 949 + entry1, _ = repo.insert(sample_dataset, maxcount=100) 950 + entry2, _ = repo.insert(sample_dataset, maxcount=100) 951 + entry3, _ = repo.insert(sample_dataset, maxcount=100) 952 + 953 + uuids = {entry1.uuid, entry2.uuid, entry3.uuid} 954 + assert len(uuids) == 3 955 + 956 + all_entries = repo.index.all_entries 957 + assert len(all_entries) == 3 958 + 959 + for entry in all_entries: 960 + assert entry.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 961 + 962 + 963 + @pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning") 964 + @pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning") 965 + def test_multiple_datasets_different_types(mock_s3, clean_redis, tmp_path): 966 + """Test inserting datasets with different sample types. 967 + 968 + Should correctly track sample_kind for each dataset and create distinct 969 + index entries. 970 + """ 971 + simple_ds = make_simple_dataset(tmp_path, num_samples=3, name="simple") 972 + array_ds = make_array_dataset(tmp_path, num_samples=3, array_shape=(5, 5)) 973 + 974 + repo = atlocal.Repo( 975 + s3_credentials=mock_s3['credentials'], 976 + hive_path=mock_s3['hive_path'], 977 + redis=clean_redis 978 + ) 979 + 980 + entry1, _ = repo.insert(simple_ds, maxcount=100) 981 + entry2, _ = repo.insert(array_ds, maxcount=100) 982 + 983 + assert entry1.sample_kind == f"{SimpleTestSample.__module__}.SimpleTestSample" 984 + assert entry2.sample_kind == f"{ArrayTestSample.__module__}.ArrayTestSample" 985 + assert entry1.sample_kind != entry2.sample_kind 986 + assert len(repo.index.all_entries) == 2 987 + 988 + 989 + def test_index_persistence_across_instances(clean_redis): 990 + """Test that index entries persist across Index instance recreations. 991 + 992 + Should be able to create an Index, add entries, create a new Index instance 993 + with the same Redis connection, and retrieve the same entries. 994 + """ 995 + index1 = atlocal.Index(redis=clean_redis) 996 + ds = atdata.Dataset[SimpleTestSample](url="s3://bucket/dataset.tar") 997 + entry1 = index1.add_entry(ds) 998 + 999 + index2 = atlocal.Index(redis=clean_redis) 1000 + entries = index2.all_entries 1001 + 1002 + assert len(entries) == 1 1003 + assert entries[0].uuid == entry1.uuid 1004 + assert entries[0].wds_url == entry1.wds_url 1005 + 1006 + 1007 + def test_concurrent_index_access(clean_redis): 1008 + """Test that multiple Index instances can access the same Redis store. 1009 + 1010 + Should handle concurrent access to the same Redis index from multiple 1011 + Index instances. 1012 + """ 1013 + index1 = atlocal.Index(redis=clean_redis) 1014 + index2 = atlocal.Index(redis=clean_redis) 1015 + 1016 + ds1 = atdata.Dataset[SimpleTestSample](url="s3://bucket/dataset1.tar") 1017 + ds2 = atdata.Dataset[ArrayTestSample](url="s3://bucket/dataset2.tar") 1018 + 1019 + entry1 = index1.add_entry(ds1) 1020 + entry2 = index2.add_entry(ds2) 1021 + 1022 + entries1 = index1.all_entries 1023 + entries2 = index2.all_entries 1024 + 1025 + assert len(entries1) == 2 1026 + assert len(entries2) == 2 1027 + 1028 + uuids1 = {e.uuid for e in entries1} 1029 + uuids2 = {e.uuid for e in entries2} 1030 + 1031 + assert entry1.uuid in uuids1 and entry2.uuid in uuids1 1032 + assert entry1.uuid in uuids2 and entry2.uuid in uuids2