A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add comprehensive atproto lexicon definitions with examples and NDArray spec

Added complete set of atproto lexicon schemas for dataset federation including:
- Dataset record, sampleSchema, and lens lexicons
- NDArray shim specification for numpy array serialization
- JSON examples demonstrating all schema types
- Design questions document for schema refinement

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

+1947
.chainlink/issues.db

This is a binary file and will not be displayed.

+166
.planning/decisions/sampleSchema_design_questions.md
··· 1 + # sampleSchema Lexicon Design Questions 2 + 3 + This document captures open design questions for the `ac.foundation.dataset.sampleSchema` Lexicon that require user decisions before implementation. 4 + 5 + ## Q1: Key Format Validation 6 + 7 + **Context:** 8 + - Schema uses `"key": "any"` in Lexicon 9 + - Documentation says rkey format is `{NSID}@{semver}` 10 + - ATProto might not support regex validation on rkey in Lexicons 11 + 12 + **Question:** 13 + Should we add validation for the rkey format in the Lexicon definition, or is this enforced elsewhere? 14 + 15 + **Options:** 16 + 1. Add rkey pattern validation if ATProto Lexicons support it 17 + 2. Document expected format but rely on application-level validation 18 + 3. Use a structured key type instead of "any" 19 + 20 + **Impact:** 21 + - Option 1: Strongest validation, prevents malformed rkeys 22 + - Option 2: Simpler, but allows invalid rkeys to be created 23 + - Option 3: May not be compatible with ATProto Lexicon spec 24 + 25 + **Decision:** [TBD] 26 + 27 + --- 28 + 29 + ## Q2: Required Fields in JSON Schema 30 + 31 + **Context:** 32 + - The `jsonSchema` field accepts any JSON Schema object 33 + - JSON Schemas can have zero required fields (all optional) 34 + - PackableSample types in atdata typically have at least one field 35 + 36 + **Question:** 37 + Should we enforce that JSON Schemas must have at least one required field? 38 + 39 + **Options:** 40 + 1. No constraint - allow empty required arrays 41 + 2. Require at least one field in required array 42 + 3. No constraint but document best practices 43 + 44 + **Impact:** 45 + - Option 1: Maximum flexibility, but allows degenerate schemas 46 + - Option 2: Forces meaningful sample definitions 47 + - Option 3: Middle ground - guidance without enforcement 48 + 49 + **Recommendation:** Option 3 (document best practices) 50 + 51 + **Decision:** [TBD] 52 + 53 + --- 54 + 55 + ## Q3: Schema Type Extension Path 56 + 57 + **Context:** 58 + - `schemaType` field has `enum: ["jsonschema"]` only 59 + - Future may want to support other formats (Avro, Protobuf, etc.) 60 + - Lexicon schema evolution unclear 61 + 62 + **Question:** 63 + How should we design for future schema format support? 64 + 65 + **Options:** 66 + 1. Keep enum as-is, add new formats in major version bump 67 + 2. Use open union type instead of closed enum 68 + 3. Add `schemaFormat` union field alongside `jsonSchema` 69 + 70 + **Example for Option 3:** 71 + ```json 72 + { 73 + "schemaFormat": { 74 + "type": "union", 75 + "refs": ["#jsonSchemaFormat", "#avroSchemaFormat", "#protobufSchemaFormat"] 76 + } 77 + } 78 + ``` 79 + 80 + **Impact:** 81 + - Option 1: Breaking change required for new formats 82 + - Option 2: No validation of format string 83 + - Option 3: Clean extensibility but more complex now 84 + 85 + **Recommendation:** Option 1 (YAGNI - wait for actual need) 86 + 87 + **Decision:** [TBD] 88 + 89 + --- 90 + 91 + ## Q4: Metadata Field Structure 92 + 93 + **Context:** 94 + - `metadata` is currently `"type": "object"` with no structure 95 + - Common fields like `author`, `license`, `tags` are documented in examples 96 + - No validation on these fields 97 + 98 + **Question:** 99 + Should we define a structured schema for common metadata fields? 100 + 101 + **Options:** 102 + 1. Keep fully unstructured (current) 103 + 2. Define optional but structured fields (author, license, tags, etc.) 104 + 3. Create separate metadata Lexicon type and reference it 105 + 106 + **Example for Option 2:** 107 + ```json 108 + { 109 + "metadata": { 110 + "type": "object", 111 + "properties": { 112 + "author": {"type": "string", "maxLength": 200}, 113 + "license": {"type": "string", "maxLength": 100}, 114 + "tags": {"type": "array", "items": {"type": "string"}, "maxItems": 20} 115 + } 116 + } 117 + } 118 + ``` 119 + 120 + **Impact:** 121 + - Option 1: Maximum flexibility, no validation 122 + - Option 2: Standardization with optional compliance 123 + - Option 3: Reusability but added complexity 124 + 125 + **Recommendation:** Option 2 (structured but optional) 126 + 127 + **Decision:** [TBD] 128 + 129 + --- 130 + 131 + ## Q5: NDArray Shim URI Default 132 + 133 + **Context:** 134 + - `ndarrayShimUri` is optional with default mentioned in description 135 + - Standard shim is at `https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0` 136 + - No explicit default value in Lexicon 137 + 138 + **Question:** 139 + Should we add an explicit default value for `ndarrayShimUri`? 140 + 141 + **Options:** 142 + 1. Add `"default": "https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0"` 143 + 2. Keep as optional, codegen assumes standard shim if missing 144 + 3. Make required - always explicit 145 + 146 + **Impact:** 147 + - Option 1: Clearest behavior, but locks in URI 148 + - Option 2: Flexibility for future shim versions 149 + - Option 3: Most explicit but verbose 150 + 151 + **Recommendation:** Option 2 (implicit default in codegen) 152 + 153 + **Decision:** [TBD] 154 + 155 + --- 156 + 157 + ## Notes 158 + 159 + These questions should be resolved before finalizing the sampleSchema Lexicon design. Some can be deferred to Phase 2 implementation based on priority. 160 + 161 + **Priority:** 162 + - Q1: High (affects rkey strategy) 163 + - Q2: Low (can document later) 164 + - Q3: Low (YAGNI until needed) 165 + - Q4: Medium (affects metadata usage patterns) 166 + - Q5: Medium (affects codegen implementation)
+252
.planning/examples/code/ndarray_roundtrip.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + Demonstration of NDArray JSON Schema shim roundtrip. 4 + 5 + This script demonstrates: 6 + 1. Creating numpy arrays 7 + 2. Serializing to bytes (numpy .npy format) 8 + 3. Storing in JSON-compatible structure 9 + 4. Validating against JSON Schema 10 + 5. Deserializing back to numpy arrays 11 + 12 + This proves the NDArray shim design works end-to-end. 13 + """ 14 + 15 + import json 16 + import base64 17 + from io import BytesIO 18 + from pathlib import Path 19 + 20 + import numpy as np 21 + from jsonschema import validate, ValidationError 22 + 23 + 24 + ## 25 + # Step 1: Define helper functions (same as atdata._helpers) 26 + 27 + def array_to_bytes(x: np.ndarray) -> bytes: 28 + """Convert numpy array to bytes using .npy format.""" 29 + np_bytes = BytesIO() 30 + np.save(np_bytes, x, allow_pickle=True) 31 + return np_bytes.getvalue() 32 + 33 + 34 + def bytes_to_array(b: bytes) -> np.ndarray: 35 + """Convert bytes back to numpy array.""" 36 + np_bytes = BytesIO(b) 37 + return np.load(np_bytes, allow_pickle=True) 38 + 39 + 40 + ## 41 + # Step 2: Load the JSON Schema for ImageSample 42 + 43 + # Get path to the schema example 44 + schema_path = Path(__file__).parent.parent / "sampleSchema_example.json" 45 + with open(schema_path) as f: 46 + schema_record = json.load(f) 47 + 48 + # Extract just the jsonSchema part 49 + json_schema = schema_record["jsonSchema"] 50 + 51 + print("=" * 80) 52 + print("JSON Schema for ImageSample") 53 + print("=" * 80) 54 + print(json.dumps(json_schema, indent=2)) 55 + print() 56 + 57 + 58 + ## 59 + # Step 3: Create sample data matching the schema 60 + 61 + print("=" * 80) 62 + print("Creating Sample Data") 63 + print("=" * 80) 64 + 65 + # Create a numpy array (simulating an image) 66 + image_array = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8) 67 + print(f"Created image array: shape={image_array.shape}, dtype={image_array.dtype}") 68 + 69 + # Serialize to bytes (this is what atdata does) 70 + image_bytes = array_to_bytes(image_array) 71 + print(f"Serialized to bytes: {len(image_bytes)} bytes") 72 + print(f"First 100 bytes (hex): {image_bytes[:100].hex()}") 73 + print() 74 + 75 + 76 + ## 77 + # Step 4: Create JSON-compatible representation 78 + 79 + print("=" * 80) 80 + print("Creating JSON-Compatible Representation") 81 + print("=" * 80) 82 + 83 + # For JSON, bytes need to be base64-encoded 84 + image_base64 = base64.b64encode(image_bytes).decode('utf-8') 85 + print(f"Base64 encoded: {len(image_base64)} characters") 86 + print(f"First 100 chars: {image_base64[:100]}...") 87 + 88 + # Create a sample object matching the schema 89 + sample_data = { 90 + "image": image_base64, # NDArray as base64 string 91 + "label": "cat", # Regular string field 92 + "confidence": 0.95 # Optional number field 93 + } 94 + 95 + print() 96 + print("Sample data structure:") 97 + print(json.dumps({ 98 + "image": f"<{len(image_base64)} chars of base64>", 99 + "label": sample_data["label"], 100 + "confidence": sample_data["confidence"] 101 + }, indent=2)) 102 + print() 103 + 104 + 105 + ## 106 + # Step 5: Validate against JSON Schema 107 + 108 + print("=" * 80) 109 + print("Validating Against JSON Schema") 110 + print("=" * 80) 111 + 112 + try: 113 + validate(instance=sample_data, schema=json_schema) 114 + print("✅ VALID: Sample data validates against JSON Schema!") 115 + except ValidationError as e: 116 + print(f"❌ INVALID: {e.message}") 117 + print(f"Failed at: {list(e.path)}") 118 + 119 + print() 120 + 121 + 122 + ## 123 + # Step 6: Deserialize back to numpy 124 + 125 + print("=" * 80) 126 + print("Deserializing Back to Numpy") 127 + print("=" * 80) 128 + 129 + # Decode from base64 130 + recovered_bytes = base64.b64decode(sample_data["image"]) 131 + print(f"Decoded from base64: {len(recovered_bytes)} bytes") 132 + 133 + # Deserialize to numpy array 134 + recovered_array = bytes_to_array(recovered_bytes) 135 + print(f"Deserialized to array: shape={recovered_array.shape}, dtype={recovered_array.dtype}") 136 + 137 + # Verify it matches the original 138 + arrays_equal = np.array_equal(image_array, recovered_array) 139 + print(f"Arrays equal: {arrays_equal}") 140 + 141 + if arrays_equal: 142 + print("✅ SUCCESS: Full roundtrip successful!") 143 + else: 144 + print("❌ FAILURE: Arrays don't match") 145 + print(f"Max difference: {np.max(np.abs(image_array.astype(float) - recovered_array.astype(float)))}") 146 + 147 + print() 148 + 149 + 150 + ## 151 + # Step 7: Demonstrate validation of dtype/shape metadata 152 + 153 + print("=" * 80) 154 + print("Validating NDArray Metadata (dtype, shape)") 155 + print("=" * 80) 156 + 157 + # Extract metadata from schema 158 + image_schema = json_schema["properties"]["image"] 159 + expected_dtype = image_schema.get("x-atdata-dtype") 160 + expected_shape = image_schema.get("x-atdata-shape") 161 + 162 + print(f"Expected dtype: {expected_dtype}") 163 + print(f"Expected shape: {expected_shape}") 164 + print(f"Actual dtype: {recovered_array.dtype}") 165 + print(f"Actual shape: {recovered_array.shape}") 166 + 167 + # Validate dtype 168 + dtype_match = str(recovered_array.dtype) == expected_dtype 169 + print(f"Dtype matches: {dtype_match}") 170 + 171 + # Validate shape (with None/null for dynamic dimensions) 172 + def validate_shape(actual_shape, expected_shape): 173 + """Validate shape with support for dynamic dimensions (None/null).""" 174 + if len(actual_shape) != len(expected_shape): 175 + return False 176 + for actual_dim, expected_dim in zip(actual_shape, expected_shape): 177 + if expected_dim is not None and actual_dim != expected_dim: 178 + return False 179 + return True 180 + 181 + shape_match = validate_shape(recovered_array.shape, expected_shape) 182 + print(f"Shape matches: {shape_match}") 183 + 184 + if dtype_match and shape_match: 185 + print("✅ SUCCESS: Array metadata matches schema expectations!") 186 + else: 187 + print("❌ FAILURE: Metadata mismatch") 188 + 189 + print() 190 + 191 + 192 + ## 193 + # Step 8: Demonstrate msgpack (actual atdata format) 194 + 195 + print("=" * 80) 196 + print("Msgpack Serialization (Actual atdata Format)") 197 + print("=" * 80) 198 + 199 + try: 200 + import msgpack 201 + 202 + # In atdata, the sample would be stored in msgpack, not JSON 203 + # The image field would be raw bytes, not base64 204 + msgpack_data = { 205 + "image": image_bytes, # Raw bytes (not base64) 206 + "label": "cat", 207 + "confidence": 0.95 208 + } 209 + 210 + # Serialize to msgpack 211 + msgpack_bytes = msgpack.packb(msgpack_data) 212 + print(f"Msgpack size: {len(msgpack_bytes)} bytes") 213 + 214 + # Deserialize from msgpack 215 + recovered_msgpack = msgpack.unpackb(msgpack_bytes, raw=False) 216 + recovered_array_msgpack = bytes_to_array(recovered_msgpack["image"]) 217 + 218 + print(f"Recovered from msgpack: shape={recovered_array_msgpack.shape}, dtype={recovered_array_msgpack.dtype}") 219 + print(f"Arrays equal: {np.array_equal(image_array, recovered_array_msgpack)}") 220 + print("✅ SUCCESS: Msgpack roundtrip successful!") 221 + 222 + except ImportError: 223 + print("⚠️ msgpack not installed, skipping msgpack demonstration") 224 + print(" (atdata uses msgpack for actual serialization)") 225 + 226 + print() 227 + 228 + 229 + ## 230 + # Summary 231 + 232 + print("=" * 80) 233 + print("SUMMARY") 234 + print("=" * 80) 235 + print(""" 236 + ✅ The NDArray JSON Schema shim works correctly: 237 + 1. JSON Schema validates structure (field is present, is base64 string) 238 + 2. Binary .npy format preserves dtype and shape 239 + 3. Extension properties (x-atdata-*) provide metadata for validation 240 + 4. Full roundtrip: numpy → bytes → base64 → JSON → validate → deserialize → numpy 241 + 5. Msgpack format (actual atdata) uses raw bytes instead of base64 242 + 243 + ⚠️ Validation happens at two levels: 244 + - JSON Schema: Structural validation (field present, correct type) 245 + - Deserialization: Semantic validation (dtype/shape match expectations) 246 + 247 + 📝 This design is a pragmatic compromise: 248 + - Leverages existing .npy serialization (proven, self-describing) 249 + - Uses standard JSON Schema conventions (format: byte, contentEncoding) 250 + - Adds metadata via extension properties (x-atdata-*) 251 + - Works with both JSON (base64) and msgpack (raw bytes) 252 + """)
+316
.planning/examples/code/validate_ndarray_shim.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + Validate base64-encoded numpy arrays against the standalone ndarray_shim.json schema. 4 + 5 + This demonstrates that the NDArray shim schema definition works correctly as a 6 + standalone, reusable schema component that can be referenced from other schemas. 7 + 8 + Note: This tests the JSON representation (base64-encoded bytes). In actual atdata 9 + usage, WebDatasets store raw bytes directly in msgpack format without base64 encoding. 10 + """ 11 + 12 + import json 13 + import base64 14 + from io import BytesIO 15 + from pathlib import Path 16 + 17 + import numpy as np 18 + from jsonschema import validate, ValidationError, Draft7Validator 19 + 20 + 21 + ## 22 + # Helper functions 23 + 24 + def array_to_bytes(x: np.ndarray) -> bytes: 25 + """Convert numpy array to bytes using .npy format.""" 26 + np_bytes = BytesIO() 27 + np.save(np_bytes, x, allow_pickle=True) 28 + return np_bytes.getvalue() 29 + 30 + 31 + def bytes_to_array(b: bytes) -> np.ndarray: 32 + """Convert bytes back to numpy array.""" 33 + np_bytes = BytesIO(b) 34 + return np.load(np_bytes, allow_pickle=True) 35 + 36 + 37 + ## 38 + # Load the standalone ndarray shim schema 39 + 40 + shim_path = Path(__file__).parent.parent.parent / "lexicons" / "ndarray_shim.json" 41 + with open(shim_path) as f: 42 + ndarray_shim = json.load(f) 43 + 44 + print("=" * 80) 45 + print("Loaded NDArray Shim Schema") 46 + print("=" * 80) 47 + print(f"Schema ID: {ndarray_shim['$id']}") 48 + print(f"Version: {ndarray_shim['version']}") 49 + print() 50 + print("NDArray definition:") 51 + print(json.dumps(ndarray_shim["$defs"]["ndarray"], indent=2)) 52 + print() 53 + 54 + 55 + ## 56 + # Test Case 1: Simple 1D array 57 + 58 + print("=" * 80) 59 + print("Test Case 1: Simple 1D Array") 60 + print("=" * 80) 61 + 62 + array_1d = np.array([1, 2, 3, 4, 5], dtype=np.int32) 63 + print(f"Created array: {array_1d}") 64 + print(f"Shape: {array_1d.shape}, dtype: {array_1d.dtype}") 65 + 66 + # Serialize and encode 67 + bytes_1d = array_to_bytes(array_1d) 68 + base64_1d = base64.b64encode(bytes_1d).decode('utf-8') 69 + print(f"Serialized to {len(bytes_1d)} bytes") 70 + print(f"Base64: {len(base64_1d)} characters") 71 + 72 + # Validate against the ndarray schema definition directly 73 + ndarray_schema = { 74 + "$schema": "http://json-schema.org/draft-07/schema#", 75 + "$defs": ndarray_shim["$defs"], 76 + "$ref": "#/$defs/ndarray" 77 + } 78 + 79 + try: 80 + validate(instance=base64_1d, schema=ndarray_schema) 81 + print("✅ VALID: 1D array validates against ndarray schema") 82 + except ValidationError as e: 83 + print(f"❌ INVALID: {e.message}") 84 + 85 + # Verify roundtrip 86 + recovered_1d = bytes_to_array(base64.b64decode(base64_1d)) 87 + print(f"Recovered: {recovered_1d}") 88 + print(f"Arrays equal: {np.array_equal(array_1d, recovered_1d)}") 89 + print() 90 + 91 + 92 + ## 93 + # Test Case 2: 2D array (matrix) 94 + 95 + print("=" * 80) 96 + print("Test Case 2: 2D Array (Matrix)") 97 + print("=" * 80) 98 + 99 + array_2d = np.random.randn(3, 4).astype(np.float32) 100 + print(f"Created array shape: {array_2d.shape}, dtype: {array_2d.dtype}") 101 + print(f"Sample values:\n{array_2d}") 102 + 103 + bytes_2d = array_to_bytes(array_2d) 104 + base64_2d = base64.b64encode(bytes_2d).decode('utf-8') 105 + print(f"Serialized to {len(bytes_2d)} bytes") 106 + 107 + try: 108 + validate(instance=base64_2d, schema=ndarray_schema) 109 + print("✅ VALID: 2D array validates against ndarray schema") 110 + except ValidationError as e: 111 + print(f"❌ INVALID: {e.message}") 112 + 113 + recovered_2d = bytes_to_array(base64.b64decode(base64_2d)) 114 + print(f"Arrays equal: {np.array_equal(array_2d, recovered_2d)}") 115 + print() 116 + 117 + 118 + ## 119 + # Test Case 3: 3D array (image-like) 120 + 121 + print("=" * 80) 122 + print("Test Case 3: 3D Array (Image-like)") 123 + print("=" * 80) 124 + 125 + array_3d = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8) 126 + print(f"Created array shape: {array_3d.shape}, dtype: {array_3d.dtype}") 127 + print(f"Total elements: {array_3d.size}") 128 + 129 + bytes_3d = array_to_bytes(array_3d) 130 + base64_3d = base64.b64encode(bytes_3d).decode('utf-8') 131 + print(f"Serialized to {len(bytes_3d)} bytes ({len(bytes_3d) / 1024:.1f} KB)") 132 + print(f"Base64 string: {len(base64_3d)} characters ({len(base64_3d) / 1024:.1f} KB)") 133 + 134 + try: 135 + validate(instance=base64_3d, schema=ndarray_schema) 136 + print("✅ VALID: 3D array validates against ndarray schema") 137 + except ValidationError as e: 138 + print(f"❌ INVALID: {e.message}") 139 + 140 + recovered_3d = bytes_to_array(base64.b64decode(base64_3d)) 141 + print(f"Recovered shape: {recovered_3d.shape}, dtype: {recovered_3d.dtype}") 142 + print(f"Arrays equal: {np.array_equal(array_3d, recovered_3d)}") 143 + print() 144 + 145 + 146 + ## 147 + # Test Case 4: Different dtypes 148 + 149 + print("=" * 80) 150 + print("Test Case 4: Various Dtypes") 151 + print("=" * 80) 152 + 153 + dtypes_to_test = [ 154 + np.int8, 155 + np.int16, 156 + np.int32, 157 + np.int64, 158 + np.uint8, 159 + np.uint16, 160 + np.uint32, 161 + np.uint64, 162 + np.float16, 163 + np.float32, 164 + np.float64, 165 + np.complex64, 166 + np.complex128, 167 + ] 168 + 169 + print(f"Testing {len(dtypes_to_test)} different dtypes...") 170 + all_passed = True 171 + 172 + for dtype in dtypes_to_test: 173 + array = np.array([1, 2, 3], dtype=dtype) 174 + array_bytes = array_to_bytes(array) 175 + array_base64 = base64.b64encode(array_bytes).decode('utf-8') 176 + 177 + try: 178 + validate(instance=array_base64, schema=ndarray_schema) 179 + recovered = bytes_to_array(base64.b64decode(array_base64)) 180 + match = np.array_equal(array, recovered) 181 + status = "✅" if match else "❌" 182 + print(f" {status} {str(dtype):12s} - valid and {'matches' if match else 'MISMATCH'}") 183 + if not match: 184 + all_passed = False 185 + except ValidationError as e: 186 + print(f" ❌ {str(dtype):12s} - validation failed: {e.message}") 187 + all_passed = False 188 + 189 + if all_passed: 190 + print("✅ SUCCESS: All dtypes validated and roundtripped correctly") 191 + else: 192 + print("❌ FAILURE: Some dtypes failed") 193 + print() 194 + 195 + 196 + ## 197 + # Test Case 5: Invalid data (should fail validation) 198 + 199 + print("=" * 80) 200 + print("Test Case 5: Invalid Data (Negative Tests)") 201 + print("=" * 80) 202 + 203 + # Test invalid types 204 + invalid_cases = [ 205 + ("plain string", "not base64 encoded array data"), 206 + ("number", 12345), 207 + ("object", {"dtype": "uint8", "data": "fake"}), 208 + ("array", [1, 2, 3]), 209 + ("null", None), 210 + ] 211 + 212 + print("Testing invalid inputs (should fail validation):") 213 + for name, invalid_data in invalid_cases: 214 + try: 215 + validate(instance=invalid_data, schema=ndarray_schema) 216 + print(f" ❌ {name:15s} - SHOULD HAVE FAILED but passed") 217 + except ValidationError: 218 + print(f" ✅ {name:15s} - correctly rejected") 219 + 220 + print() 221 + 222 + 223 + ## 224 + # Test Case 6: Using the schema as a $ref in another schema (inline) 225 + 226 + print("=" * 80) 227 + print("Test Case 6: Using NDArray Shim as $ref (Inline)") 228 + print("=" * 80) 229 + 230 + # Create a schema that inlines the ndarray shim definition 231 + sample_schema = { 232 + "$schema": "http://json-schema.org/draft-07/schema#", 233 + "title": "TestSample", 234 + "type": "object", 235 + "required": ["data", "label"], 236 + "properties": { 237 + "data": { 238 + "$ref": "#/$defs/ndarray", 239 + "description": "Numpy array data", 240 + "x-atdata-dtype": "float32", 241 + "x-atdata-shape": [None, 10] 242 + }, 243 + "label": { 244 + "type": "string", 245 + "description": "Label for this sample" 246 + } 247 + }, 248 + "$defs": { 249 + "ndarray": ndarray_shim["$defs"]["ndarray"] 250 + } 251 + } 252 + 253 + print("Created schema that uses inlined ndarray shim:") 254 + print(json.dumps({ 255 + "title": sample_schema["title"], 256 + "required": sample_schema["required"], 257 + "properties": { 258 + "data": {"$ref": "#/$defs/ndarray", "x-atdata-dtype": "float32"}, 259 + "label": {"type": "string"} 260 + } 261 + }, indent=2)) 262 + print() 263 + 264 + # Create sample data 265 + test_array = np.random.randn(5, 10).astype(np.float32) 266 + test_data = { 267 + "data": base64.b64encode(array_to_bytes(test_array)).decode('utf-8'), 268 + "label": "test sample" 269 + } 270 + 271 + print(f"Created test sample with array shape {test_array.shape}") 272 + 273 + # Validate with inline $ref 274 + validator = Draft7Validator(sample_schema) 275 + 276 + try: 277 + validator.validate(test_data) 278 + print("✅ VALID: Sample with $ref to ndarray shim validates correctly") 279 + except ValidationError as e: 280 + print(f"❌ INVALID: {e.message}") 281 + 282 + print() 283 + 284 + 285 + ## 286 + # Summary 287 + 288 + print("=" * 80) 289 + print("SUMMARY") 290 + print("=" * 80) 291 + print(""" 292 + ✅ The standalone ndarray_shim.json schema works correctly: 293 + 1. Validates base64-encoded .npy bytes as strings 294 + 2. Works with all standard numpy dtypes 295 + 3. Supports arrays of any dimensionality (1D, 2D, 3D, etc.) 296 + 4. Can be used as $ref in other schemas 297 + 5. Correctly rejects invalid data 298 + 299 + ✅ The shim is a proper JSON Schema Draft 7 definition: 300 + - Uses standard type/format (string/byte) 301 + - Uses contentEncoding/contentMediaType properly 302 + - Works with standard validators (jsonschema library) 303 + - Can be stored at a canonical URI and referenced 304 + 305 + 📝 Key points: 306 + - Base64 encoding adds ~33% overhead (150KB → 200KB) 307 + - In actual atdata, WebDatasets store raw bytes (no base64) 308 + - JSON representation useful for: APIs, validation, examples 309 + - Msgpack representation used in practice: more efficient 310 + 311 + 🎯 Design validated: 312 + - Shim definition is sound and reusable 313 + - Works as both inline $def and external $ref 314 + - Compatible with JSON Schema tooling 315 + - Ready for use in ac.foundation.dataset.sampleSchema Lexicon 316 + """)
+39
.planning/examples/dataset_blob_storage.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.record", 3 + "name": "Small Sample Dataset", 4 + "schemaRef": "at://did:plc:def456/ac.foundation.dataset.sampleSchema/textsample@2.1.0", 5 + "storage": { 6 + "type": "blobs", 7 + "blobs": [ 8 + { 9 + "$type": "blob", 10 + "ref": { 11 + "$link": "bafyreig4rvsqx3vfzdchq2qx7xr2nq2y4vjvd4w5pqtjwkqiw7h5e6vf7e" 12 + }, 13 + "mimeType": "application/x-tar", 14 + "size": 1234567 15 + }, 16 + { 17 + "$type": "blob", 18 + "ref": { 19 + "$link": "bafyreig5saabc3defghijklmnopqrstuvwxyz123456789abcdefghijk" 20 + }, 21 + "mimeType": "application/x-tar", 22 + "size": 2345678 23 + } 24 + ] 25 + }, 26 + "description": "Small text dataset stored directly on PDS for maximum decentralization", 27 + "tags": [ 28 + "nlp", 29 + "text", 30 + "small-dataset" 31 + ], 32 + "size": { 33 + "samples": 1000, 34 + "bytes": 3580245, 35 + "shards": 2 36 + }, 37 + "license": "CC-BY-4.0", 38 + "createdAt": "2025-01-07T10:30:00Z" 39 + }
+26
.planning/examples/dataset_external_storage.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.record", 3 + "name": "CIFAR-10 Training Set", 4 + "schemaRef": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imageclassification@1.0.0", 5 + "storage": { 6 + "type": "external", 7 + "urls": [ 8 + "s3://my-bucket/cifar10-train-{000000..000049}.tar" 9 + ] 10 + }, 11 + "description": "CIFAR-10 training images (50,000 samples) stored as WebDataset shards on S3", 12 + "metadata": "<msgpack-encoded bytes here>", 13 + "tags": [ 14 + "computer-vision", 15 + "classification", 16 + "cifar10", 17 + "training" 18 + ], 19 + "size": { 20 + "samples": 50000, 21 + "bytes": 178456789, 22 + "shards": 50 23 + }, 24 + "license": "MIT", 25 + "createdAt": "2025-01-06T12:00:00Z" 26 + }
+27
.planning/examples/lens_example.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.lens", 3 + "name": "RGB to Grayscale Conversion", 4 + "sourceSchema": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/rgbimage@1.0.0", 5 + "targetSchema": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/grayscaleimage@1.0.0", 6 + "description": "Converts RGB images to grayscale using standard luminosity formula", 7 + "getterCode": { 8 + "repository": "https://github.com/alice/vision-lenses", 9 + "commit": "a1b2c3d4e5f6789abcdef0123456789abcdef012", 10 + "path": "lenses/color.py:rgb_to_grayscale", 11 + "branch": "main" 12 + }, 13 + "putterCode": { 14 + "repository": "https://github.com/alice/vision-lenses", 15 + "commit": "a1b2c3d4e5f6789abcdef0123456789abcdef012", 16 + "path": "lenses/color.py:grayscale_to_rgb", 17 + "branch": "main" 18 + }, 19 + "language": "python", 20 + "metadata": { 21 + "author": "alice.bsky.social", 22 + "performance": "O(n) where n is number of pixels", 23 + "reversible": false, 24 + "notes": "Putter creates approximate RGB by duplicating grayscale channel" 25 + }, 26 + "createdAt": "2025-01-07T14:00:00Z" 27 + }
+50
.planning/examples/sampleSchema_example.json
··· 1 + { 2 + "$type": "ac.foundation.dataset.sampleSchema", 3 + "name": "ImageSample", 4 + "version": "1.0.0", 5 + "schemaType": "jsonschema", 6 + "jsonSchema": { 7 + "$schema": "http://json-schema.org/draft-07/schema#", 8 + "title": "ImageSample", 9 + "type": "object", 10 + "required": [ 11 + "image", 12 + "label" 13 + ], 14 + "properties": { 15 + "image": { 16 + "$ref": "#/$defs/ndarray", 17 + "description": "RGB image with variable height/width", 18 + "x-atdata-dtype": "uint8", 19 + "x-atdata-shape": [null, null, 3], 20 + "x-atdata-notes": "Images must have 3 color channels (RGB)" 21 + }, 22 + "label": { 23 + "type": "string", 24 + "description": "Human-readable label for the image" 25 + }, 26 + "confidence": { 27 + "type": "number", 28 + "description": "Optional confidence score", 29 + "minimum": 0, 30 + "maximum": 1 31 + } 32 + }, 33 + "$defs": { 34 + "ndarray": { 35 + "type": "string", 36 + "format": "byte", 37 + "description": "Numpy array serialized using numpy .npy format (includes dtype and shape in binary header)", 38 + "contentEncoding": "base64", 39 + "contentMediaType": "application/octet-stream" 40 + } 41 + } 42 + }, 43 + "description": "Sample type for images with labels, commonly used for computer vision datasets", 44 + "metadata": { 45 + "author": "alice.bsky.social", 46 + "license": "MIT", 47 + "tags": ["computer-vision", "image-classification"] 48 + }, 49 + "createdAt": "2025-01-06T12:00:00Z" 50 + }
+259
.planning/lexicons/README.md
··· 1 + # ATProto Lexicon Definitions for atdata 2 + 3 + This directory contains the ATProto Lexicon JSON definitions for the distributed dataset federation system. 4 + 5 + ## Lexicons 6 + 7 + ### Core Record Types 8 + 9 + 1. **[ac.foundation.dataset.sampleSchema](ac.foundation.dataset.sampleSchema.json)** 10 + - Defines PackableSample-compatible sample types using JSON Schema 11 + - Supports versioning via rkey format: `{NSID}@{semver}` 12 + - Includes NDArray shim for ML/scientific data types 13 + - Example: [sampleSchema_example.json](../examples/sampleSchema_example.json) 14 + 15 + 2. **[ac.foundation.dataset.record](ac.foundation.dataset.record.json)** 16 + - Index records for WebDataset-backed datasets 17 + - Hybrid storage support (external URLs + PDS blobs) 18 + - References sampleSchema for type information 19 + - Examples: 20 + - [External storage](../examples/dataset_external_storage.json) 21 + - [Blob storage](../examples/dataset_blob_storage.json) 22 + 23 + 3. **[ac.foundation.dataset.lens](ac.foundation.dataset.lens.json)** 24 + - Bidirectional transformations between sample types 25 + - External code references (GitHub, tangled.org) 26 + - Language metadata for multi-language support 27 + - Example: [lens_example.json](../examples/lens_example.json) 28 + 29 + ### Query APIs 30 + 31 + 4. **[ac.foundation.dataset.getLatestSchema](ac.foundation.dataset.getLatestSchema.json)** 32 + - Query to get the latest version of a schema by NSID 33 + - Returns full record + all available versions 34 + - Handles the custom rkey versioning scheme 35 + 36 + ## Key Design Decisions 37 + 38 + ### 1. Namespace 39 + 40 + All Lexicons use the `ac.foundation.dataset.*` namespace: 41 + - `ac.foundation` - Organization namespace 42 + - `dataset` - Domain (distributed datasets) 43 + - Specific record types: `sampleSchema`, `record`, `lens` 44 + 45 + ### 2. Schema Versioning (rkey Convention) 46 + 47 + **Custom rkey format**: `{NSID}@{semver}` 48 + 49 + **Example**: `com.example.myschema@1.2.3` 50 + 51 + - `{NSID}`: Permanent identifier for the schema type (e.g., `com.example.myschema`) 52 + - `{semver}`: Semantic version (e.g., `1.2.3`) 53 + 54 + **Benefits**: 55 + - Immutable version records 56 + - Easy to list all versions of a schema 57 + - Natural query pattern via `getLatestSchema` 58 + - Clear semantic versioning enforcement 59 + 60 + **Implementation**: The sampleSchema Lexicon uses `"key": "any"` to support this custom format. 61 + 62 + ### 3. JSON Schema with NDArray Shim 63 + 64 + **Decision**: Use standard JSON Schema for type definitions with a custom NDArray shim. 65 + 66 + **NDArray Shim Structure**: 67 + ```json 68 + { 69 + "$defs": { 70 + "ndarray": { 71 + "type": "object", 72 + "required": ["dtype", "shape", "data"], 73 + "properties": { 74 + "dtype": { 75 + "type": "string", 76 + "description": "Numpy dtype string (e.g., 'float32', 'uint8')" 77 + }, 78 + "shape": { 79 + "type": "array", 80 + "items": {"type": "integer"}, 81 + "description": "Array shape" 82 + }, 83 + "data": { 84 + "type": "string", 85 + "format": "byte", 86 + "description": "Array data as base64-encoded bytes" 87 + } 88 + } 89 + } 90 + } 91 + } 92 + ``` 93 + 94 + **Usage in schemas**: 95 + ```json 96 + { 97 + "properties": { 98 + "image": { 99 + "$ref": "#/$defs/ndarray", 100 + "dtype": "uint8", 101 + "shape": [null, null, 3] 102 + } 103 + } 104 + } 105 + ``` 106 + 107 + **Benefits**: 108 + - Leverages JSON Schema ecosystem (validators, tooling) 109 + - Custom NDArray handling for ML/scientific data 110 + - Extensible via `schemaType` field (future: Protobuf, etc.) 111 + 112 + ### 4. Hybrid Storage 113 + 114 + **Open union** for storage location: 115 + - `storageExternal`: External URLs (S3, HTTP, IPFS, etc.) 116 + - `storageBlobs`: ATProto PDS blobs 117 + 118 + **Benefits**: 119 + - Flexibility: Use external storage for large datasets 120 + - Decentralization: Use blobs for small datasets or self-hosting 121 + - AppView can proxy both types uniformly 122 + 123 + ### 5. External Code References 124 + 125 + **Lenses use code references** instead of inline code for security: 126 + - Repository URL (GitHub, tangled.org) 127 + - Commit hash (immutability) 128 + - Function path (e.g., `lenses/vision.py:rgb_to_grayscale`) 129 + 130 + **Benefits**: 131 + - Secure: No arbitrary code execution 132 + - Verifiable: Commit hash ensures immutability 133 + - Auditable: Users can review code before use 134 + 135 + ## Example Workflows 136 + 137 + ### Publishing a Schema 138 + 139 + ```python 140 + from atdata.atproto import SchemaPublisher 141 + 142 + @atdata.packable 143 + class ImageSample: 144 + image: NDArray # uint8, [H, W, 3] 145 + label: str 146 + 147 + publisher = SchemaPublisher(client) 148 + schema_uri = publisher.publish_schema( 149 + ImageSample, 150 + name="ImageSample", 151 + version="1.0.0", 152 + description="RGB image with label" 153 + ) 154 + # Result: at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imagesample@1.0.0 155 + ``` 156 + 157 + ### Publishing a Dataset 158 + 159 + ```python 160 + from atdata.atproto import DatasetPublisher 161 + 162 + dataset = atdata.Dataset[ImageSample]( 163 + url="s3://my-bucket/dataset-{000000..000009}.tar" 164 + ) 165 + 166 + publisher = DatasetPublisher(client) 167 + dataset_uri = publisher.publish_dataset( 168 + dataset, 169 + name="My Image Dataset", 170 + schema_uri=schema_uri, 171 + tags=["computer-vision", "training"] 172 + ) 173 + ``` 174 + 175 + ### Discovering Datasets 176 + 177 + ```python 178 + from atdata.atproto import DatasetLoader 179 + 180 + loader = DatasetLoader(client) 181 + 182 + # Search by tags 183 + datasets = loader.search_datasets(tags=["computer-vision"]) 184 + 185 + # Load dataset 186 + dataset = loader.load_dataset(datasets[0]['uri']) 187 + ``` 188 + 189 + ## Migration & Versioning 190 + 191 + ### Publishing a New Schema Version 192 + 193 + ```python 194 + # Publish v2.0.0 with migration lens 195 + schema_uri_v2 = publisher.publish_schema( 196 + ImageSampleV2, 197 + name="ImageSample", 198 + version="2.0.0", 199 + previous_version=schema_uri_v1, 200 + migration_lens=migration_lens_uri 201 + ) 202 + ``` 203 + 204 + ### Getting Latest Schema 205 + 206 + ```python 207 + from atdata.atproto import query_latest_schema 208 + 209 + latest = query_latest_schema( 210 + client, 211 + schema_id="imagesample" # Just the NSID part 212 + ) 213 + # Returns: { 214 + # "uri": "at://.../imagesample@2.0.0", 215 + # "version": "2.0.0", 216 + # "record": {...}, 217 + # "allVersions": [...] 218 + # } 219 + ``` 220 + 221 + ## Validation 222 + 223 + See [06_lexicon_validation.md](../decisions/06_lexicon_validation.md) for validation process. 224 + 225 + ### Quick Validation 226 + 227 + ```bash 228 + # Validate Lexicon JSON (requires ATProto tooling) 229 + atproto-lexicon validate ac.foundation.dataset.sampleSchema.json 230 + 231 + # Validate example records 232 + python scripts/validate_examples.py 233 + ``` 234 + 235 + ## Future Extensions 236 + 237 + ### Potential Additional Lexicons 238 + 239 + - `ac.foundation.dataset.collection` - Group multiple datasets 240 + - `ac.foundation.dataset.benchmark` - Evaluation results on datasets 241 + - `ac.foundation.dataset.attestation` - Formal correctness proofs for Lenses 242 + - `ac.foundation.dataset.verification` - Trusted DID attestations 243 + 244 + ### Schema Type Extensions 245 + 246 + Current: `"schemaType": "jsonschema"` 247 + 248 + Future possibilities: 249 + - `"schemaType": "protobuf"` - Protocol Buffers definitions 250 + - `"schemaType": "avro"` - Apache Avro schemas 251 + - Custom domain-specific schema languages 252 + 253 + ## References 254 + 255 + - Planning documents: `../*.md` 256 + - Design decisions: `../decisions/*.md` 257 + - Architectural assessment: `../decisions/assessment.md` 258 + - ATProto Lexicon spec: https://atproto.com/specs/lexicon 259 + - ATProto NSID spec: https://atproto.com/specs/nsid
+78
.planning/lexicons/ac.foundation.dataset.getLatestSchema.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.getLatestSchema", 4 + "defs": { 5 + "main": { 6 + "type": "query", 7 + "description": "Get the latest version of a sample schema by its permanent NSID identifier", 8 + "parameters": { 9 + "type": "params", 10 + "required": [ 11 + "schemaId" 12 + ], 13 + "properties": { 14 + "schemaId": { 15 + "type": "string", 16 + "description": "The permanent NSID identifier for the schema (the {NSID} part of the rkey {NSID}@{semver})", 17 + "maxLength": 500 18 + } 19 + } 20 + }, 21 + "output": { 22 + "encoding": "application/json", 23 + "schema": { 24 + "type": "object", 25 + "required": [ 26 + "uri", 27 + "version", 28 + "record" 29 + ], 30 + "properties": { 31 + "uri": { 32 + "type": "string", 33 + "description": "AT-URI of the latest schema version", 34 + "maxLength": 500 35 + }, 36 + "version": { 37 + "type": "string", 38 + "description": "Semantic version of the latest schema", 39 + "maxLength": 20 40 + }, 41 + "record": { 42 + "type": "ref", 43 + "ref": "ac.foundation.dataset.sampleSchema", 44 + "description": "The full schema record" 45 + }, 46 + "allVersions": { 47 + "type": "array", 48 + "description": "All available versions (optional, sorted by semver descending)", 49 + "items": { 50 + "type": "object", 51 + "required": [ 52 + "uri", 53 + "version" 54 + ], 55 + "properties": { 56 + "uri": { 57 + "type": "string", 58 + "maxLength": 500 59 + }, 60 + "version": { 61 + "type": "string", 62 + "maxLength": 20 63 + } 64 + } 65 + } 66 + } 67 + } 68 + } 69 + }, 70 + "errors": [ 71 + { 72 + "name": "SchemaNotFound", 73 + "description": "No schema found with the given NSID" 74 + } 75 + ] 76 + } 77 + } 78 + }
+99
.planning/lexicons/ac.foundation.dataset.lens.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.lens", 4 + "defs": { 5 + "main": { 6 + "type": "record", 7 + "description": "Bidirectional transformation (Lens) between two sample types, with code stored in external repositories", 8 + "key": "tid", 9 + "record": { 10 + "type": "object", 11 + "required": [ 12 + "name", 13 + "sourceSchema", 14 + "targetSchema", 15 + "getterCode", 16 + "putterCode", 17 + "createdAt" 18 + ], 19 + "properties": { 20 + "name": { 21 + "type": "string", 22 + "description": "Human-readable lens name", 23 + "maxLength": 100 24 + }, 25 + "sourceSchema": { 26 + "type": "string", 27 + "description": "AT-URI reference to source sampleSchema", 28 + "maxLength": 500 29 + }, 30 + "targetSchema": { 31 + "type": "string", 32 + "description": "AT-URI reference to target sampleSchema", 33 + "maxLength": 500 34 + }, 35 + "description": { 36 + "type": "string", 37 + "description": "What this transformation does", 38 + "maxLength": 1000 39 + }, 40 + "getterCode": { 41 + "type": "ref", 42 + "ref": "#codeReference", 43 + "description": "Code reference for getter function (Source -> Target)" 44 + }, 45 + "putterCode": { 46 + "type": "ref", 47 + "ref": "#codeReference", 48 + "description": "Code reference for putter function (Target, Source -> Source)" 49 + }, 50 + "language": { 51 + "type": "string", 52 + "description": "Programming language of the lens implementation (e.g., 'python', 'typescript')", 53 + "maxLength": 50 54 + }, 55 + "metadata": { 56 + "type": "object", 57 + "description": "Arbitrary metadata (author, performance notes, etc.)" 58 + }, 59 + "createdAt": { 60 + "type": "string", 61 + "format": "datetime", 62 + "description": "Timestamp when this lens was created" 63 + } 64 + } 65 + } 66 + }, 67 + "codeReference": { 68 + "type": "object", 69 + "description": "Reference to code in an external repository (GitHub, tangled.org, etc.)", 70 + "required": [ 71 + "repository", 72 + "commit", 73 + "path" 74 + ], 75 + "properties": { 76 + "repository": { 77 + "type": "string", 78 + "description": "Repository URL (e.g., 'https://github.com/user/repo' or 'at://did/tangled.repo/...')", 79 + "maxLength": 500 80 + }, 81 + "commit": { 82 + "type": "string", 83 + "description": "Git commit hash (ensures immutability)", 84 + "maxLength": 40 85 + }, 86 + "path": { 87 + "type": "string", 88 + "description": "Path to function within repository (e.g., 'lenses/vision.py:rgb_to_grayscale')", 89 + "maxLength": 500 90 + }, 91 + "branch": { 92 + "type": "string", 93 + "description": "Optional branch name (for reference, commit hash is authoritative)", 94 + "maxLength": 100 95 + } 96 + } 97 + } 98 + } 99 + }
+142
.planning/lexicons/ac.foundation.dataset.record.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.record", 4 + "defs": { 5 + "main": { 6 + "type": "record", 7 + "description": "Index record for a WebDataset-backed dataset with references to storage location and sample schema", 8 + "key": "tid", 9 + "record": { 10 + "type": "object", 11 + "required": [ 12 + "name", 13 + "schemaRef", 14 + "storage", 15 + "createdAt" 16 + ], 17 + "properties": { 18 + "name": { 19 + "type": "string", 20 + "description": "Human-readable dataset name", 21 + "maxLength": 200 22 + }, 23 + "schemaRef": { 24 + "type": "string", 25 + "description": "AT-URI reference to the sampleSchema record for this dataset's samples", 26 + "maxLength": 500 27 + }, 28 + "storage": { 29 + "type": "union", 30 + "description": "Storage location for dataset files (WebDataset tar archives)", 31 + "refs": [ 32 + "#storageExternal", 33 + "#storageBlobs" 34 + ] 35 + }, 36 + "description": { 37 + "type": "string", 38 + "description": "Human-readable description of the dataset", 39 + "maxLength": 5000 40 + }, 41 + "metadata": { 42 + "type": "bytes", 43 + "description": "Msgpack-encoded metadata dict (arbitrary key-value pairs)", 44 + "maxLength": 100000 45 + }, 46 + "tags": { 47 + "type": "array", 48 + "description": "Searchable tags for dataset discovery", 49 + "items": { 50 + "type": "string", 51 + "maxLength": 50 52 + }, 53 + "maxLength": 20 54 + }, 55 + "size": { 56 + "type": "ref", 57 + "ref": "#datasetSize", 58 + "description": "Dataset size information (optional)" 59 + }, 60 + "license": { 61 + "type": "string", 62 + "description": "License (SPDX identifier preferred)", 63 + "maxLength": 100 64 + }, 65 + "createdAt": { 66 + "type": "string", 67 + "format": "datetime", 68 + "description": "Timestamp when this dataset record was created" 69 + } 70 + } 71 + } 72 + }, 73 + "storageExternal": { 74 + "type": "object", 75 + "description": "External storage via URLs (S3, HTTP, IPFS, etc.)", 76 + "required": [ 77 + "type", 78 + "urls" 79 + ], 80 + "properties": { 81 + "type": { 82 + "type": "string", 83 + "const": "external" 84 + }, 85 + "urls": { 86 + "type": "array", 87 + "description": "WebDataset URLs (supports brace notation like 'data-{000000..000099}.tar')", 88 + "items": { 89 + "type": "string", 90 + "format": "uri", 91 + "maxLength": 1000 92 + }, 93 + "minLength": 1 94 + } 95 + } 96 + }, 97 + "storageBlobs": { 98 + "type": "object", 99 + "description": "Storage via ATProto PDS blobs", 100 + "required": [ 101 + "type", 102 + "blobs" 103 + ], 104 + "properties": { 105 + "type": { 106 + "type": "string", 107 + "const": "blobs" 108 + }, 109 + "blobs": { 110 + "type": "array", 111 + "description": "Array of blob references for WebDataset tar files", 112 + "items": { 113 + "type": "blob", 114 + "description": "Blob reference to a WebDataset tar archive" 115 + }, 116 + "minLength": 1 117 + } 118 + } 119 + }, 120 + "datasetSize": { 121 + "type": "object", 122 + "description": "Information about dataset size", 123 + "properties": { 124 + "samples": { 125 + "type": "integer", 126 + "description": "Total number of samples in the dataset", 127 + "minimum": 0 128 + }, 129 + "bytes": { 130 + "type": "integer", 131 + "description": "Total size in bytes", 132 + "minimum": 0 133 + }, 134 + "shards": { 135 + "type": "integer", 136 + "description": "Number of WebDataset shards", 137 + "minimum": 1 138 + } 139 + } 140 + } 141 + } 142 + }
+91
.planning/lexicons/ac.foundation.dataset.sampleSchema.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.sampleSchema", 4 + "defs": { 5 + "main": { 6 + "type": "record", 7 + "description": "Definition of a PackableSample-compatible sample type using JSON Schema. Supports versioning via rkey format: {NSID}@{semver}", 8 + "key": "any", 9 + "record": { 10 + "type": "object", 11 + "required": [ 12 + "name", 13 + "version", 14 + "schemaType", 15 + "jsonSchema", 16 + "createdAt" 17 + ], 18 + "properties": { 19 + "name": { 20 + "type": "string", 21 + "description": "Human-readable name for this sample type", 22 + "maxLength": 100 23 + }, 24 + "version": { 25 + "type": "string", 26 + "description": "Semantic version (e.g., '1.0.0')", 27 + "pattern": "^(0|[1-9]\\d*)\\.(0|[1-9]\\d*)\\.(0|[1-9]\\d*)(?:-((?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+([0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$", 28 + "maxLength": 100 29 + }, 30 + "schemaType": { 31 + "type": "string", 32 + "description": "Type of schema definition (currently only 'jsonschema' supported)", 33 + "enum": ["jsonschema"], 34 + "default": "jsonschema" 35 + }, 36 + "jsonSchema": { 37 + "type": "object", 38 + "description": "JSON Schema Draft 7 definition for this sample type. Use standard JSON Schema with NDArray shim for array types.", 39 + "required": ["$schema", "type", "properties"], 40 + "properties": { 41 + "$schema": { 42 + "type": "string", 43 + "const": "http://json-schema.org/draft-07/schema#" 44 + }, 45 + "type": { 46 + "type": "string", 47 + "const": "object" 48 + }, 49 + "properties": { 50 + "type": "object", 51 + "minProperties": 1 52 + } 53 + } 54 + }, 55 + "description": { 56 + "type": "string", 57 + "description": "Human-readable description of what this sample type represents", 58 + "maxLength": 5000 59 + }, 60 + "ndarrayShimUri": { 61 + "type": "string", 62 + "format": "uri", 63 + "description": "URI to the NDArray JSON Schema shim definition (optional, defaults to standard shim at https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0)", 64 + "maxLength": 500 65 + }, 66 + "metadata": { 67 + "type": "object", 68 + "description": "Arbitrary metadata (author, license, tags, etc.)" 69 + }, 70 + "previousVersion": { 71 + "type": "string", 72 + "format": "at-uri", 73 + "description": "AT-URI reference to the previous version of this schema (if any)", 74 + "maxLength": 500 75 + }, 76 + "migrationLens": { 77 + "type": "string", 78 + "format": "at-uri", 79 + "description": "AT-URI reference to a Lens for migrating from previous version (if applicable)", 80 + "maxLength": 500 81 + }, 82 + "createdAt": { 83 + "type": "string", 84 + "format": "datetime", 85 + "description": "Timestamp when this schema version was created" 86 + } 87 + } 88 + } 89 + } 90 + } 91 + }
+16
.planning/lexicons/ndarray_shim.json
··· 1 + { 2 + "$schema": "http://json-schema.org/draft-07/schema#", 3 + "$id": "https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0", 4 + "title": "ATDataNDArrayBytes", 5 + "description": "Standard definition for numpy NDArray types in JSON Schema, compatible with atdata WebDataset serialization. This type's contents are interpreted as containing the raw bytes data for a serialized numpy NDArray, and serve as a marker for atdata-based code generation to use standard numpy types, rather than generated dataclasses.", 6 + "version": "1.0.0", 7 + "$defs": { 8 + "ndarray": { 9 + "type": "string", 10 + "format": "byte", 11 + "description": "Numpy array serialized using numpy `.npy` format via `np.save` (includes dtype and shape in binary header). When represented in JSON, this is a base64-encoded string. In msgpack, this is raw bytes.", 12 + "contentEncoding": "base64", 13 + "contentMediaType": "application/octet-stream" 14 + } 15 + } 16 + }
+386
.planning/ndarray_shim_spec.md
··· 1 + # NDArray JSON Schema Shim Specification 2 + 3 + **Issue**: #52 4 + **Version**: 1.0 5 + **Status**: Draft 6 + 7 + ## Problem Statement 8 + 9 + We need a standard way to represent numpy NDArray types in JSON Schema that: 10 + 1. Works with existing atdata msgpack serialization (numpy `.npy` format) 11 + 2. Can be validated (where practical) 12 + 3. Can be used for code generation 13 + 4. Is compatible with JSON Schema tooling 14 + 5. Preserves dtype and shape information 15 + 16 + ## Current Serialization Format 17 + 18 + atdata uses `_helpers.array_to_bytes()` which serializes arrays using numpy's native `.npy` format: 19 + 20 + ```python 21 + def array_to_bytes(x: np.ndarray) -> bytes: 22 + np_bytes = BytesIO() 23 + np.save(np_bytes, x, allow_pickle=True) 24 + return np_bytes.getvalue() 25 + ``` 26 + 27 + **Result**: A bytes object containing: 28 + - Magic bytes (`\x93NUMPY`) 29 + - Version info 30 + - Header with dtype and shape 31 + - Array data 32 + 33 + **Key insight**: The .npy format is self-describing - dtype and shape are already in the bytes! 34 + 35 + ## Design Approach 36 + 37 + ### Option 1: Pure Metadata (REJECTED) 38 + 39 + Describe the semantic array only: 40 + ```json 41 + { 42 + "type": "object", 43 + "x-atdata-ndarray": true, 44 + "x-dtype": "uint8", 45 + "x-shape": [null, null, 3] 46 + } 47 + ``` 48 + 49 + **Problem**: Doesn't match actual msgpack structure (which stores bytes, not objects) 50 + 51 + ### Option 2: Bytes with Extension Properties (REJECTED) 52 + 53 + Describe the bytes with metadata: 54 + ```json 55 + { 56 + "type": "string", 57 + "format": "byte", 58 + "x-dtype": "uint8", 59 + "x-shape": [null, null, 3] 60 + } 61 + ``` 62 + 63 + **Problem**: 64 + - Non-standard use of extension properties 65 + - JSON Schema doesn't know how to validate these 66 + - Codegen tools won't understand x- properties 67 + 68 + ### Option 3: Reusable Schema Definition (RECOMMENDED) 69 + 70 + Create a standard NDArray schema definition that can be $ref'd, with controlled vocabulary for metadata. 71 + 72 + ## Recommended Specification 73 + 74 + ### Base NDArray Schema Definition 75 + 76 + This should be included in every JSON Schema that uses NDArray types: 77 + 78 + ```json 79 + { 80 + "$schema": "http://json-schema.org/draft-07/schema#", 81 + "$defs": { 82 + "ndarray": { 83 + "type": "string", 84 + "format": "byte", 85 + "description": "Numpy array serialized using numpy .npy format (includes dtype and shape in binary header)", 86 + "contentEncoding": "base64", 87 + "contentMediaType": "application/octet-stream" 88 + } 89 + } 90 + } 91 + ``` 92 + 93 + ### Using NDArray in Properties 94 + 95 + Properties that are NDArray types reference the base definition and add metadata as **sibling properties**: 96 + 97 + ```json 98 + { 99 + "properties": { 100 + "image": { 101 + "$ref": "#/$defs/ndarray", 102 + "description": "RGB image with variable height/width", 103 + "x-atdata-dtype": "uint8", 104 + "x-atdata-shape": [null, null, 3] 105 + } 106 + } 107 + } 108 + ``` 109 + 110 + ### Metadata Convention 111 + 112 + **Extension properties** (prefixed with `x-atdata-`): 113 + - `x-atdata-dtype`: Numpy dtype string (e.g., "uint8", "float32", "int64") 114 + - `x-atdata-shape`: Array of integers and null (null = dynamic dimension) 115 + - `x-atdata-notes`: Optional human-readable notes about the array 116 + 117 + **Standard JSON Schema properties** (used normally): 118 + - `description`: Human-readable description of what the array represents 119 + - `title`: Short name for the field 120 + 121 + ## Complete Example 122 + 123 + ```json 124 + { 125 + "$schema": "http://json-schema.org/draft-07/schema#", 126 + "title": "ImageSample", 127 + "type": "object", 128 + "required": ["image", "label"], 129 + "properties": { 130 + "image": { 131 + "$ref": "#/$defs/ndarray", 132 + "description": "RGB image with variable height/width", 133 + "x-atdata-dtype": "uint8", 134 + "x-atdata-shape": [null, null, 3], 135 + "x-atdata-notes": "Images must have 3 color channels (RGB)" 136 + }, 137 + "depth_map": { 138 + "$ref": "#/$defs/ndarray", 139 + "description": "Depth map corresponding to the image", 140 + "x-atdata-dtype": "float32", 141 + "x-atdata-shape": [null, null], 142 + "x-atdata-notes": "Same height and width as image, but single channel" 143 + }, 144 + "label": { 145 + "type": "string", 146 + "description": "Human-readable label" 147 + } 148 + }, 149 + "$defs": { 150 + "ndarray": { 151 + "type": "string", 152 + "format": "byte", 153 + "description": "Numpy array serialized using numpy .npy format", 154 + "contentEncoding": "base64", 155 + "contentMediaType": "application/octet-stream" 156 + } 157 + } 158 + } 159 + ``` 160 + 161 + ## Rationale 162 + 163 + ### Why `type: "string", format: "byte"`? 164 + 165 + In msgpack serialization: 166 + - The NDArray field is stored as raw bytes (the .npy format) 167 + - When represented in JSON (for validation/transport), bytes become base64 strings 168 + - JSON Schema's `type: "string", format: "byte"` is the standard way to represent binary data 169 + 170 + ### Why extension properties (`x-atdata-*`)? 171 + 172 + JSON Schema allows custom properties starting with `x-`. Benefits: 173 + 1. **Standard**: Well-established convention in JSON Schema ecosystem 174 + 2. **Ignored by validators**: Won't cause validation errors 175 + 3. **Accessible to codegen**: Tools can parse these for type generation 176 + 4. **Self-documenting**: Clear what they mean 177 + 178 + ### Why not validate dtype/shape at JSON Schema level? 179 + 180 + **Technical limitation**: JSON Schema can't validate binary .npy format internals. 181 + 182 + **Solution**: Validation happens at **deserialization time**: 183 + 1. JSON Schema validates overall structure (field is present, is bytes) 184 + 2. When bytes are deserialized to NDArray, check dtype/shape match expectations 185 + 186 + ## Usage in atdata 187 + 188 + ### Publishing Schemas 189 + 190 + When publishing a PackableSample with NDArray fields: 191 + 192 + ```python 193 + @atdata.packable 194 + class ImageSample: 195 + image: NDArray # Will be annotated with dtype/shape hints 196 + label: str 197 + 198 + # SDK extracts type hints and generates JSON Schema 199 + schema_json = { 200 + "properties": { 201 + "image": { 202 + "$ref": "#/$defs/ndarray", 203 + "x-atdata-dtype": "uint8", # From annotation or default 204 + "x-atdata-shape": [null, null, 3] # From annotation or None 205 + } 206 + } 207 + } 208 + ``` 209 + 210 + ### Type Annotations for NDArray 211 + 212 + Python type hints to specify dtype/shape: 213 + 214 + ```python 215 + from typing import Annotated 216 + from numpy.typing import NDArray 217 + 218 + # Option 1: Generic NDArray (dtype/shape inferred or not specified) 219 + image: NDArray 220 + 221 + # Option 2: With dtype (using numpy typing) 222 + image: NDArray[np.uint8] 223 + 224 + # Option 3: With full metadata (using Annotated) 225 + image: Annotated[NDArray[np.uint8], {"shape": [None, None, 3]}] 226 + ``` 227 + 228 + ### Code Generation 229 + 230 + Codegen reads JSON Schema and produces: 231 + 232 + ```python 233 + @atdata.packable 234 + class ImageSample: 235 + image: NDArray # uint8, shape: [*, *, 3] 236 + label: str 237 + ``` 238 + 239 + Comment indicates dtype/shape from `x-atdata-*` properties. 240 + 241 + ## Validation Strategy 242 + 243 + ### JSON Schema Level (Structural) 244 + ✅ Validate field is present (if required) 245 + ✅ Validate field is bytes/string (in JSON) 246 + ✅ Validate base64 encoding (if in JSON) 247 + 248 + ### Deserialization Level (Semantic) 249 + ✅ Validate .npy format is valid 250 + ✅ Validate dtype matches expected (if specified) 251 + ✅ Validate shape matches expected (if specified) 252 + ✅ Validate shape constraints (e.g., must be 3D) 253 + 254 + ### Implementation 255 + 256 + ```python 257 + from atdata.validation import validate_ndarray 258 + 259 + def validate_ndarray( 260 + array: np.ndarray, 261 + expected_dtype: Optional[str] = None, 262 + expected_shape: Optional[List[Optional[int]]] = None 263 + ) -> tuple[bool, list[str]]: 264 + """Validate array against expectations.""" 265 + errors = [] 266 + 267 + # Check dtype 268 + if expected_dtype and str(array.dtype) != expected_dtype: 269 + errors.append(f"Expected dtype {expected_dtype}, got {array.dtype}") 270 + 271 + # Check shape 272 + if expected_shape: 273 + if len(array.shape) != len(expected_shape): 274 + errors.append(f"Expected {len(expected_shape)}D array, got {len(array.shape)}D") 275 + for i, (actual, expected) in enumerate(zip(array.shape, expected_shape)): 276 + if expected is not None and actual != expected: 277 + errors.append(f"Dimension {i}: expected {expected}, got {actual}") 278 + 279 + return len(errors) == 0, errors 280 + ``` 281 + 282 + ## Standard NDArray Shim URI 283 + 284 + The NDArray shim definition should be published at a canonical URI: 285 + 286 + **Proposed**: `at://did:plc:foundation/ac.foundation.dataset.ndarray-shim/1.0.0` 287 + 288 + This allows schemas to reference a standard definition: 289 + 290 + ```json 291 + { 292 + "properties": { 293 + "image": { 294 + "$ref": "at://did:plc:foundation/ac.foundation.dataset.ndarray-shim/1.0.0#/$defs/ndarray", 295 + "x-atdata-dtype": "uint8" 296 + } 297 + } 298 + } 299 + ``` 300 + 301 + Or schemas can inline the definition (recommended for Phase 1): 302 + 303 + ```json 304 + { 305 + "$defs": { 306 + "ndarray": { /* inline definition */ } 307 + } 308 + } 309 + ``` 310 + 311 + ## Alternative: Describe Deserialized Structure 312 + 313 + For reference, an alternative approach that describes the "unpacked" structure: 314 + 315 + ```json 316 + { 317 + "$defs": { 318 + "ndarray": { 319 + "type": "object", 320 + "description": "Numpy array (deserialized representation)", 321 + "required": ["dtype", "shape", "data"], 322 + "properties": { 323 + "dtype": {"type": "string"}, 324 + "shape": {"type": "array", "items": {"type": "integer"}}, 325 + "data": {"type": "string", "format": "byte"} 326 + } 327 + } 328 + } 329 + } 330 + ``` 331 + 332 + **Problem**: This doesn't match the actual msgpack structure (which is just bytes, not an object with dtype/shape/data fields). The .npy format is opaque bytes, not a structured object. 333 + 334 + **Conclusion**: Stick with the recommended approach (bytes with metadata). 335 + 336 + ## Implementation Checklist 337 + 338 + - [ ] Update sampleSchema Lexicon to reference this spec 339 + - [ ] Create standard NDArray shim definition 340 + - [ ] Update schema examples to use the shim correctly 341 + - [ ] Implement validation helpers in Python SDK 342 + - [ ] Add type annotation support for dtype/shape hints 343 + - [ ] Update codegen to read x-atdata-* properties 344 + - [ ] Document in user-facing docs 345 + 346 + ## Open Questions 347 + 348 + 1. **Should we support other array libraries?** (PyTorch tensors, JAX arrays, etc.) 349 + - Could use `x-atdata-array-type: "numpy"|"torch"|"jax"` 350 + - Recommendation: NumPy only for Phase 1 351 + 352 + 2. **Should shape constraints be enforced at runtime?** 353 + - Pro: Catch errors early 354 + - Con: Performance overhead, flexibility lost 355 + - Recommendation: Optional validation, off by default 356 + 357 + 3. **Should we support sparse arrays?** 358 + - scipy.sparse has different serialization format 359 + - Recommendation: Defer to future 360 + 361 + 4. **What about array of arrays?** (ragged arrays) 362 + - Can be represented as Python lists of NDArrays 363 + - Recommendation: Not a priority for Phase 1 364 + 365 + ## Summary 366 + 367 + **Recommended Approach**: 368 + - NDArray fields represented as `{"$ref": "#/$defs/ndarray"}` (bytes) 369 + - Dtype and shape specified via `x-atdata-dtype` and `x-atdata-shape` 370 + - Standard `ndarray` definition inlined in every schema 371 + - Validation happens at deserialization, not JSON Schema level 372 + - Codegen reads extension properties to generate proper types 373 + 374 + **Benefits**: 375 + - ✅ Compatible with existing msgpack serialization 376 + - ✅ Works with JSON Schema tooling 377 + - ✅ Clear metadata for codegen 378 + - ✅ Flexible (dtype/shape optional) 379 + - ✅ Extensible (can add more x-atdata-* properties) 380 + 381 + **Trade-offs**: 382 + - ⚠️ Leaky abstraction (JSON Schema describes bytes, not semantic array) 383 + - ⚠️ Validation split across two layers 384 + - ⚠️ Extension properties not universally understood 385 + 386 + **Grade**: B+ (Good practical solution)