A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

refactor: add extensible array format system with versioned registry and canonical URLs

Replaced single ndarrayShimUri field with extensible array format system:

Array Format Changes:
- Removed 'ndarrayShimUri' (single URI, not extensible)
- Added 'arrayFormatVersions' object mapping format names to semver
- Example: {"ndarrayBytes": "1.0.0"}
- Supports multiple formats simultaneously if needed

ArrayFormat Registry:
- Created ac.foundation.dataset.arrayFormat lexicon
- Token-based registry with knownValues pattern (like schemaType)
- Current formats: ndarrayBytes (numpy .npy binary format)
- Enables adding new formats (Arrow, Protobuf, etc.) without breaking changes

Canonical URLs:
- Foundation.ac maintains shim schemas at predictable URLs
- Pattern: https://foundation.ac/schemas/atdata-{format}-bytes/{version}/
- Example: https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0/

Metadata Refinements:
- Removed 'author' field (ATProto records have creator DIDs)
- Enhanced 'license' to support SPDX identifiers/URLs (Schema.org alignment)
- Enhanced 'tags' description (Schema.org keywords alignment)

Documentation:
- Added README_ARRAY_FORMATS.md explaining registry pattern
- Documents version evolution and codegen integration

This follows ATProto token pattern for extensible type systems.

Closes #68

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

+207 -17
.chainlink/issues.db

This is a binary file and will not be displayed.

+3 -1
.planning/examples/sampleSchema_example.json
··· 8 8 "$schema": "http://json-schema.org/draft-07/schema#", 9 9 "title": "ImageSample", 10 10 "type": "object", 11 + "arrayFormatVersions": { 12 + "ndarrayBytes": "1.0.0" 13 + }, 11 14 "required": [ 12 15 "image", 13 16 "label" ··· 43 46 }, 44 47 "description": "Sample type for images with labels, commonly used for computer vision datasets", 45 48 "metadata": { 46 - "author": "alice.bsky.social", 47 49 "license": "MIT", 48 50 "tags": ["computer-vision", "image-classification"] 49 51 },
+178
.planning/lexicons/README_ARRAY_FORMATS.md
··· 1 + # Array Format Registry 2 + 3 + This document explains the token-based registry pattern for atdata array serialization formats. 4 + 5 + ## Overview 6 + 7 + Array formats define how numpy NDArray fields are serialized in atdata sample types. The system provides: 8 + 9 + 1. **Token-based registry**: `ac.foundation.dataset.arrayFormat` Lexicon 10 + 2. **Version tracking**: Each schema declares which format versions it uses 11 + 3. **Canonical shim schemas**: Foundation.ac maintains standard JSON Schema shims at predictable URLs 12 + 13 + ## Pattern 14 + 15 + ### arrayFormat Lexicon Structure 16 + 17 + ```json 18 + { 19 + "lexicon": 1, 20 + "id": "ac.foundation.dataset.arrayFormat", 21 + "defs": { 22 + "main": { 23 + "type": "string", 24 + "knownValues": ["ndarrayBytes"], 25 + "maxLength": 50 26 + }, 27 + "ndarrayBytes": { 28 + "type": "token", 29 + "description": "Numpy .npy binary format..." 30 + } 31 + } 32 + } 33 + ``` 34 + 35 + ### Usage in sampleSchema 36 + 37 + Schema records declare format versions in `arrayFormatVersions` field: 38 + 39 + ```json 40 + { 41 + "$type": "ac.foundation.dataset.sampleSchema", 42 + "schemaType": "jsonSchema", 43 + "schema": { 44 + "$type": "ac.foundation.dataset.sampleSchema#jsonSchemaFormat", 45 + "arrayFormatVersions": { 46 + "ndarrayBytes": "1.0.0" 47 + }, 48 + "properties": { 49 + "image": { 50 + "$ref": "#/$defs/ndarray", 51 + "x-atdata-dtype": "uint8" 52 + } 53 + }, 54 + "$defs": { 55 + "ndarray": { 56 + "type": "string", 57 + "format": "byte", 58 + ... 59 + } 60 + } 61 + } 62 + } 63 + ``` 64 + 65 + ## Canonical Shim Schema URLs 66 + 67 + Foundation.ac maintains JSON Schema shims at canonical URLs: 68 + 69 + ``` 70 + https://foundation.ac/schemas/atdata-{format}-bytes/{version}/ 71 + ``` 72 + 73 + Examples: 74 + - `https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0/` 75 + - `https://foundation.ac/schemas/atdata-arrow-bytes/1.0.0/` (future) 76 + 77 + These shim schemas define the JSON Schema representation (base64-encoded bytes) for each format. 78 + 79 + ## Default Behavior 80 + 81 + If `arrayFormatVersions` is omitted, the system defaults to: 82 + 83 + ```json 84 + { 85 + "ndarrayBytes": "1.0.0" 86 + } 87 + ``` 88 + 89 + This ensures backward compatibility and simplifies common cases. 90 + 91 + ## Current Array Formats 92 + 93 + | Token Def | knownValue | Current Version | Description | 94 + |-----------|------------|-----------------|-------------| 95 + | `#ndarrayBytes` | `"ndarrayBytes"` | `1.0.0` | Numpy .npy binary format with dtype/shape header | 96 + 97 + ## Adding New Array Formats 98 + 99 + To add support for a new array format (e.g., Apache Arrow): 100 + 101 + ### 1. Add token def to arrayFormat Lexicon 102 + 103 + Edit `ac.foundation.dataset.arrayFormat.json`: 104 + 105 + ```json 106 + { 107 + "defs": { 108 + "main": { 109 + "knownValues": ["ndarrayBytes", "arrowBytes"] 110 + }, 111 + "arrowBytes": { 112 + "type": "token", 113 + "description": "Apache Arrow IPC format for array serialization..." 114 + } 115 + } 116 + } 117 + ``` 118 + 119 + ### 2. Publish shim schema at canonical URL 120 + 121 + Create and publish JSON Schema shim at: 122 + ``` 123 + https://foundation.ac/schemas/atdata-arrow-bytes/1.0.0/ 124 + ``` 125 + 126 + ### 3. Use in sample schemas 127 + 128 + Declare format version in schema records: 129 + 130 + ```json 131 + { 132 + "arrayFormatVersions": { 133 + "arrowBytes": "1.0.0" 134 + } 135 + } 136 + ``` 137 + 138 + ## Version Evolution 139 + 140 + ### Minor/Patch Updates 141 + 142 + For backward-compatible changes: 143 + - Publish new version at new URL (e.g., `1.1.0`) 144 + - Update `arrayFormatVersions` in schema records 145 + - Old versions remain accessible 146 + 147 + ### Major Updates 148 + 149 + For breaking changes: 150 + - Consider new format name (e.g., `ndarrayBytes2`) 151 + - Or use major version in URL structure 152 + - Schemas can migrate via Lens transformations 153 + 154 + ## Design Rationale 155 + 156 + This pattern provides: 157 + 158 + 1. **Centralized Discovery**: Query `ac.foundation.dataset.arrayFormat` to see all supported formats 159 + 2. **Explicit Versioning**: Each schema declares exactly which format versions it uses 160 + 3. **Canonical References**: Predictable URLs for shim schemas maintained by foundation.ac 161 + 4. **Extensibility**: New formats added via tokens without breaking existing schemas 162 + 5. **Flexibility**: Schemas can use multiple formats simultaneously (if needed) 163 + 164 + ## Relationship to Codegen 165 + 166 + When atdata codegen processes a sampleSchema: 167 + 168 + 1. Reads `arrayFormatVersions` to know which formats are used 169 + 2. Fetches canonical shim schemas from foundation.ac URLs 170 + 3. Generates Python dataclasses with proper NDArray type hints 171 + 4. Implements serialization using appropriate format (currently `.npy` via `_helpers.py`) 172 + 173 + ## References 174 + 175 + - [ac.foundation.dataset.arrayFormat Lexicon](./ac.foundation.dataset.arrayFormat.json) 176 + - [ac.foundation.dataset.sampleSchema Lexicon](./ac.foundation.dataset.sampleSchema.json) 177 + - [NDArray Shim Specification](../.planning/ndarray_shim_spec.md) 178 + - [ATProto Lexicon Token Type](https://atproto.com/guides/lexicon)
+16
.planning/lexicons/ac.foundation.dataset.arrayFormat.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.arrayFormat", 4 + "defs": { 5 + "main": { 6 + "type": "string", 7 + "description": "Array serialization format identifier for NDArray fields in sample schemas. Known values correspond to token definitions in this Lexicon. Each format has versioned specifications maintained by foundation.ac at canonical URLs.", 8 + "knownValues": ["ndarrayBytes"], 9 + "maxLength": 50 10 + }, 11 + "ndarrayBytes": { 12 + "type": "token", 13 + "description": "Numpy .npy binary format for NDArray serialization. Stores arrays with dtype and shape in binary header. Versions maintained at https://foundation.ac/schemas/atdata-ndarray-bytes/{version}/" 14 + } 15 + } 16 + }
+10 -16
.planning/lexicons/ac.foundation.dataset.sampleSchema.json
··· 45 45 }, 46 46 "metadata": { 47 47 "type": "object", 48 - "description": "Optional metadata about this schema. Common fields include author, license, and tags, but any additional fields are permitted.", 48 + "description": "Optional metadata about this schema. Common fields include license and tags, but any additional fields are permitted.", 49 49 "maxProperties": 50, 50 50 "properties": { 51 - "author": { 51 + "license": { 52 52 "type": "string", 53 - "description": "Creator of this schema (DID, handle, or name)", 53 + "description": "License identifier or URL. SPDX identifiers recommended (e.g., MIT, Apache-2.0, CC-BY-4.0) or full SPDX URLs (e.g., http://spdx.org/licenses/MIT). Aligns with Schema.org license property.", 54 54 "maxLength": 200 55 55 }, 56 - "license": { 57 - "type": "string", 58 - "description": "License identifier (e.g., MIT, Apache-2.0, CC-BY-4.0)", 59 - "maxLength": 100 60 - }, 61 56 "tags": { 62 57 "type": "array", 63 - "description": "Categorization tags for discovery", 58 + "description": "Categorization keywords for discovery. Aligns with Schema.org keywords property.", 64 59 "items": { 65 60 "type": "string", 66 - "maxLength": 50 61 + "maxLength": 150 67 62 }, 68 - "maxLength": 20 63 + "maxLength": 30 69 64 } 70 65 } 71 66 }, ··· 101 96 "description": "Field definitions for the sample type", 102 97 "minProperties": 1 103 98 }, 104 - "ndarrayShimUri": { 105 - "type": "string", 106 - "format": "uri", 107 - "description": "URI to the NDArray JSON Schema shim definition. Optional, defaults to https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0", 108 - "maxLength": 500 99 + "arrayFormatVersions": { 100 + "type": "object", 101 + "description": "Mapping from array format identifiers to semantic versions. Keys are ac.foundation.dataset.arrayFormat values (e.g., 'ndarrayBytes'), values are semver strings (e.g., '1.0.0'). Foundation.ac maintains canonical shim schemas at https://foundation.ac/schemas/atdata-{format}-bytes/{version}/.", 102 + "maxProperties": 10 109 103 } 110 104 } 111 105 }