A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

refactor: extract storage union to separate lexicons and add validation improvements

Refactored dataset.record storage union for modularity and consistency:

Storage Union Extraction:
- Extracted #storageExternal to ac.foundation.dataset.storageExternal lexicon
- Extracted #storageBlobs to ac.foundation.dataset.storageBlobs lexicon
- Removed inline storage defs from record lexicon
- Changed union refs to point to external lexicons
- Removed 'type' field from storage objects (using $type discriminator)

This follows the pattern established for schemaType and arrayFormat,
providing better modularity and enabling future storage types without
modifying the record lexicon.

Validation Improvements:
- Added format: "at-uri" validation to schemaRef field
- Enhanced metadata field description (clarifies role vs top-level fields)
- Aligned license/tags descriptions with Schema.org (like sampleSchema)
- Increased maxLength for license (200) and tags (150) for SPDX URLs

Documentation:
- Added record_lexicon_assessment.md with design rationale
- Updated examples to use $type discriminators

Examples now use:
- "$type": "ac.foundation.dataset.storageExternal" (was "type": "external")
- "$type": "ac.foundation.dataset.storageBlobs" (was "type": "blobs")

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

+528 -57
.chainlink/issues.db

This is a binary file and will not be displayed.

+468
.planning/decisions/record_lexicon_assessment.md
··· 1 + # Record Lexicon Assessment 2 + 3 + ## Overview 4 + 5 + Comprehensive assessment of `ac.foundation.dataset.record` Lexicon design against ATProto standards and atdata project requirements. 6 + 7 + **Assessment Date:** 2026-01-07 8 + **Lexicon Version:** Initial design 9 + **Assessor:** Claude Sonnet 4.5 10 + 11 + --- 12 + 13 + ## Executive Summary 14 + 15 + **Grade: B+** (Good with improvements needed) 16 + 17 + The record Lexicon provides a solid foundation for dataset indexing with hybrid storage support. Key strengths include clean union-based storage design and appropriate use of ATProto primitives. However, several issues need addressing: 18 + 19 + - ⚠️ **Critical**: schemaRef should use format validation 20 + - ⚠️ **High**: Metadata structure inconsistency with sampleSchema pattern 21 + - ⚠️ **Medium**: Missing $type discriminators in union variants 22 + - ✅ **Strength**: Clean storage union design 23 + - ✅ **Strength**: Appropriate use of tid keys for datasets 24 + 25 + --- 26 + 27 + ## Detailed Analysis 28 + 29 + ### 1. Key Type Choice ✅ **Appropriate** 30 + 31 + ```json 32 + "key": "tid" 33 + ``` 34 + 35 + **Assessment:** Correct choice for dataset records. 36 + 37 + **Rationale:** 38 + - TIDs provide temporal ordering (useful for "recent datasets" queries) 39 + - Auto-generated, no collision risk 40 + - Appropriate for records without natural semantic keys 41 + - Consistent with ATProto patterns for user-generated content 42 + 43 + **Comparison to sampleSchema:** 44 + - sampleSchema uses `"key": "any"` for versioned rkeys like `{NSID}@{semver}` 45 + - record uses `"key": "tid"` for chronological dataset entries 46 + - Both choices are appropriate for their use cases 47 + 48 + --- 49 + 50 + ### 2. Field Validation Issues 51 + 52 + #### Issue 2.1: schemaRef Missing Format Validation ⚠️ **Critical** 53 + 54 + ```json 55 + "schemaRef": { 56 + "type": "string", 57 + "description": "AT-URI reference...", 58 + "maxLength": 500 59 + } 60 + ``` 61 + 62 + **Problem:** Should use `"format": "at-uri"` like we did for sampleSchema fields. 63 + 64 + **Fix:** 65 + ```json 66 + "schemaRef": { 67 + "type": "string", 68 + "format": "at-uri", 69 + "description": "AT-URI reference to the sampleSchema record", 70 + "maxLength": 500 71 + } 72 + ``` 73 + 74 + **Impact:** Without format validation, malformed references could be stored. 75 + 76 + --- 77 + 78 + #### Issue 2.2: License Field Inconsistency ⚠️ **Medium** 79 + 80 + sampleSchema metadata: 81 + ```json 82 + "license": { 83 + "type": "string", 84 + "description": "... SPDX identifiers recommended ... or full SPDX URLs ...", 85 + "maxLength": 200 86 + } 87 + ``` 88 + 89 + record: 90 + ```json 91 + "license": { 92 + "type": "string", 93 + "description": "License (SPDX identifier preferred)", 94 + "maxLength": 100 95 + } 96 + ``` 97 + 98 + **Problem:** Inconsistent maxLength and less detailed guidance. 99 + 100 + **Recommendation:** Align with sampleSchema: 101 + - maxLength: 200 (to support full URLs) 102 + - Enhanced description with examples 103 + - Reference Schema.org license property 104 + 105 + --- 106 + 107 + #### Issue 2.3: Tags Field Inconsistency ⚠️ **Medium** 108 + 109 + sampleSchema metadata: 110 + ```json 111 + "tags": { 112 + "type": "array", 113 + "items": {"type": "string", "maxLength": 150}, 114 + "maxLength": 30 115 + } 116 + ``` 117 + 118 + record: 119 + ```json 120 + "tags": { 121 + "type": "array", 122 + "items": {"type": "string", "maxLength": 50}, 123 + "maxLength": 20 124 + } 125 + ``` 126 + 127 + **Problem:** Different limits with no clear rationale. 128 + 129 + **Recommendation:** Use consistent limits or document why datasets need different constraints than schemas. 130 + 131 + --- 132 + 133 + ### 3. Metadata Structure ⚠️ **High Priority** 134 + 135 + #### Current Design 136 + 137 + record: 138 + ```json 139 + "metadata": { 140 + "type": "bytes", 141 + "description": "Msgpack-encoded metadata dict", 142 + "maxLength": 100000 143 + }, 144 + "tags": {...}, 145 + "license": {...} 146 + ``` 147 + 148 + sampleSchema: 149 + ```json 150 + "metadata": { 151 + "type": "object", 152 + "properties": { 153 + "license": {...}, 154 + "tags": {...} 155 + } 156 + } 157 + ``` 158 + 159 + **Problem:** Inconsistent approach between lexicons. 160 + 161 + **Analysis:** 162 + 163 + **Option A: Keep Separate (Current)** 164 + - Pros: More discoverable (top-level fields, indexed/searchable) 165 + - Pros: Validated by Lexicon 166 + - Cons: Duplicates structure with metadata blob 167 + - Cons: Inconsistent with sampleSchema pattern 168 + 169 + **Option B: Unified Metadata Object** 170 + - Pros: Consistent with sampleSchema 171 + - Pros: Single source of truth 172 + - Cons: Less discoverable for search 173 + - Cons: Can't validate blob contents 174 + 175 + **Recommendation:** Keep current approach but clarify relationship: 176 + - Top-level fields: Core, searchable metadata (license, tags, size) 177 + - metadata blob: Extended, arbitrary key-value pairs 178 + - Update descriptions to explain this pattern 179 + 180 + --- 181 + 182 + ### 4. Storage Union Design ✅ **Excellent** 183 + 184 + ```json 185 + "storage": { 186 + "type": "union", 187 + "refs": ["#storageExternal", "#storageBlobs"] 188 + } 189 + ``` 190 + 191 + **Strengths:** 192 + - Clean separation of storage types 193 + - Extensible (closed: false by default) 194 + - Well-defined variants 195 + 196 + #### Issue 4.1: Missing $type in Union Variants ⚠️ **Critical** 197 + 198 + storageExternal: 199 + ```json 200 + { 201 + "type": "object", 202 + "required": ["type", "urls"], 203 + "properties": { 204 + "type": {"type": "string", "const": "external"} 205 + } 206 + } 207 + ``` 208 + 209 + **Problem:** Uses `type` field as discriminator instead of ATProto's `$type`. 210 + 211 + **ATProto Spec:** "Unions require discriminator fields... union variants: Always include `$type`" 212 + 213 + **Fix:** 214 + ```json 215 + { 216 + "type": "object", 217 + "required": ["$type", "urls"], 218 + "properties": { 219 + "$type": { 220 + "type": "string", 221 + "const": "ac.foundation.dataset.record#storageExternal" 222 + } 223 + } 224 + } 225 + ``` 226 + 227 + **Impact:** Current design violates ATProto conventions and may cause issues with SDKs. 228 + 229 + --- 230 + 231 + ### 5. Size Information ✅ **Good Design** 232 + 233 + ```json 234 + "size": { 235 + "type": "ref", 236 + "ref": "#datasetSize", 237 + "description": "Dataset size information (optional)" 238 + } 239 + ``` 240 + 241 + **Strengths:** 242 + - Optional (appropriate, not all datasets track this) 243 + - Structured with useful fields (samples, bytes, shards) 244 + - Uses ref for reusability 245 + 246 + **Minor Suggestion:** Consider renaming `datasetSize` to `sizeInfo` or `datasetSizeInfo` for clarity. 247 + 248 + --- 249 + 250 + ### 6. Blob Storage Design ⚠️ **Needs Verification** 251 + 252 + ```json 253 + "blobs": { 254 + "type": "array", 255 + "items": { 256 + "type": "blob", 257 + "description": "Blob reference to a WebDataset tar archive" 258 + } 259 + } 260 + ``` 261 + 262 + **Questions:** 263 + 1. Does ATProto Lexicon support `"type": "blob"` for array items? 264 + 2. Should this be a ref like `"type": "ref", "ref": "#blobRef"`? 265 + 3. Are blob mime types validated? 266 + 267 + **Example shows:** 268 + ```json 269 + { 270 + "$type": "blob", 271 + "ref": {"$link": "..."}, 272 + "mimeType": "application/x-tar", 273 + "size": 1234567 274 + } 275 + ``` 276 + 277 + **Recommendation:** Verify against ATProto blob specification and potentially add validation constraints (maxSize, accept mimeType patterns). 278 + 279 + --- 280 + 281 + ### 7. Closed Union Consideration 🤔 282 + 283 + ```json 284 + "storage": { 285 + "type": "union", 286 + "refs": ["#storageExternal", "#storageBlobs"] 287 + } 288 + ``` 289 + 290 + **Current:** `closed: false` (default) 291 + 292 + **Question:** Should storage union be closed? 293 + 294 + **Arguments for closed: true:** 295 + - Core storage types unlikely to change frequently 296 + - Breaking change to add new storage after launch 297 + - More predictable for clients 298 + 299 + **Arguments for closed: false (current):** 300 + - Future extensibility (e.g., IPFS-native, Filecoin, Arweave) 301 + - Consistent with sampleSchema schema union pattern 302 + - Graceful degradation for unknown types 303 + 304 + **Recommendation:** Keep open but document in description that external/blobs are the canonical types maintained by foundation.ac. 305 + 306 + --- 307 + 308 + ### 8. Missing Fields from Standard Patterns 309 + 310 + Comparing to Schema.org Dataset and sampleSchema patterns: 311 + 312 + **Consider Adding:** 313 + 314 + 1. **Publisher/Creator** - Who published this dataset? 315 + - Could use top-level `creator` field (DID/handle) 316 + - Or rely on record author (implicit in AT-URI) 317 + 318 + 2. **Version** - Dataset versioning? 319 + - Current approach: New record per version (via tid) 320 + - Alternative: Add explicit `version` field like sampleSchema 321 + - **Recommendation:** Document that versioning is via new records, reference via AT-URI with tid 322 + 323 + 3. **Citation** - How to cite this dataset? 324 + - Optional field for academic datasets 325 + - Could go in metadata blob for now 326 + 327 + 4. **Related Datasets** - Links to variants, subsets, etc. 328 + - Could be array of AT-URIs 329 + - Or handle via separate "collection" Lexicon later 330 + 331 + **Recommendation:** Current fields are sufficient for v1. Document these as future extensions. 332 + 333 + --- 334 + 335 + ### 9. ATProto Compliance Checklist 336 + 337 + | Requirement | Status | Notes | 338 + |-------------|--------|-------| 339 + | Valid Lexicon version | ✅ | lexicon: 1 | 340 + | NSID format | ✅ | ac.foundation.dataset.record | 341 + | Key type specified | ✅ | tid (appropriate) | 342 + | Required fields present | ✅ | name, schemaRef, storage, createdAt | 343 + | Union discriminators | ⚠️ | Missing $type in variants | 344 + | Format validators | ⚠️ | Missing at-uri format | 345 + | Blob type usage | ⚠️ | Needs verification | 346 + | Description fields | ✅ | All fields documented | 347 + | maxLength constraints | ✅ | Present on strings | 348 + | Datetime format | ✅ | createdAt uses datetime | 349 + 350 + --- 351 + 352 + ### 10. Example Record Validation 353 + 354 + #### External Storage Example ✅ 355 + 356 + ```json 357 + { 358 + "$type": "ac.foundation.dataset.record", 359 + "name": "CIFAR-10 Training Set", 360 + "schemaRef": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imageclassification@1.0.0", 361 + "storage": {"type": "external", "urls": ["..."]} 362 + } 363 + ``` 364 + 365 + **Issues:** 366 + - schemaRef is well-formed but not validated (missing format check) 367 + - storage.type should be $type 368 + - Otherwise structurally correct 369 + 370 + #### Blob Storage Example ⚠️ 371 + 372 + ```json 373 + { 374 + "storage": { 375 + "type": "blobs", 376 + "blobs": [{ 377 + "$type": "blob", 378 + "ref": {"$link": "..."}, 379 + "mimeType": "application/x-tar" 380 + }] 381 + } 382 + } 383 + ``` 384 + 385 + **Issues:** 386 + - storage.type should be $type 387 + - Blob structure needs verification against ATProto spec 388 + - mimeType not validated in Lexicon 389 + 390 + --- 391 + 392 + ## Priority Issues Summary 393 + 394 + ### Critical (Must Fix) 395 + 396 + 1. **Add format validation to schemaRef** - Use `"format": "at-uri"` 397 + 2. **Fix union discriminators** - Use `$type` instead of `type` in storage variants 398 + 3. **Verify blob type usage** - Confirm ATProto compliance 399 + 400 + ### High Priority (Should Fix) 401 + 402 + 4. **Align metadata pattern** - Clarify relationship between top-level fields and metadata blob 403 + 5. **Standardize license field** - Match sampleSchema maxLength and description 404 + 6. **Standardize tags field** - Use consistent limits or document rationale 405 + 406 + ### Medium Priority (Consider) 407 + 408 + 7. **Add $type requirement to union variants** - Make explicit in required array 409 + 8. **Document versioning strategy** - Clarify that new versions = new records 410 + 9. **Add blob validation** - Consider maxSize, mimeType constraints 411 + 412 + ### Low Priority (Future) 413 + 414 + 10. **Consider closed union** - Evaluate after Phase 1 usage patterns 415 + 11. **Add creator field** - If needed based on user feedback 416 + 12. **Collection/relationship fields** - Phase 2 feature 417 + 418 + --- 419 + 420 + ## Consistency Matrix 421 + 422 + Comparison of patterns between sampleSchema and record Lexicons: 423 + 424 + | Pattern | sampleSchema | record | Status | 425 + |---------|--------------|--------|--------| 426 + | AT-URI format | ✅ Uses format | ❌ Missing | **Fix** | 427 + | License field | 200 chars, detailed | 100 chars, basic | **Align** | 428 + | Tags limits | 150/30 | 50/20 | **Decide** | 429 + | Metadata structure | Structured object | Blob + top-level | **Document** | 430 + | Union discriminator | Uses $type | Uses type | **Fix** | 431 + | Versioning | Explicit version field | Implicit (tid) | **Different OK** | 432 + | Key type | any (semantic) | tid (temporal) | **Both OK** | 433 + 434 + --- 435 + 436 + ## Recommendations 437 + 438 + ### Immediate Actions 439 + 440 + 1. Add `"format": "at-uri"` to schemaRef field 441 + 2. Change storage union variants to use `$type` discriminator 442 + 3. Verify blob array item type with ATProto specification 443 + 4. Align license field with sampleSchema (maxLength: 200, enhanced description) 444 + 5. Decide on tags limits (recommend matching sampleSchema: 150/30) 445 + 446 + ### Documentation Improvements 447 + 448 + 6. Add description clarifying metadata blob vs top-level fields relationship 449 + 7. Document that dataset versioning is via new records (tids) 450 + 8. Add note about storage union extensibility 451 + 9. Cross-reference with sampleSchema Lexicon 452 + 453 + ### Consider for Phase 2 454 + 455 + 10. Add creator/publisher field if user feedback indicates need 456 + 11. Evaluate closed union after observing extension patterns 457 + 12. Consider collection/relationship Lexicon for dataset hierarchies 458 + 459 + --- 460 + 461 + ## Conclusion 462 + 463 + The record Lexicon provides a solid foundation but needs refinement for ATProto compliance and consistency with sampleSchema patterns. The storage union design is excellent, and the use of tids is appropriate. Primary concerns are format validation, union discriminators, and metadata pattern clarity. 464 + 465 + **Estimated effort to address critical issues:** 2-3 hours 466 + **Recommended timeline:** Before Phase 1 completion 467 + 468 + After fixes, expected grade: **A-** (Excellent and production-ready)
+1 -1
.planning/examples/dataset_blob_storage.json
··· 3 3 "name": "Small Sample Dataset", 4 4 "schemaRef": "at://did:plc:def456/ac.foundation.dataset.sampleSchema/textsample@2.1.0", 5 5 "storage": { 6 - "type": "blobs", 6 + "$type": "ac.foundation.dataset.storageBlobs", 7 7 "blobs": [ 8 8 { 9 9 "$type": "blob",
+1 -1
.planning/examples/dataset_external_storage.json
··· 3 3 "name": "CIFAR-10 Training Set", 4 4 "schemaRef": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imageclassification@1.0.0", 5 5 "storage": { 6 - "type": "external", 6 + "$type": "ac.foundation.dataset.storageExternal", 7 7 "urls": [ 8 8 "s3://my-bucket/cifar10-train-{000000..000049}.tar" 9 9 ]
+9 -55
.planning/lexicons/ac.foundation.dataset.record.json
··· 22 22 }, 23 23 "schemaRef": { 24 24 "type": "string", 25 + "format": "at-uri", 25 26 "description": "AT-URI reference to the sampleSchema record for this dataset's samples", 26 27 "maxLength": 500 27 28 }, ··· 29 30 "type": "union", 30 31 "description": "Storage location for dataset files (WebDataset tar archives)", 31 32 "refs": [ 32 - "#storageExternal", 33 - "#storageBlobs" 33 + "ac.foundation.dataset.storageExternal", 34 + "ac.foundation.dataset.storageBlobs" 34 35 ] 35 36 }, 36 37 "description": { ··· 40 41 }, 41 42 "metadata": { 42 43 "type": "bytes", 43 - "description": "Msgpack-encoded metadata dict (arbitrary key-value pairs)", 44 + "description": "Msgpack-encoded metadata dict for arbitrary extended key-value pairs. Use this for additional metadata beyond the core top-level fields (license, tags, size). Top-level fields are preferred for discoverable/searchable metadata.", 44 45 "maxLength": 100000 45 46 }, 46 47 "tags": { 47 48 "type": "array", 48 - "description": "Searchable tags for dataset discovery", 49 + "description": "Searchable tags for dataset discovery. Aligns with Schema.org keywords property.", 49 50 "items": { 50 51 "type": "string", 51 - "maxLength": 50 52 + "maxLength": 150 52 53 }, 53 - "maxLength": 20 54 + "maxLength": 30 54 55 }, 55 56 "size": { 56 57 "type": "ref", ··· 59 60 }, 60 61 "license": { 61 62 "type": "string", 62 - "description": "License (SPDX identifier preferred)", 63 - "maxLength": 100 63 + "description": "License identifier or URL. SPDX identifiers recommended (e.g., MIT, Apache-2.0, CC-BY-4.0) or full SPDX URLs (e.g., http://spdx.org/licenses/MIT). Aligns with Schema.org license property.", 64 + "maxLength": 200 64 65 }, 65 66 "createdAt": { 66 67 "type": "string", 67 68 "format": "datetime", 68 69 "description": "Timestamp when this dataset record was created" 69 70 } 70 - } 71 - } 72 - }, 73 - "storageExternal": { 74 - "type": "object", 75 - "description": "External storage via URLs (S3, HTTP, IPFS, etc.)", 76 - "required": [ 77 - "type", 78 - "urls" 79 - ], 80 - "properties": { 81 - "type": { 82 - "type": "string", 83 - "const": "external" 84 - }, 85 - "urls": { 86 - "type": "array", 87 - "description": "WebDataset URLs (supports brace notation like 'data-{000000..000099}.tar')", 88 - "items": { 89 - "type": "string", 90 - "format": "uri", 91 - "maxLength": 1000 92 - }, 93 - "minLength": 1 94 - } 95 - } 96 - }, 97 - "storageBlobs": { 98 - "type": "object", 99 - "description": "Storage via ATProto PDS blobs", 100 - "required": [ 101 - "type", 102 - "blobs" 103 - ], 104 - "properties": { 105 - "type": { 106 - "type": "string", 107 - "const": "blobs" 108 - }, 109 - "blobs": { 110 - "type": "array", 111 - "description": "Array of blob references for WebDataset tar files", 112 - "items": { 113 - "type": "blob", 114 - "description": "Blob reference to a WebDataset tar archive" 115 - }, 116 - "minLength": 1 117 71 } 118 72 } 119 73 },
+24
.planning/lexicons/ac.foundation.dataset.storageBlobs.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.storageBlobs", 4 + "defs": { 5 + "main": { 6 + "type": "object", 7 + "description": "Storage via ATProto PDS blobs for WebDataset tar archives. Each blob contains one or more tar files. Used in ac.foundation.dataset.record storage union for maximum decentralization.", 8 + "required": [ 9 + "blobs" 10 + ], 11 + "properties": { 12 + "blobs": { 13 + "type": "array", 14 + "description": "Array of blob references for WebDataset tar files", 15 + "items": { 16 + "type": "blob", 17 + "description": "Blob reference to a WebDataset tar archive" 18 + }, 19 + "minLength": 1 20 + } 21 + } 22 + } 23 + } 24 + }
+25
.planning/lexicons/ac.foundation.dataset.storageExternal.json
··· 1 + { 2 + "lexicon": 1, 3 + "id": "ac.foundation.dataset.storageExternal", 4 + "defs": { 5 + "main": { 6 + "type": "object", 7 + "description": "External storage via URLs (S3, HTTP, IPFS, etc.) for WebDataset tar archives. URLs support brace notation for sharding (e.g., 'data-{000000..000099}.tar'). Used in ac.foundation.dataset.record storage union.", 8 + "required": [ 9 + "urls" 10 + ], 11 + "properties": { 12 + "urls": { 13 + "type": "array", 14 + "description": "WebDataset URLs with optional brace notation for sharded tar files", 15 + "items": { 16 + "type": "string", 17 + "format": "uri", 18 + "maxLength": 1000 19 + }, 20 + "minLength": 1 21 + } 22 + } 23 + } 24 + } 25 + }