···66**/*.env
77# Don't commit `uv` lockfiles
88**/uv.lock
99+# Development tooling (keep local, not in upstream)
1010+.chainlink/
1111+.claude/
9121013##
1114
+204
.planning/01_overview.md
···11+# ATProto Integration - Overview
22+33+## Vision
44+55+Transform `atdata` from a local/centralized dataset library into a **distributed dataset federation** built on AT Protocol. Datasets, schemas, and transformations become discoverable, versioned records on the ATProto network, enabling:
66+77+- **Decentralized dataset publishing**: Anyone can publish datasets without centralized infrastructure
88+- **Schema sharing & reuse**: Sample type definitions become reusable records with automatic code generation
99+- **Discoverable transformations**: Lens transformations are published as bidirectional mappings between schemas
1010+- **Interoperability**: Different tools and languages can consume the same datasets using generated code
1111+- **Versioning & provenance**: Immutable records provide audit trails for dataset evolution
1212+1313+## High-Level Architecture
1414+1515+```
1616+┌─────────────────────────────────────────────────────────────────┐
1717+│ AT Protocol Network │
1818+│ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
1919+│ │ Schema Records │ │ Dataset Records │ │ Lens Records │ │
2020+│ │ (Lexicon) │ │ (Lexicon) │ │ (Lexicon) │ │
2121+│ └──────────────────┘ └──────────────────┘ └───────────────┘ │
2222+│ ▲ ▲ ▲ │
2323+│ │ │ │ │
2424+└─────────┼──────────────────────┼─────────────────────┼──────────┘
2525+ │ │ │
2626+ │ publish/query │ │
2727+ │ │ │
2828+ ┌─────┴──────────────────────┴─────────────────────┴─────┐
2929+ │ Python Client Library (atdata) │
3030+ │ │
3131+ │ ┌────────────┐ ┌────────────┐ ┌──────────────────┐ │
3232+ │ │ ATProto │ │ Schema │ │ Dataset │ │
3333+ │ │ Auth │ │ Publisher │ │ Loader │ │
3434+ │ └────────────┘ └────────────┘ └──────────────────┘ │
3535+ │ │
3636+ │ Existing: │
3737+ │ - PackableSample, Dataset, Lens │
3838+ │ - WebDataset integration │
3939+ └──────────────────────────────────────────────────────────┘
4040+ │
4141+ │ queries (optional)
4242+ ▼
4343+ ┌─────────────────────┐
4444+ │ AppView Service │
4545+ │ (Index Aggregator) │
4646+ │ │
4747+ │ - Fast search │
4848+ │ - Schema browser │
4949+ │ - Metadata cache │
5050+ └─────────────────────┘
5151+```
5252+5353+## Core Concepts
5454+5555+### 1. Schema Records (PackableSample definitions)
5656+5757+Published ATProto records containing:
5858+- Field names and types (with special handling for NDArray)
5959+- Serialization metadata
6060+- Version information
6161+- Author/provenance
6262+6363+These become the **source of truth** for sample types across the network.
6464+6565+### 2. Dataset Index Records
6666+6767+Published ATProto records containing:
6868+- Reference to schema record (the sample type)
6969+- WebDataset URL(s) using brace notation (e.g., `s3://bucket/data-{000000..000099}.tar`)
7070+- Msgpack-encoded metadata (arbitrary key-value pairs)
7171+- Dataset description, tags, author
7272+7373+Users discover datasets by querying these records, then load them using existing `Dataset` class.
7474+7575+### 3. Lens Transformation Records
7676+7777+Published ATProto records containing:
7878+- Source schema reference
7979+- Target schema reference
8080+- Transformation code (or reference to code)
8181+- Bidirectional mapping metadata (getter/putter)
8282+8383+Enables building a **network of transformations** between schemas.
8484+8585+## Integration with Existing `atdata`
8686+8787+The ATProto integration is **additive**:
8888+8989+1. **Existing functionality unchanged**: `PackableSample`, `Dataset`, `Lens` continue to work as-is
9090+2. **New methods added**:
9191+ - `sample_type.publish_to_atproto(client)` - Publish schema
9292+ - `dataset.publish_to_atproto(client)` - Publish index record
9393+ - `Dataset.from_atproto(client, record_uri)` - Load from published record
9494+ - `lens.publish_to_atproto(client)` - Publish transformation
9595+3. **Optional AppView**: Query service for faster discovery (like Bluesky's AppView)
9696+9797+## Development Phases
9898+9999+### Phase 1: Lexicon Design (Issues #17, #22-25)
100100+- Design three Lexicon schemas (sample, dataset, lens)
101101+- Evaluate schema representation formats
102102+- Create reference documentation
103103+104104+**Deliverable**: Lexicon JSON definitions ready for use
105105+106106+### Phase 2: Python Client Library (Issues #18, #26-31)
107107+- ATProto SDK integration (auth, session management)
108108+- Publishing implementations for all three record types
109109+- Query/discovery functionality
110110+- Extend `Dataset` class with `from_atproto()` method
111111+112112+**Deliverable**: Working Python library that can publish/load from ATProto
113113+114114+### Phase 3: AppView Service (Issues #19, #32-35)
115115+- Optional aggregation service
116116+- Firehose ingestion
117117+- Search/query API
118118+- Performance optimization
119119+120120+**Deliverable**: Hosted service for fast dataset discovery
121121+122122+### Phase 4: Code Generation (Issues #20, #36-39)
123123+- Template system for Python codegen
124124+- CLI tool for generating classes from schema records
125125+- Type validation and compatibility checking
126126+127127+**Deliverable**: Tool to generate Python code from published schemas
128128+129129+### Phase 5: Integration & Testing (Issues #21, #40-43)
130130+- End-to-end workflows and examples
131131+- Integration test suite
132132+- Documentation and guides
133133+- Performance benchmarks
134134+135135+**Deliverable**: Production-ready feature with complete documentation
136136+137137+## Open Design Questions
138138+139139+### Schema Representation Format
140140+**Question**: How should we represent `PackableSample` schemas in Lexicon records?
141141+142142+**Options**:
143143+1. **JSON Schema** - Standard, well-supported, validation tools exist
144144+2. **Protobuf** - Compact, has codegen ecosystem, good for cross-language
145145+3. **Custom format** - Tailored to `PackableSample` specifics (NDArray handling, msgpack serialization)
146146+147147+**Considerations**:
148148+- Need to represent `NDArray` types specially (dtype, shape constraints?)
149149+- Should support future extensions (constraints, validation rules)
150150+- Must be human-readable and machine-processable
151151+- Codegen tooling needs to parse it
152152+153153+**Decision needed**: See Issue #25
154154+155155+### WebDataset Storage Location
156156+**Question**: Should actual WebDataset `.tar` files be stored on ATProto, or just references to external storage?
157157+158158+**Current approach**: References only (S3, HTTP URLs, etc.)
159159+- Pros: No storage limits, existing infrastructure works
160160+- Cons: Centralization risk if datasets disappear
161161+162162+**Future consideration**: ATProto blob storage for datasets
163163+- Pros: Truly decentralized
164164+- Cons: Storage costs, size limits, performance
165165+166166+### Lens Code Storage
167167+**Question**: How should Lens transformation code be stored?
168168+169169+**Options**:
170170+1. Python code as string in record (security concerns!)
171171+2. Reference to GitHub/GitLab repo + commit hash
172172+3. Bytecode or AST representation
173173+4. Only store metadata, expect manual implementation
174174+175175+**Decision needed**: See Phase 1 planning
176176+177177+## Success Metrics
178178+179179+- **Functionality**: Can publish schema, publish dataset, discover, load end-to-end
180180+- **Performance**: Dataset discovery <100ms (with AppView), load time unchanged
181181+- **Adoption**: Easy enough that external users publish datasets
182182+- **Interop**: Schema records usable from other languages (future)
183183+184184+## Timeline & Dependencies
185185+186186+```
187187+Phase 1 (Lexicon Design)
188188+ ↓
189189+Phase 2 (Python Client) ← CRITICAL PATH
190190+ ↓
191191+ ├── Phase 3 (AppView) [parallel, optional]
192192+ └── Phase 4 (Codegen) [parallel]
193193+ ↓
194194+Phase 5 (Integration & Testing)
195195+```
196196+197197+Phase 2 is the critical path. Phases 3 & 4 can proceed in parallel once Phase 2 foundations are in place.
198198+199199+## Related Documents
200200+201201+- `02_lexicon_design.md` - Detailed Lexicon schema specifications
202202+- `03_python_client.md` - Python library architecture and API design
203203+- `04_appview.md` - AppView service architecture
204204+- `05_codegen.md` - Code generation approach and templates
+576
.planning/02_lexicon_design.md
···11+# Lexicon Design for ATProto Integration
22+33+## Overview
44+55+This document specifies the three Lexicon schemas needed for `atdata` ATProto integration:
66+77+1. **Schema Record** (`app.bsky.atdata.schema`) - Defines PackableSample types
88+2. **Dataset Record** (`app.bsky.atdata.dataset`) - Index records pointing to WebDataset files
99+3. **Lens Record** (`app.bsky.atdata.lens`) - Transformation mappings between schemas
1010+1111+## Design Principles
1212+1313+- **Self-describing**: Records contain all necessary metadata
1414+- **Versioned**: Schema evolution supported through versioning
1515+- **Lightweight**: Minimal overhead, fast to parse
1616+- **Extensible**: Future additions don't break existing records
1717+- **Language-agnostic**: Usable from Python, TypeScript, Rust, etc.
1818+1919+## 1. Schema Record Lexicon
2020+2121+**NSID**: `app.bsky.atdata.schema` (tentative namespace)
2222+2323+**Purpose**: Define a reusable PackableSample type that can be instantiated via codegen
2424+2525+### Proposed Structure
2626+2727+```json
2828+{
2929+ "lexicon": 1,
3030+ "id": "app.bsky.atdata.schema",
3131+ "defs": {
3232+ "main": {
3333+ "type": "record",
3434+ "description": "Definition of a PackableSample-compatible sample type",
3535+ "key": "tid",
3636+ "record": {
3737+ "type": "object",
3838+ "required": ["name", "version", "fields", "createdAt"],
3939+ "properties": {
4040+ "name": {
4141+ "type": "string",
4242+ "description": "Human-readable name for this sample type",
4343+ "maxLength": 100
4444+ },
4545+ "version": {
4646+ "type": "string",
4747+ "description": "Semantic version (e.g., '1.0.0')",
4848+ "maxLength": 20
4949+ },
5050+ "description": {
5151+ "type": "string",
5252+ "description": "Human-readable description",
5353+ "maxLength": 1000
5454+ },
5555+ "fields": {
5656+ "type": "array",
5757+ "description": "List of fields in this sample type",
5858+ "items": {
5959+ "type": "ref",
6060+ "ref": "#field"
6161+ }
6262+ },
6363+ "metadata": {
6464+ "type": "object",
6565+ "description": "Arbitrary metadata (author, license, etc.)"
6666+ },
6767+ "createdAt": {
6868+ "type": "string",
6969+ "format": "datetime"
7070+ }
7171+ }
7272+ }
7373+ },
7474+ "field": {
7575+ "type": "object",
7676+ "description": "A field within a sample type",
7777+ "required": ["name", "type"],
7878+ "properties": {
7979+ "name": {
8080+ "type": "string",
8181+ "description": "Field name (Python identifier)",
8282+ "maxLength": 100
8383+ },
8484+ "type": {
8585+ "type": "ref",
8686+ "ref": "#fieldType"
8787+ },
8888+ "optional": {
8989+ "type": "boolean",
9090+ "description": "Whether field can be None",
9191+ "default": false
9292+ },
9393+ "description": {
9494+ "type": "string",
9595+ "description": "Field documentation",
9696+ "maxLength": 500
9797+ }
9898+ }
9999+ },
100100+ "fieldType": {
101101+ "type": "union",
102102+ "refs": [
103103+ "#primitiveType",
104104+ "#arrayType",
105105+ "#nestedType"
106106+ ]
107107+ },
108108+ "primitiveType": {
109109+ "type": "object",
110110+ "required": ["kind", "primitive"],
111111+ "properties": {
112112+ "kind": {
113113+ "type": "string",
114114+ "const": "primitive"
115115+ },
116116+ "primitive": {
117117+ "type": "string",
118118+ "enum": ["str", "int", "float", "bool", "bytes"]
119119+ }
120120+ }
121121+ },
122122+ "arrayType": {
123123+ "type": "object",
124124+ "required": ["kind", "dtype"],
125125+ "properties": {
126126+ "kind": {
127127+ "type": "string",
128128+ "const": "ndarray"
129129+ },
130130+ "dtype": {
131131+ "type": "string",
132132+ "description": "Numpy dtype string (e.g., 'float32', 'uint8')",
133133+ "maxLength": 20
134134+ },
135135+ "shape": {
136136+ "type": "array",
137137+ "description": "Optional shape constraint (null for dynamic dimensions)",
138138+ "items": {
139139+ "type": "integer"
140140+ }
141141+ }
142142+ }
143143+ },
144144+ "nestedType": {
145145+ "type": "object",
146146+ "required": ["kind", "schemaRef"],
147147+ "properties": {
148148+ "kind": {
149149+ "type": "string",
150150+ "const": "nested"
151151+ },
152152+ "schemaRef": {
153153+ "type": "string",
154154+ "description": "AT-URI reference to another schema record"
155155+ }
156156+ }
157157+ }
158158+ }
159159+}
160160+```
161161+162162+### Example Schema Record
163163+164164+```json
165165+{
166166+ "$type": "app.bsky.atdata.schema",
167167+ "name": "ImageSample",
168168+ "version": "1.0.0",
169169+ "description": "Sample containing an image with label",
170170+ "fields": [
171171+ {
172172+ "name": "image",
173173+ "type": {
174174+ "kind": "ndarray",
175175+ "dtype": "uint8",
176176+ "shape": [null, null, 3]
177177+ },
178178+ "description": "RGB image with variable height/width"
179179+ },
180180+ {
181181+ "name": "label",
182182+ "type": {
183183+ "kind": "primitive",
184184+ "primitive": "str"
185185+ },
186186+ "description": "Human-readable label"
187187+ },
188188+ {
189189+ "name": "confidence",
190190+ "type": {
191191+ "kind": "primitive",
192192+ "primitive": "float"
193193+ },
194194+ "optional": true,
195195+ "description": "Optional confidence score"
196196+ }
197197+ ],
198198+ "metadata": {
199199+ "author": "alice.bsky.social",
200200+ "license": "MIT"
201201+ },
202202+ "createdAt": "2025-01-06T12:00:00Z"
203203+}
204204+```
205205+206206+### Design Questions
207207+208208+1. **Shape constraints**: Should we enforce shape constraints, or just document them?
209209+ - Option A: Runtime validation against shape
210210+ - Option B: Documentation only, actual shapes can vary
211211+ - **Recommendation**: Documentation only initially, validation in future versions
212212+213213+2. **Custom types**: Should we support custom serialization hooks?
214214+ - Current approach: Only primitive + NDArray
215215+ - Future: Allow references to custom serialization functions?
216216+217217+3. **Schema inheritance**: Should schemas support inheritance/composition?
218218+ - Could reference parent schema and add fields
219219+ - **Defer to future version**
220220+221221+## 2. Dataset Record Lexicon
222222+223223+**NSID**: `app.bsky.atdata.dataset`
224224+225225+**Purpose**: Index record pointing to WebDataset files with associated metadata
226226+227227+### Proposed Structure
228228+229229+```json
230230+{
231231+ "lexicon": 1,
232232+ "id": "app.bsky.atdata.dataset",
233233+ "defs": {
234234+ "main": {
235235+ "type": "record",
236236+ "description": "Index record for a WebDataset-backed dataset",
237237+ "key": "tid",
238238+ "record": {
239239+ "type": "object",
240240+ "required": ["name", "schemaRef", "urls", "createdAt"],
241241+ "properties": {
242242+ "name": {
243243+ "type": "string",
244244+ "description": "Human-readable dataset name",
245245+ "maxLength": 200
246246+ },
247247+ "schemaRef": {
248248+ "type": "string",
249249+ "description": "AT-URI reference to the schema record for this dataset's samples"
250250+ },
251251+ "urls": {
252252+ "type": "array",
253253+ "description": "WebDataset URLs (supports brace notation)",
254254+ "items": {
255255+ "type": "string",
256256+ "format": "uri",
257257+ "maxLength": 1000
258258+ },
259259+ "minLength": 1
260260+ },
261261+ "description": {
262262+ "type": "string",
263263+ "description": "Human-readable description",
264264+ "maxLength": 5000
265265+ },
266266+ "metadata": {
267267+ "type": "bytes",
268268+ "description": "Msgpack-encoded metadata dict",
269269+ "maxLength": 100000
270270+ },
271271+ "tags": {
272272+ "type": "array",
273273+ "description": "Searchable tags",
274274+ "items": {
275275+ "type": "string",
276276+ "maxLength": 50
277277+ },
278278+ "maxLength": 20
279279+ },
280280+ "size": {
281281+ "type": "object",
282282+ "description": "Dataset size information",
283283+ "properties": {
284284+ "samples": {
285285+ "type": "integer",
286286+ "description": "Total number of samples"
287287+ },
288288+ "bytes": {
289289+ "type": "integer",
290290+ "description": "Total size in bytes"
291291+ }
292292+ }
293293+ },
294294+ "license": {
295295+ "type": "string",
296296+ "description": "License (SPDX identifier preferred)",
297297+ "maxLength": 100
298298+ },
299299+ "createdAt": {
300300+ "type": "string",
301301+ "format": "datetime"
302302+ }
303303+ }
304304+ }
305305+ }
306306+ }
307307+}
308308+```
309309+310310+### Example Dataset Record
311311+312312+```json
313313+{
314314+ "$type": "app.bsky.atdata.dataset",
315315+ "name": "CIFAR-10 Training Set",
316316+ "schemaRef": "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm",
317317+ "urls": [
318318+ "s3://my-bucket/cifar10-train-{000000..000049}.tar"
319319+ ],
320320+ "description": "CIFAR-10 training images (50,000 samples) stored as WebDataset shards",
321321+ "metadata": "<msgpack bytes>",
322322+ "tags": ["computer-vision", "classification", "cifar10"],
323323+ "size": {
324324+ "samples": 50000,
325325+ "bytes": 178456789
326326+ },
327327+ "license": "MIT",
328328+ "createdAt": "2025-01-06T12:00:00Z"
329329+}
330330+```
331331+332332+### Design Questions
333333+334334+1. **WebDataset storage**: Where are the actual `.tar` files?
335335+ - Phase 1: External storage (S3, HTTP, etc.) - just store URLs
336336+ - Future: Could use ATProto blob storage for smaller datasets
337337+ - **Recommendation**: External only for now
338338+339339+2. **Metadata size limit**: What's reasonable for msgpack metadata?
340340+ - Could store large metadata as separate blob
341341+ - **Recommendation**: 100KB limit, use blob for larger
342342+343343+3. **Versioning**: Should we support dataset versioning?
344344+ - Could link to previous version
345345+ - **Defer to future version**
346346+347347+## 3. Lens Record Lexicon
348348+349349+**NSID**: `app.bsky.atdata.lens`
350350+351351+**Purpose**: Define bidirectional transformations between sample types
352352+353353+### Proposed Structure
354354+355355+```json
356356+{
357357+ "lexicon": 1,
358358+ "id": "app.bsky.atdata.lens",
359359+ "defs": {
360360+ "main": {
361361+ "type": "record",
362362+ "description": "Bidirectional transformation between two sample types",
363363+ "key": "tid",
364364+ "record": {
365365+ "type": "object",
366366+ "required": ["name", "sourceSchema", "targetSchema", "createdAt"],
367367+ "properties": {
368368+ "name": {
369369+ "type": "string",
370370+ "description": "Human-readable lens name",
371371+ "maxLength": 100
372372+ },
373373+ "sourceSchema": {
374374+ "type": "string",
375375+ "description": "AT-URI reference to source schema"
376376+ },
377377+ "targetSchema": {
378378+ "type": "string",
379379+ "description": "AT-URI reference to target schema"
380380+ },
381381+ "description": {
382382+ "type": "string",
383383+ "description": "What this transformation does",
384384+ "maxLength": 1000
385385+ },
386386+ "getterCode": {
387387+ "type": "ref",
388388+ "ref": "#transformCode"
389389+ },
390390+ "putterCode": {
391391+ "type": "ref",
392392+ "ref": "#transformCode"
393393+ },
394394+ "metadata": {
395395+ "type": "object",
396396+ "description": "Arbitrary metadata"
397397+ },
398398+ "createdAt": {
399399+ "type": "string",
400400+ "format": "datetime"
401401+ }
402402+ }
403403+ }
404404+ },
405405+ "transformCode": {
406406+ "type": "union",
407407+ "refs": [
408408+ "#pythonCode",
409409+ "#codeReference"
410410+ ]
411411+ },
412412+ "pythonCode": {
413413+ "type": "object",
414414+ "required": ["kind", "source"],
415415+ "properties": {
416416+ "kind": {
417417+ "type": "string",
418418+ "const": "python"
419419+ },
420420+ "source": {
421421+ "type": "string",
422422+ "description": "Python function source code",
423423+ "maxLength": 50000
424424+ }
425425+ }
426426+ },
427427+ "codeReference": {
428428+ "type": "object",
429429+ "required": ["kind", "repository", "path"],
430430+ "properties": {
431431+ "kind": {
432432+ "type": "string",
433433+ "const": "reference"
434434+ },
435435+ "repository": {
436436+ "type": "string",
437437+ "description": "Git repository URL",
438438+ "maxLength": 500
439439+ },
440440+ "commit": {
441441+ "type": "string",
442442+ "description": "Git commit hash",
443443+ "maxLength": 40
444444+ },
445445+ "path": {
446446+ "type": "string",
447447+ "description": "Path to function within repo",
448448+ "maxLength": 500
449449+ }
450450+ }
451451+ }
452452+ }
453453+}
454454+```
455455+456456+### Example Lens Record
457457+458458+```json
459459+{
460460+ "$type": "app.bsky.atdata.lens",
461461+ "name": "image_to_grayscale",
462462+ "sourceSchema": "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm",
463463+ "targetSchema": "at://did:plc:def456/app.bsky.atdata.schema/7mn8op56pqr",
464464+ "description": "Convert RGB images to grayscale",
465465+ "getterCode": {
466466+ "kind": "reference",
467467+ "repository": "https://github.com/alice/lenses",
468468+ "commit": "a1b2c3d4e5f6",
469469+ "path": "lenses/vision.py:image_to_grayscale"
470470+ },
471471+ "putterCode": {
472472+ "kind": "reference",
473473+ "repository": "https://github.com/alice/lenses",
474474+ "commit": "a1b2c3d4e5f6",
475475+ "path": "lenses/vision.py:grayscale_to_image"
476476+ },
477477+ "metadata": {
478478+ "author": "alice.bsky.social"
479479+ },
480480+ "createdAt": "2025-01-06T12:00:00Z"
481481+}
482482+```
483483+484484+### Design Questions - CRITICAL
485485+486486+1. **Code storage security**: Storing executable code is dangerous!
487487+ - **Option A**: Code reference only (GitHub + commit hash) - safer
488488+ - **Option B**: Allow inline code but require manual approval - flexible
489489+ - **Option C**: AST/bytecode representation - complex
490490+ - **Recommendation**: Start with references only (Option A), defer inline code
491491+492492+2. **Lens verification**: How to verify well-behavedness?
493493+ - Could store test cases
494494+ - Could require proof of GetPut/PutGet laws
495495+ - **Defer to future**
496496+497497+3. **Lens composition**: Should lenses be composable?
498498+ - Network could auto-compose transformations
499499+ - **Defer to future**
500500+501501+## Schema Representation Format Decision
502502+503503+**Question**: What format should we use to represent field types internally?
504504+505505+### Option 1: JSON Schema
506506+**Pros**:
507507+- Standard, widely supported
508508+- Validation tooling exists
509509+- Human-readable
510510+511511+**Cons**:
512512+- Not designed for codegen
513513+- NDArray representation awkward
514514+- Overly complex for our needs
515515+516516+### Option 2: Protobuf
517517+**Pros**:
518518+- Designed for codegen
519519+- Compact binary format
520520+- Cross-language support excellent
521521+522522+**Cons**:
523523+- Not ATProto-native
524524+- Requires compilation step
525525+- Less human-readable
526526+527527+### Option 3: Custom Format (as shown above)
528528+**Pros**:
529529+- Tailored exactly to PackableSample needs
530530+- Native ATProto Lexicon
531531+- Clean NDArray representation
532532+- Easy to extend
533533+534534+**Cons**:
535535+- Need to write our own codegen
536536+- Less ecosystem tooling
537537+538538+### Recommendation: Option 3 (Custom Format)
539539+540540+**Rationale**:
541541+1. PackableSample has specific needs (NDArray, msgpack serialization)
542542+2. ATProto Lexicon provides all the structure we need
543543+3. Writing our own codegen gives us full control
544544+4. Can still use JSON Schema for validation if needed
545545+546546+The proposed Lexicon structure above uses this approach.
547547+548548+## Implementation Checklist (Phase 1)
549549+550550+- [ ] Finalize Lexicon JSON definitions for all three record types
551551+- [ ] Create reference documentation with examples
552552+- [ ] Decide on schema representation format (recommendation: custom)
553553+- [ ] Resolve open questions (code storage, versioning, etc.)
554554+- [ ] Validate Lexicons against ATProto spec
555555+- [ ] Create example records for testing
556556+557557+## Future Extensions
558558+559559+### Schema Evolution
560560+- Support schema versioning with migration paths
561561+- Compatibility checking (backward/forward compatible)
562562+563563+### Advanced Types
564564+- Generic/parameterized types
565565+- Union types for polymorphic samples
566566+- Schema composition/inheritance
567567+568568+### Lens Network
569569+- Automatic lens composition
570570+- Lens verification and testing
571571+- Performance metadata (transformation cost)
572572+573573+### Dataset Features
574574+- Dataset splitting (train/val/test) references
575575+- Dataset versioning and diffs
576576+- Access control and permissions
+690
.planning/03_python_client.md
···11+# Python Client Library Architecture
22+33+## Overview
44+55+This document specifies the Python library extensions to `atdata` for ATProto integration. The goal is to add ATProto publishing and discovery capabilities while maintaining backward compatibility with existing code.
66+77+## Design Principles
88+99+- **Backward compatible**: Existing code continues to work unchanged
1010+- **Optional integration**: ATProto features are opt-in
1111+- **Pythonic API**: Follows Python conventions and `atdata` style
1212+- **Type-safe**: Full type hints with generics
1313+- **Testable**: Mockable dependencies, unit testable
1414+1515+## Module Structure
1616+1717+```
1818+src/atdata/
1919+ __init__.py # Existing exports
2020+ dataset.py # Existing Dataset, PackableSample
2121+ lens.py # Existing Lens, LensNetwork
2222+ _helpers.py # Existing serialization helpers
2323+ atproto/ # NEW: ATProto integration
2424+ __init__.py # Public API exports
2525+ client.py # ATProtoClient for auth/session
2626+ schema.py # Schema publishing/loading
2727+ dataset.py # Dataset publishing/loading
2828+ lens.py # Lens publishing/loading
2929+ _lexicon.py # Lexicon record builders
3030+ _types.py # Type definitions for records
3131+```
3232+3333+## Core Components
3434+3535+### 1. ATProtoClient - Authentication & Session Management
3636+3737+**File**: `src/atdata/atproto/client.py`
3838+3939+```python
4040+from typing import Optional
4141+from atproto import Client as ATProtoSDKClient
4242+4343+class ATProtoClient:
4444+ """Wrapper around atproto SDK client with atdata-specific helpers."""
4545+4646+ def __init__(self, client: Optional[ATProtoSDKClient] = None):
4747+ """
4848+ Initialize ATProto client.
4949+5050+ Args:
5151+ client: Optional pre-configured atproto Client. If None, creates new client.
5252+ """
5353+ self._client = client or ATProtoSDKClient()
5454+ self._session: Optional[dict] = None
5555+5656+ def login(self, handle: str, password: str) -> None:
5757+ """Authenticate with ATProto PDS."""
5858+ self._session = self._client.login(handle, password)
5959+6060+ def login_with_token(self, access_token: str, refresh_token: str) -> None:
6161+ """Authenticate using existing tokens."""
6262+ # Implementation
6363+ pass
6464+6565+ @property
6666+ def is_authenticated(self) -> bool:
6767+ """Check if client has valid session."""
6868+ return self._session is not None
6969+7070+ @property
7171+ def did(self) -> str:
7272+ """Get DID of authenticated user."""
7373+ if not self._session:
7474+ raise ValueError("Not authenticated")
7575+ return self._session['did']
7676+7777+ # Low-level record operations
7878+ def create_record(self, collection: str, record: dict) -> str:
7979+ """Create a record and return its AT-URI."""
8080+ # Implementation using self._client
8181+ pass
8282+8383+ def get_record(self, uri: str) -> dict:
8484+ """Fetch a record by AT-URI."""
8585+ # Implementation
8686+ pass
8787+8888+ def list_records(self, collection: str, did: Optional[str] = None) -> list[dict]:
8989+ """List records in a collection."""
9090+ # Implementation
9191+ pass
9292+```
9393+9494+**Usage**:
9595+```python
9696+from atdata.atproto import ATProtoClient
9797+9898+client = ATProtoClient()
9999+client.login("alice.bsky.social", "password")
100100+```
101101+102102+### 2. Schema Publishing & Loading
103103+104104+**File**: `src/atdata/atproto/schema.py`
105105+106106+```python
107107+from typing import Type, TypeVar, get_type_hints
108108+from dataclasses import fields, is_dataclass
109109+import atdata
110110+from .client import ATProtoClient
111111+from ._lexicon import build_schema_record
112112+113113+ST = TypeVar('ST', bound=atdata.PackableSample)
114114+115115+class SchemaPublisher:
116116+ """Handles publishing PackableSample schemas to ATProto."""
117117+118118+ def __init__(self, client: ATProtoClient):
119119+ self.client = client
120120+121121+ def publish_schema(
122122+ self,
123123+ sample_type: Type[ST],
124124+ *,
125125+ name: Optional[str] = None,
126126+ version: str = "1.0.0",
127127+ description: Optional[str] = None,
128128+ metadata: Optional[dict] = None
129129+ ) -> str:
130130+ """
131131+ Publish a PackableSample schema to ATProto.
132132+133133+ Args:
134134+ sample_type: The PackableSample class to publish
135135+ name: Human-readable name (defaults to class name)
136136+ version: Semantic version
137137+ description: Human-readable description
138138+ metadata: Arbitrary metadata dict
139139+140140+ Returns:
141141+ AT-URI of the created schema record
142142+ """
143143+ if not self.client.is_authenticated:
144144+ raise ValueError("Client must be authenticated")
145145+146146+ # Extract field information from dataclass
147147+ schema_record = self._build_schema_record(
148148+ sample_type, name, version, description, metadata
149149+ )
150150+151151+ # Publish to ATProto
152152+ uri = self.client.create_record("app.bsky.atdata.schema", schema_record)
153153+ return uri
154154+155155+ def _build_schema_record(
156156+ self,
157157+ sample_type: Type[ST],
158158+ name: Optional[str],
159159+ version: str,
160160+ description: Optional[str],
161161+ metadata: Optional[dict]
162162+ ) -> dict:
163163+ """Build schema record dict from PackableSample class."""
164164+ if not is_dataclass(sample_type):
165165+ raise ValueError(f"{sample_type} must be a dataclass")
166166+167167+ field_defs = []
168168+ type_hints = get_type_hints(sample_type)
169169+170170+ for field in fields(sample_type):
171171+ field_type = type_hints[field.name]
172172+ field_def = self._field_to_record(field.name, field_type)
173173+ field_defs.append(field_def)
174174+175175+ return {
176176+ "$type": "app.bsky.atdata.schema",
177177+ "name": name or sample_type.__name__,
178178+ "version": version,
179179+ "description": description or "",
180180+ "fields": field_defs,
181181+ "metadata": metadata or {},
182182+ "createdAt": datetime.now(timezone.utc).isoformat()
183183+ }
184184+185185+ def _field_to_record(self, name: str, field_type) -> dict:
186186+ """Convert Python type annotation to schema field record."""
187187+ # Handle Optional types
188188+ is_optional = False
189189+ if hasattr(field_type, '__origin__') and field_type.__origin__ is Union:
190190+ args = field_type.__args__
191191+ if type(None) in args:
192192+ is_optional = True
193193+ field_type = next(arg for arg in args if arg is not type(None))
194194+195195+ # Map Python types to schema types
196196+ type_def = self._python_type_to_schema_type(field_type)
197197+198198+ return {
199199+ "name": name,
200200+ "type": type_def,
201201+ "optional": is_optional
202202+ }
203203+204204+ def _python_type_to_schema_type(self, python_type) -> dict:
205205+ """Map Python type to schema type definition."""
206206+ # Handle primitives
207207+ if python_type is str:
208208+ return {"kind": "primitive", "primitive": "str"}
209209+ elif python_type is int:
210210+ return {"kind": "primitive", "primitive": "int"}
211211+ elif python_type is float:
212212+ return {"kind": "primitive", "primitive": "float"}
213213+ elif python_type is bool:
214214+ return {"kind": "primitive", "primitive": "bool"}
215215+ elif python_type is bytes:
216216+ return {"kind": "primitive", "primitive": "bytes"}
217217+218218+ # Handle NDArray - this is the key special case
219219+ # In atdata, NDArray is used as a type annotation
220220+ if hasattr(python_type, '__origin__'):
221221+ origin = python_type.__origin__
222222+ if origin.__name__ == 'NDArray' or str(origin) == 'numpy.ndarray':
223223+ # Extract dtype from annotation if available
224224+ # For now, default to float32
225225+ return {
226226+ "kind": "ndarray",
227227+ "dtype": "float32", # TODO: extract from annotation
228228+ "shape": None
229229+ }
230230+231231+ # If it's another PackableSample, create nested reference
232232+ if is_dataclass(python_type) and issubclass(python_type, atdata.PackableSample):
233233+ # This would require publishing the nested type first
234234+ raise NotImplementedError("Nested PackableSample types not yet supported")
235235+236236+ raise ValueError(f"Unsupported type: {python_type}")
237237+238238+class SchemaLoader:
239239+ """Handles loading PackableSample schemas from ATProto."""
240240+241241+ def __init__(self, client: ATProtoClient):
242242+ self.client = client
243243+244244+ def get_schema(self, uri: str) -> dict:
245245+ """Fetch a schema record by AT-URI."""
246246+ record = self.client.get_record(uri)
247247+ if record.get('$type') != 'app.bsky.atdata.schema':
248248+ raise ValueError(f"Record at {uri} is not a schema record")
249249+ return record
250250+251251+ def list_schemas(self, did: Optional[str] = None) -> list[dict]:
252252+ """List available schema records."""
253253+ return self.client.list_records("app.bsky.atdata.schema", did)
254254+```
255255+256256+**Usage**:
257257+```python
258258+from atdata.atproto import ATProtoClient, SchemaPublisher
259259+260260+@atdata.packable
261261+class MySample:
262262+ image: NDArray
263263+ label: str
264264+265265+client = ATProtoClient()
266266+client.login("alice.bsky.social", "password")
267267+268268+publisher = SchemaPublisher(client)
269269+schema_uri = publisher.publish_schema(
270270+ MySample,
271271+ description="My sample type",
272272+ version="1.0.0"
273273+)
274274+print(f"Published schema at {schema_uri}")
275275+```
276276+277277+### 3. Dataset Publishing & Loading
278278+279279+**File**: `src/atdata/atproto/dataset.py`
280280+281281+```python
282282+from typing import Type, TypeVar, Optional
283283+import msgpack
284284+import atdata
285285+from .client import ATProtoClient
286286+from .schema import SchemaPublisher
287287+288288+ST = TypeVar('ST', bound=atdata.PackableSample)
289289+290290+class DatasetPublisher:
291291+ """Handles publishing Dataset index records to ATProto."""
292292+293293+ def __init__(self, client: ATProtoClient):
294294+ self.client = client
295295+ self.schema_publisher = SchemaPublisher(client)
296296+297297+ def publish_dataset(
298298+ self,
299299+ dataset: atdata.Dataset[ST],
300300+ *,
301301+ name: str,
302302+ schema_uri: Optional[str] = None,
303303+ description: Optional[str] = None,
304304+ tags: Optional[list[str]] = None,
305305+ license: Optional[str] = None,
306306+ auto_publish_schema: bool = True
307307+ ) -> str:
308308+ """
309309+ Publish a dataset index record to ATProto.
310310+311311+ Args:
312312+ dataset: The Dataset to publish
313313+ name: Human-readable dataset name
314314+ schema_uri: AT-URI of the schema record (required if auto_publish_schema=False)
315315+ description: Human-readable description
316316+ tags: Searchable tags
317317+ license: License identifier (SPDX preferred)
318318+ auto_publish_schema: If True and schema_uri not provided, publish schema automatically
319319+320320+ Returns:
321321+ AT-URI of the created dataset record
322322+ """
323323+ if not self.client.is_authenticated:
324324+ raise ValueError("Client must be authenticated")
325325+326326+ # Ensure schema is published
327327+ if schema_uri is None:
328328+ if not auto_publish_schema:
329329+ raise ValueError("schema_uri required when auto_publish_schema=False")
330330+ schema_uri = self.schema_publisher.publish_schema(dataset.sample_type)
331331+332332+ # Build dataset record
333333+ dataset_record = {
334334+ "$type": "app.bsky.atdata.dataset",
335335+ "name": name,
336336+ "schemaRef": schema_uri,
337337+ "urls": [dataset.url], # Single URL for now
338338+ "description": description or "",
339339+ "metadata": msgpack.packb(dataset.metadata),
340340+ "tags": tags or [],
341341+ "license": license or "",
342342+ "createdAt": datetime.now(timezone.utc).isoformat()
343343+ }
344344+345345+ # Add size information if available
346346+ # (would need to iterate dataset or have metadata about size)
347347+348348+ # Publish to ATProto
349349+ uri = self.client.create_record("app.bsky.atdata.dataset", dataset_record)
350350+ return uri
351351+352352+class DatasetLoader:
353353+ """Handles loading Datasets from ATProto records."""
354354+355355+ def __init__(self, client: ATProtoClient):
356356+ self.client = client
357357+358358+ def load_dataset(self, uri: str) -> atdata.Dataset:
359359+ """
360360+ Load a Dataset from an ATProto record.
361361+362362+ Args:
363363+ uri: AT-URI of the dataset record
364364+365365+ Returns:
366366+ Dataset instance configured from the record
367367+ """
368368+ # Fetch the dataset record
369369+ record = self.client.get_record(uri)
370370+ if record.get('$type') != 'app.bsky.atdata.dataset':
371371+ raise ValueError(f"Record at {uri} is not a dataset record")
372372+373373+ # For now, we still need the Python class for the sample type
374374+ # In the future, this could use codegen
375375+ # TODO: Implement dynamic type loading via codegen
376376+377377+ # Extract URLs and metadata
378378+ urls = record['urls']
379379+ metadata = msgpack.unpackb(record.get('metadata', b''))
380380+381381+ # We need the schema to instantiate the Dataset with correct type
382382+ # This is a limitation - we need codegen to create the type dynamically
383383+ # For now, raise an error
384384+ raise NotImplementedError(
385385+ "Loading datasets requires code generation to instantiate sample types. "
386386+ f"Schema URI: {record['schemaRef']}\n"
387387+ "Use the codegen tool to generate the Python class first."
388388+ )
389389+390390+ def list_datasets(self, did: Optional[str] = None) -> list[dict]:
391391+ """List available dataset records."""
392392+ return self.client.list_records("app.bsky.atdata.dataset", did)
393393+394394+ def search_datasets(self, tags: Optional[list[str]] = None, query: Optional[str] = None) -> list[dict]:
395395+ """
396396+ Search for datasets.
397397+398398+ Args:
399399+ tags: Filter by tags
400400+ query: Text search query
401401+402402+ Returns:
403403+ List of matching dataset records
404404+ """
405405+ # This would use AppView in production
406406+ # For now, fetch all and filter client-side
407407+ all_datasets = self.list_records("app.bsky.atdata.dataset")
408408+409409+ filtered = all_datasets
410410+ if tags:
411411+ filtered = [d for d in filtered if any(t in d.get('tags', []) for t in tags)]
412412+ if query:
413413+ filtered = [d for d in filtered if query.lower() in d.get('name', '').lower() or
414414+ query.lower() in d.get('description', '').lower()]
415415+416416+ return filtered
417417+```
418418+419419+**Usage**:
420420+```python
421421+from atdata.atproto import ATProtoClient, DatasetPublisher
422422+423423+# Create dataset
424424+dataset = atdata.Dataset[MySample](url="s3://bucket/data-{000000..000009}.tar")
425425+426426+# Publish
427427+client = ATProtoClient()
428428+client.login("alice.bsky.social", "password")
429429+430430+publisher = DatasetPublisher(client)
431431+dataset_uri = publisher.publish_dataset(
432432+ dataset,
433433+ name="My Training Data",
434434+ description="Training data for my model",
435435+ tags=["computer-vision", "training"],
436436+ license="MIT"
437437+)
438438+print(f"Published dataset at {dataset_uri}")
439439+```
440440+441441+### 4. Lens Publishing
442442+443443+**File**: `src/atdata/atproto/lens.py`
444444+445445+```python
446446+from typing import Callable, Optional
447447+import inspect
448448+from .client import ATProtoClient
449449+450450+class LensPublisher:
451451+ """Handles publishing Lens transformations to ATProto."""
452452+453453+ def __init__(self, client: ATProtoClient):
454454+ self.client = client
455455+456456+ def publish_lens(
457457+ self,
458458+ lens_getter: Callable,
459459+ lens_putter: Callable,
460460+ *,
461461+ name: str,
462462+ source_schema_uri: str,
463463+ target_schema_uri: str,
464464+ description: Optional[str] = None,
465465+ code_repository: Optional[str] = None,
466466+ code_commit: Optional[str] = None
467467+ ) -> str:
468468+ """
469469+ Publish a Lens transformation to ATProto.
470470+471471+ Args:
472472+ lens_getter: The getter function (Source -> Target)
473473+ lens_putter: The putter function (Target, Source -> Source)
474474+ name: Human-readable lens name
475475+ source_schema_uri: AT-URI of source schema
476476+ target_schema_uri: AT-URI of target schema
477477+ description: What this transformation does
478478+ code_repository: Git repository URL
479479+ code_commit: Git commit hash
480480+481481+ Returns:
482482+ AT-URI of the created lens record
483483+ """
484484+ if not self.client.is_authenticated:
485485+ raise ValueError("Client must be authenticated")
486486+487487+ # Build lens record
488488+ lens_record = {
489489+ "$type": "app.bsky.atdata.lens",
490490+ "name": name,
491491+ "sourceSchema": source_schema_uri,
492492+ "targetSchema": target_schema_uri,
493493+ "description": description or "",
494494+ "createdAt": datetime.now(timezone.utc).isoformat()
495495+ }
496496+497497+ # Add code references
498498+ if code_repository and code_commit:
499499+ getter_name = lens_getter.__name__
500500+ putter_name = lens_putter.__name__
501501+502502+ lens_record["getterCode"] = {
503503+ "kind": "reference",
504504+ "repository": code_repository,
505505+ "commit": code_commit,
506506+ "path": f"{getter_name}" # Simplified - would need module path
507507+ }
508508+ lens_record["putterCode"] = {
509509+ "kind": "reference",
510510+ "repository": code_repository,
511511+ "commit": code_commit,
512512+ "path": f"{putter_name}"
513513+ }
514514+ else:
515515+ # For initial version, we could store source code directly
516516+ # But this is DANGEROUS - security review required
517517+ raise NotImplementedError(
518518+ "Inline code storage not yet supported. "
519519+ "Please provide code_repository and code_commit."
520520+ )
521521+522522+ # Publish to ATProto
523523+ uri = self.client.create_record("app.bsky.atdata.lens", lens_record)
524524+ return uri
525525+```
526526+527527+## Extension to Existing Classes
528528+529529+### Adding ATProto methods to Dataset
530530+531531+**Approach**: Add methods directly to `Dataset` class in `src/atdata/dataset.py`
532532+533533+```python
534534+class Dataset[ST: PackableSample]:
535535+ # ... existing implementation ...
536536+537537+ def publish_to_atproto(
538538+ self,
539539+ client: 'ATProtoClient', # Forward reference to avoid circular import
540540+ *,
541541+ name: str,
542542+ **kwargs
543543+ ) -> str:
544544+ """
545545+ Publish this dataset to ATProto.
546546+547547+ This is a convenience method that wraps DatasetPublisher.
548548+ """
549549+ from .atproto import DatasetPublisher
550550+ publisher = DatasetPublisher(client)
551551+ return publisher.publish_dataset(self, name=name, **kwargs)
552552+553553+ @classmethod
554554+ def from_atproto(
555555+ cls,
556556+ client: 'ATProtoClient',
557557+ uri: str
558558+ ) -> 'Dataset':
559559+ """
560560+ Load a dataset from an ATProto record.
561561+562562+ Note: This requires the sample type to be available in Python.
563563+ Use codegen to generate types from schema records.
564564+ """
565565+ from .atproto import DatasetLoader
566566+ loader = DatasetLoader(client)
567567+ return loader.load_dataset(uri)
568568+```
569569+570570+**Usage**:
571571+```python
572572+# Publishing
573573+dataset = atdata.Dataset[MySample](url="s3://...")
574574+uri = dataset.publish_to_atproto(client, name="My Dataset")
575575+576576+# Loading (future, requires codegen)
577577+dataset = atdata.Dataset.from_atproto(client, uri)
578578+```
579579+580580+## Public API Exports
581581+582582+**File**: `src/atdata/atproto/__init__.py`
583583+584584+```python
585585+from .client import ATProtoClient
586586+from .schema import SchemaPublisher, SchemaLoader
587587+from .dataset import DatasetPublisher, DatasetLoader
588588+from .lens import LensPublisher
589589+590590+__all__ = [
591591+ "ATProtoClient",
592592+ "SchemaPublisher",
593593+ "SchemaLoader",
594594+ "DatasetPublisher",
595595+ "DatasetLoader",
596596+ "LensPublisher",
597597+]
598598+```
599599+600600+## Testing Strategy
601601+602602+### Unit Tests
603603+- Mock `ATProtoClient` to avoid network calls
604604+- Test schema record building from various PackableSample types
605605+- Test error handling (auth failures, invalid types, etc.)
606606+607607+### Integration Tests
608608+- Use ATProto test server or sandbox
609609+- Test full publish/query cycle
610610+- Verify record structure matches Lexicon
611611+612612+### Example Test
613613+```python
614614+import pytest
615615+from unittest.mock import Mock
616616+import atdata
617617+from atdata.atproto import SchemaPublisher
618618+619619+@atdata.packable
620620+class TestSample:
621621+ field1: str
622622+ field2: int
623623+624624+def test_schema_publisher():
625625+ # Mock client
626626+ mock_client = Mock()
627627+ mock_client.is_authenticated = True
628628+ mock_client.create_record = Mock(return_value="at://did:example/app.bsky.atdata.schema/abc123")
629629+630630+ # Publish schema
631631+ publisher = SchemaPublisher(mock_client)
632632+ uri = publisher.publish_schema(TestSample, version="1.0.0")
633633+634634+ # Verify
635635+ assert uri == "at://did:example/app.bsky.atdata.schema/abc123"
636636+ mock_client.create_record.assert_called_once()
637637+638638+ # Check the record structure
639639+ call_args = mock_client.create_record.call_args
640640+ collection, record = call_args[0]
641641+ assert collection == "app.bsky.atdata.schema"
642642+ assert record["name"] == "TestSample"
643643+ assert len(record["fields"]) == 2
644644+```
645645+646646+## Dependencies
647647+648648+**New dependencies** (to be added to `pyproject.toml`):
649649+650650+```toml
651651+[project]
652652+dependencies = [
653653+ # ... existing ...
654654+ "atproto>=0.0.40", # ATProto Python SDK
655655+]
656656+```
657657+658658+## Implementation Checklist (Phase 2)
659659+660660+- [ ] Set up `atdata/atproto/` module structure
661661+- [ ] Implement `ATProtoClient` wrapper
662662+- [ ] Implement `SchemaPublisher` with type introspection
663663+- [ ] Implement `DatasetPublisher`
664664+- [ ] Implement `LensPublisher` (code reference only)
665665+- [ ] Add convenience methods to `Dataset` class
666666+- [ ] Write unit tests for all publishers
667667+- [ ] Write integration tests with test server
668668+- [ ] Update documentation with examples
669669+670670+## Future Enhancements
671671+672672+### Better NDArray Type Handling
673673+- Parse `NDArray[DType, Shape]` annotations for accurate dtype/shape
674674+- Support for shape constraints in schema
675675+676676+### Dynamic Type Loading
677677+- Use codegen to create types at runtime from schema records
678678+- Enable `Dataset.from_atproto()` without pre-existing Python classes
679679+680680+### Caching
681681+- Cache schema lookups to avoid repeated network calls
682682+- Local schema registry
683683+684684+### Batch Operations
685685+- Publish multiple schemas/datasets in one call
686686+- Bulk import/export
687687+688688+### AppView Integration
689689+- Use AppView for fast search instead of client-side filtering
690690+- Streaming results for large queries
+578
.planning/04_appview.md
···11+# AppView Service Architecture
22+33+## Overview
44+55+The AppView is an **optional aggregation service** that indexes dataset records from across the ATProto network, providing fast search and discovery. Think of it as the "search engine" for atdata datasets.
66+77+## Why AppView?
88+99+Without AppView, discovering datasets requires:
1010+- Querying each user's Personal Data Server (PDS) individually
1111+- No global search across all published datasets
1212+- Slow, inefficient discovery
1313+1414+With AppView:
1515+- **Fast global search** across all datasets
1616+- **Rich metadata browsing** (schemas, tags, authors)
1717+- **Recommendation systems** (similar datasets, popular datasets)
1818+- **Analytics** (dataset usage, trends)
1919+2020+## Architecture
2121+2222+```
2323+┌─────────────────────────────────────────────────────────────┐
2424+│ ATProto Network │
2525+│ │
2626+│ ┌─────┐ ┌─────┐ ┌─────┐ ┌──────────────┐ │
2727+│ │ PDS │ │ PDS │ │ PDS │ ────────▶ │ Relay/ │ │
2828+│ │ 1 │ │ 2 │ │ 3 │ │ Firehose │ │
2929+│ └─────┘ └─────┘ └─────┘ └──────────────┘ │
3030+│ │ │ │ │ │
3131+│ └─────────┴─────────┴────────────────────┘ │
3232+│ (publish records) │ │
3333+└────────────────────────────────────────────────┼──────────────┘
3434+ │
3535+ │ (subscribe)
3636+ ▼
3737+ ┌─────────────────────────┐
3838+ │ AppView Service │
3939+ │ │
4040+ │ ┌──────────────────┐ │
4141+ │ │ Firehose │ │
4242+ │ │ Consumer │ │
4343+ │ └────────┬─────────┘ │
4444+ │ │ │
4545+ │ ▼ │
4646+ │ ┌──────────────────┐ │
4747+ │ │ Record │ │
4848+ │ │ Processor │ │
4949+ │ └────────┬─────────┘ │
5050+ │ │ │
5151+ │ ▼ │
5252+ │ ┌──────────────────┐ │
5353+ │ │ PostgreSQL │ │
5454+ │ │ Database │ │
5555+ │ └──────────────────┘ │
5656+ │ │
5757+ │ ┌──────────────────┐ │
5858+ │ │ Search Index │ │
5959+ │ │ (ElasticSearch) │ │
6060+ │ └──────────────────┘ │
6161+ │ │
6262+ │ ┌──────────────────┐ │
6363+ │ │ HTTP API │ │
6464+ │ │ (FastAPI) │ │
6565+ │ └──────────────────┘ │
6666+ └─────────────────────────┘
6767+ │
6868+ │ (query API)
6969+ ▼
7070+ ┌─────────────────────────┐
7171+ │ Python Client │
7272+ │ (atdata.atproto) │
7373+ └─────────────────────────┘
7474+```
7575+7676+## Components
7777+7878+### 1. Firehose Consumer
7979+8080+**Purpose**: Subscribe to ATProto firehose and receive real-time record updates
8181+8282+**Technology**: Python + `atproto` SDK
8383+8484+**Responsibilities**:
8585+- Connect to ATProto relay/firehose
8686+- Filter for relevant Lexicon types:
8787+ - `app.bsky.atdata.schema`
8888+ - `app.bsky.atdata.dataset`
8989+ - `app.bsky.atdata.lens`
9090+- Handle reconnection and backpressure
9191+- Forward records to processor
9292+9393+**Implementation**:
9494+```python
9595+from atproto import FirehoseSubscribeReposClient, parse_subscribe_repos_message
9696+9797+class AtdataFirehoseConsumer:
9898+ def __init__(self, processor: RecordProcessor):
9999+ self.processor = processor
100100+ self.client = FirehoseSubscribeReposClient()
101101+102102+ def start(self):
103103+ """Start consuming firehose."""
104104+ def on_message_handler(message):
105105+ commit = parse_subscribe_repos_message(message)
106106+ if not commit:
107107+ return
108108+109109+ for op in commit.ops:
110110+ if op.action == 'create' or op.action == 'update':
111111+ if op.path.startswith('app.bsky.atdata.'):
112112+ # Extract record
113113+ record = op.record
114114+ self.processor.process_record(
115115+ uri=op.uri,
116116+ cid=op.cid,
117117+ record=record
118118+ )
119119+120120+ self.client.start(on_message_handler)
121121+```
122122+123123+### 2. Record Processor
124124+125125+**Purpose**: Parse and validate incoming records, update database and search index
126126+127127+**Responsibilities**:
128128+- Validate records against Lexicon schemas
129129+- Extract searchable fields
130130+- Resolve references (schema URIs, etc.)
131131+- Update PostgreSQL and ElasticSearch
132132+- Handle deletions and updates
133133+134134+**Data Model**:
135135+136136+**PostgreSQL Tables**:
137137+```sql
138138+-- Schema records
139139+CREATE TABLE schemas (
140140+ uri TEXT PRIMARY KEY,
141141+ cid TEXT NOT NULL,
142142+ did TEXT NOT NULL,
143143+ name TEXT NOT NULL,
144144+ version TEXT NOT NULL,
145145+ description TEXT,
146146+ fields JSONB NOT NULL,
147147+ metadata JSONB,
148148+ created_at TIMESTAMP NOT NULL,
149149+ indexed_at TIMESTAMP NOT NULL DEFAULT NOW()
150150+);
151151+CREATE INDEX idx_schemas_did ON schemas(did);
152152+CREATE INDEX idx_schemas_name ON schemas(name);
153153+154154+-- Dataset records
155155+CREATE TABLE datasets (
156156+ uri TEXT PRIMARY KEY,
157157+ cid TEXT NOT NULL,
158158+ did TEXT NOT NULL,
159159+ name TEXT NOT NULL,
160160+ schema_ref TEXT NOT NULL REFERENCES schemas(uri),
161161+ urls TEXT[] NOT NULL,
162162+ description TEXT,
163163+ metadata BYTEA,
164164+ tags TEXT[],
165165+ license TEXT,
166166+ size_samples INTEGER,
167167+ size_bytes BIGINT,
168168+ created_at TIMESTAMP NOT NULL,
169169+ indexed_at TIMESTAMP NOT NULL DEFAULT NOW()
170170+);
171171+CREATE INDEX idx_datasets_did ON datasets(did);
172172+CREATE INDEX idx_datasets_schema ON datasets(schema_ref);
173173+CREATE INDEX idx_datasets_tags ON datasets USING GIN(tags);
174174+175175+-- Lens records
176176+CREATE TABLE lenses (
177177+ uri TEXT PRIMARY KEY,
178178+ cid TEXT NOT NULL,
179179+ did TEXT NOT NULL,
180180+ name TEXT NOT NULL,
181181+ source_schema TEXT NOT NULL REFERENCES schemas(uri),
182182+ target_schema TEXT NOT NULL REFERENCES schemas(uri),
183183+ description TEXT,
184184+ created_at TIMESTAMP NOT NULL,
185185+ indexed_at TIMESTAMP NOT NULL DEFAULT NOW()
186186+);
187187+CREATE INDEX idx_lenses_source ON lenses(source_schema);
188188+CREATE INDEX idx_lenses_target ON lenses(target_schema);
189189+190190+-- Lens network view (for finding transformation paths)
191191+CREATE MATERIALIZED VIEW lens_network AS
192192+SELECT
193193+ source_schema,
194194+ target_schema,
195195+ uri,
196196+ name
197197+FROM lenses;
198198+CREATE INDEX idx_lens_network_source ON lens_network(source_schema);
199199+CREATE INDEX idx_lens_network_target ON lens_network(target_schema);
200200+```
201201+202202+**ElasticSearch Index**:
203203+```json
204204+{
205205+ "mappings": {
206206+ "properties": {
207207+ "uri": { "type": "keyword" },
208208+ "type": { "type": "keyword" },
209209+ "did": { "type": "keyword" },
210210+ "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
211211+ "description": { "type": "text" },
212212+ "tags": { "type": "keyword" },
213213+ "created_at": { "type": "date" },
214214+ "schema_ref": { "type": "keyword" },
215215+ "license": { "type": "keyword" }
216216+ }
217217+ }
218218+}
219219+```
220220+221221+### 3. HTTP API
222222+223223+**Purpose**: Expose search and query endpoints for clients
224224+225225+**Technology**: FastAPI + Pydantic
226226+227227+**Endpoints**:
228228+229229+```python
230230+from fastapi import FastAPI, Query
231231+from pydantic import BaseModel
232232+233233+app = FastAPI()
234234+235235+# Search datasets
236236+@app.get("/api/v1/datasets/search")
237237+async def search_datasets(
238238+ q: str = Query(None, description="Text search query"),
239239+ tags: list[str] = Query(None, description="Filter by tags"),
240240+ schema_uri: str = Query(None, description="Filter by schema"),
241241+ author_did: str = Query(None, description="Filter by author DID"),
242242+ limit: int = Query(20, le=100),
243243+ offset: int = Query(0)
244244+) -> list[dict]:
245245+ """Search for datasets."""
246246+ # Query ElasticSearch + PostgreSQL
247247+ pass
248248+249249+# Get dataset details
250250+@app.get("/api/v1/datasets/{uri:path}")
251251+async def get_dataset(uri: str) -> dict:
252252+ """Get dataset record by URI."""
253253+ # Query PostgreSQL
254254+ pass
255255+256256+# List schemas
257257+@app.get("/api/v1/schemas")
258258+async def list_schemas(
259259+ limit: int = Query(20, le=100),
260260+ offset: int = Query(0)
261261+) -> list[dict]:
262262+ """List available schemas."""
263263+ pass
264264+265265+# Get schema details
266266+@app.get("/api/v1/schemas/{uri:path}")
267267+async def get_schema(uri: str) -> dict:
268268+ """Get schema record by URI."""
269269+ pass
270270+271271+# Find lens path between schemas
272272+@app.get("/api/v1/lenses/path")
273273+async def find_lens_path(
274274+ source: str = Query(..., description="Source schema URI"),
275275+ target: str = Query(..., description="Target schema URI")
276276+) -> list[dict]:
277277+ """Find transformation path between two schemas."""
278278+ # Graph search on lens_network
279279+ pass
280280+281281+# Stats and analytics
282282+@app.get("/api/v1/stats")
283283+async def get_stats() -> dict:
284284+ """Get aggregate statistics."""
285285+ return {
286286+ "total_datasets": await count_datasets(),
287287+ "total_schemas": await count_schemas(),
288288+ "total_lenses": await count_lenses()
289289+ }
290290+```
291291+292292+### 4. Caching Layer
293293+294294+**Purpose**: Reduce database load for frequent queries
295295+296296+**Technology**: Redis
297297+298298+**Cached Items**:
299299+- Popular dataset queries
300300+- Schema lookups (high read frequency)
301301+- Search results (with short TTL)
302302+- Aggregate statistics
303303+304304+**Implementation**:
305305+```python
306306+import redis
307307+import json
308308+from functools import wraps
309309+310310+redis_client = redis.Redis(host='localhost', port=6379, db=0)
311311+312312+def cache_result(ttl: int = 300):
313313+ """Decorator to cache function results in Redis."""
314314+ def decorator(func):
315315+ @wraps(func)
316316+ async def wrapper(*args, **kwargs):
317317+ # Generate cache key from function name and args
318318+ cache_key = f"{func.__name__}:{hash((args, frozenset(kwargs.items())))}"
319319+320320+ # Check cache
321321+ cached = redis_client.get(cache_key)
322322+ if cached:
323323+ return json.loads(cached)
324324+325325+ # Compute result
326326+ result = await func(*args, **kwargs)
327327+328328+ # Store in cache
329329+ redis_client.setex(cache_key, ttl, json.dumps(result))
330330+331331+ return result
332332+ return wrapper
333333+ return decorator
334334+335335+@cache_result(ttl=60)
336336+async def get_popular_datasets():
337337+ """Get popular datasets (cached for 1 minute)."""
338338+ # Query database
339339+ pass
340340+```
341341+342342+## Deployment
343343+344344+### Infrastructure
345345+346346+**Option 1: Simple (single server)**
347347+```
348348+- PostgreSQL (datasets, schemas, lenses)
349349+- ElasticSearch (search index)
350350+- Redis (cache)
351351+- FastAPI app (HTTP API)
352352+- Firehose consumer (background process)
353353+```
354354+355355+**Option 2: Scalable (cloud)**
356356+```
357357+- AWS RDS PostgreSQL (managed database)
358358+- AWS OpenSearch (managed ElasticSearch)
359359+- AWS ElastiCache (managed Redis)
360360+- AWS ECS/Fargate (containerized FastAPI app)
361361+- AWS ECS/Fargate (containerized firehose consumer)
362362+- AWS ALB (load balancer)
363363+```
364364+365365+### Docker Compose (Development)
366366+367367+```yaml
368368+version: '3.8'
369369+370370+services:
371371+ postgres:
372372+ image: postgres:15
373373+ environment:
374374+ POSTGRES_DB: atdata_appview
375375+ POSTGRES_USER: atdata
376376+ POSTGRES_PASSWORD: password
377377+ volumes:
378378+ - postgres_data:/var/lib/postgresql/data
379379+ ports:
380380+ - "5432:5432"
381381+382382+ elasticsearch:
383383+ image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
384384+ environment:
385385+ - discovery.type=single-node
386386+ - xpack.security.enabled=false
387387+ volumes:
388388+ - es_data:/usr/share/elasticsearch/data
389389+ ports:
390390+ - "9200:9200"
391391+392392+ redis:
393393+ image: redis:7
394394+ ports:
395395+ - "6379:6379"
396396+397397+ appview-api:
398398+ build:
399399+ context: .
400400+ dockerfile: Dockerfile.api
401401+ environment:
402402+ DATABASE_URL: postgresql://atdata:password@postgres/atdata_appview
403403+ ELASTICSEARCH_URL: http://elasticsearch:9200
404404+ REDIS_URL: redis://redis:6379
405405+ depends_on:
406406+ - postgres
407407+ - elasticsearch
408408+ - redis
409409+ ports:
410410+ - "8000:8000"
411411+412412+ appview-firehose:
413413+ build:
414414+ context: .
415415+ dockerfile: Dockerfile.firehose
416416+ environment:
417417+ DATABASE_URL: postgresql://atdata:password@postgres/atdata_appview
418418+ ELASTICSEARCH_URL: http://elasticsearch:9200
419419+ REDIS_URL: redis://redis:6379
420420+ FIREHOSE_URL: wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos
421421+ depends_on:
422422+ - postgres
423423+ - elasticsearch
424424+ - redis
425425+426426+volumes:
427427+ postgres_data:
428428+ es_data:
429429+```
430430+431431+## Client Integration
432432+433433+### Python Client Updates
434434+435435+Add AppView support to `atdata.atproto.dataset.DatasetLoader`:
436436+437437+```python
438438+class DatasetLoader:
439439+ def __init__(
440440+ self,
441441+ client: ATProtoClient,
442442+ appview_url: Optional[str] = None
443443+ ):
444444+ self.client = client
445445+ self.appview_url = appview_url or "https://appview.atdata.network"
446446+447447+ def search_datasets(
448448+ self,
449449+ query: Optional[str] = None,
450450+ tags: Optional[list[str]] = None,
451451+ schema_uri: Optional[str] = None,
452452+ limit: int = 20
453453+ ) -> list[dict]:
454454+ """Search datasets using AppView."""
455455+ import httpx
456456+457457+ params = {"limit": limit}
458458+ if query:
459459+ params["q"] = query
460460+ if tags:
461461+ params["tags"] = tags
462462+ if schema_uri:
463463+ params["schema_uri"] = schema_uri
464464+465465+ response = httpx.get(f"{self.appview_url}/api/v1/datasets/search", params=params)
466466+ response.raise_for_status()
467467+ return response.json()
468468+```
469469+470470+**Usage**:
471471+```python
472472+from atdata.atproto import ATProtoClient, DatasetLoader
473473+474474+client = ATProtoClient()
475475+loader = DatasetLoader(client, appview_url="https://appview.atdata.network")
476476+477477+# Search for computer vision datasets
478478+results = loader.search_datasets(
479479+ tags=["computer-vision"],
480480+ limit=10
481481+)
482482+483483+for dataset in results:
484484+ print(f"{dataset['name']}: {dataset['description']}")
485485+```
486486+487487+## Performance Considerations
488488+489489+### Indexing Speed
490490+- **Goal**: Index records in <1 second from firehose receipt
491491+- **Approach**: Async processing, batch inserts
492492+493493+### Search Performance
494494+- **Goal**: Search queries return in <100ms
495495+- **Approach**: ElasticSearch indexing, query optimization, caching
496496+497497+### Scalability
498498+- **Goal**: Handle 1000+ datasets, 100+ schemas
499499+- **Approach**: Horizontal scaling of API servers, read replicas for PostgreSQL
500500+501501+## Monitoring & Observability
502502+503503+### Metrics
504504+- Firehose lag (time behind current)
505505+- Indexing throughput (records/second)
506506+- API request latency (p50, p95, p99)
507507+- Cache hit rate
508508+- Database query performance
509509+510510+### Logging
511511+- Structured JSON logs
512512+- Log aggregation (e.g., CloudWatch, Datadog)
513513+- Error tracking (e.g., Sentry)
514514+515515+### Health Checks
516516+```python
517517+@app.get("/health")
518518+async def health_check():
519519+ """Check service health."""
520520+ return {
521521+ "status": "healthy",
522522+ "components": {
523523+ "database": await check_db_health(),
524524+ "elasticsearch": await check_es_health(),
525525+ "redis": await check_redis_health(),
526526+ "firehose": await check_firehose_health()
527527+ }
528528+ }
529529+```
530530+531531+## Implementation Checklist (Phase 3)
532532+533533+- [ ] Design database schema (PostgreSQL)
534534+- [ ] Design search index (ElasticSearch)
535535+- [ ] Implement firehose consumer
536536+- [ ] Implement record processor with validation
537537+- [ ] Implement HTTP API with FastAPI
538538+- [ ] Add caching layer (Redis)
539539+- [ ] Create Docker Compose for local development
540540+- [ ] Write integration tests
541541+- [ ] Set up monitoring and logging
542542+- [ ] Deploy to staging environment
543543+- [ ] Performance testing and optimization
544544+545545+## Future Enhancements
546546+547547+### Advanced Search
548548+- Fuzzy matching
549549+- Relevance scoring
550550+- Autocomplete for tags/names
551551+552552+### Recommendations
553553+- "Datasets similar to this one"
554554+- "Popular datasets in this category"
555555+- "Datasets by authors you follow"
556556+557557+### Analytics
558558+- Dataset usage tracking (downloads, views)
559559+- Trending datasets
560560+- Schema adoption statistics
561561+562562+### Social Features
563563+- Dataset comments/reviews
564564+- Ratings
565565+- Curation lists (e.g., "Best datasets for X")
566566+567567+### Federation
568568+- Multiple AppView instances
569569+- Cross-AppView search
570570+- Regional AppViews for performance
571571+572572+## Security Considerations
573573+574574+- **Rate limiting**: Prevent abuse of search API
575575+- **Input validation**: Sanitize all query parameters
576576+- **DDoS protection**: Use CloudFlare or similar
577577+- **Authentication** (optional): API keys for heavy users
578578+- **Data validation**: Verify record signatures from ATProto
+799
.planning/05_codegen.md
···11+# Code Generation Tooling
22+33+## Overview
44+55+Code generation enables users to create `PackableSample` classes from schema records published on ATProto, making datasets truly interoperable across different codebases and even languages.
66+77+## Goals
88+99+1. **Automatic class generation**: Convert schema records to Python classes
1010+2. **Type safety**: Generate proper type hints and validation
1111+3. **Maintainability**: Generated code should be readable and maintainable
1212+4. **Cross-language support** (future): TypeScript, Rust, etc.
1313+1414+## Python Code Generation
1515+1616+### Input: Schema Record
1717+1818+```json
1919+{
2020+ "$type": "app.bsky.atdata.schema",
2121+ "name": "ImageSample",
2222+ "version": "1.0.0",
2323+ "description": "Sample containing an image with label",
2424+ "fields": [
2525+ {
2626+ "name": "image",
2727+ "type": { "kind": "ndarray", "dtype": "uint8", "shape": [null, null, 3] },
2828+ "description": "RGB image with variable height/width"
2929+ },
3030+ {
3131+ "name": "label",
3232+ "type": { "kind": "primitive", "primitive": "str" },
3333+ "description": "Human-readable label"
3434+ },
3535+ {
3636+ "name": "confidence",
3737+ "type": { "kind": "primitive", "primitive": "float" },
3838+ "optional": true,
3939+ "description": "Optional confidence score"
4040+ }
4141+ ]
4242+}
4343+```
4444+4545+### Output: Python Code
4646+4747+```python
4848+"""
4949+ImageSample
5050+5151+Sample containing an image with label
5252+5353+Schema Version: 1.0.0
5454+Schema URI: at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm
5555+Generated: 2025-01-06T12:00:00Z
5656+"""
5757+5858+from dataclasses import dataclass
5959+from typing import Optional
6060+from numpy.typing import NDArray
6161+import atdata
6262+6363+6464+@atdata.packable
6565+class ImageSample:
6666+ """Sample containing an image with label"""
6767+6868+ #: RGB image with variable height/width
6969+ image: NDArray # uint8, shape: [*, *, 3]
7070+7171+ #: Human-readable label
7272+ label: str
7373+7474+ #: Optional confidence score
7575+ confidence: Optional[float] = None
7676+```
7777+7878+## Code Generator Architecture
7979+8080+### Module Structure
8181+8282+```
8383+src/atdata/codegen/
8484+ __init__.py # Public API
8585+ generator.py # Core code generation logic
8686+ templates/ # Template files
8787+ python.jinja2 # Python class template
8888+ cli.py # CLI interface
8989+ _validators.py # Schema validation
9090+```
9191+9292+### Core Generator
9393+9494+**File**: `src/atdata/codegen/generator.py`
9595+9696+```python
9797+from typing import Optional
9898+from datetime import datetime, timezone
9999+from jinja2 import Environment, PackageLoader
100100+import atdata
101101+from ..atproto import ATProtoClient, SchemaLoader
102102+103103+class PythonGenerator:
104104+ """Generate Python PackableSample classes from schema records."""
105105+106106+ def __init__(self):
107107+ # Set up Jinja2 environment
108108+ self.env = Environment(
109109+ loader=PackageLoader('atdata.codegen', 'templates'),
110110+ trim_blocks=True,
111111+ lstrip_blocks=True
112112+ )
113113+114114+ # Register custom filters
115115+ self.env.filters['python_type'] = self._python_type_filter
116116+ self.env.filters['python_default'] = self._python_default_filter
117117+118118+ def generate_from_uri(
119119+ self,
120120+ client: ATProtoClient,
121121+ schema_uri: str,
122122+ output_path: Optional[str] = None
123123+ ) -> str:
124124+ """
125125+ Generate Python code from a schema URI.
126126+127127+ Args:
128128+ client: ATProto client
129129+ schema_uri: URI of the schema record
130130+ output_path: Optional path to write output file
131131+132132+ Returns:
133133+ Generated Python code as string
134134+ """
135135+ # Load schema record
136136+ loader = SchemaLoader(client)
137137+ schema = loader.get_schema(schema_uri)
138138+139139+ # Generate code
140140+ code = self.generate_from_record(schema, schema_uri)
141141+142142+ # Write to file if requested
143143+ if output_path:
144144+ with open(output_path, 'w') as f:
145145+ f.write(code)
146146+147147+ return code
148148+149149+ def generate_from_record(
150150+ self,
151151+ schema: dict,
152152+ schema_uri: str
153153+ ) -> str:
154154+ """
155155+ Generate Python code from a schema record dict.
156156+157157+ Args:
158158+ schema: Schema record dict
159159+ schema_uri: URI of the schema (for documentation)
160160+161161+ Returns:
162162+ Generated Python code
163163+ """
164164+ # Validate schema
165165+ self._validate_schema(schema)
166166+167167+ # Prepare template context
168168+ context = {
169169+ 'schema': schema,
170170+ 'schema_uri': schema_uri,
171171+ 'generated_at': datetime.now(timezone.utc).isoformat(),
172172+ 'fields': self._prepare_fields(schema['fields'])
173173+ }
174174+175175+ # Render template
176176+ template = self.env.get_template('python.jinja2')
177177+ code = template.render(**context)
178178+179179+ return code
180180+181181+ def _prepare_fields(self, fields: list[dict]) -> list[dict]:
182182+ """Prepare fields for template rendering."""
183183+ prepared = []
184184+185185+ for field in fields:
186186+ prepared.append({
187187+ 'name': field['name'],
188188+ 'type_annotation': self._field_type_to_python(field['type']),
189189+ 'optional': field.get('optional', False),
190190+ 'description': field.get('description', ''),
191191+ 'type_comment': self._type_comment(field['type'])
192192+ })
193193+194194+ return prepared
195195+196196+ def _field_type_to_python(self, field_type: dict) -> str:
197197+ """Convert schema field type to Python type annotation."""
198198+ kind = field_type['kind']
199199+200200+ if kind == 'primitive':
201201+ primitive_map = {
202202+ 'str': 'str',
203203+ 'int': 'int',
204204+ 'float': 'float',
205205+ 'bool': 'bool',
206206+ 'bytes': 'bytes'
207207+ }
208208+ return primitive_map[field_type['primitive']]
209209+210210+ elif kind == 'ndarray':
211211+ return 'NDArray'
212212+213213+ elif kind == 'nested':
214214+ # Extract class name from schema ref
215215+ # For now, just use a placeholder
216216+ ref = field_type['schemaRef']
217217+ return f'NestedType' # TODO: resolve nested types
218218+219219+ else:
220220+ raise ValueError(f"Unknown field type kind: {kind}")
221221+222222+ def _type_comment(self, field_type: dict) -> Optional[str]:
223223+ """Generate type comment for NDArray types."""
224224+ if field_type['kind'] == 'ndarray':
225225+ dtype = field_type['dtype']
226226+ shape = field_type.get('shape')
227227+ if shape:
228228+ shape_str = ', '.join('*' if s is None else str(s) for s in shape)
229229+ return f"{dtype}, shape: [{shape_str}]"
230230+ else:
231231+ return f"{dtype}"
232232+ return None
233233+234234+ def _python_type_filter(self, field: dict) -> str:
235235+ """Jinja2 filter to get Python type annotation."""
236236+ type_str = self._field_type_to_python(field['type'])
237237+ if field.get('optional'):
238238+ return f'Optional[{type_str}]'
239239+ return type_str
240240+241241+ def _python_default_filter(self, field: dict) -> Optional[str]:
242242+ """Jinja2 filter to get Python default value."""
243243+ if field.get('optional'):
244244+ return 'None'
245245+ return None
246246+247247+ def _validate_schema(self, schema: dict) -> None:
248248+ """Validate schema record structure."""
249249+ required = ['name', 'version', 'fields']
250250+ for field in required:
251251+ if field not in schema:
252252+ raise ValueError(f"Schema missing required field: {field}")
253253+254254+ if not isinstance(schema['fields'], list):
255255+ raise ValueError("Schema fields must be a list")
256256+257257+ for field in schema['fields']:
258258+ if 'name' not in field or 'type' not in field:
259259+ raise ValueError(f"Field missing name or type: {field}")
260260+```
261261+262262+### Template File
263263+264264+**File**: `src/atdata/codegen/templates/python.jinja2`
265265+266266+```jinja2
267267+"""
268268+{{ schema.name }}
269269+270270+{{ schema.description }}
271271+272272+Schema Version: {{ schema.version }}
273273+Schema URI: {{ schema_uri }}
274274+Generated: {{ generated_at }}
275275+276276+⚠️ This file was automatically generated from an ATProto schema record.
277277+ Do not edit manually - regenerate using `atdata codegen` instead.
278278+"""
279279+280280+from dataclasses import dataclass
281281+{%- if fields | selectattr('optional') | list %}
282282+from typing import Optional
283283+{%- endif %}
284284+{%- if fields | selectattr('type.kind', 'equalto', 'ndarray') | list %}
285285+from numpy.typing import NDArray
286286+{%- endif %}
287287+import atdata
288288+289289+290290+@atdata.packable
291291+class {{ schema.name }}:
292292+ """{{ schema.description }}"""
293293+294294+{% for field in fields %}
295295+ {%- if field.description %}
296296+ #: {{ field.description }}
297297+ {%- endif %}
298298+ {{ field.name }}: {{ field | python_type }}
299299+ {%- if field.type_comment %} # {{ field.type_comment }}{% endif %}
300300+ {%- if field | python_default %} = {{ field | python_default }}{% endif %}
301301+302302+{% endfor %}
303303+```
304304+305305+### CLI Interface
306306+307307+**File**: `src/atdata/codegen/cli.py`
308308+309309+```python
310310+import click
311311+from pathlib import Path
312312+from ..atproto import ATProtoClient
313313+from .generator import PythonGenerator
314314+315315+316316+@click.group()
317317+def codegen():
318318+ """Code generation tools for atdata."""
319319+ pass
320320+321321+322322+@codegen.command()
323323+@click.argument('schema_uri')
324324+@click.option('--output', '-o', type=click.Path(), help='Output file path')
325325+@click.option('--handle', '-u', help='ATProto handle for authentication')
326326+@click.option('--password', '-p', help='ATProto password')
327327+@click.option('--language', '-l', default='python', type=click.Choice(['python']), help='Output language')
328328+def generate(schema_uri: str, output: str, handle: str, password: str, language: str):
329329+ """Generate code from a schema URI.
330330+331331+ Example:
332332+ atdata codegen generate at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm -o my_sample.py
333333+ """
334334+ # Initialize client
335335+ client = ATProtoClient()
336336+337337+ # Authenticate if credentials provided
338338+ if handle and password:
339339+ client.login(handle, password)
340340+341341+ # Generate code
342342+ generator = PythonGenerator()
343343+344344+ try:
345345+ code = generator.generate_from_uri(client, schema_uri, output)
346346+347347+ if output:
348348+ click.echo(f"Generated {language} code written to {output}")
349349+ else:
350350+ click.echo(code)
351351+352352+ except Exception as e:
353353+ click.echo(f"Error generating code: {e}", err=True)
354354+ raise click.Abort()
355355+356356+357357+@codegen.command()
358358+@click.argument('schema_uris', nargs=-1, required=True)
359359+@click.option('--output-dir', '-d', type=click.Path(), required=True, help='Output directory')
360360+@click.option('--handle', '-u', help='ATProto handle for authentication')
361361+@click.option('--password', '-p', help='ATProto password')
362362+def batch(schema_uris: tuple, output_dir: str, handle: str, password: str):
363363+ """Generate code for multiple schemas.
364364+365365+ Example:
366366+ atdata codegen batch schema1_uri schema2_uri -d ./generated
367367+ """
368368+ # Create output directory
369369+ output_path = Path(output_dir)
370370+ output_path.mkdir(parents=True, exist_ok=True)
371371+372372+ # Initialize client
373373+ client = ATProtoClient()
374374+ if handle and password:
375375+ client.login(handle, password)
376376+377377+ # Generate code for each schema
378378+ generator = PythonGenerator()
379379+380380+ for schema_uri in schema_uris:
381381+ try:
382382+ # Load schema to get name
383383+ from ..atproto import SchemaLoader
384384+ loader = SchemaLoader(client)
385385+ schema = loader.get_schema(schema_uri)
386386+387387+ # Generate output path from schema name
388388+ filename = f"{schema['name'].lower()}.py"
389389+ output_file = output_path / filename
390390+391391+ # Generate code
392392+ generator.generate_from_uri(client, schema_uri, str(output_file))
393393+394394+ click.echo(f"Generated {filename}")
395395+396396+ except Exception as e:
397397+ click.echo(f"Error generating code for {schema_uri}: {e}", err=True)
398398+399399+400400+if __name__ == '__main__':
401401+ codegen()
402402+```
403403+404404+### Integration with Main CLI
405405+406406+**File**: `src/atdata/cli.py` (new or extend existing)
407407+408408+```python
409409+import click
410410+from .codegen.cli import codegen as codegen_group
411411+412412+@click.group()
413413+def main():
414414+ """atdata command-line interface."""
415415+ pass
416416+417417+# Add codegen subcommand
418418+main.add_command(codegen_group)
419419+420420+if __name__ == '__main__':
421421+ main()
422422+```
423423+424424+**Update** `pyproject.toml`:
425425+426426+```toml
427427+[project.scripts]
428428+atdata = "atdata.cli:main"
429429+```
430430+431431+## Usage Examples
432432+433433+### Generate Single Schema
434434+435435+```bash
436436+# Generate Python code from schema URI
437437+atdata codegen generate \
438438+ at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm \
439439+ -o image_sample.py
440440+441441+# Output to stdout instead
442442+atdata codegen generate \
443443+ at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm
444444+```
445445+446446+### Batch Generation
447447+448448+```bash
449449+# Generate multiple schemas to a directory
450450+atdata codegen batch \
451451+ at://did:plc:abc123/app.bsky.atdata.schema/schema1 \
452452+ at://did:plc:abc123/app.bsky.atdata.schema/schema2 \
453453+ at://did:plc:abc123/app.bsky.atdata.schema/schema3 \
454454+ -d ./generated_schemas
455455+```
456456+457457+### Programmatic Usage
458458+459459+```python
460460+from atdata.atproto import ATProtoClient
461461+from atdata.codegen import PythonGenerator
462462+463463+# Initialize
464464+client = ATProtoClient()
465465+client.login("alice.bsky.social", "password")
466466+467467+# Generate code
468468+generator = PythonGenerator()
469469+code = generator.generate_from_uri(
470470+ client,
471471+ "at://did:plc:abc123/app.bsky.atdata.schema/3jk2lo34klm",
472472+ output_path="my_sample.py"
473473+)
474474+475475+# Now can import and use the generated class
476476+from my_sample import ImageSample
477477+478478+# Use with Dataset
479479+dataset = atdata.Dataset[ImageSample](url="s3://bucket/data-{000000..000009}.tar")
480480+```
481481+482482+## Type Validation
483483+484484+### Schema Compatibility Checking
485485+486486+```python
487487+from atdata.codegen import SchemaValidator
488488+489489+class SchemaValidator:
490490+ """Validate schema compatibility and evolution."""
491491+492492+ def is_compatible(self, old_schema: dict, new_schema: dict) -> tuple[bool, list[str]]:
493493+ """
494494+ Check if new_schema is compatible with old_schema.
495495+496496+ Returns:
497497+ (is_compatible, list_of_incompatibilities)
498498+ """
499499+ incompatibilities = []
500500+501501+ # Check for removed fields
502502+ old_fields = {f['name']: f for f in old_schema['fields']}
503503+ new_fields = {f['name']: f for f in new_schema['fields']}
504504+505505+ for name in old_fields:
506506+ if name not in new_fields:
507507+ incompatibilities.append(f"Field removed: {name}")
508508+509509+ # Check for type changes
510510+ for name in old_fields:
511511+ if name in new_fields:
512512+ old_type = old_fields[name]['type']
513513+ new_type = new_fields[name]['type']
514514+ if old_type != new_type:
515515+ incompatibilities.append(
516516+ f"Field type changed: {name} from {old_type} to {new_type}"
517517+ )
518518+519519+ # Check for optional -> required changes
520520+ for name in old_fields:
521521+ if name in new_fields:
522522+ was_optional = old_fields[name].get('optional', False)
523523+ is_optional = new_fields[name].get('optional', False)
524524+ if was_optional and not is_optional:
525525+ incompatibilities.append(
526526+ f"Field changed from optional to required: {name}"
527527+ )
528528+529529+ return len(incompatibilities) == 0, incompatibilities
530530+531531+ def validate_evolution(self, old_version: str, new_version: str) -> bool:
532532+ """Validate that version numbers follow semantic versioning."""
533533+ # Parse versions
534534+ old_major, old_minor, old_patch = map(int, old_version.split('.'))
535535+ new_major, new_minor, new_patch = map(int, new_version.split('.'))
536536+537537+ # Major version should increment for breaking changes
538538+ # Minor version should increment for compatible additions
539539+ # Patch version should increment for bug fixes
540540+541541+ return new_major >= old_major
542542+```
543543+544544+### Runtime Type Validation
545545+546546+```python
547547+from atdata.codegen import TypeValidator
548548+549549+class TypeValidator:
550550+ """Validate sample instances against schemas."""
551551+552552+ def validate(self, sample: atdata.PackableSample, schema: dict) -> tuple[bool, list[str]]:
553553+ """
554554+ Validate that a sample instance conforms to a schema.
555555+556556+ Returns:
557557+ (is_valid, list_of_errors)
558558+ """
559559+ errors = []
560560+561561+ # Check all required fields present
562562+ schema_fields = {f['name']: f for f in schema['fields']}
563563+564564+ for field_name, field_def in schema_fields.items():
565565+ if not field_def.get('optional', False):
566566+ if not hasattr(sample, field_name):
567567+ errors.append(f"Missing required field: {field_name}")
568568+569569+ # Check field types
570570+ for field_name, field_def in schema_fields.items():
571571+ if hasattr(sample, field_name):
572572+ value = getattr(sample, field_name)
573573+ if value is not None:
574574+ type_valid = self._validate_field_type(value, field_def['type'])
575575+ if not type_valid:
576576+ errors.append(
577577+ f"Invalid type for field {field_name}: "
578578+ f"expected {field_def['type']}, got {type(value)}"
579579+ )
580580+581581+ return len(errors) == 0, errors
582582+583583+ def _validate_field_type(self, value, field_type: dict) -> bool:
584584+ """Validate that value matches field type."""
585585+ kind = field_type['kind']
586586+587587+ if kind == 'primitive':
588588+ primitive_types = {
589589+ 'str': str,
590590+ 'int': int,
591591+ 'float': float,
592592+ 'bool': bool,
593593+ 'bytes': bytes
594594+ }
595595+ expected_type = primitive_types[field_type['primitive']]
596596+ return isinstance(value, expected_type)
597597+598598+ elif kind == 'ndarray':
599599+ import numpy as np
600600+ if not isinstance(value, np.ndarray):
601601+ return False
602602+603603+ # Check dtype if specified
604604+ if 'dtype' in field_type:
605605+ expected_dtype = np.dtype(field_type['dtype'])
606606+ if value.dtype != expected_dtype:
607607+ return False
608608+609609+ # Check shape if specified
610610+ if 'shape' in field_type and field_type['shape']:
611611+ expected_shape = field_type['shape']
612612+ if len(value.shape) != len(expected_shape):
613613+ return False
614614+ for actual_dim, expected_dim in zip(value.shape, expected_shape):
615615+ if expected_dim is not None and actual_dim != expected_dim:
616616+ return False
617617+618618+ return True
619619+620620+ return True
621621+```
622622+623623+## Testing
624624+625625+### Unit Tests
626626+627627+```python
628628+import pytest
629629+from atdata.codegen import PythonGenerator
630630+631631+def test_generate_simple_schema():
632632+ """Test generating code from a simple schema."""
633633+ schema = {
634634+ "name": "TestSample",
635635+ "version": "1.0.0",
636636+ "description": "Test sample",
637637+ "fields": [
638638+ {
639639+ "name": "field1",
640640+ "type": {"kind": "primitive", "primitive": "str"}
641641+ }
642642+ ]
643643+ }
644644+645645+ generator = PythonGenerator()
646646+ code = generator.generate_from_record(schema, "at://test/schema/123")
647647+648648+ # Check that code contains expected elements
649649+ assert "@atdata.packable" in code
650650+ assert "class TestSample:" in code
651651+ assert "field1: str" in code
652652+653653+654654+def test_generate_ndarray_field():
655655+ """Test generating code with NDArray fields."""
656656+ schema = {
657657+ "name": "ImageSample",
658658+ "version": "1.0.0",
659659+ "description": "Image sample",
660660+ "fields": [
661661+ {
662662+ "name": "image",
663663+ "type": {
664664+ "kind": "ndarray",
665665+ "dtype": "uint8",
666666+ "shape": [None, None, 3]
667667+ }
668668+ }
669669+ ]
670670+ }
671671+672672+ generator = PythonGenerator()
673673+ code = generator.generate_from_record(schema, "at://test/schema/456")
674674+675675+ assert "from numpy.typing import NDArray" in code
676676+ assert "image: NDArray" in code
677677+ assert "# uint8, shape: [*, *, 3]" in code
678678+679679+680680+def test_optional_fields():
681681+ """Test generating code with optional fields."""
682682+ schema = {
683683+ "name": "OptionalSample",
684684+ "version": "1.0.0",
685685+ "description": "Sample with optional fields",
686686+ "fields": [
687687+ {
688688+ "name": "required_field",
689689+ "type": {"kind": "primitive", "primitive": "str"}
690690+ },
691691+ {
692692+ "name": "optional_field",
693693+ "type": {"kind": "primitive", "primitive": "int"},
694694+ "optional": True
695695+ }
696696+ ]
697697+ }
698698+699699+ generator = PythonGenerator()
700700+ code = generator.generate_from_record(schema, "at://test/schema/789")
701701+702702+ assert "from typing import Optional" in code
703703+ assert "required_field: str" in code
704704+ assert "optional_field: Optional[int] = None" in code
705705+```
706706+707707+### Integration Tests
708708+709709+```python
710710+def test_generate_and_import():
711711+ """Test that generated code can be imported and used."""
712712+ import tempfile
713713+ import importlib.util
714714+715715+ schema = {
716716+ "name": "GeneratedSample",
717717+ "version": "1.0.0",
718718+ "description": "Generated sample",
719719+ "fields": [
720720+ {"name": "x", "type": {"kind": "primitive", "primitive": "int"}}
721721+ ]
722722+ }
723723+724724+ generator = PythonGenerator()
725725+726726+ # Generate code to temp file
727727+ with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
728728+ code = generator.generate_from_record(schema, "at://test/schema/123")
729729+ f.write(code)
730730+ temp_path = f.name
731731+732732+ # Import the generated module
733733+ spec = importlib.util.spec_from_file_location("generated", temp_path)
734734+ module = importlib.util.module_from_spec(spec)
735735+ spec.loader.exec_module(module)
736736+737737+ # Test instantiation
738738+ sample = module.GeneratedSample(x=42)
739739+ assert sample.x == 42
740740+741741+ # Test serialization
742742+ assert isinstance(sample, atdata.PackableSample)
743743+ packed = sample.packed
744744+ assert isinstance(packed, bytes)
745745+```
746746+747747+## Implementation Checklist (Phase 4)
748748+749749+- [ ] Implement `PythonGenerator` core logic
750750+- [ ] Create Jinja2 template for Python classes
751751+- [ ] Add CLI commands (`generate`, `batch`)
752752+- [ ] Implement schema validation
753753+- [ ] Implement type compatibility checking
754754+- [ ] Write unit tests for generator
755755+- [ ] Write integration tests (generate + import)
756756+- [ ] Add documentation and examples
757757+- [ ] Consider edge cases (nested types, complex shapes)
758758+759759+## Future Extensions
760760+761761+### Multi-Language Support
762762+763763+**TypeScript Generator**:
764764+```typescript
765765+// Generated from schema
766766+export interface ImageSample {
767767+ image: number[][][]; // uint8, [*, *, 3]
768768+ label: string;
769769+ confidence?: number;
770770+}
771771+```
772772+773773+**Rust Generator**:
774774+```rust
775775+// Generated from schema
776776+#[derive(Debug, Clone, Serialize, Deserialize)]
777777+pub struct ImageSample {
778778+ /// RGB image with variable height/width
779779+ pub image: ndarray::Array3<u8>,
780780+ /// Human-readable label
781781+ pub label: String,
782782+ /// Optional confidence score
783783+ pub confidence: Option<f64>,
784784+}
785785+```
786786+787787+### Advanced Features
788788+789789+- **Backwards compatibility checks**: Ensure schema updates don't break existing code
790790+- **Migration generators**: Generate migration code for schema evolution
791791+- **Validation decorators**: Runtime validation of generated classes
792792+- **Documentation generation**: Generate API docs from schemas
793793+- **IDE support**: Language server protocol support for autocomplete
794794+795795+### Code Quality
796796+797797+- **Formatting**: Run `black` on generated Python code
798798+- **Linting**: Ensure generated code passes `ruff`/`flake8`
799799+- **Type checking**: Ensure generated code passes `mypy`
+195
.planning/README.md
···11+# ATProto Integration Planning
22+33+This directory contains comprehensive planning documents for integrating AT Protocol into the `atdata` library, transforming it into a distributed dataset federation.
44+55+## Planning Documents
66+77+### Design Decisions
88+99+📋 **[decisions/](decisions/)** - Critical design decisions with detailed analysis
1010+- Each decision has its own document with options, recommendations, and rationale
1111+- See [decisions/README.md](decisions/README.md) for navigation guide
1212+- **Must be reviewed and finalized before Phase 1 implementation**
1313+1414+### Architecture & Design
1515+1616+1. **[01_overview.md](01_overview.md)** - High-level vision, architecture, and project roadmap
1717+ - Overall vision for distributed datasets on ATProto
1818+ - System architecture diagram
1919+ - Development phases and dependencies
2020+ - Open design questions
2121+2222+2. **[02_lexicon_design.md](02_lexicon_design.md)** - Detailed Lexicon schema specifications
2323+ - Schema Record Lexicon (for PackableSample types)
2424+ - Dataset Record Lexicon (for dataset indexes)
2525+ - Lens Record Lexicon (for transformations)
2626+ - Schema representation format decision
2727+ - Example records
2828+2929+3. **[03_python_client.md](03_python_client.md)** - Python library architecture and API design
3030+ - ATProtoClient for authentication
3131+ - SchemaPublisher/Loader
3232+ - DatasetPublisher/Loader
3333+ - LensPublisher
3434+ - Integration with existing Dataset class
3535+ - Testing strategy
3636+3737+4. **[04_appview.md](04_appview.md)** - AppView aggregation service design
3838+ - Service architecture
3939+ - Database schema (PostgreSQL, ElasticSearch)
4040+ - HTTP API endpoints
4141+ - Firehose consumer
4242+ - Deployment options
4343+ - Performance considerations
4444+4545+5. **[05_codegen.md](05_codegen.md)** - Code generation tooling
4646+ - Python code generator from schema records
4747+ - CLI interface
4848+ - Template system
4949+ - Type validation and compatibility checking
5050+ - Future multi-language support
5151+5252+## Milestone Tracking
5353+5454+**Milestone**: ATProto Integration (Milestone #1)
5555+**Total Issues**: 34 (6 parent issues + 28 subissues)
5656+5757+### Planning Phase (Issue #44)
5858+5959+**Status**: In progress
6060+**Priority**: High (blocks Phase 1)
6161+6262+Critical decisions needed before implementation:
6363+- Decide on schema representation format (#45)
6464+- Decide on Lens code storage approach (#46)
6565+- Decide on WebDataset storage strategy (#47)
6666+- Design schema evolution and versioning strategy (#48)
6767+- Finalize Lexicon namespace and NSID structure (#49)
6868+- Review and validate Lexicon JSON definitions (#50)
6969+7070+**All decisions have detailed analysis in planning documents with recommendations.**
7171+7272+### Phase Breakdown
7373+7474+#### Phase 1: Lexicon Design & Schema Definition (Issue #17)
7575+- Design Lexicon for PackableSample schema storage (#22)
7676+- Design Lexicon for dataset index records (#23)
7777+- Design Lexicon for Lens transformation records (#24)
7878+- Evaluate schema representation formats (#25)
7979+8080+**Status**: Blocked by Planning (#44)
8181+**Priority**: High (blocks all other phases)
8282+8383+#### Phase 2: Python Client Library (Issue #18)
8484+- Implement ATProto authentication and session management (#26)
8585+- Implement schema publishing to ATProto (#27)
8686+- Implement dataset index record publishing (#28)
8787+- Implement Lens transformation publishing (#29)
8888+- Implement querying and discovery of datasets (#30)
8989+- Extend Dataset class to load from ATProto records (#31)
9090+9191+**Status**: Blocked by Phase 1
9292+**Priority**: High (critical path)
9393+9494+#### Phase 3: AppView & Index Aggregation Service (Issue #19)
9595+- Design AppView architecture and data model (#32)
9696+- Implement record ingestion from ATProto firehose (#33)
9797+- Implement search and query API (#34)
9898+- Add caching and indexing for performance (#35)
9999+100100+**Status**: Blocked by Phase 2
101101+**Priority**: Medium (optional infrastructure)
102102+103103+#### Phase 4: Code Generation Tooling (Issue #20)
104104+- Design code generation template system (#36)
105105+- Implement Python code generator from schema records (#37)
106106+- Add CLI for code generation (#38)
107107+- Support type validation and compatibility checking (#39)
108108+109109+**Status**: Blocked by Phase 2
110110+**Priority**: Medium (can run parallel with Phase 3)
111111+112112+#### Phase 5: End-to-End Integration & Testing (Issue #21)
113113+- Create end-to-end example workflows (#40)
114114+- Write integration tests for full publish/discover/load cycle (#41)
115115+- Create comprehensive documentation (#42)
116116+- Performance testing and optimization (#43)
117117+118118+**Status**: Blocked by Phase 2
119119+**Priority**: High (required for production release)
120120+121121+## Getting Started
122122+123123+To begin implementation:
124124+125125+1. **Review design decisions** in `decisions/` directory - these need your input first
126126+2. **Review architecture documents** (01-05) to understand the full scope
127127+3. **Provide feedback** on the design decisions and open questions
128128+4. **Finalize decisions** for issues #45-49
129129+5. **Validate Lexicons** (issue #50) once decisions are made
130130+6. **Begin Phase 1 implementation** after validation
131131+7. **Track progress** using chainlink issues
132132+133133+### Quick Start for Decision Review
134134+135135+1. Read [decisions/README.md](decisions/README.md) for overview
136136+2. Review each decision document (01-06)
137137+3. For each decision:
138138+ - Agree with recommendation? → Comment on issue
139139+ - Disagree? → Propose alternative in issue
140140+ - Unsure? → Discuss open questions
141141+4. Once all decisions made → Proceed to issue #50 (validation)
142142+143143+## Key Design Decisions Needed
144144+145145+Before starting implementation, we need decisions on (see Issue #44 and subissues #45-50):
146146+147147+1. **Schema representation format** (Issue #45)
148148+ - Recommendation: Custom format within ATProto Lexicon
149149+ - Alternative: JSON Schema or Protobuf
150150+ - Details in `02_lexicon_design.md`
151151+152152+2. **Lens code storage** (Issue #46)
153153+ - Recommendation: Code references (GitHub + commit) only
154154+ - Alternative: Allow inline code (security concerns)
155155+ - Details in `02_lexicon_design.md`
156156+157157+3. **WebDataset storage location** (Issue #47)
158158+ - Phase 1: External storage (S3, HTTP) - just URLs
159159+ - Future: ATProto blob storage for smaller datasets
160160+ - Details in `02_lexicon_design.md`
161161+162162+4. **Schema evolution strategy** (Issue #48)
163163+ - How to handle versioning and compatibility
164164+ - Migration path for breaking changes
165165+ - Details in `05_codegen.md`
166166+167167+5. **Lexicon namespace** (Issue #49)
168168+ - Current proposal: `app.bsky.atdata.*`
169169+ - May need to coordinate with ATProto/Bluesky team
170170+ - Details in `02_lexicon_design.md`
171171+172172+6. **Lexicon validation** (Issue #50)
173173+ - Validate all Lexicon JSON against ATProto spec
174174+ - Create example records for testing
175175+ - Blocked by decisions #45-49
176176+177177+## Questions for Discussion
178178+179179+Review the "Open Design Questions" sections in each planning document, particularly:
180180+181181+- `01_overview.md` - Overall architecture questions
182182+- `02_lexicon_design.md` - Lexicon-specific design questions (CRITICAL for Phase 1)
183183+184184+## Next Steps
185185+186186+1. Review planning documents
187187+2. Discuss and finalize design decisions
188188+3. Begin Phase 1 implementation
189189+4. Iterate and refine as we learn
190190+191191+---
192192+193193+**Milestone Created**: 2026-01-07
194194+**Last Updated**: 2026-01-07
195195+**Status**: Planning complete, ready for review
+19
.planning/atproto_integration.md
···11+# Planning for full atproto integration
22+33+The overall goal for `atdata` is that the index for datasets is actually present on the atproto distributed repository, with one type of Lexicon schema for actually containing information about `PackableSample` schemas that can be reproduced with code gen, and one type of Lexicon schema designed for the main functionality: records holding the links to the WDS dataset for samples and the msgpack metadata (that can be plugged into the `Dataset` class) as well as a reference to the atproto record containing the schema for the appropriate sample type for the dataset.
44+55+## Thoughts on functionality
66+77+* Lexicons
88+ * Definition of a `PackableSample`-compatible sample type schema, that can be used to reconstitute the code in appropriate languages using code gen toolilng
99+ * Index records that contain links to the actual WebDataset data, as well as to the records with the corresponding sample schema.
1010+ * `Lenses` between defined sample type schemas across the network.
1111+* Python library functionality
1212+ * Logging in with the atproto sdk
1313+ * Posting sample schemas and dataset index records to the appropriate lexicons for the user
1414+* AppView functionality
1515+ * Aggregating index records, making an index of those that is quick to query on
1616+1717+## Questions for implementation
1818+1919+* What is the best way to store the sample type schemas within atproto Lexicons? I've thought about using JSON schema or protobuf, but want to think through possibilities.
···11+# Decision: Schema Representation Format
22+33+**Issue**: #45
44+**Status**: Needs decision
55+**Blocks**: #50 (Lexicon validation)
66+**Priority**: Critical for Phase 1
77+88+## DECISION
99+1010+Let's go with the **JSON schema** approach; the only real issue we have to worry about here is the `NDArray` support, and we can solve that by
1111+1212+* Adding a standardized JSON Schema shim to represent an `NDArray` as its serialized bytes
1313+* Referencing this as the type within other schemas, and making this the standard we use
1414+1515+We'll make this decision future-proof by adding a property in the Lexicon for schemas that gives the type of schema definition, with one currently supported value (for JSON Schema), and then leave the standard overall as an open union, as is standard for atproto lexicons.
1616+1717+---
1818+1919+## Problem Statement
2020+2121+We need to decide how to represent `PackableSample` type definitions within ATProto Lexicon records. This affects:
2222+- How schemas are stored and transmitted
2323+- Code generation complexity
2424+- Cross-language interoperability
2525+- Tooling ecosystem availability
2626+2727+## Context
2828+2929+`PackableSample` types have specific requirements:
3030+- Support for primitive types (str, int, float, bool, bytes)
3131+- **Special handling for `NDArray` types** with dtype and shape information
3232+- Msgpack serialization metadata
3333+- Optional/required field semantics
3434+- Future extensibility (constraints, validation, nested types)
3535+3636+## Options
3737+3838+### Option 1: Custom Format within ATProto Lexicon ⭐ RECOMMENDED
3939+4040+**Description**: Define our own type system using ATProto Lexicon primitives
4141+4242+**Example**:
4343+```json
4444+{
4545+ "name": "image",
4646+ "type": {
4747+ "kind": "ndarray",
4848+ "dtype": "uint8",
4949+ "shape": [null, null, 3]
5050+ },
5151+ "optional": false,
5252+ "description": "RGB image with variable height/width"
5353+}
5454+```
5555+5656+**Pros**:
5757+- ✅ Native to ATProto - no external dependencies
5858+- ✅ Tailored exactly to `PackableSample` needs
5959+- ✅ Clean representation of NDArray (dtype, shape constraints)
6060+- ✅ Full control over codegen implementation
6161+- ✅ Can evolve independently
6262+- ✅ Easy to extend (add constraints, validation rules, etc.)
6363+6464+**Cons**:
6565+- ❌ Need to implement our own codegen tooling
6666+- ❌ Less ecosystem tooling available
6767+- ❌ Need to maintain custom parsers
6868+6969+**Implementation Effort**: Medium
7070+- Lexicon design: ~2-3 days
7171+- Python codegen: ~5-7 days
7272+- Validation: ~2-3 days
7373+7474+---
7575+7676+### Option 2: JSON Schema
7777+7878+**Description**: Use JSON Schema as the type definition format
7979+8080+**Example**:
8181+```json
8282+{
8383+ "type": "object",
8484+ "properties": {
8585+ "image": {
8686+ "type": "object",
8787+ "x-atdata-type": "ndarray",
8888+ "x-dtype": "uint8",
8989+ "x-shape": [null, null, 3]
9090+ }
9191+ },
9292+ "required": ["image"]
9393+}
9494+```
9595+9696+**Pros**:
9797+- ✅ Industry standard, widely understood
9898+- ✅ Extensive validation tooling exists
9999+- ✅ Many language implementations
100100+101101+**Cons**:
102102+- ❌ Not designed for code generation
103103+- ❌ Awkward NDArray representation (need custom extensions like `x-atdata-type`)
104104+- ❌ Overly complex for our needs
105105+- ❌ Still need custom codegen despite standard format
106106+- ❌ Doesn't map cleanly to Python dataclasses
107107+108108+**Implementation Effort**: Medium-High
109109+- Still need custom codegen despite standard format
110110+- JSON Schema parsers available but adaptation needed
111111+112112+---
113113+114114+### Option 3: Protobuf (Protocol Buffers)
115115+116116+**Description**: Use Protobuf schema definitions
117117+118118+**Example**:
119119+```protobuf
120120+message ImageSample {
121121+ bytes image = 1; // NDArray serialized
122122+ string label = 2;
123123+ optional float confidence = 3;
124124+}
125125+```
126126+127127+**Pros**:
128128+- ✅ Excellent codegen ecosystem (Python, TypeScript, Rust, etc.)
129129+- ✅ Compact binary format
130130+- ✅ Strong cross-language support
131131+- ✅ Built-in versioning/evolution support
132132+133133+**Cons**:
134134+- ❌ Not ATProto-native (different ecosystem)
135135+- ❌ NDArray handling is awkward (just bytes, lose dtype/shape info)
136136+- ❌ Requires compilation step
137137+- ❌ Less human-readable than JSON
138138+- ❌ Doesn't integrate well with msgpack serialization we already use
139139+- ❌ Would need to convert between Protobuf and our existing serialization
140140+141141+**Implementation Effort**: High
142142+- Need to bridge Protobuf and PackableSample worlds
143143+- Complexity of maintaining two serialization systems
144144+145145+## Recommendation: Option 1 (Custom Format)
146146+147147+**Rationale**:
148148+149149+1. **Perfect fit for PackableSample**: Our custom format can represent NDArray types with full dtype and shape information, which is critical for ML/data applications.
150150+151151+2. **ATProto-native**: Using Lexicon primitives means everything stays within the ATProto ecosystem. No external schema dependencies.
152152+153153+3. **Full control**: We can optimize the codegen for our exact use case. Want to generate dataclasses with specific decorators? Easy. Want to add custom validation? We control it.
154154+155155+4. **Simplicity**: Despite being "custom", it's actually simpler than adapting JSON Schema or Protobuf to our needs. Less impedance mismatch.
156156+157157+5. **Future-proof**: Easy to add features like:
158158+ - Shape constraints and validation
159159+ - Custom serialization hooks
160160+ - Nested PackableSample types
161161+ - Union types for polymorphic samples
162162+163163+## Implementation Plan
164164+165165+If we choose Option 1:
166166+167167+1. **Finalize Lexicon structure** (see `02_lexicon_design.md`)
168168+ - Field type definitions (primitive, ndarray, nested)
169169+ - Union types for extensibility
170170+ - Metadata fields
171171+172172+2. **Implement Python codegen** (see `05_codegen.md`)
173173+ - Jinja2 templates for dataclass generation
174174+ - Type annotation mapping
175175+ - NDArray handling with dtype/shape comments
176176+177177+3. **Build validation tooling**
178178+ - Schema validator (ensure schemas are well-formed)
179179+ - Sample validator (ensure samples match schemas)
180180+ - Compatibility checker (schema evolution)
181181+182182+4. **Document the format**
183183+ - Clear spec for the type system
184184+ - Examples for common patterns
185185+ - Migration guide from JSON Schema if needed
186186+187187+## Alternative Approaches Considered
188188+189189+**Hybrid approach**: Use JSON Schema for validation + custom codegen
190190+- Still has awkward NDArray representation
191191+- Added complexity of two systems
192192+- Not recommended
193193+194194+**Defer decision**: Use simple types only, add NDArray later
195195+- Defeats the purpose - NDArray is core to ML datasets
196196+- Would require breaking changes later
197197+- Not recommended
198198+199199+## Impact on Other Decisions
200200+201201+- **Code generation (#36-39)**: Custom format means we fully control codegen
202202+- **Validation (#50)**: Need to implement custom validators
203203+- **Cross-language support (future)**: Need to write codegen for each language, but format is language-agnostic
204204+205205+## Success Criteria
206206+207207+After implementing this decision:
208208+- ✅ Can represent all current PackableSample types
209209+- ✅ NDArray types include dtype and shape information
210210+- ✅ Generated code is idiomatic Python (dataclasses with type hints)
211211+- ✅ Schema records are human-readable
212212+- ✅ Codegen is fast (<1s for typical schemas)
213213+214214+## Open Questions
215215+216216+1. **Should we support shape constraints beyond documentation?**
217217+ - e.g., should [224, 224, 3] be enforced at runtime?
218218+ - Recommendation: Document only initially, add validation later
219219+220220+2. **How to handle nested PackableSample types?**
221221+ - Reference by schema URI?
222222+ - Inline nested schema?
223223+ - Recommendation: URI reference for Phase 1
224224+225225+3. **Should we generate both classes and validators?**
226226+ - Just classes, or also Pydantic models?
227227+ - Recommendation: Start with dataclasses, add Pydantic later if needed
228228+229229+## References
230230+231231+- Full Lexicon design: `../02_lexicon_design.md`
232232+- Code generation plan: `../05_codegen.md`
233233+- Example schemas: `../02_lexicon_design.md` (Schema Record Lexicon section)
234234+235235+---
236236+237237+**Decision Needed By**: Before starting Phase 1 Issue #22 (Lexicon design)
238238+**Decision Maker**: Project maintainer (max)
239239+**Date Created**: 2026-01-07
+352
.planning/decisions/02_lens_code_storage.md
···11+# Decision: Lens Code Storage Approach
22+33+**Issue**: #46
44+**Status**: Needs decision
55+**Blocks**: #50 (Lexicon validation)
66+**Priority**: Critical for Phase 1
77+88+## DECISION
99+1010+Let's go with Option 1, using external repositories. We can actually make this work for
1111+1212+* GitHub
1313+* tangled.org (the native ATProto git repository system)
1414+1515+Additionally, we'll want to keep track of metadata for lenses giving the language the referenced code is implemented in.
1616+1717+Longer-term, it will also be good to add another Lexicon specification for attestation of `Lens` formal correctness (where possible), as this will enable filtering lens implementations by provability. We'll also want to add our own `verification` records that give attestation of individual atproto DIDs (user identities) as being "trusted" for creating `Lens`es, etc.
1818+1919+---
2020+2121+## Problem Statement
2222+2323+We need to decide how to store the transformation code for Lens records on ATProto. Lenses define bidirectional transformations between sample types (getter: Source → Target, putter: Target × Source → Source).
2424+2525+This is a **critical security decision** because we're dealing with executable code.
2626+2727+## Context
2828+2929+Lens transformations are functions that:
3030+- Take samples of one type and transform them to another
3131+- Are bidirectional (getter + putter)
3232+- Need to be reproducible and verifiable
3333+- Potentially execute on untrusted data
3434+3535+Example Lens:
3636+```python
3737+@atdata.lens
3838+def rgb_to_grayscale(rgb_sample: RGBSample) -> GrayscaleSample:
3939+ gray = cv2.cvtColor(rgb_sample.image, cv2.COLOR_RGB2GRAY)
4040+ return GrayscaleSample(image=gray, label=rgb_sample.label)
4141+4242+@rgb_to_grayscale.putter
4343+def grayscale_to_rgb(gray: GrayscaleSample, rgb: RGBSample) -> RGBSample:
4444+ # Convert back to RGB (approximate)
4545+ rgb_img = cv2.cvtColor(gray.image, cv2.COLOR_GRAY2RGB)
4646+ return RGBSample(image=rgb_img, label=gray.label)
4747+```
4848+4949+## Options
5050+5151+### Option 1: Code References Only (GitHub/GitLab + Commit Hash) ⭐ RECOMMENDED
5252+5353+**Description**: Store only references to code in version control repositories
5454+5555+**Record Format**:
5656+```json
5757+{
5858+ "getterCode": {
5959+ "kind": "reference",
6060+ "repository": "https://github.com/alice/lenses",
6161+ "commit": "a1b2c3d4e5f6789...",
6262+ "path": "lenses/vision.py:rgb_to_grayscale"
6363+ },
6464+ "putterCode": {
6565+ "kind": "reference",
6666+ "repository": "https://github.com/alice/lenses",
6767+ "commit": "a1b2c3d4e5f6789...",
6868+ "path": "lenses/vision.py:grayscale_to_rgb"
6969+ }
7070+}
7171+```
7272+7373+**Pros**:
7474+- ✅ **Secure**: No arbitrary code execution from ATProto records
7575+- ✅ **Verifiable**: Commit hash ensures immutability
7676+- ✅ **Auditable**: Users can review code before using
7777+- ✅ **Version controlled**: Natural versioning through git
7878+- ✅ **Professional workflow**: Encourages proper development practices
7979+8080+**Cons**:
8181+- ❌ External dependency (repo could disappear)
8282+- ❌ Requires users to have code in public/accessible repos
8383+- ❌ Need to clone/fetch repos to use lenses
8484+- ❌ Less convenient than self-contained records
8585+8686+**Security**: ⭐⭐⭐⭐⭐ Excellent
8787+**Convenience**: ⭐⭐⭐ Good
8888+**Implementation Effort**: Low-Medium
8989+9090+---
9191+9292+### Option 2: Inline Python Code with Sandboxing
9393+9494+**Description**: Store Python source code directly in records, execute in sandbox
9595+9696+**Record Format**:
9797+```json
9898+{
9999+ "getterCode": {
100100+ "kind": "python",
101101+ "source": "def rgb_to_grayscale(rgb_sample: RGBSample) -> GrayscaleSample:\n ..."
102102+ }
103103+}
104104+```
105105+106106+**Pros**:
107107+- ✅ Self-contained records
108108+- ✅ No external dependencies
109109+- ✅ More convenient for users
110110+- ✅ Easier discovery and exploration
111111+112112+**Cons**:
113113+- ❌ **MAJOR SECURITY RISK**: Executing untrusted code
114114+- ❌ Sandboxing Python is extremely difficult
115115+- ❌ Even with sandboxing, attack surface is large
116116+- ❌ `eval()`/`exec()` considered harmful
117117+- ❌ Would need extensive review and testing
118118+- ❌ Potential for malicious code injection
119119+120120+**Security**: ⭐ Very Poor (even with sandboxing)
121121+**Convenience**: ⭐⭐⭐⭐⭐ Excellent
122122+**Implementation Effort**: Very High (sandboxing is complex)
123123+124124+**Why Sandboxing is Hard**:
125125+- Python has many ways to break out of sandboxes
126126+- Import system, file I/O, network access all need blocking
127127+- `__import__`, `eval`, `exec`, `compile`, `open`, etc.
128128+- Even readonly access can leak sensitive data
129129+- See: [PyPy sandbox](https://doc.pypy.org/en/latest/sandbox.html) - discontinued
130130+131131+---
132132+133133+### Option 3: Bytecode or AST Representation
134134+135135+**Description**: Store compiled bytecode or AST instead of source
136136+137137+**Pros**:
138138+- ✅ Slightly safer than raw source (no syntax injection)
139139+- ✅ Self-contained
140140+141141+**Cons**:
142142+- ❌ Still executes arbitrary code - same security issues
143143+- ❌ Harder to audit than source
144144+- ❌ Platform/version dependent (Python bytecode changes)
145145+- ❌ Complex to implement
146146+- ❌ Doesn't solve the fundamental problem
147147+148148+**Security**: ⭐⭐ Poor
149149+**Convenience**: ⭐⭐ Poor (less readable)
150150+**Implementation Effort**: High
151151+152152+---
153153+154154+### Option 4: Metadata Only (Manual Implementation)
155155+156156+**Description**: Store only metadata about transformations, require manual implementation
157157+158158+**Record Format**:
159159+```json
160160+{
161161+ "description": "Converts RGB images to grayscale",
162162+ "getterSignature": "(RGBSample) -> GrayscaleSample",
163163+ "putterSignature": "(GrayscaleSample, RGBSample) -> RGBSample"
164164+}
165165+```
166166+167167+**Pros**:
168168+- ✅ Completely safe
169169+- ✅ Simple to implement
170170+171171+**Cons**:
172172+- ❌ Lenses not actually usable
173173+- ❌ Defeats the purpose of publishing transformations
174174+- ❌ No network effect (can't compose lenses)
175175+176176+**Security**: ⭐⭐⭐⭐⭐ Excellent
177177+**Convenience**: ⭐ Very Poor
178178+**Implementation Effort**: Very Low
179179+180180+## Recommendation: Option 1 (Code References Only)
181181+182182+**Rationale**:
183183+184184+1. **Security First**: We cannot compromise on security. Publishing executable code to a public network is extremely dangerous without proper safeguards.
185185+186186+2. **Verifiable and Auditable**: With commit hashes, users can:
187187+ - Review the exact code before execution
188188+ - Verify it hasn't been tampered with
189189+ - Make informed trust decisions
190190+191191+3. **Professional Workflow**: Requiring code in version control:
192192+ - Encourages good practices (testing, documentation)
193193+ - Makes lens development collaborative
194194+ - Enables code review
195195+196196+4. **Future Extensibility**: We can add inline code later if we solve sandboxing, but we can't easily remove it once added.
197197+198198+## Implementation Plan
199199+200200+If we choose Option 1:
201201+202202+1. **Lexicon Design** (Phase 1)
203203+ ```json
204204+ "transformCode": {
205205+ "type": "union",
206206+ "refs": ["#codeReference"]
207207+ },
208208+ "codeReference": {
209209+ "type": "object",
210210+ "required": ["kind", "repository", "commit", "path"],
211211+ "properties": {
212212+ "kind": {"type": "string", "const": "reference"},
213213+ "repository": {"type": "string", "maxLength": 500},
214214+ "commit": {"type": "string", "maxLength": 40},
215215+ "path": {"type": "string", "maxLength": 500}
216216+ }
217217+ }
218218+ ```
219219+220220+2. **Lens Publisher** (Phase 2)
221221+ - Automatically detect git repo and commit from function location
222222+ - Validate that repo is accessible
223223+ - Include function name and module path
224224+225225+3. **Lens Loader** (Phase 2)
226226+ - Clone/fetch repository at specified commit
227227+ - Import function from specified path
228228+ - Cache cloned repos locally
229229+ - Verify function signatures match schema
230230+231231+4. **Trust Model**
232232+ - Users explicitly approve which repos to trust
233233+ - Whitelist/blacklist mechanism
234234+ - Warn on first use of any lens
235235+236236+## Alternative Approaches Considered
237237+238238+**Signed inline code**: Store inline code with cryptographic signatures
239239+- Still has execution risk
240240+- Signature only proves authorship, not safety
241241+- Not recommended
242242+243243+**WASM modules**: Compile transformations to WebAssembly
244244+- More sandboxed than Python
245245+- Very complex to implement
246246+- Would require rewriting lenses in Rust/C++
247247+- Interesting future direction but not for Phase 1
248248+249249+## User Experience Implications
250250+251251+**Publishing a Lens**:
252252+```python
253253+# 1. Write lens code in your repo
254254+# lenses/vision.py
255255+@atdata.lens
256256+def rgb_to_grayscale(rgb: RGBSample) -> GrayscaleSample:
257257+ ...
258258+259259+# 2. Commit and push
260260+git add lenses/vision.py
261261+git commit -m "Add RGB to grayscale lens"
262262+git push
263263+264264+# 3. Publish to ATProto (automatically detects git info)
265265+client = ATProtoClient()
266266+client.login("alice.bsky.social", "password")
267267+268268+lens_publisher = LensPublisher(client)
269269+lens_uri = lens_publisher.publish_lens(
270270+ rgb_to_grayscale,
271271+ source_schema_uri="at://alice/schema/rgb",
272272+ target_schema_uri="at://alice/schema/gray"
273273+)
274274+```
275275+276276+**Using a Lens**:
277277+```python
278278+# 1. Discover lens
279279+loader = LensLoader(client)
280280+lenses = loader.search_lenses(
281281+ source_schema="at://alice/schema/rgb",
282282+ target_schema="at://alice/schema/gray"
283283+)
284284+285285+# 2. User reviews the repo/code (outside tool)
286286+# 3. User approves the repo
287287+288288+# 4. Load and use lens
289289+rgb_to_gray = loader.load_lens(lenses[0]['uri'])
290290+gray_sample = rgb_to_gray(rgb_sample)
291291+```
292292+293293+## Security Considerations
294294+295295+Even with code references:
296296+- **Malicious repos**: Users could reference repos with malicious code
297297+- **Mitigation**: Explicit user approval, warnings, sandboxing (future)
298298+299299+- **Repo compromise**: Git repos could be hacked
300300+- **Mitigation**: Commit hash pins exact version, users can audit
301301+302302+- **Dependency injection**: Lens code could import malicious packages
303303+- **Mitigation**: Users review code, standard Python security practices
304304+305305+## Future Enhancements
306306+307307+**If we want inline code later**:
308308+1. Build robust Python sandbox (e.g., using PyPy, restrictedpython)
309309+2. Add extensive security testing
310310+3. Implement strict permissions model
311311+4. Use WebAssembly for true isolation
312312+5. Add code signing and reputation system
313313+314314+**For now**: Start with references, prove the concept, add inline code only if there's strong demand and we can do it safely.
315315+316316+## Open Questions
317317+318318+1. **Private repositories**: How to handle lenses in private repos?
319319+ - Could support auth tokens (stored locally, not in record)
320320+ - Could use SSH keys
321321+ - Recommendation: Public repos only for Phase 1
322322+323323+2. **Repository availability**: What if repo goes offline?
324324+ - Could encourage mirrors
325325+ - Could cache code (with user permission)
326326+ - Recommendation: Accept the risk, it's part of decentralization
327327+328328+3. **Non-Python lenses**: What about TypeScript, Rust, etc.?
329329+ - References work for any language
330330+ - Each language would need its own loader
331331+ - Recommendation: Python-only for Phase 1
332332+333333+## Success Criteria
334334+335335+After implementing this decision:
336336+- ✅ Lenses can be published with code references
337337+- ✅ Users can load and execute lenses from approved repos
338338+- ✅ No arbitrary code execution from untrusted sources
339339+- ✅ Lens records include immutable commit hashes
340340+- ✅ Clear warnings when using external code
341341+342342+## References
343343+344344+- Lexicon design: `../02_lexicon_design.md` (Lens Record Lexicon)
345345+- Python client implementation: `../03_python_client.md` (LensPublisher)
346346+- Security best practices: Python security guide
347347+348348+---
349349+350350+**Decision Needed By**: Before starting Phase 1 Issue #24 (Lens Lexicon design)
351351+**Decision Maker**: Project maintainer (max)
352352+**Date Created**: 2026-01-07
+366
.planning/decisions/03_webdataset_storage.md
···11+# Decision: WebDataset Storage Strategy
22+33+**Issue**: #47
44+**Status**: Needs decision
55+**Blocks**: #50 (Lexicon validation)
66+**Priority**: Critical for Phase 1
77+88+## DECISION
99+1010+Let's build the hybrid approach in from the beginning. Critically:
1111+1212+* We'll keep track of whether dataset index records are referencing an external storage (S3, R2, etc) by URL or a PDS blob using an open union to define the data location
1313+* In the AppView implementation, we can proxy WDS urls for datasets across individual stored blobs, which streamlines some of the design.
1414+1515+This will help us be robust from the start -- particularly for those self-hosting.
1616+1717+---
1818+1919+## Problem Statement
2020+2121+We need to decide where the actual WebDataset `.tar` files are stored and how dataset records reference them. This affects decentralization, reliability, and scalability.
2222+2323+## Context
2424+2525+WebDataset files are:
2626+- **Large**: Typically gigabytes to terabytes
2727+- **Immutable**: Once created, datasets rarely change
2828+- **Sharded**: Split across multiple `.tar` files (e.g., `data-{000000..000099}.tar`)
2929+- **Binary**: Contain msgpack-serialized samples with images/arrays
3030+3131+Current `atdata` usage:
3232+```python
3333+# External storage (S3, HTTP, etc.)
3434+dataset = Dataset[MySample](url="s3://bucket/data-{000000..000009}.tar")
3535+```
3636+3737+## Options
3838+3939+### Option 1: External Storage with URL References ⭐ RECOMMENDED (Phase 1)
4040+4141+**Description**: Store WebDataset files on existing storage (S3, HTTP, IPFS, etc.), record only contains URLs
4242+4343+**Record Format**:
4444+```json
4545+{
4646+ "$type": "app.bsky.atdata.dataset",
4747+ "name": "CIFAR-10 Training Set",
4848+ "urls": [
4949+ "s3://my-bucket/cifar10-train-{000000..000049}.tar"
5050+ ],
5151+ "schemaRef": "at://alice/schema/image",
5252+ ...
5353+}
5454+```
5555+5656+**Supported URL Schemes**:
5757+- `s3://` - AWS S3 and compatible (MinIO, DigitalOcean Spaces)
5858+- `https://` - HTTP/HTTPS servers
5959+- `gs://` - Google Cloud Storage
6060+- `ipfs://` - IPFS (decentralized, content-addressed)
6161+- `file://` - Local files (for development)
6262+6363+**Pros**:
6464+- ✅ **No size limits**: Store datasets of any size
6565+- ✅ **Existing infrastructure**: Leverage proven storage solutions
6666+- ✅ **No ATProto storage costs**: Publishers pay for their own storage
6767+- ✅ **Performance**: Use CDNs, regional endpoints, etc.
6868+- ✅ **Compatibility**: Works with current `atdata` code
6969+- ✅ **Flexibility**: Different storage for different use cases
7070+7171+**Cons**:
7272+- ❌ **Centralization risk**: If storage provider goes down, dataset unavailable
7373+- ❌ **URL rot**: Links can break over time
7474+- ❌ **No permanence guarantee**: Publisher can delete files
7575+- ❌ **Access control complexity**: Need to handle auth for private datasets
7676+7777+**Decentralization**: ⭐⭐ Fair (better with IPFS)
7878+**Reliability**: ⭐⭐⭐ Good (depends on storage provider)
7979+**Cost**: ⭐⭐⭐⭐ Excellent (publishers pay storage costs)
8080+**Implementation Effort**: ⭐⭐⭐⭐⭐ Very Low (already supported)
8181+8282+---
8383+8484+### Option 2: ATProto Blob Storage
8585+8686+**Description**: Store WebDataset files as ATProto blobs, record contains blob CIDs
8787+8888+**Record Format**:
8989+```json
9090+{
9191+ "$type": "app.bsky.atdata.dataset",
9292+ "name": "Small Dataset",
9393+ "blobs": [
9494+ {"$type": "blob", "ref": {"$link": "bafyrei..."}},
9595+ {"$type": "blob", "ref": {"$link": "bafyrei..."}}
9696+ ],
9797+ "schemaRef": "at://alice/schema/image",
9898+ ...
9999+}
100100+```
101101+102102+**Pros**:
103103+- ✅ **True decentralization**: Data lives on ATProto network
104104+- ✅ **Content-addressed**: CIDs guarantee immutability
105105+- ✅ **Permanence**: As permanent as ATProto itself
106106+- ✅ **No external dependencies**: Self-contained
107107+108108+**Cons**:
109109+- ❌ **Size limits**: ATProto may have blob size restrictions (need to verify)
110110+- ❌ **Storage costs**: Who pays for storing large datasets?
111111+- ❌ **Performance**: May be slower than specialized data storage
112112+- ❌ **Scalability**: Not designed for TB-scale datasets
113113+- ❌ **Unknown limitations**: ATProto blob storage is less proven for this use case
114114+115115+**Decentralization**: ⭐⭐⭐⭐⭐ Excellent
116116+**Reliability**: ⭐⭐⭐⭐ Very Good (ATProto network)
117117+**Cost**: ⭐ Poor (storage costs for large datasets)
118118+**Implementation Effort**: ⭐⭐⭐ Medium (need to implement blob upload/download)
119119+120120+---
121121+122122+### Option 3: Hybrid Approach
123123+124124+**Description**: Support both external URLs and ATProto blobs
125125+126126+**Record Format**:
127127+```json
128128+{
129129+ "$type": "app.bsky.atdata.dataset",
130130+ "name": "Hybrid Dataset",
131131+ "storage": {
132132+ "kind": "external",
133133+ "urls": ["s3://bucket/data-{000000..000009}.tar"]
134134+ },
135135+ // OR
136136+ "storage": {
137137+ "kind": "blobs",
138138+ "blobs": [{"$type": "blob", "ref": {"$link": "bafyrei..."}}]
139139+ },
140140+ ...
141141+}
142142+```
143143+144144+**Pros**:
145145+- ✅ Best of both worlds
146146+- ✅ Flexibility for different use cases
147147+- ✅ Can migrate between storage types
148148+149149+**Cons**:
150150+- ❌ More complex Lexicon and implementation
151151+- ❌ Confusing for users (which to choose?)
152152+- ❌ Testing burden (need to test both paths)
153153+154154+**Implementation Effort**: ⭐⭐ High (two systems to maintain)
155155+156156+## Recommendation: Option 1 (External URLs) for Phase 1, Option 3 (Hybrid) for Future
157157+158158+**Rationale**:
159159+160160+1. **Pragmatism**: Most ML datasets are huge (10GB-10TB). ATProto blob storage is not designed for this scale.
161161+162162+2. **Existing Infrastructure**: S3, GCS, HTTP are battle-tested for large file storage. Why reinvent the wheel?
163163+164164+3. **Cost Model**: Publishers pay for their own storage. This is sustainable and aligns incentives.
165165+166166+4. **IPFS for Decentralization**: Users who want decentralization can use `ipfs://` URLs, which are content-addressed and distributed.
167167+168168+5. **Future-Proof**: We can add blob storage later for small datasets (<100MB) without breaking existing datasets.
169169+170170+## Implementation Plan
171171+172172+### Phase 1: External URLs Only
173173+174174+**Lexicon Design**:
175175+```json
176176+{
177177+ "urls": {
178178+ "type": "array",
179179+ "description": "WebDataset URLs (supports brace notation)",
180180+ "items": {
181181+ "type": "string",
182182+ "format": "uri",
183183+ "maxLength": 1000
184184+ },
185185+ "minLength": 1
186186+ }
187187+}
188188+```
189189+190190+**Publisher Implementation**:
191191+```python
192192+publisher = DatasetPublisher(client)
193193+dataset_uri = publisher.publish_dataset(
194194+ dataset,
195195+ name="My Dataset",
196196+ description="Training data for my model"
197197+)
198198+# dataset.url is used directly, no upload needed
199199+```
200200+201201+**Loader Implementation**:
202202+```python
203203+loader = DatasetLoader(client)
204204+dataset = loader.load_dataset("at://alice/dataset/123")
205205+# Creates Dataset with URL from record
206206+# Actual data loading happens lazily via WebDataset
207207+```
208208+209209+**Validation**:
210210+- Check URL format (scheme + netloc + path)
211211+- Support brace notation for sharded datasets
212212+- Don't validate URL accessibility (too slow, may be private)
213213+214214+### Future: Add Blob Storage Option
215215+216216+When ATProto blob storage is more mature and we understand limits:
217217+218218+1. **Add blob support to Lexicon**:
219219+ ```json
220220+ "storage": {
221221+ "type": "union",
222222+ "refs": ["#urlStorage", "#blobStorage"]
223223+ }
224224+ ```
225225+226226+2. **Implement blob upload**:
227227+ - Chunk large files
228228+ - Upload shards as separate blobs
229229+ - Update record with blob CIDs
230230+231231+3. **Size recommendations**:
232232+ - Datasets <100MB → Consider blobs
233233+ - Datasets >100MB → Use external URLs
234234+ - Datasets >10GB → Definitely external URLs
235235+236236+## URL Scheme Support
237237+238238+| Scheme | Support | Notes |
239239+|--------|---------|-------|
240240+| `s3://` | ✅ Phase 1 | AWS S3 and compatible services |
241241+| `https://` | ✅ Phase 1 | Public HTTP/HTTPS servers |
242242+| `http://` | ✅ Phase 1 | Upgraded to HTTPS when possible |
243243+| `gs://` | ✅ Phase 1 | Google Cloud Storage |
244244+| `ipfs://` | ✅ Phase 1 | Decentralized storage via IPFS |
245245+| `file://` | ✅ Phase 1 | Local development only |
246246+| `at://` | ⏳ Future | ATProto blob references |
247247+248248+## Decentralization Strategy
249249+250250+For users who want decentralization without ATProto blobs:
251251+252252+**IPFS + Pinning Services**:
253253+1. Upload dataset to IPFS
254254+2. Pin with service (Pinata, Infura, Web3.Storage)
255255+3. Publish dataset with `ipfs://` URL
256256+4. IPFS ensures content-addressed, distributed storage
257257+258258+**Example**:
259259+```python
260260+# Upload to IPFS (using ipfs client)
261261+ipfs_hash = upload_to_ipfs("data-000000.tar")
262262+263263+# Publish dataset
264264+dataset_uri = publisher.publish_dataset(
265265+ dataset,
266266+ name="My Dataset",
267267+ urls=[f"ipfs://{ipfs_hash}"]
268268+)
269269+```
270270+271271+**Benefits**:
272272+- Content-addressed (CID in URL)
273273+- Distributed (IPFS network)
274274+- Permanent (with pinning)
275275+- No ATProto blob limits
276276+277277+## Access Control Considerations
278278+279279+**Public datasets**: URLs point to public storage
280280+- S3 public buckets
281281+- Public HTTP servers
282282+- IPFS (inherently public)
283283+284284+**Private datasets**: URL points to private storage
285285+- S3 with authentication (pre-signed URLs? credentials?)
286286+- Private HTTP servers (auth tokens?)
287287+- Recommendation: Public datasets only for Phase 1
288288+289289+**Future**: Could add access control metadata to records
290290+```json
291291+{
292292+ "access": {
293293+ "kind": "authenticated",
294294+ "requiredRole": "subscriber"
295295+ }
296296+}
297297+```
298298+299299+## Storage Cost Implications
300300+301301+| Storage Type | Cost Responsibility | Pros | Cons |
302302+|-------------|-------------------|------|------|
303303+| S3 | Publisher | Industry standard, reliable | Ongoing costs |
304304+| IPFS + Pinning | Publisher | Decentralized | Need pinning service |
305305+| HTTP Server | Publisher | Full control | Maintenance burden |
306306+| ATProto Blobs | Publisher? ATProto? | Simple | Unknown cost model |
307307+308308+**Recommendation**: Let publishers choose based on their needs and budget.
309309+310310+## Alternative Approaches Considered
311311+312312+**Torrents**: Use BitTorrent protocol
313313+- Pros: Decentralized, efficient for large files
314314+- Cons: Need seeders, not as well integrated
315315+- Could add in future with `torrent://` scheme
316316+317317+**Arweave**: Permanent storage blockchain
318318+- Pros: True permanence, one-time payment
319319+- Cons: Expensive for large datasets
320320+- Could add in future for critical datasets
321321+322322+## Open Questions
323323+324324+1. **Should we validate URL accessibility when publishing?**
325325+ - Pro: Catch broken links early
326326+ - Con: Slow, may fail for private URLs
327327+ - Recommendation: No validation, trust publishers
328328+329329+2. **Should we mirror datasets automatically?**
330330+ - Could create community mirrors for popular datasets
331331+ - Recommendation: Not for Phase 1, community can organize
332332+333333+3. **What about dataset versioning?**
334334+ - New version = new record with new URLs
335335+ - Could link to previous version in metadata
336336+ - Recommendation: Simple versioning via new records
337337+338338+4. **Should we support multi-region URLs?**
339339+ ```json
340340+ "urls": [
341341+ {"region": "us-east-1", "url": "s3://..."},
342342+ {"region": "eu-west-1", "url": "s3://..."}
343343+ ]
344344+ ```
345345+ - Recommendation: Defer to future if needed
346346+347347+## Success Criteria
348348+349349+After implementing this decision:
350350+- ✅ Datasets can reference external URLs (S3, HTTPS, IPFS)
351351+- ✅ WebDataset brace notation is preserved
352352+- ✅ Loading datasets works with existing `Dataset` class
353353+- ✅ No breaking changes to current `atdata` usage
354354+- ✅ Path clear for future blob storage support
355355+356356+## References
357357+358358+- Lexicon design: `../02_lexicon_design.md` (Dataset Record Lexicon)
359359+- Python client: `../03_python_client.md` (DatasetPublisher/Loader)
360360+- WebDataset documentation: https://webdataset.github.io/webdataset/
361361+362362+---
363363+364364+**Decision Needed By**: Before starting Phase 1 Issue #23 (Dataset Lexicon design)
365365+**Decision Maker**: Project maintainer (max)
366366+**Date Created**: 2026-01-07
+509
.planning/decisions/04_schema_evolution.md
···11+# Decision: Schema Evolution and Versioning Strategy
22+33+**Issue**: #48
44+**Status**: Needs decision
55+**Blocks**: #50 (Lexicon validation), #39 (Type validation)
66+**Priority**: High
77+88+## DECISION
99+1010+For this, let's take the following approach:
1111+1212+1. Let's make the `rkey` for the `ac.foundation.dataset.sampleSchema` records be of type `any`.
1313+2. Then, we can have our own standard for the `rkey` being of the format `{NSID}@{semver}`, where `{NSID}` gives an NSID for the permanent identifier of this sample schema type.
1414+ * This allows us to bookkeep on the version updates
1515+ * We can make a `ac.foundation.dataset.getLatestSchema` `query` Lexicon that will provide the record for the latest version of a given schema, as well
1616+3. We can build into the `atdata` SDK that whenever users update their own sample schema types, they can pass in optional `Lens`es between the two versions that give transformations to downgrade / upgrade records, so that there's an easy dev-facing way to auto-update any existing datasets using an older schema and maintain compatibility with older code for newer data.
1717+1818+---
1919+2020+## Problem Statement
2121+2222+We need to define how PackableSample schemas can evolve over time without breaking existing datasets or code. This includes:
2323+- Version numbering scheme
2424+- Compatibility rules (what changes are allowed?)
2525+- Migration strategies
2626+- Runtime validation
2727+2828+## Context
2929+3030+Schemas will evolve:
3131+- **Adding new fields** (e.g., adding optional metadata)
3232+- **Removing deprecated fields**
3333+- **Changing field types** (e.g., int → float)
3434+- **Changing field constraints** (e.g., making field optional)
3535+3636+Real-world example:
3737+```python
3838+# Version 1.0.0
3939+@atdata.packable
4040+class ImageSample:
4141+ image: NDArray
4242+ label: str
4343+4444+# Version 1.1.0 - add optional field (backward compatible)
4545+@atdata.packable
4646+class ImageSample:
4747+ image: NDArray
4848+ label: str
4949+ confidence: Optional[float] = None # NEW
5050+5151+# Version 2.0.0 - remove field (breaking change)
5252+@atdata.packable
5353+class ImageSample:
5454+ image: NDArray
5555+ # label removed - BREAKING
5656+ class_id: int # NEW, replaces label
5757+```
5858+5959+## Goals
6060+6161+1. **Backward compatibility**: Old code can read new data (when possible)
6262+2. **Forward compatibility**: New code can read old data (when possible)
6363+3. **Clear breaking changes**: Users know when they need to update
6464+4. **Safe migrations**: Data transformations are explicit and verifiable
6565+5. **Developer-friendly**: Easy to understand and use
6666+6767+## Versioning Scheme
6868+6969+### Semantic Versioning (MAJOR.MINOR.PATCH)
7070+7171+**Recommendation**: Use semantic versioning for schemas
7272+7373+```
7474+1.0.0 → 1.0.1 → 1.1.0 → 2.0.0
7575+```
7676+7777+**Version Components**:
7878+- **MAJOR**: Breaking changes (incompatible with previous versions)
7979+- **MINOR**: Backward-compatible additions (new optional fields)
8080+- **PATCH**: Documentation, clarifications, no functional changes
8181+8282+### Examples
8383+8484+```python
8585+# 1.0.0 → 1.0.1 (PATCH)
8686+# Change: Fixed documentation, added field description
8787+# Compatible: ✅ Yes
8888+# Action: None needed
8989+9090+# 1.0.0 → 1.1.0 (MINOR)
9191+# Change: Added optional field 'metadata'
9292+# Compatible: ✅ Yes (backward compatible)
9393+# Action: Old code works, new code can use new field
9494+9595+# 1.0.0 → 2.0.0 (MAJOR)
9696+# Change: Removed field 'old_field'
9797+# Compatible: ❌ No (breaking change)
9898+# Action: Users must migrate or use conversion lens
9999+```
100100+101101+## Compatibility Rules
102102+103103+### Backward-Compatible Changes (MINOR version bump)
104104+105105+**Allowed**:
106106+- ✅ Adding optional fields
107107+- ✅ Making required field optional
108108+- ✅ Widening type constraints (e.g., relaxing shape requirements)
109109+- ✅ Adding documentation
110110+- ✅ Adding metadata
111111+112112+**Example**:
113113+```python
114114+# v1.0.0
115115+class Sample:
116116+ x: int
117117+118118+# v1.1.0 - backward compatible
119119+class Sample:
120120+ x: int
121121+ y: Optional[int] = None # Added optional field
122122+```
123123+124124+**Guarantee**: Code written for v1.0.0 continues to work with v1.1.0 schemas
125125+126126+---
127127+128128+### Breaking Changes (MAJOR version bump)
129129+130130+**Required**:
131131+- ❌ Removing fields
132132+- ❌ Changing field types (str → int)
133133+- ❌ Making optional field required
134134+- ❌ Narrowing type constraints (e.g., restricting shape)
135135+- ❌ Renaming fields
136136+137137+**Example**:
138138+```python
139139+# v1.0.0
140140+class Sample:
141141+ x: int
142142+ y: int
143143+144144+# v2.0.0 - breaking changes
145145+class Sample:
146146+ x: float # Type changed
147147+ # y removed
148148+ z: int # New required field
149149+```
150150+151151+**Guarantee**: Code written for v1.0.0 will NOT work with v2.0.0 without updates
152152+153153+---
154154+155155+### Non-Breaking Changes (PATCH version bump)
156156+157157+**Allowed**:
158158+- ✅ Documentation updates
159159+- ✅ Metadata changes
160160+- ✅ Clarifications
161161+- ✅ Bug fixes in schema definition (not structure)
162162+163163+**No functional changes to schema structure**
164164+165165+## Compatibility Checking
166166+167167+### Automatic Compatibility Checker
168168+169169+Implement `SchemaValidator` to check compatibility:
170170+171171+```python
172172+from atdata.codegen import SchemaValidator
173173+174174+validator = SchemaValidator()
175175+176176+old_schema = load_schema("at://alice/schema/sample/v1.0.0")
177177+new_schema = load_schema("at://alice/schema/sample/v1.1.0")
178178+179179+is_compatible, issues = validator.is_compatible(old_schema, new_schema)
180180+181181+if not is_compatible:
182182+ print("Incompatibilities found:")
183183+ for issue in issues:
184184+ print(f" - {issue}")
185185+```
186186+187187+**Checks**:
188188+1. Field additions/removals
189189+2. Type changes
190190+3. Optional → Required changes
191191+4. Shape constraint changes
192192+193193+See `../05_codegen.md` for implementation details.
194194+195195+### Version Constraints in Dataset Records
196196+197197+Datasets can specify schema version constraints:
198198+199199+```json
200200+{
201201+ "$type": "app.bsky.atdata.dataset",
202202+ "schemaRef": "at://alice/schema/sample/v1.0.0",
203203+ "schemaVersionConstraint": ">=1.0.0,<2.0.0",
204204+ ...
205205+}
206206+```
207207+208208+**Semantics**:
209209+- Dataset created with v1.0.0
210210+- Compatible with v1.x.x (minor/patch updates)
211211+- NOT compatible with v2.x.x (breaking changes)
212212+213213+## Migration Strategies
214214+215215+### Option 1: Lenses as Migration Paths ⭐ RECOMMENDED
216216+217217+**Concept**: Use Lens transformations to migrate between schema versions
218218+219219+```python
220220+# Migration lens: v1.0.0 → v2.0.0
221221+@atdata.lens
222222+def sample_v1_to_v2(v1: SampleV1) -> SampleV2:
223223+ """Migrate from v1.0.0 to v2.0.0"""
224224+ return SampleV2(
225225+ x=float(v1.x), # int → float
226226+ z=hash(v1.y) % 100 # derive z from removed y
227227+ )
228228+229229+@sample_v1_to_v2.putter
230230+def sample_v2_to_v1(v2: SampleV2, v1: SampleV1) -> SampleV1:
231231+ """Reverse migration (lossy)"""
232232+ return SampleV1(
233233+ x=int(v2.x),
234234+ y=0 # Can't recover removed field
235235+ )
236236+```
237237+238238+**Benefits**:
239239+- ✅ Reuses existing Lens infrastructure
240240+- ✅ Explicit transformation logic
241241+- ✅ Bidirectional (when possible)
242242+- ✅ Publishable and discoverable
243243+244244+**Limitations**:
245245+- ❌ May be lossy (can't always reverse)
246246+- ❌ Requires manual implementation
247247+248248+---
249249+250250+### Option 2: Automatic Migration
251251+252252+**Concept**: Generate migrations automatically based on schema diff
253253+254254+```python
255255+migrator = SchemaM migrator()
256256+v2_sample = migrator.migrate(v1_sample, target_version="2.0.0")
257257+```
258258+259259+**Benefits**:
260260+- ✅ Convenient for users
261261+- ✅ No manual code needed
262262+263263+**Limitations**:
264264+- ❌ Only works for simple changes (add/remove optional fields)
265265+- ❌ Can't handle complex transformations (type changes)
266266+- ❌ Risk of incorrect assumptions
267267+268268+**Recommendation**: Could implement for simple cases, but Lenses are more general
269269+270270+---
271271+272272+### Option 3: Manual Migration Scripts
273273+274274+**Concept**: Users write custom migration scripts
275275+276276+**Benefits**:
277277+- ✅ Full control
278278+279279+**Limitations**:
280280+- ❌ Not publishable/discoverable
281281+- ❌ No standardization
282282+283283+**Recommendation**: Allow as fallback, but encourage Lenses
284284+285285+## Runtime Validation
286286+287287+### Sample Validation Against Schema
288288+289289+```python
290290+from atdata.codegen import TypeValidator
291291+292292+validator = TypeValidator()
293293+schema = load_schema("at://alice/schema/sample/v1.0.0")
294294+295295+# Validate sample
296296+sample = SampleV1(x=42, y=100)
297297+is_valid, errors = validator.validate(sample, schema)
298298+299299+if not is_valid:
300300+ print("Validation errors:")
301301+ for error in errors:
302302+ print(f" - {error}")
303303+```
304304+305305+**Checks**:
306306+1. All required fields present
307307+2. Field types match
308308+3. NDArray dtypes match (if specified)
309309+4. NDArray shapes match (if specified)
310310+311311+**When to validate**:
312312+- ❓ Every sample creation? (slow)
313313+- ✅ On dataset write? (good balance)
314314+- ✅ On user request (explicit validation)
315315+316316+**Recommendation**: Validate on write, make runtime validation optional
317317+318318+## Schema Record Versioning
319319+320320+### Version Field in Schema Records
321321+322322+```json
323323+{
324324+ "$type": "app.bsky.atdata.schema",
325325+ "name": "ImageSample",
326326+ "version": "1.1.0", # Semantic version
327327+ ...
328328+}
329329+```
330330+331331+### Publishing New Versions
332332+333333+**Option A**: New record for each version (RECOMMENDED)
334334+```
335335+at://alice/schema/imagesample/v1.0.0 # Version 1.0.0
336336+at://alice/schema/imagesample/v1.1.0 # Version 1.1.0
337337+at://alice/schema/imagesample/v2.0.0 # Version 2.0.0
338338+```
339339+340340+**Pros**:
341341+- ✅ Immutable versions
342342+- ✅ Easy to reference specific versions
343343+- ✅ No breaking changes to existing references
344344+345345+**Cons**:
346346+- ❌ More records to manage
347347+- ❌ Harder to find "latest" version
348348+349349+**Option B**: Update existing record
350350+```
351351+at://alice/schema/imagesample # Always points to latest
352352+```
353353+354354+**Pros**:
355355+- ✅ Single canonical reference
356356+- ✅ Easy to find latest
357357+358358+**Cons**:
359359+- ❌ Breaks immutability
360360+- ❌ References become ambiguous over time
361361+362362+**Recommendation**: Option A (new record per version), with metadata linking to previous versions
363363+364364+### Linking Versions
365365+366366+```json
367367+{
368368+ "$type": "app.bsky.atdata.schema",
369369+ "name": "ImageSample",
370370+ "version": "2.0.0",
371371+ "metadata": {
372372+ "previousVersion": "at://alice/schema/imagesample/v1.1.0",
373373+ "migrationLens": "at://alice/lens/imagesample-v1-to-v2"
374374+ },
375375+ ...
376376+}
377377+```
378378+379379+## Developer Workflow
380380+381381+### Publishing a New Schema Version
382382+383383+```python
384384+# 1. Define new version
385385+@atdata.packable
386386+class ImageSampleV2:
387387+ image: NDArray
388388+ label: str
389389+ confidence: Optional[float] = None # NEW
390390+391391+# 2. Publish with version
392392+schema_uri = publisher.publish_schema(
393393+ ImageSampleV2,
394394+ name="ImageSample",
395395+ version="1.1.0", # MINOR bump
396396+ metadata={
397397+ "previousVersion": "at://alice/schema/imagesample/v1.0.0"
398398+ }
399399+)
400400+401401+# 3. Optionally publish migration lens
402402+migration_lens = publisher.publish_lens(
403403+ v1_to_v2_lens,
404404+ source_schema_uri="at://alice/schema/imagesample/v1.0.0",
405405+ target_schema_uri=schema_uri,
406406+ name="ImageSample v1→v2 Migration"
407407+)
408408+```
409409+410410+### Using Versioned Schemas
411411+412412+```python
413413+# Load specific version
414414+schema = loader.get_schema("at://alice/schema/imagesample/v1.0.0")
415415+416416+# Check compatibility
417417+is_compatible = validator.is_compatible(
418418+ "at://alice/schema/imagesample/v1.0.0",
419419+ "at://alice/schema/imagesample/v2.0.0"
420420+)
421421+422422+# Find migration path
423423+migration = loader.find_migration(
424424+ source="at://alice/schema/imagesample/v1.0.0",
425425+ target="at://alice/schema/imagesample/v2.0.0"
426426+)
427427+```
428428+429429+## Tooling Support
430430+431431+### CLI Commands
432432+433433+```bash
434434+# Check schema compatibility
435435+atdata schema diff \
436436+ at://alice/schema/sample/v1.0.0 \
437437+ at://alice/schema/sample/v2.0.0
438438+439439+# Validate sample against schema
440440+atdata validate mysample.msgpack \
441441+ --schema at://alice/schema/sample/v1.0.0
442442+443443+# Find migration path
444444+atdata schema migrate \
445445+ --from at://alice/schema/sample/v1.0.0 \
446446+ --to at://alice/schema/sample/v2.0.0
447447+```
448448+449449+### IDE Support (Future)
450450+451451+- Autocomplete for schema versions
452452+- Warnings for compatibility issues
453453+- Quick fixes for migrations
454454+455455+## Open Questions
456456+457457+1. **Should we auto-bump versions on publish?**
458458+ - Detect changes, suggest version bump?
459459+ - Recommendation: Manual for Phase 1, auto-suggest later
460460+461461+2. **How to handle shape evolution for NDArray?**
462462+ ```python
463463+ # v1: image shape [224, 224, 3]
464464+ # v2: image shape [256, 256, 3] # Breaking or not?
465465+ ```
466466+ - If shape is documented (not enforced), this could be minor
467467+ - If shape is validated, this is breaking
468468+ - Recommendation: Document only initially
469469+470470+3. **Should we support version ranges in schema refs?**
471471+ ```json
472472+ "schemaRef": "at://alice/schema/sample@^1.0.0" # npm-style
473473+ ```
474474+ - Pro: More flexible
475475+ - Con: Ambiguous (which exact version?)
476476+ - Recommendation: Explicit versions only for Phase 1
477477+478478+4. **What about deprecated fields?**
479479+ ```python
480480+ class Sample:
481481+ x: int
482482+ y: int # @deprecated: Use z instead
483483+ z: Optional[int] = None
484484+ ```
485485+ - Could add deprecation warnings
486486+ - Could track in schema metadata
487487+ - Recommendation: Metadata only for Phase 1
488488+489489+## Success Criteria
490490+491491+After implementing this decision:
492492+- ✅ Schemas use semantic versioning
493493+- ✅ Compatibility rules are clear and documented
494494+- ✅ Compatibility checker validates schema changes
495495+- ✅ Lenses can be used for migrations
496496+- ✅ Dataset records can specify version constraints
497497+- ✅ Breaking changes require major version bump
498498+499499+## References
500500+501501+- Code generation: `../05_codegen.md` (SchemaValidator, TypeValidator)
502502+- Lexicon design: `../02_lexicon_design.md` (Schema versioning)
503503+- Lens transformations: `02_lens_code_storage.md`
504504+505505+---
506506+507507+**Decision Needed By**: Before Phase 4 Issue #39 (Type validation)
508508+**Decision Maker**: Project maintainer (max)
509509+**Date Created**: 2026-01-07
+388
.planning/decisions/05_lexicon_namespace.md
···11+# Decision: Lexicon Namespace and NSID Structure
22+33+**Issue**: #49
44+**Status**: Needs decision
55+**Blocks**: #50 (Lexicon validation)
66+**Priority**: Critical for Phase 1
77+88+## DECISION
99+1010+We're going to use an org NSID for the steward organization as the base:
1111+1212+```
1313+ac.foundation.dataset.*
1414+```
1515+1616+The choices we have then are
1717+1818+```
1919+ac.foundation.dataset.sampleSchema
2020+ac.foundation.dataset.record
2121+ac.foundation.dataset.lens
2222+```
2323+2424+---
2525+2626+## Problem Statement
2727+2828+We need to finalize the namespace (NSID - Namespaced Identifier) for atdata Lexicons. This is a critical decision because:
2929+- NSIDs are permanent and hard to change
3030+- They affect discoverability and organization
3131+- They may require coordination with ATProto/Bluesky team
3232+3333+## Context
3434+3535+ATProto NSIDs follow reverse domain notation:
3636+```
3737+app.bsky.feed.post # Bluesky official feed posts
3838+com.example.myapp.record # Third-party app
3939+```
4040+4141+We need NSIDs for three record types:
4242+1. Schema records (PackableSample definitions)
4343+2. Dataset records (dataset indexes)
4444+3. Lens records (transformations)
4545+4646+## Current Proposal
4747+4848+```
4949+app.bsky.atdata.schema # PackableSample schema records
5050+app.bsky.atdata.dataset # Dataset index records
5151+app.bsky.atdata.lens # Lens transformation records
5252+```
5353+5454+## Options
5555+5656+### Option 1: `app.bsky.atdata.*` (Current Proposal)
5757+5858+**Full NSIDs**:
5959+- `app.bsky.atdata.schema`
6060+- `app.bsky.atdata.dataset`
6161+- `app.bsky.atdata.lens`
6262+6363+**Pros**:
6464+- ✅ Under Bluesky ecosystem umbrella
6565+- ✅ High visibility and discoverability
6666+- ✅ Official-looking namespace
6767+- ✅ Good for adoption
6868+6969+**Cons**:
7070+- ❌ May require approval from Bluesky team
7171+- ❌ `app.bsky.*` typically for official Bluesky apps
7272+- ❌ Could be rejected or need to change later
7373+- ❌ Implies Bluesky endorsement/ownership
7474+7575+**Risk**: ⚠️ Medium (may need to change if not approved)
7676+7777+---
7878+7979+### Option 2: `io.atdata.*` or `org.atdata.*`
8080+8181+**Full NSIDs**:
8282+- `io.atdata.schema`
8383+- `io.atdata.dataset`
8484+- `io.atdata.lens`
8585+8686+**Pros**:
8787+- ✅ Independent namespace
8888+- ✅ No approval needed
8989+- ✅ Clear ownership (atdata project)
9090+- ✅ Can use immediately
9191+9292+**Cons**:
9393+- ❌ Less discoverable (not under Bluesky)
9494+- ❌ Appears less "official"
9595+- ❌ Need to own atdata.io domain (or just use anyway?)
9696+9797+**Risk**: ⭐ Low (we control it)
9898+9999+---
100100+101101+### Option 3: `app.bsky.atproto.atdata.*` (Nested)
102102+103103+**Full NSIDs**:
104104+- `app.bsky.atproto.atdata.schema`
105105+- `app.bsky.atproto.atdata.dataset`
106106+- `app.bsky.atproto.atdata.lens`
107107+108108+**Pros**:
109109+- ✅ Still under Bluesky but more specific
110110+- ✅ Groups with other ATProto-related Lexicons
111111+- ✅ Less likely to conflict
112112+113113+**Cons**:
114114+- ❌ Longer NSIDs
115115+- ❌ Awkward naming (`atproto.atdata`?)
116116+- ❌ Still may need approval
117117+118118+**Risk**: ⚠️ Medium
119119+120120+---
121121+122122+### Option 4: Personal/Org namespace (e.g., `com.github.username.atdata.*`)
123123+124124+**Example with your GitHub**:
125125+- `com.github.maxineishere.atdata.schema` (if that's your GH username)
126126+- Or: `com.yourorg.atdata.schema`
127127+128128+**Pros**:
129129+- ✅ Guaranteed to work (it's your namespace)
130130+- ✅ No approval needed
131131+- ✅ Clear ownership
132132+133133+**Cons**:
134134+- ❌ Looks very unofficial
135135+- ❌ Hard to discover
136136+- ❌ Tied to individual/org, not project
137137+- ❌ May need to migrate later if project grows
138138+139139+**Risk**: ⭐ Very Low (but not ideal for adoption)
140140+141141+## Recommendation: Start with Option 2 (`io.atdata.*`), Keep Option 1 as Goal
142142+143143+**Phased Approach**:
144144+145145+### Phase 1: Use `io.atdata.*` immediately
146146+- No approvals needed
147147+- Can start development right away
148148+- Professional-looking namespace
149149+- Independent from Bluesky governance
150150+151151+### Future: Request `app.bsky.atdata.*` if appropriate
152152+- Once atdata has users and proven value
153153+- Submit formal request to Bluesky/ATProto team
154154+- Migrate if approved (see migration plan below)
155155+156156+**Rationale**:
157157+1. **Speed**: Don't block development waiting for approval
158158+2. **Safety**: If denied `app.bsky.*`, we haven't committed to it
159159+3. **Flexibility**: Can migrate namespaces if needed
160160+4. **Independence**: atdata can exist independently of Bluesky
161161+162162+## Implementation Details
163163+164164+### Namespace Structure
165165+166166+```
167167+io.atdata
168168+ ├── schema # PackableSample schema definitions
169169+ ├── dataset # Dataset index records
170170+ └── lens # Lens transformations
171171+```
172172+173173+**Lexicon IDs**:
174174+```json
175175+{
176176+ "lexicon": 1,
177177+ "id": "io.atdata.schema",
178178+ ...
179179+}
180180+```
181181+182182+```json
183183+{
184184+ "lexicon": 1,
185185+ "id": "io.atdata.dataset",
186186+ ...
187187+}
188188+```
189189+190190+```json
191191+{
192192+ "lexicon": 1,
193193+ "id": "io.atdata.lens",
194194+ ...
195195+}
196196+```
197197+198198+### Record URIs
199199+200200+```
201201+at://did:plc:abc123/io.atdata.schema/3jk2lo34klm
202202+at://did:plc:abc123/io.atdata.dataset/7mn8op56pqr
203203+at://did:plc:abc123/io.atdata.lens/2fg4hi78jkl
204204+```
205205+206206+### Python Constants
207207+208208+```python
209209+# src/atdata/atproto/_constants.py
210210+211211+SCHEMA_NSID = "io.atdata.schema"
212212+DATASET_NSID = "io.atdata.dataset"
213213+LENS_NSID = "io.atdata.lens"
214214+215215+# Can be changed in one place if we migrate namespaces
216216+```
217217+218218+## Domain Ownership
219219+220220+**Question**: Do we need to own `atdata.io`?
221221+222222+**ATProto Spec**: NSIDs don't require domain ownership, but it's recommended for credibility.
223223+224224+**Options**:
225225+1. **Register `atdata.io`** (~$12/year)
226226+ - Pro: Professional, verifiable ownership
227227+ - Con: Small cost
228228+ - Recommendation: ✅ Do this
229229+230230+2. **Use without owning**
231231+ - Pro: Free
232232+ - Con: Someone else could register it and claim the namespace
233233+ - Recommendation: ❌ Too risky
234234+235235+**Decision**: Register `atdata.io` domain
236236+237237+## Versioning in NSIDs
238238+239239+**Question**: Should version be part of NSID?
240240+241241+### Option A: Version in record (RECOMMENDED)
242242+```
243243+NSIDs: io.atdata.schema (constant)
244244+Versions: In schema record "version" field
245245+```
246246+247247+**Pros**:
248248+- ✅ Stable NSIDs
249249+- ✅ Versions can evolve independently
250250+- ✅ Single collection for all versions
251251+252252+**Cons**:
253253+- ❌ Need to look up version from record
254254+255255+### Option B: Version in NSID
256256+```
257257+NSIDs: io.atdata.schema.v1, io.atdata.schema.v2
258258+```
259259+260260+**Pros**:
261261+- ✅ Version explicit in URI
262262+263263+**Cons**:
264264+- ❌ New NSID for each major version
265265+- ❌ More Lexicons to maintain
266266+- ❌ Harder to query across versions
267267+268268+**Recommendation**: Option A (version in record)
269269+270270+## Namespace Migration Plan
271271+272272+If we need to migrate from `io.atdata.*` to `app.bsky.atdata.*`:
273273+274274+### Migration Steps
275275+276276+1. **Dual Publishing** (transition period)
277277+ ```python
278278+ # Publish to both namespaces
279279+ publisher.publish_schema(
280280+ sample_type,
281281+ nsid="io.atdata.schema" # Old
282282+ )
283283+ publisher.publish_schema(
284284+ sample_type,
285285+ nsid="app.bsky.atdata.schema" # New
286286+ )
287287+ ```
288288+289289+2. **Deprecation Notice**
290290+ - Announce migration timeline
291291+ - Update documentation
292292+ - Add warnings to old namespace
293293+294294+3. **Update Client**
295295+ - Default to new namespace
296296+ - Still support old namespace (read-only)
297297+298298+4. **Sunset Old Namespace**
299299+ - After 6-12 months, stop publishing to old namespace
300300+ - Keep reading old records for compatibility
301301+302302+### Record Linking
303303+304304+Add migration metadata:
305305+```json
306306+{
307307+ "$type": "app.bsky.atdata.schema",
308308+ "metadata": {
309309+ "migratedFrom": "at://did:plc:abc123/io.atdata.schema/3jk2lo34klm"
310310+ },
311311+ ...
312312+}
313313+```
314314+315315+## Additional Lexicons (Future)
316316+317317+Should we reserve NSIDs for future use?
318318+319319+**Potential Additions**:
320320+- `io.atdata.collection` - Group multiple datasets
321321+- `io.atdata.benchmark` - Evaluation results
322322+- `io.atdata.annotation` - User comments/ratings
323323+- `io.atdata.pipeline` - Data processing pipelines
324324+325325+**Recommendation**: Don't create yet, but document reserved names
326326+327327+## Community Input
328328+329329+**Before finalizing**:
330330+1. Check if `io.atdata.*` is available (no conflicts)
331331+2. Reach out to ATProto community (Discord, GitHub)
332332+3. Ask Bluesky team about `app.bsky.atdata.*` feasibility
333333+4. Document decision and rationale
334334+335335+## Open Questions
336336+337337+1. **Should we create a demo namespace first?**
338338+ - `io.atdata.dev.schema` for testing?
339339+ - Pro: Keeps production namespace clean
340340+ - Con: More namespaces to manage
341341+ - Recommendation: Not needed, use test DIDs instead
342342+343343+2. **What about language-specific namespaces?**
344344+ - `io.atdata.py.schema` for Python-specific schemas?
345345+ - Pro: Allows language-specific features
346346+ - Con: Fragments ecosystem
347347+ - Recommendation: ❌ Keep language-agnostic
348348+349349+3. **Should we namespace by domain (vision, NLP, etc.)?**
350350+ - `io.atdata.vision.schema`, `io.atdata.nlp.schema`?
351351+ - Pro: Better organization for large ecosystems
352352+ - Con: Premature optimization
353353+ - Recommendation: ❌ Not for Phase 1
354354+355355+## Success Criteria
356356+357357+After implementing this decision:
358358+- ✅ NSIDs are finalized and documented
359359+- ✅ Lexicon JSON files use correct NSIDs
360360+- ✅ Python code uses constant definitions (easy to change)
361361+- ✅ Migration plan exists if needed
362362+- ✅ Domain `atdata.io` is registered (or plan to register)
363363+364364+## References
365365+366366+- ATProto NSID spec: https://atproto.com/specs/nsid
367367+- Lexicon design: `../02_lexicon_design.md`
368368+- All three Lexicon definitions need this decision
369369+370370+---
371371+372372+**Decision Needed By**: Before starting Phase 1 Issue #22, #23, #24 (all Lexicon designs)
373373+**Decision Maker**: Project maintainer (max)
374374+**Date Created**: 2026-01-07
375375+376376+## Recommended Action
377377+378378+**Immediate**:
379379+1. ✅ Decide on `io.atdata.*` as working namespace
380380+2. ✅ Plan to register `atdata.io` domain
381381+3. ✅ Document migration path to `app.bsky.atdata.*` if desired later
382382+383383+**Before Phase 2**:
384384+1. Register `atdata.io` domain
385385+2. Optional: Reach out to Bluesky about `app.bsky.atdata.*` for future
386386+387387+**Phase 1**:
388388+Use `io.atdata.*` in all Lexicon designs
+459
.planning/decisions/06_lexicon_validation.md
···11+# Decision: Lexicon Validation Process
22+33+**Issue**: #50
44+**Status**: Needs decision
55+**Blocked By**: #45, #46, #47, #48, #49 (all design decisions)
66+**Priority**: Critical - Final step before Phase 1 completion
77+88+## Problem Statement
99+1010+Once we've finalized all design decisions, we need to validate that our Lexicon JSON definitions:
1111+1. Follow ATProto Lexicon specification correctly
1212+2. Are internally consistent
1313+3. Support all our use cases
1414+4. Can be implemented as designed
1515+1616+This is the final checkpoint before Phase 1 (Lexicon Design) is complete and we move to Phase 2 (Implementation).
1717+1818+## What Needs Validation
1919+2020+### 1. Schema Record Lexicon (`io.atdata.schema`)
2121+- Field type system (primitive, ndarray, nested)
2222+- Type unions are properly structured
2323+- Required vs optional fields
2424+- Constraints (maxLength, etc.) are reasonable
2525+- Example schema records validate against the Lexicon
2626+2727+### 2. Dataset Record Lexicon (`io.atdata.dataset`)
2828+- URL array handling
2929+- Metadata blob size limits
3030+- Schema reference format
3131+- Tag array constraints
3232+- Example dataset records validate against the Lexicon
3333+3434+### 3. Lens Record Lexicon (`io.atdata.lens`)
3535+- Code reference structure
3636+- Schema reference handling
3737+- Union types for different code storage options (if applicable)
3838+- Example lens records validate against the Lexicon
3939+4040+## Validation Checklist
4141+4242+### ATProto Spec Compliance
4343+4444+**Lexicon Structure**:
4545+- [ ] All Lexicons have required fields: `lexicon`, `id`, `defs`
4646+- [ ] `lexicon` field is set to `1` (current version)
4747+- [ ] `id` follows NSID format (reverse domain notation)
4848+- [ ] `defs.main` exists and has `type: "record"`
4949+- [ ] Record `key` is set appropriately (`tid` for time-ordered)
5050+5151+**Field Types**:
5252+- [ ] All field types are valid ATProto types
5353+ - `string`, `integer`, `boolean`, `bytes`, `object`, `array`
5454+ - `ref`, `union` for complex types
5555+- [ ] String fields have appropriate `maxLength`
5656+- [ ] Array fields have `items` definition
5757+- [ ] Object fields have `properties` definition
5858+- [ ] Refs point to valid def names (e.g., `#fieldType`)
5959+6060+**Constraints**:
6161+- [ ] `maxLength` values are reasonable (not too small, not too large)
6262+- [ ] `minLength` constraints make sense
6363+- [ ] Required fields are marked correctly
6464+- [ ] Optional fields have appropriate defaults
6565+6666+### Internal Consistency
6767+6868+**Cross-References**:
6969+- [ ] Schema refs (e.g., `schemaRef` in datasets) use correct format
7070+ - Should be AT-URI format: `at://did:plc:.../io.atdata.schema/...`
7171+- [ ] Union refs point to existing defs
7272+- [ ] No circular references
7373+7474+**Type System**:
7575+- [ ] Field types are well-defined
7676+ - Primitive types map clearly (str, int, float, bool, bytes)
7777+ - NDArray type includes dtype and optional shape
7878+ - Nested types have schema reference
7979+- [ ] Optional vs required semantics are clear
8080+8181+**Metadata**:
8282+- [ ] Descriptions are present and helpful
8383+- [ ] Examples match the schema
8484+- [ ] Deprecations are noted (if any)
8585+8686+### Use Case Coverage
8787+8888+**Can we represent...**:
8989+- [ ] All current PackableSample types?
9090+- [ ] NDArray with dtype and shape information?
9191+- [ ] Optional fields?
9292+- [ ] Nested PackableSample types (future)?
9393+- [ ] Dataset metadata (arbitrary key-value)?
9494+- [ ] Multiple WebDataset shard URLs?
9595+- [ ] Lens code references (repo + commit + path)?
9696+9797+**Can we implement...**:
9898+- [ ] Python codegen from schema records?
9999+- [ ] Dataset publishing with external URLs?
100100+- [ ] Dataset loading from records?
101101+- [ ] Lens publishing with code references?
102102+- [ ] Schema versioning (version field present)?
103103+104104+## Validation Methods
105105+106106+### 1. Schema Validation Tools
107107+108108+**Use ATProto Tools** (if available):
109109+```bash
110110+# If ATProto has a Lexicon validator
111111+atproto-lexicon validate io.atdata.schema.json
112112+atproto-lexicon validate io.atdata.dataset.json
113113+atproto-lexicon validate io.atdata.lens.json
114114+```
115115+116116+**Create Custom Validator**:
117117+```python
118118+# src/atdata/atproto/validation.py
119119+from jsonschema import validate, ValidationError
120120+121121+def validate_lexicon(lexicon_json: dict) -> tuple[bool, list[str]]:
122122+ """Validate Lexicon against ATProto spec."""
123123+ errors = []
124124+125125+ # Check required fields
126126+ if 'lexicon' not in lexicon_json:
127127+ errors.append("Missing 'lexicon' field")
128128+ if 'id' not in lexicon_json:
129129+ errors.append("Missing 'id' field")
130130+ if 'defs' not in lexicon_json:
131131+ errors.append("Missing 'defs' field")
132132+133133+ # Check NSID format
134134+ nsid = lexicon_json.get('id', '')
135135+ if not is_valid_nsid(nsid):
136136+ errors.append(f"Invalid NSID: {nsid}")
137137+138138+ # More validations...
139139+140140+ return len(errors) == 0, errors
141141+```
142142+143143+### 2. Example Record Validation
144144+145145+**Create Example Records**:
146146+147147+```python
148148+# examples/schema_record.json
149149+{
150150+ "$type": "io.atdata.schema",
151151+ "name": "ImageSample",
152152+ "version": "1.0.0",
153153+ "description": "Sample with image and label",
154154+ "fields": [
155155+ {
156156+ "name": "image",
157157+ "type": {"kind": "ndarray", "dtype": "uint8", "shape": [null, null, 3]},
158158+ "optional": false
159159+ },
160160+ {
161161+ "name": "label",
162162+ "type": {"kind": "primitive", "primitive": "str"},
163163+ "optional": false
164164+ }
165165+ ],
166166+ "metadata": {"author": "alice"},
167167+ "createdAt": "2025-01-06T12:00:00Z"
168168+}
169169+```
170170+171171+**Validate Against Lexicon**:
172172+```python
173173+def validate_record(record: dict, lexicon: dict) -> tuple[bool, list[str]]:
174174+ """Validate a record against its Lexicon."""
175175+ errors = []
176176+177177+ # Check $type matches Lexicon id
178178+ record_type = record.get('$type')
179179+ lexicon_id = lexicon.get('id')
180180+ if record_type != lexicon_id:
181181+ errors.append(f"Type mismatch: {record_type} != {lexicon_id}")
182182+183183+ # Validate required fields
184184+ main_def = lexicon['defs']['main']['record']
185185+ required = main_def.get('required', [])
186186+ for field in required:
187187+ if field not in record:
188188+ errors.append(f"Missing required field: {field}")
189189+190190+ # Validate field types
191191+ properties = main_def.get('properties', {})
192192+ for field, value in record.items():
193193+ if field in properties:
194194+ # Type checking logic
195195+ pass
196196+197197+ return len(errors) == 0, errors
198198+```
199199+200200+### 3. Roundtrip Testing
201201+202202+**Test Full Cycle**:
203203+1. Create PackableSample class
204204+2. Generate schema record from class
205205+3. Validate schema record against Lexicon
206206+4. Generate code from schema record
207207+5. Verify generated code matches original class
208208+209209+```python
210210+def test_roundtrip():
211211+ # 1. Original class
212212+ @atdata.packable
213213+ class TestSample:
214214+ x: int
215215+ y: str
216216+217217+ # 2. Generate schema record
218218+ generator = SchemaRecordGenerator()
219219+ record = generator.from_class(TestSample)
220220+221221+ # 3. Validate against Lexicon
222222+ is_valid, errors = validate_record(record, SCHEMA_LEXICON)
223223+ assert is_valid, f"Validation failed: {errors}"
224224+225225+ # 4. Generate code from record
226226+ codegen = PythonGenerator()
227227+ code = codegen.generate_from_record(record)
228228+229229+ # 5. Execute generated code and compare
230230+ exec_globals = {}
231231+ exec(code, exec_globals)
232232+ GeneratedClass = exec_globals['TestSample']
233233+234234+ # Should be equivalent
235235+ original_instance = TestSample(x=1, y="test")
236236+ generated_instance = GeneratedClass(x=1, y="test")
237237+238238+ assert original_instance.packed == generated_instance.packed
239239+```
240240+241241+### 4. Edge Case Testing
242242+243243+**Test Corner Cases**:
244244+- [ ] Empty optional fields
245245+- [ ] Very long strings (maxLength boundary)
246246+- [ ] Large arrays (maxItems boundary)
247247+- [ ] Complex nested types
248248+- [ ] Unicode in strings
249249+- [ ] Special characters in names
250250+- [ ] Large metadata blobs
251251+252252+## Validation Artifacts
253253+254254+After validation, we should have:
255255+256256+### 1. Finalized Lexicon JSON Files
257257+258258+```
259259+.planning/lexicons/
260260+ io.atdata.schema.json
261261+ io.atdata.dataset.json
262262+ io.atdata.lens.json
263263+```
264264+265265+Each file:
266266+- Validates against ATProto Lexicon spec
267267+- Has complete documentation
268268+- Includes examples
269269+270270+### 2. Example Records
271271+272272+```
273273+.planning/examples/
274274+ schema_example.json
275275+ dataset_example.json
276276+ lens_example.json
277277+```
278278+279279+Each example:
280280+- Validates against its Lexicon
281281+- Demonstrates all key features
282282+- Includes comments explaining choices
283283+284284+### 3. Validation Test Suite
285285+286286+```python
287287+# tests/test_lexicons.py
288288+289289+def test_schema_lexicon_valid():
290290+ """Test schema Lexicon is valid."""
291291+ with open('.planning/lexicons/io.atdata.schema.json') as f:
292292+ lexicon = json.load(f)
293293+ is_valid, errors = validate_lexicon(lexicon)
294294+ assert is_valid, errors
295295+296296+def test_schema_example_valid():
297297+ """Test schema example validates against Lexicon."""
298298+ with open('.planning/lexicons/io.atdata.schema.json') as f:
299299+ lexicon = json.load(f)
300300+ with open('.planning/examples/schema_example.json') as f:
301301+ example = json.load(f)
302302+ is_valid, errors = validate_record(example, lexicon)
303303+ assert is_valid, errors
304304+305305+# Similar tests for dataset and lens
306306+```
307307+308308+### 4. Validation Report
309309+310310+```markdown
311311+# Lexicon Validation Report
312312+313313+## Summary
314314+- Schema Lexicon: ✅ Valid
315315+- Dataset Lexicon: ✅ Valid
316316+- Lens Lexicon: ✅ Valid
317317+318318+## Validation Results
319319+320320+### io.atdata.schema
321321+- ATProto compliance: ✅ Pass
322322+- Internal consistency: ✅ Pass
323323+- Example validation: ✅ Pass
324324+- Edge cases: ✅ Pass
325325+326326+### io.atdata.dataset
327327+...
328328+329329+## Issues Found
330330+None
331331+332332+## Recommendations
333333+1. Consider adding X field to Y
334334+2. Might want to increase maxLength for Z
335335+...
336336+```
337337+338338+## Implementation Plan
339339+340340+### Step 1: Create Lexicon JSON Files (depends on decisions #45-49)
341341+342342+Based on finalized decisions:
343343+- Schema representation format (#45)
344344+- Lens code storage (#46)
345345+- WebDataset storage (#47)
346346+- Schema evolution (#48)
347347+- Lexicon namespace (#49)
348348+349349+Create three JSON files with complete Lexicon definitions.
350350+351351+### Step 2: Create Example Records
352352+353353+For each Lexicon, create 2-3 example records demonstrating:
354354+- Minimal record
355355+- Full-featured record
356356+- Edge cases
357357+358358+### Step 3: Write Validation Tests
359359+360360+Implement validation test suite that:
361361+- Validates Lexicons against ATProto spec
362362+- Validates examples against Lexicons
363363+- Tests roundtrip (class → record → code → class)
364364+365365+### Step 4: Manual Review
366366+367367+Have team members review:
368368+- Lexicon designs
369369+- Example records
370370+- Any edge cases or concerns
371371+372372+### Step 5: Document Issues and Resolutions
373373+374374+Track any issues found:
375375+- What was wrong?
376376+- How was it fixed?
377377+- Why was this decision made?
378378+379379+### Step 6: Final Sign-off
380380+381381+Once all validation passes:
382382+- Mark Issue #50 as complete
383383+- Unblock Phase 1 (Issue #17)
384384+- Proceed to Phase 2 implementation
385385+386386+## Tools and Resources
387387+388388+**ATProto Resources**:
389389+- Lexicon specification: https://atproto.com/specs/lexicon
390390+- NSID specification: https://atproto.com/specs/nsid
391391+- Example Lexicons: https://github.com/bluesky-social/atproto/tree/main/lexicons
392392+393393+**Validation Tools**:
394394+- JSON Schema validator (jsonschema library)
395395+- ATProto SDK validation (if available)
396396+- Custom validators (we'll write)
397397+398398+**Documentation**:
399399+- All planning docs in `.planning/`
400400+- Decision docs in `.planning/decisions/`
401401+- Lexicon design in `02_lexicon_design.md`
402402+403403+## Success Criteria
404404+405405+Phase 1 Issue #17 is complete when:
406406+- ✅ All three Lexicons are finalized and validated
407407+- ✅ Example records validate against Lexicons
408408+- ✅ Roundtrip tests pass
409409+- ✅ Team has reviewed and approved
410410+- ✅ Documentation is complete
411411+- ✅ Ready to begin Phase 2 implementation
412412+413413+## Next Steps After Validation
414414+415415+Once Issue #50 is complete:
416416+1. Close Issue #50
417417+2. Unblock and close Issue #17 (Phase 1)
418418+3. Begin Phase 2 (Issue #18) - Python Client implementation
419419+4. Reference finalized Lexicons during implementation
420420+421421+## Open Questions
422422+423423+1. **Should we submit Lexicons to ATProto for official review?**
424424+ - Pro: Get expert feedback
425425+ - Con: Delays, may not be necessary
426426+ - Recommendation: Optional, do if time permits
427427+428428+2. **Should we create a Lexicon registry/index?**
429429+ - Pro: Makes discovery easier
430430+ - Con: Extra infrastructure
431431+ - Recommendation: Defer to Phase 3 (AppView)
432432+433433+3. **How do we handle Lexicon updates after publication?**
434434+ - Once records exist, changing Lexicons is breaking
435435+ - Need clear versioning for Lexicons themselves
436436+ - Recommendation: Lexicons are v1 for all Phase 1-5
437437+438438+## References
439439+440440+- All design decisions: `01-05_*.md` in this directory
441441+- Lexicon design: `../02_lexicon_design.md`
442442+- ATProto Lexicon spec: https://atproto.com/specs/lexicon
443443+444444+---
445445+446446+**Decision Needed By**: After all decisions #45-49 are finalized
447447+**Decision Maker**: Project maintainer (max) + team review
448448+**Date Created**: 2026-01-07
449449+450450+## Recommended Action
451451+452452+**After all design decisions are made**:
453453+1. Create three Lexicon JSON files
454454+2. Create example records for each
455455+3. Write and run validation test suite
456456+4. Review as team
457457+5. Document any issues and fixes
458458+6. Get final sign-off
459459+7. Mark Phase 1 complete ✅
+158
.planning/decisions/README.md
···11+# Critical Design Decisions for ATProto Integration
22+33+This directory contains detailed analysis and recommendations for the critical design decisions needed before implementing ATProto integration in `atdata`.
44+55+## Decision Documents (In Dependency Order)
66+77+### Core Design Decisions (Can be made in parallel)
88+99+1. **[01_schema_representation_format.md](01_schema_representation_format.md)** (Issue #45)
1010+ - **Question**: How to represent PackableSample types in Lexicon records?
1111+ - **Options**: Custom format, JSON Schema, Protobuf
1212+ - **Recommendation**: Custom format within ATProto Lexicon
1313+ - **Impact**: Code generation, cross-language support
1414+ - **Blocks**: Issue #50 (validation)
1515+1616+2. **[02_lens_code_storage.md](02_lens_code_storage.md)** (Issue #46)
1717+ - **Question**: How to store Lens transformation code?
1818+ - **Options**: Code references, inline code, metadata only
1919+ - **Recommendation**: Code references (GitHub + commit hash) only
2020+ - **Impact**: Security, usability, trust model
2121+ - **Blocks**: Issue #50 (validation)
2222+ - ⚠️ **CRITICAL SECURITY DECISION**
2323+2424+3. **[03_webdataset_storage.md](03_webdataset_storage.md)** (Issue #47)
2525+ - **Question**: Where to store actual WebDataset .tar files?
2626+ - **Options**: External URLs, ATProto blobs, hybrid
2727+ - **Recommendation**: External URLs (Phase 1), hybrid (future)
2828+ - **Impact**: Decentralization, scalability, costs
2929+ - **Blocks**: Issue #50 (validation)
3030+3131+4. **[04_schema_evolution.md](04_schema_evolution.md)** (Issue #48)
3232+ - **Question**: How do schemas evolve without breaking changes?
3333+ - **Options**: Semantic versioning, compatibility rules, migrations
3434+ - **Recommendation**: Semantic versioning + Lenses for migration
3535+ - **Impact**: Long-term maintainability, compatibility
3636+ - **Blocks**: Issue #50 (validation), Issue #39 (type validation)
3737+3838+5. **[05_lexicon_namespace.md](05_lexicon_namespace.md)** (Issue #49)
3939+ - **Question**: What namespace (NSID) to use for Lexicons?
4040+ - **Options**: `app.bsky.atdata.*`, `io.atdata.*`, others
4141+ - **Recommendation**: `io.atdata.*` (Phase 1), request `app.bsky.*` later
4242+ - **Impact**: Discoverability, ownership, migration
4343+ - **Blocks**: Issue #50 (validation)
4444+4545+### Final Validation (Depends on all above)
4646+4747+6. **[06_lexicon_validation.md](06_lexicon_validation.md)** (Issue #50)
4848+ - **Question**: How to validate finalized Lexicon designs?
4949+ - **Process**: Validation checklist, example records, tests
5050+ - **Deliverables**: Finalized Lexicon JSON files, validation report
5151+ - **Blocked By**: Issues #45, #46, #47, #48, #49 (all completed ✅)
5252+ - **Blocks**: Phase 1 completion (Issue #17)
5353+ - **Status**: Ready to proceed
5454+5555+### Architectural Assessment
5656+5757+7. **[assessment.md](assessment.md)** (Issue #51) ✅ **Complete**
5858+ - **Comprehensive appraisal** of all finalized design decisions
5959+ - **Overall Grade**: A- (Excellent with caveats)
6060+ - **Analysis**: Strengths, synergies, trade-offs, risks, long-term trajectory
6161+ - **Recommendations**: Immediate next steps and phasing guidance
6262+6363+## Decision Status
6464+6565+| Issue | Decision | Status | Final Decision |
6666+|-------|----------|--------|----------------|
6767+| #45 | Schema format | ✅ Decided | JSON Schema with NDArray shim |
6868+| #46 | Lens code storage | ✅ Decided | External repos (GitHub + tangled.org) |
6969+| #47 | WebDataset storage | ✅ Decided | Hybrid (URLs + blobs from start) |
7070+| #48 | Schema evolution | ✅ Decided | rkey={NSID}@{semver} + migration Lenses |
7171+| #49 | Lexicon namespace | ✅ Decided | `ac.foundation.dataset.*` |
7272+| #50 | Validation process | ⏳ Ready | Proceed with finalized decisions |
7373+| #51 | Architectural appraisal | ✅ Complete | See [assessment.md](assessment.md) |
7474+7575+**Overall Assessment**: Grade A- (Excellent with caveats) - See [assessment.md](assessment.md) for detailed analysis
7676+7777+## How to Use These Documents
7878+7979+### For Review
8080+8181+1. **Read in order** (01 through 06) to understand dependencies
8282+2. **Focus on recommendations** - detailed analysis supports them
8383+3. **Check open questions** - some need your input
8484+4. **Provide feedback** - comment on issues or update documents
8585+8686+### For Implementation
8787+8888+1. **After decisions made** - use as reference during coding
8989+2. **Check success criteria** - ensure implementation meets goals
9090+3. **Follow recommendations** - they're based on thorough analysis
9191+4. **Update as needed** - decisions can evolve with learning
9292+9393+## Key Insights
9494+9595+### Security First
9696+- **Issue #46** (Lens code storage) is a critical security decision
9797+- Recommendation: Code references only (no arbitrary code execution)
9898+- Can add inline code later if we solve sandboxing
9999+100100+### Pragmatic Approach
101101+- Start with what works (external URLs, custom format)
102102+- Add sophistication later (ATProto blobs, advanced features)
103103+- Don't block on perfect solutions
104104+105105+### Independence
106106+- Use `io.atdata.*` namespace (don't wait for Bluesky approval)
107107+- Can migrate to `app.bsky.atdata.*` later if desired
108108+- Maintain control over project direction
109109+110110+### Future-Proof
111111+- Semantic versioning enables evolution
112112+- Hybrid storage approach allows flexibility
113113+- Custom format gives us full control
114114+115115+## Decision Dependencies
116116+117117+```
118118+┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
119119+│ #45 │ │ #46 │ │ #47 │ │ #48 │ │ #49 │
120120+│ Format │ │ Lens │ │ Storage │ │Evolution│ │Namespace│
121121+└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
122122+ │ │ │ │ │
123123+ └────────────┴────────────┴────────────┴────────────┘
124124+ │
125125+ ┌────▼────┐
126126+ │ #50 │
127127+ │Validate │
128128+ └────┬────┘
129129+ │
130130+ ┌────▼────┐
131131+ │ Phase 1 │
132132+ │Complete │
133133+ └─────────┘
134134+```
135135+136136+All decisions #45-49 can be made in parallel, then #50 validates everything before Phase 1 completion.
137137+138138+## Timeline
139139+140140+**Recommended**:
141141+1. **Week 1**: Review and decide on #45-49 (can be done in parallel)
142142+2. **Week 2**: Validation (#50) - create Lexicon JSON files and examples
143143+3. **Week 3**: Begin Phase 2 implementation
144144+145145+**Flexible**: Can make decisions incrementally, but all needed before #50
146146+147147+## Questions?
148148+149149+- Review individual decision documents for detailed analysis
150150+- Check "Open Questions" sections for items needing input
151151+- See "References" sections for related planning documents
152152+- Consult `../02_lexicon_design.md` for technical details
153153+154154+---
155155+156156+**Created**: 2026-01-07
157157+**Status**: All decisions pending review
158158+**Next Step**: Review decision documents and provide feedback
+313
.planning/decisions/assessment.md
···11+# Architectural Assessment of Design Decisions
22+33+**Issue**: #51
44+**Date**: 2026-01-07
55+**Status**: Complete
66+77+## Overall Impression: **Ambitious but Coherent**
88+99+The finalized design decisions prioritize **flexibility and future-proofing** over initial simplicity. This is a deliberate trade-off that makes sense given the scope of building a distributed dataset federation.
1010+1111+---
1212+1313+## Decision Summary
1414+1515+1. **Schema Format (#45)**: JSON Schema with NDArray shim, extensible via open union
1616+2. **Lens Code (#46)**: External repos (GitHub + tangled.org), language metadata, future attestation
1717+3. **Storage (#47)**: Hybrid (URLs + blobs) from start, AppView proxy for blobs
1818+4. **Evolution (#48)**: rkey as {NSID}@{semver}, getLatestSchema query, optional migration Lenses
1919+5. **Namespace (#49)**: `ac.foundation.dataset.*` (sampleSchema, record, lens)
2020+2121+---
2222+2323+## Key Strengths
2424+2525+### 1. **Ecosystem Integration** (JSON Schema + External Repos)
2626+2727+**Decision**: JSON Schema for type definitions, external repos for code storage
2828+2929+**Strength**: Leveraging existing ecosystems rather than building in isolation. JSON Schema brings:
3030+- Extensive tooling (validators, codegen, IDE support)
3131+- Multi-language support out of the box
3232+- Familiarity for developers
3333+3434+Pairing this with GitHub/tangled.org for Lenses means developers can use existing workflows.
3535+3636+**Implication**: Lower barrier to entry, faster time to value. The NDArray shim is the only custom piece, which is appropriate since that's the unique requirement.
3737+3838+---
3939+4040+### 2. **Progressive Decentralization** (Hybrid Storage)
4141+4242+**Decision**: Hybrid storage from day one (URLs + PDS blobs)
4343+4444+**Strength**: This is pragmatic yet principled. Not forcing decentralization where it doesn't make sense (TB-scale datasets), but enabling it where it does (smaller datasets, self-hosters).
4545+4646+**Key Insight**: The AppView proxy for blobs is clever - it means users can work with a unified WebDataset URL interface regardless of backend storage. This abstraction is powerful.
4747+4848+**Implication**: More implementation complexity upfront, but avoids a painful migration later. The open union pattern makes this clean.
4949+5050+---
5151+5252+### 3. **Versioning as Identity** (rkey = NSID@semver)
5353+5454+**Decision**: Embed version in record key, use NSID for permanent identity
5555+5656+**Strength**: This is elegant. By making versioning part of the identity (rkey), you get:
5757+- Immutable version records (can't accidentally update a published version)
5858+- Natural query pattern (`getLatestSchema` Lexicon)
5959+- Clear semantic versioning enforcement
6060+6161+**Synergy**: Combining this with Lenses for migration is brilliant. The rkey structure makes it trivial to discover what migrations exist (e.g., "show me all versions of schema X").
6262+6363+**Implication**: This requires custom rkey handling (type `any` in Lexicon), which ATProto supports but isn't the default pattern. Need to ensure tooling understands this convention.
6464+6565+---
6666+6767+### 4. **Trust Layer** (Attestation + Verification)
6868+6969+**Decision**: Language metadata + future attestation/verification records for Lenses
7070+7171+**Strength**: Thinking ahead about the trust problem. In a distributed system, trust is critical. This approach:
7272+- Short-term: Language metadata helps users understand what they're running
7373+- Long-term: Attestation (formal correctness proofs) + verification (trusted DIDs)
7474+7575+This is a **strong security model** that's missing from many distributed systems.
7676+7777+**Implication**: This is a research-level feature (formal verification of Lenses). Starting with language metadata is right, but the attestation system will require significant design work. Consider this Phase 6+.
7878+7979+---
8080+8181+## Architectural Tensions (Intentional Trade-offs)
8282+8383+### 1. **Complexity Budget**
8484+8585+**Observation**: Sophisticated solutions across the board:
8686+- JSON Schema (standard but verbose)
8787+- Hybrid storage (two code paths)
8888+- Custom rkey scheme (non-standard)
8989+- Future attestation system (advanced)
9090+9191+**Assessment**: This increases initial implementation cost significantly. However, each choice is justified:
9292+- JSON Schema: Ecosystem benefits outweigh verbosity
9393+- Hybrid storage: Essential for real-world use cases
9494+- Custom rkey: Enables clean versioning
9595+- Attestation: Future-proofing for trust
9696+9797+**Recommendation**: ✅ Accept the complexity, but **phase implementation carefully**:
9898+- Phase 1-2: Core functionality (schemas, datasets, basic lenses)
9999+- Phase 3: Hybrid storage in AppView
100100+- Phase 4: Codegen for JSON Schema
101101+- Phase 5+: Attestation/verification system
102102+103103+---
104104+105105+### 2. **ATProto Conventions vs. Custom Patterns**
106106+107107+**Observation**: Using some non-standard ATProto patterns:
108108+- rkey type `any` (not typical)
109109+- Custom versioning scheme in rkey
110110+- `getLatestSchema` query Lexicon (not standard CRUD)
111111+112112+**Assessment**: This is **justified innovation**. ATProto is designed to support custom use cases. The versioning scheme in particular is a good use of flexible rkey.
113113+114114+**Caveat**: Need to document these conventions clearly, since they won't match typical ATProto examples.
115115+116116+---
117117+118118+### 3. **JSON Schema for NDArray**
119119+120120+**Observation**: JSON Schema wasn't designed for NDArray types. The shim approach treats them as "serialized bytes" with metadata.
121121+122122+**Assessment**: This is **pragmatic but leaky**. The abstraction leaks because:
123123+- JSON Schema describes serialized form (bytes), not semantic form (array with dtype/shape)
124124+- Codegen will need custom handling for NDArray types
125125+- Validation happens at deserialization, not schema level
126126+127127+**Alternative Considered**: Custom format would give cleaner NDArray representation, but traded that for ecosystem benefits.
128128+129129+**Mitigation**: Ensure the NDArray shim is well-documented and becomes a de facto standard within the atdata ecosystem. Consider publishing it as a reusable JSON Schema extension.
130130+131131+---
132132+133133+## Synergies (Where Decisions Reinforce Each Other)
134134+135135+### 1. **Versioning + Lenses + rkey Scheme**
136136+137137+This trilogy works beautifully together:
138138+- rkey embeds version → easy to list all versions
139139+- Lenses enable migration → versions can evolve safely
140140+- `getLatestSchema` query → discoverable latest version
141141+142142+This creates a **complete version management story** that's rare in distributed systems.
143143+144144+---
145145+146146+### 2. **Hybrid Storage + AppView Proxy**
147147+148148+The hybrid storage decision unlocks the proxy pattern:
149149+- Large datasets stay on S3/R2 (practical)
150150+- Small datasets can use PDS blobs (decentralized)
151151+- AppView proxies both → uniform interface
152152+153153+This means the **client code is simple** (just WebDataset URLs) even though the backend is sophisticated.
154154+155155+---
156156+157157+### 3. **JSON Schema + Attestation + Language Metadata**
158158+159159+This builds a **tiered trust model**:
160160+1. Base layer: JSON Schema validates structure
161161+2. Language metadata: Users know what they're executing
162162+3. Attestation (future): Formal proofs of correctness
163163+4. Verification (future): Social trust (trusted DIDs)
164164+165165+Each layer adds security without requiring the next layer to exist.
166166+167167+---
168168+169169+## Implementation Risks & Mitigations
170170+171171+### Risk 1: JSON Schema Complexity
172172+173173+**Risk**: JSON Schema is verbose and can be confusing for users defining NDArray-heavy schemas.
174174+175175+**Mitigation**:
176176+- Build **high-quality codegen** that hides the complexity (users write Python, get JSON Schema)
177177+- Provide **NDArray shim library** that handles the serialization/deserialization
178178+- Create **examples and templates** for common patterns
179179+180180+---
181181+182182+### Risk 2: Hybrid Storage Code Paths
183183+184184+**Risk**: Two storage backends means 2x testing, 2x bugs, 2x maintenance.
185185+186186+**Mitigation**:
187187+- Use **abstraction layer** in Dataset class (already planned)
188188+- **Prioritize external URLs** for Phase 1-2 (blob support can be added incrementally)
189189+- Test both paths from the start (CI/CD)
190190+191191+---
192192+193193+### Risk 3: Custom rkey Convention
194194+195195+**Risk**: Tools that expect standard TID-based rkeys might break.
196196+197197+**Mitigation**:
198198+- **Document clearly** in all Lexicon definitions
199199+- Provide **helper functions** in SDK (`parseSchemaRkey`, `formatSchemaRkey`)
200200+- Ensure `getLatestSchema` query is the primary discovery mechanism (hides rkey complexity)
201201+202202+---
203203+204204+### Risk 4: Attestation System Scope Creep
205205+206206+**Risk**: Formal verification and trust systems are research-level hard. Could delay entire project.
207207+208208+**Mitigation**:
209209+- Mark as **explicitly future work** (Phase 6+)
210210+- Start with **language metadata only** (low-hanging fruit)
211211+- Consider **social trust first** (verified DIDs, reputation) before formal verification
212212+- Partner with PL/verification researchers if pursuing formal proofs
213213+214214+---
215215+216216+## Long-Term Trajectory
217217+218218+The decisions set up a compelling long-term vision:
219219+220220+**Year 1**: Core dataset federation
221221+- Publish/discover datasets
222222+- JSON Schema for types
223223+- External URL storage
224224+- Basic Lenses
225225+226226+**Year 2**: Decentralization
227227+- PDS blob storage for small datasets
228228+- AppView with proxy
229229+- Migration Lenses widely used
230230+- Community schemas emerging
231231+232232+**Year 3**: Trust & verification
233233+- Language metadata standard
234234+- Verified DID system (social trust)
235235+- Attestation for critical Lenses
236236+- Cross-language support (TypeScript, Rust)
237237+238238+**Year 4+**: Research frontier
239239+- Formal verification of Lenses
240240+- Advanced query capabilities
241241+- Federated learning on distributed datasets
242242+- Integration with compute-over-data systems
243243+244244+---
245245+246246+## Concrete Recommendations
247247+248248+### 1. **Immediate** (Before Phase 1 Implementation)
249249+250250+- [ ] Define the **NDArray JSON Schema shim** precisely (schema structure, examples)
251251+- [ ] Spec out the **rkey format** (`{NSID}@{semver}` - what's valid NSID here? full NSID or partial?)
252252+- [ ] Design the **`getLatestSchema` query Lexicon** (parameters, return type)
253253+- [ ] Define the **storage union type** (external URL variant vs PDS blob variant)
254254+255255+### 2. **Phase 1-2** (Lexicon + Python Client)
256256+257257+- [ ] Implement **external URLs only** for storage (defer blobs to Phase 3)
258258+- [ ] Build **NDArray shim library** (serialize/deserialize with metadata)
259259+- [ ] Create **basic codegen** (Python dataclass ↔ JSON Schema)
260260+- [ ] Defer **language metadata** on Lenses to Phase 2 (start with just repo reference)
261261+262262+### 3. **Phase 3** (AppView)
263263+264264+- [ ] Implement **hybrid storage support** in AppView
265265+- [ ] Build **proxy for PDS blobs** (unified WebDataset URL interface)
266266+- [ ] Add **getLatestSchema endpoint**
267267+268268+### 4. **Phase 4+** (Future Work)
269269+270270+- [ ] Add **language metadata** to Lens records
271271+- [ ] Design **attestation Lexicon** (separate from Lens records)
272272+- [ ] Design **verification Lexicon** (trusted DIDs)
273273+- [ ] Research formal verification feasibility
274274+275275+---
276276+277277+## Summary Assessment
278278+279279+**Grade: A-** (Excellent with caveats)
280280+281281+### Strengths
282282+- ✅ Leverages existing ecosystems (JSON Schema, GitHub)
283283+- ✅ Future-proof (extensible via open unions, versioning built-in)
284284+- ✅ Pragmatic decentralization (hybrid storage)
285285+- ✅ Innovative versioning (rkey scheme)
286286+- ✅ Strong security model (multi-layered trust)
287287+288288+### Concerns
289289+- ⚠️ High implementation complexity (manageable with phasing)
290290+- ⚠️ JSON Schema for NDArray is a leaky abstraction (acceptable trade-off)
291291+- ⚠️ Custom rkey convention requires good documentation
292292+- ⚠️ Attestation system is ambitious (defer to future)
293293+294294+### Overall Assessment
295295+296296+This is a **well-considered architecture** that makes intentional trade-offs. The bet is on ecosystem integration and flexibility over simplicity, which is appropriate for a distributed dataset federation. The key to success will be **disciplined phasing** - implement the core first, add sophistication incrementally.
297297+298298+The decisions form a **coherent whole** where each piece reinforces the others. The versioning scheme, Lenses, and hybrid storage create a system that's greater than the sum of its parts.
299299+300300+**Recommendation**: ✅ **Proceed with these decisions**. Document the NDArray shim and rkey conventions thoroughly, and commit to incremental implementation.
301301+302302+---
303303+304304+## Next Steps
305305+306306+1. Close decision issues #45-49 as decided
307307+2. Update planning documents with finalized decisions
308308+3. Proceed to Issue #50 (Lexicon validation) with:
309309+ - NDArray JSON Schema shim definition
310310+ - rkey format specification
311311+ - `getLatestSchema` query Lexicon design
312312+ - Storage union type definition
313313+4. Begin Phase 1 implementation after validation complete
+468
.planning/decisions/record_lexicon_assessment.md
···11+# Record Lexicon Assessment
22+33+## Overview
44+55+Comprehensive assessment of `ac.foundation.dataset.record` Lexicon design against ATProto standards and atdata project requirements.
66+77+**Assessment Date:** 2026-01-07
88+**Lexicon Version:** Initial design
99+**Assessor:** Claude Sonnet 4.5
1010+1111+---
1212+1313+## Executive Summary
1414+1515+**Grade: B+** (Good with improvements needed)
1616+1717+The record Lexicon provides a solid foundation for dataset indexing with hybrid storage support. Key strengths include clean union-based storage design and appropriate use of ATProto primitives. However, several issues need addressing:
1818+1919+- ⚠️ **Critical**: schemaRef should use format validation
2020+- ⚠️ **High**: Metadata structure inconsistency with sampleSchema pattern
2121+- ⚠️ **Medium**: Missing $type discriminators in union variants
2222+- ✅ **Strength**: Clean storage union design
2323+- ✅ **Strength**: Appropriate use of tid keys for datasets
2424+2525+---
2626+2727+## Detailed Analysis
2828+2929+### 1. Key Type Choice ✅ **Appropriate**
3030+3131+```json
3232+"key": "tid"
3333+```
3434+3535+**Assessment:** Correct choice for dataset records.
3636+3737+**Rationale:**
3838+- TIDs provide temporal ordering (useful for "recent datasets" queries)
3939+- Auto-generated, no collision risk
4040+- Appropriate for records without natural semantic keys
4141+- Consistent with ATProto patterns for user-generated content
4242+4343+**Comparison to sampleSchema:**
4444+- sampleSchema uses `"key": "any"` for versioned rkeys like `{NSID}@{semver}`
4545+- record uses `"key": "tid"` for chronological dataset entries
4646+- Both choices are appropriate for their use cases
4747+4848+---
4949+5050+### 2. Field Validation Issues
5151+5252+#### Issue 2.1: schemaRef Missing Format Validation ⚠️ **Critical**
5353+5454+```json
5555+"schemaRef": {
5656+ "type": "string",
5757+ "description": "AT-URI reference...",
5858+ "maxLength": 500
5959+}
6060+```
6161+6262+**Problem:** Should use `"format": "at-uri"` like we did for sampleSchema fields.
6363+6464+**Fix:**
6565+```json
6666+"schemaRef": {
6767+ "type": "string",
6868+ "format": "at-uri",
6969+ "description": "AT-URI reference to the sampleSchema record",
7070+ "maxLength": 500
7171+}
7272+```
7373+7474+**Impact:** Without format validation, malformed references could be stored.
7575+7676+---
7777+7878+#### Issue 2.2: License Field Inconsistency ⚠️ **Medium**
7979+8080+sampleSchema metadata:
8181+```json
8282+"license": {
8383+ "type": "string",
8484+ "description": "... SPDX identifiers recommended ... or full SPDX URLs ...",
8585+ "maxLength": 200
8686+}
8787+```
8888+8989+record:
9090+```json
9191+"license": {
9292+ "type": "string",
9393+ "description": "License (SPDX identifier preferred)",
9494+ "maxLength": 100
9595+}
9696+```
9797+9898+**Problem:** Inconsistent maxLength and less detailed guidance.
9999+100100+**Recommendation:** Align with sampleSchema:
101101+- maxLength: 200 (to support full URLs)
102102+- Enhanced description with examples
103103+- Reference Schema.org license property
104104+105105+---
106106+107107+#### Issue 2.3: Tags Field Inconsistency ⚠️ **Medium**
108108+109109+sampleSchema metadata:
110110+```json
111111+"tags": {
112112+ "type": "array",
113113+ "items": {"type": "string", "maxLength": 150},
114114+ "maxLength": 30
115115+}
116116+```
117117+118118+record:
119119+```json
120120+"tags": {
121121+ "type": "array",
122122+ "items": {"type": "string", "maxLength": 50},
123123+ "maxLength": 20
124124+}
125125+```
126126+127127+**Problem:** Different limits with no clear rationale.
128128+129129+**Recommendation:** Use consistent limits or document why datasets need different constraints than schemas.
130130+131131+---
132132+133133+### 3. Metadata Structure ⚠️ **High Priority**
134134+135135+#### Current Design
136136+137137+record:
138138+```json
139139+"metadata": {
140140+ "type": "bytes",
141141+ "description": "Msgpack-encoded metadata dict",
142142+ "maxLength": 100000
143143+},
144144+"tags": {...},
145145+"license": {...}
146146+```
147147+148148+sampleSchema:
149149+```json
150150+"metadata": {
151151+ "type": "object",
152152+ "properties": {
153153+ "license": {...},
154154+ "tags": {...}
155155+ }
156156+}
157157+```
158158+159159+**Problem:** Inconsistent approach between lexicons.
160160+161161+**Analysis:**
162162+163163+**Option A: Keep Separate (Current)**
164164+- Pros: More discoverable (top-level fields, indexed/searchable)
165165+- Pros: Validated by Lexicon
166166+- Cons: Duplicates structure with metadata blob
167167+- Cons: Inconsistent with sampleSchema pattern
168168+169169+**Option B: Unified Metadata Object**
170170+- Pros: Consistent with sampleSchema
171171+- Pros: Single source of truth
172172+- Cons: Less discoverable for search
173173+- Cons: Can't validate blob contents
174174+175175+**Recommendation:** Keep current approach but clarify relationship:
176176+- Top-level fields: Core, searchable metadata (license, tags, size)
177177+- metadata blob: Extended, arbitrary key-value pairs
178178+- Update descriptions to explain this pattern
179179+180180+---
181181+182182+### 4. Storage Union Design ✅ **Excellent**
183183+184184+```json
185185+"storage": {
186186+ "type": "union",
187187+ "refs": ["#storageExternal", "#storageBlobs"]
188188+}
189189+```
190190+191191+**Strengths:**
192192+- Clean separation of storage types
193193+- Extensible (closed: false by default)
194194+- Well-defined variants
195195+196196+#### Issue 4.1: Missing $type in Union Variants ⚠️ **Critical**
197197+198198+storageExternal:
199199+```json
200200+{
201201+ "type": "object",
202202+ "required": ["type", "urls"],
203203+ "properties": {
204204+ "type": {"type": "string", "const": "external"}
205205+ }
206206+}
207207+```
208208+209209+**Problem:** Uses `type` field as discriminator instead of ATProto's `$type`.
210210+211211+**ATProto Spec:** "Unions require discriminator fields... union variants: Always include `$type`"
212212+213213+**Fix:**
214214+```json
215215+{
216216+ "type": "object",
217217+ "required": ["$type", "urls"],
218218+ "properties": {
219219+ "$type": {
220220+ "type": "string",
221221+ "const": "ac.foundation.dataset.record#storageExternal"
222222+ }
223223+ }
224224+}
225225+```
226226+227227+**Impact:** Current design violates ATProto conventions and may cause issues with SDKs.
228228+229229+---
230230+231231+### 5. Size Information ✅ **Good Design**
232232+233233+```json
234234+"size": {
235235+ "type": "ref",
236236+ "ref": "#datasetSize",
237237+ "description": "Dataset size information (optional)"
238238+}
239239+```
240240+241241+**Strengths:**
242242+- Optional (appropriate, not all datasets track this)
243243+- Structured with useful fields (samples, bytes, shards)
244244+- Uses ref for reusability
245245+246246+**Minor Suggestion:** Consider renaming `datasetSize` to `sizeInfo` or `datasetSizeInfo` for clarity.
247247+248248+---
249249+250250+### 6. Blob Storage Design ⚠️ **Needs Verification**
251251+252252+```json
253253+"blobs": {
254254+ "type": "array",
255255+ "items": {
256256+ "type": "blob",
257257+ "description": "Blob reference to a WebDataset tar archive"
258258+ }
259259+}
260260+```
261261+262262+**Questions:**
263263+1. Does ATProto Lexicon support `"type": "blob"` for array items?
264264+2. Should this be a ref like `"type": "ref", "ref": "#blobRef"`?
265265+3. Are blob mime types validated?
266266+267267+**Example shows:**
268268+```json
269269+{
270270+ "$type": "blob",
271271+ "ref": {"$link": "..."},
272272+ "mimeType": "application/x-tar",
273273+ "size": 1234567
274274+}
275275+```
276276+277277+**Recommendation:** Verify against ATProto blob specification and potentially add validation constraints (maxSize, accept mimeType patterns).
278278+279279+---
280280+281281+### 7. Closed Union Consideration 🤔
282282+283283+```json
284284+"storage": {
285285+ "type": "union",
286286+ "refs": ["#storageExternal", "#storageBlobs"]
287287+}
288288+```
289289+290290+**Current:** `closed: false` (default)
291291+292292+**Question:** Should storage union be closed?
293293+294294+**Arguments for closed: true:**
295295+- Core storage types unlikely to change frequently
296296+- Breaking change to add new storage after launch
297297+- More predictable for clients
298298+299299+**Arguments for closed: false (current):**
300300+- Future extensibility (e.g., IPFS-native, Filecoin, Arweave)
301301+- Consistent with sampleSchema schema union pattern
302302+- Graceful degradation for unknown types
303303+304304+**Recommendation:** Keep open but document in description that external/blobs are the canonical types maintained by foundation.ac.
305305+306306+---
307307+308308+### 8. Missing Fields from Standard Patterns
309309+310310+Comparing to Schema.org Dataset and sampleSchema patterns:
311311+312312+**Consider Adding:**
313313+314314+1. **Publisher/Creator** - Who published this dataset?
315315+ - Could use top-level `creator` field (DID/handle)
316316+ - Or rely on record author (implicit in AT-URI)
317317+318318+2. **Version** - Dataset versioning?
319319+ - Current approach: New record per version (via tid)
320320+ - Alternative: Add explicit `version` field like sampleSchema
321321+ - **Recommendation:** Document that versioning is via new records, reference via AT-URI with tid
322322+323323+3. **Citation** - How to cite this dataset?
324324+ - Optional field for academic datasets
325325+ - Could go in metadata blob for now
326326+327327+4. **Related Datasets** - Links to variants, subsets, etc.
328328+ - Could be array of AT-URIs
329329+ - Or handle via separate "collection" Lexicon later
330330+331331+**Recommendation:** Current fields are sufficient for v1. Document these as future extensions.
332332+333333+---
334334+335335+### 9. ATProto Compliance Checklist
336336+337337+| Requirement | Status | Notes |
338338+|-------------|--------|-------|
339339+| Valid Lexicon version | ✅ | lexicon: 1 |
340340+| NSID format | ✅ | ac.foundation.dataset.record |
341341+| Key type specified | ✅ | tid (appropriate) |
342342+| Required fields present | ✅ | name, schemaRef, storage, createdAt |
343343+| Union discriminators | ⚠️ | Missing $type in variants |
344344+| Format validators | ⚠️ | Missing at-uri format |
345345+| Blob type usage | ⚠️ | Needs verification |
346346+| Description fields | ✅ | All fields documented |
347347+| maxLength constraints | ✅ | Present on strings |
348348+| Datetime format | ✅ | createdAt uses datetime |
349349+350350+---
351351+352352+### 10. Example Record Validation
353353+354354+#### External Storage Example ✅
355355+356356+```json
357357+{
358358+ "$type": "ac.foundation.dataset.record",
359359+ "name": "CIFAR-10 Training Set",
360360+ "schemaRef": "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/imageclassification@1.0.0",
361361+ "storage": {"type": "external", "urls": ["..."]}
362362+}
363363+```
364364+365365+**Issues:**
366366+- schemaRef is well-formed but not validated (missing format check)
367367+- storage.type should be $type
368368+- Otherwise structurally correct
369369+370370+#### Blob Storage Example ⚠️
371371+372372+```json
373373+{
374374+ "storage": {
375375+ "type": "blobs",
376376+ "blobs": [{
377377+ "$type": "blob",
378378+ "ref": {"$link": "..."},
379379+ "mimeType": "application/x-tar"
380380+ }]
381381+ }
382382+}
383383+```
384384+385385+**Issues:**
386386+- storage.type should be $type
387387+- Blob structure needs verification against ATProto spec
388388+- mimeType not validated in Lexicon
389389+390390+---
391391+392392+## Priority Issues Summary
393393+394394+### Critical (Must Fix)
395395+396396+1. **Add format validation to schemaRef** - Use `"format": "at-uri"`
397397+2. **Fix union discriminators** - Use `$type` instead of `type` in storage variants
398398+3. **Verify blob type usage** - Confirm ATProto compliance
399399+400400+### High Priority (Should Fix)
401401+402402+4. **Align metadata pattern** - Clarify relationship between top-level fields and metadata blob
403403+5. **Standardize license field** - Match sampleSchema maxLength and description
404404+6. **Standardize tags field** - Use consistent limits or document rationale
405405+406406+### Medium Priority (Consider)
407407+408408+7. **Add $type requirement to union variants** - Make explicit in required array
409409+8. **Document versioning strategy** - Clarify that new versions = new records
410410+9. **Add blob validation** - Consider maxSize, mimeType constraints
411411+412412+### Low Priority (Future)
413413+414414+10. **Consider closed union** - Evaluate after Phase 1 usage patterns
415415+11. **Add creator field** - If needed based on user feedback
416416+12. **Collection/relationship fields** - Phase 2 feature
417417+418418+---
419419+420420+## Consistency Matrix
421421+422422+Comparison of patterns between sampleSchema and record Lexicons:
423423+424424+| Pattern | sampleSchema | record | Status |
425425+|---------|--------------|--------|--------|
426426+| AT-URI format | ✅ Uses format | ❌ Missing | **Fix** |
427427+| License field | 200 chars, detailed | 100 chars, basic | **Align** |
428428+| Tags limits | 150/30 | 50/20 | **Decide** |
429429+| Metadata structure | Structured object | Blob + top-level | **Document** |
430430+| Union discriminator | Uses $type | Uses type | **Fix** |
431431+| Versioning | Explicit version field | Implicit (tid) | **Different OK** |
432432+| Key type | any (semantic) | tid (temporal) | **Both OK** |
433433+434434+---
435435+436436+## Recommendations
437437+438438+### Immediate Actions
439439+440440+1. Add `"format": "at-uri"` to schemaRef field
441441+2. Change storage union variants to use `$type` discriminator
442442+3. Verify blob array item type with ATProto specification
443443+4. Align license field with sampleSchema (maxLength: 200, enhanced description)
444444+5. Decide on tags limits (recommend matching sampleSchema: 150/30)
445445+446446+### Documentation Improvements
447447+448448+6. Add description clarifying metadata blob vs top-level fields relationship
449449+7. Document that dataset versioning is via new records (tids)
450450+8. Add note about storage union extensibility
451451+9. Cross-reference with sampleSchema Lexicon
452452+453453+### Consider for Phase 2
454454+455455+10. Add creator/publisher field if user feedback indicates need
456456+11. Evaluate closed union after observing extension patterns
457457+12. Consider collection/relationship Lexicon for dataset hierarchies
458458+459459+---
460460+461461+## Conclusion
462462+463463+The record Lexicon provides a solid foundation but needs refinement for ATProto compliance and consistency with sampleSchema patterns. The storage union design is excellent, and the use of tids is appropriate. Primary concerns are format validation, union discriminators, and metadata pattern clarity.
464464+465465+**Estimated effort to address critical issues:** 2-3 hours
466466+**Recommended timeline:** Before Phase 1 completion
467467+468468+After fixes, expected grade: **A-** (Excellent and production-ready)
···11+# sampleSchema Lexicon Design Questions
22+33+This document captures open design questions for the `ac.foundation.dataset.sampleSchema` Lexicon that require user decisions before implementation.
44+55+## Q1: Key Format Validation
66+77+**Context:**
88+- Schema uses `"key": "any"` in Lexicon
99+- Documentation says rkey format is `{NSID}@{semver}`
1010+- ATProto might not support regex validation on rkey in Lexicons
1111+1212+**Question:**
1313+Should we add validation for the rkey format in the Lexicon definition, or is this enforced elsewhere?
1414+1515+**Options:**
1616+1. Add rkey pattern validation if ATProto Lexicons support it
1717+2. Document expected format but rely on application-level validation
1818+3. Use a structured key type instead of "any"
1919+2020+**Impact:**
2121+- Option 1: Strongest validation, prevents malformed rkeys
2222+- Option 2: Simpler, but allows invalid rkeys to be created
2323+- Option 3: May not be compatible with ATProto Lexicon spec
2424+2525+**Decision:** [TBD]
2626+2727+---
2828+2929+## Q2: Required Fields in JSON Schema
3030+3131+**Context:**
3232+- The `jsonSchema` field accepts any JSON Schema object
3333+- JSON Schemas can have zero required fields (all optional)
3434+- PackableSample types in atdata typically have at least one field
3535+3636+**Question:**
3737+Should we enforce that JSON Schemas must have at least one required field?
3838+3939+**Options:**
4040+1. No constraint - allow empty required arrays
4141+2. Require at least one field in required array
4242+3. No constraint but document best practices
4343+4444+**Impact:**
4545+- Option 1: Maximum flexibility, but allows degenerate schemas
4646+- Option 2: Forces meaningful sample definitions
4747+- Option 3: Middle ground - guidance without enforcement
4848+4949+**Recommendation:** Option 3 (document best practices)
5050+5151+**Decision:** [TBD]
5252+5353+---
5454+5555+## Q3: Schema Type Extension Path
5656+5757+**Context:**
5858+- `schemaType` field has `enum: ["jsonschema"]` only
5959+- Future may want to support other formats (Avro, Protobuf, etc.)
6060+- Lexicon schema evolution unclear
6161+6262+**Question:**
6363+How should we design for future schema format support?
6464+6565+**Options:**
6666+1. Keep enum as-is, add new formats in major version bump
6767+2. Use open union type instead of closed enum
6868+3. Add `schemaFormat` union field alongside `jsonSchema`
6969+7070+**Example for Option 3:**
7171+```json
7272+{
7373+ "schemaFormat": {
7474+ "type": "union",
7575+ "refs": ["#jsonSchemaFormat", "#avroSchemaFormat", "#protobufSchemaFormat"]
7676+ }
7777+}
7878+```
7979+8080+**Impact:**
8181+- Option 1: Breaking change required for new formats
8282+- Option 2: No validation of format string
8383+- Option 3: Clean extensibility but more complex now
8484+8585+**Recommendation:** Option 1 (YAGNI - wait for actual need)
8686+8787+**Decision:** [TBD]
8888+8989+---
9090+9191+## Q4: Metadata Field Structure
9292+9393+**Context:**
9494+- `metadata` is currently `"type": "object"` with no structure
9595+- Common fields like `author`, `license`, `tags` are documented in examples
9696+- No validation on these fields
9797+9898+**Question:**
9999+Should we define a structured schema for common metadata fields?
100100+101101+**Options:**
102102+1. Keep fully unstructured (current)
103103+2. Define optional but structured fields (author, license, tags, etc.)
104104+3. Create separate metadata Lexicon type and reference it
105105+106106+**Example for Option 2:**
107107+```json
108108+{
109109+ "metadata": {
110110+ "type": "object",
111111+ "properties": {
112112+ "author": {"type": "string", "maxLength": 200},
113113+ "license": {"type": "string", "maxLength": 100},
114114+ "tags": {"type": "array", "items": {"type": "string"}, "maxItems": 20}
115115+ }
116116+ }
117117+}
118118+```
119119+120120+**Impact:**
121121+- Option 1: Maximum flexibility, no validation
122122+- Option 2: Standardization with optional compliance
123123+- Option 3: Reusability but added complexity
124124+125125+**Recommendation:** Option 2 (structured but optional)
126126+127127+**Decision:** [TBD]
128128+129129+---
130130+131131+## Q5: NDArray Shim URI Default
132132+133133+**Context:**
134134+- `ndarrayShimUri` is optional with default mentioned in description
135135+- Standard shim is at `https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0`
136136+- No explicit default value in Lexicon
137137+138138+**Question:**
139139+Should we add an explicit default value for `ndarrayShimUri`?
140140+141141+**Options:**
142142+1. Add `"default": "https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0"`
143143+2. Keep as optional, codegen assumes standard shim if missing
144144+3. Make required - always explicit
145145+146146+**Impact:**
147147+- Option 1: Clearest behavior, but locks in URI
148148+- Option 2: Flexibility for future shim versions
149149+- Option 3: Most explicit but verbose
150150+151151+**Recommendation:** Option 2 (implicit default in codegen)
152152+153153+**Decision:** [TBD]
154154+155155+---
156156+157157+## Notes
158158+159159+These questions should be resolved before finalizing the sampleSchema Lexicon design. Some can be deferred to Phase 2 implementation based on priority.
160160+161161+**Priority:**
162162+- Q1: High (affects rkey strategy)
163163+- Q2: Low (can document later)
164164+- Q3: Low (YAGNI until needed)
165165+- Q4: Medium (affects metadata usage patterns)
166166+- Q5: Medium (affects codegen implementation)
+252
.planning/examples/code/ndarray_roundtrip.py
···11+#!/usr/bin/env python3
22+"""
33+Demonstration of NDArray JSON Schema shim roundtrip.
44+55+This script demonstrates:
66+1. Creating numpy arrays
77+2. Serializing to bytes (numpy .npy format)
88+3. Storing in JSON-compatible structure
99+4. Validating against JSON Schema
1010+5. Deserializing back to numpy arrays
1111+1212+This proves the NDArray shim design works end-to-end.
1313+"""
1414+1515+import json
1616+import base64
1717+from io import BytesIO
1818+from pathlib import Path
1919+2020+import numpy as np
2121+from jsonschema import validate, ValidationError
2222+2323+2424+##
2525+# Step 1: Define helper functions (same as atdata._helpers)
2626+2727+def array_to_bytes(x: np.ndarray) -> bytes:
2828+ """Convert numpy array to bytes using .npy format."""
2929+ np_bytes = BytesIO()
3030+ np.save(np_bytes, x, allow_pickle=True)
3131+ return np_bytes.getvalue()
3232+3333+3434+def bytes_to_array(b: bytes) -> np.ndarray:
3535+ """Convert bytes back to numpy array."""
3636+ np_bytes = BytesIO(b)
3737+ return np.load(np_bytes, allow_pickle=True)
3838+3939+4040+##
4141+# Step 2: Load the JSON Schema for ImageSample
4242+4343+# Get path to the schema example
4444+schema_path = Path(__file__).parent.parent / "sampleSchema_example.json"
4545+with open(schema_path) as f:
4646+ schema_record = json.load(f)
4747+4848+# Extract just the jsonSchema part
4949+json_schema = schema_record["jsonSchema"]
5050+5151+print("=" * 80)
5252+print("JSON Schema for ImageSample")
5353+print("=" * 80)
5454+print(json.dumps(json_schema, indent=2))
5555+print()
5656+5757+5858+##
5959+# Step 3: Create sample data matching the schema
6060+6161+print("=" * 80)
6262+print("Creating Sample Data")
6363+print("=" * 80)
6464+6565+# Create a numpy array (simulating an image)
6666+image_array = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8)
6767+print(f"Created image array: shape={image_array.shape}, dtype={image_array.dtype}")
6868+6969+# Serialize to bytes (this is what atdata does)
7070+image_bytes = array_to_bytes(image_array)
7171+print(f"Serialized to bytes: {len(image_bytes)} bytes")
7272+print(f"First 100 bytes (hex): {image_bytes[:100].hex()}")
7373+print()
7474+7575+7676+##
7777+# Step 4: Create JSON-compatible representation
7878+7979+print("=" * 80)
8080+print("Creating JSON-Compatible Representation")
8181+print("=" * 80)
8282+8383+# For JSON, bytes need to be base64-encoded
8484+image_base64 = base64.b64encode(image_bytes).decode('utf-8')
8585+print(f"Base64 encoded: {len(image_base64)} characters")
8686+print(f"First 100 chars: {image_base64[:100]}...")
8787+8888+# Create a sample object matching the schema
8989+sample_data = {
9090+ "image": image_base64, # NDArray as base64 string
9191+ "label": "cat", # Regular string field
9292+ "confidence": 0.95 # Optional number field
9393+}
9494+9595+print()
9696+print("Sample data structure:")
9797+print(json.dumps({
9898+ "image": f"<{len(image_base64)} chars of base64>",
9999+ "label": sample_data["label"],
100100+ "confidence": sample_data["confidence"]
101101+}, indent=2))
102102+print()
103103+104104+105105+##
106106+# Step 5: Validate against JSON Schema
107107+108108+print("=" * 80)
109109+print("Validating Against JSON Schema")
110110+print("=" * 80)
111111+112112+try:
113113+ validate(instance=sample_data, schema=json_schema)
114114+ print("✅ VALID: Sample data validates against JSON Schema!")
115115+except ValidationError as e:
116116+ print(f"❌ INVALID: {e.message}")
117117+ print(f"Failed at: {list(e.path)}")
118118+119119+print()
120120+121121+122122+##
123123+# Step 6: Deserialize back to numpy
124124+125125+print("=" * 80)
126126+print("Deserializing Back to Numpy")
127127+print("=" * 80)
128128+129129+# Decode from base64
130130+recovered_bytes = base64.b64decode(sample_data["image"])
131131+print(f"Decoded from base64: {len(recovered_bytes)} bytes")
132132+133133+# Deserialize to numpy array
134134+recovered_array = bytes_to_array(recovered_bytes)
135135+print(f"Deserialized to array: shape={recovered_array.shape}, dtype={recovered_array.dtype}")
136136+137137+# Verify it matches the original
138138+arrays_equal = np.array_equal(image_array, recovered_array)
139139+print(f"Arrays equal: {arrays_equal}")
140140+141141+if arrays_equal:
142142+ print("✅ SUCCESS: Full roundtrip successful!")
143143+else:
144144+ print("❌ FAILURE: Arrays don't match")
145145+ print(f"Max difference: {np.max(np.abs(image_array.astype(float) - recovered_array.astype(float)))}")
146146+147147+print()
148148+149149+150150+##
151151+# Step 7: Demonstrate validation of dtype/shape metadata
152152+153153+print("=" * 80)
154154+print("Validating NDArray Metadata (dtype, shape)")
155155+print("=" * 80)
156156+157157+# Extract metadata from schema
158158+image_schema = json_schema["properties"]["image"]
159159+expected_dtype = image_schema.get("x-atdata-dtype")
160160+expected_shape = image_schema.get("x-atdata-shape")
161161+162162+print(f"Expected dtype: {expected_dtype}")
163163+print(f"Expected shape: {expected_shape}")
164164+print(f"Actual dtype: {recovered_array.dtype}")
165165+print(f"Actual shape: {recovered_array.shape}")
166166+167167+# Validate dtype
168168+dtype_match = str(recovered_array.dtype) == expected_dtype
169169+print(f"Dtype matches: {dtype_match}")
170170+171171+# Validate shape (with None/null for dynamic dimensions)
172172+def validate_shape(actual_shape, expected_shape):
173173+ """Validate shape with support for dynamic dimensions (None/null)."""
174174+ if len(actual_shape) != len(expected_shape):
175175+ return False
176176+ for actual_dim, expected_dim in zip(actual_shape, expected_shape):
177177+ if expected_dim is not None and actual_dim != expected_dim:
178178+ return False
179179+ return True
180180+181181+shape_match = validate_shape(recovered_array.shape, expected_shape)
182182+print(f"Shape matches: {shape_match}")
183183+184184+if dtype_match and shape_match:
185185+ print("✅ SUCCESS: Array metadata matches schema expectations!")
186186+else:
187187+ print("❌ FAILURE: Metadata mismatch")
188188+189189+print()
190190+191191+192192+##
193193+# Step 8: Demonstrate msgpack (actual atdata format)
194194+195195+print("=" * 80)
196196+print("Msgpack Serialization (Actual atdata Format)")
197197+print("=" * 80)
198198+199199+try:
200200+ import msgpack
201201+202202+ # In atdata, the sample would be stored in msgpack, not JSON
203203+ # The image field would be raw bytes, not base64
204204+ msgpack_data = {
205205+ "image": image_bytes, # Raw bytes (not base64)
206206+ "label": "cat",
207207+ "confidence": 0.95
208208+ }
209209+210210+ # Serialize to msgpack
211211+ msgpack_bytes = msgpack.packb(msgpack_data)
212212+ print(f"Msgpack size: {len(msgpack_bytes)} bytes")
213213+214214+ # Deserialize from msgpack
215215+ recovered_msgpack = msgpack.unpackb(msgpack_bytes, raw=False)
216216+ recovered_array_msgpack = bytes_to_array(recovered_msgpack["image"])
217217+218218+ print(f"Recovered from msgpack: shape={recovered_array_msgpack.shape}, dtype={recovered_array_msgpack.dtype}")
219219+ print(f"Arrays equal: {np.array_equal(image_array, recovered_array_msgpack)}")
220220+ print("✅ SUCCESS: Msgpack roundtrip successful!")
221221+222222+except ImportError:
223223+ print("⚠️ msgpack not installed, skipping msgpack demonstration")
224224+ print(" (atdata uses msgpack for actual serialization)")
225225+226226+print()
227227+228228+229229+##
230230+# Summary
231231+232232+print("=" * 80)
233233+print("SUMMARY")
234234+print("=" * 80)
235235+print("""
236236+✅ The NDArray JSON Schema shim works correctly:
237237+ 1. JSON Schema validates structure (field is present, is base64 string)
238238+ 2. Binary .npy format preserves dtype and shape
239239+ 3. Extension properties (x-atdata-*) provide metadata for validation
240240+ 4. Full roundtrip: numpy → bytes → base64 → JSON → validate → deserialize → numpy
241241+ 5. Msgpack format (actual atdata) uses raw bytes instead of base64
242242+243243+⚠️ Validation happens at two levels:
244244+ - JSON Schema: Structural validation (field present, correct type)
245245+ - Deserialization: Semantic validation (dtype/shape match expectations)
246246+247247+📝 This design is a pragmatic compromise:
248248+ - Leverages existing .npy serialization (proven, self-describing)
249249+ - Uses standard JSON Schema conventions (format: byte, contentEncoding)
250250+ - Adds metadata via extension properties (x-atdata-*)
251251+ - Works with both JSON (base64) and msgpack (raw bytes)
252252+""")
+316
.planning/examples/code/validate_ndarray_shim.py
···11+#!/usr/bin/env python3
22+"""
33+Validate base64-encoded numpy arrays against the standalone ndarray_shim.json schema.
44+55+This demonstrates that the NDArray shim schema definition works correctly as a
66+standalone, reusable schema component that can be referenced from other schemas.
77+88+Note: This tests the JSON representation (base64-encoded bytes). In actual atdata
99+usage, WebDatasets store raw bytes directly in msgpack format without base64 encoding.
1010+"""
1111+1212+import json
1313+import base64
1414+from io import BytesIO
1515+from pathlib import Path
1616+1717+import numpy as np
1818+from jsonschema import validate, ValidationError, Draft7Validator
1919+2020+2121+##
2222+# Helper functions
2323+2424+def array_to_bytes(x: np.ndarray) -> bytes:
2525+ """Convert numpy array to bytes using .npy format."""
2626+ np_bytes = BytesIO()
2727+ np.save(np_bytes, x, allow_pickle=True)
2828+ return np_bytes.getvalue()
2929+3030+3131+def bytes_to_array(b: bytes) -> np.ndarray:
3232+ """Convert bytes back to numpy array."""
3333+ np_bytes = BytesIO(b)
3434+ return np.load(np_bytes, allow_pickle=True)
3535+3636+3737+##
3838+# Load the standalone ndarray shim schema
3939+4040+shim_path = Path(__file__).parent.parent.parent / "lexicons" / "ndarray_shim.json"
4141+with open(shim_path) as f:
4242+ ndarray_shim = json.load(f)
4343+4444+print("=" * 80)
4545+print("Loaded NDArray Shim Schema")
4646+print("=" * 80)
4747+print(f"Schema ID: {ndarray_shim['$id']}")
4848+print(f"Version: {ndarray_shim['version']}")
4949+print()
5050+print("NDArray definition:")
5151+print(json.dumps(ndarray_shim["$defs"]["ndarray"], indent=2))
5252+print()
5353+5454+5555+##
5656+# Test Case 1: Simple 1D array
5757+5858+print("=" * 80)
5959+print("Test Case 1: Simple 1D Array")
6060+print("=" * 80)
6161+6262+array_1d = np.array([1, 2, 3, 4, 5], dtype=np.int32)
6363+print(f"Created array: {array_1d}")
6464+print(f"Shape: {array_1d.shape}, dtype: {array_1d.dtype}")
6565+6666+# Serialize and encode
6767+bytes_1d = array_to_bytes(array_1d)
6868+base64_1d = base64.b64encode(bytes_1d).decode('utf-8')
6969+print(f"Serialized to {len(bytes_1d)} bytes")
7070+print(f"Base64: {len(base64_1d)} characters")
7171+7272+# Validate against the ndarray schema definition directly
7373+ndarray_schema = {
7474+ "$schema": "http://json-schema.org/draft-07/schema#",
7575+ "$defs": ndarray_shim["$defs"],
7676+ "$ref": "#/$defs/ndarray"
7777+}
7878+7979+try:
8080+ validate(instance=base64_1d, schema=ndarray_schema)
8181+ print("✅ VALID: 1D array validates against ndarray schema")
8282+except ValidationError as e:
8383+ print(f"❌ INVALID: {e.message}")
8484+8585+# Verify roundtrip
8686+recovered_1d = bytes_to_array(base64.b64decode(base64_1d))
8787+print(f"Recovered: {recovered_1d}")
8888+print(f"Arrays equal: {np.array_equal(array_1d, recovered_1d)}")
8989+print()
9090+9191+9292+##
9393+# Test Case 2: 2D array (matrix)
9494+9595+print("=" * 80)
9696+print("Test Case 2: 2D Array (Matrix)")
9797+print("=" * 80)
9898+9999+array_2d = np.random.randn(3, 4).astype(np.float32)
100100+print(f"Created array shape: {array_2d.shape}, dtype: {array_2d.dtype}")
101101+print(f"Sample values:\n{array_2d}")
102102+103103+bytes_2d = array_to_bytes(array_2d)
104104+base64_2d = base64.b64encode(bytes_2d).decode('utf-8')
105105+print(f"Serialized to {len(bytes_2d)} bytes")
106106+107107+try:
108108+ validate(instance=base64_2d, schema=ndarray_schema)
109109+ print("✅ VALID: 2D array validates against ndarray schema")
110110+except ValidationError as e:
111111+ print(f"❌ INVALID: {e.message}")
112112+113113+recovered_2d = bytes_to_array(base64.b64decode(base64_2d))
114114+print(f"Arrays equal: {np.array_equal(array_2d, recovered_2d)}")
115115+print()
116116+117117+118118+##
119119+# Test Case 3: 3D array (image-like)
120120+121121+print("=" * 80)
122122+print("Test Case 3: 3D Array (Image-like)")
123123+print("=" * 80)
124124+125125+array_3d = np.random.randint(0, 256, size=(224, 224, 3), dtype=np.uint8)
126126+print(f"Created array shape: {array_3d.shape}, dtype: {array_3d.dtype}")
127127+print(f"Total elements: {array_3d.size}")
128128+129129+bytes_3d = array_to_bytes(array_3d)
130130+base64_3d = base64.b64encode(bytes_3d).decode('utf-8')
131131+print(f"Serialized to {len(bytes_3d)} bytes ({len(bytes_3d) / 1024:.1f} KB)")
132132+print(f"Base64 string: {len(base64_3d)} characters ({len(base64_3d) / 1024:.1f} KB)")
133133+134134+try:
135135+ validate(instance=base64_3d, schema=ndarray_schema)
136136+ print("✅ VALID: 3D array validates against ndarray schema")
137137+except ValidationError as e:
138138+ print(f"❌ INVALID: {e.message}")
139139+140140+recovered_3d = bytes_to_array(base64.b64decode(base64_3d))
141141+print(f"Recovered shape: {recovered_3d.shape}, dtype: {recovered_3d.dtype}")
142142+print(f"Arrays equal: {np.array_equal(array_3d, recovered_3d)}")
143143+print()
144144+145145+146146+##
147147+# Test Case 4: Different dtypes
148148+149149+print("=" * 80)
150150+print("Test Case 4: Various Dtypes")
151151+print("=" * 80)
152152+153153+dtypes_to_test = [
154154+ np.int8,
155155+ np.int16,
156156+ np.int32,
157157+ np.int64,
158158+ np.uint8,
159159+ np.uint16,
160160+ np.uint32,
161161+ np.uint64,
162162+ np.float16,
163163+ np.float32,
164164+ np.float64,
165165+ np.complex64,
166166+ np.complex128,
167167+]
168168+169169+print(f"Testing {len(dtypes_to_test)} different dtypes...")
170170+all_passed = True
171171+172172+for dtype in dtypes_to_test:
173173+ array = np.array([1, 2, 3], dtype=dtype)
174174+ array_bytes = array_to_bytes(array)
175175+ array_base64 = base64.b64encode(array_bytes).decode('utf-8')
176176+177177+ try:
178178+ validate(instance=array_base64, schema=ndarray_schema)
179179+ recovered = bytes_to_array(base64.b64decode(array_base64))
180180+ match = np.array_equal(array, recovered)
181181+ status = "✅" if match else "❌"
182182+ print(f" {status} {str(dtype):12s} - valid and {'matches' if match else 'MISMATCH'}")
183183+ if not match:
184184+ all_passed = False
185185+ except ValidationError as e:
186186+ print(f" ❌ {str(dtype):12s} - validation failed: {e.message}")
187187+ all_passed = False
188188+189189+if all_passed:
190190+ print("✅ SUCCESS: All dtypes validated and roundtripped correctly")
191191+else:
192192+ print("❌ FAILURE: Some dtypes failed")
193193+print()
194194+195195+196196+##
197197+# Test Case 5: Invalid data (should fail validation)
198198+199199+print("=" * 80)
200200+print("Test Case 5: Invalid Data (Negative Tests)")
201201+print("=" * 80)
202202+203203+# Test invalid types
204204+invalid_cases = [
205205+ ("plain string", "not base64 encoded array data"),
206206+ ("number", 12345),
207207+ ("object", {"dtype": "uint8", "data": "fake"}),
208208+ ("array", [1, 2, 3]),
209209+ ("null", None),
210210+]
211211+212212+print("Testing invalid inputs (should fail validation):")
213213+for name, invalid_data in invalid_cases:
214214+ try:
215215+ validate(instance=invalid_data, schema=ndarray_schema)
216216+ print(f" ❌ {name:15s} - SHOULD HAVE FAILED but passed")
217217+ except ValidationError:
218218+ print(f" ✅ {name:15s} - correctly rejected")
219219+220220+print()
221221+222222+223223+##
224224+# Test Case 6: Using the schema as a $ref in another schema (inline)
225225+226226+print("=" * 80)
227227+print("Test Case 6: Using NDArray Shim as $ref (Inline)")
228228+print("=" * 80)
229229+230230+# Create a schema that inlines the ndarray shim definition
231231+sample_schema = {
232232+ "$schema": "http://json-schema.org/draft-07/schema#",
233233+ "title": "TestSample",
234234+ "type": "object",
235235+ "required": ["data", "label"],
236236+ "properties": {
237237+ "data": {
238238+ "$ref": "#/$defs/ndarray",
239239+ "description": "Numpy array data",
240240+ "x-atdata-dtype": "float32",
241241+ "x-atdata-shape": [None, 10]
242242+ },
243243+ "label": {
244244+ "type": "string",
245245+ "description": "Label for this sample"
246246+ }
247247+ },
248248+ "$defs": {
249249+ "ndarray": ndarray_shim["$defs"]["ndarray"]
250250+ }
251251+}
252252+253253+print("Created schema that uses inlined ndarray shim:")
254254+print(json.dumps({
255255+ "title": sample_schema["title"],
256256+ "required": sample_schema["required"],
257257+ "properties": {
258258+ "data": {"$ref": "#/$defs/ndarray", "x-atdata-dtype": "float32"},
259259+ "label": {"type": "string"}
260260+ }
261261+}, indent=2))
262262+print()
263263+264264+# Create sample data
265265+test_array = np.random.randn(5, 10).astype(np.float32)
266266+test_data = {
267267+ "data": base64.b64encode(array_to_bytes(test_array)).decode('utf-8'),
268268+ "label": "test sample"
269269+}
270270+271271+print(f"Created test sample with array shape {test_array.shape}")
272272+273273+# Validate with inline $ref
274274+validator = Draft7Validator(sample_schema)
275275+276276+try:
277277+ validator.validate(test_data)
278278+ print("✅ VALID: Sample with $ref to ndarray shim validates correctly")
279279+except ValidationError as e:
280280+ print(f"❌ INVALID: {e.message}")
281281+282282+print()
283283+284284+285285+##
286286+# Summary
287287+288288+print("=" * 80)
289289+print("SUMMARY")
290290+print("=" * 80)
291291+print("""
292292+✅ The standalone ndarray_shim.json schema works correctly:
293293+ 1. Validates base64-encoded .npy bytes as strings
294294+ 2. Works with all standard numpy dtypes
295295+ 3. Supports arrays of any dimensionality (1D, 2D, 3D, etc.)
296296+ 4. Can be used as $ref in other schemas
297297+ 5. Correctly rejects invalid data
298298+299299+✅ The shim is a proper JSON Schema Draft 7 definition:
300300+ - Uses standard type/format (string/byte)
301301+ - Uses contentEncoding/contentMediaType properly
302302+ - Works with standard validators (jsonschema library)
303303+ - Can be stored at a canonical URI and referenced
304304+305305+📝 Key points:
306306+ - Base64 encoding adds ~33% overhead (150KB → 200KB)
307307+ - In actual atdata, WebDatasets store raw bytes (no base64)
308308+ - JSON representation useful for: APIs, validation, examples
309309+ - Msgpack representation used in practice: more efficient
310310+311311+🎯 Design validated:
312312+ - Shim definition is sound and reusable
313313+ - Works as both inline $def and external $ref
314314+ - Compatible with JSON Schema tooling
315315+ - Ready for use in ac.foundation.dataset.sampleSchema Lexicon
316316+""")
···11+{
22+ "lexicon": 1,
33+ "id": "ac.foundation.dataset.arrayFormat",
44+ "defs": {
55+ "main": {
66+ "type": "string",
77+ "description": "Array serialization format identifier for NDArray fields in sample schemas. Known values correspond to token definitions in this Lexicon. Each format has versioned specifications maintained by foundation.ac at canonical URLs.",
88+ "knownValues": ["ndarrayBytes"],
99+ "maxLength": 50
1010+ },
1111+ "ndarrayBytes": {
1212+ "type": "token",
1313+ "description": "Numpy .npy binary format for NDArray serialization. Stores arrays with dtype and shape in binary header. Versions maintained at https://foundation.ac/schemas/atdata-ndarray-bytes/{version}/"
1414+ }
1515+ }
1616+}
···11+{
22+ "lexicon": 1,
33+ "id": "ac.foundation.dataset.schemaType",
44+ "defs": {
55+ "main": {
66+ "type": "string",
77+ "description": "Schema type identifier for atdata sample definitions. Known values correspond to token definitions in this Lexicon. New schema types can be added as tokens without breaking changes.",
88+ "knownValues": ["jsonSchema"],
99+ "maxLength": 50
1010+ },
1111+ "jsonSchema": {
1212+ "type": "token",
1313+ "description": "JSON Schema Draft 7 format for sample type definitions. When schemaType is 'jsonSchema', the schema field must contain an object conforming to ac.foundation.dataset.sampleSchema#jsonSchemaFormat."
1414+ }
1515+ }
1616+}
···11+{
22+ "lexicon": 1,
33+ "id": "ac.foundation.dataset.storageExternal",
44+ "defs": {
55+ "main": {
66+ "type": "object",
77+ "description": "External storage via URLs (S3, HTTP, IPFS, etc.) for WebDataset tar archives. URLs support brace notation for sharding (e.g., 'data-{000000..000099}.tar'). Used in ac.foundation.dataset.record storage union.",
88+ "required": [
99+ "urls"
1010+ ],
1111+ "properties": {
1212+ "urls": {
1313+ "type": "array",
1414+ "description": "WebDataset URLs with optional brace notation for sharded tar files",
1515+ "items": {
1616+ "type": "string",
1717+ "format": "uri",
1818+ "maxLength": 1000
1919+ },
2020+ "minLength": 1
2121+ }
2222+ }
2323+ }
2424+ }
2525+}
+16
.planning/lexicons/ndarray_shim.json
···11+{
22+ "$schema": "http://json-schema.org/draft-07/schema#",
33+ "$id": "https://foundation.ac/schemas/atdata-ndarray-bytes/1.0.0",
44+ "title": "ATDataNDArrayBytes",
55+ "description": "Standard definition for numpy NDArray types in JSON Schema, compatible with atdata WebDataset serialization. This type's contents are interpreted as containing the raw bytes data for a serialized numpy NDArray, and serve as a marker for atdata-based code generation to use standard numpy types, rather than generated dataclasses.",
66+ "version": "1.0.0",
77+ "$defs": {
88+ "ndarray": {
99+ "type": "string",
1010+ "format": "byte",
1111+ "description": "Numpy array serialized using numpy `.npy` format via `np.save` (includes dtype and shape in binary header). When represented in JSON, this is a base64-encoded string. In msgpack, this is raw bytes.",
1212+ "contentEncoding": "base64",
1313+ "contentMediaType": "application/octet-stream"
1414+ }
1515+ }
1616+}
+386
.planning/ndarray_shim_spec.md
···11+# NDArray JSON Schema Shim Specification
22+33+**Issue**: #52
44+**Version**: 1.0
55+**Status**: Draft
66+77+## Problem Statement
88+99+We need a standard way to represent numpy NDArray types in JSON Schema that:
1010+1. Works with existing atdata msgpack serialization (numpy `.npy` format)
1111+2. Can be validated (where practical)
1212+3. Can be used for code generation
1313+4. Is compatible with JSON Schema tooling
1414+5. Preserves dtype and shape information
1515+1616+## Current Serialization Format
1717+1818+atdata uses `_helpers.array_to_bytes()` which serializes arrays using numpy's native `.npy` format:
1919+2020+```python
2121+def array_to_bytes(x: np.ndarray) -> bytes:
2222+ np_bytes = BytesIO()
2323+ np.save(np_bytes, x, allow_pickle=True)
2424+ return np_bytes.getvalue()
2525+```
2626+2727+**Result**: A bytes object containing:
2828+- Magic bytes (`\x93NUMPY`)
2929+- Version info
3030+- Header with dtype and shape
3131+- Array data
3232+3333+**Key insight**: The .npy format is self-describing - dtype and shape are already in the bytes!
3434+3535+## Design Approach
3636+3737+### Option 1: Pure Metadata (REJECTED)
3838+3939+Describe the semantic array only:
4040+```json
4141+{
4242+ "type": "object",
4343+ "x-atdata-ndarray": true,
4444+ "x-dtype": "uint8",
4545+ "x-shape": [null, null, 3]
4646+}
4747+```
4848+4949+**Problem**: Doesn't match actual msgpack structure (which stores bytes, not objects)
5050+5151+### Option 2: Bytes with Extension Properties (REJECTED)
5252+5353+Describe the bytes with metadata:
5454+```json
5555+{
5656+ "type": "string",
5757+ "format": "byte",
5858+ "x-dtype": "uint8",
5959+ "x-shape": [null, null, 3]
6060+}
6161+```
6262+6363+**Problem**:
6464+- Non-standard use of extension properties
6565+- JSON Schema doesn't know how to validate these
6666+- Codegen tools won't understand x- properties
6767+6868+### Option 3: Reusable Schema Definition (RECOMMENDED)
6969+7070+Create a standard NDArray schema definition that can be $ref'd, with controlled vocabulary for metadata.
7171+7272+## Recommended Specification
7373+7474+### Base NDArray Schema Definition
7575+7676+This should be included in every JSON Schema that uses NDArray types:
7777+7878+```json
7979+{
8080+ "$schema": "http://json-schema.org/draft-07/schema#",
8181+ "$defs": {
8282+ "ndarray": {
8383+ "type": "string",
8484+ "format": "byte",
8585+ "description": "Numpy array serialized using numpy .npy format (includes dtype and shape in binary header)",
8686+ "contentEncoding": "base64",
8787+ "contentMediaType": "application/octet-stream"
8888+ }
8989+ }
9090+}
9191+```
9292+9393+### Using NDArray in Properties
9494+9595+Properties that are NDArray types reference the base definition and add metadata as **sibling properties**:
9696+9797+```json
9898+{
9999+ "properties": {
100100+ "image": {
101101+ "$ref": "#/$defs/ndarray",
102102+ "description": "RGB image with variable height/width",
103103+ "x-atdata-dtype": "uint8",
104104+ "x-atdata-shape": [null, null, 3]
105105+ }
106106+ }
107107+}
108108+```
109109+110110+### Metadata Convention
111111+112112+**Extension properties** (prefixed with `x-atdata-`):
113113+- `x-atdata-dtype`: Numpy dtype string (e.g., "uint8", "float32", "int64")
114114+- `x-atdata-shape`: Array of integers and null (null = dynamic dimension)
115115+- `x-atdata-notes`: Optional human-readable notes about the array
116116+117117+**Standard JSON Schema properties** (used normally):
118118+- `description`: Human-readable description of what the array represents
119119+- `title`: Short name for the field
120120+121121+## Complete Example
122122+123123+```json
124124+{
125125+ "$schema": "http://json-schema.org/draft-07/schema#",
126126+ "title": "ImageSample",
127127+ "type": "object",
128128+ "required": ["image", "label"],
129129+ "properties": {
130130+ "image": {
131131+ "$ref": "#/$defs/ndarray",
132132+ "description": "RGB image with variable height/width",
133133+ "x-atdata-dtype": "uint8",
134134+ "x-atdata-shape": [null, null, 3],
135135+ "x-atdata-notes": "Images must have 3 color channels (RGB)"
136136+ },
137137+ "depth_map": {
138138+ "$ref": "#/$defs/ndarray",
139139+ "description": "Depth map corresponding to the image",
140140+ "x-atdata-dtype": "float32",
141141+ "x-atdata-shape": [null, null],
142142+ "x-atdata-notes": "Same height and width as image, but single channel"
143143+ },
144144+ "label": {
145145+ "type": "string",
146146+ "description": "Human-readable label"
147147+ }
148148+ },
149149+ "$defs": {
150150+ "ndarray": {
151151+ "type": "string",
152152+ "format": "byte",
153153+ "description": "Numpy array serialized using numpy .npy format",
154154+ "contentEncoding": "base64",
155155+ "contentMediaType": "application/octet-stream"
156156+ }
157157+ }
158158+}
159159+```
160160+161161+## Rationale
162162+163163+### Why `type: "string", format: "byte"`?
164164+165165+In msgpack serialization:
166166+- The NDArray field is stored as raw bytes (the .npy format)
167167+- When represented in JSON (for validation/transport), bytes become base64 strings
168168+- JSON Schema's `type: "string", format: "byte"` is the standard way to represent binary data
169169+170170+### Why extension properties (`x-atdata-*`)?
171171+172172+JSON Schema allows custom properties starting with `x-`. Benefits:
173173+1. **Standard**: Well-established convention in JSON Schema ecosystem
174174+2. **Ignored by validators**: Won't cause validation errors
175175+3. **Accessible to codegen**: Tools can parse these for type generation
176176+4. **Self-documenting**: Clear what they mean
177177+178178+### Why not validate dtype/shape at JSON Schema level?
179179+180180+**Technical limitation**: JSON Schema can't validate binary .npy format internals.
181181+182182+**Solution**: Validation happens at **deserialization time**:
183183+1. JSON Schema validates overall structure (field is present, is bytes)
184184+2. When bytes are deserialized to NDArray, check dtype/shape match expectations
185185+186186+## Usage in atdata
187187+188188+### Publishing Schemas
189189+190190+When publishing a PackableSample with NDArray fields:
191191+192192+```python
193193+@atdata.packable
194194+class ImageSample:
195195+ image: NDArray # Will be annotated with dtype/shape hints
196196+ label: str
197197+198198+# SDK extracts type hints and generates JSON Schema
199199+schema_json = {
200200+ "properties": {
201201+ "image": {
202202+ "$ref": "#/$defs/ndarray",
203203+ "x-atdata-dtype": "uint8", # From annotation or default
204204+ "x-atdata-shape": [null, null, 3] # From annotation or None
205205+ }
206206+ }
207207+}
208208+```
209209+210210+### Type Annotations for NDArray
211211+212212+Python type hints to specify dtype/shape:
213213+214214+```python
215215+from typing import Annotated
216216+from numpy.typing import NDArray
217217+218218+# Option 1: Generic NDArray (dtype/shape inferred or not specified)
219219+image: NDArray
220220+221221+# Option 2: With dtype (using numpy typing)
222222+image: NDArray[np.uint8]
223223+224224+# Option 3: With full metadata (using Annotated)
225225+image: Annotated[NDArray[np.uint8], {"shape": [None, None, 3]}]
226226+```
227227+228228+### Code Generation
229229+230230+Codegen reads JSON Schema and produces:
231231+232232+```python
233233+@atdata.packable
234234+class ImageSample:
235235+ image: NDArray # uint8, shape: [*, *, 3]
236236+ label: str
237237+```
238238+239239+Comment indicates dtype/shape from `x-atdata-*` properties.
240240+241241+## Validation Strategy
242242+243243+### JSON Schema Level (Structural)
244244+✅ Validate field is present (if required)
245245+✅ Validate field is bytes/string (in JSON)
246246+✅ Validate base64 encoding (if in JSON)
247247+248248+### Deserialization Level (Semantic)
249249+✅ Validate .npy format is valid
250250+✅ Validate dtype matches expected (if specified)
251251+✅ Validate shape matches expected (if specified)
252252+✅ Validate shape constraints (e.g., must be 3D)
253253+254254+### Implementation
255255+256256+```python
257257+from atdata.validation import validate_ndarray
258258+259259+def validate_ndarray(
260260+ array: np.ndarray,
261261+ expected_dtype: Optional[str] = None,
262262+ expected_shape: Optional[List[Optional[int]]] = None
263263+) -> tuple[bool, list[str]]:
264264+ """Validate array against expectations."""
265265+ errors = []
266266+267267+ # Check dtype
268268+ if expected_dtype and str(array.dtype) != expected_dtype:
269269+ errors.append(f"Expected dtype {expected_dtype}, got {array.dtype}")
270270+271271+ # Check shape
272272+ if expected_shape:
273273+ if len(array.shape) != len(expected_shape):
274274+ errors.append(f"Expected {len(expected_shape)}D array, got {len(array.shape)}D")
275275+ for i, (actual, expected) in enumerate(zip(array.shape, expected_shape)):
276276+ if expected is not None and actual != expected:
277277+ errors.append(f"Dimension {i}: expected {expected}, got {actual}")
278278+279279+ return len(errors) == 0, errors
280280+```
281281+282282+## Standard NDArray Shim URI
283283+284284+The NDArray shim definition should be published at a canonical URI:
285285+286286+**Proposed**: `at://did:plc:foundation/ac.foundation.dataset.ndarray-shim/1.0.0`
287287+288288+This allows schemas to reference a standard definition:
289289+290290+```json
291291+{
292292+ "properties": {
293293+ "image": {
294294+ "$ref": "at://did:plc:foundation/ac.foundation.dataset.ndarray-shim/1.0.0#/$defs/ndarray",
295295+ "x-atdata-dtype": "uint8"
296296+ }
297297+ }
298298+}
299299+```
300300+301301+Or schemas can inline the definition (recommended for Phase 1):
302302+303303+```json
304304+{
305305+ "$defs": {
306306+ "ndarray": { /* inline definition */ }
307307+ }
308308+}
309309+```
310310+311311+## Alternative: Describe Deserialized Structure
312312+313313+For reference, an alternative approach that describes the "unpacked" structure:
314314+315315+```json
316316+{
317317+ "$defs": {
318318+ "ndarray": {
319319+ "type": "object",
320320+ "description": "Numpy array (deserialized representation)",
321321+ "required": ["dtype", "shape", "data"],
322322+ "properties": {
323323+ "dtype": {"type": "string"},
324324+ "shape": {"type": "array", "items": {"type": "integer"}},
325325+ "data": {"type": "string", "format": "byte"}
326326+ }
327327+ }
328328+ }
329329+}
330330+```
331331+332332+**Problem**: This doesn't match the actual msgpack structure (which is just bytes, not an object with dtype/shape/data fields). The .npy format is opaque bytes, not a structured object.
333333+334334+**Conclusion**: Stick with the recommended approach (bytes with metadata).
335335+336336+## Implementation Checklist
337337+338338+- [ ] Update sampleSchema Lexicon to reference this spec
339339+- [ ] Create standard NDArray shim definition
340340+- [ ] Update schema examples to use the shim correctly
341341+- [ ] Implement validation helpers in Python SDK
342342+- [ ] Add type annotation support for dtype/shape hints
343343+- [ ] Update codegen to read x-atdata-* properties
344344+- [ ] Document in user-facing docs
345345+346346+## Open Questions
347347+348348+1. **Should we support other array libraries?** (PyTorch tensors, JAX arrays, etc.)
349349+ - Could use `x-atdata-array-type: "numpy"|"torch"|"jax"`
350350+ - Recommendation: NumPy only for Phase 1
351351+352352+2. **Should shape constraints be enforced at runtime?**
353353+ - Pro: Catch errors early
354354+ - Con: Performance overhead, flexibility lost
355355+ - Recommendation: Optional validation, off by default
356356+357357+3. **Should we support sparse arrays?**
358358+ - scipy.sparse has different serialization format
359359+ - Recommendation: Defer to future
360360+361361+4. **What about array of arrays?** (ragged arrays)
362362+ - Can be represented as Python lists of NDArrays
363363+ - Recommendation: Not a priority for Phase 1
364364+365365+## Summary
366366+367367+**Recommended Approach**:
368368+- NDArray fields represented as `{"$ref": "#/$defs/ndarray"}` (bytes)
369369+- Dtype and shape specified via `x-atdata-dtype` and `x-atdata-shape`
370370+- Standard `ndarray` definition inlined in every schema
371371+- Validation happens at deserialization, not JSON Schema level
372372+- Codegen reads extension properties to generate proper types
373373+374374+**Benefits**:
375375+- ✅ Compatible with existing msgpack serialization
376376+- ✅ Works with JSON Schema tooling
377377+- ✅ Clear metadata for codegen
378378+- ✅ Flexible (dtype/shape optional)
379379+- ✅ Extensible (can add more x-atdata-* properties)
380380+381381+**Trade-offs**:
382382+- ⚠️ Leaky abstraction (JSON Schema describes bytes, not semantic array)
383383+- ⚠️ Validation split across two layers
384384+- ⚠️ Extension properties not universally understood
385385+386386+**Grade**: B+ (Good practical solution)
+336
.reference/atproto_lexicon_guide.md
···11+# AT Protocol Lexicon Guide
22+33+> **Source**: [AT Protocol Lexicon Documentation](https://atproto.com/guides/lexicon)
44+55+## Overview
66+77+Lexicon is a JSON-based schema language that defines RPC methods and record types for AT Protocol. It enables interoperability by establishing agreed-upon behaviors and semantics across the open network.
88+99+## Key Concepts
1010+1111+### NSIDs (Namespaced Identifiers)
1212+1313+Schemas use reverse-DNS format identifiers indicating ownership:
1414+1515+```
1616+com.atproto.repo.createRecord # Core ATProto API
1717+app.bsky.feed.post # Bluesky app record type
1818+ac.foundation.dataset.sampleSchema # Our custom namespace
1919+```
2020+2121+Format: `authority.name` where authority is reverse-DNS
2222+2323+### Why Not RDF?
2424+2525+Lexicon prioritizes:
2626+- Schema enforcement (not optional metadata)
2727+- Code generation with types and validation
2828+- Practical developer experience
2929+3030+## Schema Types
3131+3232+### 1. Record Types
3333+3434+Define the structure of data stored in repositories:
3535+3636+```json
3737+{
3838+ "lexicon": 1,
3939+ "id": "com.example.follow",
4040+ "defs": {
4141+ "main": {
4242+ "type": "record",
4343+ "key": "tid",
4444+ "record": {
4545+ "type": "object",
4646+ "required": ["subject", "createdAt"],
4747+ "properties": {
4848+ "subject": { "type": "string", "format": "did" },
4949+ "createdAt": { "type": "string", "format": "datetime" }
5050+ }
5151+ }
5252+ }
5353+ }
5454+}
5555+```
5656+5757+Records stored in repos have a `$type` field mapping to their schema.
5858+5959+### 2. Query Methods
6060+6161+Define HTTP GET endpoints:
6262+6363+```json
6464+{
6565+ "lexicon": 1,
6666+ "id": "com.example.getProfile",
6767+ "defs": {
6868+ "main": {
6969+ "type": "query",
7070+ "parameters": {
7171+ "type": "params",
7272+ "required": ["actor"],
7373+ "properties": {
7474+ "actor": { "type": "string", "format": "at-identifier" }
7575+ }
7676+ },
7777+ "output": {
7878+ "encoding": "application/json",
7979+ "schema": { "$ref": "#/defs/profileView" }
8080+ }
8181+ }
8282+ }
8383+}
8484+```
8585+8686+Maps to: `GET /xrpc/com.example.getProfile?actor=...`
8787+8888+### 3. Procedure Methods
8989+9090+Define HTTP POST endpoints:
9191+9292+```json
9393+{
9494+ "lexicon": 1,
9595+ "id": "com.example.updateProfile",
9696+ "defs": {
9797+ "main": {
9898+ "type": "procedure",
9999+ "input": {
100100+ "encoding": "application/json",
101101+ "schema": { ... }
102102+ },
103103+ "output": {
104104+ "encoding": "application/json",
105105+ "schema": { ... }
106106+ }
107107+ }
108108+ }
109109+}
110110+```
111111+112112+### 4. Tokens
113113+114114+Declare reusable global identifiers for extensible enums:
115115+116116+```json
117117+{
118118+ "lexicon": 1,
119119+ "id": "com.example.status.active",
120120+ "defs": {
121121+ "main": {
122122+ "type": "token",
123123+ "description": "User is active"
124124+ }
125125+ }
126126+}
127127+```
128128+129129+Instead of hardcoding enum values, use tokens. Teams can add values without collisions.
130130+131131+## Field Types
132132+133133+### Primitives
134134+135135+| Type | Description |
136136+|------|-------------|
137137+| `string` | Text, with optional format/length constraints |
138138+| `integer` | Whole numbers |
139139+| `boolean` | true/false |
140140+| `bytes` | Binary data (base64 encoded in JSON) |
141141+| `cid-link` | Content identifier reference |
142142+| `unknown` | Any JSON value |
143143+144144+### String Formats
145145+146146+| Format | Description |
147147+|--------|-------------|
148148+| `at-uri` | AT Protocol URI |
149149+| `at-identifier` | Handle or DID |
150150+| `did` | Decentralized identifier |
151151+| `handle` | User handle |
152152+| `datetime` | ISO 8601 timestamp |
153153+| `uri` | Generic URI |
154154+| `language` | BCP 47 language tag |
155155+156156+### Complex Types
157157+158158+```json
159159+// Object
160160+{
161161+ "type": "object",
162162+ "required": ["field1"],
163163+ "properties": {
164164+ "field1": { "type": "string" },
165165+ "field2": { "type": "integer" }
166166+ }
167167+}
168168+169169+// Array
170170+{
171171+ "type": "array",
172172+ "items": { "type": "string" },
173173+ "maxLength": 100
174174+}
175175+176176+// Union (discriminated)
177177+{
178178+ "type": "union",
179179+ "refs": [
180180+ "#defs/typeA",
181181+ "#defs/typeB"
182182+ ]
183183+}
184184+185185+// Reference to another schema
186186+{
187187+ "type": "ref",
188188+ "ref": "com.example.otherSchema#defs/thing"
189189+}
190190+```
191191+192192+### Blob References
193193+194194+For binary data stored separately:
195195+196196+```json
197197+{
198198+ "type": "blob",
199199+ "accept": ["image/jpeg", "image/png"],
200200+ "maxSize": 1000000
201201+}
202202+```
203203+204204+## Versioning Rules
205205+206206+**Published schemas are immutable regarding constraints.**
207207+208208+- Loosening constraints breaks old software validation
209209+- Tightening constraints breaks new software validation
210210+- Only **optional** constraints may be added to previously unconstrained fields
211211+- Major changes require **new NSIDs**
212212+213213+## Schema Distribution
214214+215215+Schemas should be published as machine-readable, network-accessible resources:
216216+217217+1. Host at well-known URL: `https://authority.com/.well-known/lexicons/`
218218+2. Or embed in documentation
219219+3. Ensure canonical representation exists for consumers
220220+221221+## Record Keys (rkeys)
222222+223223+Records in collections are identified by keys:
224224+225225+| Key Type | Description |
226226+|----------|-------------|
227227+| `tid` | Timestamp-based ID (sortable, unique) |
228228+| `literal:self` | Singleton record (e.g., profile) |
229229+| `any` | Any valid string |
230230+231231+TID format: 13-character base32-sortable timestamp
232232+233233+## Example: Complete Lexicon
234234+235235+```json
236236+{
237237+ "lexicon": 1,
238238+ "id": "ac.foundation.dataset.sampleSchema",
239239+ "revision": 1,
240240+ "description": "Schema definition for a PackableSample type",
241241+ "defs": {
242242+ "main": {
243243+ "type": "record",
244244+ "key": "tid",
245245+ "description": "A sample schema record",
246246+ "record": {
247247+ "type": "object",
248248+ "required": ["name", "version", "fields"],
249249+ "properties": {
250250+ "name": {
251251+ "type": "string",
252252+ "description": "Human-readable schema name"
253253+ },
254254+ "version": {
255255+ "type": "string",
256256+ "description": "Semantic version"
257257+ },
258258+ "fields": {
259259+ "type": "array",
260260+ "items": { "type": "ref", "ref": "#defs/fieldDef" }
261261+ },
262262+ "createdAt": {
263263+ "type": "string",
264264+ "format": "datetime"
265265+ }
266266+ }
267267+ }
268268+ },
269269+ "fieldDef": {
270270+ "type": "object",
271271+ "required": ["name", "fieldType"],
272272+ "properties": {
273273+ "name": { "type": "string" },
274274+ "fieldType": { "type": "ref", "ref": "#defs/fieldType" },
275275+ "optional": { "type": "boolean", "default": false }
276276+ }
277277+ },
278278+ "fieldType": {
279279+ "type": "union",
280280+ "refs": [
281281+ "#defs/primitiveType",
282282+ "#defs/arrayType"
283283+ ]
284284+ },
285285+ "primitiveType": {
286286+ "type": "object",
287287+ "required": ["kind"],
288288+ "properties": {
289289+ "kind": {
290290+ "type": "string",
291291+ "knownValues": ["string", "int", "float", "bool", "bytes"]
292292+ }
293293+ }
294294+ },
295295+ "arrayType": {
296296+ "type": "object",
297297+ "required": ["kind", "elementType"],
298298+ "properties": {
299299+ "kind": { "type": "string", "const": "ndarray" },
300300+ "elementType": { "type": "string" },
301301+ "shape": {
302302+ "type": "array",
303303+ "items": { "type": "integer" }
304304+ }
305305+ }
306306+ }
307307+ }
308308+}
309309+```
310310+311311+## XRPC (Cross-Server RPC)
312312+313313+Lexicons map to HTTP endpoints:
314314+315315+```
316316+com.example.getProfile()
317317+ → GET /xrpc/com.example.getProfile
318318+319319+com.example.createPost()
320320+ → POST /xrpc/com.example.createPost
321321+```
322322+323323+## Validation Behavior
324324+325325+The PDS can validate records against lexicons, but:
326326+327327+1. PDS is lexicon-agnostic by default
328328+2. Validation can be disabled: `validate: false`
329329+3. Unknown lexicons are stored without validation
330330+4. Rate limits prevent abuse (not schema enforcement)
331331+332332+## Resources
333333+334334+- **Lexicon Specification**: https://atproto.com/specs/lexicon
335335+- **Lexicon Guide**: https://atproto.com/guides/lexicon
336336+- **Bluesky Lexicons**: https://github.com/bluesky-social/atproto/tree/main/lexicons
+230
.reference/atproto_lexicon_spec.md
···11+# Lexicon Specification - AT Protocol
22+33+## Overview
44+55+"Lexicon is a schema definition language used to describe atproto records, HTTP endpoints (XRPC), and event stream messages."
66+77+The language builds on the atproto Data Model and incorporates concepts similar to JSON Schema and OpenAPI, while adding protocol-specific features. This specification covers version 1 of the Lexicon language.
88+99+## Type Categories
1010+1111+Lexicon types fall into several categories:
1212+1313+**Concrete Types:** boolean, integer, string, bytes, cid-link, blob
1414+1515+**Container Types:** array, object
1616+1717+**Sub-types:** params, permission
1818+1919+**Meta Types:** token, ref, union, unknown
2020+2121+**Primary Types:** record, query, procedure, subscription, permission-set
2222+2323+## Lexicon Files
2424+2525+Lexicon schemas are JSON files associated with a single NSID containing one or more definitions. Required file fields:
2626+2727+- `lexicon` (integer): Language version, currently fixed at 1
2828+- `id` (string): The NSID identifier
2929+- `defs` (object): Named definitions with distinct keys
3030+- `description` (string, optional): Overview text
3131+3232+"References to specific definitions within a Lexicon use fragment syntax, like `com.example.defs#someView`."
3333+3434+## Primary Type Definitions
3535+3636+### Record Type
3737+3838+Specifies data objects stored in repositories. Type-specific fields:
3939+4040+- `key` (string): Record key type specification
4141+- `record` (object): Schema with type object describing the record structure
4242+4343+### Query and Procedure (HTTP API)
4444+4545+Describes XRPC endpoints. Fields:
4646+4747+- `parameters`: Optional params schema for query parameters
4848+- `output`: Response body with encoding (MIME type) and optional schema
4949+- `input`: Request body (procedures only)
5050+- `errors`: Array of possible error codes with descriptions
5151+5252+### Subscription (Event Stream)
5353+5454+Defines WebSocket endpoint messages. Fields:
5555+5656+- `parameters`: Optional HTTP parameters
5757+- `message`: Required specification with union schema
5858+- `errors`: Optional error definitions
5959+6060+"Subscription schemas must be a `union` of refs, not an `object` type."
6161+6262+### Permission Set
6363+6464+Bundles permissions for OAuth scopes. Fields:
6565+6666+- `title` / `title:langs`: Display name with localization
6767+- `detail` / `detail:langs`: Human-readable scope description
6868+- `permissions`: Array of permission definitions
6969+7070+## Field Type Definitions
7171+7272+### Primitive Types
7373+7474+**boolean:** Optional `default` and `const` fields
7575+7676+**integer:** Supports `minimum`, `maximum`, `enum`, `default`, `const`
7777+7878+**string:** Supports `format`, `maxLength`, `minLength`, `maxGraphemes`, `minGraphemes`, `knownValues`, `enum`, `default`, `const`
7979+8080+"Strings are Unicode. For non-Unicode encodings, use `bytes` instead."
8181+8282+**bytes:** Raw binary data with optional `minLength` and `maxLength`
8383+8484+**cid-link:** Content identifier links with no type-specific fields
8585+8686+### Container Types
8787+8888+**array:** Contains `items` (required schema) and optional `minLength`/`maxLength`
8989+9090+**object:**
9191+- `properties`: Named field schemas
9292+- `required`: Array of required field names
9393+- `nullable`: Array of fields accepting null values
9494+9595+"There is a semantic difference in data between omitting a field; including the field with value `null`; and including the field with a falsy value."
9696+9797+**blob:** Binary large objects with:
9898+- `accept`: MIME type restrictions (glob patterns supported)
9999+- `maxSize`: Maximum bytes
100100+101101+### Specialized Types
102102+103103+**params:** Limited to HTTP query parameters, supporting only boolean, integer, string, or arrays of these types. Cannot be top-level named definitions.
104104+105105+**permission:** Defines access permissions with `resource` field. Current resources:
106106+107107+- `repo`: Repository write permissions with collection and optional action fields
108108+- `rpc`: Remote API calls with lxm (endpoints), aud (audience), and inheritAud fields
109109+110110+"Permission declarations with unsupported resource types must be ignored by services implementing access control."
111111+112112+**token:** Empty values referenced by name, used for symbolic enumerations. Cannot be used in refs, unions, or as object fields.
113113+114114+### Reference and Union Types
115115+116116+**ref:** References another schema definition globally (by NSID) or locally (by fragment). Reduces schema duplication for reusable definitions.
117117+118118+**union:** Declares multiple possible types at a location. Fields:
119119+120120+- `refs`: Array of schema references
121121+- `closed`: Boolean indicating if type list is fixed (default: false)
122122+123123+"Unions represent that multiple possible types could be present at this location in the schema."
124124+125125+**unknown:** Accepts any data object with no specific validation, but must be a CBOR map. Data may contain optional `$type` field.
126126+127127+## String Formats
128128+129129+Lexicon supports format-constrained strings:
130130+131131+- `at-identifier`: Handle or DID
132132+- `at-uri`: AT-URI reference
133133+- `at-uri-regex`: "Lenient" version accepting unresolved at-identifier
134134+- `cid`: Content identifier
135135+- `datetime`: RFC 3339 timestamp
136136+- `did`: Decentralized identifier
137137+- `handle`: Handle identifier
138138+- `nsid`: Namespaced identifier
139139+- `tid`: Timestamp identifier
140140+- `record-key`: Record key syntax
141141+- `uri`: Generic URI (RFC 3986)
142142+- `language`: IETF language tag (BCP 47)
143143+144144+### Datetime Format
145145+146146+Required elements:
147147+- Intersection of RFC 3339, ISO 8601, and WHATWG HTML standards
148148+- Uppercase T separator between date and time
149149+- Timezone specification (preferably Z for UTC)
150150+- Whole seconds precision (millisecond precision recommended)
151151+152152+Valid example: `1985-04-12T23:20:50.123Z`
153153+154154+Invalid: Missing timezone, lowercase t, insufficient precision, or invalid day/month values
155155+156156+### AT-URI Format
157157+158158+"at-uri": Represents an AT-URI following the AT-URI scheme specification. Examples:
159159+- `at://did:plc:abc123/com.example.record/rkey123`
160160+- `at://alice.bsky.social/app.bsky.feed.post/3k4i5j6k`
161161+162162+"at-uri-regex": "Lenient" version that accepts AT-URIs with unresolved at-identifiers.
163163+164164+### URI Format
165165+166166+"uri": "Flexible to any URI schema, following the generic RFC-3986 on URIs." Supports did, https, wss, ipfs, dns, and at schemes. Maximum length is 8 KBytes.
167167+168168+### Language Format
169169+170170+"language": "An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646." Examples include `ja` (Japanese) and `pt-BR` (Brazilian Portuguese).
171171+172172+## Validation Approach
173173+174174+"For the various identifier formats, when doing Lexicon schema validation the most expansive identifier syntax format should be permitted." Application-level validation of specific identifier methods occurs separately from schema validation.
175175+176176+## When to Use `$type`
177177+178178+Data objects sometimes require a `$type` field for disambiguation:
179179+180180+- `record` objects: Always include `$type`
181181+- `union` variants: Always include `$type` (except top-level subscription messages)
182182+- `blob` objects: Always include `$type`
183183+184184+"Main types must be referenced in `$type` fields as just the NSID, not including a `#main` suffix."
185185+186186+## Validation Options
187187+188188+Three PDS validation approaches:
189189+190190+1. **Explicit validation:** Record must validate against known Lexicon; fails if unavailable
191191+2. **No validation:** Record bypasses Lexicon validation (still validates data model rules)
192192+3. **Optimistic validation (default):** Validates if Lexicon known; allows creation if unavailable
193193+194194+## Lexicon Evolution
195195+196196+Compatibility rules for schema updates:
197197+198198+- New fields must be optional
199199+- Non-optional fields cannot be removed
200200+- Field types cannot change
201201+- Fields cannot be renamed
202202+203203+"If larger breaking changes are necessary, a new Lexicon name must be used."
204204+205205+Lexicon publication occurs through atproto repositories using `com.atproto.lexicon.schema` record types, linked via DNS TXT records for authority resolution.
206206+207207+## Authority and Control
208208+209209+NSID authority derives from DNS domain control. Domain authorities maintain Lexicon definitions with ultimate responsibility for maintenance and distribution. Protocol implementations should treat data failing Lexicon validation as entirely invalid.
210210+211211+"Unexpected fields in data which otherwise conforms to the Lexicon should be ignored."
212212+213213+## Usage Guidelines
214214+215215+Implementations should support translation to JSON Schema and OpenAPI formats for cross-ecosystem compatibility. Care must be taken when deserializing/reserializing to avoid losing unexpected fields that may represent newer schema versions.
216216+217217+## Record Key Types
218218+219219+The `key` field in record definitions specifies the format of record keys (rkeys). Options:
220220+221221+- `"any"`: Any string matching general record-key syntax
222222+- `"tid"`: Must be a valid timestamp identifier
223223+- `"literal:{value}"`: Fixed literal string (e.g., `"literal:self"` for profile records)
224224+225225+## Notes on Implementation
226226+227227+- String grapheme counting should follow Unicode extended grapheme cluster boundaries
228228+- Unknown fields should be preserved during serialization/deserialization when possible
229229+- Services should be permissive with format validation but strict with structural requirements
230230+- Breaking schema changes require new NSIDs rather than version updates
+347
.reference/python_atproto_sdk.md
···11+# Python ATProto SDK Reference
22+33+> **Source**: [MarshalX/atproto](https://github.com/MarshalX/atproto) | [Documentation](https://atproto.blue/) | [PyPI](https://pypi.org/project/atproto/)
44+55+## Overview
66+77+The `atproto` package is the community Python SDK for AT Protocol (Bluesky). It provides:
88+99+- Autogenerated models from lexicons with full type hints
1010+- Synchronous and asynchronous XRPC clients
1111+- Firehose data streaming
1212+- Identity resolution (DID/Handle)
1313+- Cryptographic utilities
1414+- **Code generator for custom lexicon schemes**
1515+1616+**Version**: 0.0.65 (Dec 2025)
1717+**Python**: 3.9 - 3.14
1818+**License**: MIT
1919+2020+> Note: Until 1.0.0, compatibility between versions is not guaranteed.
2121+2222+## Installation
2323+2424+```bash
2525+pip install atproto
2626+```
2727+2828+## Package Structure
2929+3030+| Package | Purpose |
3131+|---------|---------|
3232+| `atproto_client` | XRPC client, data models, utilities |
3333+| `atproto_core` | NSID, AT URIs, CID, CAR files, DID documents |
3434+| `atproto_crypto` | Multibase, signature verification, DID keys |
3535+| `atproto_firehose` | Real-time data streaming |
3636+| `atproto_identity` | DID and handle resolution |
3737+| `atproto_lexicon` | Lexicon parsing (parser, models) |
3838+| `atproto_codegen` | Code generator for models/clients from lexicons |
3939+| `atproto_cli` | CLI tool for code generation |
4040+| `atproto_server` | Server-side JWT utilities |
4141+4242+## Authentication
4343+4444+### Basic Login
4545+4646+```python
4747+from atproto import Client
4848+4949+# Synchronous
5050+client = Client()
5151+client.login('handle.bsky.social', 'app-password')
5252+5353+# Asynchronous
5454+from atproto import AsyncClient
5555+client = AsyncClient()
5656+await client.login('handle.bsky.social', 'app-password')
5757+```
5858+5959+### Session Persistence
6060+6161+Sessions can be exported/imported to avoid repeated authentication:
6262+6363+```python
6464+# Export session
6565+session_string = client.export_session_string()
6666+6767+# Import session later
6868+client = Client()
6969+client.login(session_string=session_string)
7070+```
7171+7272+### Custom PDS
7373+7474+```python
7575+client = Client(base_url='https://my-pds.example.com')
7676+```
7777+7878+## Namespaced API Access
7979+8080+The SDK mirrors AT Protocol's NSID structure:
8181+8282+```python
8383+# Core atproto methods
8484+client.com.atproto.server.create_session(...)
8585+client.com.atproto.repo.create_record(...)
8686+client.com.atproto.repo.put_record(...)
8787+client.com.atproto.repo.get_record(...)
8888+client.com.atproto.repo.delete_record(...)
8989+9090+# Bluesky app methods
9191+client.app.bsky.feed.get_timeline(...)
9292+client.app.bsky.actor.get_profile(...)
9393+9494+# Chat methods
9595+client.chat.bsky.convo.send_message(...)
9696+```
9797+9898+## Creating Custom Records
9999+100100+This is the key functionality for atdata's ATProto integration.
101101+102102+### Using com.atproto.repo.createRecord
103103+104104+```python
105105+from atproto import Client
106106+107107+client = Client()
108108+client.login('handle', 'password')
109109+110110+# Create a record with a custom collection
111111+response = client.com.atproto.repo.create_record(
112112+ data={
113113+ 'repo': client.me.did, # Your DID
114114+ 'collection': 'ac.foundation.dataset.sampleSchema', # Custom NSID
115115+ 'record': {
116116+ '$type': 'ac.foundation.dataset.sampleSchema',
117117+ # ... your record fields
118118+ },
119119+ 'validate': False # Skip lexicon validation for custom schemas
120120+ }
121121+)
122122+123123+# Response contains:
124124+# - uri: AT URI for the record (at://did:plc:.../ac.foundation.dataset.sampleSchema/...)
125125+# - cid: Content hash of the record
126126+```
127127+128128+### Using com.atproto.repo.putRecord (Create or Update)
129129+130130+```python
131131+response = client.com.atproto.repo.put_record(
132132+ data={
133133+ 'repo': client.me.did,
134134+ 'collection': 'ac.foundation.dataset.sampleSchema',
135135+ 'rkey': 'my-schema-key', # Explicit record key
136136+ 'record': {
137137+ '$type': 'ac.foundation.dataset.sampleSchema',
138138+ # ... fields
139139+ },
140140+ 'validate': False
141141+ }
142142+)
143143+```
144144+145145+### Getting a Record
146146+147147+```python
148148+response = client.com.atproto.repo.get_record(
149149+ params={
150150+ 'repo': 'did:plc:...',
151151+ 'collection': 'ac.foundation.dataset.sampleSchema',
152152+ 'rkey': 'my-schema-key'
153153+ }
154154+)
155155+# response.value contains the record data
156156+```
157157+158158+### Listing Records in a Collection
159159+160160+```python
161161+response = client.com.atproto.repo.list_records(
162162+ params={
163163+ 'repo': 'did:plc:...',
164164+ 'collection': 'ac.foundation.dataset.sampleSchema',
165165+ 'limit': 100
166166+ }
167167+)
168168+# response.records is a list of records
169169+```
170170+171171+### Deleting a Record
172172+173173+```python
174174+client.com.atproto.repo.delete_record(
175175+ data={
176176+ 'repo': client.me.did,
177177+ 'collection': 'ac.foundation.dataset.sampleSchema',
178178+ 'rkey': 'my-schema-key'
179179+ }
180180+)
181181+```
182182+183183+## Key Insight: PDS is Lexicon-Agnostic
184184+185185+From [GitHub Discussion #3116](https://github.com/bluesky-social/atproto/discussions/3116):
186186+187187+> "You don't need the lexicon to parse a record, only to validate the schema. Validation can be disabled."
188188+189189+The PDS stores any JSON data in any collection without requiring prior knowledge of the schema. This means:
190190+191191+1. We can publish `ac.foundation.dataset.*` records immediately
192192+2. Set `validate: False` to bypass lexicon validation
193193+3. Rate limits and account bans prevent abuse, not schema enforcement
194194+195195+## AT URIs
196196+197197+Records are addressed using AT URIs:
198198+199199+```
200200+at://did:plc:abcd1234/ac.foundation.dataset.sampleSchema/record-key
201201+└──────────────────────┘ └──────────────────────────────────┘ └────────┘
202202+ authority collection rkey
203203+```
204204+205205+### Parsing AT URIs
206206+207207+```python
208208+from atproto_core import AtUri
209209+210210+uri = AtUri.from_str('at://did:plc:abc/com.example.record/key123')
211211+print(uri.hostname) # did:plc:abc
212212+print(uri.collection) # com.example.record
213213+print(uri.rkey) # key123
214214+```
215215+216216+## Core Utilities (atproto_core)
217217+218218+### NSID (Namespaced Identifier)
219219+220220+```python
221221+from atproto_core import NSID
222222+223223+nsid = NSID.from_str('ac.foundation.dataset.sampleSchema')
224224+print(nsid.authority) # ac.foundation.dataset
225225+print(nsid.name) # sampleSchema
226226+```
227227+228228+### CID (Content Identifier)
229229+230230+```python
231231+from atproto_core import CID
232232+233233+cid = CID.decode('bafyrei...')
234234+print(cid.version)
235235+print(cid.codec)
236236+```
237237+238238+### DID Document
239239+240240+```python
241241+from atproto_core import DidDocument
242242+243243+doc = DidDocument(...)
244244+pds_endpoint = doc.get_pds_endpoint()
245245+handle = doc.get_handle()
246246+```
247247+248248+## Identity Resolution
249249+250250+```python
251251+from atproto_identity import IdentityResolver
252252+253253+resolver = IdentityResolver()
254254+255255+# Resolve handle to DID
256256+did = await resolver.resolve_handle('handle.bsky.social')
257257+258258+# Resolve DID to DID document
259259+doc = await resolver.resolve_did('did:plc:...')
260260+```
261261+262262+## Firehose Streaming
263263+264264+```python
265265+from atproto import FirehoseSubscribeReposClient, parse_subscribe_repos_message
266266+267267+client = FirehoseSubscribeReposClient()
268268+269269+def on_message(message):
270270+ commit = parse_subscribe_repos_message(message)
271271+ # Process commits...
272272+273273+client.start(on_message)
274274+```
275275+276276+## Blob Upload
277277+278278+```python
279279+# Upload binary data
280280+with open('image.jpg', 'rb') as f:
281281+ upload = client.upload_blob(f.read())
282282+283283+# upload.blob can be used in record fields
284284+```
285285+286286+## Error Handling
287287+288288+```python
289289+from atproto import exceptions
290290+291291+try:
292292+ client.com.atproto.repo.get_record(...)
293293+except exceptions.AtProtocolError as e:
294294+ print(f"AT Protocol error: {e}")
295295+except exceptions.NetworkError as e:
296296+ print(f"Network error: {e}")
297297+```
298298+299299+## Code Generation for Custom Lexicons
300300+301301+The SDK supports generating Python models from custom lexicon schemas:
302302+303303+```bash
304304+# Install CLI
305305+pip install atproto[cli]
306306+307307+# Generate code from lexicons (exact CLI usage TBD)
308308+atproto codegen --lexicons ./my-lexicons --output ./generated
309309+```
310310+311311+The `atproto_codegen` package can generate:
312312+- Data models for records
313313+- Client namespaces for queries/procedures
314314+- Validation functions
315315+316316+## Relevant API Endpoints for atdata
317317+318318+| Endpoint | Purpose |
319319+|----------|---------|
320320+| `com.atproto.repo.createRecord` | Publish new schema/dataset/lens record |
321321+| `com.atproto.repo.putRecord` | Create or update by explicit rkey |
322322+| `com.atproto.repo.getRecord` | Fetch a specific record |
323323+| `com.atproto.repo.listRecords` | List all records in a collection |
324324+| `com.atproto.repo.deleteRecord` | Remove a record |
325325+| `com.atproto.sync.getRepo` | Download full repository (CAR file) |
326326+| `com.atproto.identity.resolveHandle` | Resolve handle to DID |
327327+328328+## Resources
329329+330330+- **Documentation**: https://atproto.blue/
331331+- **GitHub**: https://github.com/MarshalX/atproto
332332+- **Examples**: https://github.com/MarshalX/atproto/tree/main/examples
333333+- **PyPI**: https://pypi.org/project/atproto/
334334+- **Discord**: https://discord.gg/PCyVJXU9jN
335335+336336+## AT Protocol Specification
337337+338338+- **Lexicon Guide**: https://atproto.com/guides/lexicon
339339+- **Application Guide**: https://atproto.com/guides/applications
340340+- **SDK List**: https://atproto.com/sdks
341341+- **API Reference**: https://docs.bsky.app/docs/api/
342342+343343+## Version History
344344+345345+- 0.0.65 (Dec 8, 2025) - Latest
346346+- 0.0.64 (Dec 1, 2025)
347347+- 0.0.63 (Oct 22, 2025)
···11+# Changelog
22+33+All notable changes to this project will be documented in this file.
44+55+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
66+77+## [Unreleased]
88+99+### Added
1010+1111+### Fixed
1212+1313+### Changed
1414+- Investigate test-bucket directory creation issue (#105)
1515+- Add remaining Dataset edge case tests (#104)
1616+- Improve test coverage for edge cases (#103)
1717+- Phase 1: Lexicon Design & Schema Definition (#17)
+48-7
CLAUDE.md
···22222323### Testing
2424```bash
2525+# Always run tests through uv to use the correct virtual environment
2526# Run all tests with coverage
2626-pytest
2727+uv run pytest
27282829# Run specific test file
2929-pytest tests/test_dataset.py
3030-pytest tests/test_lens.py
3030+uv run pytest tests/test_dataset.py
3131+uv run pytest tests/test_lens.py
31323233# Run single test
3333-pytest tests/test_dataset.py::test_create_sample
3434-pytest tests/test_lens.py::test_lens
3434+uv run pytest tests/test_dataset.py::test_create_sample
3535+uv run pytest tests/test_lens.py::test_lens
3536```
36373738### Building
···136137137138**WebDataset Integration**
138139139139-- Uses `wds.ShardWriter` / `wds.TarWriter` for writing
140140+- Uses `wds.writer.ShardWriter` / `wds.writer.TarWriter` for writing
141141+ - **Important:** Always import from `wds.writer` (e.g., `wds.writer.TarWriter`) instead of `wds.TarWriter`
142142+ - This avoids linting issues while functionally equivalent
140143- Dataset iteration via `wds.DataPipeline` with custom `wrap()` / `wrap_batch()` methods
141144- Supports `ordered()` and `shuffled()` iteration modes
142145···146149- Test cases cover both decorator and inheritance syntax
147150- Temporary WebDataset tar files created in `tmp_path` fixture
148151- Tests verify both serialization and batch aggregation behavior
149149-- Lens tests verify well-behavedness (GetPut/PutGet laws)
152152+- Lens tests verify well-behavedness (GetPut/PutGet/PutPut laws)
153153+154154+### Warning Suppression Convention
155155+156156+**Keep warning suppression local to individual tests, not global.**
157157+158158+When tests generate expected warnings (e.g., from third-party library incompatibilities), suppress them using `@pytest.mark.filterwarnings` decorators on each affected test rather than global suppression in `conftest.py`. This:
159159+- Documents which specific tests have known warning behaviors
160160+- Makes it easier to track when warnings appear in unexpected places
161161+- Avoids masking genuine warnings from new code
162162+163163+Example for s3fs/moto async incompatibility warnings:
164164+```python
165165+@pytest.mark.filterwarnings("ignore::pytest.PytestUnraisableExceptionWarning")
166166+@pytest.mark.filterwarnings("ignore:coroutine.*was never awaited:RuntimeWarning")
167167+def test_repo_insert_with_s3(mock_s3, clean_redis):
168168+ ...
169169+```
170170+171171+## Git Workflow
172172+173173+### Committing Changes
174174+175175+When using the `/commit` command or creating commits:
176176+- **Always include `.chainlink/issues.db`** in commits alongside code changes
177177+- This ensures issue tracking history is preserved across sessions
178178+- The issues.db file tracks all chainlink issues, comments, and status changes
179179+180180+### Planning Documents
181181+182182+- **Track `.planning/` directory in git** - Do not ignore planning documents
183183+- Planning documents in `.planning/` should be committed to preserve design history
184184+- This includes architecture notes, implementation plans, and design decisions
185185+186186+### Reference Materials
187187+188188+- **Track `.reference/` directory in git** - Include reference documentation in commits
189189+- The `.reference/` directory contains external specifications and reference materials
190190+- This includes API specs, lexicon definitions, and other reference documentation used for development
+368
examples/atmosphere_demo.py
···11+#!/usr/bin/env python3
22+"""Demonstration of atdata.atmosphere ATProto integration.
33+44+This script demonstrates how to use the atmosphere module to publish
55+and discover datasets on the AT Protocol network.
66+77+Usage:
88+ # Dry run (no actual ATProto connection):
99+ python atmosphere_demo.py
1010+1111+ # With actual ATProto connection:
1212+ python atmosphere_demo.py --handle your.handle.social --password your-app-password
1313+1414+Requirements:
1515+ pip install atdata[atmosphere]
1616+1717+Note:
1818+ Use an app-specific password, not your main Bluesky password.
1919+ Create app passwords at: https://bsky.app/settings/app-passwords
2020+"""
2121+2222+import argparse
2323+import sys
2424+from dataclasses import asdict, fields, is_dataclass
2525+from datetime import datetime
2626+2727+import numpy as np
2828+from numpy.typing import NDArray
2929+3030+import atdata
3131+from atdata.atmosphere import (
3232+ AtmosphereClient,
3333+ SchemaPublisher,
3434+ SchemaLoader,
3535+ DatasetPublisher,
3636+ DatasetLoader,
3737+ AtUri,
3838+)
3939+4040+4141+# =============================================================================
4242+# Define sample types using @packable decorator
4343+# =============================================================================
4444+4545+@atdata.packable
4646+class ImageSample:
4747+ """A sample containing image data with metadata."""
4848+ image: NDArray
4949+ label: str
5050+ confidence: float
5151+5252+5353+@atdata.packable
5454+class TextEmbeddingSample:
5555+ """A sample containing text with embedding vectors."""
5656+ text: str
5757+ embedding: NDArray
5858+ source: str
5959+6060+6161+# =============================================================================
6262+# Demo functions
6363+# =============================================================================
6464+6565+def demo_type_introspection():
6666+ """Demonstrate how atmosphere introspects PackableSample types."""
6767+ print("\n" + "=" * 60)
6868+ print("Type Introspection Demo")
6969+ print("=" * 60)
7070+7171+ # Show what information is available from a PackableSample type
7272+ print(f"\nSample type: {ImageSample.__name__}")
7373+ print(f"Is dataclass: {is_dataclass(ImageSample)}")
7474+7575+ print("\nFields:")
7676+ for field in fields(ImageSample):
7777+ print(f" - {field.name}: {field.type}")
7878+7979+ # Create a sample instance
8080+ sample = ImageSample(
8181+ image=np.random.rand(224, 224, 3).astype(np.float32),
8282+ label="cat",
8383+ confidence=0.95,
8484+ )
8585+8686+ print(f"\nSample instance:")
8787+ print(f" image shape: {sample.image.shape}")
8888+ print(f" image dtype: {sample.image.dtype}")
8989+ print(f" label: {sample.label}")
9090+ print(f" confidence: {sample.confidence}")
9191+9292+ # Demonstrate serialization
9393+ packed = sample.packed
9494+ print(f"\nSerialized size: {len(packed):,} bytes")
9595+9696+ # Round-trip
9797+ restored = ImageSample.from_bytes(packed)
9898+ print(f"Round-trip successful: {np.allclose(sample.image, restored.image)}")
9999+100100+101101+def demo_at_uri_parsing():
102102+ """Demonstrate AT URI parsing."""
103103+ print("\n" + "=" * 60)
104104+ print("AT URI Parsing Demo")
105105+ print("=" * 60)
106106+107107+ # Example AT URIs
108108+ uris = [
109109+ "at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789",
110110+ "at://alice.bsky.social/ac.foundation.dataset.record/my-dataset",
111111+ ]
112112+113113+ for uri_str in uris:
114114+ print(f"\nParsing: {uri_str}")
115115+ uri = AtUri.parse(uri_str)
116116+ print(f" Authority: {uri.authority}")
117117+ print(f" Collection: {uri.collection}")
118118+ print(f" Rkey: {uri.rkey}")
119119+ print(f" Roundtrip: {str(uri)}")
120120+121121+122122+def demo_schema_record_building():
123123+ """Demonstrate building schema records from PackableSample types."""
124124+ print("\n" + "=" * 60)
125125+ print("Schema Record Building Demo")
126126+ print("=" * 60)
127127+128128+ from atdata.atmosphere._types import SchemaRecord, FieldDef, FieldType
129129+130130+ # Build a schema record manually (what SchemaPublisher does internally)
131131+ schema = SchemaRecord(
132132+ name="ImageSample",
133133+ version="1.0.0",
134134+ description="A sample containing image data with metadata",
135135+ fields=[
136136+ FieldDef(
137137+ name="image",
138138+ field_type=FieldType(kind="ndarray", dtype="float32", shape=[224, 224, 3]),
139139+ optional=False,
140140+ ),
141141+ FieldDef(
142142+ name="label",
143143+ field_type=FieldType(kind="primitive", primitive="str"),
144144+ optional=False,
145145+ ),
146146+ FieldDef(
147147+ name="confidence",
148148+ field_type=FieldType(kind="primitive", primitive="float"),
149149+ optional=False,
150150+ ),
151151+ ],
152152+ )
153153+154154+ # Convert to ATProto record format
155155+ record = schema.to_record()
156156+157157+ print("\nSchema record structure:")
158158+ print(f" $type: {record['$type']}")
159159+ print(f" name: {record['name']}")
160160+ print(f" version: {record['version']}")
161161+ print(f" description: {record.get('description', 'N/A')}")
162162+ print(f" fields: {len(record['fields'])} fields")
163163+164164+ for field in record["fields"]:
165165+ print(f" - {field['name']}: {field['fieldType']}")
166166+167167+168168+def demo_mock_client():
169169+ """Demonstrate the AtmosphereClient interface with a mock."""
170170+ print("\n" + "=" * 60)
171171+ print("Mock Client Demo (no network)")
172172+ print("=" * 60)
173173+174174+ from unittest.mock import Mock, MagicMock
175175+176176+ # Create a mock atproto client
177177+ mock_atproto = Mock()
178178+ mock_atproto.me = MagicMock()
179179+ mock_atproto.me.did = "did:plc:demo123456789"
180180+ mock_atproto.me.handle = "demo.bsky.social"
181181+182182+ # Mock the login response
183183+ mock_profile = Mock()
184184+ mock_profile.did = "did:plc:demo123456789"
185185+ mock_profile.handle = "demo.bsky.social"
186186+ mock_atproto.login.return_value = mock_profile
187187+188188+ # Mock create_record response
189189+ mock_response = Mock()
190190+ mock_response.uri = "at://did:plc:demo123456789/ac.foundation.dataset.sampleSchema/abc123"
191191+ mock_atproto.com.atproto.repo.create_record.return_value = mock_response
192192+193193+ # Create our client with the mock
194194+ client = AtmosphereClient(_client=mock_atproto)
195195+ client.login("demo.bsky.social", "fake-password")
196196+197197+ print(f"\nAuthenticated as: {client.handle}")
198198+ print(f"DID: {client.did}")
199199+200200+ # Demonstrate schema publishing with mock
201201+ publisher = SchemaPublisher(client)
202202+ uri = publisher.publish(
203203+ ImageSample,
204204+ name="ImageSample",
205205+ version="1.0.0",
206206+ description="Demo image sample type",
207207+ )
208208+209209+ print(f"\nPublished schema at: {uri}")
210210+ print(f" Authority: {uri.authority}")
211211+ print(f" Collection: {uri.collection}")
212212+ print(f" Rkey: {uri.rkey}")
213213+214214+215215+def demo_live_connection(handle: str, password: str):
216216+ """Demonstrate actual ATProto connection.
217217+218218+ Args:
219219+ handle: Bluesky handle (e.g., 'alice.bsky.social')
220220+ password: App-specific password
221221+ """
222222+ print("\n" + "=" * 60)
223223+ print("Live ATProto Connection Demo")
224224+ print("=" * 60)
225225+226226+ # Create client and authenticate
227227+ print(f"\nConnecting as {handle}...")
228228+ client = AtmosphereClient()
229229+ client.login(handle, password)
230230+231231+ print(f"Authenticated!")
232232+ print(f" DID: {client.did}")
233233+ print(f" Handle: {client.handle}")
234234+235235+ # Publish a schema
236236+ print("\nPublishing ImageSample schema...")
237237+ schema_publisher = SchemaPublisher(client)
238238+ schema_uri = schema_publisher.publish(
239239+ ImageSample,
240240+ name="ImageSample",
241241+ version="1.0.0",
242242+ description="Demo: Image sample with label and confidence",
243243+ )
244244+ print(f" Schema URI: {schema_uri}")
245245+246246+ # List schemas we've published
247247+ print("\nListing your published schemas...")
248248+ schema_loader = SchemaLoader(client)
249249+ schemas = schema_loader.list_all(limit=10)
250250+ print(f" Found {len(schemas)} schema(s)")
251251+ for schema in schemas:
252252+ print(f" - {schema.get('name', 'Unknown')}: v{schema.get('version', '?')}")
253253+254254+ # Publish a dataset record (pointing to example URLs)
255255+ print("\nPublishing dataset record...")
256256+ dataset_publisher = DatasetPublisher(client)
257257+ dataset_uri = dataset_publisher.publish_with_urls(
258258+ urls=["s3://example-bucket/demo-data-{000000..000009}.tar"],
259259+ schema_uri=str(schema_uri),
260260+ name="Demo Image Dataset",
261261+ description="Example dataset demonstrating atmosphere publishing",
262262+ tags=["demo", "images", "atdata"],
263263+ license="MIT",
264264+ )
265265+ print(f" Dataset URI: {dataset_uri}")
266266+267267+ # List datasets
268268+ print("\nListing your published datasets...")
269269+ dataset_loader = DatasetLoader(client)
270270+ datasets = dataset_loader.list_all(limit=10)
271271+ print(f" Found {len(datasets)} dataset(s)")
272272+ for ds in datasets:
273273+ print(f" - {ds.get('name', 'Unknown')}")
274274+ print(f" Schema: {ds.get('schemaRef', 'N/A')}")
275275+ tags = ds.get('tags', [])
276276+ if tags:
277277+ print(f" Tags: {', '.join(tags)}")
278278+279279+280280+def demo_dataset_loading():
281281+ """Demonstrate loading a dataset from an ATProto record."""
282282+ print("\n" + "=" * 60)
283283+ print("Dataset Loading Demo (conceptual)")
284284+ print("=" * 60)
285285+286286+ print("""
287287+Once you have published a dataset, others can load it like this:
288288+289289+ from atdata.atmosphere import AtmosphereClient, DatasetLoader
290290+291291+ client = AtmosphereClient()
292292+ # Note: reading public records doesn't require authentication
293293+294294+ loader = DatasetLoader(client)
295295+296296+ # Get the dataset record
297297+ record = loader.get("at://did:plc:abc123/ac.foundation.dataset.record/xyz")
298298+299299+ # Get the WebDataset URLs
300300+ urls = loader.get_urls("at://did:plc:abc123/ac.foundation.dataset.record/xyz")
301301+ print(f"Dataset URLs: {urls}")
302302+303303+ # If you have the sample type class, create a Dataset directly
304304+ dataset = loader.to_dataset(
305305+ "at://did:plc:abc123/ac.foundation.dataset.record/xyz",
306306+ sample_type=ImageSample,
307307+ )
308308+309309+ # Now iterate as usual
310310+ for batch in dataset.shuffled(batch_size=32):
311311+ images = batch.image # (32, 224, 224, 3)
312312+ labels = batch.label # list of 32 strings
313313+ process(images, labels)
314314+""")
315315+316316+317317+# =============================================================================
318318+# Main
319319+# =============================================================================
320320+321321+def main():
322322+ parser = argparse.ArgumentParser(
323323+ description="Demonstrate atdata.atmosphere ATProto integration",
324324+ formatter_class=argparse.RawDescriptionHelpFormatter,
325325+ epilog=__doc__,
326326+ )
327327+ parser.add_argument(
328328+ "--handle",
329329+ help="Bluesky handle for live demo (e.g., alice.bsky.social)",
330330+ )
331331+ parser.add_argument(
332332+ "--password",
333333+ help="App-specific password for live demo",
334334+ )
335335+336336+ args = parser.parse_args()
337337+338338+ print("=" * 60)
339339+ print("atdata.atmosphere Demo")
340340+ print("=" * 60)
341341+ print(f"\nTime: {datetime.now().isoformat()}")
342342+ print(f"atdata version: {atdata.__name__}")
343343+344344+ # Always run these demos (no network required)
345345+ demo_type_introspection()
346346+ demo_at_uri_parsing()
347347+ demo_schema_record_building()
348348+ demo_mock_client()
349349+ demo_dataset_loading()
350350+351351+ # Run live demo if credentials provided
352352+ if args.handle and args.password:
353353+ demo_live_connection(args.handle, args.password)
354354+ else:
355355+ print("\n" + "=" * 60)
356356+ print("Live Demo Skipped")
357357+ print("=" * 60)
358358+ print("\nTo run with actual ATProto connection:")
359359+ print(" python atmosphere_demo.py --handle your.handle --password your-app-password")
360360+ print("\nCreate app passwords at: https://bsky.app/settings/app-passwords")
361361+362362+ print("\n" + "=" * 60)
363363+ print("Demo Complete!")
364364+ print("=" * 60)
365365+366366+367367+if __name__ == "__main__":
368368+ main()
···11+"""ATProto integration for distributed dataset federation.
22+33+This module provides ATProto publishing and discovery capabilities for atdata,
44+enabling a loose federation of distributed, typed datasets on the AT Protocol
55+network.
66+77+Key components:
88+99+- ``AtmosphereClient``: Authentication and session management for ATProto
1010+- ``SchemaPublisher``: Publish PackableSample schemas as ATProto records
1111+- ``DatasetPublisher``: Publish dataset index records with WebDataset URLs
1212+- ``LensPublisher``: Publish lens transformation records
1313+1414+The ATProto integration is additive - existing atdata functionality continues
1515+to work unchanged. These features are opt-in for users who want to publish
1616+or discover datasets on the ATProto network.
1717+1818+Example:
1919+ >>> from atdata.atmosphere import AtmosphereClient, SchemaPublisher
2020+ >>>
2121+ >>> client = AtmosphereClient()
2222+ >>> client.login("handle.bsky.social", "app-password")
2323+ >>>
2424+ >>> publisher = SchemaPublisher(client)
2525+ >>> schema_uri = publisher.publish(MySampleType, version="1.0.0")
2626+2727+Note:
2828+ This module requires the ``atproto`` package to be installed::
2929+3030+ pip install atproto
3131+"""
3232+3333+from .client import AtmosphereClient
3434+from .schema import SchemaPublisher, SchemaLoader
3535+from .records import DatasetPublisher, DatasetLoader
3636+from .lens import LensPublisher, LensLoader
3737+from ._types import (
3838+ AtUri,
3939+ SchemaRecord,
4040+ DatasetRecord,
4141+ LensRecord,
4242+)
4343+4444+__all__ = [
4545+ # Client
4646+ "AtmosphereClient",
4747+ # Schema operations
4848+ "SchemaPublisher",
4949+ "SchemaLoader",
5050+ # Dataset operations
5151+ "DatasetPublisher",
5252+ "DatasetLoader",
5353+ # Lens operations
5454+ "LensPublisher",
5555+ "LensLoader",
5656+ # Types
5757+ "AtUri",
5858+ "SchemaRecord",
5959+ "DatasetRecord",
6060+ "LensRecord",
6161+]
+329
src/atdata/atmosphere/_types.py
···11+"""Type definitions for ATProto record structures.
22+33+This module defines the data structures used to represent ATProto records
44+for schemas, datasets, and lenses. These types map to the Lexicon definitions
55+in the ``ac.foundation.dataset.*`` namespace.
66+"""
77+88+from dataclasses import dataclass, field
99+from datetime import datetime, timezone
1010+from typing import Optional, Literal, Any
1111+1212+# Lexicon namespace for atdata records
1313+LEXICON_NAMESPACE = "ac.foundation.dataset"
1414+1515+1616+@dataclass
1717+class AtUri:
1818+ """Parsed AT Protocol URI.
1919+2020+ AT URIs follow the format: at://<authority>/<collection>/<rkey>
2121+2222+ Example:
2323+ >>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz")
2424+ >>> uri.authority
2525+ 'did:plc:abc123'
2626+ >>> uri.collection
2727+ 'ac.foundation.dataset.sampleSchema'
2828+ >>> uri.rkey
2929+ 'xyz'
3030+ """
3131+3232+ authority: str
3333+ """The DID or handle of the repository owner."""
3434+3535+ collection: str
3636+ """The NSID of the record collection."""
3737+3838+ rkey: str
3939+ """The record key within the collection."""
4040+4141+ @classmethod
4242+ def parse(cls, uri: str) -> "AtUri":
4343+ """Parse an AT URI string into components.
4444+4545+ Args:
4646+ uri: AT URI string in format ``at://<authority>/<collection>/<rkey>``
4747+4848+ Returns:
4949+ Parsed AtUri instance.
5050+5151+ Raises:
5252+ ValueError: If the URI format is invalid.
5353+ """
5454+ if not uri.startswith("at://"):
5555+ raise ValueError(f"Invalid AT URI: must start with 'at://': {uri}")
5656+5757+ parts = uri[5:].split("/")
5858+ if len(parts) < 3:
5959+ raise ValueError(f"Invalid AT URI: expected authority/collection/rkey: {uri}")
6060+6161+ return cls(
6262+ authority=parts[0],
6363+ collection=parts[1],
6464+ rkey="/".join(parts[2:]), # rkey may contain slashes
6565+ )
6666+6767+ def __str__(self) -> str:
6868+ """Format as AT URI string."""
6969+ return f"at://{self.authority}/{self.collection}/{self.rkey}"
7070+7171+7272+@dataclass
7373+class FieldType:
7474+ """Schema field type definition.
7575+7676+ Represents a type in the schema type system, supporting primitives,
7777+ ndarrays, and references to other schemas.
7878+ """
7979+8080+ kind: Literal["primitive", "ndarray", "ref", "array"]
8181+ """The category of type."""
8282+8383+ primitive: Optional[str] = None
8484+ """For kind='primitive': one of 'str', 'int', 'float', 'bool', 'bytes'."""
8585+8686+ dtype: Optional[str] = None
8787+ """For kind='ndarray': numpy dtype string (e.g., 'float32')."""
8888+8989+ shape: Optional[list[int | None]] = None
9090+ """For kind='ndarray': shape constraints (None for any dimension)."""
9191+9292+ ref: Optional[str] = None
9393+ """For kind='ref': AT URI of referenced schema."""
9494+9595+ items: Optional["FieldType"] = None
9696+ """For kind='array': type of array elements."""
9797+9898+9999+@dataclass
100100+class FieldDef:
101101+ """Schema field definition."""
102102+103103+ name: str
104104+ """Field name."""
105105+106106+ field_type: FieldType
107107+ """Type of this field."""
108108+109109+ optional: bool = False
110110+ """Whether this field can be None."""
111111+112112+ description: Optional[str] = None
113113+ """Human-readable description."""
114114+115115+116116+@dataclass
117117+class SchemaRecord:
118118+ """ATProto record for a PackableSample schema.
119119+120120+ Maps to the ``ac.foundation.dataset.sampleSchema`` Lexicon.
121121+ """
122122+123123+ name: str
124124+ """Human-readable schema name."""
125125+126126+ version: str
127127+ """Semantic version string (e.g., '1.0.0')."""
128128+129129+ fields: list[FieldDef]
130130+ """List of field definitions."""
131131+132132+ description: Optional[str] = None
133133+ """Human-readable description."""
134134+135135+ created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
136136+ """When this record was created."""
137137+138138+ metadata: Optional[dict] = None
139139+ """Arbitrary metadata as msgpack-encoded bytes."""
140140+141141+ def to_record(self) -> dict:
142142+ """Convert to ATProto record dict for publishing."""
143143+ record = {
144144+ "$type": f"{LEXICON_NAMESPACE}.sampleSchema",
145145+ "name": self.name,
146146+ "version": self.version,
147147+ "fields": [self._field_to_dict(f) for f in self.fields],
148148+ "createdAt": self.created_at.isoformat(),
149149+ }
150150+ if self.description:
151151+ record["description"] = self.description
152152+ if self.metadata:
153153+ record["metadata"] = self.metadata
154154+ return record
155155+156156+ def _field_to_dict(self, field_def: FieldDef) -> dict:
157157+ """Convert a field definition to dict."""
158158+ result = {
159159+ "name": field_def.name,
160160+ "fieldType": self._type_to_dict(field_def.field_type),
161161+ "optional": field_def.optional,
162162+ }
163163+ if field_def.description:
164164+ result["description"] = field_def.description
165165+ return result
166166+167167+ def _type_to_dict(self, field_type: FieldType) -> dict:
168168+ """Convert a field type to dict."""
169169+ result: dict = {"$type": f"{LEXICON_NAMESPACE}.schemaType#{field_type.kind}"}
170170+171171+ if field_type.kind == "primitive":
172172+ result["primitive"] = field_type.primitive
173173+ elif field_type.kind == "ndarray":
174174+ result["dtype"] = field_type.dtype
175175+ if field_type.shape:
176176+ result["shape"] = field_type.shape
177177+ elif field_type.kind == "ref":
178178+ result["ref"] = field_type.ref
179179+ elif field_type.kind == "array":
180180+ if field_type.items:
181181+ result["items"] = self._type_to_dict(field_type.items)
182182+183183+ return result
184184+185185+186186+@dataclass
187187+class StorageLocation:
188188+ """Dataset storage location specification."""
189189+190190+ kind: Literal["external", "blobs"]
191191+ """Storage type: external URLs or ATProto blobs."""
192192+193193+ urls: Optional[list[str]] = None
194194+ """For kind='external': WebDataset URLs with brace notation."""
195195+196196+ blob_refs: Optional[list[dict]] = None
197197+ """For kind='blobs': ATProto blob references."""
198198+199199+200200+@dataclass
201201+class DatasetRecord:
202202+ """ATProto record for a dataset index.
203203+204204+ Maps to the ``ac.foundation.dataset.record`` Lexicon.
205205+ """
206206+207207+ name: str
208208+ """Human-readable dataset name."""
209209+210210+ schema_ref: str
211211+ """AT URI of the schema record."""
212212+213213+ storage: StorageLocation
214214+ """Where the dataset data is stored."""
215215+216216+ description: Optional[str] = None
217217+ """Human-readable description."""
218218+219219+ tags: list[str] = field(default_factory=list)
220220+ """Searchable tags."""
221221+222222+ license: Optional[str] = None
223223+ """SPDX license identifier."""
224224+225225+ created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
226226+ """When this record was created."""
227227+228228+ metadata: Optional[bytes] = None
229229+ """Arbitrary metadata as msgpack-encoded bytes."""
230230+231231+ def to_record(self) -> dict:
232232+ """Convert to ATProto record dict for publishing."""
233233+ record = {
234234+ "$type": f"{LEXICON_NAMESPACE}.record",
235235+ "name": self.name,
236236+ "schemaRef": self.schema_ref,
237237+ "storage": self._storage_to_dict(),
238238+ "createdAt": self.created_at.isoformat(),
239239+ }
240240+ if self.description:
241241+ record["description"] = self.description
242242+ if self.tags:
243243+ record["tags"] = self.tags
244244+ if self.license:
245245+ record["license"] = self.license
246246+ if self.metadata:
247247+ record["metadata"] = self.metadata
248248+ return record
249249+250250+ def _storage_to_dict(self) -> dict:
251251+ """Convert storage location to dict."""
252252+ if self.storage.kind == "external":
253253+ return {
254254+ "$type": f"{LEXICON_NAMESPACE}.storageExternal",
255255+ "urls": self.storage.urls or [],
256256+ }
257257+ else:
258258+ return {
259259+ "$type": f"{LEXICON_NAMESPACE}.storageBlobs",
260260+ "blobs": self.storage.blob_refs or [],
261261+ }
262262+263263+264264+@dataclass
265265+class CodeReference:
266266+ """Reference to lens code in a git repository."""
267267+268268+ repository: str
269269+ """Git repository URL."""
270270+271271+ commit: str
272272+ """Git commit hash."""
273273+274274+ path: str
275275+ """Path to the code file/function."""
276276+277277+278278+@dataclass
279279+class LensRecord:
280280+ """ATProto record for a lens transformation.
281281+282282+ Maps to the ``ac.foundation.dataset.lens`` Lexicon.
283283+ """
284284+285285+ name: str
286286+ """Human-readable lens name."""
287287+288288+ source_schema: str
289289+ """AT URI of the source schema."""
290290+291291+ target_schema: str
292292+ """AT URI of the target schema."""
293293+294294+ description: Optional[str] = None
295295+ """What this transformation does."""
296296+297297+ getter_code: Optional[CodeReference] = None
298298+ """Reference to getter function code."""
299299+300300+ putter_code: Optional[CodeReference] = None
301301+ """Reference to putter function code."""
302302+303303+ created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
304304+ """When this record was created."""
305305+306306+ def to_record(self) -> dict:
307307+ """Convert to ATProto record dict for publishing."""
308308+ record: dict[str, Any] = {
309309+ "$type": f"{LEXICON_NAMESPACE}.lens",
310310+ "name": self.name,
311311+ "sourceSchema": self.source_schema,
312312+ "targetSchema": self.target_schema,
313313+ "createdAt": self.created_at.isoformat(),
314314+ }
315315+ if self.description:
316316+ record["description"] = self.description
317317+ if self.getter_code:
318318+ record["getterCode"] = {
319319+ "repository": self.getter_code.repository,
320320+ "commit": self.getter_code.commit,
321321+ "path": self.getter_code.path,
322322+ }
323323+ if self.putter_code:
324324+ record["putterCode"] = {
325325+ "repository": self.putter_code.repository,
326326+ "commit": self.putter_code.commit,
327327+ "path": self.putter_code.path,
328328+ }
329329+ return record
+393
src/atdata/atmosphere/client.py
···11+"""ATProto client wrapper for atdata.
22+33+This module provides the ``AtmosphereClient`` class which wraps the atproto SDK
44+client with atdata-specific helpers for publishing and querying records.
55+"""
66+77+from typing import Optional, Any
88+99+from ._types import AtUri, LEXICON_NAMESPACE
1010+1111+# Lazy import to avoid requiring atproto if not using atmosphere features
1212+_atproto_client_class: Optional[type] = None
1313+1414+1515+def _get_atproto_client_class():
1616+ """Lazily import the atproto Client class."""
1717+ global _atproto_client_class
1818+ if _atproto_client_class is None:
1919+ try:
2020+ from atproto import Client
2121+ _atproto_client_class = Client
2222+ except ImportError as e:
2323+ raise ImportError(
2424+ "The 'atproto' package is required for ATProto integration. "
2525+ "Install it with: pip install atproto"
2626+ ) from e
2727+ return _atproto_client_class
2828+2929+3030+class AtmosphereClient:
3131+ """ATProto client wrapper for atdata operations.
3232+3333+ This class wraps the atproto SDK client and provides higher-level methods
3434+ for working with atdata records (schemas, datasets, lenses).
3535+3636+ Example:
3737+ >>> client = AtmosphereClient()
3838+ >>> client.login("alice.bsky.social", "app-password")
3939+ >>> print(client.did)
4040+ 'did:plc:...'
4141+4242+ Note:
4343+ The password should be an app-specific password, not your main account
4444+ password. Create app passwords in your Bluesky account settings.
4545+ """
4646+4747+ def __init__(
4848+ self,
4949+ base_url: Optional[str] = None,
5050+ *,
5151+ _client: Optional[Any] = None,
5252+ ):
5353+ """Initialize the ATProto client.
5454+5555+ Args:
5656+ base_url: Optional PDS base URL. Defaults to bsky.social.
5757+ _client: Optional pre-configured atproto Client for testing.
5858+ """
5959+ if _client is not None:
6060+ self._client = _client
6161+ else:
6262+ Client = _get_atproto_client_class()
6363+ self._client = Client(base_url=base_url) if base_url else Client()
6464+6565+ self._session: Optional[dict] = None
6666+6767+ def login(self, handle: str, password: str) -> None:
6868+ """Authenticate with the ATProto PDS.
6969+7070+ Args:
7171+ handle: Your Bluesky handle (e.g., 'alice.bsky.social').
7272+ password: App-specific password (not your main password).
7373+7474+ Raises:
7575+ atproto.exceptions.AtProtocolError: If authentication fails.
7676+ """
7777+ profile = self._client.login(handle, password)
7878+ self._session = {
7979+ "did": profile.did,
8080+ "handle": profile.handle,
8181+ }
8282+8383+ def login_with_session(self, session_string: str) -> None:
8484+ """Authenticate using an exported session string.
8585+8686+ This allows reusing a session without re-authenticating, which helps
8787+ avoid rate limits on session creation.
8888+8989+ Args:
9090+ session_string: Session string from ``export_session()``.
9191+ """
9292+ self._client.login(session_string=session_string)
9393+ self._session = {
9494+ "did": self._client.me.did,
9595+ "handle": self._client.me.handle,
9696+ }
9797+9898+ def export_session(self) -> str:
9999+ """Export the current session for later reuse.
100100+101101+ Returns:
102102+ Session string that can be passed to ``login_with_session()``.
103103+104104+ Raises:
105105+ ValueError: If not authenticated.
106106+ """
107107+ if not self.is_authenticated:
108108+ raise ValueError("Not authenticated")
109109+ return self._client.export_session_string()
110110+111111+ @property
112112+ def is_authenticated(self) -> bool:
113113+ """Check if the client has a valid session."""
114114+ return self._session is not None
115115+116116+ @property
117117+ def did(self) -> str:
118118+ """Get the DID of the authenticated user.
119119+120120+ Returns:
121121+ The DID string (e.g., 'did:plc:...').
122122+123123+ Raises:
124124+ ValueError: If not authenticated.
125125+ """
126126+ if not self._session:
127127+ raise ValueError("Not authenticated")
128128+ return self._session["did"]
129129+130130+ @property
131131+ def handle(self) -> str:
132132+ """Get the handle of the authenticated user.
133133+134134+ Returns:
135135+ The handle string (e.g., 'alice.bsky.social').
136136+137137+ Raises:
138138+ ValueError: If not authenticated.
139139+ """
140140+ if not self._session:
141141+ raise ValueError("Not authenticated")
142142+ return self._session["handle"]
143143+144144+ def _ensure_authenticated(self) -> None:
145145+ """Raise if not authenticated."""
146146+ if not self.is_authenticated:
147147+ raise ValueError("Client must be authenticated to perform this operation")
148148+149149+ # Low-level record operations
150150+151151+ def create_record(
152152+ self,
153153+ collection: str,
154154+ record: dict,
155155+ *,
156156+ rkey: Optional[str] = None,
157157+ validate: bool = False,
158158+ ) -> AtUri:
159159+ """Create a record in the user's repository.
160160+161161+ Args:
162162+ collection: The NSID of the record collection
163163+ (e.g., 'ac.foundation.dataset.sampleSchema').
164164+ record: The record data. Must include a '$type' field.
165165+ rkey: Optional explicit record key. If not provided, a TID is generated.
166166+ validate: Whether to validate against the Lexicon schema. Set to False
167167+ for custom lexicons that the PDS doesn't know about.
168168+169169+ Returns:
170170+ The AT URI of the created record.
171171+172172+ Raises:
173173+ ValueError: If not authenticated.
174174+ atproto.exceptions.AtProtocolError: If record creation fails.
175175+ """
176176+ self._ensure_authenticated()
177177+178178+ response = self._client.com.atproto.repo.create_record(
179179+ data={
180180+ "repo": self.did,
181181+ "collection": collection,
182182+ "record": record,
183183+ "rkey": rkey,
184184+ "validate": validate,
185185+ }
186186+ )
187187+188188+ return AtUri.parse(response.uri)
189189+190190+ def put_record(
191191+ self,
192192+ collection: str,
193193+ rkey: str,
194194+ record: dict,
195195+ *,
196196+ validate: bool = False,
197197+ swap_commit: Optional[str] = None,
198198+ ) -> AtUri:
199199+ """Create or update a record at a specific key.
200200+201201+ Args:
202202+ collection: The NSID of the record collection.
203203+ rkey: The record key.
204204+ record: The record data. Must include a '$type' field.
205205+ validate: Whether to validate against the Lexicon schema.
206206+ swap_commit: Optional CID for compare-and-swap update.
207207+208208+ Returns:
209209+ The AT URI of the record.
210210+211211+ Raises:
212212+ ValueError: If not authenticated.
213213+ atproto.exceptions.AtProtocolError: If operation fails.
214214+ """
215215+ self._ensure_authenticated()
216216+217217+ data: dict[str, Any] = {
218218+ "repo": self.did,
219219+ "collection": collection,
220220+ "rkey": rkey,
221221+ "record": record,
222222+ "validate": validate,
223223+ }
224224+ if swap_commit:
225225+ data["swapCommit"] = swap_commit
226226+227227+ response = self._client.com.atproto.repo.put_record(data=data)
228228+229229+ return AtUri.parse(response.uri)
230230+231231+ def get_record(
232232+ self,
233233+ uri: str | AtUri,
234234+ ) -> dict:
235235+ """Fetch a record by AT URI.
236236+237237+ Args:
238238+ uri: The AT URI of the record.
239239+240240+ Returns:
241241+ The record data as a dictionary.
242242+243243+ Raises:
244244+ atproto.exceptions.AtProtocolError: If record not found.
245245+ """
246246+ if isinstance(uri, str):
247247+ uri = AtUri.parse(uri)
248248+249249+ response = self._client.com.atproto.repo.get_record(
250250+ params={
251251+ "repo": uri.authority,
252252+ "collection": uri.collection,
253253+ "rkey": uri.rkey,
254254+ }
255255+ )
256256+257257+ return response.value
258258+259259+ def delete_record(
260260+ self,
261261+ uri: str | AtUri,
262262+ *,
263263+ swap_commit: Optional[str] = None,
264264+ ) -> None:
265265+ """Delete a record.
266266+267267+ Args:
268268+ uri: The AT URI of the record to delete.
269269+ swap_commit: Optional CID for compare-and-swap delete.
270270+271271+ Raises:
272272+ ValueError: If not authenticated.
273273+ atproto.exceptions.AtProtocolError: If deletion fails.
274274+ """
275275+ self._ensure_authenticated()
276276+277277+ if isinstance(uri, str):
278278+ uri = AtUri.parse(uri)
279279+280280+ data: dict[str, Any] = {
281281+ "repo": self.did,
282282+ "collection": uri.collection,
283283+ "rkey": uri.rkey,
284284+ }
285285+ if swap_commit:
286286+ data["swapCommit"] = swap_commit
287287+288288+ self._client.com.atproto.repo.delete_record(data=data)
289289+290290+ def list_records(
291291+ self,
292292+ collection: str,
293293+ *,
294294+ repo: Optional[str] = None,
295295+ limit: int = 100,
296296+ cursor: Optional[str] = None,
297297+ ) -> tuple[list[dict], Optional[str]]:
298298+ """List records in a collection.
299299+300300+ Args:
301301+ collection: The NSID of the record collection.
302302+ repo: The DID of the repository to query. Defaults to the
303303+ authenticated user's repository.
304304+ limit: Maximum number of records to return (default 100).
305305+ cursor: Pagination cursor from a previous call.
306306+307307+ Returns:
308308+ A tuple of (records, next_cursor). The cursor is None if there
309309+ are no more records.
310310+311311+ Raises:
312312+ ValueError: If repo is None and not authenticated.
313313+ """
314314+ if repo is None:
315315+ self._ensure_authenticated()
316316+ repo = self.did
317317+318318+ response = self._client.com.atproto.repo.list_records(
319319+ params={
320320+ "repo": repo,
321321+ "collection": collection,
322322+ "limit": limit,
323323+ "cursor": cursor,
324324+ }
325325+ )
326326+327327+ records = [r.value for r in response.records]
328328+ return records, response.cursor
329329+330330+ # Convenience methods for atdata collections
331331+332332+ def list_schemas(
333333+ self,
334334+ repo: Optional[str] = None,
335335+ limit: int = 100,
336336+ ) -> list[dict]:
337337+ """List schema records.
338338+339339+ Args:
340340+ repo: The DID to query. Defaults to authenticated user.
341341+ limit: Maximum number to return.
342342+343343+ Returns:
344344+ List of schema records.
345345+ """
346346+ records, _ = self.list_records(
347347+ f"{LEXICON_NAMESPACE}.sampleSchema",
348348+ repo=repo,
349349+ limit=limit,
350350+ )
351351+ return records
352352+353353+ def list_datasets(
354354+ self,
355355+ repo: Optional[str] = None,
356356+ limit: int = 100,
357357+ ) -> list[dict]:
358358+ """List dataset records.
359359+360360+ Args:
361361+ repo: The DID to query. Defaults to authenticated user.
362362+ limit: Maximum number to return.
363363+364364+ Returns:
365365+ List of dataset records.
366366+ """
367367+ records, _ = self.list_records(
368368+ f"{LEXICON_NAMESPACE}.record",
369369+ repo=repo,
370370+ limit=limit,
371371+ )
372372+ return records
373373+374374+ def list_lenses(
375375+ self,
376376+ repo: Optional[str] = None,
377377+ limit: int = 100,
378378+ ) -> list[dict]:
379379+ """List lens records.
380380+381381+ Args:
382382+ repo: The DID to query. Defaults to authenticated user.
383383+ limit: Maximum number to return.
384384+385385+ Returns:
386386+ List of lens records.
387387+ """
388388+ records, _ = self.list_records(
389389+ f"{LEXICON_NAMESPACE}.lens",
390390+ repo=repo,
391391+ limit=limit,
392392+ )
393393+ return records
+280
src/atdata/atmosphere/lens.py
···11+"""Lens transformation publishing for ATProto.
22+33+This module provides classes for publishing Lens transformation records to
44+ATProto. Lenses are published as ``ac.foundation.dataset.lens`` records.
55+66+Note:
77+ For security reasons, lens code is stored as references to git repositories
88+ rather than inline code. Users must manually install and import lens
99+ implementations.
1010+"""
1111+1212+from typing import Optional, Callable
1313+1414+from .client import AtmosphereClient
1515+from ._types import (
1616+ AtUri,
1717+ LensRecord,
1818+ CodeReference,
1919+ LEXICON_NAMESPACE,
2020+)
2121+2222+# Import for type checking only
2323+from typing import TYPE_CHECKING
2424+if TYPE_CHECKING:
2525+ from ..lens import Lens
2626+2727+2828+class LensPublisher:
2929+ """Publishes Lens transformation records to ATProto.
3030+3131+ This class creates lens records that reference source and target schemas
3232+ and point to the transformation code in a git repository.
3333+3434+ Example:
3535+ >>> @atdata.lens
3636+ ... def my_lens(source: SourceType) -> TargetType:
3737+ ... return TargetType(field=source.other_field)
3838+ >>>
3939+ >>> client = AtmosphereClient()
4040+ >>> client.login("handle", "password")
4141+ >>>
4242+ >>> publisher = LensPublisher(client)
4343+ >>> uri = publisher.publish(
4444+ ... name="my_lens",
4545+ ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source",
4646+ ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target",
4747+ ... code_repository="https://github.com/user/repo",
4848+ ... code_commit="abc123def456",
4949+ ... getter_path="mymodule.lenses:my_lens",
5050+ ... putter_path="mymodule.lenses:my_lens_putter",
5151+ ... )
5252+5353+ Security Note:
5454+ Lens code is stored as references to git repositories rather than
5555+ inline code. This prevents arbitrary code execution from ATProto
5656+ records. Users must manually install and trust lens implementations.
5757+ """
5858+5959+ def __init__(self, client: AtmosphereClient):
6060+ """Initialize the lens publisher.
6161+6262+ Args:
6363+ client: Authenticated AtmosphereClient instance.
6464+ """
6565+ self.client = client
6666+6767+ def publish(
6868+ self,
6969+ *,
7070+ name: str,
7171+ source_schema_uri: str,
7272+ target_schema_uri: str,
7373+ description: Optional[str] = None,
7474+ code_repository: Optional[str] = None,
7575+ code_commit: Optional[str] = None,
7676+ getter_path: Optional[str] = None,
7777+ putter_path: Optional[str] = None,
7878+ rkey: Optional[str] = None,
7979+ ) -> AtUri:
8080+ """Publish a lens transformation record to ATProto.
8181+8282+ Args:
8383+ name: Human-readable lens name.
8484+ source_schema_uri: AT URI of the source schema.
8585+ target_schema_uri: AT URI of the target schema.
8686+ description: What this transformation does.
8787+ code_repository: Git repository URL containing the lens code.
8888+ code_commit: Git commit hash for reproducibility.
8989+ getter_path: Module path to the getter function
9090+ (e.g., 'mymodule.lenses:my_getter').
9191+ putter_path: Module path to the putter function
9292+ (e.g., 'mymodule.lenses:my_putter').
9393+ rkey: Optional explicit record key.
9494+9595+ Returns:
9696+ The AT URI of the created lens record.
9797+9898+ Raises:
9999+ ValueError: If code references are incomplete.
100100+ """
101101+ # Build code references if provided
102102+ getter_code: Optional[CodeReference] = None
103103+ putter_code: Optional[CodeReference] = None
104104+105105+ if code_repository and code_commit:
106106+ if getter_path:
107107+ getter_code = CodeReference(
108108+ repository=code_repository,
109109+ commit=code_commit,
110110+ path=getter_path,
111111+ )
112112+ if putter_path:
113113+ putter_code = CodeReference(
114114+ repository=code_repository,
115115+ commit=code_commit,
116116+ path=putter_path,
117117+ )
118118+119119+ lens_record = LensRecord(
120120+ name=name,
121121+ source_schema=source_schema_uri,
122122+ target_schema=target_schema_uri,
123123+ description=description,
124124+ getter_code=getter_code,
125125+ putter_code=putter_code,
126126+ )
127127+128128+ return self.client.create_record(
129129+ collection=f"{LEXICON_NAMESPACE}.lens",
130130+ record=lens_record.to_record(),
131131+ rkey=rkey,
132132+ validate=False,
133133+ )
134134+135135+ def publish_from_lens(
136136+ self,
137137+ lens_obj: "Lens",
138138+ *,
139139+ name: str,
140140+ source_schema_uri: str,
141141+ target_schema_uri: str,
142142+ code_repository: str,
143143+ code_commit: str,
144144+ description: Optional[str] = None,
145145+ rkey: Optional[str] = None,
146146+ ) -> AtUri:
147147+ """Publish a lens record from an existing Lens object.
148148+149149+ This method extracts the getter and putter function names from
150150+ the Lens object and publishes a record referencing them.
151151+152152+ Args:
153153+ lens_obj: The Lens object to publish.
154154+ name: Human-readable lens name.
155155+ source_schema_uri: AT URI of the source schema.
156156+ target_schema_uri: AT URI of the target schema.
157157+ code_repository: Git repository URL.
158158+ code_commit: Git commit hash.
159159+ description: What this transformation does.
160160+ rkey: Optional explicit record key.
161161+162162+ Returns:
163163+ The AT URI of the created lens record.
164164+ """
165165+ # Extract function names from the lens
166166+ getter_name = lens_obj._getter.__name__
167167+ putter_name = lens_obj._putter.__name__
168168+169169+ # Get module info if available
170170+ getter_module = getattr(lens_obj._getter, "__module__", "")
171171+ putter_module = getattr(lens_obj._putter, "__module__", "")
172172+173173+ getter_path = f"{getter_module}:{getter_name}" if getter_module else getter_name
174174+ putter_path = f"{putter_module}:{putter_name}" if putter_module else putter_name
175175+176176+ return self.publish(
177177+ name=name,
178178+ source_schema_uri=source_schema_uri,
179179+ target_schema_uri=target_schema_uri,
180180+ description=description,
181181+ code_repository=code_repository,
182182+ code_commit=code_commit,
183183+ getter_path=getter_path,
184184+ putter_path=putter_path,
185185+ rkey=rkey,
186186+ )
187187+188188+189189+class LensLoader:
190190+ """Loads lens records from ATProto.
191191+192192+ This class fetches lens transformation records. Note that actually
193193+ using a lens requires installing the referenced code and importing
194194+ it manually.
195195+196196+ Example:
197197+ >>> client = AtmosphereClient()
198198+ >>> loader = LensLoader(client)
199199+ >>>
200200+ >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz")
201201+ >>> print(record["name"])
202202+ >>> print(record["sourceSchema"])
203203+ >>> print(record.get("getterCode", {}).get("repository"))
204204+ """
205205+206206+ def __init__(self, client: AtmosphereClient):
207207+ """Initialize the lens loader.
208208+209209+ Args:
210210+ client: AtmosphereClient instance.
211211+ """
212212+ self.client = client
213213+214214+ def get(self, uri: str | AtUri) -> dict:
215215+ """Fetch a lens record by AT URI.
216216+217217+ Args:
218218+ uri: The AT URI of the lens record.
219219+220220+ Returns:
221221+ The lens record as a dictionary.
222222+223223+ Raises:
224224+ ValueError: If the record is not a lens record.
225225+ """
226226+ record = self.client.get_record(uri)
227227+228228+ expected_type = f"{LEXICON_NAMESPACE}.lens"
229229+ if record.get("$type") != expected_type:
230230+ raise ValueError(
231231+ f"Record at {uri} is not a lens record. "
232232+ f"Expected $type='{expected_type}', got '{record.get('$type')}'"
233233+ )
234234+235235+ return record
236236+237237+ def list_all(
238238+ self,
239239+ repo: Optional[str] = None,
240240+ limit: int = 100,
241241+ ) -> list[dict]:
242242+ """List lens records from a repository.
243243+244244+ Args:
245245+ repo: The DID of the repository. Defaults to authenticated user.
246246+ limit: Maximum number of records to return.
247247+248248+ Returns:
249249+ List of lens records.
250250+ """
251251+ return self.client.list_lenses(repo=repo, limit=limit)
252252+253253+ def find_by_schemas(
254254+ self,
255255+ source_schema_uri: str,
256256+ target_schema_uri: Optional[str] = None,
257257+ repo: Optional[str] = None,
258258+ ) -> list[dict]:
259259+ """Find lenses that transform between specific schemas.
260260+261261+ Args:
262262+ source_schema_uri: AT URI of the source schema.
263263+ target_schema_uri: Optional AT URI of the target schema.
264264+ If not provided, returns all lenses from the source.
265265+ repo: The DID of the repository to search.
266266+267267+ Returns:
268268+ List of matching lens records.
269269+ """
270270+ all_lenses = self.list_all(repo=repo, limit=1000)
271271+272272+ matches = []
273273+ for lens_record in all_lenses:
274274+ if lens_record.get("sourceSchema") == source_schema_uri:
275275+ if target_schema_uri is None:
276276+ matches.append(lens_record)
277277+ elif lens_record.get("targetSchema") == target_schema_uri:
278278+ matches.append(lens_record)
279279+280280+ return matches
+342
src/atdata/atmosphere/records.py
···11+"""Dataset record publishing and loading for ATProto.
22+33+This module provides classes for publishing dataset index records to ATProto
44+and loading them back. Dataset records are published as
55+``ac.foundation.dataset.record`` records.
66+"""
77+88+from typing import Type, TypeVar, Optional
99+import msgpack
1010+1111+from .client import AtmosphereClient
1212+from .schema import SchemaPublisher
1313+from ._types import (
1414+ AtUri,
1515+ DatasetRecord,
1616+ StorageLocation,
1717+ LEXICON_NAMESPACE,
1818+)
1919+2020+# Import for type checking only to avoid circular imports
2121+from typing import TYPE_CHECKING
2222+if TYPE_CHECKING:
2323+ from ..dataset import PackableSample, Dataset
2424+2525+ST = TypeVar("ST", bound="PackableSample")
2626+2727+2828+class DatasetPublisher:
2929+ """Publishes dataset index records to ATProto.
3030+3131+ This class creates dataset records that reference a schema and point to
3232+ external storage (WebDataset URLs) or ATProto blobs.
3333+3434+ Example:
3535+ >>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar")
3636+ >>>
3737+ >>> client = AtmosphereClient()
3838+ >>> client.login("handle", "password")
3939+ >>>
4040+ >>> publisher = DatasetPublisher(client)
4141+ >>> uri = publisher.publish(
4242+ ... dataset,
4343+ ... name="My Training Data",
4444+ ... description="Training data for my model",
4545+ ... tags=["computer-vision", "training"],
4646+ ... )
4747+ """
4848+4949+ def __init__(self, client: AtmosphereClient):
5050+ """Initialize the dataset publisher.
5151+5252+ Args:
5353+ client: Authenticated AtmosphereClient instance.
5454+ """
5555+ self.client = client
5656+ self._schema_publisher = SchemaPublisher(client)
5757+5858+ def publish(
5959+ self,
6060+ dataset: "Dataset[ST]",
6161+ *,
6262+ name: str,
6363+ schema_uri: Optional[str] = None,
6464+ description: Optional[str] = None,
6565+ tags: Optional[list[str]] = None,
6666+ license: Optional[str] = None,
6767+ auto_publish_schema: bool = True,
6868+ schema_version: str = "1.0.0",
6969+ rkey: Optional[str] = None,
7070+ ) -> AtUri:
7171+ """Publish a dataset index record to ATProto.
7272+7373+ Args:
7474+ dataset: The Dataset to publish.
7575+ name: Human-readable dataset name.
7676+ schema_uri: AT URI of the schema record. If not provided and
7777+ auto_publish_schema is True, the schema will be published.
7878+ description: Human-readable description.
7979+ tags: Searchable tags for discovery.
8080+ license: SPDX license identifier (e.g., 'MIT', 'Apache-2.0').
8181+ auto_publish_schema: If True and schema_uri not provided,
8282+ automatically publish the schema first.
8383+ schema_version: Version for auto-published schema.
8484+ rkey: Optional explicit record key.
8585+8686+ Returns:
8787+ The AT URI of the created dataset record.
8888+8989+ Raises:
9090+ ValueError: If schema_uri is not provided and auto_publish_schema is False.
9191+ """
9292+ # Ensure we have a schema reference
9393+ if schema_uri is None:
9494+ if not auto_publish_schema:
9595+ raise ValueError(
9696+ "schema_uri is required when auto_publish_schema=False"
9797+ )
9898+ # Auto-publish the schema
9999+ schema_uri_obj = self._schema_publisher.publish(
100100+ dataset.sample_type,
101101+ version=schema_version,
102102+ )
103103+ schema_uri = str(schema_uri_obj)
104104+105105+ # Build the storage location
106106+ storage = StorageLocation(
107107+ kind="external",
108108+ urls=[dataset.url],
109109+ )
110110+111111+ # Build dataset record
112112+ metadata_bytes: Optional[bytes] = None
113113+ if dataset.metadata is not None:
114114+ metadata_bytes = msgpack.packb(dataset.metadata)
115115+116116+ dataset_record = DatasetRecord(
117117+ name=name,
118118+ schema_ref=schema_uri,
119119+ storage=storage,
120120+ description=description,
121121+ tags=tags or [],
122122+ license=license,
123123+ metadata=metadata_bytes,
124124+ )
125125+126126+ # Publish to ATProto
127127+ return self.client.create_record(
128128+ collection=f"{LEXICON_NAMESPACE}.record",
129129+ record=dataset_record.to_record(),
130130+ rkey=rkey,
131131+ validate=False,
132132+ )
133133+134134+ def publish_with_urls(
135135+ self,
136136+ urls: list[str],
137137+ schema_uri: str,
138138+ *,
139139+ name: str,
140140+ description: Optional[str] = None,
141141+ tags: Optional[list[str]] = None,
142142+ license: Optional[str] = None,
143143+ metadata: Optional[dict] = None,
144144+ rkey: Optional[str] = None,
145145+ ) -> AtUri:
146146+ """Publish a dataset record with explicit URLs.
147147+148148+ This method allows publishing a dataset record without having a
149149+ Dataset object, useful for registering existing WebDataset files.
150150+151151+ Args:
152152+ urls: List of WebDataset URLs with brace notation.
153153+ schema_uri: AT URI of the schema record.
154154+ name: Human-readable dataset name.
155155+ description: Human-readable description.
156156+ tags: Searchable tags for discovery.
157157+ license: SPDX license identifier.
158158+ metadata: Arbitrary metadata dictionary.
159159+ rkey: Optional explicit record key.
160160+161161+ Returns:
162162+ The AT URI of the created dataset record.
163163+ """
164164+ storage = StorageLocation(
165165+ kind="external",
166166+ urls=urls,
167167+ )
168168+169169+ metadata_bytes: Optional[bytes] = None
170170+ if metadata is not None:
171171+ metadata_bytes = msgpack.packb(metadata)
172172+173173+ dataset_record = DatasetRecord(
174174+ name=name,
175175+ schema_ref=schema_uri,
176176+ storage=storage,
177177+ description=description,
178178+ tags=tags or [],
179179+ license=license,
180180+ metadata=metadata_bytes,
181181+ )
182182+183183+ return self.client.create_record(
184184+ collection=f"{LEXICON_NAMESPACE}.record",
185185+ record=dataset_record.to_record(),
186186+ rkey=rkey,
187187+ validate=False,
188188+ )
189189+190190+191191+class DatasetLoader:
192192+ """Loads dataset records from ATProto.
193193+194194+ This class fetches dataset index records and can create Dataset objects
195195+ from them. Note that loading a dataset requires having the corresponding
196196+ Python class for the sample type.
197197+198198+ Example:
199199+ >>> client = AtmosphereClient()
200200+ >>> loader = DatasetLoader(client)
201201+ >>>
202202+ >>> # List available datasets
203203+ >>> datasets = loader.list()
204204+ >>> for ds in datasets:
205205+ ... print(ds["name"], ds["schemaRef"])
206206+ >>>
207207+ >>> # Get a specific dataset record
208208+ >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz")
209209+ """
210210+211211+ def __init__(self, client: AtmosphereClient):
212212+ """Initialize the dataset loader.
213213+214214+ Args:
215215+ client: AtmosphereClient instance.
216216+ """
217217+ self.client = client
218218+219219+ def get(self, uri: str | AtUri) -> dict:
220220+ """Fetch a dataset record by AT URI.
221221+222222+ Args:
223223+ uri: The AT URI of the dataset record.
224224+225225+ Returns:
226226+ The dataset record as a dictionary.
227227+228228+ Raises:
229229+ ValueError: If the record is not a dataset record.
230230+ """
231231+ record = self.client.get_record(uri)
232232+233233+ expected_type = f"{LEXICON_NAMESPACE}.record"
234234+ if record.get("$type") != expected_type:
235235+ raise ValueError(
236236+ f"Record at {uri} is not a dataset record. "
237237+ f"Expected $type='{expected_type}', got '{record.get('$type')}'"
238238+ )
239239+240240+ return record
241241+242242+ def list_all(
243243+ self,
244244+ repo: Optional[str] = None,
245245+ limit: int = 100,
246246+ ) -> list[dict]:
247247+ """List dataset records from a repository.
248248+249249+ Args:
250250+ repo: The DID of the repository. Defaults to authenticated user.
251251+ limit: Maximum number of records to return.
252252+253253+ Returns:
254254+ List of dataset records.
255255+ """
256256+ return self.client.list_datasets(repo=repo, limit=limit)
257257+258258+ def get_urls(self, uri: str | AtUri) -> list[str]:
259259+ """Get the WebDataset URLs from a dataset record.
260260+261261+ Args:
262262+ uri: The AT URI of the dataset record.
263263+264264+ Returns:
265265+ List of WebDataset URLs.
266266+267267+ Raises:
268268+ ValueError: If the storage type is not external URLs.
269269+ """
270270+ record = self.get(uri)
271271+ storage = record.get("storage", {})
272272+273273+ storage_type = storage.get("$type", "")
274274+ if "storageExternal" in storage_type:
275275+ return storage.get("urls", [])
276276+ elif "storageBlobs" in storage_type:
277277+ raise ValueError(
278278+ "Dataset uses blob storage, not external URLs. "
279279+ "Use get_blobs() instead."
280280+ )
281281+ else:
282282+ raise ValueError(f"Unknown storage type: {storage_type}")
283283+284284+ def get_metadata(self, uri: str | AtUri) -> Optional[dict]:
285285+ """Get the metadata from a dataset record.
286286+287287+ Args:
288288+ uri: The AT URI of the dataset record.
289289+290290+ Returns:
291291+ The metadata dictionary, or None if no metadata.
292292+ """
293293+ record = self.get(uri)
294294+ metadata_bytes = record.get("metadata")
295295+296296+ if metadata_bytes is None:
297297+ return None
298298+299299+ return msgpack.unpackb(metadata_bytes, raw=False)
300300+301301+ def to_dataset(
302302+ self,
303303+ uri: str | AtUri,
304304+ sample_type: Type[ST],
305305+ ) -> "Dataset[ST]":
306306+ """Create a Dataset object from an ATProto record.
307307+308308+ This method creates a Dataset instance from a published record.
309309+ You must provide the sample type class, which should match the
310310+ schema referenced by the record.
311311+312312+ Args:
313313+ uri: The AT URI of the dataset record.
314314+ sample_type: The Python class for the sample type.
315315+316316+ Returns:
317317+ A Dataset instance configured from the record.
318318+319319+ Raises:
320320+ ValueError: If the storage type is not external URLs.
321321+322322+ Example:
323323+ >>> loader = DatasetLoader(client)
324324+ >>> dataset = loader.to_dataset(uri, MySampleType)
325325+ >>> for batch in dataset.shuffled(batch_size=32):
326326+ ... process(batch)
327327+ """
328328+ # Import here to avoid circular import
329329+ from ..dataset import Dataset
330330+331331+ urls = self.get_urls(uri)
332332+ if not urls:
333333+ raise ValueError("Dataset record has no URLs")
334334+335335+ # Use the first URL (multi-URL support could be added later)
336336+ url = urls[0]
337337+338338+ # Get metadata URL if available
339339+ record = self.get(uri)
340340+ metadata_url = record.get("metadataUrl")
341341+342342+ return Dataset[sample_type](url, metadata_url=metadata_url)
+296
src/atdata/atmosphere/schema.py
···11+"""Schema publishing and loading for ATProto.
22+33+This module provides classes for publishing PackableSample schemas to ATProto
44+and loading them back. Schemas are published as ``ac.foundation.dataset.sampleSchema``
55+records.
66+"""
77+88+from dataclasses import fields, is_dataclass
99+from typing import Type, TypeVar, Optional, Union, get_type_hints, get_origin, get_args
1010+import types
1111+1212+from .client import AtmosphereClient
1313+from ._types import (
1414+ AtUri,
1515+ SchemaRecord,
1616+ FieldDef,
1717+ FieldType,
1818+ LEXICON_NAMESPACE,
1919+)
2020+2121+# Import for type checking only to avoid circular imports
2222+from typing import TYPE_CHECKING
2323+if TYPE_CHECKING:
2424+ from ..dataset import PackableSample
2525+2626+ST = TypeVar("ST", bound="PackableSample")
2727+2828+2929+class SchemaPublisher:
3030+ """Publishes PackableSample schemas to ATProto.
3131+3232+ This class introspects a PackableSample class to extract its field
3333+ definitions and publishes them as an ATProto schema record.
3434+3535+ Example:
3636+ >>> @atdata.packable
3737+ ... class MySample:
3838+ ... image: NDArray
3939+ ... label: str
4040+ ...
4141+ >>> client = AtmosphereClient()
4242+ >>> client.login("handle", "password")
4343+ >>>
4444+ >>> publisher = SchemaPublisher(client)
4545+ >>> uri = publisher.publish(MySample, version="1.0.0")
4646+ >>> print(uri)
4747+ at://did:plc:.../ac.foundation.dataset.sampleSchema/...
4848+ """
4949+5050+ def __init__(self, client: AtmosphereClient):
5151+ """Initialize the schema publisher.
5252+5353+ Args:
5454+ client: Authenticated AtmosphereClient instance.
5555+ """
5656+ self.client = client
5757+5858+ def publish(
5959+ self,
6060+ sample_type: Type[ST],
6161+ *,
6262+ name: Optional[str] = None,
6363+ version: str = "1.0.0",
6464+ description: Optional[str] = None,
6565+ metadata: Optional[dict] = None,
6666+ rkey: Optional[str] = None,
6767+ ) -> AtUri:
6868+ """Publish a PackableSample schema to ATProto.
6969+7070+ Args:
7171+ sample_type: The PackableSample class to publish.
7272+ name: Human-readable name. Defaults to the class name.
7373+ version: Semantic version string (e.g., '1.0.0').
7474+ description: Human-readable description.
7575+ metadata: Arbitrary metadata dictionary.
7676+ rkey: Optional explicit record key. If not provided, a TID is generated.
7777+7878+ Returns:
7979+ The AT URI of the created schema record.
8080+8181+ Raises:
8282+ ValueError: If sample_type is not a dataclass or client is not authenticated.
8383+ TypeError: If a field type is not supported.
8484+ """
8585+ if not is_dataclass(sample_type):
8686+ raise ValueError(f"{sample_type.__name__} must be a dataclass (use @packable)")
8787+8888+ # Build the schema record
8989+ schema_record = self._build_schema_record(
9090+ sample_type,
9191+ name=name,
9292+ version=version,
9393+ description=description,
9494+ metadata=metadata,
9595+ )
9696+9797+ # Publish to ATProto
9898+ return self.client.create_record(
9999+ collection=f"{LEXICON_NAMESPACE}.sampleSchema",
100100+ record=schema_record.to_record(),
101101+ rkey=rkey,
102102+ validate=False, # PDS doesn't know our lexicon
103103+ )
104104+105105+ def _build_schema_record(
106106+ self,
107107+ sample_type: Type[ST],
108108+ *,
109109+ name: Optional[str],
110110+ version: str,
111111+ description: Optional[str],
112112+ metadata: Optional[dict],
113113+ ) -> SchemaRecord:
114114+ """Build a SchemaRecord from a PackableSample class."""
115115+ field_defs = []
116116+ type_hints = get_type_hints(sample_type)
117117+118118+ for f in fields(sample_type):
119119+ field_type = type_hints.get(f.name, f.type)
120120+ field_def = self._field_to_def(f.name, field_type)
121121+ field_defs.append(field_def)
122122+123123+ return SchemaRecord(
124124+ name=name or sample_type.__name__,
125125+ version=version,
126126+ description=description,
127127+ fields=field_defs,
128128+ metadata=metadata,
129129+ )
130130+131131+ def _field_to_def(self, name: str, python_type) -> FieldDef:
132132+ """Convert a Python field to a FieldDef."""
133133+ # Check for Optional types (Union with None)
134134+ is_optional = False
135135+ origin = get_origin(python_type)
136136+137137+ # Handle Union types (including Optional which is Union[T, None])
138138+ if origin is Union or isinstance(python_type, types.UnionType):
139139+ args = get_args(python_type)
140140+ non_none_args = [a for a in args if a is not type(None)]
141141+ if type(None) in args or len(non_none_args) < len(args):
142142+ is_optional = True
143143+ if len(non_none_args) == 1:
144144+ python_type = non_none_args[0]
145145+ elif len(non_none_args) > 1:
146146+ # Complex union type - not fully supported yet
147147+ raise TypeError(f"Complex union types not supported: {python_type}")
148148+149149+ field_type = self._python_type_to_field_type(python_type)
150150+151151+ return FieldDef(
152152+ name=name,
153153+ field_type=field_type,
154154+ optional=is_optional,
155155+ )
156156+157157+ def _python_type_to_field_type(self, python_type) -> FieldType:
158158+ """Map a Python type to a FieldType."""
159159+ # Handle primitives
160160+ if python_type is str:
161161+ return FieldType(kind="primitive", primitive="str")
162162+ elif python_type is int:
163163+ return FieldType(kind="primitive", primitive="int")
164164+ elif python_type is float:
165165+ return FieldType(kind="primitive", primitive="float")
166166+ elif python_type is bool:
167167+ return FieldType(kind="primitive", primitive="bool")
168168+ elif python_type is bytes:
169169+ return FieldType(kind="primitive", primitive="bytes")
170170+171171+ # Check for NDArray
172172+ # NDArray from numpy.typing is a special generic alias
173173+ type_str = str(python_type)
174174+ if "NDArray" in type_str or "ndarray" in type_str.lower():
175175+ # Try to extract dtype info if available
176176+ dtype = "float32" # Default
177177+ args = get_args(python_type)
178178+ if args:
179179+ # NDArray[np.float64] or similar
180180+ dtype_arg = args[-1] if args else None
181181+ if dtype_arg is not None:
182182+ dtype = self._numpy_dtype_to_string(dtype_arg)
183183+184184+ return FieldType(kind="ndarray", dtype=dtype, shape=None)
185185+186186+ # Check for list/array types
187187+ origin = get_origin(python_type)
188188+ if origin is list:
189189+ args = get_args(python_type)
190190+ if args:
191191+ items = self._python_type_to_field_type(args[0])
192192+ return FieldType(kind="array", items=items)
193193+ else:
194194+ # Untyped list
195195+ return FieldType(kind="array", items=FieldType(kind="primitive", primitive="str"))
196196+197197+ # Check for nested PackableSample (not yet supported)
198198+ if is_dataclass(python_type):
199199+ raise TypeError(
200200+ f"Nested dataclass types not yet supported: {python_type.__name__}. "
201201+ "Publish nested types separately and use references."
202202+ )
203203+204204+ raise TypeError(f"Unsupported type for schema field: {python_type}")
205205+206206+ def _numpy_dtype_to_string(self, dtype) -> str:
207207+ """Convert a numpy dtype annotation to a string."""
208208+ dtype_str = str(dtype)
209209+ # Handle common numpy dtypes
210210+ dtype_map = {
211211+ "float16": "float16",
212212+ "float32": "float32",
213213+ "float64": "float64",
214214+ "int8": "int8",
215215+ "int16": "int16",
216216+ "int32": "int32",
217217+ "int64": "int64",
218218+ "uint8": "uint8",
219219+ "uint16": "uint16",
220220+ "uint32": "uint32",
221221+ "uint64": "uint64",
222222+ "bool": "bool",
223223+ "complex64": "complex64",
224224+ "complex128": "complex128",
225225+ }
226226+227227+ for key, value in dtype_map.items():
228228+ if key in dtype_str:
229229+ return value
230230+231231+ return "float32" # Default fallback
232232+233233+234234+class SchemaLoader:
235235+ """Loads PackableSample schemas from ATProto.
236236+237237+ This class fetches schema records from ATProto and can list available
238238+ schemas from a repository.
239239+240240+ Example:
241241+ >>> client = AtmosphereClient()
242242+ >>> client.login("handle", "password")
243243+ >>>
244244+ >>> loader = SchemaLoader(client)
245245+ >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...")
246246+ >>> print(schema["name"])
247247+ 'MySample'
248248+ """
249249+250250+ def __init__(self, client: AtmosphereClient):
251251+ """Initialize the schema loader.
252252+253253+ Args:
254254+ client: AtmosphereClient instance (authentication optional for reads).
255255+ """
256256+ self.client = client
257257+258258+ def get(self, uri: str | AtUri) -> dict:
259259+ """Fetch a schema record by AT URI.
260260+261261+ Args:
262262+ uri: The AT URI of the schema record.
263263+264264+ Returns:
265265+ The schema record as a dictionary.
266266+267267+ Raises:
268268+ ValueError: If the record is not a schema record.
269269+ atproto.exceptions.AtProtocolError: If record not found.
270270+ """
271271+ record = self.client.get_record(uri)
272272+273273+ expected_type = f"{LEXICON_NAMESPACE}.sampleSchema"
274274+ if record.get("$type") != expected_type:
275275+ raise ValueError(
276276+ f"Record at {uri} is not a schema record. "
277277+ f"Expected $type='{expected_type}', got '{record.get('$type')}'"
278278+ )
279279+280280+ return record
281281+282282+ def list_all(
283283+ self,
284284+ repo: Optional[str] = None,
285285+ limit: int = 100,
286286+ ) -> list[dict]:
287287+ """List schema records from a repository.
288288+289289+ Args:
290290+ repo: The DID of the repository. Defaults to authenticated user.
291291+ limit: Maximum number of records to return.
292292+293293+ Returns:
294294+ List of schema records.
295295+ """
296296+ return self.client.list_schemas(repo=repo, limit=limit)
+40-169
src/atdata/dataset.py
···32323333from pathlib import Path
3434import uuid
3535-import functools
36353736import dataclasses
3837import types
···4039 dataclass,
4140 asdict,
4241)
4343-from abc import (
4444- ABC,
4545- abstractmethod,
4646-)
4242+from abc import ABC
47434844from tqdm import tqdm
4945import numpy as np
5046import pandas as pd
4747+import requests
51485249import typing
5350from typing import (
···6562 TypeVar,
6663 TypeAlias,
6764)
6868-# from typing_inspect import get_bound, get_parameters
6969-from numpy.typing import (
7070- NDArray,
7171- ArrayLike,
7272-)
7373-7474-#
7575-7676-# import ekumen.atmosphere as eat
6565+from numpy.typing import NDArray
77667867import msgpack
7968import ormsgpack
···9685##
9786# Main base classes
98879999-# TODO Check for best way to ensure this typevar is used as a dataclass type
100100-# DT = TypeVar( 'DT', bound = dataclass.__class__ )
10188DT = TypeVar( 'DT' )
1028910390MsgpackRawSample: TypeAlias = Dict[str, Any]
10491105105-# @dataclass
106106-# class ArrayBytes:
107107-# """Annotates bytes that should be interpreted as the raw contents of a
108108-# numpy NDArray"""
109109-110110-# raw_bytes: bytes
111111-# """The raw bytes of the corresponding NDArray"""
112112-113113-# def __init__( self,
114114-# array: Optional[ArrayLike] = None,
115115-# raw: Optional[bytes] = None,
116116-# ):
117117-# """TODO"""
118118-119119-# if array is not None:
120120-# array = np.array( array )
121121-# self.raw_bytes = eh.array_to_bytes( array )
122122-123123-# elif raw is not None:
124124-# self.raw_bytes = raw
125125-126126-# else:
127127-# raise ValueError( 'Must provide either `array` or `raw` bytes' )
128128-129129-# @property
130130-# def to_numpy( self ) -> NDArray:
131131-# """Return the `raw_bytes` data as an NDArray"""
132132-# return eh.bytes_to_array( self.raw_bytes )
1339213493def _make_packable( x ):
13594 """Convert a value to a msgpack-compatible format.
···141100 Returns:
142101 The value in a format suitable for msgpack serialization.
143102 """
144144- # if isinstance( x, ArrayBytes ):
145145- # return x.raw_bytes
146103 if isinstance( x, np.ndarray ):
147104 return eh.array_to_bytes( x )
148105 return x
···226183 # based on what is provided
227184228185 if isinstance( var_cur_value, np.ndarray ):
229229- # we're good!
230230- pass
231231-232232- # elif isinstance( var_cur_value, ArrayBytes ):
233233- # setattr( self, var_name, var_cur_value.to_numpy )
186186+ # Already the correct type, no conversion needed
187187+ continue
234188235189 elif isinstance( var_cur_value, bytes ):
236190 # TODO This does create a constraint that serialized bytes
···411365 raise AttributeError( f'No sample attribute named {name}' )
412366413367414414-# class AnySample( BaseModel ):
415415-# """A sample that can hold anything"""
416416-# value: Any
417417-418418-# class AnyBatch( BaseModel ):
419419-# """A batch of `AnySample`s"""
420420-# values: list[AnySample]
421421-422422-423368ST = TypeVar( 'ST', bound = PackableSample )
424424-# BT = TypeVar( 'BT' )
425425-426369RT = TypeVar( 'RT', bound = PackableSample )
427370428428-# TODO For python 3.13
429429-# BT = TypeVar( 'BT', default = None )
430430-# IT = TypeVar( 'IT', default = Any )
431431-432371class Dataset( Generic[ST] ):
433372 """A typed dataset built on WebDataset with lens transformations.
434373···456395 ...
457396 >>> # Transform to a different view
458397 >>> ds_view = ds.as_type(MyDataView)
398398+459399 """
460400461461- # sample_class: Type = get_parameters( )
462462- # """The type of each returned sample from this `Dataset`'s iterator"""
463463- # batch_class: Type = get_bound( BT )
464464- # """The type of a batch built from `sample_class`"""
465465-466401 @property
467402 def sample_type( self ) -> Type:
468403 """The type of each returned sample from this dataset's iterator.
···482417 Returns:
483418 ``SampleBatch[ST]`` where ``ST`` is this dataset's sample type.
484419 """
485485- # return self.__orig_class__.__args__[1]
486420 return SampleBatch[self.sample_type]
487421488488-489489- # _schema_registry_sample: dict[str, Type]
490490- # _schema_registry_batch: dict[str, Type | None]
491491-492492- #
493493-494494- def __init__( self, url: str ) -> None:
422422+ def __init__( self, url: str,
423423+ metadata_url: str | None = None,
424424+ ) -> None:
495425 """Create a dataset from a WebDataset URL.
496426497427 Args:
···501431 """
502432 super().__init__()
503433 self.url = url
434434+ """WebDataset brace-notation URL pointing to tar files, e.g.,
435435+ ``"path/to/file-{000000..000009}.tar"`` for multiple shards or
436436+ ``"path/to/file-000000.tar"`` for a single shard.
437437+ """
438438+439439+ self._metadata: dict[str, Any] | None = None
440440+ self.metadata_url: str | None = metadata_url
441441+ """Optional URL to msgpack-encoded metadata for this dataset."""
504442505443 # Allow addition of automatic transformation of raw underlying data
506444 self._output_lens: Lens | None = None
···527465 ret._output_lens = lenses.transform( self.sample_type, ret.sample_type )
528466 return ret
529467530530- # @classmethod
531531- # def register( cls, uri: str,
532532- # sample_class: Type,
533533- # batch_class: Optional[Type] = None,
534534- # ):
535535- # """Register an `ekumen` schema to use a particular dataset sample class"""
536536- # cls._schema_registry_sample[uri] = sample_class
537537- # cls._schema_registry_batch[uri] = batch_class
538538-539539- # @classmethod
540540- # def at( cls, uri: str ) -> 'Dataset':
541541- # """Create a Dataset for the `ekumen` index entry at `uri`"""
542542- # client = eat.Client()
543543- # return cls( )
544544-545545- # Common functionality
546546-547468 @property
548469 def shard_list( self ) -> list[str]:
549470 """List of individual dataset shards
···557478 wds.filters.map( lambda x: x['url'] )
558479 )
559480 return list( pipe )
481481+482482+ @property
483483+ def metadata( self ) -> dict[str, Any] | None:
484484+ """Fetch and cache metadata from metadata_url.
485485+486486+ Returns:
487487+ Deserialized metadata dictionary, or None if no metadata_url is set.
488488+489489+ Raises:
490490+ requests.HTTPError: If metadata fetch fails.
491491+ """
492492+ if self.metadata_url is None:
493493+ return None
494494+495495+ if self._metadata is None:
496496+ with requests.get( self.metadata_url, stream = True ) as response:
497497+ response.raise_for_status()
498498+ self._metadata = msgpack.unpackb( response.content, raw = False )
499499+500500+ # Use our cached values
501501+ return self._metadata
560502561503 def ordered( self,
562504 batch_size: int | None = 1,
···575517 """
576518577519 if batch_size is None:
578578- # TODO Duplication here
579520 return wds.pipeline.DataPipeline(
580521 wds.shardlists.SimpleShardList( self.url ),
581522 wds.shardlists.split_by_worker,
582582- #
583523 wds.tariterators.tarfile_to_samples(),
584584- # wds.map( self.preprocess ),
585524 wds.filters.map( self.wrap ),
586525 )
587526588527 return wds.pipeline.DataPipeline(
589528 wds.shardlists.SimpleShardList( self.url ),
590529 wds.shardlists.split_by_worker,
591591- #
592530 wds.tariterators.tarfile_to_samples(),
593593- # wds.map( self.preprocess ),
594531 wds.filters.batched( batch_size ),
595532 wds.filters.map( self.wrap_batch ),
596533 )
···618555 ``SampleBatch[ST]`` instances; otherwise yields individual ``ST``
619556 samples.
620557 """
621621-622558 if batch_size is None:
623623- # TODO Duplication here
624559 return wds.pipeline.DataPipeline(
625560 wds.shardlists.SimpleShardList( self.url ),
626561 wds.filters.shuffle( buffer_shards ),
627562 wds.shardlists.split_by_worker,
628628- #
629563 wds.tariterators.tarfile_to_samples(),
630630- # wds.shuffle( buffer_samples ),
631631- # wds.map( self.preprocess ),
632564 wds.filters.shuffle( buffer_samples ),
633565 wds.filters.map( self.wrap ),
634566 )
···637569 wds.shardlists.SimpleShardList( self.url ),
638570 wds.filters.shuffle( buffer_shards ),
639571 wds.shardlists.split_by_worker,
640640- #
641572 wds.tariterators.tarfile_to_samples(),
642642- # wds.shuffle( buffer_samples ),
643643- # wds.map( self.preprocess ),
644573 wds.filters.shuffle( buffer_samples ),
645574 wds.filters.batched( batch_size ),
646575 wds.filters.map( self.wrap_batch ),
···683612684613 cur_segment = 0
685614 cur_buffer = []
686686- path_template = (path.parent / f'{path.stem}-%06d.{path.suffix}').as_posix()
615615+ path_template = (path.parent / f'{path.stem}-{{:06d}}{path.suffix}').as_posix()
687616688617 for x in self.ordered( batch_size = None ):
689618 cur_buffer.append( sample_map( x ) )
690690-619619+691620 if len( cur_buffer ) >= maxcount:
692621 # Write current segment
693622 cur_path = path_template.format( cur_segment )
···703632 df = pd.DataFrame( cur_buffer )
704633 df.to_parquet( cur_path, **kwargs )
705634706706-707707- # Implemented by specific subclasses
708708-709709- # @property
710710- # @abstractmethod
711711- # def url( self ) -> str:
712712- # """str: Brace-notation URL of the underlying full WebDataset"""
713713- # pass
714714-715715- # @classmethod
716716- # # TODO replace Any with IT
717717- # def preprocess( cls, sample: WDSRawSample ) -> Any:
718718- # """Pre-built preprocessor for a raw `sample` from the given dataset"""
719719- # return sample
720720-721721- # @classmethod
722722- # TODO replace Any with IT
723635 def wrap( self, sample: MsgpackRawSample ) -> ST:
724636 """Wrap a raw msgpack sample into the appropriate dataset-specific type.
725637···739651740652 source_sample = self._output_lens.source_type.from_bytes( sample['msgpack'] )
741653 return self._output_lens( source_sample )
742742-743743- # try:
744744- # assert type( sample ) == dict
745745- # return cls.sample_class( **{
746746- # k: v
747747- # for k, v in sample.items() if k != '__key__'
748748- # } )
749749-750750- # except Exception as e:
751751- # # Sample constructor failed -- revert to default
752752- # return AnySample(
753753- # value = sample,
754754- # )
755654756655 def wrap_batch( self, batch: WDSRawBatch ) -> SampleBatch[ST]:
757656 """Wrap a batch of raw msgpack samples into a typed SampleBatch.
···782681 for s in batch_source ]
783682 return SampleBatch[self.sample_type]( batch_view )
784683785785- # # @classmethod
786786- # def wrap_batch( self, batch: WDSRawBatch ) -> BT:
787787- # """Wrap a `batch` of samples into the appropriate dataset-specific type
788788-789789- # This default implementation simply creates a list one sample at a time
790790- # """
791791- # assert cls.batch_class is not None, 'No batch class specified'
792792- # return cls.batch_class( **batch )
793793-794794-795795-##
796796-# Shortcut decorators
797797-798798-# def packable( cls ):
799799-# """TODO"""
800800-801801-# def decorator( cls ):
802802-# # Create a new class dynamically
803803-# # The new class inherits from the new_parent_class first, then the original cls
804804-# new_bases = (PackableSample,) + cls.__bases__
805805-# new_cls = type(cls.__name__, new_bases, dict(cls.__dict__))
806806-807807-# # Optionally, update __module__ and __qualname__ for better introspection
808808-# new_cls.__module__ = cls.__module__
809809-# new_cls.__qualname__ = cls.__qualname__
810810-811811-# return new_cls
812812-# return decorator
813684814685def packable( cls ):
815686 """Decorator to convert a regular class into a ``PackableSample``.
+2-55
src/atdata/lens.py
···201201 """
202202 return self._getter( s )
203203204204-# TODO Figure out how to properly parameterize this
205205-# def _lens_factory[S, V]( register: bool = True ):
206206-# """Register the annotated function `f` as the getter of a sample lens"""
207207-208208-# # The actual lens decorator taking a lens getter function to a lens object
209209-# def _decorator( f: LensGetter[S, V] ) -> Lens[S, V]:
210210-# ret = Lens[S, V]( f )
211211-# if register:
212212-# _network.register( ret )
213213-# return ret
214214-215215-# # Return the lens decorator
216216-# return _decorator
217217-218218-# # For convenience
219219-# lens = _lens_factory
220204221205def lens( f: LensGetter[S, V] ) -> Lens[S, V]:
222206 """Decorator to create and register a lens transformation.
···245229 _network.register( ret )
246230 return ret
247231248248-249249-##
250250-# Global registry of used lenses
251251-252252-# _registered_lenses: Dict[LensSignature, Lens] = dict()
253253-# """TODO"""
254232255233class LensNetwork:
256234 """Global registry for lens transformations between sample types.
···292270 If a lens already exists for the same type pair, it will be
293271 overwritten.
294272 """
295295-296296- # sig = inspect.signature( _lens.get )
297297- # input_types = list( sig.parameters.values() )
298298- # assert len( input_types ) == 1, \
299299- # 'Wrong number of input args for lens: should only have one'
300300-301301- # input_type = input_types[0].annotation
302302- # print( input_type )
303303- # output_type = sig.return_annotation
304304-305305- # self._registry[input_type, output_type] = _lens
306306- # print( _lens.source_type )
307273 self._registry[_lens.source_type, _lens.view_type] = _lens
308274309275 def transform( self, source: DatasetType, view: DatasetType ) -> Lens:
···323289 Currently only supports direct transformations. Compositional
324290 transformations (chaining multiple lenses) are not yet implemented.
325291 """
326326-327327- # TODO Handle compositional closure
328292 ret = self._registry.get( (source, view), None )
329293 if ret is None:
330294 raise ValueError( f'No registered lens from source {source} to view {view}' )
···332296 return ret
333297334298335335-# Create global singleton registry instance
336336-_network = LensNetwork()
337337-338338-# def lens( f: LensPutter ) -> Lens:
339339-# """Register the annotated function `f` as a sample lens"""
340340-# ##
341341-342342-# sig = inspect.signature( f )
343343-344344-# input_types = list( sig.parameters.values() )
345345-# output_type = sig.return_annotation
346346-347347-# _registered_lenses[]
348348-349349-# f.lens = Lens(
350350-351351-# )
352352-353353-# return f299299+# Global singleton registry instance
300300+_network = LensNetwork()
+492
src/atdata/local.py
···11+"""Local repository storage for atdata datasets.
22+33+This module provides a local storage backend for atdata datasets using:
44+- S3-compatible object storage for dataset tar files and metadata
55+- Redis for indexing and tracking datasets
66+77+The main classes are:
88+- Repo: Manages dataset storage in S3 with Redis indexing
99+- Index: Redis-backed index for tracking dataset metadata
1010+- BasicIndexEntry: Index entry representing a stored dataset
1111+1212+This is intended for development and small-scale deployment before
1313+migrating to the full atproto PDS infrastructure.
1414+"""
1515+1616+##
1717+# Imports
1818+1919+from atdata import (
2020+ PackableSample,
2121+ Dataset,
2222+)
2323+2424+import os
2525+from pathlib import Path
2626+from uuid import uuid4
2727+from tempfile import TemporaryDirectory
2828+from dotenv import dotenv_values
2929+import msgpack
3030+3131+from redis import Redis
3232+3333+from s3fs import (
3434+ S3FileSystem,
3535+)
3636+3737+import webdataset as wds
3838+3939+from dataclasses import (
4040+ dataclass,
4141+ asdict,
4242+ field,
4343+)
4444+from typing import (
4545+ Any,
4646+ Optional,
4747+ Dict,
4848+ Type,
4949+ TypeVar,
5050+ Generator,
5151+ BinaryIO,
5252+ cast,
5353+)
5454+5555+T = TypeVar( 'T', bound = PackableSample )
5656+5757+5858+##
5959+# Helpers
6060+6161+def _kind_str_for_sample_type( st: Type[PackableSample] ) -> str:
6262+ """Convert a sample type to a fully-qualified string identifier.
6363+6464+ Args:
6565+ st: The sample type class.
6666+6767+ Returns:
6868+ A string in the format 'module.name' identifying the sample type.
6969+ """
7070+ return f'{st.__module__}.{st.__name__}'
7171+7272+def _decode_bytes_dict( d: dict[bytes, bytes] ) -> dict[str, str]:
7373+ """Decode a dictionary with byte keys and values to strings.
7474+7575+ Redis returns dictionaries with bytes keys/values, this converts them to strings.
7676+7777+ Args:
7878+ d: Dictionary with bytes keys and values.
7979+8080+ Returns:
8181+ Dictionary with UTF-8 decoded string keys and values.
8282+ """
8383+ return {
8484+ k.decode('utf-8'): v.decode('utf-8')
8585+ for k, v in d.items()
8686+ }
8787+8888+8989+##
9090+# Redis object model
9191+9292+@dataclass
9393+class BasicIndexEntry:
9494+ """Index entry for a dataset stored in the repository.
9595+9696+ Tracks metadata about a dataset stored in S3, including its location,
9797+ type, and unique identifier.
9898+ """
9999+ ##
100100+101101+ wds_url: str
102102+ """WebDataset URL for the dataset tar files, for use with atdata.Dataset."""
103103+104104+ sample_kind: str
105105+ """Fully-qualified sample type name (e.g., 'module.ClassName')."""
106106+107107+ metadata_url: str | None
108108+ """S3 URL to the dataset's metadata msgpack file, if any."""
109109+110110+ uuid: str = field( default_factory = lambda: str( uuid4() ) )
111111+ """Unique identifier for this dataset entry. Defaults to a new UUID if not provided."""
112112+113113+ def write_to( self, redis: Redis ):
114114+ """Persist this index entry to Redis.
115115+116116+ Stores the entry as a Redis hash with key 'BasicIndexEntry:{uuid}'.
117117+118118+ Args:
119119+ redis: Redis connection to write to.
120120+ """
121121+ save_key = f'BasicIndexEntry:{self.uuid}'
122122+ # Filter out None values - Redis doesn't accept None
123123+ data = {k: v for k, v in asdict(self).items() if v is not None}
124124+ # redis-py typing uses untyped dict, so type checker complains about dict[str, Any]
125125+ redis.hset( save_key, mapping = data ) # type: ignore[arg-type]
126126+127127+def _s3_env( credentials_path: str | Path ) -> dict[str, Any]:
128128+ """Load S3 credentials from a .env file.
129129+130130+ Args:
131131+ credentials_path: Path to .env file containing S3 credentials.
132132+133133+ Returns:
134134+ Dictionary with AWS_ENDPOINT, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
135135+136136+ Raises:
137137+ AssertionError: If required credentials are missing from the file.
138138+ """
139139+ ##
140140+ credentials_path = Path( credentials_path )
141141+ env_values = dotenv_values( credentials_path )
142142+ assert 'AWS_ENDPOINT' in env_values
143143+ assert 'AWS_ACCESS_KEY_ID' in env_values
144144+ assert 'AWS_SECRET_ACCESS_KEY' in env_values
145145+146146+ return {
147147+ k: env_values[k]
148148+ for k in (
149149+ 'AWS_ENDPOINT',
150150+ 'AWS_ACCESS_KEY_ID',
151151+ 'AWS_SECRET_ACCESS_KEY',
152152+ )
153153+ }
154154+155155+def _s3_from_credentials( creds: str | Path | dict ) -> S3FileSystem:
156156+ """Create an S3FileSystem from credentials.
157157+158158+ Args:
159159+ creds: Either a path to a .env file with credentials, or a dict
160160+ containing AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally
161161+ AWS_ENDPOINT.
162162+163163+ Returns:
164164+ Configured S3FileSystem instance.
165165+ """
166166+ ##
167167+ if not isinstance( creds, dict ):
168168+ creds = _s3_env( creds )
169169+170170+ # Build kwargs, making endpoint_url optional
171171+ kwargs = {
172172+ 'key': creds['AWS_ACCESS_KEY_ID'],
173173+ 'secret': creds['AWS_SECRET_ACCESS_KEY']
174174+ }
175175+ if 'AWS_ENDPOINT' in creds:
176176+ kwargs['endpoint_url'] = creds['AWS_ENDPOINT']
177177+178178+ return S3FileSystem(**kwargs)
179179+180180+181181+##
182182+# Classes
183183+184184+class Repo:
185185+ """Repository for storing and managing atdata datasets.
186186+187187+ Provides storage of datasets in S3-compatible object storage with Redis-based
188188+ indexing. Datasets are stored as WebDataset tar files with optional metadata.
189189+190190+ Attributes:
191191+ s3_credentials: S3 credentials dictionary or None.
192192+ bucket_fs: S3FileSystem instance or None.
193193+ hive_path: Path within S3 bucket for storing datasets.
194194+ hive_bucket: Name of the S3 bucket.
195195+ index: Index instance for tracking datasets.
196196+ """
197197+198198+ ##
199199+200200+ def __init__( self,
201201+ #
202202+ s3_credentials: str | Path | dict[str, Any] | None = None,
203203+ hive_path: str | Path | None = None,
204204+ redis: Redis | None = None,
205205+ #
206206+ #
207207+ **kwargs
208208+ ) -> None:
209209+ """Initialize a repository.
210210+211211+ Args:
212212+ s3_credentials: Path to .env file with S3 credentials, or dict with
213213+ AWS_ENDPOINT, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.
214214+ If None, S3 functionality will be disabled.
215215+ hive_path: Path within the S3 bucket to store datasets.
216216+ Required if s3_credentials is provided.
217217+ redis: Redis connection for indexing. If None, creates a new connection.
218218+ **kwargs: Additional arguments (reserved for future use).
219219+220220+ Raises:
221221+ ValueError: If hive_path is not provided when s3_credentials is set.
222222+ """
223223+224224+ if s3_credentials is None:
225225+ self.s3_credentials = None
226226+ elif isinstance( s3_credentials, dict ):
227227+ self.s3_credentials = s3_credentials
228228+ else:
229229+ self.s3_credentials = _s3_env( s3_credentials )
230230+231231+ if self.s3_credentials is None:
232232+ self.bucket_fs = None
233233+ else:
234234+ self.bucket_fs = _s3_from_credentials( self.s3_credentials )
235235+236236+ if self.bucket_fs is not None:
237237+ if hive_path is None:
238238+ raise ValueError( 'Must specify hive path within bucket' )
239239+ self.hive_path = Path( hive_path )
240240+ self.hive_bucket = self.hive_path.parts[0]
241241+ else:
242242+ self.hive_path = None
243243+ self.hive_bucket = None
244244+245245+ #
246246+247247+ self.index = Index( redis = redis )
248248+249249+ ##
250250+251251+ def insert( self, ds: Dataset[T],
252252+ #
253253+ cache_local: bool = False,
254254+ #
255255+ **kwargs
256256+ ) -> tuple[BasicIndexEntry, Dataset[T]]:
257257+ """Insert a dataset into the repository.
258258+259259+ Writes the dataset to S3 as WebDataset tar files, stores metadata,
260260+ and creates an index entry in Redis.
261261+262262+ Args:
263263+ ds: The dataset to insert.
264264+ cache_local: If True, write to local temporary storage first, then
265265+ copy to S3. This can be faster for some workloads.
266266+ **kwargs: Additional arguments passed to wds.ShardWriter.
267267+268268+ Returns:
269269+ A tuple of (index_entry, new_dataset) where:
270270+ - index_entry: BasicIndexEntry for the stored dataset
271271+ - new_dataset: Dataset object pointing to the stored copy
272272+273273+ Raises:
274274+ AssertionError: If S3 credentials or hive_path are not configured.
275275+ RuntimeError: If no shards were written.
276276+ """
277277+278278+ assert self.s3_credentials is not None
279279+ assert self.hive_bucket is not None
280280+ assert self.hive_path is not None
281281+282282+ new_uuid = str( uuid4() )
283283+284284+ hive_fs = _s3_from_credentials( self.s3_credentials )
285285+286286+ # Write metadata
287287+ metadata_path = (
288288+ self.hive_path
289289+ / 'metadata'
290290+ / f'atdata-metadata--{new_uuid}.msgpack'
291291+ )
292292+ # Note: S3 doesn't need directories created beforehand - s3fs handles this
293293+294294+ if ds.metadata is not None:
295295+ # Use s3:// prefix to ensure s3fs treats this as an S3 path
296296+ with cast( BinaryIO, hive_fs.open( f's3://{metadata_path.as_posix()}', 'wb' ) ) as f:
297297+ meta_packed = msgpack.packb( ds.metadata )
298298+ assert meta_packed is not None
299299+ f.write( cast( bytes, meta_packed ) )
300300+301301+302302+ # Write data
303303+ shard_pattern = (
304304+ self.hive_path
305305+ / f'atdata--{new_uuid}--%06d.tar'
306306+ ).as_posix()
307307+308308+ with TemporaryDirectory() as temp_dir:
309309+310310+ if cache_local:
311311+ # For cache_local, we need to use boto3 directly to avoid s3fs async issues with moto
312312+ import boto3
313313+314314+ # Create boto3 client from credentials
315315+ s3_client_kwargs = {
316316+ 'aws_access_key_id': self.s3_credentials['AWS_ACCESS_KEY_ID'],
317317+ 'aws_secret_access_key': self.s3_credentials['AWS_SECRET_ACCESS_KEY']
318318+ }
319319+ if 'AWS_ENDPOINT' in self.s3_credentials:
320320+ s3_client_kwargs['endpoint_url'] = self.s3_credentials['AWS_ENDPOINT']
321321+ s3_client = boto3.client('s3', **s3_client_kwargs)
322322+323323+ def _writer_opener( p: str ):
324324+ local_cache_path = Path( temp_dir ) / p
325325+ local_cache_path.parent.mkdir( parents = True, exist_ok = True )
326326+ return open( local_cache_path, 'wb' )
327327+ writer_opener = _writer_opener
328328+329329+ def _writer_post( p: str ):
330330+ local_cache_path = Path( temp_dir ) / p
331331+332332+ # Copy to S3 using boto3 client (avoids s3fs async issues)
333333+ path_parts = Path( p ).parts
334334+ bucket = path_parts[0]
335335+ key = str( Path( *path_parts[1:] ) )
336336+337337+ with open( local_cache_path, 'rb' ) as f_in:
338338+ s3_client.put_object( Bucket=bucket, Key=key, Body=f_in.read() )
339339+340340+ # Delete local cache file
341341+ local_cache_path.unlink()
342342+343343+ written_shards.append( p )
344344+ writer_post = _writer_post
345345+346346+ else:
347347+ # Use s3:// prefix to ensure s3fs treats paths as S3 paths
348348+ writer_opener = lambda s: cast( BinaryIO, hive_fs.open( f's3://{s}', 'wb' ) )
349349+ writer_post = lambda s: written_shards.append( s )
350350+351351+ written_shards = []
352352+ with wds.writer.ShardWriter(
353353+ shard_pattern,
354354+ opener = writer_opener,
355355+ post = writer_post,
356356+ **kwargs,
357357+ ) as sink:
358358+ for sample in ds.ordered( batch_size = None ):
359359+ sink.write( sample.as_wds )
360360+361361+ # Make a new Dataset object for the written dataset copy
362362+ if len( written_shards ) == 0:
363363+ raise RuntimeError( 'Cannot form new dataset entry -- did not write any shards' )
364364+365365+ elif len( written_shards ) < 2:
366366+ new_dataset_url = (
367367+ self.hive_path
368368+ / ( Path( written_shards[0] ).name )
369369+ ).as_posix()
370370+371371+ else:
372372+ shard_s3_format = (
373373+ (
374374+ self.hive_path
375375+ / f'atdata--{new_uuid}'
376376+ ).as_posix()
377377+ ) + '--{shard_id}.tar'
378378+ shard_id_braced = '{' + f'{0:06d}..{len( written_shards ) - 1:06d}' + '}'
379379+ new_dataset_url = shard_s3_format.format( shard_id = shard_id_braced )
380380+381381+ new_dataset = Dataset[ds.sample_type](
382382+ url = new_dataset_url,
383383+ metadata_url = metadata_path.as_posix(),
384384+ )
385385+386386+ # Add to index
387387+ new_entry = self.index.add_entry( new_dataset, uuid = new_uuid )
388388+389389+ return new_entry, new_dataset
390390+391391+392392+class Index:
393393+ """Redis-backed index for tracking datasets in a repository.
394394+395395+ Maintains a registry of BasicIndexEntry objects in Redis, allowing
396396+ enumeration and lookup of stored datasets.
397397+398398+ Attributes:
399399+ _redis: Redis connection for index storage.
400400+ """
401401+402402+ ##
403403+404404+ def __init__( self,
405405+ redis: Redis | None = None,
406406+ **kwargs
407407+ ) -> None:
408408+ """Initialize an index.
409409+410410+ Args:
411411+ redis: Redis connection to use. If None, creates a new connection
412412+ using the provided kwargs.
413413+ **kwargs: Additional arguments passed to Redis() constructor if
414414+ redis is None.
415415+ """
416416+ ##
417417+418418+ if redis is not None:
419419+ self._redis = redis
420420+ else:
421421+ self._redis: Redis = Redis( **kwargs )
422422+423423+ @property
424424+ def all_entries( self ) -> list[BasicIndexEntry]:
425425+ """Get all index entries as a list.
426426+427427+ Returns:
428428+ List of all BasicIndexEntry objects in the index.
429429+ """
430430+ return list( self.entries )
431431+432432+ @property
433433+ def entries( self ) -> Generator[BasicIndexEntry, None, None]:
434434+ """Iterate over all index entries.
435435+436436+ Scans Redis for all BasicIndexEntry keys and yields them one at a time.
437437+438438+ Yields:
439439+ BasicIndexEntry objects from the index.
440440+ """
441441+ ##
442442+ for key in self._redis.scan_iter( match = 'BasicIndexEntry:*' ):
443443+ # hgetall returns dict[bytes, bytes] which we decode to dict[str, str]
444444+ cur_entry_data = _decode_bytes_dict( cast(dict[bytes, bytes], self._redis.hgetall( key )) )
445445+446446+ # Provide default None for optional fields that may be missing
447447+ # Type checker complains about None in dict[str, str], but BasicIndexEntry accepts it
448448+ cur_entry_data: dict[str, Any] = dict( **cur_entry_data )
449449+ cur_entry_data.setdefault('metadata_url', None)
450450+451451+ cur_entry = BasicIndexEntry( **cur_entry_data )
452452+ yield cur_entry
453453+454454+ return
455455+456456+ def add_entry( self, ds: Dataset,
457457+ uuid: str | None = None,
458458+ ) -> BasicIndexEntry:
459459+ """Add a dataset to the index.
460460+461461+ Creates a BasicIndexEntry for the dataset and persists it to Redis.
462462+463463+ Args:
464464+ ds: The dataset to add to the index.
465465+ uuid: Optional UUID for the entry. If None, a new UUID is generated.
466466+467467+ Returns:
468468+ The created BasicIndexEntry object.
469469+ """
470470+ ##
471471+ temp_sample_kind = _kind_str_for_sample_type( ds.sample_type )
472472+473473+ if uuid is None:
474474+ ret_data = BasicIndexEntry(
475475+ wds_url = ds.url,
476476+ sample_kind = temp_sample_kind,
477477+ metadata_url = ds.metadata_url,
478478+ )
479479+ else:
480480+ ret_data = BasicIndexEntry(
481481+ wds_url = ds.url,
482482+ sample_kind = temp_sample_kind,
483483+ metadata_url = ds.metadata_url,
484484+ uuid = uuid,
485485+ )
486486+487487+ ret_data.write_to( self._redis )
488488+489489+ return ret_data
490490+491491+492492+#
+1
tests/conftest.py
···11+"""Pytest configuration for atdata tests."""