···11+# CLAUDE.md
22+33+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
44+55+## Project Overview
66+77+`atdata` is a Python library that implements a loose federation of distributed, typed datasets built on top of WebDataset. It provides:
88+99+- **Typed samples** with automatic serialization via msgpack
1010+- **Lens-based transformations** between different dataset schemas
1111+- **Batch aggregation** with automatic numpy array stacking
1212+- **WebDataset integration** for efficient large-scale dataset storage
1313+1414+## Development Commands
1515+1616+### Environment Setup
1717+```bash
1818+# Uses uv for dependency management
1919+python -m pip install uv # if not already installed
2020+uv sync
2121+```
2222+2323+### Testing
2424+```bash
2525+# Run all tests with coverage
2626+pytest
2727+2828+# Run specific test file
2929+pytest tests/test_dataset.py
3030+pytest tests/test_lens.py
3131+3232+# Run single test
3333+pytest tests/test_dataset.py::test_create_sample
3434+pytest tests/test_lens.py::test_lens
3535+```
3636+3737+### Building
3838+```bash
3939+# Build the package
4040+uv build
4141+```
4242+4343+## Architecture
4444+4545+### Core Components
4646+4747+The codebase has three main modules under `src/atdata/`:
4848+4949+1. **dataset.py** - Core dataset and sample infrastructure
5050+ - `PackableSample`: Base class for samples that can be serialized with msgpack
5151+ - `Dataset[ST]`: Generic typed dataset wrapping WebDataset tar files
5252+ - `SampleBatch[DT]`: Automatic batching with attribute aggregation
5353+ - `@packable` decorator: Converts dataclasses into PackableSample subclasses
5454+5555+2. **lens.py** - Type transformation system
5656+ - `Lens[S, V]`: Bidirectional transformations between sample types (getter/putter)
5757+ - `LensNetwork`: Singleton registry for lens transformations
5858+ - `@lens` decorator: Registers lens getters globally
5959+6060+3. **_helpers.py** - Serialization utilities
6161+ - `array_to_bytes()` / `bytes_to_array()`: numpy array serialization
6262+6363+### Key Design Patterns
6464+6565+**Sample Type Definition**
6666+6767+Two approaches for defining sample types:
6868+6969+```python
7070+# Approach 1: Explicit inheritance
7171+@dataclass
7272+class MySample(atdata.PackableSample):
7373+ field1: str
7474+ field2: NDArray
7575+7676+# Approach 2: Decorator (recommended)
7777+@atdata.packable
7878+class MySample:
7979+ field1: str
8080+ field2: NDArray
8181+```
8282+8383+**NDArray Handling**
8484+8585+Fields annotated as `NDArray` or `NDArray | None` are automatically:
8686+- Converted from bytes during deserialization
8787+- Converted to bytes during serialization (via `_helpers.array_to_bytes`)
8888+- Handled by `_ensure_good()` method in `PackableSample.__post_init__`
8989+9090+**Lens Transformations**
9191+9292+Lenses enable viewing datasets through different type schemas:
9393+9494+```python
9595+@atdata.lens
9696+def my_lens(source: SourceType) -> ViewType:
9797+ return ViewType(...)
9898+9999+@my_lens.putter
100100+def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
101101+ return SourceType(...)
102102+103103+# Use with datasets
104104+ds = atdata.Dataset[SourceType](url).as_type(ViewType)
105105+```
106106+107107+The `LensNetwork` singleton (in `lens.py:183`) maintains a global registry of all lenses decorated with `@lens`.
108108+109109+**Batch Aggregation**
110110+111111+`SampleBatch` uses `__getattr__` magic to aggregate sample attributes:
112112+- For `NDArray` fields: stacks into numpy array with batch dimension
113113+- For other fields: creates list
114114+- Results are cached in `_aggregate_cache`
115115+116116+### Dataset URLs
117117+118118+Datasets use WebDataset brace-notation URLs:
119119+- Single shard: `path/to/file-000000.tar`
120120+- Multiple shards: `path/to/file-{000000..000009}.tar`
121121+122122+### Important Implementation Details
123123+124124+**Type Parameters**
125125+126126+The codebase uses Python 3.12+ generics heavily:
127127+- `Dataset[ST]` where `ST` is the sample type
128128+- `SampleBatch[DT]` where `DT` is the sample type
129129+- Uses `__orig_class__.__args__[0]` at runtime to extract type parameters
130130+131131+**Serialization Flow**
132132+133133+1. Sample → `as_wds` property → dict with `__key__` and `msgpack` bytes
134134+2. Msgpack bytes created by `packed` property calling `_make_packable()` on fields
135135+3. Deserialization: `from_bytes()` → `from_data()` → `__init__` → `_ensure_good()`
136136+137137+**WebDataset Integration**
138138+139139+- Uses `wds.ShardWriter` / `wds.TarWriter` for writing
140140+- Dataset iteration via `wds.DataPipeline` with custom `wrap()` / `wrap_batch()` methods
141141+- Supports `ordered()` and `shuffled()` iteration modes
142142+143143+## Testing Notes
144144+145145+- Tests use parametrization heavily via `@pytest.mark.parametrize`
146146+- Test cases cover both decorator and inheritance syntax
147147+- Temporary WebDataset tar files created in `tmp_path` fixture
148148+- Tests verify both serialization and batch aggregation behavior
149149+- Lens tests verify well-behavedness (GetPut/PutGet laws)
+155-1
README.md
···11# atdata
22-A loose federation of distributed, typed datasets
22+33+[](https://codecov.io/gh/foundation-ac/atdata)
44+55+A loose federation of distributed, typed datasets built on WebDataset.
66+77+**atdata** provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.
88+99+## Features
1010+1111+- **Typed Samples** - Define dataset schemas using Python dataclasses with automatic msgpack serialization
1212+- **Lens Transformations** - Bidirectional, composable transformations between different dataset views
1313+- **Automatic Batching** - Smart batch aggregation with numpy array stacking
1414+- **WebDataset Integration** - Efficient storage and streaming for large-scale datasets
1515+1616+## Installation
1717+1818+```bash
1919+pip install atdata
2020+```
2121+2222+Requires Python 3.12 or later.
2323+2424+## Quick Start
2525+2626+### Defining Sample Types
2727+2828+Use the `@packable` decorator to create typed dataset samples:
2929+3030+```python
3131+import atdata
3232+from numpy.typing import NDArray
3333+3434+@atdata.packable
3535+class ImageSample:
3636+ image: NDArray
3737+ label: str
3838+ metadata: dict
3939+```
4040+4141+### Creating Datasets
4242+4343+```python
4444+# Create a dataset
4545+dataset = atdata.Dataset[ImageSample]("path/to/data-{000000..000009}.tar")
4646+4747+# Iterate over samples in order
4848+for sample in dataset.ordered(batch_size=None):
4949+ print(f"Label: {sample.label}, Image shape: {sample.image.shape}")
5050+5151+# Iterate with shuffling and batching
5252+for batch in dataset.shuffled(batch_size=32):
5353+ # batch.image is automatically stacked into shape (32, ...)
5454+ # batch.label is a list of 32 labels
5555+ process_batch(batch.image, batch.label)
5656+```
5757+5858+### Lens Transformations
5959+6060+Define reusable transformations between sample types:
6161+6262+```python
6363+@atdata.packable
6464+class ProcessedSample:
6565+ features: NDArray
6666+ label: str
6767+6868+@atdata.lens
6969+def preprocess(sample: ImageSample) -> ProcessedSample:
7070+ features = extract_features(sample.image)
7171+ return ProcessedSample(features=features, label=sample.label)
7272+7373+# Apply lens to view dataset as ProcessedSample
7474+processed_ds = dataset.as_type(ProcessedSample)
7575+7676+for sample in processed_ds.ordered(batch_size=None):
7777+ # sample is now a ProcessedSample
7878+ print(sample.features.shape)
7979+```
8080+8181+## Core Concepts
8282+8383+### PackableSample
8484+8585+Base class for serializable samples. Fields annotated as `NDArray` are automatically handled:
8686+8787+```python
8888+@atdata.packable
8989+class MySample:
9090+ array_field: NDArray # Automatically serialized
9191+ optional_array: NDArray | None
9292+ regular_field: str
9393+```
9494+9595+### Lens
9696+9797+Bidirectional transformations with getter/putter semantics:
9898+9999+```python
100100+@atdata.lens
101101+def my_lens(source: SourceType) -> ViewType:
102102+ # Transform source -> view
103103+ return ViewType(...)
104104+105105+@my_lens.putter
106106+def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
107107+ # Transform view -> source
108108+ return SourceType(...)
109109+```
110110+111111+### Dataset URLs
112112+113113+Uses WebDataset brace expansion for sharded datasets:
114114+115115+- Single file: `"data/dataset-000000.tar"`
116116+- Multiple shards: `"data/dataset-{000000..000099}.tar"`
117117+- Multiple patterns: `"data/{train,val}/dataset-{000000..000009}.tar"`
118118+119119+## Development
120120+121121+### Setup
122122+123123+```bash
124124+# Install uv if not already available
125125+python -m pip install uv
126126+127127+# Install dependencies
128128+uv sync
129129+```
130130+131131+### Testing
132132+133133+```bash
134134+# Run all tests with coverage
135135+pytest
136136+137137+# Run specific test file
138138+pytest tests/test_dataset.py
139139+140140+# Run single test
141141+pytest tests/test_lens.py::test_lens
142142+```
143143+144144+### Building
145145+146146+```bash
147147+uv build
148148+```
149149+150150+## Contributing
151151+152152+Contributions are welcome! This project is in beta, so the API may still evolve.
153153+154154+## License
155155+156156+This project is licensed under the Mozilla Public License 2.0. See [LICENSE](LICENSE) for details.