A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'feature/let-claude-do-the-docs'

+323 -4
+9 -3
.github/workflows/uv-test.yml
··· 32 32 # TODO Better to use --locked for author control over versions? 33 33 # run: uv sync --locked --all-extras --dev 34 34 35 - - name: Run tests 36 - # For example, using `pytest` 37 - run: uv run pytest tests 35 + - name: Run tests with coverage 36 + run: uv run pytest --cov=atdata --cov-report=xml --cov-report=term 37 + 38 + - name: Upload coverage to Codecov 39 + uses: codecov/codecov-action@v5 40 + with: 41 + file: ./coverage.xml 42 + fail_ci_if_error: false 43 + token: ${{ secrets.CODECOV_TOKEN }} 38 44 39 45 40 46 #
+10
.vscode/settings.json
··· 1 + { 2 + "cSpell.words": [ 3 + "atdata", 4 + "getattr", 5 + "msgpack", 6 + "pypi", 7 + "pyproject", 8 + "pytest" 9 + ] 10 + }
+149
CLAUDE.md
··· 1 + # CLAUDE.md 2 + 3 + This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. 4 + 5 + ## Project Overview 6 + 7 + `atdata` is a Python library that implements a loose federation of distributed, typed datasets built on top of WebDataset. It provides: 8 + 9 + - **Typed samples** with automatic serialization via msgpack 10 + - **Lens-based transformations** between different dataset schemas 11 + - **Batch aggregation** with automatic numpy array stacking 12 + - **WebDataset integration** for efficient large-scale dataset storage 13 + 14 + ## Development Commands 15 + 16 + ### Environment Setup 17 + ```bash 18 + # Uses uv for dependency management 19 + python -m pip install uv # if not already installed 20 + uv sync 21 + ``` 22 + 23 + ### Testing 24 + ```bash 25 + # Run all tests with coverage 26 + pytest 27 + 28 + # Run specific test file 29 + pytest tests/test_dataset.py 30 + pytest tests/test_lens.py 31 + 32 + # Run single test 33 + pytest tests/test_dataset.py::test_create_sample 34 + pytest tests/test_lens.py::test_lens 35 + ``` 36 + 37 + ### Building 38 + ```bash 39 + # Build the package 40 + uv build 41 + ``` 42 + 43 + ## Architecture 44 + 45 + ### Core Components 46 + 47 + The codebase has three main modules under `src/atdata/`: 48 + 49 + 1. **dataset.py** - Core dataset and sample infrastructure 50 + - `PackableSample`: Base class for samples that can be serialized with msgpack 51 + - `Dataset[ST]`: Generic typed dataset wrapping WebDataset tar files 52 + - `SampleBatch[DT]`: Automatic batching with attribute aggregation 53 + - `@packable` decorator: Converts dataclasses into PackableSample subclasses 54 + 55 + 2. **lens.py** - Type transformation system 56 + - `Lens[S, V]`: Bidirectional transformations between sample types (getter/putter) 57 + - `LensNetwork`: Singleton registry for lens transformations 58 + - `@lens` decorator: Registers lens getters globally 59 + 60 + 3. **_helpers.py** - Serialization utilities 61 + - `array_to_bytes()` / `bytes_to_array()`: numpy array serialization 62 + 63 + ### Key Design Patterns 64 + 65 + **Sample Type Definition** 66 + 67 + Two approaches for defining sample types: 68 + 69 + ```python 70 + # Approach 1: Explicit inheritance 71 + @dataclass 72 + class MySample(atdata.PackableSample): 73 + field1: str 74 + field2: NDArray 75 + 76 + # Approach 2: Decorator (recommended) 77 + @atdata.packable 78 + class MySample: 79 + field1: str 80 + field2: NDArray 81 + ``` 82 + 83 + **NDArray Handling** 84 + 85 + Fields annotated as `NDArray` or `NDArray | None` are automatically: 86 + - Converted from bytes during deserialization 87 + - Converted to bytes during serialization (via `_helpers.array_to_bytes`) 88 + - Handled by `_ensure_good()` method in `PackableSample.__post_init__` 89 + 90 + **Lens Transformations** 91 + 92 + Lenses enable viewing datasets through different type schemas: 93 + 94 + ```python 95 + @atdata.lens 96 + def my_lens(source: SourceType) -> ViewType: 97 + return ViewType(...) 98 + 99 + @my_lens.putter 100 + def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 101 + return SourceType(...) 102 + 103 + # Use with datasets 104 + ds = atdata.Dataset[SourceType](url).as_type(ViewType) 105 + ``` 106 + 107 + The `LensNetwork` singleton (in `lens.py:183`) maintains a global registry of all lenses decorated with `@lens`. 108 + 109 + **Batch Aggregation** 110 + 111 + `SampleBatch` uses `__getattr__` magic to aggregate sample attributes: 112 + - For `NDArray` fields: stacks into numpy array with batch dimension 113 + - For other fields: creates list 114 + - Results are cached in `_aggregate_cache` 115 + 116 + ### Dataset URLs 117 + 118 + Datasets use WebDataset brace-notation URLs: 119 + - Single shard: `path/to/file-000000.tar` 120 + - Multiple shards: `path/to/file-{000000..000009}.tar` 121 + 122 + ### Important Implementation Details 123 + 124 + **Type Parameters** 125 + 126 + The codebase uses Python 3.12+ generics heavily: 127 + - `Dataset[ST]` where `ST` is the sample type 128 + - `SampleBatch[DT]` where `DT` is the sample type 129 + - Uses `__orig_class__.__args__[0]` at runtime to extract type parameters 130 + 131 + **Serialization Flow** 132 + 133 + 1. Sample → `as_wds` property → dict with `__key__` and `msgpack` bytes 134 + 2. Msgpack bytes created by `packed` property calling `_make_packable()` on fields 135 + 3. Deserialization: `from_bytes()` → `from_data()` → `__init__` → `_ensure_good()` 136 + 137 + **WebDataset Integration** 138 + 139 + - Uses `wds.ShardWriter` / `wds.TarWriter` for writing 140 + - Dataset iteration via `wds.DataPipeline` with custom `wrap()` / `wrap_batch()` methods 141 + - Supports `ordered()` and `shuffled()` iteration modes 142 + 143 + ## Testing Notes 144 + 145 + - Tests use parametrization heavily via `@pytest.mark.parametrize` 146 + - Test cases cover both decorator and inheritance syntax 147 + - Temporary WebDataset tar files created in `tmp_path` fixture 148 + - Tests verify both serialization and batch aggregation behavior 149 + - Lens tests verify well-behavedness (GetPut/PutGet laws)
+155 -1
README.md
··· 1 1 # atdata 2 - A loose federation of distributed, typed datasets 2 + 3 + [![codecov](https://codecov.io/gh/foundation-ac/atdata/branch/main/graph/badge.svg)](https://codecov.io/gh/foundation-ac/atdata) 4 + 5 + A loose federation of distributed, typed datasets built on WebDataset. 6 + 7 + **atdata** provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns. 8 + 9 + ## Features 10 + 11 + - **Typed Samples** - Define dataset schemas using Python dataclasses with automatic msgpack serialization 12 + - **Lens Transformations** - Bidirectional, composable transformations between different dataset views 13 + - **Automatic Batching** - Smart batch aggregation with numpy array stacking 14 + - **WebDataset Integration** - Efficient storage and streaming for large-scale datasets 15 + 16 + ## Installation 17 + 18 + ```bash 19 + pip install atdata 20 + ``` 21 + 22 + Requires Python 3.12 or later. 23 + 24 + ## Quick Start 25 + 26 + ### Defining Sample Types 27 + 28 + Use the `@packable` decorator to create typed dataset samples: 29 + 30 + ```python 31 + import atdata 32 + from numpy.typing import NDArray 33 + 34 + @atdata.packable 35 + class ImageSample: 36 + image: NDArray 37 + label: str 38 + metadata: dict 39 + ``` 40 + 41 + ### Creating Datasets 42 + 43 + ```python 44 + # Create a dataset 45 + dataset = atdata.Dataset[ImageSample]("path/to/data-{000000..000009}.tar") 46 + 47 + # Iterate over samples in order 48 + for sample in dataset.ordered(batch_size=None): 49 + print(f"Label: {sample.label}, Image shape: {sample.image.shape}") 50 + 51 + # Iterate with shuffling and batching 52 + for batch in dataset.shuffled(batch_size=32): 53 + # batch.image is automatically stacked into shape (32, ...) 54 + # batch.label is a list of 32 labels 55 + process_batch(batch.image, batch.label) 56 + ``` 57 + 58 + ### Lens Transformations 59 + 60 + Define reusable transformations between sample types: 61 + 62 + ```python 63 + @atdata.packable 64 + class ProcessedSample: 65 + features: NDArray 66 + label: str 67 + 68 + @atdata.lens 69 + def preprocess(sample: ImageSample) -> ProcessedSample: 70 + features = extract_features(sample.image) 71 + return ProcessedSample(features=features, label=sample.label) 72 + 73 + # Apply lens to view dataset as ProcessedSample 74 + processed_ds = dataset.as_type(ProcessedSample) 75 + 76 + for sample in processed_ds.ordered(batch_size=None): 77 + # sample is now a ProcessedSample 78 + print(sample.features.shape) 79 + ``` 80 + 81 + ## Core Concepts 82 + 83 + ### PackableSample 84 + 85 + Base class for serializable samples. Fields annotated as `NDArray` are automatically handled: 86 + 87 + ```python 88 + @atdata.packable 89 + class MySample: 90 + array_field: NDArray # Automatically serialized 91 + optional_array: NDArray | None 92 + regular_field: str 93 + ``` 94 + 95 + ### Lens 96 + 97 + Bidirectional transformations with getter/putter semantics: 98 + 99 + ```python 100 + @atdata.lens 101 + def my_lens(source: SourceType) -> ViewType: 102 + # Transform source -> view 103 + return ViewType(...) 104 + 105 + @my_lens.putter 106 + def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 107 + # Transform view -> source 108 + return SourceType(...) 109 + ``` 110 + 111 + ### Dataset URLs 112 + 113 + Uses WebDataset brace expansion for sharded datasets: 114 + 115 + - Single file: `"data/dataset-000000.tar"` 116 + - Multiple shards: `"data/dataset-{000000..000099}.tar"` 117 + - Multiple patterns: `"data/{train,val}/dataset-{000000..000009}.tar"` 118 + 119 + ## Development 120 + 121 + ### Setup 122 + 123 + ```bash 124 + # Install uv if not already available 125 + python -m pip install uv 126 + 127 + # Install dependencies 128 + uv sync 129 + ``` 130 + 131 + ### Testing 132 + 133 + ```bash 134 + # Run all tests with coverage 135 + pytest 136 + 137 + # Run specific test file 138 + pytest tests/test_dataset.py 139 + 140 + # Run single test 141 + pytest tests/test_lens.py::test_lens 142 + ``` 143 + 144 + ### Building 145 + 146 + ```bash 147 + uv build 148 + ``` 149 + 150 + ## Contributing 151 + 152 + Contributions are welcome! This project is in beta, so the API may still evolve. 153 + 154 + ## License 155 + 156 + This project is licensed under the Mozilla Public License 2.0. See [LICENSE](LICENSE) for details.