A loose federation of distributed, typed datasets
1# atdata
2
3[](https://codecov.io/gh/foundation-ac/atdata)
4
5A loose federation of distributed, typed datasets built on WebDataset.
6
7**atdata** provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.
8
9## Features
10
11- **Typed Samples** - Define dataset schemas using Python dataclasses with automatic msgpack serialization
12- **Schema-free Exploration** - Load datasets without defining a schema first using `DictSample`
13- **Lens Transformations** - Bidirectional, composable transformations between different dataset views
14- **Automatic Batching** - Smart batch aggregation with numpy array stacking
15- **WebDataset Integration** - Efficient storage and streaming for large-scale datasets
16- **Flexible Data Sources** - Stream from local files, HTTP URLs, or S3-compatible storage
17- **HuggingFace-style API** - `load_dataset()` with path resolution and split handling
18- **Local & Atmosphere Storage** - Index datasets locally with Redis or publish to ATProto network
19
20## Installation
21
22```bash
23pip install atdata
24```
25
26Requires Python 3.12 or later.
27
28## Quick Start
29
30### Loading Datasets
31
32The primary way to load datasets is with `load_dataset()`:
33
34```python
35from atdata import load_dataset
36
37# Load without specifying a type - returns Dataset[DictSample]
38ds = load_dataset("path/to/data.tar", split="train")
39
40# Explore the data
41for sample in ds.ordered():
42 print(sample.keys()) # See available fields
43 print(sample["text"]) # Dict-style access
44 print(sample.label) # Attribute access
45 break
46```
47
48### Defining Typed Schemas
49
50Once you understand your data, define a typed schema with `@packable`:
51
52```python
53import atdata
54from numpy.typing import NDArray
55
56@atdata.packable
57class ImageSample:
58 image: NDArray
59 label: str
60 metadata: dict
61```
62
63### Loading with Types
64
65```python
66# Load with explicit type
67ds = load_dataset("path/to/data-{000000..000009}.tar", ImageSample, split="train")
68
69# Or convert from DictSample
70ds = load_dataset("path/to/data.tar", split="train").as_type(ImageSample)
71
72# Iterate over samples
73for sample in ds.ordered():
74 print(f"Label: {sample.label}, Image shape: {sample.image.shape}")
75
76# Iterate with shuffling and batching
77for batch in ds.shuffled(batch_size=32):
78 # batch.image is automatically stacked into shape (32, ...)
79 # batch.label is a list of 32 labels
80 process_batch(batch.image, batch.label)
81```
82
83### Lens Transformations
84
85Define reusable transformations between sample types:
86
87```python
88@atdata.packable
89class ProcessedSample:
90 features: NDArray
91 label: str
92
93@atdata.lens
94def preprocess(sample: ImageSample) -> ProcessedSample:
95 features = extract_features(sample.image)
96 return ProcessedSample(features=features, label=sample.label)
97
98# Apply lens to view dataset as ProcessedSample
99processed_ds = dataset.as_type(ProcessedSample)
100
101for sample in processed_ds.ordered(batch_size=None):
102 # sample is now a ProcessedSample
103 print(sample.features.shape)
104```
105
106## Core Concepts
107
108### DictSample
109
110The default sample type for schema-free exploration. Provides both attribute and dict-style access:
111
112```python
113ds = load_dataset("data.tar", split="train")
114
115for sample in ds.ordered():
116 # Dict-style access
117 print(sample["field_name"])
118
119 # Attribute access
120 print(sample.field_name)
121
122 # Introspection
123 print(sample.keys())
124 print(sample.to_dict())
125```
126
127### PackableSample
128
129Base class for typed, serializable samples. Fields annotated as `NDArray` are automatically handled:
130
131```python
132@atdata.packable
133class MySample:
134 array_field: NDArray # Automatically serialized
135 optional_array: NDArray | None
136 regular_field: str
137```
138
139Every `@packable` class automatically registers a lens from `DictSample`, enabling seamless conversion via `.as_type()`.
140
141### Lens
142
143Bidirectional transformations with getter/putter semantics:
144
145```python
146@atdata.lens
147def my_lens(source: SourceType) -> ViewType:
148 # Transform source -> view
149 return ViewType(...)
150
151@my_lens.putter
152def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
153 # Transform view -> source
154 return SourceType(...)
155```
156
157### Data Sources
158
159Datasets support multiple backends via the `DataSource` protocol:
160
161```python
162# String URLs (most common) - automatically wrapped in URLSource
163dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")
164
165# S3 with authentication (private buckets, Cloudflare R2, MinIO)
166source = atdata.S3Source(
167 bucket="my-bucket",
168 keys=["data-000000.tar", "data-000001.tar"],
169 endpoint="https://my-account.r2.cloudflarestorage.com",
170 access_key="...",
171 secret_key="...",
172)
173dataset = atdata.Dataset[ImageSample](source)
174```
175
176### Dataset URLs
177
178Uses WebDataset brace expansion for sharded datasets:
179
180- Single file: `"data/dataset-000000.tar"`
181- Multiple shards: `"data/dataset-{000000..000099}.tar"`
182- Multiple patterns: `"data/{train,val}/dataset-{000000..000009}.tar"`
183
184### HuggingFace-style API
185
186Load datasets with a familiar interface:
187
188```python
189from atdata import load_dataset
190
191# Load without type for exploration (returns Dataset[DictSample])
192ds = load_dataset("./data/train-*.tar", split="train")
193
194# Load with explicit type
195ds = load_dataset("./data/train-*.tar", ImageSample, split="train")
196
197# Load from S3 with brace notation
198ds = load_dataset("s3://bucket/data-{000000..000099}.tar", ImageSample, split="train")
199
200# Load all splits (returns DatasetDict)
201ds_dict = load_dataset("./data", ImageSample)
202train_ds = ds_dict["train"]
203test_ds = ds_dict["test"]
204
205# Convert DictSample to typed schema
206ds = load_dataset("./data/train.tar", split="train").as_type(ImageSample)
207```
208
209## Development
210
211### Setup
212
213```bash
214# Install uv if not already available
215python -m pip install uv
216
217# Install dependencies
218uv sync
219```
220
221### Testing
222
223```bash
224# Run all tests with coverage
225uv run pytest
226
227# Run specific test file
228uv run pytest tests/test_dataset.py
229
230# Run single test
231uv run pytest tests/test_lens.py::test_lens
232```
233
234### Building
235
236```bash
237uv build
238```
239
240## Contributing
241
242Contributions are welcome! This project is in beta, so the API may still evolve.
243
244## License
245
246This project is licensed under the Mozilla Public License 2.0. See [LICENSE](LICENSE) for details.