A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 246 lines 6.4 kB view raw view rendered
1# atdata 2 3[![codecov](https://codecov.io/gh/foundation-ac/atdata/branch/main/graph/badge.svg)](https://codecov.io/gh/foundation-ac/atdata) 4 5A loose federation of distributed, typed datasets built on WebDataset. 6 7**atdata** provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns. 8 9## Features 10 11- **Typed Samples** - Define dataset schemas using Python dataclasses with automatic msgpack serialization 12- **Schema-free Exploration** - Load datasets without defining a schema first using `DictSample` 13- **Lens Transformations** - Bidirectional, composable transformations between different dataset views 14- **Automatic Batching** - Smart batch aggregation with numpy array stacking 15- **WebDataset Integration** - Efficient storage and streaming for large-scale datasets 16- **Flexible Data Sources** - Stream from local files, HTTP URLs, or S3-compatible storage 17- **HuggingFace-style API** - `load_dataset()` with path resolution and split handling 18- **Local & Atmosphere Storage** - Index datasets locally with Redis or publish to ATProto network 19 20## Installation 21 22```bash 23pip install atdata 24``` 25 26Requires Python 3.12 or later. 27 28## Quick Start 29 30### Loading Datasets 31 32The primary way to load datasets is with `load_dataset()`: 33 34```python 35from atdata import load_dataset 36 37# Load without specifying a type - returns Dataset[DictSample] 38ds = load_dataset("path/to/data.tar", split="train") 39 40# Explore the data 41for sample in ds.ordered(): 42 print(sample.keys()) # See available fields 43 print(sample["text"]) # Dict-style access 44 print(sample.label) # Attribute access 45 break 46``` 47 48### Defining Typed Schemas 49 50Once you understand your data, define a typed schema with `@packable`: 51 52```python 53import atdata 54from numpy.typing import NDArray 55 56@atdata.packable 57class ImageSample: 58 image: NDArray 59 label: str 60 metadata: dict 61``` 62 63### Loading with Types 64 65```python 66# Load with explicit type 67ds = load_dataset("path/to/data-{000000..000009}.tar", ImageSample, split="train") 68 69# Or convert from DictSample 70ds = load_dataset("path/to/data.tar", split="train").as_type(ImageSample) 71 72# Iterate over samples 73for sample in ds.ordered(): 74 print(f"Label: {sample.label}, Image shape: {sample.image.shape}") 75 76# Iterate with shuffling and batching 77for batch in ds.shuffled(batch_size=32): 78 # batch.image is automatically stacked into shape (32, ...) 79 # batch.label is a list of 32 labels 80 process_batch(batch.image, batch.label) 81``` 82 83### Lens Transformations 84 85Define reusable transformations between sample types: 86 87```python 88@atdata.packable 89class ProcessedSample: 90 features: NDArray 91 label: str 92 93@atdata.lens 94def preprocess(sample: ImageSample) -> ProcessedSample: 95 features = extract_features(sample.image) 96 return ProcessedSample(features=features, label=sample.label) 97 98# Apply lens to view dataset as ProcessedSample 99processed_ds = dataset.as_type(ProcessedSample) 100 101for sample in processed_ds.ordered(batch_size=None): 102 # sample is now a ProcessedSample 103 print(sample.features.shape) 104``` 105 106## Core Concepts 107 108### DictSample 109 110The default sample type for schema-free exploration. Provides both attribute and dict-style access: 111 112```python 113ds = load_dataset("data.tar", split="train") 114 115for sample in ds.ordered(): 116 # Dict-style access 117 print(sample["field_name"]) 118 119 # Attribute access 120 print(sample.field_name) 121 122 # Introspection 123 print(sample.keys()) 124 print(sample.to_dict()) 125``` 126 127### PackableSample 128 129Base class for typed, serializable samples. Fields annotated as `NDArray` are automatically handled: 130 131```python 132@atdata.packable 133class MySample: 134 array_field: NDArray # Automatically serialized 135 optional_array: NDArray | None 136 regular_field: str 137``` 138 139Every `@packable` class automatically registers a lens from `DictSample`, enabling seamless conversion via `.as_type()`. 140 141### Lens 142 143Bidirectional transformations with getter/putter semantics: 144 145```python 146@atdata.lens 147def my_lens(source: SourceType) -> ViewType: 148 # Transform source -> view 149 return ViewType(...) 150 151@my_lens.putter 152def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 153 # Transform view -> source 154 return SourceType(...) 155``` 156 157### Data Sources 158 159Datasets support multiple backends via the `DataSource` protocol: 160 161```python 162# String URLs (most common) - automatically wrapped in URLSource 163dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar") 164 165# S3 with authentication (private buckets, Cloudflare R2, MinIO) 166source = atdata.S3Source( 167 bucket="my-bucket", 168 keys=["data-000000.tar", "data-000001.tar"], 169 endpoint="https://my-account.r2.cloudflarestorage.com", 170 access_key="...", 171 secret_key="...", 172) 173dataset = atdata.Dataset[ImageSample](source) 174``` 175 176### Dataset URLs 177 178Uses WebDataset brace expansion for sharded datasets: 179 180- Single file: `"data/dataset-000000.tar"` 181- Multiple shards: `"data/dataset-{000000..000099}.tar"` 182- Multiple patterns: `"data/{train,val}/dataset-{000000..000009}.tar"` 183 184### HuggingFace-style API 185 186Load datasets with a familiar interface: 187 188```python 189from atdata import load_dataset 190 191# Load without type for exploration (returns Dataset[DictSample]) 192ds = load_dataset("./data/train-*.tar", split="train") 193 194# Load with explicit type 195ds = load_dataset("./data/train-*.tar", ImageSample, split="train") 196 197# Load from S3 with brace notation 198ds = load_dataset("s3://bucket/data-{000000..000099}.tar", ImageSample, split="train") 199 200# Load all splits (returns DatasetDict) 201ds_dict = load_dataset("./data", ImageSample) 202train_ds = ds_dict["train"] 203test_ds = ds_dict["test"] 204 205# Convert DictSample to typed schema 206ds = load_dataset("./data/train.tar", split="train").as_type(ImageSample) 207``` 208 209## Development 210 211### Setup 212 213```bash 214# Install uv if not already available 215python -m pip install uv 216 217# Install dependencies 218uv sync 219``` 220 221### Testing 222 223```bash 224# Run all tests with coverage 225uv run pytest 226 227# Run specific test file 228uv run pytest tests/test_dataset.py 229 230# Run single test 231uv run pytest tests/test_lens.py::test_lens 232``` 233 234### Building 235 236```bash 237uv build 238``` 239 240## Contributing 241 242Contributions are welcome! This project is in beta, so the API may still evolve. 243 244## License 245 246This project is licensed under the Mozilla Public License 2.0. See [LICENSE](LICENSE) for details.