README.md at main · foundation.ac/atdata

foundation.ac / atdata
fork
A loose federation of distributed, typed datasets
fork
atdata / README.md
at main 246 lines 6.4 kB view raw view rendered
wrap content
Maxine Levesque feat(api): add Packable protocol and standardize collection naming (xs/list_xs pattern) 3mo ago
fb35ddd4
  1# atdata
  2
  3[![codecov](https://codecov.io/gh/foundation-ac/atdata/branch/main/graph/badge.svg)](https://codecov.io/gh/foundation-ac/atdata)
  4
  5A loose federation of distributed, typed datasets built on WebDataset.
  6
  7**atdata** provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.
  8
  9## Features
 10
 11- **Typed Samples** - Define dataset schemas using Python dataclasses with automatic msgpack serialization
 12- **Schema-free Exploration** - Load datasets without defining a schema first using `DictSample`
 13- **Lens Transformations** - Bidirectional, composable transformations between different dataset views
 14- **Automatic Batching** - Smart batch aggregation with numpy array stacking
 15- **WebDataset Integration** - Efficient storage and streaming for large-scale datasets
 16- **Flexible Data Sources** - Stream from local files, HTTP URLs, or S3-compatible storage
 17- **HuggingFace-style API** - `load_dataset()` with path resolution and split handling
 18- **Local & Atmosphere Storage** - Index datasets locally with Redis or publish to ATProto network
 19
 20## Installation
 21
 22```bash
 23pip install atdata
 24```
 25
 26Requires Python 3.12 or later.
 27
 28## Quick Start
 29
 30### Loading Datasets
 31
 32The primary way to load datasets is with `load_dataset()`:
 33
 34```python
 35from atdata import load_dataset
 36
 37# Load without specifying a type - returns Dataset[DictSample]
 38ds = load_dataset("path/to/data.tar", split="train")
 39
 40# Explore the data
 41for sample in ds.ordered():
 42    print(sample.keys())      # See available fields
 43    print(sample["text"])     # Dict-style access
 44    print(sample.label)       # Attribute access
 45    break
 46```
 47
 48### Defining Typed Schemas
 49
 50Once you understand your data, define a typed schema with `@packable`:
 51
 52```python
 53import atdata
 54from numpy.typing import NDArray
 55
 56@atdata.packable
 57class ImageSample:
 58    image: NDArray
 59    label: str
 60    metadata: dict
 61```
 62
 63### Loading with Types
 64
 65```python
 66# Load with explicit type
 67ds = load_dataset("path/to/data-{000000..000009}.tar", ImageSample, split="train")
 68
 69# Or convert from DictSample
 70ds = load_dataset("path/to/data.tar", split="train").as_type(ImageSample)
 71
 72# Iterate over samples
 73for sample in ds.ordered():
 74    print(f"Label: {sample.label}, Image shape: {sample.image.shape}")
 75
 76# Iterate with shuffling and batching
 77for batch in ds.shuffled(batch_size=32):
 78    # batch.image is automatically stacked into shape (32, ...)
 79    # batch.label is a list of 32 labels
 80    process_batch(batch.image, batch.label)
 81```
 82
 83### Lens Transformations
 84
 85Define reusable transformations between sample types:
 86
 87```python
 88@atdata.packable
 89class ProcessedSample:
 90    features: NDArray
 91    label: str
 92
 93@atdata.lens
 94def preprocess(sample: ImageSample) -> ProcessedSample:
 95    features = extract_features(sample.image)
 96    return ProcessedSample(features=features, label=sample.label)
 97
 98# Apply lens to view dataset as ProcessedSample
 99processed_ds = dataset.as_type(ProcessedSample)
100
101for sample in processed_ds.ordered(batch_size=None):
102    # sample is now a ProcessedSample
103    print(sample.features.shape)
104```
105
106## Core Concepts
107
108### DictSample
109
110The default sample type for schema-free exploration. Provides both attribute and dict-style access:
111
112```python
113ds = load_dataset("data.tar", split="train")
114
115for sample in ds.ordered():
116    # Dict-style access
117    print(sample["field_name"])
118
119    # Attribute access
120    print(sample.field_name)
121
122    # Introspection
123    print(sample.keys())
124    print(sample.to_dict())
125```
126
127### PackableSample
128
129Base class for typed, serializable samples. Fields annotated as `NDArray` are automatically handled:
130
131```python
132@atdata.packable
133class MySample:
134    array_field: NDArray      # Automatically serialized
135    optional_array: NDArray | None
136    regular_field: str
137```
138
139Every `@packable` class automatically registers a lens from `DictSample`, enabling seamless conversion via `.as_type()`.
140
141### Lens
142
143Bidirectional transformations with getter/putter semantics:
144
145```python
146@atdata.lens
147def my_lens(source: SourceType) -> ViewType:
148    # Transform source -> view
149    return ViewType(...)
150
151@my_lens.putter
152def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
153    # Transform view -> source
154    return SourceType(...)
155```
156
157### Data Sources
158
159Datasets support multiple backends via the `DataSource` protocol:
160
161```python
162# String URLs (most common) - automatically wrapped in URLSource
163dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")
164
165# S3 with authentication (private buckets, Cloudflare R2, MinIO)
166source = atdata.S3Source(
167    bucket="my-bucket",
168    keys=["data-000000.tar", "data-000001.tar"],
169    endpoint="https://my-account.r2.cloudflarestorage.com",
170    access_key="...",
171    secret_key="...",
172)
173dataset = atdata.Dataset[ImageSample](source)
174```
175
176### Dataset URLs
177
178Uses WebDataset brace expansion for sharded datasets:
179
180- Single file: `"data/dataset-000000.tar"`
181- Multiple shards: `"data/dataset-{000000..000099}.tar"`
182- Multiple patterns: `"data/{train,val}/dataset-{000000..000009}.tar"`
183
184### HuggingFace-style API
185
186Load datasets with a familiar interface:
187
188```python
189from atdata import load_dataset
190
191# Load without type for exploration (returns Dataset[DictSample])
192ds = load_dataset("./data/train-*.tar", split="train")
193
194# Load with explicit type
195ds = load_dataset("./data/train-*.tar", ImageSample, split="train")
196
197# Load from S3 with brace notation
198ds = load_dataset("s3://bucket/data-{000000..000099}.tar", ImageSample, split="train")
199
200# Load all splits (returns DatasetDict)
201ds_dict = load_dataset("./data", ImageSample)
202train_ds = ds_dict["train"]
203test_ds = ds_dict["test"]
204
205# Convert DictSample to typed schema
206ds = load_dataset("./data/train.tar", split="train").as_type(ImageSample)
207```
208
209## Development
210
211### Setup
212
213```bash
214# Install uv if not already available
215python -m pip install uv
216
217# Install dependencies
218uv sync
219```
220
221### Testing
222
223```bash
224# Run all tests with coverage
225uv run pytest
226
227# Run specific test file
228uv run pytest tests/test_dataset.py
229
230# Run single test
231uv run pytest tests/test_lens.py::test_lens
232```
233
234### Building
235
236```bash
237uv build
238```
239
240## Contributing
241
242Contributions are welcome! This project is in beta, so the API may still evolve.
243
244## License
245
246This project is licensed under the Mozilla Public License 2.0. See [LICENSE](LICENSE) for details.
Configure Feed

Configure Feed