atdata

A loose federation of distributed, typed datasets built on WebDataset

atdata

A loose federation of distributed, typed datasets built on WebDataset.

Get Started View on GitHub

The Challenge

Machine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means:

  • Runtime surprises: Discovering a field is missing or has the wrong type during training
  • Copy-paste schemas: Redefining the same sample structure across notebooks and scripts
  • Storage silos: Data stuck in one location, invisible to collaborators
  • Discovery friction: No standard way to find datasets across teams or organizations

atdata solves these problems with a simple idea: typed, serializable samples that flow seamlessly from local development to team storage to federated sharing.

What is atdata?

atdata is a Python library that combines:

Typed Samples

Define dataclass-based sample types with automatic msgpack serialization. Catch schema errors at definition time, not training time.

Efficient Storage

Built on WebDataset’s proven tar-based format. Stream large datasets without downloading everything first.

Lens Transformations

View datasets through different schemas without duplicating data. Perfect for feature extraction, schema migration, and multi-task learning.

Batch Aggregation

Automatic numpy stacking for NDArray fields. No more manual collation code—just iterate and train.

Team Storage

Redis + S3 backend for shared dataset indexes. Publish schemas, track versions, and enable team discovery.

ATProto Federation

Publish datasets to the decentralized AT Protocol network. Enable cross-organization discovery without centralized infrastructure.

The Architecture

atdata provides a three-layer progression for your datasets:

┌─────────────────────────────────────────────────────────────┐
│  Federation: ATProto Atmosphere                             │
│  Decentralized discovery, cross-org sharing                 │
└─────────────────────────────────────────────────────────────┘
                              ↑ promote
┌─────────────────────────────────────────────────────────────┐
│  Team Storage: Redis + S3                                   │
│  Shared index, versioned schemas, S3 data                   │
└─────────────────────────────────────────────────────────────┘
                              ↑ insert
┌─────────────────────────────────────────────────────────────┐
│  Local Development                                          │
│  Typed samples, WebDataset files, fast iteration            │
└─────────────────────────────────────────────────────────────┘

Start local, scale to your team, and optionally share with the world—all with the same sample types and consistent APIs.

Installation

pip install atdata

# With ATProto support
pip install atdata[atmosphere]

Quick Example

1. Define a Sample Type

The @packable decorator creates a serializable dataclass:

import numpy as np
from numpy.typing import NDArray
import atdata

@atdata.packable
class ImageSample:
    image: NDArray      # Automatically handled as bytes
    label: str
    confidence: float

2. Create and Write Samples

Use WebDataset’s standard TarWriter:

import webdataset as wds

samples = [
    ImageSample(
        image=np.random.rand(224, 224, 3).astype(np.float32),
        label="cat",
        confidence=0.95,
    )
    for _ in range(100)
]

with wds.writer.TarWriter("data-000000.tar") as sink:
    for i, sample in enumerate(samples):
        sink.write({**sample.as_wds, "__key__": f"sample_{i:06d}"})

3. Load and Iterate with Type Safety

The generic Dataset[T] provides typed access:

dataset = atdata.Dataset[ImageSample]("data-000000.tar")

for batch in dataset.shuffled(batch_size=32):
    images = batch.image      # numpy array (32, 224, 224, 3)
    labels = batch.label      # list of 32 strings
    confs = batch.confidence  # list of 32 floats

Scaling Up

Team Storage with Redis + S3

When you’re ready to share with your team:

from atdata.local import LocalIndex, S3DataStore

# Connect to team infrastructure
store = S3DataStore(
    credentials={"AWS_ENDPOINT": "http://localhost:9000", ...},
    bucket="team-datasets",
)
index = LocalIndex(data_store=store)

# Publish schema for consistency
index.publish_schema(ImageSample, version="1.0.0")

# Insert dataset (writes to S3, indexes in Redis)
dataset = atdata.Dataset[ImageSample]("data.tar")
entry = index.insert_dataset(dataset, name="training-images-v1")

# Team members can now discover and load
# ds = atdata.load_dataset("@local/training-images-v1", index=index)

Federation with ATProto

For public or cross-organization sharing:

from atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore
from atdata.promote import promote_to_atmosphere

# Authenticate with your ATProto identity
client = AtmosphereClient()
client.login("handle.bsky.social", "app-password")

# Option 1: Promote existing local dataset
entry = index.get_dataset("training-images-v1")
at_uri = promote_to_atmosphere(entry, index, client)

# Option 2: Publish directly with blob storage
store = PDSBlobStore(client)
atm_index = AtmosphereIndex(client, data_store=store)
atm_index.insert_dataset(dataset, name="public-images", schema_ref=schema_uri)

HuggingFace-Style Loading

For convenient access to datasets:

from atdata import load_dataset

# Load from local files
ds = load_dataset("path/to/data-{000000..000009}.tar")

# Load with split detection
ds_dict = load_dataset("path/to/data/")
train_ds = ds_dict["train"]
test_ds = ds_dict["test"]

# Load from index
ds = load_dataset("@local/my-dataset", index=index)

Why atdata?

Need Solution
Type-safe samples @packable decorator, PackableSample base class
Efficient large-scale storage WebDataset tar format, streaming iteration
Schema flexibility Lens transformations, DictSample for exploration
Team collaboration Redis index, S3 data store, schema registry
Public sharing ATProto federation, content-addressable CIDs
Multiple backends Protocol abstractions (AbstractIndex, DataSource)

Next Steps

Getting Started

New to atdata? Start with the Quick Start Tutorial to learn the basics of typed samples and datasets.