A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 3017 lines 390 kB view raw
1[ 2 { 3 "objectID": "reference/protocols.html", 4 "href": "reference/protocols.html", 5 "title": "Protocols", 6 "section": "", 7 "text": "The protocols module defines abstract interfaces that enable interchangeable index backends (local Redis vs ATProto), data stores (S3 vs PDS blobs), and data sources (URL, S3, etc.).", 8 "crumbs": [ 9 "Guide", 10 "Reference", 11 "Protocols" 12 ] 13 }, 14 { 15 "objectID": "reference/protocols.html#overview", 16 "href": "reference/protocols.html#overview", 17 "title": "Protocols", 18 "section": "Overview", 19 "text": "Overview\nBoth local and atmosphere implementations solve the same problem: indexed dataset storage with external data URLs. These protocols formalize that common interface:\n\nIndexEntry: Common interface for dataset index entries\nAbstractIndex: Protocol for index operations\nAbstractDataStore: Protocol for data storage operations\nDataSource: Protocol for streaming data from various backends", 20 "crumbs": [ 21 "Guide", 22 "Reference", 23 "Protocols" 24 ] 25 }, 26 { 27 "objectID": "reference/protocols.html#indexentry-protocol", 28 "href": "reference/protocols.html#indexentry-protocol", 29 "title": "Protocols", 30 "section": "IndexEntry Protocol", 31 "text": "IndexEntry Protocol\nRepresents a dataset entry in any index:\n\nfrom atdata._protocols import IndexEntry\n\ndef process_entry(entry: IndexEntry) -> None:\n print(f\"Name: {entry.name}\")\n print(f\"Schema: {entry.schema_ref}\")\n print(f\"URLs: {entry.data_urls}\")\n print(f\"Metadata: {entry.metadata}\")\n\n\nProperties\n\n\n\nProperty\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable dataset name\n\n\nschema_ref\nstr\nSchema reference (local:// or at://)\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data\n\n\nmetadata\ndict \\| None\nArbitrary metadata dictionary\n\n\n\n\n\nImplementations\n\nLocalDatasetEntry (from atdata.local)\nAtmosphereIndexEntry (from atdata.atmosphere)", 32 "crumbs": [ 33 "Guide", 34 "Reference", 35 "Protocols" 36 ] 37 }, 38 { 39 "objectID": "reference/protocols.html#abstractindex-protocol", 40 "href": "reference/protocols.html#abstractindex-protocol", 41 "title": "Protocols", 42 "section": "AbstractIndex Protocol", 43 "text": "AbstractIndex Protocol\nDefines operations for managing schemas and datasets:\n\nfrom atdata._protocols import AbstractIndex\n\ndef list_all_datasets(index: AbstractIndex) -> None:\n \"\"\"Works with LocalIndex or AtmosphereIndex.\"\"\"\n for entry in index.list_datasets():\n print(f\"{entry.name}: {entry.schema_ref}\")\n\n\nDataset Operations\n\n# Insert a dataset\nentry = index.insert_dataset(\n dataset,\n name=\"my-dataset\",\n schema_ref=\"local://schemas/MySample@1.0.0\", # optional\n)\n\n# Get by name/reference\nentry = index.get_dataset(\"my-dataset\")\n\n# List all datasets\nfor entry in index.list_datasets():\n print(entry.name)\n\n\n\nSchema Operations\n\n# Publish a schema\nschema_ref = index.publish_schema(\n MySample,\n version=\"1.0.0\",\n)\n\n# Get schema record\nschema = index.get_schema(schema_ref)\nprint(schema[\"name\"], schema[\"version\"])\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']}@{schema['version']}\")\n\n# Decode schema to Python type\nSampleType = index.decode_schema(schema_ref)\ndataset = atdata.Dataset[SampleType](entry.data_urls[0])\n\n\n\nImplementations\n\nLocalIndex / Index (from atdata.local)\nAtmosphereIndex (from atdata.atmosphere)", 44 "crumbs": [ 45 "Guide", 46 "Reference", 47 "Protocols" 48 ] 49 }, 50 { 51 "objectID": "reference/protocols.html#abstractdatastore-protocol", 52 "href": "reference/protocols.html#abstractdatastore-protocol", 53 "title": "Protocols", 54 "section": "AbstractDataStore Protocol", 55 "text": "AbstractDataStore Protocol\nAbstracts over different storage backends:\n\nfrom atdata._protocols import AbstractDataStore\n\ndef write_dataset(store: AbstractDataStore, dataset) -> list[str]:\n \"\"\"Works with S3DataStore or future PDS blob store.\"\"\"\n urls = store.write_shards(dataset, prefix=\"datasets/v1\")\n return urls\n\n\nMethods\n\n# Write dataset shards\nurls = store.write_shards(\n dataset,\n prefix=\"datasets/mnist/v1\",\n maxcount=10000, # samples per shard\n)\n\n# Resolve URL for reading\nreadable_url = store.read_url(\"s3://bucket/path.tar\")\n\n# Check streaming support\nif store.supports_streaming():\n # Can stream directly\n pass\n\n\n\nImplementations\n\nS3DataStore (from atdata.local)", 56 "crumbs": [ 57 "Guide", 58 "Reference", 59 "Protocols" 60 ] 61 }, 62 { 63 "objectID": "reference/protocols.html#datasource-protocol", 64 "href": "reference/protocols.html#datasource-protocol", 65 "title": "Protocols", 66 "section": "DataSource Protocol", 67 "text": "DataSource Protocol\nAbstracts over different data source backends for streaming dataset shards:\n\nfrom atdata._protocols import DataSource\n\ndef load_from_source(source: DataSource) -> None:\n \"\"\"Works with URLSource, S3Source, or custom implementations.\"\"\"\n print(f\"Shards: {source.shard_list}\")\n\n for shard_id, stream in source.shards():\n print(f\"Reading {shard_id}\")\n # stream is a file-like object\n\n\nMethods\n\n# Get list of shard identifiers\nshard_ids = source.shard_list # ['data-000000.tar', 'data-000001.tar', ...]\n\n# Iterate over all shards with streams\nfor shard_id, stream in source.shards():\n # stream is IO[bytes], can be passed to tar reader\n process_shard(stream)\n\n# Open a specific shard\nstream = source.open_shard(\"data-000001.tar\")\n\n\n\nImplementations\n\nURLSource (from atdata) - WebDataset-compatible URLs (local, HTTP, etc.)\nS3Source (from atdata) - S3 and S3-compatible storage with boto3\n\n\n\nCreating Custom Data Sources\nImplement the DataSource protocol for custom backends:\n\nfrom typing import Iterator, IO\nfrom atdata._protocols import DataSource\n\nclass MyCustomSource:\n \"\"\"Custom data source for proprietary storage.\"\"\"\n\n def __init__(self, config: dict):\n self._config = config\n self._shards = [\"shard-001.tar\", \"shard-002.tar\"]\n\n @property\n def shard_list(self) -> list[str]:\n return self._shards\n\n def shards(self) -> Iterator[tuple[str, IO[bytes]]]:\n for shard_id in self._shards:\n stream = self._open(shard_id)\n yield shard_id, stream\n\n def open_shard(self, shard_id: str) -> IO[bytes]:\n if shard_id not in self._shards:\n raise KeyError(f\"Shard not found: {shard_id}\")\n return self._open(shard_id)\n\n def _open(self, shard_id: str) -> IO[bytes]:\n # Implementation-specific logic\n ...\n\n# Use with Dataset\nsource = MyCustomSource({\"endpoint\": \"...\"})\ndataset = atdata.Dataset[MySample](source)", 68 "crumbs": [ 69 "Guide", 70 "Reference", 71 "Protocols" 72 ] 73 }, 74 { 75 "objectID": "reference/protocols.html#using-protocols-for-polymorphism", 76 "href": "reference/protocols.html#using-protocols-for-polymorphism", 77 "title": "Protocols", 78 "section": "Using Protocols for Polymorphism", 79 "text": "Using Protocols for Polymorphism\nWrite code that works with any backend:\n\nfrom atdata._protocols import AbstractIndex, IndexEntry\nfrom atdata import Dataset\n\ndef backup_all_datasets(\n source: AbstractIndex,\n target: AbstractIndex,\n) -> None:\n \"\"\"Copy all datasets from source index to target.\"\"\"\n for entry in source.list_datasets():\n # Decode schema from source\n SampleType = source.decode_schema(entry.schema_ref)\n\n # Publish schema to target\n target_schema = target.publish_schema(SampleType)\n\n # Load and re-insert dataset\n ds = Dataset[SampleType](entry.data_urls[0])\n target.insert_dataset(\n ds,\n name=entry.name,\n schema_ref=target_schema,\n )", 80 "crumbs": [ 81 "Guide", 82 "Reference", 83 "Protocols" 84 ] 85 }, 86 { 87 "objectID": "reference/protocols.html#schema-reference-formats", 88 "href": "reference/protocols.html#schema-reference-formats", 89 "title": "Protocols", 90 "section": "Schema Reference Formats", 91 "text": "Schema Reference Formats\nSchema references vary by backend:\n\n\n\n\n\n\n\n\nBackend\nFormat\nExample\n\n\n\n\nLocal\natdata://local/sampleSchema/{Class}@{version}\natdata://local/sampleSchema/ImageSample@1.0.0\n\n\nAtmosphere\nat://{did}/{collection}/{rkey}\nat://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nLegacy local://schemas/ URIs are still supported for backward compatibility.", 92 "crumbs": [ 93 "Guide", 94 "Reference", 95 "Protocols" 96 ] 97 }, 98 { 99 "objectID": "reference/protocols.html#type-checking", 100 "href": "reference/protocols.html#type-checking", 101 "title": "Protocols", 102 "section": "Type Checking", 103 "text": "Type Checking\nProtocols are runtime-checkable:\n\nfrom atdata._protocols import IndexEntry, AbstractIndex\n\n# Check if object implements protocol\nentry = index.get_dataset(\"test\")\nassert isinstance(entry, IndexEntry)\n\n# Type hints work with protocols\ndef process(index: AbstractIndex) -> None:\n ... # IDE provides autocomplete", 104 "crumbs": [ 105 "Guide", 106 "Reference", 107 "Protocols" 108 ] 109 }, 110 { 111 "objectID": "reference/protocols.html#complete-example", 112 "href": "reference/protocols.html#complete-example", 113 "title": "Protocols", 114 "section": "Complete Example", 115 "text": "Complete Example\n\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex\nfrom atdata._protocols import AbstractIndex\nimport numpy as np\nfrom numpy.typing import NDArray\n\n# Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# Function works with any index\ndef count_datasets(index: AbstractIndex) -> int:\n return sum(1 for _ in index.list_datasets())\n\n# Use with local index\nlocal_index = LocalIndex()\nprint(f\"Local datasets: {count_datasets(local_index)}\")\n\n# Use with atmosphere index\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\natm_index = AtmosphereIndex(client)\nprint(f\"Atmosphere datasets: {count_datasets(atm_index)}\")\n\n# Migrate from local to atmosphere\ndef migrate_dataset(\n name: str,\n source: AbstractIndex,\n target: AbstractIndex,\n) -> None:\n entry = source.get_dataset(name)\n SampleType = source.decode_schema(entry.schema_ref)\n\n # Publish schema\n schema_ref = target.publish_schema(SampleType)\n\n # Create dataset and insert\n ds = atdata.Dataset[SampleType](entry.data_urls[0])\n target.insert_dataset(ds, name=name, schema_ref=schema_ref)\n\nmigrate_dataset(\"my-features\", local_index, atm_index)", 116 "crumbs": [ 117 "Guide", 118 "Reference", 119 "Protocols" 120 ] 121 }, 122 { 123 "objectID": "reference/protocols.html#related", 124 "href": "reference/protocols.html#related", 125 "title": "Protocols", 126 "section": "Related", 127 "text": "Related\n\nLocal Storage - LocalIndex and S3DataStore\nAtmosphere - AtmosphereIndex\nPromotion - Local to atmosphere migration\nload_dataset - Using indexes with load_dataset()", 128 "crumbs": [ 129 "Guide", 130 "Reference", 131 "Protocols" 132 ] 133 }, 134 { 135 "objectID": "reference/datasets.html", 136 "href": "reference/datasets.html", 137 "title": "Datasets", 138 "section": "", 139 "text": "The Dataset class provides typed iteration over WebDataset tar files with automatic batching and lens transformations.", 140 "crumbs": [ 141 "Guide", 142 "Reference", 143 "Datasets" 144 ] 145 }, 146 { 147 "objectID": "reference/datasets.html#creating-a-dataset", 148 "href": "reference/datasets.html#creating-a-dataset", 149 "title": "Datasets", 150 "section": "Creating a Dataset", 151 "text": "Creating a Dataset\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\n# Single shard (string URL - most common)\ndataset = atdata.Dataset[ImageSample](\"data-000000.tar\")\n\n# Multiple shards with brace notation\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nThe type parameter [ImageSample] specifies what sample type the dataset contains. This enables type-safe iteration and automatic deserialization.", 152 "crumbs": [ 153 "Guide", 154 "Reference", 155 "Datasets" 156 ] 157 }, 158 { 159 "objectID": "reference/datasets.html#data-sources", 160 "href": "reference/datasets.html#data-sources", 161 "title": "Datasets", 162 "section": "Data Sources", 163 "text": "Data Sources\nDatasets can be created from different data sources using the DataSource protocol:\n\nURL Source (default)\nWhen you pass a string to Dataset, it automatically wraps it in a URLSource:\n\n# These are equivalent:\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\ndataset = atdata.Dataset[ImageSample](atdata.URLSource(\"data-{000000..000009}.tar\"))\n\n\n\nS3 Source\nFor private S3 buckets or S3-compatible storage (Cloudflare R2, MinIO), use S3Source:\n\n# From explicit credentials\nsource = atdata.S3Source(\n bucket=\"my-bucket\",\n keys=[\"data-000000.tar\", \"data-000001.tar\"],\n endpoint=\"https://my-r2-account.r2.cloudflarestorage.com\",\n access_key=\"AKID...\",\n secret_key=\"SECRET...\",\n)\ndataset = atdata.Dataset[ImageSample](source)\n\n# From S3 URLs\nsource = atdata.S3Source.from_urls([\n \"s3://my-bucket/data-000000.tar\",\n \"s3://my-bucket/data-000001.tar\",\n])\ndataset = atdata.Dataset[ImageSample](source)\n\n\n\n\n\n\n\nNote\n\n\n\nS3Source uses boto3 for streaming, enabling authentication with private buckets. For public S3 URLs, a string URL with URLSource works directly.", 164 "crumbs": [ 165 "Guide", 166 "Reference", 167 "Datasets" 168 ] 169 }, 170 { 171 "objectID": "reference/datasets.html#iteration-modes", 172 "href": "reference/datasets.html#iteration-modes", 173 "title": "Datasets", 174 "section": "Iteration Modes", 175 "text": "Iteration Modes\n\nOrdered Iteration\nIterate through samples in their original order:\n\n# With batching (default batch_size=1)\nfor batch in dataset.ordered(batch_size=32):\n images = batch.image # numpy array (32, H, W, C)\n labels = batch.label # list of 32 strings\n\n# Without batching (raw samples)\nfor sample in dataset.ordered(batch_size=None):\n print(sample.label)\n\n\n\nShuffled Iteration\nIterate with randomized order at both shard and sample levels:\n\nfor batch in dataset.shuffled(batch_size=32):\n # Samples are shuffled\n process(batch)\n\n# Control shuffle buffer sizes\nfor batch in dataset.shuffled(\n buffer_shards=100, # Shards to buffer (default: 100)\n buffer_samples=10000, # Samples to buffer (default: 10,000)\n batch_size=32,\n):\n process(batch)\n\n\n\n\n\n\n\nTip\n\n\n\nLarger buffer sizes increase randomness but use more memory. For training, buffer_samples=10000 is usually a good balance.", 176 "crumbs": [ 177 "Guide", 178 "Reference", 179 "Datasets" 180 ] 181 }, 182 { 183 "objectID": "reference/datasets.html#samplebatch", 184 "href": "reference/datasets.html#samplebatch", 185 "title": "Datasets", 186 "section": "SampleBatch", 187 "text": "SampleBatch\nWhen iterating with a batch_size, each iteration yields a SampleBatch with automatic attribute aggregation.\n\n@atdata.packable\nclass Sample:\n features: NDArray # shape (256,)\n label: str\n score: float\n\nfor batch in dataset.ordered(batch_size=16):\n # NDArray fields are stacked with a batch dimension\n features = batch.features # numpy array (16, 256)\n\n # Other fields become lists\n labels = batch.label # list of 16 strings\n scores = batch.score # list of 16 floats\n\nResults are cached, so accessing the same attribute multiple times is efficient.", 188 "crumbs": [ 189 "Guide", 190 "Reference", 191 "Datasets" 192 ] 193 }, 194 { 195 "objectID": "reference/datasets.html#type-transformations-with-lenses", 196 "href": "reference/datasets.html#type-transformations-with-lenses", 197 "title": "Datasets", 198 "section": "Type Transformations with Lenses", 199 "text": "Type Transformations with Lenses\nView a dataset through a different sample type using registered lenses:\n\n@atdata.packable\nclass SimplifiedSample:\n label: str\n\n@atdata.lens\ndef simplify(src: ImageSample) -> SimplifiedSample:\n return SimplifiedSample(label=src.label)\n\n# Transform dataset to different type\nsimple_ds = dataset.as_type(SimplifiedSample)\n\nfor batch in simple_ds.ordered(batch_size=16):\n print(batch.label) # Only label field available\n\nSee Lenses for details on defining transformations.", 200 "crumbs": [ 201 "Guide", 202 "Reference", 203 "Datasets" 204 ] 205 }, 206 { 207 "objectID": "reference/datasets.html#dataset-properties", 208 "href": "reference/datasets.html#dataset-properties", 209 "title": "Datasets", 210 "section": "Dataset Properties", 211 "text": "Dataset Properties\n\nShard List\nGet the list of individual tar files:\n\ndataset = atdata.Dataset[Sample](\"data-{000000..000009}.tar\")\nshards = dataset.shard_list\n# ['data-000000.tar', 'data-000001.tar', ..., 'data-000009.tar']\n\n\n\nMetadata\nDatasets can have associated metadata from a URL:\n\ndataset = atdata.Dataset[Sample](\n \"data-{000000..000009}.tar\",\n metadata_url=\"https://example.com/metadata.msgpack\"\n)\n\n# Fetched and cached on first access\nmetadata = dataset.metadata # dict or None", 212 "crumbs": [ 213 "Guide", 214 "Reference", 215 "Datasets" 216 ] 217 }, 218 { 219 "objectID": "reference/datasets.html#writing-datasets", 220 "href": "reference/datasets.html#writing-datasets", 221 "title": "Datasets", 222 "section": "Writing Datasets", 223 "text": "Writing Datasets\nUse WebDataset’s TarWriter or ShardWriter to create datasets:\n\nimport webdataset as wds\nimport numpy as np\n\nsamples = [\n ImageSample(image=np.random.rand(224, 224, 3).astype(np.float32), label=\"cat\")\n for _ in range(100)\n]\n\n# Single tar file\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\n# Multiple shards with automatic splitting\nwith wds.writer.ShardWriter(\"data-%06d.tar\", maxcount=1000) as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})", 224 "crumbs": [ 225 "Guide", 226 "Reference", 227 "Datasets" 228 ] 229 }, 230 { 231 "objectID": "reference/datasets.html#parquet-export", 232 "href": "reference/datasets.html#parquet-export", 233 "title": "Datasets", 234 "section": "Parquet Export", 235 "text": "Parquet Export\nExport dataset contents to parquet format:\n\n# Export entire dataset\ndataset.to_parquet(\"output.parquet\")\n\n# Export with custom field mapping\ndef extract_fields(sample):\n return {\"label\": sample.label, \"score\": sample.confidence}\n\ndataset.to_parquet(\"output.parquet\", sample_map=extract_fields)\n\n# Export in segments\ndataset.to_parquet(\"output.parquet\", maxcount=10000)\n# Creates output-000000.parquet, output-000001.parquet, etc.", 236 "crumbs": [ 237 "Guide", 238 "Reference", 239 "Datasets" 240 ] 241 }, 242 { 243 "objectID": "reference/datasets.html#url-formats", 244 "href": "reference/datasets.html#url-formats", 245 "title": "Datasets", 246 "section": "URL Formats", 247 "text": "URL Formats\nWhen using string URLs (via URLSource), WebDataset supports various formats:\n\n\n\n\n\n\n\nFormat\nExample\n\n\n\n\nLocal files\n./data/file.tar, /absolute/path/file-{000000..000009}.tar\n\n\nHTTP/HTTPS\nhttps://example.com/data-{000000..000009}.tar\n\n\nGoogle Cloud\ngs://bucket/path/file.tar\n\n\n\nFor S3 with authentication, use S3Source instead of s3:// URLs.", 248 "crumbs": [ 249 "Guide", 250 "Reference", 251 "Datasets" 252 ] 253 }, 254 { 255 "objectID": "reference/datasets.html#dataset-properties-1", 256 "href": "reference/datasets.html#dataset-properties-1", 257 "title": "Datasets", 258 "section": "Dataset Properties", 259 "text": "Dataset Properties\n\nSource\nAccess the underlying DataSource:\n\ndataset = atdata.Dataset[Sample](\"data.tar\")\nsource = dataset.source # URLSource instance\nprint(source.shard_list) # ['data.tar']\n\n\n\nSample Type\nGet the type parameter used to create the dataset:\n\ndataset = atdata.Dataset[ImageSample](\"data.tar\")\nprint(dataset.sample_type) # <class 'ImageSample'>\nprint(dataset.batch_type) # SampleBatch[ImageSample]", 260 "crumbs": [ 261 "Guide", 262 "Reference", 263 "Datasets" 264 ] 265 }, 266 { 267 "objectID": "reference/datasets.html#related", 268 "href": "reference/datasets.html#related", 269 "title": "Datasets", 270 "section": "Related", 271 "text": "Related\n\nPackable Samples - Defining typed samples\nLenses - Type transformations\nload_dataset - HuggingFace-style loading API\nProtocols - DataSource protocol details", 272 "crumbs": [ 273 "Guide", 274 "Reference", 275 "Datasets" 276 ] 277 }, 278 { 279 "objectID": "reference/architecture.html", 280 "href": "reference/architecture.html", 281 "title": "Architecture Overview", 282 "section": "", 283 "text": "atdata is designed around a simple but powerful idea: typed, serializable samples that can flow seamlessly between local development, team storage, and a federated network. This page explains the architectural decisions and how the components work together.", 284 "crumbs": [ 285 "Guide", 286 "Reference", 287 "Architecture Overview" 288 ] 289 }, 290 { 291 "objectID": "reference/architecture.html#design-philosophy", 292 "href": "reference/architecture.html#design-philosophy", 293 "title": "Architecture Overview", 294 "section": "Design Philosophy", 295 "text": "Design Philosophy\n\nThe Problem\nMachine learning workflows involve datasets at every stage—training data, validation sets, embeddings, features, and model outputs. These datasets are often:\n\nUntyped: Raw files with implicit schemas, leading to runtime errors\nSiloed: Stuck in one location (local disk, team bucket, or cloud storage)\nUndiscoverable: No standard way to find and share datasets across teams or organizations\n\n\n\nThe Solution\natdata provides a three-layer architecture that addresses each problem:\n┌─────────────────────────────────────────────────────────────┐\n│ Layer 3: Federation (ATProto Atmosphere) │\n│ - Decentralized discovery and sharing │\n│ - Content-addressable identifiers │\n│ - Cross-organization dataset federation │\n└─────────────────────────────────────────────────────────────┘\n ↑\n Promotion\n ↑\n┌─────────────────────────────────────────────────────────────┐\n│ Layer 2: Team Storage (Redis + S3) │\n│ - Shared index for team discovery │\n│ - Scalable object storage for data │\n│ - Schema registry for type consistency │\n└─────────────────────────────────────────────────────────────┘\n ↑\n Insert\n ↑\n┌─────────────────────────────────────────────────────────────┐\n│ Layer 1: Local Development │\n│ - Typed samples with automatic serialization │\n│ - WebDataset tar files for efficient storage │\n│ - Lens transformations for schema flexibility │\n└─────────────────────────────────────────────────────────────┘", 296 "crumbs": [ 297 "Guide", 298 "Reference", 299 "Architecture Overview" 300 ] 301 }, 302 { 303 "objectID": "reference/architecture.html#core-components", 304 "href": "reference/architecture.html#core-components", 305 "title": "Architecture Overview", 306 "section": "Core Components", 307 "text": "Core Components\n\nPackableSample: The Foundation\nEverything in atdata starts with PackableSample—a base class that makes Python dataclasses serializable with msgpack:\n\n@atdata.packable\nclass ImageSample:\n image: NDArray # Automatically converted to/from bytes\n label: str # Standard msgpack serialization\n confidence: float\n\nKey features:\n\nAutomatic NDArray handling: Numpy arrays are serialized efficiently\nType safety: Field types are preserved and validated\nRound-trip fidelity: Serialize → deserialize always produces identical data\n\nThe @packable decorator is syntactic sugar that:\n\nConverts your class to a dataclass\nAdds PackableSample as a base class\nRegisters a lens from DictSample for flexible loading\n\n\n\nDataset: Typed Iteration\nThe Dataset[T] class wraps WebDataset tar archives with type information:\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nfor batch in dataset.shuffled(batch_size=32):\n images = batch.image # Stacked NDArray: (32, H, W, C)\n labels = batch.label # List of 32 strings\n\nWhy WebDataset?\nWebDataset is a battle-tested format for large-scale ML training:\n\nStreaming: No need to download entire datasets\nSharding: Data split across multiple tar files for parallelism\nShuffling: Two-level shuffling (shard + sample) for training\n\natdata adds:\n\nType safety: Know the schema at compile time\nBatch aggregation: NDArrays are automatically stacked\nLens transformations: View data through different schemas\n\n\n\nSampleBatch: Automatic Aggregation\nWhen iterating with batch_size, atdata returns SampleBatch[T] objects that aggregate sample attributes:\n\nbatch = SampleBatch[ImageSample](samples)\n\n# NDArray fields → stacked numpy array with batch dimension\nbatch.image.shape # (batch_size, H, W, C)\n\n# Other fields → list\nbatch.label # [\"cat\", \"dog\", \"bird\", ...]\n\nThis eliminates boilerplate collation code and works automatically for any PackableSample type.\n\n\nLens: Schema Transformations\nLenses enable viewing datasets through different schemas without duplicating data:\n\n@atdata.packable\nclass SimplifiedSample:\n label: str\n\n@atdata.lens\ndef simplify(src: ImageSample) -> SimplifiedSample:\n return SimplifiedSample(label=src.label)\n\n# View dataset through simplified schema\nsimple_ds = dataset.as_type(SimplifiedSample)\n\nWhen to use lenses:\n\nReducing fields: Drop unnecessary data for specific tasks\nTransforming data: Compute derived fields on-the-fly\nSchema migration: Handle version differences between datasets\n\nLenses are registered globally in a LensNetwork, enabling automatic discovery of transformation paths.", 308 "crumbs": [ 309 "Guide", 310 "Reference", 311 "Architecture Overview" 312 ] 313 }, 314 { 315 "objectID": "reference/architecture.html#storage-backends", 316 "href": "reference/architecture.html#storage-backends", 317 "title": "Architecture Overview", 318 "section": "Storage Backends", 319 "text": "Storage Backends\n\nLocal Index (Redis + S3)\nFor team-scale usage, atdata provides a two-component storage system:\nRedis Index: Stores metadata and enables fast lookups\n\nDataset entries (name, schema, URLs, metadata)\nSchema registry (type definitions)\nCID-based content addressing\n\nS3 DataStore: Stores actual data files\n\nWebDataset tar shards\nAny S3-compatible storage (AWS, MinIO, Cloudflare R2)\n\n\nstore = S3DataStore(credentials=creds, bucket=\"datasets\")\nindex = LocalIndex(data_store=store)\n\n# Insert dataset: writes to S3, indexes in Redis\nentry = index.insert_dataset(dataset, name=\"training-v1\")\n\nWhy this split?\n\nSeparation of concerns: Metadata queries don’t touch data storage\nFlexibility: Use any S3-compatible storage\nScalability: Redis handles high-throughput lookups; S3 handles large files\n\n\n\nAtmosphere Index (ATProto)\nFor public or cross-organization sharing, atdata integrates with the AT Protocol:\nATProto PDS: Your Personal Data Server stores records\n\nSchema definitions\nDataset index records\nLens transformation records\n\nPDSBlobStore: Optional blob storage on your PDS\n\nStore actual data shards as ATProto blobs\nFully decentralized—no external dependencies\n\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Publish: creates ATProto records, uploads blobs\nentry = index.insert_dataset(dataset, name=\"public-features\")", 320 "crumbs": [ 321 "Guide", 322 "Reference", 323 "Architecture Overview" 324 ] 325 }, 326 { 327 "objectID": "reference/architecture.html#protocol-abstractions", 328 "href": "reference/architecture.html#protocol-abstractions", 329 "title": "Architecture Overview", 330 "section": "Protocol Abstractions", 331 "text": "Protocol Abstractions\natdata uses protocols (structural typing) to enable backend interoperability:\n\nAbstractIndex\nCommon interface for both LocalIndex and AtmosphereIndex:\n\ndef process_dataset(index: AbstractIndex, name: str):\n entry = index.get_dataset(name)\n schema = index.decode_schema(entry.schema_ref)\n # Works with either LocalIndex or AtmosphereIndex\n\nKey methods:\n\ninsert_dataset() / get_dataset(): Dataset CRUD\npublish_schema() / decode_schema(): Schema management\nlist_datasets() / list_schemas(): Discovery\n\n\n\nAbstractDataStore\nCommon interface for S3DataStore and PDSBlobStore:\n\ndef write_to_store(store: AbstractDataStore, dataset: Dataset):\n urls = store.write_shards(dataset, prefix=\"data/v1\")\n # Works with S3 or PDS blob storage\n\n\n\nDataSource\nCommon interface for data streaming:\n\nURLSource: WebDataset-compatible URLs\nS3Source: S3 with explicit credentials\nBlobSource: ATProto PDS blobs", 332 "crumbs": [ 333 "Guide", 334 "Reference", 335 "Architecture Overview" 336 ] 337 }, 338 { 339 "objectID": "reference/architecture.html#data-flow-local-to-federation", 340 "href": "reference/architecture.html#data-flow-local-to-federation", 341 "title": "Architecture Overview", 342 "section": "Data Flow: Local to Federation", 343 "text": "Data Flow: Local to Federation\nA typical workflow progresses through three stages:\n\nStage 1: Local Development\n\n# Define type and create samples\n@atdata.packable\nclass MySample:\n features: NDArray\n label: str\n\n# Write to local tar\nwith wds.writer.TarWriter(\"data.tar\") as sink:\n for sample in samples:\n sink.write(sample.as_wds)\n\n# Iterate locally\ndataset = atdata.Dataset[MySample](\"data.tar\")\n\n\n\nStage 2: Team Storage\n\n# Set up team storage\nstore = S3DataStore(credentials=team_creds, bucket=\"team-datasets\")\nindex = LocalIndex(data_store=store)\n\n# Publish schema and insert\nindex.publish_schema(MySample, version=\"1.0.0\")\nentry = index.insert_dataset(dataset, name=\"my-features\")\n\n# Team members can now load via index\nds = load_dataset(\"@local/my-features\", index=index)\n\n\n\nStage 3: Federation\n\n# Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(entry, index, client)\n\n# Anyone can now discover and load\n# ds = load_dataset(\"@handle.bsky.social/my-features\")", 344 "crumbs": [ 345 "Guide", 346 "Reference", 347 "Architecture Overview" 348 ] 349 }, 350 { 351 "objectID": "reference/architecture.html#content-addressing", 352 "href": "reference/architecture.html#content-addressing", 353 "title": "Architecture Overview", 354 "section": "Content Addressing", 355 "text": "Content Addressing\natdata uses CIDs (Content Identifiers) for content-addressable storage:\n\nSchema CIDs: Hash of schema definition\nEntry CIDs: Hash of (schema_ref, data_urls)\nBlob CIDs: Hash of data content\n\nBenefits:\n\nDeduplication: Identical content has identical CID\nIntegrity: Verify data matches expected hash\nATProto compatibility: CIDs are native to the AT Protocol", 356 "crumbs": [ 357 "Guide", 358 "Reference", 359 "Architecture Overview" 360 ] 361 }, 362 { 363 "objectID": "reference/architecture.html#extension-points", 364 "href": "reference/architecture.html#extension-points", 365 "title": "Architecture Overview", 366 "section": "Extension Points", 367 "text": "Extension Points\natdata is designed for extensibility:\n\nCustom DataSources\nImplement the DataSource protocol to add new storage backends:\n\nclass MyCustomSource:\n def list_shards(self) -> list[str]: ...\n def open_shard(self, shard_id: str) -> IO[bytes]: ...\n\n @property\n def shards(self) -> Iterator[tuple[str, IO[bytes]]]: ...\n\n\n\nCustom Lenses\nRegister transformations between any PackableSample types:\n\n@atdata.lens\ndef my_transform(src: SourceType) -> TargetType:\n return TargetType(...)\n\n@my_transform.putter\ndef my_transform_put(view: TargetType, src: SourceType) -> SourceType:\n return SourceType(...)\n\n\n\nSchema Extensions\nThe schema format supports custom metadata for domain-specific needs:\n\nindex.publish_schema(\n MySample,\n version=\"1.0.0\",\n metadata={\"domain\": \"chemistry\", \"units\": \"mol/L\"},\n)", 368 "crumbs": [ 369 "Guide", 370 "Reference", 371 "Architecture Overview" 372 ] 373 }, 374 { 375 "objectID": "reference/architecture.html#summary", 376 "href": "reference/architecture.html#summary", 377 "title": "Architecture Overview", 378 "section": "Summary", 379 "text": "Summary\n\n\n\n\n\n\n\n\nComponent\nPurpose\nKey Classes\n\n\n\n\nSamples\nTyped, serializable data\nPackableSample, @packable\n\n\nDatasets\nTyped iteration over WebDataset\nDataset[T], SampleBatch[T]\n\n\nLenses\nSchema transformations\nLens, @lens, LensNetwork\n\n\nLocal Storage\nTeam-scale index + data\nLocalIndex, S3DataStore\n\n\nAtmosphere\nFederated sharing\nAtmosphereIndex, PDSBlobStore\n\n\nProtocols\nBackend abstraction\nAbstractIndex, AbstractDataStore, DataSource\n\n\n\nThe architecture enables a smooth progression from local experimentation to team collaboration to public federation, all while maintaining type safety and efficient data handling.", 380 "crumbs": [ 381 "Guide", 382 "Reference", 383 "Architecture Overview" 384 ] 385 }, 386 { 387 "objectID": "reference/architecture.html#related", 388 "href": "reference/architecture.html#related", 389 "title": "Architecture Overview", 390 "section": "Related", 391 "text": "Related\n\nPackable Samples - Defining sample types\nDatasets - Dataset iteration and batching\nLocal Storage - Redis + S3 backend\nAtmosphere - ATProto federation\nProtocols - Abstract interfaces", 392 "crumbs": [ 393 "Guide", 394 "Reference", 395 "Architecture Overview" 396 ] 397 }, 398 { 399 "objectID": "reference/atmosphere.html", 400 "href": "reference/atmosphere.html", 401 "title": "Atmosphere (ATProto Integration)", 402 "section": "", 403 "text": "The atmosphere module enables publishing and discovering datasets on the ATProto network, creating a federated ecosystem for typed datasets.", 404 "crumbs": [ 405 "Guide", 406 "Reference", 407 "Atmosphere (ATProto Integration)" 408 ] 409 }, 410 { 411 "objectID": "reference/atmosphere.html#installation", 412 "href": "reference/atmosphere.html#installation", 413 "title": "Atmosphere (ATProto Integration)", 414 "section": "Installation", 415 "text": "Installation\npip install atdata[atmosphere]\n# or\npip install atproto", 416 "crumbs": [ 417 "Guide", 418 "Reference", 419 "Atmosphere (ATProto Integration)" 420 ] 421 }, 422 { 423 "objectID": "reference/atmosphere.html#overview", 424 "href": "reference/atmosphere.html#overview", 425 "title": "Atmosphere (ATProto Integration)", 426 "section": "Overview", 427 "text": "Overview\nATProto integration publishes datasets, schemas, and lenses as records in the ac.foundation.dataset.* namespace. This enables:\n\nDiscovery through the ATProto network\nFederation across different hosts\nVerifiability through content-addressable records", 428 "crumbs": [ 429 "Guide", 430 "Reference", 431 "Atmosphere (ATProto Integration)" 432 ] 433 }, 434 { 435 "objectID": "reference/atmosphere.html#atmosphereclient", 436 "href": "reference/atmosphere.html#atmosphereclient", 437 "title": "Atmosphere (ATProto Integration)", 438 "section": "AtmosphereClient", 439 "text": "AtmosphereClient\nThe client handles authentication and record operations:\n\nfrom atdata.atmosphere import AtmosphereClient\n\nclient = AtmosphereClient()\n\n# Login with app-specific password (not your main password!)\nclient.login(\"alice.bsky.social\", \"app-password\")\n\nprint(client.did) # 'did:plc:...'\nprint(client.handle) # 'alice.bsky.social'\n\n\n\n\n\n\n\nWarning\n\n\n\nAlways use an app-specific password, not your main Bluesky password. Create app passwords at bsky.app/settings/app-passwords.\n\n\n\nSession Management\nSave and restore sessions to avoid re-authentication:\n\n# Export session for later\nsession_string = client.export_session()\n\n# Later: restore session\nnew_client = AtmosphereClient()\nnew_client.login_with_session(session_string)\n\n\n\nCustom PDS\nConnect to a custom PDS instead of bsky.social:\n\nclient = AtmosphereClient(base_url=\"https://pds.example.com\")", 440 "crumbs": [ 441 "Guide", 442 "Reference", 443 "Atmosphere (ATProto Integration)" 444 ] 445 }, 446 { 447 "objectID": "reference/atmosphere.html#pdsblobstore", 448 "href": "reference/atmosphere.html#pdsblobstore", 449 "title": "Atmosphere (ATProto Integration)", 450 "section": "PDSBlobStore", 451 "text": "PDSBlobStore\nStore dataset shards as ATProto blobs for fully decentralized storage:\n\nfrom atdata.atmosphere import AtmosphereClient, PDSBlobStore\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\n\n# Write shards as blobs\nurls = store.write_shards(dataset, prefix=\"my-data/v1\")\n# Returns: ['at://did:plc:.../blob/bafyrei...', ...]\n\n# Transform AT URIs to HTTP URLs for reading\nhttp_url = store.read_url(urls[0])\n# Returns: 'https://pds.example.com/xrpc/com.atproto.sync.getBlob?...'\n\n# Create a BlobSource for streaming\nsource = store.create_source(urls)\nds = atdata.Dataset[MySample](source)\n\n\nSize Limits\nPDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use maxcount and maxsize parameters to control shard sizes:\n\nurls = store.write_shards(\n dataset,\n prefix=\"large-data/v1\",\n maxcount=5000, # Max 5000 samples per shard\n maxsize=50e6, # Max 50MB per shard\n)", 452 "crumbs": [ 453 "Guide", 454 "Reference", 455 "Atmosphere (ATProto Integration)" 456 ] 457 }, 458 { 459 "objectID": "reference/atmosphere.html#blobsource", 460 "href": "reference/atmosphere.html#blobsource", 461 "title": "Atmosphere (ATProto Integration)", 462 "section": "BlobSource", 463 "text": "BlobSource\nRead datasets stored as PDS blobs:\n\nfrom atdata import BlobSource\n\n# From blob references\nsource = BlobSource.from_refs([\n {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei111\"},\n {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei222\"},\n])\n\n# Or from PDSBlobStore\nsource = store.create_source(urls)\n\n# Use with Dataset\nds = atdata.Dataset[MySample](source)\nfor batch in ds.ordered(batch_size=32):\n process(batch)", 464 "crumbs": [ 465 "Guide", 466 "Reference", 467 "Atmosphere (ATProto Integration)" 468 ] 469 }, 470 { 471 "objectID": "reference/atmosphere.html#atmosphereindex", 472 "href": "reference/atmosphere.html#atmosphereindex", 473 "title": "Atmosphere (ATProto Integration)", 474 "section": "AtmosphereIndex", 475 "text": "AtmosphereIndex\nThe unified interface for ATProto operations, implementing the AbstractIndex protocol:\n\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Without blob storage (use external URLs)\nindex = AtmosphereIndex(client)\n\n# With PDS blob storage (recommended for full decentralization)\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n\nPublishing Schemas\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\n# Publish schema\nschema_uri = index.publish_schema(\n ImageSample,\n version=\"1.0.0\",\n description=\"Image classification sample\",\n)\n# Returns: \"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\"\n\n\n\nPublishing Datasets\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nentry = index.insert_dataset(\n dataset,\n name=\"imagenet-subset\",\n schema_ref=schema_uri, # Optional - auto-publishes if omitted\n description=\"ImageNet subset\",\n tags=[\"images\", \"classification\"],\n license=\"MIT\",\n)\n\nprint(entry.uri) # AT URI of the record\nprint(entry.data_urls) # WebDataset URLs\n\n\n\nListing and Retrieving\n\n# List your datasets\nfor entry in index.list_datasets():\n print(f\"{entry.name}: {entry.schema_ref}\")\n\n# List from another user\nfor entry in index.list_datasets(repo=\"did:plc:other-user\"):\n print(entry.name)\n\n# Get specific dataset\nentry = index.get_dataset(\"at://did:plc:.../ac.foundation.dataset.record/...\")\n\n# List schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']} v{schema['version']}\")\n\n# Decode schema to Python type\nSampleType = index.decode_schema(schema_uri)", 476 "crumbs": [ 477 "Guide", 478 "Reference", 479 "Atmosphere (ATProto Integration)" 480 ] 481 }, 482 { 483 "objectID": "reference/atmosphere.html#lower-level-publishers", 484 "href": "reference/atmosphere.html#lower-level-publishers", 485 "title": "Atmosphere (ATProto Integration)", 486 "section": "Lower-Level Publishers", 487 "text": "Lower-Level Publishers\nFor more control, use the individual publisher classes:\n\nSchemaPublisher\n\nfrom atdata.atmosphere import SchemaPublisher\n\npublisher = SchemaPublisher(client)\n\nuri = publisher.publish(\n ImageSample,\n name=\"ImageSample\",\n version=\"1.0.0\",\n description=\"Image with label\",\n metadata={\"source\": \"training\"},\n)\n\n\n\nDatasetPublisher\n\nfrom atdata.atmosphere import DatasetPublisher\n\npublisher = DatasetPublisher(client)\n\nuri = publisher.publish(\n dataset,\n name=\"training-images\",\n schema_uri=schema_uri, # Required if auto_publish_schema=False\n auto_publish_schema=True, # Publish schema automatically\n description=\"Training images\",\n tags=[\"training\", \"images\"],\n license=\"MIT\",\n)\n\n\nBlob Storage\nThere are two approaches to storing data as ATProto blobs:\nApproach 1: PDSBlobStore (Recommended)\nUse PDSBlobStore with AtmosphereIndex for automatic shard management:\n\nfrom atdata.atmosphere import PDSBlobStore, AtmosphereIndex\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Dataset shards are automatically uploaded as blobs\nentry = index.insert_dataset(\n dataset,\n name=\"my-dataset\",\n schema_ref=schema_uri,\n)\n\n# Later: load using BlobSource\nsource = store.create_source(entry.data_urls)\nds = atdata.Dataset[MySample](source)\n\nApproach 2: Manual Blob Publishing\nFor more control, use DatasetPublisher.publish_with_blobs() directly:\n\nimport io\nimport webdataset as wds\n\n# Create tar data in memory\ntar_buffer = io.BytesIO()\nwith wds.writer.TarWriter(tar_buffer) as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# Publish with blob storage\nuri = publisher.publish_with_blobs(\n blobs=[tar_buffer.getvalue()],\n schema_uri=schema_uri,\n name=\"small-dataset\",\n description=\"Dataset stored in ATProto blobs\",\n tags=[\"small\", \"demo\"],\n)\n\nLoading Blob-Stored Datasets\n\nfrom atdata.atmosphere import DatasetLoader\nfrom atdata import BlobSource\n\nloader = DatasetLoader(client)\n\n# Check storage type\nstorage_type = loader.get_storage_type(uri) # \"external\" or \"blobs\"\n\nif storage_type == \"blobs\":\n # Get blob URLs and create BlobSource\n blob_urls = loader.get_blob_urls(uri)\n # Parse to blob refs for BlobSource\n # Or use loader.to_dataset() which handles this automatically\n\n# to_dataset() handles both storage types automatically\ndataset = loader.to_dataset(uri, MySample)\nfor batch in dataset.ordered(batch_size=32):\n process(batch)\n\n\n\n\nLensPublisher\n\nfrom atdata.atmosphere import LensPublisher\n\npublisher = LensPublisher(client)\n\n# With code references\nuri = publisher.publish(\n name=\"simplify\",\n source_schema=full_schema_uri,\n target_schema=simple_schema_uri,\n description=\"Extract label only\",\n getter_code={\n \"repository\": \"https://github.com/org/repo\",\n \"commit\": \"abc123def...\",\n \"path\": \"transforms/simplify.py:simplify_getter\",\n },\n putter_code={\n \"repository\": \"https://github.com/org/repo\",\n \"commit\": \"abc123def...\",\n \"path\": \"transforms/simplify.py:simplify_putter\",\n },\n)\n\n# Or publish from a Lens object\nfrom atdata.lens import lens\n\n@lens\ndef simplify(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label)\n\nuri = publisher.publish_from_lens(\n simplify,\n source_schema=full_schema_uri,\n target_schema=simple_schema_uri,\n)", 488 "crumbs": [ 489 "Guide", 490 "Reference", 491 "Atmosphere (ATProto Integration)" 492 ] 493 }, 494 { 495 "objectID": "reference/atmosphere.html#lower-level-loaders", 496 "href": "reference/atmosphere.html#lower-level-loaders", 497 "title": "Atmosphere (ATProto Integration)", 498 "section": "Lower-Level Loaders", 499 "text": "Lower-Level Loaders\nFor direct access to records, use the loader classes:\n\nSchemaLoader\n\nfrom atdata.atmosphere import SchemaLoader\n\nloader = SchemaLoader(client)\n\n# Get a specific schema\nschema = loader.get(\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/xyz\")\nprint(schema[\"name\"], schema[\"version\"])\n\n# List all schemas from a repository\nfor schema in loader.list_all(repo=\"did:plc:other-user\"):\n print(schema[\"name\"])\n\n\n\nDatasetLoader\n\nfrom atdata.atmosphere import DatasetLoader\n\nloader = DatasetLoader(client)\n\n# Get a specific dataset record\nrecord = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n# Check storage type\nstorage_type = loader.get_storage_type(uri) # \"external\" or \"blobs\"\n\n# Get URLs based on storage type\nif storage_type == \"external\":\n urls = loader.get_urls(uri)\nelse:\n urls = loader.get_blob_urls(uri)\n\n# Get metadata\nmetadata = loader.get_metadata(uri)\n\n# Create a Dataset object directly\ndataset = loader.to_dataset(uri, MySampleType)\nfor batch in dataset.ordered(batch_size=32):\n process(batch)\n\n\n\nLensLoader\n\nfrom atdata.atmosphere import LensLoader\n\nloader = LensLoader(client)\n\n# Get a specific lens record\nlens = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\nprint(lens[\"name\"])\nprint(lens[\"sourceSchema\"], \"->\", lens[\"targetSchema\"])\n\n# List all lenses from a repository\nfor lens in loader.list_all():\n print(lens[\"name\"])\n\n# Find lenses by schema\nlenses = loader.find_by_schemas(\n source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n)", 500 "crumbs": [ 501 "Guide", 502 "Reference", 503 "Atmosphere (ATProto Integration)" 504 ] 505 }, 506 { 507 "objectID": "reference/atmosphere.html#at-uris", 508 "href": "reference/atmosphere.html#at-uris", 509 "title": "Atmosphere (ATProto Integration)", 510 "section": "AT URIs", 511 "text": "AT URIs\nATProto records are identified by AT URIs:\n\nfrom atdata.atmosphere import AtUri\n\n# Parse an AT URI\nuri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n\nprint(uri.authority) # 'did:plc:abc123'\nprint(uri.collection) # 'ac.foundation.dataset.sampleSchema'\nprint(uri.rkey) # 'xyz'\n\n# Format back to string\nprint(str(uri)) # 'at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz'", 512 "crumbs": [ 513 "Guide", 514 "Reference", 515 "Atmosphere (ATProto Integration)" 516 ] 517 }, 518 { 519 "objectID": "reference/atmosphere.html#supported-field-types", 520 "href": "reference/atmosphere.html#supported-field-types", 521 "title": "Atmosphere (ATProto Integration)", 522 "section": "Supported Field Types", 523 "text": "Supported Field Types\nSchemas support these field types:\n\n\n\nPython Type\nATProto Type\n\n\n\n\nstr\nprimitive/str\n\n\nint\nprimitive/int\n\n\nfloat\nprimitive/float\n\n\nbool\nprimitive/bool\n\n\nbytes\nprimitive/bytes\n\n\nNDArray\nndarray (default dtype: float32)\n\n\nNDArray[np.float64]\nndarray (dtype: float64)\n\n\nlist[str]\narray with items\n\n\nT \\| None\nOptional field", 524 "crumbs": [ 525 "Guide", 526 "Reference", 527 "Atmosphere (ATProto Integration)" 528 ] 529 }, 530 { 531 "objectID": "reference/atmosphere.html#complete-example", 532 "href": "reference/atmosphere.html#complete-example", 533 "title": "Atmosphere (ATProto Integration)", 534 "section": "Complete Example", 535 "text": "Complete Example\nThis example shows the full workflow using PDSBlobStore for decentralized storage:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\nimport webdataset as wds\n\n# 1. Define and create samples\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n source: str\n\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(1000)\n]\n\n# 2. Write to tar\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Authenticate and set up blob storage\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# 4. Publish schema\nschema_uri = index.publish_schema(\n FeatureSample,\n version=\"1.0.0\",\n description=\"Feature vectors with labels\",\n)\n\n# 5. Publish dataset (shards uploaded as blobs)\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nentry = index.insert_dataset(\n dataset,\n name=\"synthetic-features-v1\",\n schema_ref=schema_uri,\n tags=[\"features\", \"synthetic\"],\n)\n\nprint(f\"Published: {entry.uri}\")\nprint(f\"Blob URLs: {entry.data_urls}\")\n\n# 6. Later: discover and load from blobs\nfor dataset_entry in index.list_datasets():\n print(f\"Found: {dataset_entry.name}\")\n\n # Reconstruct type from schema\n SampleType = index.decode_schema(dataset_entry.schema_ref)\n\n # Create source from blob URLs\n source = store.create_source(dataset_entry.data_urls)\n\n # Load dataset from blobs\n ds = atdata.Dataset[SampleType](source)\n for batch in ds.ordered(batch_size=32):\n print(batch.features.shape)\n break\n\nFor external URL storage (without PDSBlobStore):\n\n# Use AtmosphereIndex without data_store\nindex = AtmosphereIndex(client)\n\n# Dataset URLs will be stored as-is (external references)\nentry = index.insert_dataset(\n dataset,\n name=\"external-features\",\n schema_ref=schema_uri,\n)\n\n# Load using standard URL source\nds = atdata.Dataset[FeatureSample](entry.data_urls[0])", 536 "crumbs": [ 537 "Guide", 538 "Reference", 539 "Atmosphere (ATProto Integration)" 540 ] 541 }, 542 { 543 "objectID": "reference/atmosphere.html#related", 544 "href": "reference/atmosphere.html#related", 545 "title": "Atmosphere (ATProto Integration)", 546 "section": "Related", 547 "text": "Related\n\nLocal Storage - Redis + S3 backend\nPromotion - Promoting local datasets to ATProto\nProtocols - AbstractIndex interface\nPackable Samples - Defining sample types", 548 "crumbs": [ 549 "Guide", 550 "Reference", 551 "Atmosphere (ATProto Integration)" 552 ] 553 }, 554 { 555 "objectID": "reference/local-storage.html", 556 "href": "reference/local-storage.html", 557 "title": "Local Storage", 558 "section": "", 559 "text": "The local storage module provides a Redis + S3 backend for storing and managing datasets before publishing to the ATProto federation.", 560 "crumbs": [ 561 "Guide", 562 "Reference", 563 "Local Storage" 564 ] 565 }, 566 { 567 "objectID": "reference/local-storage.html#overview", 568 "href": "reference/local-storage.html#overview", 569 "title": "Local Storage", 570 "section": "Overview", 571 "text": "Overview\nLocal storage uses:\n\nRedis for indexing and tracking dataset metadata\nS3-compatible storage for dataset tar files\n\nThis enables development and small-scale deployment before promoting to the full ATProto infrastructure.", 572 "crumbs": [ 573 "Guide", 574 "Reference", 575 "Local Storage" 576 ] 577 }, 578 { 579 "objectID": "reference/local-storage.html#localindex", 580 "href": "reference/local-storage.html#localindex", 581 "title": "Local Storage", 582 "section": "LocalIndex", 583 "text": "LocalIndex\nThe index tracks datasets in Redis:\n\nfrom atdata.local import LocalIndex\n\n# Default connection (localhost:6379)\nindex = LocalIndex()\n\n# Custom Redis connection\nimport redis\nr = redis.Redis(host='custom-host', port=6379)\nindex = LocalIndex(redis=r)\n\n# With connection kwargs\nindex = LocalIndex(host='custom-host', port=6379, db=1)\n\n\nAdding Entries\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nentry = index.add_entry(\n dataset,\n name=\"my-dataset\",\n schema_ref=\"atdata://local/sampleSchema/ImageSample@1.0.0\", # optional\n metadata={\"description\": \"Training images\"}, # optional\n)\n\nprint(entry.cid) # Content identifier\nprint(entry.name) # \"my-dataset\"\nprint(entry.data_urls) # [\"data-{000000..000009}.tar\"]\n\n\n\nListing and Retrieving\n\n# Iterate all entries\nfor entry in index.entries:\n print(f\"{entry.name}: {entry.cid}\")\n\n# Get as list\nall_entries = index.all_entries\n\n# Get by name\nentry = index.get_entry_by_name(\"my-dataset\")\n\n# Get by CID\nentry = index.get_entry(\"bafyrei...\")", 584 "crumbs": [ 585 "Guide", 586 "Reference", 587 "Local Storage" 588 ] 589 }, 590 { 591 "objectID": "reference/local-storage.html#repo-deprecated", 592 "href": "reference/local-storage.html#repo-deprecated", 593 "title": "Local Storage", 594 "section": "Repo (Deprecated)", 595 "text": "Repo (Deprecated)\n\n\n\n\n\n\nWarning\n\n\n\nRepo is deprecated. Use LocalIndex with S3DataStore instead for new code.\n\n\nThe Repo class combines S3 storage with Redis indexing:\n\nfrom atdata.local import Repo\n\n# From credentials file\nrepo = Repo(\n s3_credentials=\"path/to/.env\",\n hive_path=\"my-bucket/datasets\",\n)\n\n# From credentials dict\nrepo = Repo(\n s3_credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n hive_path=\"my-bucket/datasets\",\n)\n\nPreferred approach - Use LocalIndex with S3DataStore:\n\nfrom atdata.local import LocalIndex, S3DataStore\n\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(data_store=store)\n\n# Insert dataset\nentry = index.insert_dataset(dataset, name=\"my-dataset\", prefix=\"datasets/v1\")\n\n\nCredentials File Format\nThe .env file should contain:\nAWS_ENDPOINT=http://localhost:9000\nAWS_ACCESS_KEY_ID=your-access-key\nAWS_SECRET_ACCESS_KEY=your-secret-key\n\n\n\n\n\n\nNote\n\n\n\nFor AWS S3, omit AWS_ENDPOINT to use the default endpoint.\n\n\n\n\nInserting Datasets\n\nimport webdataset as wds\nimport numpy as np\n\n# Create dataset from samples\nsamples = [ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=f\"sample_{i}\"\n) for i in range(1000)]\n\nwith wds.writer.TarWriter(\"temp.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\ndataset = atdata.Dataset[ImageSample](\"temp.tar\")\n\n# Insert into repo (writes to S3 + indexes in Redis)\nentry, stored_dataset = repo.insert(\n dataset,\n name=\"training-images-v1\",\n cache_local=False, # Stream directly to S3\n)\n\nprint(entry.cid) # Content identifier\nprint(stored_dataset.url) # S3 URL for the stored data\nprint(stored_dataset.shard_list) # Individual shard URLs\n\n\n\nInsert Options\n\nentry, ds = repo.insert(\n dataset,\n name=\"my-dataset\",\n cache_local=True, # Write locally first, then copy (faster for some workloads)\n maxcount=10000, # Samples per shard\n maxsize=100_000_000, # Max shard size in bytes\n)", 596 "crumbs": [ 597 "Guide", 598 "Reference", 599 "Local Storage" 600 ] 601 }, 602 { 603 "objectID": "reference/local-storage.html#localdatasetentry", 604 "href": "reference/local-storage.html#localdatasetentry", 605 "title": "Local Storage", 606 "section": "LocalDatasetEntry", 607 "text": "LocalDatasetEntry\nIndex entries provide content-addressable identification:\n\nentry = index.get_entry_by_name(\"my-dataset\")\n\n# Core properties (IndexEntry protocol)\nentry.name # Human-readable name\nentry.schema_ref # Schema reference\nentry.data_urls # WebDataset URLs\nentry.metadata # Arbitrary metadata dict or None\n\n# Content addressing\nentry.cid # ATProto-compatible CID (content identifier)\n\n# Legacy compatibility\nentry.wds_url # First data URL\nentry.sample_kind # Same as schema_ref\n\n\n\n\n\n\n\nTip\n\n\n\nThe CID is generated from the entry’s content (schema_ref + data_urls), ensuring identical data produces identical CIDs whether stored locally or in the atmosphere.", 608 "crumbs": [ 609 "Guide", 610 "Reference", 611 "Local Storage" 612 ] 613 }, 614 { 615 "objectID": "reference/local-storage.html#schema-storage", 616 "href": "reference/local-storage.html#schema-storage", 617 "title": "Local Storage", 618 "section": "Schema Storage", 619 "text": "Schema Storage\nSchemas can be stored and retrieved from the index:\n\n# Publish a schema\nschema_ref = index.publish_schema(\n ImageSample,\n version=\"1.0.0\",\n description=\"Image with label annotation\",\n)\n# Returns: \"atdata://local/sampleSchema/ImageSample@1.0.0\"\n\n# Retrieve schema record\nschema = index.get_schema(schema_ref)\n# {\n# \"name\": \"ImageSample\",\n# \"version\": \"1.0.0\",\n# \"fields\": [...],\n# \"description\": \"...\",\n# \"createdAt\": \"...\",\n# }\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']}@{schema['version']}\")\n\n# Reconstruct sample type from schema\nSampleType = index.decode_schema(schema_ref)\ndataset = atdata.Dataset[SampleType](entry.data_urls[0])", 620 "crumbs": [ 621 "Guide", 622 "Reference", 623 "Local Storage" 624 ] 625 }, 626 { 627 "objectID": "reference/local-storage.html#s3datastore", 628 "href": "reference/local-storage.html#s3datastore", 629 "title": "Local Storage", 630 "section": "S3DataStore", 631 "text": "S3DataStore\nFor direct S3 operations without Redis indexing:\n\nfrom atdata.local import S3DataStore\n\nstore = S3DataStore(\n credentials=\"path/to/.env\",\n bucket=\"my-bucket\",\n)\n\n# Write dataset shards\nurls = store.write_shards(\n dataset,\n prefix=\"datasets/v1\",\n maxcount=10000,\n)\n# Returns: [\"s3://my-bucket/datasets/v1/data--uuid--000000.tar\", ...]\n\n# Check capabilities\nstore.supports_streaming() # True", 632 "crumbs": [ 633 "Guide", 634 "Reference", 635 "Local Storage" 636 ] 637 }, 638 { 639 "objectID": "reference/local-storage.html#complete-workflow-example", 640 "href": "reference/local-storage.html#complete-workflow-example", 641 "title": "Local Storage", 642 "section": "Complete Workflow Example", 643 "text": "Complete Workflow Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass TrainingSample:\n features: NDArray\n label: int\n source: str\n\n# 2. Create samples\nsamples = [\n TrainingSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(10000)\n]\n\n# 3. Write to local tar\nwith wds.writer.TarWriter(\"local-data.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 4. Set up index with S3 data store and insert\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nindex = LocalIndex(data_store=store)\n\n# Publish schema and insert dataset\nindex.publish_schema(TrainingSample, version=\"1.0.0\")\nlocal_ds = atdata.Dataset[TrainingSample](\"local-data.tar\")\nentry = index.insert_dataset(local_ds, name=\"training-v1\", prefix=\"training\")\n\n# 5. Retrieve later\nentry = index.get_entry_by_name(\"training-v1\")\ndataset = atdata.Dataset[TrainingSample](entry.data_urls[0])\n\nfor batch in dataset.ordered(batch_size=32):\n print(batch.features.shape) # (32, 128)", 644 "crumbs": [ 645 "Guide", 646 "Reference", 647 "Local Storage" 648 ] 649 }, 650 { 651 "objectID": "reference/local-storage.html#related", 652 "href": "reference/local-storage.html#related", 653 "title": "Local Storage", 654 "section": "Related", 655 "text": "Related\n\nDatasets - Dataset iteration and batching\nProtocols - AbstractIndex and IndexEntry interfaces\nPromotion - Promoting local datasets to ATProto\nAtmosphere - ATProto federation", 656 "crumbs": [ 657 "Guide", 658 "Reference", 659 "Local Storage" 660 ] 661 }, 662 { 663 "objectID": "reference/uri-spec.html", 664 "href": "reference/uri-spec.html", 665 "title": "URI Specification", 666 "section": "", 667 "text": "The atdata:// URI scheme provides a unified way to address atdata resources across local development and the ATProto federation.", 668 "crumbs": [ 669 "Guide", 670 "Reference", 671 "URI Specification" 672 ] 673 }, 674 { 675 "objectID": "reference/uri-spec.html#overview", 676 "href": "reference/uri-spec.html#overview", 677 "title": "URI Specification", 678 "section": "Overview", 679 "text": "Overview\nThe atdata URI scheme:\n\nFollows RFC 3986 syntax\nProvides consistent addressing for local and atmosphere resources\nEnables seamless promotion from development to production", 680 "crumbs": [ 681 "Guide", 682 "Reference", 683 "URI Specification" 684 ] 685 }, 686 { 687 "objectID": "reference/uri-spec.html#uri-format", 688 "href": "reference/uri-spec.html#uri-format", 689 "title": "URI Specification", 690 "section": "URI Format", 691 "text": "URI Format\natdata://{authority}/{resource_type}/{name}@{version}\n\nAuthority\nThe authority identifies where the resource is stored:\n\n\n\nAuthority\nDescription\nExample\n\n\n\n\nlocal\nLocal Redis/S3 storage\natdata://local/...\n\n\n{handle}\nATProto handle\natdata://alice.bsky.social/...\n\n\n{did}\nATProto DID\natdata://did:plc:abc123/...\n\n\n\n\n\nResource Types\n\n\n\nResource Type\nDescription\n\n\n\n\nsampleSchema\nPackableSample type definitions\n\n\ndataset\nDataset entries (future)\n\n\nlens\nLens transformations (future)\n\n\n\n\n\nVersion Specifiers\nVersions follow semantic versioning and are specified with @:\n\n\n\nSpecifier\nDescription\nExample\n\n\n\n\n@{major}.{minor}.{patch}\nExact version\n@1.0.0, @2.1.3\n\n\n(none)\nLatest version\nResolves to highest semver", 692 "crumbs": [ 693 "Guide", 694 "Reference", 695 "URI Specification" 696 ] 697 }, 698 { 699 "objectID": "reference/uri-spec.html#examples", 700 "href": "reference/uri-spec.html#examples", 701 "title": "URI Specification", 702 "section": "Examples", 703 "text": "Examples\n\nLocal Development\n\nfrom atdata.local import Index\n\nindex = Index()\n\n# Publish a schema (returns atdata:// URI)\nref = index.publish_schema(MySample, version=\"1.0.0\")\n# => \"atdata://local/sampleSchema/MySample@1.0.0\"\n\n# Auto-increment version\nref = index.publish_schema(MySample)\n# => \"atdata://local/sampleSchema/MySample@1.0.1\"\n\n# Retrieve by URI\nschema = index.get_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n\n\n\nAtmosphere (ATProto Federation)\n\nfrom atdata.atmosphere import Client\n\nclient = Client()\n\n# Publish returns at:// URI that maps to atdata://\nref = client.publish_schema(MySample)\n# => \"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\"\n\n# Can also be addressed as:\n# => \"atdata://did:plc:abc123/sampleSchema/MySample@1.0.0\"\n# => \"atdata://alice.bsky.social/sampleSchema/MySample@1.0.0\"", 704 "crumbs": [ 705 "Guide", 706 "Reference", 707 "URI Specification" 708 ] 709 }, 710 { 711 "objectID": "reference/uri-spec.html#relationship-to-at-protocol-uris", 712 "href": "reference/uri-spec.html#relationship-to-at-protocol-uris", 713 "title": "URI Specification", 714 "section": "Relationship to AT Protocol URIs", 715 "text": "Relationship to AT Protocol URIs\nThe atdata:// scheme is inspired by and maps to ATProto’s at:// scheme:\n\n\n\n\n\n\n\natdata://\nat://\n\n\n\n\natdata://{did}/sampleSchema/{name}@{version}\nat://{did}/ac.foundation.dataset.sampleSchema/{rkey}\n\n\natdata://local/...\n(local only, no at:// equivalent)\n\n\n\nWhen publishing to the atmosphere, atdata URIs are automatically resolved to their corresponding at:// URIs for federation compatibility.", 716 "crumbs": [ 717 "Guide", 718 "Reference", 719 "URI Specification" 720 ] 721 }, 722 { 723 "objectID": "reference/uri-spec.html#legacy-format", 724 "href": "reference/uri-spec.html#legacy-format", 725 "title": "URI Specification", 726 "section": "Legacy Format", 727 "text": "Legacy Format\nFor backwards compatibility, the local index also accepts the legacy format:\nlocal://schemas/{module.Class}@{version}\nThis format is deprecated and will be removed in a future version. Use atdata://local/sampleSchema/{name}@{version} instead.", 728 "crumbs": [ 729 "Guide", 730 "Reference", 731 "URI Specification" 732 ] 733 }, 734 { 735 "objectID": "tutorials/quickstart.html", 736 "href": "tutorials/quickstart.html", 737 "title": "Quick Start", 738 "section": "", 739 "text": "This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them. You’ll learn the foundational patterns that enable type-safe, efficient dataset handling—the first layer of atdata’s three-layer architecture.", 740 "crumbs": [ 741 "Guide", 742 "Getting Started", 743 "Quick Start" 744 ] 745 }, 746 { 747 "objectID": "tutorials/quickstart.html#where-this-fits", 748 "href": "tutorials/quickstart.html#where-this-fits", 749 "title": "Quick Start", 750 "section": "Where This Fits", 751 "text": "Where This Fits\natdata is built around a simple progression:\nLocal Development → Team Storage → Federation\nThis tutorial covers local development—the foundation. Everything you learn here (typed samples, efficient iteration, lens transformations) carries forward as you scale to team storage and federated sharing. The key insight is that your sample types remain the same across all three layers; only the storage backend changes.", 752 "crumbs": [ 753 "Guide", 754 "Getting Started", 755 "Quick Start" 756 ] 757 }, 758 { 759 "objectID": "tutorials/quickstart.html#installation", 760 "href": "tutorials/quickstart.html#installation", 761 "title": "Quick Start", 762 "section": "Installation", 763 "text": "Installation\npip install atdata\n\n# With ATProto support\npip install atdata[atmosphere]", 764 "crumbs": [ 765 "Guide", 766 "Getting Started", 767 "Quick Start" 768 ] 769 }, 770 { 771 "objectID": "tutorials/quickstart.html#define-a-sample-type", 772 "href": "tutorials/quickstart.html#define-a-sample-type", 773 "title": "Quick Start", 774 "section": "Define a Sample Type", 775 "text": "Define a Sample Type\nThe core abstraction in atdata is the PackableSample—a typed, serializable data structure. Unlike raw dictionaries or ad-hoc classes, PackableSamples provide:\n\nType safety: Know your schema at write time, not training time\nAutomatic serialization: msgpack encoding with efficient NDArray handling\nRound-trip fidelity: Data survives serialization without loss\n\nUse the @packable decorator to create a typed sample:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n \"\"\"A sample containing an image with label and confidence.\"\"\"\n image: NDArray\n label: str\n confidence: float\n\nThe @packable decorator:\n\nConverts your class into a dataclass\nAdds automatic msgpack serialization\nHandles NDArray conversion to/from bytes", 776 "crumbs": [ 777 "Guide", 778 "Getting Started", 779 "Quick Start" 780 ] 781 }, 782 { 783 "objectID": "tutorials/quickstart.html#create-sample-instances", 784 "href": "tutorials/quickstart.html#create-sample-instances", 785 "title": "Quick Start", 786 "section": "Create Sample Instances", 787 "text": "Create Sample Instances\n\n# Create a single sample\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\n# Check serialization\npacked_bytes = sample.packed\nprint(f\"Serialized size: {len(packed_bytes):,} bytes\")\n\n# Verify round-trip\nrestored = ImageSample.from_bytes(packed_bytes)\nassert np.allclose(sample.image, restored.image)\nprint(\"Round-trip successful!\")", 788 "crumbs": [ 789 "Guide", 790 "Getting Started", 791 "Quick Start" 792 ] 793 }, 794 { 795 "objectID": "tutorials/quickstart.html#write-a-dataset", 796 "href": "tutorials/quickstart.html#write-a-dataset", 797 "title": "Quick Start", 798 "section": "Write a Dataset", 799 "text": "Write a Dataset\natdata uses WebDataset’s tar format for storage. This choice is deliberate:\n\nStreaming: Process data without downloading entire datasets\nSharding: Split large datasets across multiple files for parallel I/O\nProven: Battle-tested at scale by organizations like Google, NVIDIA, and OpenAI\n\nThe as_wds property on your sample provides the dictionary format WebDataset expects:\nUse WebDataset’s TarWriter to create dataset files:\n\nimport webdataset as wds\n\n# Create 100 samples\nsamples = [\n ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=f\"class_{i % 10}\",\n confidence=np.random.rand(),\n )\n for i in range(100)\n]\n\n# Write to tar file\nwith wds.writer.TarWriter(\"my-dataset-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\nprint(\"Wrote 100 samples to my-dataset-000000.tar\")", 800 "crumbs": [ 801 "Guide", 802 "Getting Started", 803 "Quick Start" 804 ] 805 }, 806 { 807 "objectID": "tutorials/quickstart.html#load-and-iterate", 808 "href": "tutorials/quickstart.html#load-and-iterate", 809 "title": "Quick Start", 810 "section": "Load and Iterate", 811 "text": "Load and Iterate\nThe generic Dataset[T] class connects your sample type to WebDataset’s streaming infrastructure. When you specify Dataset[ImageSample], atdata knows how to deserialize the msgpack bytes back into fully-typed objects.\nAutomatic batch aggregation is a key feature: when you iterate with batch_size, atdata returns SampleBatch objects that intelligently combine samples:\n\nNDArray fields are stacked into a single array with a batch dimension\nOther fields become lists of values\n\nThis eliminates boilerplate collation code and works automatically with any PackableSample type.\nCreate a typed Dataset and iterate with batching:\n\n# Load dataset with type\ndataset = atdata.Dataset[ImageSample](\"my-dataset-000000.tar\")\n\n# Iterate in order with batching\nfor batch in dataset.ordered(batch_size=16):\n # NDArray fields are stacked\n images = batch.image # shape: (16, 224, 224, 3)\n\n # Other fields become lists\n labels = batch.label # list of 16 strings\n confidences = batch.confidence # list of 16 floats\n\n print(f\"Batch shape: {images.shape}\")\n print(f\"Labels: {labels[:3]}...\")\n break", 812 "crumbs": [ 813 "Guide", 814 "Getting Started", 815 "Quick Start" 816 ] 817 }, 818 { 819 "objectID": "tutorials/quickstart.html#shuffled-iteration", 820 "href": "tutorials/quickstart.html#shuffled-iteration", 821 "title": "Quick Start", 822 "section": "Shuffled Iteration", 823 "text": "Shuffled Iteration\nProper shuffling is critical for training. WebDataset provides two-level shuffling:\n\nShard shuffling: Randomize the order of tar files\nSample shuffling: Randomize samples within a buffer\n\nThis approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset.\nFor training, use shuffled iteration:\n\nfor batch in dataset.shuffled(batch_size=32):\n # Samples are shuffled at shard and sample level\n images = batch.image\n labels = batch.label\n\n # Train your model\n # model.train(images, labels)\n break", 824 "crumbs": [ 825 "Guide", 826 "Getting Started", 827 "Quick Start" 828 ] 829 }, 830 { 831 "objectID": "tutorials/quickstart.html#use-lenses-for-type-transformations", 832 "href": "tutorials/quickstart.html#use-lenses-for-type-transformations", 833 "title": "Quick Start", 834 "section": "Use Lenses for Type Transformations", 835 "text": "Use Lenses for Type Transformations\nLenses are bidirectional transformations between sample types. They solve a common problem: you have a dataset with a rich schema, but a particular task only needs a subset of fields—or needs derived fields computed on-the-fly.\nInstead of creating separate datasets for each use case (duplicating storage and maintenance burden), lenses let you view the same underlying data through different type schemas. This is inspired by functional programming concepts and enables:\n\nSchema reduction: Drop fields you don’t need\nSchema migration: Handle version differences between datasets\nDerived features: Compute fields on-the-fly during iteration\n\nView datasets through different schemas:\n\n# Define a simplified view type\n@atdata.packable\nclass SimplifiedSample:\n label: str\n confidence: float\n\n# Create a lens transformation\n@atdata.lens\ndef simplify(src: ImageSample) -> SimplifiedSample:\n return SimplifiedSample(label=src.label, confidence=src.confidence)\n\n# View dataset through lens\nsimple_ds = dataset.as_type(SimplifiedSample)\n\nfor batch in simple_ds.ordered(batch_size=8):\n print(f\"Labels: {batch.label}\")\n print(f\"Confidences: {batch.confidence}\")\n break", 836 "crumbs": [ 837 "Guide", 838 "Getting Started", 839 "Quick Start" 840 ] 841 }, 842 { 843 "objectID": "tutorials/quickstart.html#what-youve-learned", 844 "href": "tutorials/quickstart.html#what-youve-learned", 845 "title": "Quick Start", 846 "section": "What You’ve Learned", 847 "text": "What You’ve Learned\nYou now understand atdata’s foundational concepts:\n\n\n\nConcept\nPurpose\n\n\n\n\n@packable\nCreate typed, serializable sample classes\n\n\nDataset[T]\nTyped iteration over WebDataset tar files\n\n\nSampleBatch[T]\nAutomatic aggregation with NDArray stacking\n\n\n@lens\nTransform between sample types without data duplication\n\n\n\nThese patterns work identically whether your data lives on local disk, in team S3 storage, or published to the ATProto network. The next tutorials show how to scale beyond local files.", 848 "crumbs": [ 849 "Guide", 850 "Getting Started", 851 "Quick Start" 852 ] 853 }, 854 { 855 "objectID": "tutorials/quickstart.html#next-steps", 856 "href": "tutorials/quickstart.html#next-steps", 857 "title": "Quick Start", 858 "section": "Next Steps", 859 "text": "Next Steps\n\n\n\n\n\n\nReady to Share with Your Team?\n\n\n\nThe Local Workflow tutorial shows how to set up Redis + S3 storage for team-wide dataset discovery and sharing.\n\n\n\nLocal Workflow - Store datasets with Redis + S3\nAtmosphere Publishing - Publish to ATProto federation\nPackable Samples - Deep dive into sample types\nDatasets - Advanced dataset operations", 860 "crumbs": [ 861 "Guide", 862 "Getting Started", 863 "Quick Start" 864 ] 865 }, 866 { 867 "objectID": "tutorials/atmosphere.html", 868 "href": "tutorials/atmosphere.html", 869 "title": "Atmosphere Publishing", 870 "section": "", 871 "text": "This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing. This is Layer 3 of atdata’s architecture—decentralized federation that enables cross-organization dataset sharing.", 872 "crumbs": [ 873 "Guide", 874 "Getting Started", 875 "Atmosphere Publishing" 876 ] 877 }, 878 { 879 "objectID": "tutorials/atmosphere.html#why-federation", 880 "href": "tutorials/atmosphere.html#why-federation", 881 "title": "Atmosphere Publishing", 882 "section": "Why Federation?", 883 "text": "Why Federation?\nTeam storage (Redis + S3) works well within an organization, but sharing across organizations introduces new challenges:\n\nDiscovery: How do researchers find relevant datasets across institutions?\nTrust: How do you verify a dataset is what it claims to be?\nDurability: What happens if the original publisher goes offline?\n\nThe AT Protocol (ATProto), developed by Bluesky, provides a foundation for decentralized social applications. atdata leverages ATProto’s infrastructure for dataset federation:\n\n\n\n\n\n\n\nATProto Feature\natdata Usage\n\n\n\n\nDIDs (Decentralized Identifiers)\nPublisher identity verification\n\n\nLexicons\nDataset/schema record schemas\n\n\nPDSes (Personal Data Servers)\nStorage for records and blobs\n\n\nRelays & AppViews\nDiscovery and aggregation\n\n\n\nThe key insight: your Bluesky identity (@handle.bsky.social) becomes your dataset publisher identity. Anyone can verify that a dataset was published by you, and can discover your datasets through the federated network.", 884 "crumbs": [ 885 "Guide", 886 "Getting Started", 887 "Atmosphere Publishing" 888 ] 889 }, 890 { 891 "objectID": "tutorials/atmosphere.html#prerequisites", 892 "href": "tutorials/atmosphere.html#prerequisites", 893 "title": "Atmosphere Publishing", 894 "section": "Prerequisites", 895 "text": "Prerequisites\n\npip install atdata[atmosphere]\nA Bluesky account with an app-specific password\n\n\n\n\n\n\n\nWarning\n\n\n\nAlways use an app-specific password, not your main Bluesky password.", 896 "crumbs": [ 897 "Guide", 898 "Getting Started", 899 "Atmosphere Publishing" 900 ] 901 }, 902 { 903 "objectID": "tutorials/atmosphere.html#setup", 904 "href": "tutorials/atmosphere.html#setup", 905 "title": "Atmosphere Publishing", 906 "section": "Setup", 907 "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.atmosphere import (\n AtmosphereClient,\n AtmosphereIndex,\n PDSBlobStore,\n SchemaPublisher,\n SchemaLoader,\n DatasetPublisher,\n DatasetLoader,\n AtUri,\n)\nfrom atdata import BlobSource\nimport webdataset as wds", 908 "crumbs": [ 909 "Guide", 910 "Getting Started", 911 "Atmosphere Publishing" 912 ] 913 }, 914 { 915 "objectID": "tutorials/atmosphere.html#define-sample-types", 916 "href": "tutorials/atmosphere.html#define-sample-types", 917 "title": "Atmosphere Publishing", 918 "section": "Define Sample Types", 919 "text": "Define Sample Types\n\n@atdata.packable\nclass ImageSample:\n \"\"\"A sample containing image data with metadata.\"\"\"\n image: NDArray\n label: str\n confidence: float\n\n@atdata.packable\nclass TextEmbeddingSample:\n \"\"\"A sample containing text with embedding vectors.\"\"\"\n text: str\n embedding: NDArray\n source: str", 920 "crumbs": [ 921 "Guide", 922 "Getting Started", 923 "Atmosphere Publishing" 924 ] 925 }, 926 { 927 "objectID": "tutorials/atmosphere.html#type-introspection", 928 "href": "tutorials/atmosphere.html#type-introspection", 929 "title": "Atmosphere Publishing", 930 "section": "Type Introspection", 931 "text": "Type Introspection\nSee what information is available from a PackableSample type:\n\nfrom dataclasses import fields, is_dataclass\n\nprint(f\"Sample type: {ImageSample.__name__}\")\nprint(f\"Is dataclass: {is_dataclass(ImageSample)}\")\n\nprint(\"\\nFields:\")\nfor field in fields(ImageSample):\n print(f\" - {field.name}: {field.type}\")\n\n# Create and serialize a sample\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\npacked = sample.packed\nprint(f\"\\nSerialized size: {len(packed):,} bytes\")\n\n# Round-trip\nrestored = ImageSample.from_bytes(packed)\nprint(f\"Round-trip successful: {np.allclose(sample.image, restored.image)}\")", 932 "crumbs": [ 933 "Guide", 934 "Getting Started", 935 "Atmosphere Publishing" 936 ] 937 }, 938 { 939 "objectID": "tutorials/atmosphere.html#at-uri-parsing", 940 "href": "tutorials/atmosphere.html#at-uri-parsing", 941 "title": "Atmosphere Publishing", 942 "section": "AT URI Parsing", 943 "text": "AT URI Parsing\nEvery record in ATProto is identified by an AT URI, which encodes:\n\nAuthority: The DID or handle of the record owner\nCollection: The Lexicon type (like a table name)\nRkey: The record key (unique within the collection)\n\nUnderstanding AT URIs is essential for working with atmosphere datasets, as they’re how you reference schemas, datasets, and lenses.\nATProto records are identified by AT URIs:\n\nuris = [\n \"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789\",\n \"at://alice.bsky.social/ac.foundation.dataset.record/my-dataset\",\n]\n\nfor uri_str in uris:\n print(f\"\\nParsing: {uri_str}\")\n uri = AtUri.parse(uri_str)\n print(f\" Authority: {uri.authority}\")\n print(f\" Collection: {uri.collection}\")\n print(f\" Rkey: {uri.rkey}\")", 944 "crumbs": [ 945 "Guide", 946 "Getting Started", 947 "Atmosphere Publishing" 948 ] 949 }, 950 { 951 "objectID": "tutorials/atmosphere.html#authentication", 952 "href": "tutorials/atmosphere.html#authentication", 953 "title": "Atmosphere Publishing", 954 "section": "Authentication", 955 "text": "Authentication\nThe AtmosphereClient handles ATProto authentication. When you authenticate, you’re proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS).\nConnect to ATProto:\n\nclient = AtmosphereClient()\nclient.login(\"your.handle.social\", \"your-app-password\")\n\nprint(f\"Authenticated as: {client.handle}\")\nprint(f\"DID: {client.did}\")", 956 "crumbs": [ 957 "Guide", 958 "Getting Started", 959 "Atmosphere Publishing" 960 ] 961 }, 962 { 963 "objectID": "tutorials/atmosphere.html#publish-a-schema", 964 "href": "tutorials/atmosphere.html#publish-a-schema", 965 "title": "Atmosphere Publishing", 966 "section": "Publish a Schema", 967 "text": "Publish a Schema\nWhen you publish a schema to ATProto, it becomes a public, immutable record that others can reference. The schema CID ensures that anyone can verify they’re using exactly the same type definition you published.\n\nschema_publisher = SchemaPublisher(client)\nschema_uri = schema_publisher.publish(\n ImageSample,\n name=\"ImageSample\",\n version=\"1.0.0\",\n description=\"Demo: Image sample with label and confidence\",\n)\nprint(f\"Schema URI: {schema_uri}\")", 968 "crumbs": [ 969 "Guide", 970 "Getting Started", 971 "Atmosphere Publishing" 972 ] 973 }, 974 { 975 "objectID": "tutorials/atmosphere.html#list-your-schemas", 976 "href": "tutorials/atmosphere.html#list-your-schemas", 977 "title": "Atmosphere Publishing", 978 "section": "List Your Schemas", 979 "text": "List Your Schemas\n\nschema_loader = SchemaLoader(client)\nschemas = schema_loader.list_all(limit=10)\nprint(f\"Found {len(schemas)} schema(s)\")\n\nfor schema in schemas:\n print(f\" - {schema.get('name', 'Unknown')}: v{schema.get('version', '?')}\")", 980 "crumbs": [ 981 "Guide", 982 "Getting Started", 983 "Atmosphere Publishing" 984 ] 985 }, 986 { 987 "objectID": "tutorials/atmosphere.html#publish-a-dataset", 988 "href": "tutorials/atmosphere.html#publish-a-dataset", 989 "title": "Atmosphere Publishing", 990 "section": "Publish a Dataset", 991 "text": "Publish a Dataset\n\nWith External URLs\n\ndataset_publisher = DatasetPublisher(client)\ndataset_uri = dataset_publisher.publish_with_urls(\n urls=[\"s3://example-bucket/demo-data-{000000..000009}.tar\"],\n schema_uri=str(schema_uri),\n name=\"Demo Image Dataset\",\n description=\"Example dataset demonstrating atmosphere publishing\",\n tags=[\"demo\", \"images\", \"atdata\"],\n license=\"MIT\",\n)\nprint(f\"Dataset URI: {dataset_uri}\")\n\n\n\nWith PDS Blob Storage (Recommended)\nThe PDSBlobStore is the fully decentralized option: your dataset shards are stored as ATProto blobs directly in your PDS, alongside your other ATProto records. This means:\n\nNo external dependencies: Data lives in the same infrastructure as your identity\nContent-addressed: Blobs are identified by their CID, ensuring integrity\nFederated replication: Relays can mirror your blobs for availability\n\nFor fully decentralized storage, use PDSBlobStore to store dataset shards directly as ATProto blobs in your PDS:\n\n# Create store and index with blob storage\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# Create dataset in memory or from existing tar\nsamples = [FeatureSample(features=np.random.randn(64).astype(np.float32), label=i % 10) for i in range(100)]\n\n# Write to temporary tar\nwith wds.writer.TarWriter(\"temp.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\ndataset = atdata.Dataset[FeatureSample](\"temp.tar\")\n\n# Publish - shards are uploaded as blobs automatically\nschema_uri = index.publish_schema(FeatureSample, version=\"1.0.0\")\nentry = index.insert_dataset(\n dataset,\n name=\"blob-stored-features\",\n schema_ref=schema_uri,\n description=\"Features stored as PDS blobs\",\n)\n\nprint(f\"Dataset URI: {entry.uri}\")\nprint(f\"Blob URLs: {entry.data_urls}\") # at://did/blob/cid format\n\n\n\n\n\n\n\nReading Blob-Stored Datasets\n\n\n\nUse BlobSource to stream directly from PDS blobs:\n\n# Create source from the blob URLs\nsource = store.create_source(entry.data_urls)\n\n# Or manually from blob references\nsource = BlobSource.from_refs([\n {\"did\": client.did, \"cid\": \"bafyrei...\"},\n])\n\n# Load and iterate\nds = atdata.Dataset[FeatureSample](source)\nfor batch in ds.ordered(batch_size=32):\n print(batch.features.shape)\n\n\n\n\n\nWith External URLs\nFor larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the index entry while the actual data lives elsewhere.\nFor larger datasets or when using existing object storage:\n\ndataset_publisher = DatasetPublisher(client)\ndataset_uri = dataset_publisher.publish_with_urls(\n urls=[\"s3://example-bucket/demo-data-{000000..000009}.tar\"],\n schema_uri=str(schema_uri),\n name=\"Demo Image Dataset\",\n description=\"Example dataset demonstrating atmosphere publishing\",\n tags=[\"demo\", \"images\", \"atdata\"],\n license=\"MIT\",\n)\nprint(f\"Dataset URI: {dataset_uri}\")", 992 "crumbs": [ 993 "Guide", 994 "Getting Started", 995 "Atmosphere Publishing" 996 ] 997 }, 998 { 999 "objectID": "tutorials/atmosphere.html#list-and-load-datasets", 1000 "href": "tutorials/atmosphere.html#list-and-load-datasets", 1001 "title": "Atmosphere Publishing", 1002 "section": "List and Load Datasets", 1003 "text": "List and Load Datasets\n\ndataset_loader = DatasetLoader(client)\ndatasets = dataset_loader.list_all(limit=10)\nprint(f\"Found {len(datasets)} dataset(s)\")\n\nfor ds in datasets:\n print(f\" - {ds.get('name', 'Unknown')}\")\n print(f\" Schema: {ds.get('schemaRef', 'N/A')}\")\n tags = ds.get('tags', [])\n if tags:\n print(f\" Tags: {', '.join(tags)}\")", 1004 "crumbs": [ 1005 "Guide", 1006 "Getting Started", 1007 "Atmosphere Publishing" 1008 ] 1009 }, 1010 { 1011 "objectID": "tutorials/atmosphere.html#load-a-dataset", 1012 "href": "tutorials/atmosphere.html#load-a-dataset", 1013 "title": "Atmosphere Publishing", 1014 "section": "Load a Dataset", 1015 "text": "Load a Dataset\n\n# Check storage type\nstorage_type = dataset_loader.get_storage_type(str(blob_dataset_uri))\nprint(f\"Storage type: {storage_type}\")\n\nif storage_type == \"blobs\":\n blob_urls = dataset_loader.get_blob_urls(str(blob_dataset_uri))\n print(f\"Blob URLs: {len(blob_urls)} blob(s)\")\n\n# Load and iterate (works for both storage types)\nds = dataset_loader.to_dataset(str(blob_dataset_uri), DemoSample)\nfor batch in ds.ordered():\n print(f\"Sample id={batch.id}, text={batch.text}\")", 1016 "crumbs": [ 1017 "Guide", 1018 "Getting Started", 1019 "Atmosphere Publishing" 1020 ] 1021 }, 1022 { 1023 "objectID": "tutorials/atmosphere.html#complete-publishing-workflow", 1024 "href": "tutorials/atmosphere.html#complete-publishing-workflow", 1025 "title": "Atmosphere Publishing", 1026 "section": "Complete Publishing Workflow", 1027 "text": "Complete Publishing Workflow\nHere’s the end-to-end workflow for publishing a dataset to the atmosphere:\n\nDefine your sample type using @packable\nCreate samples and write to tar (same as local workflow)\nAuthenticate with your ATProto identity\nCreate index with blob storage (AtmosphereIndex + PDSBlobStore)\nPublish schema (creates ATProto record)\nInsert dataset (uploads blobs, creates dataset record)\n\nNotice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend.\nThis example shows the recommended workflow using PDSBlobStore for fully decentralized storage:\n\n# 1. Define and create samples\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n source: str\n\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(1000)\n]\n\n# 2. Write to tar\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Authenticate and create index with blob storage\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# 4. Publish schema\nschema_uri = index.publish_schema(\n FeatureSample,\n version=\"1.0.0\",\n description=\"Feature vectors with labels\",\n)\n\n# 5. Publish dataset (shards uploaded as blobs automatically)\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nentry = index.insert_dataset(\n dataset,\n name=\"synthetic-features-v1\",\n schema_ref=schema_uri,\n tags=[\"features\", \"synthetic\"],\n)\n\nprint(f\"Published: {entry.uri}\")\nprint(f\"Data stored at: {entry.data_urls}\") # at://did/blob/cid URLs\n\n# 6. Later: load from blobs\nsource = store.create_source(entry.data_urls)\nds = atdata.Dataset[FeatureSample](source)\nfor batch in ds.ordered(batch_size=32):\n print(f\"Loaded batch with {len(batch.label)} samples\")\n break", 1028 "crumbs": [ 1029 "Guide", 1030 "Getting Started", 1031 "Atmosphere Publishing" 1032 ] 1033 }, 1034 { 1035 "objectID": "tutorials/atmosphere.html#what-youve-learned", 1036 "href": "tutorials/atmosphere.html#what-youve-learned", 1037 "title": "Atmosphere Publishing", 1038 "section": "What You’ve Learned", 1039 "text": "What You’ve Learned\nYou now understand federated dataset publishing in atdata:\n\n\n\nConcept\nPurpose\n\n\n\n\nAtmosphereClient\nATProto authentication and record management\n\n\nAtmosphereIndex\nFederated index implementing AbstractIndex\n\n\nPDSBlobStore\nPDS blob storage implementing AbstractDataStore\n\n\nBlobSource\nStream datasets from PDS blobs\n\n\nAT URIs\nUniversal identifiers for schemas and datasets\n\n\n\nThe protocol abstractions (AbstractIndex, AbstractDataStore, DataSource) ensure your code works across all three layers of atdata—local files, team storage, and federated sharing.", 1040 "crumbs": [ 1041 "Guide", 1042 "Getting Started", 1043 "Atmosphere Publishing" 1044 ] 1045 }, 1046 { 1047 "objectID": "tutorials/atmosphere.html#the-full-picture", 1048 "href": "tutorials/atmosphere.html#the-full-picture", 1049 "title": "Atmosphere Publishing", 1050 "section": "The Full Picture", 1051 "text": "The Full Picture\nYou’ve now seen atdata’s complete architecture:\nLocal Development Team Storage Federation\n───────────────── ──────────── ──────────\ntar files Redis + S3 ATProto PDS\nDataset[T] LocalIndex AtmosphereIndex\n S3DataStore PDSBlobStore\nThe same @packable sample types, the same Dataset[T] iteration patterns, and the same lens transformations work at every layer. Only the storage backend changes.", 1052 "crumbs": [ 1053 "Guide", 1054 "Getting Started", 1055 "Atmosphere Publishing" 1056 ] 1057 }, 1058 { 1059 "objectID": "tutorials/atmosphere.html#next-steps", 1060 "href": "tutorials/atmosphere.html#next-steps", 1061 "title": "Atmosphere Publishing", 1062 "section": "Next Steps", 1063 "text": "Next Steps\n\n\n\n\n\n\nAlready Have Local Datasets?\n\n\n\nThe Promotion Workflow tutorial shows how to migrate existing datasets from local storage to the atmosphere without re-processing your data.\n\n\n\nPromotion Workflow - Migrate from local storage to atmosphere\nAtmosphere Reference - Complete API reference\nProtocols - Abstract interfaces", 1064 "crumbs": [ 1065 "Guide", 1066 "Getting Started", 1067 "Atmosphere Publishing" 1068 ] 1069 }, 1070 { 1071 "objectID": "api/SchemaLoader.html", 1072 "href": "api/SchemaLoader.html", 1073 "title": "SchemaLoader", 1074 "section": "", 1075 "text": "atmosphere.SchemaLoader(client)\nLoads PackableSample schemas from ATProto.\nThis class fetches schema records from ATProto and can list available schemas from a repository.\n\n\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 1076 }, 1077 { 1078 "objectID": "api/SchemaLoader.html#examples", 1079 "href": "api/SchemaLoader.html#examples", 1080 "title": "SchemaLoader", 1081 "section": "", 1082 "text": ">>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'" 1083 }, 1084 { 1085 "objectID": "api/SchemaLoader.html#methods", 1086 "href": "api/SchemaLoader.html#methods", 1087 "title": "SchemaLoader", 1088 "section": "", 1089 "text": "Name\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 1090 }, 1091 { 1092 "objectID": "api/BlobSource.html", 1093 "href": "api/BlobSource.html", 1094 "title": "BlobSource", 1095 "section": "", 1096 "text": "BlobSource(blob_refs, pds_endpoint=None, _endpoint_cache=dict())\nData source for ATProto PDS blob storage.\nStreams dataset shards stored as blobs on an ATProto Personal Data Server. Each shard is identified by a blob reference containing the DID and CID.\nThis source resolves blob references to HTTP URLs and streams the content directly, supporting efficient iteration over shards without downloading everything upfront.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID.\n\n\n\n\n\n\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 1097 }, 1098 { 1099 "objectID": "api/BlobSource.html#attributes", 1100 "href": "api/BlobSource.html#attributes", 1101 "title": "BlobSource", 1102 "section": "", 1103 "text": "Name\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID." 1104 }, 1105 { 1106 "objectID": "api/BlobSource.html#examples", 1107 "href": "api/BlobSource.html#examples", 1108 "title": "BlobSource", 1109 "section": "", 1110 "text": ">>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1111 }, 1112 { 1113 "objectID": "api/BlobSource.html#methods", 1114 "href": "api/BlobSource.html#methods", 1115 "title": "BlobSource", 1116 "section": "", 1117 "text": "Name\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 1118 }, 1119 { 1120 "objectID": "api/AtmosphereClient.html", 1121 "href": "api/AtmosphereClient.html", 1122 "title": "AtmosphereClient", 1123 "section": "", 1124 "text": "atmosphere.AtmosphereClient(base_url=None, *, _client=None)\nATProto client wrapper for atdata operations.\nThis class wraps the atproto SDK client and provides higher-level methods for working with atdata records (schemas, datasets, lenses).\n\n\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'\n\n\n\nThe password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 1125 }, 1126 { 1127 "objectID": "api/AtmosphereClient.html#examples", 1128 "href": "api/AtmosphereClient.html#examples", 1129 "title": "AtmosphereClient", 1130 "section": "", 1131 "text": ">>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'" 1132 }, 1133 { 1134 "objectID": "api/AtmosphereClient.html#note", 1135 "href": "api/AtmosphereClient.html#note", 1136 "title": "AtmosphereClient", 1137 "section": "", 1138 "text": "The password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings." 1139 }, 1140 { 1141 "objectID": "api/AtmosphereClient.html#attributes", 1142 "href": "api/AtmosphereClient.html#attributes", 1143 "title": "AtmosphereClient", 1144 "section": "", 1145 "text": "Name\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session." 1146 }, 1147 { 1148 "objectID": "api/AtmosphereClient.html#methods", 1149 "href": "api/AtmosphereClient.html#methods", 1150 "title": "AtmosphereClient", 1151 "section": "", 1152 "text": "Name\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 1153 }, 1154 { 1155 "objectID": "api/load_dataset.html", 1156 "href": "api/load_dataset.html", 1157 "title": "load_dataset", 1158 "section": "", 1159 "text": "load_dataset(\n path,\n sample_type=None,\n *,\n split=None,\n data_files=None,\n streaming=False,\n index=None,\n)\nLoad a dataset from local files, remote URLs, or an index.\nThis function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.\nWhen no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index.\n\n\n\n\n\n\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1160 }, 1161 { 1162 "objectID": "api/load_dataset.html#parameters", 1163 "href": "api/load_dataset.html#parameters", 1164 "title": "load_dataset", 1165 "section": "", 1166 "text": "Name\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone" 1167 }, 1168 { 1169 "objectID": "api/load_dataset.html#returns", 1170 "href": "api/load_dataset.html#returns", 1171 "title": "load_dataset", 1172 "section": "", 1173 "text": "Name\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample." 1174 }, 1175 { 1176 "objectID": "api/load_dataset.html#raises", 1177 "href": "api/load_dataset.html#raises", 1178 "title": "load_dataset", 1179 "section": "", 1180 "text": "Name\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index." 1181 }, 1182 { 1183 "objectID": "api/load_dataset.html#examples", 1184 "href": "api/load_dataset.html#examples", 1185 "title": "load_dataset", 1186 "section": "", 1187 "text": ">>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1188 }, 1189 { 1190 "objectID": "api/promote_to_atmosphere.html", 1191 "href": "api/promote_to_atmosphere.html", 1192 "title": "promote_to_atmosphere", 1193 "section": "", 1194 "text": "promote.promote_to_atmosphere(\n local_entry,\n local_index,\n atmosphere_client,\n *,\n data_store=None,\n name=None,\n description=None,\n tags=None,\n license=None,\n)\nPromote a local dataset to the atmosphere network.\nThis function takes a locally-indexed dataset and publishes it to ATProto, making it discoverable on the federated atmosphere network.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs.\n\n\n\n\n\n\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1195 }, 1196 { 1197 "objectID": "api/promote_to_atmosphere.html#parameters", 1198 "href": "api/promote_to_atmosphere.html#parameters", 1199 "title": "promote_to_atmosphere", 1200 "section": "", 1201 "text": "Name\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone" 1202 }, 1203 { 1204 "objectID": "api/promote_to_atmosphere.html#returns", 1205 "href": "api/promote_to_atmosphere.html#returns", 1206 "title": "promote_to_atmosphere", 1207 "section": "", 1208 "text": "Name\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record." 1209 }, 1210 { 1211 "objectID": "api/promote_to_atmosphere.html#raises", 1212 "href": "api/promote_to_atmosphere.html#raises", 1213 "title": "promote_to_atmosphere", 1214 "section": "", 1215 "text": "Name\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs." 1216 }, 1217 { 1218 "objectID": "api/promote_to_atmosphere.html#examples", 1219 "href": "api/promote_to_atmosphere.html#examples", 1220 "title": "promote_to_atmosphere", 1221 "section": "", 1222 "text": ">>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1223 }, 1224 { 1225 "objectID": "api/SchemaPublisher.html", 1226 "href": "api/SchemaPublisher.html", 1227 "title": "SchemaPublisher", 1228 "section": "", 1229 "text": "atmosphere.SchemaPublisher(client)\nPublishes PackableSample schemas to ATProto.\nThis class introspects a PackableSample class to extract its field definitions and publishes them as an ATProto schema record.\n\n\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/...\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1230 }, 1231 { 1232 "objectID": "api/SchemaPublisher.html#examples", 1233 "href": "api/SchemaPublisher.html#examples", 1234 "title": "SchemaPublisher", 1235 "section": "", 1236 "text": ">>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/..." 1237 }, 1238 { 1239 "objectID": "api/SchemaPublisher.html#methods", 1240 "href": "api/SchemaPublisher.html#methods", 1241 "title": "SchemaPublisher", 1242 "section": "", 1243 "text": "Name\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1244 }, 1245 { 1246 "objectID": "api/DatasetPublisher.html", 1247 "href": "api/DatasetPublisher.html", 1248 "title": "DatasetPublisher", 1249 "section": "", 1250 "text": "atmosphere.DatasetPublisher(client)\nPublishes dataset index records to ATProto.\nThis class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.\n\n\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1251 }, 1252 { 1253 "objectID": "api/DatasetPublisher.html#examples", 1254 "href": "api/DatasetPublisher.html#examples", 1255 "title": "DatasetPublisher", 1256 "section": "", 1257 "text": ">>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )" 1258 }, 1259 { 1260 "objectID": "api/DatasetPublisher.html#methods", 1261 "href": "api/DatasetPublisher.html#methods", 1262 "title": "DatasetPublisher", 1263 "section": "", 1264 "text": "Name\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1265 }, 1266 { 1267 "objectID": "api/URLSource.html", 1268 "href": "api/URLSource.html", 1269 "title": "URLSource", 1270 "section": "", 1271 "text": "URLSource(url)\nData source for WebDataset-compatible URLs.\nWraps WebDataset’s gopen to open URLs using built-in handlers for http, https, pipe, gs, hf, sftp, etc. Supports brace expansion for shard patterns like “data-{000..099}.tar”.\nThis is the default source type when a string URL is passed to Dataset.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards.\n\n\n\n\n\n\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1272 }, 1273 { 1274 "objectID": "api/URLSource.html#attributes", 1275 "href": "api/URLSource.html#attributes", 1276 "title": "URLSource", 1277 "section": "", 1278 "text": "Name\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards." 1279 }, 1280 { 1281 "objectID": "api/URLSource.html#examples", 1282 "href": "api/URLSource.html#examples", 1283 "title": "URLSource", 1284 "section": "", 1285 "text": ">>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")" 1286 }, 1287 { 1288 "objectID": "api/URLSource.html#methods", 1289 "href": "api/URLSource.html#methods", 1290 "title": "URLSource", 1291 "section": "", 1292 "text": "Name\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1293 }, 1294 { 1295 "objectID": "api/index.html", 1296 "href": "api/index.html", 1297 "title": "API Reference", 1298 "section": "", 1299 "text": "Core types, decorators, and dataset classes\n\n\n\npackable\nDecorator to convert a regular class into a PackableSample.\n\n\nPackableSample\nBase class for samples that can be serialized with msgpack.\n\n\nDictSample\nDynamic sample type providing dict-like access to raw msgpack data.\n\n\nDataset\nA typed dataset built on WebDataset with lens transformations.\n\n\nSampleBatch\nA batch of samples with automatic attribute aggregation.\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nlens\nLens-based type transformations for datasets.\n\n\nload_dataset\nLoad a dataset from local files, remote URLs, or an index.\n\n\nDatasetDict\nA dictionary of split names to Dataset instances.\n\n\n\n\n\n\nAbstract protocols for storage backends\n\n\n\nPackable\nStructural protocol for packable sample types.\n\n\nIndexEntry\nCommon interface for index entries (local or atmosphere).\n\n\nAbstractIndex\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\n\n\nAbstractDataStore\nProtocol for data storage operations.\n\n\nDataSource\nProtocol for data sources that provide streams to Dataset.\n\n\n\n\n\n\nData source implementations for streaming\n\n\n\nURLSource\nData source for WebDataset-compatible URLs.\n\n\nS3Source\nData source for S3-compatible storage with explicit credentials.\n\n\nBlobSource\nData source for ATProto PDS blob storage.\n\n\n\n\n\n\nLocal Redis/S3 storage backend\n\n\n\nlocal.Index\nRedis-backed index for tracking datasets in a repository.\n\n\nlocal.LocalDatasetEntry\nIndex entry for a dataset stored in the local repository.\n\n\nlocal.S3DataStore\nS3-compatible data store implementing AbstractDataStore protocol.\n\n\n\n\n\n\nATProto federation\n\n\n\nAtmosphereClient\nATProto client wrapper for atdata operations.\n\n\nAtmosphereIndex\nATProto index implementing AbstractIndex protocol.\n\n\nAtmosphereIndexEntry\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\nPDSBlobStore\nPDS blob store implementing AbstractDataStore protocol.\n\n\nSchemaPublisher\nPublishes PackableSample schemas to ATProto.\n\n\nSchemaLoader\nLoads PackableSample schemas from ATProto.\n\n\nDatasetPublisher\nPublishes dataset index records to ATProto.\n\n\nDatasetLoader\nLoads dataset records from ATProto.\n\n\nLensPublisher\nPublishes Lens transformation records to ATProto.\n\n\nLensLoader\nLoads lens records from ATProto.\n\n\nAtUri\nParsed AT Protocol URI.\n\n\n\n\n\n\nLocal to atmosphere migration\n\n\n\npromote_to_atmosphere\nPromote a local dataset to the atmosphere network." 1300 }, 1301 { 1302 "objectID": "api/index.html#core", 1303 "href": "api/index.html#core", 1304 "title": "API Reference", 1305 "section": "", 1306 "text": "Core types, decorators, and dataset classes\n\n\n\npackable\nDecorator to convert a regular class into a PackableSample.\n\n\nPackableSample\nBase class for samples that can be serialized with msgpack.\n\n\nDictSample\nDynamic sample type providing dict-like access to raw msgpack data.\n\n\nDataset\nA typed dataset built on WebDataset with lens transformations.\n\n\nSampleBatch\nA batch of samples with automatic attribute aggregation.\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nlens\nLens-based type transformations for datasets.\n\n\nload_dataset\nLoad a dataset from local files, remote URLs, or an index.\n\n\nDatasetDict\nA dictionary of split names to Dataset instances." 1307 }, 1308 { 1309 "objectID": "api/index.html#protocols", 1310 "href": "api/index.html#protocols", 1311 "title": "API Reference", 1312 "section": "", 1313 "text": "Abstract protocols for storage backends\n\n\n\nPackable\nStructural protocol for packable sample types.\n\n\nIndexEntry\nCommon interface for index entries (local or atmosphere).\n\n\nAbstractIndex\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\n\n\nAbstractDataStore\nProtocol for data storage operations.\n\n\nDataSource\nProtocol for data sources that provide streams to Dataset." 1314 }, 1315 { 1316 "objectID": "api/index.html#data-sources", 1317 "href": "api/index.html#data-sources", 1318 "title": "API Reference", 1319 "section": "", 1320 "text": "Data source implementations for streaming\n\n\n\nURLSource\nData source for WebDataset-compatible URLs.\n\n\nS3Source\nData source for S3-compatible storage with explicit credentials.\n\n\nBlobSource\nData source for ATProto PDS blob storage." 1321 }, 1322 { 1323 "objectID": "api/index.html#local-storage", 1324 "href": "api/index.html#local-storage", 1325 "title": "API Reference", 1326 "section": "", 1327 "text": "Local Redis/S3 storage backend\n\n\n\nlocal.Index\nRedis-backed index for tracking datasets in a repository.\n\n\nlocal.LocalDatasetEntry\nIndex entry for a dataset stored in the local repository.\n\n\nlocal.S3DataStore\nS3-compatible data store implementing AbstractDataStore protocol." 1328 }, 1329 { 1330 "objectID": "api/index.html#atmosphere", 1331 "href": "api/index.html#atmosphere", 1332 "title": "API Reference", 1333 "section": "", 1334 "text": "ATProto federation\n\n\n\nAtmosphereClient\nATProto client wrapper for atdata operations.\n\n\nAtmosphereIndex\nATProto index implementing AbstractIndex protocol.\n\n\nAtmosphereIndexEntry\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\nPDSBlobStore\nPDS blob store implementing AbstractDataStore protocol.\n\n\nSchemaPublisher\nPublishes PackableSample schemas to ATProto.\n\n\nSchemaLoader\nLoads PackableSample schemas from ATProto.\n\n\nDatasetPublisher\nPublishes dataset index records to ATProto.\n\n\nDatasetLoader\nLoads dataset records from ATProto.\n\n\nLensPublisher\nPublishes Lens transformation records to ATProto.\n\n\nLensLoader\nLoads lens records from ATProto.\n\n\nAtUri\nParsed AT Protocol URI." 1335 }, 1336 { 1337 "objectID": "api/index.html#promotion", 1338 "href": "api/index.html#promotion", 1339 "title": "API Reference", 1340 "section": "", 1341 "text": "Local to atmosphere migration\n\n\n\npromote_to_atmosphere\nPromote a local dataset to the atmosphere network." 1342 }, 1343 { 1344 "objectID": "api/IndexEntry.html", 1345 "href": "api/IndexEntry.html", 1346 "title": "IndexEntry", 1347 "section": "", 1348 "text": "IndexEntry()\nCommon interface for index entries (local or atmosphere).\nBoth LocalDatasetEntry and atmosphere DatasetRecord-based entries should satisfy this protocol, enabling code that works with either.\n\n\nname: Human-readable dataset name schema_ref: Reference to schema (local:// path or AT URI) data_urls: WebDataset URLs for the data metadata: Arbitrary metadata dict, or None\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_urls\nWebDataset URLs for the data.\n\n\nmetadata\nArbitrary metadata dictionary, or None if not set.\n\n\nname\nHuman-readable dataset name.\n\n\nschema_ref\nReference to the schema for this dataset." 1349 }, 1350 { 1351 "objectID": "api/IndexEntry.html#properties", 1352 "href": "api/IndexEntry.html#properties", 1353 "title": "IndexEntry", 1354 "section": "", 1355 "text": "name: Human-readable dataset name schema_ref: Reference to schema (local:// path or AT URI) data_urls: WebDataset URLs for the data metadata: Arbitrary metadata dict, or None" 1356 }, 1357 { 1358 "objectID": "api/IndexEntry.html#attributes", 1359 "href": "api/IndexEntry.html#attributes", 1360 "title": "IndexEntry", 1361 "section": "", 1362 "text": "Name\nDescription\n\n\n\n\ndata_urls\nWebDataset URLs for the data.\n\n\nmetadata\nArbitrary metadata dictionary, or None if not set.\n\n\nname\nHuman-readable dataset name.\n\n\nschema_ref\nReference to the schema for this dataset." 1363 }, 1364 { 1365 "objectID": "api/S3Source.html", 1366 "href": "api/S3Source.html", 1367 "title": "S3Source", 1368 "section": "", 1369 "text": "S3Source(\n bucket,\n keys,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n _client=None,\n)\nData source for S3-compatible storage with explicit credentials.\nUses boto3 to stream directly from S3, supporting: - Standard AWS S3 - S3-compatible endpoints (Cloudflare R2, MinIO, etc.) - Private buckets with credentials - IAM role authentication (when keys not provided)\nUnlike URL-based approaches, this doesn’t require URL transformation or global gopen_schemes registration. Credentials are scoped to the source instance.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1).\n\n\n\n\n\n\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1370 }, 1371 { 1372 "objectID": "api/S3Source.html#attributes", 1373 "href": "api/S3Source.html#attributes", 1374 "title": "S3Source", 1375 "section": "", 1376 "text": "Name\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1)." 1377 }, 1378 { 1379 "objectID": "api/S3Source.html#examples", 1380 "href": "api/S3Source.html#examples", 1381 "title": "S3Source", 1382 "section": "", 1383 "text": ">>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1384 }, 1385 { 1386 "objectID": "api/S3Source.html#methods", 1387 "href": "api/S3Source.html#methods", 1388 "title": "S3Source", 1389 "section": "", 1390 "text": "Name\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1391 }, 1392 { 1393 "objectID": "api/local.LocalDatasetEntry.html", 1394 "href": "api/local.LocalDatasetEntry.html", 1395 "title": "local.LocalDatasetEntry", 1396 "section": "", 1397 "text": "local.LocalDatasetEntry(\n name,\n schema_ref,\n data_urls,\n metadata=None,\n _cid=None,\n _legacy_uuid=None,\n)\nIndex entry for a dataset stored in the local repository.\nImplements the IndexEntry protocol for compatibility with AbstractIndex. Uses dual identity: a content-addressable CID (ATProto-compatible) and a human-readable name.\nThe CID is generated from the entry’s content (schema_ref + data_urls), ensuring the same data produces the same CID whether stored locally or in the atmosphere. This enables seamless promotion from local to ATProto.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable name for this dataset.\n\n\nschema_ref\nstr\nReference to the schema for this dataset.\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data.\n\n\nmetadata\ndict | None\nArbitrary metadata dictionary, or None if not set.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_redis\nLoad an entry from Redis by CID.\n\n\nwrite_to\nPersist this index entry to Redis.\n\n\n\n\n\nlocal.LocalDatasetEntry.from_redis(redis, cid)\nLoad an entry from Redis by CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to read from.\nrequired\n\n\ncid\nstr\nContent identifier of the entry to load.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry loaded from Redis.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.LocalDatasetEntry.write_to(redis)\nPersist this index entry to Redis.\nStores the entry as a Redis hash with key ‘{REDIS_KEY_DATASET_ENTRY}:{cid}’.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to write to.\nrequired" 1398 }, 1399 { 1400 "objectID": "api/local.LocalDatasetEntry.html#attributes", 1401 "href": "api/local.LocalDatasetEntry.html#attributes", 1402 "title": "local.LocalDatasetEntry", 1403 "section": "", 1404 "text": "Name\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable name for this dataset.\n\n\nschema_ref\nstr\nReference to the schema for this dataset.\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data.\n\n\nmetadata\ndict | None\nArbitrary metadata dictionary, or None if not set." 1405 }, 1406 { 1407 "objectID": "api/local.LocalDatasetEntry.html#methods", 1408 "href": "api/local.LocalDatasetEntry.html#methods", 1409 "title": "local.LocalDatasetEntry", 1410 "section": "", 1411 "text": "Name\nDescription\n\n\n\n\nfrom_redis\nLoad an entry from Redis by CID.\n\n\nwrite_to\nPersist this index entry to Redis.\n\n\n\n\n\nlocal.LocalDatasetEntry.from_redis(redis, cid)\nLoad an entry from Redis by CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to read from.\nrequired\n\n\ncid\nstr\nContent identifier of the entry to load.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry loaded from Redis.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.LocalDatasetEntry.write_to(redis)\nPersist this index entry to Redis.\nStores the entry as a Redis hash with key ‘{REDIS_KEY_DATASET_ENTRY}:{cid}’.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to write to.\nrequired" 1412 }, 1413 { 1414 "objectID": "api/AbstractIndex.html", 1415 "href": "api/AbstractIndex.html", 1416 "title": "AbstractIndex", 1417 "section": "", 1418 "text": "AbstractIndex()\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\nThis protocol defines the common interface for managing dataset metadata: - Publishing and retrieving schemas - Inserting and listing datasets - (Future) Publishing and retrieving lenses\nA single index can hold datasets of many different sample types. The sample type is tracked via schema references, not as a generic parameter on the index.\n\n\nSome index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution.\n\n\n\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1419 }, 1420 { 1421 "objectID": "api/AbstractIndex.html#optional-extensions", 1422 "href": "api/AbstractIndex.html#optional-extensions", 1423 "title": "AbstractIndex", 1424 "section": "", 1425 "text": "Some index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution." 1426 }, 1427 { 1428 "objectID": "api/AbstractIndex.html#examples", 1429 "href": "api/AbstractIndex.html#examples", 1430 "title": "AbstractIndex", 1431 "section": "", 1432 "text": ">>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")" 1433 }, 1434 { 1435 "objectID": "api/AbstractIndex.html#attributes", 1436 "href": "api/AbstractIndex.html#attributes", 1437 "title": "AbstractIndex", 1438 "section": "", 1439 "text": "Name\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index." 1440 }, 1441 { 1442 "objectID": "api/AbstractIndex.html#methods", 1443 "href": "api/AbstractIndex.html#methods", 1444 "title": "AbstractIndex", 1445 "section": "", 1446 "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1447 }, 1448 { 1449 "objectID": "api/AtmosphereIndexEntry.html", 1450 "href": "api/AtmosphereIndexEntry.html", 1451 "title": "AtmosphereIndexEntry", 1452 "section": "", 1453 "text": "atmosphere.AtmosphereIndexEntry(uri, record)\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_uri\n\nAT URI of the record.\n\n\n_record\n\nRaw record dictionary." 1454 }, 1455 { 1456 "objectID": "api/AtmosphereIndexEntry.html#attributes", 1457 "href": "api/AtmosphereIndexEntry.html#attributes", 1458 "title": "AtmosphereIndexEntry", 1459 "section": "", 1460 "text": "Name\nType\nDescription\n\n\n\n\n_uri\n\nAT URI of the record.\n\n\n_record\n\nRaw record dictionary." 1461 }, 1462 { 1463 "objectID": "api/LensPublisher.html", 1464 "href": "api/LensPublisher.html", 1465 "title": "LensPublisher", 1466 "section": "", 1467 "text": "atmosphere.LensPublisher(client)\nPublishes Lens transformation records to ATProto.\nThis class creates lens records that reference source and target schemas and point to the transformation code in a git repository.\n\n\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )\n\n\n\nLens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1468 }, 1469 { 1470 "objectID": "api/LensPublisher.html#examples", 1471 "href": "api/LensPublisher.html#examples", 1472 "title": "LensPublisher", 1473 "section": "", 1474 "text": ">>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )" 1475 }, 1476 { 1477 "objectID": "api/LensPublisher.html#security-note", 1478 "href": "api/LensPublisher.html#security-note", 1479 "title": "LensPublisher", 1480 "section": "", 1481 "text": "Lens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations." 1482 }, 1483 { 1484 "objectID": "api/LensPublisher.html#methods", 1485 "href": "api/LensPublisher.html#methods", 1486 "title": "LensPublisher", 1487 "section": "", 1488 "text": "Name\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1489 }, 1490 { 1491 "objectID": "api/SampleBatch.html", 1492 "href": "api/SampleBatch.html", 1493 "title": "SampleBatch", 1494 "section": "", 1495 "text": "SampleBatch(samples)\nA batch of samples with automatic attribute aggregation.\nThis class wraps a sequence of samples and provides magic __getattr__ access to aggregate sample attributes. When you access an attribute that exists on the sample type, it automatically aggregates values across all samples in the batch.\nNDArray fields are stacked into a numpy array with a batch dimension. Other fields are aggregated into a list.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch.\n\n\n\n\n\n\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1496 }, 1497 { 1498 "objectID": "api/SampleBatch.html#parameters", 1499 "href": "api/SampleBatch.html#parameters", 1500 "title": "SampleBatch", 1501 "section": "", 1502 "text": "Name\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired" 1503 }, 1504 { 1505 "objectID": "api/SampleBatch.html#attributes", 1506 "href": "api/SampleBatch.html#attributes", 1507 "title": "SampleBatch", 1508 "section": "", 1509 "text": "Name\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch." 1510 }, 1511 { 1512 "objectID": "api/SampleBatch.html#examples", 1513 "href": "api/SampleBatch.html#examples", 1514 "title": "SampleBatch", 1515 "section": "", 1516 "text": ">>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names" 1517 }, 1518 { 1519 "objectID": "api/SampleBatch.html#note", 1520 "href": "api/SampleBatch.html#note", 1521 "title": "SampleBatch", 1522 "section": "", 1523 "text": "This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1524 }, 1525 { 1526 "objectID": "index.html", 1527 "href": "index.html", 1528 "title": "atdata", 1529 "section": "", 1530 "text": "A loose federation of distributed, typed datasets built on WebDataset.\nGet Started View on GitHub", 1531 "crumbs": [ 1532 "Guide", 1533 "atdata" 1534 ] 1535 }, 1536 { 1537 "objectID": "index.html#the-challenge", 1538 "href": "index.html#the-challenge", 1539 "title": "atdata", 1540 "section": "The Challenge", 1541 "text": "The Challenge\nMachine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means:\n\nRuntime surprises: Discovering a field is missing or has the wrong type during training\nCopy-paste schemas: Redefining the same sample structure across notebooks and scripts\nStorage silos: Data stuck in one location, invisible to collaborators\nDiscovery friction: No standard way to find datasets across teams or organizations\n\natdata solves these problems with a simple idea: typed, serializable samples that flow seamlessly from local development to team storage to federated sharing.", 1542 "crumbs": [ 1543 "Guide", 1544 "atdata" 1545 ] 1546 }, 1547 { 1548 "objectID": "index.html#what-is-atdata", 1549 "href": "index.html#what-is-atdata", 1550 "title": "atdata", 1551 "section": "What is atdata?", 1552 "text": "What is atdata?\natdata is a Python library that combines:\n\n\nTyped Samples\nDefine dataclass-based sample types with automatic msgpack serialization. Catch schema errors at definition time, not training time.\n\n\nEfficient Storage\nBuilt on WebDataset’s proven tar-based format. Stream large datasets without downloading everything first.\n\n\nLens Transformations\nView datasets through different schemas without duplicating data. Perfect for feature extraction, schema migration, and multi-task learning.\n\n\nBatch Aggregation\nAutomatic numpy stacking for NDArray fields. No more manual collation code—just iterate and train.\n\n\nTeam Storage\nRedis + S3 backend for shared dataset indexes. Publish schemas, track versions, and enable team discovery.\n\n\nATProto Federation\nPublish datasets to the decentralized AT Protocol network. Enable cross-organization discovery without centralized infrastructure.", 1553 "crumbs": [ 1554 "Guide", 1555 "atdata" 1556 ] 1557 }, 1558 { 1559 "objectID": "index.html#the-architecture", 1560 "href": "index.html#the-architecture", 1561 "title": "atdata", 1562 "section": "The Architecture", 1563 "text": "The Architecture\natdata provides a three-layer progression for your datasets:\n┌─────────────────────────────────────────────────────────────┐\n│ Federation: ATProto Atmosphere │\n│ Decentralized discovery, cross-org sharing │\n└─────────────────────────────────────────────────────────────┘\n ↑ promote\n┌─────────────────────────────────────────────────────────────┐\n│ Team Storage: Redis + S3 │\n│ Shared index, versioned schemas, S3 data │\n└─────────────────────────────────────────────────────────────┘\n ↑ insert\n┌─────────────────────────────────────────────────────────────┐\n│ Local Development │\n│ Typed samples, WebDataset files, fast iteration │\n└─────────────────────────────────────────────────────────────┘\nStart local, scale to your team, and optionally share with the world—all with the same sample types and consistent APIs.", 1564 "crumbs": [ 1565 "Guide", 1566 "atdata" 1567 ] 1568 }, 1569 { 1570 "objectID": "index.html#installation", 1571 "href": "index.html#installation", 1572 "title": "atdata", 1573 "section": "Installation", 1574 "text": "Installation\n\npip install atdata\n\n# With ATProto support\npip install atdata[atmosphere]", 1575 "crumbs": [ 1576 "Guide", 1577 "atdata" 1578 ] 1579 }, 1580 { 1581 "objectID": "index.html#quick-example", 1582 "href": "index.html#quick-example", 1583 "title": "atdata", 1584 "section": "Quick Example", 1585 "text": "Quick Example\n\n1. Define a Sample Type\nThe @packable decorator creates a serializable dataclass:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n image: NDArray # Automatically handled as bytes\n label: str\n confidence: float\n\n\n\n2. Create and Write Samples\nUse WebDataset’s standard TarWriter:\n\nimport webdataset as wds\n\nsamples = [\n ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n )\n for _ in range(100)\n]\n\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\n\n\n3. Load and Iterate with Type Safety\nThe generic Dataset[T] provides typed access:\n\ndataset = atdata.Dataset[ImageSample](\"data-000000.tar\")\n\nfor batch in dataset.shuffled(batch_size=32):\n images = batch.image # numpy array (32, 224, 224, 3)\n labels = batch.label # list of 32 strings\n confs = batch.confidence # list of 32 floats", 1586 "crumbs": [ 1587 "Guide", 1588 "atdata" 1589 ] 1590 }, 1591 { 1592 "objectID": "index.html#scaling-up", 1593 "href": "index.html#scaling-up", 1594 "title": "atdata", 1595 "section": "Scaling Up", 1596 "text": "Scaling Up\n\nTeam Storage with Redis + S3\nWhen you’re ready to share with your team:\n\nfrom atdata.local import LocalIndex, S3DataStore\n\n# Connect to team infrastructure\nstore = S3DataStore(\n credentials={\"AWS_ENDPOINT\": \"http://localhost:9000\", ...},\n bucket=\"team-datasets\",\n)\nindex = LocalIndex(data_store=store)\n\n# Publish schema for consistency\nindex.publish_schema(ImageSample, version=\"1.0.0\")\n\n# Insert dataset (writes to S3, indexes in Redis)\ndataset = atdata.Dataset[ImageSample](\"data.tar\")\nentry = index.insert_dataset(dataset, name=\"training-images-v1\")\n\n# Team members can now discover and load\n# ds = atdata.load_dataset(\"@local/training-images-v1\", index=index)\n\n\n\nFederation with ATProto\nFor public or cross-organization sharing:\n\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\nfrom atdata.promote import promote_to_atmosphere\n\n# Authenticate with your ATProto identity\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Option 1: Promote existing local dataset\nentry = index.get_dataset(\"training-images-v1\")\nat_uri = promote_to_atmosphere(entry, index, client)\n\n# Option 2: Publish directly with blob storage\nstore = PDSBlobStore(client)\natm_index = AtmosphereIndex(client, data_store=store)\natm_index.insert_dataset(dataset, name=\"public-images\", schema_ref=schema_uri)", 1597 "crumbs": [ 1598 "Guide", 1599 "atdata" 1600 ] 1601 }, 1602 { 1603 "objectID": "index.html#huggingface-style-loading", 1604 "href": "index.html#huggingface-style-loading", 1605 "title": "atdata", 1606 "section": "HuggingFace-Style Loading", 1607 "text": "HuggingFace-Style Loading\nFor convenient access to datasets:\n\nfrom atdata import load_dataset\n\n# Load from local files\nds = load_dataset(\"path/to/data-{000000..000009}.tar\")\n\n# Load with split detection\nds_dict = load_dataset(\"path/to/data/\")\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]\n\n# Load from index\nds = load_dataset(\"@local/my-dataset\", index=index)", 1608 "crumbs": [ 1609 "Guide", 1610 "atdata" 1611 ] 1612 }, 1613 { 1614 "objectID": "index.html#why-atdata", 1615 "href": "index.html#why-atdata", 1616 "title": "atdata", 1617 "section": "Why atdata?", 1618 "text": "Why atdata?\n\n\n\n\n\n\n\nNeed\nSolution\n\n\n\n\nType-safe samples\n@packable decorator, PackableSample base class\n\n\nEfficient large-scale storage\nWebDataset tar format, streaming iteration\n\n\nSchema flexibility\nLens transformations, DictSample for exploration\n\n\nTeam collaboration\nRedis index, S3 data store, schema registry\n\n\nPublic sharing\nATProto federation, content-addressable CIDs\n\n\nMultiple backends\nProtocol abstractions (AbstractIndex, DataSource)", 1619 "crumbs": [ 1620 "Guide", 1621 "atdata" 1622 ] 1623 }, 1624 { 1625 "objectID": "index.html#next-steps", 1626 "href": "index.html#next-steps", 1627 "title": "atdata", 1628 "section": "Next Steps", 1629 "text": "Next Steps\n\n\n\n\n\n\nGetting Started\n\n\n\nNew to atdata? Start with the Quick Start Tutorial to learn the basics of typed samples and datasets.\n\n\n\nArchitecture Overview - Understand the design and how components fit together\nLocal Workflow - Set up team storage with Redis + S3\nAtmosphere Publishing - Share datasets on the ATProto network\nPackable Samples - Deep dive into sample type definitions\nDatasets - Master iteration, batching, and transformations", 1630 "crumbs": [ 1631 "Guide", 1632 "atdata" 1633 ] 1634 }, 1635 { 1636 "objectID": "api/packable.html", 1637 "href": "api/packable.html", 1638 "title": "packable", 1639 "section": "", 1640 "text": "packable(cls)\nDecorator to convert a regular class into a PackableSample.\nThis decorator transforms a class into a dataclass that inherits from PackableSample, enabling automatic msgpack serialization/deserialization with special handling for NDArray fields.\nThe resulting class satisfies the Packable protocol, making it compatible with all atdata APIs that accept packable types (e.g., publish_schema, lens transformations, etc.).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures.\n\n\n\n\n\n\n>>> @packable\n... class MyData:\n... name: str\n... values: NDArray\n...\n>>> sample = MyData(name=\"test\", values=np.array([1, 2, 3]))\n>>> bytes_data = sample.packed\n>>> restored = MyData.from_bytes(bytes_data)\n>>>\n>>> # Works with Packable-typed APIs\n>>> index.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1641 }, 1642 { 1643 "objectID": "api/packable.html#parameters", 1644 "href": "api/packable.html#parameters", 1645 "title": "packable", 1646 "section": "", 1647 "text": "Name\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired" 1648 }, 1649 { 1650 "objectID": "api/packable.html#returns", 1651 "href": "api/packable.html#returns", 1652 "title": "packable", 1653 "section": "", 1654 "text": "Name\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures." 1655 }, 1656 { 1657 "objectID": "api/packable.html#examples", 1658 "href": "api/packable.html#examples", 1659 "title": "packable", 1660 "section": "", 1661 "text": ">>> @packable\n... class MyData:\n... name: str\n... values: NDArray\n...\n>>> sample = MyData(name=\"test\", values=np.array([1, 2, 3]))\n>>> bytes_data = sample.packed\n>>> restored = MyData.from_bytes(bytes_data)\n>>>\n>>> # Works with Packable-typed APIs\n>>> index.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1662 }, 1663 { 1664 "objectID": "api/Packable-protocol.html", 1665 "href": "api/Packable-protocol.html", 1666 "title": "Packable", 1667 "section": "", 1668 "text": "Packable()\nStructural protocol for packable sample types.\nThis protocol allows classes decorated with @packable to be recognized as valid types for lens transformations and schema operations, even though the decorator doesn’t change the class’s nominal type at static analysis time.\nBoth PackableSample subclasses and @packable-decorated classes satisfy this protocol structurally.\nThe protocol captures the full interface needed for: - Lens type transformations (as_wds, from_data) - Schema publishing (class introspection via dataclass fields) - Serialization/deserialization (packed, from_bytes)\n\n\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1669 }, 1670 { 1671 "objectID": "api/Packable-protocol.html#examples", 1672 "href": "api/Packable-protocol.html#examples", 1673 "title": "Packable", 1674 "section": "", 1675 "text": ">>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)" 1676 }, 1677 { 1678 "objectID": "api/Packable-protocol.html#attributes", 1679 "href": "api/Packable-protocol.html#attributes", 1680 "title": "Packable", 1681 "section": "", 1682 "text": "Name\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1683 }, 1684 { 1685 "objectID": "api/Packable-protocol.html#methods", 1686 "href": "api/Packable-protocol.html#methods", 1687 "title": "Packable", 1688 "section": "", 1689 "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1690 }, 1691 { 1692 "objectID": "api/AtUri.html", 1693 "href": "api/AtUri.html", 1694 "title": "AtUri", 1695 "section": "", 1696 "text": "atmosphere.AtUri(authority, collection, rkey)\nParsed AT Protocol URI.\nAT URIs follow the format: at:////\n\n\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1697 }, 1698 { 1699 "objectID": "api/AtUri.html#examples", 1700 "href": "api/AtUri.html#examples", 1701 "title": "AtUri", 1702 "section": "", 1703 "text": ">>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'" 1704 }, 1705 { 1706 "objectID": "api/AtUri.html#attributes", 1707 "href": "api/AtUri.html#attributes", 1708 "title": "AtUri", 1709 "section": "", 1710 "text": "Name\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection." 1711 }, 1712 { 1713 "objectID": "api/AtUri.html#methods", 1714 "href": "api/AtUri.html#methods", 1715 "title": "AtUri", 1716 "section": "", 1717 "text": "Name\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1718 }, 1719 { 1720 "objectID": "api/local.S3DataStore.html", 1721 "href": "api/local.S3DataStore.html", 1722 "title": "local.S3DataStore", 1723 "section": "", 1724 "text": "local.S3DataStore(credentials, *, bucket)\nS3-compatible data store implementing AbstractDataStore protocol.\nHandles writing dataset shards to S3-compatible object storage and resolving URLs for reading.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\ncredentials\n\nS3 credentials dictionary.\n\n\nbucket\n\nTarget bucket name.\n\n\n_fs\n\nS3FileSystem instance.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve an S3 URL for reading/streaming.\n\n\nsupports_streaming\nS3 supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to S3.\n\n\n\n\n\nlocal.S3DataStore.read_url(url)\nResolve an S3 URL for reading/streaming.\nFor S3-compatible stores with custom endpoints (like Cloudflare R2, MinIO, etc.), converts s3:// URLs to HTTPS URLs that WebDataset can stream directly.\nFor standard AWS S3 (no custom endpoint), URLs are returned unchanged since WebDataset’s built-in s3fs integration handles them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nS3 URL to resolve (e.g., ‘s3://bucket/path/file.tar’).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTPS URL if custom endpoint is configured, otherwise unchanged.\n\n\nExample\nstr\n‘s3://bucket/path’ -> ‘https://endpoint.com/bucket/path’\n\n\n\n\n\n\n\nlocal.S3DataStore.supports_streaming()\nS3 supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\nlocal.S3DataStore.write_shards(ds, *, prefix, cache_local=False, **kwargs)\nWrite dataset shards to S3.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix within bucket (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\ncache_local\nbool\nIf True, write locally first then copy to S3.\nFalse\n\n\n**kwargs\n\nAdditional args passed to wds.ShardWriter (e.g., maxcount).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of S3 URLs for the written shards.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nRuntimeError\nIf no shards were written." 1725 }, 1726 { 1727 "objectID": "api/local.S3DataStore.html#attributes", 1728 "href": "api/local.S3DataStore.html#attributes", 1729 "title": "local.S3DataStore", 1730 "section": "", 1731 "text": "Name\nType\nDescription\n\n\n\n\ncredentials\n\nS3 credentials dictionary.\n\n\nbucket\n\nTarget bucket name.\n\n\n_fs\n\nS3FileSystem instance." 1732 }, 1733 { 1734 "objectID": "api/local.S3DataStore.html#methods", 1735 "href": "api/local.S3DataStore.html#methods", 1736 "title": "local.S3DataStore", 1737 "section": "", 1738 "text": "Name\nDescription\n\n\n\n\nread_url\nResolve an S3 URL for reading/streaming.\n\n\nsupports_streaming\nS3 supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to S3.\n\n\n\n\n\nlocal.S3DataStore.read_url(url)\nResolve an S3 URL for reading/streaming.\nFor S3-compatible stores with custom endpoints (like Cloudflare R2, MinIO, etc.), converts s3:// URLs to HTTPS URLs that WebDataset can stream directly.\nFor standard AWS S3 (no custom endpoint), URLs are returned unchanged since WebDataset’s built-in s3fs integration handles them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nS3 URL to resolve (e.g., ‘s3://bucket/path/file.tar’).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTPS URL if custom endpoint is configured, otherwise unchanged.\n\n\nExample\nstr\n‘s3://bucket/path’ -> ‘https://endpoint.com/bucket/path’\n\n\n\n\n\n\n\nlocal.S3DataStore.supports_streaming()\nS3 supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\nlocal.S3DataStore.write_shards(ds, *, prefix, cache_local=False, **kwargs)\nWrite dataset shards to S3.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix within bucket (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\ncache_local\nbool\nIf True, write locally first then copy to S3.\nFalse\n\n\n**kwargs\n\nAdditional args passed to wds.ShardWriter (e.g., maxcount).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of S3 URLs for the written shards.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nRuntimeError\nIf no shards were written." 1739 }, 1740 { 1741 "objectID": "api/AbstractDataStore.html", 1742 "href": "api/AbstractDataStore.html", 1743 "title": "AbstractDataStore", 1744 "section": "", 1745 "text": "AbstractDataStore()\nProtocol for data storage operations.\nThis protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)\nThe separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.\n\n\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1746 }, 1747 { 1748 "objectID": "api/AbstractDataStore.html#examples", 1749 "href": "api/AbstractDataStore.html#examples", 1750 "title": "AbstractDataStore", 1751 "section": "", 1752 "text": ">>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]" 1753 }, 1754 { 1755 "objectID": "api/AbstractDataStore.html#methods", 1756 "href": "api/AbstractDataStore.html#methods", 1757 "title": "AbstractDataStore", 1758 "section": "", 1759 "text": "Name\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1760 }, 1761 { 1762 "objectID": "api/Dataset.html", 1763 "href": "api/Dataset.html", 1764 "title": "Dataset", 1765 "section": "", 1766 "text": "Dataset(source=None, metadata_url=None, *, url=None)\nA typed dataset built on WebDataset with lens transformations.\nThis class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.\nThe dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s).\n\n\n\n\n\n\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1767 }, 1768 { 1769 "objectID": "api/Dataset.html#parameters", 1770 "href": "api/Dataset.html#parameters", 1771 "title": "Dataset", 1772 "section": "", 1773 "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired" 1774 }, 1775 { 1776 "objectID": "api/Dataset.html#attributes", 1777 "href": "api/Dataset.html#attributes", 1778 "title": "Dataset", 1779 "section": "", 1780 "text": "Name\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s)." 1781 }, 1782 { 1783 "objectID": "api/Dataset.html#examples", 1784 "href": "api/Dataset.html#examples", 1785 "title": "Dataset", 1786 "section": "", 1787 "text": ">>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)" 1788 }, 1789 { 1790 "objectID": "api/Dataset.html#note", 1791 "href": "api/Dataset.html#note", 1792 "title": "Dataset", 1793 "section": "", 1794 "text": "This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class." 1795 }, 1796 { 1797 "objectID": "api/Dataset.html#methods", 1798 "href": "api/Dataset.html#methods", 1799 "title": "Dataset", 1800 "section": "", 1801 "text": "Name\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1802 }, 1803 { 1804 "objectID": "api/local.Index.html", 1805 "href": "api/local.Index.html", 1806 "title": "local.Index", 1807 "section": "", 1808 "text": "local.Index(\n redis=None,\n data_store=None,\n auto_stubs=False,\n stub_dir=None,\n **kwargs,\n)\nRedis-backed index for tracking datasets in a repository.\nImplements the AbstractIndex protocol. Maintains a registry of LocalDatasetEntry objects in Redis, allowing enumeration and lookup of stored datasets.\nWhen initialized with a data_store, insert_dataset() will write dataset shards to storage before indexing. Without a data_store, insert_dataset() only indexes existing URLs.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1809 }, 1810 { 1811 "objectID": "api/local.Index.html#attributes", 1812 "href": "api/local.Index.html#attributes", 1813 "title": "local.Index", 1814 "section": "", 1815 "text": "Name\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards." 1816 }, 1817 { 1818 "objectID": "api/local.Index.html#methods", 1819 "href": "api/local.Index.html#methods", 1820 "title": "local.Index", 1821 "section": "", 1822 "text": "Name\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1823 }, 1824 { 1825 "objectID": "api/Lens.html", 1826 "href": "api/Lens.html", 1827 "title": "lens", 1828 "section": "", 1829 "text": "lens\nLens-based type transformations for datasets.\nThis module implements a lens system for bidirectional transformations between different sample types. Lenses enable viewing a dataset through different type schemas without duplicating the underlying data.\nKey components:\n\nLens: Bidirectional transformation with getter (S -> V) and optional putter (V, S -> S)\nLensNetwork: Global singleton registry for lens transformations\n@lens: Decorator to create and register lens transformations\n\nLenses support the functional programming concept of composable, well-behaved transformations that satisfy lens laws (GetPut and PutGet).\n\n\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(field=view.field, other=source.other)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented.\n\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1830 }, 1831 { 1832 "objectID": "api/Lens.html#examples", 1833 "href": "api/Lens.html#examples", 1834 "title": "lens", 1835 "section": "", 1836 "text": ">>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens" 1837 }, 1838 { 1839 "objectID": "api/Lens.html#classes", 1840 "href": "api/Lens.html#classes", 1841 "title": "lens", 1842 "section": "", 1843 "text": "Name\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(field=view.field, other=source.other)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented." 1844 }, 1845 { 1846 "objectID": "api/Lens.html#functions", 1847 "href": "api/Lens.html#functions", 1848 "title": "lens", 1849 "section": "", 1850 "text": "Name\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1851 }, 1852 { 1853 "objectID": "api/DatasetLoader.html", 1854 "href": "api/DatasetLoader.html", 1855 "title": "DatasetLoader", 1856 "section": "", 1857 "text": "atmosphere.DatasetLoader(client)\nLoads dataset records from ATProto.\nThis class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.\n\n\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1858 }, 1859 { 1860 "objectID": "api/DatasetLoader.html#examples", 1861 "href": "api/DatasetLoader.html#examples", 1862 "title": "DatasetLoader", 1863 "section": "", 1864 "text": ">>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")" 1865 }, 1866 { 1867 "objectID": "api/DatasetLoader.html#methods", 1868 "href": "api/DatasetLoader.html#methods", 1869 "title": "DatasetLoader", 1870 "section": "", 1871 "text": "Name\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1872 }, 1873 { 1874 "objectID": "api/DataSource.html", 1875 "href": "api/DataSource.html", 1876 "title": "DataSource", 1877 "section": "", 1878 "text": "DataSource()\nProtocol for data sources that provide streams to Dataset.\nA DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)\nThe key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects\n\n\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1879 }, 1880 { 1881 "objectID": "api/DataSource.html#examples", 1882 "href": "api/DataSource.html#examples", 1883 "title": "DataSource", 1884 "section": "", 1885 "text": ">>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)" 1886 }, 1887 { 1888 "objectID": "api/DataSource.html#attributes", 1889 "href": "api/DataSource.html#attributes", 1890 "title": "DataSource", 1891 "section": "", 1892 "text": "Name\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard." 1893 }, 1894 { 1895 "objectID": "api/DataSource.html#methods", 1896 "href": "api/DataSource.html#methods", 1897 "title": "DataSource", 1898 "section": "", 1899 "text": "Name\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1900 }, 1901 { 1902 "objectID": "api/AtmosphereIndex.html", 1903 "href": "api/AtmosphereIndex.html", 1904 "title": "AtmosphereIndex", 1905 "section": "", 1906 "text": "atmosphere.AtmosphereIndex(client, *, data_store=None)\nATProto index implementing AbstractIndex protocol.\nWraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.\nOptionally accepts a PDSBlobStore for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.\n\n\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol).\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1907 }, 1908 { 1909 "objectID": "api/AtmosphereIndex.html#examples", 1910 "href": "api/AtmosphereIndex.html#examples", 1911 "title": "AtmosphereIndex", 1912 "section": "", 1913 "text": ">>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")" 1914 }, 1915 { 1916 "objectID": "api/AtmosphereIndex.html#attributes", 1917 "href": "api/AtmosphereIndex.html#attributes", 1918 "title": "AtmosphereIndex", 1919 "section": "", 1920 "text": "Name\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol)." 1921 }, 1922 { 1923 "objectID": "api/AtmosphereIndex.html#methods", 1924 "href": "api/AtmosphereIndex.html#methods", 1925 "title": "AtmosphereIndex", 1926 "section": "", 1927 "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1928 }, 1929 { 1930 "objectID": "api/LensLoader.html", 1931 "href": "api/LensLoader.html", 1932 "title": "LensLoader", 1933 "section": "", 1934 "text": "atmosphere.LensLoader(client)\nLoads lens records from ATProto.\nThis class fetches lens transformation records. Note that actually using a lens requires installing the referenced code and importing it manually.\n\n\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1935 }, 1936 { 1937 "objectID": "api/LensLoader.html#examples", 1938 "href": "api/LensLoader.html#examples", 1939 "title": "LensLoader", 1940 "section": "", 1941 "text": ">>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))" 1942 }, 1943 { 1944 "objectID": "api/LensLoader.html#methods", 1945 "href": "api/LensLoader.html#methods", 1946 "title": "LensLoader", 1947 "section": "", 1948 "text": "Name\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1949 }, 1950 { 1951 "objectID": "api/DictSample.html", 1952 "href": "api/DictSample.html", 1953 "title": "DictSample", 1954 "section": "", 1955 "text": "DictSample(_data=None, **kwargs)\nDynamic sample type providing dict-like access to raw msgpack data.\nThis class is the default sample type for datasets when no explicit type is specified. It stores the raw unpacked msgpack data and provides both attribute-style (sample.field) and dict-style (sample[\"field\"]) access to fields.\nDictSample is useful for: - Exploring datasets without defining a schema first - Working with datasets that have variable schemas - Prototyping before committing to a typed schema\nTo convert to a typed schema, use Dataset.as_type() with a @packable-decorated class. Every @packable class automatically registers a lens from DictSample, making this conversion seamless.\n\n\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)\n\n\n\nNDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1956 }, 1957 { 1958 "objectID": "api/DictSample.html#examples", 1959 "href": "api/DictSample.html#examples", 1960 "title": "DictSample", 1961 "section": "", 1962 "text": ">>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)" 1963 }, 1964 { 1965 "objectID": "api/DictSample.html#note", 1966 "href": "api/DictSample.html#note", 1967 "title": "DictSample", 1968 "section": "", 1969 "text": "NDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class." 1970 }, 1971 { 1972 "objectID": "api/DictSample.html#attributes", 1973 "href": "api/DictSample.html#attributes", 1974 "title": "DictSample", 1975 "section": "", 1976 "text": "Name\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1977 }, 1978 { 1979 "objectID": "api/DictSample.html#methods", 1980 "href": "api/DictSample.html#methods", 1981 "title": "DictSample", 1982 "section": "", 1983 "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1984 }, 1985 { 1986 "objectID": "api/PDSBlobStore.html", 1987 "href": "api/PDSBlobStore.html", 1988 "title": "PDSBlobStore", 1989 "section": "", 1990 "text": "atmosphere.PDSBlobStore(client)\nPDS blob store implementing AbstractDataStore protocol.\nStores dataset shards as ATProto blobs, enabling decentralized dataset storage on the AT Protocol network.\nEach shard is written to a temporary tar file, then uploaded as a blob to the user’s PDS. The returned URLs are AT URIs that can be resolved to HTTP URLs for streaming.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance.\n\n\n\n\n\n\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1991 }, 1992 { 1993 "objectID": "api/PDSBlobStore.html#attributes", 1994 "href": "api/PDSBlobStore.html#attributes", 1995 "title": "PDSBlobStore", 1996 "section": "", 1997 "text": "Name\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance." 1998 }, 1999 { 2000 "objectID": "api/PDSBlobStore.html#examples", 2001 "href": "api/PDSBlobStore.html#examples", 2002 "title": "PDSBlobStore", 2003 "section": "", 2004 "text": ">>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]" 2005 }, 2006 { 2007 "objectID": "api/PDSBlobStore.html#methods", 2008 "href": "api/PDSBlobStore.html#methods", 2009 "title": "PDSBlobStore", 2010 "section": "", 2011 "text": "Name\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 2012 }, 2013 { 2014 "objectID": "api/PackableSample.html", 2015 "href": "api/PackableSample.html", 2016 "title": "PackableSample", 2017 "section": "", 2018 "text": "PackableSample()\nBase class for samples that can be serialized with msgpack.\nThis abstract base class provides automatic serialization/deserialization for dataclass-based samples. Fields annotated as NDArray or NDArray | None are automatically converted between numpy arrays and bytes during packing/unpacking.\nSubclasses should be defined either by: 1. Direct inheritance with the @dataclass decorator 2. Using the @packable decorator (recommended)\n\n\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 2019 }, 2020 { 2021 "objectID": "api/PackableSample.html#examples", 2022 "href": "api/PackableSample.html#examples", 2023 "title": "PackableSample", 2024 "section": "", 2025 "text": ">>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize" 2026 }, 2027 { 2028 "objectID": "api/PackableSample.html#attributes", 2029 "href": "api/PackableSample.html#attributes", 2030 "title": "PackableSample", 2031 "section": "", 2032 "text": "Name\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes." 2033 }, 2034 { 2035 "objectID": "api/PackableSample.html#methods", 2036 "href": "api/PackableSample.html#methods", 2037 "title": "PackableSample", 2038 "section": "", 2039 "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 2040 }, 2041 { 2042 "objectID": "api/DatasetDict.html", 2043 "href": "api/DatasetDict.html", 2044 "title": "DatasetDict", 2045 "section": "", 2046 "text": "DatasetDict(splits=None, sample_type=None, streaming=False)\nA dictionary of split names to Dataset instances.\nSimilar to HuggingFace’s DatasetDict, this provides a container for multiple dataset splits (train, test, validation, etc.) with convenience methods that operate across all splits.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired\n\n\n\n\n\n\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 2047 }, 2048 { 2049 "objectID": "api/DatasetDict.html#parameters", 2050 "href": "api/DatasetDict.html#parameters", 2051 "title": "DatasetDict", 2052 "section": "", 2053 "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired" 2054 }, 2055 { 2056 "objectID": "api/DatasetDict.html#examples", 2057 "href": "api/DatasetDict.html#examples", 2058 "title": "DatasetDict", 2059 "section": "", 2060 "text": ">>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")" 2061 }, 2062 { 2063 "objectID": "api/DatasetDict.html#attributes", 2064 "href": "api/DatasetDict.html#attributes", 2065 "title": "DatasetDict", 2066 "section": "", 2067 "text": "Name\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 2068 }, 2069 { 2070 "objectID": "tutorials/promotion.html", 2071 "href": "tutorials/promotion.html", 2072 "title": "Promotion Workflow", 2073 "section": "", 2074 "text": "This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network. Promotion is the bridge between Layer 2 (team storage) and Layer 3 (federation).", 2075 "crumbs": [ 2076 "Guide", 2077 "Getting Started", 2078 "Promotion Workflow" 2079 ] 2080 }, 2081 { 2082 "objectID": "tutorials/promotion.html#why-promotion", 2083 "href": "tutorials/promotion.html#why-promotion", 2084 "title": "Promotion Workflow", 2085 "section": "Why Promotion?", 2086 "text": "Why Promotion?\nA common pattern in data science:\n\nStart private: Develop and validate datasets within your team\nGo public: Share successful datasets with the broader community\n\nPromotion handles this transition without re-processing your data. Instead of creating a new dataset from scratch, you’re lifting an existing local dataset entry into the federated atmosphere.\nThe workflow handles several complexities automatically:\n\nSchema deduplication: If you’ve already published the same schema type and version, promotion reuses it\nURL preservation: Data stays in place (unless you explicitly want to copy it)\nCID consistency: Content identifiers remain valid across the transition", 2087 "crumbs": [ 2088 "Guide", 2089 "Getting Started", 2090 "Promotion Workflow" 2091 ] 2092 }, 2093 { 2094 "objectID": "tutorials/promotion.html#overview", 2095 "href": "tutorials/promotion.html#overview", 2096 "title": "Promotion Workflow", 2097 "section": "Overview", 2098 "text": "Overview\nThe promotion workflow moves datasets from local storage to the atmosphere:\nLOCAL ATMOSPHERE\n----- ----------\nRedis Index ATProto PDS\nS3 Storage --> (same S3 or new location)\nlocal://schemas/... at://did:plc:.../schema/...\nKey features:\n\nSchema deduplication: Won’t republish identical schemas\nFlexible data handling: Keep existing URLs or copy to new storage\nMetadata preservation: Local metadata carries over to atmosphere", 2099 "crumbs": [ 2100 "Guide", 2101 "Getting Started", 2102 "Promotion Workflow" 2103 ] 2104 }, 2105 { 2106 "objectID": "tutorials/promotion.html#setup", 2107 "href": "tutorials/promotion.html#setup", 2108 "title": "Promotion Workflow", 2109 "section": "Setup", 2110 "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, LocalDatasetEntry, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds", 2111 "crumbs": [ 2112 "Guide", 2113 "Getting Started", 2114 "Promotion Workflow" 2115 ] 2116 }, 2117 { 2118 "objectID": "tutorials/promotion.html#prepare-a-local-dataset", 2119 "href": "tutorials/promotion.html#prepare-a-local-dataset", 2120 "title": "Promotion Workflow", 2121 "section": "Prepare a Local Dataset", 2122 "text": "Prepare a Local Dataset\nFirst, set up a dataset in local storage:\n\n# 1. Define sample type\n@atdata.packable\nclass ExperimentSample:\n \"\"\"A sample from a scientific experiment.\"\"\"\n measurement: NDArray\n timestamp: float\n sensor_id: str\n\n# 2. Create samples\nsamples = [\n ExperimentSample(\n measurement=np.random.randn(64).astype(np.float32),\n timestamp=float(i),\n sensor_id=f\"sensor_{i % 4}\",\n )\n for i in range(1000)\n]\n\n# 3. Write to tar\nwith wds.writer.TarWriter(\"experiment.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 4. Set up local index with S3 storage\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nlocal_index = LocalIndex(data_store=store)\n\n# 5. Insert dataset into index\ndataset = atdata.Dataset[ExperimentSample](\"experiment.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"experiment-2024-001\", prefix=\"experiments\")\n\n# 6. Publish schema to local index\nlocal_index.publish_schema(ExperimentSample, version=\"1.0.0\")\n\nprint(f\"Local entry name: {local_entry.name}\")\nprint(f\"Local entry CID: {local_entry.cid}\")\nprint(f\"Data URLs: {local_entry.data_urls}\")", 2123 "crumbs": [ 2124 "Guide", 2125 "Getting Started", 2126 "Promotion Workflow" 2127 ] 2128 }, 2129 { 2130 "objectID": "tutorials/promotion.html#basic-promotion", 2131 "href": "tutorials/promotion.html#basic-promotion", 2132 "title": "Promotion Workflow", 2133 "section": "Basic Promotion", 2134 "text": "Basic Promotion\nPromote the dataset to ATProto:\n\n# Connect to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\n# Promote to atmosphere\nat_uri = promote_to_atmosphere(local_entry, local_index, client)\nprint(f\"Published: {at_uri}\")", 2135 "crumbs": [ 2136 "Guide", 2137 "Getting Started", 2138 "Promotion Workflow" 2139 ] 2140 }, 2141 { 2142 "objectID": "tutorials/promotion.html#promotion-with-metadata", 2143 "href": "tutorials/promotion.html#promotion-with-metadata", 2144 "title": "Promotion Workflow", 2145 "section": "Promotion with Metadata", 2146 "text": "Promotion with Metadata\nAdd description, tags, and license:\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n name=\"experiment-2024-001-v2\", # Override name\n description=\"Sensor measurements from Lab 302\",\n tags=[\"experiment\", \"physics\", \"2024\"],\n license=\"CC-BY-4.0\",\n)\nprint(f\"Published with metadata: {at_uri}\")", 2147 "crumbs": [ 2148 "Guide", 2149 "Getting Started", 2150 "Promotion Workflow" 2151 ] 2152 }, 2153 { 2154 "objectID": "tutorials/promotion.html#schema-deduplication", 2155 "href": "tutorials/promotion.html#schema-deduplication", 2156 "title": "Promotion Workflow", 2157 "section": "Schema Deduplication", 2158 "text": "Schema Deduplication\nThe promotion workflow automatically checks for existing schemas:\n\nfrom atdata.promote import _find_existing_schema\n\n# Check if schema already exists\nexisting = _find_existing_schema(client, \"ExperimentSample\", \"1.0.0\")\nif existing:\n print(f\"Found existing schema: {existing}\")\n print(\"Will reuse instead of republishing\")\nelse:\n print(\"No existing schema found, will publish new one\")\n\nWhen you promote multiple datasets with the same sample type:\n\n# First promotion: publishes schema\nuri1 = promote_to_atmosphere(entry1, local_index, client)\n\n# Second promotion with same schema type + version: reuses existing schema\nuri2 = promote_to_atmosphere(entry2, local_index, client)", 2159 "crumbs": [ 2160 "Guide", 2161 "Getting Started", 2162 "Promotion Workflow" 2163 ] 2164 }, 2165 { 2166 "objectID": "tutorials/promotion.html#data-migration-options", 2167 "href": "tutorials/promotion.html#data-migration-options", 2168 "title": "Promotion Workflow", 2169 "section": "Data Migration Options", 2170 "text": "Data Migration Options\n\nKeep Existing URLsCopy to New Storage\n\n\nBy default, promotion keeps the original data URLs:\n\n# Data stays in original S3 location\nat_uri = promote_to_atmosphere(local_entry, local_index, client)\n\nBenefits:\n\nFastest option, no data copying\nDataset record points to existing URLs\nRequires original storage to remain accessible\n\n\n\nTo copy data to a different storage location:\n\nfrom atdata.local import S3DataStore\n\n# Create new data store\nnew_store = S3DataStore(\n credentials=\"new-s3-creds.env\",\n bucket=\"public-datasets\",\n)\n\n# Promote with data copy\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n data_store=new_store, # Copy data to new storage\n)\n\nBenefits:\n\nData is copied to new bucket\nGood for moving from private to public storage\nOriginal storage can be retired", 2171 "crumbs": [ 2172 "Guide", 2173 "Getting Started", 2174 "Promotion Workflow" 2175 ] 2176 }, 2177 { 2178 "objectID": "tutorials/promotion.html#verify-on-atmosphere", 2179 "href": "tutorials/promotion.html#verify-on-atmosphere", 2180 "title": "Promotion Workflow", 2181 "section": "Verify on Atmosphere", 2182 "text": "Verify on Atmosphere\nAfter promotion, verify the dataset is accessible:\n\nfrom atdata.atmosphere import AtmosphereIndex\n\natm_index = AtmosphereIndex(client)\nentry = atm_index.get_dataset(at_uri)\n\nprint(f\"Name: {entry.name}\")\nprint(f\"Schema: {entry.schema_ref}\")\nprint(f\"URLs: {entry.data_urls}\")\n\n# Load and iterate\nSampleType = atm_index.decode_schema(entry.schema_ref)\nds = atdata.Dataset[SampleType](entry.data_urls[0])\n\nfor batch in ds.ordered(batch_size=32):\n print(f\"Measurement shape: {batch.measurement.shape}\")\n break", 2183 "crumbs": [ 2184 "Guide", 2185 "Getting Started", 2186 "Promotion Workflow" 2187 ] 2188 }, 2189 { 2190 "objectID": "tutorials/promotion.html#error-handling", 2191 "href": "tutorials/promotion.html#error-handling", 2192 "title": "Promotion Workflow", 2193 "section": "Error Handling", 2194 "text": "Error Handling\n\ntry:\n at_uri = promote_to_atmosphere(local_entry, local_index, client)\nexcept KeyError as e:\n # Schema not found in local index\n print(f\"Missing schema: {e}\")\n print(\"Publish schema first: local_index.publish_schema(SampleType)\")\nexcept ValueError as e:\n # Entry has no data URLs\n print(f\"Invalid entry: {e}\")", 2195 "crumbs": [ 2196 "Guide", 2197 "Getting Started", 2198 "Promotion Workflow" 2199 ] 2200 }, 2201 { 2202 "objectID": "tutorials/promotion.html#requirements-checklist", 2203 "href": "tutorials/promotion.html#requirements-checklist", 2204 "title": "Promotion Workflow", 2205 "section": "Requirements Checklist", 2206 "text": "Requirements Checklist\nBefore promotion:\n\nDataset is in local index (via LocalIndex.insert_dataset() or LocalIndex.add_entry())\nSchema is published to local index (via LocalIndex.publish_schema())\nAtmosphereClient is authenticated\nData URLs are publicly accessible (or will be copied)", 2207 "crumbs": [ 2208 "Guide", 2209 "Getting Started", 2210 "Promotion Workflow" 2211 ] 2212 }, 2213 { 2214 "objectID": "tutorials/promotion.html#complete-workflow", 2215 "href": "tutorials/promotion.html#complete-workflow", 2216 "title": "Promotion Workflow", 2217 "section": "Complete Workflow", 2218 "text": "Complete Workflow\n\n# Complete local-to-atmosphere workflow\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# 2. Create dataset tar\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n )\n for i in range(1000)\n]\n\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Store in local index with S3 backend\nstore = S3DataStore(credentials=\"creds.env\", bucket=\"bucket\")\nlocal_index = LocalIndex(data_store=store)\n\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"feature-vectors-v1\", prefix=\"features\")\n\n# 4. Publish schema locally\nlocal_index.publish_schema(FeatureSample, version=\"1.0.0\")\n\n# 5. Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n description=\"Feature vectors for classification\",\n tags=[\"features\", \"embeddings\"],\n license=\"MIT\",\n)\n\nprint(f\"Dataset published: {at_uri}\")\n\n# 6. Others can now discover and load\n# ds = atdata.load_dataset(\"@myhandle.bsky.social/feature-vectors-v1\")", 2219 "crumbs": [ 2220 "Guide", 2221 "Getting Started", 2222 "Promotion Workflow" 2223 ] 2224 }, 2225 { 2226 "objectID": "tutorials/promotion.html#what-youve-learned", 2227 "href": "tutorials/promotion.html#what-youve-learned", 2228 "title": "Promotion Workflow", 2229 "section": "What You’ve Learned", 2230 "text": "What You’ve Learned\nYou now understand the promotion workflow:\n\n\n\n\n\n\n\nConcept\nPurpose\n\n\n\n\npromote_to_atmosphere()\nLift local entries to federated network\n\n\nSchema deduplication\nAvoid publishing duplicate schemas\n\n\nData URL preservation\nKeep data in place or copy to new storage\n\n\nMetadata enrichment\nAdd description, tags, license during promotion\n\n\n\nPromotion completes atdata’s three-layer story: you can now move seamlessly from local experimentation to team collaboration to public sharing, all with the same typed sample definitions.", 2231 "crumbs": [ 2232 "Guide", 2233 "Getting Started", 2234 "Promotion Workflow" 2235 ] 2236 }, 2237 { 2238 "objectID": "tutorials/promotion.html#the-complete-journey", 2239 "href": "tutorials/promotion.html#the-complete-journey", 2240 "title": "Promotion Workflow", 2241 "section": "The Complete Journey", 2242 "text": "The Complete Journey\n┌──────────────────┐ insert ┌──────────────────┐ promote ┌──────────────────┐\n│ Local Files │ ────────────→ │ Team Storage │ ────────────→ │ Federation │\n│ │ │ │ │ │\n│ tar files │ │ Redis + S3 │ │ ATProto PDS │\n│ Dataset[T] │ │ LocalIndex │ │ AtmosphereIndex │\n└──────────────────┘ └──────────────────┘ └──────────────────┘", 2243 "crumbs": [ 2244 "Guide", 2245 "Getting Started", 2246 "Promotion Workflow" 2247 ] 2248 }, 2249 { 2250 "objectID": "tutorials/promotion.html#next-steps", 2251 "href": "tutorials/promotion.html#next-steps", 2252 "title": "Promotion Workflow", 2253 "section": "Next Steps", 2254 "text": "Next Steps\n\nAtmosphere Reference - Complete atmosphere API\nProtocols - Abstract interfaces\nLocal Storage - Local storage reference", 2255 "crumbs": [ 2256 "Guide", 2257 "Getting Started", 2258 "Promotion Workflow" 2259 ] 2260 }, 2261 { 2262 "objectID": "tutorials/local-workflow.html", 2263 "href": "tutorials/local-workflow.html", 2264 "title": "Local Workflow", 2265 "section": "", 2266 "text": "This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage. This is Layer 2 of atdata’s architecture—team-scale storage that bridges local development and federated sharing.", 2267 "crumbs": [ 2268 "Guide", 2269 "Getting Started", 2270 "Local Workflow" 2271 ] 2272 }, 2273 { 2274 "objectID": "tutorials/local-workflow.html#why-team-storage", 2275 "href": "tutorials/local-workflow.html#why-team-storage", 2276 "title": "Local Workflow", 2277 "section": "Why Team Storage?", 2278 "text": "Why Team Storage?\nLocal tar files work well for individual experiments, but teams need:\n\nDiscovery: “What datasets do we have? What schema does this one use?”\nConsistency: “Is everyone using the same version of this dataset?”\nDurability: “Where’s the canonical copy of our training data?”\n\natdata’s local storage module addresses these needs with a two-component architecture:\n\n\n\n\n\n\n\nComponent\nPurpose\n\n\n\n\nRedis Index\nFast metadata queries, schema registry, dataset discovery\n\n\nS3 DataStore\nScalable object storage for actual data files\n\n\n\nThis separation means metadata operations (listing datasets, resolving schemas) are fast and don’t touch large data files, while the data itself lives in battle-tested object storage.", 2279 "crumbs": [ 2280 "Guide", 2281 "Getting Started", 2282 "Local Workflow" 2283 ] 2284 }, 2285 { 2286 "objectID": "tutorials/local-workflow.html#prerequisites", 2287 "href": "tutorials/local-workflow.html#prerequisites", 2288 "title": "Local Workflow", 2289 "section": "Prerequisites", 2290 "text": "Prerequisites\n\nRedis server running (default: localhost:6379)\nS3-compatible storage (MinIO, AWS S3, etc.)\n\n\n\n\n\n\n\nTip\n\n\n\nFor local development, you can use MinIO:\ndocker run -p 9000:9000 minio/minio server /data", 2291 "crumbs": [ 2292 "Guide", 2293 "Getting Started", 2294 "Local Workflow" 2295 ] 2296 }, 2297 { 2298 "objectID": "tutorials/local-workflow.html#setup", 2299 "href": "tutorials/local-workflow.html#setup", 2300 "title": "Local Workflow", 2301 "section": "Setup", 2302 "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, LocalDatasetEntry, S3DataStore\nimport webdataset as wds", 2303 "crumbs": [ 2304 "Guide", 2305 "Getting Started", 2306 "Local Workflow" 2307 ] 2308 }, 2309 { 2310 "objectID": "tutorials/local-workflow.html#define-sample-types", 2311 "href": "tutorials/local-workflow.html#define-sample-types", 2312 "title": "Local Workflow", 2313 "section": "Define Sample Types", 2314 "text": "Define Sample Types\n\n@atdata.packable\nclass TrainingSample:\n \"\"\"A sample containing features and label for training.\"\"\"\n features: NDArray\n label: int\n\n@atdata.packable\nclass TextSample:\n \"\"\"A sample containing text data.\"\"\"\n text: str\n category: str", 2315 "crumbs": [ 2316 "Guide", 2317 "Getting Started", 2318 "Local Workflow" 2319 ] 2320 }, 2321 { 2322 "objectID": "tutorials/local-workflow.html#localdatasetentry", 2323 "href": "tutorials/local-workflow.html#localdatasetentry", 2324 "title": "Local Workflow", 2325 "section": "LocalDatasetEntry", 2326 "text": "LocalDatasetEntry\nEvery dataset in the index is represented by a LocalDatasetEntry. A key design decision: entries use content-addressable CIDs (Content Identifiers) as their identity. This means:\n\nIdentical content always has the same CID\nYou can verify data integrity by checking the CID\nDeduplication happens automatically\n\nCIDs are computed from the entry’s schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it’s stored.\nCreate entries with content-addressable CIDs:\n\n# Create an entry manually\nentry = LocalDatasetEntry(\n _name=\"my-dataset\",\n _schema_ref=\"local://schemas/examples.TrainingSample@1.0.0\",\n _data_urls=[\"s3://bucket/data-000000.tar\", \"s3://bucket/data-000001.tar\"],\n _metadata={\"source\": \"example\", \"samples\": 10000},\n)\n\nprint(f\"Entry name: {entry.name}\")\nprint(f\"Schema ref: {entry.schema_ref}\")\nprint(f\"Data URLs: {entry.data_urls}\")\nprint(f\"Metadata: {entry.metadata}\")\nprint(f\"CID: {entry.cid}\")\n\n\n\n\n\n\n\nNote\n\n\n\nCIDs are generated from content (schema_ref + data_urls), so identical data produces identical CIDs.", 2327 "crumbs": [ 2328 "Guide", 2329 "Getting Started", 2330 "Local Workflow" 2331 ] 2332 }, 2333 { 2334 "objectID": "tutorials/local-workflow.html#localindex", 2335 "href": "tutorials/local-workflow.html#localindex", 2336 "title": "Local Workflow", 2337 "section": "LocalIndex", 2338 "text": "LocalIndex\nThe LocalIndex is your team’s dataset registry. It implements the AbstractIndex protocol, meaning code written against LocalIndex will also work with AtmosphereIndex when you’re ready for federated sharing.\nThe index tracks datasets in Redis:\n\nfrom redis import Redis\n\n# Connect to Redis\nredis = Redis(host=\"localhost\", port=6379)\nindex = LocalIndex(redis=redis)\n\nprint(\"LocalIndex connected\")\n\n\nSchema Management\nSchema publishing is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference.\nThis enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code.\n\n# Publish a schema\nschema_ref = index.publish_schema(TrainingSample, version=\"1.0.0\")\nprint(f\"Published schema: {schema_ref}\")\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\" - {schema.get('name', 'Unknown')} v{schema.get('version', '?')}\")\n\n# Get schema record\nschema_record = index.get_schema(schema_ref)\nprint(f\"Schema fields: {[f['name'] for f in schema_record.get('fields', [])]}\")\n\n# Decode schema back to a PackableSample class\ndecoded_type = index.decode_schema(schema_ref)\nprint(f\"Decoded type: {decoded_type.__name__}\")", 2339 "crumbs": [ 2340 "Guide", 2341 "Getting Started", 2342 "Local Workflow" 2343 ] 2344 }, 2345 { 2346 "objectID": "tutorials/local-workflow.html#s3datastore", 2347 "href": "tutorials/local-workflow.html#s3datastore", 2348 "title": "Local Workflow", 2349 "section": "S3DataStore", 2350 "text": "S3DataStore\nThe S3DataStore implements the AbstractDataStore protocol for S3-compatible object storage. It works with:\n\nAWS S3: Production-scale cloud storage\nMinIO: Self-hosted S3-compatible storage (great for development)\nCloudflare R2: Cost-effective S3-compatible storage\n\nThe data store handles uploading tar shards and creating signed URLs for streaming access.\nFor direct S3 operations:\n\ncreds = {\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n}\n\nstore = S3DataStore(creds, bucket=\"my-bucket\")\n\nprint(f\"Bucket: {store.bucket}\")\nprint(f\"Supports streaming: {store.supports_streaming()}\")", 2351 "crumbs": [ 2352 "Guide", 2353 "Getting Started", 2354 "Local Workflow" 2355 ] 2356 }, 2357 { 2358 "objectID": "tutorials/local-workflow.html#complete-index-workflow", 2359 "href": "tutorials/local-workflow.html#complete-index-workflow", 2360 "title": "Local Workflow", 2361 "section": "Complete Index Workflow", 2362 "text": "Complete Index Workflow\nHere’s the typical workflow for publishing a dataset to your team:\n\nCreate samples using your @packable type\nWrite to local tar for staging\nCreate a Dataset wrapper\nConnect to index with data store\nPublish schema for type consistency\nInsert dataset (uploads to S3, indexes in Redis)\n\nThe index composition pattern (LocalIndex(data_store=S3DataStore(...))) is deliberate—it separates the concern of “where is metadata?” from “where is data?”, making it easy to swap storage backends.\nUse LocalIndex with S3DataStore to store datasets with S3 storage and Redis indexing:\n\n# 1. Create sample data\nsamples = [\n TrainingSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10\n )\n for i in range(1000)\n]\nprint(f\"Created {len(samples)} training samples\")\n\n# 2. Write to local tar file\nwith wds.writer.TarWriter(\"local-data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\nprint(\"Wrote samples to local tar file\")\n\n# 3. Create Dataset\nds = atdata.Dataset[TrainingSample](\"local-data-000000.tar\")\n\n# 4. Set up index with S3 data store and insert\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(redis=redis, data_store=store)\n\n# Publish schema and insert dataset\nindex.publish_schema(TrainingSample, version=\"1.0.0\")\nentry = index.insert_dataset(ds, name=\"training-v1\", prefix=\"datasets\")\nprint(f\"Stored at: {entry.data_urls}\")\nprint(f\"CID: {entry.cid}\")\n\n# 5. Retrieve later\nretrieved_entry = index.get_entry_by_name(\"training-v1\")\ndataset = atdata.Dataset[TrainingSample](retrieved_entry.data_urls[0])\n\nfor batch in dataset.ordered(batch_size=32):\n print(f\"Batch features shape: {batch.features.shape}\")\n break", 2363 "crumbs": [ 2364 "Guide", 2365 "Getting Started", 2366 "Local Workflow" 2367 ] 2368 }, 2369 { 2370 "objectID": "tutorials/local-workflow.html#using-load_dataset-with-index", 2371 "href": "tutorials/local-workflow.html#using-load_dataset-with-index", 2372 "title": "Local Workflow", 2373 "section": "Using load_dataset with Index", 2374 "text": "Using load_dataset with Index\nThe load_dataset() function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve @local/ prefixed paths to the actual data URLs and apply the correct credentials automatically.\nThe load_dataset() function supports index lookup:\n\nfrom atdata import load_dataset\n\n# Load from local index\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# The index resolves the dataset name to URLs and schema\nfor batch in ds.shuffled(batch_size=32):\n process(batch)\n break", 2375 "crumbs": [ 2376 "Guide", 2377 "Getting Started", 2378 "Local Workflow" 2379 ] 2380 }, 2381 { 2382 "objectID": "tutorials/local-workflow.html#what-youve-learned", 2383 "href": "tutorials/local-workflow.html#what-youve-learned", 2384 "title": "Local Workflow", 2385 "section": "What You’ve Learned", 2386 "text": "What You’ve Learned\nYou now understand team-scale storage in atdata:\n\n\n\n\n\n\n\nConcept\nPurpose\n\n\n\n\nLocalIndex\nRedis-backed dataset registry implementing AbstractIndex\n\n\nS3DataStore\nS3-compatible object storage implementing AbstractDataStore\n\n\nLocalDatasetEntry\nContent-addressed dataset entries with CIDs\n\n\nSchema publishing\nShared type definitions for team consistency\n\n\n\nThe same sample types you defined in the Quick Start work seamlessly here—the only change is where the data lives.", 2387 "crumbs": [ 2388 "Guide", 2389 "Getting Started", 2390 "Local Workflow" 2391 ] 2392 }, 2393 { 2394 "objectID": "tutorials/local-workflow.html#next-steps", 2395 "href": "tutorials/local-workflow.html#next-steps", 2396 "title": "Local Workflow", 2397 "section": "Next Steps", 2398 "text": "Next Steps\n\n\n\n\n\n\nReady for Public Sharing?\n\n\n\nThe Atmosphere Publishing tutorial shows how to publish datasets to the ATProto network for decentralized, cross-organization discovery.\n\n\n\nAtmosphere Publishing - Publish to ATProto federation\nPromotion Workflow - Migrate from local to atmosphere\nLocal Storage Reference - Complete API reference", 2399 "crumbs": [ 2400 "Guide", 2401 "Getting Started", 2402 "Local Workflow" 2403 ] 2404 }, 2405 { 2406 "objectID": "reference/promotion.html", 2407 "href": "reference/promotion.html", 2408 "title": "Promotion Workflow", 2409 "section": "", 2410 "text": "The promotion workflow migrates datasets from local storage (Redis + S3) to the ATProto atmosphere network, enabling federation and discovery.", 2411 "crumbs": [ 2412 "Guide", 2413 "Reference", 2414 "Promotion Workflow" 2415 ] 2416 }, 2417 { 2418 "objectID": "reference/promotion.html#overview", 2419 "href": "reference/promotion.html#overview", 2420 "title": "Promotion Workflow", 2421 "section": "Overview", 2422 "text": "Overview\nPromotion handles:\n\nSchema deduplication: Avoids publishing duplicate schemas\nData URL preservation: Keeps existing S3 URLs or copies to new storage\nMetadata transfer: Preserves tags, descriptions, and custom metadata", 2423 "crumbs": [ 2424 "Guide", 2425 "Reference", 2426 "Promotion Workflow" 2427 ] 2428 }, 2429 { 2430 "objectID": "reference/promotion.html#basic-usage", 2431 "href": "reference/promotion.html#basic-usage", 2432 "title": "Promotion Workflow", 2433 "section": "Basic Usage", 2434 "text": "Basic Usage\n\nfrom atdata.local import LocalIndex\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\n\n# Setup\nlocal_index = LocalIndex()\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Get local entry\nentry = local_index.get_entry_by_name(\"my-dataset\")\n\n# Promote to atmosphere\nat_uri = promote_to_atmosphere(entry, local_index, client)\nprint(f\"Published: {at_uri}\")", 2435 "crumbs": [ 2436 "Guide", 2437 "Reference", 2438 "Promotion Workflow" 2439 ] 2440 }, 2441 { 2442 "objectID": "reference/promotion.html#with-metadata", 2443 "href": "reference/promotion.html#with-metadata", 2444 "title": "Promotion Workflow", 2445 "section": "With Metadata", 2446 "text": "With Metadata\n\nat_uri = promote_to_atmosphere(\n entry,\n local_index,\n client,\n name=\"my-dataset-v2\", # Override name\n description=\"Training images\", # Add description\n tags=[\"images\", \"training\"], # Add discovery tags\n license=\"MIT\", # Specify license\n)", 2447 "crumbs": [ 2448 "Guide", 2449 "Reference", 2450 "Promotion Workflow" 2451 ] 2452 }, 2453 { 2454 "objectID": "reference/promotion.html#schema-deduplication", 2455 "href": "reference/promotion.html#schema-deduplication", 2456 "title": "Promotion Workflow", 2457 "section": "Schema Deduplication", 2458 "text": "Schema Deduplication\nThe promotion workflow automatically checks for existing schemas:\n\n# First promotion: publishes schema\nuri1 = promote_to_atmosphere(entry1, local_index, client)\n\n# Second promotion with same schema type + version: reuses existing schema\nuri2 = promote_to_atmosphere(entry2, local_index, client)\n\nSchema matching is based on:\n\n{module}.{class_name} (e.g., mymodule.ImageSample)\nVersion string (e.g., 1.0.0)", 2459 "crumbs": [ 2460 "Guide", 2461 "Reference", 2462 "Promotion Workflow" 2463 ] 2464 }, 2465 { 2466 "objectID": "reference/promotion.html#data-storage-options", 2467 "href": "reference/promotion.html#data-storage-options", 2468 "title": "Promotion Workflow", 2469 "section": "Data Storage Options", 2470 "text": "Data Storage Options\n\nKeep Existing URLs (Default)Copy to New Storage\n\n\nBy default, promotion keeps the original data URLs:\n\n# Data stays in original S3 location\nat_uri = promote_to_atmosphere(entry, local_index, client)\n\n\nData stays in original S3 location\nDataset record points to existing URLs\nFastest option, no data copying\nRequires original storage to remain accessible\n\n\n\nTo copy data to a different storage location:\n\nfrom atdata.local import S3DataStore\n\n# Create new data store\nnew_store = S3DataStore(\n credentials=\"new-s3-creds.env\",\n bucket=\"public-datasets\",\n)\n\n# Promote with data copy\nat_uri = promote_to_atmosphere(\n entry,\n local_index,\n client,\n data_store=new_store, # Copy data to new storage\n)\n\n\nData is copied to new bucket\nDataset record points to new URLs\nGood for moving from private to public storage", 2471 "crumbs": [ 2472 "Guide", 2473 "Reference", 2474 "Promotion Workflow" 2475 ] 2476 }, 2477 { 2478 "objectID": "reference/promotion.html#complete-workflow-example", 2479 "href": "reference/promotion.html#complete-workflow-example", 2480 "title": "Promotion Workflow", 2481 "section": "Complete Workflow Example", 2482 "text": "Complete Workflow Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# 2. Create local dataset\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n )\n for i in range(1000)\n]\n\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Set up index with S3 data store\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nlocal_index = LocalIndex(data_store=store)\n\n# 4. Publish schema and insert dataset\nlocal_index.publish_schema(FeatureSample, version=\"1.0.0\")\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"feature-vectors-v1\", prefix=\"features\")\n\n# 5. Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n description=\"Feature vectors for classification\",\n tags=[\"features\", \"embeddings\"],\n license=\"MIT\",\n)\n\nprint(f\"Dataset published: {at_uri}\")\n\n# 6. Verify on atmosphere\nfrom atdata.atmosphere import AtmosphereIndex\n\natm_index = AtmosphereIndex(client)\nentry = atm_index.get_dataset(at_uri)\nprint(f\"Name: {entry.name}\")\nprint(f\"Schema: {entry.schema_ref}\")\nprint(f\"URLs: {entry.data_urls}\")", 2483 "crumbs": [ 2484 "Guide", 2485 "Reference", 2486 "Promotion Workflow" 2487 ] 2488 }, 2489 { 2490 "objectID": "reference/promotion.html#error-handling", 2491 "href": "reference/promotion.html#error-handling", 2492 "title": "Promotion Workflow", 2493 "section": "Error Handling", 2494 "text": "Error Handling\n\ntry:\n at_uri = promote_to_atmosphere(entry, local_index, client)\nexcept KeyError as e:\n # Schema not found in local index\n print(f\"Missing schema: {e}\")\nexcept ValueError as e:\n # Entry has no data URLs\n print(f\"Invalid entry: {e}\")", 2495 "crumbs": [ 2496 "Guide", 2497 "Reference", 2498 "Promotion Workflow" 2499 ] 2500 }, 2501 { 2502 "objectID": "reference/promotion.html#requirements", 2503 "href": "reference/promotion.html#requirements", 2504 "title": "Promotion Workflow", 2505 "section": "Requirements", 2506 "text": "Requirements\nBefore promotion:\n\nDataset must be in local index (via Index.insert_dataset() or Index.add_entry())\nSchema must be published to local index (via Index.publish_schema())\nAtmosphereClient must be authenticated", 2507 "crumbs": [ 2508 "Guide", 2509 "Reference", 2510 "Promotion Workflow" 2511 ] 2512 }, 2513 { 2514 "objectID": "reference/promotion.html#related", 2515 "href": "reference/promotion.html#related", 2516 "title": "Promotion Workflow", 2517 "section": "Related", 2518 "text": "Related\n\nLocal Storage - Setting up local datasets\nAtmosphere - ATProto integration\nProtocols - AbstractIndex and AbstractDataStore", 2519 "crumbs": [ 2520 "Guide", 2521 "Reference", 2522 "Promotion Workflow" 2523 ] 2524 }, 2525 { 2526 "objectID": "reference/load-dataset.html", 2527 "href": "reference/load-dataset.html", 2528 "title": "load_dataset API", 2529 "section": "", 2530 "text": "The load_dataset() function provides a HuggingFace Datasets-style interface for loading typed datasets.", 2531 "crumbs": [ 2532 "Guide", 2533 "Reference", 2534 "load_dataset API" 2535 ] 2536 }, 2537 { 2538 "objectID": "reference/load-dataset.html#overview", 2539 "href": "reference/load-dataset.html#overview", 2540 "title": "load_dataset API", 2541 "section": "Overview", 2542 "text": "Overview\nKey differences from HuggingFace Datasets:\n\nRequires explicit sample_type parameter (typed dataclass) unless using index\nReturns atdata.Dataset[ST] instead of HF Dataset\nBuilt on WebDataset for efficient streaming\nNo Arrow caching layer", 2543 "crumbs": [ 2544 "Guide", 2545 "Reference", 2546 "load_dataset API" 2547 ] 2548 }, 2549 { 2550 "objectID": "reference/load-dataset.html#basic-usage", 2551 "href": "reference/load-dataset.html#basic-usage", 2552 "title": "load_dataset API", 2553 "section": "Basic Usage", 2554 "text": "Basic Usage\n\nimport atdata\nfrom atdata import load_dataset\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass TextSample:\n text: str\n label: int\n\n# Load a specific split\ntrain_ds = load_dataset(\"path/to/data.tar\", TextSample, split=\"train\")\n\n# Load all splits (returns DatasetDict)\nds_dict = load_dataset(\"path/to/data/\", TextSample)\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]", 2555 "crumbs": [ 2556 "Guide", 2557 "Reference", 2558 "load_dataset API" 2559 ] 2560 }, 2561 { 2562 "objectID": "reference/load-dataset.html#path-formats", 2563 "href": "reference/load-dataset.html#path-formats", 2564 "title": "load_dataset API", 2565 "section": "Path Formats", 2566 "text": "Path Formats\n\nWebDataset Brace Notation\n\n# Range notation\nds = load_dataset(\"data-{000000..000099}.tar\", MySample, split=\"train\")\n\n# List notation\nds = load_dataset(\"data-{train,test,val}.tar\", MySample, split=\"train\")\n\n\n\nGlob Patterns\n\n# Match all tar files\nds = load_dataset(\"path/to/*.tar\", MySample)\n\n# Match pattern\nds = load_dataset(\"path/to/train-*.tar\", MySample, split=\"train\")\n\n\n\nLocal Directory\n\n# Scans for .tar files\nds = load_dataset(\"./my-dataset/\", MySample)\n\n\n\nRemote URLs\n\n# S3 (public buckets)\nds = load_dataset(\"s3://bucket/data-{000..099}.tar\", MySample, split=\"train\")\n\n# HTTP/HTTPS\nds = load_dataset(\"https://example.com/data.tar\", MySample, split=\"train\")\n\n# Google Cloud Storage\nds = load_dataset(\"gs://bucket/data.tar\", MySample, split=\"train\")\n\n\n\n\n\n\n\nNote\n\n\n\nFor private S3 buckets or S3-compatible storage with authentication, use atdata.S3Source with Dataset directly. See Datasets for details.\n\n\n\n\nIndex Lookup\n\nfrom atdata.local import LocalIndex\n\nindex = LocalIndex()\n\n# Load from local index (auto-resolves type from schema)\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# With explicit type\nds = load_dataset(\"@local/my-dataset\", MySample, index=index, split=\"train\")", 2567 "crumbs": [ 2568 "Guide", 2569 "Reference", 2570 "load_dataset API" 2571 ] 2572 }, 2573 { 2574 "objectID": "reference/load-dataset.html#split-detection", 2575 "href": "reference/load-dataset.html#split-detection", 2576 "title": "load_dataset API", 2577 "section": "Split Detection", 2578 "text": "Split Detection\nSplits are automatically detected from filenames and directories:\n\n\n\nPattern\nDetected Split\n\n\n\n\ntrain-*.tar, training-*.tar\ntrain\n\n\ntest-*.tar, testing-*.tar\ntest\n\n\nval-*.tar, valid-*.tar, validation-*.tar\nvalidation\n\n\ndev-*.tar, development-*.tar\nvalidation\n\n\ntrain/*.tar (directory)\ntrain\n\n\ntest/*.tar (directory)\ntest\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nFiles without a detected split default to “train”.", 2579 "crumbs": [ 2580 "Guide", 2581 "Reference", 2582 "load_dataset API" 2583 ] 2584 }, 2585 { 2586 "objectID": "reference/load-dataset.html#datasetdict", 2587 "href": "reference/load-dataset.html#datasetdict", 2588 "title": "load_dataset API", 2589 "section": "DatasetDict", 2590 "text": "DatasetDict\nWhen loading without split=, returns a DatasetDict:\n\nds_dict = load_dataset(\"path/to/data/\", MySample)\n\n# Access splits\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]\n\n# Iterate splits\nfor name, dataset in ds_dict.items():\n print(f\"{name}: {len(dataset.shard_list)} shards\")\n\n# Properties\nprint(ds_dict.num_shards) # {'train': 10, 'test': 2}\nprint(ds_dict.sample_type) # <class 'MySample'>\nprint(ds_dict.streaming) # False", 2591 "crumbs": [ 2592 "Guide", 2593 "Reference", 2594 "load_dataset API" 2595 ] 2596 }, 2597 { 2598 "objectID": "reference/load-dataset.html#explicit-data-files", 2599 "href": "reference/load-dataset.html#explicit-data-files", 2600 "title": "load_dataset API", 2601 "section": "Explicit Data Files", 2602 "text": "Explicit Data Files\nOverride automatic detection with data_files:\n\n# Single pattern\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files=\"custom-*.tar\",\n)\n\n# List of patterns\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files=[\"shard-000.tar\", \"shard-001.tar\"],\n)\n\n# Explicit split mapping\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files={\n \"train\": \"training-shards-*.tar\",\n \"test\": \"eval-data.tar\",\n },\n)", 2603 "crumbs": [ 2604 "Guide", 2605 "Reference", 2606 "load_dataset API" 2607 ] 2608 }, 2609 { 2610 "objectID": "reference/load-dataset.html#streaming-mode", 2611 "href": "reference/load-dataset.html#streaming-mode", 2612 "title": "load_dataset API", 2613 "section": "Streaming Mode", 2614 "text": "Streaming Mode\nThe streaming parameter signals intent for streaming mode:\n\n# Mark as streaming\nds_dict = load_dataset(\"path/to/data.tar\", MySample, streaming=True)\n\n# Check streaming status\nif ds_dict.streaming:\n print(\"Streaming mode\")\n\n\n\n\n\n\n\nTip\n\n\n\natdata datasets are always lazy/streaming via WebDataset pipelines. This parameter primarily signals intent.", 2615 "crumbs": [ 2616 "Guide", 2617 "Reference", 2618 "load_dataset API" 2619 ] 2620 }, 2621 { 2622 "objectID": "reference/load-dataset.html#auto-type-resolution", 2623 "href": "reference/load-dataset.html#auto-type-resolution", 2624 "title": "load_dataset API", 2625 "section": "Auto Type Resolution", 2626 "text": "Auto Type Resolution\nWhen using index lookup, the sample type can be resolved automatically:\n\nfrom atdata.local import LocalIndex\n\nindex = LocalIndex()\n\n# No sample_type needed - resolved from schema\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# Type is inferred from the stored schema\nsample_type = ds.sample_type", 2627 "crumbs": [ 2628 "Guide", 2629 "Reference", 2630 "load_dataset API" 2631 ] 2632 }, 2633 { 2634 "objectID": "reference/load-dataset.html#error-handling", 2635 "href": "reference/load-dataset.html#error-handling", 2636 "title": "load_dataset API", 2637 "section": "Error Handling", 2638 "text": "Error Handling\n\ntry:\n ds = load_dataset(\"path/to/data.tar\", MySample, split=\"train\")\nexcept FileNotFoundError:\n print(\"No data files found\")\nexcept ValueError as e:\n if \"Split\" in str(e):\n print(\"Requested split not found\")\n else:\n print(f\"Invalid configuration: {e}\")\nexcept KeyError:\n print(\"Dataset not found in index\")", 2639 "crumbs": [ 2640 "Guide", 2641 "Reference", 2642 "load_dataset API" 2643 ] 2644 }, 2645 { 2646 "objectID": "reference/load-dataset.html#complete-example", 2647 "href": "reference/load-dataset.html#complete-example", 2648 "title": "load_dataset API", 2649 "section": "Complete Example", 2650 "text": "Complete Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata import load_dataset\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\n# 2. Create dataset files\nfor split in [\"train\", \"test\"]:\n with wds.writer.TarWriter(f\"{split}-000.tar\") as sink:\n for i in range(100):\n sample = ImageSample(\n image=np.random.rand(64, 64, 3).astype(np.float32),\n label=f\"sample_{i}\",\n )\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Load with split detection\nds_dict = load_dataset(\"./\", ImageSample)\nprint(ds_dict.keys()) # dict_keys(['train', 'test'])\n\n# 4. Iterate\nfor batch in ds_dict[\"train\"].ordered(batch_size=16):\n print(batch.image.shape) # (16, 64, 64, 3)\n print(batch.label) # ['sample_0', 'sample_1', ...]\n break\n\n# 5. Load specific split\ntrain_ds = load_dataset(\"./\", ImageSample, split=\"train\")\nfor batch in train_ds.ordered(batch_size=32):\n process(batch)", 2651 "crumbs": [ 2652 "Guide", 2653 "Reference", 2654 "load_dataset API" 2655 ] 2656 }, 2657 { 2658 "objectID": "reference/load-dataset.html#related", 2659 "href": "reference/load-dataset.html#related", 2660 "title": "load_dataset API", 2661 "section": "Related", 2662 "text": "Related\n\nDatasets - Dataset iteration and batching\nPackable Samples - Defining sample types\nLocal Storage - LocalIndex for index lookup\nProtocols - AbstractIndex interface", 2663 "crumbs": [ 2664 "Guide", 2665 "Reference", 2666 "load_dataset API" 2667 ] 2668 }, 2669 { 2670 "objectID": "reference/lenses.html", 2671 "href": "reference/lenses.html", 2672 "title": "Lenses", 2673 "section": "", 2674 "text": "Lenses provide bidirectional transformations between sample types, enabling datasets to be viewed through different schemas without duplicating data.", 2675 "crumbs": [ 2676 "Guide", 2677 "Reference", 2678 "Lenses" 2679 ] 2680 }, 2681 { 2682 "objectID": "reference/lenses.html#overview", 2683 "href": "reference/lenses.html#overview", 2684 "title": "Lenses", 2685 "section": "Overview", 2686 "text": "Overview\nA lens consists of:\n\nGetter: Transforms source type S to view type V\nPutter: Updates source based on a modified view (optional)", 2687 "crumbs": [ 2688 "Guide", 2689 "Reference", 2690 "Lenses" 2691 ] 2692 }, 2693 { 2694 "objectID": "reference/lenses.html#creating-a-lens", 2695 "href": "reference/lenses.html#creating-a-lens", 2696 "title": "Lenses", 2697 "section": "Creating a Lens", 2698 "text": "Creating a Lens\nUse the @lens decorator to define a getter:\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass FullSample:\n image: NDArray\n label: str\n confidence: float\n metadata: dict\n\n@atdata.packable\nclass SimpleSample:\n label: str\n confidence: float\n\n@atdata.lens\ndef simplify(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label, confidence=src.confidence)\n\nThe decorator:\n\nCreates a Lens object from the getter function\nRegisters it in the global LensNetwork registry\nExtracts source/view types from annotations", 2699 "crumbs": [ 2700 "Guide", 2701 "Reference", 2702 "Lenses" 2703 ] 2704 }, 2705 { 2706 "objectID": "reference/lenses.html#adding-a-putter", 2707 "href": "reference/lenses.html#adding-a-putter", 2708 "title": "Lenses", 2709 "section": "Adding a Putter", 2710 "text": "Adding a Putter\nTo enable bidirectional updates, add a putter:\n\n@simplify.putter\ndef simplify_put(view: SimpleSample, source: FullSample) -> FullSample:\n return FullSample(\n image=source.image,\n label=view.label,\n confidence=view.confidence,\n metadata=source.metadata,\n )\n\nThe putter receives:\n\nview: The modified view value\nsource: The original source value\n\nIt returns an updated source that reflects changes from the view.", 2711 "crumbs": [ 2712 "Guide", 2713 "Reference", 2714 "Lenses" 2715 ] 2716 }, 2717 { 2718 "objectID": "reference/lenses.html#using-lenses-with-datasets", 2719 "href": "reference/lenses.html#using-lenses-with-datasets", 2720 "title": "Lenses", 2721 "section": "Using Lenses with Datasets", 2722 "text": "Using Lenses with Datasets\nLenses integrate with Dataset.as_type():\n\ndataset = atdata.Dataset[FullSample](\"data-{000000..000009}.tar\")\n\n# View through a different type\nsimple_ds = dataset.as_type(SimpleSample)\n\nfor batch in simple_ds.ordered(batch_size=32):\n # Only SimpleSample fields available\n labels = batch.label\n scores = batch.confidence", 2723 "crumbs": [ 2724 "Guide", 2725 "Reference", 2726 "Lenses" 2727 ] 2728 }, 2729 { 2730 "objectID": "reference/lenses.html#direct-lens-usage", 2731 "href": "reference/lenses.html#direct-lens-usage", 2732 "title": "Lenses", 2733 "section": "Direct Lens Usage", 2734 "text": "Direct Lens Usage\nLenses can also be called directly:\n\nimport numpy as np\n\nfull = FullSample(\n image=np.zeros((224, 224, 3)),\n label=\"cat\",\n confidence=0.95,\n metadata={\"source\": \"training\"}\n)\n\n# Apply getter\nsimple = simplify(full)\n# Or: simple = simplify.get(full)\n\n# Apply putter\nmodified_simple = SimpleSample(label=\"dog\", confidence=0.87)\nupdated_full = simplify.put(modified_simple, full)\n# updated_full has label=\"dog\", confidence=0.87, but retains\n# original image and metadata", 2735 "crumbs": [ 2736 "Guide", 2737 "Reference", 2738 "Lenses" 2739 ] 2740 }, 2741 { 2742 "objectID": "reference/lenses.html#lens-laws", 2743 "href": "reference/lenses.html#lens-laws", 2744 "title": "Lenses", 2745 "section": "Lens Laws", 2746 "text": "Lens Laws\nWell-behaved lenses should satisfy these properties:\n\nGetPutPutGetPutPut\n\n\nIf you get a view and immediately put it back, the source is unchanged:\n\nview = lens.get(source)\nassert lens.put(view, source) == source\n\n\n\nIf you put a view, getting it back yields that view:\n\nupdated = lens.put(view, source)\nassert lens.get(updated) == view\n\n\n\nPutting twice is equivalent to putting once with the final value:\n\nresult1 = lens.put(v2, lens.put(v1, source))\nresult2 = lens.put(v2, source)\nassert result1 == result2", 2747 "crumbs": [ 2748 "Guide", 2749 "Reference", 2750 "Lenses" 2751 ] 2752 }, 2753 { 2754 "objectID": "reference/lenses.html#trivial-putter", 2755 "href": "reference/lenses.html#trivial-putter", 2756 "title": "Lenses", 2757 "section": "Trivial Putter", 2758 "text": "Trivial Putter\nIf no putter is defined, a trivial putter is used that ignores view updates:\n\n@atdata.lens\ndef extract_label(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label, confidence=src.confidence)\n\n# Without a putter, put() returns the original source unchanged\nview = SimpleSample(label=\"modified\", confidence=0.5)\nupdated = extract_label.put(view, original)\nassert updated == original # No changes applied", 2759 "crumbs": [ 2760 "Guide", 2761 "Reference", 2762 "Lenses" 2763 ] 2764 }, 2765 { 2766 "objectID": "reference/lenses.html#lensnetwork-registry", 2767 "href": "reference/lenses.html#lensnetwork-registry", 2768 "title": "Lenses", 2769 "section": "LensNetwork Registry", 2770 "text": "LensNetwork Registry\nThe LensNetwork is a singleton that stores all registered lenses:\n\nfrom atdata.lens import LensNetwork\n\nnetwork = LensNetwork()\n\n# Look up a specific lens\nlens = network.transform(FullSample, SimpleSample)\n\n# Raises ValueError if no lens exists\ntry:\n lens = network.transform(TypeA, TypeB)\nexcept ValueError:\n print(\"No lens registered for TypeA -> TypeB\")", 2771 "crumbs": [ 2772 "Guide", 2773 "Reference", 2774 "Lenses" 2775 ] 2776 }, 2777 { 2778 "objectID": "reference/lenses.html#example-feature-extraction", 2779 "href": "reference/lenses.html#example-feature-extraction", 2780 "title": "Lenses", 2781 "section": "Example: Feature Extraction", 2782 "text": "Example: Feature Extraction\n\n@atdata.packable\nclass RawSample:\n audio: NDArray\n text: str\n speaker_id: int\n\n@atdata.packable\nclass TextFeatures:\n text: str\n word_count: int\n\n@atdata.lens\ndef extract_text(src: RawSample) -> TextFeatures:\n return TextFeatures(\n text=src.text,\n word_count=len(src.text.split())\n )\n\n@extract_text.putter\ndef extract_text_put(view: TextFeatures, source: RawSample) -> RawSample:\n return RawSample(\n audio=source.audio,\n text=view.text,\n speaker_id=source.speaker_id\n )", 2783 "crumbs": [ 2784 "Guide", 2785 "Reference", 2786 "Lenses" 2787 ] 2788 }, 2789 { 2790 "objectID": "reference/lenses.html#related", 2791 "href": "reference/lenses.html#related", 2792 "title": "Lenses", 2793 "section": "Related", 2794 "text": "Related\n\nDatasets - Using lenses with Dataset.as_type()\nPackable Samples - Defining sample types\nAtmosphere - Publishing lenses to ATProto federation", 2795 "crumbs": [ 2796 "Guide", 2797 "Reference", 2798 "Lenses" 2799 ] 2800 }, 2801 { 2802 "objectID": "reference/packable-samples.html", 2803 "href": "reference/packable-samples.html", 2804 "title": "Packable Samples", 2805 "section": "", 2806 "text": "Packable samples are typed dataclasses that can be serialized with msgpack for storage in WebDataset tar files.", 2807 "crumbs": [ 2808 "Guide", 2809 "Reference", 2810 "Packable Samples" 2811 ] 2812 }, 2813 { 2814 "objectID": "reference/packable-samples.html#the-packable-decorator", 2815 "href": "reference/packable-samples.html#the-packable-decorator", 2816 "title": "Packable Samples", 2817 "section": "The @packable Decorator", 2818 "text": "The @packable Decorator\nThe recommended way to define a sample type is with the @packable decorator:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\nThis creates a dataclass that:\n\nInherits from PackableSample\nHas automatic msgpack serialization\nHandles NDArray conversion to/from bytes", 2819 "crumbs": [ 2820 "Guide", 2821 "Reference", 2822 "Packable Samples" 2823 ] 2824 }, 2825 { 2826 "objectID": "reference/packable-samples.html#supported-field-types", 2827 "href": "reference/packable-samples.html#supported-field-types", 2828 "title": "Packable Samples", 2829 "section": "Supported Field Types", 2830 "text": "Supported Field Types\n\nPrimitives\n\n@atdata.packable\nclass PrimitiveSample:\n name: str\n count: int\n score: float\n active: bool\n data: bytes\n\n\n\nNumPy Arrays\nFields annotated as NDArray are automatically converted:\n\n@atdata.packable\nclass ArraySample:\n features: NDArray # Required array\n embeddings: NDArray | None # Optional array\n\n\n\n\n\n\n\nNote\n\n\n\nBytes in NDArray-typed fields are always interpreted as serialized arrays. Don’t use NDArray for raw binary data—use bytes instead.\n\n\n\n\nLists\n\n@atdata.packable\nclass ListSample:\n tags: list[str]\n scores: list[float]", 2831 "crumbs": [ 2832 "Guide", 2833 "Reference", 2834 "Packable Samples" 2835 ] 2836 }, 2837 { 2838 "objectID": "reference/packable-samples.html#serialization", 2839 "href": "reference/packable-samples.html#serialization", 2840 "title": "Packable Samples", 2841 "section": "Serialization", 2842 "text": "Serialization\n\nPacking to Bytes\n\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\n# Serialize to msgpack bytes\npacked_bytes = sample.packed\nprint(f\"Size: {len(packed_bytes)} bytes\")\n\n\n\nUnpacking from Bytes\n\n# Deserialize from bytes\nrestored = ImageSample.from_bytes(packed_bytes)\n\n# Arrays are automatically restored\nassert np.array_equal(sample.image, restored.image)\nassert sample.label == restored.label\n\n\n\nWebDataset Format\nThe as_wds property returns a dict ready for WebDataset:\n\nwds_dict = sample.as_wds\n# {'__key__': '1234...', 'msgpack': b'...'}\n\nWrite samples to a tar file:\n\nimport webdataset as wds\n\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n # Use custom key or let as_wds generate one\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})", 2843 "crumbs": [ 2844 "Guide", 2845 "Reference", 2846 "Packable Samples" 2847 ] 2848 }, 2849 { 2850 "objectID": "reference/packable-samples.html#direct-inheritance-alternative", 2851 "href": "reference/packable-samples.html#direct-inheritance-alternative", 2852 "title": "Packable Samples", 2853 "section": "Direct Inheritance (Alternative)", 2854 "text": "Direct Inheritance (Alternative)\nYou can also inherit directly from PackableSample:\n\nfrom dataclasses import dataclass\n\n@dataclass\nclass DirectSample(atdata.PackableSample):\n name: str\n values: NDArray\n\nThis is equivalent to using @packable but more verbose.", 2855 "crumbs": [ 2856 "Guide", 2857 "Reference", 2858 "Packable Samples" 2859 ] 2860 }, 2861 { 2862 "objectID": "reference/packable-samples.html#how-it-works", 2863 "href": "reference/packable-samples.html#how-it-works", 2864 "title": "Packable Samples", 2865 "section": "How It Works", 2866 "text": "How It Works\n\nSerialization Flow\n\nPackingUnpacking\n\n\n\nNDArray fields → converted to bytes via array_to_bytes()\nOther fields → passed through unchanged\nAll fields → packed with msgpack\n\n\n\n\nBytes → unpacked with ormsgpack\nDict → passed to __init__\n__post_init__ → calls _ensure_good()\nNDArray fields → bytes converted back to arrays\n\n\n\n\n\n\nThe _ensure_good() Method\nThis method runs automatically after construction and handles NDArray conversion:\n\ndef _ensure_good(self):\n for field in dataclasses.fields(self):\n if _is_possibly_ndarray_type(field.type):\n value = getattr(self, field.name)\n if isinstance(value, bytes):\n setattr(self, field.name, bytes_to_array(value))", 2867 "crumbs": [ 2868 "Guide", 2869 "Reference", 2870 "Packable Samples" 2871 ] 2872 }, 2873 { 2874 "objectID": "reference/packable-samples.html#best-practices", 2875 "href": "reference/packable-samples.html#best-practices", 2876 "title": "Packable Samples", 2877 "section": "Best Practices", 2878 "text": "Best Practices\n\nDoDon’t\n\n\n\n@atdata.packable\nclass GoodSample:\n features: NDArray # Clear type annotation\n label: str # Simple primitives\n metadata: dict # Msgpack-compatible dicts\n scores: list[float] # Typed lists\n\n\n\n\n@atdata.packable\nclass BadSample:\n # DON'T: Nested dataclasses not supported\n nested: OtherSample\n\n # DON'T: Complex objects that aren't msgpack-serializable\n callback: Callable\n\n # DON'T: Use NDArray for raw bytes\n raw_data: NDArray # Use 'bytes' type instead", 2879 "crumbs": [ 2880 "Guide", 2881 "Reference", 2882 "Packable Samples" 2883 ] 2884 }, 2885 { 2886 "objectID": "reference/packable-samples.html#related", 2887 "href": "reference/packable-samples.html#related", 2888 "title": "Packable Samples", 2889 "section": "Related", 2890 "text": "Related\n\nDatasets - Loading and iterating samples\nLenses - Transforming between sample types", 2891 "crumbs": [ 2892 "Guide", 2893 "Reference", 2894 "Packable Samples" 2895 ] 2896 }, 2897 { 2898 "objectID": "reference/deployment.html", 2899 "href": "reference/deployment.html", 2900 "title": "Deployment Guide", 2901 "section": "", 2902 "text": "This guide covers deploying atdata in production environments, including Redis setup for LocalIndex, S3 storage configuration, and ATProto publishing considerations.", 2903 "crumbs": [ 2904 "Guide", 2905 "Reference", 2906 "Deployment Guide" 2907 ] 2908 }, 2909 { 2910 "objectID": "reference/deployment.html#local-storage-deployment", 2911 "href": "reference/deployment.html#local-storage-deployment", 2912 "title": "Deployment Guide", 2913 "section": "Local Storage Deployment", 2914 "text": "Local Storage Deployment\nThe local storage backend uses Redis for metadata indexing and S3-compatible storage for dataset files.\n\nRedis Setup\n\nRequirements\n\nRedis 6.0+ (for Redis-OM compatibility)\nSufficient memory for index metadata (typically < 100MB for most deployments)\n\n\n\nDocker Deployment\n# Basic Redis\ndocker run -d \\\n --name atdata-redis \\\n -p 6379:6379 \\\n -v redis-data:/data \\\n redis:7-alpine \\\n redis-server --appendonly yes\n\n# With password\ndocker run -d \\\n --name atdata-redis \\\n -p 6379:6379 \\\n -v redis-data:/data \\\n redis:7-alpine \\\n redis-server --appendonly yes --requirepass yourpassword\n\n\nConfiguration\nfrom redis import Redis\nfrom atdata.local import LocalIndex\n\n# Basic connection\nredis = Redis(host=\"localhost\", port=6379)\nindex = LocalIndex(redis=redis)\n\n# With authentication\nredis = Redis(\n host=\"redis.example.com\",\n port=6379,\n password=\"yourpassword\",\n ssl=True, # For production\n)\nindex = LocalIndex(redis=redis)\n\n\nRedis Clustering\nFor high-availability deployments:\nfrom redis.cluster import RedisCluster\n\n# Redis Cluster connection\nredis = RedisCluster(\n host=\"redis-cluster.example.com\",\n port=6379,\n password=\"yourpassword\",\n)\nindex = LocalIndex(redis=redis)\n\n\n\n\n\n\nNote\n\n\n\nRedis-OM (used internally) supports Redis Cluster mode. Ensure all nodes have the same configuration.\n\n\n\n\n\nS3 Storage Setup\n\nAWS S3\nfrom atdata.local import S3DataStore\n\n# Using environment credentials (recommended for AWS)\n# Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY\nstore = S3DataStore(\n bucket=\"my-atdata-bucket\",\n prefix=\"datasets/\",\n)\n\n# Explicit credentials\nstore = S3DataStore(\n bucket=\"my-atdata-bucket\",\n prefix=\"datasets/\",\n credentials={\n \"AWS_ACCESS_KEY_ID\": \"...\",\n \"AWS_SECRET_ACCESS_KEY\": \"...\",\n \"AWS_DEFAULT_REGION\": \"us-west-2\",\n },\n)\n\n\nS3-Compatible Storage (MinIO, Cloudflare R2, etc.)\nstore = S3DataStore(\n bucket=\"my-bucket\",\n prefix=\"datasets/\",\n endpoint_url=\"https://s3.example.com\",\n credentials={\n \"AWS_ACCESS_KEY_ID\": \"...\",\n \"AWS_SECRET_ACCESS_KEY\": \"...\",\n },\n)\n\n\nMinIO Deployment\n# Docker deployment\ndocker run -d \\\n --name minio \\\n -p 9000:9000 \\\n -p 9001:9001 \\\n -v minio-data:/data \\\n -e MINIO_ROOT_USER=minioadmin \\\n -e MINIO_ROOT_PASSWORD=minioadmin \\\n minio/minio server /data --console-address \":9001\"\nstore = S3DataStore(\n bucket=\"atdata\",\n endpoint_url=\"http://localhost:9000\",\n credentials={\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n)\n\n\n\nProduction Checklist\n\nRedis persistence enabled (appendonly yes)\nRedis password authentication configured\nRedis TLS enabled for remote connections\nS3 bucket access policies configured (least privilege)\nS3 bucket versioning enabled (for data recovery)\nMonitoring for Redis memory usage\nBackup strategy for Redis data", 2915 "crumbs": [ 2916 "Guide", 2917 "Reference", 2918 "Deployment Guide" 2919 ] 2920 }, 2921 { 2922 "objectID": "reference/deployment.html#atproto-deployment", 2923 "href": "reference/deployment.html#atproto-deployment", 2924 "title": "Deployment Guide", 2925 "section": "ATProto Deployment", 2926 "text": "ATProto Deployment\n\nAccount Setup\n\nCreate a Bluesky account or use your existing account\nGenerate an app-specific password at bsky.app/settings/app-passwords\nNever use your main account password in code\n\n\n\n\n\n\n\nWarning\n\n\n\nSecurity: Always use app passwords, never your main password. App passwords can be revoked without affecting your account.\n\n\n\n\nAuthentication Patterns\n\nEnvironment Variables (Recommended)\nimport os\nfrom atdata.atmosphere import AtmosphereClient\n\nclient = AtmosphereClient()\nclient.login(\n os.environ[\"ATPROTO_HANDLE\"],\n os.environ[\"ATPROTO_APP_PASSWORD\"],\n)\n\n\nSession Persistence\nFor long-running services, persist and reuse sessions:\nimport os\nfrom pathlib import Path\n\nSESSION_FILE = Path(\"~/.atdata/session\").expanduser()\n\nclient = AtmosphereClient()\n\nif SESSION_FILE.exists():\n # Restore existing session\n session_string = SESSION_FILE.read_text()\n try:\n client.login_with_session(session_string)\n except Exception:\n # Session expired, re-authenticate\n client.login(handle, app_password)\n SESSION_FILE.parent.mkdir(parents=True, exist_ok=True)\n SESSION_FILE.write_text(client.export_session())\nelse:\n # Initial login\n client.login(handle, app_password)\n SESSION_FILE.parent.mkdir(parents=True, exist_ok=True)\n SESSION_FILE.write_text(client.export_session())\n\n\n\nCustom PDS Deployment\nFor self-hosted ATProto infrastructure:\nclient = AtmosphereClient(base_url=\"https://pds.example.com\")\nclient.login(\"handle.example.com\", \"app-password\")\nSee ATProto PDS documentation for self-hosting setup.\n\n\nRate Limiting Considerations\nATProto has rate limits. For bulk operations:\n\nSpace out record creation (1-2 per second for bulk uploads)\nUse batch operations where available\nImplement exponential backoff for retries\nConsider blob storage limits (~50MB per blob)\n\nimport time\n\nfor i, dataset in enumerate(datasets_to_publish):\n index.insert_dataset(dataset, name=f\"dataset-{i}\", ...)\n time.sleep(1) # Rate limiting", 2927 "crumbs": [ 2928 "Guide", 2929 "Reference", 2930 "Deployment Guide" 2931 ] 2932 }, 2933 { 2934 "objectID": "reference/deployment.html#docker-compose-example", 2935 "href": "reference/deployment.html#docker-compose-example", 2936 "title": "Deployment Guide", 2937 "section": "Docker Compose Example", 2938 "text": "Docker Compose Example\nComplete local deployment with Redis and MinIO:\n# docker-compose.yml\nversion: '3.8'\n\nservices:\n redis:\n image: redis:7-alpine\n command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}\n ports:\n - \"6379:6379\"\n volumes:\n - redis-data:/data\n\n minio:\n image: minio/minio\n command: server /data --console-address \":9001\"\n ports:\n - \"9000:9000\"\n - \"9001:9001\"\n environment:\n MINIO_ROOT_USER: ${MINIO_USER}\n MINIO_ROOT_PASSWORD: ${MINIO_PASSWORD}\n volumes:\n - minio-data:/data\n\nvolumes:\n redis-data:\n minio-data:\n# .env\nREDIS_PASSWORD=your-redis-password\nMINIO_USER=minioadmin\nMINIO_PASSWORD=your-minio-password", 2939 "crumbs": [ 2940 "Guide", 2941 "Reference", 2942 "Deployment Guide" 2943 ] 2944 }, 2945 { 2946 "objectID": "reference/deployment.html#monitoring", 2947 "href": "reference/deployment.html#monitoring", 2948 "title": "Deployment Guide", 2949 "section": "Monitoring", 2950 "text": "Monitoring\n\nRedis Metrics\nKey metrics to monitor:\n\nused_memory: Memory usage\nconnected_clients: Active connections\nkeyspace_hits/misses: Cache efficiency\naof_last_write_status: Persistence health\n\nredis-cli INFO | grep -E \"used_memory|connected_clients|keyspace\"\n\n\nS3 Metrics\n\nRequest counts and latency\nError rates (4xx, 5xx)\nStorage usage by prefix\nData transfer costs", 2951 "crumbs": [ 2952 "Guide", 2953 "Reference", 2954 "Deployment Guide" 2955 ] 2956 }, 2957 { 2958 "objectID": "reference/deployment.html#security-best-practices", 2959 "href": "reference/deployment.html#security-best-practices", 2960 "title": "Deployment Guide", 2961 "section": "Security Best Practices", 2962 "text": "Security Best Practices\n\nNetwork Isolation: Run Redis and S3 in private networks\nTLS Everywhere: Encrypt connections to Redis and S3\nCredential Rotation: Rotate API keys and passwords regularly\nAccess Logging: Enable S3 access logging for audit trails\nLeast Privilege: Use minimal IAM permissions for S3 access\n\n\nS3 IAM Policy Example\n{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Effect\": \"Allow\",\n \"Action\": [\n \"s3:GetObject\",\n \"s3:PutObject\",\n \"s3:ListBucket\"\n ],\n \"Resource\": [\n \"arn:aws:s3:::my-atdata-bucket\",\n \"arn:aws:s3:::my-atdata-bucket/*\"\n ]\n }\n ]\n}", 2963 "crumbs": [ 2964 "Guide", 2965 "Reference", 2966 "Deployment Guide" 2967 ] 2968 }, 2969 { 2970 "objectID": "reference/troubleshooting.html", 2971 "href": "reference/troubleshooting.html", 2972 "title": "Troubleshooting & FAQ", 2973 "section": "", 2974 "text": "This page covers common issues, error messages, and frequently asked questions when working with atdata.", 2975 "crumbs": [ 2976 "Guide", 2977 "Reference", 2978 "Troubleshooting & FAQ" 2979 ] 2980 }, 2981 { 2982 "objectID": "reference/troubleshooting.html#common-errors", 2983 "href": "reference/troubleshooting.html#common-errors", 2984 "title": "Troubleshooting & FAQ", 2985 "section": "Common Errors", 2986 "text": "Common Errors\n\nTypeError: ‘type’ object is not subscriptable\nError:\nTypeError: 'type' object is not subscriptable\nCause: Using Dataset or SampleBatch without subscripting the type parameter on Python < 3.9, or using an unsubscripted generic.\nSolution: Always use the subscripted form:\n# Correct\nds = Dataset[MySample](\"data.tar\")\nbatch = SampleBatch[MySample](samples)\n\n# Incorrect\nds = Dataset(\"data.tar\") # Missing type parameter\n\n\nAttributeError: ‘NoneType’ object has no attribute…\nError:\nAttributeError: 'NoneType' object has no attribute '__args__'\nCause: Creating a Dataset or SampleBatch without using the subscripted syntax Class[Type](...).\nSolution: These classes use Python’s __orig_class__ mechanism to extract type parameters at runtime. You must use:\nds = Dataset[MySample](url) # Correct\nNot:\nds = Dataset(url) # Wrong - no type information\n\n\nRuntimeError: msgpack field not found in sample\nError:\nRuntimeError: Malformed sample: 'msgpack' field not found\nCause: The tar file contains samples that weren’t written with atdata’s serialization format.\nSolution: Ensure samples are written using sample.as_wds:\nwith wds.writer.TarWriter(\"data.tar\") as sink:\n for sample in samples:\n sink.write(sample.as_wds) # Correct\n\n\nValueError: Field type not supported\nError:\nTypeError: Unsupported type for schema field: <class 'SomeType'>\nCause: Using an unsupported Python type in a PackableSample field.\nSupported types:\n\n\n\nPython Type\nNotes\n\n\n\n\nstr\nUnicode strings\n\n\nint\nIntegers\n\n\nfloat\nFloating point\n\n\nbool\nBoolean\n\n\nbytes\nBinary data\n\n\nNDArray\nNumpy arrays (any dtype)\n\n\nlist[T]\nLists of primitives\n\n\nT \\| None\nOptional fields\n\n\n\nNot supported: Nested dataclasses, dicts, custom classes.\n\n\nKeyError when iterating dataset\nError:\nKeyError: 'msgpack'\nCause: The WebDataset tar file structure doesn’t match expected format.\nSolution: Verify your tar file was created correctly:\n# Check tar contents\ntar -tvf data.tar | head -20\nEach sample should have a .msgpack extension in the tar file.", 2987 "crumbs": [ 2988 "Guide", 2989 "Reference", 2990 "Troubleshooting & FAQ" 2991 ] 2992 }, 2993 { 2994 "objectID": "reference/troubleshooting.html#faq", 2995 "href": "reference/troubleshooting.html#faq", 2996 "title": "Troubleshooting & FAQ", 2997 "section": "FAQ", 2998 "text": "FAQ\n\nHow do I check the sample type of a dataset?\nds = Dataset[MySample](\"data.tar\")\nprint(ds.sample_type) # <class 'MySample'>\n\n\nHow do I convert a dataset to a different type?\nUse the as_type() method with a registered lens:\n@atdata.lens\ndef my_lens(src: SourceType) -> TargetType:\n return TargetType(field=src.other_field)\n\nds_view = ds.as_type(TargetType)\n\n\nHow do I handle optional NDArray fields?\nUse NDArray | None annotation:\n@atdata.packable\nclass MySample:\n required_array: NDArray\n optional_array: NDArray | None = None\n\n\nWhy is my dataset iteration slow?\nCommon causes:\n\nNetwork latency: Use local caching for remote datasets\nSmall batch sizes: Increase batch_size in ordered() or shuffled()\nShuffle buffer: For shuffled(), the initial parameter controls buffer size\n\n# Larger batches = better throughput\nfor batch in ds.shuffled(batch_size=64, initial=1000):\n ...\n\n\nHow do I export to parquet?\nds = Dataset[MySample](\"data.tar\")\nds.to_parquet(\"output.parquet\")\n\n# With sample limit (for large datasets)\nds.to_parquet(\"output.parquet\", maxcount=10000)\n\n\n\n\n\n\nWarning\n\n\n\nto_parquet() loads the dataset into memory. For very large datasets, use maxcount to limit samples or process in chunks.\n\n\n\n\nHow do I handle multiple shards?\nUse WebDataset brace notation:\n# Single shard\nds = Dataset[MySample](\"data-000000.tar\")\n\n# Multiple shards (range)\nds = Dataset[MySample](\"data-{000000..000009}.tar\")\n\n# Multiple shards (list)\nds = Dataset[MySample](\"data-{000000,000005,000009}.tar\")\n\n\nCan I use S3 or other cloud storage?\nYes, use S3Source for S3-compatible storage:\nfrom atdata import S3Source, Dataset\n\nsource = S3Source.from_urls(\n [\"s3://bucket/data-000000.tar\", \"s3://bucket/data-000001.tar\"],\n endpoint_url=\"https://s3.example.com\", # Optional for non-AWS S3\n)\n\nds = Dataset[MySample](source)\n\n\nHow do I publish to ATProto/Atmosphere?\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\") # Use app password!\n\nindex = AtmosphereIndex(client)\n\n# Publish schema\nschema_uri = index.publish_schema(MySample, version=\"1.0.0\")\n\n# Publish dataset\nentry = index.insert_dataset(ds, name=\"my-dataset\", schema_ref=schema_uri)\n\n\nWhat’s the difference between LocalIndex and AtmosphereIndex?\n\n\n\nFeature\nLocalIndex\nAtmosphereIndex\n\n\n\n\nStorage\nRedis + S3\nATProto PDS\n\n\nDiscovery\nLocal only\nFederated network\n\n\nAuth\nNone required\nATProto account\n\n\nUse case\nDevelopment, private data\nPublic distribution\n\n\n\nBoth implement the AbstractIndex protocol, so code can work with either.", 2999 "crumbs": [ 3000 "Guide", 3001 "Reference", 3002 "Troubleshooting & FAQ" 3003 ] 3004 }, 3005 { 3006 "objectID": "reference/troubleshooting.html#getting-help", 3007 "href": "reference/troubleshooting.html#getting-help", 3008 "title": "Troubleshooting & FAQ", 3009 "section": "Getting Help", 3010 "text": "Getting Help\n\nGitHub Issues: github.com/your-org/atdata/issues\nDocumentation: Check the reference pages for detailed API documentation\nExamples: See the examples/ directory for working code samples", 3011 "crumbs": [ 3012 "Guide", 3013 "Reference", 3014 "Troubleshooting & FAQ" 3015 ] 3016 } 3017]