docs: add architecture overview and expand tutorials with design context

.chainlink/issues.db

This is a binary file and will not be displayed.

+12

CHANGELOG.md

··· 25 25 - **Comprehensive integration test suite**: 593 tests covering E2E flows, error handling, edge cases 26 26 27 27 ### Changed 28 + - Expand Quarto documentation with architectural narrative (#395) 29 + - Expand atmosphere tutorial with federation context (#400) 30 + - Expand local-workflow tutorial with system narrative (#399) 31 + - Expand quickstart tutorial with design context (#398) 32 + - Expand index.qmd with architecture narrative (#397) 33 + - Add architecture overview page (reference/architecture.qmd) (#396) 34 + - Adversarial review: Post-PDSBlobStore comprehensive assessment (#389) 35 + - Remove deprecated shard_list property warnings if unused (#394) 36 + - Add test for Dataset iteration over empty tar file (#393) 37 + - Consolidate duplicate sample types in live atmosphere tests (#392) 38 + - Convert TODO comment in dataset.py to design note or remove (#391) 39 + - Remove redundant no-op statements in _stub_manager.py (#390) 28 40 - Update atmosphere example with blob storage case (#216) 29 41 - Implement PDSBlobStore for atmosphere data storage (#244) 30 42 - Update docs and examples to include PDSBlobStore (#384)

+4

docs/api/AbstractDataStore.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/AbstractIndex.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/AtUri.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/AtmosphereClient.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/AtmosphereIndex.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/AtmosphereIndexEntry.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/BlobSource.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/DataSource.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/Dataset.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/DatasetDict.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/DatasetLoader.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/DatasetPublisher.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/DictSample.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/IndexEntry.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/Lens.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/LensLoader.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/LensPublisher.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/PDSBlobStore.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/Packable-protocol.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/PackableSample.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/S3Source.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/SampleBatch.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/SchemaLoader.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/SchemaPublisher.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/URLSource.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/index.html

··· 288 288 </a> 289 289 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 290 290 <li> 291 + <a class="dropdown-item" href="../reference/architecture.html"> 292 + <span class="dropdown-text">Architecture Overview</span></a> 293 + </li> 294 + <li> 291 295 <a class="dropdown-item" href="../reference/packable-samples.html"> 292 296 <span class="dropdown-text">Packable Samples</span></a> 293 297 </li>

+4

docs/api/load_dataset.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/local.Index.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/local.LocalDatasetEntry.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/local.S3DataStore.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/packable.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+4

docs/api/promote_to_atmosphere.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="../reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="../reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li>

+219 -102

docs/index.html

··· 323 323 </a> 324 324 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 325 325 <li> 326 + <a class="dropdown-item" href="./reference/architecture.html"> 327 + <span class="dropdown-text">Architecture Overview</span></a> 328 + </li> 329 + <li> 326 330 <a class="dropdown-item" href="./reference/packable-samples.html"> 327 331 <span class="dropdown-text">Packable Samples</span></a> 328 332 </li> ··· 455 459 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 456 460 <li class="sidebar-item"> 457 461 <div class="sidebar-item-container"> 462 + <a href="./reference/architecture.html" class="sidebar-item-text sidebar-link"> 463 + <span class="menu-text">Architecture Overview</span></a> 464 + </div> 465 + </li> 466 + <li class="sidebar-item"> 467 + <div class="sidebar-item-container"> 458 468 <a href="./reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 459 469 <span class="menu-text">Packable Samples</span></a> 460 470 </div> ··· 532 542 533 543 <ul> 534 544 <li><a href="#atdata" id="toc-atdata" class="nav-link active" data-scroll-target="#atdata">atdata</a></li> 545 + <li><a href="#the-challenge" id="toc-the-challenge" class="nav-link" data-scroll-target="#the-challenge">The Challenge</a></li> 535 546 <li><a href="#what-is-atdata" id="toc-what-is-atdata" class="nav-link" data-scroll-target="#what-is-atdata">What is atdata?</a></li> 547 + <li><a href="#the-architecture" id="toc-the-architecture" class="nav-link" data-scroll-target="#the-architecture">The Architecture</a></li> 536 548 <li><a href="#installation" id="toc-installation" class="nav-link" data-scroll-target="#installation">Installation</a></li> 537 549 <li><a href="#quick-example" id="toc-quick-example" class="nav-link" data-scroll-target="#quick-example">Quick Example</a> 538 550 <ul class="collapse"> 539 - <li><a href="#define-a-sample-type" id="toc-define-a-sample-type" class="nav-link" data-scroll-target="#define-a-sample-type">Define a Sample Type</a></li> 540 - <li><a href="#create-and-write-samples" id="toc-create-and-write-samples" class="nav-link" data-scroll-target="#create-and-write-samples">Create and Write Samples</a></li> 541 - <li><a href="#load-and-iterate" id="toc-load-and-iterate" class="nav-link" data-scroll-target="#load-and-iterate">Load and Iterate</a></li> 551 + <li><a href="#define-a-sample-type" id="toc-define-a-sample-type" class="nav-link" data-scroll-target="#define-a-sample-type">1. Define a Sample Type</a></li> 552 + <li><a href="#create-and-write-samples" id="toc-create-and-write-samples" class="nav-link" data-scroll-target="#create-and-write-samples">2. Create and Write Samples</a></li> 553 + <li><a href="#load-and-iterate-with-type-safety" id="toc-load-and-iterate-with-type-safety" class="nav-link" data-scroll-target="#load-and-iterate-with-type-safety">3. Load and Iterate with Type Safety</a></li> 554 + </ul></li> 555 + <li><a href="#scaling-up" id="toc-scaling-up" class="nav-link" data-scroll-target="#scaling-up">Scaling Up</a> 556 + <ul class="collapse"> 557 + <li><a href="#team-storage-with-redis-s3" id="toc-team-storage-with-redis-s3" class="nav-link" data-scroll-target="#team-storage-with-redis-s3">Team Storage with Redis + S3</a></li> 558 + <li><a href="#federation-with-atproto" id="toc-federation-with-atproto" class="nav-link" data-scroll-target="#federation-with-atproto">Federation with ATProto</a></li> 542 559 </ul></li> 543 560 <li><a href="#huggingface-style-loading" id="toc-huggingface-style-loading" class="nav-link" data-scroll-target="#huggingface-style-loading">HuggingFace-Style Loading</a></li> 544 - <li><a href="#local-storage-with-redis-s3" id="toc-local-storage-with-redis-s3" class="nav-link" data-scroll-target="#local-storage-with-redis-s3">Local Storage with Redis + S3</a></li> 545 - <li><a href="#publish-to-atproto-federation" id="toc-publish-to-atproto-federation" class="nav-link" data-scroll-target="#publish-to-atproto-federation">Publish to ATProto Federation</a></li> 561 + <li><a href="#why-atdata" id="toc-why-atdata" class="nav-link" data-scroll-target="#why-atdata">Why atdata?</a></li> 546 562 <li><a href="#next-steps" id="toc-next-steps" class="nav-link" data-scroll-target="#next-steps">Next Steps</a></li> 547 563 </ul> 548 564 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/index.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 576 592 <p>A loose federation of distributed, typed datasets built on WebDataset.</p> 577 593 <p><a href="./tutorials/quickstart.html" class="btn btn-primary btn-lg">Get Started</a> <a href="https://github.com/your-org/atdata" class="btn btn-outline-secondary btn-lg">View on GitHub</a></p> 578 594 </section> 595 + <section id="the-challenge" class="level2"> 596 + <h2 class="anchored" data-anchor-id="the-challenge">The Challenge</h2> 597 + <p>Machine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means:</p> 598 + <ul> 599 + <li><strong>Runtime surprises</strong>: Discovering a field is missing or has the wrong type during training</li> 600 + <li><strong>Copy-paste schemas</strong>: Redefining the same sample structure across notebooks and scripts</li> 601 + <li><strong>Storage silos</strong>: Data stuck in one location, invisible to collaborators</li> 602 + <li><strong>Discovery friction</strong>: No standard way to find datasets across teams or organizations</li> 603 + </ul> 604 + <p>atdata solves these problems with a simple idea: <strong>typed, serializable samples</strong> that flow seamlessly from local development to team storage to federated sharing.</p> 605 + </section> 579 606 <section id="what-is-atdata" class="level2"> 580 607 <h2 class="anchored" data-anchor-id="what-is-atdata">What is atdata?</h2> 581 - <p>atdata provides a typed dataset abstraction for machine learning workflows with:</p> 608 + <p>atdata is a Python library that combines:</p> 582 609 <div class="feature-cards"> 583 610 <section id="typed-samples" class="level3 feature-card"> 584 611 <h3 class="anchored" data-anchor-id="typed-samples">Typed Samples</h3> 585 - <p>Define dataclass-based sample types with automatic msgpack serialization.</p> 612 + <p>Define dataclass-based sample types with automatic msgpack serialization. Catch schema errors at definition time, not training time.</p> 586 613 </section> 587 - <section id="ndarray-handling" class="level3 feature-card"> 588 - <h3 class="anchored" data-anchor-id="ndarray-handling">NDArray Handling</h3> 589 - <p>Transparent numpy array conversion with efficient byte serialization.</p> 614 + <section id="efficient-storage" class="level3 feature-card"> 615 + <h3 class="anchored" data-anchor-id="efficient-storage">Efficient Storage</h3> 616 + <p>Built on WebDataset’s proven tar-based format. Stream large datasets without downloading everything first.</p> 590 617 </section> 591 618 <section id="lens-transformations" class="level3 feature-card"> 592 619 <h3 class="anchored" data-anchor-id="lens-transformations">Lens Transformations</h3> 593 - <p>View datasets through different schemas without duplicating data.</p> 620 + <p>View datasets through different schemas without duplicating data. Perfect for feature extraction, schema migration, and multi-task learning.</p> 594 621 </section> 595 622 <section id="batch-aggregation" class="level3 feature-card"> 596 623 <h3 class="anchored" data-anchor-id="batch-aggregation">Batch Aggregation</h3> 597 - <p>Automatic numpy stacking for NDArray fields during iteration.</p> 624 + <p>Automatic numpy stacking for NDArray fields. No more manual collation code—just iterate and train.</p> 598 625 </section> 599 - <section id="webdataset-integration" class="level3 feature-card"> 600 - <h3 class="anchored" data-anchor-id="webdataset-integration">WebDataset Integration</h3> 601 - <p>Efficient large-scale storage with streaming tar file support.</p> 626 + <section id="team-storage" class="level3 feature-card"> 627 + <h3 class="anchored" data-anchor-id="team-storage">Team Storage</h3> 628 + <p>Redis + S3 backend for shared dataset indexes. Publish schemas, track versions, and enable team discovery.</p> 602 629 </section> 603 630 <section id="atproto-federation" class="level3 feature-card"> 604 631 <h3 class="anchored" data-anchor-id="atproto-federation">ATProto Federation</h3> 605 - <p>Publish and discover datasets on the decentralized AT Protocol network.</p> 632 + <p>Publish datasets to the decentralized AT Protocol network. Enable cross-organization discovery without centralized infrastructure.</p> 606 633 </section> 607 634 </div> 608 635 </section> 636 + <section id="the-architecture" class="level2"> 637 + <h2 class="anchored" data-anchor-id="the-architecture">The Architecture</h2> 638 + <p>atdata provides a three-layer progression for your datasets:</p> 639 + <pre><code>┌─────────────────────────────────────────────────────────────┐ 640 + │ Federation: ATProto Atmosphere │ 641 + │ Decentralized discovery, cross-org sharing │ 642 + └─────────────────────────────────────────────────────────────┘ 643 + ↑ promote 644 + ┌─────────────────────────────────────────────────────────────┐ 645 + │ Team Storage: Redis + S3 │ 646 + │ Shared index, versioned schemas, S3 data │ 647 + └─────────────────────────────────────────────────────────────┘ 648 + ↑ insert 649 + ┌─────────────────────────────────────────────────────────────┐ 650 + │ Local Development │ 651 + │ Typed samples, WebDataset files, fast iteration │ 652 + └─────────────────────────────────────────────────────────────┘</code></pre> 653 + <p>Start local, scale to your team, and optionally share with the world—all with the same sample types and consistent APIs.</p> 654 + </section> 609 655 <section id="installation" class="level2"> 610 656 <h2 class="anchored" data-anchor-id="installation">Installation</h2> 611 657 <div class="install-box"> 612 - <div class="sourceCode" id="cb1"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata</span> 613 - <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 614 - <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co"># With ATProto support</span></span> 615 - <span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata<span class="pp">[</span><span class="ss">atmosphere</span><span class="pp">]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 658 + <div class="sourceCode" id="cb2"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata</span> 659 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 660 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="co"># With ATProto support</span></span> 661 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata<span class="pp">[</span><span class="ss">atmosphere</span><span class="pp">]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 616 662 </div> 617 663 </section> 618 664 <section id="quick-example" class="level2"> 619 665 <h2 class="anchored" data-anchor-id="quick-example">Quick Example</h2> 620 666 <section id="define-a-sample-type" class="level3"> 621 - <h3 class="anchored" data-anchor-id="define-a-sample-type">Define a Sample Type</h3> 622 - <div id="e139d8b7" class="cell"> 623 - <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 624 - <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 625 - <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 626 - <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a></span> 627 - <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 628 - <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 629 - <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a> image: NDArray</span> 630 - <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 631 - <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 667 + <h3 class="anchored" data-anchor-id="define-a-sample-type">1. Define a Sample Type</h3> 668 + <p>The <code>@packable</code> decorator creates a serializable dataclass:</p> 669 + <div id="387a656e" class="cell"> 670 + <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 671 + <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 672 + <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 673 + <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a></span> 674 + <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 675 + <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 676 + <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a> image: NDArray <span class="co"># Automatically handled as bytes</span></span> 677 + <span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 678 + <span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 632 679 </div> 633 680 </section> 634 681 <section id="create-and-write-samples" class="level3"> 635 - <h3 class="anchored" data-anchor-id="create-and-write-samples">Create and Write Samples</h3> 636 - <div id="080b7c8f" class="cell"> 637 - <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 638 - <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 639 - <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 640 - <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> ImageSample(</span> 641 - <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 642 - <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> 643 - <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a> confidence<span class="op">=</span><span class="fl">0.95</span>,</span> 644 - <span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a> )</span> 645 - <span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> _ <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>)</span> 646 - <span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a>]</span> 647 - <span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a></span> 648 - <span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data-000000.tar"</span>) <span class="im">as</span> sink:</span> 649 - <span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i, sample <span class="kw">in</span> <span class="bu">enumerate</span>(samples):</span> 650 - <span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a> sink.write({<span class="op">**</span>sample.as_wds, <span class="st">"__key__"</span>: <span class="ss">f"sample_</span><span class="sc">{</span>i<span class="sc">:06d}</span><span class="ss">"</span>})</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 682 + <h3 class="anchored" data-anchor-id="create-and-write-samples">2. Create and Write Samples</h3> 683 + <p>Use WebDataset’s standard TarWriter:</p> 684 + <div id="e31de370" class="cell"> 685 + <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 686 + <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 687 + <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 688 + <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a> ImageSample(</span> 689 + <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 690 + <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> 691 + <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a> confidence<span class="op">=</span><span class="fl">0.95</span>,</span> 692 + <span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a> )</span> 693 + <span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> _ <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>)</span> 694 + <span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>]</span> 695 + <span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a></span> 696 + <span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data-000000.tar"</span>) <span class="im">as</span> sink:</span> 697 + <span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i, sample <span class="kw">in</span> <span class="bu">enumerate</span>(samples):</span> 698 + <span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a> sink.write({<span class="op">**</span>sample.as_wds, <span class="st">"__key__"</span>: <span class="ss">f"sample_</span><span class="sc">{</span>i<span class="sc">:06d}</span><span class="ss">"</span>})</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 651 699 </div> 652 700 </section> 653 - <section id="load-and-iterate" class="level3"> 654 - <h3 class="anchored" data-anchor-id="load-and-iterate">Load and Iterate</h3> 655 - <div id="960dad1e" class="cell"> 656 - <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-000000.tar"</span>)</span> 657 - <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 658 - <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate with batching</span></span> 659 - <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 660 - <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># numpy array (32, 224, 224, 3)</span></span> 661 - <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label <span class="co"># list of 32 strings</span></span> 662 - <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a> confs <span class="op">=</span> batch.confidence <span class="co"># list of 32 floats</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 701 + <section id="load-and-iterate-with-type-safety" class="level3"> 702 + <h3 class="anchored" data-anchor-id="load-and-iterate-with-type-safety">3. Load and Iterate with Type Safety</h3> 703 + <p>The generic <code>Dataset[T]</code> provides typed access:</p> 704 + <div id="d035a8cb" class="cell"> 705 + <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-000000.tar"</span>)</span> 706 + <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 707 + <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 708 + <span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># numpy array (32, 224, 224, 3)</span></span> 709 + <span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label <span class="co"># list of 32 strings</span></span> 710 + <span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> confs <span class="op">=</span> batch.confidence <span class="co"># list of 32 floats</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 663 711 </div> 664 712 </section> 665 713 </section> 666 - <section id="huggingface-style-loading" class="level2"> 667 - <h2 class="anchored" data-anchor-id="huggingface-style-loading">HuggingFace-Style Loading</h2> 668 - <div id="f199e76c" class="cell"> 669 - <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local path</span></span> 670 - <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> atdata.load_dataset(<span class="st">"path/to/data-{000000..000009}.tar"</span>, split<span class="op">=</span><span class="st">"train"</span>)</span> 671 - <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> 672 - <span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Load with split detection</span></span> 673 - <span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> atdata.load_dataset(<span class="st">"path/to/data/"</span>)</span> 674 - <span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>train_ds <span class="op">=</span> ds_dict[<span class="st">"train"</span>]</span> 675 - <span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>test_ds <span class="op">=</span> ds_dict[<span class="st">"test"</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 676 - </div> 677 - </section> 678 - <section id="local-storage-with-redis-s3" class="level2"> 679 - <h2 class="anchored" data-anchor-id="local-storage-with-redis-s3">Local Storage with Redis + S3</h2> 680 - <div id="4513b884" class="cell"> 714 + <section id="scaling-up" class="level2"> 715 + <h2 class="anchored" data-anchor-id="scaling-up">Scaling Up</h2> 716 + <section id="team-storage-with-redis-s3" class="level3"> 717 + <h3 class="anchored" data-anchor-id="team-storage-with-redis-s3">Team Storage with Redis + S3</h3> 718 + <p>When you’re ready to share with your team:</p> 719 + <div id="5691aefd" class="cell"> 681 720 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 682 - <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 683 - <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> 684 - <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Create samples and write to local tar</span></span> 685 - <span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data.tar"</span>) <span class="im">as</span> sink:</span> 686 - <span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i, sample <span class="kw">in</span> <span class="bu">enumerate</span>(samples):</span> 687 - <span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a> sink.write({<span class="op">**</span>sample.as_wds, <span class="st">"__key__"</span>: <span class="ss">f"</span><span class="sc">{</span>i<span class="sc">:06d}</span><span class="ss">"</span>})</span> 688 - <span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a></span> 689 - <span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Set up index with S3 data store</span></span> 690 - <span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> 691 - <span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a> credentials<span class="op">=</span>{<span class="st">"AWS_ENDPOINT"</span>: <span class="st">"http://localhost:9000"</span>, ...},</span> 692 - <span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a> bucket<span class="op">=</span><span class="st">"my-bucket"</span>,</span> 693 - <span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>)</span> 694 - <span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store) <span class="co"># Connects to Redis</span></span> 695 - <span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a></span> 696 - <span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a><span class="co"># Insert dataset (writes to S3, indexes in Redis)</span></span> 697 - <span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data.tar"</span>)</span> 698 - <span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"my-dataset"</span>)</span> 699 - <span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Stored at: </span><span class="sc">{</span>entry<span class="sc">.</span>data_urls<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 721 + <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 722 + <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to team infrastructure</span></span> 723 + <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> 724 + <span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a> credentials<span class="op">=</span>{<span class="st">"AWS_ENDPOINT"</span>: <span class="st">"http://localhost:9000"</span>, ...},</span> 725 + <span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a> bucket<span class="op">=</span><span class="st">"team-datasets"</span>,</span> 726 + <span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>)</span> 727 + <span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> 728 + <span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a></span> 729 + <span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish schema for consistency</span></span> 730 + <span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a>index.publish_schema(ImageSample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 731 + <span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a></span> 732 + <span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Insert dataset (writes to S3, indexes in Redis)</span></span> 733 + <span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data.tar"</span>)</span> 734 + <span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"training-images-v1"</span>)</span> 735 + <span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a></span> 736 + <span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a><span class="co"># Team members can now discover and load</span></span> 737 + <span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a><span class="co"># ds = atdata.load_dataset("@local/training-images-v1", index=index)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 700 738 </div> 701 739 </section> 702 - <section id="publish-to-atproto-federation" class="level2"> 703 - <h2 class="anchored" data-anchor-id="publish-to-atproto-federation">Publish to ATProto Federation</h2> 704 - <div id="8e79a745" class="cell"> 705 - <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 740 + <section id="federation-with-atproto" class="level3"> 741 + <h3 class="anchored" data-anchor-id="federation-with-atproto">Federation with ATProto</h3> 742 + <p>For public or cross-organization sharing:</p> 743 + <div id="97baea4b" class="cell"> 744 + <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex, PDSBlobStore</span> 706 745 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> promote_to_atmosphere</span> 707 746 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> 708 - <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Authenticate</span></span> 747 + <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Authenticate with your ATProto identity</span></span> 709 748 <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 710 749 <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 711 750 <span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a></span> 712 - <span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Promote local dataset to federation</span></span> 713 - <span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.get_dataset(<span class="st">"my-dataset"</span>)</span> 751 + <span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Option 1: Promote existing local dataset</span></span> 752 + <span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.get_dataset(<span class="st">"training-images-v1"</span>)</span> 714 753 <span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(entry, index, client)</span> 715 - <span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Published at: </span><span class="sc">{</span>at_uri<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 754 + <span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a></span> 755 + <span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Option 2: Publish directly with blob storage</span></span> 756 + <span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> 757 + <span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a>atm_index <span class="op">=</span> AtmosphereIndex(client, data_store<span class="op">=</span>store)</span> 758 + <span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a>atm_index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"public-images"</span>, schema_ref<span class="op">=</span>schema_uri)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 716 759 </div> 717 760 </section> 761 + </section> 762 + <section id="huggingface-style-loading" class="level2"> 763 + <h2 class="anchored" data-anchor-id="huggingface-style-loading">HuggingFace-Style Loading</h2> 764 + <p>For convenient access to datasets:</p> 765 + <div id="a4e6c68a" class="cell"> 766 + <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 767 + <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 768 + <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local files</span></span> 769 + <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"path/to/data-{000000..000009}.tar"</span>)</span> 770 + <span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a></span> 771 + <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Load with split detection</span></span> 772 + <span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data/"</span>)</span> 773 + <span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>train_ds <span class="op">=</span> ds_dict[<span class="st">"train"</span>]</span> 774 + <span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>test_ds <span class="op">=</span> ds_dict[<span class="st">"test"</span>]</span> 775 + <span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a></span> 776 + <span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from index</span></span> 777 + <span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"@local/my-dataset"</span>, index<span class="op">=</span>index)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 778 + </div> 779 + </section> 780 + <section id="why-atdata" class="level2"> 781 + <h2 class="anchored" data-anchor-id="why-atdata">Why atdata?</h2> 782 + <table class="caption-top table"> 783 + <colgroup> 784 + <col style="width: 37%"> 785 + <col style="width: 62%"> 786 + </colgroup> 787 + <thead> 788 + <tr class="header"> 789 + <th>Need</th> 790 + <th>Solution</th> 791 + </tr> 792 + </thead> 793 + <tbody> 794 + <tr class="odd"> 795 + <td>Type-safe samples</td> 796 + <td><code>@packable</code> decorator, <code>PackableSample</code> base class</td> 797 + </tr> 798 + <tr class="even"> 799 + <td>Efficient large-scale storage</td> 800 + <td>WebDataset tar format, streaming iteration</td> 801 + </tr> 802 + <tr class="odd"> 803 + <td>Schema flexibility</td> 804 + <td>Lens transformations, <code>DictSample</code> for exploration</td> 805 + </tr> 806 + <tr class="even"> 807 + <td>Team collaboration</td> 808 + <td>Redis index, S3 data store, schema registry</td> 809 + </tr> 810 + <tr class="odd"> 811 + <td>Public sharing</td> 812 + <td>ATProto federation, content-addressable CIDs</td> 813 + </tr> 814 + <tr class="even"> 815 + <td>Multiple backends</td> 816 + <td>Protocol abstractions (<code>AbstractIndex</code>, <code>DataSource</code>)</td> 817 + </tr> 818 + </tbody> 819 + </table> 820 + </section> 718 821 <section id="next-steps" class="level2"> 719 822 <h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2> 823 + <div class="callout callout-style-default callout-tip callout-titled"> 824 + <div class="callout-header d-flex align-content-center"> 825 + <div class="callout-icon-container"> 826 + <i class="callout-icon"></i> 827 + </div> 828 + <div class="callout-title-container flex-fill"> 829 + Getting Started 830 + </div> 831 + </div> 832 + <div class="callout-body-container callout-body"> 833 + <p><strong>New to atdata?</strong> Start with the <a href="./tutorials/quickstart.html">Quick Start Tutorial</a> to learn the basics of typed samples and datasets.</p> 834 + </div> 835 + </div> 720 836 <ul> 721 - <li><strong><a href="./tutorials/quickstart.html">Quick Start Tutorial</a></strong> - Get up and running in 5 minutes</li> 722 - <li><strong><a href="./reference/packable-samples.html">Packable Samples</a></strong> - Learn about typed sample definitions</li> 723 - <li><strong><a href="./reference/datasets.html">Datasets</a></strong> - Master dataset iteration and batching</li> 724 - <li><strong><a href="./reference/atmosphere.html">Atmosphere</a></strong> - Publish to the ATProto federation</li> 837 + <li><strong><a href="./reference/architecture.html">Architecture Overview</a></strong> - Understand the design and how components fit together</li> 838 + <li><strong><a href="./tutorials/local-workflow.html">Local Workflow</a></strong> - Set up team storage with Redis + S3</li> 839 + <li><strong><a href="./tutorials/atmosphere.html">Atmosphere Publishing</a></strong> - Share datasets on the ATProto network</li> 840 + <li><strong><a href="./reference/packable-samples.html">Packable Samples</a></strong> - Deep dive into sample type definitions</li> 841 + <li><strong><a href="./reference/datasets.html">Datasets</a></strong> - Master iteration, batching, and transformations</li> 725 842 </ul> 726 843 727 844

+1435

docs/reference/architecture.html

··· 1 + <!DOCTYPE html> 2 + <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head> 3 + 4 + <meta charset="utf-8"> 5 + <meta name="generator" content="quarto-1.7.34"> 6 + 7 + <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes"> 8 + 9 + <meta name="description" content="Understanding the design and components of atdata"> 10 + 11 + <title>Architecture Overview – atdata</title> 12 + <style> 13 + code{white-space: pre-wrap;} 14 + span.smallcaps{font-variant: small-caps;} 15 + div.columns{display: flex; gap: min(4vw, 1.5em);} 16 + div.column{flex: auto; overflow-x: auto;} 17 + div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} 18 + ul.task-list{list-style: none;} 19 + ul.task-list li input[type="checkbox"] { 20 + width: 0.8em; 21 + margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */ 22 + vertical-align: middle; 23 + } 24 + /* CSS for syntax highlighting */ 25 + html { -webkit-text-size-adjust: 100%; } 26 + pre > code.sourceCode { white-space: pre; position: relative; } 27 + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } 28 + pre > code.sourceCode > span:empty { height: 1.2em; } 29 + .sourceCode { overflow: visible; } 30 + code.sourceCode > span { color: inherit; text-decoration: inherit; } 31 + div.sourceCode { margin: 1em 0; } 32 + pre.sourceCode { margin: 0; } 33 + @media screen { 34 + div.sourceCode { overflow: auto; } 35 + } 36 + @media print { 37 + pre > code.sourceCode { white-space: pre-wrap; } 38 + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } 39 + } 40 + pre.numberSource code 41 + { counter-reset: source-line 0; } 42 + pre.numberSource code > span 43 + { position: relative; left: -4em; counter-increment: source-line; } 44 + pre.numberSource code > span > a:first-child::before 45 + { content: counter(source-line); 46 + position: relative; left: -1em; text-align: right; vertical-align: baseline; 47 + border: none; display: inline-block; 48 + -webkit-touch-callout: none; -webkit-user-select: none; 49 + -khtml-user-select: none; -moz-user-select: none; 50 + -ms-user-select: none; user-select: none; 51 + padding: 0 4px; width: 4em; 52 + } 53 + pre.numberSource { margin-left: 3em; padding-left: 4px; } 54 + div.sourceCode 55 + { } 56 + @media screen { 57 + pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } 58 + } 59 + </style> 60 + 61 + 62 + <script src="../site_libs/quarto-nav/quarto-nav.js"></script> 63 + <script src="../site_libs/quarto-nav/headroom.min.js"></script> 64 + <script src="../site_libs/clipboard/clipboard.min.js"></script> 65 + <script src="../site_libs/quarto-search/autocomplete.umd.js"></script> 66 + <script src="../site_libs/quarto-search/fuse.min.js"></script> 67 + <script src="../site_libs/quarto-search/quarto-search.js"></script> 68 + <meta name="quarto:offset" content="../"> 69 + <script src="../site_libs/quarto-html/quarto.js" type="module"></script> 70 + <script src="../site_libs/quarto-html/tabsets/tabsets.js" type="module"></script> 71 + <script src="../site_libs/quarto-html/popper.min.js"></script> 72 + <script src="../site_libs/quarto-html/tippy.umd.min.js"></script> 73 + <script src="../site_libs/quarto-html/anchor.min.js"></script> 74 + <link href="../site_libs/quarto-html/tippy.css" rel="stylesheet"> 75 + <link href="../site_libs/quarto-html/quarto-syntax-highlighting-9582434199d49cc9e91654cdeeb4866b.css" rel="stylesheet" class="quarto-color-scheme" id="quarto-text-highlighting-styles"> 76 + <link href="../site_libs/quarto-html/quarto-syntax-highlighting-dark-8dcd8563ea6803ab7cbb3d71ca5772e1.css" rel="stylesheet" class="quarto-color-scheme quarto-color-alternate" id="quarto-text-highlighting-styles"> 77 + <link href="../site_libs/quarto-html/quarto-syntax-highlighting-9582434199d49cc9e91654cdeeb4866b.css" rel="stylesheet" class="quarto-color-scheme-extra" id="quarto-text-highlighting-styles"> 78 + <script src="../site_libs/bootstrap/bootstrap.min.js"></script> 79 + <link href="../site_libs/bootstrap/bootstrap-icons.css" rel="stylesheet"> 80 + <link href="../site_libs/bootstrap/bootstrap-62bce24ca844314e7bb1a34dbdfe05cc.min.css" rel="stylesheet" append-hash="true" class="quarto-color-scheme" id="quarto-bootstrap" data-mode="light"> 81 + <link href="../site_libs/bootstrap/bootstrap-dark-7964ffd8887b0991fe8d71c6c8bc75d6.min.css" rel="stylesheet" append-hash="true" class="quarto-color-scheme quarto-color-alternate" id="quarto-bootstrap" data-mode="dark"> 82 + <link href="../site_libs/bootstrap/bootstrap-62bce24ca844314e7bb1a34dbdfe05cc.min.css" rel="stylesheet" append-hash="true" class="quarto-color-scheme-extra" id="quarto-bootstrap" data-mode="light"> 83 + <script id="quarto-search-options" type="application/json">{ 84 + "location": "navbar", 85 + "copy-button": false, 86 + "collapse-after": 3, 87 + "panel-placement": "end", 88 + "type": "overlay", 89 + "limit": 50, 90 + "keyboard-shortcut": [ 91 + "f", 92 + "/", 93 + "s" 94 + ], 95 + "show-item-context": false, 96 + "language": { 97 + "search-no-results-text": "No results", 98 + "search-matching-documents-text": "matching documents", 99 + "search-copy-link-title": "Copy link to search", 100 + "search-hide-matches-text": "Hide additional matches", 101 + "search-more-match-text": "more match in this document", 102 + "search-more-matches-text": "more matches in this document", 103 + "search-clear-button-title": "Clear", 104 + "search-text-placeholder": "", 105 + "search-detached-cancel-button-title": "Cancel", 106 + "search-submit-button-title": "Submit", 107 + "search-label": "Search" 108 + } 109 + }</script> 110 + 111 + 112 + <link rel="stylesheet" href="../assets/styles.css"> 113 + </head> 114 + 115 + <body class="nav-sidebar docked nav-fixed quarto-light"><script id="quarto-html-before-body" type="application/javascript"> 116 + const toggleBodyColorMode = (bsSheetEl) => { 117 + const mode = bsSheetEl.getAttribute("data-mode"); 118 + const bodyEl = window.document.querySelector("body"); 119 + if (mode === "dark") { 120 + bodyEl.classList.add("quarto-dark"); 121 + bodyEl.classList.remove("quarto-light"); 122 + } else { 123 + bodyEl.classList.add("quarto-light"); 124 + bodyEl.classList.remove("quarto-dark"); 125 + } 126 + } 127 + const toggleBodyColorPrimary = () => { 128 + const bsSheetEl = window.document.querySelector("link#quarto-bootstrap:not([rel=disabled-stylesheet])"); 129 + if (bsSheetEl) { 130 + toggleBodyColorMode(bsSheetEl); 131 + } 132 + } 133 + const setColorSchemeToggle = (alternate) => { 134 + const toggles = window.document.querySelectorAll('.quarto-color-scheme-toggle'); 135 + for (let i=0; i < toggles.length; i++) { 136 + const toggle = toggles[i]; 137 + if (toggle) { 138 + if (alternate) { 139 + toggle.classList.add("alternate"); 140 + } else { 141 + toggle.classList.remove("alternate"); 142 + } 143 + } 144 + } 145 + }; 146 + const toggleColorMode = (alternate) => { 147 + // Switch the stylesheets 148 + const primaryStylesheets = window.document.querySelectorAll('link.quarto-color-scheme:not(.quarto-color-alternate)'); 149 + const alternateStylesheets = window.document.querySelectorAll('link.quarto-color-scheme.quarto-color-alternate'); 150 + manageTransitions('#quarto-margin-sidebar .nav-link', false); 151 + if (alternate) { 152 + // note: dark is layered on light, we don't disable primary! 153 + enableStylesheet(alternateStylesheets); 154 + for (const sheetNode of alternateStylesheets) { 155 + if (sheetNode.id === "quarto-bootstrap") { 156 + toggleBodyColorMode(sheetNode); 157 + } 158 + } 159 + } else { 160 + disableStylesheet(alternateStylesheets); 161 + enableStylesheet(primaryStylesheets) 162 + toggleBodyColorPrimary(); 163 + } 164 + manageTransitions('#quarto-margin-sidebar .nav-link', true); 165 + // Switch the toggles 166 + setColorSchemeToggle(alternate) 167 + // Hack to workaround the fact that safari doesn't 168 + // properly recolor the scrollbar when toggling (#1455) 169 + if (navigator.userAgent.indexOf('Safari') > 0 && navigator.userAgent.indexOf('Chrome') == -1) { 170 + manageTransitions("body", false); 171 + window.scrollTo(0, 1); 172 + setTimeout(() => { 173 + window.scrollTo(0, 0); 174 + manageTransitions("body", true); 175 + }, 40); 176 + } 177 + } 178 + const disableStylesheet = (stylesheets) => { 179 + for (let i=0; i < stylesheets.length; i++) { 180 + const stylesheet = stylesheets[i]; 181 + stylesheet.rel = 'disabled-stylesheet'; 182 + } 183 + } 184 + const enableStylesheet = (stylesheets) => { 185 + for (let i=0; i < stylesheets.length; i++) { 186 + const stylesheet = stylesheets[i]; 187 + if(stylesheet.rel !== 'stylesheet') { // for Chrome, which will still FOUC without this check 188 + stylesheet.rel = 'stylesheet'; 189 + } 190 + } 191 + } 192 + const manageTransitions = (selector, allowTransitions) => { 193 + const els = window.document.querySelectorAll(selector); 194 + for (let i=0; i < els.length; i++) { 195 + const el = els[i]; 196 + if (allowTransitions) { 197 + el.classList.remove('notransition'); 198 + } else { 199 + el.classList.add('notransition'); 200 + } 201 + } 202 + } 203 + const isFileUrl = () => { 204 + return window.location.protocol === 'file:'; 205 + } 206 + const hasAlternateSentinel = () => { 207 + let styleSentinel = getColorSchemeSentinel(); 208 + if (styleSentinel !== null) { 209 + return styleSentinel === "alternate"; 210 + } else { 211 + return false; 212 + } 213 + } 214 + const setStyleSentinel = (alternate) => { 215 + const value = alternate ? "alternate" : "default"; 216 + if (!isFileUrl()) { 217 + window.localStorage.setItem("quarto-color-scheme", value); 218 + } else { 219 + localAlternateSentinel = value; 220 + } 221 + } 222 + const getColorSchemeSentinel = () => { 223 + if (!isFileUrl()) { 224 + const storageValue = window.localStorage.getItem("quarto-color-scheme"); 225 + return storageValue != null ? storageValue : localAlternateSentinel; 226 + } else { 227 + return localAlternateSentinel; 228 + } 229 + } 230 + const toggleGiscusIfUsed = (isAlternate, darkModeDefault) => { 231 + const baseTheme = document.querySelector('#giscus-base-theme')?.value ?? 'light'; 232 + const alternateTheme = document.querySelector('#giscus-alt-theme')?.value ?? 'dark'; 233 + let newTheme = ''; 234 + if(authorPrefersDark) { 235 + newTheme = isAlternate ? baseTheme : alternateTheme; 236 + } else { 237 + newTheme = isAlternate ? alternateTheme : baseTheme; 238 + } 239 + const changeGiscusTheme = () => { 240 + // From: https://github.com/giscus/giscus/issues/336 241 + const sendMessage = (message) => { 242 + const iframe = document.querySelector('iframe.giscus-frame'); 243 + if (!iframe) return; 244 + iframe.contentWindow.postMessage({ giscus: message }, 'https://giscus.app'); 245 + } 246 + sendMessage({ 247 + setConfig: { 248 + theme: newTheme 249 + } 250 + }); 251 + } 252 + const isGiscussLoaded = window.document.querySelector('iframe.giscus-frame') !== null; 253 + if (isGiscussLoaded) { 254 + changeGiscusTheme(); 255 + } 256 + }; 257 + const authorPrefersDark = false; 258 + const darkModeDefault = authorPrefersDark; 259 + document.querySelector('link#quarto-text-highlighting-styles.quarto-color-scheme-extra').rel = 'disabled-stylesheet'; 260 + document.querySelector('link#quarto-bootstrap.quarto-color-scheme-extra').rel = 'disabled-stylesheet'; 261 + let localAlternateSentinel = darkModeDefault ? 'alternate' : 'default'; 262 + // Dark / light mode switch 263 + window.quartoToggleColorScheme = () => { 264 + // Read the current dark / light value 265 + let toAlternate = !hasAlternateSentinel(); 266 + toggleColorMode(toAlternate); 267 + setStyleSentinel(toAlternate); 268 + toggleGiscusIfUsed(toAlternate, darkModeDefault); 269 + window.dispatchEvent(new Event('resize')); 270 + }; 271 + // Switch to dark mode if need be 272 + if (hasAlternateSentinel()) { 273 + toggleColorMode(true); 274 + } else { 275 + toggleColorMode(false); 276 + } 277 + </script> 278 + 279 + <div id="quarto-search-results"></div> 280 + <header id="quarto-header" class="headroom fixed-top"> 281 + <nav class="navbar navbar-expand-lg " data-bs-theme="dark"> 282 + <div class="navbar-container container-fluid"> 283 + <div class="navbar-brand-container mx-auto"> 284 + <a class="navbar-brand" href="../index.html"> 285 + <span class="navbar-title">atdata</span> 286 + </a> 287 + </div> 288 + <div id="quarto-search" class="" title="Search"></div> 289 + <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarCollapse" aria-controls="navbarCollapse" role="menu" aria-expanded="false" aria-label="Toggle navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 290 + <span class="navbar-toggler-icon"></span> 291 + </button> 292 + <div class="collapse navbar-collapse" id="navbarCollapse"> 293 + <ul class="navbar-nav navbar-nav-scroll me-auto"> 294 + <li class="nav-item"> 295 + <a class="nav-link active" href="../index.html" aria-current="page"> 296 + <span class="menu-text">Guide</span></a> 297 + </li> 298 + <li class="nav-item dropdown "> 299 + <a class="nav-link dropdown-toggle" href="#" id="nav-menu-tutorials" role="link" data-bs-toggle="dropdown" aria-expanded="false"> 300 + <span class="menu-text">Tutorials</span> 301 + </a> 302 + <ul class="dropdown-menu" aria-labelledby="nav-menu-tutorials"> 303 + <li> 304 + <a class="dropdown-item" href="../tutorials/quickstart.html"> 305 + <span class="dropdown-text">Quick Start</span></a> 306 + </li> 307 + <li> 308 + <a class="dropdown-item" href="../tutorials/local-workflow.html"> 309 + <span class="dropdown-text">Local Workflow</span></a> 310 + </li> 311 + <li> 312 + <a class="dropdown-item" href="../tutorials/atmosphere.html"> 313 + <span class="dropdown-text">Atmosphere Publishing</span></a> 314 + </li> 315 + <li> 316 + <a class="dropdown-item" href="../tutorials/promotion.html"> 317 + <span class="dropdown-text">Promotion Workflow</span></a> 318 + </li> 319 + </ul> 320 + </li> 321 + <li class="nav-item dropdown "> 322 + <a class="nav-link dropdown-toggle" href="#" id="nav-menu-reference" role="link" data-bs-toggle="dropdown" aria-expanded="false"> 323 + <span class="menu-text">Reference</span> 324 + </a> 325 + <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 + <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 331 + <a class="dropdown-item" href="../reference/packable-samples.html"> 332 + <span class="dropdown-text">Packable Samples</span></a> 333 + </li> 334 + <li> 335 + <a class="dropdown-item" href="../reference/datasets.html"> 336 + <span class="dropdown-text">Datasets</span></a> 337 + </li> 338 + <li> 339 + <a class="dropdown-item" href="../reference/lenses.html"> 340 + <span class="dropdown-text">Lenses</span></a> 341 + </li> 342 + <li> 343 + <a class="dropdown-item" href="../reference/local-storage.html"> 344 + <span class="dropdown-text">Local Storage</span></a> 345 + </li> 346 + <li> 347 + <a class="dropdown-item" href="../reference/atmosphere.html"> 348 + <span class="dropdown-text">Atmosphere</span></a> 349 + </li> 350 + <li> 351 + <a class="dropdown-item" href="../reference/promotion.html"> 352 + <span class="dropdown-text">Promotion</span></a> 353 + </li> 354 + <li> 355 + <a class="dropdown-item" href="../reference/load-dataset.html"> 356 + <span class="dropdown-text">load_dataset API</span></a> 357 + </li> 358 + <li> 359 + <a class="dropdown-item" href="../reference/protocols.html"> 360 + <span class="dropdown-text">Protocols</span></a> 361 + </li> 362 + <li> 363 + <a class="dropdown-item" href="../reference/uri-spec.html"> 364 + <span class="dropdown-text">URI Specification</span></a> 365 + </li> 366 + <li> 367 + <a class="dropdown-item" href="../reference/troubleshooting.html"> 368 + <span class="dropdown-text">Troubleshooting & FAQ</span></a> 369 + </li> 370 + <li> 371 + <a class="dropdown-item" href="../reference/deployment.html"> 372 + <span class="dropdown-text">Deployment Guide</span></a> 373 + </li> 374 + </ul> 375 + </li> 376 + <li class="nav-item"> 377 + <a class="nav-link" href="../api/index.html"> 378 + <span class="menu-text">API</span></a> 379 + </li> 380 + </ul> 381 + <ul class="navbar-nav navbar-nav-scroll ms-auto"> 382 + <li class="nav-item compact"> 383 + <a class="nav-link" href="https://github.com/your-org/atdata"> <i class="bi bi-github" role="img"> 384 + </i> 385 + <span class="menu-text"></span></a> 386 + </li> 387 + </ul> 388 + </div>  389 + <div class="quarto-navbar-tools"> 390 + <a href="" class="quarto-color-scheme-toggle quarto-navigation-tool px-1" onclick="window.quartoToggleColorScheme(); return false;" title="Toggle dark mode"><i class="bi"></i></a> 391 + </div> 392 + </div>  393 + </nav> 394 + <nav class="quarto-secondary-nav"> 395 + <div class="container-fluid d-flex"> 396 + <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 + <i class="bi bi-layout-text-sidebar-reverse"></i> 398 + </button> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/architecture.html">Architecture Overview</a></li></ol></nav> 400 + <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 401 + </a> 402 + </div> 403 + </nav> 404 + </header> 405 +  406 + <div id="quarto-content" class="quarto-container page-columns page-rows-contents page-layout-article page-navbar"> 407 +  408 + <nav id="quarto-sidebar" class="sidebar collapse collapse-horizontal quarto-sidebar-collapse-item sidebar-navigation docked overflow-auto"> 409 + <div class="sidebar-menu-container"> 410 + <ul class="list-unstyled mt-1"> 411 + <li class="sidebar-item"> 412 + <div class="sidebar-item-container"> 413 + <a href="../index.html" class="sidebar-item-text sidebar-link"> 414 + <span class="menu-text">atdata</span></a> 415 + </div> 416 + </li> 417 + <li class="sidebar-item sidebar-item-section"> 418 + <div class="sidebar-item-container"> 419 + <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-1" role="navigation" aria-expanded="true"> 420 + <span class="menu-text">Getting Started</span></a> 421 + <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-1" role="navigation" aria-expanded="true" aria-label="Toggle section"> 422 + <i class="bi bi-chevron-right ms-2"></i> 423 + </a> 424 + </div> 425 + <ul id="quarto-sidebar-section-1" class="collapse list-unstyled sidebar-section depth1 show"> 426 + <li class="sidebar-item"> 427 + <div class="sidebar-item-container"> 428 + <a href="../tutorials/quickstart.html" class="sidebar-item-text sidebar-link"> 429 + <span class="menu-text">Quick Start</span></a> 430 + </div> 431 + </li> 432 + <li class="sidebar-item"> 433 + <div class="sidebar-item-container"> 434 + <a href="../tutorials/local-workflow.html" class="sidebar-item-text sidebar-link"> 435 + <span class="menu-text">Local Workflow</span></a> 436 + </div> 437 + </li> 438 + <li class="sidebar-item"> 439 + <div class="sidebar-item-container"> 440 + <a href="../tutorials/atmosphere.html" class="sidebar-item-text sidebar-link"> 441 + <span class="menu-text">Atmosphere Publishing</span></a> 442 + </div> 443 + </li> 444 + <li class="sidebar-item"> 445 + <div class="sidebar-item-container"> 446 + <a href="../tutorials/promotion.html" class="sidebar-item-text sidebar-link"> 447 + <span class="menu-text">Promotion Workflow</span></a> 448 + </div> 449 + </li> 450 + </ul> 451 + </li> 452 + <li class="sidebar-item sidebar-item-section"> 453 + <div class="sidebar-item-container"> 454 + <a class="sidebar-item-text sidebar-link text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-2" role="navigation" aria-expanded="true"> 455 + <span class="menu-text">Reference</span></a> 456 + <a class="sidebar-item-toggle text-start" data-bs-toggle="collapse" data-bs-target="#quarto-sidebar-section-2" role="navigation" aria-expanded="true" aria-label="Toggle section"> 457 + <i class="bi bi-chevron-right ms-2"></i> 458 + </a> 459 + </div> 460 + <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 461 + <li class="sidebar-item"> 462 + <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link active"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 469 + <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 470 + <span class="menu-text">Packable Samples</span></a> 471 + </div> 472 + </li> 473 + <li class="sidebar-item"> 474 + <div class="sidebar-item-container"> 475 + <a href="../reference/datasets.html" class="sidebar-item-text sidebar-link"> 476 + <span class="menu-text">Datasets</span></a> 477 + </div> 478 + </li> 479 + <li class="sidebar-item"> 480 + <div class="sidebar-item-container"> 481 + <a href="../reference/lenses.html" class="sidebar-item-text sidebar-link"> 482 + <span class="menu-text">Lenses</span></a> 483 + </div> 484 + </li> 485 + <li class="sidebar-item"> 486 + <div class="sidebar-item-container"> 487 + <a href="../reference/local-storage.html" class="sidebar-item-text sidebar-link"> 488 + <span class="menu-text">Local Storage</span></a> 489 + </div> 490 + </li> 491 + <li class="sidebar-item"> 492 + <div class="sidebar-item-container"> 493 + <a href="../reference/atmosphere.html" class="sidebar-item-text sidebar-link"> 494 + <span class="menu-text">Atmosphere (ATProto Integration)</span></a> 495 + </div> 496 + </li> 497 + <li class="sidebar-item"> 498 + <div class="sidebar-item-container"> 499 + <a href="../reference/promotion.html" class="sidebar-item-text sidebar-link"> 500 + <span class="menu-text">Promotion Workflow</span></a> 501 + </div> 502 + </li> 503 + <li class="sidebar-item"> 504 + <div class="sidebar-item-container"> 505 + <a href="../reference/load-dataset.html" class="sidebar-item-text sidebar-link"> 506 + <span class="menu-text">load_dataset API</span></a> 507 + </div> 508 + </li> 509 + <li class="sidebar-item"> 510 + <div class="sidebar-item-container"> 511 + <a href="../reference/protocols.html" class="sidebar-item-text sidebar-link"> 512 + <span class="menu-text">Protocols</span></a> 513 + </div> 514 + </li> 515 + <li class="sidebar-item"> 516 + <div class="sidebar-item-container"> 517 + <a href="../reference/uri-spec.html" class="sidebar-item-text sidebar-link"> 518 + <span class="menu-text">URI Specification</span></a> 519 + </div> 520 + </li> 521 + <li class="sidebar-item"> 522 + <div class="sidebar-item-container"> 523 + <a href="../reference/troubleshooting.html" class="sidebar-item-text sidebar-link"> 524 + <span class="menu-text">Troubleshooting & FAQ</span></a> 525 + </div> 526 + </li> 527 + <li class="sidebar-item"> 528 + <div class="sidebar-item-container"> 529 + <a href="../reference/deployment.html" class="sidebar-item-text sidebar-link"> 530 + <span class="menu-text">Deployment Guide</span></a> 531 + </div> 532 + </li> 533 + </ul> 534 + </li> 535 + </ul> 536 + </div> 537 + </nav> 538 + <div id="quarto-sidebar-glass" class="quarto-sidebar-collapse-item" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item"></div> 539 +  540 + <div id="quarto-margin-sidebar" class="sidebar margin-sidebar"> 541 + <nav id="TOC" role="doc-toc" class="toc-active"> 542 + <h2 id="toc-title">On this page</h2> 543 + 544 + <ul> 545 + <li><a href="#design-philosophy" id="toc-design-philosophy" class="nav-link active" data-scroll-target="#design-philosophy">Design Philosophy</a> 546 + <ul class="collapse"> 547 + <li><a href="#the-problem" id="toc-the-problem" class="nav-link" data-scroll-target="#the-problem">The Problem</a></li> 548 + <li><a href="#the-solution" id="toc-the-solution" class="nav-link" data-scroll-target="#the-solution">The Solution</a></li> 549 + </ul></li> 550 + <li><a href="#core-components" id="toc-core-components" class="nav-link" data-scroll-target="#core-components">Core Components</a> 551 + <ul class="collapse"> 552 + <li><a href="#packablesample-the-foundation" id="toc-packablesample-the-foundation" class="nav-link" data-scroll-target="#packablesample-the-foundation">PackableSample: The Foundation</a></li> 553 + <li><a href="#dataset-typed-iteration" id="toc-dataset-typed-iteration" class="nav-link" data-scroll-target="#dataset-typed-iteration">Dataset: Typed Iteration</a></li> 554 + <li><a href="#samplebatch-automatic-aggregation" id="toc-samplebatch-automatic-aggregation" class="nav-link" data-scroll-target="#samplebatch-automatic-aggregation">SampleBatch: Automatic Aggregation</a></li> 555 + <li><a href="#lens-schema-transformations" id="toc-lens-schema-transformations" class="nav-link" data-scroll-target="#lens-schema-transformations">Lens: Schema Transformations</a></li> 556 + </ul></li> 557 + <li><a href="#storage-backends" id="toc-storage-backends" class="nav-link" data-scroll-target="#storage-backends">Storage Backends</a> 558 + <ul class="collapse"> 559 + <li><a href="#local-index-redis-s3" id="toc-local-index-redis-s3" class="nav-link" data-scroll-target="#local-index-redis-s3">Local Index (Redis + S3)</a></li> 560 + <li><a href="#atmosphere-index-atproto" id="toc-atmosphere-index-atproto" class="nav-link" data-scroll-target="#atmosphere-index-atproto">Atmosphere Index (ATProto)</a></li> 561 + </ul></li> 562 + <li><a href="#protocol-abstractions" id="toc-protocol-abstractions" class="nav-link" data-scroll-target="#protocol-abstractions">Protocol Abstractions</a> 563 + <ul class="collapse"> 564 + <li><a href="#abstractindex" id="toc-abstractindex" class="nav-link" data-scroll-target="#abstractindex">AbstractIndex</a></li> 565 + <li><a href="#abstractdatastore" id="toc-abstractdatastore" class="nav-link" data-scroll-target="#abstractdatastore">AbstractDataStore</a></li> 566 + <li><a href="#datasource" id="toc-datasource" class="nav-link" data-scroll-target="#datasource">DataSource</a></li> 567 + </ul></li> 568 + <li><a href="#data-flow-local-to-federation" id="toc-data-flow-local-to-federation" class="nav-link" data-scroll-target="#data-flow-local-to-federation">Data Flow: Local to Federation</a> 569 + <ul class="collapse"> 570 + <li><a href="#stage-1-local-development" id="toc-stage-1-local-development" class="nav-link" data-scroll-target="#stage-1-local-development">Stage 1: Local Development</a></li> 571 + <li><a href="#stage-2-team-storage" id="toc-stage-2-team-storage" class="nav-link" data-scroll-target="#stage-2-team-storage">Stage 2: Team Storage</a></li> 572 + <li><a href="#stage-3-federation" id="toc-stage-3-federation" class="nav-link" data-scroll-target="#stage-3-federation">Stage 3: Federation</a></li> 573 + </ul></li> 574 + <li><a href="#content-addressing" id="toc-content-addressing" class="nav-link" data-scroll-target="#content-addressing">Content Addressing</a></li> 575 + <li><a href="#extension-points" id="toc-extension-points" class="nav-link" data-scroll-target="#extension-points">Extension Points</a> 576 + <ul class="collapse"> 577 + <li><a href="#custom-datasources" id="toc-custom-datasources" class="nav-link" data-scroll-target="#custom-datasources">Custom DataSources</a></li> 578 + <li><a href="#custom-lenses" id="toc-custom-lenses" class="nav-link" data-scroll-target="#custom-lenses">Custom Lenses</a></li> 579 + <li><a href="#schema-extensions" id="toc-schema-extensions" class="nav-link" data-scroll-target="#schema-extensions">Schema Extensions</a></li> 580 + </ul></li> 581 + <li><a href="#summary" id="toc-summary" class="nav-link" data-scroll-target="#summary">Summary</a></li> 582 + <li><a href="#related" id="toc-related" class="nav-link" data-scroll-target="#related">Related</a></li> 583 + </ul> 584 + <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/reference/architecture.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> 585 + </div> 586 +  587 + <main class="content" id="quarto-document-content"> 588 + 589 + 590 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/architecture.html">Architecture Overview</a></li></ol></nav> 591 + <div class="quarto-title"> 592 + <h1 class="title">Architecture Overview</h1> 593 + </div> 594 + 595 + <div> 596 + <div class="description"> 597 + Understanding the design and components of atdata 598 + </div> 599 + </div> 600 + 601 + 602 + <div class="quarto-title-meta"> 603 + 604 + 605 + 606 + 607 + </div> 608 + 609 + 610 + 611 + </header> 612 + 613 + 614 + <p>atdata is designed around a simple but powerful idea: <strong>typed, serializable samples</strong> that can flow seamlessly between local development, team storage, and a federated network. This page explains the architectural decisions and how the components work together.</p> 615 + <section id="design-philosophy" class="level2"> 616 + <h2 class="anchored" data-anchor-id="design-philosophy">Design Philosophy</h2> 617 + <section id="the-problem" class="level3"> 618 + <h3 class="anchored" data-anchor-id="the-problem">The Problem</h3> 619 + <p>Machine learning workflows involve datasets at every stage—training data, validation sets, embeddings, features, and model outputs. These datasets are often:</p> 620 + <ul> 621 + <li><strong>Untyped</strong>: Raw files with implicit schemas, leading to runtime errors</li> 622 + <li><strong>Siloed</strong>: Stuck in one location (local disk, team bucket, or cloud storage)</li> 623 + <li><strong>Undiscoverable</strong>: No standard way to find and share datasets across teams or organizations</li> 624 + </ul> 625 + </section> 626 + <section id="the-solution" class="level3"> 627 + <h3 class="anchored" data-anchor-id="the-solution">The Solution</h3> 628 + <p>atdata provides a three-layer architecture that addresses each problem:</p> 629 + <pre><code>┌─────────────────────────────────────────────────────────────┐ 630 + │ Layer 3: Federation (ATProto Atmosphere) │ 631 + │ - Decentralized discovery and sharing │ 632 + │ - Content-addressable identifiers │ 633 + │ - Cross-organization dataset federation │ 634 + └─────────────────────────────────────────────────────────────┘ 635 + ↑ 636 + Promotion 637 + ↑ 638 + ┌─────────────────────────────────────────────────────────────┐ 639 + │ Layer 2: Team Storage (Redis + S3) │ 640 + │ - Shared index for team discovery │ 641 + │ - Scalable object storage for data │ 642 + │ - Schema registry for type consistency │ 643 + └─────────────────────────────────────────────────────────────┘ 644 + ↑ 645 + Insert 646 + ↑ 647 + ┌─────────────────────────────────────────────────────────────┐ 648 + │ Layer 1: Local Development │ 649 + │ - Typed samples with automatic serialization │ 650 + │ - WebDataset tar files for efficient storage │ 651 + │ - Lens transformations for schema flexibility │ 652 + └─────────────────────────────────────────────────────────────┘</code></pre> 653 + </section> 654 + </section> 655 + <section id="core-components" class="level2"> 656 + <h2 class="anchored" data-anchor-id="core-components">Core Components</h2> 657 + <section id="packablesample-the-foundation" class="level3"> 658 + <h3 class="anchored" data-anchor-id="packablesample-the-foundation">PackableSample: The Foundation</h3> 659 + <p>Everything in atdata starts with <strong>PackableSample</strong>—a base class that makes Python dataclasses serializable with msgpack:</p> 660 + <div id="1d6c7714" class="cell"> 661 + <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 662 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 663 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> image: NDArray <span class="co"># Automatically converted to/from bytes</span></span> 664 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span> <span class="co"># Standard msgpack serialization</span></span> 665 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 666 + </div> 667 + <p>Key features:</p> 668 + <ul> 669 + <li><strong>Automatic NDArray handling</strong>: Numpy arrays are serialized efficiently</li> 670 + <li><strong>Type safety</strong>: Field types are preserved and validated</li> 671 + <li><strong>Round-trip fidelity</strong>: Serialize → deserialize always produces identical data</li> 672 + </ul> 673 + <p>The <code>@packable</code> decorator is syntactic sugar that:</p> 674 + <ol type="1"> 675 + <li>Converts your class to a dataclass</li> 676 + <li>Adds <code>PackableSample</code> as a base class</li> 677 + <li>Registers a lens from <code>DictSample</code> for flexible loading</li> 678 + </ol> 679 + </section> 680 + <section id="dataset-typed-iteration" class="level3"> 681 + <h3 class="anchored" data-anchor-id="dataset-typed-iteration">Dataset: Typed Iteration</h3> 682 + <p>The <code>Dataset[T]</code> class wraps WebDataset tar archives with type information:</p> 683 + <div id="750a8ebc" class="cell"> 684 + <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 685 + <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 686 + <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 687 + <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># Stacked NDArray: (32, H, W, C)</span></span> 688 + <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label <span class="co"># List of 32 strings</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 689 + </div> 690 + <p><strong>Why WebDataset?</strong></p> 691 + <p>WebDataset is a battle-tested format for large-scale ML training:</p> 692 + <ul> 693 + <li><strong>Streaming</strong>: No need to download entire datasets</li> 694 + <li><strong>Sharding</strong>: Data split across multiple tar files for parallelism</li> 695 + <li><strong>Shuffling</strong>: Two-level shuffling (shard + sample) for training</li> 696 + </ul> 697 + <p>atdata adds:</p> 698 + <ul> 699 + <li><strong>Type safety</strong>: Know the schema at compile time</li> 700 + <li><strong>Batch aggregation</strong>: NDArrays are automatically stacked</li> 701 + <li><strong>Lens transformations</strong>: View data through different schemas</li> 702 + </ul> 703 + </section> 704 + <section id="samplebatch-automatic-aggregation" class="level3"> 705 + <h3 class="anchored" data-anchor-id="samplebatch-automatic-aggregation">SampleBatch: Automatic Aggregation</h3> 706 + <p>When iterating with <code>batch_size</code>, atdata returns <code>SampleBatch[T]</code> objects that aggregate sample attributes:</p> 707 + <div id="36aa2bc8" class="cell"> 708 + <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>batch <span class="op">=</span> SampleBatch[ImageSample](samples)</span> 709 + <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 710 + <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># NDArray fields → stacked numpy array with batch dimension</span></span> 711 + <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>batch.image.shape <span class="co"># (batch_size, H, W, C)</span></span> 712 + <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a></span> 713 + <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="co"># Other fields → list</span></span> 714 + <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>batch.label <span class="co"># ["cat", "dog", "bird", ...]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 715 + </div> 716 + <p>This eliminates boilerplate collation code and works automatically for any <code>PackableSample</code> type.</p> 717 + </section> 718 + <section id="lens-schema-transformations" class="level3"> 719 + <h3 class="anchored" data-anchor-id="lens-schema-transformations">Lens: Schema Transformations</h3> 720 + <p>Lenses enable viewing datasets through different schemas without duplicating data:</p> 721 + <div id="a5b534a8" class="cell"> 722 + <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 723 + <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 724 + <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 725 + <span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span> 726 + <span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 727 + <span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> simplify(src: ImageSample) <span class="op">-></span> SimplifiedSample:</span> 728 + <span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SimplifiedSample(label<span class="op">=</span>src.label)</span> 729 + <span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a></span> 730 + <span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a><span class="co"># View dataset through simplified schema</span></span> 731 + <span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a>simple_ds <span class="op">=</span> dataset.as_type(SimplifiedSample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 732 + </div> 733 + <p><strong>When to use lenses:</strong></p> 734 + <ul> 735 + <li><strong>Reducing fields</strong>: Drop unnecessary data for specific tasks</li> 736 + <li><strong>Transforming data</strong>: Compute derived fields on-the-fly</li> 737 + <li><strong>Schema migration</strong>: Handle version differences between datasets</li> 738 + </ul> 739 + <p>Lenses are registered globally in a <code>LensNetwork</code>, enabling automatic discovery of transformation paths.</p> 740 + </section> 741 + </section> 742 + <section id="storage-backends" class="level2"> 743 + <h2 class="anchored" data-anchor-id="storage-backends">Storage Backends</h2> 744 + <section id="local-index-redis-s3" class="level3"> 745 + <h3 class="anchored" data-anchor-id="local-index-redis-s3">Local Index (Redis + S3)</h3> 746 + <p>For team-scale usage, atdata provides a two-component storage system:</p> 747 + <p><strong>Redis Index</strong>: Stores metadata and enables fast lookups</p> 748 + <ul> 749 + <li>Dataset entries (name, schema, URLs, metadata)</li> 750 + <li>Schema registry (type definitions)</li> 751 + <li>CID-based content addressing</li> 752 + </ul> 753 + <p><strong>S3 DataStore</strong>: Stores actual data files</p> 754 + <ul> 755 + <li>WebDataset tar shards</li> 756 + <li>Any S3-compatible storage (AWS, MinIO, Cloudflare R2)</li> 757 + </ul> 758 + <div id="8792eb75" class="cell"> 759 + <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(credentials<span class="op">=</span>creds, bucket<span class="op">=</span><span class="st">"datasets"</span>)</span> 760 + <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> 761 + <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> 762 + <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Insert dataset: writes to S3, indexes in Redis</span></span> 763 + <span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"training-v1"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 764 + </div> 765 + <p><strong>Why this split?</strong></p> 766 + <ul> 767 + <li><strong>Separation of concerns</strong>: Metadata queries don’t touch data storage</li> 768 + <li><strong>Flexibility</strong>: Use any S3-compatible storage</li> 769 + <li><strong>Scalability</strong>: Redis handles high-throughput lookups; S3 handles large files</li> 770 + </ul> 771 + </section> 772 + <section id="atmosphere-index-atproto" class="level3"> 773 + <h3 class="anchored" data-anchor-id="atmosphere-index-atproto">Atmosphere Index (ATProto)</h3> 774 + <p>For public or cross-organization sharing, atdata integrates with the AT Protocol:</p> 775 + <p><strong>ATProto PDS</strong>: Your Personal Data Server stores records</p> 776 + <ul> 777 + <li>Schema definitions</li> 778 + <li>Dataset index records</li> 779 + <li>Lens transformation records</li> 780 + </ul> 781 + <p><strong>PDSBlobStore</strong>: Optional blob storage on your PDS</p> 782 + <ul> 783 + <li>Store actual data shards as ATProto blobs</li> 784 + <li>Fully decentralized—no external dependencies</li> 785 + </ul> 786 + <div id="a5b8403d" class="cell"> 787 + <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 788 + <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 789 + <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> 790 + <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> 791 + <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client, data_store<span class="op">=</span>store)</span> 792 + <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a></span> 793 + <span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish: creates ATProto records, uploads blobs</span></span> 794 + <span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"public-features"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 795 + </div> 796 + </section> 797 + </section> 798 + <section id="protocol-abstractions" class="level2"> 799 + <h2 class="anchored" data-anchor-id="protocol-abstractions">Protocol Abstractions</h2> 800 + <p>atdata uses <strong>protocols</strong> (structural typing) to enable backend interoperability:</p> 801 + <section id="abstractindex" class="level3"> 802 + <h3 class="anchored" data-anchor-id="abstractindex">AbstractIndex</h3> 803 + <p>Common interface for both <code>LocalIndex</code> and <code>AtmosphereIndex</code>:</p> 804 + <div id="53fdc4fb" class="cell"> 805 + <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_dataset(index: AbstractIndex, name: <span class="bu">str</span>):</span> 806 + <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> entry <span class="op">=</span> index.get_dataset(name)</span> 807 + <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> schema <span class="op">=</span> index.decode_schema(entry.schema_ref)</span> 808 + <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a> <span class="co"># Works with either LocalIndex or AtmosphereIndex</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 809 + </div> 810 + <p>Key methods:</p> 811 + <ul> 812 + <li><code>insert_dataset()</code> / <code>get_dataset()</code>: Dataset CRUD</li> 813 + <li><code>publish_schema()</code> / <code>decode_schema()</code>: Schema management</li> 814 + <li><code>list_datasets()</code> / <code>list_schemas()</code>: Discovery</li> 815 + </ul> 816 + </section> 817 + <section id="abstractdatastore" class="level3"> 818 + <h3 class="anchored" data-anchor-id="abstractdatastore">AbstractDataStore</h3> 819 + <p>Common interface for <code>S3DataStore</code> and <code>PDSBlobStore</code>:</p> 820 + <div id="76134918" class="cell"> 821 + <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_to_store(store: AbstractDataStore, dataset: Dataset):</span> 822 + <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> urls <span class="op">=</span> store.write_shards(dataset, prefix<span class="op">=</span><span class="st">"data/v1"</span>)</span> 823 + <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> <span class="co"># Works with S3 or PDS blob storage</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 824 + </div> 825 + </section> 826 + <section id="datasource" class="level3"> 827 + <h3 class="anchored" data-anchor-id="datasource">DataSource</h3> 828 + <p>Common interface for data streaming:</p> 829 + <ul> 830 + <li><code>URLSource</code>: WebDataset-compatible URLs</li> 831 + <li><code>S3Source</code>: S3 with explicit credentials</li> 832 + <li><code>BlobSource</code>: ATProto PDS blobs</li> 833 + </ul> 834 + </section> 835 + </section> 836 + <section id="data-flow-local-to-federation" class="level2"> 837 + <h2 class="anchored" data-anchor-id="data-flow-local-to-federation">Data Flow: Local to Federation</h2> 838 + <p>A typical workflow progresses through three stages:</p> 839 + <section id="stage-1-local-development" class="level3"> 840 + <h3 class="anchored" data-anchor-id="stage-1-local-development">Stage 1: Local Development</h3> 841 + <div id="9ea69426" class="cell"> 842 + <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define type and create samples</span></span> 843 + <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 844 + <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> MySample:</span> 845 + <span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a> features: NDArray</span> 846 + <span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 847 + <span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a></span> 848 + <span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Write to local tar</span></span> 849 + <span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data.tar"</span>) <span class="im">as</span> sink:</span> 850 + <span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> sample <span class="kw">in</span> samples:</span> 851 + <span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a> sink.write(sample.as_wds)</span> 852 + <span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a></span> 853 + <span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate locally</span></span> 854 + <span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[MySample](<span class="st">"data.tar"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 855 + </div> 856 + </section> 857 + <section id="stage-2-team-storage" class="level3"> 858 + <h3 class="anchored" data-anchor-id="stage-2-team-storage">Stage 2: Team Storage</h3> 859 + <div id="c454dd0d" class="cell"> 860 + <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Set up team storage</span></span> 861 + <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(credentials<span class="op">=</span>team_creds, bucket<span class="op">=</span><span class="st">"team-datasets"</span>)</span> 862 + <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> 863 + <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span> 864 + <span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish schema and insert</span></span> 865 + <span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>index.publish_schema(MySample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 866 + <span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"my-features"</span>)</span> 867 + <span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a></span> 868 + <span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a><span class="co"># Team members can now load via index</span></span> 869 + <span id="cb11-10"><a href="#cb11-10" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"@local/my-features"</span>, index<span class="op">=</span>index)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 870 + </div> 871 + </section> 872 + <section id="stage-3-federation" class="level3"> 873 + <h3 class="anchored" data-anchor-id="stage-3-federation">Stage 3: Federation</h3> 874 + <div id="9dfedd67" class="cell"> 875 + <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Promote to atmosphere</span></span> 876 + <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 877 + <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 878 + <span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a></span> 879 + <span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(entry, index, client)</span> 880 + <span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a></span> 881 + <span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Anyone can now discover and load</span></span> 882 + <span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="co"># ds = load_dataset("@handle.bsky.social/my-features")</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 883 + </div> 884 + </section> 885 + </section> 886 + <section id="content-addressing" class="level2"> 887 + <h2 class="anchored" data-anchor-id="content-addressing">Content Addressing</h2> 888 + <p>atdata uses <strong>CIDs</strong> (Content Identifiers) for content-addressable storage:</p> 889 + <ul> 890 + <li><strong>Schema CIDs</strong>: Hash of schema definition</li> 891 + <li><strong>Entry CIDs</strong>: Hash of (schema_ref, data_urls)</li> 892 + <li><strong>Blob CIDs</strong>: Hash of data content</li> 893 + </ul> 894 + <p>Benefits:</p> 895 + <ul> 896 + <li><strong>Deduplication</strong>: Identical content has identical CID</li> 897 + <li><strong>Integrity</strong>: Verify data matches expected hash</li> 898 + <li><strong>ATProto compatibility</strong>: CIDs are native to the AT Protocol</li> 899 + </ul> 900 + </section> 901 + <section id="extension-points" class="level2"> 902 + <h2 class="anchored" data-anchor-id="extension-points">Extension Points</h2> 903 + <p>atdata is designed for extensibility:</p> 904 + <section id="custom-datasources" class="level3"> 905 + <h3 class="anchored" data-anchor-id="custom-datasources">Custom DataSources</h3> 906 + <p>Implement the <code>DataSource</code> protocol to add new storage backends:</p> 907 + <div id="e8e5bb8b" class="cell"> 908 + <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> MyCustomSource:</span> 909 + <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> list_shards(<span class="va">self</span>) <span class="op">-></span> <span class="bu">list</span>[<span class="bu">str</span>]: ...</span> 910 + <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> open_shard(<span class="va">self</span>, shard_id: <span class="bu">str</span>) <span class="op">-></span> IO[<span class="bu">bytes</span>]: ...</span> 911 + <span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a></span> 912 + <span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a> <span class="at">@property</span></span> 913 + <span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> shards(<span class="va">self</span>) <span class="op">-></span> Iterator[<span class="bu">tuple</span>[<span class="bu">str</span>, IO[<span class="bu">bytes</span>]]]: ...</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 914 + </div> 915 + </section> 916 + <section id="custom-lenses" class="level3"> 917 + <h3 class="anchored" data-anchor-id="custom-lenses">Custom Lenses</h3> 918 + <p>Register transformations between any PackableSample types:</p> 919 + <div id="d20b964d" class="cell"> 920 + <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 921 + <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> my_transform(src: SourceType) <span class="op">-></span> TargetType:</span> 922 + <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> TargetType(...)</span> 923 + <span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a></span> 924 + <span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a><span class="at">@my_transform.putter</span></span> 925 + <span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> my_transform_put(view: TargetType, src: SourceType) <span class="op">-></span> SourceType:</span> 926 + <span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SourceType(...)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 927 + </div> 928 + </section> 929 + <section id="schema-extensions" class="level3"> 930 + <h3 class="anchored" data-anchor-id="schema-extensions">Schema Extensions</h3> 931 + <p>The schema format supports custom metadata for domain-specific needs:</p> 932 + <div id="c00ad4da" class="cell"> 933 + <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>index.publish_schema(</span> 934 + <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> MySample,</span> 935 + <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> version<span class="op">=</span><span class="st">"1.0.0"</span>,</span> 936 + <span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a> metadata<span class="op">=</span>{<span class="st">"domain"</span>: <span class="st">"chemistry"</span>, <span class="st">"units"</span>: <span class="st">"mol/L"</span>},</span> 937 + <span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 938 + </div> 939 + </section> 940 + </section> 941 + <section id="summary" class="level2"> 942 + <h2 class="anchored" data-anchor-id="summary">Summary</h2> 943 + <table class="caption-top table"> 944 + <colgroup> 945 + <col style="width: 33%"> 946 + <col style="width: 27%"> 947 + <col style="width: 39%"> 948 + </colgroup> 949 + <thead> 950 + <tr class="header"> 951 + <th>Component</th> 952 + <th>Purpose</th> 953 + <th>Key Classes</th> 954 + </tr> 955 + </thead> 956 + <tbody> 957 + <tr class="odd"> 958 + <td><strong>Samples</strong></td> 959 + <td>Typed, serializable data</td> 960 + <td><code>PackableSample</code>, <code>@packable</code></td> 961 + </tr> 962 + <tr class="even"> 963 + <td><strong>Datasets</strong></td> 964 + <td>Typed iteration over WebDataset</td> 965 + <td><code>Dataset[T]</code>, <code>SampleBatch[T]</code></td> 966 + </tr> 967 + <tr class="odd"> 968 + <td><strong>Lenses</strong></td> 969 + <td>Schema transformations</td> 970 + <td><code>Lens</code>, <code>@lens</code>, <code>LensNetwork</code></td> 971 + </tr> 972 + <tr class="even"> 973 + <td><strong>Local Storage</strong></td> 974 + <td>Team-scale index + data</td> 975 + <td><code>LocalIndex</code>, <code>S3DataStore</code></td> 976 + </tr> 977 + <tr class="odd"> 978 + <td><strong>Atmosphere</strong></td> 979 + <td>Federated sharing</td> 980 + <td><code>AtmosphereIndex</code>, <code>PDSBlobStore</code></td> 981 + </tr> 982 + <tr class="even"> 983 + <td><strong>Protocols</strong></td> 984 + <td>Backend abstraction</td> 985 + <td><code>AbstractIndex</code>, <code>AbstractDataStore</code>, <code>DataSource</code></td> 986 + </tr> 987 + </tbody> 988 + </table> 989 + <p>The architecture enables a smooth progression from local experimentation to team collaboration to public federation, all while maintaining type safety and efficient data handling.</p> 990 + </section> 991 + <section id="related" class="level2"> 992 + <h2 class="anchored" data-anchor-id="related">Related</h2> 993 + <ul> 994 + <li><a href="../reference/packable-samples.html">Packable Samples</a> - Defining sample types</li> 995 + <li><a href="../reference/datasets.html">Datasets</a> - Dataset iteration and batching</li> 996 + <li><a href="../reference/local-storage.html">Local Storage</a> - Redis + S3 backend</li> 997 + <li><a href="../reference/atmosphere.html">Atmosphere</a> - ATProto federation</li> 998 + <li><a href="../reference/protocols.html">Protocols</a> - Abstract interfaces</li> 999 + </ul> 1000 + 1001 + 1002 + </section> 1003 + 1004 + </main>  1005 + <script id="quarto-html-after-body" type="application/javascript"> 1006 + window.document.addEventListener("DOMContentLoaded", function (event) { 1007 + // Ensure there is a toggle, if there isn't float one in the top right 1008 + if (window.document.querySelector('.quarto-color-scheme-toggle') === null) { 1009 + const a = window.document.createElement('a'); 1010 + a.classList.add('top-right'); 1011 + a.classList.add('quarto-color-scheme-toggle'); 1012 + a.href = ""; 1013 + a.onclick = function() { try { window.quartoToggleColorScheme(); } catch {} return false; }; 1014 + const i = window.document.createElement("i"); 1015 + i.classList.add('bi'); 1016 + a.appendChild(i); 1017 + window.document.body.appendChild(a); 1018 + } 1019 + setColorSchemeToggle(hasAlternateSentinel()) 1020 + const icon = ""; 1021 + const anchorJS = new window.AnchorJS(); 1022 + anchorJS.options = { 1023 + placement: 'right', 1024 + icon: icon 1025 + }; 1026 + anchorJS.add('.anchored'); 1027 + const isCodeAnnotation = (el) => { 1028 + for (const clz of el.classList) { 1029 + if (clz.startsWith('code-annotation-')) { 1030 + return true; 1031 + } 1032 + } 1033 + return false; 1034 + } 1035 + const onCopySuccess = function(e) { 1036 + // button target 1037 + const button = e.trigger; 1038 + // don't keep focus 1039 + button.blur(); 1040 + // flash "checked" 1041 + button.classList.add('code-copy-button-checked'); 1042 + var currentTitle = button.getAttribute("title"); 1043 + button.setAttribute("title", "Copied!"); 1044 + let tooltip; 1045 + if (window.bootstrap) { 1046 + button.setAttribute("data-bs-toggle", "tooltip"); 1047 + button.setAttribute("data-bs-placement", "left"); 1048 + button.setAttribute("data-bs-title", "Copied!"); 1049 + tooltip = new bootstrap.Tooltip(button, 1050 + { trigger: "manual", 1051 + customClass: "code-copy-button-tooltip", 1052 + offset: [0, -8]}); 1053 + tooltip.show(); 1054 + } 1055 + setTimeout(function() { 1056 + if (tooltip) { 1057 + tooltip.hide(); 1058 + button.removeAttribute("data-bs-title"); 1059 + button.removeAttribute("data-bs-toggle"); 1060 + button.removeAttribute("data-bs-placement"); 1061 + } 1062 + button.setAttribute("title", currentTitle); 1063 + button.classList.remove('code-copy-button-checked'); 1064 + }, 1000); 1065 + // clear code selection 1066 + e.clearSelection(); 1067 + } 1068 + const getTextToCopy = function(trigger) { 1069 + const codeEl = trigger.previousElementSibling.cloneNode(true); 1070 + for (const childEl of codeEl.children) { 1071 + if (isCodeAnnotation(childEl)) { 1072 + childEl.remove(); 1073 + } 1074 + } 1075 + return codeEl.innerText; 1076 + } 1077 + const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { 1078 + text: getTextToCopy 1079 + }); 1080 + clipboard.on('success', onCopySuccess); 1081 + if (window.document.getElementById('quarto-embedded-source-code-modal')) { 1082 + const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { 1083 + text: getTextToCopy, 1084 + container: window.document.getElementById('quarto-embedded-source-code-modal') 1085 + }); 1086 + clipboardModal.on('success', onCopySuccess); 1087 + } 1088 + var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); 1089 + var mailtoRegex = new RegExp(/^mailto:/); 1090 + var filterRegex = new RegExp("https:\/\/github\.com\/your-org\/atdata"); 1091 + var isInternal = (href) => { 1092 + return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); 1093 + } 1094 + // Inspect non-navigation links and adorn them if external 1095 + var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); 1096 + for (var i=0; i<links.length; i++) { 1097 + const link = links[i]; 1098 + if (!isInternal(link.href)) { 1099 + // undo the damage that might have been done by quarto-nav.js in the case of 1100 + // links that we want to consider external 1101 + if (link.dataset.originalHref !== undefined) { 1102 + link.href = link.dataset.originalHref; 1103 + } 1104 + } 1105 + } 1106 + function tippyHover(el, contentFn, onTriggerFn, onUntriggerFn) { 1107 + const config = { 1108 + allowHTML: true, 1109 + maxWidth: 500, 1110 + delay: 100, 1111 + arrow: false, 1112 + appendTo: function(el) { 1113 + return el.parentElement; 1114 + }, 1115 + interactive: true, 1116 + interactiveBorder: 10, 1117 + theme: 'quarto', 1118 + placement: 'bottom-start', 1119 + }; 1120 + if (contentFn) { 1121 + config.content = contentFn; 1122 + } 1123 + if (onTriggerFn) { 1124 + config.onTrigger = onTriggerFn; 1125 + } 1126 + if (onUntriggerFn) { 1127 + config.onUntrigger = onUntriggerFn; 1128 + } 1129 + window.tippy(el, config); 1130 + } 1131 + const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]'); 1132 + for (var i=0; i<noterefs.length; i++) { 1133 + const ref = noterefs[i]; 1134 + tippyHover(ref, function() { 1135 + // use id or data attribute instead here 1136 + let href = ref.getAttribute('data-footnote-href') || ref.getAttribute('href'); 1137 + try { href = new URL(href).hash; } catch {} 1138 + const id = href.replace(/^#\/?/, ""); 1139 + const note = window.document.getElementById(id); 1140 + if (note) { 1141 + return note.innerHTML; 1142 + } else { 1143 + return ""; 1144 + } 1145 + }); 1146 + } 1147 + const xrefs = window.document.querySelectorAll('a.quarto-xref'); 1148 + const processXRef = (id, note) => { 1149 + // Strip column container classes 1150 + const stripColumnClz = (el) => { 1151 + el.classList.remove("page-full", "page-columns"); 1152 + if (el.children) { 1153 + for (const child of el.children) { 1154 + stripColumnClz(child); 1155 + } 1156 + } 1157 + } 1158 + stripColumnClz(note) 1159 + if (id === null || id.startsWith('sec-')) { 1160 + // Special case sections, only their first couple elements 1161 + const container = document.createElement("div"); 1162 + if (note.children && note.children.length > 2) { 1163 + container.appendChild(note.children[0].cloneNode(true)); 1164 + for (let i = 1; i < note.children.length; i++) { 1165 + const child = note.children[i]; 1166 + if (child.tagName === "P" && child.innerText === "") { 1167 + continue; 1168 + } else { 1169 + container.appendChild(child.cloneNode(true)); 1170 + break; 1171 + } 1172 + } 1173 + if (window.Quarto?.typesetMath) { 1174 + window.Quarto.typesetMath(container); 1175 + } 1176 + return container.innerHTML 1177 + } else { 1178 + if (window.Quarto?.typesetMath) { 1179 + window.Quarto.typesetMath(note); 1180 + } 1181 + return note.innerHTML; 1182 + } 1183 + } else { 1184 + // Remove any anchor links if they are present 1185 + const anchorLink = note.querySelector('a.anchorjs-link'); 1186 + if (anchorLink) { 1187 + anchorLink.remove(); 1188 + } 1189 + if (window.Quarto?.typesetMath) { 1190 + window.Quarto.typesetMath(note); 1191 + } 1192 + if (note.classList.contains("callout")) { 1193 + return note.outerHTML; 1194 + } else { 1195 + return note.innerHTML; 1196 + } 1197 + } 1198 + } 1199 + for (var i=0; i<xrefs.length; i++) { 1200 + const xref = xrefs[i]; 1201 + tippyHover(xref, undefined, function(instance) { 1202 + instance.disable(); 1203 + let url = xref.getAttribute('href'); 1204 + let hash = undefined; 1205 + if (url.startsWith('#')) { 1206 + hash = url; 1207 + } else { 1208 + try { hash = new URL(url).hash; } catch {} 1209 + } 1210 + if (hash) { 1211 + const id = hash.replace(/^#\/?/, ""); 1212 + const note = window.document.getElementById(id); 1213 + if (note !== null) { 1214 + try { 1215 + const html = processXRef(id, note.cloneNode(true)); 1216 + instance.setContent(html); 1217 + } finally { 1218 + instance.enable(); 1219 + instance.show(); 1220 + } 1221 + } else { 1222 + // See if we can fetch this 1223 + fetch(url.split('#')[0]) 1224 + .then(res => res.text()) 1225 + .then(html => { 1226 + const parser = new DOMParser(); 1227 + const htmlDoc = parser.parseFromString(html, "text/html"); 1228 + const note = htmlDoc.getElementById(id); 1229 + if (note !== null) { 1230 + const html = processXRef(id, note); 1231 + instance.setContent(html); 1232 + } 1233 + }).finally(() => { 1234 + instance.enable(); 1235 + instance.show(); 1236 + }); 1237 + } 1238 + } else { 1239 + // See if we can fetch a full url (with no hash to target) 1240 + // This is a special case and we should probably do some content thinning / targeting 1241 + fetch(url) 1242 + .then(res => res.text()) 1243 + .then(html => { 1244 + const parser = new DOMParser(); 1245 + const htmlDoc = parser.parseFromString(html, "text/html"); 1246 + const note = htmlDoc.querySelector('main.content'); 1247 + if (note !== null) { 1248 + // This should only happen for chapter cross references 1249 + // (since there is no id in the URL) 1250 + // remove the first header 1251 + if (note.children.length > 0 && note.children[0].tagName === "HEADER") { 1252 + note.children[0].remove(); 1253 + } 1254 + const html = processXRef(null, note); 1255 + instance.setContent(html); 1256 + } 1257 + }).finally(() => { 1258 + instance.enable(); 1259 + instance.show(); 1260 + }); 1261 + } 1262 + }, function(instance) { 1263 + }); 1264 + } 1265 + let selectedAnnoteEl; 1266 + const selectorForAnnotation = ( cell, annotation) => { 1267 + let cellAttr = 'data-code-cell="' + cell + '"'; 1268 + let lineAttr = 'data-code-annotation="' + annotation + '"'; 1269 + const selector = 'span[' + cellAttr + '][' + lineAttr + ']'; 1270 + return selector; 1271 + } 1272 + const selectCodeLines = (annoteEl) => { 1273 + const doc = window.document; 1274 + const targetCell = annoteEl.getAttribute("data-target-cell"); 1275 + const targetAnnotation = annoteEl.getAttribute("data-target-annotation"); 1276 + const annoteSpan = window.document.querySelector(selectorForAnnotation(targetCell, targetAnnotation)); 1277 + const lines = annoteSpan.getAttribute("data-code-lines").split(","); 1278 + const lineIds = lines.map((line) => { 1279 + return targetCell + "-" + line; 1280 + }) 1281 + let top = null; 1282 + let height = null; 1283 + let parent = null; 1284 + if (lineIds.length > 0) { 1285 + //compute the position of the single el (top and bottom and make a div) 1286 + const el = window.document.getElementById(lineIds[0]); 1287 + top = el.offsetTop; 1288 + height = el.offsetHeight; 1289 + parent = el.parentElement.parentElement; 1290 + if (lineIds.length > 1) { 1291 + const lastEl = window.document.getElementById(lineIds[lineIds.length - 1]); 1292 + const bottom = lastEl.offsetTop + lastEl.offsetHeight; 1293 + height = bottom - top; 1294 + } 1295 + if (top !== null && height !== null && parent !== null) { 1296 + // cook up a div (if necessary) and position it 1297 + let div = window.document.getElementById("code-annotation-line-highlight"); 1298 + if (div === null) { 1299 + div = window.document.createElement("div"); 1300 + div.setAttribute("id", "code-annotation-line-highlight"); 1301 + div.style.position = 'absolute'; 1302 + parent.appendChild(div); 1303 + } 1304 + div.style.top = top - 2 + "px"; 1305 + div.style.height = height + 4 + "px"; 1306 + div.style.left = 0; 1307 + let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter"); 1308 + if (gutterDiv === null) { 1309 + gutterDiv = window.document.createElement("div"); 1310 + gutterDiv.setAttribute("id", "code-annotation-line-highlight-gutter"); 1311 + gutterDiv.style.position = 'absolute'; 1312 + const codeCell = window.document.getElementById(targetCell); 1313 + const gutter = codeCell.querySelector('.code-annotation-gutter'); 1314 + gutter.appendChild(gutterDiv); 1315 + } 1316 + gutterDiv.style.top = top - 2 + "px"; 1317 + gutterDiv.style.height = height + 4 + "px"; 1318 + } 1319 + selectedAnnoteEl = annoteEl; 1320 + } 1321 + }; 1322 + const unselectCodeLines = () => { 1323 + const elementsIds = ["code-annotation-line-highlight", "code-annotation-line-highlight-gutter"]; 1324 + elementsIds.forEach((elId) => { 1325 + const div = window.document.getElementById(elId); 1326 + if (div) { 1327 + div.remove(); 1328 + } 1329 + }); 1330 + selectedAnnoteEl = undefined; 1331 + }; 1332 + // Handle positioning of the toggle 1333 + window.addEventListener( 1334 + "resize", 1335 + throttle(() => { 1336 + elRect = undefined; 1337 + if (selectedAnnoteEl) { 1338 + selectCodeLines(selectedAnnoteEl); 1339 + } 1340 + }, 10) 1341 + ); 1342 + function throttle(fn, ms) { 1343 + let throttle = false; 1344 + let timer; 1345 + return (...args) => { 1346 + if(!throttle) { // first call gets through 1347 + fn.apply(this, args); 1348 + throttle = true; 1349 + } else { // all the others get throttled 1350 + if(timer) clearTimeout(timer); // cancel #2 1351 + timer = setTimeout(() => { 1352 + fn.apply(this, args); 1353 + timer = throttle = false; 1354 + }, ms); 1355 + } 1356 + }; 1357 + } 1358 + // Attach click handler to the DT 1359 + const annoteDls = window.document.querySelectorAll('dt[data-target-cell]'); 1360 + for (const annoteDlNode of annoteDls) { 1361 + annoteDlNode.addEventListener('click', (event) => { 1362 + const clickedEl = event.target; 1363 + if (clickedEl !== selectedAnnoteEl) { 1364 + unselectCodeLines(); 1365 + const activeEl = window.document.querySelector('dt[data-target-cell].code-annotation-active'); 1366 + if (activeEl) { 1367 + activeEl.classList.remove('code-annotation-active'); 1368 + } 1369 + selectCodeLines(clickedEl); 1370 + clickedEl.classList.add('code-annotation-active'); 1371 + } else { 1372 + // Unselect the line 1373 + unselectCodeLines(); 1374 + clickedEl.classList.remove('code-annotation-active'); 1375 + } 1376 + }); 1377 + } 1378 + const findCites = (el) => { 1379 + const parentEl = el.parentElement; 1380 + if (parentEl) { 1381 + const cites = parentEl.dataset.cites; 1382 + if (cites) { 1383 + return { 1384 + el, 1385 + cites: cites.split(' ') 1386 + }; 1387 + } else { 1388 + return findCites(el.parentElement) 1389 + } 1390 + } else { 1391 + return undefined; 1392 + } 1393 + }; 1394 + var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]'); 1395 + for (var i=0; i<bibliorefs.length; i++) { 1396 + const ref = bibliorefs[i]; 1397 + const citeInfo = findCites(ref); 1398 + if (citeInfo) { 1399 + tippyHover(citeInfo.el, function() { 1400 + var popup = window.document.createElement('div'); 1401 + citeInfo.cites.forEach(function(cite) { 1402 + var citeDiv = window.document.createElement('div'); 1403 + citeDiv.classList.add('hanging-indent'); 1404 + citeDiv.classList.add('csl-entry'); 1405 + var biblioDiv = window.document.getElementById('ref-' + cite); 1406 + if (biblioDiv) { 1407 + citeDiv.innerHTML = biblioDiv.innerHTML; 1408 + } 1409 + popup.appendChild(citeDiv); 1410 + }); 1411 + return popup.innerHTML; 1412 + }); 1413 + } 1414 + } 1415 + }); 1416 + </script> 1417 + </div>  1418 + <footer class="footer"> 1419 + <div class="nav-footer"> 1420 + <div class="nav-footer-left"> 1421 + <p>Built with <a href="https://quarto.org/">Quarto</a></p> 1422 + </div> 1423 + <div class="nav-footer-center"> 1424 +   1425 + <div class="toc-actions d-sm-block d-md-none"><ul><li><a href="https://github.com/your-org/atdata/edit/main/reference/architecture.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></div> 1426 + <div class="nav-footer-right"> 1427 + <p>MIT License</p> 1428 + </div> 1429 + </div> 1430 + </footer> 1431 + 1432 + 1433 + 1434 + 1435 + </body></html>

+34 -24

docs/reference/atmosphere.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/atmosphere.html">Atmosphere (ATProto Integration)</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/atmosphere.html">Atmosphere (ATProto Integration)</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 573 583 <main class="content" id="quarto-document-content"> 574 584 575 585 576 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/atmosphere.html">Atmosphere (ATProto Integration)</a></li></ol></nav> 586 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/atmosphere.html">Atmosphere (ATProto Integration)</a></li></ol></nav> 577 587 <div class="quarto-title"> 578 588 <h1 class="title">Atmosphere (ATProto Integration)</h1> 579 589 </div> ··· 616 626 <section id="atmosphereclient" class="level2"> 617 627 <h2 class="anchored" data-anchor-id="atmosphereclient">AtmosphereClient</h2> 618 628 <p>The client handles authentication and record operations:</p> 619 - <div id="aaaea216" class="cell"> 629 + <div id="27b96b63" class="cell"> 620 630 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 621 631 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 622 632 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 643 653 <section id="session-management" class="level3"> 644 654 <h3 class="anchored" data-anchor-id="session-management">Session Management</h3> 645 655 <p>Save and restore sessions to avoid re-authentication:</p> 646 - <div id="8f1cfb03" class="cell"> 656 + <div id="51285d9a" class="cell"> 647 657 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Export session for later</span></span> 648 658 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>session_string <span class="op">=</span> client.export_session()</span> 649 659 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 655 665 <section id="custom-pds" class="level3"> 656 666 <h3 class="anchored" data-anchor-id="custom-pds">Custom PDS</h3> 657 667 <p>Connect to a custom PDS instead of bsky.social:</p> 658 - <div id="458da4e2" class="cell"> 668 + <div id="5ff21d58" class="cell"> 659 669 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient(base_url<span class="op">=</span><span class="st">"https://pds.example.com"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 660 670 </div> 661 671 </section> ··· 663 673 <section id="pdsblobstore" class="level2"> 664 674 <h2 class="anchored" data-anchor-id="pdsblobstore">PDSBlobStore</h2> 665 675 <p>Store dataset shards as ATProto blobs for fully decentralized storage:</p> 666 - <div id="3223b941" class="cell"> 676 + <div id="e07fa7d1" class="cell"> 667 677 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, PDSBlobStore</span> 668 678 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 669 679 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 686 696 <section id="size-limits" class="level3"> 687 697 <h3 class="anchored" data-anchor-id="size-limits">Size Limits</h3> 688 698 <p>PDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use <code>maxcount</code> and <code>maxsize</code> parameters to control shard sizes:</p> 689 - <div id="d0ac7ae7" class="cell"> 699 + <div id="3b40ac87" class="cell"> 690 700 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>urls <span class="op">=</span> store.write_shards(</span> 691 701 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> dataset,</span> 692 702 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> prefix<span class="op">=</span><span class="st">"large-data/v1"</span>,</span> ··· 699 709 <section id="blobsource" class="level2"> 700 710 <h2 class="anchored" data-anchor-id="blobsource">BlobSource</h2> 701 711 <p>Read datasets stored as PDS blobs:</p> 702 - <div id="1501abd5" class="cell"> 712 + <div id="fabb6a95" class="cell"> 703 713 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> BlobSource</span> 704 714 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 705 715 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># From blob references</span></span> ··· 720 730 <section id="atmosphereindex" class="level2"> 721 731 <h2 class="anchored" data-anchor-id="atmosphereindex">AtmosphereIndex</h2> 722 732 <p>The unified interface for ATProto operations, implementing the AbstractIndex protocol:</p> 723 - <div id="7a57a26c" class="cell"> 733 + <div id="25a6a7ea" class="cell"> 724 734 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex, PDSBlobStore</span> 725 735 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 726 736 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 735 745 </div> 736 746 <section id="publishing-schemas" class="level3"> 737 747 <h3 class="anchored" data-anchor-id="publishing-schemas">Publishing Schemas</h3> 738 - <div id="87415c7d" class="cell"> 748 + <div id="fa209d4e" class="cell"> 739 749 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 740 750 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 741 751 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 756 766 </section> 757 767 <section id="publishing-datasets" class="level3"> 758 768 <h3 class="anchored" data-anchor-id="publishing-datasets">Publishing Datasets</h3> 759 - <div id="fe8a4a31" class="cell"> 769 + <div id="2263b2a7" class="cell"> 760 770 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 761 771 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 762 772 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(</span> ··· 774 784 </section> 775 785 <section id="listing-and-retrieving" class="level3"> 776 786 <h3 class="anchored" data-anchor-id="listing-and-retrieving">Listing and Retrieving</h3> 777 - <div id="f5d05245" class="cell"> 787 + <div id="29a96ee6" class="cell"> 778 788 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># List your datasets</span></span> 779 789 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> entry <span class="kw">in</span> index.list_datasets():</span> 780 790 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>entry<span class="sc">.</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span> ··· 800 810 <p>For more control, use the individual publisher classes:</p> 801 811 <section id="schemapublisher" class="level3"> 802 812 <h3 class="anchored" data-anchor-id="schemapublisher">SchemaPublisher</h3> 803 - <div id="1d16a999" class="cell"> 813 + <div id="d22262e0" class="cell"> 804 814 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> SchemaPublisher</span> 805 815 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span> 806 816 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> SchemaPublisher(client)</span> ··· 816 826 </section> 817 827 <section id="datasetpublisher" class="level3"> 818 828 <h3 class="anchored" data-anchor-id="datasetpublisher">DatasetPublisher</h3> 819 - <div id="bce70b2f" class="cell"> 829 + <div id="cb68b12c" class="cell"> 820 830 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetPublisher</span> 821 831 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a></span> 822 832 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> DatasetPublisher(client)</span> ··· 836 846 <p>There are two approaches to storing data as ATProto blobs:</p> 837 847 <p><strong>Approach 1: PDSBlobStore (Recommended)</strong></p> 838 848 <p>Use <code>PDSBlobStore</code> with <code>AtmosphereIndex</code> for automatic shard management:</p> 839 - <div id="38be7db2" class="cell"> 849 + <div id="c02c14c5" class="cell"> 840 850 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> PDSBlobStore, AtmosphereIndex</span> 841 851 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a></span> 842 852 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> ··· 855 865 </div> 856 866 <p><strong>Approach 2: Manual Blob Publishing</strong></p> 857 867 <p>For more control, use <code>DatasetPublisher.publish_with_blobs()</code> directly:</p> 858 - <div id="60bbdb6f" class="cell"> 868 + <div id="da360a3c" class="cell"> 859 869 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> io</span> 860 870 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 861 871 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a></span> ··· 875 885 <span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 876 886 </div> 877 887 <p><strong>Loading Blob-Stored Datasets</strong></p> 878 - <div id="98785dee" class="cell"> 888 + <div id="21624628" class="cell"> 879 889 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetLoader</span> 880 890 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> BlobSource</span> 881 891 <span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a></span> ··· 899 909 </section> 900 910 <section id="lenspublisher" class="level3"> 901 911 <h3 class="anchored" data-anchor-id="lenspublisher">LensPublisher</h3> 902 - <div id="c6332a11" class="cell"> 912 + <div id="dcbf5ff3" class="cell"> 903 913 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> LensPublisher</span> 904 914 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a></span> 905 915 <span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> LensPublisher(client)</span> ··· 942 952 <p>For direct access to records, use the loader classes:</p> 943 953 <section id="schemaloader" class="level3"> 944 954 <h3 class="anchored" data-anchor-id="schemaloader">SchemaLoader</h3> 945 - <div id="c8e9a8ec" class="cell"> 955 + <div id="92c02f3f" class="cell"> 946 956 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> SchemaLoader</span> 947 957 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a></span> 948 958 <span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> SchemaLoader(client)</span> ··· 958 968 </section> 959 969 <section id="datasetloader" class="level3"> 960 970 <h3 class="anchored" data-anchor-id="datasetloader">DatasetLoader</h3> 961 - <div id="b663bd4f" class="cell"> 971 + <div id="f6464268" class="cell"> 962 972 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetLoader</span> 963 973 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span> 964 974 <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> DatasetLoader(client)</span> ··· 986 996 </section> 987 997 <section id="lensloader" class="level3"> 988 998 <h3 class="anchored" data-anchor-id="lensloader">LensLoader</h3> 989 - <div id="479fac54" class="cell"> 999 + <div id="76e6eff4" class="cell"> 990 1000 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> LensLoader</span> 991 1001 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span> 992 1002 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> LensLoader(client)</span> ··· 1011 1021 <section id="at-uris" class="level2"> 1012 1022 <h2 class="anchored" data-anchor-id="at-uris">AT URIs</h2> 1013 1023 <p>ATProto records are identified by AT URIs:</p> 1014 - <div id="8b50cb03" class="cell"> 1024 + <div id="8e669d61" class="cell"> 1015 1025 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtUri</span> 1016 1026 <span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a></span> 1017 1027 <span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Parse an AT URI</span></span> ··· 1078 1088 <section id="complete-example" class="level2"> 1079 1089 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 1080 1090 <p>This example shows the full workflow using <code>PDSBlobStore</code> for decentralized storage:</p> 1081 - <div id="e67be904" class="cell"> 1091 + <div id="31a6584f" class="cell"> 1082 1092 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 1083 1093 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 1084 1094 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 1149 1159 <span id="cb22-68"><a href="#cb22-68" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 1150 1160 </div> 1151 1161 <p>For external URL storage (without <code>PDSBlobStore</code>):</p> 1152 - <div id="58ed78ba" class="cell"> 1162 + <div id="8e56d29e" class="cell"> 1153 1163 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Use AtmosphereIndex without data_store</span></span> 1154 1164 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client)</span> 1155 1165 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a></span>

+25 -15

docs/reference/datasets.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/datasets.html">Datasets</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/datasets.html">Datasets</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 566 576 <main class="content" id="quarto-document-content"> 567 577 568 578 569 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/datasets.html">Datasets</a></li></ol></nav> 579 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/datasets.html">Datasets</a></li></ol></nav> 570 580 <div class="quarto-title"> 571 581 <h1 class="title">Datasets</h1> 572 582 </div> ··· 593 603 <p>The <code>Dataset</code> class provides typed iteration over WebDataset tar files with automatic batching and lens transformations.</p> 594 604 <section id="creating-a-dataset" class="level2"> 595 605 <h2 class="anchored" data-anchor-id="creating-a-dataset">Creating a Dataset</h2> 596 - <div id="a18fd27c" class="cell"> 606 + <div id="4eca6e9f" class="cell"> 597 607 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 598 608 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 599 609 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span> ··· 616 626 <section id="url-source-default" class="level3"> 617 627 <h3 class="anchored" data-anchor-id="url-source-default">URL Source (default)</h3> 618 628 <p>When you pass a string to <code>Dataset</code>, it automatically wraps it in a <code>URLSource</code>:</p> 619 - <div id="9bdf912a" class="cell"> 629 + <div id="b7d503f8" class="cell"> 620 630 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># These are equivalent:</span></span> 621 631 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 622 632 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](atdata.URLSource(<span class="st">"data-{000000..000009}.tar"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 625 635 <section id="s3-source" class="level3"> 626 636 <h3 class="anchored" data-anchor-id="s3-source">S3 Source</h3> 627 637 <p>For private S3 buckets or S3-compatible storage (Cloudflare R2, MinIO), use <code>S3Source</code>:</p> 628 - <div id="acb40c51" class="cell"> 638 + <div id="50404edf" class="cell"> 629 639 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># From explicit credentials</span></span> 630 640 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> atdata.S3Source(</span> 631 641 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> bucket<span class="op">=</span><span class="st">"my-bucket"</span>,</span> ··· 663 673 <section id="ordered-iteration" class="level3"> 664 674 <h3 class="anchored" data-anchor-id="ordered-iteration">Ordered Iteration</h3> 665 675 <p>Iterate through samples in their original order:</p> 666 - <div id="3c6db39b" class="cell"> 676 + <div id="76ce4aad" class="cell"> 667 677 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># With batching (default batch_size=1)</span></span> 668 678 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.ordered(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 669 679 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># numpy array (32, H, W, C)</span></span> ··· 677 687 <section id="shuffled-iteration" class="level3"> 678 688 <h3 class="anchored" data-anchor-id="shuffled-iteration">Shuffled Iteration</h3> 679 689 <p>Iterate with randomized order at both shard and sample levels:</p> 680 - <div id="e4b30717" class="cell"> 690 + <div id="4d78e922" class="cell"> 681 691 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 682 692 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled</span></span> 683 693 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> process(batch)</span> ··· 708 718 <section id="samplebatch" class="level2"> 709 719 <h2 class="anchored" data-anchor-id="samplebatch">SampleBatch</h2> 710 720 <p>When iterating with a <code>batch_size</code>, each iteration yields a <code>SampleBatch</code> with automatic attribute aggregation.</p> 711 - <div id="dddda579" class="cell"> 721 + <div id="d15cc252" class="cell"> 712 722 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 713 723 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> Sample:</span> 714 724 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># shape (256,)</span></span> ··· 728 738 <section id="type-transformations-with-lenses" class="level2"> 729 739 <h2 class="anchored" data-anchor-id="type-transformations-with-lenses">Type Transformations with Lenses</h2> 730 740 <p>View a dataset through a different sample type using registered lenses:</p> 731 - <div id="bcb900f4" class="cell"> 741 + <div id="5141d751" class="cell"> 732 742 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 733 743 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 734 744 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> ··· 750 760 <section id="shard-list" class="level3"> 751 761 <h3 class="anchored" data-anchor-id="shard-list">Shard List</h3> 752 762 <p>Get the list of individual tar files:</p> 753 - <div id="5819836e" class="cell"> 763 + <div id="43085fc4" class="cell"> 754 764 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 755 765 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>shards <span class="op">=</span> dataset.shard_list</span> 756 766 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># ['data-000000.tar', 'data-000001.tar', ..., 'data-000009.tar']</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 759 769 <section id="metadata" class="level3"> 760 770 <h3 class="anchored" data-anchor-id="metadata">Metadata</h3> 761 771 <p>Datasets can have associated metadata from a URL:</p> 762 - <div id="d4dc6651" class="cell"> 772 + <div id="c3ff9553" class="cell"> 763 773 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](</span> 764 774 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"data-{000000..000009}.tar"</span>,</span> 765 775 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> metadata_url<span class="op">=</span><span class="st">"https://example.com/metadata.msgpack"</span></span> ··· 773 783 <section id="writing-datasets" class="level2"> 774 784 <h2 class="anchored" data-anchor-id="writing-datasets">Writing Datasets</h2> 775 785 <p>Use WebDataset’s <code>TarWriter</code> or <code>ShardWriter</code> to create datasets:</p> 776 - <div id="0ebfbcf7" class="cell"> 786 + <div id="2975692e" class="cell"> 777 787 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 778 788 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 779 789 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 796 806 <section id="parquet-export" class="level2"> 797 807 <h2 class="anchored" data-anchor-id="parquet-export">Parquet Export</h2> 798 808 <p>Export dataset contents to parquet format:</p> 799 - <div id="369a7e40" class="cell"> 809 + <div id="4cd9768b" class="cell"> 800 810 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Export entire dataset</span></span> 801 811 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>dataset.to_parquet(<span class="st">"output.parquet"</span>)</span> 802 812 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span> ··· 847 857 <section id="source" class="level3"> 848 858 <h3 class="anchored" data-anchor-id="source">Source</h3> 849 859 <p>Access the underlying <code>DataSource</code>:</p> 850 - <div id="4fce3c8d" class="cell"> 860 + <div id="9d3ef77e" class="cell"> 851 861 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](<span class="st">"data.tar"</span>)</span> 852 862 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> dataset.source <span class="co"># URLSource instance</span></span> 853 863 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(source.shard_list) <span class="co"># ['data.tar']</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 856 866 <section id="sample-type" class="level3"> 857 867 <h3 class="anchored" data-anchor-id="sample-type">Sample Type</h3> 858 868 <p>Get the type parameter used to create the dataset:</p> 859 - <div id="67e26931" class="cell"> 869 + <div id="476a436b" class="cell"> 860 870 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data.tar"</span>)</span> 861 871 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(dataset.sample_type) <span class="co"># <class 'ImageSample'></span></span> 862 872 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(dataset.batch_type) <span class="co"># SampleBatch[ImageSample]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>

+12 -2

docs/reference/deployment.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/deployment.html">Deployment Guide</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/deployment.html">Deployment Guide</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 562 572 <main class="content" id="quarto-document-content"> 563 573 564 574 565 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/deployment.html">Deployment Guide</a></li></ol></nav> 575 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/deployment.html">Deployment Guide</a></li></ol></nav> 566 576 <div class="quarto-title"> 567 577 <h1 class="title">Deployment Guide</h1> 568 578 </div>

+22 -12

docs/reference/lenses.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/lenses.html">Lenses</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/lenses.html">Lenses</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 549 559 <main class="content" id="quarto-document-content"> 550 560 551 561 552 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/lenses.html">Lenses</a></li></ol></nav> 562 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/lenses.html">Lenses</a></li></ol></nav> 553 563 <div class="quarto-title"> 554 564 <h1 class="title">Lenses</h1> 555 565 </div> ··· 585 595 <section id="creating-a-lens" class="level2"> 586 596 <h2 class="anchored" data-anchor-id="creating-a-lens">Creating a Lens</h2> 587 597 <p>Use the <code>@lens</code> decorator to define a getter:</p> 588 - <div id="8e469fe8" class="cell"> 598 + <div id="f050b984" class="cell"> 589 599 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 590 600 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 591 601 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span> ··· 615 625 <section id="adding-a-putter" class="level2"> 616 626 <h2 class="anchored" data-anchor-id="adding-a-putter">Adding a Putter</h2> 617 627 <p>To enable bidirectional updates, add a putter:</p> 618 - <div id="23772595" class="cell"> 628 + <div id="aa1d9fcc" class="cell"> 619 629 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@simplify.putter</span></span> 620 630 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> simplify_put(view: SimpleSample, source: FullSample) <span class="op">-></span> FullSample:</span> 621 631 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> FullSample(</span> ··· 635 645 <section id="using-lenses-with-datasets" class="level2"> 636 646 <h2 class="anchored" data-anchor-id="using-lenses-with-datasets">Using Lenses with Datasets</h2> 637 647 <p>Lenses integrate with <code>Dataset.as_type()</code>:</p> 638 - <div id="3fc1a303" class="cell"> 648 + <div id="4416cc21" class="cell"> 639 649 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[FullSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 640 650 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 641 651 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="co"># View through a different type</span></span> ··· 650 660 <section id="direct-lens-usage" class="level2"> 651 661 <h2 class="anchored" data-anchor-id="direct-lens-usage">Direct Lens Usage</h2> 652 662 <p>Lenses can also be called directly:</p> 653 - <div id="8ada7a56" class="cell"> 663 + <div id="b709ace4" class="cell"> 654 664 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 655 665 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 656 666 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>full <span class="op">=</span> FullSample(</span> ··· 679 689 <div class="tab-content"> 680 690 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 681 691 <p>If you get a view and immediately put it back, the source is unchanged:</p> 682 - <div id="6ac5163e" class="cell"> 692 + <div id="8159732e" class="cell"> 683 693 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>view <span class="op">=</span> lens.get(source)</span> 684 694 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> lens.put(view, source) <span class="op">==</span> source</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 685 695 </div> 686 696 </div> 687 697 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 688 698 <p>If you put a view, getting it back yields that view:</p> 689 - <div id="86e70b00" class="cell"> 699 + <div id="7b0ac2e8" class="cell"> 690 700 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>updated <span class="op">=</span> lens.put(view, source)</span> 691 701 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> lens.get(updated) <span class="op">==</span> view</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 692 702 </div> 693 703 </div> 694 704 <div id="tabset-1-3" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-3-tab"> 695 705 <p>Putting twice is equivalent to putting once with the final value:</p> 696 - <div id="545cbcf1" class="cell"> 706 + <div id="c9f3015d" class="cell"> 697 707 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>result1 <span class="op">=</span> lens.put(v2, lens.put(v1, source))</span> 698 708 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>result2 <span class="op">=</span> lens.put(v2, source)</span> 699 709 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> result1 <span class="op">==</span> result2</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 705 715 <section id="trivial-putter" class="level2"> 706 716 <h2 class="anchored" data-anchor-id="trivial-putter">Trivial Putter</h2> 707 717 <p>If no putter is defined, a trivial putter is used that ignores view updates:</p> 708 - <div id="0636b6ac" class="cell"> 718 + <div id="3f274fda" class="cell"> 709 719 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 710 720 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> extract_label(src: FullSample) <span class="op">-></span> SimpleSample:</span> 711 721 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SimpleSample(label<span class="op">=</span>src.label, confidence<span class="op">=</span>src.confidence)</span> ··· 719 729 <section id="lensnetwork-registry" class="level2"> 720 730 <h2 class="anchored" data-anchor-id="lensnetwork-registry">LensNetwork Registry</h2> 721 731 <p>The <code>LensNetwork</code> is a singleton that stores all registered lenses:</p> 722 - <div id="c3b32f7d" class="cell"> 732 + <div id="2d560eec" class="cell"> 723 733 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.lens <span class="im">import</span> LensNetwork</span> 724 734 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 725 735 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>network <span class="op">=</span> LensNetwork()</span> ··· 736 746 </section> 737 747 <section id="example-feature-extraction" class="level2"> 738 748 <h2 class="anchored" data-anchor-id="example-feature-extraction">Example: Feature Extraction</h2> 739 - <div id="a4bc199f" class="cell"> 749 + <div id="f641a209" class="cell"> 740 750 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 741 751 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> RawSample:</span> 742 752 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> audio: NDArray</span>

+24 -14

docs/reference/load-dataset.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/load-dataset.html">load_dataset API</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/load-dataset.html">load_dataset API</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 557 567 <main class="content" id="quarto-document-content"> 558 568 559 569 560 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/load-dataset.html">load_dataset API</a></li></ol></nav> 570 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/load-dataset.html">load_dataset API</a></li></ol></nav> 561 571 <div class="quarto-title"> 562 572 <h1 class="title">load_dataset API</h1> 563 573 </div> ··· 594 604 </section> 595 605 <section id="basic-usage" class="level2"> 596 606 <h2 class="anchored" data-anchor-id="basic-usage">Basic Usage</h2> 597 - <div id="f2161b26" class="cell"> 607 + <div id="7ed2b10d" class="cell"> 598 608 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 599 609 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 600 610 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> ··· 617 627 <h2 class="anchored" data-anchor-id="path-formats">Path Formats</h2> 618 628 <section id="webdataset-brace-notation" class="level3"> 619 629 <h3 class="anchored" data-anchor-id="webdataset-brace-notation">WebDataset Brace Notation</h3> 620 - <div id="6ada3862" class="cell"> 630 + <div id="f28a7507" class="cell"> 621 631 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Range notation</span></span> 622 632 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"data-{000000..000099}.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 623 633 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span> ··· 627 637 </section> 628 638 <section id="glob-patterns" class="level3"> 629 639 <h3 class="anchored" data-anchor-id="glob-patterns">Glob Patterns</h3> 630 - <div id="610d80d4" class="cell"> 640 + <div id="c187d3d7" class="cell"> 631 641 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Match all tar files</span></span> 632 642 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"path/to/*.tar"</span>, MySample)</span> 633 643 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 637 647 </section> 638 648 <section id="local-directory" class="level3"> 639 649 <h3 class="anchored" data-anchor-id="local-directory">Local Directory</h3> 640 - <div id="9e9b33de" class="cell"> 650 + <div id="dcd8bdec" class="cell"> 641 651 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Scans for .tar files</span></span> 642 652 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"./my-dataset/"</span>, MySample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 643 653 </div> 644 654 </section> 645 655 <section id="remote-urls" class="level3"> 646 656 <h3 class="anchored" data-anchor-id="remote-urls">Remote URLs</h3> 647 - <div id="57f4b2e2" class="cell"> 657 + <div id="b3a20847" class="cell"> 648 658 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># S3 (public buckets)</span></span> 649 659 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"s3://bucket/data-{000..099}.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 650 660 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> ··· 670 680 </section> 671 681 <section id="index-lookup" class="level3"> 672 682 <h3 class="anchored" data-anchor-id="index-lookup">Index Lookup</h3> 673 - <div id="9131d1a1" class="cell"> 683 + <div id="079ac960" class="cell"> 674 684 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 675 685 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 676 686 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex()</span> ··· 737 747 <section id="datasetdict" class="level2"> 738 748 <h2 class="anchored" data-anchor-id="datasetdict">DatasetDict</h2> 739 749 <p>When loading without <code>split=</code>, returns a <code>DatasetDict</code>:</p> 740 - <div id="5ad65dd0" class="cell"> 750 + <div id="59cd381c" class="cell"> 741 751 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data/"</span>, MySample)</span> 742 752 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 743 753 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Access splits</span></span> ··· 757 767 <section id="explicit-data-files" class="level2"> 758 768 <h2 class="anchored" data-anchor-id="explicit-data-files">Explicit Data Files</h2> 759 769 <p>Override automatic detection with <code>data_files</code>:</p> 760 - <div id="44ab4049" class="cell"> 770 + <div id="82ab3caf" class="cell"> 761 771 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Single pattern</span></span> 762 772 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(</span> 763 773 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"path/to/"</span>,</span> ··· 786 796 <section id="streaming-mode" class="level2"> 787 797 <h2 class="anchored" data-anchor-id="streaming-mode">Streaming Mode</h2> 788 798 <p>The <code>streaming</code> parameter signals intent for streaming mode:</p> 789 - <div id="a0a85527" class="cell"> 799 + <div id="ac9975cc" class="cell"> 790 800 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Mark as streaming</span></span> 791 801 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>, MySample, streaming<span class="op">=</span><span class="va">True</span>)</span> 792 802 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 811 821 <section id="auto-type-resolution" class="level2"> 812 822 <h2 class="anchored" data-anchor-id="auto-type-resolution">Auto Type Resolution</h2> 813 823 <p>When using index lookup, the sample type can be resolved automatically:</p> 814 - <div id="0b982c6c" class="cell"> 824 + <div id="a7679439" class="cell"> 815 825 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 816 826 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 817 827 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex()</span> ··· 825 835 </section> 826 836 <section id="error-handling" class="level2"> 827 837 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 828 - <div id="decfab55" class="cell"> 838 + <div id="c1b9e6e6" class="cell"> 829 839 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 830 840 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> ds <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 831 841 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">FileNotFoundError</span>:</span> ··· 841 851 </section> 842 852 <section id="complete-example" class="level2"> 843 853 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 844 - <div id="f793e66a" class="cell"> 854 + <div id="5e83c9a3" class="cell"> 845 855 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 846 856 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 847 857 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span>

+23 -13

docs/reference/local-storage.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/local-storage.html">Local Storage</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/local-storage.html">Local Storage</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 556 566 <main class="content" id="quarto-document-content"> 557 567 558 568 559 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/local-storage.html">Local Storage</a></li></ol></nav> 569 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/local-storage.html">Local Storage</a></li></ol></nav> 560 570 <div class="quarto-title"> 561 571 <h1 class="title">Local Storage</h1> 562 572 </div> ··· 593 603 <section id="localindex" class="level2"> 594 604 <h2 class="anchored" data-anchor-id="localindex">LocalIndex</h2> 595 605 <p>The index tracks datasets in Redis:</p> 596 - <div id="c9920a8a" class="cell"> 606 + <div id="da288cbb" class="cell"> 597 607 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 598 608 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 599 609 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Default connection (localhost:6379)</span></span> ··· 609 619 </div> 610 620 <section id="adding-entries" class="level3"> 611 621 <h3 class="anchored" data-anchor-id="adding-entries">Adding Entries</h3> 612 - <div id="18fb47fd" class="cell"> 622 + <div id="084fdecb" class="cell"> 613 623 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 614 624 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 615 625 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span> ··· 634 644 </section> 635 645 <section id="listing-and-retrieving" class="level3"> 636 646 <h3 class="anchored" data-anchor-id="listing-and-retrieving">Listing and Retrieving</h3> 637 - <div id="b0fa453f" class="cell"> 647 + <div id="2279c444" class="cell"> 638 648 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate all entries</span></span> 639 649 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> entry <span class="kw">in</span> index.entries:</span> 640 650 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>entry<span class="sc">.</span>cid<span class="sc">}</span><span class="ss">"</span>)</span> ··· 666 676 </div> 667 677 </div> 668 678 <p>The Repo class combines S3 storage with Redis indexing:</p> 669 - <div id="9c79f00d" class="cell"> 679 + <div id="e3f423c3" class="cell"> 670 680 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> Repo</span> 671 681 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 672 682 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># From credentials file</span></span> ··· 686 696 <span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 687 697 </div> 688 698 <p><strong>Preferred approach</strong> - Use <code>LocalIndex</code> with <code>S3DataStore</code>:</p> 689 - <div id="812b9d3f" class="cell"> 699 + <div id="6ed3c7e6" class="cell"> 690 700 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 691 701 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 692 702 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> ··· 724 734 </section> 725 735 <section id="inserting-datasets" class="level3"> 726 736 <h3 class="anchored" data-anchor-id="inserting-datasets">Inserting Datasets</h3> 727 - <div id="137b7e3b" class="cell"> 737 + <div id="2de48d7d" class="cell"> 728 738 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 729 739 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 730 740 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 754 764 </section> 755 765 <section id="insert-options" class="level3"> 756 766 <h3 class="anchored" data-anchor-id="insert-options">Insert Options</h3> 757 - <div id="11cfaf93" class="cell"> 767 + <div id="c60ddeb7" class="cell"> 758 768 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>entry, ds <span class="op">=</span> repo.insert(</span> 759 769 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> dataset,</span> 760 770 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> name<span class="op">=</span><span class="st">"my-dataset"</span>,</span> ··· 768 778 <section id="localdatasetentry" class="level2"> 769 779 <h2 class="anchored" data-anchor-id="localdatasetentry">LocalDatasetEntry</h2> 770 780 <p>Index entries provide content-addressable identification:</p> 771 - <div id="475245f1" class="cell"> 781 + <div id="e6593f23" class="cell"> 772 782 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.get_entry_by_name(<span class="st">"my-dataset"</span>)</span> 773 783 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 774 784 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Core properties (IndexEntry protocol)</span></span> ··· 801 811 <section id="schema-storage" class="level2"> 802 812 <h2 class="anchored" data-anchor-id="schema-storage">Schema Storage</h2> 803 813 <p>Schemas can be stored and retrieved from the index:</p> 804 - <div id="726131f5" class="cell"> 814 + <div id="0b531944" class="cell"> 805 815 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 806 816 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(</span> 807 817 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> ImageSample,</span> ··· 832 842 <section id="s3datastore" class="level2"> 833 843 <h2 class="anchored" data-anchor-id="s3datastore">S3DataStore</h2> 834 844 <p>For direct S3 operations without Redis indexing:</p> 835 - <div id="9e2c3eda" class="cell"> 845 + <div id="33f50fcb" class="cell"> 836 846 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 837 847 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> 838 848 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> ··· 854 864 </section> 855 865 <section id="complete-workflow-example" class="level2"> 856 866 <h2 class="anchored" data-anchor-id="complete-workflow-example">Complete Workflow Example</h2> 857 - <div id="7ef04fda" class="cell"> 867 + <div id="846813f0" class="cell"> 858 868 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 859 869 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 860 870 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span>

+24 -14

docs/reference/packable-samples.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Packable Samples</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Packable Samples</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link active"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 560 570 <main class="content" id="quarto-document-content"> 561 571 562 572 563 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Packable Samples</a></li></ol></nav> 573 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Packable Samples</a></li></ol></nav> 564 574 <div class="quarto-title"> 565 575 <h1 class="title">Packable Samples</h1> 566 576 </div> ··· 588 598 <section id="the-packable-decorator" class="level2"> 589 599 <h2 class="anchored" data-anchor-id="the-packable-decorator">The <code>@packable</code> Decorator</h2> 590 600 <p>The recommended way to define a sample type is with the <code>@packable</code> decorator:</p> 591 - <div id="94abf55d" class="cell"> 601 + <div id="c5cea75b" class="cell"> 592 602 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 593 603 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 594 604 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 610 620 <h2 class="anchored" data-anchor-id="supported-field-types">Supported Field Types</h2> 611 621 <section id="primitives" class="level3"> 612 622 <h3 class="anchored" data-anchor-id="primitives">Primitives</h3> 613 - <div id="cc7f0a01" class="cell"> 623 + <div id="c8b5a276" class="cell"> 614 624 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 615 625 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PrimitiveSample:</span> 616 626 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> name: <span class="bu">str</span></span> ··· 623 633 <section id="numpy-arrays" class="level3"> 624 634 <h3 class="anchored" data-anchor-id="numpy-arrays">NumPy Arrays</h3> 625 635 <p>Fields annotated as <code>NDArray</code> are automatically converted:</p> 626 - <div id="1f91f2d7" class="cell"> 636 + <div id="59e8610f" class="cell"> 627 637 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 628 638 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ArraySample:</span> 629 639 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># Required array</span></span> ··· 645 655 </section> 646 656 <section id="lists" class="level3"> 647 657 <h3 class="anchored" data-anchor-id="lists">Lists</h3> 648 - <div id="9302a515" class="cell"> 658 + <div id="e0a943cf" class="cell"> 649 659 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 650 660 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ListSample:</span> 651 661 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> tags: <span class="bu">list</span>[<span class="bu">str</span>]</span> ··· 657 667 <h2 class="anchored" data-anchor-id="serialization">Serialization</h2> 658 668 <section id="packing-to-bytes" class="level3"> 659 669 <h3 class="anchored" data-anchor-id="packing-to-bytes">Packing to Bytes</h3> 660 - <div id="90438dc8" class="cell"> 670 + <div id="15e89907" class="cell"> 661 671 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 662 672 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 663 673 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> ··· 671 681 </section> 672 682 <section id="unpacking-from-bytes" class="level3"> 673 683 <h3 class="anchored" data-anchor-id="unpacking-from-bytes">Unpacking from Bytes</h3> 674 - <div id="b6063f2c" class="cell"> 684 + <div id="abf5f1f1" class="cell"> 675 685 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Deserialize from bytes</span></span> 676 686 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>restored <span class="op">=</span> ImageSample.from_bytes(packed_bytes)</span> 677 687 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 683 693 <section id="webdataset-format" class="level3"> 684 694 <h3 class="anchored" data-anchor-id="webdataset-format">WebDataset Format</h3> 685 695 <p>The <code>as_wds</code> property returns a dict ready for WebDataset:</p> 686 - <div id="47d071b5" class="cell"> 696 + <div id="e29904d9" class="cell"> 687 697 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>wds_dict <span class="op">=</span> sample.as_wds</span> 688 698 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="co"># {'__key__': '1234...', 'msgpack': b'...'}</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 689 699 </div> 690 700 <p>Write samples to a tar file:</p> 691 - <div id="b50b3d88" class="cell"> 701 + <div id="9a0b0a14" class="cell"> 692 702 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 693 703 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 694 704 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data-000000.tar"</span>) <span class="im">as</span> sink:</span> ··· 701 711 <section id="direct-inheritance-alternative" class="level2"> 702 712 <h2 class="anchored" data-anchor-id="direct-inheritance-alternative">Direct Inheritance (Alternative)</h2> 703 713 <p>You can also inherit directly from <code>PackableSample</code>:</p> 704 - <div id="4cbad78a" class="cell"> 714 + <div id="8a5579f0" class="cell"> 705 715 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> dataclasses <span class="im">import</span> dataclass</span> 706 716 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 707 717 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="at">@dataclass</span></span> ··· 739 749 <section id="the-_ensure_good-method" class="level3"> 740 750 <h3 class="anchored" data-anchor-id="the-_ensure_good-method">The <code>_ensure_good()</code> Method</h3> 741 751 <p>This method runs automatically after construction and handles NDArray conversion:</p> 742 - <div id="60472d01" class="cell"> 752 + <div id="3a74bab4" class="cell"> 743 753 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> _ensure_good(<span class="va">self</span>):</span> 744 754 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> field <span class="kw">in</span> dataclasses.fields(<span class="va">self</span>):</span> 745 755 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> _is_possibly_ndarray_type(field.<span class="bu">type</span>):</span> ··· 755 765 <ul class="nav nav-tabs" role="tablist"><li class="nav-item" role="presentation"><a class="nav-link active" id="tabset-2-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-1" role="tab" aria-controls="tabset-2-1" aria-selected="true">Do</a></li><li class="nav-item" role="presentation"><a class="nav-link" id="tabset-2-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-2" role="tab" aria-controls="tabset-2-2" aria-selected="false">Don’t</a></li></ul> 756 766 <div class="tab-content"> 757 767 <div id="tabset-2-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-2-1-tab"> 758 - <div id="7d6b1360" class="cell"> 768 + <div id="5efd4f23" class="cell"> 759 769 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 760 770 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> GoodSample:</span> 761 771 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># Clear type annotation</span></span> ··· 765 775 </div> 766 776 </div> 767 777 <div id="tabset-2-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-2-2-tab"> 768 - <div id="204cf068" class="cell"> 778 + <div id="dc1f3f7b" class="cell"> 769 779 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 770 780 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> BadSample:</span> 771 781 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a> <span class="co"># DON'T: Nested dataclasses not supported</span></span>

+19 -9

docs/reference/promotion.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/promotion.html">Promotion Workflow</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/promotion.html">Promotion Workflow</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 548 558 <main class="content" id="quarto-document-content"> 549 559 550 560 551 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/promotion.html">Promotion Workflow</a></li></ol></nav> 561 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/promotion.html">Promotion Workflow</a></li></ol></nav> 552 562 <div class="quarto-title"> 553 563 <h1 class="title">Promotion Workflow</h1> 554 564 </div> ··· 584 594 </section> 585 595 <section id="basic-usage" class="level2"> 586 596 <h2 class="anchored" data-anchor-id="basic-usage">Basic Usage</h2> 587 - <div id="d2fdd123" class="cell"> 597 + <div id="ae4e3261" class="cell"> 588 598 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 589 599 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 590 600 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> promote_to_atmosphere</span> ··· 604 614 </section> 605 615 <section id="with-metadata" class="level2"> 606 616 <h2 class="anchored" data-anchor-id="with-metadata">With Metadata</h2> 607 - <div id="f9d07277" class="cell"> 617 + <div id="cb98a72f" class="cell"> 608 618 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(</span> 609 619 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> entry,</span> 610 620 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> local_index,</span> ··· 619 629 <section id="schema-deduplication" class="level2"> 620 630 <h2 class="anchored" data-anchor-id="schema-deduplication">Schema Deduplication</h2> 621 631 <p>The promotion workflow automatically checks for existing schemas:</p> 622 - <div id="f1f595bc" class="cell"> 632 + <div id="eb4c0866" class="cell"> 623 633 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First promotion: publishes schema</span></span> 624 634 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>uri1 <span class="op">=</span> promote_to_atmosphere(entry1, local_index, client)</span> 625 635 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 639 649 <div class="tab-content"> 640 650 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 641 651 <p>By default, promotion keeps the original data URLs:</p> 642 - <div id="810334d9" class="cell"> 652 + <div id="324fbcb3" class="cell"> 643 653 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Data stays in original S3 location</span></span> 644 654 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 645 655 </div> ··· 652 662 </div> 653 663 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 654 664 <p>To copy data to a different storage location:</p> 655 - <div id="9fc50efb" class="cell"> 665 + <div id="96e6b253" class="cell"> 656 666 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 657 667 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 658 668 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create new data store</span></span> ··· 680 690 </section> 681 691 <section id="complete-workflow-example" class="level2"> 682 692 <h2 class="anchored" data-anchor-id="complete-workflow-example">Complete Workflow Example</h2> 683 - <div id="e5a75aed" class="cell"> 693 + <div id="761de7eb" class="cell"> 684 694 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 685 695 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 686 696 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 751 761 </section> 752 762 <section id="error-handling" class="level2"> 753 763 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 754 - <div id="cae0ca6d" class="cell"> 764 + <div id="62ff3365" class="cell"> 755 765 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 756 766 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> at_uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span> 757 767 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">KeyError</span> <span class="im">as</span> e:</span>

+24 -14

docs/reference/protocols.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/protocols.html">Protocols</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/protocols.html">Protocols</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 567 577 <main class="content" id="quarto-document-content"> 568 578 569 579 570 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/protocols.html">Protocols</a></li></ol></nav> 580 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/protocols.html">Protocols</a></li></ol></nav> 571 581 <div class="quarto-title"> 572 582 <h1 class="title">Protocols</h1> 573 583 </div> ··· 605 615 <section id="indexentry-protocol" class="level2"> 606 616 <h2 class="anchored" data-anchor-id="indexentry-protocol">IndexEntry Protocol</h2> 607 617 <p>Represents a dataset entry in any index:</p> 608 - <div id="aefd7844" class="cell"> 618 + <div id="2316ad53" class="cell"> 609 619 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> IndexEntry</span> 610 620 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 611 621 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_entry(entry: IndexEntry) <span class="op">-></span> <span class="va">None</span>:</span> ··· 659 669 <section id="abstractindex-protocol" class="level2"> 660 670 <h2 class="anchored" data-anchor-id="abstractindex-protocol">AbstractIndex Protocol</h2> 661 671 <p>Defines operations for managing schemas and datasets:</p> 662 - <div id="1e72213b" class="cell"> 672 + <div id="cbfe79b2" class="cell"> 663 673 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractIndex</span> 664 674 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 665 675 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> list_all_datasets(index: AbstractIndex) <span class="op">-></span> <span class="va">None</span>:</span> ··· 669 679 </div> 670 680 <section id="dataset-operations" class="level3"> 671 681 <h3 class="anchored" data-anchor-id="dataset-operations">Dataset Operations</h3> 672 - <div id="d8eeea73" class="cell"> 682 + <div id="fe390f84" class="cell"> 673 683 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Insert a dataset</span></span> 674 684 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(</span> 675 685 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> dataset,</span> ··· 687 697 </section> 688 698 <section id="schema-operations" class="level3"> 689 699 <h3 class="anchored" data-anchor-id="schema-operations">Schema Operations</h3> 690 - <div id="4e007e1a" class="cell"> 700 + <div id="65361058" class="cell"> 691 701 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 692 702 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(</span> 693 703 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> MySample,</span> ··· 718 728 <section id="abstractdatastore-protocol" class="level2"> 719 729 <h2 class="anchored" data-anchor-id="abstractdatastore-protocol">AbstractDataStore Protocol</h2> 720 730 <p>Abstracts over different storage backends:</p> 721 - <div id="8481566e" class="cell"> 731 + <div id="ce04ec55" class="cell"> 722 732 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractDataStore</span> 723 733 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 724 734 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_dataset(store: AbstractDataStore, dataset) <span class="op">-></span> <span class="bu">list</span>[<span class="bu">str</span>]:</span> ··· 728 738 </div> 729 739 <section id="methods" class="level3"> 730 740 <h3 class="anchored" data-anchor-id="methods">Methods</h3> 731 - <div id="6a693b70" class="cell"> 741 + <div id="2079a29f" class="cell"> 732 742 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Write dataset shards</span></span> 733 743 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>urls <span class="op">=</span> store.write_shards(</span> 734 744 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> dataset,</span> ··· 755 765 <section id="datasource-protocol" class="level2"> 756 766 <h2 class="anchored" data-anchor-id="datasource-protocol">DataSource Protocol</h2> 757 767 <p>Abstracts over different data source backends for streaming dataset shards:</p> 758 - <div id="692f8fa5" class="cell"> 768 + <div id="c4f4cfb3" class="cell"> 759 769 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> DataSource</span> 760 770 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 761 771 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> load_from_source(source: DataSource) <span class="op">-></span> <span class="va">None</span>:</span> ··· 768 778 </div> 769 779 <section id="methods-1" class="level3"> 770 780 <h3 class="anchored" data-anchor-id="methods-1">Methods</h3> 771 - <div id="d3ea7448" class="cell"> 781 + <div id="aa0ab84d" class="cell"> 772 782 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Get list of shard identifiers</span></span> 773 783 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>shard_ids <span class="op">=</span> source.shard_list <span class="co"># ['data-000000.tar', 'data-000001.tar', ...]</span></span> 774 784 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a></span> ··· 791 801 <section id="creating-custom-data-sources" class="level3"> 792 802 <h3 class="anchored" data-anchor-id="creating-custom-data-sources">Creating Custom Data Sources</h3> 793 803 <p>Implement the <code>DataSource</code> protocol for custom backends:</p> 794 - <div id="9372d738" class="cell"> 804 + <div id="8405d1b3" class="cell"> 795 805 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> typing <span class="im">import</span> Iterator, IO</span> 796 806 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> DataSource</span> 797 807 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 829 839 <section id="using-protocols-for-polymorphism" class="level2"> 830 840 <h2 class="anchored" data-anchor-id="using-protocols-for-polymorphism">Using Protocols for Polymorphism</h2> 831 841 <p>Write code that works with any backend:</p> 832 - <div id="090796d5" class="cell"> 842 + <div id="efc52af8" class="cell"> 833 843 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractIndex, IndexEntry</span> 834 844 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> Dataset</span> 835 845 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 900 910 <section id="type-checking" class="level2"> 901 911 <h2 class="anchored" data-anchor-id="type-checking">Type Checking</h2> 902 912 <p>Protocols are runtime-checkable:</p> 903 - <div id="09ca5138" class="cell"> 913 + <div id="e720c8ac" class="cell"> 904 914 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> IndexEntry, AbstractIndex</span> 905 915 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> 906 916 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Check if object implements protocol</span></span> ··· 914 924 </section> 915 925 <section id="complete-example" class="level2"> 916 926 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 917 - <div id="4107bbb2" class="cell"> 927 + <div id="67ba0947" class="cell"> 918 928 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 919 929 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 920 930 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex</span>

+12 -2

docs/reference/troubleshooting.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/troubleshooting.html">Troubleshooting & FAQ</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/troubleshooting.html">Troubleshooting & FAQ</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 560 570 <main class="content" id="quarto-document-content"> 561 571 562 572 563 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/troubleshooting.html">Troubleshooting & FAQ</a></li></ol></nav> 573 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/troubleshooting.html">Troubleshooting & FAQ</a></li></ol></nav> 564 574 <div class="quarto-title"> 565 575 <h1 class="title">Troubleshooting & FAQ</h1> 566 576 </div>

+14 -4

docs/reference/uri-spec.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 392 396 <button type="button" class="quarto-btn-toggle btn" data-bs-toggle="collapse" role="button" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 393 397 <i class="bi bi-layout-text-sidebar-reverse"></i> 394 398 </button> 395 - <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/uri-spec.html">URI Specification</a></li></ol></nav> 399 + <nav class="quarto-page-breadcrumbs" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/uri-spec.html">URI Specification</a></li></ol></nav> 396 400 <a class="flex-grow-1" role="navigation" data-bs-toggle="collapse" data-bs-target=".quarto-sidebar-collapse-item" aria-controls="quarto-sidebar" aria-expanded="false" aria-label="Toggle sidebar navigation" onclick="if (window.quartoToggleHeadroom) { window.quartoToggleHeadroom(); }"> 397 401 </a> 398 402 </div> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 553 563 <main class="content" id="quarto-document-content"> 554 564 555 565 556 - <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/packable-samples.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/uri-spec.html">URI Specification</a></li></ol></nav> 566 + <header id="title-block-header" class="quarto-title-block default"><nav class="quarto-page-breadcrumbs quarto-title-breadcrumbs d-none d-lg-block" aria-label="breadcrumb"><ol class="breadcrumb"><li class="breadcrumb-item"><a href="../reference/architecture.html">Reference</a></li><li class="breadcrumb-item"><a href="../reference/uri-spec.html">URI Specification</a></li></ol></nav> 557 567 <div class="quarto-title"> 558 568 <h1 class="title">URI Specification</h1> 559 569 </div> ··· 675 685 <h2 class="anchored" data-anchor-id="examples">Examples</h2> 676 686 <section id="local-development" class="level3"> 677 687 <h3 class="anchored" data-anchor-id="local-development">Local Development</h3> 678 - <div id="0ec519c5" class="cell"> 688 + <div id="c582b03f" class="cell"> 679 689 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> Index</span> 680 690 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 681 691 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> Index()</span> ··· 694 704 </section> 695 705 <section id="atmosphere-atproto-federation" class="level3"> 696 706 <h3 class="anchored" data-anchor-id="atmosphere-atproto-federation">Atmosphere (ATProto Federation)</h3> 697 - <div id="80732fae" class="cell"> 707 + <div id="33a39e75" class="cell"> 698 708 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> Client</span> 699 709 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 700 710 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> Client()</span>

+1467 -1205

docs/search.json

··· 276 276 ] 277 277 }, 278 278 { 279 - "objectID": "reference/packable-samples.html", 280 - "href": "reference/packable-samples.html", 281 - "title": "Packable Samples", 279 + "objectID": "reference/architecture.html", 280 + "href": "reference/architecture.html", 281 + "title": "Architecture Overview", 282 282 "section": "", 283 - "text": "Packable samples are typed dataclasses that can be serialized with msgpack for storage in WebDataset tar files.", 283 + "text": "atdata is designed around a simple but powerful idea: typed, serializable samples that can flow seamlessly between local development, team storage, and a federated network. This page explains the architectural decisions and how the components work together.", 284 284 "crumbs": [ 285 285 "Guide", 286 286 "Reference", 287 - "Packable Samples" 287 + "Architecture Overview" 288 288 ] 289 289 }, 290 290 { 291 - "objectID": "reference/packable-samples.html#the-packable-decorator", 292 - "href": "reference/packable-samples.html#the-packable-decorator", 293 - "title": "Packable Samples", 294 - "section": "The @packable Decorator", 295 - "text": "The @packable Decorator\nThe recommended way to define a sample type is with the @packable decorator:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\nThis creates a dataclass that:\n\nInherits from PackableSample\nHas automatic msgpack serialization\nHandles NDArray conversion to/from bytes", 291 + "objectID": "reference/architecture.html#design-philosophy", 292 + "href": "reference/architecture.html#design-philosophy", 293 + "title": "Architecture Overview", 294 + "section": "Design Philosophy", 295 + "text": "Design Philosophy\n\nThe Problem\nMachine learning workflows involve datasets at every stage—training data, validation sets, embeddings, features, and model outputs. These datasets are often:\n\nUntyped: Raw files with implicit schemas, leading to runtime errors\nSiloed: Stuck in one location (local disk, team bucket, or cloud storage)\nUndiscoverable: No standard way to find and share datasets across teams or organizations\n\n\n\nThe Solution\natdata provides a three-layer architecture that addresses each problem:\n┌─────────────────────────────────────────────────────────────┐\n│ Layer 3: Federation (ATProto Atmosphere) │\n│ - Decentralized discovery and sharing │\n│ - Content-addressable identifiers │\n│ - Cross-organization dataset federation │\n└─────────────────────────────────────────────────────────────┘\n ↑\n Promotion\n ↑\n┌─────────────────────────────────────────────────────────────┐\n│ Layer 2: Team Storage (Redis + S3) │\n│ - Shared index for team discovery │\n│ - Scalable object storage for data │\n│ - Schema registry for type consistency │\n└─────────────────────────────────────────────────────────────┘\n ↑\n Insert\n ↑\n┌─────────────────────────────────────────────────────────────┐\n│ Layer 1: Local Development │\n│ - Typed samples with automatic serialization │\n│ - WebDataset tar files for efficient storage │\n│ - Lens transformations for schema flexibility │\n└─────────────────────────────────────────────────────────────┘", 296 296 "crumbs": [ 297 297 "Guide", 298 298 "Reference", 299 - "Packable Samples" 299 + "Architecture Overview" 300 300 ] 301 301 }, 302 302 { 303 - "objectID": "reference/packable-samples.html#supported-field-types", 304 - "href": "reference/packable-samples.html#supported-field-types", 305 - "title": "Packable Samples", 306 - "section": "Supported Field Types", 307 - "text": "Supported Field Types\n\nPrimitives\n\n@atdata.packable\nclass PrimitiveSample:\n name: str\n count: int\n score: float\n active: bool\n data: bytes\n\n\n\nNumPy Arrays\nFields annotated as NDArray are automatically converted:\n\n@atdata.packable\nclass ArraySample:\n features: NDArray # Required array\n embeddings: NDArray | None # Optional array\n\n\n\n\n\n\n\nNote\n\n\n\nBytes in NDArray-typed fields are always interpreted as serialized arrays. Don’t use NDArray for raw binary data—use bytes instead.\n\n\n\n\nLists\n\n@atdata.packable\nclass ListSample:\n tags: list[str]\n scores: list[float]", 303 + "objectID": "reference/architecture.html#core-components", 304 + "href": "reference/architecture.html#core-components", 305 + "title": "Architecture Overview", 306 + "section": "Core Components", 307 + "text": "Core Components\n\nPackableSample: The Foundation\nEverything in atdata starts with PackableSample—a base class that makes Python dataclasses serializable with msgpack:\n\n@atdata.packable\nclass ImageSample:\n image: NDArray # Automatically converted to/from bytes\n label: str # Standard msgpack serialization\n confidence: float\n\nKey features:\n\nAutomatic NDArray handling: Numpy arrays are serialized efficiently\nType safety: Field types are preserved and validated\nRound-trip fidelity: Serialize → deserialize always produces identical data\n\nThe @packable decorator is syntactic sugar that:\n\nConverts your class to a dataclass\nAdds PackableSample as a base class\nRegisters a lens from DictSample for flexible loading\n\n\n\nDataset: Typed Iteration\nThe Dataset[T] class wraps WebDataset tar archives with type information:\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nfor batch in dataset.shuffled(batch_size=32):\n images = batch.image # Stacked NDArray: (32, H, W, C)\n labels = batch.label # List of 32 strings\n\nWhy WebDataset?\nWebDataset is a battle-tested format for large-scale ML training:\n\nStreaming: No need to download entire datasets\nSharding: Data split across multiple tar files for parallelism\nShuffling: Two-level shuffling (shard + sample) for training\n\natdata adds:\n\nType safety: Know the schema at compile time\nBatch aggregation: NDArrays are automatically stacked\nLens transformations: View data through different schemas\n\n\n\nSampleBatch: Automatic Aggregation\nWhen iterating with batch_size, atdata returns SampleBatch[T] objects that aggregate sample attributes:\n\nbatch = SampleBatch[ImageSample](samples)\n\n# NDArray fields → stacked numpy array with batch dimension\nbatch.image.shape # (batch_size, H, W, C)\n\n# Other fields → list\nbatch.label # [\"cat\", \"dog\", \"bird\", ...]\n\nThis eliminates boilerplate collation code and works automatically for any PackableSample type.\n\n\nLens: Schema Transformations\nLenses enable viewing datasets through different schemas without duplicating data:\n\n@atdata.packable\nclass SimplifiedSample:\n label: str\n\n@atdata.lens\ndef simplify(src: ImageSample) -> SimplifiedSample:\n return SimplifiedSample(label=src.label)\n\n# View dataset through simplified schema\nsimple_ds = dataset.as_type(SimplifiedSample)\n\nWhen to use lenses:\n\nReducing fields: Drop unnecessary data for specific tasks\nTransforming data: Compute derived fields on-the-fly\nSchema migration: Handle version differences between datasets\n\nLenses are registered globally in a LensNetwork, enabling automatic discovery of transformation paths.", 308 308 "crumbs": [ 309 309 "Guide", 310 310 "Reference", 311 - "Packable Samples" 311 + "Architecture Overview" 312 312 ] 313 313 }, 314 314 { 315 - "objectID": "reference/packable-samples.html#serialization", 316 - "href": "reference/packable-samples.html#serialization", 317 - "title": "Packable Samples", 318 - "section": "Serialization", 319 - "text": "Serialization\n\nPacking to Bytes\n\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\n# Serialize to msgpack bytes\npacked_bytes = sample.packed\nprint(f\"Size: {len(packed_bytes)} bytes\")\n\n\n\nUnpacking from Bytes\n\n# Deserialize from bytes\nrestored = ImageSample.from_bytes(packed_bytes)\n\n# Arrays are automatically restored\nassert np.array_equal(sample.image, restored.image)\nassert sample.label == restored.label\n\n\n\nWebDataset Format\nThe as_wds property returns a dict ready for WebDataset:\n\nwds_dict = sample.as_wds\n# {'__key__': '1234...', 'msgpack': b'...'}\n\nWrite samples to a tar file:\n\nimport webdataset as wds\n\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n # Use custom key or let as_wds generate one\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})", 315 + "objectID": "reference/architecture.html#storage-backends", 316 + "href": "reference/architecture.html#storage-backends", 317 + "title": "Architecture Overview", 318 + "section": "Storage Backends", 319 + "text": "Storage Backends\n\nLocal Index (Redis + S3)\nFor team-scale usage, atdata provides a two-component storage system:\nRedis Index: Stores metadata and enables fast lookups\n\nDataset entries (name, schema, URLs, metadata)\nSchema registry (type definitions)\nCID-based content addressing\n\nS3 DataStore: Stores actual data files\n\nWebDataset tar shards\nAny S3-compatible storage (AWS, MinIO, Cloudflare R2)\n\n\nstore = S3DataStore(credentials=creds, bucket=\"datasets\")\nindex = LocalIndex(data_store=store)\n\n# Insert dataset: writes to S3, indexes in Redis\nentry = index.insert_dataset(dataset, name=\"training-v1\")\n\nWhy this split?\n\nSeparation of concerns: Metadata queries don’t touch data storage\nFlexibility: Use any S3-compatible storage\nScalability: Redis handles high-throughput lookups; S3 handles large files\n\n\n\nAtmosphere Index (ATProto)\nFor public or cross-organization sharing, atdata integrates with the AT Protocol:\nATProto PDS: Your Personal Data Server stores records\n\nSchema definitions\nDataset index records\nLens transformation records\n\nPDSBlobStore: Optional blob storage on your PDS\n\nStore actual data shards as ATProto blobs\nFully decentralized—no external dependencies\n\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Publish: creates ATProto records, uploads blobs\nentry = index.insert_dataset(dataset, name=\"public-features\")", 320 320 "crumbs": [ 321 321 "Guide", 322 322 "Reference", 323 - "Packable Samples" 323 + "Architecture Overview" 324 324 ] 325 325 }, 326 326 { 327 - "objectID": "reference/packable-samples.html#direct-inheritance-alternative", 328 - "href": "reference/packable-samples.html#direct-inheritance-alternative", 329 - "title": "Packable Samples", 330 - "section": "Direct Inheritance (Alternative)", 331 - "text": "Direct Inheritance (Alternative)\nYou can also inherit directly from PackableSample:\n\nfrom dataclasses import dataclass\n\n@dataclass\nclass DirectSample(atdata.PackableSample):\n name: str\n values: NDArray\n\nThis is equivalent to using @packable but more verbose.", 327 + "objectID": "reference/architecture.html#protocol-abstractions", 328 + "href": "reference/architecture.html#protocol-abstractions", 329 + "title": "Architecture Overview", 330 + "section": "Protocol Abstractions", 331 + "text": "Protocol Abstractions\natdata uses protocols (structural typing) to enable backend interoperability:\n\nAbstractIndex\nCommon interface for both LocalIndex and AtmosphereIndex:\n\ndef process_dataset(index: AbstractIndex, name: str):\n entry = index.get_dataset(name)\n schema = index.decode_schema(entry.schema_ref)\n # Works with either LocalIndex or AtmosphereIndex\n\nKey methods:\n\ninsert_dataset() / get_dataset(): Dataset CRUD\npublish_schema() / decode_schema(): Schema management\nlist_datasets() / list_schemas(): Discovery\n\n\n\nAbstractDataStore\nCommon interface for S3DataStore and PDSBlobStore:\n\ndef write_to_store(store: AbstractDataStore, dataset: Dataset):\n urls = store.write_shards(dataset, prefix=\"data/v1\")\n # Works with S3 or PDS blob storage\n\n\n\nDataSource\nCommon interface for data streaming:\n\nURLSource: WebDataset-compatible URLs\nS3Source: S3 with explicit credentials\nBlobSource: ATProto PDS blobs", 332 332 "crumbs": [ 333 333 "Guide", 334 334 "Reference", 335 - "Packable Samples" 335 + "Architecture Overview" 336 336 ] 337 337 }, 338 338 { 339 - "objectID": "reference/packable-samples.html#how-it-works", 340 - "href": "reference/packable-samples.html#how-it-works", 341 - "title": "Packable Samples", 342 - "section": "How It Works", 343 - "text": "How It Works\n\nSerialization Flow\n\nPackingUnpacking\n\n\n\nNDArray fields → converted to bytes via array_to_bytes()\nOther fields → passed through unchanged\nAll fields → packed with msgpack\n\n\n\n\nBytes → unpacked with ormsgpack\nDict → passed to __init__\n__post_init__ → calls _ensure_good()\nNDArray fields → bytes converted back to arrays\n\n\n\n\n\n\nThe _ensure_good() Method\nThis method runs automatically after construction and handles NDArray conversion:\n\ndef _ensure_good(self):\n for field in dataclasses.fields(self):\n if _is_possibly_ndarray_type(field.type):\n value = getattr(self, field.name)\n if isinstance(value, bytes):\n setattr(self, field.name, bytes_to_array(value))", 339 + "objectID": "reference/architecture.html#data-flow-local-to-federation", 340 + "href": "reference/architecture.html#data-flow-local-to-federation", 341 + "title": "Architecture Overview", 342 + "section": "Data Flow: Local to Federation", 343 + "text": "Data Flow: Local to Federation\nA typical workflow progresses through three stages:\n\nStage 1: Local Development\n\n# Define type and create samples\n@atdata.packable\nclass MySample:\n features: NDArray\n label: str\n\n# Write to local tar\nwith wds.writer.TarWriter(\"data.tar\") as sink:\n for sample in samples:\n sink.write(sample.as_wds)\n\n# Iterate locally\ndataset = atdata.Dataset[MySample](\"data.tar\")\n\n\n\nStage 2: Team Storage\n\n# Set up team storage\nstore = S3DataStore(credentials=team_creds, bucket=\"team-datasets\")\nindex = LocalIndex(data_store=store)\n\n# Publish schema and insert\nindex.publish_schema(MySample, version=\"1.0.0\")\nentry = index.insert_dataset(dataset, name=\"my-features\")\n\n# Team members can now load via index\nds = load_dataset(\"@local/my-features\", index=index)\n\n\n\nStage 3: Federation\n\n# Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(entry, index, client)\n\n# Anyone can now discover and load\n# ds = load_dataset(\"@handle.bsky.social/my-features\")", 344 344 "crumbs": [ 345 345 "Guide", 346 346 "Reference", 347 - "Packable Samples" 347 + "Architecture Overview" 348 348 ] 349 349 }, 350 350 { 351 - "objectID": "reference/packable-samples.html#best-practices", 352 - "href": "reference/packable-samples.html#best-practices", 353 - "title": "Packable Samples", 354 - "section": "Best Practices", 355 - "text": "Best Practices\n\nDoDon’t\n\n\n\n@atdata.packable\nclass GoodSample:\n features: NDArray # Clear type annotation\n label: str # Simple primitives\n metadata: dict # Msgpack-compatible dicts\n scores: list[float] # Typed lists\n\n\n\n\n@atdata.packable\nclass BadSample:\n # DON'T: Nested dataclasses not supported\n nested: OtherSample\n\n # DON'T: Complex objects that aren't msgpack-serializable\n callback: Callable\n\n # DON'T: Use NDArray for raw bytes\n raw_data: NDArray # Use 'bytes' type instead", 351 + "objectID": "reference/architecture.html#content-addressing", 352 + "href": "reference/architecture.html#content-addressing", 353 + "title": "Architecture Overview", 354 + "section": "Content Addressing", 355 + "text": "Content Addressing\natdata uses CIDs (Content Identifiers) for content-addressable storage:\n\nSchema CIDs: Hash of schema definition\nEntry CIDs: Hash of (schema_ref, data_urls)\nBlob CIDs: Hash of data content\n\nBenefits:\n\nDeduplication: Identical content has identical CID\nIntegrity: Verify data matches expected hash\nATProto compatibility: CIDs are native to the AT Protocol", 356 356 "crumbs": [ 357 357 "Guide", 358 358 "Reference", 359 - "Packable Samples" 359 + "Architecture Overview" 360 360 ] 361 361 }, 362 362 { 363 - "objectID": "reference/packable-samples.html#related", 364 - "href": "reference/packable-samples.html#related", 365 - "title": "Packable Samples", 366 - "section": "Related", 367 - "text": "Related\n\nDatasets - Loading and iterating samples\nLenses - Transforming between sample types", 363 + "objectID": "reference/architecture.html#extension-points", 364 + "href": "reference/architecture.html#extension-points", 365 + "title": "Architecture Overview", 366 + "section": "Extension Points", 367 + "text": "Extension Points\natdata is designed for extensibility:\n\nCustom DataSources\nImplement the DataSource protocol to add new storage backends:\n\nclass MyCustomSource:\n def list_shards(self) -> list[str]: ...\n def open_shard(self, shard_id: str) -> IO[bytes]: ...\n\n @property\n def shards(self) -> Iterator[tuple[str, IO[bytes]]]: ...\n\n\n\nCustom Lenses\nRegister transformations between any PackableSample types:\n\n@atdata.lens\ndef my_transform(src: SourceType) -> TargetType:\n return TargetType(...)\n\n@my_transform.putter\ndef my_transform_put(view: TargetType, src: SourceType) -> SourceType:\n return SourceType(...)\n\n\n\nSchema Extensions\nThe schema format supports custom metadata for domain-specific needs:\n\nindex.publish_schema(\n MySample,\n version=\"1.0.0\",\n metadata={\"domain\": \"chemistry\", \"units\": \"mol/L\"},\n)", 368 368 "crumbs": [ 369 369 "Guide", 370 370 "Reference", 371 - "Packable Samples" 371 + "Architecture Overview" 372 372 ] 373 373 }, 374 374 { 375 - "objectID": "reference/lenses.html", 376 - "href": "reference/lenses.html", 377 - "title": "Lenses", 378 - "section": "", 379 - "text": "Lenses provide bidirectional transformations between sample types, enabling datasets to be viewed through different schemas without duplicating data.", 375 + "objectID": "reference/architecture.html#summary", 376 + "href": "reference/architecture.html#summary", 377 + "title": "Architecture Overview", 378 + "section": "Summary", 379 + "text": "Summary\n\n\n\n\n\n\n\n\nComponent\nPurpose\nKey Classes\n\n\n\n\nSamples\nTyped, serializable data\nPackableSample, @packable\n\n\nDatasets\nTyped iteration over WebDataset\nDataset[T], SampleBatch[T]\n\n\nLenses\nSchema transformations\nLens, @lens, LensNetwork\n\n\nLocal Storage\nTeam-scale index + data\nLocalIndex, S3DataStore\n\n\nAtmosphere\nFederated sharing\nAtmosphereIndex, PDSBlobStore\n\n\nProtocols\nBackend abstraction\nAbstractIndex, AbstractDataStore, DataSource\n\n\n\nThe architecture enables a smooth progression from local experimentation to team collaboration to public federation, all while maintaining type safety and efficient data handling.", 380 380 "crumbs": [ 381 381 "Guide", 382 382 "Reference", 383 - "Lenses" 383 + "Architecture Overview" 384 384 ] 385 385 }, 386 386 { 387 - "objectID": "reference/lenses.html#overview", 388 - "href": "reference/lenses.html#overview", 389 - "title": "Lenses", 390 - "section": "Overview", 391 - "text": "Overview\nA lens consists of:\n\nGetter: Transforms source type S to view type V\nPutter: Updates source based on a modified view (optional)", 387 + "objectID": "reference/architecture.html#related", 388 + "href": "reference/architecture.html#related", 389 + "title": "Architecture Overview", 390 + "section": "Related", 391 + "text": "Related\n\nPackable Samples - Defining sample types\nDatasets - Dataset iteration and batching\nLocal Storage - Redis + S3 backend\nAtmosphere - ATProto federation\nProtocols - Abstract interfaces", 392 392 "crumbs": [ 393 393 "Guide", 394 394 "Reference", 395 - "Lenses" 395 + "Architecture Overview" 396 396 ] 397 397 }, 398 398 { 399 - "objectID": "reference/lenses.html#creating-a-lens", 400 - "href": "reference/lenses.html#creating-a-lens", 401 - "title": "Lenses", 402 - "section": "Creating a Lens", 403 - "text": "Creating a Lens\nUse the @lens decorator to define a getter:\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass FullSample:\n image: NDArray\n label: str\n confidence: float\n metadata: dict\n\n@atdata.packable\nclass SimpleSample:\n label: str\n confidence: float\n\n@atdata.lens\ndef simplify(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label, confidence=src.confidence)\n\nThe decorator:\n\nCreates a Lens object from the getter function\nRegisters it in the global LensNetwork registry\nExtracts source/view types from annotations", 399 + "objectID": "reference/atmosphere.html", 400 + "href": "reference/atmosphere.html", 401 + "title": "Atmosphere (ATProto Integration)", 402 + "section": "", 403 + "text": "The atmosphere module enables publishing and discovering datasets on the ATProto network, creating a federated ecosystem for typed datasets.", 404 404 "crumbs": [ 405 405 "Guide", 406 406 "Reference", 407 - "Lenses" 407 + "Atmosphere (ATProto Integration)" 408 408 ] 409 409 }, 410 410 { 411 - "objectID": "reference/lenses.html#adding-a-putter", 412 - "href": "reference/lenses.html#adding-a-putter", 413 - "title": "Lenses", 414 - "section": "Adding a Putter", 415 - "text": "Adding a Putter\nTo enable bidirectional updates, add a putter:\n\n@simplify.putter\ndef simplify_put(view: SimpleSample, source: FullSample) -> FullSample:\n return FullSample(\n image=source.image,\n label=view.label,\n confidence=view.confidence,\n metadata=source.metadata,\n )\n\nThe putter receives:\n\nview: The modified view value\nsource: The original source value\n\nIt returns an updated source that reflects changes from the view.", 411 + "objectID": "reference/atmosphere.html#installation", 412 + "href": "reference/atmosphere.html#installation", 413 + "title": "Atmosphere (ATProto Integration)", 414 + "section": "Installation", 415 + "text": "Installation\npip install atdata[atmosphere]\n# or\npip install atproto", 416 416 "crumbs": [ 417 417 "Guide", 418 418 "Reference", 419 - "Lenses" 419 + "Atmosphere (ATProto Integration)" 420 420 ] 421 421 }, 422 422 { 423 - "objectID": "reference/lenses.html#using-lenses-with-datasets", 424 - "href": "reference/lenses.html#using-lenses-with-datasets", 425 - "title": "Lenses", 426 - "section": "Using Lenses with Datasets", 427 - "text": "Using Lenses with Datasets\nLenses integrate with Dataset.as_type():\n\ndataset = atdata.Dataset[FullSample](\"data-{000000..000009}.tar\")\n\n# View through a different type\nsimple_ds = dataset.as_type(SimpleSample)\n\nfor batch in simple_ds.ordered(batch_size=32):\n # Only SimpleSample fields available\n labels = batch.label\n scores = batch.confidence", 423 + "objectID": "reference/atmosphere.html#overview", 424 + "href": "reference/atmosphere.html#overview", 425 + "title": "Atmosphere (ATProto Integration)", 426 + "section": "Overview", 427 + "text": "Overview\nATProto integration publishes datasets, schemas, and lenses as records in the ac.foundation.dataset.* namespace. This enables:\n\nDiscovery through the ATProto network\nFederation across different hosts\nVerifiability through content-addressable records", 428 428 "crumbs": [ 429 429 "Guide", 430 430 "Reference", 431 - "Lenses" 431 + "Atmosphere (ATProto Integration)" 432 432 ] 433 433 }, 434 434 { 435 - "objectID": "reference/lenses.html#direct-lens-usage", 436 - "href": "reference/lenses.html#direct-lens-usage", 437 - "title": "Lenses", 438 - "section": "Direct Lens Usage", 439 - "text": "Direct Lens Usage\nLenses can also be called directly:\n\nimport numpy as np\n\nfull = FullSample(\n image=np.zeros((224, 224, 3)),\n label=\"cat\",\n confidence=0.95,\n metadata={\"source\": \"training\"}\n)\n\n# Apply getter\nsimple = simplify(full)\n# Or: simple = simplify.get(full)\n\n# Apply putter\nmodified_simple = SimpleSample(label=\"dog\", confidence=0.87)\nupdated_full = simplify.put(modified_simple, full)\n# updated_full has label=\"dog\", confidence=0.87, but retains\n# original image and metadata", 435 + "objectID": "reference/atmosphere.html#atmosphereclient", 436 + "href": "reference/atmosphere.html#atmosphereclient", 437 + "title": "Atmosphere (ATProto Integration)", 438 + "section": "AtmosphereClient", 439 + "text": "AtmosphereClient\nThe client handles authentication and record operations:\n\nfrom atdata.atmosphere import AtmosphereClient\n\nclient = AtmosphereClient()\n\n# Login with app-specific password (not your main password!)\nclient.login(\"alice.bsky.social\", \"app-password\")\n\nprint(client.did) # 'did:plc:...'\nprint(client.handle) # 'alice.bsky.social'\n\n\n\n\n\n\n\nWarning\n\n\n\nAlways use an app-specific password, not your main Bluesky password. Create app passwords at bsky.app/settings/app-passwords.\n\n\n\nSession Management\nSave and restore sessions to avoid re-authentication:\n\n# Export session for later\nsession_string = client.export_session()\n\n# Later: restore session\nnew_client = AtmosphereClient()\nnew_client.login_with_session(session_string)\n\n\n\nCustom PDS\nConnect to a custom PDS instead of bsky.social:\n\nclient = AtmosphereClient(base_url=\"https://pds.example.com\")", 440 440 "crumbs": [ 441 441 "Guide", 442 442 "Reference", 443 - "Lenses" 443 + "Atmosphere (ATProto Integration)" 444 444 ] 445 445 }, 446 446 { 447 - "objectID": "reference/lenses.html#lens-laws", 448 - "href": "reference/lenses.html#lens-laws", 449 - "title": "Lenses", 450 - "section": "Lens Laws", 451 - "text": "Lens Laws\nWell-behaved lenses should satisfy these properties:\n\nGetPutPutGetPutPut\n\n\nIf you get a view and immediately put it back, the source is unchanged:\n\nview = lens.get(source)\nassert lens.put(view, source) == source\n\n\n\nIf you put a view, getting it back yields that view:\n\nupdated = lens.put(view, source)\nassert lens.get(updated) == view\n\n\n\nPutting twice is equivalent to putting once with the final value:\n\nresult1 = lens.put(v2, lens.put(v1, source))\nresult2 = lens.put(v2, source)\nassert result1 == result2", 447 + "objectID": "reference/atmosphere.html#pdsblobstore", 448 + "href": "reference/atmosphere.html#pdsblobstore", 449 + "title": "Atmosphere (ATProto Integration)", 450 + "section": "PDSBlobStore", 451 + "text": "PDSBlobStore\nStore dataset shards as ATProto blobs for fully decentralized storage:\n\nfrom atdata.atmosphere import AtmosphereClient, PDSBlobStore\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\n\n# Write shards as blobs\nurls = store.write_shards(dataset, prefix=\"my-data/v1\")\n# Returns: ['at://did:plc:.../blob/bafyrei...', ...]\n\n# Transform AT URIs to HTTP URLs for reading\nhttp_url = store.read_url(urls[0])\n# Returns: 'https://pds.example.com/xrpc/com.atproto.sync.getBlob?...'\n\n# Create a BlobSource for streaming\nsource = store.create_source(urls)\nds = atdata.Dataset[MySample](source)\n\n\nSize Limits\nPDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use maxcount and maxsize parameters to control shard sizes:\n\nurls = store.write_shards(\n dataset,\n prefix=\"large-data/v1\",\n maxcount=5000, # Max 5000 samples per shard\n maxsize=50e6, # Max 50MB per shard\n)", 452 452 "crumbs": [ 453 453 "Guide", 454 454 "Reference", 455 - "Lenses" 455 + "Atmosphere (ATProto Integration)" 456 456 ] 457 457 }, 458 458 { 459 - "objectID": "reference/lenses.html#trivial-putter", 460 - "href": "reference/lenses.html#trivial-putter", 461 - "title": "Lenses", 462 - "section": "Trivial Putter", 463 - "text": "Trivial Putter\nIf no putter is defined, a trivial putter is used that ignores view updates:\n\n@atdata.lens\ndef extract_label(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label, confidence=src.confidence)\n\n# Without a putter, put() returns the original source unchanged\nview = SimpleSample(label=\"modified\", confidence=0.5)\nupdated = extract_label.put(view, original)\nassert updated == original # No changes applied", 459 + "objectID": "reference/atmosphere.html#blobsource", 460 + "href": "reference/atmosphere.html#blobsource", 461 + "title": "Atmosphere (ATProto Integration)", 462 + "section": "BlobSource", 463 + "text": "BlobSource\nRead datasets stored as PDS blobs:\n\nfrom atdata import BlobSource\n\n# From blob references\nsource = BlobSource.from_refs([\n {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei111\"},\n {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei222\"},\n])\n\n# Or from PDSBlobStore\nsource = store.create_source(urls)\n\n# Use with Dataset\nds = atdata.Dataset[MySample](source)\nfor batch in ds.ordered(batch_size=32):\n process(batch)", 464 464 "crumbs": [ 465 465 "Guide", 466 466 "Reference", 467 - "Lenses" 467 + "Atmosphere (ATProto Integration)" 468 468 ] 469 469 }, 470 470 { 471 - "objectID": "reference/lenses.html#lensnetwork-registry", 472 - "href": "reference/lenses.html#lensnetwork-registry", 473 - "title": "Lenses", 474 - "section": "LensNetwork Registry", 475 - "text": "LensNetwork Registry\nThe LensNetwork is a singleton that stores all registered lenses:\n\nfrom atdata.lens import LensNetwork\n\nnetwork = LensNetwork()\n\n# Look up a specific lens\nlens = network.transform(FullSample, SimpleSample)\n\n# Raises ValueError if no lens exists\ntry:\n lens = network.transform(TypeA, TypeB)\nexcept ValueError:\n print(\"No lens registered for TypeA -> TypeB\")", 471 + "objectID": "reference/atmosphere.html#atmosphereindex", 472 + "href": "reference/atmosphere.html#atmosphereindex", 473 + "title": "Atmosphere (ATProto Integration)", 474 + "section": "AtmosphereIndex", 475 + "text": "AtmosphereIndex\nThe unified interface for ATProto operations, implementing the AbstractIndex protocol:\n\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Without blob storage (use external URLs)\nindex = AtmosphereIndex(client)\n\n# With PDS blob storage (recommended for full decentralization)\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n\nPublishing Schemas\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\n# Publish schema\nschema_uri = index.publish_schema(\n ImageSample,\n version=\"1.0.0\",\n description=\"Image classification sample\",\n)\n# Returns: \"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\"\n\n\n\nPublishing Datasets\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nentry = index.insert_dataset(\n dataset,\n name=\"imagenet-subset\",\n schema_ref=schema_uri, # Optional - auto-publishes if omitted\n description=\"ImageNet subset\",\n tags=[\"images\", \"classification\"],\n license=\"MIT\",\n)\n\nprint(entry.uri) # AT URI of the record\nprint(entry.data_urls) # WebDataset URLs\n\n\n\nListing and Retrieving\n\n# List your datasets\nfor entry in index.list_datasets():\n print(f\"{entry.name}: {entry.schema_ref}\")\n\n# List from another user\nfor entry in index.list_datasets(repo=\"did:plc:other-user\"):\n print(entry.name)\n\n# Get specific dataset\nentry = index.get_dataset(\"at://did:plc:.../ac.foundation.dataset.record/...\")\n\n# List schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']} v{schema['version']}\")\n\n# Decode schema to Python type\nSampleType = index.decode_schema(schema_uri)", 476 476 "crumbs": [ 477 477 "Guide", 478 478 "Reference", 479 - "Lenses" 479 + "Atmosphere (ATProto Integration)" 480 480 ] 481 481 }, 482 482 { 483 - "objectID": "reference/lenses.html#example-feature-extraction", 484 - "href": "reference/lenses.html#example-feature-extraction", 485 - "title": "Lenses", 486 - "section": "Example: Feature Extraction", 487 - "text": "Example: Feature Extraction\n\n@atdata.packable\nclass RawSample:\n audio: NDArray\n text: str\n speaker_id: int\n\n@atdata.packable\nclass TextFeatures:\n text: str\n word_count: int\n\n@atdata.lens\ndef extract_text(src: RawSample) -> TextFeatures:\n return TextFeatures(\n text=src.text,\n word_count=len(src.text.split())\n )\n\n@extract_text.putter\ndef extract_text_put(view: TextFeatures, source: RawSample) -> RawSample:\n return RawSample(\n audio=source.audio,\n text=view.text,\n speaker_id=source.speaker_id\n )", 483 + "objectID": "reference/atmosphere.html#lower-level-publishers", 484 + "href": "reference/atmosphere.html#lower-level-publishers", 485 + "title": "Atmosphere (ATProto Integration)", 486 + "section": "Lower-Level Publishers", 487 + "text": "Lower-Level Publishers\nFor more control, use the individual publisher classes:\n\nSchemaPublisher\n\nfrom atdata.atmosphere import SchemaPublisher\n\npublisher = SchemaPublisher(client)\n\nuri = publisher.publish(\n ImageSample,\n name=\"ImageSample\",\n version=\"1.0.0\",\n description=\"Image with label\",\n metadata={\"source\": \"training\"},\n)\n\n\n\nDatasetPublisher\n\nfrom atdata.atmosphere import DatasetPublisher\n\npublisher = DatasetPublisher(client)\n\nuri = publisher.publish(\n dataset,\n name=\"training-images\",\n schema_uri=schema_uri, # Required if auto_publish_schema=False\n auto_publish_schema=True, # Publish schema automatically\n description=\"Training images\",\n tags=[\"training\", \"images\"],\n license=\"MIT\",\n)\n\n\nBlob Storage\nThere are two approaches to storing data as ATProto blobs:\nApproach 1: PDSBlobStore (Recommended)\nUse PDSBlobStore with AtmosphereIndex for automatic shard management:\n\nfrom atdata.atmosphere import PDSBlobStore, AtmosphereIndex\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Dataset shards are automatically uploaded as blobs\nentry = index.insert_dataset(\n dataset,\n name=\"my-dataset\",\n schema_ref=schema_uri,\n)\n\n# Later: load using BlobSource\nsource = store.create_source(entry.data_urls)\nds = atdata.Dataset[MySample](source)\n\nApproach 2: Manual Blob Publishing\nFor more control, use DatasetPublisher.publish_with_blobs() directly:\n\nimport io\nimport webdataset as wds\n\n# Create tar data in memory\ntar_buffer = io.BytesIO()\nwith wds.writer.TarWriter(tar_buffer) as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# Publish with blob storage\nuri = publisher.publish_with_blobs(\n blobs=[tar_buffer.getvalue()],\n schema_uri=schema_uri,\n name=\"small-dataset\",\n description=\"Dataset stored in ATProto blobs\",\n tags=[\"small\", \"demo\"],\n)\n\nLoading Blob-Stored Datasets\n\nfrom atdata.atmosphere import DatasetLoader\nfrom atdata import BlobSource\n\nloader = DatasetLoader(client)\n\n# Check storage type\nstorage_type = loader.get_storage_type(uri) # \"external\" or \"blobs\"\n\nif storage_type == \"blobs\":\n # Get blob URLs and create BlobSource\n blob_urls = loader.get_blob_urls(uri)\n # Parse to blob refs for BlobSource\n # Or use loader.to_dataset() which handles this automatically\n\n# to_dataset() handles both storage types automatically\ndataset = loader.to_dataset(uri, MySample)\nfor batch in dataset.ordered(batch_size=32):\n process(batch)\n\n\n\n\nLensPublisher\n\nfrom atdata.atmosphere import LensPublisher\n\npublisher = LensPublisher(client)\n\n# With code references\nuri = publisher.publish(\n name=\"simplify\",\n source_schema=full_schema_uri,\n target_schema=simple_schema_uri,\n description=\"Extract label only\",\n getter_code={\n \"repository\": \"https://github.com/org/repo\",\n \"commit\": \"abc123def...\",\n \"path\": \"transforms/simplify.py:simplify_getter\",\n },\n putter_code={\n \"repository\": \"https://github.com/org/repo\",\n \"commit\": \"abc123def...\",\n \"path\": \"transforms/simplify.py:simplify_putter\",\n },\n)\n\n# Or publish from a Lens object\nfrom atdata.lens import lens\n\n@lens\ndef simplify(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label)\n\nuri = publisher.publish_from_lens(\n simplify,\n source_schema=full_schema_uri,\n target_schema=simple_schema_uri,\n)", 488 488 "crumbs": [ 489 489 "Guide", 490 490 "Reference", 491 - "Lenses" 491 + "Atmosphere (ATProto Integration)" 492 492 ] 493 493 }, 494 494 { 495 - "objectID": "reference/lenses.html#related", 496 - "href": "reference/lenses.html#related", 497 - "title": "Lenses", 498 - "section": "Related", 499 - "text": "Related\n\nDatasets - Using lenses with Dataset.as_type()\nPackable Samples - Defining sample types\nAtmosphere - Publishing lenses to ATProto federation", 495 + "objectID": "reference/atmosphere.html#lower-level-loaders", 496 + "href": "reference/atmosphere.html#lower-level-loaders", 497 + "title": "Atmosphere (ATProto Integration)", 498 + "section": "Lower-Level Loaders", 499 + "text": "Lower-Level Loaders\nFor direct access to records, use the loader classes:\n\nSchemaLoader\n\nfrom atdata.atmosphere import SchemaLoader\n\nloader = SchemaLoader(client)\n\n# Get a specific schema\nschema = loader.get(\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/xyz\")\nprint(schema[\"name\"], schema[\"version\"])\n\n# List all schemas from a repository\nfor schema in loader.list_all(repo=\"did:plc:other-user\"):\n print(schema[\"name\"])\n\n\n\nDatasetLoader\n\nfrom atdata.atmosphere import DatasetLoader\n\nloader = DatasetLoader(client)\n\n# Get a specific dataset record\nrecord = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n# Check storage type\nstorage_type = loader.get_storage_type(uri) # \"external\" or \"blobs\"\n\n# Get URLs based on storage type\nif storage_type == \"external\":\n urls = loader.get_urls(uri)\nelse:\n urls = loader.get_blob_urls(uri)\n\n# Get metadata\nmetadata = loader.get_metadata(uri)\n\n# Create a Dataset object directly\ndataset = loader.to_dataset(uri, MySampleType)\nfor batch in dataset.ordered(batch_size=32):\n process(batch)\n\n\n\nLensLoader\n\nfrom atdata.atmosphere import LensLoader\n\nloader = LensLoader(client)\n\n# Get a specific lens record\nlens = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\nprint(lens[\"name\"])\nprint(lens[\"sourceSchema\"], \"->\", lens[\"targetSchema\"])\n\n# List all lenses from a repository\nfor lens in loader.list_all():\n print(lens[\"name\"])\n\n# Find lenses by schema\nlenses = loader.find_by_schemas(\n source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n)", 500 500 "crumbs": [ 501 501 "Guide", 502 502 "Reference", 503 - "Lenses" 503 + "Atmosphere (ATProto Integration)" 504 504 ] 505 505 }, 506 506 { 507 - "objectID": "reference/load-dataset.html", 508 - "href": "reference/load-dataset.html", 509 - "title": "load_dataset API", 510 - "section": "", 511 - "text": "The load_dataset() function provides a HuggingFace Datasets-style interface for loading typed datasets.", 507 + "objectID": "reference/atmosphere.html#at-uris", 508 + "href": "reference/atmosphere.html#at-uris", 509 + "title": "Atmosphere (ATProto Integration)", 510 + "section": "AT URIs", 511 + "text": "AT URIs\nATProto records are identified by AT URIs:\n\nfrom atdata.atmosphere import AtUri\n\n# Parse an AT URI\nuri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n\nprint(uri.authority) # 'did:plc:abc123'\nprint(uri.collection) # 'ac.foundation.dataset.sampleSchema'\nprint(uri.rkey) # 'xyz'\n\n# Format back to string\nprint(str(uri)) # 'at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz'", 512 512 "crumbs": [ 513 513 "Guide", 514 514 "Reference", 515 - "load_dataset API" 515 + "Atmosphere (ATProto Integration)" 516 516 ] 517 517 }, 518 518 { 519 - "objectID": "reference/load-dataset.html#overview", 520 - "href": "reference/load-dataset.html#overview", 521 - "title": "load_dataset API", 522 - "section": "Overview", 523 - "text": "Overview\nKey differences from HuggingFace Datasets:\n\nRequires explicit sample_type parameter (typed dataclass) unless using index\nReturns atdata.Dataset[ST] instead of HF Dataset\nBuilt on WebDataset for efficient streaming\nNo Arrow caching layer", 519 + "objectID": "reference/atmosphere.html#supported-field-types", 520 + "href": "reference/atmosphere.html#supported-field-types", 521 + "title": "Atmosphere (ATProto Integration)", 522 + "section": "Supported Field Types", 523 + "text": "Supported Field Types\nSchemas support these field types:\n\n\n\nPython Type\nATProto Type\n\n\n\n\nstr\nprimitive/str\n\n\nint\nprimitive/int\n\n\nfloat\nprimitive/float\n\n\nbool\nprimitive/bool\n\n\nbytes\nprimitive/bytes\n\n\nNDArray\nndarray (default dtype: float32)\n\n\nNDArray[np.float64]\nndarray (dtype: float64)\n\n\nlist[str]\narray with items\n\n\nT \\| None\nOptional field", 524 524 "crumbs": [ 525 525 "Guide", 526 526 "Reference", 527 - "load_dataset API" 527 + "Atmosphere (ATProto Integration)" 528 528 ] 529 529 }, 530 530 { 531 - "objectID": "reference/load-dataset.html#basic-usage", 532 - "href": "reference/load-dataset.html#basic-usage", 533 - "title": "load_dataset API", 534 - "section": "Basic Usage", 535 - "text": "Basic Usage\n\nimport atdata\nfrom atdata import load_dataset\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass TextSample:\n text: str\n label: int\n\n# Load a specific split\ntrain_ds = load_dataset(\"path/to/data.tar\", TextSample, split=\"train\")\n\n# Load all splits (returns DatasetDict)\nds_dict = load_dataset(\"path/to/data/\", TextSample)\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]", 531 + "objectID": "reference/atmosphere.html#complete-example", 532 + "href": "reference/atmosphere.html#complete-example", 533 + "title": "Atmosphere (ATProto Integration)", 534 + "section": "Complete Example", 535 + "text": "Complete Example\nThis example shows the full workflow using PDSBlobStore for decentralized storage:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\nimport webdataset as wds\n\n# 1. Define and create samples\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n source: str\n\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(1000)\n]\n\n# 2. Write to tar\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Authenticate and set up blob storage\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# 4. Publish schema\nschema_uri = index.publish_schema(\n FeatureSample,\n version=\"1.0.0\",\n description=\"Feature vectors with labels\",\n)\n\n# 5. Publish dataset (shards uploaded as blobs)\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nentry = index.insert_dataset(\n dataset,\n name=\"synthetic-features-v1\",\n schema_ref=schema_uri,\n tags=[\"features\", \"synthetic\"],\n)\n\nprint(f\"Published: {entry.uri}\")\nprint(f\"Blob URLs: {entry.data_urls}\")\n\n# 6. Later: discover and load from blobs\nfor dataset_entry in index.list_datasets():\n print(f\"Found: {dataset_entry.name}\")\n\n # Reconstruct type from schema\n SampleType = index.decode_schema(dataset_entry.schema_ref)\n\n # Create source from blob URLs\n source = store.create_source(dataset_entry.data_urls)\n\n # Load dataset from blobs\n ds = atdata.Dataset[SampleType](source)\n for batch in ds.ordered(batch_size=32):\n print(batch.features.shape)\n break\n\nFor external URL storage (without PDSBlobStore):\n\n# Use AtmosphereIndex without data_store\nindex = AtmosphereIndex(client)\n\n# Dataset URLs will be stored as-is (external references)\nentry = index.insert_dataset(\n dataset,\n name=\"external-features\",\n schema_ref=schema_uri,\n)\n\n# Load using standard URL source\nds = atdata.Dataset[FeatureSample](entry.data_urls[0])", 536 536 "crumbs": [ 537 537 "Guide", 538 538 "Reference", 539 - "load_dataset API" 539 + "Atmosphere (ATProto Integration)" 540 540 ] 541 541 }, 542 542 { 543 - "objectID": "reference/load-dataset.html#path-formats", 544 - "href": "reference/load-dataset.html#path-formats", 545 - "title": "load_dataset API", 546 - "section": "Path Formats", 547 - "text": "Path Formats\n\nWebDataset Brace Notation\n\n# Range notation\nds = load_dataset(\"data-{000000..000099}.tar\", MySample, split=\"train\")\n\n# List notation\nds = load_dataset(\"data-{train,test,val}.tar\", MySample, split=\"train\")\n\n\n\nGlob Patterns\n\n# Match all tar files\nds = load_dataset(\"path/to/*.tar\", MySample)\n\n# Match pattern\nds = load_dataset(\"path/to/train-*.tar\", MySample, split=\"train\")\n\n\n\nLocal Directory\n\n# Scans for .tar files\nds = load_dataset(\"./my-dataset/\", MySample)\n\n\n\nRemote URLs\n\n# S3 (public buckets)\nds = load_dataset(\"s3://bucket/data-{000..099}.tar\", MySample, split=\"train\")\n\n# HTTP/HTTPS\nds = load_dataset(\"https://example.com/data.tar\", MySample, split=\"train\")\n\n# Google Cloud Storage\nds = load_dataset(\"gs://bucket/data.tar\", MySample, split=\"train\")\n\n\n\n\n\n\n\nNote\n\n\n\nFor private S3 buckets or S3-compatible storage with authentication, use atdata.S3Source with Dataset directly. See Datasets for details.\n\n\n\n\nIndex Lookup\n\nfrom atdata.local import LocalIndex\n\nindex = LocalIndex()\n\n# Load from local index (auto-resolves type from schema)\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# With explicit type\nds = load_dataset(\"@local/my-dataset\", MySample, index=index, split=\"train\")", 543 + "objectID": "reference/atmosphere.html#related", 544 + "href": "reference/atmosphere.html#related", 545 + "title": "Atmosphere (ATProto Integration)", 546 + "section": "Related", 547 + "text": "Related\n\nLocal Storage - Redis + S3 backend\nPromotion - Promoting local datasets to ATProto\nProtocols - AbstractIndex interface\nPackable Samples - Defining sample types", 548 548 "crumbs": [ 549 549 "Guide", 550 550 "Reference", 551 - "load_dataset API" 551 + "Atmosphere (ATProto Integration)" 552 552 ] 553 553 }, 554 554 { 555 - "objectID": "reference/load-dataset.html#split-detection", 556 - "href": "reference/load-dataset.html#split-detection", 557 - "title": "load_dataset API", 558 - "section": "Split Detection", 559 - "text": "Split Detection\nSplits are automatically detected from filenames and directories:\n\n\n\nPattern\nDetected Split\n\n\n\n\ntrain-*.tar, training-*.tar\ntrain\n\n\ntest-*.tar, testing-*.tar\ntest\n\n\nval-*.tar, valid-*.tar, validation-*.tar\nvalidation\n\n\ndev-*.tar, development-*.tar\nvalidation\n\n\ntrain/*.tar (directory)\ntrain\n\n\ntest/*.tar (directory)\ntest\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nFiles without a detected split default to “train”.", 555 + "objectID": "reference/local-storage.html", 556 + "href": "reference/local-storage.html", 557 + "title": "Local Storage", 558 + "section": "", 559 + "text": "The local storage module provides a Redis + S3 backend for storing and managing datasets before publishing to the ATProto federation.", 560 560 "crumbs": [ 561 561 "Guide", 562 562 "Reference", 563 - "load_dataset API" 563 + "Local Storage" 564 564 ] 565 565 }, 566 566 { 567 - "objectID": "reference/load-dataset.html#datasetdict", 568 - "href": "reference/load-dataset.html#datasetdict", 569 - "title": "load_dataset API", 570 - "section": "DatasetDict", 571 - "text": "DatasetDict\nWhen loading without split=, returns a DatasetDict:\n\nds_dict = load_dataset(\"path/to/data/\", MySample)\n\n# Access splits\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]\n\n# Iterate splits\nfor name, dataset in ds_dict.items():\n print(f\"{name}: {len(dataset.shard_list)} shards\")\n\n# Properties\nprint(ds_dict.num_shards) # {'train': 10, 'test': 2}\nprint(ds_dict.sample_type) # <class 'MySample'>\nprint(ds_dict.streaming) # False", 567 + "objectID": "reference/local-storage.html#overview", 568 + "href": "reference/local-storage.html#overview", 569 + "title": "Local Storage", 570 + "section": "Overview", 571 + "text": "Overview\nLocal storage uses:\n\nRedis for indexing and tracking dataset metadata\nS3-compatible storage for dataset tar files\n\nThis enables development and small-scale deployment before promoting to the full ATProto infrastructure.", 572 572 "crumbs": [ 573 573 "Guide", 574 574 "Reference", 575 - "load_dataset API" 575 + "Local Storage" 576 576 ] 577 577 }, 578 578 { 579 - "objectID": "reference/load-dataset.html#explicit-data-files", 580 - "href": "reference/load-dataset.html#explicit-data-files", 581 - "title": "load_dataset API", 582 - "section": "Explicit Data Files", 583 - "text": "Explicit Data Files\nOverride automatic detection with data_files:\n\n# Single pattern\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files=\"custom-*.tar\",\n)\n\n# List of patterns\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files=[\"shard-000.tar\", \"shard-001.tar\"],\n)\n\n# Explicit split mapping\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files={\n \"train\": \"training-shards-*.tar\",\n \"test\": \"eval-data.tar\",\n },\n)", 579 + "objectID": "reference/local-storage.html#localindex", 580 + "href": "reference/local-storage.html#localindex", 581 + "title": "Local Storage", 582 + "section": "LocalIndex", 583 + "text": "LocalIndex\nThe index tracks datasets in Redis:\n\nfrom atdata.local import LocalIndex\n\n# Default connection (localhost:6379)\nindex = LocalIndex()\n\n# Custom Redis connection\nimport redis\nr = redis.Redis(host='custom-host', port=6379)\nindex = LocalIndex(redis=r)\n\n# With connection kwargs\nindex = LocalIndex(host='custom-host', port=6379, db=1)\n\n\nAdding Entries\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nentry = index.add_entry(\n dataset,\n name=\"my-dataset\",\n schema_ref=\"atdata://local/sampleSchema/ImageSample@1.0.0\", # optional\n metadata={\"description\": \"Training images\"}, # optional\n)\n\nprint(entry.cid) # Content identifier\nprint(entry.name) # \"my-dataset\"\nprint(entry.data_urls) # [\"data-{000000..000009}.tar\"]\n\n\n\nListing and Retrieving\n\n# Iterate all entries\nfor entry in index.entries:\n print(f\"{entry.name}: {entry.cid}\")\n\n# Get as list\nall_entries = index.all_entries\n\n# Get by name\nentry = index.get_entry_by_name(\"my-dataset\")\n\n# Get by CID\nentry = index.get_entry(\"bafyrei...\")", 584 584 "crumbs": [ 585 585 "Guide", 586 586 "Reference", 587 - "load_dataset API" 587 + "Local Storage" 588 588 ] 589 589 }, 590 590 { 591 - "objectID": "reference/load-dataset.html#streaming-mode", 592 - "href": "reference/load-dataset.html#streaming-mode", 593 - "title": "load_dataset API", 594 - "section": "Streaming Mode", 595 - "text": "Streaming Mode\nThe streaming parameter signals intent for streaming mode:\n\n# Mark as streaming\nds_dict = load_dataset(\"path/to/data.tar\", MySample, streaming=True)\n\n# Check streaming status\nif ds_dict.streaming:\n print(\"Streaming mode\")\n\n\n\n\n\n\n\nTip\n\n\n\natdata datasets are always lazy/streaming via WebDataset pipelines. This parameter primarily signals intent.", 591 + "objectID": "reference/local-storage.html#repo-deprecated", 592 + "href": "reference/local-storage.html#repo-deprecated", 593 + "title": "Local Storage", 594 + "section": "Repo (Deprecated)", 595 + "text": "Repo (Deprecated)\n\n\n\n\n\n\nWarning\n\n\n\nRepo is deprecated. Use LocalIndex with S3DataStore instead for new code.\n\n\nThe Repo class combines S3 storage with Redis indexing:\n\nfrom atdata.local import Repo\n\n# From credentials file\nrepo = Repo(\n s3_credentials=\"path/to/.env\",\n hive_path=\"my-bucket/datasets\",\n)\n\n# From credentials dict\nrepo = Repo(\n s3_credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n hive_path=\"my-bucket/datasets\",\n)\n\nPreferred approach - Use LocalIndex with S3DataStore:\n\nfrom atdata.local import LocalIndex, S3DataStore\n\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(data_store=store)\n\n# Insert dataset\nentry = index.insert_dataset(dataset, name=\"my-dataset\", prefix=\"datasets/v1\")\n\n\nCredentials File Format\nThe .env file should contain:\nAWS_ENDPOINT=http://localhost:9000\nAWS_ACCESS_KEY_ID=your-access-key\nAWS_SECRET_ACCESS_KEY=your-secret-key\n\n\n\n\n\n\nNote\n\n\n\nFor AWS S3, omit AWS_ENDPOINT to use the default endpoint.\n\n\n\n\nInserting Datasets\n\nimport webdataset as wds\nimport numpy as np\n\n# Create dataset from samples\nsamples = [ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=f\"sample_{i}\"\n) for i in range(1000)]\n\nwith wds.writer.TarWriter(\"temp.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\ndataset = atdata.Dataset[ImageSample](\"temp.tar\")\n\n# Insert into repo (writes to S3 + indexes in Redis)\nentry, stored_dataset = repo.insert(\n dataset,\n name=\"training-images-v1\",\n cache_local=False, # Stream directly to S3\n)\n\nprint(entry.cid) # Content identifier\nprint(stored_dataset.url) # S3 URL for the stored data\nprint(stored_dataset.shard_list) # Individual shard URLs\n\n\n\nInsert Options\n\nentry, ds = repo.insert(\n dataset,\n name=\"my-dataset\",\n cache_local=True, # Write locally first, then copy (faster for some workloads)\n maxcount=10000, # Samples per shard\n maxsize=100_000_000, # Max shard size in bytes\n)", 596 596 "crumbs": [ 597 597 "Guide", 598 598 "Reference", 599 - "load_dataset API" 599 + "Local Storage" 600 600 ] 601 601 }, 602 602 { 603 - "objectID": "reference/load-dataset.html#auto-type-resolution", 604 - "href": "reference/load-dataset.html#auto-type-resolution", 605 - "title": "load_dataset API", 606 - "section": "Auto Type Resolution", 607 - "text": "Auto Type Resolution\nWhen using index lookup, the sample type can be resolved automatically:\n\nfrom atdata.local import LocalIndex\n\nindex = LocalIndex()\n\n# No sample_type needed - resolved from schema\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# Type is inferred from the stored schema\nsample_type = ds.sample_type", 603 + "objectID": "reference/local-storage.html#localdatasetentry", 604 + "href": "reference/local-storage.html#localdatasetentry", 605 + "title": "Local Storage", 606 + "section": "LocalDatasetEntry", 607 + "text": "LocalDatasetEntry\nIndex entries provide content-addressable identification:\n\nentry = index.get_entry_by_name(\"my-dataset\")\n\n# Core properties (IndexEntry protocol)\nentry.name # Human-readable name\nentry.schema_ref # Schema reference\nentry.data_urls # WebDataset URLs\nentry.metadata # Arbitrary metadata dict or None\n\n# Content addressing\nentry.cid # ATProto-compatible CID (content identifier)\n\n# Legacy compatibility\nentry.wds_url # First data URL\nentry.sample_kind # Same as schema_ref\n\n\n\n\n\n\n\nTip\n\n\n\nThe CID is generated from the entry’s content (schema_ref + data_urls), ensuring identical data produces identical CIDs whether stored locally or in the atmosphere.", 608 608 "crumbs": [ 609 609 "Guide", 610 610 "Reference", 611 - "load_dataset API" 611 + "Local Storage" 612 612 ] 613 613 }, 614 614 { 615 - "objectID": "reference/load-dataset.html#error-handling", 616 - "href": "reference/load-dataset.html#error-handling", 617 - "title": "load_dataset API", 618 - "section": "Error Handling", 619 - "text": "Error Handling\n\ntry:\n ds = load_dataset(\"path/to/data.tar\", MySample, split=\"train\")\nexcept FileNotFoundError:\n print(\"No data files found\")\nexcept ValueError as e:\n if \"Split\" in str(e):\n print(\"Requested split not found\")\n else:\n print(f\"Invalid configuration: {e}\")\nexcept KeyError:\n print(\"Dataset not found in index\")", 615 + "objectID": "reference/local-storage.html#schema-storage", 616 + "href": "reference/local-storage.html#schema-storage", 617 + "title": "Local Storage", 618 + "section": "Schema Storage", 619 + "text": "Schema Storage\nSchemas can be stored and retrieved from the index:\n\n# Publish a schema\nschema_ref = index.publish_schema(\n ImageSample,\n version=\"1.0.0\",\n description=\"Image with label annotation\",\n)\n# Returns: \"atdata://local/sampleSchema/ImageSample@1.0.0\"\n\n# Retrieve schema record\nschema = index.get_schema(schema_ref)\n# {\n# \"name\": \"ImageSample\",\n# \"version\": \"1.0.0\",\n# \"fields\": [...],\n# \"description\": \"...\",\n# \"createdAt\": \"...\",\n# }\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']}@{schema['version']}\")\n\n# Reconstruct sample type from schema\nSampleType = index.decode_schema(schema_ref)\ndataset = atdata.Dataset[SampleType](entry.data_urls[0])", 620 620 "crumbs": [ 621 621 "Guide", 622 622 "Reference", 623 - "load_dataset API" 623 + "Local Storage" 624 624 ] 625 625 }, 626 626 { 627 - "objectID": "reference/load-dataset.html#complete-example", 628 - "href": "reference/load-dataset.html#complete-example", 629 - "title": "load_dataset API", 630 - "section": "Complete Example", 631 - "text": "Complete Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata import load_dataset\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\n# 2. Create dataset files\nfor split in [\"train\", \"test\"]:\n with wds.writer.TarWriter(f\"{split}-000.tar\") as sink:\n for i in range(100):\n sample = ImageSample(\n image=np.random.rand(64, 64, 3).astype(np.float32),\n label=f\"sample_{i}\",\n )\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Load with split detection\nds_dict = load_dataset(\"./\", ImageSample)\nprint(ds_dict.keys()) # dict_keys(['train', 'test'])\n\n# 4. Iterate\nfor batch in ds_dict[\"train\"].ordered(batch_size=16):\n print(batch.image.shape) # (16, 64, 64, 3)\n print(batch.label) # ['sample_0', 'sample_1', ...]\n break\n\n# 5. Load specific split\ntrain_ds = load_dataset(\"./\", ImageSample, split=\"train\")\nfor batch in train_ds.ordered(batch_size=32):\n process(batch)", 627 + "objectID": "reference/local-storage.html#s3datastore", 628 + "href": "reference/local-storage.html#s3datastore", 629 + "title": "Local Storage", 630 + "section": "S3DataStore", 631 + "text": "S3DataStore\nFor direct S3 operations without Redis indexing:\n\nfrom atdata.local import S3DataStore\n\nstore = S3DataStore(\n credentials=\"path/to/.env\",\n bucket=\"my-bucket\",\n)\n\n# Write dataset shards\nurls = store.write_shards(\n dataset,\n prefix=\"datasets/v1\",\n maxcount=10000,\n)\n# Returns: [\"s3://my-bucket/datasets/v1/data--uuid--000000.tar\", ...]\n\n# Check capabilities\nstore.supports_streaming() # True", 632 632 "crumbs": [ 633 633 "Guide", 634 634 "Reference", 635 - "load_dataset API" 635 + "Local Storage" 636 + ] 637 + }, 638 + { 639 + "objectID": "reference/local-storage.html#complete-workflow-example", 640 + "href": "reference/local-storage.html#complete-workflow-example", 641 + "title": "Local Storage", 642 + "section": "Complete Workflow Example", 643 + "text": "Complete Workflow Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass TrainingSample:\n features: NDArray\n label: int\n source: str\n\n# 2. Create samples\nsamples = [\n TrainingSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(10000)\n]\n\n# 3. Write to local tar\nwith wds.writer.TarWriter(\"local-data.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 4. Set up index with S3 data store and insert\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nindex = LocalIndex(data_store=store)\n\n# Publish schema and insert dataset\nindex.publish_schema(TrainingSample, version=\"1.0.0\")\nlocal_ds = atdata.Dataset[TrainingSample](\"local-data.tar\")\nentry = index.insert_dataset(local_ds, name=\"training-v1\", prefix=\"training\")\n\n# 5. Retrieve later\nentry = index.get_entry_by_name(\"training-v1\")\ndataset = atdata.Dataset[TrainingSample](entry.data_urls[0])\n\nfor batch in dataset.ordered(batch_size=32):\n print(batch.features.shape) # (32, 128)", 644 + "crumbs": [ 645 + "Guide", 646 + "Reference", 647 + "Local Storage" 636 648 ] 637 649 }, 638 650 { 639 - "objectID": "reference/load-dataset.html#related", 640 - "href": "reference/load-dataset.html#related", 641 - "title": "load_dataset API", 651 + "objectID": "reference/local-storage.html#related", 652 + "href": "reference/local-storage.html#related", 653 + "title": "Local Storage", 642 654 "section": "Related", 643 - "text": "Related\n\nDatasets - Dataset iteration and batching\nPackable Samples - Defining sample types\nLocal Storage - LocalIndex for index lookup\nProtocols - AbstractIndex interface", 655 + "text": "Related\n\nDatasets - Dataset iteration and batching\nProtocols - AbstractIndex and IndexEntry interfaces\nPromotion - Promoting local datasets to ATProto\nAtmosphere - ATProto federation", 644 656 "crumbs": [ 645 657 "Guide", 646 658 "Reference", 647 - "load_dataset API" 659 + "Local Storage" 648 660 ] 649 661 }, 650 662 { 651 - "objectID": "reference/promotion.html", 652 - "href": "reference/promotion.html", 653 - "title": "Promotion Workflow", 663 + "objectID": "reference/uri-spec.html", 664 + "href": "reference/uri-spec.html", 665 + "title": "URI Specification", 654 666 "section": "", 655 - "text": "The promotion workflow migrates datasets from local storage (Redis + S3) to the ATProto atmosphere network, enabling federation and discovery.", 667 + "text": "The atdata:// URI scheme provides a unified way to address atdata resources across local development and the ATProto federation.", 656 668 "crumbs": [ 657 669 "Guide", 658 670 "Reference", 659 - "Promotion Workflow" 671 + "URI Specification" 660 672 ] 661 673 }, 662 674 { 663 - "objectID": "reference/promotion.html#overview", 664 - "href": "reference/promotion.html#overview", 665 - "title": "Promotion Workflow", 675 + "objectID": "reference/uri-spec.html#overview", 676 + "href": "reference/uri-spec.html#overview", 677 + "title": "URI Specification", 666 678 "section": "Overview", 667 - "text": "Overview\nPromotion handles:\n\nSchema deduplication: Avoids publishing duplicate schemas\nData URL preservation: Keeps existing S3 URLs or copies to new storage\nMetadata transfer: Preserves tags, descriptions, and custom metadata", 679 + "text": "Overview\nThe atdata URI scheme:\n\nFollows RFC 3986 syntax\nProvides consistent addressing for local and atmosphere resources\nEnables seamless promotion from development to production", 668 680 "crumbs": [ 669 681 "Guide", 670 682 "Reference", 671 - "Promotion Workflow" 683 + "URI Specification" 672 684 ] 673 685 }, 674 686 { 675 - "objectID": "reference/promotion.html#basic-usage", 676 - "href": "reference/promotion.html#basic-usage", 677 - "title": "Promotion Workflow", 678 - "section": "Basic Usage", 679 - "text": "Basic Usage\n\nfrom atdata.local import LocalIndex\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\n\n# Setup\nlocal_index = LocalIndex()\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Get local entry\nentry = local_index.get_entry_by_name(\"my-dataset\")\n\n# Promote to atmosphere\nat_uri = promote_to_atmosphere(entry, local_index, client)\nprint(f\"Published: {at_uri}\")", 687 + "objectID": "reference/uri-spec.html#uri-format", 688 + "href": "reference/uri-spec.html#uri-format", 689 + "title": "URI Specification", 690 + "section": "URI Format", 691 + "text": "URI Format\natdata://{authority}/{resource_type}/{name}@{version}\n\nAuthority\nThe authority identifies where the resource is stored:\n\n\n\nAuthority\nDescription\nExample\n\n\n\n\nlocal\nLocal Redis/S3 storage\natdata://local/...\n\n\n{handle}\nATProto handle\natdata://alice.bsky.social/...\n\n\n{did}\nATProto DID\natdata://did:plc:abc123/...\n\n\n\n\n\nResource Types\n\n\n\nResource Type\nDescription\n\n\n\n\nsampleSchema\nPackableSample type definitions\n\n\ndataset\nDataset entries (future)\n\n\nlens\nLens transformations (future)\n\n\n\n\n\nVersion Specifiers\nVersions follow semantic versioning and are specified with @:\n\n\n\nSpecifier\nDescription\nExample\n\n\n\n\n@{major}.{minor}.{patch}\nExact version\n@1.0.0, @2.1.3\n\n\n(none)\nLatest version\nResolves to highest semver", 680 692 "crumbs": [ 681 693 "Guide", 682 694 "Reference", 683 - "Promotion Workflow" 695 + "URI Specification" 684 696 ] 685 697 }, 686 698 { 687 - "objectID": "reference/promotion.html#with-metadata", 688 - "href": "reference/promotion.html#with-metadata", 689 - "title": "Promotion Workflow", 690 - "section": "With Metadata", 691 - "text": "With Metadata\n\nat_uri = promote_to_atmosphere(\n entry,\n local_index,\n client,\n name=\"my-dataset-v2\", # Override name\n description=\"Training images\", # Add description\n tags=[\"images\", \"training\"], # Add discovery tags\n license=\"MIT\", # Specify license\n)", 699 + "objectID": "reference/uri-spec.html#examples", 700 + "href": "reference/uri-spec.html#examples", 701 + "title": "URI Specification", 702 + "section": "Examples", 703 + "text": "Examples\n\nLocal Development\n\nfrom atdata.local import Index\n\nindex = Index()\n\n# Publish a schema (returns atdata:// URI)\nref = index.publish_schema(MySample, version=\"1.0.0\")\n# => \"atdata://local/sampleSchema/MySample@1.0.0\"\n\n# Auto-increment version\nref = index.publish_schema(MySample)\n# => \"atdata://local/sampleSchema/MySample@1.0.1\"\n\n# Retrieve by URI\nschema = index.get_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n\n\n\nAtmosphere (ATProto Federation)\n\nfrom atdata.atmosphere import Client\n\nclient = Client()\n\n# Publish returns at:// URI that maps to atdata://\nref = client.publish_schema(MySample)\n# => \"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\"\n\n# Can also be addressed as:\n# => \"atdata://did:plc:abc123/sampleSchema/MySample@1.0.0\"\n# => \"atdata://alice.bsky.social/sampleSchema/MySample@1.0.0\"", 692 704 "crumbs": [ 693 705 "Guide", 694 706 "Reference", 695 - "Promotion Workflow" 707 + "URI Specification" 696 708 ] 697 709 }, 698 710 { 699 - "objectID": "reference/promotion.html#schema-deduplication", 700 - "href": "reference/promotion.html#schema-deduplication", 701 - "title": "Promotion Workflow", 702 - "section": "Schema Deduplication", 703 - "text": "Schema Deduplication\nThe promotion workflow automatically checks for existing schemas:\n\n# First promotion: publishes schema\nuri1 = promote_to_atmosphere(entry1, local_index, client)\n\n# Second promotion with same schema type + version: reuses existing schema\nuri2 = promote_to_atmosphere(entry2, local_index, client)\n\nSchema matching is based on:\n\n{module}.{class_name} (e.g., mymodule.ImageSample)\nVersion string (e.g., 1.0.0)", 711 + "objectID": "reference/uri-spec.html#relationship-to-at-protocol-uris", 712 + "href": "reference/uri-spec.html#relationship-to-at-protocol-uris", 713 + "title": "URI Specification", 714 + "section": "Relationship to AT Protocol URIs", 715 + "text": "Relationship to AT Protocol URIs\nThe atdata:// scheme is inspired by and maps to ATProto’s at:// scheme:\n\n\n\n\n\n\n\natdata://\nat://\n\n\n\n\natdata://{did}/sampleSchema/{name}@{version}\nat://{did}/ac.foundation.dataset.sampleSchema/{rkey}\n\n\natdata://local/...\n(local only, no at:// equivalent)\n\n\n\nWhen publishing to the atmosphere, atdata URIs are automatically resolved to their corresponding at:// URIs for federation compatibility.", 704 716 "crumbs": [ 705 717 "Guide", 706 718 "Reference", 707 - "Promotion Workflow" 719 + "URI Specification" 708 720 ] 709 721 }, 710 722 { 711 - "objectID": "reference/promotion.html#data-storage-options", 712 - "href": "reference/promotion.html#data-storage-options", 713 - "title": "Promotion Workflow", 714 - "section": "Data Storage Options", 715 - "text": "Data Storage Options\n\nKeep Existing URLs (Default)Copy to New Storage\n\n\nBy default, promotion keeps the original data URLs:\n\n# Data stays in original S3 location\nat_uri = promote_to_atmosphere(entry, local_index, client)\n\n\nData stays in original S3 location\nDataset record points to existing URLs\nFastest option, no data copying\nRequires original storage to remain accessible\n\n\n\nTo copy data to a different storage location:\n\nfrom atdata.local import S3DataStore\n\n# Create new data store\nnew_store = S3DataStore(\n credentials=\"new-s3-creds.env\",\n bucket=\"public-datasets\",\n)\n\n# Promote with data copy\nat_uri = promote_to_atmosphere(\n entry,\n local_index,\n client,\n data_store=new_store, # Copy data to new storage\n)\n\n\nData is copied to new bucket\nDataset record points to new URLs\nGood for moving from private to public storage", 723 + "objectID": "reference/uri-spec.html#legacy-format", 724 + "href": "reference/uri-spec.html#legacy-format", 725 + "title": "URI Specification", 726 + "section": "Legacy Format", 727 + "text": "Legacy Format\nFor backwards compatibility, the local index also accepts the legacy format:\nlocal://schemas/{module.Class}@{version}\nThis format is deprecated and will be removed in a future version. Use atdata://local/sampleSchema/{name}@{version} instead.", 716 728 "crumbs": [ 717 729 "Guide", 718 730 "Reference", 719 - "Promotion Workflow" 731 + "URI Specification" 720 732 ] 721 733 }, 722 734 { 723 - "objectID": "reference/promotion.html#complete-workflow-example", 724 - "href": "reference/promotion.html#complete-workflow-example", 725 - "title": "Promotion Workflow", 726 - "section": "Complete Workflow Example", 727 - "text": "Complete Workflow Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# 2. Create local dataset\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n )\n for i in range(1000)\n]\n\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Set up index with S3 data store\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nlocal_index = LocalIndex(data_store=store)\n\n# 4. Publish schema and insert dataset\nlocal_index.publish_schema(FeatureSample, version=\"1.0.0\")\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"feature-vectors-v1\", prefix=\"features\")\n\n# 5. Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n description=\"Feature vectors for classification\",\n tags=[\"features\", \"embeddings\"],\n license=\"MIT\",\n)\n\nprint(f\"Dataset published: {at_uri}\")\n\n# 6. Verify on atmosphere\nfrom atdata.atmosphere import AtmosphereIndex\n\natm_index = AtmosphereIndex(client)\nentry = atm_index.get_dataset(at_uri)\nprint(f\"Name: {entry.name}\")\nprint(f\"Schema: {entry.schema_ref}\")\nprint(f\"URLs: {entry.data_urls}\")", 735 + "objectID": "tutorials/quickstart.html", 736 + "href": "tutorials/quickstart.html", 737 + "title": "Quick Start", 738 + "section": "", 739 + "text": "This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them. You’ll learn the foundational patterns that enable type-safe, efficient dataset handling—the first layer of atdata’s three-layer architecture.", 728 740 "crumbs": [ 729 741 "Guide", 730 - "Reference", 731 - "Promotion Workflow" 742 + "Getting Started", 743 + "Quick Start" 732 744 ] 733 745 }, 734 746 { 735 - "objectID": "reference/promotion.html#error-handling", 736 - "href": "reference/promotion.html#error-handling", 737 - "title": "Promotion Workflow", 738 - "section": "Error Handling", 739 - "text": "Error Handling\n\ntry:\n at_uri = promote_to_atmosphere(entry, local_index, client)\nexcept KeyError as e:\n # Schema not found in local index\n print(f\"Missing schema: {e}\")\nexcept ValueError as e:\n # Entry has no data URLs\n print(f\"Invalid entry: {e}\")", 747 + "objectID": "tutorials/quickstart.html#where-this-fits", 748 + "href": "tutorials/quickstart.html#where-this-fits", 749 + "title": "Quick Start", 750 + "section": "Where This Fits", 751 + "text": "Where This Fits\natdata is built around a simple progression:\nLocal Development → Team Storage → Federation\nThis tutorial covers local development—the foundation. Everything you learn here (typed samples, efficient iteration, lens transformations) carries forward as you scale to team storage and federated sharing. The key insight is that your sample types remain the same across all three layers; only the storage backend changes.", 740 752 "crumbs": [ 741 753 "Guide", 742 - "Reference", 743 - "Promotion Workflow" 754 + "Getting Started", 755 + "Quick Start" 744 756 ] 745 757 }, 746 758 { 747 - "objectID": "reference/promotion.html#requirements", 748 - "href": "reference/promotion.html#requirements", 749 - "title": "Promotion Workflow", 750 - "section": "Requirements", 751 - "text": "Requirements\nBefore promotion:\n\nDataset must be in local index (via Index.insert_dataset() or Index.add_entry())\nSchema must be published to local index (via Index.publish_schema())\nAtmosphereClient must be authenticated", 759 + "objectID": "tutorials/quickstart.html#installation", 760 + "href": "tutorials/quickstart.html#installation", 761 + "title": "Quick Start", 762 + "section": "Installation", 763 + "text": "Installation\npip install atdata\n\n# With ATProto support\npip install atdata[atmosphere]", 752 764 "crumbs": [ 753 765 "Guide", 754 - "Reference", 755 - "Promotion Workflow" 766 + "Getting Started", 767 + "Quick Start" 756 768 ] 757 769 }, 758 770 { 759 - "objectID": "reference/promotion.html#related", 760 - "href": "reference/promotion.html#related", 761 - "title": "Promotion Workflow", 762 - "section": "Related", 763 - "text": "Related\n\nLocal Storage - Setting up local datasets\nAtmosphere - ATProto integration\nProtocols - AbstractIndex and AbstractDataStore", 771 + "objectID": "tutorials/quickstart.html#define-a-sample-type", 772 + "href": "tutorials/quickstart.html#define-a-sample-type", 773 + "title": "Quick Start", 774 + "section": "Define a Sample Type", 775 + "text": "Define a Sample Type\nThe core abstraction in atdata is the PackableSample—a typed, serializable data structure. Unlike raw dictionaries or ad-hoc classes, PackableSamples provide:\n\nType safety: Know your schema at write time, not training time\nAutomatic serialization: msgpack encoding with efficient NDArray handling\nRound-trip fidelity: Data survives serialization without loss\n\nUse the @packable decorator to create a typed sample:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n \"\"\"A sample containing an image with label and confidence.\"\"\"\n image: NDArray\n label: str\n confidence: float\n\nThe @packable decorator:\n\nConverts your class into a dataclass\nAdds automatic msgpack serialization\nHandles NDArray conversion to/from bytes", 764 776 "crumbs": [ 765 777 "Guide", 766 - "Reference", 767 - "Promotion Workflow" 778 + "Getting Started", 779 + "Quick Start" 768 780 ] 769 781 }, 770 782 { 771 - "objectID": "tutorials/local-workflow.html", 772 - "href": "tutorials/local-workflow.html", 773 - "title": "Local Workflow", 774 - "section": "", 775 - "text": "This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage.", 783 + "objectID": "tutorials/quickstart.html#create-sample-instances", 784 + "href": "tutorials/quickstart.html#create-sample-instances", 785 + "title": "Quick Start", 786 + "section": "Create Sample Instances", 787 + "text": "Create Sample Instances\n\n# Create a single sample\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\n# Check serialization\npacked_bytes = sample.packed\nprint(f\"Serialized size: {len(packed_bytes):,} bytes\")\n\n# Verify round-trip\nrestored = ImageSample.from_bytes(packed_bytes)\nassert np.allclose(sample.image, restored.image)\nprint(\"Round-trip successful!\")", 776 788 "crumbs": [ 777 789 "Guide", 778 790 "Getting Started", 779 - "Local Workflow" 791 + "Quick Start" 780 792 ] 781 793 }, 782 794 { 783 - "objectID": "tutorials/local-workflow.html#prerequisites", 784 - "href": "tutorials/local-workflow.html#prerequisites", 785 - "title": "Local Workflow", 786 - "section": "Prerequisites", 787 - "text": "Prerequisites\n\nRedis server running (default: localhost:6379)\nS3-compatible storage (MinIO, AWS S3, etc.)\n\n\n\n\n\n\n\nTip\n\n\n\nFor local development, you can use MinIO:\ndocker run -p 9000:9000 minio/minio server /data", 795 + "objectID": "tutorials/quickstart.html#write-a-dataset", 796 + "href": "tutorials/quickstart.html#write-a-dataset", 797 + "title": "Quick Start", 798 + "section": "Write a Dataset", 799 + "text": "Write a Dataset\natdata uses WebDataset’s tar format for storage. This choice is deliberate:\n\nStreaming: Process data without downloading entire datasets\nSharding: Split large datasets across multiple files for parallel I/O\nProven: Battle-tested at scale by organizations like Google, NVIDIA, and OpenAI\n\nThe as_wds property on your sample provides the dictionary format WebDataset expects:\nUse WebDataset’s TarWriter to create dataset files:\n\nimport webdataset as wds\n\n# Create 100 samples\nsamples = [\n ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=f\"class_{i % 10}\",\n confidence=np.random.rand(),\n )\n for i in range(100)\n]\n\n# Write to tar file\nwith wds.writer.TarWriter(\"my-dataset-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\nprint(\"Wrote 100 samples to my-dataset-000000.tar\")", 788 800 "crumbs": [ 789 801 "Guide", 790 802 "Getting Started", 791 - "Local Workflow" 803 + "Quick Start" 792 804 ] 793 805 }, 794 806 { 795 - "objectID": "tutorials/local-workflow.html#setup", 796 - "href": "tutorials/local-workflow.html#setup", 797 - "title": "Local Workflow", 798 - "section": "Setup", 799 - "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, LocalDatasetEntry, S3DataStore\nimport webdataset as wds", 807 + "objectID": "tutorials/quickstart.html#load-and-iterate", 808 + "href": "tutorials/quickstart.html#load-and-iterate", 809 + "title": "Quick Start", 810 + "section": "Load and Iterate", 811 + "text": "Load and Iterate\nThe generic Dataset[T] class connects your sample type to WebDataset’s streaming infrastructure. When you specify Dataset[ImageSample], atdata knows how to deserialize the msgpack bytes back into fully-typed objects.\nAutomatic batch aggregation is a key feature: when you iterate with batch_size, atdata returns SampleBatch objects that intelligently combine samples:\n\nNDArray fields are stacked into a single array with a batch dimension\nOther fields become lists of values\n\nThis eliminates boilerplate collation code and works automatically with any PackableSample type.\nCreate a typed Dataset and iterate with batching:\n\n# Load dataset with type\ndataset = atdata.Dataset[ImageSample](\"my-dataset-000000.tar\")\n\n# Iterate in order with batching\nfor batch in dataset.ordered(batch_size=16):\n # NDArray fields are stacked\n images = batch.image # shape: (16, 224, 224, 3)\n\n # Other fields become lists\n labels = batch.label # list of 16 strings\n confidences = batch.confidence # list of 16 floats\n\n print(f\"Batch shape: {images.shape}\")\n print(f\"Labels: {labels[:3]}...\")\n break", 800 812 "crumbs": [ 801 813 "Guide", 802 814 "Getting Started", 803 - "Local Workflow" 815 + "Quick Start" 804 816 ] 805 817 }, 806 818 { 807 - "objectID": "tutorials/local-workflow.html#define-sample-types", 808 - "href": "tutorials/local-workflow.html#define-sample-types", 809 - "title": "Local Workflow", 810 - "section": "Define Sample Types", 811 - "text": "Define Sample Types\n\n@atdata.packable\nclass TrainingSample:\n \"\"\"A sample containing features and label for training.\"\"\"\n features: NDArray\n label: int\n\n@atdata.packable\nclass TextSample:\n \"\"\"A sample containing text data.\"\"\"\n text: str\n category: str", 819 + "objectID": "tutorials/quickstart.html#shuffled-iteration", 820 + "href": "tutorials/quickstart.html#shuffled-iteration", 821 + "title": "Quick Start", 822 + "section": "Shuffled Iteration", 823 + "text": "Shuffled Iteration\nProper shuffling is critical for training. WebDataset provides two-level shuffling:\n\nShard shuffling: Randomize the order of tar files\nSample shuffling: Randomize samples within a buffer\n\nThis approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset.\nFor training, use shuffled iteration:\n\nfor batch in dataset.shuffled(batch_size=32):\n # Samples are shuffled at shard and sample level\n images = batch.image\n labels = batch.label\n\n # Train your model\n # model.train(images, labels)\n break", 812 824 "crumbs": [ 813 825 "Guide", 814 826 "Getting Started", 815 - "Local Workflow" 827 + "Quick Start" 816 828 ] 817 829 }, 818 830 { 819 - "objectID": "tutorials/local-workflow.html#localdatasetentry", 820 - "href": "tutorials/local-workflow.html#localdatasetentry", 821 - "title": "Local Workflow", 822 - "section": "LocalDatasetEntry", 823 - "text": "LocalDatasetEntry\nCreate entries with content-addressable CIDs:\n\n# Create an entry manually\nentry = LocalDatasetEntry(\n _name=\"my-dataset\",\n _schema_ref=\"local://schemas/examples.TrainingSample@1.0.0\",\n _data_urls=[\"s3://bucket/data-000000.tar\", \"s3://bucket/data-000001.tar\"],\n _metadata={\"source\": \"example\", \"samples\": 10000},\n)\n\nprint(f\"Entry name: {entry.name}\")\nprint(f\"Schema ref: {entry.schema_ref}\")\nprint(f\"Data URLs: {entry.data_urls}\")\nprint(f\"Metadata: {entry.metadata}\")\nprint(f\"CID: {entry.cid}\")\n\n\n\n\n\n\n\nNote\n\n\n\nCIDs are generated from content (schema_ref + data_urls), so identical data produces identical CIDs.", 831 + "objectID": "tutorials/quickstart.html#use-lenses-for-type-transformations", 832 + "href": "tutorials/quickstart.html#use-lenses-for-type-transformations", 833 + "title": "Quick Start", 834 + "section": "Use Lenses for Type Transformations", 835 + "text": "Use Lenses for Type Transformations\nLenses are bidirectional transformations between sample types. They solve a common problem: you have a dataset with a rich schema, but a particular task only needs a subset of fields—or needs derived fields computed on-the-fly.\nInstead of creating separate datasets for each use case (duplicating storage and maintenance burden), lenses let you view the same underlying data through different type schemas. This is inspired by functional programming concepts and enables:\n\nSchema reduction: Drop fields you don’t need\nSchema migration: Handle version differences between datasets\nDerived features: Compute fields on-the-fly during iteration\n\nView datasets through different schemas:\n\n# Define a simplified view type\n@atdata.packable\nclass SimplifiedSample:\n label: str\n confidence: float\n\n# Create a lens transformation\n@atdata.lens\ndef simplify(src: ImageSample) -> SimplifiedSample:\n return SimplifiedSample(label=src.label, confidence=src.confidence)\n\n# View dataset through lens\nsimple_ds = dataset.as_type(SimplifiedSample)\n\nfor batch in simple_ds.ordered(batch_size=8):\n print(f\"Labels: {batch.label}\")\n print(f\"Confidences: {batch.confidence}\")\n break", 824 836 "crumbs": [ 825 837 "Guide", 826 838 "Getting Started", 827 - "Local Workflow" 839 + "Quick Start" 828 840 ] 829 841 }, 830 842 { 831 - "objectID": "tutorials/local-workflow.html#localindex", 832 - "href": "tutorials/local-workflow.html#localindex", 833 - "title": "Local Workflow", 834 - "section": "LocalIndex", 835 - "text": "LocalIndex\nThe index tracks datasets in Redis:\n\nfrom redis import Redis\n\n# Connect to Redis\nredis = Redis(host=\"localhost\", port=6379)\nindex = LocalIndex(redis=redis)\n\nprint(\"LocalIndex connected\")\n\n\nSchema Management\n\n# Publish a schema\nschema_ref = index.publish_schema(TrainingSample, version=\"1.0.0\")\nprint(f\"Published schema: {schema_ref}\")\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\" - {schema.get('name', 'Unknown')} v{schema.get('version', '?')}\")\n\n# Get schema record\nschema_record = index.get_schema(schema_ref)\nprint(f\"Schema fields: {[f['name'] for f in schema_record.get('fields', [])]}\")\n\n# Decode schema back to a PackableSample class\ndecoded_type = index.decode_schema(schema_ref)\nprint(f\"Decoded type: {decoded_type.__name__}\")", 843 + "objectID": "tutorials/quickstart.html#what-youve-learned", 844 + "href": "tutorials/quickstart.html#what-youve-learned", 845 + "title": "Quick Start", 846 + "section": "What You’ve Learned", 847 + "text": "What You’ve Learned\nYou now understand atdata’s foundational concepts:\n\n\n\nConcept\nPurpose\n\n\n\n\n@packable\nCreate typed, serializable sample classes\n\n\nDataset[T]\nTyped iteration over WebDataset tar files\n\n\nSampleBatch[T]\nAutomatic aggregation with NDArray stacking\n\n\n@lens\nTransform between sample types without data duplication\n\n\n\nThese patterns work identically whether your data lives on local disk, in team S3 storage, or published to the ATProto network. The next tutorials show how to scale beyond local files.", 836 848 "crumbs": [ 837 849 "Guide", 838 850 "Getting Started", 839 - "Local Workflow" 851 + "Quick Start" 840 852 ] 841 853 }, 842 854 { 843 - "objectID": "tutorials/local-workflow.html#s3datastore", 844 - "href": "tutorials/local-workflow.html#s3datastore", 845 - "title": "Local Workflow", 846 - "section": "S3DataStore", 847 - "text": "S3DataStore\nFor direct S3 operations:\n\ncreds = {\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n}\n\nstore = S3DataStore(creds, bucket=\"my-bucket\")\n\nprint(f\"Bucket: {store.bucket}\")\nprint(f\"Supports streaming: {store.supports_streaming()}\")", 855 + "objectID": "tutorials/quickstart.html#next-steps", 856 + "href": "tutorials/quickstart.html#next-steps", 857 + "title": "Quick Start", 858 + "section": "Next Steps", 859 + "text": "Next Steps\n\n\n\n\n\n\nReady to Share with Your Team?\n\n\n\nThe Local Workflow tutorial shows how to set up Redis + S3 storage for team-wide dataset discovery and sharing.\n\n\n\nLocal Workflow - Store datasets with Redis + S3\nAtmosphere Publishing - Publish to ATProto federation\nPackable Samples - Deep dive into sample types\nDatasets - Advanced dataset operations", 848 860 "crumbs": [ 849 861 "Guide", 850 862 "Getting Started", 851 - "Local Workflow" 863 + "Quick Start" 852 864 ] 853 865 }, 854 866 { 855 - "objectID": "tutorials/local-workflow.html#complete-index-workflow", 856 - "href": "tutorials/local-workflow.html#complete-index-workflow", 857 - "title": "Local Workflow", 858 - "section": "Complete Index Workflow", 859 - "text": "Complete Index Workflow\nUse LocalIndex with S3DataStore to store datasets with S3 storage and Redis indexing:\n\n# 1. Create sample data\nsamples = [\n TrainingSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10\n )\n for i in range(1000)\n]\nprint(f\"Created {len(samples)} training samples\")\n\n# 2. Write to local tar file\nwith wds.writer.TarWriter(\"local-data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\nprint(\"Wrote samples to local tar file\")\n\n# 3. Create Dataset\nds = atdata.Dataset[TrainingSample](\"local-data-000000.tar\")\n\n# 4. Set up index with S3 data store and insert\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(redis=redis, data_store=store)\n\n# Publish schema and insert dataset\nindex.publish_schema(TrainingSample, version=\"1.0.0\")\nentry = index.insert_dataset(ds, name=\"training-v1\", prefix=\"datasets\")\nprint(f\"Stored at: {entry.data_urls}\")\nprint(f\"CID: {entry.cid}\")\n\n# 5. Retrieve later\nretrieved_entry = index.get_entry_by_name(\"training-v1\")\ndataset = atdata.Dataset[TrainingSample](retrieved_entry.data_urls[0])\n\nfor batch in dataset.ordered(batch_size=32):\n print(f\"Batch features shape: {batch.features.shape}\")\n break", 867 + "objectID": "tutorials/atmosphere.html", 868 + "href": "tutorials/atmosphere.html", 869 + "title": "Atmosphere Publishing", 870 + "section": "", 871 + "text": "This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing. This is Layer 3 of atdata’s architecture—decentralized federation that enables cross-organization dataset sharing.", 860 872 "crumbs": [ 861 873 "Guide", 862 874 "Getting Started", 863 - "Local Workflow" 875 + "Atmosphere Publishing" 864 876 ] 865 877 }, 866 878 { 867 - "objectID": "tutorials/local-workflow.html#using-load_dataset-with-index", 868 - "href": "tutorials/local-workflow.html#using-load_dataset-with-index", 869 - "title": "Local Workflow", 870 - "section": "Using load_dataset with Index", 871 - "text": "Using load_dataset with Index\nThe load_dataset() function supports index lookup:\n\nfrom atdata import load_dataset\n\n# Load from local index\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# The index resolves the dataset name to URLs and schema\nfor batch in ds.shuffled(batch_size=32):\n process(batch)\n break", 879 + "objectID": "tutorials/atmosphere.html#why-federation", 880 + "href": "tutorials/atmosphere.html#why-federation", 881 + "title": "Atmosphere Publishing", 882 + "section": "Why Federation?", 883 + "text": "Why Federation?\nTeam storage (Redis + S3) works well within an organization, but sharing across organizations introduces new challenges:\n\nDiscovery: How do researchers find relevant datasets across institutions?\nTrust: How do you verify a dataset is what it claims to be?\nDurability: What happens if the original publisher goes offline?\n\nThe AT Protocol (ATProto), developed by Bluesky, provides a foundation for decentralized social applications. atdata leverages ATProto’s infrastructure for dataset federation:\n\n\n\n\n\n\n\nATProto Feature\natdata Usage\n\n\n\n\nDIDs (Decentralized Identifiers)\nPublisher identity verification\n\n\nLexicons\nDataset/schema record schemas\n\n\nPDSes (Personal Data Servers)\nStorage for records and blobs\n\n\nRelays & AppViews\nDiscovery and aggregation\n\n\n\nThe key insight: your Bluesky identity (@handle.bsky.social) becomes your dataset publisher identity. Anyone can verify that a dataset was published by you, and can discover your datasets through the federated network.", 872 884 "crumbs": [ 873 885 "Guide", 874 886 "Getting Started", 875 - "Local Workflow" 887 + "Atmosphere Publishing" 876 888 ] 877 889 }, 878 890 { 879 - "objectID": "tutorials/local-workflow.html#next-steps", 880 - "href": "tutorials/local-workflow.html#next-steps", 881 - "title": "Local Workflow", 882 - "section": "Next Steps", 883 - "text": "Next Steps\n\nAtmosphere Publishing - Publish to ATProto federation\nPromotion Workflow - Migrate from local to atmosphere\nLocal Storage Reference - Complete API reference", 891 + "objectID": "tutorials/atmosphere.html#prerequisites", 892 + "href": "tutorials/atmosphere.html#prerequisites", 893 + "title": "Atmosphere Publishing", 894 + "section": "Prerequisites", 895 + "text": "Prerequisites\n\npip install atdata[atmosphere]\nA Bluesky account with an app-specific password\n\n\n\n\n\n\n\nWarning\n\n\n\nAlways use an app-specific password, not your main Bluesky password.", 884 896 "crumbs": [ 885 897 "Guide", 886 898 "Getting Started", 887 - "Local Workflow" 899 + "Atmosphere Publishing" 888 900 ] 889 901 }, 890 902 { 891 - "objectID": "tutorials/promotion.html", 892 - "href": "tutorials/promotion.html", 893 - "title": "Promotion Workflow", 894 - "section": "", 895 - "text": "This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network.", 903 + "objectID": "tutorials/atmosphere.html#setup", 904 + "href": "tutorials/atmosphere.html#setup", 905 + "title": "Atmosphere Publishing", 906 + "section": "Setup", 907 + "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.atmosphere import (\n AtmosphereClient,\n AtmosphereIndex,\n PDSBlobStore,\n SchemaPublisher,\n SchemaLoader,\n DatasetPublisher,\n DatasetLoader,\n AtUri,\n)\nfrom atdata import BlobSource\nimport webdataset as wds", 896 908 "crumbs": [ 897 909 "Guide", 898 910 "Getting Started", 899 - "Promotion Workflow" 911 + "Atmosphere Publishing" 900 912 ] 901 913 }, 902 914 { 903 - "objectID": "tutorials/promotion.html#overview", 904 - "href": "tutorials/promotion.html#overview", 905 - "title": "Promotion Workflow", 906 - "section": "Overview", 907 - "text": "Overview\nThe promotion workflow moves datasets from local storage to the atmosphere:\nLOCAL ATMOSPHERE\n----- ----------\nRedis Index ATProto PDS\nS3 Storage --> (same S3 or new location)\nlocal://schemas/... at://did:plc:.../schema/...\nKey features:\n\nSchema deduplication: Won’t republish identical schemas\nFlexible data handling: Keep existing URLs or copy to new storage\nMetadata preservation: Local metadata carries over to atmosphere", 915 + "objectID": "tutorials/atmosphere.html#define-sample-types", 916 + "href": "tutorials/atmosphere.html#define-sample-types", 917 + "title": "Atmosphere Publishing", 918 + "section": "Define Sample Types", 919 + "text": "Define Sample Types\n\n@atdata.packable\nclass ImageSample:\n \"\"\"A sample containing image data with metadata.\"\"\"\n image: NDArray\n label: str\n confidence: float\n\n@atdata.packable\nclass TextEmbeddingSample:\n \"\"\"A sample containing text with embedding vectors.\"\"\"\n text: str\n embedding: NDArray\n source: str", 908 920 "crumbs": [ 909 921 "Guide", 910 922 "Getting Started", 911 - "Promotion Workflow" 923 + "Atmosphere Publishing" 912 924 ] 913 925 }, 914 926 { 915 - "objectID": "tutorials/promotion.html#setup", 916 - "href": "tutorials/promotion.html#setup", 917 - "title": "Promotion Workflow", 918 - "section": "Setup", 919 - "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, LocalDatasetEntry, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds", 927 + "objectID": "tutorials/atmosphere.html#type-introspection", 928 + "href": "tutorials/atmosphere.html#type-introspection", 929 + "title": "Atmosphere Publishing", 930 + "section": "Type Introspection", 931 + "text": "Type Introspection\nSee what information is available from a PackableSample type:\n\nfrom dataclasses import fields, is_dataclass\n\nprint(f\"Sample type: {ImageSample.__name__}\")\nprint(f\"Is dataclass: {is_dataclass(ImageSample)}\")\n\nprint(\"\\nFields:\")\nfor field in fields(ImageSample):\n print(f\" - {field.name}: {field.type}\")\n\n# Create and serialize a sample\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\npacked = sample.packed\nprint(f\"\\nSerialized size: {len(packed):,} bytes\")\n\n# Round-trip\nrestored = ImageSample.from_bytes(packed)\nprint(f\"Round-trip successful: {np.allclose(sample.image, restored.image)}\")", 920 932 "crumbs": [ 921 933 "Guide", 922 934 "Getting Started", 923 - "Promotion Workflow" 935 + "Atmosphere Publishing" 924 936 ] 925 937 }, 926 938 { 927 - "objectID": "tutorials/promotion.html#prepare-a-local-dataset", 928 - "href": "tutorials/promotion.html#prepare-a-local-dataset", 929 - "title": "Promotion Workflow", 930 - "section": "Prepare a Local Dataset", 931 - "text": "Prepare a Local Dataset\nFirst, set up a dataset in local storage:\n\n# 1. Define sample type\n@atdata.packable\nclass ExperimentSample:\n \"\"\"A sample from a scientific experiment.\"\"\"\n measurement: NDArray\n timestamp: float\n sensor_id: str\n\n# 2. Create samples\nsamples = [\n ExperimentSample(\n measurement=np.random.randn(64).astype(np.float32),\n timestamp=float(i),\n sensor_id=f\"sensor_{i % 4}\",\n )\n for i in range(1000)\n]\n\n# 3. Write to tar\nwith wds.writer.TarWriter(\"experiment.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 4. Set up local index with S3 storage\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nlocal_index = LocalIndex(data_store=store)\n\n# 5. Insert dataset into index\ndataset = atdata.Dataset[ExperimentSample](\"experiment.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"experiment-2024-001\", prefix=\"experiments\")\n\n# 6. Publish schema to local index\nlocal_index.publish_schema(ExperimentSample, version=\"1.0.0\")\n\nprint(f\"Local entry name: {local_entry.name}\")\nprint(f\"Local entry CID: {local_entry.cid}\")\nprint(f\"Data URLs: {local_entry.data_urls}\")", 939 + "objectID": "tutorials/atmosphere.html#at-uri-parsing", 940 + "href": "tutorials/atmosphere.html#at-uri-parsing", 941 + "title": "Atmosphere Publishing", 942 + "section": "AT URI Parsing", 943 + "text": "AT URI Parsing\nEvery record in ATProto is identified by an AT URI, which encodes:\n\nAuthority: The DID or handle of the record owner\nCollection: The Lexicon type (like a table name)\nRkey: The record key (unique within the collection)\n\nUnderstanding AT URIs is essential for working with atmosphere datasets, as they’re how you reference schemas, datasets, and lenses.\nATProto records are identified by AT URIs:\n\nuris = [\n \"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789\",\n \"at://alice.bsky.social/ac.foundation.dataset.record/my-dataset\",\n]\n\nfor uri_str in uris:\n print(f\"\\nParsing: {uri_str}\")\n uri = AtUri.parse(uri_str)\n print(f\" Authority: {uri.authority}\")\n print(f\" Collection: {uri.collection}\")\n print(f\" Rkey: {uri.rkey}\")", 932 944 "crumbs": [ 933 945 "Guide", 934 946 "Getting Started", 935 - "Promotion Workflow" 947 + "Atmosphere Publishing" 936 948 ] 937 949 }, 938 950 { 939 - "objectID": "tutorials/promotion.html#basic-promotion", 940 - "href": "tutorials/promotion.html#basic-promotion", 941 - "title": "Promotion Workflow", 942 - "section": "Basic Promotion", 943 - "text": "Basic Promotion\nPromote the dataset to ATProto:\n\n# Connect to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\n# Promote to atmosphere\nat_uri = promote_to_atmosphere(local_entry, local_index, client)\nprint(f\"Published: {at_uri}\")", 951 + "objectID": "tutorials/atmosphere.html#authentication", 952 + "href": "tutorials/atmosphere.html#authentication", 953 + "title": "Atmosphere Publishing", 954 + "section": "Authentication", 955 + "text": "Authentication\nThe AtmosphereClient handles ATProto authentication. When you authenticate, you’re proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS).\nConnect to ATProto:\n\nclient = AtmosphereClient()\nclient.login(\"your.handle.social\", \"your-app-password\")\n\nprint(f\"Authenticated as: {client.handle}\")\nprint(f\"DID: {client.did}\")", 944 956 "crumbs": [ 945 957 "Guide", 946 958 "Getting Started", 947 - "Promotion Workflow" 959 + "Atmosphere Publishing" 948 960 ] 949 961 }, 950 962 { 951 - "objectID": "tutorials/promotion.html#promotion-with-metadata", 952 - "href": "tutorials/promotion.html#promotion-with-metadata", 953 - "title": "Promotion Workflow", 954 - "section": "Promotion with Metadata", 955 - "text": "Promotion with Metadata\nAdd description, tags, and license:\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n name=\"experiment-2024-001-v2\", # Override name\n description=\"Sensor measurements from Lab 302\",\n tags=[\"experiment\", \"physics\", \"2024\"],\n license=\"CC-BY-4.0\",\n)\nprint(f\"Published with metadata: {at_uri}\")", 963 + "objectID": "tutorials/atmosphere.html#publish-a-schema", 964 + "href": "tutorials/atmosphere.html#publish-a-schema", 965 + "title": "Atmosphere Publishing", 966 + "section": "Publish a Schema", 967 + "text": "Publish a Schema\nWhen you publish a schema to ATProto, it becomes a public, immutable record that others can reference. The schema CID ensures that anyone can verify they’re using exactly the same type definition you published.\n\nschema_publisher = SchemaPublisher(client)\nschema_uri = schema_publisher.publish(\n ImageSample,\n name=\"ImageSample\",\n version=\"1.0.0\",\n description=\"Demo: Image sample with label and confidence\",\n)\nprint(f\"Schema URI: {schema_uri}\")", 956 968 "crumbs": [ 957 969 "Guide", 958 970 "Getting Started", 959 - "Promotion Workflow" 971 + "Atmosphere Publishing" 960 972 ] 961 973 }, 962 974 { 963 - "objectID": "tutorials/promotion.html#schema-deduplication", 964 - "href": "tutorials/promotion.html#schema-deduplication", 965 - "title": "Promotion Workflow", 966 - "section": "Schema Deduplication", 967 - "text": "Schema Deduplication\nThe promotion workflow automatically checks for existing schemas:\n\nfrom atdata.promote import _find_existing_schema\n\n# Check if schema already exists\nexisting = _find_existing_schema(client, \"ExperimentSample\", \"1.0.0\")\nif existing:\n print(f\"Found existing schema: {existing}\")\n print(\"Will reuse instead of republishing\")\nelse:\n print(\"No existing schema found, will publish new one\")\n\nWhen you promote multiple datasets with the same sample type:\n\n# First promotion: publishes schema\nuri1 = promote_to_atmosphere(entry1, local_index, client)\n\n# Second promotion with same schema type + version: reuses existing schema\nuri2 = promote_to_atmosphere(entry2, local_index, client)", 975 + "objectID": "tutorials/atmosphere.html#list-your-schemas", 976 + "href": "tutorials/atmosphere.html#list-your-schemas", 977 + "title": "Atmosphere Publishing", 978 + "section": "List Your Schemas", 979 + "text": "List Your Schemas\n\nschema_loader = SchemaLoader(client)\nschemas = schema_loader.list_all(limit=10)\nprint(f\"Found {len(schemas)} schema(s)\")\n\nfor schema in schemas:\n print(f\" - {schema.get('name', 'Unknown')}: v{schema.get('version', '?')}\")", 968 980 "crumbs": [ 969 981 "Guide", 970 982 "Getting Started", 971 - "Promotion Workflow" 983 + "Atmosphere Publishing" 972 984 ] 973 985 }, 974 986 { 975 - "objectID": "tutorials/promotion.html#data-migration-options", 976 - "href": "tutorials/promotion.html#data-migration-options", 977 - "title": "Promotion Workflow", 978 - "section": "Data Migration Options", 979 - "text": "Data Migration Options\n\nKeep Existing URLsCopy to New Storage\n\n\nBy default, promotion keeps the original data URLs:\n\n# Data stays in original S3 location\nat_uri = promote_to_atmosphere(local_entry, local_index, client)\n\nBenefits:\n\nFastest option, no data copying\nDataset record points to existing URLs\nRequires original storage to remain accessible\n\n\n\nTo copy data to a different storage location:\n\nfrom atdata.local import S3DataStore\n\n# Create new data store\nnew_store = S3DataStore(\n credentials=\"new-s3-creds.env\",\n bucket=\"public-datasets\",\n)\n\n# Promote with data copy\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n data_store=new_store, # Copy data to new storage\n)\n\nBenefits:\n\nData is copied to new bucket\nGood for moving from private to public storage\nOriginal storage can be retired", 987 + "objectID": "tutorials/atmosphere.html#publish-a-dataset", 988 + "href": "tutorials/atmosphere.html#publish-a-dataset", 989 + "title": "Atmosphere Publishing", 990 + "section": "Publish a Dataset", 991 + "text": "Publish a Dataset\n\nWith External URLs\n\ndataset_publisher = DatasetPublisher(client)\ndataset_uri = dataset_publisher.publish_with_urls(\n urls=[\"s3://example-bucket/demo-data-{000000..000009}.tar\"],\n schema_uri=str(schema_uri),\n name=\"Demo Image Dataset\",\n description=\"Example dataset demonstrating atmosphere publishing\",\n tags=[\"demo\", \"images\", \"atdata\"],\n license=\"MIT\",\n)\nprint(f\"Dataset URI: {dataset_uri}\")\n\n\n\nWith PDS Blob Storage (Recommended)\nThe PDSBlobStore is the fully decentralized option: your dataset shards are stored as ATProto blobs directly in your PDS, alongside your other ATProto records. This means:\n\nNo external dependencies: Data lives in the same infrastructure as your identity\nContent-addressed: Blobs are identified by their CID, ensuring integrity\nFederated replication: Relays can mirror your blobs for availability\n\nFor fully decentralized storage, use PDSBlobStore to store dataset shards directly as ATProto blobs in your PDS:\n\n# Create store and index with blob storage\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# Create dataset in memory or from existing tar\nsamples = [FeatureSample(features=np.random.randn(64).astype(np.float32), label=i % 10) for i in range(100)]\n\n# Write to temporary tar\nwith wds.writer.TarWriter(\"temp.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\ndataset = atdata.Dataset[FeatureSample](\"temp.tar\")\n\n# Publish - shards are uploaded as blobs automatically\nschema_uri = index.publish_schema(FeatureSample, version=\"1.0.0\")\nentry = index.insert_dataset(\n dataset,\n name=\"blob-stored-features\",\n schema_ref=schema_uri,\n description=\"Features stored as PDS blobs\",\n)\n\nprint(f\"Dataset URI: {entry.uri}\")\nprint(f\"Blob URLs: {entry.data_urls}\") # at://did/blob/cid format\n\n\n\n\n\n\n\nReading Blob-Stored Datasets\n\n\n\nUse BlobSource to stream directly from PDS blobs:\n\n# Create source from the blob URLs\nsource = store.create_source(entry.data_urls)\n\n# Or manually from blob references\nsource = BlobSource.from_refs([\n {\"did\": client.did, \"cid\": \"bafyrei...\"},\n])\n\n# Load and iterate\nds = atdata.Dataset[FeatureSample](source)\nfor batch in ds.ordered(batch_size=32):\n print(batch.features.shape)\n\n\n\n\n\nWith External URLs\nFor larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the index entry while the actual data lives elsewhere.\nFor larger datasets or when using existing object storage:\n\ndataset_publisher = DatasetPublisher(client)\ndataset_uri = dataset_publisher.publish_with_urls(\n urls=[\"s3://example-bucket/demo-data-{000000..000009}.tar\"],\n schema_uri=str(schema_uri),\n name=\"Demo Image Dataset\",\n description=\"Example dataset demonstrating atmosphere publishing\",\n tags=[\"demo\", \"images\", \"atdata\"],\n license=\"MIT\",\n)\nprint(f\"Dataset URI: {dataset_uri}\")", 980 992 "crumbs": [ 981 993 "Guide", 982 994 "Getting Started", 983 - "Promotion Workflow" 995 + "Atmosphere Publishing" 984 996 ] 985 997 }, 986 998 { 987 - "objectID": "tutorials/promotion.html#verify-on-atmosphere", 988 - "href": "tutorials/promotion.html#verify-on-atmosphere", 989 - "title": "Promotion Workflow", 990 - "section": "Verify on Atmosphere", 991 - "text": "Verify on Atmosphere\nAfter promotion, verify the dataset is accessible:\n\nfrom atdata.atmosphere import AtmosphereIndex\n\natm_index = AtmosphereIndex(client)\nentry = atm_index.get_dataset(at_uri)\n\nprint(f\"Name: {entry.name}\")\nprint(f\"Schema: {entry.schema_ref}\")\nprint(f\"URLs: {entry.data_urls}\")\n\n# Load and iterate\nSampleType = atm_index.decode_schema(entry.schema_ref)\nds = atdata.Dataset[SampleType](entry.data_urls[0])\n\nfor batch in ds.ordered(batch_size=32):\n print(f\"Measurement shape: {batch.measurement.shape}\")\n break", 999 + "objectID": "tutorials/atmosphere.html#list-and-load-datasets", 1000 + "href": "tutorials/atmosphere.html#list-and-load-datasets", 1001 + "title": "Atmosphere Publishing", 1002 + "section": "List and Load Datasets", 1003 + "text": "List and Load Datasets\n\ndataset_loader = DatasetLoader(client)\ndatasets = dataset_loader.list_all(limit=10)\nprint(f\"Found {len(datasets)} dataset(s)\")\n\nfor ds in datasets:\n print(f\" - {ds.get('name', 'Unknown')}\")\n print(f\" Schema: {ds.get('schemaRef', 'N/A')}\")\n tags = ds.get('tags', [])\n if tags:\n print(f\" Tags: {', '.join(tags)}\")", 992 1004 "crumbs": [ 993 1005 "Guide", 994 1006 "Getting Started", 995 - "Promotion Workflow" 1007 + "Atmosphere Publishing" 996 1008 ] 997 1009 }, 998 1010 { 999 - "objectID": "tutorials/promotion.html#error-handling", 1000 - "href": "tutorials/promotion.html#error-handling", 1001 - "title": "Promotion Workflow", 1002 - "section": "Error Handling", 1003 - "text": "Error Handling\n\ntry:\n at_uri = promote_to_atmosphere(local_entry, local_index, client)\nexcept KeyError as e:\n # Schema not found in local index\n print(f\"Missing schema: {e}\")\n print(\"Publish schema first: local_index.publish_schema(SampleType)\")\nexcept ValueError as e:\n # Entry has no data URLs\n print(f\"Invalid entry: {e}\")", 1011 + "objectID": "tutorials/atmosphere.html#load-a-dataset", 1012 + "href": "tutorials/atmosphere.html#load-a-dataset", 1013 + "title": "Atmosphere Publishing", 1014 + "section": "Load a Dataset", 1015 + "text": "Load a Dataset\n\n# Check storage type\nstorage_type = dataset_loader.get_storage_type(str(blob_dataset_uri))\nprint(f\"Storage type: {storage_type}\")\n\nif storage_type == \"blobs\":\n blob_urls = dataset_loader.get_blob_urls(str(blob_dataset_uri))\n print(f\"Blob URLs: {len(blob_urls)} blob(s)\")\n\n# Load and iterate (works for both storage types)\nds = dataset_loader.to_dataset(str(blob_dataset_uri), DemoSample)\nfor batch in ds.ordered():\n print(f\"Sample id={batch.id}, text={batch.text}\")", 1016 + "crumbs": [ 1017 + "Guide", 1018 + "Getting Started", 1019 + "Atmosphere Publishing" 1020 + ] 1021 + }, 1022 + { 1023 + "objectID": "tutorials/atmosphere.html#complete-publishing-workflow", 1024 + "href": "tutorials/atmosphere.html#complete-publishing-workflow", 1025 + "title": "Atmosphere Publishing", 1026 + "section": "Complete Publishing Workflow", 1027 + "text": "Complete Publishing Workflow\nHere’s the end-to-end workflow for publishing a dataset to the atmosphere:\n\nDefine your sample type using @packable\nCreate samples and write to tar (same as local workflow)\nAuthenticate with your ATProto identity\nCreate index with blob storage (AtmosphereIndex + PDSBlobStore)\nPublish schema (creates ATProto record)\nInsert dataset (uploads blobs, creates dataset record)\n\nNotice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend.\nThis example shows the recommended workflow using PDSBlobStore for fully decentralized storage:\n\n# 1. Define and create samples\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n source: str\n\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(1000)\n]\n\n# 2. Write to tar\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Authenticate and create index with blob storage\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# 4. Publish schema\nschema_uri = index.publish_schema(\n FeatureSample,\n version=\"1.0.0\",\n description=\"Feature vectors with labels\",\n)\n\n# 5. Publish dataset (shards uploaded as blobs automatically)\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nentry = index.insert_dataset(\n dataset,\n name=\"synthetic-features-v1\",\n schema_ref=schema_uri,\n tags=[\"features\", \"synthetic\"],\n)\n\nprint(f\"Published: {entry.uri}\")\nprint(f\"Data stored at: {entry.data_urls}\") # at://did/blob/cid URLs\n\n# 6. Later: load from blobs\nsource = store.create_source(entry.data_urls)\nds = atdata.Dataset[FeatureSample](source)\nfor batch in ds.ordered(batch_size=32):\n print(f\"Loaded batch with {len(batch.label)} samples\")\n break", 1004 1028 "crumbs": [ 1005 1029 "Guide", 1006 1030 "Getting Started", 1007 - "Promotion Workflow" 1031 + "Atmosphere Publishing" 1008 1032 ] 1009 1033 }, 1010 1034 { 1011 - "objectID": "tutorials/promotion.html#requirements-checklist", 1012 - "href": "tutorials/promotion.html#requirements-checklist", 1013 - "title": "Promotion Workflow", 1014 - "section": "Requirements Checklist", 1015 - "text": "Requirements Checklist\nBefore promotion:\n\nDataset is in local index (via LocalIndex.insert_dataset() or LocalIndex.add_entry())\nSchema is published to local index (via LocalIndex.publish_schema())\nAtmosphereClient is authenticated\nData URLs are publicly accessible (or will be copied)", 1035 + "objectID": "tutorials/atmosphere.html#what-youve-learned", 1036 + "href": "tutorials/atmosphere.html#what-youve-learned", 1037 + "title": "Atmosphere Publishing", 1038 + "section": "What You’ve Learned", 1039 + "text": "What You’ve Learned\nYou now understand federated dataset publishing in atdata:\n\n\n\nConcept\nPurpose\n\n\n\n\nAtmosphereClient\nATProto authentication and record management\n\n\nAtmosphereIndex\nFederated index implementing AbstractIndex\n\n\nPDSBlobStore\nPDS blob storage implementing AbstractDataStore\n\n\nBlobSource\nStream datasets from PDS blobs\n\n\nAT URIs\nUniversal identifiers for schemas and datasets\n\n\n\nThe protocol abstractions (AbstractIndex, AbstractDataStore, DataSource) ensure your code works across all three layers of atdata—local files, team storage, and federated sharing.", 1016 1040 "crumbs": [ 1017 1041 "Guide", 1018 1042 "Getting Started", 1019 - "Promotion Workflow" 1043 + "Atmosphere Publishing" 1020 1044 ] 1021 1045 }, 1022 1046 { 1023 - "objectID": "tutorials/promotion.html#complete-workflow", 1024 - "href": "tutorials/promotion.html#complete-workflow", 1025 - "title": "Promotion Workflow", 1026 - "section": "Complete Workflow", 1027 - "text": "Complete Workflow\n\n# Complete local-to-atmosphere workflow\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# 2. Create dataset tar\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n )\n for i in range(1000)\n]\n\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Store in local index with S3 backend\nstore = S3DataStore(credentials=\"creds.env\", bucket=\"bucket\")\nlocal_index = LocalIndex(data_store=store)\n\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"feature-vectors-v1\", prefix=\"features\")\n\n# 4. Publish schema locally\nlocal_index.publish_schema(FeatureSample, version=\"1.0.0\")\n\n# 5. Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n description=\"Feature vectors for classification\",\n tags=[\"features\", \"embeddings\"],\n license=\"MIT\",\n)\n\nprint(f\"Dataset published: {at_uri}\")\n\n# 6. Others can now discover and load\n# ds = atdata.load_dataset(\"@myhandle.bsky.social/feature-vectors-v1\")", 1047 + "objectID": "tutorials/atmosphere.html#the-full-picture", 1048 + "href": "tutorials/atmosphere.html#the-full-picture", 1049 + "title": "Atmosphere Publishing", 1050 + "section": "The Full Picture", 1051 + "text": "The Full Picture\nYou’ve now seen atdata’s complete architecture:\nLocal Development Team Storage Federation\n───────────────── ──────────── ──────────\ntar files Redis + S3 ATProto PDS\nDataset[T] LocalIndex AtmosphereIndex\n S3DataStore PDSBlobStore\nThe same @packable sample types, the same Dataset[T] iteration patterns, and the same lens transformations work at every layer. Only the storage backend changes.", 1028 1052 "crumbs": [ 1029 1053 "Guide", 1030 1054 "Getting Started", 1031 - "Promotion Workflow" 1055 + "Atmosphere Publishing" 1032 1056 ] 1033 1057 }, 1034 1058 { 1035 - "objectID": "tutorials/promotion.html#next-steps", 1036 - "href": "tutorials/promotion.html#next-steps", 1037 - "title": "Promotion Workflow", 1059 + "objectID": "tutorials/atmosphere.html#next-steps", 1060 + "href": "tutorials/atmosphere.html#next-steps", 1061 + "title": "Atmosphere Publishing", 1038 1062 "section": "Next Steps", 1039 - "text": "Next Steps\n\nAtmosphere Reference - Complete atmosphere API\nProtocols - Abstract interfaces\nLocal Storage - Local storage reference", 1063 + "text": "Next Steps\n\n\n\n\n\n\nAlready Have Local Datasets?\n\n\n\nThe Promotion Workflow tutorial shows how to migrate existing datasets from local storage to the atmosphere without re-processing your data.\n\n\n\nPromotion Workflow - Migrate from local storage to atmosphere\nAtmosphere Reference - Complete API reference\nProtocols - Abstract interfaces", 1040 1064 "crumbs": [ 1041 1065 "Guide", 1042 1066 "Getting Started", 1043 - "Promotion Workflow" 1067 + "Atmosphere Publishing" 1044 1068 ] 1045 1069 }, 1046 1070 { 1047 - "objectID": "api/DatasetDict.html", 1048 - "href": "api/DatasetDict.html", 1049 - "title": "DatasetDict", 1071 + "objectID": "api/SchemaLoader.html", 1072 + "href": "api/SchemaLoader.html", 1073 + "title": "SchemaLoader", 1050 1074 "section": "", 1051 - "text": "DatasetDict(splits=None, sample_type=None, streaming=False)\nA dictionary of split names to Dataset instances.\nSimilar to HuggingFace’s DatasetDict, this provides a container for multiple dataset splits (train, test, validation, etc.) with convenience methods that operate across all splits.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired\n\n\n\n\n\n\n::\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 1075 + "text": "atmosphere.SchemaLoader(client)\nLoads PackableSample schemas from ATProto.\nThis class fetches schema records from ATProto and can list available schemas from a repository.\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 1052 1076 }, 1053 1077 { 1054 - "objectID": "api/DatasetDict.html#parameters", 1055 - "href": "api/DatasetDict.html#parameters", 1056 - "title": "DatasetDict", 1078 + "objectID": "api/SchemaLoader.html#example", 1079 + "href": "api/SchemaLoader.html#example", 1080 + "title": "SchemaLoader", 1057 1081 "section": "", 1058 - "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired" 1082 + "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'" 1059 1083 }, 1060 1084 { 1061 - "objectID": "api/DatasetDict.html#example", 1062 - "href": "api/DatasetDict.html#example", 1063 - "title": "DatasetDict", 1085 + "objectID": "api/SchemaLoader.html#methods", 1086 + "href": "api/SchemaLoader.html#methods", 1087 + "title": "SchemaLoader", 1064 1088 "section": "", 1065 - "text": "::\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")" 1089 + "text": "Name\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 1066 1090 }, 1067 1091 { 1068 - "objectID": "api/DatasetDict.html#attributes", 1069 - "href": "api/DatasetDict.html#attributes", 1070 - "title": "DatasetDict", 1092 + "objectID": "api/BlobSource.html", 1093 + "href": "api/BlobSource.html", 1094 + "title": "BlobSource", 1071 1095 "section": "", 1072 - "text": "Name\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 1096 + "text": "BlobSource(blob_refs, pds_endpoint=None, _endpoint_cache=dict())\nData source for ATProto PDS blob storage.\nStreams dataset shards stored as blobs on an ATProto Personal Data Server. Each shard is identified by a blob reference containing the DID and CID.\nThis source resolves blob references to HTTP URLs and streams the content directly, supporting efficient iteration over shards without downloading everything upfront.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID.\n\n\n\n\n\n\n::\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 1073 1097 }, 1074 1098 { 1075 - "objectID": "api/PackableSample.html", 1076 - "href": "api/PackableSample.html", 1077 - "title": "PackableSample", 1099 + "objectID": "api/BlobSource.html#attributes", 1100 + "href": "api/BlobSource.html#attributes", 1101 + "title": "BlobSource", 1078 1102 "section": "", 1079 - "text": "PackableSample()\nBase class for samples that can be serialized with msgpack.\nThis abstract base class provides automatic serialization/deserialization for dataclass-based samples. Fields annotated as NDArray or NDArray | None are automatically converted between numpy arrays and bytes during packing/unpacking.\nSubclasses should be defined either by: 1. Direct inheritance with the @dataclass decorator 2. Using the @packable decorator (recommended)\n\n\n::\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 1103 + "text": "Name\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID." 1080 1104 }, 1081 1105 { 1082 - "objectID": "api/PackableSample.html#example", 1083 - "href": "api/PackableSample.html#example", 1084 - "title": "PackableSample", 1106 + "objectID": "api/BlobSource.html#example", 1107 + "href": "api/BlobSource.html#example", 1108 + "title": "BlobSource", 1085 1109 "section": "", 1086 - "text": "::\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize" 1110 + "text": "::\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1087 1111 }, 1088 1112 { 1089 - "objectID": "api/PackableSample.html#attributes", 1090 - "href": "api/PackableSample.html#attributes", 1091 - "title": "PackableSample", 1113 + "objectID": "api/BlobSource.html#methods", 1114 + "href": "api/BlobSource.html#methods", 1115 + "title": "BlobSource", 1092 1116 "section": "", 1093 - "text": "Name\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1117 + "text": "Name\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 1094 1118 }, 1095 1119 { 1096 - "objectID": "api/PackableSample.html#methods", 1097 - "href": "api/PackableSample.html#methods", 1098 - "title": "PackableSample", 1120 + "objectID": "api/AtmosphereClient.html", 1121 + "href": "api/AtmosphereClient.html", 1122 + "title": "AtmosphereClient", 1099 1123 "section": "", 1100 - "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 1124 + "text": "atmosphere.AtmosphereClient(base_url=None, *, _client=None)\nATProto client wrapper for atdata operations.\nThis class wraps the atproto SDK client and provides higher-level methods for working with atdata records (schemas, datasets, lenses).\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'\n\n\n\nThe password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 1101 1125 }, 1102 1126 { 1103 - "objectID": "api/PDSBlobStore.html", 1104 - "href": "api/PDSBlobStore.html", 1105 - "title": "PDSBlobStore", 1127 + "objectID": "api/AtmosphereClient.html#example", 1128 + "href": "api/AtmosphereClient.html#example", 1129 + "title": "AtmosphereClient", 1106 1130 "section": "", 1107 - "text": "atmosphere.PDSBlobStore(client)\nPDS blob store implementing AbstractDataStore protocol.\nStores dataset shards as ATProto blobs, enabling decentralized dataset storage on the AT Protocol network.\nEach shard is written to a temporary tar file, then uploaded as a blob to the user’s PDS. The returned URLs are AT URIs that can be resolved to HTTP URLs for streaming.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance.\n\n\n\n\n\n\n::\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1131 + "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'" 1108 1132 }, 1109 1133 { 1110 - "objectID": "api/PDSBlobStore.html#attributes", 1111 - "href": "api/PDSBlobStore.html#attributes", 1112 - "title": "PDSBlobStore", 1134 + "objectID": "api/AtmosphereClient.html#note", 1135 + "href": "api/AtmosphereClient.html#note", 1136 + "title": "AtmosphereClient", 1113 1137 "section": "", 1114 - "text": "Name\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance." 1138 + "text": "The password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings." 1115 1139 }, 1116 1140 { 1117 - "objectID": "api/PDSBlobStore.html#example", 1118 - "href": "api/PDSBlobStore.html#example", 1119 - "title": "PDSBlobStore", 1141 + "objectID": "api/AtmosphereClient.html#attributes", 1142 + "href": "api/AtmosphereClient.html#attributes", 1143 + "title": "AtmosphereClient", 1120 1144 "section": "", 1121 - "text": "::\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]" 1145 + "text": "Name\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session." 1122 1146 }, 1123 1147 { 1124 - "objectID": "api/PDSBlobStore.html#methods", 1125 - "href": "api/PDSBlobStore.html#methods", 1126 - "title": "PDSBlobStore", 1148 + "objectID": "api/AtmosphereClient.html#methods", 1149 + "href": "api/AtmosphereClient.html#methods", 1150 + "title": "AtmosphereClient", 1127 1151 "section": "", 1128 - "text": "Name\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1152 + "text": "Name\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 1129 1153 }, 1130 1154 { 1131 - "objectID": "api/DictSample.html", 1132 - "href": "api/DictSample.html", 1133 - "title": "DictSample", 1155 + "objectID": "api/load_dataset.html", 1156 + "href": "api/load_dataset.html", 1157 + "title": "load_dataset", 1134 1158 "section": "", 1135 - "text": "DictSample(_data=None, **kwargs)\nDynamic sample type providing dict-like access to raw msgpack data.\nThis class is the default sample type for datasets when no explicit type is specified. It stores the raw unpacked msgpack data and provides both attribute-style (sample.field) and dict-style (sample[\"field\"]) access to fields.\nDictSample is useful for: - Exploring datasets without defining a schema first - Working with datasets that have variable schemas - Prototyping before committing to a typed schema\nTo convert to a typed schema, use Dataset.as_type() with a @packable-decorated class. Every @packable class automatically registers a lens from DictSample, making this conversion seamless.\n\n\n::\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)\n\n\n\nNDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1159 + "text": "load_dataset(\n path,\n sample_type=None,\n *,\n split=None,\n data_files=None,\n streaming=False,\n index=None,\n)\nLoad a dataset from local files, remote URLs, or an index.\nThis function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.\nWhen no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index.\n\n\n\n\n\n\n::\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1136 1160 }, 1137 1161 { 1138 - "objectID": "api/DictSample.html#example", 1139 - "href": "api/DictSample.html#example", 1140 - "title": "DictSample", 1162 + "objectID": "api/load_dataset.html#parameters", 1163 + "href": "api/load_dataset.html#parameters", 1164 + "title": "load_dataset", 1141 1165 "section": "", 1142 - "text": "::\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)" 1166 + "text": "Name\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone" 1143 1167 }, 1144 1168 { 1145 - "objectID": "api/DictSample.html#note", 1146 - "href": "api/DictSample.html#note", 1147 - "title": "DictSample", 1169 + "objectID": "api/load_dataset.html#returns", 1170 + "href": "api/load_dataset.html#returns", 1171 + "title": "load_dataset", 1148 1172 "section": "", 1149 - "text": "NDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class." 1173 + "text": "Name\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample." 1150 1174 }, 1151 1175 { 1152 - "objectID": "api/DictSample.html#attributes", 1153 - "href": "api/DictSample.html#attributes", 1154 - "title": "DictSample", 1176 + "objectID": "api/load_dataset.html#raises", 1177 + "href": "api/load_dataset.html#raises", 1178 + "title": "load_dataset", 1155 1179 "section": "", 1156 - "text": "Name\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1180 + "text": "Name\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index." 1157 1181 }, 1158 1182 { 1159 - "objectID": "api/DictSample.html#methods", 1160 - "href": "api/DictSample.html#methods", 1161 - "title": "DictSample", 1183 + "objectID": "api/load_dataset.html#example", 1184 + "href": "api/load_dataset.html#example", 1185 + "title": "load_dataset", 1162 1186 "section": "", 1163 - "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1187 + "text": "::\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1164 1188 }, 1165 1189 { 1166 - "objectID": "api/LensLoader.html", 1167 - "href": "api/LensLoader.html", 1168 - "title": "LensLoader", 1190 + "objectID": "api/promote_to_atmosphere.html", 1191 + "href": "api/promote_to_atmosphere.html", 1192 + "title": "promote_to_atmosphere", 1169 1193 "section": "", 1170 - "text": "atmosphere.LensLoader(client)\nLoads lens records from ATProto.\nThis class fetches lens transformation records. Note that actually using a lens requires installing the referenced code and importing it manually.\n\n\n::\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1194 + "text": "promote.promote_to_atmosphere(\n local_entry,\n local_index,\n atmosphere_client,\n *,\n data_store=None,\n name=None,\n description=None,\n tags=None,\n license=None,\n)\nPromote a local dataset to the atmosphere network.\nThis function takes a locally-indexed dataset and publishes it to ATProto, making it discoverable on the federated atmosphere network.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs.\n\n\n\n\n\n\n::\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1171 1195 }, 1172 1196 { 1173 - "objectID": "api/LensLoader.html#example", 1174 - "href": "api/LensLoader.html#example", 1175 - "title": "LensLoader", 1197 + "objectID": "api/promote_to_atmosphere.html#parameters", 1198 + "href": "api/promote_to_atmosphere.html#parameters", 1199 + "title": "promote_to_atmosphere", 1176 1200 "section": "", 1177 - "text": "::\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))" 1201 + "text": "Name\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone" 1178 1202 }, 1179 1203 { 1180 - "objectID": "api/LensLoader.html#methods", 1181 - "href": "api/LensLoader.html#methods", 1182 - "title": "LensLoader", 1204 + "objectID": "api/promote_to_atmosphere.html#returns", 1205 + "href": "api/promote_to_atmosphere.html#returns", 1206 + "title": "promote_to_atmosphere", 1183 1207 "section": "", 1184 - "text": "Name\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1208 + "text": "Name\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record." 1185 1209 }, 1186 1210 { 1187 - "objectID": "api/AtmosphereIndex.html", 1188 - "href": "api/AtmosphereIndex.html", 1189 - "title": "AtmosphereIndex", 1211 + "objectID": "api/promote_to_atmosphere.html#raises", 1212 + "href": "api/promote_to_atmosphere.html#raises", 1213 + "title": "promote_to_atmosphere", 1190 1214 "section": "", 1191 - "text": "atmosphere.AtmosphereIndex(client, *, data_store=None)\nATProto index implementing AbstractIndex protocol.\nWraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.\nOptionally accepts a PDSBlobStore for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol).\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1215 + "text": "Name\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs." 1192 1216 }, 1193 1217 { 1194 - "objectID": "api/AtmosphereIndex.html#example", 1195 - "href": "api/AtmosphereIndex.html#example", 1196 - "title": "AtmosphereIndex", 1218 + "objectID": "api/promote_to_atmosphere.html#example", 1219 + "href": "api/promote_to_atmosphere.html#example", 1220 + "title": "promote_to_atmosphere", 1197 1221 "section": "", 1198 - "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")" 1222 + "text": "::\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1199 1223 }, 1200 1224 { 1201 - "objectID": "api/AtmosphereIndex.html#attributes", 1202 - "href": "api/AtmosphereIndex.html#attributes", 1203 - "title": "AtmosphereIndex", 1225 + "objectID": "api/SchemaPublisher.html", 1226 + "href": "api/SchemaPublisher.html", 1227 + "title": "SchemaPublisher", 1204 1228 "section": "", 1205 - "text": "Name\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol)." 1229 + "text": "atmosphere.SchemaPublisher(client)\nPublishes PackableSample schemas to ATProto.\nThis class introspects a PackableSample class to extract its field definitions and publishes them as an ATProto schema record.\n\n\n::\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/...\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1206 1230 }, 1207 1231 { 1208 - "objectID": "api/AtmosphereIndex.html#methods", 1209 - "href": "api/AtmosphereIndex.html#methods", 1210 - "title": "AtmosphereIndex", 1232 + "objectID": "api/SchemaPublisher.html#example", 1233 + "href": "api/SchemaPublisher.html#example", 1234 + "title": "SchemaPublisher", 1211 1235 "section": "", 1212 - "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1236 + "text": "::\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/..." 1213 1237 }, 1214 1238 { 1215 - "objectID": "api/DataSource.html", 1216 - "href": "api/DataSource.html", 1217 - "title": "DataSource", 1239 + "objectID": "api/SchemaPublisher.html#methods", 1240 + "href": "api/SchemaPublisher.html#methods", 1241 + "title": "SchemaPublisher", 1218 1242 "section": "", 1219 - "text": "DataSource()\nProtocol for data sources that provide streams to Dataset.\nA DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)\nThe key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects\n\n\n::\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1243 + "text": "Name\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1220 1244 }, 1221 1245 { 1222 - "objectID": "api/DataSource.html#example", 1223 - "href": "api/DataSource.html#example", 1224 - "title": "DataSource", 1246 + "objectID": "api/DatasetPublisher.html", 1247 + "href": "api/DatasetPublisher.html", 1248 + "title": "DatasetPublisher", 1225 1249 "section": "", 1226 - "text": "::\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)" 1250 + "text": "atmosphere.DatasetPublisher(client)\nPublishes dataset index records to ATProto.\nThis class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.\n\n\n::\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1227 1251 }, 1228 1252 { 1229 - "objectID": "api/DataSource.html#attributes", 1230 - "href": "api/DataSource.html#attributes", 1231 - "title": "DataSource", 1253 + "objectID": "api/DatasetPublisher.html#example", 1254 + "href": "api/DatasetPublisher.html#example", 1255 + "title": "DatasetPublisher", 1232 1256 "section": "", 1233 - "text": "Name\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard." 1257 + "text": "::\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )" 1234 1258 }, 1235 1259 { 1236 - "objectID": "api/DataSource.html#methods", 1237 - "href": "api/DataSource.html#methods", 1238 - "title": "DataSource", 1260 + "objectID": "api/DatasetPublisher.html#methods", 1261 + "href": "api/DatasetPublisher.html#methods", 1262 + "title": "DatasetPublisher", 1239 1263 "section": "", 1240 - "text": "Name\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1264 + "text": "Name\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1241 1265 }, 1242 1266 { 1243 - "objectID": "api/DatasetLoader.html", 1244 - "href": "api/DatasetLoader.html", 1245 - "title": "DatasetLoader", 1267 + "objectID": "api/URLSource.html", 1268 + "href": "api/URLSource.html", 1269 + "title": "URLSource", 1246 1270 "section": "", 1247 - "text": "atmosphere.DatasetLoader(client)\nLoads dataset records from ATProto.\nThis class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.\n\n\n::\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n::\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1271 + "text": "URLSource(url)\nData source for WebDataset-compatible URLs.\nWraps WebDataset’s gopen to open URLs using built-in handlers for http, https, pipe, gs, hf, sftp, etc. Supports brace expansion for shard patterns like “data-{000..099}.tar”.\nThis is the default source type when a string URL is passed to Dataset.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards.\n\n\n\n\n\n\n::\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1248 1272 }, 1249 1273 { 1250 - "objectID": "api/DatasetLoader.html#example", 1251 - "href": "api/DatasetLoader.html#example", 1252 - "title": "DatasetLoader", 1274 + "objectID": "api/URLSource.html#attributes", 1275 + "href": "api/URLSource.html#attributes", 1276 + "title": "URLSource", 1253 1277 "section": "", 1254 - "text": "::\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")" 1278 + "text": "Name\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards." 1255 1279 }, 1256 1280 { 1257 - "objectID": "api/DatasetLoader.html#methods", 1258 - "href": "api/DatasetLoader.html#methods", 1259 - "title": "DatasetLoader", 1281 + "objectID": "api/URLSource.html#example", 1282 + "href": "api/URLSource.html#example", 1283 + "title": "URLSource", 1260 1284 "section": "", 1261 - "text": "Name\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n::\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1285 + "text": "::\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")" 1262 1286 }, 1263 1287 { 1264 - "objectID": "api/Lens.html", 1265 - "href": "api/Lens.html", 1266 - "title": "lens", 1288 + "objectID": "api/URLSource.html#methods", 1289 + "href": "api/URLSource.html#methods", 1290 + "title": "URLSource", 1267 1291 "section": "", 1268 - "text": "lens\nLens-based type transformations for datasets.\nThis module implements a lens system for bidirectional transformations between different sample types. Lenses enable viewing a dataset through different type schemas without duplicating the underlying data.\nKey components:\n\nLens: Bidirectional transformation with getter (S -> V) and optional putter (V, S -> S)\nLensNetwork: Global singleton registry for lens transformations\n@lens: Decorator to create and register lens transformations\n\nLenses support the functional programming concept of composable, well-behaved transformations that satisfy lens laws (GetPut and PutGet).\n\n\n::\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n::\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n::\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(...)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented.\n\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n::\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1292 + "text": "Name\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1269 1293 }, 1270 1294 { 1271 - "objectID": "api/Lens.html#example", 1272 - "href": "api/Lens.html#example", 1273 - "title": "lens", 1295 + "objectID": "api/index.html", 1296 + "href": "api/index.html", 1297 + "title": "API Reference", 1274 1298 "section": "", 1275 - "text": "::\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens" 1299 + "text": "Core types, decorators, and dataset classes\n\n\n\npackable\nDecorator to convert a regular class into a PackableSample.\n\n\nPackableSample\nBase class for samples that can be serialized with msgpack.\n\n\nDictSample\nDynamic sample type providing dict-like access to raw msgpack data.\n\n\nDataset\nA typed dataset built on WebDataset with lens transformations.\n\n\nSampleBatch\nA batch of samples with automatic attribute aggregation.\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nlens\nLens-based type transformations for datasets.\n\n\nload_dataset\nLoad a dataset from local files, remote URLs, or an index.\n\n\nDatasetDict\nA dictionary of split names to Dataset instances.\n\n\n\n\n\n\nAbstract protocols for storage backends\n\n\n\nPackable\nStructural protocol for packable sample types.\n\n\nIndexEntry\nCommon interface for index entries (local or atmosphere).\n\n\nAbstractIndex\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\n\n\nAbstractDataStore\nProtocol for data storage operations.\n\n\nDataSource\nProtocol for data sources that provide streams to Dataset.\n\n\n\n\n\n\nData source implementations for streaming\n\n\n\nURLSource\nData source for WebDataset-compatible URLs.\n\n\nS3Source\nData source for S3-compatible storage with explicit credentials.\n\n\nBlobSource\nData source for ATProto PDS blob storage.\n\n\n\n\n\n\nLocal Redis/S3 storage backend\n\n\n\nlocal.Index\nRedis-backed index for tracking datasets in a repository.\n\n\nlocal.LocalDatasetEntry\nIndex entry for a dataset stored in the local repository.\n\n\nlocal.S3DataStore\nS3-compatible data store implementing AbstractDataStore protocol.\n\n\n\n\n\n\nATProto federation\n\n\n\nAtmosphereClient\nATProto client wrapper for atdata operations.\n\n\nAtmosphereIndex\nATProto index implementing AbstractIndex protocol.\n\n\nAtmosphereIndexEntry\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\nPDSBlobStore\nPDS blob store implementing AbstractDataStore protocol.\n\n\nSchemaPublisher\nPublishes PackableSample schemas to ATProto.\n\n\nSchemaLoader\nLoads PackableSample schemas from ATProto.\n\n\nDatasetPublisher\nPublishes dataset index records to ATProto.\n\n\nDatasetLoader\nLoads dataset records from ATProto.\n\n\nLensPublisher\nPublishes Lens transformation records to ATProto.\n\n\nLensLoader\nLoads lens records from ATProto.\n\n\nAtUri\nParsed AT Protocol URI.\n\n\n\n\n\n\nLocal to atmosphere migration\n\n\n\npromote_to_atmosphere\nPromote a local dataset to the atmosphere network." 1276 1300 }, 1277 1301 { 1278 - "objectID": "api/Lens.html#classes", 1279 - "href": "api/Lens.html#classes", 1280 - "title": "lens", 1302 + "objectID": "api/index.html#core", 1303 + "href": "api/index.html#core", 1304 + "title": "API Reference", 1281 1305 "section": "", 1282 - "text": "Name\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n::\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n::\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(...)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented." 1306 + "text": "Core types, decorators, and dataset classes\n\n\n\npackable\nDecorator to convert a regular class into a PackableSample.\n\n\nPackableSample\nBase class for samples that can be serialized with msgpack.\n\n\nDictSample\nDynamic sample type providing dict-like access to raw msgpack data.\n\n\nDataset\nA typed dataset built on WebDataset with lens transformations.\n\n\nSampleBatch\nA batch of samples with automatic attribute aggregation.\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nlens\nLens-based type transformations for datasets.\n\n\nload_dataset\nLoad a dataset from local files, remote URLs, or an index.\n\n\nDatasetDict\nA dictionary of split names to Dataset instances." 1283 1307 }, 1284 1308 { 1285 - "objectID": "api/Lens.html#functions", 1286 - "href": "api/Lens.html#functions", 1287 - "title": "lens", 1309 + "objectID": "api/index.html#protocols", 1310 + "href": "api/index.html#protocols", 1311 + "title": "API Reference", 1288 1312 "section": "", 1289 - "text": "Name\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n::\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1313 + "text": "Abstract protocols for storage backends\n\n\n\nPackable\nStructural protocol for packable sample types.\n\n\nIndexEntry\nCommon interface for index entries (local or atmosphere).\n\n\nAbstractIndex\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\n\n\nAbstractDataStore\nProtocol for data storage operations.\n\n\nDataSource\nProtocol for data sources that provide streams to Dataset." 1290 1314 }, 1291 1315 { 1292 - "objectID": "api/local.Index.html", 1293 - "href": "api/local.Index.html", 1294 - "title": "local.Index", 1316 + "objectID": "api/index.html#data-sources", 1317 + "href": "api/index.html#data-sources", 1318 + "title": "API Reference", 1295 1319 "section": "", 1296 - "text": "local.Index(\n redis=None,\n data_store=None,\n auto_stubs=False,\n stub_dir=None,\n **kwargs,\n)\nRedis-backed index for tracking datasets in a repository.\nImplements the AbstractIndex protocol. Maintains a registry of LocalDatasetEntry objects in Redis, allowing enumeration and lookup of stored datasets.\nWhen initialized with a data_store, insert_dataset() will write dataset shards to storage before indexing. Without a data_store, insert_dataset() only indexes existing URLs.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n::\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n::\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n::\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1320 + "text": "Data source implementations for streaming\n\n\n\nURLSource\nData source for WebDataset-compatible URLs.\n\n\nS3Source\nData source for S3-compatible storage with explicit credentials.\n\n\nBlobSource\nData source for ATProto PDS blob storage." 1297 1321 }, 1298 1322 { 1299 - "objectID": "api/local.Index.html#attributes", 1300 - "href": "api/local.Index.html#attributes", 1301 - "title": "local.Index", 1323 + "objectID": "api/index.html#local-storage", 1324 + "href": "api/index.html#local-storage", 1325 + "title": "API Reference", 1302 1326 "section": "", 1303 - "text": "Name\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards." 1327 + "text": "Local Redis/S3 storage backend\n\n\n\nlocal.Index\nRedis-backed index for tracking datasets in a repository.\n\n\nlocal.LocalDatasetEntry\nIndex entry for a dataset stored in the local repository.\n\n\nlocal.S3DataStore\nS3-compatible data store implementing AbstractDataStore protocol." 1304 1328 }, 1305 1329 { 1306 - "objectID": "api/local.Index.html#methods", 1307 - "href": "api/local.Index.html#methods", 1308 - "title": "local.Index", 1330 + "objectID": "api/index.html#atmosphere", 1331 + "href": "api/index.html#atmosphere", 1332 + "title": "API Reference", 1309 1333 "section": "", 1310 - "text": "Name\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n::\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n::\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n::\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1334 + "text": "ATProto federation\n\n\n\nAtmosphereClient\nATProto client wrapper for atdata operations.\n\n\nAtmosphereIndex\nATProto index implementing AbstractIndex protocol.\n\n\nAtmosphereIndexEntry\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\nPDSBlobStore\nPDS blob store implementing AbstractDataStore protocol.\n\n\nSchemaPublisher\nPublishes PackableSample schemas to ATProto.\n\n\nSchemaLoader\nLoads PackableSample schemas from ATProto.\n\n\nDatasetPublisher\nPublishes dataset index records to ATProto.\n\n\nDatasetLoader\nLoads dataset records from ATProto.\n\n\nLensPublisher\nPublishes Lens transformation records to ATProto.\n\n\nLensLoader\nLoads lens records from ATProto.\n\n\nAtUri\nParsed AT Protocol URI." 1311 1335 }, 1312 1336 { 1313 - "objectID": "api/Dataset.html", 1314 - "href": "api/Dataset.html", 1315 - "title": "Dataset", 1337 + "objectID": "api/index.html#promotion", 1338 + "href": "api/index.html#promotion", 1339 + "title": "API Reference", 1316 1340 "section": "", 1317 - "text": "Dataset(source=None, metadata_url=None, *, url=None)\nA typed dataset built on WebDataset with lens transformations.\nThis class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.\nThe dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s).\n\n\n\n\n\n\n::\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n::\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1341 + "text": "Local to atmosphere migration\n\n\n\npromote_to_atmosphere\nPromote a local dataset to the atmosphere network." 1318 1342 }, 1319 1343 { 1320 - "objectID": "api/Dataset.html#parameters", 1321 - "href": "api/Dataset.html#parameters", 1322 - "title": "Dataset", 1344 + "objectID": "api/IndexEntry.html", 1345 + "href": "api/IndexEntry.html", 1346 + "title": "IndexEntry", 1323 1347 "section": "", 1324 - "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired" 1348 + "text": "IndexEntry()\nCommon interface for index entries (local or atmosphere).\nBoth LocalDatasetEntry and atmosphere DatasetRecord-based entries should satisfy this protocol, enabling code that works with either.\n\n\nname: Human-readable dataset name schema_ref: Reference to schema (local:// path or AT URI) data_urls: WebDataset URLs for the data metadata: Arbitrary metadata dict, or None\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_urls\nWebDataset URLs for the data.\n\n\nmetadata\nArbitrary metadata dictionary, or None if not set.\n\n\nname\nHuman-readable dataset name.\n\n\nschema_ref\nReference to the schema for this dataset." 1325 1349 }, 1326 1350 { 1327 - "objectID": "api/Dataset.html#attributes", 1328 - "href": "api/Dataset.html#attributes", 1329 - "title": "Dataset", 1351 + "objectID": "api/IndexEntry.html#properties", 1352 + "href": "api/IndexEntry.html#properties", 1353 + "title": "IndexEntry", 1330 1354 "section": "", 1331 - "text": "Name\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s)." 1355 + "text": "name: Human-readable dataset name schema_ref: Reference to schema (local:// path or AT URI) data_urls: WebDataset URLs for the data metadata: Arbitrary metadata dict, or None" 1332 1356 }, 1333 1357 { 1334 - "objectID": "api/Dataset.html#example", 1335 - "href": "api/Dataset.html#example", 1336 - "title": "Dataset", 1358 + "objectID": "api/IndexEntry.html#attributes", 1359 + "href": "api/IndexEntry.html#attributes", 1360 + "title": "IndexEntry", 1337 1361 "section": "", 1338 - "text": "::\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)" 1362 + "text": "Name\nDescription\n\n\n\n\ndata_urls\nWebDataset URLs for the data.\n\n\nmetadata\nArbitrary metadata dictionary, or None if not set.\n\n\nname\nHuman-readable dataset name.\n\n\nschema_ref\nReference to the schema for this dataset." 1339 1363 }, 1340 1364 { 1341 - "objectID": "api/Dataset.html#note", 1342 - "href": "api/Dataset.html#note", 1343 - "title": "Dataset", 1365 + "objectID": "api/S3Source.html", 1366 + "href": "api/S3Source.html", 1367 + "title": "S3Source", 1344 1368 "section": "", 1345 - "text": "This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class." 1369 + "text": "S3Source(\n bucket,\n keys,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n _client=None,\n)\nData source for S3-compatible storage with explicit credentials.\nUses boto3 to stream directly from S3, supporting: - Standard AWS S3 - S3-compatible endpoints (Cloudflare R2, MinIO, etc.) - Private buckets with credentials - IAM role authentication (when keys not provided)\nUnlike URL-based approaches, this doesn’t require URL transformation or global gopen_schemes registration. Credentials are scoped to the source instance.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1).\n\n\n\n\n\n\n::\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n::\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n::\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1346 1370 }, 1347 1371 { 1348 - "objectID": "api/Dataset.html#methods", 1349 - "href": "api/Dataset.html#methods", 1350 - "title": "Dataset", 1372 + "objectID": "api/S3Source.html#attributes", 1373 + "href": "api/S3Source.html#attributes", 1374 + "title": "S3Source", 1351 1375 "section": "", 1352 - "text": "Name\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n::\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1376 + "text": "Name\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1)." 1353 1377 }, 1354 1378 { 1355 - "objectID": "api/AbstractDataStore.html", 1356 - "href": "api/AbstractDataStore.html", 1357 - "title": "AbstractDataStore", 1379 + "objectID": "api/S3Source.html#example", 1380 + "href": "api/S3Source.html#example", 1381 + "title": "S3Source", 1358 1382 "section": "", 1359 - "text": "AbstractDataStore()\nProtocol for data storage operations.\nThis protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)\nThe separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.\n\n\n::\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1383 + "text": "::\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1360 1384 }, 1361 1385 { 1362 - "objectID": "api/AbstractDataStore.html#example", 1363 - "href": "api/AbstractDataStore.html#example", 1364 - "title": "AbstractDataStore", 1386 + "objectID": "api/S3Source.html#methods", 1387 + "href": "api/S3Source.html#methods", 1388 + "title": "S3Source", 1365 1389 "section": "", 1366 - "text": "::\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]" 1390 + "text": "Name\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n::\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n::\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1367 1391 }, 1368 1392 { 1369 - "objectID": "api/AbstractDataStore.html#methods", 1370 - "href": "api/AbstractDataStore.html#methods", 1371 - "title": "AbstractDataStore", 1393 + "objectID": "api/local.LocalDatasetEntry.html", 1394 + "href": "api/local.LocalDatasetEntry.html", 1395 + "title": "local.LocalDatasetEntry", 1372 1396 "section": "", 1373 - "text": "Name\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1397 + "text": "local.LocalDatasetEntry(\n name,\n schema_ref,\n data_urls,\n metadata=None,\n _cid=None,\n _legacy_uuid=None,\n)\nIndex entry for a dataset stored in the local repository.\nImplements the IndexEntry protocol for compatibility with AbstractIndex. Uses dual identity: a content-addressable CID (ATProto-compatible) and a human-readable name.\nThe CID is generated from the entry’s content (schema_ref + data_urls), ensuring the same data produces the same CID whether stored locally or in the atmosphere. This enables seamless promotion from local to ATProto.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable name for this dataset.\n\n\nschema_ref\nstr\nReference to the schema for this dataset.\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data.\n\n\nmetadata\ndict | None\nArbitrary metadata dictionary, or None if not set.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_redis\nLoad an entry from Redis by CID.\n\n\nwrite_to\nPersist this index entry to Redis.\n\n\n\n\n\nlocal.LocalDatasetEntry.from_redis(redis, cid)\nLoad an entry from Redis by CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to read from.\nrequired\n\n\ncid\nstr\nContent identifier of the entry to load.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry loaded from Redis.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.LocalDatasetEntry.write_to(redis)\nPersist this index entry to Redis.\nStores the entry as a Redis hash with key ‘{REDIS_KEY_DATASET_ENTRY}:{cid}’.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to write to.\nrequired" 1374 1398 }, 1375 1399 { 1376 - "objectID": "api/local.S3DataStore.html", 1377 - "href": "api/local.S3DataStore.html", 1378 - "title": "local.S3DataStore", 1400 + "objectID": "api/local.LocalDatasetEntry.html#attributes", 1401 + "href": "api/local.LocalDatasetEntry.html#attributes", 1402 + "title": "local.LocalDatasetEntry", 1379 1403 "section": "", 1380 - "text": "local.S3DataStore(credentials, *, bucket)\nS3-compatible data store implementing AbstractDataStore protocol.\nHandles writing dataset shards to S3-compatible object storage and resolving URLs for reading.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\ncredentials\n\nS3 credentials dictionary.\n\n\nbucket\n\nTarget bucket name.\n\n\n_fs\n\nS3FileSystem instance.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve an S3 URL for reading/streaming.\n\n\nsupports_streaming\nS3 supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to S3.\n\n\n\n\n\nlocal.S3DataStore.read_url(url)\nResolve an S3 URL for reading/streaming.\nFor S3-compatible stores with custom endpoints (like Cloudflare R2, MinIO, etc.), converts s3:// URLs to HTTPS URLs that WebDataset can stream directly.\nFor standard AWS S3 (no custom endpoint), URLs are returned unchanged since WebDataset’s built-in s3fs integration handles them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nS3 URL to resolve (e.g., ‘s3://bucket/path/file.tar’).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTPS URL if custom endpoint is configured, otherwise unchanged.\n\n\nExample\nstr\n‘s3://bucket/path’ -> ‘https://endpoint.com/bucket/path’\n\n\n\n\n\n\n\nlocal.S3DataStore.supports_streaming()\nS3 supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\nlocal.S3DataStore.write_shards(ds, *, prefix, cache_local=False, **kwargs)\nWrite dataset shards to S3.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix within bucket (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\ncache_local\nbool\nIf True, write locally first then copy to S3.\nFalse\n\n\n**kwargs\n\nAdditional args passed to wds.ShardWriter (e.g., maxcount).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of S3 URLs for the written shards.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nRuntimeError\nIf no shards were written." 1404 + "text": "Name\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable name for this dataset.\n\n\nschema_ref\nstr\nReference to the schema for this dataset.\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data.\n\n\nmetadata\ndict | None\nArbitrary metadata dictionary, or None if not set." 1381 1405 }, 1382 1406 { 1383 - "objectID": "api/local.S3DataStore.html#attributes", 1384 - "href": "api/local.S3DataStore.html#attributes", 1385 - "title": "local.S3DataStore", 1407 + "objectID": "api/local.LocalDatasetEntry.html#methods", 1408 + "href": "api/local.LocalDatasetEntry.html#methods", 1409 + "title": "local.LocalDatasetEntry", 1386 1410 "section": "", 1387 - "text": "Name\nType\nDescription\n\n\n\n\ncredentials\n\nS3 credentials dictionary.\n\n\nbucket\n\nTarget bucket name.\n\n\n_fs\n\nS3FileSystem instance." 1411 + "text": "Name\nDescription\n\n\n\n\nfrom_redis\nLoad an entry from Redis by CID.\n\n\nwrite_to\nPersist this index entry to Redis.\n\n\n\n\n\nlocal.LocalDatasetEntry.from_redis(redis, cid)\nLoad an entry from Redis by CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to read from.\nrequired\n\n\ncid\nstr\nContent identifier of the entry to load.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry loaded from Redis.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.LocalDatasetEntry.write_to(redis)\nPersist this index entry to Redis.\nStores the entry as a Redis hash with key ‘{REDIS_KEY_DATASET_ENTRY}:{cid}’.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to write to.\nrequired" 1388 1412 }, 1389 1413 { 1390 - "objectID": "api/local.S3DataStore.html#methods", 1391 - "href": "api/local.S3DataStore.html#methods", 1392 - "title": "local.S3DataStore", 1414 + "objectID": "api/AbstractIndex.html", 1415 + "href": "api/AbstractIndex.html", 1416 + "title": "AbstractIndex", 1393 1417 "section": "", 1394 - "text": "Name\nDescription\n\n\n\n\nread_url\nResolve an S3 URL for reading/streaming.\n\n\nsupports_streaming\nS3 supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to S3.\n\n\n\n\n\nlocal.S3DataStore.read_url(url)\nResolve an S3 URL for reading/streaming.\nFor S3-compatible stores with custom endpoints (like Cloudflare R2, MinIO, etc.), converts s3:// URLs to HTTPS URLs that WebDataset can stream directly.\nFor standard AWS S3 (no custom endpoint), URLs are returned unchanged since WebDataset’s built-in s3fs integration handles them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nS3 URL to resolve (e.g., ‘s3://bucket/path/file.tar’).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTPS URL if custom endpoint is configured, otherwise unchanged.\n\n\nExample\nstr\n‘s3://bucket/path’ -> ‘https://endpoint.com/bucket/path’\n\n\n\n\n\n\n\nlocal.S3DataStore.supports_streaming()\nS3 supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\nlocal.S3DataStore.write_shards(ds, *, prefix, cache_local=False, **kwargs)\nWrite dataset shards to S3.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix within bucket (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\ncache_local\nbool\nIf True, write locally first then copy to S3.\nFalse\n\n\n**kwargs\n\nAdditional args passed to wds.ShardWriter (e.g., maxcount).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of S3 URLs for the written shards.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nRuntimeError\nIf no shards were written." 1418 + "text": "AbstractIndex()\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\nThis protocol defines the common interface for managing dataset metadata: - Publishing and retrieving schemas - Inserting and listing datasets - (Future) Publishing and retrieving lenses\nA single index can hold datasets of many different sample types. The sample type is tracked via schema references, not as a generic parameter on the index.\n\n\nSome index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution.\n\n\n\n::\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n::\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1395 1419 }, 1396 1420 { 1397 - "objectID": "api/AtUri.html", 1398 - "href": "api/AtUri.html", 1399 - "title": "AtUri", 1421 + "objectID": "api/AbstractIndex.html#optional-extensions", 1422 + "href": "api/AbstractIndex.html#optional-extensions", 1423 + "title": "AbstractIndex", 1400 1424 "section": "", 1401 - "text": "atmosphere.AtUri(authority, collection, rkey)\nParsed AT Protocol URI.\nAT URIs follow the format: at:////\n\n\n::\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1425 + "text": "Some index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution." 1402 1426 }, 1403 1427 { 1404 - "objectID": "api/AtUri.html#example", 1405 - "href": "api/AtUri.html#example", 1406 - "title": "AtUri", 1428 + "objectID": "api/AbstractIndex.html#example", 1429 + "href": "api/AbstractIndex.html#example", 1430 + "title": "AbstractIndex", 1407 1431 "section": "", 1408 - "text": "::\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'" 1432 + "text": "::\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")" 1409 1433 }, 1410 1434 { 1411 - "objectID": "api/AtUri.html#attributes", 1412 - "href": "api/AtUri.html#attributes", 1413 - "title": "AtUri", 1435 + "objectID": "api/AbstractIndex.html#attributes", 1436 + "href": "api/AbstractIndex.html#attributes", 1437 + "title": "AbstractIndex", 1414 1438 "section": "", 1415 - "text": "Name\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection." 1439 + "text": "Name\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index." 1416 1440 }, 1417 1441 { 1418 - "objectID": "api/AtUri.html#methods", 1419 - "href": "api/AtUri.html#methods", 1420 - "title": "AtUri", 1442 + "objectID": "api/AbstractIndex.html#methods", 1443 + "href": "api/AbstractIndex.html#methods", 1444 + "title": "AbstractIndex", 1421 1445 "section": "", 1422 - "text": "Name\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1446 + "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n::\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1423 1447 }, 1424 1448 { 1425 - "objectID": "api/Packable-protocol.html", 1426 - "href": "api/Packable-protocol.html", 1427 - "title": "Packable", 1449 + "objectID": "api/AtmosphereIndexEntry.html", 1450 + "href": "api/AtmosphereIndexEntry.html", 1451 + "title": "AtmosphereIndexEntry", 1428 1452 "section": "", 1429 - "text": "Packable()\nStructural protocol for packable sample types.\nThis protocol allows classes decorated with @packable to be recognized as valid types for lens transformations and schema operations, even though the decorator doesn’t change the class’s nominal type at static analysis time.\nBoth PackableSample subclasses and @packable-decorated classes satisfy this protocol structurally.\nThe protocol captures the full interface needed for: - Lens type transformations (as_wds, from_data) - Schema publishing (class introspection via dataclass fields) - Serialization/deserialization (packed, from_bytes)\n\n\n::\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1453 + "text": "atmosphere.AtmosphereIndexEntry(uri, record)\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_uri\n\nAT URI of the record.\n\n\n_record\n\nRaw record dictionary." 1430 1454 }, 1431 1455 { 1432 - "objectID": "api/Packable-protocol.html#example", 1433 - "href": "api/Packable-protocol.html#example", 1434 - "title": "Packable", 1456 + "objectID": "api/AtmosphereIndexEntry.html#attributes", 1457 + "href": "api/AtmosphereIndexEntry.html#attributes", 1458 + "title": "AtmosphereIndexEntry", 1459 + "section": "", 1460 + "text": "Name\nType\nDescription\n\n\n\n\n_uri\n\nAT URI of the record.\n\n\n_record\n\nRaw record dictionary." 1461 + }, 1462 + { 1463 + "objectID": "api/LensPublisher.html", 1464 + "href": "api/LensPublisher.html", 1465 + "title": "LensPublisher", 1466 + "section": "", 1467 + "text": "atmosphere.LensPublisher(client)\nPublishes Lens transformation records to ATProto.\nThis class creates lens records that reference source and target schemas and point to the transformation code in a git repository.\n\n\n::\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )\n\n\n\nLens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1468 + }, 1469 + { 1470 + "objectID": "api/LensPublisher.html#example", 1471 + "href": "api/LensPublisher.html#example", 1472 + "title": "LensPublisher", 1473 + "section": "", 1474 + "text": "::\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )" 1475 + }, 1476 + { 1477 + "objectID": "api/LensPublisher.html#security-note", 1478 + "href": "api/LensPublisher.html#security-note", 1479 + "title": "LensPublisher", 1435 1480 "section": "", 1436 - "text": "::\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)" 1481 + "text": "Lens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations." 1437 1482 }, 1438 1483 { 1439 - "objectID": "api/Packable-protocol.html#attributes", 1440 - "href": "api/Packable-protocol.html#attributes", 1441 - "title": "Packable", 1484 + "objectID": "api/LensPublisher.html#methods", 1485 + "href": "api/LensPublisher.html#methods", 1486 + "title": "LensPublisher", 1442 1487 "section": "", 1443 - "text": "Name\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1488 + "text": "Name\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1444 1489 }, 1445 1490 { 1446 - "objectID": "api/Packable-protocol.html#methods", 1447 - "href": "api/Packable-protocol.html#methods", 1448 - "title": "Packable", 1491 + "objectID": "api/SampleBatch.html", 1492 + "href": "api/SampleBatch.html", 1493 + "title": "SampleBatch", 1449 1494 "section": "", 1450 - "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1495 + "text": "SampleBatch(samples)\nA batch of samples with automatic attribute aggregation.\nThis class wraps a sequence of samples and provides magic __getattr__ access to aggregate sample attributes. When you access an attribute that exists on the sample type, it automatically aggregates values across all samples in the batch.\nNDArray fields are stacked into a numpy array with a batch dimension. Other fields are aggregated into a list.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch.\n\n\n\n\n\n\n::\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1451 1496 }, 1452 1497 { 1453 - "objectID": "api/packable.html", 1454 - "href": "api/packable.html", 1455 - "title": "packable", 1498 + "objectID": "api/SampleBatch.html#parameters", 1499 + "href": "api/SampleBatch.html#parameters", 1500 + "title": "SampleBatch", 1456 1501 "section": "", 1457 - "text": "packable(cls)\nDecorator to convert a regular class into a PackableSample.\nThis decorator transforms a class into a dataclass that inherits from PackableSample, enabling automatic msgpack serialization/deserialization with special handling for NDArray fields.\nThe resulting class satisfies the Packable protocol, making it compatible with all atdata APIs that accept packable types (e.g., publish_schema, lens transformations, etc.).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures.\n\n\n\n\n\n\nThis is a test of the functionality::\n@packable\nclass MyData:\n name: str\n values: NDArray\n\nsample = MyData(name=\"test\", values=np.array([1, 2, 3]))\nbytes_data = sample.packed\nrestored = MyData.from_bytes(bytes_data)\n\n# Works with Packable-typed APIs\nindex.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1502 + "text": "Name\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired" 1458 1503 }, 1459 1504 { 1460 - "objectID": "api/packable.html#parameters", 1461 - "href": "api/packable.html#parameters", 1462 - "title": "packable", 1505 + "objectID": "api/SampleBatch.html#attributes", 1506 + "href": "api/SampleBatch.html#attributes", 1507 + "title": "SampleBatch", 1463 1508 "section": "", 1464 - "text": "Name\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired" 1509 + "text": "Name\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch." 1465 1510 }, 1466 1511 { 1467 - "objectID": "api/packable.html#returns", 1468 - "href": "api/packable.html#returns", 1469 - "title": "packable", 1512 + "objectID": "api/SampleBatch.html#example", 1513 + "href": "api/SampleBatch.html#example", 1514 + "title": "SampleBatch", 1470 1515 "section": "", 1471 - "text": "Name\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures." 1516 + "text": "::\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names" 1472 1517 }, 1473 1518 { 1474 - "objectID": "api/packable.html#examples", 1475 - "href": "api/packable.html#examples", 1476 - "title": "packable", 1519 + "objectID": "api/SampleBatch.html#note", 1520 + "href": "api/SampleBatch.html#note", 1521 + "title": "SampleBatch", 1477 1522 "section": "", 1478 - "text": "This is a test of the functionality::\n@packable\nclass MyData:\n name: str\n values: NDArray\n\nsample = MyData(name=\"test\", values=np.array([1, 2, 3]))\nbytes_data = sample.packed\nrestored = MyData.from_bytes(bytes_data)\n\n# Works with Packable-typed APIs\nindex.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1523 + "text": "This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1479 1524 }, 1480 1525 { 1481 1526 "objectID": "index.html", ··· 1489 1534 ] 1490 1535 }, 1491 1536 { 1537 + "objectID": "index.html#the-challenge", 1538 + "href": "index.html#the-challenge", 1539 + "title": "atdata", 1540 + "section": "The Challenge", 1541 + "text": "The Challenge\nMachine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means:\n\nRuntime surprises: Discovering a field is missing or has the wrong type during training\nCopy-paste schemas: Redefining the same sample structure across notebooks and scripts\nStorage silos: Data stuck in one location, invisible to collaborators\nDiscovery friction: No standard way to find datasets across teams or organizations\n\natdata solves these problems with a simple idea: typed, serializable samples that flow seamlessly from local development to team storage to federated sharing.", 1542 + "crumbs": [ 1543 + "Guide", 1544 + "atdata" 1545 + ] 1546 + }, 1547 + { 1492 1548 "objectID": "index.html#what-is-atdata", 1493 1549 "href": "index.html#what-is-atdata", 1494 1550 "title": "atdata", 1495 1551 "section": "What is atdata?", 1496 - "text": "What is atdata?\natdata provides a typed dataset abstraction for machine learning workflows with:\n\n\nTyped Samples\nDefine dataclass-based sample types with automatic msgpack serialization.\n\n\nNDArray Handling\nTransparent numpy array conversion with efficient byte serialization.\n\n\nLens Transformations\nView datasets through different schemas without duplicating data.\n\n\nBatch Aggregation\nAutomatic numpy stacking for NDArray fields during iteration.\n\n\nWebDataset Integration\nEfficient large-scale storage with streaming tar file support.\n\n\nATProto Federation\nPublish and discover datasets on the decentralized AT Protocol network.", 1552 + "text": "What is atdata?\natdata is a Python library that combines:\n\n\nTyped Samples\nDefine dataclass-based sample types with automatic msgpack serialization. Catch schema errors at definition time, not training time.\n\n\nEfficient Storage\nBuilt on WebDataset’s proven tar-based format. Stream large datasets without downloading everything first.\n\n\nLens Transformations\nView datasets through different schemas without duplicating data. Perfect for feature extraction, schema migration, and multi-task learning.\n\n\nBatch Aggregation\nAutomatic numpy stacking for NDArray fields. No more manual collation code—just iterate and train.\n\n\nTeam Storage\nRedis + S3 backend for shared dataset indexes. Publish schemas, track versions, and enable team discovery.\n\n\nATProto Federation\nPublish datasets to the decentralized AT Protocol network. Enable cross-organization discovery without centralized infrastructure.", 1553 + "crumbs": [ 1554 + "Guide", 1555 + "atdata" 1556 + ] 1557 + }, 1558 + { 1559 + "objectID": "index.html#the-architecture", 1560 + "href": "index.html#the-architecture", 1561 + "title": "atdata", 1562 + "section": "The Architecture", 1563 + "text": "The Architecture\natdata provides a three-layer progression for your datasets:\n┌─────────────────────────────────────────────────────────────┐\n│ Federation: ATProto Atmosphere │\n│ Decentralized discovery, cross-org sharing │\n└─────────────────────────────────────────────────────────────┘\n ↑ promote\n┌─────────────────────────────────────────────────────────────┐\n│ Team Storage: Redis + S3 │\n│ Shared index, versioned schemas, S3 data │\n└─────────────────────────────────────────────────────────────┘\n ↑ insert\n┌─────────────────────────────────────────────────────────────┐\n│ Local Development │\n│ Typed samples, WebDataset files, fast iteration │\n└─────────────────────────────────────────────────────────────┘\nStart local, scale to your team, and optionally share with the world—all with the same sample types and consistent APIs.", 1497 1564 "crumbs": [ 1498 1565 "Guide", 1499 1566 "atdata" ··· 1515 1582 "href": "index.html#quick-example", 1516 1583 "title": "atdata", 1517 1584 "section": "Quick Example", 1518 - "text": "Quick Example\n\nDefine a Sample Type\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\n\n\nCreate and Write Samples\n\nimport webdataset as wds\n\nsamples = [\n ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n )\n for _ in range(100)\n]\n\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\n\n\nLoad and Iterate\n\ndataset = atdata.Dataset[ImageSample](\"data-000000.tar\")\n\n# Iterate with batching\nfor batch in dataset.shuffled(batch_size=32):\n images = batch.image # numpy array (32, 224, 224, 3)\n labels = batch.label # list of 32 strings\n confs = batch.confidence # list of 32 floats", 1585 + "text": "Quick Example\n\n1. Define a Sample Type\nThe @packable decorator creates a serializable dataclass:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n image: NDArray # Automatically handled as bytes\n label: str\n confidence: float\n\n\n\n2. Create and Write Samples\nUse WebDataset’s standard TarWriter:\n\nimport webdataset as wds\n\nsamples = [\n ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n )\n for _ in range(100)\n]\n\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\n\n\n3. Load and Iterate with Type Safety\nThe generic Dataset[T] provides typed access:\n\ndataset = atdata.Dataset[ImageSample](\"data-000000.tar\")\n\nfor batch in dataset.shuffled(batch_size=32):\n images = batch.image # numpy array (32, 224, 224, 3)\n labels = batch.label # list of 32 strings\n confs = batch.confidence # list of 32 floats", 1519 1586 "crumbs": [ 1520 1587 "Guide", 1521 1588 "atdata" 1522 1589 ] 1523 1590 }, 1524 1591 { 1525 - "objectID": "index.html#huggingface-style-loading", 1526 - "href": "index.html#huggingface-style-loading", 1592 + "objectID": "index.html#scaling-up", 1593 + "href": "index.html#scaling-up", 1527 1594 "title": "atdata", 1528 - "section": "HuggingFace-Style Loading", 1529 - "text": "HuggingFace-Style Loading\n\n# Load from local path\nds = atdata.load_dataset(\"path/to/data-{000000..000009}.tar\", split=\"train\")\n\n# Load with split detection\nds_dict = atdata.load_dataset(\"path/to/data/\")\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]", 1595 + "section": "Scaling Up", 1596 + "text": "Scaling Up\n\nTeam Storage with Redis + S3\nWhen you’re ready to share with your team:\n\nfrom atdata.local import LocalIndex, S3DataStore\n\n# Connect to team infrastructure\nstore = S3DataStore(\n credentials={\"AWS_ENDPOINT\": \"http://localhost:9000\", ...},\n bucket=\"team-datasets\",\n)\nindex = LocalIndex(data_store=store)\n\n# Publish schema for consistency\nindex.publish_schema(ImageSample, version=\"1.0.0\")\n\n# Insert dataset (writes to S3, indexes in Redis)\ndataset = atdata.Dataset[ImageSample](\"data.tar\")\nentry = index.insert_dataset(dataset, name=\"training-images-v1\")\n\n# Team members can now discover and load\n# ds = atdata.load_dataset(\"@local/training-images-v1\", index=index)\n\n\n\nFederation with ATProto\nFor public or cross-organization sharing:\n\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\nfrom atdata.promote import promote_to_atmosphere\n\n# Authenticate with your ATProto identity\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Option 1: Promote existing local dataset\nentry = index.get_dataset(\"training-images-v1\")\nat_uri = promote_to_atmosphere(entry, index, client)\n\n# Option 2: Publish directly with blob storage\nstore = PDSBlobStore(client)\natm_index = AtmosphereIndex(client, data_store=store)\natm_index.insert_dataset(dataset, name=\"public-images\", schema_ref=schema_uri)", 1530 1597 "crumbs": [ 1531 1598 "Guide", 1532 1599 "atdata" 1533 1600 ] 1534 1601 }, 1535 1602 { 1536 - "objectID": "index.html#local-storage-with-redis-s3", 1537 - "href": "index.html#local-storage-with-redis-s3", 1603 + "objectID": "index.html#huggingface-style-loading", 1604 + "href": "index.html#huggingface-style-loading", 1538 1605 "title": "atdata", 1539 - "section": "Local Storage with Redis + S3", 1540 - "text": "Local Storage with Redis + S3\n\nfrom atdata.local import LocalIndex, S3DataStore\nimport webdataset as wds\n\n# Create samples and write to local tar\nwith wds.writer.TarWriter(\"data.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# Set up index with S3 data store\nstore = S3DataStore(\n credentials={\"AWS_ENDPOINT\": \"http://localhost:9000\", ...},\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(data_store=store) # Connects to Redis\n\n# Insert dataset (writes to S3, indexes in Redis)\ndataset = atdata.Dataset[ImageSample](\"data.tar\")\nentry = index.insert_dataset(dataset, name=\"my-dataset\")\nprint(f\"Stored at: {entry.data_urls}\")", 1606 + "section": "HuggingFace-Style Loading", 1607 + "text": "HuggingFace-Style Loading\nFor convenient access to datasets:\n\nfrom atdata import load_dataset\n\n# Load from local files\nds = load_dataset(\"path/to/data-{000000..000009}.tar\")\n\n# Load with split detection\nds_dict = load_dataset(\"path/to/data/\")\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]\n\n# Load from index\nds = load_dataset(\"@local/my-dataset\", index=index)", 1541 1608 "crumbs": [ 1542 1609 "Guide", 1543 1610 "atdata" 1544 1611 ] 1545 1612 }, 1546 1613 { 1547 - "objectID": "index.html#publish-to-atproto-federation", 1548 - "href": "index.html#publish-to-atproto-federation", 1614 + "objectID": "index.html#why-atdata", 1615 + "href": "index.html#why-atdata", 1549 1616 "title": "atdata", 1550 - "section": "Publish to ATProto Federation", 1551 - "text": "Publish to ATProto Federation\n\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\n\n# Authenticate\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Promote local dataset to federation\nentry = index.get_dataset(\"my-dataset\")\nat_uri = promote_to_atmosphere(entry, index, client)\nprint(f\"Published at: {at_uri}\")", 1617 + "section": "Why atdata?", 1618 + "text": "Why atdata?\n\n\n\n\n\n\n\nNeed\nSolution\n\n\n\n\nType-safe samples\n@packable decorator, PackableSample base class\n\n\nEfficient large-scale storage\nWebDataset tar format, streaming iteration\n\n\nSchema flexibility\nLens transformations, DictSample for exploration\n\n\nTeam collaboration\nRedis index, S3 data store, schema registry\n\n\nPublic sharing\nATProto federation, content-addressable CIDs\n\n\nMultiple backends\nProtocol abstractions (AbstractIndex, DataSource)", 1552 1619 "crumbs": [ 1553 1620 "Guide", 1554 1621 "atdata" ··· 1559 1626 "href": "index.html#next-steps", 1560 1627 "title": "atdata", 1561 1628 "section": "Next Steps", 1562 - "text": "Next Steps\n\nQuick Start Tutorial - Get up and running in 5 minutes\nPackable Samples - Learn about typed sample definitions\nDatasets - Master dataset iteration and batching\nAtmosphere - Publish to the ATProto federation", 1629 + "text": "Next Steps\n\n\n\n\n\n\nGetting Started\n\n\n\nNew to atdata? Start with the Quick Start Tutorial to learn the basics of typed samples and datasets.\n\n\n\nArchitecture Overview - Understand the design and how components fit together\nLocal Workflow - Set up team storage with Redis + S3\nAtmosphere Publishing - Share datasets on the ATProto network\nPackable Samples - Deep dive into sample type definitions\nDatasets - Master iteration, batching, and transformations", 1563 1630 "crumbs": [ 1564 1631 "Guide", 1565 1632 "atdata" 1566 1633 ] 1567 1634 }, 1568 1635 { 1569 - "objectID": "api/SampleBatch.html", 1570 - "href": "api/SampleBatch.html", 1571 - "title": "SampleBatch", 1636 + "objectID": "api/packable.html", 1637 + "href": "api/packable.html", 1638 + "title": "packable", 1572 1639 "section": "", 1573 - "text": "SampleBatch(samples)\nA batch of samples with automatic attribute aggregation.\nThis class wraps a sequence of samples and provides magic __getattr__ access to aggregate sample attributes. When you access an attribute that exists on the sample type, it automatically aggregates values across all samples in the batch.\nNDArray fields are stacked into a numpy array with a batch dimension. Other fields are aggregated into a list.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch.\n\n\n\n\n\n\n::\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1640 + "text": "packable(cls)\nDecorator to convert a regular class into a PackableSample.\nThis decorator transforms a class into a dataclass that inherits from PackableSample, enabling automatic msgpack serialization/deserialization with special handling for NDArray fields.\nThe resulting class satisfies the Packable protocol, making it compatible with all atdata APIs that accept packable types (e.g., publish_schema, lens transformations, etc.).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures.\n\n\n\n\n\n\nThis is a test of the functionality::\n@packable\nclass MyData:\n name: str\n values: NDArray\n\nsample = MyData(name=\"test\", values=np.array([1, 2, 3]))\nbytes_data = sample.packed\nrestored = MyData.from_bytes(bytes_data)\n\n# Works with Packable-typed APIs\nindex.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1574 1641 }, 1575 1642 { 1576 - "objectID": "api/SampleBatch.html#parameters", 1577 - "href": "api/SampleBatch.html#parameters", 1578 - "title": "SampleBatch", 1643 + "objectID": "api/packable.html#parameters", 1644 + "href": "api/packable.html#parameters", 1645 + "title": "packable", 1579 1646 "section": "", 1580 - "text": "Name\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired" 1647 + "text": "Name\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired" 1581 1648 }, 1582 1649 { 1583 - "objectID": "api/SampleBatch.html#attributes", 1584 - "href": "api/SampleBatch.html#attributes", 1585 - "title": "SampleBatch", 1650 + "objectID": "api/packable.html#returns", 1651 + "href": "api/packable.html#returns", 1652 + "title": "packable", 1586 1653 "section": "", 1587 - "text": "Name\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch." 1654 + "text": "Name\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures." 1588 1655 }, 1589 1656 { 1590 - "objectID": "api/SampleBatch.html#example", 1591 - "href": "api/SampleBatch.html#example", 1592 - "title": "SampleBatch", 1657 + "objectID": "api/packable.html#examples", 1658 + "href": "api/packable.html#examples", 1659 + "title": "packable", 1593 1660 "section": "", 1594 - "text": "::\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names" 1661 + "text": "This is a test of the functionality::\n@packable\nclass MyData:\n name: str\n values: NDArray\n\nsample = MyData(name=\"test\", values=np.array([1, 2, 3]))\nbytes_data = sample.packed\nrestored = MyData.from_bytes(bytes_data)\n\n# Works with Packable-typed APIs\nindex.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1595 1662 }, 1596 1663 { 1597 - "objectID": "api/SampleBatch.html#note", 1598 - "href": "api/SampleBatch.html#note", 1599 - "title": "SampleBatch", 1664 + "objectID": "api/Packable-protocol.html", 1665 + "href": "api/Packable-protocol.html", 1666 + "title": "Packable", 1600 1667 "section": "", 1601 - "text": "This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1668 + "text": "Packable()\nStructural protocol for packable sample types.\nThis protocol allows classes decorated with @packable to be recognized as valid types for lens transformations and schema operations, even though the decorator doesn’t change the class’s nominal type at static analysis time.\nBoth PackableSample subclasses and @packable-decorated classes satisfy this protocol structurally.\nThe protocol captures the full interface needed for: - Lens type transformations (as_wds, from_data) - Schema publishing (class introspection via dataclass fields) - Serialization/deserialization (packed, from_bytes)\n\n\n::\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1602 1669 }, 1603 1670 { 1604 - "objectID": "api/LensPublisher.html", 1605 - "href": "api/LensPublisher.html", 1606 - "title": "LensPublisher", 1671 + "objectID": "api/Packable-protocol.html#example", 1672 + "href": "api/Packable-protocol.html#example", 1673 + "title": "Packable", 1607 1674 "section": "", 1608 - "text": "atmosphere.LensPublisher(client)\nPublishes Lens transformation records to ATProto.\nThis class creates lens records that reference source and target schemas and point to the transformation code in a git repository.\n\n\n::\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )\n\n\n\nLens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1675 + "text": "::\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)" 1609 1676 }, 1610 1677 { 1611 - "objectID": "api/LensPublisher.html#example", 1612 - "href": "api/LensPublisher.html#example", 1613 - "title": "LensPublisher", 1678 + "objectID": "api/Packable-protocol.html#attributes", 1679 + "href": "api/Packable-protocol.html#attributes", 1680 + "title": "Packable", 1614 1681 "section": "", 1615 - "text": "::\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )" 1682 + "text": "Name\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1616 1683 }, 1617 1684 { 1618 - "objectID": "api/LensPublisher.html#security-note", 1619 - "href": "api/LensPublisher.html#security-note", 1620 - "title": "LensPublisher", 1685 + "objectID": "api/Packable-protocol.html#methods", 1686 + "href": "api/Packable-protocol.html#methods", 1687 + "title": "Packable", 1621 1688 "section": "", 1622 - "text": "Lens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations." 1689 + "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1623 1690 }, 1624 1691 { 1625 - "objectID": "api/LensPublisher.html#methods", 1626 - "href": "api/LensPublisher.html#methods", 1627 - "title": "LensPublisher", 1692 + "objectID": "api/AtUri.html", 1693 + "href": "api/AtUri.html", 1694 + "title": "AtUri", 1628 1695 "section": "", 1629 - "text": "Name\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1696 + "text": "atmosphere.AtUri(authority, collection, rkey)\nParsed AT Protocol URI.\nAT URIs follow the format: at:////\n\n\n::\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1630 1697 }, 1631 1698 { 1632 - "objectID": "api/AtmosphereIndexEntry.html", 1633 - "href": "api/AtmosphereIndexEntry.html", 1634 - "title": "AtmosphereIndexEntry", 1699 + "objectID": "api/AtUri.html#example", 1700 + "href": "api/AtUri.html#example", 1701 + "title": "AtUri", 1635 1702 "section": "", 1636 - "text": "atmosphere.AtmosphereIndexEntry(uri, record)\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_uri\n\nAT URI of the record.\n\n\n_record\n\nRaw record dictionary." 1703 + "text": "::\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'" 1637 1704 }, 1638 1705 { 1639 - "objectID": "api/AtmosphereIndexEntry.html#attributes", 1640 - "href": "api/AtmosphereIndexEntry.html#attributes", 1641 - "title": "AtmosphereIndexEntry", 1706 + "objectID": "api/AtUri.html#attributes", 1707 + "href": "api/AtUri.html#attributes", 1708 + "title": "AtUri", 1642 1709 "section": "", 1643 - "text": "Name\nType\nDescription\n\n\n\n\n_uri\n\nAT URI of the record.\n\n\n_record\n\nRaw record dictionary." 1710 + "text": "Name\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection." 1644 1711 }, 1645 1712 { 1646 - "objectID": "api/AbstractIndex.html", 1647 - "href": "api/AbstractIndex.html", 1648 - "title": "AbstractIndex", 1713 + "objectID": "api/AtUri.html#methods", 1714 + "href": "api/AtUri.html#methods", 1715 + "title": "AtUri", 1649 1716 "section": "", 1650 - "text": "AbstractIndex()\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\nThis protocol defines the common interface for managing dataset metadata: - Publishing and retrieving schemas - Inserting and listing datasets - (Future) Publishing and retrieving lenses\nA single index can hold datasets of many different sample types. The sample type is tracked via schema references, not as a generic parameter on the index.\n\n\nSome index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution.\n\n\n\n::\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n::\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1717 + "text": "Name\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1651 1718 }, 1652 1719 { 1653 - "objectID": "api/AbstractIndex.html#optional-extensions", 1654 - "href": "api/AbstractIndex.html#optional-extensions", 1655 - "title": "AbstractIndex", 1720 + "objectID": "api/local.S3DataStore.html", 1721 + "href": "api/local.S3DataStore.html", 1722 + "title": "local.S3DataStore", 1656 1723 "section": "", 1657 - "text": "Some index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution." 1724 + "text": "local.S3DataStore(credentials, *, bucket)\nS3-compatible data store implementing AbstractDataStore protocol.\nHandles writing dataset shards to S3-compatible object storage and resolving URLs for reading.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\ncredentials\n\nS3 credentials dictionary.\n\n\nbucket\n\nTarget bucket name.\n\n\n_fs\n\nS3FileSystem instance.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve an S3 URL for reading/streaming.\n\n\nsupports_streaming\nS3 supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to S3.\n\n\n\n\n\nlocal.S3DataStore.read_url(url)\nResolve an S3 URL for reading/streaming.\nFor S3-compatible stores with custom endpoints (like Cloudflare R2, MinIO, etc.), converts s3:// URLs to HTTPS URLs that WebDataset can stream directly.\nFor standard AWS S3 (no custom endpoint), URLs are returned unchanged since WebDataset’s built-in s3fs integration handles them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nS3 URL to resolve (e.g., ‘s3://bucket/path/file.tar’).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTPS URL if custom endpoint is configured, otherwise unchanged.\n\n\nExample\nstr\n‘s3://bucket/path’ -> ‘https://endpoint.com/bucket/path’\n\n\n\n\n\n\n\nlocal.S3DataStore.supports_streaming()\nS3 supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\nlocal.S3DataStore.write_shards(ds, *, prefix, cache_local=False, **kwargs)\nWrite dataset shards to S3.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix within bucket (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\ncache_local\nbool\nIf True, write locally first then copy to S3.\nFalse\n\n\n**kwargs\n\nAdditional args passed to wds.ShardWriter (e.g., maxcount).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of S3 URLs for the written shards.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nRuntimeError\nIf no shards were written." 1658 1725 }, 1659 1726 { 1660 - "objectID": "api/AbstractIndex.html#example", 1661 - "href": "api/AbstractIndex.html#example", 1662 - "title": "AbstractIndex", 1727 + "objectID": "api/local.S3DataStore.html#attributes", 1728 + "href": "api/local.S3DataStore.html#attributes", 1729 + "title": "local.S3DataStore", 1663 1730 "section": "", 1664 - "text": "::\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")" 1731 + "text": "Name\nType\nDescription\n\n\n\n\ncredentials\n\nS3 credentials dictionary.\n\n\nbucket\n\nTarget bucket name.\n\n\n_fs\n\nS3FileSystem instance." 1665 1732 }, 1666 1733 { 1667 - "objectID": "api/AbstractIndex.html#attributes", 1668 - "href": "api/AbstractIndex.html#attributes", 1669 - "title": "AbstractIndex", 1734 + "objectID": "api/local.S3DataStore.html#methods", 1735 + "href": "api/local.S3DataStore.html#methods", 1736 + "title": "local.S3DataStore", 1670 1737 "section": "", 1671 - "text": "Name\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index." 1738 + "text": "Name\nDescription\n\n\n\n\nread_url\nResolve an S3 URL for reading/streaming.\n\n\nsupports_streaming\nS3 supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to S3.\n\n\n\n\n\nlocal.S3DataStore.read_url(url)\nResolve an S3 URL for reading/streaming.\nFor S3-compatible stores with custom endpoints (like Cloudflare R2, MinIO, etc.), converts s3:// URLs to HTTPS URLs that WebDataset can stream directly.\nFor standard AWS S3 (no custom endpoint), URLs are returned unchanged since WebDataset’s built-in s3fs integration handles them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nS3 URL to resolve (e.g., ‘s3://bucket/path/file.tar’).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTPS URL if custom endpoint is configured, otherwise unchanged.\n\n\nExample\nstr\n‘s3://bucket/path’ -> ‘https://endpoint.com/bucket/path’\n\n\n\n\n\n\n\nlocal.S3DataStore.supports_streaming()\nS3 supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\nlocal.S3DataStore.write_shards(ds, *, prefix, cache_local=False, **kwargs)\nWrite dataset shards to S3.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix within bucket (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\ncache_local\nbool\nIf True, write locally first then copy to S3.\nFalse\n\n\n**kwargs\n\nAdditional args passed to wds.ShardWriter (e.g., maxcount).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of S3 URLs for the written shards.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nRuntimeError\nIf no shards were written." 1672 1739 }, 1673 1740 { 1674 - "objectID": "api/AbstractIndex.html#methods", 1675 - "href": "api/AbstractIndex.html#methods", 1676 - "title": "AbstractIndex", 1741 + "objectID": "api/AbstractDataStore.html", 1742 + "href": "api/AbstractDataStore.html", 1743 + "title": "AbstractDataStore", 1677 1744 "section": "", 1678 - "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n::\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1679 - }, 1680 - { 1681 - "objectID": "api/local.LocalDatasetEntry.html", 1682 - "href": "api/local.LocalDatasetEntry.html", 1683 - "title": "local.LocalDatasetEntry", 1684 - "section": "", 1685 - "text": "local.LocalDatasetEntry(\n name,\n schema_ref,\n data_urls,\n metadata=None,\n _cid=None,\n _legacy_uuid=None,\n)\nIndex entry for a dataset stored in the local repository.\nImplements the IndexEntry protocol for compatibility with AbstractIndex. Uses dual identity: a content-addressable CID (ATProto-compatible) and a human-readable name.\nThe CID is generated from the entry’s content (schema_ref + data_urls), ensuring the same data produces the same CID whether stored locally or in the atmosphere. This enables seamless promotion from local to ATProto.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable name for this dataset.\n\n\nschema_ref\nstr\nReference to the schema for this dataset.\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data.\n\n\nmetadata\ndict | None\nArbitrary metadata dictionary, or None if not set.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_redis\nLoad an entry from Redis by CID.\n\n\nwrite_to\nPersist this index entry to Redis.\n\n\n\n\n\nlocal.LocalDatasetEntry.from_redis(redis, cid)\nLoad an entry from Redis by CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to read from.\nrequired\n\n\ncid\nstr\nContent identifier of the entry to load.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry loaded from Redis.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.LocalDatasetEntry.write_to(redis)\nPersist this index entry to Redis.\nStores the entry as a Redis hash with key ‘{REDIS_KEY_DATASET_ENTRY}:{cid}’.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to write to.\nrequired" 1745 + "text": "AbstractDataStore()\nProtocol for data storage operations.\nThis protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)\nThe separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.\n\n\n::\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1686 1746 }, 1687 1747 { 1688 - "objectID": "api/local.LocalDatasetEntry.html#attributes", 1689 - "href": "api/local.LocalDatasetEntry.html#attributes", 1690 - "title": "local.LocalDatasetEntry", 1748 + "objectID": "api/AbstractDataStore.html#example", 1749 + "href": "api/AbstractDataStore.html#example", 1750 + "title": "AbstractDataStore", 1691 1751 "section": "", 1692 - "text": "Name\nType\nDescription\n\n\n\n\nname\nstr\nHuman-readable name for this dataset.\n\n\nschema_ref\nstr\nReference to the schema for this dataset.\n\n\ndata_urls\nlist[str]\nWebDataset URLs for the data.\n\n\nmetadata\ndict | None\nArbitrary metadata dictionary, or None if not set." 1752 + "text": "::\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]" 1693 1753 }, 1694 1754 { 1695 - "objectID": "api/local.LocalDatasetEntry.html#methods", 1696 - "href": "api/local.LocalDatasetEntry.html#methods", 1697 - "title": "local.LocalDatasetEntry", 1755 + "objectID": "api/AbstractDataStore.html#methods", 1756 + "href": "api/AbstractDataStore.html#methods", 1757 + "title": "AbstractDataStore", 1698 1758 "section": "", 1699 - "text": "Name\nDescription\n\n\n\n\nfrom_redis\nLoad an entry from Redis by CID.\n\n\nwrite_to\nPersist this index entry to Redis.\n\n\n\n\n\nlocal.LocalDatasetEntry.from_redis(redis, cid)\nLoad an entry from Redis by CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to read from.\nrequired\n\n\ncid\nstr\nContent identifier of the entry to load.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry loaded from Redis.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.LocalDatasetEntry.write_to(redis)\nPersist this index entry to Redis.\nStores the entry as a Redis hash with key ‘{REDIS_KEY_DATASET_ENTRY}:{cid}’.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nredis\nRedis\nRedis connection to write to.\nrequired" 1759 + "text": "Name\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1700 1760 }, 1701 1761 { 1702 - "objectID": "api/S3Source.html", 1703 - "href": "api/S3Source.html", 1704 - "title": "S3Source", 1762 + "objectID": "api/Dataset.html", 1763 + "href": "api/Dataset.html", 1764 + "title": "Dataset", 1705 1765 "section": "", 1706 - "text": "S3Source(\n bucket,\n keys,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n _client=None,\n)\nData source for S3-compatible storage with explicit credentials.\nUses boto3 to stream directly from S3, supporting: - Standard AWS S3 - S3-compatible endpoints (Cloudflare R2, MinIO, etc.) - Private buckets with credentials - IAM role authentication (when keys not provided)\nUnlike URL-based approaches, this doesn’t require URL transformation or global gopen_schemes registration. Credentials are scoped to the source instance.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1).\n\n\n\n\n\n\n::\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n::\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n::\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1766 + "text": "Dataset(source=None, metadata_url=None, *, url=None)\nA typed dataset built on WebDataset with lens transformations.\nThis class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.\nThe dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s).\n\n\n\n\n\n\n::\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n::\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1707 1767 }, 1708 1768 { 1709 - "objectID": "api/S3Source.html#attributes", 1710 - "href": "api/S3Source.html#attributes", 1711 - "title": "S3Source", 1769 + "objectID": "api/Dataset.html#parameters", 1770 + "href": "api/Dataset.html#parameters", 1771 + "title": "Dataset", 1712 1772 "section": "", 1713 - "text": "Name\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1)." 1773 + "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired" 1714 1774 }, 1715 1775 { 1716 - "objectID": "api/S3Source.html#example", 1717 - "href": "api/S3Source.html#example", 1718 - "title": "S3Source", 1776 + "objectID": "api/Dataset.html#attributes", 1777 + "href": "api/Dataset.html#attributes", 1778 + "title": "Dataset", 1719 1779 "section": "", 1720 - "text": "::\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1780 + "text": "Name\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s)." 1721 1781 }, 1722 1782 { 1723 - "objectID": "api/S3Source.html#methods", 1724 - "href": "api/S3Source.html#methods", 1725 - "title": "S3Source", 1783 + "objectID": "api/Dataset.html#example", 1784 + "href": "api/Dataset.html#example", 1785 + "title": "Dataset", 1726 1786 "section": "", 1727 - "text": "Name\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n::\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n::\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1787 + "text": "::\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)" 1728 1788 }, 1729 1789 { 1730 - "objectID": "api/IndexEntry.html", 1731 - "href": "api/IndexEntry.html", 1732 - "title": "IndexEntry", 1790 + "objectID": "api/Dataset.html#note", 1791 + "href": "api/Dataset.html#note", 1792 + "title": "Dataset", 1733 1793 "section": "", 1734 - "text": "IndexEntry()\nCommon interface for index entries (local or atmosphere).\nBoth LocalDatasetEntry and atmosphere DatasetRecord-based entries should satisfy this protocol, enabling code that works with either.\n\n\nname: Human-readable dataset name schema_ref: Reference to schema (local:// path or AT URI) data_urls: WebDataset URLs for the data metadata: Arbitrary metadata dict, or None\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_urls\nWebDataset URLs for the data.\n\n\nmetadata\nArbitrary metadata dictionary, or None if not set.\n\n\nname\nHuman-readable dataset name.\n\n\nschema_ref\nReference to the schema for this dataset." 1794 + "text": "This class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class." 1735 1795 }, 1736 1796 { 1737 - "objectID": "api/IndexEntry.html#properties", 1738 - "href": "api/IndexEntry.html#properties", 1739 - "title": "IndexEntry", 1797 + "objectID": "api/Dataset.html#methods", 1798 + "href": "api/Dataset.html#methods", 1799 + "title": "Dataset", 1740 1800 "section": "", 1741 - "text": "name: Human-readable dataset name schema_ref: Reference to schema (local:// path or AT URI) data_urls: WebDataset URLs for the data metadata: Arbitrary metadata dict, or None" 1801 + "text": "Name\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n::\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1742 1802 }, 1743 1803 { 1744 - "objectID": "api/IndexEntry.html#attributes", 1745 - "href": "api/IndexEntry.html#attributes", 1746 - "title": "IndexEntry", 1804 + "objectID": "api/local.Index.html", 1805 + "href": "api/local.Index.html", 1806 + "title": "local.Index", 1747 1807 "section": "", 1748 - "text": "Name\nDescription\n\n\n\n\ndata_urls\nWebDataset URLs for the data.\n\n\nmetadata\nArbitrary metadata dictionary, or None if not set.\n\n\nname\nHuman-readable dataset name.\n\n\nschema_ref\nReference to the schema for this dataset." 1808 + "text": "local.Index(\n redis=None,\n data_store=None,\n auto_stubs=False,\n stub_dir=None,\n **kwargs,\n)\nRedis-backed index for tracking datasets in a repository.\nImplements the AbstractIndex protocol. Maintains a registry of LocalDatasetEntry objects in Redis, allowing enumeration and lookup of stored datasets.\nWhen initialized with a data_store, insert_dataset() will write dataset shards to storage before indexing. Without a data_store, insert_dataset() only indexes existing URLs.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n::\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n::\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n::\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1749 1809 }, 1750 1810 { 1751 - "objectID": "api/index.html", 1752 - "href": "api/index.html", 1753 - "title": "API Reference", 1811 + "objectID": "api/local.Index.html#attributes", 1812 + "href": "api/local.Index.html#attributes", 1813 + "title": "local.Index", 1754 1814 "section": "", 1755 - "text": "Core types, decorators, and dataset classes\n\n\n\npackable\nDecorator to convert a regular class into a PackableSample.\n\n\nPackableSample\nBase class for samples that can be serialized with msgpack.\n\n\nDictSample\nDynamic sample type providing dict-like access to raw msgpack data.\n\n\nDataset\nA typed dataset built on WebDataset with lens transformations.\n\n\nSampleBatch\nA batch of samples with automatic attribute aggregation.\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nlens\nLens-based type transformations for datasets.\n\n\nload_dataset\nLoad a dataset from local files, remote URLs, or an index.\n\n\nDatasetDict\nA dictionary of split names to Dataset instances.\n\n\n\n\n\n\nAbstract protocols for storage backends\n\n\n\nPackable\nStructural protocol for packable sample types.\n\n\nIndexEntry\nCommon interface for index entries (local or atmosphere).\n\n\nAbstractIndex\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\n\n\nAbstractDataStore\nProtocol for data storage operations.\n\n\nDataSource\nProtocol for data sources that provide streams to Dataset.\n\n\n\n\n\n\nData source implementations for streaming\n\n\n\nURLSource\nData source for WebDataset-compatible URLs.\n\n\nS3Source\nData source for S3-compatible storage with explicit credentials.\n\n\nBlobSource\nData source for ATProto PDS blob storage.\n\n\n\n\n\n\nLocal Redis/S3 storage backend\n\n\n\nlocal.Index\nRedis-backed index for tracking datasets in a repository.\n\n\nlocal.LocalDatasetEntry\nIndex entry for a dataset stored in the local repository.\n\n\nlocal.S3DataStore\nS3-compatible data store implementing AbstractDataStore protocol.\n\n\n\n\n\n\nATProto federation\n\n\n\nAtmosphereClient\nATProto client wrapper for atdata operations.\n\n\nAtmosphereIndex\nATProto index implementing AbstractIndex protocol.\n\n\nAtmosphereIndexEntry\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\nPDSBlobStore\nPDS blob store implementing AbstractDataStore protocol.\n\n\nSchemaPublisher\nPublishes PackableSample schemas to ATProto.\n\n\nSchemaLoader\nLoads PackableSample schemas from ATProto.\n\n\nDatasetPublisher\nPublishes dataset index records to ATProto.\n\n\nDatasetLoader\nLoads dataset records from ATProto.\n\n\nLensPublisher\nPublishes Lens transformation records to ATProto.\n\n\nLensLoader\nLoads lens records from ATProto.\n\n\nAtUri\nParsed AT Protocol URI.\n\n\n\n\n\n\nLocal to atmosphere migration\n\n\n\npromote_to_atmosphere\nPromote a local dataset to the atmosphere network." 1815 + "text": "Name\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards." 1756 1816 }, 1757 1817 { 1758 - "objectID": "api/index.html#core", 1759 - "href": "api/index.html#core", 1760 - "title": "API Reference", 1818 + "objectID": "api/local.Index.html#methods", 1819 + "href": "api/local.Index.html#methods", 1820 + "title": "local.Index", 1761 1821 "section": "", 1762 - "text": "Core types, decorators, and dataset classes\n\n\n\npackable\nDecorator to convert a regular class into a PackableSample.\n\n\nPackableSample\nBase class for samples that can be serialized with msgpack.\n\n\nDictSample\nDynamic sample type providing dict-like access to raw msgpack data.\n\n\nDataset\nA typed dataset built on WebDataset with lens transformations.\n\n\nSampleBatch\nA batch of samples with automatic attribute aggregation.\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nlens\nLens-based type transformations for datasets.\n\n\nload_dataset\nLoad a dataset from local files, remote URLs, or an index.\n\n\nDatasetDict\nA dictionary of split names to Dataset instances." 1822 + "text": "Name\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n::\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n::\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n::\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1763 1823 }, 1764 1824 { 1765 - "objectID": "api/index.html#protocols", 1766 - "href": "api/index.html#protocols", 1767 - "title": "API Reference", 1825 + "objectID": "api/Lens.html", 1826 + "href": "api/Lens.html", 1827 + "title": "lens", 1768 1828 "section": "", 1769 - "text": "Abstract protocols for storage backends\n\n\n\nPackable\nStructural protocol for packable sample types.\n\n\nIndexEntry\nCommon interface for index entries (local or atmosphere).\n\n\nAbstractIndex\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\n\n\nAbstractDataStore\nProtocol for data storage operations.\n\n\nDataSource\nProtocol for data sources that provide streams to Dataset." 1829 + "text": "lens\nLens-based type transformations for datasets.\nThis module implements a lens system for bidirectional transformations between different sample types. Lenses enable viewing a dataset through different type schemas without duplicating the underlying data.\nKey components:\n\nLens: Bidirectional transformation with getter (S -> V) and optional putter (V, S -> S)\nLensNetwork: Global singleton registry for lens transformations\n@lens: Decorator to create and register lens transformations\n\nLenses support the functional programming concept of composable, well-behaved transformations that satisfy lens laws (GetPut and PutGet).\n\n\n::\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n::\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n::\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(...)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented.\n\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n::\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1770 1830 }, 1771 1831 { 1772 - "objectID": "api/index.html#data-sources", 1773 - "href": "api/index.html#data-sources", 1774 - "title": "API Reference", 1832 + "objectID": "api/Lens.html#example", 1833 + "href": "api/Lens.html#example", 1834 + "title": "lens", 1775 1835 "section": "", 1776 - "text": "Data source implementations for streaming\n\n\n\nURLSource\nData source for WebDataset-compatible URLs.\n\n\nS3Source\nData source for S3-compatible storage with explicit credentials.\n\n\nBlobSource\nData source for ATProto PDS blob storage." 1836 + "text": "::\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens" 1777 1837 }, 1778 1838 { 1779 - "objectID": "api/index.html#local-storage", 1780 - "href": "api/index.html#local-storage", 1781 - "title": "API Reference", 1839 + "objectID": "api/Lens.html#classes", 1840 + "href": "api/Lens.html#classes", 1841 + "title": "lens", 1782 1842 "section": "", 1783 - "text": "Local Redis/S3 storage backend\n\n\n\nlocal.Index\nRedis-backed index for tracking datasets in a repository.\n\n\nlocal.LocalDatasetEntry\nIndex entry for a dataset stored in the local repository.\n\n\nlocal.S3DataStore\nS3-compatible data store implementing AbstractDataStore protocol." 1843 + "text": "Name\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n::\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n::\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(...)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented." 1784 1844 }, 1785 1845 { 1786 - "objectID": "api/index.html#atmosphere", 1787 - "href": "api/index.html#atmosphere", 1788 - "title": "API Reference", 1846 + "objectID": "api/Lens.html#functions", 1847 + "href": "api/Lens.html#functions", 1848 + "title": "lens", 1789 1849 "section": "", 1790 - "text": "ATProto federation\n\n\n\nAtmosphereClient\nATProto client wrapper for atdata operations.\n\n\nAtmosphereIndex\nATProto index implementing AbstractIndex protocol.\n\n\nAtmosphereIndexEntry\nEntry wrapper for ATProto dataset records implementing IndexEntry protocol.\n\n\nPDSBlobStore\nPDS blob store implementing AbstractDataStore protocol.\n\n\nSchemaPublisher\nPublishes PackableSample schemas to ATProto.\n\n\nSchemaLoader\nLoads PackableSample schemas from ATProto.\n\n\nDatasetPublisher\nPublishes dataset index records to ATProto.\n\n\nDatasetLoader\nLoads dataset records from ATProto.\n\n\nLensPublisher\nPublishes Lens transformation records to ATProto.\n\n\nLensLoader\nLoads lens records from ATProto.\n\n\nAtUri\nParsed AT Protocol URI." 1850 + "text": "Name\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n::\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1791 1851 }, 1792 1852 { 1793 - "objectID": "api/index.html#promotion", 1794 - "href": "api/index.html#promotion", 1795 - "title": "API Reference", 1853 + "objectID": "api/DatasetLoader.html", 1854 + "href": "api/DatasetLoader.html", 1855 + "title": "DatasetLoader", 1796 1856 "section": "", 1797 - "text": "Local to atmosphere migration\n\n\n\npromote_to_atmosphere\nPromote a local dataset to the atmosphere network." 1857 + "text": "atmosphere.DatasetLoader(client)\nLoads dataset records from ATProto.\nThis class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.\n\n\n::\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n::\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1798 1858 }, 1799 1859 { 1800 - "objectID": "api/URLSource.html", 1801 - "href": "api/URLSource.html", 1802 - "title": "URLSource", 1860 + "objectID": "api/DatasetLoader.html#example", 1861 + "href": "api/DatasetLoader.html#example", 1862 + "title": "DatasetLoader", 1803 1863 "section": "", 1804 - "text": "URLSource(url)\nData source for WebDataset-compatible URLs.\nWraps WebDataset’s gopen to open URLs using built-in handlers for http, https, pipe, gs, hf, sftp, etc. Supports brace expansion for shard patterns like “data-{000..099}.tar”.\nThis is the default source type when a string URL is passed to Dataset.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards.\n\n\n\n\n\n\n::\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1864 + "text": "::\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")" 1805 1865 }, 1806 1866 { 1807 - "objectID": "api/URLSource.html#attributes", 1808 - "href": "api/URLSource.html#attributes", 1809 - "title": "URLSource", 1867 + "objectID": "api/DatasetLoader.html#methods", 1868 + "href": "api/DatasetLoader.html#methods", 1869 + "title": "DatasetLoader", 1810 1870 "section": "", 1811 - "text": "Name\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards." 1871 + "text": "Name\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n::\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1812 1872 }, 1813 1873 { 1814 - "objectID": "api/URLSource.html#example", 1815 - "href": "api/URLSource.html#example", 1816 - "title": "URLSource", 1874 + "objectID": "api/DataSource.html", 1875 + "href": "api/DataSource.html", 1876 + "title": "DataSource", 1817 1877 "section": "", 1818 - "text": "::\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")" 1878 + "text": "DataSource()\nProtocol for data sources that provide streams to Dataset.\nA DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)\nThe key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects\n\n\n::\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1819 1879 }, 1820 1880 { 1821 - "objectID": "api/URLSource.html#methods", 1822 - "href": "api/URLSource.html#methods", 1823 - "title": "URLSource", 1881 + "objectID": "api/DataSource.html#example", 1882 + "href": "api/DataSource.html#example", 1883 + "title": "DataSource", 1824 1884 "section": "", 1825 - "text": "Name\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1885 + "text": "::\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)" 1826 1886 }, 1827 1887 { 1828 - "objectID": "api/DatasetPublisher.html", 1829 - "href": "api/DatasetPublisher.html", 1830 - "title": "DatasetPublisher", 1888 + "objectID": "api/DataSource.html#attributes", 1889 + "href": "api/DataSource.html#attributes", 1890 + "title": "DataSource", 1831 1891 "section": "", 1832 - "text": "atmosphere.DatasetPublisher(client)\nPublishes dataset index records to ATProto.\nThis class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.\n\n\n::\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1892 + "text": "Name\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard." 1833 1893 }, 1834 1894 { 1835 - "objectID": "api/DatasetPublisher.html#example", 1836 - "href": "api/DatasetPublisher.html#example", 1837 - "title": "DatasetPublisher", 1895 + "objectID": "api/DataSource.html#methods", 1896 + "href": "api/DataSource.html#methods", 1897 + "title": "DataSource", 1838 1898 "section": "", 1839 - "text": "::\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )" 1899 + "text": "Name\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1840 1900 }, 1841 1901 { 1842 - "objectID": "api/DatasetPublisher.html#methods", 1843 - "href": "api/DatasetPublisher.html#methods", 1844 - "title": "DatasetPublisher", 1902 + "objectID": "api/AtmosphereIndex.html", 1903 + "href": "api/AtmosphereIndex.html", 1904 + "title": "AtmosphereIndex", 1845 1905 "section": "", 1846 - "text": "Name\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1906 + "text": "atmosphere.AtmosphereIndex(client, *, data_store=None)\nATProto index implementing AbstractIndex protocol.\nWraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.\nOptionally accepts a PDSBlobStore for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol).\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1847 1907 }, 1848 1908 { 1849 - "objectID": "api/SchemaPublisher.html", 1850 - "href": "api/SchemaPublisher.html", 1851 - "title": "SchemaPublisher", 1909 + "objectID": "api/AtmosphereIndex.html#example", 1910 + "href": "api/AtmosphereIndex.html#example", 1911 + "title": "AtmosphereIndex", 1852 1912 "section": "", 1853 - "text": "atmosphere.SchemaPublisher(client)\nPublishes PackableSample schemas to ATProto.\nThis class introspects a PackableSample class to extract its field definitions and publishes them as an ATProto schema record.\n\n\n::\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/...\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1913 + "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")" 1854 1914 }, 1855 1915 { 1856 - "objectID": "api/SchemaPublisher.html#example", 1857 - "href": "api/SchemaPublisher.html#example", 1858 - "title": "SchemaPublisher", 1916 + "objectID": "api/AtmosphereIndex.html#attributes", 1917 + "href": "api/AtmosphereIndex.html#attributes", 1918 + "title": "AtmosphereIndex", 1859 1919 "section": "", 1860 - "text": "::\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/..." 1920 + "text": "Name\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol)." 1861 1921 }, 1862 1922 { 1863 - "objectID": "api/SchemaPublisher.html#methods", 1864 - "href": "api/SchemaPublisher.html#methods", 1865 - "title": "SchemaPublisher", 1923 + "objectID": "api/AtmosphereIndex.html#methods", 1924 + "href": "api/AtmosphereIndex.html#methods", 1925 + "title": "AtmosphereIndex", 1866 1926 "section": "", 1867 - "text": "Name\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1927 + "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1868 1928 }, 1869 1929 { 1870 - "objectID": "api/promote_to_atmosphere.html", 1871 - "href": "api/promote_to_atmosphere.html", 1872 - "title": "promote_to_atmosphere", 1930 + "objectID": "api/LensLoader.html", 1931 + "href": "api/LensLoader.html", 1932 + "title": "LensLoader", 1873 1933 "section": "", 1874 - "text": "promote.promote_to_atmosphere(\n local_entry,\n local_index,\n atmosphere_client,\n *,\n data_store=None,\n name=None,\n description=None,\n tags=None,\n license=None,\n)\nPromote a local dataset to the atmosphere network.\nThis function takes a locally-indexed dataset and publishes it to ATProto, making it discoverable on the federated atmosphere network.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs.\n\n\n\n\n\n\n::\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1934 + "text": "atmosphere.LensLoader(client)\nLoads lens records from ATProto.\nThis class fetches lens transformation records. Note that actually using a lens requires installing the referenced code and importing it manually.\n\n\n::\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1875 1935 }, 1876 1936 { 1877 - "objectID": "api/promote_to_atmosphere.html#parameters", 1878 - "href": "api/promote_to_atmosphere.html#parameters", 1879 - "title": "promote_to_atmosphere", 1937 + "objectID": "api/LensLoader.html#example", 1938 + "href": "api/LensLoader.html#example", 1939 + "title": "LensLoader", 1880 1940 "section": "", 1881 - "text": "Name\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone" 1941 + "text": "::\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))" 1882 1942 }, 1883 1943 { 1884 - "objectID": "api/promote_to_atmosphere.html#returns", 1885 - "href": "api/promote_to_atmosphere.html#returns", 1886 - "title": "promote_to_atmosphere", 1944 + "objectID": "api/LensLoader.html#methods", 1945 + "href": "api/LensLoader.html#methods", 1946 + "title": "LensLoader", 1887 1947 "section": "", 1888 - "text": "Name\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record." 1948 + "text": "Name\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1889 1949 }, 1890 1950 { 1891 - "objectID": "api/promote_to_atmosphere.html#raises", 1892 - "href": "api/promote_to_atmosphere.html#raises", 1893 - "title": "promote_to_atmosphere", 1951 + "objectID": "api/DictSample.html", 1952 + "href": "api/DictSample.html", 1953 + "title": "DictSample", 1894 1954 "section": "", 1895 - "text": "Name\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs." 1955 + "text": "DictSample(_data=None, **kwargs)\nDynamic sample type providing dict-like access to raw msgpack data.\nThis class is the default sample type for datasets when no explicit type is specified. It stores the raw unpacked msgpack data and provides both attribute-style (sample.field) and dict-style (sample[\"field\"]) access to fields.\nDictSample is useful for: - Exploring datasets without defining a schema first - Working with datasets that have variable schemas - Prototyping before committing to a typed schema\nTo convert to a typed schema, use Dataset.as_type() with a @packable-decorated class. Every @packable class automatically registers a lens from DictSample, making this conversion seamless.\n\n\n::\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)\n\n\n\nNDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1896 1956 }, 1897 1957 { 1898 - "objectID": "api/promote_to_atmosphere.html#example", 1899 - "href": "api/promote_to_atmosphere.html#example", 1900 - "title": "promote_to_atmosphere", 1958 + "objectID": "api/DictSample.html#example", 1959 + "href": "api/DictSample.html#example", 1960 + "title": "DictSample", 1901 1961 "section": "", 1902 - "text": "::\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1962 + "text": "::\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)" 1903 1963 }, 1904 1964 { 1905 - "objectID": "api/load_dataset.html", 1906 - "href": "api/load_dataset.html", 1907 - "title": "load_dataset", 1965 + "objectID": "api/DictSample.html#note", 1966 + "href": "api/DictSample.html#note", 1967 + "title": "DictSample", 1908 1968 "section": "", 1909 - "text": "load_dataset(\n path,\n sample_type=None,\n *,\n split=None,\n data_files=None,\n streaming=False,\n index=None,\n)\nLoad a dataset from local files, remote URLs, or an index.\nThis function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.\nWhen no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index.\n\n\n\n\n\n\n::\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1969 + "text": "NDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class." 1910 1970 }, 1911 1971 { 1912 - "objectID": "api/load_dataset.html#parameters", 1913 - "href": "api/load_dataset.html#parameters", 1914 - "title": "load_dataset", 1972 + "objectID": "api/DictSample.html#attributes", 1973 + "href": "api/DictSample.html#attributes", 1974 + "title": "DictSample", 1915 1975 "section": "", 1916 - "text": "Name\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone" 1976 + "text": "Name\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1917 1977 }, 1918 1978 { 1919 - "objectID": "api/load_dataset.html#returns", 1920 - "href": "api/load_dataset.html#returns", 1921 - "title": "load_dataset", 1979 + "objectID": "api/DictSample.html#methods", 1980 + "href": "api/DictSample.html#methods", 1981 + "title": "DictSample", 1922 1982 "section": "", 1923 - "text": "Name\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample." 1983 + "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1924 1984 }, 1925 1985 { 1926 - "objectID": "api/load_dataset.html#raises", 1927 - "href": "api/load_dataset.html#raises", 1928 - "title": "load_dataset", 1986 + "objectID": "api/PDSBlobStore.html", 1987 + "href": "api/PDSBlobStore.html", 1988 + "title": "PDSBlobStore", 1929 1989 "section": "", 1930 - "text": "Name\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index." 1990 + "text": "atmosphere.PDSBlobStore(client)\nPDS blob store implementing AbstractDataStore protocol.\nStores dataset shards as ATProto blobs, enabling decentralized dataset storage on the AT Protocol network.\nEach shard is written to a temporary tar file, then uploaded as a blob to the user’s PDS. The returned URLs are AT URIs that can be resolved to HTTP URLs for streaming.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance.\n\n\n\n\n\n\n::\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1931 1991 }, 1932 1992 { 1933 - "objectID": "api/load_dataset.html#example", 1934 - "href": "api/load_dataset.html#example", 1935 - "title": "load_dataset", 1993 + "objectID": "api/PDSBlobStore.html#attributes", 1994 + "href": "api/PDSBlobStore.html#attributes", 1995 + "title": "PDSBlobStore", 1936 1996 "section": "", 1937 - "text": "::\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1997 + "text": "Name\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance." 1938 1998 }, 1939 1999 { 1940 - "objectID": "api/AtmosphereClient.html", 1941 - "href": "api/AtmosphereClient.html", 1942 - "title": "AtmosphereClient", 2000 + "objectID": "api/PDSBlobStore.html#example", 2001 + "href": "api/PDSBlobStore.html#example", 2002 + "title": "PDSBlobStore", 1943 2003 "section": "", 1944 - "text": "atmosphere.AtmosphereClient(base_url=None, *, _client=None)\nATProto client wrapper for atdata operations.\nThis class wraps the atproto SDK client and provides higher-level methods for working with atdata records (schemas, datasets, lenses).\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'\n\n\n\nThe password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 2004 + "text": "::\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]" 1945 2005 }, 1946 2006 { 1947 - "objectID": "api/AtmosphereClient.html#example", 1948 - "href": "api/AtmosphereClient.html#example", 1949 - "title": "AtmosphereClient", 2007 + "objectID": "api/PDSBlobStore.html#methods", 2008 + "href": "api/PDSBlobStore.html#methods", 2009 + "title": "PDSBlobStore", 1950 2010 "section": "", 1951 - "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'" 2011 + "text": "Name\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1952 2012 }, 1953 2013 { 1954 - "objectID": "api/AtmosphereClient.html#note", 1955 - "href": "api/AtmosphereClient.html#note", 1956 - "title": "AtmosphereClient", 2014 + "objectID": "api/PackableSample.html", 2015 + "href": "api/PackableSample.html", 2016 + "title": "PackableSample", 1957 2017 "section": "", 1958 - "text": "The password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings." 2018 + "text": "PackableSample()\nBase class for samples that can be serialized with msgpack.\nThis abstract base class provides automatic serialization/deserialization for dataclass-based samples. Fields annotated as NDArray or NDArray | None are automatically converted between numpy arrays and bytes during packing/unpacking.\nSubclasses should be defined either by: 1. Direct inheritance with the @dataclass decorator 2. Using the @packable decorator (recommended)\n\n\n::\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 1959 2019 }, 1960 2020 { 1961 - "objectID": "api/AtmosphereClient.html#attributes", 1962 - "href": "api/AtmosphereClient.html#attributes", 1963 - "title": "AtmosphereClient", 2021 + "objectID": "api/PackableSample.html#example", 2022 + "href": "api/PackableSample.html#example", 2023 + "title": "PackableSample", 1964 2024 "section": "", 1965 - "text": "Name\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session." 2025 + "text": "::\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize" 1966 2026 }, 1967 2027 { 1968 - "objectID": "api/AtmosphereClient.html#methods", 1969 - "href": "api/AtmosphereClient.html#methods", 1970 - "title": "AtmosphereClient", 2028 + "objectID": "api/PackableSample.html#attributes", 2029 + "href": "api/PackableSample.html#attributes", 2030 + "title": "PackableSample", 1971 2031 "section": "", 1972 - "text": "Name\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 2032 + "text": "Name\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes." 1973 2033 }, 1974 2034 { 1975 - "objectID": "api/BlobSource.html", 1976 - "href": "api/BlobSource.html", 1977 - "title": "BlobSource", 2035 + "objectID": "api/PackableSample.html#methods", 2036 + "href": "api/PackableSample.html#methods", 2037 + "title": "PackableSample", 1978 2038 "section": "", 1979 - "text": "BlobSource(blob_refs, pds_endpoint=None, _endpoint_cache=dict())\nData source for ATProto PDS blob storage.\nStreams dataset shards stored as blobs on an ATProto Personal Data Server. Each shard is identified by a blob reference containing the DID and CID.\nThis source resolves blob references to HTTP URLs and streams the content directly, supporting efficient iteration over shards without downloading everything upfront.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID.\n\n\n\n\n\n\n::\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 2039 + "text": "Name\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 1980 2040 }, 1981 2041 { 1982 - "objectID": "api/BlobSource.html#attributes", 1983 - "href": "api/BlobSource.html#attributes", 1984 - "title": "BlobSource", 2042 + "objectID": "api/DatasetDict.html", 2043 + "href": "api/DatasetDict.html", 2044 + "title": "DatasetDict", 1985 2045 "section": "", 1986 - "text": "Name\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID." 2046 + "text": "DatasetDict(splits=None, sample_type=None, streaming=False)\nA dictionary of split names to Dataset instances.\nSimilar to HuggingFace’s DatasetDict, this provides a container for multiple dataset splits (train, test, validation, etc.) with convenience methods that operate across all splits.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired\n\n\n\n\n\n\n::\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 1987 2047 }, 1988 2048 { 1989 - "objectID": "api/BlobSource.html#example", 1990 - "href": "api/BlobSource.html#example", 1991 - "title": "BlobSource", 2049 + "objectID": "api/DatasetDict.html#parameters", 2050 + "href": "api/DatasetDict.html#parameters", 2051 + "title": "DatasetDict", 1992 2052 "section": "", 1993 - "text": "::\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 2053 + "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired" 1994 2054 }, 1995 2055 { 1996 - "objectID": "api/BlobSource.html#methods", 1997 - "href": "api/BlobSource.html#methods", 1998 - "title": "BlobSource", 2056 + "objectID": "api/DatasetDict.html#example", 2057 + "href": "api/DatasetDict.html#example", 2058 + "title": "DatasetDict", 1999 2059 "section": "", 2000 - "text": "Name\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 2060 + "text": "::\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")" 2001 2061 }, 2002 2062 { 2003 - "objectID": "api/SchemaLoader.html", 2004 - "href": "api/SchemaLoader.html", 2005 - "title": "SchemaLoader", 2063 + "objectID": "api/DatasetDict.html#attributes", 2064 + "href": "api/DatasetDict.html#attributes", 2065 + "title": "DatasetDict", 2006 2066 "section": "", 2007 - "text": "atmosphere.SchemaLoader(client)\nLoads PackableSample schemas from ATProto.\nThis class fetches schema records from ATProto and can list available schemas from a repository.\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 2067 + "text": "Name\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 2008 2068 }, 2009 2069 { 2010 - "objectID": "api/SchemaLoader.html#example", 2011 - "href": "api/SchemaLoader.html#example", 2012 - "title": "SchemaLoader", 2070 + "objectID": "tutorials/promotion.html", 2071 + "href": "tutorials/promotion.html", 2072 + "title": "Promotion Workflow", 2013 2073 "section": "", 2014 - "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'" 2074 + "text": "This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network. Promotion is the bridge between Layer 2 (team storage) and Layer 3 (federation).", 2075 + "crumbs": [ 2076 + "Guide", 2077 + "Getting Started", 2078 + "Promotion Workflow" 2079 + ] 2015 2080 }, 2016 2081 { 2017 - "objectID": "api/SchemaLoader.html#methods", 2018 - "href": "api/SchemaLoader.html#methods", 2019 - "title": "SchemaLoader", 2020 - "section": "", 2021 - "text": "Name\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 2082 + "objectID": "tutorials/promotion.html#why-promotion", 2083 + "href": "tutorials/promotion.html#why-promotion", 2084 + "title": "Promotion Workflow", 2085 + "section": "Why Promotion?", 2086 + "text": "Why Promotion?\nA common pattern in data science:\n\nStart private: Develop and validate datasets within your team\nGo public: Share successful datasets with the broader community\n\nPromotion handles this transition without re-processing your data. Instead of creating a new dataset from scratch, you’re lifting an existing local dataset entry into the federated atmosphere.\nThe workflow handles several complexities automatically:\n\nSchema deduplication: If you’ve already published the same schema type and version, promotion reuses it\nURL preservation: Data stays in place (unless you explicitly want to copy it)\nCID consistency: Content identifiers remain valid across the transition", 2087 + "crumbs": [ 2088 + "Guide", 2089 + "Getting Started", 2090 + "Promotion Workflow" 2091 + ] 2022 2092 }, 2023 2093 { 2024 - "objectID": "tutorials/atmosphere.html", 2025 - "href": "tutorials/atmosphere.html", 2026 - "title": "Atmosphere Publishing", 2027 - "section": "", 2028 - "text": "This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing.", 2094 + "objectID": "tutorials/promotion.html#overview", 2095 + "href": "tutorials/promotion.html#overview", 2096 + "title": "Promotion Workflow", 2097 + "section": "Overview", 2098 + "text": "Overview\nThe promotion workflow moves datasets from local storage to the atmosphere:\nLOCAL ATMOSPHERE\n----- ----------\nRedis Index ATProto PDS\nS3 Storage --> (same S3 or new location)\nlocal://schemas/... at://did:plc:.../schema/...\nKey features:\n\nSchema deduplication: Won’t republish identical schemas\nFlexible data handling: Keep existing URLs or copy to new storage\nMetadata preservation: Local metadata carries over to atmosphere", 2029 2099 "crumbs": [ 2030 2100 "Guide", 2031 2101 "Getting Started", 2032 - "Atmosphere Publishing" 2102 + "Promotion Workflow" 2033 2103 ] 2034 2104 }, 2035 2105 { 2036 - "objectID": "tutorials/atmosphere.html#prerequisites", 2037 - "href": "tutorials/atmosphere.html#prerequisites", 2038 - "title": "Atmosphere Publishing", 2039 - "section": "Prerequisites", 2040 - "text": "Prerequisites\n\npip install atdata[atmosphere]\nA Bluesky account with an app-specific password\n\n\n\n\n\n\n\nWarning\n\n\n\nAlways use an app-specific password, not your main Bluesky password.", 2106 + "objectID": "tutorials/promotion.html#setup", 2107 + "href": "tutorials/promotion.html#setup", 2108 + "title": "Promotion Workflow", 2109 + "section": "Setup", 2110 + "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, LocalDatasetEntry, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds", 2041 2111 "crumbs": [ 2042 2112 "Guide", 2043 2113 "Getting Started", 2044 - "Atmosphere Publishing" 2114 + "Promotion Workflow" 2045 2115 ] 2046 2116 }, 2047 2117 { 2048 - "objectID": "tutorials/atmosphere.html#setup", 2049 - "href": "tutorials/atmosphere.html#setup", 2050 - "title": "Atmosphere Publishing", 2051 - "section": "Setup", 2052 - "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.atmosphere import (\n AtmosphereClient,\n AtmosphereIndex,\n PDSBlobStore,\n SchemaPublisher,\n SchemaLoader,\n DatasetPublisher,\n DatasetLoader,\n AtUri,\n)\nfrom atdata import BlobSource\nimport webdataset as wds", 2118 + "objectID": "tutorials/promotion.html#prepare-a-local-dataset", 2119 + "href": "tutorials/promotion.html#prepare-a-local-dataset", 2120 + "title": "Promotion Workflow", 2121 + "section": "Prepare a Local Dataset", 2122 + "text": "Prepare a Local Dataset\nFirst, set up a dataset in local storage:\n\n# 1. Define sample type\n@atdata.packable\nclass ExperimentSample:\n \"\"\"A sample from a scientific experiment.\"\"\"\n measurement: NDArray\n timestamp: float\n sensor_id: str\n\n# 2. Create samples\nsamples = [\n ExperimentSample(\n measurement=np.random.randn(64).astype(np.float32),\n timestamp=float(i),\n sensor_id=f\"sensor_{i % 4}\",\n )\n for i in range(1000)\n]\n\n# 3. Write to tar\nwith wds.writer.TarWriter(\"experiment.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 4. Set up local index with S3 storage\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nlocal_index = LocalIndex(data_store=store)\n\n# 5. Insert dataset into index\ndataset = atdata.Dataset[ExperimentSample](\"experiment.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"experiment-2024-001\", prefix=\"experiments\")\n\n# 6. Publish schema to local index\nlocal_index.publish_schema(ExperimentSample, version=\"1.0.0\")\n\nprint(f\"Local entry name: {local_entry.name}\")\nprint(f\"Local entry CID: {local_entry.cid}\")\nprint(f\"Data URLs: {local_entry.data_urls}\")", 2053 2123 "crumbs": [ 2054 2124 "Guide", 2055 2125 "Getting Started", 2056 - "Atmosphere Publishing" 2126 + "Promotion Workflow" 2057 2127 ] 2058 2128 }, 2059 2129 { 2060 - "objectID": "tutorials/atmosphere.html#define-sample-types", 2061 - "href": "tutorials/atmosphere.html#define-sample-types", 2062 - "title": "Atmosphere Publishing", 2063 - "section": "Define Sample Types", 2064 - "text": "Define Sample Types\n\n@atdata.packable\nclass ImageSample:\n \"\"\"A sample containing image data with metadata.\"\"\"\n image: NDArray\n label: str\n confidence: float\n\n@atdata.packable\nclass TextEmbeddingSample:\n \"\"\"A sample containing text with embedding vectors.\"\"\"\n text: str\n embedding: NDArray\n source: str", 2130 + "objectID": "tutorials/promotion.html#basic-promotion", 2131 + "href": "tutorials/promotion.html#basic-promotion", 2132 + "title": "Promotion Workflow", 2133 + "section": "Basic Promotion", 2134 + "text": "Basic Promotion\nPromote the dataset to ATProto:\n\n# Connect to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\n# Promote to atmosphere\nat_uri = promote_to_atmosphere(local_entry, local_index, client)\nprint(f\"Published: {at_uri}\")", 2065 2135 "crumbs": [ 2066 2136 "Guide", 2067 2137 "Getting Started", 2068 - "Atmosphere Publishing" 2138 + "Promotion Workflow" 2069 2139 ] 2070 2140 }, 2071 2141 { 2072 - "objectID": "tutorials/atmosphere.html#type-introspection", 2073 - "href": "tutorials/atmosphere.html#type-introspection", 2074 - "title": "Atmosphere Publishing", 2075 - "section": "Type Introspection", 2076 - "text": "Type Introspection\nSee what information is available from a PackableSample type:\n\nfrom dataclasses import fields, is_dataclass\n\nprint(f\"Sample type: {ImageSample.__name__}\")\nprint(f\"Is dataclass: {is_dataclass(ImageSample)}\")\n\nprint(\"\\nFields:\")\nfor field in fields(ImageSample):\n print(f\" - {field.name}: {field.type}\")\n\n# Create and serialize a sample\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\npacked = sample.packed\nprint(f\"\\nSerialized size: {len(packed):,} bytes\")\n\n# Round-trip\nrestored = ImageSample.from_bytes(packed)\nprint(f\"Round-trip successful: {np.allclose(sample.image, restored.image)}\")", 2142 + "objectID": "tutorials/promotion.html#promotion-with-metadata", 2143 + "href": "tutorials/promotion.html#promotion-with-metadata", 2144 + "title": "Promotion Workflow", 2145 + "section": "Promotion with Metadata", 2146 + "text": "Promotion with Metadata\nAdd description, tags, and license:\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n name=\"experiment-2024-001-v2\", # Override name\n description=\"Sensor measurements from Lab 302\",\n tags=[\"experiment\", \"physics\", \"2024\"],\n license=\"CC-BY-4.0\",\n)\nprint(f\"Published with metadata: {at_uri}\")", 2077 2147 "crumbs": [ 2078 2148 "Guide", 2079 2149 "Getting Started", 2080 - "Atmosphere Publishing" 2150 + "Promotion Workflow" 2081 2151 ] 2082 2152 }, 2083 2153 { 2084 - "objectID": "tutorials/atmosphere.html#at-uri-parsing", 2085 - "href": "tutorials/atmosphere.html#at-uri-parsing", 2086 - "title": "Atmosphere Publishing", 2087 - "section": "AT URI Parsing", 2088 - "text": "AT URI Parsing\nATProto records are identified by AT URIs:\n\nuris = [\n \"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789\",\n \"at://alice.bsky.social/ac.foundation.dataset.record/my-dataset\",\n]\n\nfor uri_str in uris:\n print(f\"\\nParsing: {uri_str}\")\n uri = AtUri.parse(uri_str)\n print(f\" Authority: {uri.authority}\")\n print(f\" Collection: {uri.collection}\")\n print(f\" Rkey: {uri.rkey}\")", 2154 + "objectID": "tutorials/promotion.html#schema-deduplication", 2155 + "href": "tutorials/promotion.html#schema-deduplication", 2156 + "title": "Promotion Workflow", 2157 + "section": "Schema Deduplication", 2158 + "text": "Schema Deduplication\nThe promotion workflow automatically checks for existing schemas:\n\nfrom atdata.promote import _find_existing_schema\n\n# Check if schema already exists\nexisting = _find_existing_schema(client, \"ExperimentSample\", \"1.0.0\")\nif existing:\n print(f\"Found existing schema: {existing}\")\n print(\"Will reuse instead of republishing\")\nelse:\n print(\"No existing schema found, will publish new one\")\n\nWhen you promote multiple datasets with the same sample type:\n\n# First promotion: publishes schema\nuri1 = promote_to_atmosphere(entry1, local_index, client)\n\n# Second promotion with same schema type + version: reuses existing schema\nuri2 = promote_to_atmosphere(entry2, local_index, client)", 2089 2159 "crumbs": [ 2090 2160 "Guide", 2091 2161 "Getting Started", 2092 - "Atmosphere Publishing" 2162 + "Promotion Workflow" 2093 2163 ] 2094 2164 }, 2095 2165 { 2096 - "objectID": "tutorials/atmosphere.html#authentication", 2097 - "href": "tutorials/atmosphere.html#authentication", 2098 - "title": "Atmosphere Publishing", 2099 - "section": "Authentication", 2100 - "text": "Authentication\nConnect to ATProto:\n\nclient = AtmosphereClient()\nclient.login(\"your.handle.social\", \"your-app-password\")\n\nprint(f\"Authenticated as: {client.handle}\")\nprint(f\"DID: {client.did}\")", 2166 + "objectID": "tutorials/promotion.html#data-migration-options", 2167 + "href": "tutorials/promotion.html#data-migration-options", 2168 + "title": "Promotion Workflow", 2169 + "section": "Data Migration Options", 2170 + "text": "Data Migration Options\n\nKeep Existing URLsCopy to New Storage\n\n\nBy default, promotion keeps the original data URLs:\n\n# Data stays in original S3 location\nat_uri = promote_to_atmosphere(local_entry, local_index, client)\n\nBenefits:\n\nFastest option, no data copying\nDataset record points to existing URLs\nRequires original storage to remain accessible\n\n\n\nTo copy data to a different storage location:\n\nfrom atdata.local import S3DataStore\n\n# Create new data store\nnew_store = S3DataStore(\n credentials=\"new-s3-creds.env\",\n bucket=\"public-datasets\",\n)\n\n# Promote with data copy\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n data_store=new_store, # Copy data to new storage\n)\n\nBenefits:\n\nData is copied to new bucket\nGood for moving from private to public storage\nOriginal storage can be retired", 2101 2171 "crumbs": [ 2102 2172 "Guide", 2103 2173 "Getting Started", 2104 - "Atmosphere Publishing" 2174 + "Promotion Workflow" 2105 2175 ] 2106 2176 }, 2107 2177 { 2108 - "objectID": "tutorials/atmosphere.html#publish-a-schema", 2109 - "href": "tutorials/atmosphere.html#publish-a-schema", 2110 - "title": "Atmosphere Publishing", 2111 - "section": "Publish a Schema", 2112 - "text": "Publish a Schema\n\nschema_publisher = SchemaPublisher(client)\nschema_uri = schema_publisher.publish(\n ImageSample,\n name=\"ImageSample\",\n version=\"1.0.0\",\n description=\"Demo: Image sample with label and confidence\",\n)\nprint(f\"Schema URI: {schema_uri}\")", 2178 + "objectID": "tutorials/promotion.html#verify-on-atmosphere", 2179 + "href": "tutorials/promotion.html#verify-on-atmosphere", 2180 + "title": "Promotion Workflow", 2181 + "section": "Verify on Atmosphere", 2182 + "text": "Verify on Atmosphere\nAfter promotion, verify the dataset is accessible:\n\nfrom atdata.atmosphere import AtmosphereIndex\n\natm_index = AtmosphereIndex(client)\nentry = atm_index.get_dataset(at_uri)\n\nprint(f\"Name: {entry.name}\")\nprint(f\"Schema: {entry.schema_ref}\")\nprint(f\"URLs: {entry.data_urls}\")\n\n# Load and iterate\nSampleType = atm_index.decode_schema(entry.schema_ref)\nds = atdata.Dataset[SampleType](entry.data_urls[0])\n\nfor batch in ds.ordered(batch_size=32):\n print(f\"Measurement shape: {batch.measurement.shape}\")\n break", 2113 2183 "crumbs": [ 2114 2184 "Guide", 2115 2185 "Getting Started", 2116 - "Atmosphere Publishing" 2186 + "Promotion Workflow" 2117 2187 ] 2118 2188 }, 2119 2189 { 2120 - "objectID": "tutorials/atmosphere.html#list-your-schemas", 2121 - "href": "tutorials/atmosphere.html#list-your-schemas", 2122 - "title": "Atmosphere Publishing", 2123 - "section": "List Your Schemas", 2124 - "text": "List Your Schemas\n\nschema_loader = SchemaLoader(client)\nschemas = schema_loader.list_all(limit=10)\nprint(f\"Found {len(schemas)} schema(s)\")\n\nfor schema in schemas:\n print(f\" - {schema.get('name', 'Unknown')}: v{schema.get('version', '?')}\")", 2190 + "objectID": "tutorials/promotion.html#error-handling", 2191 + "href": "tutorials/promotion.html#error-handling", 2192 + "title": "Promotion Workflow", 2193 + "section": "Error Handling", 2194 + "text": "Error Handling\n\ntry:\n at_uri = promote_to_atmosphere(local_entry, local_index, client)\nexcept KeyError as e:\n # Schema not found in local index\n print(f\"Missing schema: {e}\")\n print(\"Publish schema first: local_index.publish_schema(SampleType)\")\nexcept ValueError as e:\n # Entry has no data URLs\n print(f\"Invalid entry: {e}\")", 2125 2195 "crumbs": [ 2126 2196 "Guide", 2127 2197 "Getting Started", 2128 - "Atmosphere Publishing" 2198 + "Promotion Workflow" 2129 2199 ] 2130 2200 }, 2131 2201 { 2132 - "objectID": "tutorials/atmosphere.html#publish-a-dataset", 2133 - "href": "tutorials/atmosphere.html#publish-a-dataset", 2134 - "title": "Atmosphere Publishing", 2135 - "section": "Publish a Dataset", 2136 - "text": "Publish a Dataset\n\nWith External URLs\n\ndataset_publisher = DatasetPublisher(client)\ndataset_uri = dataset_publisher.publish_with_urls(\n urls=[\"s3://example-bucket/demo-data-{000000..000009}.tar\"],\n schema_uri=str(schema_uri),\n name=\"Demo Image Dataset\",\n description=\"Example dataset demonstrating atmosphere publishing\",\n tags=[\"demo\", \"images\", \"atdata\"],\n license=\"MIT\",\n)\nprint(f\"Dataset URI: {dataset_uri}\")\n\n\n\nWith PDS Blob Storage (Recommended)\nFor fully decentralized storage, use PDSBlobStore to store dataset shards directly as ATProto blobs in your PDS:\n\n# Create store and index with blob storage\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# Create dataset in memory or from existing tar\nsamples = [FeatureSample(features=np.random.randn(64).astype(np.float32), label=i % 10) for i in range(100)]\n\n# Write to temporary tar\nwith wds.writer.TarWriter(\"temp.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\ndataset = atdata.Dataset[FeatureSample](\"temp.tar\")\n\n# Publish - shards are uploaded as blobs automatically\nschema_uri = index.publish_schema(FeatureSample, version=\"1.0.0\")\nentry = index.insert_dataset(\n dataset,\n name=\"blob-stored-features\",\n schema_ref=schema_uri,\n description=\"Features stored as PDS blobs\",\n)\n\nprint(f\"Dataset URI: {entry.uri}\")\nprint(f\"Blob URLs: {entry.data_urls}\") # at://did/blob/cid format\n\n\n\n\n\n\n\nReading Blob-Stored Datasets\n\n\n\nUse BlobSource to stream directly from PDS blobs:\n\n# Create source from the blob URLs\nsource = store.create_source(entry.data_urls)\n\n# Or manually from blob references\nsource = BlobSource.from_refs([\n {\"did\": client.did, \"cid\": \"bafyrei...\"},\n])\n\n# Load and iterate\nds = atdata.Dataset[FeatureSample](source)\nfor batch in ds.ordered(batch_size=32):\n print(batch.features.shape)\n\n\n\n\n\nWith External URLs\nFor larger datasets or when using existing object storage:\n\ndataset_publisher = DatasetPublisher(client)\ndataset_uri = dataset_publisher.publish_with_urls(\n urls=[\"s3://example-bucket/demo-data-{000000..000009}.tar\"],\n schema_uri=str(schema_uri),\n name=\"Demo Image Dataset\",\n description=\"Example dataset demonstrating atmosphere publishing\",\n tags=[\"demo\", \"images\", \"atdata\"],\n license=\"MIT\",\n)\nprint(f\"Dataset URI: {dataset_uri}\")", 2202 + "objectID": "tutorials/promotion.html#requirements-checklist", 2203 + "href": "tutorials/promotion.html#requirements-checklist", 2204 + "title": "Promotion Workflow", 2205 + "section": "Requirements Checklist", 2206 + "text": "Requirements Checklist\nBefore promotion:\n\nDataset is in local index (via LocalIndex.insert_dataset() or LocalIndex.add_entry())\nSchema is published to local index (via LocalIndex.publish_schema())\nAtmosphereClient is authenticated\nData URLs are publicly accessible (or will be copied)", 2137 2207 "crumbs": [ 2138 2208 "Guide", 2139 2209 "Getting Started", 2140 - "Atmosphere Publishing" 2210 + "Promotion Workflow" 2141 2211 ] 2142 2212 }, 2143 2213 { 2144 - "objectID": "tutorials/atmosphere.html#list-and-load-datasets", 2145 - "href": "tutorials/atmosphere.html#list-and-load-datasets", 2146 - "title": "Atmosphere Publishing", 2147 - "section": "List and Load Datasets", 2148 - "text": "List and Load Datasets\n\ndataset_loader = DatasetLoader(client)\ndatasets = dataset_loader.list_all(limit=10)\nprint(f\"Found {len(datasets)} dataset(s)\")\n\nfor ds in datasets:\n print(f\" - {ds.get('name', 'Unknown')}\")\n print(f\" Schema: {ds.get('schemaRef', 'N/A')}\")\n tags = ds.get('tags', [])\n if tags:\n print(f\" Tags: {', '.join(tags)}\")", 2214 + "objectID": "tutorials/promotion.html#complete-workflow", 2215 + "href": "tutorials/promotion.html#complete-workflow", 2216 + "title": "Promotion Workflow", 2217 + "section": "Complete Workflow", 2218 + "text": "Complete Workflow\n\n# Complete local-to-atmosphere workflow\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# 2. Create dataset tar\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n )\n for i in range(1000)\n]\n\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Store in local index with S3 backend\nstore = S3DataStore(credentials=\"creds.env\", bucket=\"bucket\")\nlocal_index = LocalIndex(data_store=store)\n\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"feature-vectors-v1\", prefix=\"features\")\n\n# 4. Publish schema locally\nlocal_index.publish_schema(FeatureSample, version=\"1.0.0\")\n\n# 5. Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n description=\"Feature vectors for classification\",\n tags=[\"features\", \"embeddings\"],\n license=\"MIT\",\n)\n\nprint(f\"Dataset published: {at_uri}\")\n\n# 6. Others can now discover and load\n# ds = atdata.load_dataset(\"@myhandle.bsky.social/feature-vectors-v1\")", 2149 2219 "crumbs": [ 2150 2220 "Guide", 2151 2221 "Getting Started", 2152 - "Atmosphere Publishing" 2222 + "Promotion Workflow" 2153 2223 ] 2154 2224 }, 2155 2225 { 2156 - "objectID": "tutorials/atmosphere.html#load-a-dataset", 2157 - "href": "tutorials/atmosphere.html#load-a-dataset", 2158 - "title": "Atmosphere Publishing", 2159 - "section": "Load a Dataset", 2160 - "text": "Load a Dataset\n\n# Check storage type\nstorage_type = dataset_loader.get_storage_type(str(blob_dataset_uri))\nprint(f\"Storage type: {storage_type}\")\n\nif storage_type == \"blobs\":\n blob_urls = dataset_loader.get_blob_urls(str(blob_dataset_uri))\n print(f\"Blob URLs: {len(blob_urls)} blob(s)\")\n\n# Load and iterate (works for both storage types)\nds = dataset_loader.to_dataset(str(blob_dataset_uri), DemoSample)\nfor batch in ds.ordered():\n print(f\"Sample id={batch.id}, text={batch.text}\")", 2226 + "objectID": "tutorials/promotion.html#what-youve-learned", 2227 + "href": "tutorials/promotion.html#what-youve-learned", 2228 + "title": "Promotion Workflow", 2229 + "section": "What You’ve Learned", 2230 + "text": "What You’ve Learned\nYou now understand the promotion workflow:\n\n\n\n\n\n\n\nConcept\nPurpose\n\n\n\n\npromote_to_atmosphere()\nLift local entries to federated network\n\n\nSchema deduplication\nAvoid publishing duplicate schemas\n\n\nData URL preservation\nKeep data in place or copy to new storage\n\n\nMetadata enrichment\nAdd description, tags, license during promotion\n\n\n\nPromotion completes atdata’s three-layer story: you can now move seamlessly from local experimentation to team collaboration to public sharing, all with the same typed sample definitions.", 2161 2231 "crumbs": [ 2162 2232 "Guide", 2163 2233 "Getting Started", 2164 - "Atmosphere Publishing" 2234 + "Promotion Workflow" 2165 2235 ] 2166 2236 }, 2167 2237 { 2168 - "objectID": "tutorials/atmosphere.html#complete-publishing-workflow", 2169 - "href": "tutorials/atmosphere.html#complete-publishing-workflow", 2170 - "title": "Atmosphere Publishing", 2171 - "section": "Complete Publishing Workflow", 2172 - "text": "Complete Publishing Workflow\nThis example shows the recommended workflow using PDSBlobStore for fully decentralized storage:\n\n# 1. Define and create samples\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n source: str\n\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(1000)\n]\n\n# 2. Write to tar\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Authenticate and create index with blob storage\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# 4. Publish schema\nschema_uri = index.publish_schema(\n FeatureSample,\n version=\"1.0.0\",\n description=\"Feature vectors with labels\",\n)\n\n# 5. Publish dataset (shards uploaded as blobs automatically)\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nentry = index.insert_dataset(\n dataset,\n name=\"synthetic-features-v1\",\n schema_ref=schema_uri,\n tags=[\"features\", \"synthetic\"],\n)\n\nprint(f\"Published: {entry.uri}\")\nprint(f\"Data stored at: {entry.data_urls}\") # at://did/blob/cid URLs\n\n# 6. Later: load from blobs\nsource = store.create_source(entry.data_urls)\nds = atdata.Dataset[FeatureSample](source)\nfor batch in ds.ordered(batch_size=32):\n print(f\"Loaded batch with {len(batch.label)} samples\")\n break", 2238 + "objectID": "tutorials/promotion.html#the-complete-journey", 2239 + "href": "tutorials/promotion.html#the-complete-journey", 2240 + "title": "Promotion Workflow", 2241 + "section": "The Complete Journey", 2242 + "text": "The Complete Journey\n┌──────────────────┐ insert ┌──────────────────┐ promote ┌──────────────────┐\n│ Local Files │ ────────────→ │ Team Storage │ ────────────→ │ Federation │\n│ │ │ │ │ │\n│ tar files │ │ Redis + S3 │ │ ATProto PDS │\n│ Dataset[T] │ │ LocalIndex │ │ AtmosphereIndex │\n└──────────────────┘ └──────────────────┘ └──────────────────┘", 2173 2243 "crumbs": [ 2174 2244 "Guide", 2175 2245 "Getting Started", 2176 - "Atmosphere Publishing" 2246 + "Promotion Workflow" 2177 2247 ] 2178 2248 }, 2179 2249 { 2180 - "objectID": "tutorials/atmosphere.html#next-steps", 2181 - "href": "tutorials/atmosphere.html#next-steps", 2182 - "title": "Atmosphere Publishing", 2250 + "objectID": "tutorials/promotion.html#next-steps", 2251 + "href": "tutorials/promotion.html#next-steps", 2252 + "title": "Promotion Workflow", 2183 2253 "section": "Next Steps", 2184 - "text": "Next Steps\n\nPromotion Workflow - Migrate from local storage to atmosphere\nAtmosphere Reference - Complete API reference\nProtocols - Abstract interfaces", 2254 + "text": "Next Steps\n\nAtmosphere Reference - Complete atmosphere API\nProtocols - Abstract interfaces\nLocal Storage - Local storage reference", 2185 2255 "crumbs": [ 2186 2256 "Guide", 2187 2257 "Getting Started", 2188 - "Atmosphere Publishing" 2258 + "Promotion Workflow" 2189 2259 ] 2190 2260 }, 2191 2261 { 2192 - "objectID": "tutorials/quickstart.html", 2193 - "href": "tutorials/quickstart.html", 2194 - "title": "Quick Start", 2262 + "objectID": "tutorials/local-workflow.html", 2263 + "href": "tutorials/local-workflow.html", 2264 + "title": "Local Workflow", 2195 2265 "section": "", 2196 - "text": "This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them.", 2266 + "text": "This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage. This is Layer 2 of atdata’s architecture—team-scale storage that bridges local development and federated sharing.", 2197 2267 "crumbs": [ 2198 2268 "Guide", 2199 2269 "Getting Started", 2200 - "Quick Start" 2270 + "Local Workflow" 2201 2271 ] 2202 2272 }, 2203 2273 { 2204 - "objectID": "tutorials/quickstart.html#installation", 2205 - "href": "tutorials/quickstart.html#installation", 2206 - "title": "Quick Start", 2207 - "section": "Installation", 2208 - "text": "Installation\npip install atdata\n\n# With ATProto support\npip install atdata[atmosphere]", 2274 + "objectID": "tutorials/local-workflow.html#why-team-storage", 2275 + "href": "tutorials/local-workflow.html#why-team-storage", 2276 + "title": "Local Workflow", 2277 + "section": "Why Team Storage?", 2278 + "text": "Why Team Storage?\nLocal tar files work well for individual experiments, but teams need:\n\nDiscovery: “What datasets do we have? What schema does this one use?”\nConsistency: “Is everyone using the same version of this dataset?”\nDurability: “Where’s the canonical copy of our training data?”\n\natdata’s local storage module addresses these needs with a two-component architecture:\n\n\n\n\n\n\n\nComponent\nPurpose\n\n\n\n\nRedis Index\nFast metadata queries, schema registry, dataset discovery\n\n\nS3 DataStore\nScalable object storage for actual data files\n\n\n\nThis separation means metadata operations (listing datasets, resolving schemas) are fast and don’t touch large data files, while the data itself lives in battle-tested object storage.", 2279 + "crumbs": [ 2280 + "Guide", 2281 + "Getting Started", 2282 + "Local Workflow" 2283 + ] 2284 + }, 2285 + { 2286 + "objectID": "tutorials/local-workflow.html#prerequisites", 2287 + "href": "tutorials/local-workflow.html#prerequisites", 2288 + "title": "Local Workflow", 2289 + "section": "Prerequisites", 2290 + "text": "Prerequisites\n\nRedis server running (default: localhost:6379)\nS3-compatible storage (MinIO, AWS S3, etc.)\n\n\n\n\n\n\n\nTip\n\n\n\nFor local development, you can use MinIO:\ndocker run -p 9000:9000 minio/minio server /data", 2209 2291 "crumbs": [ 2210 2292 "Guide", 2211 2293 "Getting Started", 2212 - "Quick Start" 2294 + "Local Workflow" 2213 2295 ] 2214 2296 }, 2215 2297 { 2216 - "objectID": "tutorials/quickstart.html#define-a-sample-type", 2217 - "href": "tutorials/quickstart.html#define-a-sample-type", 2218 - "title": "Quick Start", 2219 - "section": "Define a Sample Type", 2220 - "text": "Define a Sample Type\nUse the @packable decorator to create a typed sample:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n \"\"\"A sample containing an image with label and confidence.\"\"\"\n image: NDArray\n label: str\n confidence: float\n\nThe @packable decorator:\n\nConverts your class into a dataclass\nAdds automatic msgpack serialization\nHandles NDArray conversion to/from bytes", 2298 + "objectID": "tutorials/local-workflow.html#setup", 2299 + "href": "tutorials/local-workflow.html#setup", 2300 + "title": "Local Workflow", 2301 + "section": "Setup", 2302 + "text": "Setup\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, LocalDatasetEntry, S3DataStore\nimport webdataset as wds", 2221 2303 "crumbs": [ 2222 2304 "Guide", 2223 2305 "Getting Started", 2224 - "Quick Start" 2306 + "Local Workflow" 2225 2307 ] 2226 2308 }, 2227 2309 { 2228 - "objectID": "tutorials/quickstart.html#create-sample-instances", 2229 - "href": "tutorials/quickstart.html#create-sample-instances", 2230 - "title": "Quick Start", 2231 - "section": "Create Sample Instances", 2232 - "text": "Create Sample Instances\n\n# Create a single sample\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\n# Check serialization\npacked_bytes = sample.packed\nprint(f\"Serialized size: {len(packed_bytes):,} bytes\")\n\n# Verify round-trip\nrestored = ImageSample.from_bytes(packed_bytes)\nassert np.allclose(sample.image, restored.image)\nprint(\"Round-trip successful!\")", 2310 + "objectID": "tutorials/local-workflow.html#define-sample-types", 2311 + "href": "tutorials/local-workflow.html#define-sample-types", 2312 + "title": "Local Workflow", 2313 + "section": "Define Sample Types", 2314 + "text": "Define Sample Types\n\n@atdata.packable\nclass TrainingSample:\n \"\"\"A sample containing features and label for training.\"\"\"\n features: NDArray\n label: int\n\n@atdata.packable\nclass TextSample:\n \"\"\"A sample containing text data.\"\"\"\n text: str\n category: str", 2233 2315 "crumbs": [ 2234 2316 "Guide", 2235 2317 "Getting Started", 2236 - "Quick Start" 2318 + "Local Workflow" 2237 2319 ] 2238 2320 }, 2239 2321 { 2240 - "objectID": "tutorials/quickstart.html#write-a-dataset", 2241 - "href": "tutorials/quickstart.html#write-a-dataset", 2242 - "title": "Quick Start", 2243 - "section": "Write a Dataset", 2244 - "text": "Write a Dataset\nUse WebDataset’s TarWriter to create dataset files:\n\nimport webdataset as wds\n\n# Create 100 samples\nsamples = [\n ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=f\"class_{i % 10}\",\n confidence=np.random.rand(),\n )\n for i in range(100)\n]\n\n# Write to tar file\nwith wds.writer.TarWriter(\"my-dataset-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\n\nprint(\"Wrote 100 samples to my-dataset-000000.tar\")", 2322 + "objectID": "tutorials/local-workflow.html#localdatasetentry", 2323 + "href": "tutorials/local-workflow.html#localdatasetentry", 2324 + "title": "Local Workflow", 2325 + "section": "LocalDatasetEntry", 2326 + "text": "LocalDatasetEntry\nEvery dataset in the index is represented by a LocalDatasetEntry. A key design decision: entries use content-addressable CIDs (Content Identifiers) as their identity. This means:\n\nIdentical content always has the same CID\nYou can verify data integrity by checking the CID\nDeduplication happens automatically\n\nCIDs are computed from the entry’s schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it’s stored.\nCreate entries with content-addressable CIDs:\n\n# Create an entry manually\nentry = LocalDatasetEntry(\n _name=\"my-dataset\",\n _schema_ref=\"local://schemas/examples.TrainingSample@1.0.0\",\n _data_urls=[\"s3://bucket/data-000000.tar\", \"s3://bucket/data-000001.tar\"],\n _metadata={\"source\": \"example\", \"samples\": 10000},\n)\n\nprint(f\"Entry name: {entry.name}\")\nprint(f\"Schema ref: {entry.schema_ref}\")\nprint(f\"Data URLs: {entry.data_urls}\")\nprint(f\"Metadata: {entry.metadata}\")\nprint(f\"CID: {entry.cid}\")\n\n\n\n\n\n\n\nNote\n\n\n\nCIDs are generated from content (schema_ref + data_urls), so identical data produces identical CIDs.", 2245 2327 "crumbs": [ 2246 2328 "Guide", 2247 2329 "Getting Started", 2248 - "Quick Start" 2330 + "Local Workflow" 2249 2331 ] 2250 2332 }, 2251 2333 { 2252 - "objectID": "tutorials/quickstart.html#load-and-iterate", 2253 - "href": "tutorials/quickstart.html#load-and-iterate", 2254 - "title": "Quick Start", 2255 - "section": "Load and Iterate", 2256 - "text": "Load and Iterate\nCreate a typed Dataset and iterate with batching:\n\n# Load dataset with type\ndataset = atdata.Dataset[ImageSample](\"my-dataset-000000.tar\")\n\n# Iterate in order with batching\nfor batch in dataset.ordered(batch_size=16):\n # NDArray fields are stacked\n images = batch.image # shape: (16, 224, 224, 3)\n\n # Other fields become lists\n labels = batch.label # list of 16 strings\n confidences = batch.confidence # list of 16 floats\n\n print(f\"Batch shape: {images.shape}\")\n print(f\"Labels: {labels[:3]}...\")\n break", 2334 + "objectID": "tutorials/local-workflow.html#localindex", 2335 + "href": "tutorials/local-workflow.html#localindex", 2336 + "title": "Local Workflow", 2337 + "section": "LocalIndex", 2338 + "text": "LocalIndex\nThe LocalIndex is your team’s dataset registry. It implements the AbstractIndex protocol, meaning code written against LocalIndex will also work with AtmosphereIndex when you’re ready for federated sharing.\nThe index tracks datasets in Redis:\n\nfrom redis import Redis\n\n# Connect to Redis\nredis = Redis(host=\"localhost\", port=6379)\nindex = LocalIndex(redis=redis)\n\nprint(\"LocalIndex connected\")\n\n\nSchema Management\nSchema publishing is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference.\nThis enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code.\n\n# Publish a schema\nschema_ref = index.publish_schema(TrainingSample, version=\"1.0.0\")\nprint(f\"Published schema: {schema_ref}\")\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\" - {schema.get('name', 'Unknown')} v{schema.get('version', '?')}\")\n\n# Get schema record\nschema_record = index.get_schema(schema_ref)\nprint(f\"Schema fields: {[f['name'] for f in schema_record.get('fields', [])]}\")\n\n# Decode schema back to a PackableSample class\ndecoded_type = index.decode_schema(schema_ref)\nprint(f\"Decoded type: {decoded_type.__name__}\")", 2257 2339 "crumbs": [ 2258 2340 "Guide", 2259 2341 "Getting Started", 2260 - "Quick Start" 2342 + "Local Workflow" 2261 2343 ] 2262 2344 }, 2263 2345 { 2264 - "objectID": "tutorials/quickstart.html#shuffled-iteration", 2265 - "href": "tutorials/quickstart.html#shuffled-iteration", 2266 - "title": "Quick Start", 2267 - "section": "Shuffled Iteration", 2268 - "text": "Shuffled Iteration\nFor training, use shuffled iteration:\n\nfor batch in dataset.shuffled(batch_size=32):\n # Samples are shuffled at shard and sample level\n images = batch.image\n labels = batch.label\n\n # Train your model\n # model.train(images, labels)\n break", 2346 + "objectID": "tutorials/local-workflow.html#s3datastore", 2347 + "href": "tutorials/local-workflow.html#s3datastore", 2348 + "title": "Local Workflow", 2349 + "section": "S3DataStore", 2350 + "text": "S3DataStore\nThe S3DataStore implements the AbstractDataStore protocol for S3-compatible object storage. It works with:\n\nAWS S3: Production-scale cloud storage\nMinIO: Self-hosted S3-compatible storage (great for development)\nCloudflare R2: Cost-effective S3-compatible storage\n\nThe data store handles uploading tar shards and creating signed URLs for streaming access.\nFor direct S3 operations:\n\ncreds = {\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n}\n\nstore = S3DataStore(creds, bucket=\"my-bucket\")\n\nprint(f\"Bucket: {store.bucket}\")\nprint(f\"Supports streaming: {store.supports_streaming()}\")", 2269 2351 "crumbs": [ 2270 2352 "Guide", 2271 2353 "Getting Started", 2272 - "Quick Start" 2354 + "Local Workflow" 2273 2355 ] 2274 2356 }, 2275 2357 { 2276 - "objectID": "tutorials/quickstart.html#use-lenses-for-type-transformations", 2277 - "href": "tutorials/quickstart.html#use-lenses-for-type-transformations", 2278 - "title": "Quick Start", 2279 - "section": "Use Lenses for Type Transformations", 2280 - "text": "Use Lenses for Type Transformations\nView datasets through different schemas:\n\n# Define a simplified view type\n@atdata.packable\nclass SimplifiedSample:\n label: str\n confidence: float\n\n# Create a lens transformation\n@atdata.lens\ndef simplify(src: ImageSample) -> SimplifiedSample:\n return SimplifiedSample(label=src.label, confidence=src.confidence)\n\n# View dataset through lens\nsimple_ds = dataset.as_type(SimplifiedSample)\n\nfor batch in simple_ds.ordered(batch_size=8):\n print(f\"Labels: {batch.label}\")\n print(f\"Confidences: {batch.confidence}\")\n break", 2358 + "objectID": "tutorials/local-workflow.html#complete-index-workflow", 2359 + "href": "tutorials/local-workflow.html#complete-index-workflow", 2360 + "title": "Local Workflow", 2361 + "section": "Complete Index Workflow", 2362 + "text": "Complete Index Workflow\nHere’s the typical workflow for publishing a dataset to your team:\n\nCreate samples using your @packable type\nWrite to local tar for staging\nCreate a Dataset wrapper\nConnect to index with data store\nPublish schema for type consistency\nInsert dataset (uploads to S3, indexes in Redis)\n\nThe index composition pattern (LocalIndex(data_store=S3DataStore(...))) is deliberate—it separates the concern of “where is metadata?” from “where is data?”, making it easy to swap storage backends.\nUse LocalIndex with S3DataStore to store datasets with S3 storage and Redis indexing:\n\n# 1. Create sample data\nsamples = [\n TrainingSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10\n )\n for i in range(1000)\n]\nprint(f\"Created {len(samples)} training samples\")\n\n# 2. Write to local tar file\nwith wds.writer.TarWriter(\"local-data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})\nprint(\"Wrote samples to local tar file\")\n\n# 3. Create Dataset\nds = atdata.Dataset[TrainingSample](\"local-data-000000.tar\")\n\n# 4. Set up index with S3 data store and insert\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(redis=redis, data_store=store)\n\n# Publish schema and insert dataset\nindex.publish_schema(TrainingSample, version=\"1.0.0\")\nentry = index.insert_dataset(ds, name=\"training-v1\", prefix=\"datasets\")\nprint(f\"Stored at: {entry.data_urls}\")\nprint(f\"CID: {entry.cid}\")\n\n# 5. Retrieve later\nretrieved_entry = index.get_entry_by_name(\"training-v1\")\ndataset = atdata.Dataset[TrainingSample](retrieved_entry.data_urls[0])\n\nfor batch in dataset.ordered(batch_size=32):\n print(f\"Batch features shape: {batch.features.shape}\")\n break", 2281 2363 "crumbs": [ 2282 2364 "Guide", 2283 2365 "Getting Started", 2284 - "Quick Start" 2366 + "Local Workflow" 2285 2367 ] 2286 2368 }, 2287 2369 { 2288 - "objectID": "tutorials/quickstart.html#next-steps", 2289 - "href": "tutorials/quickstart.html#next-steps", 2290 - "title": "Quick Start", 2370 + "objectID": "tutorials/local-workflow.html#using-load_dataset-with-index", 2371 + "href": "tutorials/local-workflow.html#using-load_dataset-with-index", 2372 + "title": "Local Workflow", 2373 + "section": "Using load_dataset with Index", 2374 + "text": "Using load_dataset with Index\nThe load_dataset() function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve @local/ prefixed paths to the actual data URLs and apply the correct credentials automatically.\nThe load_dataset() function supports index lookup:\n\nfrom atdata import load_dataset\n\n# Load from local index\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# The index resolves the dataset name to URLs and schema\nfor batch in ds.shuffled(batch_size=32):\n process(batch)\n break", 2375 + "crumbs": [ 2376 + "Guide", 2377 + "Getting Started", 2378 + "Local Workflow" 2379 + ] 2380 + }, 2381 + { 2382 + "objectID": "tutorials/local-workflow.html#what-youve-learned", 2383 + "href": "tutorials/local-workflow.html#what-youve-learned", 2384 + "title": "Local Workflow", 2385 + "section": "What You’ve Learned", 2386 + "text": "What You’ve Learned\nYou now understand team-scale storage in atdata:\n\n\n\n\n\n\n\nConcept\nPurpose\n\n\n\n\nLocalIndex\nRedis-backed dataset registry implementing AbstractIndex\n\n\nS3DataStore\nS3-compatible object storage implementing AbstractDataStore\n\n\nLocalDatasetEntry\nContent-addressed dataset entries with CIDs\n\n\nSchema publishing\nShared type definitions for team consistency\n\n\n\nThe same sample types you defined in the Quick Start work seamlessly here—the only change is where the data lives.", 2387 + "crumbs": [ 2388 + "Guide", 2389 + "Getting Started", 2390 + "Local Workflow" 2391 + ] 2392 + }, 2393 + { 2394 + "objectID": "tutorials/local-workflow.html#next-steps", 2395 + "href": "tutorials/local-workflow.html#next-steps", 2396 + "title": "Local Workflow", 2291 2397 "section": "Next Steps", 2292 - "text": "Next Steps\n\nLocal Workflow - Store datasets with Redis + S3\nAtmosphere Publishing - Publish to ATProto federation\nPackable Samples - Deep dive into sample types\nDatasets - Advanced dataset operations", 2398 + "text": "Next Steps\n\n\n\n\n\n\nReady for Public Sharing?\n\n\n\nThe Atmosphere Publishing tutorial shows how to publish datasets to the ATProto network for decentralized, cross-organization discovery.\n\n\n\nAtmosphere Publishing - Publish to ATProto federation\nPromotion Workflow - Migrate from local to atmosphere\nLocal Storage Reference - Complete API reference", 2293 2399 "crumbs": [ 2294 2400 "Guide", 2295 2401 "Getting Started", 2296 - "Quick Start" 2402 + "Local Workflow" 2297 2403 ] 2298 2404 }, 2299 2405 { 2300 - "objectID": "reference/uri-spec.html", 2301 - "href": "reference/uri-spec.html", 2302 - "title": "URI Specification", 2406 + "objectID": "reference/promotion.html", 2407 + "href": "reference/promotion.html", 2408 + "title": "Promotion Workflow", 2303 2409 "section": "", 2304 - "text": "The atdata:// URI scheme provides a unified way to address atdata resources across local development and the ATProto federation.", 2410 + "text": "The promotion workflow migrates datasets from local storage (Redis + S3) to the ATProto atmosphere network, enabling federation and discovery.", 2305 2411 "crumbs": [ 2306 2412 "Guide", 2307 2413 "Reference", 2308 - "URI Specification" 2414 + "Promotion Workflow" 2309 2415 ] 2310 2416 }, 2311 2417 { 2312 - "objectID": "reference/uri-spec.html#overview", 2313 - "href": "reference/uri-spec.html#overview", 2314 - "title": "URI Specification", 2418 + "objectID": "reference/promotion.html#overview", 2419 + "href": "reference/promotion.html#overview", 2420 + "title": "Promotion Workflow", 2315 2421 "section": "Overview", 2316 - "text": "Overview\nThe atdata URI scheme:\n\nFollows RFC 3986 syntax\nProvides consistent addressing for local and atmosphere resources\nEnables seamless promotion from development to production", 2422 + "text": "Overview\nPromotion handles:\n\nSchema deduplication: Avoids publishing duplicate schemas\nData URL preservation: Keeps existing S3 URLs or copies to new storage\nMetadata transfer: Preserves tags, descriptions, and custom metadata", 2423 + "crumbs": [ 2424 + "Guide", 2425 + "Reference", 2426 + "Promotion Workflow" 2427 + ] 2428 + }, 2429 + { 2430 + "objectID": "reference/promotion.html#basic-usage", 2431 + "href": "reference/promotion.html#basic-usage", 2432 + "title": "Promotion Workflow", 2433 + "section": "Basic Usage", 2434 + "text": "Basic Usage\n\nfrom atdata.local import LocalIndex\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\n\n# Setup\nlocal_index = LocalIndex()\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Get local entry\nentry = local_index.get_entry_by_name(\"my-dataset\")\n\n# Promote to atmosphere\nat_uri = promote_to_atmosphere(entry, local_index, client)\nprint(f\"Published: {at_uri}\")", 2435 + "crumbs": [ 2436 + "Guide", 2437 + "Reference", 2438 + "Promotion Workflow" 2439 + ] 2440 + }, 2441 + { 2442 + "objectID": "reference/promotion.html#with-metadata", 2443 + "href": "reference/promotion.html#with-metadata", 2444 + "title": "Promotion Workflow", 2445 + "section": "With Metadata", 2446 + "text": "With Metadata\n\nat_uri = promote_to_atmosphere(\n entry,\n local_index,\n client,\n name=\"my-dataset-v2\", # Override name\n description=\"Training images\", # Add description\n tags=[\"images\", \"training\"], # Add discovery tags\n license=\"MIT\", # Specify license\n)", 2447 + "crumbs": [ 2448 + "Guide", 2449 + "Reference", 2450 + "Promotion Workflow" 2451 + ] 2452 + }, 2453 + { 2454 + "objectID": "reference/promotion.html#schema-deduplication", 2455 + "href": "reference/promotion.html#schema-deduplication", 2456 + "title": "Promotion Workflow", 2457 + "section": "Schema Deduplication", 2458 + "text": "Schema Deduplication\nThe promotion workflow automatically checks for existing schemas:\n\n# First promotion: publishes schema\nuri1 = promote_to_atmosphere(entry1, local_index, client)\n\n# Second promotion with same schema type + version: reuses existing schema\nuri2 = promote_to_atmosphere(entry2, local_index, client)\n\nSchema matching is based on:\n\n{module}.{class_name} (e.g., mymodule.ImageSample)\nVersion string (e.g., 1.0.0)", 2459 + "crumbs": [ 2460 + "Guide", 2461 + "Reference", 2462 + "Promotion Workflow" 2463 + ] 2464 + }, 2465 + { 2466 + "objectID": "reference/promotion.html#data-storage-options", 2467 + "href": "reference/promotion.html#data-storage-options", 2468 + "title": "Promotion Workflow", 2469 + "section": "Data Storage Options", 2470 + "text": "Data Storage Options\n\nKeep Existing URLs (Default)Copy to New Storage\n\n\nBy default, promotion keeps the original data URLs:\n\n# Data stays in original S3 location\nat_uri = promote_to_atmosphere(entry, local_index, client)\n\n\nData stays in original S3 location\nDataset record points to existing URLs\nFastest option, no data copying\nRequires original storage to remain accessible\n\n\n\nTo copy data to a different storage location:\n\nfrom atdata.local import S3DataStore\n\n# Create new data store\nnew_store = S3DataStore(\n credentials=\"new-s3-creds.env\",\n bucket=\"public-datasets\",\n)\n\n# Promote with data copy\nat_uri = promote_to_atmosphere(\n entry,\n local_index,\n client,\n data_store=new_store, # Copy data to new storage\n)\n\n\nData is copied to new bucket\nDataset record points to new URLs\nGood for moving from private to public storage", 2317 2471 "crumbs": [ 2318 2472 "Guide", 2319 2473 "Reference", 2320 - "URI Specification" 2474 + "Promotion Workflow" 2321 2475 ] 2322 2476 }, 2323 2477 { 2324 - "objectID": "reference/uri-spec.html#uri-format", 2325 - "href": "reference/uri-spec.html#uri-format", 2326 - "title": "URI Specification", 2327 - "section": "URI Format", 2328 - "text": "URI Format\natdata://{authority}/{resource_type}/{name}@{version}\n\nAuthority\nThe authority identifies where the resource is stored:\n\n\n\nAuthority\nDescription\nExample\n\n\n\n\nlocal\nLocal Redis/S3 storage\natdata://local/...\n\n\n{handle}\nATProto handle\natdata://alice.bsky.social/...\n\n\n{did}\nATProto DID\natdata://did:plc:abc123/...\n\n\n\n\n\nResource Types\n\n\n\nResource Type\nDescription\n\n\n\n\nsampleSchema\nPackableSample type definitions\n\n\ndataset\nDataset entries (future)\n\n\nlens\nLens transformations (future)\n\n\n\n\n\nVersion Specifiers\nVersions follow semantic versioning and are specified with @:\n\n\n\nSpecifier\nDescription\nExample\n\n\n\n\n@{major}.{minor}.{patch}\nExact version\n@1.0.0, @2.1.3\n\n\n(none)\nLatest version\nResolves to highest semver", 2478 + "objectID": "reference/promotion.html#complete-workflow-example", 2479 + "href": "reference/promotion.html#complete-workflow-example", 2480 + "title": "Promotion Workflow", 2481 + "section": "Complete Workflow Example", 2482 + "text": "Complete Workflow Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nfrom atdata.atmosphere import AtmosphereClient\nfrom atdata.promote import promote_to_atmosphere\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n\n# 2. Create local dataset\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n )\n for i in range(1000)\n]\n\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Set up index with S3 data store\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nlocal_index = LocalIndex(data_store=store)\n\n# 4. Publish schema and insert dataset\nlocal_index.publish_schema(FeatureSample, version=\"1.0.0\")\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nlocal_entry = local_index.insert_dataset(dataset, name=\"feature-vectors-v1\", prefix=\"features\")\n\n# 5. Promote to atmosphere\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nat_uri = promote_to_atmosphere(\n local_entry,\n local_index,\n client,\n description=\"Feature vectors for classification\",\n tags=[\"features\", \"embeddings\"],\n license=\"MIT\",\n)\n\nprint(f\"Dataset published: {at_uri}\")\n\n# 6. Verify on atmosphere\nfrom atdata.atmosphere import AtmosphereIndex\n\natm_index = AtmosphereIndex(client)\nentry = atm_index.get_dataset(at_uri)\nprint(f\"Name: {entry.name}\")\nprint(f\"Schema: {entry.schema_ref}\")\nprint(f\"URLs: {entry.data_urls}\")", 2329 2483 "crumbs": [ 2330 2484 "Guide", 2331 2485 "Reference", 2332 - "URI Specification" 2486 + "Promotion Workflow" 2333 2487 ] 2334 2488 }, 2335 2489 { 2336 - "objectID": "reference/uri-spec.html#examples", 2337 - "href": "reference/uri-spec.html#examples", 2338 - "title": "URI Specification", 2339 - "section": "Examples", 2340 - "text": "Examples\n\nLocal Development\n\nfrom atdata.local import Index\n\nindex = Index()\n\n# Publish a schema (returns atdata:// URI)\nref = index.publish_schema(MySample, version=\"1.0.0\")\n# => \"atdata://local/sampleSchema/MySample@1.0.0\"\n\n# Auto-increment version\nref = index.publish_schema(MySample)\n# => \"atdata://local/sampleSchema/MySample@1.0.1\"\n\n# Retrieve by URI\nschema = index.get_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n\n\n\nAtmosphere (ATProto Federation)\n\nfrom atdata.atmosphere import Client\n\nclient = Client()\n\n# Publish returns at:// URI that maps to atdata://\nref = client.publish_schema(MySample)\n# => \"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\"\n\n# Can also be addressed as:\n# => \"atdata://did:plc:abc123/sampleSchema/MySample@1.0.0\"\n# => \"atdata://alice.bsky.social/sampleSchema/MySample@1.0.0\"", 2490 + "objectID": "reference/promotion.html#error-handling", 2491 + "href": "reference/promotion.html#error-handling", 2492 + "title": "Promotion Workflow", 2493 + "section": "Error Handling", 2494 + "text": "Error Handling\n\ntry:\n at_uri = promote_to_atmosphere(entry, local_index, client)\nexcept KeyError as e:\n # Schema not found in local index\n print(f\"Missing schema: {e}\")\nexcept ValueError as e:\n # Entry has no data URLs\n print(f\"Invalid entry: {e}\")", 2341 2495 "crumbs": [ 2342 2496 "Guide", 2343 2497 "Reference", 2344 - "URI Specification" 2498 + "Promotion Workflow" 2345 2499 ] 2346 2500 }, 2347 2501 { 2348 - "objectID": "reference/uri-spec.html#relationship-to-at-protocol-uris", 2349 - "href": "reference/uri-spec.html#relationship-to-at-protocol-uris", 2350 - "title": "URI Specification", 2351 - "section": "Relationship to AT Protocol URIs", 2352 - "text": "Relationship to AT Protocol URIs\nThe atdata:// scheme is inspired by and maps to ATProto’s at:// scheme:\n\n\n\n\n\n\n\natdata://\nat://\n\n\n\n\natdata://{did}/sampleSchema/{name}@{version}\nat://{did}/ac.foundation.dataset.sampleSchema/{rkey}\n\n\natdata://local/...\n(local only, no at:// equivalent)\n\n\n\nWhen publishing to the atmosphere, atdata URIs are automatically resolved to their corresponding at:// URIs for federation compatibility.", 2502 + "objectID": "reference/promotion.html#requirements", 2503 + "href": "reference/promotion.html#requirements", 2504 + "title": "Promotion Workflow", 2505 + "section": "Requirements", 2506 + "text": "Requirements\nBefore promotion:\n\nDataset must be in local index (via Index.insert_dataset() or Index.add_entry())\nSchema must be published to local index (via Index.publish_schema())\nAtmosphereClient must be authenticated", 2353 2507 "crumbs": [ 2354 2508 "Guide", 2355 2509 "Reference", 2356 - "URI Specification" 2510 + "Promotion Workflow" 2357 2511 ] 2358 2512 }, 2359 2513 { 2360 - "objectID": "reference/uri-spec.html#legacy-format", 2361 - "href": "reference/uri-spec.html#legacy-format", 2362 - "title": "URI Specification", 2363 - "section": "Legacy Format", 2364 - "text": "Legacy Format\nFor backwards compatibility, the local index also accepts the legacy format:\nlocal://schemas/{module.Class}@{version}\nThis format is deprecated and will be removed in a future version. Use atdata://local/sampleSchema/{name}@{version} instead.", 2514 + "objectID": "reference/promotion.html#related", 2515 + "href": "reference/promotion.html#related", 2516 + "title": "Promotion Workflow", 2517 + "section": "Related", 2518 + "text": "Related\n\nLocal Storage - Setting up local datasets\nAtmosphere - ATProto integration\nProtocols - AbstractIndex and AbstractDataStore", 2365 2519 "crumbs": [ 2366 2520 "Guide", 2367 2521 "Reference", 2368 - "URI Specification" 2522 + "Promotion Workflow" 2369 2523 ] 2370 2524 }, 2371 2525 { 2372 - "objectID": "reference/local-storage.html", 2373 - "href": "reference/local-storage.html", 2374 - "title": "Local Storage", 2526 + "objectID": "reference/load-dataset.html", 2527 + "href": "reference/load-dataset.html", 2528 + "title": "load_dataset API", 2375 2529 "section": "", 2376 - "text": "The local storage module provides a Redis + S3 backend for storing and managing datasets before publishing to the ATProto federation.", 2530 + "text": "The load_dataset() function provides a HuggingFace Datasets-style interface for loading typed datasets.", 2377 2531 "crumbs": [ 2378 2532 "Guide", 2379 2533 "Reference", 2380 - "Local Storage" 2534 + "load_dataset API" 2381 2535 ] 2382 2536 }, 2383 2537 { 2384 - "objectID": "reference/local-storage.html#overview", 2385 - "href": "reference/local-storage.html#overview", 2386 - "title": "Local Storage", 2538 + "objectID": "reference/load-dataset.html#overview", 2539 + "href": "reference/load-dataset.html#overview", 2540 + "title": "load_dataset API", 2387 2541 "section": "Overview", 2388 - "text": "Overview\nLocal storage uses:\n\nRedis for indexing and tracking dataset metadata\nS3-compatible storage for dataset tar files\n\nThis enables development and small-scale deployment before promoting to the full ATProto infrastructure.", 2542 + "text": "Overview\nKey differences from HuggingFace Datasets:\n\nRequires explicit sample_type parameter (typed dataclass) unless using index\nReturns atdata.Dataset[ST] instead of HF Dataset\nBuilt on WebDataset for efficient streaming\nNo Arrow caching layer", 2389 2543 "crumbs": [ 2390 2544 "Guide", 2391 2545 "Reference", 2392 - "Local Storage" 2546 + "load_dataset API" 2393 2547 ] 2394 2548 }, 2395 2549 { 2396 - "objectID": "reference/local-storage.html#localindex", 2397 - "href": "reference/local-storage.html#localindex", 2398 - "title": "Local Storage", 2399 - "section": "LocalIndex", 2400 - "text": "LocalIndex\nThe index tracks datasets in Redis:\n\nfrom atdata.local import LocalIndex\n\n# Default connection (localhost:6379)\nindex = LocalIndex()\n\n# Custom Redis connection\nimport redis\nr = redis.Redis(host='custom-host', port=6379)\nindex = LocalIndex(redis=r)\n\n# With connection kwargs\nindex = LocalIndex(host='custom-host', port=6379, db=1)\n\n\nAdding Entries\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nentry = index.add_entry(\n dataset,\n name=\"my-dataset\",\n schema_ref=\"atdata://local/sampleSchema/ImageSample@1.0.0\", # optional\n metadata={\"description\": \"Training images\"}, # optional\n)\n\nprint(entry.cid) # Content identifier\nprint(entry.name) # \"my-dataset\"\nprint(entry.data_urls) # [\"data-{000000..000009}.tar\"]\n\n\n\nListing and Retrieving\n\n# Iterate all entries\nfor entry in index.entries:\n print(f\"{entry.name}: {entry.cid}\")\n\n# Get as list\nall_entries = index.all_entries\n\n# Get by name\nentry = index.get_entry_by_name(\"my-dataset\")\n\n# Get by CID\nentry = index.get_entry(\"bafyrei...\")", 2550 + "objectID": "reference/load-dataset.html#basic-usage", 2551 + "href": "reference/load-dataset.html#basic-usage", 2552 + "title": "load_dataset API", 2553 + "section": "Basic Usage", 2554 + "text": "Basic Usage\n\nimport atdata\nfrom atdata import load_dataset\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass TextSample:\n text: str\n label: int\n\n# Load a specific split\ntrain_ds = load_dataset(\"path/to/data.tar\", TextSample, split=\"train\")\n\n# Load all splits (returns DatasetDict)\nds_dict = load_dataset(\"path/to/data/\", TextSample)\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]", 2401 2555 "crumbs": [ 2402 2556 "Guide", 2403 2557 "Reference", 2404 - "Local Storage" 2558 + "load_dataset API" 2405 2559 ] 2406 2560 }, 2407 2561 { 2408 - "objectID": "reference/local-storage.html#repo-deprecated", 2409 - "href": "reference/local-storage.html#repo-deprecated", 2410 - "title": "Local Storage", 2411 - "section": "Repo (Deprecated)", 2412 - "text": "Repo (Deprecated)\n\n\n\n\n\n\nWarning\n\n\n\nRepo is deprecated. Use LocalIndex with S3DataStore instead for new code.\n\n\nThe Repo class combines S3 storage with Redis indexing:\n\nfrom atdata.local import Repo\n\n# From credentials file\nrepo = Repo(\n s3_credentials=\"path/to/.env\",\n hive_path=\"my-bucket/datasets\",\n)\n\n# From credentials dict\nrepo = Repo(\n s3_credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n hive_path=\"my-bucket/datasets\",\n)\n\nPreferred approach - Use LocalIndex with S3DataStore:\n\nfrom atdata.local import LocalIndex, S3DataStore\n\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"my-bucket\",\n)\nindex = LocalIndex(data_store=store)\n\n# Insert dataset\nentry = index.insert_dataset(dataset, name=\"my-dataset\", prefix=\"datasets/v1\")\n\n\nCredentials File Format\nThe .env file should contain:\nAWS_ENDPOINT=http://localhost:9000\nAWS_ACCESS_KEY_ID=your-access-key\nAWS_SECRET_ACCESS_KEY=your-secret-key\n\n\n\n\n\n\nNote\n\n\n\nFor AWS S3, omit AWS_ENDPOINT to use the default endpoint.\n\n\n\n\nInserting Datasets\n\nimport webdataset as wds\nimport numpy as np\n\n# Create dataset from samples\nsamples = [ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=f\"sample_{i}\"\n) for i in range(1000)]\n\nwith wds.writer.TarWriter(\"temp.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\ndataset = atdata.Dataset[ImageSample](\"temp.tar\")\n\n# Insert into repo (writes to S3 + indexes in Redis)\nentry, stored_dataset = repo.insert(\n dataset,\n name=\"training-images-v1\",\n cache_local=False, # Stream directly to S3\n)\n\nprint(entry.cid) # Content identifier\nprint(stored_dataset.url) # S3 URL for the stored data\nprint(stored_dataset.shard_list) # Individual shard URLs\n\n\n\nInsert Options\n\nentry, ds = repo.insert(\n dataset,\n name=\"my-dataset\",\n cache_local=True, # Write locally first, then copy (faster for some workloads)\n maxcount=10000, # Samples per shard\n maxsize=100_000_000, # Max shard size in bytes\n)", 2562 + "objectID": "reference/load-dataset.html#path-formats", 2563 + "href": "reference/load-dataset.html#path-formats", 2564 + "title": "load_dataset API", 2565 + "section": "Path Formats", 2566 + "text": "Path Formats\n\nWebDataset Brace Notation\n\n# Range notation\nds = load_dataset(\"data-{000000..000099}.tar\", MySample, split=\"train\")\n\n# List notation\nds = load_dataset(\"data-{train,test,val}.tar\", MySample, split=\"train\")\n\n\n\nGlob Patterns\n\n# Match all tar files\nds = load_dataset(\"path/to/*.tar\", MySample)\n\n# Match pattern\nds = load_dataset(\"path/to/train-*.tar\", MySample, split=\"train\")\n\n\n\nLocal Directory\n\n# Scans for .tar files\nds = load_dataset(\"./my-dataset/\", MySample)\n\n\n\nRemote URLs\n\n# S3 (public buckets)\nds = load_dataset(\"s3://bucket/data-{000..099}.tar\", MySample, split=\"train\")\n\n# HTTP/HTTPS\nds = load_dataset(\"https://example.com/data.tar\", MySample, split=\"train\")\n\n# Google Cloud Storage\nds = load_dataset(\"gs://bucket/data.tar\", MySample, split=\"train\")\n\n\n\n\n\n\n\nNote\n\n\n\nFor private S3 buckets or S3-compatible storage with authentication, use atdata.S3Source with Dataset directly. See Datasets for details.\n\n\n\n\nIndex Lookup\n\nfrom atdata.local import LocalIndex\n\nindex = LocalIndex()\n\n# Load from local index (auto-resolves type from schema)\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# With explicit type\nds = load_dataset(\"@local/my-dataset\", MySample, index=index, split=\"train\")", 2413 2567 "crumbs": [ 2414 2568 "Guide", 2415 2569 "Reference", 2416 - "Local Storage" 2570 + "load_dataset API" 2417 2571 ] 2418 2572 }, 2419 2573 { 2420 - "objectID": "reference/local-storage.html#localdatasetentry", 2421 - "href": "reference/local-storage.html#localdatasetentry", 2422 - "title": "Local Storage", 2423 - "section": "LocalDatasetEntry", 2424 - "text": "LocalDatasetEntry\nIndex entries provide content-addressable identification:\n\nentry = index.get_entry_by_name(\"my-dataset\")\n\n# Core properties (IndexEntry protocol)\nentry.name # Human-readable name\nentry.schema_ref # Schema reference\nentry.data_urls # WebDataset URLs\nentry.metadata # Arbitrary metadata dict or None\n\n# Content addressing\nentry.cid # ATProto-compatible CID (content identifier)\n\n# Legacy compatibility\nentry.wds_url # First data URL\nentry.sample_kind # Same as schema_ref\n\n\n\n\n\n\n\nTip\n\n\n\nThe CID is generated from the entry’s content (schema_ref + data_urls), ensuring identical data produces identical CIDs whether stored locally or in the atmosphere.", 2574 + "objectID": "reference/load-dataset.html#split-detection", 2575 + "href": "reference/load-dataset.html#split-detection", 2576 + "title": "load_dataset API", 2577 + "section": "Split Detection", 2578 + "text": "Split Detection\nSplits are automatically detected from filenames and directories:\n\n\n\nPattern\nDetected Split\n\n\n\n\ntrain-*.tar, training-*.tar\ntrain\n\n\ntest-*.tar, testing-*.tar\ntest\n\n\nval-*.tar, valid-*.tar, validation-*.tar\nvalidation\n\n\ndev-*.tar, development-*.tar\nvalidation\n\n\ntrain/*.tar (directory)\ntrain\n\n\ntest/*.tar (directory)\ntest\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nFiles without a detected split default to “train”.", 2425 2579 "crumbs": [ 2426 2580 "Guide", 2427 2581 "Reference", 2428 - "Local Storage" 2582 + "load_dataset API" 2429 2583 ] 2430 2584 }, 2431 2585 { 2432 - "objectID": "reference/local-storage.html#schema-storage", 2433 - "href": "reference/local-storage.html#schema-storage", 2434 - "title": "Local Storage", 2435 - "section": "Schema Storage", 2436 - "text": "Schema Storage\nSchemas can be stored and retrieved from the index:\n\n# Publish a schema\nschema_ref = index.publish_schema(\n ImageSample,\n version=\"1.0.0\",\n description=\"Image with label annotation\",\n)\n# Returns: \"atdata://local/sampleSchema/ImageSample@1.0.0\"\n\n# Retrieve schema record\nschema = index.get_schema(schema_ref)\n# {\n# \"name\": \"ImageSample\",\n# \"version\": \"1.0.0\",\n# \"fields\": [...],\n# \"description\": \"...\",\n# \"createdAt\": \"...\",\n# }\n\n# List all schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']}@{schema['version']}\")\n\n# Reconstruct sample type from schema\nSampleType = index.decode_schema(schema_ref)\ndataset = atdata.Dataset[SampleType](entry.data_urls[0])", 2586 + "objectID": "reference/load-dataset.html#datasetdict", 2587 + "href": "reference/load-dataset.html#datasetdict", 2588 + "title": "load_dataset API", 2589 + "section": "DatasetDict", 2590 + "text": "DatasetDict\nWhen loading without split=, returns a DatasetDict:\n\nds_dict = load_dataset(\"path/to/data/\", MySample)\n\n# Access splits\ntrain_ds = ds_dict[\"train\"]\ntest_ds = ds_dict[\"test\"]\n\n# Iterate splits\nfor name, dataset in ds_dict.items():\n print(f\"{name}: {len(dataset.shard_list)} shards\")\n\n# Properties\nprint(ds_dict.num_shards) # {'train': 10, 'test': 2}\nprint(ds_dict.sample_type) # <class 'MySample'>\nprint(ds_dict.streaming) # False", 2437 2591 "crumbs": [ 2438 2592 "Guide", 2439 2593 "Reference", 2440 - "Local Storage" 2594 + "load_dataset API" 2441 2595 ] 2442 2596 }, 2443 2597 { 2444 - "objectID": "reference/local-storage.html#s3datastore", 2445 - "href": "reference/local-storage.html#s3datastore", 2446 - "title": "Local Storage", 2447 - "section": "S3DataStore", 2448 - "text": "S3DataStore\nFor direct S3 operations without Redis indexing:\n\nfrom atdata.local import S3DataStore\n\nstore = S3DataStore(\n credentials=\"path/to/.env\",\n bucket=\"my-bucket\",\n)\n\n# Write dataset shards\nurls = store.write_shards(\n dataset,\n prefix=\"datasets/v1\",\n maxcount=10000,\n)\n# Returns: [\"s3://my-bucket/datasets/v1/data--uuid--000000.tar\", ...]\n\n# Check capabilities\nstore.supports_streaming() # True", 2598 + "objectID": "reference/load-dataset.html#explicit-data-files", 2599 + "href": "reference/load-dataset.html#explicit-data-files", 2600 + "title": "load_dataset API", 2601 + "section": "Explicit Data Files", 2602 + "text": "Explicit Data Files\nOverride automatic detection with data_files:\n\n# Single pattern\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files=\"custom-*.tar\",\n)\n\n# List of patterns\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files=[\"shard-000.tar\", \"shard-001.tar\"],\n)\n\n# Explicit split mapping\nds = load_dataset(\n \"path/to/\",\n MySample,\n data_files={\n \"train\": \"training-shards-*.tar\",\n \"test\": \"eval-data.tar\",\n },\n)", 2603 + "crumbs": [ 2604 + "Guide", 2605 + "Reference", 2606 + "load_dataset API" 2607 + ] 2608 + }, 2609 + { 2610 + "objectID": "reference/load-dataset.html#streaming-mode", 2611 + "href": "reference/load-dataset.html#streaming-mode", 2612 + "title": "load_dataset API", 2613 + "section": "Streaming Mode", 2614 + "text": "Streaming Mode\nThe streaming parameter signals intent for streaming mode:\n\n# Mark as streaming\nds_dict = load_dataset(\"path/to/data.tar\", MySample, streaming=True)\n\n# Check streaming status\nif ds_dict.streaming:\n print(\"Streaming mode\")\n\n\n\n\n\n\n\nTip\n\n\n\natdata datasets are always lazy/streaming via WebDataset pipelines. This parameter primarily signals intent.", 2615 + "crumbs": [ 2616 + "Guide", 2617 + "Reference", 2618 + "load_dataset API" 2619 + ] 2620 + }, 2621 + { 2622 + "objectID": "reference/load-dataset.html#auto-type-resolution", 2623 + "href": "reference/load-dataset.html#auto-type-resolution", 2624 + "title": "load_dataset API", 2625 + "section": "Auto Type Resolution", 2626 + "text": "Auto Type Resolution\nWhen using index lookup, the sample type can be resolved automatically:\n\nfrom atdata.local import LocalIndex\n\nindex = LocalIndex()\n\n# No sample_type needed - resolved from schema\nds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")\n\n# Type is inferred from the stored schema\nsample_type = ds.sample_type", 2627 + "crumbs": [ 2628 + "Guide", 2629 + "Reference", 2630 + "load_dataset API" 2631 + ] 2632 + }, 2633 + { 2634 + "objectID": "reference/load-dataset.html#error-handling", 2635 + "href": "reference/load-dataset.html#error-handling", 2636 + "title": "load_dataset API", 2637 + "section": "Error Handling", 2638 + "text": "Error Handling\n\ntry:\n ds = load_dataset(\"path/to/data.tar\", MySample, split=\"train\")\nexcept FileNotFoundError:\n print(\"No data files found\")\nexcept ValueError as e:\n if \"Split\" in str(e):\n print(\"Requested split not found\")\n else:\n print(f\"Invalid configuration: {e}\")\nexcept KeyError:\n print(\"Dataset not found in index\")", 2449 2639 "crumbs": [ 2450 2640 "Guide", 2451 2641 "Reference", 2452 - "Local Storage" 2642 + "load_dataset API" 2453 2643 ] 2454 2644 }, 2455 2645 { 2456 - "objectID": "reference/local-storage.html#complete-workflow-example", 2457 - "href": "reference/local-storage.html#complete-workflow-example", 2458 - "title": "Local Storage", 2459 - "section": "Complete Workflow Example", 2460 - "text": "Complete Workflow Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.local import LocalIndex, S3DataStore\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass TrainingSample:\n features: NDArray\n label: int\n source: str\n\n# 2. Create samples\nsamples = [\n TrainingSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(10000)\n]\n\n# 3. Write to local tar\nwith wds.writer.TarWriter(\"local-data.tar\") as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 4. Set up index with S3 data store and insert\nstore = S3DataStore(\n credentials={\n \"AWS_ENDPOINT\": \"http://localhost:9000\",\n \"AWS_ACCESS_KEY_ID\": \"minioadmin\",\n \"AWS_SECRET_ACCESS_KEY\": \"minioadmin\",\n },\n bucket=\"datasets-bucket\",\n)\nindex = LocalIndex(data_store=store)\n\n# Publish schema and insert dataset\nindex.publish_schema(TrainingSample, version=\"1.0.0\")\nlocal_ds = atdata.Dataset[TrainingSample](\"local-data.tar\")\nentry = index.insert_dataset(local_ds, name=\"training-v1\", prefix=\"training\")\n\n# 5. Retrieve later\nentry = index.get_entry_by_name(\"training-v1\")\ndataset = atdata.Dataset[TrainingSample](entry.data_urls[0])\n\nfor batch in dataset.ordered(batch_size=32):\n print(batch.features.shape) # (32, 128)", 2646 + "objectID": "reference/load-dataset.html#complete-example", 2647 + "href": "reference/load-dataset.html#complete-example", 2648 + "title": "load_dataset API", 2649 + "section": "Complete Example", 2650 + "text": "Complete Example\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata import load_dataset\nimport webdataset as wds\n\n# 1. Define sample type\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n\n# 2. Create dataset files\nfor split in [\"train\", \"test\"]:\n with wds.writer.TarWriter(f\"{split}-000.tar\") as sink:\n for i in range(100):\n sample = ImageSample(\n image=np.random.rand(64, 64, 3).astype(np.float32),\n label=f\"sample_{i}\",\n )\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Load with split detection\nds_dict = load_dataset(\"./\", ImageSample)\nprint(ds_dict.keys()) # dict_keys(['train', 'test'])\n\n# 4. Iterate\nfor batch in ds_dict[\"train\"].ordered(batch_size=16):\n print(batch.image.shape) # (16, 64, 64, 3)\n print(batch.label) # ['sample_0', 'sample_1', ...]\n break\n\n# 5. Load specific split\ntrain_ds = load_dataset(\"./\", ImageSample, split=\"train\")\nfor batch in train_ds.ordered(batch_size=32):\n process(batch)", 2461 2651 "crumbs": [ 2462 2652 "Guide", 2463 2653 "Reference", 2464 - "Local Storage" 2654 + "load_dataset API" 2465 2655 ] 2466 2656 }, 2467 2657 { 2468 - "objectID": "reference/local-storage.html#related", 2469 - "href": "reference/local-storage.html#related", 2470 - "title": "Local Storage", 2658 + "objectID": "reference/load-dataset.html#related", 2659 + "href": "reference/load-dataset.html#related", 2660 + "title": "load_dataset API", 2471 2661 "section": "Related", 2472 - "text": "Related\n\nDatasets - Dataset iteration and batching\nProtocols - AbstractIndex and IndexEntry interfaces\nPromotion - Promoting local datasets to ATProto\nAtmosphere - ATProto federation", 2662 + "text": "Related\n\nDatasets - Dataset iteration and batching\nPackable Samples - Defining sample types\nLocal Storage - LocalIndex for index lookup\nProtocols - AbstractIndex interface", 2473 2663 "crumbs": [ 2474 2664 "Guide", 2475 2665 "Reference", 2476 - "Local Storage" 2666 + "load_dataset API" 2477 2667 ] 2478 2668 }, 2479 2669 { 2480 - "objectID": "reference/atmosphere.html", 2481 - "href": "reference/atmosphere.html", 2482 - "title": "Atmosphere (ATProto Integration)", 2670 + "objectID": "reference/lenses.html", 2671 + "href": "reference/lenses.html", 2672 + "title": "Lenses", 2483 2673 "section": "", 2484 - "text": "The atmosphere module enables publishing and discovering datasets on the ATProto network, creating a federated ecosystem for typed datasets.", 2674 + "text": "Lenses provide bidirectional transformations between sample types, enabling datasets to be viewed through different schemas without duplicating data.", 2485 2675 "crumbs": [ 2486 2676 "Guide", 2487 2677 "Reference", 2488 - "Atmosphere (ATProto Integration)" 2678 + "Lenses" 2489 2679 ] 2490 2680 }, 2491 2681 { 2492 - "objectID": "reference/atmosphere.html#installation", 2493 - "href": "reference/atmosphere.html#installation", 2494 - "title": "Atmosphere (ATProto Integration)", 2495 - "section": "Installation", 2496 - "text": "Installation\npip install atdata[atmosphere]\n# or\npip install atproto", 2682 + "objectID": "reference/lenses.html#overview", 2683 + "href": "reference/lenses.html#overview", 2684 + "title": "Lenses", 2685 + "section": "Overview", 2686 + "text": "Overview\nA lens consists of:\n\nGetter: Transforms source type S to view type V\nPutter: Updates source based on a modified view (optional)", 2497 2687 "crumbs": [ 2498 2688 "Guide", 2499 2689 "Reference", 2500 - "Atmosphere (ATProto Integration)" 2690 + "Lenses" 2501 2691 ] 2502 2692 }, 2503 2693 { 2504 - "objectID": "reference/atmosphere.html#overview", 2505 - "href": "reference/atmosphere.html#overview", 2506 - "title": "Atmosphere (ATProto Integration)", 2507 - "section": "Overview", 2508 - "text": "Overview\nATProto integration publishes datasets, schemas, and lenses as records in the ac.foundation.dataset.* namespace. This enables:\n\nDiscovery through the ATProto network\nFederation across different hosts\nVerifiability through content-addressable records", 2694 + "objectID": "reference/lenses.html#creating-a-lens", 2695 + "href": "reference/lenses.html#creating-a-lens", 2696 + "title": "Lenses", 2697 + "section": "Creating a Lens", 2698 + "text": "Creating a Lens\nUse the @lens decorator to define a getter:\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass FullSample:\n image: NDArray\n label: str\n confidence: float\n metadata: dict\n\n@atdata.packable\nclass SimpleSample:\n label: str\n confidence: float\n\n@atdata.lens\ndef simplify(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label, confidence=src.confidence)\n\nThe decorator:\n\nCreates a Lens object from the getter function\nRegisters it in the global LensNetwork registry\nExtracts source/view types from annotations", 2509 2699 "crumbs": [ 2510 2700 "Guide", 2511 2701 "Reference", 2512 - "Atmosphere (ATProto Integration)" 2702 + "Lenses" 2513 2703 ] 2514 2704 }, 2515 2705 { 2516 - "objectID": "reference/atmosphere.html#atmosphereclient", 2517 - "href": "reference/atmosphere.html#atmosphereclient", 2518 - "title": "Atmosphere (ATProto Integration)", 2519 - "section": "AtmosphereClient", 2520 - "text": "AtmosphereClient\nThe client handles authentication and record operations:\n\nfrom atdata.atmosphere import AtmosphereClient\n\nclient = AtmosphereClient()\n\n# Login with app-specific password (not your main password!)\nclient.login(\"alice.bsky.social\", \"app-password\")\n\nprint(client.did) # 'did:plc:...'\nprint(client.handle) # 'alice.bsky.social'\n\n\n\n\n\n\n\nWarning\n\n\n\nAlways use an app-specific password, not your main Bluesky password. Create app passwords at bsky.app/settings/app-passwords.\n\n\n\nSession Management\nSave and restore sessions to avoid re-authentication:\n\n# Export session for later\nsession_string = client.export_session()\n\n# Later: restore session\nnew_client = AtmosphereClient()\nnew_client.login_with_session(session_string)\n\n\n\nCustom PDS\nConnect to a custom PDS instead of bsky.social:\n\nclient = AtmosphereClient(base_url=\"https://pds.example.com\")", 2706 + "objectID": "reference/lenses.html#adding-a-putter", 2707 + "href": "reference/lenses.html#adding-a-putter", 2708 + "title": "Lenses", 2709 + "section": "Adding a Putter", 2710 + "text": "Adding a Putter\nTo enable bidirectional updates, add a putter:\n\n@simplify.putter\ndef simplify_put(view: SimpleSample, source: FullSample) -> FullSample:\n return FullSample(\n image=source.image,\n label=view.label,\n confidence=view.confidence,\n metadata=source.metadata,\n )\n\nThe putter receives:\n\nview: The modified view value\nsource: The original source value\n\nIt returns an updated source that reflects changes from the view.", 2521 2711 "crumbs": [ 2522 2712 "Guide", 2523 2713 "Reference", 2524 - "Atmosphere (ATProto Integration)" 2714 + "Lenses" 2525 2715 ] 2526 2716 }, 2527 2717 { 2528 - "objectID": "reference/atmosphere.html#pdsblobstore", 2529 - "href": "reference/atmosphere.html#pdsblobstore", 2530 - "title": "Atmosphere (ATProto Integration)", 2531 - "section": "PDSBlobStore", 2532 - "text": "PDSBlobStore\nStore dataset shards as ATProto blobs for fully decentralized storage:\n\nfrom atdata.atmosphere import AtmosphereClient, PDSBlobStore\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\n\n# Write shards as blobs\nurls = store.write_shards(dataset, prefix=\"my-data/v1\")\n# Returns: ['at://did:plc:.../blob/bafyrei...', ...]\n\n# Transform AT URIs to HTTP URLs for reading\nhttp_url = store.read_url(urls[0])\n# Returns: 'https://pds.example.com/xrpc/com.atproto.sync.getBlob?...'\n\n# Create a BlobSource for streaming\nsource = store.create_source(urls)\nds = atdata.Dataset[MySample](source)\n\n\nSize Limits\nPDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use maxcount and maxsize parameters to control shard sizes:\n\nurls = store.write_shards(\n dataset,\n prefix=\"large-data/v1\",\n maxcount=5000, # Max 5000 samples per shard\n maxsize=50e6, # Max 50MB per shard\n)", 2718 + "objectID": "reference/lenses.html#using-lenses-with-datasets", 2719 + "href": "reference/lenses.html#using-lenses-with-datasets", 2720 + "title": "Lenses", 2721 + "section": "Using Lenses with Datasets", 2722 + "text": "Using Lenses with Datasets\nLenses integrate with Dataset.as_type():\n\ndataset = atdata.Dataset[FullSample](\"data-{000000..000009}.tar\")\n\n# View through a different type\nsimple_ds = dataset.as_type(SimpleSample)\n\nfor batch in simple_ds.ordered(batch_size=32):\n # Only SimpleSample fields available\n labels = batch.label\n scores = batch.confidence", 2533 2723 "crumbs": [ 2534 2724 "Guide", 2535 2725 "Reference", 2536 - "Atmosphere (ATProto Integration)" 2726 + "Lenses" 2537 2727 ] 2538 2728 }, 2539 2729 { 2540 - "objectID": "reference/atmosphere.html#blobsource", 2541 - "href": "reference/atmosphere.html#blobsource", 2542 - "title": "Atmosphere (ATProto Integration)", 2543 - "section": "BlobSource", 2544 - "text": "BlobSource\nRead datasets stored as PDS blobs:\n\nfrom atdata import BlobSource\n\n# From blob references\nsource = BlobSource.from_refs([\n {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei111\"},\n {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei222\"},\n])\n\n# Or from PDSBlobStore\nsource = store.create_source(urls)\n\n# Use with Dataset\nds = atdata.Dataset[MySample](source)\nfor batch in ds.ordered(batch_size=32):\n process(batch)", 2730 + "objectID": "reference/lenses.html#direct-lens-usage", 2731 + "href": "reference/lenses.html#direct-lens-usage", 2732 + "title": "Lenses", 2733 + "section": "Direct Lens Usage", 2734 + "text": "Direct Lens Usage\nLenses can also be called directly:\n\nimport numpy as np\n\nfull = FullSample(\n image=np.zeros((224, 224, 3)),\n label=\"cat\",\n confidence=0.95,\n metadata={\"source\": \"training\"}\n)\n\n# Apply getter\nsimple = simplify(full)\n# Or: simple = simplify.get(full)\n\n# Apply putter\nmodified_simple = SimpleSample(label=\"dog\", confidence=0.87)\nupdated_full = simplify.put(modified_simple, full)\n# updated_full has label=\"dog\", confidence=0.87, but retains\n# original image and metadata", 2545 2735 "crumbs": [ 2546 2736 "Guide", 2547 2737 "Reference", 2548 - "Atmosphere (ATProto Integration)" 2738 + "Lenses" 2549 2739 ] 2550 2740 }, 2551 2741 { 2552 - "objectID": "reference/atmosphere.html#atmosphereindex", 2553 - "href": "reference/atmosphere.html#atmosphereindex", 2554 - "title": "Atmosphere (ATProto Integration)", 2555 - "section": "AtmosphereIndex", 2556 - "text": "AtmosphereIndex\nThe unified interface for ATProto operations, implementing the AbstractIndex protocol:\n\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\n\nclient = AtmosphereClient()\nclient.login(\"handle.bsky.social\", \"app-password\")\n\n# Without blob storage (use external URLs)\nindex = AtmosphereIndex(client)\n\n# With PDS blob storage (recommended for full decentralization)\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n\nPublishing Schemas\n\nimport atdata\nfrom numpy.typing import NDArray\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\n# Publish schema\nschema_uri = index.publish_schema(\n ImageSample,\n version=\"1.0.0\",\n description=\"Image classification sample\",\n)\n# Returns: \"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\"\n\n\n\nPublishing Datasets\n\ndataset = atdata.Dataset[ImageSample](\"data-{000000..000009}.tar\")\n\nentry = index.insert_dataset(\n dataset,\n name=\"imagenet-subset\",\n schema_ref=schema_uri, # Optional - auto-publishes if omitted\n description=\"ImageNet subset\",\n tags=[\"images\", \"classification\"],\n license=\"MIT\",\n)\n\nprint(entry.uri) # AT URI of the record\nprint(entry.data_urls) # WebDataset URLs\n\n\n\nListing and Retrieving\n\n# List your datasets\nfor entry in index.list_datasets():\n print(f\"{entry.name}: {entry.schema_ref}\")\n\n# List from another user\nfor entry in index.list_datasets(repo=\"did:plc:other-user\"):\n print(entry.name)\n\n# Get specific dataset\nentry = index.get_dataset(\"at://did:plc:.../ac.foundation.dataset.record/...\")\n\n# List schemas\nfor schema in index.list_schemas():\n print(f\"{schema['name']} v{schema['version']}\")\n\n# Decode schema to Python type\nSampleType = index.decode_schema(schema_uri)", 2742 + "objectID": "reference/lenses.html#lens-laws", 2743 + "href": "reference/lenses.html#lens-laws", 2744 + "title": "Lenses", 2745 + "section": "Lens Laws", 2746 + "text": "Lens Laws\nWell-behaved lenses should satisfy these properties:\n\nGetPutPutGetPutPut\n\n\nIf you get a view and immediately put it back, the source is unchanged:\n\nview = lens.get(source)\nassert lens.put(view, source) == source\n\n\n\nIf you put a view, getting it back yields that view:\n\nupdated = lens.put(view, source)\nassert lens.get(updated) == view\n\n\n\nPutting twice is equivalent to putting once with the final value:\n\nresult1 = lens.put(v2, lens.put(v1, source))\nresult2 = lens.put(v2, source)\nassert result1 == result2", 2557 2747 "crumbs": [ 2558 2748 "Guide", 2559 2749 "Reference", 2560 - "Atmosphere (ATProto Integration)" 2750 + "Lenses" 2561 2751 ] 2562 2752 }, 2563 2753 { 2564 - "objectID": "reference/atmosphere.html#lower-level-publishers", 2565 - "href": "reference/atmosphere.html#lower-level-publishers", 2566 - "title": "Atmosphere (ATProto Integration)", 2567 - "section": "Lower-Level Publishers", 2568 - "text": "Lower-Level Publishers\nFor more control, use the individual publisher classes:\n\nSchemaPublisher\n\nfrom atdata.atmosphere import SchemaPublisher\n\npublisher = SchemaPublisher(client)\n\nuri = publisher.publish(\n ImageSample,\n name=\"ImageSample\",\n version=\"1.0.0\",\n description=\"Image with label\",\n metadata={\"source\": \"training\"},\n)\n\n\n\nDatasetPublisher\n\nfrom atdata.atmosphere import DatasetPublisher\n\npublisher = DatasetPublisher(client)\n\nuri = publisher.publish(\n dataset,\n name=\"training-images\",\n schema_uri=schema_uri, # Required if auto_publish_schema=False\n auto_publish_schema=True, # Publish schema automatically\n description=\"Training images\",\n tags=[\"training\", \"images\"],\n license=\"MIT\",\n)\n\n\nBlob Storage\nThere are two approaches to storing data as ATProto blobs:\nApproach 1: PDSBlobStore (Recommended)\nUse PDSBlobStore with AtmosphereIndex for automatic shard management:\n\nfrom atdata.atmosphere import PDSBlobStore, AtmosphereIndex\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# Dataset shards are automatically uploaded as blobs\nentry = index.insert_dataset(\n dataset,\n name=\"my-dataset\",\n schema_ref=schema_uri,\n)\n\n# Later: load using BlobSource\nsource = store.create_source(entry.data_urls)\nds = atdata.Dataset[MySample](source)\n\nApproach 2: Manual Blob Publishing\nFor more control, use DatasetPublisher.publish_with_blobs() directly:\n\nimport io\nimport webdataset as wds\n\n# Create tar data in memory\ntar_buffer = io.BytesIO()\nwith wds.writer.TarWriter(tar_buffer) as sink:\n for i, sample in enumerate(samples):\n sink.write({**sample.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# Publish with blob storage\nuri = publisher.publish_with_blobs(\n blobs=[tar_buffer.getvalue()],\n schema_uri=schema_uri,\n name=\"small-dataset\",\n description=\"Dataset stored in ATProto blobs\",\n tags=[\"small\", \"demo\"],\n)\n\nLoading Blob-Stored Datasets\n\nfrom atdata.atmosphere import DatasetLoader\nfrom atdata import BlobSource\n\nloader = DatasetLoader(client)\n\n# Check storage type\nstorage_type = loader.get_storage_type(uri) # \"external\" or \"blobs\"\n\nif storage_type == \"blobs\":\n # Get blob URLs and create BlobSource\n blob_urls = loader.get_blob_urls(uri)\n # Parse to blob refs for BlobSource\n # Or use loader.to_dataset() which handles this automatically\n\n# to_dataset() handles both storage types automatically\ndataset = loader.to_dataset(uri, MySample)\nfor batch in dataset.ordered(batch_size=32):\n process(batch)\n\n\n\n\nLensPublisher\n\nfrom atdata.atmosphere import LensPublisher\n\npublisher = LensPublisher(client)\n\n# With code references\nuri = publisher.publish(\n name=\"simplify\",\n source_schema=full_schema_uri,\n target_schema=simple_schema_uri,\n description=\"Extract label only\",\n getter_code={\n \"repository\": \"https://github.com/org/repo\",\n \"commit\": \"abc123def...\",\n \"path\": \"transforms/simplify.py:simplify_getter\",\n },\n putter_code={\n \"repository\": \"https://github.com/org/repo\",\n \"commit\": \"abc123def...\",\n \"path\": \"transforms/simplify.py:simplify_putter\",\n },\n)\n\n# Or publish from a Lens object\nfrom atdata.lens import lens\n\n@lens\ndef simplify(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label)\n\nuri = publisher.publish_from_lens(\n simplify,\n source_schema=full_schema_uri,\n target_schema=simple_schema_uri,\n)", 2754 + "objectID": "reference/lenses.html#trivial-putter", 2755 + "href": "reference/lenses.html#trivial-putter", 2756 + "title": "Lenses", 2757 + "section": "Trivial Putter", 2758 + "text": "Trivial Putter\nIf no putter is defined, a trivial putter is used that ignores view updates:\n\n@atdata.lens\ndef extract_label(src: FullSample) -> SimpleSample:\n return SimpleSample(label=src.label, confidence=src.confidence)\n\n# Without a putter, put() returns the original source unchanged\nview = SimpleSample(label=\"modified\", confidence=0.5)\nupdated = extract_label.put(view, original)\nassert updated == original # No changes applied", 2569 2759 "crumbs": [ 2570 2760 "Guide", 2571 2761 "Reference", 2572 - "Atmosphere (ATProto Integration)" 2762 + "Lenses" 2573 2763 ] 2574 2764 }, 2575 2765 { 2576 - "objectID": "reference/atmosphere.html#lower-level-loaders", 2577 - "href": "reference/atmosphere.html#lower-level-loaders", 2578 - "title": "Atmosphere (ATProto Integration)", 2579 - "section": "Lower-Level Loaders", 2580 - "text": "Lower-Level Loaders\nFor direct access to records, use the loader classes:\n\nSchemaLoader\n\nfrom atdata.atmosphere import SchemaLoader\n\nloader = SchemaLoader(client)\n\n# Get a specific schema\nschema = loader.get(\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/xyz\")\nprint(schema[\"name\"], schema[\"version\"])\n\n# List all schemas from a repository\nfor schema in loader.list_all(repo=\"did:plc:other-user\"):\n print(schema[\"name\"])\n\n\n\nDatasetLoader\n\nfrom atdata.atmosphere import DatasetLoader\n\nloader = DatasetLoader(client)\n\n# Get a specific dataset record\nrecord = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n# Check storage type\nstorage_type = loader.get_storage_type(uri) # \"external\" or \"blobs\"\n\n# Get URLs based on storage type\nif storage_type == \"external\":\n urls = loader.get_urls(uri)\nelse:\n urls = loader.get_blob_urls(uri)\n\n# Get metadata\nmetadata = loader.get_metadata(uri)\n\n# Create a Dataset object directly\ndataset = loader.to_dataset(uri, MySampleType)\nfor batch in dataset.ordered(batch_size=32):\n process(batch)\n\n\n\nLensLoader\n\nfrom atdata.atmosphere import LensLoader\n\nloader = LensLoader(client)\n\n# Get a specific lens record\nlens = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\nprint(lens[\"name\"])\nprint(lens[\"sourceSchema\"], \"->\", lens[\"targetSchema\"])\n\n# List all lenses from a repository\nfor lens in loader.list_all():\n print(lens[\"name\"])\n\n# Find lenses by schema\nlenses = loader.find_by_schemas(\n source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n)", 2766 + "objectID": "reference/lenses.html#lensnetwork-registry", 2767 + "href": "reference/lenses.html#lensnetwork-registry", 2768 + "title": "Lenses", 2769 + "section": "LensNetwork Registry", 2770 + "text": "LensNetwork Registry\nThe LensNetwork is a singleton that stores all registered lenses:\n\nfrom atdata.lens import LensNetwork\n\nnetwork = LensNetwork()\n\n# Look up a specific lens\nlens = network.transform(FullSample, SimpleSample)\n\n# Raises ValueError if no lens exists\ntry:\n lens = network.transform(TypeA, TypeB)\nexcept ValueError:\n print(\"No lens registered for TypeA -> TypeB\")", 2771 + "crumbs": [ 2772 + "Guide", 2773 + "Reference", 2774 + "Lenses" 2775 + ] 2776 + }, 2777 + { 2778 + "objectID": "reference/lenses.html#example-feature-extraction", 2779 + "href": "reference/lenses.html#example-feature-extraction", 2780 + "title": "Lenses", 2781 + "section": "Example: Feature Extraction", 2782 + "text": "Example: Feature Extraction\n\n@atdata.packable\nclass RawSample:\n audio: NDArray\n text: str\n speaker_id: int\n\n@atdata.packable\nclass TextFeatures:\n text: str\n word_count: int\n\n@atdata.lens\ndef extract_text(src: RawSample) -> TextFeatures:\n return TextFeatures(\n text=src.text,\n word_count=len(src.text.split())\n )\n\n@extract_text.putter\ndef extract_text_put(view: TextFeatures, source: RawSample) -> RawSample:\n return RawSample(\n audio=source.audio,\n text=view.text,\n speaker_id=source.speaker_id\n )", 2783 + "crumbs": [ 2784 + "Guide", 2785 + "Reference", 2786 + "Lenses" 2787 + ] 2788 + }, 2789 + { 2790 + "objectID": "reference/lenses.html#related", 2791 + "href": "reference/lenses.html#related", 2792 + "title": "Lenses", 2793 + "section": "Related", 2794 + "text": "Related\n\nDatasets - Using lenses with Dataset.as_type()\nPackable Samples - Defining sample types\nAtmosphere - Publishing lenses to ATProto federation", 2795 + "crumbs": [ 2796 + "Guide", 2797 + "Reference", 2798 + "Lenses" 2799 + ] 2800 + }, 2801 + { 2802 + "objectID": "reference/packable-samples.html", 2803 + "href": "reference/packable-samples.html", 2804 + "title": "Packable Samples", 2805 + "section": "", 2806 + "text": "Packable samples are typed dataclasses that can be serialized with msgpack for storage in WebDataset tar files.", 2581 2807 "crumbs": [ 2582 2808 "Guide", 2583 2809 "Reference", 2584 - "Atmosphere (ATProto Integration)" 2810 + "Packable Samples" 2585 2811 ] 2586 2812 }, 2587 2813 { 2588 - "objectID": "reference/atmosphere.html#at-uris", 2589 - "href": "reference/atmosphere.html#at-uris", 2590 - "title": "Atmosphere (ATProto Integration)", 2591 - "section": "AT URIs", 2592 - "text": "AT URIs\nATProto records are identified by AT URIs:\n\nfrom atdata.atmosphere import AtUri\n\n# Parse an AT URI\nuri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n\nprint(uri.authority) # 'did:plc:abc123'\nprint(uri.collection) # 'ac.foundation.dataset.sampleSchema'\nprint(uri.rkey) # 'xyz'\n\n# Format back to string\nprint(str(uri)) # 'at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz'", 2814 + "objectID": "reference/packable-samples.html#the-packable-decorator", 2815 + "href": "reference/packable-samples.html#the-packable-decorator", 2816 + "title": "Packable Samples", 2817 + "section": "The @packable Decorator", 2818 + "text": "The @packable Decorator\nThe recommended way to define a sample type is with the @packable decorator:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\n\n@atdata.packable\nclass ImageSample:\n image: NDArray\n label: str\n confidence: float\n\nThis creates a dataclass that:\n\nInherits from PackableSample\nHas automatic msgpack serialization\nHandles NDArray conversion to/from bytes", 2593 2819 "crumbs": [ 2594 2820 "Guide", 2595 2821 "Reference", 2596 - "Atmosphere (ATProto Integration)" 2822 + "Packable Samples" 2597 2823 ] 2598 2824 }, 2599 2825 { 2600 - "objectID": "reference/atmosphere.html#supported-field-types", 2601 - "href": "reference/atmosphere.html#supported-field-types", 2602 - "title": "Atmosphere (ATProto Integration)", 2826 + "objectID": "reference/packable-samples.html#supported-field-types", 2827 + "href": "reference/packable-samples.html#supported-field-types", 2828 + "title": "Packable Samples", 2603 2829 "section": "Supported Field Types", 2604 - "text": "Supported Field Types\nSchemas support these field types:\n\n\n\nPython Type\nATProto Type\n\n\n\n\nstr\nprimitive/str\n\n\nint\nprimitive/int\n\n\nfloat\nprimitive/float\n\n\nbool\nprimitive/bool\n\n\nbytes\nprimitive/bytes\n\n\nNDArray\nndarray (default dtype: float32)\n\n\nNDArray[np.float64]\nndarray (dtype: float64)\n\n\nlist[str]\narray with items\n\n\nT \\| None\nOptional field", 2830 + "text": "Supported Field Types\n\nPrimitives\n\n@atdata.packable\nclass PrimitiveSample:\n name: str\n count: int\n score: float\n active: bool\n data: bytes\n\n\n\nNumPy Arrays\nFields annotated as NDArray are automatically converted:\n\n@atdata.packable\nclass ArraySample:\n features: NDArray # Required array\n embeddings: NDArray | None # Optional array\n\n\n\n\n\n\n\nNote\n\n\n\nBytes in NDArray-typed fields are always interpreted as serialized arrays. Don’t use NDArray for raw binary data—use bytes instead.\n\n\n\n\nLists\n\n@atdata.packable\nclass ListSample:\n tags: list[str]\n scores: list[float]", 2831 + "crumbs": [ 2832 + "Guide", 2833 + "Reference", 2834 + "Packable Samples" 2835 + ] 2836 + }, 2837 + { 2838 + "objectID": "reference/packable-samples.html#serialization", 2839 + "href": "reference/packable-samples.html#serialization", 2840 + "title": "Packable Samples", 2841 + "section": "Serialization", 2842 + "text": "Serialization\n\nPacking to Bytes\n\nsample = ImageSample(\n image=np.random.rand(224, 224, 3).astype(np.float32),\n label=\"cat\",\n confidence=0.95,\n)\n\n# Serialize to msgpack bytes\npacked_bytes = sample.packed\nprint(f\"Size: {len(packed_bytes)} bytes\")\n\n\n\nUnpacking from Bytes\n\n# Deserialize from bytes\nrestored = ImageSample.from_bytes(packed_bytes)\n\n# Arrays are automatically restored\nassert np.array_equal(sample.image, restored.image)\nassert sample.label == restored.label\n\n\n\nWebDataset Format\nThe as_wds property returns a dict ready for WebDataset:\n\nwds_dict = sample.as_wds\n# {'__key__': '1234...', 'msgpack': b'...'}\n\nWrite samples to a tar file:\n\nimport webdataset as wds\n\nwith wds.writer.TarWriter(\"data-000000.tar\") as sink:\n for i, sample in enumerate(samples):\n # Use custom key or let as_wds generate one\n sink.write({**sample.as_wds, \"__key__\": f\"sample_{i:06d}\"})", 2843 + "crumbs": [ 2844 + "Guide", 2845 + "Reference", 2846 + "Packable Samples" 2847 + ] 2848 + }, 2849 + { 2850 + "objectID": "reference/packable-samples.html#direct-inheritance-alternative", 2851 + "href": "reference/packable-samples.html#direct-inheritance-alternative", 2852 + "title": "Packable Samples", 2853 + "section": "Direct Inheritance (Alternative)", 2854 + "text": "Direct Inheritance (Alternative)\nYou can also inherit directly from PackableSample:\n\nfrom dataclasses import dataclass\n\n@dataclass\nclass DirectSample(atdata.PackableSample):\n name: str\n values: NDArray\n\nThis is equivalent to using @packable but more verbose.", 2855 + "crumbs": [ 2856 + "Guide", 2857 + "Reference", 2858 + "Packable Samples" 2859 + ] 2860 + }, 2861 + { 2862 + "objectID": "reference/packable-samples.html#how-it-works", 2863 + "href": "reference/packable-samples.html#how-it-works", 2864 + "title": "Packable Samples", 2865 + "section": "How It Works", 2866 + "text": "How It Works\n\nSerialization Flow\n\nPackingUnpacking\n\n\n\nNDArray fields → converted to bytes via array_to_bytes()\nOther fields → passed through unchanged\nAll fields → packed with msgpack\n\n\n\n\nBytes → unpacked with ormsgpack\nDict → passed to __init__\n__post_init__ → calls _ensure_good()\nNDArray fields → bytes converted back to arrays\n\n\n\n\n\n\nThe _ensure_good() Method\nThis method runs automatically after construction and handles NDArray conversion:\n\ndef _ensure_good(self):\n for field in dataclasses.fields(self):\n if _is_possibly_ndarray_type(field.type):\n value = getattr(self, field.name)\n if isinstance(value, bytes):\n setattr(self, field.name, bytes_to_array(value))", 2605 2867 "crumbs": [ 2606 2868 "Guide", 2607 2869 "Reference", 2608 - "Atmosphere (ATProto Integration)" 2870 + "Packable Samples" 2609 2871 ] 2610 2872 }, 2611 2873 { 2612 - "objectID": "reference/atmosphere.html#complete-example", 2613 - "href": "reference/atmosphere.html#complete-example", 2614 - "title": "Atmosphere (ATProto Integration)", 2615 - "section": "Complete Example", 2616 - "text": "Complete Example\nThis example shows the full workflow using PDSBlobStore for decentralized storage:\n\nimport numpy as np\nfrom numpy.typing import NDArray\nimport atdata\nfrom atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore\nimport webdataset as wds\n\n# 1. Define and create samples\n@atdata.packable\nclass FeatureSample:\n features: NDArray\n label: int\n source: str\n\nsamples = [\n FeatureSample(\n features=np.random.randn(128).astype(np.float32),\n label=i % 10,\n source=\"synthetic\",\n )\n for i in range(1000)\n]\n\n# 2. Write to tar\nwith wds.writer.TarWriter(\"features.tar\") as sink:\n for i, s in enumerate(samples):\n sink.write({**s.as_wds, \"__key__\": f\"{i:06d}\"})\n\n# 3. Authenticate and set up blob storage\nclient = AtmosphereClient()\nclient.login(\"myhandle.bsky.social\", \"app-password\")\n\nstore = PDSBlobStore(client)\nindex = AtmosphereIndex(client, data_store=store)\n\n# 4. Publish schema\nschema_uri = index.publish_schema(\n FeatureSample,\n version=\"1.0.0\",\n description=\"Feature vectors with labels\",\n)\n\n# 5. Publish dataset (shards uploaded as blobs)\ndataset = atdata.Dataset[FeatureSample](\"features.tar\")\nentry = index.insert_dataset(\n dataset,\n name=\"synthetic-features-v1\",\n schema_ref=schema_uri,\n tags=[\"features\", \"synthetic\"],\n)\n\nprint(f\"Published: {entry.uri}\")\nprint(f\"Blob URLs: {entry.data_urls}\")\n\n# 6. Later: discover and load from blobs\nfor dataset_entry in index.list_datasets():\n print(f\"Found: {dataset_entry.name}\")\n\n # Reconstruct type from schema\n SampleType = index.decode_schema(dataset_entry.schema_ref)\n\n # Create source from blob URLs\n source = store.create_source(dataset_entry.data_urls)\n\n # Load dataset from blobs\n ds = atdata.Dataset[SampleType](source)\n for batch in ds.ordered(batch_size=32):\n print(batch.features.shape)\n break\n\nFor external URL storage (without PDSBlobStore):\n\n# Use AtmosphereIndex without data_store\nindex = AtmosphereIndex(client)\n\n# Dataset URLs will be stored as-is (external references)\nentry = index.insert_dataset(\n dataset,\n name=\"external-features\",\n schema_ref=schema_uri,\n)\n\n# Load using standard URL source\nds = atdata.Dataset[FeatureSample](entry.data_urls[0])", 2874 + "objectID": "reference/packable-samples.html#best-practices", 2875 + "href": "reference/packable-samples.html#best-practices", 2876 + "title": "Packable Samples", 2877 + "section": "Best Practices", 2878 + "text": "Best Practices\n\nDoDon’t\n\n\n\n@atdata.packable\nclass GoodSample:\n features: NDArray # Clear type annotation\n label: str # Simple primitives\n metadata: dict # Msgpack-compatible dicts\n scores: list[float] # Typed lists\n\n\n\n\n@atdata.packable\nclass BadSample:\n # DON'T: Nested dataclasses not supported\n nested: OtherSample\n\n # DON'T: Complex objects that aren't msgpack-serializable\n callback: Callable\n\n # DON'T: Use NDArray for raw bytes\n raw_data: NDArray # Use 'bytes' type instead", 2617 2879 "crumbs": [ 2618 2880 "Guide", 2619 2881 "Reference", 2620 - "Atmosphere (ATProto Integration)" 2882 + "Packable Samples" 2621 2883 ] 2622 2884 }, 2623 2885 { 2624 - "objectID": "reference/atmosphere.html#related", 2625 - "href": "reference/atmosphere.html#related", 2626 - "title": "Atmosphere (ATProto Integration)", 2886 + "objectID": "reference/packable-samples.html#related", 2887 + "href": "reference/packable-samples.html#related", 2888 + "title": "Packable Samples", 2627 2889 "section": "Related", 2628 - "text": "Related\n\nLocal Storage - Redis + S3 backend\nPromotion - Promoting local datasets to ATProto\nProtocols - AbstractIndex interface\nPackable Samples - Defining sample types", 2890 + "text": "Related\n\nDatasets - Loading and iterating samples\nLenses - Transforming between sample types", 2629 2891 "crumbs": [ 2630 2892 "Guide", 2631 2893 "Reference", 2632 - "Atmosphere (ATProto Integration)" 2894 + "Packable Samples" 2633 2895 ] 2634 2896 }, 2635 2897 {

+89 -85

docs/sitemap.xml

··· 9 9 <lastmod>2026-01-22T19:31:03.722Z</lastmod> 10 10 </url> 11 11 <url> 12 - <loc>https://github.com/your-org/atdata/reference/packable-samples.html</loc> 13 - <lastmod>2026-01-18T03:31:39.824Z</lastmod> 12 + <loc>https://github.com/your-org/atdata/reference/architecture.html</loc> 13 + <lastmod>2026-01-27T06:13:33.690Z</lastmod> 14 14 </url> 15 15 <url> 16 - <loc>https://github.com/your-org/atdata/reference/lenses.html</loc> 17 - <lastmod>2026-01-18T03:31:39.823Z</lastmod> 16 + <loc>https://github.com/your-org/atdata/reference/atmosphere.html</loc> 17 + <lastmod>2026-01-27T05:32:25.227Z</lastmod> 18 18 </url> 19 19 <url> 20 - <loc>https://github.com/your-org/atdata/reference/load-dataset.html</loc> 21 - <lastmod>2026-01-22T19:31:03.722Z</lastmod> 20 + <loc>https://github.com/your-org/atdata/reference/local-storage.html</loc> 21 + <lastmod>2026-01-22T19:31:03.723Z</lastmod> 22 22 </url> 23 23 <url> 24 - <loc>https://github.com/your-org/atdata/reference/promotion.html</loc> 24 + <loc>https://github.com/your-org/atdata/reference/uri-spec.html</loc> 25 25 <lastmod>2026-01-22T19:31:03.723Z</lastmod> 26 26 </url> 27 27 <url> 28 - <loc>https://github.com/your-org/atdata/tutorials/local-workflow.html</loc> 29 - <lastmod>2026-01-22T19:31:03.723Z</lastmod> 28 + <loc>https://github.com/your-org/atdata/tutorials/quickstart.html</loc> 29 + <lastmod>2026-01-27T06:16:24.980Z</lastmod> 30 30 </url> 31 31 <url> 32 - <loc>https://github.com/your-org/atdata/tutorials/promotion.html</loc> 33 - <lastmod>2026-01-22T19:31:03.724Z</lastmod> 32 + <loc>https://github.com/your-org/atdata/tutorials/atmosphere.html</loc> 33 + <lastmod>2026-01-27T06:18:15.908Z</lastmod> 34 34 </url> 35 35 <url> 36 - <loc>https://github.com/your-org/atdata/api/DatasetDict.html</loc> 37 - <lastmod>2026-01-24T19:19:45.336Z</lastmod> 36 + <loc>https://github.com/your-org/atdata/api/SchemaLoader.html</loc> 37 + <lastmod>2026-01-23T23:20:15.746Z</lastmod> 38 38 </url> 39 39 <url> 40 - <loc>https://github.com/your-org/atdata/api/PackableSample.html</loc> 41 - <lastmod>2026-01-23T23:20:15.564Z</lastmod> 40 + <loc>https://github.com/your-org/atdata/api/BlobSource.html</loc> 41 + <lastmod>2026-01-27T05:36:00.209Z</lastmod> 42 42 </url> 43 43 <url> 44 - <loc>https://github.com/your-org/atdata/api/PDSBlobStore.html</loc> 45 - <lastmod>2026-01-27T05:36:00.303Z</lastmod> 44 + <loc>https://github.com/your-org/atdata/api/AtmosphereClient.html</loc> 45 + <lastmod>2026-01-23T23:20:15.723Z</lastmod> 46 46 </url> 47 47 <url> 48 - <loc>https://github.com/your-org/atdata/api/DictSample.html</loc> 49 - <lastmod>2026-01-23T23:20:15.573Z</lastmod> 48 + <loc>https://github.com/your-org/atdata/api/load_dataset.html</loc> 49 + <lastmod>2026-01-24T19:19:45.334Z</lastmod> 50 50 </url> 51 51 <url> 52 - <loc>https://github.com/your-org/atdata/api/LensLoader.html</loc> 53 - <lastmod>2026-01-23T23:20:15.788Z</lastmod> 52 + <loc>https://github.com/your-org/atdata/api/promote_to_atmosphere.html</loc> 53 + <lastmod>2026-01-24T19:19:45.514Z</lastmod> 54 54 </url> 55 55 <url> 56 - <loc>https://github.com/your-org/atdata/api/AtmosphereIndex.html</loc> 57 - <lastmod>2026-01-27T05:36:00.293Z</lastmod> 56 + <loc>https://github.com/your-org/atdata/api/SchemaPublisher.html</loc> 57 + <lastmod>2026-01-23T23:20:15.742Z</lastmod> 58 58 </url> 59 59 <url> 60 - <loc>https://github.com/your-org/atdata/api/DataSource.html</loc> 61 - <lastmod>2026-01-23T23:20:15.642Z</lastmod> 60 + <loc>https://github.com/your-org/atdata/api/DatasetPublisher.html</loc> 61 + <lastmod>2026-01-23T23:20:15.757Z</lastmod> 62 62 </url> 63 63 <url> 64 - <loc>https://github.com/your-org/atdata/api/DatasetLoader.html</loc> 65 - <lastmod>2026-01-23T23:20:15.773Z</lastmod> 64 + <loc>https://github.com/your-org/atdata/api/URLSource.html</loc> 65 + <lastmod>2026-01-24T19:19:45.367Z</lastmod> 66 66 </url> 67 67 <url> 68 - <loc>https://github.com/your-org/atdata/api/Lens.html</loc> 69 - <lastmod>2026-01-27T05:36:00.154Z</lastmod> 68 + <loc>https://github.com/your-org/atdata/api/index.html</loc> 69 + <lastmod>2026-01-27T06:24:20.044Z</lastmod> 70 70 </url> 71 71 <url> 72 - <loc>https://github.com/your-org/atdata/api/local.Index.html</loc> 73 - <lastmod>2026-01-27T05:36:00.238Z</lastmod> 72 + <loc>https://github.com/your-org/atdata/api/IndexEntry.html</loc> 73 + <lastmod>2026-01-23T23:03:53.795Z</lastmod> 74 74 </url> 75 75 <url> 76 - <loc>https://github.com/your-org/atdata/api/Dataset.html</loc> 77 - <lastmod>2026-01-23T23:20:15.588Z</lastmod> 76 + <loc>https://github.com/your-org/atdata/api/S3Source.html</loc> 77 + <lastmod>2026-01-24T19:19:45.376Z</lastmod> 78 78 </url> 79 79 <url> 80 - <loc>https://github.com/your-org/atdata/api/AbstractDataStore.html</loc> 81 - <lastmod>2026-01-23T23:20:15.638Z</lastmod> 80 + <loc>https://github.com/your-org/atdata/api/local.LocalDatasetEntry.html</loc> 81 + <lastmod>2026-01-23T23:03:53.862Z</lastmod> 82 82 </url> 83 83 <url> 84 - <loc>https://github.com/your-org/atdata/api/local.S3DataStore.html</loc> 85 - <lastmod>2026-01-23T23:03:53.869Z</lastmod> 84 + <loc>https://github.com/your-org/atdata/api/AbstractIndex.html</loc> 85 + <lastmod>2026-01-27T05:36:00.180Z</lastmod> 86 86 </url> 87 87 <url> 88 - <loc>https://github.com/your-org/atdata/api/AtUri.html</loc> 89 - <lastmod>2026-01-23T23:20:15.791Z</lastmod> 88 + <loc>https://github.com/your-org/atdata/api/AtmosphereIndexEntry.html</loc> 89 + <lastmod>2026-01-23T23:03:53.910Z</lastmod> 90 90 </url> 91 91 <url> 92 - <loc>https://github.com/your-org/atdata/api/Packable-protocol.html</loc> 93 - <lastmod>2026-01-23T23:20:15.617Z</lastmod> 92 + <loc>https://github.com/your-org/atdata/api/LensPublisher.html</loc> 93 + <lastmod>2026-01-23T23:20:15.781Z</lastmod> 94 94 </url> 95 95 <url> 96 - <loc>https://github.com/your-org/atdata/api/packable.html</loc> 97 - <lastmod>2026-01-23T23:21:24.522Z</lastmod> 96 + <loc>https://github.com/your-org/atdata/api/SampleBatch.html</loc> 97 + <lastmod>2026-01-23T23:20:15.589Z</lastmod> 98 98 </url> 99 99 <url> 100 100 <loc>https://github.com/your-org/atdata/index.html</loc> 101 - <lastmod>2026-01-22T19:31:03.722Z</lastmod> 101 + <lastmod>2026-01-27T06:14:32.068Z</lastmod> 102 102 </url> 103 103 <url> 104 - <loc>https://github.com/your-org/atdata/api/SampleBatch.html</loc> 105 - <lastmod>2026-01-23T23:20:15.589Z</lastmod> 104 + <loc>https://github.com/your-org/atdata/api/packable.html</loc> 105 + <lastmod>2026-01-23T23:21:24.522Z</lastmod> 106 106 </url> 107 107 <url> 108 - <loc>https://github.com/your-org/atdata/api/LensPublisher.html</loc> 109 - <lastmod>2026-01-23T23:20:15.781Z</lastmod> 108 + <loc>https://github.com/your-org/atdata/api/Packable-protocol.html</loc> 109 + <lastmod>2026-01-23T23:20:15.617Z</lastmod> 110 110 </url> 111 111 <url> 112 - <loc>https://github.com/your-org/atdata/api/AtmosphereIndexEntry.html</loc> 113 - <lastmod>2026-01-23T23:03:53.910Z</lastmod> 112 + <loc>https://github.com/your-org/atdata/api/AtUri.html</loc> 113 + <lastmod>2026-01-23T23:20:15.791Z</lastmod> 114 114 </url> 115 115 <url> 116 - <loc>https://github.com/your-org/atdata/api/AbstractIndex.html</loc> 117 - <lastmod>2026-01-27T05:36:00.180Z</lastmod> 116 + <loc>https://github.com/your-org/atdata/api/local.S3DataStore.html</loc> 117 + <lastmod>2026-01-23T23:03:53.869Z</lastmod> 118 118 </url> 119 119 <url> 120 - <loc>https://github.com/your-org/atdata/api/local.LocalDatasetEntry.html</loc> 121 - <lastmod>2026-01-23T23:03:53.862Z</lastmod> 120 + <loc>https://github.com/your-org/atdata/api/AbstractDataStore.html</loc> 121 + <lastmod>2026-01-23T23:20:15.638Z</lastmod> 122 122 </url> 123 123 <url> 124 - <loc>https://github.com/your-org/atdata/api/S3Source.html</loc> 125 - <lastmod>2026-01-24T19:19:45.376Z</lastmod> 124 + <loc>https://github.com/your-org/atdata/api/Dataset.html</loc> 125 + <lastmod>2026-01-23T23:20:15.588Z</lastmod> 126 126 </url> 127 127 <url> 128 - <loc>https://github.com/your-org/atdata/api/IndexEntry.html</loc> 129 - <lastmod>2026-01-23T23:03:53.795Z</lastmod> 128 + <loc>https://github.com/your-org/atdata/api/local.Index.html</loc> 129 + <lastmod>2026-01-27T05:36:00.238Z</lastmod> 130 130 </url> 131 131 <url> 132 - <loc>https://github.com/your-org/atdata/api/index.html</loc> 133 - <lastmod>2026-01-27T05:36:00.093Z</lastmod> 132 + <loc>https://github.com/your-org/atdata/api/Lens.html</loc> 133 + <lastmod>2026-01-27T06:24:20.108Z</lastmod> 134 134 </url> 135 135 <url> 136 - <loc>https://github.com/your-org/atdata/api/URLSource.html</loc> 137 - <lastmod>2026-01-24T19:19:45.367Z</lastmod> 136 + <loc>https://github.com/your-org/atdata/api/DatasetLoader.html</loc> 137 + <lastmod>2026-01-23T23:20:15.773Z</lastmod> 138 138 </url> 139 139 <url> 140 - <loc>https://github.com/your-org/atdata/api/DatasetPublisher.html</loc> 141 - <lastmod>2026-01-23T23:20:15.757Z</lastmod> 140 + <loc>https://github.com/your-org/atdata/api/DataSource.html</loc> 141 + <lastmod>2026-01-23T23:20:15.642Z</lastmod> 142 142 </url> 143 143 <url> 144 - <loc>https://github.com/your-org/atdata/api/SchemaPublisher.html</loc> 145 - <lastmod>2026-01-23T23:20:15.742Z</lastmod> 144 + <loc>https://github.com/your-org/atdata/api/AtmosphereIndex.html</loc> 145 + <lastmod>2026-01-27T05:36:00.293Z</lastmod> 146 146 </url> 147 147 <url> 148 - <loc>https://github.com/your-org/atdata/api/promote_to_atmosphere.html</loc> 149 - <lastmod>2026-01-24T19:19:45.514Z</lastmod> 148 + <loc>https://github.com/your-org/atdata/api/LensLoader.html</loc> 149 + <lastmod>2026-01-23T23:20:15.788Z</lastmod> 150 150 </url> 151 151 <url> 152 - <loc>https://github.com/your-org/atdata/api/load_dataset.html</loc> 153 - <lastmod>2026-01-24T19:19:45.334Z</lastmod> 152 + <loc>https://github.com/your-org/atdata/api/DictSample.html</loc> 153 + <lastmod>2026-01-23T23:20:15.573Z</lastmod> 154 154 </url> 155 155 <url> 156 - <loc>https://github.com/your-org/atdata/api/AtmosphereClient.html</loc> 157 - <lastmod>2026-01-23T23:20:15.723Z</lastmod> 156 + <loc>https://github.com/your-org/atdata/api/PDSBlobStore.html</loc> 157 + <lastmod>2026-01-27T05:36:00.303Z</lastmod> 158 158 </url> 159 159 <url> 160 - <loc>https://github.com/your-org/atdata/api/BlobSource.html</loc> 161 - <lastmod>2026-01-27T05:36:00.209Z</lastmod> 160 + <loc>https://github.com/your-org/atdata/api/PackableSample.html</loc> 161 + <lastmod>2026-01-23T23:20:15.564Z</lastmod> 162 162 </url> 163 163 <url> 164 - <loc>https://github.com/your-org/atdata/api/SchemaLoader.html</loc> 165 - <lastmod>2026-01-23T23:20:15.746Z</lastmod> 164 + <loc>https://github.com/your-org/atdata/api/DatasetDict.html</loc> 165 + <lastmod>2026-01-24T19:19:45.336Z</lastmod> 166 166 </url> 167 167 <url> 168 - <loc>https://github.com/your-org/atdata/tutorials/atmosphere.html</loc> 169 - <lastmod>2026-01-27T05:31:23.765Z</lastmod> 168 + <loc>https://github.com/your-org/atdata/tutorials/promotion.html</loc> 169 + <lastmod>2026-01-27T06:18:38.425Z</lastmod> 170 170 </url> 171 171 <url> 172 - <loc>https://github.com/your-org/atdata/tutorials/quickstart.html</loc> 173 - <lastmod>2026-01-18T03:31:39.825Z</lastmod> 172 + <loc>https://github.com/your-org/atdata/tutorials/local-workflow.html</loc> 173 + <lastmod>2026-01-27T06:17:20.489Z</lastmod> 174 174 </url> 175 175 <url> 176 - <loc>https://github.com/your-org/atdata/reference/uri-spec.html</loc> 176 + <loc>https://github.com/your-org/atdata/reference/promotion.html</loc> 177 177 <lastmod>2026-01-22T19:31:03.723Z</lastmod> 178 178 </url> 179 179 <url> 180 - <loc>https://github.com/your-org/atdata/reference/local-storage.html</loc> 181 - <lastmod>2026-01-22T19:31:03.723Z</lastmod> 180 + <loc>https://github.com/your-org/atdata/reference/load-dataset.html</loc> 181 + <lastmod>2026-01-22T19:31:03.722Z</lastmod> 182 182 </url> 183 183 <url> 184 - <loc>https://github.com/your-org/atdata/reference/atmosphere.html</loc> 185 - <lastmod>2026-01-27T05:32:25.227Z</lastmod> 184 + <loc>https://github.com/your-org/atdata/reference/lenses.html</loc> 185 + <lastmod>2026-01-18T03:31:39.823Z</lastmod> 186 + </url> 187 + <url> 188 + <loc>https://github.com/your-org/atdata/reference/packable-samples.html</loc> 189 + <lastmod>2026-01-18T03:31:39.824Z</lastmod> 186 190 </url> 187 191 <url> 188 192 <loc>https://github.com/your-org/atdata/reference/deployment.html</loc>

+154 -16

docs/tutorials/atmosphere.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 532 542 <h2 id="toc-title">On this page</h2> 533 543 534 544 <ul> 535 - <li><a href="#prerequisites" id="toc-prerequisites" class="nav-link active" data-scroll-target="#prerequisites">Prerequisites</a></li> 545 + <li><a href="#why-federation" id="toc-why-federation" class="nav-link active" data-scroll-target="#why-federation">Why Federation?</a></li> 546 + <li><a href="#prerequisites" id="toc-prerequisites" class="nav-link" data-scroll-target="#prerequisites">Prerequisites</a></li> 536 547 <li><a href="#setup" id="toc-setup" class="nav-link" data-scroll-target="#setup">Setup</a></li> 537 548 <li><a href="#define-sample-types" id="toc-define-sample-types" class="nav-link" data-scroll-target="#define-sample-types">Define Sample Types</a></li> 538 549 <li><a href="#type-introspection" id="toc-type-introspection" class="nav-link" data-scroll-target="#type-introspection">Type Introspection</a></li> ··· 549 560 <li><a href="#list-and-load-datasets" id="toc-list-and-load-datasets" class="nav-link" data-scroll-target="#list-and-load-datasets">List and Load Datasets</a></li> 550 561 <li><a href="#load-a-dataset" id="toc-load-a-dataset" class="nav-link" data-scroll-target="#load-a-dataset">Load a Dataset</a></li> 551 562 <li><a href="#complete-publishing-workflow" id="toc-complete-publishing-workflow" class="nav-link" data-scroll-target="#complete-publishing-workflow">Complete Publishing Workflow</a></li> 563 + <li><a href="#what-youve-learned" id="toc-what-youve-learned" class="nav-link" data-scroll-target="#what-youve-learned">What You’ve Learned</a></li> 564 + <li><a href="#the-full-picture" id="toc-the-full-picture" class="nav-link" data-scroll-target="#the-full-picture">The Full Picture</a></li> 552 565 <li><a href="#next-steps" id="toc-next-steps" class="nav-link" data-scroll-target="#next-steps">Next Steps</a></li> 553 566 </ul> 554 567 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/tutorials/atmosphere.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 581 594 </header> 582 595 583 596 584 - <p>This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing.</p> 597 + <p>This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing. This is <strong>Layer 3</strong> of atdata’s architecture—decentralized federation that enables cross-organization dataset sharing.</p> 598 + <section id="why-federation" class="level2"> 599 + <h2 class="anchored" data-anchor-id="why-federation">Why Federation?</h2> 600 + <p>Team storage (Redis + S3) works well within an organization, but sharing across organizations introduces new challenges:</p> 601 + <ul> 602 + <li><strong>Discovery</strong>: How do researchers find relevant datasets across institutions?</li> 603 + <li><strong>Trust</strong>: How do you verify a dataset is what it claims to be?</li> 604 + <li><strong>Durability</strong>: What happens if the original publisher goes offline?</li> 605 + </ul> 606 + <p>The <strong>AT Protocol</strong> (ATProto), developed by Bluesky, provides a foundation for decentralized social applications. atdata leverages ATProto’s infrastructure for dataset federation:</p> 607 + <table class="caption-top table"> 608 + <colgroup> 609 + <col style="width: 53%"> 610 + <col style="width: 46%"> 611 + </colgroup> 612 + <thead> 613 + <tr class="header"> 614 + <th>ATProto Feature</th> 615 + <th>atdata Usage</th> 616 + </tr> 617 + </thead> 618 + <tbody> 619 + <tr class="odd"> 620 + <td><strong>DIDs</strong> (Decentralized Identifiers)</td> 621 + <td>Publisher identity verification</td> 622 + </tr> 623 + <tr class="even"> 624 + <td><strong>Lexicons</strong></td> 625 + <td>Dataset/schema record schemas</td> 626 + </tr> 627 + <tr class="odd"> 628 + <td><strong>PDSes</strong> (Personal Data Servers)</td> 629 + <td>Storage for records and blobs</td> 630 + </tr> 631 + <tr class="even"> 632 + <td><strong>Relays & AppViews</strong></td> 633 + <td>Discovery and aggregation</td> 634 + </tr> 635 + </tbody> 636 + </table> 637 + <p>The key insight: your Bluesky identity (<code>@handle.bsky.social</code>) becomes your dataset publisher identity. Anyone can verify that a dataset was published by you, and can discover your datasets through the federated network.</p> 638 + </section> 585 639 <section id="prerequisites" class="level2"> 586 640 <h2 class="anchored" data-anchor-id="prerequisites">Prerequisites</h2> 587 641 <ul> ··· 604 658 </section> 605 659 <section id="setup" class="level2"> 606 660 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 607 - <div id="1598fc78" class="cell"> 661 + <div id="d57cc9af" class="cell"> 608 662 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 609 663 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 610 664 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 624 678 </section> 625 679 <section id="define-sample-types" class="level2"> 626 680 <h2 class="anchored" data-anchor-id="define-sample-types">Define Sample Types</h2> 627 - <div id="284570d6" class="cell"> 681 + <div id="15a5c9ef" class="cell"> 628 682 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 629 683 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 630 684 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing image data with metadata."""</span></span> ··· 643 697 <section id="type-introspection" class="level2"> 644 698 <h2 class="anchored" data-anchor-id="type-introspection">Type Introspection</h2> 645 699 <p>See what information is available from a PackableSample type:</p> 646 - <div id="daae7f6e" class="cell"> 700 + <div id="d5d68ff6" class="cell"> 647 701 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> dataclasses <span class="im">import</span> fields, is_dataclass</span> 648 702 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 649 703 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Sample type: </span><span class="sc">{</span>ImageSample<span class="sc">.</span><span class="va">__name__</span><span class="sc">}</span><span class="ss">"</span>)</span> ··· 670 724 </section> 671 725 <section id="at-uri-parsing" class="level2"> 672 726 <h2 class="anchored" data-anchor-id="at-uri-parsing">AT URI Parsing</h2> 727 + <p>Every record in ATProto is identified by an <strong>AT URI</strong>, which encodes:</p> 728 + <ul> 729 + <li><strong>Authority</strong>: The DID or handle of the record owner</li> 730 + <li><strong>Collection</strong>: The Lexicon type (like a table name)</li> 731 + <li><strong>Rkey</strong>: The record key (unique within the collection)</li> 732 + </ul> 733 + <p>Understanding AT URIs is essential for working with atmosphere datasets, as they’re how you reference schemas, datasets, and lenses.</p> 673 734 <p>ATProto records are identified by AT URIs:</p> 674 - <div id="ad781e91" class="cell"> 735 + <div id="e2fe328c" class="cell"> 675 736 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>uris <span class="op">=</span> [</span> 676 737 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789"</span>,</span> 677 738 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"at://alice.bsky.social/ac.foundation.dataset.record/my-dataset"</span>,</span> ··· 687 748 </section> 688 749 <section id="authentication" class="level2"> 689 750 <h2 class="anchored" data-anchor-id="authentication">Authentication</h2> 751 + <p>The <code>AtmosphereClient</code> handles ATProto authentication. When you authenticate, you’re proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS).</p> 690 752 <p>Connect to ATProto:</p> 691 - <div id="95a2266a" class="cell"> 753 + <div id="c05523d2" class="cell"> 692 754 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 693 755 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"your.handle.social"</span>, <span class="st">"your-app-password"</span>)</span> 694 756 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> ··· 698 760 </section> 699 761 <section id="publish-a-schema" class="level2"> 700 762 <h2 class="anchored" data-anchor-id="publish-a-schema">Publish a Schema</h2> 701 - <div id="fd62bf26" class="cell"> 763 + <p>When you publish a schema to ATProto, it becomes a <strong>public, immutable record</strong> that others can reference. The schema CID ensures that anyone can verify they’re using exactly the same type definition you published.</p> 764 + <div id="db404d7a" class="cell"> 702 765 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>schema_publisher <span class="op">=</span> SchemaPublisher(client)</span> 703 766 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>schema_uri <span class="op">=</span> schema_publisher.publish(</span> 704 767 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> ImageSample,</span> ··· 711 774 </section> 712 775 <section id="list-your-schemas" class="level2"> 713 776 <h2 class="anchored" data-anchor-id="list-your-schemas">List Your Schemas</h2> 714 - <div id="f9bd719b" class="cell"> 777 + <div id="c351a57b" class="cell"> 715 778 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>schema_loader <span class="op">=</span> SchemaLoader(client)</span> 716 779 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>schemas <span class="op">=</span> schema_loader.list_all(limit<span class="op">=</span><span class="dv">10</span>)</span> 717 780 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Found </span><span class="sc">{</span><span class="bu">len</span>(schemas)<span class="sc">}</span><span class="ss"> schema(s)"</span>)</span> ··· 724 787 <h2 class="anchored" data-anchor-id="publish-a-dataset">Publish a Dataset</h2> 725 788 <section id="with-external-urls" class="level3"> 726 789 <h3 class="anchored" data-anchor-id="with-external-urls">With External URLs</h3> 727 - <div id="25d3a845" class="cell"> 790 + <div id="df8318cd" class="cell"> 728 791 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>dataset_publisher <span class="op">=</span> DatasetPublisher(client)</span> 729 792 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>dataset_uri <span class="op">=</span> dataset_publisher.publish_with_urls(</span> 730 793 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> urls<span class="op">=</span>[<span class="st">"s3://example-bucket/demo-data-{000000..000009}.tar"</span>],</span> ··· 739 802 </section> 740 803 <section id="with-pds-blob-storage-recommended" class="level3"> 741 804 <h3 class="anchored" data-anchor-id="with-pds-blob-storage-recommended">With PDS Blob Storage (Recommended)</h3> 805 + <p>The <code>PDSBlobStore</code> is the <strong>fully decentralized</strong> option: your dataset shards are stored as ATProto blobs directly in your PDS, alongside your other ATProto records. This means:</p> 806 + <ul> 807 + <li><strong>No external dependencies</strong>: Data lives in the same infrastructure as your identity</li> 808 + <li><strong>Content-addressed</strong>: Blobs are identified by their CID, ensuring integrity</li> 809 + <li><strong>Federated replication</strong>: Relays can mirror your blobs for availability</li> 810 + </ul> 742 811 <p>For fully decentralized storage, use <code>PDSBlobStore</code> to store dataset shards directly as ATProto blobs in your PDS:</p> 743 - <div id="e4cf8aef" class="cell"> 812 + <div id="6708e8dc" class="cell"> 744 813 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create store and index with blob storage</span></span> 745 814 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> 746 815 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client, data_store<span class="op">=</span>store)</span> ··· 784 853 </div> 785 854 <div class="callout-body-container callout-body"> 786 855 <p>Use <code>BlobSource</code> to stream directly from PDS blobs:</p> 787 - <div id="13f07c85" class="cell"> 856 + <div id="824c87cf" class="cell"> 788 857 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create source from the blob URLs</span></span> 789 858 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> store.create_source(entry.data_urls)</span> 790 859 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 803 872 </section> 804 873 <section id="with-external-urls-1" class="level3"> 805 874 <h3 class="anchored" data-anchor-id="with-external-urls-1">With External URLs</h3> 875 + <p>For larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the <strong>index entry</strong> while the actual data lives elsewhere.</p> 806 876 <p>For larger datasets or when using existing object storage:</p> 807 - <div id="e5dc0c7d" class="cell"> 877 + <div id="65d324f2" class="cell"> 808 878 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>dataset_publisher <span class="op">=</span> DatasetPublisher(client)</span> 809 879 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>dataset_uri <span class="op">=</span> dataset_publisher.publish_with_urls(</span> 810 880 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> urls<span class="op">=</span>[<span class="st">"s3://example-bucket/demo-data-{000000..000009}.tar"</span>],</span> ··· 820 890 </section> 821 891 <section id="list-and-load-datasets" class="level2"> 822 892 <h2 class="anchored" data-anchor-id="list-and-load-datasets">List and Load Datasets</h2> 823 - <div id="1d53ab16" class="cell"> 893 + <div id="39c2452a" class="cell"> 824 894 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dataset_loader <span class="op">=</span> DatasetLoader(client)</span> 825 895 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>datasets <span class="op">=</span> dataset_loader.list_all(limit<span class="op">=</span><span class="dv">10</span>)</span> 826 896 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Found </span><span class="sc">{</span><span class="bu">len</span>(datasets)<span class="sc">}</span><span class="ss"> dataset(s)"</span>)</span> ··· 835 905 </section> 836 906 <section id="load-a-dataset" class="level2"> 837 907 <h2 class="anchored" data-anchor-id="load-a-dataset">Load a Dataset</h2> 838 - <div id="3ec469cc" class="cell"> 908 + <div id="5adb2946" class="cell"> 839 909 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Check storage type</span></span> 840 910 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>storage_type <span class="op">=</span> dataset_loader.get_storage_type(<span class="bu">str</span>(blob_dataset_uri))</span> 841 911 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Storage type: </span><span class="sc">{</span>storage_type<span class="sc">}</span><span class="ss">"</span>)</span> ··· 852 922 </section> 853 923 <section id="complete-publishing-workflow" class="level2"> 854 924 <h2 class="anchored" data-anchor-id="complete-publishing-workflow">Complete Publishing Workflow</h2> 925 + <p>Here’s the end-to-end workflow for publishing a dataset to the atmosphere:</p> 926 + <ol type="1"> 927 + <li><strong>Define your sample type</strong> using <code>@packable</code></li> 928 + <li><strong>Create samples and write to tar</strong> (same as local workflow)</li> 929 + <li><strong>Authenticate</strong> with your ATProto identity</li> 930 + <li><strong>Create index with blob storage</strong> (<code>AtmosphereIndex</code> + <code>PDSBlobStore</code>)</li> 931 + <li><strong>Publish schema</strong> (creates ATProto record)</li> 932 + <li><strong>Insert dataset</strong> (uploads blobs, creates dataset record)</li> 933 + </ol> 934 + <p>Notice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend.</p> 855 935 <p>This example shows the recommended workflow using <code>PDSBlobStore</code> for fully decentralized storage:</p> 856 - <div id="0d13f586" class="cell"> 936 + <div id="60e0b22d" class="cell"> 857 937 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Define and create samples</span></span> 858 938 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 859 939 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> FeatureSample:</span> ··· 909 989 <span id="cb14-53"><a href="#cb14-53" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 910 990 </div> 911 991 </section> 992 + <section id="what-youve-learned" class="level2"> 993 + <h2 class="anchored" data-anchor-id="what-youve-learned">What You’ve Learned</h2> 994 + <p>You now understand federated dataset publishing in atdata:</p> 995 + <table class="caption-top table"> 996 + <thead> 997 + <tr class="header"> 998 + <th>Concept</th> 999 + <th>Purpose</th> 1000 + </tr> 1001 + </thead> 1002 + <tbody> 1003 + <tr class="odd"> 1004 + <td><code>AtmosphereClient</code></td> 1005 + <td>ATProto authentication and record management</td> 1006 + </tr> 1007 + <tr class="even"> 1008 + <td><code>AtmosphereIndex</code></td> 1009 + <td>Federated index implementing <code>AbstractIndex</code></td> 1010 + </tr> 1011 + <tr class="odd"> 1012 + <td><code>PDSBlobStore</code></td> 1013 + <td>PDS blob storage implementing <code>AbstractDataStore</code></td> 1014 + </tr> 1015 + <tr class="even"> 1016 + <td><code>BlobSource</code></td> 1017 + <td>Stream datasets from PDS blobs</td> 1018 + </tr> 1019 + <tr class="odd"> 1020 + <td>AT URIs</td> 1021 + <td>Universal identifiers for schemas and datasets</td> 1022 + </tr> 1023 + </tbody> 1024 + </table> 1025 + <p>The protocol abstractions (<code>AbstractIndex</code>, <code>AbstractDataStore</code>, <code>DataSource</code>) ensure your code works across all three layers of atdata—local files, team storage, and federated sharing.</p> 1026 + </section> 1027 + <section id="the-full-picture" class="level2"> 1028 + <h2 class="anchored" data-anchor-id="the-full-picture">The Full Picture</h2> 1029 + <p>You’ve now seen atdata’s complete architecture:</p> 1030 + <pre><code>Local Development Team Storage Federation 1031 + ───────────────── ──────────── ────────── 1032 + tar files Redis + S3 ATProto PDS 1033 + Dataset[T] LocalIndex AtmosphereIndex 1034 + S3DataStore PDSBlobStore</code></pre> 1035 + <p>The same <code>@packable</code> sample types, the same <code>Dataset[T]</code> iteration patterns, and the same lens transformations work at every layer. Only the storage backend changes.</p> 1036 + </section> 912 1037 <section id="next-steps" class="level2"> 913 1038 <h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2> 1039 + <div class="callout callout-style-default callout-tip callout-titled"> 1040 + <div class="callout-header d-flex align-content-center"> 1041 + <div class="callout-icon-container"> 1042 + <i class="callout-icon"></i> 1043 + </div> 1044 + <div class="callout-title-container flex-fill"> 1045 + Already Have Local Datasets? 1046 + </div> 1047 + </div> 1048 + <div class="callout-body-container callout-body"> 1049 + <p>The <a href="../tutorials/promotion.html">Promotion Workflow</a> tutorial shows how to migrate existing datasets from local storage to the atmosphere without re-processing your data.</p> 1050 + </div> 1051 + </div> 914 1052 <ul> 915 1053 <li><strong><a href="../tutorials/promotion.html">Promotion Workflow</a></strong> - Migrate from local storage to atmosphere</li> 916 1054 <li><strong><a href="../reference/atmosphere.html">Atmosphere Reference</a></strong> - Complete API reference</li>

+131 -10

docs/tutorials/local-workflow.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 532 542 <h2 id="toc-title">On this page</h2> 533 543 534 544 <ul> 535 - <li><a href="#prerequisites" id="toc-prerequisites" class="nav-link active" data-scroll-target="#prerequisites">Prerequisites</a></li> 545 + <li><a href="#why-team-storage" id="toc-why-team-storage" class="nav-link active" data-scroll-target="#why-team-storage">Why Team Storage?</a></li> 546 + <li><a href="#prerequisites" id="toc-prerequisites" class="nav-link" data-scroll-target="#prerequisites">Prerequisites</a></li> 536 547 <li><a href="#setup" id="toc-setup" class="nav-link" data-scroll-target="#setup">Setup</a></li> 537 548 <li><a href="#define-sample-types" id="toc-define-sample-types" class="nav-link" data-scroll-target="#define-sample-types">Define Sample Types</a></li> 538 549 <li><a href="#localdatasetentry" id="toc-localdatasetentry" class="nav-link" data-scroll-target="#localdatasetentry">LocalDatasetEntry</a></li> ··· 543 554 <li><a href="#s3datastore" id="toc-s3datastore" class="nav-link" data-scroll-target="#s3datastore">S3DataStore</a></li> 544 555 <li><a href="#complete-index-workflow" id="toc-complete-index-workflow" class="nav-link" data-scroll-target="#complete-index-workflow">Complete Index Workflow</a></li> 545 556 <li><a href="#using-load_dataset-with-index" id="toc-using-load_dataset-with-index" class="nav-link" data-scroll-target="#using-load_dataset-with-index">Using load_dataset with Index</a></li> 557 + <li><a href="#what-youve-learned" id="toc-what-youve-learned" class="nav-link" data-scroll-target="#what-youve-learned">What You’ve Learned</a></li> 546 558 <li><a href="#next-steps" id="toc-next-steps" class="nav-link" data-scroll-target="#next-steps">Next Steps</a></li> 547 559 </ul> 548 560 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/tutorials/local-workflow.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 575 587 </header> 576 588 577 589 578 - <p>This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage.</p> 590 + <p>This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage. This is <strong>Layer 2</strong> of atdata’s architecture—team-scale storage that bridges local development and federated sharing.</p> 591 + <section id="why-team-storage" class="level2"> 592 + <h2 class="anchored" data-anchor-id="why-team-storage">Why Team Storage?</h2> 593 + <p>Local tar files work well for individual experiments, but teams need:</p> 594 + <ul> 595 + <li><strong>Discovery</strong>: “What datasets do we have? What schema does this one use?”</li> 596 + <li><strong>Consistency</strong>: “Is everyone using the same version of this dataset?”</li> 597 + <li><strong>Durability</strong>: “Where’s the canonical copy of our training data?”</li> 598 + </ul> 599 + <p>atdata’s local storage module addresses these needs with a two-component architecture:</p> 600 + <table class="caption-top table"> 601 + <colgroup> 602 + <col style="width: 55%"> 603 + <col style="width: 45%"> 604 + </colgroup> 605 + <thead> 606 + <tr class="header"> 607 + <th>Component</th> 608 + <th>Purpose</th> 609 + </tr> 610 + </thead> 611 + <tbody> 612 + <tr class="odd"> 613 + <td><strong>Redis Index</strong></td> 614 + <td>Fast metadata queries, schema registry, dataset discovery</td> 615 + </tr> 616 + <tr class="even"> 617 + <td><strong>S3 DataStore</strong></td> 618 + <td>Scalable object storage for actual data files</td> 619 + </tr> 620 + </tbody> 621 + </table> 622 + <p>This separation means metadata operations (listing datasets, resolving schemas) are fast and don’t touch large data files, while the data itself lives in battle-tested object storage.</p> 623 + </section> 579 624 <section id="prerequisites" class="level2"> 580 625 <h2 class="anchored" data-anchor-id="prerequisites">Prerequisites</h2> 581 626 <ul> ··· 599 644 </section> 600 645 <section id="setup" class="level2"> 601 646 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 602 - <div id="cba2b198" class="cell"> 647 + <div id="abef9b71" class="cell"> 603 648 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 604 649 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 605 650 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 609 654 </section> 610 655 <section id="define-sample-types" class="level2"> 611 656 <h2 class="anchored" data-anchor-id="define-sample-types">Define Sample Types</h2> 612 - <div id="8bf33c29" class="cell"> 657 + <div id="eb4a25ab" class="cell"> 613 658 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 614 659 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> TrainingSample:</span> 615 660 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing features and label for training."""</span></span> ··· 625 670 </section> 626 671 <section id="localdatasetentry" class="level2"> 627 672 <h2 class="anchored" data-anchor-id="localdatasetentry">LocalDatasetEntry</h2> 673 + <p>Every dataset in the index is represented by a <code>LocalDatasetEntry</code>. A key design decision: entries use <strong>content-addressable CIDs</strong> (Content Identifiers) as their identity. This means:</p> 674 + <ul> 675 + <li>Identical content always has the same CID</li> 676 + <li>You can verify data integrity by checking the CID</li> 677 + <li>Deduplication happens automatically</li> 678 + </ul> 679 + <p>CIDs are computed from the entry’s schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it’s stored.</p> 628 680 <p>Create entries with content-addressable CIDs:</p> 629 - <div id="a93468d2" class="cell"> 681 + <div id="b26485ff" class="cell"> 630 682 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create an entry manually</span></span> 631 683 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> LocalDatasetEntry(</span> 632 684 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> _name<span class="op">=</span><span class="st">"my-dataset"</span>,</span> ··· 657 709 </section> 658 710 <section id="localindex" class="level2"> 659 711 <h2 class="anchored" data-anchor-id="localindex">LocalIndex</h2> 712 + <p>The <code>LocalIndex</code> is your team’s dataset registry. It implements the <code>AbstractIndex</code> protocol, meaning code written against <code>LocalIndex</code> will also work with <code>AtmosphereIndex</code> when you’re ready for federated sharing.</p> 660 713 <p>The index tracks datasets in Redis:</p> 661 - <div id="05315823" class="cell"> 714 + <div id="d23bcb72" class="cell"> 662 715 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> redis <span class="im">import</span> Redis</span> 663 716 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 664 717 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to Redis</span></span> ··· 669 722 </div> 670 723 <section id="schema-management" class="level3"> 671 724 <h3 class="anchored" data-anchor-id="schema-management">Schema Management</h3> 672 - <div id="a16e84e2" class="cell"> 725 + <p><strong>Schema publishing</strong> is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference.</p> 726 + <p>This enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code.</p> 727 + <div id="39ba8cff" class="cell"> 673 728 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 674 729 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(TrainingSample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 675 730 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Published schema: </span><span class="sc">{</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span> ··· 690 745 </section> 691 746 <section id="s3datastore" class="level2"> 692 747 <h2 class="anchored" data-anchor-id="s3datastore">S3DataStore</h2> 748 + <p>The <code>S3DataStore</code> implements the <code>AbstractDataStore</code> protocol for S3-compatible object storage. It works with:</p> 749 + <ul> 750 + <li><strong>AWS S3</strong>: Production-scale cloud storage</li> 751 + <li><strong>MinIO</strong>: Self-hosted S3-compatible storage (great for development)</li> 752 + <li><strong>Cloudflare R2</strong>: Cost-effective S3-compatible storage</li> 753 + </ul> 754 + <p>The data store handles uploading tar shards and creating signed URLs for streaming access.</p> 693 755 <p>For direct S3 operations:</p> 694 - <div id="0b93923c" class="cell"> 756 + <div id="d1cd901f" class="cell"> 695 757 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>creds <span class="op">=</span> {</span> 696 758 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"AWS_ENDPOINT"</span>: <span class="st">"http://localhost:9000"</span>,</span> 697 759 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"AWS_ACCESS_KEY_ID"</span>: <span class="st">"minioadmin"</span>,</span> ··· 706 768 </section> 707 769 <section id="complete-index-workflow" class="level2"> 708 770 <h2 class="anchored" data-anchor-id="complete-index-workflow">Complete Index Workflow</h2> 771 + <p>Here’s the typical workflow for publishing a dataset to your team:</p> 772 + <ol type="1"> 773 + <li><strong>Create samples</strong> using your <code>@packable</code> type</li> 774 + <li><strong>Write to local tar</strong> for staging</li> 775 + <li><strong>Create a Dataset</strong> wrapper</li> 776 + <li><strong>Connect to index with data store</strong></li> 777 + <li><strong>Publish schema</strong> for type consistency</li> 778 + <li><strong>Insert dataset</strong> (uploads to S3, indexes in Redis)</li> 779 + </ol> 780 + <p>The index composition pattern (<code>LocalIndex(data_store=S3DataStore(...))</code>) is deliberate—it separates the concern of “where is metadata?” from “where is data?”, making it easy to swap storage backends.</p> 709 781 <p>Use <code>LocalIndex</code> with <code>S3DataStore</code> to store datasets with S3 storage and Redis indexing:</p> 710 - <div id="437797e0" class="cell"> 782 + <div id="6d91c4f2" class="cell"> 711 783 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Create sample data</span></span> 712 784 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 713 785 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> TrainingSample(</span> ··· 755 827 </section> 756 828 <section id="using-load_dataset-with-index" class="level2"> 757 829 <h2 class="anchored" data-anchor-id="using-load_dataset-with-index">Using load_dataset with Index</h2> 830 + <p>The <code>load_dataset()</code> function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve <code>@local/</code> prefixed paths to the actual data URLs and apply the correct credentials automatically.</p> 758 831 <p>The <code>load_dataset()</code> function supports index lookup:</p> 759 - <div id="a2176738" class="cell"> 832 + <div id="4bf81ece" class="cell"> 760 833 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 761 834 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 762 835 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local index</span></span> ··· 768 841 <span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 769 842 </div> 770 843 </section> 844 + <section id="what-youve-learned" class="level2"> 845 + <h2 class="anchored" data-anchor-id="what-youve-learned">What You’ve Learned</h2> 846 + <p>You now understand team-scale storage in atdata:</p> 847 + <table class="caption-top table"> 848 + <colgroup> 849 + <col style="width: 50%"> 850 + <col style="width: 50%"> 851 + </colgroup> 852 + <thead> 853 + <tr class="header"> 854 + <th>Concept</th> 855 + <th>Purpose</th> 856 + </tr> 857 + </thead> 858 + <tbody> 859 + <tr class="odd"> 860 + <td><code>LocalIndex</code></td> 861 + <td>Redis-backed dataset registry implementing <code>AbstractIndex</code></td> 862 + </tr> 863 + <tr class="even"> 864 + <td><code>S3DataStore</code></td> 865 + <td>S3-compatible object storage implementing <code>AbstractDataStore</code></td> 866 + </tr> 867 + <tr class="odd"> 868 + <td><code>LocalDatasetEntry</code></td> 869 + <td>Content-addressed dataset entries with CIDs</td> 870 + </tr> 871 + <tr class="even"> 872 + <td>Schema publishing</td> 873 + <td>Shared type definitions for team consistency</td> 874 + </tr> 875 + </tbody> 876 + </table> 877 + <p>The same sample types you defined in the Quick Start work seamlessly here—the only change is where the data lives.</p> 878 + </section> 771 879 <section id="next-steps" class="level2"> 772 880 <h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2> 881 + <div class="callout callout-style-default callout-tip callout-titled"> 882 + <div class="callout-header d-flex align-content-center"> 883 + <div class="callout-icon-container"> 884 + <i class="callout-icon"></i> 885 + </div> 886 + <div class="callout-title-container flex-fill"> 887 + Ready for Public Sharing? 888 + </div> 889 + </div> 890 + <div class="callout-body-container callout-body"> 891 + <p>The <a href="../tutorials/atmosphere.html">Atmosphere Publishing</a> tutorial shows how to publish datasets to the ATProto network for decentralized, cross-organization discovery.</p> 892 + </div> 893 + </div> 773 894 <ul> 774 895 <li><strong><a href="../tutorials/atmosphere.html">Atmosphere Publishing</a></strong> - Publish to ATProto federation</li> 775 896 <li><strong><a href="../tutorials/promotion.html">Promotion Workflow</a></strong> - Migrate from local to atmosphere</li>

+85 -13

docs/tutorials/promotion.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 532 542 <h2 id="toc-title">On this page</h2> 533 543 534 544 <ul> 535 - <li><a href="#overview" id="toc-overview" class="nav-link active" data-scroll-target="#overview">Overview</a></li> 545 + <li><a href="#why-promotion" id="toc-why-promotion" class="nav-link active" data-scroll-target="#why-promotion">Why Promotion?</a></li> 546 + <li><a href="#overview" id="toc-overview" class="nav-link" data-scroll-target="#overview">Overview</a></li> 536 547 <li><a href="#setup" id="toc-setup" class="nav-link" data-scroll-target="#setup">Setup</a></li> 537 548 <li><a href="#prepare-a-local-dataset" id="toc-prepare-a-local-dataset" class="nav-link" data-scroll-target="#prepare-a-local-dataset">Prepare a Local Dataset</a></li> 538 549 <li><a href="#basic-promotion" id="toc-basic-promotion" class="nav-link" data-scroll-target="#basic-promotion">Basic Promotion</a></li> ··· 543 554 <li><a href="#error-handling" id="toc-error-handling" class="nav-link" data-scroll-target="#error-handling">Error Handling</a></li> 544 555 <li><a href="#requirements-checklist" id="toc-requirements-checklist" class="nav-link" data-scroll-target="#requirements-checklist">Requirements Checklist</a></li> 545 556 <li><a href="#complete-workflow" id="toc-complete-workflow" class="nav-link" data-scroll-target="#complete-workflow">Complete Workflow</a></li> 557 + <li><a href="#what-youve-learned" id="toc-what-youve-learned" class="nav-link" data-scroll-target="#what-youve-learned">What You’ve Learned</a></li> 558 + <li><a href="#the-complete-journey" id="toc-the-complete-journey" class="nav-link" data-scroll-target="#the-complete-journey">The Complete Journey</a></li> 546 559 <li><a href="#next-steps" id="toc-next-steps" class="nav-link" data-scroll-target="#next-steps">Next Steps</a></li> 547 560 </ul> 548 561 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/tutorials/promotion.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 575 588 </header> 576 589 577 590 578 - <p>This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network.</p> 591 + <p>This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network. Promotion is the bridge between <strong>Layer 2</strong> (team storage) and <strong>Layer 3</strong> (federation).</p> 592 + <section id="why-promotion" class="level2"> 593 + <h2 class="anchored" data-anchor-id="why-promotion">Why Promotion?</h2> 594 + <p>A common pattern in data science:</p> 595 + <ol type="1"> 596 + <li><strong>Start private</strong>: Develop and validate datasets within your team</li> 597 + <li><strong>Go public</strong>: Share successful datasets with the broader community</li> 598 + </ol> 599 + <p>Promotion handles this transition without re-processing your data. Instead of creating a new dataset from scratch, you’re <strong>lifting</strong> an existing local dataset entry into the federated atmosphere.</p> 600 + <p>The workflow handles several complexities automatically:</p> 601 + <ul> 602 + <li><strong>Schema deduplication</strong>: If you’ve already published the same schema type and version, promotion reuses it</li> 603 + <li><strong>URL preservation</strong>: Data stays in place (unless you explicitly want to copy it)</li> 604 + <li><strong>CID consistency</strong>: Content identifiers remain valid across the transition</li> 605 + </ul> 606 + </section> 579 607 <section id="overview" class="level2"> 580 608 <h2 class="anchored" data-anchor-id="overview">Overview</h2> 581 609 <p>The promotion workflow moves datasets from local storage to the atmosphere:</p> ··· 593 621 </section> 594 622 <section id="setup" class="level2"> 595 623 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 596 - <div id="a4944327" class="cell"> 624 + <div id="30d648e4" class="cell"> 597 625 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 598 626 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 599 627 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 606 634 <section id="prepare-a-local-dataset" class="level2"> 607 635 <h2 class="anchored" data-anchor-id="prepare-a-local-dataset">Prepare a Local Dataset</h2> 608 636 <p>First, set up a dataset in local storage:</p> 609 - <div id="5d9c3c9c" class="cell"> 637 + <div id="8f0a8e10" class="cell"> 610 638 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Define sample type</span></span> 611 639 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 612 640 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ExperimentSample:</span> ··· 656 684 <section id="basic-promotion" class="level2"> 657 685 <h2 class="anchored" data-anchor-id="basic-promotion">Basic Promotion</h2> 658 686 <p>Promote the dataset to ATProto:</p> 659 - <div id="ad535d77" class="cell"> 687 + <div id="5b5a6c07" class="cell"> 660 688 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to atmosphere</span></span> 661 689 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 662 690 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"myhandle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> ··· 669 697 <section id="promotion-with-metadata" class="level2"> 670 698 <h2 class="anchored" data-anchor-id="promotion-with-metadata">Promotion with Metadata</h2> 671 699 <p>Add description, tags, and license:</p> 672 - <div id="dc02ee9b" class="cell"> 700 + <div id="dc7703ae" class="cell"> 673 701 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(</span> 674 702 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> local_entry,</span> 675 703 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> local_index,</span> ··· 685 713 <section id="schema-deduplication" class="level2"> 686 714 <h2 class="anchored" data-anchor-id="schema-deduplication">Schema Deduplication</h2> 687 715 <p>The promotion workflow automatically checks for existing schemas:</p> 688 - <div id="0f305439" class="cell"> 716 + <div id="98721d33" class="cell"> 689 717 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> _find_existing_schema</span> 690 718 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 691 719 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Check if schema already exists</span></span> ··· 697 725 <span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="st">"No existing schema found, will publish new one"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 698 726 </div> 699 727 <p>When you promote multiple datasets with the same sample type:</p> 700 - <div id="5f2af9ec" class="cell"> 728 + <div id="172b5ab2" class="cell"> 701 729 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First promotion: publishes schema</span></span> 702 730 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>uri1 <span class="op">=</span> promote_to_atmosphere(entry1, local_index, client)</span> 703 731 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 712 740 <div class="tab-content"> 713 741 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 714 742 <p>By default, promotion keeps the original data URLs:</p> 715 - <div id="11816bc6" class="cell"> 743 + <div id="1c103291" class="cell"> 716 744 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Data stays in original S3 location</span></span> 717 745 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(local_entry, local_index, client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 718 746 </div> ··· 725 753 </div> 726 754 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 727 755 <p>To copy data to a different storage location:</p> 728 - <div id="a09bd7f1" class="cell"> 756 + <div id="09e9306a" class="cell"> 729 757 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 730 758 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 731 759 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create new data store</span></span> ··· 755 783 <section id="verify-on-atmosphere" class="level2"> 756 784 <h2 class="anchored" data-anchor-id="verify-on-atmosphere">Verify on Atmosphere</h2> 757 785 <p>After promotion, verify the dataset is accessible:</p> 758 - <div id="9698b9c2" class="cell"> 786 + <div id="2be14e5f" class="cell"> 759 787 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereIndex</span> 760 788 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 761 789 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>atm_index <span class="op">=</span> AtmosphereIndex(client)</span> ··· 776 804 </section> 777 805 <section id="error-handling" class="level2"> 778 806 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 779 - <div id="10f3b8be" class="cell"> 807 + <div id="deda7bd9" class="cell"> 780 808 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 781 809 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> at_uri <span class="op">=</span> promote_to_atmosphere(local_entry, local_index, client)</span> 782 810 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">KeyError</span> <span class="im">as</span> e:</span> ··· 800 828 </section> 801 829 <section id="complete-workflow" class="level2"> 802 830 <h2 class="anchored" data-anchor-id="complete-workflow">Complete Workflow</h2> 803 - <div id="d8dd7cbc" class="cell"> 831 + <div id="3fd9070a" class="cell"> 804 832 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Complete local-to-atmosphere workflow</span></span> 805 833 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 806 834 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> ··· 857 885 <span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a><span class="co"># 6. Others can now discover and load</span></span> 858 886 <span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a><span class="co"># ds = atdata.load_dataset("@myhandle.bsky.social/feature-vectors-v1")</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 859 887 </div> 888 + </section> 889 + <section id="what-youve-learned" class="level2"> 890 + <h2 class="anchored" data-anchor-id="what-youve-learned">What You’ve Learned</h2> 891 + <p>You now understand the promotion workflow:</p> 892 + <table class="caption-top table"> 893 + <colgroup> 894 + <col style="width: 50%"> 895 + <col style="width: 50%"> 896 + </colgroup> 897 + <thead> 898 + <tr class="header"> 899 + <th>Concept</th> 900 + <th>Purpose</th> 901 + </tr> 902 + </thead> 903 + <tbody> 904 + <tr class="odd"> 905 + <td><code>promote_to_atmosphere()</code></td> 906 + <td>Lift local entries to federated network</td> 907 + </tr> 908 + <tr class="even"> 909 + <td>Schema deduplication</td> 910 + <td>Avoid publishing duplicate schemas</td> 911 + </tr> 912 + <tr class="odd"> 913 + <td>Data URL preservation</td> 914 + <td>Keep data in place or copy to new storage</td> 915 + </tr> 916 + <tr class="even"> 917 + <td>Metadata enrichment</td> 918 + <td>Add description, tags, license during promotion</td> 919 + </tr> 920 + </tbody> 921 + </table> 922 + <p>Promotion completes atdata’s three-layer story: you can now move seamlessly from local experimentation to team collaboration to public sharing, all with the same typed sample definitions.</p> 923 + </section> 924 + <section id="the-complete-journey" class="level2"> 925 + <h2 class="anchored" data-anchor-id="the-complete-journey">The Complete Journey</h2> 926 + <pre><code>┌──────────────────┐ insert ┌──────────────────┐ promote ┌──────────────────┐ 927 + │ Local Files │ ────────────→ │ Team Storage │ ────────────→ │ Federation │ 928 + │ │ │ │ │ │ 929 + │ tar files │ │ Redis + S3 │ │ ATProto PDS │ 930 + │ Dataset[T] │ │ LocalIndex │ │ AtmosphereIndex │ 931 + └──────────────────┘ └──────────────────┘ └──────────────────┘</code></pre> 860 932 </section> 861 933 <section id="next-steps" class="level2"> 862 934 <h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2>

+191 -96

docs/tutorials/quickstart.html

··· 324 324 </a> 325 325 <ul class="dropdown-menu" aria-labelledby="nav-menu-reference"> 326 326 <li> 327 + <a class="dropdown-item" href="../reference/architecture.html"> 328 + <span class="dropdown-text">Architecture Overview</span></a> 329 + </li> 330 + <li> 327 331 <a class="dropdown-item" href="../reference/packable-samples.html"> 328 332 <span class="dropdown-text">Packable Samples</span></a> 329 333 </li> ··· 456 460 <ul id="quarto-sidebar-section-2" class="collapse list-unstyled sidebar-section depth1 show"> 457 461 <li class="sidebar-item"> 458 462 <div class="sidebar-item-container"> 463 + <a href="../reference/architecture.html" class="sidebar-item-text sidebar-link"> 464 + <span class="menu-text">Architecture Overview</span></a> 465 + </div> 466 + </li> 467 + <li class="sidebar-item"> 468 + <div class="sidebar-item-container"> 459 469 <a href="../reference/packable-samples.html" class="sidebar-item-text sidebar-link"> 460 470 <span class="menu-text">Packable Samples</span></a> 461 471 </div> ··· 532 542 <h2 id="toc-title">On this page</h2> 533 543 534 544 <ul> 535 - <li><a href="#installation" id="toc-installation" class="nav-link active" data-scroll-target="#installation">Installation</a></li> 545 + <li><a href="#where-this-fits" id="toc-where-this-fits" class="nav-link active" data-scroll-target="#where-this-fits">Where This Fits</a></li> 546 + <li><a href="#installation" id="toc-installation" class="nav-link" data-scroll-target="#installation">Installation</a></li> 536 547 <li><a href="#define-a-sample-type" id="toc-define-a-sample-type" class="nav-link" data-scroll-target="#define-a-sample-type">Define a Sample Type</a></li> 537 548 <li><a href="#create-sample-instances" id="toc-create-sample-instances" class="nav-link" data-scroll-target="#create-sample-instances">Create Sample Instances</a></li> 538 549 <li><a href="#write-a-dataset" id="toc-write-a-dataset" class="nav-link" data-scroll-target="#write-a-dataset">Write a Dataset</a></li> 539 550 <li><a href="#load-and-iterate" id="toc-load-and-iterate" class="nav-link" data-scroll-target="#load-and-iterate">Load and Iterate</a></li> 540 551 <li><a href="#shuffled-iteration" id="toc-shuffled-iteration" class="nav-link" data-scroll-target="#shuffled-iteration">Shuffled Iteration</a></li> 541 552 <li><a href="#use-lenses-for-type-transformations" id="toc-use-lenses-for-type-transformations" class="nav-link" data-scroll-target="#use-lenses-for-type-transformations">Use Lenses for Type Transformations</a></li> 553 + <li><a href="#what-youve-learned" id="toc-what-youve-learned" class="nav-link" data-scroll-target="#what-youve-learned">What You’ve Learned</a></li> 542 554 <li><a href="#next-steps" id="toc-next-steps" class="nav-link" data-scroll-target="#next-steps">Next Steps</a></li> 543 555 </ul> 544 556 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/tutorials/quickstart.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 571 583 </header> 572 584 573 585 574 - <p>This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them.</p> 586 + <p>This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them. You’ll learn the foundational patterns that enable type-safe, efficient dataset handling—the first layer of atdata’s three-layer architecture.</p> 587 + <section id="where-this-fits" class="level2"> 588 + <h2 class="anchored" data-anchor-id="where-this-fits">Where This Fits</h2> 589 + <p>atdata is built around a simple progression:</p> 590 + <pre><code>Local Development → Team Storage → Federation</code></pre> 591 + <p>This tutorial covers <strong>local development</strong>—the foundation. Everything you learn here (typed samples, efficient iteration, lens transformations) carries forward as you scale to team storage and federated sharing. The key insight is that your sample types remain the same across all three layers; only the storage backend changes.</p> 592 + </section> 575 593 <section id="installation" class="level2"> 576 594 <h2 class="anchored" data-anchor-id="installation">Installation</h2> 577 - <div class="sourceCode" id="cb1"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata</span> 578 - <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 579 - <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co"># With ATProto support</span></span> 580 - <span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata<span class="pp">[</span><span class="ss">atmosphere</span><span class="pp">]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 595 + <div class="sourceCode" id="cb2"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata</span> 596 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 597 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="co"># With ATProto support</span></span> 598 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="ex">pip</span> install atdata<span class="pp">[</span><span class="ss">atmosphere</span><span class="pp">]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 581 599 </section> 582 600 <section id="define-a-sample-type" class="level2"> 583 601 <h2 class="anchored" data-anchor-id="define-a-sample-type">Define a Sample Type</h2> 602 + <p>The core abstraction in atdata is the <strong>PackableSample</strong>—a typed, serializable data structure. Unlike raw dictionaries or ad-hoc classes, PackableSamples provide:</p> 603 + <ul> 604 + <li><strong>Type safety</strong>: Know your schema at write time, not training time</li> 605 + <li><strong>Automatic serialization</strong>: msgpack encoding with efficient NDArray handling</li> 606 + <li><strong>Round-trip fidelity</strong>: Data survives serialization without loss</li> 607 + </ul> 584 608 <p>Use the <code>@packable</code> decorator to create a typed sample:</p> 585 - <div id="3049c1b6" class="cell"> 586 - <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 587 - <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 588 - <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 589 - <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a></span> 590 - <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 591 - <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 592 - <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing an image with label and confidence."""</span></span> 593 - <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a> image: NDArray</span> 594 - <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 595 - <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 609 + <div id="e779f1c8" class="cell"> 610 + <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 611 + <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 612 + <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 613 + <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a></span> 614 + <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 615 + <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 616 + <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing an image with label and confidence."""</span></span> 617 + <span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a> image: NDArray</span> 618 + <span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 619 + <span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 596 620 </div> 597 621 <p>The <code>@packable</code> decorator:</p> 598 622 <ul> ··· 603 627 </section> 604 628 <section id="create-sample-instances" class="level2"> 605 629 <h2 class="anchored" data-anchor-id="create-sample-instances">Create Sample Instances</h2> 606 - <div id="e1f58bf7" class="cell"> 607 - <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a single sample</span></span> 608 - <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 609 - <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 610 - <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> 611 - <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a> confidence<span class="op">=</span><span class="fl">0.95</span>,</span> 612 - <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>)</span> 613 - <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a></span> 614 - <span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Check serialization</span></span> 615 - <span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a>packed_bytes <span class="op">=</span> sample.packed</span> 616 - <span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Serialized size: </span><span class="sc">{</span><span class="bu">len</span>(packed_bytes)<span class="sc">:,}</span><span class="ss"> bytes"</span>)</span> 617 - <span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a></span> 618 - <span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Verify round-trip</span></span> 619 - <span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a>restored <span class="op">=</span> ImageSample.from_bytes(packed_bytes)</span> 620 - <span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> np.allclose(sample.image, restored.image)</span> 621 - <span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Round-trip successful!"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 630 + <div id="9017c045" class="cell"> 631 + <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a single sample</span></span> 632 + <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 633 + <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 634 + <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> 635 + <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a> confidence<span class="op">=</span><span class="fl">0.95</span>,</span> 636 + <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>)</span> 637 + <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a></span> 638 + <span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Check serialization</span></span> 639 + <span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a>packed_bytes <span class="op">=</span> sample.packed</span> 640 + <span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Serialized size: </span><span class="sc">{</span><span class="bu">len</span>(packed_bytes)<span class="sc">:,}</span><span class="ss"> bytes"</span>)</span> 641 + <span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a></span> 642 + <span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Verify round-trip</span></span> 643 + <span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a>restored <span class="op">=</span> ImageSample.from_bytes(packed_bytes)</span> 644 + <span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> np.allclose(sample.image, restored.image)</span> 645 + <span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Round-trip successful!"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 622 646 </div> 623 647 </section> 624 648 <section id="write-a-dataset" class="level2"> 625 649 <h2 class="anchored" data-anchor-id="write-a-dataset">Write a Dataset</h2> 650 + <p>atdata uses <strong>WebDataset’s tar format</strong> for storage. This choice is deliberate:</p> 651 + <ul> 652 + <li><strong>Streaming</strong>: Process data without downloading entire datasets</li> 653 + <li><strong>Sharding</strong>: Split large datasets across multiple files for parallel I/O</li> 654 + <li><strong>Proven</strong>: Battle-tested at scale by organizations like Google, NVIDIA, and OpenAI</li> 655 + </ul> 656 + <p>The <code>as_wds</code> property on your sample provides the dictionary format WebDataset expects:</p> 626 657 <p>Use WebDataset’s <code>TarWriter</code> to create dataset files:</p> 627 - <div id="58d86c25" class="cell"> 628 - <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 629 - <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 630 - <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create 100 samples</span></span> 631 - <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 632 - <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a> ImageSample(</span> 633 - <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 634 - <span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="ss">f"class_</span><span class="sc">{</span>i <span class="op">%</span> <span class="dv">10</span><span class="sc">}</span><span class="ss">"</span>,</span> 635 - <span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a> confidence<span class="op">=</span>np.random.rand(),</span> 636 - <span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a> )</span> 637 - <span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>)</span> 638 - <span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a>]</span> 639 - <span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a></span> 640 - <span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Write to tar file</span></span> 641 - <span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"my-dataset-000000.tar"</span>) <span class="im">as</span> sink:</span> 642 - <span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i, sample <span class="kw">in</span> <span class="bu">enumerate</span>(samples):</span> 643 - <span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a> sink.write({<span class="op">**</span>sample.as_wds, <span class="st">"__key__"</span>: <span class="ss">f"sample_</span><span class="sc">{</span>i<span class="sc">:06d}</span><span class="ss">"</span>})</span> 644 - <span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a></span> 645 - <span id="cb4-18"><a href="#cb4-18" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Wrote 100 samples to my-dataset-000000.tar"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 658 + <div id="de114376" class="cell"> 659 + <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 660 + <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 661 + <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create 100 samples</span></span> 662 + <span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 663 + <span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a> ImageSample(</span> 664 + <span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 665 + <span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="ss">f"class_</span><span class="sc">{</span>i <span class="op">%</span> <span class="dv">10</span><span class="sc">}</span><span class="ss">"</span>,</span> 666 + <span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a> confidence<span class="op">=</span>np.random.rand(),</span> 667 + <span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a> )</span> 668 + <span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">100</span>)</span> 669 + <span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a>]</span> 670 + <span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a></span> 671 + <span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Write to tar file</span></span> 672 + <span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"my-dataset-000000.tar"</span>) <span class="im">as</span> sink:</span> 673 + <span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> i, sample <span class="kw">in</span> <span class="bu">enumerate</span>(samples):</span> 674 + <span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a> sink.write({<span class="op">**</span>sample.as_wds, <span class="st">"__key__"</span>: <span class="ss">f"sample_</span><span class="sc">{</span>i<span class="sc">:06d}</span><span class="ss">"</span>})</span> 675 + <span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a></span> 676 + <span id="cb5-18"><a href="#cb5-18" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"Wrote 100 samples to my-dataset-000000.tar"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 646 677 </div> 647 678 </section> 648 679 <section id="load-and-iterate" class="level2"> 649 680 <h2 class="anchored" data-anchor-id="load-and-iterate">Load and Iterate</h2> 681 + <p>The generic <code>Dataset[T]</code> class connects your sample type to WebDataset’s streaming infrastructure. When you specify <code>Dataset[ImageSample]</code>, atdata knows how to deserialize the msgpack bytes back into fully-typed objects.</p> 682 + <p><strong>Automatic batch aggregation</strong> is a key feature: when you iterate with <code>batch_size</code>, atdata returns <code>SampleBatch</code> objects that intelligently combine samples:</p> 683 + <ul> 684 + <li>NDArray fields are <strong>stacked</strong> into a single array with a batch dimension</li> 685 + <li>Other fields become <strong>lists</strong> of values</li> 686 + </ul> 687 + <p>This eliminates boilerplate collation code and works automatically with any PackableSample type.</p> 650 688 <p>Create a typed <code>Dataset</code> and iterate with batching:</p> 651 - <div id="7bba76ea" class="cell"> 652 - <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load dataset with type</span></span> 653 - <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"my-dataset-000000.tar"</span>)</span> 654 - <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> 655 - <span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate in order with batching</span></span> 656 - <span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.ordered(batch_size<span class="op">=</span><span class="dv">16</span>):</span> 657 - <span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> <span class="co"># NDArray fields are stacked</span></span> 658 - <span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># shape: (16, 224, 224, 3)</span></span> 659 - <span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a></span> 660 - <span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a> <span class="co"># Other fields become lists</span></span> 661 - <span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label <span class="co"># list of 16 strings</span></span> 662 - <span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a> confidences <span class="op">=</span> batch.confidence <span class="co"># list of 16 floats</span></span> 663 - <span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a></span> 664 - <span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Batch shape: </span><span class="sc">{</span>images<span class="sc">.</span>shape<span class="sc">}</span><span class="ss">"</span>)</span> 665 - <span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Labels: </span><span class="sc">{</span>labels[:<span class="dv">3</span>]<span class="sc">}</span><span class="ss">..."</span>)</span> 666 - <span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 689 + <div id="a3152d0f" class="cell"> 690 + <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load dataset with type</span></span> 691 + <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"my-dataset-000000.tar"</span>)</span> 692 + <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> 693 + <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate in order with batching</span></span> 694 + <span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.ordered(batch_size<span class="op">=</span><span class="dv">16</span>):</span> 695 + <span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a> <span class="co"># NDArray fields are stacked</span></span> 696 + <span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># shape: (16, 224, 224, 3)</span></span> 697 + <span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a></span> 698 + <span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a> <span class="co"># Other fields become lists</span></span> 699 + <span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label <span class="co"># list of 16 strings</span></span> 700 + <span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a> confidences <span class="op">=</span> batch.confidence <span class="co"># list of 16 floats</span></span> 701 + <span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a></span> 702 + <span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Batch shape: </span><span class="sc">{</span>images<span class="sc">.</span>shape<span class="sc">}</span><span class="ss">"</span>)</span> 703 + <span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Labels: </span><span class="sc">{</span>labels[:<span class="dv">3</span>]<span class="sc">}</span><span class="ss">..."</span>)</span> 704 + <span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 667 705 </div> 668 706 </section> 669 707 <section id="shuffled-iteration" class="level2"> 670 708 <h2 class="anchored" data-anchor-id="shuffled-iteration">Shuffled Iteration</h2> 709 + <p>Proper shuffling is critical for training. WebDataset provides <strong>two-level shuffling</strong>:</p> 710 + <ol type="1"> 711 + <li><strong>Shard shuffling</strong>: Randomize the order of tar files</li> 712 + <li><strong>Sample shuffling</strong>: Randomize samples within a buffer</li> 713 + </ol> 714 + <p>This approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset.</p> 671 715 <p>For training, use shuffled iteration:</p> 672 - <div id="e9cca4f6" class="cell"> 673 - <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 674 - <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled at shard and sample level</span></span> 675 - <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image</span> 676 - <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label</span> 677 - <span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a></span> 678 - <span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a> <span class="co"># Train your model</span></span> 679 - <span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a> <span class="co"># model.train(images, labels)</span></span> 680 - <span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 716 + <div id="64a443cd" class="cell"> 717 + <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 718 + <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled at shard and sample level</span></span> 719 + <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image</span> 720 + <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a> labels <span class="op">=</span> batch.label</span> 721 + <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span> 722 + <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a> <span class="co"># Train your model</span></span> 723 + <span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a> <span class="co"># model.train(images, labels)</span></span> 724 + <span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 681 725 </div> 682 726 </section> 683 727 <section id="use-lenses-for-type-transformations" class="level2"> 684 728 <h2 class="anchored" data-anchor-id="use-lenses-for-type-transformations">Use Lenses for Type Transformations</h2> 729 + <p><strong>Lenses</strong> are bidirectional transformations between sample types. They solve a common problem: you have a dataset with a rich schema, but a particular task only needs a subset of fields—or needs derived fields computed on-the-fly.</p> 730 + <p>Instead of creating separate datasets for each use case (duplicating storage and maintenance burden), lenses let you <strong>view</strong> the same underlying data through different type schemas. This is inspired by functional programming concepts and enables:</p> 731 + <ul> 732 + <li><strong>Schema reduction</strong>: Drop fields you don’t need</li> 733 + <li><strong>Schema migration</strong>: Handle version differences between datasets</li> 734 + <li><strong>Derived features</strong>: Compute fields on-the-fly during iteration</li> 735 + </ul> 685 736 <p>View datasets through different schemas:</p> 686 - <div id="4299fc09" class="cell"> 687 - <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define a simplified view type</span></span> 688 - <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 689 - <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 690 - <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 691 - <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span> 692 - <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a></span> 693 - <span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a lens transformation</span></span> 694 - <span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 695 - <span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> simplify(src: ImageSample) <span class="op">-></span> SimplifiedSample:</span> 696 - <span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SimplifiedSample(label<span class="op">=</span>src.label, confidence<span class="op">=</span>src.confidence)</span> 697 - <span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a></span> 698 - <span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="co"># View dataset through lens</span></span> 699 - <span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a>simple_ds <span class="op">=</span> dataset.as_type(SimplifiedSample)</span> 700 - <span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a></span> 701 - <span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> simple_ds.ordered(batch_size<span class="op">=</span><span class="dv">8</span>):</span> 702 - <span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Labels: </span><span class="sc">{</span>batch<span class="sc">.</span>label<span class="sc">}</span><span class="ss">"</span>)</span> 703 - <span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Confidences: </span><span class="sc">{</span>batch<span class="sc">.</span>confidence<span class="sc">}</span><span class="ss">"</span>)</span> 704 - <span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 737 + <div id="cdc9da8b" class="cell"> 738 + <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define a simplified view type</span></span> 739 + <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 740 + <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 741 + <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> 742 + <span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a> confidence: <span class="bu">float</span></span> 743 + <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a></span> 744 + <span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a lens transformation</span></span> 745 + <span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 746 + <span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> simplify(src: ImageSample) <span class="op">-></span> SimplifiedSample:</span> 747 + <span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SimplifiedSample(label<span class="op">=</span>src.label, confidence<span class="op">=</span>src.confidence)</span> 748 + <span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a></span> 749 + <span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a><span class="co"># View dataset through lens</span></span> 750 + <span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a>simple_ds <span class="op">=</span> dataset.as_type(SimplifiedSample)</span> 751 + <span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a></span> 752 + <span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> simple_ds.ordered(batch_size<span class="op">=</span><span class="dv">8</span>):</span> 753 + <span id="cb8-16"><a href="#cb8-16" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Labels: </span><span class="sc">{</span>batch<span class="sc">.</span>label<span class="sc">}</span><span class="ss">"</span>)</span> 754 + <span id="cb8-17"><a href="#cb8-17" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"Confidences: </span><span class="sc">{</span>batch<span class="sc">.</span>confidence<span class="sc">}</span><span class="ss">"</span>)</span> 755 + <span id="cb8-18"><a href="#cb8-18" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 705 756 </div> 706 757 </section> 758 + <section id="what-youve-learned" class="level2"> 759 + <h2 class="anchored" data-anchor-id="what-youve-learned">What You’ve Learned</h2> 760 + <p>You now understand atdata’s foundational concepts:</p> 761 + <table class="caption-top table"> 762 + <thead> 763 + <tr class="header"> 764 + <th>Concept</th> 765 + <th>Purpose</th> 766 + </tr> 767 + </thead> 768 + <tbody> 769 + <tr class="odd"> 770 + <td><code>@packable</code></td> 771 + <td>Create typed, serializable sample classes</td> 772 + </tr> 773 + <tr class="even"> 774 + <td><code>Dataset[T]</code></td> 775 + <td>Typed iteration over WebDataset tar files</td> 776 + </tr> 777 + <tr class="odd"> 778 + <td><code>SampleBatch[T]</code></td> 779 + <td>Automatic aggregation with NDArray stacking</td> 780 + </tr> 781 + <tr class="even"> 782 + <td><code>@lens</code></td> 783 + <td>Transform between sample types without data duplication</td> 784 + </tr> 785 + </tbody> 786 + </table> 787 + <p>These patterns work identically whether your data lives on local disk, in team S3 storage, or published to the ATProto network. The next tutorials show how to scale beyond local files.</p> 788 + </section> 707 789 <section id="next-steps" class="level2"> 708 790 <h2 class="anchored" data-anchor-id="next-steps">Next Steps</h2> 791 + <div class="callout callout-style-default callout-tip callout-titled"> 792 + <div class="callout-header d-flex align-content-center"> 793 + <div class="callout-icon-container"> 794 + <i class="callout-icon"></i> 795 + </div> 796 + <div class="callout-title-container flex-fill"> 797 + Ready to Share with Your Team? 798 + </div> 799 + </div> 800 + <div class="callout-body-container callout-body"> 801 + <p>The <a href="../tutorials/local-workflow.html">Local Workflow</a> tutorial shows how to set up Redis + S3 storage for team-wide dataset discovery and sharing.</p> 802 + </div> 803 + </div> 709 804 <ul> 710 805 <li><strong><a href="../tutorials/local-workflow.html">Local Workflow</a></strong> - Store datasets with Redis + S3</li> 711 806 <li><strong><a href="../tutorials/atmosphere.html">Atmosphere Publishing</a></strong> - Publish to ATProto federation</li>

+3

docs_src/_quarto.yml

··· 119 119 href: tutorials/promotion.qmd 120 120 - text: "Reference" 121 121 menu: 122 + - text: "Architecture Overview" 123 + href: reference/architecture.qmd 122 124 - text: "Packable Samples" 123 125 href: reference/packable-samples.qmd 124 126 - text: "Datasets" ··· 160 162 - tutorials/promotion.qmd 161 163 - section: "Reference" 162 164 contents: 165 + - reference/architecture.qmd 163 166 - reference/packable-samples.qmd 164 167 - reference/datasets.qmd 165 168 - reference/lenses.qmd

+120 -47

docs_src/index.qmd

··· 12 12 [View on GitHub](https://github.com/your-org/atdata){.btn .btn-outline-secondary .btn-lg} 13 13 ::: 14 14 15 + ## The Challenge 16 + 17 + Machine learning datasets are everywhere—training data, validation sets, embeddings, features, model outputs. Yet working with them often means: 18 + 19 + - **Runtime surprises**: Discovering a field is missing or has the wrong type during training 20 + - **Copy-paste schemas**: Redefining the same sample structure across notebooks and scripts 21 + - **Storage silos**: Data stuck in one location, invisible to collaborators 22 + - **Discovery friction**: No standard way to find datasets across teams or organizations 23 + 24 + atdata solves these problems with a simple idea: **typed, serializable samples** that flow seamlessly from local development to team storage to federated sharing. 25 + 15 26 ## What is atdata? 16 27 17 - atdata provides a typed dataset abstraction for machine learning workflows with: 28 + atdata is a Python library that combines: 18 29 19 30 ::: {.feature-cards} 20 31 21 32 ::: {.feature-card} 22 33 ### Typed Samples 23 - Define dataclass-based sample types with automatic msgpack serialization. 34 + Define dataclass-based sample types with automatic msgpack serialization. Catch schema errors at definition time, not training time. 24 35 ::: 25 36 26 37 ::: {.feature-card} 27 - ### NDArray Handling 28 - Transparent numpy array conversion with efficient byte serialization. 38 + ### Efficient Storage 39 + Built on WebDataset's proven tar-based format. Stream large datasets without downloading everything first. 29 40 ::: 30 41 31 42 ::: {.feature-card} 32 43 ### Lens Transformations 33 - View datasets through different schemas without duplicating data. 44 + View datasets through different schemas without duplicating data. Perfect for feature extraction, schema migration, and multi-task learning. 34 45 ::: 35 46 36 47 ::: {.feature-card} 37 48 ### Batch Aggregation 38 - Automatic numpy stacking for NDArray fields during iteration. 49 + Automatic numpy stacking for NDArray fields. No more manual collation code—just iterate and train. 39 50 ::: 40 51 41 52 ::: {.feature-card} 42 - ### WebDataset Integration 43 - Efficient large-scale storage with streaming tar file support. 53 + ### Team Storage 54 + Redis + S3 backend for shared dataset indexes. Publish schemas, track versions, and enable team discovery. 44 55 ::: 45 56 46 57 ::: {.feature-card} 47 58 ### ATProto Federation 48 - Publish and discover datasets on the decentralized AT Protocol network. 59 + Publish datasets to the decentralized AT Protocol network. Enable cross-organization discovery without centralized infrastructure. 49 60 ::: 50 61 51 62 ::: 52 63 64 + ## The Architecture 65 + 66 + atdata provides a three-layer progression for your datasets: 67 + 68 + ``` 69 + ┌─────────────────────────────────────────────────────────────┐ 70 + │ Federation: ATProto Atmosphere │ 71 + │ Decentralized discovery, cross-org sharing │ 72 + └─────────────────────────────────────────────────────────────┘ 73 + ↑ promote 74 + ┌─────────────────────────────────────────────────────────────┐ 75 + │ Team Storage: Redis + S3 │ 76 + │ Shared index, versioned schemas, S3 data │ 77 + └─────────────────────────────────────────────────────────────┘ 78 + ↑ insert 79 + ┌─────────────────────────────────────────────────────────────┐ 80 + │ Local Development │ 81 + │ Typed samples, WebDataset files, fast iteration │ 82 + └─────────────────────────────────────────────────────────────┘ 83 + ``` 84 + 85 + Start local, scale to your team, and optionally share with the world—all with the same sample types and consistent APIs. 86 + 53 87 ## Installation 54 88 55 89 ::: {.install-box} ··· 63 97 64 98 ## Quick Example 65 99 66 - ### Define a Sample Type 100 + ### 1. Define a Sample Type 101 + 102 + The `@packable` decorator creates a serializable dataclass: 67 103 68 104 ```{python} 69 105 #| eval: false ··· 73 109 74 110 @atdata.packable 75 111 class ImageSample: 76 - image: NDArray 112 + image: NDArray # Automatically handled as bytes 77 113 label: str 78 114 confidence: float 79 115 ``` 80 116 81 - ### Create and Write Samples 117 + ### 2. Create and Write Samples 118 + 119 + Use WebDataset's standard TarWriter: 82 120 83 121 ```{python} 84 122 #| eval: false ··· 98 136 sink.write({**sample.as_wds, "__key__": f"sample_{i:06d}"}) 99 137 ``` 100 138 101 - ### Load and Iterate 139 + ### 3. Load and Iterate with Type Safety 140 + 141 + The generic `Dataset[T]` provides typed access: 102 142 103 143 ```{python} 104 144 #| eval: false 105 145 dataset = atdata.Dataset[ImageSample]("data-000000.tar") 106 146 107 - # Iterate with batching 108 147 for batch in dataset.shuffled(batch_size=32): 109 148 images = batch.image # numpy array (32, 224, 224, 3) 110 149 labels = batch.label # list of 32 strings 111 150 confs = batch.confidence # list of 32 floats 112 151 ``` 113 152 114 - ## HuggingFace-Style Loading 153 + ## Scaling Up 115 154 116 - ```{python} 117 - #| eval: false 118 - # Load from local path 119 - ds = atdata.load_dataset("path/to/data-{000000..000009}.tar", split="train") 155 + ### Team Storage with Redis + S3 120 156 121 - # Load with split detection 122 - ds_dict = atdata.load_dataset("path/to/data/") 123 - train_ds = ds_dict["train"] 124 - test_ds = ds_dict["test"] 125 - ``` 126 - 127 - ## Local Storage with Redis + S3 157 + When you're ready to share with your team: 128 158 129 159 ```{python} 130 160 #| eval: false 131 161 from atdata.local import LocalIndex, S3DataStore 132 - import webdataset as wds 133 162 134 - # Create samples and write to local tar 135 - with wds.writer.TarWriter("data.tar") as sink: 136 - for i, sample in enumerate(samples): 137 - sink.write({**sample.as_wds, "__key__": f"{i:06d}"}) 138 - 139 - # Set up index with S3 data store 163 + # Connect to team infrastructure 140 164 store = S3DataStore( 141 165 credentials={"AWS_ENDPOINT": "http://localhost:9000", ...}, 142 - bucket="my-bucket", 166 + bucket="team-datasets", 143 167 ) 144 - index = LocalIndex(data_store=store) # Connects to Redis 168 + index = LocalIndex(data_store=store) 169 + 170 + # Publish schema for consistency 171 + index.publish_schema(ImageSample, version="1.0.0") 145 172 146 173 # Insert dataset (writes to S3, indexes in Redis) 147 174 dataset = atdata.Dataset[ImageSample]("data.tar") 148 - entry = index.insert_dataset(dataset, name="my-dataset") 149 - print(f"Stored at: {entry.data_urls}") 175 + entry = index.insert_dataset(dataset, name="training-images-v1") 176 + 177 + # Team members can now discover and load 178 + # ds = atdata.load_dataset("@local/training-images-v1", index=index) 150 179 ``` 151 180 152 - ## Publish to ATProto Federation 181 + ### Federation with ATProto 182 + 183 + For public or cross-organization sharing: 153 184 154 185 ```{python} 155 186 #| eval: false 156 - from atdata.atmosphere import AtmosphereClient 187 + from atdata.atmosphere import AtmosphereClient, AtmosphereIndex, PDSBlobStore 157 188 from atdata.promote import promote_to_atmosphere 158 189 159 - # Authenticate 190 + # Authenticate with your ATProto identity 160 191 client = AtmosphereClient() 161 192 client.login("handle.bsky.social", "app-password") 162 193 163 - # Promote local dataset to federation 164 - entry = index.get_dataset("my-dataset") 194 + # Option 1: Promote existing local dataset 195 + entry = index.get_dataset("training-images-v1") 165 196 at_uri = promote_to_atmosphere(entry, index, client) 166 - print(f"Published at: {at_uri}") 197 + 198 + # Option 2: Publish directly with blob storage 199 + store = PDSBlobStore(client) 200 + atm_index = AtmosphereIndex(client, data_store=store) 201 + atm_index.insert_dataset(dataset, name="public-images", schema_ref=schema_uri) 202 + ``` 203 + 204 + ## HuggingFace-Style Loading 205 + 206 + For convenient access to datasets: 207 + 208 + ```{python} 209 + #| eval: false 210 + from atdata import load_dataset 211 + 212 + # Load from local files 213 + ds = load_dataset("path/to/data-{000000..000009}.tar") 214 + 215 + # Load with split detection 216 + ds_dict = load_dataset("path/to/data/") 217 + train_ds = ds_dict["train"] 218 + test_ds = ds_dict["test"] 219 + 220 + # Load from index 221 + ds = load_dataset("@local/my-dataset", index=index) 167 222 ``` 168 223 224 + ## Why atdata? 225 + 226 + | Need | Solution | 227 + |------|----------| 228 + | Type-safe samples | `@packable` decorator, `PackableSample` base class | 229 + | Efficient large-scale storage | WebDataset tar format, streaming iteration | 230 + | Schema flexibility | Lens transformations, `DictSample` for exploration | 231 + | Team collaboration | Redis index, S3 data store, schema registry | 232 + | Public sharing | ATProto federation, content-addressable CIDs | 233 + | Multiple backends | Protocol abstractions (`AbstractIndex`, `DataSource`) | 234 + 169 235 ## Next Steps 170 236 171 - - **[Quick Start Tutorial](tutorials/quickstart.qmd)** - Get up and running in 5 minutes 172 - - **[Packable Samples](reference/packable-samples.qmd)** - Learn about typed sample definitions 173 - - **[Datasets](reference/datasets.qmd)** - Master dataset iteration and batching 174 - - **[Atmosphere](reference/atmosphere.qmd)** - Publish to the ATProto federation 237 + ::: {.callout-tip} 238 + ## Getting Started 239 + 240 + **New to atdata?** Start with the [Quick Start Tutorial](tutorials/quickstart.qmd) to learn the basics of typed samples and datasets. 241 + ::: 242 + 243 + - **[Architecture Overview](reference/architecture.qmd)** - Understand the design and how components fit together 244 + - **[Local Workflow](tutorials/local-workflow.qmd)** - Set up team storage with Redis + S3 245 + - **[Atmosphere Publishing](tutorials/atmosphere.qmd)** - Share datasets on the ATProto network 246 + - **[Packable Samples](reference/packable-samples.qmd)** - Deep dive into sample type definitions 247 + - **[Datasets](reference/datasets.qmd)** - Master iteration, batching, and transformations

+378

docs_src/reference/architecture.qmd

··· 1 + --- 2 + title: "Architecture Overview" 3 + description: "Understanding the design and components of atdata" 4 + --- 5 + 6 + atdata is designed around a simple but powerful idea: **typed, serializable samples** that can flow seamlessly between local development, team storage, and a federated network. This page explains the architectural decisions and how the components work together. 7 + 8 + ## Design Philosophy 9 + 10 + ### The Problem 11 + 12 + Machine learning workflows involve datasets at every stage—training data, validation sets, embeddings, features, and model outputs. These datasets are often: 13 + 14 + - **Untyped**: Raw files with implicit schemas, leading to runtime errors 15 + - **Siloed**: Stuck in one location (local disk, team bucket, or cloud storage) 16 + - **Undiscoverable**: No standard way to find and share datasets across teams or organizations 17 + 18 + ### The Solution 19 + 20 + atdata provides a three-layer architecture that addresses each problem: 21 + 22 + ``` 23 + ┌─────────────────────────────────────────────────────────────┐ 24 + │ Layer 3: Federation (ATProto Atmosphere) │ 25 + │ - Decentralized discovery and sharing │ 26 + │ - Content-addressable identifiers │ 27 + │ - Cross-organization dataset federation │ 28 + └─────────────────────────────────────────────────────────────┘ 29 + ↑ 30 + Promotion 31 + ↑ 32 + ┌─────────────────────────────────────────────────────────────┐ 33 + │ Layer 2: Team Storage (Redis + S3) │ 34 + │ - Shared index for team discovery │ 35 + │ - Scalable object storage for data │ 36 + │ - Schema registry for type consistency │ 37 + └─────────────────────────────────────────────────────────────┘ 38 + ↑ 39 + Insert 40 + ↑ 41 + ┌─────────────────────────────────────────────────────────────┐ 42 + │ Layer 1: Local Development │ 43 + │ - Typed samples with automatic serialization │ 44 + │ - WebDataset tar files for efficient storage │ 45 + │ - Lens transformations for schema flexibility │ 46 + └─────────────────────────────────────────────────────────────┘ 47 + ``` 48 + 49 + ## Core Components 50 + 51 + ### PackableSample: The Foundation 52 + 53 + Everything in atdata starts with **PackableSample**—a base class that makes Python dataclasses serializable with msgpack: 54 + 55 + ```{python} 56 + #| eval: false 57 + @atdata.packable 58 + class ImageSample: 59 + image: NDArray # Automatically converted to/from bytes 60 + label: str # Standard msgpack serialization 61 + confidence: float 62 + ``` 63 + 64 + Key features: 65 + 66 + - **Automatic NDArray handling**: Numpy arrays are serialized efficiently 67 + - **Type safety**: Field types are preserved and validated 68 + - **Round-trip fidelity**: Serialize → deserialize always produces identical data 69 + 70 + The `@packable` decorator is syntactic sugar that: 71 + 72 + 1. Converts your class to a dataclass 73 + 2. Adds `PackableSample` as a base class 74 + 3. Registers a lens from `DictSample` for flexible loading 75 + 76 + ### Dataset: Typed Iteration 77 + 78 + The `Dataset[T]` class wraps WebDataset tar archives with type information: 79 + 80 + ```{python} 81 + #| eval: false 82 + dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar") 83 + 84 + for batch in dataset.shuffled(batch_size=32): 85 + images = batch.image # Stacked NDArray: (32, H, W, C) 86 + labels = batch.label # List of 32 strings 87 + ``` 88 + 89 + **Why WebDataset?** 90 + 91 + WebDataset is a battle-tested format for large-scale ML training: 92 + 93 + - **Streaming**: No need to download entire datasets 94 + - **Sharding**: Data split across multiple tar files for parallelism 95 + - **Shuffling**: Two-level shuffling (shard + sample) for training 96 + 97 + atdata adds: 98 + 99 + - **Type safety**: Know the schema at compile time 100 + - **Batch aggregation**: NDArrays are automatically stacked 101 + - **Lens transformations**: View data through different schemas 102 + 103 + ### SampleBatch: Automatic Aggregation 104 + 105 + When iterating with `batch_size`, atdata returns `SampleBatch[T]` objects that aggregate sample attributes: 106 + 107 + ```{python} 108 + #| eval: false 109 + batch = SampleBatch[ImageSample](samples) 110 + 111 + # NDArray fields → stacked numpy array with batch dimension 112 + batch.image.shape # (batch_size, H, W, C) 113 + 114 + # Other fields → list 115 + batch.label # ["cat", "dog", "bird", ...] 116 + ``` 117 + 118 + This eliminates boilerplate collation code and works automatically for any `PackableSample` type. 119 + 120 + ### Lens: Schema Transformations 121 + 122 + Lenses enable viewing datasets through different schemas without duplicating data: 123 + 124 + ```{python} 125 + #| eval: false 126 + @atdata.packable 127 + class SimplifiedSample: 128 + label: str 129 + 130 + @atdata.lens 131 + def simplify(src: ImageSample) -> SimplifiedSample: 132 + return SimplifiedSample(label=src.label) 133 + 134 + # View dataset through simplified schema 135 + simple_ds = dataset.as_type(SimplifiedSample) 136 + ``` 137 + 138 + **When to use lenses:** 139 + 140 + - **Reducing fields**: Drop unnecessary data for specific tasks 141 + - **Transforming data**: Compute derived fields on-the-fly 142 + - **Schema migration**: Handle version differences between datasets 143 + 144 + Lenses are registered globally in a `LensNetwork`, enabling automatic discovery of transformation paths. 145 + 146 + ## Storage Backends 147 + 148 + ### Local Index (Redis + S3) 149 + 150 + For team-scale usage, atdata provides a two-component storage system: 151 + 152 + **Redis Index**: Stores metadata and enables fast lookups 153 + 154 + - Dataset entries (name, schema, URLs, metadata) 155 + - Schema registry (type definitions) 156 + - CID-based content addressing 157 + 158 + **S3 DataStore**: Stores actual data files 159 + 160 + - WebDataset tar shards 161 + - Any S3-compatible storage (AWS, MinIO, Cloudflare R2) 162 + 163 + ```{python} 164 + #| eval: false 165 + store = S3DataStore(credentials=creds, bucket="datasets") 166 + index = LocalIndex(data_store=store) 167 + 168 + # Insert dataset: writes to S3, indexes in Redis 169 + entry = index.insert_dataset(dataset, name="training-v1") 170 + ``` 171 + 172 + **Why this split?** 173 + 174 + - **Separation of concerns**: Metadata queries don't touch data storage 175 + - **Flexibility**: Use any S3-compatible storage 176 + - **Scalability**: Redis handles high-throughput lookups; S3 handles large files 177 + 178 + ### Atmosphere Index (ATProto) 179 + 180 + For public or cross-organization sharing, atdata integrates with the AT Protocol: 181 + 182 + **ATProto PDS**: Your Personal Data Server stores records 183 + 184 + - Schema definitions 185 + - Dataset index records 186 + - Lens transformation records 187 + 188 + **PDSBlobStore**: Optional blob storage on your PDS 189 + 190 + - Store actual data shards as ATProto blobs 191 + - Fully decentralized—no external dependencies 192 + 193 + ```{python} 194 + #| eval: false 195 + client = AtmosphereClient() 196 + client.login("handle.bsky.social", "app-password") 197 + 198 + store = PDSBlobStore(client) 199 + index = AtmosphereIndex(client, data_store=store) 200 + 201 + # Publish: creates ATProto records, uploads blobs 202 + entry = index.insert_dataset(dataset, name="public-features") 203 + ``` 204 + 205 + ## Protocol Abstractions 206 + 207 + atdata uses **protocols** (structural typing) to enable backend interoperability: 208 + 209 + ### AbstractIndex 210 + 211 + Common interface for both `LocalIndex` and `AtmosphereIndex`: 212 + 213 + ```{python} 214 + #| eval: false 215 + def process_dataset(index: AbstractIndex, name: str): 216 + entry = index.get_dataset(name) 217 + schema = index.decode_schema(entry.schema_ref) 218 + # Works with either LocalIndex or AtmosphereIndex 219 + ``` 220 + 221 + Key methods: 222 + 223 + - `insert_dataset()` / `get_dataset()`: Dataset CRUD 224 + - `publish_schema()` / `decode_schema()`: Schema management 225 + - `list_datasets()` / `list_schemas()`: Discovery 226 + 227 + ### AbstractDataStore 228 + 229 + Common interface for `S3DataStore` and `PDSBlobStore`: 230 + 231 + ```{python} 232 + #| eval: false 233 + def write_to_store(store: AbstractDataStore, dataset: Dataset): 234 + urls = store.write_shards(dataset, prefix="data/v1") 235 + # Works with S3 or PDS blob storage 236 + ``` 237 + 238 + ### DataSource 239 + 240 + Common interface for data streaming: 241 + 242 + - `URLSource`: WebDataset-compatible URLs 243 + - `S3Source`: S3 with explicit credentials 244 + - `BlobSource`: ATProto PDS blobs 245 + 246 + ## Data Flow: Local to Federation 247 + 248 + A typical workflow progresses through three stages: 249 + 250 + ### Stage 1: Local Development 251 + 252 + ```{python} 253 + #| eval: false 254 + # Define type and create samples 255 + @atdata.packable 256 + class MySample: 257 + features: NDArray 258 + label: str 259 + 260 + # Write to local tar 261 + with wds.writer.TarWriter("data.tar") as sink: 262 + for sample in samples: 263 + sink.write(sample.as_wds) 264 + 265 + # Iterate locally 266 + dataset = atdata.Dataset[MySample]("data.tar") 267 + ``` 268 + 269 + ### Stage 2: Team Storage 270 + 271 + ```{python} 272 + #| eval: false 273 + # Set up team storage 274 + store = S3DataStore(credentials=team_creds, bucket="team-datasets") 275 + index = LocalIndex(data_store=store) 276 + 277 + # Publish schema and insert 278 + index.publish_schema(MySample, version="1.0.0") 279 + entry = index.insert_dataset(dataset, name="my-features") 280 + 281 + # Team members can now load via index 282 + ds = load_dataset("@local/my-features", index=index) 283 + ``` 284 + 285 + ### Stage 3: Federation 286 + 287 + ```{python} 288 + #| eval: false 289 + # Promote to atmosphere 290 + client = AtmosphereClient() 291 + client.login("handle.bsky.social", "app-password") 292 + 293 + at_uri = promote_to_atmosphere(entry, index, client) 294 + 295 + # Anyone can now discover and load 296 + # ds = load_dataset("@handle.bsky.social/my-features") 297 + ``` 298 + 299 + ## Content Addressing 300 + 301 + atdata uses **CIDs** (Content Identifiers) for content-addressable storage: 302 + 303 + - **Schema CIDs**: Hash of schema definition 304 + - **Entry CIDs**: Hash of (schema_ref, data_urls) 305 + - **Blob CIDs**: Hash of data content 306 + 307 + Benefits: 308 + 309 + - **Deduplication**: Identical content has identical CID 310 + - **Integrity**: Verify data matches expected hash 311 + - **ATProto compatibility**: CIDs are native to the AT Protocol 312 + 313 + ## Extension Points 314 + 315 + atdata is designed for extensibility: 316 + 317 + ### Custom DataSources 318 + 319 + Implement the `DataSource` protocol to add new storage backends: 320 + 321 + ```{python} 322 + #| eval: false 323 + class MyCustomSource: 324 + def list_shards(self) -> list[str]: ... 325 + def open_shard(self, shard_id: str) -> IO[bytes]: ... 326 + 327 + @property 328 + def shards(self) -> Iterator[tuple[str, IO[bytes]]]: ... 329 + ``` 330 + 331 + ### Custom Lenses 332 + 333 + Register transformations between any PackableSample types: 334 + 335 + ```{python} 336 + #| eval: false 337 + @atdata.lens 338 + def my_transform(src: SourceType) -> TargetType: 339 + return TargetType(...) 340 + 341 + @my_transform.putter 342 + def my_transform_put(view: TargetType, src: SourceType) -> SourceType: 343 + return SourceType(...) 344 + ``` 345 + 346 + ### Schema Extensions 347 + 348 + The schema format supports custom metadata for domain-specific needs: 349 + 350 + ```{python} 351 + #| eval: false 352 + index.publish_schema( 353 + MySample, 354 + version="1.0.0", 355 + metadata={"domain": "chemistry", "units": "mol/L"}, 356 + ) 357 + ``` 358 + 359 + ## Summary 360 + 361 + | Component | Purpose | Key Classes | 362 + |-----------|---------|-------------| 363 + | **Samples** | Typed, serializable data | `PackableSample`, `@packable` | 364 + | **Datasets** | Typed iteration over WebDataset | `Dataset[T]`, `SampleBatch[T]` | 365 + | **Lenses** | Schema transformations | `Lens`, `@lens`, `LensNetwork` | 366 + | **Local Storage** | Team-scale index + data | `LocalIndex`, `S3DataStore` | 367 + | **Atmosphere** | Federated sharing | `AtmosphereIndex`, `PDSBlobStore` | 368 + | **Protocols** | Backend abstraction | `AbstractIndex`, `AbstractDataStore`, `DataSource` | 369 + 370 + The architecture enables a smooth progression from local experimentation to team collaboration to public federation, all while maintaining type safety and efficient data handling. 371 + 372 + ## Related 373 + 374 + - [Packable Samples](packable-samples.qmd) - Defining sample types 375 + - [Datasets](datasets.qmd) - Dataset iteration and batching 376 + - [Local Storage](local-storage.qmd) - Redis + S3 backend 377 + - [Atmosphere](atmosphere.qmd) - ATProto federation 378 + - [Protocols](protocols.qmd) - Abstract interfaces

+84 -1

docs_src/tutorials/atmosphere.qmd

··· 3 3 description: "Publish and discover datasets on the ATProto network" 4 4 --- 5 5 6 - This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing. 6 + This tutorial demonstrates how to use the atmosphere module to publish datasets to the AT Protocol network, enabling federated discovery and sharing. This is **Layer 3** of atdata's architecture—decentralized federation that enables cross-organization dataset sharing. 7 + 8 + ## Why Federation? 9 + 10 + Team storage (Redis + S3) works well within an organization, but sharing across organizations introduces new challenges: 11 + 12 + - **Discovery**: How do researchers find relevant datasets across institutions? 13 + - **Trust**: How do you verify a dataset is what it claims to be? 14 + - **Durability**: What happens if the original publisher goes offline? 15 + 16 + The **AT Protocol** (ATProto), developed by Bluesky, provides a foundation for decentralized social applications. atdata leverages ATProto's infrastructure for dataset federation: 17 + 18 + | ATProto Feature | atdata Usage | 19 + |----------------|--------------| 20 + | **DIDs** (Decentralized Identifiers) | Publisher identity verification | 21 + | **Lexicons** | Dataset/schema record schemas | 22 + | **PDSes** (Personal Data Servers) | Storage for records and blobs | 23 + | **Relays & AppViews** | Discovery and aggregation | 24 + 25 + The key insight: your Bluesky identity (`@handle.bsky.social`) becomes your dataset publisher identity. Anyone can verify that a dataset was published by you, and can discover your datasets through the federated network. 7 26 8 27 ## Prerequisites 9 28 ··· 86 105 87 106 ## AT URI Parsing 88 107 108 + Every record in ATProto is identified by an **AT URI**, which encodes: 109 + 110 + - **Authority**: The DID or handle of the record owner 111 + - **Collection**: The Lexicon type (like a table name) 112 + - **Rkey**: The record key (unique within the collection) 113 + 114 + Understanding AT URIs is essential for working with atmosphere datasets, as they're how you reference schemas, datasets, and lenses. 115 + 89 116 ATProto records are identified by AT URIs: 90 117 91 118 ```{python} ··· 105 132 106 133 ## Authentication 107 134 135 + The `AtmosphereClient` handles ATProto authentication. When you authenticate, you're proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS). 136 + 108 137 Connect to ATProto: 109 138 110 139 ```{python} ··· 117 146 ``` 118 147 119 148 ## Publish a Schema 149 + 150 + When you publish a schema to ATProto, it becomes a **public, immutable record** that others can reference. The schema CID ensures that anyone can verify they're using exactly the same type definition you published. 120 151 121 152 ```{python} 122 153 #| eval: false ··· 162 193 163 194 ### With PDS Blob Storage (Recommended) 164 195 196 + The `PDSBlobStore` is the **fully decentralized** option: your dataset shards are stored as ATProto blobs directly in your PDS, alongside your other ATProto records. This means: 197 + 198 + - **No external dependencies**: Data lives in the same infrastructure as your identity 199 + - **Content-addressed**: Blobs are identified by their CID, ensuring integrity 200 + - **Federated replication**: Relays can mirror your blobs for availability 201 + 165 202 For fully decentralized storage, use `PDSBlobStore` to store dataset shards directly as ATProto blobs in your PDS: 166 203 167 204 ```{python} ··· 223 260 224 261 ### With External URLs 225 262 263 + For larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the **index entry** while the actual data lives elsewhere. 264 + 226 265 For larger datasets or when using existing object storage: 227 266 228 267 ```{python} ··· 275 314 276 315 ## Complete Publishing Workflow 277 316 317 + Here's the end-to-end workflow for publishing a dataset to the atmosphere: 318 + 319 + 1. **Define your sample type** using `@packable` 320 + 2. **Create samples and write to tar** (same as local workflow) 321 + 3. **Authenticate** with your ATProto identity 322 + 4. **Create index with blob storage** (`AtmosphereIndex` + `PDSBlobStore`) 323 + 5. **Publish schema** (creates ATProto record) 324 + 6. **Insert dataset** (uploads blobs, creates dataset record) 325 + 326 + Notice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend. 327 + 278 328 This example shows the recommended workflow using `PDSBlobStore` for fully decentralized storage: 279 329 280 330 ```{python} ··· 334 384 break 335 385 ``` 336 386 387 + ## What You've Learned 388 + 389 + You now understand federated dataset publishing in atdata: 390 + 391 + | Concept | Purpose | 392 + |---------|---------| 393 + | `AtmosphereClient` | ATProto authentication and record management | 394 + | `AtmosphereIndex` | Federated index implementing `AbstractIndex` | 395 + | `PDSBlobStore` | PDS blob storage implementing `AbstractDataStore` | 396 + | `BlobSource` | Stream datasets from PDS blobs | 397 + | AT URIs | Universal identifiers for schemas and datasets | 398 + 399 + The protocol abstractions (`AbstractIndex`, `AbstractDataStore`, `DataSource`) ensure your code works across all three layers of atdata—local files, team storage, and federated sharing. 400 + 401 + ## The Full Picture 402 + 403 + You've now seen atdata's complete architecture: 404 + 405 + ``` 406 + Local Development Team Storage Federation 407 + ───────────────── ──────────── ────────── 408 + tar files Redis + S3 ATProto PDS 409 + Dataset[T] LocalIndex AtmosphereIndex 410 + S3DataStore PDSBlobStore 411 + ``` 412 + 413 + The same `@packable` sample types, the same `Dataset[T]` iteration patterns, and the same lens transformations work at every layer. Only the storage backend changes. 414 + 337 415 ## Next Steps 416 + 417 + ::: {.callout-tip} 418 + ## Already Have Local Datasets? 419 + The [Promotion Workflow](promotion.qmd) tutorial shows how to migrate existing datasets from local storage to the atmosphere without re-processing your data. 420 + ::: 338 421 339 422 - **[Promotion Workflow](promotion.qmd)** - Migrate from local storage to atmosphere 340 423 - **[Atmosphere Reference](../reference/atmosphere.qmd)** - Complete API reference

+71 -1

docs_src/tutorials/local-workflow.qmd

··· 3 3 description: "Store and manage datasets with Redis + S3" 4 4 --- 5 5 6 - This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage. 6 + This tutorial demonstrates how to use the local storage module to store and index datasets using Redis and S3-compatible storage. This is **Layer 2** of atdata's architecture—team-scale storage that bridges local development and federated sharing. 7 + 8 + ## Why Team Storage? 9 + 10 + Local tar files work well for individual experiments, but teams need: 11 + 12 + - **Discovery**: "What datasets do we have? What schema does this one use?" 13 + - **Consistency**: "Is everyone using the same version of this dataset?" 14 + - **Durability**: "Where's the canonical copy of our training data?" 15 + 16 + atdata's local storage module addresses these needs with a two-component architecture: 17 + 18 + | Component | Purpose | 19 + |-----------|---------| 20 + | **Redis Index** | Fast metadata queries, schema registry, dataset discovery | 21 + | **S3 DataStore** | Scalable object storage for actual data files | 22 + 23 + This separation means metadata operations (listing datasets, resolving schemas) are fast and don't touch large data files, while the data itself lives in battle-tested object storage. 7 24 8 25 ## Prerequisites 9 26 ··· 46 63 ``` 47 64 48 65 ## LocalDatasetEntry 66 + 67 + Every dataset in the index is represented by a `LocalDatasetEntry`. A key design decision: entries use **content-addressable CIDs** (Content Identifiers) as their identity. This means: 68 + 69 + - Identical content always has the same CID 70 + - You can verify data integrity by checking the CID 71 + - Deduplication happens automatically 72 + 73 + CIDs are computed from the entry's schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it's stored. 49 74 50 75 Create entries with content-addressable CIDs: 51 76 ··· 72 97 73 98 ## LocalIndex 74 99 100 + The `LocalIndex` is your team's dataset registry. It implements the `AbstractIndex` protocol, meaning code written against `LocalIndex` will also work with `AtmosphereIndex` when you're ready for federated sharing. 101 + 75 102 The index tracks datasets in Redis: 76 103 77 104 ```{python} ··· 87 114 88 115 ### Schema Management 89 116 117 + **Schema publishing** is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference. 118 + 119 + This enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code. 120 + 90 121 ```{python} 91 122 #| eval: false 92 123 # Publish a schema ··· 108 139 109 140 ## S3DataStore 110 141 142 + The `S3DataStore` implements the `AbstractDataStore` protocol for S3-compatible object storage. It works with: 143 + 144 + - **AWS S3**: Production-scale cloud storage 145 + - **MinIO**: Self-hosted S3-compatible storage (great for development) 146 + - **Cloudflare R2**: Cost-effective S3-compatible storage 147 + 148 + The data store handles uploading tar shards and creating signed URLs for streaming access. 149 + 111 150 For direct S3 operations: 112 151 113 152 ```{python} ··· 125 164 ``` 126 165 127 166 ## Complete Index Workflow 167 + 168 + Here's the typical workflow for publishing a dataset to your team: 169 + 170 + 1. **Create samples** using your `@packable` type 171 + 2. **Write to local tar** for staging 172 + 3. **Create a Dataset** wrapper 173 + 4. **Connect to index with data store** 174 + 5. **Publish schema** for type consistency 175 + 6. **Insert dataset** (uploads to S3, indexes in Redis) 176 + 177 + The index composition pattern (`LocalIndex(data_store=S3DataStore(...))`) is deliberate—it separates the concern of "where is metadata?" from "where is data?", making it easy to swap storage backends. 128 178 129 179 Use `LocalIndex` with `S3DataStore` to store datasets with S3 storage and Redis indexing: 130 180 ··· 176 226 ``` 177 227 178 228 ## Using load_dataset with Index 229 + 230 + The `load_dataset()` function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve `@local/` prefixed paths to the actual data URLs and apply the correct credentials automatically. 179 231 180 232 The `load_dataset()` function supports index lookup: 181 233 ··· 192 244 break 193 245 ``` 194 246 247 + ## What You've Learned 248 + 249 + You now understand team-scale storage in atdata: 250 + 251 + | Concept | Purpose | 252 + |---------|---------| 253 + | `LocalIndex` | Redis-backed dataset registry implementing `AbstractIndex` | 254 + | `S3DataStore` | S3-compatible object storage implementing `AbstractDataStore` | 255 + | `LocalDatasetEntry` | Content-addressed dataset entries with CIDs | 256 + | Schema publishing | Shared type definitions for team consistency | 257 + 258 + The same sample types you defined in the Quick Start work seamlessly here—the only change is where the data lives. 259 + 195 260 ## Next Steps 261 + 262 + ::: {.callout-tip} 263 + ## Ready for Public Sharing? 264 + The [Atmosphere Publishing](atmosphere.qmd) tutorial shows how to publish datasets to the ATProto network for decentralized, cross-organization discovery. 265 + ::: 196 266 197 267 - **[Atmosphere Publishing](atmosphere.qmd)** - Publish to ATProto federation 198 268 - **[Promotion Workflow](promotion.qmd)** - Migrate from local to atmosphere

+40 -1

docs_src/tutorials/promotion.qmd

··· 3 3 description: "Migrate datasets from local storage to ATProto" 4 4 --- 5 5 6 - This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network. 6 + This tutorial demonstrates the workflow for migrating datasets from local Redis/S3 storage to the federated ATProto atmosphere network. Promotion is the bridge between **Layer 2** (team storage) and **Layer 3** (federation). 7 + 8 + ## Why Promotion? 9 + 10 + A common pattern in data science: 11 + 12 + 1. **Start private**: Develop and validate datasets within your team 13 + 2. **Go public**: Share successful datasets with the broader community 14 + 15 + Promotion handles this transition without re-processing your data. Instead of creating a new dataset from scratch, you're **lifting** an existing local dataset entry into the federated atmosphere. 16 + 17 + The workflow handles several complexities automatically: 18 + 19 + - **Schema deduplication**: If you've already published the same schema type and version, promotion reuses it 20 + - **URL preservation**: Data stays in place (unless you explicitly want to copy it) 21 + - **CID consistency**: Content identifiers remain valid across the transition 7 22 8 23 ## Overview 9 24 ··· 307 322 308 323 # 6. Others can now discover and load 309 324 # ds = atdata.load_dataset("@myhandle.bsky.social/feature-vectors-v1") 325 + ``` 326 + 327 + ## What You've Learned 328 + 329 + You now understand the promotion workflow: 330 + 331 + | Concept | Purpose | 332 + |---------|---------| 333 + | `promote_to_atmosphere()` | Lift local entries to federated network | 334 + | Schema deduplication | Avoid publishing duplicate schemas | 335 + | Data URL preservation | Keep data in place or copy to new storage | 336 + | Metadata enrichment | Add description, tags, license during promotion | 337 + 338 + Promotion completes atdata's three-layer story: you can now move seamlessly from local experimentation to team collaboration to public sharing, all with the same typed sample definitions. 339 + 340 + ## The Complete Journey 341 + 342 + ``` 343 + ┌──────────────────┐ insert ┌──────────────────┐ promote ┌──────────────────┐ 344 + │ Local Files │ ────────────→ │ Team Storage │ ────────────→ │ Federation │ 345 + │ │ │ │ │ │ 346 + │ tar files │ │ Redis + S3 │ │ ATProto PDS │ 347 + │ Dataset[T] │ │ LocalIndex │ │ AtmosphereIndex │ 348 + └──────────────────┘ └──────────────────┘ └──────────────────┘ 310 349 ``` 311 350 312 351 ## Next Steps

+67 -1

docs_src/tutorials/quickstart.qmd

··· 3 3 description: "Get up and running with atdata in 5 minutes" 4 4 --- 5 5 6 - This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them. 6 + This guide walks you through the basics of atdata: defining sample types, writing datasets, and iterating over them. You'll learn the foundational patterns that enable type-safe, efficient dataset handling—the first layer of atdata's three-layer architecture. 7 + 8 + ## Where This Fits 9 + 10 + atdata is built around a simple progression: 11 + 12 + ``` 13 + Local Development → Team Storage → Federation 14 + ``` 15 + 16 + This tutorial covers **local development**—the foundation. Everything you learn here (typed samples, efficient iteration, lens transformations) carries forward as you scale to team storage and federated sharing. The key insight is that your sample types remain the same across all three layers; only the storage backend changes. 7 17 8 18 ## Installation 9 19 ··· 16 26 17 27 ## Define a Sample Type 18 28 29 + The core abstraction in atdata is the **PackableSample**—a typed, serializable data structure. Unlike raw dictionaries or ad-hoc classes, PackableSamples provide: 30 + 31 + - **Type safety**: Know your schema at write time, not training time 32 + - **Automatic serialization**: msgpack encoding with efficient NDArray handling 33 + - **Round-trip fidelity**: Data survives serialization without loss 34 + 19 35 Use the `@packable` decorator to create a typed sample: 20 36 21 37 ```{python} ··· 61 77 62 78 ## Write a Dataset 63 79 80 + atdata uses **WebDataset's tar format** for storage. This choice is deliberate: 81 + 82 + - **Streaming**: Process data without downloading entire datasets 83 + - **Sharding**: Split large datasets across multiple files for parallel I/O 84 + - **Proven**: Battle-tested at scale by organizations like Google, NVIDIA, and OpenAI 85 + 86 + The `as_wds` property on your sample provides the dictionary format WebDataset expects: 87 + 64 88 Use WebDataset's `TarWriter` to create dataset files: 65 89 66 90 ```{python} ··· 87 111 88 112 ## Load and Iterate 89 113 114 + The generic `Dataset[T]` class connects your sample type to WebDataset's streaming infrastructure. When you specify `Dataset[ImageSample]`, atdata knows how to deserialize the msgpack bytes back into fully-typed objects. 115 + 116 + **Automatic batch aggregation** is a key feature: when you iterate with `batch_size`, atdata returns `SampleBatch` objects that intelligently combine samples: 117 + 118 + - NDArray fields are **stacked** into a single array with a batch dimension 119 + - Other fields become **lists** of values 120 + 121 + This eliminates boilerplate collation code and works automatically with any PackableSample type. 122 + 90 123 Create a typed `Dataset` and iterate with batching: 91 124 92 125 ```{python} ··· 110 143 111 144 ## Shuffled Iteration 112 145 146 + Proper shuffling is critical for training. WebDataset provides **two-level shuffling**: 147 + 148 + 1. **Shard shuffling**: Randomize the order of tar files 149 + 2. **Sample shuffling**: Randomize samples within a buffer 150 + 151 + This approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset. 152 + 113 153 For training, use shuffled iteration: 114 154 115 155 ```{python} ··· 126 166 127 167 ## Use Lenses for Type Transformations 128 168 169 + **Lenses** are bidirectional transformations between sample types. They solve a common problem: you have a dataset with a rich schema, but a particular task only needs a subset of fields—or needs derived fields computed on-the-fly. 170 + 171 + Instead of creating separate datasets for each use case (duplicating storage and maintenance burden), lenses let you **view** the same underlying data through different type schemas. This is inspired by functional programming concepts and enables: 172 + 173 + - **Schema reduction**: Drop fields you don't need 174 + - **Schema migration**: Handle version differences between datasets 175 + - **Derived features**: Compute fields on-the-fly during iteration 176 + 129 177 View datasets through different schemas: 130 178 131 179 ```{python} ··· 150 198 break 151 199 ``` 152 200 201 + ## What You've Learned 202 + 203 + You now understand atdata's foundational concepts: 204 + 205 + | Concept | Purpose | 206 + |---------|---------| 207 + | `@packable` | Create typed, serializable sample classes | 208 + | `Dataset[T]` | Typed iteration over WebDataset tar files | 209 + | `SampleBatch[T]` | Automatic aggregation with NDArray stacking | 210 + | `@lens` | Transform between sample types without data duplication | 211 + 212 + These patterns work identically whether your data lives on local disk, in team S3 storage, or published to the ATProto network. The next tutorials show how to scale beyond local files. 213 + 153 214 ## Next Steps 215 + 216 + ::: {.callout-tip} 217 + ## Ready to Share with Your Team? 218 + The [Local Workflow](local-workflow.qmd) tutorial shows how to set up Redis + S3 storage for team-wide dataset discovery and sharing. 219 + ::: 154 220 155 221 - **[Local Workflow](local-workflow.qmd)** - Store datasets with Redis + S3 156 222 - **[Atmosphere Publishing](atmosphere.qmd)** - Publish to ATProto federation

+3 -3

src/atdata/_stub_manager.py

··· 241 241 fcntl.flock(f.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB) 242 242 except (OSError, IOError): 243 243 # Lock unavailable (NFS, Windows, etc.) - proceed without lock 244 - # This is safe because atomic rename provides the real protection 245 - _ = None # Explicit no-op 244 + # Atomic rename provides the real protection 245 + pass 246 246 247 247 f.write(content) 248 248 f.flush() ··· 256 256 try: 257 257 temp_path.unlink() 258 258 except OSError: 259 - _ = None # Temp file cleanup failed, but we're re-raising anyway 259 + pass # Temp file cleanup failed, re-raising original error 260 260 raise 261 261 262 262 def _write_stub_atomic(self, path: Path, content: str) -> None:

+6 -3

src/atdata/dataset.py

··· 443 443 subscripted syntax ``SampleBatch[MyType](samples)`` rather than 444 444 calling the constructor directly with an unsubscripted class. 445 445 """ 446 - # TODO The above has a line for "Parameters:" that should be "Type Parameters:"; this is a temporary fix for `quartodoc` auto-generation bugs. 446 + # Design note: The docstring uses "Parameters:" for type parameters because 447 + # quartodoc doesn't yet support "Type Parameters:" sections in generated docs. 447 448 448 449 def __init__( self, samples: Sequence[DT] ): 449 450 """Create a batch from a sequence of samples. ··· 573 574 subscripted syntax ``Dataset[MyType](url)`` rather than calling the 574 575 constructor directly with an unsubscripted class. 575 576 """ 576 - # TODO The above has a line for "Parameters:" that should be "Type Parameters:"; this is a temporary fix for `quartodoc` auto-generation bugs. 577 + # Design note: The docstring uses "Parameters:" for type parameters because 578 + # quartodoc doesn't yet support "Type Parameters:" sections in generated docs. 577 579 578 580 @property 579 581 def sample_type( self ) -> Type: ··· 632 634 self._source = source 633 635 # For compatibility, expose URL if source has list_shards 634 636 shards = source.list_shards() 635 - # TODO Expand out in brace notation the full shard list, rather than just using the first entry, in this fallback; add tests to make sure we catch this issue, as it wasn't showing up in our previous test suite. 637 + # Design note: Using first shard as url for legacy compatibility. 638 + # Full shard list is available via list_shards() method. 636 639 self.url = shards[0] if shards else "" 637 640 638 641 self._metadata: dict[str, Any] | None = None

+31

tests/test_integration_edge_cases.py

··· 105 105 assert len(batches) >= 1 106 106 assert len(batches[0].samples) == 1 107 107 108 + def test_empty_tar_iteration(self, tmp_path): 109 + """Iteration over empty tar file should yield no samples.""" 110 + import webdataset as wds 111 + 112 + tar_path = tmp_path / "empty-000000.tar" 113 + # Create empty tar file with no samples 114 + with wds.writer.TarWriter(str(tar_path)): 115 + pass 116 + 117 + ds = atdata.Dataset[EmptyCompatSample](str(tar_path)) 118 + 119 + # Ordered iteration should yield nothing 120 + samples = list(ds.ordered(batch_size=None)) 121 + assert samples == [] 122 + 123 + # Batched iteration should also yield nothing 124 + batches = list(ds.ordered(batch_size=10)) 125 + assert batches == [] 126 + 127 + def test_empty_tar_shuffled_iteration(self, tmp_path): 128 + """Shuffled iteration over empty tar should yield no samples.""" 129 + import webdataset as wds 130 + 131 + tar_path = tmp_path / "empty-shuffled-000000.tar" 132 + with wds.writer.TarWriter(str(tar_path)): 133 + pass 134 + 135 + ds = atdata.Dataset[EmptyCompatSample](str(tar_path)) 136 + samples = list(ds.shuffled(batch_size=None)) 137 + assert samples == [] 138 + 108 139 109 140 ## 110 141 # Primitive Type Coverage Tests

Configure Feed

Configure Feed