A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: regenerate HTML output from Quarto sources

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+162 -162
+6 -6
docs/index.html
··· 666 666 <section id="define-a-sample-type" class="level3"> 667 667 <h3 class="anchored" data-anchor-id="define-a-sample-type">1. Define a Sample Type</h3> 668 668 <p>The <code>@packable</code> decorator creates a serializable dataclass:</p> 669 - <div id="387a656e" class="cell"> 669 + <div id="ce9c727b" class="cell"> 670 670 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 671 671 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 672 672 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 681 681 <section id="create-and-write-samples" class="level3"> 682 682 <h3 class="anchored" data-anchor-id="create-and-write-samples">2. Create and Write Samples</h3> 683 683 <p>Use WebDataset’s standard TarWriter:</p> 684 - <div id="e31de370" class="cell"> 684 + <div id="a2250059" class="cell"> 685 685 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 686 686 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 687 687 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> ··· 701 701 <section id="load-and-iterate-with-type-safety" class="level3"> 702 702 <h3 class="anchored" data-anchor-id="load-and-iterate-with-type-safety">3. Load and Iterate with Type Safety</h3> 703 703 <p>The generic <code>Dataset[T]</code> provides typed access:</p> 704 - <div id="d035a8cb" class="cell"> 704 + <div id="86e5860f" class="cell"> 705 705 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-000000.tar"</span>)</span> 706 706 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 707 707 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> ··· 716 716 <section id="team-storage-with-redis-s3" class="level3"> 717 717 <h3 class="anchored" data-anchor-id="team-storage-with-redis-s3">Team Storage with Redis + S3</h3> 718 718 <p>When you’re ready to share with your team:</p> 719 - <div id="5691aefd" class="cell"> 719 + <div id="7a91c337" class="cell"> 720 720 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 721 721 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 722 722 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to team infrastructure</span></span> ··· 740 740 <section id="federation-with-atproto" class="level3"> 741 741 <h3 class="anchored" data-anchor-id="federation-with-atproto">Federation with ATProto</h3> 742 742 <p>For public or cross-organization sharing:</p> 743 - <div id="97baea4b" class="cell"> 743 + <div id="1d65da04" class="cell"> 744 744 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex, PDSBlobStore</span> 745 745 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> promote_to_atmosphere</span> 746 746 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 762 762 <section id="huggingface-style-loading" class="level2"> 763 763 <h2 class="anchored" data-anchor-id="huggingface-style-loading">HuggingFace-Style Loading</h2> 764 764 <p>For convenient access to datasets:</p> 765 - <div id="a4e6c68a" class="cell"> 765 + <div id="f21a38dc" class="cell"> 766 766 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 767 767 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 768 768 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local files</span></span>
+14 -14
docs/reference/architecture.html
··· 657 657 <section id="packablesample-the-foundation" class="level3"> 658 658 <h3 class="anchored" data-anchor-id="packablesample-the-foundation">PackableSample: The Foundation</h3> 659 659 <p>Everything in atdata starts with <strong>PackableSample</strong>—a base class that makes Python dataclasses serializable with msgpack:</p> 660 - <div id="1d6c7714" class="cell"> 660 + <div id="89a48627" class="cell"> 661 661 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 662 662 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 663 663 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> image: NDArray <span class="co"># Automatically converted to/from bytes</span></span> ··· 680 680 <section id="dataset-typed-iteration" class="level3"> 681 681 <h3 class="anchored" data-anchor-id="dataset-typed-iteration">Dataset: Typed Iteration</h3> 682 682 <p>The <code>Dataset[T]</code> class wraps WebDataset tar archives with type information:</p> 683 - <div id="750a8ebc" class="cell"> 683 + <div id="3a0ab003" class="cell"> 684 684 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 685 685 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 686 686 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> ··· 704 704 <section id="samplebatch-automatic-aggregation" class="level3"> 705 705 <h3 class="anchored" data-anchor-id="samplebatch-automatic-aggregation">SampleBatch: Automatic Aggregation</h3> 706 706 <p>When iterating with <code>batch_size</code>, atdata returns <code>SampleBatch[T]</code> objects that aggregate sample attributes:</p> 707 - <div id="36aa2bc8" class="cell"> 707 + <div id="c43dd2d8" class="cell"> 708 708 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>batch <span class="op">=</span> SampleBatch[ImageSample](samples)</span> 709 709 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 710 710 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># NDArray fields → stacked numpy array with batch dimension</span></span> ··· 718 718 <section id="lens-schema-transformations" class="level3"> 719 719 <h3 class="anchored" data-anchor-id="lens-schema-transformations">Lens: Schema Transformations</h3> 720 720 <p>Lenses enable viewing datasets through different schemas without duplicating data:</p> 721 - <div id="a5b534a8" class="cell"> 721 + <div id="2dc95de3" class="cell"> 722 722 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 723 723 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 724 724 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> ··· 755 755 <li>WebDataset tar shards</li> 756 756 <li>Any S3-compatible storage (AWS, MinIO, Cloudflare R2)</li> 757 757 </ul> 758 - <div id="8792eb75" class="cell"> 758 + <div id="a2bdd84b" class="cell"> 759 759 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(credentials<span class="op">=</span>creds, bucket<span class="op">=</span><span class="st">"datasets"</span>)</span> 760 760 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> 761 761 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 783 783 <li>Store actual data shards as ATProto blobs</li> 784 784 <li>Fully decentralized—no external dependencies</li> 785 785 </ul> 786 - <div id="a5b8403d" class="cell"> 786 + <div id="d614873c" class="cell"> 787 787 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 788 788 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 789 789 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 801 801 <section id="abstractindex" class="level3"> 802 802 <h3 class="anchored" data-anchor-id="abstractindex">AbstractIndex</h3> 803 803 <p>Common interface for both <code>LocalIndex</code> and <code>AtmosphereIndex</code>:</p> 804 - <div id="53fdc4fb" class="cell"> 804 + <div id="cb928d04" class="cell"> 805 805 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_dataset(index: AbstractIndex, name: <span class="bu">str</span>):</span> 806 806 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> entry <span class="op">=</span> index.get_dataset(name)</span> 807 807 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> schema <span class="op">=</span> index.decode_schema(entry.schema_ref)</span> ··· 817 817 <section id="abstractdatastore" class="level3"> 818 818 <h3 class="anchored" data-anchor-id="abstractdatastore">AbstractDataStore</h3> 819 819 <p>Common interface for <code>S3DataStore</code> and <code>PDSBlobStore</code>:</p> 820 - <div id="76134918" class="cell"> 820 + <div id="e7b39bb2" class="cell"> 821 821 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_to_store(store: AbstractDataStore, dataset: Dataset):</span> 822 822 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> urls <span class="op">=</span> store.write_shards(dataset, prefix<span class="op">=</span><span class="st">"data/v1"</span>)</span> 823 823 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> <span class="co"># Works with S3 or PDS blob storage</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 838 838 <p>A typical workflow progresses through three stages:</p> 839 839 <section id="stage-1-local-development" class="level3"> 840 840 <h3 class="anchored" data-anchor-id="stage-1-local-development">Stage 1: Local Development</h3> 841 - <div id="9ea69426" class="cell"> 841 + <div id="42b455c5" class="cell"> 842 842 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define type and create samples</span></span> 843 843 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 844 844 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> MySample:</span> ··· 856 856 </section> 857 857 <section id="stage-2-team-storage" class="level3"> 858 858 <h3 class="anchored" data-anchor-id="stage-2-team-storage">Stage 2: Team Storage</h3> 859 - <div id="c454dd0d" class="cell"> 859 + <div id="dca2cb0c" class="cell"> 860 860 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Set up team storage</span></span> 861 861 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(credentials<span class="op">=</span>team_creds, bucket<span class="op">=</span><span class="st">"team-datasets"</span>)</span> 862 862 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> ··· 871 871 </section> 872 872 <section id="stage-3-federation" class="level3"> 873 873 <h3 class="anchored" data-anchor-id="stage-3-federation">Stage 3: Federation</h3> 874 - <div id="9dfedd67" class="cell"> 874 + <div id="b001011a" class="cell"> 875 875 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Promote to atmosphere</span></span> 876 876 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 877 877 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> ··· 904 904 <section id="custom-datasources" class="level3"> 905 905 <h3 class="anchored" data-anchor-id="custom-datasources">Custom DataSources</h3> 906 906 <p>Implement the <code>DataSource</code> protocol to add new storage backends:</p> 907 - <div id="e8e5bb8b" class="cell"> 907 + <div id="b258c8c1" class="cell"> 908 908 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> MyCustomSource:</span> 909 909 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> list_shards(<span class="va">self</span>) <span class="op">-&gt;</span> <span class="bu">list</span>[<span class="bu">str</span>]: ...</span> 910 910 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> open_shard(<span class="va">self</span>, shard_id: <span class="bu">str</span>) <span class="op">-&gt;</span> IO[<span class="bu">bytes</span>]: ...</span> ··· 916 916 <section id="custom-lenses" class="level3"> 917 917 <h3 class="anchored" data-anchor-id="custom-lenses">Custom Lenses</h3> 918 918 <p>Register transformations between any PackableSample types:</p> 919 - <div id="d20b964d" class="cell"> 919 + <div id="69b20972" class="cell"> 920 920 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 921 921 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> my_transform(src: SourceType) <span class="op">-&gt;</span> TargetType:</span> 922 922 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> TargetType(...)</span> ··· 929 929 <section id="schema-extensions" class="level3"> 930 930 <h3 class="anchored" data-anchor-id="schema-extensions">Schema Extensions</h3> 931 931 <p>The schema format supports custom metadata for domain-specific needs:</p> 932 - <div id="c00ad4da" class="cell"> 932 + <div id="acd9c2fa" class="cell"> 933 933 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>index.publish_schema(</span> 934 934 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> MySample,</span> 935 935 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> version<span class="op">=</span><span class="st">"1.0.0"</span>,</span>
+22 -22
docs/reference/atmosphere.html
··· 626 626 <section id="atmosphereclient" class="level2"> 627 627 <h2 class="anchored" data-anchor-id="atmosphereclient">AtmosphereClient</h2> 628 628 <p>The client handles authentication and record operations:</p> 629 - <div id="27b96b63" class="cell"> 629 + <div id="397fcc7e" class="cell"> 630 630 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 631 631 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 632 632 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 653 653 <section id="session-management" class="level3"> 654 654 <h3 class="anchored" data-anchor-id="session-management">Session Management</h3> 655 655 <p>Save and restore sessions to avoid re-authentication:</p> 656 - <div id="51285d9a" class="cell"> 656 + <div id="9b0dd285" class="cell"> 657 657 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Export session for later</span></span> 658 658 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>session_string <span class="op">=</span> client.export_session()</span> 659 659 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 665 665 <section id="custom-pds" class="level3"> 666 666 <h3 class="anchored" data-anchor-id="custom-pds">Custom PDS</h3> 667 667 <p>Connect to a custom PDS instead of bsky.social:</p> 668 - <div id="5ff21d58" class="cell"> 668 + <div id="4f231236" class="cell"> 669 669 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient(base_url<span class="op">=</span><span class="st">"https://pds.example.com"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 670 670 </div> 671 671 </section> ··· 673 673 <section id="pdsblobstore" class="level2"> 674 674 <h2 class="anchored" data-anchor-id="pdsblobstore">PDSBlobStore</h2> 675 675 <p>Store dataset shards as ATProto blobs for fully decentralized storage:</p> 676 - <div id="e07fa7d1" class="cell"> 676 + <div id="2a2824bd" class="cell"> 677 677 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, PDSBlobStore</span> 678 678 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 679 679 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 696 696 <section id="size-limits" class="level3"> 697 697 <h3 class="anchored" data-anchor-id="size-limits">Size Limits</h3> 698 698 <p>PDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use <code>maxcount</code> and <code>maxsize</code> parameters to control shard sizes:</p> 699 - <div id="3b40ac87" class="cell"> 699 + <div id="f36e6565" class="cell"> 700 700 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>urls <span class="op">=</span> store.write_shards(</span> 701 701 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> dataset,</span> 702 702 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> prefix<span class="op">=</span><span class="st">"large-data/v1"</span>,</span> ··· 709 709 <section id="blobsource" class="level2"> 710 710 <h2 class="anchored" data-anchor-id="blobsource">BlobSource</h2> 711 711 <p>Read datasets stored as PDS blobs:</p> 712 - <div id="fabb6a95" class="cell"> 712 + <div id="f533908b" class="cell"> 713 713 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> BlobSource</span> 714 714 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 715 715 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># From blob references</span></span> ··· 730 730 <section id="atmosphereindex" class="level2"> 731 731 <h2 class="anchored" data-anchor-id="atmosphereindex">AtmosphereIndex</h2> 732 732 <p>The unified interface for ATProto operations, implementing the AbstractIndex protocol:</p> 733 - <div id="25a6a7ea" class="cell"> 733 + <div id="a20f035d" class="cell"> 734 734 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex, PDSBlobStore</span> 735 735 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 736 736 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 745 745 </div> 746 746 <section id="publishing-schemas" class="level3"> 747 747 <h3 class="anchored" data-anchor-id="publishing-schemas">Publishing Schemas</h3> 748 - <div id="fa209d4e" class="cell"> 748 + <div id="7b342ab7" class="cell"> 749 749 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 750 750 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 751 751 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 766 766 </section> 767 767 <section id="publishing-datasets" class="level3"> 768 768 <h3 class="anchored" data-anchor-id="publishing-datasets">Publishing Datasets</h3> 769 - <div id="2263b2a7" class="cell"> 769 + <div id="b328fbe1" class="cell"> 770 770 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 771 771 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 772 772 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(</span> ··· 784 784 </section> 785 785 <section id="listing-and-retrieving" class="level3"> 786 786 <h3 class="anchored" data-anchor-id="listing-and-retrieving">Listing and Retrieving</h3> 787 - <div id="29a96ee6" class="cell"> 787 + <div id="7aa3400f" class="cell"> 788 788 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># List your datasets</span></span> 789 789 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> entry <span class="kw">in</span> index.list_datasets():</span> 790 790 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>entry<span class="sc">.</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span> ··· 810 810 <p>For more control, use the individual publisher classes:</p> 811 811 <section id="schemapublisher" class="level3"> 812 812 <h3 class="anchored" data-anchor-id="schemapublisher">SchemaPublisher</h3> 813 - <div id="d22262e0" class="cell"> 813 + <div id="5573ad6c" class="cell"> 814 814 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> SchemaPublisher</span> 815 815 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span> 816 816 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> SchemaPublisher(client)</span> ··· 826 826 </section> 827 827 <section id="datasetpublisher" class="level3"> 828 828 <h3 class="anchored" data-anchor-id="datasetpublisher">DatasetPublisher</h3> 829 - <div id="cb68b12c" class="cell"> 829 + <div id="aa1667c5" class="cell"> 830 830 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetPublisher</span> 831 831 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a></span> 832 832 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> DatasetPublisher(client)</span> ··· 846 846 <p>There are two approaches to storing data as ATProto blobs:</p> 847 847 <p><strong>Approach 1: PDSBlobStore (Recommended)</strong></p> 848 848 <p>Use <code>PDSBlobStore</code> with <code>AtmosphereIndex</code> for automatic shard management:</p> 849 - <div id="c02c14c5" class="cell"> 849 + <div id="b92c0516" class="cell"> 850 850 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> PDSBlobStore, AtmosphereIndex</span> 851 851 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a></span> 852 852 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> ··· 865 865 </div> 866 866 <p><strong>Approach 2: Manual Blob Publishing</strong></p> 867 867 <p>For more control, use <code>DatasetPublisher.publish_with_blobs()</code> directly:</p> 868 - <div id="da360a3c" class="cell"> 868 + <div id="32c6279b" class="cell"> 869 869 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> io</span> 870 870 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 871 871 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a></span> ··· 885 885 <span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 886 886 </div> 887 887 <p><strong>Loading Blob-Stored Datasets</strong></p> 888 - <div id="21624628" class="cell"> 888 + <div id="33b302ee" class="cell"> 889 889 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetLoader</span> 890 890 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> BlobSource</span> 891 891 <span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a></span> ··· 909 909 </section> 910 910 <section id="lenspublisher" class="level3"> 911 911 <h3 class="anchored" data-anchor-id="lenspublisher">LensPublisher</h3> 912 - <div id="dcbf5ff3" class="cell"> 912 + <div id="4e8e0489" class="cell"> 913 913 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> LensPublisher</span> 914 914 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a></span> 915 915 <span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> LensPublisher(client)</span> ··· 952 952 <p>For direct access to records, use the loader classes:</p> 953 953 <section id="schemaloader" class="level3"> 954 954 <h3 class="anchored" data-anchor-id="schemaloader">SchemaLoader</h3> 955 - <div id="92c02f3f" class="cell"> 955 + <div id="b6366add" class="cell"> 956 956 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> SchemaLoader</span> 957 957 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a></span> 958 958 <span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> SchemaLoader(client)</span> ··· 968 968 </section> 969 969 <section id="datasetloader" class="level3"> 970 970 <h3 class="anchored" data-anchor-id="datasetloader">DatasetLoader</h3> 971 - <div id="f6464268" class="cell"> 971 + <div id="a0cdfb2e" class="cell"> 972 972 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetLoader</span> 973 973 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span> 974 974 <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> DatasetLoader(client)</span> ··· 996 996 </section> 997 997 <section id="lensloader" class="level3"> 998 998 <h3 class="anchored" data-anchor-id="lensloader">LensLoader</h3> 999 - <div id="76e6eff4" class="cell"> 999 + <div id="011735c6" class="cell"> 1000 1000 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> LensLoader</span> 1001 1001 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span> 1002 1002 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> LensLoader(client)</span> ··· 1021 1021 <section id="at-uris" class="level2"> 1022 1022 <h2 class="anchored" data-anchor-id="at-uris">AT URIs</h2> 1023 1023 <p>ATProto records are identified by AT URIs:</p> 1024 - <div id="8e669d61" class="cell"> 1024 + <div id="87eca79a" class="cell"> 1025 1025 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtUri</span> 1026 1026 <span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a></span> 1027 1027 <span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Parse an AT URI</span></span> ··· 1088 1088 <section id="complete-example" class="level2"> 1089 1089 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 1090 1090 <p>This example shows the full workflow using <code>PDSBlobStore</code> for decentralized storage:</p> 1091 - <div id="31a6584f" class="cell"> 1091 + <div id="ef08c41a" class="cell"> 1092 1092 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 1093 1093 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 1094 1094 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 1159 1159 <span id="cb22-68"><a href="#cb22-68" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 1160 1160 </div> 1161 1161 <p>For external URL storage (without <code>PDSBlobStore</code>):</p> 1162 - <div id="8e56d29e" class="cell"> 1162 + <div id="5a95b798" class="cell"> 1163 1163 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Use AtmosphereIndex without data_store</span></span> 1164 1164 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client)</span> 1165 1165 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a></span>
+13 -13
docs/reference/datasets.html
··· 603 603 <p>The <code>Dataset</code> class provides typed iteration over WebDataset tar files with automatic batching and lens transformations.</p> 604 604 <section id="creating-a-dataset" class="level2"> 605 605 <h2 class="anchored" data-anchor-id="creating-a-dataset">Creating a Dataset</h2> 606 - <div id="4eca6e9f" class="cell"> 606 + <div id="dc6c764f" class="cell"> 607 607 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 608 608 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 609 609 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span> ··· 626 626 <section id="url-source-default" class="level3"> 627 627 <h3 class="anchored" data-anchor-id="url-source-default">URL Source (default)</h3> 628 628 <p>When you pass a string to <code>Dataset</code>, it automatically wraps it in a <code>URLSource</code>:</p> 629 - <div id="b7d503f8" class="cell"> 629 + <div id="43d34823" class="cell"> 630 630 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># These are equivalent:</span></span> 631 631 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 632 632 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](atdata.URLSource(<span class="st">"data-{000000..000009}.tar"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 635 635 <section id="s3-source" class="level3"> 636 636 <h3 class="anchored" data-anchor-id="s3-source">S3 Source</h3> 637 637 <p>For private S3 buckets or S3-compatible storage (Cloudflare R2, MinIO), use <code>S3Source</code>:</p> 638 - <div id="50404edf" class="cell"> 638 + <div id="665f191c" class="cell"> 639 639 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># From explicit credentials</span></span> 640 640 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> atdata.S3Source(</span> 641 641 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> bucket<span class="op">=</span><span class="st">"my-bucket"</span>,</span> ··· 673 673 <section id="ordered-iteration" class="level3"> 674 674 <h3 class="anchored" data-anchor-id="ordered-iteration">Ordered Iteration</h3> 675 675 <p>Iterate through samples in their original order:</p> 676 - <div id="76ce4aad" class="cell"> 676 + <div id="f610e2e6" class="cell"> 677 677 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># With batching (default batch_size=1)</span></span> 678 678 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.ordered(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 679 679 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># numpy array (32, H, W, C)</span></span> ··· 687 687 <section id="shuffled-iteration" class="level3"> 688 688 <h3 class="anchored" data-anchor-id="shuffled-iteration">Shuffled Iteration</h3> 689 689 <p>Iterate with randomized order at both shard and sample levels:</p> 690 - <div id="4d78e922" class="cell"> 690 + <div id="31fd18b7" class="cell"> 691 691 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 692 692 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled</span></span> 693 693 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> process(batch)</span> ··· 718 718 <section id="samplebatch" class="level2"> 719 719 <h2 class="anchored" data-anchor-id="samplebatch">SampleBatch</h2> 720 720 <p>When iterating with a <code>batch_size</code>, each iteration yields a <code>SampleBatch</code> with automatic attribute aggregation.</p> 721 - <div id="d15cc252" class="cell"> 721 + <div id="a62e95b5" class="cell"> 722 722 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 723 723 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> Sample:</span> 724 724 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># shape (256,)</span></span> ··· 738 738 <section id="type-transformations-with-lenses" class="level2"> 739 739 <h2 class="anchored" data-anchor-id="type-transformations-with-lenses">Type Transformations with Lenses</h2> 740 740 <p>View a dataset through a different sample type using registered lenses:</p> 741 - <div id="5141d751" class="cell"> 741 + <div id="18cd1b39" class="cell"> 742 742 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 743 743 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 744 744 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> ··· 760 760 <section id="shard-list" class="level3"> 761 761 <h3 class="anchored" data-anchor-id="shard-list">Shard List</h3> 762 762 <p>Get the list of individual tar files:</p> 763 - <div id="43085fc4" class="cell"> 763 + <div id="06e7f4d9" class="cell"> 764 764 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 765 765 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>shards <span class="op">=</span> dataset.shard_list</span> 766 766 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># ['data-000000.tar', 'data-000001.tar', ..., 'data-000009.tar']</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 769 769 <section id="metadata" class="level3"> 770 770 <h3 class="anchored" data-anchor-id="metadata">Metadata</h3> 771 771 <p>Datasets can have associated metadata from a URL:</p> 772 - <div id="c3ff9553" class="cell"> 772 + <div id="4f4fd9f1" class="cell"> 773 773 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](</span> 774 774 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"data-{000000..000009}.tar"</span>,</span> 775 775 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> metadata_url<span class="op">=</span><span class="st">"https://example.com/metadata.msgpack"</span></span> ··· 783 783 <section id="writing-datasets" class="level2"> 784 784 <h2 class="anchored" data-anchor-id="writing-datasets">Writing Datasets</h2> 785 785 <p>Use WebDataset’s <code>TarWriter</code> or <code>ShardWriter</code> to create datasets:</p> 786 - <div id="2975692e" class="cell"> 786 + <div id="62cb57ad" class="cell"> 787 787 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 788 788 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 789 789 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 806 806 <section id="parquet-export" class="level2"> 807 807 <h2 class="anchored" data-anchor-id="parquet-export">Parquet Export</h2> 808 808 <p>Export dataset contents to parquet format:</p> 809 - <div id="4cd9768b" class="cell"> 809 + <div id="d47c5a12" class="cell"> 810 810 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Export entire dataset</span></span> 811 811 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>dataset.to_parquet(<span class="st">"output.parquet"</span>)</span> 812 812 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span> ··· 857 857 <section id="source" class="level3"> 858 858 <h3 class="anchored" data-anchor-id="source">Source</h3> 859 859 <p>Access the underlying <code>DataSource</code>:</p> 860 - <div id="9d3ef77e" class="cell"> 860 + <div id="e4036932" class="cell"> 861 861 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](<span class="st">"data.tar"</span>)</span> 862 862 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> dataset.source <span class="co"># URLSource instance</span></span> 863 863 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(source.shard_list) <span class="co"># ['data.tar']</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 866 866 <section id="sample-type" class="level3"> 867 867 <h3 class="anchored" data-anchor-id="sample-type">Sample Type</h3> 868 868 <p>Get the type parameter used to create the dataset:</p> 869 - <div id="476a436b" class="cell"> 869 + <div id="3718300e" class="cell"> 870 870 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data.tar"</span>)</span> 871 871 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(dataset.sample_type) <span class="co"># &lt;class 'ImageSample'&gt;</span></span> 872 872 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(dataset.batch_type) <span class="co"># SampleBatch[ImageSample]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+10 -10
docs/reference/lenses.html
··· 595 595 <section id="creating-a-lens" class="level2"> 596 596 <h2 class="anchored" data-anchor-id="creating-a-lens">Creating a Lens</h2> 597 597 <p>Use the <code>@lens</code> decorator to define a getter:</p> 598 - <div id="f050b984" class="cell"> 598 + <div id="016cb2ac" class="cell"> 599 599 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 600 600 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 601 601 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span> ··· 625 625 <section id="adding-a-putter" class="level2"> 626 626 <h2 class="anchored" data-anchor-id="adding-a-putter">Adding a Putter</h2> 627 627 <p>To enable bidirectional updates, add a putter:</p> 628 - <div id="aa1d9fcc" class="cell"> 628 + <div id="e207472f" class="cell"> 629 629 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@simplify.putter</span></span> 630 630 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> simplify_put(view: SimpleSample, source: FullSample) <span class="op">-&gt;</span> FullSample:</span> 631 631 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> FullSample(</span> ··· 645 645 <section id="using-lenses-with-datasets" class="level2"> 646 646 <h2 class="anchored" data-anchor-id="using-lenses-with-datasets">Using Lenses with Datasets</h2> 647 647 <p>Lenses integrate with <code>Dataset.as_type()</code>:</p> 648 - <div id="4416cc21" class="cell"> 648 + <div id="744a9d2d" class="cell"> 649 649 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[FullSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 650 650 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 651 651 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="co"># View through a different type</span></span> ··· 660 660 <section id="direct-lens-usage" class="level2"> 661 661 <h2 class="anchored" data-anchor-id="direct-lens-usage">Direct Lens Usage</h2> 662 662 <p>Lenses can also be called directly:</p> 663 - <div id="b709ace4" class="cell"> 663 + <div id="395185d3" class="cell"> 664 664 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 665 665 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 666 666 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>full <span class="op">=</span> FullSample(</span> ··· 689 689 <div class="tab-content"> 690 690 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 691 691 <p>If you get a view and immediately put it back, the source is unchanged:</p> 692 - <div id="8159732e" class="cell"> 692 + <div id="436b7dc7" class="cell"> 693 693 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>view <span class="op">=</span> lens.get(source)</span> 694 694 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> lens.put(view, source) <span class="op">==</span> source</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 695 695 </div> 696 696 </div> 697 697 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 698 698 <p>If you put a view, getting it back yields that view:</p> 699 - <div id="7b0ac2e8" class="cell"> 699 + <div id="4c6fcf1c" class="cell"> 700 700 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>updated <span class="op">=</span> lens.put(view, source)</span> 701 701 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> lens.get(updated) <span class="op">==</span> view</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 702 702 </div> 703 703 </div> 704 704 <div id="tabset-1-3" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-3-tab"> 705 705 <p>Putting twice is equivalent to putting once with the final value:</p> 706 - <div id="c9f3015d" class="cell"> 706 + <div id="db0bfa78" class="cell"> 707 707 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>result1 <span class="op">=</span> lens.put(v2, lens.put(v1, source))</span> 708 708 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>result2 <span class="op">=</span> lens.put(v2, source)</span> 709 709 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> result1 <span class="op">==</span> result2</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 715 715 <section id="trivial-putter" class="level2"> 716 716 <h2 class="anchored" data-anchor-id="trivial-putter">Trivial Putter</h2> 717 717 <p>If no putter is defined, a trivial putter is used that ignores view updates:</p> 718 - <div id="3f274fda" class="cell"> 718 + <div id="d326f43d" class="cell"> 719 719 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 720 720 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> extract_label(src: FullSample) <span class="op">-&gt;</span> SimpleSample:</span> 721 721 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SimpleSample(label<span class="op">=</span>src.label, confidence<span class="op">=</span>src.confidence)</span> ··· 729 729 <section id="lensnetwork-registry" class="level2"> 730 730 <h2 class="anchored" data-anchor-id="lensnetwork-registry">LensNetwork Registry</h2> 731 731 <p>The <code>LensNetwork</code> is a singleton that stores all registered lenses:</p> 732 - <div id="2d560eec" class="cell"> 732 + <div id="a3896ed6" class="cell"> 733 733 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.lens <span class="im">import</span> LensNetwork</span> 734 734 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 735 735 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>network <span class="op">=</span> LensNetwork()</span> ··· 746 746 </section> 747 747 <section id="example-feature-extraction" class="level2"> 748 748 <h2 class="anchored" data-anchor-id="example-feature-extraction">Example: Feature Extraction</h2> 749 - <div id="f641a209" class="cell"> 749 + <div id="bbd444b8" class="cell"> 750 750 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 751 751 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> RawSample:</span> 752 752 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> audio: NDArray</span>
+12 -12
docs/reference/load-dataset.html
··· 604 604 </section> 605 605 <section id="basic-usage" class="level2"> 606 606 <h2 class="anchored" data-anchor-id="basic-usage">Basic Usage</h2> 607 - <div id="7ed2b10d" class="cell"> 607 + <div id="753f1d4b" class="cell"> 608 608 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 609 609 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 610 610 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> ··· 627 627 <h2 class="anchored" data-anchor-id="path-formats">Path Formats</h2> 628 628 <section id="webdataset-brace-notation" class="level3"> 629 629 <h3 class="anchored" data-anchor-id="webdataset-brace-notation">WebDataset Brace Notation</h3> 630 - <div id="f28a7507" class="cell"> 630 + <div id="1f4e9142" class="cell"> 631 631 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Range notation</span></span> 632 632 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"data-{000000..000099}.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 633 633 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span> ··· 637 637 </section> 638 638 <section id="glob-patterns" class="level3"> 639 639 <h3 class="anchored" data-anchor-id="glob-patterns">Glob Patterns</h3> 640 - <div id="c187d3d7" class="cell"> 640 + <div id="9fd0a092" class="cell"> 641 641 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Match all tar files</span></span> 642 642 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"path/to/*.tar"</span>, MySample)</span> 643 643 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 647 647 </section> 648 648 <section id="local-directory" class="level3"> 649 649 <h3 class="anchored" data-anchor-id="local-directory">Local Directory</h3> 650 - <div id="dcd8bdec" class="cell"> 650 + <div id="8ddee3fc" class="cell"> 651 651 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Scans for .tar files</span></span> 652 652 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"./my-dataset/"</span>, MySample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 653 653 </div> 654 654 </section> 655 655 <section id="remote-urls" class="level3"> 656 656 <h3 class="anchored" data-anchor-id="remote-urls">Remote URLs</h3> 657 - <div id="b3a20847" class="cell"> 657 + <div id="6a120d61" class="cell"> 658 658 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># S3 (public buckets)</span></span> 659 659 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"s3://bucket/data-{000..099}.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 660 660 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> ··· 680 680 </section> 681 681 <section id="index-lookup" class="level3"> 682 682 <h3 class="anchored" data-anchor-id="index-lookup">Index Lookup</h3> 683 - <div id="079ac960" class="cell"> 683 + <div id="c4c8a01b" class="cell"> 684 684 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 685 685 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 686 686 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex()</span> ··· 747 747 <section id="datasetdict" class="level2"> 748 748 <h2 class="anchored" data-anchor-id="datasetdict">DatasetDict</h2> 749 749 <p>When loading without <code>split=</code>, returns a <code>DatasetDict</code>:</p> 750 - <div id="59cd381c" class="cell"> 750 + <div id="eed41d5d" class="cell"> 751 751 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data/"</span>, MySample)</span> 752 752 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 753 753 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Access splits</span></span> ··· 767 767 <section id="explicit-data-files" class="level2"> 768 768 <h2 class="anchored" data-anchor-id="explicit-data-files">Explicit Data Files</h2> 769 769 <p>Override automatic detection with <code>data_files</code>:</p> 770 - <div id="82ab3caf" class="cell"> 770 + <div id="a87676d3" class="cell"> 771 771 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Single pattern</span></span> 772 772 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(</span> 773 773 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"path/to/"</span>,</span> ··· 796 796 <section id="streaming-mode" class="level2"> 797 797 <h2 class="anchored" data-anchor-id="streaming-mode">Streaming Mode</h2> 798 798 <p>The <code>streaming</code> parameter signals intent for streaming mode:</p> 799 - <div id="ac9975cc" class="cell"> 799 + <div id="d0ca2f60" class="cell"> 800 800 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Mark as streaming</span></span> 801 801 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>, MySample, streaming<span class="op">=</span><span class="va">True</span>)</span> 802 802 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 821 821 <section id="auto-type-resolution" class="level2"> 822 822 <h2 class="anchored" data-anchor-id="auto-type-resolution">Auto Type Resolution</h2> 823 823 <p>When using index lookup, the sample type can be resolved automatically:</p> 824 - <div id="a7679439" class="cell"> 824 + <div id="eff9b3ae" class="cell"> 825 825 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 826 826 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 827 827 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex()</span> ··· 835 835 </section> 836 836 <section id="error-handling" class="level2"> 837 837 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 838 - <div id="c1b9e6e6" class="cell"> 838 + <div id="70c2f7f5" class="cell"> 839 839 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 840 840 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> ds <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 841 841 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">FileNotFoundError</span>:</span> ··· 851 851 </section> 852 852 <section id="complete-example" class="level2"> 853 853 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 854 - <div id="5e83c9a3" class="cell"> 854 + <div id="076854da" class="cell"> 855 855 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 856 856 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 857 857 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span>
+11 -11
docs/reference/local-storage.html
··· 603 603 <section id="localindex" class="level2"> 604 604 <h2 class="anchored" data-anchor-id="localindex">LocalIndex</h2> 605 605 <p>The index tracks datasets in Redis:</p> 606 - <div id="da288cbb" class="cell"> 606 + <div id="014371d7" class="cell"> 607 607 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 608 608 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 609 609 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Default connection (localhost:6379)</span></span> ··· 619 619 </div> 620 620 <section id="adding-entries" class="level3"> 621 621 <h3 class="anchored" data-anchor-id="adding-entries">Adding Entries</h3> 622 - <div id="084fdecb" class="cell"> 622 + <div id="258ff92b" class="cell"> 623 623 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 624 624 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 625 625 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span> ··· 644 644 </section> 645 645 <section id="listing-and-retrieving" class="level3"> 646 646 <h3 class="anchored" data-anchor-id="listing-and-retrieving">Listing and Retrieving</h3> 647 - <div id="2279c444" class="cell"> 647 + <div id="ecd210a1" class="cell"> 648 648 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate all entries</span></span> 649 649 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> entry <span class="kw">in</span> index.entries:</span> 650 650 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>entry<span class="sc">.</span>cid<span class="sc">}</span><span class="ss">"</span>)</span> ··· 676 676 </div> 677 677 </div> 678 678 <p>The Repo class combines S3 storage with Redis indexing:</p> 679 - <div id="e3f423c3" class="cell"> 679 + <div id="752fc198" class="cell"> 680 680 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> Repo</span> 681 681 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 682 682 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># From credentials file</span></span> ··· 696 696 <span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 697 697 </div> 698 698 <p><strong>Preferred approach</strong> - Use <code>LocalIndex</code> with <code>S3DataStore</code>:</p> 699 - <div id="6ed3c7e6" class="cell"> 699 + <div id="fc784c62" class="cell"> 700 700 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 701 701 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 702 702 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> ··· 734 734 </section> 735 735 <section id="inserting-datasets" class="level3"> 736 736 <h3 class="anchored" data-anchor-id="inserting-datasets">Inserting Datasets</h3> 737 - <div id="2de48d7d" class="cell"> 737 + <div id="bb27e7db" class="cell"> 738 738 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 739 739 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 740 740 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 764 764 </section> 765 765 <section id="insert-options" class="level3"> 766 766 <h3 class="anchored" data-anchor-id="insert-options">Insert Options</h3> 767 - <div id="c60ddeb7" class="cell"> 767 + <div id="269b9b72" class="cell"> 768 768 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>entry, ds <span class="op">=</span> repo.insert(</span> 769 769 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> dataset,</span> 770 770 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> name<span class="op">=</span><span class="st">"my-dataset"</span>,</span> ··· 778 778 <section id="localdatasetentry" class="level2"> 779 779 <h2 class="anchored" data-anchor-id="localdatasetentry">LocalDatasetEntry</h2> 780 780 <p>Index entries provide content-addressable identification:</p> 781 - <div id="e6593f23" class="cell"> 781 + <div id="125f1ec6" class="cell"> 782 782 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.get_entry_by_name(<span class="st">"my-dataset"</span>)</span> 783 783 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 784 784 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Core properties (IndexEntry protocol)</span></span> ··· 811 811 <section id="schema-storage" class="level2"> 812 812 <h2 class="anchored" data-anchor-id="schema-storage">Schema Storage</h2> 813 813 <p>Schemas can be stored and retrieved from the index:</p> 814 - <div id="0b531944" class="cell"> 814 + <div id="f5b4209d" class="cell"> 815 815 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 816 816 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(</span> 817 817 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> ImageSample,</span> ··· 842 842 <section id="s3datastore" class="level2"> 843 843 <h2 class="anchored" data-anchor-id="s3datastore">S3DataStore</h2> 844 844 <p>For direct S3 operations without Redis indexing:</p> 845 - <div id="33f50fcb" class="cell"> 845 + <div id="ead4d102" class="cell"> 846 846 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 847 847 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> 848 848 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> ··· 864 864 </section> 865 865 <section id="complete-workflow-example" class="level2"> 866 866 <h2 class="anchored" data-anchor-id="complete-workflow-example">Complete Workflow Example</h2> 867 - <div id="846813f0" class="cell"> 867 + <div id="96fe8292" class="cell"> 868 868 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 869 869 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 870 870 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span>
+12 -12
docs/reference/packable-samples.html
··· 598 598 <section id="the-packable-decorator" class="level2"> 599 599 <h2 class="anchored" data-anchor-id="the-packable-decorator">The <code>@packable</code> Decorator</h2> 600 600 <p>The recommended way to define a sample type is with the <code>@packable</code> decorator:</p> 601 - <div id="c5cea75b" class="cell"> 601 + <div id="a94874e3" class="cell"> 602 602 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 603 603 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 604 604 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 620 620 <h2 class="anchored" data-anchor-id="supported-field-types">Supported Field Types</h2> 621 621 <section id="primitives" class="level3"> 622 622 <h3 class="anchored" data-anchor-id="primitives">Primitives</h3> 623 - <div id="c8b5a276" class="cell"> 623 + <div id="550b2a4e" class="cell"> 624 624 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 625 625 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PrimitiveSample:</span> 626 626 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> name: <span class="bu">str</span></span> ··· 633 633 <section id="numpy-arrays" class="level3"> 634 634 <h3 class="anchored" data-anchor-id="numpy-arrays">NumPy Arrays</h3> 635 635 <p>Fields annotated as <code>NDArray</code> are automatically converted:</p> 636 - <div id="59e8610f" class="cell"> 636 + <div id="029f2fed" class="cell"> 637 637 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 638 638 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ArraySample:</span> 639 639 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># Required array</span></span> ··· 655 655 </section> 656 656 <section id="lists" class="level3"> 657 657 <h3 class="anchored" data-anchor-id="lists">Lists</h3> 658 - <div id="e0a943cf" class="cell"> 658 + <div id="c70e0022" class="cell"> 659 659 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 660 660 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ListSample:</span> 661 661 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> tags: <span class="bu">list</span>[<span class="bu">str</span>]</span> ··· 667 667 <h2 class="anchored" data-anchor-id="serialization">Serialization</h2> 668 668 <section id="packing-to-bytes" class="level3"> 669 669 <h3 class="anchored" data-anchor-id="packing-to-bytes">Packing to Bytes</h3> 670 - <div id="15e89907" class="cell"> 670 + <div id="2a44172f" class="cell"> 671 671 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 672 672 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 673 673 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> ··· 681 681 </section> 682 682 <section id="unpacking-from-bytes" class="level3"> 683 683 <h3 class="anchored" data-anchor-id="unpacking-from-bytes">Unpacking from Bytes</h3> 684 - <div id="abf5f1f1" class="cell"> 684 + <div id="30d8b649" class="cell"> 685 685 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Deserialize from bytes</span></span> 686 686 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>restored <span class="op">=</span> ImageSample.from_bytes(packed_bytes)</span> 687 687 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 693 693 <section id="webdataset-format" class="level3"> 694 694 <h3 class="anchored" data-anchor-id="webdataset-format">WebDataset Format</h3> 695 695 <p>The <code>as_wds</code> property returns a dict ready for WebDataset:</p> 696 - <div id="e29904d9" class="cell"> 696 + <div id="71fdefd9" class="cell"> 697 697 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>wds_dict <span class="op">=</span> sample.as_wds</span> 698 698 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="co"># {'__key__': '1234...', 'msgpack': b'...'}</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 699 699 </div> 700 700 <p>Write samples to a tar file:</p> 701 - <div id="9a0b0a14" class="cell"> 701 + <div id="0d06403b" class="cell"> 702 702 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 703 703 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 704 704 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data-000000.tar"</span>) <span class="im">as</span> sink:</span> ··· 711 711 <section id="direct-inheritance-alternative" class="level2"> 712 712 <h2 class="anchored" data-anchor-id="direct-inheritance-alternative">Direct Inheritance (Alternative)</h2> 713 713 <p>You can also inherit directly from <code>PackableSample</code>:</p> 714 - <div id="8a5579f0" class="cell"> 714 + <div id="2cbfbd91" class="cell"> 715 715 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> dataclasses <span class="im">import</span> dataclass</span> 716 716 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 717 717 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="at">@dataclass</span></span> ··· 749 749 <section id="the-_ensure_good-method" class="level3"> 750 750 <h3 class="anchored" data-anchor-id="the-_ensure_good-method">The <code>_ensure_good()</code> Method</h3> 751 751 <p>This method runs automatically after construction and handles NDArray conversion:</p> 752 - <div id="3a74bab4" class="cell"> 752 + <div id="3e22df34" class="cell"> 753 753 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> _ensure_good(<span class="va">self</span>):</span> 754 754 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> field <span class="kw">in</span> dataclasses.fields(<span class="va">self</span>):</span> 755 755 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> _is_possibly_ndarray_type(field.<span class="bu">type</span>):</span> ··· 765 765 <ul class="nav nav-tabs" role="tablist"><li class="nav-item" role="presentation"><a class="nav-link active" id="tabset-2-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-1" role="tab" aria-controls="tabset-2-1" aria-selected="true">Do</a></li><li class="nav-item" role="presentation"><a class="nav-link" id="tabset-2-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-2" role="tab" aria-controls="tabset-2-2" aria-selected="false">Don’t</a></li></ul> 766 766 <div class="tab-content"> 767 767 <div id="tabset-2-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-2-1-tab"> 768 - <div id="5efd4f23" class="cell"> 768 + <div id="5708dd52" class="cell"> 769 769 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 770 770 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> GoodSample:</span> 771 771 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># Clear type annotation</span></span> ··· 775 775 </div> 776 776 </div> 777 777 <div id="tabset-2-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-2-2-tab"> 778 - <div id="dc1f3f7b" class="cell"> 778 + <div id="14b1ada0" class="cell"> 779 779 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 780 780 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> BadSample:</span> 781 781 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a> <span class="co"># DON'T: Nested dataclasses not supported</span></span>
+7 -7
docs/reference/promotion.html
··· 594 594 </section> 595 595 <section id="basic-usage" class="level2"> 596 596 <h2 class="anchored" data-anchor-id="basic-usage">Basic Usage</h2> 597 - <div id="ae4e3261" class="cell"> 597 + <div id="b23db877" class="cell"> 598 598 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 599 599 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 600 600 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> promote_to_atmosphere</span> ··· 614 614 </section> 615 615 <section id="with-metadata" class="level2"> 616 616 <h2 class="anchored" data-anchor-id="with-metadata">With Metadata</h2> 617 - <div id="cb98a72f" class="cell"> 617 + <div id="2b2c6783" class="cell"> 618 618 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(</span> 619 619 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> entry,</span> 620 620 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> local_index,</span> ··· 629 629 <section id="schema-deduplication" class="level2"> 630 630 <h2 class="anchored" data-anchor-id="schema-deduplication">Schema Deduplication</h2> 631 631 <p>The promotion workflow automatically checks for existing schemas:</p> 632 - <div id="eb4c0866" class="cell"> 632 + <div id="59a8d371" class="cell"> 633 633 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First promotion: publishes schema</span></span> 634 634 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>uri1 <span class="op">=</span> promote_to_atmosphere(entry1, local_index, client)</span> 635 635 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 649 649 <div class="tab-content"> 650 650 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 651 651 <p>By default, promotion keeps the original data URLs:</p> 652 - <div id="324fbcb3" class="cell"> 652 + <div id="4808223f" class="cell"> 653 653 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Data stays in original S3 location</span></span> 654 654 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 655 655 </div> ··· 662 662 </div> 663 663 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 664 664 <p>To copy data to a different storage location:</p> 665 - <div id="96e6b253" class="cell"> 665 + <div id="e6c590b5" class="cell"> 666 666 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 667 667 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 668 668 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create new data store</span></span> ··· 690 690 </section> 691 691 <section id="complete-workflow-example" class="level2"> 692 692 <h2 class="anchored" data-anchor-id="complete-workflow-example">Complete Workflow Example</h2> 693 - <div id="761de7eb" class="cell"> 693 + <div id="01c84d05" class="cell"> 694 694 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 695 695 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 696 696 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 761 761 </section> 762 762 <section id="error-handling" class="level2"> 763 763 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 764 - <div id="62ff3365" class="cell"> 764 + <div id="c3957a63" class="cell"> 765 765 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 766 766 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> at_uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span> 767 767 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">KeyError</span> <span class="im">as</span> e:</span>
+12 -12
docs/reference/protocols.html
··· 615 615 <section id="indexentry-protocol" class="level2"> 616 616 <h2 class="anchored" data-anchor-id="indexentry-protocol">IndexEntry Protocol</h2> 617 617 <p>Represents a dataset entry in any index:</p> 618 - <div id="2316ad53" class="cell"> 618 + <div id="c302a855" class="cell"> 619 619 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> IndexEntry</span> 620 620 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 621 621 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_entry(entry: IndexEntry) <span class="op">-&gt;</span> <span class="va">None</span>:</span> ··· 669 669 <section id="abstractindex-protocol" class="level2"> 670 670 <h2 class="anchored" data-anchor-id="abstractindex-protocol">AbstractIndex Protocol</h2> 671 671 <p>Defines operations for managing schemas and datasets:</p> 672 - <div id="cbfe79b2" class="cell"> 672 + <div id="07a31265" class="cell"> 673 673 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractIndex</span> 674 674 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 675 675 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> list_all_datasets(index: AbstractIndex) <span class="op">-&gt;</span> <span class="va">None</span>:</span> ··· 679 679 </div> 680 680 <section id="dataset-operations" class="level3"> 681 681 <h3 class="anchored" data-anchor-id="dataset-operations">Dataset Operations</h3> 682 - <div id="fe390f84" class="cell"> 682 + <div id="7678f752" class="cell"> 683 683 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Insert a dataset</span></span> 684 684 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(</span> 685 685 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> dataset,</span> ··· 697 697 </section> 698 698 <section id="schema-operations" class="level3"> 699 699 <h3 class="anchored" data-anchor-id="schema-operations">Schema Operations</h3> 700 - <div id="65361058" class="cell"> 700 + <div id="70f62633" class="cell"> 701 701 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 702 702 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(</span> 703 703 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> MySample,</span> ··· 728 728 <section id="abstractdatastore-protocol" class="level2"> 729 729 <h2 class="anchored" data-anchor-id="abstractdatastore-protocol">AbstractDataStore Protocol</h2> 730 730 <p>Abstracts over different storage backends:</p> 731 - <div id="ce04ec55" class="cell"> 731 + <div id="52e50320" class="cell"> 732 732 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractDataStore</span> 733 733 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 734 734 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_dataset(store: AbstractDataStore, dataset) <span class="op">-&gt;</span> <span class="bu">list</span>[<span class="bu">str</span>]:</span> ··· 738 738 </div> 739 739 <section id="methods" class="level3"> 740 740 <h3 class="anchored" data-anchor-id="methods">Methods</h3> 741 - <div id="2079a29f" class="cell"> 741 + <div id="317eac4c" class="cell"> 742 742 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Write dataset shards</span></span> 743 743 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>urls <span class="op">=</span> store.write_shards(</span> 744 744 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> dataset,</span> ··· 765 765 <section id="datasource-protocol" class="level2"> 766 766 <h2 class="anchored" data-anchor-id="datasource-protocol">DataSource Protocol</h2> 767 767 <p>Abstracts over different data source backends for streaming dataset shards:</p> 768 - <div id="c4f4cfb3" class="cell"> 768 + <div id="8a53d5a4" class="cell"> 769 769 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> DataSource</span> 770 770 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 771 771 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> load_from_source(source: DataSource) <span class="op">-&gt;</span> <span class="va">None</span>:</span> ··· 778 778 </div> 779 779 <section id="methods-1" class="level3"> 780 780 <h3 class="anchored" data-anchor-id="methods-1">Methods</h3> 781 - <div id="aa0ab84d" class="cell"> 781 + <div id="f867cfdc" class="cell"> 782 782 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Get list of shard identifiers</span></span> 783 783 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>shard_ids <span class="op">=</span> source.shard_list <span class="co"># ['data-000000.tar', 'data-000001.tar', ...]</span></span> 784 784 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a></span> ··· 801 801 <section id="creating-custom-data-sources" class="level3"> 802 802 <h3 class="anchored" data-anchor-id="creating-custom-data-sources">Creating Custom Data Sources</h3> 803 803 <p>Implement the <code>DataSource</code> protocol for custom backends:</p> 804 - <div id="8405d1b3" class="cell"> 804 + <div id="44575d54" class="cell"> 805 805 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> typing <span class="im">import</span> Iterator, IO</span> 806 806 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> DataSource</span> 807 807 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 839 839 <section id="using-protocols-for-polymorphism" class="level2"> 840 840 <h2 class="anchored" data-anchor-id="using-protocols-for-polymorphism">Using Protocols for Polymorphism</h2> 841 841 <p>Write code that works with any backend:</p> 842 - <div id="efc52af8" class="cell"> 842 + <div id="db1294c4" class="cell"> 843 843 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractIndex, IndexEntry</span> 844 844 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> Dataset</span> 845 845 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 910 910 <section id="type-checking" class="level2"> 911 911 <h2 class="anchored" data-anchor-id="type-checking">Type Checking</h2> 912 912 <p>Protocols are runtime-checkable:</p> 913 - <div id="e720c8ac" class="cell"> 913 + <div id="0bb97720" class="cell"> 914 914 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> IndexEntry, AbstractIndex</span> 915 915 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> 916 916 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Check if object implements protocol</span></span> ··· 924 924 </section> 925 925 <section id="complete-example" class="level2"> 926 926 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 927 - <div id="67ba0947" class="cell"> 927 + <div id="e296723f" class="cell"> 928 928 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 929 929 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 930 930 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex</span>
+2 -2
docs/reference/uri-spec.html
··· 685 685 <h2 class="anchored" data-anchor-id="examples">Examples</h2> 686 686 <section id="local-development" class="level3"> 687 687 <h3 class="anchored" data-anchor-id="local-development">Local Development</h3> 688 - <div id="c582b03f" class="cell"> 688 + <div id="09088cac" class="cell"> 689 689 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> Index</span> 690 690 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 691 691 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> Index()</span> ··· 704 704 </section> 705 705 <section id="atmosphere-atproto-federation" class="level3"> 706 706 <h3 class="anchored" data-anchor-id="atmosphere-atproto-federation">Atmosphere (ATProto Federation)</h3> 707 - <div id="33a39e75" class="cell"> 707 + <div id="1250c10e" class="cell"> 708 708 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> Client</span> 709 709 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 710 710 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> Client()</span>
+2 -2
docs/sitemap.xml
··· 66 66 </url> 67 67 <url> 68 68 <loc>https://github.com/your-org/atdata/api/index.html</loc> 69 - <lastmod>2026-01-27T06:24:20.044Z</lastmod> 69 + <lastmod>2026-01-27T06:39:59.502Z</lastmod> 70 70 </url> 71 71 <url> 72 72 <loc>https://github.com/your-org/atdata/api/IndexEntry.html</loc> ··· 130 130 </url> 131 131 <url> 132 132 <loc>https://github.com/your-org/atdata/api/Lens.html</loc> 133 - <lastmod>2026-01-27T06:24:20.108Z</lastmod> 133 + <lastmod>2026-01-27T06:39:59.563Z</lastmod> 134 134 </url> 135 135 <url> 136 136 <loc>https://github.com/your-org/atdata/api/DatasetLoader.html</loc>
+14 -14
docs/tutorials/atmosphere.html
··· 658 658 </section> 659 659 <section id="setup" class="level2"> 660 660 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 661 - <div id="d57cc9af" class="cell"> 661 + <div id="acf890d8" class="cell"> 662 662 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 663 663 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 664 664 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 678 678 </section> 679 679 <section id="define-sample-types" class="level2"> 680 680 <h2 class="anchored" data-anchor-id="define-sample-types">Define Sample Types</h2> 681 - <div id="15a5c9ef" class="cell"> 681 + <div id="ba343730" class="cell"> 682 682 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 683 683 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 684 684 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing image data with metadata."""</span></span> ··· 697 697 <section id="type-introspection" class="level2"> 698 698 <h2 class="anchored" data-anchor-id="type-introspection">Type Introspection</h2> 699 699 <p>See what information is available from a PackableSample type:</p> 700 - <div id="d5d68ff6" class="cell"> 700 + <div id="d3147fcc" class="cell"> 701 701 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> dataclasses <span class="im">import</span> fields, is_dataclass</span> 702 702 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 703 703 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Sample type: </span><span class="sc">{</span>ImageSample<span class="sc">.</span><span class="va">__name__</span><span class="sc">}</span><span class="ss">"</span>)</span> ··· 732 732 </ul> 733 733 <p>Understanding AT URIs is essential for working with atmosphere datasets, as they’re how you reference schemas, datasets, and lenses.</p> 734 734 <p>ATProto records are identified by AT URIs:</p> 735 - <div id="e2fe328c" class="cell"> 735 + <div id="2e4e2375" class="cell"> 736 736 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>uris <span class="op">=</span> [</span> 737 737 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789"</span>,</span> 738 738 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"at://alice.bsky.social/ac.foundation.dataset.record/my-dataset"</span>,</span> ··· 750 750 <h2 class="anchored" data-anchor-id="authentication">Authentication</h2> 751 751 <p>The <code>AtmosphereClient</code> handles ATProto authentication. When you authenticate, you’re proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS).</p> 752 752 <p>Connect to ATProto:</p> 753 - <div id="c05523d2" class="cell"> 753 + <div id="5e1f8209" class="cell"> 754 754 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 755 755 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"your.handle.social"</span>, <span class="st">"your-app-password"</span>)</span> 756 756 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> ··· 761 761 <section id="publish-a-schema" class="level2"> 762 762 <h2 class="anchored" data-anchor-id="publish-a-schema">Publish a Schema</h2> 763 763 <p>When you publish a schema to ATProto, it becomes a <strong>public, immutable record</strong> that others can reference. The schema CID ensures that anyone can verify they’re using exactly the same type definition you published.</p> 764 - <div id="db404d7a" class="cell"> 764 + <div id="9fd4595f" class="cell"> 765 765 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>schema_publisher <span class="op">=</span> SchemaPublisher(client)</span> 766 766 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>schema_uri <span class="op">=</span> schema_publisher.publish(</span> 767 767 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> ImageSample,</span> ··· 774 774 </section> 775 775 <section id="list-your-schemas" class="level2"> 776 776 <h2 class="anchored" data-anchor-id="list-your-schemas">List Your Schemas</h2> 777 - <div id="c351a57b" class="cell"> 777 + <div id="d7b31f8e" class="cell"> 778 778 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>schema_loader <span class="op">=</span> SchemaLoader(client)</span> 779 779 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>schemas <span class="op">=</span> schema_loader.list_all(limit<span class="op">=</span><span class="dv">10</span>)</span> 780 780 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Found </span><span class="sc">{</span><span class="bu">len</span>(schemas)<span class="sc">}</span><span class="ss"> schema(s)"</span>)</span> ··· 787 787 <h2 class="anchored" data-anchor-id="publish-a-dataset">Publish a Dataset</h2> 788 788 <section id="with-external-urls" class="level3"> 789 789 <h3 class="anchored" data-anchor-id="with-external-urls">With External URLs</h3> 790 - <div id="df8318cd" class="cell"> 790 + <div id="38a5ea19" class="cell"> 791 791 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>dataset_publisher <span class="op">=</span> DatasetPublisher(client)</span> 792 792 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>dataset_uri <span class="op">=</span> dataset_publisher.publish_with_urls(</span> 793 793 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> urls<span class="op">=</span>[<span class="st">"s3://example-bucket/demo-data-{000000..000009}.tar"</span>],</span> ··· 809 809 <li><strong>Federated replication</strong>: Relays can mirror your blobs for availability</li> 810 810 </ul> 811 811 <p>For fully decentralized storage, use <code>PDSBlobStore</code> to store dataset shards directly as ATProto blobs in your PDS:</p> 812 - <div id="6708e8dc" class="cell"> 812 + <div id="674acaa3" class="cell"> 813 813 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create store and index with blob storage</span></span> 814 814 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> 815 815 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client, data_store<span class="op">=</span>store)</span> ··· 853 853 </div> 854 854 <div class="callout-body-container callout-body"> 855 855 <p>Use <code>BlobSource</code> to stream directly from PDS blobs:</p> 856 - <div id="824c87cf" class="cell"> 856 + <div id="0f2fcbb6" class="cell"> 857 857 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create source from the blob URLs</span></span> 858 858 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> store.create_source(entry.data_urls)</span> 859 859 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 874 874 <h3 class="anchored" data-anchor-id="with-external-urls-1">With External URLs</h3> 875 875 <p>For larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the <strong>index entry</strong> while the actual data lives elsewhere.</p> 876 876 <p>For larger datasets or when using existing object storage:</p> 877 - <div id="65d324f2" class="cell"> 877 + <div id="8111fff9" class="cell"> 878 878 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>dataset_publisher <span class="op">=</span> DatasetPublisher(client)</span> 879 879 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>dataset_uri <span class="op">=</span> dataset_publisher.publish_with_urls(</span> 880 880 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> urls<span class="op">=</span>[<span class="st">"s3://example-bucket/demo-data-{000000..000009}.tar"</span>],</span> ··· 890 890 </section> 891 891 <section id="list-and-load-datasets" class="level2"> 892 892 <h2 class="anchored" data-anchor-id="list-and-load-datasets">List and Load Datasets</h2> 893 - <div id="39c2452a" class="cell"> 893 + <div id="ef2e681c" class="cell"> 894 894 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dataset_loader <span class="op">=</span> DatasetLoader(client)</span> 895 895 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>datasets <span class="op">=</span> dataset_loader.list_all(limit<span class="op">=</span><span class="dv">10</span>)</span> 896 896 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Found </span><span class="sc">{</span><span class="bu">len</span>(datasets)<span class="sc">}</span><span class="ss"> dataset(s)"</span>)</span> ··· 905 905 </section> 906 906 <section id="load-a-dataset" class="level2"> 907 907 <h2 class="anchored" data-anchor-id="load-a-dataset">Load a Dataset</h2> 908 - <div id="5adb2946" class="cell"> 908 + <div id="a0076f9e" class="cell"> 909 909 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Check storage type</span></span> 910 910 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>storage_type <span class="op">=</span> dataset_loader.get_storage_type(<span class="bu">str</span>(blob_dataset_uri))</span> 911 911 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Storage type: </span><span class="sc">{</span>storage_type<span class="sc">}</span><span class="ss">"</span>)</span> ··· 933 933 </ol> 934 934 <p>Notice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend.</p> 935 935 <p>This example shows the recommended workflow using <code>PDSBlobStore</code> for fully decentralized storage:</p> 936 - <div id="60e0b22d" class="cell"> 936 + <div id="92130535" class="cell"> 937 937 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Define and create samples</span></span> 938 938 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 939 939 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> FeatureSample:</span>
+8 -8
docs/tutorials/local-workflow.html
··· 644 644 </section> 645 645 <section id="setup" class="level2"> 646 646 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 647 - <div id="abef9b71" class="cell"> 647 + <div id="366a6743" class="cell"> 648 648 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 649 649 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 650 650 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 654 654 </section> 655 655 <section id="define-sample-types" class="level2"> 656 656 <h2 class="anchored" data-anchor-id="define-sample-types">Define Sample Types</h2> 657 - <div id="eb4a25ab" class="cell"> 657 + <div id="7dcf168a" class="cell"> 658 658 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 659 659 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> TrainingSample:</span> 660 660 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing features and label for training."""</span></span> ··· 678 678 </ul> 679 679 <p>CIDs are computed from the entry’s schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it’s stored.</p> 680 680 <p>Create entries with content-addressable CIDs:</p> 681 - <div id="b26485ff" class="cell"> 681 + <div id="93a2dc43" class="cell"> 682 682 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create an entry manually</span></span> 683 683 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> LocalDatasetEntry(</span> 684 684 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> _name<span class="op">=</span><span class="st">"my-dataset"</span>,</span> ··· 711 711 <h2 class="anchored" data-anchor-id="localindex">LocalIndex</h2> 712 712 <p>The <code>LocalIndex</code> is your team’s dataset registry. It implements the <code>AbstractIndex</code> protocol, meaning code written against <code>LocalIndex</code> will also work with <code>AtmosphereIndex</code> when you’re ready for federated sharing.</p> 713 713 <p>The index tracks datasets in Redis:</p> 714 - <div id="d23bcb72" class="cell"> 714 + <div id="e7cb9abe" class="cell"> 715 715 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> redis <span class="im">import</span> Redis</span> 716 716 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 717 717 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to Redis</span></span> ··· 724 724 <h3 class="anchored" data-anchor-id="schema-management">Schema Management</h3> 725 725 <p><strong>Schema publishing</strong> is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference.</p> 726 726 <p>This enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code.</p> 727 - <div id="39ba8cff" class="cell"> 727 + <div id="d8c57637" class="cell"> 728 728 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 729 729 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(TrainingSample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 730 730 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Published schema: </span><span class="sc">{</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span> ··· 753 753 </ul> 754 754 <p>The data store handles uploading tar shards and creating signed URLs for streaming access.</p> 755 755 <p>For direct S3 operations:</p> 756 - <div id="d1cd901f" class="cell"> 756 + <div id="dc2e8870" class="cell"> 757 757 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>creds <span class="op">=</span> {</span> 758 758 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"AWS_ENDPOINT"</span>: <span class="st">"http://localhost:9000"</span>,</span> 759 759 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"AWS_ACCESS_KEY_ID"</span>: <span class="st">"minioadmin"</span>,</span> ··· 779 779 </ol> 780 780 <p>The index composition pattern (<code>LocalIndex(data_store=S3DataStore(...))</code>) is deliberate—it separates the concern of “where is metadata?” from “where is data?”, making it easy to swap storage backends.</p> 781 781 <p>Use <code>LocalIndex</code> with <code>S3DataStore</code> to store datasets with S3 storage and Redis indexing:</p> 782 - <div id="6d91c4f2" class="cell"> 782 + <div id="d039393d" class="cell"> 783 783 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Create sample data</span></span> 784 784 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 785 785 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> TrainingSample(</span> ··· 829 829 <h2 class="anchored" data-anchor-id="using-load_dataset-with-index">Using load_dataset with Index</h2> 830 830 <p>The <code>load_dataset()</code> function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve <code>@local/</code> prefixed paths to the actual data URLs and apply the correct credentials automatically.</p> 831 831 <p>The <code>load_dataset()</code> function supports index lookup:</p> 832 - <div id="4bf81ece" class="cell"> 832 + <div id="4f5d6513" class="cell"> 833 833 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 834 834 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 835 835 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local index</span></span>
+11 -11
docs/tutorials/promotion.html
··· 621 621 </section> 622 622 <section id="setup" class="level2"> 623 623 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 624 - <div id="30d648e4" class="cell"> 624 + <div id="b2a4cfbd" class="cell"> 625 625 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 626 626 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 627 627 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 634 634 <section id="prepare-a-local-dataset" class="level2"> 635 635 <h2 class="anchored" data-anchor-id="prepare-a-local-dataset">Prepare a Local Dataset</h2> 636 636 <p>First, set up a dataset in local storage:</p> 637 - <div id="8f0a8e10" class="cell"> 637 + <div id="f7229ae4" class="cell"> 638 638 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Define sample type</span></span> 639 639 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 640 640 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ExperimentSample:</span> ··· 684 684 <section id="basic-promotion" class="level2"> 685 685 <h2 class="anchored" data-anchor-id="basic-promotion">Basic Promotion</h2> 686 686 <p>Promote the dataset to ATProto:</p> 687 - <div id="5b5a6c07" class="cell"> 687 + <div id="54490a42" class="cell"> 688 688 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to atmosphere</span></span> 689 689 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 690 690 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"myhandle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> ··· 697 697 <section id="promotion-with-metadata" class="level2"> 698 698 <h2 class="anchored" data-anchor-id="promotion-with-metadata">Promotion with Metadata</h2> 699 699 <p>Add description, tags, and license:</p> 700 - <div id="dc7703ae" class="cell"> 700 + <div id="517ea4f6" class="cell"> 701 701 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(</span> 702 702 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> local_entry,</span> 703 703 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> local_index,</span> ··· 713 713 <section id="schema-deduplication" class="level2"> 714 714 <h2 class="anchored" data-anchor-id="schema-deduplication">Schema Deduplication</h2> 715 715 <p>The promotion workflow automatically checks for existing schemas:</p> 716 - <div id="98721d33" class="cell"> 716 + <div id="f80cf10d" class="cell"> 717 717 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> _find_existing_schema</span> 718 718 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 719 719 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Check if schema already exists</span></span> ··· 725 725 <span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="st">"No existing schema found, will publish new one"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 726 726 </div> 727 727 <p>When you promote multiple datasets with the same sample type:</p> 728 - <div id="172b5ab2" class="cell"> 728 + <div id="e623fe02" class="cell"> 729 729 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First promotion: publishes schema</span></span> 730 730 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>uri1 <span class="op">=</span> promote_to_atmosphere(entry1, local_index, client)</span> 731 731 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 740 740 <div class="tab-content"> 741 741 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 742 742 <p>By default, promotion keeps the original data URLs:</p> 743 - <div id="1c103291" class="cell"> 743 + <div id="fd052b0a" class="cell"> 744 744 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Data stays in original S3 location</span></span> 745 745 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(local_entry, local_index, client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 746 746 </div> ··· 753 753 </div> 754 754 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 755 755 <p>To copy data to a different storage location:</p> 756 - <div id="09e9306a" class="cell"> 756 + <div id="87044fcf" class="cell"> 757 757 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 758 758 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 759 759 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create new data store</span></span> ··· 783 783 <section id="verify-on-atmosphere" class="level2"> 784 784 <h2 class="anchored" data-anchor-id="verify-on-atmosphere">Verify on Atmosphere</h2> 785 785 <p>After promotion, verify the dataset is accessible:</p> 786 - <div id="2be14e5f" class="cell"> 786 + <div id="ae122715" class="cell"> 787 787 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereIndex</span> 788 788 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 789 789 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>atm_index <span class="op">=</span> AtmosphereIndex(client)</span> ··· 804 804 </section> 805 805 <section id="error-handling" class="level2"> 806 806 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 807 - <div id="deda7bd9" class="cell"> 807 + <div id="f5fbc08d" class="cell"> 808 808 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 809 809 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> at_uri <span class="op">=</span> promote_to_atmosphere(local_entry, local_index, client)</span> 810 810 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">KeyError</span> <span class="im">as</span> e:</span> ··· 828 828 </section> 829 829 <section id="complete-workflow" class="level2"> 830 830 <h2 class="anchored" data-anchor-id="complete-workflow">Complete Workflow</h2> 831 - <div id="3fd9070a" class="cell"> 831 + <div id="bcd3ed5a" class="cell"> 832 832 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Complete local-to-atmosphere workflow</span></span> 833 833 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 834 834 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span>
+6 -6
docs/tutorials/quickstart.html
··· 606 606 <li><strong>Round-trip fidelity</strong>: Data survives serialization without loss</li> 607 607 </ul> 608 608 <p>Use the <code>@packable</code> decorator to create a typed sample:</p> 609 - <div id="e779f1c8" class="cell"> 609 + <div id="5fe343cc" class="cell"> 610 610 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 611 611 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 612 612 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 627 627 </section> 628 628 <section id="create-sample-instances" class="level2"> 629 629 <h2 class="anchored" data-anchor-id="create-sample-instances">Create Sample Instances</h2> 630 - <div id="9017c045" class="cell"> 630 + <div id="ceeaea84" class="cell"> 631 631 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a single sample</span></span> 632 632 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 633 633 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> ··· 655 655 </ul> 656 656 <p>The <code>as_wds</code> property on your sample provides the dictionary format WebDataset expects:</p> 657 657 <p>Use WebDataset’s <code>TarWriter</code> to create dataset files:</p> 658 - <div id="de114376" class="cell"> 658 + <div id="21430fdb" class="cell"> 659 659 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 660 660 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 661 661 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create 100 samples</span></span> ··· 686 686 </ul> 687 687 <p>This eliminates boilerplate collation code and works automatically with any PackableSample type.</p> 688 688 <p>Create a typed <code>Dataset</code> and iterate with batching:</p> 689 - <div id="a3152d0f" class="cell"> 689 + <div id="f9c53332" class="cell"> 690 690 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load dataset with type</span></span> 691 691 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"my-dataset-000000.tar"</span>)</span> 692 692 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 713 713 </ol> 714 714 <p>This approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset.</p> 715 715 <p>For training, use shuffled iteration:</p> 716 - <div id="64a443cd" class="cell"> 716 + <div id="7af05c86" class="cell"> 717 717 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 718 718 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled at shard and sample level</span></span> 719 719 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image</span> ··· 734 734 <li><strong>Derived features</strong>: Compute fields on-the-fly during iteration</li> 735 735 </ul> 736 736 <p>View datasets through different schemas:</p> 737 - <div id="cdc9da8b" class="cell"> 737 + <div id="a671173c" class="cell"> 738 738 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define a simplified view type</span></span> 739 739 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 740 740 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span>