fix(docs): update docstrings from Example: to Examples: format for proper code rendering

.chainlink/issues.db

This is a binary file and will not be displayed.

+3

CHANGELOG.md

··· 25 25 - **Comprehensive integration test suite**: 593 tests covering E2E flows, error handling, edge cases 26 26 27 27 ### Changed 28 + - Investigate quartodoc Example section rendering - missing CSS classes on pre/code tags (#401) 29 + - Update all docstrings from Example: to Examples: format (#403) 30 + - Create GitHub issues for v0.3 roadmap feature domains (#402) 28 31 - Expand Quarto documentation with architectural narrative (#395) 29 32 - Expand atmosphere tutorial with federation context (#400) 30 33 - Expand local-workflow tutorial with system narrative (#399)

+28 -38

CLAUDE.md

··· 209 209 210 210 ## Docstring Formatting 211 211 212 - This project uses **Google-style docstrings** with quartodoc for API documentation generation. The most important formatting requirement is for **Example sections**. 212 + This project uses **Google-style docstrings** with quartodoc for API documentation generation. The most important formatting requirement is for **Examples sections**. 213 213 214 - ### Example Section Format 214 + ### Examples Section Format 215 215 216 - Example sections must use reStructuredText literal block syntax (`::`) to render correctly in quartodoc-generated documentation: 216 + Use `Examples:` (plural) for code examples. This is recognized by griffe's Google docstring parser and rendered with proper syntax highlighting by quartodoc: 217 217 218 218 ```python 219 219 def my_function(): ··· 227 227 Returns: 228 228 Description of return value. 229 229 230 - Example: 231 - :: 232 - 233 - >>> result = my_function() 234 - >>> print(result) 235 - 'output' 230 + Examples: 231 + >>> result = my_function() 232 + >>> print(result) 233 + 'output' 236 234 """ 237 235 ``` 238 236 239 237 **Key formatting rules:** 240 238 241 - 1. `Example:` with a colon, 4-space indented from the docstring margin 242 - 2. `::` on its own line, 8-space indented (4 more than `Example:`) 243 - 3. Blank line after `::` 244 - 4. Code examples indented 12 spaces (4 more than `::`) 245 - 5. Use `>>>` for Python prompts and `...` for continuation lines 239 + 1. Use `Examples:` (plural, not `Example:` singular) 240 + 2. Code examples are indented 8 spaces (4 more than `Examples:`) 241 + 3. Use `>>>` for Python prompts and `...` for continuation lines 242 + 4. No `::` marker needed - griffe handles the parsing automatically 246 243 247 - **Incorrect format (will not render properly):** 244 + **Incorrect format (will not render with syntax highlighting):** 248 245 ```python 249 - Example: 250 - >>> code_here() # Wrong - missing :: and extra indentation 246 + Example: # Wrong - singular form is treated as an admonition 247 + :: # Wrong - reST literal block marker not needed 248 + >>> code_here() 251 249 ``` 252 250 253 251 **Correct format:** 254 252 ```python 255 - Example: 256 - :: 257 - 258 - >>> code_here() # Correct - has :: and proper indentation 253 + Examples: 254 + >>> code_here() # Correct - plural form, proper indentation 259 255 ``` 260 256 261 257 ### Multiple Examples 262 258 263 - For multiple examples, use the same pattern: 259 + For multiple examples, continue in the same section: 264 260 265 261 ```python 266 - Example: 267 - :: 268 - 269 - >>> # First example 270 - >>> x = create_thing() 262 + Examples: 263 + >>> # First example 264 + >>> x = create_thing() 271 265 272 - >>> # Second example 273 - >>> y = other_thing() 266 + >>> # Second example 267 + >>> y = other_thing() 274 268 ``` 275 269 276 270 ### Class and Method Docstrings ··· 281 275 class MyClass: 282 276 """Class description. 283 277 284 - Example: 285 - :: 286 - 287 - >>> obj = MyClass() 288 - >>> obj.do_something() 278 + Examples: 279 + >>> obj = MyClass() 280 + >>> obj.do_something() 289 281 """ 290 282 291 283 def method(self): 292 284 """Method description. 293 285 294 - Example: 295 - :: 296 - 297 - >>> self.method() 286 + Examples: 287 + >>> self.method() 298 288 """ 299 289 ``` 300 290

+7 -8

docs/api/AbstractDataStore.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.AbstractDataStore" id="toc-atdata.AbstractDataStore" class="nav-link active" data-scroll-target="#atdata.AbstractDataStore">AbstractDataStore</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.AbstractDataStore.read_url" id="toc-atdata.AbstractDataStore.read_url" class="nav-link" data-scroll-target="#atdata.AbstractDataStore.read_url">read_url</a></li> ··· 426 426 <p>Protocol for data storage operations.</p> 427 427 <p>This protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)</p> 428 428 <p>The separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.</p> 429 - <section id="example" class="level2 doc-section doc-section-example"> 430 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 431 - <p>::</p> 432 - <pre><code>>>> store = S3DataStore(credentials, bucket="my-bucket") 433 - >>> urls = store.write_shards(dataset, prefix="training/v1") 434 - >>> print(urls) 435 - ['s3://my-bucket/training/v1/shard-000000.tar', ...]</code></pre> 429 + <section id="examples" class="level2 doc-section doc-section-examples"> 430 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 431 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> store <span class="op">=</span> S3DataStore(credentials, bucket<span class="op">=</span><span class="st">"my-bucket"</span>)</span> 432 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> urls <span class="op">=</span> store.write_shards(dataset, prefix<span class="op">=</span><span class="st">"training/v1"</span>)</span> 433 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(urls)</span> 434 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>[<span class="st">'s3://my-bucket/training/v1/shard-000000.tar'</span>, ...]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 436 435 </section> 437 436 <section id="methods" class="level2"> 438 437 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+22 -24

docs/api/AbstractIndex.html

··· 403 403 <li><a href="#atdata.AbstractIndex" id="toc-atdata.AbstractIndex" class="nav-link active" data-scroll-target="#atdata.AbstractIndex">AbstractIndex</a> 404 404 <ul class="collapse"> 405 405 <li><a href="#optional-extensions" id="toc-optional-extensions" class="nav-link" data-scroll-target="#optional-extensions">Optional Extensions</a></li> 406 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 406 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 407 407 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 408 408 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 409 409 <ul class="collapse"> ··· 436 436 <h2 class="doc-section doc-section-optional-extensions anchored" data-anchor-id="optional-extensions">Optional Extensions</h2> 437 437 <p>Some index implementations support additional features: - <code>data_store</code>: An AbstractDataStore for reading/writing dataset shards. If present, <code>load_dataset</code> will use it for S3 credential resolution.</p> 438 438 </section> 439 - <section id="example" class="level2 doc-section doc-section-example"> 440 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 441 - <p>::</p> 442 - <pre><code>>>> def publish_and_list(index: AbstractIndex) -> None: 443 - ... # Publish schemas for different types 444 - ... schema1 = index.publish_schema(ImageSample, version="1.0.0") 445 - ... schema2 = index.publish_schema(TextSample, version="1.0.0") 446 - ... 447 - ... # Insert datasets of different types 448 - ... index.insert_dataset(image_ds, name="images") 449 - ... index.insert_dataset(text_ds, name="texts") 450 - ... 451 - ... # List all datasets (mixed types) 452 - ... for entry in index.list_datasets(): 453 - ... print(f"{entry.name} -> {entry.schema_ref}")</code></pre> 439 + <section id="examples" class="level2 doc-section doc-section-examples"> 440 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 441 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="kw">def</span> publish_and_list(index: AbstractIndex) <span class="op">-></span> <span class="va">None</span>:</span> 442 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... <span class="co"># Publish schemas for different types</span></span> 443 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... schema1 <span class="op">=</span> index.publish_schema(ImageSample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 444 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... schema2 <span class="op">=</span> index.publish_schema(TextSample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 445 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>...</span> 446 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>... <span class="co"># Insert datasets of different types</span></span> 447 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>... index.insert_dataset(image_ds, name<span class="op">=</span><span class="st">"images"</span>)</span> 448 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>... index.insert_dataset(text_ds, name<span class="op">=</span><span class="st">"texts"</span>)</span> 449 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>...</span> 450 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a>... <span class="co"># List all datasets (mixed types)</span></span> 451 + <span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a>... <span class="cf">for</span> entry <span class="kw">in</span> index.list_datasets():</span> 452 + <span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss"> -> </span><span class="sc">{</span>entry<span class="sc">.</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 454 453 </section> 455 454 <section id="attributes" class="level2"> 456 455 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2> ··· 596 595 </tbody> 597 596 </table> 598 597 </section> 599 - <section id="example-1" class="level4 doc-section doc-section-example"> 600 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-1">Example</h4> 601 - <p>::</p> 602 - <pre><code>>>> entry = index.get_dataset("my-dataset") 603 - >>> SampleType = index.decode_schema(entry.schema_ref) 604 - >>> ds = Dataset[SampleType](entry.data_urls[0]) 605 - >>> for sample in ds.ordered(): 606 - ... print(sample) # sample is instance of SampleType</code></pre> 598 + <section id="examples-1" class="level4 doc-section doc-section-examples"> 599 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-1">Examples</h4> 600 + <div class="sourceCode" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> entry <span class="op">=</span> index.get_dataset(<span class="st">"my-dataset"</span>)</span> 601 + <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> SampleType <span class="op">=</span> index.decode_schema(entry.schema_ref)</span> 602 + <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> Dataset[SampleType](entry.data_urls[<span class="dv">0</span>])</span> 603 + <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> sample <span class="kw">in</span> ds.ordered():</span> 604 + <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample) <span class="co"># sample is instance of SampleType</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 607 605 </section> 608 606 </section> 609 607 <section id="atdata.AbstractIndex.get_dataset" class="level3">

+10 -11

docs/api/AtUri.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.AtUri" id="toc-atdata.atmosphere.AtUri" class="nav-link active" data-scroll-target="#atdata.atmosphere.AtUri">AtUri</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> ··· 424 424 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.AtUri(authority, collection, rkey)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 425 425 <p>Parsed AT Protocol URI.</p> 426 426 <p>AT URIs follow the format: at://<authority>/<collection>/<rkey></rkey></collection></authority></p> 427 - <section id="example" class="level2 doc-section doc-section-example"> 428 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 429 - <p>::</p> 430 - <pre><code>>>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz") 431 - >>> uri.authority 432 - 'did:plc:abc123' 433 - >>> uri.collection 434 - 'ac.foundation.dataset.sampleSchema' 435 - >>> uri.rkey 436 - 'xyz'</code></pre> 427 + <section id="examples" class="level2 doc-section doc-section-examples"> 428 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 429 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri <span class="op">=</span> AtUri.parse(<span class="st">"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz"</span>)</span> 430 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri.authority</span> 431 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="co">'did:plc:abc123'</span></span> 432 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri.collection</span> 433 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="co">'ac.foundation.dataset.sampleSchema'</span></span> 434 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri.rkey</span> 435 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="co">'xyz'</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 437 436 </section> 438 437 <section id="attributes" class="level2"> 439 438 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2>

+7 -8

docs/api/AtmosphereClient.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.AtmosphereClient" id="toc-atdata.atmosphere.AtmosphereClient" class="nav-link active" data-scroll-target="#atdata.atmosphere.AtmosphereClient">AtmosphereClient</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#note" id="toc-note" class="nav-link" data-scroll-target="#note">Note</a></li> 407 407 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 408 408 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> ··· 438 438 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.AtmosphereClient(base_url<span class="op">=</span><span class="va">None</span>, <span class="op">*</span>, _client<span class="op">=</span><span class="va">None</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 439 439 <p>ATProto client wrapper for atdata operations.</p> 440 440 <p>This class wraps the atproto SDK client and provides higher-level methods for working with atdata records (schemas, datasets, lenses).</p> 441 - <section id="example" class="level2 doc-section doc-section-example"> 442 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 443 - <p>::</p> 444 - <pre><code>>>> client = AtmosphereClient() 445 - >>> client.login("alice.bsky.social", "app-password") 446 - >>> print(client.did) 447 - 'did:plc:...'</code></pre> 441 + <section id="examples" class="level2 doc-section doc-section-examples"> 442 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 443 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 444 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client.login(<span class="st">"alice.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 445 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(client.did)</span> 446 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="co">'did:plc:...'</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 448 447 </section> 449 448 <section id="note" class="level2 doc-section doc-section-note"> 450 449 <h2 class="doc-section doc-section-note anchored" data-anchor-id="note">Note</h2>

+13 -14

docs/api/AtmosphereIndex.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.AtmosphereIndex" id="toc-atdata.atmosphere.AtmosphereIndex" class="nav-link active" data-scroll-target="#atdata.atmosphere.AtmosphereIndex">AtmosphereIndex</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> ··· 431 431 <p>ATProto index implementing AbstractIndex protocol.</p> 432 432 <p>Wraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.</p> 433 433 <p>Optionally accepts a <code>PDSBlobStore</code> for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.</p> 434 - <section id="example" class="level2 doc-section doc-section-example"> 435 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 436 - <p>::</p> 437 - <pre><code>>>> client = AtmosphereClient() 438 - >>> client.login("handle.bsky.social", "app-password") 439 - >>> 440 - >>> # Without blob storage (external URLs only) 441 - >>> index = AtmosphereIndex(client) 442 - >>> 443 - >>> # With PDS blob storage 444 - >>> store = PDSBlobStore(client) 445 - >>> index = AtmosphereIndex(client, data_store=store) 446 - >>> entry = index.insert_dataset(dataset, name="my-data")</code></pre> 434 + <section id="examples" class="level2 doc-section doc-section-examples"> 435 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 436 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 437 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 438 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 439 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Without blob storage (external URLs only)</span></span> 440 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index <span class="op">=</span> AtmosphereIndex(client)</span> 441 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 442 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># With PDS blob storage</span></span> 443 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> store <span class="op">=</span> PDSBlobStore(client)</span> 444 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index <span class="op">=</span> AtmosphereIndex(client, data_store<span class="op">=</span>store)</span> 445 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> entry <span class="op">=</span> index.insert_dataset(dataset, name<span class="op">=</span><span class="st">"my-data"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 447 446 </section> 448 447 <section id="attributes" class="level2"> 449 448 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2>

+11 -12

docs/api/BlobSource.html

··· 403 403 <li><a href="#atdata.BlobSource" id="toc-atdata.BlobSource" class="nav-link active" data-scroll-target="#atdata.BlobSource">BlobSource</a> 404 404 <ul class="collapse"> 405 405 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 406 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 406 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> 409 409 <li><a href="#atdata.BlobSource.from_refs" id="toc-atdata.BlobSource.from_refs" class="nav-link" data-scroll-target="#atdata.BlobSource.from_refs">from_refs</a></li> ··· 451 451 </tbody> 452 452 </table> 453 453 </section> 454 - <section id="example" class="level2 doc-section doc-section-example"> 455 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 456 - <p>::</p> 457 - <pre><code>>>> source = BlobSource( 458 - ... blob_refs=[ 459 - ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 460 - ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 461 - ... ], 462 - ... ) 463 - >>> for shard_id, stream in source.shards: 464 - ... process(stream)</code></pre> 454 + <section id="examples" class="level2 doc-section doc-section-examples"> 455 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 456 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> source <span class="op">=</span> BlobSource(</span> 457 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... blob_refs<span class="op">=</span>[</span> 458 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... {<span class="st">"did"</span>: <span class="st">"did:plc:abc123"</span>, <span class="st">"cid"</span>: <span class="st">"bafyrei..."</span>},</span> 459 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... {<span class="st">"did"</span>: <span class="st">"did:plc:abc123"</span>, <span class="st">"cid"</span>: <span class="st">"bafyrei..."</span>},</span> 460 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>... ],</span> 461 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>... )</span> 462 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> shard_id, stream <span class="kw">in</span> source.shards:</span> 463 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>... process(stream)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 465 464 </section> 466 465 <section id="methods" class="level2"> 467 466 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+12 -13

docs/api/DataSource.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.DataSource" id="toc-atdata.DataSource" class="nav-link active" data-scroll-target="#atdata.DataSource">DataSource</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> ··· 426 426 <p>Protocol for data sources that provide streams to Dataset.</p> 427 427 <p>A DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)</p> 428 428 <p>The key method is <code>shards()</code>, which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects</p> 429 - <section id="example" class="level2 doc-section doc-section-example"> 430 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 431 - <p>::</p> 432 - <pre><code>>>> source = S3Source( 433 - ... bucket="my-bucket", 434 - ... keys=["data-000.tar", "data-001.tar"], 435 - ... endpoint="https://r2.example.com", 436 - ... credentials=creds, 437 - ... ) 438 - >>> ds = Dataset[MySample](source) 439 - >>> for sample in ds.ordered(): 440 - ... print(sample)</code></pre> 429 + <section id="examples" class="level2 doc-section doc-section-examples"> 430 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 431 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> source <span class="op">=</span> S3Source(</span> 432 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... bucket<span class="op">=</span><span class="st">"my-bucket"</span>,</span> 433 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... keys<span class="op">=</span>[<span class="st">"data-000.tar"</span>, <span class="st">"data-001.tar"</span>],</span> 434 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... endpoint<span class="op">=</span><span class="st">"https://r2.example.com"</span>,</span> 435 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>... credentials<span class="op">=</span>creds,</span> 436 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>... )</span> 437 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> Dataset[MySample](source)</span> 438 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> sample <span class="kw">in</span> ds.ordered():</span> 439 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 441 440 </section> 442 441 <section id="attributes" class="level2"> 443 442 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2>

+18 -20

docs/api/Dataset.html

··· 404 404 <ul class="collapse"> 405 405 <li><a href="#parameters" id="toc-parameters" class="nav-link" data-scroll-target="#parameters">Parameters</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 407 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 408 408 <li><a href="#note" id="toc-note" class="nav-link" data-scroll-target="#note">Note</a></li> 409 409 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 410 410 <ul class="collapse"> ··· 479 479 </tbody> 480 480 </table> 481 481 </section> 482 - <section id="example" class="level2 doc-section doc-section-example"> 483 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 484 - <p>::</p> 485 - <pre><code>>>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar") 486 - >>> for sample in ds.ordered(batch_size=32): 487 - ... # sample is SampleBatch[MyData] with batch_size samples 488 - ... embeddings = sample.embeddings # shape: (32, ...) 489 - ... 490 - >>> # Transform to a different view 491 - >>> ds_view = ds.as_type(MyDataView)</code></pre> 482 + <section id="examples" class="level2 doc-section doc-section-examples"> 483 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 484 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> Dataset[MyData](<span class="st">"path/to/data-{000000..000009}.tar"</span>)</span> 485 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> sample <span class="kw">in</span> ds.ordered(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 486 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... <span class="co"># sample is SampleBatch[MyData] with batch_size samples</span></span> 487 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... embeddings <span class="op">=</span> sample.embeddings <span class="co"># shape: (32, ...)</span></span> 488 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>...</span> 489 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Transform to a different view</span></span> 490 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds_view <span class="op">=</span> ds.as_type(MyDataView)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 492 491 </section> 493 492 <section id="note" class="level2 doc-section doc-section-note"> 494 493 <h2 class="doc-section doc-section-note anchored" data-anchor-id="note">Note</h2> ··· 817 816 ds.to_parquet("output.parquet", maxcount=10000)</code></pre> 818 817 <p>This creates multiple parquet files: <code>output-000000.parquet</code>, <code>output-000001.parquet</code>, etc.</p> 819 818 </section> 820 - <section id="example-1" class="level4 doc-section doc-section-example"> 821 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-1">Example</h4> 822 - <p>::</p> 823 - <pre><code>>>> ds = Dataset[MySample]("data.tar") 824 - >>> # Small dataset - load all at once 825 - >>> ds.to_parquet("output.parquet") 826 - >>> 827 - >>> # Large dataset - process in chunks 828 - >>> ds.to_parquet("output.parquet", maxcount=50000)</code></pre> 819 + <section id="examples-1" class="level4 doc-section doc-section-examples"> 820 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-1">Examples</h4> 821 + <div class="sourceCode" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> Dataset[MySample](<span class="st">"data.tar"</span>)</span> 822 + <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Small dataset - load all at once</span></span> 823 + <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds.to_parquet(<span class="st">"output.parquet"</span>)</span> 824 + <span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 825 + <span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Large dataset - process in chunks</span></span> 826 + <span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds.to_parquet(<span class="st">"output.parquet"</span>, maxcount<span class="op">=</span><span class="dv">50000</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 829 827 </section> 830 828 </section> 831 829 <section id="atdata.Dataset.wrap" class="level3">

+10 -11

docs/api/DatasetDict.html

··· 403 403 <li><a href="#atdata.DatasetDict" id="toc-atdata.DatasetDict" class="nav-link active" data-scroll-target="#atdata.DatasetDict">DatasetDict</a> 404 404 <ul class="collapse"> 405 405 <li><a href="#parameters" id="toc-parameters" class="nav-link" data-scroll-target="#parameters">Parameters</a></li> 406 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 406 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 407 407 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 408 408 </ul></li> 409 409 </ul> ··· 448 448 </tbody> 449 449 </table> 450 450 </section> 451 - <section id="example" class="level2 doc-section doc-section-example"> 452 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 453 - <p>::</p> 454 - <pre><code>>>> ds_dict = load_dataset("path/to/data", MyData) 455 - >>> train = ds_dict["train"] 456 - >>> test = ds_dict["test"] 457 - >>> 458 - >>> # Iterate over all splits 459 - >>> for split_name, dataset in ds_dict.items(): 460 - ... print(f"{split_name}: {len(dataset.shard_list)} shards")</code></pre> 451 + <section id="examples" class="level2 doc-section doc-section-examples"> 452 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 453 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data"</span>, MyData)</span> 454 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> train <span class="op">=</span> ds_dict[<span class="st">"train"</span>]</span> 455 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> test <span class="op">=</span> ds_dict[<span class="st">"test"</span>]</span> 456 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 457 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Iterate over all splits</span></span> 458 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> split_name, dataset <span class="kw">in</span> ds_dict.items():</span> 459 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>split_name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span><span class="bu">len</span>(dataset.shard_list)<span class="sc">}</span><span class="ss"> shards"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 461 460 </section> 462 461 <section id="attributes" class="level2"> 463 462 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2>

+19 -21

docs/api/DatasetLoader.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.DatasetLoader" id="toc-atdata.atmosphere.DatasetLoader" class="nav-link active" data-scroll-target="#atdata.atmosphere.DatasetLoader">DatasetLoader</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.atmosphere.DatasetLoader.get" id="toc-atdata.atmosphere.DatasetLoader.get" class="nav-link" data-scroll-target="#atdata.atmosphere.DatasetLoader.get">get</a></li> ··· 430 430 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.DatasetLoader(client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 431 431 <p>Loads dataset records from ATProto.</p> 432 432 <p>This class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.</p> 433 - <section id="example" class="level2 doc-section doc-section-example"> 434 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 435 - <p>::</p> 436 - <pre><code>>>> client = AtmosphereClient() 437 - >>> loader = DatasetLoader(client) 438 - >>> 439 - >>> # List available datasets 440 - >>> datasets = loader.list() 441 - >>> for ds in datasets: 442 - ... print(ds["name"], ds["schemaRef"]) 443 - >>> 444 - >>> # Get a specific dataset record 445 - >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz")</code></pre> 433 + <section id="examples" class="level2 doc-section doc-section-examples"> 434 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 435 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 436 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> loader <span class="op">=</span> DatasetLoader(client)</span> 437 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 438 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># List available datasets</span></span> 439 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> datasets <span class="op">=</span> loader.<span class="bu">list</span>()</span> 440 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> ds <span class="kw">in</span> datasets:</span> 441 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(ds[<span class="st">"name"</span>], ds[<span class="st">"schemaRef"</span>])</span> 442 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 443 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Get a specific dataset record</span></span> 444 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> record <span class="op">=</span> loader.get(<span class="st">"at://did:plc:abc/ac.foundation.dataset.record/xyz"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 446 445 </section> 447 446 <section id="methods" class="level2"> 448 447 <h2 class="anchored" data-anchor-id="methods">Methods</h2> ··· 976 975 </tbody> 977 976 </table> 978 977 </section> 979 - <section id="example-1" class="level4 doc-section doc-section-example"> 980 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-1">Example</h4> 981 - <p>::</p> 982 - <pre><code>>>> loader = DatasetLoader(client) 983 - >>> dataset = loader.to_dataset(uri, MySampleType) 984 - >>> for batch in dataset.shuffled(batch_size=32): 985 - ... process(batch)</code></pre> 978 + <section id="examples-1" class="level4 doc-section doc-section-examples"> 979 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-1">Examples</h4> 980 + <div class="sourceCode" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> loader <span class="op">=</span> DatasetLoader(client)</span> 981 + <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> dataset <span class="op">=</span> loader.to_dataset(uri, MySampleType)</span> 982 + <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 983 + <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>... process(batch)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 986 984 987 985 988 986 </section>

+15 -16

docs/api/DatasetPublisher.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.DatasetPublisher" id="toc-atdata.atmosphere.DatasetPublisher" class="nav-link active" data-scroll-target="#atdata.atmosphere.DatasetPublisher">DatasetPublisher</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.atmosphere.DatasetPublisher.publish" id="toc-atdata.atmosphere.DatasetPublisher.publish" class="nav-link" data-scroll-target="#atdata.atmosphere.DatasetPublisher.publish">publish</a></li> ··· 425 425 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.DatasetPublisher(client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 426 426 <p>Publishes dataset index records to ATProto.</p> 427 427 <p>This class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.</p> 428 - <section id="example" class="level2 doc-section doc-section-example"> 429 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 430 - <p>::</p> 431 - <pre><code>>>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar") 432 - >>> 433 - >>> client = AtmosphereClient() 434 - >>> client.login("handle", "password") 435 - >>> 436 - >>> publisher = DatasetPublisher(client) 437 - >>> uri = publisher.publish( 438 - ... dataset, 439 - ... name="My Training Data", 440 - ... description="Training data for my model", 441 - ... tags=["computer-vision", "training"], 442 - ... )</code></pre> 428 + <section id="examples" class="level2 doc-section doc-section-examples"> 429 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 430 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> dataset <span class="op">=</span> atdata.Dataset[MySample](<span class="st">"s3://bucket/data-{000000..000009}.tar"</span>)</span> 431 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 432 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 433 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client.login(<span class="st">"handle"</span>, <span class="st">"password"</span>)</span> 434 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 435 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> publisher <span class="op">=</span> DatasetPublisher(client)</span> 436 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri <span class="op">=</span> publisher.publish(</span> 437 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>... dataset,</span> 438 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>... name<span class="op">=</span><span class="st">"My Training Data"</span>,</span> 439 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a>... description<span class="op">=</span><span class="st">"Training data for my model"</span>,</span> 440 + <span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a>... tags<span class="op">=</span>[<span class="st">"computer-vision"</span>, <span class="st">"training"</span>],</span> 441 + <span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>... )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 443 442 </section> 444 443 <section id="methods" class="level2"> 445 444 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+11 -12

docs/api/DictSample.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.DictSample" id="toc-atdata.DictSample" class="nav-link active" data-scroll-target="#atdata.DictSample">DictSample</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#note" id="toc-note" class="nav-link" data-scroll-target="#note">Note</a></li> 407 407 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 408 408 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> ··· 433 433 <p>This class is the default sample type for datasets when no explicit type is specified. It stores the raw unpacked msgpack data and provides both attribute-style (<code>sample.field</code>) and dict-style (<code>sample["field"]</code>) access to fields.</p> 434 434 <p><code>DictSample</code> is useful for: - Exploring datasets without defining a schema first - Working with datasets that have variable schemas - Prototyping before committing to a typed schema</p> 435 435 <p>To convert to a typed schema, use <code>Dataset.as_type()</code> with a <code>@packable</code>-decorated class. Every <code>@packable</code> class automatically registers a lens from <code>DictSample</code>, making this conversion seamless.</p> 436 - <section id="example" class="level2 doc-section doc-section-example"> 437 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 438 - <p>::</p> 439 - <pre><code>>>> ds = load_dataset("path/to/data.tar") # Returns Dataset[DictSample] 440 - >>> for sample in ds.ordered(): 441 - ... print(sample.some_field) # Attribute access 442 - ... print(sample["other_field"]) # Dict access 443 - ... print(sample.keys()) # Inspect available fields 444 - ... 445 - >>> # Convert to typed schema 446 - >>> typed_ds = ds.as_type(MyTypedSample)</code></pre> 436 + <section id="examples" class="level2 doc-section doc-section-examples"> 437 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 438 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>) <span class="co"># Returns Dataset[DictSample]</span></span> 439 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> sample <span class="kw">in</span> ds.ordered():</span> 440 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample.some_field) <span class="co"># Attribute access</span></span> 441 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample[<span class="st">"other_field"</span>]) <span class="co"># Dict access</span></span> 442 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample.keys()) <span class="co"># Inspect available fields</span></span> 443 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>...</span> 444 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Convert to typed schema</span></span> 445 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> typed_ds <span class="op">=</span> ds.as_type(MyTypedSample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 447 446 </section> 448 447 <section id="note" class="level2 doc-section doc-section-note"> 449 448 <h2 class="doc-section doc-section-note anchored" data-anchor-id="note">Note</h2>

+47 -51

docs/api/Lens.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.lens" id="toc-atdata.lens" class="nav-link active" data-scroll-target="#atdata.lens">lens</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#classes" id="toc-classes" class="nav-link" data-scroll-target="#classes">Classes</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.lens.Lens" id="toc-atdata.lens.Lens" class="nav-link" data-scroll-target="#atdata.lens.Lens">Lens</a></li> ··· 435 435 <li><code>@lens</code>: Decorator to create and register lens transformations</li> 436 436 </ul> 437 437 <p>Lenses support the functional programming concept of composable, well-behaved transformations that satisfy lens laws (GetPut and PutGet).</p> 438 - <section id="example" class="level2 doc-section doc-section-example"> 439 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 440 - <p>::</p> 441 - <pre><code>>>> @packable 442 - ... class FullData: 443 - ... name: str 444 - ... age: int 445 - ... embedding: NDArray 446 - ... 447 - >>> @packable 448 - ... class NameOnly: 449 - ... name: str 450 - ... 451 - >>> @lens 452 - ... def name_view(full: FullData) -> NameOnly: 453 - ... return NameOnly(name=full.name) 454 - ... 455 - >>> @name_view.putter 456 - ... def name_view_put(view: NameOnly, source: FullData) -> FullData: 457 - ... return FullData(name=view.name, age=source.age, 458 - ... embedding=source.embedding) 459 - ... 460 - >>> ds = Dataset[FullData]("data.tar") 461 - >>> ds_names = ds.as_type(NameOnly) # Uses registered lens</code></pre> 438 + <section id="examples" class="level2 doc-section doc-section-examples"> 439 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 440 + <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>packable</span> 441 + <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">class</span> FullData:</span> 442 + <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>... name: <span class="bu">str</span></span> 443 + <span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>... age: <span class="bu">int</span></span> 444 + <span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>... embedding: NDArray</span> 445 + <span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>...</span> 446 + <span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>packable</span> 447 + <span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>... <span class="kw">class</span> NameOnly:</span> 448 + <span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a>... name: <span class="bu">str</span></span> 449 + <span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a>...</span> 450 + <span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>lens</span> 451 + <span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> name_view(full: FullData) <span class="op">-></span> NameOnly:</span> 452 + <span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> NameOnly(name<span class="op">=</span>full.name)</span> 453 + <span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a>...</span> 454 + <span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>name_view.putter</span> 455 + <span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> name_view_put(view: NameOnly, source: FullData) <span class="op">-></span> FullData:</span> 456 + <span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> FullData(name<span class="op">=</span>view.name, age<span class="op">=</span>source.age,</span> 457 + <span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a>... embedding<span class="op">=</span>source.embedding)</span> 458 + <span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a>...</span> 459 + <span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> Dataset[FullData](<span class="st">"data.tar"</span>)</span> 460 + <span id="cb1-21"><a href="#cb1-21" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds_names <span class="op">=</span> ds.as_type(NameOnly) <span class="co"># Uses registered lens</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 462 461 </section> 463 462 <section id="classes" class="level2"> 464 463 <h2 class="anchored" data-anchor-id="classes">Classes</h2> ··· 518 517 </tbody> 519 518 </table> 520 519 </section> 521 - <section id="example-1" class="level4 doc-section doc-section-example"> 522 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-1">Example</h4> 523 - <p>::</p> 524 - <pre><code>>>> @lens 525 - ... def name_lens(full: FullData) -> NameOnly: 526 - ... return NameOnly(name=full.name) 527 - ... 528 - >>> @name_lens.putter 529 - ... def name_lens_put(view: NameOnly, source: FullData) -> FullData: 530 - ... return FullData(name=view.name, age=source.age)</code></pre> 520 + <section id="examples-1" class="level4 doc-section doc-section-examples"> 521 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-1">Examples</h4> 522 + <div class="sourceCode" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>lens</span> 523 + <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> name_lens(full: FullData) <span class="op">-></span> NameOnly:</span> 524 + <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> NameOnly(name<span class="op">=</span>full.name)</span> 525 + <span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>...</span> 526 + <span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>name_lens.putter</span> 527 + <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> name_lens_put(view: NameOnly, source: FullData) <span class="op">-></span> FullData:</span> 528 + <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> FullData(name<span class="op">=</span>view.name, age<span class="op">=</span>source.age)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 531 529 </section> 532 530 <section id="methods" class="level4"> 533 531 <h4 class="anchored" data-anchor-id="methods">Methods</h4> ··· 693 691 </tbody> 694 692 </table> 695 693 </section> 696 - <section id="example-2" class="level6 doc-section doc-section-example"> 697 - <h6 class="doc-section doc-section-example anchored" data-anchor-id="example-2">Example</h6> 698 - <p>::</p> 699 - <pre><code>>>> @my_lens.putter 700 - ... def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 701 - ... return SourceType(...)</code></pre> 694 + <section id="examples-2" class="level6 doc-section doc-section-examples"> 695 + <h6 class="doc-section doc-section-examples anchored" data-anchor-id="examples-2">Examples</h6> 696 + <div class="sourceCode" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>my_lens.putter</span> 697 + <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> my_lens_put(view: ViewType, source: SourceType) <span class="op">-></span> SourceType:</span> 698 + <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> SourceType(field<span class="op">=</span>view.field, other<span class="op">=</span>source.other)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 702 699 </section> 703 700 </section> 704 701 </section> ··· 925 922 </tbody> 926 923 </table> 927 924 </section> 928 - <section id="example-3" class="level4 doc-section doc-section-example"> 929 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-3">Example</h4> 930 - <p>::</p> 931 - <pre><code>>>> @lens 932 - ... def extract_name(full: FullData) -> NameOnly: 933 - ... return NameOnly(name=full.name) 934 - ... 935 - >>> @extract_name.putter 936 - ... def extract_name_put(view: NameOnly, source: FullData) -> FullData: 937 - ... return FullData(name=view.name, age=source.age)</code></pre> 925 + <section id="examples-3" class="level4 doc-section doc-section-examples"> 926 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-3">Examples</h4> 927 + <div class="sourceCode" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>lens</span> 928 + <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> extract_name(full: FullData) <span class="op">-></span> NameOnly:</span> 929 + <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> NameOnly(name<span class="op">=</span>full.name)</span> 930 + <span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>...</span> 931 + <span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>extract_name.putter</span> 932 + <span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> extract_name_put(view: NameOnly, source: FullData) <span class="op">-></span> FullData:</span> 933 + <span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> FullData(name<span class="op">=</span>view.name, age<span class="op">=</span>source.age)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 938 934 939 935 940 936 </section>

+10 -11

docs/api/LensLoader.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.LensLoader" id="toc-atdata.atmosphere.LensLoader" class="nav-link active" data-scroll-target="#atdata.atmosphere.LensLoader">LensLoader</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.atmosphere.LensLoader.find_by_schemas" id="toc-atdata.atmosphere.LensLoader.find_by_schemas" class="nav-link" data-scroll-target="#atdata.atmosphere.LensLoader.find_by_schemas">find_by_schemas</a></li> ··· 425 425 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.LensLoader(client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 426 426 <p>Loads lens records from ATProto.</p> 427 427 <p>This class fetches lens transformation records. Note that actually using a lens requires installing the referenced code and importing it manually.</p> 428 - <section id="example" class="level2 doc-section doc-section-example"> 429 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 430 - <p>::</p> 431 - <pre><code>>>> client = AtmosphereClient() 432 - >>> loader = LensLoader(client) 433 - >>> 434 - >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz") 435 - >>> print(record["name"]) 436 - >>> print(record["sourceSchema"]) 437 - >>> print(record.get("getterCode", {}).get("repository"))</code></pre> 428 + <section id="examples" class="level2 doc-section doc-section-examples"> 429 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 430 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 431 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> loader <span class="op">=</span> LensLoader(client)</span> 432 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 433 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> record <span class="op">=</span> loader.get(<span class="st">"at://did:plc:abc/ac.foundation.dataset.lens/xyz"</span>)</span> 434 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(record[<span class="st">"name"</span>])</span> 435 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(record[<span class="st">"sourceSchema"</span>])</span> 436 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(record.get(<span class="st">"getterCode"</span>, {}).get(<span class="st">"repository"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 438 437 </section> 439 438 <section id="methods" class="level2"> 440 439 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+20 -21

docs/api/LensPublisher.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.LensPublisher" id="toc-atdata.atmosphere.LensPublisher" class="nav-link active" data-scroll-target="#atdata.atmosphere.LensPublisher">LensPublisher</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#security-note" id="toc-security-note" class="nav-link" data-scroll-target="#security-note">Security Note</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> ··· 425 425 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.LensPublisher(client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 426 426 <p>Publishes Lens transformation records to ATProto.</p> 427 427 <p>This class creates lens records that reference source and target schemas and point to the transformation code in a git repository.</p> 428 - <section id="example" class="level2 doc-section doc-section-example"> 429 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 430 - <p>::</p> 431 - <pre><code>>>> @atdata.lens 432 - ... def my_lens(source: SourceType) -> TargetType: 433 - ... return TargetType(field=source.other_field) 434 - >>> 435 - >>> client = AtmosphereClient() 436 - >>> client.login("handle", "password") 437 - >>> 438 - >>> publisher = LensPublisher(client) 439 - >>> uri = publisher.publish( 440 - ... name="my_lens", 441 - ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source", 442 - ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target", 443 - ... code_repository="https://github.com/user/repo", 444 - ... code_commit="abc123def456", 445 - ... getter_path="mymodule.lenses:my_lens", 446 - ... putter_path="mymodule.lenses:my_lens_putter", 447 - ... )</code></pre> 428 + <section id="examples" class="level2 doc-section doc-section-examples"> 429 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 430 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>atdata.lens</span> 431 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">def</span> my_lens(source: SourceType) <span class="op">-></span> TargetType:</span> 432 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... <span class="cf">return</span> TargetType(field<span class="op">=</span>source.other_field)</span> 433 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 434 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 435 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client.login(<span class="st">"handle"</span>, <span class="st">"password"</span>)</span> 436 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 437 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> publisher <span class="op">=</span> LensPublisher(client)</span> 438 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri <span class="op">=</span> publisher.publish(</span> 439 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a>... name<span class="op">=</span><span class="st">"my_lens"</span>,</span> 440 + <span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a>... source_schema_uri<span class="op">=</span><span class="st">"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source"</span>,</span> 441 + <span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>... target_schema_uri<span class="op">=</span><span class="st">"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target"</span>,</span> 442 + <span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a>... code_repository<span class="op">=</span><span class="st">"https://github.com/user/repo"</span>,</span> 443 + <span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a>... code_commit<span class="op">=</span><span class="st">"abc123def456"</span>,</span> 444 + <span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a>... getter_path<span class="op">=</span><span class="st">"mymodule.lenses:my_lens"</span>,</span> 445 + <span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a>... putter_path<span class="op">=</span><span class="st">"mymodule.lenses:my_lens_putter"</span>,</span> 446 + <span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a>... )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 448 447 </section> 449 448 <section id="security-note" class="level2 doc-section doc-section-security-note"> 450 449 <h2 class="doc-section doc-section-security-note anchored" data-anchor-id="security-note">Security Note</h2>

+7 -8

docs/api/PDSBlobStore.html

··· 403 403 <li><a href="#atdata.atmosphere.PDSBlobStore" id="toc-atdata.atmosphere.PDSBlobStore" class="nav-link active" data-scroll-target="#atdata.atmosphere.PDSBlobStore">PDSBlobStore</a> 404 404 <ul class="collapse"> 405 405 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 406 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 406 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> 409 409 <li><a href="#atdata.atmosphere.PDSBlobStore.create_source" id="toc-atdata.atmosphere.PDSBlobStore.create_source" class="nav-link" data-scroll-target="#atdata.atmosphere.PDSBlobStore.create_source">create_source</a></li> ··· 452 452 </tbody> 453 453 </table> 454 454 </section> 455 - <section id="example" class="level2 doc-section doc-section-example"> 456 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 457 - <p>::</p> 458 - <pre><code>>>> store = PDSBlobStore(client) 459 - >>> urls = store.write_shards(dataset, prefix="training/v1") 460 - >>> # Returns AT URIs like: 461 - >>> # ['at://did:plc:abc/blob/bafyrei...', ...]</code></pre> 455 + <section id="examples" class="level2 doc-section doc-section-examples"> 456 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 457 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> store <span class="op">=</span> PDSBlobStore(client)</span> 458 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> urls <span class="op">=</span> store.write_shards(dataset, prefix<span class="op">=</span><span class="st">"training/v1"</span>)</span> 459 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Returns AT URIs like:</span></span> 460 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># ['at://did:plc:abc/blob/bafyrei...', ...]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 462 461 </section> 463 462 <section id="methods" class="level2"> 464 463 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+12 -13

docs/api/Packable-protocol.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.Packable" id="toc-atdata.Packable" class="nav-link active" data-scroll-target="#atdata.Packable">Packable</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> ··· 427 427 <p>This protocol allows classes decorated with <code>@packable</code> to be recognized as valid types for lens transformations and schema operations, even though the decorator doesn’t change the class’s nominal type at static analysis time.</p> 428 428 <p>Both <code>PackableSample</code> subclasses and <code>@packable</code>-decorated classes satisfy this protocol structurally.</p> 429 429 <p>The protocol captures the full interface needed for: - Lens type transformations (as_wds, from_data) - Schema publishing (class introspection via dataclass fields) - Serialization/deserialization (packed, from_bytes)</p> 430 - <section id="example" class="level2 doc-section doc-section-example"> 431 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 432 - <p>::</p> 433 - <pre><code>>>> @packable 434 - ... class MySample: 435 - ... name: str 436 - ... value: int 437 - ... 438 - >>> def process(sample_type: Type[Packable]) -> None: 439 - ... # Type checker knows sample_type has from_bytes, packed, etc. 440 - ... instance = sample_type.from_bytes(data) 441 - ... print(instance.packed)</code></pre> 430 + <section id="examples" class="level2 doc-section doc-section-examples"> 431 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 432 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>packable</span> 433 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">class</span> MySample:</span> 434 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... name: <span class="bu">str</span></span> 435 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... value: <span class="bu">int</span></span> 436 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>...</span> 437 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="kw">def</span> process(sample_type: Type[Packable]) <span class="op">-></span> <span class="va">None</span>:</span> 438 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>... <span class="co"># Type checker knows sample_type has from_bytes, packed, etc.</span></span> 439 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>... instance <span class="op">=</span> sample_type.from_bytes(data)</span> 440 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(instance.packed)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 442 441 </section> 443 442 <section id="attributes" class="level2"> 444 443 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2>

+11 -12

docs/api/PackableSample.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.PackableSample" id="toc-atdata.PackableSample" class="nav-link active" data-scroll-target="#atdata.PackableSample">PackableSample</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> ··· 426 426 <p>Base class for samples that can be serialized with msgpack.</p> 427 427 <p>This abstract base class provides automatic serialization/deserialization for dataclass-based samples. Fields annotated as <code>NDArray</code> or <code>NDArray | None</code> are automatically converted between numpy arrays and bytes during packing/unpacking.</p> 428 428 <p>Subclasses should be defined either by: 1. Direct inheritance with the <code>@dataclass</code> decorator 2. Using the <code>@packable</code> decorator (recommended)</p> 429 - <section id="example" class="level2 doc-section doc-section-example"> 430 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 431 - <p>::</p> 432 - <pre><code>>>> @packable 433 - ... class MyData: 434 - ... name: str 435 - ... embeddings: NDArray 436 - ... 437 - >>> sample = MyData(name="test", embeddings=np.array([1.0, 2.0])) 438 - >>> packed = sample.packed # Serialize to bytes 439 - >>> restored = MyData.from_bytes(packed) # Deserialize</code></pre> 429 + <section id="examples" class="level2 doc-section doc-section-examples"> 430 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 431 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>packable</span> 432 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">class</span> MyData:</span> 433 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... name: <span class="bu">str</span></span> 434 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... embeddings: NDArray</span> 435 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>...</span> 436 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> sample <span class="op">=</span> MyData(name<span class="op">=</span><span class="st">"test"</span>, embeddings<span class="op">=</span>np.array([<span class="fl">1.0</span>, <span class="fl">2.0</span>]))</span> 437 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> packed <span class="op">=</span> sample.packed <span class="co"># Serialize to bytes</span></span> 438 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> restored <span class="op">=</span> MyData.from_bytes(packed) <span class="co"># Deserialize</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 440 439 </section> 441 440 <section id="attributes" class="level2"> 442 441 <h2 class="anchored" data-anchor-id="attributes">Attributes</h2>

+26 -29

docs/api/S3Source.html

··· 403 403 <li><a href="#atdata.S3Source" id="toc-atdata.S3Source" class="nav-link active" data-scroll-target="#atdata.S3Source">S3Source</a> 404 404 <ul class="collapse"> 405 405 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 406 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 406 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> 409 409 <li><a href="#atdata.S3Source.from_credentials" id="toc-atdata.S3Source.from_credentials" class="nav-link" data-scroll-target="#atdata.S3Source.from_credentials">from_credentials</a></li> ··· 480 480 </tbody> 481 481 </table> 482 482 </section> 483 - <section id="example" class="level2 doc-section doc-section-example"> 484 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 485 - <p>::</p> 486 - <pre><code>>>> source = S3Source( 487 - ... bucket="my-datasets", 488 - ... keys=["train/shard-000.tar", "train/shard-001.tar"], 489 - ... endpoint="https://abc123.r2.cloudflarestorage.com", 490 - ... access_key="AKIAIOSFODNN7EXAMPLE", 491 - ... secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 492 - ... ) 493 - >>> for shard_id, stream in source.shards: 494 - ... process(stream)</code></pre> 483 + <section id="examples" class="level2 doc-section doc-section-examples"> 484 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 485 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> source <span class="op">=</span> S3Source(</span> 486 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... bucket<span class="op">=</span><span class="st">"my-datasets"</span>,</span> 487 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... keys<span class="op">=</span>[<span class="st">"train/shard-000.tar"</span>, <span class="st">"train/shard-001.tar"</span>],</span> 488 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... endpoint<span class="op">=</span><span class="st">"https://abc123.r2.cloudflarestorage.com"</span>,</span> 489 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>... access_key<span class="op">=</span><span class="st">"AKIAIOSFODNN7EXAMPLE"</span>,</span> 490 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>... secret_key<span class="op">=</span><span class="st">"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"</span>,</span> 491 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>... )</span> 492 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> shard_id, stream <span class="kw">in</span> source.shards:</span> 493 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>... process(stream)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 495 494 </section> 496 495 <section id="methods" class="level2"> 497 496 <h2 class="anchored" data-anchor-id="methods">Methods</h2> ··· 578 577 </tbody> 579 578 </table> 580 579 </section> 581 - <section id="example-1" class="level4 doc-section doc-section-example"> 582 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-1">Example</h4> 583 - <p>::</p> 584 - <pre><code>>>> creds = { 585 - ... "AWS_ACCESS_KEY_ID": "...", 586 - ... "AWS_SECRET_ACCESS_KEY": "...", 587 - ... "AWS_ENDPOINT": "https://r2.example.com", 588 - ... } 589 - >>> source = S3Source.from_credentials(creds, "my-bucket", ["data.tar"])</code></pre> 580 + <section id="examples-1" class="level4 doc-section doc-section-examples"> 581 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-1">Examples</h4> 582 + <div class="sourceCode" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> creds <span class="op">=</span> {</span> 583 + <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>... <span class="st">"AWS_ACCESS_KEY_ID"</span>: <span class="st">"..."</span>,</span> 584 + <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>... <span class="st">"AWS_SECRET_ACCESS_KEY"</span>: <span class="st">"..."</span>,</span> 585 + <span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>... <span class="st">"AWS_ENDPOINT"</span>: <span class="st">"https://r2.example.com"</span>,</span> 586 + <span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>... }</span> 587 + <span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> source <span class="op">=</span> S3Source.from_credentials(creds, <span class="st">"my-bucket"</span>, [<span class="st">"data.tar"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 590 588 </section> 591 589 </section> 592 590 <section id="atdata.S3Source.from_urls" class="level3"> ··· 684 682 </tbody> 685 683 </table> 686 684 </section> 687 - <section id="example-2" class="level4 doc-section doc-section-example"> 688 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-2">Example</h4> 689 - <p>::</p> 690 - <pre><code>>>> source = S3Source.from_urls( 691 - ... ["s3://my-bucket/train-000.tar", "s3://my-bucket/train-001.tar"], 692 - ... endpoint="https://r2.example.com", 693 - ... )</code></pre> 685 + <section id="examples-2" class="level4 doc-section doc-section-examples"> 686 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-2">Examples</h4> 687 + <div class="sourceCode" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> source <span class="op">=</span> S3Source.from_urls(</span> 688 + <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>... [<span class="st">"s3://my-bucket/train-000.tar"</span>, <span class="st">"s3://my-bucket/train-001.tar"</span>],</span> 689 + <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>... endpoint<span class="op">=</span><span class="st">"https://r2.example.com"</span>,</span> 690 + <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>... )</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 694 691 </section> 695 692 </section> 696 693 <section id="atdata.S3Source.list_shards" class="level3">

+6 -7

docs/api/SampleBatch.html

··· 404 404 <ul class="collapse"> 405 405 <li><a href="#parameters" id="toc-parameters" class="nav-link" data-scroll-target="#parameters">Parameters</a></li> 406 406 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 407 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 407 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 408 408 <li><a href="#note" id="toc-note" class="nav-link" data-scroll-target="#note">Note</a></li> 409 409 </ul></li> 410 410 </ul> ··· 469 469 </tbody> 470 470 </table> 471 471 </section> 472 - <section id="example" class="level2 doc-section doc-section-example"> 473 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 474 - <p>::</p> 475 - <pre><code>>>> batch = SampleBatch[MyData]([sample1, sample2, sample3]) 476 - >>> batch.embeddings # Returns stacked numpy array of shape (3, ...) 477 - >>> batch.names # Returns list of names</code></pre> 472 + <section id="examples" class="level2 doc-section doc-section-examples"> 473 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 474 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> batch <span class="op">=</span> SampleBatch[MyData]([sample1, sample2, sample3])</span> 475 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> batch.embeddings <span class="co"># Returns stacked numpy array of shape (3, ...)</span></span> 476 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> batch.names <span class="co"># Returns list of names</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 478 477 </section> 479 478 <section id="note" class="level2 doc-section doc-section-note"> 480 479 <h2 class="doc-section doc-section-note anchored" data-anchor-id="note">Note</h2>

+10 -11

docs/api/SchemaLoader.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.SchemaLoader" id="toc-atdata.atmosphere.SchemaLoader" class="nav-link active" data-scroll-target="#atdata.atmosphere.SchemaLoader">SchemaLoader</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.atmosphere.SchemaLoader.get" id="toc-atdata.atmosphere.SchemaLoader.get" class="nav-link" data-scroll-target="#atdata.atmosphere.SchemaLoader.get">get</a></li> ··· 424 424 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.SchemaLoader(client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 425 425 <p>Loads PackableSample schemas from ATProto.</p> 426 426 <p>This class fetches schema records from ATProto and can list available schemas from a repository.</p> 427 - <section id="example" class="level2 doc-section doc-section-example"> 428 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 429 - <p>::</p> 430 - <pre><code>>>> client = AtmosphereClient() 431 - >>> client.login("handle", "password") 432 - >>> 433 - >>> loader = SchemaLoader(client) 434 - >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...") 435 - >>> print(schema["name"]) 436 - 'MySample'</code></pre> 427 + <section id="examples" class="level2 doc-section doc-section-examples"> 428 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 429 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 430 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client.login(<span class="st">"handle"</span>, <span class="st">"password"</span>)</span> 431 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 432 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> loader <span class="op">=</span> SchemaLoader(client)</span> 433 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> schema <span class="op">=</span> loader.get(<span class="st">"at://did:plc:.../ac.foundation.dataset.sampleSchema/..."</span>)</span> 434 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(schema[<span class="st">"name"</span>])</span> 435 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="co">'MySample'</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 437 436 </section> 438 437 <section id="methods" class="level2"> 439 438 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+15 -16

docs/api/SchemaPublisher.html

··· 402 402 <ul> 403 403 <li><a href="#atdata.atmosphere.SchemaPublisher" id="toc-atdata.atmosphere.SchemaPublisher" class="nav-link active" data-scroll-target="#atdata.atmosphere.SchemaPublisher">SchemaPublisher</a> 404 404 <ul class="collapse"> 405 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 405 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 406 406 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 407 407 <ul class="collapse"> 408 408 <li><a href="#atdata.atmosphere.SchemaPublisher.publish" id="toc-atdata.atmosphere.SchemaPublisher.publish" class="nav-link" data-scroll-target="#atdata.atmosphere.SchemaPublisher.publish">publish</a></li> ··· 423 423 <div class="sourceCode" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>atmosphere.SchemaPublisher(client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 424 424 <p>Publishes PackableSample schemas to ATProto.</p> 425 425 <p>This class introspects a PackableSample class to extract its field definitions and publishes them as an ATProto schema record.</p> 426 - <section id="example" class="level2 doc-section doc-section-example"> 427 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 428 - <p>::</p> 429 - <pre><code>>>> @atdata.packable 430 - ... class MySample: 431 - ... image: NDArray 432 - ... label: str 433 - ... 434 - >>> client = AtmosphereClient() 435 - >>> client.login("handle", "password") 436 - >>> 437 - >>> publisher = SchemaPublisher(client) 438 - >>> uri = publisher.publish(MySample, version="1.0.0") 439 - >>> print(uri) 440 - at://did:plc:.../ac.foundation.dataset.sampleSchema/...</code></pre> 426 + <section id="examples" class="level2 doc-section doc-section-examples"> 427 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 428 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>atdata.packable</span> 429 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">class</span> MySample:</span> 430 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... image: NDArray</span> 431 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... label: <span class="bu">str</span></span> 432 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>...</span> 433 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client <span class="op">=</span> AtmosphereClient()</span> 434 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> client.login(<span class="st">"handle"</span>, <span class="st">"password"</span>)</span> 435 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 436 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> publisher <span class="op">=</span> SchemaPublisher(client)</span> 437 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri <span class="op">=</span> publisher.publish(MySample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 438 + <span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(uri)</span> 439 + <span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a>at:<span class="op">//</span>did:plc:...<span class="op">/</span>ac.foundation.dataset.sampleSchema<span class="op">/</span>...</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 441 440 </section> 442 441 <section id="methods" class="level2"> 443 442 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+6 -7

docs/api/URLSource.html

··· 403 403 <li><a href="#atdata.URLSource" id="toc-atdata.URLSource" class="nav-link active" data-scroll-target="#atdata.URLSource">URLSource</a> 404 404 <ul class="collapse"> 405 405 <li><a href="#attributes" id="toc-attributes" class="nav-link" data-scroll-target="#attributes">Attributes</a></li> 406 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 406 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 407 407 <li><a href="#methods" id="toc-methods" class="nav-link" data-scroll-target="#methods">Methods</a> 408 408 <ul class="collapse"> 409 409 <li><a href="#atdata.URLSource.list_shards" id="toc-atdata.URLSource.list_shards" class="nav-link" data-scroll-target="#atdata.URLSource.list_shards">list_shards</a></li> ··· 445 445 </tbody> 446 446 </table> 447 447 </section> 448 - <section id="example" class="level2 doc-section doc-section-example"> 449 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 450 - <p>::</p> 451 - <pre><code>>>> source = URLSource("https://example.com/train-{000..009}.tar") 452 - >>> for shard_id, stream in source.shards: 453 - ... print(f"Streaming {shard_id}")</code></pre> 448 + <section id="examples" class="level2 doc-section doc-section-examples"> 449 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 450 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> source <span class="op">=</span> URLSource(<span class="st">"https://example.com/train-{000..009}.tar"</span>)</span> 451 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> shard_id, stream <span class="kw">in</span> source.shards:</span> 452 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(<span class="ss">f"Streaming </span><span class="sc">{</span>shard_id<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 454 453 </section> 455 454 <section id="methods" class="level2"> 456 455 <h2 class="anchored" data-anchor-id="methods">Methods</h2>

+19 -20

docs/api/load_dataset.html

··· 405 405 <li><a href="#parameters" id="toc-parameters" class="nav-link" data-scroll-target="#parameters">Parameters</a></li> 406 406 <li><a href="#returns" id="toc-returns" class="nav-link" data-scroll-target="#returns">Returns</a></li> 407 407 <li><a href="#raises" id="toc-raises" class="nav-link" data-scroll-target="#raises">Raises</a></li> 408 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 408 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 409 409 </ul></li> 410 410 </ul> 411 411 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/api/load_dataset.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 540 540 </tbody> 541 541 </table> 542 542 </section> 543 - <section id="example" class="level2 doc-section doc-section-example"> 544 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 545 - <p>::</p> 546 - <pre><code>>>> # Load without type - get DictSample for exploration 547 - >>> ds = load_dataset("./data/train.tar", split="train") 548 - >>> for sample in ds.ordered(): 549 - ... print(sample.keys()) # Explore fields 550 - ... print(sample["text"]) # Dict-style access 551 - ... print(sample.label) # Attribute access 552 - >>> 553 - >>> # Convert to typed schema 554 - >>> typed_ds = ds.as_type(TextData) 555 - >>> 556 - >>> # Or load with explicit type directly 557 - >>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train") 558 - >>> 559 - >>> # Load from index with auto-type resolution 560 - >>> index = LocalIndex() 561 - >>> ds = load_dataset("@local/my-dataset", index=index, split="train")</code></pre> 543 + <section id="examples" class="level2 doc-section doc-section-examples"> 544 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 545 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Load without type - get DictSample for exploration</span></span> 546 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> load_dataset(<span class="st">"./data/train.tar"</span>, split<span class="op">=</span><span class="st">"train"</span>)</span> 547 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="cf">for</span> sample <span class="kw">in</span> ds.ordered():</span> 548 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample.keys()) <span class="co"># Explore fields</span></span> 549 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample[<span class="st">"text"</span>]) <span class="co"># Dict-style access</span></span> 550 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>... <span class="bu">print</span>(sample.label) <span class="co"># Attribute access</span></span> 551 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 552 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Convert to typed schema</span></span> 553 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> typed_ds <span class="op">=</span> ds.as_type(TextData)</span> 554 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 555 + <span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Or load with explicit type directly</span></span> 556 + <span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> train_ds <span class="op">=</span> load_dataset(<span class="st">"./data/train-*.tar"</span>, TextData, split<span class="op">=</span><span class="st">"train"</span>)</span> 557 + <span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 558 + <span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Load from index with auto-type resolution</span></span> 559 + <span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index <span class="op">=</span> LocalIndex()</span> 560 + <span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ds <span class="op">=</span> load_dataset(<span class="st">"@local/my-dataset"</span>, index<span class="op">=</span>index, split<span class="op">=</span><span class="st">"train"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 562 561 563 562 564 563 </section>

+26 -29

docs/api/local.Index.html

··· 766 766 </tbody> 767 767 </table> 768 768 </section> 769 - <section id="example" class="level4 doc-section doc-section-example"> 770 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h4> 771 - <p>::</p> 772 - <pre><code>>>> # After enabling auto_stubs and configuring IDE extraPaths: 773 - >>> from local.MySample_1_0_0 import MySample 774 - >>> 775 - >>> # This gives full IDE autocomplete: 776 - >>> DecodedType = index.decode_schema_as(ref, MySample) 777 - >>> sample = DecodedType(text="hello", value=42) # IDE knows signature!</code></pre> 769 + <section id="examples" class="level4 doc-section doc-section-examples"> 770 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h4> 771 + <div class="sourceCode" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># After enabling auto_stubs and configuring IDE extraPaths:</span></span> 772 + <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="im">from</span> local.MySample_1_0_0 <span class="im">import</span> MySample</span> 773 + <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 774 + <span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># This gives full IDE autocomplete:</span></span> 775 + <span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> DecodedType <span class="op">=</span> index.decode_schema_as(ref, MySample)</span> 776 + <span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> sample <span class="op">=</span> DecodedType(text<span class="op">=</span><span class="st">"hello"</span>, value<span class="op">=</span><span class="dv">42</span>) <span class="co"># IDE knows signature!</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 778 777 </section> 779 778 <section id="note" class="level4 doc-section doc-section-note"> 780 779 <h4 class="doc-section doc-section-note anchored" data-anchor-id="note">Note</h4> ··· 1023 1022 </tbody> 1024 1023 </table> 1025 1024 </section> 1026 - <section id="example-1" class="level4 doc-section doc-section-example"> 1027 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-1">Example</h4> 1028 - <p>::</p> 1029 - <pre><code>>>> index = LocalIndex(auto_stubs=True) 1030 - >>> ref = index.publish_schema(MySample, version="1.0.0") 1031 - >>> index.load_schema(ref) 1032 - >>> print(index.get_import_path(ref)) 1033 - local.MySample_1_0_0 1034 - >>> # Then in your code: 1035 - >>> # from local.MySample_1_0_0 import MySample</code></pre> 1025 + <section id="examples-1" class="level4 doc-section doc-section-examples"> 1026 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-1">Examples</h4> 1027 + <div class="sourceCode" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index <span class="op">=</span> LocalIndex(auto_stubs<span class="op">=</span><span class="va">True</span>)</span> 1028 + <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> ref <span class="op">=</span> index.publish_schema(MySample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 1029 + <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index.load_schema(ref)</span> 1030 + <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(index.get_import_path(ref))</span> 1031 + <span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>local.MySample_1_0_0</span> 1032 + <span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Then in your code:</span></span> 1033 + <span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># from local.MySample_1_0_0 import MySample</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 1036 1034 </section> 1037 1035 </section> 1038 1036 <section id="atdata.local.Index.get_schema" class="level3"> ··· 1389 1387 </tbody> 1390 1388 </table> 1391 1389 </section> 1392 - <section id="example-2" class="level4 doc-section doc-section-example"> 1393 - <h4 class="doc-section doc-section-example anchored" data-anchor-id="example-2">Example</h4> 1394 - <p>::</p> 1395 - <pre><code>>>> # Load and use immediately 1396 - >>> MyType = index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 1397 - >>> sample = MyType(name="hello", value=42) 1398 - >>> 1399 - >>> # Or access later via namespace 1400 - >>> index.load_schema("atdata://local/sampleSchema/OtherType@1.0.0") 1401 - >>> other = index.types.OtherType(data="test")</code></pre> 1390 + <section id="examples-2" class="level4 doc-section doc-section-examples"> 1391 + <h4 class="doc-section doc-section-examples anchored" data-anchor-id="examples-2">Examples</h4> 1392 + <div class="sourceCode" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Load and use immediately</span></span> 1393 + <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> MyType <span class="op">=</span> index.load_schema(<span class="st">"atdata://local/sampleSchema/MySample@1.0.0"</span>)</span> 1394 + <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> sample <span class="op">=</span> MyType(name<span class="op">=</span><span class="st">"hello"</span>, value<span class="op">=</span><span class="dv">42</span>)</span> 1395 + <span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 1396 + <span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Or access later via namespace</span></span> 1397 + <span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index.load_schema(<span class="st">"atdata://local/sampleSchema/OtherType@1.0.0"</span>)</span> 1398 + <span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> other <span class="op">=</span> index.types.OtherType(data<span class="op">=</span><span class="st">"test"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 1402 1399 </section> 1403 1400 </section> 1404 1401 <section id="atdata.local.Index.publish_schema" class="level3">

+11 -12

docs/api/packable.html

··· 474 474 </section> 475 475 <section id="examples" class="level2 doc-section doc-section-examples"> 476 476 <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 477 - <p>This is a test of the functionality::</p> 478 - <pre><code>@packable 479 - class MyData: 480 - name: str 481 - values: NDArray 482 - 483 - sample = MyData(name="test", values=np.array([1, 2, 3])) 484 - bytes_data = sample.packed 485 - restored = MyData.from_bytes(bytes_data) 486 - 487 - # Works with Packable-typed APIs 488 - index.publish_schema(MyData, version="1.0.0") # Type-safe</code></pre> 477 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="op">@</span>packable</span> 478 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>... <span class="kw">class</span> MyData:</span> 479 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>... name: <span class="bu">str</span></span> 480 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>... values: NDArray</span> 481 + <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>...</span> 482 + <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> sample <span class="op">=</span> MyData(name<span class="op">=</span><span class="st">"test"</span>, values<span class="op">=</span>np.array([<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>]))</span> 483 + <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> bytes_data <span class="op">=</span> sample.packed</span> 484 + <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> restored <span class="op">=</span> MyData.from_bytes(bytes_data)</span> 485 + <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span></span> 486 + <span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="co"># Works with Packable-typed APIs</span></span> 487 + <span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> index.publish_schema(MyData, version<span class="op">=</span><span class="st">"1.0.0"</span>) <span class="co"># Type-safe</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 489 488 490 489 491 490 </section>

+7 -8

docs/api/promote_to_atmosphere.html

··· 405 405 <li><a href="#parameters" id="toc-parameters" class="nav-link" data-scroll-target="#parameters">Parameters</a></li> 406 406 <li><a href="#returns" id="toc-returns" class="nav-link" data-scroll-target="#returns">Returns</a></li> 407 407 <li><a href="#raises" id="toc-raises" class="nav-link" data-scroll-target="#raises">Raises</a></li> 408 - <li><a href="#example" id="toc-example" class="nav-link" data-scroll-target="#example">Example</a></li> 408 + <li><a href="#examples" id="toc-examples" class="nav-link" data-scroll-target="#examples">Examples</a></li> 409 409 </ul></li> 410 410 </ul> 411 411 <div class="toc-actions"><ul><li><a href="https://github.com/your-org/atdata/edit/main/api/promote_to_atmosphere.qmd" class="toc-action"><i class="bi bi-github"></i>Edit this page</a></li><li><a href="https://github.com/your-org/atdata/issues/new" class="toc-action"><i class="bi empty"></i>Report an issue</a></li></ul></div></nav> ··· 538 538 </tbody> 539 539 </table> 540 540 </section> 541 - <section id="example" class="level2 doc-section doc-section-example"> 542 - <h2 class="doc-section doc-section-example anchored" data-anchor-id="example">Example</h2> 543 - <p>::</p> 544 - <pre><code>>>> entry = local_index.get_dataset("mnist-train") 545 - >>> uri = promote_to_atmosphere(entry, local_index, client) 546 - >>> print(uri) 547 - at://did:plc:abc123/ac.foundation.dataset.datasetIndex/...</code></pre> 541 + <section id="examples" class="level2 doc-section doc-section-examples"> 542 + <h2 class="doc-section doc-section-examples anchored" data-anchor-id="examples">Examples</h2> 543 + <div class="sourceCode" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> entry <span class="op">=</span> local_index.get_dataset(<span class="st">"mnist-train"</span>)</span> 544 + <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span> 545 + <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="op">>>></span> <span class="bu">print</span>(uri)</span> 546 + <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>at:<span class="op">//</span>did:plc:abc123<span class="op">/</span>ac.foundation.dataset.datasetIndex<span class="op">/</span>...</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 548 547 549 548 550 549 </section>

+6 -6

docs/index.html

··· 666 666 <section id="define-a-sample-type" class="level3"> 667 667 <h3 class="anchored" data-anchor-id="define-a-sample-type">1. Define a Sample Type</h3> 668 668 <p>The <code>@packable</code> decorator creates a serializable dataclass:</p> 669 - <div id="ce9c727b" class="cell"> 669 + <div id="bde2db89" class="cell"> 670 670 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 671 671 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 672 672 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 681 681 <section id="create-and-write-samples" class="level3"> 682 682 <h3 class="anchored" data-anchor-id="create-and-write-samples">2. Create and Write Samples</h3> 683 683 <p>Use WebDataset’s standard TarWriter:</p> 684 - <div id="a2250059" class="cell"> 684 + <div id="4fadf976" class="cell"> 685 685 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 686 686 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 687 687 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> ··· 701 701 <section id="load-and-iterate-with-type-safety" class="level3"> 702 702 <h3 class="anchored" data-anchor-id="load-and-iterate-with-type-safety">3. Load and Iterate with Type Safety</h3> 703 703 <p>The generic <code>Dataset[T]</code> provides typed access:</p> 704 - <div id="86e5860f" class="cell"> 704 + <div id="908f965e" class="cell"> 705 705 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-000000.tar"</span>)</span> 706 706 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 707 707 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> ··· 716 716 <section id="team-storage-with-redis-s3" class="level3"> 717 717 <h3 class="anchored" data-anchor-id="team-storage-with-redis-s3">Team Storage with Redis + S3</h3> 718 718 <p>When you’re ready to share with your team:</p> 719 - <div id="7a91c337" class="cell"> 719 + <div id="ac419917" class="cell"> 720 720 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 721 721 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 722 722 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to team infrastructure</span></span> ··· 740 740 <section id="federation-with-atproto" class="level3"> 741 741 <h3 class="anchored" data-anchor-id="federation-with-atproto">Federation with ATProto</h3> 742 742 <p>For public or cross-organization sharing:</p> 743 - <div id="1d65da04" class="cell"> 743 + <div id="de1b2de2" class="cell"> 744 744 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex, PDSBlobStore</span> 745 745 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> promote_to_atmosphere</span> 746 746 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 762 762 <section id="huggingface-style-loading" class="level2"> 763 763 <h2 class="anchored" data-anchor-id="huggingface-style-loading">HuggingFace-Style Loading</h2> 764 764 <p>For convenient access to datasets:</p> 765 - <div id="f21a38dc" class="cell"> 765 + <div id="b7319285" class="cell"> 766 766 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 767 767 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 768 768 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local files</span></span>

+14 -14

docs/reference/architecture.html

··· 657 657 <section id="packablesample-the-foundation" class="level3"> 658 658 <h3 class="anchored" data-anchor-id="packablesample-the-foundation">PackableSample: The Foundation</h3> 659 659 <p>Everything in atdata starts with <strong>PackableSample</strong>—a base class that makes Python dataclasses serializable with msgpack:</p> 660 - <div id="89a48627" class="cell"> 660 + <div id="935164ef" class="cell"> 661 661 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 662 662 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 663 663 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> image: NDArray <span class="co"># Automatically converted to/from bytes</span></span> ··· 680 680 <section id="dataset-typed-iteration" class="level3"> 681 681 <h3 class="anchored" data-anchor-id="dataset-typed-iteration">Dataset: Typed Iteration</h3> 682 682 <p>The <code>Dataset[T]</code> class wraps WebDataset tar archives with type information:</p> 683 - <div id="3a0ab003" class="cell"> 683 + <div id="cb4b1f5d" class="cell"> 684 684 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 685 685 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 686 686 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> ··· 704 704 <section id="samplebatch-automatic-aggregation" class="level3"> 705 705 <h3 class="anchored" data-anchor-id="samplebatch-automatic-aggregation">SampleBatch: Automatic Aggregation</h3> 706 706 <p>When iterating with <code>batch_size</code>, atdata returns <code>SampleBatch[T]</code> objects that aggregate sample attributes:</p> 707 - <div id="c43dd2d8" class="cell"> 707 + <div id="dac750c2" class="cell"> 708 708 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>batch <span class="op">=</span> SampleBatch[ImageSample](samples)</span> 709 709 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 710 710 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># NDArray fields → stacked numpy array with batch dimension</span></span> ··· 718 718 <section id="lens-schema-transformations" class="level3"> 719 719 <h3 class="anchored" data-anchor-id="lens-schema-transformations">Lens: Schema Transformations</h3> 720 720 <p>Lenses enable viewing datasets through different schemas without duplicating data:</p> 721 - <div id="2dc95de3" class="cell"> 721 + <div id="06cb19a2" class="cell"> 722 722 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 723 723 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 724 724 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> ··· 755 755 <li>WebDataset tar shards</li> 756 756 <li>Any S3-compatible storage (AWS, MinIO, Cloudflare R2)</li> 757 757 </ul> 758 - <div id="a2bdd84b" class="cell"> 758 + <div id="046d8bbf" class="cell"> 759 759 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(credentials<span class="op">=</span>creds, bucket<span class="op">=</span><span class="st">"datasets"</span>)</span> 760 760 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> 761 761 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 783 783 <li>Store actual data shards as ATProto blobs</li> 784 784 <li>Fully decentralized—no external dependencies</li> 785 785 </ul> 786 - <div id="d614873c" class="cell"> 786 + <div id="eb3771a4" class="cell"> 787 787 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 788 788 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> 789 789 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 801 801 <section id="abstractindex" class="level3"> 802 802 <h3 class="anchored" data-anchor-id="abstractindex">AbstractIndex</h3> 803 803 <p>Common interface for both <code>LocalIndex</code> and <code>AtmosphereIndex</code>:</p> 804 - <div id="cb928d04" class="cell"> 804 + <div id="d95e2c44" class="cell"> 805 805 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_dataset(index: AbstractIndex, name: <span class="bu">str</span>):</span> 806 806 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> entry <span class="op">=</span> index.get_dataset(name)</span> 807 807 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> schema <span class="op">=</span> index.decode_schema(entry.schema_ref)</span> ··· 817 817 <section id="abstractdatastore" class="level3"> 818 818 <h3 class="anchored" data-anchor-id="abstractdatastore">AbstractDataStore</h3> 819 819 <p>Common interface for <code>S3DataStore</code> and <code>PDSBlobStore</code>:</p> 820 - <div id="e7b39bb2" class="cell"> 820 + <div id="d5c8d1d6" class="cell"> 821 821 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_to_store(store: AbstractDataStore, dataset: Dataset):</span> 822 822 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> urls <span class="op">=</span> store.write_shards(dataset, prefix<span class="op">=</span><span class="st">"data/v1"</span>)</span> 823 823 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> <span class="co"># Works with S3 or PDS blob storage</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 838 838 <p>A typical workflow progresses through three stages:</p> 839 839 <section id="stage-1-local-development" class="level3"> 840 840 <h3 class="anchored" data-anchor-id="stage-1-local-development">Stage 1: Local Development</h3> 841 - <div id="42b455c5" class="cell"> 841 + <div id="d075eb1f" class="cell"> 842 842 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define type and create samples</span></span> 843 843 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 844 844 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> MySample:</span> ··· 856 856 </section> 857 857 <section id="stage-2-team-storage" class="level3"> 858 858 <h3 class="anchored" data-anchor-id="stage-2-team-storage">Stage 2: Team Storage</h3> 859 - <div id="dca2cb0c" class="cell"> 859 + <div id="006a8036" class="cell"> 860 860 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Set up team storage</span></span> 861 861 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(credentials<span class="op">=</span>team_creds, bucket<span class="op">=</span><span class="st">"team-datasets"</span>)</span> 862 862 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex(data_store<span class="op">=</span>store)</span> ··· 871 871 </section> 872 872 <section id="stage-3-federation" class="level3"> 873 873 <h3 class="anchored" data-anchor-id="stage-3-federation">Stage 3: Federation</h3> 874 - <div id="b001011a" class="cell"> 874 + <div id="8964a63d" class="cell"> 875 875 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Promote to atmosphere</span></span> 876 876 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 877 877 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"handle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> ··· 904 904 <section id="custom-datasources" class="level3"> 905 905 <h3 class="anchored" data-anchor-id="custom-datasources">Custom DataSources</h3> 906 906 <p>Implement the <code>DataSource</code> protocol to add new storage backends:</p> 907 - <div id="b258c8c1" class="cell"> 907 + <div id="2aa480fc" class="cell"> 908 908 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> MyCustomSource:</span> 909 909 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> list_shards(<span class="va">self</span>) <span class="op">-></span> <span class="bu">list</span>[<span class="bu">str</span>]: ...</span> 910 910 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a> <span class="kw">def</span> open_shard(<span class="va">self</span>, shard_id: <span class="bu">str</span>) <span class="op">-></span> IO[<span class="bu">bytes</span>]: ...</span> ··· 916 916 <section id="custom-lenses" class="level3"> 917 917 <h3 class="anchored" data-anchor-id="custom-lenses">Custom Lenses</h3> 918 918 <p>Register transformations between any PackableSample types:</p> 919 - <div id="69b20972" class="cell"> 919 + <div id="d65d4dda" class="cell"> 920 920 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 921 921 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> my_transform(src: SourceType) <span class="op">-></span> TargetType:</span> 922 922 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> TargetType(...)</span> ··· 929 929 <section id="schema-extensions" class="level3"> 930 930 <h3 class="anchored" data-anchor-id="schema-extensions">Schema Extensions</h3> 931 931 <p>The schema format supports custom metadata for domain-specific needs:</p> 932 - <div id="acd9c2fa" class="cell"> 932 + <div id="e2b76dd7" class="cell"> 933 933 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>index.publish_schema(</span> 934 934 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> MySample,</span> 935 935 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> version<span class="op">=</span><span class="st">"1.0.0"</span>,</span>

+22 -22

docs/reference/atmosphere.html

··· 626 626 <section id="atmosphereclient" class="level2"> 627 627 <h2 class="anchored" data-anchor-id="atmosphereclient">AtmosphereClient</h2> 628 628 <p>The client handles authentication and record operations:</p> 629 - <div id="397fcc7e" class="cell"> 629 + <div id="a55b5480" class="cell"> 630 630 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 631 631 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 632 632 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 653 653 <section id="session-management" class="level3"> 654 654 <h3 class="anchored" data-anchor-id="session-management">Session Management</h3> 655 655 <p>Save and restore sessions to avoid re-authentication:</p> 656 - <div id="9b0dd285" class="cell"> 656 + <div id="546ad519" class="cell"> 657 657 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Export session for later</span></span> 658 658 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>session_string <span class="op">=</span> client.export_session()</span> 659 659 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 665 665 <section id="custom-pds" class="level3"> 666 666 <h3 class="anchored" data-anchor-id="custom-pds">Custom PDS</h3> 667 667 <p>Connect to a custom PDS instead of bsky.social:</p> 668 - <div id="4f231236" class="cell"> 668 + <div id="02895206" class="cell"> 669 669 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient(base_url<span class="op">=</span><span class="st">"https://pds.example.com"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 670 670 </div> 671 671 </section> ··· 673 673 <section id="pdsblobstore" class="level2"> 674 674 <h2 class="anchored" data-anchor-id="pdsblobstore">PDSBlobStore</h2> 675 675 <p>Store dataset shards as ATProto blobs for fully decentralized storage:</p> 676 - <div id="2a2824bd" class="cell"> 676 + <div id="ddcafefe" class="cell"> 677 677 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, PDSBlobStore</span> 678 678 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 679 679 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 696 696 <section id="size-limits" class="level3"> 697 697 <h3 class="anchored" data-anchor-id="size-limits">Size Limits</h3> 698 698 <p>PDS blobs typically have size limits (often 50MB-5GB depending on the PDS). Use <code>maxcount</code> and <code>maxsize</code> parameters to control shard sizes:</p> 699 - <div id="f36e6565" class="cell"> 699 + <div id="8d86e8e3" class="cell"> 700 700 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>urls <span class="op">=</span> store.write_shards(</span> 701 701 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a> dataset,</span> 702 702 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> prefix<span class="op">=</span><span class="st">"large-data/v1"</span>,</span> ··· 709 709 <section id="blobsource" class="level2"> 710 710 <h2 class="anchored" data-anchor-id="blobsource">BlobSource</h2> 711 711 <p>Read datasets stored as PDS blobs:</p> 712 - <div id="f533908b" class="cell"> 712 + <div id="bc781aa8" class="cell"> 713 713 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> BlobSource</span> 714 714 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 715 715 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># From blob references</span></span> ··· 730 730 <section id="atmosphereindex" class="level2"> 731 731 <h2 class="anchored" data-anchor-id="atmosphereindex">AtmosphereIndex</h2> 732 732 <p>The unified interface for ATProto operations, implementing the AbstractIndex protocol:</p> 733 - <div id="a20f035d" class="cell"> 733 + <div id="5a53b688" class="cell"> 734 734 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex, PDSBlobStore</span> 735 735 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 736 736 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> ··· 745 745 </div> 746 746 <section id="publishing-schemas" class="level3"> 747 747 <h3 class="anchored" data-anchor-id="publishing-schemas">Publishing Schemas</h3> 748 - <div id="7b342ab7" class="cell"> 748 + <div id="582e7fb0" class="cell"> 749 749 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 750 750 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 751 751 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 766 766 </section> 767 767 <section id="publishing-datasets" class="level3"> 768 768 <h3 class="anchored" data-anchor-id="publishing-datasets">Publishing Datasets</h3> 769 - <div id="b328fbe1" class="cell"> 769 + <div id="f658a5cc" class="cell"> 770 770 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 771 771 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 772 772 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(</span> ··· 784 784 </section> 785 785 <section id="listing-and-retrieving" class="level3"> 786 786 <h3 class="anchored" data-anchor-id="listing-and-retrieving">Listing and Retrieving</h3> 787 - <div id="7aa3400f" class="cell"> 787 + <div id="bc1fd369" class="cell"> 788 788 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># List your datasets</span></span> 789 789 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> entry <span class="kw">in</span> index.list_datasets():</span> 790 790 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>entry<span class="sc">.</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span> ··· 810 810 <p>For more control, use the individual publisher classes:</p> 811 811 <section id="schemapublisher" class="level3"> 812 812 <h3 class="anchored" data-anchor-id="schemapublisher">SchemaPublisher</h3> 813 - <div id="5573ad6c" class="cell"> 813 + <div id="b4deca81" class="cell"> 814 814 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> SchemaPublisher</span> 815 815 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span> 816 816 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> SchemaPublisher(client)</span> ··· 826 826 </section> 827 827 <section id="datasetpublisher" class="level3"> 828 828 <h3 class="anchored" data-anchor-id="datasetpublisher">DatasetPublisher</h3> 829 - <div id="aa1667c5" class="cell"> 829 + <div id="626a0256" class="cell"> 830 830 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetPublisher</span> 831 831 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a></span> 832 832 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> DatasetPublisher(client)</span> ··· 846 846 <p>There are two approaches to storing data as ATProto blobs:</p> 847 847 <p><strong>Approach 1: PDSBlobStore (Recommended)</strong></p> 848 848 <p>Use <code>PDSBlobStore</code> with <code>AtmosphereIndex</code> for automatic shard management:</p> 849 - <div id="b92c0516" class="cell"> 849 + <div id="ff58eaa6" class="cell"> 850 850 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> PDSBlobStore, AtmosphereIndex</span> 851 851 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a></span> 852 852 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> ··· 865 865 </div> 866 866 <p><strong>Approach 2: Manual Blob Publishing</strong></p> 867 867 <p>For more control, use <code>DatasetPublisher.publish_with_blobs()</code> directly:</p> 868 - <div id="32c6279b" class="cell"> 868 + <div id="46c8ca86" class="cell"> 869 869 <div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> io</span> 870 870 <span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 871 871 <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a></span> ··· 885 885 <span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 886 886 </div> 887 887 <p><strong>Loading Blob-Stored Datasets</strong></p> 888 - <div id="33b302ee" class="cell"> 888 + <div id="01f518ca" class="cell"> 889 889 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetLoader</span> 890 890 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> BlobSource</span> 891 891 <span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a></span> ··· 909 909 </section> 910 910 <section id="lenspublisher" class="level3"> 911 911 <h3 class="anchored" data-anchor-id="lenspublisher">LensPublisher</h3> 912 - <div id="4e8e0489" class="cell"> 912 + <div id="7733653d" class="cell"> 913 913 <div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> LensPublisher</span> 914 914 <span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a></span> 915 915 <span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a>publisher <span class="op">=</span> LensPublisher(client)</span> ··· 952 952 <p>For direct access to records, use the loader classes:</p> 953 953 <section id="schemaloader" class="level3"> 954 954 <h3 class="anchored" data-anchor-id="schemaloader">SchemaLoader</h3> 955 - <div id="b6366add" class="cell"> 955 + <div id="7ea59338" class="cell"> 956 956 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> SchemaLoader</span> 957 957 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a></span> 958 958 <span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> SchemaLoader(client)</span> ··· 968 968 </section> 969 969 <section id="datasetloader" class="level3"> 970 970 <h3 class="anchored" data-anchor-id="datasetloader">DatasetLoader</h3> 971 - <div id="a0cdfb2e" class="cell"> 971 + <div id="18fc9b5a" class="cell"> 972 972 <div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> DatasetLoader</span> 973 973 <span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span> 974 974 <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> DatasetLoader(client)</span> ··· 996 996 </section> 997 997 <section id="lensloader" class="level3"> 998 998 <h3 class="anchored" data-anchor-id="lensloader">LensLoader</h3> 999 - <div id="011735c6" class="cell"> 999 + <div id="fa26f4fb" class="cell"> 1000 1000 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> LensLoader</span> 1001 1001 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span> 1002 1002 <span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>loader <span class="op">=</span> LensLoader(client)</span> ··· 1021 1021 <section id="at-uris" class="level2"> 1022 1022 <h2 class="anchored" data-anchor-id="at-uris">AT URIs</h2> 1023 1023 <p>ATProto records are identified by AT URIs:</p> 1024 - <div id="87eca79a" class="cell"> 1024 + <div id="892f4f44" class="cell"> 1025 1025 <div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtUri</span> 1026 1026 <span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a></span> 1027 1027 <span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Parse an AT URI</span></span> ··· 1088 1088 <section id="complete-example" class="level2"> 1089 1089 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 1090 1090 <p>This example shows the full workflow using <code>PDSBlobStore</code> for decentralized storage:</p> 1091 - <div id="ef08c41a" class="cell"> 1091 + <div id="8ea58305" class="cell"> 1092 1092 <div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 1093 1093 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 1094 1094 <span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 1159 1159 <span id="cb22-68"><a href="#cb22-68" aria-hidden="true" tabindex="-1"></a> <span class="cf">break</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 1160 1160 </div> 1161 1161 <p>For external URL storage (without <code>PDSBlobStore</code>):</p> 1162 - <div id="5a95b798" class="cell"> 1162 + <div id="1f55ae9a" class="cell"> 1163 1163 <div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Use AtmosphereIndex without data_store</span></span> 1164 1164 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client)</span> 1165 1165 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a></span>

+13 -13

docs/reference/datasets.html

··· 603 603 <p>The <code>Dataset</code> class provides typed iteration over WebDataset tar files with automatic batching and lens transformations.</p> 604 604 <section id="creating-a-dataset" class="level2"> 605 605 <h2 class="anchored" data-anchor-id="creating-a-dataset">Creating a Dataset</h2> 606 - <div id="dc6c764f" class="cell"> 606 + <div id="ef6e2916" class="cell"> 607 607 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 608 608 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 609 609 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span> ··· 626 626 <section id="url-source-default" class="level3"> 627 627 <h3 class="anchored" data-anchor-id="url-source-default">URL Source (default)</h3> 628 628 <p>When you pass a string to <code>Dataset</code>, it automatically wraps it in a <code>URLSource</code>:</p> 629 - <div id="43d34823" class="cell"> 629 + <div id="9cb82ec3" class="cell"> 630 630 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># These are equivalent:</span></span> 631 631 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 632 632 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](atdata.URLSource(<span class="st">"data-{000000..000009}.tar"</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 635 635 <section id="s3-source" class="level3"> 636 636 <h3 class="anchored" data-anchor-id="s3-source">S3 Source</h3> 637 637 <p>For private S3 buckets or S3-compatible storage (Cloudflare R2, MinIO), use <code>S3Source</code>:</p> 638 - <div id="665f191c" class="cell"> 638 + <div id="c2cb3b94" class="cell"> 639 639 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># From explicit credentials</span></span> 640 640 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> atdata.S3Source(</span> 641 641 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> bucket<span class="op">=</span><span class="st">"my-bucket"</span>,</span> ··· 673 673 <section id="ordered-iteration" class="level3"> 674 674 <h3 class="anchored" data-anchor-id="ordered-iteration">Ordered Iteration</h3> 675 675 <p>Iterate through samples in their original order:</p> 676 - <div id="f610e2e6" class="cell"> 676 + <div id="d3f45d97" class="cell"> 677 677 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># With batching (default batch_size=1)</span></span> 678 678 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.ordered(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 679 679 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image <span class="co"># numpy array (32, H, W, C)</span></span> ··· 687 687 <section id="shuffled-iteration" class="level3"> 688 688 <h3 class="anchored" data-anchor-id="shuffled-iteration">Shuffled Iteration</h3> 689 689 <p>Iterate with randomized order at both shard and sample levels:</p> 690 - <div id="31fd18b7" class="cell"> 690 + <div id="fb5204d1" class="cell"> 691 691 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 692 692 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled</span></span> 693 693 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> process(batch)</span> ··· 718 718 <section id="samplebatch" class="level2"> 719 719 <h2 class="anchored" data-anchor-id="samplebatch">SampleBatch</h2> 720 720 <p>When iterating with a <code>batch_size</code>, each iteration yields a <code>SampleBatch</code> with automatic attribute aggregation.</p> 721 - <div id="a62e95b5" class="cell"> 721 + <div id="c1093b41" class="cell"> 722 722 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 723 723 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> Sample:</span> 724 724 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># shape (256,)</span></span> ··· 738 738 <section id="type-transformations-with-lenses" class="level2"> 739 739 <h2 class="anchored" data-anchor-id="type-transformations-with-lenses">Type Transformations with Lenses</h2> 740 740 <p>View a dataset through a different sample type using registered lenses:</p> 741 - <div id="18cd1b39" class="cell"> 741 + <div id="044cffe6" class="cell"> 742 742 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 743 743 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span> 744 744 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> label: <span class="bu">str</span></span> ··· 760 760 <section id="shard-list" class="level3"> 761 761 <h3 class="anchored" data-anchor-id="shard-list">Shard List</h3> 762 762 <p>Get the list of individual tar files:</p> 763 - <div id="06e7f4d9" class="cell"> 763 + <div id="ce9df6da" class="cell"> 764 764 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 765 765 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>shards <span class="op">=</span> dataset.shard_list</span> 766 766 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="co"># ['data-000000.tar', 'data-000001.tar', ..., 'data-000009.tar']</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 769 769 <section id="metadata" class="level3"> 770 770 <h3 class="anchored" data-anchor-id="metadata">Metadata</h3> 771 771 <p>Datasets can have associated metadata from a URL:</p> 772 - <div id="4f4fd9f1" class="cell"> 772 + <div id="129c7a63" class="cell"> 773 773 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](</span> 774 774 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"data-{000000..000009}.tar"</span>,</span> 775 775 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> metadata_url<span class="op">=</span><span class="st">"https://example.com/metadata.msgpack"</span></span> ··· 783 783 <section id="writing-datasets" class="level2"> 784 784 <h2 class="anchored" data-anchor-id="writing-datasets">Writing Datasets</h2> 785 785 <p>Use WebDataset’s <code>TarWriter</code> or <code>ShardWriter</code> to create datasets:</p> 786 - <div id="62cb57ad" class="cell"> 786 + <div id="f29fbb2c" class="cell"> 787 787 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 788 788 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 789 789 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 806 806 <section id="parquet-export" class="level2"> 807 807 <h2 class="anchored" data-anchor-id="parquet-export">Parquet Export</h2> 808 808 <p>Export dataset contents to parquet format:</p> 809 - <div id="d47c5a12" class="cell"> 809 + <div id="53060440" class="cell"> 810 810 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Export entire dataset</span></span> 811 811 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>dataset.to_parquet(<span class="st">"output.parquet"</span>)</span> 812 812 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span> ··· 857 857 <section id="source" class="level3"> 858 858 <h3 class="anchored" data-anchor-id="source">Source</h3> 859 859 <p>Access the underlying <code>DataSource</code>:</p> 860 - <div id="e4036932" class="cell"> 860 + <div id="e315e899" class="cell"> 861 861 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[Sample](<span class="st">"data.tar"</span>)</span> 862 862 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> dataset.source <span class="co"># URLSource instance</span></span> 863 863 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(source.shard_list) <span class="co"># ['data.tar']</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 866 866 <section id="sample-type" class="level3"> 867 867 <h3 class="anchored" data-anchor-id="sample-type">Sample Type</h3> 868 868 <p>Get the type parameter used to create the dataset:</p> 869 - <div id="3718300e" class="cell"> 869 + <div id="c77919f7" class="cell"> 870 870 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"data.tar"</span>)</span> 871 871 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(dataset.sample_type) <span class="co"># <class 'ImageSample'></span></span> 872 872 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(dataset.batch_type) <span class="co"># SampleBatch[ImageSample]</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>

+10 -10

docs/reference/lenses.html

··· 595 595 <section id="creating-a-lens" class="level2"> 596 596 <h2 class="anchored" data-anchor-id="creating-a-lens">Creating a Lens</h2> 597 597 <p>Use the <code>@lens</code> decorator to define a getter:</p> 598 - <div id="016cb2ac" class="cell"> 598 + <div id="390e6ab5" class="cell"> 599 599 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 600 600 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 601 601 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span> ··· 625 625 <section id="adding-a-putter" class="level2"> 626 626 <h2 class="anchored" data-anchor-id="adding-a-putter">Adding a Putter</h2> 627 627 <p>To enable bidirectional updates, add a putter:</p> 628 - <div id="e207472f" class="cell"> 628 + <div id="858a66a9" class="cell"> 629 629 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@simplify.putter</span></span> 630 630 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> simplify_put(view: SimpleSample, source: FullSample) <span class="op">-></span> FullSample:</span> 631 631 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> FullSample(</span> ··· 645 645 <section id="using-lenses-with-datasets" class="level2"> 646 646 <h2 class="anchored" data-anchor-id="using-lenses-with-datasets">Using Lenses with Datasets</h2> 647 647 <p>Lenses integrate with <code>Dataset.as_type()</code>:</p> 648 - <div id="744a9d2d" class="cell"> 648 + <div id="d9989ec0" class="cell"> 649 649 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[FullSample](<span class="st">"data-{000000..000009}.tar"</span>)</span> 650 650 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 651 651 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="co"># View through a different type</span></span> ··· 660 660 <section id="direct-lens-usage" class="level2"> 661 661 <h2 class="anchored" data-anchor-id="direct-lens-usage">Direct Lens Usage</h2> 662 662 <p>Lenses can also be called directly:</p> 663 - <div id="395185d3" class="cell"> 663 + <div id="33deda47" class="cell"> 664 664 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 665 665 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 666 666 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>full <span class="op">=</span> FullSample(</span> ··· 689 689 <div class="tab-content"> 690 690 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 691 691 <p>If you get a view and immediately put it back, the source is unchanged:</p> 692 - <div id="436b7dc7" class="cell"> 692 + <div id="fba0d404" class="cell"> 693 693 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>view <span class="op">=</span> lens.get(source)</span> 694 694 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> lens.put(view, source) <span class="op">==</span> source</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 695 695 </div> 696 696 </div> 697 697 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 698 698 <p>If you put a view, getting it back yields that view:</p> 699 - <div id="4c6fcf1c" class="cell"> 699 + <div id="00a23052" class="cell"> 700 700 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>updated <span class="op">=</span> lens.put(view, source)</span> 701 701 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> lens.get(updated) <span class="op">==</span> view</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 702 702 </div> 703 703 </div> 704 704 <div id="tabset-1-3" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-3-tab"> 705 705 <p>Putting twice is equivalent to putting once with the final value:</p> 706 - <div id="db0bfa78" class="cell"> 706 + <div id="2ca3d0ae" class="cell"> 707 707 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>result1 <span class="op">=</span> lens.put(v2, lens.put(v1, source))</span> 708 708 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>result2 <span class="op">=</span> lens.put(v2, source)</span> 709 709 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="cf">assert</span> result1 <span class="op">==</span> result2</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> ··· 715 715 <section id="trivial-putter" class="level2"> 716 716 <h2 class="anchored" data-anchor-id="trivial-putter">Trivial Putter</h2> 717 717 <p>If no putter is defined, a trivial putter is used that ignores view updates:</p> 718 - <div id="d326f43d" class="cell"> 718 + <div id="e91f979c" class="cell"> 719 719 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.lens</span></span> 720 720 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> extract_label(src: FullSample) <span class="op">-></span> SimpleSample:</span> 721 721 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">return</span> SimpleSample(label<span class="op">=</span>src.label, confidence<span class="op">=</span>src.confidence)</span> ··· 729 729 <section id="lensnetwork-registry" class="level2"> 730 730 <h2 class="anchored" data-anchor-id="lensnetwork-registry">LensNetwork Registry</h2> 731 731 <p>The <code>LensNetwork</code> is a singleton that stores all registered lenses:</p> 732 - <div id="a3896ed6" class="cell"> 732 + <div id="fa3c962e" class="cell"> 733 733 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.lens <span class="im">import</span> LensNetwork</span> 734 734 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 735 735 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>network <span class="op">=</span> LensNetwork()</span> ··· 746 746 </section> 747 747 <section id="example-feature-extraction" class="level2"> 748 748 <h2 class="anchored" data-anchor-id="example-feature-extraction">Example: Feature Extraction</h2> 749 - <div id="bbd444b8" class="cell"> 749 + <div id="4af8323d" class="cell"> 750 750 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 751 751 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> RawSample:</span> 752 752 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> audio: NDArray</span>

+12 -12

docs/reference/load-dataset.html

··· 604 604 </section> 605 605 <section id="basic-usage" class="level2"> 606 606 <h2 class="anchored" data-anchor-id="basic-usage">Basic Usage</h2> 607 - <div id="753f1d4b" class="cell"> 607 + <div id="44ca6bce" class="cell"> 608 608 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 609 609 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 610 610 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> ··· 627 627 <h2 class="anchored" data-anchor-id="path-formats">Path Formats</h2> 628 628 <section id="webdataset-brace-notation" class="level3"> 629 629 <h3 class="anchored" data-anchor-id="webdataset-brace-notation">WebDataset Brace Notation</h3> 630 - <div id="1f4e9142" class="cell"> 630 + <div id="8d0a44f7" class="cell"> 631 631 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Range notation</span></span> 632 632 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"data-{000000..000099}.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 633 633 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span> ··· 637 637 </section> 638 638 <section id="glob-patterns" class="level3"> 639 639 <h3 class="anchored" data-anchor-id="glob-patterns">Glob Patterns</h3> 640 - <div id="9fd0a092" class="cell"> 640 + <div id="f88b246c" class="cell"> 641 641 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Match all tar files</span></span> 642 642 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"path/to/*.tar"</span>, MySample)</span> 643 643 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 647 647 </section> 648 648 <section id="local-directory" class="level3"> 649 649 <h3 class="anchored" data-anchor-id="local-directory">Local Directory</h3> 650 - <div id="8ddee3fc" class="cell"> 650 + <div id="dd112997" class="cell"> 651 651 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Scans for .tar files</span></span> 652 652 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"./my-dataset/"</span>, MySample)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 653 653 </div> 654 654 </section> 655 655 <section id="remote-urls" class="level3"> 656 656 <h3 class="anchored" data-anchor-id="remote-urls">Remote URLs</h3> 657 - <div id="6a120d61" class="cell"> 657 + <div id="fb8989cb" class="cell"> 658 658 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># S3 (public buckets)</span></span> 659 659 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(<span class="st">"s3://bucket/data-{000..099}.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 660 660 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> ··· 680 680 </section> 681 681 <section id="index-lookup" class="level3"> 682 682 <h3 class="anchored" data-anchor-id="index-lookup">Index Lookup</h3> 683 - <div id="c4c8a01b" class="cell"> 683 + <div id="415c2094" class="cell"> 684 684 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 685 685 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 686 686 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex()</span> ··· 747 747 <section id="datasetdict" class="level2"> 748 748 <h2 class="anchored" data-anchor-id="datasetdict">DatasetDict</h2> 749 749 <p>When loading without <code>split=</code>, returns a <code>DatasetDict</code>:</p> 750 - <div id="eed41d5d" class="cell"> 750 + <div id="22c1bc35" class="cell"> 751 751 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data/"</span>, MySample)</span> 752 752 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 753 753 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Access splits</span></span> ··· 767 767 <section id="explicit-data-files" class="level2"> 768 768 <h2 class="anchored" data-anchor-id="explicit-data-files">Explicit Data Files</h2> 769 769 <p>Override automatic detection with <code>data_files</code>:</p> 770 - <div id="a87676d3" class="cell"> 770 + <div id="452b998f" class="cell"> 771 771 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Single pattern</span></span> 772 772 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_dataset(</span> 773 773 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"path/to/"</span>,</span> ··· 796 796 <section id="streaming-mode" class="level2"> 797 797 <h2 class="anchored" data-anchor-id="streaming-mode">Streaming Mode</h2> 798 798 <p>The <code>streaming</code> parameter signals intent for streaming mode:</p> 799 - <div id="d0ca2f60" class="cell"> 799 + <div id="9de4892c" class="cell"> 800 800 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Mark as streaming</span></span> 801 801 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>ds_dict <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>, MySample, streaming<span class="op">=</span><span class="va">True</span>)</span> 802 802 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 821 821 <section id="auto-type-resolution" class="level2"> 822 822 <h2 class="anchored" data-anchor-id="auto-type-resolution">Auto Type Resolution</h2> 823 823 <p>When using index lookup, the sample type can be resolved automatically:</p> 824 - <div id="eff9b3ae" class="cell"> 824 + <div id="04d401c9" class="cell"> 825 825 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 826 826 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 827 827 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> LocalIndex()</span> ··· 835 835 </section> 836 836 <section id="error-handling" class="level2"> 837 837 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 838 - <div id="70c2f7f5" class="cell"> 838 + <div id="fdcf842d" class="cell"> 839 839 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 840 840 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> ds <span class="op">=</span> load_dataset(<span class="st">"path/to/data.tar"</span>, MySample, split<span class="op">=</span><span class="st">"train"</span>)</span> 841 841 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">FileNotFoundError</span>:</span> ··· 851 851 </section> 852 852 <section id="complete-example" class="level2"> 853 853 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 854 - <div id="076854da" class="cell"> 854 + <div id="6d45dd17" class="cell"> 855 855 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 856 856 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 857 857 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span>

+11 -11

docs/reference/local-storage.html

··· 603 603 <section id="localindex" class="level2"> 604 604 <h2 class="anchored" data-anchor-id="localindex">LocalIndex</h2> 605 605 <p>The index tracks datasets in Redis:</p> 606 - <div id="014371d7" class="cell"> 606 + <div id="3120faf0" class="cell"> 607 607 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 608 608 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 609 609 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Default connection (localhost:6379)</span></span> ··· 619 619 </div> 620 620 <section id="adding-entries" class="level3"> 621 621 <h3 class="anchored" data-anchor-id="adding-entries">Adding Entries</h3> 622 - <div id="258ff92b" class="cell"> 622 + <div id="53371db7" class="cell"> 623 623 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 624 624 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 625 625 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span> ··· 644 644 </section> 645 645 <section id="listing-and-retrieving" class="level3"> 646 646 <h3 class="anchored" data-anchor-id="listing-and-retrieving">Listing and Retrieving</h3> 647 - <div id="ecd210a1" class="cell"> 647 + <div id="e5edf2cd" class="cell"> 648 648 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Iterate all entries</span></span> 649 649 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> entry <span class="kw">in</span> index.entries:</span> 650 650 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>entry<span class="sc">.</span>name<span class="sc">}</span><span class="ss">: </span><span class="sc">{</span>entry<span class="sc">.</span>cid<span class="sc">}</span><span class="ss">"</span>)</span> ··· 676 676 </div> 677 677 </div> 678 678 <p>The Repo class combines S3 storage with Redis indexing:</p> 679 - <div id="752fc198" class="cell"> 679 + <div id="57c27198" class="cell"> 680 680 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> Repo</span> 681 681 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span> 682 682 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co"># From credentials file</span></span> ··· 696 696 <span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 697 697 </div> 698 698 <p><strong>Preferred approach</strong> - Use <code>LocalIndex</code> with <code>S3DataStore</code>:</p> 699 - <div id="fc784c62" class="cell"> 699 + <div id="9ddadfba" class="cell"> 700 700 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 701 701 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 702 702 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> ··· 734 734 </section> 735 735 <section id="inserting-datasets" class="level3"> 736 736 <h3 class="anchored" data-anchor-id="inserting-datasets">Inserting Datasets</h3> 737 - <div id="bb27e7db" class="cell"> 737 + <div id="7041abcb" class="cell"> 738 738 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 739 739 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 740 740 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 764 764 </section> 765 765 <section id="insert-options" class="level3"> 766 766 <h3 class="anchored" data-anchor-id="insert-options">Insert Options</h3> 767 - <div id="269b9b72" class="cell"> 767 + <div id="4f123ba3" class="cell"> 768 768 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>entry, ds <span class="op">=</span> repo.insert(</span> 769 769 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a> dataset,</span> 770 770 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> name<span class="op">=</span><span class="st">"my-dataset"</span>,</span> ··· 778 778 <section id="localdatasetentry" class="level2"> 779 779 <h2 class="anchored" data-anchor-id="localdatasetentry">LocalDatasetEntry</h2> 780 780 <p>Index entries provide content-addressable identification:</p> 781 - <div id="125f1ec6" class="cell"> 781 + <div id="0d213b56" class="cell"> 782 782 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.get_entry_by_name(<span class="st">"my-dataset"</span>)</span> 783 783 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 784 784 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Core properties (IndexEntry protocol)</span></span> ··· 811 811 <section id="schema-storage" class="level2"> 812 812 <h2 class="anchored" data-anchor-id="schema-storage">Schema Storage</h2> 813 813 <p>Schemas can be stored and retrieved from the index:</p> 814 - <div id="f5b4209d" class="cell"> 814 + <div id="2e22c50d" class="cell"> 815 815 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 816 816 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(</span> 817 817 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> ImageSample,</span> ··· 842 842 <section id="s3datastore" class="level2"> 843 843 <h2 class="anchored" data-anchor-id="s3datastore">S3DataStore</h2> 844 844 <p>For direct S3 operations without Redis indexing:</p> 845 - <div id="ead4d102" class="cell"> 845 + <div id="3fe31407" class="cell"> 846 846 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 847 847 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> 848 848 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> S3DataStore(</span> ··· 864 864 </section> 865 865 <section id="complete-workflow-example" class="level2"> 866 866 <h2 class="anchored" data-anchor-id="complete-workflow-example">Complete Workflow Example</h2> 867 - <div id="96fe8292" class="cell"> 867 + <div id="42ed52ed" class="cell"> 868 868 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 869 869 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 870 870 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span>

+12 -12

docs/reference/packable-samples.html

··· 598 598 <section id="the-packable-decorator" class="level2"> 599 599 <h2 class="anchored" data-anchor-id="the-packable-decorator">The <code>@packable</code> Decorator</h2> 600 600 <p>The recommended way to define a sample type is with the <code>@packable</code> decorator:</p> 601 - <div id="a94874e3" class="cell"> 601 + <div id="4706cae6" class="cell"> 602 602 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 603 603 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 604 604 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 620 620 <h2 class="anchored" data-anchor-id="supported-field-types">Supported Field Types</h2> 621 621 <section id="primitives" class="level3"> 622 622 <h3 class="anchored" data-anchor-id="primitives">Primitives</h3> 623 - <div id="550b2a4e" class="cell"> 623 + <div id="e4fc5e65" class="cell"> 624 624 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 625 625 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> PrimitiveSample:</span> 626 626 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> name: <span class="bu">str</span></span> ··· 633 633 <section id="numpy-arrays" class="level3"> 634 634 <h3 class="anchored" data-anchor-id="numpy-arrays">NumPy Arrays</h3> 635 635 <p>Fields annotated as <code>NDArray</code> are automatically converted:</p> 636 - <div id="029f2fed" class="cell"> 636 + <div id="54fb169f" class="cell"> 637 637 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 638 638 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ArraySample:</span> 639 639 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># Required array</span></span> ··· 655 655 </section> 656 656 <section id="lists" class="level3"> 657 657 <h3 class="anchored" data-anchor-id="lists">Lists</h3> 658 - <div id="c70e0022" class="cell"> 658 + <div id="ff4deef3" class="cell"> 659 659 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 660 660 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ListSample:</span> 661 661 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> tags: <span class="bu">list</span>[<span class="bu">str</span>]</span> ··· 667 667 <h2 class="anchored" data-anchor-id="serialization">Serialization</h2> 668 668 <section id="packing-to-bytes" class="level3"> 669 669 <h3 class="anchored" data-anchor-id="packing-to-bytes">Packing to Bytes</h3> 670 - <div id="2a44172f" class="cell"> 670 + <div id="bd357b91" class="cell"> 671 671 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 672 672 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> 673 673 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> label<span class="op">=</span><span class="st">"cat"</span>,</span> ··· 681 681 </section> 682 682 <section id="unpacking-from-bytes" class="level3"> 683 683 <h3 class="anchored" data-anchor-id="unpacking-from-bytes">Unpacking from Bytes</h3> 684 - <div id="30d8b649" class="cell"> 684 + <div id="67e2bcc5" class="cell"> 685 685 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Deserialize from bytes</span></span> 686 686 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>restored <span class="op">=</span> ImageSample.from_bytes(packed_bytes)</span> 687 687 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 693 693 <section id="webdataset-format" class="level3"> 694 694 <h3 class="anchored" data-anchor-id="webdataset-format">WebDataset Format</h3> 695 695 <p>The <code>as_wds</code> property returns a dict ready for WebDataset:</p> 696 - <div id="71fdefd9" class="cell"> 696 + <div id="adafdd43" class="cell"> 697 697 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>wds_dict <span class="op">=</span> sample.as_wds</span> 698 698 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="co"># {'__key__': '1234...', 'msgpack': b'...'}</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 699 699 </div> 700 700 <p>Write samples to a tar file:</p> 701 - <div id="0d06403b" class="cell"> 701 + <div id="5fd86dab" class="cell"> 702 702 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 703 703 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span> 704 704 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> wds.writer.TarWriter(<span class="st">"data-000000.tar"</span>) <span class="im">as</span> sink:</span> ··· 711 711 <section id="direct-inheritance-alternative" class="level2"> 712 712 <h2 class="anchored" data-anchor-id="direct-inheritance-alternative">Direct Inheritance (Alternative)</h2> 713 713 <p>You can also inherit directly from <code>PackableSample</code>:</p> 714 - <div id="2cbfbd91" class="cell"> 714 + <div id="9c493592" class="cell"> 715 715 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> dataclasses <span class="im">import</span> dataclass</span> 716 716 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 717 717 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="at">@dataclass</span></span> ··· 749 749 <section id="the-_ensure_good-method" class="level3"> 750 750 <h3 class="anchored" data-anchor-id="the-_ensure_good-method">The <code>_ensure_good()</code> Method</h3> 751 751 <p>This method runs automatically after construction and handles NDArray conversion:</p> 752 - <div id="3e22df34" class="cell"> 752 + <div id="2397bcd3" class="cell"> 753 753 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> _ensure_good(<span class="va">self</span>):</span> 754 754 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a> <span class="cf">for</span> field <span class="kw">in</span> dataclasses.fields(<span class="va">self</span>):</span> 755 755 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> <span class="cf">if</span> _is_possibly_ndarray_type(field.<span class="bu">type</span>):</span> ··· 765 765 <ul class="nav nav-tabs" role="tablist"><li class="nav-item" role="presentation"><a class="nav-link active" id="tabset-2-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-1" role="tab" aria-controls="tabset-2-1" aria-selected="true">Do</a></li><li class="nav-item" role="presentation"><a class="nav-link" id="tabset-2-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-2-2" role="tab" aria-controls="tabset-2-2" aria-selected="false">Don’t</a></li></ul> 766 766 <div class="tab-content"> 767 767 <div id="tabset-2-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-2-1-tab"> 768 - <div id="5708dd52" class="cell"> 768 + <div id="b9bcc4bd" class="cell"> 769 769 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 770 770 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> GoodSample:</span> 771 771 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> features: NDArray <span class="co"># Clear type annotation</span></span> ··· 775 775 </div> 776 776 </div> 777 777 <div id="tabset-2-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-2-2-tab"> 778 - <div id="14b1ada0" class="cell"> 778 + <div id="3f23b423" class="cell"> 779 779 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 780 780 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> BadSample:</span> 781 781 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a> <span class="co"># DON'T: Nested dataclasses not supported</span></span>

+7 -7

docs/reference/promotion.html

··· 594 594 </section> 595 595 <section id="basic-usage" class="level2"> 596 596 <h2 class="anchored" data-anchor-id="basic-usage">Basic Usage</h2> 597 - <div id="b23db877" class="cell"> 597 + <div id="f83bcdc2" class="cell"> 598 598 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex</span> 599 599 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient</span> 600 600 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> promote_to_atmosphere</span> ··· 614 614 </section> 615 615 <section id="with-metadata" class="level2"> 616 616 <h2 class="anchored" data-anchor-id="with-metadata">With Metadata</h2> 617 - <div id="2b2c6783" class="cell"> 617 + <div id="3b1395cf" class="cell"> 618 618 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(</span> 619 619 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> entry,</span> 620 620 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> local_index,</span> ··· 629 629 <section id="schema-deduplication" class="level2"> 630 630 <h2 class="anchored" data-anchor-id="schema-deduplication">Schema Deduplication</h2> 631 631 <p>The promotion workflow automatically checks for existing schemas:</p> 632 - <div id="59a8d371" class="cell"> 632 + <div id="509d97ed" class="cell"> 633 633 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First promotion: publishes schema</span></span> 634 634 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>uri1 <span class="op">=</span> promote_to_atmosphere(entry1, local_index, client)</span> 635 635 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span> ··· 649 649 <div class="tab-content"> 650 650 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 651 651 <p>By default, promotion keeps the original data URLs:</p> 652 - <div id="4808223f" class="cell"> 652 + <div id="85fac1e6" class="cell"> 653 653 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Data stays in original S3 location</span></span> 654 654 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 655 655 </div> ··· 662 662 </div> 663 663 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 664 664 <p>To copy data to a different storage location:</p> 665 - <div id="e6c590b5" class="cell"> 665 + <div id="84eeff63" class="cell"> 666 666 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 667 667 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 668 668 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create new data store</span></span> ··· 690 690 </section> 691 691 <section id="complete-workflow-example" class="level2"> 692 692 <h2 class="anchored" data-anchor-id="complete-workflow-example">Complete Workflow Example</h2> 693 - <div id="01c84d05" class="cell"> 693 + <div id="8f45fa86" class="cell"> 694 694 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 695 695 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 696 696 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 761 761 </section> 762 762 <section id="error-handling" class="level2"> 763 763 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 764 - <div id="c3957a63" class="cell"> 764 + <div id="2431b20a" class="cell"> 765 765 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 766 766 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> at_uri <span class="op">=</span> promote_to_atmosphere(entry, local_index, client)</span> 767 767 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">KeyError</span> <span class="im">as</span> e:</span>

+12 -12

docs/reference/protocols.html

··· 615 615 <section id="indexentry-protocol" class="level2"> 616 616 <h2 class="anchored" data-anchor-id="indexentry-protocol">IndexEntry Protocol</h2> 617 617 <p>Represents a dataset entry in any index:</p> 618 - <div id="c302a855" class="cell"> 618 + <div id="fe6ad9a5" class="cell"> 619 619 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> IndexEntry</span> 620 620 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span> 621 621 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> process_entry(entry: IndexEntry) <span class="op">-></span> <span class="va">None</span>:</span> ··· 669 669 <section id="abstractindex-protocol" class="level2"> 670 670 <h2 class="anchored" data-anchor-id="abstractindex-protocol">AbstractIndex Protocol</h2> 671 671 <p>Defines operations for managing schemas and datasets:</p> 672 - <div id="07a31265" class="cell"> 672 + <div id="439d68b1" class="cell"> 673 673 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractIndex</span> 674 674 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 675 675 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> list_all_datasets(index: AbstractIndex) <span class="op">-></span> <span class="va">None</span>:</span> ··· 679 679 </div> 680 680 <section id="dataset-operations" class="level3"> 681 681 <h3 class="anchored" data-anchor-id="dataset-operations">Dataset Operations</h3> 682 - <div id="7678f752" class="cell"> 682 + <div id="8647c740" class="cell"> 683 683 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Insert a dataset</span></span> 684 684 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> index.insert_dataset(</span> 685 685 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> dataset,</span> ··· 697 697 </section> 698 698 <section id="schema-operations" class="level3"> 699 699 <h3 class="anchored" data-anchor-id="schema-operations">Schema Operations</h3> 700 - <div id="70f62633" class="cell"> 700 + <div id="ed91bcba" class="cell"> 701 701 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 702 702 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(</span> 703 703 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> MySample,</span> ··· 728 728 <section id="abstractdatastore-protocol" class="level2"> 729 729 <h2 class="anchored" data-anchor-id="abstractdatastore-protocol">AbstractDataStore Protocol</h2> 730 730 <p>Abstracts over different storage backends:</p> 731 - <div id="52e50320" class="cell"> 731 + <div id="ec8edb2e" class="cell"> 732 732 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractDataStore</span> 733 733 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 734 734 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> write_dataset(store: AbstractDataStore, dataset) <span class="op">-></span> <span class="bu">list</span>[<span class="bu">str</span>]:</span> ··· 738 738 </div> 739 739 <section id="methods" class="level3"> 740 740 <h3 class="anchored" data-anchor-id="methods">Methods</h3> 741 - <div id="317eac4c" class="cell"> 741 + <div id="f3a03a91" class="cell"> 742 742 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Write dataset shards</span></span> 743 743 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>urls <span class="op">=</span> store.write_shards(</span> 744 744 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> dataset,</span> ··· 765 765 <section id="datasource-protocol" class="level2"> 766 766 <h2 class="anchored" data-anchor-id="datasource-protocol">DataSource Protocol</h2> 767 767 <p>Abstracts over different data source backends for streaming dataset shards:</p> 768 - <div id="8a53d5a4" class="cell"> 768 + <div id="51efd4bf" class="cell"> 769 769 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> DataSource</span> 770 770 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span> 771 771 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> load_from_source(source: DataSource) <span class="op">-></span> <span class="va">None</span>:</span> ··· 778 778 </div> 779 779 <section id="methods-1" class="level3"> 780 780 <h3 class="anchored" data-anchor-id="methods-1">Methods</h3> 781 - <div id="f867cfdc" class="cell"> 781 + <div id="03a780ee" class="cell"> 782 782 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Get list of shard identifiers</span></span> 783 783 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>shard_ids <span class="op">=</span> source.shard_list <span class="co"># ['data-000000.tar', 'data-000001.tar', ...]</span></span> 784 784 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a></span> ··· 801 801 <section id="creating-custom-data-sources" class="level3"> 802 802 <h3 class="anchored" data-anchor-id="creating-custom-data-sources">Creating Custom Data Sources</h3> 803 803 <p>Implement the <code>DataSource</code> protocol for custom backends:</p> 804 - <div id="44575d54" class="cell"> 804 + <div id="594f4fdb" class="cell"> 805 805 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> typing <span class="im">import</span> Iterator, IO</span> 806 806 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> DataSource</span> 807 807 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span> ··· 839 839 <section id="using-protocols-for-polymorphism" class="level2"> 840 840 <h2 class="anchored" data-anchor-id="using-protocols-for-polymorphism">Using Protocols for Polymorphism</h2> 841 841 <p>Write code that works with any backend:</p> 842 - <div id="db1294c4" class="cell"> 842 + <div id="1bffcf78" class="cell"> 843 843 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> AbstractIndex, IndexEntry</span> 844 844 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> Dataset</span> 845 845 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 910 910 <section id="type-checking" class="level2"> 911 911 <h2 class="anchored" data-anchor-id="type-checking">Type Checking</h2> 912 912 <p>Protocols are runtime-checkable:</p> 913 - <div id="0bb97720" class="cell"> 913 + <div id="7712ed61" class="cell"> 914 914 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata._protocols <span class="im">import</span> IndexEntry, AbstractIndex</span> 915 915 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span> 916 916 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Check if object implements protocol</span></span> ··· 924 924 </section> 925 925 <section id="complete-example" class="level2"> 926 926 <h2 class="anchored" data-anchor-id="complete-example">Complete Example</h2> 927 - <div id="e296723f" class="cell"> 927 + <div id="b6ead38d" class="cell"> 928 928 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> 929 929 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> LocalIndex, S3DataStore</span> 930 930 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereClient, AtmosphereIndex</span>

+2 -2

docs/reference/uri-spec.html

··· 685 685 <h2 class="anchored" data-anchor-id="examples">Examples</h2> 686 686 <section id="local-development" class="level3"> 687 687 <h3 class="anchored" data-anchor-id="local-development">Local Development</h3> 688 - <div id="09088cac" class="cell"> 688 + <div id="f60296a7" class="cell"> 689 689 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> Index</span> 690 690 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span> 691 691 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> Index()</span> ··· 704 704 </section> 705 705 <section id="atmosphere-atproto-federation" class="level3"> 706 706 <h3 class="anchored" data-anchor-id="atmosphere-atproto-federation">Atmosphere (ATProto Federation)</h3> 707 - <div id="1250c10e" class="cell"> 707 + <div id="339436d2" class="cell"> 708 708 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> Client</span> 709 709 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 710 710 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> Client()</span>

+110 -110

docs/search.json

··· 1072 1072 "href": "api/SchemaLoader.html", 1073 1073 "title": "SchemaLoader", 1074 1074 "section": "", 1075 - "text": "atmosphere.SchemaLoader(client)\nLoads PackableSample schemas from ATProto.\nThis class fetches schema records from ATProto and can list available schemas from a repository.\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 1075 + "text": "atmosphere.SchemaLoader(client)\nLoads PackableSample schemas from ATProto.\nThis class fetches schema records from ATProto and can list available schemas from a repository.\n\n\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a schema record by AT URI.\n\n\nlist_all\nList schema records from a repository.\n\n\n\n\n\natmosphere.SchemaLoader.get(uri)\nFetch a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe schema record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a schema record.\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.SchemaLoader.list_all(repo=None, limit=100)\nList schema records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records." 1076 1076 }, 1077 1077 { 1078 - "objectID": "api/SchemaLoader.html#example", 1079 - "href": "api/SchemaLoader.html#example", 1078 + "objectID": "api/SchemaLoader.html#examples", 1079 + "href": "api/SchemaLoader.html#examples", 1080 1080 "title": "SchemaLoader", 1081 1081 "section": "", 1082 - "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'" 1082 + "text": ">>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> loader = SchemaLoader(client)\n>>> schema = loader.get(\"at://did:plc:.../ac.foundation.dataset.sampleSchema/...\")\n>>> print(schema[\"name\"])\n'MySample'" 1083 1083 }, 1084 1084 { 1085 1085 "objectID": "api/SchemaLoader.html#methods", ··· 1093 1093 "href": "api/BlobSource.html", 1094 1094 "title": "BlobSource", 1095 1095 "section": "", 1096 - "text": "BlobSource(blob_refs, pds_endpoint=None, _endpoint_cache=dict())\nData source for ATProto PDS blob storage.\nStreams dataset shards stored as blobs on an ATProto Personal Data Server. Each shard is identified by a blob reference containing the DID and CID.\nThis source resolves blob references to HTTP URLs and streams the content directly, supporting efficient iteration over shards without downloading everything upfront.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID.\n\n\n\n\n\n\n::\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 1096 + "text": "BlobSource(blob_refs, pds_endpoint=None, _endpoint_cache=dict())\nData source for ATProto PDS blob storage.\nStreams dataset shards stored as blobs on an ATProto Personal Data Server. Each shard is identified by a blob reference containing the DID and CID.\nThis source resolves blob references to HTTP URLs and streams the content directly, supporting efficient iteration over shards without downloading everything upfront.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID.\n\n\n\n\n\n\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_refs\nCreate BlobSource from blob reference dicts.\n\n\nlist_shards\nReturn list of AT URI-style shard identifiers.\n\n\nopen_shard\nOpen a single shard by its AT URI.\n\n\n\n\n\nBlobSource.from_refs(refs, *, pds_endpoint=None)\nCreate BlobSource from blob reference dicts.\nAccepts blob references in the format returned by upload_blob: {\"$type\": \"blob\", \"ref\": {\"$link\": \"cid\"}, ...}\nAlso accepts simplified format: {\"did\": \"...\", \"cid\": \"...\"}\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrefs\nlist[dict]\nList of blob reference dicts.\nrequired\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint to use for all blobs.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nConfigured BlobSource.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf refs is empty or format is invalid.\n\n\n\n\n\n\n\nBlobSource.list_shards()\nReturn list of AT URI-style shard identifiers.\n\n\n\nBlobSource.open_shard(shard_id)\nOpen a single shard by its AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nAT URI of the shard (at://did/blob/cid).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreaming response body for reading the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards().\n\n\n\nValueError\nIf shard_id format is invalid." 1097 1097 }, 1098 1098 { 1099 1099 "objectID": "api/BlobSource.html#attributes", ··· 1103 1103 "text": "Name\nType\nDescription\n\n\n\n\nblob_refs\nlist[dict[str, str]]\nList of blob reference dicts with ‘did’ and ‘cid’ keys.\n\n\npds_endpoint\nstr | None\nOptional PDS endpoint URL. If not provided, resolved from DID." 1104 1104 }, 1105 1105 { 1106 - "objectID": "api/BlobSource.html#example", 1107 - "href": "api/BlobSource.html#example", 1106 + "objectID": "api/BlobSource.html#examples", 1107 + "href": "api/BlobSource.html#examples", 1108 1108 "title": "BlobSource", 1109 1109 "section": "", 1110 - "text": "::\n>>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1110 + "text": ">>> source = BlobSource(\n... blob_refs=[\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... {\"did\": \"did:plc:abc123\", \"cid\": \"bafyrei...\"},\n... ],\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1111 1111 }, 1112 1112 { 1113 1113 "objectID": "api/BlobSource.html#methods", ··· 1121 1121 "href": "api/AtmosphereClient.html", 1122 1122 "title": "AtmosphereClient", 1123 1123 "section": "", 1124 - "text": "atmosphere.AtmosphereClient(base_url=None, *, _client=None)\nATProto client wrapper for atdata operations.\nThis class wraps the atproto SDK client and provides higher-level methods for working with atdata records (schemas, datasets, lenses).\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'\n\n\n\nThe password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 1124 + "text": "atmosphere.AtmosphereClient(base_url=None, *, _client=None)\nATProto client wrapper for atdata operations.\nThis class wraps the atproto SDK client and provides higher-level methods for working with atdata records (schemas, datasets, lenses).\n\n\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'\n\n\n\nThe password should be an app-specific password, not your main account password. Create app passwords in your Bluesky account settings.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndid\nGet the DID of the authenticated user.\n\n\nhandle\nGet the handle of the authenticated user.\n\n\nis_authenticated\nCheck if the client has a valid session.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_record\nCreate a record in the user’s repository.\n\n\ndelete_record\nDelete a record.\n\n\nexport_session\nExport the current session for later reuse.\n\n\nget_blob\nDownload a blob from a PDS.\n\n\nget_blob_url\nGet the direct URL for fetching a blob.\n\n\nget_record\nFetch a record by AT URI.\n\n\nlist_datasets\nList dataset records.\n\n\nlist_lenses\nList lens records.\n\n\nlist_records\nList records in a collection.\n\n\nlist_schemas\nList schema records.\n\n\nlogin\nAuthenticate with the ATProto PDS.\n\n\nlogin_with_session\nAuthenticate using an exported session string.\n\n\nput_record\nCreate or update a record at a specific key.\n\n\nupload_blob\nUpload binary data as a blob to the PDS.\n\n\n\n\n\natmosphere.AtmosphereClient.create_record(\n collection,\n record,\n *,\n rkey=None,\n validate=False,\n)\nCreate a record in the user’s repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection (e.g., ‘ac.foundation.dataset.sampleSchema’).\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema. Set to False for custom lexicons that the PDS doesn’t know about.\nFalse\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf record creation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.delete_record(uri, *, swap_commit=None)\nDelete a record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record to delete.\nrequired\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap delete.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf deletion fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.export_session()\nExport the current session for later reuse.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSession string that can be passed to login_with_session().\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob(did, cid)\nDownload a blob from a PDS.\nThis resolves the PDS endpoint from the DID document and fetches the blob directly from the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbytes\nThe blob data as bytes.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\nrequests.HTTPError\nIf blob fetch fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_blob_url(did, cid)\nGet the direct URL for fetching a blob.\nThis is useful for passing to WebDataset or other HTTP clients.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndid\nstr\nThe DID of the repository containing the blob.\nrequired\n\n\ncid\nstr\nThe CID of the blob.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nThe full URL for fetching the blob.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf PDS endpoint cannot be resolved.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.get_record(uri)\nFetch a record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe record data as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf record not found.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_datasets(repo=None, limit=100)\nList dataset records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_lenses(repo=None, limit=100)\nList lens records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_records(\n collection,\n *,\n repo=None,\n limit=100,\n cursor=None,\n)\nList records in a collection.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrepo\nOptional[str]\nThe DID of the repository to query. Defaults to the authenticated user’s repository.\nNone\n\n\nlimit\nint\nMaximum number of records to return (default 100).\n100\n\n\ncursor\nOptional[str]\nPagination cursor from a previous call.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nA tuple of (records, next_cursor). The cursor is None if there\n\n\n\nOptional[str]\nare no more records.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf repo is None and not authenticated.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.list_schemas(repo=None, limit=100)\nList schema records.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID to query. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login(handle, password)\nAuthenticate with the ATProto PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nhandle\nstr\nYour Bluesky handle (e.g., ‘alice.bsky.social’).\nrequired\n\n\npassword\nstr\nApp-specific password (not your main password).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\natproto.exceptions.AtProtocolError\nIf authentication fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.login_with_session(session_string)\nAuthenticate using an exported session string.\nThis allows reusing a session without re-authenticating, which helps avoid rate limits on session creation.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsession_string\nstr\nSession string from export_session().\nrequired\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.put_record(\n collection,\n rkey,\n record,\n *,\n validate=False,\n swap_commit=None,\n)\nCreate or update a record at a specific key.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncollection\nstr\nThe NSID of the record collection.\nrequired\n\n\nrkey\nstr\nThe record key.\nrequired\n\n\nrecord\ndict\nThe record data. Must include a ‘$type’ field.\nrequired\n\n\nvalidate\nbool\nWhether to validate against the Lexicon schema.\nFalse\n\n\nswap_commit\nOptional[str]\nOptional CID for compare-and-swap update.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf operation fails.\n\n\n\n\n\n\n\natmosphere.AtmosphereClient.upload_blob(\n data,\n mime_type='application/octet-stream',\n)\nUpload binary data as a blob to the PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nbytes\nBinary data to upload.\nrequired\n\n\nmime_type\nstr\nMIME type of the data (for reference, not enforced by PDS).\n'application/octet-stream'\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nA blob reference dict with keys: ‘$type’, ‘ref’, ‘mimeType’, ‘size’.\n\n\n\ndict\nThis can be embedded directly in record fields.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\natproto.exceptions.AtProtocolError\nIf upload fails." 1125 1125 }, 1126 1126 { 1127 - "objectID": "api/AtmosphereClient.html#example", 1128 - "href": "api/AtmosphereClient.html#example", 1127 + "objectID": "api/AtmosphereClient.html#examples", 1128 + "href": "api/AtmosphereClient.html#examples", 1129 1129 "title": "AtmosphereClient", 1130 1130 "section": "", 1131 - "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'" 1131 + "text": ">>> client = AtmosphereClient()\n>>> client.login(\"alice.bsky.social\", \"app-password\")\n>>> print(client.did)\n'did:plc:...'" 1132 1132 }, 1133 1133 { 1134 1134 "objectID": "api/AtmosphereClient.html#note", ··· 1156 1156 "href": "api/load_dataset.html", 1157 1157 "title": "load_dataset", 1158 1158 "section": "", 1159 - "text": "load_dataset(\n path,\n sample_type=None,\n *,\n split=None,\n data_files=None,\n streaming=False,\n index=None,\n)\nLoad a dataset from local files, remote URLs, or an index.\nThis function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.\nWhen no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index.\n\n\n\n\n\n\n::\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1159 + "text": "load_dataset(\n path,\n sample_type=None,\n *,\n split=None,\n data_files=None,\n streaming=False,\n index=None,\n)\nLoad a dataset from local files, remote URLs, or an index.\nThis function provides a HuggingFace Datasets-style interface for loading atdata typed datasets. It handles path resolution, split detection, and returns either a single Dataset or a DatasetDict depending on the split parameter.\nWhen no sample_type is provided, returns a Dataset[DictSample] that provides dynamic dict-like access to fields. Use .as_type(MyType) to convert to a typed schema.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nstr\nPath to dataset. Can be: - Index lookup: “@handle/dataset-name” or “@local/dataset-name” - WebDataset brace notation: “path/to/{train,test}-{000..099}.tar” - Local directory: “./data/” (scans for .tar files) - Glob pattern: “path/to/.tar” - Remote URL: ”s3://bucket/path/data-.tar” - Single file: “path/to/data.tar”\nrequired\n\n\nsample_type\nType[ST] | None\nThe PackableSample subclass defining the schema. If None, returns Dataset[DictSample] with dynamic field access. Can also be resolved from an index when using @handle/dataset syntax.\nNone\n\n\nsplit\nstr | None\nWhich split to load. If None, returns a DatasetDict with all detected splits. If specified (e.g., “train”, “test”), returns a single Dataset for that split.\nNone\n\n\ndata_files\nstr | list[str] | dict[str, str | list[str]] | None\nOptional explicit mapping of data files. Can be: - str: Single file pattern - list[str]: List of file patterns (assigned to “train”) - dict[str, str | list[str]]: Explicit split -> files mapping\nNone\n\n\nstreaming\nbool\nIf True, explicitly marks the dataset for streaming mode. Note: atdata Datasets are already lazy/streaming via WebDataset pipelines, so this parameter primarily signals intent.\nFalse\n\n\nindex\nOptional['AbstractIndex']\nOptional AbstractIndex for dataset lookup. Required when using @handle/dataset syntax. When provided with an indexed path, the schema can be auto-resolved from the index.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is None: DatasetDict with all detected splits.\n\n\n\nDataset[ST] | DatasetDict[ST]\nIf split is specified: Dataset for that split.\n\n\n\nDataset[ST] | DatasetDict[ST]\nType is ST if sample_type provided, otherwise DictSample.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index.\n\n\n\n\n\n\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1160 1160 }, 1161 1161 { 1162 1162 "objectID": "api/load_dataset.html#parameters", ··· 1180 1180 "text": "Name\nType\nDescription\n\n\n\n\n\nValueError\nIf the specified split is not found.\n\n\n\nFileNotFoundError\nIf no data files are found at the path.\n\n\n\nKeyError\nIf dataset not found in index." 1181 1181 }, 1182 1182 { 1183 - "objectID": "api/load_dataset.html#example", 1184 - "href": "api/load_dataset.html#example", 1183 + "objectID": "api/load_dataset.html#examples", 1184 + "href": "api/load_dataset.html#examples", 1185 1185 "title": "load_dataset", 1186 1186 "section": "", 1187 - "text": "::\n>>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1187 + "text": ">>> # Load without type - get DictSample for exploration\n>>> ds = load_dataset(\"./data/train.tar\", split=\"train\")\n>>> for sample in ds.ordered():\n... print(sample.keys()) # Explore fields\n... print(sample[\"text\"]) # Dict-style access\n... print(sample.label) # Attribute access\n>>>\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(TextData)\n>>>\n>>> # Or load with explicit type directly\n>>> train_ds = load_dataset(\"./data/train-*.tar\", TextData, split=\"train\")\n>>>\n>>> # Load from index with auto-type resolution\n>>> index = LocalIndex()\n>>> ds = load_dataset(\"@local/my-dataset\", index=index, split=\"train\")" 1188 1188 }, 1189 1189 { 1190 1190 "objectID": "api/promote_to_atmosphere.html", 1191 1191 "href": "api/promote_to_atmosphere.html", 1192 1192 "title": "promote_to_atmosphere", 1193 1193 "section": "", 1194 - "text": "promote.promote_to_atmosphere(\n local_entry,\n local_index,\n atmosphere_client,\n *,\n data_store=None,\n name=None,\n description=None,\n tags=None,\n license=None,\n)\nPromote a local dataset to the atmosphere network.\nThis function takes a locally-indexed dataset and publishes it to ATProto, making it discoverable on the federated atmosphere network.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs.\n\n\n\n\n\n\n::\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1194 + "text": "promote.promote_to_atmosphere(\n local_entry,\n local_index,\n atmosphere_client,\n *,\n data_store=None,\n name=None,\n description=None,\n tags=None,\n license=None,\n)\nPromote a local dataset to the atmosphere network.\nThis function takes a locally-indexed dataset and publishes it to ATProto, making it discoverable on the federated atmosphere network.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlocal_entry\nLocalDatasetEntry\nThe LocalDatasetEntry to promote.\nrequired\n\n\nlocal_index\nLocalIndex\nLocal index containing the schema for this entry.\nrequired\n\n\natmosphere_client\nAtmosphereClient\nAuthenticated AtmosphereClient.\nrequired\n\n\ndata_store\nAbstractDataStore | None\nOptional data store for copying data to new location. If None, the existing data_urls are used as-is.\nNone\n\n\nname\nstr | None\nOverride name for the atmosphere record. Defaults to local name.\nNone\n\n\ndescription\nstr | None\nOptional description for the dataset.\nNone\n\n\ntags\nlist[str] | None\nOptional tags for discovery.\nNone\n\n\nlicense\nstr | None\nOptional license identifier.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the created atmosphere dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs.\n\n\n\n\n\n\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1195 1195 }, 1196 1196 { 1197 1197 "objectID": "api/promote_to_atmosphere.html#parameters", ··· 1215 1215 "text": "Name\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found in local index.\n\n\n\nValueError\nIf local entry has no data URLs." 1216 1216 }, 1217 1217 { 1218 - "objectID": "api/promote_to_atmosphere.html#example", 1219 - "href": "api/promote_to_atmosphere.html#example", 1218 + "objectID": "api/promote_to_atmosphere.html#examples", 1219 + "href": "api/promote_to_atmosphere.html#examples", 1220 1220 "title": "promote_to_atmosphere", 1221 1221 "section": "", 1222 - "text": "::\n>>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1222 + "text": ">>> entry = local_index.get_dataset(\"mnist-train\")\n>>> uri = promote_to_atmosphere(entry, local_index, client)\n>>> print(uri)\nat://did:plc:abc123/ac.foundation.dataset.datasetIndex/..." 1223 1223 }, 1224 1224 { 1225 1225 "objectID": "api/SchemaPublisher.html", 1226 1226 "href": "api/SchemaPublisher.html", 1227 1227 "title": "SchemaPublisher", 1228 1228 "section": "", 1229 - "text": "atmosphere.SchemaPublisher(client)\nPublishes PackableSample schemas to ATProto.\nThis class introspects a PackableSample class to extract its field definitions and publishes them as an ATProto schema record.\n\n\n::\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/...\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1229 + "text": "atmosphere.SchemaPublisher(client)\nPublishes PackableSample schemas to ATProto.\nThis class introspects a PackableSample class to extract its field definitions and publishes them as an ATProto schema record.\n\n\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/...\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\natmosphere.SchemaPublisher.publish(\n sample_type,\n *,\n name=None,\n version='1.0.0',\n description=None,\n metadata=None,\n rkey=None,\n)\nPublish a PackableSample schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[ST]\nThe PackableSample class to publish.\nrequired\n\n\nname\nOptional[str]\nHuman-readable name. Defaults to the class name.\nNone\n\n\nversion\nstr\nSemantic version string (e.g., ‘1.0.0’).\n'1.0.0'\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key. If not provided, a TID is generated.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created schema record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass or client is not authenticated.\n\n\n\nTypeError\nIf a field type is not supported." 1230 1230 }, 1231 1231 { 1232 - "objectID": "api/SchemaPublisher.html#example", 1233 - "href": "api/SchemaPublisher.html#example", 1232 + "objectID": "api/SchemaPublisher.html#examples", 1233 + "href": "api/SchemaPublisher.html#examples", 1234 1234 "title": "SchemaPublisher", 1235 1235 "section": "", 1236 - "text": "::\n>>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/..." 1236 + "text": ">>> @atdata.packable\n... class MySample:\n... image: NDArray\n... label: str\n...\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = SchemaPublisher(client)\n>>> uri = publisher.publish(MySample, version=\"1.0.0\")\n>>> print(uri)\nat://did:plc:.../ac.foundation.dataset.sampleSchema/..." 1237 1237 }, 1238 1238 { 1239 1239 "objectID": "api/SchemaPublisher.html#methods", ··· 1247 1247 "href": "api/DatasetPublisher.html", 1248 1248 "title": "DatasetPublisher", 1249 1249 "section": "", 1250 - "text": "atmosphere.DatasetPublisher(client)\nPublishes dataset index records to ATProto.\nThis class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.\n\n\n::\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1250 + "text": "atmosphere.DatasetPublisher(client)\nPublishes dataset index records to ATProto.\nThis class creates dataset records that reference a schema and point to external storage (WebDataset URLs) or ATProto blobs.\n\n\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a dataset index record to ATProto.\n\n\npublish_with_blobs\nPublish a dataset with data stored as ATProto blobs.\n\n\npublish_with_urls\nPublish a dataset record with explicit URLs.\n\n\n\n\n\natmosphere.DatasetPublisher.publish(\n dataset,\n *,\n name,\n schema_uri=None,\n description=None,\n tags=None,\n license=None,\n auto_publish_schema=True,\n schema_version='1.0.0',\n rkey=None,\n)\nPublish a dataset index record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndataset\nDataset[ST]\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\nschema_uri\nOptional[str]\nAT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.\nNone\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).\nNone\n\n\nauto_publish_schema\nbool\nIf True and schema_uri not provided, automatically publish the schema first.\nTrue\n\n\nschema_version\nstr\nVersion for auto-published schema.\n'1.0.0'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema_uri is not provided and auto_publish_schema is False.\n\n\n\n\n\n\n\natmosphere.DatasetPublisher.publish_with_blobs(\n blobs,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n mime_type='application/x-tar',\n rkey=None,\n)\nPublish a dataset with data stored as ATProto blobs.\nThis method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nblobs\nlist[bytes]\nList of binary data (e.g., tar shards) to upload as blobs.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nmime_type\nstr\nMIME type for the blobs (default: application/x-tar).\n'application/x-tar'\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record.\n\n\n\n\n\n\nBlobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.\n\n\n\n\natmosphere.DatasetPublisher.publish_with_urls(\n urls,\n schema_uri,\n *,\n name,\n description=None,\n tags=None,\n license=None,\n metadata=None,\n rkey=None,\n)\nPublish a dataset record with explicit URLs.\nThis method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of WebDataset URLs with brace notation.\nrequired\n\n\nschema_uri\nstr\nAT URI of the schema record.\nrequired\n\n\nname\nstr\nHuman-readable dataset name.\nrequired\n\n\ndescription\nOptional[str]\nHuman-readable description.\nNone\n\n\ntags\nOptional[list[str]]\nSearchable tags for discovery.\nNone\n\n\nlicense\nOptional[str]\nSPDX license identifier.\nNone\n\n\nmetadata\nOptional[dict]\nArbitrary metadata dictionary.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created dataset record." 1251 1251 }, 1252 1252 { 1253 - "objectID": "api/DatasetPublisher.html#example", 1254 - "href": "api/DatasetPublisher.html#example", 1253 + "objectID": "api/DatasetPublisher.html#examples", 1254 + "href": "api/DatasetPublisher.html#examples", 1255 1255 "title": "DatasetPublisher", 1256 1256 "section": "", 1257 - "text": "::\n>>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )" 1257 + "text": ">>> dataset = atdata.Dataset[MySample](\"s3://bucket/data-{000000..000009}.tar\")\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = DatasetPublisher(client)\n>>> uri = publisher.publish(\n... dataset,\n... name=\"My Training Data\",\n... description=\"Training data for my model\",\n... tags=[\"computer-vision\", \"training\"],\n... )" 1258 1258 }, 1259 1259 { 1260 1260 "objectID": "api/DatasetPublisher.html#methods", ··· 1268 1268 "href": "api/URLSource.html", 1269 1269 "title": "URLSource", 1270 1270 "section": "", 1271 - "text": "URLSource(url)\nData source for WebDataset-compatible URLs.\nWraps WebDataset’s gopen to open URLs using built-in handlers for http, https, pipe, gs, hf, sftp, etc. Supports brace expansion for shard patterns like “data-{000..099}.tar”.\nThis is the default source type when a string URL is passed to Dataset.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards.\n\n\n\n\n\n\n::\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1271 + "text": "URLSource(url)\nData source for WebDataset-compatible URLs.\nWraps WebDataset’s gopen to open URLs using built-in handlers for http, https, pipe, gs, hf, sftp, etc. Supports brace expansion for shard patterns like “data-{000..099}.tar”.\nThis is the default source type when a string URL is passed to Dataset.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards.\n\n\n\n\n\n\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nExpand brace pattern and return list of shard URLs.\n\n\nopen_shard\nOpen a single shard by URL.\n\n\n\n\n\nURLSource.list_shards()\nExpand brace pattern and return list of shard URLs.\n\n\n\nURLSource.open_shard(shard_id)\nOpen a single shard by URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nURL of the shard to open.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream from gopen.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1272 1272 }, 1273 1273 { 1274 1274 "objectID": "api/URLSource.html#attributes", ··· 1278 1278 "text": "Name\nType\nDescription\n\n\n\n\nurl\nstr\nURL or brace pattern for the shards." 1279 1279 }, 1280 1280 { 1281 - "objectID": "api/URLSource.html#example", 1282 - "href": "api/URLSource.html#example", 1281 + "objectID": "api/URLSource.html#examples", 1282 + "href": "api/URLSource.html#examples", 1283 1283 "title": "URLSource", 1284 1284 "section": "", 1285 - "text": "::\n>>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")" 1285 + "text": ">>> source = URLSource(\"https://example.com/train-{000..009}.tar\")\n>>> for shard_id, stream in source.shards:\n... print(f\"Streaming {shard_id}\")" 1286 1286 }, 1287 1287 { 1288 1288 "objectID": "api/URLSource.html#methods", ··· 1366 1366 "href": "api/S3Source.html", 1367 1367 "title": "S3Source", 1368 1368 "section": "", 1369 - "text": "S3Source(\n bucket,\n keys,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n _client=None,\n)\nData source for S3-compatible storage with explicit credentials.\nUses boto3 to stream directly from S3, supporting: - Standard AWS S3 - S3-compatible endpoints (Cloudflare R2, MinIO, etc.) - Private buckets with credentials - IAM role authentication (when keys not provided)\nUnlike URL-based approaches, this doesn’t require URL transformation or global gopen_schemes registration. Credentials are scoped to the source instance.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1).\n\n\n\n\n\n\n::\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n::\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n::\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1369 + "text": "S3Source(\n bucket,\n keys,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n _client=None,\n)\nData source for S3-compatible storage with explicit credentials.\nUses boto3 to stream directly from S3, supporting: - Standard AWS S3 - S3-compatible endpoints (Cloudflare R2, MinIO, etc.) - Private buckets with credentials - IAM role authentication (when keys not provided)\nUnlike URL-based approaches, this doesn’t require URL transformation or global gopen_schemes registration. Credentials are scoped to the source instance.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1).\n\n\n\n\n\n\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1370 1370 }, 1371 1371 { 1372 1372 "objectID": "api/S3Source.html#attributes", ··· 1376 1376 "text": "Name\nType\nDescription\n\n\n\n\nbucket\nstr\nS3 bucket name.\n\n\nkeys\nlist[str]\nList of object keys (paths within bucket).\n\n\nendpoint\nstr | None\nOptional custom endpoint URL for S3-compatible services.\n\n\naccess_key\nstr | None\nOptional AWS access key ID.\n\n\nsecret_key\nstr | None\nOptional AWS secret access key.\n\n\nregion\nstr | None\nOptional AWS region (defaults to us-east-1)." 1377 1377 }, 1378 1378 { 1379 - "objectID": "api/S3Source.html#example", 1380 - "href": "api/S3Source.html#example", 1379 + "objectID": "api/S3Source.html#examples", 1380 + "href": "api/S3Source.html#examples", 1381 1381 "title": "S3Source", 1382 1382 "section": "", 1383 - "text": "::\n>>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1383 + "text": ">>> source = S3Source(\n... bucket=\"my-datasets\",\n... keys=[\"train/shard-000.tar\", \"train/shard-001.tar\"],\n... endpoint=\"https://abc123.r2.cloudflarestorage.com\",\n... access_key=\"AKIAIOSFODNN7EXAMPLE\",\n... secret_key=\"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\",\n... )\n>>> for shard_id, stream in source.shards:\n... process(stream)" 1384 1384 }, 1385 1385 { 1386 1386 "objectID": "api/S3Source.html#methods", 1387 1387 "href": "api/S3Source.html#methods", 1388 1388 "title": "S3Source", 1389 1389 "section": "", 1390 - "text": "Name\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n::\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n::\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1390 + "text": "Name\nDescription\n\n\n\n\nfrom_credentials\nCreate S3Source from a credentials dictionary.\n\n\nfrom_urls\nCreate S3Source from s3:// URLs.\n\n\nlist_shards\nReturn list of S3 URIs for the shards.\n\n\nopen_shard\nOpen a single shard by S3 URI.\n\n\n\n\n\nS3Source.from_credentials(credentials, bucket, keys)\nCreate S3Source from a credentials dictionary.\nAccepts the same credential format used by S3DataStore.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncredentials\ndict[str, str]\nDict with AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_ENDPOINT.\nrequired\n\n\nbucket\nstr\nS3 bucket name.\nrequired\n\n\nkeys\nlist[str]\nList of object keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nConfigured S3Source.\n\n\n\n\n\n\n>>> creds = {\n... \"AWS_ACCESS_KEY_ID\": \"...\",\n... \"AWS_SECRET_ACCESS_KEY\": \"...\",\n... \"AWS_ENDPOINT\": \"https://r2.example.com\",\n... }\n>>> source = S3Source.from_credentials(creds, \"my-bucket\", [\"data.tar\"])\n\n\n\n\nS3Source.from_urls(\n urls,\n *,\n endpoint=None,\n access_key=None,\n secret_key=None,\n region=None,\n)\nCreate S3Source from s3:// URLs.\nParses s3://bucket/key URLs and extracts bucket and keys. All URLs must be in the same bucket.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of s3:// URLs.\nrequired\n\n\nendpoint\nstr | None\nOptional custom endpoint.\nNone\n\n\naccess_key\nstr | None\nOptional access key.\nNone\n\n\nsecret_key\nstr | None\nOptional secret key.\nNone\n\n\nregion\nstr | None\nOptional region.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'S3Source'\nS3Source configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid s3:// URLs or span multiple buckets.\n\n\n\n\n\n\n>>> source = S3Source.from_urls(\n... [\"s3://my-bucket/train-000.tar\", \"s3://my-bucket/train-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... )\n\n\n\n\nS3Source.list_shards()\nReturn list of S3 URIs for the shards.\n\n\n\nS3Source.open_shard(shard_id)\nOpen a single shard by S3 URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nS3 URI of the shard (s3://bucket/key).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nStreamingBody for reading the object.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in list_shards()." 1391 1391 }, 1392 1392 { 1393 1393 "objectID": "api/local.LocalDatasetEntry.html", ··· 1415 1415 "href": "api/AbstractIndex.html", 1416 1416 "title": "AbstractIndex", 1417 1417 "section": "", 1418 - "text": "AbstractIndex()\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\nThis protocol defines the common interface for managing dataset metadata: - Publishing and retrieving schemas - Inserting and listing datasets - (Future) Publishing and retrieving lenses\nA single index can hold datasets of many different sample types. The sample type is tracked via schema references, not as a generic parameter on the index.\n\n\nSome index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution.\n\n\n\n::\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n::\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1418 + "text": "AbstractIndex()\nProtocol for index operations - implemented by LocalIndex and AtmosphereIndex.\nThis protocol defines the common interface for managing dataset metadata: - Publishing and retrieving schemas - Inserting and listing datasets - (Future) Publishing and retrieving lenses\nA single index can hold datasets of many different sample types. The sample type is tracked via schema references, not as a generic parameter on the index.\n\n\nSome index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution.\n\n\n\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nOptional data store for reading/writing shards.\n\n\ndatasets\nLazily iterate over all dataset entries in this index.\n\n\nschemas\nLazily iterate over all schema records in this index.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1419 1419 }, 1420 1420 { 1421 1421 "objectID": "api/AbstractIndex.html#optional-extensions", ··· 1425 1425 "text": "Some index implementations support additional features: - data_store: An AbstractDataStore for reading/writing dataset shards. If present, load_dataset will use it for S3 credential resolution." 1426 1426 }, 1427 1427 { 1428 - "objectID": "api/AbstractIndex.html#example", 1429 - "href": "api/AbstractIndex.html#example", 1428 + "objectID": "api/AbstractIndex.html#examples", 1429 + "href": "api/AbstractIndex.html#examples", 1430 1430 "title": "AbstractIndex", 1431 1431 "section": "", 1432 - "text": "::\n>>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")" 1432 + "text": ">>> def publish_and_list(index: AbstractIndex) -> None:\n... # Publish schemas for different types\n... schema1 = index.publish_schema(ImageSample, version=\"1.0.0\")\n... schema2 = index.publish_schema(TextSample, version=\"1.0.0\")\n...\n... # Insert datasets of different types\n... index.insert_dataset(image_ds, name=\"images\")\n... index.insert_dataset(text_ds, name=\"texts\")\n...\n... # List all datasets (mixed types)\n... for entry in index.list_datasets():\n... print(f\"{entry.name} -> {entry.schema_ref}\")" 1433 1433 }, 1434 1434 { 1435 1435 "objectID": "api/AbstractIndex.html#attributes", ··· 1443 1443 "href": "api/AbstractIndex.html#methods", 1444 1444 "title": "AbstractIndex", 1445 1445 "section": "", 1446 - "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n::\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1446 + "text": "Name\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python Packable type from a stored schema.\n\n\nget_dataset\nGet a dataset entry by name or reference.\n\n\nget_schema\nGet a schema record by reference.\n\n\ninsert_dataset\nInsert a dataset into the index.\n\n\nlist_datasets\nGet all dataset entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list.\n\n\npublish_schema\nPublish a schema for a sample type.\n\n\n\n\n\nAbstractIndex.decode_schema(ref)\nReconstruct a Python Packable type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a Packable class matching the schema definition.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA dynamically generated Packable class with fields matching\n\n\n\nType[Packable]\nthe schema definition. The class can be used with\n\n\n\nType[Packable]\nDataset[T] to load and iterate over samples.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded (unsupported field types).\n\n\n\n\n\n\n>>> entry = index.get_dataset(\"my-dataset\")\n>>> SampleType = index.decode_schema(entry.schema_ref)\n>>> ds = Dataset[SampleType](entry.data_urls[0])\n>>> for sample in ds.ordered():\n... print(sample) # sample is instance of SampleType\n\n\n\n\nAbstractIndex.get_dataset(ref)\nGet a dataset entry by name or reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name, path, or full reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nAbstractIndex.get_schema(ref)\nGet a schema record by reference.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (local:// or at://).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with fields like ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\n\n\n\n\nAbstractIndex.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index.\nThe sample type is inferred from ds.sample_type. If schema_ref is not provided, the schema may be auto-published based on the sample type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register in the index (any sample type).\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional explicit schema reference. If not provided, the schema may be auto-published or inferred from ds.sample_type.\nNone\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIndexEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_datasets()\nGet all dataset entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[IndexEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nAbstractIndex.list_schemas()\nGet all schema records as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nAbstractIndex.publish_schema(sample_type, *, version='1.0.0', **kwargs)\nPublish a schema for a sample type.\nThe sample_type is accepted as type rather than Type[Packable] to support @packable-decorated classes, which satisfy the Packable protocol at runtime but cannot be statically verified by type checkers.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (PackableSample subclass or @packable-decorated). Validated at runtime via the @runtime_checkable Packable protocol.\nrequired\n\n\nversion\nstr\nSemantic version string for the schema.\n'1.0.0'\n\n\n**kwargs\n\nAdditional backend-specific options.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string:\n\n\n\nstr\n- Local: ‘local://schemas/{module.Class}@version’\n\n\n\nstr\n- Atmosphere: ‘at://did:plc:…/ac.foundation.dataset.sampleSchema/…’" 1447 1447 }, 1448 1448 { 1449 1449 "objectID": "api/AtmosphereIndexEntry.html", ··· 1464 1464 "href": "api/LensPublisher.html", 1465 1465 "title": "LensPublisher", 1466 1466 "section": "", 1467 - "text": "atmosphere.LensPublisher(client)\nPublishes Lens transformation records to ATProto.\nThis class creates lens records that reference source and target schemas and point to the transformation code in a git repository.\n\n\n::\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )\n\n\n\nLens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1467 + "text": "atmosphere.LensPublisher(client)\nPublishes Lens transformation records to ATProto.\nThis class creates lens records that reference source and target schemas and point to the transformation code in a git repository.\n\n\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )\n\n\n\nLens code is stored as references to git repositories rather than inline code. This prevents arbitrary code execution from ATProto records. Users must manually install and trust lens implementations.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\npublish\nPublish a lens transformation record to ATProto.\n\n\npublish_from_lens\nPublish a lens record from an existing Lens object.\n\n\n\n\n\natmosphere.LensPublisher.publish(\n name,\n source_schema_uri,\n target_schema_uri,\n description=None,\n code_repository=None,\n code_commit=None,\n getter_path=None,\n putter_path=None,\n rkey=None,\n)\nPublish a lens transformation record to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\ncode_repository\nOptional[str]\nGit repository URL containing the lens code.\nNone\n\n\ncode_commit\nOptional[str]\nGit commit hash for reproducibility.\nNone\n\n\ngetter_path\nOptional[str]\nModule path to the getter function (e.g., ‘mymodule.lenses:my_getter’).\nNone\n\n\nputter_path\nOptional[str]\nModule path to the putter function (e.g., ‘mymodule.lenses:my_putter’).\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf code references are incomplete.\n\n\n\n\n\n\n\natmosphere.LensPublisher.publish_from_lens(\n lens_obj,\n *,\n name,\n source_schema_uri,\n target_schema_uri,\n code_repository,\n code_commit,\n description=None,\n rkey=None,\n)\nPublish a lens record from an existing Lens object.\nThis method extracts the getter and putter function names from the Lens object and publishes a record referencing them.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nlens_obj\nLens\nThe Lens object to publish.\nrequired\n\n\nname\nstr\nHuman-readable lens name.\nrequired\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nstr\nAT URI of the target schema.\nrequired\n\n\ncode_repository\nstr\nGit repository URL.\nrequired\n\n\ncode_commit\nstr\nGit commit hash.\nrequired\n\n\ndescription\nOptional[str]\nWhat this transformation does.\nNone\n\n\nrkey\nOptional[str]\nOptional explicit record key.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nThe AT URI of the created lens record." 1468 1468 }, 1469 1469 { 1470 - "objectID": "api/LensPublisher.html#example", 1471 - "href": "api/LensPublisher.html#example", 1470 + "objectID": "api/LensPublisher.html#examples", 1471 + "href": "api/LensPublisher.html#examples", 1472 1472 "title": "LensPublisher", 1473 1473 "section": "", 1474 - "text": "::\n>>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )" 1474 + "text": ">>> @atdata.lens\n... def my_lens(source: SourceType) -> TargetType:\n... return TargetType(field=source.other_field)\n>>>\n>>> client = AtmosphereClient()\n>>> client.login(\"handle\", \"password\")\n>>>\n>>> publisher = LensPublisher(client)\n>>> uri = publisher.publish(\n... name=\"my_lens\",\n... source_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/source\",\n... target_schema_uri=\"at://did:plc:abc/ac.foundation.dataset.sampleSchema/target\",\n... code_repository=\"https://github.com/user/repo\",\n... code_commit=\"abc123def456\",\n... getter_path=\"mymodule.lenses:my_lens\",\n... putter_path=\"mymodule.lenses:my_lens_putter\",\n... )" 1475 1475 }, 1476 1476 { 1477 1477 "objectID": "api/LensPublisher.html#security-note", ··· 1492 1492 "href": "api/SampleBatch.html", 1493 1493 "title": "SampleBatch", 1494 1494 "section": "", 1495 - "text": "SampleBatch(samples)\nA batch of samples with automatic attribute aggregation.\nThis class wraps a sequence of samples and provides magic __getattr__ access to aggregate sample attributes. When you access an attribute that exists on the sample type, it automatically aggregates values across all samples in the batch.\nNDArray fields are stacked into a numpy array with a batch dimension. Other fields are aggregated into a list.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch.\n\n\n\n\n\n\n::\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1495 + "text": "SampleBatch(samples)\nA batch of samples with automatic attribute aggregation.\nThis class wraps a sequence of samples and provides magic __getattr__ access to aggregate sample attributes. When you access an attribute that exists on the sample type, it automatically aggregates values across all samples in the batch.\nNDArray fields are stacked into a numpy array with a batch dimension. Other fields are aggregated into a list.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nDT\n\nThe sample type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch.\n\n\n\n\n\n\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax SampleBatch[MyType](samples) rather than calling the constructor directly with an unsubscripted class." 1496 1496 }, 1497 1497 { 1498 1498 "objectID": "api/SampleBatch.html#parameters", ··· 1509 1509 "text": "Name\nType\nDescription\n\n\n\n\nsamples\n\nThe list of sample instances in this batch." 1510 1510 }, 1511 1511 { 1512 - "objectID": "api/SampleBatch.html#example", 1513 - "href": "api/SampleBatch.html#example", 1512 + "objectID": "api/SampleBatch.html#examples", 1513 + "href": "api/SampleBatch.html#examples", 1514 1514 "title": "SampleBatch", 1515 1515 "section": "", 1516 - "text": "::\n>>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names" 1516 + "text": ">>> batch = SampleBatch[MyData]([sample1, sample2, sample3])\n>>> batch.embeddings # Returns stacked numpy array of shape (3, ...)\n>>> batch.names # Returns list of names" 1517 1517 }, 1518 1518 { 1519 1519 "objectID": "api/SampleBatch.html#note", ··· 1637 1637 "href": "api/packable.html", 1638 1638 "title": "packable", 1639 1639 "section": "", 1640 - "text": "packable(cls)\nDecorator to convert a regular class into a PackableSample.\nThis decorator transforms a class into a dataclass that inherits from PackableSample, enabling automatic msgpack serialization/deserialization with special handling for NDArray fields.\nThe resulting class satisfies the Packable protocol, making it compatible with all atdata APIs that accept packable types (e.g., publish_schema, lens transformations, etc.).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures.\n\n\n\n\n\n\nThis is a test of the functionality::\n@packable\nclass MyData:\n name: str\n values: NDArray\n\nsample = MyData(name=\"test\", values=np.array([1, 2, 3]))\nbytes_data = sample.packed\nrestored = MyData.from_bytes(bytes_data)\n\n# Works with Packable-typed APIs\nindex.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1640 + "text": "packable(cls)\nDecorator to convert a regular class into a PackableSample.\nThis decorator transforms a class into a dataclass that inherits from PackableSample, enabling automatic msgpack serialization/deserialization with special handling for NDArray fields.\nThe resulting class satisfies the Packable protocol, making it compatible with all atdata APIs that accept packable types (e.g., publish_schema, lens transformations, etc.).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncls\ntype[_T]\nThe class to convert. Should have type annotations for its fields.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[_T]\nA new dataclass that inherits from PackableSample with the same\n\n\n\ntype[_T]\nname and annotations as the original class. The class satisfies the\n\n\n\ntype[_T]\nPackable protocol and can be used with Type[Packable] signatures.\n\n\n\n\n\n\n>>> @packable\n... class MyData:\n... name: str\n... values: NDArray\n...\n>>> sample = MyData(name=\"test\", values=np.array([1, 2, 3]))\n>>> bytes_data = sample.packed\n>>> restored = MyData.from_bytes(bytes_data)\n>>>\n>>> # Works with Packable-typed APIs\n>>> index.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1641 1641 }, 1642 1642 { 1643 1643 "objectID": "api/packable.html#parameters", ··· 1658 1658 "href": "api/packable.html#examples", 1659 1659 "title": "packable", 1660 1660 "section": "", 1661 - "text": "This is a test of the functionality::\n@packable\nclass MyData:\n name: str\n values: NDArray\n\nsample = MyData(name=\"test\", values=np.array([1, 2, 3]))\nbytes_data = sample.packed\nrestored = MyData.from_bytes(bytes_data)\n\n# Works with Packable-typed APIs\nindex.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1661 + "text": ">>> @packable\n... class MyData:\n... name: str\n... values: NDArray\n...\n>>> sample = MyData(name=\"test\", values=np.array([1, 2, 3]))\n>>> bytes_data = sample.packed\n>>> restored = MyData.from_bytes(bytes_data)\n>>>\n>>> # Works with Packable-typed APIs\n>>> index.publish_schema(MyData, version=\"1.0.0\") # Type-safe" 1662 1662 }, 1663 1663 { 1664 1664 "objectID": "api/Packable-protocol.html", 1665 1665 "href": "api/Packable-protocol.html", 1666 1666 "title": "Packable", 1667 1667 "section": "", 1668 - "text": "Packable()\nStructural protocol for packable sample types.\nThis protocol allows classes decorated with @packable to be recognized as valid types for lens transformations and schema operations, even though the decorator doesn’t change the class’s nominal type at static analysis time.\nBoth PackableSample subclasses and @packable-decorated classes satisfy this protocol structurally.\nThe protocol captures the full interface needed for: - Lens type transformations (as_wds, from_data) - Schema publishing (class introspection via dataclass fields) - Serialization/deserialization (packed, from_bytes)\n\n\n::\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1668 + "text": "Packable()\nStructural protocol for packable sample types.\nThis protocol allows classes decorated with @packable to be recognized as valid types for lens transformations and schema operations, even though the decorator doesn’t change the class’s nominal type at static analysis time.\nBoth PackableSample subclasses and @packable-decorated classes satisfy this protocol structurally.\nThe protocol captures the full interface needed for: - Lens type transformations (as_wds, from_data) - Schema publishing (class introspection via dataclass fields) - Serialization/deserialization (packed, from_bytes)\n\n\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nWebDataset-compatible representation with key and msgpack.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate instance from raw msgpack bytes.\n\n\nfrom_data\nCreate instance from unpacked msgpack data dictionary.\n\n\n\n\n\nPackable.from_bytes(bs)\nCreate instance from raw msgpack bytes.\n\n\n\nPackable.from_data(data)\nCreate instance from unpacked msgpack data dictionary." 1669 1669 }, 1670 1670 { 1671 - "objectID": "api/Packable-protocol.html#example", 1672 - "href": "api/Packable-protocol.html#example", 1671 + "objectID": "api/Packable-protocol.html#examples", 1672 + "href": "api/Packable-protocol.html#examples", 1673 1673 "title": "Packable", 1674 1674 "section": "", 1675 - "text": "::\n>>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)" 1675 + "text": ">>> @packable\n... class MySample:\n... name: str\n... value: int\n...\n>>> def process(sample_type: Type[Packable]) -> None:\n... # Type checker knows sample_type has from_bytes, packed, etc.\n... instance = sample_type.from_bytes(data)\n... print(instance.packed)" 1676 1676 }, 1677 1677 { 1678 1678 "objectID": "api/Packable-protocol.html#attributes", ··· 1693 1693 "href": "api/AtUri.html", 1694 1694 "title": "AtUri", 1695 1695 "section": "", 1696 - "text": "atmosphere.AtUri(authority, collection, rkey)\nParsed AT Protocol URI.\nAT URIs follow the format: at:////\n\n\n::\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1696 + "text": "atmosphere.AtUri(authority, collection, rkey)\nParsed AT Protocol URI.\nAT URIs follow the format: at:////\n\n\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nauthority\nThe DID or handle of the repository owner.\n\n\ncollection\nThe NSID of the record collection.\n\n\nrkey\nThe record key within the collection.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nparse\nParse an AT URI string into components.\n\n\n\n\n\natmosphere.AtUri.parse(uri)\nParse an AT URI string into components.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr\nAT URI string in format at://<authority>/<collection>/<rkey>\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtUri\nParsed AtUri instance.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the URI format is invalid." 1697 1697 }, 1698 1698 { 1699 - "objectID": "api/AtUri.html#example", 1700 - "href": "api/AtUri.html#example", 1699 + "objectID": "api/AtUri.html#examples", 1700 + "href": "api/AtUri.html#examples", 1701 1701 "title": "AtUri", 1702 1702 "section": "", 1703 - "text": "::\n>>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'" 1703 + "text": ">>> uri = AtUri.parse(\"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz\")\n>>> uri.authority\n'did:plc:abc123'\n>>> uri.collection\n'ac.foundation.dataset.sampleSchema'\n>>> uri.rkey\n'xyz'" 1704 1704 }, 1705 1705 { 1706 1706 "objectID": "api/AtUri.html#attributes", ··· 1742 1742 "href": "api/AbstractDataStore.html", 1743 1743 "title": "AbstractDataStore", 1744 1744 "section": "", 1745 - "text": "AbstractDataStore()\nProtocol for data storage operations.\nThis protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)\nThe separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.\n\n\n::\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1745 + "text": "AbstractDataStore()\nProtocol for data storage operations.\nThis protocol abstracts over different storage backends for dataset data: - S3DataStore: S3-compatible object storage - PDSBlobStore: ATProto PDS blob storage (future)\nThe separation of index (metadata) from data store (actual files) allows flexible deployment: local index with S3 storage, atmosphere index with S3 storage, or atmosphere index with PDS blobs.\n\n\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nread_url\nResolve a storage URL for reading.\n\n\nsupports_streaming\nWhether this store supports streaming reads.\n\n\nwrite_shards\nWrite dataset shards to storage.\n\n\n\n\n\nAbstractDataStore.read_url(url)\nResolve a storage URL for reading.\nSome storage backends may need to transform URLs (e.g., signing S3 URLs or resolving blob references). This method returns a URL that can be used directly with WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nStorage URL to resolve.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nWebDataset-compatible URL for reading.\n\n\n\n\n\n\n\nAbstractDataStore.supports_streaming()\nWhether this store supports streaming reads.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue if the store supports efficient streaming (like S3),\n\n\n\nbool\nFalse if data must be fully downloaded first.\n\n\n\n\n\n\n\nAbstractDataStore.write_shards(ds, *, prefix, **kwargs)\nWrite dataset shards to storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nPath prefix for the shards (e.g., ‘datasets/mnist/v1’).\nrequired\n\n\n**kwargs\n\nBackend-specific options (e.g., maxcount for shard size).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for the written shards, suitable for use with\n\n\n\nlist[str]\nWebDataset or atdata.Dataset()." 1746 1746 }, 1747 1747 { 1748 - "objectID": "api/AbstractDataStore.html#example", 1749 - "href": "api/AbstractDataStore.html#example", 1748 + "objectID": "api/AbstractDataStore.html#examples", 1749 + "href": "api/AbstractDataStore.html#examples", 1750 1750 "title": "AbstractDataStore", 1751 1751 "section": "", 1752 - "text": "::\n>>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]" 1752 + "text": ">>> store = S3DataStore(credentials, bucket=\"my-bucket\")\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> print(urls)\n['s3://my-bucket/training/v1/shard-000000.tar', ...]" 1753 1753 }, 1754 1754 { 1755 1755 "objectID": "api/AbstractDataStore.html#methods", ··· 1763 1763 "href": "api/Dataset.html", 1764 1764 "title": "Dataset", 1765 1765 "section": "", 1766 - "text": "Dataset(source=None, metadata_url=None, *, url=None)\nA typed dataset built on WebDataset with lens transformations.\nThis class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.\nThe dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s).\n\n\n\n\n\n\n::\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n::\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1766 + "text": "Dataset(source=None, metadata_url=None, *, url=None)\nA typed dataset built on WebDataset with lens transformations.\nThis class wraps WebDataset tar archives and provides type-safe iteration over samples of a specific PackableSample type. Samples are stored as msgpack-serialized data within WebDataset shards.\nThe dataset supports: - Ordered and shuffled iteration - Automatic batching with SampleBatch - Type transformations via the lens system (as_type()) - Export to parquet format\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for this dataset, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s).\n\n\n\n\n\n\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)\n\n\n\nThis class uses Python’s __orig_class__ mechanism to extract the type parameter at runtime. Instances must be created using the subscripted syntax Dataset[MyType](url) rather than calling the constructor directly with an unsubscripted class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1767 1767 }, 1768 1768 { 1769 1769 "objectID": "api/Dataset.html#parameters", ··· 1780 1780 "text": "Name\nType\nDescription\n\n\n\n\nurl\n\nWebDataset brace-notation URL for the tar file(s)." 1781 1781 }, 1782 1782 { 1783 - "objectID": "api/Dataset.html#example", 1784 - "href": "api/Dataset.html#example", 1783 + "objectID": "api/Dataset.html#examples", 1784 + "href": "api/Dataset.html#examples", 1785 1785 "title": "Dataset", 1786 1786 "section": "", 1787 - "text": "::\n>>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)" 1787 + "text": ">>> ds = Dataset[MyData](\"path/to/data-{000000..000009}.tar\")\n>>> for sample in ds.ordered(batch_size=32):\n... # sample is SampleBatch[MyData] with batch_size samples\n... embeddings = sample.embeddings # shape: (32, ...)\n...\n>>> # Transform to a different view\n>>> ds_view = ds.as_type(MyDataView)" 1788 1788 }, 1789 1789 { 1790 1790 "objectID": "api/Dataset.html#note", ··· 1798 1798 "href": "api/Dataset.html#methods", 1799 1799 "title": "Dataset", 1800 1800 "section": "", 1801 - "text": "Name\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n::\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1801 + "text": "Name\nDescription\n\n\n\n\nas_type\nView this dataset through a different sample type using a registered lens.\n\n\nlist_shards\nGet list of individual dataset shards.\n\n\nordered\nIterate over the dataset in order\n\n\nshuffled\nIterate over the dataset in random order.\n\n\nto_parquet\nExport dataset contents to parquet format.\n\n\nwrap\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\nwrap_batch\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nDataset.as_type(other)\nView this dataset through a different sample type using a registered lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nother\nType[RT]\nThe target sample type to transform into. Must be a type derived from PackableSample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[RT]\nA new Dataset instance that yields samples of type other\n\n\n\nDataset[RT]\nby applying the appropriate lens transformation from the global\n\n\n\nDataset[RT]\nLensNetwork registry.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no registered lens exists between the current sample type and the target type.\n\n\n\n\n\n\n\nDataset.list_shards()\nGet list of individual dataset shards.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nA full (non-lazy) list of the individual tar files within the\n\n\n\nlist[str]\nsource WebDataset.\n\n\n\n\n\n\n\nDataset.ordered(batch_size=None)\nIterate over the dataset in order\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch_size (\n\nobj:int, optional): The size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nobj:webdataset.DataPipeline A data pipeline that iterates over\n\n\n\nIterable[ST]\nthe dataset in its original sample order\n\n\n\n\n\n\n\nDataset.shuffled(buffer_shards=100, buffer_samples=10000, batch_size=None)\nIterate over the dataset in random order.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbuffer_shards\nint\nNumber of shards to buffer for shuffling at the shard level. Larger values increase randomness but use more memory. Default: 100.\n100\n\n\nbuffer_samples\nint\nNumber of samples to buffer for shuffling within shards. Larger values increase randomness but use more memory. Default: 10,000.\n10000\n\n\nbatch_size\nint | None\nThe size of iterated batches. Default: None (unbatched). If None, iterates over one sample at a time with no batch dimension.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIterable[ST]\nA WebDataset data pipeline that iterates over the dataset in\n\n\n\nIterable[ST]\nrandomized order. If batch_size is not None, yields\n\n\n\nIterable[ST]\nSampleBatch[ST] instances; otherwise yields individual ST\n\n\n\nIterable[ST]\nsamples.\n\n\n\n\n\n\n\nDataset.to_parquet(path, sample_map=None, maxcount=None, **kwargs)\nExport dataset contents to parquet format.\nConverts all samples to a pandas DataFrame and saves to parquet file(s). Useful for interoperability with data analysis tools.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\npath\nPathlike\nOutput path for the parquet file. If maxcount is specified, files are named {stem}-{segment:06d}.parquet.\nrequired\n\n\nsample_map\nOptional[SampleExportMap]\nOptional function to convert samples to dictionaries. Defaults to dataclasses.asdict.\nNone\n\n\nmaxcount\nOptional[int]\nIf specified, split output into multiple files with at most this many samples each. Recommended for large datasets.\nNone\n\n\n**kwargs\n\nAdditional arguments passed to pandas.DataFrame.to_parquet(). Common options include compression, index, engine.\n{}\n\n\n\n\n\n\nMemory Usage: When maxcount=None (default), this method loads the entire dataset into memory as a pandas DataFrame before writing. For large datasets, this can cause memory exhaustion.\nFor datasets larger than available RAM, always specify maxcount::\n# Safe for large datasets - processes in chunks\nds.to_parquet(\"output.parquet\", maxcount=10000)\nThis creates multiple parquet files: output-000000.parquet, output-000001.parquet, etc.\n\n\n\n>>> ds = Dataset[MySample](\"data.tar\")\n>>> # Small dataset - load all at once\n>>> ds.to_parquet(\"output.parquet\")\n>>>\n>>> # Large dataset - process in chunks\n>>> ds.to_parquet(\"output.parquet\", maxcount=50000)\n\n\n\n\nDataset.wrap(sample)\nWrap a raw msgpack sample into the appropriate dataset-specific type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample\nWDSRawSample\nA dictionary containing at minimum a 'msgpack' key with serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nST\nA deserialized sample of type ST, optionally transformed through\n\n\n\nST\na lens if as_type() was called.\n\n\n\n\n\n\n\nDataset.wrap_batch(batch)\nWrap a batch of raw msgpack samples into a typed SampleBatch.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbatch\nWDSRawBatch\nA dictionary containing a 'msgpack' key with a list of serialized sample bytes.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSampleBatch[ST]\nA SampleBatch[ST] containing deserialized samples, optionally\n\n\n\nSampleBatch[ST]\ntransformed through a lens if as_type() was called.\n\n\n\n\n\n\nThis implementation deserializes samples one at a time, then aggregates them into a batch." 1802 1802 }, 1803 1803 { 1804 1804 "objectID": "api/local.Index.html", 1805 1805 "href": "api/local.Index.html", 1806 1806 "title": "local.Index", 1807 1807 "section": "", 1808 - "text": "local.Index(\n redis=None,\n data_store=None,\n auto_stubs=False,\n stub_dir=None,\n **kwargs,\n)\nRedis-backed index for tracking datasets in a repository.\nImplements the AbstractIndex protocol. Maintains a registry of LocalDatasetEntry objects in Redis, allowing enumeration and lookup of stored datasets.\nWhen initialized with a data_store, insert_dataset() will write dataset shards to storage before indexing. Without a data_store, insert_dataset() only indexes existing URLs.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n::\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n::\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n::\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1808 + "text": "local.Index(\n redis=None,\n data_store=None,\n auto_stubs=False,\n stub_dir=None,\n **kwargs,\n)\nRedis-backed index for tracking datasets in a repository.\nImplements the AbstractIndex protocol. Maintains a registry of LocalDatasetEntry objects in Redis, allowing enumeration and lookup of stored datasets.\nWhen initialized with a data_store, insert_dataset() will write dataset shards to storage before indexing. Without a data_store, insert_dataset() only indexes existing URLs.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_redis\n\nRedis connection for index storage.\n\n\n_data_store\n\nOptional AbstractDataStore for writing dataset shards.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1809 1809 }, 1810 1810 { 1811 1811 "objectID": "api/local.Index.html#attributes", ··· 1819 1819 "href": "api/local.Index.html#methods", 1820 1820 "title": "local.Index", 1821 1821 "section": "", 1822 - "text": "Name\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n::\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n::\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n::\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1822 + "text": "Name\nDescription\n\n\n\n\nadd_entry\nAdd a dataset to the index.\n\n\nclear_stubs\nRemove all auto-generated stub files.\n\n\ndecode_schema\nReconstruct a Python PackableSample type from a stored schema.\n\n\ndecode_schema_as\nDecode a schema with explicit type hint for IDE support.\n\n\nget_dataset\nGet a dataset entry by name (AbstractIndex protocol).\n\n\nget_entry\nGet an entry by its CID.\n\n\nget_entry_by_name\nGet an entry by its human-readable name.\n\n\nget_import_path\nGet the import path for a schema’s generated module.\n\n\nget_schema\nGet a schema record by reference (AbstractIndex protocol).\n\n\nget_schema_record\nGet a schema record as LocalSchemaRecord object.\n\n\ninsert_dataset\nInsert a dataset into the index (AbstractIndex protocol).\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_entries\nGet all index entries as a materialized list.\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\nload_schema\nLoad a schema and make it available in the types namespace.\n\n\npublish_schema\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nlocal.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)\nAdd a dataset to the index.\nCreates a LocalDatasetEntry for the dataset and persists it to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe dataset to add to the index.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference. If None, generates from sample type.\nNone\n\n\nmetadata\ndict | None\nOptional metadata dictionary. If None, uses ds._metadata if available.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nThe created LocalDatasetEntry object.\n\n\n\n\n\n\n\nlocal.Index.clear_stubs()\nRemove all auto-generated stub files.\nOnly works if auto_stubs was enabled when creating the Index.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nint\nNumber of stub files removed, or 0 if auto_stubs is disabled.\n\n\n\n\n\n\n\nlocal.Index.decode_schema(ref)\nReconstruct a Python PackableSample type from a stored schema.\nThis method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.\nIf auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nA PackableSample subclass - either imported from a generated module\n\n\n\nType[Packable]\n(if auto_stubs is enabled) or dynamically created.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\nlocal.Index.decode_schema_as(ref, type_hint)\nDecode a schema with explicit type hint for IDE support.\nThis is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\ntype_hint\ntype[T]\nThe stub type to use for type hints. Import this from the generated stub file.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ntype[T]\nThe decoded type, cast to match the type_hint for IDE support.\n\n\n\n\n\n\n>>> # After enabling auto_stubs and configuring IDE extraPaths:\n>>> from local.MySample_1_0_0 import MySample\n>>>\n>>> # This gives full IDE autocomplete:\n>>> DecodedType = index.decode_schema_as(ref, MySample)\n>>> sample = DecodedType(text=\"hello\", value=42) # IDE knows signature!\n\n\n\nThe type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.\n\n\n\n\nlocal.Index.get_dataset(ref)\nGet a dataset entry by name (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nDataset name.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf dataset not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry(cid)\nGet an entry by its CID.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ncid\nstr\nContent identifier of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry for the given CID.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf entry not found.\n\n\n\n\n\n\n\nlocal.Index.get_entry_by_name(name)\nGet an entry by its human-readable name.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nname\nstr\nHuman-readable name of the entry.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nLocalDatasetEntry with the given name.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf no entry with that name exists.\n\n\n\n\n\n\n\nlocal.Index.get_import_path(ref)\nGet the import path for a schema’s generated module.\nWhen auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr | None\nImport path like “local.MySample_1_0_0”, or None if auto_stubs\n\n\n\nstr | None\nis disabled.\n\n\n\n\n\n\n>>> index = LocalIndex(auto_stubs=True)\n>>> ref = index.publish_schema(MySample, version=\"1.0.0\")\n>>> index.load_schema(ref)\n>>> print(index.get_import_path(ref))\nlocal.MySample_1_0_0\n>>> # Then in your code:\n>>> # from local.MySample_1_0_0 import MySample\n\n\n\n\nlocal.Index.get_schema(ref)\nGet a schema record by reference (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string. Supports both new format (atdata://local/sampleSchema/{name}@version) and legacy format (local://schemas/{module.Class}@version).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record as a dictionary with keys ‘name’, ‘version’,\n\n\n\ndict\n‘fields’, ‘$ref’, etc.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.get_schema_record(ref)\nGet a schema record as LocalSchemaRecord object.\nUse this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalSchemaRecord\nLocalSchemaRecord with schema details.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf reference format is invalid.\n\n\n\n\n\n\n\nlocal.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)\nInsert a dataset into the index (AbstractIndex protocol).\nIf a data_store was provided at initialization, writes dataset shards to storage first, then indexes the new URLs. Otherwise, indexes the dataset’s existing URL.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to register.\nrequired\n\n\nname\nstr\nHuman-readable name for the dataset.\nrequired\n\n\nschema_ref\nstr | None\nOptional schema reference.\nNone\n\n\n**kwargs\n\nAdditional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLocalDatasetEntry\nIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\nlocal.Index.list_datasets()\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of IndexEntry for each dataset.\n\n\n\n\n\n\n\nlocal.Index.list_entries()\nGet all index entries as a materialized list.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[LocalDatasetEntry]\nList of all LocalDatasetEntry objects in the index.\n\n\n\n\n\n\n\nlocal.Index.list_schemas()\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\nlocal.Index.load_schema(ref)\nLoad a schema and make it available in the types namespace.\nThis method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nSchema reference string (atdata://local/sampleSchema/… or legacy local://schemas/…).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nThe decoded PackableSample subclass. Also available via\n\n\n\nType[Packable]\nindex.types.<ClassName> after this call.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf schema not found.\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n>>> # Load and use immediately\n>>> MyType = index.load_schema(\"atdata://local/sampleSchema/MySample@1.0.0\")\n>>> sample = MyType(name=\"hello\", value=42)\n>>>\n>>> # Or access later via namespace\n>>> index.load_schema(\"atdata://local/sampleSchema/OtherType@1.0.0\")\n>>> other = index.types.OtherType(data=\"test\")\n\n\n\n\nlocal.Index.publish_schema(sample_type, *, version=None, description=None)\nPublish a schema for a sample type to Redis.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\ntype\nA Packable type (@packable-decorated or PackableSample subclass).\nrequired\n\n\nversion\nstr | None\nSemantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists.\nNone\n\n\ndescription\nstr | None\nOptional human-readable description. If None, uses the class docstring.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nSchema reference string: ‘atdata://local/sampleSchema/{name}@version’.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf sample_type is not a dataclass.\n\n\n\nTypeError\nIf sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported." 1823 1823 }, 1824 1824 { 1825 1825 "objectID": "api/Lens.html", 1826 1826 "href": "api/Lens.html", 1827 1827 "title": "lens", 1828 1828 "section": "", 1829 - "text": "lens\nLens-based type transformations for datasets.\nThis module implements a lens system for bidirectional transformations between different sample types. Lenses enable viewing a dataset through different type schemas without duplicating the underlying data.\nKey components:\n\nLens: Bidirectional transformation with getter (S -> V) and optional putter (V, S -> S)\nLensNetwork: Global singleton registry for lens transformations\n@lens: Decorator to create and register lens transformations\n\nLenses support the functional programming concept of composable, well-behaved transformations that satisfy lens laws (GetPut and PutGet).\n\n\n::\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n::\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n::\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(...)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented.\n\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n::\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1829 + "text": "lens\nLens-based type transformations for datasets.\nThis module implements a lens system for bidirectional transformations between different sample types. Lenses enable viewing a dataset through different type schemas without duplicating the underlying data.\nKey components:\n\nLens: Bidirectional transformation with getter (S -> V) and optional putter (V, S -> S)\nLensNetwork: Global singleton registry for lens transformations\n@lens: Decorator to create and register lens transformations\n\nLenses support the functional programming concept of composable, well-behaved transformations that satisfy lens laws (GetPut and PutGet).\n\n\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(field=view.field, other=source.other)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented.\n\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1830 1830 }, 1831 1831 { 1832 - "objectID": "api/Lens.html#example", 1833 - "href": "api/Lens.html#example", 1832 + "objectID": "api/Lens.html#examples", 1833 + "href": "api/Lens.html#examples", 1834 1834 "title": "lens", 1835 1835 "section": "", 1836 - "text": "::\n>>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens" 1836 + "text": ">>> @packable\n... class FullData:\n... name: str\n... age: int\n... embedding: NDArray\n...\n>>> @packable\n... class NameOnly:\n... name: str\n...\n>>> @lens\n... def name_view(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_view.putter\n... def name_view_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age,\n... embedding=source.embedding)\n...\n>>> ds = Dataset[FullData](\"data.tar\")\n>>> ds_names = ds.as_type(NameOnly) # Uses registered lens" 1837 1837 }, 1838 1838 { 1839 1839 "objectID": "api/Lens.html#classes", 1840 1840 "href": "api/Lens.html#classes", 1841 1841 "title": "lens", 1842 1842 "section": "", 1843 - "text": "Name\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n::\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n::\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(...)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented." 1843 + "text": "Name\nDescription\n\n\n\n\nLens\nA bidirectional transformation between two sample types.\n\n\nLensNetwork\nGlobal registry for lens transformations between sample types.\n\n\n\n\n\nlens.Lens(get, put=None)\nA bidirectional transformation between two sample types.\nA lens provides a way to view and update data of type S (source) as if it were type V (view). It consists of a getter that transforms S -> V and an optional putter that transforms (V, S) -> S, enabling updates to the view to be reflected back in the source.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nS\n\nThe source type, must derive from PackableSample.\nrequired\n\n\nV\n\nThe view type, must derive from PackableSample.\nrequired\n\n\n\n\n\n\n>>> @lens\n... def name_lens(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @name_lens.putter\n... def name_lens_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nTransform the source into the view type.\n\n\nput\nUpdate the source based on a modified view.\n\n\nputter\nDecorator to register a putter function for this lens.\n\n\n\n\n\nlens.Lens.get(s)\nTransform the source into the view type.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ns\nS\nThe source sample of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nV\nA view of the source as type V.\n\n\n\n\n\n\n\nlens.Lens.put(v, s)\nUpdate the source based on a modified view.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nv\nV\nThe modified view of type V.\nrequired\n\n\ns\nS\nThe original source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nS\nAn updated source of type S that reflects changes from the view.\n\n\n\n\n\n\n\nlens.Lens.putter(put)\nDecorator to register a putter function for this lens.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nput\nLensPutter[S, V]\nA function that takes a view of type V and source of type S, and returns an updated source of type S.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLensPutter[S, V]\nThe putter function, allowing this to be used as a decorator.\n\n\n\n\n\n\n>>> @my_lens.putter\n... def my_lens_put(view: ViewType, source: SourceType) -> SourceType:\n... return SourceType(field=view.field, other=source.other)\n\n\n\n\n\n\nlens.LensNetwork()\nGlobal registry for lens transformations between sample types.\nThis class implements a singleton pattern to maintain a global registry of all lenses decorated with @lens. It enables looking up transformations between different PackableSample types.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n_instance\n\nThe singleton instance of this class.\n\n\n_registry\nDict[LensSignature, Lens]\nDictionary mapping (source_type, view_type) tuples to their corresponding Lens objects.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nregister\nRegister a lens as the canonical transformation between two types.\n\n\ntransform\nLook up the lens transformation between two sample types.\n\n\n\n\n\nlens.LensNetwork.register(_lens)\nRegister a lens as the canonical transformation between two types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\n_lens\nLens\nThe lens to register. Will be stored in the registry under the key (_lens.source_type, _lens.view_type).\nrequired\n\n\n\n\n\n\nIf a lens already exists for the same type pair, it will be overwritten.\n\n\n\n\nlens.LensNetwork.transform(source, view)\nLook up the lens transformation between two sample types.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource\nDatasetType\nThe source sample type (must derive from PackableSample).\nrequired\n\n\nview\nDatasetType\nThe target view type (must derive from PackableSample).\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens\nThe registered Lens that transforms from source to view.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no lens has been registered for the given type pair.\n\n\n\n\n\n\nCurrently only supports direct transformations. Compositional transformations (chaining multiple lenses) are not yet implemented." 1844 1844 }, 1845 1845 { 1846 1846 "objectID": "api/Lens.html#functions", 1847 1847 "href": "api/Lens.html#functions", 1848 1848 "title": "lens", 1849 1849 "section": "", 1850 - "text": "Name\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n::\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1850 + "text": "Name\nDescription\n\n\n\n\nlens\nDecorator to create and register a lens transformation.\n\n\n\n\n\nlens.lens(f)\nDecorator to create and register a lens transformation.\nThis decorator converts a getter function into a Lens object and automatically registers it in the global LensNetwork registry.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nf\nLensGetter[S, V]\nA getter function that transforms from source type S to view type V. Must have exactly one parameter with a type annotation.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nLens[S, V]\nA Lens[S, V] object that can be called to apply the transformation\n\n\n\nLens[S, V]\nor decorated with @lens_name.putter to add a putter function.\n\n\n\n\n\n\n>>> @lens\n... def extract_name(full: FullData) -> NameOnly:\n... return NameOnly(name=full.name)\n...\n>>> @extract_name.putter\n... def extract_name_put(view: NameOnly, source: FullData) -> FullData:\n... return FullData(name=view.name, age=source.age)" 1851 1851 }, 1852 1852 { 1853 1853 "objectID": "api/DatasetLoader.html", 1854 1854 "href": "api/DatasetLoader.html", 1855 1855 "title": "DatasetLoader", 1856 1856 "section": "", 1857 - "text": "atmosphere.DatasetLoader(client)\nLoads dataset records from ATProto.\nThis class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.\n\n\n::\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n::\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1857 + "text": "atmosphere.DatasetLoader(client)\nLoads dataset records from ATProto.\nThis class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.\n\n\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1858 1858 }, 1859 1859 { 1860 - "objectID": "api/DatasetLoader.html#example", 1861 - "href": "api/DatasetLoader.html#example", 1860 + "objectID": "api/DatasetLoader.html#examples", 1861 + "href": "api/DatasetLoader.html#examples", 1862 1862 "title": "DatasetLoader", 1863 1863 "section": "", 1864 - "text": "::\n>>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")" 1864 + "text": ">>> client = AtmosphereClient()\n>>> loader = DatasetLoader(client)\n>>>\n>>> # List available datasets\n>>> datasets = loader.list()\n>>> for ds in datasets:\n... print(ds[\"name\"], ds[\"schemaRef\"])\n>>>\n>>> # Get a specific dataset record\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.record/xyz\")" 1865 1865 }, 1866 1866 { 1867 1867 "objectID": "api/DatasetLoader.html#methods", 1868 1868 "href": "api/DatasetLoader.html#methods", 1869 1869 "title": "DatasetLoader", 1870 1870 "section": "", 1871 - "text": "Name\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n::\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1871 + "text": "Name\nDescription\n\n\n\n\nget\nFetch a dataset record by AT URI.\n\n\nget_blob_urls\nGet fetchable URLs for blob-stored dataset shards.\n\n\nget_blobs\nGet the blob references from a dataset record.\n\n\nget_metadata\nGet the metadata from a dataset record.\n\n\nget_storage_type\nGet the storage type of a dataset record.\n\n\nget_urls\nGet the WebDataset URLs from a dataset record.\n\n\nlist_all\nList dataset records from a repository.\n\n\nto_dataset\nCreate a Dataset object from an ATProto record.\n\n\n\n\n\natmosphere.DatasetLoader.get(uri)\nFetch a dataset record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe dataset record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a dataset record.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blob_urls(uri)\nGet fetchable URLs for blob-stored dataset shards.\nThis resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of URLs for fetching the blob data.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is not blobs or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_blobs(uri)\nGet the blob references from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of blob reference dicts with keys: $type, ref, mimeType, size.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not blobs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_metadata(uri)\nGet the metadata from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nOptional[dict]\nThe metadata dictionary, or None if no metadata.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_storage_type(uri)\nGet the storage type of a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nEither “external” or “blobs”.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf storage type is unknown.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.get_urls(uri)\nGet the WebDataset URLs from a dataset record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of WebDataset URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the storage type is not external URLs.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.list_all(repo=None, limit=100)\nList dataset records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of dataset records.\n\n\n\n\n\n\n\natmosphere.DatasetLoader.to_dataset(uri, sample_type)\nCreate a Dataset object from an ATProto record.\nThis method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.\nSupports both external URL storage and ATProto blob storage.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the dataset record.\nrequired\n\n\nsample_type\nType[ST]\nThe Python class for the sample type.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDataset[ST]\nA Dataset instance configured from the record.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf no storage URLs can be resolved.\n\n\n\n\n\n\n>>> loader = DatasetLoader(client)\n>>> dataset = loader.to_dataset(uri, MySampleType)\n>>> for batch in dataset.shuffled(batch_size=32):\n... process(batch)" 1872 1872 }, 1873 1873 { 1874 1874 "objectID": "api/DataSource.html", 1875 1875 "href": "api/DataSource.html", 1876 1876 "title": "DataSource", 1877 1877 "section": "", 1878 - "text": "DataSource()\nProtocol for data sources that provide streams to Dataset.\nA DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)\nThe key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects\n\n\n::\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1878 + "text": "DataSource()\nProtocol for data sources that provide streams to Dataset.\nA DataSource abstracts over different ways of accessing dataset shards: - URLSource: Standard WebDataset-compatible URLs (http, https, pipe, gs, etc.) - S3Source: S3-compatible storage with explicit credentials - BlobSource: ATProto blob references (future)\nThe key method is shards(), which yields (identifier, stream) pairs. These are fed directly to WebDataset’s tar_file_expander, bypassing URL resolution entirely. This enables: - Private S3 repos with credentials - Custom endpoints (Cloudflare R2, MinIO) - ATProto blob streaming - Any other source that can provide file-like objects\n\n\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nshards\nLazily yield (identifier, stream) pairs for each shard.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nlist_shards\nGet list of shard identifiers without opening streams.\n\n\nopen_shard\nOpen a single shard by its identifier.\n\n\n\n\n\nDataSource.list_shards()\nGet list of shard identifiers without opening streams.\nUsed for metadata queries like counting shards without actually streaming data. Implementations should return identifiers that match what shards would yield.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of shard identifier strings.\n\n\n\n\n\n\n\nDataSource.open_shard(shard_id)\nOpen a single shard by its identifier.\nThis method enables random access to individual shards, which is required for PyTorch DataLoader worker splitting. Each worker opens only its assigned shards rather than iterating all shards.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nshard_id\nstr\nShard identifier from shard_list.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nIO[bytes]\nFile-like stream for reading the shard.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nKeyError\nIf shard_id is not in shard_list." 1879 1879 }, 1880 1880 { 1881 - "objectID": "api/DataSource.html#example", 1882 - "href": "api/DataSource.html#example", 1881 + "objectID": "api/DataSource.html#examples", 1882 + "href": "api/DataSource.html#examples", 1883 1883 "title": "DataSource", 1884 1884 "section": "", 1885 - "text": "::\n>>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)" 1885 + "text": ">>> source = S3Source(\n... bucket=\"my-bucket\",\n... keys=[\"data-000.tar\", \"data-001.tar\"],\n... endpoint=\"https://r2.example.com\",\n... credentials=creds,\n... )\n>>> ds = Dataset[MySample](source)\n>>> for sample in ds.ordered():\n... print(sample)" 1886 1886 }, 1887 1887 { 1888 1888 "objectID": "api/DataSource.html#attributes", ··· 1903 1903 "href": "api/AtmosphereIndex.html", 1904 1904 "title": "AtmosphereIndex", 1905 1905 "section": "", 1906 - "text": "atmosphere.AtmosphereIndex(client, *, data_store=None)\nATProto index implementing AbstractIndex protocol.\nWraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.\nOptionally accepts a PDSBlobStore for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.\n\n\n::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol).\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1906 + "text": "atmosphere.AtmosphereIndex(client, *, data_store=None)\nATProto index implementing AbstractIndex protocol.\nWraps SchemaPublisher/Loader and DatasetPublisher/Loader to provide a unified interface compatible with LocalIndex.\nOptionally accepts a PDSBlobStore for writing dataset shards as ATProto blobs, enabling fully decentralized dataset storage.\n\n\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndata_store\nThe PDS blob store for writing shards, or None if not configured.\n\n\ndatasets\nLazily iterate over all dataset entries (AbstractIndex protocol).\n\n\nschemas\nLazily iterate over all schema records (AbstractIndex protocol).\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ndecode_schema\nReconstruct a Python type from a schema record.\n\n\nget_dataset\nGet a dataset by AT URI.\n\n\nget_schema\nGet a schema record by AT URI.\n\n\ninsert_dataset\nInsert a dataset into ATProto.\n\n\nlist_datasets\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\nlist_schemas\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\npublish_schema\nPublish a schema to ATProto.\n\n\n\n\n\natmosphere.AtmosphereIndex.decode_schema(ref)\nReconstruct a Python type from a schema record.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nType[Packable]\nDynamically generated Packable type.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf schema cannot be decoded.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_dataset(ref)\nGet a dataset by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the dataset record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the dataset.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.get_schema(ref)\nGet a schema record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nref\nstr\nAT URI of the schema record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nSchema record dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf record is not a schema.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.insert_dataset(\n ds,\n *,\n name,\n schema_ref=None,\n **kwargs,\n)\nInsert a dataset into ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\nDataset\nThe Dataset to publish.\nrequired\n\n\nname\nstr\nHuman-readable name.\nrequired\n\n\nschema_ref\nOptional[str]\nOptional schema AT URI. If None, auto-publishes schema.\nNone\n\n\n**kwargs\n\nAdditional options (description, tags, license).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAtmosphereIndexEntry\nAtmosphereIndexEntry for the inserted dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_datasets(repo=None)\nGet all dataset entries as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[AtmosphereIndexEntry]\nList of AtmosphereIndexEntry for each dataset.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.list_schemas(repo=None)\nGet all schema records as a materialized list (AbstractIndex protocol).\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nDID of repository. Defaults to authenticated user.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of schema records as dictionaries.\n\n\n\n\n\n\n\natmosphere.AtmosphereIndex.publish_schema(\n sample_type,\n *,\n version='1.0.0',\n **kwargs,\n)\nPublish a schema to ATProto.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsample_type\nType[Packable]\nA Packable type (PackableSample subclass or @packable-decorated).\nrequired\n\n\nversion\nstr\nSemantic version string.\n'1.0.0'\n\n\n**kwargs\n\nAdditional options (description, metadata).\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nAT URI of the schema record." 1907 1907 }, 1908 1908 { 1909 - "objectID": "api/AtmosphereIndex.html#example", 1910 - "href": "api/AtmosphereIndex.html#example", 1909 + "objectID": "api/AtmosphereIndex.html#examples", 1910 + "href": "api/AtmosphereIndex.html#examples", 1911 1911 "title": "AtmosphereIndex", 1912 1912 "section": "", 1913 - "text": "::\n>>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")" 1913 + "text": ">>> client = AtmosphereClient()\n>>> client.login(\"handle.bsky.social\", \"app-password\")\n>>>\n>>> # Without blob storage (external URLs only)\n>>> index = AtmosphereIndex(client)\n>>>\n>>> # With PDS blob storage\n>>> store = PDSBlobStore(client)\n>>> index = AtmosphereIndex(client, data_store=store)\n>>> entry = index.insert_dataset(dataset, name=\"my-data\")" 1914 1914 }, 1915 1915 { 1916 1916 "objectID": "api/AtmosphereIndex.html#attributes", ··· 1931 1931 "href": "api/LensLoader.html", 1932 1932 "title": "LensLoader", 1933 1933 "section": "", 1934 - "text": "atmosphere.LensLoader(client)\nLoads lens records from ATProto.\nThis class fetches lens transformation records. Note that actually using a lens requires installing the referenced code and importing it manually.\n\n\n::\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1934 + "text": "atmosphere.LensLoader(client)\nLoads lens records from ATProto.\nThis class fetches lens transformation records. Note that actually using a lens requires installing the referenced code and importing it manually.\n\n\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfind_by_schemas\nFind lenses that transform between specific schemas.\n\n\nget\nFetch a lens record by AT URI.\n\n\nlist_all\nList lens records from a repository.\n\n\n\n\n\natmosphere.LensLoader.find_by_schemas(\n source_schema_uri,\n target_schema_uri=None,\n repo=None,\n)\nFind lenses that transform between specific schemas.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsource_schema_uri\nstr\nAT URI of the source schema.\nrequired\n\n\ntarget_schema_uri\nOptional[str]\nOptional AT URI of the target schema. If not provided, returns all lenses from the source.\nNone\n\n\nrepo\nOptional[str]\nThe DID of the repository to search.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of matching lens records.\n\n\n\n\n\n\n\natmosphere.LensLoader.get(uri)\nFetch a lens record by AT URI.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nuri\nstr | AtUri\nThe AT URI of the lens record.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\ndict\nThe lens record as a dictionary.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf the record is not a lens record.\n\n\n\n\n\n\n\natmosphere.LensLoader.list_all(repo=None, limit=100)\nList lens records from a repository.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nrepo\nOptional[str]\nThe DID of the repository. Defaults to authenticated user.\nNone\n\n\nlimit\nint\nMaximum number of records to return.\n100\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[dict]\nList of lens records." 1935 1935 }, 1936 1936 { 1937 - "objectID": "api/LensLoader.html#example", 1938 - "href": "api/LensLoader.html#example", 1937 + "objectID": "api/LensLoader.html#examples", 1938 + "href": "api/LensLoader.html#examples", 1939 1939 "title": "LensLoader", 1940 1940 "section": "", 1941 - "text": "::\n>>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))" 1941 + "text": ">>> client = AtmosphereClient()\n>>> loader = LensLoader(client)\n>>>\n>>> record = loader.get(\"at://did:plc:abc/ac.foundation.dataset.lens/xyz\")\n>>> print(record[\"name\"])\n>>> print(record[\"sourceSchema\"])\n>>> print(record.get(\"getterCode\", {}).get(\"repository\"))" 1942 1942 }, 1943 1943 { 1944 1944 "objectID": "api/LensLoader.html#methods", ··· 1952 1952 "href": "api/DictSample.html", 1953 1953 "title": "DictSample", 1954 1954 "section": "", 1955 - "text": "DictSample(_data=None, **kwargs)\nDynamic sample type providing dict-like access to raw msgpack data.\nThis class is the default sample type for datasets when no explicit type is specified. It stores the raw unpacked msgpack data and provides both attribute-style (sample.field) and dict-style (sample[\"field\"]) access to fields.\nDictSample is useful for: - Exploring datasets without defining a schema first - Working with datasets that have variable schemas - Prototyping before committing to a typed schema\nTo convert to a typed schema, use Dataset.as_type() with a @packable-decorated class. Every @packable class automatically registers a lens from DictSample, making this conversion seamless.\n\n\n::\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)\n\n\n\nNDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1955 + "text": "DictSample(_data=None, **kwargs)\nDynamic sample type providing dict-like access to raw msgpack data.\nThis class is the default sample type for datasets when no explicit type is specified. It stores the raw unpacked msgpack data and provides both attribute-style (sample.field) and dict-style (sample[\"field\"]) access to fields.\nDictSample is useful for: - Exploring datasets without defining a schema first - Working with datasets that have variable schemas - Prototyping before committing to a typed schema\nTo convert to a typed schema, use Dataset.as_type() with a @packable-decorated class. Every @packable class automatically registers a lens from DictSample, making this conversion seamless.\n\n\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)\n\n\n\nNDArray fields are stored as raw bytes in DictSample. They are only converted to numpy arrays when accessed through a typed sample class.\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a DictSample from raw msgpack bytes.\n\n\nfrom_data\nCreate a DictSample from unpacked msgpack data.\n\n\nget\nGet a field value with optional default.\n\n\nitems\nReturn list of (field_name, value) tuples.\n\n\nkeys\nReturn list of field names.\n\n\nto_dict\nReturn a copy of the underlying data dictionary.\n\n\nvalues\nReturn list of field values.\n\n\n\n\n\nDictSample.from_bytes(bs)\nCreate a DictSample from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance with the unpacked data.\n\n\n\n\n\n\n\nDictSample.from_data(data)\nCreate a DictSample from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\ndict[str, Any]\nDictionary with field names as keys.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nDictSample\nNew DictSample instance wrapping the data.\n\n\n\n\n\n\n\nDictSample.get(key, default=None)\nGet a field value with optional default.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nkey\nstr\nField name to access.\nrequired\n\n\ndefault\nAny\nValue to return if field doesn’t exist.\nNone\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nAny\nThe field value or default.\n\n\n\n\n\n\n\nDictSample.items()\nReturn list of (field_name, value) tuples.\n\n\n\nDictSample.keys()\nReturn list of field names.\n\n\n\nDictSample.to_dict()\nReturn a copy of the underlying data dictionary.\n\n\n\nDictSample.values()\nReturn list of field values." 1956 1956 }, 1957 1957 { 1958 - "objectID": "api/DictSample.html#example", 1959 - "href": "api/DictSample.html#example", 1958 + "objectID": "api/DictSample.html#examples", 1959 + "href": "api/DictSample.html#examples", 1960 1960 "title": "DictSample", 1961 1961 "section": "", 1962 - "text": "::\n>>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)" 1962 + "text": ">>> ds = load_dataset(\"path/to/data.tar\") # Returns Dataset[DictSample]\n>>> for sample in ds.ordered():\n... print(sample.some_field) # Attribute access\n... print(sample[\"other_field\"]) # Dict access\n... print(sample.keys()) # Inspect available fields\n...\n>>> # Convert to typed schema\n>>> typed_ds = ds.as_type(MyTypedSample)" 1963 1963 }, 1964 1964 { 1965 1965 "objectID": "api/DictSample.html#note", ··· 1987 1987 "href": "api/PDSBlobStore.html", 1988 1988 "title": "PDSBlobStore", 1989 1989 "section": "", 1990 - "text": "atmosphere.PDSBlobStore(client)\nPDS blob store implementing AbstractDataStore protocol.\nStores dataset shards as ATProto blobs, enabling decentralized dataset storage on the AT Protocol network.\nEach shard is written to a temporary tar file, then uploaded as a blob to the user’s PDS. The returned URLs are AT URIs that can be resolved to HTTP URLs for streaming.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance.\n\n\n\n\n\n\n::\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1990 + "text": "atmosphere.PDSBlobStore(client)\nPDS blob store implementing AbstractDataStore protocol.\nStores dataset shards as ATProto blobs, enabling decentralized dataset storage on the AT Protocol network.\nEach shard is written to a temporary tar file, then uploaded as a blob to the user’s PDS. The returned URLs are AT URIs that can be resolved to HTTP URLs for streaming.\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance.\n\n\n\n\n\n\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]\n\n\n\n\n\n\nName\nDescription\n\n\n\n\ncreate_source\nCreate a BlobSource for reading these AT URIs.\n\n\nread_url\nResolve an AT URI blob reference to an HTTP URL.\n\n\nsupports_streaming\nPDS blobs support streaming via HTTP.\n\n\nwrite_shards\nWrite dataset shards as PDS blobs.\n\n\n\n\n\natmosphere.PDSBlobStore.create_source(urls)\nCreate a BlobSource for reading these AT URIs.\nThis is a convenience method for creating a DataSource that can stream the blobs written by this store.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurls\nlist[str]\nList of AT URIs from write_shards().\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\n'BlobSource'\nBlobSource configured for the given URLs.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URLs are not valid AT URIs.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.read_url(url)\nResolve an AT URI blob reference to an HTTP URL.\nTransforms at://did/blob/cid URIs to HTTP URLs that can be streamed by WebDataset.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nurl\nstr\nAT URI in format at://{did}/blob/{cid}.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nstr\nHTTP URL for fetching the blob via PDS API.\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf URL format is invalid or PDS cannot be resolved.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.supports_streaming()\nPDS blobs support streaming via HTTP.\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nbool\nTrue.\n\n\n\n\n\n\n\natmosphere.PDSBlobStore.write_shards(\n ds,\n *,\n prefix,\n maxcount=10000,\n maxsize=3000000000.0,\n **kwargs,\n)\nWrite dataset shards as PDS blobs.\nCreates tar archives from the dataset and uploads each as a blob to the authenticated user’s PDS.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nds\n'Dataset'\nThe Dataset to write.\nrequired\n\n\nprefix\nstr\nLogical path prefix for naming (used in shard names only).\nrequired\n\n\nmaxcount\nint\nMaximum samples per shard (default: 10000).\n10000\n\n\nmaxsize\nfloat\nMaximum shard size in bytes (default: 3GB, PDS limit).\n3000000000.0\n\n\n**kwargs\nAny\nAdditional args passed to wds.ShardWriter.\n{}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nlist[str]\nList of AT URIs for the written blobs, in format:\n\n\n\nlist[str]\nat://{did}/blob/{cid}\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nValueError\nIf not authenticated.\n\n\n\nRuntimeError\nIf no shards were written.\n\n\n\n\n\n\nPDS blobs have size limits (typically 50MB-5GB depending on PDS). Adjust maxcount/maxsize to stay within limits." 1991 1991 }, 1992 1992 { 1993 1993 "objectID": "api/PDSBlobStore.html#attributes", ··· 1997 1997 "text": "Name\nType\nDescription\n\n\n\n\nclient\n'AtmosphereClient'\nAuthenticated AtmosphereClient instance." 1998 1998 }, 1999 1999 { 2000 - "objectID": "api/PDSBlobStore.html#example", 2001 - "href": "api/PDSBlobStore.html#example", 2000 + "objectID": "api/PDSBlobStore.html#examples", 2001 + "href": "api/PDSBlobStore.html#examples", 2002 2002 "title": "PDSBlobStore", 2003 2003 "section": "", 2004 - "text": "::\n>>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]" 2004 + "text": ">>> store = PDSBlobStore(client)\n>>> urls = store.write_shards(dataset, prefix=\"training/v1\")\n>>> # Returns AT URIs like:\n>>> # ['at://did:plc:abc/blob/bafyrei...', ...]" 2005 2005 }, 2006 2006 { 2007 2007 "objectID": "api/PDSBlobStore.html#methods", ··· 2015 2015 "href": "api/PackableSample.html", 2016 2016 "title": "PackableSample", 2017 2017 "section": "", 2018 - "text": "PackableSample()\nBase class for samples that can be serialized with msgpack.\nThis abstract base class provides automatic serialization/deserialization for dataclass-based samples. Fields annotated as NDArray or NDArray | None are automatically converted between numpy arrays and bytes during packing/unpacking.\nSubclasses should be defined either by: 1. Direct inheritance with the @dataclass decorator 2. Using the @packable decorator (recommended)\n\n\n::\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 2018 + "text": "PackableSample()\nBase class for samples that can be serialized with msgpack.\nThis abstract base class provides automatic serialization/deserialization for dataclass-based samples. Fields annotated as NDArray or NDArray | None are automatically converted between numpy arrays and bytes during packing/unpacking.\nSubclasses should be defined either by: 1. Direct inheritance with the @dataclass decorator 2. Using the @packable decorator (recommended)\n\n\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nas_wds\nPack this sample’s data for writing to WebDataset.\n\n\npacked\nPack this sample’s data into msgpack bytes.\n\n\n\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nfrom_bytes\nCreate a sample instance from raw msgpack bytes.\n\n\nfrom_data\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nPackableSample.from_bytes(bs)\nCreate a sample instance from raw msgpack bytes.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nbs\nbytes\nRaw bytes from a msgpack-serialized sample.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nA new instance of this sample class deserialized from the bytes.\n\n\n\n\n\n\n\nPackableSample.from_data(data)\nCreate a sample instance from unpacked msgpack data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndata\nWDSRawSample\nDictionary with keys matching the sample’s field names.\nrequired\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\n\n\n\n\n\nSelf\nNew instance with NDArray fields auto-converted from bytes." 2019 2019 }, 2020 2020 { 2021 - "objectID": "api/PackableSample.html#example", 2022 - "href": "api/PackableSample.html#example", 2021 + "objectID": "api/PackableSample.html#examples", 2022 + "href": "api/PackableSample.html#examples", 2023 2023 "title": "PackableSample", 2024 2024 "section": "", 2025 - "text": "::\n>>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize" 2025 + "text": ">>> @packable\n... class MyData:\n... name: str\n... embeddings: NDArray\n...\n>>> sample = MyData(name=\"test\", embeddings=np.array([1.0, 2.0]))\n>>> packed = sample.packed # Serialize to bytes\n>>> restored = MyData.from_bytes(packed) # Deserialize" 2026 2026 }, 2027 2027 { 2028 2028 "objectID": "api/PackableSample.html#attributes", ··· 2043 2043 "href": "api/DatasetDict.html", 2044 2044 "title": "DatasetDict", 2045 2045 "section": "", 2046 - "text": "DatasetDict(splits=None, sample_type=None, streaming=False)\nA dictionary of split names to Dataset instances.\nSimilar to HuggingFace’s DatasetDict, this provides a container for multiple dataset splits (train, test, validation, etc.) with convenience methods that operate across all splits.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired\n\n\n\n\n\n\n::\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 2046 + "text": "DatasetDict(splits=None, sample_type=None, streaming=False)\nA dictionary of split names to Dataset instances.\nSimilar to HuggingFace’s DatasetDict, this provides a container for multiple dataset splits (train, test, validation, etc.) with convenience methods that operate across all splits.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired\n\n\n\n\n\n\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")\n\n\n\n\n\n\nName\nDescription\n\n\n\n\nnum_shards\nNumber of shards in each split.\n\n\nsample_type\nThe sample type for datasets in this dict.\n\n\nstreaming\nWhether this DatasetDict was loaded in streaming mode." 2047 2047 }, 2048 2048 { 2049 2049 "objectID": "api/DatasetDict.html#parameters", ··· 2053 2053 "text": "Name\nType\nDescription\nDefault\n\n\n\n\nST\n\nThe sample type for all datasets in this dict.\nrequired" 2054 2054 }, 2055 2055 { 2056 - "objectID": "api/DatasetDict.html#example", 2057 - "href": "api/DatasetDict.html#example", 2056 + "objectID": "api/DatasetDict.html#examples", 2057 + "href": "api/DatasetDict.html#examples", 2058 2058 "title": "DatasetDict", 2059 2059 "section": "", 2060 - "text": "::\n>>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")" 2060 + "text": ">>> ds_dict = load_dataset(\"path/to/data\", MyData)\n>>> train = ds_dict[\"train\"]\n>>> test = ds_dict[\"test\"]\n>>>\n>>> # Iterate over all splits\n>>> for split_name, dataset in ds_dict.items():\n... print(f\"{split_name}: {len(dataset.shard_list)} shards\")" 2061 2061 }, 2062 2062 { 2063 2063 "objectID": "api/DatasetDict.html#attributes",

+47 -47

docs/sitemap.xml

··· 2 2 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 3 3 <url> 4 4 <loc>https://github.com/your-org/atdata/reference/protocols.html</loc> 5 - <lastmod>2026-01-22T19:31:03.723Z</lastmod> 5 + <lastmod>2026-01-28T18:46:20.894Z</lastmod> 6 6 </url> 7 7 <url> 8 8 <loc>https://github.com/your-org/atdata/reference/datasets.html</loc> 9 - <lastmod>2026-01-22T19:31:03.722Z</lastmod> 9 + <lastmod>2026-01-28T18:46:20.893Z</lastmod> 10 10 </url> 11 11 <url> 12 12 <loc>https://github.com/your-org/atdata/reference/architecture.html</loc> 13 - <lastmod>2026-01-27T06:13:33.690Z</lastmod> 13 + <lastmod>2026-01-28T19:56:53.889Z</lastmod> 14 14 </url> 15 15 <url> 16 16 <loc>https://github.com/your-org/atdata/reference/atmosphere.html</loc> 17 - <lastmod>2026-01-27T05:32:25.227Z</lastmod> 17 + <lastmod>2026-01-28T19:56:53.889Z</lastmod> 18 18 </url> 19 19 <url> 20 20 <loc>https://github.com/your-org/atdata/reference/local-storage.html</loc> 21 - <lastmod>2026-01-22T19:31:03.723Z</lastmod> 21 + <lastmod>2026-01-28T18:46:20.894Z</lastmod> 22 22 </url> 23 23 <url> 24 24 <loc>https://github.com/your-org/atdata/reference/uri-spec.html</loc> 25 - <lastmod>2026-01-22T19:31:03.723Z</lastmod> 25 + <lastmod>2026-01-28T18:46:20.895Z</lastmod> 26 26 </url> 27 27 <url> 28 28 <loc>https://github.com/your-org/atdata/tutorials/quickstart.html</loc> 29 - <lastmod>2026-01-27T06:16:24.980Z</lastmod> 29 + <lastmod>2026-01-28T19:56:53.890Z</lastmod> 30 30 </url> 31 31 <url> 32 32 <loc>https://github.com/your-org/atdata/tutorials/atmosphere.html</loc> 33 - <lastmod>2026-01-27T06:18:15.908Z</lastmod> 33 + <lastmod>2026-01-28T19:56:53.889Z</lastmod> 34 34 </url> 35 35 <url> 36 36 <loc>https://github.com/your-org/atdata/api/SchemaLoader.html</loc> 37 - <lastmod>2026-01-23T23:20:15.746Z</lastmod> 37 + <lastmod>2026-01-28T20:31:19.270Z</lastmod> 38 38 </url> 39 39 <url> 40 40 <loc>https://github.com/your-org/atdata/api/BlobSource.html</loc> 41 - <lastmod>2026-01-27T05:36:00.209Z</lastmod> 41 + <lastmod>2026-01-28T20:31:19.167Z</lastmod> 42 42 </url> 43 43 <url> 44 44 <loc>https://github.com/your-org/atdata/api/AtmosphereClient.html</loc> 45 - <lastmod>2026-01-23T23:20:15.723Z</lastmod> 45 + <lastmod>2026-01-28T20:31:19.237Z</lastmod> 46 46 </url> 47 47 <url> 48 48 <loc>https://github.com/your-org/atdata/api/load_dataset.html</loc> 49 - <lastmod>2026-01-24T19:19:45.334Z</lastmod> 49 + <lastmod>2026-01-28T20:31:19.114Z</lastmod> 50 50 </url> 51 51 <url> 52 52 <loc>https://github.com/your-org/atdata/api/promote_to_atmosphere.html</loc> 53 - <lastmod>2026-01-24T19:19:45.514Z</lastmod> 53 + <lastmod>2026-01-28T20:31:19.321Z</lastmod> 54 54 </url> 55 55 <url> 56 56 <loc>https://github.com/your-org/atdata/api/SchemaPublisher.html</loc> 57 - <lastmod>2026-01-23T23:20:15.742Z</lastmod> 57 + <lastmod>2026-01-28T20:31:19.265Z</lastmod> 58 58 </url> 59 59 <url> 60 60 <loc>https://github.com/your-org/atdata/api/DatasetPublisher.html</loc> 61 - <lastmod>2026-01-23T23:20:15.757Z</lastmod> 61 + <lastmod>2026-01-28T20:31:19.283Z</lastmod> 62 62 </url> 63 63 <url> 64 64 <loc>https://github.com/your-org/atdata/api/URLSource.html</loc> 65 - <lastmod>2026-01-24T19:19:45.367Z</lastmod> 65 + <lastmod>2026-01-28T20:31:19.150Z</lastmod> 66 66 </url> 67 67 <url> 68 68 <loc>https://github.com/your-org/atdata/api/index.html</loc> 69 - <lastmod>2026-01-27T06:39:59.502Z</lastmod> 69 + <lastmod>2026-01-28T20:50:17.801Z</lastmod> 70 70 </url> 71 71 <url> 72 72 <loc>https://github.com/your-org/atdata/api/IndexEntry.html</loc> 73 - <lastmod>2026-01-23T23:03:53.795Z</lastmod> 73 + <lastmod>2026-01-28T19:56:53.885Z</lastmod> 74 74 </url> 75 75 <url> 76 76 <loc>https://github.com/your-org/atdata/api/S3Source.html</loc> 77 - <lastmod>2026-01-24T19:19:45.376Z</lastmod> 77 + <lastmod>2026-01-28T20:31:19.160Z</lastmod> 78 78 </url> 79 79 <url> 80 80 <loc>https://github.com/your-org/atdata/api/local.LocalDatasetEntry.html</loc> 81 - <lastmod>2026-01-23T23:03:53.862Z</lastmod> 81 + <lastmod>2026-01-28T19:56:53.887Z</lastmod> 82 82 </url> 83 83 <url> 84 84 <loc>https://github.com/your-org/atdata/api/AbstractIndex.html</loc> 85 - <lastmod>2026-01-27T05:36:00.180Z</lastmod> 85 + <lastmod>2026-01-28T20:31:19.135Z</lastmod> 86 86 </url> 87 87 <url> 88 88 <loc>https://github.com/your-org/atdata/api/AtmosphereIndexEntry.html</loc> 89 - <lastmod>2026-01-23T23:03:53.910Z</lastmod> 89 + <lastmod>2026-01-28T19:56:53.884Z</lastmod> 90 90 </url> 91 91 <url> 92 92 <loc>https://github.com/your-org/atdata/api/LensPublisher.html</loc> 93 - <lastmod>2026-01-23T23:20:15.781Z</lastmod> 93 + <lastmod>2026-01-28T20:31:19.307Z</lastmod> 94 94 </url> 95 95 <url> 96 96 <loc>https://github.com/your-org/atdata/api/SampleBatch.html</loc> 97 - <lastmod>2026-01-23T23:20:15.589Z</lastmod> 97 + <lastmod>2026-01-28T20:31:19.088Z</lastmod> 98 98 </url> 99 99 <url> 100 100 <loc>https://github.com/your-org/atdata/index.html</loc> 101 - <lastmod>2026-01-27T06:14:32.068Z</lastmod> 101 + <lastmod>2026-01-28T19:56:53.888Z</lastmod> 102 102 </url> 103 103 <url> 104 104 <loc>https://github.com/your-org/atdata/api/packable.html</loc> 105 - <lastmod>2026-01-23T23:21:24.522Z</lastmod> 105 + <lastmod>2026-01-28T20:31:19.057Z</lastmod> 106 106 </url> 107 107 <url> 108 108 <loc>https://github.com/your-org/atdata/api/Packable-protocol.html</loc> 109 - <lastmod>2026-01-23T23:20:15.617Z</lastmod> 109 + <lastmod>2026-01-28T20:31:19.119Z</lastmod> 110 110 </url> 111 111 <url> 112 112 <loc>https://github.com/your-org/atdata/api/AtUri.html</loc> 113 - <lastmod>2026-01-23T23:20:15.791Z</lastmod> 113 + <lastmod>2026-01-28T20:31:19.317Z</lastmod> 114 114 </url> 115 115 <url> 116 116 <loc>https://github.com/your-org/atdata/api/local.S3DataStore.html</loc> 117 - <lastmod>2026-01-23T23:03:53.869Z</lastmod> 117 + <lastmod>2026-01-28T19:56:53.887Z</lastmod> 118 118 </url> 119 119 <url> 120 120 <loc>https://github.com/your-org/atdata/api/AbstractDataStore.html</loc> 121 - <lastmod>2026-01-23T23:20:15.638Z</lastmod> 121 + <lastmod>2026-01-28T20:31:19.141Z</lastmod> 122 122 </url> 123 123 <url> 124 124 <loc>https://github.com/your-org/atdata/api/Dataset.html</loc> 125 - <lastmod>2026-01-23T23:20:15.588Z</lastmod> 125 + <lastmod>2026-01-28T20:31:19.086Z</lastmod> 126 126 </url> 127 127 <url> 128 128 <loc>https://github.com/your-org/atdata/api/local.Index.html</loc> 129 - <lastmod>2026-01-27T05:36:00.238Z</lastmod> 129 + <lastmod>2026-01-28T20:31:19.196Z</lastmod> 130 130 </url> 131 131 <url> 132 132 <loc>https://github.com/your-org/atdata/api/Lens.html</loc> 133 - <lastmod>2026-01-27T06:39:59.563Z</lastmod> 133 + <lastmod>2026-01-28T20:50:17.859Z</lastmod> 134 134 </url> 135 135 <url> 136 136 <loc>https://github.com/your-org/atdata/api/DatasetLoader.html</loc> 137 - <lastmod>2026-01-23T23:20:15.773Z</lastmod> 137 + <lastmod>2026-01-28T20:31:19.298Z</lastmod> 138 138 </url> 139 139 <url> 140 140 <loc>https://github.com/your-org/atdata/api/DataSource.html</loc> 141 - <lastmod>2026-01-23T23:20:15.642Z</lastmod> 141 + <lastmod>2026-01-28T20:31:19.146Z</lastmod> 142 142 </url> 143 143 <url> 144 144 <loc>https://github.com/your-org/atdata/api/AtmosphereIndex.html</loc> 145 - <lastmod>2026-01-27T05:36:00.293Z</lastmod> 145 + <lastmod>2026-01-28T20:31:19.251Z</lastmod> 146 146 </url> 147 147 <url> 148 148 <loc>https://github.com/your-org/atdata/api/LensLoader.html</loc> 149 - <lastmod>2026-01-23T23:20:15.788Z</lastmod> 149 + <lastmod>2026-01-28T20:31:19.314Z</lastmod> 150 150 </url> 151 151 <url> 152 152 <loc>https://github.com/your-org/atdata/api/DictSample.html</loc> 153 - <lastmod>2026-01-23T23:20:15.573Z</lastmod> 153 + <lastmod>2026-01-28T20:31:19.071Z</lastmod> 154 154 </url> 155 155 <url> 156 156 <loc>https://github.com/your-org/atdata/api/PDSBlobStore.html</loc> 157 - <lastmod>2026-01-27T05:36:00.303Z</lastmod> 157 + <lastmod>2026-01-28T20:31:19.261Z</lastmod> 158 158 </url> 159 159 <url> 160 160 <loc>https://github.com/your-org/atdata/api/PackableSample.html</loc> 161 - <lastmod>2026-01-23T23:20:15.564Z</lastmod> 161 + <lastmod>2026-01-28T20:31:19.062Z</lastmod> 162 162 </url> 163 163 <url> 164 164 <loc>https://github.com/your-org/atdata/api/DatasetDict.html</loc> 165 - <lastmod>2026-01-24T19:19:45.336Z</lastmod> 165 + <lastmod>2026-01-28T20:31:19.116Z</lastmod> 166 166 </url> 167 167 <url> 168 168 <loc>https://github.com/your-org/atdata/tutorials/promotion.html</loc> 169 - <lastmod>2026-01-27T06:18:38.425Z</lastmod> 169 + <lastmod>2026-01-28T19:56:53.890Z</lastmod> 170 170 </url> 171 171 <url> 172 172 <loc>https://github.com/your-org/atdata/tutorials/local-workflow.html</loc> 173 - <lastmod>2026-01-27T06:17:20.489Z</lastmod> 173 + <lastmod>2026-01-28T19:56:53.890Z</lastmod> 174 174 </url> 175 175 <url> 176 176 <loc>https://github.com/your-org/atdata/reference/promotion.html</loc> 177 - <lastmod>2026-01-22T19:31:03.723Z</lastmod> 177 + <lastmod>2026-01-28T18:46:20.894Z</lastmod> 178 178 </url> 179 179 <url> 180 180 <loc>https://github.com/your-org/atdata/reference/load-dataset.html</loc> 181 - <lastmod>2026-01-22T19:31:03.722Z</lastmod> 181 + <lastmod>2026-01-28T18:46:20.894Z</lastmod> 182 182 </url> 183 183 <url> 184 184 <loc>https://github.com/your-org/atdata/reference/lenses.html</loc> ··· 190 190 </url> 191 191 <url> 192 192 <loc>https://github.com/your-org/atdata/reference/deployment.html</loc> 193 - <lastmod>2026-01-22T20:19:56.455Z</lastmod> 193 + <lastmod>2026-01-28T19:56:53.889Z</lastmod> 194 194 </url> 195 195 <url> 196 196 <loc>https://github.com/your-org/atdata/reference/troubleshooting.html</loc> 197 - <lastmod>2026-01-22T20:18:56.494Z</lastmod> 197 + <lastmod>2026-01-28T19:56:53.889Z</lastmod> 198 198 </url> 199 199 </urlset>

+14 -14

docs/tutorials/atmosphere.html

··· 658 658 </section> 659 659 <section id="setup" class="level2"> 660 660 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 661 - <div id="acf890d8" class="cell"> 661 + <div id="be2b35d7" class="cell"> 662 662 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 663 663 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 664 664 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 678 678 </section> 679 679 <section id="define-sample-types" class="level2"> 680 680 <h2 class="anchored" data-anchor-id="define-sample-types">Define Sample Types</h2> 681 - <div id="ba343730" class="cell"> 681 + <div id="f472e384" class="cell"> 682 682 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 683 683 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ImageSample:</span> 684 684 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing image data with metadata."""</span></span> ··· 697 697 <section id="type-introspection" class="level2"> 698 698 <h2 class="anchored" data-anchor-id="type-introspection">Type Introspection</h2> 699 699 <p>See what information is available from a PackableSample type:</p> 700 - <div id="d3147fcc" class="cell"> 700 + <div id="ffea79d0" class="cell"> 701 701 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> dataclasses <span class="im">import</span> fields, is_dataclass</span> 702 702 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span> 703 703 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Sample type: </span><span class="sc">{</span>ImageSample<span class="sc">.</span><span class="va">__name__</span><span class="sc">}</span><span class="ss">"</span>)</span> ··· 732 732 </ul> 733 733 <p>Understanding AT URIs is essential for working with atmosphere datasets, as they’re how you reference schemas, datasets, and lenses.</p> 734 734 <p>ATProto records are identified by AT URIs:</p> 735 - <div id="2e4e2375" class="cell"> 735 + <div id="dfdc208c" class="cell"> 736 736 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>uris <span class="op">=</span> [</span> 737 737 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz789"</span>,</span> 738 738 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"at://alice.bsky.social/ac.foundation.dataset.record/my-dataset"</span>,</span> ··· 750 750 <h2 class="anchored" data-anchor-id="authentication">Authentication</h2> 751 751 <p>The <code>AtmosphereClient</code> handles ATProto authentication. When you authenticate, you’re proving ownership of your decentralized identity (DID), which gives you permission to create and modify records in your Personal Data Server (PDS).</p> 752 752 <p>Connect to ATProto:</p> 753 - <div id="5e1f8209" class="cell"> 753 + <div id="a4f475f4" class="cell"> 754 754 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 755 755 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"your.handle.social"</span>, <span class="st">"your-app-password"</span>)</span> 756 756 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a></span> ··· 761 761 <section id="publish-a-schema" class="level2"> 762 762 <h2 class="anchored" data-anchor-id="publish-a-schema">Publish a Schema</h2> 763 763 <p>When you publish a schema to ATProto, it becomes a <strong>public, immutable record</strong> that others can reference. The schema CID ensures that anyone can verify they’re using exactly the same type definition you published.</p> 764 - <div id="9fd4595f" class="cell"> 764 + <div id="96200176" class="cell"> 765 765 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>schema_publisher <span class="op">=</span> SchemaPublisher(client)</span> 766 766 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>schema_uri <span class="op">=</span> schema_publisher.publish(</span> 767 767 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a> ImageSample,</span> ··· 774 774 </section> 775 775 <section id="list-your-schemas" class="level2"> 776 776 <h2 class="anchored" data-anchor-id="list-your-schemas">List Your Schemas</h2> 777 - <div id="d7b31f8e" class="cell"> 777 + <div id="076dac8a" class="cell"> 778 778 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>schema_loader <span class="op">=</span> SchemaLoader(client)</span> 779 779 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>schemas <span class="op">=</span> schema_loader.list_all(limit<span class="op">=</span><span class="dv">10</span>)</span> 780 780 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Found </span><span class="sc">{</span><span class="bu">len</span>(schemas)<span class="sc">}</span><span class="ss"> schema(s)"</span>)</span> ··· 787 787 <h2 class="anchored" data-anchor-id="publish-a-dataset">Publish a Dataset</h2> 788 788 <section id="with-external-urls" class="level3"> 789 789 <h3 class="anchored" data-anchor-id="with-external-urls">With External URLs</h3> 790 - <div id="38a5ea19" class="cell"> 790 + <div id="6b04e0fa" class="cell"> 791 791 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>dataset_publisher <span class="op">=</span> DatasetPublisher(client)</span> 792 792 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>dataset_uri <span class="op">=</span> dataset_publisher.publish_with_urls(</span> 793 793 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> urls<span class="op">=</span>[<span class="st">"s3://example-bucket/demo-data-{000000..000009}.tar"</span>],</span> ··· 809 809 <li><strong>Federated replication</strong>: Relays can mirror your blobs for availability</li> 810 810 </ul> 811 811 <p>For fully decentralized storage, use <code>PDSBlobStore</code> to store dataset shards directly as ATProto blobs in your PDS:</p> 812 - <div id="674acaa3" class="cell"> 812 + <div id="703c6bab" class="cell"> 813 813 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create store and index with blob storage</span></span> 814 814 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>store <span class="op">=</span> PDSBlobStore(client)</span> 815 815 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>index <span class="op">=</span> AtmosphereIndex(client, data_store<span class="op">=</span>store)</span> ··· 853 853 </div> 854 854 <div class="callout-body-container callout-body"> 855 855 <p>Use <code>BlobSource</code> to stream directly from PDS blobs:</p> 856 - <div id="0f2fcbb6" class="cell"> 856 + <div id="930a5ae0" class="cell"> 857 857 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create source from the blob URLs</span></span> 858 858 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>source <span class="op">=</span> store.create_source(entry.data_urls)</span> 859 859 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span> ··· 874 874 <h3 class="anchored" data-anchor-id="with-external-urls-1">With External URLs</h3> 875 875 <p>For larger datasets that exceed PDS blob limits, or when you already have data in object storage, you can publish a dataset record that references external URLs. The ATProto record serves as the <strong>index entry</strong> while the actual data lives elsewhere.</p> 876 876 <p>For larger datasets or when using existing object storage:</p> 877 - <div id="8111fff9" class="cell"> 877 + <div id="f5218175" class="cell"> 878 878 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>dataset_publisher <span class="op">=</span> DatasetPublisher(client)</span> 879 879 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>dataset_uri <span class="op">=</span> dataset_publisher.publish_with_urls(</span> 880 880 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> urls<span class="op">=</span>[<span class="st">"s3://example-bucket/demo-data-{000000..000009}.tar"</span>],</span> ··· 890 890 </section> 891 891 <section id="list-and-load-datasets" class="level2"> 892 892 <h2 class="anchored" data-anchor-id="list-and-load-datasets">List and Load Datasets</h2> 893 - <div id="ef2e681c" class="cell"> 893 + <div id="be94f99c" class="cell"> 894 894 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>dataset_loader <span class="op">=</span> DatasetLoader(client)</span> 895 895 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>datasets <span class="op">=</span> dataset_loader.list_all(limit<span class="op">=</span><span class="dv">10</span>)</span> 896 896 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Found </span><span class="sc">{</span><span class="bu">len</span>(datasets)<span class="sc">}</span><span class="ss"> dataset(s)"</span>)</span> ··· 905 905 </section> 906 906 <section id="load-a-dataset" class="level2"> 907 907 <h2 class="anchored" data-anchor-id="load-a-dataset">Load a Dataset</h2> 908 - <div id="a0076f9e" class="cell"> 908 + <div id="7b4bc5bb" class="cell"> 909 909 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Check storage type</span></span> 910 910 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>storage_type <span class="op">=</span> dataset_loader.get_storage_type(<span class="bu">str</span>(blob_dataset_uri))</span> 911 911 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Storage type: </span><span class="sc">{</span>storage_type<span class="sc">}</span><span class="ss">"</span>)</span> ··· 933 933 </ol> 934 934 <p>Notice how similar this is to the local workflow—the same sample types and patterns, just with a different storage backend.</p> 935 935 <p>This example shows the recommended workflow using <code>PDSBlobStore</code> for fully decentralized storage:</p> 936 - <div id="92130535" class="cell"> 936 + <div id="48092185" class="cell"> 937 937 <div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Define and create samples</span></span> 938 938 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 939 939 <span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> FeatureSample:</span>

+8 -8

docs/tutorials/local-workflow.html

··· 644 644 </section> 645 645 <section id="setup" class="level2"> 646 646 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 647 - <div id="366a6743" class="cell"> 647 + <div id="0a2f50ff" class="cell"> 648 648 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 649 649 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 650 650 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 654 654 </section> 655 655 <section id="define-sample-types" class="level2"> 656 656 <h2 class="anchored" data-anchor-id="define-sample-types">Define Sample Types</h2> 657 - <div id="7dcf168a" class="cell"> 657 + <div id="cd43e33f" class="cell"> 658 658 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 659 659 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> TrainingSample:</span> 660 660 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="co">"""A sample containing features and label for training."""</span></span> ··· 678 678 </ul> 679 679 <p>CIDs are computed from the entry’s schema reference and data URLs, so the same logical dataset will have the same CID regardless of where it’s stored.</p> 680 680 <p>Create entries with content-addressable CIDs:</p> 681 - <div id="93a2dc43" class="cell"> 681 + <div id="6e2dd5de" class="cell"> 682 682 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create an entry manually</span></span> 683 683 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>entry <span class="op">=</span> LocalDatasetEntry(</span> 684 684 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> _name<span class="op">=</span><span class="st">"my-dataset"</span>,</span> ··· 711 711 <h2 class="anchored" data-anchor-id="localindex">LocalIndex</h2> 712 712 <p>The <code>LocalIndex</code> is your team’s dataset registry. It implements the <code>AbstractIndex</code> protocol, meaning code written against <code>LocalIndex</code> will also work with <code>AtmosphereIndex</code> when you’re ready for federated sharing.</p> 713 713 <p>The index tracks datasets in Redis:</p> 714 - <div id="e7cb9abe" class="cell"> 714 + <div id="a315adcf" class="cell"> 715 715 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> redis <span class="im">import</span> Redis</span> 716 716 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 717 717 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to Redis</span></span> ··· 724 724 <h3 class="anchored" data-anchor-id="schema-management">Schema Management</h3> 725 725 <p><strong>Schema publishing</strong> is how you ensure type consistency across your team. When you publish a schema, atdata stores the complete type definition (field names, types, metadata) so anyone can reconstruct the Python class from just the schema reference.</p> 726 726 <p>This enables a powerful workflow: share a dataset by sharing its name, and consumers can dynamically reconstruct the sample type without having the original Python code.</p> 727 - <div id="d8c57637" class="cell"> 727 + <div id="45693810" class="cell"> 728 728 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Publish a schema</span></span> 729 729 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>schema_ref <span class="op">=</span> index.publish_schema(TrainingSample, version<span class="op">=</span><span class="st">"1.0.0"</span>)</span> 730 730 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Published schema: </span><span class="sc">{</span>schema_ref<span class="sc">}</span><span class="ss">"</span>)</span> ··· 753 753 </ul> 754 754 <p>The data store handles uploading tar shards and creating signed URLs for streaming access.</p> 755 755 <p>For direct S3 operations:</p> 756 - <div id="dc2e8870" class="cell"> 756 + <div id="fdea4349" class="cell"> 757 757 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>creds <span class="op">=</span> {</span> 758 758 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="st">"AWS_ENDPOINT"</span>: <span class="st">"http://localhost:9000"</span>,</span> 759 759 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> <span class="st">"AWS_ACCESS_KEY_ID"</span>: <span class="st">"minioadmin"</span>,</span> ··· 779 779 </ol> 780 780 <p>The index composition pattern (<code>LocalIndex(data_store=S3DataStore(...))</code>) is deliberate—it separates the concern of “where is metadata?” from “where is data?”, making it easy to swap storage backends.</p> 781 781 <p>Use <code>LocalIndex</code> with <code>S3DataStore</code> to store datasets with S3 storage and Redis indexing:</p> 782 - <div id="d039393d" class="cell"> 782 + <div id="f7933f51" class="cell"> 783 783 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Create sample data</span></span> 784 784 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>samples <span class="op">=</span> [</span> 785 785 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a> TrainingSample(</span> ··· 829 829 <h2 class="anchored" data-anchor-id="using-load_dataset-with-index">Using load_dataset with Index</h2> 830 830 <p>The <code>load_dataset()</code> function provides a HuggingFace-style API that abstracts away the details of where data lives. When you pass an index, it can resolve <code>@local/</code> prefixed paths to the actual data URLs and apply the correct credentials automatically.</p> 831 831 <p>The <code>load_dataset()</code> function supports index lookup:</p> 832 - <div id="4f5d6513" class="cell"> 832 + <div id="71d86430" class="cell"> 833 833 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata <span class="im">import</span> load_dataset</span> 834 834 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 835 835 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Load from local index</span></span>

+11 -11

docs/tutorials/promotion.html

··· 621 621 </section> 622 622 <section id="setup" class="level2"> 623 623 <h2 class="anchored" data-anchor-id="setup">Setup</h2> 624 - <div id="b2a4cfbd" class="cell"> 624 + <div id="e3fb0424" class="cell"> 625 625 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 626 626 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 627 627 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 634 634 <section id="prepare-a-local-dataset" class="level2"> 635 635 <h2 class="anchored" data-anchor-id="prepare-a-local-dataset">Prepare a Local Dataset</h2> 636 636 <p>First, set up a dataset in local storage:</p> 637 - <div id="f7229ae4" class="cell"> 637 + <div id="b9d50874" class="cell"> 638 638 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Define sample type</span></span> 639 639 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 640 640 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> ExperimentSample:</span> ··· 684 684 <section id="basic-promotion" class="level2"> 685 685 <h2 class="anchored" data-anchor-id="basic-promotion">Basic Promotion</h2> 686 686 <p>Promote the dataset to ATProto:</p> 687 - <div id="54490a42" class="cell"> 687 + <div id="03d13bda" class="cell"> 688 688 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Connect to atmosphere</span></span> 689 689 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>client <span class="op">=</span> AtmosphereClient()</span> 690 690 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>client.login(<span class="st">"myhandle.bsky.social"</span>, <span class="st">"app-password"</span>)</span> ··· 697 697 <section id="promotion-with-metadata" class="level2"> 698 698 <h2 class="anchored" data-anchor-id="promotion-with-metadata">Promotion with Metadata</h2> 699 699 <p>Add description, tags, and license:</p> 700 - <div id="517ea4f6" class="cell"> 700 + <div id="78f1fe71" class="cell"> 701 701 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(</span> 702 702 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> local_entry,</span> 703 703 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> local_index,</span> ··· 713 713 <section id="schema-deduplication" class="level2"> 714 714 <h2 class="anchored" data-anchor-id="schema-deduplication">Schema Deduplication</h2> 715 715 <p>The promotion workflow automatically checks for existing schemas:</p> 716 - <div id="f80cf10d" class="cell"> 716 + <div id="89e88b54" class="cell"> 717 717 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.promote <span class="im">import</span> _find_existing_schema</span> 718 718 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span> 719 719 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Check if schema already exists</span></span> ··· 725 725 <span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a> <span class="bu">print</span>(<span class="st">"No existing schema found, will publish new one"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 726 726 </div> 727 727 <p>When you promote multiple datasets with the same sample type:</p> 728 - <div id="e623fe02" class="cell"> 728 + <div id="d5e0911d" class="cell"> 729 729 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># First promotion: publishes schema</span></span> 730 730 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>uri1 <span class="op">=</span> promote_to_atmosphere(entry1, local_index, client)</span> 731 731 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a></span> ··· 740 740 <div class="tab-content"> 741 741 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab"> 742 742 <p>By default, promotion keeps the original data URLs:</p> 743 - <div id="fd052b0a" class="cell"> 743 + <div id="c050b783" class="cell"> 744 744 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Data stays in original S3 location</span></span> 745 745 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>at_uri <span class="op">=</span> promote_to_atmosphere(local_entry, local_index, client)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div> 746 746 </div> ··· 753 753 </div> 754 754 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab"> 755 755 <p>To copy data to a different storage location:</p> 756 - <div id="87044fcf" class="cell"> 756 + <div id="6db17021" class="cell"> 757 757 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.local <span class="im">import</span> S3DataStore</span> 758 758 <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span> 759 759 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create new data store</span></span> ··· 783 783 <section id="verify-on-atmosphere" class="level2"> 784 784 <h2 class="anchored" data-anchor-id="verify-on-atmosphere">Verify on Atmosphere</h2> 785 785 <p>After promotion, verify the dataset is accessible:</p> 786 - <div id="ae122715" class="cell"> 786 + <div id="46c36a34" class="cell"> 787 787 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> atdata.atmosphere <span class="im">import</span> AtmosphereIndex</span> 788 788 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span> 789 789 <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>atm_index <span class="op">=</span> AtmosphereIndex(client)</span> ··· 804 804 </section> 805 805 <section id="error-handling" class="level2"> 806 806 <h2 class="anchored" data-anchor-id="error-handling">Error Handling</h2> 807 - <div id="f5fbc08d" class="cell"> 807 + <div id="ff95cdd2" class="cell"> 808 808 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="cf">try</span>:</span> 809 809 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> at_uri <span class="op">=</span> promote_to_atmosphere(local_entry, local_index, client)</span> 810 810 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="cf">except</span> <span class="pp">KeyError</span> <span class="im">as</span> e:</span> ··· 828 828 </section> 829 829 <section id="complete-workflow" class="level2"> 830 830 <h2 class="anchored" data-anchor-id="complete-workflow">Complete Workflow</h2> 831 - <div id="bcd3ed5a" class="cell"> 831 + <div id="54c7fb18" class="cell"> 832 832 <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Complete local-to-atmosphere workflow</span></span> 833 833 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 834 834 <span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span>

+6 -6

docs/tutorials/quickstart.html

··· 606 606 <li><strong>Round-trip fidelity</strong>: Data survives serialization without loss</li> 607 607 </ul> 608 608 <p>Use the <code>@packable</code> decorator to create a typed sample:</p> 609 - <div id="5fe343cc" class="cell"> 609 + <div id="440626a1" class="cell"> 610 610 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span> 611 611 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> numpy.typing <span class="im">import</span> NDArray</span> 612 612 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> atdata</span> ··· 627 627 </section> 628 628 <section id="create-sample-instances" class="level2"> 629 629 <h2 class="anchored" data-anchor-id="create-sample-instances">Create Sample Instances</h2> 630 - <div id="ceeaea84" class="cell"> 630 + <div id="c6081379" class="cell"> 631 631 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a single sample</span></span> 632 632 <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>sample <span class="op">=</span> ImageSample(</span> 633 633 <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> image<span class="op">=</span>np.random.rand(<span class="dv">224</span>, <span class="dv">224</span>, <span class="dv">3</span>).astype(np.float32),</span> ··· 655 655 </ul> 656 656 <p>The <code>as_wds</code> property on your sample provides the dictionary format WebDataset expects:</p> 657 657 <p>Use WebDataset’s <code>TarWriter</code> to create dataset files:</p> 658 - <div id="21430fdb" class="cell"> 658 + <div id="f621e87f" class="cell"> 659 659 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> webdataset <span class="im">as</span> wds</span> 660 660 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span> 661 661 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Create 100 samples</span></span> ··· 686 686 </ul> 687 687 <p>This eliminates boilerplate collation code and works automatically with any PackableSample type.</p> 688 688 <p>Create a typed <code>Dataset</code> and iterate with batching:</p> 689 - <div id="f9c53332" class="cell"> 689 + <div id="e60a7dc5" class="cell"> 690 690 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Load dataset with type</span></span> 691 691 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>dataset <span class="op">=</span> atdata.Dataset[ImageSample](<span class="st">"my-dataset-000000.tar"</span>)</span> 692 692 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span> ··· 713 713 </ol> 714 714 <p>This approach balances randomness with streaming efficiency—you get well-shuffled data without needing random access to the entire dataset.</p> 715 715 <p>For training, use shuffled iteration:</p> 716 - <div id="7af05c86" class="cell"> 716 + <div id="7dc74662" class="cell"> 717 717 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> batch <span class="kw">in</span> dataset.shuffled(batch_size<span class="op">=</span><span class="dv">32</span>):</span> 718 718 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> <span class="co"># Samples are shuffled at shard and sample level</span></span> 719 719 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> images <span class="op">=</span> batch.image</span> ··· 734 734 <li><strong>Derived features</strong>: Compute fields on-the-fly during iteration</li> 735 735 </ul> 736 736 <p>View datasets through different schemas:</p> 737 - <div id="a671173c" class="cell"> 737 + <div id="8494cd76" class="cell"> 738 738 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Define a simplified view type</span></span> 739 739 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="at">@atdata.packable</span></span> 740 740 <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> SimplifiedSample:</span>

+7 -7

docs_src/api/AbstractDataStore.qmd

··· 14 14 flexible deployment: local index with S3 storage, atmosphere index with 15 15 S3 storage, or atmosphere index with PDS blobs. 16 16 17 - ## Example {.doc-section .doc-section-example} 18 - 19 - :: 17 + ## Examples {.doc-section .doc-section-examples} 20 18 21 - >>> store = S3DataStore(credentials, bucket="my-bucket") 22 - >>> urls = store.write_shards(dataset, prefix="training/v1") 23 - >>> print(urls) 24 - ['s3://my-bucket/training/v1/shard-000000.tar', ...] 19 + ```python 20 + >>> store = S3DataStore(credentials, bucket="my-bucket") 21 + >>> urls = store.write_shards(dataset, prefix="training/v1") 22 + >>> print(urls) 23 + ['s3://my-bucket/training/v1/shard-000000.tar', ...] 24 + ``` 25 25 26 26 ## Methods 27 27

+23 -23

docs_src/api/AbstractIndex.qmd

··· 20 20 - ``data_store``: An AbstractDataStore for reading/writing dataset shards. 21 21 If present, ``load_dataset`` will use it for S3 credential resolution. 22 22 23 - ## Example {.doc-section .doc-section-example} 24 - 25 - :: 23 + ## Examples {.doc-section .doc-section-examples} 26 24 27 - >>> def publish_and_list(index: AbstractIndex) -> None: 28 - ... # Publish schemas for different types 29 - ... schema1 = index.publish_schema(ImageSample, version="1.0.0") 30 - ... schema2 = index.publish_schema(TextSample, version="1.0.0") 31 - ... 32 - ... # Insert datasets of different types 33 - ... index.insert_dataset(image_ds, name="images") 34 - ... index.insert_dataset(text_ds, name="texts") 35 - ... 36 - ... # List all datasets (mixed types) 37 - ... for entry in index.list_datasets(): 38 - ... print(f"{entry.name} -> {entry.schema_ref}") 25 + ```python 26 + >>> def publish_and_list(index: AbstractIndex) -> None: 27 + ... # Publish schemas for different types 28 + ... schema1 = index.publish_schema(ImageSample, version="1.0.0") 29 + ... schema2 = index.publish_schema(TextSample, version="1.0.0") 30 + ... 31 + ... # Insert datasets of different types 32 + ... index.insert_dataset(image_ds, name="images") 33 + ... index.insert_dataset(text_ds, name="texts") 34 + ... 35 + ... # List all datasets (mixed types) 36 + ... for entry in index.list_datasets(): 37 + ... print(f"{entry.name} -> {entry.schema_ref}") 38 + ``` 39 39 40 40 ## Attributes 41 41 ··· 90 90 | | [KeyError](`KeyError`) | If schema not found. | 91 91 | | [ValueError](`ValueError`) | If schema cannot be decoded (unsupported field types). | 92 92 93 - #### Example {.doc-section .doc-section-example} 94 - 95 - :: 93 + #### Examples {.doc-section .doc-section-examples} 96 94 97 - >>> entry = index.get_dataset("my-dataset") 98 - >>> SampleType = index.decode_schema(entry.schema_ref) 99 - >>> ds = Dataset[SampleType](entry.data_urls[0]) 100 - >>> for sample in ds.ordered(): 101 - ... print(sample) # sample is instance of SampleType 95 + ```python 96 + >>> entry = index.get_dataset("my-dataset") 97 + >>> SampleType = index.decode_schema(entry.schema_ref) 98 + >>> ds = Dataset[SampleType](entry.data_urls[0]) 99 + >>> for sample in ds.ordered(): 100 + ... print(sample) # sample is instance of SampleType 101 + ``` 102 102 103 103 ### get_dataset { #atdata.AbstractIndex.get_dataset } 104 104

+10 -10

docs_src/api/AtUri.qmd

··· 8 8 9 9 AT URIs follow the format: at://<authority>/<collection>/<rkey> 10 10 11 - ## Example {.doc-section .doc-section-example} 11 + ## Examples {.doc-section .doc-section-examples} 12 12 13 - :: 14 - 15 - >>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz") 16 - >>> uri.authority 17 - 'did:plc:abc123' 18 - >>> uri.collection 19 - 'ac.foundation.dataset.sampleSchema' 20 - >>> uri.rkey 21 - 'xyz' 13 + ```python 14 + >>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz") 15 + >>> uri.authority 16 + 'did:plc:abc123' 17 + >>> uri.collection 18 + 'ac.foundation.dataset.sampleSchema' 19 + >>> uri.rkey 20 + 'xyz' 21 + ``` 22 22 23 23 ## Attributes 24 24

+7 -7

docs_src/api/AtmosphereClient.qmd

··· 9 9 This class wraps the atproto SDK client and provides higher-level methods 10 10 for working with atdata records (schemas, datasets, lenses). 11 11 12 - ## Example {.doc-section .doc-section-example} 13 - 14 - :: 12 + ## Examples {.doc-section .doc-section-examples} 15 13 16 - >>> client = AtmosphereClient() 17 - >>> client.login("alice.bsky.social", "app-password") 18 - >>> print(client.did) 19 - 'did:plc:...' 14 + ```python 15 + >>> client = AtmosphereClient() 16 + >>> client.login("alice.bsky.social", "app-password") 17 + >>> print(client.did) 18 + 'did:plc:...' 19 + ``` 20 20 21 21 ## Note {.doc-section .doc-section-note} 22 22

+13 -13

docs_src/api/AtmosphereIndex.qmd

··· 12 12 Optionally accepts a ``PDSBlobStore`` for writing dataset shards as 13 13 ATProto blobs, enabling fully decentralized dataset storage. 14 14 15 - ## Example {.doc-section .doc-section-example} 16 - 17 - :: 15 + ## Examples {.doc-section .doc-section-examples} 18 16 19 - >>> client = AtmosphereClient() 20 - >>> client.login("handle.bsky.social", "app-password") 21 - >>> 22 - >>> # Without blob storage (external URLs only) 23 - >>> index = AtmosphereIndex(client) 24 - >>> 25 - >>> # With PDS blob storage 26 - >>> store = PDSBlobStore(client) 27 - >>> index = AtmosphereIndex(client, data_store=store) 28 - >>> entry = index.insert_dataset(dataset, name="my-data") 17 + ```python 18 + >>> client = AtmosphereClient() 19 + >>> client.login("handle.bsky.social", "app-password") 20 + >>> 21 + >>> # Without blob storage (external URLs only) 22 + >>> index = AtmosphereIndex(client) 23 + >>> 24 + >>> # With PDS blob storage 25 + >>> store = PDSBlobStore(client) 26 + >>> index = AtmosphereIndex(client, data_store=store) 27 + >>> entry = index.insert_dataset(dataset, name="my-data") 28 + ``` 29 29 30 30 ## Attributes 31 31

+11 -11

docs_src/api/BlobSource.qmd

··· 20 20 | blob_refs | [list](`list`)\[[dict](`dict`)\[[str](`str`), [str](`str`)\]\] | List of blob reference dicts with 'did' and 'cid' keys. | 21 21 | pds_endpoint | [str](`str`) \| None | Optional PDS endpoint URL. If not provided, resolved from DID. | 22 22 23 - ## Example {.doc-section .doc-section-example} 24 - 25 - :: 23 + ## Examples {.doc-section .doc-section-examples} 26 24 27 - >>> source = BlobSource( 28 - ... blob_refs=[ 29 - ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 30 - ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 31 - ... ], 32 - ... ) 33 - >>> for shard_id, stream in source.shards: 34 - ... process(stream) 25 + ```python 26 + >>> source = BlobSource( 27 + ... blob_refs=[ 28 + ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 29 + ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 30 + ... ], 31 + ... ) 32 + >>> for shard_id, stream in source.shards: 33 + ... process(stream) 34 + ``` 35 35 36 36 ## Methods 37 37

+12 -12

docs_src/api/DataSource.qmd

··· 19 19 - ATProto blob streaming 20 20 - Any other source that can provide file-like objects 21 21 22 - ## Example {.doc-section .doc-section-example} 23 - 24 - :: 22 + ## Examples {.doc-section .doc-section-examples} 25 23 26 - >>> source = S3Source( 27 - ... bucket="my-bucket", 28 - ... keys=["data-000.tar", "data-001.tar"], 29 - ... endpoint="https://r2.example.com", 30 - ... credentials=creds, 31 - ... ) 32 - >>> ds = Dataset[MySample](source) 33 - >>> for sample in ds.ordered(): 34 - ... print(sample) 24 + ```python 25 + >>> source = S3Source( 26 + ... bucket="my-bucket", 27 + ... keys=["data-000.tar", "data-001.tar"], 28 + ... endpoint="https://r2.example.com", 29 + ... credentials=creds, 30 + ... ) 31 + >>> ds = Dataset[MySample](source) 32 + >>> for sample in ds.ordered(): 33 + ... print(sample) 34 + ``` 35 35 36 36 ## Attributes 37 37

+19 -19

docs_src/api/Dataset.qmd

··· 28 28 |--------|--------|----------------------------------------------------| 29 29 | url | | WebDataset brace-notation URL for the tar file(s). | 30 30 31 - ## Example {.doc-section .doc-section-example} 31 + ## Examples {.doc-section .doc-section-examples} 32 32 33 - :: 34 - 35 - >>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar") 36 - >>> for sample in ds.ordered(batch_size=32): 37 - ... # sample is SampleBatch[MyData] with batch_size samples 38 - ... embeddings = sample.embeddings # shape: (32, ...) 39 - ... 40 - >>> # Transform to a different view 41 - >>> ds_view = ds.as_type(MyDataView) 33 + ```python 34 + >>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar") 35 + >>> for sample in ds.ordered(batch_size=32): 36 + ... # sample is SampleBatch[MyData] with batch_size samples 37 + ... embeddings = sample.embeddings # shape: (32, ...) 38 + ... 39 + >>> # Transform to a different view 40 + >>> ds_view = ds.as_type(MyDataView) 41 + ``` 42 42 43 43 ## Note {.doc-section .doc-section-note} 44 44 ··· 182 182 This creates multiple parquet files: ``output-000000.parquet``, 183 183 ``output-000001.parquet``, etc. 184 184 185 - #### Example {.doc-section .doc-section-example} 185 + #### Examples {.doc-section .doc-section-examples} 186 186 187 - :: 188 - 189 - >>> ds = Dataset[MySample]("data.tar") 190 - >>> # Small dataset - load all at once 191 - >>> ds.to_parquet("output.parquet") 192 - >>> 193 - >>> # Large dataset - process in chunks 194 - >>> ds.to_parquet("output.parquet", maxcount=50000) 187 + ```python 188 + >>> ds = Dataset[MySample]("data.tar") 189 + >>> # Small dataset - load all at once 190 + >>> ds.to_parquet("output.parquet") 191 + >>> 192 + >>> # Large dataset - process in chunks 193 + >>> ds.to_parquet("output.parquet", maxcount=50000) 194 + ``` 195 195 196 196 ### wrap { #atdata.Dataset.wrap } 197 197

+10 -10

docs_src/api/DatasetDict.qmd

··· 16 16 |--------|--------|------------------------------------------------|------------| 17 17 | ST | | The sample type for all datasets in this dict. | _required_ | 18 18 19 - ## Example {.doc-section .doc-section-example} 19 + ## Examples {.doc-section .doc-section-examples} 20 20 21 - :: 22 - 23 - >>> ds_dict = load_dataset("path/to/data", MyData) 24 - >>> train = ds_dict["train"] 25 - >>> test = ds_dict["test"] 26 - >>> 27 - >>> # Iterate over all splits 28 - >>> for split_name, dataset in ds_dict.items(): 29 - ... print(f"{split_name}: {len(dataset.shard_list)} shards") 21 + ```python 22 + >>> ds_dict = load_dataset("path/to/data", MyData) 23 + >>> train = ds_dict["train"] 24 + >>> test = ds_dict["test"] 25 + >>> 26 + >>> # Iterate over all splits 27 + >>> for split_name, dataset in ds_dict.items(): 28 + ... print(f"{split_name}: {len(dataset.shard_list)} shards") 29 + ``` 30 30 31 31 ## Attributes 32 32

+20 -20

docs_src/api/DatasetLoader.qmd

··· 10 10 from them. Note that loading a dataset requires having the corresponding 11 11 Python class for the sample type. 12 12 13 - ## Example {.doc-section .doc-section-example} 14 - 15 - :: 13 + ## Examples {.doc-section .doc-section-examples} 16 14 17 - >>> client = AtmosphereClient() 18 - >>> loader = DatasetLoader(client) 19 - >>> 20 - >>> # List available datasets 21 - >>> datasets = loader.list() 22 - >>> for ds in datasets: 23 - ... print(ds["name"], ds["schemaRef"]) 24 - >>> 25 - >>> # Get a specific dataset record 26 - >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz") 15 + ```python 16 + >>> client = AtmosphereClient() 17 + >>> loader = DatasetLoader(client) 18 + >>> 19 + >>> # List available datasets 20 + >>> datasets = loader.list() 21 + >>> for ds in datasets: 22 + ... print(ds["name"], ds["schemaRef"]) 23 + >>> 24 + >>> # Get a specific dataset record 25 + >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz") 26 + ``` 27 27 28 28 ## Methods 29 29 ··· 245 245 |--------|----------------------------|-------------------------------------| 246 246 | | [ValueError](`ValueError`) | If no storage URLs can be resolved. | 247 247 248 - #### Example {.doc-section .doc-section-example} 249 - 250 - :: 248 + #### Examples {.doc-section .doc-section-examples} 251 249 252 - >>> loader = DatasetLoader(client) 253 - >>> dataset = loader.to_dataset(uri, MySampleType) 254 - >>> for batch in dataset.shuffled(batch_size=32): 255 - ... process(batch) 250 + ```python 251 + >>> loader = DatasetLoader(client) 252 + >>> dataset = loader.to_dataset(uri, MySampleType) 253 + >>> for batch in dataset.shuffled(batch_size=32): 254 + ... process(batch) 255 + ```

+15 -15

docs_src/api/DatasetPublisher.qmd

··· 9 9 This class creates dataset records that reference a schema and point to 10 10 external storage (WebDataset URLs) or ATProto blobs. 11 11 12 - ## Example {.doc-section .doc-section-example} 13 - 14 - :: 12 + ## Examples {.doc-section .doc-section-examples} 15 13 16 - >>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar") 17 - >>> 18 - >>> client = AtmosphereClient() 19 - >>> client.login("handle", "password") 20 - >>> 21 - >>> publisher = DatasetPublisher(client) 22 - >>> uri = publisher.publish( 23 - ... dataset, 24 - ... name="My Training Data", 25 - ... description="Training data for my model", 26 - ... tags=["computer-vision", "training"], 27 - ... ) 14 + ```python 15 + >>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar") 16 + >>> 17 + >>> client = AtmosphereClient() 18 + >>> client.login("handle", "password") 19 + >>> 20 + >>> publisher = DatasetPublisher(client) 21 + >>> uri = publisher.publish( 22 + ... dataset, 23 + ... name="My Training Data", 24 + ... description="Training data for my model", 25 + ... tags=["computer-vision", "training"], 26 + ... ) 27 + ``` 28 28 29 29 ## Methods 30 30

+11 -11

docs_src/api/DictSample.qmd

··· 20 20 ``@packable``-decorated class. Every ``@packable`` class automatically 21 21 registers a lens from ``DictSample``, making this conversion seamless. 22 22 23 - ## Example {.doc-section .doc-section-example} 24 - 25 - :: 23 + ## Examples {.doc-section .doc-section-examples} 26 24 27 - >>> ds = load_dataset("path/to/data.tar") # Returns Dataset[DictSample] 28 - >>> for sample in ds.ordered(): 29 - ... print(sample.some_field) # Attribute access 30 - ... print(sample["other_field"]) # Dict access 31 - ... print(sample.keys()) # Inspect available fields 32 - ... 33 - >>> # Convert to typed schema 34 - >>> typed_ds = ds.as_type(MyTypedSample) 25 + ```python 26 + >>> ds = load_dataset("path/to/data.tar") # Returns Dataset[DictSample] 27 + >>> for sample in ds.ordered(): 28 + ... print(sample.some_field) # Attribute access 29 + ... print(sample["other_field"]) # Dict access 30 + ... print(sample.keys()) # Inspect available fields 31 + ... 32 + >>> # Convert to typed schema 33 + >>> typed_ds = ds.as_type(MyTypedSample) 34 + ``` 35 35 36 36 ## Note {.doc-section .doc-section-note} 37 37

+50 -50

docs_src/api/Lens.qmd

··· 18 18 Lenses support the functional programming concept of composable, well-behaved 19 19 transformations that satisfy lens laws (GetPut and PutGet). 20 20 21 - ## Example {.doc-section .doc-section-example} 22 - 23 - :: 21 + ## Examples {.doc-section .doc-section-examples} 24 22 25 - >>> @packable 26 - ... class FullData: 27 - ... name: str 28 - ... age: int 29 - ... embedding: NDArray 30 - ... 31 - >>> @packable 32 - ... class NameOnly: 33 - ... name: str 34 - ... 35 - >>> @lens 36 - ... def name_view(full: FullData) -> NameOnly: 37 - ... return NameOnly(name=full.name) 38 - ... 39 - >>> @name_view.putter 40 - ... def name_view_put(view: NameOnly, source: FullData) -> FullData: 41 - ... return FullData(name=view.name, age=source.age, 42 - ... embedding=source.embedding) 43 - ... 44 - >>> ds = Dataset[FullData]("data.tar") 45 - >>> ds_names = ds.as_type(NameOnly) # Uses registered lens 23 + ```python 24 + >>> @packable 25 + ... class FullData: 26 + ... name: str 27 + ... age: int 28 + ... embedding: NDArray 29 + ... 30 + >>> @packable 31 + ... class NameOnly: 32 + ... name: str 33 + ... 34 + >>> @lens 35 + ... def name_view(full: FullData) -> NameOnly: 36 + ... return NameOnly(name=full.name) 37 + ... 38 + >>> @name_view.putter 39 + ... def name_view_put(view: NameOnly, source: FullData) -> FullData: 40 + ... return FullData(name=view.name, age=source.age, 41 + ... embedding=source.embedding) 42 + ... 43 + >>> ds = Dataset[FullData]("data.tar") 44 + >>> ds_names = ds.as_type(NameOnly) # Uses registered lens 45 + ``` 46 46 47 47 ## Classes 48 48 ··· 71 71 | S | | The source type, must derive from ``PackableSample``. | _required_ | 72 72 | V | | The view type, must derive from ``PackableSample``. | _required_ | 73 73 74 - #### Example {.doc-section .doc-section-example} 75 - 76 - :: 74 + #### Examples {.doc-section .doc-section-examples} 77 75 78 - >>> @lens 79 - ... def name_lens(full: FullData) -> NameOnly: 80 - ... return NameOnly(name=full.name) 81 - ... 82 - >>> @name_lens.putter 83 - ... def name_lens_put(view: NameOnly, source: FullData) -> FullData: 84 - ... return FullData(name=view.name, age=source.age) 76 + ```python 77 + >>> @lens 78 + ... def name_lens(full: FullData) -> NameOnly: 79 + ... return NameOnly(name=full.name) 80 + ... 81 + >>> @name_lens.putter 82 + ... def name_lens_put(view: NameOnly, source: FullData) -> FullData: 83 + ... return FullData(name=view.name, age=source.age) 84 + ``` 85 85 86 86 #### Methods 87 87 ··· 152 152 |--------|--------------------------------------------------------------------------------------|---------------------------------------------------------------| 153 153 | | [LensPutter](`atdata.lens.LensPutter`)\[[S](`atdata.lens.S`), [V](`atdata.lens.V`)\] | The putter function, allowing this to be used as a decorator. | 154 154 155 - ###### Example {.doc-section .doc-section-example} 156 - 157 - :: 155 + ###### Examples {.doc-section .doc-section-examples} 158 156 159 - >>> @my_lens.putter 160 - ... def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 161 - ... return SourceType(...) 157 + ```python 158 + >>> @my_lens.putter 159 + ... def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 160 + ... return SourceType(field=view.field, other=source.other) 161 + ``` 162 162 163 163 ### LensNetwork { #atdata.lens.LensNetwork } 164 164 ··· 267 267 | | [Lens](`atdata.lens.Lens`)\[[S](`atdata.lens.S`), [V](`atdata.lens.V`)\] | A ``Lens[S, V]`` object that can be called to apply the transformation | 268 268 | | [Lens](`atdata.lens.Lens`)\[[S](`atdata.lens.S`), [V](`atdata.lens.V`)\] | or decorated with ``@lens_name.putter`` to add a putter function. | 269 269 270 - #### Example {.doc-section .doc-section-example} 270 + #### Examples {.doc-section .doc-section-examples} 271 271 272 - :: 273 - 274 - >>> @lens 275 - ... def extract_name(full: FullData) -> NameOnly: 276 - ... return NameOnly(name=full.name) 277 - ... 278 - >>> @extract_name.putter 279 - ... def extract_name_put(view: NameOnly, source: FullData) -> FullData: 280 - ... return FullData(name=view.name, age=source.age) 272 + ```python 273 + >>> @lens 274 + ... def extract_name(full: FullData) -> NameOnly: 275 + ... return NameOnly(name=full.name) 276 + ... 277 + >>> @extract_name.putter 278 + ... def extract_name_put(view: NameOnly, source: FullData) -> FullData: 279 + ... return FullData(name=view.name, age=source.age) 280 + ```

+10 -10

docs_src/api/LensLoader.qmd

··· 10 10 using a lens requires installing the referenced code and importing 11 11 it manually. 12 12 13 - ## Example {.doc-section .doc-section-example} 13 + ## Examples {.doc-section .doc-section-examples} 14 14 15 - :: 16 - 17 - >>> client = AtmosphereClient() 18 - >>> loader = LensLoader(client) 19 - >>> 20 - >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz") 21 - >>> print(record["name"]) 22 - >>> print(record["sourceSchema"]) 23 - >>> print(record.get("getterCode", {}).get("repository")) 15 + ```python 16 + >>> client = AtmosphereClient() 17 + >>> loader = LensLoader(client) 18 + >>> 19 + >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz") 20 + >>> print(record["name"]) 21 + >>> print(record["sourceSchema"]) 22 + >>> print(record.get("getterCode", {}).get("repository")) 23 + ``` 24 24 25 25 ## Methods 26 26

+20 -20

docs_src/api/LensPublisher.qmd

··· 9 9 This class creates lens records that reference source and target schemas 10 10 and point to the transformation code in a git repository. 11 11 12 - ## Example {.doc-section .doc-section-example} 12 + ## Examples {.doc-section .doc-section-examples} 13 13 14 - :: 15 - 16 - >>> @atdata.lens 17 - ... def my_lens(source: SourceType) -> TargetType: 18 - ... return TargetType(field=source.other_field) 19 - >>> 20 - >>> client = AtmosphereClient() 21 - >>> client.login("handle", "password") 22 - >>> 23 - >>> publisher = LensPublisher(client) 24 - >>> uri = publisher.publish( 25 - ... name="my_lens", 26 - ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source", 27 - ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target", 28 - ... code_repository="https://github.com/user/repo", 29 - ... code_commit="abc123def456", 30 - ... getter_path="mymodule.lenses:my_lens", 31 - ... putter_path="mymodule.lenses:my_lens_putter", 32 - ... ) 14 + ```python 15 + >>> @atdata.lens 16 + ... def my_lens(source: SourceType) -> TargetType: 17 + ... return TargetType(field=source.other_field) 18 + >>> 19 + >>> client = AtmosphereClient() 20 + >>> client.login("handle", "password") 21 + >>> 22 + >>> publisher = LensPublisher(client) 23 + >>> uri = publisher.publish( 24 + ... name="my_lens", 25 + ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source", 26 + ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target", 27 + ... code_repository="https://github.com/user/repo", 28 + ... code_commit="abc123def456", 29 + ... getter_path="mymodule.lenses:my_lens", 30 + ... putter_path="mymodule.lenses:my_lens_putter", 31 + ... ) 32 + ``` 33 33 34 34 ## Security Note {.doc-section .doc-section-security-note} 35 35

+7 -7

docs_src/api/PDSBlobStore.qmd

··· 19 19 |--------|----------------------|------------------------------------------| 20 20 | client | \'AtmosphereClient\' | Authenticated AtmosphereClient instance. | 21 21 22 - ## Example {.doc-section .doc-section-example} 23 - 24 - :: 22 + ## Examples {.doc-section .doc-section-examples} 25 23 26 - >>> store = PDSBlobStore(client) 27 - >>> urls = store.write_shards(dataset, prefix="training/v1") 28 - >>> # Returns AT URIs like: 29 - >>> # ['at://did:plc:abc/blob/bafyrei...', ...] 24 + ```python 25 + >>> store = PDSBlobStore(client) 26 + >>> urls = store.write_shards(dataset, prefix="training/v1") 27 + >>> # Returns AT URIs like: 28 + >>> # ['at://did:plc:abc/blob/bafyrei...', ...] 29 + ``` 30 30 31 31 ## Methods 32 32

+12 -12

docs_src/api/Packable-protocol.qmd

··· 18 18 - Schema publishing (class introspection via dataclass fields) 19 19 - Serialization/deserialization (packed, from_bytes) 20 20 21 - ## Example {.doc-section .doc-section-example} 22 - 23 - :: 21 + ## Examples {.doc-section .doc-section-examples} 24 22 25 - >>> @packable 26 - ... class MySample: 27 - ... name: str 28 - ... value: int 29 - ... 30 - >>> def process(sample_type: Type[Packable]) -> None: 31 - ... # Type checker knows sample_type has from_bytes, packed, etc. 32 - ... instance = sample_type.from_bytes(data) 33 - ... print(instance.packed) 23 + ```python 24 + >>> @packable 25 + ... class MySample: 26 + ... name: str 27 + ... value: int 28 + ... 29 + >>> def process(sample_type: Type[Packable]) -> None: 30 + ... # Type checker knows sample_type has from_bytes, packed, etc. 31 + ... instance = sample_type.from_bytes(data) 32 + ... print(instance.packed) 33 + ``` 34 34 35 35 ## Attributes 36 36

+11 -11

docs_src/api/PackableSample.qmd

··· 15 15 1. Direct inheritance with the ``@dataclass`` decorator 16 16 2. Using the ``@packable`` decorator (recommended) 17 17 18 - ## Example {.doc-section .doc-section-example} 19 - 20 - :: 18 + ## Examples {.doc-section .doc-section-examples} 21 19 22 - >>> @packable 23 - ... class MyData: 24 - ... name: str 25 - ... embeddings: NDArray 26 - ... 27 - >>> sample = MyData(name="test", embeddings=np.array([1.0, 2.0])) 28 - >>> packed = sample.packed # Serialize to bytes 29 - >>> restored = MyData.from_bytes(packed) # Deserialize 20 + ```python 21 + >>> @packable 22 + ... class MyData: 23 + ... name: str 24 + ... embeddings: NDArray 25 + ... 26 + >>> sample = MyData(name="test", embeddings=np.array([1.0, 2.0])) 27 + >>> packed = sample.packed # Serialize to bytes 28 + >>> restored = MyData.from_bytes(packed) # Deserialize 29 + ``` 30 30 31 31 ## Attributes 32 32

+28 -28

docs_src/api/S3Source.qmd

··· 35 35 | secret_key | [str](`str`) \| None | Optional AWS secret access key. | 36 36 | region | [str](`str`) \| None | Optional AWS region (defaults to us-east-1). | 37 37 38 - ## Example {.doc-section .doc-section-example} 39 - 40 - :: 38 + ## Examples {.doc-section .doc-section-examples} 41 39 42 - >>> source = S3Source( 43 - ... bucket="my-datasets", 44 - ... keys=["train/shard-000.tar", "train/shard-001.tar"], 45 - ... endpoint="https://abc123.r2.cloudflarestorage.com", 46 - ... access_key="AKIAIOSFODNN7EXAMPLE", 47 - ... secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 48 - ... ) 49 - >>> for shard_id, stream in source.shards: 50 - ... process(stream) 40 + ```python 41 + >>> source = S3Source( 42 + ... bucket="my-datasets", 43 + ... keys=["train/shard-000.tar", "train/shard-001.tar"], 44 + ... endpoint="https://abc123.r2.cloudflarestorage.com", 45 + ... access_key="AKIAIOSFODNN7EXAMPLE", 46 + ... secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 47 + ... ) 48 + >>> for shard_id, stream in source.shards: 49 + ... process(stream) 50 + ``` 51 51 52 52 ## Methods 53 53 ··· 82 82 |--------|--------------|----------------------| 83 83 | | \'S3Source\' | Configured S3Source. | 84 84 85 - #### Example {.doc-section .doc-section-example} 86 - 87 - :: 85 + #### Examples {.doc-section .doc-section-examples} 88 86 89 - >>> creds = { 90 - ... "AWS_ACCESS_KEY_ID": "...", 91 - ... "AWS_SECRET_ACCESS_KEY": "...", 92 - ... "AWS_ENDPOINT": "https://r2.example.com", 93 - ... } 94 - >>> source = S3Source.from_credentials(creds, "my-bucket", ["data.tar"]) 87 + ```python 88 + >>> creds = { 89 + ... "AWS_ACCESS_KEY_ID": "...", 90 + ... "AWS_SECRET_ACCESS_KEY": "...", 91 + ... "AWS_ENDPOINT": "https://r2.example.com", 92 + ... } 93 + >>> source = S3Source.from_credentials(creds, "my-bucket", ["data.tar"]) 94 + ``` 95 95 96 96 ### from_urls { #atdata.S3Source.from_urls } 97 97 ··· 133 133 |--------|----------------------------|------------------------------------------------------------| 134 134 | | [ValueError](`ValueError`) | If URLs are not valid s3:// URLs or span multiple buckets. | 135 135 136 - #### Example {.doc-section .doc-section-example} 137 - 138 - :: 136 + #### Examples {.doc-section .doc-section-examples} 139 137 140 - >>> source = S3Source.from_urls( 141 - ... ["s3://my-bucket/train-000.tar", "s3://my-bucket/train-001.tar"], 142 - ... endpoint="https://r2.example.com", 143 - ... ) 138 + ```python 139 + >>> source = S3Source.from_urls( 140 + ... ["s3://my-bucket/train-000.tar", "s3://my-bucket/train-001.tar"], 141 + ... endpoint="https://r2.example.com", 142 + ... ) 143 + ``` 144 144 145 145 ### list_shards { #atdata.S3Source.list_shards } 146 146

+6 -6

docs_src/api/SampleBatch.qmd

··· 26 26 |---------|--------|---------------------------------------------| 27 27 | samples | | The list of sample instances in this batch. | 28 28 29 - ## Example {.doc-section .doc-section-example} 30 - 31 - :: 29 + ## Examples {.doc-section .doc-section-examples} 32 30 33 - >>> batch = SampleBatch[MyData]([sample1, sample2, sample3]) 34 - >>> batch.embeddings # Returns stacked numpy array of shape (3, ...) 35 - >>> batch.names # Returns list of names 31 + ```python 32 + >>> batch = SampleBatch[MyData]([sample1, sample2, sample3]) 33 + >>> batch.embeddings # Returns stacked numpy array of shape (3, ...) 34 + >>> batch.names # Returns list of names 35 + ``` 36 36 37 37 ## Note {.doc-section .doc-section-note} 38 38

+10 -10

docs_src/api/SchemaLoader.qmd

··· 9 9 This class fetches schema records from ATProto and can list available 10 10 schemas from a repository. 11 11 12 - ## Example {.doc-section .doc-section-example} 12 + ## Examples {.doc-section .doc-section-examples} 13 13 14 - :: 15 - 16 - >>> client = AtmosphereClient() 17 - >>> client.login("handle", "password") 18 - >>> 19 - >>> loader = SchemaLoader(client) 20 - >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...") 21 - >>> print(schema["name"]) 22 - 'MySample' 14 + ```python 15 + >>> client = AtmosphereClient() 16 + >>> client.login("handle", "password") 17 + >>> 18 + >>> loader = SchemaLoader(client) 19 + >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...") 20 + >>> print(schema["name"]) 21 + 'MySample' 22 + ``` 23 23 24 24 ## Methods 25 25

+15 -15

docs_src/api/SchemaPublisher.qmd

··· 9 9 This class introspects a PackableSample class to extract its field 10 10 definitions and publishes them as an ATProto schema record. 11 11 12 - ## Example {.doc-section .doc-section-example} 13 - 14 - :: 12 + ## Examples {.doc-section .doc-section-examples} 15 13 16 - >>> @atdata.packable 17 - ... class MySample: 18 - ... image: NDArray 19 - ... label: str 20 - ... 21 - >>> client = AtmosphereClient() 22 - >>> client.login("handle", "password") 23 - >>> 24 - >>> publisher = SchemaPublisher(client) 25 - >>> uri = publisher.publish(MySample, version="1.0.0") 26 - >>> print(uri) 27 - at://did:plc:.../ac.foundation.dataset.sampleSchema/... 14 + ```python 15 + >>> @atdata.packable 16 + ... class MySample: 17 + ... image: NDArray 18 + ... label: str 19 + ... 20 + >>> client = AtmosphereClient() 21 + >>> client.login("handle", "password") 22 + >>> 23 + >>> publisher = SchemaPublisher(client) 24 + >>> uri = publisher.publish(MySample, version="1.0.0") 25 + >>> print(uri) 26 + at://did:plc:.../ac.foundation.dataset.sampleSchema/... 27 + ``` 28 28 29 29 ## Methods 30 30

+6 -6

docs_src/api/URLSource.qmd

··· 18 18 |--------|--------------|--------------------------------------| 19 19 | url | [str](`str`) | URL or brace pattern for the shards. | 20 20 21 - ## Example {.doc-section .doc-section-example} 22 - 23 - :: 21 + ## Examples {.doc-section .doc-section-examples} 24 22 25 - >>> source = URLSource("https://example.com/train-{000..009}.tar") 26 - >>> for shard_id, stream in source.shards: 27 - ... print(f"Streaming {shard_id}") 23 + ```python 24 + >>> source = URLSource("https://example.com/train-{000..009}.tar") 25 + >>> for shard_id, stream in source.shards: 26 + ... print(f"Streaming {shard_id}") 27 + ``` 28 28 29 29 ## Methods 30 30

+19 -19

docs_src/api/load_dataset.qmd

··· 50 50 | | [FileNotFoundError](`FileNotFoundError`) | If no data files are found at the path. | 51 51 | | [KeyError](`KeyError`) | If dataset not found in index. | 52 52 53 - ## Example {.doc-section .doc-section-example} 53 + ## Examples {.doc-section .doc-section-examples} 54 54 55 - :: 56 - 57 - >>> # Load without type - get DictSample for exploration 58 - >>> ds = load_dataset("./data/train.tar", split="train") 59 - >>> for sample in ds.ordered(): 60 - ... print(sample.keys()) # Explore fields 61 - ... print(sample["text"]) # Dict-style access 62 - ... print(sample.label) # Attribute access 63 - >>> 64 - >>> # Convert to typed schema 65 - >>> typed_ds = ds.as_type(TextData) 66 - >>> 67 - >>> # Or load with explicit type directly 68 - >>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train") 69 - >>> 70 - >>> # Load from index with auto-type resolution 71 - >>> index = LocalIndex() 72 - >>> ds = load_dataset("@local/my-dataset", index=index, split="train") 55 + ```python 56 + >>> # Load without type - get DictSample for exploration 57 + >>> ds = load_dataset("./data/train.tar", split="train") 58 + >>> for sample in ds.ordered(): 59 + ... print(sample.keys()) # Explore fields 60 + ... print(sample["text"]) # Dict-style access 61 + ... print(sample.label) # Attribute access 62 + >>> 63 + >>> # Convert to typed schema 64 + >>> typed_ds = ds.as_type(TextData) 65 + >>> 66 + >>> # Or load with explicit type directly 67 + >>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train") 68 + >>> 69 + >>> # Load from index with auto-type resolution 70 + >>> index = LocalIndex() 71 + >>> ds = load_dataset("@local/my-dataset", index=index, split="train") 72 + ```

+29 -29

docs_src/api/local.Index.qmd

··· 150 150 |--------|-----------------------------------------|----------------------------------------------------------------| 151 151 | | [type](`type`)\[[T](`atdata.local.T`)\] | The decoded type, cast to match the type_hint for IDE support. | 152 152 153 - #### Example {.doc-section .doc-section-example} 153 + #### Examples {.doc-section .doc-section-examples} 154 154 155 - :: 156 - 157 - >>> # After enabling auto_stubs and configuring IDE extraPaths: 158 - >>> from local.MySample_1_0_0 import MySample 159 - >>> 160 - >>> # This gives full IDE autocomplete: 161 - >>> DecodedType = index.decode_schema_as(ref, MySample) 162 - >>> sample = DecodedType(text="hello", value=42) # IDE knows signature! 155 + ```python 156 + >>> # After enabling auto_stubs and configuring IDE extraPaths: 157 + >>> from local.MySample_1_0_0 import MySample 158 + >>> 159 + >>> # This gives full IDE autocomplete: 160 + >>> DecodedType = index.decode_schema_as(ref, MySample) 161 + >>> sample = DecodedType(text="hello", value=42) # IDE knows signature! 162 + ``` 163 163 164 164 #### Note {.doc-section .doc-section-note} 165 165 ··· 269 269 | | [str](`str`) \| None | Import path like "local.MySample_1_0_0", or None if auto_stubs | 270 270 | | [str](`str`) \| None | is disabled. | 271 271 272 - #### Example {.doc-section .doc-section-example} 272 + #### Examples {.doc-section .doc-section-examples} 273 273 274 - :: 275 - 276 - >>> index = LocalIndex(auto_stubs=True) 277 - >>> ref = index.publish_schema(MySample, version="1.0.0") 278 - >>> index.load_schema(ref) 279 - >>> print(index.get_import_path(ref)) 280 - local.MySample_1_0_0 281 - >>> # Then in your code: 282 - >>> # from local.MySample_1_0_0 import MySample 274 + ```python 275 + >>> index = LocalIndex(auto_stubs=True) 276 + >>> ref = index.publish_schema(MySample, version="1.0.0") 277 + >>> index.load_schema(ref) 278 + >>> print(index.get_import_path(ref)) 279 + local.MySample_1_0_0 280 + >>> # Then in your code: 281 + >>> # from local.MySample_1_0_0 import MySample 282 + ``` 283 283 284 284 ### get_schema { #atdata.local.Index.get_schema } 285 285 ··· 440 440 | | [KeyError](`KeyError`) | If schema not found. | 441 441 | | [ValueError](`ValueError`) | If schema cannot be decoded. | 442 442 443 - #### Example {.doc-section .doc-section-example} 443 + #### Examples {.doc-section .doc-section-examples} 444 444 445 - :: 446 - 447 - >>> # Load and use immediately 448 - >>> MyType = index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 449 - >>> sample = MyType(name="hello", value=42) 450 - >>> 451 - >>> # Or access later via namespace 452 - >>> index.load_schema("atdata://local/sampleSchema/OtherType@1.0.0") 453 - >>> other = index.types.OtherType(data="test") 445 + ```python 446 + >>> # Load and use immediately 447 + >>> MyType = index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 448 + >>> sample = MyType(name="hello", value=42) 449 + >>> 450 + >>> # Or access later via namespace 451 + >>> index.load_schema("atdata://local/sampleSchema/OtherType@1.0.0") 452 + >>> other = index.types.OtherType(data="test") 453 + ``` 454 454 455 455 ### publish_schema { #atdata.local.Index.publish_schema } 456 456

+13 -13

docs_src/api/packable.qmd

··· 30 30 31 31 ## Examples {.doc-section .doc-section-examples} 32 32 33 - This is a test of the functionality:: 34 - 35 - @packable 36 - class MyData: 37 - name: str 38 - values: NDArray 39 - 40 - sample = MyData(name="test", values=np.array([1, 2, 3])) 41 - bytes_data = sample.packed 42 - restored = MyData.from_bytes(bytes_data) 43 - 44 - # Works with Packable-typed APIs 45 - index.publish_schema(MyData, version="1.0.0") # Type-safe 33 + ```python 34 + >>> @packable 35 + ... class MyData: 36 + ... name: str 37 + ... values: NDArray 38 + ... 39 + >>> sample = MyData(name="test", values=np.array([1, 2, 3])) 40 + >>> bytes_data = sample.packed 41 + >>> restored = MyData.from_bytes(bytes_data) 42 + >>> 43 + >>> # Works with Packable-typed APIs 44 + >>> index.publish_schema(MyData, version="1.0.0") # Type-safe 45 + ```

+7 -7

docs_src/api/promote_to_atmosphere.qmd

··· 45 45 | | [KeyError](`KeyError`) | If schema not found in local index. | 46 46 | | [ValueError](`ValueError`) | If local entry has no data URLs. | 47 47 48 - ## Example {.doc-section .doc-section-example} 49 - 50 - :: 48 + ## Examples {.doc-section .doc-section-examples} 51 49 52 - >>> entry = local_index.get_dataset("mnist-train") 53 - >>> uri = promote_to_atmosphere(entry, local_index, client) 54 - >>> print(uri) 55 - at://did:plc:abc123/ac.foundation.dataset.datasetIndex/... 50 + ```python 51 + >>> entry = local_index.get_dataset("mnist-train") 52 + >>> uri = promote_to_atmosphere(entry, local_index, client) 53 + >>> print(uri) 54 + at://did:plc:abc123/ac.foundation.dataset.datasetIndex/... 55 + ```

+23 -33

src/atdata/_cid.py

··· 12 12 This ensures compatibility with ATProto's CID requirements and enables 13 13 seamless promotion from local storage to atmosphere (ATProto network). 14 14 15 - Example: 16 - :: 17 - 18 - >>> schema = {"name": "ImageSample", "version": "1.0.0", "fields": [...]} 19 - >>> cid = generate_cid(schema) 20 - >>> print(cid) 21 - bafyreihffx5a2e7k6r5zqgp5iwpjqr2gfyheqhzqtlxagvqjqyxzqpzqaa 15 + Examples: 16 + >>> schema = {"name": "ImageSample", "version": "1.0.0", "fields": [...]} 17 + >>> cid = generate_cid(schema) 18 + >>> print(cid) 19 + bafyreihffx5a2e7k6r5zqgp5iwpjqr2gfyheqhzqtlxagvqjqyxzqpzqaa 22 20 """ 23 21 24 22 import hashlib ··· 50 48 Raises: 51 49 ValueError: If the data cannot be encoded as DAG-CBOR. 52 50 53 - Example: 54 - :: 55 - 56 - >>> generate_cid({"name": "test", "value": 42}) 57 - 'bafyrei...' 51 + Examples: 52 + >>> generate_cid({"name": "test", "value": 42}) 53 + 'bafyrei...' 58 54 """ 59 55 # Encode data as DAG-CBOR 60 56 try: ··· 86 82 Returns: 87 83 CIDv1 string in base32 multibase format. 88 84 89 - Example: 90 - :: 91 - 92 - >>> cbor_bytes = libipld.encode_dag_cbor({"key": "value"}) 93 - >>> cid = generate_cid_from_bytes(cbor_bytes) 85 + Examples: 86 + >>> cbor_bytes = libipld.encode_dag_cbor({"key": "value"}) 87 + >>> cid = generate_cid_from_bytes(cbor_bytes) 94 88 """ 95 89 sha256_hash = hashlib.sha256(data_bytes).digest() 96 90 raw_cid_bytes = bytes([CID_VERSION_1, CODEC_DAG_CBOR, HASH_SHA256, SHA256_SIZE]) + sha256_hash ··· 107 101 Returns: 108 102 True if the CID matches the data, False otherwise. 109 103 110 - Example: 111 - :: 112 - 113 - >>> cid = generate_cid({"name": "test"}) 114 - >>> verify_cid(cid, {"name": "test"}) 115 - True 116 - >>> verify_cid(cid, {"name": "different"}) 117 - False 104 + Examples: 105 + >>> cid = generate_cid({"name": "test"}) 106 + >>> verify_cid(cid, {"name": "test"}) 107 + True 108 + >>> verify_cid(cid, {"name": "different"}) 109 + False 118 110 """ 119 111 expected_cid = generate_cid(data) 120 112 return cid == expected_cid ··· 130 122 Dictionary with 'version', 'codec', and 'hash' keys. 131 123 The 'hash' value is itself a dict with 'code', 'size', and 'digest'. 132 124 133 - Example: 134 - :: 135 - 136 - >>> info = parse_cid('bafyrei...') 137 - >>> info['version'] 138 - 1 139 - >>> info['codec'] 140 - 113 # 0x71 = dag-cbor 125 + Examples: 126 + >>> info = parse_cid('bafyrei...') 127 + >>> info['version'] 128 + 1 129 + >>> info['codec'] 130 + 113 # 0x71 = dag-cbor 141 131 """ 142 132 return libipld.decode_cid(cid) 143 133

+40 -46

src/atdata/_hf_api.py

··· 9 9 - Built on WebDataset for efficient streaming of large datasets 10 10 - No Arrow caching layer (WebDataset handles remote/local transparently) 11 11 12 - Example: 13 - :: 14 - 15 - >>> import atdata 16 - >>> from atdata import load_dataset 17 - >>> 18 - >>> @atdata.packable 19 - ... class MyData: 20 - ... text: str 21 - ... label: int 22 - >>> 23 - >>> # Load a single split 24 - >>> ds = load_dataset("path/to/train-{000000..000099}.tar", MyData, split="train") 25 - >>> 26 - >>> # Load all splits (returns DatasetDict) 27 - >>> ds_dict = load_dataset("path/to/{train,test}-*.tar", MyData) 28 - >>> train_ds = ds_dict["train"] 12 + Examples: 13 + >>> import atdata 14 + >>> from atdata import load_dataset 15 + >>> 16 + >>> @atdata.packable 17 + ... class MyData: 18 + ... text: str 19 + ... label: int 20 + >>> 21 + >>> # Load a single split 22 + >>> ds = load_dataset("path/to/train-{000000..000099}.tar", MyData, split="train") 23 + >>> 24 + >>> # Load all splits (returns DatasetDict) 25 + >>> ds_dict = load_dataset("path/to/{train,test}-*.tar", MyData) 26 + >>> train_ds = ds_dict["train"] 29 27 """ 30 28 31 29 from __future__ import annotations ··· 70 68 Parameters: 71 69 ST: The sample type for all datasets in this dict. 72 70 73 - Example: 74 - :: 75 - 76 - >>> ds_dict = load_dataset("path/to/data", MyData) 77 - >>> train = ds_dict["train"] 78 - >>> test = ds_dict["test"] 79 - >>> 80 - >>> # Iterate over all splits 81 - >>> for split_name, dataset in ds_dict.items(): 82 - ... print(f"{split_name}: {len(dataset.shard_list)} shards") 71 + Examples: 72 + >>> ds_dict = load_dataset("path/to/data", MyData) 73 + >>> train = ds_dict["train"] 74 + >>> test = ds_dict["test"] 75 + >>> 76 + >>> # Iterate over all splits 77 + >>> for split_name, dataset in ds_dict.items(): 78 + ... print(f"{split_name}: {len(dataset.shard_list)} shards") 83 79 """ 84 80 # TODO The above has a line for "Parameters:" that should be "Type Parameters:"; this is a temporary fix for `quartodoc` auto-generation bugs. 85 81 ··· 613 609 FileNotFoundError: If no data files are found at the path. 614 610 KeyError: If dataset not found in index. 615 611 616 - Example: 617 - :: 618 - 619 - >>> # Load without type - get DictSample for exploration 620 - >>> ds = load_dataset("./data/train.tar", split="train") 621 - >>> for sample in ds.ordered(): 622 - ... print(sample.keys()) # Explore fields 623 - ... print(sample["text"]) # Dict-style access 624 - ... print(sample.label) # Attribute access 625 - >>> 626 - >>> # Convert to typed schema 627 - >>> typed_ds = ds.as_type(TextData) 628 - >>> 629 - >>> # Or load with explicit type directly 630 - >>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train") 631 - >>> 632 - >>> # Load from index with auto-type resolution 633 - >>> index = LocalIndex() 634 - >>> ds = load_dataset("@local/my-dataset", index=index, split="train") 612 + Examples: 613 + >>> # Load without type - get DictSample for exploration 614 + >>> ds = load_dataset("./data/train.tar", split="train") 615 + >>> for sample in ds.ordered(): 616 + ... print(sample.keys()) # Explore fields 617 + ... print(sample["text"]) # Dict-style access 618 + ... print(sample.label) # Attribute access 619 + >>> 620 + >>> # Convert to typed schema 621 + >>> typed_ds = ds.as_type(TextData) 622 + >>> 623 + >>> # Or load with explicit type directly 624 + >>> train_ds = load_dataset("./data/train-*.tar", TextData, split="train") 625 + >>> 626 + >>> # Load from index with auto-type resolution 627 + >>> index = LocalIndex() 628 + >>> ds = load_dataset("@local/my-dataset", index=index, split="train") 635 629 """ 636 630 # Handle @handle/dataset indexed path resolution 637 631 if _is_indexed_path(path):

+56 -70

src/atdata/_protocols.py

··· 19 19 AbstractIndex: Protocol for index operations (schemas, datasets, lenses) 20 20 AbstractDataStore: Protocol for data storage operations 21 21 22 - Example: 23 - :: 24 - 25 - >>> def process_datasets(index: AbstractIndex) -> None: 26 - ... for entry in index.list_datasets(): 27 - ... print(f"{entry.name}: {entry.data_urls}") 28 - ... 29 - >>> # Works with either LocalIndex or AtmosphereIndex 30 - >>> process_datasets(local_index) 31 - >>> process_datasets(atmosphere_index) 22 + Examples: 23 + >>> def process_datasets(index: AbstractIndex) -> None: 24 + ... for entry in index.list_datasets(): 25 + ... print(f"{entry.name}: {entry.data_urls}") 26 + ... 27 + >>> # Works with either LocalIndex or AtmosphereIndex 28 + >>> process_datasets(local_index) 29 + >>> process_datasets(atmosphere_index) 32 30 """ 33 31 34 32 from typing import ( ··· 67 65 - Schema publishing (class introspection via dataclass fields) 68 66 - Serialization/deserialization (packed, from_bytes) 69 67 70 - Example: 71 - :: 72 - 73 - >>> @packable 74 - ... class MySample: 75 - ... name: str 76 - ... value: int 77 - ... 78 - >>> def process(sample_type: Type[Packable]) -> None: 79 - ... # Type checker knows sample_type has from_bytes, packed, etc. 80 - ... instance = sample_type.from_bytes(data) 81 - ... print(instance.packed) 68 + Examples: 69 + >>> @packable 70 + ... class MySample: 71 + ... name: str 72 + ... value: int 73 + ... 74 + >>> def process(sample_type: Type[Packable]) -> None: 75 + ... # Type checker knows sample_type has from_bytes, packed, etc. 76 + ... instance = sample_type.from_bytes(data) 77 + ... print(instance.packed) 82 78 """ 83 79 84 80 @classmethod ··· 169 165 - ``data_store``: An AbstractDataStore for reading/writing dataset shards. 170 166 If present, ``load_dataset`` will use it for S3 credential resolution. 171 167 172 - Example: 173 - :: 174 - 175 - >>> def publish_and_list(index: AbstractIndex) -> None: 176 - ... # Publish schemas for different types 177 - ... schema1 = index.publish_schema(ImageSample, version="1.0.0") 178 - ... schema2 = index.publish_schema(TextSample, version="1.0.0") 179 - ... 180 - ... # Insert datasets of different types 181 - ... index.insert_dataset(image_ds, name="images") 182 - ... index.insert_dataset(text_ds, name="texts") 183 - ... 184 - ... # List all datasets (mixed types) 185 - ... for entry in index.list_datasets(): 186 - ... print(f"{entry.name} -> {entry.schema_ref}") 168 + Examples: 169 + >>> def publish_and_list(index: AbstractIndex) -> None: 170 + ... # Publish schemas for different types 171 + ... schema1 = index.publish_schema(ImageSample, version="1.0.0") 172 + ... schema2 = index.publish_schema(TextSample, version="1.0.0") 173 + ... 174 + ... # Insert datasets of different types 175 + ... index.insert_dataset(image_ds, name="images") 176 + ... index.insert_dataset(text_ds, name="texts") 177 + ... 178 + ... # List all datasets (mixed types) 179 + ... for entry in index.list_datasets(): 180 + ... print(f"{entry.name} -> {entry.schema_ref}") 187 181 """ 188 182 189 183 @property ··· 341 335 KeyError: If schema not found. 342 336 ValueError: If schema cannot be decoded (unsupported field types). 343 337 344 - Example: 345 - :: 346 - 347 - >>> entry = index.get_dataset("my-dataset") 348 - >>> SampleType = index.decode_schema(entry.schema_ref) 349 - >>> ds = Dataset[SampleType](entry.data_urls[0]) 350 - >>> for sample in ds.ordered(): 351 - ... print(sample) # sample is instance of SampleType 338 + Examples: 339 + >>> entry = index.get_dataset("my-dataset") 340 + >>> SampleType = index.decode_schema(entry.schema_ref) 341 + >>> ds = Dataset[SampleType](entry.data_urls[0]) 342 + >>> for sample in ds.ordered(): 343 + ... print(sample) # sample is instance of SampleType 352 344 """ 353 345 ... 354 346 ··· 368 360 flexible deployment: local index with S3 storage, atmosphere index with 369 361 S3 storage, or atmosphere index with PDS blobs. 370 362 371 - Example: 372 - :: 373 - 374 - >>> store = S3DataStore(credentials, bucket="my-bucket") 375 - >>> urls = store.write_shards(dataset, prefix="training/v1") 376 - >>> print(urls) 377 - ['s3://my-bucket/training/v1/shard-000000.tar', ...] 363 + Examples: 364 + >>> store = S3DataStore(credentials, bucket="my-bucket") 365 + >>> urls = store.write_shards(dataset, prefix="training/v1") 366 + >>> print(urls) 367 + ['s3://my-bucket/training/v1/shard-000000.tar', ...] 378 368 """ 379 369 380 370 def write_shards( ··· 443 433 - ATProto blob streaming 444 434 - Any other source that can provide file-like objects 445 435 446 - Example: 447 - :: 448 - 449 - >>> source = S3Source( 450 - ... bucket="my-bucket", 451 - ... keys=["data-000.tar", "data-001.tar"], 452 - ... endpoint="https://r2.example.com", 453 - ... credentials=creds, 454 - ... ) 455 - >>> ds = Dataset[MySample](source) 456 - >>> for sample in ds.ordered(): 457 - ... print(sample) 436 + Examples: 437 + >>> source = S3Source( 438 + ... bucket="my-bucket", 439 + ... keys=["data-000.tar", "data-001.tar"], 440 + ... endpoint="https://r2.example.com", 441 + ... credentials=creds, 442 + ... ) 443 + >>> ds = Dataset[MySample](source) 444 + >>> for sample in ds.ordered(): 445 + ... print(sample) 458 446 """ 459 447 460 448 @property ··· 467 455 Yields: 468 456 Tuple of (shard_identifier, file_like_stream). 469 457 470 - Example: 471 - :: 472 - 473 - >>> for shard_id, stream in source.shards: 474 - ... print(f"Processing {shard_id}") 475 - ... data = stream.read() 458 + Examples: 459 + >>> for shard_id, stream in source.shards: 460 + ... print(f"Processing {shard_id}") 461 + ... data = stream.read() 476 462 """ 477 463 ... 478 464

+27 -35

src/atdata/_schema_codec.py

··· 9 9 ``atmosphere/_types.py``, with field types supporting primitives, ndarrays, 10 10 arrays, and schema references. 11 11 12 - Example: 13 - :: 14 - 15 - >>> schema = { 16 - ... "name": "ImageSample", 17 - ... "version": "1.0.0", 18 - ... "fields": [ 19 - ... {"name": "image", "fieldType": {"$type": "...#ndarray", "dtype": "float32"}, "optional": False}, 20 - ... {"name": "label", "fieldType": {"$type": "...#primitive", "primitive": "str"}, "optional": False}, 21 - ... ] 22 - ... } 23 - >>> ImageSample = schema_to_type(schema) 24 - >>> sample = ImageSample(image=np.zeros((64, 64)), label="cat") 12 + Examples: 13 + >>> schema = { 14 + ... "name": "ImageSample", 15 + ... "version": "1.0.0", 16 + ... "fields": [ 17 + ... {"name": "image", "fieldType": {"$type": "...#ndarray", "dtype": "float32"}, "optional": False}, 18 + ... {"name": "label", "fieldType": {"$type": "...#primitive", "primitive": "str"}, "optional": False}, 19 + ... ] 20 + ... } 21 + >>> ImageSample = schema_to_type(schema) 22 + >>> sample = ImageSample(image=np.zeros((64, 64)), label="cat") 25 23 """ 26 24 27 25 from dataclasses import field, make_dataclass ··· 151 149 Raises: 152 150 ValueError: If schema is malformed or contains unsupported types. 153 151 154 - Example: 155 - :: 156 - 157 - >>> schema = index.get_schema("local://schemas/MySample@1.0.0") 158 - >>> MySample = schema_to_type(schema) 159 - >>> ds = Dataset[MySample]("data.tar") 160 - >>> for sample in ds.ordered(): 161 - ... print(sample) 152 + Examples: 153 + >>> schema = index.get_schema("local://schemas/MySample@1.0.0") 154 + >>> MySample = schema_to_type(schema) 155 + >>> ds = Dataset[MySample]("data.tar") 156 + >>> for sample in ds.ordered(): 157 + ... print(sample) 162 158 """ 163 159 # Check cache first 164 160 if use_cache: ··· 282 278 Returns: 283 279 String content for a .pyi stub file. 284 280 285 - Example: 286 - :: 287 - 288 - >>> schema = index.get_schema("atdata://local/sampleSchema/MySample@1.0.0") 289 - >>> stub_content = generate_stub(schema.to_dict()) 290 - >>> # Save to a stubs directory configured in your IDE 291 - >>> with open("stubs/my_sample.pyi", "w") as f: 292 - ... f.write(stub_content) 281 + Examples: 282 + >>> schema = index.get_schema("atdata://local/sampleSchema/MySample@1.0.0") 283 + >>> stub_content = generate_stub(schema.to_dict()) 284 + >>> # Save to a stubs directory configured in your IDE 285 + >>> with open("stubs/my_sample.pyi", "w") as f: 286 + ... f.write(stub_content) 293 287 """ 294 288 name = schema.get("name", "UnknownSample") 295 289 version = schema.get("version", "1.0.0") ··· 360 354 Returns: 361 355 String content for a .py module file. 362 356 363 - Example: 364 - :: 365 - 366 - >>> schema = index.get_schema("atdata://local/sampleSchema/MySample@1.0.0") 367 - >>> module_content = generate_module(schema.to_dict()) 368 - >>> # The module can be imported after being saved 357 + Examples: 358 + >>> schema = index.get_schema("atdata://local/sampleSchema/MySample@1.0.0") 359 + >>> module_content = generate_module(schema.to_dict()) 360 + >>> # The module can be imported after being saved 369 361 """ 370 362 name = schema.get("name", "UnknownSample") 371 363 version = schema.get("version", "1.0.0")

+49 -61

src/atdata/_sources.py

··· 13 13 By providing streams directly, we can support private repos, custom 14 14 endpoints, and future backends like ATProto blobs. 15 15 16 - Example: 17 - :: 18 - 19 - >>> # Standard URL (uses WebDataset's gopen) 20 - >>> source = URLSource("https://example.com/data-{000..009}.tar") 21 - >>> ds = Dataset[MySample](source) 22 - >>> 23 - >>> # Private S3 with credentials 24 - >>> source = S3Source( 25 - ... bucket="my-bucket", 26 - ... keys=["train/shard-000.tar", "train/shard-001.tar"], 27 - ... endpoint="https://my-r2.cloudflarestorage.com", 28 - ... access_key="...", 29 - ... secret_key="...", 30 - ... ) 31 - >>> ds = Dataset[MySample](source) 16 + Examples: 17 + >>> # Standard URL (uses WebDataset's gopen) 18 + >>> source = URLSource("https://example.com/data-{000..009}.tar") 19 + >>> ds = Dataset[MySample](source) 20 + >>> 21 + >>> # Private S3 with credentials 22 + >>> source = S3Source( 23 + ... bucket="my-bucket", 24 + ... keys=["train/shard-000.tar", "train/shard-001.tar"], 25 + ... endpoint="https://my-r2.cloudflarestorage.com", 26 + ... access_key="...", 27 + ... secret_key="...", 28 + ... ) 29 + >>> ds = Dataset[MySample](source) 32 30 """ 33 31 34 32 from __future__ import annotations ··· 54 52 Attributes: 55 53 url: URL or brace pattern for the shards. 56 54 57 - Example: 58 - :: 59 - 60 - >>> source = URLSource("https://example.com/train-{000..009}.tar") 61 - >>> for shard_id, stream in source.shards: 62 - ... print(f"Streaming {shard_id}") 55 + Examples: 56 + >>> source = URLSource("https://example.com/train-{000..009}.tar") 57 + >>> for shard_id, stream in source.shards: 58 + ... print(f"Streaming {shard_id}") 63 59 """ 64 60 65 61 url: str ··· 131 127 secret_key: Optional AWS secret access key. 132 128 region: Optional AWS region (defaults to us-east-1). 133 129 134 - Example: 135 - :: 136 - 137 - >>> source = S3Source( 138 - ... bucket="my-datasets", 139 - ... keys=["train/shard-000.tar", "train/shard-001.tar"], 140 - ... endpoint="https://abc123.r2.cloudflarestorage.com", 141 - ... access_key="AKIAIOSFODNN7EXAMPLE", 142 - ... secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 143 - ... ) 144 - >>> for shard_id, stream in source.shards: 145 - ... process(stream) 130 + Examples: 131 + >>> source = S3Source( 132 + ... bucket="my-datasets", 133 + ... keys=["train/shard-000.tar", "train/shard-001.tar"], 134 + ... endpoint="https://abc123.r2.cloudflarestorage.com", 135 + ... access_key="AKIAIOSFODNN7EXAMPLE", 136 + ... secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 137 + ... ) 138 + >>> for shard_id, stream in source.shards: 139 + ... process(stream) 146 140 """ 147 141 148 142 bucket: str ··· 258 252 Raises: 259 253 ValueError: If URLs are not valid s3:// URLs or span multiple buckets. 260 254 261 - Example: 262 - :: 263 - 264 - >>> source = S3Source.from_urls( 265 - ... ["s3://my-bucket/train-000.tar", "s3://my-bucket/train-001.tar"], 266 - ... endpoint="https://r2.example.com", 267 - ... ) 255 + Examples: 256 + >>> source = S3Source.from_urls( 257 + ... ["s3://my-bucket/train-000.tar", "s3://my-bucket/train-001.tar"], 258 + ... endpoint="https://r2.example.com", 259 + ... ) 268 260 """ 269 261 if not urls: 270 262 raise ValueError("urls cannot be empty") ··· 317 309 Returns: 318 310 Configured S3Source. 319 311 320 - Example: 321 - :: 322 - 323 - >>> creds = { 324 - ... "AWS_ACCESS_KEY_ID": "...", 325 - ... "AWS_SECRET_ACCESS_KEY": "...", 326 - ... "AWS_ENDPOINT": "https://r2.example.com", 327 - ... } 328 - >>> source = S3Source.from_credentials(creds, "my-bucket", ["data.tar"]) 312 + Examples: 313 + >>> creds = { 314 + ... "AWS_ACCESS_KEY_ID": "...", 315 + ... "AWS_SECRET_ACCESS_KEY": "...", 316 + ... "AWS_ENDPOINT": "https://r2.example.com", 317 + ... } 318 + >>> source = S3Source.from_credentials(creds, "my-bucket", ["data.tar"]) 329 319 """ 330 320 return cls( 331 321 bucket=bucket, ··· 352 342 blob_refs: List of blob reference dicts with 'did' and 'cid' keys. 353 343 pds_endpoint: Optional PDS endpoint URL. If not provided, resolved from DID. 354 344 355 - Example: 356 - :: 357 - 358 - >>> source = BlobSource( 359 - ... blob_refs=[ 360 - ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 361 - ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 362 - ... ], 363 - ... ) 364 - >>> for shard_id, stream in source.shards: 365 - ... process(stream) 345 + Examples: 346 + >>> source = BlobSource( 347 + ... blob_refs=[ 348 + ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 349 + ... {"did": "did:plc:abc123", "cid": "bafyrei..."}, 350 + ... ], 351 + ... ) 352 + >>> for shard_id, stream in source.shards: 353 + ... process(stream) 366 354 """ 367 355 368 356 blob_refs: list[dict[str, str]]

+18 -22

src/atdata/_stub_manager.py

··· 8 8 can be imported at runtime. This allows ``decode_schema`` to return properly 9 9 typed classes that work with both static type checkers and runtime. 10 10 11 - Example: 12 - :: 13 - 14 - >>> from atdata.local import Index 15 - >>> 16 - >>> # Enable auto-stub generation 17 - >>> index = Index(auto_stubs=True) 18 - >>> 19 - >>> # Modules are generated automatically on decode_schema 20 - >>> MyType = index.decode_schema("atdata://local/sampleSchema/MySample@1.0.0") 21 - >>> # MyType is now properly typed for IDE autocomplete! 22 - >>> 23 - >>> # Get the stub directory path for IDE configuration 24 - >>> print(f"Add to IDE: {index.stub_dir}") 11 + Examples: 12 + >>> from atdata.local import Index 13 + >>> 14 + >>> # Enable auto-stub generation 15 + >>> index = Index(auto_stubs=True) 16 + >>> 17 + >>> # Modules are generated automatically on decode_schema 18 + >>> MyType = index.decode_schema("atdata://local/sampleSchema/MySample@1.0.0") 19 + >>> # MyType is now properly typed for IDE autocomplete! 20 + >>> 21 + >>> # Get the stub directory path for IDE configuration 22 + >>> print(f"Add to IDE: {index.stub_dir}") 25 23 """ 26 24 27 25 from pathlib import Path ··· 101 99 Args: 102 100 stub_dir: Directory to write module files. Defaults to ``~/.atdata/stubs/``. 103 101 104 - Example: 105 - :: 106 - 107 - >>> manager = StubManager() 108 - >>> schema_dict = {"name": "MySample", "version": "1.0.0", "fields": [...]} 109 - >>> SampleClass = manager.ensure_module(schema_dict) 110 - >>> print(manager.stub_dir) 111 - /Users/you/.atdata/stubs 102 + Examples: 103 + >>> manager = StubManager() 104 + >>> schema_dict = {"name": "MySample", "version": "1.0.0", "fields": [...]} 105 + >>> SampleClass = manager.ensure_module(schema_dict) 106 + >>> print(manager.stub_dir) 107 + /Users/you/.atdata/stubs 112 108 """ 113 109 114 110 def __init__(self, stub_dir: Optional[Union[str, Path]] = None):

+19 -23

src/atdata/atmosphere/__init__.py

··· 15 15 to work unchanged. These features are opt-in for users who want to publish 16 16 or discover datasets on the ATProto network. 17 17 18 - Example: 19 - :: 20 - 21 - >>> from atdata.atmosphere import AtmosphereClient, SchemaPublisher 22 - >>> 23 - >>> client = AtmosphereClient() 24 - >>> client.login("handle.bsky.social", "app-password") 25 - >>> 26 - >>> publisher = SchemaPublisher(client) 27 - >>> schema_uri = publisher.publish(MySampleType, version="1.0.0") 18 + Examples: 19 + >>> from atdata.atmosphere import AtmosphereClient, SchemaPublisher 20 + >>> 21 + >>> client = AtmosphereClient() 22 + >>> client.login("handle.bsky.social", "app-password") 23 + >>> 24 + >>> publisher = SchemaPublisher(client) 25 + >>> schema_uri = publisher.publish(MySampleType, version="1.0.0") 28 26 29 27 Note: 30 28 This module requires the ``atproto`` package to be installed:: ··· 106 104 Optionally accepts a ``PDSBlobStore`` for writing dataset shards as 107 105 ATProto blobs, enabling fully decentralized dataset storage. 108 106 109 - Example: 110 - :: 111 - 112 - >>> client = AtmosphereClient() 113 - >>> client.login("handle.bsky.social", "app-password") 114 - >>> 115 - >>> # Without blob storage (external URLs only) 116 - >>> index = AtmosphereIndex(client) 117 - >>> 118 - >>> # With PDS blob storage 119 - >>> store = PDSBlobStore(client) 120 - >>> index = AtmosphereIndex(client, data_store=store) 121 - >>> entry = index.insert_dataset(dataset, name="my-data") 107 + Examples: 108 + >>> client = AtmosphereClient() 109 + >>> client.login("handle.bsky.social", "app-password") 110 + >>> 111 + >>> # Without blob storage (external URLs only) 112 + >>> index = AtmosphereIndex(client) 113 + >>> 114 + >>> # With PDS blob storage 115 + >>> store = PDSBlobStore(client) 116 + >>> index = AtmosphereIndex(client, data_store=store) 117 + >>> entry = index.insert_dataset(dataset, name="my-data") 122 118 """ 123 119 124 120 def __init__(

+8 -10

src/atdata/atmosphere/_types.py

··· 19 19 20 20 AT URIs follow the format: at://<authority>/<collection>/<rkey> 21 21 22 - Example: 23 - :: 24 - 25 - >>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz") 26 - >>> uri.authority 27 - 'did:plc:abc123' 28 - >>> uri.collection 29 - 'ac.foundation.dataset.sampleSchema' 30 - >>> uri.rkey 31 - 'xyz' 22 + Examples: 23 + >>> uri = AtUri.parse("at://did:plc:abc123/ac.foundation.dataset.sampleSchema/xyz") 24 + >>> uri.authority 25 + 'did:plc:abc123' 26 + >>> uri.collection 27 + 'ac.foundation.dataset.sampleSchema' 28 + >>> uri.rkey 29 + 'xyz' 32 30 """ 33 31 34 32 authority: str

+5 -7

src/atdata/atmosphere/client.py

··· 33 33 This class wraps the atproto SDK client and provides higher-level methods 34 34 for working with atdata records (schemas, datasets, lenses). 35 35 36 - Example: 37 - :: 38 - 39 - >>> client = AtmosphereClient() 40 - >>> client.login("alice.bsky.social", "app-password") 41 - >>> print(client.did) 42 - 'did:plc:...' 36 + Examples: 37 + >>> client = AtmosphereClient() 38 + >>> client.login("alice.bsky.social", "app-password") 39 + >>> print(client.did) 40 + 'did:plc:...' 43 41 44 42 Note: 45 43 The password should be an app-specific password, not your main account

+26 -30

src/atdata/atmosphere/lens.py

··· 31 31 This class creates lens records that reference source and target schemas 32 32 and point to the transformation code in a git repository. 33 33 34 - Example: 35 - :: 36 - 37 - >>> @atdata.lens 38 - ... def my_lens(source: SourceType) -> TargetType: 39 - ... return TargetType(field=source.other_field) 40 - >>> 41 - >>> client = AtmosphereClient() 42 - >>> client.login("handle", "password") 43 - >>> 44 - >>> publisher = LensPublisher(client) 45 - >>> uri = publisher.publish( 46 - ... name="my_lens", 47 - ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source", 48 - ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target", 49 - ... code_repository="https://github.com/user/repo", 50 - ... code_commit="abc123def456", 51 - ... getter_path="mymodule.lenses:my_lens", 52 - ... putter_path="mymodule.lenses:my_lens_putter", 53 - ... ) 34 + Examples: 35 + >>> @atdata.lens 36 + ... def my_lens(source: SourceType) -> TargetType: 37 + ... return TargetType(field=source.other_field) 38 + >>> 39 + >>> client = AtmosphereClient() 40 + >>> client.login("handle", "password") 41 + >>> 42 + >>> publisher = LensPublisher(client) 43 + >>> uri = publisher.publish( 44 + ... name="my_lens", 45 + ... source_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/source", 46 + ... target_schema_uri="at://did:plc:abc/ac.foundation.dataset.sampleSchema/target", 47 + ... code_repository="https://github.com/user/repo", 48 + ... code_commit="abc123def456", 49 + ... getter_path="mymodule.lenses:my_lens", 50 + ... putter_path="mymodule.lenses:my_lens_putter", 51 + ... ) 54 52 55 53 Security Note: 56 54 Lens code is stored as references to git repositories rather than ··· 195 193 using a lens requires installing the referenced code and importing 196 194 it manually. 197 195 198 - Example: 199 - :: 200 - 201 - >>> client = AtmosphereClient() 202 - >>> loader = LensLoader(client) 203 - >>> 204 - >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz") 205 - >>> print(record["name"]) 206 - >>> print(record["sourceSchema"]) 207 - >>> print(record.get("getterCode", {}).get("repository")) 196 + Examples: 197 + >>> client = AtmosphereClient() 198 + >>> loader = LensLoader(client) 199 + >>> 200 + >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.lens/xyz") 201 + >>> print(record["name"]) 202 + >>> print(record["sourceSchema"]) 203 + >>> print(record.get("getterCode", {}).get("repository")) 208 204 """ 209 205 210 206 def __init__(self, client: AtmosphereClient):

+29 -35

src/atdata/atmosphere/records.py

··· 31 31 This class creates dataset records that reference a schema and point to 32 32 external storage (WebDataset URLs) or ATProto blobs. 33 33 34 - Example: 35 - :: 36 - 37 - >>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar") 38 - >>> 39 - >>> client = AtmosphereClient() 40 - >>> client.login("handle", "password") 41 - >>> 42 - >>> publisher = DatasetPublisher(client) 43 - >>> uri = publisher.publish( 44 - ... dataset, 45 - ... name="My Training Data", 46 - ... description="Training data for my model", 47 - ... tags=["computer-vision", "training"], 48 - ... ) 34 + Examples: 35 + >>> dataset = atdata.Dataset[MySample]("s3://bucket/data-{000000..000009}.tar") 36 + >>> 37 + >>> client = AtmosphereClient() 38 + >>> client.login("handle", "password") 39 + >>> 40 + >>> publisher = DatasetPublisher(client) 41 + >>> uri = publisher.publish( 42 + ... dataset, 43 + ... name="My Training Data", 44 + ... description="Training data for my model", 45 + ... tags=["computer-vision", "training"], 46 + ... ) 49 47 """ 50 48 51 49 def __init__(self, client: AtmosphereClient): ··· 267 265 from them. Note that loading a dataset requires having the corresponding 268 266 Python class for the sample type. 269 267 270 - Example: 271 - :: 272 - 273 - >>> client = AtmosphereClient() 274 - >>> loader = DatasetLoader(client) 275 - >>> 276 - >>> # List available datasets 277 - >>> datasets = loader.list() 278 - >>> for ds in datasets: 279 - ... print(ds["name"], ds["schemaRef"]) 280 - >>> 281 - >>> # Get a specific dataset record 282 - >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz") 268 + Examples: 269 + >>> client = AtmosphereClient() 270 + >>> loader = DatasetLoader(client) 271 + >>> 272 + >>> # List available datasets 273 + >>> datasets = loader.list() 274 + >>> for ds in datasets: 275 + ... print(ds["name"], ds["schemaRef"]) 276 + >>> 277 + >>> # Get a specific dataset record 278 + >>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz") 283 279 """ 284 280 285 281 def __init__(self, client: AtmosphereClient): ··· 478 474 Raises: 479 475 ValueError: If no storage URLs can be resolved. 480 476 481 - Example: 482 - :: 483 - 484 - >>> loader = DatasetLoader(client) 485 - >>> dataset = loader.to_dataset(uri, MySampleType) 486 - >>> for batch in dataset.shuffled(batch_size=32): 487 - ... process(batch) 477 + Examples: 478 + >>> loader = DatasetLoader(client) 479 + >>> dataset = loader.to_dataset(uri, MySampleType) 480 + >>> for batch in dataset.shuffled(batch_size=32): 481 + ... process(batch) 488 482 """ 489 483 # Import here to avoid circular import 490 484 from ..dataset import Dataset

+21 -25

src/atdata/atmosphere/schema.py

··· 37 37 This class introspects a PackableSample class to extract its field 38 38 definitions and publishes them as an ATProto schema record. 39 39 40 - Example: 41 - :: 42 - 43 - >>> @atdata.packable 44 - ... class MySample: 45 - ... image: NDArray 46 - ... label: str 47 - ... 48 - >>> client = AtmosphereClient() 49 - >>> client.login("handle", "password") 50 - >>> 51 - >>> publisher = SchemaPublisher(client) 52 - >>> uri = publisher.publish(MySample, version="1.0.0") 53 - >>> print(uri) 54 - at://did:plc:.../ac.foundation.dataset.sampleSchema/... 40 + Examples: 41 + >>> @atdata.packable 42 + ... class MySample: 43 + ... image: NDArray 44 + ... label: str 45 + ... 46 + >>> client = AtmosphereClient() 47 + >>> client.login("handle", "password") 48 + >>> 49 + >>> publisher = SchemaPublisher(client) 50 + >>> uri = publisher.publish(MySample, version="1.0.0") 51 + >>> print(uri) 52 + at://did:plc:.../ac.foundation.dataset.sampleSchema/... 55 53 """ 56 54 57 55 def __init__(self, client: AtmosphereClient): ··· 178 176 This class fetches schema records from ATProto and can list available 179 177 schemas from a repository. 180 178 181 - Example: 182 - :: 183 - 184 - >>> client = AtmosphereClient() 185 - >>> client.login("handle", "password") 186 - >>> 187 - >>> loader = SchemaLoader(client) 188 - >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...") 189 - >>> print(schema["name"]) 190 - 'MySample' 179 + Examples: 180 + >>> client = AtmosphereClient() 181 + >>> client.login("handle", "password") 182 + >>> 183 + >>> loader = SchemaLoader(client) 184 + >>> schema = loader.get("at://did:plc:.../ac.foundation.dataset.sampleSchema/...") 185 + >>> print(schema["name"]) 186 + 'MySample' 191 187 """ 192 188 193 189 def __init__(self, client: AtmosphereClient):

+15 -19

src/atdata/atmosphere/store.py

··· 6 6 This enables fully decentralized dataset storage where both metadata (records) 7 7 and data (blobs) live on the AT Protocol network. 8 8 9 - Example: 10 - :: 11 - 12 - >>> from atdata.atmosphere import AtmosphereClient, PDSBlobStore 13 - >>> 14 - >>> client = AtmosphereClient() 15 - >>> client.login("handle.bsky.social", "app-password") 16 - >>> 17 - >>> store = PDSBlobStore(client) 18 - >>> urls = store.write_shards(dataset, prefix="mnist/v1") 19 - >>> print(urls) 20 - ['at://did:plc:.../blob/bafyrei...', ...] 9 + Examples: 10 + >>> from atdata.atmosphere import AtmosphereClient, PDSBlobStore 11 + >>> 12 + >>> client = AtmosphereClient() 13 + >>> client.login("handle.bsky.social", "app-password") 14 + >>> 15 + >>> store = PDSBlobStore(client) 16 + >>> urls = store.write_shards(dataset, prefix="mnist/v1") 17 + >>> print(urls) 18 + ['at://did:plc:.../blob/bafyrei...', ...] 21 19 """ 22 20 23 21 from __future__ import annotations ··· 48 46 Attributes: 49 47 client: Authenticated AtmosphereClient instance. 50 48 51 - Example: 52 - :: 53 - 54 - >>> store = PDSBlobStore(client) 55 - >>> urls = store.write_shards(dataset, prefix="training/v1") 56 - >>> # Returns AT URIs like: 57 - >>> # ['at://did:plc:abc/blob/bafyrei...', ...] 49 + Examples: 50 + >>> store = PDSBlobStore(client) 51 + >>> urls = store.write_shards(dataset, prefix="training/v1") 52 + >>> # Returns AT URIs like: 53 + >>> # ['at://did:plc:abc/blob/bafyrei...', ...] 58 54 """ 59 55 60 56 client: "AtmosphereClient"

+61 -77

src/atdata/dataset.py

··· 13 13 during serialization, enabling efficient storage of numerical data in WebDataset 14 14 archives. 15 15 16 - Example: 17 - :: 18 - 19 - >>> @packable 20 - ... class ImageSample: 21 - ... image: NDArray 22 - ... label: str 23 - ... 24 - >>> ds = Dataset[ImageSample]("data-{000000..000009}.tar") 25 - >>> for batch in ds.shuffled(batch_size=32): 26 - ... images = batch.image # Stacked numpy array (32, H, W, C) 27 - ... labels = batch.label # List of 32 strings 16 + Examples: 17 + >>> @packable 18 + ... class ImageSample: 19 + ... image: NDArray 20 + ... label: str 21 + ... 22 + >>> ds = Dataset[ImageSample]("data-{000000..000009}.tar") 23 + >>> for batch in ds.shuffled(batch_size=32): 24 + ... images = batch.image # Stacked numpy array (32, H, W, C) 25 + ... labels = batch.label # List of 32 strings 28 26 """ 29 27 30 28 ## ··· 126 124 ``@packable``-decorated class. Every ``@packable`` class automatically 127 125 registers a lens from ``DictSample``, making this conversion seamless. 128 126 129 - Example: 130 - :: 131 - 132 - >>> ds = load_dataset("path/to/data.tar") # Returns Dataset[DictSample] 133 - >>> for sample in ds.ordered(): 134 - ... print(sample.some_field) # Attribute access 135 - ... print(sample["other_field"]) # Dict access 136 - ... print(sample.keys()) # Inspect available fields 137 - ... 138 - >>> # Convert to typed schema 139 - >>> typed_ds = ds.as_type(MyTypedSample) 127 + Examples: 128 + >>> ds = load_dataset("path/to/data.tar") # Returns Dataset[DictSample] 129 + >>> for sample in ds.ordered(): 130 + ... print(sample.some_field) # Attribute access 131 + ... print(sample["other_field"]) # Dict access 132 + ... print(sample.keys()) # Inspect available fields 133 + ... 134 + >>> # Convert to typed schema 135 + >>> typed_ds = ds.as_type(MyTypedSample) 140 136 141 137 Note: 142 138 NDArray fields are stored as raw bytes in DictSample. They are only ··· 289 285 1. Direct inheritance with the ``@dataclass`` decorator 290 286 2. Using the ``@packable`` decorator (recommended) 291 287 292 - Example: 293 - :: 294 - 295 - >>> @packable 296 - ... class MyData: 297 - ... name: str 298 - ... embeddings: NDArray 299 - ... 300 - >>> sample = MyData(name="test", embeddings=np.array([1.0, 2.0])) 301 - >>> packed = sample.packed # Serialize to bytes 302 - >>> restored = MyData.from_bytes(packed) # Deserialize 288 + Examples: 289 + >>> @packable 290 + ... class MyData: 291 + ... name: str 292 + ... embeddings: NDArray 293 + ... 294 + >>> sample = MyData(name="test", embeddings=np.array([1.0, 2.0])) 295 + >>> packed = sample.packed # Serialize to bytes 296 + >>> restored = MyData.from_bytes(packed) # Deserialize 303 297 """ 304 298 305 299 def _ensure_good( self ): ··· 430 424 Attributes: 431 425 samples: The list of sample instances in this batch. 432 426 433 - Example: 434 - :: 435 - 436 - >>> batch = SampleBatch[MyData]([sample1, sample2, sample3]) 437 - >>> batch.embeddings # Returns stacked numpy array of shape (3, ...) 438 - >>> batch.names # Returns list of names 427 + Examples: 428 + >>> batch = SampleBatch[MyData]([sample1, sample2, sample3]) 429 + >>> batch.embeddings # Returns stacked numpy array of shape (3, ...) 430 + >>> batch.names # Returns list of names 439 431 440 432 Note: 441 433 This class uses Python's ``__orig_class__`` mechanism to extract the ··· 557 549 Attributes: 558 550 url: WebDataset brace-notation URL for the tar file(s). 559 551 560 - Example: 561 - :: 562 - 563 - >>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar") 564 - >>> for sample in ds.ordered(batch_size=32): 565 - ... # sample is SampleBatch[MyData] with batch_size samples 566 - ... embeddings = sample.embeddings # shape: (32, ...) 567 - ... 568 - >>> # Transform to a different view 569 - >>> ds_view = ds.as_type(MyDataView) 552 + Examples: 553 + >>> ds = Dataset[MyData]("path/to/data-{000000..000009}.tar") 554 + >>> for sample in ds.ordered(batch_size=32): 555 + ... # sample is SampleBatch[MyData] with batch_size samples 556 + ... embeddings = sample.embeddings # shape: (32, ...) 557 + ... 558 + >>> # Transform to a different view 559 + >>> ds_view = ds.as_type(MyDataView) 570 560 571 561 Note: 572 562 This class uses Python's ``__orig_class__`` mechanism to extract the ··· 679 669 Yields: 680 670 Shard identifiers (e.g., 'train-000000.tar', 'train-000001.tar'). 681 671 682 - Example: 683 - :: 684 - 685 - >>> for shard in ds.shards: 686 - ... print(f"Processing {shard}") 672 + Examples: 673 + >>> for shard in ds.shards: 674 + ... print(f"Processing {shard}") 687 675 """ 688 676 return iter(self._source.list_shards()) 689 677 ··· 851 839 This creates multiple parquet files: ``output-000000.parquet``, 852 840 ``output-000001.parquet``, etc. 853 841 854 - Example: 855 - :: 856 - 857 - >>> ds = Dataset[MySample]("data.tar") 858 - >>> # Small dataset - load all at once 859 - >>> ds.to_parquet("output.parquet") 860 - >>> 861 - >>> # Large dataset - process in chunks 862 - >>> ds.to_parquet("output.parquet", maxcount=50000) 842 + Examples: 843 + >>> ds = Dataset[MySample]("data.tar") 844 + >>> # Small dataset - load all at once 845 + >>> ds.to_parquet("output.parquet") 846 + >>> 847 + >>> # Large dataset - process in chunks 848 + >>> ds.to_parquet("output.parquet", maxcount=50000) 863 849 """ 864 850 ## 865 851 ··· 984 970 ``Packable`` protocol and can be used with ``Type[Packable]`` signatures. 985 971 986 972 Examples: 987 - This is a test of the functionality:: 988 - 989 - @packable 990 - class MyData: 991 - name: str 992 - values: NDArray 993 - 994 - sample = MyData(name="test", values=np.array([1, 2, 3])) 995 - bytes_data = sample.packed 996 - restored = MyData.from_bytes(bytes_data) 997 - 998 - # Works with Packable-typed APIs 999 - index.publish_schema(MyData, version="1.0.0") # Type-safe 973 + >>> @packable 974 + ... class MyData: 975 + ... name: str 976 + ... values: NDArray 977 + ... 978 + >>> sample = MyData(name="test", values=np.array([1, 2, 3])) 979 + >>> bytes_data = sample.packed 980 + >>> restored = MyData.from_bytes(bytes_data) 981 + >>> 982 + >>> # Works with Packable-typed APIs 983 + >>> index.publish_schema(MyData, version="1.0.0") # Type-safe 1000 984 """ 1001 985 1002 986 ##

+42 -50

src/atdata/lens.py

··· 14 14 Lenses support the functional programming concept of composable, well-behaved 15 15 transformations that satisfy lens laws (GetPut and PutGet). 16 16 17 - Example: 18 - :: 19 - 20 - >>> @packable 21 - ... class FullData: 22 - ... name: str 23 - ... age: int 24 - ... embedding: NDArray 25 - ... 26 - >>> @packable 27 - ... class NameOnly: 28 - ... name: str 29 - ... 30 - >>> @lens 31 - ... def name_view(full: FullData) -> NameOnly: 32 - ... return NameOnly(name=full.name) 33 - ... 34 - >>> @name_view.putter 35 - ... def name_view_put(view: NameOnly, source: FullData) -> FullData: 36 - ... return FullData(name=view.name, age=source.age, 37 - ... embedding=source.embedding) 38 - ... 39 - >>> ds = Dataset[FullData]("data.tar") 40 - >>> ds_names = ds.as_type(NameOnly) # Uses registered lens 17 + Examples: 18 + >>> @packable 19 + ... class FullData: 20 + ... name: str 21 + ... age: int 22 + ... embedding: NDArray 23 + ... 24 + >>> @packable 25 + ... class NameOnly: 26 + ... name: str 27 + ... 28 + >>> @lens 29 + ... def name_view(full: FullData) -> NameOnly: 30 + ... return NameOnly(name=full.name) 31 + ... 32 + >>> @name_view.putter 33 + ... def name_view_put(view: NameOnly, source: FullData) -> FullData: 34 + ... return FullData(name=view.name, age=source.age, 35 + ... embedding=source.embedding) 36 + ... 37 + >>> ds = Dataset[FullData]("data.tar") 38 + >>> ds_names = ds.as_type(NameOnly) # Uses registered lens 41 39 """ 42 40 43 41 ## ··· 92 90 S: The source type, must derive from ``PackableSample``. 93 91 V: The view type, must derive from ``PackableSample``. 94 92 95 - Example: 96 - :: 97 - 98 - >>> @lens 99 - ... def name_lens(full: FullData) -> NameOnly: 100 - ... return NameOnly(name=full.name) 101 - ... 102 - >>> @name_lens.putter 103 - ... def name_lens_put(view: NameOnly, source: FullData) -> FullData: 104 - ... return FullData(name=view.name, age=source.age) 93 + Examples: 94 + >>> @lens 95 + ... def name_lens(full: FullData) -> NameOnly: 96 + ... return NameOnly(name=full.name) 97 + ... 98 + >>> @name_lens.putter 99 + ... def name_lens_put(view: NameOnly, source: FullData) -> FullData: 100 + ... return FullData(name=view.name, age=source.age) 105 101 """ 106 102 # TODO The above has a line for "Parameters:" that should be "Type Parameters:"; this is a temporary fix for `quartodoc` auto-generation bugs. 107 103 ··· 163 159 Returns: 164 160 The putter function, allowing this to be used as a decorator. 165 161 166 - Example: 167 - :: 168 - 169 - >>> @my_lens.putter 170 - ... def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 171 - ... return SourceType(...) 162 + Examples: 163 + >>> @my_lens.putter 164 + ... def my_lens_put(view: ViewType, source: SourceType) -> SourceType: 165 + ... return SourceType(field=view.field, other=source.other) 172 166 """ 173 167 ## 174 168 self._putter = put ··· 218 212 A ``Lens[S, V]`` object that can be called to apply the transformation 219 213 or decorated with ``@lens_name.putter`` to add a putter function. 220 214 221 - Example: 222 - :: 223 - 224 - >>> @lens 225 - ... def extract_name(full: FullData) -> NameOnly: 226 - ... return NameOnly(name=full.name) 227 - ... 228 - >>> @extract_name.putter 229 - ... def extract_name_put(view: NameOnly, source: FullData) -> FullData: 230 - ... return FullData(name=view.name, age=source.age) 215 + Examples: 216 + >>> @lens 217 + ... def extract_name(full: FullData) -> NameOnly: 218 + ... return NameOnly(name=full.name) 219 + ... 220 + >>> @extract_name.putter 221 + ... def extract_name_put(view: NameOnly, source: FullData) -> FullData: 222 + ... return FullData(name=view.name, age=source.age) 231 223 """ 232 224 ret = Lens[S, V]( f ) 233 225 _network.register( ret )

+31 -41

src/atdata/local.py

··· 84 84 loaded schema types. After calling ``index.load_schema(uri)``, the 85 85 schema's class becomes available as an attribute on this namespace. 86 86 87 - Example: 88 - :: 89 - 90 - >>> index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 91 - >>> MyType = index.types.MySample 92 - >>> sample = MyType(field1="hello", field2=42) 87 + Examples: 88 + >>> index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 89 + >>> MyType = index.types.MySample 90 + >>> sample = MyType(field1="hello", field2=42) 93 91 94 92 The namespace supports: 95 93 - Attribute access: ``index.types.MySample`` ··· 1027 1025 After calling :meth:`load_schema`, schema types become available 1028 1026 as attributes on this namespace. 1029 1027 1030 - Example: 1031 - :: 1032 - 1033 - >>> index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 1034 - >>> MyType = index.types.MySample 1035 - >>> sample = MyType(name="hello", value=42) 1028 + Examples: 1029 + >>> index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 1030 + >>> MyType = index.types.MySample 1031 + >>> sample = MyType(name="hello", value=42) 1036 1032 1037 1033 Returns: 1038 1034 SchemaNamespace containing all loaded schema types. ··· 1058 1054 KeyError: If schema not found. 1059 1055 ValueError: If schema cannot be decoded. 1060 1056 1061 - Example: 1062 - :: 1063 - 1064 - >>> # Load and use immediately 1065 - >>> MyType = index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 1066 - >>> sample = MyType(name="hello", value=42) 1067 - >>> 1068 - >>> # Or access later via namespace 1069 - >>> index.load_schema("atdata://local/sampleSchema/OtherType@1.0.0") 1070 - >>> other = index.types.OtherType(data="test") 1057 + Examples: 1058 + >>> # Load and use immediately 1059 + >>> MyType = index.load_schema("atdata://local/sampleSchema/MySample@1.0.0") 1060 + >>> sample = MyType(name="hello", value=42) 1061 + >>> 1062 + >>> # Or access later via namespace 1063 + >>> index.load_schema("atdata://local/sampleSchema/OtherType@1.0.0") 1064 + >>> other = index.types.OtherType(data="test") 1071 1065 """ 1072 1066 # Decode the schema (uses generated module if auto_stubs enabled) 1073 1067 cls = self.decode_schema(ref) ··· 1090 1084 Import path like "local.MySample_1_0_0", or None if auto_stubs 1091 1085 is disabled. 1092 1086 1093 - Example: 1094 - :: 1095 - 1096 - >>> index = LocalIndex(auto_stubs=True) 1097 - >>> ref = index.publish_schema(MySample, version="1.0.0") 1098 - >>> index.load_schema(ref) 1099 - >>> print(index.get_import_path(ref)) 1100 - local.MySample_1_0_0 1101 - >>> # Then in your code: 1102 - >>> # from local.MySample_1_0_0 import MySample 1087 + Examples: 1088 + >>> index = LocalIndex(auto_stubs=True) 1089 + >>> ref = index.publish_schema(MySample, version="1.0.0") 1090 + >>> index.load_schema(ref) 1091 + >>> print(index.get_import_path(ref)) 1092 + local.MySample_1_0_0 1093 + >>> # Then in your code: 1094 + >>> # from local.MySample_1_0_0 import MySample 1103 1095 """ 1104 1096 if self._stub_manager is None: 1105 1097 return None ··· 1551 1543 Returns: 1552 1544 The decoded type, cast to match the type_hint for IDE support. 1553 1545 1554 - Example: 1555 - :: 1556 - 1557 - >>> # After enabling auto_stubs and configuring IDE extraPaths: 1558 - >>> from local.MySample_1_0_0 import MySample 1559 - >>> 1560 - >>> # This gives full IDE autocomplete: 1561 - >>> DecodedType = index.decode_schema_as(ref, MySample) 1562 - >>> sample = DecodedType(text="hello", value=42) # IDE knows signature! 1546 + Examples: 1547 + >>> # After enabling auto_stubs and configuring IDE extraPaths: 1548 + >>> from local.MySample_1_0_0 import MySample 1549 + >>> 1550 + >>> # This gives full IDE autocomplete: 1551 + >>> DecodedType = index.decode_schema_as(ref, MySample) 1552 + >>> sample = DecodedType(text="hello", value=42) # IDE knows signature! 1563 1553 1564 1554 Note: 1565 1555 The type_hint is only used for static type checking - at runtime,

+18 -22

src/atdata/promote.py

··· 4 4 ATProto atmosphere network. This enables sharing datasets with the broader 5 5 federation while maintaining schema consistency. 6 6 7 - Example: 8 - :: 9 - 10 - >>> from atdata.local import LocalIndex, Repo 11 - >>> from atdata.atmosphere import AtmosphereClient, AtmosphereIndex 12 - >>> from atdata.promote import promote_to_atmosphere 13 - >>> 14 - >>> # Setup 15 - >>> local_index = LocalIndex() 16 - >>> client = AtmosphereClient() 17 - >>> client.login("handle.bsky.social", "app-password") 18 - >>> 19 - >>> # Promote a dataset 20 - >>> entry = local_index.get_dataset("my-dataset") 21 - >>> at_uri = promote_to_atmosphere(entry, local_index, client) 7 + Examples: 8 + >>> from atdata.local import LocalIndex, Repo 9 + >>> from atdata.atmosphere import AtmosphereClient, AtmosphereIndex 10 + >>> from atdata.promote import promote_to_atmosphere 11 + >>> 12 + >>> # Setup 13 + >>> local_index = LocalIndex() 14 + >>> client = AtmosphereClient() 15 + >>> client.login("handle.bsky.social", "app-password") 16 + >>> 17 + >>> # Promote a dataset 18 + >>> entry = local_index.get_dataset("my-dataset") 19 + >>> at_uri = promote_to_atmosphere(entry, local_index, client) 22 20 """ 23 21 24 22 from typing import TYPE_CHECKING, Type ··· 128 126 KeyError: If schema not found in local index. 129 127 ValueError: If local entry has no data URLs. 130 128 131 - Example: 132 - :: 133 - 134 - >>> entry = local_index.get_dataset("mnist-train") 135 - >>> uri = promote_to_atmosphere(entry, local_index, client) 136 - >>> print(uri) 137 - at://did:plc:abc123/ac.foundation.dataset.datasetIndex/... 129 + Examples: 130 + >>> entry = local_index.get_dataset("mnist-train") 131 + >>> uri = promote_to_atmosphere(entry, local_index, client) 132 + >>> print(uri) 133 + at://did:plc:abc123/ac.foundation.dataset.datasetIndex/... 138 134 """ 139 135 from .atmosphere import DatasetPublisher 140 136 from ._schema_codec import schema_to_type

Configure Feed

Configure Feed