A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add blob storage documentation to atmosphere.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+45
+1
CHANGELOG.md
··· 11 11 ### Fixed 12 12 13 13 ### Changed 14 + - Add blob storage documentation to atmosphere.md (#223) 14 15 - Add shared sample type definitions to conftest.py (#219) 15 16 - Add blob operation tests for DatasetLoader and DatasetPublisher (#220) 16 17 - Trim verbose docstrings on internal helper functions (#222)
+44
docs/atmosphere.md
··· 164 164 ) 165 165 ``` 166 166 167 + #### Blob Storage 168 + 169 + For smaller datasets (up to ~50MB per shard), you can store data directly in ATProto blobs instead of external URLs: 170 + 171 + ```python 172 + import io 173 + import webdataset as wds 174 + 175 + # Create tar data in memory 176 + tar_buffer = io.BytesIO() 177 + with wds.writer.TarWriter(tar_buffer) as sink: 178 + for i, sample in enumerate(samples): 179 + sink.write({**sample.as_wds, "__key__": f"{i:06d}"}) 180 + 181 + # Publish with blob storage 182 + uri = publisher.publish_with_blobs( 183 + blobs=[tar_buffer.getvalue()], 184 + schema_uri=schema_uri, 185 + name="small-dataset", 186 + description="Dataset stored in ATProto blobs", 187 + tags=["small", "demo"], 188 + ) 189 + ``` 190 + 191 + To load datasets with blob storage: 192 + 193 + ```python 194 + from atdata.atmosphere import DatasetLoader 195 + 196 + loader = DatasetLoader(client) 197 + 198 + # Check storage type 199 + storage_type = loader.get_storage_type(uri) # "external" or "blobs" 200 + 201 + if storage_type == "blobs": 202 + # Get blob URLs for direct access 203 + blob_urls = loader.get_blob_urls(uri) 204 + 205 + # to_dataset() handles both storage types automatically 206 + dataset = loader.to_dataset(uri, MySample) 207 + for batch in dataset.ordered(batch_size=32): 208 + process(batch) 209 + ``` 210 + 167 211 ### LensPublisher 168 212 169 213 ```python