A loose federation of distributed, typed datasets
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

feat(api): add Packable protocol and standardize collection naming (xs/list_xs pattern)

Introduce Packable protocol in _protocols.py enabling type-safe APIs for
both PackableSample subclasses and @packable-decorated classes. Update
AbstractIndex, DataSource, and Dataset to use consistent naming pattern:
property `xs` for lazy iteration, method `list_xs()` for materialized lists.

Key changes:
- Packable protocol with from_data, from_bytes, packed, as_wds methods
- AbstractIndex: datasets/list_datasets(), schemas/list_schemas()
- DataSource/Dataset: shards property, list_shards() method
- Index: entries/list_entries() for local index
- Legacy aliases (shard_list, all_entries) preserved for backwards compat
- Update README with DictSample docs and load_dataset examples
- Add data_store to AbstractIndex for S3 credential resolution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+302 -109
.chainlink/issues.db

This is a binary file and will not be displayed.

+19
CHANGELOG.md
··· 25 25 - **Comprehensive integration test suite**: 593 tests covering E2E flows, error handling, edge cases 26 26 27 27 ### Changed 28 + - Review and address human-review.md feedback (#344) 29 + - Fix load_dataset overloads and AbstractIndex compatibility (#348) 30 + - Connect load_dataset to index data_store for S3 credentials (#361) 31 + - Fix load_dataset overload return types for DictSample (#360) 32 + - Add data_store to AbstractIndex protocol (#359) 33 + - Audit and fix xs/list_xs naming convention (#347) 34 + - Fix AtmosphereIndex: list_datasets/list_schemas return types (#357) 35 + - Refactor DataSource/Dataset: shards()/shard_list -> shards/list_shards() (#356) 36 + - Refactor local.py: entries/all_entries -> entries/list_entries (#355) 37 + - Update AbstractIndex protocol to match new naming convention (#358) 38 + - Investigate Redis index entry removal issue (#346) 39 + - Implement 'atdata diagnose' command for Redis health check (#354) 40 + - Implement 'atdata local up' command to run Redis + MinIO (#353) 41 + - Create atdata.cli module with entry point (#352) 42 + - Evaluate PackableSample → Packable protocol migration (#345) 43 + - Update publish_schema and other signatures to use Packable protocol (#351) 44 + - Update @packable decorator return type annotation (#350) 45 + - Define Packable protocol in _protocols.py (#349) 46 + - Review and update README for v0.2.2 release (#343) 28 47 - Streamline Dataset API with DictSample default type (#338) 29 48 - Add tests for DictSample and new API (#342) 30 49 - Update load_dataset default type to DictSample (#341)
+69 -18
README.md
··· 9 9 ## Features 10 10 11 11 - **Typed Samples** - Define dataset schemas using Python dataclasses with automatic msgpack serialization 12 + - **Schema-free Exploration** - Load datasets without defining a schema first using `DictSample` 12 13 - **Lens Transformations** - Bidirectional, composable transformations between different dataset views 13 14 - **Automatic Batching** - Smart batch aggregation with numpy array stacking 14 15 - **WebDataset Integration** - Efficient storage and streaming for large-scale datasets ··· 26 27 27 28 ## Quick Start 28 29 29 - ### Defining Sample Types 30 + ### Loading Datasets 30 31 31 - Use the `@packable` decorator to create typed dataset samples: 32 + The primary way to load datasets is with `load_dataset()`: 33 + 34 + ```python 35 + from atdata import load_dataset 36 + 37 + # Load without specifying a type - returns Dataset[DictSample] 38 + ds = load_dataset("path/to/data.tar", split="train") 39 + 40 + # Explore the data 41 + for sample in ds.ordered(): 42 + print(sample.keys()) # See available fields 43 + print(sample["text"]) # Dict-style access 44 + print(sample.label) # Attribute access 45 + break 46 + ``` 47 + 48 + ### Defining Typed Schemas 49 + 50 + Once you understand your data, define a typed schema with `@packable`: 32 51 33 52 ```python 34 53 import atdata ··· 41 60 metadata: dict 42 61 ``` 43 62 44 - ### Creating Datasets 63 + ### Loading with Types 45 64 46 65 ```python 47 - # Create a dataset 48 - dataset = atdata.Dataset[ImageSample]("path/to/data-{000000..000009}.tar") 66 + # Load with explicit type 67 + ds = load_dataset("path/to/data-{000000..000009}.tar", ImageSample, split="train") 49 68 50 - # Iterate over samples in order 51 - for sample in dataset.ordered(batch_size=None): 69 + # Or convert from DictSample 70 + ds = load_dataset("path/to/data.tar", split="train").as_type(ImageSample) 71 + 72 + # Iterate over samples 73 + for sample in ds.ordered(): 52 74 print(f"Label: {sample.label}, Image shape: {sample.image.shape}") 53 75 54 76 # Iterate with shuffling and batching 55 - for batch in dataset.shuffled(batch_size=32): 77 + for batch in ds.shuffled(batch_size=32): 56 78 # batch.image is automatically stacked into shape (32, ...) 57 79 # batch.label is a list of 32 labels 58 80 process_batch(batch.image, batch.label) ··· 83 105 84 106 ## Core Concepts 85 107 108 + ### DictSample 109 + 110 + The default sample type for schema-free exploration. Provides both attribute and dict-style access: 111 + 112 + ```python 113 + ds = load_dataset("data.tar", split="train") 114 + 115 + for sample in ds.ordered(): 116 + # Dict-style access 117 + print(sample["field_name"]) 118 + 119 + # Attribute access 120 + print(sample.field_name) 121 + 122 + # Introspection 123 + print(sample.keys()) 124 + print(sample.to_dict()) 125 + ``` 126 + 86 127 ### PackableSample 87 128 88 - Base class for serializable samples. Fields annotated as `NDArray` are automatically handled: 129 + Base class for typed, serializable samples. Fields annotated as `NDArray` are automatically handled: 89 130 90 131 ```python 91 132 @atdata.packable ··· 94 135 optional_array: NDArray | None 95 136 regular_field: str 96 137 ``` 138 + 139 + Every `@packable` class automatically registers a lens from `DictSample`, enabling seamless conversion via `.as_type()`. 97 140 98 141 ### Lens 99 142 ··· 145 188 ```python 146 189 from atdata import load_dataset 147 190 148 - # Load from local path with glob patterns 149 - ds = load_dataset("./data/train-*.tar", sample_type=ImageSample) 191 + # Load without type for exploration (returns Dataset[DictSample]) 192 + ds = load_dataset("./data/train-*.tar", split="train") 193 + 194 + # Load with explicit type 195 + ds = load_dataset("./data/train-*.tar", ImageSample, split="train") 150 196 151 - # Load from brace notation 152 - ds = load_dataset("s3://bucket/data-{000000..000099}.tar", sample_type=ImageSample) 197 + # Load from S3 with brace notation 198 + ds = load_dataset("s3://bucket/data-{000000..000099}.tar", ImageSample, split="train") 199 + 200 + # Load all splits (returns DatasetDict) 201 + ds_dict = load_dataset("./data", ImageSample) 202 + train_ds = ds_dict["train"] 203 + test_ds = ds_dict["test"] 153 204 154 - # Load with train/test splits 155 - ds = load_dataset("./data", sample_type=ImageSample, split="train") 205 + # Convert DictSample to typed schema 206 + ds = load_dataset("./data/train.tar", split="train").as_type(ImageSample) 156 207 ``` 157 208 158 209 ## Development ··· 171 222 172 223 ```bash 173 224 # Run all tests with coverage 174 - pytest 225 + uv run pytest 175 226 176 227 # Run specific test file 177 - pytest tests/test_dataset.py 228 + uv run pytest tests/test_dataset.py 178 229 179 230 # Run single test 180 - pytest tests/test_lens.py::test_lens 231 + uv run pytest tests/test_lens.py::test_lens 181 232 ``` 182 233 183 234 ### Building
+3 -2
src/atdata/__init__.py
··· 58 58 ) 59 59 60 60 from ._protocols import ( 61 + Packable as Packable, 61 62 IndexEntry as IndexEntry, 62 63 AbstractIndex as AbstractIndex, 63 64 AbstractDataStore as AbstractDataStore, ··· 85 86 # ATProto integration (lazy import to avoid requiring atproto package) 86 87 from . import atmosphere as atmosphere 87 88 88 - 89 - # 89 + # CLI entry point 90 + from .cli import main as main
+78 -25
src/atdata/_protocols.py
··· 42 42 ) 43 43 44 44 if TYPE_CHECKING: 45 - from .dataset import PackableSample, Dataset 45 + from .dataset import Dataset 46 46 47 47 48 48 ## ··· 54 54 """Structural protocol for packable sample types. 55 55 56 56 This protocol allows classes decorated with ``@packable`` to be recognized 57 - as valid types for lens transformations, even though the decorator doesn't 58 - change the class's nominal type at static analysis time. 57 + as valid types for lens transformations and schema operations, even though 58 + the decorator doesn't change the class's nominal type at static analysis time. 59 59 60 60 Both ``PackableSample`` subclasses and ``@packable``-decorated classes 61 61 satisfy this protocol structurally. 62 62 63 - Note: 64 - This protocol is intentionally minimal - it only requires what's needed 65 - for structural compatibility. The actual ``PackableSample`` class has 66 - more methods, but lenses only need to know the type is packable. 63 + The protocol captures the full interface needed for: 64 + - Lens type transformations (as_wds, from_data) 65 + - Schema publishing (class introspection via dataclass fields) 66 + - Serialization/deserialization (packed, from_bytes) 67 + 68 + Example: 69 + >>> @packable 70 + ... class MySample: 71 + ... name: str 72 + ... value: int 73 + ... 74 + >>> def process(sample_type: Type[Packable]) -> None: 75 + ... # Type checker knows sample_type has from_bytes, packed, etc. 76 + ... instance = sample_type.from_bytes(data) 77 + ... print(instance.packed) 67 78 """ 68 79 80 + @classmethod 81 + def from_data(cls, data: dict[str, Any]) -> "Packable": 82 + """Create instance from unpacked msgpack data dictionary.""" 83 + ... 84 + 85 + @classmethod 86 + def from_bytes(cls, bs: bytes) -> "Packable": 87 + """Create instance from raw msgpack bytes.""" 88 + ... 89 + 90 + @property 91 + def packed(self) -> bytes: 92 + """Pack this sample's data into msgpack bytes.""" 93 + ... 94 + 69 95 @property 70 96 def as_wds(self) -> dict[str, Any]: 71 - """WebDataset-compatible representation.""" 97 + """WebDataset-compatible representation with __key__ and msgpack.""" 72 98 ... 73 99 74 100 ··· 134 160 A single index can hold datasets of many different sample types. The sample 135 161 type is tracked via schema references, not as a generic parameter on the index. 136 162 163 + Optional Extensions: 164 + Some index implementations support additional features: 165 + - ``data_store``: An AbstractDataStore for reading/writing dataset shards. 166 + If present, ``load_dataset`` will use it for S3 credential resolution. 167 + 137 168 Example: 138 169 >>> def publish_and_list(index: AbstractIndex) -> None: 139 170 ... # Publish schemas for different types ··· 148 179 ... for entry in index.list_datasets(): 149 180 ... print(f"{entry.name} -> {entry.schema_ref}") 150 181 """ 182 + 183 + # Optional data store (not required by protocol, but supported by some implementations) 184 + # Use hasattr() to check if available: if hasattr(index, 'data_store') and index.data_store 185 + data_store: Optional["AbstractDataStore"] 151 186 152 187 # Dataset operations 153 188 ··· 190 225 """ 191 226 ... 192 227 193 - def list_datasets(self) -> Iterator[IndexEntry]: 194 - """List all dataset entries in this index. 228 + @property 229 + def datasets(self) -> Iterator[IndexEntry]: 230 + """Lazily iterate over all dataset entries in this index. 195 231 196 232 Yields: 197 233 IndexEntry for each dataset (may be of different sample types). 198 234 """ 199 235 ... 200 236 237 + def list_datasets(self) -> list[IndexEntry]: 238 + """Get all dataset entries as a materialized list. 239 + 240 + Returns: 241 + List of IndexEntry for each dataset. 242 + """ 243 + ... 244 + 201 245 # Schema operations 202 246 203 247 def publish_schema( 204 248 self, 205 - sample_type: "Type[PackableSample]", 249 + sample_type: Type[Packable], 206 250 *, 207 251 version: str = "1.0.0", 208 252 **kwargs, ··· 210 254 """Publish a schema for a sample type. 211 255 212 256 Args: 213 - sample_type: The PackableSample subclass to publish. 257 + sample_type: A Packable type (PackableSample subclass or @packable-decorated). 214 258 version: Semantic version string for the schema. 215 259 **kwargs: Additional backend-specific options. 216 260 ··· 236 280 """ 237 281 ... 238 282 239 - def list_schemas(self) -> Iterator[dict]: 240 - """List all schema records in this index. 283 + @property 284 + def schemas(self) -> Iterator[dict]: 285 + """Lazily iterate over all schema records in this index. 241 286 242 287 Yields: 243 288 Schema records as dictionaries. 244 289 """ 245 290 ... 246 291 247 - def decode_schema(self, ref: str) -> "Type[PackableSample]": 248 - """Reconstruct a Python PackableSample type from a stored schema. 292 + def list_schemas(self) -> list[dict]: 293 + """Get all schema records as a materialized list. 294 + 295 + Returns: 296 + List of schema records as dictionaries. 297 + """ 298 + ... 299 + 300 + def decode_schema(self, ref: str) -> Type[Packable]: 301 + """Reconstruct a Python Packable type from a stored schema. 249 302 250 303 This method enables loading datasets without knowing the sample type 251 304 ahead of time. The index retrieves the schema record and dynamically 252 - generates a PackableSample subclass matching the schema definition. 305 + generates a Packable class matching the schema definition. 253 306 254 307 Args: 255 308 ref: Schema reference string (local:// or at://). 256 309 257 310 Returns: 258 - A dynamically generated PackableSample subclass with fields 259 - matching the schema definition. The class can be used with 311 + A dynamically generated Packable class with fields matching 312 + the schema definition. The class can be used with 260 313 ``Dataset[T]`` to load and iterate over samples. 261 314 262 315 Raises: ··· 373 426 ... print(sample) 374 427 """ 375 428 429 + @property 376 430 def shards(self) -> Iterator[tuple[str, IO[bytes]]]: 377 - """Yield (identifier, stream) pairs for each shard. 431 + """Lazily yield (identifier, stream) pairs for each shard. 378 432 379 433 The identifier is used for error messages and __url__ metadata. 380 434 The stream must be a file-like object that can be read by tarfile. ··· 383 437 Tuple of (shard_identifier, file_like_stream). 384 438 385 439 Example: 386 - >>> for shard_id, stream in source.shards(): 440 + >>> for shard_id, stream in source.shards: 387 441 ... print(f"Processing {shard_id}") 388 442 ... data = stream.read() 389 443 """ 390 444 ... 391 445 392 - @property 393 - def shard_list(self) -> list[str]: 394 - """List of shard identifiers without opening streams. 446 + def list_shards(self) -> list[str]: 447 + """Get list of shard identifiers without opening streams. 395 448 396 449 Used for metadata queries like counting shards without actually 397 450 streaming data. Implementations should return identifiers that 398 - match what shards() would yield. 451 + match what shards would yield. 399 452 400 453 Returns: 401 454 List of shard identifier strings.
+25 -13
src/atdata/_sources.py
··· 54 54 55 55 Example: 56 56 >>> source = URLSource("https://example.com/train-{000..009}.tar") 57 - >>> for shard_id, stream in source.shards(): 57 + >>> for shard_id, stream in source.shards: 58 58 ... print(f"Streaming {shard_id}") 59 59 """ 60 60 61 61 url: str 62 62 63 + def list_shards(self) -> list[str]: 64 + """Expand brace pattern and return list of shard URLs.""" 65 + return list(braceexpand.braceexpand(self.url)) 66 + 67 + # Legacy alias for backwards compatibility 63 68 @property 64 69 def shard_list(self) -> list[str]: 65 - """Expand brace pattern and return list of shard URLs.""" 66 - return list(braceexpand.braceexpand(self.url)) 70 + """Expand brace pattern and return list of shard URLs (deprecated, use list_shards()).""" 71 + return self.list_shards() 67 72 73 + @property 68 74 def shards(self) -> Iterator[tuple[str, IO[bytes]]]: 69 - """Yield (url, stream) pairs for each shard. 75 + """Lazily yield (url, stream) pairs for each shard. 70 76 71 77 Uses WebDataset's gopen to open URLs, which handles various schemes: 72 78 - http/https: via curl ··· 78 84 Yields: 79 85 Tuple of (url, file-like stream). 80 86 """ 81 - for url in self.shard_list: 87 + for url in self.list_shards(): 82 88 stream = wds.gopen(url, mode="rb") 83 89 yield url, stream 84 90 ··· 92 98 File-like stream from gopen. 93 99 94 100 Raises: 95 - KeyError: If shard_id is not in shard_list. 101 + KeyError: If shard_id is not in list_shards(). 96 102 """ 97 - if shard_id not in self.shard_list: 103 + if shard_id not in self.list_shards(): 98 104 raise KeyError(f"Shard not found: {shard_id}") 99 105 return wds.gopen(shard_id, mode="rb") 100 106 ··· 129 135 ... access_key="AKIAIOSFODNN7EXAMPLE", 130 136 ... secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", 131 137 ... ) 132 - >>> for shard_id, stream in source.shards(): 138 + >>> for shard_id, stream in source.shards: 133 139 ... process(stream) 134 140 """ 135 141 ··· 166 172 self._client = boto3.client("s3", **client_kwargs) 167 173 return self._client 168 174 175 + def list_shards(self) -> list[str]: 176 + """Return list of S3 URIs for the shards.""" 177 + return [f"s3://{self.bucket}/{key}" for key in self.keys] 178 + 179 + # Legacy alias for backwards compatibility 169 180 @property 170 181 def shard_list(self) -> list[str]: 171 - """Return list of S3 URIs for the shards.""" 172 - return [f"s3://{self.bucket}/{key}" for key in self.keys] 182 + """Return list of S3 URIs for the shards (deprecated, use list_shards()).""" 183 + return self.list_shards() 173 184 185 + @property 174 186 def shards(self) -> Iterator[tuple[str, IO[bytes]]]: 175 - """Yield (s3_uri, stream) pairs for each shard. 187 + """Lazily yield (s3_uri, stream) pairs for each shard. 176 188 177 189 Uses boto3 to get streaming response bodies, which are file-like 178 190 objects that can be read directly by tarfile. ··· 198 210 StreamingBody for reading the object. 199 211 200 212 Raises: 201 - KeyError: If shard_id is not in shard_list. 213 + KeyError: If shard_id is not in list_shards(). 202 214 """ 203 - if shard_id not in self.shard_list: 215 + if shard_id not in self.list_shards(): 204 216 raise KeyError(f"Shard not found: {shard_id}") 205 217 206 218 # Parse s3://bucket/key -> key
+46 -16
src/atdata/atmosphere/__init__.py
··· 44 44 ) 45 45 46 46 if TYPE_CHECKING: 47 - from ..dataset import PackableSample, Dataset 47 + from ..dataset import Dataset 48 + from .._protocols import Packable 48 49 49 50 50 51 class AtmosphereIndexEntry: ··· 119 120 self._schema_loader = SchemaLoader(client) 120 121 self._dataset_publisher = DatasetPublisher(client) 121 122 self._dataset_loader = DatasetLoader(client) 123 + # AtmosphereIndex doesn't support data_store (uses PDS blobs) 124 + self.data_store = None 122 125 123 126 # Dataset operations 124 127 ··· 168 171 record = self._dataset_loader.get(ref) 169 172 return AtmosphereIndexEntry(ref, record) 170 173 171 - def list_datasets(self, repo: Optional[str] = None) -> Iterator[AtmosphereIndexEntry]: 172 - """List dataset entries from a repository. 174 + @property 175 + def datasets(self) -> Iterator[AtmosphereIndexEntry]: 176 + """Lazily iterate over all dataset entries (AbstractIndex protocol). 173 177 174 - Args: 175 - repo: DID of repository. Defaults to authenticated user. 178 + Uses the authenticated user's repository. 176 179 177 180 Yields: 178 181 AtmosphereIndexEntry for each dataset. 179 182 """ 180 - records = self._dataset_loader.list_all(repo=repo) 183 + records = self._dataset_loader.list_all() 181 184 for rec in records: 182 185 uri = rec.get("uri", "") 183 186 yield AtmosphereIndexEntry(uri, rec.get("value", rec)) 184 187 188 + def list_datasets(self, repo: Optional[str] = None) -> list[AtmosphereIndexEntry]: 189 + """Get all dataset entries as a materialized list (AbstractIndex protocol). 190 + 191 + Args: 192 + repo: DID of repository. Defaults to authenticated user. 193 + 194 + Returns: 195 + List of AtmosphereIndexEntry for each dataset. 196 + """ 197 + records = self._dataset_loader.list_all(repo=repo) 198 + return [ 199 + AtmosphereIndexEntry(rec.get("uri", ""), rec.get("value", rec)) 200 + for rec in records 201 + ] 202 + 185 203 # Schema operations 186 204 187 205 def publish_schema( 188 206 self, 189 - sample_type: "Type[PackableSample]", 207 + sample_type: "Type[Packable]", 190 208 *, 191 209 version: str = "1.0.0", 192 210 **kwargs, ··· 194 212 """Publish a schema to ATProto. 195 213 196 214 Args: 197 - sample_type: The PackableSample subclass to publish. 215 + sample_type: A Packable type (PackableSample subclass or @packable-decorated). 198 216 version: Semantic version string. 199 217 **kwargs: Additional options (description, metadata). 200 218 ··· 223 241 """ 224 242 return self._schema_loader.get(ref) 225 243 226 - def list_schemas(self, repo: Optional[str] = None) -> Iterator[dict]: 227 - """List schema records from a repository. 244 + @property 245 + def schemas(self) -> Iterator[dict]: 246 + """Lazily iterate over all schema records (AbstractIndex protocol). 247 + 248 + Uses the authenticated user's repository. 249 + 250 + Yields: 251 + Schema records as dictionaries. 252 + """ 253 + records = self._schema_loader.list_all() 254 + for rec in records: 255 + yield rec.get("value", rec) 256 + 257 + def list_schemas(self, repo: Optional[str] = None) -> list[dict]: 258 + """Get all schema records as a materialized list (AbstractIndex protocol). 228 259 229 260 Args: 230 261 repo: DID of repository. Defaults to authenticated user. 231 262 232 - Yields: 233 - Schema records. 263 + Returns: 264 + List of schema records as dictionaries. 234 265 """ 235 266 records = self._schema_loader.list_all(repo=repo) 236 - for rec in records: 237 - yield rec.get("value", rec) 267 + return [rec.get("value", rec) for rec in records] 238 268 239 - def decode_schema(self, ref: str) -> "Type[PackableSample]": 269 + def decode_schema(self, ref: str) -> "Type[Packable]": 240 270 """Reconstruct a Python type from a schema record. 241 271 242 272 Args: 243 273 ref: AT URI of the schema record. 244 274 245 275 Returns: 246 - Dynamically generated PackableSample subclass. 276 + Dynamically generated Packable type. 247 277 248 278 Raises: 249 279 ValueError: If schema cannot be decoded.
+23 -8
src/atdata/dataset.py
··· 503 503 504 504 def run(self): 505 505 """Yield {url: shard_id} dicts for each shard.""" 506 - for shard_id in self.source.shard_list: 506 + for shard_id in self.source.list_shards(): 507 507 yield {"url": shard_id} 508 508 509 509 ··· 615 615 self.url = source 616 616 else: 617 617 self._source = source 618 - # For compatibility, expose URL if source has shard_list 619 - self.url = source.shard_list[0] if source.shard_list else "" 618 + # For compatibility, expose URL if source has list_shards 619 + shards = source.list_shards() 620 + # TODO Expand out in brace notation the full shard list, rather than just using the first entry, in this fallback; add tests to make sure we catch this issue, as it wasn't showing up in our previous test suite. 621 + self.url = shards[0] if shards else "" 620 622 621 623 self._metadata: dict[str, Any] | None = None 622 624 self.metadata_url: str | None = metadata_url ··· 652 654 ret._output_lens = lenses.transform( self.sample_type, ret.sample_type ) 653 655 return ret 654 656 655 - @property 656 - def shard_list( self ) -> list[str]: 657 - """List of individual dataset shards 657 + def list_shards( self ) -> list[str]: 658 + """Get list of individual dataset shards. 658 659 659 660 Returns: 660 661 A full (non-lazy) list of the individual ``tar`` files within the 661 662 source WebDataset. 662 663 """ 663 - return self._source.shard_list 664 + return self._source.list_shards() 665 + 666 + # Legacy alias for backwards compatibility 667 + @property 668 + def shard_list( self ) -> list[str]: 669 + """List of individual dataset shards (deprecated, use list_shards()).""" 670 + return self.list_shards() 664 671 665 672 @property 666 673 def metadata( self ) -> dict[str, Any] | None: ··· 919 926 ``PackableSample``, enabling automatic msgpack serialization/deserialization 920 927 with special handling for NDArray fields. 921 928 929 + The resulting class satisfies the ``Packable`` protocol, making it compatible 930 + with all atdata APIs that accept packable types (e.g., ``publish_schema``, 931 + lens transformations, etc.). 932 + 922 933 Args: 923 934 cls: The class to convert. Should have type annotations for its fields. 924 935 925 936 Returns: 926 937 A new dataclass that inherits from ``PackableSample`` with the same 927 - name and annotations as the original class. 938 + name and annotations as the original class. The class satisfies the 939 + ``Packable`` protocol and can be used with ``Type[Packable]`` signatures. 928 940 929 941 Example: 930 942 >>> @packable ··· 935 947 >>> sample = MyData(name="test", values=np.array([1, 2, 3])) 936 948 >>> bytes_data = sample.packed 937 949 >>> restored = MyData.from_bytes(bytes_data) 950 + >>> 951 + >>> # Works with Packable-typed APIs 952 + >>> index.publish_schema(MyData, version="1.0.0") # Type-safe 938 953 """ 939 954 940 955 ##
+34 -21
src/atdata/local.py
··· 30 30 is_ndarray_type, 31 31 extract_ndarray_dtype, 32 32 ) 33 - from atdata._protocols import IndexEntry, AbstractDataStore 33 + from atdata._protocols import IndexEntry, AbstractDataStore, Packable 34 34 35 35 from pathlib import Path 36 36 from uuid import uuid4 ··· 106 106 """ 107 107 108 108 def __init__(self) -> None: 109 - self._types: dict[str, Type[PackableSample]] = {} 109 + self._types: dict[str, Type[Packable]] = {} 110 110 111 - def _register(self, name: str, cls: Type[PackableSample]) -> None: 111 + def _register(self, name: str, cls: Type[Packable]) -> None: 112 112 """Register a schema type in the namespace.""" 113 113 self._types[name] = cls 114 114 ··· 142 142 names = ", ".join(sorted(self._types.keys())) 143 143 return f"SchemaNamespace({names})" 144 144 145 - def get(self, name: str, default: T | None = None) -> Type[PackableSample] | T | None: 145 + def get(self, name: str, default: T | None = None) -> Type[Packable] | T | None: 146 146 """Get a type by name, returning default if not found. 147 147 148 148 Args: ··· 355 355 ## 356 356 # Helpers 357 357 358 - def _kind_str_for_sample_type( st: Type[PackableSample] ) -> str: 358 + def _kind_str_for_sample_type( st: Type[Packable] ) -> str: 359 359 """Return fully-qualified 'module.name' string for a sample type.""" 360 360 return f'{st.__module__}.{st.__name__}' 361 361 ··· 435 435 _LEGACY_URI_PREFIX = "local://schemas/" 436 436 437 437 438 - def _schema_ref_from_type(sample_type: Type[PackableSample], version: str) -> str: 438 + def _schema_ref_from_type(sample_type: Type[Packable], version: str) -> str: 439 439 """Generate 'atdata://local/sampleSchema/{name}@{version}' reference.""" 440 440 return _make_schema_ref(sample_type.__name__, version) 441 441 ··· 506 506 507 507 508 508 def _build_schema_record( 509 - sample_type: Type[PackableSample], 509 + sample_type: Type[Packable], 510 510 *, 511 511 version: str, 512 512 description: str | None = None, ··· 1035 1035 """ 1036 1036 return self._schema_namespace 1037 1037 1038 - def load_schema(self, ref: str) -> Type[PackableSample]: 1038 + def load_schema(self, ref: str) -> Type[Packable]: 1039 1039 """Load a schema and make it available in the types namespace. 1040 1040 1041 1041 This method decodes the schema, optionally generates a Python module ··· 1107 1107 1108 1108 return f"{authority}.{module_name}" 1109 1109 1110 - @property 1111 - def all_entries(self) -> list[LocalDatasetEntry]: 1112 - """Get all index entries as a list. 1110 + def list_entries(self) -> list[LocalDatasetEntry]: 1111 + """Get all index entries as a materialized list. 1113 1112 1114 1113 Returns: 1115 1114 List of all LocalDatasetEntry objects in the index. 1116 1115 """ 1117 1116 return list(self.entries) 1117 + 1118 + # Legacy alias for backwards compatibility 1119 + @property 1120 + def all_entries(self) -> list[LocalDatasetEntry]: 1121 + """Get all index entries as a list (deprecated, use list_entries()).""" 1122 + return self.list_entries() 1118 1123 1119 1124 @property 1120 1125 def entries(self) -> Generator[LocalDatasetEntry, None, None]: ··· 1277 1282 """ 1278 1283 return self.get_entry_by_name(ref) 1279 1284 1280 - def list_datasets(self) -> Iterator[LocalDatasetEntry]: 1281 - """List all dataset entries (AbstractIndex protocol). 1285 + @property 1286 + def datasets(self) -> Generator[LocalDatasetEntry, None, None]: 1287 + """Lazily iterate over all dataset entries (AbstractIndex protocol). 1282 1288 1283 1289 Yields: 1284 1290 IndexEntry for each dataset. 1285 1291 """ 1286 1292 return self.entries 1293 + 1294 + def list_datasets(self) -> list[LocalDatasetEntry]: 1295 + """Get all dataset entries as a materialized list (AbstractIndex protocol). 1296 + 1297 + Returns: 1298 + List of IndexEntry for each dataset. 1299 + """ 1300 + return self.list_entries() 1287 1301 1288 1302 # Schema operations 1289 1303 ··· 1320 1334 1321 1335 def publish_schema( 1322 1336 self, 1323 - sample_type: Type[PackableSample], 1337 + sample_type: Type[Packable], 1324 1338 *, 1325 1339 version: str | None = None, 1326 1340 description: str | None = None, ··· 1453 1467 schema['$ref'] = _make_schema_ref(name, version) 1454 1468 yield LocalSchemaRecord.from_dict(schema) 1455 1469 1456 - def list_schemas(self) -> Iterator[dict]: 1457 - """List all schema records (AbstractIndex protocol). 1470 + def list_schemas(self) -> list[dict]: 1471 + """Get all schema records as a materialized list (AbstractIndex protocol). 1458 1472 1459 - Yields: 1460 - Schema records as dictionaries. 1473 + Returns: 1474 + List of schema records as dictionaries. 1461 1475 """ 1462 - for record in self.schemas: 1463 - yield record.to_dict() 1476 + return [record.to_dict() for record in self.schemas] 1464 1477 1465 - def decode_schema(self, ref: str) -> Type[PackableSample]: 1478 + def decode_schema(self, ref: str) -> Type[Packable]: 1466 1479 """Reconstruct a Python PackableSample type from a stored schema. 1467 1480 1468 1481 This method enables loading datasets without knowing the sample type
+2 -3
src/atdata/promote.py
··· 24 24 if TYPE_CHECKING: 25 25 from .local import LocalDatasetEntry, Index as LocalIndex 26 26 from .atmosphere import AtmosphereClient 27 - from .dataset import PackableSample 28 - from ._protocols import AbstractDataStore 27 + from ._protocols import AbstractDataStore, Packable 29 28 30 29 31 30 def _find_existing_schema( ··· 54 53 55 54 56 55 def _find_or_publish_schema( 57 - sample_type: "Type[PackableSample]", 56 + sample_type: "Type[Packable]", 58 57 version: str, 59 58 client: "AtmosphereClient", 60 59 description: str | None = None,
+3 -3
tests/test_sources.py
··· 62 62 ] 63 63 64 64 def test_shards_yields_streams(self, tmp_path): 65 - """shards() yields (url, stream) pairs.""" 65 + """shards property yields (url, stream) pairs.""" 66 66 # Create test tar file 67 67 tar_path = tmp_path / "test.tar" 68 68 create_test_tar(tar_path, [{"name": "test", "value": 42}]) 69 69 70 70 source = URLSource(str(tar_path)) 71 - shards = list(source.shards()) 71 + shards = list(source.shards) 72 72 73 73 assert len(shards) == 1 74 74 url, stream = shards[0] ··· 202 202 secret_key="SECRET", 203 203 ) 204 204 205 - shards = list(source.shards()) 205 + shards = list(source.shards) 206 206 207 207 assert len(shards) == 1 208 208 uri, stream = shards[0]