Added backup of human review files · foundation.ac/atdata@410e083

+16

.review/human-review.md

··· 1 + * We had talked previously about potentially moving `PackableSample` to a `Packable` protocol to simplify some type hints / etc. Let's go over the pros and cons of this. This shows up for the linting / typing of `local.Index.publish_schema`, where the way that @packable is working right now doesn't properly get the PackableSample superclass to register for this signature. 2 + * We have an interesting persistent issue with Redis removing old records; can you think through why it seems like Redis resets our index entries somewhat inconsistently over time? Is there a Redis setting that might be responsible for this? 3 + * We want to make sure that we keep to the pattern that `foo.xs` is an @property that gives a (lazy) iterable for `x`, while `foo.list_xs` is a fully evaluated list for all of the `x`s that uses `foo.xs` under the hood. We should go through the full codebase to evaluate for following this convention. 4 + * `load_dataset` has a couple issues: 5 + * We updated how `Dataset` is initialized to be able to accommodate a number of different underlying sources of `wds`-compatible data, and we should make it so that the overloads for `load_dataset` properly connect up with this; for example: 6 + * If `load_dataset` is coming from a specified local index with an S3 store, we should use the S3 credentials there as the source for the `Dataset` returned by `load_dataset` 7 + * If it's using an atproto location (like `'@maxine.science/mnist'`), the returned `Dataset` should use whatever is the storage mechanism referenced in that atproto record from the network (for example, blobs as at-uris that are wrapped in a file-like interface for passing to `webdataset`). 8 + * Calls like 9 + ```python 10 + ds = load_dataset( "@local/proto-text-samples-3", TextSample, 11 + split = 'train', 12 + # 13 + index = index, 14 + ) 15 + ``` 16 + are resulting in linting errors because of the way overloading is handled for `load_dataset`; could you take a look at how the overloads are implemented, to make sure that things are functioning as expected? In particular, it seems like the `AbstractIndex` protocol and `local.Index` don't play nicely for linting!

+19

prototyping/human-review-atmosphere.ipynb

··· 1 + { 2 + "cells": [ 3 + { 4 + "cell_type": "code", 5 + "execution_count": null, 6 + "id": "87dec017", 7 + "metadata": {}, 8 + "outputs": [], 9 + "source": [] 10 + } 11 + ], 12 + "metadata": { 13 + "language_info": { 14 + "name": "python" 15 + } 16 + }, 17 + "nbformat": 4, 18 + "nbformat_minor": 5 19 + }

+613

prototyping/human-review-local.ipynb

··· 1 + { 2 + "cells": [ 3 + { 4 + "cell_type": "code", 5 + "execution_count": 1, 6 + "id": "df3f0691", 7 + "metadata": {}, 8 + "outputs": [], 9 + "source": [ 10 + "import numpy as np\n", 11 + "from numpy.typing import NDArray\n", 12 + "import atdata\n", 13 + "from atdata.local import LocalDatasetEntry, S3DataStore, Index\n", 14 + "import webdataset as wds" 15 + ] 16 + }, 17 + { 18 + "cell_type": "code", 19 + "execution_count": 34, 20 + "id": "1f7ea651", 21 + "metadata": {}, 22 + "outputs": [], 23 + "source": [ 24 + "@atdata.packable\n", 25 + "class TrainingSample:\n", 26 + " \"\"\"A sample containing features and label for training.\"\"\"\n", 27 + " features: NDArray\n", 28 + " label: int\n", 29 + "\n", 30 + "from dataclasses import dataclass\n", 31 + "\n", 32 + "@atdata.packable\n", 33 + "class TextSample( atdata.PackableSample ):\n", 34 + " \"\"\"A sample containing text data.\"\"\"\n", 35 + " text: str\n", 36 + " category: str" 37 + ] 38 + }, 39 + { 40 + "cell_type": "code", 41 + "execution_count": 3, 42 + "id": "55549f64", 43 + "metadata": {}, 44 + "outputs": [], 45 + "source": [ 46 + "x = TextSample(\n", 47 + " text = 'Hello',\n", 48 + " category = 'test',\n", 49 + ")" 50 + ] 51 + }, 52 + { 53 + "cell_type": "markdown", 54 + "id": "462a780b", 55 + "metadata": {}, 56 + "source": [ 57 + "---" 58 + ] 59 + }, 60 + { 61 + "cell_type": "code", 62 + "execution_count": 4, 63 + "id": "ed0821b9", 64 + "metadata": {}, 65 + "outputs": [ 66 + { 67 + "name": "stdout", 68 + "output_type": "stream", 69 + "text": [ 70 + "Bucket: analysis-hive\n", 71 + "Supports streaming: True\n", 72 + "LocalIndex connected\n" 73 + ] 74 + } 75 + ], 76 + "source": [ 77 + "# Connect to S3\n", 78 + "store = S3DataStore( '.credentials/r2-analysis-hive.env',\n", 79 + " bucket = \"analysis-hive\"\n", 80 + ")\n", 81 + "print(f\"Bucket: {store.bucket}\")\n", 82 + "print(f\"Supports streaming: {store.supports_streaming()}\")\n", 83 + "\n", 84 + "# Connect to Redis\n", 85 + "index = Index(\n", 86 + " data_store = store,\n", 87 + " auto_stubs = True,\n", 88 + ")\n", 89 + "print( \"LocalIndex connected\" )" 90 + ] 91 + }, 92 + { 93 + "cell_type": "code", 94 + "execution_count": 6, 95 + "id": "2fd2229f", 96 + "metadata": {}, 97 + "outputs": [ 98 + { 99 + "data": { 100 + "text/plain": [ 101 + "[{'name': 'TrainingSample',\n", 102 + " 'version': '1.0.0',\n", 103 + " 'fields': [{'name': 'features',\n", 104 + " 'fieldType': {'$type': 'local#ndarray', 'dtype': 'float32'},\n", 105 + " 'optional': False},\n", 106 + " {'name': 'label',\n", 107 + " 'fieldType': {'$type': 'local#primitive', 'primitive': 'int'},\n", 108 + " 'optional': False}],\n", 109 + " '$ref': 'atdata://local/sampleSchema/TrainingSample@1.0.0',\n", 110 + " 'description': 'A sample containing features and label for training.',\n", 111 + " 'createdAt': '2026-01-22T22:01:47.560660+00:00'},\n", 112 + " {'name': 'TextSample',\n", 113 + " 'version': '1.0.1',\n", 114 + " 'fields': [{'name': 'text',\n", 115 + " 'fieldType': {'$type': 'local#primitive', 'primitive': 'str'},\n", 116 + " 'optional': False},\n", 117 + " {'name': 'category',\n", 118 + " 'fieldType': {'$type': 'local#primitive', 'primitive': 'str'},\n", 119 + " 'optional': False}],\n", 120 + " '$ref': 'atdata://local/sampleSchema/TextSample@1.0.1',\n", 121 + " 'description': 'A sample containing text data.',\n", 122 + " 'createdAt': '2026-01-22T22:09:51.907476+00:00'}]" 123 + ] 124 + }, 125 + "execution_count": 6, 126 + "metadata": {}, 127 + "output_type": "execute_result" 128 + } 129 + ], 130 + "source": [ 131 + "list( index.list_schemas() )" 132 + ] 133 + }, 134 + { 135 + "cell_type": "code", 136 + "execution_count": 7, 137 + "id": "c23765ad", 138 + "metadata": {}, 139 + "outputs": [], 140 + "source": [ 141 + "s = next( index.schemas )" 142 + ] 143 + }, 144 + { 145 + "cell_type": "code", 146 + "execution_count": 8, 147 + "id": "b4be08f9", 148 + "metadata": {}, 149 + "outputs": [ 150 + { 151 + "data": { 152 + "text/plain": [ 153 + "'atdata://local/sampleSchema/TrainingSample@1.0.0'" 154 + ] 155 + }, 156 + "execution_count": 8, 157 + "metadata": {}, 158 + "output_type": "execute_result" 159 + } 160 + ], 161 + "source": [ 162 + "s.ref" 163 + ] 164 + }, 165 + { 166 + "cell_type": "code", 167 + "execution_count": 9, 168 + "id": "51829873", 169 + "metadata": {}, 170 + "outputs": [ 171 + { 172 + "name": "stdout", 173 + "output_type": "stream", 174 + "text": [ 175 + "Published schema: atdata://local/sampleSchema/TrainingSample@1.0.0\n", 176 + " - TrainingSample v1.0.0\n", 177 + " - TextSample v1.0.1\n", 178 + "Schema fields: ['features', 'label']\n", 179 + "Decoded type: TrainingSample\n" 180 + ] 181 + } 182 + ], 183 + "source": [ 184 + "# Publish a schema\n", 185 + "schema_ref = index.publish_schema( TrainingSample, version=\"1.0.0\")\n", 186 + "print(f\"Published schema: {schema_ref}\")\n", 187 + "\n", 188 + "# List all schemas\n", 189 + "for schema in index.list_schemas():\n", 190 + " print(f\" - {schema.get('name', 'Unknown')} v{schema.get('version', '?')}\")\n", 191 + "\n", 192 + "# Get schema record\n", 193 + "schema_record = index.get_schema(schema_ref)\n", 194 + "print(f\"Schema fields: {[f['name'] for f in schema_record.get('fields', [])]}\")\n", 195 + "\n", 196 + "# Decode schema back to a PackableSample class\n", 197 + "decoded_type = index.decode_schema(schema_ref)\n", 198 + "print(f\"Decoded type: {decoded_type.__name__}\")" 199 + ] 200 + }, 201 + { 202 + "cell_type": "code", 203 + "execution_count": 10, 204 + "id": "fadbddaa", 205 + "metadata": {}, 206 + "outputs": [ 207 + { 208 + "name": "stdout", 209 + "output_type": "stream", 210 + "text": [ 211 + "Published schema: atdata://local/sampleSchema/TextSample@1.0.1\n", 212 + " - TrainingSample v1.0.0\n", 213 + " - TextSample v1.0.1\n", 214 + "Schema fields: ['text', 'category']\n", 215 + "Decoded type: TextSample\n" 216 + ] 217 + } 218 + ], 219 + "source": [ 220 + "# Publish a schema\n", 221 + "schema_ref_2 = index.publish_schema(TextSample, version=\"1.0.1\")\n", 222 + "print(f\"Published schema: {schema_ref_2}\")\n", 223 + "\n", 224 + "# List all schemas\n", 225 + "for schema in index.list_schemas():\n", 226 + " print(f\" - {schema.get('name', 'Unknown')} v{schema.get('version', '?')}\")\n", 227 + "\n", 228 + "# Get schema record\n", 229 + "schema_record = index.get_schema(schema_ref_2)\n", 230 + "print(f\"Schema fields: {[f['name'] for f in schema_record.get('fields', [])]}\")\n", 231 + "\n", 232 + "# Decode schema back to a PackableSample class\n", 233 + "decoded_type = index.decode_schema(schema_ref_2)\n", 234 + "print(f\"Decoded type: {decoded_type.__name__}\")" 235 + ] 236 + }, 237 + { 238 + "cell_type": "code", 239 + "execution_count": 12, 240 + "id": "18e14e77", 241 + "metadata": {}, 242 + "outputs": [], 243 + "source": [ 244 + "from typing import TypeVar, TypeAlias, Generic, Callable, Any\n", 245 + "\n", 246 + "S = TypeVar( 'S', bound = atdata.PackableSample )\n", 247 + "V = TypeVar( 'V', bound = atdata.PackableSample )\n", 248 + "\n", 249 + "FromAnyTo = Callable[[Any], V]\n", 250 + "\n", 251 + "def make_local_lens( f: FromAnyTo[V], remote: type[S], local: type[V] ) -> atdata.Lens[S, V]:\n", 252 + " \"\"\"TODO\"\"\"\n", 253 + " @atdata.lens\n", 254 + " def _to_local( s: S ) -> V:\n", 255 + " return f( s )\n", 256 + " return _to_local" 257 + ] 258 + }, 259 + { 260 + "cell_type": "code", 261 + "execution_count": 18, 262 + "id": "8fede400", 263 + "metadata": {}, 264 + "outputs": [], 265 + "source": [ 266 + "index.load_schema( 'atdata://local/sampleSchema/TextSample@1.0.1' )\n", 267 + "TextSampleRemote = index.types.TextSample" 268 + ] 269 + }, 270 + { 271 + "cell_type": "code", 272 + "execution_count": 20, 273 + "id": "53979bee", 274 + "metadata": {}, 275 + "outputs": [], 276 + "source": [ 277 + "x = TextSampleRemote(\n", 278 + " text = 'hello',\n", 279 + " category = 'test',\n", 280 + ")" 281 + ] 282 + }, 283 + { 284 + "cell_type": "code", 285 + "execution_count": 26, 286 + "id": "5a3122bd", 287 + "metadata": {}, 288 + "outputs": [], 289 + "source": [ 290 + "def _to_text_sample( s: Any ) -> TextSample:\n", 291 + " return TextSample(\n", 292 + " text = s.text,\n", 293 + " category = s.category,\n", 294 + " )\n", 295 + "\n", 296 + "l = make_local_lens( _to_text_sample, TextSampleRemote, TextSample )" 297 + ] 298 + }, 299 + { 300 + "cell_type": "code", 301 + "execution_count": 27, 302 + "id": "a730c075", 303 + "metadata": {}, 304 + "outputs": [], 305 + "source": [ 306 + "y = l( x )" 307 + ] 308 + }, 309 + { 310 + "cell_type": "markdown", 311 + "id": "08b8d647", 312 + "metadata": {}, 313 + "source": [ 314 + "---" 315 + ] 316 + }, 317 + { 318 + "cell_type": "code", 319 + "execution_count": 29, 320 + "id": "55d944d0", 321 + "metadata": {}, 322 + "outputs": [ 323 + { 324 + "name": "stdout", 325 + "output_type": "stream", 326 + "text": [ 327 + "# writing data/TextSample_test-000000.tar 0 0.0 GB 0\n", 328 + "# writing data/TextSample_test-000001.tar 1000 0.0 GB 1000\n", 329 + "# writing data/TextSample_test-000002.tar 1000 0.0 GB 2000\n", 330 + "# writing data/TextSample_test-000003.tar 1000 0.0 GB 3000\n", 331 + "# writing data/TextSample_test-000004.tar 1000 0.0 GB 4000\n", 332 + "# writing data/TextSample_test-000005.tar 1000 0.0 GB 5000\n", 333 + "# writing data/TextSample_test-000006.tar 1000 0.0 GB 6000\n", 334 + "# writing data/TextSample_test-000007.tar 1000 0.0 GB 7000\n", 335 + "# writing data/TextSample_test-000008.tar 1000 0.0 GB 8000\n", 336 + "# writing data/TextSample_test-000009.tar 1000 0.0 GB 9000\n" 337 + ] 338 + } 339 + ], 340 + "source": [ 341 + "import webdataset as wds\n", 342 + "from uuid import uuid4\n", 343 + "\n", 344 + "data_pattern = 'data/TextSample_test-%06d.tar'\n", 345 + "\n", 346 + "with wds.writer.ShardWriter( data_pattern, maxcount = 1_000 ) as sink:\n", 347 + " for i in range( 10_000 ):\n", 348 + " new_sample = TextSample(\n", 349 + " text = str( uuid4() ),\n", 350 + " category = 'test',\n", 351 + " )\n", 352 + " sink.write( new_sample.as_wds )" 353 + ] 354 + }, 355 + { 356 + "cell_type": "code", 357 + "execution_count": 36, 358 + "id": "5978b632", 359 + "metadata": {}, 360 + "outputs": [], 361 + "source": [ 362 + "from atdata import load_dataset\n", 363 + "\n", 364 + "ds = (\n", 365 + " load_dataset( 'data/TextSample_test-{000000..000009}.tar',\n", 366 + " split = 'test'\n", 367 + " )\n", 368 + " .as_type( TextSample )\n", 369 + ")" 370 + ] 371 + }, 372 + { 373 + "cell_type": "code", 374 + "execution_count": 40, 375 + "id": "cc81a54e", 376 + "metadata": {}, 377 + "outputs": [], 378 + "source": [ 379 + "x = next( iter( ds.ordered() ) )" 380 + ] 381 + }, 382 + { 383 + "cell_type": "code", 384 + "execution_count": 41, 385 + "id": "3beac49d", 386 + "metadata": {}, 387 + "outputs": [ 388 + { 389 + "data": { 390 + "text/plain": [ 391 + "TextSample(text='d06a8072-5833-4867-9bc6-03baa3cee75b', category='test')" 392 + ] 393 + }, 394 + "execution_count": 41, 395 + "metadata": {}, 396 + "output_type": "execute_result" 397 + } 398 + ], 399 + "source": [ 400 + "x" 401 + ] 402 + }, 403 + { 404 + "cell_type": "code", 405 + "execution_count": 42, 406 + "id": "4ebfcc63", 407 + "metadata": {}, 408 + "outputs": [ 409 + { 410 + "name": "stdout", 411 + "output_type": "stream", 412 + "text": [ 413 + "# writing analysis-hive/prototyping/data--2b5dd738-c9f6-46d4-8c31-e218531601be--000000.tar 0 0.0 GB 0\n" 414 + ] 415 + } 416 + ], 417 + "source": [ 418 + "entry = index.insert_dataset( ds, \n", 419 + " name = 'proto-text-samples-3',\n", 420 + " prefix = 'prototyping',\n", 421 + " schema_ref = 'atdata://local/sampleSchema/TextSample@1.0.1',\n", 422 + ")" 423 + ] 424 + }, 425 + { 426 + "cell_type": "code", 427 + "execution_count": null, 428 + "id": "e74d68f6", 429 + "metadata": {}, 430 + "outputs": [ 431 + { 432 + "data": { 433 + "text/plain": [ 434 + "LocalDatasetEntry(name='proto-text-samples-3', schema_ref='atdata://local/sampleSchema/TextSample@1.0.1', data_urls=['s3://analysis-hive/prototyping/data--2b5dd738-c9f6-46d4-8c31-e218531601be--000000.tar'], metadata=None)" 435 + ] 436 + }, 437 + "execution_count": 43, 438 + "metadata": {}, 439 + "output_type": "execute_result" 440 + } 441 + ], 442 + "source": [ 443 + "entry" 444 + ] 445 + }, 446 + { 447 + "cell_type": "markdown", 448 + "id": "a51090c3", 449 + "metadata": {}, 450 + "source": [ 451 + "Notes:\n", 452 + "\n", 453 + "* We should make sure that the `s3` URI-scheme here is properly used\n", 454 + " * Should we be using the `https` URI since actually this is doing data streaming with `wds`? Or does this indicate that we should think more deeply about the `Dataset` API design and generalizing how we're setting up the `wds` data streaming ...\n", 455 + " * No matter what, we're definitely going to want to make sure that we incorporate the actual host details of the `LocalIndex`'s `S3DataStore` for this, since the S3 host is definitely not local.\n", 456 + " * Should there be underscores here? These feel like public properties ..." 457 + ] 458 + }, 459 + { 460 + "cell_type": "markdown", 461 + "id": "90872fe7", 462 + "metadata": {}, 463 + "source": [ 464 + "---" 465 + ] 466 + }, 467 + { 468 + "cell_type": "code", 469 + "execution_count": 45, 470 + "id": "02abbcc2", 471 + "metadata": {}, 472 + "outputs": [ 473 + { 474 + "data": { 475 + "text/plain": [ 476 + "<atdata.dataset.Dataset at 0x114eb9160>" 477 + ] 478 + }, 479 + "execution_count": 45, 480 + "metadata": {}, 481 + "output_type": "execute_result" 482 + } 483 + ], 484 + "source": [ 485 + "ds" 486 + ] 487 + }, 488 + { 489 + "cell_type": "code", 490 + "execution_count": null, 491 + "id": "f0a50853", 492 + "metadata": {}, 493 + "outputs": [], 494 + "source": [] 495 + }, 496 + { 497 + "cell_type": "code", 498 + "execution_count": null, 499 + "id": "4a2736f0", 500 + "metadata": {}, 501 + "outputs": [ 502 + { 503 + "ename": "OSError", 504 + "evalue": "(\"((['curl', '--connect-timeout', '30', '--retry', '30', '--retry-delay', '2', '-f', '-s', '-L', 'https://f5bf77c06cb35b5136ff6d61ab4b7dbc.r2.cloudflarestorage.com/analysis-hive/prototyping/data--2b5dd738-c9f6-46d4-8c31-e218531601be--000000.tar'],), {'bufsize': 8192}): exit 22 (read) {}\", <webdataset.gopen.Pipe object at 0x11425e150>, 'https://f5bf77c06cb35b5136ff6d61ab4b7dbc.r2.cloudflarestorage.com/analysis-hive/prototyping/data--2b5dd738-c9f6-46d4-8c31-e218531601be--000000.tar')", 505 + "output_type": "error", 506 + "traceback": [ 507 + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", 508 + "\u001b[31mOSError\u001b[39m Traceback (most recent call last)", 509 + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[48]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m 4\u001b[39m ds = load_dataset( \u001b[33m\"\u001b[39m\u001b[33m@local/proto-text-samples-3\u001b[39m\u001b[33m\"\u001b[39m, TextSample,\n\u001b[32m 5\u001b[39m split = \u001b[33m'\u001b[39m\u001b[33mtrain\u001b[39m\u001b[33m'\u001b[39m,\n\u001b[32m 6\u001b[39m \u001b[38;5;66;03m#\u001b[39;00m\n\u001b[32m 7\u001b[39m index = index,\n\u001b[32m 8\u001b[39m )\n\u001b[32m 10\u001b[39m \u001b[38;5;66;03m# The index resolves the dataset name to URLs and schema\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m \u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mbatch\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mds\u001b[49m\u001b[43m.\u001b[49m\u001b[43mshuffled\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 12\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mbreak\u001b[39;49;00m\n", 510 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/pipeline.py:105\u001b[39m, in \u001b[36mDataPipeline.iterator\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 103\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m _ \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(\u001b[38;5;28mself\u001b[39m.repetitions):\n\u001b[32m 104\u001b[39m count = \u001b[32m0\u001b[39m\n\u001b[32m--> \u001b[39m\u001b[32m105\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43miterator1\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 106\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01myield\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\n\u001b[32m 107\u001b[39m \u001b[43m \u001b[49m\u001b[43mcount\u001b[49m\u001b[43m \u001b[49m\u001b[43m+\u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\n", 511 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/filters.py:520\u001b[39m, in \u001b[36m_map\u001b[39m\u001b[34m(data, f, handler)\u001b[39m\n\u001b[32m 505\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_map\u001b[39m(data, f, handler=reraise_exception):\n\u001b[32m 506\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 507\u001b[39m \u001b[33;03m Map samples through a function.\u001b[39;00m\n\u001b[32m 508\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 518\u001b[39m \u001b[33;03m Exception: If the handler doesn't handle an exception.\u001b[39;00m\n\u001b[32m 519\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m520\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 521\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mtry\u001b[39;49;00m\u001b[43m:\u001b[49m\n\u001b[32m 522\u001b[39m \u001b[43m \u001b[49m\u001b[43mresult\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[43msample\u001b[49m\u001b[43m)\u001b[49m\n", 512 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/filters.py:358\u001b[39m, in \u001b[36m_shuffle\u001b[39m\u001b[34m(data, bufsize, initial, rng, seed, handler)\u001b[39m\n\u001b[32m 356\u001b[39m initial = \u001b[38;5;28mmin\u001b[39m(initial, bufsize)\n\u001b[32m 357\u001b[39m buf = []\n\u001b[32m--> \u001b[39m\u001b[32m358\u001b[39m \u001b[43m\u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 359\u001b[39m \u001b[43m \u001b[49m\u001b[43mbuf\u001b[49m\u001b[43m.\u001b[49m\u001b[43mappend\u001b[49m\u001b[43m(\u001b[49m\u001b[43msample\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 360\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mbuf\u001b[49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[43m<\u001b[49m\u001b[43m \u001b[49m\u001b[43mbufsize\u001b[49m\u001b[43m:\u001b[49m\n", 513 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/tariterators.py:230\u001b[39m, in \u001b[36mgroup_by_keys\u001b[39m\u001b[34m(data, keys, lcase, suffixes, handler)\u001b[39m\n\u001b[32m 214\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Group tarfile contents by keys and yield samples.\u001b[39;00m\n\u001b[32m 215\u001b[39m \n\u001b[32m 216\u001b[39m \u001b[33;03mArgs:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 227\u001b[39m \u001b[33;03m Iterator over samples.\u001b[39;00m\n\u001b[32m 228\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 229\u001b[39m current_sample = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m230\u001b[39m \u001b[43m\u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mfilesample\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 231\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01mtry\u001b[39;49;00m\u001b[43m:\u001b[49m\n\u001b[32m 232\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01massert\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43misinstance\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mfilesample\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mdict\u001b[39;49m\u001b[43m)\u001b[49m\n", 514 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/tariterators.py:201\u001b[39m, in \u001b[36mtar_file_expander\u001b[39m\u001b[34m(data, handler, select_files, rename_files, eof_value)\u001b[39m\n\u001b[32m 199\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mException\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m exn:\n\u001b[32m 200\u001b[39m exn.args = exn.args + (source.get(\u001b[33m\"\u001b[39m\u001b[33mstream\u001b[39m\u001b[33m\"\u001b[39m), source.get(\u001b[33m\"\u001b[39m\u001b[33murl\u001b[39m\u001b[33m\"\u001b[39m))\n\u001b[32m--> \u001b[39m\u001b[32m201\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[43mhandler\u001b[49m\u001b[43m(\u001b[49m\u001b[43mexn\u001b[49m\u001b[43m)\u001b[49m:\n\u001b[32m 202\u001b[39m \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[32m 203\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n", 515 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/handlers.py:31\u001b[39m, in \u001b[36mreraise_exception\u001b[39m\u001b[34m(exn)\u001b[39m\n\u001b[32m 22\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mreraise_exception\u001b[39m(exn):\n\u001b[32m 23\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"Re-raise the given exception.\u001b[39;00m\n\u001b[32m 24\u001b[39m \n\u001b[32m 25\u001b[39m \u001b[33;03m Args:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 29\u001b[39m \u001b[33;03m The input exception.\u001b[39;00m\n\u001b[32m 30\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m31\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m exn\n", 516 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/tariterators.py:184\u001b[39m, in \u001b[36mtar_file_expander\u001b[39m\u001b[34m(data, handler, select_files, rename_files, eof_value)\u001b[39m\n\u001b[32m 182\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(source, \u001b[38;5;28mdict\u001b[39m)\n\u001b[32m 183\u001b[39m \u001b[38;5;28;01massert\u001b[39;00m \u001b[33m\"\u001b[39m\u001b[33mstream\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m source\n\u001b[32m--> \u001b[39m\u001b[32m184\u001b[39m \u001b[43m\u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mtar_file_iterator\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 185\u001b[39m \u001b[43m \u001b[49m\u001b[43msource\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mstream\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 186\u001b[39m \u001b[43m \u001b[49m\u001b[43mhandler\u001b[49m\u001b[43m=\u001b[49m\u001b[43mhandler\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 187\u001b[39m \u001b[43m \u001b[49m\u001b[43mselect_files\u001b[49m\u001b[43m=\u001b[49m\u001b[43mselect_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 188\u001b[39m \u001b[43m \u001b[49m\u001b[43mrename_files\u001b[49m\u001b[43m=\u001b[49m\u001b[43mrename_files\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 189\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 190\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43;01massert\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43misinstance\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43msample\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mdict\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mand\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mdata\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mand\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mfname\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43msample\u001b[49m\n\u001b[32m 191\u001b[39m \u001b[43m \u001b[49m\u001b[43msample\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43m__url__\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43murl\u001b[49m\n", 517 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/tariterators.py:128\u001b[39m, in \u001b[36mtar_file_iterator\u001b[39m\u001b[34m(fileobj, skip_meta, handler, select_files, rename_files)\u001b[39m\n\u001b[32m 109\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mtar_file_iterator\u001b[39m(\n\u001b[32m 110\u001b[39m fileobj: tarfile.TarFile,\n\u001b[32m 111\u001b[39m skip_meta: Optional[\u001b[38;5;28mstr\u001b[39m] = \u001b[33mr\u001b[39m\u001b[33m\"\u001b[39m\u001b[33m__[^/]*__($|/)\u001b[39m\u001b[33m\"\u001b[39m,\n\u001b[32m (...)\u001b[39m\u001b[32m 114\u001b[39m rename_files: Optional[Callable[[\u001b[38;5;28mstr\u001b[39m], \u001b[38;5;28mstr\u001b[39m]] = \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m 115\u001b[39m ) -> Iterator[Dict[\u001b[38;5;28mstr\u001b[39m, Any]]:\n\u001b[32m 116\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"Iterate over tar file, yielding filename, content pairs for the given tar stream.\u001b[39;00m\n\u001b[32m 117\u001b[39m \n\u001b[32m 118\u001b[39m \u001b[33;03m Args:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 126\u001b[39m \u001b[33;03m A stream of samples.\u001b[39;00m\n\u001b[32m 127\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m128\u001b[39m stream = \u001b[43mtarfile\u001b[49m\u001b[43m.\u001b[49m\u001b[43mopen\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfileobj\u001b[49m\u001b[43m=\u001b[49m\u001b[43mfileobj\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmode\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mr|*\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m 129\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m tarinfo \u001b[38;5;129;01min\u001b[39;00m stream:\n\u001b[32m 130\u001b[39m fname = tarinfo.name\n", 518 + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/tarfile.py:1883\u001b[39m, in \u001b[36mTarFile.open\u001b[39m\u001b[34m(cls, name, mode, fileobj, bufsize, **kwargs)\u001b[39m\n\u001b[32m 1880\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[33m\"\u001b[39m\u001b[33mmode must be \u001b[39m\u001b[33m'\u001b[39m\u001b[33mr\u001b[39m\u001b[33m'\u001b[39m\u001b[33m or \u001b[39m\u001b[33m'\u001b[39m\u001b[33mw\u001b[39m\u001b[33m'\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 1882\u001b[39m compresslevel = kwargs.pop(\u001b[33m\"\u001b[39m\u001b[33mcompresslevel\u001b[39m\u001b[33m\"\u001b[39m, \u001b[32m9\u001b[39m)\n\u001b[32m-> \u001b[39m\u001b[32m1883\u001b[39m stream = \u001b[43m_Stream\u001b[49m\u001b[43m(\u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfilemode\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcomptype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfileobj\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbufsize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1884\u001b[39m \u001b[43m \u001b[49m\u001b[43mcompresslevel\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1885\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m 1886\u001b[39m t = \u001b[38;5;28mcls\u001b[39m(name, filemode, stream, **kwargs)\n", 519 + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/tarfile.py:355\u001b[39m, in \u001b[36m_Stream.__init__\u001b[39m\u001b[34m(self, name, mode, comptype, fileobj, bufsize, compresslevel)\u001b[39m\n\u001b[32m 350\u001b[39m \u001b[38;5;28mself\u001b[39m._extfileobj = \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m 352\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m comptype == \u001b[33m'\u001b[39m\u001b[33m*\u001b[39m\u001b[33m'\u001b[39m:\n\u001b[32m 353\u001b[39m \u001b[38;5;66;03m# Enable transparent compression detection for the\u001b[39;00m\n\u001b[32m 354\u001b[39m \u001b[38;5;66;03m# stream interface\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m355\u001b[39m fileobj = \u001b[43m_StreamProxy\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfileobj\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 356\u001b[39m comptype = fileobj.getcomptype()\n\u001b[32m 358\u001b[39m \u001b[38;5;28mself\u001b[39m.name = name \u001b[38;5;129;01mor\u001b[39;00m \u001b[33m\"\u001b[39m\u001b[33m\"\u001b[39m\n", 520 + "\u001b[36mFile \u001b[39m\u001b[32m~/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/tarfile.py:583\u001b[39m, in \u001b[36m_StreamProxy.__init__\u001b[39m\u001b[34m(self, fileobj)\u001b[39m\n\u001b[32m 581\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m__init__\u001b[39m(\u001b[38;5;28mself\u001b[39m, fileobj):\n\u001b[32m 582\u001b[39m \u001b[38;5;28mself\u001b[39m.fileobj = fileobj\n\u001b[32m--> \u001b[39m\u001b[32m583\u001b[39m \u001b[38;5;28mself\u001b[39m.buf = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mfileobj\u001b[49m\u001b[43m.\u001b[49m\u001b[43mread\u001b[49m\u001b[43m(\u001b[49m\u001b[43mBLOCKSIZE\u001b[49m\u001b[43m)\u001b[49m\n", 521 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/gopen.py:105\u001b[39m, in \u001b[36mPipe.read\u001b[39m\u001b[34m(self, *args, **kw)\u001b[39m\n\u001b[32m 95\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Wrap stream.read and checks status.\u001b[39;00m\n\u001b[32m 96\u001b[39m \n\u001b[32m 97\u001b[39m \u001b[33;03mArgs:\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 102\u001b[39m \u001b[33;03m The result of stream.read\u001b[39;00m\n\u001b[32m 103\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 104\u001b[39m result = \u001b[38;5;28mself\u001b[39m.stream.read(*args, **kw)\n\u001b[32m--> \u001b[39m\u001b[32m105\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mcheck_status\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 106\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m result\n", 522 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/gopen.py:77\u001b[39m, in \u001b[36mPipe.check_status\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 75\u001b[39m status = \u001b[38;5;28mself\u001b[39m.proc.poll()\n\u001b[32m 76\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m status \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m---> \u001b[39m\u001b[32m77\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mwait_for_child\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n", 523 + "\u001b[36mFile \u001b[39m\u001b[32m~/git-forecast/atdata/.venv/lib/python3.12/site-packages/webdataset/gopen.py:92\u001b[39m, in \u001b[36mPipe.wait_for_child\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 87\u001b[39m \u001b[38;5;28mprint\u001b[39m(\n\u001b[32m 88\u001b[39m \u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mpipe exit [\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m.status\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mos.getpid()\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m:\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m.proc.pid\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m] \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m.args\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m \u001b[39m\u001b[38;5;132;01m{\u001b[39;00minfo\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m,\n\u001b[32m 89\u001b[39m file=sys.stderr,\n\u001b[32m 90\u001b[39m )\n\u001b[32m 91\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.status \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.ignore_status \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m.ignore_errors:\n\u001b[32m---> \u001b[39m\u001b[32m92\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIOError\u001b[39;00m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m.args\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m: exit \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m.status\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m (read) \u001b[39m\u001b[38;5;132;01m{\u001b[39;00minfo\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m\"\u001b[39m)\n", 524 + "\u001b[31mOSError\u001b[39m: (\"((['curl', '--connect-timeout', '30', '--retry', '30', '--retry-delay', '2', '-f', '-s', '-L', 'https://f5bf77c06cb35b5136ff6d61ab4b7dbc.r2.cloudflarestorage.com/analysis-hive/prototyping/data--2b5dd738-c9f6-46d4-8c31-e218531601be--000000.tar'],), {'bufsize': 8192}): exit 22 (read) {}\", <webdataset.gopen.Pipe object at 0x11425e150>, 'https://f5bf77c06cb35b5136ff6d61ab4b7dbc.r2.cloudflarestorage.com/analysis-hive/prototyping/data--2b5dd738-c9f6-46d4-8c31-e218531601be--000000.tar')" 525 + ] 526 + } 527 + ], 528 + "source": [ 529 + "from atdata import load_dataset\n", 530 + "\n", 531 + "# Load from local index\n", 532 + "ds = load_dataset( \"@local/proto-text-samples-3\", TextSample,\n", 533 + " split = 'train',\n", 534 + " #\n", 535 + " index = index,\n", 536 + ")\n", 537 + "\n", 538 + "# The index resolves the dataset name to URLs and schema\n", 539 + "for batch in ds.shuffled():\n", 540 + " break" 541 + ] 542 + }, 543 + { 544 + "cell_type": "markdown", 545 + "id": "3b30bb49", 546 + "metadata": {}, 547 + "source": [ 548 + "Notes:\n", 549 + "\n", 550 + "* This is also getting linting errors on `load_dataset` that there are no matching overloads." 551 + ] 552 + }, 553 + { 554 + "cell_type": "code", 555 + "execution_count": 36, 556 + "id": "5c2afcd2", 557 + "metadata": {}, 558 + "outputs": [ 559 + { 560 + "data": { 561 + "text/plain": [ 562 + "'s3://analysis-hive/prototyping/data--4a5ff662-803b-4700-81f4-45f288f6e565--000000.tar'" 563 + ] 564 + }, 565 + "execution_count": 36, 566 + "metadata": {}, 567 + "output_type": "execute_result" 568 + } 569 + ], 570 + "source": [ 571 + "ds.url" 572 + ] 573 + }, 574 + { 575 + "cell_type": "markdown", 576 + "id": "2e4238ba", 577 + "metadata": {}, 578 + "source": [ 579 + "Notes:\n", 580 + "\n", 581 + "* We're getting linting errors because of the protocol use for `AbstractIndex`; better to subclass, or is there a way for this to get the protocol adherence?\n", 582 + "* The S3 URI error is showing up here now because of how dataset loading works! The data is uploaded correctly on my end, but it can't be accessed because of this URI not being the correct way to access the data for `wds` streaming over `https`; we should think of how best to encode this!" 583 + ] 584 + }, 585 + { 586 + "cell_type": "markdown", 587 + "id": "2bbedcd2", 588 + "metadata": {}, 589 + "source": [] 590 + } 591 + ], 592 + "metadata": { 593 + "kernelspec": { 594 + "display_name": "atdata", 595 + "language": "python", 596 + "name": "python3" 597 + }, 598 + "language_info": { 599 + "codemirror_mode": { 600 + "name": "ipython", 601 + "version": 3 602 + }, 603 + "file_extension": ".py", 604 + "mimetype": "text/x-python", 605 + "name": "python", 606 + "nbconvert_exporter": "python", 607 + "pygments_lexer": "ipython3", 608 + "version": "3.12.11" 609 + } 610 + }, 611 + "nbformat": 4, 612 + "nbformat_minor": 5 613 + }

Configure Feed

Configure Feed