···11+# HuggingFace Datasets - Architecture Overview
22+33+Source: https://huggingface.co/docs/datasets/en/about_dataset_load
44+55+## How load_dataset Works (ELI5)
66+77+A dataset is a directory that contains:
88+- Data files in generic formats (JSON, CSV, Parquet, text, etc.)
99+- A dataset card (`README.md`) with documentation and YAML configuration
1010+1111+`load_dataset()` fetches the requested dataset locally or from the Hugging Face Hub.
1212+1313+### Automatic Format Detection
1414+1515+If the dataset only contains data files, `load_dataset()` automatically infers how to load them based on file extensions. Under the hood, it uses an appropriate `DatasetBuilder`:
1616+1717+| Format | Builder Class |
1818+|--------|---------------|
1919+| `.txt` | `datasets.packaged_modules.text.Text` |
2020+| `.csv`, `.tsv` | `datasets.packaged_modules.csv.Csv` |
2121+| `.json`, `.jsonl` | `datasets.packaged_modules.json.Json` |
2222+| `.parquet` | `datasets.packaged_modules.parquet.Parquet` |
2323+| `.arrow` | `datasets.packaged_modules.arrow.Arrow` |
2424+| SQL | `datasets.packaged_modules.sql.Sql` |
2525+| Image folders | `datasets.packaged_modules.imagefolder.ImageFolder` |
2626+| Audio folders | `datasets.packaged_modules.audiofolder.AudioFolder` |
2727+| WebDataset TAR | `datasets.packaged_modules.webdataset.WebDataset` |
2828+2929+---
3030+3131+## Building a Dataset
3232+3333+Two main classes are responsible for building a dataset:
3434+3535+### BuilderConfig
3636+3737+Configuration class containing dataset attributes:
3838+3939+| Attribute | Description |
4040+|-----------|-------------|
4141+| `name` | Short name of the dataset |
4242+| `version` | Dataset version identifier |
4343+| `data_dir` | Path to local folder containing data files |
4444+| `data_files` | Paths to local data files |
4545+| `description` | Description of the dataset |
4646+4747+Custom attributes (like class labels) can be added by subclassing `BuilderConfig`.
4848+4949+Configuration can be populated:
5050+1. Via predefined `BuilderConfig` instances in `DatasetBuilder.BUILDER_CONFIGS`
5151+2. Via keyword arguments to `load_dataset()` (overrides predefined)
5252+5353+### DatasetBuilder
5454+5555+Accesses `BuilderConfig` attributes to build the actual dataset.
5656+5757+Three main methods:
5858+5959+#### 1. `_info()` - Define dataset attributes
6060+6161+- Defines dataset attributes returned by `dataset.info`
6262+- Specifies `Features` (schema with column names and types)
6363+6464+#### 2. `_split_generator()` - Organize data files
6565+6666+- Downloads or retrieves data files
6767+- Uses `DownloadManager` for downloading/extracting
6868+- Organizes files into splits via `SplitGenerator`
6969+- Returns keyword arguments for `_generate_examples`
7070+7171+#### 3. `_generate_examples()` - Parse and yield examples
7272+7373+- Reads and parses data files for each split
7474+- Yields examples as Python dicts matching the schema
7575+- Uses Python generator (memory efficient)
7676+- Examples buffered in `ArrowWriter` before writing to disk
7777+7878+---
7979+8080+## Data Flow
8181+8282+```
8383+load_dataset("name", split="train")
8484+ │
8585+ ▼
8686+┌───────────────────────────────────────┐
8787+│ 1. Resolve dataset path │
8888+│ - Hub repo? Local dir? Builder? │
8989+└───────────────────────────────────────┘
9090+ │
9191+ ▼
9292+┌───────────────────────────────────────┐
9393+│ 2. Load DatasetBuilder │
9494+│ - Auto-detect format │
9595+│ - Apply BuilderConfig │
9696+└───────────────────────────────────────┘
9797+ │
9898+ ▼
9999+┌───────────────────────────────────────┐
100100+│ 3. Download & prepare (if not cached) │
101101+│ - _split_generator() downloads │
102102+│ - _generate_examples() yields │
103103+│ - Arrow tables cached to disk │
104104+└───────────────────────────────────────┘
105105+ │
106106+ ▼
107107+┌───────────────────────────────────────┐
108108+│ 4. Load from cache │
109109+│ - Memory-map Arrow files │
110110+│ - Return Dataset/DatasetDict │
111111+└───────────────────────────────────────┘
112112+```
113113+114114+---
115115+116116+## Caching
117117+118118+- Datasets are cached as Arrow tables in `~/.cache/huggingface/datasets`
119119+- Subsequent loads use the cache (fast!)
120120+- Cache can be disabled or customized via `cache_dir` parameter
121121+- `download_mode` controls cache behavior:
122122+ - `REUSE_DATASET_IF_EXISTS` (default): Use cache if available
123123+ - `FORCE_REDOWNLOAD`: Re-download everything
124124+ - `REUSE_CACHE_IF_EXISTS`: Reuse cache for downloads but regenerate dataset
125125+126126+---
127127+128128+## Streaming Mode
129129+130130+With `streaming=True`:
131131+- No downloading or caching
132132+- Data streamed on-the-fly during iteration
133133+- Returns `IterableDataset` instead of `Dataset`
134134+- Best for large datasets
135135+136136+```python
137137+ds = load_dataset("large_dataset", split="train", streaming=True)
138138+for example in ds:
139139+ process(example) # Examples fetched as needed
140140+```
141141+142142+---
143143+144144+## Integrity Verification
145145+146146+`load_dataset()` verifies downloaded data:
147147+- Number of splits in generated `DatasetDict`
148148+- Number of samples in each split
149149+- List of downloaded files
150150+- SHA256 checksums (disabled by default)
151151+152152+Disable with `verification_mode="no_checks"` if needed.
153153+154154+---
155155+156156+## Key Design Patterns for atdata Integration
157157+158158+### Pattern 1: Path Resolution
159159+HF Datasets supports multiple path types:
160160+- Hub repository: `"username/dataset"`
161161+- Local directory: `"./path/to/data"`
162162+- Builder name: `"parquet"` with `data_files`
163163+164164+### Pattern 2: Split Handling
165165+- `split=None` → `DatasetDict` with all splits
166166+- `split="train"` → Single `Dataset`
167167+- Split string algebra: `"train+test"`, `"train[:10%]"`
168168+169169+### Pattern 3: Lazy Loading
170170+- Streaming mode for large datasets
171171+- Generator-based iteration
172172+- Buffer-based shuffling
173173+174174+### Pattern 4: Format Abstraction
175175+- Single API for multiple formats
176176+- Auto-detection based on file extensions
177177+- Builder-specific configuration via kwargs
178178+179179+### Pattern 5: Type System
180180+- `Features` schema defines column types
181181+- Automatic type inference with override capability
182182+- Special types for media (Audio, Image, Video)
+308
.reference/huggingface-datasets/loading-guide.md
···11+# HuggingFace Datasets - Loading Guide
22+33+Source: https://huggingface.co/docs/datasets/en/loading
44+55+## Overview
66+77+Data can be loaded from multiple sources:
88+- The Hugging Face Hub
99+- Local files (CSV, JSON, Parquet, etc.)
1010+- In-memory data (dicts, lists, generators, DataFrames)
1111+- SQL databases
1212+- Remote URLs
1313+1414+---
1515+1616+## Loading from Hugging Face Hub
1717+1818+```python
1919+from datasets import load_dataset
2020+2121+# Basic usage
2222+dataset = load_dataset("lhoestq/demo1")
2323+2424+# Specific version (git tag, branch, or commit)
2525+dataset = load_dataset("lhoestq/custom_squad", revision="main")
2626+2727+# Map data files to splits
2828+data_files = {"train": "train.csv", "test": "test.csv"}
2929+dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)
3030+3131+# Load subset of files with patterns
3232+c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
3333+3434+# Load from subdirectory
3535+c4_subset = load_dataset("allenai/c4", data_dir="en")
3636+```
3737+3838+---
3939+4040+## Loading Local Files
4141+4242+### CSV
4343+4444+```python
4545+from datasets import load_dataset
4646+4747+# Single file
4848+dataset = load_dataset("csv", data_files="my_file.csv")
4949+5050+# Multiple files
5151+dataset = load_dataset("csv", data_files=["file1.csv", "file2.csv"])
5252+5353+# With split mapping
5454+dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})
5555+```
5656+5757+### JSON
5858+5959+```python
6060+# Standard JSON lines format (one object per line)
6161+dataset = load_dataset("json", data_files="my_file.json")
6262+6363+# Nested JSON with field parameter
6464+# File: {"version": "0.1.0", "data": [{"a": 1}, {"a": 2}]}
6565+dataset = load_dataset("json", data_files="my_file.json", field="data")
6666+6767+# Remote JSON
6868+base_url = "https://example.com/data/"
6969+dataset = load_dataset("json", data_files={
7070+ "train": base_url + "train.json",
7171+ "validation": base_url + "dev.json"
7272+}, field="data")
7373+```
7474+7575+### Parquet
7676+7777+```python
7878+# Local
7979+dataset = load_dataset("parquet", data_files={'train': 'train.parquet', 'test': 'test.parquet'})
8080+8181+# Remote
8282+base_url = "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ab/"
8383+data_files = {"train": base_url + "train-00000-of-00001.parquet"}
8484+wiki = load_dataset("parquet", data_files=data_files, split="train")
8585+```
8686+8787+### Arrow
8888+8989+```python
9090+# Via load_dataset
9191+dataset = load_dataset("arrow", data_files={'train': 'train.arrow'})
9292+9393+# Direct memory mapping (faster, no cache)
9494+from datasets import Dataset
9595+dataset = Dataset.from_file("data.arrow")
9696+```
9797+9898+### Text
9999+100100+```python
101101+dataset = load_dataset("text", data_files="my_file.txt")
102102+```
103103+104104+### WebDataset (TAR archives)
105105+106106+```python
107107+# Best used with streaming for large datasets
108108+path = "path/to/train/*.tar"
109109+dataset = load_dataset("webdataset", data_files={"train": path}, split="train", streaming=True)
110110+111111+# Remote WebDataset
112112+base_url = "https://example.com/dataset/"
113113+urls = [base_url + f"shard-{i:06d}.tar" for i in range(4)]
114114+dataset = load_dataset("webdataset", data_files={"train": urls}, split="train", streaming=True)
115115+```
116116+117117+### HDF5
118118+119119+```python
120120+dataset = load_dataset("hdf5", data_files="data.h5")
121121+```
122122+123123+### SQL Databases
124124+125125+```python
126126+from datasets import Dataset
127127+128128+# Load entire table
129129+dataset = Dataset.from_sql("data_table_name", con="sqlite:///sqlite_file.db")
130130+131131+# Load from query
132132+dataset = Dataset.from_sql(
133133+ "SELECT text FROM table WHERE length(text) > 100 LIMIT 10",
134134+ con="sqlite:///sqlite_file.db"
135135+)
136136+```
137137+138138+---
139139+140140+## Loading In-Memory Data
141141+142142+### Python Dictionary
143143+144144+```python
145145+from datasets import Dataset
146146+147147+my_dict = {"a": [1, 2, 3], "b": ["x", "y", "z"]}
148148+dataset = Dataset.from_dict(my_dict)
149149+```
150150+151151+### Python List of Dictionaries
152152+153153+```python
154154+my_list = [{"a": 1, "b": "x"}, {"a": 2, "b": "y"}, {"a": 3, "b": "z"}]
155155+dataset = Dataset.from_list(my_list)
156156+```
157157+158158+### Python Generator
159159+160160+```python
161161+from datasets import Dataset, IterableDataset
162162+163163+# For data larger than memory
164164+def my_gen():
165165+ for i in range(1, 1000000):
166166+ yield {"a": i, "text": f"example {i}"}
167167+168168+dataset = Dataset.from_generator(my_gen)
169169+170170+# Sharded generator for distributed processing
171171+def gen(shards):
172172+ for shard in shards:
173173+ with open(shard) as f:
174174+ for line in f:
175175+ yield {"line": line}
176176+177177+shards = [f"data{i}.txt" for i in range(32)]
178178+ds = IterableDataset.from_generator(gen, gen_kwargs={"shards": shards})
179179+ds = ds.shuffle(seed=42, buffer_size=10_000)
180180+```
181181+182182+### Pandas DataFrame
183183+184184+```python
185185+import pandas as pd
186186+from datasets import Dataset
187187+188188+df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
189189+dataset = Dataset.from_pandas(df)
190190+```
191191+192192+---
193193+194194+## Multiprocessing
195195+196196+Speed up loading with multiple processes:
197197+198198+```python
199199+from datasets import load_dataset
200200+201201+# Each process handles a subset of shards
202202+imagenet = load_dataset("timm/imagenet-1k-wds", num_proc=8)
203203+```
204204+205205+---
206206+207207+## Slicing Splits
208208+209209+### String API
210210+211211+```python
212212+import datasets
213213+214214+# Concatenate splits
215215+train_test_ds = datasets.load_dataset("dataset_name", split="train+test")
216216+217217+# Select rows by index
218218+train_10_20_ds = datasets.load_dataset("dataset_name", split="train[10:20]")
219219+220220+# Select by percentage
221221+train_10pct_ds = datasets.load_dataset("dataset_name", split="train[:10%]")
222222+223223+# Combine percentage slices
224224+train_10_80pct_ds = datasets.load_dataset("dataset_name", split="train[:10%]+train[-80%:]")
225225+226226+# Cross-validation splits
227227+val_ds = datasets.load_dataset("dataset_name",
228228+ split=[f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)])
229229+train_ds = datasets.load_dataset("dataset_name",
230230+ split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)])
231231+```
232232+233233+### ReadInstruction API
234234+235235+```python
236236+import datasets
237237+238238+# Concatenate
239239+ri = datasets.ReadInstruction("train") + datasets.ReadInstruction("test")
240240+train_test_ds = datasets.load_dataset("dataset_name", split=ri)
241241+242242+# Percentage with rounding control
243243+ri = datasets.ReadInstruction("train", from_=50, to=52, unit="%", rounding="pct1_dropremainder")
244244+train_50_52_ds = datasets.load_dataset("dataset_name", split=ri)
245245+```
246246+247247+---
248248+249249+## Specifying Features
250250+251251+Override auto-inferred features:
252252+253253+```python
254254+from datasets import load_dataset, Features, Value, ClassLabel
255255+256256+# Define custom features
257257+class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
258258+emotion_features = Features({
259259+ 'text': Value('string'),
260260+ 'label': ClassLabel(names=class_names)
261261+})
262262+263263+# Apply when loading
264264+dataset = load_dataset('csv', data_files='data.csv', features=emotion_features)
265265+266266+# Verify
267267+print(dataset['train'].features)
268268+# {'text': Value('string'), 'label': ClassLabel(names=['sadness', 'joy', ...])}
269269+```
270270+271271+---
272272+273273+## Offline Mode
274274+275275+Use cached datasets without internet:
276276+277277+```bash
278278+# Set environment variable
279279+export HF_HUB_OFFLINE=1
280280+```
281281+282282+```python
283283+# Will use cache only
284284+dataset = load_dataset("dataset_name")
285285+```
286286+287287+---
288288+289289+## Image/Audio/Video Datasets
290290+291291+### ImageFolder
292292+293293+```python
294294+# Directory structure: images/{class_name}/{image_file}
295295+dataset = load_dataset("imagefolder", data_dir="path/to/images", split="train")
296296+```
297297+298298+### AudioFolder
299299+300300+```python
301301+dataset = load_dataset("audiofolder", data_dir="path/to/audio", split="train")
302302+```
303303+304304+### VideoFolder
305305+306306+```python
307307+dataset = load_dataset("videofolder", data_dir="path/to/videos", split="train")
308308+```