Add README with training split documentation; rebalance train data; fix val

+170 -6

2 changed files

expand all

README.md

src

+142

README.md

··· 1 + # gemma-4-typst-ocr 2 + 3 + Fine-tuning vision-language models to transcribe mathematical expressions and 4 + document fragments into [Typst](https://typst.app) notation. 5 + 6 + Two model targets are supported: 7 + 8 + - **Gemma 4 E2B** (`src/train.py`) -- Unsloth QLoRA, faster iteration 9 + - **DeepSeek-OCR-2** (`src/train_deepseek.py`) -- 3B model, stronger baseline OCR 10 + 11 + --- 12 + 13 + ## Training data 14 + 15 + ### Overview 16 + 17 + The training set combines handwritten math datasets with synthetically rendered 18 + Typst documents. No dataset contains handwritten document fragments (math 19 + embedded in running text); generalization to that domain requires the model to 20 + transfer handwriting recognition and document-structure understanding 21 + simultaneously. 22 + 23 + ### Splits 24 + 25 + #### Handwritten -- real / semi-real 26 + 27 + | Split | Samples | Notes | 28 + |---|---|---| 29 + | `mathwriting_train` | ~42,880 | Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting. | 30 + | `mathwriting_symbols` | ~168 | Isolated symbol images from MathWriting. | 31 + | `crohme_real_train` | ~9 | Real CROHME competition handwriting samples. Tiny but genuine. | 32 + 33 + #### Handwritten -- synthetic 34 + 35 + Fully synthetic images generated from expression grammars or stroke models. 36 + Useful for coverage of rare expressions but further from real photos. 37 + 38 + | Split | Samples (raw) | Cap | Notes | 39 + |---|---|---|---| 40 + | `mathwriting_synthetic` | ~78,000 | **20,000** | Synthetic stroke-rendered images from MathWriting grammar. | 41 + | `crohme_gen_2019` | ~40,000 | **15,000** | Generated CROHME-style images, 2019 grammar. | 42 + | `crohme_gen_2023` | ~2,682 | none | Generated CROHME-style images, 2023 grammar. | 43 + | `crohme_gen_syntactic` | ~15,653 | none | Syntactically diverse generated CROHME expressions. | 44 + 45 + Caps are applied in `train_deepseek.py` to prevent synthetic data from 46 + dominating training. After capping the effective mix is: 47 + 48 + - Real/semi-real handwriting: ~43k (34%) 49 + - Synthetic handwriting: ~33k (27%) 50 + - Typeset document content: ~28k (23%) 51 + - Typeset single equations: ~8k (6%) 52 + - Small real splits: ~3k (2%) 53 + 54 + #### Typeset -- single equations 55 + 56 + | Split | Samples | Notes | 57 + |---|---|---| 58 + | `typeset_train` | ~8,000 | Typst-rendered single math expressions; targets are bare `$ ... $` display math. | 59 + 60 + #### Typeset -- document fragments (mixed content) 61 + 62 + The most important split for document-level generalization. Each sample is a 63 + rendered Typst page fragment containing inline math embedded in prose, lists, 64 + or tables. Page widths are randomized to cover both wide single-column and 65 + narrow two-column layouts. Text spans may be bold, italic, or underlined. 66 + 67 + | Split | Samples | Notes | 68 + |---|---|---| 69 + | `typeset_mixed_train` | ~20,000 | Paragraphs, multi-paragraph blocks, bullet/numbered lists, para+list combinations, and tables. Reflowable content rendered at 7 different widths (200--480 pt). | 70 + 71 + Body types and their generation weights (see `src/generate_mixed.py`): 72 + 73 + - **Tables** (15%): grid layout with inline math cells 74 + - **Multi-paragraph** (15%): 2--4 paragraphs each with inline math 75 + - **Para + list / list + para** (10%): mixed prose and enumeration 76 + - **List** (18%): bullet or numbered list with math items 77 + - **Inline sequence** (42%): single paragraph of mixed text and math tokens 78 + 79 + Math content within mixed bodies includes fractions, integrals, sums, limits, 80 + derivatives, matrices, polynomials (including operator/symbolic forms), and 81 + schematic $n \times m$ matrices with ellipsis notation. 82 + 83 + ### Label conventions 84 + 85 + - **Math-only splits** (`typeset_train`, `mathwriting_*`, `crohme_*`): manifest 86 + stores bare math expressions. `data.load_records()` wraps these as 87 + `$ ... $` at load time so every training target is valid Typst. 88 + - **Mixed splits** (`typeset_mixed_*`): manifest stores complete body content 89 + with inline `$...$` delimiters already present. No wrapping applied. 90 + 91 + ### Validation 92 + 93 + | Split | Samples used | Notes | 94 + |---|---|---| 95 + | `mathwriting_val` | 250 (sampled) | Covers real handwritten single equations. | 96 + | `typeset_val` | ~1,000 (all) | Typeset single equations. | 97 + | `typeset_mixed_val` | ~500 (all) | Document fragments; primary signal for layout generalization. | 98 + | **Total** | **~1,750** | | 99 + 100 + ### Known gaps 101 + 102 + - No handwritten document fragments exist in the training set. The model must 103 + transfer handwriting recognition (learned from single-equation data) and 104 + document structure understanding (learned from typeset mixed data) jointly at 105 + inference time. 106 + - No hybrid handwritten+typeset samples (e.g. a printed form with handwritten 107 + fill-in). 108 + - `crohme_real_train` has only 9 samples -- real handwriting at document scale 109 + is essentially absent. 110 + 111 + --- 112 + 113 + ## Setup 114 + 115 + ```bash 116 + uv sync 117 + ``` 118 + 119 + ### Generate data 120 + 121 + ```bash 122 + uv run generate-typeset # typeset_train, typeset_val, typeset_test 123 + uv run generate-mixed # typeset_mixed_{train,val,test} 124 + ``` 125 + 126 + ### Train 127 + 128 + ```bash 129 + # DeepSeek-OCR-2 (recommended for 12 GB VRAM) 130 + uv run train-deepseek --smoke-test # validate forward+backward first 131 + uv run train-deepseek 132 + 133 + # Gemma 4 E2B (via Unsloth) 134 + uv run train 135 + ``` 136 + 137 + ### Evaluate 138 + 139 + ```bash 140 + uv run evaluate 141 + uv run probe-deepseek --n 10 142 + ```

+28 -6

src/train_deepseek.py

··· 161 161 model = get_peft_model(model, lora_cfg) 162 162 model.print_trainable_parameters() 163 163 164 - train_records = load_records(TRAIN_SPLITS, dedupe=True) 165 - val_records = load_records(VAL_SPLITS, dedupe=False) 166 - 167 164 import random as _random 168 165 _rng = _random.Random(42) 169 - if len(val_records) > 500: 170 - val_records = _rng.sample(val_records, 500) 171 166 172 - print(f"Train: {len(train_records):,} Val: {len(val_records):,}") 167 + # Train: load each split individually, apply caps, then combine. 168 + # Caps prevent synthetic-heavy splits from dominating. 169 + # Policy: cap synthetics (mathwriting_synthetic, crohme_gen_2019); keep 170 + # real and document-structure splits uncapped. 171 + _TRAIN_CAPS = { 172 + "mathwriting_synthetic": 20_000, 173 + "crohme_gen_2019": 15_000, 174 + } 175 + train_records: list[dict] = [] 176 + for split in TRAIN_SPLITS: 177 + recs = load_records([split], dedupe=True) 178 + cap = _TRAIN_CAPS.get(split) 179 + if cap and len(recs) > cap: 180 + recs = _rng.sample(recs, cap) 181 + train_records.extend(recs) 182 + _rng.shuffle(train_records) 183 + 184 + # Val: all typeset_val + all typeset_mixed_val + 250 from mathwriting_val. 185 + _rng = _random.Random(42) 186 + mw_val = load_records(["mathwriting_val"], dedupe=False) 187 + typeset_val = load_records(["typeset_val"], dedupe=False) 188 + mixed_val = load_records(["typeset_mixed_val"], dedupe=False) 189 + mw_sample = _rng.sample(mw_val, min(250, len(mw_val))) 190 + val_records = mw_sample + typeset_val + mixed_val 191 + _rng.shuffle(val_records) 192 + 193 + print(f"Train: {len(train_records):,} Val: {len(val_records):,} " 194 + f"(mathwriting={len(mw_sample)}, typeset={len(typeset_val)}, mixed={len(mixed_val)})") 173 195 174 196 train_ds = make_dataset(train_records, do_augment=True) 175 197 val_ds = make_dataset(val_records, do_augment=False)

Configure Feed

Configure Feed