Update README: reflect hw_* splits, new body grammar weights, updated data mix

+63 -74

1 changed file

expand all

README.md

+63 -74

README.md

··· 12 12 ### Overview 13 13 14 14 The training set combines handwritten math datasets with synthetically rendered 15 - Typst documents. No dataset contains handwritten document fragments (math 16 - embedded in running text); generalization to that domain requires the model to 17 - transfer handwriting recognition and document-structure understanding 18 - simultaneously. 15 + Typst documents. Handwriting-font splits (`hw_*`) cover both single-equation 16 + and structured document content rendered in 6 diverse fonts plus the Typst 17 + default, sampled uniformly. 19 18 20 19 ### Splits 21 20 ··· 23 22 24 23 | Split | Samples | Notes | 25 24 |---|---|---| 26 - | `mathwriting_train` | ~42,880 | Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting. | 27 - | `mathwriting_symbols` | ~168 | Isolated symbol images from MathWriting. | 25 + | `mathwriting_train` | ~143,096 | Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting. | 26 + | `mathwriting_symbols` | ~6,091 | Isolated symbol images from MathWriting. | 28 27 | `crohme_real_train` | ~9 | Real CROHME competition handwriting samples. Tiny but genuine. | 29 28 30 29 #### Handwritten -- synthetic 31 30 32 31 Fully synthetic images generated from expression grammars or stroke models. 33 - Useful for coverage of rare expressions but further from real photos. 34 32 35 33 | Split | Samples (raw) | Cap | Notes | 36 34 |---|---|---|---| 37 - | `mathwriting_synthetic` | ~78,000 | **20,000** | Synthetic stroke-rendered images from MathWriting grammar. | 38 - | `crohme_gen_2019` | ~40,000 | **15,000** | Generated CROHME-style images, 2019 grammar. | 39 - | `crohme_gen_2023` | ~2,682 | none | Generated CROHME-style images, 2023 grammar. | 40 - | `crohme_gen_syntactic` | ~15,653 | none | Syntactically diverse generated CROHME expressions. | 35 + | `mathwriting_synthetic` | ~85,879 | **20,000** | Synthetic stroke-rendered images from MathWriting grammar. | 36 + | `crohme_gen_2019` | ~51,855 | **15,000** | Generated CROHME-style images, 2019 grammar. | 37 + | `crohme_gen_2023` | ~3,072 | none | Generated CROHME-style images, 2023 grammar. | 38 + | `crohme_gen_syntactic` | ~69,397 | **15,000** | Syntactically diverse generated CROHME expressions. | 41 39 42 - Caps are applied in `train.py` to prevent synthetic data from 43 - dominating training. After capping the effective mix is: 40 + #### Typeset / handwriting-font -- document fragments 44 41 45 - - Real/semi-real handwriting: ~43k (34%) 46 - - Synthetic handwriting: ~33k (27%) 47 - - Typeset document content: ~28k (23%) 48 - - Typeset single equations: ~8k (6%) 49 - - Small real splits: ~3k (2%) 50 - 51 - #### Typeset -- single equations 42 + Each sample is a rendered Typst page fragment containing inline math embedded 43 + in prose, lists, or tables. Content is rendered in one of 6 handwriting fonts 44 + (Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script, Special Elite) or 45 + the Typst default (New Computer Modern), sampled uniformly (~14% default). 52 46 53 47 | Split | Samples | Notes | 54 48 |---|---|---| 55 - | `typeset_train` | ~8,000 | Typst-rendered single math expressions; targets are bare `$ ... $` display math. | 49 + | `hw_structured_train` | 15,000 | Whole document uses one uniformly-sampled font. | 50 + | `hw_mixed_train` | 10,000 | Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies. | 56 51 57 - #### Typeset -- document fragments (mixed content) 58 - 59 - The most important split for document-level generalization. Each sample is a 60 - rendered Typst page fragment containing inline math embedded in prose, lists, 61 - or tables. Page widths are randomized to cover both wide single-column and 62 - narrow two-column layouts. Text spans may be bold, italic, or underlined. 52 + Body types and generation weights (see `src/generate_mixed.py`): 63 53 64 - | Split | Samples | Notes | 65 - |---|---|---| 66 - | `typeset_mixed_train` | ~20,000 | Paragraphs, multi-paragraph blocks, bullet/numbered lists, para+list combinations, and tables. Reflowable content rendered at 7 different widths (200--480 pt). | 67 - 68 - Body types and their generation weights (see `src/generate_mixed.py`): 69 - 70 - - **Tables** (15%): grid layout with inline math cells 71 - - **Multi-paragraph** (15%): 2--4 paragraphs each with inline math 72 - - **Para + list / list + para** (10%): mixed prose and enumeration 73 - - **List** (18%): bullet or numbered list with math items 74 - - **Inline sequence** (42%): single paragraph of mixed text and math tokens 54 + - **Bare math** (18%): single `$ expr $` -- lowest complexity, bridges to single-equation data 55 + - **Short inline** (15%): 1--2 tokens (math, text, or mixed) 56 + - **Longer inline** (23%): 3--7 tokens 57 + - **List** (12%): 2--5 bullet/numbered items with math content 58 + - **Para + list / list + para** (8%): prose introduction or conclusion around a list 59 + - **Multi-paragraph** (12%): 2--4 paragraphs separated by blank lines 60 + - **Table** (12%): 2--4 columns, 2--5 rows, inline math cells 75 61 76 - Math content within mixed bodies includes fractions, integrals, sums, limits, 77 - derivatives, matrices, polynomials (including operator/symbolic forms), and 78 - schematic $n \times m$ matrices with ellipsis notation. 62 + Page widths randomized (200--480 pt) for multi-block bodies. Text spans may be 63 + bold, italic, or underlined. Ink colour sampled from near-black palette. 79 64 80 65 ### Label conventions 81 66 82 - - **Math-only splits** (`typeset_train`, `mathwriting_*`, `crohme_*`): manifest 83 - stores bare math expressions. `data.load_records()` wraps these as 84 - `$ ... $` at load time so every training target is valid Typst. 85 - - **Mixed splits** (`typeset_mixed_*`): manifest stores complete body content 86 - with inline `$...$` delimiters already present. No wrapping applied. 67 + - **Math-only splits** (`mathwriting_*`, `crohme_*`): manifest stores bare math 68 + expressions. `data.load_records()` wraps these as `$ ... $` at load time. 69 + - **hw_* splits**: manifest stores complete body content with inline `$...$` 70 + delimiters already present. No wrapping applied. 87 71 88 72 ### Effective training mix (after caps) 89 73 90 74 | Split | Raw | Used | % | 91 75 |---|---|---|---| 92 - | `mathwriting_train` | ~42,880 | **10,000** | 11% | 93 - | `mathwriting_synthetic` | ~78,152 | **20,000** | 22% | 94 - | `crohme_gen_2019` | ~40,069 | **15,000** | 16% | 95 - | `crohme_gen_syntactic` | ~15,653 | 15,653 | 17% | 96 - | `typeset_mixed_train` | ~20,000 | 20,000 | 22% | 97 - | `typeset_train` | ~8,000 | 8,000 | 9% | 98 - | `crohme_gen_2023` | ~2,682 | 2,682 | 3% | 99 - | `mathwriting_symbols` | ~168 | 168 | <1% | 76 + | `mathwriting_train` | ~143,096 | **10,000** | 11% | 77 + | `mathwriting_synthetic` | ~85,879 | **20,000** | 21% | 78 + | `crohme_gen_2019` | ~51,855 | **15,000** | 16% | 79 + | `crohme_gen_syntactic` | ~69,397 | **15,000** | 16% | 80 + | `hw_structured_train` | 15,000 | 15,000 | 16% | 81 + | `hw_mixed_train` | 10,000 | 10,000 | 11% | 82 + | `crohme_gen_2023` | ~3,072 | 3,072 | 3% | 83 + | `mathwriting_symbols` | ~6,091 | 6,091 | 6% | 100 84 | `crohme_real_train` | ~9 | 9 | <1% | 101 - | **Total** | | **~91,500** | | 85 + | **Total** | | **~94,200** | | 102 86 103 87 ### Validation 104 88 105 - | Split | Samples used | Notes | 89 + 250 samples drawn per split (seed 42), capped to available: 90 + 91 + | Split | Samples | Notes | 106 92 |---|---|---| 107 - | `mathwriting_val` | 250 (sampled, seed 42) | Real handwritten single equations. | 108 - | `typeset_val` | 250 (sampled, seed 42) | Typeset single equations. | 109 - | `typeset_mixed_val` | ~500 (all) | Document fragments; primary signal for layout generalization. | 110 - | **Total** | **~1,000** | | 93 + | `mathwriting_val` | 250 | Real handwritten single equations. | 94 + | `hw_structured_val` | 250 | Whole-doc font document fragments. | 95 + | `hw_mixed_val` | 250 | Per-block mixed-font document fragments. | 96 + | **Total** | **750** | | 111 97 112 98 ### Test 113 99 114 100 | Split | Notes | 115 101 |---|---| 116 102 | `mathwriting_test` | Held-out real handwritten equations. | 117 - | `typeset_test` | Held-out typeset equations. | 118 - | `typeset_mixed_test` | Held-out document fragments. | 103 + | `hw_structured_test` | Held-out whole-doc font document fragments. | 104 + | `hw_mixed_test` | Held-out mixed-font document fragments. | 119 105 120 106 ### Known gaps 121 107 122 - - No handwritten document fragments exist in the training set. The model must 123 - transfer handwriting recognition (learned from single-equation data) and 124 - document structure understanding (learned from typeset mixed data) jointly at 125 - inference time. 126 - - No hybrid handwritten+typeset samples (e.g. a printed form with handwritten 127 - fill-in). 128 - - `crohme_real_train` has only 9 samples -- real handwriting at document scale 129 - is essentially absent. 108 + - `hw_*` splits are font-based renders, not real handwriting photos. The model 109 + still lacks real handwritten document fragments at scale. 110 + - `crohme_real_train` has only 9 samples. 130 111 131 112 --- 132 113 ··· 139 120 ### Generate data 140 121 141 122 ```bash 142 - uv run generate-typeset # typeset_train, typeset_val, typeset_test 143 - uv run generate-mixed # typeset_mixed_{train,val,test} 123 + # Download handwriting fonts (once) 124 + uv run download-hw-fonts 125 + 126 + # Generate hw splits 127 + uv run generate-hw --mode hw --count 15000 --out data/hw_structured_train 128 + uv run generate-hw --mode mix --count 10000 --out data/hw_mixed_train 129 + uv run generate-hw --mode hw --count 500 --out data/hw_structured_val --seed 100 130 + uv run generate-hw --mode mix --count 500 --out data/hw_mixed_val --seed 100 131 + uv run generate-hw --mode hw --count 500 --out data/hw_structured_test --seed 200 132 + uv run generate-hw --mode mix --count 500 --out data/hw_mixed_test --seed 200 144 133 ``` 145 134 146 135 ### Train 147 136 148 137 ```bash 149 - uv run train 138 + uv run train --output-dir checkpoints/gemma4e2b-run1 --epochs 2 150 139 ``` 151 140 152 141 ### Evaluate 153 142 154 143 ```bash 155 - uv run evaluate --checkpoint checkpoints/gemma-4-e2b/final --n 100 144 + uv run evaluate --checkpoint checkpoints/gemma4e2b-run1/final --n 100 156 145 ```

Configure Feed

Configure Feed