Update README · oscillatory.net/ocr-to-typst@2d4ec1f

+48 -24

2 changed files

expand all

README.md

deepseek

src

train.py

+47 -24

README.md

··· 3 3 Fine-tuning vision-language models to transcribe mathematical expressions and 4 4 document fragments into [Typst](https://typst.app) notation. 5 5 6 - Model: **Gemma 4 E2B** (`src/train.py`) -- Unsloth QLoRA fine-tuning. 6 + Primary model: **DeepSeek-OCR-2** (`deepseek/src/train.py`) -- Unsloth QLoRA fine-tuning. 7 + Prior work: Gemma 4 E2B (`src/train.py`) -- archived, underperformed DeepSeek-OCR-2. 7 8 8 9 --- 9 10 ··· 12 13 ### Overview 13 14 14 15 The training set combines handwritten math datasets with synthetically rendered 15 - Typst documents. Handwriting-font splits (`typeset_*`) cover both single-equation 16 + Typst documents. Typeset splits (`typeset_*`) cover both single-equation 16 17 and structured document content rendered in 6 diverse fonts plus the Typst 17 18 default, sampled uniformly. 18 19 ··· 46 47 47 48 | Split | Samples | Notes | 48 49 |---|---|---| 49 - | `typeset_uniform_train` | 15,000 | Whole document uses one uniformly-sampled font. | 50 - | `typeset_mixed_train` | 10,000 | Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies. | 50 + | `typeset_uniform_train` | 10,000 | Whole document uses one uniformly-sampled font. | 51 + | `typeset_mixed_train` | 20,000 | Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies. | 52 + | `typeset_prose_train` | 5,000 | Prose-heavy fragments: paragraphs with inline math, minimal bare-math content. | 51 53 52 54 Body types and generation weights (see `src/generate_mixed.py`): 53 55 ··· 73 75 74 76 | Split | Raw | Used | % | 75 77 |---|---|---|---| 76 - | `mathwriting_train` | ~143,096 | **10,000** | 11% | 77 - | `mathwriting_synthetic` | ~85,879 | **20,000** | 21% | 78 - | `crohme_gen_2019` | ~51,855 | **15,000** | 16% | 79 - | `crohme_gen_syntactic` | ~69,397 | **15,000** | 16% | 80 - | `typeset_uniform_train` | 15,000 | 15,000 | 16% | 81 - | `typeset_mixed_train` | 10,000 | 10,000 | 11% | 82 - | `crohme_gen_2023` | ~3,072 | 3,072 | 3% | 78 + | `mathwriting_synthetic` | ~85,879 | **20,000** | 19% | 79 + | `typeset_mixed_train` | 20,000 | 20,000 | 19% | 80 + | `crohme_gen_2019` | ~51,855 | **15,000** | 14% | 81 + | `crohme_gen_syntactic` | ~69,397 | **15,000** | 14% | 82 + | `mathwriting_train` | ~143,096 | **10,000** | 10% | 83 + | `typeset_uniform_train` | 10,000 | 10,000 | 10% | 83 84 | `mathwriting_symbols` | ~6,091 | 6,091 | 6% | 85 + | `typeset_prose_train` | 5,000 | **5,000** | 5% | 86 + | `crohme_gen_2023` | ~3,072 | 3,072 | 3% | 84 87 | `crohme_real_train` | ~9 | 9 | <1% | 85 - | **Total** | | **~94,200** | | 88 + | **Total** | | **~104,200** | | 86 89 87 90 ### Validation 88 91 89 - 250 samples drawn per split (seed 42), capped to available: 92 + 250 samples drawn per split: 90 93 91 94 | Split | Samples | Notes | 92 95 |---|---|---| 93 96 | `mathwriting_val` | 250 | Real handwritten single equations. | 94 97 | `typeset_uniform_val` | 250 | Whole-doc font document fragments. | 95 98 | `typeset_mixed_val` | 250 | Per-block mixed-font document fragments. | 96 - | **Total** | **750** | | 99 + | `typeset_prose_val` | 250 | Prose-heavy document fragments. | 100 + | **Total** | **1,000** | | 97 101 98 102 ### Test 99 103 ··· 102 106 | `mathwriting_test` | Held-out real handwritten equations. | 103 107 | `typeset_uniform_test` | Held-out whole-doc font document fragments. | 104 108 | `typeset_mixed_test` | Held-out mixed-font document fragments. | 109 + | `typeset_prose_test` | Held-out prose-heavy document fragments. | 105 110 106 111 ### Known gaps 107 112 108 113 - `typeset_*` splits are font-based renders, not real handwriting photos. The model 109 114 still lacks real handwritten document fragments at scale. 110 115 - `crohme_real_train` has only 9 samples. 116 + - No mixed handwritten+typeset document examples (future: synthetic handwriting generator). 111 117 112 118 --- 113 119 114 120 ## Setup 115 121 122 + The root environment covers data generation. The DeepSeek training environment 123 + is a separate `uv` project under `deepseek/`. 124 + 116 125 ```bash 126 + # Root env (data generation, evaluation) 117 127 uv sync 128 + 129 + # DeepSeek training env 130 + cd deepseek && uv sync 118 131 ``` 119 132 120 133 ### Generate data ··· 123 136 # Download handwriting fonts (once) 124 137 uv run download-hw-fonts 125 138 126 - # Generate hw splits 139 + uv run generate-typeset --mode prose --count 5000 --out data/typeset_prose_train --seed 101 140 + uv run generate-typeset --mode prose --count 500 --out data/typeset_prose_val --seed 103 141 + uv run generate-typeset --mode prose --count 500 --out data/typeset_prose_test --seed 107 127 142 143 + uv run generate-typeset --mode uniform --count 10000 --out data/typeset_uniform_train --seed 109 144 + uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_val --seed 113 145 + uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_test --seed 127 146 + 147 + uv run generate-typeset --mode mixed --count 20000 --out data/typeset_mixed_train --seed 131 148 + uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_val --seed 137 149 + uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_test --seed 139 128 150 ``` 129 - uv run generate-typeset --mode uniform --count 10000 --out data/typeset_uniform_train 130 - uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_val --seed 100 131 - uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_test --seed 200 132 - uv run generate-typeset --mode mixed --count 25000 --out data/typeset_mixed_train 133 - uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_val --seed 100 134 - uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_test --seed 200 151 + 152 + ### Train (DeepSeek-OCR-2) 153 + 154 + ```bash 155 + cd deepseek 156 + uv run train-deepseek --output-dir ../checkpoints/deepseek-ocr2-run1 --epochs 2 135 157 ``` 136 158 137 - ### Train 159 + Smoke test (small caps, ~1 hour): 138 160 139 161 ```bash 140 - uv run train --output-dir checkpoints/gemma4e2b-run1 --epochs 2 162 + cd deepseek && sh run-smoke.sh 141 163 ``` 142 164 143 165 ### Evaluate 144 166 145 167 ```bash 146 - uv run evaluate --checkpoint checkpoints/gemma4e2b-run1/final --n 100 168 + cd deepseek 169 + uv run evaluate-deepseek --checkpoint ../checkpoints/deepseek-ocr2-run1/final --n 100 147 170 ```

deepseek/src/train.py

··· 47 47 "crohme_gen_2019": 15_000, 48 48 "crohme_gen_syntactic": 15_000, 49 49 "mathwriting_train": 10_000, 50 + "typeset_prose_train": 5_000, 50 51 } 51 52 52 53

Configure Feed

Configure Feed