typst-ocr#
Fine-tuning vision-language models to transcribe mathematical expressions and document fragments into Typst notation.
Primary model: DeepSeek-OCR-2 (deepseek/src/train.py) -- Unsloth QLoRA fine-tuning.
Prior work: Gemma 4 E2B (src/train.py) -- archived, underperformed DeepSeek-OCR-2.
Training data#
Overview#
The training set combines handwritten math datasets with synthetically rendered
Typst documents. Typeset splits (typeset_*) cover both single-equation
and structured document content rendered in 6 diverse fonts plus the Typst
default, sampled uniformly.
Splits#
Handwritten -- real / semi-real#
| Split | Samples | Notes |
|---|---|---|
mathwriting_train |
~143,096 | Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting. |
mathwriting_symbols |
~6,091 | Isolated symbol images from MathWriting. |
crohme_real_train |
~9 | Real CROHME competition handwriting samples. Tiny but genuine. |
Handwritten -- synthetic#
Fully synthetic images generated from expression grammars or stroke models.
| Split | Samples (raw) | Cap | Notes |
|---|---|---|---|
mathwriting_synthetic |
~85,879 | 20,000 | Synthetic stroke-rendered images from MathWriting grammar. |
crohme_gen_2019 |
~51,855 | 15,000 | Generated CROHME-style images, 2019 grammar. |
crohme_gen_2023 |
~3,072 | none | Generated CROHME-style images, 2023 grammar. |
crohme_gen_syntactic |
~69,397 | 15,000 | Syntactically diverse generated CROHME expressions. |
Typeset / handwriting-font -- document fragments#
Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Content is rendered in one of 6 handwriting fonts (Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script, Special Elite) or the Typst default (New Computer Modern), sampled uniformly (~14% default).
| Split | Samples | Notes |
|---|---|---|
typeset_uniform_train |
10,000 | Whole document uses one uniformly-sampled font. |
typeset_mixed_train |
20,000 | Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies. |
typeset_prose_train |
5,000 | Prose-heavy fragments: paragraphs with inline math, minimal bare-math content. |
Body types and generation weights (see src/generate_mixed.py):
- Bare math (18%): single
$ expr $-- lowest complexity, bridges to single-equation data - Short inline (15%): 1--2 tokens (math, text, or mixed)
- Longer inline (23%): 3--7 tokens
- List (12%): 2--5 bullet/numbered items with math content
- Para + list / list + para (8%): prose introduction or conclusion around a list
- Multi-paragraph (12%): 2--4 paragraphs separated by blank lines
- Table (12%): 2--4 columns, 2--5 rows, inline math cells
Page widths randomized (200--480 pt) for multi-block bodies. Text spans may be bold, italic, or underlined. Ink colour sampled from near-black palette.
Label conventions#
- Math-only splits (
mathwriting_*,crohme_*): manifest stores bare math expressions.data.load_records()wraps these as$ ... $at load time. - typeset_ splits*: manifest stores complete body content with inline
$...$delimiters already present. No wrapping applied.
Effective training mix (after caps)#
| Split | Raw | Used | % |
|---|---|---|---|
mathwriting_synthetic |
~85,879 | 20,000 | 19% |
typeset_mixed_train |
20,000 | 20,000 | 19% |
crohme_gen_2019 |
~51,855 | 15,000 | 14% |
crohme_gen_syntactic |
~69,397 | 15,000 | 14% |
mathwriting_train |
~143,096 | 10,000 | 10% |
typeset_uniform_train |
10,000 | 10,000 | 10% |
mathwriting_symbols |
~6,091 | 6,091 | 6% |
typeset_prose_train |
5,000 | 5,000 | 5% |
crohme_gen_2023 |
~3,072 | 3,072 | 3% |
crohme_real_train |
~9 | 9 | <1% |
| Total | ~104,200 |
Validation#
250 samples drawn per split:
| Split | Samples | Notes |
|---|---|---|
mathwriting_val |
250 | Real handwritten single equations. |
typeset_uniform_val |
250 | Whole-doc font document fragments. |
typeset_mixed_val |
250 | Per-block mixed-font document fragments. |
typeset_prose_val |
250 | Prose-heavy document fragments. |
| Total | 1,000 |
Test#
| Split | Notes |
|---|---|
mathwriting_test |
Held-out real handwritten equations. |
typeset_uniform_test |
Held-out whole-doc font document fragments. |
typeset_mixed_test |
Held-out mixed-font document fragments. |
typeset_prose_test |
Held-out prose-heavy document fragments. |
Known gaps#
typeset_*splits are font-based renders, not real handwriting photos. The model still lacks real handwritten document fragments at scale.crohme_real_trainhas only 9 samples.- No mixed handwritten+typeset document examples (future: synthetic handwriting generator).
Setup#
The root environment covers data generation. The DeepSeek training environment
is a separate uv project under deepseek/.
# Root env (data generation, evaluation)
uv sync
# DeepSeek training env
cd deepseek && uv sync
Generate data#
# Download handwriting fonts (once)
uv run download-hw-fonts
uv run generate-typeset --mode prose --count 5000 --out data/typeset_prose_train --seed 101
uv run generate-typeset --mode prose --count 500 --out data/typeset_prose_val --seed 103
uv run generate-typeset --mode prose --count 500 --out data/typeset_prose_test --seed 107
uv run generate-typeset --mode uniform --count 10000 --out data/typeset_uniform_train --seed 109
uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_val --seed 113
uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_test --seed 127
uv run generate-typeset --mode mixed --count 20000 --out data/typeset_mixed_train --seed 131
uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_val --seed 137
uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_test --seed 139
Train (DeepSeek-OCR-2)#
cd deepseek
uv run train-deepseek --output-dir ../checkpoints/deepseek-ocr2-run1 --epochs 2
Smoke test (small caps, ~1 hour):
cd deepseek && sh run-smoke.sh
Evaluate#
cd deepseek
uv run evaluate-deepseek --checkpoint ../checkpoints/deepseek-ocr2-run1/final --n 100