typst-ocr#
Fine-tuning vision-language models to transcribe mathematical expressions and document fragments into Typst notation.
Model: Gemma 4 E2B (src/train.py) -- Unsloth QLoRA fine-tuning.
Training data#
Overview#
The training set combines handwritten math datasets with synthetically rendered
Typst documents. Handwriting-font splits (hw_*) cover both single-equation
and structured document content rendered in 6 diverse fonts plus the Typst
default, sampled uniformly.
Splits#
Handwritten -- real / semi-real#
| Split | Samples | Notes |
|---|---|---|
mathwriting_train |
~143,096 | Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting. |
mathwriting_symbols |
~6,091 | Isolated symbol images from MathWriting. |
crohme_real_train |
~9 | Real CROHME competition handwriting samples. Tiny but genuine. |
Handwritten -- synthetic#
Fully synthetic images generated from expression grammars or stroke models.
| Split | Samples (raw) | Cap | Notes |
|---|---|---|---|
mathwriting_synthetic |
~85,879 | 20,000 | Synthetic stroke-rendered images from MathWriting grammar. |
crohme_gen_2019 |
~51,855 | 15,000 | Generated CROHME-style images, 2019 grammar. |
crohme_gen_2023 |
~3,072 | none | Generated CROHME-style images, 2023 grammar. |
crohme_gen_syntactic |
~69,397 | 15,000 | Syntactically diverse generated CROHME expressions. |
Typeset / handwriting-font -- document fragments#
Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Content is rendered in one of 6 handwriting fonts (Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script, Special Elite) or the Typst default (New Computer Modern), sampled uniformly (~14% default).
| Split | Samples | Notes |
|---|---|---|
hw_structured_train |
15,000 | Whole document uses one uniformly-sampled font. |
hw_mixed_train |
10,000 | Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies. |
Body types and generation weights (see src/generate_mixed.py):
- Bare math (18%): single
$ expr $-- lowest complexity, bridges to single-equation data - Short inline (15%): 1--2 tokens (math, text, or mixed)
- Longer inline (23%): 3--7 tokens
- List (12%): 2--5 bullet/numbered items with math content
- Para + list / list + para (8%): prose introduction or conclusion around a list
- Multi-paragraph (12%): 2--4 paragraphs separated by blank lines
- Table (12%): 2--4 columns, 2--5 rows, inline math cells
Page widths randomized (200--480 pt) for multi-block bodies. Text spans may be bold, italic, or underlined. Ink colour sampled from near-black palette.
Label conventions#
- Math-only splits (
mathwriting_*,crohme_*): manifest stores bare math expressions.data.load_records()wraps these as$ ... $at load time. - hw_ splits*: manifest stores complete body content with inline
$...$delimiters already present. No wrapping applied.
Effective training mix (after caps)#
| Split | Raw | Used | % |
|---|---|---|---|
mathwriting_train |
~143,096 | 10,000 | 11% |
mathwriting_synthetic |
~85,879 | 20,000 | 21% |
crohme_gen_2019 |
~51,855 | 15,000 | 16% |
crohme_gen_syntactic |
~69,397 | 15,000 | 16% |
hw_structured_train |
15,000 | 15,000 | 16% |
hw_mixed_train |
10,000 | 10,000 | 11% |
crohme_gen_2023 |
~3,072 | 3,072 | 3% |
mathwriting_symbols |
~6,091 | 6,091 | 6% |
crohme_real_train |
~9 | 9 | <1% |
| Total | ~94,200 |
Validation#
250 samples drawn per split (seed 42), capped to available:
| Split | Samples | Notes |
|---|---|---|
mathwriting_val |
250 | Real handwritten single equations. |
hw_structured_val |
250 | Whole-doc font document fragments. |
hw_mixed_val |
250 | Per-block mixed-font document fragments. |
| Total | 750 |
Test#
| Split | Notes |
|---|---|
mathwriting_test |
Held-out real handwritten equations. |
hw_structured_test |
Held-out whole-doc font document fragments. |
hw_mixed_test |
Held-out mixed-font document fragments. |
Known gaps#
hw_*splits are font-based renders, not real handwriting photos. The model still lacks real handwritten document fragments at scale.crohme_real_trainhas only 9 samples.
Setup#
uv sync
Generate data#
# Download handwriting fonts (once)
uv run download-hw-fonts
# Generate hw splits
uv run generate-hw --mode hw --count 15000 --out data/hw_structured_train
uv run generate-hw --mode mix --count 10000 --out data/hw_mixed_train
uv run generate-hw --mode hw --count 500 --out data/hw_structured_val --seed 100
uv run generate-hw --mode mix --count 500 --out data/hw_mixed_val --seed 100
uv run generate-hw --mode hw --count 500 --out data/hw_structured_test --seed 200
uv run generate-hw --mode mix --count 500 --out data/hw_mixed_test --seed 200
Train#
uv run train --output-dir checkpoints/gemma4e2b-run1 --epochs 2
Evaluate#
uv run evaluate --checkpoint checkpoints/gemma4e2b-run1/final --n 100