The training set combines handwritten math datasets with synthetically rendered Typst documents. Handwriting-font splits (hw_*) cover both single-equation and structured document content rendered in 6 diverse fonts plus the Typst default, sampled uniformly.

Splits#

Handwritten -- real / semi-real#

Split	Samples	Notes
`mathwriting_train`	~143,096	Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting.
`mathwriting_symbols`	~6,091	Isolated symbol images from MathWriting.
`crohme_real_train`	~9	Real CROHME competition handwriting samples. Tiny but genuine.

Handwritten -- synthetic#

Fully synthetic images generated from expression grammars or stroke models.

Split	Samples (raw)	Cap	Notes
`mathwriting_synthetic`	~85,879	20,000	Synthetic stroke-rendered images from MathWriting grammar.
`crohme_gen_2019`	~51,855	15,000	Generated CROHME-style images, 2019 grammar.
`crohme_gen_2023`	~3,072	none	Generated CROHME-style images, 2023 grammar.
`crohme_gen_syntactic`	~69,397	15,000	Syntactically diverse generated CROHME expressions.

Typeset / handwriting-font -- document fragments#

Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Content is rendered in one of 6 handwriting fonts (Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script, Special Elite) or the Typst default (New Computer Modern), sampled uniformly (~14% default).

Split	Samples	Notes
`hw_structured_train`	15,000	Whole document uses one uniformly-sampled font.
`hw_mixed_train`	10,000	Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies.

Body types and generation weights (see src/generate_mixed.py):

Bare math (18%): single $ expr $ -- lowest complexity, bridges to single-equation data
Short inline (15%): 1--2 tokens (math, text, or mixed)
Longer inline (23%): 3--7 tokens
List (12%): 2--5 bullet/numbered items with math content
Para + list / list + para (8%): prose introduction or conclusion around a list
Multi-paragraph (12%): 2--4 paragraphs separated by blank lines
Table (12%): 2--4 columns, 2--5 rows, inline math cells

Page widths randomized (200--480 pt) for multi-block bodies. Text spans may be bold, italic, or underlined. Ink colour sampled from near-black palette.

Label conventions#

Math-only splits (mathwriting_*, crohme_*): manifest stores bare math expressions. data.load_records() wraps these as $ ... $ at load time.
hw_ splits*: manifest stores complete body content with inline $...$ delimiters already present. No wrapping applied.

Effective training mix (after caps)#

Split	Raw	Used	%
`mathwriting_train`	~143,096	10,000	11%
`mathwriting_synthetic`	~85,879	20,000	21%
`crohme_gen_2019`	~51,855	15,000	16%
`crohme_gen_syntactic`	~69,397	15,000	16%
`hw_structured_train`	15,000	15,000	16%
`hw_mixed_train`	10,000	10,000	11%
`crohme_gen_2023`	~3,072	3,072	3%
`mathwriting_symbols`	~6,091	6,091	6%
`crohme_real_train`	~9	9	<1%
Total		~94,200

Validation#

250 samples drawn per split (seed 42), capped to available:

Split	Samples	Notes
`mathwriting_val`	250	Real handwritten single equations.
`hw_structured_val`	250	Whole-doc font document fragments.
`hw_mixed_val`	250	Per-block mixed-font document fragments.
Total	750

Test#

Split	Notes
`mathwriting_test`	Held-out real handwritten equations.
`hw_structured_test`	Held-out whole-doc font document fragments.
`hw_mixed_test`	Held-out mixed-font document fragments.

Known gaps#

hw_* splits are font-based renders, not real handwriting photos. The model still lacks real handwritten document fragments at scale.
crohme_real_train has only 9 samples.

Setup#

uv sync

Generate data#

# Download handwriting fonts (once)
uv run download-hw-fonts

# Generate hw splits
uv run generate-hw --mode hw  --count 15000 --out data/hw_structured_train
uv run generate-hw --mode mix --count 10000 --out data/hw_mixed_train
uv run generate-hw --mode hw  --count 500   --out data/hw_structured_val  --seed 100
uv run generate-hw --mode mix --count 500   --out data/hw_mixed_val       --seed 100
uv run generate-hw --mode hw  --count 500   --out data/hw_structured_test --seed 200
uv run generate-hw --mode mix --count 500   --out data/hw_mixed_test      --seed 200

Train#

uv run train --output-dir checkpoints/gemma4e2b-run1 --epochs 2

Evaluate#

uv run evaluate --checkpoint checkpoints/gemma4e2b-run1/final --n 100

Configure Feed