The training set combines handwritten math datasets with synthetically rendered Typst documents. No dataset contains handwritten document fragments (math embedded in running text); generalization to that domain requires the model to transfer handwriting recognition and document-structure understanding simultaneously.

Splits#

Handwritten -- real / semi-real#

Split	Samples	Notes
`mathwriting_train`	~42,880	Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting.
`mathwriting_symbols`	~168	Isolated symbol images from MathWriting.
`crohme_real_train`	~9	Real CROHME competition handwriting samples. Tiny but genuine.

Handwritten -- synthetic#

Fully synthetic images generated from expression grammars or stroke models. Useful for coverage of rare expressions but further from real photos.

Split	Samples (raw)	Cap	Notes
`mathwriting_synthetic`	~78,000	20,000	Synthetic stroke-rendered images from MathWriting grammar.
`crohme_gen_2019`	~40,000	15,000	Generated CROHME-style images, 2019 grammar.
`crohme_gen_2023`	~2,682	none	Generated CROHME-style images, 2023 grammar.
`crohme_gen_syntactic`	~15,653	none	Syntactically diverse generated CROHME expressions.

Caps are applied in train_deepseek.py to prevent synthetic data from dominating training. After capping the effective mix is:

Real/semi-real handwriting: ~43k (34%)
Synthetic handwriting: ~33k (27%)
Typeset document content: ~28k (23%)
Typeset single equations: ~8k (6%)
Small real splits: ~3k (2%)

Typeset -- single equations#

Split	Samples	Notes
`typeset_train`	~8,000	Typst-rendered single math expressions; targets are bare $ ... $ display math.

Typeset -- document fragments (mixed content)#

The most important split for document-level generalization. Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Page widths are randomized to cover both wide single-column and narrow two-column layouts. Text spans may be bold, italic, or underlined.

Split	Samples	Notes
`typeset_mixed_train`	~20,000	Paragraphs, multi-paragraph blocks, bullet/numbered lists, para+list combinations, and tables. Reflowable content rendered at 7 different widths (200--480 pt).

Body types and their generation weights (see src/generate_mixed.py):

Tables (15%): grid layout with inline math cells
Multi-paragraph (15%): 2--4 paragraphs each with inline math
Para + list / list + para (10%): mixed prose and enumeration
List (18%): bullet or numbered list with math items
Inline sequence (42%): single paragraph of mixed text and math tokens

Math content within mixed bodies includes fractions, integrals, sums, limits, derivatives, matrices, polynomials (including operator/symbolic forms), and schematic $n \times m$ matrices with ellipsis notation.

Label conventions#

Math-only splits (typeset_train, mathwriting_*, crohme_*): manifest stores bare math expressions. data.load_records() wraps these as $ ... $ at load time so every training target is valid Typst.
Mixed splits (typeset_mixed_*): manifest stores complete body content with inline $...$ delimiters already present. No wrapping applied.

Effective training mix (after caps)#

Split	Raw	Used	%
`mathwriting_train`	~42,880	10,000	11%
`mathwriting_synthetic`	~78,152	20,000	22%
`crohme_gen_2019`	~40,069	15,000	16%
`crohme_gen_syntactic`	~15,653	15,653	17%
`typeset_mixed_train`	~20,000	20,000	22%
`typeset_train`	~8,000	8,000	9%
`crohme_gen_2023`	~2,682	2,682	3%
`mathwriting_symbols`	~168	168	<1%
`crohme_real_train`	~9	9	<1%
Total		~91,500

Validation#

Split	Samples used	Notes
`mathwriting_val`	250 (sampled, seed 42)	Real handwritten single equations.
`typeset_val`	250 (sampled, seed 42)	Typeset single equations.
`typeset_mixed_val`	~500 (all)	Document fragments; primary signal for layout generalization.
Total	~1,000

Test#

Split	Notes
`mathwriting_test`	Held-out real handwritten equations.
`typeset_test`	Held-out typeset equations.
`typeset_mixed_test`	Held-out document fragments.

Known gaps#

No handwritten document fragments exist in the training set. The model must transfer handwriting recognition (learned from single-equation data) and document structure understanding (learned from typeset mixed data) jointly at inference time.
No hybrid handwritten+typeset samples (e.g. a printed form with handwritten fill-in).
crohme_real_train has only 9 samples -- real handwriting at document scale is essentially absent.

Setup#

uv sync

Generate data#

uv run generate-typeset   # typeset_train, typeset_val, typeset_test
uv run generate-mixed     # typeset_mixed_{train,val,test}

Train#

# DeepSeek-OCR-2 (recommended for 12 GB VRAM)
uv run train-deepseek --smoke-test   # validate forward+backward first
uv run train-deepseek

# Gemma 4 E2B (via Unsloth)
uv run train

Evaluate#

uv run evaluate
uv run probe-deepseek --n 10

Configure Feed