typst-ocr#
Fine-tuning vision-language models to transcribe mathematical expressions and document fragments into Typst notation.
Two model targets are supported:
- Gemma 4 E2B (
src/train.py) -- Unsloth QLoRA, faster iteration - DeepSeek-OCR-2 (
src/train_deepseek.py) -- 3B model, stronger baseline OCR
Training data#
Overview#
The training set combines handwritten math datasets with synthetically rendered Typst documents. No dataset contains handwritten document fragments (math embedded in running text); generalization to that domain requires the model to transfer handwriting recognition and document-structure understanding simultaneously.
Splits#
Handwritten -- real / semi-real#
| Split | Samples | Notes |
|---|---|---|
mathwriting_train |
~42,880 | Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting. |
mathwriting_symbols |
~168 | Isolated symbol images from MathWriting. |
crohme_real_train |
~9 | Real CROHME competition handwriting samples. Tiny but genuine. |
Handwritten -- synthetic#
Fully synthetic images generated from expression grammars or stroke models. Useful for coverage of rare expressions but further from real photos.
| Split | Samples (raw) | Cap | Notes |
|---|---|---|---|
mathwriting_synthetic |
~78,000 | 20,000 | Synthetic stroke-rendered images from MathWriting grammar. |
crohme_gen_2019 |
~40,000 | 15,000 | Generated CROHME-style images, 2019 grammar. |
crohme_gen_2023 |
~2,682 | none | Generated CROHME-style images, 2023 grammar. |
crohme_gen_syntactic |
~15,653 | none | Syntactically diverse generated CROHME expressions. |
Caps are applied in train_deepseek.py to prevent synthetic data from
dominating training. After capping the effective mix is:
- Real/semi-real handwriting: ~43k (34%)
- Synthetic handwriting: ~33k (27%)
- Typeset document content: ~28k (23%)
- Typeset single equations: ~8k (6%)
- Small real splits: ~3k (2%)
Typeset -- single equations#
| Split | Samples | Notes |
|---|---|---|
typeset_train |
~8,000 | Typst-rendered single math expressions; targets are bare $ ... $ display math. |
Typeset -- document fragments (mixed content)#
The most important split for document-level generalization. Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Page widths are randomized to cover both wide single-column and narrow two-column layouts. Text spans may be bold, italic, or underlined.
| Split | Samples | Notes |
|---|---|---|
typeset_mixed_train |
~20,000 | Paragraphs, multi-paragraph blocks, bullet/numbered lists, para+list combinations, and tables. Reflowable content rendered at 7 different widths (200--480 pt). |
Body types and their generation weights (see src/generate_mixed.py):
- Tables (15%): grid layout with inline math cells
- Multi-paragraph (15%): 2--4 paragraphs each with inline math
- Para + list / list + para (10%): mixed prose and enumeration
- List (18%): bullet or numbered list with math items
- Inline sequence (42%): single paragraph of mixed text and math tokens
Math content within mixed bodies includes fractions, integrals, sums, limits, derivatives, matrices, polynomials (including operator/symbolic forms), and schematic $n \times m$ matrices with ellipsis notation.
Label conventions#
- Math-only splits (
typeset_train,mathwriting_*,crohme_*): manifest stores bare math expressions.data.load_records()wraps these as$ ... $at load time so every training target is valid Typst. - Mixed splits (
typeset_mixed_*): manifest stores complete body content with inline$...$delimiters already present. No wrapping applied.
Effective training mix (after caps)#
| Split | Raw | Used | % |
|---|---|---|---|
mathwriting_train |
~42,880 | 10,000 | 11% |
mathwriting_synthetic |
~78,152 | 20,000 | 22% |
crohme_gen_2019 |
~40,069 | 15,000 | 16% |
crohme_gen_syntactic |
~15,653 | 15,653 | 17% |
typeset_mixed_train |
~20,000 | 20,000 | 22% |
typeset_train |
~8,000 | 8,000 | 9% |
crohme_gen_2023 |
~2,682 | 2,682 | 3% |
mathwriting_symbols |
~168 | 168 | <1% |
crohme_real_train |
~9 | 9 | <1% |
| Total | ~91,500 |
Validation#
| Split | Samples used | Notes |
|---|---|---|
mathwriting_val |
250 (sampled, seed 42) | Real handwritten single equations. |
typeset_val |
250 (sampled, seed 42) | Typeset single equations. |
typeset_mixed_val |
~500 (all) | Document fragments; primary signal for layout generalization. |
| Total | ~1,000 |
Test#
| Split | Notes |
|---|---|
mathwriting_test |
Held-out real handwritten equations. |
typeset_test |
Held-out typeset equations. |
typeset_mixed_test |
Held-out document fragments. |
Known gaps#
- No handwritten document fragments exist in the training set. The model must transfer handwriting recognition (learned from single-equation data) and document structure understanding (learned from typeset mixed data) jointly at inference time.
- No hybrid handwritten+typeset samples (e.g. a printed form with handwritten fill-in).
crohme_real_trainhas only 9 samples -- real handwriting at document scale is essentially absent.
Setup#
uv sync
Generate data#
uv run generate-typeset # typeset_train, typeset_val, typeset_test
uv run generate-mixed # typeset_mixed_{train,val,test}
Train#
# DeepSeek-OCR-2 (recommended for 12 GB VRAM)
uv run train-deepseek --smoke-test # validate forward+backward first
uv run train-deepseek
# Gemma 4 E2B (via Unsloth)
uv run train
Evaluate#
uv run evaluate
uv run probe-deepseek --n 10