this repo has no description
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Python 93.0%
HTML 6.0%
Other 1.0%
38 1 0

Clone this repository

https://tangled.org/oscillatory.net/ocr-to-typst https://tangled.org/did:plc:ncwcsl2uejt5ci5qgn7spgis/ocr-to-typst
git@knot.oscillatory.net:oscillatory.net/ocr-to-typst git@knot.oscillatory.net:did:plc:ncwcsl2uejt5ci5qgn7spgis/ocr-to-typst

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

typst-ocr#

Fine-tuning vision-language models to transcribe mathematical expressions and document fragments into Typst notation.

Two model targets are supported:

  • Gemma 4 E2B (src/train.py) -- Unsloth QLoRA, faster iteration
  • DeepSeek-OCR-2 (src/train_deepseek.py) -- 3B model, stronger baseline OCR

Training data#

Overview#

The training set combines handwritten math datasets with synthetically rendered Typst documents. No dataset contains handwritten document fragments (math embedded in running text); generalization to that domain requires the model to transfer handwriting recognition and document-structure understanding simultaneously.

Splits#

Handwritten -- real / semi-real#

Split Samples Notes
mathwriting_train ~42,880 Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting.
mathwriting_symbols ~168 Isolated symbol images from MathWriting.
crohme_real_train ~9 Real CROHME competition handwriting samples. Tiny but genuine.

Handwritten -- synthetic#

Fully synthetic images generated from expression grammars or stroke models. Useful for coverage of rare expressions but further from real photos.

Split Samples (raw) Cap Notes
mathwriting_synthetic ~78,000 20,000 Synthetic stroke-rendered images from MathWriting grammar.
crohme_gen_2019 ~40,000 15,000 Generated CROHME-style images, 2019 grammar.
crohme_gen_2023 ~2,682 none Generated CROHME-style images, 2023 grammar.
crohme_gen_syntactic ~15,653 none Syntactically diverse generated CROHME expressions.

Caps are applied in train_deepseek.py to prevent synthetic data from dominating training. After capping the effective mix is:

  • Real/semi-real handwriting: ~43k (34%)
  • Synthetic handwriting: ~33k (27%)
  • Typeset document content: ~28k (23%)
  • Typeset single equations: ~8k (6%)
  • Small real splits: ~3k (2%)

Typeset -- single equations#

Split Samples Notes
typeset_train ~8,000 Typst-rendered single math expressions; targets are bare $ ... $ display math.

Typeset -- document fragments (mixed content)#

The most important split for document-level generalization. Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Page widths are randomized to cover both wide single-column and narrow two-column layouts. Text spans may be bold, italic, or underlined.

Split Samples Notes
typeset_mixed_train ~20,000 Paragraphs, multi-paragraph blocks, bullet/numbered lists, para+list combinations, and tables. Reflowable content rendered at 7 different widths (200--480 pt).

Body types and their generation weights (see src/generate_mixed.py):

  • Tables (15%): grid layout with inline math cells
  • Multi-paragraph (15%): 2--4 paragraphs each with inline math
  • Para + list / list + para (10%): mixed prose and enumeration
  • List (18%): bullet or numbered list with math items
  • Inline sequence (42%): single paragraph of mixed text and math tokens

Math content within mixed bodies includes fractions, integrals, sums, limits, derivatives, matrices, polynomials (including operator/symbolic forms), and schematic $n \times m$ matrices with ellipsis notation.

Label conventions#

  • Math-only splits (typeset_train, mathwriting_*, crohme_*): manifest stores bare math expressions. data.load_records() wraps these as $ ... $ at load time so every training target is valid Typst.
  • Mixed splits (typeset_mixed_*): manifest stores complete body content with inline $...$ delimiters already present. No wrapping applied.

Effective training mix (after caps)#

Split Raw Used %
mathwriting_train ~42,880 10,000 11%
mathwriting_synthetic ~78,152 20,000 22%
crohme_gen_2019 ~40,069 15,000 16%
crohme_gen_syntactic ~15,653 15,653 17%
typeset_mixed_train ~20,000 20,000 22%
typeset_train ~8,000 8,000 9%
crohme_gen_2023 ~2,682 2,682 3%
mathwriting_symbols ~168 168 <1%
crohme_real_train ~9 9 <1%
Total ~91,500

Validation#

Split Samples used Notes
mathwriting_val 250 (sampled, seed 42) Real handwritten single equations.
typeset_val 250 (sampled, seed 42) Typeset single equations.
typeset_mixed_val ~500 (all) Document fragments; primary signal for layout generalization.
Total ~1,000

Test#

Split Notes
mathwriting_test Held-out real handwritten equations.
typeset_test Held-out typeset equations.
typeset_mixed_test Held-out document fragments.

Known gaps#

  • No handwritten document fragments exist in the training set. The model must transfer handwriting recognition (learned from single-equation data) and document structure understanding (learned from typeset mixed data) jointly at inference time.
  • No hybrid handwritten+typeset samples (e.g. a printed form with handwritten fill-in).
  • crohme_real_train has only 9 samples -- real handwriting at document scale is essentially absent.

Setup#

uv sync

Generate data#

uv run generate-typeset   # typeset_train, typeset_val, typeset_test
uv run generate-mixed     # typeset_mixed_{train,val,test}

Train#

# DeepSeek-OCR-2 (recommended for 12 GB VRAM)
uv run train-deepseek --smoke-test   # validate forward+backward first
uv run train-deepseek

# Gemma 4 E2B (via Unsloth)
uv run train

Evaluate#

uv run evaluate
uv run probe-deepseek --n 10