this repo has no description
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Python 89.9%
HTML 8.5%
Other 1.6%
62 1 0

Clone this repository

https://tangled.org/oscillatory.net/ocr-to-typst https://tangled.org/did:plc:ncwcsl2uejt5ci5qgn7spgis/ocr-to-typst
git@knot.oscillatory.net:oscillatory.net/ocr-to-typst git@knot.oscillatory.net:did:plc:ncwcsl2uejt5ci5qgn7spgis/ocr-to-typst

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

typst-ocr#

Fine-tuning vision-language models to transcribe mathematical expressions and document fragments into Typst notation.

Primary model: DeepSeek-OCR-2 (deepseek/src/train.py) -- Unsloth QLoRA fine-tuning. Prior work: Gemma 4 E2B (src/train.py) -- archived, underperformed DeepSeek-OCR-2.


Training data#

Overview#

The training set combines handwritten math datasets with synthetically rendered Typst documents. Typeset splits (typeset_*) cover both single-equation and structured document content rendered in 6 diverse fonts plus the Typst default, sampled uniformly.

Splits#

Handwritten -- real / semi-real#

Split Samples Notes
mathwriting_train ~143,096 Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting.
mathwriting_symbols ~6,091 Isolated symbol images from MathWriting.
crohme_real_train ~9 Real CROHME competition handwriting samples. Tiny but genuine.

Handwritten -- synthetic#

Fully synthetic images generated from expression grammars or stroke models.

Split Samples (raw) Cap Notes
mathwriting_synthetic ~85,879 20,000 Synthetic stroke-rendered images from MathWriting grammar.
crohme_gen_2019 ~51,855 15,000 Generated CROHME-style images, 2019 grammar.
crohme_gen_2023 ~3,072 none Generated CROHME-style images, 2023 grammar.
crohme_gen_syntactic ~69,397 15,000 Syntactically diverse generated CROHME expressions.

Typeset / handwriting-font -- document fragments#

Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Content is rendered in one of 6 handwriting fonts (Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script, Special Elite) or the Typst default (New Computer Modern), sampled uniformly (~14% default).

Split Samples Notes
typeset_uniform_train 10,000 Whole document uses one uniformly-sampled font.
typeset_mixed_train 20,000 Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies.
typeset_prose_train 5,000 Prose-heavy fragments: paragraphs with inline math, minimal bare-math content.

Body types and generation weights (see src/generate_mixed.py):

  • Bare math (18%): single $ expr $ -- lowest complexity, bridges to single-equation data
  • Short inline (15%): 1--2 tokens (math, text, or mixed)
  • Longer inline (23%): 3--7 tokens
  • List (12%): 2--5 bullet/numbered items with math content
  • Para + list / list + para (8%): prose introduction or conclusion around a list
  • Multi-paragraph (12%): 2--4 paragraphs separated by blank lines
  • Table (12%): 2--4 columns, 2--5 rows, inline math cells

Page widths randomized (200--480 pt) for multi-block bodies. Text spans may be bold, italic, or underlined. Ink colour sampled from near-black palette.

Label conventions#

  • Math-only splits (mathwriting_*, crohme_*): manifest stores bare math expressions. data.load_records() wraps these as $ ... $ at load time.
  • typeset_ splits*: manifest stores complete body content with inline $...$ delimiters already present. No wrapping applied.

Effective training mix (after caps)#

Split Raw Used %
mathwriting_synthetic ~85,879 20,000 19%
typeset_mixed_train 20,000 20,000 19%
crohme_gen_2019 ~51,855 15,000 14%
crohme_gen_syntactic ~69,397 15,000 14%
mathwriting_train ~143,096 10,000 10%
typeset_uniform_train 10,000 10,000 10%
mathwriting_symbols ~6,091 6,091 6%
typeset_prose_train 5,000 5,000 5%
crohme_gen_2023 ~3,072 3,072 3%
crohme_real_train ~9 9 <1%
Total ~104,200

Validation#

250 samples drawn per split:

Split Samples Notes
mathwriting_val 250 Real handwritten single equations.
typeset_uniform_val 250 Whole-doc font document fragments.
typeset_mixed_val 250 Per-block mixed-font document fragments.
typeset_prose_val 250 Prose-heavy document fragments.
Total 1,000

Test#

Split Notes
mathwriting_test Held-out real handwritten equations.
typeset_uniform_test Held-out whole-doc font document fragments.
typeset_mixed_test Held-out mixed-font document fragments.
typeset_prose_test Held-out prose-heavy document fragments.

Known gaps#

  • typeset_* splits are font-based renders, not real handwriting photos. The model still lacks real handwritten document fragments at scale.
  • crohme_real_train has only 9 samples.
  • No mixed handwritten+typeset document examples (future: synthetic handwriting generator).

Setup#

The root environment covers data generation. The DeepSeek training environment is a separate uv project under deepseek/.

# Root env (data generation, evaluation)
uv sync

# DeepSeek training env
cd deepseek && uv sync

Generate data#

# Download handwriting fonts (once)
uv run download-hw-fonts

uv run generate-typeset --mode prose --count 5000 --out data/typeset_prose_train --seed 101
uv run generate-typeset --mode prose --count 500 --out data/typeset_prose_val --seed 103
uv run generate-typeset --mode prose --count 500 --out data/typeset_prose_test --seed 107

uv run generate-typeset --mode uniform --count 10000 --out data/typeset_uniform_train --seed 109
uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_val --seed 113
uv run generate-typeset --mode uniform --count 500 --out data/typeset_uniform_test --seed 127

uv run generate-typeset --mode mixed --count 20000 --out data/typeset_mixed_train --seed 131
uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_val --seed 137
uv run generate-typeset --mode mixed --count 500 --out data/typeset_mixed_test --seed 139

Train (DeepSeek-OCR-2)#

cd deepseek
uv run train-deepseek --output-dir ../checkpoints/deepseek-ocr2-run1 --epochs 2

Smoke test (small caps, ~1 hour):

cd deepseek && sh run-smoke.sh

Evaluate#

cd deepseek
uv run evaluate-deepseek --checkpoint ../checkpoints/deepseek-ocr2-run1/final --n 100