this repo has no description
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Python 93.6%
HTML 5.4%
Other 0.9%
48 1 0

Clone this repository

https://tangled.org/oscillatory.net/ocr-to-typst https://tangled.org/did:plc:ncwcsl2uejt5ci5qgn7spgis/ocr-to-typst
git@knot.oscillatory.net:oscillatory.net/ocr-to-typst git@knot.oscillatory.net:did:plc:ncwcsl2uejt5ci5qgn7spgis/ocr-to-typst

For self-hosted knots, clone URLs may differ based on your setup.

Download tar.gz
README.md

typst-ocr#

Fine-tuning vision-language models to transcribe mathematical expressions and document fragments into Typst notation.

Model: Gemma 4 E2B (src/train.py) -- Unsloth QLoRA fine-tuning.


Training data#

Overview#

The training set combines handwritten math datasets with synthetically rendered Typst documents. Handwriting-font splits (hw_*) cover both single-equation and structured document content rendered in 6 diverse fonts plus the Typst default, sampled uniformly.

Splits#

Handwritten -- real / semi-real#

Split Samples Notes
mathwriting_train ~143,096 Google MathWriting: digitized pen strokes rendered to images. Closest data to real-world photos of handwriting.
mathwriting_symbols ~6,091 Isolated symbol images from MathWriting.
crohme_real_train ~9 Real CROHME competition handwriting samples. Tiny but genuine.

Handwritten -- synthetic#

Fully synthetic images generated from expression grammars or stroke models.

Split Samples (raw) Cap Notes
mathwriting_synthetic ~85,879 20,000 Synthetic stroke-rendered images from MathWriting grammar.
crohme_gen_2019 ~51,855 15,000 Generated CROHME-style images, 2019 grammar.
crohme_gen_2023 ~3,072 none Generated CROHME-style images, 2023 grammar.
crohme_gen_syntactic ~69,397 15,000 Syntactically diverse generated CROHME expressions.

Typeset / handwriting-font -- document fragments#

Each sample is a rendered Typst page fragment containing inline math embedded in prose, lists, or tables. Content is rendered in one of 6 handwriting fonts (Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script, Special Elite) or the Typst default (New Computer Modern), sampled uniformly (~14% default).

Split Samples Notes
hw_structured_train 15,000 Whole document uses one uniformly-sampled font.
hw_mixed_train 10,000 Per-paragraph font mixing (~55% of blocks get hw font); requires multi-block bodies.

Body types and generation weights (see src/generate_mixed.py):

  • Bare math (18%): single $ expr $ -- lowest complexity, bridges to single-equation data
  • Short inline (15%): 1--2 tokens (math, text, or mixed)
  • Longer inline (23%): 3--7 tokens
  • List (12%): 2--5 bullet/numbered items with math content
  • Para + list / list + para (8%): prose introduction or conclusion around a list
  • Multi-paragraph (12%): 2--4 paragraphs separated by blank lines
  • Table (12%): 2--4 columns, 2--5 rows, inline math cells

Page widths randomized (200--480 pt) for multi-block bodies. Text spans may be bold, italic, or underlined. Ink colour sampled from near-black palette.

Label conventions#

  • Math-only splits (mathwriting_*, crohme_*): manifest stores bare math expressions. data.load_records() wraps these as $ ... $ at load time.
  • hw_ splits*: manifest stores complete body content with inline $...$ delimiters already present. No wrapping applied.

Effective training mix (after caps)#

Split Raw Used %
mathwriting_train ~143,096 10,000 11%
mathwriting_synthetic ~85,879 20,000 21%
crohme_gen_2019 ~51,855 15,000 16%
crohme_gen_syntactic ~69,397 15,000 16%
hw_structured_train 15,000 15,000 16%
hw_mixed_train 10,000 10,000 11%
crohme_gen_2023 ~3,072 3,072 3%
mathwriting_symbols ~6,091 6,091 6%
crohme_real_train ~9 9 <1%
Total ~94,200

Validation#

250 samples drawn per split (seed 42), capped to available:

Split Samples Notes
mathwriting_val 250 Real handwritten single equations.
hw_structured_val 250 Whole-doc font document fragments.
hw_mixed_val 250 Per-block mixed-font document fragments.
Total 750

Test#

Split Notes
mathwriting_test Held-out real handwritten equations.
hw_structured_test Held-out whole-doc font document fragments.
hw_mixed_test Held-out mixed-font document fragments.

Known gaps#

  • hw_* splits are font-based renders, not real handwriting photos. The model still lacks real handwritten document fragments at scale.
  • crohme_real_train has only 9 samples.

Setup#

uv sync

Generate data#

# Download handwriting fonts (once)
uv run download-hw-fonts

# Generate hw splits
uv run generate-hw --mode hw  --count 15000 --out data/hw_structured_train
uv run generate-hw --mode mix --count 10000 --out data/hw_mixed_train
uv run generate-hw --mode hw  --count 500   --out data/hw_structured_val  --seed 100
uv run generate-hw --mode mix --count 500   --out data/hw_mixed_val       --seed 100
uv run generate-hw --mode hw  --count 500   --out data/hw_structured_test --seed 200
uv run generate-hw --mode mix --count 500   --out data/hw_mixed_test      --seed 200

Train#

uv run train --output-dir checkpoints/gemma4e2b-run1 --epochs 2

Evaluate#

uv run evaluate --checkpoint checkpoints/gemma4e2b-run1/final --n 100