commits
Randomly inverts 50% of handwriting images during training to expose the
model to dark-background variants. Gated to _HANDWRITING_SPLITS only --
typeset splits are excluded because color-only-distinct emojis (colored
circles/squares) lose discriminative information under RGB inversion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _bg_color: sample image corners to detect background color
- _find_blocks: adaptive ink threshold for dark/light mode
- _transform_patch, _region_jitter, _augment: use bg color for all
canvas fills instead of hardcoded white; elastic transforms now use
pad-apply-crop to avoid white-fill artifacts
- augment_vis: simplify to single _augment call matching training exactly
- Update DVC manifests for regenerated typeset splits
- Expand _EMOJI pool with directional arrows, pin, food items
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract all magic numbers into named AUG_* constants
- Fix affine crop to account for rotation corner shift and scale expansion
- Fix perspective transform clipping (pad before, crop after)
- Add _find_blocks, _transform_patch, _region_jitter for per-block jitter
- Cap per-block dy to half the inter-block gap to prevent overlap
- Reduce patch elastic distortion (alpha 6→4, sigma 3→5)
- Content-type detection: lists get AUG_JITTER_LIST_MAX_DX=10 vs 40 for prose
- Increase ruled-line opacity (28-55→60-110) and probability (20%→30%)
- augment-vis: show orig / aug / aug+jitter columns; save NN_typst.txt;
print list detection tag and 80-char typst preview
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a prose-only synthetic split (typeset_prose_{train,val,test}) rendered
in handwriting-style fonts with no math delimiters, addressing the prior bias
P(Math | Handwritten) ≈ 1 learned from math-only real handwriting datasets.
Also extends _augment() with ElasticTransform (baseline wobble) and
RandomPerspective (photographed-paper effect), plus optional ruled-line
overlay for notebook paper simulation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
modeling_deepseekocr2.py is the upstream model package with one local patch:
in the eval_mode branch of infer(), pass attention_mask=torch.ones_like(input_ids)
to generate() to suppress the spurious warning caused by pad_token_id == eos_token_id,
and reuse the already-computed _input_ids_cuda tensor in the decode step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bump render PPI 150->250 in generate_typeset.py; increase margins y:8->12pt
- Rename typeset_structured_* splits to typeset_uniform_* (mode=uniform vs mixed)
- Consolidate generate_handwritten.py into generate_typeset.py; drop dead
generate_mixed.py and generate_handwritten.py entrypoints
- Regenerate typeset_uniform (10k train) and typeset_mixed (20k train) at new PPI
- Strengthen PROMPT: require raw Typst output, explicitly forbid LaTeX
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- LICENSE: 0BSD for project code and dataset generation scripts
- data/fonts/LICENSE-OFL-1.1.txt: covers Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script
- data/fonts/LICENSE-Apache-2.0.txt: covers Special Elite (Astigmatic)
- data/fonts/README.md: copyright notices and designer credits
- data/fonts/: flatten from data/fonts/handwriting/ (removed unnecessary nesting)
- src/download_hw_fonts.py, src/generate_handwritten.py: update default font path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- data/: remove typeset_train/val/test and typeset_mixed_* DVC tracking (old single-equation
and font-homogeneous splits); add typeset_structured_* and typeset_mixed_* DVC files
(renamed from hw_*; font-diverse renders with expanded body grammar)
- data/fonts/handwriting/: commit TTF files directly (33-172KB each)
- src/data.py: update split names to typeset_structured_* / typeset_mixed_*
- pyproject.toml: pin unsloth==2026.4.5 (2026.4.4 dropped gemma-4-E2B-it support)
- uv.lock: update accordingly
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/download_hw_fonts.py: downloads 6 Google Fonts TTFs, strips WOFF wrappers,
instantiates variable fonts at wght=400 for full character coverage
- src/generate_handwritten.py: hw mode (whole-doc font) and mix mode (per-block
font mixing); 7-way uniform font sampling including Typst default; manifest
records clean body (no font directives)
- src/generate_mixed.py: expand generate_body -- add 18% bare math, 15% short
inline (1-2 tokens); reduce complex structured weight; min complexity now n=1
- src/data.py: replace typeset_* splits with hw_structured_* and hw_mixed_*;
update val sampling to use VAL_SPLITS
- src/train.py: fix val loading to use VAL_SPLITS from data.py; move import to
top level
- pyproject.toml: add generate-hw and download-hw-fonts entry points
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- train.py: remove dedupe, port per-split caps from DeepSeek script, add
argparse (--epochs, --lr, --output-dir, --cap), default output dir to
checkpoints/gemma-4-e2b, add tensorboard logging
- eval_deepseek.py: --n is now per-split cap (stratified sampling) instead
of a head slice across all splits combined
- data.py: add mathwriting_val/mathwriting_test to _MATH_ONLY_SPLITS so
bare-expression labels get $ ... $ wrapping at eval time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
collate_deepseek: pass zeros for local_crops so forward() takes the
global-only branch (145 features), not local+global (289 features).
The old code passed the real image for both slots, causing masked_scatter_
to inject local-crop features and discard the global view entirely.
Also: add eval_deepseek, mine_failures, infer_debug, train_hnm scripts;
add split tracking to data.py; save every 250 steps (keep 10); backfill
TensorBoard epoch + learning_rate scalars.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- report_to=["tensorboard"] for live training run metrics
- Add tensorboard>=2.20.0 and setuptools==81.0.0 deps (setuptools 82
removed pkg_resources which tensorboard 2.20 still imports)
- src/backfill_tb.py: parse stdout training log and write TB events
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trl 0.15 with report_to='none' suppresses stdout logging.
Empty list keeps external reporters off while preserving console output.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
batch=2, grad_accum=4 keeps effective batch 8 but improves GPU utilization.
~5.2 GB VRAM headroom at batch=1 should accommodate the extra activations.
Seed 42 -> 29979 for train/val sampling.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Val: typeset_val now sampled to 250 (was all ~1000), matching mathwriting_val.
Total val: ~1000 (250 mathwriting + 250 typeset + 500 mixed).
README: add effective training mix table with caps applied, validation
table with exact sample counts, and test split listing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No longer Gemma 4 exclusive -- supports DeepSeek-OCR-2 and future backends.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- requires-python >= 3.12, .python-version pinned to 3.12
- Default epochs 3 -> 1 (~59h on RTX 3060 at current step time)
- Cap mathwriting_train at 10k (was 42k; real but over-represented)
- Cap mathwriting_synthetic at 20k, crohme_gen_2019 at 15k (unchanged)
- Total training samples: ~91k, ~11.4k optimizer steps per epoch
- Keep eager attention (MLA absorption trick is already efficient;
flash-attn source compile is too heavy for available hardware)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
README: detailed write-up of all training splits, label conventions, known
generalization gaps, and validation composition.
Training: cap mathwriting_synthetic→20k and crohme_gen_2019→15k so synthetic
data doesn't dominate. typeset_mixed rises to ~16%, real mathwriting becomes
the dominant handwriting source at 34%. Total: ~124k samples.
Validation: all typeset_val (1k) + all typeset_mixed_val (500) + 250 sampled
from mathwriting_val = 1,750 total.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BASE/PATCHES shape prints fired on every forward pass during training.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two patches to modeling_deepseekocr2.py for training compatibility:
1. .clone(): breaks autograd leaf-variable link so masked_scatter_ on
inputs_embeds slice doesn't raise during backprop
2. .to(bfloat16): matches vision encoder dtype (prepare_model_for_kbit_training
upcasts embedding table to fp32; vision encoder stays bfloat16)
train_deepseek.py now imports DeepseekOCR2ForCausalLM directly from the local
src/deepseek_ocr2 module instead of trust_remote_code -- weights still fetched
from hub, only the forward() code is local and version-controlled.
Smoke test: forward OK (loss 16.85), backward OK.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copied from deepseek-ai/DeepSeek-OCR-2 commit aaa02f38.
No modifications -- patches for training compatibility follow in the next commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4-bit NF4 + LoRA r=16 targeting MLA attention and MLP layers.
Freezes SAM vision encoder (already strongly pretrained).
Custom DeepSeekTrainer subclass moves list-of-tuple images to device.
Smoke-test flag validates forward+backward before full training run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/collate_deepseek.py:
- Letterbox-pads images to 768×768 (gray fill, mean=0.5), normalizes
- Inserts 145 image tokens (12²+1) at sequence start
- Masks image+prompt prefix with -100; trains on response+EOS only
- Builds images_seq_mask and images_spatial_crop for forward()
- crop_mode=False: single global view, same tensor for both tuple slots
(TODO: validate images tuple format against model source)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CROHME, MathWriting, and typeset_train manifests store bare math expressions.
Since images render as display math ($ expr $), training targets should be
valid Typst -- wrap at load_records() via _MATH_ONLY_SPLITS set rather than
touching manifests. Mixed splits already contain full body content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_styled() now accepts underline param, rendering as #underline[...] wrapping
any bold/italic markup. Bold, italic, and underline are each applied
independently at 25% (down from 30% for bold/italic).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
generate_typeset.py:
- _polynomial(): explicit (a x^3 + b x^2 + c x + d), indexed (a_0 + a_1 x + ... + a_n x^n),
monic general form; variable pool includes operator letters T, A, D, L, X, S
- _schematic_matrix(): generic n×m with dots.c / dots.v / dots.down ellipsis
- New 0.73–0.76 probability slots (stole 3% from product/df/dx/partial)
generate_mixed.py:
- _multi_paragraph(): 2–4 inline seqs separated by blank lines (~15% of bodies)
- _para_then_list() / _list_then_para(): intro-para+list and list+outro-para (~10%)
- All reflowable bodies (multi-para, lists, mixed) get random fixed width from
_PARA_WIDTHS = [200, 240, 280, 320, 360, 420, 480]pt to cover narrow two-column
journal through wide single-column; tables stay width: auto
- generate_body() returns (body, page_width) tuple; _CONTENT_TEMPLATE parameterized
with {width}; render_content() accepts page_width arg
- Image hash keyed on "width:body" to avoid collisions across reflow variants
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tall math (sqrt of integrals, underbrace labels, nested radicals) can
exceed Typst's default em-based list spacing and visually overlap the
item above. Increase list and enum spacing to 1.0em.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces hardcoded 2x2/3x3 matrix branch with a shape table covering
1x2, 1x3, 1x4 (row vectors), 2x2, 2x3, 3x2, 3x3. Column vectors
excluded since vec() already handles those. Shape is chosen via weighted
sample; rows assembled generically so adding new shapes is trivial.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
generate_typeset.py:
- Add _CALLIGRAPHIC pool; cal(X) appears in _atom() at ~10% and cal(P)(expr) in tail
- Add phi.alt, epsilon.alt, theta.alt (LaTeX \varphi, \varepsilon, \vartheta variants)
- Add logic/sequent block (3%): tack.r, and/or/=>/<=>/xor, not, models, top/bot, type judgments
- Add vec(a,b) / vec(a,b,c) column vectors
- Add underbrace/overbrace/underbracket/overbracket/underparen/overparen with atom labels
- Add bra-ket Dirac notation: lr(|ψ⟩), lr(⟨ψ|), lr(⟨φ|ψ⟩), lr(⟨A⟩)
- Add intervals via lr(): closed [a,b], open (a,b), half-open [a,b) and (a,b]
- 3x3 matrices (35% of matrix branch, was 2x2 only)
- Cartesian product type signatures: f: A×B→C, f: A×B×C→D
- _atom() now includes calligraphic letters
generate_mixed.py:
- Add _generate_table(): 15% of bodies; 2-4 cols, 2-5 rows; cell complexity
inversely proportional to table size; delegates to _inline_seq for unified
content (inherits emoji, math/text mix automatically)
- Table inset (x:5pt, y:7pt) to prevent underbrace/overbrace label clipping
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces `d` with `dif` (Typst differential operator) in two contexts:
1. Derivative fractions: (d A)/(d B) and operator form d/(d z)
2. Integral differentials: d-tokens following an `integral` keyword
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Web app search supports `-term` exclusion (e.g. `integral -dif`)
- search-labels CLI gains `--exclude` flag (repeatable)
- /items returns total count; UI shows it next to search box and in load-more button
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Needed for patterns containing non-word chars like ^prime.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
uv run search-labels gt
uv run search-labels prime --replace "'" --dry-run
uv run search-labels gt --replace >
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/review_app.py: FastAPI app serving a browser UI for paging through
dataset items, marking reviewed/flagged, and editing labels inline.
State persists in data/review.db (SQLite, git-ignored). Items table
reloads from manifests each startup; reviews/flags/edits are durable.
Cursor-based pagination, server-side filtering (pending/reviewed/
flagged/edited/all), per-split stats in toolbar.
- src/static/review.html: vanilla JS frontend. All interactions go
through an `actions` object; keyboard shortcut hook is stubbed out
(commented) for easy wiring later.
- src/apply_edits.py: CLI to flush edits table -> manifest.jsonl files,
with --dry-run and per-split filtering. Prints dvc add / git commit
instructions on completion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/probe_deepseek.py adds quantization support (--bits 4/8/16),
integrates with src.data.load_records/TEST_SPLITS and src.eval.normalize.
pyproject.toml entrypoint already pointed here.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Copied CROHME+MathWriting raster splits from eff-mer with clean names
- Renamed data/typeset -> data/typeset_train for consistency
- Unified RASTER_ROOT+TYPESET_ROOT into single DATA_ROOT in src/data.py
- typeset_train now in TRAIN_SPLITS directly; train.py simplified accordingly
- DVC initialized; all 14 splits tracked as .dvc pointer files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PEP 723 inline-dep scripts for isolated transformers==4.46.3 env:
- scripts/probe_deepseek.py: inference probe
- scripts/train_deepseek.py: QLoRA fine-tuning on Typst OCR data
Also adds addict, matplotlib, einops, easydict to pyproject deps
and a (non-functional) probe-deepseek entry point stub.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Canvas-based frontend for drawing math expressions and getting Typst
output from the model. Includes probe.py cleanup and eff_mer/infer.py
whitespace normalization fix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copies eff_mer package (encoder, decoder, vocab, data, train, infer)
into src/eff_mer/ and adds eff-mer-evaluate entrypoint. Paths updated
to resolve eff-mer data relative to sibling repo location.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Randomly inverts 50% of handwriting images during training to expose the
model to dark-background variants. Gated to _HANDWRITING_SPLITS only --
typeset splits are excluded because color-only-distinct emojis (colored
circles/squares) lose discriminative information under RGB inversion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _bg_color: sample image corners to detect background color
- _find_blocks: adaptive ink threshold for dark/light mode
- _transform_patch, _region_jitter, _augment: use bg color for all
canvas fills instead of hardcoded white; elastic transforms now use
pad-apply-crop to avoid white-fill artifacts
- augment_vis: simplify to single _augment call matching training exactly
- Update DVC manifests for regenerated typeset splits
- Expand _EMOJI pool with directional arrows, pin, food items
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract all magic numbers into named AUG_* constants
- Fix affine crop to account for rotation corner shift and scale expansion
- Fix perspective transform clipping (pad before, crop after)
- Add _find_blocks, _transform_patch, _region_jitter for per-block jitter
- Cap per-block dy to half the inter-block gap to prevent overlap
- Reduce patch elastic distortion (alpha 6→4, sigma 3→5)
- Content-type detection: lists get AUG_JITTER_LIST_MAX_DX=10 vs 40 for prose
- Increase ruled-line opacity (28-55→60-110) and probability (20%→30%)
- augment-vis: show orig / aug / aug+jitter columns; save NN_typst.txt;
print list detection tag and 80-char typst preview
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a prose-only synthetic split (typeset_prose_{train,val,test}) rendered
in handwriting-style fonts with no math delimiters, addressing the prior bias
P(Math | Handwritten) ≈ 1 learned from math-only real handwriting datasets.
Also extends _augment() with ElasticTransform (baseline wobble) and
RandomPerspective (photographed-paper effect), plus optional ruled-line
overlay for notebook paper simulation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
modeling_deepseekocr2.py is the upstream model package with one local patch:
in the eval_mode branch of infer(), pass attention_mask=torch.ones_like(input_ids)
to generate() to suppress the spurious warning caused by pad_token_id == eos_token_id,
and reuse the already-computed _input_ids_cuda tensor in the decode step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Bump render PPI 150->250 in generate_typeset.py; increase margins y:8->12pt
- Rename typeset_structured_* splits to typeset_uniform_* (mode=uniform vs mixed)
- Consolidate generate_handwritten.py into generate_typeset.py; drop dead
generate_mixed.py and generate_handwritten.py entrypoints
- Regenerate typeset_uniform (10k train) and typeset_mixed (20k train) at new PPI
- Strengthen PROMPT: require raw Typst output, explicitly forbid LaTeX
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- LICENSE: 0BSD for project code and dataset generation scripts
- data/fonts/LICENSE-OFL-1.1.txt: covers Comic Neue, Gochi Hand, Handlee, Oswald, Dancing Script
- data/fonts/LICENSE-Apache-2.0.txt: covers Special Elite (Astigmatic)
- data/fonts/README.md: copyright notices and designer credits
- data/fonts/: flatten from data/fonts/handwriting/ (removed unnecessary nesting)
- src/download_hw_fonts.py, src/generate_handwritten.py: update default font path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- data/: remove typeset_train/val/test and typeset_mixed_* DVC tracking (old single-equation
and font-homogeneous splits); add typeset_structured_* and typeset_mixed_* DVC files
(renamed from hw_*; font-diverse renders with expanded body grammar)
- data/fonts/handwriting/: commit TTF files directly (33-172KB each)
- src/data.py: update split names to typeset_structured_* / typeset_mixed_*
- pyproject.toml: pin unsloth==2026.4.5 (2026.4.4 dropped gemma-4-E2B-it support)
- uv.lock: update accordingly
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/download_hw_fonts.py: downloads 6 Google Fonts TTFs, strips WOFF wrappers,
instantiates variable fonts at wght=400 for full character coverage
- src/generate_handwritten.py: hw mode (whole-doc font) and mix mode (per-block
font mixing); 7-way uniform font sampling including Typst default; manifest
records clean body (no font directives)
- src/generate_mixed.py: expand generate_body -- add 18% bare math, 15% short
inline (1-2 tokens); reduce complex structured weight; min complexity now n=1
- src/data.py: replace typeset_* splits with hw_structured_* and hw_mixed_*;
update val sampling to use VAL_SPLITS
- src/train.py: fix val loading to use VAL_SPLITS from data.py; move import to
top level
- pyproject.toml: add generate-hw and download-hw-fonts entry points
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- train.py: remove dedupe, port per-split caps from DeepSeek script, add
argparse (--epochs, --lr, --output-dir, --cap), default output dir to
checkpoints/gemma-4-e2b, add tensorboard logging
- eval_deepseek.py: --n is now per-split cap (stratified sampling) instead
of a head slice across all splits combined
- data.py: add mathwriting_val/mathwriting_test to _MATH_ONLY_SPLITS so
bare-expression labels get $ ... $ wrapping at eval time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
collate_deepseek: pass zeros for local_crops so forward() takes the
global-only branch (145 features), not local+global (289 features).
The old code passed the real image for both slots, causing masked_scatter_
to inject local-crop features and discard the global view entirely.
Also: add eval_deepseek, mine_failures, infer_debug, train_hnm scripts;
add split tracking to data.py; save every 250 steps (keep 10); backfill
TensorBoard epoch + learning_rate scalars.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- report_to=["tensorboard"] for live training run metrics
- Add tensorboard>=2.20.0 and setuptools==81.0.0 deps (setuptools 82
removed pkg_resources which tensorboard 2.20 still imports)
- src/backfill_tb.py: parse stdout training log and write TB events
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Val: typeset_val now sampled to 250 (was all ~1000), matching mathwriting_val.
Total val: ~1000 (250 mathwriting + 250 typeset + 500 mixed).
README: add effective training mix table with caps applied, validation
table with exact sample counts, and test split listing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- requires-python >= 3.12, .python-version pinned to 3.12
- Default epochs 3 -> 1 (~59h on RTX 3060 at current step time)
- Cap mathwriting_train at 10k (was 42k; real but over-represented)
- Cap mathwriting_synthetic at 20k, crohme_gen_2019 at 15k (unchanged)
- Total training samples: ~91k, ~11.4k optimizer steps per epoch
- Keep eager attention (MLA absorption trick is already efficient;
flash-attn source compile is too heavy for available hardware)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
README: detailed write-up of all training splits, label conventions, known
generalization gaps, and validation composition.
Training: cap mathwriting_synthetic→20k and crohme_gen_2019→15k so synthetic
data doesn't dominate. typeset_mixed rises to ~16%, real mathwriting becomes
the dominant handwriting source at 34%. Total: ~124k samples.
Validation: all typeset_val (1k) + all typeset_mixed_val (500) + 250 sampled
from mathwriting_val = 1,750 total.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two patches to modeling_deepseekocr2.py for training compatibility:
1. .clone(): breaks autograd leaf-variable link so masked_scatter_ on
inputs_embeds slice doesn't raise during backprop
2. .to(bfloat16): matches vision encoder dtype (prepare_model_for_kbit_training
upcasts embedding table to fp32; vision encoder stays bfloat16)
train_deepseek.py now imports DeepseekOCR2ForCausalLM directly from the local
src/deepseek_ocr2 module instead of trust_remote_code -- weights still fetched
from hub, only the forward() code is local and version-controlled.
Smoke test: forward OK (loss 16.85), backward OK.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4-bit NF4 + LoRA r=16 targeting MLA attention and MLP layers.
Freezes SAM vision encoder (already strongly pretrained).
Custom DeepSeekTrainer subclass moves list-of-tuple images to device.
Smoke-test flag validates forward+backward before full training run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/collate_deepseek.py:
- Letterbox-pads images to 768×768 (gray fill, mean=0.5), normalizes
- Inserts 145 image tokens (12²+1) at sequence start
- Masks image+prompt prefix with -100; trains on response+EOS only
- Builds images_seq_mask and images_spatial_crop for forward()
- crop_mode=False: single global view, same tensor for both tuple slots
(TODO: validate images tuple format against model source)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CROHME, MathWriting, and typeset_train manifests store bare math expressions.
Since images render as display math ($ expr $), training targets should be
valid Typst -- wrap at load_records() via _MATH_ONLY_SPLITS set rather than
touching manifests. Mixed splits already contain full body content.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
generate_typeset.py:
- _polynomial(): explicit (a x^3 + b x^2 + c x + d), indexed (a_0 + a_1 x + ... + a_n x^n),
monic general form; variable pool includes operator letters T, A, D, L, X, S
- _schematic_matrix(): generic n×m with dots.c / dots.v / dots.down ellipsis
- New 0.73–0.76 probability slots (stole 3% from product/df/dx/partial)
generate_mixed.py:
- _multi_paragraph(): 2–4 inline seqs separated by blank lines (~15% of bodies)
- _para_then_list() / _list_then_para(): intro-para+list and list+outro-para (~10%)
- All reflowable bodies (multi-para, lists, mixed) get random fixed width from
_PARA_WIDTHS = [200, 240, 280, 320, 360, 420, 480]pt to cover narrow two-column
journal through wide single-column; tables stay width: auto
- generate_body() returns (body, page_width) tuple; _CONTENT_TEMPLATE parameterized
with {width}; render_content() accepts page_width arg
- Image hash keyed on "width:body" to avoid collisions across reflow variants
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces hardcoded 2x2/3x3 matrix branch with a shape table covering
1x2, 1x3, 1x4 (row vectors), 2x2, 2x3, 3x2, 3x3. Column vectors
excluded since vec() already handles those. Shape is chosen via weighted
sample; rows assembled generically so adding new shapes is trivial.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
generate_typeset.py:
- Add _CALLIGRAPHIC pool; cal(X) appears in _atom() at ~10% and cal(P)(expr) in tail
- Add phi.alt, epsilon.alt, theta.alt (LaTeX \varphi, \varepsilon, \vartheta variants)
- Add logic/sequent block (3%): tack.r, and/or/=>/<=>/xor, not, models, top/bot, type judgments
- Add vec(a,b) / vec(a,b,c) column vectors
- Add underbrace/overbrace/underbracket/overbracket/underparen/overparen with atom labels
- Add bra-ket Dirac notation: lr(|ψ⟩), lr(⟨ψ|), lr(⟨φ|ψ⟩), lr(⟨A⟩)
- Add intervals via lr(): closed [a,b], open (a,b), half-open [a,b) and (a,b]
- 3x3 matrices (35% of matrix branch, was 2x2 only)
- Cartesian product type signatures: f: A×B→C, f: A×B×C→D
- _atom() now includes calligraphic letters
generate_mixed.py:
- Add _generate_table(): 15% of bodies; 2-4 cols, 2-5 rows; cell complexity
inversely proportional to table size; delegates to _inline_seq for unified
content (inherits emoji, math/text mix automatically)
- Table inset (x:5pt, y:7pt) to prevent underbrace/overbrace label clipping
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/review_app.py: FastAPI app serving a browser UI for paging through
dataset items, marking reviewed/flagged, and editing labels inline.
State persists in data/review.db (SQLite, git-ignored). Items table
reloads from manifests each startup; reviews/flags/edits are durable.
Cursor-based pagination, server-side filtering (pending/reviewed/
flagged/edited/all), per-split stats in toolbar.
- src/static/review.html: vanilla JS frontend. All interactions go
through an `actions` object; keyboard shortcut hook is stubbed out
(commented) for easy wiring later.
- src/apply_edits.py: CLI to flush edits table -> manifest.jsonl files,
with --dry-run and per-split filtering. Prints dvc add / git commit
instructions on completion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Copied CROHME+MathWriting raster splits from eff-mer with clean names
- Renamed data/typeset -> data/typeset_train for consistency
- Unified RASTER_ROOT+TYPESET_ROOT into single DATA_ROOT in src/data.py
- typeset_train now in TRAIN_SPLITS directly; train.py simplified accordingly
- DVC initialized; all 14 splits tracked as .dvc pointer files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PEP 723 inline-dep scripts for isolated transformers==4.46.3 env:
- scripts/probe_deepseek.py: inference probe
- scripts/train_deepseek.py: QLoRA fine-tuning on Typst OCR data
Also adds addict, matplotlib, einops, easydict to pyproject deps
and a (non-functional) probe-deepseek entry point stub.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>