Cap typeset_val at 250; document training mix and splits in README

Val: typeset_val now sampled to 250 (was all ~1000), matching mathwriting_val.
Total val: ~1000 (250 mathwriting + 250 typeset + 500 mixed).

README: add effective training mix table with caps applied, validation
table with exact sample counts, and test split listing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

oscillatory.net 3 weeks ago d4e31716 c4c0a79b

+30 -6

2 changed files

expand all

README.md

src

train_deepseek.py

+26 -3

README.md

··· 88 88 - **Mixed splits** (`typeset_mixed_*`): manifest stores complete body content 89 89 with inline `$...$` delimiters already present. No wrapping applied. 90 90 91 + ### Effective training mix (after caps) 92 + 93 + | Split | Raw | Used | % | 94 + |---|---|---|---| 95 + | `mathwriting_train` | ~42,880 | **10,000** | 11% | 96 + | `mathwriting_synthetic` | ~78,152 | **20,000** | 22% | 97 + | `crohme_gen_2019` | ~40,069 | **15,000** | 16% | 98 + | `crohme_gen_syntactic` | ~15,653 | 15,653 | 17% | 99 + | `typeset_mixed_train` | ~20,000 | 20,000 | 22% | 100 + | `typeset_train` | ~8,000 | 8,000 | 9% | 101 + | `crohme_gen_2023` | ~2,682 | 2,682 | 3% | 102 + | `mathwriting_symbols` | ~168 | 168 | <1% | 103 + | `crohme_real_train` | ~9 | 9 | <1% | 104 + | **Total** | | **~91,500** | | 105 + 91 106 ### Validation 92 107 93 108 | Split | Samples used | Notes | 94 109 |---|---|---| 95 - | `mathwriting_val` | 250 (sampled) | Covers real handwritten single equations. | 96 - | `typeset_val` | ~1,000 (all) | Typeset single equations. | 110 + | `mathwriting_val` | 250 (sampled, seed 42) | Real handwritten single equations. | 111 + | `typeset_val` | 250 (sampled, seed 42) | Typeset single equations. | 97 112 | `typeset_mixed_val` | ~500 (all) | Document fragments; primary signal for layout generalization. | 98 - | **Total** | **~1,750** | | 113 + | **Total** | **~1,000** | | 114 + 115 + ### Test 116 + 117 + | Split | Notes | 118 + |---|---| 119 + | `mathwriting_test` | Held-out real handwritten equations. | 120 + | `typeset_test` | Held-out typeset equations. | 121 + | `typeset_mixed_test` | Held-out document fragments. | 99 122 100 123 ### Known gaps 101 124

+4 -3

src/train_deepseek.py

··· 185 185 mw_val = load_records(["mathwriting_val"], dedupe=False) 186 186 typeset_val = load_records(["typeset_val"], dedupe=False) 187 187 mixed_val = load_records(["typeset_mixed_val"], dedupe=False) 188 - mw_sample = _rng.sample(mw_val, min(250, len(mw_val))) 189 - val_records = mw_sample + typeset_val + mixed_val 188 + mw_sample = _rng.sample(mw_val, min(250, len(mw_val))) 189 + ts_sample = _rng.sample(typeset_val, min(250, len(typeset_val))) 190 + val_records = mw_sample + ts_sample + mixed_val 190 191 _rng.shuffle(val_records) 191 192 192 193 print(f"Train: {len(train_records):,} Val: {len(val_records):,} " 193 - f"(mathwriting={len(mw_sample)}, typeset={len(typeset_val)}, mixed={len(mixed_val)})") 194 + f"(mathwriting={len(mw_sample)}, typeset={len(ts_sample)}, mixed={len(mixed_val)})") 194 195 195 196 train_ds = make_dataset(train_records, do_augment=True) 196 197 val_ds = make_dataset(val_records, do_augment=False)

Configure Feed

Configure Feed