Add README with training split documentation; rebalance train data; fix val
README: detailed write-up of all training splits, label conventions, known
generalization gaps, and validation composition.
Training: cap mathwriting_synthetic→20k and crohme_gen_2019→15k so synthetic
data doesn't dominate. typeset_mixed rises to ~16%, real mathwriting becomes
the dominant handwriting source at 34%. Total: ~124k samples.
Validation: all typeset_val (1k) + all typeset_mixed_val (500) + 250 sampled
from mathwriting_val = 1,750 total.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>