Overhaul Gemma training; fix eval stratification; fix mathwriting label format
- train.py: remove dedupe, port per-split caps from DeepSeek script, add
argparse (--epochs, --lr, --output-dir, --cap), default output dir to
checkpoints/gemma-4-e2b, add tensorboard logging
- eval_deepseek.py: --n is now per-split cap (stratified sampling) instead
of a head slice across all splits combined
- data.py: add mathwriting_val/mathwriting_test to _MATH_ONLY_SPLITS so
bare-expression labels get $ ... $ wrapping at eval time
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>