this repo has no description
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Switch to Python 3.12; 1 epoch; rebalance training caps

- requires-python >= 3.12, .python-version pinned to 3.12
- Default epochs 3 -> 1 (~59h on RTX 3060 at current step time)
- Cap mathwriting_train at 10k (was 42k; real but over-represented)
- Cap mathwriting_synthetic at 20k, crohme_gen_2019 at 15k (unchanged)
- Total training samples: ~91k, ~11.4k optimizer steps per epoch
- Keep eager attention (MLA absorption trick is already efficient;
flash-attn source compile is too heavy for available hardware)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

+6 -4
+1
.python-version
··· 1 + 3.12
+3 -1
pyproject.toml
··· 1 1 [project] 2 2 name = "gemma-4-typst-ocr" 3 3 version = "0.1.0" 4 - requires-python = ">=3.11" 4 + requires-python = ">=3.12" 5 5 dependencies = [ 6 6 "unsloth[colab-new]", 7 7 "trl>=0.15", ··· 19 19 "python-multipart>=0.0.9", 20 20 "einops>=0.8.2", 21 21 "easydict>=1.13", 22 + "transformers==4.47.1", 22 23 ] 23 24 24 25 [project.scripts] ··· 46 47 47 48 [tool.hatch.build.targets.wheel] 48 49 packages = ["src"] 50 +
+2 -3
src/train_deepseek.py
··· 67 67 model_id, 68 68 quantization_config=_bnb_config(), 69 69 use_safetensors=True, 70 - # eager: flash_attn2 + grad-checkpointing combination can be fragile; 71 - # switch to flash_attention_2 once smoke-test confirms stability. 72 70 _attn_implementation="eager", 73 71 ) 74 72 return model, tokenizer ··· 134 132 parser.add_argument("--smoke-test", action="store_true", 135 133 help="One forward+backward pass then exit") 136 134 parser.add_argument("--output-dir", default="checkpoints/deepseek") 137 - parser.add_argument("--epochs", type=int, default=3) 135 + parser.add_argument("--epochs", type=int, default=1) 138 136 parser.add_argument("--lr", type=float, default=1e-4) 139 137 parser.add_argument("--lora-r", type=int, default=16) 140 138 args = parser.parse_args() ··· 171 169 _TRAIN_CAPS = { 172 170 "mathwriting_synthetic": 20_000, 173 171 "crohme_gen_2019": 15_000, 172 + "mathwriting_train": 10_000, # real but large; cap to avoid dominating 174 173 } 175 174 train_records: list[dict] = [] 176 175 for split in TRAIN_SPLITS: