this repo has no description
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Wrap math-only split labels with display math delimiters at load time

CROHME, MathWriting, and typeset_train manifests store bare math expressions.
Since images render as display math ($ expr $), training targets should be
valid Typst -- wrap at load_records() via _MATH_ONLY_SPLITS set rather than
touching manifests. Mixed splits already contain full body content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

+12
+12
src/data.py
··· 34 34 VAL_SPLITS = ["mathwriting_val", "typeset_val", "typeset_mixed_val"] 35 35 TEST_SPLITS = ["mathwriting_test", "typeset_test", "typeset_mixed_test"] 36 36 37 + # Splits whose manifest typst field is a bare math expression (no $ delimiters). 38 + # These are wrapped as display math at load time so the training target is valid Typst. 39 + # Mixed splits already contain full body content with inline $...$ where needed. 40 + _MATH_ONLY_SPLITS = { 41 + "crohme_gen_2019", "crohme_gen_2023", "crohme_gen_syntactic", "crohme_real_train", 42 + "mathwriting_train", "mathwriting_synthetic", "mathwriting_symbols", 43 + "typeset_train", "typeset_val", "typeset_test", 44 + } 45 + 37 46 PROMPT = "Transcribe this image to Typst notation." 38 47 BASE_MODEL = "unsloth/gemma-4-E2B-it" 39 48 ··· 70 79 for name in split_names: 71 80 manifest = root / name / "manifest.jsonl" 72 81 base = (root / name).resolve() 82 + math_only = name in _MATH_ONLY_SPLITS 73 83 for line in manifest.read_text().splitlines(): 74 84 r = json.loads(line) 75 85 typst = r.get("typst", "") 76 86 if not typst or typst.startswith("ERROR:"): 77 87 continue 88 + if math_only: 89 + typst = f"$ {typst} $" 78 90 if dedupe: 79 91 if typst in seen_exact: 80 92 continue