this repo has no description
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

feat: move hanzi scripts into a cli

+1958 -2054
+91 -65
data/README.md
··· 1 1 # Data 2 2 3 - This is the data we use to construct our app json files. 4 - None of these files should be directly included into the app. 3 + Source files used to build the app's subject JSON. None of these files are 4 + directly included in the app bundle — everything is compiled to `www/static/gen/` 5 + by the CLI. 5 6 6 - Data is usually "safe" to change at will (as long as you aren't changing any ids), BUT you changes may be overwritten depending on how update scripts are run. See docs and generator scripts for more details. 7 + Data files are generally safe to edit by hand (as long as you don't change any 8 + ids), but changes to TSVs may be overwritten if the relevant generator command 9 + is re-run. See the CLI docs below for which commands affect which files. 7 10 8 - ## Scripts 11 + ## CLI 9 12 10 - The `scripts` directory includes scripts for creating our apps `data` files. These scripts assume existence of `{target}/order/characters.txt`, `{target}/order/order/vocabulary.csv`, `{target}/order/radicals.csv`. These define which subjects are used, and in what order. 13 + All data tooling runs through a single entry point: 11 14 12 - The main process looks something like this: 15 + ```sh 16 + deno task data <command> 17 + deno task data --help 18 + ``` 13 19 14 - 1. `1_gen_dicts.ts` script to take any new subjects, and append a line to `dictionaries/{subject}.tsv`. These files help us match words to definitions and chars. What makes a "new" subject is loosely defined as "the traditional characters are different". Since Hanz/ 15 - 2. `2_gen_audio.ts` script, which will use TTS to generate audio files from the dicts. `deno run -A data/scripts/2_gen_audio.ts zh_HK character 200` 16 - 3. `3_update_app_data.ts`, which will use data files from steps 1 and 2, as well as a reading the file locations from step 3, and compile this in to data.json 17 - 4. `4_update_progress_data.ts` updates the HSK and TOCFL lists 20 + ### Build commands 18 21 19 - Separately from the main scripts, there is also `gen_progress.ts`, which generates all the progress-checking json files used by the app. This should translate between Hanz and Hans using the `dictionaries`, rather than an external source, since all chars/words should be present within our app. 22 + Run by **code contributors** as part of the normal build process. 20 23 21 - ## Sources 24 + | Command | Description | 25 + | --- | --- | 26 + | `build` | Run `gen:app`, `gen:progress`, and `gen:licenses` in sequence | 27 + | `gen:app` | Compile subjects JSON for all user-lang × target-lang pairs | 28 + | `gen:audio <locale> <dict> [limit]` | Generate TTS audio via Azure (requires `.env` + `ffmpeg`) | 29 + | `gen:progress` | Generate HSK / TOCFL / JLPT progress JSON | 30 + | `gen:licenses` | Fetch and bundle license texts | 22 31 23 - This is where we store the raw data. Store as `.tsv` to make it easier to view the data in the text editor by adjusting tab size (.editorconfig). Every tsv file should have a `.js` file in `sources/scripts`, helping to parse the raw data from its original source. 32 + ### Studio commands 24 33 25 - | file | description | 26 - | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | 27 - | `hsk.tsv` | The HSK 3.0 vocab list used by China; bands 1-5 | 28 - | `hsk_missing.tsv` | Hacking Chinese's list of words missing from HSK | 29 - | `tocfl.tsv` | The TOCFL vocab list used by Taiwan | 30 - | `tocfl_missing.tsv` | Hacking Chinese's list of words missing from TOCFL | 31 - | `standard.tsv` | A list of the General Standard Chinese Characters (通用规范汉字表). The standardized list of 8105 simplified Chinese characters. These are listed in order. | 32 - | `sentences.tsv` | Sentences provided by https://tatoeba.org/en | 34 + Run by **translation and language data contributors** when updating source files. 35 + Not needed for normal code development. 36 + 37 + | Command | Description | 38 + | --- | --- | 39 + | `studio sort:dicts [--lang=zh_CN]` | Sort TSVs by id (default) or by a target language's curriculum order | 40 + | `studio update:dicts` | Add missing readings/meanings entries from dictionary sources | 41 + | `studio update:sentences` | Process raw Tatoeba sentence exports into app sentence TSVs | 33 42 34 43 ## Directory Layout 35 44 36 45 ``` 37 46 data/ 38 - ├── scripts/ # Build scripts 39 - ├── other/ # Other Data (licenses, voice data, other aux data) 40 - └── lang/ # Language Data 41 - ├── characters.tsv # id, hant, hans 42 - ├── radicals.tsv # id, hant, hans alt 43 - ├── vocabulary.tsv # id, hant, hans 47 + ├── cli/ # CLI entry point and commands (see cli/main.ts) 48 + ├── scripts/ # Legacy build scripts (superseded by cli/) 49 + ├── other/ # Auxiliary data (licenses.tsv, voices.json) 50 + └── lang/ # All language data 51 + ├── characters.tsv # Shared CJK character registry: id, hant, hans 52 + ├── radicals.tsv # Shared radical registry: id, hant, hans, alt 53 + ├── vocabulary.tsv # Shared CJK vocabulary registry: id, hant, hans 44 54 ├── zh_CN/ # Mainland Chinese target-language data 45 - │ ├── pronunciation.tsv # id pinyin zhuyin 46 - │ ├── progress/ # Data for tracking progress in other courses 55 + │ ├── readings.tsv # id, pinyin 56 + │ ├── progress/ # External word lists (HSK, frequency data) 47 57 │ └── order/ # Curriculum order 48 - │ ├── characters.txt # Row = level, columns = chars in that level 58 + │ ├── characters.txt # One line per level; chars in that level 49 59 │ ├── vocabulary.csv 50 60 │ └── radicals.csv 61 + ├── zh_HK/ # Cantonese target-language data 62 + │ └── (same structure as zh_CN/, readings use jyutping) 63 + ├── zh_TW/ # Taiwanese Mandarin target-language data 64 + │ └── (same structure as zh_CN/, readings use pinyin + zhuyin) 65 + ├── ja/ # Japanese target-language data 66 + │ ├── reading.characters.tsv # id, kunyomi, onyomi 67 + │ ├── reading.vocabulary.tsv # id, reading (hiragana) 68 + │ └── (order/, progress/ same structure as zh_CN/) 51 69 ├── en/ # English user-language data 52 - │ ├── characters.tsv # id, value (English meanings) 53 - │ ├── vocabulary.tsv # id, value 54 - │ ├── characters.meaning.tsv # id, hant, locale, en (mnemonic hints) 55 - │ ├── vocabulary.meaning.tsv 56 - │ ├── characters.reading.tsv 57 - │ ├── vocabulary.reading.tsv 58 - │ └── sentences/ # Locale-keyed example sentences 70 + │ ├── characters.tsv # id, hant, value (meanings) 71 + │ ├── vocabulary.tsv # id, hant, value 72 + │ ├── radicals.tsv # id, hant, name 73 + │ ├── hints/ # Mnemonic hints 74 + │ │ ├── meaning.characters.tsv 75 + │ │ ├── meaning.vocabulary.tsv 76 + │ │ ├── reading.characters.tsv 77 + │ │ └── reading.vocabulary.tsv 78 + │ ├── meanings/ # Locale-specific meaning overrides 79 + │ │ ├── ja.characters.tsv 80 + │ │ └── ja.vocabulary.tsv 81 + │ └── sentences/ # Example sentences keyed by target locale 59 82 │ ├── zh_CN.tsv 60 83 │ ├── zh_HK.tsv 61 - │ └── zh_TW.tsv 62 - ├── es/ # Spanish user-language data 63 - │ └── (same structure as en/) 64 - ├── zh_TW/ # Taiwanese Chinese Target Language data 65 - │ └── (same structure as zh_CN/) 66 - └── ja/ # Japanese Target Language data 67 - └── (same structure as zh_CN/) 84 + │ ├── zh_TW.tsv 85 + │ └── ja.tsv 86 + └── es/ # Spanish user-language data 87 + └── (same structure as en/) 68 88 ``` 69 89 70 - ## Order 90 + ## Curriculum Order 71 91 72 - This is where we store data for creating lessons. Use `.csv` to compact the format within the text editor. Order files are per target-locale, e.g. `zh_CN/order/vocabulary.csv` for Mainland-focused vocabulary. 73 - 74 - ### How we determine Lesson Num/Order 75 - 76 - This is very hand-wavey at the moment. But broadly: 77 - 78 - | LVL | CEFR | HSK | TOCFL | # Characters | # Words | 79 - | ----- | ---- | ----- | ------- | ---------------- | ---------- | 80 - | 1-10 | | | | | | 81 - | 11-20 | A1 | HSK 2 | TOCFL 1 | 200 most popular | | 82 - | 21-30 | A2 | HSK3 | TOCFL 2 | 900 Characters | 1200 Words | 83 - | 31-60 | B1-2 | HSK 5 | TOCFL 4 | 1900 Characters | 5000 Words | 92 + Order files define which subjects are taught and in what sequence. They live 93 + at `lang/{targetLang}/order/` and are the source of truth for what gets 94 + included in the compiled subject JSON. 84 95 85 - - Lvls 1-10 is less focused on specific milestones, and more just trying to gracefully introduce characters in a natural way. 86 - - Lvls 21-30 seems short because 1-20 should introduce vocab beyond the stated goals. Just the order might not lend itself as well to testing milestones. 96 + - `characters.txt` — one line per level, each character is one slug 97 + - `vocabulary.csv` — one row per level, comma-separated slugs 98 + - `radicals.csv` — one row per level, comma-separated slugs 87 99 88 - This is expressed broadly in `other/groups.txt`, and refined into `order/characters.txt`. Vocabulary are a bit more auxillary and are introduced at anytime after the characters are. So in practice, users should reach the word goals a little after the listed levels above. 100 + Subjects not present in any order file are included in the output with 101 + `hiddenAt` set and `level: 0` (accessible but not scheduled for review). 89 102 90 103 We currently do not, but in the future, we should also try and keep frequency in mind, to hopefully gracefully prepare users for things such as graded readers, such as [these](https://www.gradedchinesereaders.com) or [these](https://talktaiwanesemandarin.com/books/). Some frequency lists: 91 104 92 105 - tocfl data includes sFreq and wFreq 93 106 - https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php 94 107 95 - ## Dictionaries 108 + ### Rough level targets 109 + 110 + | LVL | CEFR | HSK | TOCFL | Characters | Words | 111 + | ----- | ---- | ----- | ------- | ---------- | ----- | 112 + | 1-10 | | | | | | 113 + | 11-20 | A1 | HSK 2 | TOCFL 1 | ~200 | | 114 + | 21-30 | A2 | HSK 3 | TOCFL 2 | ~900 | ~1200 | 115 + | 31-60 | B1-2 | HSK 5 | TOCFL 4 | ~1900 | ~5000 | 96 116 97 - These are used for mapping hanz/hans/en altogether. Not used right now, but `overrides.tsv` should include locale-specific overrides for 117 + ## Audio 98 118 99 - ## Future Support 119 + Audio is generated separately from the main build and is not committed to the 120 + repo. Run `gen:audio` for each locale after adding new curriculum items: 100 121 101 - Maybe we could look at adding other dialects, more regional targets: 122 + ```sh 123 + deno task data gen:audio zh_CN character 124 + deno task data gen:audio zh_HK character 125 + deno task data gen:audio zh_TW character 126 + ``` 102 127 103 - `cmn`, `hak`, `wuu`, `gan`, `nan`, `hnm`, `hsn`, `cjy`, `zh_SG`, `zh_MY` 128 + Requires `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in a `.env` file, and 129 + `ffmpeg` on `$PATH`. Generated files land in `www/static/gen/audio/{locale}/`.
+242
data/cli/commands/gen_app_data.ts
··· 1 + /** 2 + * Compiles all subject data into the app's JSON files. 3 + * 4 + * For each user-language × target-language pair, reads dictionaries, readings, 5 + * hints, meanings, and sentences, then writes a combined subjects file to 6 + * www/static/gen/lang/{userLang}/{targetLang}.json. 7 + * 8 + * This is the main build step — the output is what the app reads at runtime. 9 + * 10 + * Idempotent: existing subjects are preserved and merged; hand-edited fields 11 + * (mnemonics, hints) survive re-runs. Subjects removed from the curriculum 12 + * are marked with `hiddenAt` rather than deleted. 13 + * 14 + * Depends on: 15 + * - data/lang/ TSVs (dictionaries, readings, hints, meanings) 16 + * - data/lang/{userLang}/sentences/ (example sentences) 17 + * - www/static/gen/audio/ (audio files embedded in subjects; optional) 18 + */ 19 + 20 + import { Command } from '@cliffy/command' 21 + import { SubjectType } from '$/enums.ts' 22 + import type { Subject } from '$/models/subjects.ts' 23 + import { readAllLocaleHints, readDict, readMeaningOverrides, readMeanings, readReadingsMap } from '../utils/dict.ts' 24 + import { readTsv } from '../utils/fs.ts' 25 + import { readCharacterOrder, readLessonOrder } from '../utils/ordering.ts' 26 + import { loadSentences } from '../utils/sentences.ts' 27 + import { createSubject, readSubjectsMap, writeSubjects } from '../utils/subjects.ts' 28 + import { listAudioFiles } from '../utils/audio.ts' 29 + 30 + const USER_LANGS = ['en', 'es'] 31 + const TARGET_LANGS = ['zh_CN', 'zh_HK', 'zh_TW', 'ja'] 32 + 33 + export const genAppDataCmd = new Command() 34 + .description( 35 + 'Compile subject data into app JSON (www/static/gen/lang/{userLang}/{targetLang}.json). ' + 36 + 'Run after updating dictionaries, readings, or curriculum order. ' + 37 + 'Audio files are included automatically if already present.', 38 + ) 39 + .action(() => { 40 + // Load shared dictionary data once for all language pairs 41 + const charDefs = readDict('lang/characters.tsv') 42 + const vocabDefs = readDict('lang/vocabulary.tsv') 43 + const audioFiles = listAudioFiles() 44 + 45 + for (const userLang of USER_LANGS) { 46 + const charMeanings = readMeanings(`lang/${userLang}/characters.tsv`) 47 + const vocabMeanings = readMeanings(`lang/${userLang}/vocabulary.tsv`) 48 + const charMeaningHints = readAllLocaleHints(`lang/${userLang}/hints/meaning.characters.tsv`) 49 + const charReadingHints = readAllLocaleHints(`lang/${userLang}/hints/reading.characters.tsv`) 50 + const vocabMeaningHints = readAllLocaleHints(`lang/${userLang}/hints/meaning.vocabulary.tsv`) 51 + const vocabReadingHints = readAllLocaleHints(`lang/${userLang}/hints/reading.vocabulary.tsv`) 52 + 53 + for (const targetLang of TARGET_LANGS) { 54 + console.log(`\nGenerating ${userLang}/${targetLang}`) 55 + const sentences = loadSentences(userLang, targetLang) 56 + 57 + const characterOrder = readCharacterOrder(targetLang) 58 + const vocabularyOrder = readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`) 59 + const radicalOrder = readLessonOrder(`lang/${targetLang}/order/radicals.csv`) 60 + 61 + const outPath = `lang/${userLang}/${targetLang}.json` 62 + // Preserves hand-edited fields (mnemonics, hints) across runs 63 + const existingSubjects = readSubjectsMap(outPath) 64 + const updated = new Set<string>() 65 + 66 + // ja splits readings by subject type; all other locales share one file 67 + const charReadingsMap = targetLang === 'ja' 68 + ? readReadingsMap('lang/ja/reading.characters.tsv') 69 + : readReadingsMap(`lang/${targetLang}/readings.tsv`) 70 + const vocabReadingsMap = targetLang === 'ja' 71 + ? readReadingsMap('lang/ja/reading.vocabulary.tsv') 72 + : readReadingsMap(`lang/${targetLang}/readings.tsv`) 73 + 74 + const charMeaningOverrides = readMeaningOverrides( 75 + `lang/${userLang}/meanings/${targetLang}.characters.tsv`, 76 + ) 77 + const vocabMeaningOverrides = readMeaningOverrides( 78 + `lang/${userLang}/meanings/${targetLang}.vocabulary.tsv`, 79 + ) 80 + 81 + // --- Characters --- 82 + console.log(` characters: ${characterOrder.length} levels`) 83 + characterOrder.forEach((slugs, index) => { 84 + const level = index + 1 85 + slugs.forEach((slug, posIndex) => { 86 + if (!slug) return 87 + const position = posIndex + 1 88 + const subject = existingSubjects[slug] ?? 89 + createSubject( 90 + slug, level, position, targetLang, 91 + charMeanings, vocabMeanings, sentences, 92 + charDefs, vocabDefs, audioFiles, 93 + ) 94 + subject.data.level = level 95 + subject.data.position = position 96 + 97 + const readings = charReadingsMap[subject.id] 98 + if (readings?.length) subject.data.readings = readings 99 + else if (subject.data.type !== SubjectType.Radical) { 100 + console.warn(` No readings for character ${slug} (${subject.id}) in ${targetLang}`) 101 + } 102 + 103 + const override = charMeaningOverrides[subject.id] 104 + if (override) { 105 + subject.data.meanings = override.split(';').map((def, i) => ({ 106 + value: def.trim(), isPrimary: i === 0, isAcceptedAnswer: true, 107 + })) 108 + } 109 + 110 + if (charMeaningHints[subject.id]) subject.data.meaningHint = charMeaningHints[subject.id] 111 + if (charReadingHints[subject.id]) subject.data.readingHint = charReadingHints[subject.id] 112 + existingSubjects[slug] = subject 113 + updated.add(slug) 114 + }) 115 + }) 116 + 117 + // --- Vocabulary --- 118 + console.log(` vocabulary: ${vocabularyOrder.length} levels`) 119 + vocabularyOrder.forEach((slugs, index) => { 120 + const level = index + 1 121 + slugs.forEach((slug, posIndex) => { 122 + if (!slug) return 123 + const position = posIndex + 1 124 + const subject = existingSubjects[slug] ?? 125 + createSubject( 126 + slug, level, position, targetLang, 127 + charMeanings, vocabMeanings, sentences, 128 + charDefs, vocabDefs, audioFiles, 129 + ) 130 + subject.data.level = level 131 + subject.data.position = position 132 + 133 + const readings = vocabReadingsMap[subject.id] 134 + if (readings?.length) subject.data.readings = readings 135 + else console.warn(` No readings for vocabulary ${slug} (${subject.id}) in ${targetLang}`) 136 + 137 + const override = vocabMeaningOverrides[subject.id] 138 + if (override) { 139 + subject.data.meanings = override.split(';').map((def, i) => ({ 140 + value: def.trim(), isPrimary: i === 0, isAcceptedAnswer: true, 141 + })) 142 + } 143 + 144 + if (vocabMeaningHints[subject.id]) subject.data.meaningHint = vocabMeaningHints[subject.id] 145 + if (vocabReadingHints[subject.id]) subject.data.readingHint = vocabReadingHints[subject.id] 146 + existingSubjects[slug] = subject 147 + updated.add(slug) 148 + }) 149 + }) 150 + 151 + // --- Radicals --- 152 + console.log(` radicals: ${radicalOrder.length} levels`) 153 + buildRadicals(targetLang, userLang, radicalOrder, existingSubjects, updated) 154 + 155 + writeSubjects( 156 + outPath, 157 + Object.values(existingSubjects).map((subject) => { 158 + if (updated.has(subject.data.slug)) { 159 + delete subject.hiddenAt 160 + } else { 161 + // Subjects removed from the curriculum are hidden, not deleted 162 + subject.hiddenAt = subject.hiddenAt ?? new Date() 163 + subject.data.level = 0 164 + subject.data.position = 0 165 + } 166 + return subject 167 + }), 168 + ) 169 + } 170 + } 171 + }) 172 + 173 + /** 174 + * Builds radical subjects from radicals.tsv and the curriculum radical order. 175 + * Parsed manually because the TSV has an optional trailing `alt` field that 176 + * trips up the strict CSV parser. 177 + */ 178 + function buildRadicals( 179 + targetLang: string, 180 + userLang: string, 181 + radicalOrder: string[][], 182 + existingSubjects: Record<string, Subject>, 183 + updated: Set<string>, 184 + ): void { 185 + const byHant: Record<string, { id: string; hant: string; hans: string }> = {} 186 + const byAlt: Record<string, { id: string; hant: string; hans: string }> = {} 187 + 188 + Deno.readTextFileSync('./data/lang/radicals.tsv') 189 + .split('\n') 190 + .slice(1) // skip header 191 + .filter((line) => line.trim()) 192 + .forEach((line) => { 193 + const [id, hant, hans, alt] = line.split('\t') 194 + const row = { id, hant, hans } 195 + byHant[hant] = row 196 + if (alt) { 197 + alt.split(';').map((a) => a.trim()).filter(Boolean).forEach((a) => { byAlt[a] = row }) 198 + } 199 + }) 200 + 201 + const nameById: Record<string, string> = {} 202 + readTsv(`lang/${userLang}/radicals.tsv`).forEach((row) => { nameById[row.id] = row.name }) 203 + 204 + const isSimplified = targetLang === 'zh_CN' 205 + 206 + radicalOrder.forEach((chars, levelIndex) => { 207 + const level = levelIndex + 1 208 + chars.forEach((char, posIndex) => { 209 + const ch = char.trim() 210 + if (!ch) return 211 + const row = byHant[ch] || byAlt[ch] 212 + if (!row) { 213 + console.warn(`No radical found for: ${ch}`) 214 + return 215 + } 216 + const { id, hant, hans } = row 217 + const slug = hant 218 + const existing = existingSubjects[slug] 219 + existingSubjects[slug] = { 220 + ...(existing || {}), 221 + id, 222 + learnCards: ['meanings'], 223 + quizCards: ['meanings'], 224 + data: { 225 + ...(existing?.data || {}), 226 + character: isSimplified ? hans : hant, 227 + level, 228 + meanings: nameById[id] 229 + ? [{ value: nameById[id], isPrimary: true, isAcceptedAnswer: true }] 230 + : [], 231 + position: posIndex + 1, 232 + readings: [], 233 + requiredSubjects: [], 234 + slug, 235 + srsId: 2, 236 + type: SubjectType.Radical, 237 + }, 238 + } as Subject 239 + updated.add(slug) 240 + }) 241 + }) 242 + }
+226
data/cli/commands/gen_audio.ts
··· 1 + /** 2 + * Generates TTS audio files via Azure Cognitive Services. 3 + * 4 + * Fetches audio for all characters or vocabulary that are missing a local file, 5 + * splitting each batch of 100 items into one Azure request, then using ffmpeg 6 + * to split the returned audio on silence boundaries. 7 + * 8 + * Requirements: 9 + * - AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in a .env file 10 + * - ffmpeg installed and on $PATH 11 + * 12 + * Usage: 13 + * hanzi gen-audio <locale> <dict> [limit] 14 + * 15 + * Examples: 16 + * hanzi gen-audio zh_CN character 17 + * hanzi gen-audio zh_HK vocabulary 100 18 + */ 19 + 20 + import { Command } from '@cliffy/command' 21 + import { load } from '@std/dotenv' 22 + import { ensureDir } from '@std/fs' 23 + import { writeAll } from '@std/io' 24 + import { join } from '@std/path' 25 + import { Locale } from '$/enums.ts' 26 + import type { Definition } from '../utils/dict.ts' 27 + import { getFilename, listAudioFiles, VOICE_IDS } from '../utils/audio.ts' 28 + import { readOrderedDefs } from '../utils/ordering.ts' 29 + 30 + const GEN_DIR = 'www/static/gen' 31 + const TEMP_DIR = join(GEN_DIR, 'audio', 'tmp') 32 + 33 + // ffmpeg silence detection parameters 34 + const MAX_NOISE_LEVEL = -40 35 + const SILENCE_SPLIT = 1 36 + const DETECT_STR = `silencedetect=noise=${MAX_NOISE_LEVEL}dB:d=${SILENCE_SPLIT}` 37 + const MATCH_SILENCE = /silence_start: ([\w.]+)[\s\S]+?silence_end: ([\w.]+)/g 38 + 39 + const VALID_LOCALES = [Locale.zh_CN, Locale.zh_HK, Locale.zh_TW] 40 + 41 + export const genAudioCmd = new Command() 42 + .description( 43 + 'Generate TTS audio files via Azure Cognitive Services. ' + 44 + 'Requires AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in .env, and ffmpeg on $PATH.', 45 + ) 46 + .arguments('<locale:string> <dict:string> [limit:number]') 47 + .action(async (_options, locale: string, dict: string, limit = 0) => { 48 + if (!VALID_LOCALES.includes(locale as Locale)) { 49 + console.error(`Invalid locale: ${locale}. Valid: ${VALID_LOCALES.join(', ')}`) 50 + Deno.exit(1) 51 + } 52 + if (!['character', 'vocabulary'].includes(dict)) { 53 + console.error(`Invalid dict: ${dict}. Use "character" or "vocabulary".`) 54 + Deno.exit(1) 55 + } 56 + 57 + const env = await load() 58 + await genAudio(locale as Locale, dict as 'character' | 'vocabulary', limit, env) 59 + console.log('COMPLETE!') 60 + Deno.exit(0) 61 + }) 62 + 63 + async function genAudio( 64 + locale: Locale, 65 + dict: 'character' | 'vocabulary', 66 + limit: number, 67 + env: Record<string, string>, 68 + ): Promise<void> { 69 + await ensureDir(TEMP_DIR) 70 + 71 + const missing = await findMissingAudioFiles(locale, dict, limit) 72 + if (!missing.length) { 73 + console.log('No missing audio files — nothing to do.') 74 + return 75 + } 76 + 77 + const ttsResults = await ttsAll(locale, missing, VOICE_IDS[locale], env) 78 + console.log('Source audio groups:', JSON.stringify(ttsResults.map((r) => r.fileName))) 79 + 80 + for (const { groupIndex, fileName, keys } of ttsResults) { 81 + if (!fileName) { 82 + console.warn(`Skipping group ${groupIndex}: no fileName (Azure request may have failed)`) 83 + continue 84 + } 85 + console.log('Processing group:', fileName) 86 + await writeAudioFiles(join(TEMP_DIR, fileName), locale, keys) 87 + } 88 + } 89 + 90 + async function findMissingAudioFiles( 91 + locale: Locale, 92 + dict: 'character' | 'vocabulary', 93 + limit: number, 94 + ): Promise<Definition[]> { 95 + await ensureDir(join(GEN_DIR, 'audio', locale)) 96 + const exists = new Set(listAudioFiles([locale])) 97 + const ordered = readOrderedDefs(dict, locale) 98 + const missing = ordered.filter(({ id }) => !exists.has(getFilename(id, locale))) 99 + return limit ? missing.slice(0, limit) : missing 100 + } 101 + 102 + function ttsAll( 103 + locale: Locale, 104 + subjects: Definition[], 105 + voiceId: string, 106 + env: Record<string, string>, 107 + ): Promise<{ groupIndex: number; fileName: string | null; keys: string[] }[]> { 108 + // Group into batches of 100 to stay within Azure SSML limits 109 + const groups: Definition[][] = [] 110 + subjects.forEach((subject, index) => { 111 + const groupIndex = Math.floor(index / 100) 112 + if (!groups[groupIndex]) groups[groupIndex] = [] 113 + groups[groupIndex].push(subject) 114 + }) 115 + 116 + console.log( 117 + `About to request TTS for ${subjects.length} items ` + 118 + `(${subjects[0]?.id} → ${subjects[subjects.length - 1]?.id}).`, 119 + ) 120 + if (!confirm('Proceed?')) { 121 + console.log('Aborted.') 122 + Deno.exit(0) 123 + } 124 + 125 + return Promise.all( 126 + groups.map(async (batch, groupIndex) => ({ 127 + groupIndex, 128 + fileName: await ttsAzure(batch.map((s) => s.hant), voiceId, locale, groupIndex, env), 129 + keys: batch.map((s) => s.id), 130 + })), 131 + ) 132 + } 133 + 134 + async function ttsAzure( 135 + texts: string[], 136 + voiceId: string, 137 + locale: Locale, 138 + groupIndex: number, 139 + env: Record<string, string>, 140 + ): Promise<string | null> { 141 + const SILENCE_BETWEEN_S = 2 142 + const region = env['AZURE_SPEECH_REGION'] 143 + const url = `https://${region}.tts.speech.microsoft.com/cognitiveservices/v1` 144 + 145 + const response = await fetch(url, { 146 + method: 'POST', 147 + headers: { 148 + 'Ocp-Apim-Subscription-Key': env['AZURE_SPEECH_KEY'], 149 + 'Content-Type': 'application/ssml+xml', 150 + 'X-Microsoft-OutputFormat': 'audio-16khz-128kbitrate-mono-mp3', 151 + 'User-Agent': 'curl', 152 + }, 153 + body: ` 154 + <speak version='1.0' xml:lang='${locale}'> 155 + <voice name='${voiceId}' xml:lang='${locale}'> 156 + <prosody rate="-20.00%"> 157 + ${texts.join(`, <break time="${SILENCE_BETWEEN_S}s"/> `)} 158 + </prosody> 159 + </voice> 160 + </speak> 161 + `, 162 + }) 163 + 164 + if (response.status > 399) { 165 + console.warn(`Azure error ${response.status}:`, await response.text()) 166 + return null 167 + } 168 + 169 + const fileName = `${locale}_${groupIndex}.mp3` 170 + const file = await Deno.open(join(TEMP_DIR, fileName), { create: true, write: true }) 171 + await writeAll(file, new Uint8Array(await response.arrayBuffer())) 172 + return fileName 173 + } 174 + 175 + async function writeAudioFiles( 176 + sourceFile: string, 177 + locale: Locale, 178 + keys: string[], 179 + ): Promise<void> { 180 + const audioDir = join(GEN_DIR, 'audio', locale) 181 + await Deno.mkdir(audioDir, { recursive: true }) 182 + 183 + // Use ffmpeg to detect silence boundaries between spoken words 184 + const { stderr } = await new Deno.Command('ffmpeg', { 185 + stdout: 'piped', 186 + args: ['-i', sourceFile, '-af', DETECT_STR, '-f', 'null', '-'], 187 + }).output() 188 + const detectOutput = new TextDecoder().decode(stderr) 189 + 190 + let match = MATCH_SILENCE.exec(detectOutput) 191 + let clipStartMS = 0 192 + let count = 0 193 + 194 + while (match) { 195 + const [_, silenceStartS, silenceEndS] = match 196 + const silenceStartMS = Math.round(1000 * parseFloat(silenceStartS)) 197 + // Shift end back slightly to avoid clipping the next word's start 198 + const silenceEndMS = Math.round(1000 * (parseFloat(silenceEndS) - 0.1)) 199 + 200 + const outFile = join(audioDir, getFilename(keys[count], locale)) 201 + const seek = `${Math.max(0, clipStartMS)}ms` 202 + const len = `${silenceStartMS - (clipStartMS + 0.1)}ms` 203 + 204 + await new Deno.Command('ffmpeg', { 205 + stdout: 'piped', 206 + args: ['-ss', seek, '-t', len, '-i', sourceFile, '-c:a', 'copy', outFile], 207 + }).output() 208 + 209 + count++ 210 + clipStartMS = silenceEndMS 211 + match = MATCH_SILENCE.exec(detectOutput) 212 + } 213 + 214 + // Write the final clip (no trailing silence) 215 + if (!keys[count]) { 216 + console.warn(`Key/audio mismatch in ${sourceFile} — got ${count} clips for ${keys.length} keys`) 217 + return 218 + } 219 + const outFile = join(audioDir, getFilename(keys[count], locale)) 220 + await new Deno.Command('ffmpeg', { 221 + stdout: 'piped', 222 + args: ['-ss', `${Math.max(0, clipStartMS)}ms`, '-i', sourceFile, '-c:a', 'copy', outFile], 223 + }).output() 224 + 225 + console.log(` wrote ${count + 1} audio files to ${audioDir}`) 226 + }
+21
data/cli/commands/gen_licenses.ts
··· 1 + /** 2 + * Fetches license texts from URLs listed in data/other/licenses.tsv and 3 + * writes the combined result to www/static/gen/licenses.json. 4 + */ 5 + 6 + import { Command } from '@cliffy/command' 7 + import { readTsv, writeAppJson } from '../utils/fs.ts' 8 + 9 + export const genLicensesCmd = new Command() 10 + .description('Fetch license texts and write to www/static/gen/licenses.json.') 11 + .action(async () => { 12 + const licenseList = readTsv('other/licenses.tsv') 13 + const licenses = await Promise.all( 14 + licenseList.map(async ({ name, href }) => { 15 + const text = await (await fetch(href)).text() 16 + return { name, href, text } 17 + }), 18 + ) 19 + writeAppJson('licenses.json', licenses) 20 + console.log(` wrote licenses.json (${licenses.length} licenses)`) 21 + })
+67
data/cli/commands/gen_progress.ts
··· 1 + /** 2 + * Generates progress-tracking JSON files for external word lists. 3 + * 4 + * Writes to www/static/gen/progress/: 5 + * - hsk.json — HSK 3.0 vocabulary bands (zh_CN) 6 + * - tocfl.json — TOCFL vocabulary levels (zh_TW) 7 + * - jlpt-kanji.json — JLPT kanji levels (ja) 8 + * - jlpt-vocab.json — JLPT vocabulary levels (ja) 9 + * 10 + * These are used by the app's stats/progress pages to show users how many 11 + * HSK or TOCFL words they've already learned. 12 + */ 13 + 14 + import { Command } from '@cliffy/command' 15 + import * as OpenCC from 'opencc-js' 16 + import { readTsv, writeAppJson } from '../utils/fs.ts' 17 + 18 + export const genProgressCmd = new Command() 19 + .description( 20 + 'Generate HSK, TOCFL, and JLPT progress JSON files for www/static/gen/progress/.', 21 + ) 22 + .action(() => { 23 + const toSimplified = OpenCC.Converter({ from: 'hk', to: 'cn' }) 24 + const toTraditional = OpenCC.Converter({ from: 'cn', to: 'hk' }) 25 + 26 + Deno.mkdirSync('www/static/gen/progress', { recursive: true }) 27 + 28 + writeAppJson( 29 + 'progress/hsk.json', 30 + readTsv('lang/zh_CN/progress/hsk.tsv').map((row) => ({ 31 + level: Number(row.band), 32 + id: Number(row.no), 33 + simplified: row.hans, 34 + traditional: toTraditional(row.hans), 35 + })), 36 + ) 37 + console.log(' wrote progress/hsk.json') 38 + 39 + writeAppJson( 40 + 'progress/tocfl.json', 41 + readTsv('lang/zh_TW/progress/tocfl.tsv').map((row) => ({ 42 + level: Number(row.level), 43 + id: Number(row.id), 44 + simplified: toSimplified(row.hant), 45 + traditional: row.hant, 46 + })), 47 + ) 48 + console.log(' wrote progress/tocfl.json') 49 + 50 + writeAppJson( 51 + 'progress/jlpt-kanji.json', 52 + readTsv('lang/ja/progress/jlpt-kanji.tsv').map((row) => ({ 53 + level: Number(row.level), 54 + kanji: row.kanji, 55 + })), 56 + ) 57 + console.log(' wrote progress/jlpt-kanji.json') 58 + 59 + writeAppJson( 60 + 'progress/jlpt-vocab.json', 61 + readTsv('lang/ja/progress/jlpt-vocab.tsv').map((row) => ({ 62 + level: Number(row.level), 63 + chars: row.chars, 64 + })), 65 + ) 66 + console.log(' wrote progress/jlpt-vocab.json') 67 + })
+228
data/cli/commands/studio/dicts.ts
··· 1 + /** 2 + * Studio command: updates readings and meanings TSVs from dictionary sources. 3 + * 4 + * "Studio" commands are for data managers updating source files — they are not 5 + * part of the normal app build and do not need to be run by contributors. 6 + * 7 + * What this does: 8 + * - Appends missing entries to lang/{userLang}/characters.tsv and vocabulary.tsv 9 + * (new items get a "[todo: add definition]" placeholder) 10 + * - Fills in missing readings in the target-language readings TSVs using 11 + * dictionary lookups (pinyin, jyutping, zhuyin, kunyomi/onyomi, hiragana) 12 + * 13 + * Existing entries are never overwritten — only missing ones are added. 14 + * Run this after adding new characters or vocabulary to a curriculum order file. 15 + * 16 + * Generated files: 17 + * - lang/en/characters.tsv, lang/es/characters.tsv (meanings) 18 + * - lang/en/vocabulary.tsv, lang/es/vocabulary.tsv (meanings) 19 + * - lang/zh_CN/readings.tsv (pinyin) 20 + * - lang/zh_HK/readings.tsv (jyutping) 21 + * - lang/zh_TW/readings.tsv (pinyin + zhuyin) 22 + * - lang/ja/reading.characters.tsv (kunyomi + onyomi) 23 + * - lang/ja/reading.vocabulary.tsv (hiragana) 24 + */ 25 + 26 + import { Command } from '@cliffy/command' 27 + import pinyin from 'chinese-to-pinyin' 28 + import { p2z } from 'pinyin-to-zhuyin' 29 + import * as OpenCC from 'opencc-js' 30 + import { toJyutping } from '$/utils/jyutping.ts' 31 + import { type Definition, readDict } from '../../utils/dict.ts' 32 + import { readCsv, readJson, readTsv, writeTsv } from '../../utils/fs.ts' 33 + 34 + const USER_LANGS = ['en', 'es'] 35 + 36 + export const updateDictsCmd = new Command() 37 + .description( 38 + 'Update readings and meanings TSVs from dictionary sources. ' + 39 + 'Only adds missing entries — never overwrites existing ones. ' + 40 + 'Run after adding new characters or vocabulary to a curriculum order file.', 41 + ) 42 + .action(async () => { 43 + const toCN = OpenCC.Converter({ from: 'tw', to: 'cn' }) 44 + 45 + const characters = readDict('lang/characters.tsv') 46 + const vocabulary = readDict('lang/vocabulary.tsv') 47 + 48 + // Pinyin source: General Standard Chinese Characters (通用规范汉字表) 49 + const pinyinMap: Record<string, string> = {} 50 + readTsv('lang/zh_CN/source/standard.tsv').forEach((row) => { 51 + pinyinMap[row.hans] = row.pinyin 52 + }) 53 + 54 + // zh_TW pronunciation source (TOCFL 詞語表) 55 + const pinyinTwMap: Record<string, string> = {} 56 + const zhuyinMap: Record<string, string> = {} 57 + readCsv('lang/zh_TW/sources/詞語表202504.csv').forEach((row) => { 58 + pinyinTwMap[row.word] = row.pinyin 59 + zhuyinMap[row.word] = row.bopomofo 60 + }) 61 + 62 + // Japanese dictionary sources 63 + const kanji = readJson<Record<string, { kanji: string; meaning: string; kun: string[]; on: string[] }>>( 64 + 'lang/ja/source/kanji.json', 65 + ) 66 + const jaVocab = readJson<Record<string, [string, string]>>('lang/ja/source/vocab.json') 67 + 68 + function getPinyin(hans: string): string { 69 + return pinyinMap[hans] ?? pinyin(hans) 70 + } 71 + 72 + function getTwPinyin(hant: string): string { 73 + return pinyinTwMap[hant] ?? getPinyin(toCN(hant)) 74 + } 75 + 76 + function getZhuyin(hant: string): string { 77 + if (zhuyinMap[hant]) return zhuyinMap[hant] 78 + try { 79 + const result = p2z(getTwPinyin(hant), { 80 + tonemarks: true, 81 + inputHasToneMarks: true, 82 + convertPunctuation: false, 83 + }) 84 + if (!result?.trim()) throw new Error('Empty result') 85 + return result 86 + } catch (err) { 87 + console.warn(`Failed to convert "${hant}" pinyin to zhuyin:`, err) 88 + return '' 89 + } 90 + } 91 + 92 + // --- Meanings --- 93 + for (const userLang of USER_LANGS) { 94 + await updateMeanings(userLang, 'characters', characters) 95 + await updateMeanings(userLang, 'vocabulary', vocabulary) 96 + } 97 + 98 + // --- Readings --- 99 + updateReadings( 100 + 'lang/zh_CN/readings.tsv', 101 + ['id', 'pinyin'], 102 + (def) => ({ id: def.id, pinyin: getPinyin(def.hans) }), 103 + [...characters, ...vocabulary], 104 + ) 105 + updateReadings( 106 + 'lang/zh_HK/readings.tsv', 107 + ['id', 'jyutping'], 108 + (def) => ({ id: def.id, jyutping: toJyutping(def.hant) }), 109 + [...characters, ...vocabulary], 110 + ) 111 + updateReadings( 112 + 'lang/zh_TW/readings.tsv', 113 + ['id', 'pinyin', 'zhuyin'], 114 + (def) => ({ id: def.id, pinyin: getTwPinyin(def.hant), zhuyin: getZhuyin(def.hant) }), 115 + [...characters, ...vocabulary], 116 + ) 117 + 118 + // --- Japanese readings --- 119 + updateJaCharReadings(characters, kanji) 120 + updateJaVocabReadings(vocabulary, jaVocab) 121 + 122 + console.log('Done.') 123 + }) 124 + 125 + /** 126 + * Appends missing entries to lang/{userLang}/{type}.tsv. 127 + * New items get a "[todo: add definition]" placeholder value. 128 + */ 129 + async function updateMeanings( 130 + userLang: string, 131 + type: 'characters' | 'vocabulary', 132 + defs: Definition[], 133 + ): Promise<void> { 134 + const filePath = `lang/${userLang}/${type}.tsv` 135 + const existing = new Map( 136 + readTsv(filePath).map((row) => [row.id, { hant: row.hant, value: row.value }]), 137 + ) 138 + 139 + let added = 0 140 + for (const def of defs) { 141 + if (!existing.has(def.id)) { 142 + existing.set(def.id, { hant: def.hant, value: '[todo: add definition]' }) 143 + added++ 144 + } 145 + } 146 + 147 + if (added > 0) { 148 + console.log(` ${filePath}: adding ${added} missing entries`) 149 + const rows = [...existing.entries()] 150 + .map(([id, { hant, value }]) => ({ id, hant, value })) 151 + .sort((a, b) => a.id.localeCompare(b.id)) 152 + writeTsv(filePath, ['id', 'hant', 'value'], rows) 153 + } 154 + } 155 + 156 + /** 157 + * Appends missing readings to a readings TSV. 158 + * Existing rows are preserved as-is (manual overrides are safe). 159 + */ 160 + function updateReadings( 161 + filePath: string, 162 + columns: string[], 163 + compute: (def: Definition) => Record<string, string>, 164 + defs: Definition[], 165 + ): void { 166 + const existing = new Map<string, Record<string, string>>() 167 + try { 168 + readTsv(filePath).forEach((row) => existing.set(row.id, row)) 169 + } catch { /* file doesn't exist yet */ } 170 + 171 + let added = 0 172 + for (const def of defs) { 173 + if (!existing.has(def.id)) { 174 + existing.set(def.id, compute(def)) 175 + added++ 176 + } 177 + } 178 + 179 + if (added > 0) { 180 + console.log(` ${filePath}: adding ${added} entries`) 181 + writeTsv(filePath, columns, [...existing.values()]) 182 + } 183 + } 184 + 185 + function updateJaCharReadings( 186 + characters: Definition[], 187 + kanji: Record<string, { kun: string[]; on: string[] }>, 188 + ): void { 189 + type CharReading = { id: string; kunyomi: string; onyomi: string } 190 + const existing = new Map<string, CharReading>() 191 + try { 192 + readTsv('lang/ja/reading.characters.tsv').forEach((row) => 193 + existing.set(row.id, row as unknown as CharReading) 194 + ) 195 + } catch { /* file doesn't exist yet */ } 196 + 197 + for (const def of characters) { 198 + if (!def.ja || existing.has(def.id)) continue 199 + const data = kanji[def.ja] 200 + if (!data) continue 201 + existing.set(def.id, { id: def.id, kunyomi: data.kun.join('; '), onyomi: data.on.join('; ') }) 202 + } 203 + 204 + writeTsv('lang/ja/reading.characters.tsv', ['id', 'kunyomi', 'onyomi'], [...existing.values()]) 205 + console.log(' lang/ja/reading.characters.tsv: updated') 206 + } 207 + 208 + function updateJaVocabReadings( 209 + vocabulary: Definition[], 210 + jaVocab: Record<string, [string, string]>, 211 + ): void { 212 + const existing = new Map<string, { id: string; reading: string }>() 213 + try { 214 + readTsv('lang/ja/reading.vocabulary.tsv').forEach((row) => 215 + existing.set(row.id, { id: row.id, reading: row.reading }) 216 + ) 217 + } catch { /* file doesn't exist yet */ } 218 + 219 + for (const def of vocabulary) { 220 + if (!def.ja || existing.has(def.id)) continue 221 + const entry = jaVocab[def.ja] 222 + if (!entry) continue 223 + existing.set(def.id, { id: def.id, reading: entry[0] }) 224 + } 225 + 226 + writeTsv('lang/ja/reading.vocabulary.tsv', ['id', 'reading'], [...existing.values()]) 227 + console.log(' lang/ja/reading.vocabulary.tsv: updated') 228 + }
+112
data/cli/commands/studio/ordering.ts
··· 1 + /** 2 + * Studio command: sorts user-language dictionary TSVs. 3 + * 4 + * "Studio" commands are for data managers updating source files — they are not 5 + * part of the normal app build. 6 + * 7 + * By default, sorts all user-language TSVs by subject id (the canonical committed state). 8 + * With --lang, sorts by that target language's curriculum order instead — useful when 9 + * you want to review a file in the order subjects appear in a specific course. 10 + * 11 + * Files sorted (in data/lang/{userLang}/): 12 + * - characters.tsv, vocabulary.tsv, radicals.tsv 13 + * - hints/meaning.characters.tsv, hints/meaning.vocabulary.tsv 14 + * - hints/reading.characters.tsv, hints/reading.vocabulary.tsv 15 + * - meanings/{targetLang}.characters.tsv, meanings/{targetLang}.vocabulary.tsv 16 + * 17 + * Examples: 18 + * hanzi studio sort-dicts 19 + * hanzi studio sort-dicts --lang=zh_CN 20 + */ 21 + 22 + import { Command } from '@cliffy/command' 23 + import { readDictByHant } from '../../utils/dict.ts' 24 + import { readCharacterOrder, readLessonOrder } from '../../utils/ordering.ts' 25 + import { DATA_ROOT, readTsv, writeTsv } from '../../utils/fs.ts' 26 + 27 + const USER_LANGS = ['en', 'es'] 28 + const TARGET_LANGS = ['zh_CN', 'zh_HK', 'zh_TW', 'ja'] 29 + 30 + export const sortDictsCmd = new Command() 31 + .description( 32 + 'Sort user-language dictionary TSVs. ' + 33 + 'Default: sort by id (canonical committed order). ' + 34 + 'With --lang: sort by that target language\'s curriculum order.', 35 + ) 36 + .option('--lang <targetLang:string>', 'Sort by curriculum order of this target language') 37 + .action(({ lang: targetLang }) => { 38 + if (targetLang && !TARGET_LANGS.includes(targetLang)) { 39 + console.error(`Unknown --lang: ${targetLang}. Valid: ${TARGET_LANGS.join(', ')}`) 40 + Deno.exit(1) 41 + } 42 + 43 + const charIdOrder = targetLang ? buildIdOrder('character', targetLang) : null 44 + const vocabIdOrder = targetLang ? buildIdOrder('vocabulary', targetLang) : null 45 + const meaningTargetLangs = targetLang ? [targetLang] : TARGET_LANGS 46 + 47 + for (const userLang of USER_LANGS) { 48 + console.log(`\nSorting lang/${userLang}/`) 49 + sortFile(`lang/${userLang}/characters.tsv`, charIdOrder) 50 + sortFile(`lang/${userLang}/radicals.tsv`, null) 51 + sortFile(`lang/${userLang}/vocabulary.tsv`, vocabIdOrder) 52 + sortFile(`lang/${userLang}/hints/meaning.characters.tsv`, charIdOrder) 53 + sortFile(`lang/${userLang}/hints/meaning.vocabulary.tsv`, vocabIdOrder) 54 + sortFile(`lang/${userLang}/hints/reading.characters.tsv`, charIdOrder) 55 + sortFile(`lang/${userLang}/hints/reading.vocabulary.tsv`, vocabIdOrder) 56 + for (const tl of meaningTargetLangs) { 57 + sortFile(`lang/${userLang}/meanings/${tl}.characters.tsv`, charIdOrder) 58 + sortFile(`lang/${userLang}/meanings/${tl}.vocabulary.tsv`, vocabIdOrder) 59 + } 60 + } 61 + console.log('\nDone.') 62 + }) 63 + 64 + /** Builds a map of subject id → curriculum position for a given type and target language. */ 65 + function buildIdOrder( 66 + type: 'character' | 'vocabulary', 67 + targetLang: string, 68 + ): Map<string, number> { 69 + const slugs = type === 'character' 70 + ? readCharacterOrder(targetLang).flat() 71 + : readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`).flat() 72 + const dictPath = type === 'character' ? 'lang/characters.tsv' : 'lang/vocabulary.tsv' 73 + const byHant = readDictByHant(dictPath) 74 + const map = new Map<string, number>() 75 + slugs.forEach((slug, i) => { 76 + const def = byHant[slug] 77 + if (def && !map.has(def.id)) map.set(def.id, i) 78 + }) 79 + return map 80 + } 81 + 82 + /** Reads the raw header line from a TSV (preserving original column order). */ 83 + function readHeaders(relPath: string): string[] { 84 + const text = Deno.readTextFileSync(DATA_ROOT + relPath) 85 + return text.split('\n')[0].split('\t').map((h) => h.replace('\r', '')) 86 + } 87 + 88 + /** Sorts a TSV file in place. Silently skips files that don't exist. */ 89 + function sortFile(relPath: string, orderMap: Map<string, number> | null): void { 90 + let rows: Record<string, string>[] 91 + let headers: string[] 92 + try { 93 + headers = readHeaders(relPath) 94 + rows = readTsv(relPath) 95 + } catch { 96 + return // file doesn't exist for this user/target lang combo 97 + } 98 + if (!rows.length) return 99 + 100 + if (orderMap) { 101 + rows.sort((a, b) => { 102 + const posA = orderMap.get(a.id) ?? Infinity 103 + const posB = orderMap.get(b.id) ?? Infinity 104 + return posA - posB 105 + }) 106 + } else { 107 + rows.sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)) 108 + } 109 + 110 + writeTsv(relPath, headers, rows) 111 + console.log(` sorted: ${relPath}`) 112 + }
+170
data/cli/commands/studio/sentences.ts
··· 1 + /** 2 + * Processes raw Tatoeba sentence TSV exports into the app's sentence files. 3 + * 4 + * Reads source files from data/lang/{userLang}/sources/ and writes processed 5 + * sentences to data/lang/{userLang}/sentences/{targetLang}.tsv. 6 + * 7 + * Run this after downloading new sentence pairs from https://tatoeba.org. 8 + * Source files are auto-detected by name pattern — no date argument needed. 9 + * 10 + * Expected source filename format: 11 + * "Sentence pairs in {Language}-{Target} - {date}.tsv" 12 + * e.g. "Sentence pairs in English-Mandarin Chinese - 2026-03-13.tsv" 13 + * 14 + * Processing steps: 15 + * 1. Parse and validate rows (filters empty/too-long sentences) 16 + * 2. Convert zh_CN ↔ zh_TW using OpenCC 17 + * 3. Deduplicate by sentence text 18 + * 4. Sort by simplicity (characters appearing earlier in the curriculum score lower) 19 + */ 20 + 21 + import { Command } from '@cliffy/command' 22 + import { parse } from '@std/csv/parse' 23 + import { stringify } from '@std/csv/stringify' 24 + import * as OpenCC from 'opencc-js' 25 + import { readCharacterOrder } from '../../utils/ordering.ts' 26 + import { readDict } from '../../utils/dict.ts' 27 + 28 + const toCN = OpenCC.Converter({ from: 'tw', to: 'cn' }) 29 + const toTW = OpenCC.Converter({ from: 'cn', to: 'tw' }) 30 + 31 + const USER_LANGS = ['en', 'es'] 32 + 33 + const SOURCE_TARGET_NAMES: Record<string, string> = { 34 + zh_CN: 'Mandarin Chinese', 35 + zh_HK: 'Cantonese', 36 + zh_TW: 'Mandarin Chinese', // TW sentences are converted from CN source 37 + ja: 'Japanese', 38 + } 39 + 40 + interface Sentence { 41 + id: number 42 + value: string 43 + enId: number 44 + en: string 45 + } 46 + 47 + export const updateSentencesCmd = new Command() 48 + .description( 49 + 'Process raw Tatoeba sentence exports into app sentence TSVs (data/lang/{userLang}/sentences/). ' + 50 + 'Run after downloading new sentence pairs from tatoeba.org.', 51 + ) 52 + .action(() => { 53 + const charDefs = readDict('lang/characters.tsv') 54 + const hantToHans: Record<string, string> = Object.fromEntries( 55 + charDefs.map((d) => [d.hant, d.hans]), 56 + ) 57 + 58 + for (const langCode of USER_LANGS) { 59 + processLang(langCode, hantToHans) 60 + } 61 + console.log('Done.') 62 + }) 63 + 64 + function userLangName(langCode: string): string { 65 + return langCode === 'es' ? 'Spanish' : 'English' 66 + } 67 + 68 + /** Finds the most recent source file matching a name prefix in the sources directory. */ 69 + function findSourceFile(langCode: string, targetName: string): string | null { 70 + const dir = `./data/lang/${langCode}/sources` 71 + const prefix = `Sentence pairs in ${userLangName(langCode)}-${targetName}` 72 + try { 73 + for (const entry of Deno.readDirSync(dir)) { 74 + if (entry.name.startsWith(prefix) && entry.name.endsWith('.tsv')) { 75 + return `${dir}/${entry.name}` 76 + } 77 + } 78 + } catch { /* directory missing */ } 79 + return null 80 + } 81 + 82 + function processLang(langCode: string, hantToHans: Record<string, string>): void { 83 + const cnPath = findSourceFile(langCode, SOURCE_TARGET_NAMES.zh_CN) 84 + const hkPath = findSourceFile(langCode, SOURCE_TARGET_NAMES.zh_HK) 85 + const jaPath = findSourceFile(langCode, SOURCE_TARGET_NAMES.ja) 86 + 87 + if (!cnPath || !hkPath || !jaPath) { 88 + console.warn(`Source files not found for ${langCode} — skipping. Expected files in data/lang/${langCode}/sources/`) 89 + return 90 + } 91 + 92 + const readSourceFile = (path: string): Sentence[] => 93 + parse(Deno.readTextFileSync(path), { separator: '\t', lazyQuotes: true }) 94 + .map(([enIdStr, en, idStr, value]) => { 95 + const enId = parseInt(enIdStr) 96 + const id = parseInt(idStr) 97 + if (!enId || !id || !en || !value || value.length > 15) return null 98 + return { id, value, enId, en } 99 + }) 100 + .filter((row): row is Sentence => row != null) 101 + 102 + const cnRaw = readSourceFile(cnPath) 103 + const hkRaw = readSourceFile(hkPath) 104 + const jaRaw = readSourceFile(jaPath) 105 + 106 + const locales: [string, Sentence[]][] = [ 107 + ['zh_CN', cnRaw.map((s) => ({ ...s, value: toCN(s.value) }))], 108 + ['zh_TW', cnRaw.map((s) => ({ ...s, value: toTW(s.value) }))], 109 + ['zh_HK', hkRaw], 110 + ['ja', jaRaw], 111 + ] 112 + 113 + Deno.mkdirSync(`./data/lang/${langCode}/sentences`, { recursive: true }) 114 + 115 + for (const [locale, sentences] of locales) { 116 + const positionMap = buildPositionMap(locale, hantToHans) 117 + const processed = dedupeAndSort(sentences, positionMap) 118 + const outPath = `./data/lang/${langCode}/sentences/${locale}.tsv` 119 + Deno.writeTextFileSync( 120 + outPath, 121 + stringify(processed, { columns: ['id', 'value', 'enId', 'en'], separator: '\t' }), 122 + ) 123 + console.log(` wrote lang/${langCode}/sentences/${locale}.tsv (${processed.length} sentences)`) 124 + } 125 + } 126 + 127 + /** 128 + * Builds a map of character → curriculum position for scoring sentence simplicity. 129 + * Both hant and hans forms are indexed so scores work for any script variant. 130 + */ 131 + function buildPositionMap(locale: string, hantToHans: Record<string, string>): Map<string, number> { 132 + const map = new Map<string, number>() 133 + let pos = 0 134 + for (const level of readCharacterOrder(locale)) { 135 + for (const hant of level) { 136 + map.set(hant, pos) 137 + const hans = hantToHans[hant] 138 + if (hans && hans !== hant) map.set(hans, pos) 139 + pos++ 140 + } 141 + } 142 + return map 143 + } 144 + 145 + /** 146 + * Scores a sentence by the sum of its characters' curriculum positions. 147 + * Sentences with only early-curriculum characters score lower (simpler). 148 + * A length factor discourages very short sentences. 149 + */ 150 + function sentenceScore( 151 + sentence: string, 152 + positionMap: Map<string, number>, 153 + maxPos: number, 154 + ): number { 155 + let score = 0 156 + for (const char of sentence) score += positionMap.get(char) ?? maxPos 157 + return score * Math.min(1, sentence.length / 1.3) 158 + } 159 + 160 + /** Deduplicates by sentence text (keeps later id on collision), then sorts by simplicity. */ 161 + function dedupeAndSort(sentences: Sentence[], positionMap: Map<string, number>): Sentence[] { 162 + const seen: Record<string, Sentence> = {} 163 + for (const s of sentences) { 164 + if (!seen[s.value] || seen[s.value].id < s.id) seen[s.value] = s 165 + } 166 + const maxPos = positionMap.size 167 + return Object.values(seen).sort( 168 + (a, b) => sentenceScore(a.value, positionMap, maxPos) - sentenceScore(b.value, positionMap, maxPos), 169 + ) 170 + }
+65
data/cli/main.ts
··· 1 + /** 2 + * Hanzi data CLI 3 + * 4 + * Entry point for all data pipeline and studio tools. 5 + * 6 + * Build commands (run by code contributors): 7 + * build Run gen:app, gen:progress, and gen:licenses in sequence 8 + * gen:app Compile subject JSON for all language pairs 9 + * gen:audio Generate TTS audio via Azure (requires .env + ffmpeg) 10 + * gen:progress Generate HSK / TOCFL / JLPT progress JSON 11 + * gen:licenses Fetch and bundle license texts 12 + * 13 + * Studio commands (run by translation and language data contributors): 14 + * studio update-dicts Add missing readings/meanings from dictionary sources 15 + * studio sort-dicts Sort TSVs by id or by curriculum order 16 + * studio process-sentences Process raw Tatoeba exports into sentence TSVs 17 + * 18 + * Usage: 19 + * deno task data <command> [options] 20 + * deno task data --help 21 + * deno task data studio --help 22 + */ 23 + 24 + import { Command } from '@cliffy/command' 25 + import { genAppDataCmd } from './commands/gen_app_data.ts' 26 + import { genAudioCmd } from './commands/gen_audio.ts' 27 + import { genProgressCmd } from './commands/gen_progress.ts' 28 + import { genLicensesCmd } from './commands/gen_licenses.ts' 29 + import { updateDictsCmd } from './commands/studio/dicts.ts' 30 + import { sortDictsCmd } from './commands/studio/ordering.ts' 31 + import { updateSentencesCmd } from './commands/studio/sentences.ts' 32 + 33 + const studioCmd = new Command() 34 + .description( 35 + 'Tools for translation and language data contributors. ' + 36 + 'Run these when updating dictionary sources, readings, or curriculum ordering.', 37 + ) 38 + .command('update:dicts', updateDictsCmd) 39 + .command('update:sentences', updateSentencesCmd) 40 + .command('sort:dicts', sortDictsCmd) 41 + 42 + await new Command() 43 + .name('hanzi') 44 + .version('0.0.1') 45 + .description('CLI tools for building and managing Hanzi app data.') 46 + .action(function () { this.showHelp() }) 47 + // Build commands 48 + .command('build', new Command() 49 + .description('Run the full build: gen:app, gen:progress, and gen:licenses.') 50 + .action(async () => { 51 + console.log('=== gen:app ===') 52 + await genAppDataCmd.parse([]) 53 + console.log('\n=== gen:progress ===') 54 + await genProgressCmd.parse([]) 55 + console.log('\n=== gen:licenses ===') 56 + await genLicensesCmd.parse([]) 57 + }), 58 + ) 59 + .command('gen:app', genAppDataCmd) 60 + .command('gen:audio', genAudioCmd) 61 + .command('gen:progress', genProgressCmd) 62 + .command('gen:licenses', genLicensesCmd) 63 + // Studio commands 64 + .command('studio', studioCmd) 65 + .parse(Deno.args)
+36
data/cli/utils/audio.ts
··· 1 + /** 2 + * Audio file utilities: voice IDs, filename conventions, and file listing. 3 + */ 4 + 5 + import { join } from '@std/path' 6 + import { Locale } from '$/enums.ts' 7 + 8 + /** Azure Neural TTS voice ID used per locale. */ 9 + export const VOICE_IDS: Record<Locale, string> = { 10 + [Locale.zh_CN]: 'zh-CN-XiaoxiaoNeural', 11 + [Locale.zh_HK]: 'zh-HK-WanLungNeural', 12 + [Locale.zh_TW]: 'zh-TW-YunJheNeural', 13 + } 14 + 15 + /** Returns the audio filename for a given subject id and locale. */ 16 + export function getFilename(id: string, locale: Locale): string { 17 + return `${id}_${locale.replace('_', '-')}_${VOICE_IDS[locale]}.mp3` 18 + } 19 + 20 + /** 21 + * Returns all existing audio filenames (not full paths) for the given locales. 22 + * Skips locales whose audio directories don't exist yet (e.g. before first audio run). 23 + */ 24 + export function listAudioFiles(locales: string[] = ['zh_CN', 'zh_HK', 'zh_TW']): string[] { 25 + return locales 26 + .filter((locale) => locale !== 'tmp') 27 + .flatMap((locale) => { 28 + try { 29 + return Array.from(Deno.readDirSync(join('www/static/gen/audio', locale))) 30 + } catch { 31 + return [] 32 + } 33 + }) 34 + .filter(({ name }) => /.*\.mp3$/.test(name)) 35 + .map((file) => file.name) 36 + }
+146
data/cli/utils/dict.ts
··· 1 + /** 2 + * Dictionary utilities: reading and interpreting the shared CJK dictionary files 3 + * and user-language meaning/hint/reading files. 4 + * 5 + * All files live under data/lang/ and are keyed by string ids (e.g. "c-00001"). 6 + */ 7 + 8 + import { Transliteration } from '$/enums.ts' 9 + import type { Reading } from '$/models/subjects.ts' 10 + import { readTsv } from './fs.ts' 11 + 12 + /** A single entry from characters.tsv, vocabulary.tsv, or radicals.tsv. */ 13 + export interface Definition { 14 + /** Unique string id (e.g. "c-00001" for characters, "v-00001" for vocabulary). */ 15 + id: string 16 + hans: string 17 + hant: string 18 + /** Japanese-specific form (kanji col in characters.tsv, ja col in vocabulary.tsv). */ 19 + ja?: string 20 + } 21 + 22 + export interface Hint { 23 + id: string 24 + locale: string 25 + en: string 26 + } 27 + 28 + /** Maps TSV column names to their Transliteration enum values. */ 29 + const COL_TYPE: Record<string, Transliteration> = { 30 + pinyin: Transliteration.Pinyin, 31 + jyutping: Transliteration.Jyutping, 32 + zhuyin: Transliteration.Zhuyin, 33 + kunyomi: Transliteration.Kunyomi, 34 + onyomi: Transliteration.Onyomi, 35 + reading: Transliteration.Hiragana, 36 + } 37 + 38 + /** Reads a CJK dictionary TSV (characters.tsv, vocabulary.tsv, or radicals.tsv). */ 39 + export function readDict(path: string): Definition[] { 40 + return readTsv(path).map((row) => ({ 41 + id: row.id, 42 + hans: row.hans, 43 + hant: row.hant, 44 + ja: row.kanji || row.ja || undefined, 45 + })) 46 + } 47 + 48 + /** Returns a dictionary indexed by traditional (hant) form. */ 49 + export function readDictByHant(path: string): Record<string, Definition> { 50 + return Object.fromEntries(readDict(path).map((d) => [d.hant, d])) 51 + } 52 + 53 + /** Returns a dictionary indexed by simplified (hans) form. */ 54 + export function readDictByHans(path: string): Record<string, Definition> { 55 + return Object.fromEntries(readDict(path).map((d) => [d.hans, d])) 56 + } 57 + 58 + /** Returns a dictionary indexed by subject id. */ 59 + export function readDictById(path: string): Record<string, Definition> { 60 + return Object.fromEntries(readDict(path).map((d) => [d.id, d])) 61 + } 62 + 63 + /** 64 + * Reads a user-language meanings file (e.g. lang/en/characters.tsv). 65 + * Returns a map of subject id → meaning string. 66 + */ 67 + export function readMeanings(path: string): Record<string, string> { 68 + return Object.fromEntries(readTsv(path).map((row) => [row.id, row.value])) 69 + } 70 + 71 + /** 72 + * Reads a meaning-override TSV (e.g. lang/en/meanings/ja.characters.tsv). 73 + * Returns a map of subject id → semicolon-separated meaning string. 74 + * Returns an empty map if the file doesn't exist. 75 + */ 76 + export function readMeaningOverrides(path: string): Record<string, string> { 77 + try { 78 + return Object.fromEntries(readTsv(path).map((row) => [row.id, row.meaning])) 79 + } catch { 80 + return {} 81 + } 82 + } 83 + 84 + /** 85 + * Reads a readings TSV (e.g. lang/zh_CN/readings.tsv, lang/ja/reading.characters.tsv) 86 + * and returns a map of subject id → Reading[]. 87 + * 88 + * Each non-id column maps to a Transliteration type via COL_TYPE. Semicolon-separated 89 + * values produce multiple readings; the first value in the first column is isPrimary. 90 + * Returns an empty map if the file doesn't exist. 91 + */ 92 + export function readReadingsMap(path: string): Record<string, Reading[]> { 93 + const result: Record<string, Reading[]> = {} 94 + try { 95 + const rows = readTsv(path) 96 + if (!rows.length) return result 97 + // Reverse so isPrimary goes to the first column 98 + const cols = Object.keys(rows[0]).filter((k) => k !== 'id').reverse() 99 + for (const row of rows) { 100 + const readings: Reading[] = [] 101 + let firstCol = true 102 + for (const col of cols) { 103 + const val = row[col] || '' 104 + const type = COL_TYPE[col] 105 + if (!type || !val) continue 106 + val.split(';').map((s) => s.trim()).filter(Boolean).forEach((value, i) => { 107 + readings.push({ value, type, isAcceptedAnswer: true, isPrimary: firstCol && i === 0 }) 108 + }) 109 + firstCol = false 110 + } 111 + if (readings.length) result[row.id] = readings 112 + } 113 + } catch { /* file missing — return empty */ } 114 + return result 115 + } 116 + 117 + /** Reads a hint TSV (lang/en/hints/*.tsv). Returns an empty array if the file doesn't exist. */ 118 + export function readHints(path: string): Hint[] { 119 + try { 120 + return readTsv(path).map((row) => ({ id: row.id, locale: row.locale, en: row.en })) 121 + } catch { 122 + return [] 123 + } 124 + } 125 + 126 + /** Reads hints and returns a nested map of subject id → locale → hint text. */ 127 + export function readHintsById(path: string): Record<string, Record<string, string>> { 128 + const result: Record<string, Record<string, string>> = {} 129 + for (const row of readHints(path)) { 130 + result[row.id] ??= {} 131 + result[row.id][row.locale] = row.en 132 + } 133 + return result 134 + } 135 + 136 + /** 137 + * Reads hints and returns only those with locale === 'ALL', keyed by subject id. 138 + * Used for hints that apply regardless of target language. 139 + */ 140 + export function readAllLocaleHints(path: string): Record<string, string> { 141 + return Object.fromEntries( 142 + readHints(path) 143 + .filter((row) => row.locale === 'ALL') 144 + .map((row) => [row.id, row.en]), 145 + ) 146 + }
+53
data/cli/utils/fs.ts
··· 1 + /** 2 + * Low-level file I/O for the data pipeline. 3 + * 4 + * This module only handles reading and writing raw files — no domain logic. 5 + * For dictionary, subject, ordering, or sentence helpers, see the sibling modules. 6 + */ 7 + 8 + import { parse } from '@std/csv/parse' 9 + import { stringify } from '@std/csv/stringify' 10 + import stringifyJSON from 'json-stringify-pretty-compact' 11 + 12 + /** Root directory for data source files (TSVs, CSVs, JSON sources). */ 13 + export const DATA_ROOT = './data/' 14 + 15 + /** Output directory for generated app JSON files served at runtime. */ 16 + export const APP_ROOT = './www/static/gen/' 17 + 18 + /** Reads a TSV from `data/`. Skips the header row; returns records keyed by column name. */ 19 + export function readTsv(input: string): Record<string, string>[] { 20 + const text = Deno.readTextFileSync(DATA_ROOT + input) 21 + return parse(text, { 22 + separator: '\t', 23 + lazyQuotes: true, 24 + skipFirstRow: true, 25 + }) as Record<string, string>[] 26 + } 27 + 28 + /** Reads a CSV from `data/`. Skips the header row; returns records keyed by column name. */ 29 + export function readCsv(input: string): Record<string, string>[] { 30 + const text = Deno.readTextFileSync(DATA_ROOT + input) 31 + return parse(text, { 32 + lazyQuotes: true, 33 + skipFirstRow: true, 34 + }) as Record<string, string>[] 35 + } 36 + 37 + /** Reads and parses a JSON file from `data/`. */ 38 + export function readJson<T = unknown>(input: string): T { 39 + return JSON.parse(Deno.readTextFileSync(DATA_ROOT + input)) as T 40 + } 41 + 42 + /** Writes rows as a TSV to `data/`. Column order follows the `columns` array. */ 43 + export function writeTsv(input: string, columns: string[], data: unknown[]): void { 44 + Deno.writeTextFileSync( 45 + DATA_ROOT + input, 46 + stringify(data as Record<string, string>[], { columns, separator: '\t' }), 47 + ) 48 + } 49 + 50 + /** Writes content as pretty-printed JSON to `www/static/gen/`. */ 51 + export function writeAppJson(path: string, content: unknown): void { 52 + Deno.writeTextFileSync(APP_ROOT + path, stringifyJSON(content)) 53 + }
+55
data/cli/utils/ordering.ts
··· 1 + /** 2 + * Curriculum ordering utilities: reading the order files that define which 3 + * characters, vocabulary, and radicals are taught, and in what sequence. 4 + * 5 + * Order files live under data/lang/{targetLang}/order/. 6 + */ 7 + 8 + import { parse } from '@std/csv/parse' 9 + import { DATA_ROOT } from './fs.ts' 10 + import { readDictByHant, type Definition } from './dict.ts' 11 + 12 + /** 13 + * Reads a vocabulary or radical order CSV (e.g. lang/zh_CN/order/vocabulary.csv). 14 + * Returns an array of levels, each level being an array of slugs (traditional forms). 15 + */ 16 + export function readLessonOrder(input: string): string[][] { 17 + return parse(Deno.readTextFileSync(DATA_ROOT + input)) as string[][] 18 + } 19 + 20 + /** 21 + * Reads a character order file (e.g. lang/zh_CN/order/characters.txt). 22 + * Each line is one level; each character in the line is one slug. 23 + */ 24 + export function readCharacterOrder(targetLang: string): string[][] { 25 + const text = Deno.readTextFileSync(DATA_ROOT + `lang/${targetLang}/order/characters.txt`) 26 + return text.split('\n').map((row) => row.split('')) 27 + } 28 + 29 + /** 30 + * Returns the ordered set of definitions for items in the curriculum, deduplicated 31 + * and in curriculum order. Only includes items that are actually taught — not the 32 + * full dictionary. 33 + * 34 + * @param type - 'character' or 'vocabulary' 35 + * @param targetLang - target language code (default: 'zh_CN') 36 + */ 37 + export function readOrderedDefs( 38 + type: 'character' | 'vocabulary', 39 + targetLang = 'zh_CN', 40 + ): Definition[] { 41 + const [dictPath, slugsList] = type === 'character' 42 + ? ['lang/characters.tsv', readCharacterOrder(targetLang).flat()] 43 + : ['lang/vocabulary.tsv', readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`).flat()] 44 + 45 + const byHant = readDictByHant(dictPath) 46 + const seen = new Set<string>() 47 + const result: Definition[] = [] 48 + for (const slug of slugsList) { 49 + if (!slug || seen.has(slug)) continue 50 + seen.add(slug) 51 + const def = byHant[slug] 52 + if (def) result.push(def) 53 + } 54 + return result 55 + }
+2
data/cli/utils/progress/hsk.ts
··· 1 + // HSK progress data is generated directly in commands/gen_progress.ts. 2 + // This file is reserved for future extraction if the gen_progress command grows.
+2
data/cli/utils/progress/tocfl.ts
··· 1 + // TOCFL progress data is generated directly in commands/gen_progress.ts. 2 + // This file is reserved for future extraction if the gen_progress command grows.
+61
data/cli/utils/sentences.ts
··· 1 + /** 2 + * Sentence utilities: reading pre-processed example sentence files and building 3 + * the character-level indexes used when generating subject data. 4 + * 5 + * Sentence files live at data/lang/{userLang}/sentences/{targetLang}.tsv. 6 + */ 7 + 8 + import { distinct } from '@std/collections/distinct' 9 + import { parse } from '@std/csv/parse' 10 + import type { Locale } from '$/enums.ts' 11 + import { DATA_ROOT } from './fs.ts' 12 + 13 + export interface Sentences { 14 + /** Maps sentence text → user-language translation. */ 15 + bySentence: Record<string, string> 16 + /** Maps each individual character → all sentence texts that contain it. */ 17 + byChar: Map<string, string[]> 18 + /** All sentence texts, in curriculum-sorted order (simplest first). */ 19 + sorted: string[] 20 + } 21 + 22 + /** 23 + * Reads the pre-processed sentence TSV for a given user + target language pair. 24 + * Returns `{ bySentence, keys }`. Returns empty values if the file doesn't exist. 25 + */ 26 + export function readSentences( 27 + userLang: string, 28 + locale: Locale, 29 + ): { bySentence: Record<string, string>; keys: string[] } { 30 + let text = '' 31 + try { 32 + text = Deno.readTextFileSync(`${DATA_ROOT}lang/${userLang}/sentences/${locale}.tsv`) 33 + } catch { 34 + return { bySentence: {}, keys: [] } 35 + } 36 + const rows = parse(text, { separator: '\t', lazyQuotes: true }) 37 + const bySentence: Record<string, string> = {} 38 + const keys = distinct( 39 + rows.map(([_id, value, _enId, translation]) => { 40 + bySentence[value] = translation 41 + return value 42 + }), 43 + ) 44 + return { bySentence, keys } 45 + } 46 + 47 + /** 48 + * Loads and indexes sentences for a user + target language pair. 49 + * Builds `byChar` for fast per-character lookup when generating subject examples. 50 + */ 51 + export function loadSentences(userLang: string, targetLang: string): Sentences { 52 + const raw = readSentences(userLang, targetLang as Locale) 53 + const byChar = new Map<string, string[]>() 54 + for (const key of raw.keys) { 55 + for (const char of key) { 56 + if (!byChar.has(char)) byChar.set(char, []) 57 + byChar.get(char)!.push(key) 58 + } 59 + } 60 + return { bySentence: raw.bySentence, byChar, sorted: raw.keys } 61 + }
+234
data/cli/utils/subjects.ts
··· 1 + /** 2 + * Subject utilities: reading/writing compiled subject JSON files, and creating 3 + * new Subject objects from dictionary + curriculum data. 4 + * 5 + * Compiled subject files live at www/static/gen/lang/{userLang}/{targetLang}.json. 6 + */ 7 + 8 + import { distinct } from '@std/collections/distinct' 9 + import { dirname } from '@std/path' 10 + import stringifyJSON from 'json-stringify-pretty-compact' 11 + import { Locale, SubjectType } from '$/enums.ts' 12 + import type { Audio, Subject } from '$/models/subjects.ts' 13 + import { APP_ROOT } from './fs.ts' 14 + import type { Definition } from './dict.ts' 15 + import type { Sentences } from './sentences.ts' 16 + 17 + const { Character, Vocabulary } = SubjectType 18 + 19 + // --------------------------------------------------------------------------- 20 + // Subject I/O 21 + // --------------------------------------------------------------------------- 22 + 23 + /** Reads compiled subject JSON from `www/static/gen/`. Returns an empty array on error. */ 24 + export function readSubjects(input: string): Subject[] { 25 + try { 26 + return JSON.parse(Deno.readTextFileSync(APP_ROOT + input)) 27 + } catch { 28 + return [] 29 + } 30 + } 31 + 32 + /** 33 + * Reads compiled subject JSON and returns a map keyed by `data.slug`. 34 + * Subjects with a missing id or slug are skipped with a warning. 35 + */ 36 + export function readSubjectsMap(input: string): Record<string, Subject> { 37 + const map: Record<string, Subject> = {} 38 + readSubjects(input).forEach((subject) => { 39 + if (!subject.id || !subject.data?.slug) { 40 + console.warn( 41 + `Skipping subject with missing id/slug in ${input}:`, 42 + JSON.stringify(subject).slice(0, 120), 43 + ) 44 + return 45 + } 46 + map[subject.data.slug] = subject 47 + }) 48 + return map 49 + } 50 + 51 + /** 52 + * Writes compiled subjects to `www/static/gen/`. Before writing, subjects are: 53 + * - Filtered to require id, slug, and type (corrupt entries are dropped) 54 + * - Remapped with a stable property order for consistent diffs 55 + * - Sorted by level → type (Radical < Character < Vocabulary) → position 56 + */ 57 + export function writeSubjects(output: string, subjects: Subject[]): void { 58 + const levelAndPosition = new Set<string>() 59 + 60 + const toWrite = subjects 61 + .filter((subject) => { 62 + if (!subject.id || !subject.data?.slug || !subject.data?.type) { 63 + console.warn( 64 + 'Dropping invalid subject (missing id/slug/type):', 65 + JSON.stringify(subject).slice(0, 120), 66 + ) 67 + return false 68 + } 69 + return true 70 + }) 71 + .map((subject) => { 72 + const { data } = subject 73 + const levelPosition = `${data.type}-${data.level}-${data.position}` 74 + if (levelAndPosition.has(levelPosition) && levelPosition !== `${data.type}-0-0`) { 75 + console.warn(`Two subjects at same position ${levelPosition}: ${data.slug}`) 76 + } else { 77 + levelAndPosition.add(levelPosition) 78 + } 79 + // Explicit property order for stable JSON diffs 80 + return { 81 + id: subject.id, 82 + hiddenAt: subject.hiddenAt, 83 + learnCards: subject.learnCards?.length ? subject.learnCards : ['meanings'], 84 + quizCards: subject.quizCards?.length ? subject.quizCards : ['meanings', 'readings'], 85 + data: { 86 + audios: data.audios, 87 + character: data.character, 88 + requiredSubjects: data.requiredSubjects, 89 + examples: data.examples, 90 + level: data.level, 91 + meanings: data.meanings, 92 + meaningHint: data.meaningHint, 93 + meaningMnemonic: data.meaningMnemonic, 94 + position: data.position, 95 + readings: data.readings, 96 + readingHint: data.readingHint, 97 + readingMnemonic: data.readingMnemonic, 98 + slug: data.slug, 99 + srsId: data.srsId, 100 + type: data.type, 101 + }, 102 + } as Subject 103 + }) 104 + .sort((a, b) => { 105 + if (!a.data.level || !a.data.position) return 1 106 + if (!b.data.level || !b.data.position) return -1 107 + const levelDiff = a.data.level - b.data.level 108 + if (levelDiff) return levelDiff 109 + const typePriority: Record<string, number> = { Radical: 0, Character: 1, Vocabulary: 2 } 110 + const typeDiff = (typePriority[a.data.type] ?? 0) - (typePriority[b.data.type] ?? 0) 111 + if (typeDiff) return typeDiff 112 + return a.data.position - b.data.position 113 + }) 114 + 115 + const outPath = APP_ROOT + output 116 + Deno.mkdirSync(dirname(outPath), { recursive: true }) 117 + Deno.writeTextFileSync(outPath, stringifyJSON(toWrite)) 118 + } 119 + 120 + // --------------------------------------------------------------------------- 121 + // Subject creation 122 + // --------------------------------------------------------------------------- 123 + 124 + /** 125 + * Indexes for fast slug/hans/ja lookups. Built lazily on first call to createSubject. 126 + * We defer loading so that commands that don't need subject creation (gen-progress, 127 + * gen-licenses) don't pay the startup cost of reading the dictionary files. 128 + */ 129 + let charBySlug: Record<string, Definition> | null = null 130 + let charByHans: Record<string, Definition> | null = null 131 + let charByJa: Record<string, Definition> | null = null 132 + let vocabBySlug: Record<string, Definition> | null = null 133 + let vocabByJa: Record<string, Definition> | null = null 134 + let audioMeta: Record<string, Record<string, Audio>> | null = null 135 + 136 + function initDicts( 137 + charDefs: Definition[], 138 + vocabDefs: Definition[], 139 + audioFiles: string[], 140 + ): void { 141 + if (charBySlug) return // already initialized 142 + charBySlug = Object.fromEntries(charDefs.map((d) => [d.hant, d])) 143 + charByHans = Object.fromEntries(charDefs.map((d) => [d.hans, d])) 144 + charByJa = Object.fromEntries(charDefs.filter((d) => d.ja).map((d) => [d.ja!, d])) 145 + vocabBySlug = Object.fromEntries(vocabDefs.map((d) => [d.hant, d])) 146 + vocabByJa = Object.fromEntries(vocabDefs.filter((d) => d.ja).map((d) => [d.ja!, d])) 147 + 148 + audioMeta = {} 149 + audioFiles.forEach((filename) => { 150 + // Filename format: {id}_{locale-hyphenated}_{voiceId}.mp3 151 + // Voice IDs use hyphens (not underscores), so splitting on _ is safe. 152 + const [idStr, localeHyphen, voiceId] = filename.replace('.mp3', '').split('_') 153 + const locale = localeHyphen?.replace('-', '_') 154 + if (!locale || !voiceId) return 155 + audioMeta![locale] ??= {} 156 + audioMeta![locale][idStr] = { url: filename, voiceId } 157 + }) 158 + } 159 + 160 + function getCharForLocale(targetLang: string, hans: string, hant: string, ja?: string): string { 161 + if (targetLang === 'ja') return ja || hant 162 + return targetLang === Locale.zh_CN ? hans : hant 163 + } 164 + 165 + /** 166 + * Creates a new Subject from dictionary and curriculum data. 167 + * Used when a slug has no existing entry in the output JSON. 168 + * 169 + * @param charDefs - All character definitions (from lang/characters.tsv) 170 + * @param vocabDefs - All vocabulary definitions (from lang/vocabulary.tsv) 171 + * @param audioFiles - List of existing audio filenames (from listAudioFiles) 172 + */ 173 + export function createSubject( 174 + slug: string, 175 + level: number, 176 + position: number, 177 + targetLang: string, 178 + charMeanings: Record<string, string>, 179 + vocabMeanings: Record<string, string>, 180 + sentences: Sentences, 181 + charDefs: Definition[], 182 + vocabDefs: Definition[], 183 + audioFiles: string[], 184 + ): Subject { 185 + initDicts(charDefs, vocabDefs, audioFiles) 186 + 187 + const isVocab = slug.length > 1 188 + const dictEntry = isVocab 189 + ? (vocabBySlug![slug] || vocabByJa![slug]) 190 + : (charBySlug![slug] || charByHans![slug] || charByJa![slug]) 191 + 192 + if (!dictEntry) { 193 + console.error(`No valid dictionary entry for slug: ${slug}`) 194 + return { data: {} } as Subject 195 + } 196 + 197 + const { id, hans, hant, ja } = dictEntry 198 + const en = isVocab ? (vocabMeanings[id] || '') : (charMeanings[id] || '') 199 + const character = getCharForLocale(targetLang, hans, hant, ja) 200 + const charForSentences = targetLang === Locale.zh_CN ? hans : hant 201 + 202 + return { 203 + id, 204 + learnCards: ['meanings'], 205 + quizCards: ['meanings', 'readings'], 206 + data: { 207 + audios: [audioMeta![targetLang]?.[id]].filter((a): a is Audio => a != null), 208 + character, 209 + examples: ( 210 + charForSentences.length === 1 211 + ? (sentences.byChar.get(charForSentences) ?? []) 212 + : (sentences.byChar.get(charForSentences[0]) ?? []).filter((key) => 213 + key.includes(charForSentences) 214 + ) 215 + ) 216 + .slice(0, 3) 217 + .map((value) => ({ value, translation: sentences.bySentence[value] })), 218 + level, 219 + meanings: en.split(';').map((def, i) => ({ 220 + value: def.trim(), 221 + isPrimary: i === 0, 222 + isAcceptedAnswer: true, 223 + })), 224 + position, 225 + readings: [], 226 + requiredSubjects: distinct( 227 + slug.split('').map((c) => charBySlug![c]?.id ?? ''), 228 + ).filter((reqId) => reqId && reqId !== charBySlug![slug]?.id), 229 + slug, 230 + srsId: level > 2 ? 1 : 2, 231 + type: isVocab ? Vocabulary : Character, 232 + }, 233 + } as Subject 234 + }
+26
data/deno.json
··· 1 + { 2 + "version": "0.0.1", 3 + "imports": { 4 + "@cliffy/ansi": "jsr:@cliffy/ansi@^1.0.0", 5 + "@cliffy/command": "jsr:@cliffy/command@^1.0.0", 6 + "@cliffy/flags": "jsr:@cliffy/flags@^1.0.0", 7 + "@cliffy/keycode": "jsr:@cliffy/keycode@^1.0.0", 8 + "@cliffy/keypress": "jsr:@cliffy/keypress@^1.0.0", 9 + "@cliffy/prompt": "jsr:@cliffy/prompt@^1.0.0", 10 + "@cliffy/table": "jsr:@cliffy/table@^1.0.0", 11 + "@std/collections": "jsr:@std/collections@^1.1.6", 12 + "@std/csv": "jsr:@std/csv@^1.0.6", 13 + "@std/dotenv": "jsr:@std/dotenv@^0.225.6", 14 + "@std/fs": "jsr:@std/fs@^1.0.23", 15 + "@std/io": "jsr:@std/io@^0.225.3", 16 + "@std/path": "jsr:@std/path@^1.1.4", 17 + "cc-cedict": "npm:cc-cedict@^1.1.1", 18 + "chinese-to-pinyin": "npm:chinese-to-pinyin@^1.3.1", 19 + "hanzi": "npm:hanzi@^2.1.5", 20 + "json-stringify-pretty-compact": "npm:json-stringify-pretty-compact@^4.0.0", 21 + "kuromoji": "npm:kuromoji@^0.1.2", 22 + "opencc-js": "npm:opencc-js@^1.0.5", 23 + "pinyin-to-zhuyin": "npm:pinyin-to-zhuyin@^1.0.3", 24 + "pinyin-tone-tool": "npm:pinyin-tone-tool@^1.0.5" 25 + } 26 + }
-1
data/other/licenses.tsv
··· 7 7 hanzi https://raw.githubusercontent.com/nieldlr/hanzi/master/LICENSE.txt 8 8 howler https://raw.githubusercontent.com/goldfire/howler.js/master/LICENSE.md 9 9 json-stringify-pretty-compact https://raw.githubusercontent.com/lydell/json-stringify-pretty-compact/main/LICENSE 10 - native-file-system-adapter https://raw.githubusercontent.com/jimmywarting/native-file-system-adapter/refs/heads/master/LICENSE 11 10 opencc-js https://raw.githubusercontent.com/nk2028/opencc-js/main/LICENSE 12 11 prettify-pinyin https://raw.githubusercontent.com/johnheroy/prettify-pinyin/master/README.md 13 12 tatoeba https://tatoeba.org/en/terms_of_use
-271
data/scripts/1_gen_dicts.ts
··· 1 - import { p2z } from 'pinyin-to-zhuyin' 2 - import pinyin from 'chinese-to-pinyin' 3 - import * as OpenCC from 'npm:opencc-js' 4 - import { toJyutping } from '$/utils/jyutping.ts' 5 - import { type Definition, readDict, readCsv, readJson, readTsv, writeTsv } from './shared/fs.ts' 6 - 7 - const characters = readDict('lang/characters.tsv') 8 - const vocabulary = readDict('lang/vocabulary.tsv') 9 - 10 - const toCN = OpenCC.Converter({ from: 'tw', to: 'cn' }) 11 - 12 - const pinyinMap: Record<string, string> = {} 13 - readTsv('lang/zh_CN/source/standard.tsv') 14 - .forEach(({ hans, pinyin }) => pinyinMap[hans] = pinyin) 15 - 16 - const pinyinTwMap: Record<string, string> = {} 17 - const zhuyinMap: Record<string, string> = {} 18 - readCsv('lang/zh_TW/sources/詞語表202504.csv') 19 - .forEach(({ word: hant, bopomofo, pinyin }) => { 20 - pinyinTwMap[hant] = pinyin 21 - zhuyinMap[hant] = bopomofo 22 - }) 23 - 24 - const kanji = readJson('lang/ja/source/kanji.json') 25 - const jaVocab = readJson('lang/ja/source/vocab.json') 26 - 27 - /** 28 - * Use dictionaries to fill out data: 29 - * - userLang meanings (data/lang/en/{characters | vocabulary | radicals}.tsv) 30 - * - targetLang readings (data/lang/zh_CN/readings.tsv) 31 - */ 32 - 33 - await migrateIds() 34 - 35 - /** Fill out {characters | radicals | vocabulary}.tsv */ 36 - await updateMeanings('en') 37 - await updateMeanings('es') 38 - 39 - /** Populates zh_CN readings.tsv. Uses getPinyin to update the pinyin. */ 40 - await updateReadings( 41 - 'lang/zh_CN/readings.tsv', 42 - ['id', 'pinyin'], 43 - (def) => ({ id: def.id, pinyin: getPinyin(def.hans), }) 44 - ) 45 - await updateReadings( 46 - 'lang/zh_HK/readings.tsv', 47 - ['id', 'jyutping'], 48 - ({ id, hant }) => ({ id: id, jyutping: toJyutping(hant) })) 49 - await updateReadings( 50 - 'lang/zh_TW/readings.tsv', 51 - ['id', 'pinyin', 'zhuyin'], 52 - ({ id, hant }) => ({ id, pinyin: getTwPinyin(hant), zhuyin: getZhuyin(hant) })) 53 - await ja() 54 - 55 - /** 56 - * One-time migration: rewrites user-lang meanings and hint files from legacy v2 numeric 57 - * ids to the new c-XXXXX / v-XXXXX string ids. Safe to re-run — skips files that are 58 - * already fully migrated. 59 - */ 60 - function migrateIds() { 61 - const charV2ToId = new Map<number, string>() 62 - characters.forEach((d) => { if (d.v2 != null) charV2ToId.set(d.v2, d.id) }) 63 - const vocabV2ToId = new Map<number, string>() 64 - vocabulary.forEach((d) => { if (d.v2 != null) vocabV2ToId.set(d.v2, d.id) }) 65 - 66 - const charFiles = [ 67 - { path: 'lang/en/characters.tsv', columns: ['id', 'value'] }, 68 - { path: 'lang/en/hints/meaning.characters.tsv', columns: ['id', 'hant', 'locale', 'en'] }, 69 - { path: 'lang/en/hints/reading.characters.tsv', columns: ['id', 'hant', 'locale', 'en'] }, 70 - { path: 'lang/es/characters.tsv', columns: ['id', 'value'] }, 71 - ] 72 - const vocabFiles = [ 73 - { path: 'lang/en/vocabulary.tsv', columns: ['id', 'value'] }, 74 - { path: 'lang/en/hints/meaning.vocabulary.tsv', columns: ['id', 'hant', 'locale', 'en'] }, 75 - { path: 'lang/en/hints/reading.vocabulary.tsv', columns: ['id', 'hant', 'locale', 'en'] }, 76 - { path: 'lang/es/vocabulary.tsv', columns: ['id', 'value'] }, 77 - ] 78 - 79 - for (const { path, columns } of charFiles) migrateFile(path, columns, charV2ToId) 80 - for (const { path, columns } of vocabFiles) migrateFile(path, columns, vocabV2ToId) 81 - 82 - // Add hant column to meanings files if not already present 83 - const charIdToHant = new Map(characters.map((d) => [d.id, d.hant])) 84 - const vocabIdToHant = new Map(vocabulary.map((d) => [d.id, d.hant])) 85 - const radicalIdToHant = new Map( 86 - readTsv('lang/radicals.tsv').map((row) => [row.id as string, row.hant as string]), 87 - ) 88 - addHantColumn('lang/en/characters.tsv', charIdToHant) 89 - addHantColumn('lang/en/vocabulary.tsv', vocabIdToHant) 90 - addHantColumn('lang/en/radicals.tsv', radicalIdToHant) 91 - addHantColumn('lang/es/characters.tsv', charIdToHant) 92 - addHantColumn('lang/es/vocabulary.tsv', vocabIdToHant) 93 - addHantColumn('lang/es/radicals.tsv', radicalIdToHant) 94 - } 95 - 96 - function addHantColumn(filePath: string, idToHant: Map<string, string>) { 97 - const rows = readTsv(filePath) 98 - if (!rows.length || 'hant' in rows[0]) return // already has hant column 99 - const otherCols = Object.keys(rows[0]).filter((k) => k !== 'id') 100 - const newRows = rows.map((row) => ({ 101 - id: row.id, 102 - hant: idToHant.get(row.id as string) ?? '', 103 - ...Object.fromEntries(Object.entries(row).filter(([k]) => k !== 'id')), 104 - })) 105 - writeTsv(filePath, ['id', 'hant', ...otherCols], newRows) 106 - console.log(` ${filePath}: added hant column`) 107 - } 108 - 109 - function migrateFile(filePath: string, columns: string[], v2ToId: Map<number, string>) { 110 - const rows = readTsv(filePath) 111 - if (!rows.some((row) => !isNaN(Number(row.id)))) return // already migrated 112 - 113 - let migrated = 0 114 - const newRows = rows 115 - .map((row) => { 116 - const asNum = Number(row.id) 117 - if (isNaN(asNum)) return row // already a string id 118 - const newId = v2ToId.get(asNum) 119 - if (!newId) { 120 - console.warn(` Dropping orphaned v2 id ${row.id} in ${filePath}`) 121 - return null 122 - } 123 - migrated++ 124 - return { ...row, id: newId } 125 - }) 126 - .filter((row) => row != null) 127 - .sort((a, b) => (a.id as string).localeCompare(b.id as string)) 128 - 129 - console.log(` ${filePath}: migrated ${migrated} ids`) 130 - writeTsv(filePath, columns, newRows) 131 - } 132 - 133 - /** 134 - * Appends any characters/vocab from lang/characters.tsv and lang/vocabulary.tsv 135 - * that are not yet present in the meanings file. 136 - * Missing definitions get a `[todo: add definition]` placeholder. 137 - */ 138 - async function updateMeanings(userLang: string) { 139 - for (const type of ['characters', 'vocabulary'] as const) { 140 - const filePath = `lang/${userLang}/${type}.tsv` 141 - const defs: Definition[] = type === 'characters' ? characters : vocabulary 142 - 143 - const existing = new Map( 144 - readTsv(filePath).map((row) => [row.id as string, { hant: row.hant as string, value: row.value as string }]), 145 - ) 146 - 147 - let added = 0 148 - for (const def of defs) { 149 - if (!existing.has(def.id)) { 150 - existing.set(def.id, { hant: def.hant, value: '[todo: add definition]' }) 151 - added++ 152 - } 153 - } 154 - 155 - if (added > 0) { 156 - console.log(` ${userLang}/${type}.tsv: adding ${added} missing entries`) 157 - const rows = [...existing.entries()] 158 - .map(([id, { hant, value }]) => ({ id, hant, value })) 159 - .sort((a, b) => a.id.localeCompare(b.id)) 160 - writeTsv(filePath, ['id', 'hant', 'value'], rows) 161 - } 162 - } 163 - } 164 - 165 - /** 166 - * Reads an existing readings TSV, fills in any missing entries for all characters and 167 - * vocabulary using the provided compute function, then rewrites the file. 168 - * Existing rows are preserved as-is (manual overrides are safe). 169 - */ 170 - async function updateReadings( 171 - filePath: string, 172 - columns: string[], 173 - compute: (def: Definition) => Record<string, string>, 174 - ) { 175 - const existing = new Map<string, Record<string, unknown>>() 176 - readTsv(filePath).forEach((row) => existing.set(row.id as string, row)) 177 - 178 - let added = 0 179 - for (const def of [...characters, ...vocabulary]) { 180 - if (!existing.has(def.id)) { 181 - existing.set(def.id, compute(def)) 182 - added++ 183 - } 184 - } 185 - 186 - if (added > 0) { 187 - console.log(` ${filePath}: adding ${added} entries`) 188 - writeTsv(filePath, columns, [...existing.values()]) 189 - } 190 - } 191 - 192 - /** 193 - * Populates ja readings for characters and vocabulary. 194 - */ 195 - async function ja() { 196 - // --- Characters --- 197 - type CharReading = { id: string; kunyomi: string; onyomi: string } 198 - const charReadings = new Map<string, CharReading>() 199 - 200 - readTsv('lang/ja/reading.characters.tsv') 201 - .forEach((row) => charReadings.set(row.id as string, row as unknown as CharReading)) 202 - 203 - for (const def of characters) { 204 - if (!def.ja || charReadings.has(def.id)) continue 205 - const data = getKanji(def.ja) 206 - if (!data) continue 207 - charReadings.set(def.id, { id: def.id, kunyomi: data.kun.join('; '), onyomi: data.on.join('; ') }) 208 - } 209 - 210 - writeTsv('lang/ja/reading.characters.tsv', ['id', 'kunyomi', 'onyomi'], [...charReadings.values()]) 211 - 212 - // --- Vocabulary --- 213 - const vocabReadings = new Map<string, { id: string; reading: string }>() 214 - 215 - readTsv('lang/ja/reading.vocabulary.tsv') 216 - .forEach((row) => vocabReadings.set(row.id as string, { id: row.id as string, reading: row.reading as string })) 217 - 218 - for (const def of vocabulary) { 219 - if (!def.ja || vocabReadings.has(def.id)) continue 220 - const data = getJaVocab(def.ja) 221 - if (!data) continue 222 - vocabReadings.set(def.id, { id: def.id, reading: data.reading }) 223 - } 224 - 225 - writeTsv('lang/ja/reading.vocabulary.tsv', ['id', 'reading'], [...vocabReadings.values()]) 226 - } 227 - 228 - function getPinyin(hans: string): string { 229 - if (pinyinMap[hans]) return pinyinMap[hans] 230 - return pinyin(hans) 231 - } 232 - 233 - function getTwPinyin(hant: string): string { 234 - if (pinyinTwMap[hant]) return pinyinTwMap[hant] 235 - return getPinyin(toCN(hant)) 236 - } 237 - 238 - function getZhuyin(hant: string): string { 239 - if (zhuyinMap[hant]) return zhuyinMap[hant] 240 - try { 241 - const pinyin = getTwPinyin(hant) 242 - // The p2z function expects input with tone marks or numbers 243 - const result = p2z(pinyin, { 244 - tonemarks: true, 245 - inputHasToneMarks: true, 246 - convertPunctuation: false 247 - }) 248 - if (!result || result.trim() === '') { 249 - throw new Error(`Empty result from conversion`) 250 - } 251 - return result 252 - } catch (error) { 253 - console.warn(`Failed to convert pinyin "${pinyin}" to zhuyin:`, error) 254 - return '' 255 - } 256 - } 257 - 258 - function getKanji(chars: string): { 259 - kanji: string 260 - meaning: string 261 - kun: string[] 262 - on: string[] 263 - } | undefined { 264 - return kanji[chars] 265 - } 266 - 267 - function getJaVocab(chars: string): { reading: string; meaning: string } | undefined { 268 - if (!jaVocab[chars]) return undefined 269 - const [reading, meaning] = jaVocab[chars] 270 - return { reading, meaning } 271 - }
-232
data/scripts/2_gen_audio.ts
··· 1 - /** 2 - * Script for generating audio files, via Azure 3 - * Requires an azure account and ffmpeg to be installed 4 - */ 5 - import { ensureDir } from '@std/fs' 6 - import { load } from '@std/dotenv' 7 - import { writeAll } from '@std/io' 8 - import { join } from '@std/path' 9 - import { Locale } from '$/enums.ts' 10 - import { Definition, listAudioFiles, readOrderedDefs } from './shared/fs.ts' 11 - 12 - const GEN_DIR = 'static/gen' 13 - const TEMP_DIR = join(GEN_DIR, 'audio', 'tmp') 14 - 15 - const voiceIds: Record<Locale, string> = { 16 - [Locale.zh_CN]: 'zh-CN-XiaoxiaoNeural', 17 - [Locale.zh_HK]: 'zh-HK-WanLungNeural', 18 - [Locale.zh_TW]: 'zh-TW-YunJheNeural', 19 - } 20 - 21 - const env = await load() 22 - 23 - const MAX_NOISE_LEVEL = -40 24 - const SILENCE_SPLIT = 1 25 - const DETECT_STR = `silencedetect=noise=${MAX_NOISE_LEVEL}dB:d=${SILENCE_SPLIT}` 26 - const MATCH_SILENCE = /silence_start: ([\w\.]+)[\s\S]+?silence_end: ([\w\.]+)/g 27 - 28 - const [locale, dict = 'character', limitStr] = Deno.args 29 - const limit = limitStr ? parseInt(limitStr) : 0 30 - 31 - if ( 32 - !locale || !['character', 'vocabulary'].includes(dict) || 33 - typeof limit != 'number' 34 - ) { 35 - throw new Error('invalid args') 36 - } 37 - 38 - await genAudio(locale as Locale) 39 - console.log('COMPLETE!') 40 - Deno.exit(0) 41 - 42 - async function genAudio(locale: Locale) { 43 - await ensureDir(TEMP_DIR) 44 - 45 - const ttsResults = await ttsAll( 46 - locale, 47 - await findMissingAudioFiles(locale), 48 - voiceIds[locale], 49 - ) 50 - 51 - console.log('source audio id: ', JSON.stringify(ttsResults)) 52 - 53 - for (const idx in ttsResults) { 54 - const { groupIndex, fileName, keys } = ttsResults[idx] 55 - if (!fileName) { 56 - console.warn(`Skipping group "${groupIndex}": no fileName`) 57 - continue 58 - } 59 - console.log('source audio saved: ', fileName) 60 - const tempAudio = join(TEMP_DIR, fileName) 61 - 62 - if (Object.keys(keys).length) { 63 - await writeAudioFiles(tempAudio, locale, keys) 64 - } 65 - } 66 - } 67 - 68 - async function findMissingAudioFiles(locale: Locale) { 69 - await ensureDir(join(GEN_DIR, 'audio', locale)) 70 - 71 - const exists = new Set([...listAudioFiles([locale])]) 72 - const ordered = readOrderedDefs(dict as 'character' | 'vocabulary') 73 - 74 - const missing = ordered.filter(({ id }) => !exists.has(getFilename(id, locale))) 75 - return limit ? missing.slice(0, limit) : missing 76 - } 77 - 78 - function ttsAll( 79 - locale: Locale, 80 - subjects: Definition[], 81 - voiceId: string, 82 - ): Promise<{ groupIndex: number; fileName: string | null; keys: string[] }[]> { 83 - const groups: Definition[][] = [] 84 - 85 - subjects.forEach((subject, index) => { 86 - const groupIndex = Math.floor(index / 100) 87 - if (!groups[groupIndex]) groups[groupIndex] = [] 88 - groups[groupIndex].push(subject) 89 - }) 90 - 91 - console.log( 92 - `About to request TTS for ${subjects.length} items, from ids ${ 93 - subjects[0]?.id 94 - } to ${subjects[subjects.length - 1]?.id}.`, 95 - ) 96 - const proceed = confirm('Do you want to proceed?') 97 - if (!proceed) { 98 - console.log('aborting...') 99 - Deno.exit(0) 100 - } else { 101 - console.log('processing...') 102 - } 103 - 104 - return Promise.all( 105 - groups.map(async (subjects, groupIndex) => { 106 - const texts = subjects.map((subject) => subject.hant) 107 - return { 108 - groupIndex, 109 - fileName: await ttsAzure(texts, voiceId, locale, groupIndex), 110 - keys: subjects.map((subject) => subject.id as string), 111 - } 112 - }), 113 - ) 114 - } 115 - 116 - async function ttsAzure( 117 - texts: string[], 118 - voiceId: string, 119 - locale: Locale, 120 - groupIndex: number, 121 - ): Promise<string | null> { 122 - const SILENCE_REQUEST = 2 123 - const region = env['AZURE_SPEECH_REGION'] 124 - const url = `https://${region}.tts.speech.microsoft.com/cognitiveservices/v1` 125 - 126 - const response = await fetch(url, { 127 - method: 'POST', 128 - headers: { 129 - 'Ocp-Apim-Subscription-Key': env['AZURE_SPEECH_KEY'], 130 - 'Content-Type': 'application/ssml+xml', 131 - 'X-Microsoft-OutputFormat': 'audio-16khz-128kbitrate-mono-mp3', 132 - 'User-Agent': 'curl', 133 - }, 134 - body: ` 135 - <speak version='1.0' xml:lang='${locale}'> 136 - <voice name='${voiceId}' xml:lang='${locale}'> 137 - <prosody rate="-20.00%"> 138 - ${texts.join(`, <break time="${SILENCE_REQUEST}s"/> `)} 139 - </prosody> 140 - </voice> 141 - </speak> 142 - `, 143 - }) 144 - if (response.status > 399) { 145 - console.warn(response) 146 - return null 147 - } 148 - const fileName = `${locale}_${groupIndex}.mp3` 149 - const filePath = join(TEMP_DIR, fileName) 150 - const file = await Deno.open(filePath, { create: true, write: true }) 151 - const arrayBuffer = new Uint8Array(await response.arrayBuffer()) 152 - await writeAll(file, arrayBuffer) 153 - return fileName 154 - } 155 - 156 - // /** 157 - // * 1. Splits a joined translation audio clip 158 - // * 2. Writes files for each translation, naming appropriately 159 - // */ 160 - async function writeAudioFiles( 161 - sourceURL: string, 162 - locale: Locale, 163 - keys: string[], 164 - ) { 165 - const audioDirLocation = join(GEN_DIR, 'audio', locale) 166 - 167 - try { 168 - await Deno.mkdir(audioDirLocation, { recursive: true }) 169 - } catch { /* Dir Exists */ } 170 - 171 - const detectSilence = new Deno.Command('ffmpeg', { 172 - stdout: 'piped', 173 - args: ['-i', sourceURL, '-af', DETECT_STR, '-f', 'null', '-'], 174 - }) 175 - 176 - const detectSilenceResult = (await detectSilence.output()).stderr 177 - const detectSilenceOutput = new TextDecoder().decode(detectSilenceResult) 178 - 179 - let match = MATCH_SILENCE.exec(detectSilenceOutput) 180 - let clipStartMS = 0 181 - let count = 0 182 - 183 - while (match) { 184 - const [_, nextSilenceStartS, nextSilenceEndS] = match 185 - const nextSilenceStartMS = Math.round(1000 * parseFloat(nextSilenceStartS)) 186 - 187 - // 0.1 is so we don't clip the beginning of the audio gen 188 - const nextSilenceEndMS = Math.round( 189 - 1000 * (parseFloat(nextSilenceEndS) - 0.1), 190 - ) 191 - 192 - const outFile = join(audioDirLocation, getFilename(keys[count], locale)) 193 - count = count + 1 194 - 195 - const seek = Math.max(0, clipStartMS) + 'ms' 196 - 197 - // 0.1 to maintain length after shifting nextSilenceEndMS 198 - const len = nextSilenceStartMS - (clipStartMS + 0.1) + 'ms' 199 - 200 - const convert = new Deno.Command('ffmpeg', { 201 - stdout: 'piped', 202 - args: ['-ss', seek, '-t', len, '-i', sourceURL, '-c:a', 'copy', outFile], 203 - }) 204 - await convert.output() 205 - clipStartMS = nextSilenceEndMS 206 - match = MATCH_SILENCE.exec(detectSilenceOutput) 207 - } 208 - 209 - // last file 210 - if (!keys[count]) { 211 - console.warn(`Careful about mismatching: ${sourceURL}`) 212 - return 213 - } 214 - count = count + 1 215 - 216 - const outFile = join(audioDirLocation, getFilename(keys[count], locale)) 217 - const seek = Math.max(0, clipStartMS) + 'ms' 218 - const convert = new Deno.Command('ffmpeg', { 219 - stdout: 'piped', 220 - args: ['-ss', seek, '-i', sourceURL, '-c:a', 'copy', outFile], 221 - }) 222 - await convert.output() 223 - console.log(keys.length, count) 224 - } 225 - 226 - export function getFilename( 227 - id: string, 228 - locale: Locale, 229 - ) { 230 - const voiceId = voiceIds[locale] 231 - return `${id}_${locale.replace('_', '-')}_${voiceId}.mp3` 232 - }
-204
data/scripts/3_update_app_data.ts
··· 1 - import { 2 - readAllLocaleHints, 3 - readCharacterOrder, 4 - readLessonOrder, 5 - readMeaningOverrides, 6 - readMeanings, 7 - readReadingsMap, 8 - readSubjectsMap, 9 - readTsv, 10 - writeSubjects, 11 - } from './shared/fs.ts' 12 - import { createSubject, loadSentences } from './shared/subject_utils.ts' 13 - import { SubjectType } from '$/enums.ts' 14 - import type { Subject } from '$/models/subjects.ts' 15 - 16 - const USER_LANGS = ['en', 'es'] 17 - const TARGET_LANGS = ['zh_CN', 'zh_HK', 'zh_TW', 'ja'] 18 - 19 - for (const userLang of USER_LANGS) { 20 - const charMeanings = readMeanings(`lang/${userLang}/characters.tsv`) 21 - const vocabMeanings = readMeanings(`lang/${userLang}/vocabulary.tsv`) 22 - const charMeaningHints = readAllLocaleHints(`lang/${userLang}/hints/meaning.characters.tsv`) 23 - const charReadingHints = readAllLocaleHints(`lang/${userLang}/hints/reading.character.tsv`) 24 - const vocabMeaningHints = readAllLocaleHints(`lang/${userLang}/hints/meaning.vocabulary.tsv`) 25 - const vocabReadingHints = readAllLocaleHints(`lang/${userLang}/hints/reading.vocabulary.tsv`) 26 - 27 - for (const targetLang of TARGET_LANGS) { 28 - console.log(`\nGenerating ${userLang}/${targetLang}`) 29 - const sentences = loadSentences(userLang, targetLang) 30 - console.log('loaded sentences') 31 - 32 - const characterOrder = readCharacterOrder(targetLang) 33 - const vocabularyOrder = readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`) 34 - const radicalOrder = readLessonOrder(`lang/${targetLang}/order/radicals.csv`) 35 - 36 - const outPath = `lang/${userLang}/${targetLang}.json` 37 - // Existing subjects keyed by slug — preserves hand-edited fields across runs 38 - const existingSubjects = readSubjectsMap(outPath) 39 - const updated: Set<string> = new Set() 40 - // Readings: ja splits by subject type; all other locales share one file 41 - const charReadingsMap = targetLang === 'ja' 42 - ? readReadingsMap('lang/ja/reading.characters.tsv') 43 - : readReadingsMap(`lang/${targetLang}/readings.tsv`) 44 - const vocabReadingsMap = targetLang === 'ja' 45 - ? readReadingsMap('lang/ja/reading.vocabulary.tsv') 46 - : readReadingsMap(`lang/${targetLang}/readings.tsv`) 47 - 48 - // Meaning overrides: locale-specific English meanings that replace the base meanings 49 - const charMeaningOverrides = readMeaningOverrides( 50 - `lang/${userLang}/meanings/${targetLang}.characters.tsv`, 51 - ) 52 - const vocabMeaningOverrides = readMeaningOverrides( 53 - `lang/${userLang}/meanings/${targetLang}.vocabulary.tsv`, 54 - ) 55 - 56 - // --- Characters --- 57 - console.log(` characters: ${characterOrder.length} levels`) 58 - characterOrder.forEach((slugs, index) => { 59 - const level = index + 1 60 - slugs.forEach((slug, posIndex) => { 61 - const position = posIndex + 1 62 - const subject = existingSubjects[slug] || 63 - createSubject(slug, level, position, targetLang, charMeanings, vocabMeanings, sentences) 64 - subject.data.level = level 65 - subject.data.position = position 66 - 67 - const readings = charReadingsMap[subject.id] 68 - if (readings?.length) subject.data.readings = readings 69 - else if (subject.data.type !== SubjectType.Radical) { 70 - console.warn(` No readings for character ${slug} (${subject.id}) in ${targetLang}`) 71 - } 72 - 73 - const override = charMeaningOverrides[subject.id] 74 - if (override) { 75 - subject.data.meanings = override.split(';').map((def, i) => ({ 76 - value: def.trim(), isPrimary: i === 0, isAcceptedAnswer: true, 77 - })) 78 - } 79 - 80 - if (charMeaningHints[subject.id]) subject.data.meaningHint = charMeaningHints[subject.id] 81 - if (charReadingHints[subject.id]) subject.data.readingHint = charReadingHints[subject.id] 82 - existingSubjects[slug] = subject 83 - updated.add(slug) 84 - }) 85 - }) 86 - 87 - // --- Vocabulary --- 88 - console.log(` vocabulary: ${vocabularyOrder.length} levels`) 89 - vocabularyOrder.forEach((slugs, index) => { 90 - const level = index + 1 91 - slugs.forEach((slug, posIndex) => { 92 - const position = posIndex + 1 93 - const subject = existingSubjects[slug] || 94 - createSubject(slug, level, position, targetLang, charMeanings, vocabMeanings, sentences) 95 - subject.data.level = level 96 - subject.data.position = position 97 - 98 - const readings = vocabReadingsMap[subject.id] 99 - if (readings?.length) subject.data.readings = readings 100 - else console.warn(` No readings for vocabulary ${slug} (${subject.id}) in ${targetLang}`) 101 - 102 - const override = vocabMeaningOverrides[subject.id] 103 - if (override) { 104 - subject.data.meanings = override.split(';').map((def, i) => ({ 105 - value: def.trim(), isPrimary: i === 0, isAcceptedAnswer: true, 106 - })) 107 - } 108 - 109 - if (vocabMeaningHints[subject.id]) subject.data.meaningHint = vocabMeaningHints[subject.id] 110 - if (vocabReadingHints[subject.id]) subject.data.readingHint = vocabReadingHints[subject.id] 111 - existingSubjects[slug] = subject 112 - updated.add(slug) 113 - }) 114 - }) 115 - 116 - // --- Radicals --- 117 - console.log(` radicals: ${radicalOrder.length} levels`) 118 - buildRadicals(targetLang, userLang, radicalOrder, existingSubjects, updated) 119 - 120 - writeSubjects( 121 - outPath, 122 - Object.values(existingSubjects).map((subject) => { 123 - if (updated.has(subject.data.slug)) delete subject.hiddenAt 124 - else { 125 - subject.hiddenAt = subject.hiddenAt ?? new Date() 126 - subject.data.level = 0 127 - subject.data.position = 0 128 - } 129 - return subject 130 - }), 131 - ) 132 - } 133 - } 134 - 135 - function buildRadicals( 136 - targetLang: string, 137 - userLang: string, 138 - radicalOrder: string[][], 139 - existingSubjects: Record<string, Subject>, 140 - updated: Set<string>, 141 - ) { 142 - // Parse manually — the tsv has an optional trailing `alt` field that trips up the strict CSV parser 143 - const DATA_ROOT = './data/' 144 - const byHant: Record<string, { id: string; hant: string; hans: string }> = {} 145 - const byAlt: Record<string, { id: string; hant: string; hans: string }> = {} 146 - 147 - Deno.readTextFileSync(DATA_ROOT + 'lang/radicals.tsv') 148 - .split('\n') 149 - .slice(1) // skip header 150 - .filter((line) => line.trim()) 151 - .forEach((line) => { 152 - const [id, hant, hans, alt] = line.split('\t') 153 - const row = { id, hant, hans } 154 - byHant[hant] = row 155 - if (alt) { 156 - alt.split(';').map((a) => a.trim()).filter(Boolean).forEach((a) => { 157 - byAlt[a] = row 158 - }) 159 - } 160 - }) 161 - 162 - const nameById: Record<string, string> = {} 163 - readTsv(`lang/${userLang}/radicals.tsv`).forEach((row) => { 164 - nameById[row.id as string] = row.name as string 165 - }) 166 - 167 - radicalOrder.forEach((chars, levelIndex) => { 168 - const level = levelIndex + 1 169 - chars.forEach((char, posIndex) => { 170 - const ch = char.trim() 171 - if (!ch) return 172 - const row = byHant[ch] || byAlt[ch] 173 - if (!row) { 174 - console.warn(`No radical found for: ${ch}`) 175 - return 176 - } 177 - const { id, hant, hans } = row 178 - const name = nameById[id] || '' 179 - const slug = hant 180 - const existing = existingSubjects[slug] 181 - const character = targetLang === 'zh_CN' ? hans : hant 182 - 183 - existingSubjects[slug] = { 184 - ...(existing || {}), 185 - id, 186 - learnCards: ['meanings'], 187 - quizCards: ['meanings'], 188 - data: { 189 - ...(existing?.data || {}), 190 - character, 191 - level, 192 - meanings: name ? [{ value: name, isPrimary: true, isAcceptedAnswer: true }] : [], 193 - position: posIndex + 1, 194 - readings: [], 195 - requiredSubjects: [], 196 - slug, 197 - srsId: 2, 198 - type: SubjectType.Radical, 199 - }, 200 - } as Subject 201 - updated.add(slug) 202 - }) 203 - }) 204 - }
-65
data/scripts/4_gen_progress.ts
··· 1 - /** 2 - * Generate json files that help users track progress. 3 - * This includes: 4 - * - HSK data 5 - * - TOCFL data 6 - * - Frequency data 7 - * 8 - * @todo 9 - * All of these words should be accounted for in Hanzi Offline 10 - * Therefore, use our dictionaries for hans/hant translation 11 - */ 12 - import stringify from 'json-stringify-pretty-compact' 13 - import * as OpenCC from 'opencc-js' 14 - import { readTsv } from './shared/fs.ts' 15 - 16 - const toSimplified = OpenCC.Converter({ from: 'hk', to: 'cn' }) 17 - const toTraditional = OpenCC.Converter({ from: 'cn', to: 'hk' }) 18 - 19 - Deno.writeTextFileSync( 20 - 'www/static/gen/progress/hsk.json', 21 - stringify( 22 - readTsv('lang/zh_CN/progress/hsk.tsv') 23 - .map((data) => ({ 24 - level: Number(data.band), 25 - id: Number(data.no), 26 - simplified: data.hans, 27 - traditional: toTraditional(data.hans), 28 - })), 29 - ), 30 - ) 31 - 32 - Deno.writeTextFileSync( 33 - 'www/static/gen/progress/tocfl.json', 34 - stringify( 35 - readTsv('lang/zh_TW/progress/tocfl.tsv') 36 - .map((data) => ({ 37 - level: Number(data.level), 38 - id: Number(data.id), 39 - simplified: toSimplified(data.hant), 40 - traditional: data.hant, 41 - })), 42 - ), 43 - ) 44 - 45 - Deno.writeTextFileSync( 46 - 'www/static/gen/progress/jlpt-kanji.json', 47 - stringify( 48 - readTsv('lang/ja/progress/jlpt-kanji.tsv') 49 - .map((data) => ({ 50 - level: Number(data.level), 51 - kanji: data.kanji, 52 - })), 53 - ), 54 - ) 55 - 56 - Deno.writeTextFileSync( 57 - 'www/static/gen/progress/jlpt-vocab.json', 58 - stringify( 59 - readTsv('lang/ja/progress/jlpt-vocab.tsv') 60 - .map((data) => ({ 61 - level: Number(data.level), 62 - chars: data.chars, 63 - })), 64 - ), 65 - )
-13
data/scripts/5_gen_licenses.ts
··· 1 - /** 2 - * Write license texts 3 - */ 4 - import { readTsv, writeJSON } from './shared/fs.ts' 5 - const licenseList = readTsv('other/licenses.tsv') 6 - 7 - const licenses = licenseList 8 - .map(async ({ name, href }) => { 9 - const text = await (await fetch(href)).text() 10 - return { name, href, text } 11 - }) 12 - 13 - writeJSON('licenses.json', await Promise.all(licenses))
-42
data/scripts/metadata/hsk.js
··· 1 - /** 2 - * Formats the data from Hacking Chinese's "Missing HSK" csv list. 3 - * Adds HK Traditional script for reference, but simplified is the "key" here. 4 - * @reference https://www.hackingchinese.com/what-important-words-are-missing-from-hsk/ 5 - * @reference https://www.hackingchinese.com/wp-content/uploads/2020/07/hacking-chinese_missing-hsk-words.csv 6 - */ 7 - import { parse } from 'jsr:@std/csv/parse' 8 - import { stringify } from 'jsr:@std/csv/stringify' 9 - import * as OpenCC from 'npm:opencc-js' 10 - 11 - const converter = OpenCC.Converter({ from: 'cn', to: 'hk' }) 12 - 13 - const rows = parse( 14 - Deno.readTextFileSync('./hacking-chinese_missing-hsk-words.csv'), 15 - ) 16 - 17 - const results = [] 18 - rows.forEach(([hsk3, hsk4, hsk5, hsk6]) => { 19 - if (hsk3) results.push(translate(hsk3, 3)) 20 - if (hsk4) results.push(translate(hsk4, 4)) 21 - if (hsk5) results.push(translate(hsk5, 5)) 22 - if (hsk6) results.push(translate(hsk6, 6)) 23 - }) 24 - 25 - function translate(char, band) { 26 - const simplified = char 27 - const traditional = converter(char) 28 - return { simplified, traditional, band } 29 - } 30 - 31 - Deno.writeTextFileSync( 32 - 'hsk_missing.tsv', 33 - stringify( 34 - results 35 - .sort((a, b) => a.band - b.band) 36 - .filter((i) => !i.simplified.includes('HSK')), 37 - { 38 - columns: ['band', 'simplified', 'traditional'], 39 - separator: '\t', 40 - }, 41 - ), 42 - )
-149
data/scripts/metadata/sentences.ts
··· 1 - /** 2 - * Parse and format sentence data from: 3 - * @reference https://tatoeba.org 4 - * 5 - * Mapping to Mandarin Chinese (assume zh_CN) and Cantonese (assume zh_HK) 6 - * Map all data from `en`, because there is much more en <-> hk/cn translation 7 - * mappings than there is hk <-> cn 8 - * 9 - * Try to use sentences where there is a unique 1:1 character association. 10 - */ 11 - import { parse } from 'jsr:@std/csv/parse' 12 - import { stringify } from 'jsr:@std/csv/stringify' 13 - import * as OpenCC from 'npm:opencc-js' 14 - import { readCharacterOrder, readDict } from '../shared/fs.ts' 15 - 16 - const charDefs = readDict('lang/characters.tsv') 17 - const hantToHans: Record<string, string> = Object.fromEntries(charDefs.map((d) => [d.hant, d.hans])) 18 - 19 - function buildPositionMap(locale: string): Map<string, number> { 20 - const map = new Map<string, number>() 21 - let pos = 0 22 - for (const level of readCharacterOrder(locale)) { 23 - for (const hant of level) { 24 - map.set(hant, pos) 25 - const hans = hantToHans[hant] 26 - if (hans && hans !== hant) map.set(hans, pos) 27 - pos++ 28 - } 29 - } 30 - return map 31 - } 32 - 33 - function sentenceSimplicity(sentence: string, positionMap: Map<string, number>, maxPos: number): number { 34 - let score = 0 35 - for (const char of sentence) { 36 - score += positionMap.get(char) ?? maxPos 37 - } 38 - score *= Math.min(1, sentence.length / 1.3) 39 - return score 40 - } 41 - 42 - interface Sentence { 43 - id: number 44 - value: string 45 - enId: number 46 - en: string 47 - } 48 - 49 - const toCN = OpenCC.Converter({ from: 'tw', to: 'cn' }) 50 - const toTW = OpenCC.Converter({ from: 'cn', to: 'tw' }) 51 - 52 - const USER_LANGS = ['en', 'es'] 53 - const str: Record<string, string> = { 54 - es: 'Spanish', 55 - en: 'English', 56 - hk: 'Cantonese', 57 - cn: 'Mandarin Chinese', 58 - ja: 'Japanese' 59 - } 60 - const date = '2026-03-13' 61 - for (const langCode of USER_LANGS) { 62 - formatLang(langCode) 63 - } 64 - 65 - function formatLang(langCode: string) { 66 - const nameCn = `Sentence pairs in ${str[langCode]}-${str.cn} - ${date}.tsv` 67 - const nameHk = `Sentence pairs in ${str[langCode]}-${str.hk} - ${date}.tsv` 68 - const nameJa = `Sentence pairs in ${str[langCode]}-${str.ja} - ${date}.tsv` 69 - console.log("Updating Sentence pairs:") 70 - console.log(nameCn) 71 - console.log(nameHk) 72 - console.log(nameJa) 73 - 74 - const cnPath = `./data/lang/${langCode}/sources/${nameCn}` 75 - const hkPath = `./data/lang/${langCode}/sources/${nameHk}` 76 - const jaPath = `./data/lang/${langCode}/sources/${nameJa}` 77 - 78 - let cnText = '' 79 - let hkText = '' 80 - let jaText = '' 81 - try { 82 - cnText = Deno.readTextFileSync(cnPath) 83 - hkText = Deno.readTextFileSync(hkPath) 84 - jaText = Deno.readTextFileSync(jaPath) 85 - } catch { 86 - return // source files not present, skip 87 - } 88 - if (!hkText.trim() || !cnText.trim() || !jaText.trim()) return // empty stubs, skip 89 - 90 - const hkSentences = parse(hkText, { separator: '\t', lazyQuotes: true }) 91 - .map(([enIdStr, en, idStr, hk]) => { 92 - const enId = parseInt(enIdStr) 93 - const id = parseInt(idStr) 94 - if (!enId || !id || !en || !hk || hk.length > 15) return null 95 - return { id, value: hk, enId, en } 96 - }).filter((row) => row != null) 97 - 98 - const twSentences: Sentence[] = [] 99 - const cnSentences: Sentence[] = [] 100 - parse(cnText, { separator: '\t', lazyQuotes: true }) 101 - .forEach(([enIdStr, en, idStr, zh]) => { 102 - const enId = parseInt(enIdStr) 103 - const id = parseInt(idStr) 104 - if (!enId || !id || !en || !zh || zh.length > 15) return null 105 - cnSentences.push({ id, value: toCN(zh), enId, en }) 106 - twSentences.push({ id, value: toTW(zh), enId, en }) 107 - }) 108 - 109 - const jaSentences = parse(jaText, { separator: '\t', lazyQuotes: true }) 110 - .map(([enIdStr, en, idStr, ja]) => { 111 - const enId = parseInt(enIdStr) 112 - const id = parseInt(idStr) 113 - if (!enId || !id || !en || !ja || ja.length > 15) return null 114 - return { id, value: ja, enId, en } 115 - }).filter((row) => row != null) 116 - 117 - writeFile(langCode, 'ja', processSentences(jaSentences, buildPositionMap('ja'))) 118 - writeFile(langCode, 'zh_CN', processSentences(cnSentences, buildPositionMap('zh_CN'))) 119 - writeFile(langCode, 'zh_HK', processSentences(hkSentences, buildPositionMap('zh_HK'))) 120 - writeFile(langCode, 'zh_TW', processSentences(twSentences, buildPositionMap('zh_TW'))) 121 - } 122 - 123 - function writeFile(langCode: string, locale: string, sentences: Sentence[]) { 124 - const columns = ['id', 'value', 'enId', 'en'] 125 - Deno.mkdirSync(`./data/lang/${langCode}/sentences`, { recursive: true }) 126 - Deno.writeTextFileSync( 127 - `./data/lang/${langCode}/sentences/${locale}.tsv`, 128 - stringify(sentences, { columns, separator: '\t' }), 129 - ) 130 - } 131 - 132 - /** 133 - * Dedupes, and sorts by simplicity 134 - * @todo ignore punctiation. 。,!,? 135 - */ 136 - function processSentences(sentences: Sentence[], positionMap: Map<string, number>): Sentence[] { 137 - const existing: Record<string, Sentence> = {} 138 - sentences.forEach((sentence) => { 139 - if ( 140 - !existing[sentence.value] || 141 - (existing[sentence.value].id < sentence.id) 142 - ) { 143 - existing[sentence.value] = sentence 144 - } 145 - }) 146 - const maxPos = positionMap.size 147 - return Object.values(existing) 148 - .sort((a, b) => sentenceSimplicity(a.value, positionMap, maxPos) - sentenceSimplicity(b.value, positionMap, maxPos)) 149 - }
-57
data/scripts/metadata/tocfl.js
··· 1 - /** 2 - * Mostly take straight from naer source. Trim things like pronunciation, since 3 - * we do these separately from the word list. I think the specific link breaks, 4 - * because it relies on release date. So go to the main page, find the `詞語表` 5 - * (三等七級詞語表) link for the .xlsx file, and replace the extension with `csv`. 6 - * 7 - * @reference https://coct.naer.edu.tw 8 - * @reference https://coct.naer.edu.tw/file/files/詞語表202504.csv 9 - */ 10 - import { parse } from 'jsr:@std/csv/parse' 11 - import { stringify } from 'jsr:@std/csv/stringify' 12 - import * as OpenCC from 'npm:opencc-js' 13 - 14 - const tocflDataFile = Deno.readTextFileSync('./詞語表202504.csv') 15 - const tocflRows = parse(tocflDataFile, { skipFirstRow: true }).map((row) => ({ 16 - id: row.id, 17 - traditional: row.word, 18 - level: (row.ji === '第1級') ? 0 : row.ji.match(/(\d)/)[0], 19 - category: row.situation, 20 - wfreq: row.wfreq, 21 - sfreq: row.sfreq, 22 - })) 23 - 24 - const converter = OpenCC.Converter({ from: 'tw', to: 'cn' }) 25 - 26 - const results = [] 27 - parse(Deno.readTextFileSync('./hacking-chinese_missing-tocfl-words.csv')) 28 - .forEach(([tocfl1, tocfl2, tocfl3, tocfl4, tocfl5]) => { 29 - if (tocfl1) results.push(translate(tocfl1, 1)) 30 - if (tocfl2) results.push(translate(tocfl2, 2)) 31 - if (tocfl3) results.push(translate(tocfl3, 3)) 32 - if (tocfl4) results.push(translate(tocfl4, 4)) 33 - if (tocfl5) results.push(translate(tocfl5, 5)) 34 - }) 35 - 36 - function translate(char, level) { 37 - return { simplified: converter(char), traditional: char, level } 38 - } 39 - 40 - const columns = ['id', 'level', 'category', 'traditional', 'wfreq', 'sfreq'] 41 - Deno.writeTextFileSync('tocfl.tsv', stringify( 42 - tocflRows.sort((a, b) => a.band - b.band), 43 - { columns, separator: '\t' }, 44 - )) 45 - 46 - Deno.writeTextFileSync( 47 - 'tocfl_missing.tsv', 48 - stringify( 49 - results 50 - .sort((a, b) => a.level - b.level) 51 - .filter((i) => !i.simplified.includes('TOCFL')), 52 - { 53 - columns: ['level', 'simplified', 'traditional'], 54 - separator: '\t', 55 - }, 56 - ), 57 - )
-126
data/scripts/ordering/generate_words.js
··· 1 - import * as OpenCC from 'npm:opencc-js' 2 - import { 3 - readCharacterOrder, 4 - readLessonOrder, 5 - readTsv, 6 - } from '../shared/fs.ts' 7 - 8 - const toTraditional = OpenCC.Converter({ from: 'cn', to: 'hk' }) 9 - 10 - const charLevels = [new Set()] 11 - const totalChars = new Set() 12 - readCharacterOrder('order/zh_CN.characters.txt').forEach((row, index) => { 13 - row.forEach((char) => totalChars.add(char)) 14 - charLevels[index + 1] = new Set([...totalChars]) 15 - }) 16 - 17 - const sortedVocab = [] 18 - 19 - const hskToLevel = { 20 - [1]: 15, 21 - [2]: 20, 22 - [3]: 30, 23 - [4]: 40, 24 - [5]: 60, 25 - [6]: 60, 26 - } 27 - 28 - const tocflToLevel = { 29 - [1]: 20, 30 - [2]: 30, 31 - [3]: 45, 32 - [4]: 60, 33 - [5]: 60, 34 - } 35 - 36 - function deriveLevel(system, wordLevelStr, foundLevel) { 37 - const wordLevel = Number(wordLevelStr) 38 - if (!wordLevel || !foundLevel) return 0 39 - if (system === 'hsk') { 40 - if (foundLevel > hskToLevel[wordLevel]) return hskToLevel[wordLevel] 41 - return foundLevel 42 - } 43 - if (system === 'tocfl') { 44 - if (foundLevel > tocflToLevel[wordLevel]) return tocflToLevel[wordLevel] 45 - return foundLevel 46 - } 47 - } 48 - 49 - const allWords = new Set() 50 - 51 - readTsv('sources/hsk.tsv') 52 - .concat(readTsv('sources/hsk_missing.tsv')) 53 - .filter((item) => { 54 - if (!item) return false 55 - const traditional = toTraditional(item.simplified) 56 - if (allWords.has(traditional)) return false 57 - else allWords.add(traditional) 58 - return item.simplified.split('').length > 1 59 - }) 60 - .forEach((item) => { 61 - const traditional = toTraditional(item.simplified) 62 - const level = deriveLevel( 63 - 'hsk', 64 - item.band, 65 - charLevels.findIndex((level) => 66 - traditional.split('').every((char) => level.has(char)) 67 - ), 68 - ) 69 - if (level === -1) { 70 - console.warn('no level for word:', traditional) 71 - allWords.delete(traditional) 72 - } else { 73 - if (!sortedVocab[level]) sortedVocab[level] = [] 74 - sortedVocab[level].push(traditional) 75 - } 76 - }) 77 - 78 - readTsv('sources/tocfl.tsv') 79 - .concat(readTsv('sources/tocfl_missing.tsv')) 80 - .filter((item) => { 81 - if (!item) return false 82 - if (item.level > 4) return false 83 - if (allWords.has(item.traditional)) return false 84 - else allWords.add(item.traditional) 85 - return item.traditional.split('').length > 1 86 - }) 87 - .forEach((item) => { 88 - const traditional = item.traditional 89 - const level = deriveLevel( 90 - 'tocfl', 91 - item.level, 92 - charLevels.findIndex((level) => 93 - traditional.split('').every((char) => level.has(char)) 94 - ), 95 - ) 96 - if (level === -1) { 97 - console.warn('no level for word:', traditional) 98 - allWords.delete(traditional) 99 - } else { 100 - if (!sortedVocab[level]) sortedVocab[level] = [] 101 - sortedVocab[level].push(traditional) 102 - } 103 - }) 104 - 105 - const lessons = readLessonOrder('order/vocabulary.csv') 106 - allWords.clear() 107 - 108 - for (let i = 0; i < 60; i++) { 109 - const nextLesson = [] 110 - const lesson = lessons[i] || [] 111 - const vocab = sortedVocab[i + 1] || [] 112 - 113 - lesson.forEach((vocab) => { 114 - nextLesson.push(vocab) 115 - allWords.add(vocab) 116 - }) 117 - vocab.forEach((vocab) => { 118 - if (!allWords.has(vocab)) { 119 - nextLesson.push(vocab) 120 - } 121 - allWords.add(vocab) 122 - }) 123 - lessons[i] = nextLesson.join(',') 124 - } 125 - 126 - console.log(lessons.join('\n'))
-181
data/scripts/ordering/group_subjects.js
··· 1 - import hanzi from 'npm:hanzi' 2 - import * as OpenCC from 'npm:opencc-js' 3 - import { readTsv } from '../shared/fs.ts' 4 - 5 - hanzi.start() 6 - 7 - const toTraditional = OpenCC.Converter({ from: 'cn', to: 'hk' }) 8 - 9 - /** 10 - * Creates general groups of characters to help sort. 11 - * 12 - * Novice - lvls 1-10 should have all of these chars and words 13 - * Beginner - lvls 11-30 should have all of these chars and words 14 - * Intermediate - lvl 31-60 should have all of these chars and words 15 - * Advanced - unslotted as for now. 16 - */ 17 - const Novice = 1 18 - const Beginner = 2 19 - const Intermediate = 3 20 - 21 - let all = [] 22 - 23 - function push(lvl, char, toTranslate) { 24 - const chars = toTranslate ? toTraditional(char) : char 25 - if (chars) all.push([lvl, chars]) 26 - if (chars.length > 1) { 27 - chars 28 - .replace(/[() /〇|12¹²…)(\n]/g, '') 29 - .split('') 30 - .filter((c) => c.length > 0) 31 - .forEach((char) => all.push([lvl, char])) 32 - } 33 - } 34 - 35 - readTsv('sources/hsk.tsv') 36 - .concat(readTsv('sources/hsk_missing.tsv')) 37 - .forEach((item) => { 38 - const band = Number(item.band) 39 - if (band <= 2) push(Novice, item.simplified, true) 40 - if (band <= 3) push(Beginner, item.simplified, true) 41 - else if (band <= 6) push(Intermediate, item.simplified, true) 42 - }) 43 - 44 - readTsv('sources/tocfl.tsv') 45 - .concat(readTsv('sources/tocfl_missing.tsv')) 46 - .forEach((item) => { 47 - const level = Number(item.level) 48 - if (level <= 1) push(Novice, item.traditional, false) 49 - else if (level <= 2) push(Beginner, item.traditional, false) 50 - else if (level <= 4) push(Intermediate, item.traditional, false) 51 - }) 52 - 53 - all.sort((a, b) => a.level - b.level) 54 - 55 - const included = new Set() 56 - // Dedupe, keeping lower level 57 - all = all.filter(([_level, char]) => { 58 - if (included.has(char)) return false 59 - included.add(char) 60 - return true 61 - }) 62 - 63 - const chars = { 64 - [Novice]: '', 65 - [Beginner]: '', 66 - [Intermediate]: '', 67 - } 68 - 69 - const vocab = { 70 - [Novice]: '', 71 - [Beginner]: '', 72 - [Intermediate]: '', 73 - } 74 - 75 - all.forEach(([level, char], i) => { 76 - if (char.split('').length > 1) { 77 - if (i > 0) vocab[level] += ', ' 78 - vocab[level] += char 79 - } else { 80 - chars[level] += char 81 - } 82 - }) 83 - 84 - const noviceLvls = [ 85 - 18, 86 - 36, 87 - 30, 88 - 39, 89 - 43, 90 - 41, 91 - 33, 92 - 32, 93 - 37, 94 - 39, 95 - 38, 96 - 39, 97 - 36, 98 - 30, 99 - 38, 100 - 33, 101 - 37, 102 - 31, 103 - 33, 104 - 33, 105 - ] 106 - const beginnerLvls = [33, 33, 33, 31, 35, 33, 32, 35, 33, 33] 107 - const intermediateLvls = [ 108 - 36, 109 - 34, 110 - 34, 111 - 35, 112 - 33, 113 - 34, 114 - 34, 115 - 35, 116 - 36, 117 - 36, 118 - 34, 119 - 34, 120 - 33, 121 - 33, 122 - 36, 123 - 37, 124 - 37, 125 - 37, 126 - 35, 127 - 35, 128 - 36, 129 - 35, 130 - 35, 131 - 35, 132 - 36, 133 - 35, 134 - 33, 135 - 34, 136 - 36, 137 - 34, 138 - ] 139 - 140 - Deno.writeTextFileSync( 141 - './groups.gen.txt', 142 - ` 143 - Characters\n===\n\n 144 - Novice\n===\n${formatChars(chars[1], noviceLvls)}\n 145 - Beginner\n===\n${formatChars(chars[2], beginnerLvls)}\n 146 - Intermediate\n===\n${formatChars(chars[3], intermediateLvls)}\n 147 - 148 - Words\n===\n\n 149 - Novice\n===\n${vocab[1]}\n 150 - Beginner\n===\n${vocab[2]}\n 151 - Intermediate\n===\n${vocab[3]}\n 152 - `, 153 - ) 154 - 155 - function formatChars(chars, lvls) { 156 - const results = [] 157 - let count = 0 158 - 159 - chars.split('').sort((a, b) => { 160 - const compsA = hanzi.decompose(a, 3).components.length 161 - const compsB = hanzi.decompose(b, 3).components.length 162 - if (compsA - compsB) return compsA - compsB 163 - 164 - const numA = hanzi.getCharactersWithComponent(a).length 165 - const numB = hanzi.getCharactersWithComponent(b).length 166 - if (numB - numA) return numB - numA 167 - 168 - const freqA = hanzi.getCharacterFrequency(a)?.number || Infinity 169 - const freqB = hanzi.getCharacterFrequency(b)?.number || Infinity 170 - if (freqB - freqA) return freqA - freqB 171 - }).forEach((char) => { 172 - results.push(char) 173 - count++ 174 - if (count == lvls[0]) { 175 - count = 0 176 - lvls.shift() 177 - results.push('\n') 178 - } 179 - }) 180 - return results.join('') 181 - }
-113
data/scripts/ordering/sort_dicts.ts
··· 1 - /** 2 - * This script is for sorting dict files, for ease of manual update. 3 - * By default, we should sort by id; this is the default state, and 4 - * should be the state of these files when we commit. 5 - * 6 - * Otherwise, we accept a --lang={targetLang} (ja | zh_CN | zh_HK | zh_TW). 7 - * This property will sort the userLang dicts in order of the targetLang order. 8 - * 9 - * All files are optionally available. 10 - * The files to sort in data/lang/{userLang} are: 11 - * - /characters.tsv 12 - * - /radicals.tsv 13 - * - /vocabulary.tsv 14 - * - /hints/meaning.characters.tsv 15 - * - /hints/meaning.vocabulary.tsv 16 - * - /hints/reading.characters.tsv 17 - * - /hints/reading.vocabulary.tsv 18 - * - /meanings/{targetLang}.characters.tsv 19 - * - /meanings/{targetLang}.vocabulary.tsv 20 - * 21 - * @example deno run data/scripts/ordering/sort_dicts.ts 22 - * @example deno run data/scripts/ordering/sort_dicts.ts --lang=zh_CN 23 - */ 24 - 25 - import { 26 - readCharacterOrder, 27 - readDictByHant, 28 - readLessonOrder, 29 - readTsv, 30 - writeTsv, 31 - } from '../shared/fs.ts' 32 - 33 - const DATA_ROOT = './data/' 34 - const USER_LANGS = ['en', 'es'] 35 - const TARGET_LANGS = ['zh_CN', 'zh_HK', 'zh_TW', 'ja'] 36 - 37 - const langArg = Deno.args.find((a) => a.startsWith('--lang=')) 38 - const targetLang = langArg?.split('=')[1] ?? null 39 - 40 - if (targetLang && !TARGET_LANGS.includes(targetLang)) { 41 - console.error(`Unknown --lang: ${targetLang}. Valid: ${TARGET_LANGS.join(', ')}`) 42 - Deno.exit(1) 43 - } 44 - 45 - // Build id → curriculum-position maps when --lang is given 46 - const charIdOrder = targetLang ? buildIdOrder('character', targetLang) : null 47 - const vocabIdOrder = targetLang ? buildIdOrder('vocabulary', targetLang) : null 48 - 49 - function buildIdOrder(type: 'character' | 'vocabulary', lang: string): Map<string, number> { 50 - const slugs = type === 'character' 51 - ? readCharacterOrder(lang).flat() 52 - : readLessonOrder(`lang/${lang}/order/vocabulary.csv`).flat() 53 - const dictPath = type === 'character' ? 'lang/characters.tsv' : 'lang/vocabulary.tsv' 54 - const byHant = readDictByHant(dictPath) 55 - const map = new Map<string, number>() 56 - slugs.forEach((slug, i) => { 57 - const def = byHant[slug] 58 - if (def && !map.has(def.id)) map.set(def.id, i) 59 - }) 60 - return map 61 - } 62 - 63 - function readHeaders(relPath: string): string[] { 64 - const text = Deno.readTextFileSync(DATA_ROOT + relPath) 65 - return text.split('\n')[0].split('\t').map(text => text.replace("\r", "")) 66 - } 67 - 68 - // deno-lint-ignore no-explicit-any 69 - function sortRows(rows: any[], orderMap: Map<string, number> | null): any[] { 70 - if (orderMap) { 71 - return rows.sort((a, b) => { 72 - const posA = orderMap.get(a.id as string) ?? Infinity 73 - const posB = orderMap.get(b.id as string) ?? Infinity 74 - return posA - posB 75 - }) 76 - } 77 - return rows.sort((a, b) => { 78 - const idA = a.id as string 79 - const idB = b.id as string 80 - return idA < idB ? -1 : idA > idB ? 1 : 0 81 - }) 82 - } 83 - 84 - function sortFile(relPath: string, orderMap: Map<string, number> | null) { 85 - let rows: Record<string, unknown>[] 86 - let headers: string[] 87 - try { 88 - headers = readHeaders(relPath) 89 - rows = readTsv(relPath) as Record<string, unknown>[] 90 - } catch { 91 - return // file doesn't exist, skip 92 - } 93 - if (!rows.length) return 94 - writeTsv(relPath, headers, sortRows(rows, orderMap)) 95 - console.log(` sorted: ${relPath}`) 96 - } 97 - 98 - const meaningTargetLangs = targetLang ? [targetLang] : TARGET_LANGS 99 - 100 - for (const userLang of USER_LANGS) { 101 - console.log(`\nSorting lang/${userLang}/`) 102 - sortFile(`lang/${userLang}/characters.tsv`, charIdOrder) 103 - sortFile(`lang/${userLang}/radicals.tsv`, null) 104 - sortFile(`lang/${userLang}/vocabulary.tsv`, vocabIdOrder) 105 - sortFile(`lang/${userLang}/hints/meaning.characters.tsv`, charIdOrder) 106 - sortFile(`lang/${userLang}/hints/meaning.vocabulary.tsv`, vocabIdOrder) 107 - sortFile(`lang/${userLang}/hints/reading.characters.tsv`, charIdOrder) 108 - sortFile(`lang/${userLang}/hints/reading.vocabulary.tsv`, vocabIdOrder) 109 - for (const tl of meaningTargetLangs) { 110 - sortFile(`lang/${userLang}/meanings/${tl}.characters.tsv`, charIdOrder) 111 - sortFile(`lang/${userLang}/meanings/${tl}.vocabulary.tsv`, vocabIdOrder) 112 - } 113 - }
-379
data/scripts/shared/fs.ts
··· 1 - import { distinct } from '@std/collections/distinct' 2 - import { parse } from '@std/csv/parse' 3 - import { stringify } from '@std/csv/stringify' 4 - import { dirname, join } from '@std/path' 5 - import cedict from 'cc-cedict' 6 - import stringifyJSON from 'json-stringify-pretty-compact' 7 - import * as OpenCC from 'opencc-js' 8 - import { Transliteration, type Locale } from '$/enums.ts' 9 - import { Reading, Subject } from '$/models/subjects.ts' 10 - 11 - const toSimplified = OpenCC.Converter({ from: 'hk', to: 'cn' }) 12 - const DATA_ROOT = './data/' 13 - const APP_ROOT = './www/static/gen/' 14 - 15 - // For files under data/ (sources, lang/zh, lang/en, etc.) 16 - export function readTsv(input: string) { 17 - const text = Deno.readTextFileSync(DATA_ROOT + input) 18 - return parse(text, { 19 - separator: '\t', 20 - lazyQuotes: true, 21 - skipFirstRow: true, 22 - }) 23 - } 24 - 25 - export function readCsv(input: string) { 26 - const text = Deno.readTextFileSync(DATA_ROOT + input) 27 - return parse(text, { 28 - lazyQuotes: true, 29 - skipFirstRow: true, 30 - }) 31 - } 32 - 33 - export function readJson(input: string) { 34 - return JSON.parse(Deno.readTextFileSync(DATA_ROOT + input)) 35 - } 36 - 37 - // deno-lint-ignore no-explicit-any 38 - export function writeTsv(input: string, columns: string[], data: any) { 39 - Deno.writeTextFileSync( 40 - DATA_ROOT + input, 41 - stringify(data, { columns, separator: '\t' }), 42 - ) 43 - } 44 - 45 - export interface Definition { 46 - id: string 47 - v2?: number 48 - hans: string 49 - hant: string 50 - ja?: string // Japanese-specific form (kanji col in characters.tsv, ja col in vocabulary.tsv) 51 - } 52 - 53 - export function readDict(path: string): Definition[] { 54 - return readTsv(path).map((row) => ({ 55 - id: row.id as string, 56 - v2: row.v2 ? Number(row.v2) : undefined, 57 - hans: row.hans as string, 58 - hant: row.hant as string, 59 - ja: (row.kanji || row.ja) as string | undefined || undefined, 60 - })) 61 - } 62 - 63 - // Reads a user-language meanings file (lang/en/characters.tsv, lang/es/vocabulary.tsv, etc.) 64 - // Format: id, value 65 - export function readMeanings(path: string): Record<string, string> { 66 - return Object.fromEntries( 67 - readTsv(path).map((row) => [row.id as string, row.value as string]), 68 - ) 69 - } 70 - 71 - export function readDictByHant(path: string): Record<string, Definition> { 72 - const results: Record<string, Definition> = {} 73 - const data = readDict(path) 74 - data.forEach((row) => { 75 - results[row.hant] = row 76 - }) 77 - return results 78 - } 79 - 80 - export function readDictByHans(path: string): Record<string, Definition> { 81 - const results: Record<string, Definition> = {} 82 - const data = readDict(path) 83 - data.forEach((row) => { 84 - results[row.hans] = row 85 - }) 86 - return results 87 - } 88 - 89 - // All hanzi by id 90 - export function readDictById(path: string): Record<number, Definition> { 91 - const results: Record<number, Definition> = {} 92 - const data = readDict(path) 93 - data.forEach((row) => { 94 - results[row.id] = row 95 - }) 96 - return results 97 - } 98 - 99 - const options = { 100 - allowVariants: false, 101 - asObject: false, 102 - } 103 - 104 - export function appendToZhDict( 105 - zhPath: string, 106 - enPath: string, 107 - traditional: string, 108 - ) { 109 - const rows = readDict(zhPath).sort((a, b) => b.id - a.id) 110 - const id = (rows[0]?.id || 0) + 1 111 - 112 - const def = cedict.getByTraditional(traditional, null, options)?.[0] 113 - const simplified = def ? def.simplified : toSimplified(traditional) 114 - const english = def 115 - ? def.english.slice(0, 3).map((a: string) => a.trim()).join('; ') 116 - : '' 117 - 118 - if (!def) console.warn('No definition found for: ', traditional) 119 - 120 - const zhText = [id, traditional, simplified].join('\t') + '\n' 121 - Deno.writeTextFileSync(DATA_ROOT + zhPath, zhText, { append: true }) 122 - 123 - const enText = [id, english].join('\t') + '\n' 124 - Deno.writeTextFileSync(DATA_ROOT + enPath, enText, { append: true }) 125 - } 126 - 127 - // For `data/lessons` files 128 - export function readLessonOrder(input: string): string[][] { 129 - return parse(Deno.readTextFileSync(DATA_ROOT + input)) 130 - } 131 - 132 - export function readCharacterOrder(targetLang: string): string[][] { 133 - const text = Deno.readTextFileSync(DATA_ROOT + `lang/${targetLang}/order/characters.txt`) 134 - const rows = text.split('\n') 135 - return rows.map((row) => row.split('')) 136 - } 137 - 138 - // Returns only the definitions for items that appear in the curriculum order files, 139 - // in curriculum order, deduplicated. This ensures we only generate data/audio for 140 - // items that are actually taught, not the entire dictionary. 141 - export function readOrderedDefs(type: 'character' | 'vocabulary', targetLang = 'zh_CN'): Definition[] { 142 - const [dictPath, slugsList] = type === 'character' 143 - ? ['lang/characters.tsv', readCharacterOrder(targetLang).flat()] 144 - : ['lang/vocabulary.tsv', readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`).flat()] 145 - 146 - const byHant = readDictByHant(dictPath) 147 - const seen = new Set<string>() 148 - const result: Definition[] = [] 149 - for (const slug of slugsList) { 150 - if (!slug || seen.has(slug)) continue 151 - seen.add(slug) 152 - const def = byHant[slug] 153 - if (def) result.push(def) 154 - } 155 - return result 156 - } 157 - 158 - export function readSubjects(input: string): Subject[] { 159 - try { 160 - const text = Deno.readTextFileSync(APP_ROOT + input) 161 - return JSON.parse(text) 162 - } catch { 163 - return [] 164 - } 165 - } 166 - 167 - export function readSubjectsMap(input: string): Record<string, Subject> { 168 - const subjectsMap: Record<string, Subject> = {} 169 - readSubjects(input) 170 - .forEach((subject) => { 171 - if (!subject.id || !subject.data?.slug) { 172 - console.warn(`Skipping subject with missing id/slug in ${input}:`, JSON.stringify(subject).slice(0, 120)) 173 - return 174 - } 175 - subjectsMap[subject.data.slug] = subject 176 - }) 177 - return subjectsMap 178 - } 179 - 180 - export function writeSubjects(output: string, subjects: Subject[]): void { 181 - const levelAndPosition = new Set() 182 - const toWrite = subjects 183 - // Drop any subject missing required identity fields — prevents corrupt entries 184 - // from perpetuating across pipeline runs (e.g. slugless subjects keyed as "undefined") 185 - .filter((subject) => { 186 - if (!subject.id || !subject.data?.slug || !subject.data?.type) { 187 - console.warn(`Dropping invalid subject (missing id/slug/type):`, JSON.stringify(subject).slice(0, 120)) 188 - return false 189 - } 190 - return true 191 - }) 192 - // Re-add properties to force json string write order 193 - .map((subject) => { 194 - const { data } = subject 195 - const levelPosition = `${data.type}-${data.level}-${data.position}` 196 - if (levelAndPosition.has(levelPosition) && levelPosition !== `${data.type}-0-0`) { 197 - console.warn(`Two with same position ${levelPosition} ${data.slug}`) 198 - } else { 199 - levelAndPosition.add(levelPosition) 200 - } 201 - return { 202 - id: subject.id, 203 - hiddenAt: subject.hiddenAt, 204 - learnCards: subject.learnCards?.length ? subject.learnCards : ['meanings'], 205 - quizCards: subject.quizCards?.length ? subject.quizCards : ['meanings', 'readings'], 206 - data: { 207 - audios: data.audios, 208 - character: data.character, 209 - requiredSubjects: data.requiredSubjects, 210 - examples: data.examples, 211 - level: data.level, 212 - meanings: data.meanings, 213 - meaningHint: data.meaningHint, 214 - meaningMnemonic: data.meaningMnemonic, 215 - position: data.position, 216 - readings: data.readings, 217 - readingHint: data.readingHint, 218 - readingMnemonic: data.readingMnemonic, 219 - slug: data.slug, 220 - srsId: data.srsId, 221 - type: data.type, 222 - }, 223 - } as Subject 224 - }) 225 - // Sort by: level, type (radical < character < vocabulary), position 226 - .sort((a, b) => { 227 - if (!a.data.level || !a.data.position) return 1 228 - if (!b.data.level || !b.data.position) return -1 229 - const levelDiff = a.data.level - b.data.level 230 - if (levelDiff) return levelDiff 231 - const typePriority: Record<string, number> = { Radical: 0, Character: 1, Vocabulary: 2 } 232 - const typeDiff = (typePriority[a.data.type] ?? 0) - (typePriority[b.data.type] ?? 0) 233 - if (typeDiff) return typeDiff 234 - return a.data.position - b.data.position 235 - }) 236 - 237 - const outPath = APP_ROOT + output 238 - Deno.mkdirSync(dirname(outPath), { recursive: true }) 239 - Deno.writeTextFileSync(outPath, stringifyJSON(toWrite)) 240 - } 241 - 242 - // deno-lint-ignore no-explicit-any 243 - export function writeJSON(path: string, content: any) { 244 - Deno.writeTextFileSync(APP_ROOT + path, stringifyJSON(content)) 245 - } 246 - 247 - export function listAudioFiles( 248 - locales: string[] = ['zh_CN', 'zh_HK', 'zh_TW'], 249 - ): string[] { 250 - return locales 251 - .filter((locale) => locale !== 'tmp') 252 - .flatMap((locale) => 253 - Array.from(Deno.readDirSync(join('www/static/gen/audio', locale))) 254 - ) 255 - .filter(({ name }) => /.*\.mp3$/.test(name)) 256 - .map((file) => file.name) 257 - } 258 - 259 - export function readSentences(userLang: string, locale: Locale): { 260 - bySentence: Record<string, string> 261 - keys: string[] 262 - } { 263 - let text = '' 264 - try { 265 - text = Deno.readTextFileSync(`${DATA_ROOT}lang/${userLang}/sentences/${locale}.tsv`) 266 - } catch { 267 - return { bySentence: {}, keys: [] } 268 - } 269 - const sentences = parse( 270 - text, 271 - { separator: '\t', lazyQuotes: true }, 272 - ) 273 - const bySentence: { [chinese: string]: string } = {} 274 - const keys = distinct(sentences.map(([_id, text, _, english]) => { 275 - bySentence[text] = english 276 - return text 277 - })) 278 - 279 - return { bySentence, keys } 280 - } 281 - 282 - export interface Hint { 283 - id: string 284 - locale: string 285 - en: string 286 - } 287 - 288 - export function readHints(path: string): Hint[] { 289 - try { 290 - return readTsv(path).map((row) => ({ 291 - id: row.id as string, 292 - locale: row.locale as string, 293 - en: row.en as string, 294 - })) 295 - } catch { 296 - return [] 297 - } 298 - } 299 - 300 - // Maps column names in readings TSVs to their Transliteration type 301 - const COL_TYPE: Record<string, Transliteration> = { 302 - pinyin: Transliteration.Pinyin, 303 - jyutping: Transliteration.Jyutping, 304 - zhuyin: Transliteration.Zhuyin, 305 - kunyomi: Transliteration.Kunyomi, 306 - onyomi: Transliteration.Onyomi, 307 - reading: Transliteration.Hiragana, 308 - } 309 - 310 - /** 311 - * Reads a readings TSV (e.g. zh_CN/readings.tsv, ja/readings.characters.tsv) and 312 - * returns a map of subject id → Reading[]. Each non-id column becomes a Reading type; 313 - * semicolon-separated values produce multiple entries, with the first value isPrimary. 314 - * Returns empty map if the file doesn't exist. 315 - */ 316 - export function readReadingsMap(path: string): Record<string, Reading[]> { 317 - const result: Record<string, Reading[]> = {} 318 - try { 319 - const rows = readTsv(path) 320 - if (!rows.length) return result 321 - const cols = Object.keys(rows[0]).filter((k) => k !== 'id').reverse() 322 - for (const row of rows) { 323 - const id = row.id as string 324 - const readings: Reading[] = [] 325 - let firstCol = true 326 - for (const col of cols) { 327 - const val = (row[col] as string) || '' 328 - const type = COL_TYPE[col] 329 - if (!type || !val) continue 330 - val.split(';').map((s) => s.trim()).filter(Boolean).forEach((value, i) => { 331 - readings.push({ value, type, isAcceptedAnswer: true, isPrimary: firstCol && i === 0 }) 332 - }) 333 - firstCol = false 334 - } 335 - if (readings.length) result[id] = readings 336 - } 337 - } catch { /* file missing */ } 338 - return result 339 - } 340 - 341 - /** 342 - * Reads a meaning-override TSV (e.g. en/meanings/characters.ja.tsv, format: id, meaning). 343 - * Returns a map of subject id → semicolon-separated meaning string. 344 - * Returns empty map if the file doesn't exist. 345 - */ 346 - export function readMeaningOverrides(path: string): Record<string, string> { 347 - try { 348 - return Object.fromEntries( 349 - readTsv(path).map((row) => [row.id as string, row.meaning as string]), 350 - ) 351 - } catch { 352 - return {} 353 - } 354 - } 355 - 356 - export function readHintsById( 357 - path: string, 358 - ): Record<string, Record<string, string>> { 359 - const results: Record<string, Record<string, string>> = {} 360 - const data = readHints(path) 361 - data.forEach((row) => { 362 - if (!results[row.id]) { 363 - results[row.id] = {} 364 - } 365 - results[row.id][row.locale] = row.en 366 - }) 367 - return results 368 - } 369 - 370 - export function readAllLocaleHints(path: string): Record<string, string> { 371 - const results: Record<string, string> = {} 372 - const data = readHints(path) 373 - data.forEach((row) => { 374 - if (row.locale === 'ALL') { 375 - results[row.id] = row.en 376 - } 377 - }) 378 - return results 379 - }
-109
data/scripts/shared/subject_utils.ts
··· 1 - import { distinct } from '@std/collections/distinct' 2 - import { Locale, SubjectType } from '$/enums.ts' 3 - import { Audio, Subject } from '$/models/subjects.ts' 4 - import { 5 - Definition, 6 - listAudioFiles, 7 - readDict, 8 - readSentences, 9 - } from './fs.ts' 10 - 11 - const { Character, Vocabulary } = SubjectType 12 - const { zh_CN } = Locale 13 - 14 - const charDefs = readDict('lang/characters.tsv') 15 - const charBySlug: Record<string, Definition> = Object.fromEntries(charDefs.map((d) => [d.hant, d])) 16 - const charByHans: Record<string, Definition> = Object.fromEntries(charDefs.map((d) => [d.hans, d])) 17 - const charByJa: Record<string, Definition> = Object.fromEntries( 18 - charDefs.filter((d) => d.ja).map((d) => [d.ja!, d]), 19 - ) 20 - const vocabDefs = readDict('lang/vocabulary.tsv') 21 - const vocabBySlug: Record<string, Definition> = Object.fromEntries(vocabDefs.map((d) => [d.hant, d])) 22 - const vocabByJa: Record<string, Definition> = Object.fromEntries( 23 - vocabDefs.filter((d) => d.ja).map((d) => [d.ja!, d]), 24 - ) 25 - const audioMeta: Record<string, Record<string, Audio>> = {} 26 - listAudioFiles().forEach((filename) => { 27 - const [idStr, localeStr, voiceId] = filename.replace('.mp3', '').split('_') 28 - const locale = localeStr.replace('-', '_') 29 - if (!voiceId) return 30 - audioMeta[locale] ??= {} 31 - audioMeta[locale][idStr] = { url: filename, voiceId } 32 - }) 33 - 34 - export type Sentences = { 35 - bySentence: Record<string, string> 36 - byChar: Map<string, string[]> 37 - sorted: string[] 38 - } 39 - 40 - export function loadSentences(userLang: string, targetLang: string): Sentences { 41 - const raw = readSentences(userLang, targetLang as Locale) 42 - const byChar = new Map<string, string[]>() 43 - for (const key of raw.keys) { 44 - for (const char of key) { 45 - if (!byChar.has(char)) byChar.set(char, []) 46 - byChar.get(char)!.push(key) 47 - } 48 - } 49 - return { bySentence: raw.bySentence, byChar, sorted: raw.keys } 50 - } 51 - 52 - function getCharForLocale(targetLang: string, hans: string, hant: string, ja?: string): string { 53 - if (targetLang === 'ja') return ja || hant 54 - return targetLang === zh_CN ? hans : hant 55 - } 56 - 57 - export function createSubject( 58 - slug: string, 59 - level: number, 60 - position: number, 61 - targetLang: string, 62 - charMeanings: Record<string, string>, 63 - vocabMeanings: Record<string, string>, 64 - sentences: Sentences, 65 - ): Subject { 66 - const isVocab = slug.length > 1 67 - 68 - const dictEntry = isVocab 69 - ? (vocabBySlug[slug] || vocabByJa[slug]) 70 - : (charBySlug[slug] || charByHans[slug] || charByJa[slug]) 71 - if (!dictEntry) { 72 - console.error(`No valid id for ${slug}`) 73 - return { data: {}} 74 - } 75 - const { id, hans, hant, ja } = dictEntry 76 - const en = isVocab ? (vocabMeanings[id] || '') : (charMeanings[id] || '') 77 - const character = getCharForLocale(targetLang, hans, hant, ja) 78 - const charForSentences = targetLang === zh_CN ? hans : hant 79 - 80 - return { 81 - id, 82 - learnCards: ['meanings'], 83 - quizCards: ['meanings', 'readings'], 84 - data: { 85 - audios: [audioMeta[targetLang]?.[id]].filter((audio) => audio != null), 86 - character, 87 - examples: (charForSentences.length === 1 88 - ? (sentences.byChar.get(charForSentences) ?? []) 89 - : (sentences.byChar.get(charForSentences[0]) ?? []).filter((key) => key.includes(charForSentences)) 90 - ).slice(0, 3).map((value) => ({ value, translation: sentences.bySentence[value] })), 91 - level, 92 - meanings: en.split(';').map((def, i) => ({ 93 - value: def.trim(), 94 - isPrimary: i === 0, 95 - isAcceptedAnswer: true, 96 - })), 97 - position, 98 - readings: [], 99 - // All chars of slug, mapped to id, deduped, not identity 100 - requiredSubjects: distinct( 101 - slug.split('').map((c) => charBySlug[c]?.id ?? ''), 102 - ).filter((reqId: string) => reqId && reqId !== charBySlug[slug]?.id), 103 - slug, 104 - srsId: (level > 2) ? 1 : 2, // Early levels use srs 2, which is faster 105 - type: isVocab ? Vocabulary : Character, 106 - }, 107 - } as Subject 108 - } 109 -
+3 -21
deno.json
··· 1 1 { 2 2 "version": "3.3.1", 3 + "workspace": ["./data"], 3 4 "compilerOptions": { 4 5 "lib": [ 5 6 "deno.ns", ··· 34 35 "test:check": "deno check ./www/index.ts", 35 36 "test:unit": "deno test -A", 36 37 "test:update": "deno test -A -- --update source", 37 - "data:step-1": "deno run -A ./data/scripts/1_gen_dicts.ts", 38 - "data:step-3": "deno run -A ./data/scripts/3_update_app_data.ts", 39 - "data:step-4": "deno run -A ./data/scripts/4_gen_progress.ts", 40 - "data:step-5": "deno run -A ./data/scripts/5_gen_licenses.ts" 38 + "data": "deno run -A ./data/cli/main.ts" 41 39 }, 42 40 "imports": { 43 41 "$/": "./www/", ··· 50 48 "@leeoniya/ufuzzy": "npm:@leeoniya/ufuzzy@^1.0.19", 51 49 "@std/assert": "jsr:@std/assert@^1.0.19", 52 50 "@std/async": "jsr:@std/async@^1.2.0", 53 - "@std/collections": "jsr:@std/collections@^1.1.6", 54 - "@std/csv": "jsr:@std/csv@^1.0.6", 55 - "@std/dotenv": "jsr:@std/dotenv@^0.225.6", 56 - "@std/fs": "jsr:@std/fs@^1.0.23", 57 - "@std/io": "jsr:@std/io@^0.225.3", 58 - "@std/path": "jsr:@std/path@^1.1.4", 59 - "@std/semver": "jsr:@std/semver@^1.0.8", 60 - "@std/streams": "jsr:@std/streams@1.0.17", 61 51 "@zod/zod": "jsr:@zod/zod@^4.3.6", 62 - "cc-cedict": "npm:cc-cedict@^1.1.1", 63 - "chinese-to-pinyin": "npm:chinese-to-pinyin@^1.3.1", 64 52 "howler": "npm:howler@^2.2.4", 65 - "json-stringify-pretty-compact": "npm:json-stringify-pretty-compact@^4.0.0", 66 - "kuromoji": "npm:kuromoji@^0.1.2", 67 - "lit": "npm:lit@^3.3.2", 68 - "native-file-system-adapter": "npm:native-file-system-adapter@^3.0.1", 69 - "opencc-js": "npm:opencc-js@^1.0.5", 70 - "pinyin-to-zhuyin": "npm:pinyin-to-zhuyin@^1.0.3", 71 - "pinyin-tone-tool": "npm:pinyin-tone-tool@^1.0.5" 53 + "lit": "npm:lit@^3.3.2" 72 54 }, 73 55 "allowScripts": [ 74 56 "npm:@byojs/storage@0.12.1",
+118 -21
deno.lock
··· 6 6 "jsr:@civility/sync@~0.1.1": "0.1.1", 7 7 "jsr:@civility/ui@~0.2.2": "0.2.2", 8 8 "jsr:@civility/workers@~0.2.3": "0.2.3", 9 + "jsr:@cliffy/ansi@1": "1.0.0", 10 + "jsr:@cliffy/ansi@1.0.0": "1.0.0", 11 + "jsr:@cliffy/command@1": "1.0.0", 12 + "jsr:@cliffy/flags@1": "1.0.0", 13 + "jsr:@cliffy/flags@1.0.0": "1.0.0", 14 + "jsr:@cliffy/internal@1.0.0": "1.0.0", 15 + "jsr:@cliffy/keycode@1": "1.0.0", 16 + "jsr:@cliffy/keycode@1.0.0": "1.0.0", 17 + "jsr:@cliffy/keypress@1": "1.0.0", 18 + "jsr:@cliffy/prompt@1": "1.0.0", 19 + "jsr:@cliffy/table@1": "1.0.0", 20 + "jsr:@cliffy/table@1.0.0": "1.0.0", 9 21 "jsr:@inro/simple-tools@0.5.2": "0.5.2", 10 22 "jsr:@paulmillr/qr@~0.5.5": "0.5.5", 23 + "jsr:@std/assert@^1.0.18": "1.0.19", 11 24 "jsr:@std/assert@^1.0.19": "1.0.19", 12 25 "jsr:@std/async@^1.2.0": "1.2.0", 13 26 "jsr:@std/bytes@^1.0.6": "1.0.6", 14 27 "jsr:@std/collections@^1.1.0": "1.1.6", 15 28 "jsr:@std/collections@^1.1.6": "1.1.6", 16 - "jsr:@std/csv@*": "1.0.6", 17 29 "jsr:@std/csv@^1.0.6": "1.0.6", 18 30 "jsr:@std/dotenv@~0.225.6": "0.225.6", 31 + "jsr:@std/encoding@^1.0.10": "1.0.10", 32 + "jsr:@std/fmt@^1.0.9": "1.0.9", 19 33 "jsr:@std/fs@^1.0.17": "1.0.23", 20 34 "jsr:@std/fs@^1.0.23": "1.0.23", 21 35 "jsr:@std/html@^1.0.5": "1.0.5", ··· 23 37 "jsr:@std/io@~0.225.3": "0.225.3", 24 38 "jsr:@std/path@^1.1.4": "1.1.4", 25 39 "jsr:@std/semver@^1.0.8": "1.0.8", 26 - "jsr:@std/streams@1.0.17": "1.0.17", 27 40 "jsr:@std/streams@^1.0.9": "1.0.17", 41 + "jsr:@std/text@^1.0.17": "1.0.17", 28 42 "jsr:@zod/zod@^4.3.6": "4.3.6", 29 43 "npm:@byojs/storage@~0.12.1": "0.12.1", 30 44 "npm:@leeoniya/ufuzzy@^1.0.19": "1.0.19", 31 45 "npm:@tauri-apps/plugin-store@^2.2.0": "2.4.2", 32 46 "npm:cc-cedict@^1.1.1": "1.1.1", 33 47 "npm:chinese-to-pinyin@^1.3.1": "1.3.1", 48 + "npm:hanzi@^2.1.5": "2.1.5", 34 49 "npm:howler@^2.2.4": "2.2.4", 35 50 "npm:json-stringify-pretty-compact@4": "4.0.0", 36 51 "npm:kuromoji@~0.1.2": "0.1.2", ··· 68 83 "@civility/workers@0.2.3": { 69 84 "integrity": "84130ff9b3c5d0ee133d8ed076dd86d5ea4a3bb8f49c06c114959eb4e0c66602" 70 85 }, 86 + "@cliffy/ansi@1.0.0": { 87 + "integrity": "987008f74e50aa72cc1517ffccc769711734a14927bc4599e052efe1b9a840e2", 88 + "dependencies": [ 89 + "jsr:@cliffy/internal", 90 + "jsr:@std/encoding", 91 + "jsr:@std/fmt", 92 + "jsr:@std/io" 93 + ] 94 + }, 95 + "@cliffy/command@1.0.0": { 96 + "integrity": "c52a241ea68857fcdaff4f3173eb404f8017d7bc35553b6f533c592b89dde7d2", 97 + "dependencies": [ 98 + "jsr:@cliffy/flags@1.0.0", 99 + "jsr:@cliffy/internal", 100 + "jsr:@cliffy/table@1.0.0", 101 + "jsr:@std/fmt", 102 + "jsr:@std/semver", 103 + "jsr:@std/text" 104 + ] 105 + }, 106 + "@cliffy/flags@1.0.0": { 107 + "integrity": "8b57698adc644da8f90422d58976362d41a4ebca39c312ca1c101585d0148feb", 108 + "dependencies": [ 109 + "jsr:@cliffy/internal", 110 + "jsr:@std/text" 111 + ] 112 + }, 113 + "@cliffy/internal@1.0.0": { 114 + "integrity": "1e17ccbcd5420093c0a93e5b3827bbdc9abac5195bacf187edc44665e54bdde6" 115 + }, 116 + "@cliffy/keycode@1.0.0": { 117 + "integrity": "755dbf007be110dcb5625f87eb61b362b6a0ca6835453af03ebd3b34d399cf14" 118 + }, 119 + "@cliffy/keypress@1.0.0": { 120 + "integrity": "dd2e33484bea5fedf9bad5ed4aa0248a53373427d70cb94de4aad3052f948cea", 121 + "dependencies": [ 122 + "jsr:@cliffy/internal", 123 + "jsr:@cliffy/keycode@1.0.0" 124 + ] 125 + }, 126 + "@cliffy/prompt@1.0.0": { 127 + "integrity": "48b4cd35199fda7832f35e1fe0a3e8bc2b1ea49ba57b4ec0e29e22db44e8ca9f", 128 + "dependencies": [ 129 + "jsr:@cliffy/ansi@1.0.0", 130 + "jsr:@cliffy/internal", 131 + "jsr:@cliffy/keycode@1.0.0", 132 + "jsr:@std/assert@^1.0.18", 133 + "jsr:@std/fmt", 134 + "jsr:@std/io", 135 + "jsr:@std/path", 136 + "jsr:@std/text" 137 + ] 138 + }, 139 + "@cliffy/table@1.0.0": { 140 + "integrity": "3fdaa9e1ef1ea62022108adabd826932bdea8dd05497079896febcd41322907f", 141 + "dependencies": [ 142 + "jsr:@std/fmt" 143 + ] 144 + }, 71 145 "@inro/simple-tools@0.5.2": { 72 146 "integrity": "cc34cd0914b9e0576d9bed9a66a91994123b73f3fd87a4e8db76880181731ee5", 73 147 "dependencies": [ ··· 98 172 "@std/csv@1.0.6": { 99 173 "integrity": "52ef0e62799a0028d278fa04762f17f9bd263fad9a8e7f98c14fbd371d62d9fd", 100 174 "dependencies": [ 101 - "jsr:@std/streams@^1.0.9" 175 + "jsr:@std/streams" 102 176 ] 103 177 }, 104 178 "@std/dotenv@0.225.6": { 105 179 "integrity": "1d6f9db72f565bd26790fa034c26e45ecb260b5245417be76c2279e5734c421b" 180 + }, 181 + "@std/encoding@1.0.10": { 182 + "integrity": "8783c6384a2d13abd5e9e87a7ae0520a30e9f56aeeaa3bdf910a3eaaf5c811a1" 183 + }, 184 + "@std/fmt@1.0.9": { 185 + "integrity": "2487343e8899fb2be5d0e3d35013e54477ada198854e52dd05ed0422eddcabe0" 106 186 }, 107 187 "@std/fs@1.0.23": { 108 188 "integrity": "3ecbae4ce4fee03b180fa710caff36bb5adb66631c46a6460aaad49515565a37", ··· 138 218 "jsr:@std/bytes" 139 219 ] 140 220 }, 221 + "@std/text@1.0.17": { 222 + "integrity": "4b2c4ef67ae5b6c1dfd447c81c83a43718f52e3c7e748d8b33f694aba9895f95" 223 + }, 141 224 "@zod/zod@4.3.6": { 142 225 "integrity": "7144e5e11f8ffc3cf6e2fca624f6597a8762898aac9868cc8938e9398b96ffe4" 143 226 } ··· 222 305 }, 223 306 "graceful-fs@4.2.11": { 224 307 "integrity": "sha512-RbJ5/jmFcNNCcDV5o9eTnBLJ/HszWV0P73bc+Ff4nS/rJj+YaS6IGyiOL0VoBYX+l1Wrl3k63h/KrH+nhJ0XvQ==" 308 + }, 309 + "hanzi@2.1.5": { 310 + "integrity": "sha512-kXLmMV39QStRX0L1VH7j+5sz8VvTWD2Igx4htQwicCopPZGCWXAqavbb+jG0kFVllGcE89K7pv2UvD+pURzwYQ==" 225 311 }, 226 312 "howler@2.2.4": { 227 313 "integrity": "sha512-iARIBPgcQrwtEr+tALF+rapJ8qSc+Set2GJQl7xT1MQzWaVkFebdJhR3alVlSiUf5U7nAANKuj3aWpwerocD5w==" ··· 363 449 "jsr:@inro/simple-tools@0.5.2", 364 450 "jsr:@std/assert@^1.0.19", 365 451 "jsr:@std/async@^1.2.0", 366 - "jsr:@std/collections@^1.1.6", 367 - "jsr:@std/csv@^1.0.6", 368 - "jsr:@std/dotenv@~0.225.6", 369 - "jsr:@std/fs@^1.0.23", 370 - "jsr:@std/io@~0.225.3", 371 - "jsr:@std/path@^1.1.4", 372 - "jsr:@std/semver@^1.0.8", 373 - "jsr:@std/streams@1.0.17", 374 452 "jsr:@zod/zod@^4.3.6", 375 453 "npm:@byojs/storage@~0.12.1", 376 454 "npm:@leeoniya/ufuzzy@^1.0.19", 377 - "npm:cc-cedict@^1.1.1", 378 - "npm:chinese-to-pinyin@^1.3.1", 379 455 "npm:howler@^2.2.4", 380 - "npm:json-stringify-pretty-compact@4", 381 - "npm:kuromoji@~0.1.2", 382 - "npm:lit@^3.3.2", 383 - "npm:native-file-system-adapter@^3.0.1", 384 - "npm:opencc-js@^1.0.5", 385 - "npm:pinyin-to-zhuyin@^1.0.3", 386 - "npm:pinyin-tone-tool@^1.0.5" 387 - ] 456 + "npm:lit@^3.3.2" 457 + ], 458 + "members": { 459 + "data": { 460 + "dependencies": [ 461 + "jsr:@cliffy/ansi@1", 462 + "jsr:@cliffy/command@1", 463 + "jsr:@cliffy/flags@1", 464 + "jsr:@cliffy/keycode@1", 465 + "jsr:@cliffy/keypress@1", 466 + "jsr:@cliffy/prompt@1", 467 + "jsr:@cliffy/table@1", 468 + "jsr:@std/collections@^1.1.6", 469 + "jsr:@std/csv@^1.0.6", 470 + "jsr:@std/dotenv@~0.225.6", 471 + "jsr:@std/fs@^1.0.23", 472 + "jsr:@std/io@~0.225.3", 473 + "jsr:@std/path@^1.1.4", 474 + "npm:cc-cedict@^1.1.1", 475 + "npm:chinese-to-pinyin@^1.3.1", 476 + "npm:hanzi@^2.1.5", 477 + "npm:json-stringify-pretty-compact@4", 478 + "npm:kuromoji@~0.1.2", 479 + "npm:opencc-js@^1.0.5", 480 + "npm:pinyin-to-zhuyin@^1.0.3", 481 + "npm:pinyin-tone-tool@^1.0.5" 482 + ] 483 + } 484 + } 388 485 } 389 486 }
-5
www/static/gen/licenses.json
··· 40 40 "text": "The MIT License (MIT)\n\nCopyright (c) 2014, 2016, 2017, 2019, 2021, 2022, 2023 Simon Lydell\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n" 41 41 }, 42 42 { 43 - "name": "native-file-system-adapter", 44 - "href": "https://raw.githubusercontent.com/jimmywarting/native-file-system-adapter/refs/heads/master/LICENSE", 45 - "text": "MIT License\n\nCopyright (c) 2019 Jimmy Wärting\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n" 46 - }, 47 - { 48 43 "name": "opencc-js", 49 44 "href": "https://raw.githubusercontent.com/nk2028/opencc-js/main/LICENSE", 50 45 "text": "MIT License\n\nCopyright (c) 2020-2021 The nk2028 Project\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"