···11# Data
2233-This is the data we use to construct our app json files.
44-None of these files should be directly included into the app.
33+Source files used to build the app's subject JSON. None of these files are
44+directly included in the app bundle — everything is compiled to `www/static/gen/`
55+by the CLI.
5666-Data is usually "safe" to change at will (as long as you aren't changing any ids), BUT you changes may be overwritten depending on how update scripts are run. See docs and generator scripts for more details.
77+Data files are generally safe to edit by hand (as long as you don't change any
88+ids), but changes to TSVs may be overwritten if the relevant generator command
99+is re-run. See the CLI docs below for which commands affect which files.
71088-## Scripts
1111+## CLI
9121010-The `scripts` directory includes scripts for creating our apps `data` files. These scripts assume existence of `{target}/order/characters.txt`, `{target}/order/order/vocabulary.csv`, `{target}/order/radicals.csv`. These define which subjects are used, and in what order.
1313+All data tooling runs through a single entry point:
11141212-The main process looks something like this:
1515+```sh
1616+deno task data <command>
1717+deno task data --help
1818+```
13191414-1. `1_gen_dicts.ts` script to take any new subjects, and append a line to `dictionaries/{subject}.tsv`. These files help us match words to definitions and chars. What makes a "new" subject is loosely defined as "the traditional characters are different". Since Hanz/
1515-2. `2_gen_audio.ts` script, which will use TTS to generate audio files from the dicts. `deno run -A data/scripts/2_gen_audio.ts zh_HK character 200`
1616-3. `3_update_app_data.ts`, which will use data files from steps 1 and 2, as well as a reading the file locations from step 3, and compile this in to data.json
1717-4. `4_update_progress_data.ts` updates the HSK and TOCFL lists
2020+### Build commands
18211919-Separately from the main scripts, there is also `gen_progress.ts`, which generates all the progress-checking json files used by the app. This should translate between Hanz and Hans using the `dictionaries`, rather than an external source, since all chars/words should be present within our app.
2222+Run by **code contributors** as part of the normal build process.
20232121-## Sources
2424+| Command | Description |
2525+| --- | --- |
2626+| `build` | Run `gen:app`, `gen:progress`, and `gen:licenses` in sequence |
2727+| `gen:app` | Compile subjects JSON for all user-lang × target-lang pairs |
2828+| `gen:audio <locale> <dict> [limit]` | Generate TTS audio via Azure (requires `.env` + `ffmpeg`) |
2929+| `gen:progress` | Generate HSK / TOCFL / JLPT progress JSON |
3030+| `gen:licenses` | Fetch and bundle license texts |
22312323-This is where we store the raw data. Store as `.tsv` to make it easier to view the data in the text editor by adjusting tab size (.editorconfig). Every tsv file should have a `.js` file in `sources/scripts`, helping to parse the raw data from its original source.
3232+### Studio commands
24332525-| file | description |
2626-| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
2727-| `hsk.tsv` | The HSK 3.0 vocab list used by China; bands 1-5 |
2828-| `hsk_missing.tsv` | Hacking Chinese's list of words missing from HSK |
2929-| `tocfl.tsv` | The TOCFL vocab list used by Taiwan |
3030-| `tocfl_missing.tsv` | Hacking Chinese's list of words missing from TOCFL |
3131-| `standard.tsv` | A list of the General Standard Chinese Characters (通用规范汉字表). The standardized list of 8105 simplified Chinese characters. These are listed in order. |
3232-| `sentences.tsv` | Sentences provided by https://tatoeba.org/en |
3434+Run by **translation and language data contributors** when updating source files.
3535+Not needed for normal code development.
3636+3737+| Command | Description |
3838+| --- | --- |
3939+| `studio sort:dicts [--lang=zh_CN]` | Sort TSVs by id (default) or by a target language's curriculum order |
4040+| `studio update:dicts` | Add missing readings/meanings entries from dictionary sources |
4141+| `studio update:sentences` | Process raw Tatoeba sentence exports into app sentence TSVs |
33423443## Directory Layout
35443645```
3746data/
3838-├── scripts/ # Build scripts
3939-├── other/ # Other Data (licenses, voice data, other aux data)
4040-└── lang/ # Language Data
4141- ├── characters.tsv # id, hant, hans
4242- ├── radicals.tsv # id, hant, hans alt
4343- ├── vocabulary.tsv # id, hant, hans
4747+├── cli/ # CLI entry point and commands (see cli/main.ts)
4848+├── scripts/ # Legacy build scripts (superseded by cli/)
4949+├── other/ # Auxiliary data (licenses.tsv, voices.json)
5050+└── lang/ # All language data
5151+ ├── characters.tsv # Shared CJK character registry: id, hant, hans
5252+ ├── radicals.tsv # Shared radical registry: id, hant, hans, alt
5353+ ├── vocabulary.tsv # Shared CJK vocabulary registry: id, hant, hans
4454 ├── zh_CN/ # Mainland Chinese target-language data
4545- │ ├── pronunciation.tsv # id pinyin zhuyin
4646- │ ├── progress/ # Data for tracking progress in other courses
5555+ │ ├── readings.tsv # id, pinyin
5656+ │ ├── progress/ # External word lists (HSK, frequency data)
4757 │ └── order/ # Curriculum order
4848- │ ├── characters.txt # Row = level, columns = chars in that level
5858+ │ ├── characters.txt # One line per level; chars in that level
4959 │ ├── vocabulary.csv
5060 │ └── radicals.csv
6161+ ├── zh_HK/ # Cantonese target-language data
6262+ │ └── (same structure as zh_CN/, readings use jyutping)
6363+ ├── zh_TW/ # Taiwanese Mandarin target-language data
6464+ │ └── (same structure as zh_CN/, readings use pinyin + zhuyin)
6565+ ├── ja/ # Japanese target-language data
6666+ │ ├── reading.characters.tsv # id, kunyomi, onyomi
6767+ │ ├── reading.vocabulary.tsv # id, reading (hiragana)
6868+ │ └── (order/, progress/ same structure as zh_CN/)
5169 ├── en/ # English user-language data
5252- │ ├── characters.tsv # id, value (English meanings)
5353- │ ├── vocabulary.tsv # id, value
5454- │ ├── characters.meaning.tsv # id, hant, locale, en (mnemonic hints)
5555- │ ├── vocabulary.meaning.tsv
5656- │ ├── characters.reading.tsv
5757- │ ├── vocabulary.reading.tsv
5858- │ └── sentences/ # Locale-keyed example sentences
7070+ │ ├── characters.tsv # id, hant, value (meanings)
7171+ │ ├── vocabulary.tsv # id, hant, value
7272+ │ ├── radicals.tsv # id, hant, name
7373+ │ ├── hints/ # Mnemonic hints
7474+ │ │ ├── meaning.characters.tsv
7575+ │ │ ├── meaning.vocabulary.tsv
7676+ │ │ ├── reading.characters.tsv
7777+ │ │ └── reading.vocabulary.tsv
7878+ │ ├── meanings/ # Locale-specific meaning overrides
7979+ │ │ ├── ja.characters.tsv
8080+ │ │ └── ja.vocabulary.tsv
8181+ │ └── sentences/ # Example sentences keyed by target locale
5982 │ ├── zh_CN.tsv
6083 │ ├── zh_HK.tsv
6161- │ └── zh_TW.tsv
6262- ├── es/ # Spanish user-language data
6363- │ └── (same structure as en/)
6464- ├── zh_TW/ # Taiwanese Chinese Target Language data
6565- │ └── (same structure as zh_CN/)
6666- └── ja/ # Japanese Target Language data
6767- └── (same structure as zh_CN/)
8484+ │ ├── zh_TW.tsv
8585+ │ └── ja.tsv
8686+ └── es/ # Spanish user-language data
8787+ └── (same structure as en/)
6888```
69897070-## Order
9090+## Curriculum Order
71917272-This is where we store data for creating lessons. Use `.csv` to compact the format within the text editor. Order files are per target-locale, e.g. `zh_CN/order/vocabulary.csv` for Mainland-focused vocabulary.
7373-7474-### How we determine Lesson Num/Order
7575-7676-This is very hand-wavey at the moment. But broadly:
7777-7878-| LVL | CEFR | HSK | TOCFL | # Characters | # Words |
7979-| ----- | ---- | ----- | ------- | ---------------- | ---------- |
8080-| 1-10 | | | | | |
8181-| 11-20 | A1 | HSK 2 | TOCFL 1 | 200 most popular | |
8282-| 21-30 | A2 | HSK3 | TOCFL 2 | 900 Characters | 1200 Words |
8383-| 31-60 | B1-2 | HSK 5 | TOCFL 4 | 1900 Characters | 5000 Words |
9292+Order files define which subjects are taught and in what sequence. They live
9393+at `lang/{targetLang}/order/` and are the source of truth for what gets
9494+included in the compiled subject JSON.
84958585-- Lvls 1-10 is less focused on specific milestones, and more just trying to gracefully introduce characters in a natural way.
8686-- Lvls 21-30 seems short because 1-20 should introduce vocab beyond the stated goals. Just the order might not lend itself as well to testing milestones.
9696+- `characters.txt` — one line per level, each character is one slug
9797+- `vocabulary.csv` — one row per level, comma-separated slugs
9898+- `radicals.csv` — one row per level, comma-separated slugs
87998888-This is expressed broadly in `other/groups.txt`, and refined into `order/characters.txt`. Vocabulary are a bit more auxillary and are introduced at anytime after the characters are. So in practice, users should reach the word goals a little after the listed levels above.
100100+Subjects not present in any order file are included in the output with
101101+`hiddenAt` set and `level: 0` (accessible but not scheduled for review).
8910290103We currently do not, but in the future, we should also try and keep frequency in mind, to hopefully gracefully prepare users for things such as graded readers, such as [these](https://www.gradedchinesereaders.com) or [these](https://talktaiwanesemandarin.com/books/). Some frequency lists:
9110492105- tocfl data includes sFreq and wFreq
93106- https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php
941079595-## Dictionaries
108108+### Rough level targets
109109+110110+| LVL | CEFR | HSK | TOCFL | Characters | Words |
111111+| ----- | ---- | ----- | ------- | ---------- | ----- |
112112+| 1-10 | | | | | |
113113+| 11-20 | A1 | HSK 2 | TOCFL 1 | ~200 | |
114114+| 21-30 | A2 | HSK 3 | TOCFL 2 | ~900 | ~1200 |
115115+| 31-60 | B1-2 | HSK 5 | TOCFL 4 | ~1900 | ~5000 |
961169797-These are used for mapping hanz/hans/en altogether. Not used right now, but `overrides.tsv` should include locale-specific overrides for
117117+## Audio
981189999-## Future Support
119119+Audio is generated separately from the main build and is not committed to the
120120+repo. Run `gen:audio` for each locale after adding new curriculum items:
100121101101-Maybe we could look at adding other dialects, more regional targets:
122122+```sh
123123+deno task data gen:audio zh_CN character
124124+deno task data gen:audio zh_HK character
125125+deno task data gen:audio zh_TW character
126126+```
102127103103-`cmn`, `hak`, `wuu`, `gan`, `nan`, `hnm`, `hsn`, `cjy`, `zh_SG`, `zh_MY`
128128+Requires `AZURE_SPEECH_KEY` and `AZURE_SPEECH_REGION` in a `.env` file, and
129129+`ffmpeg` on `$PATH`. Generated files land in `www/static/gen/audio/{locale}/`.
+242
data/cli/commands/gen_app_data.ts
···11+/**
22+ * Compiles all subject data into the app's JSON files.
33+ *
44+ * For each user-language × target-language pair, reads dictionaries, readings,
55+ * hints, meanings, and sentences, then writes a combined subjects file to
66+ * www/static/gen/lang/{userLang}/{targetLang}.json.
77+ *
88+ * This is the main build step — the output is what the app reads at runtime.
99+ *
1010+ * Idempotent: existing subjects are preserved and merged; hand-edited fields
1111+ * (mnemonics, hints) survive re-runs. Subjects removed from the curriculum
1212+ * are marked with `hiddenAt` rather than deleted.
1313+ *
1414+ * Depends on:
1515+ * - data/lang/ TSVs (dictionaries, readings, hints, meanings)
1616+ * - data/lang/{userLang}/sentences/ (example sentences)
1717+ * - www/static/gen/audio/ (audio files embedded in subjects; optional)
1818+ */
1919+2020+import { Command } from '@cliffy/command'
2121+import { SubjectType } from '$/enums.ts'
2222+import type { Subject } from '$/models/subjects.ts'
2323+import { readAllLocaleHints, readDict, readMeaningOverrides, readMeanings, readReadingsMap } from '../utils/dict.ts'
2424+import { readTsv } from '../utils/fs.ts'
2525+import { readCharacterOrder, readLessonOrder } from '../utils/ordering.ts'
2626+import { loadSentences } from '../utils/sentences.ts'
2727+import { createSubject, readSubjectsMap, writeSubjects } from '../utils/subjects.ts'
2828+import { listAudioFiles } from '../utils/audio.ts'
2929+3030+const USER_LANGS = ['en', 'es']
3131+const TARGET_LANGS = ['zh_CN', 'zh_HK', 'zh_TW', 'ja']
3232+3333+export const genAppDataCmd = new Command()
3434+ .description(
3535+ 'Compile subject data into app JSON (www/static/gen/lang/{userLang}/{targetLang}.json). ' +
3636+ 'Run after updating dictionaries, readings, or curriculum order. ' +
3737+ 'Audio files are included automatically if already present.',
3838+ )
3939+ .action(() => {
4040+ // Load shared dictionary data once for all language pairs
4141+ const charDefs = readDict('lang/characters.tsv')
4242+ const vocabDefs = readDict('lang/vocabulary.tsv')
4343+ const audioFiles = listAudioFiles()
4444+4545+ for (const userLang of USER_LANGS) {
4646+ const charMeanings = readMeanings(`lang/${userLang}/characters.tsv`)
4747+ const vocabMeanings = readMeanings(`lang/${userLang}/vocabulary.tsv`)
4848+ const charMeaningHints = readAllLocaleHints(`lang/${userLang}/hints/meaning.characters.tsv`)
4949+ const charReadingHints = readAllLocaleHints(`lang/${userLang}/hints/reading.characters.tsv`)
5050+ const vocabMeaningHints = readAllLocaleHints(`lang/${userLang}/hints/meaning.vocabulary.tsv`)
5151+ const vocabReadingHints = readAllLocaleHints(`lang/${userLang}/hints/reading.vocabulary.tsv`)
5252+5353+ for (const targetLang of TARGET_LANGS) {
5454+ console.log(`\nGenerating ${userLang}/${targetLang}`)
5555+ const sentences = loadSentences(userLang, targetLang)
5656+5757+ const characterOrder = readCharacterOrder(targetLang)
5858+ const vocabularyOrder = readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`)
5959+ const radicalOrder = readLessonOrder(`lang/${targetLang}/order/radicals.csv`)
6060+6161+ const outPath = `lang/${userLang}/${targetLang}.json`
6262+ // Preserves hand-edited fields (mnemonics, hints) across runs
6363+ const existingSubjects = readSubjectsMap(outPath)
6464+ const updated = new Set<string>()
6565+6666+ // ja splits readings by subject type; all other locales share one file
6767+ const charReadingsMap = targetLang === 'ja'
6868+ ? readReadingsMap('lang/ja/reading.characters.tsv')
6969+ : readReadingsMap(`lang/${targetLang}/readings.tsv`)
7070+ const vocabReadingsMap = targetLang === 'ja'
7171+ ? readReadingsMap('lang/ja/reading.vocabulary.tsv')
7272+ : readReadingsMap(`lang/${targetLang}/readings.tsv`)
7373+7474+ const charMeaningOverrides = readMeaningOverrides(
7575+ `lang/${userLang}/meanings/${targetLang}.characters.tsv`,
7676+ )
7777+ const vocabMeaningOverrides = readMeaningOverrides(
7878+ `lang/${userLang}/meanings/${targetLang}.vocabulary.tsv`,
7979+ )
8080+8181+ // --- Characters ---
8282+ console.log(` characters: ${characterOrder.length} levels`)
8383+ characterOrder.forEach((slugs, index) => {
8484+ const level = index + 1
8585+ slugs.forEach((slug, posIndex) => {
8686+ if (!slug) return
8787+ const position = posIndex + 1
8888+ const subject = existingSubjects[slug] ??
8989+ createSubject(
9090+ slug, level, position, targetLang,
9191+ charMeanings, vocabMeanings, sentences,
9292+ charDefs, vocabDefs, audioFiles,
9393+ )
9494+ subject.data.level = level
9595+ subject.data.position = position
9696+9797+ const readings = charReadingsMap[subject.id]
9898+ if (readings?.length) subject.data.readings = readings
9999+ else if (subject.data.type !== SubjectType.Radical) {
100100+ console.warn(` No readings for character ${slug} (${subject.id}) in ${targetLang}`)
101101+ }
102102+103103+ const override = charMeaningOverrides[subject.id]
104104+ if (override) {
105105+ subject.data.meanings = override.split(';').map((def, i) => ({
106106+ value: def.trim(), isPrimary: i === 0, isAcceptedAnswer: true,
107107+ }))
108108+ }
109109+110110+ if (charMeaningHints[subject.id]) subject.data.meaningHint = charMeaningHints[subject.id]
111111+ if (charReadingHints[subject.id]) subject.data.readingHint = charReadingHints[subject.id]
112112+ existingSubjects[slug] = subject
113113+ updated.add(slug)
114114+ })
115115+ })
116116+117117+ // --- Vocabulary ---
118118+ console.log(` vocabulary: ${vocabularyOrder.length} levels`)
119119+ vocabularyOrder.forEach((slugs, index) => {
120120+ const level = index + 1
121121+ slugs.forEach((slug, posIndex) => {
122122+ if (!slug) return
123123+ const position = posIndex + 1
124124+ const subject = existingSubjects[slug] ??
125125+ createSubject(
126126+ slug, level, position, targetLang,
127127+ charMeanings, vocabMeanings, sentences,
128128+ charDefs, vocabDefs, audioFiles,
129129+ )
130130+ subject.data.level = level
131131+ subject.data.position = position
132132+133133+ const readings = vocabReadingsMap[subject.id]
134134+ if (readings?.length) subject.data.readings = readings
135135+ else console.warn(` No readings for vocabulary ${slug} (${subject.id}) in ${targetLang}`)
136136+137137+ const override = vocabMeaningOverrides[subject.id]
138138+ if (override) {
139139+ subject.data.meanings = override.split(';').map((def, i) => ({
140140+ value: def.trim(), isPrimary: i === 0, isAcceptedAnswer: true,
141141+ }))
142142+ }
143143+144144+ if (vocabMeaningHints[subject.id]) subject.data.meaningHint = vocabMeaningHints[subject.id]
145145+ if (vocabReadingHints[subject.id]) subject.data.readingHint = vocabReadingHints[subject.id]
146146+ existingSubjects[slug] = subject
147147+ updated.add(slug)
148148+ })
149149+ })
150150+151151+ // --- Radicals ---
152152+ console.log(` radicals: ${radicalOrder.length} levels`)
153153+ buildRadicals(targetLang, userLang, radicalOrder, existingSubjects, updated)
154154+155155+ writeSubjects(
156156+ outPath,
157157+ Object.values(existingSubjects).map((subject) => {
158158+ if (updated.has(subject.data.slug)) {
159159+ delete subject.hiddenAt
160160+ } else {
161161+ // Subjects removed from the curriculum are hidden, not deleted
162162+ subject.hiddenAt = subject.hiddenAt ?? new Date()
163163+ subject.data.level = 0
164164+ subject.data.position = 0
165165+ }
166166+ return subject
167167+ }),
168168+ )
169169+ }
170170+ }
171171+ })
172172+173173+/**
174174+ * Builds radical subjects from radicals.tsv and the curriculum radical order.
175175+ * Parsed manually because the TSV has an optional trailing `alt` field that
176176+ * trips up the strict CSV parser.
177177+ */
178178+function buildRadicals(
179179+ targetLang: string,
180180+ userLang: string,
181181+ radicalOrder: string[][],
182182+ existingSubjects: Record<string, Subject>,
183183+ updated: Set<string>,
184184+): void {
185185+ const byHant: Record<string, { id: string; hant: string; hans: string }> = {}
186186+ const byAlt: Record<string, { id: string; hant: string; hans: string }> = {}
187187+188188+ Deno.readTextFileSync('./data/lang/radicals.tsv')
189189+ .split('\n')
190190+ .slice(1) // skip header
191191+ .filter((line) => line.trim())
192192+ .forEach((line) => {
193193+ const [id, hant, hans, alt] = line.split('\t')
194194+ const row = { id, hant, hans }
195195+ byHant[hant] = row
196196+ if (alt) {
197197+ alt.split(';').map((a) => a.trim()).filter(Boolean).forEach((a) => { byAlt[a] = row })
198198+ }
199199+ })
200200+201201+ const nameById: Record<string, string> = {}
202202+ readTsv(`lang/${userLang}/radicals.tsv`).forEach((row) => { nameById[row.id] = row.name })
203203+204204+ const isSimplified = targetLang === 'zh_CN'
205205+206206+ radicalOrder.forEach((chars, levelIndex) => {
207207+ const level = levelIndex + 1
208208+ chars.forEach((char, posIndex) => {
209209+ const ch = char.trim()
210210+ if (!ch) return
211211+ const row = byHant[ch] || byAlt[ch]
212212+ if (!row) {
213213+ console.warn(`No radical found for: ${ch}`)
214214+ return
215215+ }
216216+ const { id, hant, hans } = row
217217+ const slug = hant
218218+ const existing = existingSubjects[slug]
219219+ existingSubjects[slug] = {
220220+ ...(existing || {}),
221221+ id,
222222+ learnCards: ['meanings'],
223223+ quizCards: ['meanings'],
224224+ data: {
225225+ ...(existing?.data || {}),
226226+ character: isSimplified ? hans : hant,
227227+ level,
228228+ meanings: nameById[id]
229229+ ? [{ value: nameById[id], isPrimary: true, isAcceptedAnswer: true }]
230230+ : [],
231231+ position: posIndex + 1,
232232+ readings: [],
233233+ requiredSubjects: [],
234234+ slug,
235235+ srsId: 2,
236236+ type: SubjectType.Radical,
237237+ },
238238+ } as Subject
239239+ updated.add(slug)
240240+ })
241241+ })
242242+}
+226
data/cli/commands/gen_audio.ts
···11+/**
22+ * Generates TTS audio files via Azure Cognitive Services.
33+ *
44+ * Fetches audio for all characters or vocabulary that are missing a local file,
55+ * splitting each batch of 100 items into one Azure request, then using ffmpeg
66+ * to split the returned audio on silence boundaries.
77+ *
88+ * Requirements:
99+ * - AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in a .env file
1010+ * - ffmpeg installed and on $PATH
1111+ *
1212+ * Usage:
1313+ * hanzi gen-audio <locale> <dict> [limit]
1414+ *
1515+ * Examples:
1616+ * hanzi gen-audio zh_CN character
1717+ * hanzi gen-audio zh_HK vocabulary 100
1818+ */
1919+2020+import { Command } from '@cliffy/command'
2121+import { load } from '@std/dotenv'
2222+import { ensureDir } from '@std/fs'
2323+import { writeAll } from '@std/io'
2424+import { join } from '@std/path'
2525+import { Locale } from '$/enums.ts'
2626+import type { Definition } from '../utils/dict.ts'
2727+import { getFilename, listAudioFiles, VOICE_IDS } from '../utils/audio.ts'
2828+import { readOrderedDefs } from '../utils/ordering.ts'
2929+3030+const GEN_DIR = 'www/static/gen'
3131+const TEMP_DIR = join(GEN_DIR, 'audio', 'tmp')
3232+3333+// ffmpeg silence detection parameters
3434+const MAX_NOISE_LEVEL = -40
3535+const SILENCE_SPLIT = 1
3636+const DETECT_STR = `silencedetect=noise=${MAX_NOISE_LEVEL}dB:d=${SILENCE_SPLIT}`
3737+const MATCH_SILENCE = /silence_start: ([\w.]+)[\s\S]+?silence_end: ([\w.]+)/g
3838+3939+const VALID_LOCALES = [Locale.zh_CN, Locale.zh_HK, Locale.zh_TW]
4040+4141+export const genAudioCmd = new Command()
4242+ .description(
4343+ 'Generate TTS audio files via Azure Cognitive Services. ' +
4444+ 'Requires AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in .env, and ffmpeg on $PATH.',
4545+ )
4646+ .arguments('<locale:string> <dict:string> [limit:number]')
4747+ .action(async (_options, locale: string, dict: string, limit = 0) => {
4848+ if (!VALID_LOCALES.includes(locale as Locale)) {
4949+ console.error(`Invalid locale: ${locale}. Valid: ${VALID_LOCALES.join(', ')}`)
5050+ Deno.exit(1)
5151+ }
5252+ if (!['character', 'vocabulary'].includes(dict)) {
5353+ console.error(`Invalid dict: ${dict}. Use "character" or "vocabulary".`)
5454+ Deno.exit(1)
5555+ }
5656+5757+ const env = await load()
5858+ await genAudio(locale as Locale, dict as 'character' | 'vocabulary', limit, env)
5959+ console.log('COMPLETE!')
6060+ Deno.exit(0)
6161+ })
6262+6363+async function genAudio(
6464+ locale: Locale,
6565+ dict: 'character' | 'vocabulary',
6666+ limit: number,
6767+ env: Record<string, string>,
6868+): Promise<void> {
6969+ await ensureDir(TEMP_DIR)
7070+7171+ const missing = await findMissingAudioFiles(locale, dict, limit)
7272+ if (!missing.length) {
7373+ console.log('No missing audio files — nothing to do.')
7474+ return
7575+ }
7676+7777+ const ttsResults = await ttsAll(locale, missing, VOICE_IDS[locale], env)
7878+ console.log('Source audio groups:', JSON.stringify(ttsResults.map((r) => r.fileName)))
7979+8080+ for (const { groupIndex, fileName, keys } of ttsResults) {
8181+ if (!fileName) {
8282+ console.warn(`Skipping group ${groupIndex}: no fileName (Azure request may have failed)`)
8383+ continue
8484+ }
8585+ console.log('Processing group:', fileName)
8686+ await writeAudioFiles(join(TEMP_DIR, fileName), locale, keys)
8787+ }
8888+}
8989+9090+async function findMissingAudioFiles(
9191+ locale: Locale,
9292+ dict: 'character' | 'vocabulary',
9393+ limit: number,
9494+): Promise<Definition[]> {
9595+ await ensureDir(join(GEN_DIR, 'audio', locale))
9696+ const exists = new Set(listAudioFiles([locale]))
9797+ const ordered = readOrderedDefs(dict, locale)
9898+ const missing = ordered.filter(({ id }) => !exists.has(getFilename(id, locale)))
9999+ return limit ? missing.slice(0, limit) : missing
100100+}
101101+102102+function ttsAll(
103103+ locale: Locale,
104104+ subjects: Definition[],
105105+ voiceId: string,
106106+ env: Record<string, string>,
107107+): Promise<{ groupIndex: number; fileName: string | null; keys: string[] }[]> {
108108+ // Group into batches of 100 to stay within Azure SSML limits
109109+ const groups: Definition[][] = []
110110+ subjects.forEach((subject, index) => {
111111+ const groupIndex = Math.floor(index / 100)
112112+ if (!groups[groupIndex]) groups[groupIndex] = []
113113+ groups[groupIndex].push(subject)
114114+ })
115115+116116+ console.log(
117117+ `About to request TTS for ${subjects.length} items ` +
118118+ `(${subjects[0]?.id} → ${subjects[subjects.length - 1]?.id}).`,
119119+ )
120120+ if (!confirm('Proceed?')) {
121121+ console.log('Aborted.')
122122+ Deno.exit(0)
123123+ }
124124+125125+ return Promise.all(
126126+ groups.map(async (batch, groupIndex) => ({
127127+ groupIndex,
128128+ fileName: await ttsAzure(batch.map((s) => s.hant), voiceId, locale, groupIndex, env),
129129+ keys: batch.map((s) => s.id),
130130+ })),
131131+ )
132132+}
133133+134134+async function ttsAzure(
135135+ texts: string[],
136136+ voiceId: string,
137137+ locale: Locale,
138138+ groupIndex: number,
139139+ env: Record<string, string>,
140140+): Promise<string | null> {
141141+ const SILENCE_BETWEEN_S = 2
142142+ const region = env['AZURE_SPEECH_REGION']
143143+ const url = `https://${region}.tts.speech.microsoft.com/cognitiveservices/v1`
144144+145145+ const response = await fetch(url, {
146146+ method: 'POST',
147147+ headers: {
148148+ 'Ocp-Apim-Subscription-Key': env['AZURE_SPEECH_KEY'],
149149+ 'Content-Type': 'application/ssml+xml',
150150+ 'X-Microsoft-OutputFormat': 'audio-16khz-128kbitrate-mono-mp3',
151151+ 'User-Agent': 'curl',
152152+ },
153153+ body: `
154154+ <speak version='1.0' xml:lang='${locale}'>
155155+ <voice name='${voiceId}' xml:lang='${locale}'>
156156+ <prosody rate="-20.00%">
157157+ ${texts.join(`, <break time="${SILENCE_BETWEEN_S}s"/> `)}
158158+ </prosody>
159159+ </voice>
160160+ </speak>
161161+ `,
162162+ })
163163+164164+ if (response.status > 399) {
165165+ console.warn(`Azure error ${response.status}:`, await response.text())
166166+ return null
167167+ }
168168+169169+ const fileName = `${locale}_${groupIndex}.mp3`
170170+ const file = await Deno.open(join(TEMP_DIR, fileName), { create: true, write: true })
171171+ await writeAll(file, new Uint8Array(await response.arrayBuffer()))
172172+ return fileName
173173+}
174174+175175+async function writeAudioFiles(
176176+ sourceFile: string,
177177+ locale: Locale,
178178+ keys: string[],
179179+): Promise<void> {
180180+ const audioDir = join(GEN_DIR, 'audio', locale)
181181+ await Deno.mkdir(audioDir, { recursive: true })
182182+183183+ // Use ffmpeg to detect silence boundaries between spoken words
184184+ const { stderr } = await new Deno.Command('ffmpeg', {
185185+ stdout: 'piped',
186186+ args: ['-i', sourceFile, '-af', DETECT_STR, '-f', 'null', '-'],
187187+ }).output()
188188+ const detectOutput = new TextDecoder().decode(stderr)
189189+190190+ let match = MATCH_SILENCE.exec(detectOutput)
191191+ let clipStartMS = 0
192192+ let count = 0
193193+194194+ while (match) {
195195+ const [_, silenceStartS, silenceEndS] = match
196196+ const silenceStartMS = Math.round(1000 * parseFloat(silenceStartS))
197197+ // Shift end back slightly to avoid clipping the next word's start
198198+ const silenceEndMS = Math.round(1000 * (parseFloat(silenceEndS) - 0.1))
199199+200200+ const outFile = join(audioDir, getFilename(keys[count], locale))
201201+ const seek = `${Math.max(0, clipStartMS)}ms`
202202+ const len = `${silenceStartMS - (clipStartMS + 0.1)}ms`
203203+204204+ await new Deno.Command('ffmpeg', {
205205+ stdout: 'piped',
206206+ args: ['-ss', seek, '-t', len, '-i', sourceFile, '-c:a', 'copy', outFile],
207207+ }).output()
208208+209209+ count++
210210+ clipStartMS = silenceEndMS
211211+ match = MATCH_SILENCE.exec(detectOutput)
212212+ }
213213+214214+ // Write the final clip (no trailing silence)
215215+ if (!keys[count]) {
216216+ console.warn(`Key/audio mismatch in ${sourceFile} — got ${count} clips for ${keys.length} keys`)
217217+ return
218218+ }
219219+ const outFile = join(audioDir, getFilename(keys[count], locale))
220220+ await new Deno.Command('ffmpeg', {
221221+ stdout: 'piped',
222222+ args: ['-ss', `${Math.max(0, clipStartMS)}ms`, '-i', sourceFile, '-c:a', 'copy', outFile],
223223+ }).output()
224224+225225+ console.log(` wrote ${count + 1} audio files to ${audioDir}`)
226226+}
+21
data/cli/commands/gen_licenses.ts
···11+/**
22+ * Fetches license texts from URLs listed in data/other/licenses.tsv and
33+ * writes the combined result to www/static/gen/licenses.json.
44+ */
55+66+import { Command } from '@cliffy/command'
77+import { readTsv, writeAppJson } from '../utils/fs.ts'
88+99+export const genLicensesCmd = new Command()
1010+ .description('Fetch license texts and write to www/static/gen/licenses.json.')
1111+ .action(async () => {
1212+ const licenseList = readTsv('other/licenses.tsv')
1313+ const licenses = await Promise.all(
1414+ licenseList.map(async ({ name, href }) => {
1515+ const text = await (await fetch(href)).text()
1616+ return { name, href, text }
1717+ }),
1818+ )
1919+ writeAppJson('licenses.json', licenses)
2020+ console.log(` wrote licenses.json (${licenses.length} licenses)`)
2121+ })
+67
data/cli/commands/gen_progress.ts
···11+/**
22+ * Generates progress-tracking JSON files for external word lists.
33+ *
44+ * Writes to www/static/gen/progress/:
55+ * - hsk.json — HSK 3.0 vocabulary bands (zh_CN)
66+ * - tocfl.json — TOCFL vocabulary levels (zh_TW)
77+ * - jlpt-kanji.json — JLPT kanji levels (ja)
88+ * - jlpt-vocab.json — JLPT vocabulary levels (ja)
99+ *
1010+ * These are used by the app's stats/progress pages to show users how many
1111+ * HSK or TOCFL words they've already learned.
1212+ */
1313+1414+import { Command } from '@cliffy/command'
1515+import * as OpenCC from 'opencc-js'
1616+import { readTsv, writeAppJson } from '../utils/fs.ts'
1717+1818+export const genProgressCmd = new Command()
1919+ .description(
2020+ 'Generate HSK, TOCFL, and JLPT progress JSON files for www/static/gen/progress/.',
2121+ )
2222+ .action(() => {
2323+ const toSimplified = OpenCC.Converter({ from: 'hk', to: 'cn' })
2424+ const toTraditional = OpenCC.Converter({ from: 'cn', to: 'hk' })
2525+2626+ Deno.mkdirSync('www/static/gen/progress', { recursive: true })
2727+2828+ writeAppJson(
2929+ 'progress/hsk.json',
3030+ readTsv('lang/zh_CN/progress/hsk.tsv').map((row) => ({
3131+ level: Number(row.band),
3232+ id: Number(row.no),
3333+ simplified: row.hans,
3434+ traditional: toTraditional(row.hans),
3535+ })),
3636+ )
3737+ console.log(' wrote progress/hsk.json')
3838+3939+ writeAppJson(
4040+ 'progress/tocfl.json',
4141+ readTsv('lang/zh_TW/progress/tocfl.tsv').map((row) => ({
4242+ level: Number(row.level),
4343+ id: Number(row.id),
4444+ simplified: toSimplified(row.hant),
4545+ traditional: row.hant,
4646+ })),
4747+ )
4848+ console.log(' wrote progress/tocfl.json')
4949+5050+ writeAppJson(
5151+ 'progress/jlpt-kanji.json',
5252+ readTsv('lang/ja/progress/jlpt-kanji.tsv').map((row) => ({
5353+ level: Number(row.level),
5454+ kanji: row.kanji,
5555+ })),
5656+ )
5757+ console.log(' wrote progress/jlpt-kanji.json')
5858+5959+ writeAppJson(
6060+ 'progress/jlpt-vocab.json',
6161+ readTsv('lang/ja/progress/jlpt-vocab.tsv').map((row) => ({
6262+ level: Number(row.level),
6363+ chars: row.chars,
6464+ })),
6565+ )
6666+ console.log(' wrote progress/jlpt-vocab.json')
6767+ })
+228
data/cli/commands/studio/dicts.ts
···11+/**
22+ * Studio command: updates readings and meanings TSVs from dictionary sources.
33+ *
44+ * "Studio" commands are for data managers updating source files — they are not
55+ * part of the normal app build and do not need to be run by contributors.
66+ *
77+ * What this does:
88+ * - Appends missing entries to lang/{userLang}/characters.tsv and vocabulary.tsv
99+ * (new items get a "[todo: add definition]" placeholder)
1010+ * - Fills in missing readings in the target-language readings TSVs using
1111+ * dictionary lookups (pinyin, jyutping, zhuyin, kunyomi/onyomi, hiragana)
1212+ *
1313+ * Existing entries are never overwritten — only missing ones are added.
1414+ * Run this after adding new characters or vocabulary to a curriculum order file.
1515+ *
1616+ * Generated files:
1717+ * - lang/en/characters.tsv, lang/es/characters.tsv (meanings)
1818+ * - lang/en/vocabulary.tsv, lang/es/vocabulary.tsv (meanings)
1919+ * - lang/zh_CN/readings.tsv (pinyin)
2020+ * - lang/zh_HK/readings.tsv (jyutping)
2121+ * - lang/zh_TW/readings.tsv (pinyin + zhuyin)
2222+ * - lang/ja/reading.characters.tsv (kunyomi + onyomi)
2323+ * - lang/ja/reading.vocabulary.tsv (hiragana)
2424+ */
2525+2626+import { Command } from '@cliffy/command'
2727+import pinyin from 'chinese-to-pinyin'
2828+import { p2z } from 'pinyin-to-zhuyin'
2929+import * as OpenCC from 'opencc-js'
3030+import { toJyutping } from '$/utils/jyutping.ts'
3131+import { type Definition, readDict } from '../../utils/dict.ts'
3232+import { readCsv, readJson, readTsv, writeTsv } from '../../utils/fs.ts'
3333+3434+const USER_LANGS = ['en', 'es']
3535+3636+export const updateDictsCmd = new Command()
3737+ .description(
3838+ 'Update readings and meanings TSVs from dictionary sources. ' +
3939+ 'Only adds missing entries — never overwrites existing ones. ' +
4040+ 'Run after adding new characters or vocabulary to a curriculum order file.',
4141+ )
4242+ .action(async () => {
4343+ const toCN = OpenCC.Converter({ from: 'tw', to: 'cn' })
4444+4545+ const characters = readDict('lang/characters.tsv')
4646+ const vocabulary = readDict('lang/vocabulary.tsv')
4747+4848+ // Pinyin source: General Standard Chinese Characters (通用规范汉字表)
4949+ const pinyinMap: Record<string, string> = {}
5050+ readTsv('lang/zh_CN/source/standard.tsv').forEach((row) => {
5151+ pinyinMap[row.hans] = row.pinyin
5252+ })
5353+5454+ // zh_TW pronunciation source (TOCFL 詞語表)
5555+ const pinyinTwMap: Record<string, string> = {}
5656+ const zhuyinMap: Record<string, string> = {}
5757+ readCsv('lang/zh_TW/sources/詞語表202504.csv').forEach((row) => {
5858+ pinyinTwMap[row.word] = row.pinyin
5959+ zhuyinMap[row.word] = row.bopomofo
6060+ })
6161+6262+ // Japanese dictionary sources
6363+ const kanji = readJson<Record<string, { kanji: string; meaning: string; kun: string[]; on: string[] }>>(
6464+ 'lang/ja/source/kanji.json',
6565+ )
6666+ const jaVocab = readJson<Record<string, [string, string]>>('lang/ja/source/vocab.json')
6767+6868+ function getPinyin(hans: string): string {
6969+ return pinyinMap[hans] ?? pinyin(hans)
7070+ }
7171+7272+ function getTwPinyin(hant: string): string {
7373+ return pinyinTwMap[hant] ?? getPinyin(toCN(hant))
7474+ }
7575+7676+ function getZhuyin(hant: string): string {
7777+ if (zhuyinMap[hant]) return zhuyinMap[hant]
7878+ try {
7979+ const result = p2z(getTwPinyin(hant), {
8080+ tonemarks: true,
8181+ inputHasToneMarks: true,
8282+ convertPunctuation: false,
8383+ })
8484+ if (!result?.trim()) throw new Error('Empty result')
8585+ return result
8686+ } catch (err) {
8787+ console.warn(`Failed to convert "${hant}" pinyin to zhuyin:`, err)
8888+ return ''
8989+ }
9090+ }
9191+9292+ // --- Meanings ---
9393+ for (const userLang of USER_LANGS) {
9494+ await updateMeanings(userLang, 'characters', characters)
9595+ await updateMeanings(userLang, 'vocabulary', vocabulary)
9696+ }
9797+9898+ // --- Readings ---
9999+ updateReadings(
100100+ 'lang/zh_CN/readings.tsv',
101101+ ['id', 'pinyin'],
102102+ (def) => ({ id: def.id, pinyin: getPinyin(def.hans) }),
103103+ [...characters, ...vocabulary],
104104+ )
105105+ updateReadings(
106106+ 'lang/zh_HK/readings.tsv',
107107+ ['id', 'jyutping'],
108108+ (def) => ({ id: def.id, jyutping: toJyutping(def.hant) }),
109109+ [...characters, ...vocabulary],
110110+ )
111111+ updateReadings(
112112+ 'lang/zh_TW/readings.tsv',
113113+ ['id', 'pinyin', 'zhuyin'],
114114+ (def) => ({ id: def.id, pinyin: getTwPinyin(def.hant), zhuyin: getZhuyin(def.hant) }),
115115+ [...characters, ...vocabulary],
116116+ )
117117+118118+ // --- Japanese readings ---
119119+ updateJaCharReadings(characters, kanji)
120120+ updateJaVocabReadings(vocabulary, jaVocab)
121121+122122+ console.log('Done.')
123123+ })
124124+125125+/**
126126+ * Appends missing entries to lang/{userLang}/{type}.tsv.
127127+ * New items get a "[todo: add definition]" placeholder value.
128128+ */
129129+async function updateMeanings(
130130+ userLang: string,
131131+ type: 'characters' | 'vocabulary',
132132+ defs: Definition[],
133133+): Promise<void> {
134134+ const filePath = `lang/${userLang}/${type}.tsv`
135135+ const existing = new Map(
136136+ readTsv(filePath).map((row) => [row.id, { hant: row.hant, value: row.value }]),
137137+ )
138138+139139+ let added = 0
140140+ for (const def of defs) {
141141+ if (!existing.has(def.id)) {
142142+ existing.set(def.id, { hant: def.hant, value: '[todo: add definition]' })
143143+ added++
144144+ }
145145+ }
146146+147147+ if (added > 0) {
148148+ console.log(` ${filePath}: adding ${added} missing entries`)
149149+ const rows = [...existing.entries()]
150150+ .map(([id, { hant, value }]) => ({ id, hant, value }))
151151+ .sort((a, b) => a.id.localeCompare(b.id))
152152+ writeTsv(filePath, ['id', 'hant', 'value'], rows)
153153+ }
154154+}
155155+156156+/**
157157+ * Appends missing readings to a readings TSV.
158158+ * Existing rows are preserved as-is (manual overrides are safe).
159159+ */
160160+function updateReadings(
161161+ filePath: string,
162162+ columns: string[],
163163+ compute: (def: Definition) => Record<string, string>,
164164+ defs: Definition[],
165165+): void {
166166+ const existing = new Map<string, Record<string, string>>()
167167+ try {
168168+ readTsv(filePath).forEach((row) => existing.set(row.id, row))
169169+ } catch { /* file doesn't exist yet */ }
170170+171171+ let added = 0
172172+ for (const def of defs) {
173173+ if (!existing.has(def.id)) {
174174+ existing.set(def.id, compute(def))
175175+ added++
176176+ }
177177+ }
178178+179179+ if (added > 0) {
180180+ console.log(` ${filePath}: adding ${added} entries`)
181181+ writeTsv(filePath, columns, [...existing.values()])
182182+ }
183183+}
184184+185185+function updateJaCharReadings(
186186+ characters: Definition[],
187187+ kanji: Record<string, { kun: string[]; on: string[] }>,
188188+): void {
189189+ type CharReading = { id: string; kunyomi: string; onyomi: string }
190190+ const existing = new Map<string, CharReading>()
191191+ try {
192192+ readTsv('lang/ja/reading.characters.tsv').forEach((row) =>
193193+ existing.set(row.id, row as unknown as CharReading)
194194+ )
195195+ } catch { /* file doesn't exist yet */ }
196196+197197+ for (const def of characters) {
198198+ if (!def.ja || existing.has(def.id)) continue
199199+ const data = kanji[def.ja]
200200+ if (!data) continue
201201+ existing.set(def.id, { id: def.id, kunyomi: data.kun.join('; '), onyomi: data.on.join('; ') })
202202+ }
203203+204204+ writeTsv('lang/ja/reading.characters.tsv', ['id', 'kunyomi', 'onyomi'], [...existing.values()])
205205+ console.log(' lang/ja/reading.characters.tsv: updated')
206206+}
207207+208208+function updateJaVocabReadings(
209209+ vocabulary: Definition[],
210210+ jaVocab: Record<string, [string, string]>,
211211+): void {
212212+ const existing = new Map<string, { id: string; reading: string }>()
213213+ try {
214214+ readTsv('lang/ja/reading.vocabulary.tsv').forEach((row) =>
215215+ existing.set(row.id, { id: row.id, reading: row.reading })
216216+ )
217217+ } catch { /* file doesn't exist yet */ }
218218+219219+ for (const def of vocabulary) {
220220+ if (!def.ja || existing.has(def.id)) continue
221221+ const entry = jaVocab[def.ja]
222222+ if (!entry) continue
223223+ existing.set(def.id, { id: def.id, reading: entry[0] })
224224+ }
225225+226226+ writeTsv('lang/ja/reading.vocabulary.tsv', ['id', 'reading'], [...existing.values()])
227227+ console.log(' lang/ja/reading.vocabulary.tsv: updated')
228228+}
+112
data/cli/commands/studio/ordering.ts
···11+/**
22+ * Studio command: sorts user-language dictionary TSVs.
33+ *
44+ * "Studio" commands are for data managers updating source files — they are not
55+ * part of the normal app build.
66+ *
77+ * By default, sorts all user-language TSVs by subject id (the canonical committed state).
88+ * With --lang, sorts by that target language's curriculum order instead — useful when
99+ * you want to review a file in the order subjects appear in a specific course.
1010+ *
1111+ * Files sorted (in data/lang/{userLang}/):
1212+ * - characters.tsv, vocabulary.tsv, radicals.tsv
1313+ * - hints/meaning.characters.tsv, hints/meaning.vocabulary.tsv
1414+ * - hints/reading.characters.tsv, hints/reading.vocabulary.tsv
1515+ * - meanings/{targetLang}.characters.tsv, meanings/{targetLang}.vocabulary.tsv
1616+ *
1717+ * Examples:
1818+ * hanzi studio sort-dicts
1919+ * hanzi studio sort-dicts --lang=zh_CN
2020+ */
2121+2222+import { Command } from '@cliffy/command'
2323+import { readDictByHant } from '../../utils/dict.ts'
2424+import { readCharacterOrder, readLessonOrder } from '../../utils/ordering.ts'
2525+import { DATA_ROOT, readTsv, writeTsv } from '../../utils/fs.ts'
2626+2727+const USER_LANGS = ['en', 'es']
2828+const TARGET_LANGS = ['zh_CN', 'zh_HK', 'zh_TW', 'ja']
2929+3030+export const sortDictsCmd = new Command()
3131+ .description(
3232+ 'Sort user-language dictionary TSVs. ' +
3333+ 'Default: sort by id (canonical committed order). ' +
3434+ 'With --lang: sort by that target language\'s curriculum order.',
3535+ )
3636+ .option('--lang <targetLang:string>', 'Sort by curriculum order of this target language')
3737+ .action(({ lang: targetLang }) => {
3838+ if (targetLang && !TARGET_LANGS.includes(targetLang)) {
3939+ console.error(`Unknown --lang: ${targetLang}. Valid: ${TARGET_LANGS.join(', ')}`)
4040+ Deno.exit(1)
4141+ }
4242+4343+ const charIdOrder = targetLang ? buildIdOrder('character', targetLang) : null
4444+ const vocabIdOrder = targetLang ? buildIdOrder('vocabulary', targetLang) : null
4545+ const meaningTargetLangs = targetLang ? [targetLang] : TARGET_LANGS
4646+4747+ for (const userLang of USER_LANGS) {
4848+ console.log(`\nSorting lang/${userLang}/`)
4949+ sortFile(`lang/${userLang}/characters.tsv`, charIdOrder)
5050+ sortFile(`lang/${userLang}/radicals.tsv`, null)
5151+ sortFile(`lang/${userLang}/vocabulary.tsv`, vocabIdOrder)
5252+ sortFile(`lang/${userLang}/hints/meaning.characters.tsv`, charIdOrder)
5353+ sortFile(`lang/${userLang}/hints/meaning.vocabulary.tsv`, vocabIdOrder)
5454+ sortFile(`lang/${userLang}/hints/reading.characters.tsv`, charIdOrder)
5555+ sortFile(`lang/${userLang}/hints/reading.vocabulary.tsv`, vocabIdOrder)
5656+ for (const tl of meaningTargetLangs) {
5757+ sortFile(`lang/${userLang}/meanings/${tl}.characters.tsv`, charIdOrder)
5858+ sortFile(`lang/${userLang}/meanings/${tl}.vocabulary.tsv`, vocabIdOrder)
5959+ }
6060+ }
6161+ console.log('\nDone.')
6262+ })
6363+6464+/** Builds a map of subject id → curriculum position for a given type and target language. */
6565+function buildIdOrder(
6666+ type: 'character' | 'vocabulary',
6767+ targetLang: string,
6868+): Map<string, number> {
6969+ const slugs = type === 'character'
7070+ ? readCharacterOrder(targetLang).flat()
7171+ : readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`).flat()
7272+ const dictPath = type === 'character' ? 'lang/characters.tsv' : 'lang/vocabulary.tsv'
7373+ const byHant = readDictByHant(dictPath)
7474+ const map = new Map<string, number>()
7575+ slugs.forEach((slug, i) => {
7676+ const def = byHant[slug]
7777+ if (def && !map.has(def.id)) map.set(def.id, i)
7878+ })
7979+ return map
8080+}
8181+8282+/** Reads the raw header line from a TSV (preserving original column order). */
8383+function readHeaders(relPath: string): string[] {
8484+ const text = Deno.readTextFileSync(DATA_ROOT + relPath)
8585+ return text.split('\n')[0].split('\t').map((h) => h.replace('\r', ''))
8686+}
8787+8888+/** Sorts a TSV file in place. Silently skips files that don't exist. */
8989+function sortFile(relPath: string, orderMap: Map<string, number> | null): void {
9090+ let rows: Record<string, string>[]
9191+ let headers: string[]
9292+ try {
9393+ headers = readHeaders(relPath)
9494+ rows = readTsv(relPath)
9595+ } catch {
9696+ return // file doesn't exist for this user/target lang combo
9797+ }
9898+ if (!rows.length) return
9999+100100+ if (orderMap) {
101101+ rows.sort((a, b) => {
102102+ const posA = orderMap.get(a.id) ?? Infinity
103103+ const posB = orderMap.get(b.id) ?? Infinity
104104+ return posA - posB
105105+ })
106106+ } else {
107107+ rows.sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0))
108108+ }
109109+110110+ writeTsv(relPath, headers, rows)
111111+ console.log(` sorted: ${relPath}`)
112112+}
+170
data/cli/commands/studio/sentences.ts
···11+/**
22+ * Processes raw Tatoeba sentence TSV exports into the app's sentence files.
33+ *
44+ * Reads source files from data/lang/{userLang}/sources/ and writes processed
55+ * sentences to data/lang/{userLang}/sentences/{targetLang}.tsv.
66+ *
77+ * Run this after downloading new sentence pairs from https://tatoeba.org.
88+ * Source files are auto-detected by name pattern — no date argument needed.
99+ *
1010+ * Expected source filename format:
1111+ * "Sentence pairs in {Language}-{Target} - {date}.tsv"
1212+ * e.g. "Sentence pairs in English-Mandarin Chinese - 2026-03-13.tsv"
1313+ *
1414+ * Processing steps:
1515+ * 1. Parse and validate rows (filters empty/too-long sentences)
1616+ * 2. Convert zh_CN ↔ zh_TW using OpenCC
1717+ * 3. Deduplicate by sentence text
1818+ * 4. Sort by simplicity (characters appearing earlier in the curriculum score lower)
1919+ */
2020+2121+import { Command } from '@cliffy/command'
2222+import { parse } from '@std/csv/parse'
2323+import { stringify } from '@std/csv/stringify'
2424+import * as OpenCC from 'opencc-js'
2525+import { readCharacterOrder } from '../../utils/ordering.ts'
2626+import { readDict } from '../../utils/dict.ts'
2727+2828+const toCN = OpenCC.Converter({ from: 'tw', to: 'cn' })
2929+const toTW = OpenCC.Converter({ from: 'cn', to: 'tw' })
3030+3131+const USER_LANGS = ['en', 'es']
3232+3333+const SOURCE_TARGET_NAMES: Record<string, string> = {
3434+ zh_CN: 'Mandarin Chinese',
3535+ zh_HK: 'Cantonese',
3636+ zh_TW: 'Mandarin Chinese', // TW sentences are converted from CN source
3737+ ja: 'Japanese',
3838+}
3939+4040+interface Sentence {
4141+ id: number
4242+ value: string
4343+ enId: number
4444+ en: string
4545+}
4646+4747+export const updateSentencesCmd = new Command()
4848+ .description(
4949+ 'Process raw Tatoeba sentence exports into app sentence TSVs (data/lang/{userLang}/sentences/). ' +
5050+ 'Run after downloading new sentence pairs from tatoeba.org.',
5151+ )
5252+ .action(() => {
5353+ const charDefs = readDict('lang/characters.tsv')
5454+ const hantToHans: Record<string, string> = Object.fromEntries(
5555+ charDefs.map((d) => [d.hant, d.hans]),
5656+ )
5757+5858+ for (const langCode of USER_LANGS) {
5959+ processLang(langCode, hantToHans)
6060+ }
6161+ console.log('Done.')
6262+ })
6363+6464+function userLangName(langCode: string): string {
6565+ return langCode === 'es' ? 'Spanish' : 'English'
6666+}
6767+6868+/** Finds the most recent source file matching a name prefix in the sources directory. */
6969+function findSourceFile(langCode: string, targetName: string): string | null {
7070+ const dir = `./data/lang/${langCode}/sources`
7171+ const prefix = `Sentence pairs in ${userLangName(langCode)}-${targetName}`
7272+ try {
7373+ for (const entry of Deno.readDirSync(dir)) {
7474+ if (entry.name.startsWith(prefix) && entry.name.endsWith('.tsv')) {
7575+ return `${dir}/${entry.name}`
7676+ }
7777+ }
7878+ } catch { /* directory missing */ }
7979+ return null
8080+}
8181+8282+function processLang(langCode: string, hantToHans: Record<string, string>): void {
8383+ const cnPath = findSourceFile(langCode, SOURCE_TARGET_NAMES.zh_CN)
8484+ const hkPath = findSourceFile(langCode, SOURCE_TARGET_NAMES.zh_HK)
8585+ const jaPath = findSourceFile(langCode, SOURCE_TARGET_NAMES.ja)
8686+8787+ if (!cnPath || !hkPath || !jaPath) {
8888+ console.warn(`Source files not found for ${langCode} — skipping. Expected files in data/lang/${langCode}/sources/`)
8989+ return
9090+ }
9191+9292+ const readSourceFile = (path: string): Sentence[] =>
9393+ parse(Deno.readTextFileSync(path), { separator: '\t', lazyQuotes: true })
9494+ .map(([enIdStr, en, idStr, value]) => {
9595+ const enId = parseInt(enIdStr)
9696+ const id = parseInt(idStr)
9797+ if (!enId || !id || !en || !value || value.length > 15) return null
9898+ return { id, value, enId, en }
9999+ })
100100+ .filter((row): row is Sentence => row != null)
101101+102102+ const cnRaw = readSourceFile(cnPath)
103103+ const hkRaw = readSourceFile(hkPath)
104104+ const jaRaw = readSourceFile(jaPath)
105105+106106+ const locales: [string, Sentence[]][] = [
107107+ ['zh_CN', cnRaw.map((s) => ({ ...s, value: toCN(s.value) }))],
108108+ ['zh_TW', cnRaw.map((s) => ({ ...s, value: toTW(s.value) }))],
109109+ ['zh_HK', hkRaw],
110110+ ['ja', jaRaw],
111111+ ]
112112+113113+ Deno.mkdirSync(`./data/lang/${langCode}/sentences`, { recursive: true })
114114+115115+ for (const [locale, sentences] of locales) {
116116+ const positionMap = buildPositionMap(locale, hantToHans)
117117+ const processed = dedupeAndSort(sentences, positionMap)
118118+ const outPath = `./data/lang/${langCode}/sentences/${locale}.tsv`
119119+ Deno.writeTextFileSync(
120120+ outPath,
121121+ stringify(processed, { columns: ['id', 'value', 'enId', 'en'], separator: '\t' }),
122122+ )
123123+ console.log(` wrote lang/${langCode}/sentences/${locale}.tsv (${processed.length} sentences)`)
124124+ }
125125+}
126126+127127+/**
128128+ * Builds a map of character → curriculum position for scoring sentence simplicity.
129129+ * Both hant and hans forms are indexed so scores work for any script variant.
130130+ */
131131+function buildPositionMap(locale: string, hantToHans: Record<string, string>): Map<string, number> {
132132+ const map = new Map<string, number>()
133133+ let pos = 0
134134+ for (const level of readCharacterOrder(locale)) {
135135+ for (const hant of level) {
136136+ map.set(hant, pos)
137137+ const hans = hantToHans[hant]
138138+ if (hans && hans !== hant) map.set(hans, pos)
139139+ pos++
140140+ }
141141+ }
142142+ return map
143143+}
144144+145145+/**
146146+ * Scores a sentence by the sum of its characters' curriculum positions.
147147+ * Sentences with only early-curriculum characters score lower (simpler).
148148+ * A length factor discourages very short sentences.
149149+ */
150150+function sentenceScore(
151151+ sentence: string,
152152+ positionMap: Map<string, number>,
153153+ maxPos: number,
154154+): number {
155155+ let score = 0
156156+ for (const char of sentence) score += positionMap.get(char) ?? maxPos
157157+ return score * Math.min(1, sentence.length / 1.3)
158158+}
159159+160160+/** Deduplicates by sentence text (keeps later id on collision), then sorts by simplicity. */
161161+function dedupeAndSort(sentences: Sentence[], positionMap: Map<string, number>): Sentence[] {
162162+ const seen: Record<string, Sentence> = {}
163163+ for (const s of sentences) {
164164+ if (!seen[s.value] || seen[s.value].id < s.id) seen[s.value] = s
165165+ }
166166+ const maxPos = positionMap.size
167167+ return Object.values(seen).sort(
168168+ (a, b) => sentenceScore(a.value, positionMap, maxPos) - sentenceScore(b.value, positionMap, maxPos),
169169+ )
170170+}
+65
data/cli/main.ts
···11+/**
22+ * Hanzi data CLI
33+ *
44+ * Entry point for all data pipeline and studio tools.
55+ *
66+ * Build commands (run by code contributors):
77+ * build Run gen:app, gen:progress, and gen:licenses in sequence
88+ * gen:app Compile subject JSON for all language pairs
99+ * gen:audio Generate TTS audio via Azure (requires .env + ffmpeg)
1010+ * gen:progress Generate HSK / TOCFL / JLPT progress JSON
1111+ * gen:licenses Fetch and bundle license texts
1212+ *
1313+ * Studio commands (run by translation and language data contributors):
1414+ * studio update-dicts Add missing readings/meanings from dictionary sources
1515+ * studio sort-dicts Sort TSVs by id or by curriculum order
1616+ * studio process-sentences Process raw Tatoeba exports into sentence TSVs
1717+ *
1818+ * Usage:
1919+ * deno task data <command> [options]
2020+ * deno task data --help
2121+ * deno task data studio --help
2222+ */
2323+2424+import { Command } from '@cliffy/command'
2525+import { genAppDataCmd } from './commands/gen_app_data.ts'
2626+import { genAudioCmd } from './commands/gen_audio.ts'
2727+import { genProgressCmd } from './commands/gen_progress.ts'
2828+import { genLicensesCmd } from './commands/gen_licenses.ts'
2929+import { updateDictsCmd } from './commands/studio/dicts.ts'
3030+import { sortDictsCmd } from './commands/studio/ordering.ts'
3131+import { updateSentencesCmd } from './commands/studio/sentences.ts'
3232+3333+const studioCmd = new Command()
3434+ .description(
3535+ 'Tools for translation and language data contributors. ' +
3636+ 'Run these when updating dictionary sources, readings, or curriculum ordering.',
3737+ )
3838+ .command('update:dicts', updateDictsCmd)
3939+ .command('update:sentences', updateSentencesCmd)
4040+ .command('sort:dicts', sortDictsCmd)
4141+4242+await new Command()
4343+ .name('hanzi')
4444+ .version('0.0.1')
4545+ .description('CLI tools for building and managing Hanzi app data.')
4646+ .action(function () { this.showHelp() })
4747+ // Build commands
4848+ .command('build', new Command()
4949+ .description('Run the full build: gen:app, gen:progress, and gen:licenses.')
5050+ .action(async () => {
5151+ console.log('=== gen:app ===')
5252+ await genAppDataCmd.parse([])
5353+ console.log('\n=== gen:progress ===')
5454+ await genProgressCmd.parse([])
5555+ console.log('\n=== gen:licenses ===')
5656+ await genLicensesCmd.parse([])
5757+ }),
5858+ )
5959+ .command('gen:app', genAppDataCmd)
6060+ .command('gen:audio', genAudioCmd)
6161+ .command('gen:progress', genProgressCmd)
6262+ .command('gen:licenses', genLicensesCmd)
6363+ // Studio commands
6464+ .command('studio', studioCmd)
6565+ .parse(Deno.args)
+36
data/cli/utils/audio.ts
···11+/**
22+ * Audio file utilities: voice IDs, filename conventions, and file listing.
33+ */
44+55+import { join } from '@std/path'
66+import { Locale } from '$/enums.ts'
77+88+/** Azure Neural TTS voice ID used per locale. */
99+export const VOICE_IDS: Record<Locale, string> = {
1010+ [Locale.zh_CN]: 'zh-CN-XiaoxiaoNeural',
1111+ [Locale.zh_HK]: 'zh-HK-WanLungNeural',
1212+ [Locale.zh_TW]: 'zh-TW-YunJheNeural',
1313+}
1414+1515+/** Returns the audio filename for a given subject id and locale. */
1616+export function getFilename(id: string, locale: Locale): string {
1717+ return `${id}_${locale.replace('_', '-')}_${VOICE_IDS[locale]}.mp3`
1818+}
1919+2020+/**
2121+ * Returns all existing audio filenames (not full paths) for the given locales.
2222+ * Skips locales whose audio directories don't exist yet (e.g. before first audio run).
2323+ */
2424+export function listAudioFiles(locales: string[] = ['zh_CN', 'zh_HK', 'zh_TW']): string[] {
2525+ return locales
2626+ .filter((locale) => locale !== 'tmp')
2727+ .flatMap((locale) => {
2828+ try {
2929+ return Array.from(Deno.readDirSync(join('www/static/gen/audio', locale)))
3030+ } catch {
3131+ return []
3232+ }
3333+ })
3434+ .filter(({ name }) => /.*\.mp3$/.test(name))
3535+ .map((file) => file.name)
3636+}
+146
data/cli/utils/dict.ts
···11+/**
22+ * Dictionary utilities: reading and interpreting the shared CJK dictionary files
33+ * and user-language meaning/hint/reading files.
44+ *
55+ * All files live under data/lang/ and are keyed by string ids (e.g. "c-00001").
66+ */
77+88+import { Transliteration } from '$/enums.ts'
99+import type { Reading } from '$/models/subjects.ts'
1010+import { readTsv } from './fs.ts'
1111+1212+/** A single entry from characters.tsv, vocabulary.tsv, or radicals.tsv. */
1313+export interface Definition {
1414+ /** Unique string id (e.g. "c-00001" for characters, "v-00001" for vocabulary). */
1515+ id: string
1616+ hans: string
1717+ hant: string
1818+ /** Japanese-specific form (kanji col in characters.tsv, ja col in vocabulary.tsv). */
1919+ ja?: string
2020+}
2121+2222+export interface Hint {
2323+ id: string
2424+ locale: string
2525+ en: string
2626+}
2727+2828+/** Maps TSV column names to their Transliteration enum values. */
2929+const COL_TYPE: Record<string, Transliteration> = {
3030+ pinyin: Transliteration.Pinyin,
3131+ jyutping: Transliteration.Jyutping,
3232+ zhuyin: Transliteration.Zhuyin,
3333+ kunyomi: Transliteration.Kunyomi,
3434+ onyomi: Transliteration.Onyomi,
3535+ reading: Transliteration.Hiragana,
3636+}
3737+3838+/** Reads a CJK dictionary TSV (characters.tsv, vocabulary.tsv, or radicals.tsv). */
3939+export function readDict(path: string): Definition[] {
4040+ return readTsv(path).map((row) => ({
4141+ id: row.id,
4242+ hans: row.hans,
4343+ hant: row.hant,
4444+ ja: row.kanji || row.ja || undefined,
4545+ }))
4646+}
4747+4848+/** Returns a dictionary indexed by traditional (hant) form. */
4949+export function readDictByHant(path: string): Record<string, Definition> {
5050+ return Object.fromEntries(readDict(path).map((d) => [d.hant, d]))
5151+}
5252+5353+/** Returns a dictionary indexed by simplified (hans) form. */
5454+export function readDictByHans(path: string): Record<string, Definition> {
5555+ return Object.fromEntries(readDict(path).map((d) => [d.hans, d]))
5656+}
5757+5858+/** Returns a dictionary indexed by subject id. */
5959+export function readDictById(path: string): Record<string, Definition> {
6060+ return Object.fromEntries(readDict(path).map((d) => [d.id, d]))
6161+}
6262+6363+/**
6464+ * Reads a user-language meanings file (e.g. lang/en/characters.tsv).
6565+ * Returns a map of subject id → meaning string.
6666+ */
6767+export function readMeanings(path: string): Record<string, string> {
6868+ return Object.fromEntries(readTsv(path).map((row) => [row.id, row.value]))
6969+}
7070+7171+/**
7272+ * Reads a meaning-override TSV (e.g. lang/en/meanings/ja.characters.tsv).
7373+ * Returns a map of subject id → semicolon-separated meaning string.
7474+ * Returns an empty map if the file doesn't exist.
7575+ */
7676+export function readMeaningOverrides(path: string): Record<string, string> {
7777+ try {
7878+ return Object.fromEntries(readTsv(path).map((row) => [row.id, row.meaning]))
7979+ } catch {
8080+ return {}
8181+ }
8282+}
8383+8484+/**
8585+ * Reads a readings TSV (e.g. lang/zh_CN/readings.tsv, lang/ja/reading.characters.tsv)
8686+ * and returns a map of subject id → Reading[].
8787+ *
8888+ * Each non-id column maps to a Transliteration type via COL_TYPE. Semicolon-separated
8989+ * values produce multiple readings; the first value in the first column is isPrimary.
9090+ * Returns an empty map if the file doesn't exist.
9191+ */
9292+export function readReadingsMap(path: string): Record<string, Reading[]> {
9393+ const result: Record<string, Reading[]> = {}
9494+ try {
9595+ const rows = readTsv(path)
9696+ if (!rows.length) return result
9797+ // Reverse so isPrimary goes to the first column
9898+ const cols = Object.keys(rows[0]).filter((k) => k !== 'id').reverse()
9999+ for (const row of rows) {
100100+ const readings: Reading[] = []
101101+ let firstCol = true
102102+ for (const col of cols) {
103103+ const val = row[col] || ''
104104+ const type = COL_TYPE[col]
105105+ if (!type || !val) continue
106106+ val.split(';').map((s) => s.trim()).filter(Boolean).forEach((value, i) => {
107107+ readings.push({ value, type, isAcceptedAnswer: true, isPrimary: firstCol && i === 0 })
108108+ })
109109+ firstCol = false
110110+ }
111111+ if (readings.length) result[row.id] = readings
112112+ }
113113+ } catch { /* file missing — return empty */ }
114114+ return result
115115+}
116116+117117+/** Reads a hint TSV (lang/en/hints/*.tsv). Returns an empty array if the file doesn't exist. */
118118+export function readHints(path: string): Hint[] {
119119+ try {
120120+ return readTsv(path).map((row) => ({ id: row.id, locale: row.locale, en: row.en }))
121121+ } catch {
122122+ return []
123123+ }
124124+}
125125+126126+/** Reads hints and returns a nested map of subject id → locale → hint text. */
127127+export function readHintsById(path: string): Record<string, Record<string, string>> {
128128+ const result: Record<string, Record<string, string>> = {}
129129+ for (const row of readHints(path)) {
130130+ result[row.id] ??= {}
131131+ result[row.id][row.locale] = row.en
132132+ }
133133+ return result
134134+}
135135+136136+/**
137137+ * Reads hints and returns only those with locale === 'ALL', keyed by subject id.
138138+ * Used for hints that apply regardless of target language.
139139+ */
140140+export function readAllLocaleHints(path: string): Record<string, string> {
141141+ return Object.fromEntries(
142142+ readHints(path)
143143+ .filter((row) => row.locale === 'ALL')
144144+ .map((row) => [row.id, row.en]),
145145+ )
146146+}
+53
data/cli/utils/fs.ts
···11+/**
22+ * Low-level file I/O for the data pipeline.
33+ *
44+ * This module only handles reading and writing raw files — no domain logic.
55+ * For dictionary, subject, ordering, or sentence helpers, see the sibling modules.
66+ */
77+88+import { parse } from '@std/csv/parse'
99+import { stringify } from '@std/csv/stringify'
1010+import stringifyJSON from 'json-stringify-pretty-compact'
1111+1212+/** Root directory for data source files (TSVs, CSVs, JSON sources). */
1313+export const DATA_ROOT = './data/'
1414+1515+/** Output directory for generated app JSON files served at runtime. */
1616+export const APP_ROOT = './www/static/gen/'
1717+1818+/** Reads a TSV from `data/`. Skips the header row; returns records keyed by column name. */
1919+export function readTsv(input: string): Record<string, string>[] {
2020+ const text = Deno.readTextFileSync(DATA_ROOT + input)
2121+ return parse(text, {
2222+ separator: '\t',
2323+ lazyQuotes: true,
2424+ skipFirstRow: true,
2525+ }) as Record<string, string>[]
2626+}
2727+2828+/** Reads a CSV from `data/`. Skips the header row; returns records keyed by column name. */
2929+export function readCsv(input: string): Record<string, string>[] {
3030+ const text = Deno.readTextFileSync(DATA_ROOT + input)
3131+ return parse(text, {
3232+ lazyQuotes: true,
3333+ skipFirstRow: true,
3434+ }) as Record<string, string>[]
3535+}
3636+3737+/** Reads and parses a JSON file from `data/`. */
3838+export function readJson<T = unknown>(input: string): T {
3939+ return JSON.parse(Deno.readTextFileSync(DATA_ROOT + input)) as T
4040+}
4141+4242+/** Writes rows as a TSV to `data/`. Column order follows the `columns` array. */
4343+export function writeTsv(input: string, columns: string[], data: unknown[]): void {
4444+ Deno.writeTextFileSync(
4545+ DATA_ROOT + input,
4646+ stringify(data as Record<string, string>[], { columns, separator: '\t' }),
4747+ )
4848+}
4949+5050+/** Writes content as pretty-printed JSON to `www/static/gen/`. */
5151+export function writeAppJson(path: string, content: unknown): void {
5252+ Deno.writeTextFileSync(APP_ROOT + path, stringifyJSON(content))
5353+}
+55
data/cli/utils/ordering.ts
···11+/**
22+ * Curriculum ordering utilities: reading the order files that define which
33+ * characters, vocabulary, and radicals are taught, and in what sequence.
44+ *
55+ * Order files live under data/lang/{targetLang}/order/.
66+ */
77+88+import { parse } from '@std/csv/parse'
99+import { DATA_ROOT } from './fs.ts'
1010+import { readDictByHant, type Definition } from './dict.ts'
1111+1212+/**
1313+ * Reads a vocabulary or radical order CSV (e.g. lang/zh_CN/order/vocabulary.csv).
1414+ * Returns an array of levels, each level being an array of slugs (traditional forms).
1515+ */
1616+export function readLessonOrder(input: string): string[][] {
1717+ return parse(Deno.readTextFileSync(DATA_ROOT + input)) as string[][]
1818+}
1919+2020+/**
2121+ * Reads a character order file (e.g. lang/zh_CN/order/characters.txt).
2222+ * Each line is one level; each character in the line is one slug.
2323+ */
2424+export function readCharacterOrder(targetLang: string): string[][] {
2525+ const text = Deno.readTextFileSync(DATA_ROOT + `lang/${targetLang}/order/characters.txt`)
2626+ return text.split('\n').map((row) => row.split(''))
2727+}
2828+2929+/**
3030+ * Returns the ordered set of definitions for items in the curriculum, deduplicated
3131+ * and in curriculum order. Only includes items that are actually taught — not the
3232+ * full dictionary.
3333+ *
3434+ * @param type - 'character' or 'vocabulary'
3535+ * @param targetLang - target language code (default: 'zh_CN')
3636+ */
3737+export function readOrderedDefs(
3838+ type: 'character' | 'vocabulary',
3939+ targetLang = 'zh_CN',
4040+): Definition[] {
4141+ const [dictPath, slugsList] = type === 'character'
4242+ ? ['lang/characters.tsv', readCharacterOrder(targetLang).flat()]
4343+ : ['lang/vocabulary.tsv', readLessonOrder(`lang/${targetLang}/order/vocabulary.csv`).flat()]
4444+4545+ const byHant = readDictByHant(dictPath)
4646+ const seen = new Set<string>()
4747+ const result: Definition[] = []
4848+ for (const slug of slugsList) {
4949+ if (!slug || seen.has(slug)) continue
5050+ seen.add(slug)
5151+ const def = byHant[slug]
5252+ if (def) result.push(def)
5353+ }
5454+ return result
5555+}
+2
data/cli/utils/progress/hsk.ts
···11+// HSK progress data is generated directly in commands/gen_progress.ts.
22+// This file is reserved for future extraction if the gen_progress command grows.
+2
data/cli/utils/progress/tocfl.ts
···11+// TOCFL progress data is generated directly in commands/gen_progress.ts.
22+// This file is reserved for future extraction if the gen_progress command grows.
+61
data/cli/utils/sentences.ts
···11+/**
22+ * Sentence utilities: reading pre-processed example sentence files and building
33+ * the character-level indexes used when generating subject data.
44+ *
55+ * Sentence files live at data/lang/{userLang}/sentences/{targetLang}.tsv.
66+ */
77+88+import { distinct } from '@std/collections/distinct'
99+import { parse } from '@std/csv/parse'
1010+import type { Locale } from '$/enums.ts'
1111+import { DATA_ROOT } from './fs.ts'
1212+1313+export interface Sentences {
1414+ /** Maps sentence text → user-language translation. */
1515+ bySentence: Record<string, string>
1616+ /** Maps each individual character → all sentence texts that contain it. */
1717+ byChar: Map<string, string[]>
1818+ /** All sentence texts, in curriculum-sorted order (simplest first). */
1919+ sorted: string[]
2020+}
2121+2222+/**
2323+ * Reads the pre-processed sentence TSV for a given user + target language pair.
2424+ * Returns `{ bySentence, keys }`. Returns empty values if the file doesn't exist.
2525+ */
2626+export function readSentences(
2727+ userLang: string,
2828+ locale: Locale,
2929+): { bySentence: Record<string, string>; keys: string[] } {
3030+ let text = ''
3131+ try {
3232+ text = Deno.readTextFileSync(`${DATA_ROOT}lang/${userLang}/sentences/${locale}.tsv`)
3333+ } catch {
3434+ return { bySentence: {}, keys: [] }
3535+ }
3636+ const rows = parse(text, { separator: '\t', lazyQuotes: true })
3737+ const bySentence: Record<string, string> = {}
3838+ const keys = distinct(
3939+ rows.map(([_id, value, _enId, translation]) => {
4040+ bySentence[value] = translation
4141+ return value
4242+ }),
4343+ )
4444+ return { bySentence, keys }
4545+}
4646+4747+/**
4848+ * Loads and indexes sentences for a user + target language pair.
4949+ * Builds `byChar` for fast per-character lookup when generating subject examples.
5050+ */
5151+export function loadSentences(userLang: string, targetLang: string): Sentences {
5252+ const raw = readSentences(userLang, targetLang as Locale)
5353+ const byChar = new Map<string, string[]>()
5454+ for (const key of raw.keys) {
5555+ for (const char of key) {
5656+ if (!byChar.has(char)) byChar.set(char, [])
5757+ byChar.get(char)!.push(key)
5858+ }
5959+ }
6060+ return { bySentence: raw.bySentence, byChar, sorted: raw.keys }
6161+}
+234
data/cli/utils/subjects.ts
···11+/**
22+ * Subject utilities: reading/writing compiled subject JSON files, and creating
33+ * new Subject objects from dictionary + curriculum data.
44+ *
55+ * Compiled subject files live at www/static/gen/lang/{userLang}/{targetLang}.json.
66+ */
77+88+import { distinct } from '@std/collections/distinct'
99+import { dirname } from '@std/path'
1010+import stringifyJSON from 'json-stringify-pretty-compact'
1111+import { Locale, SubjectType } from '$/enums.ts'
1212+import type { Audio, Subject } from '$/models/subjects.ts'
1313+import { APP_ROOT } from './fs.ts'
1414+import type { Definition } from './dict.ts'
1515+import type { Sentences } from './sentences.ts'
1616+1717+const { Character, Vocabulary } = SubjectType
1818+1919+// ---------------------------------------------------------------------------
2020+// Subject I/O
2121+// ---------------------------------------------------------------------------
2222+2323+/** Reads compiled subject JSON from `www/static/gen/`. Returns an empty array on error. */
2424+export function readSubjects(input: string): Subject[] {
2525+ try {
2626+ return JSON.parse(Deno.readTextFileSync(APP_ROOT + input))
2727+ } catch {
2828+ return []
2929+ }
3030+}
3131+3232+/**
3333+ * Reads compiled subject JSON and returns a map keyed by `data.slug`.
3434+ * Subjects with a missing id or slug are skipped with a warning.
3535+ */
3636+export function readSubjectsMap(input: string): Record<string, Subject> {
3737+ const map: Record<string, Subject> = {}
3838+ readSubjects(input).forEach((subject) => {
3939+ if (!subject.id || !subject.data?.slug) {
4040+ console.warn(
4141+ `Skipping subject with missing id/slug in ${input}:`,
4242+ JSON.stringify(subject).slice(0, 120),
4343+ )
4444+ return
4545+ }
4646+ map[subject.data.slug] = subject
4747+ })
4848+ return map
4949+}
5050+5151+/**
5252+ * Writes compiled subjects to `www/static/gen/`. Before writing, subjects are:
5353+ * - Filtered to require id, slug, and type (corrupt entries are dropped)
5454+ * - Remapped with a stable property order for consistent diffs
5555+ * - Sorted by level → type (Radical < Character < Vocabulary) → position
5656+ */
5757+export function writeSubjects(output: string, subjects: Subject[]): void {
5858+ const levelAndPosition = new Set<string>()
5959+6060+ const toWrite = subjects
6161+ .filter((subject) => {
6262+ if (!subject.id || !subject.data?.slug || !subject.data?.type) {
6363+ console.warn(
6464+ 'Dropping invalid subject (missing id/slug/type):',
6565+ JSON.stringify(subject).slice(0, 120),
6666+ )
6767+ return false
6868+ }
6969+ return true
7070+ })
7171+ .map((subject) => {
7272+ const { data } = subject
7373+ const levelPosition = `${data.type}-${data.level}-${data.position}`
7474+ if (levelAndPosition.has(levelPosition) && levelPosition !== `${data.type}-0-0`) {
7575+ console.warn(`Two subjects at same position ${levelPosition}: ${data.slug}`)
7676+ } else {
7777+ levelAndPosition.add(levelPosition)
7878+ }
7979+ // Explicit property order for stable JSON diffs
8080+ return {
8181+ id: subject.id,
8282+ hiddenAt: subject.hiddenAt,
8383+ learnCards: subject.learnCards?.length ? subject.learnCards : ['meanings'],
8484+ quizCards: subject.quizCards?.length ? subject.quizCards : ['meanings', 'readings'],
8585+ data: {
8686+ audios: data.audios,
8787+ character: data.character,
8888+ requiredSubjects: data.requiredSubjects,
8989+ examples: data.examples,
9090+ level: data.level,
9191+ meanings: data.meanings,
9292+ meaningHint: data.meaningHint,
9393+ meaningMnemonic: data.meaningMnemonic,
9494+ position: data.position,
9595+ readings: data.readings,
9696+ readingHint: data.readingHint,
9797+ readingMnemonic: data.readingMnemonic,
9898+ slug: data.slug,
9999+ srsId: data.srsId,
100100+ type: data.type,
101101+ },
102102+ } as Subject
103103+ })
104104+ .sort((a, b) => {
105105+ if (!a.data.level || !a.data.position) return 1
106106+ if (!b.data.level || !b.data.position) return -1
107107+ const levelDiff = a.data.level - b.data.level
108108+ if (levelDiff) return levelDiff
109109+ const typePriority: Record<string, number> = { Radical: 0, Character: 1, Vocabulary: 2 }
110110+ const typeDiff = (typePriority[a.data.type] ?? 0) - (typePriority[b.data.type] ?? 0)
111111+ if (typeDiff) return typeDiff
112112+ return a.data.position - b.data.position
113113+ })
114114+115115+ const outPath = APP_ROOT + output
116116+ Deno.mkdirSync(dirname(outPath), { recursive: true })
117117+ Deno.writeTextFileSync(outPath, stringifyJSON(toWrite))
118118+}
119119+120120+// ---------------------------------------------------------------------------
121121+// Subject creation
122122+// ---------------------------------------------------------------------------
123123+124124+/**
125125+ * Indexes for fast slug/hans/ja lookups. Built lazily on first call to createSubject.
126126+ * We defer loading so that commands that don't need subject creation (gen-progress,
127127+ * gen-licenses) don't pay the startup cost of reading the dictionary files.
128128+ */
129129+let charBySlug: Record<string, Definition> | null = null
130130+let charByHans: Record<string, Definition> | null = null
131131+let charByJa: Record<string, Definition> | null = null
132132+let vocabBySlug: Record<string, Definition> | null = null
133133+let vocabByJa: Record<string, Definition> | null = null
134134+let audioMeta: Record<string, Record<string, Audio>> | null = null
135135+136136+function initDicts(
137137+ charDefs: Definition[],
138138+ vocabDefs: Definition[],
139139+ audioFiles: string[],
140140+): void {
141141+ if (charBySlug) return // already initialized
142142+ charBySlug = Object.fromEntries(charDefs.map((d) => [d.hant, d]))
143143+ charByHans = Object.fromEntries(charDefs.map((d) => [d.hans, d]))
144144+ charByJa = Object.fromEntries(charDefs.filter((d) => d.ja).map((d) => [d.ja!, d]))
145145+ vocabBySlug = Object.fromEntries(vocabDefs.map((d) => [d.hant, d]))
146146+ vocabByJa = Object.fromEntries(vocabDefs.filter((d) => d.ja).map((d) => [d.ja!, d]))
147147+148148+ audioMeta = {}
149149+ audioFiles.forEach((filename) => {
150150+ // Filename format: {id}_{locale-hyphenated}_{voiceId}.mp3
151151+ // Voice IDs use hyphens (not underscores), so splitting on _ is safe.
152152+ const [idStr, localeHyphen, voiceId] = filename.replace('.mp3', '').split('_')
153153+ const locale = localeHyphen?.replace('-', '_')
154154+ if (!locale || !voiceId) return
155155+ audioMeta![locale] ??= {}
156156+ audioMeta![locale][idStr] = { url: filename, voiceId }
157157+ })
158158+}
159159+160160+function getCharForLocale(targetLang: string, hans: string, hant: string, ja?: string): string {
161161+ if (targetLang === 'ja') return ja || hant
162162+ return targetLang === Locale.zh_CN ? hans : hant
163163+}
164164+165165+/**
166166+ * Creates a new Subject from dictionary and curriculum data.
167167+ * Used when a slug has no existing entry in the output JSON.
168168+ *
169169+ * @param charDefs - All character definitions (from lang/characters.tsv)
170170+ * @param vocabDefs - All vocabulary definitions (from lang/vocabulary.tsv)
171171+ * @param audioFiles - List of existing audio filenames (from listAudioFiles)
172172+ */
173173+export function createSubject(
174174+ slug: string,
175175+ level: number,
176176+ position: number,
177177+ targetLang: string,
178178+ charMeanings: Record<string, string>,
179179+ vocabMeanings: Record<string, string>,
180180+ sentences: Sentences,
181181+ charDefs: Definition[],
182182+ vocabDefs: Definition[],
183183+ audioFiles: string[],
184184+): Subject {
185185+ initDicts(charDefs, vocabDefs, audioFiles)
186186+187187+ const isVocab = slug.length > 1
188188+ const dictEntry = isVocab
189189+ ? (vocabBySlug![slug] || vocabByJa![slug])
190190+ : (charBySlug![slug] || charByHans![slug] || charByJa![slug])
191191+192192+ if (!dictEntry) {
193193+ console.error(`No valid dictionary entry for slug: ${slug}`)
194194+ return { data: {} } as Subject
195195+ }
196196+197197+ const { id, hans, hant, ja } = dictEntry
198198+ const en = isVocab ? (vocabMeanings[id] || '') : (charMeanings[id] || '')
199199+ const character = getCharForLocale(targetLang, hans, hant, ja)
200200+ const charForSentences = targetLang === Locale.zh_CN ? hans : hant
201201+202202+ return {
203203+ id,
204204+ learnCards: ['meanings'],
205205+ quizCards: ['meanings', 'readings'],
206206+ data: {
207207+ audios: [audioMeta![targetLang]?.[id]].filter((a): a is Audio => a != null),
208208+ character,
209209+ examples: (
210210+ charForSentences.length === 1
211211+ ? (sentences.byChar.get(charForSentences) ?? [])
212212+ : (sentences.byChar.get(charForSentences[0]) ?? []).filter((key) =>
213213+ key.includes(charForSentences)
214214+ )
215215+ )
216216+ .slice(0, 3)
217217+ .map((value) => ({ value, translation: sentences.bySentence[value] })),
218218+ level,
219219+ meanings: en.split(';').map((def, i) => ({
220220+ value: def.trim(),
221221+ isPrimary: i === 0,
222222+ isAcceptedAnswer: true,
223223+ })),
224224+ position,
225225+ readings: [],
226226+ requiredSubjects: distinct(
227227+ slug.split('').map((c) => charBySlug![c]?.id ?? ''),
228228+ ).filter((reqId) => reqId && reqId !== charBySlug![slug]?.id),
229229+ slug,
230230+ srsId: level > 2 ? 1 : 2,
231231+ type: isVocab ? Vocabulary : Character,
232232+ },
233233+ } as Subject
234234+}
···4040 "text": "The MIT License (MIT)\n\nCopyright (c) 2014, 2016, 2017, 2019, 2021, 2022, 2023 Simon Lydell\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n"
4141 },
4242 {
4343- "name": "native-file-system-adapter",
4444- "href": "https://raw.githubusercontent.com/jimmywarting/native-file-system-adapter/refs/heads/master/LICENSE",
4545- "text": "MIT License\n\nCopyright (c) 2019 Jimmy Wärting\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
4646- },
4747- {
4843 "name": "opencc-js",
4944 "href": "https://raw.githubusercontent.com/nk2028/opencc-js/main/LICENSE",
5045 "text": "MIT License\n\nCopyright (c) 2020-2021 The nk2028 Project\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"