Data#
Source files used to build the app's subject JSON. None of these files are
directly included in the app bundle — everything is compiled to www/static/gen/
by the CLI.
Data files are generally safe to edit by hand (as long as you don't change any ids), but changes to TSVs may be overwritten if the relevant generator command is re-run. See the CLI docs below for which commands affect which files.
CLI#
All data tooling runs through a single entry point:
deno task data <command>
deno task data --help
Build commands#
Run by code contributors as part of the normal build process.
| Command | Description |
|---|---|
build |
Run gen:app, gen:progress, and gen:licenses in sequence |
gen:app |
Compile subjects JSON for all user-lang × target-lang pairs |
gen:audio <locale> <dict> [limit] |
Generate TTS audio via Azure (requires .env + ffmpeg) |
gen:progress |
Generate HSK / TOCFL / JLPT progress JSON |
gen:licenses |
Fetch and bundle license texts |
Studio commands#
Run by translation and language data contributors when updating source files. Not needed for normal code development.
| Command | Description |
|---|---|
studio sort:dicts [--lang=zh_CN] |
Sort TSVs by id (default) or by a target language's curriculum order |
studio update:dicts |
Add missing readings/meanings entries from dictionary sources |
studio update:sentences |
Process raw Tatoeba sentence exports into app sentence TSVs |
Directory Layout#
data/
├── cli/ # CLI entry point and commands (see cli/main.ts)
├── scripts/ # Legacy build scripts (superseded by cli/)
├── other/ # Auxiliary data (licenses.tsv, voices.json)
└── lang/ # All language data
├── characters.tsv # Shared CJK character registry: id, hant, hans
├── radicals.tsv # Shared radical registry: id, hant, hans, alt
├── vocabulary.tsv # Shared CJK vocabulary registry: id, hant, hans
├── zh_CN/ # Mainland Chinese target-language data
│ ├── readings.tsv # id, pinyin
│ ├── progress/ # External word lists (HSK, frequency data)
│ └── order/ # Curriculum order
│ ├── characters.txt # One line per level; chars in that level
│ ├── vocabulary.csv
│ └── radicals.csv
├── zh_HK/ # Cantonese target-language data
│ └── (same structure as zh_CN/, readings use jyutping)
├── zh_TW/ # Taiwanese Mandarin target-language data
│ └── (same structure as zh_CN/, readings use pinyin + zhuyin)
├── ja/ # Japanese target-language data
│ ├── reading.characters.tsv # id, kunyomi, onyomi
│ ├── reading.vocabulary.tsv # id, reading (hiragana)
│ └── (order/, progress/ same structure as zh_CN/)
├── en/ # English user-language data
│ ├── characters.tsv # id, hant, value (meanings)
│ ├── vocabulary.tsv # id, hant, value
│ ├── radicals.tsv # id, hant, name
│ ├── hints/ # Mnemonic hints
│ │ ├── meaning.characters.tsv
│ │ ├── meaning.vocabulary.tsv
│ │ ├── reading.characters.tsv
│ │ └── reading.vocabulary.tsv
│ ├── meanings/ # Locale-specific meaning overrides
│ │ ├── ja.characters.tsv
│ │ └── ja.vocabulary.tsv
│ └── sentences/ # Example sentences keyed by target locale
│ ├── zh_CN.tsv
│ ├── zh_HK.tsv
│ ├── zh_TW.tsv
│ └── ja.tsv
└── es/ # Spanish user-language data
└── (same structure as en/)
Curriculum Order#
Order files define which subjects are taught and in what sequence. They live
at lang/{targetLang}/order/ and are the source of truth for what gets
included in the compiled subject JSON.
characters.txt— one line per level, each character is one slugvocabulary.csv— one row per level, comma-separated slugsradicals.csv— one row per level, comma-separated slugs
Subjects not present in any order file are included in the output with
hiddenAt set and level: 0 (accessible but not scheduled for review).
We currently do not, but in the future, we should also try and keep frequency in mind, to hopefully gracefully prepare users for things such as graded readers, such as these or these. Some frequency lists:
- tocfl data includes sFreq and wFreq
- https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php
Rough level targets#
| LVL | CEFR | HSK | TOCFL | Characters | Words |
|---|---|---|---|---|---|
| 1-10 | |||||
| 11-20 | A1 | HSK 2 | TOCFL 1 | ~200 | |
| 21-30 | A2 | HSK 3 | TOCFL 2 | ~900 | ~1200 |
| 31-60 | B1-2 | HSK 5 | TOCFL 4 | ~1900 | ~5000 |
Audio#
Audio is generated separately from the main build and is not committed to the
repo. Run gen:audio for each locale after adding new curriculum items:
deno task data gen:audio zh_CN character
deno task data gen:audio zh_HK character
deno task data gen:audio zh_TW character
Requires AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in a .env file, and
ffmpeg on $PATH. Generated files land in www/static/gen/audio/{locale}/.