Data#

Source files used to build the app's subject JSON. None of these files are directly included in the app bundle — everything is compiled to www/static/gen/ by the CLI.

Data files are generally safe to edit by hand (as long as you don't change any ids), but changes to TSVs may be overwritten if the relevant generator command is re-run. See the CLI docs below for which commands affect which files.

CLI#

All data tooling runs through a single entry point:

deno task data <command>
deno task data --help

Build commands#

Run by code contributors as part of the normal build process.

Command	Description
`build`	Run `gen:app`, `gen:progress`, and `gen:licenses` in sequence
`gen:app`	Compile subjects JSON for all user-lang × target-lang pairs
`gen:audio <locale> <dict> [limit]`	Generate TTS audio via Azure (requires `.env` + `ffmpeg`)
`gen:progress`	Generate HSK / TOCFL / JLPT progress JSON
`gen:licenses`	Fetch and bundle license texts

Studio commands#

Run by translation and language data contributors when updating source files. Not needed for normal code development.

Command	Description
`studio sort:dicts [--lang=zh_CN]`	Sort TSVs by id (default) or by a target language's curriculum order
`studio update:dicts`	Add missing readings/meanings entries from dictionary sources
`studio update:sentences`	Process raw Tatoeba sentence exports into app sentence TSVs

Directory Layout#

data/
├── cli/            # CLI entry point and commands (see cli/main.ts)
├── scripts/        # Legacy build scripts (superseded by cli/)
├── other/          # Auxiliary data (licenses.tsv, voices.json)
└── lang/           # All language data
    ├── characters.tsv        # Shared CJK character registry: id, hant, hans
    ├── radicals.tsv          # Shared radical registry: id, hant, hans, alt
    ├── vocabulary.tsv        # Shared CJK vocabulary registry: id, hant, hans
    ├── zh_CN/                # Mainland Chinese target-language data
    │   ├── readings.tsv          # id, pinyin
    │   ├── progress/             # External word lists (HSK, frequency data)
    │   └── order/                # Curriculum order
    │       ├── characters.txt    # One line per level; chars in that level
    │       ├── vocabulary.csv
    │       └── radicals.csv
    ├── zh_HK/                # Cantonese target-language data
    │   └── (same structure as zh_CN/, readings use jyutping)
    ├── zh_TW/                # Taiwanese Mandarin target-language data
    │   └── (same structure as zh_CN/, readings use pinyin + zhuyin)
    ├── ja/                   # Japanese target-language data
    │   ├── reading.characters.tsv   # id, kunyomi, onyomi
    │   ├── reading.vocabulary.tsv   # id, reading (hiragana)
    │   └── (order/, progress/ same structure as zh_CN/)
    ├── en/                   # English user-language data
    │   ├── characters.tsv        # id, hant, value  (meanings)
    │   ├── vocabulary.tsv        # id, hant, value
    │   ├── radicals.tsv          # id, hant, name
    │   ├── hints/                # Mnemonic hints
    │   │   ├── meaning.characters.tsv
    │   │   ├── meaning.vocabulary.tsv
    │   │   ├── reading.characters.tsv
    │   │   └── reading.vocabulary.tsv
    │   ├── meanings/             # Locale-specific meaning overrides
    │   │   ├── ja.characters.tsv
    │   │   └── ja.vocabulary.tsv
    │   └── sentences/            # Example sentences keyed by target locale
    │       ├── zh_CN.tsv
    │       ├── zh_HK.tsv
    │       ├── zh_TW.tsv
    │       └── ja.tsv
    └── es/                   # Spanish user-language data
         └── (same structure as en/)

Curriculum Order#

Order files define which subjects are taught and in what sequence. They live at lang/{targetLang}/order/ and are the source of truth for what gets included in the compiled subject JSON.

characters.txt — one line per level, each character is one slug
vocabulary.csv — one row per level, comma-separated slugs
radicals.csv — one row per level, comma-separated slugs

Subjects not present in any order file are included in the output with hiddenAt set and level: 0 (accessible but not scheduled for review).

We currently do not, but in the future, we should also try and keep frequency in mind, to hopefully gracefully prepare users for things such as graded readers, such as these or these. Some frequency lists:

tocfl data includes sFreq and wFreq
https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php

Rough level targets#

LVL	CEFR	HSK	TOCFL	Characters	Words
1-10
11-20	A1	HSK 2	TOCFL 1	~200
21-30	A2	HSK 3	TOCFL 2	~900	~1200
31-60	B1-2	HSK 5	TOCFL 4	~1900	~5000

Audio#

Audio is generated separately from the main build and is not committed to the repo. Run gen:audio for each locale after adding new curriculum items:

deno task data gen:audio zh_CN character
deno task data gen:audio zh_HK character
deno task data gen:audio zh_TW character

Requires AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in a .env file, and ffmpeg on $PATH. Generated files land in www/static/gen/audio/{locale}/.

Configure Feed