this repo has no description
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Data#

Source files used to build the app's subject JSON. None of these files are directly included in the app bundle — everything is compiled to www/static/gen/ by the CLI.

Data files are generally safe to edit by hand (as long as you don't change any ids), but changes to TSVs may be overwritten if the relevant generator command is re-run. See the CLI docs below for which commands affect which files.

CLI#

All data tooling runs through a single entry point:

deno task data <command>
deno task data --help

Build commands#

Run by code contributors as part of the normal build process.

Command Description
build Run gen:app, gen:progress, and gen:licenses in sequence
gen:app Compile subjects JSON for all user-lang × target-lang pairs
gen:audio <locale> <dict> [limit] Generate TTS audio via Azure (requires .env + ffmpeg)
gen:progress Generate HSK / TOCFL / JLPT progress JSON
gen:licenses Fetch and bundle license texts

Studio commands#

Run by translation and language data contributors when updating source files. Not needed for normal code development.

Command Description
studio sort:dicts [--lang=zh_CN] Sort TSVs by id (default) or by a target language's curriculum order
studio update:dicts Add missing readings/meanings entries from dictionary sources
studio update:sentences Process raw Tatoeba sentence exports into app sentence TSVs

Directory Layout#

data/
├── cli/            # CLI entry point and commands (see cli/main.ts)
├── scripts/        # Legacy build scripts (superseded by cli/)
├── other/          # Auxiliary data (licenses.tsv, voices.json)
└── lang/           # All language data
    ├── characters.tsv        # Shared CJK character registry: id, hant, hans
    ├── radicals.tsv          # Shared radical registry: id, hant, hans, alt
    ├── vocabulary.tsv        # Shared CJK vocabulary registry: id, hant, hans
    ├── zh_CN/                # Mainland Chinese target-language data
    │   ├── readings.tsv          # id, pinyin
    │   ├── progress/             # External word lists (HSK, frequency data)
    │   └── order/                # Curriculum order
    │       ├── characters.txt    # One line per level; chars in that level
    │       ├── vocabulary.csv
    │       └── radicals.csv
    ├── zh_HK/                # Cantonese target-language data
    │   └── (same structure as zh_CN/, readings use jyutping)
    ├── zh_TW/                # Taiwanese Mandarin target-language data
    │   └── (same structure as zh_CN/, readings use pinyin + zhuyin)
    ├── ja/                   # Japanese target-language data
    │   ├── reading.characters.tsv   # id, kunyomi, onyomi
    │   ├── reading.vocabulary.tsv   # id, reading (hiragana)
    │   └── (order/, progress/ same structure as zh_CN/)
    ├── en/                   # English user-language data
    │   ├── characters.tsv        # id, hant, value  (meanings)
    │   ├── vocabulary.tsv        # id, hant, value
    │   ├── radicals.tsv          # id, hant, name
    │   ├── hints/                # Mnemonic hints
    │   │   ├── meaning.characters.tsv
    │   │   ├── meaning.vocabulary.tsv
    │   │   ├── reading.characters.tsv
    │   │   └── reading.vocabulary.tsv
    │   ├── meanings/             # Locale-specific meaning overrides
    │   │   ├── ja.characters.tsv
    │   │   └── ja.vocabulary.tsv
    │   └── sentences/            # Example sentences keyed by target locale
    │       ├── zh_CN.tsv
    │       ├── zh_HK.tsv
    │       ├── zh_TW.tsv
    │       └── ja.tsv
    └── es/                   # Spanish user-language data
         └── (same structure as en/)

Curriculum Order#

Order files define which subjects are taught and in what sequence. They live at lang/{targetLang}/order/ and are the source of truth for what gets included in the compiled subject JSON.

  • characters.txt — one line per level, each character is one slug
  • vocabulary.csv — one row per level, comma-separated slugs
  • radicals.csv — one row per level, comma-separated slugs

Subjects not present in any order file are included in the output with hiddenAt set and level: 0 (accessible but not scheduled for review).

We currently do not, but in the future, we should also try and keep frequency in mind, to hopefully gracefully prepare users for things such as graded readers, such as these or these. Some frequency lists:

Rough level targets#

LVL CEFR HSK TOCFL Characters Words
1-10
11-20 A1 HSK 2 TOCFL 1 ~200
21-30 A2 HSK 3 TOCFL 2 ~900 ~1200
31-60 B1-2 HSK 5 TOCFL 4 ~1900 ~5000

Audio#

Audio is generated separately from the main build and is not committed to the repo. Run gen:audio for each locale after adding new curriculum items:

deno task data gen:audio zh_CN character
deno task data gen:audio zh_HK character
deno task data gen:audio zh_TW character

Requires AZURE_SPEECH_KEY and AZURE_SPEECH_REGION in a .env file, and ffmpeg on $PATH. Generated files land in www/static/gen/audio/{locale}/.