pop/voice: jeffrey speech synthesis, two lanes (PT + diphone bank)
PT lane (research / vocal-tract music instrument):
- vendor pink trombone from dood.al directly (MIT, sha-pinned), not
a github mirror (zakaton fork is GPL-3 after refactor)
- render-pt.mjs: node vm sandbox loads PT inline, renders sustained
vowels + keyframe trajectories at ~90 ms/render
- fit.py: CMA-ES + log-mel mean+std + librosa pyin F0; PT->PT
self-test recovers params, 2-keyframe joint fit recovers natural
rising F0 contour from jeffrey-pvc word recordings
- measure-jeffrey.py: mediapipe FaceLandmarker on 521 confirmed-jeffrey
IG photos -> 363 measurements -> jeffrey-anthropometry.json with
Fitch-Giedd VTL prior + lip aperture bound
- PHYSIOLOGY.md, MODEL-EXTENSIONS.md frame the personalized-PT paper
Pivot 2026-05-04: PT cannot reach phoneme-level intelligibility from
external trajectory fitting alone. PT reframed as a timbral musical
instrument; new mainline for shippable jeffrey speech is the diphone
bank.
Diphone lane (shippable AC-native TTS):
- diphone-targets.json: 1527 ARPABet pairs with perceptual weights,
532 in tier-1 (vowel-vowel + CV/VC)
- 10 paragraph-form carriers in jeffrey's lowercase voice (~$0.39
ElevenLabs)
- align-paragraphs.py: torchaudio MMS_FA + g2p_en -> 1510 phonemes
aligned across 168.7 s of audio
- extract-diphones.py: 485 unique diphones, 3.2 MB bank, Hann-tapered
- synth-word.py: text -> g2p_en -> overlap-add concat -> jeffrey-shaped
speech, --rate 0.82 default for natural pacing
- end-to-end works on "the music is real" and 10 demo words
Total session ElevenLabs spend: ~$0.65. Output: a 3.2 MB shippable
bank that pronounces arbitrary text in jeffrey's voice (with known
gaps at word boundaries, fixable next pass).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>