Add scream TTS exploration report · aesthetic.computer/core@1ed697c

+114

1 changed file

expand all

reports

2026-03-30-scream-tts-exploration.md

+114

reports/2026-03-30-scream-tts-exploration.md

··· 1 + # Scream TTS Exploration Report 2 + 3 + **Date:** 2026-03-30 4 + **Goal:** Make the `scream` command produce actual screaming TTS audio 5 + 6 + ## Summary 7 + 8 + Explored multiple TTS APIs and approaches to generate screamo/metal-style vocal delivery with intelligible words. No single API perfectly solves "screamo TTS with lyrics" yet, but several promising paths emerged. 9 + 10 + ## What Was Built 11 + 12 + ### Code Changes (pushed to main) 13 + 14 + 1. **`system/netlify/functions/say.js`** — Upgraded OpenAI TTS + added ElevenLabs provider 15 + - `gpt-4o-mini-tts` model used when `instructions` param provided (emotional/style control) 16 + - Falls back to `tts-1` for normal speech (cheaper, faster) 17 + - ElevenLabs provider (`eleven`) with voice mapping and scream mode settings 18 + - New voices: ash, ballad, coral, sage, verse (OpenAI); Harry, Charlie, Callum, Liam, etc. (ElevenLabs) 19 + - Cache keys include instructions + scream flag for distinct entries 20 + 21 + 2. **`system/public/aesthetic.computer/lib/speech.mjs`** — Frontend passes `instructions` and `scream` to API 22 + 23 + 3. **`system/public/aesthetic.computer/disks/say.mjs`** — New colon options: 24 + - `say:scream WORDS` — defaults to ElevenLabs with scream settings 25 + - `say:eleven WORDS` — normal ElevenLabs speech 26 + - `say:openai:scream WORDS` — OpenAI gpt-4o-mini-tts with screaming instructions 27 + 28 + ### Infrastructure 29 + 30 + - ElevenLabs API key added to vault (`devcontainer.env`) and production lith server (`/opt/ac/system/.env`) 31 + - Lith server restarted, production endpoint verified working 32 + 33 + ## API Comparison 34 + 35 + ### OpenAI `gpt-4o-mini-tts` with instructions 36 + - **Strengths:** Words are always intelligible, `instructions` field gives some emotional control 37 + - **Weaknesses:** Still sounds like "speaking loudly" — can't break out of clean speech patterns 38 + - **Best prompt:** "You are performing screamo metal vocals. Guttural, throat-shredding screams..." 39 + - **Verdict:** Not screamy enough 40 + 41 + ### ElevenLabs TTS (stability=0.0, style=1.0) 42 + - **Strengths:** More expressive than OpenAI, variable delivery, some voices get intense 43 + - **Weaknesses:** Still fundamentally speech synthesis — "unhinged speaking" not "screaming" 44 + - **Best voices for intensity:** Harry (Fierce Warrior), Charlie (Energetic), Callum (Husky Trickster) 45 + - **Key settings:** `stability: 0.0`, `similarity_boost: 0.3-0.5`, `style: 1.0` 46 + - **Verdict:** Getting close but not screamo 47 + 48 + ### ElevenLabs Sound Effects API (`/v1/sound-generation`) 49 + - **Strengths:** Actually generates screaming sounds — metal vocals, harsh shrieks, death growls 50 + - **Weaknesses:** Cannot reliably produce specific words/lyrics — it's a sound model, not speech 51 + - **Best prompts:** "screamo metal vocals, guttural harsh screaming" / "deathcore breakdown vocals, pig squeal into death growl" 52 + - **Verdict:** Best screams, but no word control 53 + 54 + ### ElevenLabs Speech-to-Speech (`/v1/speech-to-speech/{voice_id}`) 55 + - **Strengths:** Takes audio input and transforms it through a voice — preserves word timing/content 56 + - **Available on free tier!** 57 + - **Approach:** Generate yelling TTS with OpenAI -> transform through ElevenLabs voice at 0 stability 58 + - **Model:** `eleven_english_sts_v2` 59 + - **Verdict:** Most promising hybrid — needs more testing with extreme source audio 60 + 61 + ## Test Samples Generated (`output/`) 62 + 63 + ### OpenAI gpt-4o-mini-tts 64 + | File | Voice | Description | 65 + |------|-------|-------------| 66 + | `normal-test.mp3` | onyx (tts-1) | Baseline, no instructions | 67 + | `scream-test.mp3` | onyx | Angry scream instructions | 68 + | `scream-terror.mp3` | ash | Terror/panic instructions | 69 + | `scream-extreme.mp3` | ash | Maximum intensity instructions | 70 + | `scream-horror.mp3` | verse | Horror movie victim instructions | 71 + | `screamo-openai-ash.mp3` | ash | Metal vocal instructions | 72 + 73 + ### ElevenLabs TTS (stability=0.0) 74 + | File | Voice | Description | 75 + |------|-------|-------------| 76 + | `eleven-harry-scream.mp3` | Harry | stability 0.15 | 77 + | `eleven-charlie-scream.mp3` | Charlie | stability 0.1 | 78 + | `eleven-callum-scream.mp3` | Callum | stability 0.1 | 79 + | `screamo-harry-0stab.mp3` | Harry | stability 0.0, screamo lyrics | 80 + | `screamo-charlie-turbo.mp3` | Charlie (turbo) | stability 0.0, turbo model — 651K of chaos | 81 + | `screamo-liam-0stab.mp3` | Liam | stability 0.0 | 82 + | `screamo-callum-0stab.mp3` | Callum | stability 0.0 | 83 + | `screamo-george-0stab.mp3` | George | British storyteller losing it | 84 + 85 + ### ElevenLabs Sound Effects (best raw screams, no word control) 86 + | File | Prompt concept | 87 + |------|---------------| 88 + | `sfx-scream-metal.mp3` | Screamo metal vocals | 89 + | `sfx-scream-horror.mp3` | Blood-curdling horror scream | 90 + | `sfx-scream-deathcore.mp3` | Deathcore breakdown, pig squeal to growl | 91 + | `sfx-scream-screamo.mp3` | Screamo punk like Thursday/Glassjaw | 92 + | `sfx-words-getout.mp3` | Attempted "GET OUT" (garbled) | 93 + | `sfx-words-burnit.mp3` | Attempted "BURN IT DOWN" (garbled) | 94 + | `sfx-words-helpme.mp3` | Attempted "HELP ME" (garbled) | 95 + | `sfx-words-nothing.mp3` | Attempted "I FEEL NOTHING" (garbled) | 96 + | `sfx-just-no.mp3` | Attempted just "NO" repeated | 97 + | `sfx-fire-concert.mp3` | Live concert "FIRE" | 98 + 99 + ### Speech-to-Speech Transforms (hybrid approach) 100 + | File | Source -> Voice | Description | 101 + |------|----------------|-------------| 102 + | `sts-harry-scream.mp3` | OpenAI yelling -> Harry | stability 0.0 | 103 + | `sts-charlie-scream.mp3` | OpenAI yelling -> Charlie | stability 0.0 | 104 + | `sts-callum-from-openai.mp3` | OpenAI screamo -> Callum | Full screamo through Callum, 182K | 105 + 106 + ## Next Steps 107 + 108 + 1. **Voice Cloning (requires ElevenLabs Starter $5/mo):** Clone a voice from one of the good SFX screamo samples, then use it for TTS — would combine screamo quality with word intelligibility 109 + 110 + 2. **Speech-to-Speech Pipeline:** Chain OpenAI gpt-4o-mini-tts (aggressive instructions) -> ElevenLabs STS (0 stability) as a two-step generation. Could be wired into say.js as `provider: "scream-hybrid"` 111 + 112 + 3. **Pre-recorded Samples:** For the `scream` broadcast command specifically, use actual screamo audio clips rather than generating them 113 + 114 + 4. **Suno/Udio Music APIs:** Music generation models can do actual harsh vocals with lyrics — worth exploring if a music gen API becomes available

Configure Feed

Configure Feed