Monorepo for Aesthetic.Computer aesthetic.computer
4
fork

Configure Feed

Select the types of activity you want to include in your feed.

pop/big-pictures: per-syllable tiktok pipeline + folk bell accompaniment

initial check-in of the big-pictures pop song pipeline. score-driven
storyboard (.np → per-syllable slides), pitchsnap WORLD vocal
correction, FLUX word-image gen + OCR validation, render_frames.py
held-center keyframes (slides arrive td after audio onset, hold
through full sustain), finalize.mjs cover art with non-overlapping
bands. mary scored across all four traditional verses + chick-chick-
boom outro hook for the folk_backing gunshot SFX trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

+6407
+6
pop/.gitignore
··· 1 + # reference corpus stays vault-only — never commit third-party lyrics 2 + references/* 3 + !references/README.md 4 + 5 + # track outputs 6 + big-pictures/out/
+178
pop/AUTOTUNE-ALGORITHMS.md
··· 1 + # autotune algorithms — non-neural offline pitch correction 2 + 3 + scope: the actual internals of pitch-correction, with code we can lift. neural systems (rvc, diffsinger, etc.) are the other agent's beat. this is dsp, c, and python — the stuff that runs on jeffrey's 8gb macbook without a gpu and finishes before lunch. 4 + 5 + the problem with `pop/bin/pitchsnap.mjs` today: it slices a vocal by word and runs `rubberband -p N` per segment. that's a *static* shift — every sample inside the word slides up the same n semitones. the source's natural pitch contour rides along, which is the opposite of autotune. autotune *clamps* — it replaces f0 with a target curve. 6 + 7 + ## 1. what autotune actually does — five sentences 8 + 9 + step 1: estimate f0 (pitch) frame-by-frame on the input, ~5 ms hop. step 2: for each frame, snap the detected midi note to the nearest member of an allowed scale (with hysteresis so a wobble around the boundary doesn't flip notes every frame). step 3: compute a per-frame correction ratio `target_f0 / source_f0`. step 4: shift the audio by that ratio while preserving formants — so the timbre stays human, the pitch becomes mechanical. step 5: stitch frames back together with phase coherence (or, in psola/world world, resynth from decomposed parameters). 10 + 11 + that's it. the magic is entirely in steps 1, 4, and 5 being good. 12 + 13 + ## 2. pitch detection — yin / pyin 14 + 15 + naive autocorrelation (what `pitchsnap.mjs` uses today) finds octave errors because the autocorrelation peak at lag `t` is matched at `2t`, `3t`, `4t`, etc. any noise pushes the search to the wrong harmonic. yin (de cheveigné & kawahara, 2002) fixes this with the *cumulative mean normalized difference function* — the curve starts at 1 and only dips below when a true period is found, with the lower-lag (higher pitch) match preferred. 16 + 17 + ```python 18 + # from patriceguyot/Yin (mit license) — compact form 19 + def differenceFunction(x, w, tau_max): 20 + # d(tau) = sum_i (x[i] - x[i+tau])^2, computed via fft for speed 21 + x_cumsum = np.concatenate(([0.], (x*x).cumsum())) 22 + fc = np.fft.rfft(x, size_pad) 23 + conv = np.fft.irfft(fc * fc.conjugate())[:tau_max] 24 + return x_cumsum[w:w-tau_max:-1] + x_cumsum[w] - x_cumsum[:tau_max] - 2*conv 25 + 26 + def cmndf(df, N): 27 + # normalize so curve starts at 1, drops < 1 at true period 28 + return np.insert(df[1:] * range(1, N) / np.cumsum(df[1:]), 0, 1) 29 + 30 + def getPitch(cmdf, tau_min, tau_max, harmo_th=0.1): 31 + tau = tau_min 32 + while tau < tau_max: 33 + if cmdf[tau] < harmo_th: 34 + while tau+1 < tau_max and cmdf[tau+1] < cmdf[tau]: 35 + tau += 1 # walk down to local minimum 36 + return tau 37 + tau += 1 38 + return 0 # no pitch found (unvoiced) 39 + ``` 40 + 41 + source: `patriceguyot/Yin/yin.py`, also vendored into `nvidia/mellotron`. pyin (mauch & dixon, 2014) extends yin with a probabilistic hmm over candidate periods — much more robust on noisy/breathy vocals. `librosa.yin` and `librosa.pyin` ship both. 42 + 43 + ## 3. pitch shifting — psola vs phase vocoder vs world resynth 44 + 45 + three families, each with a tradeoff: 46 + 47 + **td-psola** (charpentier & stella, 1986). estimate f0, find pitch-period peaks, extract two-period hann-windowed grains, re-space the grains at the *target* period, overlap-add. key property: formants are intrinsic to the grain shape, so they survive untouched — no separate formant flag needed. weak on polyphonic / unvoiced material. clean reference: `sannawag/TD-PSOLA`: 48 + 49 + ```python 50 + def shift_pitch(signal, fs, f_ratio): 51 + peaks = find_peaks(signal, fs) # autocorr-based pitch-mark detection 52 + return psola(signal, peaks, f_ratio) # window+respace+overlap-add 53 + ``` 54 + 55 + **phase vocoder** (flanagan & golden, 1966; dolson 1986). stft → for each bin, estimate true frequency from the phase derivative across hops → resynthesize at scaled frame rate → resample to original duration. this is what librosa's `effects.pitch_shift` does (time-stretch via phase vocoder, then resample). also the engine inside rubber band's `R2` mode. weakness: "phasiness" — bins drift apart, transients smear. fixes (laroche & dolson 1999, "phase locking") are what rubber band's `R3 (--finer)` mode adds. 56 + 57 + **world resynth** (morise et al., 2016). decompose speech into three independent streams: f0, smoothed spectral envelope (cheaptrick), aperiodicity (d4c). modify any of them. resynth. this is the cleanest "replace the f0 curve" pipeline that exists outside the neural world — see section 4. 58 + 59 + | method | formants | best on | failure mode | 60 + |---|---|---|---| 61 + | td-psola | preserved (intrinsic) | clean monophonic vocals | breathy / unvoiced | 62 + | phase vocoder | requires extra filter | broadband / polyphonic | phasiness, transient smear | 63 + | world resynth | preserved (envelope split out) | speech / vocals | not free of artifacts on heavy distortion | 64 + 65 + ## 4. world / pyworld — the f0-replace pipeline 66 + 67 + this is the one. `pyworld` is the python wrapper around morise's c library. install: `pip install pyworld`. has wheels for arm64 macos. the canonical autotune flow is six lines: 68 + 69 + ```python 70 + import pyworld as pw 71 + 72 + # 1. decompose 73 + _f0, t = pw.dio(x, fs) # raw f0 candidates (or pw.harvest for accuracy) 74 + f0 = pw.stonemask(x, _f0, t, fs) # refine f0 to sub-frame precision 75 + sp = pw.cheaptrick(x, f0, t, fs) # smoothed spectral envelope (formants) 76 + ap = pw.d4c(x, f0, t, fs) # aperiodicity (breath, fricatives) 77 + 78 + # 2. modify f0 — this is where autotune happens 79 + f0_corrected = quantize_to_scale(f0, scale_midi_notes) # see section 5 80 + 81 + # 3. resynthesize 82 + y = pw.synthesize(f0_corrected, sp, ap, fs) 83 + ``` 84 + 85 + note what's not here: there is no pitch-shift step. you literally write the new f0 curve into the synthesizer. `sp` (formants) and `ap` (breath texture) are unchanged, so the speaker's identity survives perfectly. this is the api surface we want. 86 + 87 + `harvest` is slower but recovers from the octave errors that bite `dio` on jeffrey's takes. for vocals always prefer `harvest` + `stonemask`. 88 + 89 + ## 5. scale quantization with hysteresis 90 + 91 + given an instantaneous f0 in hz and a target scale (set of midi numbers), the snap is straightforward — hz → midi via `69 + 12*log2(f/440)`, then nearest-member lookup. the trick is hysteresis: if you're sitting between two scale degrees and the f0 wobbles ±10 cents, naive nearest-snap will *flip notes every frame*, producing a digital trill. fix: 92 + 93 + ```python 94 + def quantize_to_scale(f0, scale_midi, prev_target=None, 95 + hysteresis_cents=30, retain=1.0): 96 + out = np.zeros_like(f0) 97 + cur = prev_target 98 + for i, f in enumerate(f0): 99 + if f <= 0: # unvoiced — keep silent 100 + out[i] = 0; continue 101 + midi = 69 + 12*np.log2(f/440) 102 + # candidates sorted by distance to current frame 103 + cands = sorted(scale_midi, key=lambda n: abs(n - midi)) 104 + nearest = cands[0] 105 + if cur is not None and nearest != cur: 106 + # only switch if we've moved > hysteresis cents past the boundary 107 + if abs(midi - nearest)*100 + hysteresis_cents > abs(midi - cur)*100: 108 + nearest = cur 109 + cur = nearest 110 + # `retain` interpolates between source pitch (0) and full snap (1) 111 + corrected_midi = midi + retain * (nearest - midi) 112 + out[i] = 440 * 2**((corrected_midi - 69)/12) 113 + return out 114 + ``` 115 + 116 + `retain=1.0` is full hard auto-tune (the t-pain sound). `retain=0.6` is melodyne-style — keeps human bend. `hysteresis_cents=30` to `50` kills the trill without slowing legit pitch transitions. 117 + 118 + ## 6. shortlist — what we can plug in tonight 119 + 120 + | lib | language | offline? | install | what we get | 121 + |---|---|---|---|---| 122 + | **pyworld** | python | yes | `pip install pyworld` | f0 extract + replace + resynth, formants intrinsic | 123 + | psola | python | yes | `pip install psola` | td-psola via parselmouth/praat — needs target-pitch numpy array | 124 + | librosa | python | yes | `pip install librosa` | `yin`, `pyin`, `effects.pitch_shift` (phase vocoder) | 125 + | crepe | python | gpu-light | `pip install crepe` | neural pitch detector — slower but cleanest f0 | 126 + | autotalent | c (ladspa) | yes | source build | reference c implementation, gpl2 | 127 + | rubberband | c++ | yes | `brew install rubberband` (already installed) | phase vocoder pitch shift, no f0-replace | 128 + | sannawag/TD-PSOLA | python | yes | `git clone` | minimal readable td-psola reference | 129 + 130 + all of the python options install on apple silicon python 3.14 with native wheels in 2026. pyworld's only build dep is numpy + cython. 131 + 132 + ## 7. integration sketch — pitchsnap.mjs → world 133 + 134 + the right move is to delegate the per-word pitch step to a small python helper that does world-style f0 replacement. `pitchsnap.mjs` keeps its responsibilities (whisper alignment, score parsing, grid snap, slice extraction) and calls the helper for the actual correction: 135 + 136 + ```python 137 + # pop/bin/pitchsnap_world.py — called per-word from pitchsnap.mjs 138 + import sys, numpy as np, pyworld as pw, soundfile as sf 139 + 140 + in_wav, out_wav, target_midi, retain = sys.argv[1:5] 141 + target_midi = float(target_midi); retain = float(retain) 142 + 143 + x, fs = sf.read(in_wav, dtype="float64") 144 + if x.ndim > 1: x = x.mean(axis=1) 145 + 146 + f0_raw, t = pw.harvest(x, fs, f0_floor=80, f0_ceil=600) 147 + f0 = pw.stonemask(x, f0_raw, t, fs) 148 + sp = pw.cheaptrick(x, f0, t, fs) 149 + ap = pw.d4c(x, f0, t, fs) 150 + 151 + # replace f0 with target — hold the bend with `retain` 152 + target_hz = 440 * 2**((target_midi - 69)/12) 153 + voiced = f0 > 0 154 + f0_new = np.where(voiced, np.exp((1-retain)*np.log(np.maximum(f0,1e-6)) 155 + + retain*np.log(target_hz)), 0) 156 + 157 + y = pw.synthesize(f0_new, sp, ap, fs) 158 + sf.write(out_wav, y.astype(np.float32), fs) 159 + ``` 160 + 161 + then in `pitchsnap.mjs`, replace the rubberband call inside the per-word loop with: 162 + 163 + ```javascript 164 + spawnSync("python3", [ 165 + resolve(import.meta.dirname, "pitchsnap_world.py"), 166 + sliceWav, shiftedWav, String(targetMidi), String(retain) 167 + ], { stdio: "inherit" }); 168 + ``` 169 + 170 + next phase: instead of a single `target_midi` per word, write a per-frame target curve that interpolates between syllable notes — that's the syllable-glide we already plan in `pitchsnap.mjs` curve mode, but executed by world instead of cross-fading two rubberband renders. 171 + 172 + --- 173 + 174 + **recommendation: integrate pyworld first.** it's the only option in this list that lets us *replace* the f0 curve instead of shifting it, formants stay intact for free, and the install is one line. 175 + 176 + ``` 177 + pip install pyworld soundfile numpy 178 + ```
+173
pop/JEFFREY-VOICE.md
··· 1 + # jeffrey voice — for writing rap lyrics 2 + 3 + a style guide for `pop/big-pictures/`. read this before drafting a lyric. all quotes below are real, pulled from the AC repo — papers, recap narrations, code comments. the voice you're chasing already exists; the job is to compress it into bars without flattening it. 4 + 5 + ## 0. the posture in one line 6 + 7 + > "the music is real and the math just flipped" — `pop/big-pictures/plork.txt` 8 + 9 + quiet conviction, present tense, plain words. nobody is being convinced. somebody is being shown. 10 + 11 + ## 1. recurring couplet patterns (lift these) 12 + 13 + ### a. "X is the Y, Y is the Z" — chained identity 14 + 15 + the load-bearing one. jeffrey reaches for it whenever a system collapses into itself. 16 + 17 + - "every piece a url / every url a score" (`ac.txt`) 18 + - "the address is the score" (`ac.txt`) 19 + - "the goal was the room. the room was the soul" (`plork.txt`) 20 + - "the gear was the gate, the gate was the price tag" (`plork.txt`) 21 + - "the deployment mechanism is a file, not a supply chain" (`plork.tex` §10) 22 + 23 + lands when the chain is a real collapse — A actually equals B, not "is sort of like". don't fake it. 24 + 25 + ### b. "every X a Y" — universalizer 26 + 27 + - "every piece a url" (`ac.txt`) 28 + - "every URL is the program" (paraphrased through whole `papers/SCORE.md`) 29 + - "every orchestra instrument has a name. now every laptop does too." (`plork.tex` §4) 30 + 31 + works in hooks. one stress per side, no filler. 32 + 33 + ### c. "no X, no Y, no Z" — the negation triplet 34 + 35 + - "no investors, no playbook, no series-a scheme" (`ac.txt`) 36 + - "one-person stack, no runway, no dream" (`ac.txt`) 37 + - "no userspace, no desktop environment, no package manager, no shell" (`plork.tex` §4) 38 + - "no install, no app store, no auth at the door" (general AC posture, paraphrasable) 39 + 40 + three is the number. four is academic. two is incomplete. 41 + 42 + ### d. "this is not a metaphor. it is a Y." — the slam-pivot 43 + 44 + the canonical jeffrey line: 45 + 46 + > "PLOrk'ing the planet is not a metaphor. It is a logistics problem. And the logistics just got a hundred times cheaper." (`plork.tex` §13) 47 + 48 + already lifted into bar form: 49 + 50 + - "the music is real — the logistics got cheap" (hook of `plork.txt`) 51 + - "plork the planet — not a theory anymore, a tool" (`plork.txt`) 52 + 53 + use this when the listener might be reaching for the metaphor read. block it. name what it actually is. 54 + 55 + ### e. "i'm not building a product, i'm building a Y." — the disavowal 56 + 57 + - "i'm not pitching a product, i'm holding a stand" (`ac.txt`) 58 + - "i'm not asking permission — i'm just lifting it" (`plork.txt`) 59 + - "i'm not bitter — i'm honest, i keep score on the track" (`plork.txt`) 60 + 61 + the second clause has to do real work. "not X, just Y" only lands if Y is more specific than X, not more abstract. 62 + 63 + ### f. "X was the Y" — reframing what something already was 64 + 65 + - "the laptop orchestra was a beautiful idea that reached almost no one" (`plork.tex` §1) 66 + - "this was not a barrier for Princeton. it was a barrier for everyone else." (`plork.tex` §1) 67 + - "the surplus laptop is not a degraded general-purpose computer. it is an unrealized dedicated device." (`plork.tex` §5) 68 + - "the whistlegraph is a score that teaches you how to play it" (`whistlegraph.tex` abstract) 69 + 70 + structurally a setup-and-flip. the rap version puts the flip on the rhyme. 71 + 72 + ## 2. vocabulary jeffrey reaches for 73 + 74 + real words from the corpus, with locations. lift these. they carry his fingerprint. 75 + 76 + | word / phrase | example | 77 + |---|---| 78 + | **score** | "the address is the score" / "a score that teaches you how to play it" (`whistlegraph.tex`) | 79 + | **address** | "the address is the score" (`ac.txt`); "color address" (`recap/audience/jeffrey-24h-2026-04-30.mjs`) | 80 + | **piece** | "every piece a url" / "the piece was the wave" (`ac.txt`) | 81 + | **real / it's real** | "the music is real" (used three times in `plork.txt`); "real, deadpan, very 'him'" (recap) | 82 + | **the whole X** | "the whole move" / "the whole stack" / "the whole runtime" (`ac.txt`); "the whole app does less work when hidden" (`jeffrey-73h`) | 83 + | **load-bearing** | "load-bearing infrastructure" (`jeffrey-24h-2026-05-01`); from `pop/VOICE.md`: "earned and load-bearing" | 84 + | **honest** | "i'm not bitter — i'm honest" (`plork.txt`); "be honest about what doesn't work" (`papers/VOICE.md`) | 85 + | **lands / it lands** | "fragments when they land" (`pop/VOICE.md`); "end with something that lands" (`papers/VOICE.md`) | 86 + | **shipping / shipped** | "menuband shipped six releases in a row" (`jeffrey-73h`) | 87 + | **logistics** | "it is a logistics problem. and the logistics just got a hundred times cheaper" (`plork.tex`) | 88 + | **the gate / gated** | "the gear was the gate, the gate was the price tag" (`plork.txt`); "gated by enrollment" (`plork.tex`) | 89 + | **kernel / flash a kernel** | "i flash a kernel from a usb and call it a school" (`plork.txt`) | 90 + | **stack** | "one-person stack, no runway, no dream" (`ac.txt`); "the menubar audio stack" (`jeffrey-73h`) | 91 + | **barely / under / sublinear** | "scales sublinearly with complexity" (`plork.tex`); "boot in under 8 seconds" | 92 + | **the room** | "the goal was the room. the room was the soul" (`plork.txt`) | 93 + | **tool / convivial** | "not a theory anymore, a tool" (`plork.txt`); "tools for conviviality" (`plork.tex` §8) | 94 + | **planet / planetary** | "plork the planet" / "planetary laptop orchestra" (`plork.tex` everywhere) | 95 + | **moat** | "build the research moat" (`papers/RESEARCH-DIRECTION.md`) | 96 + | **fork / forkable** | "fork the whole runtime — same kid, same shape" (`ac.txt`); "URL-safe, and forkable" (`plork.tex`) | 97 + | **embarrassingly** | "embarrassingly abundant" (`plork.tex` conclusion) | 98 + 99 + avoid technical synonyms. when jeffrey means kernel, he says kernel. when he means url, he says url. don't translate `mjs` to "module" or `usb` to "drive". the specific word is the texture. 100 + 101 + ## 3. punctuation + rhythm tics 102 + 103 + - **em-dash** for the swerve. comma is too soft, parens are academic, semicolons are forbidden. 104 + - "the music is real — the logistics got cheap" 105 + - "i'm not bitter — i'm honest, i keep score on the track" 106 + - **period for the slam.** short final sentence after a longer setup. the bar that ends a verse is almost always shorter than the bar before it. 107 + - "It is a logistics problem. And the logistics just got a hundred times cheaper." 108 + - **lowercase as default.** capitalize only where convention demands it (handles, proper nouns the listener actually knows). brand names like AC and PLOrk capitalize in papers but stay lowercase in lyric drafts. 109 + - **fragments after a longer setup.** "fair. but the full stack was never the goal." — the one-word concession lands harder than the explanation. 110 + - **never a question mark in a hook.** statements, not pitches. (the verse can ask a question; the hook closes one.) 111 + 112 + ## 4. what jeffrey AVOIDS (cut these from drafts) 113 + 114 + from `pop/VOICE.md` and `papers/VOICE.md` directly + the corpus pattern: 115 + 116 + - **academic hedging**: "we propose", "the authors", "in this paper we", "a novel approach to", "results suggest potential viability" 117 + - **defensive citation stacking** — cite when it matters, not to prove you read things 118 + - **ChatGPT prose tells**: "moreover", "furthermore", "it's worth noting", parallelism for parallelism's sake, three-item lists where two would do, smoothing the rough edge 119 + - **rap cliche**: money, cars, club, ice, bottles, "ayy", "skrt", "let's get it", "you already know" — `pop/VOICE.md` is explicit: "AC has nothing to flex about and that's the point" 120 + - **corporate vocab**: "ecosystem", "stakeholders", "leverage", "synergy", "vision statement" 121 + - **generic emotion**: "tired" is not a feeling — `pop/VOICE.md`: "tired of explaining kidlisp at every dinner" is 122 + - **explaining the vision before singing it.** the lyric should *be* the vision, not narrate it (`pop/VOICE.md`) 123 + - **smoothing the catch.** if there's a catch, name it. plork.txt verse 2 opens with "look — i know what they'll say. it sounds like a hack" — that's the move. 124 + 125 + ## 5. lyric-specific patterns (the rap layer) 126 + 127 + these don't appear in the prose corpus because they're rap-only. they extend the voice into bars without breaking it. 128 + 129 + - **internal rhyme over end rhyme.** the rhyme should fall mid-bar at least as often as it falls on the four. (`pop/VOICE.md`) 130 + - working: "two-forty million laptops, **windows ten retired** / flash a kernel — every one of them just got **hired**" — internal "retired/hired" plus the period after "kernel" 131 + - **the hook is the vision compressed to one repeatable line.** if you can't say what the song's about in the hook's first line, the hook's not done. 132 + - "the music is real — the logistics got cheap" (`plork.txt`) 133 + - "every piece a url / every url a score" (`ac.txt`) 134 + - **second verse can be the limitations section.** confess the catch, then keep moving. exactly mirrors the papers' Limitations sections. 135 + - `plork.txt` v2: "look — i know what they'll say. it sounds like a hack / fifty-dollar laptops can't run the full stack / fair. but the full stack was never the goal" 136 + - **technical names land plainly, no ceremony.** kidlisp, notepat, fedac, plork, mjs, hp-gl, ywft, rs-232 — drop them in unglossed. the listener catches up. don't say "my custom lisp called kidlisp"; just say kidlisp. 137 + - **conviction quiet but absolute.** never "i think the music is real". just "the music is real". hedging is the audible giveaway that the line isn't from him. 138 + - **ad-libs are minimal.** one or two per verse, low in the mix. "look —" works. "yeah yeah" doesn't. 139 + - **the t-shirt rule (recap-derived).** every chapter image has a one-word command on the shirt that resolves at the AC prompt. lyric equivalent: every verse has at least one bar that names a real piece you can type and run. (notepat, kidlisp, $roz, plork…) 140 + 141 + ## 6. examples gallery — works / doesn't 142 + 143 + pulled straight from `pop/big-pictures/*.txt` to mark what's already landing and what's still drifting toward generic. 144 + 145 + **works:** 146 + 147 + - `plork.txt` hook: "the music is real — the logistics got cheap / landfill's full of orchestras dreaming in their sleep" — em-dash slam, then a specific image (landfill, orchestras, sleeping). no abstraction. 148 + - `plork.txt`: "i'm not bitter — i'm honest, i keep score on the track" — disavowal pattern + the double meaning on "score" is exactly the kind of pun jeffrey actually makes (score = music, score = tally). 149 + - `plork.txt`: "the gear was the gate, the gate was the price tag" — chained-identity (pattern 1a) compressed into one bar. perfect. 150 + - `plork.txt`: "look — i know what they'll say. it sounds like a hack / fifty-dollar laptops can't run the full stack / fair. but the full stack was never the goal / the goal was the room. the room was the soul" — the limitations-section move done as bars. the "fair." is the Adornoesque period-slam. 151 + - `ac.txt` hook: "every piece a url / every url a score / push enter — you're inside / aesthetic dot computer" — pattern 1b twice, then an instruction, then the address. four bars, four moves, no waste. 152 + - `ac.txt`: "i wanted a place my friends could just paint / where the paint was the piece, the piece was the wave" — pattern 1a, but starting from "i wanted" gives it the personal ground. this is the emo-rap-honesty register `pop/VOICE.md` asks for. 153 + 154 + **doesn't (yet):** 155 + 156 + - `ac.txt`: "i'm not pitching a product, i'm holding a stand" — pattern 1e is right but "holding a stand" is vaguer than what jeffrey would actually say. compare to "i'm not asking permission — i'm just lifting it" from `plork.txt`, which has a real verb. "stand" reads like the line was reaching for a rhyme with "hand". the second clause has to specify, not abstract. 157 + - `ac.txt`: "the music is real and it works in your hand" — "the music is real" is load-bearing jeffrey, but "works in your hand" is a marketing flourish. cf. `plork.txt`'s "the music is real and the math just flipped" — same opener, then a specific claim. swap the second clause for something only AC can say. 158 + - `plork.txt`: "i watched plork in a basement on a livestream / fifteen laptops in a hall doing rituals like a dream" — "rituals like a dream" is poetic-rap default. the rest of the song avoids that register; this couplet drifts. the fix is the specific number-noun ("fifteen laptops in a hall") without the "like a dream" softener. 159 + 160 + ## 7. checklist before you ship a draft 161 + 162 + - [ ] does the hook pass the one-line-vision test? if you read just line one, do you know the song? 163 + - [ ] is there at least one chained-identity bar (pattern 1a)? 164 + - [ ] is there at least one period-slam at a verse end? 165 + - [ ] does verse 2 own a limitation honestly? 166 + - [ ] are technical names (kidlisp, notepat, fedac…) landed without gloss? 167 + - [ ] every "the music is real / X is the Y" — does X actually equal Y, or did you reach? 168 + - [ ] zero "ayy / skrt / let's get it / yeah yeah". zero "ecosystem / leverage / stakeholders". zero "moreover". 169 + - [ ] read it aloud lowercase. does it sound like the recap narrations or like a poster? 170 + 171 + --- 172 + 173 + *maintained by @jeffrey — update when the voice evolves. all examples above are real, pulled from the AC repo as of 2026-05-03.*
+78
pop/RESEARCH-DIRECTION.md
··· 1 + # Research Direction · AC Pop 2 + 3 + **Last updated**: 2026-05-03 4 + **Author**: @jeffrey 5 + 6 + --- 7 + 8 + ## Posture 9 + 10 + **bottom-up + compositional.** suno / udio are product-in, top-down, overused, and not compositional — you give them a prompt and get a finished song you didn't compose. that is not the AC posture. tracks here are built from AC's own instruments (the same notepat / chord / sinebells / beat primitives the recap waltz bed already uses), bar by bar. the composition is the artifact. 11 + 12 + ## Vocal Architecture (decided 2026-05-03) 13 + 14 + big-pictures vocal = **jeffrey-pvc via ElevenLabs**. same `provider: "jeffrey", voice: "neutral:0"` already used by the 24h recap pipeline (`/api/say` proxy in `system/netlify/functions/`, called from `recap/bin/tts.mjs`). hooks and verses both. it's literally jeffrey's cloned voice — the most authentic option and already wired up. 15 + 16 + **AC-native vocal was attempted and dropped on 2026-05-03.** built a 3-formant synth (`recap/bin/vocal.mjs`) and a `.np` score reader, rendered the plork hook over the trap bed. result: it sounded like tones, not voice. real vocoder/talkbox character would need glottal pulse, F4/F5, pitch jitter, breath, consonant articulation — a full research lane. not blocking track 1 on it. `vocal.mjs` stays in the repo as experimental research; may find use as a *melodic instrument* layer (formant-shaped lead) rather than a vocal. 17 + 18 + if/when AC-native vocal resumes, the right starting point is **Pink Trombone** (Neil Thapen, 2017) — a browser-based physical model of the vocal tract that produces articulate speech-like sounds in pure JS. open-source, copy-pasteable. ports cleanly to a node renderer that consumes the `.np` score directly. see `reference_pink_trombone.md` in memory. 19 + 20 + after the ElevenLabs stem returns, we run **WhisperX forced alignment** on it (the same dependency the recap pipeline already uses for subtitle timing — see `feedback_recap_subtitle_timing_drift.md`) to get per-word timestamps. those word boundaries: 21 + 22 + 1. **confirm** every word lands on or near a 16th-note grid line of the trap bed 23 + 2. drive the **bar-snap** pass that nudges drift to the nearest beat (per `feedback_recap_musical_snapping.md`, ~200ms tolerance) 24 + 3. become the score for any visual layer (subtitle overlays, notepat-style scrolling lyric track, AC piece tied to the track) 25 + 4. become the **edit unit** for vocal post-production (next section) 26 + 27 + the **`.np` score** for each track stays useful: it pairs the lyrics with a pitch contour, which is the right input to a notepat-style scrolling lyric visual layer and to any future kidlisp-driven music piece. 28 + 29 + ## Vocal Post-Production (per-word) 30 + 31 + once a vocal track has word boundaries (WhisperX for ElevenLabs verses, or directly from the `.np` score for `vocal.mjs` hooks), each word is an editable segment. the post-prod stage applies per-word edits driven by a recipe file (`pop/big-pictures/<slug>.edit.json`): 32 + 33 + - **pitch** — shift up/down semitones, autotune to the scale of the bed (C minor pentatonic for trap) 34 + - **elongate** — time-stretch to fit a target duration (rubberband / atempo, formant-preserving) 35 + - **effect** — reverb / delay / distortion / vocoder / formant-shift / saturation, per-word 36 + - **harmonize** — duplicate stem, pitch-shift each copy (+3rd, +5th, octave), mix back 37 + - **realign** — move the word's start to match a target beat / 16th-note slot 38 + 39 + the **aggression** is a per-edit (or per-track) knob: `gentle` (≤50ms nudge), `firm` (≤200ms), `aggressive` (snap to nearest beat regardless of distance), `off` (leave timestamps as-is). aggression applies independently to each operation — you can pitch-correct gently and realign aggressively, or vice versa. 40 + 41 + post-prod is **generic across vocal sources** — works the same on ElevenLabs verse stems and `vocal.mjs` hook output. the `.edit.json` recipe is paired with the lyrics/score. 42 + 43 + implementation hooks: ffmpeg filtergraph (`asetrate`, `atempo`, `aecho`, `aphaser`, `chorus`), rubberband for pitch+time independence, sox for finer effects. 44 + 45 + ## Current Goals 46 + 47 + 1. **Land the first big-pictures track** — one AC vision, ninety seconds, mixed and listenable end-to-end. proves the pipeline before scaling 48 + 2. **Route lyrics through jeffrey-pvc TTS** — feed plork lyrics into `/api/say` with the existing voice config (`provider: "jeffrey", voice: "neutral:0"`); cache the stem; mix over the trap bed 49 + 3. **Build a curated reference corpus** — ~10–20 emo rap tracks' lyrics in the vault, used as in-context style examples for the lyric generator (not training data) 50 + 4. **Write the lyric generator** — paper / vision → 16-bar verse + 4-line hook in jeffrey-pvc voice with emo-rap overlay 51 + 5. **WhisperX align + bar-snap** — run forced alignment on the jeffrey-pvc stem; emit per-word timestamps; nudge drift to bar grid 52 + 6. **Build the per-word post-prod stage** — `pop/bin/vocal-post.mjs`: takes a vocal track + word timestamps + `.edit.json` recipe → applies pitch / elongate / effect / harmonize / realign, with per-edit aggression 53 + 7. **Tune the AC-native trap bed** — extend `recap/bin/trap.mjs` (now landed) with: dedicated 808 sub voice, swing toggle, fill on bar 16, optional pan stage 54 + 55 + ## Open Questions 56 + 57 + - copyright posture for reference lyrics: in-context style examples only, vault-only, never committed. enough? 58 + - elevenlabs cadence control: how tightly can we pin the vocal to the bar grid? rap needs the stem to land on the beat, not float. WhisperX + snap should fix most drift; test before committing the lane 59 + - jeffrey-pvc rap performance: the 24h-recap voice is calm + descriptive. does it deliver rap cadence at all, or do we need a separate voice variant ("jeffrey-pvc:rap")? test on the plork hook first 60 + - "big pictures" the show vs. the format: is each track its own episode, or are they collected into albums? defer until track 3 61 + 62 + ## First Track Plan — `plork` 63 + 64 + source: `papers/arxiv-plork/plork.tex`. core hook line already in the paper: *the music is real, the logistics just got cheaper.* 65 + 66 + ``` 67 + 1. lyrics → pop/big-pictures/plork.txt [done — v1 draft] 68 + 2. score → pop/big-pictures/plork.np [done — hook + verse 1 + outro; visual layer / future] 69 + 3. trap bed → recap/out/trap.mp3 [done — 16 bars, 140 BPM] 70 + 4. vocal stem → /api/say with jeffrey-pvc [next: feed plork.txt as narration body] 71 + 5. align → WhisperX → per-word timestamps 72 + 6. snap → nudge words to nearest bar-grid 16th-note 73 + 7. post-prod → vocal-post.mjs --edit plork.edit.json 74 + (pitch / elongate / effect / harmonize / realign, aggression-tunable) 75 + 8. mix → pop/big-pictures/out/plork.mp3 (~1:30) 76 + ``` 77 + 78 + If the jeffrey-pvc TTS doesn't carry rap cadence — try a different ElevenLabs voice variant or work the lyric line-breaks / punctuation to coax better delivery before resorting to a different provider.
+49
pop/SCORE.md
··· 1 + # Score for Pop 2 + 3 + ## Mill Mission 4 + 5 + `pop` is the research home for music that comes out of Aesthetic Computer — songs, instrumentals, and the writing about them. The papers platter pushes AC's thinking out as text. `pop` does the same job in the form of tracks: short, finished pieces of music that can leave the building and be heard. 6 + 7 + The mill is not a label. It is a research lane. Every track here exists because it was the most honest way to compress an idea — a feature, a vision, a paper, a moment in the project. If a thread can survive being written as a song, it was real. If it can survive being compressed to ninety seconds, it was essential. 8 + 9 + ## What This Is 10 + 11 + Pop tracks are one output of the AC research platter. They share source material with `papers/` — the same threads, the same readings, the same code — but render as audio. The platter feeds both. Some threads become papers, some become tracks, some become both. 12 + 13 + ## Posture 14 + 15 + **bottom-up + compositional.** tracks here are composed from AC's own instruments — the notepat sample bank, sinebells, chord, beat — the same primitives the recap waltz bed already uses. no suno-style end-to-end song generation; that's product-in, top-down, and not compositional. AI vocal (ElevenLabs) is the one exception, since vocal is performance on top of the composition, not the compositional substrate. 16 + 17 + ## Process 18 + 19 + ``` 20 + platter (raw material: notes, code, conversations, papers) 21 + → thread (a vision worth singing) 22 + → draft lyrics (in jeffrey-pvc voice + per-genre voice) 23 + → vocal + beat (per-lane pipeline) 24 + → mix (~1:30 mp3, audio-only) 25 + ``` 26 + 27 + Audio-only by default. No video, no chrome. If a track later becomes a video lane, that's a recap-side concern, not a `pop` concern. 28 + 29 + ## Swimlanes 30 + 31 + ### 1. big pictures (`big-pictures/`) 32 + 33 + Hip hop / trap dance versions of jeffrey's AC visions. Roughly **1:30 per track**. Lyrics rapped over a 4/4 trap bed (808s, triplet hats), one track per "big picture" — a single AC vision pulled from the papers platter (laptop orchestras, kidlisp, native OS, identity, latency, etc). 34 + 35 + Voice posture: emo-rap honesty. Conviction quiet but absolute. No flexing, no industry posture. Internal rhyme over end rhyme. The vision is the hook. 36 + 37 + See [`big-pictures/README.md`](big-pictures/README.md) for the format spec. 38 + 39 + ### 2. (open) 40 + 41 + More lanes will land here as they prove themselves. Candidates: kidlisp-as-instrument tracks, AC-native ensemble cuts, voice-memo-grade demo lane. None of them have earned a swimlane yet — they need a real track first. 42 + 43 + ## References 44 + 45 + Third-party lyrics (emo rap reference corpus, etc) live in the vault. They are not committed to this repo. See [`references/README.md`](references/README.md). 46 + 47 + --- 48 + 49 + *maintained by @jeffrey*
+172
pop/SPEECH-TO-SINGING-V2.md
··· 1 + # speech-to-singing — v2 follow-up 2 + 3 + v1 picked the saitou recipe and we built it. world (`pyworld`) lands the median pitch on target to ~2¢. but the holes are now audible — word-start stutters, held-note ringing, harsh sibilants, and outlier octave errors from naive autocorrelation in `pitchcheck.mjs`. v2 goes shopping for fixes to *those specific symptoms*, not for another full pipeline. 4 + 5 + three lanes: 6 + 7 + — **stay inside world, fix it.** parameter passes, second-pass correction, voicing-edge ramps. 8 + — **swap parts of world out.** psola via parselmouth, sinusoidal modeling, neural pitch detectors. 9 + — **bypass world.** rvc / so-vits-svc / nsf as a "real singer" lane. bigger payoff, bigger install. 10 + 11 + ## 1. better pitch detection — the quick outlier kill 12 + 13 + our `pitchcheck.mjs` autocorrelation is the *measurement* tool, not the renderer — but its octave errors inflate the mean drift report and make us chase ghosts. world internally uses `harvest` already, which is good. the question is: when *we* measure the post-render f0 to verify, what do we use. 14 + 15 + | tool | install | gpu | what we get | 16 + |---|---|---|---| 17 + | **librosa.pyin** | already installed in the env | no | viterbi-smoothed yin, monophonic-only — perfect for our use. one function call: `librosa.pyin(y, sr, fmin=70, fmax=600)`. drop-in replacement for `detectPitch()`. | 18 + | **torchcrepe** | `pip install torchcrepe` (~80 mb plus torch) | optional | pretrained cnn, sub-cent accuracy on clean voice. has a `viterbi=True` flag for smoothing. cpu-runnable, ~2x real-time on m-series. | 19 + | **rmvpe** | `pip install rmvpe` (~120 mb model) | optional | best-on-singing benchmark (87.2% rpa on mir-1k vs crepe 85.3%). robust to noise. used internally by rvc/applio. | 20 + | **fcpe / torchfcpe** | `pip install torchfcpe` (~30 mb) | optional | 2026 model. ~5× faster than rmvpe, ~77× faster than crepe. cpu-realtime. accuracy on par with rmvpe on clean monophonic. | 21 + | **penn** | `pip install penn` (~50 mb) | optional | morrison's cross-domain neural pitch + periodicity, 11× rt on cpu, returns confidence — useful for voicing detection. | 22 + 23 + **verdict.** `librosa.pyin` is the immediate win — already on disk, two-line swap inside `pitchcheck.mjs`, removes the 170¢ mean-drift outliers in one move. for a future "second-pass autotune" (see §3), `torchfcpe` is the right pick — small install, cpu-friendly, sings well. 24 + 25 + integration sketch (pitchcheck.mjs): 26 + 27 + ```python 28 + # pop/bin/pitchcheck_pyin.py — child process called from pitchcheck.mjs 29 + import sys, librosa, soundfile as sf, numpy as np 30 + y, sr = sf.read(sys.argv[1]) 31 + if y.ndim > 1: y = y.mean(axis=1) 32 + f0, voiced, conf = librosa.pyin(y.astype(np.float32), sr=sr, 33 + fmin=70, fmax=600, frame_length=2048) 34 + # emit timestamp, hz, confidence csv 35 + ``` 36 + 37 + ## 2. fix the world stutter — voicing-transition pops 38 + 39 + the audible "skip" at word starts is the world synth flipping unvoiced→voiced abruptly. our current script removed the `voicing-ramp` because multiplying f0 by 0..1 forced the first 40ms to be unvoiced. correct fix is the opposite — *interpolate f0 across unvoiced regions before synthesis*, then synthesise, then mute the unvoiced portions in the time domain after. 40 + 41 + this is the standard f0-contour smoother trick (zeehio / edinburgh speech tools, harvest paper §4): 42 + 43 + ```python 44 + # fill unvoiced gaps in f0 with linear interpolation 45 + voiced_idx = np.where(f0 > 0)[0] 46 + if len(voiced_idx) >= 2: 47 + f0_filled = np.interp(np.arange(len(f0)), voiced_idx, f0[voiced_idx]) 48 + else: 49 + f0_filled = f0 50 + # build unvoiced mask in time-domain *after* synth 51 + unvoiced_mask_audio = np.repeat(f0 == 0, samples_per_frame) 52 + y = pw.synthesize(f0_filled, sp, ap, fs) # smooth pitch through gaps 53 + # crossfade-mute the unvoiced regions (5 ms ramps at boundaries) 54 + y *= smooth_mute(unvoiced_mask_audio, ramp_ms=5) 55 + ``` 56 + 57 + this gives world a continuous f0 curve to work against (no abrupt 0→target jumps), then re-imposes the voiced/unvoiced structure as a smoothed amplitude mask. costs ~10 lines of python; addresses the *single largest* perceptual artifact we currently hit. 58 + 59 + related fix: **bump the cheaptrick fft size**. default is auto-computed from `f0_floor=70`. raising `f0_floor` to 90 hz (jeffrey's voice never goes below 90) shrinks the analysis window, which reduces the "echo / ringing" on held vowels — that ring is cheaptrick smearing low-frequency formant estimates over too long a window. 60 + 61 + ```python 62 + fft_size = pw.get_cheaptrick_fft_size(fs, f0_floor=90.0) 63 + sp = pw.cheaptrick(x, f0, t, fs, fft_size=fft_size, f0_floor=90.0) 64 + ``` 65 + 66 + ## 3. second-pass correction — melodyne-lite 67 + 68 + even with world's f0-replace, we still see drift on outliers. add a dirt-cheap *second pass*: 69 + 70 + 1. measure actual f0 of world output (with `torchfcpe` or `librosa.pyin`). 71 + 2. compute residual cents-error per frame: `err = 1200 * log2(measured / target)`. 72 + 3. if median |err| over a sustained note > 15 cents, build a per-frame correction ratio and re-render with world (only that note). otherwise pass through. 73 + 74 + this is what auto-tune internally does after its first snap pass. one extra world round-trip per problem note, ~50 lines, no new deps. works because the *first* pass got us to within ~2¢ median; the second pass surgically fixes the tail. 75 + 76 + ## 4. psola alternative — when world rings, try grains 77 + 78 + td-psola (`psola` on pypi, wraps parselmouth → praat) does pitch-shift in the time domain by repositioning glottal-pulse-aligned grains. **no spectral-envelope estimation step**, so no cheaptrick ringing. the cost: psola is monophonic-only and breathy/unvoiced material can break it. 79 + 80 + ```python 81 + import psola 82 + # target_pitch: numpy array same length as audio, hz per sample 83 + y = psola.vocode(audio, sample_rate, target_pitch=target_pitch_curve, 84 + fmin=70, fmax=600) 85 + ``` 86 + 87 + | | world | td-psola | 88 + |---|---|---| 89 + | held-note ringing | yes (cheaptrick smearing) | no (no spectral step) | 90 + | sibilant clipping | yes (envelope smoothes 's') | yes (psola breaks on unvoiced) | 91 + | word-start stutter | yes (v/uv transition) | less (grains overlap-add naturally) | 92 + | install | `pip install pyworld` | `pip install psola praat-parselmouth` | 93 + | cpu speed | fast | slow (~3× slower) | 94 + 95 + **recommended use:** dual-render — psola on sustained vowels, world on consonants — composited per-phoneme. that's a half-day build but addresses two of our four symptoms. 96 + 97 + ## 5. sinusoidal modeling — the slow lane 98 + 99 + `sms-tools` (mtg/upf, xavier serra), `loris` / `loristrck`, and `simpl` decompose audio into time-varying sinusoidal partials + a noise residual. you can shift each partial independently, which gives surgical control over harmonics vs sibilance vs breath. 100 + 101 + — `pip install sms-tools` (apple silicon wheels exist as of 2026). 102 + — per-partial pitch shifting: pure tones move, residual noise stays. 103 + — **fixes sibilants directly**: classify partials by frequency band, leave 5–10 khz partials unshifted, shift only the harmonic stack below 4 khz. 104 + 105 + cost: orders of magnitude slower than world, more knobs, more code (~300 lines). flag this as a research-track experiment, not a production swap. 106 + 107 + ## 6. consonant-region artifact suppression — the cleanup pass 108 + 109 + even with everything above, world will still occasionally muck up "s", "k", "t" because it tries to apply f0 to noisy regions. the cheap fix is a *post-vocoder* spectral repair pass: 110 + 111 + — **de-esser**: dynamic compressor on the 5–8 khz band, sidechain-keyed by an envelope of that same band. classic broadcast trick. `scipy.signal.iirfilter` + a one-pole detector + per-sample gain in ~30 lines. or use the `pedalboard` library (spotify, `pip install pedalboard`) which has `Compressor` + `HighpassFilter` building blocks ready. 112 + — **transient gate**: at frame boundaries that fall inside unvoiced regions, replace world's output with the *original* recording's audio in that region, crossfaded ±5 ms. world handles vowels; original audio handles fricatives and stops. this is the "world for pitch, source for noise" composite. 113 + 114 + ```python 115 + # composite: keep world output on voiced frames, original audio on unvoiced 116 + voiced_audio = upsample(voiced_mask, fs) # 0..1 per sample, 5 ms ramps 117 + y_final = voiced_audio * y_world + (1 - voiced_audio) * x_original 118 + ``` 119 + 120 + this is the single biggest "sounds clean" win for the same effort budget as §2. probably do both. 121 + 122 + ## 7. neural lane — rvc / so-vits-svc / nsf 123 + 124 + if §1–§6 still fall short, the field's actual answer in 2026 is voice conversion: train a model on jeffrey's voice, feed it a sung melody (synthesised or hummed), get sung jeffrey out. 125 + 126 + | project | status | install | model size | cpu? | what it gives us | 127 + |---|---|---|---|---|---| 128 + | **rvc / applio** | active, large community | `git clone IAHispano/Applio` + cli | ~500 mb base + ~50 mb per voice | yes (~1× rt on m-series for non-real-time) | trained voice clone, melody-driven. uses rmvpe internally. | 129 + | **so-vits-svc-fork** (voicepaw) | actively maintained | `pip install so-vits-svc-fork` | ~200 mb base + ~150 mb per voice | yes (~3× rt cpu) | similar to rvc, slightly older arch. | 130 + | **diff-svc / diffsinger** | research-grade | `git clone prophesier/diff-svc` | ~800 mb | gpu strongly preferred | diffusion, highest quality, slow. | 131 + | **nnsvs + pc-nsf-hifigan** | active, score-driven | `pip install nnsvs` | ~400 mb | yes — nsf models specifically beat hifi-gan on cpu | feeds *score + lyrics*, not a guide vocal. complementary lane. | 132 + 133 + **critical caveat for our pipeline.** rvc and so-vits-svc need a *sung* input. they convert sung-anyone → sung-jeffrey. they don't solve speech→sung. so the pipeline becomes: 134 + 135 + ``` 136 + speech → world (saitou f0-replace) → rough sung → rvc (jeffrey clone) → polished sung 137 + ``` 138 + 139 + rvc as a *cleanup pass* downstream of our world output is the highest-ceiling option in this doc. it would mask all of §2/§4/§6 — rvc's neural decoder doesn't care if our world stage stutters, it resamples everything through the trained voice. estimated install + train: 1 day for setup, 4–8 hours of jeffrey audio collection, ~2 hours of training on cpu (overnight) or 30 min on a rented gpu. 140 + 141 + — **diff-pitcher** (jhu-lcap, waspaa 2023, `haidog-yaqub/DiffPitcher` on github) is the closest thing to "pitch-correct neurally without a trained singer". diffusion-based correction that takes out-of-tune audio + target midi, returns in-tune audio with timbre preserved. no voice training step. ~600 mb model, cpu-runnable but slow (~5× rt). worth a one-day evaluation if §1–§6 leaves residual artifacts on jeffrey's takes specifically. 142 + 143 + ## 8. nsf vocoders — better than world without leaving the dsp lane 144 + 145 + if we want to *replace* world entirely with something newer but stay non-neural-conversion, **pc-nsf-hifigan** (pitch-controllable neural source-filter, used by nnsvs as of 2025) takes `(f0, mel)` and produces a waveform. quality on singing beats world by a wide margin in the published evals (yamagishi lab samples, source-filter hifi-gan paper 2022). cpu-runnable, ~10× faster than world per second of audio. 146 + 147 + integration cost is the install (`pip install nnsvs` brings ~400 mb of torch + checkpoints) and re-fitting our pipeline to emit mel-spectrograms instead of cheaptrick envelopes. about a 2-day rewrite. defer until §1–§6 are exhausted. 148 + 149 + ## the new symptom-to-fix table 150 + 151 + | symptom | root cause | new fix proposed | section | 152 + |---|---|---|---| 153 + | word-start stutter | abrupt 0→target f0 at voicing transitions | interpolate f0 across unvoiced gaps + post-synth amplitude mask | §2 | 154 + | held-note ringing | cheaptrick analysis window too long | raise `f0_floor` to 90 hz, recompute fft_size | §2 | 155 + | harsh sibilants | world tries to apply f0 to fricatives | composite — world on voiced, source audio on unvoiced | §6 | 156 + | residual drift on outliers | one-pass correction insufficient | melodyne-lite second-pass over world output, measured by torchfcpe | §3 | 157 + | octave errors in `pitchcheck.mjs` report | naive autocorrelation | swap to `librosa.pyin` (already installed) | §1 | 158 + | occasional vowel mush | cheaptrick envelope over-smooths | dual-render with td-psola on sustained vowels | §4 | 159 + 160 + ## shortlist — three things to try next, ranked 161 + 162 + **1. f0-gap interpolation + voiced/unvoiced source compositing.** *highest payoff, ~3 hour build, no new deps.* sections 2 and 6 combined. this kills the two worst symptoms (stutter at word starts, harsh sibilants) inside `pitchsnap_world.py` with ~30 lines of numpy. `librosa.pyin` swap in `pitchcheck.mjs` rolls in for free. zero new install. 163 + 164 + **2. cheaptrick fft_size tuning + melodyne-lite second pass.** *medium payoff, ~half-day build, adds torchfcpe.* sections 2 (the `f0_floor=90` line) and 3. addresses held-note ringing and tail outliers. `pip install torchfcpe` is the only new dep — small, cpu-friendly, no gpu. 165 + 166 + **3. rvc/applio as a downstream cleanup pass.** *highest ceiling, ~2 day build, large install.* the "real" answer if signal-processing exhausts itself. world output → applio cli → polished sung jeffrey. requires collecting clean jeffrey audio (probably already in the recap corpus) and one overnight training run. defer until (1) and (2) ship and we have a tagged "world-only" baseline to a/b against. 167 + 168 + ## strongest single new finding 169 + 170 + **the word-start stutter is fixable inside world without changing the vocoder.** v1 assumed any pop at voicing transitions was a fundamental world limitation; the literature is clear it's a configuration problem. the fix is the standard "interpolate f0 through unvoiced gaps, synthesise on the smoothed curve, then re-impose the voiced mask in the time domain with crossfaded ramps" — used by every production pitch-correction pipeline, missing from our current script because we tried (and removed) the wrong version of it (multiplying f0 by 0→1 envelope, which forced unvoiced and made the problem worse). doing it the documented way (interpolate the contour, mute the *output* not the f0) is ~30 lines and should clean up the stutter completely. 171 + 172 + **recommended immediate action:** patch `pitchsnap_world.py` to interpolate f0 across unvoiced frames before synthesis, mute the unvoiced output via a 5 ms-ramped amplitude mask, and bump `f0_floor` to 90 hz in cheaptrick. that single commit should hit two of the four current symptoms. then `librosa.pyin` swap in `pitchcheck.mjs` to retire the autocorrelation outliers from the drift report. day one.
+102
pop/SPEECH-TO-SINGING.md
··· 1 + # speech-to-singing — survey 2 + 3 + a map of how the field gets sung output from spoken input, and which pieces drop into our existing `pitchsnap.mjs` lane. 4 + 5 + scope: jeffrey-pvc tts → sung melody. neural svc / svs lanes are noted but flagged as a separate research track (gpu, trained models, weeks of setup). 6 + 7 + ## the field 8 + 9 + three eras, all still in active use: 10 + 11 + **signal-processing era (2007–2014).** saitou et al.'s "speech-to-singing synthesis system" (waspaa 2007 / 2009) is the canonical paper. you read the lyrics out loud, hand it a score, and the system replaces the f0 contour with the score's contour, lengthens phonemes to fit notes, and reshapes the spectrum to add the singer's formant. it runs on the **straight** vocoder (kawahara) — analyse → modify (f0, duration, spectrum) → resynthesise. saitou explicitly lists the three things missing from speech: pitched melody, sustained vowels, and a singer's-formant ring. that decomposition is still the right mental model. 12 + 13 + **statistical-parametric era (2010–2018).** **sinsy** (nagoya institute of technology, hmm-based, bsd-licensed, on sourceforge / github) takes musicxml in and produces sung audio. it learns from a corpus what real singers do with a score. spiritually similar but you don't get to inject your own speech timbre — it's full synthesis from a model, not transformation of a recording. 14 + 15 + **neural era (2019–now).** parekh et al. (icassp 2020, "speech-to-singing conversion in an encoder-decoder framework") was the first end-to-end learned s2s — input is speech spectrogram + target melody, output is sung spectrogram. on the synthesis side, **diffsinger** (aaai 2022, diffusion + hifigan) and **nnsvs** (r9y9, 2022, the open-source successor to neutrino) take a score and a trained voice and produce a sung waveform. on the conversion side, **so-vits-svc** and **rvc** turn a guide vocal (sung melody) into the target singer's voice — they need a sung melody as input, so they don't solve our problem on their own; they'd sit *downstream* of a working s2s pipeline. 16 + 17 + ## the techniques 18 + 19 + each technique listed with what it actually does to the signal. the problem isn't that we don't have enough techniques — it's that pitchsnap currently uses one (rubberband segment-shifting) and skips the rest. 20 + 21 + — **f0 substitution.** decompose audio with a vocoder that gives you `(f0, spectral_envelope, aperiodicity)` as separate streams. throw the source f0 away, write a new f0 contour from the score, resynthesise. spectral envelope unchanged → vowel identity preserved → it still sounds like jeffrey, just with sung pitch. this is the saitou recipe and also what world / pyworld / praat do natively. our current rubberband path *shifts* the existing f0 by an interval, which is why "spoken prosody fights melody" — the prosody is *in* the f0 we're shifting. 22 + 23 + — **phoneme-aware time-stretch (vowel sustain).** detect vowel regions, stretch them; leave consonants alone. straight + saitou's "duration control model" does this. world's `synthesizeRequiem` lets you pass a per-frame time-warp. praat's manipulation object exposes a `durationtier` you can set per phoneme. without this step, "real" stretched 4× sounds like "rrrrreal" — every phoneme drags. with it, only the `ee` drags and you get "reeeeeal". 24 + 25 + — **phoneme alignment.** **montreal forced aligner** (mfa, kaldi-based, free, has english pretrained models) gives you per-phoneme timestamps from audio + transcript. **charsiu** (lingjzhu, wav2vec2-based, lighter than mfa) does the same with a smaller install and decent quality. either gives consonant/vowel boundaries — required for vowel-sustain to know which spans to stretch. 26 + 27 + — **f0 detection.** our current `pitchsnap.mjs` uses naive autocorrelation in `detectPitch()` (lines 150–182), which the comments admit picks 2× / ½× on individual words. **pyin** (mauch & dixon 2014, in librosa as `librosa.pyin`) adds viterbi smoothing and is the standard for monophonic speech/singing. **crepe** (kim et al. 2018) is a cnn, more accurate, needs tensorflow. **dio + stonemask** in world is fast and clean for voice. swapping autocorrelation → pyin or world's dio is a one-day fix that removes octave errors immediately. 28 + 29 + — **f0 contour shaping.** real sung notes don't step. saitou models four sub-effects: 30 + - *overshoot* (~50 cents past the target on attack, decays to centre over ~200 ms) 31 + - *vibrato* (4.5–6.5 hz, 50–120 cents peak-to-peak, fades in after sustain begins) 32 + - *preparation* (slight dip below the target before a rising interval) 33 + - *fine fluctuation* (sub-cent jitter for naturalness) 34 + add these on top of a stepped target curve and a flat synthetic line starts to read as performed. 35 + 36 + — **singer's formant.** sundberg 1974 — trained singers cluster f3/f4/f5 around 2.5–3 khz for "ring" that cuts through an orchestra. in spectral-envelope terms: boost a ~500 hz-wide band centred near 2.8 khz by ~6–10 db. on a vocoder that exposes the spectral envelope (world, straight, praat) this is one filter operation per frame. we can fake it on raw audio with a parametric eq at +6 db / 2.8 khz / q=4, but it's cleaner inside a vocoder where the boost rides the formant rather than ringing. 37 + 38 + — **autotune (correction style).** the antares trick: run pyin → snap each frame's f0 to the nearest scale degree → resynthesise via a phase vocoder or psola. with a low retune-time you get the stepped t-pain effect; with higher retune-time you get gentle correction. the *aggressive* variant is identical to "f0 substitution from score" except the target curve comes from snapping rather than from a score. 39 + 40 + — **psola / phase vocoder.** **td-psola** (moulines & charpentier 1990, in praat) modifies pitch and duration in the time domain by repositioning glottal-pulse-aligned grains; preserves formants by construction. phase vocoder (frequency-domain) is what rubberband and librosa use under the hood. for singing, psola tends to sound more natural on small shifts, phase vocoder handles bigger transformations better but smears transients. 41 + 42 + ## the tools 43 + 44 + | tool | language | install | gpu? | what it gives us | 45 + |---|---|---|---|---| 46 + | **rubberband** (cli, current) | c++ | already installed | no | pitch + time stretch, phase vocoder. has `--formant`, `--pitchmap`, `--smoothing` flags we're under-using. | 47 + | **world / pyworld** | c++ / python | `pip install pyworld` (~1 mb) | no | analyse → `(f0, sp, ap)` → modify f0 directly → resynthesise. cleanest path to f0 substitution. dio for f0, cheaptrick for envelope, d4c for aperiodicity. | 48 + | **praat** (cli + scripts) | c | `brew install praat` | no | manipulation object with pitchtier + durationtier, td-psola underneath. fully scriptable. heavier orchestration than pyworld but doesn't need python. | 49 + | **librosa** | python | `pip install librosa` (~50 mb) | no | `librosa.pyin` for f0 detection, `pitch_shift` / `time_stretch` (lower fidelity than rubberband). good as a measurement tool, not a render tool. | 50 + | **crepe** | python (tf) | `pip install crepe` (~500 mb tf) | optional | best-in-class f0 detection. heavy install for one feature. | 51 + | **mfa** | python (kaldi) | `conda install -c conda-forge montreal-forced-aligner` (~2 gb incl. acoustic models) | no | per-phoneme alignment from audio + transcript. heavy. | 52 + | **charsiu** | python (torch) | `pip install transformers + ckpt` (~500 mb) | optional | per-phoneme alignment, lighter than mfa, comparable accuracy. | 53 + | **sinsy** | c++ | source build | no | full hmm-based singer from musicxml; outputs sung audio, not a transformation tool — different lane. | 54 + | **nnsvs / diffsinger / so-vits-svc / rvc** | python (torch) | gpu strongly recommended; multi-gb models | yes | parallel research lane. produce excellent results but require trained voices and gpus. flagged out-of-scope-for-now. | 55 + 56 + ffmpeg's bundled rubberband filter is not available in our build — confirmed earlier; we shell out to the cli. 57 + 58 + ## what fixes our specific symptoms 59 + 60 + | symptom we hit | root cause | technique that fixes it | tool | 61 + |---|---|---|---| 62 + | "spoken prosody fights melody" | rubberband shifts existing f0 by interval; original speech contour rides on top of every shift | f0 substitution — replace, don't shift | pyworld (dio + stonemask + cheaptrick + synthesize) | 63 + | "vowels don't sustain" | word-level slices stretched uniformly drag consonants too; word-final vowels are too short to ring | phoneme alignment + selective vowel stretch | charsiu (or mfa) → per-phoneme durationtier in praat / world | 64 + | "sounds tonal not sung" | flat stepped pitch, no overshoot / vibrato / fine fluctuation; missing singer's formant ring | saitou f0-contour shaping + spectral-envelope boost at 2.5–3 khz | pyworld envelope edit + post-eq, or praat scripted | 65 + | "octave errors on shifts" | autocorrelation in `detectPitch()` picks 2× / ½× on some words | viterbi-smoothed f0 | librosa.pyin or world.dio | 66 + | "consonants get cut at word edges" | whisper word boundaries land mid-consonant | phoneme alignment, slice on phoneme edges | charsiu | 67 + | "segment boundaries click" | fixed 20 ms crossfade in segmented rendering doesn't align with glottal pulses | psola — grains aligned to f0 periods | praat td-psola, or world resynth (no clicks by construction) | 68 + | "formant character lost on big shifts" | rubberband `--formant` not currently passed in `pitchsnap.mjs` | enable formant preservation | rubberband `--formant` flag (one-line change) | 69 + 70 + ## shortlist 71 + 72 + ranked by effort vs. payoff. opinionated. 73 + 74 + **1. swap the rendering core to world (pyworld) for f0 substitution.** *highest payoff, ~2-day build.* this is the saitou recipe and addresses three of our top symptoms in one move: 75 + - rubberband segment-shift → world `(f0, sp, ap)` decompose, write target f0 from score, resynth. 76 + - kills "spoken prosody fights melody" (because we replace, not shift). 77 + - kills octave errors (dio is reliable on voice). 78 + - removes click artefacts at segment boundaries (no segmentation needed). 79 + - keeps jeffrey's timbre intact (spectral envelope unchanged). 80 + - new piece: a python child process called from `pitchsnap.mjs` that takes the slice + target f0 contour json and returns a wav. ~150 lines of python, ~1 mb pip install, no gpu. 81 + - immediately enables vibrato / overshoot / singer's-formant boost as f0/envelope edits in the same script. 82 + 83 + **2. add phoneme alignment via charsiu, switch from word-snap to phoneme-snap.** *medium payoff, ~1-day build, depends on (1) for full effect.* gets us: 84 + - clean consonant/vowel boundaries, no chopped tails. 85 + - vowel-sustain becomes a real operation: identify the vowel span, stretch only that. 86 + - lyric-carrying consonants stay short and crisp; the *note* lives on the vowel. 87 + - charsiu over mfa for install size — 500 mb vs. 2 gb, comparable quality on english. 88 + 89 + **3. saitou f0 sub-effects: vibrato + overshoot + fine fluctuation.** *lowest effort once (1) lands, big perceptual win.* once we own the f0 contour, layer: 90 + - 5.5 hz sine, 70 cents peak-to-peak, fading in after the first 150 ms of sustain → vibrato 91 + - 50 cents above target on attack, exponential decay over 200 ms → overshoot 92 + - low-pass-filtered noise at ±10 cents → fine fluctuation 93 + - all parameters in a recipe file, per-track tunable. 94 + this is ~50 lines on top of the world renderer and is the single biggest "sounds sung" lever once the substrate is right. 95 + 96 + **deferred (parallel lane).** so-vits-svc / rvc / diffsinger / nnsvs as a separate research track. they produce the best results in the field today, but they need trained models on jeffrey's voice (so-vits-svc / rvc) or are full synthesizers from score (diffsinger / nnsvs) and skip our "preserve the speech recording" posture. start with (1)–(3); evaluate the neural lane only after we know what the signal-processing path actually sounds like on our material. 97 + 98 + ## the strongest pattern 99 + 100 + every working s2s system in the literature — saitou 2007, sinsy 2010, vocalistener 2009, diffsinger 2022, parekh 2020 — separates f0 from spectral envelope and treats them as independent edit streams. our current pipeline doesn't. we shift one composite signal in semitones, which means jeffrey's spoken prosody is still riding on top of every "snapped" note. the single biggest upgrade is moving to a vocoder that exposes those streams separately so we can replace f0 wholesale instead of shifting it. 101 + 102 + **top recommendation for the immediate next pitchsnap upgrade:** add a `--engine world` option to `pitchsnap.mjs` that shells out to a small `pitchsnap-world.py` for the per-word render. python takes the wav slice + target midi(s), runs `pyworld.dio` → `pyworld.stonemask` → `pyworld.cheaptrick` → `pyworld.d4c`, replaces f0 with the target curve (with a 200 ms attack overshoot and a 5.5 hz vibrato fading in after the first sustained quarter-second), and resynths with `pyworld.synthesize`. keeps the rest of the snap / scale-walk / score-mapping logic identical. one new dependency, one new flag, no neural models. that's the saitou pipeline and it should turn "tonal speech" into "sung jeffrey" in a single weekend.
+45
pop/VOICE.md
··· 1 + # Voice Guide for AC Pop 2 + 3 + pop tracks should sound like @jeffrey writing songs, not like a brand writing copy. lyrics are not advertising. they are the most compressed form a vision can take before it stops being music. 4 + 5 + ## the basics 6 + 7 + - first person is fine. "i built this" not "the artist built" 8 + - lowercase default in lyric drafts. capitalization in mix metadata only when convention demands it 9 + - em dashes for asides — not semicolons, not parentheses 10 + - short lines. fragments when they land. no filler syllables to pad the bar 11 + - internal rhyme over end rhyme. the rhyme should fall mid-bar at least as often as it falls on the four 12 + - state the hard thing plainly. don't soften with hedging language 13 + - conviction is quiet but absolute — "the music is real" not "i'm feeling kinda confident about this" 14 + - humor is allowed. the line that makes you laugh on the first listen is doing work 15 + - be honest about what doesn't work. the second verse can be the limitations section 16 + 17 + ## what to avoid 18 + 19 + - flexing posture. money / cars / club bars. AC has nothing to flex about and that's the point 20 + - industry filler — "ayy", "skrt", "let's get it" — unless they're earned and load-bearing 21 + - explaining the vision before singing it. the lyric should *be* the vision, not narrate it 22 + - writing differently for different platforms. one lyric file, one mix, no platform variants 23 + - corporate vocabulary. "ecosystem", "stakeholders", "leverage" — drop them 24 + - generic emotion. pick the specific feeling. "tired" is not a feeling. "tired of explaining kidlisp at every dinner" is 25 + 26 + ## per-genre overlays 27 + 28 + each lane has its own voice on top of this base. 29 + 30 + ### big pictures (emo rap / trap) 31 + 32 + - emo rap honesty — vulnerability without self-pity. confess the thing, then keep moving 33 + - triplet flow allowed but not default. use it when the line needs urgency 34 + - ad-libs: minimal. one or two per verse, low in the mix, never "yeah yeah yeah" filler 35 + - the hook is the vision compressed to one line — repeatable, undeniable, true 36 + - 16 bars per verse, 4-line hook, ~1:30 total. roughly: hook → verse → hook → verse → hook → outro 37 + - reference voice: see vault. ingestion is in-context style, not training corpus 38 + 39 + ## examples of the voice working 40 + 41 + placeholders. real lines land here when the first track is mixed. 42 + 43 + --- 44 + 45 + *maintained by @jeffrey — update this when the voice evolves*
+76
pop/big-pictures/README.md
··· 1 + # big pictures 2 + 3 + audio-only hip hop / trap versions of jeffrey's AC visions. one vision per track. ~1:30 each. 4 + 5 + ## format spec 6 + 7 + - **length**: 90 seconds, ±10s 8 + - **structure**: hook (4 bars) → verse (16 bars) → hook → verse (16 bars) → hook → outro 9 + - **tempo**: ~140 BPM, 4/4 10 + - **bed**: trap — 808 sub, triplet hats, sparse snare on 3, room for the vocal 11 + - **vocal**: rapped, not sung. emo-rap honesty (see `../VOICE.md`) 12 + - **output**: single mp3 per track in `out/<slug>.mp3`. no video. 13 + 14 + ## source → track 15 + 16 + each track corresponds to one paper or one vision from the platter. the lyric is the compression of that paper into the form a song can carry. the hook is the vision in one line. 17 + 18 + ``` 19 + papers/arxiv-<slug>/<slug>.tex 20 + → pop/big-pictures/<slug>.txt (plain lyrics) 21 + → pop/big-pictures/<slug>.np (notepat score: NOTE:syllable per syllable) 22 + → pop/big-pictures/out/<slug>.mp3 (mix) 23 + ``` 24 + 25 + the `.np` (notepat) file is the score in the same notation as the folk-songs paper (`papers/arxiv-folk-songs/folk-songs.tex` §3). every syllable carries a pitch — making the lyric playable on notepat in song mode and renderable through `recap/bin/vocal.mjs` (formant synth) or any other pitch-driven voice. the file is its own URL when fed to `notepat.com?song=...`. 26 + 27 + ## lyric file format 28 + 29 + plain text. no metadata header. blocks separated by blank lines, labeled in lowercase: 30 + 31 + ``` 32 + hook 33 + <4 lines> 34 + 35 + verse 1 36 + <16 lines> 37 + 38 + hook 39 + 40 + verse 2 41 + <16 lines> 42 + 43 + hook 44 + 45 + outro 46 + <2-4 lines> 47 + ``` 48 + 49 + ## pipeline (planned) 50 + 51 + `recap/bin/big-pictures.mjs` — mirrors the recap cli pattern. cached per step so reruns cost nothing. 52 + 53 + ``` 54 + read paper 55 + → draft lyrics (jeffrey-pvc voice + emo-rap overlay) 56 + → write notepat score (.np) — every syllable carries pitch (visual / kidlisp future) 57 + → AC-native trap bed (recap/bin/trap.mjs) 58 + → vocal stem: /api/say with jeffrey-pvc (provider:"jeffrey", voice:"neutral:0") 59 + → WhisperX forced alignment — per-word timestamps 60 + → snap drift to bar grid — ±200ms tolerance, 16th-note quantization 61 + → vocal-post per-word edits — pitch / elongate / effect / harmonize, aggression-tunable 62 + → mix (bed + vocal) 63 + → mp3 64 + ``` 65 + 66 + ## vocal source 67 + 68 + big-pictures uses **jeffrey-pvc via ElevenLabs** (the same voice the 24h recap pipeline already uses). hooks and verses both. it's literally jeffrey's cloned voice and is already wired up through `/api/say`. 69 + 70 + an AC-native formant-synth vocal was attempted and dropped on 2026-05-03 — it produced melodic tones but not voice; getting real vocoder/talkbox character would need glottal pulse + F4/F5 + pitch jitter + consonants, a full research lane. `recap/bin/vocal.mjs` remains in the repo as experimental research; may resurface as a *melodic instrument* layer (formant-shaped lead in the bed) rather than a vocal. 71 + 72 + **bottom-up posture preserved at the composition layer.** the bed is composed bar-by-bar from AC instruments (`trap.mjs` over `percussion.mjs`); the score is hand-written in `.np` notation. ElevenLabs is the *performance* on top of that composition. 73 + 74 + ## tracks 75 + 76 + none yet. first candidate: `plork` (laptop orchestras, planetary scale).
+29
pop/big-pictures/ac-da.txt
··· 1 + hook 2 + hvert stykke er en url 3 + hver url er et nodeark 4 + tryk enter — så er du inde 5 + æstetisk dot computer 6 + 7 + verse 1 8 + kidlisp i browseren 9 + notepat på qwerty 10 + samme samples, native og statisk 11 + dual-channel, under femti millisek 12 + hvert stykke — én mjs-fil 13 + adressen er kilden 14 + 15 + hook 16 + 17 + verse 2 18 + en mand alene, ingen runway 19 + jeg ville bare mine venner kunne male 20 + kildekoden er på github 21 + cachen holder altid 22 + jeg bygger ikke et produkt 23 + jeg bygger et lille sted 24 + 25 + hook 26 + 27 + outro 28 + tryk enter 29 + du er inde i det
+31
pop/big-pictures/ac.txt
··· 1 + hook 2 + every piece a url 3 + every url a score 4 + push enter — you're inside 5 + aesthetic dot computer 6 + 7 + verse 1 8 + kidlisp evaluates parens live in the browser 9 + notepat — qwerty, chromatic, two octaves over 10 + sample bank's the same on native and the static 11 + half-meg of api, every event with a purpose 12 + mjs is the unit, the address is the score 13 + i write a piece, i save it — that's the whole move 14 + 15 + hook 16 + 17 + verse 2 18 + one-person stack, no runway, no dream 19 + no investors, no playbook, no series-a scheme 20 + i wanted a place my friends could just paint 21 + where the paint was the piece, the piece was the wave 22 + the source is on github, the cache stays the cache 23 + fork the whole runtime — same kid, same shape 24 + i'm not pitching a product, i'm holding a stand 25 + the music is real and it works in your hand 26 + 27 + hook 28 + 29 + outro 30 + push enter 31 + you're inside it
+18
pop/big-pictures/amazing.np
··· 1 + # Amazing Grace — full verse 1. "New Britain" tune, William Walker 1835. 2 + # notation: NOTE:syllable*beats 3 + # Phrase 1+2 melody from papers/arxiv-folk-songs/folk-songs.tex:181 4 + # (the canonical pentatonic transcription). 5 + # 6 + # key: G major pentatonic. octave 3 — sits in jeffrey-pvc's natural 7 + # baritone. Climax at D4 on "blind" (line 4) is the highest note. 8 + # Beat values follow the dotted-half / quarter pattern of 3/4 hymn 9 + # delivery. Each line ends with an extended hold (5 beats) for the 10 + # phrase-end breath. 11 + # 12 + # Use --beat-mode --bpm 70. 13 + 14 + verse 1 15 + D3:a-*1 G3:-ma-*3 B3:-zing*1 G3:grace*3 B3:how*1 A3:sweet*3 G3:the*1 B3:sound*5 16 + D3:that*1 E3:saved*3 D3:a*1 B3:wretch*3 G3:like*1 B3:me*5 17 + D3:i*1 G3:once*3 D4:was*1 D4:lost*3 B3:but*1 D4:now*3 A3:am*1 G3:found*5 18 + B3:was*1 D4:blind*3 D4:but*1 B3:now*3 A3:i*1 G3:see*5
+5
pop/big-pictures/amazing.txt
··· 1 + verse 1 2 + amazing grace how sweet the sound 3 + that saved a wretch like me 4 + i once was lost but now am found 5 + was blind but now i see
+9
pop/big-pictures/elephant.np
··· 1 + # elephant × 4 — C minor i–VI–iv–V arpeggio progression 2 + # 3 syllables per word: el-e-phant 3 + # rep 1: i (Cm) → C3 Eb3 G3 4 + # rep 2: VI (Ab) → Ab3 C4 Eb3 5 + # rep 3: iv (Fm) → F3 Ab3 C4 6 + # rep 4: V (G) → G3 Bb3 F3 (resolves back to root next bar) 7 + 8 + verse 1 9 + C3:el- Eb3:-e- G3:-phant Ab3:el- C4:-e- Eb3:-phant F3:el- Ab3:-e- C4:-phant G3:el- Bb3:-e- F3:-phant
+2
pop/big-pictures/elephant.txt
··· 1 + verse 1 2 + elephant elephant elephant elephant
+33
pop/big-pictures/mary.np
··· 1 + # Mary Had a Little Lamb — notepat score, all 4 traditional verses 2 + # plus the gen-alpha "chick chick BOOM no more lamb" outro. 3 + # notation: NOTE:syllable, hyphens for syllable continuation 4 + # (folk-songs paper §3, papers/arxiv-folk-songs/folk-songs.tex:152) 5 + # 6 + # key: C major. classic kindergarten melody. octave 4 — middle C and 7 + # above. WORLD engine keeps voice timbre intact even with the +12 to 8 + # +16 semitone shift from jeffrey-pvc's natural baritone. Same E-D-C-D- 9 + # E-E-E phrase repeats across all four verses. 10 + # 11 + # The closing "chick chick BOOM" verse triggers a gunshot SFX on the 12 + # BOOM syllable (folk_backing.py overlays a synthesized PISTOL preset 13 + # gunshot wherever the syllable text matches /^boom$/i). 14 + 15 + verse 1 16 + E4:mar- D4:-y C4:had D4:a E4:lit- E4:-tle E4:lamb 17 + D4:lit- D4:-tle E4:lamb D4:lit- D4:-tle E4:lamb 18 + E4:mar- D4:-y C4:had D4:a E4:lit- E4:-tle E4:lamb 19 + E4:its D4:fleece D4:was E4:white D4:as C4:snow 20 + E4:ev- D4:-ery- C4:-where D4:that E4:mar- E4:-y E4:went 21 + D4:mar- D4:-y E4:went D4:mar- D4:-y E4:went 22 + E4:ev- D4:-ery- C4:-where D4:that E4:mar- E4:-y E4:went 23 + E4:the D4:lamb D4:was E4:sure D4:to C4:go 24 + E4:fol- D4:-lowed C4:her D4:to E4:school E4:one E4:day 25 + D4:school D4:one E4:day D4:school D4:one E4:day 26 + E4:fol- D4:-lowed C4:her D4:to E4:school E4:one E4:day 27 + E4:which D4:was D4:a- E4:-gainst D4:the C4:rules 28 + E4:made D4:the C4:chil- D4:-dren E4:laugh E4:and E4:play 29 + D4:laugh D4:and E4:play D4:laugh D4:and E4:play 30 + E4:made D4:the C4:chil- D4:-dren E4:laugh E4:and E4:play 31 + E4:to D4:see D4:a E4:lamb D4:at C4:school 32 + E4:mar- D4:-y C4:had D4:a E4:lit- E4:-tle E4:lamb 33 + D4:chick D4:chick E4:boom D4:no D4:more E4:lamb*2
+19
pop/big-pictures/mary.txt
··· 1 + verse 1 2 + mary had a little lamb 3 + little lamb, little lamb 4 + mary had a little lamb 5 + its fleece was white as snow 6 + everywhere that mary went 7 + mary went, mary went 8 + everywhere that mary went 9 + the lamb was sure to go 10 + followed her to school one day 11 + school one day, school one day 12 + followed her to school one day 13 + which was against the rules 14 + made the children laugh and play 15 + laugh and play, laugh and play 16 + made the children laugh and play 17 + to see a lamb at school 18 + mary had a little lamb 19 + chick chick boom, no more lamb
+43
pop/big-pictures/plork.np
··· 1 + # big pictures: plork — notepat score 2 + # notation: NOTE:syllable, hyphens for syllable continuation 3 + # (folk-songs paper §3, papers/arxiv-folk-songs/folk-songs.tex:152) 4 + # 5 + # key: C minor (pentatonic-leaning: C Eb F G Bb) 6 + # fits trap progression i VI iv V → Cm Ab Fm G 7 + # rap contour: verses sit close to root, peaks on key words; hook is 8 + # the most melodic line and lands on tonic. 9 + 10 + hook 11 + Eb:the C:mu- Eb:-sic G:is G:real Eb:the C:lo- Eb:-gis- F:-tics G:got C:cheap 12 + Eb:land- Eb:-fill's Eb:full Eb:of C:or- Eb:-ches- F:-tras G:dream- F:-ing Eb:in Eb:their C:sleep 13 + G:two- G:-for- G:-ty G:mil- G:-lion Bb:lap- G:-tops, F:win- F:-dows F:ten Eb:re- C:-tired 14 + Eb:flash Eb:a F:ker- Eb:-nel G:ev- G:-ery G:one Eb:of Eb:them F:just G:got C:hired 15 + 16 + verse 1 17 + C:i C:watched C:plork C:in C:a Eb:base- C:-ment C:on C:a Eb:live- C:-stream 18 + C:fif- C:-teen C:lap- C:-tops Eb:in Eb:a F:hall Eb:do- Eb:-ing Eb:rit- C:-u- C:-als C:like C:a G:dream 19 + Eb:two Eb:thou- C:-sand C:six F:prince- Eb:-ton C:had C:the Eb:keys C:to C:the C:hall 20 + Eb:twen- C:-ty Eb:twen- C:-ty F:half Eb:those Eb:or- C:-ches- C:-tras Eb:don't Eb:an- Eb:-swer C:the C:call 21 + C:co- C:-vid Eb:took Eb:the F:rooms F:and Eb:no- C:-bo- C:-dy Eb:built C:them C:back 22 + C:i'm C:not Eb:bit- C:-ter C:i'm Eb:hon- C:-est C:i Eb:keep G:score Eb:on C:the C:track 23 + C:the Eb:gear C:was C:the F:gate C:the F:gate C:was C:the G:price C:tag 24 + C:fif- C:-teen C:hun- C:-dred C:a Eb:seat C:that's C:not C:a Eb:school C:that's C:a G:flag 25 + Eb:mean- C:-while Eb:dell Eb:un- C:-loads F:pal- Eb:-lets C:at C:the Eb:auc- C:-tion C:door 26 + C:fif- C:-ty Eb:bucks C:du- C:-al Eb:core F:screen Eb:scratched C:on C:the C:floor 27 + C:i Eb:flash C:a Eb:ker- C:-nel C:from C:a C:u- C:-s- C:-b Eb:and Eb:call C:it C:a G:school 28 + G:plork F:the Eb:plan- C:-et Eb:not C:a Eb:the- C:-o- C:-ry Eb:an- C:-y- C:-more C:a C:tool 29 + Eb:ge C:wang Eb:made C:chuck Eb:pa- C:-pert Eb:made C:the F:tur- Eb:-tle C:and C:the C:kid 30 + Eb:il- C:-lich Eb:said Eb:de- C:-school F:the Eb:build- C:-ing C:was C:the C:lid 31 + C:i'm C:not Eb:ask- C:-ing F:per- Eb:-mis- C:-sion C:i'm Eb:just G:lift- F:-ing C:it 32 + Eb:the C:mu- Eb:-sic G:is G:real Eb:and Eb:the F:math G:just C:flipped 33 + 34 + hook 35 + 36 + # verse 2 — TBD (plain lyrics in plork.txt; pitch contour pending) 37 + 38 + hook 39 + 40 + outro 41 + Eb:plug Eb:it C:in Eb:lis- C:-ten G:close 42 + Eb:the C:mu- Eb:-sic G:is C:real 43 + Eb:the C:lo- Eb:-gis- F:-tics Eb:the C:lo- Eb:-gis- F:-tics G:got C:cheap
+50
pop/big-pictures/plork.txt
··· 1 + hook 2 + the music is real — the logistics got cheap 3 + landfill's full of orchestras dreaming in their sleep 4 + two-forty million laptops, windows ten retired 5 + flash a kernel — every one of them just got hired 6 + 7 + verse 1 8 + i watched plork in a basement on a livestream 9 + fifteen laptops in a hall doing rituals like a dream 10 + two thousand six — princeton had the keys to the hall 11 + twenty twenty — half those orchestras don't answer the call 12 + covid took the rooms and nobody built them back 13 + i'm not bitter — i'm honest, i keep score on the track 14 + the gear was the gate, the gate was the price tag 15 + fifteen hundred a seat — that's not a school, that's a flag 16 + meanwhile dell unloads pallets at the auction door 17 + fifty bucks, dual core, screen scratched on the floor 18 + i flash a kernel from a usb and call it a school 19 + plork the planet — not a theory anymore, a tool 20 + ge wang made chuck, papert made the turtle and the kid 21 + illich said deschool — the building was the lid 22 + i'm not asking permission — i'm just lifting it 23 + the music is real and the math just flipped 24 + 25 + hook 26 + 27 + verse 2 28 + look — i know what they'll say. it sounds like a hack 29 + fifty-dollar laptops can't run the full stack 30 + fair. but the full stack was never the goal 31 + the goal was the room. the room was the soul 32 + we'll lose the latency wars to the fiber-pumped bay 33 + pay attention to mesh networks — they're closer to the day 34 + i can't promise the speakers won't pop in the rain 35 + i can't promise the kernel won't panic on the train 36 + but princeton couldn't promise that either — they just had a wing 37 + small classroom, big idea — same hands, same swing 38 + i'm building from the bottom, you'll hear the plywood 39 + that's not the cheap part — that's the part that's good 40 + e-waste is a deadline. october was the gun 41 + microsoft retired ten — the count's begun 42 + two-forty million machines — pick the ones you want to use 43 + or every single one of them is going down with the refuse 44 + 45 + hook 46 + 47 + outro 48 + plug it in — listen close 49 + the music is real 50 + the logistics — the logistics got cheap
+10
pop/big-pictures/row.np
··· 1 + # Row Row Row Your Boat — notepat score 2 + # notation: NOTE:syllable, hyphens for syllable continuation 3 + # canonical c major round melody. octave 4 (middle C up to C5 on the 4 + # "merrily" peak). 5 + 6 + verse 1 7 + C4:row C4:row C4:row D4:your E4:boat*3 8 + E4:gen- D4:-tly E4:down F4:the G4:stream*3 9 + C5:mer- C5:-ri- C5:-ly G4:mer- G4:-ri- G4:-ly E4:mer- E4:-ri- E4:-ly C4:mer- C4:-ri- C4:-ly 10 + G4:life F4:is E4:but D4:a C4:dream*4
+5
pop/big-pictures/row.txt
··· 1 + verse 1 2 + row row row your boat 3 + gently down the stream 4 + merrily merrily merrily merrily 5 + life is but a dream
+6
pop/big-pictures/sentence.np
··· 1 + # the music is real — short hook phrase, c minor pentatonic 2 + # rising contour landing on G3 for "real" (the emphasized word) 3 + # 5 syllables: the / mu- / -sic / is / real 4 + 5 + verse 1 6 + C3:the Eb3:mu- Eb3:-sic F3:is G3:real
+2
pop/big-pictures/sentence.txt
··· 1 + verse 1 2 + the music is real
+6
pop/big-pictures/uncomfortable.np
··· 1 + # uncomfortable — single-word melody substrate. 2 + # 5 syllables: un-com-fort-a-ble 3 + # descending arc, c minor: G3 → F3 → Eb3 → D3 → C3. 4 + 5 + verse 1 6 + G3:un- F3:-com- Eb3:-fort- D3:-a- C3:-ble
+2
pop/big-pictures/uncomfortable.txt
··· 1 + verse 1 2 + uncomfortable
+79
pop/bin/align.mjs
··· 1 + #!/usr/bin/env node 2 + // align.mjs — run whisper-cli on a vocal stem, emit per-word timestamps. 3 + // 4 + // Mirrors `recap/bin/transcribe.mjs` but takes the stem path as an 5 + // argument and writes `<stem>-words.json` next to the source. Uses the 6 + // same whisper.cpp model already in recap/models/. Caches by source 7 + // content hash; --force to bypass. 8 + // 9 + // Output format (matches recap pipeline): [{text, fromMs, toMs}, ...] 10 + // 11 + // Usage: 12 + // node bin/align.mjs ../big-pictures/out/plork-hook-vocal.mp3 13 + // node bin/align.mjs <stem.mp3> --force 14 + 15 + import { execFileSync } from "node:child_process"; 16 + import { readFileSync, writeFileSync, existsSync, unlinkSync } from "node:fs"; 17 + import { resolve, dirname, basename } from "node:path"; 18 + import { fileURLToPath } from "node:url"; 19 + import { createHash } from "node:crypto"; 20 + 21 + const HERE = dirname(fileURLToPath(import.meta.url)); 22 + const ROOT = resolve(HERE, ".."); 23 + const REPO = resolve(ROOT, ".."); 24 + const MODEL = `${REPO}/recap/models/ggml-base.en.bin`; 25 + 26 + const argv = process.argv.slice(2); 27 + const force = argv.includes("--force"); 28 + const stemPath = resolve(process.cwd(), argv.find((a) => !a.startsWith("--")) || ""); 29 + 30 + if (!stemPath || !existsSync(stemPath)) { 31 + console.error("usage: node bin/align.mjs <stem.mp3> [--force]"); 32 + process.exit(1); 33 + } 34 + if (!existsSync(MODEL)) { 35 + console.error(`✗ missing whisper model: ${MODEL}\n download: curl -L -o ${MODEL} https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin`); 36 + process.exit(1); 37 + } 38 + 39 + const inputHash = createHash("sha256") 40 + .update(readFileSync(stemPath)) 41 + .digest("hex").slice(0, 16); 42 + 43 + const wordsPath = stemPath.replace(/\.mp3$/, "-words.json"); 44 + const hashFile = `${wordsPath}.hash`; 45 + 46 + if (!force && existsSync(wordsPath) && existsSync(hashFile)) { 47 + const cached = readFileSync(hashFile, "utf8").trim(); 48 + if (cached === inputHash) { 49 + const words = JSON.parse(readFileSync(wordsPath, "utf8")); 50 + const last = words[words.length - 1]; 51 + console.log(`✓ ${wordsPath} cached · ${words.length} words · hash ${inputHash} — skipping whisper`); 52 + if (last) console.log(` audio ends at ${(last.toMs / 1000).toFixed(2)}s`); 53 + process.exit(0); 54 + } 55 + } 56 + 57 + console.log(`→ whisper-cli · ${stemPath}`); 58 + const stemDir = dirname(stemPath); 59 + const stemBase = basename(stemPath).replace(/\.mp3$/, ""); 60 + const tmpJson = `${stemDir}/${stemBase}.json`; 61 + 62 + execFileSync( 63 + "whisper-cli", 64 + ["-m", MODEL, "-f", stemPath, "-ojf", "-of", `${stemDir}/${stemBase}`, "--max-len", "1", "-ml", "1", "-sow"], 65 + { stdio: ["ignore", "ignore", "inherit"] }, 66 + ); 67 + 68 + const raw = JSON.parse(readFileSync(tmpJson, "utf8")); 69 + const words = raw.transcription 70 + .map((s) => ({ text: s.text.trim(), fromMs: s.offsets.from, toMs: s.offsets.to })) 71 + .filter((w) => w.text.length > 0); 72 + 73 + writeFileSync(wordsPath, JSON.stringify(words, null, 2)); 74 + writeFileSync(hashFile, inputHash + "\n"); 75 + try { unlinkSync(tmpJson); } catch {} 76 + 77 + const last = words[words.length - 1]; 78 + console.log(`✓ ${wordsPath} · ${words.length} words · ${(last.toMs / 1000).toFixed(2)}s · hash ${inputHash}`); 79 + console.log(` first 6: ${words.slice(0, 6).map(w => `${w.text}@${(w.fromMs/1000).toFixed(2)}s`).join(" ")}`);
+456
pop/bin/finalize.mjs
··· 1 + #!/usr/bin/env node 2 + // finalize.mjs — emit a finalized mp3 with ID3v2 metadata + embedded 3 + // auto-generated, timestamped cover art. Last leg of the 4 + // pop/big-pictures lane: takes a rendered track (output of any of the 5 + // align/pitchsnap/timefit pipeline) and produces a streaming-ready file. 6 + // 7 + // Cover generation runs entirely offline — no APIs, no canvas 8 + // dependencies. PNG is hand-rolled with zlib + a 6x10 bitmap font 9 + // extracted from fedac/native/src/font-6x10.h (public-domain X11 10 + // terminal font), scaled up to draw big titles at 3000×3000. 11 + // 12 + // Usage: 13 + // node bin/finalize.mjs --in out/ac-mix.mp3 --slug ac --title "aesthetic 24" 14 + // node bin/finalize.mjs --in out/mary-tuned.mp3 --slug mary 15 + // node bin/finalize.mjs --in out/plork-chorus.mp3 --slug plork --cover my-cover.png 16 + // node bin/finalize.mjs --in out/ac-mix.mp3 --slug ac --out custom-final.mp3 --force 17 + 18 + import { spawnSync } from "node:child_process"; 19 + import { writeFileSync, readFileSync, mkdirSync, existsSync, statSync } from "node:fs"; 20 + import { resolve, dirname, basename } from "node:path"; 21 + import { fileURLToPath } from "node:url"; 22 + import { createHash } from "node:crypto"; 23 + import { homedir } from "node:os"; 24 + import { deflateSync } from "node:zlib"; 25 + 26 + const HERE = dirname(fileURLToPath(import.meta.url)); 27 + const ROOT = resolve(HERE, ".."); 28 + 29 + // ── argv parsing (matches say.mjs / timefit.mjs idiom) ───────────────── 30 + function parseArgs(argv) { 31 + const flags = {}; 32 + const positional = []; 33 + for (let i = 0; i < argv.length; i++) { 34 + const a = argv[i]; 35 + if (a.startsWith("--")) { 36 + const k = a.slice(2); 37 + const next = argv[i + 1]; 38 + if (next !== undefined && !next.startsWith("--")) { flags[k] = next; i++; } 39 + else flags[k] = true; 40 + } else positional.push(a); 41 + } 42 + return { flags, positional }; 43 + } 44 + 45 + function expandHome(p) { 46 + if (!p || typeof p !== "string") return p; 47 + if (p === "~") return homedir(); 48 + if (p.startsWith("~/")) return resolve(homedir(), p.slice(2)); 49 + return p; 50 + } 51 + 52 + const { flags } = parseArgs(process.argv.slice(2)); 53 + 54 + if (!flags.in || !flags.slug) { 55 + console.error("usage: node bin/finalize.mjs --in <mp3> --slug <slug> [--title \"...\"] [--cover <png>] [--out <path>] [--force]"); 56 + process.exit(1); 57 + } 58 + 59 + const IN_PATH = resolve(process.cwd(), expandHome(flags.in)); 60 + if (!existsSync(IN_PATH)) { 61 + console.error(`✗ input mp3 not found: ${IN_PATH}`); 62 + process.exit(1); 63 + } 64 + 65 + const SLUG = String(flags.slug).trim().toLowerCase(); 66 + const TITLE = flags.title && flags.title !== true 67 + ? String(flags.title) 68 + : SLUG.replace(/[-_]+/g, " "); 69 + const FORCE = flags.force === true; 70 + const OUT_PATH = expandHome(flags.out) 71 + ? resolve(process.cwd(), expandHome(flags.out)) 72 + : `${ROOT}/big-pictures/out/${SLUG}-final.mp3`; 73 + const COVER_OVERRIDE = flags.cover && flags.cover !== true 74 + ? resolve(process.cwd(), expandHome(flags.cover)) 75 + : null; 76 + 77 + mkdirSync(dirname(OUT_PATH), { recursive: true }); 78 + 79 + // ── Probe input duration (mirror of timefit.mjs) ─────────────────────── 80 + const probe = spawnSync( 81 + "ffprobe", 82 + ["-v", "error", "-show_entries", "format=duration", 83 + "-of", "default=noprint_wrappers=1:nokey=1", IN_PATH], 84 + { encoding: "utf8" }, 85 + ); 86 + const inputDur = Number(probe.stdout.trim()); 87 + if (!(inputDur > 0)) { 88 + console.error(`✗ ffprobe could not read duration of ${IN_PATH}`); 89 + process.exit(1); 90 + } 91 + 92 + // ── Build metadata ───────────────────────────────────────────────────── 93 + const now = new Date(); 94 + const pad = (n) => String(n).padStart(2, "0"); 95 + const isoDate = `${now.getUTCFullYear()}-${pad(now.getUTCMonth() + 1)}-${pad(now.getUTCDate())}`; 96 + const isoYear = String(now.getUTCFullYear()); 97 + const isoStamp = now.toISOString(); 98 + const localStampShort = `${isoDate} ${pad(now.getUTCHours())}:${pad(now.getUTCMinutes())} UTC`; 99 + 100 + const ARTIST = "@jeffrey"; 101 + const ALBUM = "big pictures"; 102 + const GENRE = "emo trap"; 103 + 104 + // Lyric file (sibling .txt) → USLT 105 + const lyricPath = `${ROOT}/big-pictures/${SLUG}.txt`; 106 + let lyricsText = null; 107 + if (existsSync(lyricPath)) { 108 + lyricsText = readFileSync(lyricPath, "utf8").trim(); 109 + } 110 + 111 + const meta = { 112 + title: TITLE, 113 + artist: ARTIST, 114 + album_artist: ARTIST, 115 + album: ALBUM, 116 + date: isoDate, 117 + year: isoYear, 118 + genre: GENRE, 119 + comment: `rendered ${isoStamp} · source: ${basename(IN_PATH)}`, 120 + }; 121 + 122 + // ── Cache key (idempotent on input contents + key metadata) ──────────── 123 + const inputBuf = readFileSync(IN_PATH); 124 + const cacheKey = createHash("sha256").update(inputBuf).update(JSON.stringify({ 125 + slug: SLUG, title: TITLE, lyricsText, coverOverride: COVER_OVERRIDE, 126 + })).digest("hex").slice(0, 16); 127 + const hashFile = `${OUT_PATH}.hash`; 128 + 129 + if (!FORCE && existsSync(OUT_PATH) && existsSync(hashFile)) { 130 + const cached = readFileSync(hashFile, "utf8").trim(); 131 + if (cached === cacheKey) { 132 + const size = (statSync(OUT_PATH).size / 1024).toFixed(0); 133 + console.log(`✓ ${OUT_PATH} cached (${size} KB · hash ${cacheKey}) — skipping finalize`); 134 + process.exit(0); 135 + } 136 + } 137 + 138 + // ── 6×10 bitmap font (public-domain X11 fixed, mined from fedac) ─────── 139 + // Each glyph is 10 bytes; high 6 bits of each byte are the row pixels 140 + // (MSB = leftmost column). Glyph 0 = ASCII 32 (space). 95 glyphs total. 141 + const FONT_W = 6, FONT_H = 10; 142 + const FONT_B64 = 143 + "AAAAAAAAAAAAAAAgICAgIAAgAAAAUFBQAAAAAAAAAFBQ+FD4UFAAAAAgcKBwKHAgAAAASKhQIFCo" + 144 + "kAAAAECgoECokGgAAAAgICAAAAAAAAAAECBAQEAgEAAAAEAgEBAQIEAAAAAAiFD4UIgAAAAAACAg" + 145 + "+CAgAAAAAAAAAAAAMCBAAAAAAAD4AAAAAAAAAAAAAAAgcCAAAAgIECBAgIAAAAAgUIiIiFAgAAAA" + 146 + "IGCgICAg+AAAAHCICDBAgPgAAAD4CBAwCIhwAAAAEDBQkPgQEAAAAPiAsMgIiHAAAAAwQICwyIhw" + 147 + "AAAA+AgQECBAQAAAAHCIiHCIiHAAAABwiJhoCBBgAAAAACBwIAAgcCAAAAAgcCAAMCBAAAAIECBA" + 148 + "IBAIAAAAAAD4APgAAAAAAEAgEAgQIEAAAABwiBAgIAAgAAAAcIiYqLCAcAAAACBQiIj4iIgAAADw" + 149 + "SEhwSEjwAAAAcIiAgICIcAAAAPBISEhISPAAAAD4gIDwgID4AAAA+ICA8ICAgAAAAHCIgICYiHAA" + 150 + "AACIiIj4iIiIAAAAcCAgICAgcAAAADgQEBAQkGAAAACIkKDAoJCIAAAAgICAgICA+AAAAIiI2KiI" + 151 + "iIgAAACIiMiomIiIAAAAcIiIiIiIcAAAAPCIiPCAgIAAAABwiIiIiKhwCAAA8IiI8KCQiAAAAHCI" + 152 + "gHAIiHAAAAD4ICAgICAgAAAAiIiIiIiIcAAAAIiIiFBQUCAAAACIiIioqNiIAAAAiIhQIFCIiAAA" + 153 + "AIiIUCAgICAAAAD4CBAgQID4AAAAcEBAQEBAcAAAAICAQCAQCAgAAABwEBAQEBBwAAAAIFCIAAAA" + 154 + "AAAAAAAAAAAAAAD4ACAQAAAAAAAAAAAAAABwCHiIeAAAAICAsMiIyLAAAAAAAHCIgIhwAAAACAho" + 155 + "mIiYaAAAAAAAcIj4gHAAAAAwSEDwQEBAAAAAAAB4iIh4CIhwAICAsMiIiIgAAAAgAGAgICBwAAAA" + 156 + "CAAYCAgISEgwAICAiJDgkIgAAABgICAgICBwAAAAAADQqKioiAAAAAAAsMiIiIgAAAAAAHCIiIhw" + 157 + "AAAAAACwyIjIsICAAAAAaJiImGgICAAAALDIgICAAAAAAABwgHAI8AAAAEBA8EBASDAAAAAAAIiI" + 158 + "iJhoAAAAAACIiFBQIAAAAAAAiIioqFAAAAAAAIhQIFCIAAAAAACIiJhoCIhwAAAA+BAgQPgAAAAY" + 159 + "IBBgECAYAAAAICAgICAgIAAAAGAQIBggEGAAAABIqJAAAAAAAAA="; 160 + const FONT = Buffer.from(FONT_B64, "base64"); 161 + 162 + // pixel sample at (x, y) in glyph space [0..6) × [0..10). 163 + function glyphPixel(ch, x, y) { 164 + const code = ch.charCodeAt(0); 165 + if (code < 32 || code > 126) return false; 166 + const idx = code - 32; 167 + const row = FONT[idx * FONT_H + y]; 168 + if (row === undefined) return false; 169 + // High 6 bits hold the columns; bit 7 = leftmost. 170 + return (row & (0x80 >> x)) !== 0; 171 + } 172 + 173 + // Measure text width at scale (no kerning, fixed-pitch font). 174 + function textWidth(str, scale) { 175 + return str.length * FONT_W * scale; 176 + } 177 + 178 + // ── Pure-Node 8-bit RGB framebuffer ──────────────────────────────────── 179 + function makeFB(w, h, bg) { 180 + const buf = Buffer.alloc(w * h * 3); 181 + for (let i = 0; i < buf.length; i += 3) { 182 + buf[i] = bg[0]; buf[i + 1] = bg[1]; buf[i + 2] = bg[2]; 183 + } 184 + return { w, h, buf }; 185 + } 186 + 187 + function setPx(fb, x, y, rgb) { 188 + if (x < 0 || y < 0 || x >= fb.w || y >= fb.h) return; 189 + const i = (y * fb.w + x) * 3; 190 + fb.buf[i] = rgb[0]; fb.buf[i + 1] = rgb[1]; fb.buf[i + 2] = rgb[2]; 191 + } 192 + 193 + function fillRect(fb, x0, y0, w, h, rgb) { 194 + for (let y = y0; y < y0 + h; y++) { 195 + for (let x = x0; x < x0 + w; x++) setPx(fb, x, y, rgb); 196 + } 197 + } 198 + 199 + function drawText(fb, str, x0, y0, scale, rgb) { 200 + for (let i = 0; i < str.length; i++) { 201 + const ch = str[i]; 202 + const gx0 = x0 + i * FONT_W * scale; 203 + for (let gy = 0; gy < FONT_H; gy++) { 204 + for (let gx = 0; gx < FONT_W; gx++) { 205 + if (!glyphPixel(ch, gx, gy)) continue; 206 + // Stamp scale×scale square per pixel. 207 + fillRect(fb, gx0 + gx * scale, y0 + gy * scale, scale, scale, rgb); 208 + } 209 + } 210 + } 211 + } 212 + 213 + // ── Hand-rolled PNG writer (deflate, RGB8, no filtering) ─────────────── 214 + function crc32(buf) { 215 + let c, table = crc32.table; 216 + if (!table) { 217 + table = new Uint32Array(256); 218 + for (let n = 0; n < 256; n++) { 219 + c = n; 220 + for (let k = 0; k < 8; k++) c = (c & 1) ? (0xEDB88320 ^ (c >>> 1)) : (c >>> 1); 221 + table[n] = c >>> 0; 222 + } 223 + crc32.table = table; 224 + } 225 + c = 0xFFFFFFFF; 226 + for (let i = 0; i < buf.length; i++) c = table[(c ^ buf[i]) & 0xFF] ^ (c >>> 8); 227 + return (c ^ 0xFFFFFFFF) >>> 0; 228 + } 229 + 230 + function chunk(type, data) { 231 + const len = Buffer.alloc(4); len.writeUInt32BE(data.length, 0); 232 + const tbuf = Buffer.from(type, "ascii"); 233 + const crcBuf = Buffer.alloc(4); 234 + crcBuf.writeUInt32BE(crc32(Buffer.concat([tbuf, data])), 0); 235 + return Buffer.concat([len, tbuf, data, crcBuf]); 236 + } 237 + 238 + function writePNG(fb) { 239 + const sig = Buffer.from([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]); 240 + // IHDR 241 + const ihdr = Buffer.alloc(13); 242 + ihdr.writeUInt32BE(fb.w, 0); 243 + ihdr.writeUInt32BE(fb.h, 4); 244 + ihdr[8] = 8; // bit depth 245 + ihdr[9] = 2; // color type: RGB 246 + ihdr[10] = 0; // compression 247 + ihdr[11] = 0; // filter 248 + ihdr[12] = 0; // interlace 249 + // IDAT — prepend 0x00 filter byte to each scanline 250 + const stride = fb.w * 3; 251 + const filtered = Buffer.alloc(fb.h * (stride + 1)); 252 + for (let y = 0; y < fb.h; y++) { 253 + filtered[y * (stride + 1)] = 0; 254 + fb.buf.copy(filtered, y * (stride + 1) + 1, y * stride, (y + 1) * stride); 255 + } 256 + const idat = deflateSync(filtered, { level: 6 }); 257 + return Buffer.concat([sig, chunk("IHDR", ihdr), chunk("IDAT", idat), chunk("IEND", Buffer.alloc(0))]); 258 + } 259 + 260 + // ── Slug-deterministic palette (HSL → RGB, AC-saturated) ─────────────── 261 + function hslToRgb(h, s, l) { 262 + // h ∈ [0..360), s ∈ [0..1], l ∈ [0..1] 263 + const c = (1 - Math.abs(2 * l - 1)) * s; 264 + const hp = h / 60; 265 + const x = c * (1 - Math.abs((hp % 2) - 1)); 266 + let r1 = 0, g1 = 0, b1 = 0; 267 + if (hp < 1) { r1 = c; g1 = x; } 268 + else if (hp < 2) { r1 = x; g1 = c; } 269 + else if (hp < 3) { g1 = c; b1 = x; } 270 + else if (hp < 4) { g1 = x; b1 = c; } 271 + else if (hp < 5) { r1 = x; b1 = c; } 272 + else { r1 = c; b1 = x; } 273 + const m = l - c / 2; 274 + return [Math.round((r1 + m) * 255), Math.round((g1 + m) * 255), Math.round((b1 + m) * 255)]; 275 + } 276 + 277 + function paletteFromSlug(slug) { 278 + const h = createHash("sha256").update(slug).digest(); 279 + const hue = (h[0] / 256) * 360; 280 + // AC palette: rich, saturated, fairly dark backgrounds with a complementary accent 281 + const bg = hslToRgb(hue, 0.72, 0.16); 282 + const accent = hslToRgb((hue + 36) % 360, 0.85, 0.58); 283 + // Bright cream type for max contrast against the dark bg 284 + const fg = hslToRgb(hue, 0.10, 0.96); 285 + const dim = hslToRgb(hue, 0.40, 0.62); 286 + return { bg, fg, accent, dim }; 287 + } 288 + 289 + // ── Compose the cover (3000×3000) ────────────────────────────────────── 290 + function buildCover(slug, title, stampShort) { 291 + const W = 3000, H = 3000; 292 + const pal = paletteFromSlug(slug); 293 + const fb = makeFB(W, H, pal.bg); 294 + 295 + // Top decoration bars — geometric, hand-drawn-feeling. 296 + const h = createHash("sha256").update(slug).digest(); 297 + const barCount = 5 + (h[1] % 5); 298 + for (let i = 0; i < barCount; i++) { 299 + fillRect(fb, 200, 90 + i * 22, W - 400, 4, pal.dim); 300 + } 301 + 302 + // ── Vertical bands (no overlap) ───────────────────────────────────── 303 + // 90..200 top bars 304 + // 280..1080 title (scale ≤80) 305 + // 1180..1480 waveform 306 + // 1560..1780 stamp 307 + // 1860..2230 album mark "big pictures" 308 + // 2310..2870 handle "@jeffrey" 309 + // 2900..2980 bottom bars 310 + 311 + // Title — large, occupies most of the top band. 312 + const titleStr = String(title || slug).toLowerCase(); 313 + let titleScale = Math.floor((W - 400) / (titleStr.length * FONT_W)); 314 + titleScale = Math.min(titleScale, 80); 315 + titleScale = Math.max(titleScale, 28); 316 + const titleW = textWidth(titleStr, titleScale); 317 + const titleX = Math.round((W - titleW) / 2); 318 + const titleH = FONT_H * titleScale; 319 + const titleY = 280 + Math.round((800 - titleH) / 2); 320 + const shadowOff = Math.max(4, Math.round(titleScale * 0.18)); 321 + drawText(fb, titleStr, titleX + shadowOff, titleY + shadowOff, titleScale, pal.accent); 322 + drawText(fb, titleStr, titleX, titleY, titleScale, pal.fg); 323 + 324 + // Waveform glyph 325 + const wvBars = 32; 326 + const wvY = 1330; 327 + const wvW = Math.round(W * 0.66); 328 + const wvX = Math.round((W - wvW) / 2); 329 + const barWidth = Math.floor(wvW / wvBars) - 4; 330 + for (let i = 0; i < wvBars; i++) { 331 + const seed = h[i % h.length]; 332 + const amp = 30 + ((seed * (i + 1)) % 130); 333 + const x = wvX + i * Math.floor(wvW / wvBars); 334 + fillRect(fb, x, wvY - amp, barWidth, amp * 2, pal.accent); 335 + } 336 + 337 + // Timestamp — small, centered. 338 + const stampScale = 22; 339 + const stampW = textWidth(stampShort, stampScale); 340 + drawText(fb, stampShort, Math.round((W - stampW) / 2), 1560, stampScale, pal.dim); 341 + 342 + // Album mark "big pictures" — auto-fit width. 343 + const mark = "big pictures"; 344 + let markScale = Math.floor((W - 300) / (mark.length * FONT_W)); 345 + markScale = Math.min(markScale, 38); 346 + markScale = Math.max(markScale, 24); 347 + const markW = textWidth(mark, markScale); 348 + const markH = FONT_H * markScale; 349 + const markY = 1860 + Math.round((370 - markH) / 2); 350 + drawText(fb, mark, Math.round((W - markW) / 2), markY, markScale, pal.fg); 351 + 352 + // Handle "@jeffrey" — biggest text on the cover, max-fit to width. 353 + const handle = "@jeffrey"; 354 + let handleScale = Math.floor((W - 240) / (handle.length * FONT_W)); 355 + handleScale = Math.min(handleScale, 70); 356 + handleScale = Math.max(handleScale, 28); 357 + const handleW = textWidth(handle, handleScale); 358 + const handleX = Math.round((W - handleW) / 2); 359 + const handleH = FONT_H * handleScale; 360 + const handleY = 2310 + Math.round((560 - handleH) / 2); 361 + const handleShadow = Math.max(6, Math.round(handleScale * 0.14)); 362 + drawText(fb, handle, handleX + handleShadow, handleY + handleShadow, handleScale, pal.accent); 363 + drawText(fb, handle, handleX, handleY, handleScale, pal.fg); 364 + 365 + // Bottom decoration bars 366 + for (let i = 0; i < barCount; i++) { 367 + fillRect(fb, 200, H - 90 - i * 14, W - 400, 4, pal.dim); 368 + } 369 + 370 + return writePNG(fb); 371 + } 372 + 373 + // ── Resolve cover path ───────────────────────────────────────────────── 374 + let coverPath; 375 + if (COVER_OVERRIDE) { 376 + if (!existsSync(COVER_OVERRIDE)) { 377 + console.error(`✗ --cover not found: ${COVER_OVERRIDE}`); 378 + process.exit(1); 379 + } 380 + coverPath = COVER_OVERRIDE; 381 + console.log(`→ using provided cover: ${coverPath}`); 382 + } else { 383 + coverPath = `${ROOT}/big-pictures/out/${SLUG}-cover.png`; 384 + console.log(`→ generating cover: ${coverPath} (3000×3000, ${localStampShort})`); 385 + const png = buildCover(SLUG, TITLE, localStampShort); 386 + writeFileSync(coverPath, png); 387 + console.log(` wrote ${(png.length / 1024).toFixed(0)} KB`); 388 + } 389 + 390 + // ── ffmpeg: mux audio + cover, attach ID3v2 ──────────────────────────── 391 + // Notes: 392 + // * `-map 0:a -map 1` keeps audio + cover. 393 + // * `-c copy` preserves the source mp3 bitstream (no re-encode). 394 + // * `-disposition:v attached_pic` sets the APIC role. 395 + // * `-id3v2_version 3` keeps ID3v2.3 (best player support). 396 + // * `-write_id3v1 1` is a courtesy for legacy players. 397 + // * `-metadata:s:v` sets the per-stream cover description. 398 + const args = [ 399 + "-hide_banner", "-y", "-loglevel", "error", 400 + "-i", IN_PATH, 401 + "-i", coverPath, 402 + "-map", "0:a", 403 + "-map", "1", 404 + "-c", "copy", 405 + "-id3v2_version", "3", 406 + "-write_id3v1", "1", 407 + "-disposition:v", "attached_pic", 408 + "-metadata:s:v", "title=Album cover", 409 + "-metadata:s:v", "comment=Cover (front)", 410 + ]; 411 + 412 + for (const [k, v] of Object.entries(meta)) { 413 + args.push("-metadata", `${k}=${v}`); 414 + } 415 + if (lyricsText) { 416 + args.push("-metadata", `lyrics-eng=${lyricsText}`); 417 + } 418 + 419 + args.push(OUT_PATH); 420 + 421 + console.log(`→ ffmpeg mux · in=${basename(IN_PATH)} cover=${basename(coverPath)} → ${basename(OUT_PATH)}`); 422 + const ff = spawnSync("ffmpeg", args, { stdio: "inherit" }); 423 + if (ff.status !== 0) { 424 + console.error("✗ ffmpeg failed"); 425 + process.exit(1); 426 + } 427 + 428 + // ── Verify output ────────────────────────────────────────────────────── 429 + const verify = spawnSync( 430 + "ffprobe", 431 + ["-v", "error", "-show_entries", 432 + "format=duration:format_tags=title,artist,album,album_artist,date,genre,comment,lyrics-eng:stream=codec_type,codec_name,disposition:stream_tags=title,comment", 433 + "-of", "default=noprint_wrappers=1", OUT_PATH], 434 + { encoding: "utf8" }, 435 + ); 436 + 437 + const outDur = (() => { 438 + const m = (verify.stdout || "").match(/^duration=(.+)$/m); 439 + return m ? Number(m[1]) : null; 440 + })(); 441 + 442 + const hasCover = (verify.stdout || "").includes("codec_type=video"); 443 + const drift = outDur !== null ? outDur - inputDur : null; 444 + 445 + writeFileSync(hashFile, cacheKey + "\n"); 446 + const outSize = (statSync(OUT_PATH).size / 1024).toFixed(0); 447 + console.log(`✓ ${OUT_PATH} (${outSize} KB · hash ${cacheKey})`); 448 + console.log(` duration ${outDur?.toFixed(3) ?? "?"}s` + 449 + (drift !== null ? ` (drift ${drift >= 0 ? "+" : ""}${drift.toFixed(3)}s)` : "") + 450 + ` · cover ${hasCover ? "embedded" : "MISSING"}` + 451 + ` · lyrics ${lyricsText ? "yes" : "no"}`); 452 + if (!hasCover) { 453 + console.error("✗ cover stream not detected in output — verify ffprobe report:"); 454 + console.error(verify.stdout); 455 + process.exit(1); 456 + }
+313
pop/bin/musicxml_to_np.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + musicxml_to_np.py — convert a MusicXML lead sheet (melody + lyrics) to 4 + AC's `.np` score format used by pitchsnap.mjs. 5 + 6 + The .np format pairs one note with one syllable per token: 7 + NOTE:syllable*beats 8 + 9 + - NOTE — scientific pitch notation, e.g. D3, G#3 (flats normalized to 10 + enharmonic sharps so the AC parser doesn't have to know "Bb3") 11 + - syllable — lowercase. MusicXML <syllabic> markers map to AC dashes: 12 + single → "grace" 13 + begin → "a-" 14 + middle → "-ma-" 15 + end → "-zing" 16 + - beats — duration in beats (MusicXML <duration> / <divisions>). 17 + Rounded to int when whole, otherwise 2-decimal. 18 + 19 + Tied notes are merged into one logical note (sum of durations, lyric 20 + from the first). Chord tones (non-melody pitches) and rests are 21 + skipped — this is a melody-extraction pass for the AC vocal lane, not 22 + a full MusicXML round-trip. 23 + 24 + Usage: 25 + python bin/musicxml_to_np.py input.musicxml output.np \\ 26 + [--bpm 70] [--key "G major"] [--title "Amazing Grace"] 27 + 28 + Designed to be run from sources like Hymnary, Mutopia (after 29 + `lilypond --output=musicxml`), or any MuseScore export. 30 + """ 31 + import argparse 32 + import sys 33 + import xml.etree.ElementTree as ET 34 + from pathlib import Path 35 + from fractions import Fraction 36 + 37 + 38 + def strip_ns(tag): 39 + """Strip XML namespace if present (MusicXML files in the wild are 40 + inconsistent about whether they declare a namespace).""" 41 + return tag.split("}", 1)[-1] if "}" in tag else tag 42 + 43 + 44 + def find(elem, name): 45 + """Find a direct child by local name, ignoring namespaces.""" 46 + if elem is None: 47 + return None 48 + for child in elem: 49 + if strip_ns(child.tag) == name: 50 + return child 51 + return None 52 + 53 + 54 + def findall(elem, name): 55 + if elem is None: 56 + return [] 57 + return [c for c in elem if strip_ns(c.tag) == name] 58 + 59 + 60 + # Flats → enharmonic sharps (AC pitch parser sticks to sharps) 61 + FLAT_TO_SHARP = {"D": "C#", "E": "D#", "G": "F#", "A": "G#", "B": "A#"} 62 + 63 + 64 + def pitch_to_np(step, alter, octave): 65 + name = step.upper() 66 + if alter == 1: 67 + name += "#" 68 + elif alter == 2: 69 + # Double sharp — rare. Roll forward one whole step. 70 + roll = {"C": "D", "D": "E", "F": "G", "G": "A", "A": "B"} 71 + name = roll.get(name, name + "##") 72 + elif alter == -1: 73 + eq = FLAT_TO_SHARP.get(step.upper()) 74 + if eq is not None: 75 + name = eq 76 + elif step.upper() == "C": 77 + # Cb → B (same pitch, octave - 1) 78 + return f"B{octave - 1}" 79 + elif step.upper() == "F": 80 + # Fb → E 81 + name = "E" 82 + elif alter == -2: 83 + # Double flat — also rare; roll back one whole step 84 + roll = {"E": "D", "B": "A", "A": "G", "G": "F", "D": "C"} 85 + name = roll.get(name, name + "bb") 86 + return f"{name}{octave}" 87 + 88 + 89 + def syllabify(text, syllabic): 90 + text = (text or "").lower().strip() 91 + # Strip punctuation that would confuse the AC parser 92 + text = text.replace("_", "").replace(",", "").replace(".", "").replace("?", "").replace("!", "") 93 + if not text: 94 + return "_" 95 + if syllabic == "begin": 96 + return text + "-" 97 + if syllabic == "middle": 98 + return "-" + text + "-" 99 + if syllabic == "end": 100 + return "-" + text 101 + return text # single (or default) 102 + 103 + 104 + def beats_str(beats): 105 + if beats == int(beats): 106 + return str(int(beats)) 107 + # 2-dec, but trim trailing zeros (1.50 → 1.5) 108 + s = f"{beats:.2f}".rstrip("0").rstrip(".") 109 + return s or "0" 110 + 111 + 112 + class TiedAccumulator: 113 + """Collects duration across <tie type='start' | 'continue'> until we 114 + see <tie type='stop'>. The first note in the chain owns the lyric.""" 115 + 116 + def __init__(self): 117 + self.active = False 118 + self.pitch = None 119 + self.duration = 0 120 + self.divisions = 1 121 + self.lyric = None 122 + self.syllabic = None 123 + 124 + def reset(self): 125 + self.__init__() 126 + 127 + 128 + def extract_melody(part): 129 + """Walk one <part>, yielding (np_pitch, syllable, beats) tuples in 130 + order. Handles tied notes, chord tones, rests, key/voice changes.""" 131 + tokens = [] 132 + line_breaks = [] # measure index of each line break candidate 133 + 134 + divisions = 1 135 + tied = TiedAccumulator() 136 + main_voice = None 137 + 138 + measures = findall(part, "measure") 139 + for m_idx, measure in enumerate(measures): 140 + attrs = find(measure, "attributes") 141 + if attrs is not None: 142 + d = find(attrs, "divisions") 143 + if d is not None and d.text: 144 + divisions = int(d.text) 145 + 146 + for note in findall(measure, "note"): 147 + voice_el = find(note, "voice") 148 + voice = voice_el.text if voice_el is not None else "1" 149 + if main_voice is None: 150 + main_voice = voice 151 + if voice != main_voice: 152 + continue 153 + 154 + # Chord tone (non-first pitch in a chord) — skip; we only 155 + # transcribe the topmost melody. 156 + if find(note, "chord") is not None: 157 + continue 158 + 159 + duration_el = find(note, "duration") 160 + if duration_el is None or not duration_el.text: 161 + continue 162 + duration = int(duration_el.text) 163 + 164 + # Rest: flush any tied note, then skip 165 + if find(note, "rest") is not None: 166 + if tied.active: 167 + tokens.append(_emit_tied(tied)) 168 + tied.reset() 169 + continue 170 + 171 + pitch = find(note, "pitch") 172 + if pitch is None: 173 + continue 174 + 175 + step = (find(pitch, "step").text or "C").upper() 176 + octave = int((find(pitch, "octave").text or "4")) 177 + alter_el = find(pitch, "alter") 178 + alter = int(alter_el.text) if alter_el is not None and alter_el.text else 0 179 + np_pitch = pitch_to_np(step, alter, octave) 180 + 181 + # Lyric (only from notes that *start* a syllable) 182 + lyric = None 183 + syllabic = "single" 184 + for ly in findall(note, "lyric"): 185 + txt = find(ly, "text") 186 + if txt is not None and txt.text: 187 + lyric = txt.text 188 + syl = find(ly, "syllabic") 189 + syllabic = syl.text if syl is not None and syl.text else "single" 190 + break 191 + 192 + # Tie handling 193 + ties = findall(note, "tie") 194 + tie_types = {t.attrib.get("type") for t in ties} 195 + 196 + if "start" in tie_types and "stop" not in tie_types: 197 + # Start of a tied chain 198 + if tied.active: 199 + tokens.append(_emit_tied(tied)) 200 + tied.active = True 201 + tied.pitch = np_pitch 202 + tied.duration = duration 203 + tied.divisions = divisions 204 + tied.lyric = lyric 205 + tied.syllabic = syllabic 206 + elif tie_types == {"start", "stop"} or tie_types == {"stop", "start"}: 207 + # Continuation in the middle of a chain 208 + if tied.active and tied.pitch == np_pitch: 209 + tied.duration += duration 210 + else: 211 + # Stray; treat as new 212 + if tied.active: 213 + tokens.append(_emit_tied(tied)) 214 + tied.reset() 215 + tokens.append((np_pitch, syllabify(lyric, syllabic), 216 + duration / divisions)) 217 + elif "stop" in tie_types: 218 + # End of a tied chain 219 + if tied.active and tied.pitch == np_pitch: 220 + tied.duration += duration 221 + tokens.append(_emit_tied(tied)) 222 + else: 223 + tokens.append((np_pitch, syllabify(lyric, syllabic), 224 + duration / divisions)) 225 + tied.reset() 226 + else: 227 + # Plain note 228 + if tied.active: 229 + tokens.append(_emit_tied(tied)) 230 + tied.reset() 231 + tokens.append((np_pitch, syllabify(lyric, syllabic), 232 + duration / divisions)) 233 + 234 + line_breaks.append(len(tokens)) 235 + 236 + if tied.active: 237 + tokens.append(_emit_tied(tied)) 238 + 239 + return tokens, line_breaks 240 + 241 + 242 + def _emit_tied(t): 243 + return (t.pitch, syllabify(t.lyric, t.syllabic), t.duration / t.divisions) 244 + 245 + 246 + def find_part(root): 247 + """Return the first <part> element. Score may have <score-partwise> 248 + or <score-timewise> at the root, with namespaces, etc.""" 249 + for elem in root.iter(): 250 + if strip_ns(elem.tag) == "part": 251 + return elem 252 + return None 253 + 254 + 255 + def main(): 256 + ap = argparse.ArgumentParser() 257 + ap.add_argument("input", help="MusicXML file") 258 + ap.add_argument("output", help="Output .np file") 259 + ap.add_argument("--bpm", type=int, help="Tempo in BPM (written as a comment)") 260 + ap.add_argument("--key", help="Key name (written as a comment)") 261 + ap.add_argument("--title", help="Title (written as a comment)") 262 + ap.add_argument("--verse", default="verse 1", help='Verse heading (default: "verse 1")') 263 + ap.add_argument("--line-every", type=int, default=8, 264 + help="Wrap output to a new line every N notes (default: 8)") 265 + args = ap.parse_args() 266 + 267 + tree = ET.parse(args.input) 268 + root = tree.getroot() 269 + part = find_part(root) 270 + if part is None: 271 + print("✗ no <part> in MusicXML", file=sys.stderr) 272 + sys.exit(1) 273 + 274 + tokens, line_breaks = extract_melody(part) 275 + if not tokens: 276 + print("✗ no melody notes extracted", file=sys.stderr) 277 + sys.exit(1) 278 + 279 + # Build output 280 + lines = [] 281 + if args.title: 282 + lines.append(f"# {args.title}") 283 + if args.key: 284 + lines.append(f"# key: {args.key}") 285 + if args.bpm: 286 + lines.append(f"# Use --beat-mode --bpm {args.bpm}.") 287 + if lines: 288 + lines.append("") # blank separator 289 + 290 + lines.append(args.verse) 291 + 292 + # Wrap line every N notes (simple heuristic; user re-flows by hand) 293 + cur = [] 294 + for i, (pitch, syl, beats) in enumerate(tokens): 295 + cur.append(f"{pitch}:{syl}*{beats_str(beats)}") 296 + if len(cur) >= args.line_every: 297 + lines.append(" ".join(cur)) 298 + cur = [] 299 + if cur: 300 + lines.append(" ".join(cur)) 301 + 302 + Path(args.output).write_text("\n".join(lines) + "\n") 303 + print(f"✓ {args.output}") 304 + print(f" {len(tokens)} notes · {len(line_breaks)} measures · " 305 + f"≈{sum(t[2] for t in tokens):.1f} beats") 306 + # Show first line preview 307 + first_line = next((l for l in lines if l and not l.startswith("#") and l != args.verse), "") 308 + if first_line: 309 + print(f" first line: {first_line[:120]}{'…' if len(first_line) > 120 else ''}") 310 + 311 + 312 + if __name__ == "__main__": 313 + main()
+264
pop/bin/pitchcheck.mjs
··· 1 + #!/usr/bin/env node 2 + // pitchcheck.mjs — measure the actual fundamental of each word in a 3 + // rendered vocal stem and compare to what pitchsnap *intended* to 4 + // shift each word to. Reads the `.events.json` emitted by pitchsnap 5 + // (avoids re-aligning the rendered output, which whisper degrades on 6 + // heavily-shifted audio). 7 + // 8 + // Pitch detection: autocorrelation over a central window of each 9 + // word's slice, restricted to voice range [80 Hz, 600 Hz]. Parabolic 10 + // interpolation around the peak for sub-sample precision. Skip 11 + // silence by RMS gate. 12 + // 13 + // Output: per-word table of expected vs measured + cents drift, and a 14 + // summary of mean / median absolute drift. ±50¢ = quarter-tone, ±25¢ 15 + // = "in tune." 16 + // 17 + // Usage: 18 + // node bin/pitchcheck.mjs --vocal big-pictures/out/mary-sung.mp3 19 + // (auto-finds mary-sung.events.json next to the mp3) 20 + 21 + import { spawnSync } from "node:child_process"; 22 + import { existsSync, readFileSync, mkdirSync, rmSync } from "node:fs"; 23 + import { resolve, dirname, basename } from "node:path"; 24 + 25 + function parseArgs(argv) { 26 + const flags = {}; 27 + for (let i = 0; i < argv.length; i++) { 28 + const a = argv[i]; 29 + if (!a.startsWith("--")) continue; 30 + const k = a.slice(2); 31 + const next = argv[i + 1]; 32 + if (next !== undefined && !next.startsWith("--")) { flags[k] = next; i++; } 33 + else flags[k] = true; 34 + } 35 + return flags; 36 + } 37 + 38 + const flags = parseArgs(process.argv.slice(2)); 39 + const vocalPath = resolve(process.cwd(), flags.vocal || ""); 40 + if (!existsSync(vocalPath)) { 41 + console.error("usage: --vocal <pitchsnap-output.mp3>"); 42 + process.exit(1); 43 + } 44 + const eventsPath = resolve(process.cwd(), 45 + flags.events || vocalPath.replace(/\.mp3$/, ".events.json")); 46 + if (!existsSync(eventsPath)) { 47 + console.error(`✗ events file not found: ${eventsPath}\n rerun pitchsnap.mjs to generate it.`); 48 + process.exit(1); 49 + } 50 + const SAMPLE_RATE = 48_000; 51 + const F_MIN = Number(flags["f-min"]) || 80; 52 + const F_MAX = Number(flags["f-max"]) || 600; 53 + 54 + // ── helpers ─────────────────────────────────────────────────────────── 55 + function freqToMidi(f) { return 69 + 12 * Math.log2(f / 440); } 56 + function midiToName(midi) { 57 + const names = ["C","C#","D","Eb","E","F","F#","G","G#","A","Bb","B"]; 58 + const r = Math.round(midi); 59 + return `${names[((r % 12) + 12) % 12]}${Math.floor(r / 12) - 1}`; 60 + } 61 + 62 + function readWav(path) { 63 + const buf = readFileSync(path); 64 + let i = 12; 65 + while (i < buf.length - 8) { 66 + const id = buf.toString("ascii", i, i + 4); 67 + const size = buf.readUInt32LE(i + 4); 68 + if (id === "data") { 69 + i += 8; 70 + const samples = new Float32Array(size / 2); 71 + for (let j = 0; j < samples.length; j++) { 72 + samples[j] = buf.readInt16LE(i + j * 2) / 32768; 73 + } 74 + return samples; 75 + } 76 + i += 8 + size; 77 + } 78 + throw new Error(`no data chunk in ${path}`); 79 + } 80 + 81 + // Autocorrelation pitch detection. Naive but works for clean voice. 82 + // Skip first/last 20% of samples (attack/release transients). 83 + function detectPitch(samples, sr, fmin, fmax) { 84 + if (samples.length < sr * 0.05) return null; // < 50ms — too short 85 + const start = Math.floor(samples.length * 0.2); 86 + const end = Math.floor(samples.length * 0.8); 87 + const win = samples.slice(start, end); 88 + 89 + // RMS gate — skip silence 90 + let rms = 0; 91 + for (let i = 0; i < win.length; i++) rms += win[i] * win[i]; 92 + rms = Math.sqrt(rms / win.length); 93 + if (rms < 0.005) return null; 94 + 95 + const lagMin = Math.floor(sr / fmax); 96 + const lagMax = Math.min(Math.floor(sr / fmin), Math.floor(win.length / 2)); 97 + 98 + let bestLag = lagMin; 99 + let bestScore = -Infinity; 100 + for (let lag = lagMin; lag <= lagMax; lag++) { 101 + let sum = 0; 102 + let n = win.length - lag; 103 + for (let i = 0; i < n; i++) sum += win[i] * win[i + lag]; 104 + sum /= n; 105 + if (sum > bestScore) { bestScore = sum; bestLag = lag; } 106 + } 107 + 108 + // Parabolic interpolation around peak for sub-sample precision 109 + let lagF = bestLag; 110 + if (bestLag > lagMin && bestLag < lagMax) { 111 + const acAt = (k) => { 112 + let s = 0; 113 + const n = win.length - k; 114 + for (let i = 0; i < n; i++) s += win[i] * win[i + k]; 115 + return s / n; 116 + }; 117 + const a = acAt(bestLag - 1); 118 + const b = acAt(bestLag); 119 + const c = acAt(bestLag + 1); 120 + const denom = a - 2 * b + c; 121 + if (Math.abs(denom) > 1e-9) lagF = bestLag - 0.5 * (c - a) / denom; 122 + } 123 + 124 + return sr / lagF; 125 + } 126 + 127 + // ── main ────────────────────────────────────────────────────────────── 128 + const events = JSON.parse(readFileSync(eventsPath, "utf8")); 129 + 130 + const tmpDir = `${dirname(vocalPath)}/.pitchcheck-tmp`; 131 + rmSync(tmpDir, { recursive: true, force: true }); 132 + mkdirSync(tmpDir, { recursive: true }); 133 + 134 + console.log( 135 + `→ pitchcheck · ${events.events.length} events against ${basename(eventsPath)}\n` + 136 + ` vocal=${basename(vocalPath)} · stretch=${events.stretch}× curve=${events.curve}\n` 137 + ); 138 + console.log(` ${"i".padStart(3)} ${"word".padEnd(12)} ${"expected".padEnd(14)} ${"measured".padEnd(20)} drift`); 139 + console.log(` ${"─".repeat(60)}`); 140 + 141 + let drifts = []; 142 + let confidentCount = 0; 143 + 144 + for (const ev of events.events) { 145 + const startSec = ev.snappedStart; 146 + const endSec = startSec + ev.durSec; 147 + 148 + const sliceWav = `${tmpDir}/w${ev.i.toString().padStart(3,"0")}.wav`; 149 + spawnSync("ffmpeg", 150 + ["-hide_banner","-y","-loglevel","error", 151 + "-ss",startSec.toFixed(4),"-to",endSec.toFixed(4), 152 + "-i",vocalPath, 153 + "-c:a","pcm_s16le","-ar",String(SAMPLE_RATE),"-ac","1",sliceWav], 154 + { stdio: ["ignore","ignore","ignore"] }); 155 + if (!existsSync(sliceWav)) continue; 156 + 157 + const samples = readWav(sliceWav); 158 + const f0 = detectPitch(samples, SAMPLE_RATE, F_MIN, F_MAX); 159 + 160 + if (f0 === null) { 161 + console.log(` ${ev.i.toString().padStart(3)} ${ev.text.padEnd(12)} ${ev.targetNote.padEnd(14)} ${"(silence)".padEnd(20)}`); 162 + continue; 163 + } 164 + 165 + const measuredMidi = freqToMidi(f0); 166 + const measuredName = midiToName(measuredMidi); 167 + const driftCents = (measuredMidi - ev.targetMidi) * 100; 168 + drifts.push(driftCents); 169 + confidentCount++; 170 + 171 + const driftStr = `${driftCents >= 0 ? "+" : ""}${driftCents.toFixed(0)}¢`; 172 + const measuredStr = `${measuredName} (${f0.toFixed(1)}Hz)`; 173 + console.log( 174 + ` ${ev.i.toString().padStart(3)} ${ev.text.padEnd(12)} ${ev.targetNote.padEnd(14)} ${measuredStr.padEnd(20)} ${driftStr}` 175 + ); 176 + } 177 + 178 + rmSync(tmpDir, { recursive: true, force: true }); 179 + 180 + if (confidentCount === 0) { 181 + console.log("\n no confident measurements — too much silence or noise"); 182 + process.exit(0); 183 + } 184 + 185 + drifts.sort((a, b) => Math.abs(a) - Math.abs(b)); 186 + const median = Math.abs(drifts[Math.floor(drifts.length / 2)]); 187 + const mean = drifts.reduce((a, b) => a + Math.abs(b), 0) / drifts.length; 188 + const max = Math.max(...drifts.map(Math.abs)); 189 + 190 + console.log(`\n summary · ${confidentCount}/${events.events.length} measured`); 191 + console.log(` median |drift| = ${median.toFixed(0)}¢`); 192 + console.log(` mean |drift| = ${mean.toFixed(0)}¢`); 193 + console.log(` max |drift| = ${max.toFixed(0)}¢`); 194 + console.log(`\n reference: ±50¢ = within a quarter-tone, ±25¢ = "in tune"`); 195 + 196 + // ── Stutter detection ────────────────────────────────────────────────── 197 + // Look for amplitude dips (RMS drops > 70% within 50ms then recovers 198 + // within 80ms) and f0 jumps (frame-to-frame f0 ratio > 1.5 = >7 199 + // semitones in 5ms). Both indicate WORLD phase resets / vocal skips. 200 + { 201 + const fullSliceWav = `/tmp/pitchcheck-stutter-${Date.now()}.wav`; 202 + spawnSync("ffmpeg", 203 + ["-hide_banner", "-y", "-loglevel", "error", 204 + "-i", vocalPath, 205 + "-c:a", "pcm_s16le", "-ar", String(SAMPLE_RATE), "-ac", "1", fullSliceWav], 206 + { stdio: ["ignore", "ignore", "ignore"] }); 207 + if (existsSync(fullSliceWav)) { 208 + const samples = readWav(fullSliceWav); 209 + const hop = Math.floor(0.010 * SAMPLE_RATE); 210 + const nF = Math.floor(samples.length / hop); 211 + const rms = new Float32Array(nF); 212 + for (let f = 0; f < nF; f++) { 213 + let r = 0; 214 + for (let j = 0; j < hop; j++) { 215 + const v = samples[f * hop + j]; 216 + r += v * v; 217 + } 218 + rms[f] = Math.sqrt(r / hop); 219 + } 220 + // Smooth RMS with a 3-frame rolling mean for stability 221 + const sm = new Float32Array(nF); 222 + for (let f = 0; f < nF; f++) { 223 + let s = 0, c = 0; 224 + for (let k = -1; k <= 1; k++) { 225 + if (f + k >= 0 && f + k < nF) { s += rms[f + k]; c++; } 226 + } 227 + sm[f] = s / c; 228 + } 229 + const peak = sm.reduce((m, v) => v > m ? v : m, 0); 230 + // Stutter = dip below 30% of peak that's surrounded by content > 60% 231 + const dipThr = peak * 0.30; 232 + const surroundThr = peak * 0.60; 233 + const stutters = []; 234 + for (let f = 5; f < nF - 5; f++) { 235 + if (sm[f] < dipThr) { 236 + // Check if surrounded by content 237 + let preMax = 0, postMax = 0; 238 + for (let k = 1; k <= 5; k++) { 239 + if (sm[f - k] > preMax) preMax = sm[f - k]; 240 + if (sm[f + k] > postMax) postMax = sm[f + k]; 241 + } 242 + if (preMax > surroundThr && postMax > surroundThr) { 243 + // It's a dip — check if it's a real stutter (recovers within 80ms) 244 + let recoveredBy = 8; 245 + for (let k = 1; k <= 8; k++) { 246 + if (f + k < nF && sm[f + k] > surroundThr) { recoveredBy = k; break; } 247 + } 248 + stutters.push({ time: f * 0.010, dipDepth: 1 - sm[f] / preMax, recoveryFrames: recoveredBy }); 249 + // skip ahead past this dip 250 + f += recoveredBy; 251 + } 252 + } 253 + } 254 + if (stutters.length === 0) { 255 + console.log(`\n stutters: none detected ✓`); 256 + } else { 257 + console.log(`\n stutters: ${stutters.length} amplitude dip${stutters.length === 1 ? "" : "s"} flagged`); 258 + for (const s of stutters.slice(0, 12)) { 259 + console.log(` ${s.time.toFixed(2)}s depth ${(s.dipDepth * 100).toFixed(0)}% recover ${s.recoveryFrames * 10}ms`); 260 + } 261 + if (stutters.length > 12) console.log(` ... and ${stutters.length - 12} more`); 262 + } 263 + } 264 + }
+1017
pop/bin/pitchsnap.mjs
··· 1 + #!/usr/bin/env node 2 + // pitchsnap.mjs — aggressive per-word post-prod, no elongation. 3 + // 4 + // For each whisper-aligned word: 5 + // 1. Snap its START to the nearest 16th-note slot at the target BPM 6 + // (no time-stretch — word duration stays natural) 7 + // 2. Pitch-shift to the target note from the .np score (proportional 8 + // word→syllable mapping, formant-preserving via rubberband) 9 + // 3. Place the pitched slice into a fresh buffer at the snapped start 10 + // 11 + // Inter-word gaps end up shifted slightly (sometimes longer, sometimes 12 + // shorter) — that's the snap. Words themselves keep natural speech 13 + // rate, so jeffrey-pvc doesn't sound rushed; only their *placement* 14 + // quantizes to the grid. 15 + // 16 + // Stretch ("lazy" mode): pass `--stretch FACTOR` to time-stretch every 17 + // word by FACTOR using rubberband (formant + pitch preserving), then 18 + // re-snap the stretched starts to the grid. 1.0 = natural, 1.5 = lazy 19 + // (50% longer per word), 2.0 = drone. Total track duration grows. 20 + // 21 + // Usage: 22 + // node bin/pitchsnap.mjs --vocal big-pictures/out/ac-vocal.mp3 \ 23 + // --score big-pictures/plork.np --section hook \ 24 + // --bpm 140 --grid 16 --ref-note C3 \ 25 + // --stretch 1.4 \ 26 + // --out big-pictures/out/ac-snapped-pitched.mp3 27 + 28 + import { spawnSync } from "node:child_process"; 29 + import { existsSync, mkdirSync, readFileSync, writeFileSync, rmSync } from "node:fs"; 30 + import { resolve, dirname, basename } from "node:path"; 31 + import { fileURLToPath } from "node:url"; 32 + 33 + const HERE = dirname(fileURLToPath(import.meta.url)); 34 + const POP_ROOT = resolve(HERE, ".."); 35 + 36 + function parseArgs(argv) { 37 + const flags = {}; 38 + for (let i = 0; i < argv.length; i++) { 39 + const a = argv[i]; 40 + if (!a.startsWith("--")) continue; 41 + const k = a.slice(2); 42 + const next = argv[i + 1]; 43 + if (next !== undefined && !next.startsWith("--")) { flags[k] = next; i++; } 44 + else flags[k] = true; 45 + } 46 + return flags; 47 + } 48 + 49 + const flags = parseArgs(process.argv.slice(2)); 50 + 51 + const vocalPath = resolve(process.cwd(), flags.vocal || ""); 52 + if (!existsSync(vocalPath)) { 53 + console.error("usage: --vocal <stem.mp3> --score <path.np> [--section hook] [--bpm 140] [--grid 16] [--ref-note C3] [--out path.mp3]"); 54 + process.exit(1); 55 + } 56 + const wordsPath = resolve(process.cwd(), flags.words || vocalPath.replace(/\.mp3$/, "-words.json")); 57 + if (!existsSync(wordsPath)) { 58 + console.error(`✗ words.json not found at ${wordsPath}. run bin/align.mjs first.`); 59 + process.exit(1); 60 + } 61 + const scorePath = resolve(process.cwd(), flags.score || ""); 62 + if (!existsSync(scorePath)) { 63 + console.error(`✗ --score file required (path to .np)`); 64 + process.exit(1); 65 + } 66 + const SECTION = (flags.section || "hook").toLowerCase(); 67 + const BPM = Number(flags.bpm) || 140; 68 + const GRID = Number(flags.grid) || 16; // 16 = sixteenth notes per bar 69 + const REF_NOTE = flags["ref-note"] || "C3"; 70 + const STRETCH = Number(flags.stretch) || 1.0; // 1.0 = natural, >1 = lazier 71 + const CURVE = flags.curve || "flat"; // "flat" | "linear" | "bezier" 72 + // Engine: "rubberband" (default — segmented per-syllable rubberband 73 + // shifts, preserves source prosody shifted up/down) or "world" (calls 74 + // pitchsnap_world.py for WORLD-vocoder f0 replacement, fully clamps 75 + // pitch to target with vocal timbre intact). World requires the .venv 76 + // at pop/.venv with pyworld + soundfile installed. 77 + const ENGINE = flags.engine || "rubberband"; 78 + const RETAIN = Number(flags.retain ?? 1.0); // world only: 0 = source, 1 = clamp 79 + const VIBRATO_HZ = Number(flags["vibrato-hz"] ?? 0); 80 + const VIBRATO_CENTS = Number(flags["vibrato-cents"] ?? 0); 81 + // Transpose every target note by N semitones at runtime (no need to 82 + // rewrite the .np). Useful when the score is in a register too high 83 + // for the source voice (jeffrey-pvc baritone ≈ C3, so kid-songs at 84 + // C4-E4 sound chipmunky — try --transpose -12 to drop them an octave). 85 + const TRANSPOSE = Number(flags.transpose ?? 0); 86 + // Beat mode: interpret syllable `*weight` as BEATS at the given BPM, 87 + // not as relative multipliers. Each word's stretch becomes 88 + // (sum(beats) * 60/BPM) / naturalWordDuration. Required for real 89 + // song timing — speech speeds rarely match the song's meter. 90 + const BEAT_MODE = flags["beat-mode"] === true; 91 + // Detect syllable boundaries within each word via librosa onset 92 + // detection (in pitchsnap_world.py). For multi-syllable words this 93 + // snaps the per-syllable pitch targets to natural energy peaks in 94 + // the audio rather than weighted-proportional splits. 95 + const DETECT_BOUNDARIES = flags["detect-boundaries"] === true; 96 + // Scale walk: comma-separated notes (e.g. "C3,D3,Eb3,F3,G3,Ab3,Bb3,C4") 97 + // that override the .np score's syllable pitches. The full scale is 98 + // laid across each word's duration as evenly-spaced pitchmap 99 + // waypoints — useful for single-word melody experiments where you 100 + // want a slur through more notes than the word has syllables. 101 + const SCALE_WALK = flags["scale-walk"] 102 + ? String(flags["scale-walk"]).split(",").map((s) => s.trim()).filter(Boolean) 103 + : null; 104 + // autotune: off | "global" (median source pitch — uniform shifts, robust) 105 + // | "word" (per-word source — accurate but octave-error-prone). 106 + // Default "global" because autocorrelation f0-detection occasionally 107 + // finds 2× or ½× on individual words, causing audible octave jumps. 108 + const AUTOTUNE = flags.autotune === true ? "global" 109 + : flags.autotune === false ? "off" 110 + : (flags.autotune || "off"); 111 + const SAMPLE_RATE = 48_000; 112 + const OUT_PATH = flags.out 113 + ? resolve(process.cwd(), flags.out) 114 + : vocalPath.replace(/\.mp3$/, "-snapped-pitched.mp3"); 115 + 116 + // ── helpers ─────────────────────────────────────────────────────────── 117 + const NOTE_TO_SEMI = { c: 0, d: 2, e: 4, f: 5, g: 7, a: 9, b: 11 }; 118 + function noteToMidi(p) { 119 + const m = p.trim().toLowerCase().match(/^([a-g])([#b]?)(-?\d+)$/); 120 + if (!m) throw new Error(`bad note: ${p}`); 121 + let semi = NOTE_TO_SEMI[m[1]]; 122 + if (m[2] === "#") semi += 1; 123 + if (m[2] === "b") semi -= 1; 124 + const oct = parseInt(m[3], 10); 125 + return 12 * (oct + 1) + semi; 126 + } 127 + 128 + function parseNp(text) { 129 + const sections = {}; 130 + let current = null; 131 + for (const raw of text.split("\n")) { 132 + const line = raw.trim(); 133 + if (!line || line.startsWith("#")) continue; 134 + if (!line.includes(":") && /^[a-z][a-z0-9 ]*$/.test(line)) { 135 + current = line.toLowerCase(); 136 + if (!sections[current]) sections[current] = []; 137 + continue; 138 + } 139 + if (!current) { current = "default"; sections[current] = []; } 140 + const tokens = line.split(/\s+/).filter(Boolean); 141 + for (const tok of tokens) { 142 + // Note: letter + optional sharp/flat + optional octave digit + ":" + syllable 143 + // Token grammar: <note>:<syllable>[*<weight>] 144 + // Weight is a relative duration multiplier (default 1). E.g. 145 + // `G3:-ma-*2` makes "ma" twice as long as a default syllable 146 + // when distributing time across the parent word. 147 + const m = tok.match(/^([A-Ga-g][#b]?\d?):(.+?)(?:\*([\d.]+))?$/); 148 + if (!m) continue; 149 + const note = m[1].charAt(0).toUpperCase() + m[1].slice(1); 150 + const weight = m[3] ? Number(m[3]) : 1; 151 + sections[current].push({ pitch: note, syl: m[2], weight }); 152 + } 153 + } 154 + return sections; 155 + } 156 + 157 + function probeDuration(p) { 158 + const r = spawnSync( 159 + "ffprobe", 160 + ["-v", "error", "-show_entries", "format=duration", 161 + "-of", "default=noprint_wrappers=1:nokey=1", p], 162 + { encoding: "utf8" } 163 + ); 164 + return Number(r.stdout.trim()); 165 + } 166 + 167 + // Heuristic syllable count for an English word — counts vowel groups, 168 + // drops trailing silent 'e'. Good enough for lyric-to-score alignment; 169 + // failure modes (e.g. "fire" → 1 vs. true 2) drift by ~1 syllable per 170 + // word but the proportional mapping recovers across line lengths. 171 + function syllableCount(word) { 172 + const w = word.toLowerCase().replace(/[^a-z]/g, ""); 173 + if (!w) return 1; 174 + const groups = w.match(/[aeiouy]+/g) || []; 175 + let count = groups.length; 176 + // Silent 'ed' suffix in past-tense verbs (saved, called, loved). 177 + // Exception: "ed" preceded by t/d IS pronounced (wanted, rated). 178 + if (w.endsWith("ed") && w.length > 2 && count > 1) { 179 + const beforeEd = w.charAt(w.length - 3); 180 + if (beforeEd !== "t" && beforeEd !== "d") count--; 181 + } 182 + // Silent trailing 'e' (rake, love, since, more) — but not after the 183 + // ed-rule has already fired. 184 + else if (w.endsWith("e") && count > 1) count--; 185 + // 'le' creates an extra syllable when preceded by a consonant 186 + // (apple, little, bottle). After silent-e adjustment. 187 + if (w.endsWith("le") && w.length > 2 && !"aeiouy".includes(w.charAt(w.length - 3))) count++; 188 + return Math.max(1, count); 189 + } 190 + 191 + // Autocorrelation pitch detection — same algorithm as pitchcheck.mjs. 192 + // Restricted to voice range, uses central window, parabolic interp. 193 + function detectPitch(samples, sr, fmin = 80, fmax = 600) { 194 + if (samples.length < sr * 0.05) return null; 195 + const start = Math.floor(samples.length * 0.2); 196 + const end = Math.floor(samples.length * 0.8); 197 + const win = samples.slice(start, end); 198 + let rms = 0; 199 + for (let i = 0; i < win.length; i++) rms += win[i] * win[i]; 200 + rms = Math.sqrt(rms / win.length); 201 + if (rms < 0.005) return null; 202 + const lagMin = Math.floor(sr / fmax); 203 + const lagMax = Math.min(Math.floor(sr / fmin), Math.floor(win.length / 2)); 204 + let bestLag = lagMin, bestScore = -Infinity; 205 + for (let lag = lagMin; lag <= lagMax; lag++) { 206 + let sum = 0; 207 + const n = win.length - lag; 208 + for (let i = 0; i < n; i++) sum += win[i] * win[i + lag]; 209 + sum /= n; 210 + if (sum > bestScore) { bestScore = sum; bestLag = lag; } 211 + } 212 + let lagF = bestLag; 213 + if (bestLag > lagMin && bestLag < lagMax) { 214 + const acAt = (k) => { 215 + let s = 0; 216 + const n = win.length - k; 217 + for (let i = 0; i < n; i++) s += win[i] * win[i + k]; 218 + return s / n; 219 + }; 220 + const a = acAt(bestLag - 1), b = acAt(bestLag), c = acAt(bestLag + 1); 221 + const denom = a - 2 * b + c; 222 + if (Math.abs(denom) > 1e-9) lagF = bestLag - 0.5 * (c - a) / denom; 223 + } 224 + return sr / lagF; 225 + } 226 + function freqToMidi(f) { return 69 + 12 * Math.log2(f / 440); } 227 + 228 + function readWav(path) { 229 + // Read a 16-bit PCM mono wav written by ffmpeg into a Float32Array 230 + // of normalized samples. Skips RIFF/fmt headers via the data chunk. 231 + const buf = readFileSync(path); 232 + // Find 'data' chunk 233 + let i = 12; 234 + while (i < buf.length - 8) { 235 + const id = buf.toString("ascii", i, i + 4); 236 + const size = buf.readUInt32LE(i + 4); 237 + if (id === "data") { 238 + i += 8; 239 + const samples = new Float32Array(size / 2); 240 + for (let j = 0; j < samples.length; j++) { 241 + samples[j] = buf.readInt16LE(i + j * 2) / 32768; 242 + } 243 + return samples; 244 + } 245 + i += 8 + size; 246 + } 247 + throw new Error(`no data chunk in ${path}`); 248 + } 249 + 250 + // ── load inputs ─────────────────────────────────────────────────────── 251 + const words = JSON.parse(readFileSync(wordsPath, "utf8")); 252 + const score = parseNp(readFileSync(scorePath, "utf8")); 253 + const syllables = score[SECTION]; 254 + if (!syllables || !syllables.length) { 255 + console.error(`✗ section '${SECTION}' empty in ${scorePath}`); 256 + process.exit(1); 257 + } 258 + 259 + const refMidi = noteToMidi(REF_NOTE); 260 + const beatSec = 60 / BPM; 261 + const stepSec = beatSec * 4 / GRID; // 16th = beatSec / 4 262 + 263 + const totalNaturalDur = probeDuration(vocalPath); 264 + 265 + const tmpDir = `${dirname(OUT_PATH)}/.${basename(OUT_PATH).replace(/\..*$/, "")}-ps-tmp`; 266 + rmSync(tmpDir, { recursive: true, force: true }); 267 + mkdirSync(tmpDir, { recursive: true }); 268 + 269 + // ── Pre-pass: measure source pitch across all words for "global" autotune ── 270 + // Slice every word once with ffmpeg, run autocorrelation, take median. 271 + // One pre-pass means the per-word loop below uses a stable reference. 272 + let globalSourceMidi = null; 273 + if (AUTOTUNE === "global") { 274 + const detected = []; 275 + for (let i = 0; i < words.length; i++) { 276 + const w = words[i]; 277 + const startSec = w.fromMs / 1000; 278 + const endSec = i < words.length - 1 ? words[i + 1].fromMs / 1000 : totalNaturalDur; 279 + if (startSec >= totalNaturalDur - 0.005) continue; 280 + const safeEnd = Math.min(endSec, totalNaturalDur); 281 + const probeWav = `${tmpDir}/probe${i.toString().padStart(3, "0")}.wav`; 282 + spawnSync( 283 + "ffmpeg", 284 + ["-hide_banner", "-y", "-loglevel", "error", 285 + "-ss", startSec.toFixed(4), "-to", safeEnd.toFixed(4), 286 + "-i", vocalPath, 287 + "-c:a", "pcm_s16le", "-ar", String(SAMPLE_RATE), "-ac", "1", probeWav], 288 + { stdio: ["ignore", "ignore", "ignore"] } 289 + ); 290 + if (!existsSync(probeWav)) continue; 291 + const samples = readWav(probeWav); 292 + const f = detectPitch(samples, SAMPLE_RATE, 80, 280); 293 + if (f !== null) detected.push(freqToMidi(f)); 294 + } 295 + if (detected.length) { 296 + detected.sort((a, b) => a - b); 297 + globalSourceMidi = detected[Math.floor(detected.length / 2)]; 298 + console.log(` global source pitch: median MIDI ${globalSourceMidi.toFixed(2)} from ${detected.length} words`); 299 + } else { 300 + console.warn(" ! global autotune: no pitch detections — falling back to ref-relative shifts"); 301 + } 302 + } 303 + 304 + console.log( 305 + `→ pitchsnap · ${words.length} whisper words → ${syllables.length} score syllables\n` + 306 + ` bpm=${BPM} grid=1/${GRID}-note (${(stepSec * 1000).toFixed(1)}ms) ref=${REF_NOTE} ` + 307 + `stretch=${STRETCH.toFixed(2)}× curve=${CURVE} autotune=${AUTOTUNE}` 308 + ); 309 + 310 + // ── per-word: extract slice → pitch shift → record snapped start ───── 311 + const slices = []; // { snappedStart, samples, text } 312 + let maxEndSec = 0; 313 + let sylCursor = 0; // syllable-aware mapping: advance by each word's syllable count 314 + 315 + // Beat-mode pre-pass: walk the score syllable cursor exactly the same 316 + // way the per-word loop will, so we can produce a beats-cumulative 317 + // start time per word. Without this, words begin at speech-time and 318 + // individually-stretched durations cause overlap. 319 + let beatStarts = null; 320 + if (BEAT_MODE) { 321 + beatStarts = []; 322 + let beats = 0; 323 + let cur = 0; 324 + const beatSec = 60 / BPM; 325 + for (let i = 0; i < words.length; i++) { 326 + beatStarts.push(beats * beatSec); 327 + const ws = syllableCount(words[i].text); 328 + let wordBeats = 0; 329 + for (let k = cur; k < cur + ws && k < syllables.length; k++) { 330 + const sw = syllables[k] && typeof syllables[k].weight === "number" 331 + ? syllables[k].weight : 1; 332 + wordBeats += sw; 333 + } 334 + beats += wordBeats; 335 + cur += ws; 336 + } 337 + console.log(` beat-mode timeline: ${beats} beats total = ${(beats * beatSec).toFixed(2)}s @ ${BPM} BPM`); 338 + } 339 + 340 + for (let i = 0; i < words.length; i++) { 341 + const w = words[i]; 342 + const naturalStart = w.fromMs / 1000; 343 + const naturalEnd = i < words.length - 1 ? words[i + 1].fromMs / 1000 : totalNaturalDur; 344 + 345 + // Snap target: in BEAT_MODE, place each word at its cumulative beat 346 + // position from the score. Otherwise preserve the speech timeline 347 + // (scaled by global STRETCH and snapped to grid). 348 + const snappedStart = beatStarts 349 + ? beatStarts[i] 350 + : Math.round((naturalStart * STRETCH) / stepSec) * stepSec; 351 + 352 + // Syllable-aware mapping: each word claims `syllableCount(word)` 353 + // entries from the score, starting at sylCursor. Use the FIRST 354 + // syllable's pitch as the word's main target; the LAST syllable's 355 + // pitch as the next-target for curve glide. 356 + const wordSyls = syllableCount(w.text); 357 + const startSylIdx = Math.min(syllables.length - 1, sylCursor); 358 + const endSylIdx = Math.min(syllables.length - 1, sylCursor + wordSyls - 1); 359 + sylCursor += wordSyls; 360 + 361 + const syl = syllables[startSylIdx]; 362 + const noteStr = /\d/.test(syl.pitch) ? syl.pitch : syl.pitch + "3"; 363 + const targetMidi = noteToMidi(noteStr) + TRANSPOSE; 364 + let semitones = targetMidi - refMidi; 365 + 366 + // Next-target for curve mode: end syllable of this word (intra-word 367 + // glide for multi-syllable words) or the first syllable of the next word. 368 + let nextSemitones = semitones; 369 + if (endSylIdx > startSylIdx) { 370 + // Multi-syllable word — glide to its own last syllable 371 + const lastSyl = syllables[endSylIdx]; 372 + const lastNoteStr = /\d/.test(lastSyl.pitch) ? lastSyl.pitch : lastSyl.pitch + "3"; 373 + nextSemitones = noteToMidi(lastNoteStr) - refMidi; 374 + } else if (sylCursor < syllables.length) { 375 + // Single-syllable word — glide toward next word's first syllable 376 + const nextSyl = syllables[Math.min(syllables.length - 1, sylCursor)]; 377 + const nextNoteStr = /\d/.test(nextSyl.pitch) ? nextSyl.pitch : nextSyl.pitch + "3"; 378 + nextSemitones = noteToMidi(nextNoteStr) - refMidi; 379 + } 380 + 381 + // Guard: skip words whose start has gone past the natural duration. 382 + if (naturalStart >= totalNaturalDur - 0.005) { 383 + console.warn(` ! word ${i} (${w.text}) starts past audio end — skipping`); 384 + continue; 385 + } 386 + // Extend each word's slice end by 60ms so trailing consonants / 387 + // releases don't get cut. The next word will overlap slightly on 388 + // playback (handled by additive mixing) — better than chopped tails. 389 + const safeEnd = Math.min(naturalEnd + 0.06, totalNaturalDur); 390 + const sliceWav = `${tmpDir}/w${i.toString().padStart(3, "0")}.wav`; 391 + spawnSync( 392 + "ffmpeg", 393 + ["-hide_banner", "-y", "-loglevel", "error", 394 + "-ss", naturalStart.toFixed(4), "-to", safeEnd.toFixed(4), 395 + "-i", vocalPath, 396 + "-c:a", "pcm_s16le", "-ar", String(SAMPLE_RATE), "-ac", "1", sliceWav], 397 + { stdio: ["ignore", "ignore", "inherit"] } 398 + ); 399 + if (!existsSync(sliceWav)) { 400 + console.warn(` ! word ${i} (${w.text}) ffmpeg slice failed — skipping`); 401 + continue; 402 + } 403 + 404 + // Trim leading/trailing silence within the slice so WORLD analyses 405 + // only the actual word content. Whisper word boundaries can land 406 + // mid-vowel of the previous word or mid-consonant of the current 407 + // one; trimming on RMS-envelope at 5% of peak finds the real 408 + // start/end. Adds 8ms of padding either side so we don't cut into 409 + // attack transients. 410 + { 411 + const buf = readWav(sliceWav); 412 + const hop = Math.floor(0.010 * SAMPLE_RATE); 413 + const nFrames = Math.max(1, Math.floor(buf.length / hop)); 414 + const env = new Float32Array(nFrames); 415 + let peakE = 0; 416 + for (let f = 0; f < nFrames; f++) { 417 + let r = 0; 418 + const a = f * hop; 419 + const b = Math.min(buf.length, a + hop); 420 + for (let j = a; j < b; j++) r += buf[j] * buf[j]; 421 + env[f] = Math.sqrt(r / (b - a)); 422 + if (env[f] > peakE) peakE = env[f]; 423 + } 424 + if (peakE > 0.005) { 425 + const thr = peakE * 0.05; 426 + let s = 0; while (s < nFrames && env[s] < thr) s++; 427 + let e = nFrames - 1; while (e > s && env[e] < thr) e--; 428 + // ATTACK DETECTION (only for short slot allocations). 429 + // For words with allocated weight ≥ 3 beats (sustained notes 430 + // like 'found', 'see' on *5), keep the natural ramp-in intact 431 + // because rubberband needs that material to stretch into the 432 + // long sustain. Aggressively trimming a slow-attack 5-beat note 433 + // produces a clipped sustain — the word ends early in its slot 434 + // and the listener hears silence. Short words (1-2 beats) get 435 + // the full attack-detection trim so their attack lands on beat. 436 + const wordBeats = syllables 437 + .slice(startSylIdx, endSylIdx + 1) 438 + .reduce((sum, s2) => sum + (s2.weight || 1), 0); 439 + if (wordBeats <= 2) { 440 + const lookaheadFrames = Math.min(25, e - s); // 250ms @ 10ms hop 441 + let maxRise = 0; 442 + let attackFrame = s; 443 + for (let af = s + 1; af < s + lookaheadFrames; af++) { 444 + const rise = env[af] - env[Math.max(s, af - 3)]; 445 + if (rise > maxRise) { 446 + maxRise = rise; 447 + attackFrame = af; 448 + } 449 + } 450 + if (attackFrame - s > 1) { 451 + const preRollFrames = Math.max(0, Math.floor(0.015 / 0.010)); 452 + s = Math.max(s, attackFrame - preRollFrames); 453 + } 454 + } 455 + const pad = Math.floor(0.008 * SAMPLE_RATE); 456 + const startSamp = Math.max(0, s * hop - pad); 457 + const endSamp = Math.min(buf.length, (e + 1) * hop + pad); 458 + if (endSamp > startSamp + Math.floor(0.030 * SAMPLE_RATE)) { 459 + // Write trimmed back to sliceWav so downstream steps see the cleaned cut. 460 + const trimmed = buf.slice(startSamp, endSamp); 461 + const sampleBytes = Buffer.alloc(trimmed.length * 2); 462 + for (let k = 0; k < trimmed.length; k++) { 463 + const v = Math.max(-1, Math.min(1, trimmed[k])); 464 + sampleBytes.writeInt16LE(Math.floor(v * 32767), k * 2); 465 + } 466 + // Minimal RIFF header for 48kHz mono 16-bit PCM 467 + const dataLen = sampleBytes.length; 468 + const header = Buffer.alloc(44); 469 + header.write("RIFF", 0); header.writeUInt32LE(36 + dataLen, 4); 470 + header.write("WAVE", 8); header.write("fmt ", 12); 471 + header.writeUInt32LE(16, 16); header.writeUInt16LE(1, 20); 472 + header.writeUInt16LE(1, 22); header.writeUInt32LE(SAMPLE_RATE, 24); 473 + header.writeUInt32LE(SAMPLE_RATE * 2, 28); header.writeUInt16LE(2, 32); 474 + header.writeUInt16LE(16, 34); header.write("data", 36); 475 + header.writeUInt32LE(dataLen, 40); 476 + writeFileSync(sliceWav, Buffer.concat([header, sampleBytes])); 477 + } 478 + } 479 + } 480 + 481 + // Autotune: shift to land ON the target note. 482 + // - global: shift by (target − globalSourceMidi). Uniform across 483 + // all words, robust against per-word f0-detection octave errors. 484 + // - word: shift by (target − measured per-word source). Most 485 + // accurate when detection is clean, but susceptible to octave 486 + // errors that produce audible jumps. 487 + let sourceMidi = null; 488 + if (AUTOTUNE === "global" && globalSourceMidi !== null) { 489 + sourceMidi = globalSourceMidi; 490 + semitones = targetMidi - sourceMidi; 491 + } else if (AUTOTUNE === "word") { 492 + const probeSamples = readWav(sliceWav); 493 + const sourceF = detectPitch(probeSamples, SAMPLE_RATE, 80, 280); 494 + if (sourceF !== null) { 495 + sourceMidi = freqToMidi(sourceF); 496 + semitones = targetMidi - sourceMidi; 497 + } 498 + // If pitch detection failed, fall back to ref-based shift. 499 + } 500 + 501 + // Waypoints describe the pitch curve through this word; consumed by 502 + // both the rubberband segmented rendering and the sine + tick overlay below. 503 + let waypoints = []; 504 + 505 + // Segmented rendering — chops the source slice into N equal pieces 506 + // and pitch-shifts each to its own target. Step pitch changes that 507 + // are audibly distinct per segment. Bypasses rubberband's --pitchmap 508 + // which produced ambiguous timing on this version. 509 + // 510 + // Notes for the segments come from one of: 511 + // 1. SCALE_WALK (CLI override, walks notes regardless of syllables) 512 + // 2. .np per-syllable pitches (multi-syllable word mapped to its 513 + // syllable count's worth of score notes) 514 + let segNotes = null; 515 + let segWeights = null; 516 + if (SCALE_WALK && SCALE_WALK.length >= 2) { 517 + segNotes = SCALE_WALK.map((s) => /\d/.test(s) ? s : s + "3"); 518 + segWeights = segNotes.map(() => 1); 519 + } else if (CURVE !== "flat" && (endSylIdx - startSylIdx + 1) >= 1) { 520 + // Populate for ALL words (1+ syllables) so single-syllable words 521 + // also flow through the WORLD / beat-mode path. Previously gated 522 + // on >= 2, which silently skipped half the lyric. 523 + segNotes = []; 524 + segWeights = []; 525 + for (let k = startSylIdx; k <= endSylIdx; k++) { 526 + const sylk = syllables[k]; 527 + const baseNote = /\d/.test(sylk.pitch) ? sylk.pitch : sylk.pitch + "3"; 528 + if (TRANSPOSE !== 0) { 529 + const m = noteToMidi(baseNote) + TRANSPOSE; 530 + const names = ["C","C#","D","Eb","E","F","F#","G","G#","A","Bb","B"]; 531 + const oct = Math.floor(m / 12) - 1; 532 + const idx = ((m % 12) + 12) % 12; 533 + segNotes.push(`${names[idx]}${oct}`); 534 + } else { 535 + segNotes.push(baseNote); 536 + } 537 + segWeights.push(typeof sylk.weight === "number" ? sylk.weight : 1); 538 + } 539 + } 540 + 541 + // ── WORLD engine: replace f0 wholesale, no segmenting ──────────── 542 + if (segNotes && segNotes.length >= 1 && ENGINE === "world") { 543 + // Compute per-word stretch. 544 + let perWordStretch = STRETCH; 545 + if (BEAT_MODE && segWeights) { 546 + const naturalWordDur = readWav(sliceWav).length / SAMPLE_RATE; 547 + const beatSec = 60 / BPM; 548 + const targetDur = segWeights.reduce((a, b) => a + b, 0) * beatSec; 549 + if (naturalWordDur > 0.01) { 550 + // Cap raised from 8× to 20× — ElevenLabs utterances of short 551 + // words like "see" are ~250-350ms naturally and need to fill 552 + // 4-5 beat sustain slots. With the old 8× ceiling, sustained 553 + // notes ended early and the listener heard silence after. 554 + perWordStretch = Math.max(0.5, Math.min(20.0, targetDur / naturalWordDur)); 555 + } 556 + console.log(` beat-mode · '${w.text}' ${segWeights.join("+")}b → ${targetDur.toFixed(2)}s @ ${perWordStretch.toFixed(2)}× stretch (nat=${naturalWordDur.toFixed(2)}s)`); 557 + } 558 + 559 + // Pre-stretch with rubberband (formant-preserving) so the stretched 560 + // wav is what WORLD analyses. This delivers per-word stretch on the 561 + // world engine path (WORLD itself doesn't stretch). 562 + let worldInputWav = sliceWav; 563 + if (Math.abs(perWordStretch - 1.0) >= 0.001) { 564 + worldInputWav = `${tmpDir}/w${i.toString().padStart(3,"0")}-stretched.wav`; 565 + const rb = spawnSync( 566 + "rubberband", 567 + ["-t", String(perWordStretch), sliceWav, worldInputWav], 568 + { stdio: ["ignore", "ignore", "ignore"] } 569 + ); 570 + if (rb.status !== 0 || !existsSync(worldInputWav)) worldInputWav = sliceWav; 571 + } 572 + const pieceWavWorld = `${tmpDir}/w${i.toString().padStart(3,"0")}-world.wav`; 573 + const venvPython = resolve(POP_ROOT, ".venv/bin/python"); 574 + const helperPath = resolve(POP_ROOT, "bin/pitchsnap_world.py"); 575 + const args = [ 576 + helperPath, worldInputWav, pieceWavWorld, 577 + "--notes", segNotes.join(","), 578 + "--retain", String(RETAIN), 579 + ]; 580 + if (segWeights && segWeights.some((w) => w !== 1)) { 581 + args.push("--weights", segWeights.join(",")); 582 + } 583 + if (DETECT_BOUNDARIES) args.push("--detect-boundaries"); 584 + if (VIBRATO_HZ > 0) { 585 + args.push("--vibrato-hz", String(VIBRATO_HZ)); 586 + args.push("--vibrato-cents", String(VIBRATO_CENTS)); 587 + } 588 + const r = spawnSync(venvPython, args, { stdio: ["ignore", "inherit", "inherit"] }); 589 + if (r.status !== 0 || !existsSync(pieceWavWorld)) { 590 + console.warn(` ! word ${i} world engine failed — falling back to rubberband`); 591 + } else { 592 + // Build waypoints (for sine overlay + events trace) 593 + const refForShift = (AUTOTUNE === "global" && globalSourceMidi !== null) 594 + ? globalSourceMidi : (sourceMidi !== null ? sourceMidi : refMidi); 595 + const wlen = readWav(pieceWavWorld).length; 596 + for (let k = 0; k < segNotes.length; k++) { 597 + const midiK = noteToMidi(segNotes[k]); 598 + waypoints.push({ 599 + sample: Math.floor((k / segNotes.length) * wlen), 600 + midi: midiK, 601 + semi: midiK - refForShift, 602 + }); 603 + } 604 + let samples = readWav(pieceWavWorld); 605 + // Hard-trim each word to its beat allocation (BEAT_MODE) so 606 + // stretched WORLD output doesn't bleed into the next word's slot 607 + // and create overlap/echo. 30ms cosine fade-out at the trim point 608 + // smooths the cut. Without this, words that ended up slightly 609 + // longer than their beat target leaked tonal tail into following 610 + // words → audible echo on the single voice. 611 + if (BEAT_MODE && segWeights) { 612 + const beatSec = 60 / BPM; 613 + const allocatedSec = segWeights.reduce((a, b) => a + b, 0) * beatSec; 614 + const allocatedSamples = Math.floor(allocatedSec * SAMPLE_RATE); 615 + if (samples.length > allocatedSamples) { 616 + const trimmed = samples.slice(0, allocatedSamples); 617 + const fadeS = Math.min(Math.floor(0.030 * SAMPLE_RATE), Math.floor(trimmed.length / 8)); 618 + for (let k = 0; k < fadeS; k++) { 619 + const j = trimmed.length - fadeS + k; 620 + const env = 0.5 + 0.5 * Math.cos((Math.PI * k) / fadeS); 621 + trimmed[j] *= env; 622 + } 623 + samples = trimmed; 624 + } 625 + } 626 + slices.push({ 627 + snappedStart, samples, text: w.text, naturalStart, 628 + semitones, noteStr, targetMidi, sourceMidi, waypoints, 629 + }); 630 + const endSec = snappedStart + samples.length / SAMPLE_RATE; 631 + if (endSec > maxEndSec) maxEndSec = endSec; 632 + continue; 633 + } 634 + } 635 + 636 + if (segNotes && segNotes.length >= 2) { 637 + const naturalSamples = readWav(sliceWav).length; 638 + const refForShift = (AUTOTUNE === "global" && globalSourceMidi !== null) 639 + ? globalSourceMidi 640 + : (sourceMidi !== null ? sourceMidi : refMidi); 641 + 642 + const N = segNotes.length; 643 + // Weighted segment boundaries — per-syllable `*weight` from the .np 644 + // score lets us hold longer notes (e.g. "a-MAAA-zing"). 645 + const weights = (segWeights && segWeights.length === N) ? segWeights : segNotes.map(() => 1); 646 + const wTotal = weights.reduce((a, b) => a + b, 0) || N; 647 + const cumW = [0]; 648 + for (let k = 0; k < N; k++) cumW.push(cumW[k] + weights[k] / wTotal); 649 + 650 + const segPieces = []; 651 + for (let k = 0; k < N; k++) { 652 + const noteAtK = segNotes[k]; 653 + const midiAtK = noteToMidi(noteAtK); 654 + const semiAtK = midiAtK - refForShift; 655 + waypoints.push({ 656 + sample: Math.floor(cumW[k] * naturalSamples * STRETCH), 657 + midi: midiAtK, 658 + semi: semiAtK, 659 + }); 660 + 661 + // Slice source segment k by its weighted boundaries. 662 + const segStartSec = cumW[k] * (naturalSamples / SAMPLE_RATE); 663 + const segEndSec = cumW[k + 1] * (naturalSamples / SAMPLE_RATE); 664 + const segSrcWav = `${tmpDir}/w${i.toString().padStart(3,"0")}-seg${k}.wav`; 665 + spawnSync( 666 + "ffmpeg", 667 + ["-hide_banner", "-y", "-loglevel", "error", 668 + "-ss", segStartSec.toFixed(4), "-to", segEndSec.toFixed(4), 669 + "-i", sliceWav, 670 + "-c:a", "pcm_s16le", "-ar", String(SAMPLE_RATE), "-ac", "1", segSrcWav], 671 + { stdio: ["ignore", "ignore", "ignore"] } 672 + ); 673 + if (!existsSync(segSrcWav)) continue; 674 + 675 + // Pitch-shift + stretch this segment. 676 + const segOutWav = `${tmpDir}/w${i.toString().padStart(3,"0")}-seg${k}-p.wav`; 677 + const rb = spawnSync( 678 + "rubberband", 679 + ["-p", String(semiAtK), "-t", String(STRETCH), segSrcWav, segOutWav], 680 + { stdio: ["ignore", "ignore", "ignore"] } 681 + ); 682 + if (rb.status === 0 && existsSync(segOutWav)) segPieces.push(segOutWav); 683 + else segPieces.push(segSrcWav); 684 + } 685 + 686 + // Concat segments with overlap-add crossfade (20ms) so segment 687 + // boundaries don't click. Read each piece as samples, mix into a 688 + // running buffer with the tail of segment k overlapping the head 689 + // of segment k+1. 690 + const xfade = Math.floor(0.020 * SAMPLE_RATE); 691 + const segBufs = segPieces.map((p) => readWav(p)); 692 + let totalLen = 0; 693 + for (const b of segBufs) totalLen += b.length; 694 + totalLen -= xfade * Math.max(0, segBufs.length - 1); 695 + const walked = new Float32Array(Math.max(0, totalLen) + xfade); 696 + let cursor = 0; 697 + for (let k = 0; k < segBufs.length; k++) { 698 + const b = segBufs[k]; 699 + const startK = cursor; 700 + for (let j = 0; j < b.length; j++) { 701 + let env = 1; 702 + if (k > 0 && j < xfade) env = j / xfade; 703 + if (k < segBufs.length - 1 && j >= b.length - xfade) { 704 + env = Math.min(env, (b.length - j) / xfade); 705 + } 706 + const dst = startK + j; 707 + if (dst >= 0 && dst < walked.length) walked[dst] += b[j] * env; 708 + } 709 + cursor += b.length - xfade; 710 + } 711 + // Trim trailing zeros if we overcounted. 712 + let endIdx = walked.length; 713 + while (endIdx > 0 && Math.abs(walked[endIdx - 1]) < 1e-9) endIdx--; 714 + const samples = walked.slice(0, endIdx); 715 + slices.push({ 716 + snappedStart, samples, text: w.text, naturalStart, 717 + semitones, noteStr, targetMidi, sourceMidi, waypoints, 718 + }); 719 + const endSec = snappedStart + samples.length / SAMPLE_RATE; 720 + if (endSec > maxEndSec) maxEndSec = endSec; 721 + continue; // skip the rest of the per-word loop body for SCALE_WALK 722 + } 723 + 724 + let pieceWav = sliceWav; 725 + const needsPitch = Math.abs(semitones) >= 0.01 || (CURVE !== "flat" && Math.abs(nextSemitones - semitones) >= 0.01); 726 + const needsStretch = Math.abs(STRETCH - 1.0) >= 0.001; 727 + if (needsPitch || needsStretch) { 728 + const target = `${tmpDir}/w${i.toString().padStart(3, "0")}-p.wav`; 729 + const rbArgs = []; 730 + 731 + // (waypoints declared at the per-word scope below) 732 + if (CURVE !== "flat") { 733 + const sliceSamples = readWav(sliceWav).length; 734 + const stretchedSamples = Math.floor(sliceSamples * STRETCH); 735 + 736 + const refForShift = (AUTOTUNE === "global" && globalSourceMidi !== null) 737 + ? globalSourceMidi 738 + : (sourceMidi !== null ? sourceMidi : refMidi); 739 + 740 + if (SCALE_WALK && SCALE_WALK.length >= 2) { 741 + for (let k = 0; k < SCALE_WALK.length; k++) { 742 + const noteAtK = /\d/.test(SCALE_WALK[k]) ? SCALE_WALK[k] : SCALE_WALK[k] + "3"; 743 + const midiAtK = noteToMidi(noteAtK); 744 + const sample = Math.floor((k / (SCALE_WALK.length - 1)) * stretchedSamples); 745 + waypoints.push({ sample, midi: midiAtK, semi: midiAtK - refForShift }); 746 + } 747 + } else if (CURVE === "linear" && (endSylIdx - startSylIdx + 1) >= 2) { 748 + const sylsInWord = endSylIdx - startSylIdx + 1; 749 + for (let k = 0; k < sylsInWord; k++) { 750 + const sylAtK = syllables[startSylIdx + k]; 751 + const noteAtK = /\d/.test(sylAtK.pitch) ? sylAtK.pitch : sylAtK.pitch + "3"; 752 + const midiAtK = noteToMidi(noteAtK); 753 + const sample = Math.floor((k / (sylsInWord - 1)) * stretchedSamples); 754 + waypoints.push({ sample, midi: midiAtK, semi: midiAtK - refForShift }); 755 + } 756 + } else if (CURVE === "linear") { 757 + waypoints.push({ sample: 0, midi: targetMidi, semi: semitones }); 758 + const nextMidi = (sylCursor < syllables.length) 759 + ? noteToMidi(/\d/.test(syllables[Math.min(syllables.length - 1, sylCursor)].pitch) 760 + ? syllables[Math.min(syllables.length - 1, sylCursor)].pitch 761 + : syllables[Math.min(syllables.length - 1, sylCursor)].pitch + "3") 762 + : targetMidi; 763 + waypoints.push({ sample: stretchedSamples, midi: nextMidi, semi: nextSemitones }); 764 + } else if (CURVE === "bezier") { 765 + const fifthOffset = 7; 766 + const midSemi = semitones + Math.sign(nextSemitones - semitones || 1) * (fifthOffset / 4); 767 + waypoints.push({ sample: 0, midi: targetMidi, semi: semitones }); 768 + waypoints.push({ sample: Math.floor(stretchedSamples / 2), midi: targetMidi + midSemi - semitones, semi: midSemi }); 769 + waypoints.push({ sample: stretchedSamples, midi: targetMidi + nextSemitones - semitones, semi: nextSemitones }); 770 + } 771 + 772 + const pitchmap = `${tmpDir}/w${i.toString().padStart(3, "0")}.pmap`; 773 + writeFileSync(pitchmap, waypoints.map((w) => `${w.sample} ${w.semi.toFixed(3)}`).join("\n") + "\n"); 774 + rbArgs.push("--pitchmap", pitchmap); 775 + rbArgs.push("-t", String(STRETCH)); 776 + } else { 777 + if (needsPitch) rbArgs.push("-p", String(semitones)); 778 + if (needsStretch) rbArgs.push("-t", String(STRETCH)); 779 + } 780 + 781 + rbArgs.push(sliceWav, target); 782 + const r = spawnSync("rubberband", rbArgs, { stdio: ["ignore", "ignore", "ignore"] }); 783 + if (r.status === 0 && existsSync(target)) { 784 + pieceWav = target; 785 + } else { 786 + // rubberband can fail on very short slices — fall back to natural. 787 + console.warn(` ! word ${i} (${w.text}) rubberband fell back to natural slice`); 788 + } 789 + } 790 + 791 + const samples = readWav(pieceWav); 792 + slices.push({ 793 + snappedStart, samples, text: w.text, naturalStart, 794 + semitones, noteStr, targetMidi, sourceMidi, 795 + waypoints, // for sine overlay + events trace 796 + }); 797 + const endSec = snappedStart + samples.length / SAMPLE_RATE; 798 + if (endSec > maxEndSec) maxEndSec = endSec; 799 + } 800 + 801 + // ── assemble buffer ─────────────────────────────────────────────────── 802 + const totalSamples = Math.ceil((maxEndSec + 0.5) * SAMPLE_RATE); 803 + const out = new Float32Array(totalSamples); 804 + 805 + // Word-boundary fade — 5ms cosine in/out per word slice. Just enough 806 + // to prevent splice clicks; longer fade-ins were softening attacks of 807 + // already-trimmed words, defeating the perceptual-onset alignment in 808 + // the per-word silence trim above. The LAST word doesn't fade out so 809 + // the song resolves naturally on its final note. 810 + const wordFadeS = Math.floor(0.005 * SAMPLE_RATE); 811 + for (let sIdx = 0; sIdx < slices.length; sIdx++) { 812 + const s = slices[sIdx]; 813 + const isLast = sIdx === slices.length - 1; 814 + const startIdx = Math.floor(s.snappedStart * SAMPLE_RATE); 815 + const len = s.samples.length; 816 + const fadeIn = Math.min(wordFadeS, Math.floor(len / 8)); 817 + const fadeOut = isLast ? 0 : Math.min(wordFadeS, Math.floor(len / 8)); 818 + const sustainEnd = len - fadeOut; 819 + for (let i = 0; i < len; i++) { 820 + const dst = startIdx + i; 821 + if (dst < 0 || dst >= out.length) continue; 822 + let env = 1; 823 + if (i < fadeIn) env = 0.5 - 0.5 * Math.cos((Math.PI * i) / fadeIn); 824 + else if (fadeOut > 0 && i >= sustainEnd) env = 0.5 - 0.5 * Math.cos((Math.PI * (len - i)) / fadeOut); 825 + out[dst] += s.samples[i] * env; 826 + } 827 + } 828 + 829 + // Normalize to ~ -3 dBFS peak. 830 + let peak = 0; 831 + for (let i = 0; i < out.length; i++) { 832 + const a = Math.abs(out[i]); 833 + if (a > peak) peak = a; 834 + } 835 + if (peak > 0) { 836 + const norm = 0.85 / peak; 837 + for (let i = 0; i < out.length; i++) out[i] *= norm; 838 + } 839 + 840 + // Write float32 raw + ffmpeg → mp3. 841 + const rawPath = `${tmpDir}/out.f32.raw`; 842 + const buf = Buffer.alloc(out.length * 4); 843 + for (let i = 0; i < out.length; i++) buf.writeFloatLE(out[i], i * 4); 844 + writeFileSync(rawPath, buf); 845 + 846 + const ff = spawnSync( 847 + "ffmpeg", 848 + ["-hide_banner", "-y", "-loglevel", "error", 849 + "-f", "f32le", "-ar", String(SAMPLE_RATE), "-ac", "1", 850 + "-i", rawPath, 851 + "-c:a", "libmp3lame", "-q:a", "3", OUT_PATH], 852 + { stdio: "inherit" } 853 + ); 854 + if (ff.status !== 0) { 855 + console.error("✗ ffmpeg encode failed"); 856 + process.exit(1); 857 + } 858 + 859 + // ── summary ─────────────────────────────────────────────────────────── 860 + console.log(`\n ${"i".padStart(3)} ${"word".padEnd(14)} ${"natural→snapped".padEnd(20)} pitch`); 861 + console.log(` ${"─".repeat(55)}`); 862 + for (let i = 0; i < slices.length; i++) { 863 + const s = slices[i]; 864 + const arrow = s.semitones >= 0 ? "+" : ""; 865 + const drift = s.snappedStart - s.naturalStart; 866 + const driftMs = Math.round(drift * 1000); 867 + const driftStr = `${s.naturalStart.toFixed(2)}→${s.snappedStart.toFixed(2)}s (${driftMs >= 0 ? "+" : ""}${driftMs}ms)`; 868 + console.log( 869 + ` ${i.toString().padStart(3)} ${s.text.padEnd(14)} ${driftStr.padEnd(20)} ${s.noteStr} ${arrow}${s.semitones} st` 870 + ); 871 + } 872 + 873 + // ── Optional: render a sine reference under the vocal ───────────────── 874 + // Pure sine at each event's target pitch for the event's duration, 875 + // gated by a gentle envelope. Mixed at -12 dB by default. Useful for 876 + // auditing whether pitch shifts landed — you should hear the vocal 877 + // "tracking" the sine. Set --sine-overlay 0 to mute (still emits the 878 + // file), or --sine-overlay <gain> to override (e.g. 0.3 = -10 dB). 879 + const SINE_OVERLAY = flags["sine-overlay"] !== undefined; 880 + if (SINE_OVERLAY) { 881 + const sineGain = flags["sine-overlay"] === true ? 0.25 : Number(flags["sine-overlay"]); 882 + const sineBuf = new Float32Array(totalSamples); 883 + const twoPiOverSr = (2 * Math.PI) / SAMPLE_RATE; 884 + const ATTACK_SEC = 0.015; 885 + const RELEASE_SEC = 0.060; 886 + const sineXfade = Math.floor(0.030 * SAMPLE_RATE); // 30ms 887 + for (const s of slices) { 888 + const startIdx = Math.floor(s.snappedStart * SAMPLE_RATE); 889 + const len = s.samples.length; 890 + 891 + const wps = (s.waypoints && s.waypoints.length >= 2) 892 + ? s.waypoints.map((w) => ({ pos: Math.min(len, w.sample), midi: w.midi })) 893 + : [{ pos: 0, midi: s.targetMidi }, { pos: len, midi: s.targetMidi }]; 894 + 895 + // Sine reference — phase-continuous, with linear pitch crossfade 896 + // in a 30ms window centered on each waypoint boundary. The pitch 897 + // glides smoothly between adjacent held notes so there are no 898 + // glitches/clicks. Outer attack/release envelope applies only at 899 + // the very start/end of the whole word. 900 + let phase = 0; 901 + let segIdx = 0; 902 + const wordAtt = Math.floor(ATTACK_SEC * SAMPLE_RATE); 903 + const wordRel = Math.floor(RELEASE_SEC * SAMPLE_RATE); 904 + const wordSustainEnd = len - wordRel; 905 + for (let i = 0; i < len; i++) { 906 + const dst = startIdx + i; 907 + if (dst < 0 || dst >= sineBuf.length) continue; 908 + while (segIdx + 1 < wps.length && i >= wps[segIdx + 1].pos) segIdx++; 909 + const cur = wps[segIdx]; 910 + // Determine pitch with optional crossfade near the next waypoint. 911 + let midiNow = cur.midi; 912 + const nxt = wps[segIdx + 1]; 913 + if (nxt) { 914 + const distToNext = nxt.pos - i; 915 + if (distToNext < sineXfade) { 916 + const t = 1 - distToNext / sineXfade; 917 + midiNow = cur.midi * (1 - t) + nxt.midi * t; 918 + } 919 + } 920 + const freq = 440 * Math.pow(2, (midiNow - 69) / 12); 921 + phase += twoPiOverSr * freq; 922 + 923 + let env; 924 + if (i < wordAtt) env = i / wordAtt; 925 + else if (i >= wordSustainEnd) env = Math.max(0, (len - i) / wordRel); 926 + else env = 1; 927 + sineBuf[dst] += Math.sin(phase) * env * sineGain; 928 + } 929 + 930 + // Ticks at each waypoint position so the listener can HEAR the 931 + // pitch grid changing. Loud and unambiguous — 1 kHz tone, 50 ms, 932 + // sharp exponential decay. Amplitude clipped to 0.95 so it always 933 + // pokes through the mix. 934 + const tickLen = Math.floor(0.050 * SAMPLE_RATE); 935 + for (const wp of wps) { 936 + const tickStart = startIdx + wp.pos; 937 + for (let j = 0; j < tickLen; j++) { 938 + const dst = tickStart + j; 939 + if (dst < 0 || dst >= sineBuf.length) continue; 940 + const env = Math.exp(-j / (tickLen * 0.20)); 941 + const tone = Math.sin(2 * Math.PI * 1000 * j / SAMPLE_RATE); 942 + sineBuf[dst] += tone * env * 0.95; 943 + } 944 + } 945 + } 946 + 947 + // Diagnostic: write the sine + tick layer alone so it can be 948 + // auditioned without the vocal masking it. 949 + const refOut = OUT_PATH.replace(/\.mp3$/, "-ref.mp3"); 950 + const refRaw = `${tmpDir}/ref.f32.raw`; 951 + let refPeak = 0; 952 + for (let i = 0; i < sineBuf.length; i++) { 953 + const a = Math.abs(sineBuf[i]); 954 + if (a > refPeak) refPeak = a; 955 + } 956 + const refScale = refPeak > 0 ? 0.9 / refPeak : 1; 957 + const refBuf = Buffer.alloc(sineBuf.length * 4); 958 + for (let i = 0; i < sineBuf.length; i++) refBuf.writeFloatLE(sineBuf[i] * refScale, i * 4); 959 + writeFileSync(refRaw, refBuf); 960 + spawnSync( 961 + "ffmpeg", 962 + ["-hide_banner", "-y", "-loglevel", "error", 963 + "-f", "f32le", "-ar", String(SAMPLE_RATE), "-ac", "1", 964 + "-i", refRaw, 965 + "-c:a", "libmp3lame", "-q:a", "3", refOut], 966 + { stdio: "inherit" } 967 + ); 968 + console.log(` sine+tick reference: ${refOut}`); 969 + // Mix sineBuf into out (additively). 970 + for (let i = 0; i < out.length && i < sineBuf.length; i++) out[i] += sineBuf[i]; 971 + // Re-normalize because we just added energy. 972 + let p2 = 0; 973 + for (let i = 0; i < out.length; i++) { 974 + const a = Math.abs(out[i]); 975 + if (a > p2) p2 = a; 976 + } 977 + if (p2 > 0) { 978 + const norm = 0.85 / p2; 979 + for (let i = 0; i < out.length; i++) out[i] *= norm; 980 + } 981 + console.log(` sine overlay: enabled at gain ${sineGain.toFixed(2)}`); 982 + } 983 + 984 + // Emit an events file alongside the mp3 so pitchcheck.mjs and other 985 + // downstream tools can compare measured pitch vs intended target 986 + // without re-aligning the rendered output (whisper degrades on 987 + // heavily-shifted audio). 988 + const eventsPath = OUT_PATH.replace(/\.mp3$/, ".events.json"); 989 + const eventsObj = { 990 + source: vocalPath, 991 + score: scorePath, 992 + section: SECTION, 993 + bpm: BPM, 994 + grid: GRID, 995 + refNote: REF_NOTE, 996 + refMidi, 997 + stretch: STRETCH, 998 + curve: CURVE, 999 + totalDur: maxEndSec, 1000 + events: slices.map((s, i) => ({ 1001 + i, 1002 + text: s.text, 1003 + naturalStart: s.naturalStart, 1004 + snappedStart: s.snappedStart, 1005 + durSec: s.samples.length / SAMPLE_RATE, 1006 + targetNote: s.noteStr, 1007 + targetMidi: s.targetMidi, 1008 + targetFreq: 440 * Math.pow(2, (s.targetMidi - 69) / 12), 1009 + sourceMidi: s.sourceMidi, 1010 + semitones: s.semitones, 1011 + })), 1012 + }; 1013 + writeFileSync(eventsPath, JSON.stringify(eventsObj, null, 2)); 1014 + 1015 + rmSync(tmpDir, { recursive: true, force: true }); 1016 + console.log(`\n✓ ${OUT_PATH} · ${maxEndSec.toFixed(2)}s`); 1017 + console.log(` events: ${eventsPath}`);
+276
pop/bin/pitchsnap_world.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + pitchsnap_world.py — WORLD-vocoder-based pitch correction helper for 4 + pitchsnap.mjs. Replaces the f0 curve of a vocal slice with a target 5 + melody (one or more notes laid across the duration) while preserving 6 + the spectral envelope and aperiodicity — i.e. jeffrey's voice 7 + character is unchanged but the pitch lands on the score. 8 + 9 + Pipeline (Saitou 2007 / Morise et al. 2016): 10 + audio 11 + → harvest (f0 candidates) 12 + → stonemask (f0 refinement) 13 + → cheaptrick (spectral envelope = formants = voice identity) 14 + → d4c (aperiodicity = breath / consonants) 15 + → REPLACE f0 with target curve 16 + → synthesize 17 + 18 + Multi-syllable: --notes "C3,D3,Eb3,F3,G3" lays N evenly-spaced 19 + target pitches across the audio. Each segment holds steady; transitions 20 + crossfade in log-pitch space (one frame). 21 + 22 + Usage: 23 + pitchsnap_world.py <in.wav> <out.wav> \ 24 + --notes "C3,Eb3,G3" [--retain 1.0] [--vibrato-hz 5.5] \ 25 + [--vibrato-cents 0] [--xfade-ms 30] 26 + """ 27 + import argparse 28 + import sys 29 + import numpy as np 30 + import soundfile as sf 31 + import pyworld as pw 32 + 33 + try: 34 + import librosa 35 + HAS_LIBROSA = True 36 + except ImportError: 37 + HAS_LIBROSA = False 38 + 39 + NOTE_SEMI = {"c": 0, "d": 2, "e": 4, "f": 5, "g": 7, "a": 9, "b": 11} 40 + 41 + def note_to_midi(s: str) -> int: 42 + s = s.strip().lower() 43 + semi = NOTE_SEMI[s[0]] 44 + i = 1 45 + if i < len(s) and s[i] in "#b": 46 + semi += 1 if s[i] == "#" else -1 47 + i += 1 48 + octave = int(s[i:]) 49 + return 12 * (octave + 1) + semi 50 + 51 + def midi_to_hz(m: float) -> float: 52 + return 440.0 * (2.0 ** ((m - 69.0) / 12.0)) 53 + 54 + def main() -> int: 55 + p = argparse.ArgumentParser() 56 + p.add_argument("in_wav") 57 + p.add_argument("out_wav") 58 + p.add_argument("--notes", required=True, 59 + help='Comma-separated notes laid evenly across the audio, e.g. "C3,Eb3,G3"') 60 + p.add_argument("--weights", default=None, 61 + help='Comma-separated relative duration weights per note (default 1 each)') 62 + p.add_argument("--note-starts", default=None, 63 + help='Comma-separated absolute start times (seconds) per note. ' 64 + 'When provided, target curve is anchored at these times ' 65 + '(useful for whole-audio continuous resynth).') 66 + p.add_argument("--retain", type=float, default=1.0, 67 + help="0 = source f0 unchanged, 1 = full clamp to target (default)") 68 + p.add_argument("--vibrato-hz", type=float, default=0.0, 69 + help="Vibrato frequency in Hz (0 = off)") 70 + p.add_argument("--vibrato-cents", type=float, default=0.0, 71 + help="Vibrato depth in cents (peak-to-peak / 2)") 72 + p.add_argument("--vibrato-onset-ms", type=float, default=200.0, 73 + help="Delay before vibrato fades in (ms)") 74 + p.add_argument("--xfade-ms", type=float, default=80.0, 75 + help="Crossfade between adjacent target notes (ms). Larger " 76 + "values smooth pitch transitions but blur the attack.") 77 + p.add_argument("--voicing-ramp-ms", type=float, default=40.0, 78 + help="Ramp f0 in/out over this window at voiced/unvoiced " 79 + "boundaries (word starts/ends). Smooths the WORLD " 80 + "synth pop where pitch goes 0→target abruptly.") 81 + p.add_argument("--detect-boundaries", action="store_true", 82 + help="Use librosa onset detection to find natural syllable " 83 + "boundaries in the audio, rather than weight-proportional " 84 + "splits. Only applies when there are >= 2 notes.") 85 + args = p.parse_args() 86 + 87 + notes = [n.strip() for n in args.notes.split(",") if n.strip()] 88 + if not notes: 89 + print("✗ --notes required", file=sys.stderr) 90 + return 1 91 + target_midis = np.array([note_to_midi(n) for n in notes], dtype=np.float64) 92 + target_hzs = np.array([midi_to_hz(m) for m in target_midis]) 93 + 94 + if args.weights: 95 + weights = np.array([float(w) for w in args.weights.split(",")], dtype=np.float64) 96 + if len(weights) != len(notes): 97 + print(f"✗ --weights length {len(weights)} != notes length {len(notes)}", file=sys.stderr) 98 + return 1 99 + else: 100 + weights = np.ones(len(notes), dtype=np.float64) 101 + weights = weights / weights.sum() 102 + cum_w = np.concatenate([[0.0], np.cumsum(weights)]) 103 + 104 + x, fs = sf.read(args.in_wav, dtype="float64") 105 + if x.ndim > 1: 106 + x = x.mean(axis=1) 107 + 108 + # WORLD decompose. f0_floor=90 (jeffrey-pvc never goes below 90 Hz) 109 + # tightens cheaptrick's analysis window, which kills the held-note 110 + # "ring/echo" artifact (low-floor → wide window → formant smearing). 111 + f0_floor = 90.0 112 + f0_raw, t = pw.harvest(x, fs, f0_floor=f0_floor, f0_ceil=600.0, frame_period=5.0) 113 + f0 = pw.stonemask(x, f0_raw, t, fs) 114 + fft_size = pw.get_cheaptrick_fft_size(fs, f0_floor=f0_floor) 115 + sp = pw.cheaptrick(x, f0, t, fs, fft_size=fft_size, f0_floor=f0_floor) 116 + ap = pw.d4c(x, f0, t, fs, fft_size=fft_size) 117 + 118 + # Build per-frame target f0 curve. N notes laid across the time axis; 119 + # each note holds its pitch; transitions crossfade in log space over 120 + # `--xfade-ms` ms. 121 + n_frames = len(t) 122 + frame_period_ms = (t[1] - t[0]) * 1000.0 if n_frames > 1 else 5.0 123 + xfade_frames = max(1, int(args.xfade_ms / frame_period_ms)) 124 + 125 + # Weighted segment boundaries in frame indices. 126 + seg_starts = (cum_w * n_frames).astype(np.int64) 127 + 128 + # Absolute start-time override — pin each note to a specific time 129 + # in the audio (seconds). Useful for whole-audio resynth where note 130 + # boundaries should align to whisper word timestamps, not be 131 + # distributed evenly across the file. 132 + if args.note_starts: 133 + starts_sec = [float(s) for s in args.note_starts.split(",") if s.strip()] 134 + if len(starts_sec) != len(notes): 135 + print(f"✗ --note-starts length {len(starts_sec)} != notes length {len(notes)}", file=sys.stderr) 136 + return 1 137 + seg_starts = np.array( 138 + [int(round(s / (frame_period_ms / 1000.0))) for s in starts_sec] + [n_frames], 139 + dtype=np.int64, 140 + ) 141 + seg_starts = np.clip(seg_starts, 0, n_frames) 142 + 143 + # Optionally replace with librosa-detected syllable boundaries — 144 + # finds onset frames in the audio and snaps the segment starts to 145 + # the nearest detected onset. Falls back to weighted splits if 146 + # detection finds the wrong number of onsets. 147 + if args.detect_boundaries and HAS_LIBROSA and len(notes) >= 2: 148 + # librosa needs float32 mono; use the WORLD-analyzed signal. 149 + try: 150 + onset_env = librosa.onset.onset_strength(y=x.astype(np.float32), sr=fs, hop_length=256) 151 + onset_frames = librosa.onset.onset_detect( 152 + onset_envelope=onset_env, sr=fs, hop_length=256, 153 + backtrack=True, 154 + ) 155 + # Convert from librosa hop frames to WORLD frames 156 + librosa_to_world = (256 / fs) / (frame_period_ms / 1000.0) 157 + world_onset_frames = (onset_frames * librosa_to_world).astype(np.int64) 158 + # Always anchor first segment at 0; need (len(notes)-1) onset hits 159 + need = len(notes) - 1 160 + if len(world_onset_frames) >= need: 161 + # Pick the `need` largest-strength onsets 162 + strengths = onset_env[onset_frames] 163 + top_idx = np.argsort(strengths)[-need:] 164 + top_onsets = np.sort(world_onset_frames[top_idx]) 165 + detected = np.concatenate([[0], top_onsets, [n_frames]]) 166 + seg_starts = detected.astype(np.int64) 167 + print(f" librosa onsets: {len(onset_frames)} found, used {need} → {top_onsets.tolist()}") 168 + else: 169 + print(f" librosa onsets: only {len(onset_frames)} found, need {need} — using weighted splits") 170 + except Exception as e: 171 + print(f" librosa onset detection failed: {e} — using weighted splits") 172 + 173 + target_log = np.zeros(n_frames) 174 + for i in range(n_frames): 175 + # Which segment is this frame in? 176 + seg = int(np.searchsorted(seg_starts[1:], i, side="right")) 177 + seg = min(seg, len(notes) - 1) 178 + center_log = np.log(target_hzs[seg]) 179 + # Crossfade with the next segment if we're near its boundary. 180 + if seg + 1 < len(notes): 181 + next_start = seg_starts[seg + 1] 182 + dist_to_next = next_start - i 183 + if dist_to_next < xfade_frames: 184 + tval = 1.0 - (dist_to_next / xfade_frames) 185 + center_log = (1 - tval) * center_log + tval * np.log(target_hzs[seg + 1]) 186 + target_log[i] = center_log 187 + 188 + target_curve = np.exp(target_log) 189 + 190 + # Vibrato — sine LFO on top of target, fading in after onset. 191 + if args.vibrato_hz > 0 and args.vibrato_cents > 0: 192 + time_sec = np.arange(n_frames) * (frame_period_ms / 1000.0) 193 + onset_sec = args.vibrato_onset_ms / 1000.0 194 + fade = np.clip((time_sec - onset_sec) / max(0.05, onset_sec), 0.0, 1.0) 195 + depth_ratio = (args.vibrato_cents / 100.0) / 12.0 # cents → semitones → log2 196 + lfo = np.sin(2 * np.pi * args.vibrato_hz * time_sec) * depth_ratio * fade 197 + target_curve = target_curve * (2.0 ** lfo) 198 + 199 + # Build the f0 curve passed to WORLD synth. Two key moves vs naive: 200 + # 201 + # 1. INTERPOLATE through unvoiced gaps before synthesis. WORLD pops 202 + # on f0=0→target jumps; feeding it a continuous curve eliminates 203 + # those phase-resets. Voiced/unvoiced structure gets re-imposed 204 + # in the time domain after synthesis (step 2). 205 + # 2. MUTE unvoiced regions post-synth with 5ms crossfade ramps — 206 + # keeps consonants/sibilants from being colored by tone. 207 + voiced = f0 > 0 208 + 209 + # Build the per-frame target the synth will see (interpolated curve). 210 + if args.retain >= 0.999: 211 + # Hard clamp to target on voiced frames; interpolate target 212 + # through unvoiced gaps so the synth has continuous pitch. 213 + f0_synth = target_curve.copy() 214 + else: 215 + log_src = np.log(np.maximum(f0, 1e-6)) 216 + log_tgt = np.log(target_curve) 217 + log_blend = (1.0 - args.retain) * log_src + args.retain * log_tgt 218 + # Interpolate the source side across unvoiced frames so the blend 219 + # has continuous data; otherwise unvoiced frames blend toward 0. 220 + if voiced.sum() >= 2: 221 + voiced_idx = np.where(voiced)[0] 222 + log_src_interp = np.interp(np.arange(len(f0)), voiced_idx, log_src[voiced_idx]) 223 + log_blend_smooth = (1.0 - args.retain) * log_src_interp + args.retain * log_tgt 224 + f0_synth = np.exp(log_blend_smooth) 225 + else: 226 + f0_synth = np.exp(log_blend) 227 + 228 + f0_new = f0_synth # name kept for downstream code consistency 229 + 230 + y = pw.synthesize(f0_new, sp, ap, fs, frame_period=5.0) 231 + 232 + # Re-impose voiced/unvoiced structure as a time-domain amplitude 233 + # mask. Unvoiced regions get muted with 5ms crossfades at every 234 + # transition so consonants ("s", "k", "th") aren't colored by 235 + # whatever tone WORLD synthesized through the gap. This is the 236 + # "WORLD for pitch, original passes through for noise" trick. 237 + samples_per_frame = int(round(fs * 5.0 / 1000.0)) 238 + voiced_audio_mask = np.repeat(voiced.astype(np.float64), samples_per_frame) 239 + if len(voiced_audio_mask) < len(y): 240 + voiced_audio_mask = np.pad(voiced_audio_mask, (0, len(y) - len(voiced_audio_mask)), 241 + mode="edge") 242 + voiced_audio_mask = voiced_audio_mask[:len(y)] 243 + 244 + # Smooth the 0/1 mask with 5ms cosine crossfades at edges. 245 + ramp = int(0.005 * fs) 246 + if ramp > 1: 247 + edges = np.diff(voiced_audio_mask.astype(np.int8)) 248 + for idx in np.where(edges == 1)[0]: 249 + for k in range(ramp): 250 + pos = idx + 1 + k 251 + if pos < len(voiced_audio_mask): 252 + voiced_audio_mask[pos] *= 0.5 - 0.5 * np.cos(np.pi * (k + 1) / ramp) 253 + for idx in np.where(edges == -1)[0]: 254 + for k in range(ramp): 255 + pos = idx - k 256 + if pos >= 0: 257 + voiced_audio_mask[pos] *= 0.5 - 0.5 * np.cos(np.pi * (k + 1) / ramp) 258 + 259 + # Composite: WORLD audio in voiced regions, original audio in 260 + # unvoiced regions. Preserves natural sibilants and stops while 261 + # keeping the pitch-corrected vowels. mask is 1 in voiced, 0 in 262 + # unvoiced (with 5ms ramps at boundaries). 263 + n = min(len(y), len(x), len(voiced_audio_mask)) 264 + y_composite = np.zeros(n) 265 + y_composite = voiced_audio_mask[:n] * y[:n] + (1.0 - voiced_audio_mask[:n]) * x[:n] 266 + 267 + sf.write(args.out_wav, y_composite.astype(np.float32), fs) 268 + 269 + # Print a one-line summary so pitchsnap.mjs can show it. 270 + voiced_pct = 100.0 * np.mean(voiced) 271 + print(f" world · {len(notes)} notes · {n_frames} frames · {voiced_pct:.0f}% voiced " 272 + f"· retain={args.retain} · vibrato={args.vibrato_hz}Hz/{args.vibrato_cents}¢") 273 + return 0 274 + 275 + if __name__ == "__main__": 276 + sys.exit(main())
+199
pop/bin/pitchwords.mjs
··· 1 + #!/usr/bin/env node 2 + // pitchwords.mjs — first stage of vocal-post: pitch each word in a 3 + // vocal stem to its target note from the .np score. 4 + // 5 + // Slice plan: for each whisper word at [fromMs, toMs], grab the audio 6 + // from fromMs to the NEXT word's fromMs (or end of file for the last 7 + // word). This keeps the inter-word silence with each word so the 8 + // concatenated output preserves rhythm. 9 + // 10 + // Pitch plan: walk whisper words in order; map each word to a syllable 11 + // in the .np section at the proportional index (whisper words are 12 + // usually fewer than .np syllables — multi-syllable words pick one 13 + // syllable's note). Pitch-shift the slice by (target − ref) semitones 14 + // using the `rubberband` CLI (formant-preserving). 15 + // 16 + // Aggressive by default — full target-note pitch shift, no smoothing. 17 + // Future: gentle / firm / off knobs per the post-prod memory. 18 + // 19 + // Usage: 20 + // node bin/pitchwords.mjs --vocal big-pictures/out/plork-hook-vocal.mp3 \ 21 + // --score big-pictures/plork.np --section hook \ 22 + // --ref-note C3 \ 23 + // --out big-pictures/out/plork-hook-pitched.mp3 24 + 25 + import { spawnSync } from "node:child_process"; 26 + import { existsSync, mkdirSync, readFileSync, writeFileSync, rmSync } from "node:fs"; 27 + import { resolve, dirname, basename } from "node:path"; 28 + 29 + // ── arg parse ───────────────────────────────────────────────────────── 30 + function parseArgs(argv) { 31 + const flags = {}; 32 + for (let i = 0; i < argv.length; i++) { 33 + const a = argv[i]; 34 + if (!a.startsWith("--")) continue; 35 + const k = a.slice(2); 36 + const next = argv[i + 1]; 37 + if (next !== undefined && !next.startsWith("--")) { flags[k] = next; i++; } 38 + else flags[k] = true; 39 + } 40 + return flags; 41 + } 42 + 43 + const flags = parseArgs(process.argv.slice(2)); 44 + 45 + const vocalPath = resolve(process.cwd(), flags.vocal || ""); 46 + if (!existsSync(vocalPath)) { 47 + console.error("usage: --vocal <stem.mp3> --score <path.np> [--section hook] [--ref-note C3] [--out path.mp3]"); 48 + process.exit(1); 49 + } 50 + const wordsPath = resolve(process.cwd(), flags.words || vocalPath.replace(/\.mp3$/, "-words.json")); 51 + if (!existsSync(wordsPath)) { 52 + console.error(`✗ words.json not found at ${wordsPath}. run bin/align.mjs first.`); 53 + process.exit(1); 54 + } 55 + const scorePath = resolve(process.cwd(), flags.score || ""); 56 + if (!existsSync(scorePath)) { 57 + console.error(`✗ --score file required (path to .np)`); 58 + process.exit(1); 59 + } 60 + const SECTION = (flags.section || "hook").toLowerCase(); 61 + const REF_NOTE = flags["ref-note"] || "C3"; 62 + const OUT_PATH = flags.out 63 + ? resolve(process.cwd(), flags.out) 64 + : vocalPath.replace(/\.mp3$/, "-pitched.mp3"); 65 + 66 + // ── helpers ─────────────────────────────────────────────────────────── 67 + const NOTE_TO_SEMI = { c: 0, d: 2, e: 4, f: 5, g: 7, a: 9, b: 11 }; 68 + function noteToMidi(p) { 69 + const m = p.trim().toLowerCase().match(/^([a-g])([#b]?)(-?\d+)$/); 70 + if (!m) throw new Error(`bad note: ${p}`); 71 + let semi = NOTE_TO_SEMI[m[1]]; 72 + if (m[2] === "#") semi += 1; 73 + if (m[2] === "b") semi -= 1; 74 + const oct = parseInt(m[3], 10); 75 + return 12 * (oct + 1) + semi; 76 + } 77 + 78 + // Same parser shape as recap/bin/vocal.mjs's parseNp — flatten to a 79 + // list of {pitch, syl} per section. 80 + function parseNp(text) { 81 + const sections = {}; 82 + let current = null; 83 + for (const raw of text.split("\n")) { 84 + const line = raw.trim(); 85 + if (!line || line.startsWith("#")) continue; 86 + if (!line.includes(":") && /^[a-z][a-z0-9 ]*$/.test(line)) { 87 + current = line.toLowerCase(); 88 + if (!sections[current]) sections[current] = []; 89 + continue; 90 + } 91 + if (!current) { current = "default"; sections[current] = []; } 92 + const tokens = line.split(/\s+/).filter(Boolean); 93 + for (const tok of tokens) { 94 + const m = tok.match(/^([A-Ga-g][#b]?):(.+)$/); 95 + if (!m) continue; 96 + const note = m[1].charAt(0).toUpperCase() + m[1].slice(1); 97 + sections[current].push({ pitch: note, syl: m[2] }); 98 + } 99 + } 100 + return sections; 101 + } 102 + 103 + function probeDuration(p) { 104 + const r = spawnSync( 105 + "ffprobe", 106 + ["-v", "error", "-show_entries", "format=duration", 107 + "-of", "default=noprint_wrappers=1:nokey=1", p], 108 + { encoding: "utf8" } 109 + ); 110 + return Number(r.stdout.trim()); 111 + } 112 + 113 + // ── main ────────────────────────────────────────────────────────────── 114 + const words = JSON.parse(readFileSync(wordsPath, "utf8")); 115 + const score = parseNp(readFileSync(scorePath, "utf8")); 116 + const syllables = score[SECTION]; 117 + if (!syllables || !syllables.length) { 118 + console.error(`✗ section '${SECTION}' empty in ${scorePath}`); 119 + process.exit(1); 120 + } 121 + 122 + const refMidi = noteToMidi(REF_NOTE); 123 + const totalDur = probeDuration(vocalPath); 124 + 125 + const tmpDir = `${dirname(OUT_PATH)}/.${basename(OUT_PATH).replace(/\..*$/, "")}-pw-tmp`; 126 + rmSync(tmpDir, { recursive: true, force: true }); 127 + mkdirSync(tmpDir, { recursive: true }); 128 + 129 + console.log(`→ pitchwords · ${words.length} whisper words → ${syllables.length} score syllables · ref=${REF_NOTE}`); 130 + 131 + const slicePaths = []; 132 + for (let i = 0; i < words.length; i++) { 133 + const w = words[i]; 134 + const startSec = w.fromMs / 1000; 135 + const endSec = i < words.length - 1 ? words[i + 1].fromMs / 1000 : totalDur; 136 + 137 + // Proportional mapping word i → syllable 138 + const sylIdx = Math.min(syllables.length - 1, 139 + Math.floor(i * syllables.length / words.length)); 140 + const syl = syllables[sylIdx]; 141 + const noteStr = /\d/.test(syl.pitch) ? syl.pitch : syl.pitch + "3"; 142 + const targetMidi = noteToMidi(noteStr); 143 + const semitones = targetMidi - refMidi; 144 + 145 + const sliceWav = `${tmpDir}/w${i.toString().padStart(3, "0")}.wav`; 146 + const cut = spawnSync( 147 + "ffmpeg", 148 + ["-hide_banner", "-y", "-loglevel", "error", 149 + "-ss", startSec.toFixed(4), "-to", endSec.toFixed(4), 150 + "-i", vocalPath, 151 + "-c:a", "pcm_s16le", "-ar", "48000", "-ac", "1", sliceWav], 152 + { stdio: ["ignore", "ignore", "inherit"] } 153 + ); 154 + if (cut.status !== 0) { 155 + console.error(`✗ ffmpeg slice failed at word ${i}`); 156 + process.exit(1); 157 + } 158 + 159 + let pieceWav = sliceWav; 160 + if (Math.abs(semitones) >= 0.01) { 161 + pieceWav = `${tmpDir}/w${i.toString().padStart(3, "0")}-p.wav`; 162 + const rb = spawnSync( 163 + "rubberband", 164 + ["-p", String(semitones), sliceWav, pieceWav], 165 + { stdio: ["ignore", "ignore", "ignore"] } 166 + ); 167 + if (rb.status !== 0) { 168 + console.error(`✗ rubberband failed at word ${i} (Δ${semitones} st)`); 169 + process.exit(1); 170 + } 171 + } 172 + slicePaths.push(pieceWav); 173 + 174 + const arrow = semitones >= 0 ? "+" : ""; 175 + console.log( 176 + ` ${i.toString().padStart(2, "0")} ${w.text.padEnd(14)} ` + 177 + `${startSec.toFixed(2)}–${endSec.toFixed(2)}s → ${syl.syl.padEnd(8)} ${noteStr} ` + 178 + `(${arrow}${semitones} st)` 179 + ); 180 + } 181 + 182 + // ── concat ───────────────────────────────────────────────────────────── 183 + const concatList = `${tmpDir}/concat.txt`; 184 + writeFileSync(concatList, slicePaths.map(p => `file '${p}'`).join("\n") + "\n"); 185 + 186 + const concat = spawnSync( 187 + "ffmpeg", 188 + ["-hide_banner", "-y", "-loglevel", "error", 189 + "-f", "concat", "-safe", "0", "-i", concatList, 190 + "-c:a", "libmp3lame", "-q:a", "3", OUT_PATH], 191 + { stdio: "inherit" } 192 + ); 193 + if (concat.status !== 0) { 194 + console.error("✗ ffmpeg concat failed"); 195 + process.exit(1); 196 + } 197 + 198 + rmSync(tmpDir, { recursive: true, force: true }); 199 + console.log(`✓ ${OUT_PATH}`);
+92
pop/bin/refine_onsets.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + refine_onsets.py — VALIDATOR (not modifier). 4 + 5 + The score (.np file → events.json with `snappedStart` beat positions) 6 + is the authoritative source of timing in this pipeline. The visual 7 + storyboard rigidly follows the score; any drift between score and 8 + audio is a *pitchsnap* problem to fix at the audio generation layer, 9 + not papered over visually. 10 + 11 + This script measures the gap between score-defined slide.start times 12 + and detected audio onsets, prints a report of mismatches, and exits. 13 + It does NOT modify the storyboard. Use the report to identify which 14 + words pitchsnap rendered late/early — that's where the audio pipeline 15 + needs slot-rigid trimming/padding. 16 + 17 + Usage: 18 + refine_onsets.py <storyboard.json> <audio.mp3> 19 + """ 20 + import json 21 + import sys 22 + from pathlib import Path 23 + 24 + import numpy as np 25 + import librosa 26 + 27 + SEARCH_WIN_S = 0.85 # how far from snappedStart to look for an onset 28 + MIN_FIX_MS = 25 # don't override if onset is within this of original 29 + # (avoids jitter from low-confidence detections) 30 + 31 + 32 + def main(): 33 + if len(sys.argv) < 3: 34 + print("usage: refine_onsets.py <storyboard.json> <audio>") 35 + sys.exit(1) 36 + storyboard_path = Path(sys.argv[1]) 37 + audio_path = sys.argv[2] 38 + 39 + sb = json.loads(storyboard_path.read_text()) 40 + slides = sb["slides"] 41 + audio_dur = float(sb["duration"]) 42 + 43 + print(f"loading audio: {audio_path}") 44 + y, sr = librosa.load(audio_path, sr=22050) 45 + print(f" {len(y)/sr:.2f}s @ {sr}Hz") 46 + 47 + print("detecting onsets…") 48 + onset_env = librosa.onset.onset_strength(y=y, sr=sr, hop_length=512) 49 + onsets = librosa.onset.onset_detect( 50 + onset_envelope=onset_env, sr=sr, hop_length=512, 51 + backtrack=True, units="time", 52 + ) 53 + print(f" {len(onsets)} onsets detected") 54 + 55 + # For each slide, find nearest onset within search window 56 + print() 57 + print(f" {'i':>3} {'text':<10} {'score':>8} {'audio':>8} {'drift':>7}") 58 + drift_total = 0.0 59 + n_off = 0 60 + n_bad = 0 61 + for i, s in enumerate(slides): 62 + score_start = float(s["start"]) 63 + win_lo = score_start - SEARCH_WIN_S 64 + win_hi = score_start + SEARCH_WIN_S 65 + nearby = onsets[(onsets >= win_lo) & (onsets <= win_hi)] 66 + if len(nearby) == 0: 67 + line = f" {i:>3} {s['text'][:10]:<10} {score_start:>8.3f} {'—':>8} {'no onset':>7}" 68 + else: 69 + audio_start = float(nearby[np.argmin(np.abs(nearby - score_start))]) 70 + drift_ms = (audio_start - score_start) * 1000 71 + tag = "" 72 + if abs(drift_ms) >= 200: 73 + tag = " ✗ BAD" 74 + n_bad += 1 75 + elif abs(drift_ms) >= MIN_FIX_MS: 76 + tag = " ⚠" 77 + n_off += 1 78 + drift_total += abs(drift_ms) 79 + line = f" {i:>3} {s['text'][:10]:<10} {score_start:>8.3f} {audio_start:>8.3f} {drift_ms:>+6.0f}ms{tag}" 80 + print(line) 81 + 82 + print() 83 + print(f" {n_bad} slides with >200ms audio drift (likely pitchsnap slot overflow)") 84 + print(f" {n_off} slides with 25-200ms drift") 85 + print(f" total drift: {drift_total:.0f}ms across {len(slides)} slides") 86 + print() 87 + print(" → storyboard NOT modified — score (snappedStart) remains authoritative.") 88 + print(" → fix audio drift in pitchsnap.mjs (slot-rigid trim/pad), not here.") 89 + 90 + 91 + if __name__ == "__main__": 92 + main()
+890
pop/bin/render_frames.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + render_frames.py — frame-by-frame TikTok renderer for big-pictures. 4 + 5 + Reads a storyboard.json + word images + audio, produces a PNG frame 6 + sequence ready for ffmpeg encode. Per-character glyph extraction + own 7 + compositing lets us: 8 + 9 + - paste extracted character pixels onto programmatic backgrounds 10 + (no FLUX background artifacts) 11 + - smooth-gradient between slide colors during transitions 12 + - per-character vertical bounce driven by audio amplitude 13 + - loop-closure: last slide's slide-out reveals slide 0 (perfect loop) 14 + 15 + Usage: 16 + render_frames.py --storyboard <path> --img-dir <path> --audio <path> 17 + --frames-dir <path> --fps 30 18 + """ 19 + import argparse, json, os, sys 20 + from pathlib import Path 21 + import numpy as np 22 + from PIL import Image, ImageDraw 23 + from scipy.ndimage import binary_dilation, label, find_objects 24 + 25 + try: 26 + import soundfile as sf 27 + HAS_SF = True 28 + except ImportError: 29 + HAS_SF = False 30 + 31 + # CSS named colors → RGB. We only use what the storyboard's emotional 32 + # arc references; subset of the W3C CSS4 named colors. 33 + CSS_COLORS = { 34 + "peachpuff": (255, 218, 185), 35 + "moccasin": (255, 228, 181), 36 + "wheat": (245, 222, 179), 37 + "khaki": (240, 230, 140), 38 + "palegoldenrod": (238, 232, 170), 39 + "lightyellow": (255, 255, 224), 40 + "lemonchiffon": (255, 250, 205), 41 + "burlywood": (222, 184, 135), 42 + "tan": (210, 180, 140), 43 + "rosybrown": (188, 143, 143), 44 + "thistle": (216, 191, 216), 45 + "lavender": (230, 230, 250), 46 + "mistyrose": (255, 228, 225), 47 + "skyblue": (135, 206, 235), 48 + "lightblue": (173, 216, 230), 49 + "mediumturquoise": (72, 209, 204), 50 + "mediumaquamarine": (102, 205, 170), 51 + "lightseagreen": (32, 178, 170), 52 + "palegreen": (152, 251, 152), 53 + "lightgreen": (144, 238, 144), 54 + "hotpink": (255, 105, 180), 55 + "deeppink": (255, 20, 147), 56 + "violet": (238, 130, 238), 57 + "orchid": (218, 112, 214), 58 + "salmon": (250, 128, 114), 59 + "gold": (255, 215, 0), 60 + "saddlebrown": (139, 69, 19), 61 + "darkred": (139, 0, 0), 62 + "indigo": (75, 0, 130), 63 + "darkolivegreen": (85, 107, 47), 64 + "maroon": (128, 0, 0), 65 + "darkgoldenrod": (184, 134, 11), 66 + "darkslateblue": (72, 61, 139), 67 + "darkslategray": (47, 79, 79), 68 + "ivory": (255, 255, 240), 69 + "white": (255, 255, 255), 70 + "navy": (0, 0, 128), 71 + "midnightblue": (25, 25, 112), 72 + "darkgreen": (0, 100, 0), 73 + "darkviolet": (148, 0, 211), 74 + "black": (0, 0, 0), 75 + } 76 + 77 + def color_rgb(name, default=(40, 40, 40)): 78 + return CSS_COLORS.get(name.lower(), default) 79 + 80 + # ── Glyph extraction ────────────────────────────────────────────────── 81 + # Connected-component segmentation. Foreground/background by corner 82 + # sampling, then dilate to merge within-letter pixel fragments, then 83 + # scipy.label() to find each letter's connected blob. Returns a list of 84 + # {img, w, h, x0, y0, baseline_y, top_y} sorted left-to-right. Includes 85 + # baseline + top so the renderer can align letters on a common y-line. 86 + 87 + def extract_glyphs(word_img_path, foreground_threshold=38, dilate_iters=2, 88 + merge_dx_frac=0.20): 89 + img = Image.open(word_img_path).convert("RGBA") 90 + arr = np.array(img, dtype=np.int16) 91 + h, w, _ = arr.shape 92 + # Background sample from the 4 corners (16×16 each) 93 + s = 16 94 + corners = np.concatenate([ 95 + arr[:s, :s, :3].reshape(-1, 3), 96 + arr[:s, -s:, :3].reshape(-1, 3), 97 + arr[-s:, :s, :3].reshape(-1, 3), 98 + arr[-s:, -s:, :3].reshape(-1, 3), 99 + ]) 100 + bg = np.median(corners, axis=0) 101 + diff = np.linalg.norm(arr[:, :, :3] - bg, axis=2) 102 + fg_mask = (diff > foreground_threshold).astype(np.uint8) 103 + 104 + # Dilate so within-letter fragments merge into one component while 105 + # adjacent letters stay separate. 106 + dilated = binary_dilation(fg_mask, iterations=dilate_iters) 107 + labeled, n_components = label(dilated) 108 + objects = find_objects(labeled) 109 + 110 + raw = [] 111 + for ci, slc in enumerate(objects, 1): 112 + if slc is None: 113 + continue 114 + ys = slc[0] 115 + xs = slc[1] 116 + y0, y1 = int(ys.start), int(ys.stop) 117 + x0, x1 = int(xs.start), int(xs.stop) 118 + bw, bh = x1 - x0, y1 - y0 119 + if bw < 5 or bh < 10: 120 + continue 121 + # Cut from the ORIGINAL fg_mask (un-dilated) so glyph edges stay sharp 122 + local_mask = fg_mask[y0:y1, x0:x1] 123 + # Restrict to the labeled component so neighbor blobs in the same 124 + # bounding box don't bleed in 125 + comp_mask = (labeled[y0:y1, x0:x1] == ci).astype(np.uint8) 126 + cut = (local_mask & comp_mask).astype(np.uint8) 127 + if cut.sum() < 18: 128 + continue 129 + raw.append({"x0": x0, "x1": x1, "y0": y0, "y1": y1, 130 + "mask": cut, "rgb": arr[y0:y1, x0:x1, :3].astype(np.uint8)}) 131 + 132 + raw.sort(key=lambda g: g["x0"]) 133 + 134 + # Merge vertically-stacked components (e.g. dot of i / j) that share 135 + # an x-range. Merge threshold: x overlap > merge_dx_frac of the 136 + # smaller component's width. 137 + merged = [] 138 + for g in raw: 139 + if merged: 140 + prev = merged[-1] 141 + ox = max(0, min(prev["x1"], g["x1"]) - max(prev["x0"], g["x0"])) 142 + min_w = max(1, min(prev["x1"] - prev["x0"], g["x1"] - g["x0"])) 143 + if ox / min_w > merge_dx_frac: 144 + # Merge: union bounding box, OR masks 145 + nx0 = min(prev["x0"], g["x0"]) 146 + nx1 = max(prev["x1"], g["x1"]) 147 + ny0 = min(prev["y0"], g["y0"]) 148 + ny1 = max(prev["y1"], g["y1"]) 149 + nmask = np.zeros((ny1 - ny0, nx1 - nx0), dtype=np.uint8) 150 + # Place prev mask 151 + py = prev["y0"] - ny0 152 + px = prev["x0"] - nx0 153 + nmask[py:py + (prev["y1"] - prev["y0"]), 154 + px:px + (prev["x1"] - prev["x0"])] |= prev["mask"] 155 + py2 = g["y0"] - ny0 156 + px2 = g["x0"] - nx0 157 + nmask[py2:py2 + (g["y1"] - g["y0"]), 158 + px2:px2 + (g["x1"] - g["x0"])] |= g["mask"] 159 + # Pull RGB from underlying image array via outer indexing 160 + nrgb = arr[ny0:ny1, nx0:nx1, :3].astype(np.uint8) 161 + merged[-1] = {"x0": nx0, "x1": nx1, "y0": ny0, "y1": ny1, 162 + "mask": nmask, "rgb": nrgb} 163 + continue 164 + merged.append(g) 165 + 166 + glyphs = [] 167 + for g in merged: 168 + bw = g["x1"] - g["x0"] 169 + bh = g["y1"] - g["y0"] 170 + rgba = np.zeros((bh, bw, 4), dtype=np.uint8) 171 + rgba[:, :, :3] = g["rgb"] 172 + rgba[:, :, 3] = g["mask"] * 255 173 + glyphs.append({ 174 + "img": Image.fromarray(rgba, "RGBA"), 175 + "w": int(bw), 176 + "h": int(bh), 177 + "x0": int(g["x0"]), 178 + "y0": int(g["y0"]), 179 + "y1": int(g["y1"]), 180 + }) 181 + return glyphs 182 + 183 + # ── Audio amplitude curve ───────────────────────────────────────────── 184 + def amplitude_curve(audio_path, total_dur, fps): 185 + if not HAS_SF: 186 + return np.zeros(int(total_dur * fps) + 1) 187 + audio, sr = sf.read(audio_path, dtype="float64") 188 + if audio.ndim > 1: 189 + audio = audio.mean(axis=1) 190 + n_frames = int(total_dur * fps) + 1 191 + samples_per_frame = sr / fps 192 + env = np.zeros(n_frames) 193 + for f in range(n_frames): 194 + a = int(f * samples_per_frame) 195 + b = int(min(len(audio), a + samples_per_frame * 1.5)) 196 + if b > a: 197 + env[f] = np.sqrt(np.mean(audio[a:b] ** 2)) 198 + if env.max() > 0: 199 + env /= env.max() 200 + return env 201 + 202 + # ── Slide animation: where does word i sit at time t? ──────────────── 203 + # Returns (x_offset, alpha) for each slide. x in pixels; alpha 0-1. 204 + # Animation phases (matching tiktok.mjs): 205 + # t < inStart → x = W (off right) 206 + # inStart..sStart → x: W → +CRAWL_AMP (slide in) 207 + # sStart..outStart → x: +CRAWL → -CRAWL (slow crawl) 208 + # outStart..sEnd → x: -CRAWL → -W (slide out) 209 + # t ≥ sEnd → x = -W 210 + 211 + CRAWL_AMP = 35 212 + TRANSITION_DUR_DEFAULT = 0.22 213 + 214 + def slide_position(t, slide, W): 215 + s_start = slide["start"] 216 + s_end = slide["end"] 217 + trans_in = slide.get("transitionMs", 220) / 1000.0 218 + in_start = max(0.0, s_start - trans_in) 219 + next_trans = slide.get("nextTransitionMs", 220) / 1000.0 220 + out_start = s_end - next_trans 221 + if t < in_start: 222 + return W 223 + if t < s_start: 224 + prog = (t - in_start) / max(0.001, trans_in) 225 + return W - (W - CRAWL_AMP) * prog 226 + if t < out_start: 227 + prog = (t - s_start) / max(0.001, out_start - s_start) 228 + return CRAWL_AMP - 2 * CRAWL_AMP * prog 229 + if t < s_end: 230 + prog = (t - out_start) / max(0.001, next_trans) 231 + return -CRAWL_AMP - (W - CRAWL_AMP) * prog 232 + return -W 233 + 234 + # ── Train camera: continuous global row of variable-width lanes ────── 235 + # Each word gets a lane sized to fit its rendered width + padding. 236 + # The camera pans from one lane center to the next over each slide's 237 + # duration — short words = fast pan, long words = slow pan. All words 238 + # share the same baseline y, all visible at once like one long line of 239 + # text scrolling past. 240 + def lane_center_world(i, lane_starts, lane_widths): 241 + return lane_starts[i] + lane_widths[i] / 2.0 242 + 243 + def cam_at_slide_start(i, lane_starts, lane_widths, screen_w): 244 + # Camera offset such that lane i is centered on screen. 245 + return lane_center_world(i, lane_starts, lane_widths) - screen_w / 2.0 246 + 247 + # Keyframe camera. Each slide owns TWO keyframes: 248 + # 1. (in_end_t, cs_i) — slide centered after ramp-in 249 + # 2. (out_start_t, cs_i + PEEK) — slide drifted slightly so next inches in 250 + # Between adjacent slides, the camera traverses cs_i+PEEK → cs_(i+1) 251 + # over (out_start_i .. in_end_(i+1)) — that's the rapid transition. 252 + # 253 + # Transition duration is CAPPED so very long-held slides don't waste 254 + # motion budget on slow drifts; very short slides still get a usable 255 + # hold window between in/out ramps. 256 + HOLD_FRAC = 0.55 257 + TRANS_CAP_S = 0.55 # longer transitions, more groovy 258 + TRANS_MIN_S = 0.28 # min transition duration in seconds 259 + MAX_HOLD_S = 1.4 # cap centered-hold duration; long sustains 260 + # (e.g. 5-beat 'found' = 4.3s) get a slow 261 + # drift toward the next slide instead of 262 + # 3.6s of dead-center stillness. 263 + 264 + def transition_dur(slide_dur): 265 + return max(TRANS_MIN_S, min(TRANS_CAP_S, slide_dur * (1 - HOLD_FRAC) / 2)) 266 + 267 + def camera_x(t, slides, lane_starts, lane_widths, screen_w, peek_px, 268 + loop_end_t=None): 269 + n = len(slides) 270 + if n == 0: 271 + return 0.0 272 + cs = lambda i: lane_starts[i] + lane_widths[i] / 2.0 - screen_w / 2.0 273 + LANE_W = lane_widths[0] if lane_widths else screen_w 274 + # Build keyframe list — 2 per slide (arrival + held-centered). 275 + # in_end is at slide.start + td: the camera ARRIVES centered td 276 + # AFTER the word's audio begins. So when a word starts being sung 277 + # the slide slides in from the right. The camera then holds at 278 + # cs(i) for the FULL duration of the slide (no peek-drift). The 279 + # transition cs(i)→cs(i+1) happens during the NEXT slide's 280 + # opening td seconds. 281 + kfs = [] 282 + for i, s in enumerate(slides): 283 + d = s["end"] - s["start"] 284 + td = transition_dur(d) 285 + in_end = s["start"] + td 286 + # Held-centered keyframe at slide.end so the camera stays 287 + # rooted on the slide's letters for the full sustain, even 288 + # for *4/*5 long-held notes — no drifting away while the 289 + # word is still being sung. 290 + kfs.append((in_end, cs(i))) 291 + kfs.append((s["end"], cs(i))) 292 + # Phantom pre-roll keyframe at t=0: place the camera at cs(0) - 293 + # LANE_W (≡ cs(n-1) mod n*LANE_W) so frame 0 shows slide_(n-1) 294 + # centered and the transition to cs(0) plays out over the first 295 + # slide's opening td. This makes the very first slide also slide 296 + # in from the right exactly as the first word begins. 297 + if n > 0 and slides[0]["start"] < 0.001: 298 + kfs.insert(0, (0.0, cs(0) - LANE_W)) 299 + if loop_end_t is not None and loop_end_t > kfs[-1][0]: 300 + # During trailing silence (after the last slide ends but before 301 + # loop_end_t), keep the camera CENTERED on the last slide. Frame 302 + # N-1 lands at cs(n-1); frame 0 of the next iteration starts at 303 + # cs(0)-LANE_W ≡ cs(n-1) mod — seamless wrap with no off-center 304 + # drift while the final word's tail is still ringing. 305 + kfs.append((loop_end_t, cs(n - 1))) 306 + first_t, first_pos = kfs[0] 307 + last_t, last_pos = kfs[-1] 308 + if t <= first_t: 309 + return first_pos # no pre-roll — slide_0 centered at t=0 310 + if t >= last_t: 311 + return last_pos 312 + # Find bracket — LINEAR interpolation (no cosine ease) for groovy, 313 + # constant-velocity transitions instead of springy fast-middle. 314 + for k in range(len(kfs) - 1): 315 + t0, p0 = kfs[k] 316 + t1, p1 = kfs[k + 1] 317 + if t0 <= t < t1: 318 + prog = (t - t0) / max(0.001, t1 - t0) 319 + return p0 + prog * (p1 - p0) 320 + return last_pos 321 + 322 + # ── Per-character "currently being sung" bounce + Dock zoom ────────── 323 + # As the slide plays, the character currently being vocalized bounces 324 + # high and zooms up. Neighbors bounce/zoom less. Like macOS dock hover 325 + # but the cursor moves left-to-right through the word over the slide's 326 + # duration. 327 + DOCK_WINDOW = 1.5 # how many "char slots" of influence each side 328 + MAX_BOUNCE_PX = 64 # peak bounce — more energetic, more responsive 329 + MAX_ZOOM = 1.00 330 + 331 + def char_emphasis(t, slide, char_idx, n_chars, amp_now): 332 + # Time window over which the singing-cursor traverses the word — 333 + # the slide's full audible duration (start → end). 334 + s_start = slide["start"] 335 + s_end = slide["end"] 336 + if n_chars <= 0 or t < s_start or t >= s_end: 337 + return 0.0 338 + progress = (t - s_start) / max(0.001, s_end - s_start) 339 + cursor_pos = progress * (n_chars - 1) 340 + distance = abs(char_idx - cursor_pos) 341 + if distance > DOCK_WINDOW: 342 + return 0.0 343 + weight = 0.5 + 0.5 * np.cos(np.pi * distance / DOCK_WINDOW) 344 + # Amp-driven response — minimal floor so quiet moments are quiet, 345 + # peak moments really pop. 346 + return weight * (0.10 + 0.90 * amp_now) 347 + 348 + def char_bounce_y(emphasis): 349 + return int(emphasis * MAX_BOUNCE_PX * -1) # negative = up 350 + 351 + def char_zoom(emphasis): 352 + return 1.0 + emphasis * (MAX_ZOOM - 1.0) 353 + 354 + 355 + # ── Live-waveform connector polyline ───────────────────────────────── 356 + # Renders a single thick scrolling waveform polyline between two words. 357 + # Y-position follows recent audio amplitude (older samples on left, 358 + # newer on right — feels like the audio is flowing through the dash 359 + # left-to-right). Color gradients from cur_color → nxt_color across 360 + # the polyline length, like an em-dash carrying the upcoming word. 361 + DASH_HALF_SPAN_PX = 110 # half-width — short flat punctuation 362 + DASH_AMP_HEIGHT_PX = 0 # FLAT — no audio modulation, just punctuation 363 + DASH_THICKNESS = 16 364 + DASH_N_POINTS = 6 # minimal — just enough for color gradient segments 365 + DASH_FRAMES_PER_POINT = 1 366 + DASH_ALPHA = 130 # mild ambient feel 367 + 368 + 369 + DASH_LOWRES_DIVISOR = 3 # render at 1/3 res, NEAREST upscale → pixelated 370 + 371 + def _draw_waveform_dash(img, gap_screen_x, midline_y, 372 + cur_color, nxt_color, amp_curve, frame_idx, fps, 373 + W, H): 374 + # Render the dash at low resolution onto a small RGBA layer, then 375 + # NEAREST-upscale to give it a chunky pixel-art / staircase look, 376 + # matching the letterforms' aesthetic. 377 + div = DASH_LOWRES_DIVISOR 378 + lo_w = (DASH_HALF_SPAN_PX * 2) // div + 4 379 + lo_h = (DASH_AMP_HEIGHT_PX * 2 + DASH_THICKNESS * 2) // div + 4 380 + layer = Image.new("RGBA", (lo_w, lo_h), (0, 0, 0, 0)) 381 + draw = ImageDraw.Draw(layer) 382 + n = DASH_N_POINTS 383 + cy = lo_h // 2 # midline within layer 384 + points = [] 385 + for k in range(n): 386 + x_lo = (k / (n - 1)) * (lo_w - 4) + 2 387 + # Flat horizontal line — uniform punctuation between words, 388 + # no audio modulation. The color gradient still runs across. 389 + points.append((x_lo, cy)) 390 + # Draw thick polyline at low-res; the upscale's NEAREST gives 391 + # natural staircases without antialiasing. 392 + lo_thick = max(1, DASH_THICKNESS // div) 393 + for k in range(n - 1): 394 + t = k / (n - 2) 395 + col = tuple(int(cur_color[c] * (1 - t) + nxt_color[c] * t) 396 + for c in range(3)) + (255,) 397 + draw.line([points[k], points[k + 1]], fill=col, width=lo_thick) 398 + r = max(1, lo_thick // 2) 399 + for k, (px, py) in enumerate(points): 400 + t = k / (n - 1) 401 + col = tuple(int(cur_color[c] * (1 - t) + nxt_color[c] * t) 402 + for c in range(3)) + (255,) 403 + draw.ellipse([px - r, py - r, px + r, py + r], fill=col) 404 + # NEAREST upscale → pixelated, then dim alpha for ambient feel 405 + up = layer.resize((lo_w * div, lo_h * div), Image.NEAREST) 406 + up_arr = np.array(up) 407 + up_arr[:, :, 3] = (up_arr[:, :, 3].astype(np.float32) * (DASH_ALPHA / 255.0)).astype(np.uint8) 408 + up = Image.fromarray(up_arr, "RGBA") 409 + paste_x = int(gap_screen_x - up.width // 2) 410 + paste_y = int(midline_y - up.height // 2) 411 + img.paste(up, (paste_x, paste_y), up) 412 + 413 + # ── Main ───────────────────────────────────────────────────────────── 414 + def main(): 415 + p = argparse.ArgumentParser() 416 + p.add_argument("--storyboard", required=True) 417 + p.add_argument("--img-dir", required=True) 418 + p.add_argument("--audio", required=True) 419 + p.add_argument("--frames-dir", required=True) 420 + p.add_argument("--fps", type=int, default=30) 421 + args = p.parse_args() 422 + 423 + sb = json.load(open(args.storyboard)) 424 + slides = sb["slides"] 425 + W = sb["resolution"]["w"] 426 + H = sb["resolution"]["h"] 427 + total_dur = sb["duration"] 428 + fps = args.fps 429 + 430 + # Compute next-transition for each slide (used for slide-out) 431 + for i, s in enumerate(slides): 432 + nxt = slides[i + 1] if i + 1 < len(slides) else slides[0] # loop 433 + s["nextTransitionMs"] = nxt["transitionMs"] 434 + 435 + # Extract glyphs once per slide (cached in memory). 436 + # Dedup against expected character count: count [a-z] in the slide 437 + # text; if FLUX produced more glyphs than expected, drop the 438 + # smallest ones (typically apostrophe ghosts / texture artifacts). 439 + print(f"→ extracting glyphs from {len(slides)} word images…") 440 + glyph_cache = [] 441 + for i, slide in enumerate(slides): 442 + path = os.path.join(args.img_dir, slide["image"]) 443 + try: 444 + glyphs = extract_glyphs(path) 445 + except Exception as e: 446 + print(f" ⚠ slide {i} '{slide['text']}' glyph extract failed: {e}") 447 + glyphs = [] 448 + expected_n = len("".join(c for c in slide["text"].lower() if c.isalpha())) 449 + if expected_n > 0 and len(glyphs) > expected_n: 450 + # Sort by area, keep the LARGEST `expected_n` (preserve order) 451 + sized = [(idx, g, g["w"] * g["h"]) for idx, g in enumerate(glyphs)] 452 + sized.sort(key=lambda x: -x[2]) 453 + keep_idxs = sorted(s[0] for s in sized[:expected_n]) 454 + glyphs = [glyphs[k] for k in keep_idxs] 455 + target_rgb = color_rgb(slide.get("letterColor", "black")) 456 + # Texture-preserving recolor: blend FLUX's source RGB toward 457 + # target_rgb at 70% — letters end up predominantly the slide's 458 + # letter color but retain ~30% of the source's intensity/hue 459 + # variation, so internal grayscale texture survives instead of 460 + # being flattened to a solid block. 461 + BLEND_TO_TARGET = 0.70 462 + for g in glyphs: 463 + arr = np.array(g["img"]) 464 + mask = arr[:, :, 3] 465 + src = arr[:, :, :3].astype(np.float32) 466 + arr[:, :, 0] = (src[:, :, 0] * (1 - BLEND_TO_TARGET) + target_rgb[0] * BLEND_TO_TARGET).astype(np.uint8) 467 + arr[:, :, 1] = (src[:, :, 1] * (1 - BLEND_TO_TARGET) + target_rgb[1] * BLEND_TO_TARGET).astype(np.uint8) 468 + arr[:, :, 2] = (src[:, :, 2] * (1 - BLEND_TO_TARGET) + target_rgb[2] * BLEND_TO_TARGET).astype(np.uint8) 469 + arr[:, :, 3] = mask 470 + g["img"] = Image.fromarray(arr, "RGBA") 471 + glyph_cache.append(glyphs) 472 + print(f" slide {i:2d} '{slide['text']}' → {len(glyphs)} glyphs (expected {expected_n})") 473 + 474 + # Audio amplitude 475 + print("→ extracting audio amplitude…") 476 + amp = amplitude_curve(args.audio, total_dur, fps) 477 + 478 + # Render frames 479 + Path(args.frames_dir).mkdir(parents=True, exist_ok=True) 480 + n_frames = int(total_dur * fps) 481 + print(f"→ rendering {n_frames} frames @ {fps}fps to {args.frames_dir}…") 482 + 483 + KERN_SPACING = 4 # air between letters within a word 484 + BASE_SCALE = 0.85 # bigger now since one word per screen 485 + FOCUS_BOOST = 1.18 # spotlight only mildly above general 486 + LANE_W = W # one word per screen-width 487 + MIN_PAD = 80 # generous bg-color padding around each word 488 + # MAX_WORD_W must account for FOCUS_BOOST so even 7-letter words 489 + # fit within the viewport at peak spotlight scale, with comfortable 490 + # air on both sides for the dash + dynamic edges. 491 + SIDE_AIR_PX = 200 492 + MAX_WORD_W = int((LANE_W - 2 * SIDE_AIR_PX) / FOCUS_BOOST) 493 + TARGET_WORD_H = int(H * 0.10) # ~192px tall — one-word focus 494 + PEEK_PX = int(W * 0.14) # ~150px of next word peeks in by hold-end 495 + CONSUME_ZONE = 0.40 # whole left ~40% of screen disintegrates 496 + LOOKAHEAD_ZONE = 0.82 497 + 498 + # Per-word display scale — height-normalized, then capped to fit 499 + # MAX_WORD_W so every word sits comfortably inside its fixed lane. 500 + per_word_scales = [] 501 + per_word_widths = [] 502 + per_word_baselines = [] # source-image y position of each word's baseline 503 + for glyphs in glyph_cache: 504 + if not glyphs: 505 + per_word_scales.append(BASE_SCALE) 506 + per_word_widths.append(180) 507 + per_word_baselines.append(0) 508 + continue 509 + max_h = max(g["h"] for g in glyphs) 510 + h_norm = TARGET_WORD_H / max(1, max_h) 511 + s_h = h_norm * BASE_SCALE 512 + natural_w = sum(g["w"] for g in glyphs) + KERN_SPACING * max(0, len(glyphs) - 1) 513 + s_w = MAX_WORD_W / max(1, natural_w) 514 + s = min(s_h, s_w) 515 + per_word_scales.append(s) 516 + per_word_widths.append(int(natural_w * s)) 517 + # Baseline = bottom of the tallest glyph (in source-image coords). 518 + # Used for vertical alignment so letters share a real baseline. 519 + per_word_baselines.append(max(g["y1"] for g in glyphs)) 520 + 521 + # Lane geometry: fixed lane width = W / LANES_VISIBLE so the 522 + # viewport always shows ~3 full lanes (4 momentarily during transitions). 523 + lane_widths = [LANE_W] * len(slides) 524 + lane_starts = [0.0] 525 + for lw in lane_widths: 526 + lane_starts.append(lane_starts[-1] + lw) 527 + total_lane_w = lane_starts[-1] 528 + 529 + # Spotlight peak: only the lane within ±half a lane-width of screen 530 + # center boosts above 1.0; everything else stays at uniform general 531 + # (BASE) scale. Sharper peak so the focused word is unambiguously 532 + # the spotlight. 533 + def focus_factor(lane_center_screen, screen_center_x, falloff_px): 534 + d = abs(lane_center_screen - screen_center_x) 535 + if d >= falloff_px: 536 + return 1.0 537 + ratio = d / falloff_px 538 + return 1.0 + (FOCUS_BOOST - 1.0) * (0.5 + 0.5 * np.cos(np.pi * ratio)) 539 + 540 + # Entry scale: when a lane is past the LOOKAHEAD_ZONE on the right, 541 + # render it smaller to telegraph "this word is on deck". As it slides 542 + # in, it grows to general scale before reaching the spotlight peak. 543 + def entry_scale(lane_center_screen, screen_w): 544 + threshold = screen_w * LOOKAHEAD_ZONE 545 + if lane_center_screen <= threshold: 546 + return 1.0 547 + out_x = screen_w * 1.05 548 + if lane_center_screen >= out_x: 549 + return 0.45 550 + ratio = (lane_center_screen - threshold) / max(1.0, out_x - threshold) 551 + eased = 0.5 - 0.5 * np.cos(np.pi * ratio) 552 + return 1.0 - 0.55 * eased 553 + 554 + # Demoscene consume effect: when a glyph drifts into the leftmost 555 + # CONSUME_ZONE of the screen, slice it into horizontal raster bands, 556 + # offset each band chaotically, fade alpha, chromatic-aberrate the 557 + # color channels. consume_factor 0..1 (0 = intact, 1 = gone). 558 + def consume_factor_for(x_screen, screen_w): 559 + threshold = screen_w * CONSUME_ZONE 560 + if x_screen >= threshold: 561 + return 0.0 562 + if x_screen <= -screen_w * 0.05: 563 + return 1.0 564 + # Smooth ramp 0..1 as x moves from threshold → -W*0.05 565 + return min(1.0, max(0.0, 566 + (threshold - x_screen) / (threshold + screen_w * 0.05))) 567 + 568 + consume_rng = np.random.default_rng(0xACE51) 569 + def apply_consume(glyph_img, cf, seed): 570 + if cf <= 0.001: 571 + return glyph_img 572 + if cf >= 0.999: 573 + return None # fully consumed 574 + arr = np.array(glyph_img) 575 + h, w, _ = arr.shape 576 + out = np.zeros_like(arr) 577 + slice_h = max(2, h // 18) 578 + max_offset = int(w * cf * 1.6 + 12) 579 + # Per-slice deterministic offsets (seeded by glyph index + slice y) 580 + for y in range(0, h, slice_h): 581 + ye = min(h, y + slice_h) 582 + srng = np.random.default_rng((seed * 9176 + y * 7919) & 0xFFFFFFFF) 583 + offset = int(srng.uniform(-1, 1) * max_offset) 584 + offset = max(-w + 1, min(w - 1, offset)) 585 + if offset >= 0: 586 + out[y:ye, offset:w, :] = arr[y:ye, 0:w - offset, :] 587 + else: 588 + out[y:ye, 0:w + offset, :] = arr[y:ye, -offset:w, :] 589 + # Chromatic aberration: shift R left, B right by cf*4 px 590 + if cf > 0.15: 591 + shift = max(1, int(cf * 6)) 592 + r = np.zeros((h, w), dtype=np.uint8) 593 + b = np.zeros((h, w), dtype=np.uint8) 594 + r[:, :w - shift] = out[:, shift:w, 0] 595 + b[:, shift:w] = out[:, :w - shift, 2] 596 + out[:, :, 0] = r 597 + out[:, :, 2] = b 598 + # Alpha fade 599 + fade = max(0.0, 1.0 - cf) 600 + out[:, :, 3] = (out[:, :, 3].astype(np.float32) * fade).astype(np.uint8) 601 + return Image.fromarray(out, "RGBA") 602 + 603 + def fill_lane_bg(img, slide, lane_screen_x, lane_w): 604 + # Now a no-op — the gradient bg strip computed once per frame 605 + # supersedes per-lane fills (so lane boundaries blend smoothly 606 + # rather than hard-cutting between bg colors). 607 + pass 608 + 609 + BG_BLEND_PX = 240 # width of the transition zone between adjacent lanes 610 + # Pre-compute lane bg colors once (immutable across frames) 611 + bg_colors_arr = np.array( 612 + [color_rgb(s.get("bgColor", "black")) for s in slides], 613 + dtype=np.float32, 614 + ) 615 + 616 + def compute_bg_strip(x_cam_v): 617 + """Return a (W, 3) uint8 array — the bg color for each viewport 618 + column, with cosine-eased gradients across each lane boundary.""" 619 + n = len(slides) 620 + half = BG_BLEND_PX / 2.0 621 + xs = np.arange(W) 622 + world_xs = xs + x_cam_v 623 + rel_xs = world_xs % total_lane_w 624 + lane_idx = (rel_xs // LANE_W).astype(np.int64) % n 625 + in_lane_x = rel_xs - lane_idx * LANE_W 626 + cur_c = bg_colors_arr[lane_idx] 627 + prev_c = bg_colors_arr[(lane_idx - 1) % n] 628 + next_c = bg_colors_arr[(lane_idx + 1) % n] 629 + bg_out = cur_c.copy() 630 + # Right blend: in_lane_x in [LANE_W - half, LANE_W) 631 + rmask = in_lane_x >= (LANE_W - half) 632 + if rmask.any(): 633 + t = (in_lane_x[rmask] - (LANE_W - half)) / BG_BLEND_PX # 0..0.5 634 + eased = (0.5 - 0.5 * np.cos(np.pi * t))[:, np.newaxis] 635 + bg_out[rmask] = (1 - eased) * cur_c[rmask] + eased * next_c[rmask] 636 + # Left blend: in_lane_x in [0, half) 637 + lmask = in_lane_x < half 638 + if lmask.any(): 639 + t = 0.5 + in_lane_x[lmask] / BG_BLEND_PX # 0.5..1.0 640 + eased = (0.5 - 0.5 * np.cos(np.pi * t))[:, np.newaxis] 641 + bg_out[lmask] = (1 - eased) * prev_c[lmask] + eased * cur_c[lmask] 642 + return bg_out.clip(0, 255).astype(np.uint8) 643 + 644 + def paint_lane_glyphs(img, glyphs, lane_screen_x, lane_w, 645 + word_scale, focus_mult, baseline_src, 646 + per_char_zoom=None, per_char_bounce=None, 647 + slide_idx=0, suppress_consume=False): 648 + if not glyphs: 649 + return 650 + n_chars = len(glyphs) 651 + s = word_scale * focus_mult 652 + scaled_widths = [] 653 + scaled_heights = [] 654 + # below_baseline_src[i] = how far below baseline this glyph extends 655 + # in the SOURCE image (0 for letters touching baseline; positive 656 + # for descenders). top_offset_src = how high above baseline. 657 + below_src = [] 658 + top_offset_src = [] 659 + for ci, g in enumerate(glyphs): 660 + cz = per_char_zoom[ci] if per_char_zoom else 1.0 661 + scaled_widths.append(max(1, int(g["w"] * s * cz))) 662 + scaled_heights.append(max(1, int(g["h"] * s * cz))) 663 + below_src.append(max(0, baseline_src - g["y1"])) 664 + top_offset_src.append(max(0, g["y1"] - g["y0"])) 665 + total_w = sum(scaled_widths) + KERN_SPACING * max(0, n_chars - 1) 666 + lane_center_screen = lane_screen_x + lane_w / 2.0 667 + base_x = int(lane_center_screen - total_w / 2) 668 + # Baseline: place the row near vertical center of screen. 669 + baseline_y = H // 2 + scaled_heights[0] // 4 # nudge so top isn't clipped 670 + x_cur = base_x 671 + for ci, g in enumerate(glyphs): 672 + gw = scaled_widths[ci] 673 + gh = scaled_heights[ci] 674 + # LANCZOS gives clean letterforms; with MAX_ZOOM=1.0 the 675 + # scale is constant per slide so there's no frame-to-frame 676 + # shimmer. We then HARD-BINARIZE the alpha because LANCZOS 677 + # turns the binary mask into fractional values, which would 678 + # otherwise let bg color bleed through letter centers. 679 + if (gw, gh) != (g["w"], g["h"]): 680 + glyph_img = g["img"].resize((gw, gh), Image.LANCZOS) 681 + arr_g = np.array(glyph_img) 682 + arr_g[:, :, 3] = np.where(arr_g[:, :, 3] > 128, 255, 0).astype(np.uint8) 683 + glyph_img = Image.fromarray(arr_g, "RGBA") 684 + else: 685 + glyph_img = g["img"] 686 + bnc = per_char_bounce[ci] if per_char_bounce else 0 687 + # Glyph bottom relative to baseline = below_src * s (descender) 688 + # so glyph's bottom in screen = baseline_y + below_src*s. 689 + # For most letters below_src = 0, so bottom = baseline_y. 690 + glyph_bottom = baseline_y - int((baseline_src - g["y1"]) * s) + bnc 691 + gy = glyph_bottom - gh 692 + # Per-glyph consume effect: only fires once a slide is in 693 + # the past (its word being sung is over). Suppressed on the 694 + # current/active slide so words don't break up while sung. 695 + if not suppress_consume: 696 + cf = consume_factor_for(x_cur, W) 697 + if cf > 0.001: 698 + glyph_img = apply_consume(glyph_img, cf, slide_idx * 1000 + ci) 699 + if glyph_img is None: 700 + x_cur += gw + KERN_SPACING 701 + continue 702 + img.paste(glyph_img, (int(x_cur), gy), glyph_img) 703 + x_cur += gw + KERN_SPACING 704 + 705 + for f in range(n_frames): 706 + t = f / fps 707 + amp_now = amp[min(f, len(amp) - 1)] 708 + 709 + cur_idx = 0 710 + for i, s in enumerate(slides): 711 + if s["start"] <= t < s["end"]: 712 + cur_idx = i 713 + break 714 + 715 + x_cam = camera_x(t, slides, lane_starts, lane_widths, W, PEEK_PX, 716 + loop_end_t=(n_frames - 1) / fps) 717 + screen_center = W / 2.0 718 + # Tight falloff: spotlight only the lane near center; words 719 + # peeking in from the right stay at uniform general scale. 720 + falloff = LANE_W * 0.30 721 + 722 + # Build the gradient bg strip (W×3) for this frame and tile 723 + # it vertically to fill the canvas — replaces per-lane fills. 724 + bg_strip = compute_bg_strip(x_cam) 725 + img_arr = np.broadcast_to(bg_strip[np.newaxis, :, :], 726 + (H, W, 3)).copy() 727 + img = Image.fromarray(img_arr, "RGB") 728 + 729 + # Two-pass: bgs first (all colored strips), then glyphs (so the 730 + # focused word's overflow into neighbor lanes stays visible 731 + # instead of being painted over by the next lane's bg). 732 + n = len(slides) 733 + visible = [] 734 + for loop_off in (-1, 0, 1): 735 + world_shift = loop_off * total_lane_w 736 + for i in range(n): 737 + lane_screen_x = lane_starts[i] + world_shift - x_cam 738 + if lane_screen_x >= W or lane_screen_x + lane_widths[i] <= 0: 739 + continue 740 + visible.append((i, loop_off, lane_screen_x)) 741 + 742 + # Pass 1: all bg strips 743 + for i, _, lane_screen_x in visible: 744 + fill_lane_bg(img, slides[i], lane_screen_x, lane_widths[i]) 745 + 746 + # Pass 2: all glyphs, focused word last so it sits on top 747 + # if it overlaps neighbors. Sort: non-active first, active last. 748 + visible_glyph_pass = sorted( 749 + visible, 750 + key=lambda v: 1 if (v[0] == cur_idx and v[1] == 0) else 0, 751 + ) 752 + for i, loop_off, lane_screen_x in visible_glyph_pass: 753 + slide_real = slides[i] 754 + glyphs = glyph_cache[i] 755 + lane_center_screen = lane_screen_x + lane_widths[i] / 2.0 756 + f_mult = focus_factor(lane_center_screen, screen_center, falloff) 757 + e_mult = entry_scale(lane_center_screen, W) 758 + scale_mult = f_mult * e_mult 759 + baseline_src = per_word_baselines[i] 760 + # Suppress consume on the CURRENT slide (still being sung) 761 + # AND on the next slide on its way in. Past slides (i < cur) 762 + # are the only ones eligible to break apart on their way out. 763 + is_past = (i < cur_idx and loop_off == 0) or loop_off < 0 764 + suppress = not is_past 765 + if i == cur_idx and loop_off == 0 and glyphs: 766 + n_chars = len(glyphs) 767 + zooms = [] 768 + bounces = [] 769 + for ci in range(n_chars): 770 + em = char_emphasis(t, slide_real, ci, n_chars, amp_now) 771 + zooms.append(char_zoom(em)) 772 + bounces.append(char_bounce_y(em)) 773 + paint_lane_glyphs(img, glyphs, lane_screen_x, lane_widths[i], 774 + per_word_scales[i], scale_mult, baseline_src, 775 + zooms, bounces, slide_idx=i, 776 + suppress_consume=suppress) 777 + else: 778 + paint_lane_glyphs(img, glyphs, lane_screen_x, lane_widths[i], 779 + per_word_scales[i], scale_mult, baseline_src, 780 + slide_idx=i + loop_off * 1000, 781 + suppress_consume=suppress) 782 + 783 + # ── Em-dash waveform connectors at every gap ──────────────── 784 + # Each gap (between consecutive slides) has its own dash at a 785 + # fixed WORLD position (lane_starts[i+1]) — so as the camera 786 + # scrolls, the dashes scroll with it, sliding all the way off 787 + # screen left rather than disappearing at the edge. At any 788 + # frame, the viewport-clip filter naturally hides off-screen 789 + # gaps; usually one dash is visible (the gap to the right of 790 + # the centered word) with another sometimes entering/leaving. 791 + n_slides = len(slides) 792 + midline_y = H // 2 - int(TARGET_WORD_H * 0.30) 793 + for i in range(n_slides): 794 + for loop_off in (-1, 0, 1): 795 + gap_world_x = (lane_starts[i + 1] 796 + + loop_off * total_lane_w) 797 + gap_screen_x = gap_world_x - x_cam 798 + if gap_screen_x < -DASH_HALF_SPAN_PX or gap_screen_x > W + DASH_HALF_SPAN_PX: 799 + continue 800 + cur_letter = color_rgb(slides[i].get("letterColor", "white")) 801 + nxt = slides[(i + 1) % n_slides] 802 + nxt_letter = color_rgb(nxt.get("letterColor", "white")) 803 + _draw_waveform_dash( 804 + img, gap_screen_x, midline_y, 805 + cur_letter, nxt_letter, 806 + amp, f, fps, W, H, 807 + ) 808 + 809 + # ── Whole-screen-left disintegration ──────────────────────── 810 + # Glitchy demoscene-style fade in the leftmost CONSUME_ZONE: 811 + # 1. Pixelation (aggressive downscale + NEAREST upscale) 812 + # 2. Horizontal raster slice break (per-band x-shift) 813 + # 3. Chromatic aberration (R/B channels split) 814 + # 4. Cosine-eased alpha fade to bg gradient 815 + zone_w = int(W * CONSUME_ZONE) 816 + if zone_w > 0: 817 + arr = np.array(img) 818 + arr_orig = arr[:, :zone_w, :].copy() # untouched left zone 819 + zone = arr_orig.copy() 820 + 821 + # 1. PIXELATION: downscale via BOX (anti-aliased average) 822 + # then NEAREST upscale for chunky pixels 823 + PIX_DIV = 6 824 + pw, ph = max(1, zone_w // PIX_DIV), max(1, H // PIX_DIV) 825 + zone_img = Image.fromarray(zone) 826 + small = zone_img.resize((pw, ph), Image.BOX) 827 + big = small.resize((pw * PIX_DIV, ph * PIX_DIV), Image.NEAREST) 828 + zone = np.array(big)[:H, :zone_w].copy() 829 + 830 + # 2. RASTER SLICE BREAK: per 14-row band, random horizontal 831 + # shift up to ±zone_w*0.12. Frame-deterministic seed so 832 + # glitch evolves chaotically across frames. 833 + slice_h = 14 834 + rng = np.random.default_rng((f * 9173) & 0xFFFFFFFF) 835 + max_shift = int(zone_w * 0.12) 836 + out_zone = zone.copy() 837 + for y in range(0, H, slice_h): 838 + ye = min(H, y + slice_h) 839 + shift = int(rng.uniform(-1, 1) * max_shift) 840 + if shift > 0 and shift < zone_w: 841 + out_zone[y:ye, shift:, :] = zone[y:ye, :zone_w - shift, :] 842 + out_zone[y:ye, :shift, :] = bg_strip[np.newaxis, :shift, :] 843 + elif shift < 0 and -shift < zone_w: 844 + out_zone[y:ye, :zone_w + shift, :] = zone[y:ye, -shift:, :] 845 + fill_x_start = zone_w + shift 846 + if fill_x_start < zone_w: 847 + out_zone[y:ye, fill_x_start:, :] = ( 848 + bg_strip[np.newaxis, fill_x_start:zone_w, :] 849 + ) 850 + zone = out_zone 851 + 852 + # 3. CHROMATIC ABERRATION: shift R left, B right by 5px 853 + aberr = 5 854 + if zone_w > aberr * 2: 855 + chroma = zone.copy() 856 + chroma[:, :zone_w - aberr, 0] = zone[:, aberr:, 0] 857 + chroma[:, aberr:, 2] = zone[:, :zone_w - aberr, 2] 858 + zone = chroma 859 + 860 + # ── Smooth ramp from clean → glitched → bg ───────────── 861 + # cf: 0 at right edge of zone, 1 at far left 862 + cf = 1.0 - (np.arange(zone_w) / zone_w) 863 + # Glitch ramp: cubic so glitch ramps in slowly near the 864 + # right edge of the zone (cf small → cubic small) and only 865 + # fully kicks in toward the far left. Eliminates the 866 + # "harsh start" the user pointed out. 867 + glitch_ramp = (cf ** 2.2)[np.newaxis, :, np.newaxis] 868 + # Blend ORIGINAL ↔ GLITCHED: 869 + blended = ( 870 + arr_orig.astype(np.float32) * (1 - glitch_ramp) 871 + + zone.astype(np.float32) * glitch_ramp 872 + ) 873 + # Bg fade: cosine-eased, separate ramp (less aggressive 874 + # since glitch already does most of the visual work) 875 + bg_ramp = (0.5 - 0.5 * np.cos(np.pi * cf)) * 0.85 876 + bg_ramp = bg_ramp[np.newaxis, :, np.newaxis] 877 + bg_zone = bg_strip[:zone_w][np.newaxis, :, :].astype(np.float32) 878 + arr[:, :zone_w, :] = ( 879 + blended * (1 - bg_ramp) + bg_zone * bg_ramp 880 + ).astype(np.uint8) 881 + img = Image.fromarray(arr, "RGB") 882 + 883 + img.save(os.path.join(args.frames_dir, f"f{f:05d}.png")) 884 + if f % 120 == 0: 885 + print(f" {f}/{n_frames} ({100*f/n_frames:.0f}%)") 886 + 887 + print(f"✓ {n_frames} frames written") 888 + 889 + if __name__ == "__main__": 890 + main()
+152
pop/bin/repair_letter.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + repair_letter.py — patch a single bad letter into an existing word image. 4 + 5 + Given: the original word image, the bad-letter index, and a freshly- 6 + generated single-letter FLUX image (same color scheme as the word), 7 + extract the new letter, scale it to the size of the bad letter's 8 + bounding box, and composite it back into the word image at that spot. 9 + 10 + The replacement keeps the FLUX aesthetic (no font fallback) — we just 11 + swap pixel patches. 12 + 13 + Usage: 14 + repair_letter.py \\ 15 + --word-img big-pictures/out/.../word-001.jpg \\ 16 + --letter-img /tmp/replacement-a.jpg \\ 17 + --slot 2 \\ 18 + --expected-word grace \\ 19 + --bg-color peachpuff \\ 20 + --letters-color saddlebrown 21 + """ 22 + import argparse 23 + import json 24 + import sys 25 + from pathlib import Path 26 + 27 + import numpy as np 28 + from PIL import Image 29 + 30 + sys.path.insert(0, str(Path(__file__).parent)) 31 + from render_frames import extract_glyphs, color_rgb 32 + 33 + 34 + def main(): 35 + ap = argparse.ArgumentParser() 36 + ap.add_argument("--word-img", required=True) 37 + ap.add_argument("--letter-img", required=True) 38 + ap.add_argument("--slot", type=int, required=True, 39 + help="Index (0-based) of the bad letter in the word") 40 + ap.add_argument("--expected-word", required=True) 41 + ap.add_argument("--bg-color", required=True, 42 + help="CSS named color for the slide's background") 43 + ap.add_argument("--letters-color", required=True, 44 + help="CSS named color for the slide's letters") 45 + ap.add_argument("--out", help="Output path (default: overwrites --word-img)") 46 + args = ap.parse_args() 47 + 48 + out_path = args.out or args.word_img 49 + 50 + # 1. Extract glyphs from the original word image to find the bad 51 + # letter's source bounding box. 52 + word_glyphs = extract_glyphs(args.word_img) 53 + if args.slot >= len(word_glyphs): 54 + print(json.dumps({"ok": False, "reason": f"slot {args.slot} out of range"})) 55 + return 56 + target = word_glyphs[args.slot] 57 + target_box = (target["x0"], target["y0"], 58 + target["x0"] + target["w"], target["y1"]) 59 + target_w = target["w"] 60 + target_h = target["h"] 61 + 62 + # 2. Extract the letter from the replacement single-letter image. 63 + repl_glyphs = extract_glyphs(args.letter_img) 64 + if not repl_glyphs: 65 + print(json.dumps({"ok": False, "reason": "no glyph extracted from replacement"})) 66 + return 67 + # If multiple components were detected, pick the largest by area — 68 + # FLUX may render speech bubbles or stray dots around a single letter. 69 + repl_glyphs.sort(key=lambda g: g["w"] * g["h"], reverse=True) 70 + repl = repl_glyphs[0] 71 + 72 + # 3. Scale the replacement glyph to match the target bounding box's 73 + # HEIGHT (preserve aspect ratio so the letter doesn't stretch). 74 + # NEAREST instead of LANCZOS — the source is pixel art (no 75 + # smoothing wanted) AND lanczos turns binary alpha into 76 + # fractional values, which PIL's paste then mixes the FLUX bg 77 + # color into the slide bg, creating a visible rectangular halo. 78 + scale = target_h / max(1, repl["h"]) 79 + new_w = max(1, int(repl["w"] * scale)) 80 + new_h = max(1, int(repl["h"] * scale)) 81 + repl_img = repl["img"].resize((new_w, new_h), Image.NEAREST) 82 + 83 + # 4. Recolor + hard-binarize alpha. Belt-and-suspenders against 84 + # halo bleed: where the alpha mask says "not letter", explicitly 85 + # overwrite RGB to the slide bg color so even partial-opacity 86 + # pixels render as bg, not as FLUX's original bg. 87 + target_rgb = color_rgb(args.letters_color) 88 + bg_rgb = color_rgb(args.bg_color) 89 + arr = np.array(repl_img) 90 + raw_alpha = arr[:, :, 3] 91 + binary_mask = np.where(raw_alpha > 128, 255, 0).astype(np.uint8) 92 + arr[:, :, 0] = np.where(binary_mask > 0, target_rgb[0], bg_rgb[0]) 93 + arr[:, :, 1] = np.where(binary_mask > 0, target_rgb[1], bg_rgb[1]) 94 + arr[:, :, 2] = np.where(binary_mask > 0, target_rgb[2], bg_rgb[2]) 95 + arr[:, :, 3] = binary_mask 96 + repl_img = Image.fromarray(arr, "RGBA") 97 + 98 + # 5. Composite into the original word image. Sample the ACTUAL bg 99 + # color from the word image's corners — FLUX often renders the 100 + # spec'd bg as a different shade, so painting with the spec color 101 + # creates a visible rectangular patch. Sampled bg = same color 102 + # as the surrounding pixels = invisible patch. 103 + word_img = Image.open(args.word_img).convert("RGB") 104 + word_arr = np.array(word_img) 105 + s = 16 106 + corners = np.concatenate([ 107 + word_arr[:s, :s, :].reshape(-1, 3), 108 + word_arr[:s, -s:, :].reshape(-1, 3), 109 + word_arr[-s:, :s, :].reshape(-1, 3), 110 + word_arr[-s:, -s:, :].reshape(-1, 3), 111 + ]) 112 + actual_bg = tuple(int(c) for c in np.median(corners, axis=0)) 113 + # Also rebuild the replacement's non-letter pixels with this sampled 114 + # bg (replacing the slide-spec bg we used earlier as defense). 115 + repl_arr = np.array(repl_img) 116 + repl_mask = repl_arr[:, :, 3] 117 + repl_arr[:, :, 0] = np.where(repl_mask > 0, target_rgb[0], actual_bg[0]) 118 + repl_arr[:, :, 1] = np.where(repl_mask > 0, target_rgb[1], actual_bg[1]) 119 + repl_arr[:, :, 2] = np.where(repl_mask > 0, target_rgb[2], actual_bg[2]) 120 + repl_img = Image.fromarray(repl_arr, "RGBA") 121 + 122 + # Expand the patch rectangle slightly so we cover the dilation halo. 123 + pad = 4 124 + x0 = max(0, target_box[0] - pad) 125 + y0 = max(0, target_box[1] - pad) 126 + x1 = min(word_img.width, target_box[2] + pad) 127 + y1 = min(word_img.height, target_box[3] + pad) 128 + # Paint over the old letter with the sampled bg (invisible patch) 129 + overlay = Image.new("RGB", (x1 - x0, y1 - y0), actual_bg) 130 + word_img.paste(overlay, (x0, y0)) 131 + # Paste the new letter centered horizontally inside the original 132 + # bbox; bottom-aligned to the original baseline (y1). 133 + new_letter_x = x0 + pad + (target_w - new_w) // 2 134 + new_letter_y = target_box[3] - new_h 135 + word_img.paste(repl_img, (new_letter_x, new_letter_y), repl_img) 136 + 137 + # Save as PNG (lossless) to preserve the small letter counter that 138 + # JPEG would smear into oblivion. PIL on read auto-detects format 139 + # from file content, not extension, so the .jpg-named path still 140 + # works downstream. 141 + word_img.save(out_path, format="PNG") 142 + print(json.dumps({ 143 + "ok": True, 144 + "out": out_path, 145 + "target_box": [int(x) for x in target_box], 146 + "patch_box": [x0, y0, x1, y1], 147 + "new_letter_size": [new_w, new_h], 148 + })) 149 + 150 + 151 + if __name__ == "__main__": 152 + main()
+193
pop/bin/say.mjs
··· 1 + #!/usr/bin/env node 2 + // say.mjs — POST a lyric file to /api/say, cache the vocal stem. 3 + // 4 + // Mirrors `recap/bin/tts.mjs` but reads a plain lyric file (the 5 + // pop/big-pictures/<slug>.txt format) instead of an audience config: 6 + // - strips section headers (hook / verse 1 / verse 2 / outro) 7 + // - expands "hook" repeats inline (when a header appears alone it 8 + // repeats the named section's body) 9 + // - drops blank lines 10 + // - replaces em dashes with commas for cleaner TTS prosody 11 + // 12 + // Voice: jeffrey-pvc — same config as the 24h recap pipeline 13 + // (provider="jeffrey", voice="neutral:0"). Endpoint is the production 14 + // /api/say proxy, which costs real money via ElevenLabs — caches by 15 + // content-hash. Pass --force to bypass. 16 + // 17 + // Usage: 18 + // node bin/say.mjs ../big-pictures/plork.txt 19 + // node bin/say.mjs ../big-pictures/plork.txt --section hook 20 + // node bin/say.mjs ../big-pictures/plork.txt --out ../big-pictures/out/plork-vocal.mp3 21 + // node bin/say.mjs ../big-pictures/plork.txt --provider jeffrey --voice neutral:0 22 + 23 + import { writeFileSync, readFileSync, mkdirSync, existsSync } from "node:fs"; 24 + import { resolve, dirname, basename } from "node:path"; 25 + import { fileURLToPath } from "node:url"; 26 + import { createHash } from "node:crypto"; 27 + import { homedir } from "node:os"; 28 + 29 + const HERE = dirname(fileURLToPath(import.meta.url)); 30 + const ROOT = resolve(HERE, ".."); 31 + 32 + const argv = process.argv.slice(2); 33 + const flags = {}; 34 + const positional = []; 35 + for (let i = 0; i < argv.length; i++) { 36 + const a = argv[i]; 37 + if (a.startsWith("--")) { 38 + const key = a.slice(2); 39 + const next = argv[i + 1]; 40 + if (next !== undefined && !next.startsWith("--")) { flags[key] = next; i++; } 41 + else flags[key] = true; 42 + } else positional.push(a); 43 + } 44 + 45 + if (!positional[0]) { 46 + console.error("usage: node bin/say.mjs <lyric-file.txt> [--section hook] [--out path.mp3] [--force]"); 47 + process.exit(1); 48 + } 49 + 50 + function expandHome(p) { 51 + if (!p || typeof p !== "string") return p; 52 + if (p === "~") return homedir(); 53 + if (p.startsWith("~/")) return resolve(homedir(), p.slice(2)); 54 + return p; 55 + } 56 + 57 + const lyricPath = resolve(process.cwd(), positional[0]); 58 + if (!existsSync(lyricPath)) { 59 + console.error(`✗ lyric file not found: ${lyricPath}`); 60 + process.exit(1); 61 + } 62 + 63 + const slug = basename(lyricPath).replace(/\.[^.]+$/, ""); 64 + const SECTION = flags.section || null; // null = full track 65 + const PROVIDER = flags.provider || "jeffrey"; 66 + const VOICE_ID = flags.voice || "neutral:0"; 67 + const SPEED = Number(flags.speed ?? 1.0); // 0.7-1.2 for ElevenLabs jeffrey provider 68 + const STYLE = flags.style !== undefined ? Number(flags.style) : null; // 0-1 style exaggeration 69 + const STABILITY = flags.stability !== undefined ? Number(flags.stability) : null; // 0-1 70 + const SIMILARITY = flags.similarity !== undefined ? Number(flags.similarity) : null; // 0-1 71 + const FORCE = flags.force === true; 72 + const OUT_PATH = expandHome(flags.out) 73 + || `${ROOT}/big-pictures/out/${slug}${SECTION ? `-${SECTION.replace(/\s+/g, "_")}` : ""}-vocal.mp3`; 74 + 75 + // ── Parse lyric file ─────────────────────────────────────────────────── 76 + // Sections are headers (lowercase, no colon, alphabetic). Body lines 77 + // follow until the next header or EOF. A header line with no body 78 + // before the next header is a repeat marker. 79 + function parseLyrics(text) { 80 + const HEADER_RE = /^(hook|verse \d+|outro|bridge|chorus|intro)$/i; 81 + const sections = {}; // name → first-seen body lines 82 + const order = []; // array of section names (with repeats) 83 + let current = null; 84 + let buffer = []; 85 + 86 + const flush = () => { 87 + if (current === null) return; 88 + if (buffer.length === 0) { 89 + // Repeat marker — body comes from first occurrence. 90 + order.push(current); 91 + } else { 92 + if (!sections[current]) sections[current] = buffer.slice(); 93 + order.push(current); 94 + } 95 + buffer = []; 96 + }; 97 + 98 + for (const raw of text.split("\n")) { 99 + const line = raw.trim(); 100 + if (!line) continue; 101 + if (HEADER_RE.test(line)) { 102 + flush(); 103 + current = line.toLowerCase(); 104 + continue; 105 + } 106 + if (current === null) { 107 + // Lyrics before any header — treat as 'intro' 108 + current = "intro"; 109 + } 110 + buffer.push(line); 111 + } 112 + flush(); 113 + 114 + return { sections, order }; 115 + } 116 + 117 + const { sections, order } = parseLyrics(readFileSync(lyricPath, "utf8")); 118 + 119 + if (SECTION && !sections[SECTION.toLowerCase()]) { 120 + console.error(`✗ section '${SECTION}' not found. available: ${Object.keys(sections).join(", ")}`); 121 + process.exit(1); 122 + } 123 + 124 + // ── Build narration text ─────────────────────────────────────────────── 125 + function cleanLine(line) { 126 + return line 127 + .replace(/—/g, ",") // em dash → comma for TTS pacing 128 + .replace(/–/g, ",") // en dash too 129 + .replace(/\s+/g, " ") 130 + .trim(); 131 + } 132 + 133 + let narration; 134 + if (SECTION) { 135 + const body = sections[SECTION.toLowerCase()]; 136 + narration = body.map(cleanLine).join("\n"); 137 + } else { 138 + const lines = []; 139 + for (const name of order) { 140 + const body = sections[name]; 141 + if (!body) continue; 142 + for (const l of body) lines.push(cleanLine(l)); 143 + lines.push(""); // blank line between sections 144 + } 145 + narration = lines.filter((l, i, a) => !(l === "" && a[i + 1] === "")).join("\n").trim(); 146 + } 147 + 148 + if (!narration) { 149 + console.error(`✗ no narration content extracted from ${lyricPath}`); 150 + process.exit(1); 151 + } 152 + 153 + // ── Build request + cache key ────────────────────────────────────────── 154 + const body = { from: narration, provider: PROVIDER, voice: VOICE_ID }; 155 + if (SPEED !== 1.0) body.speed = Math.max(0.7, Math.min(1.2, SPEED)); 156 + if (STYLE !== null && Number.isFinite(STYLE)) body.style = Math.max(0, Math.min(1, STYLE)); 157 + if (STABILITY !== null && Number.isFinite(STABILITY)) body.stability = Math.max(0, Math.min(1, STABILITY)); 158 + if (SIMILARITY !== null && Number.isFinite(SIMILARITY)) body.similarity = Math.max(0, Math.min(1, SIMILARITY)); 159 + const inputHash = createHash("sha256") 160 + .update(JSON.stringify(body)) 161 + .digest("hex").slice(0, 16); 162 + 163 + const hashFile = `${OUT_PATH}.hash`; 164 + mkdirSync(dirname(OUT_PATH), { recursive: true }); 165 + 166 + if (!FORCE && existsSync(OUT_PATH) && existsSync(hashFile)) { 167 + const cached = readFileSync(hashFile, "utf8").trim(); 168 + if (cached === inputHash) { 169 + const size = (readFileSync(OUT_PATH).length / 1024).toFixed(0); 170 + console.log(`✓ ${OUT_PATH} cached (${size} KB · hash ${inputHash}) — skipping /api/say`); 171 + process.exit(0); 172 + } 173 + } 174 + 175 + console.log(`→ POST /api/say · ${narration.length} chars · ${PROVIDER}/${VOICE_ID}` + (SPEED !== 1.0 ? ` · speed=${SPEED}` : "") + (STYLE !== null ? ` · style=${STYLE}` : "") + (SECTION ? ` · section=${SECTION}` : "")); 176 + console.log(` preview: ${narration.split("\n")[0].slice(0, 80)}…`); 177 + 178 + const res = await fetch("https://aesthetic.computer/api/say", { 179 + method: "POST", 180 + headers: { "Content-Type": "application/json" }, 181 + body: JSON.stringify(body), 182 + redirect: "follow", 183 + }); 184 + 185 + if (!res.ok) { 186 + console.error(`✗ /api/say returned ${res.status}: ${await res.text()}`); 187 + process.exit(1); 188 + } 189 + 190 + const buf = Buffer.from(await res.arrayBuffer()); 191 + writeFileSync(OUT_PATH, buf); 192 + writeFileSync(hashFile, inputHash + "\n"); 193 + console.log(`✓ ${OUT_PATH} (${(buf.length / 1024).toFixed(0)} KB · hash ${inputHash})`);
+226
pop/bin/storyboard.mjs
··· 1 + #!/usr/bin/env node 2 + // storyboard.mjs — generate a hand-editable storyboard JSON for a 3 + // big-pictures track, driven directly from the .np SCORE (not from 4 + // the rendered audio's whisper alignment). 5 + // 6 + // Source of timing truth (in priority order): 7 + // 1. <slug>.np score — per-syllable note + beat-weight. THE LAW. 8 + // Each `note:syllable*weight` token becomes ONE slide. 9 + // 2. BPM (CLI flag or score comment) — converts beat positions to 10 + // seconds. 11 + // 3. The mixed audio's true duration (ffprobe) — only used as a 12 + // sanity-check on the storyboard total + as the loop endpoint. 13 + // 14 + // Each syllable becomes its own slide so multi-syllable words like 15 + // 'Amazing' (a-/ma-/zing on D3/G3/B3) get individual visual hits at 16 + // each note transition, instead of one static word held over 3 notes. 17 + // 18 + // Output: <slug>.storyboard.json. tiktok.mjs reads it verbatim and 19 + // FLUX-generates one word image per slide (so the syllable text is 20 + // the displayed word — "a", "ma", "zing", "grace", ...). 21 + // 22 + // Usage: 23 + // node bin/storyboard.mjs --slug amazing 24 + // [--score big-pictures/amazing.np] 25 + // [--section "verse 1"] 26 + // [--bpm 70] 27 + // [--audio big-pictures/out/amazing-final.mp3] 28 + 29 + import { spawnSync } from "node:child_process"; 30 + import { readFileSync, writeFileSync, existsSync } from "node:fs"; 31 + import { resolve } from "node:path"; 32 + 33 + const flags = {}; 34 + for (let i = 0; i < process.argv.length; i++) { 35 + const a = process.argv[i]; 36 + if (a.startsWith("--")) flags[a.slice(2)] = process.argv[i + 1]; 37 + } 38 + 39 + const SLUG = flags.slug || "amazing"; 40 + const POP = "/Users/jas/aesthetic-computer/pop"; 41 + const SCORE_PATH = flags.score 42 + ? resolve(process.cwd(), flags.score) 43 + : `${POP}/big-pictures/${SLUG}.np`; 44 + const SECTION = (flags.section || "verse 1").toLowerCase(); 45 + const BPM = Number(flags.bpm) || 70; 46 + const AUDIO = flags.audio 47 + ? resolve(process.cwd(), flags.audio) 48 + : `${POP}/big-pictures/out/${SLUG}-final.mp3`; 49 + const OUT = flags.out 50 + ? resolve(process.cwd(), flags.out) 51 + : `${POP}/big-pictures/out/${SLUG}.storyboard.json`; 52 + const IMG_DIR = flags["img-dir"] 53 + ? resolve(process.cwd(), flags["img-dir"]) 54 + : `${POP}/big-pictures/out/${SLUG}-tiktok-frames`; 55 + 56 + if (!existsSync(SCORE_PATH)) { 57 + console.error(`✗ score file missing: ${SCORE_PATH}`); 58 + process.exit(1); 59 + } 60 + if (!existsSync(AUDIO)) { 61 + console.error(`✗ audio missing: ${AUDIO}`); 62 + process.exit(1); 63 + } 64 + 65 + // ── Parse the .np score ───────────────────────────────────────────── 66 + // Each line in the section is space-separated `note:syllable*weight` 67 + // tokens. Syllables can be: 68 + // "a-" — start of multi-syllable word 69 + // "-ma-" — middle of multi-syllable word 70 + // "-zing" — end of multi-syllable word 71 + // "grace" — single-syllable word 72 + const scoreText = readFileSync(SCORE_PATH, "utf8"); 73 + const scoreLines = scoreText.split("\n"); 74 + const sectionStart = scoreLines.findIndex( 75 + (l) => l.trim().toLowerCase() === SECTION, 76 + ); 77 + if (sectionStart < 0) { 78 + console.error(`✗ section '${SECTION}' not found in ${SCORE_PATH}`); 79 + process.exit(1); 80 + } 81 + 82 + const syllables = []; 83 + for (let i = sectionStart + 1; i < scoreLines.length; i++) { 84 + const line = scoreLines[i].trim(); 85 + if (!line) break; // empty line ends section 86 + if (line.startsWith("#")) continue; 87 + if (/^[a-z]+ \d/i.test(line)) break; // next section header 88 + for (const tok of line.split(/\s+/)) { 89 + // Match note:syllable*weight, where note is something like D3, G#3, Eb4 90 + const m = tok.match(/^([A-Ga-g][#b]?-?\d):(.+?)\*(\d+(?:\.\d+)?)$/); 91 + if (!m) continue; 92 + syllables.push({ 93 + note: m[1], 94 + raw: m[2], 95 + weight: Number(m[3]), 96 + }); 97 + } 98 + } 99 + 100 + if (syllables.length === 0) { 101 + console.error(`✗ no syllables found under section '${SECTION}'`); 102 + process.exit(1); 103 + } 104 + 105 + // ── Probe true audio duration (sanity check only) ─────────────────── 106 + const probe = spawnSync("ffprobe", [ 107 + "-v", "error", "-show_entries", "format=duration", 108 + "-of", "default=noprint_wrappers=1:nokey=1", AUDIO, 109 + ], { encoding: "utf8" }); 110 + const audioDur = Number(probe.stdout.trim()); 111 + 112 + // 28-color emotional arc — extended from 26 to cover the per-syllable 113 + // expansion. Verse 1 (awakening) → verse 2 (humility) → verse 3 114 + // (climbing) → verse 4 (revelation). 115 + const EMOTIONAL_COLORS = [ 116 + // Awakening / morning 117 + { bg: "peachpuff", letters: "saddlebrown" }, 118 + { bg: "moccasin", letters: "darkred" }, 119 + { bg: "wheat", letters: "indigo" }, 120 + { bg: "khaki", letters: "darkolivegreen" }, 121 + { bg: "palegoldenrod", letters: "maroon" }, 122 + { bg: "lightyellow", letters: "darkgoldenrod" }, 123 + { bg: "lemonchiffon", letters: "darkslateblue" }, 124 + { bg: "papayawhip", letters: "saddlebrown" }, 125 + // Humility / grounded 126 + { bg: "burlywood", letters: "darkslategray" }, 127 + { bg: "tan", letters: "ivory" }, 128 + { bg: "rosybrown", letters: "white" }, 129 + { bg: "thistle", letters: "indigo" }, 130 + { bg: "lavender", letters: "darkviolet" }, 131 + { bg: "mistyrose", letters: "maroon" }, 132 + { bg: "plum", letters: "ivory" }, 133 + // Climbing / seeking 134 + { bg: "skyblue", letters: "navy" }, 135 + { bg: "lightblue", letters: "midnightblue" }, 136 + { bg: "mediumturquoise", letters: "darkslategray" }, 137 + { bg: "mediumaquamarine", letters: "ivory" }, 138 + { bg: "lightseagreen", letters: "lemonchiffon" }, 139 + { bg: "palegreen", letters: "darkgreen" }, 140 + { bg: "lightgreen", letters: "darkolivegreen" }, 141 + { bg: "aquamarine", letters: "darkslategray" }, 142 + // Revelation / bright 143 + { bg: "hotpink", letters: "white" }, 144 + { bg: "deeppink", letters: "lemonchiffon" }, 145 + { bg: "violet", letters: "white" }, 146 + { bg: "orchid", letters: "ivory" }, 147 + { bg: "salmon", letters: "white" }, 148 + { bg: "gold", letters: "black" }, 149 + ]; 150 + 151 + const TYPOGRAPHY_STYLES = [ 152 + "chunky pixel-art block letters, fat strokes, square pixels", 153 + "narrow tall pixel-art letters, condensed, square pixels", 154 + "wide squat pixel-art letters, low-resolution display style", 155 + "outlined pixel-art letters, hollow centers, single-pixel borders", 156 + "8-bit terminal pixel-art letters, retro arcade style", 157 + "thick rounded pixel-art letters, friendly chunky bitmap", 158 + ]; 159 + 160 + // ── Build slides directly from the score ───────────────────────────── 161 + const beatSec = 60.0 / BPM; 162 + let beatPos = 0; 163 + const slides = syllables.map((syl, i) => { 164 + const start = beatPos * beatSec; 165 + beatPos += syl.weight; 166 + const end = beatPos * beatSec; 167 + // Strip leading/trailing hyphens for the displayed text — those 168 + // are score-syntax markers indicating multi-syllable continuity. 169 + const visible = syl.raw.replace(/^-|-$/g, "").replace(/[.,!?;:]/g, ""); 170 + const colorIdx = i % EMOTIONAL_COLORS.length; 171 + const typoIdx = i % TYPOGRAPHY_STYLES.length; 172 + const dur = end - start; 173 + const transitionMs = Math.round(Math.max(120, Math.min(450, dur * 280))); 174 + return { 175 + i, 176 + start: Number(start.toFixed(3)), 177 + end: Number(end.toFixed(3)), 178 + duration: Number(dur.toFixed(3)), 179 + text: visible, 180 + rawText: syl.raw, 181 + note: syl.note, 182 + weight: syl.weight, 183 + image: `word-${String(i).padStart(3, "0")}.jpg`, 184 + transition: "slideleft", 185 + transitionMs, 186 + bgColor: EMOTIONAL_COLORS[colorIdx].bg, 187 + letterColor: EMOTIONAL_COLORS[colorIdx].letters, 188 + typography: TYPOGRAPHY_STYLES[typoIdx], 189 + }; 190 + }); 191 + 192 + const totalScoreSec = beatPos * beatSec; 193 + const storyboard = { 194 + schema: "ac/big-pictures/storyboard@2", 195 + slug: SLUG, 196 + audio: AUDIO.replace(`${POP}/`, "pop/"), 197 + score: SCORE_PATH.replace(`${POP}/`, "pop/"), 198 + bpm: BPM, 199 + section: SECTION, 200 + // Use the SCORE total as the duration; audio file may have trailing 201 + // silence (ID3 padding) but the visual cycle is locked to the score. 202 + duration: Number(Math.max(totalScoreSec, audioDur).toFixed(3)), 203 + scoreDuration: Number(totalScoreSec.toFixed(3)), 204 + audioDuration: Number(audioDur.toFixed(3)), 205 + resolution: { w: 1080, h: 1920 }, 206 + framerate: 30, 207 + imageDir: IMG_DIR.replace(`${POP}/`, "pop/"), 208 + defaults: { 209 + transition: "slideleft", 210 + transitionMs: 220, 211 + fontFamily: "/System/Library/Fonts/Supplemental/Futura.ttc", 212 + }, 213 + slides, 214 + }; 215 + 216 + writeFileSync(OUT, JSON.stringify(storyboard, null, 2)); 217 + console.log(`✓ ${OUT}`); 218 + console.log(` ${slides.length} slides · score=${totalScoreSec.toFixed(2)}s audio=${audioDur.toFixed(2)}s · ${BPM} BPM`); 219 + console.log(` first 5:`); 220 + for (const s of slides.slice(0, 5)) { 221 + console.log(` ${String(s.i).padStart(2)} '${s.text.padEnd(8)}' ${s.start.toFixed(2)}-${s.end.toFixed(2)}s (${s.duration.toFixed(2)}s × ${s.weight}b) → ${s.note}`); 222 + } 223 + console.log(` last 5:`); 224 + for (const s of slides.slice(-5)) { 225 + console.log(` ${String(s.i).padStart(2)} '${s.text.padEnd(8)}' ${s.start.toFixed(2)}-${s.end.toFixed(2)}s (${s.duration.toFixed(2)}s × ${s.weight}b) → ${s.note}`); 226 + }
+637
pop/bin/tiktok.mjs
··· 1 + #!/usr/bin/env node 2 + // tiktok.mjs — render a 9:16 TikTok video from a storyboard.json. 3 + // 4 + // The storyboard is the source of truth for slide timing, resolution, 5 + // framerate, and per-slide content. Generate it first via 6 + // `bin/storyboard.mjs --slug <slug>`. This renderer: 7 + // 8 + // 1. Reads <slug>.storyboard.json 9 + // 2. For each slide, generates an AI-rendered word image via FLUX 10 + // (cached in storyboard.imageDir; only re-renders missing slots) 11 + // 3. Concats into mp4 at the storyboard's framerate / resolution, 12 + // with each slide held for its (end - start) duration from the 13 + // beat-mode timeline 14 + // 4. Self-tests output frames against the storyboard timing 15 + // 16 + // Usage: 17 + // node bin/storyboard.mjs --slug amazing # generate storyboard 18 + // node bin/tiktok.mjs --slug amazing # render video 19 + 20 + import { spawnSync } from "node:child_process"; 21 + import { existsSync, readFileSync, writeFileSync, mkdirSync, rmSync } from "node:fs"; 22 + import { resolve } from "node:path"; 23 + import { createHash } from "node:crypto"; 24 + 25 + const NVIDIA_KEY = readFileSync( 26 + "/Users/jas/aesthetic-computer/aesthetic-computer-vault/.env", 27 + "utf8", 28 + ).match(/^NVIDIA_API_KEY=(\S+)/m)?.[1]; 29 + if (!NVIDIA_KEY) { 30 + console.error("✗ NVIDIA_API_KEY not found in vault .env"); 31 + process.exit(1); 32 + } 33 + 34 + const flags = {}; 35 + for (let i = 0; i < process.argv.length; i++) { 36 + const a = process.argv[i]; 37 + if (a.startsWith("--")) flags[a.slice(2)] = process.argv[i + 1]; 38 + } 39 + 40 + const SLUG = flags.slug || "amazing"; 41 + const POP = "/Users/jas/aesthetic-computer/pop"; 42 + const STORYBOARD_PATH = flags.storyboard 43 + ? resolve(process.cwd(), flags.storyboard) 44 + : `${POP}/big-pictures/out/${SLUG}.storyboard.json`; 45 + const OUT = flags.out 46 + ? resolve(process.cwd(), flags.out) 47 + : `${POP}/big-pictures/out/${SLUG}-tiktok.mp4`; 48 + 49 + if (!existsSync(STORYBOARD_PATH)) { 50 + console.error(`✗ storyboard missing: ${STORYBOARD_PATH}`); 51 + console.error(` generate with: node bin/storyboard.mjs --slug ${SLUG}`); 52 + process.exit(1); 53 + } 54 + const sb = JSON.parse(readFileSync(STORYBOARD_PATH, "utf8")); 55 + console.log(`→ storyboard: ${sb.slug} · ${sb.slides.length} slides · ${sb.duration}s @ ${sb.framerate}fps · ${sb.resolution.w}×${sb.resolution.h}`); 56 + 57 + // Resolve relative paths from the storyboard 58 + function resolvePath(p) { 59 + return p.startsWith("pop/") 60 + ? `${POP}/${p.slice(4)}` 61 + : resolve(process.cwd(), p); 62 + } 63 + const AUDIO = resolvePath(sb.audio); 64 + const IMG_DIR = resolvePath(sb.imageDir); 65 + mkdirSync(IMG_DIR, { recursive: true }); 66 + 67 + const W = sb.resolution.w === 1080 ? 768 : 1024; // FLUX aspect-matched 68 + const H = sb.resolution.h === 1920 ? 1344 : 1024; 69 + 70 + // (Color theme + typography come from the storyboard per-slide now — 71 + // emotional color arc with no repeats, plus serif/sans/mono variations.) 72 + 73 + async function flux(prompt, seed) { 74 + const res = await fetch( 75 + "https://ai.api.nvidia.com/v1/genai/black-forest-labs/flux.1-schnell", 76 + { 77 + method: "POST", 78 + headers: { 79 + Authorization: `Bearer ${NVIDIA_KEY}`, 80 + "Content-Type": "application/json", 81 + Accept: "application/json", 82 + }, 83 + body: JSON.stringify({ prompt, cfg_scale: 0, width: W, height: H, seed, steps: 4 }), 84 + }, 85 + ); 86 + if (!res.ok) throw new Error(`flux ${res.status}: ${await res.text()}`); 87 + const j = await res.json(); 88 + const b64 = j.artifacts?.[0]?.base64 || j.image?.replace(/^data:image\/\w+;base64,/, "") || j.b64_json; 89 + if (!b64) throw new Error("flux: no image"); 90 + return Buffer.from(b64, "base64"); 91 + } 92 + 93 + // Glyph-count sanity check: extract characters via the same algorithm 94 + // the renderer uses; reject images that won't yield a clean per-char 95 + // composite (missing letters, extra stray blobs, or bad aspect ratios). 96 + function validateGlyphs(imagePath, expectedWord) { 97 + const r = spawnSync( 98 + `${POP}/.venv/bin/python`, 99 + [`${POP}/bin/validate_word.py`, imagePath, expectedWord], 100 + { encoding: "utf8" }, 101 + ); 102 + if (r.status !== 0) return { ok: false, diagnostic: "validate exit !=0" }; 103 + try { 104 + return JSON.parse(r.stdout.trim().split("\n").pop()); 105 + } catch { 106 + return { ok: false, diagnostic: "validate parse" }; 107 + } 108 + } 109 + 110 + // Dump the extracted glyphs of a word image to disk as separate PNGs. 111 + // Returns an array of {path, letter} so per-character OCR can ask 112 + // "what letter is this?" on each one. 113 + function dumpGlyphs(imagePath, expectedWord, outDir) { 114 + const r = spawnSync( 115 + `${POP}/.venv/bin/python`, 116 + ["-c", ` 117 + import sys, json, os 118 + sys.path.insert(0, '${POP}/bin') 119 + from render_frames import extract_glyphs 120 + img = '${imagePath}' 121 + word = '${expectedWord}'.lower() 122 + letters = [c for c in word if c.isalpha()] 123 + out_dir = '${outDir}' 124 + os.makedirs(out_dir, exist_ok=True) 125 + glyphs = extract_glyphs(img) 126 + results = [] 127 + for i, g in enumerate(glyphs): 128 + if i >= len(letters): break 129 + p = os.path.join(out_dir, f'glyph_{i:02d}_{letters[i]}.png') 130 + g['img'].save(p) 131 + results.append({'path': p, 'letter': letters[i]}) 132 + print(json.dumps(results)) 133 + `], 134 + { encoding: "utf8" }, 135 + ); 136 + if (r.status !== 0) return []; 137 + try { 138 + return JSON.parse(r.stdout.trim().split("\n").pop()); 139 + } catch { 140 + return []; 141 + } 142 + } 143 + 144 + // Per-letter repair: when a winning image has correct glyph count but 145 + // specific glyphs are topologically wrong (e.g. an 'a' rendered as a 146 + // solid block), generate a fresh single-letter FLUX image for each bad 147 + // slot and composite it into the word image. Keeps the FLUX aesthetic 148 + // (no font fallback) — we just patch pixel patches. 149 + async function repairLetters(wordImagePath, slide, topologyFailures) { 150 + const bg = slide.bgColor || "cream"; 151 + const letters = slide.letterColor || "navy"; 152 + const typography = slide.typography || 153 + "chunky pixel-art block letters, fat strokes, square pixels"; 154 + let fixed = 0; 155 + for (const fail of topologyFailures) { 156 + const slotIdx = fail[0]; 157 + const letter = fail[1]; 158 + // Generate a single-letter image, validate it has the right topology, 159 + // up to 3 attempts. 160 + const prompt = 161 + `the single capital letter shape "${letter.toUpperCase()}" / lowercase letter "${letter}" ` + 162 + `rendered LARGE and CENTERED, in ${typography}, ` + 163 + `STRICTLY TWO-TONE: solid ${letters} pixels and solid ${bg} background, ` + 164 + `the letter must show its proper anatomy — ` + 165 + (letter === 'a' || letter === 'e' || letter === 'o' ? "with a clearly visible enclosed counter (open inside the letter)" : "with proper letterform") + 166 + `, no other text, no other characters, no decorations, ` + 167 + `large bold pixel-art rendering, perfectly clean letterform, ` + 168 + `low-resolution pixel-perfect bitmap, 90s indie computing, ` + 169 + `NO script, NO cursive, NO connected strokes, NO gradient, NO texture, NO shadow`; 170 + 171 + let bestLetter = null; 172 + let bestRatio = -1; 173 + const expectedMinHole = LETTER_HOLE_MIN_JS[letter] || 0; 174 + for (let attempt = 0; attempt < 3; attempt++) { 175 + const buf = await flux(prompt, 9000 + slotIdx * 31 + attempt * 7919); 176 + if (buf.length < 8000) continue; 177 + // Save to a temp file so the Python validator can extract + score 178 + const tmp = `/tmp/repair-${slide.i}-${slotIdx}-${attempt}.jpg`; 179 + writeFileSync(tmp, buf); 180 + const r = spawnSync( 181 + `${POP}/.venv/bin/python`, 182 + ["-c", ` 183 + import sys, json 184 + sys.path.insert(0, '${POP}/bin') 185 + from validate_word import hole_ratio 186 + from render_frames import extract_glyphs 187 + g = extract_glyphs('${tmp}') 188 + if not g: 189 + print(json.dumps({'ok': False, 'reason': 'no glyph'})) 190 + else: 191 + g.sort(key=lambda x: x['w']*x['h'], reverse=True) 192 + r = hole_ratio(g[0]['img']) 193 + print(json.dumps({'ok': True, 'hole_ratio': r, 'w': int(g[0]['w']), 'h': int(g[0]['h'])})) 194 + `], 195 + { encoding: "utf8" }, 196 + ); 197 + let v; 198 + try { v = JSON.parse(r.stdout.trim().split("\n").pop()); } 199 + catch { continue; } 200 + if (!v.ok) continue; 201 + if (v.hole_ratio >= expectedMinHole && v.hole_ratio > bestRatio) { 202 + bestRatio = v.hole_ratio; 203 + bestLetter = tmp; 204 + } else if (bestLetter === null && v.hole_ratio > bestRatio) { 205 + bestRatio = v.hole_ratio; 206 + bestLetter = tmp; 207 + } 208 + // Early-out on a clear pass 209 + if (v.hole_ratio >= expectedMinHole * 1.5) break; 210 + } 211 + if (!bestLetter) { 212 + console.log(` repair ${slotIdx} '${letter}': no valid replacement found`); 213 + continue; 214 + } 215 + // Run repair_letter.py to composite 216 + const r = spawnSync( 217 + `${POP}/.venv/bin/python`, 218 + [ 219 + `${POP}/bin/repair_letter.py`, 220 + "--word-img", wordImagePath, 221 + "--letter-img", bestLetter, 222 + "--slot", String(slotIdx), 223 + "--expected-word", slide.text.toLowerCase().replace(/[^a-z]/g, ""), 224 + "--bg-color", bg, 225 + "--letters-color", letters, 226 + ], 227 + { encoding: "utf8" }, 228 + ); 229 + try { 230 + const out = JSON.parse(r.stdout.trim().split("\n").pop()); 231 + if (out.ok) { 232 + fixed++; 233 + console.log(` repair ${slotIdx} '${letter}': hole_ratio=${bestRatio.toFixed(3)} → patched`); 234 + } else { 235 + console.log(` repair ${slotIdx} '${letter}': composite failed (${out.reason})`); 236 + } 237 + } catch (e) { 238 + console.log(` repair ${slotIdx} '${letter}': composite parse failed`); 239 + } 240 + } 241 + return fixed; 242 + } 243 + 244 + // Mirror of LETTER_HOLE_MIN in validate_word.py for the JS side 245 + const LETTER_HOLE_MIN_JS = { 246 + 'a': 0.018, 'b': 0.040, 'd': 0.040, 'e': 0.015, 247 + 'g': 0.035, 'o': 0.050, 'p': 0.035, 'q': 0.035, 248 + 'A': 0.025, 'B': 0.025, 'D': 0.040, 'O': 0.050, 249 + 'P': 0.025, 'Q': 0.035, 'R': 0.018, 250 + }; 251 + 252 + // Per-character vision OCR — crop each extracted glyph, ask the vision 253 + // model what single letter it shows. (Currently unused — vision models 254 + // hallucinate badly on isolated pixel-art letters; topology + per-letter 255 + // repair are the actual quality gates. Kept for diagnostic use.) 256 + async function perCharOCR(imagePath, expectedWord, outDir) { 257 + const glyphs = dumpGlyphs(imagePath, expectedWord, outDir); 258 + if (glyphs.length === 0) { 259 + return { ok: false, score: 0, results: [], reason: "no glyphs extracted" }; 260 + } 261 + const results = []; 262 + let correct = 0; 263 + for (const g of glyphs) { 264 + const buf = readFileSync(g.path); 265 + const dataUrl = `data:image/png;base64,${buf.toString("base64")}`; 266 + const res = await fetch("https://integrate.api.nvidia.com/v1/chat/completions", { 267 + method: "POST", 268 + headers: { 269 + Authorization: `Bearer ${NVIDIA_KEY}`, 270 + "Content-Type": "application/json", 271 + Accept: "application/json", 272 + }, 273 + body: JSON.stringify({ 274 + model: "meta/llama-3.2-11b-vision-instruct", 275 + messages: [{ 276 + role: "user", 277 + content: [ 278 + { 279 + type: "text", 280 + text: `What single letter is shown in this image? Answer with ONLY the lowercase letter, nothing else. ` + 281 + `If it doesn't look like a clear, traditional letterform, answer "x".`, 282 + }, 283 + { type: "image_url", image_url: { url: dataUrl } }, 284 + ], 285 + }], 286 + max_tokens: 4, 287 + temperature: 0, 288 + }), 289 + }); 290 + let seen = "?"; 291 + if (res.ok) { 292 + const j = await res.json(); 293 + seen = ((j.choices?.[0]?.message?.content || "").trim().toLowerCase().match(/[a-z]/) || ["?"])[0]; 294 + } 295 + const match = seen === g.letter.toLowerCase(); 296 + if (match) correct++; 297 + results.push({ expected: g.letter, seen, ok: match }); 298 + } 299 + const score = correct / glyphs.length; 300 + return { 301 + ok: score === 1.0, 302 + score, 303 + results, 304 + reason: score < 1.0 ? `${results.filter(r => !r.ok).map(r => `${r.expected}→${r.seen}`).join(",")}` : "ok", 305 + }; 306 + } 307 + 308 + // OCR via NVIDIA vision LLM — three-question scored pass: 309 + // 1. What word is written? (returns the lowercase reading) 310 + // 2. Is the word fully visible, completely spelled, no truncation, 311 + // no extra letters or characters? (yes/no) 312 + // 3. Are the letters in standard, traditional, non-broken letterforms, 313 + // each one clearly distinct? (yes/no) 314 + // Returns a score (0..5) instead of a hard pass/fail so tiktok.mjs can 315 + // pick the best of N attempts probabilistically. 316 + async function ocrValidate(imageBuf, expectedWord) { 317 + const dataUrl = `data:image/jpeg;base64,${imageBuf.toString("base64")}`; 318 + async function ask(text) { 319 + const res = await fetch("https://integrate.api.nvidia.com/v1/chat/completions", { 320 + method: "POST", 321 + headers: { 322 + Authorization: `Bearer ${NVIDIA_KEY}`, 323 + "Content-Type": "application/json", 324 + Accept: "application/json", 325 + }, 326 + body: JSON.stringify({ 327 + model: "meta/llama-3.2-11b-vision-instruct", 328 + messages: [{ 329 + role: "user", 330 + content: [ 331 + { type: "text", text }, 332 + { type: "image_url", image_url: { url: dataUrl } }, 333 + ], 334 + }], 335 + max_tokens: 16, 336 + temperature: 0, 337 + }), 338 + }); 339 + if (!res.ok) return null; 340 + const j = await res.json(); 341 + return (j.choices?.[0]?.message?.content || "").trim(); 342 + } 343 + const seen = (await ask( 344 + `Read the word in this image. Answer with ONLY the lowercase word as you see it. ` + 345 + `If letters are missing or cut off, transcribe ONLY what's actually visible. ` + 346 + `If unreadable, answer "unreadable".` 347 + ) || "").toLowerCase().replace(/[^a-z']/g, "").trim() || null; 348 + 349 + // Strict equality. Apostrophe-stripped form also counts (e.g. "i'm" ≈ "im"). 350 + const wordStripped = expectedWord.replace(/'/g, ""); 351 + const seenStripped = (seen || "").replace(/'/g, ""); 352 + const exact = seen === expectedWord || seenStripped === wordStripped; 353 + 354 + // Score: 355 + // +3 exact spelling match 356 + // +1 close (Levenshtein ≤ 1) but not exact 357 + // +1 completeness yes 358 + // +1 letterform standard yes 359 + let score = 0; 360 + if (exact) { 361 + score += 3; 362 + } else if (seen && levenshtein(seen, expectedWord) <= 1) { 363 + score += 1; 364 + } 365 + 366 + // Completeness check 367 + const complete = (await ask( 368 + `Is the word "${expectedWord}" written in this image fully visible and complete, ` + 369 + `with no missing letters, no truncation, no extra letters or symbols? ` + 370 + `Answer with ONLY "yes" or "no".` 371 + ) || "").toLowerCase(); 372 + const completeYes = complete.includes("yes"); 373 + if (completeYes) score += 1; 374 + 375 + // Letterform fidelity: standard, traditional, distinct, non-broken 376 + const fidelity = (await ask( 377 + `Look at the letters in this image. Are they standard, traditional letterforms — ` + 378 + `clearly readable, each letter distinct from the others, no broken or fragmented shapes, ` + 379 + `no decorative or unusual stylization that would make a letter hard to recognize? ` + 380 + `Answer with ONLY "yes" or "no".` 381 + ) || "").toLowerCase(); 382 + const fidelityYes = fidelity.includes("yes"); 383 + if (fidelityYes) score += 1; 384 + 385 + return { 386 + ok: exact && completeYes, 387 + seen, 388 + exact, 389 + complete: completeYes, 390 + fidelity: fidelityYes, 391 + score, 392 + reason: exact ? (completeYes ? "ok" : `incomplete: '${complete.slice(0, 24)}'`) 393 + : `mismatch: saw '${seen}'`, 394 + }; 395 + } 396 + 397 + // Tiny Levenshtein for "almost matches" credit 398 + function levenshtein(a, b) { 399 + if (a === b) return 0; 400 + const m = a.length, n = b.length; 401 + if (!m) return n; 402 + if (!n) return m; 403 + const dp = Array.from({ length: m + 1 }, (_, i) => [i, ...Array(n).fill(0)]); 404 + for (let j = 0; j <= n; j++) dp[0][j] = j; 405 + for (let i = 1; i <= m; i++) { 406 + for (let j = 1; j <= n; j++) { 407 + dp[i][j] = a[i - 1] === b[j - 1] 408 + ? dp[i - 1][j - 1] 409 + : 1 + Math.min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]); 410 + } 411 + } 412 + return dp[m][n]; 413 + } 414 + 415 + console.log(`→ generating word images via NVIDIA FLUX + OCR validation (cached in ${IMG_DIR.replace(POP + "/", "")})…`); 416 + const seenHashes = new Set(); 417 + let nFresh = 0, nCached = 0, nOcrPass = 0, nOcrFail = 0; 418 + const MAX_ATTEMPTS = 9; 419 + 420 + function spellOut(word) { 421 + // Insert spaces between letters: "amazing" → "A M A Z I N G" 422 + return word.toUpperCase().split("").join(" "); 423 + } 424 + 425 + for (let i = 0; i < sb.slides.length; i++) { 426 + const slide = sb.slides[i]; 427 + const path = `${IMG_DIR}/${slide.image}`; 428 + if (existsSync(path)) { 429 + const buf = readFileSync(path); 430 + const h = createHash("sha256").update(buf).digest("hex").slice(0, 12); 431 + seenHashes.add(h); 432 + nCached++; 433 + continue; 434 + } 435 + const word = slide.text.toLowerCase().replace(/[^a-z']/g, ""); 436 + const bg = slide.bgColor || "cream"; 437 + const letters = slide.letterColor || "navy"; 438 + const typography = slide.typography || "chunky pixel-art block letters, fat strokes, square pixels"; 439 + // Strict TWO-TONE constraint so per-character extraction has a clean 440 + // foreground/background split. Pixelated, DISCONNECTED letters 441 + // (no script / cursive — characters need to extract individually). 442 + const prompt = 443 + `the single word "${word}" (spelled ${spellOut(word)}) ` + 444 + `rendered in ${typography}, ` + 445 + `STRICTLY TWO-TONE: only solid ${letters} pixels and solid ${bg} pixels, no other colors, ` + 446 + `${letters} colored letters on a perfectly uniform solid flat ${bg} background, ` + 447 + `entire frame is a uniform ${bg} field except for the centered text "${word}", ` + 448 + `large bold typography, exactly the letters ${spellOut(word)} in order, ` + 449 + `each letter completely separated from the next with clear gaps between them, ` + 450 + `NO script, NO cursive, NO connected letters, NO ligatures, ` + 451 + `NO gradient, NO texture, NO noise, NO shading, NO anti-aliasing, ` + 452 + `NO other text, NO other words, NO objects, NO shadow, NO border, ` + 453 + `low-resolution pixel-perfect bitmap aesthetic, 90s indie computing`; 454 + 455 + // Backup prompt for words that hit FLUX's safety filter on the 456 + // verbose form (e.g. "but" came back 6KB every time). Direct probe 457 + // showed the *verbosity* itself trips FLUX; minimal prompts pass. 458 + // Letters get recolored downstream by extract_glyphs so we don't 459 + // need to specify colors here. bg is sampled from the saved image's 460 + // corners, so FLUX's default black bg is fine. 461 + const altPrompt = `pixel-art typography spelling "${word}"`; 462 + 463 + // Probabilistic best-pick: run ALL attempts, score each on 464 + // ocr (0..5) + glyph extraction (0..5) 465 + // and keep the highest scorer. This handles the "spelling mostly right 466 + // but with one weird letterform" case far better than first-pass-pass. 467 + const attempts = []; 468 + let tinyCount = 0; 469 + for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) { 470 + // After 2 consecutive tiny outputs, switch to the simplified prompt 471 + // that bypasses the verbose constraint cluster which sometimes 472 + // trips FLUX's safety filter. 473 + const usePrompt = tinyCount >= 2 ? altPrompt : prompt; 474 + let buf; 475 + try { 476 + buf = await flux(usePrompt, 200 + i * 7 + attempt * 1000); 477 + } catch (err) { 478 + console.log(` ${String(i).padStart(2)} '${word}' a${attempt}: flux error: ${String(err).slice(0, 80)}`); 479 + continue; 480 + } 481 + if (buf.length < 8000) { 482 + tinyCount++; 483 + console.log(` ${String(i).padStart(2)} '${word}' a${attempt}: tiny ${(buf.length / 1024).toFixed(0)}KB (safety filter?) ${tinyCount >= 2 ? "[switching to alt prompt]" : ""}`); 484 + continue; 485 + } 486 + tinyCount = 0; 487 + const h = createHash("sha256").update(buf).digest("hex").slice(0, 12); 488 + // Don't reject duplicates — if FLUX produced the same image again, 489 + // it might still be the best one. Just skip per-slide dupes. 490 + seenHashes.add(h); 491 + 492 + // Tentatively write so the Python validator can read it 493 + writeFileSync(path, buf); 494 + const ocr = await ocrValidate(buf, word); 495 + const gv = validateGlyphs(path, word); 496 + const score = (ocr.score || 0) + (gv.score || 0); 497 + attempts.push({ buf, h, ocr, gv, score, attempt }); 498 + console.log( 499 + ` ${String(i).padStart(2)} '${word}' a${attempt}: ` + 500 + `ocr=${ocr.score}/5 (seen='${ocr.seen}'${ocr.exact ? " ✓" : ""}` + 501 + `${ocr.complete ? " complete" : ""}${ocr.fidelity ? " standard" : ""})` + 502 + ` glyphs=${gv.found_n}/${gv.expected_n} score=${gv.score}/6 ` + 503 + `${gv.topology_failures && gv.topology_failures.length ? "[topology✗] " : ""}` + 504 + `→ total ${score}/11` 505 + ); 506 + // Always run all attempts — the topology check is strict enough now 507 + // that an "11/11 perfect" early-out missed cases where a higher 508 + // attempt was the only one with valid letterforms. 509 + } 510 + 511 + if (attempts.length === 0) { 512 + // No usable FLUX output at all; fall back to previous slide's image 513 + if (i > 0) { 514 + const prevPath = `${IMG_DIR}/${sb.slides[i - 1].image}`; 515 + if (existsSync(prevPath)) { 516 + writeFileSync(path, readFileSync(prevPath)); 517 + nFresh++; 518 + nOcrFail++; 519 + console.warn(` ${String(i).padStart(2)} '${word}' fallback (previous-slide, no attempts)`); 520 + continue; 521 + } 522 + } 523 + console.error(` ✗ ${i} '${word}' no usable output — pipeline will fail`); 524 + continue; 525 + } 526 + 527 + // Hard priority chain: 528 + // 1. solid-bg attempts win over textured-bg ones (uniform bg > wood grain) 529 + // 2. count-correct attempts win over count-mismatch 530 + // 3. higher total score 531 + // 4. lower attempt number (cheaper / earlier was better) 532 + attempts.sort((a, b) => { 533 + const aSolid = (a.gv.bg_std ?? 99) <= 5.0 ? 1 : 0; 534 + const bSolid = (b.gv.bg_std ?? 99) <= 5.0 ? 1 : 0; 535 + if (aSolid !== bSolid) return bSolid - aSolid; 536 + const aCount = a.gv.found_n === a.gv.expected_n ? 1 : 0; 537 + const bCount = b.gv.found_n === b.gv.expected_n ? 1 : 0; 538 + if (aCount !== bCount) return bCount - aCount; 539 + return b.score - a.score || a.attempt - b.attempt; 540 + }); 541 + const best = attempts[0]; 542 + writeFileSync(path, best.buf); 543 + nFresh++; 544 + 545 + // Per-letter repair pass: any topology failures in the winner mean 546 + // the right letter count was achieved but specific glyph(s) were 547 + // rendered as wrong shapes (block-instead-of-a, missing-counter-b, 548 + // etc.). Target each one with a single-letter FLUX gen + composite. 549 + let repaired = 0; 550 + if (best.gv.ok) { 551 + nOcrPass++; 552 + } else if (best.gv.topology_failures && best.gv.topology_failures.length > 0) { 553 + repaired = await repairLetters(path, slide, best.gv.topology_failures); 554 + // Don't re-validate post-repair — small replacement letters lose 555 + // their counters under the validator's dilation during extraction, 556 + // producing false negatives. We trust the per-letter generator 557 + // already verified hole_ratio in the *replacement* before pasting. 558 + if (repaired === best.gv.topology_failures.length) nOcrPass++; 559 + else nOcrFail++; 560 + console.log(` ${String(i).padStart(2)} '${word}' winner: a${best.attempt} score=${best.score}/11, repaired ${repaired}/${best.gv.topology_failures.length} letter(s)`); 561 + continue; 562 + } else { 563 + nOcrFail++; 564 + } 565 + console.log(` ${String(i).padStart(2)} '${word}' winner: a${best.attempt} score=${best.score}/11${best.gv.ok ? " ✓" : " (best avail; " + best.gv.diagnostic + ")"}`); 566 + } 567 + console.log(` ${nFresh} fresh (${nOcrPass} OCR ✓, ${nOcrFail} fallback) · ${nCached} cached`); 568 + 569 + // ── Per-character compositor (Python) ──────────────────────────────── 570 + // Glyph extraction → audio amplitude curve → frame-by-frame render 571 + // with bounce + gradient backgrounds + loop closure. ffmpeg encodes 572 + // the resulting PNG sequence + audio. 573 + const FRAMES_DIR = `${POP}/big-pictures/out/.${SLUG}-frames`; 574 + rmSync(FRAMES_DIR, { recursive: true, force: true }); 575 + 576 + console.log(`→ rendering frames via per-character compositor…`); 577 + const py = spawnSync( 578 + `${POP}/.venv/bin/python`, 579 + [ 580 + `${POP}/bin/render_frames.py`, 581 + "--storyboard", STORYBOARD_PATH, 582 + "--img-dir", IMG_DIR, 583 + "--audio", AUDIO, 584 + "--frames-dir", FRAMES_DIR, 585 + "--fps", String(sb.framerate), 586 + ], 587 + { stdio: "inherit" }, 588 + ); 589 + if (py.status !== 0) { 590 + console.error("✗ render_frames.py failed"); 591 + process.exit(1); 592 + } 593 + 594 + console.log(`→ ffmpeg encode @ ${sb.framerate}fps…`); 595 + 596 + const ff = spawnSync("ffmpeg", [ 597 + "-hide_banner", "-y", "-loglevel", "error", "-stats", 598 + "-framerate", String(sb.framerate), 599 + "-i", `${FRAMES_DIR}/f%05d.png`, 600 + "-i", AUDIO, 601 + "-c:v", "libx264", "-preset", "medium", "-crf", "20", 602 + "-pix_fmt", "yuv420p", 603 + "-c:a", "aac", "-b:a", "192k", 604 + "-shortest", 605 + OUT, 606 + ], { stdio: "inherit" }); 607 + if (ff.status !== 0) { 608 + console.error("✗ ffmpeg failed"); 609 + process.exit(1); 610 + } 611 + 612 + // ── Verify timing: sample at each slide's midpoint, expect distinct frames ─ 613 + console.log("→ verifying slide timing in output…"); 614 + const checks = [0, 1, Math.floor(sb.slides.length / 4), Math.floor(sb.slides.length / 2), 615 + Math.floor(sb.slides.length * 3 / 4), sb.slides.length - 1]; 616 + const tmp = `/tmp/tiktok-check-${Date.now()}`; 617 + mkdirSync(tmp, { recursive: true }); 618 + const seen = new Set(); 619 + let dupes = 0; 620 + for (const idx of checks) { 621 + const s = sb.slides[idx]; 622 + const t = s.start + Math.min(0.2, s.duration / 2); 623 + const f = `${tmp}/check-${idx}.png`; 624 + spawnSync("ffmpeg", [ 625 + "-hide_banner", "-y", "-loglevel", "error", 626 + "-ss", String(t), "-i", OUT, "-frames:v", "1", f, 627 + ], { stdio: "ignore" }); 628 + if (!existsSync(f)) continue; 629 + const h = createHash("sha256").update(readFileSync(f)).digest("hex").slice(0, 12); 630 + const dup = seen.has(h); 631 + if (dup) dupes++; 632 + seen.add(h); 633 + console.log(` slide ${String(idx).padStart(2)} '${s.text}' @ ${t.toFixed(2)}s · ${h}${dup ? " [DUP]" : ""}`); 634 + } 635 + rmSync(tmp, { recursive: true, force: true }); 636 + if (dupes > 0) console.warn(` ⚠ ${dupes} duplicate sample frames — slides not changing`); 637 + console.log(`✓ ${OUT}`);
+133
pop/bin/timefit.mjs
··· 1 + #!/usr/bin/env node 2 + // timefit.mjs — uniformly time-compress (or stretch) a vocal stem to 3 + // fit a target bar count at a target BPM. The first pass at the 4 + // "realign" stage of the post-prod pipeline — global tempo only, no 5 + // per-word work yet. 6 + // 7 + // Implements ffmpeg `atempo` chaining (atempo accepts 0.5–2.0 per 8 + // instance, so factors outside that range cascade two passes). 9 + // Optionally wraps the source through `--keep-pitch` (formant- 10 + // preserving rubberband) when available; default is atempo (slight 11 + // pitch shift, noticeable above ~1.3x). 12 + // 13 + // Usage: 14 + // node bin/timefit.mjs <stem.mp3> --bars 4 --bpm 140 15 + // node bin/timefit.mjs <stem.mp3> --duration 6.86 16 + // node bin/timefit.mjs <stem.mp3> --bars 4 --bpm 140 --keep-pitch 17 + // node bin/timefit.mjs <stem.mp3> --bars 4 --bpm 140 --out path.mp3 18 + 19 + import { spawnSync } from "node:child_process"; 20 + import { existsSync } from "node:fs"; 21 + import { resolve, dirname, basename, extname } from "node:path"; 22 + 23 + function parseArgs(argv) { 24 + const flags = {}; 25 + const positional = []; 26 + for (let i = 0; i < argv.length; i++) { 27 + const a = argv[i]; 28 + if (a.startsWith("--")) { 29 + const k = a.slice(2); 30 + const next = argv[i + 1]; 31 + if (next !== undefined && !next.startsWith("--")) { flags[k] = next; i++; } 32 + else flags[k] = true; 33 + } else positional.push(a); 34 + } 35 + return { flags, positional }; 36 + } 37 + 38 + const { flags, positional } = parseArgs(process.argv.slice(2)); 39 + 40 + const stemPath = resolve(process.cwd(), positional[0] || ""); 41 + if (!stemPath || !existsSync(stemPath)) { 42 + console.error("usage: node bin/timefit.mjs <stem.mp3> --bars N --bpm BPM [--keep-pitch] [--out path.mp3]"); 43 + process.exit(1); 44 + } 45 + 46 + const stemDir = dirname(stemPath); 47 + const stemBase = basename(stemPath, extname(stemPath)); 48 + 49 + // ── Source duration ─────────────────────────────────────────────────── 50 + const probe = spawnSync( 51 + "ffprobe", 52 + ["-v", "error", "-show_entries", "format=duration", 53 + "-of", "default=noprint_wrappers=1:nokey=1", stemPath], 54 + { encoding: "utf8" } 55 + ); 56 + const sourceDur = Number(probe.stdout.trim()); 57 + if (!(sourceDur > 0)) { 58 + console.error(`✗ ffprobe could not read duration of ${stemPath}`); 59 + process.exit(1); 60 + } 61 + 62 + // ── Target duration ─────────────────────────────────────────────────── 63 + let targetDur = null; 64 + if (flags.duration !== undefined) targetDur = Number(flags.duration); 65 + if (flags.bars !== undefined && flags.bpm !== undefined) { 66 + const bars = Number(flags.bars); 67 + const bpm = Number(flags.bpm); 68 + const beatsPerBar = Number(flags["beats-per-bar"] || 4); 69 + targetDur = bars * (60 / bpm) * beatsPerBar; 70 + } 71 + if (!(targetDur > 0)) { 72 + console.error("✗ specify --duration <sec> OR (--bars N --bpm BPM)"); 73 + process.exit(1); 74 + } 75 + 76 + const factor = sourceDur / targetDur; // > 1 means speed up (compress) 77 + console.log(`→ source ${sourceDur.toFixed(3)}s · target ${targetDur.toFixed(3)}s · factor ${factor.toFixed(3)}×`); 78 + 79 + // ── Build ffmpeg filter chain ───────────────────────────────────────── 80 + // atempo accepts 0.5..2.0 per instance. Cascade for factors outside. 81 + function atempoChain(f) { 82 + const stages = []; 83 + let remaining = f; 84 + while (remaining > 2.0) { stages.push(2.0); remaining /= 2.0; } 85 + while (remaining < 0.5) { stages.push(0.5); remaining /= 0.5; } 86 + stages.push(remaining); 87 + return stages.map((s) => `atempo=${s.toFixed(6)}`).join(","); 88 + } 89 + 90 + const KEEP_PITCH = flags["keep-pitch"] === true; 91 + const OUT_PATH = flags.out 92 + ? resolve(process.cwd(), flags.out) 93 + : `${stemDir}/${stemBase}-fit${factor >= 1 ? "ter" : "longer"}.mp3`; 94 + 95 + let filter; 96 + if (KEEP_PITCH) { 97 + // rubberband filter preserves pitch. Built-in if ffmpeg is built with it. 98 + filter = `rubberband=tempo=${factor.toFixed(6)}`; 99 + } else { 100 + filter = atempoChain(factor); 101 + } 102 + 103 + console.log(`→ filter · ${filter}`); 104 + console.log(`→ writing · ${OUT_PATH}`); 105 + 106 + const ff = spawnSync( 107 + "ffmpeg", 108 + ["-hide_banner", "-y", "-loglevel", "error", 109 + "-i", stemPath, 110 + "-filter:a", filter, 111 + "-c:a", "libmp3lame", "-q:a", "3", 112 + OUT_PATH], 113 + { stdio: "inherit" } 114 + ); 115 + if (ff.status !== 0) { 116 + if (KEEP_PITCH) { 117 + console.error("✗ ffmpeg failed — your ffmpeg may not have rubberband. Retry without --keep-pitch."); 118 + } else { 119 + console.error("✗ ffmpeg failed"); 120 + } 121 + process.exit(1); 122 + } 123 + 124 + // Verify output duration 125 + const probeOut = spawnSync( 126 + "ffprobe", 127 + ["-v", "error", "-show_entries", "format=duration", 128 + "-of", "default=noprint_wrappers=1:nokey=1", OUT_PATH], 129 + { encoding: "utf8" } 130 + ); 131 + const outDur = Number(probeOut.stdout.trim()); 132 + const drift = outDur - targetDur; 133 + console.log(`✓ ${OUT_PATH} · ${outDur.toFixed(3)}s (drift ${drift >= 0 ? "+" : ""}${drift.toFixed(3)}s vs target)`);
+313
pop/bin/validate_word.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + validate_word.py — sanity-check + score a generated word image. 4 + 5 + Re-uses the CC-based extract_glyphs() from render_frames.py. Reports: 6 + 7 + expected_n — letter count from the expected word 8 + found_n — number of glyph runs detected (post merge) 9 + height_cv — coefficient of variation of glyph heights (0=uniform) 10 + baseline_cv — coefficient of variation of glyph y1 (bottom) 11 + ok — boolean: passes all hard checks 12 + score — 0..5 quality score (higher = better extraction) 13 + diagnostic — one-line reason if not ok 14 + 15 + Score components: 16 + +2 if found_n == expected_n 17 + +1 if height_cv < 0.35 (uniform letter heights) 18 + +1 if baseline_cv < 0.05 (clean baseline) 19 + +1 if no slivers (every glyph w >= 6, h >= median*0.45) 20 + 21 + Used by tiktok.mjs to (a) reject FLUX outputs that won't extract 22 + cleanly and (b) pick the best of 5 attempts probabilistically. 23 + 24 + Usage: 25 + validate_word.py <image_path> <expected_word> 26 + """ 27 + import json 28 + import sys 29 + from pathlib import Path 30 + 31 + import numpy as np 32 + from scipy.ndimage import binary_fill_holes 33 + 34 + sys.path.insert(0, str(Path(__file__).parent)) 35 + from render_frames import extract_glyphs 36 + 37 + # Letter-aware topology: lowercase letters that have an enclosed counter. 38 + # Threshold = minimum hole-area ratio (hole_pixels / bbox_area) the letter 39 + # must show. Tuned from the FLUX corpus: real 'a' counters clock in 40 + # around 5–15% of bbox; a fake "block-shape a" comes in under 2%. 41 + LETTER_HOLE_MIN = { 42 + 'a': 0.018, 'b': 0.040, 'd': 0.040, 'e': 0.015, 43 + 'g': 0.035, 'o': 0.050, 'p': 0.035, 'q': 0.035, 44 + 'A': 0.025, 'B': 0.025, 'D': 0.040, 'O': 0.050, 45 + 'P': 0.025, 'Q': 0.035, 'R': 0.018, 46 + } 47 + # Per-letter shape constraints — catches malformed letters that pass 48 + # topology (because they're not supposed to have a counter) but are 49 + # rendered as solid blocks or wrong proportions. 50 + # 51 + # Aspect = w / h. Density = foreground_pixels / bbox_area. 52 + # Tuned permissively so chunky pixel fonts pass; only egregious 53 + # block-shapes get rejected. Pixel art legitimately ranges high in 54 + # density for some letters (e, o), so we set per-letter ceilings. 55 + # Conservative — only catch egregious shape cases. Pixel-art chunky 56 + # fonts produce wider variation than non-pixel typography, so most 57 + # letter ranges have to be very permissive or we get false positives. 58 + # 'w' is the most distinctively-shaped letter (4 strokes with V-gaps) 59 + # so it gets the cleanest density+aspect signals. 60 + LETTER_ASPECT = { 61 + 'i': (0.08, 0.55), 62 + 'j': (0.15, 0.65), 63 + 'l': (0.08, 0.60), 64 + 'm': (0.80, 1.80), 65 + 'w': (0.85, 1.80), 66 + } 67 + LETTER_DENSITY_MAX = { 68 + 'w': 0.62, 69 + } 70 + 71 + 72 + def hole_ratio(glyph_img): 73 + """Hole pixels / bbox area — measures how much of the bounding box is 74 + empty space enclosed inside the glyph (the counter).""" 75 + arr = np.array(glyph_img) 76 + mask = (arr[:, :, 3] > 128).astype(np.uint8) 77 + if mask.sum() == 0: 78 + return 0.0 79 + filled = binary_fill_holes(mask).astype(np.uint8) 80 + hole_px = int(filled.sum() - mask.sum()) 81 + bbox_area = int(mask.shape[0] * mask.shape[1]) 82 + return hole_px / max(1, bbox_area) 83 + 84 + 85 + def density(glyph_img): 86 + """Foreground pixels / bbox area.""" 87 + arr = np.array(glyph_img) 88 + mask = (arr[:, :, 3] > 128).astype(np.uint8) 89 + return float(mask.sum()) / max(1, mask.shape[0] * mask.shape[1]) 90 + 91 + 92 + def stroke_balance(glyph_img): 93 + """For letters with an enclosed counter (a/b/d/e/g/o/p/q): the 94 + left and right vertical strokes flanking the counter should be 95 + roughly equal thickness. Broken counter shapes (e.g. an 'o' that's 96 + actually a 'C' with the inner area filled solid) produce 3-4× 97 + stroke imbalance — topology check passes but the letter is wrong. 98 + Returns max/min stroke width ratio across middle rows of the 99 + glyph; ~1.0 = balanced, ≥2.0 = broken.""" 100 + arr = np.array(glyph_img) 101 + mask = (arr[:, :, 3] > 128).astype(int) 102 + h, w = mask.shape 103 + if h < 12 or w < 6: 104 + return 1.0 105 + mid = mask[h // 4: 3 * h // 4] 106 + left_widths = [] 107 + right_widths = [] 108 + for row in mid: 109 + s = int(row.sum()) 110 + # Skip rows that are top/bottom bars (mostly fg) 111 + if s > w * 0.85 or s == 0: 112 + continue 113 + # Walk in from left 114 + lw = 0; in_run = False 115 + for x in range(w): 116 + if row[x]: 117 + in_run = True; lw += 1 118 + elif in_run: 119 + break 120 + # Walk in from right 121 + rw = 0; in_run = False 122 + for x in range(w - 1, -1, -1): 123 + if row[x]: 124 + in_run = True; rw += 1 125 + elif in_run: 126 + break 127 + # Only count rows with TWO separate strokes 128 + if lw > 0 and rw > 0 and lw + rw < s + 2: 129 + left_widths.append(lw) 130 + right_widths.append(rw) 131 + if len(left_widths) < 3: 132 + return 1.0 # not enough two-stroke rows; can't judge 133 + import statistics as st 134 + ml = st.median(left_widths) 135 + mr = st.median(right_widths) 136 + return max(ml, mr) / max(1, min(ml, mr)) 137 + 138 + 139 + def bg_uniformity(img_path): 140 + """Sample 8 16x16 patches around the perimeter and return the 141 + average per-channel std of the combined samples. Solid backgrounds 142 + score ~1; textured ones (wood grain, gradients) score 5-15+.""" 143 + from PIL import Image 144 + arr = np.array(Image.open(img_path).convert("RGB")) 145 + h, w, _ = arr.shape 146 + s = 16 147 + positions = [ 148 + (0, 0), (0, w // 2 - s // 2), (0, w - s), 149 + (h // 2 - s // 2, 0), (h // 2 - s // 2, w - s), 150 + (h - s, 0), (h - s, w // 2 - s // 2), (h - s, w - s), 151 + ] 152 + patches = [arr[y0:y0 + s, x0:x0 + s].reshape(-1, 3) for y0, x0 in positions] 153 + samples = np.concatenate(patches) 154 + return float(np.std(samples, axis=0).mean()) 155 + 156 + 157 + def coef_of_var(values): 158 + if not values: 159 + return 0.0 160 + mean = sum(values) / len(values) 161 + if mean == 0: 162 + return 0.0 163 + var = sum((v - mean) ** 2 for v in values) / len(values) 164 + return (var ** 0.5) / abs(mean) 165 + 166 + 167 + def main(): 168 + if len(sys.argv) < 3: 169 + print(json.dumps({"ok": False, "diagnostic": "usage error", "score": 0})) 170 + return 171 + img_path = sys.argv[1] 172 + expected_word = sys.argv[2].lower() 173 + expected_n = sum(1 for c in expected_word if c.isalpha()) 174 + try: 175 + glyphs = extract_glyphs(img_path) 176 + except Exception as e: 177 + print(json.dumps({ 178 + "ok": False, "score": 0, 179 + "diagnostic": f"extract failed: {e}", 180 + "expected_n": expected_n, "found_n": 0, 181 + })) 182 + return 183 + 184 + found_n = len(glyphs) 185 + widths = [int(g["w"]) for g in glyphs] 186 + heights = [int(g["h"]) for g in glyphs] 187 + bottoms = [int(g["y1"]) for g in glyphs] 188 + 189 + height_cv = coef_of_var(heights) 190 + # Baseline CV: variation in bottom-y across glyphs, normalized by 191 + # median height (so it's scale-invariant). 192 + if heights: 193 + median_h = sorted(heights)[len(heights) // 2] 194 + if bottoms and median_h > 0: 195 + mean_b = sum(bottoms) / len(bottoms) 196 + var_b = sum((b - mean_b) ** 2 for b in bottoms) / len(bottoms) 197 + baseline_cv = (var_b ** 0.5) / median_h 198 + else: 199 + baseline_cv = 0.0 200 + else: 201 + median_h = 1 202 + baseline_cv = 1.0 203 + 204 + # Sliver check (very thin glyphs are usually FLUX artifacts) 205 + slivers = [] 206 + for i, (w, h) in enumerate(zip(widths, heights)): 207 + if w < 6 or h < median_h * 0.45: 208 + slivers.append(i) 209 + 210 + # Topology check: per-glyph hole ratio vs the letter at the 211 + # corresponding position in the expected word. This catches 212 + # block-shape glyphs that FLUX produced when the letterform was 213 + # supposed to have an enclosed counter. 214 + expected_letters = [c for c in expected_word if c.isalpha()] 215 + topology_failures = [] # (idx, letter, hole_ratio, expected_min) 216 + shape_failures = [] 217 + if found_n == expected_n: 218 + for i, (g, letter) in enumerate(zip(glyphs, expected_letters)): 219 + ll = letter.lower() 220 + if letter in LETTER_HOLE_MIN: 221 + r = hole_ratio(g["img"]) 222 + lo = LETTER_HOLE_MIN[letter] 223 + if r < lo: 224 + topology_failures.append((i, letter, round(r, 4), lo)) 225 + # Stroke-balance check for counter letters: catches the 226 + # 'C-shape with filled left half' case where topology 227 + # alone says "has hole" but the letter is broken. 228 + bal = stroke_balance(g["img"]) 229 + if bal >= 2.2: 230 + topology_failures.append((i, letter, round(-bal, 4), -2.2)) 231 + # Aspect ratio: catches "letter rendered too square" (block w) 232 + if ll in LETTER_ASPECT: 233 + aspect = g["w"] / max(1, g["h"]) 234 + lo, hi = LETTER_ASPECT[ll] 235 + if aspect < lo or aspect > hi: 236 + shape_failures.append((i, letter, "aspect", round(aspect, 3), lo, hi)) 237 + # Density ceiling: catches "letter rendered as solid block" 238 + if ll in LETTER_DENSITY_MAX: 239 + d = density(g["img"]) 240 + dmax = LETTER_DENSITY_MAX[ll] 241 + if d > dmax: 242 + shape_failures.append((i, letter, "density", round(d, 3), 0.0, dmax)) 243 + 244 + # Bg uniformity: reject FLUX outputs with textured/non-solid bg 245 + # (the wood-grain "saved" case where the alpha mask wraps the 246 + # texture instead of just the letters). 247 + bg_std = bg_uniformity(img_path) 248 + 249 + # Hard checks 250 + ok = True 251 + diag = "ok" 252 + if bg_std > 5.0: 253 + ok = False 254 + diag = f"non-uniform bg (std={bg_std:.1f}, expected <5) — bg is textured" 255 + elif found_n != expected_n: 256 + ok = False 257 + diag = f"glyph count mismatch (found {found_n}, expected {expected_n})" 258 + elif slivers: 259 + ok = False 260 + diag = f"{len(slivers)} sliver glyph(s) — likely artifacts" 261 + elif baseline_cv > 0.18: 262 + ok = False 263 + diag = f"baseline misaligned (cv={baseline_cv:.2f})" 264 + elif height_cv > 0.55: 265 + ok = False 266 + diag = f"letter heights wildly inconsistent (cv={height_cv:.2f})" 267 + elif topology_failures: 268 + f = topology_failures[0] 269 + ok = False 270 + diag = (f"topology fail: pos {f[0]} '{f[1]}' hole_ratio={f[2]:.3f} " 271 + f"(expected {'<=' if f[3] < 0 else '>='} {abs(f[3]):.3f})") 272 + elif shape_failures: 273 + f = shape_failures[0] 274 + ok = False 275 + diag = (f"shape fail: pos {f[0]} '{f[1]}' {f[2]}={f[3]} " 276 + f"(expected {f[4]}..{f[5]})") 277 + 278 + # Score (0..8) — used for probabilistic best-pick 279 + score = 0 280 + if found_n == expected_n: 281 + score += 2 282 + if height_cv < 0.35: 283 + score += 1 284 + if baseline_cv < 0.05: 285 + score += 1 286 + if not slivers: 287 + score += 1 288 + if not topology_failures: 289 + score += 1 290 + if not shape_failures: 291 + score += 1 292 + if bg_std <= 5.0: 293 + score += 1 294 + 295 + print(json.dumps({ 296 + "ok": ok, 297 + "diagnostic": diag, 298 + "expected_n": expected_n, 299 + "found_n": found_n, 300 + "height_cv": round(height_cv, 3), 301 + "baseline_cv": round(baseline_cv, 3), 302 + "slivers": len(slivers), 303 + "topology_failures": topology_failures, 304 + "shape_failures": shape_failures, 305 + "bg_std": round(bg_std, 2), 306 + "widths": widths, 307 + "heights": heights, 308 + "score": score, 309 + })) 310 + 311 + 312 + if __name__ == "__main__": 313 + main()
+18
pop/references/README.md
··· 1 + # references 2 + 3 + third-party reference material — emo rap lyrics, trap song breakdowns, beat references — does **not live in this repo**. 4 + 5 + it lives in the vault (jas.life, closed-source). this folder exists only as a pointer so the path is real and pipeline code can resolve a vault-side mount at runtime if needed. 6 + 7 + ## why not committed 8 + 9 + - copyright. the corpus is third-party lyrics used as in-context style examples for lyric generation. that's a fair-use-ish posture for a private notebook, not for a public github repo 10 + - voice hygiene. the AC repo should reflect AC's own writing. reference material is scaffolding, not output 11 + 12 + ## how it's used 13 + 14 + the lyric generator pulls a small curated set (~10–20 tracks) from the vault as in-context examples — cadence, internal rhyme, imagery — and never includes them in any committed artifact. 15 + 16 + ## vault path 17 + 18 + mounted at runtime. exact path TBD when the lyric generator lands.