···11+# Hybrid Latent Sonic Architecture
22+33+This is the architecture for `kidlisp-wasm` audio export.
44+55+## Goal
66+77+Let the framebuffer cover the full color spectrum while the sound path covers a much wider synthesis range than a fixed visual-to-bytebeat mapping.
88+99+That means two things have to be true at once:
1010+1111+1. The buffer can act like an audio-bearing field.
1212+2. The system does not hear that field the same way every frame.
1313+1414+## Core Model
1515+1616+`visible framebuffer -> hidden latent field -> changing listener -> audio experts -> stereo output`
1717+1818+The visible frame is still the thing we see.
1919+The hidden latent field is the thing we hear.
2020+The listener is a small feedforward network that decides how to scan and decode the latent field over time.
2121+2222+## Layers
2323+2424+### 1. Visible Framebuffer
2525+2626+The rendered KidLisp or fixture frame is read in full every visual frame.
2727+We extract:
2828+2929+- global color and luminance statistics
3030+- spatial balance and centroid
3131+- edge energy
3232+- tile-local color features across the frame
3333+3434+### 2. Hidden Latent Field
3535+3636+The frame is re-encoded into a continuous latent grid.
3737+Each tile contributes to a latent vector, and that vector is blended with its prior value so the sonic field has memory.
3838+3939+This gives us:
4040+4141+- continuity across frames
4242+- local sonic neighborhoods
4343+- enough hidden state for simple visuals to still evolve sonically
4444+4545+### 3. Changing Listener
4646+4747+A feedforward control network reads global frame features plus a persistent latent state.
4848+It outputs:
4949+5050+- scan motion and stereo drift
5151+- pitch and formant drift
5252+- table warp and breath amount
5353+- expert mixture weights
5454+- living-system rules for the petri/bytebeat branch
5555+5656+So the sound is not just “what the image is.”
5757+It is also “how the current listener chooses to hear the image.”
5858+5959+### 4. Decoder Experts
6060+6161+The latent field is decoded by a mixture of experts:
6262+6363+- `tonal`: additive sine-like harmonics
6464+- `vocal`: voiced source plus moving formants
6565+- `table`: raw PCM-style readout from the full RGB buffer
6666+- `living`: petri/bytebeat emergent branch
6767+6868+The listener mixes these experts per frame and per read position.
6969+7070+## Why This Matches The Artistic Goal
7171+7272+A fixed mapping like `red = sine` or `edges = noise` gets boring fast.
7373+The hybrid latent system keeps a strong relation to the image, but it changes the interpretation over time.
7474+7575+That means:
7676+7777+- a pulsing square can drift between tone, table-like grit, and breathy vowel sound
7878+- a gradient can span the whole spectrum without collapsing into one static drone
7979+- a simple orbit can still sound alive because the listener is moving through the latent field
8080+8181+## Current Repo Pieces
8282+8383+- [`sonic-frame.mjs`](./sonic-frame.mjs) implements the hybrid latent engine.
8484+- [`sonic-fixtures.mjs`](./sonic-fixtures.mjs) defines simple visual fixtures like `pulse-square` and `gradient-sweep`.
8585+- [`mp4.mjs`](./mp4.mjs) renders KidLisp pieces or fixtures and muxes the soundtrack into MP4.
8686+8787+## Current Limits
8888+8989+This is a first pass, not a trained EnCodec-style codec yet.
9090+The latent field is continuous and deterministic, but not learned from a corpus.
9191+9292+So this architecture is now in place, and it is ready for the next step:
9393+9494+- trainable latent encoders/decoders
9595+- vector-quantized audio codes
9696+- richer voice modeling
9797+- explicit spectral/STFT buffer modes