# KidLisp Performance Analysis & Optimization Report

## Executive Summary

Analysis of the `$cow` piece rendering pipeline reveals several performance bottlenecks in the alpha compositing and embedded layer system. The piece composites two animated layers (`$39i` and `$r2f`) at 120fps with complex timing expressions, creating significant computational overhead.

## Piece Analysis: $cow

### Structure
```
📁 $cow (composite layer)
 ├─ 📄 $39i (background effects layer)
 └─ 📄 $r2f (foreground zoom layer)
```

### Source Code Breakdown

**Main Compositor ($cow)**:
```kidlisp
($39i 0 0 w h 128)  ; Background layer at 50% opacity
($r2f 0 0 w h 128)  ; Foreground layer at 50% opacity  
(contrast 1.5)      ; Expensive post-processing effect
```

**Background Layer ($39i)**: 11 operations/frame
- Complex timing: 5 different timer expressions (0.1s, 1.5s, 1s, 2s, 0.3s)
- Heavy operations: `flood`, `scroll`, `zoom`, `blur`, `contrast`, `spin`
- Random point generation: `(repeat 30 point)`

**Foreground Layer ($r2f)**: 11 operations/frame  
- High-frequency zoom: `(0.1s (zoom (? 1.89 1 1.1 1.2)))` every 100ms
- Continuous effects: `scroll`, `spin`, `blur`
- Multiple flood fills: `(repeat 2 (flood ? ?))`

## Performance Bottlenecks

### 1. Alpha Compositing Pipeline

**Current Implementation** (`graph.mjs` lines 1558-1610):
```javascript
function blend(dst, src, si, di, alphaIn = 1) {
  // Branch A: Transparent pixel compositing (expensive)
  if (dst[di + 3] < 255 && src[si + 3] > 0) {
    const alphaSrc = (src[si + 3] * alphaIn) / 255;
    const alphaDst = dst[di + 3] / 255;
    const combinedAlpha = alphaSrc + (1.0 - alphaSrc) * alphaDst;
    // Per-channel floating-point math (3 divisions, 6 multiplications)
    for (let offset = 0; offset < 3; offset++) {
      dst[di + offset] = (src[si + offset] * alphaSrc + 
                         dst[di + offset] * (1.0 - alphaSrc) * alphaDst) / 
                         (combinedAlpha + epsilon);
    }
  } else {
    // Branch B: Opaque pixel compositing (faster integer math)
    const alpha = src[si + 3] * alphaIn + 1;
    const invAlpha = 256 - alpha;
    dst[di] = (alpha * src[si + 0] + invAlpha * dst[di + 0]) >> 8;
    // ... similar for G, B channels
  }
}
```

**Performance Issues**:
- **Floating-point overhead**: Branch A uses expensive division and floating-point arithmetic
- **Branch prediction**: Transparent vs opaque pixel branching creates CPU pipeline stalls
- **Memory access pattern**: Non-sequential pixel access in compositing loops

### 2. Layer Rendering Frequency

**Current Execution Pattern**:
- `$cow` renders at 120fps (8.33ms budget per frame)
- Each embedded layer renders independently 
- `$39i`: 5 timer expressions firing at different intervals
- `$r2f`: High-frequency zoom (10 times per second)
- Double alpha compositing: Each layer → main buffer → final output

**Frame Budget Breakdown** (estimated):
```
Per-frame operations (120fps = 8.33ms budget):
├─ $39i rendering:     ~2.5ms (timer evaluation + effects)
├─ $r2f rendering:     ~2.0ms (zoom calculations + blending)  
├─ Alpha compositing:  ~2.5ms (pixel-by-pixel blending)
├─ Contrast effect:    ~1.0ms (post-processing)
└─ Overhead:           ~0.33ms (timing, evaluation)
Total:                 ~8.33ms (100% budget utilization)
```

### 3. Timing System Overhead

**Recent Fix Applied**: Removed `setTimeout`-based timing in favor of frame-based counting:
```javascript
// OLD (expensive): 
setTimeout(() => { /* execute timing expression */ }, delay);

// NEW (efficient):
if (this.frameCount - lastExecution >= targetFrames) {
  // execute immediately in frame context
}
```

**Remaining Issues**:
- Timer expressions are evaluated every frame even when not firing
- Context switching between embedded layers
- Redundant timing key generation: `${head}-${cacheId}-${JSON.stringify(args)}`

## Optimization Opportunities

### 1. Alpha Compositing Optimizations

**A. SIMD-style Bulk Operations**
```javascript
// Instead of pixel-by-pixel blending, process in chunks
function fastBlendChunk(dst, src, dstIdx, srcIdx, count, alpha) {
  // Process 4 pixels at once using typed array operations
  const alpha256 = (alpha * 256) | 0;
  const invAlpha = 256 - alpha256;
  
  for (let i = 0; i < count * 4; i += 4) {
    dst[dstIdx + i] = ((alpha256 * src[srcIdx + i] + invAlpha * dst[dstIdx + i]) >> 8);
    dst[dstIdx + i + 1] = ((alpha256 * src[srcIdx + i + 1] + invAlpha * dst[dstIdx + i + 1]) >> 8);
    dst[dstIdx + i + 2] = ((alpha256 * src[srcIdx + i + 2] + invAlpha * dst[dstIdx + i + 2]) >> 8);
    // Skip alpha channel for opaque blending
  }
}
```

**B. Pre-multiplied Alpha**
```javascript
// Store layers in pre-multiplied format to avoid runtime multiplication
function convertToPremultiplied(pixels) {
  for (let i = 0; i < pixels.length; i += 4) {
    const alpha = pixels[i + 3] / 255;
    pixels[i] *= alpha;     // R
    pixels[i + 1] *= alpha; // G  
    pixels[i + 2] *= alpha; // B
  }
}
```

**Estimated Gain**: 40-60% faster compositing

### 2. Layer Rendering Optimizations

**A. Dirty Rectangle Tracking**
```javascript
class LayerBuffer {
  constructor(width, height) {
    this.pixels = new Uint8Array(width * height * 4);
    this.dirtyBox = null; // Track changed regions
  }
  
  markDirty(x, y, w, h) {
    if (!this.dirtyBox) {
      this.dirtyBox = { x, y, w, h };
    } else {
      // Expand dirty box to include new region
      this.dirtyBox = expandBox(this.dirtyBox, { x, y, w, h });
    }
  }
}
```

**B. Layer Caching Strategy**
```javascript
// Cache layer results when no animations are active
const layerCache = new Map();

function renderLayerWithCaching(layerId, hasActiveTimers) {
  if (!hasActiveTimers && layerCache.has(layerId)) {
    return layerCache.get(layerId);
  }
  
  const result = renderLayer(layerId);
  if (!hasActiveTimers) {
    layerCache.set(layerId, result);
  }
  return result;
}
```

**Estimated Gain**: 30-50% reduction in redundant layer rendering

### 3. Timing System Optimizations

**A. Timer Batching**
```javascript
// Group timers by interval for batch processing
const timerBatches = {
  '0.1s': [], // 12-frame intervals
  '1s': [],   // 120-frame intervals  
  '1.5s': [], // 180-frame intervals
};

function processTimerBatch(interval, frameCount) {
  if (frameCount % getFramesForInterval(interval) === 0) {
    timerBatches[interval].forEach(timer => timer.execute());
  }
}
```

**B. Lazy Timer Key Generation**
```javascript
// Cache timer keys to avoid JSON.stringify overhead
const timerKeyCache = new WeakMap();

function getTimerKey(head, cacheId, args) {
  if (!timerKeyCache.has(args)) {
    timerKeyCache.set(args, `${head}-${cacheId}-${JSON.stringify(args)}`);
  }
  return timerKeyCache.get(args);
}
```

**Estimated Gain**: 15-25% reduction in timing overhead

## Recommended Implementation Plan

### Phase 1: Critical Path Optimization (High Impact)
1. **Implement integer-only alpha compositing** for opaque blending cases
2. **Add SIMD-style bulk pixel operations** for large layer composites  
3. **Implement dirty rectangle tracking** for incremental updates

### Phase 2: Memory & Caching (Medium Impact)  
4. **Add layer result caching** for static content
5. **Implement pre-multiplied alpha storage** format
6. **Optimize timer key generation** with caching

### Phase 3: Advanced Optimizations (Lower Impact)
7. **WebGL compositing pipeline** for complex effects
8. **Worker thread layer rendering** for parallel processing
9. **Adaptive quality scaling** based on performance metrics

## Measurement & Validation

### Performance Metrics to Track
```javascript
// Add to graph.mjs for performance monitoring
const perfMetrics = {
  blendTime: 0,
  layerRenderTime: 0,
  timingOverhead: 0,
  frameDrops: 0
};

function measureBlendPerformance(fn) {
  const start = performance.now();
  fn();
  perfMetrics.blendTime += performance.now() - start;
}
```

### Expected Performance Gains
- **Alpha Compositing**: 40-60% faster → ~1.5ms savings per frame
- **Layer Rendering**: 30-50% reduction → ~1.0ms savings per frame  
- **Timing Overhead**: 15-25% reduction → ~0.2ms savings per frame
- **Total Improvement**: ~2.7ms per frame (32% performance gain)

This would provide significant headroom for more complex effects and better frame stability at 120fps.

## Current Status

✅ **Completed**: Removed setTimeout-based timing (major architectural fix)  
🔄 **In Progress**: Performance measurement infrastructure  
📋 **Next**: Implement integer-only alpha compositing optimization

---
*Last Updated: September 6, 2025*  
*Analysis Target: $cow embedded layer composition*