KidLisp Performance Analysis & Optimization Report#

Executive Summary#

Analysis of the $cow piece rendering pipeline reveals several performance bottlenecks in the alpha compositing and embedded layer system. The piece composites two animated layers ($39i and $r2f) at 120fps with complex timing expressions, creating significant computational overhead.

Piece Analysis: $cow#

Structure#

📁 $cow (composite layer)
 ├─ 📄 $39i (background effects layer)
 └─ 📄 $r2f (foreground zoom layer)

Source Code Breakdown#

Main Compositor ($cow):

($39i 0 0 w h 128)  ; Background layer at 50% opacity
($r2f 0 0 w h 128)  ; Foreground layer at 50% opacity  
(contrast 1.5)      ; Expensive post-processing effect

Background Layer ($39i): 11 operations/frame

Complex timing: 5 different timer expressions (0.1s, 1.5s, 1s, 2s, 0.3s)
Heavy operations: flood, scroll, zoom, blur, contrast, spin
Random point generation: (repeat 30 point)

Foreground Layer ($r2f): 11 operations/frame

High-frequency zoom: (0.1s (zoom (? 1.89 1 1.1 1.2))) every 100ms
Continuous effects: scroll, spin, blur
Multiple flood fills: (repeat 2 (flood ? ?))

Performance Bottlenecks#

1. Alpha Compositing Pipeline#

Current Implementation (graph.mjs lines 1558-1610):

function blend(dst, src, si, di, alphaIn = 1) {
  // Branch A: Transparent pixel compositing (expensive)
  if (dst[di + 3] < 255 && src[si + 3] > 0) {
    const alphaSrc = (src[si + 3] * alphaIn) / 255;
    const alphaDst = dst[di + 3] / 255;
    const combinedAlpha = alphaSrc + (1.0 - alphaSrc) * alphaDst;
    // Per-channel floating-point math (3 divisions, 6 multiplications)
    for (let offset = 0; offset < 3; offset++) {
      dst[di + offset] = (src[si + offset] * alphaSrc + 
                         dst[di + offset] * (1.0 - alphaSrc) * alphaDst) / 
                         (combinedAlpha + epsilon);
    }
  } else {
    // Branch B: Opaque pixel compositing (faster integer math)
    const alpha = src[si + 3] * alphaIn + 1;
    const invAlpha = 256 - alpha;
    dst[di] = (alpha * src[si + 0] + invAlpha * dst[di + 0]) >> 8;
    // ... similar for G, B channels
  }
}

Performance Issues:

Floating-point overhead: Branch A uses expensive division and floating-point arithmetic
Branch prediction: Transparent vs opaque pixel branching creates CPU pipeline stalls
Memory access pattern: Non-sequential pixel access in compositing loops

2. Layer Rendering Frequency#

Current Execution Pattern:

$cow renders at 120fps (8.33ms budget per frame)
Each embedded layer renders independently
$39i: 5 timer expressions firing at different intervals
$r2f: High-frequency zoom (10 times per second)
Double alpha compositing: Each layer → main buffer → final output

Frame Budget Breakdown (estimated):

Per-frame operations (120fps = 8.33ms budget):
├─ $39i rendering:     ~2.5ms (timer evaluation + effects)
├─ $r2f rendering:     ~2.0ms (zoom calculations + blending)  
├─ Alpha compositing:  ~2.5ms (pixel-by-pixel blending)
├─ Contrast effect:    ~1.0ms (post-processing)
└─ Overhead:           ~0.33ms (timing, evaluation)
Total:                 ~8.33ms (100% budget utilization)

3. Timing System Overhead#

Recent Fix Applied: Removed setTimeout-based timing in favor of frame-based counting:

// OLD (expensive): 
setTimeout(() => { /* execute timing expression */ }, delay);

// NEW (efficient):
if (this.frameCount - lastExecution >= targetFrames) {
  // execute immediately in frame context
}

Remaining Issues:

Timer expressions are evaluated every frame even when not firing
Context switching between embedded layers
Redundant timing key generation: ${head}-${cacheId}-${JSON.stringify(args)}

Optimization Opportunities#

1. Alpha Compositing Optimizations#

A. SIMD-style Bulk Operations

// Instead of pixel-by-pixel blending, process in chunks
function fastBlendChunk(dst, src, dstIdx, srcIdx, count, alpha) {
  // Process 4 pixels at once using typed array operations
  const alpha256 = (alpha * 256) | 0;
  const invAlpha = 256 - alpha256;
  
  for (let i = 0; i < count * 4; i += 4) {
    dst[dstIdx + i] = ((alpha256 * src[srcIdx + i] + invAlpha * dst[dstIdx + i]) >> 8);
    dst[dstIdx + i + 1] = ((alpha256 * src[srcIdx + i + 1] + invAlpha * dst[dstIdx + i + 1]) >> 8);
    dst[dstIdx + i + 2] = ((alpha256 * src[srcIdx + i + 2] + invAlpha * dst[dstIdx + i + 2]) >> 8);
    // Skip alpha channel for opaque blending
  }
}

B. Pre-multiplied Alpha

// Store layers in pre-multiplied format to avoid runtime multiplication
function convertToPremultiplied(pixels) {
  for (let i = 0; i < pixels.length; i += 4) {
    const alpha = pixels[i + 3] / 255;
    pixels[i] *= alpha;     // R
    pixels[i + 1] *= alpha; // G  
    pixels[i + 2] *= alpha; // B
  }
}

Estimated Gain: 40-60% faster compositing

2. Layer Rendering Optimizations#

A. Dirty Rectangle Tracking

class LayerBuffer {
  constructor(width, height) {
    this.pixels = new Uint8Array(width * height * 4);
    this.dirtyBox = null; // Track changed regions
  }
  
  markDirty(x, y, w, h) {
    if (!this.dirtyBox) {
      this.dirtyBox = { x, y, w, h };
    } else {
      // Expand dirty box to include new region
      this.dirtyBox = expandBox(this.dirtyBox, { x, y, w, h });
    }
  }
}

B. Layer Caching Strategy

// Cache layer results when no animations are active
const layerCache = new Map();

function renderLayerWithCaching(layerId, hasActiveTimers) {
  if (!hasActiveTimers && layerCache.has(layerId)) {
    return layerCache.get(layerId);
  }
  
  const result = renderLayer(layerId);
  if (!hasActiveTimers) {
    layerCache.set(layerId, result);
  }
  return result;
}

Estimated Gain: 30-50% reduction in redundant layer rendering

3. Timing System Optimizations#

A. Timer Batching

// Group timers by interval for batch processing
const timerBatches = {
  '0.1s': [], // 12-frame intervals
  '1s': [],   // 120-frame intervals  
  '1.5s': [], // 180-frame intervals
};

function processTimerBatch(interval, frameCount) {
  if (frameCount % getFramesForInterval(interval) === 0) {
    timerBatches[interval].forEach(timer => timer.execute());
  }
}

B. Lazy Timer Key Generation

// Cache timer keys to avoid JSON.stringify overhead
const timerKeyCache = new WeakMap();

function getTimerKey(head, cacheId, args) {
  if (!timerKeyCache.has(args)) {
    timerKeyCache.set(args, `${head}-${cacheId}-${JSON.stringify(args)}`);
  }
  return timerKeyCache.get(args);
}

Estimated Gain: 15-25% reduction in timing overhead

Recommended Implementation Plan#

Phase 1: Critical Path Optimization (High Impact)#

Implement integer-only alpha compositing for opaque blending cases
Add SIMD-style bulk pixel operations for large layer composites
Implement dirty rectangle tracking for incremental updates

Phase 2: Memory & Caching (Medium Impact)#

Add layer result caching for static content
Implement pre-multiplied alpha storage format
Optimize timer key generation with caching

Phase 3: Advanced Optimizations (Lower Impact)#

WebGL compositing pipeline for complex effects
Worker thread layer rendering for parallel processing
Adaptive quality scaling based on performance metrics

Measurement & Validation#

Performance Metrics to Track#

// Add to graph.mjs for performance monitoring
const perfMetrics = {
  blendTime: 0,
  layerRenderTime: 0,
  timingOverhead: 0,
  frameDrops: 0
};

function measureBlendPerformance(fn) {
  const start = performance.now();
  fn();
  perfMetrics.blendTime += performance.now() - start;
}

Expected Performance Gains#

Alpha Compositing: 40-60% faster → ~1.5ms savings per frame
Layer Rendering: 30-50% reduction → ~1.0ms savings per frame
Timing Overhead: 15-25% reduction → ~0.2ms savings per frame
Total Improvement: ~2.7ms per frame (32% performance gain)

This would provide significant headroom for more complex effects and better frame stability at 120fps.

Current Status#

✅ Completed: Removed setTimeout-based timing (major architectural fix)
🔄 In Progress: Performance measurement infrastructure
📋 Next: Implement integer-only alpha compositing optimization

Last Updated: September 6, 2025
Analysis Target: $cow embedded layer composition

Configure Feed