KidLisp Performance Analysis & Optimization Report#
Executive Summary#
Analysis of the $cow piece rendering pipeline reveals several performance bottlenecks in the alpha compositing and embedded layer system. The piece composites two animated layers ($39i and $r2f) at 120fps with complex timing expressions, creating significant computational overhead.
Piece Analysis: $cow#
Structure#
📁 $cow (composite layer)
├─ 📄 $39i (background effects layer)
└─ 📄 $r2f (foreground zoom layer)
Source Code Breakdown#
Main Compositor ($cow):
($39i 0 0 w h 128) ; Background layer at 50% opacity
($r2f 0 0 w h 128) ; Foreground layer at 50% opacity
(contrast 1.5) ; Expensive post-processing effect
Background Layer ($39i): 11 operations/frame
- Complex timing: 5 different timer expressions (0.1s, 1.5s, 1s, 2s, 0.3s)
- Heavy operations:
flood,scroll,zoom,blur,contrast,spin - Random point generation:
(repeat 30 point)
Foreground Layer ($r2f): 11 operations/frame
- High-frequency zoom:
(0.1s (zoom (? 1.89 1 1.1 1.2)))every 100ms - Continuous effects:
scroll,spin,blur - Multiple flood fills:
(repeat 2 (flood ? ?))
Performance Bottlenecks#
1. Alpha Compositing Pipeline#
Current Implementation (graph.mjs lines 1558-1610):
function blend(dst, src, si, di, alphaIn = 1) {
// Branch A: Transparent pixel compositing (expensive)
if (dst[di + 3] < 255 && src[si + 3] > 0) {
const alphaSrc = (src[si + 3] * alphaIn) / 255;
const alphaDst = dst[di + 3] / 255;
const combinedAlpha = alphaSrc + (1.0 - alphaSrc) * alphaDst;
// Per-channel floating-point math (3 divisions, 6 multiplications)
for (let offset = 0; offset < 3; offset++) {
dst[di + offset] = (src[si + offset] * alphaSrc +
dst[di + offset] * (1.0 - alphaSrc) * alphaDst) /
(combinedAlpha + epsilon);
}
} else {
// Branch B: Opaque pixel compositing (faster integer math)
const alpha = src[si + 3] * alphaIn + 1;
const invAlpha = 256 - alpha;
dst[di] = (alpha * src[si + 0] + invAlpha * dst[di + 0]) >> 8;
// ... similar for G, B channels
}
}
Performance Issues:
- Floating-point overhead: Branch A uses expensive division and floating-point arithmetic
- Branch prediction: Transparent vs opaque pixel branching creates CPU pipeline stalls
- Memory access pattern: Non-sequential pixel access in compositing loops
2. Layer Rendering Frequency#
Current Execution Pattern:
$cowrenders at 120fps (8.33ms budget per frame)- Each embedded layer renders independently
$39i: 5 timer expressions firing at different intervals$r2f: High-frequency zoom (10 times per second)- Double alpha compositing: Each layer → main buffer → final output
Frame Budget Breakdown (estimated):
Per-frame operations (120fps = 8.33ms budget):
├─ $39i rendering: ~2.5ms (timer evaluation + effects)
├─ $r2f rendering: ~2.0ms (zoom calculations + blending)
├─ Alpha compositing: ~2.5ms (pixel-by-pixel blending)
├─ Contrast effect: ~1.0ms (post-processing)
└─ Overhead: ~0.33ms (timing, evaluation)
Total: ~8.33ms (100% budget utilization)
3. Timing System Overhead#
Recent Fix Applied: Removed setTimeout-based timing in favor of frame-based counting:
// OLD (expensive):
setTimeout(() => { /* execute timing expression */ }, delay);
// NEW (efficient):
if (this.frameCount - lastExecution >= targetFrames) {
// execute immediately in frame context
}
Remaining Issues:
- Timer expressions are evaluated every frame even when not firing
- Context switching between embedded layers
- Redundant timing key generation:
${head}-${cacheId}-${JSON.stringify(args)}
Optimization Opportunities#
1. Alpha Compositing Optimizations#
A. SIMD-style Bulk Operations
// Instead of pixel-by-pixel blending, process in chunks
function fastBlendChunk(dst, src, dstIdx, srcIdx, count, alpha) {
// Process 4 pixels at once using typed array operations
const alpha256 = (alpha * 256) | 0;
const invAlpha = 256 - alpha256;
for (let i = 0; i < count * 4; i += 4) {
dst[dstIdx + i] = ((alpha256 * src[srcIdx + i] + invAlpha * dst[dstIdx + i]) >> 8);
dst[dstIdx + i + 1] = ((alpha256 * src[srcIdx + i + 1] + invAlpha * dst[dstIdx + i + 1]) >> 8);
dst[dstIdx + i + 2] = ((alpha256 * src[srcIdx + i + 2] + invAlpha * dst[dstIdx + i + 2]) >> 8);
// Skip alpha channel for opaque blending
}
}
B. Pre-multiplied Alpha
// Store layers in pre-multiplied format to avoid runtime multiplication
function convertToPremultiplied(pixels) {
for (let i = 0; i < pixels.length; i += 4) {
const alpha = pixels[i + 3] / 255;
pixels[i] *= alpha; // R
pixels[i + 1] *= alpha; // G
pixels[i + 2] *= alpha; // B
}
}
Estimated Gain: 40-60% faster compositing
2. Layer Rendering Optimizations#
A. Dirty Rectangle Tracking
class LayerBuffer {
constructor(width, height) {
this.pixels = new Uint8Array(width * height * 4);
this.dirtyBox = null; // Track changed regions
}
markDirty(x, y, w, h) {
if (!this.dirtyBox) {
this.dirtyBox = { x, y, w, h };
} else {
// Expand dirty box to include new region
this.dirtyBox = expandBox(this.dirtyBox, { x, y, w, h });
}
}
}
B. Layer Caching Strategy
// Cache layer results when no animations are active
const layerCache = new Map();
function renderLayerWithCaching(layerId, hasActiveTimers) {
if (!hasActiveTimers && layerCache.has(layerId)) {
return layerCache.get(layerId);
}
const result = renderLayer(layerId);
if (!hasActiveTimers) {
layerCache.set(layerId, result);
}
return result;
}
Estimated Gain: 30-50% reduction in redundant layer rendering
3. Timing System Optimizations#
A. Timer Batching
// Group timers by interval for batch processing
const timerBatches = {
'0.1s': [], // 12-frame intervals
'1s': [], // 120-frame intervals
'1.5s': [], // 180-frame intervals
};
function processTimerBatch(interval, frameCount) {
if (frameCount % getFramesForInterval(interval) === 0) {
timerBatches[interval].forEach(timer => timer.execute());
}
}
B. Lazy Timer Key Generation
// Cache timer keys to avoid JSON.stringify overhead
const timerKeyCache = new WeakMap();
function getTimerKey(head, cacheId, args) {
if (!timerKeyCache.has(args)) {
timerKeyCache.set(args, `${head}-${cacheId}-${JSON.stringify(args)}`);
}
return timerKeyCache.get(args);
}
Estimated Gain: 15-25% reduction in timing overhead
Recommended Implementation Plan#
Phase 1: Critical Path Optimization (High Impact)#
- Implement integer-only alpha compositing for opaque blending cases
- Add SIMD-style bulk pixel operations for large layer composites
- Implement dirty rectangle tracking for incremental updates
Phase 2: Memory & Caching (Medium Impact)#
- Add layer result caching for static content
- Implement pre-multiplied alpha storage format
- Optimize timer key generation with caching
Phase 3: Advanced Optimizations (Lower Impact)#
- WebGL compositing pipeline for complex effects
- Worker thread layer rendering for parallel processing
- Adaptive quality scaling based on performance metrics
Measurement & Validation#
Performance Metrics to Track#
// Add to graph.mjs for performance monitoring
const perfMetrics = {
blendTime: 0,
layerRenderTime: 0,
timingOverhead: 0,
frameDrops: 0
};
function measureBlendPerformance(fn) {
const start = performance.now();
fn();
perfMetrics.blendTime += performance.now() - start;
}
Expected Performance Gains#
- Alpha Compositing: 40-60% faster → ~1.5ms savings per frame
- Layer Rendering: 30-50% reduction → ~1.0ms savings per frame
- Timing Overhead: 15-25% reduction → ~0.2ms savings per frame
- Total Improvement: ~2.7ms per frame (32% performance gain)
This would provide significant headroom for more complex effects and better frame stability at 120fps.
Current Status#
✅ Completed: Removed setTimeout-based timing (major architectural fix)
🔄 In Progress: Performance measurement infrastructure
📋 Next: Implement integer-only alpha compositing optimization
Last Updated: September 6, 2025
Analysis Target: $cow embedded layer composition