# Oven Architecture Report

**Generated:** 2026-02-13
**Server:** oven.aesthetic.computer (137.184.237.166)
**Uptime:** 44 days (OS), ~12 min since last oven restart

---

## 1. Machine Specs

| Resource | Value |
|----------|-------|
| **CPU** | 2 vCPUs (Intel, DO-Regular) |
| **RAM** | 1.97 GB total, ~635 MB used, ~1.3 GB available |
| **Swap** | None configured |
| **Disk** | 58 GB, 6.7 GB used (12%) |
| **OS** | Ubuntu 24.04.3 LTS (kernel 6.8.0-90) |
| **Node** | v20.20.0 |
| **Chrome** | 143.0.7499.40 (headless, Puppeteer-managed) |
| **ffmpeg** | 6.1.1 (system package, WebP + H.264 support) |

### Current Memory Breakdown (at rest with 1 active grab)
- **Node (server.mjs):** ~175 MB (8.6% of RAM)
- **Chrome main process:** ~202 MB (10%)
- **Chrome GPU process:** ~157 MB (7.7%)
- **Chrome network service:** ~125 MB (6.2%)
- **Chrome renderer(s):** ~65-100 MB each (3-5%)
- **Caddy:** ~32 MB
- **Total Chrome footprint:** ~600-700 MB
- **Peak memory observed in logs:** **1.4 GB** (during heavy grab batches)

**Verdict:** With 2 GB total and no swap, the machine is memory-constrained. A single Chrome instance + Node already consumes ~850 MB at rest. During heavy workloads, peak memory hits 1.4 GB, leaving very little headroom. This is the primary bottleneck.

---

## 2. Architecture Overview

```
Internet → Caddy (port 443/80, gzip, TLS) → Express (port 3002) → Puppeteer (Chrome)
                                                                 → ffmpeg (WebP/MP4)
                                                                 → terser (JS minification)
                                                                 → DO Spaces (S3 storage)
                                                                 → MongoDB (metadata)
```

### Process Model
- **Single Node process** (`server.mjs`) — no clustering, no workers
- **Single Chrome browser** — shared instance, reused across all grab/icon/preview requests
- **Serial grab queue** — `grabRunning` boolean, one grab at a time, 100ms delay between jobs
- systemd manages the oven service with `Restart=always`, `RestartSec=10`

### Key Modules
| Module | Purpose | Size |
|--------|---------|------|
| `server.mjs` | Express routes + dashboard HTML | 104 KB |
| `grabber.mjs` | Screenshot/WebP/icon capture via Puppeteer | 127 KB |
| `baker.mjs` | Tape (MP4) baking pipeline | 24 KB |
| `bundler.mjs` | KidLisp/JS piece HTML bundle generation | 44 KB |

---

## 3. API Endpoints (41 routes)

### Core Operations
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/` | GET | Dashboard (real-time WebSocket updates) |
| `/health` | GET | Health check |
| `/status` | GET | Server status + recent bakes |
| `/grab-status` | GET | Active grabs + queue state |

### Tape Baking (MP4)
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/bake` | POST | Start tape bake (WebP frames → MP4) |
| `/bake-complete` | POST | Callback when bake finishes |
| `/bake-status` | POST | Check bake progress |

### Screenshots & WebP Captures (Grabber)
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/grab` | POST | Trigger grab (screenshot/animation) |
| `/grab/:format/:width/:height/:piece` | GET | Direct grab with params |
| `/grab-ipfs` | POST | Grab + IPFS upload |
| `/grab-cleanup` | POST | Clean stale grabs |
| `/grab-clear` | POST | Clear all active grabs |
| `/icon/:size/:piece.png` | GET | Piece icon (cached → DO Spaces) |
| `/icon/:size/:piece.webp` | GET | Piece icon as WebP |
| `/preview/:size/:piece.png` | GET | Piece preview screenshot |

### OG Images
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/kidlisp-og.png` | GET | KidLisp OG image (for social sharing) |
| `/kidlisp-og` | GET | KidLisp OG page (HTML) |
| `/kidlisp-og/status` | GET | OG cache status |
| `/kidlisp-og/preview` | GET | OG preview page |
| `/notepat-og.png` | GET | Notepat OG image |
| `/kidlisp-backdrop.webp` | GET | KidLisp backdrop animation |
| `/kidlisp-backdrop` | GET | KidLisp backdrop page |

### App Screenshots
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/app-screenshots` | GET | App screenshot dashboard |
| `/app-screenshots/:preset/:piece.png` | GET | Screenshot by preset |
| `/app-screenshots/download/:piece` | GET | Download all presets as ZIP |
| `/api/app-screenshots/:piece` | GET | JSON metadata for screenshots |

### Bundle (HTML offline bundles)
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/bundle-html` | GET | Generate HTML bundle (SSE streaming) |
| `/bundle-prewarm` | POST | Prewarm bundle cache |
| `/bundle-status` | GET | Bundle cache status |

### Misc
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/api/frozen` | GET | List frozen pieces |
| `/api/frozen/:piece` | DELETE | Unfreeze a piece |
| `/keeps/latest` | GET | Latest keep thumbnail |
| `/keeps/latest/:piece` | GET | Latest keep for specific piece |
| `/keeps/all` | GET | All latest IPFS uploads |

---

## 4. Current Issues

### 4.1 Terser Not Found (FIXED in latest deploy)
The error log shows **92 minification failures** with `Cannot find package 'terser'`. This was from a previous deploy where `npm install` wasn't run after `terser` was added to `package.json`. The latest deploy (today) resolved this — bundler is working and prewarm succeeds.

### 4.2 Repeated Service Crashes
The systemd journal shows **25 instances** of `Main process exited, code=exited, status=1/FAILURE`. These are likely from:
- Deploys that didn't run `npm install` before restarting
- OOM situations (no swap, peak memory hit 1.4 GB on a 2 GB machine)
- Chrome connection drops during heavy workloads

### 4.3 Serial Grab Queue (Primary Performance Bottleneck)
The grabber processes **one grab at a time** using a simple boolean lock:
```javascript
let grabRunning = false;  // Only one grab runs at a time
```
Currently there are **19 items in the queue** (1 capturing, 18 queued). Each grab takes roughly 30-40 seconds (load page + wait for ready signal + capture 16 frames + ffmpeg encode + upload to Spaces). That means the current queue will take **~10-13 minutes** to clear.

### 4.4 No Swap Space
With 2 GB RAM and Chrome eating 600-700 MB at rest, there's no safety net. If a grab hits a memory-heavy piece (or multiple Chrome renderer processes spawn), the OOM killer can terminate the process.

### 4.5 Low File Descriptor Limit
`ulimit -n` is 1024 (default). Chrome alone can use hundreds of FDs. Under heavy load this could cause `EMFILE` errors.

### 4.6 Stale PM2 Process
There's a PM2 daemon running (`PM2 v6.0.14`) from before the systemd migration. It's consuming 17 MB of RAM doing nothing.

---

## 5. Recommendations for Faster Parallel WebP Recording

### Priority 1: Upgrade the Droplet (Immediate Impact)

| Current | Recommended | Cost |
|---------|-------------|------|
| 2 vCPU / 2 GB | **4 vCPU / 8 GB** | ~$48/mo (vs ~$18/mo now) |

With 8 GB RAM you can comfortably run **3-4 concurrent Chrome tabs** for parallel captures. 4 vCPUs means ffmpeg encoding can happen in parallel without blocking grabs.

### Priority 2: Add Swap (Quick Win, Free)
```bash
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
```
This prevents OOM kills during peak usage. Even slow swap is better than crashing.

### Priority 3: Parallel Grab Workers (Architecture Change)

Replace the serial `grabRunning` boolean with a **concurrency pool**:

```
Current:  [Queue] → [Single Worker] → [Upload]
                        ↓
Proposed: [Queue] → [Worker 1] → [Upload]
                  → [Worker 2] → [Upload]
                  → [Worker 3] → [Upload]
```

**Implementation approach:**
1. Replace the single shared browser with a **browser page pool** — launch N pages (tabs) in the same Chrome instance
2. Replace `grabRunning` boolean with a semaphore/counter: `let grabsRunning = 0; const MAX_CONCURRENT_GRABS = 3;`
3. Each worker gets its own page from the pool, captures frames, encodes, uploads, then returns the page
4. Chrome tabs share memory more efficiently than separate browser instances (~65 MB per tab vs ~300+ MB per browser)

**Key changes in `grabber.mjs`:**
- `processGrabQueue()` — loop while `grabsRunning < MAX_CONCURRENT_GRABS && queue.length > 0`
- Page pool: pre-create N pages at startup, hand them out via `acquirePage()` / `releasePage()`
- ffmpeg calls already happen in child processes, so they parallelize naturally

**Expected improvement:** With 3 concurrent workers on a 4-CPU/8-GB droplet:
- Current: 19 queued items × ~35s each = **~11 minutes**
- Parallel: 19 items / 3 workers × ~35s = **~3.7 minutes** (3x speedup)

### Priority 4: Optimize Individual Grab Speed

- **Reduce `acPieceReady` timeout** from 30s to 10s — pieces that don't signal ready in 10s probably won't at 30s either
- **Skip Google Analytics** in capture mode — add `?noanalytics=true` param or block GA URLs in Chrome's request interception (eliminates `ERR_ABORTED` noise in logs)
- **Pre-render frame capture** — instead of 16 sequential `page.screenshot()` calls with delays, consider a client-side approach where the piece renders frames to an offscreen canvas and bundles them

### Priority 5: Separate Concerns (Long-term)

The oven server handles too many responsibilities in a single process:
- Screenshot/WebP capture (CPU + memory intensive)
- OG image generation (CPU intensive)
- Bundle HTML generation (CPU intensive during minification)
- Tape baking (CPU intensive)
- Dashboard serving
- Icon/preview caching

Consider splitting into:
1. **API gateway** (Express, lightweight) — routes, dashboard, status
2. **Capture workers** (Chrome + ffmpeg) — the heavy lifting, can be scaled independently
3. **Bundle worker** — terser minification, isolated from capture workload

This could be done with Node worker threads, separate processes, or even separate droplets behind a load balancer.

### Quick Wins (Do Now)

1. **Kill stale PM2:** `pm2 kill` — frees 17 MB
2. **Add swap:** 2 GB swapfile — prevents OOM crashes
3. **Increase file limits:** Add `LimitNOFILE=65536` to oven.service
4. **Clean up logs:** `journalctl --vacuum-time=7d`

---

## 6. Storage & CDN

| Storage | Bucket | Content |
|---------|--------|---------|
| DO Spaces (art) | `art-aesthetic-computer` | Source ZIPs, grab WebPs, icons |
| DO Spaces (blobs) | `at-blobs-aesthetic-computer` | Processed tapes (MP4), thumbnails |
| CDN | `art-aesthetic-computer.sfo3.cdn.digitaloceanspaces.com` | Public CDN for grabs/icons |
| CDN | `at-blobs.aesthetic.computer` | Public CDN for tapes |

- ac-source on oven: **640 files** in `/opt/oven/ac-source/`
- Total oven directory: **168 MB** (including node_modules)

---

## 7. Bundle Cache Status

- **Cache state:** Warm (189 core files minified)
- **Git version:** `64512591a`
- **ac-source synced:** 640 files
- **Post-push hook:** Installed (`.git/hooks/pre-push` → `sync-source.sh`)
- **Prewarm:** Triggered on every `deploy.sh` restart

---

## 8. Summary

The oven is a **capable but resource-constrained** single-process server trying to do everything at once on a 2 vCPU / 2 GB droplet. The serial grab queue is the biggest performance bottleneck — with 18+ items queued, individual WebP recordings wait 10+ minutes.

**Fastest path to improvement:**
1. Add 2 GB swap (5 min, prevents crashes)
2. Upgrade to 4 vCPU / 8 GB ($30/mo more)
3. Implement parallel grab workers (code change in `grabber.mjs`)
4. Expected result: **3-4x faster WebP recording throughput**