···11+I want to design an OCaml library that builds in support for modifying the
22+program linked to it using Claude Code, and restarting itself with the fixes
33+automatically. The idea is for long-running services to regularly consult with
44+Claude (either on a fixed timetable, or urgently if something really unexpected
55+happens) and improve their own functionality. Claude should be used to analyse
66+patterns in the logs and determine whether to write code to handle a particular
77+case. Claude should not be directly used in the application datapath itself, as
88+it should write code.
99+1010+To make this work, the program needs to emit sufficient tracing data to be
1111+useful to Claude when it does an inspection, but not so much that it overwhelms
1212+the context window. Therefore, the first thing the library needs is some
1313+mechanism to intercept the logging output of the program suitably. The OCaml
1414+"logs" library is a good thing to standardise on here. It's also fine to use
1515+the OCaml direct-style Eio library for all interactions.
1616+1717+Assume the code is running in a Linux environment with root level access. There
1818+will also be a Zulip server available with an API key that can be used to post
1919+messages to and interact with.
2020+2121+This is an ambitious project, so before embarking on it, I need to think really
2222+carefully about the design and tradeoffs, including seeking clarificaiton where
2323+necessary about what sorts of MCP servers or other support infrastructure will
2424+be useful to making library successful. I'm ok taking risks and trying unusual
2525+approaches. The library will be called "Dancer" after the Hunter S Thompson
2626+quote "We're raising a generation of dancers, afraid to take one step out of
2727+line."
2828+2929+## Architecture Design (v1)
3030+3131+### Core Components
3232+3333+1. **Log Interceptor & Buffer**
3434+ - Hook into OCaml Logs library at the reporter level
3535+ - Maintain a persistent buffer on disk, perhaps in Sqlite, for analysis
3636+ - Group consecutive identical errors with count
3737+ - Tag logs with timestamp, module, and error type
3838+3939+2. **Pattern Detector**
4040+ - Track log messages and their frequency
4141+ - Use string matching to identify recurring patterns
4242+ - Maintain a simple SQLite database of seen patterns
4343+ - Trigger Claude consultation when:
4444+ - New error pattern appears frequently (>10 times in 5 min)
4545+ - Error rate spikes above baseline
4646+ - Scheduled review (e.g., every 6 hours)
4747+4848+3. **Claude Consultation Manager**
4949+ - Prepare context: recent logs + relevant source files
5050+ - Ask Claude to:
5151+ - Analyze the error pattern
5252+ - Generate OCaml code to handle the case
5353+ - Suggest where to integrate the fix
5454+ - Test the fixes and trial a deployment
5555+ - Store Claude's response and proposed changes
5656+5757+4. **Version Control Integration**
5858+ - Each Claude fix creates a new git branch: `dancer/fix-<timestamp>-<error-hash>`
5959+ - Use git worktrees for isolated changes:
6060+ ```bash
6161+ git worktree add ../dancer-fix-<id> -b dancer/fix-<id>
6262+ ```
6363+ - Apply Claude's changes in the worktree include a changelog in the commits
6464+ - Compile and test in isolation
6565+ - If successful, merge to main and restart application
6666+ - Have a script that can search for all the fix branches and update a central changelog ordered by time, suitable for a human to review regularly
6767+6868+5. **Restart Orchestration**
6969+ - Library has a supervisor for process management of the application itself
7070+ - Graceful shutdown: finish current requests with a timeout
7171+ - State persistence before restart (if needed)
7272+ - Automatic rollback if restart fails from the previous successful binary
7373+ - Health check after restart
7474+7575+6. **Zulip Integration**
7676+ - Post proposed changes for human review
7777+ - Emergency stop command
7878+ - Status updates on consultations
7979+ - Performance metrics before/after changes
8080+8181+### Git Workflow Design
8282+8383+1. **Branch Strategy**
8484+ ```
8585+ main (production code)
8686+ ├── dancer/fix-2024-01-15-1200-auth-error
8787+ ├── dancer/fix-2024-01-15-1800-timeout-handler
8888+ └── dancer/rollback-2024-01-15-1900 (if needed)
8989+ ```
9090+9191+2. **Worktree Management**
9292+ - Base directory: `/var/dancer/worktrees/`
9393+ - Each fix gets its own worktree
9494+ - Clean up old worktrees after successful merge
9595+ - Keep failed attempts for analysis
9696+9797+3. **Change Process**
9898+ ```ocaml
9999+ type fix_status =
100100+ | Proposed
101101+ | Testing
102102+ | Approved
103103+ | Deployed
104104+ | Rolled_back
105105+106106+ type fix_record = {
107107+ id: string;
108108+ branch: string;
109109+ worktree: string;
110110+ error_pattern: string;
111111+ claude_solution: string;
112112+ test_results: string option;
113113+ status: fix_status;
114114+ created_at: float;
115115+ }
116116+ ```
117117+118118+### Simplified Log Management
119119+120120+1. **Log Format**
121121+ ```ocaml
122122+ type log_entry = {
123123+ timestamp: float;
124124+ level: Logs.level;
125125+ source: string; (* module name *)
126126+ message: string;
127127+ error_type: string option;
128128+ stack_trace: string option;
129129+ }
130130+ ```
131131+132132+2. **Context Preparation for Claude**
133133+ - Last 500 lines of logs
134134+ - Error frequency summary
135135+ - Relevant source file (where error originated)
136136+ - Previous fix attempts for similar errors
137137+ - System metrics (CPU, memory, request rate)
138138+139139+### Restart Safety Mechanisms
140140+141141+1. **Pre-Restart Checks**
142142+ - Compile the modified code
143143+ - Run unit tests if available
144144+ - Check syntax with `ocamlc -i`
145145+ - Verify no obvious issues (missing semicolons, etc.)
146146+147147+2. **Restart Process**
148148+ ```bash
149149+ # Save current version
150150+ git tag dancer-before-$(date +%s)
151151+152152+ # Merge fix
153153+ git merge --no-ff dancer/fix-<id>
154154+155155+ # Rebuild
156156+ dune build
157157+158158+ # Graceful restart
159159+ systemctl reload dancer-service || systemctl restart dancer-service
160160+161161+ # Health check
162162+ ./health_check.sh || git reset --hard dancer-before-<timestamp>
163163+ ```
164164+165165+3. **Rollback Triggers**
166166+ - Service fails to start
167167+ - Health check fails after restart
168168+ - Error rate increases by >50%
169169+ - Memory usage spikes
170170+ - Manual intervention via Zulip
171171+172172+### MCP Server Requirements (Simplified)
173173+174174+1. **Git Server**
175175+ - Local git repository with remote backup
176176+ - Web interface for viewing changes
177177+ - Webhook support for CI integration
178178+179179+2. **Monitoring Server**
180180+ - Simple metrics collection (Prometheus/Grafana)
181181+ - Log aggregation (just file-based initially)
182182+ - Alert routing to Zulip
183183+184184+3. **Claude API Gateway**
185185+ - Rate limiting
186186+ - Cost tracking
187187+ - Request/response logging
188188+ - Fallback to manual mode if quota exceeded
189189+190190+### Implementation Phases (Simplified)
191191+192192+**Phase 1: Core Infrastructure (Week 1-2)**
193193+- Log interception and buffering
194194+- Basic error pattern detection
195195+- Git worktree management
196196+- Manual Claude consultation
197197+198198+**Phase 2: Automation (Week 3-4)**
199199+- Automatic Claude triggers
200200+- Code generation and application
201201+- Restart orchestration
202202+- Basic safety checks
203203+204204+**Phase 3: Monitoring & Safety (Week 5-6)**
205205+- Zulip integration
206206+- Rollback mechanisms
207207+- Performance tracking
208208+- Cost management
209209+210210+### Example Usage Flow
211211+212212+1. **Error Detection**
213213+ ```ocaml
214214+ (* Application code *)
215215+ Logs.err (fun m -> m "Database connection failed: %s" error_msg);
216216+ (* This error happens 20 times in 2 minutes *)
217217+ ```
218218+219219+2. **Claude Consultation**
220220+ ```
221221+ Context: Database connection errors occurring frequently
222222+ Pattern: "Database connection failed: Connection refused"
223223+224224+ Claude generates:
225225+ - Exponential backoff retry logic
226226+ - Connection pool management
227227+ - Fallback to cached data
228228+ ```
229229+230230+3. **Version Control**
231231+ ```bash
232232+ git worktree add ../dancer-fix-db-conn -b dancer/fix-db-conn
233233+ cd ../dancer-fix-db-conn
234234+ # Apply Claude's changes
235235+ dune build
236236+ # If successful, merge and restart
237237+ ```
238238+239239+4. **Deployment**
240240+ ```bash
241241+ git checkout main
242242+ git merge dancer/fix-db-conn
243243+ systemctl restart dancer-service
244244+ # Monitor for 5 minutes
245245+ # If stable, cleanup worktree
246246+ ```
247247+248248+### Data Structures
249249+250250+```ocaml
251251+module Dancer = struct
252252+ type config = {
253253+ claude_api_key: string;
254254+ zulip_api_key: string;
255255+ zulip_stream: string;
256256+ max_context_size: int; (* chars to send to Claude *)
257257+ consultation_cooldown: float; (* seconds between consultations *)
258258+ error_threshold: int; (* errors before triggering *)
259259+ restart_timeout: float; (* max seconds for restart *)
260260+ worktree_base: string; (* base directory for git worktrees *)
261261+ }
262262+263263+ type consultation_request = {
264264+ pattern: string;
265265+ occurrences: int;
266266+ timespan: float;
267267+ recent_logs: string;
268268+ source_context: string option;
269269+ }
270270+271271+ type consultation_response = {
272272+ analysis: string;
273273+ proposed_fix: string;
274274+ target_file: string;
275275+ confidence: float;
276276+ }
277277+end
278278+```
279279+280280+### Key Simplifications from Original Design
281281+282282+1. **No Dynamic Linking** - Just restart the process
283283+2. **Simple Pattern Matching** - String comparison, no bloom filters
284284+3. **Basic Git Workflow** - Branches and worktrees, no complex versioning
285285+4. **Minimal Infrastructure** - SQLite instead of complex databases
286286+5. **Simple Rollback** - Git reset instead of sophisticated mechanisms
287287+6. **Direct Process Restart** - Using systemd/supervisor instead of hot-reload
288288+7. **File-Based Logs** - No complex log aggregation initially
289289+8. **Manual Approval Option** - Human can review via Zulip before deploy
290290+291291+## Library Decomposition Plan
292292+293293+### Core Libraries
294294+295295+1. **dancer-logs** - Log interception and buffering
296296+ - Hook into OCaml Logs reporter
297297+ - SQLite-backed circular buffer
298298+ - Pattern normalization
299299+ - Standalone testable
300300+301301+2. **dancer-patterns** - Pattern detection and tracking
302302+ - Error pattern recognition
303303+ - Frequency/acceleration tracking
304304+ - Pattern database management
305305+ - Trigger decision logic
306306+307307+3. **dancer-claude** - Claude CLI integration
308308+ - Prompt construction
309309+ - Response parsing
310310+ - Context preparation
311311+ - Token cost tracking
312312+313313+4. **dancer-git** - Git worktree management
314314+ - Worktree creation/cleanup
315315+ - Branch management
316316+ - Safe merging operations
317317+ - Rollback capabilities
318318+319319+5. **dancer-test** - Alcotest generation
320320+ - Test template generation
321321+ - Test execution in worktrees
322322+ - Result parsing
323323+ - Coverage tracking
324324+325325+6. **dancer-process** - Process management
326326+ - Tmux orchestration
327327+ - Service restart logic
328328+ - Health checking
329329+ - Graceful shutdown
330330+331331+7. **dancer-observe** - Observability
332332+ - Metrics collection
333333+ - SQLite time-series storage
334334+ - Anomaly detection
335335+ - Audit trail management
336336+337337+8. **dancer-spec** - Service specification
338338+ - YAML spec parsing
339339+ - Constraint validation
340340+ - Fix validation against spec
341341+ - Schema enforcement
342342+343343+9. **dancer-deploy** - Deployment pipeline
344344+ - Staging environment setup
345345+ - Promotion criteria evaluation
346346+ - Production deployment
347347+ - Rollback orchestration
348348+349349+10. **dancer-ui** - Human oversight interfaces
350350+ - Web dashboard (Dream)
351351+ - Terminal UI (Nottui)
352352+ - WebSocket live updates
353353+ - Audit log viewer
354354+355355+### Implementation Order
356356+357357+**Phase 1: Foundation** (Week 1)
358358+1. `dancer-logs` - Need log data first
359359+2. `dancer-patterns` - Pattern detection on logs
360360+3. `dancer-observe` - Basic metrics/storage
361361+362362+**Phase 2: Claude Integration** (Week 2)
363363+4. `dancer-claude` - Claude consultation
364364+5. `dancer-spec` - Service constraints
365365+6. `dancer-test` - Test generation
366366+367367+**Phase 3: Deployment** (Week 3)
368368+7. `dancer-git` - Worktree management
369369+8. `dancer-process` - Process control
370370+9. `dancer-deploy` - Staging/production
371371+372372+**Phase 4: Oversight** (Week 4)
373373+10. `dancer-ui` - Dashboard and monitoring