My aggregated monorepo of OCaml code, automaintained
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Add day10 agent-driven CI design and implementation plan docs

Design doc covers the data model (JSONL history, status.json), failure
categories, CLI commands, log retention policy, and agent operational
loop for the docs CI replacement.

Implementation plan breaks the work into 15 tasks with dependency graph.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+1200
+324
docs/plans/2026-03-08-day10-agent-ci-design.md
··· 1 + # Day10 Agent-Driven Docs CI — Design 2 + 3 + ## Context 4 + 5 + Day10 is a documentation CI system for OCaml packages that builds packages 6 + in containers using overlayfs layers and generates odoc documentation. It 7 + currently supports batch builds, a web dashboard, and basic run tracking. 8 + 9 + This design extends day10 to serve as a full replacement for the current 10 + ocaml-docs-ci (sage), with an agent-first operational model: Claude runs 11 + the service, monitors builds, diagnoses failures, retries transients, and 12 + alerts a human only when needed. 13 + 14 + ## Requirements 15 + 16 + 1. **Failure classification**: Distinguish solver failures, depext 17 + unavailable, dependency build failures, package build failures, doc 18 + compile failures, doc link failures, and transient/infra failures. 19 + Categories are extensible strings, not a closed enum. 20 + 21 + 2. **Package history**: Track status over time per package-version per 22 + build (since one package-version may be built multiple times with 23 + different dependency sets). Detect transitions: new failures, new 24 + successes, flaky packages, persistent failures. 25 + 26 + 3. **Blessed-first priority**: Each package-version has one blessed build 27 + (the canonical universe for ocaml.org/p/). Blessed build status is the 28 + primary status. Blessed regressions are high-priority alerts. 29 + 30 + 4. **CLI tools for agent consumption**: Every command supports 31 + `--format json` for machine-readable output. The agent uses these to 32 + make decisions about retries, cascades, gc, and notifications. 33 + 34 + 5. **Log retention**: Keep logs on status change (pass→fail, fail→pass). 35 + Discard logs for stably-passing packages. Archive to spinning disk 36 + under pressure. Keep metadata (status, category, timestamp) always. 37 + 38 + 6. **Cascade reruns**: After the agent identifies and reruns transient 39 + failures, cascade to reverse dependencies that were stuck in 40 + dependency_failure state. Prioritise blessed builds. 41 + 42 + 7. **Pluggable notifications**: Configurable channel (slack, zulip, 43 + telegram, email). Rules for which events trigger alerts vs daily 44 + digests. 45 + 46 + ## Approach 47 + 48 + Structured filesystem with index files (Approach B). All state is 49 + inspectable with cat/jq. Convertible to Parquet later via clickhouse or 50 + similar tooling. 51 + 52 + ## Data Model 53 + 54 + ### Per-package history 55 + 56 + File: `packages/{name}.{version}/history.jsonl` 57 + 58 + Append-only, one JSON line per build event: 59 + 60 + ```jsonl 61 + {"ts":"2026-03-08T14:00:00Z","run":"2026-03-08-140000","build_hash":"build-abc123","status":"success","category":"build","compiler":"ocaml.5.3.0","blessed":true} 62 + {"ts":"2026-03-09T10:00:00Z","run":"2026-03-09-100000","build_hash":"build-abc123","status":"failure","category":"build_failure","error":"exit code 2","compiler":"ocaml.5.3.0","blessed":true} 63 + ``` 64 + 65 + Fields: 66 + - `ts`: ISO 8601 timestamp 67 + - `run`: run identifier 68 + - `build_hash`: the specific build layer hash 69 + - `status`: "success" or "failure" 70 + - `category`: failure classification string (extensible) 71 + - `compiler`: which OCaml version the solver chose 72 + - `blessed`: whether this is the blessed build 73 + - `error`: optional error description 74 + - `failed_dep`: optional, for dependency_failure — which dep failed 75 + - `failed_dep_hash`: optional, the hash of the failed dep build 76 + 77 + ### Global status index 78 + 79 + File: `status.json` — regenerated after each run. 80 + 81 + ```json 82 + { 83 + "generated": "2026-03-10T08:30:00Z", 84 + "run_id": "2026-03-10-080000", 85 + "totals": { 86 + "blessed": {"success": 3800, "solver_failure": 100, "build_failure": 30, "doc_compile_failure": 15, "doc_link_failure": 5}, 87 + "non_blessed": {"success": 14000, "build_failure": 500, "dependency_failure": 600} 88 + }, 89 + "changes_since_last": [ 90 + {"package": "cohttp.6.0.0", "build_hash": "build-xyz", "blessed": true, "from": "build_failure", "to": "success"}, 91 + {"package": "lascar.0.7.0", "build_hash": "build-abc", "blessed": true, "from": "success", "to": "solver_failure"} 92 + ], 93 + "new_packages": ["alice.0.5.0", "bibfmt.0.9.2"] 94 + } 95 + ``` 96 + 97 + ### Extended run summary 98 + 99 + File: `logs/runs/{run-id}/summary.json` — existing, extended with: 100 + - `changes`: list of status transitions with build hashes 101 + - `new_packages`: first-time packages 102 + - `retries`: packages retried in this run and outcomes 103 + 104 + ### Notification config 105 + 106 + File: `notifications.yaml` 107 + 108 + ```yaml 109 + notifications: 110 + channel: slack 111 + rules: 112 + - event: blessed_regression 113 + priority: high 114 + notify: always 115 + - event: new_success 116 + priority: low 117 + notify: daily_digest 118 + - event: disk_warning 119 + priority: high 120 + notify: always 121 + threshold: 85% 122 + - event: run_complete 123 + priority: low 124 + notify: daily_digest 125 + - event: transient_cluster 126 + priority: high 127 + notify: always 128 + threshold: 10 129 + ``` 130 + 131 + ## Failure Categories 132 + 133 + Extensible strings. Initial set: 134 + 135 + | Category | Meaning | 136 + |----------|---------| 137 + | `solver_failure` | No solution exists for this package | 138 + | `depext_unavailable` | System dependency not installable | 139 + | `dependency_failure` | A dependency failed to build | 140 + | `build_failure` | Package itself failed to compile | 141 + | `doc_compile_failure` | odoc compile step failed | 142 + | `doc_link_failure` | odoc link/html-generate failed | 143 + | `transient_failure` | Infra issue: network, disk, container crash | 144 + | `success` | Everything passed | 145 + 146 + New categories can be added as we encounter new failure modes in 147 + production. No code changes needed — they are just strings. 148 + 149 + ## CLI Commands 150 + 151 + ### `day10 status [--format json|text]` 152 + 153 + Overview of current state. 154 + 155 + ``` 156 + Run: 2026-03-10-080000 (3h ago) 157 + Blessed: 3800 success / 30 build_failure / 15 doc_failure 158 + Non-blessed: 14000 success / 500 build_failure / 600 dep_failure 159 + Changes (blessed): 160 + + cohttp.6.0.0 build_failure → success 161 + - lascar.0.7.0 success → solver_failure 162 + Disk: /cache 168G used / 400G (42%) 163 + ``` 164 + 165 + ### `day10 query <package> [--history] [--builds] [--log] [--format json|text]` 166 + 167 + Package detail. 168 + 169 + ``` 170 + cohttp_async_websocket.v0.17.0 171 + Blessed: build-abc123 (ocaml.5.3.0) — success 172 + Other builds: 173 + build-def456 (ocaml.4.14.2) — success 174 + build-ghi789 (ocaml.5.1.1) — dependency_failure (async.v0.17.0) 175 + ``` 176 + 177 + - `--history`: show history.jsonl entries, most recent first 178 + - `--builds`: show all builds with their dependency sets 179 + - `--log`: print the build/doc log for the blessed build 180 + 181 + ### `day10 failures [--blessed] [--category <cat>] [--since <run-id|date>] [--format json|text]` 182 + 183 + List currently failing packages. 184 + 185 + - `--blessed`: only blessed builds (most important for ocaml.org) 186 + - `--category`: filter by failure category 187 + - `--since`: only new failures since a given run 188 + 189 + ### `day10 changes [--since <run-id|date>] [--format json|text]` 190 + 191 + Status transitions between runs. Core command for the agent's periodic 192 + check. 193 + 194 + ### `day10 rerun <build-hash|package> [--force] [--format json|text]` 195 + 196 + Retry a specific build or all failing builds of a package. 197 + 198 + - By build hash: reruns that exact build 199 + - By package name: lists builds, reruns failing ones 200 + - `--force`: rebuild regardless of current status (clears cached layers) 201 + 202 + ### `day10 cascade [--blessed-first] [--dry-run] [--format json|text]` 203 + 204 + After transient reruns succeed, cascade to reverse dependencies still in 205 + `dependency_failure`. Used by the agent after analysing and retrying 206 + transient failures. 207 + 208 + - `--blessed-first`: prioritise blessed build cascades 209 + - `--dry-run`: show what would be rerun 210 + 211 + ### `day10 rdeps <package> [--format json|text]` 212 + 213 + Reverse dependencies from cached solutions. Filterable to packages 214 + currently in `dependency_failure`. 215 + 216 + ### `day10 disk [--format json|text]` 217 + 218 + Disk usage breakdown: base images, build layers, doc layers, logs, 219 + archived. 220 + 221 + ### `day10 gc [--archive <path>] [--keep-runs N] [--stable-threshold N] [--dry-run]` 222 + 223 + Garbage collection and log retention. 224 + 225 + - `--archive`: move change-event logs to spinning disk 226 + - `--keep-runs`: active run summaries to retain (default 30) 227 + - `--stable-threshold`: runs of consistent status before discarding logs 228 + (default 5) 229 + - `--dry-run`: report what would be reclaimed 230 + 231 + ### `day10 notify --channel <channel> --message <text>` 232 + 233 + Send a notification. Used by the agent. Channel configuration from 234 + `notifications.yaml`. 235 + 236 + ## Log Retention Policy 237 + 238 + ### Active tier (fast disk, /cache) 239 + 240 + - Current build layers + logs for all packages 241 + - History files (tiny, always kept) 242 + - Last N run summaries (default 30) 243 + 244 + ### Archive tier (spinning disk) 245 + 246 + - Logs from status-change events (pass→fail, fail→pass) 247 + - Logs from currently-failing packages at archive time 248 + - Old run summaries 249 + 250 + ### Discard 251 + 252 + - Logs for packages consistently passing for N runs (default 5) 253 + - Intermediate logs where only latest matters 254 + - Orphaned layers with no package references 255 + 256 + ### Compaction 257 + 258 + History files: remove redundant consecutive same-status entries older than 259 + 90 days. Keep the first and last of each run of identical statuses. 260 + 261 + ## Agent Operational Loop 262 + 263 + ``` 264 + 1. Run completes 265 + 2. Regenerate status.json with change detection 266 + 3. Analyse new failures: 267 + - Read logs, classify failure category 268 + - Identify transients (network timeout, disk full, container crash) 269 + 4. Rerun transients: day10 rerun <build-hash> for each 270 + 5. After reruns complete: day10 cascade --blessed-first 271 + 6. Notify: 272 + - Blessed regressions → immediate high-priority alert 273 + - Summary of changes → daily digest 274 + - Disk warnings → immediate if above threshold 275 + 7. Disk management: day10 gc --archive if needed 276 + 8. Record observations in lab notebook 277 + ``` 278 + 279 + Opam-repository updates are triggered externally (webhook or similar) and 280 + are out of scope for this design. 281 + 282 + ## Directory Structure 283 + 284 + ``` 285 + /cache/day10-cache/ 286 + ├── {os-key}/ 287 + │ ├── base/ # Base container image 288 + │ ├── build-{hash}/ 289 + │ │ ├── layer.json 290 + │ │ └── build.log 291 + │ ├── doc-{hash}/ 292 + │ │ ├── layer.json 293 + │ │ └── odoc-voodoo-all.log 294 + │ ├── packages/{name}.{version}/ 295 + │ │ ├── build-{hash} -> ... # Symlinks to layers 296 + │ │ ├── blessed-build -> build-{hash} 297 + │ │ ├── blessed-docs -> doc-{hash} 298 + │ │ └── history.jsonl # Append-only status history 299 + │ ├── logs/ 300 + │ │ └── runs/{run-id}/ 301 + │ │ ├── summary.json # Extended with changes 302 + │ │ ├── build/{package}.log 303 + │ │ └── docs/{package}.log 304 + │ └── status.json # Global index 305 + ├── solutions/{opam-sha}/ 306 + │ └── {package}.json 307 + └── notifications.yaml 308 + 309 + /mnt/spinning/day10-archive/ 310 + ├── logs/ # Archived change-event logs 311 + └── runs/ # Old run summaries 312 + ``` 313 + 314 + ## Future Considerations 315 + 316 + - **Parquet export**: status.json and history.jsonl are straightforward to 317 + convert to Parquet via clickhouse local or similar. The Makefile already 318 + has Parquet targets. 319 + - **New failure categories**: added as strings with no code changes. 320 + The agent and notification rules learn new categories as they appear. 321 + - **Multiple OS targets**: the os-key directory structure already supports 322 + this. Status/history would need per-os-key sections. 323 + - **Web UI extensions**: the existing day10-web can read status.json and 324 + history.jsonl to show richer dashboards with change timelines.
+876
docs/plans/2026-03-08-day10-agent-ci-plan.md
··· 1 + # Day10 Agent-Driven Docs CI — Implementation Plan 2 + 3 + > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. 4 + 5 + **Goal:** Extend day10 with failure classification, package history, change detection, CLI query tools, log retention, cascade reruns, and pluggable notifications — enabling an agent-first operational model. 6 + 7 + **Architecture:** Filesystem-based state (JSONL history files, JSON index files) with CLI commands for querying and management. All state inspectable with cat/jq. The day10_lib library gets new modules for history, status, and notifications. The CLI (bin/main.ml) gets new subcommands via cmdliner. 8 + 9 + **Tech Stack:** OCaml 5.3+, cmdliner (<2.0), yojson, day10_lib, unix 10 + 11 + --- 12 + 13 + ### Task 1: History module — data types and JSONL I/O 14 + 15 + **Files:** 16 + - Create: `day10/lib/history.ml` 17 + - Create: `day10/lib/history.mli` 18 + - Modify: `day10/lib/dune` — add `history` to modules list 19 + 20 + **Step 1: Write the history.mli interface** 21 + 22 + ```ocaml 23 + (** Per-package build history, stored as append-only JSONL files. *) 24 + 25 + type entry = { 26 + ts : string; (** ISO 8601 timestamp *) 27 + run : string; (** Run identifier *) 28 + build_hash : string; (** Build layer hash *) 29 + status : string; (** "success" or "failure" *) 30 + category : string; (** Failure category or "success" *) 31 + compiler : string; (** OCaml version used *) 32 + blessed : bool; (** Whether this is the blessed build *) 33 + error : string option; (** Error description if failed *) 34 + failed_dep : string option; (** For dependency_failure *) 35 + failed_dep_hash : string option; (** Hash of the failed dep *) 36 + } 37 + 38 + val entry_to_json : entry -> Yojson.Safe.t 39 + val entry_of_json : Yojson.Safe.t -> entry 40 + 41 + (** Append an entry to the history file for a package. 42 + File: [packages_dir]/[pkg_str]/history.jsonl *) 43 + val append : packages_dir:string -> pkg_str:string -> entry -> unit 44 + 45 + (** Read all history entries for a package, most recent first. *) 46 + val read : packages_dir:string -> pkg_str:string -> entry list 47 + 48 + (** Read only the most recent entry per build_hash for a package. *) 49 + val read_latest : packages_dir:string -> pkg_str:string -> entry list 50 + 51 + (** Read the most recent blessed entry for a package. *) 52 + val read_blessed : packages_dir:string -> pkg_str:string -> entry option 53 + 54 + (** Compact a history file: for consecutive same-status entries older 55 + than [max_age_days], keep only first and last. *) 56 + val compact : packages_dir:string -> pkg_str:string -> max_age_days:int -> unit 57 + ``` 58 + 59 + **Step 2: Write history.ml implementation** 60 + 61 + ```ocaml 62 + type entry = { 63 + ts : string; 64 + run : string; 65 + build_hash : string; 66 + status : string; 67 + category : string; 68 + compiler : string; 69 + blessed : bool; 70 + error : string option; 71 + failed_dep : string option; 72 + failed_dep_hash : string option; 73 + } 74 + 75 + let entry_to_json e = 76 + let fields = [ 77 + "ts", `String e.ts; 78 + "run", `String e.run; 79 + "build_hash", `String e.build_hash; 80 + "status", `String e.status; 81 + "category", `String e.category; 82 + "compiler", `String e.compiler; 83 + "blessed", `Bool e.blessed; 84 + ] in 85 + let fields = match e.error with 86 + | Some s -> ("error", `String s) :: fields | None -> fields in 87 + let fields = match e.failed_dep with 88 + | Some s -> ("failed_dep", `String s) :: fields | None -> fields in 89 + let fields = match e.failed_dep_hash with 90 + | Some s -> ("failed_dep_hash", `String s) :: fields | None -> fields in 91 + `Assoc fields 92 + 93 + let string_field j k = 94 + match Yojson.Safe.Util.member k j with 95 + | `String s -> s | _ -> "" 96 + 97 + let string_field_opt j k = 98 + match Yojson.Safe.Util.member k j with 99 + | `String s -> Some s | _ -> None 100 + 101 + let bool_field j k = 102 + match Yojson.Safe.Util.member k j with 103 + | `Bool b -> b | _ -> false 104 + 105 + let entry_of_json j = { 106 + ts = string_field j "ts"; 107 + run = string_field j "run"; 108 + build_hash = string_field j "build_hash"; 109 + status = string_field j "status"; 110 + category = string_field j "category"; 111 + compiler = string_field j "compiler"; 112 + blessed = bool_field j "blessed"; 113 + error = string_field_opt j "error"; 114 + failed_dep = string_field_opt j "failed_dep"; 115 + failed_dep_hash = string_field_opt j "failed_dep_hash"; 116 + } 117 + 118 + let history_path ~packages_dir ~pkg_str = 119 + Filename.concat (Filename.concat packages_dir pkg_str) "history.jsonl" 120 + 121 + let append ~packages_dir ~pkg_str entry = 122 + let path = history_path ~packages_dir ~pkg_str in 123 + let dir = Filename.dirname path in 124 + if not (Sys.file_exists dir) then 125 + Unix.mkdir dir 0o755; 126 + let oc = open_out_gen [Open_append; Open_creat; Open_wronly] 0o644 path in 127 + Fun.protect ~finally:(fun () -> close_out oc) (fun () -> 128 + output_string oc (Yojson.Safe.to_string (entry_to_json entry)); 129 + output_char oc '\n') 130 + 131 + let read ~packages_dir ~pkg_str = 132 + let path = history_path ~packages_dir ~pkg_str in 133 + if not (Sys.file_exists path) then [] 134 + else 135 + let ic = open_in path in 136 + Fun.protect ~finally:(fun () -> close_in ic) (fun () -> 137 + let entries = ref [] in 138 + (try while true do 139 + let line = input_line ic in 140 + if String.length line > 0 then 141 + entries := entry_of_json (Yojson.Safe.from_string line) :: !entries 142 + done with End_of_file -> ()); 143 + !entries) (* Already in reverse = most recent first *) 144 + 145 + let read_latest ~packages_dir ~pkg_str = 146 + let entries = read ~packages_dir ~pkg_str in 147 + let seen = Hashtbl.create 16 in 148 + List.filter (fun e -> 149 + if Hashtbl.mem seen e.build_hash then false 150 + else (Hashtbl.add seen e.build_hash (); true) 151 + ) entries 152 + 153 + let read_blessed ~packages_dir ~pkg_str = 154 + let entries = read ~packages_dir ~pkg_str in 155 + List.find_opt (fun e -> e.blessed) entries 156 + 157 + let compact ~packages_dir ~pkg_str ~max_age_days = 158 + let entries = List.rev (read ~packages_dir ~pkg_str) in (* oldest first *) 159 + let now = Unix.gettimeofday () in 160 + let max_age = float_of_int max_age_days *. 86400.0 in 161 + (* Group consecutive same-status, same-build_hash entries *) 162 + let rec go acc = function 163 + | [] -> List.rev acc 164 + | [e] -> List.rev (e :: acc) 165 + | e1 :: ((e2 :: _) as rest) -> 166 + if e1.status = e2.status && e1.build_hash = e2.build_hash 167 + && (now -. (float_of_string (String.sub e1.ts 0 10 |> fun _ -> 168 + (* parse ISO date — simplified, use Unix for real impl *) 169 + 0.0))) > max_age 170 + then 171 + (* Skip middle entries of a run of same status *) 172 + let rec skip_middle = function 173 + | [] -> List.rev (e1 :: acc) 174 + | [last] -> List.rev (last :: e1 :: acc) 175 + | x :: rest when x.status = e1.status && x.build_hash = e1.build_hash -> 176 + skip_middle rest 177 + | rest -> go (e1 :: acc) rest 178 + in 179 + skip_middle rest 180 + else 181 + go (e1 :: acc) rest 182 + in 183 + let compacted = go [] entries in 184 + let path = history_path ~packages_dir ~pkg_str in 185 + let tmp = path ^ ".tmp" in 186 + let oc = open_out tmp in 187 + Fun.protect ~finally:(fun () -> close_out oc) (fun () -> 188 + List.iter (fun e -> 189 + output_string oc (Yojson.Safe.to_string (entry_to_json e)); 190 + output_char oc '\n') compacted); 191 + Sys.rename tmp path 192 + ``` 193 + 194 + **Step 3: Update lib/dune** 195 + 196 + Add `history` to the modules list in `day10/lib/dune`. 197 + 198 + **Step 4: Build and verify** 199 + 200 + Run: `opam exec -- dune build` 201 + Expected: Clean build with new module available. 202 + 203 + **Step 5: Commit** 204 + 205 + ```bash 206 + git add day10/lib/history.ml day10/lib/history.mli day10/lib/dune 207 + git commit -m "day10: add History module for per-package JSONL tracking" 208 + ``` 209 + 210 + --- 211 + 212 + ### Task 2: Status index module — global state and change detection 213 + 214 + **Files:** 215 + - Create: `day10/lib/status_index.ml` 216 + - Create: `day10/lib/status_index.mli` 217 + - Modify: `day10/lib/dune` — add `status_index` to modules list 218 + 219 + **Step 1: Write status_index.mli** 220 + 221 + ```ocaml 222 + (** Global status index — regenerated after each run. *) 223 + 224 + type change = { 225 + package : string; 226 + build_hash : string; 227 + blessed : bool; 228 + from_status : string; 229 + to_status : string; 230 + } 231 + 232 + type totals = (string * int) list 233 + (** Association list of category -> count *) 234 + 235 + type t = { 236 + generated : string; (** ISO 8601 *) 237 + run_id : string; 238 + blessed_totals : totals; 239 + non_blessed_totals : totals; 240 + changes : change list; 241 + new_packages : string list; 242 + } 243 + 244 + val to_json : t -> Yojson.Safe.t 245 + val of_json : Yojson.Safe.t -> t 246 + 247 + (** Generate a status index by scanning all package history files. 248 + Compares with previous index (if exists) to detect changes. *) 249 + val generate : 250 + packages_dir:string -> 251 + run_id:string -> 252 + previous:t option -> 253 + t 254 + 255 + (** Write status.json to the given directory. *) 256 + val write : dir:string -> t -> unit 257 + 258 + (** Read status.json from the given directory. Returns None if missing. *) 259 + val read : dir:string -> t option 260 + ``` 261 + 262 + **Step 2: Write status_index.ml implementation** 263 + 264 + Implement the types and JSON serialization. The `generate` function: 265 + - Scans all subdirectories of `packages_dir` 266 + - Reads the latest history entry per package (via `History.read_latest`) 267 + - Tallies blessed/non-blessed totals by category 268 + - Compares with `previous` to detect changes (status transitions) 269 + - Identifies new packages (present now but absent from previous) 270 + 271 + **Step 3: Update lib/dune, build, commit** 272 + 273 + ```bash 274 + git commit -m "day10: add Status_index module for global state and change detection" 275 + ``` 276 + 277 + --- 278 + 279 + ### Task 3: Failure classification in the build pipeline 280 + 281 + **Files:** 282 + - Modify: `day10/bin/main.ml` — after each build/solve result, classify and record history 283 + - Modify: `day10/bin/util.ml` — helper to extract compiler version from solution 284 + 285 + **Step 1: Identify the integration points** 286 + 287 + In `main.ml`, the key function is the build result handler (around lines 254-266 and the build loop). Currently it writes layer.json and run logs. We need to also append to history.jsonl. 288 + 289 + The integration points are: 290 + - After solver returns `No_solution` → append history entry with `category:"solver_failure"` 291 + - After a dep fails → append with `category:"dependency_failure"` and `failed_dep` field 292 + - After build fails (exit_status != 0) → append with `category:"build_failure"` 293 + - After build succeeds → append with `category:"success"` 294 + - After doc generation → append with `category:"doc_compile_failure"` or `"doc_link_failure"` or update to `"success"` 295 + 296 + **Step 2: Add a classify_and_record function** 297 + 298 + ```ocaml 299 + let record_build_result ~packages_dir ~run_id ~pkg_str ~build_hash 300 + ~compiler ~blessed ~status ~category ?error ?failed_dep ?failed_dep_hash () = 301 + let entry : Day10_lib.History.entry = { 302 + ts = Day10_lib.Run_log.format_time (Unix.gettimeofday ()); 303 + run = run_id; 304 + build_hash; 305 + status; 306 + category; 307 + compiler; 308 + blessed; 309 + error; 310 + failed_dep; 311 + failed_dep_hash; 312 + } in 313 + Day10_lib.History.append ~packages_dir ~pkg_str entry 314 + ``` 315 + 316 + **Step 3: Wire it into the solver failure path** 317 + 318 + Find where `No_solution` is handled and add the history append call. 319 + 320 + **Step 4: Wire it into the build success/failure paths** 321 + 322 + Find where `layer.json` is written after a build completes and add the history append call. Extract the compiler version from the solution's package set. 323 + 324 + **Step 5: Wire it into the doc generation paths** 325 + 326 + Find where doc results (success/failure/skipped) are recorded and add history append calls for doc-specific failures. 327 + 328 + **Step 6: Add depext detection** 329 + 330 + In the build failure path, check if the build log contains patterns indicating missing system dependencies (e.g. "Package not found", "dpkg: dependency problems", "apt-get install"). If detected, use `category:"depext_unavailable"` instead of `"build_failure"`. 331 + 332 + **Step 7: Add transient failure detection** 333 + 334 + Check build logs for transient patterns: network timeouts, "No space left on device", container runtime errors. Use `category:"transient_failure"`. 335 + 336 + **Step 8: Build and test manually** 337 + 338 + Run a health-check and verify history.jsonl is created: 339 + ```bash 340 + sg sudo -c './day10 health-check --cache-dir /cache/day10-cache --opam-repository /cache/opam-repository --ocaml-version ocaml.5.3.0 algaeff.2.0.0' 341 + cat /cache/day10-cache/ubuntu-25.04-x86_64/packages/algaeff.2.0.0/history.jsonl 342 + ``` 343 + 344 + **Step 9: Commit** 345 + 346 + ```bash 347 + git commit -m "day10: classify build results and record per-package history" 348 + ``` 349 + 350 + --- 351 + 352 + ### Task 4: Status generation after runs 353 + 354 + **Files:** 355 + - Modify: `day10/bin/main.ml` — call Status_index.generate after batch/health-check runs 356 + - Modify: `day10/lib/run_log.ml` and `run_log.mli` — extend summary with changes 357 + 358 + **Step 1: Extend run_log summary type** 359 + 360 + Add to the `summary` record: 361 + ```ocaml 362 + changes : (string * string * string * string) list; 363 + (* (package, build_hash, from_status, to_status) *) 364 + new_packages : string list; 365 + retries : (string * string) list; (* (package, outcome) *) 366 + ``` 367 + 368 + Update `summary_to_json` and `finish_run` accordingly. 369 + 370 + **Step 2: Generate status.json after a batch run completes** 371 + 372 + In the batch command handler, after `finish_run`: 373 + ```ocaml 374 + let previous = Day10_lib.Status_index.read ~dir:os_dir in 375 + let status = Day10_lib.Status_index.generate ~packages_dir ~run_id ~previous in 376 + Day10_lib.Status_index.write ~dir:os_dir status 377 + ``` 378 + 379 + **Step 3: Generate status.json after health-check runs too** 380 + 381 + Same pattern in the health-check handler. 382 + 383 + **Step 4: Build, test with a real run, verify status.json** 384 + 385 + ```bash 386 + cat /cache/day10-cache/ubuntu-25.04-x86_64/status.json | python3 -m json.tool 387 + ``` 388 + 389 + **Step 5: Commit** 390 + 391 + ```bash 392 + git commit -m "day10: generate status.json with change detection after runs" 393 + ``` 394 + 395 + --- 396 + 397 + ### Task 5: CLI — `day10 status` command 398 + 399 + **Files:** 400 + - Modify: `day10/bin/main.ml` — add `status` subcommand 401 + 402 + **Step 1: Add the status subcommand** 403 + 404 + Follow the existing cmdliner pattern. The command: 405 + - Reads `status.json` from `{cache_dir}/{os_key}/` 406 + - Reads disk usage via `Unix.statvfs` or shelling out to `df` 407 + - Outputs human-readable text (default) or JSON (`--format json`) 408 + 409 + ```ocaml 410 + let run_status ~cache_dir ~format () = 411 + let os_key = Config.detect_os_key () in 412 + let os_dir = Filename.concat cache_dir os_key in 413 + match Day10_lib.Status_index.read ~dir:os_dir with 414 + | None -> Printf.eprintf "No status index found. Run a build first.\n"; 1 415 + | Some status -> 416 + match format with 417 + | "json" -> 418 + print_string (Yojson.Safe.pretty_to_string (Day10_lib.Status_index.to_json status)); 419 + 0 420 + | _ -> 421 + (* Print human-readable summary *) 422 + Printf.printf "Run: %s\n" status.run_id; 423 + Printf.printf "Blessed: %s\n" (format_totals status.blessed_totals); 424 + Printf.printf "Non-blessed: %s\n" (format_totals status.non_blessed_totals); 425 + if status.changes <> [] then begin 426 + Printf.printf "Changes (blessed):\n"; 427 + List.iter (fun c -> 428 + if c.blessed then 429 + Printf.printf " %s %s %s → %s\n" 430 + (if c.to_status = "success" then "+" else "-") 431 + c.package c.from_status c.to_status 432 + ) status.changes 433 + end; 434 + (* Disk usage *) 435 + let st = Unix.statvfs cache_dir in 436 + let used = Int64.to_float (Int64.mul (Int64.sub st.f_blocks st.f_bavail) (Int64.of_int st.f_bsize)) in 437 + let total = Int64.to_float (Int64.mul st.f_blocks (Int64.of_int st.f_bsize)) in 438 + Printf.printf "Disk: %.0fG used / %.0fG (%.0f%%)\n" 439 + (used /. 1e9) (total /. 1e9) (used /. total *. 100.0); 440 + 0 441 + ``` 442 + 443 + **Step 2: Register the subcommand in the Cmd.group** 444 + 445 + Add `status_cmd` to the list in the `Cmd.group` call at the bottom of main.ml. 446 + 447 + **Step 3: Build and test** 448 + 449 + ```bash 450 + opam exec -- dune build 451 + ./_build/install/default/bin/day10 status --cache-dir /cache/day10-cache 452 + ./_build/install/default/bin/day10 status --cache-dir /cache/day10-cache --format json 453 + ``` 454 + 455 + **Step 4: Commit** 456 + 457 + ```bash 458 + git commit -m "day10: add 'status' CLI command" 459 + ``` 460 + 461 + --- 462 + 463 + ### Task 6: CLI — `day10 query` command 464 + 465 + **Files:** 466 + - Modify: `day10/bin/main.ml` — add `query` subcommand 467 + 468 + **Step 1: Implement the query command** 469 + 470 + The command: 471 + - Takes a package name (required positional arg) 472 + - Reads the packages directory to find all build symlinks 473 + - Reads layer.json for each build to get deps/compiler/exit_status 474 + - Reads blessed-build symlink to identify the blessed build 475 + - With `--history`: reads and displays history.jsonl 476 + - With `--builds`: shows all builds with dependency sets 477 + - With `--log`: prints the build log from the blessed build's layer dir 478 + - `--format json|text` for output format 479 + 480 + **Step 2: Register the subcommand** 481 + 482 + **Step 3: Build and test** 483 + 484 + ```bash 485 + ./_build/install/default/bin/day10 query --cache-dir /cache/day10-cache cohttp_async_websocket.v0.17.0 486 + ./_build/install/default/bin/day10 query --cache-dir /cache/day10-cache --history algaeff.2.0.0 487 + ./_build/install/default/bin/day10 query --cache-dir /cache/day10-cache --format json aez.0.3 488 + ``` 489 + 490 + **Step 4: Commit** 491 + 492 + ```bash 493 + git commit -m "day10: add 'query' CLI command for package detail and history" 494 + ``` 495 + 496 + --- 497 + 498 + ### Task 7: CLI — `day10 failures` command 499 + 500 + **Files:** 501 + - Modify: `day10/bin/main.ml` — add `failures` subcommand 502 + 503 + **Step 1: Implement the failures command** 504 + 505 + The command: 506 + - Scans all packages' latest history entries 507 + - Filters to failures only 508 + - `--blessed`: only blessed builds 509 + - `--category`: filter by category string 510 + - `--since`: only failures newer than a given run-id or date 511 + - Outputs list with package, build hash, category, compiler, error 512 + 513 + **Step 2: Register, build, test** 514 + 515 + ```bash 516 + ./_build/install/default/bin/day10 failures --cache-dir /cache/day10-cache --blessed 517 + ./_build/install/default/bin/day10 failures --cache-dir /cache/day10-cache --category solver_failure 518 + ``` 519 + 520 + **Step 3: Commit** 521 + 522 + ```bash 523 + git commit -m "day10: add 'failures' CLI command with category and blessed filters" 524 + ``` 525 + 526 + --- 527 + 528 + ### Task 8: CLI — `day10 changes` command 529 + 530 + **Files:** 531 + - Modify: `day10/bin/main.ml` — add `changes` subcommand 532 + 533 + **Step 1: Implement the changes command** 534 + 535 + The command: 536 + - Reads `status.json` for the changes list 537 + - Or, with `--since`: scans history files for transitions after the given run/date 538 + - Shows: package, build hash, blessed, from → to status 539 + 540 + **Step 2: Register, build, test** 541 + 542 + ```bash 543 + ./_build/install/default/bin/day10 changes --cache-dir /cache/day10-cache 544 + ./_build/install/default/bin/day10 changes --cache-dir /cache/day10-cache --since 2026-03-08 545 + ``` 546 + 547 + **Step 3: Commit** 548 + 549 + ```bash 550 + git commit -m "day10: add 'changes' CLI command for status transition tracking" 551 + ``` 552 + 553 + --- 554 + 555 + ### Task 9: CLI — `day10 disk` command 556 + 557 + **Files:** 558 + - Modify: `day10/bin/main.ml` — add `disk` subcommand 559 + 560 + **Step 1: Implement the disk command** 561 + 562 + The command: 563 + - Walks the cache directory tree 564 + - Categorises by type: base image, build layers, doc layers, jtw layers, logs, solutions, packages (symlinks, tiny) 565 + - Reports total and per-category disk usage 566 + - Reports archive disk usage if archive path provided 567 + - `--format json|text` 568 + 569 + Use a simple `du`-like walk or shell out to `du -sb` for directories. 570 + 571 + **Step 2: Register, build, test** 572 + 573 + ```bash 574 + ./_build/install/default/bin/day10 disk --cache-dir /cache/day10-cache 575 + ``` 576 + 577 + **Step 3: Commit** 578 + 579 + ```bash 580 + git commit -m "day10: add 'disk' CLI command for cache usage breakdown" 581 + ``` 582 + 583 + --- 584 + 585 + ### Task 10: CLI — `day10 rerun` command 586 + 587 + **Files:** 588 + - Modify: `day10/bin/main.ml` — add `rerun` subcommand 589 + 590 + **Step 1: Implement the rerun command** 591 + 592 + The command: 593 + - Takes a build hash or package name as argument 594 + - By build hash: reads layer.json to get the package and deps, re-invokes the health-check logic for that exact package with the same opam-repository 595 + - By package name: finds all failing builds (from history), reruns each 596 + - `--force`: deletes the existing layer directory before rebuilding 597 + - Reports what was rerun and outcomes 598 + 599 + **Step 2: Handle the force flag** 600 + 601 + With `--force`, remove the build layer directory so the build system doesn't skip it as cached. 602 + 603 + **Step 3: Register, build, test** 604 + 605 + ```bash 606 + ./_build/install/default/bin/day10 rerun --cache-dir /cache/day10-cache --opam-repository /cache/opam-repository build-abc123 607 + ``` 608 + 609 + **Step 4: Commit** 610 + 611 + ```bash 612 + git commit -m "day10: add 'rerun' CLI command for retrying failed builds" 613 + ``` 614 + 615 + --- 616 + 617 + ### Task 11: CLI — `day10 rdeps` command 618 + 619 + **Files:** 620 + - Modify: `day10/bin/main.ml` — add `rdeps` subcommand 621 + 622 + **Step 1: Implement the rdeps command** 623 + 624 + The command: 625 + - Takes a package name as argument 626 + - Scans all cached solutions to build a reverse dependency map 627 + - Lists packages that depend on the given package 628 + - Optionally filters to only packages currently in `dependency_failure` state 629 + - `--format json|text` 630 + 631 + **Step 2: Register, build, test** 632 + 633 + ```bash 634 + ./_build/install/default/bin/day10 rdeps --cache-dir /cache/day10-cache dune.3.21.0 635 + ``` 636 + 637 + **Step 3: Commit** 638 + 639 + ```bash 640 + git commit -m "day10: add 'rdeps' CLI command for reverse dependency lookup" 641 + ``` 642 + 643 + --- 644 + 645 + ### Task 12: CLI — `day10 cascade` command 646 + 647 + **Files:** 648 + - Modify: `day10/bin/main.ml` — add `cascade` subcommand 649 + 650 + **Step 1: Implement the cascade command** 651 + 652 + The command: 653 + - Finds all packages that recently transitioned to success (from status.json changes or history scan) 654 + - For each, finds reverse deps in `dependency_failure` state (via rdeps logic) 655 + - `--blessed-first`: sort to prioritise blessed builds 656 + - `--dry-run`: list what would be rerun without doing it 657 + - Without `--dry-run`: invokes rerun for each 658 + 659 + **Step 2: Register, build, test** 660 + 661 + ```bash 662 + ./_build/install/default/bin/day10 cascade --cache-dir /cache/day10-cache --opam-repository /cache/opam-repository --dry-run 663 + ``` 664 + 665 + **Step 3: Commit** 666 + 667 + ```bash 668 + git commit -m "day10: add 'cascade' CLI command for dependency-failure reruns" 669 + ``` 670 + 671 + --- 672 + 673 + ### Task 13: Extended GC — log retention and archival 674 + 675 + **Files:** 676 + - Modify: `day10/lib/gc.ml` and `gc.mli` — add log retention functions 677 + - Modify: `day10/bin/main.ml` — extend the gc command or add new gc subcommand 678 + 679 + **Step 1: Add log retention to gc.mli** 680 + 681 + ```ocaml 682 + type log_gc_config = { 683 + archive_dir : string option; (** Spinning disk path *) 684 + keep_runs : int; (** Active run summaries to retain *) 685 + stable_threshold : int; (** Runs of same status before discarding logs *) 686 + compact_max_age_days : int; (** History compaction threshold *) 687 + } 688 + 689 + type log_gc_result = { 690 + logs_archived : int; 691 + logs_deleted : int; 692 + runs_archived : int; 693 + runs_deleted : int; 694 + histories_compacted : int; 695 + bytes_reclaimed : int64; 696 + } 697 + 698 + val gc_logs : 699 + cache_dir:string -> 700 + os_key:string -> 701 + config:log_gc_config -> 702 + dry_run:bool -> 703 + log_gc_result 704 + ``` 705 + 706 + **Step 2: Implement gc_logs** 707 + 708 + - Scan history files to classify each package: stable-pass (N consecutive success), stable-fail, changing, new 709 + - For stable-pass packages: delete build logs (keep layer.json). If archive_dir set, move change-event logs there first. 710 + - For stable-fail packages: keep latest log, archive older ones 711 + - For changing packages: keep all logs (these are diagnostic) 712 + - Archive old run summaries beyond keep_runs 713 + - Compact history files older than compact_max_age_days 714 + 715 + **Step 3: Add CLI flags to gc subcommand** 716 + 717 + Add `--archive`, `--keep-runs`, `--stable-threshold` flags. If the gc subcommand doesn't already exist as a separate command, add it. (The existing `gc` is called internally during batch; we need a standalone CLI command.) 718 + 719 + **Step 4: Build, test with --dry-run** 720 + 721 + ```bash 722 + ./_build/install/default/bin/day10 gc --cache-dir /cache/day10-cache --dry-run 723 + ./_build/install/default/bin/day10 gc --cache-dir /cache/day10-cache --archive /mnt/spinning/day10-archive --dry-run 724 + ``` 725 + 726 + **Step 5: Commit** 727 + 728 + ```bash 729 + git commit -m "day10: add log retention, archival, and history compaction to gc" 730 + ``` 731 + 732 + --- 733 + 734 + ### Task 14: Notification module 735 + 736 + **Files:** 737 + - Create: `day10/lib/notify.ml` 738 + - Create: `day10/lib/notify.mli` 739 + - Modify: `day10/lib/dune` — add `notify` to modules list 740 + - Modify: `day10/bin/main.ml` — add `notify` subcommand 741 + 742 + **Step 1: Write notify.mli** 743 + 744 + ```ocaml 745 + (** Pluggable notification system. *) 746 + 747 + type channel = Slack | Zulip | Telegram | Email | Stdout 748 + 749 + val channel_of_string : string -> channel 750 + val channel_to_string : channel -> string 751 + 752 + (** Send a message via the given channel. 753 + Channel-specific config is read from environment variables: 754 + - SLACK_WEBHOOK_URL 755 + - ZULIP_BOT_EMAIL, ZULIP_BOT_API_KEY, ZULIP_SERVER, ZULIP_STREAM 756 + - TELEGRAM_BOT_TOKEN, TELEGRAM_CHAT_ID 757 + - EMAIL_TO, EMAIL_FROM (uses sendmail) 758 + Returns 0 on success, non-zero on failure. *) 759 + val send : channel:channel -> message:string -> int 760 + ``` 761 + 762 + **Step 2: Implement notify.ml** 763 + 764 + - `Slack`: POST to webhook URL with JSON `{"text": message}` 765 + - `Zulip`: POST to API with topic/content 766 + - `Telegram`: POST to Bot API sendMessage 767 + - `Email`: pipe to sendmail 768 + - `Stdout`: just print (for testing) 769 + 770 + Use `Unix.open_process` to shell out to `curl` for HTTP channels (avoids adding HTTP client dependency). 771 + 772 + **Step 3: Add `notify` CLI subcommand** 773 + 774 + ```bash 775 + ./_build/install/default/bin/day10 notify --channel stdout --message "Test notification" 776 + ``` 777 + 778 + **Step 4: Build, test with stdout channel** 779 + 780 + **Step 5: Commit** 781 + 782 + ```bash 783 + git commit -m "day10: add pluggable notification module and CLI command" 784 + ``` 785 + 786 + --- 787 + 788 + ### Task 15: Integration test — full cycle 789 + 790 + **Files:** 791 + - No new files — this is a manual integration test 792 + 793 + **Step 1: Run a batch build of a few packages** 794 + 795 + ```bash 796 + sg sudo -c './_build/install/default/bin/day10 health-check \ 797 + --cache-dir /cache/day10-cache \ 798 + --opam-repository /cache/opam-repository \ 799 + --ocaml-version ocaml.5.3.0 \ 800 + algaeff.2.0.0' 801 + ``` 802 + 803 + **Step 2: Verify history was recorded** 804 + 805 + ```bash 806 + cat /cache/day10-cache/ubuntu-25.04-x86_64/packages/algaeff.2.0.0/history.jsonl | jq . 807 + ``` 808 + 809 + **Step 3: Check status** 810 + 811 + ```bash 812 + ./_build/install/default/bin/day10 status --cache-dir /cache/day10-cache 813 + ./_build/install/default/bin/day10 status --cache-dir /cache/day10-cache --format json | jq . 814 + ``` 815 + 816 + **Step 4: Query a package** 817 + 818 + ```bash 819 + ./_build/install/default/bin/day10 query --cache-dir /cache/day10-cache cohttp_async_websocket.v0.17.0 --history 820 + ``` 821 + 822 + **Step 5: List failures** 823 + 824 + ```bash 825 + ./_build/install/default/bin/day10 failures --cache-dir /cache/day10-cache --blessed 826 + ``` 827 + 828 + **Step 6: Check changes** 829 + 830 + ```bash 831 + ./_build/install/default/bin/day10 changes --cache-dir /cache/day10-cache 832 + ``` 833 + 834 + **Step 7: Check disk** 835 + 836 + ```bash 837 + ./_build/install/default/bin/day10 disk --cache-dir /cache/day10-cache 838 + ``` 839 + 840 + **Step 8: Test GC dry-run** 841 + 842 + ```bash 843 + ./_build/install/default/bin/day10 gc --cache-dir /cache/day10-cache --dry-run 844 + ``` 845 + 846 + **Step 9: Test notification** 847 + 848 + ```bash 849 + ./_build/install/default/bin/day10 notify --channel stdout --message "Integration test passed" 850 + ``` 851 + 852 + **Step 10: Commit any fixes discovered during integration testing** 853 + 854 + --- 855 + 856 + ## Implementation Order & Dependencies 857 + 858 + ``` 859 + Task 1 (History module) 860 + └→ Task 2 (Status_index module) 861 + └→ Task 3 (Failure classification in pipeline) 862 + └→ Task 4 (Status generation after runs) 863 + └→ Task 5 (CLI: status) 864 + └→ Task 8 (CLI: changes) 865 + └→ Task 6 (CLI: query) 866 + └→ Task 7 (CLI: failures) 867 + Task 9 (CLI: disk) — independent 868 + Task 10 (CLI: rerun) — independent, but uses history for filtering 869 + Task 11 (CLI: rdeps) — independent 870 + Task 12 (CLI: cascade) — depends on Task 10 + Task 11 871 + Task 13 (Extended GC) — depends on Task 1 (history) 872 + Task 14 (Notification) — independent 873 + Task 15 (Integration test) — depends on all above 874 + ``` 875 + 876 + Tasks 5-9 and 14 can be parallelised once Tasks 1-4 are complete.