WIP: add deployment policy spec · adam.robins.wtf/sower@b6f66ca

+564

2 changed files

expand all

docs

+153

docs/plan-deployment-policy.org

··· 1 + * Deployment Policy — Implementation Plan 2 + 3 + See [[file:spec-deployment-policy.org][spec-deployment-policy.org]] for the full specification. 4 + 5 + ** Compatibility Constraints 6 + 7 + - Gardens and servers can be at different versions. 8 + - Server must be upgraded first — it needs to accept both old and new formats. 9 + - Old gardens send old fields (reboot_policy, allow_realtime, etc.) and must 10 + continue to work with a new server. 11 + - New gardens sending policy rules must not break an old server — so the garden 12 + upgrade must happen after the server can accept both formats. 13 + - Garden config files (client.json) are user-facing. Users need a migration 14 + path and deprecation warnings, not a hard cutover. 15 + 16 + ** Phase 1: Foundation 17 + 18 + No behavior change. Build the new types and evaluation logic, fully tested. 19 + 20 + *** Schema additions (sower_client) 21 + 22 + - Add =SowerClient.Orchestration.Subscription.Policy= module 23 + - =Policy.Rule= schema: actions, triggers, window, confirm 24 + - Reuses existing =Subscription.Window= schema for the window field 25 + - Add =policy= field to =SowerClient.Orchestration.Subscription= schema 26 + - Type: array of =Policy.Rule=, default: =[]= (empty means use default policy) 27 + - Add to =sower_client.ex= schema list 28 + 29 + *** Shared evaluator (sower_client) 30 + 31 + - =Policy.evaluate(policy_rules, trigger, now, seed_type)= 32 + - Returns ={:allow, action}=, ={:confirm, action}=, or =:deny= 33 + - Walks the disruption hierarchy (restart → activate → stage) 34 + - Applies default policy when rules are empty 35 + - Validates actions against seed type 36 + - =Policy.trigger_for_reason(audit_reason)= 37 + - Maps audit reasons to policy triggers 38 + - Window evaluation logic with overnight span support (subsumes sow-160) 39 + 40 + *** Database migration (sower) 41 + 42 + - Add =policy= column (=:map= / jsonb array) to subscriptions table 43 + - Column is nullable — nil means "use old fields" during transition 44 + 45 + *** Tests 46 + 47 + - Policy evaluator unit tests covering: 48 + - Each trigger type 49 + - Window matching (including overnight spans) 50 + - Disruption hierarchy (highest permitted action wins) 51 + - Default policy behavior 52 + - Confirm flag (confirm wins when multiple rules match) 53 + - Seed type action validation 54 + - Empty/nil policy → default 55 + 56 + ** Phase 2: Server-side adoption 57 + 58 + Server uses policy for all deployment decisions. Old gardens still work. 59 + 60 + *** Old-to-new conversion 61 + 62 + - Add =Policy.from_legacy(subscription)= in sower_client 63 + - Converts reboot_policy, allow_realtime, poll_on_connect, window, 64 + activation_args into equivalent policy rules 65 + - Used at registration time when a garden sends old-format subscriptions 66 + - Server's =register_subscription/2= and =sync_subscriptions/2=: 67 + - If subscription has =policy= rules: store as-is 68 + - If subscription has no policy (old garden): call =from_legacy=, store 69 + the converted policy alongside the old fields 70 + 71 + *** Replace server-side evaluation 72 + 73 + - =Sower.Workers.DeploySubscription.run/2=: 74 + - Replace =within_window?= check with =Policy.evaluate/3= 75 + - Trigger is =:realtime= (already known from worker context) 76 + - =Sower.Orchestration.Deployment.deploy_subscription/2=: 77 + - Evaluate policy before creating deployment 78 + - =event_reason= already flows through opts — map to trigger 79 + - Manual deploy path (UI): 80 + - Pass =:user_triggered= through to policy evaluation 81 + - Handle =:confirm= result (UI confirmation flow) 82 + - =find_realtime_subscriptions/1=: 83 + - Replace =sub.allow_realtime= filter with policy check for realtime trigger 84 + 85 + *** Audit reason updates 86 + 87 + - Rename =retry= → =user_retry= in deployment_events enum (migration) 88 + - Add =poll_on_connect= to deployment_events enum (migration) 89 + 90 + *** UI updates 91 + 92 + - Subscription show page: display policy rules instead of old fields 93 + - Keep displaying old fields as fallback if policy is empty (transition period) 94 + 95 + ** Phase 3: Garden-side adoption 96 + 97 + Garden uses policy for all deployment decisions. 98 + 99 + *** Garden evaluation injection 100 + 101 + - =Garden.Socket.handle_cast(:sync_subscriptions)=: 102 + - Evaluate policy for poll_on_connect subscriptions before deploying 103 + - Pass =:poll_on_connect= as audit reason to server 104 + - =Garden.Scheduler=: 105 + - Evaluate policy before deploying on cron fire 106 + - =Garden.Deployer=: 107 + - Replace =Garden.Seed.activation_mode(subscription)= with action from 108 + policy evaluation result 109 + - Replace =reboot_reason/1= logic — reboot decision comes from policy 110 + (=:restart= action permitted or not), not from =reboot_policy= field 111 + 112 + *** Garden config support 113 + 114 + - =SowerClient.Config.preprocess_subscriptions/1=: 115 + - Support =policy= key in subscription config (new format) 116 + - Continue supporting old keys (reboot_policy, etc.) — convert via 117 + =Policy.from_legacy= with deprecation warning logged 118 + - Document new config format for users 119 + 120 + ** Phase 4: Cleanup 121 + 122 + Remove old fields once all gardens have upgraded. 123 + 124 + *** Remove old fields 125 + 126 + - Remove from =SowerClient.Orchestration.Subscription= schema: 127 + =reboot_policy=, =allow_realtime=, =poll_on_connect=, =window=, 128 + =activation_args= 129 + - Remove from server Ecto schema and changeset 130 + - Remove from =register_subscription= and =create_subscription= field lists 131 + - Remove =Policy.from_legacy/1= conversion code 132 + - Remove =Garden.Seed.activation_mode/1= 133 + - Remove =Sower.Orchestration.Subscription.within_window?/2= 134 + - Remove =reboot_reason/1= from =Garden.Deployer= 135 + 136 + *** Database migration 137 + 138 + - Drop old columns: =reboot_policy=, =allow_realtime=, =window=, 139 + =activation_args= from subscriptions table 140 + 141 + *** Config cleanup 142 + 143 + - Remove support for old config keys in =preprocess_subscriptions= 144 + - Deprecation warnings become errors 145 + 146 + ** Release Sequence 147 + 148 + 1. Release Phases 1+2+3 together → policy system fully functional on both 149 + server and garden. Old fields kept alongside new policy field. 150 + =Policy.from_legacy= converts old garden configs automatically with 151 + deprecation warnings. No user action required to upgrade. 152 + 2. After all gardens running release 1 → release Phase 4 cleanup (drop 153 + old fields, remove conversion code, deprecation warnings become errors).

+411

docs/spec-deployment-policy.org

··· 1 + * Deployment Policy Rules — Specification 2 + 3 + ** Overview 4 + 5 + A deployment policy is a list of rules attached to a subscription that controls 6 + when and how deployment actions are permitted. The system uses *default deny* — 7 + if no rule matches a given action/trigger/time combination, the action is blocked. 8 + 9 + The subscription retains responsibility for *what* to deploy (seed identity, 10 + matching rules, schedule). The policy controls *whether* a deployment action is 11 + allowed to proceed. 12 + 13 + ** Design Principles 14 + 15 + - Default deny: the absence of a matching rule means the action is not permitted. 16 + - Allow-only: rules grant permission. There is no "deny" effect. 17 + - Any-match: rules are unordered. If any rule matches, the action is allowed. 18 + - Omitted fields mean "any": a rule with no =triggers= field matches all trigger types. 19 + - Overlapping rules are harmless: redundancy is acceptable, not an error. 20 + - Seed types define which actions are valid: the policy is validated against the 21 + seed type's supported actions. 22 + 23 + ** Actions 24 + 25 + Actions represent what a deployment does, generalized across seed types. 26 + 27 + | Action | Description | Disruption | 28 + |----------+----------------------------------------------------------+------------| 29 + | stage | Download and pin a closure (gcroot). No runtime changes. | None | 30 + | activate | Apply the configuration via the seed type's activation. | Service | 31 + | restart | Full machine reboot. | Full | 32 + 33 + *** Actions by Seed Type 34 + 35 + | Seed Type | stage | activate | restart | 36 + |--------------+-------+----------+---------| 37 + | nixos | yes | yes | yes | 38 + | home-manager | yes | yes | no | 39 + | gcroot | yes | no | no | 40 + | service | yes | yes | no | 41 + 42 + This table will grow as new seed types are added. The policy engine should 43 + reject rules that reference actions unsupported by the subscription's seed type. 44 + 45 + *** Activation Mode Derivation 46 + 47 + The activation mode is not configured directly. It is derived from the set of 48 + actions permitted at the time of deployment. This keeps policy rules purely 49 + about permission while the system selects the appropriate mechanism. 50 + 51 + **** NixOS 52 + 53 + | Permitted Actions | Activation Mode | Behavior | 54 + |--------------------------+-----------------+------------------------------------| 55 + | activate only | switch | Activate immediately, set as boot | 56 + | activate + restart | boot | Set as boot config, then reboot | 57 + | stage only | (none) | Download and pin closure only | 58 + 59 + When =activate= is permitted but =restart= is not, the system uses =switch= — 60 + the configuration takes effect immediately without a reboot. When both 61 + =activate= and =restart= are permitted, the system uses =boot= mode and 62 + triggers a reboot, ensuring the full configuration (including kernel changes) 63 + is applied. 64 + 65 + This means a subscription with separate time windows for =activate= and 66 + =restart= naturally produces the desired behavior: 67 + 68 + - During activate-only windows: =switch= (non-disruptive) 69 + - During activate+restart windows: =boot= + reboot (full apply) 70 + 71 + **** Home-manager 72 + 73 + | Permitted Actions | Activation Mode | Behavior | 74 + |-------------------+-----------------+--------------------------------| 75 + | activate | switch | Activate new generation | 76 + | stage only | (none) | Download and pin closure only | 77 + 78 + **** Service 79 + 80 + | Permitted Actions | Activation Mode | Behavior | 81 + |-------------------+-----------------+--------------------------------| 82 + | activate | restart | Restart the service | 83 + | stage only | (none) | Download and pin closure only | 84 + 85 + **** Gcroot 86 + 87 + | Permitted Actions | Activation Mode | Behavior | 88 + |-------------------+-----------------+--------------------------------| 89 + | stage | (none) | Download and pin closure only | 90 + 91 + ** Triggers 92 + 93 + Triggers represent how a deployment was initiated. These align with the 94 + existing =deployment_events= audit system's =reason= field. 95 + 96 + | Trigger | Audit Reason | Description | Source | 97 + |------------------+-----------------------+------------------------------------------------+-----------| 98 + | manual | user_triggered | User clicked deploy in the UI or invoked via API | Human | 99 + | manual | user_retry | User retried a failed deployment | Human | 100 + | scheduled | schedule_triggered | Cron schedule fired and a new seed was found | Automated | 101 + | realtime | realtime_triggered | A matching seed was registered | Automated | 102 + | poll_on_connect | poll_on_connect | Garden connected and polled for pending seeds | Automated | 103 + 104 + Multiple audit reasons can map to the same policy trigger. The policy engine 105 + evaluates against the trigger column; the audit reason is preserved for 106 + traceability. =user_retry= is treated identically to =manual= for policy 107 + purposes. 108 + 109 + ** Windows 110 + 111 + A window constrains when a rule applies based on day of week and time of day. 112 + Windows naturally handle overnight spans — when =time_start= is after 113 + =time_end=, the window wraps across midnight. 114 + 115 + | Field | Type | Required | Description | 116 + |------------+----------+----------+----------------------------------------| 117 + | days | string[] | yes | Days of the week the window is active | 118 + | time_start | string | yes | Start of window (HH:MM) | 119 + | time_end | string | yes | End of window (HH:MM) | 120 + 121 + Day values: =mon=, =tue=, =wed=, =thu=, =fri=, =sat=, =sun=. 122 + 123 + Timezone is specified once on the subscription (=timezone= field) and applies 124 + to all windows in the policy. 125 + 126 + When =time_start= > =time_end= (e.g. =22:00= to =06:00=), the window spans 127 + midnight. The =days= field refers to the day the window *opens*. A rule with 128 + =days: ["fri"], time_start: "22:00", time_end: "06:00"= covers Friday 22:00 129 + through Saturday 06:00. 130 + 131 + A rule with no =window= field is not time-constrained — it matches at any time. 132 + 133 + ** Rule Schema 134 + 135 + A single policy rule: 136 + 137 + | Field | Type | Required | Default | Description | 138 + |----------+------------+----------+---------+-------------------------------------------------| 139 + | actions | string[] | yes | | Actions this rule permits | 140 + | triggers | string[] | no | any | Triggers this rule applies to. Omit for all. | 141 + | window | object | no | none | Time window constraint (see Windows section) | 142 + | confirm | boolean | no | false | Require explicit confirmation before proceeding | 143 + 144 + Multiple rules are ORed (any-match). Within a rule, all fields are ANDed — 145 + the action, trigger, and window must all match. 146 + 147 + ** Policy Evaluation 148 + 149 + Actions form a disruption hierarchy. Each level subsumes the levels below it: 150 + 151 + | Level | Action | Implies | 152 + |-------+----------+-----------------| 153 + | 3 | restart | activate, stage | 154 + | 2 | activate | stage | 155 + | 1 | stage | (none) | 156 + 157 + On each deployment event with =(trigger, current_time)=: 158 + 159 + 1. For each action the seed type supports (from highest to lowest disruption): 160 + a. Filter rules where the action is in the rule's =actions= list. 161 + b. Filter rules where =trigger= is in the rule's =triggers= list (or triggers is omitted). 162 + c. For remaining rules, evaluate the =window= against =current_time= (or pass if no window). 163 + d. If any rule matches: the action is *permitted*. 164 + 2. Select the highest-disruption permitted action. All lower actions are 165 + implied and do not need separate rules. 166 + 3. Derive the activation mode from the selected action (see Activation Mode Derivation). 167 + 4. If the matching rule has =confirm: true=: block pending out-of-band approval 168 + before proceeding. 169 + 5. If no action is permitted: *deny* — do nothing. 170 + 171 + ** Relationship to Schedule 172 + 173 + The subscription's =schedule= field defines *when to poll for new seeds*. The 174 + policy defines *whether to act on what is found*. These are separate concerns. 175 + 176 + A schedule that fires outside any policy window results in a poll that finds a 177 + seed but does not deploy it. This is intentional — the seed will be deployed 178 + when the next poll fires within the window, or when a manual deploy is triggered. 179 + 180 + ** Examples 181 + 182 + *** Allow manual apply anytime, reboot only 2-4am 183 + 184 + #+begin_src json 185 + [ 186 + { 187 + "actions": ["activate"], 188 + "triggers": ["manual"] 189 + }, 190 + { 191 + "actions": ["restart"], 192 + "window": { 193 + "days": ["mon", "tue", "wed", "thu", "fri", "sat", "sun"], 194 + "time_start": "02:00", 195 + "time_end": "04:00" 196 + } 197 + } 198 + ] 199 + #+end_src 200 + 201 + *** Automated deploys on weekdays, reboots on weekends 202 + 203 + #+begin_src json 204 + [ 205 + { 206 + "actions": ["activate"], 207 + "triggers": ["manual"] 208 + }, 209 + { 210 + "actions": ["stage", "activate"], 211 + "triggers": ["scheduled", "realtime"], 212 + "window": { 213 + "days": ["mon", "tue", "wed", "thu", "fri"], 214 + "time_start": "09:00", 215 + "time_end": "17:00" 216 + } 217 + }, 218 + { 219 + "actions": ["restart"], 220 + "window": { 221 + "days": ["fri"], 222 + "time_start": "22:00", 223 + "time_end": "06:00" 224 + } 225 + } 226 + ] 227 + #+end_src 228 + 229 + *** Staging only (gcroot subscription) 230 + 231 + #+begin_src json 232 + [ 233 + { 234 + "actions": ["stage"], 235 + "triggers": ["scheduled", "realtime"] 236 + } 237 + ] 238 + #+end_src 239 + 240 + *** Everything allowed, manual confirmation for reboot 241 + 242 + #+begin_src json 243 + [ 244 + { 245 + "actions": ["stage", "activate"] 246 + }, 247 + { 248 + "actions": ["restart"], 249 + "confirm": true 250 + } 251 + ] 252 + #+end_src 253 + 254 + *** Overnight maintenance window 255 + 256 + #+begin_src json 257 + [ 258 + { 259 + "actions": ["activate"], 260 + "triggers": ["manual"] 261 + }, 262 + { 263 + "actions": ["activate", "restart"], 264 + "triggers": ["scheduled"], 265 + "window": { 266 + "days": ["tue", "thu"], 267 + "time_start": "23:00", 268 + "time_end": "05:00" 269 + } 270 + } 271 + ] 272 + #+end_src 273 + 274 + ** Migration from Current Schema 275 + 276 + *** Field Mapping 277 + 278 + The following subscription fields are replaced by policy rules: 279 + 280 + | Old Field | Replacement | 281 + |------------------+----------------------------------------------------------| 282 + | reboot_policy | =restart= action in policy rules with window constraints | 283 + | allow_realtime | =realtime= trigger in policy rules | 284 + | poll_on_connect | =poll_on_connect= trigger in policy rules | 285 + | window | =window= on policy rules | 286 + | activation_args | Derived from permitted actions per seed type (see Activation Mode Derivation) | 287 + 288 + *** Audit Reason Alignment 289 + 290 + The existing =deployment_events.reason= enum needs updates to align with 291 + policy triggers: 292 + 293 + | Current Reason | Change | 294 + |----------------------+---------------------------------------------------------| 295 + | user_triggered | Keep — maps to =manual= trigger | 296 + | retry | Rename to =user_retry= — maps to =manual= trigger | 297 + | schedule_triggered | Keep — maps to =scheduled= trigger | 298 + | realtime_triggered | Keep — maps to =realtime= trigger | 299 + | (new) | Add =poll_on_connect= — maps to =poll_on_connect= trigger | 300 + | superseded | Keep — lifecycle event, not a trigger | 301 + | stale | Keep — lifecycle event, not a trigger | 302 + 303 + The =poll_on_connect= audit reason is currently missing. Garden-side 304 + poll-on-connect deploys do not pass audit context to the server. This must be 305 + added so that: 306 + 1. The audit trail captures why the deployment was created. 307 + 2. The policy engine can evaluate the correct trigger type. 308 + 309 + *** Shared Policy Evaluation Module 310 + 311 + The policy engine lives in =sower_client= as 312 + =SowerClient.Orchestration.Subscription.Policy=. Both the server (=sower=) 313 + and the garden depend on =sower_client=, so one implementation serves both 314 + sides. 315 + 316 + #+begin_example 317 + alias SowerClient.Orchestration.Subscription.Policy 318 + 319 + Policy.evaluate(subscription.policy, trigger, now) 320 + → {:allow, :restart} # highest permitted action 321 + → {:allow, :activate} 322 + → {:allow, :stage} 323 + → {:confirm, :restart} # allowed but requires confirmation 324 + → :deny 325 + #+end_example 326 + 327 + The trigger is derived from the audit reason: 328 + 329 + #+begin_example 330 + Policy.trigger_for_reason(:user_triggered) → :manual 331 + Policy.trigger_for_reason(:user_retry) → :manual 332 + Policy.trigger_for_reason(:schedule_triggered) → :scheduled 333 + Policy.trigger_for_reason(:realtime_triggered) → :realtime 334 + Policy.trigger_for_reason(:poll_on_connect) → :poll_on_connect 335 + #+end_example 336 + 337 + *** Injection Points 338 + 339 + *Server-side:* 340 + 341 + - =Sower.Workers.DeploySubscription.run/2= — currently checks =within_window?=. 342 + Replace with =Policy.evaluate/3=. Covers realtime deploys. 343 + - =Sower.Orchestration.Deployment.deploy_subscription/2= — entry point for 344 + manual (UI) and schedule catch-up deploys. Evaluate policy before creating 345 + the deployment. The =event_reason= is already passed via opts. 346 + 347 + *Garden-side:* 348 + 349 + - =Garden.Socket.handle_cast(:sync_subscriptions)= — poll-on-connect path. 350 + Evaluate policy before triggering deploy. The subscription is already 351 + available from local storage. 352 + - =Garden.Scheduler= — scheduled deploys. Evaluate policy before deploying. 353 + The subscription is already available. 354 + - =Garden.Deployer= — derives activation mode from =subscription.activation_args= 355 + today. Replace with the action returned by policy evaluation. The seed 356 + deployment's =subscription_sid= is used to look up the subscription, which 357 + already happens in =find_subscription/1=. 358 + 359 + *Both sides:* 360 + 361 + The =within_window?= function on =Sower.Orchestration.Subscription= is 362 + replaced entirely by the shared policy evaluator. The server-side window 363 + schema (=embeds_one :window=) on subscriptions is replaced by windows 364 + embedded within policy rules. 365 + 366 + *** Other Implementation Notes 367 + 368 + - =within_window?= must be updated to handle overnight spans (sow-160) as 369 + the same window logic is reused within the policy evaluator. 370 + - The deploy button in the UI currently has no concept of trigger type. It 371 + will need to pass =user_triggered= (or =user_retry=) through to the policy 372 + engine so manual deploys can be distinguished from automated ones. 373 + 374 + ** Resolved Decisions 375 + 376 + - *Staging must be explicit.* =stage= is not implicitly allowed. However, 377 + =activate= will always ensure staging is complete before proceeding — if a 378 + policy allows =activate= but not =stage= independently, staging happens as 379 + part of the activation flow, not as a separately-gated action. 380 + - *Default policy is narrow but not empty.* When no policy is specified (or the 381 + policy list is empty), the following default applies: 382 + #+begin_src json 383 + [{"actions": ["activate"], "triggers": ["manual", "scheduled", "poll_on_connect"]}] 384 + #+end_src 385 + This ensures gardens are never stranded — they can always apply an upgrade 386 + through safe channels. Notably excluded: =realtime= (too aggressive for a 387 + default) and =restart= (too disruptive). Users must explicitly opt in to both. 388 + When a user provides any policy rules, the default is fully replaced — there 389 + is no merging. 390 + - *Confirm blocks any trigger.* The =confirm= flag is not limited to manual 391 + triggers. When a rule with =confirm: true= matches, the action is blocked 392 + pending out-of-band approval, regardless of trigger type. If multiple rules 393 + match the same action and any of them has =confirm: true=, confirmation is 394 + required. 395 + - *Realtime is gated at both ends.* The server will not send realtime deployment 396 + events to a garden unless =realtime= appears as a trigger in at least one 397 + policy rule. The garden will also reject realtime events not present in its 398 + policy. This avoids unnecessary traffic and provides defense in depth. 399 + - *Inhibition is a separate layer.* Runtime inhibition (garden-side lock) is 400 + evaluated as a pre-check before policy evaluation. It is out of scope for 401 + this spec and will be specified separately. 402 + 403 + ** Future Considerations 404 + 405 + - Additional window constraints (e.g. specific dates, week-of-month). 406 + - Generic conditions (non-time-based variables) could be added alongside windows 407 + if needed, using an operator-based format (eq, in, between). 408 + - Cross-subscription coordination (rolling deployments, staged rollouts) would 409 + be a separate orchestration layer above this policy system. 410 + - An embedded scripting language (Lua via Luerl) could replace or supplement 411 + structured rules if the policy needs outgrow what windows and triggers can express.

Configure Feed

Configure Feed