docs: rebuild environment from scratch · khuedoan.com/cloudlab@9cfe22b

+142

2 changed files

expand all

README.md

docs

how-to-guides

README.md

··· 166 166 proxmox_password: foo 167 167 ``` 168 168 169 + ### Operations 170 + 171 + - [Rebuild environment](docs/how-to-guides/rebuild-environment.md) 172 + 169 173 ## TODOs 170 174 171 175 - Fix OCI plain HTTP for local development

+138

docs/how-to-guides/rebuild-environment.md

··· 1 + # Rebuild environment 2 + 3 + This is the runbook for destroying and recreating an environment such as 4 + `staging`. 5 + 6 + ## Before you start 7 + 8 + - Run the rebuild from a Linux host that can apply the NixOS parts. 9 + - Make sure the repo on that host contains the changes you want to deploy. 10 + - Make sure `infra/<env>/secrets.yaml` exists and can be decrypted. 11 + - Make sure `settings.yaml` is ready for `toolbox secrets`. 12 + - Expect a destructive rebuild to rotate the node SSH host key. 13 + - Start a `tmux` session before running the long-lived commands in this guide. 14 + `terragrunt destroy --all`, `terragrunt apply --all`, and the `toolbox` 15 + publish steps can take long enough that you do not want them tied to a single 16 + terminal or SSH connection. 17 + 18 + ## Recreate infra 19 + 20 + For example in `staging`: 21 + 22 + ```sh 23 + pushd ~/Documents/cloudlab/infra/staging 24 + terragrunt destroy --all 25 + popd 26 + make infra bootstrap platform env=staging 27 + ``` 28 + 29 + During bootstrap, it will ask you to input some secrets, open `tmux` session to 30 + do that (you may want to rotate your API keys as well). 31 + 32 + ## Verify the rebuild 33 + 34 + Run the smoke tests: 35 + 36 + ```sh 37 + make test env=staging 38 + ``` 39 + 40 + ## Known failure modes 41 + 42 + ### Flux says a release is ready, but the workload is missing 43 + 44 + This happened with `cert-manager`, `dex`, and the Istio stack during rebuilds. 45 + The live workload was gone, but helm-controller still had the old release state 46 + in `flux-system`. 47 + 48 + Fix: 49 + 50 + 1. Delete the stale `HelmRelease` 51 + 2. Delete the matching Helm storage secret in `flux-system` 52 + 3. Reconcile `platform` 53 + 54 + Example: 55 + 56 + ```sh 57 + kubectl -n flux-system delete helmrelease dex 58 + kubectl -n flux-system delete secret sh.helm.release.v1.dex.v1 59 + kubectl -n flux-system annotate kustomization platform \ 60 + reconcile.fluxcd.io/requestedAt="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ 61 + --overwrite 62 + ``` 63 + 64 + ### A `HelmRelease` is stuck in terminal failed state 65 + 66 + This showed up with `cert-manager`. The workload was healthy, but 67 + helm-controller refused to retry because the release was marked `RetriesExceeded`. 68 + 69 + Fix: 70 + 71 + ```sh 72 + ts=$(date -u +%Y-%m-%dT%H:%M:%SZ) 73 + kubectl -n flux-system annotate helmrelease cert-manager \ 74 + reconcile.fluxcd.io/resetAt="$ts" \ 75 + reconcile.fluxcd.io/requestedAt="$ts" \ 76 + --overwrite 77 + ``` 78 + 79 + ### New pods fail with `istio-cni ... Unauthorized` 80 + 81 + This means the Istio CNI state on the node survived, but the Istio workloads in 82 + `istio-system` did not. 83 + 84 + Symptoms: 85 + 86 + - pods stay in `ContainerCreating` or `Init` 87 + - pod events contain `plugin type="istio-cni" ... Unauthorized` 88 + - `istio-system` is empty or partially missing 89 + 90 + Fix: 91 + 92 + 1. Delete stale Istio `HelmRelease` objects 93 + 2. Delete the matching Helm storage secrets 94 + 3. Reconcile `platform` 95 + 4. Recreate any pods that were created while CNI was broken 96 + 97 + ### Forgejo cannot bootstrap OIDC through the public Dex URL 98 + 99 + Forgejo uses the in-cluster Dex service for bootstrap instead of 100 + hairpinning through the public gateway. 101 + 102 + If Forgejo init fails, check: 103 + 104 + ```sh 105 + kubectl -n forgejo logs deploy/forgejo -c configure-gitea 106 + kubectl get all -n dex 107 + ``` 108 + 109 + ### Public `443` is broken even though Gateway and HTTPRoutes are healthy 110 + 111 + This showed up after restart when: 112 + 113 + - `gateway-istio` was healthy 114 + - `HTTPRoute`s were accepted 115 + - public `curl` to `https://*.staging.khuedoan.com` still failed 116 + 117 + Root cause: 118 + 119 + - the host IPv6 DNAT rule for `2a01:4f9:c013:e5ee::1:443` still pointed to an 120 + old gateway pod IP 121 + 122 + Checks: 123 + 124 + ```sh 125 + kubectl -n kube-system get pod -l svccontroller.k3s.cattle.io/svcname=gateway-istio -o wide 126 + ip6tables -t nat -S | grep '2a01:4f9:c013:e5ee::1/128 -p tcp -m tcp --dport 443' 127 + ``` 128 + 129 + If the DNAT target points at a dead pod IP, either recycle the `svclb` pod or 130 + rewrite the top `PREROUTING` and `OUTPUT` rules to the live 131 + `gateway-istio` Service ClusterIP. 132 + 133 + ## Current staging-specific notes 134 + 135 + - `platform/staging` is wrapped in Flux resources to avoid the earlier 136 + bootstrap deadlocks around namespaces and privileged resources 137 + - Dex client secrets and password hashes live in Vault, not plain YAML 138 + - Forgejo bootstraps against the in-cluster Dex service

Configure Feed

Configure Feed