···166166proxmox_password: foo
167167```
168168169169+### Operations
170170+171171+- [Rebuild environment](docs/how-to-guides/rebuild-environment.md)
172172+169173## TODOs
170174171175- Fix OCI plain HTTP for local development
+138
docs/how-to-guides/rebuild-environment.md
···11+# Rebuild environment
22+33+This is the runbook for destroying and recreating an environment such as
44+`staging`.
55+66+## Before you start
77+88+- Run the rebuild from a Linux host that can apply the NixOS parts.
99+- Make sure the repo on that host contains the changes you want to deploy.
1010+- Make sure `infra/<env>/secrets.yaml` exists and can be decrypted.
1111+- Make sure `settings.yaml` is ready for `toolbox secrets`.
1212+- Expect a destructive rebuild to rotate the node SSH host key.
1313+- Start a `tmux` session before running the long-lived commands in this guide.
1414+ `terragrunt destroy --all`, `terragrunt apply --all`, and the `toolbox`
1515+ publish steps can take long enough that you do not want them tied to a single
1616+ terminal or SSH connection.
1717+1818+## Recreate infra
1919+2020+For example in `staging`:
2121+2222+```sh
2323+pushd ~/Documents/cloudlab/infra/staging
2424+terragrunt destroy --all
2525+popd
2626+make infra bootstrap platform env=staging
2727+```
2828+2929+During bootstrap, it will ask you to input some secrets, open `tmux` session to
3030+do that (you may want to rotate your API keys as well).
3131+3232+## Verify the rebuild
3333+3434+Run the smoke tests:
3535+3636+```sh
3737+make test env=staging
3838+```
3939+4040+## Known failure modes
4141+4242+### Flux says a release is ready, but the workload is missing
4343+4444+This happened with `cert-manager`, `dex`, and the Istio stack during rebuilds.
4545+The live workload was gone, but helm-controller still had the old release state
4646+in `flux-system`.
4747+4848+Fix:
4949+5050+1. Delete the stale `HelmRelease`
5151+2. Delete the matching Helm storage secret in `flux-system`
5252+3. Reconcile `platform`
5353+5454+Example:
5555+5656+```sh
5757+kubectl -n flux-system delete helmrelease dex
5858+kubectl -n flux-system delete secret sh.helm.release.v1.dex.v1
5959+kubectl -n flux-system annotate kustomization platform \
6060+ reconcile.fluxcd.io/requestedAt="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
6161+ --overwrite
6262+```
6363+6464+### A `HelmRelease` is stuck in terminal failed state
6565+6666+This showed up with `cert-manager`. The workload was healthy, but
6767+helm-controller refused to retry because the release was marked `RetriesExceeded`.
6868+6969+Fix:
7070+7171+```sh
7272+ts=$(date -u +%Y-%m-%dT%H:%M:%SZ)
7373+kubectl -n flux-system annotate helmrelease cert-manager \
7474+ reconcile.fluxcd.io/resetAt="$ts" \
7575+ reconcile.fluxcd.io/requestedAt="$ts" \
7676+ --overwrite
7777+```
7878+7979+### New pods fail with `istio-cni ... Unauthorized`
8080+8181+This means the Istio CNI state on the node survived, but the Istio workloads in
8282+`istio-system` did not.
8383+8484+Symptoms:
8585+8686+- pods stay in `ContainerCreating` or `Init`
8787+- pod events contain `plugin type="istio-cni" ... Unauthorized`
8888+- `istio-system` is empty or partially missing
8989+9090+Fix:
9191+9292+1. Delete stale Istio `HelmRelease` objects
9393+2. Delete the matching Helm storage secrets
9494+3. Reconcile `platform`
9595+4. Recreate any pods that were created while CNI was broken
9696+9797+### Forgejo cannot bootstrap OIDC through the public Dex URL
9898+9999+Forgejo uses the in-cluster Dex service for bootstrap instead of
100100+hairpinning through the public gateway.
101101+102102+If Forgejo init fails, check:
103103+104104+```sh
105105+kubectl -n forgejo logs deploy/forgejo -c configure-gitea
106106+kubectl get all -n dex
107107+```
108108+109109+### Public `443` is broken even though Gateway and HTTPRoutes are healthy
110110+111111+This showed up after restart when:
112112+113113+- `gateway-istio` was healthy
114114+- `HTTPRoute`s were accepted
115115+- public `curl` to `https://*.staging.khuedoan.com` still failed
116116+117117+Root cause:
118118+119119+- the host IPv6 DNAT rule for `2a01:4f9:c013:e5ee::1:443` still pointed to an
120120+ old gateway pod IP
121121+122122+Checks:
123123+124124+```sh
125125+kubectl -n kube-system get pod -l svccontroller.k3s.cattle.io/svcname=gateway-istio -o wide
126126+ip6tables -t nat -S | grep '2a01:4f9:c013:e5ee::1/128 -p tcp -m tcp --dport 443'
127127+```
128128+129129+If the DNAT target points at a dead pod IP, either recycle the `svclb` pod or
130130+rewrite the top `PREROUTING` and `OUTPUT` rules to the live
131131+`gateway-istio` Service ClusterIP.
132132+133133+## Current staging-specific notes
134134+135135+- `platform/staging` is wrapped in Flux resources to avoid the earlier
136136+ bootstrap deadlocks around namespaces and privileged resources
137137+- Dex client secrets and password hashes live in Vault, not plain YAML
138138+- Forgejo bootstraps against the in-cluster Dex service