configure docket redis URL + add build notes · zzstoatzz.io/my-prefect-server@7cde929

+178

2 changed files

expand all

deploy

notes

journey.md

deploy/prefect-values.yaml

··· 8 8 uiConfig: 9 9 prefectUiApiUrl: "https://DOMAIN_PLACEHOLDER/api" 10 10 11 + extraEnvVarsCM: prefect-docket-config 12 + 11 13 backgroundServices: 12 14 runAsSeparateDeployment: true 13 15 replicaCount: 1 16 + extraEnvVarsCM: prefect-docket-config 14 17 15 18 postgresql: 16 19 enabled: true

+175

notes/journey.md

··· 1 + # self-hosted prefect on hetzner: build notes 2 + 3 + these notes capture the full journey of deploying a self-hosted Prefect OSS server from scratch. the goal is to feed these back into the Prefect docs as an end-to-end guide. 4 + 5 + ## starting point 6 + 7 + nothing. empty directory, fresh hetzner project with an API token. 8 + 9 + ## step 1: hetzner VM with k3s 10 + 11 + provisioned a single VM using terraform's hcloud provider, following the `relay/indigo` pattern from our other projects. 12 + 13 + - **server type**: `cpx21` (3 vCPU, 4 GB RAM, ~€4/mo). note: `cx22` doesn't exist at the `ash` (ashburn) datacenter — shared AMD (`cpx`) vs shared Intel (`cx`) naming. 14 + - **cloud-init**: installs k3s with `--tls-san $PUBLIC_IP`, waits for readiness, touches `/run/k3s-ready` as a signal file 15 + - **firewall**: ports 22 (SSH), 80/443 (HTTP/HTTPS), 6443 (k8s API) 16 + - **kubeconfig**: fetched via `scp`, rewritten to use the public IP instead of `127.0.0.1` 17 + 18 + ### pitfall: we initially tried bare Docker containers (not k3s) 19 + 20 + the original plan called for Docker containers with Caddy as a reverse proxy. this worked but was a dead end for monitoring — the Prefect Grafana dashboards use Kubernetes-specific PromQL queries (recording rules, `namespace=`, `pod=` labels). we tore everything down and rebuilt on k3s. 21 + 22 + **lesson**: if you want Prefect's official monitoring dashboards, run on Kubernetes. 23 + 24 + ## step 2: deploying the prefect stack via helm 25 + 26 + the `just deploy` recipe handles all of this in one shot: 27 + 28 + 1. **cert-manager** — `jetstack/cert-manager` helm chart with CRDs enabled, plus a `ClusterIssuer` for Let's Encrypt (HTTP-01 via traefik) 29 + 2. **prefect auth secret** — `kubectl create secret generic prefect-auth` with the `AUTH_STRING` (format: `user:pass`) 30 + 3. **prefect server** — `prefect/prefect-server` helm chart: 31 + - 2 API server replicas (`server.replicaCount: 2`) 32 + - background services as a separate deployment (`backgroundServices.runAsSeparateDeployment: true`) 33 + - postgresql (bitnami subchart, 10Gi persistent volume) 34 + - redis (bitnami subchart, standalone mode) 35 + - basic auth enabled, referencing the `prefect-auth` secret 36 + - traefik ingress with cert-manager TLS annotation 37 + - `uiConfig.prefectUiApiUrl` set to the public HTTPS URL 38 + 4. **monitoring** — `prometheus-community/kube-prometheus-stack`: 39 + - lightweight config: alertmanager disabled, most kube component monitors disabled 40 + - prometheus: 30s scrape interval, 14d retention, 10Gi storage 41 + - grafana: sidecar enabled for auto-discovering dashboard ConfigMaps (label: `grafana_dashboard: "1"`) 42 + - node-exporter + kube-state-metrics enabled 43 + 5. **prefect dashboards** — loaded from `deploy/dashboards/*.json` as ConfigMaps with the grafana sidecar label 44 + 6. **prefect exporter** — `prefect/prometheus-prefect-exporter` helm chart, pointed at the in-cluster prefect API URL 45 + 46 + ### pitfall: helm secret ownership 47 + 48 + if you manually `kubectl create secret` something that a helm chart wants to manage (like the postgresql password secret), helm will fail because the secret lacks helm ownership labels. solution: let helm manage its own secrets and pass values via `--set postgresql.auth.password=...`. 49 + 50 + ### pitfall: DNS caching blocks cert-manager 51 + 52 + cert-manager's HTTP-01 solver needs to resolve your domain. if the node's systemd-resolved is caching a previous NXDOMAIN (from before you set the DNS record), cert-manager will fail. fix: 53 + 54 + ```bash 55 + resolvectl dns eth0 8.8.8.8 1.1.1.1 56 + resolvectl flush-caches 57 + kubectl rollout restart deploy/coredns -n kube-system 58 + ``` 59 + 60 + ## step 3: terraform for prefect resources 61 + 62 + a second terraform layer (`prefect/`) uses the `prefecthq/prefect` provider to manage server-level resources: 63 + 64 + ```hcl 65 + provider "prefect" { 66 + endpoint = "https://${var.domain}/api" 67 + basic_auth_key = var.auth_string # same user:pass format 68 + } 69 + 70 + resource "prefect_work_pool" "k8s" { 71 + name = "kubernetes-pool" 72 + type = "kubernetes" 73 + } 74 + ``` 75 + 76 + individual flow deployments are NOT managed by terraform — they're registered from code (see step 5). 77 + 78 + ### pitfall: work pool namespace default 79 + 80 + the kubernetes work pool's base job template defaults `namespace` to `default`. if your worker's RBAC is scoped to the `prefect` namespace (as it should be), flow run jobs will fail with 403s. fix: 81 + 82 + ```bash 83 + prefect work-pool get-default-base-job-template --type kubernetes > template.json 84 + # edit template.json: set variables.properties.namespace.default to "prefect" 85 + prefect work-pool update kubernetes-pool --base-job-template template.json 86 + ``` 87 + 88 + ## step 4: kubernetes worker 89 + 90 + deployed as a manual k8s Deployment (not the helm chart, which is oriented toward Cloud): 91 + 92 + - **image**: `prefecthq/prefect:3-python3.11-kubernetes` — the `-kubernetes` tag includes the `kubernetes` python package. the base `3-latest` image does NOT. 93 + - **RBAC**: namespace-scoped Role + RoleBinding in `prefect` namespace, matching the official helm chart's permissions: 94 + - `events, pods, pods/log, pods/status` — get, watch, list 95 + - `jobs` — get, list, watch, create, update, patch, delete 96 + - **critical env var**: `PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_NAMESPACES=prefect` — the kubernetes worker uses kopf, which watches at cluster scope by default. this env var restricts it to a single namespace, so a namespace-scoped Role works (no ClusterRole needed). 97 + 98 + ### pitfall: kopf cluster-scope watching 99 + 100 + without `PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_NAMESPACES`, the worker tries to list/watch all jobs and pods cluster-wide. with a namespace-scoped Role, this produces 403 errors. the fix is NOT to escalate to a ClusterRole — it's to tell the worker which namespace to watch. 101 + 102 + ## step 5: deploying flows 103 + 104 + the key pattern is `flow.from_source()` + `.deploy()`: 105 + 106 + ```python 107 + from prefect import flow 108 + 109 + if __name__ == "__main__": 110 + flow.from_source( 111 + source="https://tangled.sh/zzstoatzz.io/my-prefect-server.git", 112 + entrypoint="flows/diagnostics.py:diagnostics", 113 + ).deploy( 114 + name="diagnostics-every-5m", 115 + work_pool_name="kubernetes-pool", 116 + cron="*/5 * * * *", 117 + ) 118 + ``` 119 + 120 + - `from_source()` tells Prefect to use `git_clone` as the pull step — the worker clones the repo at runtime 121 + - without `from_source()`, `.deploy()` tries to build a Docker image with your code baked in 122 + - each flow run gets its own k8s Job/Pod — clean isolation, no dependency pollution 123 + - flow code stays in git, not in the worker image or configmaps 124 + - registration runs from a local machine (eventually CI): `uv run --with prefect flows/diagnostics.py` 125 + 126 + ## architecture summary 127 + 128 + ``` 129 + ┌─────────────────────────────────────────────────────┐ 130 + │ hetzner cpx21 (k3s) │ 131 + │ │ 132 + │ prefect namespace: │ 133 + │ ├── prefect-server (2 replicas) ─── traefik ─── TLS │ 134 + │ ├── prefect-background-services (1 replica) │ 135 + │ ├── postgresql (1 replica, 10Gi PVC) │ 136 + │ ├── redis (1 replica, standalone) │ 137 + │ ├── prefect-worker (1 replica) │ 138 + │ ├── prometheus-prefect-exporter │ 139 + │ └── flow run jobs (ephemeral, per-run) │ 140 + │ │ 141 + │ monitoring namespace: │ 142 + │ ├── prometheus │ 143 + │ ├── grafana ─── traefik ─── TLS │ 144 + │ ├── node-exporter │ 145 + │ └── kube-state-metrics │ 146 + │ │ 147 + │ cert-manager namespace: │ 148 + │ └── cert-manager (ACME/Let's Encrypt) │ 149 + └─────────────────────────────────────────────────────┘ 150 + ``` 151 + 152 + ## TODO: file issue on prefecthq/prefect-helm 153 + 154 + **the `prefect-server` helm chart does not configure `PREFECT_SERVER_DOCKET_URL`.** 155 + 156 + when `redis.enabled: true` and `backgroundServices.runAsSeparateDeployment: true`, the chart automatically sets all the other Redis env vars: 157 + - `PREFECT_MESSAGING_BROKER` 158 + - `PREFECT_MESSAGING_CACHE` 159 + - `PREFECT_SERVER_EVENTS_CAUSAL_ORDERING` 160 + - `PREFECT_SERVER_CONCURRENCY_LEASE_STORAGE` 161 + - `PREFECT_REDIS_MESSAGING_HOST` / `PORT` / `DB` / `PASSWORD` 162 + 163 + but it does NOT set `PREFECT_SERVER_DOCKET_URL`, which defaults to `memory://` (single-server only). this means HA deployments using the helm chart have docket silently running in-memory mode, which breaks coordination of background services like the scheduler, late run detection, and automation triggers. 164 + 165 + the fix should be: when `redis.enabled: true`, automatically set `PREFECT_SERVER_DOCKET_URL` to `redis://<redis-host>:6379/1` (using a separate db number from messaging, matching the docker-compose example in the self-hosted docs). this belongs in the `backgroundServices.envVars` helper in `_helpers.tpl`. 166 + 167 + the current workaround is creating a separate ConfigMap with the env var and referencing it via `extraEnvVarsCM` on both `server` and `backgroundServices`, which is unnecessarily complex for what should be a built-in setting. 168 + 169 + **action: create an issue at https://github.com/PrefectHQ/prefect-helm/issues with the above.** 170 + 171 + ## what's next 172 + 173 + - custom "executive" grafana dashboard — single pane with infra + prefect overview 174 + - CI for flow registration on tangled (.tangled CI, not github actions) 175 + - contribute guide back to prefect docs

Configure Feed

Configure Feed