my prefect server setup prefect-metrics.waow.tech
python orchestration
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

bump to cpx31 (8GB), fix dashboard, add readme

- resize server from cpx21 (4GB) to cpx31 (8GB) for monitoring headroom
- fix pods-not-ready query (filter for == 1, not just phase label match)
- rename deployment from "diagnostics-every-5m" to "diagnostics"
- add minimal README with deployment steps
- update build notes with docket, dashboard, and memory optimization steps

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+169 -16
+56
README.md
··· 1 + self-hosted prefect OSS on a single hetzner VM (k3s), with monitoring. 2 + 3 + <details> 4 + <summary>deployment</summary> 5 + 6 + ### prerequisites 7 + 8 + - [terraform](https://developer.hashicorp.com/terraform/install) 9 + - [just](https://just.systems) 10 + - [uv](https://docs.astral.sh/uv) 11 + - a hetzner cloud API token 12 + - a domain with DNS you control 13 + 14 + ### setup 15 + 16 + ```bash 17 + cp .env.example .env 18 + # fill in: HCLOUD_TOKEN, POSTGRES_PASSWORD, AUTH_STRING, DOMAIN, LETSENCRYPT_EMAIL 19 + ``` 20 + 21 + ### deploy 22 + 23 + ```bash 24 + just init # terraform init 25 + just infra # create the VM 26 + just kubeconfig # wait for k3s, fetch kubeconfig 27 + just deploy # cert-manager, prefect server, monitoring, dashboards 28 + just worker # deploy the kubernetes worker 29 + just prefect-init # terraform init for prefect resources 30 + just prefect-apply # create work pool + variables 31 + just register-flows # register flow deployments from code 32 + ``` 33 + 34 + after `deploy`, point your DNS: 35 + - `$DOMAIN` → server IP (`just server-ip`) 36 + - `$GRAFANA_DOMAIN` → same IP 37 + 38 + ### verify 39 + 40 + ```bash 41 + just health # curl the /api/health endpoint 42 + just status # node + pod resource usage 43 + just prefect work-pool ls 44 + ``` 45 + 46 + ### operations 47 + 48 + ```bash 49 + just logs # tail prefect-server logs (default) 50 + just logs worker # tail worker logs 51 + just prefect flow-run ls # run any prefect CLI command remotely 52 + just dashboards # reload grafana dashboards from deploy/dashboards/ 53 + just ssh # ssh into the server 54 + ``` 55 + 56 + </details>
+1 -1
deploy/dashboards/executive-overview.json
··· 248 248 "options": { "colorMode": "background", "graphMode": "none", "reduceOptions": { "calcs": ["lastNotNull"] } }, 249 249 "title": "pods not ready", 250 250 "type": "stat", 251 - "targets": [{ "expr": "count(kube_pod_status_phase{namespace=~\"prefect|monitoring\", phase=~\"Failed|Unknown\"}) or vector(0)", "refId": "A" }] 251 + "targets": [{ "expr": "count(kube_pod_status_phase{namespace=~\"prefect|monitoring\", phase=~\"Failed|Unknown\"} == 1) or vector(0)", "refId": "A" }] 252 252 }, 253 253 { 254 254 "datasource": { "type": "prometheus", "uid": "prometheus" },
+1 -1
flows/diagnostics.py
··· 23 23 source="https://tangled.sh/zzstoatzz.io/my-prefect-server.git", 24 24 entrypoint="flows/diagnostics.py:diagnostics", 25 25 ).deploy( 26 - name="diagnostics-every-5m", 26 + name="diagnostics", 27 27 work_pool_name="kubernetes-pool", 28 28 cron="*/5 * * * *", 29 29 )
+2 -2
infra/variables.tf
··· 11 11 } 12 12 13 13 variable "server_type" { 14 - description = "Hetzner server type (cpx21 = 3 vCPU, 4 GB RAM, 80 GB disk)" 14 + description = "Hetzner server type (cpx31 = 4 vCPU, 8 GB RAM, 160 GB disk)" 15 15 type = string 16 - default = "cpx21" 16 + default = "cpx31" 17 17 } 18 18 19 19 variable "location" {
+109 -12
notes/journey.md
··· 149 149 └─────────────────────────────────────────────────────┘ 150 150 ``` 151 151 152 - ## TODO: file issue on prefecthq/prefect-helm 152 + ## step 6: redis/docket configuration 153 + 154 + docket is Prefect's coordination layer for background services (scheduler, late run detection, automation triggers). it's configured via `PREFECT_SERVER_DOCKET_URL` and defaults to `memory://`, which only works for single-process deployments. 155 + 156 + the helm chart auto-configures all the other Redis env vars (`PREFECT_MESSAGING_BROKER`, `PREFECT_MESSAGING_CACHE`, etc.) but does NOT set `PREFECT_SERVER_DOCKET_URL`. this means HA deployments silently run docket in-memory mode. 157 + 158 + ### workaround: ConfigMap for docket URL 159 + 160 + created a ConfigMap with `PREFECT_SERVER_DOCKET_URL` and referenced it via `extraEnvVarsCM` on both `server` and `backgroundServices`: 161 + 162 + ```yaml 163 + # in prefect-values.yaml 164 + server: 165 + extraEnvVarsCM: prefect-docket-config 166 + backgroundServices: 167 + extraEnvVarsCM: prefect-docket-config 168 + ``` 169 + 170 + ### pitfall: Redis auth in docket URL 153 171 154 - **the `prefect-server` helm chart does not configure `PREFECT_SERVER_DOCKET_URL`.** 172 + initial attempt used `redis://redis-host:6379/1` — failed with `redis.exceptions.AuthenticationError`. the bitnami Redis subchart enables auth by default. the URL must include the password: 173 + 174 + ``` 175 + redis://:${REDIS_PASSWORD}@prefect-server-redis-master.prefect.svc.cluster.local:6379/1 176 + ``` 155 177 156 - when `redis.enabled: true` and `backgroundServices.runAsSeparateDeployment: true`, the chart automatically sets all the other Redis env vars: 157 - - `PREFECT_MESSAGING_BROKER` 158 - - `PREFECT_MESSAGING_CACHE` 159 - - `PREFECT_SERVER_EVENTS_CAUSAL_ORDERING` 160 - - `PREFECT_SERVER_CONCURRENCY_LEASE_STORAGE` 161 - - `PREFECT_REDIS_MESSAGING_HOST` / `PORT` / `DB` / `PASSWORD` 178 + verified docket is working by checking Redis db 1: 179 + ```bash 180 + kubectl exec -it -n prefect prefect-server-redis-master-0 -- redis-cli -a "$PASS" -n 1 KEYS '*' 181 + # shows docket coordination keys 182 + ``` 162 183 163 - but it does NOT set `PREFECT_SERVER_DOCKET_URL`, which defaults to `memory://` (single-server only). this means HA deployments using the helm chart have docket silently running in-memory mode, which breaks coordination of background services like the scheduler, late run detection, and automation triggers. 184 + ### TODO: file issue on prefecthq/prefect-helm 164 185 165 - the fix should be: when `redis.enabled: true`, automatically set `PREFECT_SERVER_DOCKET_URL` to `redis://<redis-host>:6379/1` (using a separate db number from messaging, matching the docker-compose example in the self-hosted docs). this belongs in the `backgroundServices.envVars` helper in `_helpers.tpl`. 186 + **the `prefect-server` helm chart should configure `PREFECT_SERVER_DOCKET_URL` automatically.** 166 187 167 - the current workaround is creating a separate ConfigMap with the env var and referencing it via `extraEnvVarsCM` on both `server` and `backgroundServices`, which is unnecessarily complex for what should be a built-in setting. 188 + when `redis.enabled: true`, the chart should set `PREFECT_SERVER_DOCKET_URL` to `redis://<redis-host>:6379/1` (separate db from messaging, matching the docker-compose example in the self-hosted docs). this belongs in the `backgroundServices.envVars` helper in `_helpers.tpl`. 168 189 169 190 **action: create an issue at https://github.com/PrefectHQ/prefect-helm/issues with the above.** 170 191 192 + ## step 7: executive grafana dashboard 193 + 194 + built a single-pane dashboard (`deploy/dashboards/executive-overview.json`) combining infra + prefect overview: 195 + 196 + **prefect section:** 197 + - stat panels: flows, deployments, work pools, workers, avg run time, active runs 198 + - pie chart: flow runs by state (color-coded — green/completed, red/failed, blue/running, etc.) 199 + - table: deployments with name, flow, status, work pool 200 + 201 + **infrastructure section:** 202 + - gauge panels: node CPU %, memory %, disk % 203 + - stat panel: pods not ready (Failed|Unknown only — initially counted Succeeded pods from completed Jobs, which inflated the number to 51) 204 + - time series: per-pod CPU and memory usage in the prefect namespace 205 + 206 + dashboards are loaded as ConfigMaps with the `grafana_dashboard: "1"` label, auto-discovered by the grafana sidecar. 207 + 208 + ### pitfall: pods-not-ready counting completed Jobs 209 + 210 + initial query counted all pods not in `Running` or `Succeeded` phase. but completed flow run Jobs leave pods in `Succeeded` state, and a broad "not ready" filter caught them. fix: only count `Failed|Unknown` phases: 211 + ```promql 212 + count(kube_pod_status_phase{namespace=~"prefect|monitoring", phase=~"Failed|Unknown"}) or vector(0) 213 + ``` 214 + 215 + ## step 8: memory optimization 216 + 217 + the cpx21 node (4 GB RAM) was at ~80% memory usage. biggest consumers: 218 + - grafana: ~340 Mi 219 + - prometheus: ~301 Mi 220 + - prefect API servers (2 replicas): ~170 Mi each 221 + 222 + optimizations applied: 223 + 1. scaled API server to 1 replica (sufficient for this use case) 224 + 2. increased prometheus scrape interval from 30s to 60s 225 + 3. reduced prometheus retention from 14d to 7d 226 + 4. set explicit resource limits on prometheus (512Mi), grafana (256Mi), prometheus-operator (128Mi) 227 + 228 + result: ~73% memory usage, stable. 229 + 230 + ## step 9: justfile as the operations interface 231 + 232 + early on, we were running long ad-hoc `kubectl` commands. consolidated everything into justfile recipes to keep operations repeatable: 233 + 234 + - `just logs [component]` — parameterized log tailing (defaults to prefect-server) 235 + - `just prefect *args` — proxy for running prefect CLI against the remote server 236 + - `just dashboards` — reload grafana dashboards from `deploy/dashboards/` 237 + - `just status` — `kubectl top nodes`, `kubectl top pods`, pod status for prefect + monitoring namespaces 238 + 239 + **lesson**: if you're running it more than once, put it in the justfile. 240 + 241 + ## architecture summary 242 + 243 + ``` 244 + ┌─────────────────────────────────────────────────────┐ 245 + │ hetzner cpx21 (k3s) │ 246 + │ │ 247 + │ prefect namespace: │ 248 + │ ├── prefect-server (1 replica) ─── traefik ─── TLS │ 249 + │ ├── prefect-background-services (1 replica) │ 250 + │ ├── postgresql (1 replica, 10Gi PVC) │ 251 + │ ├── redis (1 replica, standalone) │ 252 + │ ├── prefect-worker (1 replica) │ 253 + │ ├── prometheus-prefect-exporter │ 254 + │ └── flow run jobs (ephemeral, per-run) │ 255 + │ │ 256 + │ monitoring namespace: │ 257 + │ ├── prometheus (60s scrape, 7d retention) │ 258 + │ ├── grafana ─── traefik ─── TLS │ 259 + │ ├── node-exporter │ 260 + │ └── kube-state-metrics │ 261 + │ │ 262 + │ cert-manager namespace: │ 263 + │ └── cert-manager (ACME/Let's Encrypt) │ 264 + └─────────────────────────────────────────────────────┘ 265 + ``` 266 + 171 267 ## what's next 172 268 173 - - custom "executive" grafana dashboard — single pane with infra + prefect overview 174 269 - CI for flow registration on tangled (.tangled CI, not github actions) 270 + - more interesting deployments with automations 271 + - file upstream issue on prefecthq/prefect-helm for docket URL 175 272 - contribute guide back to prefect docs