First article about systemd (#9) · hauleth.dev/blog@8a90493

+1

.vale.ini

··· 2 2 3 3 [*.md] 4 4 BasedOnStyles = proselint, write-good 5 + write-good.Passive = NO

+517

content/post/who-watches-watchmen-i.md

··· 1 + +++ 2 + title = "Who Watches Watchmen? - Part 1" 3 + date = 2022-01-17T21:22:18+01:00 4 + draft = true 5 + 6 + description = """ 7 + A lot of application use systems like Kubernetes for their deployment. In my 8 + humble opinion it is often overkill as system ,that offers most of the stuff such 9 + thing provide, is already present in your OS. In this article I will try to 10 + present how to utilise the most popular system supervisor from Elixir 11 + applications. 12 + """ 13 + 14 + [taxonomies] 15 + tags = [ 16 + "elixir", 17 + "programming", 18 + "systemd", 19 + "deployment" 20 + ] 21 + +++ 22 + 23 + I gave talk about this topic on CODE Beam V Americas, but I wasn't really 24 + satisfied with it. In this post I will try to describe what my presentation was 25 + meant to be about. 26 + 27 + If you are wondering about the presentation, [the slides are on SpeakerDeck][slides]. 28 + 29 + [slides]: https://speakerdeck.com/hauleth/who-supervises-supervisors 30 + 31 + ## Abstract 32 + 33 + Most of the operating systems are multi-process and multi-user operating 34 + systems. This has a lot of positive aspects, like to be able to do more than one 35 + thing at the time at our devices, but it introduces a lot of complexities that 36 + in most cases are hidden from the users and developers. These things still need 37 + to be handled in one or another way. The most basic problems are: 38 + 39 + - some processes need to be started before user can interact with the OS 40 + in meaningful (for them) way (for example mounting filesystems, logging, 41 + etc.) 42 + - some processes require strict startup ordering, for example you may need 43 + logging to be started before starting HTTP server 44 + - system operator somehow need to know when the process is ready to do their 45 + work, which is often some time after process start 46 + - system operator should be able to check process state in case when debugging 47 + is needed, most commonly via logs 48 + - shutdown of the processes should be handled in a way, that will allow other 49 + processes to be shut down cleanly (for example application that uses DB should 50 + be down before DB itself) 51 + 52 + ## Why we need system supervisor? 53 + 54 + System supervisor is a process started early in the OS boot, that should handle 55 + starting and managing all other processes that will be run on our system. It is 56 + often the init process (first process started by the OS that is running with PID 57 + 1\) or it is first (and sometimes only) process started by the init process. 58 + Popular examples of such supervisors (often integrated with init systems): 59 + 60 + - SysV which is "traditional" implementation that originates at UNIX System 61 + V (hence the name) 62 + - BSD init that with some variations is used in BSD-based OSes (NetBSD, 63 + FreeBSD), it shares some similarities to SysV init and services description is 64 + provided by shell scripts 65 + - OpenRC that also uses shell-based scripts for service description, used by 66 + Linux distributions like Gentoo or Alpine 67 + - `launchd` that is used on Darwin (macOS, iPadOS, iOS, watchOS) systems that uses 68 + XML-based `plists` for services description 69 + - `runit` which is small init and supervisor, but quite capable, for example 70 + used by Void Linux 71 + - Upstart created by Canonical Ltd. as a replacement for SysV-like init system 72 + in Ubuntu (no longer in use in Ubuntu), still used in some distributions like 73 + ChromeOS or Synology NAS 74 + - `systemd` (this is the name, not "SystemD") that was created by Red Hat 75 + employee, (in)famous Lennart Poettering, and later was adopted by almost all 76 + major Linux distributions which spawned some heated discussion about it 77 + 78 + In this article I will focus on systemd, and its approach to "new-style system 79 + daemons". 80 + 81 + --- 82 + 83 + **DISCLAIMER** 84 + 85 + Each of the solutions mentioned above has its strong and weak points. I do not 86 + want to start another flame war whether it is good or not. It has some good in 87 + it, and it has some bad in it, but we can say that it "won" over the most used 88 + distributions, and despite our love or hate towards it, we need to learn how to 89 + live with that. 90 + 91 + --- 92 + 93 + ## Why `systemd`? 94 + 95 + `systemd` became a thing because SysV approach to ordering services' startup was 96 + mildly irritating and non-parallelizable. In short, SysV is starting processes 97 + exactly in lexicographical order of files in given directory. This meant, that 98 + even if your service didn't need the DB at all, but it somehow ended further in 99 + the directory listing, you ended in waiting for the DB startup. Additionally, 100 + SysV wasn't really monitoring services, it just assumed that when process forked 101 + itself to the background, then it is "done" with the startup, and we can 102 + continue. This is obviously not true in many cases, for example, if your 103 + previous shutdown wasn't clean because of power shortage or other issue, then 104 + your DB probably need a bit of time to rebuild state from journal. This causes 105 + even more slowdown for the processes further in the list. This is highly 106 + undesired in modern, cloud-based, environment, where you can often start the 107 + machines on-demand during autoscaling actions. When there is a spike in the 108 + traffic that need autoscaling, then the sooner new machine is in usable state 109 + the sooner it can take load from other machines. 110 + 111 + Different tools take different approach to solve that issue there. `systemd` 112 + take approach that is derived from `launchd` - do not do stuff, that is not 113 + needed. It achieved that by merging D-Bus into the `systemd` itself, and then 114 + making all service to be D-Bus daemons (which are started on request), and 115 + additionally it provides a bunch of triggers for that daemons. We can trigger on 116 + action of other services (obviously), but also on stuff like socket activity, 117 + path creation/modification, mounts, connection or disconnection of device, 118 + time events, etc. 119 + 120 + --- 121 + 122 + **DIGRESSION** 123 + 124 + This is exactly the reason why `systemd` has its infamous "feature creep", it 125 + doesn't "digest" all services like Cron or `udev`. It is not that these are 126 + "tightly" intertwined into `systemd`. You can still replace them with their 127 + older counterparts, you will just lose all the features these bring with them. 128 + 129 + --- 130 + 131 + Such lazy approach sometimes require changes into the service itself. For 132 + example to let supervisor know, that you are ready (not just started), you need 133 + some way to communicate with supervisor. In `systemd` you can do so via UNIX 134 + socket pointed by `NOTIFY_SOCKET` environment variable passed to your 135 + application. With the same socket you can implement another useful feature 136 + \- watchdog/heartbeat process. This mean that if for any reason your process 137 + became non-responsive (but it will refuse to die), then supervisor will 138 + forcefully bring process down and restart it, assuming that the error was 139 + accidental. 140 + 141 + About restarting, we can define behaviour of service after main process die. It 142 + can be restarted regardless of the exit code, it can be restarted on abnormal 143 + exit, it can remain shut, etc. Does this ring a bell? This works similarly to 144 + OTP supervisors, but "one level above". If your service utilize system 145 + supervisor right, you can make your application almost ultimately self-healing 146 + (by restarts). 147 + 148 + ## Basic setup 149 + 150 + Now, when we know a little about how and why `systemd` works as it works, we 151 + now can go to details on how to utilize that with services in Elixir. 152 + 153 + As a base we will implement super simple Plug application: 154 + 155 + ```elixir 156 + # hello/application.ex 157 + defmodule Hello.Application do 158 + use Application 159 + 160 + def start(_type, _opts) do 161 + children = [ 162 + {Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()}, 163 + {Plug.Cowboy.Drainer, refs: :all} 164 + ] 165 + 166 + Supervisor.start_link(children, strategy: :one_for_one) 167 + end 168 + 169 + defp cowboy_opts do 170 + [ 171 + port: String.to_integer(System.get_env("PORT", "4000")) 172 + ] 173 + end 174 + end 175 + ``` 176 + 177 + ```elixir 178 + # hello/router.ex 179 + defmodule Hello.Router do 180 + use Plug.Router 181 + 182 + plug :match 183 + plug :dispatch 184 + 185 + get "/" do 186 + send_resp(conn, 200, "Hello World!") 187 + end 188 + end 189 + ``` 190 + 191 + I will also assume that we are using [Mix release][mix-release] named `hello` 192 + that we later copy to `/opt/hello`. 193 + 194 + [mix-release]: https://hexdocs.pm/mix/Mix.Tasks.Release.html 195 + 196 + ### systemd unit file 197 + 198 + We have only one thing left, we need to define our [`hello.service`][systemd.service]: 199 + 200 + ```ini 201 + [Unit] 202 + Description=Hello World service 203 + 204 + [Service] 205 + Environment=PORT=80 206 + ExecStart=/opt/plug/bin/plug start 207 + ``` 208 + 209 + Now you can create file with that content in 210 + `/usr/local/lib/systemd/system/hello.service` and then start it with: 211 + 212 + ``` 213 + # systemctl start hello.service 214 + ``` 215 + 216 + This is the simplest service imaginable, however from the start we have few 217 + issues there: 218 + 219 + - It will run service as user running supervisor, so if it is run using global 220 + supervisor, then it will run as `root`. You do not want to run anything as 221 + `root`. 222 + - On error it will produce (BEAM) core dump, which may contain sensitive data. 223 + - It can read (and, due to being run as `root`, write) everything in the system, 224 + like private data of other processes. 225 + 226 + [systemd.service]: https://www.freedesktop.org/software/systemd/man/systemd.service.html# 227 + 228 + ## Service readiness 229 + 230 + Erlang VM isn't really the best tool out there wrt the startup times. In 231 + addition to that our application may need some preparation steps before it can 232 + be marked as "ready". This is problem that I sometimes encounter in Docker, 233 + where some containers do not really have any health check, and then I need to 234 + have loop with check in some of the containers that depend on another one. This 235 + "workaround" is frustrating, error prone, and can cause nasty Heisenbugs when 236 + the timing will be wrong. 237 + 238 + Two possible solutions for this problem are: 239 + 240 + - Readiness probe - another program that is ran after the main process is 241 + started, that checks whether our application is ready to work. 242 + - Notification system where our application uses some common protocol to inform 243 + the supervisor that it finished setup and is ready for work. 244 + 245 + systemd supports the second approach via [`sd_notify`][sd_notify]. The approach 246 + there is simple - we have `NOTIFY_SOCKET` environment variable that contain path 247 + to the Unix datagram socket, that we can use to send informations about state of 248 + our application. This socket accept set of different messages, but right now, 249 + for our purposes, we will focus only on few of them: 250 + 251 + - `READY=1` - marks our service as ready, aka it is ready to do its work (for 252 + example accept incoming HTTP connections in our example). It need to be sent 253 + withing given timespan after start of the VM, otherwise the process will be 254 + killed and possibly restarted 255 + - `STATUS=name` - sets status of our application that can be checked via 256 + `systemctl status hello.service`, this allows us to have better insight into 257 + what is the high level state without manually traversing through logs 258 + - `RELOADING=1` - marks, that our application is reloading, which in general may 259 + mean a lot of things, but there it will be used to mark `:init.restart/0`-like 260 + behaviour (due to [erlang/otp#4698][] there is wrapper for that function in 261 + `systemd` library). The process need then to send `READY=1` within given 262 + timespan, or the process will be marked as a malfunctioning, and will be 263 + forcefully killed and possibly restarted 264 + - `STOPPING=1` - marks, that our application began shutting down process, and 265 + will be closing soon. If the process will not close within given timespan, it 266 + will be forcefully killed 267 + 268 + These messages provide us enough power to not only mark the service as ready, 269 + but also provides additional information about system state, so even operator, 270 + who knows a little about Erlang or our application runtime, will be able to 271 + understand what is going on. 272 + 273 + The main thing is that systemd will wait with activation of the dependants of 274 + our system as well as the `systemctl start` and `systemctl restart` commands 275 + will wait until our service declare that it is ready. 276 + 277 + Usage of such feature is quite simple: 278 + 279 + ```ini 280 + [Unit] 281 + Description=Hello World service 282 + 283 + [Service] 284 + # Define `Type=` to `notify` 285 + Type=notify 286 + Environment=PORT=80 287 + ExecStart=/opt/plug/bin/plug start 288 + WatchdogSec=1min 289 + ``` 290 + 291 + And then in our supervisor tree we need add `:systemd.ready()` **after** last 292 + process needed for proper functioning of our application, in our simple example 293 + it is after `Plug.Cowboy`: 294 + 295 + ```elixir 296 + # hello/application.ex 297 + defmodule Hello.Application do 298 + use Application 299 + 300 + def start(_type, _opts) do 301 + children = [ 302 + {Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()}, 303 + :systemd.ready(), # <-- it is function call, as it returns proper 304 + # `child_spec/0` 305 + {Plug.Cowboy.Drainer, refs: :all} 306 + ] 307 + 308 + Supervisor.start_link(children, strategy: :one_for_one) 309 + end 310 + 311 + defp cowboy_opts do 312 + [ 313 + port: String.to_integer(System.get_env("PORT", "4000")) 314 + ] 315 + end 316 + end 317 + ``` 318 + 319 + Now restarting our service will not finish immediately, but will wait until our 320 + service will declare that it is ready. 321 + 322 + ```shell 323 + # systemctl restart hello.service 324 + ``` 325 + 326 + About `STOPPING=1` - the magic thing is that the `systemd` library takes care of 327 + it for you. As soon as the system will be scheduled to shutdown this message 328 + will be automatically sent, and the operator will be notified about this fact. 329 + 330 + We can also provide more information about state of our application. As you may 331 + have already noticed, we have [`Plug.Cowboy.Drainer`][] there. It is process that 332 + will delay shutdown of our application while there are still open connections. 333 + This can take some time, so it would be handy if the operator would see that the 334 + draining is in progress. We can easily achieve that by again changing our 335 + supervision tree to: 336 + 337 + ```elixir 338 + # hello/application.ex 339 + defmodule Hello.Application do 340 + use Application 341 + 342 + def start(_type, _opts) do 343 + children = [ 344 + {Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()}, 345 + :systemd.ready(), 346 + :systemd.set_status(down: [status: "drained"]), 347 + {Plug.Cowboy.Drainer, refs: :all, shutdown: 10_000}, 348 + :systemd.set_status(down: [status: "draining"]) 349 + ] 350 + 351 + Supervisor.start_link(children, strategy: :one_for_one) 352 + end 353 + 354 + defp cowboy_opts do 355 + [ 356 + port: String.to_integer(System.get_env("PORT", "4000")) 357 + ] 358 + end 359 + end 360 + ``` 361 + 362 + Now when we will shutdown our application by: 363 + 364 + ```shell 365 + # systemctl stop hello.service 366 + ``` 367 + 368 + And we have some connections open to our service (you can simulate that with 369 + `wrk`) then when we ran `systemctl status hello.service` in separate terminal 370 + (previous will be blocked until our service shuts down) then you will be able to 371 + see something like: 372 + 373 + ``` 374 + ● hello.service - Example Plug application 375 + Loaded: loaded (/usr/local/lib/systemd/system/hello.service; static; vendor preset: enabled) 376 + Active: deactivating (stop-sigterm) since Sat 2022-01-15 17:46:30 CET; 377 + 1s ago 378 + Main PID: 1327 (beam.smp) 379 + Status: "draining" 380 + Tasks: 19 (limit: 1136) 381 + Memory: 106.5M 382 + ``` 383 + 384 + You can notice that the `Status` is set to `"draining"`. As soon as all 385 + connections will be drained it will change to `"drained"` and then the 386 + application will shut down and service will be marked as `inactive`. 387 + 388 + [sd_notify]: https://www.freedesktop.org/software/systemd/man/sd_notify.html 389 + [erlang/otp#4698]: https://github.com/erlang/otp/issues/4698 390 + [`Plug.Cowboy.Drainer`]: https://hexdocs.pm/plug_cowboy/2.5.2/Plug.Cowboy.Drainer.html 391 + 392 + ## Watchdog 393 + 394 + Watchdog allows us to monitor our application for responsiveness (as mentioned 395 + above). It is simple feature that requires our application to ping systemd 396 + within specified interval, otherwise the application will be forcibly shut down 397 + as malfunctioning. Fortunately for us, the `systemd` library that provides our 398 + integration, have that feature out of the box, so all we need to do to achieve 399 + expected result is set `WatchdogSec=` option in our `systemd.service` file: 400 + 401 + ```ini 402 + [Unit] 403 + Description=Hello World service 404 + 405 + [Service] 406 + Environment=PORT=80 407 + Type=notify 408 + ExecStart=/opt/plug/bin/plug start 409 + WatchdogSec=1min 410 + ``` 411 + 412 + This configuration says that if the VM will not send healthy message each 1 413 + minute interval, then the service will be marked as malfunctioning. From the 414 + application side we can manage state of the watchdog in several ways: 415 + 416 + - By setting `systemd.watchdog_check` configuration option we can configure the 417 + function that will be called on each check, if that function return `true` 418 + then it mean that application is healthy and the systemd should be notified 419 + with ping, if it returns `false` or fail, then the check will be omitted. 420 + - Manually sending trigger message in case of detected problems via 421 + `:systemd.watchdog(trigger)`, it will immediately mark service as 422 + malfunctioning and will trigger action defined in service unit file (by 423 + default it will restart application) 424 + - Disabling built in watchdog process via `:systemd.watchdog(:disable)` and then 425 + manually sending `:systemd.watchdog(:ping)` within expected intervals 426 + (discouraged) 427 + 428 + ## Security 429 + 430 + We should start with changing default user and group which is assigned to our 431 + process. We can do so in 2 different ways: 432 + 433 + 1. Use some existing user and group by defining `User=` and `Group=` directives 434 + in our service definition; or 435 + 2. Create ephemeral user on-demand before our service starts, by using directive 436 + `DynamicUser=true` in service definition. 437 + 438 + I prefer second option, as it additionally provides a lot of other security 439 + related options, like creating private `/tmp` directory, making system 440 + read-only, etc. This has also some disadvantages, like removing all of given 441 + data on service shutdown, however there are options to keep some data between 442 + launches. 443 + 444 + In addition to that we can add `PrivateDevices=true` that will hide all 445 + physical devices from `/dev` leaving only pseudo devices like `/dev/null` or 446 + `/dev/urandom` (so you will be able to use `:crypto` and `:ssl` modules without 447 + problems). 448 + 449 + Next thing is that we can do, is to [disable crash dumps generated by BEAM][crash]. 450 + While not strictly needed in this case, it is worth remembering, that it isn't 451 + hard to achieve, it is just using `Environment=ERL_CRASH_DUMP_SECONDS=0`. 452 + 453 + Our new, more secure, `hello.service` will look like: 454 + 455 + ```ini 456 + [Unit] 457 + Description=Hello World service 458 + Requires=network.target 459 + 460 + [Service] 461 + Type=notify 462 + Environment=PORT=80 463 + ExecStart=/opt/plug/bin/plug start 464 + WatchdogSec=1min 465 + 466 + # We need to add capability to be able to bind on port 80 467 + CapabilityBoundingSet=CAP_NET_BIND_SERVICE 468 + 469 + # Hardening 470 + DynamicUser=true 471 + PrivateDevices=true 472 + Environment=ERL_CRASH_DUMP_SECONDS=0 473 + ``` 474 + 475 + The problem with that configuration is that our service is now capable on 476 + binding **any** port under 1024, so for example, if there is some security 477 + issue, then the malicious party can open any of the restricted ports and then 478 + serve whatever data they want there. This can be quite problematic, and the 479 + solution for that problem will be covered in Part 2, where we will cover socket 480 + passing and socket activation for our service. 481 + 482 + With that we achieved quite basic level of isolation to what Docker (or other 483 + container runtime) is providing, but it do not require `overlayfs` or anything 484 + more, than what you already have on your machine. That means, updates done by 485 + your system package manager will be applied to all running services. With that 486 + you do not need to rebuild all your containers when there is security patch 487 + issued for any of your dependencies. 488 + 489 + Of course it only scratches the surface of what is possible with systemd wrt 490 + the hardening of the services. More information can be found in [RedHat 491 + article][rh-systemd-hardening] and in [`systemd-analyze security` command 492 + output][systemd-analyze-security]. Possible features are: 493 + 494 + - creation of the private networks for your services 495 + - disallowing creation of socket connections that are outside of the specified 496 + set of families 497 + - make only some paths readable 498 + - hide some paths from the process 499 + - etc. 500 + 501 + Coverage of just that topic is a little bit out of scope for this blog post, so 502 + I encourage you to read the documentation of [`systemd.exec`][systemd.exec] and 503 + articles mentioned above for more details. 504 + 505 + [crash]: https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/crash_dumps 506 + [rh-systemd-hardening]: https://www.redhat.com/sysadmin/mastering-systemd 507 + [systemd-analyze-security]: https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20security%20%5BUNIT...%5D 508 + [systemd.exec]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html 509 + 510 + ## Summary 511 + 512 + This blog post is already quite lengthy, so I will split it into separate parts. 513 + There probably will be 3 of them: 514 + 515 + - [Part 1 - Basics, security, and FD passing (this one)](?1) 516 + - Part 2 - Socket activation 517 + - Part 3 - Logging

+9 -3

netlify.toml

··· 1 + [build] 2 + command = "zola build" 3 + publish = "public/" 4 + 5 + [context.deploy-preview] 6 + command = "zola build --drafts" 7 + 1 8 [[headers]] 2 9 for = "/*" 3 10 [headers.values] 11 + # Disable Google cohort tracking 4 12 Permission-Policy = "interest-cohort=()" 13 + # Disallow showing the website in frames 5 14 X-Frame-Options = "DENY" 6 15 X-XSS-Protection = "1; mode=block" 7 - 8 - [context.deploy-preview] 9 - command = "zola build --drafts" 10 16 11 17 [[redirects]] 12 18 from = "/post"

Configure Feed

Configure Feed