···2233[*.md]
44BasedOnStyles = proselint, write-good
55+write-good.Passive = NO
+517
content/post/who-watches-watchmen-i.md
···11++++
22+title = "Who Watches Watchmen? - Part 1"
33+date = 2022-01-17T21:22:18+01:00
44+draft = true
55+66+description = """
77+A lot of application use systems like Kubernetes for their deployment. In my
88+humble opinion it is often overkill as system ,that offers most of the stuff such
99+thing provide, is already present in your OS. In this article I will try to
1010+present how to utilise the most popular system supervisor from Elixir
1111+applications.
1212+"""
1313+1414+[taxonomies]
1515+tags = [
1616+ "elixir",
1717+ "programming",
1818+ "systemd",
1919+ "deployment"
2020+]
2121++++
2222+2323+I gave talk about this topic on CODE Beam V Americas, but I wasn't really
2424+satisfied with it. In this post I will try to describe what my presentation was
2525+meant to be about.
2626+2727+If you are wondering about the presentation, [the slides are on SpeakerDeck][slides].
2828+2929+[slides]: https://speakerdeck.com/hauleth/who-supervises-supervisors
3030+3131+## Abstract
3232+3333+Most of the operating systems are multi-process and multi-user operating
3434+systems. This has a lot of positive aspects, like to be able to do more than one
3535+thing at the time at our devices, but it introduces a lot of complexities that
3636+in most cases are hidden from the users and developers. These things still need
3737+to be handled in one or another way. The most basic problems are:
3838+3939+- some processes need to be started before user can interact with the OS
4040+ in meaningful (for them) way (for example mounting filesystems, logging,
4141+ etc.)
4242+- some processes require strict startup ordering, for example you may need
4343+ logging to be started before starting HTTP server
4444+- system operator somehow need to know when the process is ready to do their
4545+ work, which is often some time after process start
4646+- system operator should be able to check process state in case when debugging
4747+ is needed, most commonly via logs
4848+- shutdown of the processes should be handled in a way, that will allow other
4949+ processes to be shut down cleanly (for example application that uses DB should
5050+ be down before DB itself)
5151+5252+## Why we need system supervisor?
5353+5454+System supervisor is a process started early in the OS boot, that should handle
5555+starting and managing all other processes that will be run on our system. It is
5656+often the init process (first process started by the OS that is running with PID
5757+1\) or it is first (and sometimes only) process started by the init process.
5858+Popular examples of such supervisors (often integrated with init systems):
5959+6060+- SysV which is "traditional" implementation that originates at UNIX System
6161+ V (hence the name)
6262+- BSD init that with some variations is used in BSD-based OSes (NetBSD,
6363+ FreeBSD), it shares some similarities to SysV init and services description is
6464+ provided by shell scripts
6565+- OpenRC that also uses shell-based scripts for service description, used by
6666+ Linux distributions like Gentoo or Alpine
6767+- `launchd` that is used on Darwin (macOS, iPadOS, iOS, watchOS) systems that uses
6868+ XML-based `plists` for services description
6969+- `runit` which is small init and supervisor, but quite capable, for example
7070+ used by Void Linux
7171+- Upstart created by Canonical Ltd. as a replacement for SysV-like init system
7272+ in Ubuntu (no longer in use in Ubuntu), still used in some distributions like
7373+ ChromeOS or Synology NAS
7474+- `systemd` (this is the name, not "SystemD") that was created by Red Hat
7575+ employee, (in)famous Lennart Poettering, and later was adopted by almost all
7676+ major Linux distributions which spawned some heated discussion about it
7777+7878+In this article I will focus on systemd, and its approach to "new-style system
7979+daemons".
8080+8181+---
8282+8383+**DISCLAIMER**
8484+8585+Each of the solutions mentioned above has its strong and weak points. I do not
8686+want to start another flame war whether it is good or not. It has some good in
8787+it, and it has some bad in it, but we can say that it "won" over the most used
8888+distributions, and despite our love or hate towards it, we need to learn how to
8989+live with that.
9090+9191+---
9292+9393+## Why `systemd`?
9494+9595+`systemd` became a thing because SysV approach to ordering services' startup was
9696+mildly irritating and non-parallelizable. In short, SysV is starting processes
9797+exactly in lexicographical order of files in given directory. This meant, that
9898+even if your service didn't need the DB at all, but it somehow ended further in
9999+the directory listing, you ended in waiting for the DB startup. Additionally,
100100+SysV wasn't really monitoring services, it just assumed that when process forked
101101+itself to the background, then it is "done" with the startup, and we can
102102+continue. This is obviously not true in many cases, for example, if your
103103+previous shutdown wasn't clean because of power shortage or other issue, then
104104+your DB probably need a bit of time to rebuild state from journal. This causes
105105+even more slowdown for the processes further in the list. This is highly
106106+undesired in modern, cloud-based, environment, where you can often start the
107107+machines on-demand during autoscaling actions. When there is a spike in the
108108+traffic that need autoscaling, then the sooner new machine is in usable state
109109+the sooner it can take load from other machines.
110110+111111+Different tools take different approach to solve that issue there. `systemd`
112112+take approach that is derived from `launchd` - do not do stuff, that is not
113113+needed. It achieved that by merging D-Bus into the `systemd` itself, and then
114114+making all service to be D-Bus daemons (which are started on request), and
115115+additionally it provides a bunch of triggers for that daemons. We can trigger on
116116+action of other services (obviously), but also on stuff like socket activity,
117117+path creation/modification, mounts, connection or disconnection of device,
118118+time events, etc.
119119+120120+---
121121+122122+**DIGRESSION**
123123+124124+This is exactly the reason why `systemd` has its infamous "feature creep", it
125125+doesn't "digest" all services like Cron or `udev`. It is not that these are
126126+"tightly" intertwined into `systemd`. You can still replace them with their
127127+older counterparts, you will just lose all the features these bring with them.
128128+129129+---
130130+131131+Such lazy approach sometimes require changes into the service itself. For
132132+example to let supervisor know, that you are ready (not just started), you need
133133+some way to communicate with supervisor. In `systemd` you can do so via UNIX
134134+socket pointed by `NOTIFY_SOCKET` environment variable passed to your
135135+application. With the same socket you can implement another useful feature
136136+\- watchdog/heartbeat process. This mean that if for any reason your process
137137+became non-responsive (but it will refuse to die), then supervisor will
138138+forcefully bring process down and restart it, assuming that the error was
139139+accidental.
140140+141141+About restarting, we can define behaviour of service after main process die. It
142142+can be restarted regardless of the exit code, it can be restarted on abnormal
143143+exit, it can remain shut, etc. Does this ring a bell? This works similarly to
144144+OTP supervisors, but "one level above". If your service utilize system
145145+supervisor right, you can make your application almost ultimately self-healing
146146+(by restarts).
147147+148148+## Basic setup
149149+150150+Now, when we know a little about how and why `systemd` works as it works, we
151151+now can go to details on how to utilize that with services in Elixir.
152152+153153+As a base we will implement super simple Plug application:
154154+155155+```elixir
156156+# hello/application.ex
157157+defmodule Hello.Application do
158158+ use Application
159159+160160+ def start(_type, _opts) do
161161+ children = [
162162+ {Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()},
163163+ {Plug.Cowboy.Drainer, refs: :all}
164164+ ]
165165+166166+ Supervisor.start_link(children, strategy: :one_for_one)
167167+ end
168168+169169+ defp cowboy_opts do
170170+ [
171171+ port: String.to_integer(System.get_env("PORT", "4000"))
172172+ ]
173173+ end
174174+end
175175+```
176176+177177+```elixir
178178+# hello/router.ex
179179+defmodule Hello.Router do
180180+ use Plug.Router
181181+182182+ plug :match
183183+ plug :dispatch
184184+185185+ get "/" do
186186+ send_resp(conn, 200, "Hello World!")
187187+ end
188188+end
189189+```
190190+191191+I will also assume that we are using [Mix release][mix-release] named `hello`
192192+that we later copy to `/opt/hello`.
193193+194194+[mix-release]: https://hexdocs.pm/mix/Mix.Tasks.Release.html
195195+196196+### systemd unit file
197197+198198+We have only one thing left, we need to define our [`hello.service`][systemd.service]:
199199+200200+```ini
201201+[Unit]
202202+Description=Hello World service
203203+204204+[Service]
205205+Environment=PORT=80
206206+ExecStart=/opt/plug/bin/plug start
207207+```
208208+209209+Now you can create file with that content in
210210+`/usr/local/lib/systemd/system/hello.service` and then start it with:
211211+212212+```
213213+# systemctl start hello.service
214214+```
215215+216216+This is the simplest service imaginable, however from the start we have few
217217+issues there:
218218+219219+- It will run service as user running supervisor, so if it is run using global
220220+ supervisor, then it will run as `root`. You do not want to run anything as
221221+ `root`.
222222+- On error it will produce (BEAM) core dump, which may contain sensitive data.
223223+- It can read (and, due to being run as `root`, write) everything in the system,
224224+ like private data of other processes.
225225+226226+[systemd.service]: https://www.freedesktop.org/software/systemd/man/systemd.service.html#
227227+228228+## Service readiness
229229+230230+Erlang VM isn't really the best tool out there wrt the startup times. In
231231+addition to that our application may need some preparation steps before it can
232232+be marked as "ready". This is problem that I sometimes encounter in Docker,
233233+where some containers do not really have any health check, and then I need to
234234+have loop with check in some of the containers that depend on another one. This
235235+"workaround" is frustrating, error prone, and can cause nasty Heisenbugs when
236236+the timing will be wrong.
237237+238238+Two possible solutions for this problem are:
239239+240240+- Readiness probe - another program that is ran after the main process is
241241+ started, that checks whether our application is ready to work.
242242+- Notification system where our application uses some common protocol to inform
243243+ the supervisor that it finished setup and is ready for work.
244244+245245+systemd supports the second approach via [`sd_notify`][sd_notify]. The approach
246246+there is simple - we have `NOTIFY_SOCKET` environment variable that contain path
247247+to the Unix datagram socket, that we can use to send informations about state of
248248+our application. This socket accept set of different messages, but right now,
249249+for our purposes, we will focus only on few of them:
250250+251251+- `READY=1` - marks our service as ready, aka it is ready to do its work (for
252252+ example accept incoming HTTP connections in our example). It need to be sent
253253+ withing given timespan after start of the VM, otherwise the process will be
254254+ killed and possibly restarted
255255+- `STATUS=name` - sets status of our application that can be checked via
256256+ `systemctl status hello.service`, this allows us to have better insight into
257257+ what is the high level state without manually traversing through logs
258258+- `RELOADING=1` - marks, that our application is reloading, which in general may
259259+ mean a lot of things, but there it will be used to mark `:init.restart/0`-like
260260+ behaviour (due to [erlang/otp#4698][] there is wrapper for that function in
261261+ `systemd` library). The process need then to send `READY=1` within given
262262+ timespan, or the process will be marked as a malfunctioning, and will be
263263+ forcefully killed and possibly restarted
264264+- `STOPPING=1` - marks, that our application began shutting down process, and
265265+ will be closing soon. If the process will not close within given timespan, it
266266+ will be forcefully killed
267267+268268+These messages provide us enough power to not only mark the service as ready,
269269+but also provides additional information about system state, so even operator,
270270+who knows a little about Erlang or our application runtime, will be able to
271271+understand what is going on.
272272+273273+The main thing is that systemd will wait with activation of the dependants of
274274+our system as well as the `systemctl start` and `systemctl restart` commands
275275+will wait until our service declare that it is ready.
276276+277277+Usage of such feature is quite simple:
278278+279279+```ini
280280+[Unit]
281281+Description=Hello World service
282282+283283+[Service]
284284+# Define `Type=` to `notify`
285285+Type=notify
286286+Environment=PORT=80
287287+ExecStart=/opt/plug/bin/plug start
288288+WatchdogSec=1min
289289+```
290290+291291+And then in our supervisor tree we need add `:systemd.ready()` **after** last
292292+process needed for proper functioning of our application, in our simple example
293293+it is after `Plug.Cowboy`:
294294+295295+```elixir
296296+# hello/application.ex
297297+defmodule Hello.Application do
298298+ use Application
299299+300300+ def start(_type, _opts) do
301301+ children = [
302302+ {Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()},
303303+ :systemd.ready(), # <-- it is function call, as it returns proper
304304+ # `child_spec/0`
305305+ {Plug.Cowboy.Drainer, refs: :all}
306306+ ]
307307+308308+ Supervisor.start_link(children, strategy: :one_for_one)
309309+ end
310310+311311+ defp cowboy_opts do
312312+ [
313313+ port: String.to_integer(System.get_env("PORT", "4000"))
314314+ ]
315315+ end
316316+end
317317+```
318318+319319+Now restarting our service will not finish immediately, but will wait until our
320320+service will declare that it is ready.
321321+322322+```shell
323323+# systemctl restart hello.service
324324+```
325325+326326+About `STOPPING=1` - the magic thing is that the `systemd` library takes care of
327327+it for you. As soon as the system will be scheduled to shutdown this message
328328+will be automatically sent, and the operator will be notified about this fact.
329329+330330+We can also provide more information about state of our application. As you may
331331+have already noticed, we have [`Plug.Cowboy.Drainer`][] there. It is process that
332332+will delay shutdown of our application while there are still open connections.
333333+This can take some time, so it would be handy if the operator would see that the
334334+draining is in progress. We can easily achieve that by again changing our
335335+supervision tree to:
336336+337337+```elixir
338338+# hello/application.ex
339339+defmodule Hello.Application do
340340+ use Application
341341+342342+ def start(_type, _opts) do
343343+ children = [
344344+ {Plug.Cowboy, [scheme: :http, plug: Hello.Router] ++ cowboy_opts()},
345345+ :systemd.ready(),
346346+ :systemd.set_status(down: [status: "drained"]),
347347+ {Plug.Cowboy.Drainer, refs: :all, shutdown: 10_000},
348348+ :systemd.set_status(down: [status: "draining"])
349349+ ]
350350+351351+ Supervisor.start_link(children, strategy: :one_for_one)
352352+ end
353353+354354+ defp cowboy_opts do
355355+ [
356356+ port: String.to_integer(System.get_env("PORT", "4000"))
357357+ ]
358358+ end
359359+end
360360+```
361361+362362+Now when we will shutdown our application by:
363363+364364+```shell
365365+# systemctl stop hello.service
366366+```
367367+368368+And we have some connections open to our service (you can simulate that with
369369+`wrk`) then when we ran `systemctl status hello.service` in separate terminal
370370+(previous will be blocked until our service shuts down) then you will be able to
371371+see something like:
372372+373373+```
374374+● hello.service - Example Plug application
375375+ Loaded: loaded (/usr/local/lib/systemd/system/hello.service; static; vendor preset: enabled)
376376+ Active: deactivating (stop-sigterm) since Sat 2022-01-15 17:46:30 CET;
377377+ 1s ago
378378+ Main PID: 1327 (beam.smp)
379379+ Status: "draining"
380380+ Tasks: 19 (limit: 1136)
381381+ Memory: 106.5M
382382+```
383383+384384+You can notice that the `Status` is set to `"draining"`. As soon as all
385385+connections will be drained it will change to `"drained"` and then the
386386+application will shut down and service will be marked as `inactive`.
387387+388388+[sd_notify]: https://www.freedesktop.org/software/systemd/man/sd_notify.html
389389+[erlang/otp#4698]: https://github.com/erlang/otp/issues/4698
390390+[`Plug.Cowboy.Drainer`]: https://hexdocs.pm/plug_cowboy/2.5.2/Plug.Cowboy.Drainer.html
391391+392392+## Watchdog
393393+394394+Watchdog allows us to monitor our application for responsiveness (as mentioned
395395+above). It is simple feature that requires our application to ping systemd
396396+within specified interval, otherwise the application will be forcibly shut down
397397+as malfunctioning. Fortunately for us, the `systemd` library that provides our
398398+integration, have that feature out of the box, so all we need to do to achieve
399399+expected result is set `WatchdogSec=` option in our `systemd.service` file:
400400+401401+```ini
402402+[Unit]
403403+Description=Hello World service
404404+405405+[Service]
406406+Environment=PORT=80
407407+Type=notify
408408+ExecStart=/opt/plug/bin/plug start
409409+WatchdogSec=1min
410410+```
411411+412412+This configuration says that if the VM will not send healthy message each 1
413413+minute interval, then the service will be marked as malfunctioning. From the
414414+application side we can manage state of the watchdog in several ways:
415415+416416+- By setting `systemd.watchdog_check` configuration option we can configure the
417417+ function that will be called on each check, if that function return `true`
418418+ then it mean that application is healthy and the systemd should be notified
419419+ with ping, if it returns `false` or fail, then the check will be omitted.
420420+- Manually sending trigger message in case of detected problems via
421421+ `:systemd.watchdog(trigger)`, it will immediately mark service as
422422+ malfunctioning and will trigger action defined in service unit file (by
423423+ default it will restart application)
424424+- Disabling built in watchdog process via `:systemd.watchdog(:disable)` and then
425425+ manually sending `:systemd.watchdog(:ping)` within expected intervals
426426+ (discouraged)
427427+428428+## Security
429429+430430+We should start with changing default user and group which is assigned to our
431431+process. We can do so in 2 different ways:
432432+433433+1. Use some existing user and group by defining `User=` and `Group=` directives
434434+ in our service definition; or
435435+2. Create ephemeral user on-demand before our service starts, by using directive
436436+ `DynamicUser=true` in service definition.
437437+438438+I prefer second option, as it additionally provides a lot of other security
439439+related options, like creating private `/tmp` directory, making system
440440+read-only, etc. This has also some disadvantages, like removing all of given
441441+data on service shutdown, however there are options to keep some data between
442442+launches.
443443+444444+In addition to that we can add `PrivateDevices=true` that will hide all
445445+physical devices from `/dev` leaving only pseudo devices like `/dev/null` or
446446+`/dev/urandom` (so you will be able to use `:crypto` and `:ssl` modules without
447447+problems).
448448+449449+Next thing is that we can do, is to [disable crash dumps generated by BEAM][crash].
450450+While not strictly needed in this case, it is worth remembering, that it isn't
451451+hard to achieve, it is just using `Environment=ERL_CRASH_DUMP_SECONDS=0`.
452452+453453+Our new, more secure, `hello.service` will look like:
454454+455455+```ini
456456+[Unit]
457457+Description=Hello World service
458458+Requires=network.target
459459+460460+[Service]
461461+Type=notify
462462+Environment=PORT=80
463463+ExecStart=/opt/plug/bin/plug start
464464+WatchdogSec=1min
465465+466466+# We need to add capability to be able to bind on port 80
467467+CapabilityBoundingSet=CAP_NET_BIND_SERVICE
468468+469469+# Hardening
470470+DynamicUser=true
471471+PrivateDevices=true
472472+Environment=ERL_CRASH_DUMP_SECONDS=0
473473+```
474474+475475+The problem with that configuration is that our service is now capable on
476476+binding **any** port under 1024, so for example, if there is some security
477477+issue, then the malicious party can open any of the restricted ports and then
478478+serve whatever data they want there. This can be quite problematic, and the
479479+solution for that problem will be covered in Part 2, where we will cover socket
480480+passing and socket activation for our service.
481481+482482+With that we achieved quite basic level of isolation to what Docker (or other
483483+container runtime) is providing, but it do not require `overlayfs` or anything
484484+more, than what you already have on your machine. That means, updates done by
485485+your system package manager will be applied to all running services. With that
486486+you do not need to rebuild all your containers when there is security patch
487487+issued for any of your dependencies.
488488+489489+Of course it only scratches the surface of what is possible with systemd wrt
490490+the hardening of the services. More information can be found in [RedHat
491491+article][rh-systemd-hardening] and in [`systemd-analyze security` command
492492+output][systemd-analyze-security]. Possible features are:
493493+494494+- creation of the private networks for your services
495495+- disallowing creation of socket connections that are outside of the specified
496496+ set of families
497497+- make only some paths readable
498498+- hide some paths from the process
499499+- etc.
500500+501501+Coverage of just that topic is a little bit out of scope for this blog post, so
502502+I encourage you to read the documentation of [`systemd.exec`][systemd.exec] and
503503+articles mentioned above for more details.
504504+505505+[crash]: https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/crash_dumps
506506+[rh-systemd-hardening]: https://www.redhat.com/sysadmin/mastering-systemd
507507+[systemd-analyze-security]: https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20security%20%5BUNIT...%5D
508508+[systemd.exec]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html
509509+510510+## Summary
511511+512512+This blog post is already quite lengthy, so I will split it into separate parts.
513513+There probably will be 3 of them:
514514+515515+- [Part 1 - Basics, security, and FD passing (this one)](?1)
516516+- Part 2 - Socket activation
517517+- Part 3 - Logging