Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC

When MFD_NOEXEC_SEAL was introduced, there was one big mistake: it didn't
have proper documentation. This led to a lot of confusion, especially
about whether or not memfd created with the MFD_NOEXEC_SEAL flag is
sealable. Before MFD_NOEXEC_SEAL, memfd had to explicitly set
MFD_ALLOW_SEALING to be sealable, so it's a fair question.

As one might have noticed, unlike other flags in memfd_create,
MFD_NOEXEC_SEAL is actually a combination of multiple flags. The idea is
to make it easier to use memfd in the most common way, which is NOEXEC +
F_SEAL_EXEC + MFD_ALLOW_SEALING. This works with sysctl vm.noexec to help
existing applications move to a more secure way of using memfd.

Proposals have been made to put MFD_NOEXEC_SEAL non-sealable, unless
MFD_ALLOW_SEALING is set, to be consistent with other flags [1], Those
are based on the viewpoint that each flag is an atomic unit, which is a
reasonable assumption. However, MFD_NOEXEC_SEAL was designed with the
intent of promoting the most secure method of using memfd, therefore a
combination of multiple functionalities into one bit.

Furthermore, the MFD_NOEXEC_SEAL has been added for more than one year,
and multiple applications and distributions have backported and utilized
it. Altering ABI now presents a degree of risk and may lead to
disruption.

MFD_NOEXEC_SEAL is a new flag, and applications must change their code to
use it. There is no backward compatibility problem.

When sysctl vm.noexec == 1 or 2, applications that don't set
MFD_NOEXEC_SEAL or MFD_EXEC will get MFD_NOEXEC_SEAL memfd. And
old-application might break, that is by-design, in such a system vm.noexec
= 0 shall be used. Also no backward compatibility problem.

I propose to include this documentation patch to assist in clarifying the
semantics of MFD_NOEXEC_SEAL, thereby preventing any potential future
confusion.

Finally, I would like to express my gratitude to David Rheinsberg and
Barnabás Pőcze for initiating the discussion on the topic of sealability.

[1]
https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/

[jeffxu@chromium.org: updates per Randy]
Link: https://lkml.kernel.org/r/20240611034903.3456796-2-jeffxu@chromium.org
[jeffxu@chromium.org: v3]
Link: https://lkml.kernel.org/r/20240611231409.3899809-2-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240607203543.2151433-2-jeffxu@google.com
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Barnabás Pőcze <pobrn@protonmail.com>
Cc: Daniel Verkamp <dverkamp@chromium.org>
Cc: David Rheinsberg <david@readahead.eu>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Jeff Xu and committed by
Andrew Morton
653c5c75 3afb76a6

+87
+1
Documentation/userspace-api/index.rst
··· 32 32 seccomp_filter 33 33 landlock 34 34 lsm 35 + mfd_noexec 35 36 spec_ctrl 36 37 tee 37 38
+86
Documentation/userspace-api/mfd_noexec.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================================== 4 + Introduction of non-executable mfd 5 + ================================== 6 + :Author: 7 + Daniel Verkamp <dverkamp@chromium.org> 8 + Jeff Xu <jeffxu@chromium.org> 9 + 10 + :Contributor: 11 + Aleksa Sarai <cyphar@cyphar.com> 12 + 13 + Since Linux introduced the memfd feature, memfds have always had their 14 + execute bit set, and the memfd_create() syscall doesn't allow setting 15 + it differently. 16 + 17 + However, in a secure-by-default system, such as ChromeOS, (where all 18 + executables should come from the rootfs, which is protected by verified 19 + boot), this executable nature of memfd opens a door for NoExec bypass 20 + and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm 21 + process created a memfd to share the content with an external process, 22 + however the memfd is overwritten and used for executing arbitrary code 23 + and root escalation. [2] lists more VRP of this kind. 24 + 25 + On the other hand, executable memfd has its legit use: runc uses memfd’s 26 + seal and executable feature to copy the contents of the binary then 27 + execute them. For such a system, we need a solution to differentiate runc's 28 + use of executable memfds and an attacker's [3]. 29 + 30 + To address those above: 31 + - Let memfd_create() set X bit at creation time. 32 + - Let memfd be sealed for modifying X bit when NX is set. 33 + - Add a new pid namespace sysctl: vm.memfd_noexec to help applications in 34 + migrating and enforcing non-executable MFD. 35 + 36 + User API 37 + ======== 38 + ``int memfd_create(const char *name, unsigned int flags)`` 39 + 40 + ``MFD_NOEXEC_SEAL`` 41 + When MFD_NOEXEC_SEAL bit is set in the ``flags``, memfd is created 42 + with NX. F_SEAL_EXEC is set and the memfd can't be modified to 43 + add X later. MFD_ALLOW_SEALING is also implied. 44 + This is the most common case for the application to use memfd. 45 + 46 + ``MFD_EXEC`` 47 + When MFD_EXEC bit is set in the ``flags``, memfd is created with X. 48 + 49 + Note: 50 + ``MFD_NOEXEC_SEAL`` implies ``MFD_ALLOW_SEALING``. In case that 51 + an app doesn't want sealing, it can add F_SEAL_SEAL after creation. 52 + 53 + 54 + Sysctl: 55 + ======== 56 + ``pid namespaced sysctl vm.memfd_noexec`` 57 + 58 + The new pid namespaced sysctl vm.memfd_noexec has 3 values: 59 + 60 + - 0: MEMFD_NOEXEC_SCOPE_EXEC 61 + memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like 62 + MFD_EXEC was set. 63 + 64 + - 1: MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 65 + memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like 66 + MFD_NOEXEC_SEAL was set. 67 + 68 + - 2: MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 69 + memfd_create() without MFD_NOEXEC_SEAL will be rejected. 70 + 71 + The sysctl allows finer control of memfd_create for old software that 72 + doesn't set the executable bit; for example, a container with 73 + vm.memfd_noexec=1 means the old software will create non-executable memfd 74 + by default while new software can create executable memfd by setting 75 + MFD_EXEC. 76 + 77 + The value of vm.memfd_noexec is passed to child namespace at creation 78 + time. In addition, the setting is hierarchical, i.e. during memfd_create, 79 + we will search from current ns to root ns and use the most restrictive 80 + setting. 81 + 82 + [1] https://crbug.com/1305267 83 + 84 + [2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20memfd%20escalation&can=1 85 + 86 + [3] https://lwn.net/Articles/781013/