Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge patch series "procfs: make reference pidns more user-visible"

Aleksa Sarai <cyphar@cyphar.com> says:

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).

/* pidns mount option for procfs */

This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:

* In order to bypass the mnt_too_revealing() check, Kubernetes creates
a procfs mount from an empty pidns so that user namespaced containers
can be nested (without this, the nested containers would fail to
mount procfs). But this requires forking off a helper process because
you cannot just one-shot this using mount(2).

* Container runtimes in general need to fork into a container before
configuring its mounts, which can lead to security issues in the case
of shared-pidns containers (a privileged process in the pidns can
interact with your container runtime process). While
SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
strict need for this due to a minor uAPI wart is kind of unfortunate.

Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):

fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.

The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.

Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.

* patches from https://lore.kernel.org/20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com:
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper

Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>

+336 -14
+8
Documentation/filesystems/proc.rst
··· 2362 2362 hidepid= Set /proc/<pid>/ access mode. 2363 2363 gid= Set the group authorized to learn processes information. 2364 2364 subset= Show only the specified subset of procfs. 2365 + pidns= Specify a the namespace used by this procfs. 2365 2366 ========= ======================================================== 2366 2367 2367 2368 hidepid=off or hidepid=0 means classic mode - everybody may access all ··· 2394 2393 2395 2394 subset=pid hides all top level files and directories in the procfs that 2396 2395 are not related to tasks. 2396 + 2397 + pidns= specifies a pid namespace (either as a string path to something like 2398 + `/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that 2399 + will be used by the procfs instance when translating pids. By default, procfs 2400 + will use the calling process's active pid namespace. Note that the pid 2401 + namespace of an existing procfs instance cannot be modified (attempting to do 2402 + so will give an `-EBUSY` error). 2397 2403 2398 2404 Chapter 5: Filesystem behavior 2399 2405 ==============================
+92 -6
fs/proc/root.c
··· 38 38 Opt_gid, 39 39 Opt_hidepid, 40 40 Opt_subset, 41 + Opt_pidns, 41 42 }; 42 43 43 44 static const struct fs_parameter_spec proc_fs_parameters[] = { 44 - fsparam_u32("gid", Opt_gid), 45 + fsparam_u32("gid", Opt_gid), 45 46 fsparam_string("hidepid", Opt_hidepid), 46 47 fsparam_string("subset", Opt_subset), 48 + fsparam_file_or_string("pidns", Opt_pidns), 47 49 {} 48 50 }; 49 51 ··· 111 109 return 0; 112 110 } 113 111 112 + #ifdef CONFIG_PID_NS 113 + static int proc_parse_pidns_param(struct fs_context *fc, 114 + struct fs_parameter *param, 115 + struct fs_parse_result *result) 116 + { 117 + struct proc_fs_context *ctx = fc->fs_private; 118 + struct pid_namespace *target, *active = task_active_pid_ns(current); 119 + struct ns_common *ns; 120 + struct file *ns_filp __free(fput) = NULL; 121 + 122 + switch (param->type) { 123 + case fs_value_is_file: 124 + /* came through fsconfig, steal the file reference */ 125 + ns_filp = no_free_ptr(param->file); 126 + break; 127 + case fs_value_is_string: 128 + ns_filp = filp_open(param->string, O_RDONLY, 0); 129 + break; 130 + default: 131 + WARN_ON_ONCE(true); 132 + break; 133 + } 134 + if (!ns_filp) 135 + ns_filp = ERR_PTR(-EBADF); 136 + if (IS_ERR(ns_filp)) { 137 + errorfc(fc, "could not get file from pidns argument"); 138 + return PTR_ERR(ns_filp); 139 + } 140 + 141 + if (!proc_ns_file(ns_filp)) 142 + return invalfc(fc, "pidns argument is not an nsfs file"); 143 + ns = get_proc_ns(file_inode(ns_filp)); 144 + if (ns->ops->type != CLONE_NEWPID) 145 + return invalfc(fc, "pidns argument is not a pidns file"); 146 + target = container_of(ns, struct pid_namespace, ns); 147 + 148 + /* 149 + * pidns= is shorthand for joining the pidns to get a fsopen fd, so the 150 + * permission model should be the same as pidns_install(). 151 + */ 152 + if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) { 153 + errorfc(fc, "insufficient permissions to set pidns"); 154 + return -EPERM; 155 + } 156 + if (!pidns_is_ancestor(target, active)) 157 + return invalfc(fc, "cannot set pidns to non-descendant pidns"); 158 + 159 + put_pid_ns(ctx->pid_ns); 160 + ctx->pid_ns = get_pid_ns(target); 161 + put_user_ns(fc->user_ns); 162 + fc->user_ns = get_user_ns(ctx->pid_ns->user_ns); 163 + return 0; 164 + } 165 + #endif /* CONFIG_PID_NS */ 166 + 114 167 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param) 115 168 { 116 169 struct proc_fs_context *ctx = fc->fs_private; 117 170 struct fs_parse_result result; 118 - int opt; 171 + int opt, err; 119 172 120 173 opt = fs_parse(fc, proc_fs_parameters, param, &result); 121 174 if (opt < 0) ··· 182 125 break; 183 126 184 127 case Opt_hidepid: 185 - if (proc_parse_hidepid_param(fc, param)) 186 - return -EINVAL; 128 + err = proc_parse_hidepid_param(fc, param); 129 + if (err) 130 + return err; 187 131 break; 188 132 189 133 case Opt_subset: 190 - if (proc_parse_subset_param(fc, param->string) < 0) 191 - return -EINVAL; 134 + err = proc_parse_subset_param(fc, param->string); 135 + if (err) 136 + return err; 192 137 break; 138 + 139 + case Opt_pidns: 140 + #ifdef CONFIG_PID_NS 141 + /* 142 + * We would have to RCU-protect every proc_pid_ns() or 143 + * proc_sb_info() access if we allowed this to be reconfigured 144 + * for an existing procfs instance. Luckily, procfs instances 145 + * are cheap to create, and mount-beneath would let you 146 + * atomically replace an instance even with overmounts. 147 + */ 148 + if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) { 149 + errorfc(fc, "cannot reconfigure pidns for existing procfs"); 150 + return -EBUSY; 151 + } 152 + err = proc_parse_pidns_param(fc, param, &result); 153 + if (err) 154 + return err; 155 + break; 156 + #else 157 + errorfc(fc, "pidns mount flag not supported on this system"); 158 + return -EOPNOTSUPP; 159 + #endif 193 160 194 161 default: 195 162 return -EINVAL; ··· 235 154 fs_info->hide_pid = ctx->hidepid; 236 155 if (ctx->mask & (1 << Opt_subset)) 237 156 fs_info->pidonly = ctx->pidonly; 157 + if (ctx->mask & (1 << Opt_pidns) && 158 + !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) { 159 + put_pid_ns(fs_info->pid_ns); 160 + fs_info->pid_ns = get_pid_ns(ctx->pid_ns); 161 + } 238 162 } 239 163 240 164 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
+9
include/linux/pid_namespace.h
··· 84 84 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); 85 85 extern void put_pid_ns(struct pid_namespace *ns); 86 86 87 + extern bool pidns_is_ancestor(struct pid_namespace *child, 88 + struct pid_namespace *ancestor); 89 + 87 90 #else /* !CONFIG_PID_NS */ 88 91 #include <linux/err.h> 89 92 ··· 120 117 static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) 121 118 { 122 119 return 0; 120 + } 121 + 122 + static inline bool pidns_is_ancestor(struct pid_namespace *child, 123 + struct pid_namespace *ancestor) 124 + { 125 + return false; 123 126 } 124 127 #endif /* CONFIG_PID_NS */ 125 128
+14 -8
kernel/pid_namespace.c
··· 390 390 put_pid_ns(to_pid_ns(ns)); 391 391 } 392 392 393 + bool pidns_is_ancestor(struct pid_namespace *child, 394 + struct pid_namespace *ancestor) 395 + { 396 + struct pid_namespace *ns; 397 + 398 + if (child->level < ancestor->level) 399 + return false; 400 + for (ns = child; ns->level > ancestor->level; ns = ns->parent) 401 + ; 402 + return ns == ancestor; 403 + } 404 + 393 405 static int pidns_install(struct nsset *nsset, struct ns_common *ns) 394 406 { 395 407 struct nsproxy *nsproxy = nsset->nsproxy; 396 408 struct pid_namespace *active = task_active_pid_ns(current); 397 - struct pid_namespace *ancestor, *new = to_pid_ns(ns); 409 + struct pid_namespace *new = to_pid_ns(ns); 398 410 399 411 if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || 400 412 !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) ··· 420 408 * this maintains the property that processes and their 421 409 * children can not escape their current pid namespace. 422 410 */ 423 - if (new->level < active->level) 424 - return -EINVAL; 425 - 426 - ancestor = new; 427 - while (ancestor->level > active->level) 428 - ancestor = ancestor->parent; 429 - if (ancestor != active) 411 + if (!pidns_is_ancestor(new, active)) 430 412 return -EINVAL; 431 413 432 414 put_pid_ns(nsproxy->pid_ns_for_children);
+1
tools/testing/selftests/proc/.gitignore
··· 18 18 /proc-tid0 19 19 /proc-uptime-001 20 20 /proc-uptime-002 21 + /proc-pidns 21 22 /read 22 23 /self 23 24 /setns-dcache
+1
tools/testing/selftests/proc/Makefile
··· 28 28 TEST_GEN_PROGS += thread-self 29 29 TEST_GEN_PROGS += proc-multiple-procfs 30 30 TEST_GEN_PROGS += proc-fsconfig-hidepid 31 + TEST_GEN_PROGS += proc-pidns 31 32 32 33 include ../lib.mk
+211
tools/testing/selftests/proc/proc-pidns.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + /* 3 + * Author: Aleksa Sarai <cyphar@cyphar.com> 4 + * Copyright (C) 2025 SUSE LLC. 5 + */ 6 + 7 + #include <assert.h> 8 + #include <errno.h> 9 + #include <sched.h> 10 + #include <stdbool.h> 11 + #include <stdlib.h> 12 + #include <string.h> 13 + #include <unistd.h> 14 + #include <stdio.h> 15 + #include <sys/mount.h> 16 + #include <sys/stat.h> 17 + #include <sys/prctl.h> 18 + 19 + #include "../kselftest_harness.h" 20 + 21 + #define ASSERT_ERRNO(expected, _t, seen) \ 22 + __EXPECT(expected, #expected, \ 23 + ({__typeof__(seen) _tmp_seen = (seen); \ 24 + _tmp_seen >= 0 ? _tmp_seen : -errno; }), #seen, _t, 1) 25 + 26 + #define ASSERT_ERRNO_EQ(expected, seen) \ 27 + ASSERT_ERRNO(expected, ==, seen) 28 + 29 + #define ASSERT_SUCCESS(seen) \ 30 + ASSERT_ERRNO(0, <=, seen) 31 + 32 + static int touch(char *path) 33 + { 34 + int fd = open(path, O_WRONLY|O_CREAT|O_CLOEXEC, 0644); 35 + if (fd < 0) 36 + return -1; 37 + return close(fd); 38 + } 39 + 40 + FIXTURE(ns) 41 + { 42 + int host_mntns, host_pidns; 43 + int dummy_pidns; 44 + }; 45 + 46 + FIXTURE_SETUP(ns) 47 + { 48 + /* Stash the old mntns. */ 49 + self->host_mntns = open("/proc/self/ns/mnt", O_RDONLY|O_CLOEXEC); 50 + ASSERT_SUCCESS(self->host_mntns); 51 + 52 + /* Create a new mount namespace and make it private. */ 53 + ASSERT_SUCCESS(unshare(CLONE_NEWNS)); 54 + ASSERT_SUCCESS(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL)); 55 + 56 + /* 57 + * Create a proper tmpfs that we can use and will disappear once we 58 + * leave this mntns. 59 + */ 60 + ASSERT_SUCCESS(mount("tmpfs", "/tmp", "tmpfs", 0, NULL)); 61 + 62 + /* 63 + * Create a pidns we can use for later tests. We need to fork off a 64 + * child so that we get a usable nsfd that we can bind-mount and open. 65 + */ 66 + ASSERT_SUCCESS(mkdir("/tmp/dummy", 0755)); 67 + ASSERT_SUCCESS(touch("/tmp/dummy/pidns")); 68 + ASSERT_SUCCESS(mkdir("/tmp/dummy/proc", 0755)); 69 + 70 + self->host_pidns = open("/proc/self/ns/pid", O_RDONLY|O_CLOEXEC); 71 + ASSERT_SUCCESS(self->host_pidns); 72 + ASSERT_SUCCESS(unshare(CLONE_NEWPID)); 73 + 74 + pid_t pid = fork(); 75 + ASSERT_SUCCESS(pid); 76 + if (!pid) { 77 + prctl(PR_SET_PDEATHSIG, SIGKILL); 78 + ASSERT_SUCCESS(mount("/proc/self/ns/pid", "/tmp/dummy/pidns", NULL, MS_BIND, NULL)); 79 + ASSERT_SUCCESS(mount("proc", "/tmp/dummy/proc", "proc", 0, NULL)); 80 + exit(0); 81 + } 82 + 83 + int wstatus; 84 + ASSERT_EQ(waitpid(pid, &wstatus, 0), pid); 85 + ASSERT_TRUE(WIFEXITED(wstatus)); 86 + ASSERT_EQ(WEXITSTATUS(wstatus), 0); 87 + 88 + ASSERT_SUCCESS(setns(self->host_pidns, CLONE_NEWPID)); 89 + 90 + self->dummy_pidns = open("/tmp/dummy/pidns", O_RDONLY|O_CLOEXEC); 91 + ASSERT_SUCCESS(self->dummy_pidns); 92 + } 93 + 94 + FIXTURE_TEARDOWN(ns) 95 + { 96 + ASSERT_SUCCESS(setns(self->host_mntns, CLONE_NEWNS)); 97 + ASSERT_SUCCESS(close(self->host_mntns)); 98 + 99 + ASSERT_SUCCESS(close(self->host_pidns)); 100 + ASSERT_SUCCESS(close(self->dummy_pidns)); 101 + } 102 + 103 + TEST_F(ns, pidns_mount_string_path) 104 + { 105 + ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755)); 106 + ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid")); 107 + ASSERT_SUCCESS(access("/tmp/proc-host/self/", X_OK)); 108 + 109 + ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755)); 110 + ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy/pidns")); 111 + ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/1/", X_OK)); 112 + ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/self/", X_OK)); 113 + } 114 + 115 + TEST_F(ns, pidns_fsconfig_string_path) 116 + { 117 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 118 + ASSERT_SUCCESS(fsfd); 119 + 120 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0)); 121 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 122 + 123 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 124 + ASSERT_SUCCESS(mountfd); 125 + 126 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0)); 127 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0)); 128 + 129 + ASSERT_SUCCESS(close(fsfd)); 130 + ASSERT_SUCCESS(close(mountfd)); 131 + } 132 + 133 + TEST_F(ns, pidns_fsconfig_fd) 134 + { 135 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 136 + ASSERT_SUCCESS(fsfd); 137 + 138 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns)); 139 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 140 + 141 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 142 + ASSERT_SUCCESS(mountfd); 143 + 144 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0)); 145 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0)); 146 + 147 + ASSERT_SUCCESS(close(fsfd)); 148 + ASSERT_SUCCESS(close(mountfd)); 149 + } 150 + 151 + TEST_F(ns, pidns_reconfigure_remount) 152 + { 153 + ASSERT_SUCCESS(mkdir("/tmp/proc", 0755)); 154 + ASSERT_SUCCESS(mount("proc", "/tmp/proc", "proc", 0, "")); 155 + 156 + ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK)); 157 + ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK)); 158 + 159 + ASSERT_ERRNO_EQ(-EBUSY, mount(NULL, "/tmp/proc", NULL, MS_REMOUNT, "pidns=/tmp/dummy/pidns")); 160 + 161 + ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK)); 162 + ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK)); 163 + } 164 + 165 + TEST_F(ns, pidns_reconfigure_fsconfig_string_path) 166 + { 167 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 168 + ASSERT_SUCCESS(fsfd); 169 + 170 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 171 + 172 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 173 + ASSERT_SUCCESS(mountfd); 174 + 175 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 176 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 177 + 178 + ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0)); 179 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */ 180 + 181 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 182 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 183 + 184 + ASSERT_SUCCESS(close(fsfd)); 185 + ASSERT_SUCCESS(close(mountfd)); 186 + } 187 + 188 + TEST_F(ns, pidns_reconfigure_fsconfig_fd) 189 + { 190 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 191 + ASSERT_SUCCESS(fsfd); 192 + 193 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 194 + 195 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 196 + ASSERT_SUCCESS(mountfd); 197 + 198 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 199 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 200 + 201 + ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns)); 202 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */ 203 + 204 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 205 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 206 + 207 + ASSERT_SUCCESS(close(fsfd)); 208 + ASSERT_SUCCESS(close(mountfd)); 209 + } 210 + 211 + TEST_HARNESS_MAIN