mount: add FSMOUNT_NAMESPACE

Add FSMOUNT_NAMESPACE flag to fsmount() that creates a new mount
namespace with the newly created filesystem attached to a copy of the
real rootfs. This returns a namespace file descriptor instead of an
O_PATH mount fd, similar to how OPEN_TREE_NAMESPACE works for open_tree().

This allows creating a new filesystem and immediately placing it in a
new mount namespace in a single operation, which is useful for container
runtimes and other namespace-based isolation mechanisms.

The rootfs mount is created before copying the real rootfs for the new
namespace meaning that the mount namespace id for the mount of the root
of the namespace is bigger than the child mounted on top of it. We've
never explicitly given the guarantee for such ordering and I doubt
anyone relies on it. Accepting that lets us avoid copying the mount
again and also avoids having to massage may_copy_tree() to grant an
exception for fsmount->mnt->mnt_ns being NULL.

Link: https://patch.msgid.link/20260122-work-fsmount-namespace-v1-3-5ef0a886e646@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

Christian Brauner 2 months ago 5e8969bd ad4a3599

+31 -7

2 changed files

expand all

namespace.c

include

uapi

linux

mount.h

+30 -7

fs/namespace.c

··· 3118 3118 } 3119 3119 3120 3120 /* 3121 - * We don't emulate unshare()ing a mount namespace. We stick 3122 - * to the restrictions of creating detached bind-mounts. It 3123 - * has a lot saner and simpler semantics. 3121 + * We don't emulate unshare()ing a mount namespace. We stick to 3122 + * the restrictions of creating detached bind-mounts. It has a 3123 + * lot saner and simpler semantics. 3124 3124 */ 3125 - mnt = __do_loopback(path, recurse, copy_flags); 3125 + mnt = real_mount(path->mnt); 3126 + if (!mnt->mnt_ns) { 3127 + /* 3128 + * If we're moving into a new mount namespace via 3129 + * fsmount() swap the mount ids so the nullfs mount id 3130 + * is the lowest in the mount namespace avoiding another 3131 + * useless copy. This is fine we're not attached to any 3132 + * mount namespace so the mount ids are pure decoration 3133 + * at that point. 3134 + */ 3135 + swap(mnt->mnt_id_unique, new_ns_root->mnt_id_unique); 3136 + swap(mnt->mnt_id, new_ns_root->mnt_id); 3137 + mntget(&mnt->mnt); 3138 + } else { 3139 + mnt = __do_loopback(path, recurse, copy_flags); 3140 + } 3126 3141 scoped_guard(mount_writer) { 3127 3142 if (IS_ERR(mnt)) { 3128 3143 emptied_ns = new_ns; ··· 4416 4401 unsigned int mnt_flags = 0; 4417 4402 long ret; 4418 4403 4419 - if (!may_mount()) 4404 + if ((flags & ~(FSMOUNT_CLOEXEC | FSMOUNT_NAMESPACE)) != 0) 4405 + return -EINVAL; 4406 + 4407 + if ((flags & FSMOUNT_NAMESPACE) && 4408 + !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) 4420 4409 return -EPERM; 4421 4410 4422 - if ((flags & ~(FSMOUNT_CLOEXEC)) != 0) 4423 - return -EINVAL; 4411 + if (!(flags & FSMOUNT_NAMESPACE) && !may_mount()) 4412 + return -EPERM; 4424 4413 4425 4414 if (attr_flags & ~FSMOUNT_VALID_FLAGS) 4426 4415 return -EINVAL; ··· 4490 4471 * don't want to have to handle any errors incurred. 4491 4472 */ 4492 4473 vfs_clean_context(fc); 4474 + 4475 + if (flags & FSMOUNT_NAMESPACE) 4476 + return FD_ADD((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0, 4477 + open_new_namespace(&new_path, 0)); 4493 4478 4494 4479 ns = alloc_mnt_ns(current->nsproxy->mnt_ns->user_ns, true); 4495 4480 if (IS_ERR(ns))

include/uapi/linux/mount.h

··· 110 110 * fsmount() flags. 111 111 */ 112 112 #define FSMOUNT_CLOEXEC 0x00000001 113 + #define FSMOUNT_NAMESPACE 0x00000002 /* Create the mount in a new mount namespace */ 113 114 114 115 /* 115 116 * Mount attributes.

Configure Feed

Configure Feed