Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge patch series "fs: add immutable rootfs"

Christian Brauner <brauner@kernel.org> says:

Currently pivot_root() doesn't work on the real rootfs because it
cannot be unmounted. Userspace has to do a recursive removal of the
initramfs contents manually before continuing the boot.

Really all we want from the real rootfs is to serve as the parent mount
for anything that is actually useful such as the tmpfs or ramfs for
initramfs unpacking or the rootfs itself. There's no need for the real
rootfs to actually be anything meaningful or useful. Add a immutable
rootfs called "nullfs" that can be selected via the "nullfs_rootfs"
kernel command line option.

The kernel will mount a tmpfs/ramfs on top of it, unpack the initramfs
and fire up userspace which mounts the rootfs and can then just do:

chdir(rootfs);
pivot_root(".", ".");
umount2(".", MNT_DETACH);

and be done with it. (Ofc, userspace can also choose to retain the
initramfs contents by using something like pivot_root(".", "/initramfs")
without unmounting it.)

Technically this also means that the rootfs mount in unprivileged
namespaces doesn't need to become MNT_LOCKED anymore as it's guaranteed
that the immutable rootfs remains permanently empty so there cannot be
anything revealed by unmounting the covering mount.

In the future this will also allow us to create completely empty mount
namespaces without risking to leak anything.

systemd already handles this all correctly as it tries to pivot_root()
first and falls back to MS_MOVE only when that fails.

This goes back to various discussion in previous years and a LPC 2024
presentation about this very topic.

* patches from https://patch.msgid.link/20260112-work-immutable-rootfs-v2-0-88dd1c34a204@kernel.org:
docs: mention nullfs
fs: add immutable rootfs
fs: add init_pivot_root()
fs: ensure that internal tmpfs mount gets mount id zero

Link: https://patch.msgid.link/20260112-work-immutable-rootfs-v2-0-88dd1c34a204@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>

+254 -67
+23 -9
Documentation/filesystems/ramfs-rootfs-initramfs.rst
··· 76 76 --------------- 77 77 78 78 Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is 79 - always present in 2.6 systems. You can't unmount rootfs for approximately the 80 - same reason you can't kill the init process; rather than having special code 81 - to check for and handle an empty list, it's smaller and simpler for the kernel 82 - to just make sure certain lists can't become empty. 79 + always present in 2.6 systems. Traditionally, you can't unmount rootfs for 80 + approximately the same reason you can't kill the init process; rather than 81 + having special code to check for and handle an empty list, it's smaller and 82 + simpler for the kernel to just make sure certain lists can't become empty. 83 + 84 + However, if the kernel is booted with "nullfs_rootfs", an immutable empty 85 + filesystem called nullfs is used as the true root, with the mutable rootfs 86 + (tmpfs/ramfs) mounted on top of it. This allows pivot_root() and unmounting 87 + of the initramfs to work normally. 83 88 84 89 Most systems just mount another filesystem over rootfs and ignore it. The 85 90 amount of space an empty instance of ramfs takes up is tiny. ··· 126 121 program. See the switch_root utility, below.) 127 122 128 123 - When switching another root device, initrd would pivot_root and then 129 - umount the ramdisk. But initramfs is rootfs: you can neither pivot_root 130 - rootfs, nor unmount it. Instead delete everything out of rootfs to 131 - free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs 132 - with the new root (cd /newmount; mount --move . /; chroot .), attach 133 - stdin/stdout/stderr to the new /dev/console, and exec the new init. 124 + umount the ramdisk. Traditionally, initramfs is rootfs: you can neither 125 + pivot_root rootfs, nor unmount it. Instead delete everything out of 126 + rootfs to free up the space (find -xdev / -exec rm '{}' ';'), overmount 127 + rootfs with the new root (cd /newmount; mount --move . /; chroot .), 128 + attach stdin/stdout/stderr to the new /dev/console, and exec the new init. 134 129 135 130 Since this is a remarkably persnickety process (and involves deleting 136 131 commands before you can run them), the klibc package introduced a helper 137 132 program (utils/run_init.c) to do all this for you. Most other packages 138 133 (such as busybox) have named this command "switch_root". 134 + 135 + However, if the kernel is booted with "nullfs_rootfs", pivot_root() works 136 + normally from the initramfs. Userspace can simply do:: 137 + 138 + chdir(new_root); 139 + pivot_root(".", "."); 140 + umount2(".", MNT_DETACH); 141 + 142 + This is the preferred method when nullfs_rootfs is enabled. 139 143 140 144 Populating initramfs: 141 145 ---------------------
+1 -1
fs/Makefile
··· 16 16 stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ 17 17 fs_dirent.o fs_context.o fs_parser.o fsopen.o init.o \ 18 18 kernel_read_file.o mnt_idmapping.o remap_range.o pidfs.o \ 19 - file_attr.o 19 + file_attr.o nullfs.o 20 20 21 21 obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o 22 22 obj-$(CONFIG_PROC_FS) += proc_namespace.o
+17
fs/init.c
··· 13 13 #include <linux/security.h> 14 14 #include "internal.h" 15 15 16 + int __init init_pivot_root(const char *new_root, const char *put_old) 17 + { 18 + struct path new_path __free(path_put) = {}; 19 + struct path old_path __free(path_put) = {}; 20 + int ret; 21 + 22 + ret = kern_path(new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new_path); 23 + if (ret) 24 + return ret; 25 + 26 + ret = kern_path(put_old, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old_path); 27 + if (ret) 28 + return ret; 29 + 30 + return path_pivot_root(&new_path, &old_path); 31 + } 32 + 16 33 int __init init_mount(const char *dev_name, const char *dir_name, 17 34 const char *type_page, unsigned long flags, void *data_page) 18 35 {
+1
fs/internal.h
··· 90 90 int path_mount(const char *dev_name, const struct path *path, 91 91 const char *type_page, unsigned long flags, void *data_page); 92 92 int path_umount(const struct path *path, int flags); 93 + int path_pivot_root(struct path *new, struct path *old); 93 94 94 95 int show_path(struct seq_file *m, struct dentry *root); 95 96
+1
fs/mount.h
··· 5 5 #include <linux/ns_common.h> 6 6 #include <linux/fs_pin.h> 7 7 8 + extern struct file_system_type nullfs_fs_type; 8 9 extern struct list_head notify_list; 9 10 10 11 struct mnt_namespace {
+124 -57
fs/namespace.c
··· 75 75 76 76 __setup("initramfs_options=", initramfs_options_setup); 77 77 78 + bool nullfs_rootfs = false; 79 + 80 + static int __init nullfs_rootfs_setup(char *str) 81 + { 82 + if (*str) 83 + return 0; 84 + nullfs_rootfs = true; 85 + return 1; 86 + } 87 + __setup("nullfs_rootfs", nullfs_rootfs_setup); 88 + 78 89 static u64 event; 79 90 static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC); 80 91 static DEFINE_IDA(mnt_group_ida); ··· 232 221 int res; 233 222 234 223 xa_lock(&mnt_id_xa); 235 - res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, XA_LIMIT(1, INT_MAX), GFP_KERNEL); 224 + res = __xa_alloc(&mnt_id_xa, &mnt->mnt_id, mnt, xa_limit_31b, GFP_KERNEL); 236 225 if (!res) 237 226 mnt->mnt_id_unique = ++mnt_id_ctr; 238 227 xa_unlock(&mnt_id_xa); ··· 4509 4498 } 4510 4499 EXPORT_SYMBOL(path_is_under); 4511 4500 4512 - /* 4513 - * pivot_root Semantics: 4514 - * Moves the root file system of the current process to the directory put_old, 4515 - * makes new_root as the new root file system of the current process, and sets 4516 - * root/cwd of all processes which had them on the current root to new_root. 4517 - * 4518 - * Restrictions: 4519 - * The new_root and put_old must be directories, and must not be on the 4520 - * same file system as the current process root. The put_old must be 4521 - * underneath new_root, i.e. adding a non-zero number of /.. to the string 4522 - * pointed to by put_old must yield the same directory as new_root. No other 4523 - * file system may be mounted on put_old. After all, new_root is a mountpoint. 4524 - * 4525 - * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem. 4526 - * See Documentation/filesystems/ramfs-rootfs-initramfs.rst for alternatives 4527 - * in this situation. 4528 - * 4529 - * Notes: 4530 - * - we don't move root/cwd if they are not at the root (reason: if something 4531 - * cared enough to change them, it's probably wrong to force them elsewhere) 4532 - * - it's okay to pick a root that isn't the root of a file system, e.g. 4533 - * /nfs/my_root where /nfs is the mount point. It must be a mountpoint, 4534 - * though, so you may need to say mount --bind /nfs/my_root /nfs/my_root 4535 - * first. 4536 - */ 4537 - SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, 4538 - const char __user *, put_old) 4501 + int path_pivot_root(struct path *new, struct path *old) 4539 4502 { 4540 - struct path new __free(path_put) = {}; 4541 - struct path old __free(path_put) = {}; 4542 4503 struct path root __free(path_put) = {}; 4543 4504 struct mount *new_mnt, *root_mnt, *old_mnt, *root_parent, *ex_parent; 4544 4505 int error; ··· 4518 4535 if (!may_mount()) 4519 4536 return -EPERM; 4520 4537 4521 - error = user_path_at(AT_FDCWD, new_root, 4522 - LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new); 4523 - if (error) 4524 - return error; 4525 - 4526 - error = user_path_at(AT_FDCWD, put_old, 4527 - LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old); 4528 - if (error) 4529 - return error; 4530 - 4531 - error = security_sb_pivotroot(&old, &new); 4538 + error = security_sb_pivotroot(old, new); 4532 4539 if (error) 4533 4540 return error; 4534 4541 4535 4542 get_fs_root(current->fs, &root); 4536 4543 4537 - LOCK_MOUNT(old_mp, &old); 4544 + LOCK_MOUNT(old_mp, old); 4538 4545 old_mnt = old_mp.parent; 4539 4546 if (IS_ERR(old_mnt)) 4540 4547 return PTR_ERR(old_mnt); 4541 4548 4542 - new_mnt = real_mount(new.mnt); 4549 + new_mnt = real_mount(new->mnt); 4543 4550 root_mnt = real_mount(root.mnt); 4544 4551 ex_parent = new_mnt->mnt_parent; 4545 4552 root_parent = root_mnt->mnt_parent; ··· 4541 4568 return -EINVAL; 4542 4569 if (new_mnt->mnt.mnt_flags & MNT_LOCKED) 4543 4570 return -EINVAL; 4544 - if (d_unlinked(new.dentry)) 4571 + if (d_unlinked(new->dentry)) 4545 4572 return -ENOENT; 4546 4573 if (new_mnt == root_mnt || old_mnt == root_mnt) 4547 4574 return -EBUSY; /* loop, on the same file system */ ··· 4549 4576 return -EINVAL; /* not a mountpoint */ 4550 4577 if (!mnt_has_parent(root_mnt)) 4551 4578 return -EINVAL; /* absolute root */ 4552 - if (!path_mounted(&new)) 4579 + if (!path_mounted(new)) 4553 4580 return -EINVAL; /* not a mountpoint */ 4554 4581 if (!mnt_has_parent(new_mnt)) 4555 4582 return -EINVAL; /* absolute root */ 4556 4583 /* make sure we can reach put_old from new_root */ 4557 - if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, &new)) 4584 + if (!is_path_reachable(old_mnt, old_mp.mp->m_dentry, new)) 4558 4585 return -EINVAL; 4559 4586 /* make certain new is below the root */ 4560 - if (!is_path_reachable(new_mnt, new.dentry, &root)) 4587 + if (!is_path_reachable(new_mnt, new->dentry, &root)) 4561 4588 return -EINVAL; 4562 4589 lock_mount_hash(); 4563 4590 umount_mnt(new_mnt); ··· 4576 4603 unlock_mount_hash(); 4577 4604 mnt_notify_add(root_mnt); 4578 4605 mnt_notify_add(new_mnt); 4579 - chroot_fs_refs(&root, &new); 4606 + chroot_fs_refs(&root, new); 4580 4607 return 0; 4608 + } 4609 + 4610 + /* 4611 + * pivot_root Semantics: 4612 + * Moves the root file system of the current process to the directory put_old, 4613 + * makes new_root as the new root file system of the current process, and sets 4614 + * root/cwd of all processes which had them on the current root to new_root. 4615 + * 4616 + * Restrictions: 4617 + * The new_root and put_old must be directories, and must not be on the 4618 + * same file system as the current process root. The put_old must be 4619 + * underneath new_root, i.e. adding a non-zero number of /.. to the string 4620 + * pointed to by put_old must yield the same directory as new_root. No other 4621 + * file system may be mounted on put_old. After all, new_root is a mountpoint. 4622 + * 4623 + * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem 4624 + * unless the kernel was booted with "nullfs_rootfs". See 4625 + * Documentation/filesystems/ramfs-rootfs-initramfs.rst for alternatives 4626 + * in this situation. 4627 + * 4628 + * Notes: 4629 + * - we don't move root/cwd if they are not at the root (reason: if something 4630 + * cared enough to change them, it's probably wrong to force them elsewhere) 4631 + * - it's okay to pick a root that isn't the root of a file system, e.g. 4632 + * /nfs/my_root where /nfs is the mount point. It must be a mountpoint, 4633 + * though, so you may need to say mount --bind /nfs/my_root /nfs/my_root 4634 + * first. 4635 + */ 4636 + SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, 4637 + const char __user *, put_old) 4638 + { 4639 + struct path new __free(path_put) = {}; 4640 + struct path old __free(path_put) = {}; 4641 + int error; 4642 + 4643 + error = user_path_at(AT_FDCWD, new_root, 4644 + LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new); 4645 + if (error) 4646 + return error; 4647 + 4648 + error = user_path_at(AT_FDCWD, put_old, 4649 + LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &old); 4650 + if (error) 4651 + return error; 4652 + 4653 + return path_pivot_root(&new, &old); 4581 4654 } 4582 4655 4583 4656 static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt) ··· 5988 5969 5989 5970 static void __init init_mount_tree(void) 5990 5971 { 5991 - struct vfsmount *mnt; 5992 - struct mount *m; 5972 + struct vfsmount *mnt, *nullfs_mnt; 5973 + struct mount *mnt_root; 5993 5974 struct path root; 5975 + 5976 + /* 5977 + * When nullfs is used, we create two mounts: 5978 + * 5979 + * (1) nullfs with mount id 1 5980 + * (2) mutable rootfs with mount id 2 5981 + * 5982 + * with (2) mounted on top of (1). 5983 + */ 5984 + if (nullfs_rootfs) { 5985 + nullfs_mnt = vfs_kern_mount(&nullfs_fs_type, 0, "nullfs", NULL); 5986 + if (IS_ERR(nullfs_mnt)) 5987 + panic("VFS: Failed to create nullfs"); 5988 + } 5994 5989 5995 5990 mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", initramfs_options); 5996 5991 if (IS_ERR(mnt)) 5997 5992 panic("Can't create rootfs"); 5998 5993 5999 - m = real_mount(mnt); 6000 - init_mnt_ns.root = m; 6001 - init_mnt_ns.nr_mounts = 1; 6002 - mnt_add_to_ns(&init_mnt_ns, m); 5994 + if (nullfs_rootfs) { 5995 + VFS_WARN_ON_ONCE(real_mount(nullfs_mnt)->mnt_id != 1); 5996 + VFS_WARN_ON_ONCE(real_mount(mnt)->mnt_id != 2); 5997 + 5998 + /* The namespace root is the nullfs mnt. */ 5999 + mnt_root = real_mount(nullfs_mnt); 6000 + init_mnt_ns.root = mnt_root; 6001 + 6002 + /* Mount mutable rootfs on top of nullfs. */ 6003 + root.mnt = nullfs_mnt; 6004 + root.dentry = nullfs_mnt->mnt_root; 6005 + 6006 + LOCK_MOUNT_EXACT(mp, &root); 6007 + if (unlikely(IS_ERR(mp.parent))) 6008 + panic("VFS: Failed to mount rootfs on nullfs"); 6009 + scoped_guard(mount_writer) 6010 + attach_mnt(real_mount(mnt), mp.parent, mp.mp); 6011 + 6012 + pr_info("VFS: Finished mounting rootfs on nullfs\n"); 6013 + } else { 6014 + VFS_WARN_ON_ONCE(real_mount(mnt)->mnt_id != 1); 6015 + 6016 + /* The namespace root is the mutable rootfs. */ 6017 + mnt_root = real_mount(mnt); 6018 + init_mnt_ns.root = mnt_root; 6019 + } 6020 + 6021 + /* 6022 + * We've dropped all locks here but that's fine. Not just are we 6023 + * the only task that's running, there's no other mount 6024 + * namespace in existence and the initial mount namespace is 6025 + * completely empty until we add the mounts we just created. 6026 + */ 6027 + for (struct mount *p = mnt_root; p; p = next_mnt(p, mnt_root)) { 6028 + mnt_add_to_ns(&init_mnt_ns, p); 6029 + init_mnt_ns.nr_mounts++; 6030 + } 6031 + 6003 6032 init_task.nsproxy->mnt_ns = &init_mnt_ns; 6004 6033 get_mnt_ns(&init_mnt_ns); 6005 6034 6006 - root.mnt = mnt; 6007 - root.dentry = mnt->mnt_root; 6008 - 6035 + /* The root and pwd always point to the mutable rootfs. */ 6036 + root.mnt = mnt; 6037 + root.dentry = mnt->mnt_root; 6009 6038 set_fs_pwd(current->fs, &root); 6010 6039 set_fs_root(current->fs, &root); 6011 6040
+70
fs/nullfs.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + /* Copyright (c) 2026 Christian Brauner <brauner@kernel.org> */ 3 + #include <linux/fs/super_types.h> 4 + #include <linux/fs_context.h> 5 + #include <linux/magic.h> 6 + 7 + static const struct super_operations nullfs_super_operations = { 8 + .statfs = simple_statfs, 9 + }; 10 + 11 + static int nullfs_fs_fill_super(struct super_block *s, struct fs_context *fc) 12 + { 13 + struct inode *inode; 14 + 15 + s->s_maxbytes = MAX_LFS_FILESIZE; 16 + s->s_blocksize = PAGE_SIZE; 17 + s->s_blocksize_bits = PAGE_SHIFT; 18 + s->s_magic = NULL_FS_MAGIC; 19 + s->s_op = &nullfs_super_operations; 20 + s->s_export_op = NULL; 21 + s->s_xattr = NULL; 22 + s->s_time_gran = 1; 23 + s->s_d_flags = 0; 24 + 25 + inode = new_inode(s); 26 + if (!inode) 27 + return -ENOMEM; 28 + 29 + /* nullfs is permanently empty... */ 30 + make_empty_dir_inode(inode); 31 + simple_inode_init_ts(inode); 32 + inode->i_ino = 1; 33 + /* ... and immutable. */ 34 + inode->i_flags |= S_IMMUTABLE; 35 + 36 + s->s_root = d_make_root(inode); 37 + if (!s->s_root) 38 + return -ENOMEM; 39 + 40 + return 0; 41 + } 42 + 43 + /* 44 + * For now this is a single global instance. If needed we can make it 45 + * mountable by userspace at which point we will need to make it 46 + * multi-instance. 47 + */ 48 + static int nullfs_fs_get_tree(struct fs_context *fc) 49 + { 50 + return get_tree_single(fc, nullfs_fs_fill_super); 51 + } 52 + 53 + static const struct fs_context_operations nullfs_fs_context_ops = { 54 + .get_tree = nullfs_fs_get_tree, 55 + }; 56 + 57 + static int nullfs_init_fs_context(struct fs_context *fc) 58 + { 59 + fc->ops = &nullfs_fs_context_ops; 60 + fc->global = true; 61 + fc->sb_flags = SB_NOUSER; 62 + fc->s_iflags = SB_I_NOEXEC | SB_I_NODEV; 63 + return 0; 64 + } 65 + 66 + struct file_system_type nullfs_fs_type = { 67 + .name = "nullfs", 68 + .init_fs_context = nullfs_init_fs_context, 69 + .kill_sb = kill_anon_super, 70 + };
+1
include/linux/init_syscalls.h
··· 17 17 int __init init_rmdir(const char *pathname); 18 18 int __init init_utimes(char *filename, struct timespec64 *ts); 19 19 int __init init_dup(struct file *file); 20 + int __init init_pivot_root(const char *new_root, const char *put_old);
+1
include/uapi/linux/magic.h
··· 104 104 #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ 105 105 #define PID_FS_MAGIC 0x50494446 /* "PIDF" */ 106 106 #define GUEST_MEMFD_MAGIC 0x474d454d /* "GMEM" */ 107 + #define NULL_FS_MAGIC 0x4E554C4C /* "NULL" */ 107 108 108 109 #endif /* __LINUX_MAGIC_H__ */
+14
init/do_mounts.c
··· 492 492 mount_root(saved_root_name); 493 493 out: 494 494 devtmpfs_mount(); 495 + 496 + if (nullfs_rootfs) { 497 + if (init_pivot_root(".", ".")) { 498 + pr_err("VFS: Failed to pivot into new rootfs\n"); 499 + return; 500 + } 501 + if (init_umount(".", MNT_DETACH)) { 502 + pr_err("VFS: Failed to unmount old rootfs\n"); 503 + return; 504 + } 505 + pr_info("VFS: Pivoted into new rootfs\n"); 506 + return; 507 + } 508 + 495 509 init_mount(".", "/", NULL, MS_MOVE, NULL); 496 510 init_chroot("."); 497 511 }
+1
init/do_mounts.h
··· 15 15 void mount_root_generic(char *name, char *pretty_name, int flags); 16 16 void mount_root(char *root_device_name); 17 17 extern int root_mountflags; 18 + extern bool nullfs_rootfs; 18 19 19 20 static inline __init int create_dev(char *name, dev_t dev) 20 21 {