Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.

Features:

- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size

- Add RWF_NOSIGNAL flag for pwritev2()

Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets

- Allow to pass pid namespace as procfs mount option

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)

This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:

* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)

But this requires forking off a helper process because you cannot
just one-shot this using mount(2)

* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)

While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate

Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):

fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

Cleanups:

- Remove the last references to EXPORT_OP_ASYNC_LOCK

- Make file_remove_privs_flags() static

- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used

- Use try_cmpxchg() in start_dir_add()

- Use try_cmpxchg() in sb_init_done_wq()

- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()

- Remove vfs_ioctl() export

- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels

- Make ns_entries in fs/proc/namespaces const

- Use a switch() statement() in init_special_inode() just like we do
in may_open()

- Use struct_size() in dir_add() in the initramfs code

- Use str_plural() in rd_load_image()

- Replace strcpy() with strscpy() in find_link()

- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()

- Remove unused arguments from fcntl_{g,s}et_rw_hint()

Fixes:

- Document @name parameter for name_contains_dotdot() helper

- Fix spelling mistake

- Always return zero from replace_fd() instead of the file descriptor
number

- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow

- Fix debugfs mount options not being applied

- Verify the inode mode when loading it from disk in minixfs

- Verify the inode mode when loading it from disk in cramfs

- Don't trigger automounts with RESOLVE_NO_XDEV

If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them

- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints

- Fix unused variable warning in rd_load_image() on s390

- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD

- Use ns_capable_noaudit() when determining net sysctl permissions

- Don't call path_put() under namespace semaphore in listmount() and
statmount()"

* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...

+582 -264
+3
Documentation/admin-guide/kernel-parameters.txt
··· 6429 6429 6430 6430 rootflags= [KNL] Set root filesystem mount option string 6431 6431 6432 + initramfs_options= [KNL] 6433 + Specify mount options for for the initramfs mount. 6434 + 6432 6435 rootfstype= [KNL] Set root filesystem type 6433 6436 6434 6437 rootwait [KNL] Wait (indefinitely) for root device to show up.
+2 -2
Documentation/filesystems/porting.rst
··· 340 340 341 341 ->drop_inode() returns int now; it's called on final iput() with 342 342 inode->i_lock held and it returns true if filesystems wants the inode to be 343 - dropped. As before, generic_drop_inode() is still the default and it's been 344 - updated appropriately. generic_delete_inode() is also alive and it consists 343 + dropped. As before, inode_generic_drop() is still the default and it's been 344 + updated appropriately. inode_just_drop() is also alive and it consists 345 345 simply of return 1. Note that all actual eviction work is done by caller after 346 346 ->drop_inode() returns. 347 347
+8
Documentation/filesystems/proc.rst
··· 2362 2362 hidepid= Set /proc/<pid>/ access mode. 2363 2363 gid= Set the group authorized to learn processes information. 2364 2364 subset= Show only the specified subset of procfs. 2365 + pidns= Specify a the namespace used by this procfs. 2365 2366 ========= ======================================================== 2366 2367 2367 2368 hidepid=off or hidepid=0 means classic mode - everybody may access all ··· 2394 2393 2395 2394 subset=pid hides all top level files and directories in the procfs that 2396 2395 are not related to tasks. 2396 + 2397 + pidns= specifies a pid namespace (either as a string path to something like 2398 + `/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that 2399 + will be used by the procfs instance when translating pids. By default, procfs 2400 + will use the calling process's active pid namespace. Note that the pid 2401 + namespace of an existing procfs instance cannot be modified (attempting to do 2402 + so will give an `-EBUSY` error). 2397 2403 2398 2404 Chapter 5: Filesystem behavior 2399 2405 ==============================
+2 -2
Documentation/filesystems/vfs.rst
··· 327 327 inode->i_lock spinlock held. 328 328 329 329 This method should be either NULL (normal UNIX filesystem 330 - semantics) or "generic_delete_inode" (for filesystems that do 330 + semantics) or "inode_just_drop" (for filesystems that do 331 331 not want to cache inodes - causing "delete_inode" to always be 332 332 called regardless of the value of i_nlink) 333 333 334 - The "generic_delete_inode()" behavior is equivalent to the old 334 + The "inode_just_drop()" behavior is equivalent to the old 335 335 practice of using "force_delete" in the put_inode() case, but 336 336 does not have the races that the "force_delete()" approach had. 337 337
+1 -1
block/bdev.c
··· 412 412 .statfs = simple_statfs, 413 413 .alloc_inode = bdev_alloc_inode, 414 414 .free_inode = bdev_free_inode, 415 - .drop_inode = generic_delete_inode, 415 + .drop_inode = inode_just_drop, 416 416 .evict_inode = bdev_evict_inode, 417 417 }; 418 418
+1 -1
drivers/dax/super.c
··· 388 388 .alloc_inode = dax_alloc_inode, 389 389 .destroy_inode = dax_destroy_inode, 390 390 .free_inode = dax_free_inode, 391 - .drop_inode = generic_delete_inode, 391 + .drop_inode = inode_just_drop, 392 392 }; 393 393 394 394 static int dax_init_fs_context(struct fs_context *fc)
+1 -1
drivers/misc/ibmasm/ibmasmfs.c
··· 94 94 95 95 static const struct super_operations ibmasmfs_s_ops = { 96 96 .statfs = simple_statfs, 97 - .drop_inode = generic_delete_inode, 97 + .drop_inode = inode_just_drop, 98 98 }; 99 99 100 100 static const struct file_operations *ibmasmfs_dir_ops = &simple_dir_operations;
+1 -1
drivers/usb/gadget/function/f_fs.c
··· 1891 1891 /* Super block */ 1892 1892 static const struct super_operations ffs_sb_operations = { 1893 1893 .statfs = simple_statfs, 1894 - .drop_inode = generic_delete_inode, 1894 + .drop_inode = inode_just_drop, 1895 1895 }; 1896 1896 1897 1897 struct ffs_sb_fill_data {
+1 -1
drivers/usb/gadget/legacy/inode.c
··· 2011 2011 2012 2012 static const struct super_operations gadget_fs_operations = { 2013 2013 .statfs = simple_statfs, 2014 - .drop_inode = generic_delete_inode, 2014 + .drop_inode = inode_just_drop, 2015 2015 }; 2016 2016 2017 2017 static int
+1 -1
fs/9p/vfs_super.c
··· 252 252 253 253 v9ses = v9fs_inode2v9ses(inode); 254 254 if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) 255 - return generic_drop_inode(inode); 255 + return inode_generic_drop(inode); 256 256 /* 257 257 * in case of non cached mode always drop the 258 258 * inode because we want the inode attribute
+2 -2
fs/afs/inode.c
··· 723 723 _enter(""); 724 724 725 725 if (test_bit(AFS_VNODE_PSEUDODIR, &AFS_FS_I(inode)->flags)) 726 - return generic_delete_inode(inode); 726 + return inode_just_drop(inode); 727 727 else 728 - return generic_drop_inode(inode); 728 + return inode_generic_drop(inode); 729 729 } 730 730 731 731 /*
+1 -1
fs/btrfs/inode.c
··· 7973 7973 if (btrfs_root_refs(&root->root_item) == 0) 7974 7974 return 1; 7975 7975 else 7976 - return generic_drop_inode(inode); 7976 + return inode_generic_drop(inode); 7977 7977 } 7978 7978 7979 7979 static void init_once(void *foo)
+1 -1
fs/ceph/super.c
··· 1042 1042 .alloc_inode = ceph_alloc_inode, 1043 1043 .free_inode = ceph_free_inode, 1044 1044 .write_inode = ceph_write_inode, 1045 - .drop_inode = generic_delete_inode, 1045 + .drop_inode = inode_just_drop, 1046 1046 .evict_inode = ceph_evict_inode, 1047 1047 .sync_fs = ceph_sync_fs, 1048 1048 .put_super = ceph_put_super,
+1 -1
fs/configfs/mount.c
··· 36 36 37 37 static const struct super_operations configfs_ops = { 38 38 .statfs = simple_statfs, 39 - .drop_inode = generic_delete_inode, 39 + .drop_inode = inode_just_drop, 40 40 .free_inode = configfs_free_inode, 41 41 }; 42 42
+10 -1
fs/cramfs/inode.c
··· 116 116 inode_nohighmem(inode); 117 117 inode->i_data.a_ops = &cramfs_aops; 118 118 break; 119 - default: 119 + case S_IFCHR: 120 + case S_IFBLK: 121 + case S_IFIFO: 122 + case S_IFSOCK: 120 123 init_special_inode(inode, cramfs_inode->mode, 121 124 old_decode_dev(cramfs_inode->size)); 125 + break; 126 + default: 127 + printk(KERN_DEBUG "CRAMFS: Invalid file type 0%04o for inode %lu.\n", 128 + inode->i_mode, inode->i_ino); 129 + iget_failed(inode); 130 + return ERR_PTR(-EIO); 122 131 } 123 132 124 133 inode->i_mode = cramfs_inode->mode;
+2 -2
fs/dcache.c
··· 2509 2509 { 2510 2510 preempt_disable_nested(); 2511 2511 for (;;) { 2512 - unsigned n = dir->i_dir_seq; 2513 - if (!(n & 1) && cmpxchg(&dir->i_dir_seq, n, n + 1) == n) 2512 + unsigned n = READ_ONCE(dir->i_dir_seq); 2513 + if (!(n & 1) && try_cmpxchg(&dir->i_dir_seq, &n, n + 1)) 2514 2514 return n; 2515 2515 cpu_relax(); 2516 2516 }
+1 -1
fs/efivarfs/super.c
··· 127 127 128 128 static const struct super_operations efivarfs_ops = { 129 129 .statfs = efivarfs_statfs, 130 - .drop_inode = generic_delete_inode, 130 + .drop_inode = inode_just_drop, 131 131 .alloc_inode = efivarfs_alloc_inode, 132 132 .free_inode = efivarfs_free_inode, 133 133 .show_options = efivarfs_show_options,
+26 -113
fs/eventpoll.c
··· 46 46 * 47 47 * 1) epnested_mutex (mutex) 48 48 * 2) ep->mtx (mutex) 49 - * 3) ep->lock (rwlock) 49 + * 3) ep->lock (spinlock) 50 50 * 51 51 * The acquire order is the one listed above, from 1 to 3. 52 - * We need a rwlock (ep->lock) because we manipulate objects 52 + * We need a spinlock (ep->lock) because we manipulate objects 53 53 * from inside the poll callback, that might be triggered from 54 54 * a wake_up() that in turn might be called from IRQ context. 55 55 * So we can't sleep inside the poll callback and hence we need ··· 195 195 struct list_head rdllist; 196 196 197 197 /* Lock which protects rdllist and ovflist */ 198 - rwlock_t lock; 198 + spinlock_t lock; 199 199 200 200 /* RB tree root used to store monitored fd structs */ 201 201 struct rb_root_cached rbr; ··· 741 741 * in a lockless way. 742 742 */ 743 743 lockdep_assert_irqs_enabled(); 744 - write_lock_irq(&ep->lock); 744 + spin_lock_irq(&ep->lock); 745 745 list_splice_init(&ep->rdllist, txlist); 746 746 WRITE_ONCE(ep->ovflist, NULL); 747 - write_unlock_irq(&ep->lock); 747 + spin_unlock_irq(&ep->lock); 748 748 } 749 749 750 750 static void ep_done_scan(struct eventpoll *ep, ··· 752 752 { 753 753 struct epitem *epi, *nepi; 754 754 755 - write_lock_irq(&ep->lock); 755 + spin_lock_irq(&ep->lock); 756 756 /* 757 757 * During the time we spent inside the "sproc" callback, some 758 758 * other events might have been queued by the poll callback. ··· 793 793 wake_up(&ep->wq); 794 794 } 795 795 796 - write_unlock_irq(&ep->lock); 796 + spin_unlock_irq(&ep->lock); 797 797 } 798 798 799 799 static void ep_get(struct eventpoll *ep) ··· 868 868 869 869 rb_erase_cached(&epi->rbn, &ep->rbr); 870 870 871 - write_lock_irq(&ep->lock); 871 + spin_lock_irq(&ep->lock); 872 872 if (ep_is_linked(epi)) 873 873 list_del_init(&epi->rdllink); 874 - write_unlock_irq(&ep->lock); 874 + spin_unlock_irq(&ep->lock); 875 875 876 876 wakeup_source_unregister(ep_wakeup_source(epi)); 877 877 /* ··· 1152 1152 return -ENOMEM; 1153 1153 1154 1154 mutex_init(&ep->mtx); 1155 - rwlock_init(&ep->lock); 1155 + spin_lock_init(&ep->lock); 1156 1156 init_waitqueue_head(&ep->wq); 1157 1157 init_waitqueue_head(&ep->poll_wait); 1158 1158 INIT_LIST_HEAD(&ep->rdllist); ··· 1240 1240 #endif /* CONFIG_KCMP */ 1241 1241 1242 1242 /* 1243 - * Adds a new entry to the tail of the list in a lockless way, i.e. 1244 - * multiple CPUs are allowed to call this function concurrently. 1245 - * 1246 - * Beware: it is necessary to prevent any other modifications of the 1247 - * existing list until all changes are completed, in other words 1248 - * concurrent list_add_tail_lockless() calls should be protected 1249 - * with a read lock, where write lock acts as a barrier which 1250 - * makes sure all list_add_tail_lockless() calls are fully 1251 - * completed. 1252 - * 1253 - * Also an element can be locklessly added to the list only in one 1254 - * direction i.e. either to the tail or to the head, otherwise 1255 - * concurrent access will corrupt the list. 1256 - * 1257 - * Return: %false if element has been already added to the list, %true 1258 - * otherwise. 1259 - */ 1260 - static inline bool list_add_tail_lockless(struct list_head *new, 1261 - struct list_head *head) 1262 - { 1263 - struct list_head *prev; 1264 - 1265 - /* 1266 - * This is simple 'new->next = head' operation, but cmpxchg() 1267 - * is used in order to detect that same element has been just 1268 - * added to the list from another CPU: the winner observes 1269 - * new->next == new. 1270 - */ 1271 - if (!try_cmpxchg(&new->next, &new, head)) 1272 - return false; 1273 - 1274 - /* 1275 - * Initially ->next of a new element must be updated with the head 1276 - * (we are inserting to the tail) and only then pointers are atomically 1277 - * exchanged. XCHG guarantees memory ordering, thus ->next should be 1278 - * updated before pointers are actually swapped and pointers are 1279 - * swapped before prev->next is updated. 1280 - */ 1281 - 1282 - prev = xchg(&head->prev, new); 1283 - 1284 - /* 1285 - * It is safe to modify prev->next and new->prev, because a new element 1286 - * is added only to the tail and new->next is updated before XCHG. 1287 - */ 1288 - 1289 - prev->next = new; 1290 - new->prev = prev; 1291 - 1292 - return true; 1293 - } 1294 - 1295 - /* 1296 - * Chains a new epi entry to the tail of the ep->ovflist in a lockless way, 1297 - * i.e. multiple CPUs are allowed to call this function concurrently. 1298 - * 1299 - * Return: %false if epi element has been already chained, %true otherwise. 1300 - */ 1301 - static inline bool chain_epi_lockless(struct epitem *epi) 1302 - { 1303 - struct eventpoll *ep = epi->ep; 1304 - 1305 - /* Fast preliminary check */ 1306 - if (epi->next != EP_UNACTIVE_PTR) 1307 - return false; 1308 - 1309 - /* Check that the same epi has not been just chained from another CPU */ 1310 - if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) 1311 - return false; 1312 - 1313 - /* Atomically exchange tail */ 1314 - epi->next = xchg(&ep->ovflist, epi); 1315 - 1316 - return true; 1317 - } 1318 - 1319 - /* 1320 1243 * This is the callback that is passed to the wait queue wakeup 1321 1244 * mechanism. It is called by the stored file descriptors when they 1322 1245 * have events to report. 1323 - * 1324 - * This callback takes a read lock in order not to contend with concurrent 1325 - * events from another file descriptor, thus all modifications to ->rdllist 1326 - * or ->ovflist are lockless. Read lock is paired with the write lock from 1327 - * ep_start/done_scan(), which stops all list modifications and guarantees 1328 - * that lists state is seen correctly. 1329 - * 1330 - * Another thing worth to mention is that ep_poll_callback() can be called 1331 - * concurrently for the same @epi from different CPUs if poll table was inited 1332 - * with several wait queues entries. Plural wakeup from different CPUs of a 1333 - * single wait queue is serialized by wq.lock, but the case when multiple wait 1334 - * queues are used should be detected accordingly. This is detected using 1335 - * cmpxchg() operation. 1336 1246 */ 1337 1247 static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 1338 1248 { ··· 1253 1343 unsigned long flags; 1254 1344 int ewake = 0; 1255 1345 1256 - read_lock_irqsave(&ep->lock, flags); 1346 + spin_lock_irqsave(&ep->lock, flags); 1257 1347 1258 1348 ep_set_busy_poll_napi_id(epi); 1259 1349 ··· 1282 1372 * chained in ep->ovflist and requeued later on. 1283 1373 */ 1284 1374 if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { 1285 - if (chain_epi_lockless(epi)) 1375 + if (epi->next == EP_UNACTIVE_PTR) { 1376 + epi->next = READ_ONCE(ep->ovflist); 1377 + WRITE_ONCE(ep->ovflist, epi); 1286 1378 ep_pm_stay_awake_rcu(epi); 1379 + } 1287 1380 } else if (!ep_is_linked(epi)) { 1288 1381 /* In the usual case, add event to ready list. */ 1289 - if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) 1290 - ep_pm_stay_awake_rcu(epi); 1382 + list_add_tail(&epi->rdllink, &ep->rdllist); 1383 + ep_pm_stay_awake_rcu(epi); 1291 1384 } 1292 1385 1293 1386 /* ··· 1323 1410 pwake++; 1324 1411 1325 1412 out_unlock: 1326 - read_unlock_irqrestore(&ep->lock, flags); 1413 + spin_unlock_irqrestore(&ep->lock, flags); 1327 1414 1328 1415 /* We have to call this outside the lock */ 1329 1416 if (pwake) ··· 1658 1745 } 1659 1746 1660 1747 /* We have to drop the new item inside our item list to keep track of it */ 1661 - write_lock_irq(&ep->lock); 1748 + spin_lock_irq(&ep->lock); 1662 1749 1663 1750 /* record NAPI ID of new item if present */ 1664 1751 ep_set_busy_poll_napi_id(epi); ··· 1675 1762 pwake++; 1676 1763 } 1677 1764 1678 - write_unlock_irq(&ep->lock); 1765 + spin_unlock_irq(&ep->lock); 1679 1766 1680 1767 /* We have to call this outside the lock */ 1681 1768 if (pwake) ··· 1739 1826 * list, push it inside. 1740 1827 */ 1741 1828 if (ep_item_poll(epi, &pt, 1)) { 1742 - write_lock_irq(&ep->lock); 1829 + spin_lock_irq(&ep->lock); 1743 1830 if (!ep_is_linked(epi)) { 1744 1831 list_add_tail(&epi->rdllink, &ep->rdllist); 1745 1832 ep_pm_stay_awake(epi); ··· 1750 1837 if (waitqueue_active(&ep->poll_wait)) 1751 1838 pwake++; 1752 1839 } 1753 - write_unlock_irq(&ep->lock); 1840 + spin_unlock_irq(&ep->lock); 1754 1841 } 1755 1842 1756 1843 /* We have to call this outside the lock */ ··· 2002 2089 init_wait(&wait); 2003 2090 wait.func = ep_autoremove_wake_function; 2004 2091 2005 - write_lock_irq(&ep->lock); 2092 + spin_lock_irq(&ep->lock); 2006 2093 /* 2007 2094 * Barrierless variant, waitqueue_active() is called under 2008 2095 * the same lock on wakeup ep_poll_callback() side, so it ··· 2021 2108 if (!eavail) 2022 2109 __add_wait_queue_exclusive(&ep->wq, &wait); 2023 2110 2024 - write_unlock_irq(&ep->lock); 2111 + spin_unlock_irq(&ep->lock); 2025 2112 2026 2113 if (!eavail) 2027 2114 timed_out = !ep_schedule_timeout(to) || ··· 2037 2124 eavail = 1; 2038 2125 2039 2126 if (!list_empty_careful(&wait.entry)) { 2040 - write_lock_irq(&ep->lock); 2127 + spin_lock_irq(&ep->lock); 2041 2128 /* 2042 2129 * If the thread timed out and is not on the wait queue, 2043 2130 * it means that the thread was woken up after its ··· 2048 2135 if (timed_out) 2049 2136 eavail = list_empty(&wait.entry); 2050 2137 __remove_wait_queue(&ep->wq, &wait); 2051 - write_unlock_irq(&ep->lock); 2138 + spin_unlock_irq(&ep->lock); 2052 2139 } 2053 2140 } 2054 2141 }
+1 -1
fs/ext4/super.c
··· 1417 1417 1418 1418 static int ext4_drop_inode(struct inode *inode) 1419 1419 { 1420 - int drop = generic_drop_inode(inode); 1420 + int drop = inode_generic_drop(inode); 1421 1421 1422 1422 if (!drop) 1423 1423 drop = fscrypt_drop_inode(inode);
+1 -1
fs/f2fs/super.c
··· 1768 1768 trace_f2fs_drop_inode(inode, 0); 1769 1769 return 0; 1770 1770 } 1771 - ret = generic_drop_inode(inode); 1771 + ret = inode_generic_drop(inode); 1772 1772 if (!ret) 1773 1773 ret = fscrypt_drop_inode(inode); 1774 1774 trace_f2fs_drop_inode(inode, ret);
+4 -6
fs/fcntl.c
··· 355 355 } 356 356 } 357 357 358 - static long fcntl_get_rw_hint(struct file *file, unsigned int cmd, 359 - unsigned long arg) 358 + static long fcntl_get_rw_hint(struct file *file, unsigned long arg) 360 359 { 361 360 struct inode *inode = file_inode(file); 362 361 u64 __user *argp = (u64 __user *)arg; ··· 366 367 return 0; 367 368 } 368 369 369 - static long fcntl_set_rw_hint(struct file *file, unsigned int cmd, 370 - unsigned long arg) 370 + static long fcntl_set_rw_hint(struct file *file, unsigned long arg) 371 371 { 372 372 struct inode *inode = file_inode(file); 373 373 u64 __user *argp = (u64 __user *)arg; ··· 545 547 err = memfd_fcntl(filp, cmd, argi); 546 548 break; 547 549 case F_GET_RW_HINT: 548 - err = fcntl_get_rw_hint(filp, cmd, arg); 550 + err = fcntl_get_rw_hint(filp, arg); 549 551 break; 550 552 case F_SET_RW_HINT: 551 - err = fcntl_set_rw_hint(filp, cmd, arg); 553 + err = fcntl_set_rw_hint(filp, arg); 552 554 break; 553 555 default: 554 556 break;
+4 -1
fs/file.c
··· 1330 1330 err = expand_files(files, fd); 1331 1331 if (unlikely(err < 0)) 1332 1332 goto out_unlock; 1333 - return do_dup2(files, file, fd, flags); 1333 + err = do_dup2(files, file, fd, flags); 1334 + if (err < 0) 1335 + return err; 1336 + return 0; 1334 1337 1335 1338 out_unlock: 1336 1339 spin_unlock(&files->file_lock);
+1 -1
fs/fs-writeback.c
··· 1123 1123 dirty = dirty * 10 / 8; 1124 1124 1125 1125 /* issue the writeback work */ 1126 - work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); 1126 + work = kzalloc(sizeof(*work), GFP_NOWAIT); 1127 1127 if (work) { 1128 1128 work->nr_pages = dirty; 1129 1129 work->sync_mode = WB_SYNC_NONE;
+1 -1
fs/fuse/inode.c
··· 1209 1209 .free_inode = fuse_free_inode, 1210 1210 .evict_inode = fuse_evict_inode, 1211 1211 .write_inode = fuse_write_inode, 1212 - .drop_inode = generic_delete_inode, 1212 + .drop_inode = inode_just_drop, 1213 1213 .umount_begin = fuse_umount_begin, 1214 1214 .statfs = fuse_statfs, 1215 1215 .sync_fs = fuse_sync_fs,
+1 -1
fs/gfs2/super.c
··· 1050 1050 if (test_bit(SDF_EVICTING, &sdp->sd_flags)) 1051 1051 return 1; 1052 1052 1053 - return generic_drop_inode(inode); 1053 + return inode_generic_drop(inode); 1054 1054 } 1055 1055 1056 1056 /**
+1 -1
fs/hostfs/hostfs_kern.c
··· 261 261 static const struct super_operations hostfs_sbops = { 262 262 .alloc_inode = hostfs_alloc_inode, 263 263 .free_inode = hostfs_free_inode, 264 - .drop_inode = generic_delete_inode, 264 + .drop_inode = inode_just_drop, 265 265 .evict_inode = hostfs_evict_inode, 266 266 .statfs = hostfs_statfs, 267 267 .show_options = hostfs_show_options,
+18 -12
fs/inode.c
··· 1838 1838 EXPORT_SYMBOL(insert_inode_locked4); 1839 1839 1840 1840 1841 - int generic_delete_inode(struct inode *inode) 1841 + int inode_just_drop(struct inode *inode) 1842 1842 { 1843 1843 return 1; 1844 1844 } 1845 - EXPORT_SYMBOL(generic_delete_inode); 1845 + EXPORT_SYMBOL(inode_just_drop); 1846 1846 1847 1847 /* 1848 1848 * Called when we're dropping the last reference ··· 1866 1866 if (op->drop_inode) 1867 1867 drop = op->drop_inode(inode); 1868 1868 else 1869 - drop = generic_drop_inode(inode); 1869 + drop = inode_generic_drop(inode); 1870 1870 1871 1871 if (!drop && 1872 1872 !(inode->i_state & I_DONTCACHE) && ··· 2189 2189 return notify_change(idmap, dentry, &newattrs, NULL); 2190 2190 } 2191 2191 2192 - int file_remove_privs_flags(struct file *file, unsigned int flags) 2192 + static int file_remove_privs_flags(struct file *file, unsigned int flags) 2193 2193 { 2194 2194 struct dentry *dentry = file_dentry(file); 2195 2195 struct inode *inode = file_inode(file); ··· 2214 2214 inode_has_no_xattr(inode); 2215 2215 return error; 2216 2216 } 2217 - EXPORT_SYMBOL_GPL(file_remove_privs_flags); 2218 2217 2219 2218 /** 2220 2219 * file_remove_privs - remove special file privileges (suid, capabilities) ··· 2518 2519 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev) 2519 2520 { 2520 2521 inode->i_mode = mode; 2521 - if (S_ISCHR(mode)) { 2522 + switch (inode->i_mode & S_IFMT) { 2523 + case S_IFCHR: 2522 2524 inode->i_fop = &def_chr_fops; 2523 2525 inode->i_rdev = rdev; 2524 - } else if (S_ISBLK(mode)) { 2526 + break; 2527 + case S_IFBLK: 2525 2528 if (IS_ENABLED(CONFIG_BLOCK)) 2526 2529 inode->i_fop = &def_blk_fops; 2527 2530 inode->i_rdev = rdev; 2528 - } else if (S_ISFIFO(mode)) 2531 + break; 2532 + case S_IFIFO: 2529 2533 inode->i_fop = &pipefifo_fops; 2530 - else if (S_ISSOCK(mode)) 2531 - ; /* leave it no_open_fops */ 2532 - else 2534 + break; 2535 + case S_IFSOCK: 2536 + /* leave it no_open_fops */ 2537 + break; 2538 + default: 2533 2539 printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for" 2534 2540 " inode %s:%lu\n", mode, inode->i_sb->s_id, 2535 2541 inode->i_ino); 2542 + break; 2543 + } 2536 2544 } 2537 2545 EXPORT_SYMBOL(init_special_inode); 2538 2546 ··· 2920 2914 */ 2921 2915 void dump_inode(struct inode *inode, const char *reason) 2922 2916 { 2923 - pr_warn("%s encountered for inode %px", reason, inode); 2917 + pr_warn("%s encountered for inode %px (%s)\n", reason, inode, inode->i_sb->s_type->name); 2924 2918 } 2925 2919 2926 2920 EXPORT_SYMBOL(dump_inode);
+2 -3
fs/ioctl.c
··· 41 41 * 42 42 * Returns 0 on success, -errno on error. 43 43 */ 44 - int vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 44 + static int vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) 45 45 { 46 46 int error = -ENOTTY; 47 47 ··· 54 54 out: 55 55 return error; 56 56 } 57 - EXPORT_SYMBOL(vfs_ioctl); 58 57 59 58 static int ioctl_fibmap(struct file *filp, int __user *p) 60 59 { ··· 425 426 goto out; 426 427 } 427 428 428 - size = offsetof(struct file_dedupe_range, info[count]); 429 + size = struct_size(same, info, count); 429 430 if (size > PAGE_SIZE) { 430 431 ret = -ENOMEM; 431 432 goto out;
+1 -1
fs/kernfs/mount.c
··· 57 57 58 58 const struct super_operations kernfs_sops = { 59 59 .statfs = kernfs_statfs, 60 - .drop_inode = generic_delete_inode, 60 + .drop_inode = inode_just_drop, 61 61 .evict_inode = kernfs_evict_inode, 62 62 63 63 .show_options = kernfs_sop_show_options,
+2 -2
fs/locks.c
··· 2328 2328 * To avoid blocking kernel daemons, such as lockd, that need to acquire POSIX 2329 2329 * locks, the ->lock() interface may return asynchronously, before the lock has 2330 2330 * been granted or denied by the underlying filesystem, if (and only if) 2331 - * lm_grant is set. Additionally EXPORT_OP_ASYNC_LOCK in export_operations 2332 - * flags need to be set. 2331 + * lm_grant is set. Additionally FOP_ASYNC_LOCK in file_operations fop_flags 2332 + * need to be set. 2333 2333 * 2334 2334 * Callers expecting ->lock() to return asynchronously will only use F_SETLK, 2335 2335 * not F_SETLKW; they will set FL_SLEEP if (and only if) the request is for a
+7 -1
fs/minix/inode.c
··· 492 492 inode->i_op = &minix_symlink_inode_operations; 493 493 inode_nohighmem(inode); 494 494 inode->i_mapping->a_ops = &minix_aops; 495 - } else 495 + } else if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) || 496 + S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) { 496 497 init_special_inode(inode, inode->i_mode, rdev); 498 + } else { 499 + printk(KERN_DEBUG "MINIX-fs: Invalid file type 0%04o for inode %lu.\n", 500 + inode->i_mode, inode->i_ino); 501 + make_bad_inode(inode); 502 + } 497 503 } 498 504 499 505 /*
+15 -7
fs/namei.c
··· 1449 1449 dentry->d_inode) 1450 1450 return -EISDIR; 1451 1451 1452 + /* No need to trigger automounts if mountpoint crossing is disabled. */ 1453 + if (lookup_flags & LOOKUP_NO_XDEV) 1454 + return -EXDEV; 1455 + 1452 1456 if (count && (*count)++ >= MAXSYMLINKS) 1453 1457 return -ELOOP; 1454 1458 ··· 1476 1472 /* Allow the filesystem to manage the transit without i_rwsem 1477 1473 * being held. */ 1478 1474 if (flags & DCACHE_MANAGE_TRANSIT) { 1475 + if (lookup_flags & LOOKUP_NO_XDEV) { 1476 + ret = -EXDEV; 1477 + break; 1478 + } 1479 1479 ret = path->dentry->d_op->d_manage(path, false); 1480 1480 flags = smp_load_acquire(&path->dentry->d_flags); 1481 1481 if (ret < 0) ··· 1497 1489 // here we know it's positive 1498 1490 flags = path->dentry->d_flags; 1499 1491 need_mntput = true; 1492 + if (unlikely(lookup_flags & LOOKUP_NO_XDEV)) { 1493 + ret = -EXDEV; 1494 + break; 1495 + } 1500 1496 continue; 1501 1497 } 1502 1498 } ··· 1642 1630 return -ECHILD; 1643 1631 } 1644 1632 ret = traverse_mounts(path, &jumped, &nd->total_link_count, nd->flags); 1645 - if (jumped) { 1646 - if (unlikely(nd->flags & LOOKUP_NO_XDEV)) 1647 - ret = -EXDEV; 1648 - else 1649 - nd->state |= ND_JUMPED; 1650 - } 1633 + if (jumped) 1634 + nd->state |= ND_JUMPED; 1651 1635 if (unlikely(ret)) { 1652 1636 dput(path->dentry); 1653 1637 if (path->mnt != nd->path.mnt) ··· 4836 4828 return -EPERM; 4837 4829 /* 4838 4830 * Updating the link count will likely cause i_uid and i_gid to 4839 - * be writen back improperly if their true value is unknown to 4831 + * be written back improperly if their true value is unknown to 4840 4832 * the vfs. 4841 4833 */ 4842 4834 if (HAS_UNMAPPED_ID(idmap, inode))
+72 -34
fs/namespace.c
··· 65 65 } 66 66 __setup("mphash_entries=", set_mphash_entries); 67 67 68 + static char * __initdata initramfs_options; 69 + static int __init initramfs_options_setup(char *str) 70 + { 71 + initramfs_options = str; 72 + return 1; 73 + } 74 + 75 + __setup("initramfs_options=", initramfs_options_setup); 76 + 68 77 static u64 event; 69 78 static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC); 70 79 static DEFINE_IDA(mnt_group_ida); ··· 5720 5711 static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, 5721 5712 struct mnt_namespace *ns) 5722 5713 { 5723 - struct path root __free(path_put) = {}; 5724 5714 struct mount *m; 5725 5715 int err; 5726 5716 ··· 5731 5723 if (!s->mnt) 5732 5724 return -ENOENT; 5733 5725 5734 - err = grab_requested_root(ns, &root); 5726 + err = grab_requested_root(ns, &s->root); 5735 5727 if (err) 5736 5728 return err; 5737 5729 ··· 5740 5732 * mounts to show users. 5741 5733 */ 5742 5734 m = real_mount(s->mnt); 5743 - if (!is_path_reachable(m, m->mnt.mnt_root, &root) && 5735 + if (!is_path_reachable(m, m->mnt.mnt_root, &s->root) && 5744 5736 !ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN)) 5745 5737 return -EPERM; 5746 5738 5747 5739 err = security_sb_statfs(s->mnt->mnt_root); 5748 5740 if (err) 5749 5741 return err; 5750 - 5751 - s->root = root; 5752 5742 5753 5743 /* 5754 5744 * Note that mount properties in mnt->mnt_flags, mnt->mnt_idmap ··· 5969 5963 if (!ret) 5970 5964 ret = copy_statmount_to_user(ks); 5971 5965 kvfree(ks->seq.buf); 5966 + path_put(&ks->root); 5972 5967 if (retry_statmount(ret, &seq_size)) 5973 5968 goto retry; 5974 5969 return ret; 5975 5970 } 5976 5971 5977 - static ssize_t do_listmount(struct mnt_namespace *ns, u64 mnt_parent_id, 5978 - u64 last_mnt_id, u64 *mnt_ids, size_t nr_mnt_ids, 5979 - bool reverse) 5972 + struct klistmount { 5973 + u64 last_mnt_id; 5974 + u64 mnt_parent_id; 5975 + u64 *kmnt_ids; 5976 + u32 nr_mnt_ids; 5977 + struct mnt_namespace *ns; 5978 + struct path root; 5979 + }; 5980 + 5981 + static ssize_t do_listmount(struct klistmount *kls, bool reverse) 5980 5982 { 5981 - struct path root __free(path_put) = {}; 5983 + struct mnt_namespace *ns = kls->ns; 5984 + u64 mnt_parent_id = kls->mnt_parent_id; 5985 + u64 last_mnt_id = kls->last_mnt_id; 5986 + u64 *mnt_ids = kls->kmnt_ids; 5987 + size_t nr_mnt_ids = kls->nr_mnt_ids; 5982 5988 struct path orig; 5983 5989 struct mount *r, *first; 5984 5990 ssize_t ret; 5985 5991 5986 5992 rwsem_assert_held(&namespace_sem); 5987 5993 5988 - ret = grab_requested_root(ns, &root); 5994 + ret = grab_requested_root(ns, &kls->root); 5989 5995 if (ret) 5990 5996 return ret; 5991 5997 5992 5998 if (mnt_parent_id == LSMT_ROOT) { 5993 - orig = root; 5999 + orig = kls->root; 5994 6000 } else { 5995 6001 orig.mnt = lookup_mnt_in_ns(mnt_parent_id, ns); 5996 6002 if (!orig.mnt) ··· 6014 5996 * Don't trigger audit denials. We just want to determine what 6015 5997 * mounts to show users. 6016 5998 */ 6017 - if (!is_path_reachable(real_mount(orig.mnt), orig.dentry, &root) && 5999 + if (!is_path_reachable(real_mount(orig.mnt), orig.dentry, &kls->root) && 6018 6000 !ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN)) 6019 6001 return -EPERM; 6020 6002 ··· 6047 6029 return ret; 6048 6030 } 6049 6031 6032 + static void __free_klistmount_free(const struct klistmount *kls) 6033 + { 6034 + path_put(&kls->root); 6035 + kvfree(kls->kmnt_ids); 6036 + mnt_ns_release(kls->ns); 6037 + } 6038 + 6039 + static inline int prepare_klistmount(struct klistmount *kls, struct mnt_id_req *kreq, 6040 + size_t nr_mnt_ids) 6041 + { 6042 + 6043 + u64 last_mnt_id = kreq->param; 6044 + 6045 + /* The first valid unique mount id is MNT_UNIQUE_ID_OFFSET + 1. */ 6046 + if (last_mnt_id != 0 && last_mnt_id <= MNT_UNIQUE_ID_OFFSET) 6047 + return -EINVAL; 6048 + 6049 + kls->last_mnt_id = last_mnt_id; 6050 + 6051 + kls->nr_mnt_ids = nr_mnt_ids; 6052 + kls->kmnt_ids = kvmalloc_array(nr_mnt_ids, sizeof(*kls->kmnt_ids), 6053 + GFP_KERNEL_ACCOUNT); 6054 + if (!kls->kmnt_ids) 6055 + return -ENOMEM; 6056 + 6057 + kls->ns = grab_requested_mnt_ns(kreq); 6058 + if (!kls->ns) 6059 + return -ENOENT; 6060 + 6061 + kls->mnt_parent_id = kreq->mnt_id; 6062 + return 0; 6063 + } 6064 + 6050 6065 SYSCALL_DEFINE4(listmount, const struct mnt_id_req __user *, req, 6051 6066 u64 __user *, mnt_ids, size_t, nr_mnt_ids, unsigned int, flags) 6052 6067 { 6053 - u64 *kmnt_ids __free(kvfree) = NULL; 6068 + struct klistmount kls __free(klistmount_free) = {}; 6054 6069 const size_t maxcount = 1000000; 6055 - struct mnt_namespace *ns __free(mnt_ns_release) = NULL; 6056 6070 struct mnt_id_req kreq; 6057 - u64 last_mnt_id; 6058 6071 ssize_t ret; 6059 6072 6060 6073 if (flags & ~LISTMOUNT_REVERSE) ··· 6106 6057 if (ret) 6107 6058 return ret; 6108 6059 6109 - last_mnt_id = kreq.param; 6110 - /* The first valid unique mount id is MNT_UNIQUE_ID_OFFSET + 1. */ 6111 - if (last_mnt_id != 0 && last_mnt_id <= MNT_UNIQUE_ID_OFFSET) 6112 - return -EINVAL; 6060 + ret = prepare_klistmount(&kls, &kreq, nr_mnt_ids); 6061 + if (ret) 6062 + return ret; 6113 6063 6114 - kmnt_ids = kvmalloc_array(nr_mnt_ids, sizeof(*kmnt_ids), 6115 - GFP_KERNEL_ACCOUNT); 6116 - if (!kmnt_ids) 6117 - return -ENOMEM; 6118 - 6119 - ns = grab_requested_mnt_ns(&kreq); 6120 - if (!ns) 6121 - return -ENOENT; 6122 - 6123 - if (kreq.mnt_ns_id && (ns != current->nsproxy->mnt_ns) && 6124 - !ns_capable_noaudit(ns->user_ns, CAP_SYS_ADMIN)) 6064 + if (kreq.mnt_ns_id && (kls.ns != current->nsproxy->mnt_ns) && 6065 + !ns_capable_noaudit(kls.ns->user_ns, CAP_SYS_ADMIN)) 6125 6066 return -ENOENT; 6126 6067 6127 6068 /* ··· 6119 6080 * listmount() doesn't care about any mount properties. 6120 6081 */ 6121 6082 scoped_guard(rwsem_read, &namespace_sem) 6122 - ret = do_listmount(ns, kreq.mnt_id, last_mnt_id, kmnt_ids, 6123 - nr_mnt_ids, (flags & LISTMOUNT_REVERSE)); 6083 + ret = do_listmount(&kls, (flags & LISTMOUNT_REVERSE)); 6124 6084 if (ret <= 0) 6125 6085 return ret; 6126 6086 6127 - if (copy_to_user(mnt_ids, kmnt_ids, ret * sizeof(*mnt_ids))) 6087 + if (copy_to_user(mnt_ids, kls.kmnt_ids, ret * sizeof(*mnt_ids))) 6128 6088 return -EFAULT; 6129 6089 6130 6090 return ret; ··· 6136 6098 struct mnt_namespace *ns; 6137 6099 struct path root; 6138 6100 6139 - mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL); 6101 + mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", initramfs_options); 6140 6102 if (IS_ERR(mnt)) 6141 6103 panic("Can't create rootfs"); 6142 6104
+1 -1
fs/nfs/inode.c
··· 108 108 109 109 int nfs_drop_inode(struct inode *inode) 110 110 { 111 - return NFS_STALE(inode) || generic_drop_inode(inode); 111 + return NFS_STALE(inode) || inode_generic_drop(inode); 112 112 } 113 113 EXPORT_SYMBOL_GPL(nfs_drop_inode); 114 114
+1 -1
fs/ocfs2/dlmfs/dlmfs.c
··· 547 547 .alloc_inode = dlmfs_alloc_inode, 548 548 .free_inode = dlmfs_free_inode, 549 549 .evict_inode = dlmfs_evict_inode, 550 - .drop_inode = generic_delete_inode, 550 + .drop_inode = inode_just_drop, 551 551 }; 552 552 553 553 static const struct inode_operations dlmfs_file_inode_operations = {
+1 -1
fs/orangefs/super.c
··· 306 306 .free_inode = orangefs_free_inode, 307 307 .destroy_inode = orangefs_destroy_inode, 308 308 .write_inode = orangefs_write_inode, 309 - .drop_inode = generic_delete_inode, 309 + .drop_inode = inode_just_drop, 310 310 .statfs = orangefs_statfs, 311 311 .show_options = orangefs_show_options, 312 312 };
+1 -1
fs/overlayfs/super.c
··· 280 280 .alloc_inode = ovl_alloc_inode, 281 281 .free_inode = ovl_free_inode, 282 282 .destroy_inode = ovl_destroy_inode, 283 - .drop_inode = generic_delete_inode, 283 + .drop_inode = inode_just_drop, 284 284 .put_super = ovl_put_super, 285 285 .sync_fs = ovl_sync_fs, 286 286 .statfs = ovl_statfs,
+1 -1
fs/pidfs.c
··· 718 718 } 719 719 720 720 static const struct super_operations pidfs_sops = { 721 - .drop_inode = generic_delete_inode, 721 + .drop_inode = inode_just_drop, 722 722 .evict_inode = pidfs_evict_inode, 723 723 .statfs = simple_statfs, 724 724 };
+4 -2
fs/pipe.c
··· 458 458 mutex_lock(&pipe->mutex); 459 459 460 460 if (!pipe->readers) { 461 - send_sig(SIGPIPE, current, 0); 461 + if ((iocb->ki_flags & IOCB_NOSIGNAL) == 0) 462 + send_sig(SIGPIPE, current, 0); 462 463 ret = -EPIPE; 463 464 goto out; 464 465 } ··· 499 498 500 499 for (;;) { 501 500 if (!pipe->readers) { 502 - send_sig(SIGPIPE, current, 0); 501 + if ((iocb->ki_flags & IOCB_NOSIGNAL) == 0) 502 + send_sig(SIGPIPE, current, 0); 503 503 if (!ret) 504 504 ret = -EPIPE; 505 505 break;
+1 -1
fs/proc/inode.c
··· 187 187 const struct super_operations proc_sops = { 188 188 .alloc_inode = proc_alloc_inode, 189 189 .free_inode = proc_free_inode, 190 - .drop_inode = generic_delete_inode, 190 + .drop_inode = inode_just_drop, 191 191 .evict_inode = proc_evict_inode, 192 192 .statfs = simple_statfs, 193 193 .show_options = proc_show_options,
+3 -3
fs/proc/namespaces.c
··· 12 12 #include "internal.h" 13 13 14 14 15 - static const struct proc_ns_operations *ns_entries[] = { 15 + static const struct proc_ns_operations *const ns_entries[] = { 16 16 #ifdef CONFIG_NET_NS 17 17 &netns_operations, 18 18 #endif ··· 117 117 static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx) 118 118 { 119 119 struct task_struct *task = get_proc_task(file_inode(file)); 120 - const struct proc_ns_operations **entry, **last; 120 + const struct proc_ns_operations *const *entry, *const *last; 121 121 122 122 if (!task) 123 123 return -ENOENT; ··· 151 151 struct dentry *dentry, unsigned int flags) 152 152 { 153 153 struct task_struct *task = get_proc_task(dir); 154 - const struct proc_ns_operations **entry, **last; 154 + const struct proc_ns_operations *const *entry, *const *last; 155 155 unsigned int len = dentry->d_name.len; 156 156 struct dentry *res = ERR_PTR(-ENOENT); 157 157
+92 -6
fs/proc/root.c
··· 38 38 Opt_gid, 39 39 Opt_hidepid, 40 40 Opt_subset, 41 + Opt_pidns, 41 42 }; 42 43 43 44 static const struct fs_parameter_spec proc_fs_parameters[] = { 44 - fsparam_u32("gid", Opt_gid), 45 + fsparam_u32("gid", Opt_gid), 45 46 fsparam_string("hidepid", Opt_hidepid), 46 47 fsparam_string("subset", Opt_subset), 48 + fsparam_file_or_string("pidns", Opt_pidns), 47 49 {} 48 50 }; 49 51 ··· 111 109 return 0; 112 110 } 113 111 112 + #ifdef CONFIG_PID_NS 113 + static int proc_parse_pidns_param(struct fs_context *fc, 114 + struct fs_parameter *param, 115 + struct fs_parse_result *result) 116 + { 117 + struct proc_fs_context *ctx = fc->fs_private; 118 + struct pid_namespace *target, *active = task_active_pid_ns(current); 119 + struct ns_common *ns; 120 + struct file *ns_filp __free(fput) = NULL; 121 + 122 + switch (param->type) { 123 + case fs_value_is_file: 124 + /* came through fsconfig, steal the file reference */ 125 + ns_filp = no_free_ptr(param->file); 126 + break; 127 + case fs_value_is_string: 128 + ns_filp = filp_open(param->string, O_RDONLY, 0); 129 + break; 130 + default: 131 + WARN_ON_ONCE(true); 132 + break; 133 + } 134 + if (!ns_filp) 135 + ns_filp = ERR_PTR(-EBADF); 136 + if (IS_ERR(ns_filp)) { 137 + errorfc(fc, "could not get file from pidns argument"); 138 + return PTR_ERR(ns_filp); 139 + } 140 + 141 + if (!proc_ns_file(ns_filp)) 142 + return invalfc(fc, "pidns argument is not an nsfs file"); 143 + ns = get_proc_ns(file_inode(ns_filp)); 144 + if (ns->ops->type != CLONE_NEWPID) 145 + return invalfc(fc, "pidns argument is not a pidns file"); 146 + target = container_of(ns, struct pid_namespace, ns); 147 + 148 + /* 149 + * pidns= is shorthand for joining the pidns to get a fsopen fd, so the 150 + * permission model should be the same as pidns_install(). 151 + */ 152 + if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) { 153 + errorfc(fc, "insufficient permissions to set pidns"); 154 + return -EPERM; 155 + } 156 + if (!pidns_is_ancestor(target, active)) 157 + return invalfc(fc, "cannot set pidns to non-descendant pidns"); 158 + 159 + put_pid_ns(ctx->pid_ns); 160 + ctx->pid_ns = get_pid_ns(target); 161 + put_user_ns(fc->user_ns); 162 + fc->user_ns = get_user_ns(ctx->pid_ns->user_ns); 163 + return 0; 164 + } 165 + #endif /* CONFIG_PID_NS */ 166 + 114 167 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param) 115 168 { 116 169 struct proc_fs_context *ctx = fc->fs_private; 117 170 struct fs_parse_result result; 118 - int opt; 171 + int opt, err; 119 172 120 173 opt = fs_parse(fc, proc_fs_parameters, param, &result); 121 174 if (opt < 0) ··· 182 125 break; 183 126 184 127 case Opt_hidepid: 185 - if (proc_parse_hidepid_param(fc, param)) 186 - return -EINVAL; 128 + err = proc_parse_hidepid_param(fc, param); 129 + if (err) 130 + return err; 187 131 break; 188 132 189 133 case Opt_subset: 190 - if (proc_parse_subset_param(fc, param->string) < 0) 191 - return -EINVAL; 134 + err = proc_parse_subset_param(fc, param->string); 135 + if (err) 136 + return err; 192 137 break; 138 + 139 + case Opt_pidns: 140 + #ifdef CONFIG_PID_NS 141 + /* 142 + * We would have to RCU-protect every proc_pid_ns() or 143 + * proc_sb_info() access if we allowed this to be reconfigured 144 + * for an existing procfs instance. Luckily, procfs instances 145 + * are cheap to create, and mount-beneath would let you 146 + * atomically replace an instance even with overmounts. 147 + */ 148 + if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) { 149 + errorfc(fc, "cannot reconfigure pidns for existing procfs"); 150 + return -EBUSY; 151 + } 152 + err = proc_parse_pidns_param(fc, param, &result); 153 + if (err) 154 + return err; 155 + break; 156 + #else 157 + errorfc(fc, "pidns mount flag not supported on this system"); 158 + return -EOPNOTSUPP; 159 + #endif 193 160 194 161 default: 195 162 return -EINVAL; ··· 235 154 fs_info->hide_pid = ctx->hidepid; 236 155 if (ctx->mask & (1 << Opt_subset)) 237 156 fs_info->pidonly = ctx->pidonly; 157 + if (ctx->mask & (1 << Opt_pidns) && 158 + !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) { 159 + put_pid_ns(fs_info->pid_ns); 160 + fs_info->pid_ns = get_pid_ns(ctx->pid_ns); 161 + } 238 162 } 239 163 240 164 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
+1 -1
fs/pstore/inode.c
··· 282 282 283 283 static const struct super_operations pstore_ops = { 284 284 .statfs = simple_statfs, 285 - .drop_inode = generic_delete_inode, 285 + .drop_inode = inode_just_drop, 286 286 .evict_inode = pstore_evict_inode, 287 287 .show_options = pstore_show_options, 288 288 };
+1 -1
fs/ramfs/inode.c
··· 215 215 216 216 static const struct super_operations ramfs_ops = { 217 217 .statfs = simple_statfs, 218 - .drop_inode = generic_delete_inode, 218 + .drop_inode = inode_just_drop, 219 219 .show_options = ramfs_show_options, 220 220 }; 221 221
+9 -5
fs/read_write.c
··· 1576 1576 if (len == 0) 1577 1577 return 0; 1578 1578 1579 + /* 1580 + * Make sure return value doesn't overflow in 32bit compat mode. Also 1581 + * limit the size for all cases except when calling ->copy_file_range(). 1582 + */ 1583 + if (splice || !file_out->f_op->copy_file_range || in_compat_syscall()) 1584 + len = min_t(size_t, MAX_RW_COUNT, len); 1585 + 1579 1586 file_start_write(file_out); 1580 1587 1581 1588 /* ··· 1596 1589 len, flags); 1597 1590 } else if (!splice && file_in->f_op->remap_file_range && samesb) { 1598 1591 ret = file_in->f_op->remap_file_range(file_in, pos_in, 1599 - file_out, pos_out, 1600 - min_t(loff_t, MAX_RW_COUNT, len), 1601 - REMAP_FILE_CAN_SHORTEN); 1592 + file_out, pos_out, len, REMAP_FILE_CAN_SHORTEN); 1602 1593 /* fallback to splice */ 1603 1594 if (ret <= 0) 1604 1595 splice = true; ··· 1629 1624 * to splicing from input file, while file_start_write() is held on 1630 1625 * the output file on a different sb. 1631 1626 */ 1632 - ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, 1633 - min_t(size_t, len, MAX_RW_COUNT), 0); 1627 + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0); 1634 1628 done: 1635 1629 if (ret > 0) { 1636 1630 fsnotify_access(file_in);
+1 -1
fs/smb/client/cifsfs.c
··· 857 857 858 858 /* no serverino => unconditional eviction */ 859 859 return !(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SERVER_INUM) || 860 - generic_drop_inode(inode); 860 + inode_generic_drop(inode); 861 861 } 862 862 863 863 static const struct super_operations cifs_super_ops = {
+5 -3
fs/super.c
··· 2318 2318 sb->s_id); 2319 2319 if (!wq) 2320 2320 return -ENOMEM; 2321 + 2322 + old = NULL; 2321 2323 /* 2322 2324 * This has to be atomic as more DIOs can race to create the workqueue 2323 2325 */ 2324 - old = cmpxchg(&sb->s_dio_done_wq, NULL, wq); 2325 - /* Someone created workqueue before us? Free ours... */ 2326 - if (old) 2326 + if (!try_cmpxchg(&sb->s_dio_done_wq, &old, wq)) { 2327 + /* Someone created workqueue before us? Free ours... */ 2327 2328 destroy_workqueue(wq); 2329 + } 2328 2330 return 0; 2329 2331 } 2330 2332 EXPORT_SYMBOL_GPL(sb_init_dio_done_wq);
+1 -1
fs/ubifs/super.c
··· 335 335 336 336 static int ubifs_drop_inode(struct inode *inode) 337 337 { 338 - int drop = generic_drop_inode(inode); 338 + int drop = inode_generic_drop(inode); 339 339 340 340 if (!drop) 341 341 drop = fscrypt_drop_inode(inode);
+1 -1
fs/xfs/xfs_super.c
··· 778 778 return 0; 779 779 } 780 780 781 - return generic_drop_inode(inode); 781 + return inode_generic_drop(inode); 782 782 } 783 783 784 784 STATIC void
+4 -6
include/linux/fs.h
··· 357 357 #define IOCB_APPEND (__force int) RWF_APPEND 358 358 #define IOCB_ATOMIC (__force int) RWF_ATOMIC 359 359 #define IOCB_DONTCACHE (__force int) RWF_DONTCACHE 360 + #define IOCB_NOSIGNAL (__force int) RWF_NOSIGNAL 360 361 361 362 /* non-RWF related bits - start at 16 */ 362 363 #define IOCB_EVENTFD (1 << 16) ··· 2054 2053 int vfs_fchmod(struct file *file, umode_t mode); 2055 2054 int vfs_utimes(const struct path *path, struct timespec64 *times); 2056 2055 2057 - int vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); 2058 - 2059 2056 #ifdef CONFIG_COMPAT 2060 2057 extern long compat_ptr_ioctl(struct file *file, unsigned int cmd, 2061 2058 unsigned long arg); ··· 3281 3282 3282 3283 /** 3283 3284 * name_contains_dotdot - check if a file name contains ".." path components 3284 - * 3285 + * @name: File path string to check 3285 3286 * Search for ".." surrounded by either '/' or start/end of string. 3286 3287 */ 3287 3288 static inline bool name_contains_dotdot(const char *name) ··· 3313 3314 extern struct inode * igrab(struct inode *); 3314 3315 extern ino_t iunique(struct super_block *, ino_t); 3315 3316 extern int inode_needs_sync(struct inode *inode); 3316 - extern int generic_delete_inode(struct inode *inode); 3317 - static inline int generic_drop_inode(struct inode *inode) 3317 + extern int inode_just_drop(struct inode *inode); 3318 + static inline int inode_generic_drop(struct inode *inode) 3318 3319 { 3319 3320 return !inode->i_nlink || inode_unhashed(inode); 3320 3321 } ··· 3393 3394 extern struct inode *new_inode(struct super_block *sb); 3394 3395 extern void free_inode_nonrcu(struct inode *inode); 3395 3396 extern int setattr_should_drop_suidgid(struct mnt_idmap *, struct inode *); 3396 - extern int file_remove_privs_flags(struct file *file, unsigned int flags); 3397 3397 extern int file_remove_privs(struct file *); 3398 3398 int setattr_should_drop_sgid(struct mnt_idmap *idmap, 3399 3399 const struct inode *inode);
+9
include/linux/pid_namespace.h
··· 84 84 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); 85 85 extern void put_pid_ns(struct pid_namespace *ns); 86 86 87 + extern bool pidns_is_ancestor(struct pid_namespace *child, 88 + struct pid_namespace *ancestor); 89 + 87 90 #else /* !CONFIG_PID_NS */ 88 91 #include <linux/err.h> 89 92 ··· 120 117 static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) 121 118 { 122 119 return 0; 120 + } 121 + 122 + static inline bool pidns_is_ancestor(struct pid_namespace *child, 123 + struct pid_namespace *ancestor) 124 + { 125 + return false; 123 126 } 124 127 #endif /* CONFIG_PID_NS */ 125 128
+2 -1
include/trace/events/filelock.h
··· 27 27 { FL_SLEEP, "FL_SLEEP" }, \ 28 28 { FL_DOWNGRADE_PENDING, "FL_DOWNGRADE_PENDING" }, \ 29 29 { FL_UNLOCK_PENDING, "FL_UNLOCK_PENDING" }, \ 30 - { FL_OFDLCK, "FL_OFDLCK" }) 30 + { FL_OFDLCK, "FL_OFDLCK" }, \ 31 + { FL_RECLAIM, "FL_RECLAIM"}) 31 32 32 33 #define show_fl_type(val) \ 33 34 __print_symbolic(val, \
+4 -1
include/uapi/linux/fs.h
··· 430 430 /* buffered IO that drops the cache after reading or writing data */ 431 431 #define RWF_DONTCACHE ((__force __kernel_rwf_t)0x00000080) 432 432 433 + /* prevent pipe and socket writes from raising SIGPIPE */ 434 + #define RWF_NOSIGNAL ((__force __kernel_rwf_t)0x00000100) 435 + 433 436 /* mask of flags supported by the kernel */ 434 437 #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ 435 438 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\ 436 - RWF_DONTCACHE) 439 + RWF_DONTCACHE | RWF_NOSIGNAL) 437 440 438 441 #define PROCFS_IOCTL_MAGIC 'f' 439 442
+1
init/Kconfig
··· 1504 1504 1505 1505 config INITRAMFS_PRESERVE_MTIME 1506 1506 bool "Preserve cpio archive mtimes in initramfs" 1507 + depends on BLK_DEV_INITRD 1507 1508 default y 1508 1509 help 1509 1510 Each entry in an initramfs cpio archive carries an mtime value. When
+6 -8
init/do_mounts_rd.c
··· 7 7 #include <uapi/linux/cramfs_fs.h> 8 8 #include <linux/initrd.h> 9 9 #include <linux/string.h> 10 + #include <linux/string_choices.h> 10 11 #include <linux/slab.h> 11 12 12 13 #include "do_mounts.h" ··· 187 186 int __init rd_load_image(char *from) 188 187 { 189 188 int res = 0; 190 - unsigned long rd_blocks, devblocks; 189 + unsigned long rd_blocks, devblocks, nr_disks; 191 190 int nblocks, i; 192 191 char *buf = NULL; 193 192 unsigned short rotate = 0; 194 193 decompress_fn decompressor = NULL; 195 - #if !defined(CONFIG_S390) 196 194 char rotator[4] = { '|' , '/' , '-' , '\\' }; 197 - #endif 198 195 199 196 out_file = filp_open("/dev/ram", O_RDWR, 0); 200 197 if (IS_ERR(out_file)) ··· 243 244 goto done; 244 245 } 245 246 246 - printk(KERN_NOTICE "RAMDISK: Loading %dKiB [%ld disk%s] into ram disk... ", 247 - nblocks, ((nblocks-1)/devblocks)+1, nblocks>devblocks ? "s" : ""); 247 + nr_disks = (nblocks - 1) / devblocks + 1; 248 + pr_notice("RAMDISK: Loading %dKiB [%ld disk%s] into ram disk... ", 249 + nblocks, nr_disks, str_plural(nr_disks)); 248 250 for (i = 0; i < nblocks; i++) { 249 251 if (i && (i % devblocks == 0)) { 250 252 pr_cont("done disk #1.\n"); ··· 255 255 } 256 256 kernel_read(in_file, buf, BLOCK_SIZE, &in_pos); 257 257 kernel_write(out_file, buf, BLOCK_SIZE, &out_pos); 258 - #if !defined(CONFIG_S390) 259 - if (!(i % 16)) { 258 + if (!IS_ENABLED(CONFIG_S390) && !(i % 16)) { 260 259 pr_cont("%c\b", rotator[rotate & 0x3]); 261 260 rotate++; 262 261 } 263 - #endif 264 262 } 265 263 pr_cont("done.\n"); 266 264
+3 -2
init/initramfs.c
··· 19 19 #include <linux/init_syscalls.h> 20 20 #include <linux/umh.h> 21 21 #include <linux/security.h> 22 + #include <linux/overflow.h> 22 23 23 24 #include "do_mounts.h" 24 25 #include "initramfs_internal.h" ··· 109 108 q->minor = minor; 110 109 q->ino = ino; 111 110 q->mode = mode; 112 - strcpy(q->name, name); 111 + strscpy(q->name, name); 113 112 q->next = NULL; 114 113 *p = q; 115 114 hardlink_seen = true; ··· 153 152 { 154 153 struct dir_entry *de; 155 154 156 - de = kmalloc(sizeof(struct dir_entry) + nlen, GFP_KERNEL); 155 + de = kmalloc(struct_size(de, name, nlen), GFP_KERNEL); 157 156 if (!de) 158 157 panic_show_mem("can't allocate dir_entry buffer"); 159 158 INIT_LIST_HEAD(&de->list);
+1 -1
kernel/bpf/inode.c
··· 788 788 789 789 const struct super_operations bpf_super_ops = { 790 790 .statfs = simple_statfs, 791 - .drop_inode = generic_delete_inode, 791 + .drop_inode = inode_just_drop, 792 792 .show_options = bpf_show_options, 793 793 .free_inode = bpf_free_inode, 794 794 };
+1 -1
kernel/pid.c
··· 680 680 container_of(head->set, struct pid_namespace, set); 681 681 int mode = table->mode; 682 682 683 - if (ns_capable(pidns->user_ns, CAP_SYS_ADMIN) || 683 + if (ns_capable_noaudit(pidns->user_ns, CAP_SYS_ADMIN) || 684 684 uid_eq(current_euid(), make_kuid(pidns->user_ns, 0))) 685 685 mode = (mode & S_IRWXU) >> 6; 686 686 else if (in_egroup_p(make_kgid(pidns->user_ns, 0)))
+14 -8
kernel/pid_namespace.c
··· 390 390 put_pid_ns(to_pid_ns(ns)); 391 391 } 392 392 393 + bool pidns_is_ancestor(struct pid_namespace *child, 394 + struct pid_namespace *ancestor) 395 + { 396 + struct pid_namespace *ns; 397 + 398 + if (child->level < ancestor->level) 399 + return false; 400 + for (ns = child; ns->level > ancestor->level; ns = ns->parent) 401 + ; 402 + return ns == ancestor; 403 + } 404 + 393 405 static int pidns_install(struct nsset *nsset, struct ns_common *ns) 394 406 { 395 407 struct nsproxy *nsproxy = nsset->nsproxy; 396 408 struct pid_namespace *active = task_active_pid_ns(current); 397 - struct pid_namespace *ancestor, *new = to_pid_ns(ns); 409 + struct pid_namespace *new = to_pid_ns(ns); 398 410 399 411 if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || 400 412 !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) ··· 420 408 * this maintains the property that processes and their 421 409 * children can not escape their current pid namespace. 422 410 */ 423 - if (new->level < active->level) 424 - return -EINVAL; 425 - 426 - ancestor = new; 427 - while (ancestor->level > active->level) 428 - ancestor = ancestor->parent; 429 - if (ancestor != active) 411 + if (!pidns_is_ancestor(new, active)) 430 412 return -EINVAL; 431 413 432 414 put_pid_ns(nsproxy->pid_ns_for_children);
+1 -1
mm/shmem.c
··· 5341 5341 .get_dquots = shmem_get_dquots, 5342 5342 #endif 5343 5343 .evict_inode = shmem_evict_inode, 5344 - .drop_inode = generic_delete_inode, 5344 + .drop_inode = inode_just_drop, 5345 5345 .put_super = shmem_put_super, 5346 5346 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 5347 5347 .nr_cached_objects = shmem_unused_huge_count,
+3
net/socket.c
··· 1176 1176 if (sock->type == SOCK_SEQPACKET) 1177 1177 msg.msg_flags |= MSG_EOR; 1178 1178 1179 + if (iocb->ki_flags & IOCB_NOSIGNAL) 1180 + msg.msg_flags |= MSG_NOSIGNAL; 1181 + 1179 1182 res = __sock_sendmsg(sock, &msg); 1180 1183 *from = msg.msg_iter; 1181 1184 return res;
+1
tools/testing/selftests/proc/.gitignore
··· 18 18 /proc-tid0 19 19 /proc-uptime-001 20 20 /proc-uptime-002 21 + /proc-pidns 21 22 /read 22 23 /self 23 24 /setns-dcache
+1
tools/testing/selftests/proc/Makefile
··· 28 28 TEST_GEN_PROGS += thread-self 29 29 TEST_GEN_PROGS += proc-multiple-procfs 30 30 TEST_GEN_PROGS += proc-fsconfig-hidepid 31 + TEST_GEN_PROGS += proc-pidns 31 32 32 33 include ../lib.mk
+211
tools/testing/selftests/proc/proc-pidns.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + /* 3 + * Author: Aleksa Sarai <cyphar@cyphar.com> 4 + * Copyright (C) 2025 SUSE LLC. 5 + */ 6 + 7 + #include <assert.h> 8 + #include <errno.h> 9 + #include <sched.h> 10 + #include <stdbool.h> 11 + #include <stdlib.h> 12 + #include <string.h> 13 + #include <unistd.h> 14 + #include <stdio.h> 15 + #include <sys/mount.h> 16 + #include <sys/stat.h> 17 + #include <sys/prctl.h> 18 + 19 + #include "../kselftest_harness.h" 20 + 21 + #define ASSERT_ERRNO(expected, _t, seen) \ 22 + __EXPECT(expected, #expected, \ 23 + ({__typeof__(seen) _tmp_seen = (seen); \ 24 + _tmp_seen >= 0 ? _tmp_seen : -errno; }), #seen, _t, 1) 25 + 26 + #define ASSERT_ERRNO_EQ(expected, seen) \ 27 + ASSERT_ERRNO(expected, ==, seen) 28 + 29 + #define ASSERT_SUCCESS(seen) \ 30 + ASSERT_ERRNO(0, <=, seen) 31 + 32 + static int touch(char *path) 33 + { 34 + int fd = open(path, O_WRONLY|O_CREAT|O_CLOEXEC, 0644); 35 + if (fd < 0) 36 + return -1; 37 + return close(fd); 38 + } 39 + 40 + FIXTURE(ns) 41 + { 42 + int host_mntns, host_pidns; 43 + int dummy_pidns; 44 + }; 45 + 46 + FIXTURE_SETUP(ns) 47 + { 48 + /* Stash the old mntns. */ 49 + self->host_mntns = open("/proc/self/ns/mnt", O_RDONLY|O_CLOEXEC); 50 + ASSERT_SUCCESS(self->host_mntns); 51 + 52 + /* Create a new mount namespace and make it private. */ 53 + ASSERT_SUCCESS(unshare(CLONE_NEWNS)); 54 + ASSERT_SUCCESS(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL)); 55 + 56 + /* 57 + * Create a proper tmpfs that we can use and will disappear once we 58 + * leave this mntns. 59 + */ 60 + ASSERT_SUCCESS(mount("tmpfs", "/tmp", "tmpfs", 0, NULL)); 61 + 62 + /* 63 + * Create a pidns we can use for later tests. We need to fork off a 64 + * child so that we get a usable nsfd that we can bind-mount and open. 65 + */ 66 + ASSERT_SUCCESS(mkdir("/tmp/dummy", 0755)); 67 + ASSERT_SUCCESS(touch("/tmp/dummy/pidns")); 68 + ASSERT_SUCCESS(mkdir("/tmp/dummy/proc", 0755)); 69 + 70 + self->host_pidns = open("/proc/self/ns/pid", O_RDONLY|O_CLOEXEC); 71 + ASSERT_SUCCESS(self->host_pidns); 72 + ASSERT_SUCCESS(unshare(CLONE_NEWPID)); 73 + 74 + pid_t pid = fork(); 75 + ASSERT_SUCCESS(pid); 76 + if (!pid) { 77 + prctl(PR_SET_PDEATHSIG, SIGKILL); 78 + ASSERT_SUCCESS(mount("/proc/self/ns/pid", "/tmp/dummy/pidns", NULL, MS_BIND, NULL)); 79 + ASSERT_SUCCESS(mount("proc", "/tmp/dummy/proc", "proc", 0, NULL)); 80 + exit(0); 81 + } 82 + 83 + int wstatus; 84 + ASSERT_EQ(waitpid(pid, &wstatus, 0), pid); 85 + ASSERT_TRUE(WIFEXITED(wstatus)); 86 + ASSERT_EQ(WEXITSTATUS(wstatus), 0); 87 + 88 + ASSERT_SUCCESS(setns(self->host_pidns, CLONE_NEWPID)); 89 + 90 + self->dummy_pidns = open("/tmp/dummy/pidns", O_RDONLY|O_CLOEXEC); 91 + ASSERT_SUCCESS(self->dummy_pidns); 92 + } 93 + 94 + FIXTURE_TEARDOWN(ns) 95 + { 96 + ASSERT_SUCCESS(setns(self->host_mntns, CLONE_NEWNS)); 97 + ASSERT_SUCCESS(close(self->host_mntns)); 98 + 99 + ASSERT_SUCCESS(close(self->host_pidns)); 100 + ASSERT_SUCCESS(close(self->dummy_pidns)); 101 + } 102 + 103 + TEST_F(ns, pidns_mount_string_path) 104 + { 105 + ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755)); 106 + ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid")); 107 + ASSERT_SUCCESS(access("/tmp/proc-host/self/", X_OK)); 108 + 109 + ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755)); 110 + ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy/pidns")); 111 + ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/1/", X_OK)); 112 + ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/self/", X_OK)); 113 + } 114 + 115 + TEST_F(ns, pidns_fsconfig_string_path) 116 + { 117 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 118 + ASSERT_SUCCESS(fsfd); 119 + 120 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0)); 121 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 122 + 123 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 124 + ASSERT_SUCCESS(mountfd); 125 + 126 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0)); 127 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0)); 128 + 129 + ASSERT_SUCCESS(close(fsfd)); 130 + ASSERT_SUCCESS(close(mountfd)); 131 + } 132 + 133 + TEST_F(ns, pidns_fsconfig_fd) 134 + { 135 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 136 + ASSERT_SUCCESS(fsfd); 137 + 138 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns)); 139 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 140 + 141 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 142 + ASSERT_SUCCESS(mountfd); 143 + 144 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0)); 145 + ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0)); 146 + 147 + ASSERT_SUCCESS(close(fsfd)); 148 + ASSERT_SUCCESS(close(mountfd)); 149 + } 150 + 151 + TEST_F(ns, pidns_reconfigure_remount) 152 + { 153 + ASSERT_SUCCESS(mkdir("/tmp/proc", 0755)); 154 + ASSERT_SUCCESS(mount("proc", "/tmp/proc", "proc", 0, "")); 155 + 156 + ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK)); 157 + ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK)); 158 + 159 + ASSERT_ERRNO_EQ(-EBUSY, mount(NULL, "/tmp/proc", NULL, MS_REMOUNT, "pidns=/tmp/dummy/pidns")); 160 + 161 + ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK)); 162 + ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK)); 163 + } 164 + 165 + TEST_F(ns, pidns_reconfigure_fsconfig_string_path) 166 + { 167 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 168 + ASSERT_SUCCESS(fsfd); 169 + 170 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 171 + 172 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 173 + ASSERT_SUCCESS(mountfd); 174 + 175 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 176 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 177 + 178 + ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0)); 179 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */ 180 + 181 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 182 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 183 + 184 + ASSERT_SUCCESS(close(fsfd)); 185 + ASSERT_SUCCESS(close(mountfd)); 186 + } 187 + 188 + TEST_F(ns, pidns_reconfigure_fsconfig_fd) 189 + { 190 + int fsfd = fsopen("proc", FSOPEN_CLOEXEC); 191 + ASSERT_SUCCESS(fsfd); 192 + 193 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)); 194 + 195 + int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); 196 + ASSERT_SUCCESS(mountfd); 197 + 198 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 199 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 200 + 201 + ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns)); 202 + ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */ 203 + 204 + ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0)); 205 + ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0)); 206 + 207 + ASSERT_SUCCESS(close(fsfd)); 208 + ASSERT_SUCCESS(close(mountfd)); 209 + } 210 + 211 + TEST_HARNESS_MAIN