Merge tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.

Scalability and performance:

- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing

- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks

- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures

- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing

- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch

- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder

Bug fixes and correctness:

- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash

- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract

- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists

- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures

- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern

API modernization:

- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems

- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling

- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns

- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int

- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag

Deprecation:

- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)

Documentation:

- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source

- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()

Testing:

- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX

Misc:

- Add missing <linux/init_task.h> include in fs_struct.c"

* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...

Linus Torvalds 4 months ago 9e355113 3304b3fe

+351 -293

39 changed files

expand all

Documentation

filesystems

nfs

exporting.rst

acl.c

btrfs

acl.c

buffer.c

ceph

acl.c

char_dev.c

dcache.c

exec.c

ext4

mballoc.c

resize.c

super.c

fat

dir.c

file.c

file_table.c

fs_struct.c

fuse

acl.c

dev.c

file.c

gfs2

acl.c

inode.c

jfs

acl.c

locks.c

namei.c

namespace.c

ntfs3

xattr.c

orangefs

acl.c

posix_acl.c

select.c

splice.c

include

linux

device_cgroup.h

exportfs.h

filelock.h

fs.h

ns_common_types.h

posix_acl_xattr.h

restart_block.h

init

Kconfig

initramfs_test.c

kernel

pid.c

+4 -36

Documentation/filesystems/nfs/exporting.rst

··· 119 119 120 120 A file system implementation declares that instances of the filesystem 121 121 are exportable by setting the s_export_op field in the struct 122 - super_block. This field must point to a "struct export_operations" 123 - struct which has the following members: 122 + super_block. This field must point to a struct export_operations 123 + which has the following members: 124 124 125 - encode_fh (mandatory) 126 - Takes a dentry and creates a filehandle fragment which may later be used 127 - to find or create a dentry for the same object. 128 - 129 - fh_to_dentry (mandatory) 130 - Given a filehandle fragment, this should find the implied object and 131 - create a dentry for it (possibly with d_obtain_alias). 132 - 133 - fh_to_parent (optional but strongly recommended) 134 - Given a filehandle fragment, this should find the parent of the 135 - implied object and create a dentry for it (possibly with 136 - d_obtain_alias). May fail if the filehandle fragment is too small. 137 - 138 - get_parent (optional but strongly recommended) 139 - When given a dentry for a directory, this should return a dentry for 140 - the parent. Quite possibly the parent dentry will have been allocated 141 - by d_alloc_anon. The default get_parent function just returns an error 142 - so any filehandle lookup that requires finding a parent will fail. 143 - ->lookup("..") is *not* used as a default as it can leave ".." entries 144 - in the dcache which are too messy to work with. 145 - 146 - get_name (optional) 147 - When given a parent dentry and a child dentry, this should find a name 148 - in the directory identified by the parent dentry, which leads to the 149 - object identified by the child dentry. If no get_name function is 150 - supplied, a default implementation is provided which uses vfs_readdir 151 - to find potential names, and matches inode numbers to find the correct 152 - match. 153 - 154 - flags 155 - Some filesystems may need to be handled differently than others. The 156 - export_operations struct also includes a flags field that allows the 157 - filesystem to communicate such information to nfsd. See the Export 158 - Operations Flags section below for more explanation. 125 + .. kernel-doc:: include/linux/exportfs.h 126 + :identifiers: struct export_operations 159 127 160 128 A filehandle fragment consists of an array of 1 or more 4byte words, 161 129 together with a one byte "type".

+3 -13

fs/9p/acl.c

··· 167 167 if (retval) 168 168 goto err_out; 169 169 170 - size = posix_acl_xattr_size(acl->a_count); 171 - 172 - value = kzalloc(size, GFP_NOFS); 170 + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS); 173 171 if (!value) { 174 172 retval = -ENOMEM; 175 173 goto err_out; 176 174 } 177 - 178 - retval = posix_acl_to_xattr(&init_user_ns, acl, value, size); 179 - if (retval < 0) 180 - goto err_out; 181 175 } 182 176 183 177 /* ··· 251 257 return 0; 252 258 253 259 /* Set a setxattr request to server */ 254 - size = posix_acl_xattr_size(acl->a_count); 255 - buffer = kmalloc(size, GFP_KERNEL); 260 + buffer = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); 256 261 if (!buffer) 257 262 return -ENOMEM; 258 - retval = posix_acl_to_xattr(&init_user_ns, acl, buffer, size); 259 - if (retval < 0) 260 - goto err_free_out; 263 + 261 264 switch (type) { 262 265 case ACL_TYPE_ACCESS: 263 266 name = XATTR_NAME_POSIX_ACL_ACCESS; ··· 266 275 BUG(); 267 276 } 268 277 retval = v9fs_fid_xattr_set(fid, name, buffer, size, 0); 269 - err_free_out: 270 278 kfree(buffer); 271 279 return retval; 272 280 }

+3 -7

fs/btrfs/acl.c

··· 57 57 int __btrfs_set_acl(struct btrfs_trans_handle *trans, struct inode *inode, 58 58 struct posix_acl *acl, int type) 59 59 { 60 - int ret, size = 0; 60 + int ret; 61 + size_t size = 0; 61 62 const char *name; 62 63 char AUTO_KFREE(value); 63 64 ··· 78 77 if (acl) { 79 78 unsigned int nofs_flag; 80 79 81 - size = posix_acl_xattr_size(acl->a_count); 82 80 /* 83 81 * We're holding a transaction handle, so use a NOFS memory 84 82 * allocation context to avoid deadlock if reclaim happens. 85 83 */ 86 84 nofs_flag = memalloc_nofs_save(); 87 - value = kmalloc(size, GFP_KERNEL); 85 + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); 88 86 memalloc_nofs_restore(nofs_flag); 89 87 if (!value) 90 88 return -ENOMEM; 91 - 92 - ret = posix_acl_to_xattr(&init_user_ns, acl, value, size); 93 - if (ret < 0) 94 - return ret; 95 89 } 96 90 97 91 if (trans)

+5 -1

fs/buffer.c

··· 2354 2354 if (!head) 2355 2355 return false; 2356 2356 blocksize = head->b_size; 2357 - to = min_t(unsigned, folio_size(folio) - from, count); 2357 + to = min(folio_size(folio) - from, count); 2358 2358 to = from + to; 2359 2359 if (from < blocksize && to > folio_size(folio) - blocksize) 2360 2360 return false; ··· 2947 2947 BUG_ON(!folio_test_locked(folio)); 2948 2948 if (folio_test_writeback(folio)) 2949 2949 return false; 2950 + 2951 + /* Misconfigured folio check */ 2952 + if (WARN_ON_ONCE(!folio_buffers(folio))) 2953 + return true; 2950 2954 2951 2955 if (mapping == NULL) { /* can this still happen? */ 2952 2956 ret = drop_buffers(folio, &buffers_to_free);

+22 -28

fs/ceph/acl.c

··· 90 90 int ceph_set_acl(struct mnt_idmap *idmap, struct dentry *dentry, 91 91 struct posix_acl *acl, int type) 92 92 { 93 - int ret = 0, size = 0; 93 + int ret = 0; 94 + size_t size = 0; 94 95 const char *name = NULL; 95 96 char *value = NULL; 96 97 struct iattr newattrs; ··· 127 126 } 128 127 129 128 if (acl) { 130 - size = posix_acl_xattr_size(acl->a_count); 131 - value = kmalloc(size, GFP_NOFS); 129 + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS); 132 130 if (!value) { 133 131 ret = -ENOMEM; 134 132 goto out; 135 133 } 136 - 137 - ret = posix_acl_to_xattr(&init_user_ns, acl, value, size); 138 - if (ret < 0) 139 - goto out_free; 140 134 } 141 135 142 136 if (new_mode != old_mode) { ··· 168 172 struct posix_acl *acl, *default_acl; 169 173 size_t val_size1 = 0, val_size2 = 0; 170 174 struct ceph_pagelist *pagelist = NULL; 171 - void *tmp_buf = NULL; 175 + void *tmp_buf1 = NULL, *tmp_buf2 = NULL; 172 176 int err; 173 177 174 178 err = posix_acl_create(dir, mode, &default_acl, &acl); ··· 188 192 if (!default_acl && !acl) 189 193 return 0; 190 194 191 - if (acl) 192 - val_size1 = posix_acl_xattr_size(acl->a_count); 193 - if (default_acl) 194 - val_size2 = posix_acl_xattr_size(default_acl->a_count); 195 - 196 195 err = -ENOMEM; 197 - tmp_buf = kmalloc(max(val_size1, val_size2), GFP_KERNEL); 198 - if (!tmp_buf) 199 - goto out_err; 200 196 pagelist = ceph_pagelist_alloc(GFP_KERNEL); 201 197 if (!pagelist) 202 198 goto out_err; ··· 201 213 202 214 if (acl) { 203 215 size_t len = strlen(XATTR_NAME_POSIX_ACL_ACCESS); 216 + 217 + err = -ENOMEM; 218 + tmp_buf1 = posix_acl_to_xattr(&init_user_ns, acl, 219 + &val_size1, GFP_KERNEL); 220 + if (!tmp_buf1) 221 + goto out_err; 204 222 err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8); 205 223 if (err) 206 224 goto out_err; 207 225 ceph_pagelist_encode_string(pagelist, XATTR_NAME_POSIX_ACL_ACCESS, 208 226 len); 209 - err = posix_acl_to_xattr(&init_user_ns, acl, 210 - tmp_buf, val_size1); 211 - if (err < 0) 212 - goto out_err; 213 227 ceph_pagelist_encode_32(pagelist, val_size1); 214 - ceph_pagelist_append(pagelist, tmp_buf, val_size1); 228 + ceph_pagelist_append(pagelist, tmp_buf1, val_size1); 215 229 } 216 230 if (default_acl) { 217 231 size_t len = strlen(XATTR_NAME_POSIX_ACL_DEFAULT); 232 + 233 + err = -ENOMEM; 234 + tmp_buf2 = posix_acl_to_xattr(&init_user_ns, default_acl, 235 + &val_size2, GFP_KERNEL); 236 + if (!tmp_buf2) 237 + goto out_err; 218 238 err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8); 219 239 if (err) 220 240 goto out_err; 221 241 ceph_pagelist_encode_string(pagelist, 222 242 XATTR_NAME_POSIX_ACL_DEFAULT, len); 223 - err = posix_acl_to_xattr(&init_user_ns, default_acl, 224 - tmp_buf, val_size2); 225 - if (err < 0) 226 - goto out_err; 227 243 ceph_pagelist_encode_32(pagelist, val_size2); 228 - ceph_pagelist_append(pagelist, tmp_buf, val_size2); 244 + ceph_pagelist_append(pagelist, tmp_buf2, val_size2); 229 245 } 230 246 231 - kfree(tmp_buf); 247 + kfree(tmp_buf1); 248 + kfree(tmp_buf2); 232 249 233 250 as_ctx->acl = acl; 234 251 as_ctx->default_acl = default_acl; ··· 243 250 out_err: 244 251 posix_acl_release(acl); 245 252 posix_acl_release(default_acl); 246 - kfree(tmp_buf); 253 + kfree(tmp_buf1); 254 + kfree(tmp_buf2); 247 255 if (pagelist) 248 256 ceph_pagelist_release(pagelist); 249 257 return err;

+8 -11

fs/char_dev.c

··· 10 10 #include <linux/kdev_t.h> 11 11 #include <linux/slab.h> 12 12 #include <linux/string.h> 13 + #include <linux/cleanup.h> 13 14 14 15 #include <linux/major.h> 15 16 #include <linux/errno.h> ··· 98 97 __register_chrdev_region(unsigned int major, unsigned int baseminor, 99 98 int minorct, const char *name) 100 99 { 101 - struct char_device_struct *cd, *curr, *prev = NULL; 100 + struct char_device_struct *cd __free(kfree) = NULL; 101 + struct char_device_struct *curr, *prev = NULL; 102 102 int ret; 103 103 int i; 104 104 ··· 119 117 if (cd == NULL) 120 118 return ERR_PTR(-ENOMEM); 121 119 122 - mutex_lock(&chrdevs_lock); 120 + guard(mutex)(&chrdevs_lock); 123 121 124 122 if (major == 0) { 125 123 ret = find_dynamic_major(); 126 124 if (ret < 0) { 127 125 pr_err("CHRDEV \"%s\" dynamic allocation region is full\n", 128 126 name); 129 - goto out; 127 + return ERR_PTR(ret); 130 128 } 131 129 major = ret; 132 130 } ··· 146 144 if (curr->baseminor >= baseminor + minorct) 147 145 break; 148 146 149 - goto out; 147 + return ERR_PTR(ret); 150 148 } 151 149 152 150 cd->major = major; ··· 162 160 prev->next = cd; 163 161 } 164 162 165 - mutex_unlock(&chrdevs_lock); 166 - return cd; 167 - out: 168 - mutex_unlock(&chrdevs_lock); 169 - kfree(cd); 170 - return ERR_PTR(ret); 163 + return_ptr(cd); 171 164 } 172 165 173 166 static struct char_device_struct * ··· 340 343 kfree(cd); 341 344 } 342 345 343 - static DEFINE_SPINLOCK(cdev_lock); 346 + static __cacheline_aligned_in_smp DEFINE_SPINLOCK(cdev_lock); 344 347 345 348 static struct kobject *cdev_get(struct cdev *p) 346 349 {

+1 -4

fs/dcache.c

··· 3237 3237 static __initdata unsigned long dhash_entries; 3238 3238 static int __init set_dhash_entries(char *str) 3239 3239 { 3240 - if (!str) 3241 - return 0; 3242 - dhash_entries = simple_strtoul(str, &str, 0); 3243 - return 1; 3240 + return kstrtoul(str, 0, &dhash_entries) == 0; 3244 3241 } 3245 3242 __setup("dhash_entries=", set_dhash_entries); 3246 3243

+1 -1

fs/exec.c

··· 555 555 return -E2BIG; 556 556 557 557 while (len > 0) { 558 - unsigned int bytes_to_copy = min_t(unsigned int, len, 558 + unsigned int bytes_to_copy = min(len, 559 559 min_not_zero(offset_in_page(pos), PAGE_SIZE)); 560 560 struct page *page; 561 561

+1 -2

fs/ext4/mballoc.c

··· 4276 4276 * get the corresponding group metadata to work with. 4277 4277 * For this we have goto again loop. 4278 4278 */ 4279 - thisgrp_len = min_t(unsigned int, (unsigned int)len, 4280 - EXT4_BLOCKS_PER_GROUP(sb) - EXT4_C2B(sbi, blkoff)); 4279 + thisgrp_len = min(len, EXT4_BLOCKS_PER_GROUP(sb) - EXT4_C2B(sbi, blkoff)); 4281 4280 clen = EXT4_NUM_B2C(sbi, thisgrp_len); 4282 4281 4283 4282 if (!ext4_sb_block_valid(sb, NULL, block, thisgrp_len)) {

+1 -1

fs/ext4/resize.c

··· 1479 1479 1480 1480 /* Update the global fs size fields */ 1481 1481 sbi->s_groups_count += flex_gd->count; 1482 - sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count, 1482 + sbi->s_blockfile_groups = min(sbi->s_groups_count, 1483 1483 (EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb))); 1484 1484 1485 1485 /* Update the reserved block counts only once the new group is

+1 -1

fs/ext4/super.c

··· 4837 4837 return -EINVAL; 4838 4838 } 4839 4839 sbi->s_groups_count = blocks_count; 4840 - sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count, 4840 + sbi->s_blockfile_groups = min(sbi->s_groups_count, 4841 4841 (EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb))); 4842 4842 if (((u64)sbi->s_groups_count * sbi->s_inodes_per_group) != 4843 4843 le32_to_cpu(es->s_inodes_count)) {

+2 -2

fs/fat/dir.c

··· 1355 1355 1356 1356 /* Fill the long name slots. */ 1357 1357 for (i = 0; i < long_bhs; i++) { 1358 - int copy = min_t(int, sb->s_blocksize - offset, size); 1358 + int copy = umin(sb->s_blocksize - offset, size); 1359 1359 memcpy(bhs[i]->b_data + offset, slots, copy); 1360 1360 mark_buffer_dirty_inode(bhs[i], dir); 1361 1361 offset = 0; ··· 1366 1366 err = fat_sync_bhs(bhs, long_bhs); 1367 1367 if (!err && i < nr_bhs) { 1368 1368 /* Fill the short name slot. */ 1369 - int copy = min_t(int, sb->s_blocksize - offset, size); 1369 + int copy = umin(sb->s_blocksize - offset, size); 1370 1370 memcpy(bhs[i]->b_data + offset, slots, copy); 1371 1371 mark_buffer_dirty_inode(bhs[i], dir); 1372 1372 if (IS_DIRSYNC(dir))

+1 -2

fs/fat/file.c

··· 141 141 if (copy_from_user(&range, user_range, sizeof(range))) 142 142 return -EFAULT; 143 143 144 - range.minlen = max_t(unsigned int, range.minlen, 145 - bdev_discard_granularity(sb->s_bdev)); 144 + range.minlen = max(range.minlen, bdev_discard_granularity(sb->s_bdev)); 146 145 147 146 err = fat_trim_fs(inode, &range); 148 147 if (err < 0)

+5 -5

fs/file_table.c

··· 176 176 177 177 f->f_flags = flags; 178 178 f->f_mode = OPEN_FMODE(flags); 179 + /* 180 + * Disable permission and pre-content events for all files by default. 181 + * They may be enabled later by fsnotify_open_perm_and_set_mode(). 182 + */ 183 + file_set_fsnotify_mode(f, FMODE_NONOTIFY_PERM); 179 184 180 185 f->f_op = NULL; 181 186 f->f_mapping = NULL; ··· 202 197 * refcount bumps we should reinitialize the reused file first. 203 198 */ 204 199 file_ref_init(&f->f_ref, 1); 205 - /* 206 - * Disable permission and pre-content events for all files by default. 207 - * They may be enabled later by fsnotify_open_perm_and_set_mode(). 208 - */ 209 - file_set_fsnotify_mode(f, FMODE_NONOTIFY_PERM); 210 200 return 0; 211 201 } 212 202

fs/fs_struct.c

··· 6 6 #include <linux/path.h> 7 7 #include <linux/slab.h> 8 8 #include <linux/fs_struct.h> 9 + #include <linux/init_task.h> 9 10 #include "internal.h" 10 11 11 12 /*

+4 -8

fs/fuse/acl.c

··· 122 122 * them to be refreshed the next time they are used, 123 123 * and it also updates i_ctime. 124 124 */ 125 - size_t size = posix_acl_xattr_size(acl->a_count); 125 + size_t size; 126 126 void *value; 127 127 128 - if (size > PAGE_SIZE) 129 - return -E2BIG; 130 - 131 - value = kmalloc(size, GFP_KERNEL); 128 + value = posix_acl_to_xattr(fc->user_ns, acl, &size, GFP_KERNEL); 132 129 if (!value) 133 130 return -ENOMEM; 134 131 135 - ret = posix_acl_to_xattr(fc->user_ns, acl, value, size); 136 - if (ret < 0) { 132 + if (size > PAGE_SIZE) { 137 133 kfree(value); 138 - return ret; 134 + return -E2BIG; 139 135 } 140 136 141 137 /*

+1 -1

fs/fuse/dev.c

··· 1813 1813 goto out_iput; 1814 1814 1815 1815 folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset; 1816 - nr_bytes = min_t(unsigned, num, folio_size(folio) - folio_offset); 1816 + nr_bytes = min(num, folio_size(folio) - folio_offset); 1817 1817 nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT; 1818 1818 1819 1819 err = fuse_copy_folio(cs, &folio, folio_offset, nr_bytes, 0);

+3 -5

fs/fuse/file.c

··· 1323 1323 static inline unsigned int fuse_wr_pages(loff_t pos, size_t len, 1324 1324 unsigned int max_pages) 1325 1325 { 1326 - return min_t(unsigned int, 1327 - ((pos + len - 1) >> PAGE_SHIFT) - 1328 - (pos >> PAGE_SHIFT) + 1, 1329 - max_pages); 1326 + return min(((pos + len - 1) >> PAGE_SHIFT) - (pos >> PAGE_SHIFT) + 1, 1327 + max_pages); 1330 1328 } 1331 1329 1332 1330 static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii) ··· 1605 1607 struct folio *folio = page_folio(pages[i]); 1606 1608 unsigned int offset = start + 1607 1609 (folio_page_idx(folio, pages[i]) << PAGE_SHIFT); 1608 - unsigned int len = min_t(unsigned int, ret, PAGE_SIZE - start); 1610 + unsigned int len = umin(ret, PAGE_SIZE - start); 1609 1611 1610 1612 ap->descs[ap->num_folios].offset = offset; 1611 1613 ap->descs[ap->num_folios].length = len;

+3 -10

fs/gfs2/acl.c

··· 83 83 int __gfs2_set_acl(struct inode *inode, struct posix_acl *acl, int type) 84 84 { 85 85 int error; 86 - size_t len; 87 - char *data; 86 + size_t len = 0; 87 + char *data = NULL; 88 88 const char *name = gfs2_acl_name(type); 89 89 90 90 if (acl) { 91 - len = posix_acl_xattr_size(acl->a_count); 92 - data = kmalloc(len, GFP_NOFS); 91 + data = posix_acl_to_xattr(&init_user_ns, acl, &len, GFP_NOFS); 93 92 if (data == NULL) 94 93 return -ENOMEM; 95 - error = posix_acl_to_xattr(&init_user_ns, acl, data, len); 96 - if (error < 0) 97 - goto out; 98 - } else { 99 - data = NULL; 100 - len = 0; 101 94 } 102 95 103 96 error = __gfs2_xattr_set(inode, name, data, len, 0, GFS2_EATYPE_SYS);

+59 -34

fs/inode.c

··· 1028 1028 return freed; 1029 1029 } 1030 1030 1031 - static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked); 1031 + static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked); 1032 + 1032 1033 /* 1033 1034 * Called with the inode lock held. 1034 1035 */ 1035 1036 static struct inode *find_inode(struct super_block *sb, 1036 1037 struct hlist_head *head, 1037 1038 int (*test)(struct inode *, void *), 1038 - void *data, bool is_inode_hash_locked, 1039 + void *data, bool hash_locked, 1039 1040 bool *isnew) 1040 1041 { 1041 1042 struct inode *inode = NULL; 1042 1043 1043 - if (is_inode_hash_locked) 1044 + if (hash_locked) 1044 1045 lockdep_assert_held(&inode_hash_lock); 1045 1046 else 1046 1047 lockdep_assert_not_held(&inode_hash_lock); ··· 1055 1054 continue; 1056 1055 spin_lock(&inode->i_lock); 1057 1056 if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) { 1058 - __wait_on_freeing_inode(inode, is_inode_hash_locked); 1057 + __wait_on_freeing_inode(inode, hash_locked, true); 1059 1058 goto repeat; 1060 1059 } 1061 1060 if (unlikely(inode_state_read(inode) & I_CREATING)) { ··· 1079 1078 */ 1080 1079 static struct inode *find_inode_fast(struct super_block *sb, 1081 1080 struct hlist_head *head, unsigned long ino, 1082 - bool is_inode_hash_locked, bool *isnew) 1081 + bool hash_locked, bool *isnew) 1083 1082 { 1084 1083 struct inode *inode = NULL; 1085 1084 1086 - if (is_inode_hash_locked) 1085 + if (hash_locked) 1087 1086 lockdep_assert_held(&inode_hash_lock); 1088 1087 else 1089 1088 lockdep_assert_not_held(&inode_hash_lock); ··· 1097 1096 continue; 1098 1097 spin_lock(&inode->i_lock); 1099 1098 if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE)) { 1100 - __wait_on_freeing_inode(inode, is_inode_hash_locked); 1099 + __wait_on_freeing_inode(inode, hash_locked, true); 1101 1100 goto repeat; 1102 1101 } 1103 1102 if (unlikely(inode_state_read(inode) & I_CREATING)) { ··· 1833 1832 while (1) { 1834 1833 struct inode *old = NULL; 1835 1834 spin_lock(&inode_hash_lock); 1835 + repeat: 1836 1836 hlist_for_each_entry(old, head, i_hash) { 1837 1837 if (old->i_ino != ino) 1838 1838 continue; 1839 1839 if (old->i_sb != sb) 1840 1840 continue; 1841 1841 spin_lock(&old->i_lock); 1842 - if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) { 1843 - spin_unlock(&old->i_lock); 1844 - continue; 1845 - } 1846 1842 break; 1847 1843 } 1848 1844 if (likely(!old)) { ··· 1849 1851 spin_unlock(&inode->i_lock); 1850 1852 spin_unlock(&inode_hash_lock); 1851 1853 return 0; 1854 + } 1855 + if (inode_state_read(old) & (I_FREEING | I_WILL_FREE)) { 1856 + __wait_on_freeing_inode(old, true, false); 1857 + old = NULL; 1858 + goto repeat; 1852 1859 } 1853 1860 if (unlikely(inode_state_read(old) & I_CREATING)) { 1854 1861 spin_unlock(&old->i_lock); ··· 2525 2522 * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list 2526 2523 * will DTRT. 2527 2524 */ 2528 - static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked) 2525 + static void __wait_on_freeing_inode(struct inode *inode, bool hash_locked, bool rcu_locked) 2529 2526 { 2530 2527 struct wait_bit_queue_entry wqe; 2531 2528 struct wait_queue_head *wq_head; 2529 + 2530 + VFS_BUG_ON(!hash_locked && !rcu_locked); 2532 2531 2533 2532 /* 2534 2533 * Handle racing against evict(), see that routine for more details. 2535 2534 */ 2536 2535 if (unlikely(inode_unhashed(inode))) { 2537 - WARN_ON(is_inode_hash_locked); 2536 + WARN_ON(hash_locked); 2538 2537 spin_unlock(&inode->i_lock); 2539 2538 return; 2540 2539 } ··· 2544 2539 wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW); 2545 2540 prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE); 2546 2541 spin_unlock(&inode->i_lock); 2547 - rcu_read_unlock(); 2548 - if (is_inode_hash_locked) 2542 + if (rcu_locked) 2543 + rcu_read_unlock(); 2544 + if (hash_locked) 2549 2545 spin_unlock(&inode_hash_lock); 2550 2546 schedule(); 2551 2547 finish_wait(wq_head, &wqe.wq_entry); 2552 - if (is_inode_hash_locked) 2548 + if (hash_locked) 2553 2549 spin_lock(&inode_hash_lock); 2554 - rcu_read_lock(); 2550 + if (rcu_locked) 2551 + rcu_read_lock(); 2555 2552 } 2556 2553 2557 2554 static __initdata unsigned long ihash_entries; 2558 2555 static int __init set_ihash_entries(char *str) 2559 2556 { 2560 - if (!str) 2561 - return 0; 2562 - ihash_entries = simple_strtoul(str, &str, 0); 2563 - return 1; 2557 + return kstrtoul(str, 0, &ihash_entries) == 0; 2564 2558 } 2565 2559 __setup("ihash_entries=", set_ihash_entries); 2566 2560 ··· 3009 3005 EXPORT_SYMBOL(mode_strip_sgid); 3010 3006 3011 3007 #ifdef CONFIG_DEBUG_VFS 3012 - /* 3013 - * Dump an inode. 3008 + /** 3009 + * dump_inode - dump an inode. 3010 + * @inode: inode to dump 3011 + * @reason: reason for dumping 3014 3012 * 3015 - * TODO: add a proper inode dumping routine, this is a stub to get debug off the 3016 - * ground. 3017 - * 3018 - * TODO: handle getting to fs type with get_kernel_nofault()? 3019 - * See dump_mapping() above. 3013 + * If inode is an invalid pointer, we don't want to crash accessing it, 3014 + * so probe everything depending on it carefully with get_kernel_nofault(). 3020 3015 */ 3021 3016 void dump_inode(struct inode *inode, const char *reason) 3022 3017 { 3023 - struct super_block *sb = inode->i_sb; 3018 + struct super_block *sb; 3019 + struct file_system_type *s_type; 3020 + const char *fs_name_ptr; 3021 + char fs_name[32] = {}; 3022 + umode_t mode; 3023 + unsigned short opflags; 3024 + unsigned int flags; 3025 + unsigned int state; 3026 + int count; 3024 3027 3025 - pr_warn("%s encountered for inode %px\n" 3026 - "fs %s mode %ho opflags 0x%hx flags 0x%x state 0x%x count %d\n", 3027 - reason, inode, sb->s_type->name, inode->i_mode, inode->i_opflags, 3028 - inode->i_flags, inode_state_read_once(inode), atomic_read(&inode->i_count)); 3028 + if (get_kernel_nofault(sb, &inode->i_sb) || 3029 + get_kernel_nofault(mode, &inode->i_mode) || 3030 + get_kernel_nofault(opflags, &inode->i_opflags) || 3031 + get_kernel_nofault(flags, &inode->i_flags)) { 3032 + pr_warn("%s: unreadable inode:%px\n", reason, inode); 3033 + return; 3034 + } 3035 + 3036 + state = inode_state_read_once(inode); 3037 + count = atomic_read(&inode->i_count); 3038 + 3039 + if (!sb || 3040 + get_kernel_nofault(s_type, &sb->s_type) || !s_type || 3041 + get_kernel_nofault(fs_name_ptr, &s_type->name) || !fs_name_ptr || 3042 + strncpy_from_kernel_nofault(fs_name, fs_name_ptr, sizeof(fs_name) - 1) < 0) 3043 + strscpy(fs_name, "<unknown, sb unreadable>"); 3044 + 3045 + pr_warn("%s: inode:%px fs:%s mode:%ho opflags:%#x flags:%#x state:%#x count:%d\n", 3046 + reason, inode, fs_name, mode, opflags, flags, state, count); 3029 3047 } 3030 - 3031 3048 EXPORT_SYMBOL(dump_inode); 3032 3049 #endif

+2 -7

fs/jfs/acl.c

··· 61 61 { 62 62 char *ea_name; 63 63 int rc; 64 - int size = 0; 64 + size_t size = 0; 65 65 char *value = NULL; 66 66 67 67 switch (type) { ··· 76 76 } 77 77 78 78 if (acl) { 79 - size = posix_acl_xattr_size(acl->a_count); 80 - value = kmalloc(size, GFP_KERNEL); 79 + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); 81 80 if (!value) 82 81 return -ENOMEM; 83 - rc = posix_acl_to_xattr(&init_user_ns, acl, value, size); 84 - if (rc < 0) 85 - goto out; 86 82 } 87 83 rc = __jfs_setxattr(tid, inode, ea_name, value, size, 0); 88 - out: 89 84 kfree(value); 90 85 91 86 if (!rc)

+12 -2

fs/locks.c

··· 178 178 { 179 179 struct file_lock_context *ctx; 180 180 181 - /* paired with cmpxchg() below */ 182 181 ctx = locks_inode_context(inode); 183 182 if (likely(ctx) || type == F_UNLCK) 184 183 goto out; ··· 195 196 * Assign the pointer if it's not already assigned. If it is, then 196 197 * free the context we just allocated. 197 198 */ 198 - if (cmpxchg(&inode->i_flctx, NULL, ctx)) { 199 + spin_lock(&inode->i_lock); 200 + if (!(inode->i_opflags & IOP_FLCTX)) { 201 + VFS_BUG_ON_INODE(inode->i_flctx, inode); 202 + WRITE_ONCE(inode->i_flctx, ctx); 203 + /* 204 + * Paired with locks_inode_context(). 205 + */ 206 + smp_store_release(&inode->i_opflags, inode->i_opflags | IOP_FLCTX); 207 + spin_unlock(&inode->i_lock); 208 + } else { 209 + VFS_BUG_ON_INODE(!inode->i_flctx, inode); 210 + spin_unlock(&inode->i_lock); 199 211 kmem_cache_free(flctx_cache, ctx); 200 212 ctx = locks_inode_context(inode); 201 213 }

+3 -5

fs/namei.c

··· 879 879 { 880 880 struct dentry *parent = nd->path.dentry; 881 881 882 - BUG_ON(!(nd->flags & LOOKUP_RCU)); 882 + VFS_BUG_ON(!(nd->flags & LOOKUP_RCU)); 883 883 884 884 if (unlikely(nd->flags & LOOKUP_CACHED)) { 885 885 drop_links(nd); ··· 919 919 static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) 920 920 { 921 921 int res; 922 - BUG_ON(!(nd->flags & LOOKUP_RCU)); 922 + 923 + VFS_BUG_ON(!(nd->flags & LOOKUP_RCU)); 923 924 924 925 if (unlikely(nd->flags & LOOKUP_CACHED)) { 925 926 drop_links(nd); ··· 1631 1630 { 1632 1631 struct dentry *dentry = path->dentry; 1633 1632 unsigned int flags = dentry->d_flags; 1634 - 1635 - if (likely(!(flags & DCACHE_MANAGED_DENTRY))) 1636 - return true; 1637 1633 1638 1634 if (unlikely(nd->flags & LOOKUP_NO_XDEV)) 1639 1635 return false;

+2 -8

fs/namespace.c

··· 49 49 static __initdata unsigned long mhash_entries; 50 50 static int __init set_mhash_entries(char *str) 51 51 { 52 - if (!str) 53 - return 0; 54 - mhash_entries = simple_strtoul(str, &str, 0); 55 - return 1; 52 + return kstrtoul(str, 0, &mhash_entries) == 0; 56 53 } 57 54 __setup("mhash_entries=", set_mhash_entries); 58 55 59 56 static __initdata unsigned long mphash_entries; 60 57 static int __init set_mphash_entries(char *str) 61 58 { 62 - if (!str) 63 - return 0; 64 - mphash_entries = simple_strtoul(str, &str, 0); 65 - return 1; 59 + return kstrtoul(str, 0, &mphash_entries) == 0; 66 60 } 67 61 __setup("mphash_entries=", set_mphash_entries); 68 62

+1 -5

fs/ntfs3/xattr.c

··· 641 641 value = NULL; 642 642 flags = XATTR_REPLACE; 643 643 } else { 644 - size = posix_acl_xattr_size(acl->a_count); 645 - value = kmalloc(size, GFP_NOFS); 644 + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_NOFS); 646 645 if (!value) 647 646 return -ENOMEM; 648 - err = posix_acl_to_xattr(&init_user_ns, acl, value, size); 649 - if (err < 0) 650 - goto out; 651 647 flags = 0; 652 648 } 653 649

+1 -7

fs/orangefs/acl.c

··· 90 90 type); 91 91 92 92 if (acl) { 93 - size = posix_acl_xattr_size(acl->a_count); 94 - value = kmalloc(size, GFP_KERNEL); 93 + value = posix_acl_to_xattr(&init_user_ns, acl, &size, GFP_KERNEL); 95 94 if (!value) 96 95 return -ENOMEM; 97 - 98 - error = posix_acl_to_xattr(&init_user_ns, acl, value, size); 99 - if (error < 0) 100 - goto out; 101 96 } 102 97 103 98 gossip_debug(GOSSIP_ACL_DEBUG, ··· 106 111 */ 107 112 error = orangefs_inode_setxattr(inode, name, value, size, 0); 108 113 109 - out: 110 114 kfree(value); 111 115 if (!error) 112 116 set_cached_acl(inode, type, acl);

+11 -10

fs/posix_acl.c

··· 829 829 /* 830 830 * Convert from in-memory to extended attribute representation. 831 831 */ 832 - int 832 + void * 833 833 posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl, 834 - void *buffer, size_t size) 834 + size_t *sizep, gfp_t gfp) 835 835 { 836 - struct posix_acl_xattr_header *ext_acl = buffer; 836 + struct posix_acl_xattr_header *ext_acl; 837 837 struct posix_acl_xattr_entry *ext_entry; 838 - int real_size, n; 838 + size_t size; 839 + int n; 839 840 840 - real_size = posix_acl_xattr_size(acl->a_count); 841 - if (!buffer) 842 - return real_size; 843 - if (real_size > size) 844 - return -ERANGE; 841 + size = posix_acl_xattr_size(acl->a_count); 842 + ext_acl = kmalloc(size, gfp); 843 + if (!ext_acl) 844 + return NULL; 845 845 846 846 ext_entry = (void *)(ext_acl + 1); 847 847 ext_acl->a_version = cpu_to_le32(POSIX_ACL_XATTR_VERSION); ··· 864 864 break; 865 865 } 866 866 } 867 - return real_size; 867 + *sizep = size; 868 + return ext_acl; 868 869 } 869 870 EXPORT_SYMBOL (posix_acl_to_xattr); 870 871

+4 -8

fs/select.c

··· 1038 1038 { 1039 1039 struct pollfd __user *ufds = restart_block->poll.ufds; 1040 1040 int nfds = restart_block->poll.nfds; 1041 - struct timespec64 *to = NULL, end_time; 1041 + struct timespec64 *to = NULL; 1042 1042 int ret; 1043 1043 1044 - if (restart_block->poll.has_timeout) { 1045 - end_time.tv_sec = restart_block->poll.tv_sec; 1046 - end_time.tv_nsec = restart_block->poll.tv_nsec; 1047 - to = &end_time; 1048 - } 1044 + if (restart_block->poll.has_timeout) 1045 + to = &restart_block->poll.end_time; 1049 1046 1050 1047 ret = do_sys_poll(ufds, nfds, to); 1051 1048 ··· 1074 1077 restart_block->poll.nfds = nfds; 1075 1078 1076 1079 if (timeout_msecs >= 0) { 1077 - restart_block->poll.tv_sec = end_time.tv_sec; 1078 - restart_block->poll.tv_nsec = end_time.tv_nsec; 1080 + restart_block->poll.end_time = end_time; 1079 1081 restart_block->poll.has_timeout = 1; 1080 1082 } else 1081 1083 restart_block->poll.has_timeout = 0;

+1 -1

fs/splice.c

··· 1467 1467 1468 1468 n = DIV_ROUND_UP(left + start, PAGE_SIZE); 1469 1469 for (i = 0; i < n; i++) { 1470 - int size = min_t(int, left, PAGE_SIZE - start); 1470 + int size = umin(left, PAGE_SIZE - start); 1471 1471 1472 1472 buf.page = pages[i]; 1473 1473 buf.offset = start;

+1 -1

include/linux/device_cgroup.h

··· 21 21 if (likely(!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode))) 22 22 return 0; 23 23 24 - if (likely(!inode->i_rdev)) 24 + if (!inode->i_rdev) 25 25 return 0; 26 26 27 27 if (S_ISBLK(inode->i_mode))

+23 -10

include/linux/exportfs.h

··· 201 201 * @commit_metadata: commit metadata changes to stable storage 202 202 * 203 203 * See Documentation/filesystems/nfs/exporting.rst for details on how to use 204 - * this interface correctly. 204 + * this interface correctly and the definition of the flags. 205 205 * 206 - * encode_fh: 206 + * @encode_fh: 207 207 * @encode_fh should store in the file handle fragment @fh (using at most 208 208 * @max_len bytes) information that can be used by @decode_fh to recover the 209 209 * file referred to by the &struct dentry @de. If @flag has CONNECTABLE bit ··· 215 215 * greater than @max_len*4 bytes). On error @max_len contains the minimum 216 216 * size(in 4 byte unit) needed to encode the file handle. 217 217 * 218 - * fh_to_dentry: 218 + * @fh_to_dentry: 219 219 * @fh_to_dentry is given a &struct super_block (@sb) and a file handle 220 220 * fragment (@fh, @fh_len). It should return a &struct dentry which refers 221 221 * to the same file that the file handle fragment refers to. If it cannot, ··· 227 227 * created with d_alloc_root. The caller can then find any other extant 228 228 * dentries by following the d_alias links. 229 229 * 230 - * fh_to_parent: 230 + * @fh_to_parent: 231 231 * Same as @fh_to_dentry, except that it returns a pointer to the parent 232 232 * dentry if it was encoded into the filehandle fragment by @encode_fh. 233 233 * 234 - * get_name: 234 + * @get_name: 235 235 * @get_name should find a name for the given @child in the given @parent 236 236 * directory. The name should be stored in the @name (with the 237 - * understanding that it is already pointing to a %NAME_MAX+1 sized 237 + * understanding that it is already pointing to a %NAME_MAX + 1 sized 238 238 * buffer. get_name() should return %0 on success, a negative error code 239 239 * or error. @get_name will be called without @parent->i_rwsem held. 240 240 * 241 - * get_parent: 241 + * @get_parent: 242 242 * @get_parent should find the parent directory for the given @child which 243 243 * is also a directory. In the event that it cannot be found, or storage 244 244 * space cannot be allocated, a %ERR_PTR should be returned. 245 245 * 246 - * permission: 246 + * @permission: 247 247 * Allow filesystems to specify a custom permission function. 248 248 * 249 - * open: 249 + * @open: 250 250 * Allow filesystems to specify a custom open function. 251 251 * 252 - * commit_metadata: 252 + * @commit_metadata: 253 253 * @commit_metadata should commit metadata changes to stable storage. 254 + * 255 + * @get_uuid: 256 + * Get a filesystem unique signature exposed to clients. 257 + * 258 + * @map_blocks: 259 + * Map and, if necessary, allocate blocks for a layout. 260 + * 261 + * @commit_blocks: 262 + * Commit blocks in a layout once the client is done with them. 263 + * 264 + * @flags: 265 + * Allows the filesystem to communicate to nfsd that it may want to do things 266 + * differently when dealing with it. 254 267 * 255 268 * Locking rules: 256 269 * get_parent is called with child->d_inode->i_rwsem down

+14 -4

include/linux/filelock.h

··· 242 242 static inline struct file_lock_context * 243 243 locks_inode_context(const struct inode *inode) 244 244 { 245 - return smp_load_acquire(&inode->i_flctx); 245 + /* 246 + * Paired with smp_store_release in locks_get_lock_context(). 247 + * 248 + * Ensures ->i_flctx will be visible if we spotted the flag. 249 + */ 250 + if (likely(!(smp_load_acquire(&inode->i_opflags) & IOP_FLCTX))) 251 + return NULL; 252 + return READ_ONCE(inode->i_flctx); 246 253 } 247 254 248 255 #else /* !CONFIG_FILE_LOCKING */ ··· 476 469 * could end up racing with tasks trying to set a new lease on this 477 470 * file. 478 471 */ 479 - flctx = READ_ONCE(inode->i_flctx); 472 + flctx = locks_inode_context(inode); 480 473 if (!flctx) 481 474 return 0; 482 475 smp_mb(); ··· 495 488 * could end up racing with tasks trying to set a new lease on this 496 489 * file. 497 490 */ 498 - flctx = READ_ONCE(inode->i_flctx); 491 + flctx = locks_inode_context(inode); 499 492 if (!flctx) 500 493 return 0; 501 494 smp_mb(); ··· 540 533 541 534 static inline int break_layout(struct inode *inode, bool wait) 542 535 { 536 + struct file_lock_context *flctx; 537 + 543 538 smp_mb(); 544 - if (inode->i_flctx && !list_empty_careful(&inode->i_flctx->flc_lease)) { 539 + flctx = locks_inode_context(inode); 540 + if (flctx && !list_empty_careful(&flctx->flc_lease)) { 545 541 unsigned int flags = LEASE_BREAK_LAYOUT; 546 542 547 543 if (!wait)

include/linux/fs.h

··· 631 631 #define IOP_MGTIME 0x0020 632 632 #define IOP_CACHED_LINK 0x0040 633 633 #define IOP_FASTPERM_MAY_EXEC 0x0080 634 + #define IOP_FLCTX 0x0100 634 635 635 636 /* 636 637 * Inode state bits. Protected by inode->i_lock

+3 -1

include/linux/ns/ns_common_types.h

··· 108 108 * @ns_tree: namespace tree nodes and active reference count 109 109 */ 110 110 struct ns_common { 111 + struct { 112 + refcount_t __ns_ref; /* do not use directly */ 113 + } ____cacheline_aligned_in_smp; 111 114 u32 ns_type; 112 115 struct dentry *stashed; 113 116 const struct proc_ns_operations *ops; 114 117 unsigned int inum; 115 - refcount_t __ns_ref; /* do not use directly */ 116 118 union { 117 119 struct ns_tree; 118 120 struct rcu_head ns_rcu;

+3 -2

include/linux/posix_acl_xattr.h

··· 44 44 } 45 45 #endif 46 46 47 - int posix_acl_to_xattr(struct user_namespace *user_ns, 48 - const struct posix_acl *acl, void *buffer, size_t size); 47 + extern void *posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl, 48 + size_t *sizep, gfp_t gfp); 49 + 49 50 static inline const char *posix_acl_xattr_name(int type) 50 51 { 51 52 switch (type) {

+2 -2

include/linux/restart_block.h

··· 6 6 #define __LINUX_RESTART_BLOCK_H 7 7 8 8 #include <linux/compiler.h> 9 + #include <linux/time64.h> 9 10 #include <linux/types.h> 10 11 11 12 struct __kernel_timespec; ··· 51 50 struct pollfd __user *ufds; 52 51 int nfds; 53 52 int has_timeout; 54 - unsigned long tv_sec; 55 - unsigned long tv_nsec; 53 + struct timespec64 end_time; 56 54 } poll; 57 55 }; 58 56 };

+5 -2

init/Kconfig

··· 624 624 arch_update_hw_pressure() and arch_scale_thermal_pressure(). 625 625 626 626 config BSD_PROCESS_ACCT 627 - bool "BSD Process Accounting" 627 + bool "BSD Process Accounting (DEPRECATED)" 628 628 depends on MULTIUSER 629 + default n 629 630 help 630 631 If you say Y here, a user level program will be able to instruct the 631 632 kernel (via a special system call) to write process accounting ··· 636 635 command name, memory usage, controlling terminal etc. (the complete 637 636 list is in the struct acct in <file:include/linux/acct.h>). It is 638 637 up to the user level program to do useful things with this 639 - information. This is generally a good idea, so say Y. 638 + information. This mechanism is antiquated and has significant 639 + scalability issues. You probably want to use eBPF instead. Say 640 + N unless you really need this. 640 641 641 642 config BSD_PROCESS_ACCT_V3 642 643 bool "BSD Process Accounting version 3 file format"

+48

init/initramfs_test.c

··· 447 447 kfree(tbufs); 448 448 } 449 449 450 + static void __init initramfs_test_fname_path_max(struct kunit *test) 451 + { 452 + char *err; 453 + size_t len; 454 + struct kstat st0, st1; 455 + char fdata[] = "this file data will not be unpacked"; 456 + struct test_fname_path_max { 457 + char fname_oversize[PATH_MAX + 1]; 458 + char fname_ok[PATH_MAX]; 459 + char cpio_src[(CPIO_HDRLEN + PATH_MAX + 3 + sizeof(fdata)) * 2]; 460 + } *tbufs = kzalloc(sizeof(struct test_fname_path_max), GFP_KERNEL); 461 + struct initramfs_test_cpio c[] = { { 462 + .magic = "070701", 463 + .ino = 1, 464 + .mode = S_IFDIR | 0777, 465 + .nlink = 1, 466 + .namesize = sizeof(tbufs->fname_oversize), 467 + .fname = tbufs->fname_oversize, 468 + .filesize = sizeof(fdata), 469 + .data = fdata, 470 + }, { 471 + .magic = "070701", 472 + .ino = 2, 473 + .mode = S_IFDIR | 0777, 474 + .nlink = 1, 475 + .namesize = sizeof(tbufs->fname_ok), 476 + .fname = tbufs->fname_ok, 477 + } }; 478 + 479 + memset(tbufs->fname_oversize, '/', sizeof(tbufs->fname_oversize) - 1); 480 + memset(tbufs->fname_ok, '/', sizeof(tbufs->fname_ok) - 1); 481 + memcpy(tbufs->fname_oversize, "fname_oversize", 482 + sizeof("fname_oversize") - 1); 483 + memcpy(tbufs->fname_ok, "fname_ok", sizeof("fname_ok") - 1); 484 + len = fill_cpio(c, ARRAY_SIZE(c), tbufs->cpio_src); 485 + 486 + /* unpack skips over fname_oversize instead of returning an error */ 487 + err = unpack_to_rootfs(tbufs->cpio_src, len); 488 + KUNIT_EXPECT_NULL(test, err); 489 + 490 + KUNIT_EXPECT_EQ(test, init_stat("fname_oversize", &st0, 0), -ENOENT); 491 + KUNIT_EXPECT_EQ(test, init_stat("fname_ok", &st1, 0), 0); 492 + KUNIT_EXPECT_EQ(test, init_rmdir("fname_ok"), 0); 493 + 494 + kfree(tbufs); 495 + } 496 + 450 497 /* 451 498 * The kunit_case/_suite struct cannot be marked as __initdata as this will be 452 499 * used in debugfs to retrieve results after test has run. ··· 506 459 KUNIT_CASE(initramfs_test_hardlink), 507 460 KUNIT_CASE(initramfs_test_many), 508 461 KUNIT_CASE(initramfs_test_fname_pad), 462 + KUNIT_CASE(initramfs_test_fname_path_max), 509 463 {}, 510 464 }; 511 465

+85 -46

kernel/pid.c

··· 159 159 free_pid(pids[tmp]); 160 160 } 161 161 162 - struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, 163 - size_t set_tid_size) 162 + struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid, 163 + size_t arg_set_tid_size) 164 164 { 165 + int set_tid[MAX_PID_NS_LEVEL + 1] = {}; 166 + int pid_max[MAX_PID_NS_LEVEL + 1] = {}; 165 167 struct pid *pid; 166 168 enum pid_type type; 167 169 int i, nr; 168 170 struct pid_namespace *tmp; 169 171 struct upid *upid; 170 172 int retval = -ENOMEM; 173 + bool retried_preload; 171 174 172 175 /* 173 - * set_tid_size contains the size of the set_tid array. Starting at 176 + * arg_set_tid_size contains the size of the arg_set_tid array. Starting at 174 177 * the most nested currently active PID namespace it tells alloc_pid() 175 178 * which PID to set for a process in that most nested PID namespace 176 - * up to set_tid_size PID namespaces. It does not have to set the PID 177 - * for a process in all nested PID namespaces but set_tid_size must 179 + * up to arg_set_tid_size PID namespaces. It does not have to set the PID 180 + * for a process in all nested PID namespaces but arg_set_tid_size must 178 181 * never be greater than the current ns->level + 1. 179 182 */ 180 - if (set_tid_size > ns->level + 1) 183 + if (arg_set_tid_size > ns->level + 1) 181 184 return ERR_PTR(-EINVAL); 182 185 186 + /* 187 + * Prep before we take locks: 188 + * 189 + * 1. allocate and fill in pid struct 190 + */ 183 191 pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); 184 192 if (!pid) 185 193 return ERR_PTR(retval); 186 194 187 - tmp = ns; 195 + get_pid_ns(ns); 188 196 pid->level = ns->level; 197 + refcount_set(&pid->count, 1); 198 + spin_lock_init(&pid->lock); 199 + for (type = 0; type < PIDTYPE_MAX; ++type) 200 + INIT_HLIST_HEAD(&pid->tasks[type]); 201 + init_waitqueue_head(&pid->wait_pidfd); 202 + INIT_HLIST_HEAD(&pid->inodes); 189 203 190 - for (i = ns->level; i >= 0; i--) { 191 - int tid = 0; 192 - int pid_max = READ_ONCE(tmp->pid_max); 204 + /* 205 + * 2. perm check checkpoint_restore_ns_capable() 206 + * 207 + * This stores found pid_max to make sure the used value is the same should 208 + * later code need it. 209 + */ 210 + for (tmp = ns, i = ns->level; i >= 0; i--) { 211 + pid_max[ns->level - i] = READ_ONCE(tmp->pid_max); 193 212 194 - if (set_tid_size) { 195 - tid = set_tid[ns->level - i]; 213 + if (arg_set_tid_size) { 214 + int tid = set_tid[ns->level - i] = arg_set_tid[ns->level - i]; 196 215 197 216 retval = -EINVAL; 198 - if (tid < 1 || tid >= pid_max) 199 - goto out_free; 217 + if (tid < 1 || tid >= pid_max[ns->level - i]) 218 + goto out_abort; 200 219 /* 201 220 * Also fail if a PID != 1 is requested and 202 221 * no PID 1 exists. 203 222 */ 204 223 if (tid != 1 && !tmp->child_reaper) 205 - goto out_free; 224 + goto out_abort; 206 225 retval = -EPERM; 207 226 if (!checkpoint_restore_ns_capable(tmp->user_ns)) 208 - goto out_free; 209 - set_tid_size--; 227 + goto out_abort; 228 + arg_set_tid_size--; 210 229 } 211 230 212 - idr_preload(GFP_KERNEL); 213 - spin_lock(&pidmap_lock); 231 + tmp = tmp->parent; 232 + } 233 + 234 + /* 235 + * Prep is done, id allocation goes here: 236 + */ 237 + retried_preload = false; 238 + idr_preload(GFP_KERNEL); 239 + spin_lock(&pidmap_lock); 240 + for (tmp = ns, i = ns->level; i >= 0;) { 241 + int tid = set_tid[ns->level - i]; 214 242 215 243 if (tid) { 216 244 nr = idr_alloc(&tmp->idr, NULL, tid, ··· 248 220 * alreay in use. Return EEXIST in that case. 249 221 */ 250 222 if (nr == -ENOSPC) 223 + 251 224 nr = -EEXIST; 252 225 } else { 253 226 int pid_min = 1; ··· 264 235 * a partially initialized PID (see below). 265 236 */ 266 237 nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, 267 - pid_max, GFP_ATOMIC); 238 + pid_max[ns->level - i], GFP_ATOMIC); 239 + if (nr == -ENOSPC) 240 + nr = -EAGAIN; 268 241 } 269 - spin_unlock(&pidmap_lock); 270 - idr_preload_end(); 271 242 272 - if (nr < 0) { 273 - retval = (nr == -ENOSPC) ? -EAGAIN : nr; 243 + if (unlikely(nr < 0)) { 244 + /* 245 + * Preload more memory if idr_alloc{,cyclic} failed with -ENOMEM. 246 + * 247 + * The IDR API only allows us to preload memory for one call, while we may end 248 + * up doing several under pidmap_lock with GFP_ATOMIC. The situation may be 249 + * salvageable with GFP_KERNEL. But make sure to not loop indefinitely if preload 250 + * did not help (the routine unfortunately returns void, so we have no idea 251 + * if it got anywhere). 252 + * 253 + * The lock can be safely dropped and picked up as historically pid allocation 254 + * for different namespaces was *not* atomic -- we try to hold on to it the 255 + * entire time only for performance reasons. 256 + */ 257 + if (nr == -ENOMEM && !retried_preload) { 258 + spin_unlock(&pidmap_lock); 259 + idr_preload_end(); 260 + retried_preload = true; 261 + idr_preload(GFP_KERNEL); 262 + spin_lock(&pidmap_lock); 263 + continue; 264 + } 265 + retval = nr; 274 266 goto out_free; 275 267 } 276 268 277 269 pid->numbers[i].nr = nr; 278 270 pid->numbers[i].ns = tmp; 279 271 tmp = tmp->parent; 272 + i--; 273 + retried_preload = false; 280 274 } 281 275 282 276 /* ··· 309 257 * is what we have exposed to userspace for a long time and it is 310 258 * documented behavior for pid namespaces. So we can't easily 311 259 * change it even if there were an error code better suited. 260 + * 261 + * This can't be done earlier because we need to preserve other 262 + * error conditions. 312 263 */ 313 264 retval = -ENOMEM; 314 - 315 - get_pid_ns(ns); 316 - refcount_set(&pid->count, 1); 317 - spin_lock_init(&pid->lock); 318 - for (type = 0; type < PIDTYPE_MAX; ++type) 319 - INIT_HLIST_HEAD(&pid->tasks[type]); 320 - 321 - init_waitqueue_head(&pid->wait_pidfd); 322 - INIT_HLIST_HEAD(&pid->inodes); 323 - 324 - upid = pid->numbers + ns->level; 325 - idr_preload(GFP_KERNEL); 326 - spin_lock(&pidmap_lock); 327 - if (!(ns->pid_allocated & PIDNS_ADDING)) 328 - goto out_unlock; 265 + if (unlikely(!(ns->pid_allocated & PIDNS_ADDING))) 266 + goto out_free; 329 267 pidfs_add_pid(pid); 330 - for ( ; upid >= pid->numbers; --upid) { 268 + for (upid = pid->numbers + ns->level; upid >= pid->numbers; --upid) { 331 269 /* Make the PID visible to find_pid_ns. */ 332 270 idr_replace(&upid->ns->idr, pid, upid->nr); 333 271 upid->ns->pid_allocated++; ··· 328 286 329 287 return pid; 330 288 331 - out_unlock: 332 - spin_unlock(&pidmap_lock); 333 - idr_preload_end(); 334 - put_pid_ns(ns); 335 - 336 289 out_free: 337 - spin_lock(&pidmap_lock); 338 290 while (++i <= ns->level) { 339 291 upid = pid->numbers + i; 340 292 idr_remove(&upid->ns->idr, upid->nr); ··· 339 303 idr_set_cursor(&ns->idr, 0); 340 304 341 305 spin_unlock(&pidmap_lock); 306 + idr_preload_end(); 342 307 308 + out_abort: 309 + put_pid_ns(ns); 343 310 kmem_cache_free(ns->pid_cachep, pid); 344 311 return ERR_PTR(retval); 345 312 }

Configure Feed

Configure Feed