pidfs: persist information

Persist exit and coredump information independent of whether anyone
currently holds a pidfd for the struct pid.

The current scheme allocated pidfs dentries on-demand repeatedly.
This scheme is reaching it's limits as it makes it impossible to pin
information that needs to be available after the task has exited or
coredumped and that should not be lost simply because the pidfd got
closed temporarily. The next opener should still see the stashed
information.

This is also a prerequisite for supporting extended attributes on
pidfds to allow attaching meta information to them.

If someone opens a pidfd for a struct pid a pidfs dentry is allocated
and stashed in pid->stashed. Once the last pidfd for the struct pid is
closed the pidfs dentry is released and removed from pid->stashed.

So if 10 callers create a pidfs dentry for the same struct pid
sequentially, i.e., each closing the pidfd before the other creates a
new one then a new pidfs dentry is allocated every time.

Because multiple tasks acquiring and releasing a pidfd for the same
struct pid can race with each another a task may still find a valid
pidfs entry from the previous task in pid->stashed and reuse it. Or it
might find a dead dentry in there and fail to reuse it and so stashes a
new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry
but only one will succeed, the other ones will put their dentry.

The current scheme aims to ensure that a pidfs dentry for a struct pid
can only be created if the task is still alive or if a pidfs dentry
already existed before the task was reaped and so exit information has
been was stashed in the pidfs inode.

That's great except that it's buggy. If a pidfs dentry is stashed in
pid->stashed after pidfs_exit() but before __unhash_process() is called
we will return a pidfd for a reaped task without exit information being
available.

The pidfds_pid_valid() check does not guard against this race as it
doens't sync at all with pidfs_exit(). The pid_has_task() check might be
successful simply because we're before __unhash_process() but after
pidfs_exit().

Introduce a new scheme where the lifetime of information associated with
a pidfs entry (coredump and exit information) isn't bound to the
lifetime of the pidfs inode but the struct pid itself.

The first time a pidfs dentry is allocated for a struct pid a struct
pidfs_attr will be allocated which will be used to store exit and
coredump information.

If all pidfs for the pidfs dentry are closed the dentry and inode can be
cleaned up but the struct pidfs_attr will stick until the struct pid
itself is freed. This will ensure minimal memory usage while persisting
relevant information.

The new scheme has various advantages. First, it allows to close the
race where we end up handing out a pidfd for a reaped task for which no
exit information is available. Second, it minimizes memory usage.
Third, it allows to remove complex lifetime tracking via dentries when
registering a struct pid with pidfs. There's no need to get or put a
reference. Instead, the lifetime of exit and coredump information
associated with a struct pid is bound to the lifetime of struct pid
itself.

Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-5-98f3456fd552@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

Christian Brauner 1 year ago 8ec7c826 75215c97

+153 -69

4 changed files

expand all

pidfs.c

include

linux

pid.h

pidfs.h

kernel

pid.c

+148 -68

fs/pidfs.c

··· 25 25 #include "internal.h" 26 26 #include "mount.h" 27 27 28 + #define PIDFS_PID_DEAD ERR_PTR(-ESRCH) 29 + 28 30 static struct kmem_cache *pidfs_cachep __ro_after_init; 31 + static struct kmem_cache *pidfs_attr_cachep __ro_after_init; 29 32 30 33 /* 31 34 * Stashes information that userspace needs to access even after the ··· 38 35 __u64 cgroupid; 39 36 __s32 exit_code; 40 37 __u32 coredump_mask; 38 + }; 39 + 40 + struct pidfs_attr { 41 + struct pidfs_exit_info __pei; 42 + struct pidfs_exit_info *exit_info; 41 43 }; 42 44 43 45 struct pidfs_inode { ··· 133 125 134 126 pid->ino = pidfs_ino_nr; 135 127 pid->stashed = NULL; 128 + pid->attr = NULL; 136 129 pidfs_ino_nr++; 137 130 138 131 write_seqcount_begin(&pidmap_lock_seq); ··· 146 137 write_seqcount_begin(&pidmap_lock_seq); 147 138 rb_erase(&pid->pidfs_node, &pidfs_ino_tree); 148 139 write_seqcount_end(&pidmap_lock_seq); 140 + } 141 + 142 + void pidfs_free_pid(struct pid *pid) 143 + { 144 + /* 145 + * Any dentry must've been wiped from the pid by now. 146 + * Otherwise there's a reference count bug. 147 + */ 148 + VFS_WARN_ON_ONCE(pid->stashed); 149 + 150 + if (!IS_ERR(pid->attr)) 151 + kfree(pid->attr); 149 152 } 150 153 151 154 #ifdef CONFIG_PROC_FS ··· 282 261 static long pidfd_info(struct file *file, unsigned int cmd, unsigned long arg) 283 262 { 284 263 struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; 285 - struct inode *inode = file_inode(file); 286 264 struct pid *pid = pidfd_pid(file); 287 265 size_t usize = _IOC_SIZE(cmd); 288 266 struct pidfd_info kinfo = {}; 289 267 struct pidfs_exit_info *exit_info; 290 268 struct user_namespace *user_ns; 291 269 struct task_struct *task; 270 + struct pidfs_attr *attr; 292 271 const struct cred *c; 293 272 __u64 mask; 294 273 ··· 307 286 if (!pid_in_current_pidns(pid)) 308 287 return -ESRCH; 309 288 289 + attr = READ_ONCE(pid->attr); 310 290 if (mask & PIDFD_INFO_EXIT) { 311 - exit_info = READ_ONCE(pidfs_i(inode)->exit_info); 291 + exit_info = READ_ONCE(attr->exit_info); 312 292 if (exit_info) { 313 293 kinfo.mask |= PIDFD_INFO_EXIT; 314 294 #ifdef CONFIG_CGROUPS ··· 322 300 323 301 if (mask & PIDFD_INFO_COREDUMP) { 324 302 kinfo.mask |= PIDFD_INFO_COREDUMP; 325 - kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); 303 + kinfo.coredump_mask = READ_ONCE(attr->__pei.coredump_mask); 326 304 } 327 305 328 306 task = get_pid_task(pid, PIDTYPE_PID); ··· 574 552 * task has been reaped which cannot happen until we're out of 575 553 * release_task(). 576 554 * 577 - * If this struct pid is referred to by a pidfd then 578 - * stashed_dentry_get() will return the dentry and inode for that struct 579 - * pid. Since we've taken a reference on it there's now an additional 580 - * reference from the exit path on it. Which is fine. We're going to put 581 - * it again in a second and we know that the pid is kept alive anyway. 555 + * If this struct pid has at least once been referred to by a pidfd then 556 + * pid->attr will be allocated. If not we mark the struct pid as dead so 557 + * anyone who is trying to register it with pidfs will fail to do so. 558 + * Otherwise we would hand out pidfs for reaped tasks without having 559 + * exit information available. 582 560 * 583 - * Worst case is that we've filled in the info and immediately free the 584 - * dentry and inode afterwards since the pidfd has been closed. Since 561 + * Worst case is that we've filled in the info and the pid gets freed 562 + * right away in free_pid() when no one holds a pidfd anymore. Since 585 563 * pidfs_exit() currently is placed after exit_task_work() we know that 586 - * it cannot be us aka the exiting task holding a pidfd to ourselves. 564 + * it cannot be us aka the exiting task holding a pidfd to itself. 587 565 */ 588 566 void pidfs_exit(struct task_struct *tsk) 589 567 { 590 - struct dentry *dentry; 568 + struct pid *pid = task_pid(tsk); 569 + struct pidfs_attr *attr; 570 + struct pidfs_exit_info *exit_info; 571 + #ifdef CONFIG_CGROUPS 572 + struct cgroup *cgrp; 573 + #endif 591 574 592 575 might_sleep(); 593 576 594 - dentry = stashed_dentry_get(&task_pid(tsk)->stashed); 595 - if (dentry) { 596 - struct inode *inode = d_inode(dentry); 597 - struct pidfs_exit_info *exit_info = &pidfs_i(inode)->__pei; 598 - #ifdef CONFIG_CGROUPS 599 - struct cgroup *cgrp; 600 - 601 - rcu_read_lock(); 602 - cgrp = task_dfl_cgroup(tsk); 603 - exit_info->cgroupid = cgroup_id(cgrp); 604 - rcu_read_unlock(); 605 - #endif 606 - exit_info->exit_code = tsk->exit_code; 607 - 608 - /* Ensure that PIDFD_GET_INFO sees either all or nothing. */ 609 - smp_store_release(&pidfs_i(inode)->exit_info, &pidfs_i(inode)->__pei); 610 - dput(dentry); 577 + guard(spinlock_irq)(&pid->wait_pidfd.lock); 578 + attr = pid->attr; 579 + if (!attr) { 580 + /* 581 + * No one ever held a pidfd for this struct pid. 582 + * Mark it as dead so no one can add a pidfs 583 + * entry anymore. We're about to be reaped and 584 + * so no exit information would be available. 585 + */ 586 + pid->attr = PIDFS_PID_DEAD; 587 + return; 611 588 } 589 + 590 + /* 591 + * If @pid->attr is set someone might still legitimately hold a 592 + * pidfd to @pid or someone might concurrently still be getting 593 + * a reference to an already stashed dentry from @pid->stashed. 594 + * So defer cleaning @pid->attr until the last reference to @pid 595 + * is put 596 + */ 597 + 598 + exit_info = &attr->__pei; 599 + 600 + #ifdef CONFIG_CGROUPS 601 + rcu_read_lock(); 602 + cgrp = task_dfl_cgroup(tsk); 603 + exit_info->cgroupid = cgroup_id(cgrp); 604 + rcu_read_unlock(); 605 + #endif 606 + exit_info->exit_code = tsk->exit_code; 607 + 608 + /* Ensure that PIDFD_GET_INFO sees either all or nothing. */ 609 + smp_store_release(&attr->exit_info, &attr->__pei); 612 610 } 613 611 614 612 #ifdef CONFIG_COREDUMP ··· 636 594 { 637 595 struct pid *pid = cprm->pid; 638 596 struct pidfs_exit_info *exit_info; 639 - struct dentry *dentry; 640 - struct inode *inode; 597 + struct pidfs_attr *attr; 641 598 __u32 coredump_mask = 0; 642 599 643 - dentry = pid->stashed; 644 - if (WARN_ON_ONCE(!dentry)) 645 - return; 600 + attr = READ_ONCE(pid->attr); 646 601 647 - inode = d_inode(dentry); 648 - exit_info = &pidfs_i(inode)->__pei; 602 + VFS_WARN_ON_ONCE(!attr); 603 + VFS_WARN_ON_ONCE(attr == PIDFS_PID_DEAD); 604 + 605 + exit_info = &attr->__pei; 649 606 /* Note how we were coredumped. */ 650 607 coredump_mask = pidfs_coredump_mask(cprm->mm_flags); 651 608 /* Note that we actually did coredump. */ ··· 704 663 705 664 static void pidfs_free_inode(struct inode *inode) 706 665 { 707 - kmem_cache_free(pidfs_cachep, pidfs_i(inode)); 666 + kfree(pidfs_i(inode)); 708 667 } 709 668 710 669 static const struct super_operations pidfs_sops = { ··· 872 831 * recorded and published can be handled correctly. 873 832 */ 874 833 if (unlikely(!pid_has_task(pid, type))) { 875 - struct inode *inode = d_inode(path->dentry); 876 - return !!READ_ONCE(pidfs_i(inode)->exit_info); 834 + struct pidfs_attr *attr; 835 + 836 + attr = READ_ONCE(pid->attr); 837 + if (!attr) 838 + return false; 839 + if (!READ_ONCE(attr->exit_info)) 840 + return false; 877 841 } 878 842 879 843 return true; ··· 924 878 put_pid(pid); 925 879 } 926 880 881 + /** 882 + * pidfs_register_pid - register a struct pid in pidfs 883 + * @pid: pid to pin 884 + * 885 + * Register a struct pid in pidfs. Needs to be paired with 886 + * pidfs_put_pid() to not risk leaking the pidfs dentry and inode. 887 + * 888 + * Return: On success zero, on error a negative error code is returned. 889 + */ 890 + int pidfs_register_pid(struct pid *pid) 891 + { 892 + struct pidfs_attr *new_attr __free(kfree) = NULL; 893 + struct pidfs_attr *attr; 894 + 895 + might_sleep(); 896 + 897 + if (!pid) 898 + return 0; 899 + 900 + attr = READ_ONCE(pid->attr); 901 + if (unlikely(attr == PIDFS_PID_DEAD)) 902 + return PTR_ERR(PIDFS_PID_DEAD); 903 + if (attr) 904 + return 0; 905 + 906 + new_attr = kmem_cache_zalloc(pidfs_attr_cachep, GFP_KERNEL); 907 + if (!new_attr) 908 + return -ENOMEM; 909 + 910 + /* Synchronize with pidfs_exit(). */ 911 + guard(spinlock_irq)(&pid->wait_pidfd.lock); 912 + 913 + attr = pid->attr; 914 + if (unlikely(attr == PIDFS_PID_DEAD)) 915 + return PTR_ERR(PIDFS_PID_DEAD); 916 + if (unlikely(attr)) 917 + return 0; 918 + 919 + pid->attr = no_free_ptr(new_attr); 920 + return 0; 921 + } 922 + 923 + static struct dentry *pidfs_stash_dentry(struct dentry **stashed, 924 + struct dentry *dentry) 925 + { 926 + int ret; 927 + struct pid *pid = d_inode(dentry)->i_private; 928 + 929 + VFS_WARN_ON_ONCE(stashed != &pid->stashed); 930 + 931 + ret = pidfs_register_pid(pid); 932 + if (ret) 933 + return ERR_PTR(ret); 934 + 935 + return stash_dentry(stashed, dentry); 936 + } 937 + 927 938 static const struct stashed_operations pidfs_stashed_ops = { 928 - .init_inode = pidfs_init_inode, 929 - .put_data = pidfs_put_data, 939 + .stash_dentry = pidfs_stash_dentry, 940 + .init_inode = pidfs_init_inode, 941 + .put_data = pidfs_put_data, 930 942 }; 931 943 932 944 static int pidfs_init_fs_context(struct fs_context *fc) ··· 1041 937 } 1042 938 1043 939 /** 1044 - * pidfs_register_pid - register a struct pid in pidfs 1045 - * @pid: pid to pin 1046 - * 1047 - * Register a struct pid in pidfs. Needs to be paired with 1048 - * pidfs_put_pid() to not risk leaking the pidfs dentry and inode. 1049 - * 1050 - * Return: On success zero, on error a negative error code is returned. 1051 - */ 1052 - int pidfs_register_pid(struct pid *pid) 1053 - { 1054 - struct path path __free(path_put) = {}; 1055 - int ret; 1056 - 1057 - might_sleep(); 1058 - 1059 - if (!pid) 1060 - return 0; 1061 - 1062 - ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path); 1063 - if (unlikely(ret)) 1064 - return ret; 1065 - /* Keep the dentry and only put the reference to the mount. */ 1066 - path.dentry = NULL; 1067 - return 0; 1068 - } 1069 - 1070 - /** 1071 940 * pidfs_get_pid - pin a struct pid through pidfs 1072 941 * @pid: pid to pin 1073 942 * ··· 1085 1008 (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT | 1086 1009 SLAB_ACCOUNT | SLAB_PANIC), 1087 1010 pidfs_inode_init_once); 1011 + pidfs_attr_cachep = kmem_cache_create("pidfs_attr_cache", sizeof(struct pidfs_attr), 0, 1012 + (SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT | 1013 + SLAB_ACCOUNT | SLAB_PANIC), NULL); 1088 1014 pidfs_mnt = kern_mount(&pidfs_type); 1089 1015 if (IS_ERR(pidfs_mnt)) 1090 1016 panic("Failed to mount pidfs pseudo filesystem");

include/linux/pid.h

··· 47 47 48 48 #define RESERVED_PIDS 300 49 49 50 + struct pidfs_attr; 51 + 50 52 struct upid { 51 53 int nr; 52 54 struct pid_namespace *ns; ··· 62 60 u64 ino; 63 61 struct rb_node pidfs_node; 64 62 struct dentry *stashed; 63 + struct pidfs_attr *attr; 65 64 }; 66 65 /* lists of tasks that use this pid */ 67 66 struct hlist_head tasks[PIDTYPE_MAX];

include/linux/pidfs.h

··· 16 16 int pidfs_register_pid(struct pid *pid); 17 17 void pidfs_get_pid(struct pid *pid); 18 18 void pidfs_put_pid(struct pid *pid); 19 + void pidfs_free_pid(struct pid *pid); 19 20 20 21 #endif /* _LINUX_PID_FS_H */

+1 -1

kernel/pid.c

··· 100 100 101 101 ns = pid->numbers[pid->level].ns; 102 102 if (refcount_dec_and_test(&pid->count)) { 103 - WARN_ON_ONCE(pid->stashed); 103 + pidfs_free_pid(pid); 104 104 kmem_cache_free(ns->pid_cachep, pid); 105 105 put_pid_ns(ns); 106 106 }

Configure Feed

Configure Feed