sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:

- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.

- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.

Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.

By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:

- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.

- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().

This fix is made possible by the cgroup_task_dead() split in the previous patch.

This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.

Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Tejun Heo 7 months ago 7900aa69 587eb08a

+9 -4

4 changed files

expand all

include

linux

sched

ext.h

kernel

fork.c

sched

core.c

ext.c

+2 -2

include/linux/sched/ext.h

··· 208 208 struct list_head tasks_node; 209 209 }; 210 210 211 - void sched_ext_free(struct task_struct *p); 211 + void sched_ext_dead(struct task_struct *p); 212 212 void print_scx_info(const char *log_lvl, struct task_struct *p); 213 213 void scx_softlockup(u32 dur_s); 214 214 bool scx_rcu_cpu_stall(void); 215 215 216 216 #else /* !CONFIG_SCHED_CLASS_EXT */ 217 217 218 - static inline void sched_ext_free(struct task_struct *p) {} 218 + static inline void sched_ext_dead(struct task_struct *p) {} 219 219 static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} 220 220 static inline void scx_softlockup(u32 dur_s) {} 221 221 static inline bool scx_rcu_cpu_stall(void) { return false; }

-1

kernel/fork.c

··· 736 736 WARN_ON(tsk == current); 737 737 738 738 unwind_task_free(tsk); 739 - sched_ext_free(tsk); 740 739 io_uring_free(tsk); 741 740 cgroup_task_free(tsk); 742 741 task_numa_free(tsk, true);

kernel/sched/core.c

··· 5151 5151 if (prev->sched_class->task_dead) 5152 5152 prev->sched_class->task_dead(prev); 5153 5153 5154 + /* 5155 + * sched_ext_dead() must come before cgroup_task_dead() to 5156 + * prevent cgroups from being removed while its member tasks are 5157 + * visible to SCX schedulers. 5158 + */ 5159 + sched_ext_dead(prev); 5154 5160 cgroup_task_dead(prev); 5155 5161 5156 5162 /* Task is done with its stack. */

+1 -1

kernel/sched/ext.c

··· 2966 2966 percpu_up_read(&scx_fork_rwsem); 2967 2967 } 2968 2968 2969 - void sched_ext_free(struct task_struct *p) 2969 + void sched_ext_dead(struct task_struct *p) 2970 2970 { 2971 2971 unsigned long flags; 2972 2972

Configure Feed

Configure Feed