Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable

scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while
calling scx_ops_exit_task() on all tasks including dead ones. This can leave
a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent.

If another task was in the process of changing the TASK_DEAD task's
scheduling class and grabs the rq lock after scx_ops_disable_workfn() is
done with the task, the task ends up calling scx_ops_disable_task() on the
dead task which is in an inconsistent state triggering a warning:

WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160
...
RIP: 0010:scx_ops_disable_task+0x12c/0x160
...
Call Trace:
<TASK>
check_class_changed+0x2c/0x70
__sched_setscheduler+0x8a0/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f140d70ea5b

There is no reason to leave dead tasks on SCX when unloading the BPF
scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including
the dead ones from SCX.

Signed-off-by: Tejun Heo <tj@kernel.org>

+8 -16
+8 -16
kernel/sched/ext.c
··· 3998 3998 spin_lock_irq(&scx_tasks_lock); 3999 3999 scx_task_iter_init(&sti); 4000 4000 /* 4001 - * Invoke scx_ops_exit_task() on all non-idle tasks, including 4002 - * TASK_DEAD tasks. Because dead tasks may have a nonzero refcount, 4003 - * we may not have invoked sched_ext_free() on them by the time a 4004 - * scheduler is disabled. We must therefore exit the task here, or we'd 4005 - * fail to invoke ops.exit_task(), as the scheduler will have been 4006 - * unloaded by the time the task is subsequently exited on the 4007 - * sched_ext_free() path. 4001 + * The BPF scheduler is going away. All tasks including %TASK_DEAD ones 4002 + * must be switched out and exited synchronously. 4008 4003 */ 4009 4004 while ((p = scx_task_iter_next_locked(&sti, true))) { 4010 4005 const struct sched_class *old_class = p->sched_class; 4011 4006 struct sched_enq_and_set_ctx ctx; 4012 4007 4013 - if (READ_ONCE(p->__state) != TASK_DEAD) { 4014 - sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, 4015 - &ctx); 4008 + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); 4016 4009 4017 - p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL); 4018 - __setscheduler_prio(p, p->prio); 4019 - check_class_changing(task_rq(p), p, old_class); 4010 + p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL); 4011 + __setscheduler_prio(p, p->prio); 4012 + check_class_changing(task_rq(p), p, old_class); 4020 4013 4021 - sched_enq_and_set_task(&ctx); 4014 + sched_enq_and_set_task(&ctx); 4022 4015 4023 - check_class_changed(task_rq(p), p, old_class, p->prio); 4024 - } 4016 + check_class_changed(task_rq(p), p, old_class, p->prio); 4025 4017 scx_ops_exit_task(p); 4026 4018 } 4027 4019 scx_task_iter_exit(&sti);