sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable

During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
on every task. To do this, it does get_task_struct() on each iterated task,
drop the lock and then call ops.init_task().

However, a TASK_DEAD task may already have lost all its usage count and be
waiting for RCU grace period to be freed. If get_task_struct() is called on
such task, use-after-free can happen. To avoid such situations,
scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
as they are never going to be scheduled again.

Unfortunately, a racing sched_setscheduler(2) can grab the task before the
task is unhashed and then continue to e.g. move the task from RT to SCX
after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
gone through scx_ops_init_task(), scx_ops_enable_task() called from
switching_to_scx() triggers the following warning:

sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
...
RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x84e/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e

As in the ops_disable path, it just doesn't seem like a good idea to leave
any task in an inconsistent state, even when the task is dead. The root
cause is ops_enable not being able to tell reliably whether a task is truly
dead (no one else is looking at it and it's about to be freed) and was
testing TASK_DEAD instead. Fix it by testing the task's usage count
directly.

- ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
tasks, @include_dead is removed from scx_task_iter_next_locked() along
with dead task filtering.

- tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
fails.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>

Tejun Heo 2 years ago a8532fac 61eeb9a9

+18 -17

2 changed files

expand all

include

linux

sched

task.h

kernel

sched

ext.c

include/linux/sched/task.h

··· 120 120 return t; 121 121 } 122 122 123 + static inline struct task_struct *tryget_task_struct(struct task_struct *t) 124 + { 125 + return refcount_inc_not_zero(&t->usage) ? t : NULL; 126 + } 127 + 123 128 extern void __put_task_struct(struct task_struct *t); 124 129 extern void __put_task_struct_rcu_cb(struct rcu_head *rhp); 125 130

+13 -17

kernel/sched/ext.c

··· 1240 1240 * whether they would like to filter out dead tasks. See scx_task_iter_init() 1241 1241 * for details. 1242 1242 */ 1243 - static struct task_struct * 1244 - scx_task_iter_next_locked(struct scx_task_iter *iter, bool include_dead) 1243 + static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter) 1245 1244 { 1246 1245 struct task_struct *p; 1247 - retry: 1246 + 1248 1247 scx_task_iter_rq_unlock(iter); 1249 1248 1250 1249 while ((p = scx_task_iter_next(iter))) { ··· 1280 1281 1281 1282 iter->rq = task_rq_lock(p, &iter->rf); 1282 1283 iter->locked = p; 1283 - 1284 - /* 1285 - * If we see %TASK_DEAD, @p already disabled preemption, is about to do 1286 - * the final __schedule(), won't ever need to be scheduled again and can 1287 - * thus be safely ignored. If we don't see %TASK_DEAD, @p can't enter 1288 - * the final __schedle() while we're locking its rq and thus will stay 1289 - * alive until the rq is unlocked. 1290 - */ 1291 - if (!include_dead && READ_ONCE(p->__state) == TASK_DEAD) 1292 - goto retry; 1293 1284 1294 1285 return p; 1295 1286 } ··· 3990 4001 * The BPF scheduler is going away. All tasks including %TASK_DEAD ones 3991 4002 * must be switched out and exited synchronously. 3992 4003 */ 3993 - while ((p = scx_task_iter_next_locked(&sti, true))) { 4004 + while ((p = scx_task_iter_next_locked(&sti))) { 3994 4005 const struct sched_class *old_class = p->sched_class; 3995 4006 struct sched_enq_and_set_ctx ctx; 3996 4007 ··· 4621 4632 spin_lock_irq(&scx_tasks_lock); 4622 4633 4623 4634 scx_task_iter_init(&sti); 4624 - while ((p = scx_task_iter_next_locked(&sti, false))) { 4625 - get_task_struct(p); 4635 + while ((p = scx_task_iter_next_locked(&sti))) { 4636 + /* 4637 + * @p may already be dead, have lost all its usages counts and 4638 + * be waiting for RCU grace period before being freed. @p can't 4639 + * be initialized for SCX in such cases and should be ignored. 4640 + */ 4641 + if (!tryget_task_struct(p)) 4642 + continue; 4643 + 4626 4644 scx_task_iter_rq_unlock(&sti); 4627 4645 spin_unlock_irq(&scx_tasks_lock); 4628 4646 ··· 4682 4686 WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL)); 4683 4687 4684 4688 scx_task_iter_init(&sti); 4685 - while ((p = scx_task_iter_next_locked(&sti, false))) { 4689 + while ((p = scx_task_iter_next_locked(&sti))) { 4686 4690 const struct sched_class *old_class = p->sched_class; 4687 4691 struct sched_enq_and_set_ctx ctx; 4688 4692

Configure Feed

Configure Feed