sched_ext: sync disable_irq_work in bpf_scx_unreg()

When unregistered my self-written scx scheduler, the following panic
occurs.

[ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP
[ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[ 230.093972] Workqueue: events_unbound bpf_map_free_deferred
[ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[ 230.116843] pc : 0xffff80009bc2c1f8
[ 230.120406] lr : dequeue_task_scx+0x270/0x2d0
[ 230.217749] Call trace:
[ 230.228515] 0xffff80009bc2c1f8 (P)
[ 230.232077] dequeue_task+0x84/0x188
[ 230.235728] sched_change_begin+0x1dc/0x250
[ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240
[ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0
[ 230.249701] ___migrate_enable+0x4c/0xa0
[ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0
[ 230.258246] process_one_work+0x184/0x540
[ 230.262342] worker_thread+0x19c/0x348
[ 230.266170] kthread+0x13c/0x150
[ 230.269465] ret_from_fork+0x10/0x20
[ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 230.287621] ---[ end trace 0000000000000000 ]---
[ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt

The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.

The expected ordering during teardown is:
bitmap_zero(sch->has_op) + synchronize_rcu()
-> guarantees no CPU will ever call sch->ops.* again
-> only THEN free the BPF struct_ops JIT page

bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.

So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.

Add a helper scx_flush_disable_work() so the future use cases that want
to flush disable_work can use it.
Also amend the call for scx_root_enable_workfn() and
scx_sub_enable_workfn() which have similar pattern in the error path.

Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Richard Cheng and committed by

Tejun Heo 2 weeks ago 510a2705 4e3d7c89

+17 -3

1 changed file

expand all

kernel

sched

ext.c

+17 -3

kernel/sched/ext.c

··· 5923 5923 irq_work_queue(&sch->disable_irq_work); 5924 5924 } 5925 5925 5926 + /** 5927 + * scx_flush_disable_work - flush the disable work and wait for it to finish 5928 + * @sch: the scheduler 5929 + * 5930 + * sch->disable_work might still not queued, causing kthread_flush_work() 5931 + * as a noop. Syncing the irq_work first is required to guarantee the 5932 + * kthread work has been queued before waiting for it. 5933 + */ 5934 + static void scx_flush_disable_work(struct scx_sched *sch) 5935 + { 5936 + irq_work_sync(&sch->disable_irq_work); 5937 + kthread_flush_work(&sch->disable_work); 5938 + } 5939 + 5926 5940 static void dump_newline(struct seq_buf *s) 5927 5941 { 5928 5942 trace_sched_ext_dump(""); ··· 6837 6823 * completion. sch's base reference will be put by bpf_scx_unreg(). 6838 6824 */ 6839 6825 scx_error(sch, "scx_root_enable() failed (%d)", ret); 6840 - kthread_flush_work(&sch->disable_work); 6826 + scx_flush_disable_work(sch); 6841 6827 cmd->ret = 0; 6842 6828 } 6843 6829 ··· 7104 7090 percpu_up_write(&scx_fork_rwsem); 7105 7091 err_disable: 7106 7092 mutex_unlock(&scx_enable_mutex); 7107 - kthread_flush_work(&sch->disable_work); 7093 + scx_flush_disable_work(sch); 7108 7094 cmd->ret = 0; 7109 7095 } 7110 7096 ··· 7365 7351 struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); 7366 7352 7367 7353 scx_disable(sch, SCX_EXIT_UNREG); 7368 - kthread_flush_work(&sch->disable_work); 7354 + scx_flush_disable_work(sch); 7369 7355 RCU_INIT_POINTER(ops->priv, NULL); 7370 7356 kobject_put(&sch->kobj); 7371 7357 }

Configure Feed

Configure Feed