Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
"The merge window pulled in the cgroup sub-scheduler infrastructure,
and new AI reviews are accelerating bug reporting and fixing - hence
the larger than usual fixes batch:

- Use-after-frees during scheduler load/unload:
- The disable path could free the BPF scheduler while deferred
irq_work / kthread work was still in flight
- cgroup setter callbacks read the active scheduler outside the
rwsem that synchronizes against teardown
Fix both, and reuse the disable drain in the enable error paths so
the BPF JIT page can't be freed under live callbacks.

- Several BPF op invocations didn't tell the framework which runqueue
was already locked, so helper kfuncs that re-acquire the runqueue
by CPU could deadlock on the held lock

Fix the affected callsites, including recursive parent-into-child
dispatch.

- The hardlockup notifier ran from NMI but eventually took a
non-NMI-safe lock. Bounce it through irq_work.

- A handful of bugs in the new sub-scheduler hierarchy:
- helper kfuncs hard-coded the root instead of resolving the
caller's scheduler
- the enable error path tried to disable per-task state that had
never been initialized, and leaked cpus_read_lock on the way
out
- a sysfs object was leaked on every load/unload
- the dispatch fast-path used the root scheduler instead of the
task's
- a couple of CONFIG #ifdef guards were misclassified

- Verifier-time hardening: BPF programs of unrelated struct_ops types
(e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
Now rejected at load. Plus a few NULL and cross-task argument
checks on sched_ext kfuncs, and a selftest covering the new deny.

- rhashtable (Herbert): restore the insecure_elasticity toggle and
bounce the deferred-resize kick through irq_work to break a
lock-order cycle observable from raw-spinlock callers. sched_ext's
scheduler-instance hash is the first user of both.

- The bypass-mode load balancer used file-scope cpumasks; with
multiple scheduler instances now possible, those raced. Move to
per-instance cpumasks, plus a follow-up to skip tasks whose
recorded CPU is stale relative to the new owning runqueue.

- Smaller fixes:
- a dispatch queue's first-task tracking misbehaved when a parked
iterator cursor sat in the list
- the runqueue's next-class wasn't promoted on local-queue
enqueue, leaving an SCX task behind RT in edge cases
- the reference qmap scheduler stopped erroring on legitimate
cross-scheduler task-storage misses"

* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
sched_ext: Fix scx_flush_disable_work() UAF race
sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
sched_ext: Refuse cross-task select_cpu_from_kfunc calls
sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
sched_ext: Make bypass LB cpumasks per-scheduler
sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
sched_ext: Unregister sub_kset on scheduler disable
sched_ext: Defer scx_hardlockup() out of NMI
sched_ext: sync disable_irq_work in bpf_scx_unreg()
sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
...

+436 -150
+5
include/linux/rhashtable-types.h
··· 12 12 #include <linux/alloc_tag.h> 13 13 #include <linux/atomic.h> 14 14 #include <linux/compiler.h> 15 + #include <linux/irq_work_types.h> 15 16 #include <linux/mutex.h> 16 17 #include <linux/workqueue_types.h> 17 18 ··· 50 49 * @head_offset: Offset of rhash_head in struct to be hashed 51 50 * @max_size: Maximum size while expanding 52 51 * @min_size: Minimum size while shrinking 52 + * @insecure_elasticity: Set to true to disable chain length checks 53 53 * @automatic_shrinking: Enable automatic shrinking of tables 54 54 * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash) 55 55 * @obj_hashfn: Function to hash object ··· 63 61 u16 head_offset; 64 62 unsigned int max_size; 65 63 u16 min_size; 64 + bool insecure_elasticity; 66 65 bool automatic_shrinking; 67 66 rht_hashfn_t hashfn; 68 67 rht_obj_hashfn_t obj_hashfn; ··· 78 75 * @p: Configuration parameters 79 76 * @rhlist: True if this is an rhltable 80 77 * @run_work: Deferred worker to expand/shrink asynchronously 78 + * @run_irq_work: Bounces the @run_work kick through hard IRQ context. 81 79 * @mutex: Mutex to protect current/future table swapping 82 80 * @lock: Spin lock to protect walker list 83 81 * @nelems: Number of elements in table ··· 90 86 struct rhashtable_params p; 91 87 bool rhlist; 92 88 struct work_struct run_work; 89 + struct irq_work run_irq_work; 93 90 struct mutex mutex; 94 91 spinlock_t lock; 95 92 atomic_t nelems;
+5 -3
include/linux/rhashtable.h
··· 20 20 21 21 #include <linux/err.h> 22 22 #include <linux/errno.h> 23 + #include <linux/irq_work.h> 23 24 #include <linux/jhash.h> 24 25 #include <linux/list_nulls.h> 25 26 #include <linux/workqueue.h> ··· 822 821 goto out; 823 822 } 824 823 825 - if (elasticity <= 0) 824 + if (elasticity <= 0 && !params.insecure_elasticity) 826 825 goto slow_path; 827 826 828 827 data = ERR_PTR(-E2BIG); 829 828 if (unlikely(rht_grow_above_max(ht, tbl))) 830 829 goto out_unlock; 831 830 832 - if (unlikely(rht_grow_above_100(ht, tbl))) 831 + if (unlikely(rht_grow_above_100(ht, tbl)) && 832 + !params.insecure_elasticity) 833 833 goto slow_path; 834 834 835 835 /* Inserting at head of list makes unlocking free. */ ··· 848 846 rht_assign_unlock(tbl, bkt, obj, flags); 849 847 850 848 if (rht_grow_above_75(ht, tbl)) 851 - schedule_work(&ht->run_work); 849 + irq_work_queue(&ht->run_irq_work); 852 850 853 851 data = NULL; 854 852 out:
+276 -122
kernel/sched/ext.c
··· 32 32 .key_len = sizeof_field(struct scx_sched, ops.sub_cgroup_id), 33 33 .key_offset = offsetof(struct scx_sched, ops.sub_cgroup_id), 34 34 .head_offset = offsetof(struct scx_sched, hash_node), 35 + .insecure_elasticity = true, /* inserted under scx_sched_lock */ 35 36 }; 36 37 37 38 static struct rhashtable scx_sched_hash; ··· 53 52 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); 54 53 static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED); 55 54 static DEFINE_RAW_SPINLOCK(scx_bypass_lock); 56 - static cpumask_var_t scx_bypass_lb_donee_cpumask; 57 - static cpumask_var_t scx_bypass_lb_resched_cpumask; 58 55 static bool scx_init_task_enabled; 59 56 static bool scx_switching_all; 60 57 DEFINE_STATIC_KEY_FALSE(__scx_switched_all); ··· 468 469 __this_cpu_write(scx_locked_rq_state, rq); 469 470 } 470 471 471 - #define SCX_CALL_OP(sch, op, rq, args...) \ 472 + /* 473 + * SCX ops can recurse via scx_bpf_sub_dispatch() - the inner call must not 474 + * clobber the outer's scx_locked_rq_state. Save it on entry, restore on exit. 475 + */ 476 + #define SCX_CALL_OP(sch, op, locked_rq, args...) \ 472 477 do { \ 473 - if (rq) \ 474 - update_locked_rq(rq); \ 478 + struct rq *__prev_locked_rq; \ 479 + \ 480 + if (locked_rq) { \ 481 + __prev_locked_rq = scx_locked_rq(); \ 482 + update_locked_rq(locked_rq); \ 483 + } \ 475 484 (sch)->ops.op(args); \ 476 - if (rq) \ 477 - update_locked_rq(NULL); \ 485 + if (locked_rq) \ 486 + update_locked_rq(__prev_locked_rq); \ 478 487 } while (0) 479 488 480 - #define SCX_CALL_OP_RET(sch, op, rq, args...) \ 489 + #define SCX_CALL_OP_RET(sch, op, locked_rq, args...) \ 481 490 ({ \ 491 + struct rq *__prev_locked_rq; \ 482 492 __typeof__((sch)->ops.op(args)) __ret; \ 483 493 \ 484 - if (rq) \ 485 - update_locked_rq(rq); \ 494 + if (locked_rq) { \ 495 + __prev_locked_rq = scx_locked_rq(); \ 496 + update_locked_rq(locked_rq); \ 497 + } \ 486 498 __ret = (sch)->ops.op(args); \ 487 - if (rq) \ 488 - update_locked_rq(NULL); \ 499 + if (locked_rq) \ 500 + update_locked_rq(__prev_locked_rq); \ 489 501 __ret; \ 490 502 }) 491 503 ··· 508 498 * those subject tasks. 509 499 * 510 500 * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - 511 - * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock 512 - * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if 513 - * kf_tasks[] is set, @p's scheduler-protected fields are stable. 501 + * either via the @locked_rq argument here, or (for ops.select_cpu()) via @p's 502 + * pi_lock held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. 503 + * So if kf_tasks[] is set, @p's scheduler-protected fields are stable. 514 504 * 515 505 * kf_tasks[] can not stack, so task-based SCX ops must not nest. The 516 506 * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants 517 507 * while a previous one is still in progress. 518 508 */ 519 - #define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ 509 + #define SCX_CALL_OP_TASK(sch, op, locked_rq, task, args...) \ 520 510 do { \ 521 511 WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 522 512 current->scx.kf_tasks[0] = task; \ 523 - SCX_CALL_OP((sch), op, rq, task, ##args); \ 513 + SCX_CALL_OP((sch), op, locked_rq, task, ##args); \ 524 514 current->scx.kf_tasks[0] = NULL; \ 525 515 } while (0) 526 516 527 - #define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ 517 + #define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...) \ 528 518 ({ \ 529 519 __typeof__((sch)->ops.op(task, ##args)) __ret; \ 530 520 WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 531 521 current->scx.kf_tasks[0] = task; \ 532 - __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ 522 + __ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args); \ 533 523 current->scx.kf_tasks[0] = NULL; \ 534 524 __ret; \ 535 525 }) 536 526 537 - #define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ 527 + #define SCX_CALL_OP_2TASKS_RET(sch, op, locked_rq, task0, task1, args...) \ 538 528 ({ \ 539 529 __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ 540 530 WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 541 531 current->scx.kf_tasks[0] = task0; \ 542 532 current->scx.kf_tasks[1] = task1; \ 543 - __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ 533 + __ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args); \ 544 534 current->scx.kf_tasks[0] = NULL; \ 545 535 current->scx.kf_tasks[1] = NULL; \ 546 536 __ret; \ ··· 1398 1388 p->scx.flags &= ~SCX_TASK_IN_CUSTODY; 1399 1389 } 1400 1390 1401 - static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p, 1402 - u64 enq_flags) 1391 + static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq, 1392 + struct task_struct *p, u64 enq_flags) 1403 1393 { 1404 1394 struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1405 - bool preempt = false; 1406 1395 1407 - call_task_dequeue(scx_root, rq, p, 0); 1396 + call_task_dequeue(sch, rq, p, 0); 1397 + 1398 + /* 1399 + * Note that @rq's lock may be dropped between this enqueue and @p 1400 + * actually getting on CPU. This gives higher-class tasks (e.g. RT) 1401 + * an opportunity to wake up on @rq and prevent @p from running. 1402 + * Here are some concrete examples: 1403 + * 1404 + * Example 1: 1405 + * 1406 + * We dispatch two tasks from a single ops.dispatch(): 1407 + * - First, a local task to this CPU's local DSQ; 1408 + * - Second, a local/remote task to a remote CPU's local DSQ. 1409 + * We must drop the local rq lock in order to finish the second 1410 + * dispatch. In that time, an RT task can wake up on the local rq. 1411 + * 1412 + * Example 2: 1413 + * 1414 + * We dispatch a local/remote task to a remote CPU's local DSQ. 1415 + * We must drop the remote rq lock before the dispatched task can run, 1416 + * which gives an RT task an opportunity to wake up on the remote rq. 1417 + * 1418 + * Both examples work the same if we replace dispatching with moving 1419 + * the tasks from a user-created DSQ. 1420 + * 1421 + * We must detect these wakeups so that we can re-enqueue IMMED tasks 1422 + * from @rq's local DSQ. scx_wakeup_preempt() serves exactly this 1423 + * purpose, but for it to be invoked, we must ensure that we bump 1424 + * @rq->next_class to &ext_sched_class if it's currently idle. 1425 + * 1426 + * wakeup_preempt() does the bumping, and since we only invoke it if 1427 + * @rq->next_class is below &ext_sched_class, it will also 1428 + * resched_curr(rq). 1429 + */ 1430 + if (sched_class_above(p->sched_class, rq->next_class)) 1431 + wakeup_preempt(rq, p, 0); 1408 1432 1409 1433 /* 1410 1434 * If @rq is in balance, the CPU is already vacant and looking for the 1411 1435 * next task to run. No need to preempt or trigger resched after moving 1412 1436 * @p into its local DSQ. 1437 + * Note that the wakeup_preempt() above may have already triggered 1438 + * a resched if @rq->next_class was idle. It's harmless, since 1439 + * need_resched is cleared immediately after task pick. 1413 1440 */ 1414 1441 if (rq->scx.flags & SCX_RQ_IN_BALANCE) 1415 1442 return; ··· 1454 1407 if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr && 1455 1408 rq->curr->sched_class == &ext_sched_class) { 1456 1409 rq->curr->scx.slice = 0; 1457 - preempt = true; 1458 - } 1459 - 1460 - if (preempt || sched_class_above(&ext_sched_class, rq->curr->sched_class)) 1461 1410 resched_curr(rq); 1411 + } 1462 1412 } 1463 1413 1464 1414 static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, ··· 1538 1494 if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN)) 1539 1495 rcu_assign_pointer(dsq->first_task, p); 1540 1496 } else { 1541 - bool was_empty; 1542 - 1543 - was_empty = list_empty(&dsq->list); 1497 + /* 1498 + * dsq->list can contain parked BPF iterator cursors, so 1499 + * list_empty() here isn't a reliable proxy for "no real 1500 + * task in the DSQ". Test dsq->first_task directly. 1501 + */ 1544 1502 list_add_tail(&p->scx.dsq_list.node, &dsq->list); 1545 - if (was_empty && !(dsq->id & SCX_DSQ_FLAG_BUILTIN)) 1503 + if (!dsq->first_task && !(dsq->id & SCX_DSQ_FLAG_BUILTIN)) 1546 1504 rcu_assign_pointer(dsq->first_task, p); 1547 1505 } 1548 1506 } ··· 1564 1518 * concurrently in a non-atomic way. 1565 1519 */ 1566 1520 if (is_local) { 1567 - local_dsq_post_enq(dsq, p, enq_flags); 1521 + local_dsq_post_enq(sch, dsq, p, enq_flags); 1568 1522 } else { 1569 1523 /* 1570 1524 * Task on global/bypass DSQ: leave custody, task on ··· 2175 2129 schedule_reenq_local(rq, 0); 2176 2130 } 2177 2131 2178 - static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, 2132 + static void move_local_task_to_local_dsq(struct scx_sched *sch, 2133 + struct task_struct *p, u64 enq_flags, 2179 2134 struct scx_dispatch_q *src_dsq, 2180 2135 struct rq *dst_rq) 2181 2136 { ··· 2196 2149 dsq_inc_nr(dst_dsq, p, enq_flags); 2197 2150 p->scx.dsq = dst_dsq; 2198 2151 2199 - local_dsq_post_enq(dst_dsq, p, enq_flags); 2152 + local_dsq_post_enq(sch, dst_dsq, p, enq_flags); 2200 2153 } 2201 2154 2202 2155 /** ··· 2417 2370 /* @p is going from a non-local DSQ to a local DSQ */ 2418 2371 if (src_rq == dst_rq) { 2419 2372 task_unlink_from_dsq(p, src_dsq); 2420 - move_local_task_to_local_dsq(p, enq_flags, 2373 + move_local_task_to_local_dsq(sch, p, enq_flags, 2421 2374 src_dsq, dst_rq); 2422 2375 raw_spin_unlock(&src_dsq->lock); 2423 2376 } else { ··· 2470 2423 2471 2424 if (rq == task_rq) { 2472 2425 task_unlink_from_dsq(p, dsq); 2473 - move_local_task_to_local_dsq(p, enq_flags, dsq, rq); 2426 + move_local_task_to_local_dsq(sch, p, enq_flags, dsq, rq); 2474 2427 raw_spin_unlock(&dsq->lock); 2475 2428 return true; 2476 2429 } ··· 3230 3183 if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) && 3231 3184 !scx_bypassing(sch_a, task_cpu(a))) 3232 3185 return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before, 3233 - NULL, 3186 + task_rq(a), 3234 3187 (struct task_struct *)a, 3235 3188 (struct task_struct *)b); 3236 3189 else ··· 3678 3631 SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); 3679 3632 } 3680 3633 3634 + /* 3635 + * Undo a completed __scx_init_task(sch, p, false) when scx_enable_task() never 3636 + * ran. The task state has not been transitioned, so this mirrors the 3637 + * SCX_TASK_INIT branch in __scx_disable_and_exit_task(). 3638 + */ 3639 + static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p) 3640 + { 3641 + struct scx_exit_task_args args = { .cancelled = true }; 3642 + 3643 + lockdep_assert_held(&p->pi_lock); 3644 + lockdep_assert_rq_held(task_rq(p)); 3645 + 3646 + if (SCX_HAS_OP(sch, exit_task)) 3647 + SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); 3648 + } 3649 + 3681 3650 static void scx_disable_and_exit_task(struct scx_sched *sch, 3682 3651 struct task_struct *p) 3683 3652 { ··· 3702 3639 /* 3703 3640 * If set, @p exited between __scx_init_task() and scx_enable_task() in 3704 3641 * scx_sub_enable() and is initialized for both the associated sched and 3705 - * its parent. Disable and exit for the child too. 3642 + * its parent. Exit for the child too - scx_enable_task() never ran for 3643 + * it, so undo only init_task. 3706 3644 */ 3707 - if ((p->scx.flags & SCX_TASK_SUB_INIT) && 3708 - !WARN_ON_ONCE(!scx_enabling_sub_sched)) { 3709 - __scx_disable_and_exit_task(scx_enabling_sub_sched, p); 3645 + if (p->scx.flags & SCX_TASK_SUB_INIT) { 3646 + if (!WARN_ON_ONCE(!scx_enabling_sub_sched)) 3647 + scx_sub_init_cancel_task(scx_enabling_sub_sched, p); 3710 3648 p->scx.flags &= ~SCX_TASK_SUB_INIT; 3711 3649 } 3712 3650 ··· 4388 4324 4389 4325 void scx_group_set_weight(struct task_group *tg, unsigned long weight) 4390 4326 { 4391 - struct scx_sched *sch = scx_root; 4327 + struct scx_sched *sch; 4392 4328 4393 4329 percpu_down_read(&scx_cgroup_ops_rwsem); 4330 + sch = scx_root; 4394 4331 4395 4332 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) && 4396 4333 tg->scx.weight != weight) ··· 4404 4339 4405 4340 void scx_group_set_idle(struct task_group *tg, bool idle) 4406 4341 { 4407 - struct scx_sched *sch = scx_root; 4342 + struct scx_sched *sch; 4408 4343 4409 4344 percpu_down_read(&scx_cgroup_ops_rwsem); 4345 + sch = scx_root; 4410 4346 4411 4347 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle)) 4412 4348 SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle); ··· 4421 4355 void scx_group_set_bandwidth(struct task_group *tg, 4422 4356 u64 period_us, u64 quota_us, u64 burst_us) 4423 4357 { 4424 - struct scx_sched *sch = scx_root; 4358 + struct scx_sched *sch; 4425 4359 4426 4360 percpu_down_read(&scx_cgroup_ops_rwsem); 4361 + sch = scx_root; 4427 4362 4428 4363 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_bandwidth) && 4429 4364 (tg->scx.bw_period_us != period_us || ··· 4447 4380 return &cgrp_dfl_root.cgrp; 4448 4381 } 4449 4382 4450 - static struct cgroup *sch_cgroup(struct scx_sched *sch) 4451 - { 4452 - return sch->cgrp; 4453 - } 4454 - 4455 - /* for each descendant of @cgrp including self, set ->scx_sched to @sch */ 4456 - static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) 4457 - { 4458 - struct cgroup *pos; 4459 - struct cgroup_subsys_state *css; 4460 - 4461 - cgroup_for_each_live_descendant_pre(pos, css, cgrp) 4462 - rcu_assign_pointer(pos->scx_sched, sch); 4463 - } 4464 - 4465 4383 static void scx_cgroup_lock(void) 4466 4384 { 4467 4385 #ifdef CONFIG_EXT_GROUP_SCHED ··· 4464 4412 } 4465 4413 #else /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ 4466 4414 static struct cgroup *root_cgroup(void) { return NULL; } 4467 - static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } 4468 - static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} 4469 4415 static void scx_cgroup_lock(void) {} 4470 4416 static void scx_cgroup_unlock(void) {} 4471 4417 #endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ 4418 + 4419 + #ifdef CONFIG_EXT_SUB_SCHED 4420 + static struct cgroup *sch_cgroup(struct scx_sched *sch) 4421 + { 4422 + return sch->cgrp; 4423 + } 4424 + 4425 + /* for each descendant of @cgrp including self, set ->scx_sched to @sch */ 4426 + static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) 4427 + { 4428 + struct cgroup *pos; 4429 + struct cgroup_subsys_state *css; 4430 + 4431 + cgroup_for_each_live_descendant_pre(pos, css, cgrp) 4432 + rcu_assign_pointer(pos->scx_sched, sch); 4433 + } 4434 + #else /* CONFIG_EXT_SUB_SCHED */ 4435 + static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } 4436 + static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} 4437 + #endif /* CONFIG_EXT_SUB_SCHED */ 4472 4438 4473 4439 /* 4474 4440 * Omitted operations: ··· 4782 4712 irq_work_sync(&sch->disable_irq_work); 4783 4713 kthread_destroy_worker(sch->helper); 4784 4714 timer_shutdown_sync(&sch->bypass_lb_timer); 4715 + free_cpumask_var(sch->bypass_lb_donee_cpumask); 4716 + free_cpumask_var(sch->bypass_lb_resched_cpumask); 4785 4717 4786 4718 #ifdef CONFIG_EXT_SUB_SCHED 4787 4719 kfree(sch->cgrp_path); ··· 5010 4938 smp_processor_id(), dur_s); 5011 4939 } 5012 4940 4941 + /* 4942 + * scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(), 4943 + * which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing 4944 + * it from NMI context can lead to deadlocks. Defer via irq_work; the 4945 + * disable path runs off irq_work anyway. 4946 + */ 4947 + static atomic_t scx_hardlockup_cpu = ATOMIC_INIT(-1); 4948 + 4949 + static void scx_hardlockup_irq_workfn(struct irq_work *work) 4950 + { 4951 + int cpu = atomic_xchg(&scx_hardlockup_cpu, -1); 4952 + 4953 + if (cpu >= 0 && handle_lockup("hard lockup - CPU %d", cpu)) 4954 + printk_deferred(KERN_ERR "sched_ext: Hard lockup - CPU %d, disabling BPF scheduler\n", 4955 + cpu); 4956 + } 4957 + 4958 + static DEFINE_IRQ_WORK(scx_hardlockup_irq_work, scx_hardlockup_irq_workfn); 4959 + 5013 4960 /** 5014 4961 * scx_hardlockup - sched_ext hardlockup handler 5015 4962 * ··· 5037 4946 * Try kicking out the current scheduler in an attempt to recover the system to 5038 4947 * a good state before taking more drastic actions. 5039 4948 * 5040 - * Returns %true if sched_ext is enabled and abort was initiated, which may 5041 - * resolve the reported hardlockup. %false if sched_ext is not enabled or 5042 - * someone else already initiated abort. 4949 + * Queues an irq_work; the handle_lockup() call happens in IRQ context (see 4950 + * scx_hardlockup_irq_workfn). 4951 + * 4952 + * Returns %true if sched_ext is enabled and the work was queued, %false 4953 + * otherwise. 5043 4954 */ 5044 4955 bool scx_hardlockup(int cpu) 5045 4956 { 5046 - if (!handle_lockup("hard lockup - CPU %d", cpu)) 4957 + if (!rcu_access_pointer(scx_root)) 5047 4958 return false; 5048 4959 5049 - printk_deferred(KERN_ERR "sched_ext: Hard lockup - CPU %d, disabling BPF scheduler\n", 5050 - cpu); 4960 + atomic_cmpxchg(&scx_hardlockup_cpu, -1, cpu); 4961 + irq_work_queue(&scx_hardlockup_irq_work); 5051 4962 return true; 5052 4963 } 5053 4964 ··· 5092 4999 5093 5000 if (cpumask_empty(donee_mask)) 5094 5001 break; 5002 + 5003 + /* 5004 + * If an earlier pass placed @p on @donor_dsq from a different 5005 + * CPU and the donee hasn't consumed it yet, @p is still on the 5006 + * previous CPU and task_rq(@p) != @donor_rq. @p can't be moved 5007 + * without its rq locked. Skip. 5008 + */ 5009 + if (task_rq(p) != donor_rq) 5010 + continue; 5095 5011 5096 5012 donee = cpumask_any_and_distribute(donee_mask, p->cpus_ptr); 5097 5013 if (donee >= nr_cpu_ids) ··· 5160 5058 static void bypass_lb_node(struct scx_sched *sch, int node) 5161 5059 { 5162 5060 const struct cpumask *node_mask = cpumask_of_node(node); 5163 - struct cpumask *donee_mask = scx_bypass_lb_donee_cpumask; 5164 - struct cpumask *resched_mask = scx_bypass_lb_resched_cpumask; 5061 + struct cpumask *donee_mask = sch->bypass_lb_donee_cpumask; 5062 + struct cpumask *resched_mask = sch->bypass_lb_resched_cpumask; 5165 5063 u32 nr_tasks = 0, nr_cpus = 0, nr_balanced = 0; 5166 5064 u32 nr_target, nr_donor_target; 5167 5065 u32 before_min = U32_MAX, before_max = 0; ··· 5800 5698 5801 5699 if (sch->ops.exit) 5802 5700 SCX_CALL_OP(sch, exit, NULL, sch->exit_info); 5701 + if (sch->sub_kset) 5702 + kset_unregister(sch->sub_kset); 5803 5703 kobject_del(&sch->kobj); 5804 5704 } 5805 5705 #else /* CONFIG_EXT_SUB_SCHED */ ··· 5933 5829 * could observe an object of the same name still in the hierarchy when 5934 5830 * the next scheduler is loaded. 5935 5831 */ 5832 + #ifdef CONFIG_EXT_SUB_SCHED 5833 + if (sch->sub_kset) 5834 + kset_unregister(sch->sub_kset); 5835 + #endif 5936 5836 kobject_del(&sch->kobj); 5937 5837 5938 5838 free_kick_syncs(); ··· 6027 5919 guard(preempt)(); 6028 5920 if (scx_claim_exit(sch, kind)) 6029 5921 irq_work_queue(&sch->disable_irq_work); 5922 + } 5923 + 5924 + /** 5925 + * scx_flush_disable_work - flush the disable work and wait for it to finish 5926 + * @sch: the scheduler 5927 + * 5928 + * sch->disable_work might still not queued, causing kthread_flush_work() 5929 + * as a noop. Syncing the irq_work first is required to guarantee the 5930 + * kthread work has been queued before waiting for it. 5931 + */ 5932 + static void scx_flush_disable_work(struct scx_sched *sch) 5933 + { 5934 + int kind; 5935 + 5936 + do { 5937 + irq_work_sync(&sch->disable_irq_work); 5938 + kthread_flush_work(&sch->disable_work); 5939 + kind = atomic_read(&sch->exit_kind); 5940 + } while (kind != SCX_EXIT_NONE && kind != SCX_EXIT_DONE); 6030 5941 } 6031 5942 6032 5943 static void dump_newline(struct seq_buf *s) ··· 6159 6032 scx_dump_data.cpu = -1; 6160 6033 } 6161 6034 6162 - static void scx_dump_task(struct scx_sched *sch, 6163 - struct seq_buf *s, struct scx_dump_ctx *dctx, 6164 - struct task_struct *p, char marker) 6035 + static void scx_dump_task(struct scx_sched *sch, struct seq_buf *s, struct scx_dump_ctx *dctx, 6036 + struct rq *rq, struct task_struct *p, char marker) 6165 6037 { 6166 6038 static unsigned long bt[SCX_EXIT_BT_LEN]; 6167 6039 struct scx_sched *task_sch = scx_task_sched(p); ··· 6201 6075 6202 6076 if (SCX_HAS_OP(sch, dump_task)) { 6203 6077 ops_dump_init(s, " "); 6204 - SCX_CALL_OP(sch, dump_task, NULL, dctx, p); 6078 + SCX_CALL_OP(sch, dump_task, rq, dctx, p); 6205 6079 ops_dump_exit(); 6206 6080 } 6207 6081 ··· 6325 6199 used = seq_buf_used(&ns); 6326 6200 if (SCX_HAS_OP(sch, dump_cpu)) { 6327 6201 ops_dump_init(&ns, " "); 6328 - SCX_CALL_OP(sch, dump_cpu, NULL, 6329 - &dctx, cpu, idle); 6202 + SCX_CALL_OP(sch, dump_cpu, rq, &dctx, cpu, idle); 6330 6203 ops_dump_exit(); 6331 6204 } 6332 6205 ··· 6348 6223 6349 6224 if (rq->curr->sched_class == &ext_sched_class && 6350 6225 (dump_all_tasks || scx_task_on_sched(sch, rq->curr))) 6351 - scx_dump_task(sch, &s, &dctx, rq->curr, '*'); 6226 + scx_dump_task(sch, &s, &dctx, rq, rq->curr, '*'); 6352 6227 6353 6228 list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) 6354 6229 if (dump_all_tasks || scx_task_on_sched(sch, p)) 6355 - scx_dump_task(sch, &s, &dctx, p, ' '); 6230 + scx_dump_task(sch, &s, &dctx, rq, p, ' '); 6356 6231 next: 6357 6232 rq_unlock_irqrestore(rq, &rf); 6358 6233 } ··· 6562 6437 init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn); 6563 6438 kthread_init_work(&sch->disable_work, scx_disable_workfn); 6564 6439 timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0); 6440 + 6441 + if (!alloc_cpumask_var(&sch->bypass_lb_donee_cpumask, GFP_KERNEL)) { 6442 + ret = -ENOMEM; 6443 + goto err_stop_helper; 6444 + } 6445 + if (!alloc_cpumask_var(&sch->bypass_lb_resched_cpumask, GFP_KERNEL)) { 6446 + ret = -ENOMEM; 6447 + goto err_free_lb_cpumask; 6448 + } 6565 6449 sch->ops = *ops; 6566 6450 rcu_assign_pointer(ops->priv, sch); 6567 6451 ··· 6580 6446 char *buf = kzalloc(PATH_MAX, GFP_KERNEL); 6581 6447 if (!buf) { 6582 6448 ret = -ENOMEM; 6583 - goto err_stop_helper; 6449 + goto err_free_lb_resched; 6584 6450 } 6585 6451 cgroup_path(cgrp, buf, PATH_MAX); 6586 6452 sch->cgrp_path = kstrdup(buf, GFP_KERNEL); 6587 6453 kfree(buf); 6588 6454 if (!sch->cgrp_path) { 6589 6455 ret = -ENOMEM; 6590 - goto err_stop_helper; 6456 + goto err_free_lb_resched; 6591 6457 } 6592 6458 6593 6459 sch->cgrp = cgrp; ··· 6622 6488 #endif /* CONFIG_EXT_SUB_SCHED */ 6623 6489 return sch; 6624 6490 6625 - #ifdef CONFIG_EXT_SUB_SCHED 6491 + err_free_lb_resched: 6492 + free_cpumask_var(sch->bypass_lb_resched_cpumask); 6493 + err_free_lb_cpumask: 6494 + free_cpumask_var(sch->bypass_lb_donee_cpumask); 6626 6495 err_stop_helper: 6627 6496 kthread_destroy_worker(sch->helper); 6628 - #endif 6629 6497 err_free_pcpu: 6630 6498 for_each_possible_cpu(cpu) { 6631 6499 if (cpu == bypass_fail_cpu) ··· 6646 6510 err_free_sch: 6647 6511 kfree(sch); 6648 6512 err_put_cgrp: 6649 - #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 6513 + #ifdef CONFIG_EXT_SUB_SCHED 6650 6514 cgroup_put(cgrp); 6651 6515 #endif 6652 6516 return ERR_PTR(ret); ··· 6737 6601 if (ret) 6738 6602 goto err_unlock; 6739 6603 6740 - #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 6604 + #ifdef CONFIG_EXT_SUB_SCHED 6741 6605 cgroup_get(cgrp); 6742 6606 #endif 6743 6607 sch = scx_alloc_and_add_sched(ops, cgrp, NULL); ··· 6775 6639 rcu_assign_pointer(scx_root, sch); 6776 6640 6777 6641 ret = scx_link_sched(sch); 6778 - if (ret) 6642 + if (ret) { 6643 + cpus_read_unlock(); 6779 6644 goto err_disable; 6645 + } 6780 6646 6781 6647 scx_idle_enable(ops); 6782 6648 ··· 6959 6821 * completion. sch's base reference will be put by bpf_scx_unreg(). 6960 6822 */ 6961 6823 scx_error(sch, "scx_root_enable() failed (%d)", ret); 6962 - kthread_flush_work(&sch->disable_work); 6824 + scx_flush_disable_work(sch); 6963 6825 cmd->ret = 0; 6964 6826 } 6965 6827 ··· 7210 7072 abort: 7211 7073 put_task_struct(p); 7212 7074 scx_task_iter_stop(&sti); 7213 - scx_enabling_sub_sched = NULL; 7214 7075 7076 + /* 7077 + * Undo __scx_init_task() for tasks we marked. scx_enable_task() never 7078 + * ran for @sch on them, so calling scx_disable_task() here would invoke 7079 + * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched 7080 + * must stay set until SUB_INIT is cleared from every marked task - 7081 + * scx_disable_and_exit_task() reads it when a task exits concurrently. 7082 + */ 7215 7083 scx_task_iter_start(&sti, sch->cgrp); 7216 7084 while ((p = scx_task_iter_next_locked(&sti))) { 7217 7085 if (p->scx.flags & SCX_TASK_SUB_INIT) { 7218 - __scx_disable_and_exit_task(sch, p); 7086 + scx_sub_init_cancel_task(sch, p); 7219 7087 p->scx.flags &= ~SCX_TASK_SUB_INIT; 7220 7088 } 7221 7089 } 7222 7090 scx_task_iter_stop(&sti); 7091 + scx_enabling_sub_sched = NULL; 7223 7092 err_unlock_and_disable: 7224 7093 /* we'll soon enter disable path, keep bypass on */ 7225 7094 scx_cgroup_unlock(); 7226 7095 percpu_up_write(&scx_fork_rwsem); 7227 7096 err_disable: 7228 7097 mutex_unlock(&scx_enable_mutex); 7229 - kthread_flush_work(&sch->disable_work); 7098 + scx_flush_disable_work(sch); 7230 7099 cmd->ret = 0; 7231 7100 } 7232 7101 ··· 7494 7349 struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); 7495 7350 7496 7351 scx_disable(sch, SCX_EXIT_UNREG); 7497 - kthread_flush_work(&sch->disable_work); 7352 + scx_flush_disable_work(sch); 7498 7353 RCU_INIT_POINTER(ops->priv, NULL); 7499 7354 kobject_put(&sch->kobj); 7500 7355 } ··· 8178 8033 struct task_struct *p, u64 dsq_id, u64 enq_flags) 8179 8034 { 8180 8035 struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq; 8181 - struct scx_sched *sch = src_dsq->sched; 8036 + struct scx_sched *sch; 8182 8037 struct rq *this_rq, *src_rq, *locked_rq; 8183 8038 bool dispatched = false; 8184 8039 bool in_balance; 8185 8040 unsigned long flags; 8041 + 8042 + /* 8043 + * The verifier considers an iterator slot initialized on any 8044 + * KF_ITER_NEW return, so a BPF program may legally reach here after 8045 + * bpf_iter_scx_dsq_new() failed and left @kit->dsq NULL. 8046 + */ 8047 + if (unlikely(!src_dsq)) 8048 + return false; 8049 + 8050 + sch = src_dsq->sched; 8186 8051 8187 8052 if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags)) 8188 8053 return false; ··· 8681 8526 8682 8527 guard(rcu)(); 8683 8528 sch = scx_prog_sched(aux); 8684 - if (unlikely(!scx_task_on_sched(sch, p))) 8529 + if (unlikely(!sch || !scx_task_on_sched(sch, p))) 8685 8530 return false; 8686 8531 8687 8532 p->scx.slice = slice; ··· 8704 8549 8705 8550 guard(rcu)(); 8706 8551 sch = scx_prog_sched(aux); 8707 - if (unlikely(!scx_task_on_sched(sch, p))) 8552 + if (unlikely(!sch || !scx_task_on_sched(sch, p))) 8708 8553 return false; 8709 8554 8710 8555 p->scx.dsq_vtime = vtime; ··· 8788 8633 /** 8789 8634 * scx_bpf_dsq_nr_queued - Return the number of queued tasks 8790 8635 * @dsq_id: id of the DSQ 8636 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8791 8637 * 8792 8638 * Return the number of tasks in the DSQ matching @dsq_id. If not found, 8793 8639 * -%ENOENT is returned. 8794 8640 */ 8795 - __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id) 8641 + __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id, const struct bpf_prog_aux *aux) 8796 8642 { 8797 8643 struct scx_sched *sch; 8798 8644 struct scx_dispatch_q *dsq; ··· 8801 8645 8802 8646 preempt_disable(); 8803 8647 8804 - sch = rcu_dereference_sched(scx_root); 8648 + sch = scx_prog_sched(aux); 8805 8649 if (unlikely(!sch)) { 8806 8650 ret = -ENODEV; 8807 8651 goto out; ··· 8833 8677 /** 8834 8678 * scx_bpf_destroy_dsq - Destroy a custom DSQ 8835 8679 * @dsq_id: DSQ to destroy 8680 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8836 8681 * 8837 8682 * Destroy the custom DSQ identified by @dsq_id. Only DSQs created with 8838 8683 * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is 8839 8684 * empty and no further tasks are dispatched to it. Ignored if called on a DSQ 8840 8685 * which doesn't exist. Can be called from any online scx_ops operations. 8841 8686 */ 8842 - __bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id) 8687 + __bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id, const struct bpf_prog_aux *aux) 8843 8688 { 8844 8689 struct scx_sched *sch; 8845 8690 8846 - rcu_read_lock(); 8847 - sch = rcu_dereference(scx_root); 8691 + guard(rcu)(); 8692 + sch = scx_prog_sched(aux); 8848 8693 if (sch) 8849 8694 destroy_dsq(sch, dsq_id); 8850 - rcu_read_unlock(); 8851 8695 } 8852 8696 8853 8697 /** ··· 9601 9445 BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU); 9602 9446 BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU); 9603 9447 BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS) 9604 - BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) 9605 - BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) 9448 + BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued, KF_IMPLICIT_ARGS) 9449 + BTF_ID_FLAGS(func, scx_bpf_destroy_dsq, KF_IMPLICIT_ARGS) 9606 9450 BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL) 9607 9451 BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS) 9608 9452 BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS) ··· 9635 9479 static const struct btf_kfunc_id_set scx_kfunc_set_any = { 9636 9480 .owner = THIS_MODULE, 9637 9481 .set = &scx_kfunc_ids_any, 9482 + .filter = scx_kfunc_context_filter, 9638 9483 }; 9639 9484 9640 9485 /* ··· 9683 9526 }; 9684 9527 9685 9528 /* 9686 - * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the 9687 - * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this 9688 - * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or 9529 + * Verifier-time filter for SCX kfuncs. Registered via the .filter field on 9530 + * each per-group btf_kfunc_id_set. The BPF core invokes this for every kfunc 9531 + * call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or 9689 9532 * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the 9690 - * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. 9691 - * scx_kfunc_ids_any) by falling through to "allow" when none of the 9692 - * context-sensitive sets contain the kfunc. 9533 + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern by 9534 + * falling through to "allow" when none of the SCX sets contain the kfunc. 9693 9535 */ 9694 9536 int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) 9695 9537 { ··· 9697 9541 bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); 9698 9542 bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); 9699 9543 bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); 9544 + bool in_idle = btf_id_set8_contains(&scx_kfunc_ids_idle, kfunc_id); 9545 + bool in_any = btf_id_set8_contains(&scx_kfunc_ids_any, kfunc_id); 9700 9546 u32 moff, flags; 9701 9547 9702 - /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ 9703 - if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) 9548 + /* Not an SCX kfunc - allow. */ 9549 + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || 9550 + in_cpu_release || in_idle || in_any)) 9704 9551 return 0; 9705 9552 9706 9553 /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ 9707 9554 if (prog->type == BPF_PROG_TYPE_SYSCALL) 9708 - return (in_unlocked || in_select_cpu) ? 0 : -EACCES; 9555 + return (in_unlocked || in_select_cpu || in_idle || in_any) ? 0 : -EACCES; 9709 9556 9710 9557 if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) 9711 - return -EACCES; 9558 + return (in_any || in_idle) ? 0 : -EACCES; 9712 9559 9713 9560 /* 9714 9561 * add_subprog_and_kfunc() collects all kfunc calls, including dead code ··· 9724 9565 return 0; 9725 9566 9726 9567 /* 9727 - * Non-SCX struct_ops: only unlocked kfuncs are safe. The other 9728 - * context-sensitive kfuncs assume the rq lock is held by the SCX 9729 - * dispatch path, which doesn't apply to other struct_ops users. 9568 + * Non-SCX struct_ops: SCX kfuncs are not permitted. 9730 9569 */ 9731 9570 if (prog->aux->st_ops != &bpf_sched_ext_ops) 9732 - return in_unlocked ? 0 : -EACCES; 9571 + return -EACCES; 9733 9572 9734 9573 /* SCX struct_ops: check the per-op allow list. */ 9574 + if (in_any || in_idle) 9575 + return 0; 9576 + 9735 9577 moff = prog->aux->attach_st_ops_member_off; 9736 9578 flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; 9737 9579 ··· 9814 9654 if (ret < 0) { 9815 9655 pr_err("sched_ext: Failed to add global attributes\n"); 9816 9656 return ret; 9817 - } 9818 - 9819 - if (!alloc_cpumask_var(&scx_bypass_lb_donee_cpumask, GFP_KERNEL) || 9820 - !alloc_cpumask_var(&scx_bypass_lb_resched_cpumask, GFP_KERNEL)) { 9821 - pr_err("sched_ext: Failed to allocate cpumasks\n"); 9822 - return -ENOMEM; 9823 9657 } 9824 9658 9825 9659 return 0;
+18 -2
kernel/sched/ext_idle.c
··· 927 927 * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq 928 928 * lock or @p's pi_lock. Three cases: 929 929 * 930 - * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. 930 + * - inside ops.select_cpu(): try_to_wake_up() holds the wake-up 931 + * task's pi_lock; the wake-up task is recorded in kf_tasks[0] 932 + * by SCX_CALL_OP_TASK_RET(). 931 933 * - other rq-locked SCX op: scx_locked_rq() points at the held rq. 932 934 * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): 933 935 * nothing held, take pi_lock ourselves. 936 + * 937 + * In the first two cases, BPF schedulers may pass an arbitrary task 938 + * that the held lock doesn't cover. Refuse those. 934 939 */ 935 940 if (this_rq()->scx.in_select_cpu) { 941 + if (!scx_kf_arg_task_ok(sch, p)) 942 + return -EINVAL; 936 943 lockdep_assert_held(&p->pi_lock); 937 - } else if (!scx_locked_rq()) { 944 + } else if (scx_locked_rq()) { 945 + if (task_rq(p) != scx_locked_rq()) 946 + goto cross_task; 947 + } else { 938 948 raw_spin_lock_irqsave(&p->pi_lock, irq_flags); 939 949 we_locked = true; 940 950 } ··· 970 960 raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); 971 961 972 962 return cpu; 963 + 964 + cross_task: 965 + scx_error(sch, "select_cpu kfunc called cross-task on %s[%d]", 966 + p->comm, p->pid); 967 + return -EINVAL; 973 968 } 974 969 975 970 /** ··· 1482 1467 static const struct btf_kfunc_id_set scx_kfunc_set_idle = { 1483 1468 .owner = THIS_MODULE, 1484 1469 .set = &scx_kfunc_ids_idle, 1470 + .filter = scx_kfunc_context_filter, 1485 1471 }; 1486 1472 1487 1473 /*
+1
kernel/sched/ext_idle.h
··· 12 12 13 13 struct sched_ext_ops; 14 14 15 + extern struct btf_id_set8 scx_kfunc_ids_idle; 15 16 extern struct btf_id_set8 scx_kfunc_ids_select_cpu; 16 17 17 18 void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops);
+2
kernel/sched/ext_internal.h
··· 1075 1075 struct irq_work disable_irq_work; 1076 1076 struct kthread_work disable_work; 1077 1077 struct timer_list bypass_lb_timer; 1078 + cpumask_var_t bypass_lb_donee_cpumask; 1079 + cpumask_var_t bypass_lb_resched_cpumask; 1078 1080 struct rcu_work rcu_work; 1079 1081 1080 1082 /* all ancestors including self */
+31 -5
lib/rhashtable.c
··· 441 441 442 442 mutex_unlock(&ht->mutex); 443 443 444 + /* 445 + * Re-arm via @run_work, not @run_irq_work. 446 + * rhashtable_free_and_destroy() drains async work as irq_work_sync() 447 + * followed by cancel_work_sync(). If this site queued irq_work while 448 + * cancel_work_sync() was waiting for us, irq_work_sync() would already 449 + * have returned and the stale irq_work could fire post-teardown. 450 + * cancel_work_sync() natively handles self-requeue on @run_work. 451 + */ 444 452 if (err) 445 453 schedule_work(&ht->run_work); 454 + } 455 + 456 + /* 457 + * Insert-path callers can run under a raw spinlock (e.g. an insecure_elasticity 458 + * user). Calling schedule_work() under that lock records caller_lock -> 459 + * pool->lock -> pi_lock -> rq->__lock, closing a locking cycle if any of 460 + * these is acquired in the reverse direction elsewhere. Bounce through 461 + * irq_work so the schedule_work() runs with the caller's lock no longer held. 462 + */ 463 + static void rht_deferred_irq_work(struct irq_work *irq_work) 464 + { 465 + struct rhashtable *ht = container_of(irq_work, struct rhashtable, 466 + run_irq_work); 467 + 468 + schedule_work(&ht->run_work); 446 469 } 447 470 448 471 static int rhashtable_insert_rehash(struct rhashtable *ht, ··· 500 477 if (err == -EEXIST) 501 478 err = 0; 502 479 } else 503 - schedule_work(&ht->run_work); 480 + irq_work_queue(&ht->run_irq_work); 504 481 505 482 return err; 506 483 ··· 511 488 512 489 /* Schedule async rehash to retry allocation in process context. */ 513 490 if (err == -ENOMEM) 514 - schedule_work(&ht->run_work); 491 + irq_work_queue(&ht->run_irq_work); 515 492 516 493 return err; 517 494 } ··· 561 538 return NULL; 562 539 } 563 540 564 - if (elasticity <= 0) 541 + if (elasticity <= 0 && !ht->p.insecure_elasticity) 565 542 return ERR_PTR(-EAGAIN); 566 543 567 544 return ERR_PTR(-ENOENT); ··· 591 568 if (unlikely(rht_grow_above_max(ht, tbl))) 592 569 return ERR_PTR(-E2BIG); 593 570 594 - if (unlikely(rht_grow_above_100(ht, tbl))) 571 + if (unlikely(rht_grow_above_100(ht, tbl)) && 572 + !ht->p.insecure_elasticity) 595 573 return ERR_PTR(-EAGAIN); 596 574 597 575 head = rht_ptr(bkt, tbl, hash); ··· 653 629 rht_unlock(tbl, bkt, flags); 654 630 655 631 if (inserted && rht_grow_above_75(ht, tbl)) 656 - schedule_work(&ht->run_work); 632 + irq_work_queue(&ht->run_irq_work); 657 633 } 658 634 } while (!IS_ERR_OR_NULL(new_tbl)); 659 635 ··· 1108 1084 RCU_INIT_POINTER(ht->tbl, tbl); 1109 1085 1110 1086 INIT_WORK(&ht->run_work, rht_deferred_worker); 1087 + init_irq_work(&ht->run_irq_work, rht_deferred_irq_work); 1111 1088 1112 1089 return 0; 1113 1090 } ··· 1174 1149 struct bucket_table *tbl, *next_tbl; 1175 1150 unsigned int i; 1176 1151 1152 + irq_work_sync(&ht->run_irq_work); 1177 1153 cancel_work_sync(&ht->run_work); 1178 1154 1179 1155 mutex_lock(&ht->mutex);
+6 -18
tools/sched_ext/scx_qmap.bpf.c
··· 159 159 160 160 static struct task_ctx *lookup_task_ctx(struct task_struct *p) 161 161 { 162 - struct task_ctx *tctx; 163 - 164 - if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) { 165 - scx_bpf_error("task_ctx lookup failed"); 166 - return NULL; 167 - } 168 - return tctx; 162 + return bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 169 163 } 170 164 171 165 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, ··· 169 175 s32 cpu; 170 176 171 177 if (!(tctx = lookup_task_ctx(p))) 172 - return -ESRCH; 178 + return prev_cpu; 173 179 174 180 if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD)) 175 181 return prev_cpu; ··· 534 540 */ 535 541 if (prev) { 536 542 tctx = bpf_task_storage_get(&task_ctx_stor, prev, 0, 0); 537 - if (!tctx) { 538 - scx_bpf_error("task_ctx lookup failed"); 539 - return; 540 - } 541 - 542 - tctx->core_sched_seq = 543 - core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++; 543 + if (tctx) 544 + tctx->core_sched_seq = 545 + core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++; 544 546 } 545 547 } 546 548 ··· 574 584 s64 qdist; 575 585 576 586 tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 577 - if (!tctx) { 578 - scx_bpf_error("task_ctx lookup failed"); 587 + if (!tctx) 579 588 return 0; 580 - } 581 589 582 590 qdist = tctx->core_sched_seq - core_sched_head_seqs[idx]; 583 591
+1
tools/testing/selftests/sched_ext/Makefile
··· 175 175 maximal \ 176 176 maybe_null \ 177 177 minimal \ 178 + non_scx_kfunc_deny \ 178 179 numa \ 179 180 allowed_cpus \ 180 181 peek_dsq \
+44
tools/testing/selftests/sched_ext/non_scx_kfunc_deny.bpf.c
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Verify that context-sensitive SCX kfuncs (even "unlocked" ones) are 4 + * restricted to only SCX struct_ops programs. Non-SCX struct_ops programs, 5 + * such as TCP congestion control programs, should be rejected by the BPF 6 + * verifier when attempting to call these kfuncs. 7 + * 8 + * Copyright (C) 2026 Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> 9 + * Copyright (C) 2026 Cheng-Yang Chou <yphbchou0911@gmail.com> 10 + */ 11 + 12 + #include <vmlinux.h> 13 + #include <bpf/bpf_helpers.h> 14 + #include <bpf/bpf_tracing.h> 15 + 16 + /* SCX kfunc from scx_kfunc_ids_any set */ 17 + void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; 18 + 19 + SEC("struct_ops/ssthresh") 20 + __u32 BPF_PROG(tcp_ca_ssthresh, struct sock *sk) 21 + { 22 + /* 23 + * This call should be rejected by the verifier because this is a 24 + * TCP congestion control program (non-SCX struct_ops). 25 + */ 26 + scx_bpf_kick_cpu(0, 0); 27 + return 2; 28 + } 29 + 30 + SEC("struct_ops/cong_avoid") 31 + void BPF_PROG(tcp_ca_cong_avoid, struct sock *sk, __u32 ack, __u32 acked) {} 32 + 33 + SEC("struct_ops/undo_cwnd") 34 + __u32 BPF_PROG(tcp_ca_undo_cwnd, struct sock *sk) { return 2; } 35 + 36 + SEC(".struct_ops") 37 + struct tcp_congestion_ops tcp_non_scx_ca = { 38 + .ssthresh = (void *)tcp_ca_ssthresh, 39 + .cong_avoid = (void *)tcp_ca_cong_avoid, 40 + .undo_cwnd = (void *)tcp_ca_undo_cwnd, 41 + .name = "tcp_kfunc_deny", 42 + }; 43 + 44 + char _license[] SEC("license") = "GPL";
+47
tools/testing/selftests/sched_ext/non_scx_kfunc_deny.c
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Verify that context-sensitive SCX kfuncs (even "unlocked" ones) are 4 + * restricted to only SCX struct_ops programs. Non-SCX struct_ops programs, 5 + * such as TCP congestion control programs, should be rejected by the BPF 6 + * verifier when attempting to call these kfuncs. 7 + * 8 + * Copyright (C) 2026 Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> 9 + * Copyright (C) 2026 Cheng-Yang Chou <yphbchou0911@gmail.com> 10 + */ 11 + 12 + #include <bpf/bpf.h> 13 + #include <scx/common.h> 14 + #include <unistd.h> 15 + #include <errno.h> 16 + #include <stdio.h> 17 + #include "non_scx_kfunc_deny.bpf.skel.h" 18 + #include "scx_test.h" 19 + 20 + static enum scx_test_status run(void *ctx) 21 + { 22 + struct non_scx_kfunc_deny *skel; 23 + int err; 24 + 25 + skel = non_scx_kfunc_deny__open(); 26 + if (!skel) { 27 + SCX_ERR("Failed to open skel"); 28 + return SCX_TEST_FAIL; 29 + } 30 + 31 + err = non_scx_kfunc_deny__load(skel); 32 + non_scx_kfunc_deny__destroy(skel); 33 + 34 + if (err == 0) { 35 + SCX_ERR("non-SCX BPF program loaded when it should have been rejected"); 36 + return SCX_TEST_FAIL; 37 + } 38 + 39 + return SCX_TEST_PASS; 40 + } 41 + 42 + struct scx_test non_scx_kfunc_deny = { 43 + .name = "non_scx_kfunc_deny", 44 + .description = "Verify that non-SCX struct_ops programs cannot call SCX kfuncs", 45 + .run = run, 46 + }; 47 + REGISTER_SCX_TEST(&non_scx_kfunc_deny)