Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

- cgroup sub-scheduler groundwork

Multiple BPF schedulers can be attached to cgroups and the dispatch
path is made hierarchical. This involves substantial restructuring of
the core dispatch, bypass, watchdog, and dump paths to be
per-scheduler, along with new infrastructure for scheduler ownership
enforcement, lifecycle management, and cgroup subtree iteration

The enqueue path is not yet updated and will follow in a later cycle

- scx_bpf_dsq_reenq() generalized to support any DSQ including remote
local DSQs and user DSQs

Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
to local DSQs either run immediately or get reenqueued back through
ops.enqueue(), giving schedulers tighter control over queueing
latency

Also useful for opportunistic CPU sharing across sub-schedulers

- ops.dequeue() was only invoked when the core knew a task was in BPF
data structures, missing scheduling property change events and
skipping callbacks for non-local DSQ dispatches from ops.select_cpu()

Fixed to guarantee exactly one ops.dequeue() call when a task leaves
BPF scheduler custody

- Kfunc access validation moved from runtime to BPF verifier time,
removing runtime mask enforcement

- Idle SMT sibling prioritization in the idle CPU selection path

- Documentation, selftest, and tooling updates. Misc bug fixes and
cleanups

* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
sched_ext: Make string params of __ENUM_set() const
tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
sched_ext: Drop spurious warning on kick during scheduler disable
sched_ext: Warn on task-based SCX op recursion
sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
sched_ext: Remove runtime kfunc mask enforcement
sched_ext: Add verifier-time kfunc context filter
sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
sched_ext: Decouple kfunc unlocked-context check from kf_mask
sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
sched_ext: Drop TRACING access to select_cpu kfuncs
selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
selftests/sched_ext: Improve runner error reporting for invalid arguments
sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
sched_ext: Documentation: Add ops.dequeue() to task lifecycle
tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
...

+5257 -1266
+188 -19
Documentation/scheduler/sched-ext.rst
··· 93 93 # cat /sys/kernel/sched_ext/enable_seq 94 94 1 95 95 96 + Each running scheduler also exposes a per-scheduler ``events`` file under 97 + ``/sys/kernel/sched_ext/<scheduler-name>/events`` that tracks diagnostic 98 + counters. Each counter occupies one ``name value`` line: 99 + 100 + .. code-block:: none 101 + 102 + # cat /sys/kernel/sched_ext/simple/events 103 + SCX_EV_SELECT_CPU_FALLBACK 0 104 + SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0 105 + SCX_EV_DISPATCH_KEEP_LAST 123 106 + SCX_EV_ENQ_SKIP_EXITING 0 107 + SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0 108 + SCX_EV_REENQ_IMMED 0 109 + SCX_EV_REENQ_LOCAL_REPEAT 0 110 + SCX_EV_REFILL_SLICE_DFL 456789 111 + SCX_EV_BYPASS_DURATION 0 112 + SCX_EV_BYPASS_DISPATCH 0 113 + SCX_EV_BYPASS_ACTIVATE 0 114 + SCX_EV_INSERT_NOT_OWNED 0 115 + SCX_EV_SUB_BYPASS_DISPATCH 0 116 + 117 + The counters are described in ``kernel/sched/ext_internal.h``; briefly: 118 + 119 + * ``SCX_EV_SELECT_CPU_FALLBACK``: ops.select_cpu() returned a CPU unusable by 120 + the task and the core scheduler silently picked a fallback CPU. 121 + * ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE``: a local-DSQ dispatch was redirected 122 + to the global DSQ because the target CPU went offline. 123 + * ``SCX_EV_DISPATCH_KEEP_LAST``: a task continued running because no other 124 + task was available (only when ``SCX_OPS_ENQ_LAST`` is not set). 125 + * ``SCX_EV_ENQ_SKIP_EXITING``: an exiting task was dispatched to the local DSQ 126 + directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING`` is not set). 127 + * ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED``: a migration-disabled task was 128 + dispatched to its local DSQ directly (only when 129 + ``SCX_OPS_ENQ_MIGRATION_DISABLED`` is not set). 130 + * ``SCX_EV_REENQ_IMMED``: a task dispatched with ``SCX_ENQ_IMMED`` was 131 + re-enqueued because the target CPU was not available for immediate execution. 132 + * ``SCX_EV_REENQ_LOCAL_REPEAT``: a reenqueue of the local DSQ triggered 133 + another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ`` 134 + handling in the BPF scheduler. 135 + * ``SCX_EV_REFILL_SLICE_DFL``: a task's time slice was refilled with the 136 + default value (``SCX_SLICE_DFL``). 137 + * ``SCX_EV_BYPASS_DURATION``: total nanoseconds spent in bypass mode. 138 + * ``SCX_EV_BYPASS_DISPATCH``: number of tasks dispatched while in bypass mode. 139 + * ``SCX_EV_BYPASS_ACTIVATE``: number of times bypass mode was activated. 140 + * ``SCX_EV_INSERT_NOT_OWNED``: attempted to insert a task not owned by this 141 + scheduler into a DSQ; such attempts are silently ignored. 142 + * ``SCX_EV_SUB_BYPASS_DISPATCH``: tasks dispatched from sub-scheduler bypass 143 + DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED``). 144 + 96 145 ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 97 146 detailed information: 98 147 ··· 277 228 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 278 229 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 279 230 280 - A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 281 - by calling ``scx_bpf_dsq_insert()``. If the task is inserted into 282 - ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the 283 - local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 284 - Additionally, inserting directly from ``ops.select_cpu()`` will cause the 285 - ``ops.enqueue()`` callback to be skipped. 286 - 287 231 Note that the scheduler core will ignore an invalid CPU selection, for 288 232 example, if it's outside the allowed cpumask of the task. 233 + 234 + A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 235 + by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``. 236 + 237 + If the task is inserted into ``SCX_DSQ_LOCAL`` from 238 + ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU 239 + is returned from ``ops.select_cpu()``. Additionally, inserting directly 240 + from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to 241 + be skipped. 242 + 243 + Any other attempt to store a task in BPF-internal data structures from 244 + ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being 245 + invoked. This is discouraged, as it can introduce racy behavior or 246 + inconsistent state. 289 247 290 248 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 291 249 task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` ··· 307 251 308 252 * Queue the task on the BPF side. 309 253 254 + **Task State Tracking and ops.dequeue() Semantics** 255 + 256 + A task is in the "BPF scheduler's custody" when the BPF scheduler is 257 + responsible for managing its lifecycle. A task enters custody when it is 258 + dispatched to a user DSQ or stored in the BPF scheduler's internal data 259 + structures. Custody is entered only from ``ops.enqueue()`` for those 260 + operations. The only exception is dispatching to a user DSQ from 261 + ``ops.select_cpu()``: although the task is not yet technically in BPF 262 + scheduler custody at that point, the dispatch has the same semantic 263 + effect as dispatching from ``ops.enqueue()`` for custody-related 264 + purposes. 265 + 266 + Once ``ops.enqueue()`` is called, the task may or may not enter custody 267 + depending on what the scheduler does: 268 + 269 + * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``, 270 + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler 271 + is done with the task - it either goes straight to a CPU's local run 272 + queue or to the global DSQ as a fallback. The task never enters (or 273 + exits) BPF custody, and ``ops.dequeue()`` will not be called. 274 + 275 + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the 276 + BPF scheduler's custody. When the task later leaves BPF custody 277 + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for 278 + sleep/property changes), ``ops.dequeue()`` will be called exactly 279 + once. 280 + 281 + * **Stored in BPF data structures** (e.g., internal BPF queues): the 282 + task is in BPF custody. ``ops.dequeue()`` will be called when it 283 + leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or 284 + on property change / sleep). 285 + 286 + When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked. 287 + The dequeue can happen for different reasons, distinguished by flags: 288 + 289 + 1. **Regular dispatch**: when a task in BPF custody is dispatched to a 290 + terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for 291 + execution), ``ops.dequeue()`` is triggered without any special flags. 292 + 293 + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and 294 + core scheduling picks a task for execution while it's still in BPF 295 + custody, ``ops.dequeue()`` is called with the 296 + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. 297 + 298 + 3. **Scheduling property change**: when a task property changes (via 299 + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, 300 + priority changes, CPU migrations, etc.) while the task is still in 301 + BPF custody, ``ops.dequeue()`` is called with the 302 + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. 303 + 304 + **Important**: Once a task has left BPF custody (e.g., after being 305 + dispatched to a terminal DSQ), property changes will not trigger 306 + ``ops.dequeue()``, since the task is no longer managed by the BPF 307 + scheduler. 308 + 310 309 3. When a CPU is ready to schedule, it first looks at its local DSQ. If 311 310 empty, it then looks at the global DSQ. If there still isn't a task to 312 311 run, ``ops.dispatch()`` is invoked which can use the following two ··· 375 264 rather than performing them immediately. There can be up to 376 265 ``ops.dispatch_max_batch`` pending tasks. 377 266 378 - * ``scx_bpf_move_to_local()`` moves a task from the specified non-local 267 + * ``scx_bpf_dsq_move_to_local()`` moves a task from the specified non-local 379 268 DSQ to the dispatching DSQ. This function cannot be called with any BPF 380 - locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions 269 + locks held. ``scx_bpf_dsq_move_to_local()`` flushes the pending insertions 381 270 tasks before trying to move from the specified DSQ. 382 271 383 272 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, ··· 408 297 Task Lifecycle 409 298 -------------- 410 299 411 - The following pseudo-code summarizes the entire lifecycle of a task managed 412 - by a sched_ext scheduler: 300 + The following pseudo-code presents a rough overview of the entire lifecycle 301 + of a task managed by a sched_ext scheduler: 413 302 414 303 .. code-block:: c 415 304 ··· 422 311 423 312 ops.runnable(); /* Task becomes ready to run */ 424 313 425 - while (task is runnable) { 426 - if (task is not in a DSQ && task->scx.slice == 0) { 314 + while (task_is_runnable(task)) { 315 + if (task is not in a DSQ || task->scx.slice == 0) { 427 316 ops.enqueue(); /* Task can be added to a DSQ */ 317 + 318 + /* Task property change (i.e., affinity, nice, etc.)? */ 319 + if (sched_change(task)) { 320 + ops.dequeue(); /* Exiting BPF scheduler custody */ 321 + ops.quiescent(); 322 + 323 + /* Property change callback, e.g. ops.set_weight() */ 324 + 325 + ops.runnable(); 326 + continue; 327 + } 428 328 429 329 /* Any usable CPU becomes available */ 430 330 431 - ops.dispatch(); /* Task is moved to a local DSQ */ 331 + ops.dispatch(); /* Task is moved to a local DSQ */ 332 + ops.dequeue(); /* Exiting BPF scheduler custody */ 432 333 } 334 + 433 335 ops.running(); /* Task starts running on its assigned CPU */ 434 - while (task->scx.slice > 0 && task is runnable) 336 + 337 + while (task_is_runnable(task) && task->scx.slice > 0) { 435 338 ops.tick(); /* Called every 1/HZ seconds */ 339 + 340 + if (task->scx.slice == 0) 341 + ops.dispatch(); /* task->scx.slice can be refilled */ 342 + } 343 + 436 344 ops.stopping(); /* Task stops running (time slice expires or wait) */ 437 - 438 - /* Task's CPU becomes available */ 439 - 440 - ops.dispatch(); /* task->scx.slice can be refilled */ 441 345 } 442 346 443 347 ops.quiescent(); /* Task releases its assigned CPU (wait) */ ··· 460 334 461 335 ops.disable(); /* Disable BPF scheduling for the task */ 462 336 ops.exit_task(); /* Task is destroyed */ 337 + 338 + Note that the above pseudo-code does not cover all possible state transitions 339 + and edge cases, to name a few examples: 340 + 341 + * ``ops.dispatch()`` may fail to move the task to a local DSQ due to a racing 342 + property change on that task, in which case ``ops.dispatch()`` will be 343 + retried. 344 + 345 + * The task may be direct-dispatched to a local DSQ from ``ops.enqueue()``, 346 + in which case ``ops.dispatch()`` and ``ops.dequeue()`` are skipped and we go 347 + straight to ``ops.running()``. 348 + 349 + * Property changes may occur at virtually any point during the task's lifecycle, 350 + not just when the task is queued and waiting to be dispatched. For example, 351 + changing a property of a running task will lead to the callback sequence 352 + ``ops.stopping()`` -> ``ops.quiescent()`` -> (property change callback) -> 353 + ``ops.runnable()`` -> ``ops.running()``. 354 + 355 + * A sched_ext task can be preempted by a task from a higher-priority scheduling 356 + class, in which case it will exit the tick-dispatch loop even though it is runnable 357 + and has a non-zero slice. 358 + 359 + See the "Scheduling Cycle" section for a more detailed description of how 360 + a freshly woken up task gets on a CPU. 463 361 464 362 Where to Look 465 363 ============= ··· 526 376 * ``scx_userland[.bpf].c``: A minimal scheduler demonstrating user space 527 377 scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order; 528 378 all others are scheduled in user space by a simple vruntime scheduler. 379 + 380 + Module Parameters 381 + ================= 382 + 383 + sched_ext exposes two module parameters under the ``sched_ext.`` prefix that 384 + control bypass-mode behaviour. These knobs are primarily for debugging; there 385 + is usually no reason to change them during normal operation. They can be read 386 + and written at runtime (mode 0600) via 387 + ``/sys/module/sched_ext/parameters/``. 388 + 389 + ``sched_ext.slice_bypass_us`` (default: 5000 µs) 390 + The time slice assigned to all tasks when the scheduler is in bypass mode, 391 + i.e. during BPF scheduler load, unload, and error recovery. Valid range is 392 + 100 µs to 100 ms. 393 + 394 + ``sched_ext.bypass_lb_intv_us`` (default: 500000 µs) 395 + The interval at which the bypass-mode load balancer redistributes tasks 396 + across CPUs. Set to 0 to disable load balancing during bypass mode. Valid 397 + range is 0 to 10 s. 529 398 530 399 ABI Instability 531 400 ===============
+4
include/linux/cgroup-defs.h
··· 17 17 #include <linux/refcount.h> 18 18 #include <linux/percpu-refcount.h> 19 19 #include <linux/percpu-rwsem.h> 20 + #include <linux/sched.h> 20 21 #include <linux/u64_stats_sync.h> 21 22 #include <linux/workqueue.h> 22 23 #include <linux/bpf-cgroup-defs.h> ··· 628 627 629 628 #ifdef CONFIG_BPF_SYSCALL 630 629 struct bpf_local_storage __rcu *bpf_cgrp_storage; 630 + #endif 631 + #ifdef CONFIG_EXT_SUB_SCHED 632 + struct scx_sched __rcu *scx_sched; 631 633 #endif 632 634 633 635 /* All ancestors including self */
+64 -45
include/linux/sched/ext.h
··· 62 62 SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU, 63 63 }; 64 64 65 + struct scx_deferred_reenq_user { 66 + struct list_head node; 67 + u64 flags; 68 + }; 69 + 70 + struct scx_dsq_pcpu { 71 + struct scx_dispatch_q *dsq; 72 + struct scx_deferred_reenq_user deferred_reenq_user; 73 + }; 74 + 65 75 /* 66 76 * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered 67 77 * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to ··· 88 78 u64 id; 89 79 struct rhash_head hash_node; 90 80 struct llist_node free_node; 81 + struct scx_sched *sched; 82 + struct scx_dsq_pcpu __percpu *pcpu; 91 83 struct rcu_head rcu; 92 84 }; 93 85 94 - /* scx_entity.flags */ 86 + /* sched_ext_entity.flags */ 95 87 enum scx_ent_flags { 96 88 SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ 89 + SCX_TASK_IN_CUSTODY = 1 << 1, /* in custody, needs ops.dequeue() when leaving */ 97 90 SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ 98 91 SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ 92 + SCX_TASK_SUB_INIT = 1 << 4, /* task being initialized for a sub sched */ 93 + SCX_TASK_IMMED = 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */ 99 94 100 - SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ 95 + /* 96 + * Bits 8 and 9 are used to carry task state: 97 + * 98 + * NONE ops.init_task() not called yet 99 + * INIT ops.init_task() succeeded, but task can be cancelled 100 + * READY fully initialized, but not in sched_ext 101 + * ENABLED fully initialized and in sched_ext 102 + */ 103 + SCX_TASK_STATE_SHIFT = 8, /* bits 8 and 9 are used to carry task state */ 101 104 SCX_TASK_STATE_BITS = 2, 102 105 SCX_TASK_STATE_MASK = ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT, 103 106 104 - SCX_TASK_CURSOR = 1 << 31, /* iteration cursor, not a task */ 105 - }; 107 + SCX_TASK_NONE = 0 << SCX_TASK_STATE_SHIFT, 108 + SCX_TASK_INIT = 1 << SCX_TASK_STATE_SHIFT, 109 + SCX_TASK_READY = 2 << SCX_TASK_STATE_SHIFT, 110 + SCX_TASK_ENABLED = 3 << SCX_TASK_STATE_SHIFT, 106 111 107 - /* scx_entity.flags & SCX_TASK_STATE_MASK */ 108 - enum scx_task_state { 109 - SCX_TASK_NONE, /* ops.init_task() not called yet */ 110 - SCX_TASK_INIT, /* ops.init_task() succeeded, but task can be cancelled */ 111 - SCX_TASK_READY, /* fully initialized, but not in sched_ext */ 112 - SCX_TASK_ENABLED, /* fully initialized and in sched_ext */ 112 + /* 113 + * Bits 12 and 13 are used to carry reenqueue reason. In addition to 114 + * %SCX_ENQ_REENQ flag, ops.enqueue() can also test for 115 + * %SCX_TASK_REENQ_REASON_NONE to distinguish reenqueues. 116 + * 117 + * NONE not being reenqueued 118 + * KFUNC reenqueued by scx_bpf_dsq_reenq() and friends 119 + * IMMED reenqueued due to failed ENQ_IMMED 120 + * PREEMPTED preempted while running 121 + */ 122 + SCX_TASK_REENQ_REASON_SHIFT = 12, 123 + SCX_TASK_REENQ_REASON_BITS = 2, 124 + SCX_TASK_REENQ_REASON_MASK = ((1 << SCX_TASK_REENQ_REASON_BITS) - 1) << SCX_TASK_REENQ_REASON_SHIFT, 113 125 114 - SCX_TASK_NR_STATES, 126 + SCX_TASK_REENQ_NONE = 0 << SCX_TASK_REENQ_REASON_SHIFT, 127 + SCX_TASK_REENQ_KFUNC = 1 << SCX_TASK_REENQ_REASON_SHIFT, 128 + SCX_TASK_REENQ_IMMED = 2 << SCX_TASK_REENQ_REASON_SHIFT, 129 + SCX_TASK_REENQ_PREEMPTED = 3 << SCX_TASK_REENQ_REASON_SHIFT, 130 + 131 + /* iteration cursor, not a task */ 132 + SCX_TASK_CURSOR = 1 << 31, 115 133 }; 116 134 117 135 /* scx_entity.dsq_flags */ 118 136 enum scx_ent_dsq_flags { 119 137 SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ 120 - }; 121 - 122 - /* 123 - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from 124 - * everywhere and the following bits track which kfunc sets are currently 125 - * allowed for %current. This simple per-task tracking works because SCX ops 126 - * nest in a limited way. BPF will likely implement a way to allow and disallow 127 - * kfuncs depending on the calling context which will replace this manual 128 - * mechanism. See scx_kf_allow(). 129 - */ 130 - enum scx_kf_mask { 131 - SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ 132 - /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ 133 - SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ 134 - /* 135 - * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and 136 - * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be 137 - * nested inside DISPATCH. 138 - */ 139 - SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ 140 - SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ 141 - SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ 142 - SCX_KF_REST = 1 << 4, /* other rq-locked operations */ 143 - 144 - __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | 145 - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, 146 - __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, 147 138 }; 148 139 149 140 enum scx_dsq_lnode_flags { ··· 160 149 u32 priv; /* can be used by iter cursor */ 161 150 }; 162 151 163 - #define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv) \ 152 + #define INIT_DSQ_LIST_CURSOR(__cursor, __dsq, __flags) \ 164 153 (struct scx_dsq_list_node) { \ 165 - .node = LIST_HEAD_INIT((__node).node), \ 154 + .node = LIST_HEAD_INIT((__cursor).node), \ 166 155 .flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags), \ 167 - .priv = (__priv), \ 156 + .priv = READ_ONCE((__dsq)->seq), \ 168 157 } 158 + 159 + struct scx_sched; 169 160 170 161 /* 171 162 * The following is embedded in task_struct and contains all fields necessary 172 163 * for a task to be scheduled by SCX. 173 164 */ 174 165 struct sched_ext_entity { 166 + #ifdef CONFIG_CGROUPS 167 + /* 168 + * Associated scx_sched. Updated either during fork or while holding 169 + * both p->pi_lock and rq lock. 170 + */ 171 + struct scx_sched __rcu *sched; 172 + #endif 175 173 struct scx_dispatch_q *dsq; 174 + atomic_long_t ops_state; 175 + u64 ddsp_dsq_id; 176 + u64 ddsp_enq_flags; 176 177 struct scx_dsq_list_node dsq_list; /* dispatch order */ 177 178 struct rb_node dsq_priq; /* p->scx.dsq_vtime order */ 178 179 u32 dsq_seq; ··· 194 171 s32 sticky_cpu; 195 172 s32 holding_cpu; 196 173 s32 selected_cpu; 197 - u32 kf_mask; /* see scx_kf_mask above */ 198 174 struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ 199 - atomic_long_t ops_state; 200 175 201 176 struct list_head runnable_node; /* rq->scx.runnable_list */ 202 177 unsigned long runnable_at; ··· 202 181 #ifdef CONFIG_SCHED_CORE 203 182 u64 core_sched_at; /* see scx_prio_less() */ 204 183 #endif 205 - u64 ddsp_dsq_id; 206 - u64 ddsp_enq_flags; 207 184 208 185 /* BPF scheduler modifiable fields */ 209 186
+4
init/Kconfig
··· 1190 1190 1191 1191 endif #CGROUP_SCHED 1192 1192 1193 + config EXT_SUB_SCHED 1194 + def_bool y 1195 + depends on SCHED_CLASS_EXT && CGROUPS 1196 + 1193 1197 config SCHED_MM_CID 1194 1198 def_bool y 1195 1199 depends on SMP && RSEQ
+5 -1
kernel/fork.c
··· 2514 2514 fd_install(pidfd, pidfile); 2515 2515 2516 2516 proc_fork_connector(p); 2517 - sched_post_fork(p); 2517 + /* 2518 + * sched_ext needs @p to be associated with its cgroup in its post_fork 2519 + * hook. cgroup_post_fork() should come before sched_post_fork(). 2520 + */ 2518 2521 cgroup_post_fork(p, args); 2522 + sched_post_fork(p); 2519 2523 perf_event_fork(p); 2520 2524 2521 2525 trace_task_newtask(p, clone_flags);
+1 -1
kernel/sched/core.c
··· 4776 4776 p->sched_class->task_fork(p); 4777 4777 raw_spin_unlock_irqrestore(&p->pi_lock, flags); 4778 4778 4779 - return scx_fork(p); 4779 + return scx_fork(p, kargs); 4780 4780 } 4781 4781 4782 4782 void sched_cancel_fork(struct task_struct *p)
+3062 -937
kernel/sched/ext.c
··· 9 9 #include <linux/btf_ids.h> 10 10 #include "ext_idle.h" 11 11 12 + static DEFINE_RAW_SPINLOCK(scx_sched_lock); 13 + 12 14 /* 13 15 * NOTE: sched_ext is in the process of growing multiple scheduler support and 14 16 * scx_root usage is in a transitional state. Naked dereferences are safe if the ··· 19 17 * are used as temporary markers to indicate that the dereferences need to be 20 18 * updated to point to the associated scheduler instances rather than scx_root. 21 19 */ 22 - static struct scx_sched __rcu *scx_root; 20 + struct scx_sched __rcu *scx_root; 21 + 22 + /* 23 + * All scheds, writers must hold both scx_enable_mutex and scx_sched_lock. 24 + * Readers can hold either or rcu_read_lock(). 25 + */ 26 + static LIST_HEAD(scx_sched_all); 27 + 28 + #ifdef CONFIG_EXT_SUB_SCHED 29 + static const struct rhashtable_params scx_sched_hash_params = { 30 + .key_len = sizeof_field(struct scx_sched, ops.sub_cgroup_id), 31 + .key_offset = offsetof(struct scx_sched, ops.sub_cgroup_id), 32 + .head_offset = offsetof(struct scx_sched, hash_node), 33 + }; 34 + 35 + static struct rhashtable scx_sched_hash; 36 + #endif 23 37 24 38 /* 25 39 * During exit, a task may schedule after losing its PIDs. When disabling the ··· 51 33 DEFINE_STATIC_KEY_FALSE(__scx_enabled); 52 34 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); 53 35 static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED); 54 - static int scx_bypass_depth; 36 + static DEFINE_RAW_SPINLOCK(scx_bypass_lock); 55 37 static cpumask_var_t scx_bypass_lb_donee_cpumask; 56 38 static cpumask_var_t scx_bypass_lb_resched_cpumask; 57 - static bool scx_aborting; 58 39 static bool scx_init_task_enabled; 59 40 static bool scx_switching_all; 60 41 DEFINE_STATIC_KEY_FALSE(__scx_switched_all); 61 42 62 - /* 63 - * Tracks whether scx_enable() called scx_bypass(true). Used to balance bypass 64 - * depth on enable failure. Will be removed when bypass depth is moved into the 65 - * sched instance. 66 - */ 67 - static bool scx_bypassed_for_enable; 68 - 69 43 static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); 70 44 static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0); 71 45 46 + #ifdef CONFIG_EXT_SUB_SCHED 72 47 /* 73 - * A monotically increasing sequence number that is incremented every time a 74 - * scheduler is enabled. This can be used by to check if any custom sched_ext 48 + * The sub sched being enabled. Used by scx_disable_and_exit_task() to exit 49 + * tasks for the sub-sched being enabled. Use a global variable instead of a 50 + * per-task field as all enables are serialized. 51 + */ 52 + static struct scx_sched *scx_enabling_sub_sched; 53 + #else 54 + #define scx_enabling_sub_sched (struct scx_sched *)NULL 55 + #endif /* CONFIG_EXT_SUB_SCHED */ 56 + 57 + /* 58 + * A monotonically increasing sequence number that is incremented every time a 59 + * scheduler is enabled. This can be used to check if any custom sched_ext 75 60 * scheduler has ever been used in the system. 76 61 */ 77 62 static atomic_long_t scx_enable_seq = ATOMIC_LONG_INIT(0); 78 63 79 64 /* 80 - * The maximum amount of time in jiffies that a task may be runnable without 81 - * being scheduled on a CPU. If this timeout is exceeded, it will trigger 82 - * scx_error(). 65 + * Watchdog interval. All scx_sched's share a single watchdog timer and the 66 + * interval is half of the shortest sch->watchdog_timeout. 83 67 */ 84 - static unsigned long scx_watchdog_timeout; 68 + static unsigned long scx_watchdog_interval; 85 69 86 70 /* 87 71 * The last time the delayed work was run. This delayed work relies on ··· 126 106 127 107 static LLIST_HEAD(dsqs_to_free); 128 108 129 - /* dispatch buf */ 130 - struct scx_dsp_buf_ent { 131 - struct task_struct *task; 132 - unsigned long qseq; 133 - u64 dsq_id; 134 - u64 enq_flags; 135 - }; 136 - 137 - static u32 scx_dsp_max_batch; 138 - 139 - struct scx_dsp_ctx { 140 - struct rq *rq; 141 - u32 cursor; 142 - u32 nr_tasks; 143 - struct scx_dsp_buf_ent buf[]; 144 - }; 145 - 146 - static struct scx_dsp_ctx __percpu *scx_dsp_ctx; 147 - 148 109 /* string formatting from BPF */ 149 110 struct scx_bstr_buf { 150 111 u64 data[MAX_BPRINTF_VARARGS]; ··· 136 135 static struct scx_bstr_buf scx_exit_bstr_buf; 137 136 138 137 /* ops debug dump */ 138 + static DEFINE_RAW_SPINLOCK(scx_dump_lock); 139 + 139 140 struct scx_dump_data { 140 141 s32 cpu; 141 142 bool first; ··· 159 156 * There usually is no reason to modify these as normal scheduler operation 160 157 * shouldn't be affected by them. The knobs are primarily for debugging. 161 158 */ 162 - static u64 scx_slice_dfl = SCX_SLICE_DFL; 163 159 static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC; 164 160 static unsigned int scx_bypass_lb_intv_us = SCX_BYPASS_LB_DFL_INTV_US; 165 161 ··· 195 193 #define CREATE_TRACE_POINTS 196 194 #include <trace/events/sched_ext.h> 197 195 198 - static void process_ddsp_deferred_locals(struct rq *rq); 196 + static void run_deferred(struct rq *rq); 199 197 static bool task_dead_and_done(struct task_struct *p); 200 - static u32 reenq_local(struct rq *rq); 201 198 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags); 199 + static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind); 202 200 static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind, 203 201 s64 exit_code, const char *fmt, va_list args); 204 202 ··· 229 227 return -(long)jiffies_to_msecs(now - at); 230 228 } 231 229 232 - /* if the highest set bit is N, return a mask with bits [N+1, 31] set */ 233 - static u32 higher_bits(u32 flags) 234 - { 235 - return ~((1 << fls(flags)) - 1); 236 - } 237 - 238 - /* return the mask with only the highest bit set */ 239 - static u32 highest_bit(u32 flags) 240 - { 241 - int bit = fls(flags); 242 - return ((u64)1 << bit) >> 1; 243 - } 244 - 245 230 static bool u32_before(u32 a, u32 b) 246 231 { 247 232 return (s32)(a - b) < 0; 248 233 } 249 234 250 - static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, 251 - struct task_struct *p) 235 + #ifdef CONFIG_EXT_SUB_SCHED 236 + /** 237 + * scx_parent - Find the parent sched 238 + * @sch: sched to find the parent of 239 + * 240 + * Returns the parent scheduler or %NULL if @sch is root. 241 + */ 242 + static struct scx_sched *scx_parent(struct scx_sched *sch) 252 243 { 253 - return sch->global_dsqs[cpu_to_node(task_cpu(p))]; 244 + if (sch->level) 245 + return sch->ancestors[sch->level - 1]; 246 + else 247 + return NULL; 248 + } 249 + 250 + /** 251 + * scx_next_descendant_pre - find the next descendant for pre-order walk 252 + * @pos: the current position (%NULL to initiate traversal) 253 + * @root: sched whose descendants to walk 254 + * 255 + * To be used by scx_for_each_descendant_pre(). Find the next descendant to 256 + * visit for pre-order traversal of @root's descendants. @root is included in 257 + * the iteration and the first node to be visited. 258 + */ 259 + static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, 260 + struct scx_sched *root) 261 + { 262 + struct scx_sched *next; 263 + 264 + lockdep_assert(lockdep_is_held(&scx_enable_mutex) || 265 + lockdep_is_held(&scx_sched_lock)); 266 + 267 + /* if first iteration, visit @root */ 268 + if (!pos) 269 + return root; 270 + 271 + /* visit the first child if exists */ 272 + next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling); 273 + if (next) 274 + return next; 275 + 276 + /* no child, visit my or the closest ancestor's next sibling */ 277 + while (pos != root) { 278 + if (!list_is_last(&pos->sibling, &scx_parent(pos)->children)) 279 + return list_next_entry(pos, sibling); 280 + pos = scx_parent(pos); 281 + } 282 + 283 + return NULL; 284 + } 285 + 286 + static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) 287 + { 288 + return rhashtable_lookup(&scx_sched_hash, &cgroup_id, 289 + scx_sched_hash_params); 290 + } 291 + 292 + static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) 293 + { 294 + rcu_assign_pointer(p->scx.sched, sch); 295 + } 296 + #else /* CONFIG_EXT_SUB_SCHED */ 297 + static struct scx_sched *scx_parent(struct scx_sched *sch) { return NULL; } 298 + static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; } 299 + static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) { return NULL; } 300 + static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {} 301 + #endif /* CONFIG_EXT_SUB_SCHED */ 302 + 303 + /** 304 + * scx_is_descendant - Test whether sched is a descendant 305 + * @sch: sched to test 306 + * @ancestor: ancestor sched to test against 307 + * 308 + * Test whether @sch is a descendant of @ancestor. 309 + */ 310 + static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor) 311 + { 312 + if (sch->level < ancestor->level) 313 + return false; 314 + return sch->ancestors[ancestor->level] == ancestor; 315 + } 316 + 317 + /** 318 + * scx_for_each_descendant_pre - pre-order walk of a sched's descendants 319 + * @pos: iteration cursor 320 + * @root: sched to walk the descendants of 321 + * 322 + * Walk @root's descendants. @root is included in the iteration and the first 323 + * node to be visited. Must be called with either scx_enable_mutex or 324 + * scx_sched_lock held. 325 + */ 326 + #define scx_for_each_descendant_pre(pos, root) \ 327 + for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \ 328 + (pos) = scx_next_descendant_pre((pos), (root))) 329 + 330 + static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu) 331 + { 332 + return &sch->pnode[cpu_to_node(cpu)]->global_dsq; 254 333 } 255 334 256 335 static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id) ··· 347 264 return __setscheduler_class(p->policy, p->prio); 348 265 } 349 266 350 - /* 351 - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX 352 - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate 353 - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check 354 - * whether it's running from an allowed context. 355 - * 356 - * @mask is constant, always inline to cull the mask calculations. 357 - */ 358 - static __always_inline void scx_kf_allow(u32 mask) 267 + static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu) 359 268 { 360 - /* nesting is allowed only in increasing scx_kf_mask order */ 361 - WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, 362 - "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", 363 - current->scx.kf_mask, mask); 364 - current->scx.kf_mask |= mask; 365 - barrier(); 269 + return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq; 366 270 } 367 271 368 - static void scx_kf_disallow(u32 mask) 272 + static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu) 369 273 { 370 - barrier(); 371 - current->scx.kf_mask &= ~mask; 274 + #ifdef CONFIG_EXT_SUB_SCHED 275 + /* 276 + * If @sch is a sub-sched which is bypassing, its tasks should go into 277 + * the bypass DSQs of the nearest ancestor which is not bypassing. The 278 + * not-bypassing ancestor is responsible for scheduling all tasks from 279 + * bypassing sub-trees. If all ancestors including root are bypassing, 280 + * all tasks should go to the root's bypass DSQs. 281 + * 282 + * Whenever a sched starts bypassing, all runnable tasks in its subtree 283 + * are re-enqueued after scx_bypassing() is turned on, guaranteeing that 284 + * all tasks are transferred to the right DSQs. 285 + */ 286 + while (scx_parent(sch) && scx_bypassing(sch, cpu)) 287 + sch = scx_parent(sch); 288 + #endif /* CONFIG_EXT_SUB_SCHED */ 289 + 290 + return bypass_dsq(sch, cpu); 291 + } 292 + 293 + /** 294 + * bypass_dsp_enabled - Check if bypass dispatch path is enabled 295 + * @sch: scheduler to check 296 + * 297 + * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled 298 + * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors 299 + * are bypassing. In the former case, the ancestor is not itself bypassing but 300 + * its bypass DSQs will be populated with bypassed tasks from descendants. Thus, 301 + * the ancestor's bypass dispatch path must be active even though its own 302 + * bypass_depth remains zero. 303 + * 304 + * This function checks bypass_dsp_enable_depth which is managed separately from 305 + * bypass_depth to enable this decoupling. See enable_bypass_dsp() and 306 + * disable_bypass_dsp(). 307 + */ 308 + static bool bypass_dsp_enabled(struct scx_sched *sch) 309 + { 310 + return unlikely(atomic_read(&sch->bypass_dsp_enable_depth)); 311 + } 312 + 313 + /** 314 + * rq_is_open - Is the rq available for immediate execution of an SCX task? 315 + * @rq: rq to test 316 + * @enq_flags: optional %SCX_ENQ_* of the task being enqueued 317 + * 318 + * Returns %true if @rq is currently open for executing an SCX task. After a 319 + * %false return, @rq is guaranteed to invoke SCX dispatch path at least once 320 + * before going to idle and not inserting a task into @rq's local DSQ after a 321 + * %false return doesn't cause @rq to stall. 322 + */ 323 + static bool rq_is_open(struct rq *rq, u64 enq_flags) 324 + { 325 + lockdep_assert_rq_held(rq); 326 + 327 + /* 328 + * A higher-priority class task is either running or in the process of 329 + * waking up on @rq. 330 + */ 331 + if (sched_class_above(rq->next_class, &ext_sched_class)) 332 + return false; 333 + 334 + /* 335 + * @rq is either in transition to or in idle and there is no 336 + * higher-priority class task waking up on it. 337 + */ 338 + if (sched_class_above(&ext_sched_class, rq->next_class)) 339 + return true; 340 + 341 + /* 342 + * @rq is either picking, in transition to, or running an SCX task. 343 + */ 344 + 345 + /* 346 + * If we're in the dispatch path holding rq lock, $curr may or may not 347 + * be ready depending on whether the on-going dispatch decides to extend 348 + * $curr's slice. We say yes here and resolve it at the end of dispatch. 349 + * See balance_one(). 350 + */ 351 + if (rq->scx.flags & SCX_RQ_IN_BALANCE) 352 + return true; 353 + 354 + /* 355 + * %SCX_ENQ_PREEMPT clears $curr's slice if on SCX and kicks dispatch, 356 + * so allow it to avoid spuriously triggering reenq on a combined 357 + * PREEMPT|IMMED insertion. 358 + */ 359 + if (enq_flags & SCX_ENQ_PREEMPT) 360 + return true; 361 + 362 + /* 363 + * @rq is either in transition to or running an SCX task and can't go 364 + * idle without another SCX dispatch cycle. 365 + */ 366 + return false; 372 367 } 373 368 374 369 /* ··· 469 308 __this_cpu_write(scx_locked_rq_state, rq); 470 309 } 471 310 472 - #define SCX_CALL_OP(sch, mask, op, rq, args...) \ 311 + #define SCX_CALL_OP(sch, op, rq, args...) \ 473 312 do { \ 474 313 if (rq) \ 475 314 update_locked_rq(rq); \ 476 - if (mask) { \ 477 - scx_kf_allow(mask); \ 478 - (sch)->ops.op(args); \ 479 - scx_kf_disallow(mask); \ 480 - } else { \ 481 - (sch)->ops.op(args); \ 482 - } \ 315 + (sch)->ops.op(args); \ 483 316 if (rq) \ 484 317 update_locked_rq(NULL); \ 485 318 } while (0) 486 319 487 - #define SCX_CALL_OP_RET(sch, mask, op, rq, args...) \ 320 + #define SCX_CALL_OP_RET(sch, op, rq, args...) \ 488 321 ({ \ 489 322 __typeof__((sch)->ops.op(args)) __ret; \ 490 323 \ 491 324 if (rq) \ 492 325 update_locked_rq(rq); \ 493 - if (mask) { \ 494 - scx_kf_allow(mask); \ 495 - __ret = (sch)->ops.op(args); \ 496 - scx_kf_disallow(mask); \ 497 - } else { \ 498 - __ret = (sch)->ops.op(args); \ 499 - } \ 326 + __ret = (sch)->ops.op(args); \ 500 327 if (rq) \ 501 328 update_locked_rq(NULL); \ 502 329 __ret; \ 503 330 }) 504 331 505 332 /* 506 - * Some kfuncs are allowed only on the tasks that are subjects of the 507 - * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such 508 - * restrictions, the following SCX_CALL_OP_*() variants should be used when 509 - * invoking scx_ops operations that take task arguments. These can only be used 510 - * for non-nesting operations due to the way the tasks are tracked. 333 + * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments 334 + * and records them in current->scx.kf_tasks[] for the duration of the call. A 335 + * kfunc invoked from inside such an op can then use 336 + * scx_kf_arg_task_ok() to verify that its task argument is one of 337 + * those subject tasks. 511 338 * 512 - * kfuncs which can only operate on such tasks can in turn use 513 - * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on 514 - * the specific task. 339 + * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - 340 + * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock 341 + * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if 342 + * kf_tasks[] is set, @p's scheduler-protected fields are stable. 343 + * 344 + * kf_tasks[] can not stack, so task-based SCX ops must not nest. The 345 + * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants 346 + * while a previous one is still in progress. 515 347 */ 516 - #define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ 348 + #define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ 517 349 do { \ 518 - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 350 + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 519 351 current->scx.kf_tasks[0] = task; \ 520 - SCX_CALL_OP((sch), mask, op, rq, task, ##args); \ 352 + SCX_CALL_OP((sch), op, rq, task, ##args); \ 521 353 current->scx.kf_tasks[0] = NULL; \ 522 354 } while (0) 523 355 524 - #define SCX_CALL_OP_TASK_RET(sch, mask, op, rq, task, args...) \ 356 + #define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ 525 357 ({ \ 526 358 __typeof__((sch)->ops.op(task, ##args)) __ret; \ 527 - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 359 + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 528 360 current->scx.kf_tasks[0] = task; \ 529 - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task, ##args); \ 361 + __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ 530 362 current->scx.kf_tasks[0] = NULL; \ 531 363 __ret; \ 532 364 }) 533 365 534 - #define SCX_CALL_OP_2TASKS_RET(sch, mask, op, rq, task0, task1, args...) \ 366 + #define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ 535 367 ({ \ 536 368 __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ 537 - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 369 + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 538 370 current->scx.kf_tasks[0] = task0; \ 539 371 current->scx.kf_tasks[1] = task1; \ 540 - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task0, task1, ##args); \ 372 + __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ 541 373 current->scx.kf_tasks[0] = NULL; \ 542 374 current->scx.kf_tasks[1] = NULL; \ 543 375 __ret; \ 544 376 }) 545 377 546 - /* @mask is constant, always inline to cull unnecessary branches */ 547 - static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) 548 - { 549 - if (unlikely(!(current->scx.kf_mask & mask))) { 550 - scx_error(sch, "kfunc with mask 0x%x called from an operation only allowing 0x%x", 551 - mask, current->scx.kf_mask); 552 - return false; 553 - } 554 - 555 - /* 556 - * Enforce nesting boundaries. e.g. A kfunc which can be called from 557 - * DISPATCH must not be called if we're running DEQUEUE which is nested 558 - * inside ops.dispatch(). We don't need to check boundaries for any 559 - * blocking kfuncs as the verifier ensures they're only called from 560 - * sleepable progs. 561 - */ 562 - if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && 563 - (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { 564 - scx_error(sch, "cpu_release kfunc called from a nested operation"); 565 - return false; 566 - } 567 - 568 - if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && 569 - (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { 570 - scx_error(sch, "dispatch kfunc called from a nested operation"); 571 - return false; 572 - } 573 - 574 - return true; 575 - } 576 - 577 378 /* see SCX_CALL_OP_TASK() */ 578 - static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, 579 - u32 mask, 379 + static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch, 580 380 struct task_struct *p) 581 381 { 582 - if (!scx_kf_allowed(sch, mask)) 583 - return false; 584 - 585 382 if (unlikely((p != current->scx.kf_tasks[0] && 586 383 p != current->scx.kf_tasks[1]))) { 587 384 scx_error(sch, "called on a task not being operated on"); ··· 549 430 return true; 550 431 } 551 432 433 + enum scx_dsq_iter_flags { 434 + /* iterate in the reverse dispatch order */ 435 + SCX_DSQ_ITER_REV = 1U << 16, 436 + 437 + __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, 438 + __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, 439 + 440 + __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, 441 + __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | 442 + __SCX_DSQ_ITER_HAS_SLICE | 443 + __SCX_DSQ_ITER_HAS_VTIME, 444 + }; 445 + 552 446 /** 553 447 * nldsq_next_task - Iterate to the next task in a non-local DSQ 554 - * @dsq: user dsq being iterated 448 + * @dsq: non-local dsq being iterated 555 449 * @cur: current position, %NULL to start iteration 556 450 * @rev: walk backwards 557 451 * ··· 604 472 for ((p) = nldsq_next_task((dsq), NULL, false); (p); \ 605 473 (p) = nldsq_next_task((dsq), (p), false)) 606 474 475 + /** 476 + * nldsq_cursor_next_task - Iterate to the next task given a cursor in a non-local DSQ 477 + * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR() 478 + * @dsq: non-local dsq being iterated 479 + * 480 + * Find the next task in a cursor based iteration. The caller must have 481 + * initialized @cursor using INIT_DSQ_LIST_CURSOR() and can release the DSQ lock 482 + * between the iteration steps. 483 + * 484 + * Only tasks which were queued before @cursor was initialized are visible. This 485 + * bounds the iteration and guarantees that vtime never jumps in the other 486 + * direction while iterating. 487 + */ 488 + static struct task_struct *nldsq_cursor_next_task(struct scx_dsq_list_node *cursor, 489 + struct scx_dispatch_q *dsq) 490 + { 491 + bool rev = cursor->flags & SCX_DSQ_ITER_REV; 492 + struct task_struct *p; 493 + 494 + lockdep_assert_held(&dsq->lock); 495 + BUG_ON(!(cursor->flags & SCX_DSQ_LNODE_ITER_CURSOR)); 496 + 497 + if (list_empty(&cursor->node)) 498 + p = NULL; 499 + else 500 + p = container_of(cursor, struct task_struct, scx.dsq_list); 501 + 502 + /* skip cursors and tasks that were queued after @cursor init */ 503 + do { 504 + p = nldsq_next_task(dsq, p, rev); 505 + } while (p && unlikely(u32_before(cursor->priv, p->scx.dsq_seq))); 506 + 507 + if (p) { 508 + if (rev) 509 + list_move_tail(&cursor->node, &p->scx.dsq_list.node); 510 + else 511 + list_move(&cursor->node, &p->scx.dsq_list.node); 512 + } else { 513 + list_del_init(&cursor->node); 514 + } 515 + 516 + return p; 517 + } 518 + 519 + /** 520 + * nldsq_cursor_lost_task - Test whether someone else took the task since iteration 521 + * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR() 522 + * @rq: rq @p was on 523 + * @dsq: dsq @p was on 524 + * @p: target task 525 + * 526 + * @p is a task returned by nldsq_cursor_next_task(). The locks may have been 527 + * dropped and re-acquired inbetween. Verify that no one else took or is in the 528 + * process of taking @p from @dsq. 529 + * 530 + * On %false return, the caller can assume full ownership of @p. 531 + */ 532 + static bool nldsq_cursor_lost_task(struct scx_dsq_list_node *cursor, 533 + struct rq *rq, struct scx_dispatch_q *dsq, 534 + struct task_struct *p) 535 + { 536 + lockdep_assert_rq_held(rq); 537 + lockdep_assert_held(&dsq->lock); 538 + 539 + /* 540 + * @p could have already left $src_dsq, got re-enqueud, or be in the 541 + * process of being consumed by someone else. 542 + */ 543 + if (unlikely(p->scx.dsq != dsq || 544 + u32_before(cursor->priv, p->scx.dsq_seq) || 545 + p->scx.holding_cpu >= 0)) 546 + return true; 547 + 548 + /* if @p has stayed on @dsq, its rq couldn't have changed */ 549 + if (WARN_ON_ONCE(rq != task_rq(p))) 550 + return true; 551 + 552 + return false; 553 + } 607 554 608 555 /* 609 556 * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse] ··· 690 479 * changes without breaking backward compatibility. Can be used with 691 480 * bpf_for_each(). See bpf_iter_scx_dsq_*(). 692 481 */ 693 - enum scx_dsq_iter_flags { 694 - /* iterate in the reverse dispatch order */ 695 - SCX_DSQ_ITER_REV = 1U << 16, 696 - 697 - __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, 698 - __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, 699 - 700 - __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, 701 - __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | 702 - __SCX_DSQ_ITER_HAS_SLICE | 703 - __SCX_DSQ_ITER_HAS_VTIME, 704 - }; 705 - 706 482 struct bpf_iter_scx_dsq_kern { 707 483 struct scx_dsq_list_node cursor; 708 484 struct scx_dispatch_q *dsq; ··· 712 514 struct rq_flags rf; 713 515 u32 cnt; 714 516 bool list_locked; 517 + #ifdef CONFIG_EXT_SUB_SCHED 518 + struct cgroup *cgrp; 519 + struct cgroup_subsys_state *css_pos; 520 + struct css_task_iter css_iter; 521 + #endif 715 522 }; 716 523 717 524 /** 718 525 * scx_task_iter_start - Lock scx_tasks_lock and start a task iteration 719 526 * @iter: iterator to init 527 + * @cgrp: Optional root of cgroup subhierarchy to iterate 720 528 * 721 - * Initialize @iter and return with scx_tasks_lock held. Once initialized, @iter 722 - * must eventually be stopped with scx_task_iter_stop(). 529 + * Initialize @iter. Once initialized, @iter must eventually be stopped with 530 + * scx_task_iter_stop(). 531 + * 532 + * If @cgrp is %NULL, scx_tasks is used for iteration and this function returns 533 + * with scx_tasks_lock held and @iter->cursor inserted into scx_tasks. 534 + * 535 + * If @cgrp is not %NULL, @cgrp and its descendants' tasks are walked using 536 + * @iter->css_iter. The caller must be holding cgroup_lock() to prevent cgroup 537 + * task migrations. 538 + * 539 + * The two modes of iterations are largely independent and it's likely that 540 + * scx_tasks can be removed in favor of always using cgroup iteration if 541 + * CONFIG_SCHED_CLASS_EXT depends on CONFIG_CGROUPS. 723 542 * 724 543 * scx_tasks_lock and the rq lock may be released using scx_task_iter_unlock() 725 544 * between this and the first next() call or between any two next() calls. If ··· 747 532 * All tasks which existed when the iteration started are guaranteed to be 748 533 * visited as long as they are not dead. 749 534 */ 750 - static void scx_task_iter_start(struct scx_task_iter *iter) 535 + static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp) 751 536 { 752 537 memset(iter, 0, sizeof(*iter)); 753 538 539 + #ifdef CONFIG_EXT_SUB_SCHED 540 + if (cgrp) { 541 + lockdep_assert_held(&cgroup_mutex); 542 + iter->cgrp = cgrp; 543 + iter->css_pos = css_next_descendant_pre(NULL, &iter->cgrp->self); 544 + css_task_iter_start(iter->css_pos, 0, &iter->css_iter); 545 + return; 546 + } 547 + #endif 754 548 raw_spin_lock_irq(&scx_tasks_lock); 755 549 756 550 iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR }; ··· 812 588 */ 813 589 static void scx_task_iter_stop(struct scx_task_iter *iter) 814 590 { 591 + #ifdef CONFIG_EXT_SUB_SCHED 592 + if (iter->cgrp) { 593 + if (iter->css_pos) 594 + css_task_iter_end(&iter->css_iter); 595 + __scx_task_iter_rq_unlock(iter); 596 + return; 597 + } 598 + #endif 815 599 __scx_task_iter_maybe_relock(iter); 816 600 list_del_init(&iter->cursor.tasks_node); 817 601 scx_task_iter_unlock(iter); ··· 843 611 cond_resched(); 844 612 } 845 613 614 + #ifdef CONFIG_EXT_SUB_SCHED 615 + if (iter->cgrp) { 616 + while (iter->css_pos) { 617 + struct task_struct *p; 618 + 619 + p = css_task_iter_next(&iter->css_iter); 620 + if (p) 621 + return p; 622 + 623 + css_task_iter_end(&iter->css_iter); 624 + iter->css_pos = css_next_descendant_pre(iter->css_pos, 625 + &iter->cgrp->self); 626 + if (iter->css_pos) 627 + css_task_iter_start(iter->css_pos, 0, &iter->css_iter); 628 + } 629 + return NULL; 630 + } 631 + #endif 846 632 __scx_task_iter_maybe_relock(iter); 847 633 848 634 list_for_each_entry(pos, cursor, tasks_node) { ··· 1060 810 return -EPROTO; 1061 811 } 1062 812 1063 - static void run_deferred(struct rq *rq) 1064 - { 1065 - process_ddsp_deferred_locals(rq); 1066 - 1067 - if (local_read(&rq->scx.reenq_local_deferred)) { 1068 - local_set(&rq->scx.reenq_local_deferred, 0); 1069 - reenq_local(rq); 1070 - } 1071 - } 1072 - 1073 813 static void deferred_bal_cb_workfn(struct rq *rq) 1074 814 { 1075 815 run_deferred(rq); ··· 1085 845 static void schedule_deferred(struct rq *rq) 1086 846 { 1087 847 /* 1088 - * Queue an irq work. They are executed on IRQ re-enable which may take 1089 - * a bit longer than the scheduler hook in schedule_deferred_locked(). 848 + * This is the fallback when schedule_deferred_locked() can't use 849 + * the cheaper balance callback or wakeup hook paths (the target 850 + * CPU is not in balance or wakeup). Currently, this is primarily 851 + * hit by reenqueue operations targeting a remote CPU. 852 + * 853 + * Queue on the target CPU. The deferred work can run from any CPU 854 + * correctly - the _locked() path already processes remote rqs from 855 + * the calling CPU - but targeting the owning CPU allows IPI delivery 856 + * without waiting for the calling CPU to re-enable IRQs and is 857 + * cheaper as the reenqueue runs locally. 1090 858 */ 1091 - irq_work_queue(&rq->scx.deferred_irq_work); 859 + irq_work_queue_on(&rq->scx.deferred_irq_work, cpu_of(rq)); 1092 860 } 1093 861 1094 862 /** ··· 1144 896 * time to IRQ re-enable shouldn't be long. 1145 897 */ 1146 898 schedule_deferred(rq); 899 + } 900 + 901 + static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq, 902 + u64 reenq_flags, struct rq *locked_rq) 903 + { 904 + struct rq *rq; 905 + 906 + /* 907 + * Allowing reenqueues doesn't make sense while bypassing. This also 908 + * blocks from new reenqueues to be scheduled on dead scheds. 909 + */ 910 + if (unlikely(READ_ONCE(sch->bypass_depth))) 911 + return; 912 + 913 + if (dsq->id == SCX_DSQ_LOCAL) { 914 + rq = container_of(dsq, struct rq, scx.local_dsq); 915 + 916 + struct scx_sched_pcpu *sch_pcpu = per_cpu_ptr(sch->pcpu, cpu_of(rq)); 917 + struct scx_deferred_reenq_local *drl = &sch_pcpu->deferred_reenq_local; 918 + 919 + /* 920 + * Pairs with smp_mb() in process_deferred_reenq_locals() and 921 + * guarantees that there is a reenq_local() afterwards. 922 + */ 923 + smp_mb(); 924 + 925 + if (list_empty(&drl->node) || 926 + (READ_ONCE(drl->flags) & reenq_flags) != reenq_flags) { 927 + 928 + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); 929 + 930 + if (list_empty(&drl->node)) 931 + list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals); 932 + WRITE_ONCE(drl->flags, drl->flags | reenq_flags); 933 + } 934 + } else if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN)) { 935 + rq = this_rq(); 936 + 937 + struct scx_dsq_pcpu *dsq_pcpu = per_cpu_ptr(dsq->pcpu, cpu_of(rq)); 938 + struct scx_deferred_reenq_user *dru = &dsq_pcpu->deferred_reenq_user; 939 + 940 + /* 941 + * Pairs with smp_mb() in process_deferred_reenq_users() and 942 + * guarantees that there is a reenq_user() afterwards. 943 + */ 944 + smp_mb(); 945 + 946 + if (list_empty(&dru->node) || 947 + (READ_ONCE(dru->flags) & reenq_flags) != reenq_flags) { 948 + 949 + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); 950 + 951 + if (list_empty(&dru->node)) 952 + list_move_tail(&dru->node, &rq->scx.deferred_reenq_users); 953 + WRITE_ONCE(dru->flags, dru->flags | reenq_flags); 954 + } 955 + } else { 956 + scx_error(sch, "DSQ 0x%llx not allowed for reenq", dsq->id); 957 + return; 958 + } 959 + 960 + if (rq == locked_rq) 961 + schedule_deferred_locked(rq); 962 + else 963 + schedule_deferred(rq); 964 + } 965 + 966 + static void schedule_reenq_local(struct rq *rq, u64 reenq_flags) 967 + { 968 + struct scx_sched *root = rcu_dereference_sched(scx_root); 969 + 970 + if (WARN_ON_ONCE(!root)) 971 + return; 972 + 973 + schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq); 1147 974 } 1148 975 1149 976 /** ··· 1297 974 return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime); 1298 975 } 1299 976 1300 - static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) 977 + static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) 1301 978 { 979 + /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */ 980 + WRITE_ONCE(dsq->nr, dsq->nr + 1); 981 + 1302 982 /* 1303 - * scx_bpf_dsq_nr_queued() reads ->nr without locking. Use READ_ONCE() 1304 - * on the read side and WRITE_ONCE() on the write side to properly 1305 - * annotate the concurrent lockless access and avoid KCSAN warnings. 983 + * Once @p reaches a local DSQ, it can only leave it by being dispatched 984 + * to the CPU or dequeued. In both cases, the only way @p can go back to 985 + * the BPF sched is through enqueueing. If being inserted into a local 986 + * DSQ with IMMED, persist the state until the next enqueueing event in 987 + * do_enqueue_task() so that we can maintain IMMED protection through 988 + * e.g. SAVE/RESTORE cycles and slice extensions. 1306 989 */ 1307 - WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta); 990 + if (enq_flags & SCX_ENQ_IMMED) { 991 + if (unlikely(dsq->id != SCX_DSQ_LOCAL)) { 992 + WARN_ON_ONCE(!(enq_flags & SCX_ENQ_GDSQ_FALLBACK)); 993 + return; 994 + } 995 + p->scx.flags |= SCX_TASK_IMMED; 996 + } 997 + 998 + if (p->scx.flags & SCX_TASK_IMMED) { 999 + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1000 + 1001 + if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 1002 + return; 1003 + 1004 + rq->scx.nr_immed++; 1005 + 1006 + /* 1007 + * If @rq already had other tasks or the current task is not 1008 + * done yet, @p can't go on the CPU immediately. Re-enqueue. 1009 + */ 1010 + if (unlikely(dsq->nr > 1 || !rq_is_open(rq, enq_flags))) 1011 + schedule_reenq_local(rq, 0); 1012 + } 1013 + } 1014 + 1015 + static void dsq_dec_nr(struct scx_dispatch_q *dsq, struct task_struct *p) 1016 + { 1017 + /* see dsq_inc_nr() */ 1018 + WRITE_ONCE(dsq->nr, dsq->nr - 1); 1019 + 1020 + if (p->scx.flags & SCX_TASK_IMMED) { 1021 + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1022 + 1023 + if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL) || 1024 + WARN_ON_ONCE(rq->scx.nr_immed <= 0)) 1025 + return; 1026 + 1027 + rq->scx.nr_immed--; 1028 + } 1308 1029 } 1309 1030 1310 1031 static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p) 1311 1032 { 1312 - p->scx.slice = READ_ONCE(scx_slice_dfl); 1033 + p->scx.slice = READ_ONCE(sch->slice_dfl); 1313 1034 __scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1); 1035 + } 1036 + 1037 + /* 1038 + * Return true if @p is moving due to an internal SCX migration, false 1039 + * otherwise. 1040 + */ 1041 + static inline bool task_scx_migrating(struct task_struct *p) 1042 + { 1043 + /* 1044 + * We only need to check sticky_cpu: it is set to the destination 1045 + * CPU in move_remote_task_to_local_dsq() before deactivate_task() 1046 + * and cleared when the task is enqueued on the destination, so it 1047 + * is only non-negative during an internal SCX migration. 1048 + */ 1049 + return p->scx.sticky_cpu >= 0; 1050 + } 1051 + 1052 + /* 1053 + * Call ops.dequeue() if the task is in BPF custody and not migrating. 1054 + * Clears %SCX_TASK_IN_CUSTODY when the callback is invoked. 1055 + */ 1056 + static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, 1057 + struct task_struct *p, u64 deq_flags) 1058 + { 1059 + if (!(p->scx.flags & SCX_TASK_IN_CUSTODY) || task_scx_migrating(p)) 1060 + return; 1061 + 1062 + if (SCX_HAS_OP(sch, dequeue)) 1063 + SCX_CALL_OP_TASK(sch, dequeue, rq, p, deq_flags); 1064 + 1065 + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; 1314 1066 } 1315 1067 1316 1068 static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p, ··· 1393 995 { 1394 996 struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1395 997 bool preempt = false; 998 + 999 + call_task_dequeue(scx_root, rq, p, 0); 1396 1000 1397 1001 /* 1398 1002 * If @rq is in balance, the CPU is already vacant and looking for the ··· 1414 1014 resched_curr(rq); 1415 1015 } 1416 1016 1417 - static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, 1418 - struct task_struct *p, u64 enq_flags) 1017 + static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, 1018 + struct scx_dispatch_q *dsq, struct task_struct *p, 1019 + u64 enq_flags) 1419 1020 { 1420 1021 bool is_local = dsq->id == SCX_DSQ_LOCAL; 1421 1022 ··· 1432 1031 scx_error(sch, "attempting to dispatch to a destroyed dsq"); 1433 1032 /* fall back to the global dsq */ 1434 1033 raw_spin_unlock(&dsq->lock); 1435 - dsq = find_global_dsq(sch, p); 1034 + dsq = find_global_dsq(sch, task_cpu(p)); 1436 1035 raw_spin_lock(&dsq->lock); 1437 1036 } 1438 1037 } ··· 1507 1106 WRITE_ONCE(dsq->seq, dsq->seq + 1); 1508 1107 p->scx.dsq_seq = dsq->seq; 1509 1108 1510 - dsq_mod_nr(dsq, 1); 1109 + dsq_inc_nr(dsq, p, enq_flags); 1511 1110 p->scx.dsq = dsq; 1111 + 1112 + /* 1113 + * Update custody and call ops.dequeue() before clearing ops_state: 1114 + * once ops_state is cleared, waiters in ops_dequeue() can proceed 1115 + * and dequeue_task_scx() will RMW p->scx.flags. If we clear 1116 + * ops_state first, both sides would modify p->scx.flags 1117 + * concurrently in a non-atomic way. 1118 + */ 1119 + if (is_local) { 1120 + local_dsq_post_enq(dsq, p, enq_flags); 1121 + } else { 1122 + /* 1123 + * Task on global/bypass DSQ: leave custody, task on 1124 + * non-terminal DSQ: enter custody. 1125 + */ 1126 + if (dsq->id == SCX_DSQ_GLOBAL || dsq->id == SCX_DSQ_BYPASS) 1127 + call_task_dequeue(sch, rq, p, 0); 1128 + else 1129 + p->scx.flags |= SCX_TASK_IN_CUSTODY; 1130 + 1131 + raw_spin_unlock(&dsq->lock); 1132 + } 1512 1133 1513 1134 /* 1514 1135 * We're transitioning out of QUEUEING or DISPATCHING. store_release to ··· 1538 1115 */ 1539 1116 if (enq_flags & SCX_ENQ_CLEAR_OPSS) 1540 1117 atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); 1541 - 1542 - if (is_local) 1543 - local_dsq_post_enq(dsq, p, enq_flags); 1544 - else 1545 - raw_spin_unlock(&dsq->lock); 1546 1118 } 1547 1119 1548 1120 static void task_unlink_from_dsq(struct task_struct *p, ··· 1552 1134 } 1553 1135 1554 1136 list_del_init(&p->scx.dsq_list.node); 1555 - dsq_mod_nr(dsq, -1); 1137 + dsq_dec_nr(dsq, p); 1556 1138 1557 1139 if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) { 1558 1140 struct task_struct *first_task; ··· 1631 1213 1632 1214 static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch, 1633 1215 struct rq *rq, u64 dsq_id, 1634 - struct task_struct *p) 1216 + s32 tcpu) 1635 1217 { 1636 1218 struct scx_dispatch_q *dsq; 1637 1219 ··· 1642 1224 s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; 1643 1225 1644 1226 if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) 1645 - return find_global_dsq(sch, p); 1227 + return find_global_dsq(sch, tcpu); 1646 1228 1647 1229 return &cpu_rq(cpu)->scx.local_dsq; 1648 1230 } 1649 1231 1650 1232 if (dsq_id == SCX_DSQ_GLOBAL) 1651 - dsq = find_global_dsq(sch, p); 1233 + dsq = find_global_dsq(sch, tcpu); 1652 1234 else 1653 1235 dsq = find_user_dsq(sch, dsq_id); 1654 1236 1655 1237 if (unlikely(!dsq)) { 1656 - scx_error(sch, "non-existent DSQ 0x%llx for %s[%d]", 1657 - dsq_id, p->comm, p->pid); 1658 - return find_global_dsq(sch, p); 1238 + scx_error(sch, "non-existent DSQ 0x%llx", dsq_id); 1239 + return find_global_dsq(sch, tcpu); 1659 1240 } 1660 1241 1661 1242 return dsq; ··· 1717 1300 { 1718 1301 struct rq *rq = task_rq(p); 1719 1302 struct scx_dispatch_q *dsq = 1720 - find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p); 1303 + find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p)); 1721 1304 u64 ddsp_enq_flags; 1722 1305 1723 1306 touch_core_sched_dispatch(rq, p); ··· 1762 1345 ddsp_enq_flags = p->scx.ddsp_enq_flags; 1763 1346 clear_direct_dispatch(p); 1764 1347 1765 - dispatch_enqueue(sch, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); 1348 + dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); 1766 1349 } 1767 1350 1768 1351 static bool scx_rq_online(struct rq *rq) ··· 1780 1363 static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, 1781 1364 int sticky_cpu) 1782 1365 { 1783 - struct scx_sched *sch = scx_root; 1366 + struct scx_sched *sch = scx_task_sched(p); 1784 1367 struct task_struct **ddsp_taskp; 1785 1368 struct scx_dispatch_q *dsq; 1786 1369 unsigned long qseq; 1787 1370 1788 1371 WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); 1789 1372 1790 - /* rq migration */ 1373 + /* internal movements - rq migration / RESTORE */ 1791 1374 if (sticky_cpu == cpu_of(rq)) 1792 1375 goto local_norefill; 1376 + 1377 + /* 1378 + * Clear persistent TASK_IMMED for fresh enqueues, see dsq_inc_nr(). 1379 + * Note that exiting and migration-disabled tasks that skip 1380 + * ops.enqueue() below will lose IMMED protection unless 1381 + * %SCX_OPS_ENQ_EXITING / %SCX_OPS_ENQ_MIGRATION_DISABLED are set. 1382 + */ 1383 + p->scx.flags &= ~SCX_TASK_IMMED; 1793 1384 1794 1385 /* 1795 1386 * If !scx_rq_online(), we already told the BPF scheduler that the CPU ··· 1807 1382 if (!scx_rq_online(rq)) 1808 1383 goto local; 1809 1384 1810 - if (scx_rq_bypassing(rq)) { 1385 + if (scx_bypassing(sch, cpu_of(rq))) { 1811 1386 __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); 1812 1387 goto bypass; 1813 1388 } ··· 1842 1417 WARN_ON_ONCE(*ddsp_taskp); 1843 1418 *ddsp_taskp = p; 1844 1419 1845 - SCX_CALL_OP_TASK(sch, SCX_KF_ENQUEUE, enqueue, rq, p, enq_flags); 1420 + SCX_CALL_OP_TASK(sch, enqueue, rq, p, enq_flags); 1846 1421 1847 1422 *ddsp_taskp = NULL; 1848 1423 if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) 1849 1424 goto direct; 1425 + 1426 + /* 1427 + * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY 1428 + * so ops.dequeue() is called when it leaves custody. 1429 + */ 1430 + p->scx.flags |= SCX_TASK_IN_CUSTODY; 1850 1431 1851 1432 /* 1852 1433 * If not directly dispatched, QUEUEING isn't clear yet and dispatch or ··· 1865 1434 direct_dispatch(sch, p, enq_flags); 1866 1435 return; 1867 1436 local_norefill: 1868 - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); 1437 + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); 1869 1438 return; 1870 1439 local: 1871 1440 dsq = &rq->scx.local_dsq; 1872 1441 goto enqueue; 1873 1442 global: 1874 - dsq = find_global_dsq(sch, p); 1443 + dsq = find_global_dsq(sch, task_cpu(p)); 1875 1444 goto enqueue; 1876 1445 bypass: 1877 - dsq = &task_rq(p)->scx.bypass_dsq; 1446 + dsq = bypass_enq_target_dsq(sch, task_cpu(p)); 1878 1447 goto enqueue; 1879 1448 1880 1449 enqueue: ··· 1886 1455 touch_core_sched(rq, p); 1887 1456 refill_task_slice_dfl(sch, p); 1888 1457 clear_direct_dispatch(p); 1889 - dispatch_enqueue(sch, dsq, p, enq_flags); 1458 + dispatch_enqueue(sch, rq, dsq, p, enq_flags); 1890 1459 } 1891 1460 1892 1461 static bool task_runnable(const struct task_struct *p) ··· 1919 1488 1920 1489 static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_flags) 1921 1490 { 1922 - struct scx_sched *sch = scx_root; 1491 + struct scx_sched *sch = scx_task_sched(p); 1923 1492 int sticky_cpu = p->scx.sticky_cpu; 1924 1493 u64 enq_flags = core_enq_flags | rq->scx.extra_enq_flags; 1925 1494 1926 1495 if (enq_flags & ENQUEUE_WAKEUP) 1927 1496 rq->scx.flags |= SCX_RQ_IN_WAKEUP; 1928 - 1929 - if (sticky_cpu >= 0) 1930 - p->scx.sticky_cpu = -1; 1931 1497 1932 1498 /* 1933 1499 * Restoring a running task will be immediately followed by ··· 1946 1518 add_nr_running(rq, 1); 1947 1519 1948 1520 if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p)) 1949 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags); 1521 + SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags); 1950 1522 1951 1523 if (enq_flags & SCX_ENQ_WAKEUP) 1952 1524 touch_core_sched(rq, p); ··· 1956 1528 dl_server_start(&rq->ext_server); 1957 1529 1958 1530 do_enqueue_task(rq, p, enq_flags, sticky_cpu); 1531 + 1532 + if (sticky_cpu >= 0) 1533 + p->scx.sticky_cpu = -1; 1959 1534 out: 1960 1535 rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; 1961 1536 ··· 1969 1538 1970 1539 static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) 1971 1540 { 1972 - struct scx_sched *sch = scx_root; 1541 + struct scx_sched *sch = scx_task_sched(p); 1973 1542 unsigned long opss; 1974 1543 1975 1544 /* dequeue is always temporary, don't reset runnable_at */ ··· 1988 1557 */ 1989 1558 BUG(); 1990 1559 case SCX_OPSS_QUEUED: 1991 - if (SCX_HAS_OP(sch, dequeue)) 1992 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, 1993 - p, deq_flags); 1994 - 1560 + /* A queued task must always be in BPF scheduler's custody */ 1561 + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)); 1995 1562 if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, 1996 1563 SCX_OPSS_NONE)) 1997 1564 break; ··· 2012 1583 BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); 2013 1584 break; 2014 1585 } 1586 + 1587 + /* 1588 + * Call ops.dequeue() if the task is still in BPF custody. 1589 + * 1590 + * The code that clears ops_state to %SCX_OPSS_NONE does not always 1591 + * clear %SCX_TASK_IN_CUSTODY: in dispatch_to_local_dsq(), when 1592 + * we're moving a task that was in %SCX_OPSS_DISPATCHING to a 1593 + * remote CPU's local DSQ, we only set ops_state to %SCX_OPSS_NONE 1594 + * so that a concurrent dequeue can proceed, but we clear 1595 + * %SCX_TASK_IN_CUSTODY only when we later enqueue or move the 1596 + * task. So we can see NONE + IN_CUSTODY here and we must handle 1597 + * it. Similarly, after waiting on %SCX_OPSS_DISPATCHING we see 1598 + * NONE but the task may still have %SCX_TASK_IN_CUSTODY set until 1599 + * it is enqueued on the destination. 1600 + */ 1601 + call_task_dequeue(sch, rq, p, deq_flags); 2015 1602 } 2016 1603 2017 - static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) 1604 + static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_flags) 2018 1605 { 2019 - struct scx_sched *sch = scx_root; 1606 + struct scx_sched *sch = scx_task_sched(p); 1607 + u64 deq_flags = core_deq_flags; 1608 + 1609 + /* 1610 + * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a property 1611 + * change (not sleep or core-sched pick). 1612 + */ 1613 + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) 1614 + deq_flags |= SCX_DEQ_SCHED_CHANGE; 2020 1615 2021 1616 if (!(p->scx.flags & SCX_TASK_QUEUED)) { 2022 1617 WARN_ON_ONCE(task_runnable(p)); ··· 2063 1610 */ 2064 1611 if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) { 2065 1612 update_curr_scx(rq); 2066 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false); 1613 + SCX_CALL_OP_TASK(sch, stopping, rq, p, false); 2067 1614 } 2068 1615 2069 1616 if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p)) 2070 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags); 1617 + SCX_CALL_OP_TASK(sch, quiescent, rq, p, deq_flags); 2071 1618 2072 1619 if (deq_flags & SCX_DEQ_SLEEP) 2073 1620 p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; ··· 2085 1632 2086 1633 static void yield_task_scx(struct rq *rq) 2087 1634 { 2088 - struct scx_sched *sch = scx_root; 2089 1635 struct task_struct *p = rq->donor; 1636 + struct scx_sched *sch = scx_task_sched(p); 2090 1637 2091 1638 if (SCX_HAS_OP(sch, yield)) 2092 - SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, p, NULL); 1639 + SCX_CALL_OP_2TASKS_RET(sch, yield, rq, p, NULL); 2093 1640 else 2094 1641 p->scx.slice = 0; 2095 1642 } 2096 1643 2097 1644 static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) 2098 1645 { 2099 - struct scx_sched *sch = scx_root; 2100 1646 struct task_struct *from = rq->donor; 1647 + struct scx_sched *sch = scx_task_sched(from); 2101 1648 2102 - if (SCX_HAS_OP(sch, yield)) 2103 - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, 2104 - from, to); 1649 + if (SCX_HAS_OP(sch, yield) && sch == scx_task_sched(to)) 1650 + return SCX_CALL_OP_2TASKS_RET(sch, yield, rq, from, to); 2105 1651 else 2106 1652 return false; 1653 + } 1654 + 1655 + static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) 1656 + { 1657 + /* 1658 + * Preemption between SCX tasks is implemented by resetting the victim 1659 + * task's slice to 0 and triggering reschedule on the target CPU. 1660 + * Nothing to do. 1661 + */ 1662 + if (p->sched_class == &ext_sched_class) 1663 + return; 1664 + 1665 + /* 1666 + * Getting preempted by a higher-priority class. Reenqueue IMMED tasks. 1667 + * This captures all preemption cases including: 1668 + * 1669 + * - A SCX task is currently running. 1670 + * 1671 + * - @rq is waking from idle due to a SCX task waking to it. 1672 + * 1673 + * - A higher-priority wakes up while SCX dispatch is in progress. 1674 + */ 1675 + if (rq->scx.nr_immed) 1676 + schedule_reenq_local(rq, 0); 2107 1677 } 2108 1678 2109 1679 static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, ··· 2146 1670 else 2147 1671 list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list); 2148 1672 2149 - dsq_mod_nr(dst_dsq, 1); 1673 + dsq_inc_nr(dst_dsq, p, enq_flags); 2150 1674 p->scx.dsq = dst_dsq; 2151 1675 2152 1676 local_dsq_post_enq(dst_dsq, p, enq_flags); ··· 2166 1690 { 2167 1691 lockdep_assert_rq_held(src_rq); 2168 1692 2169 - /* the following marks @p MIGRATING which excludes dequeue */ 1693 + /* 1694 + * Set sticky_cpu before deactivate_task() to properly mark the 1695 + * beginning of an SCX-internal migration. 1696 + */ 1697 + p->scx.sticky_cpu = cpu_of(dst_rq); 2170 1698 deactivate_task(src_rq, p, 0); 2171 1699 set_task_cpu(p, cpu_of(dst_rq)); 2172 - p->scx.sticky_cpu = cpu_of(dst_rq); 2173 1700 2174 1701 raw_spin_rq_unlock(src_rq); 2175 1702 raw_spin_rq_lock(dst_rq); ··· 2212 1733 struct task_struct *p, struct rq *rq, 2213 1734 bool enforce) 2214 1735 { 2215 - int cpu = cpu_of(rq); 1736 + s32 cpu = cpu_of(rq); 2216 1737 2217 1738 WARN_ON_ONCE(task_cpu(p) == cpu); 2218 1739 ··· 2306 1827 !WARN_ON_ONCE(src_rq != task_rq(p)); 2307 1828 } 2308 1829 2309 - static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, 1830 + static bool consume_remote_task(struct rq *this_rq, 1831 + struct task_struct *p, u64 enq_flags, 2310 1832 struct scx_dispatch_q *dsq, struct rq *src_rq) 2311 1833 { 2312 1834 raw_spin_rq_unlock(this_rq); 2313 1835 2314 1836 if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { 2315 - move_remote_task_to_local_dsq(p, 0, src_rq, this_rq); 1837 + move_remote_task_to_local_dsq(p, enq_flags, src_rq, this_rq); 2316 1838 return true; 2317 1839 } else { 2318 1840 raw_spin_rq_unlock(src_rq); ··· 2353 1873 dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); 2354 1874 if (src_rq != dst_rq && 2355 1875 unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { 2356 - dst_dsq = find_global_dsq(sch, p); 1876 + dst_dsq = find_global_dsq(sch, task_cpu(p)); 2357 1877 dst_rq = src_rq; 1878 + enq_flags |= SCX_ENQ_GDSQ_FALLBACK; 2358 1879 } 2359 1880 } else { 2360 1881 /* no need to migrate if destination is a non-local DSQ */ ··· 2386 1905 dispatch_dequeue_locked(p, src_dsq); 2387 1906 raw_spin_unlock(&src_dsq->lock); 2388 1907 2389 - dispatch_enqueue(sch, dst_dsq, p, enq_flags); 1908 + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); 2390 1909 } 2391 1910 2392 1911 return dst_rq; 2393 1912 } 2394 1913 2395 1914 static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq, 2396 - struct scx_dispatch_q *dsq) 1915 + struct scx_dispatch_q *dsq, u64 enq_flags) 2397 1916 { 2398 1917 struct task_struct *p; 2399 1918 retry: ··· 2418 1937 * the system into the bypass mode. This can easily live-lock the 2419 1938 * machine. If aborting, exit from all non-bypass DSQs. 2420 1939 */ 2421 - if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS) 1940 + if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS) 2422 1941 break; 2423 1942 2424 1943 if (rq == task_rq) { 2425 1944 task_unlink_from_dsq(p, dsq); 2426 - move_local_task_to_local_dsq(p, 0, dsq, rq); 1945 + move_local_task_to_local_dsq(p, enq_flags, dsq, rq); 2427 1946 raw_spin_unlock(&dsq->lock); 2428 1947 return true; 2429 1948 } 2430 1949 2431 1950 if (task_can_run_on_remote_rq(sch, p, rq, false)) { 2432 - if (likely(consume_remote_task(rq, p, dsq, task_rq))) 1951 + if (likely(consume_remote_task(rq, p, enq_flags, dsq, task_rq))) 2433 1952 return true; 2434 1953 goto retry; 2435 1954 } ··· 2443 1962 { 2444 1963 int node = cpu_to_node(cpu_of(rq)); 2445 1964 2446 - return consume_dispatch_q(sch, rq, sch->global_dsqs[node]); 1965 + return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0); 2447 1966 } 2448 1967 2449 1968 /** ··· 2476 1995 * If dispatching to @rq that @p is already on, no lock dancing needed. 2477 1996 */ 2478 1997 if (rq == src_rq && rq == dst_rq) { 2479 - dispatch_enqueue(sch, dst_dsq, p, 1998 + dispatch_enqueue(sch, rq, dst_dsq, p, 2480 1999 enq_flags | SCX_ENQ_CLEAR_OPSS); 2481 2000 return; 2482 2001 } 2483 2002 2484 2003 if (src_rq != dst_rq && 2485 2004 unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { 2486 - dispatch_enqueue(sch, find_global_dsq(sch, p), p, 2487 - enq_flags | SCX_ENQ_CLEAR_OPSS); 2005 + dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p, 2006 + enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK); 2488 2007 return; 2489 2008 } 2490 2009 ··· 2521 2040 */ 2522 2041 if (src_rq == dst_rq) { 2523 2042 p->scx.holding_cpu = -1; 2524 - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, 2043 + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, 2525 2044 enq_flags); 2526 2045 } else { 2527 2046 move_remote_task_to_local_dsq(p, enq_flags, ··· 2591 2110 if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch) 2592 2111 return; 2593 2112 2113 + /* see SCX_EV_INSERT_NOT_OWNED definition */ 2114 + if (unlikely(!scx_task_on_sched(sch, p))) { 2115 + __scx_add_event(sch, SCX_EV_INSERT_NOT_OWNED, 1); 2116 + return; 2117 + } 2118 + 2594 2119 /* 2595 2120 * While we know @p is accessible, we don't yet have a claim on 2596 2121 * it - the BPF scheduler is allowed to dispatch tasks ··· 2621 2134 2622 2135 BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); 2623 2136 2624 - dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); 2137 + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, task_cpu(p)); 2625 2138 2626 2139 if (dsq->id == SCX_DSQ_LOCAL) 2627 2140 dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); 2628 2141 else 2629 - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2142 + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2630 2143 } 2631 2144 2632 2145 static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) 2633 2146 { 2634 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2147 + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 2635 2148 u32 u; 2636 2149 2637 2150 for (u = 0; u < dspc->cursor; u++) { ··· 2658 2171 rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING; 2659 2172 } 2660 2173 2174 + /* 2175 + * One user of this function is scx_bpf_dispatch() which can be called 2176 + * recursively as sub-sched dispatches nest. Always inline to reduce stack usage 2177 + * from the call frame. 2178 + */ 2179 + static __always_inline bool 2180 + scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, 2181 + struct task_struct *prev, bool nested) 2182 + { 2183 + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 2184 + int nr_loops = SCX_DSP_MAX_LOOPS; 2185 + s32 cpu = cpu_of(rq); 2186 + bool prev_on_sch = (prev->sched_class == &ext_sched_class) && 2187 + scx_task_on_sched(sch, prev); 2188 + 2189 + if (consume_global_dsq(sch, rq)) 2190 + return true; 2191 + 2192 + if (bypass_dsp_enabled(sch)) { 2193 + /* if @sch is bypassing, only the bypass DSQs are active */ 2194 + if (scx_bypassing(sch, cpu)) 2195 + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0); 2196 + 2197 + #ifdef CONFIG_EXT_SUB_SCHED 2198 + /* 2199 + * If @sch isn't bypassing but its children are, @sch is 2200 + * responsible for making forward progress for both its own 2201 + * tasks that aren't bypassing and the bypassing descendants' 2202 + * tasks. The following implements a simple built-in behavior - 2203 + * let each CPU try to run the bypass DSQ every Nth time. 2204 + * 2205 + * Later, if necessary, we can add an ops flag to suppress the 2206 + * auto-consumption and a kfunc to consume the bypass DSQ and, 2207 + * so that the BPF scheduler can fully control scheduling of 2208 + * bypassed tasks. 2209 + */ 2210 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); 2211 + 2212 + if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) && 2213 + consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0)) { 2214 + __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1); 2215 + return true; 2216 + } 2217 + #endif /* CONFIG_EXT_SUB_SCHED */ 2218 + } 2219 + 2220 + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) 2221 + return false; 2222 + 2223 + dspc->rq = rq; 2224 + 2225 + /* 2226 + * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, 2227 + * the local DSQ might still end up empty after a successful 2228 + * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() 2229 + * produced some tasks, retry. The BPF scheduler may depend on this 2230 + * looping behavior to simplify its implementation. 2231 + */ 2232 + do { 2233 + dspc->nr_tasks = 0; 2234 + 2235 + if (nested) { 2236 + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); 2237 + } else { 2238 + /* stash @prev so that nested invocations can access it */ 2239 + rq->scx.sub_dispatch_prev = prev; 2240 + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); 2241 + rq->scx.sub_dispatch_prev = NULL; 2242 + } 2243 + 2244 + flush_dispatch_buf(sch, rq); 2245 + 2246 + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) { 2247 + rq->scx.flags |= SCX_RQ_BAL_KEEP; 2248 + return true; 2249 + } 2250 + if (rq->scx.local_dsq.nr) 2251 + return true; 2252 + if (consume_global_dsq(sch, rq)) 2253 + return true; 2254 + 2255 + /* 2256 + * ops.dispatch() can trap us in this loop by repeatedly 2257 + * dispatching ineligible tasks. Break out once in a while to 2258 + * allow the watchdog to run. As IRQ can't be enabled in 2259 + * balance(), we want to complete this scheduling cycle and then 2260 + * start a new one. IOW, we want to call resched_curr() on the 2261 + * next, most likely idle, task, not the current one. Use 2262 + * __scx_bpf_kick_cpu() for deferred kicking. 2263 + */ 2264 + if (unlikely(!--nr_loops)) { 2265 + scx_kick_cpu(sch, cpu, 0); 2266 + break; 2267 + } 2268 + } while (dspc->nr_tasks); 2269 + 2270 + /* 2271 + * Prevent the CPU from going idle while bypassed descendants have tasks 2272 + * queued. Without this fallback, bypassed tasks could stall if the host 2273 + * scheduler's ops.dispatch() doesn't yield any tasks. 2274 + */ 2275 + if (bypass_dsp_enabled(sch)) 2276 + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0); 2277 + 2278 + return false; 2279 + } 2280 + 2661 2281 static int balance_one(struct rq *rq, struct task_struct *prev) 2662 2282 { 2663 2283 struct scx_sched *sch = scx_root; 2664 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2665 - bool prev_on_scx = prev->sched_class == &ext_sched_class; 2666 - bool prev_on_rq = prev->scx.flags & SCX_TASK_QUEUED; 2667 - int nr_loops = SCX_DSP_MAX_LOOPS; 2284 + s32 cpu = cpu_of(rq); 2668 2285 2669 2286 lockdep_assert_rq_held(rq); 2670 2287 rq->scx.flags |= SCX_RQ_IN_BALANCE; ··· 2783 2192 * emitted in switch_class(). 2784 2193 */ 2785 2194 if (SCX_HAS_OP(sch, cpu_acquire)) 2786 - SCX_CALL_OP(sch, SCX_KF_REST, cpu_acquire, rq, 2787 - cpu_of(rq), NULL); 2195 + SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL); 2788 2196 rq->scx.cpu_released = false; 2789 2197 } 2790 2198 2791 - if (prev_on_scx) { 2199 + if (prev->sched_class == &ext_sched_class) { 2792 2200 update_curr_scx(rq); 2793 2201 2794 2202 /* ··· 2800 2210 * See scx_disable_workfn() for the explanation on the bypassing 2801 2211 * test. 2802 2212 */ 2803 - if (prev_on_rq && prev->scx.slice && !scx_rq_bypassing(rq)) { 2213 + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && 2214 + !scx_bypassing(sch, cpu)) { 2804 2215 rq->scx.flags |= SCX_RQ_BAL_KEEP; 2805 2216 goto has_tasks; 2806 2217 } ··· 2811 2220 if (rq->scx.local_dsq.nr) 2812 2221 goto has_tasks; 2813 2222 2814 - if (consume_global_dsq(sch, rq)) 2223 + if (scx_dispatch_sched(sch, rq, prev, false)) 2815 2224 goto has_tasks; 2816 2225 2817 - if (scx_rq_bypassing(rq)) { 2818 - if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq)) 2819 - goto has_tasks; 2820 - else 2821 - goto no_tasks; 2822 - } 2823 - 2824 - if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) 2825 - goto no_tasks; 2826 - 2827 - dspc->rq = rq; 2828 - 2829 - /* 2830 - * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, 2831 - * the local DSQ might still end up empty after a successful 2832 - * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() 2833 - * produced some tasks, retry. The BPF scheduler may depend on this 2834 - * looping behavior to simplify its implementation. 2835 - */ 2836 - do { 2837 - dspc->nr_tasks = 0; 2838 - 2839 - SCX_CALL_OP(sch, SCX_KF_DISPATCH, dispatch, rq, 2840 - cpu_of(rq), prev_on_scx ? prev : NULL); 2841 - 2842 - flush_dispatch_buf(sch, rq); 2843 - 2844 - if (prev_on_rq && prev->scx.slice) { 2845 - rq->scx.flags |= SCX_RQ_BAL_KEEP; 2846 - goto has_tasks; 2847 - } 2848 - if (rq->scx.local_dsq.nr) 2849 - goto has_tasks; 2850 - if (consume_global_dsq(sch, rq)) 2851 - goto has_tasks; 2852 - 2853 - /* 2854 - * ops.dispatch() can trap us in this loop by repeatedly 2855 - * dispatching ineligible tasks. Break out once in a while to 2856 - * allow the watchdog to run. As IRQ can't be enabled in 2857 - * balance(), we want to complete this scheduling cycle and then 2858 - * start a new one. IOW, we want to call resched_curr() on the 2859 - * next, most likely idle, task, not the current one. Use 2860 - * scx_kick_cpu() for deferred kicking. 2861 - */ 2862 - if (unlikely(!--nr_loops)) { 2863 - scx_kick_cpu(sch, cpu_of(rq), 0); 2864 - break; 2865 - } 2866 - } while (dspc->nr_tasks); 2867 - 2868 - no_tasks: 2869 2226 /* 2870 2227 * Didn't find another task to run. Keep running @prev unless 2871 2228 * %SCX_OPS_ENQ_LAST is in effect. 2872 2229 */ 2873 - if (prev_on_rq && 2874 - (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_rq_bypassing(rq))) { 2230 + if ((prev->scx.flags & SCX_TASK_QUEUED) && 2231 + (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_bypassing(sch, cpu))) { 2875 2232 rq->scx.flags |= SCX_RQ_BAL_KEEP; 2876 2233 __scx_add_event(sch, SCX_EV_DISPATCH_KEEP_LAST, 1); 2877 2234 goto has_tasks; ··· 2828 2289 return false; 2829 2290 2830 2291 has_tasks: 2292 + /* 2293 + * @rq may have extra IMMED tasks without reenq scheduled: 2294 + * 2295 + * - rq_is_open() can't reliably tell when and how slice is going to be 2296 + * modified for $curr and allows IMMED tasks to be queued while 2297 + * dispatch is in progress. 2298 + * 2299 + * - A non-IMMED HEAD task can get queued in front of an IMMED task 2300 + * between the IMMED queueing and the subsequent scheduling event. 2301 + */ 2302 + if (unlikely(rq->scx.local_dsq.nr > 1 && rq->scx.nr_immed)) 2303 + schedule_reenq_local(rq, 0); 2304 + 2831 2305 rq->scx.flags &= ~SCX_RQ_IN_BALANCE; 2832 2306 return true; 2833 2307 } 2834 2308 2835 - static void process_ddsp_deferred_locals(struct rq *rq) 2836 - { 2837 - struct task_struct *p; 2838 - 2839 - lockdep_assert_rq_held(rq); 2840 - 2841 - /* 2842 - * Now that @rq can be unlocked, execute the deferred enqueueing of 2843 - * tasks directly dispatched to the local DSQs of other CPUs. See 2844 - * direct_dispatch(). Keep popping from the head instead of using 2845 - * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq 2846 - * temporarily. 2847 - */ 2848 - while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, 2849 - struct task_struct, scx.dsq_list.node))) { 2850 - struct scx_sched *sch = scx_root; 2851 - struct scx_dispatch_q *dsq; 2852 - u64 dsq_id = p->scx.ddsp_dsq_id; 2853 - u64 enq_flags = p->scx.ddsp_enq_flags; 2854 - 2855 - list_del_init(&p->scx.dsq_list.node); 2856 - clear_direct_dispatch(p); 2857 - 2858 - dsq = find_dsq_for_dispatch(sch, rq, dsq_id, p); 2859 - if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 2860 - dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); 2861 - } 2862 - } 2863 - 2864 2309 static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) 2865 2310 { 2866 - struct scx_sched *sch = scx_root; 2311 + struct scx_sched *sch = scx_task_sched(p); 2867 2312 2868 2313 if (p->scx.flags & SCX_TASK_QUEUED) { 2869 2314 /* ··· 2862 2339 2863 2340 /* see dequeue_task_scx() on why we skip when !QUEUED */ 2864 2341 if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED)) 2865 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, running, rq, p); 2342 + SCX_CALL_OP_TASK(sch, running, rq, p); 2866 2343 2867 2344 clr_task_runnable(p, true); 2868 2345 ··· 2934 2411 .task = next, 2935 2412 }; 2936 2413 2937 - SCX_CALL_OP(sch, SCX_KF_CPU_RELEASE, cpu_release, rq, 2938 - cpu_of(rq), &args); 2414 + SCX_CALL_OP(sch, cpu_release, rq, cpu_of(rq), &args); 2939 2415 } 2940 2416 rq->scx.cpu_released = true; 2941 2417 } ··· 2943 2421 static void put_prev_task_scx(struct rq *rq, struct task_struct *p, 2944 2422 struct task_struct *next) 2945 2423 { 2946 - struct scx_sched *sch = scx_root; 2424 + struct scx_sched *sch = scx_task_sched(p); 2947 2425 2948 2426 /* see kick_sync_wait_bal_cb() */ 2949 2427 smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1); ··· 2952 2430 2953 2431 /* see dequeue_task_scx() on why we skip when !QUEUED */ 2954 2432 if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED)) 2955 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, true); 2433 + SCX_CALL_OP_TASK(sch, stopping, rq, p, true); 2956 2434 2957 2435 if (p->scx.flags & SCX_TASK_QUEUED) { 2958 2436 set_task_runnable(rq, p); ··· 2961 2439 * If @p has slice left and is being put, @p is getting 2962 2440 * preempted by a higher priority scheduler class or core-sched 2963 2441 * forcing a different task. Leave it at the head of the local 2964 - * DSQ. 2442 + * DSQ unless it was an IMMED task. IMMED tasks should not 2443 + * linger on a busy CPU, reenqueue them to the BPF scheduler. 2965 2444 */ 2966 - if (p->scx.slice && !scx_rq_bypassing(rq)) { 2967 - dispatch_enqueue(sch, &rq->scx.local_dsq, p, 2968 - SCX_ENQ_HEAD); 2445 + if (p->scx.slice && !scx_bypassing(sch, cpu_of(rq))) { 2446 + if (p->scx.flags & SCX_TASK_IMMED) { 2447 + p->scx.flags |= SCX_TASK_REENQ_PREEMPTED; 2448 + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 2449 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 2450 + } else { 2451 + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD); 2452 + } 2969 2453 goto switch_class; 2970 2454 } 2971 2455 ··· 3096 2568 if (keep_prev) { 3097 2569 p = prev; 3098 2570 if (!p->scx.slice) 3099 - refill_task_slice_dfl(rcu_dereference_sched(scx_root), p); 2571 + refill_task_slice_dfl(scx_task_sched(p), p); 3100 2572 } else { 3101 2573 p = first_local_task(rq); 3102 2574 if (!p) 3103 2575 return NULL; 3104 2576 3105 2577 if (unlikely(!p->scx.slice)) { 3106 - struct scx_sched *sch = rcu_dereference_sched(scx_root); 2578 + struct scx_sched *sch = scx_task_sched(p); 3107 2579 3108 - if (!scx_rq_bypassing(rq) && !sch->warned_zero_slice) { 2580 + if (!scx_bypassing(sch, cpu_of(rq)) && 2581 + !sch->warned_zero_slice) { 3109 2582 printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in %s()\n", 3110 2583 p->comm, p->pid, __func__); 3111 2584 sch->warned_zero_slice = true; ··· 3172 2643 bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, 3173 2644 bool in_fi) 3174 2645 { 3175 - struct scx_sched *sch = scx_root; 2646 + struct scx_sched *sch_a = scx_task_sched(a); 2647 + struct scx_sched *sch_b = scx_task_sched(b); 3176 2648 3177 2649 /* 3178 2650 * The const qualifiers are dropped from task_struct pointers when 3179 2651 * calling ops.core_sched_before(). Accesses are controlled by the 3180 2652 * verifier. 3181 2653 */ 3182 - if (SCX_HAS_OP(sch, core_sched_before) && 3183 - !scx_rq_bypassing(task_rq(a))) 3184 - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, core_sched_before, 2654 + if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) && 2655 + !scx_bypassing(sch_a, task_cpu(a))) 2656 + return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before, 3185 2657 NULL, 3186 2658 (struct task_struct *)a, 3187 2659 (struct task_struct *)b); ··· 3193 2663 3194 2664 static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags) 3195 2665 { 3196 - struct scx_sched *sch = scx_root; 3197 - bool rq_bypass; 2666 + struct scx_sched *sch = scx_task_sched(p); 2667 + bool bypassing; 3198 2668 3199 2669 /* 3200 2670 * sched_exec() calls with %WF_EXEC when @p is about to exec(2) as it ··· 3209 2679 if (unlikely(wake_flags & WF_EXEC)) 3210 2680 return prev_cpu; 3211 2681 3212 - rq_bypass = scx_rq_bypassing(task_rq(p)); 3213 - if (likely(SCX_HAS_OP(sch, select_cpu)) && !rq_bypass) { 2682 + bypassing = scx_bypassing(sch, task_cpu(p)); 2683 + if (likely(SCX_HAS_OP(sch, select_cpu)) && !bypassing) { 3214 2684 s32 cpu; 3215 2685 struct task_struct **ddsp_taskp; 3216 2686 ··· 3218 2688 WARN_ON_ONCE(*ddsp_taskp); 3219 2689 *ddsp_taskp = p; 3220 2690 3221 - cpu = SCX_CALL_OP_TASK_RET(sch, 3222 - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, 3223 - select_cpu, NULL, p, prev_cpu, 3224 - wake_flags); 2691 + this_rq()->scx.in_select_cpu = true; 2692 + cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags); 2693 + this_rq()->scx.in_select_cpu = false; 3225 2694 p->scx.selected_cpu = cpu; 3226 2695 *ddsp_taskp = NULL; 3227 2696 if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()")) ··· 3239 2710 } 3240 2711 p->scx.selected_cpu = cpu; 3241 2712 3242 - if (rq_bypass) 2713 + if (bypassing) 3243 2714 __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); 3244 2715 return cpu; 3245 2716 } ··· 3253 2724 static void set_cpus_allowed_scx(struct task_struct *p, 3254 2725 struct affinity_context *ac) 3255 2726 { 3256 - struct scx_sched *sch = scx_root; 2727 + struct scx_sched *sch = scx_task_sched(p); 3257 2728 3258 2729 set_cpus_allowed_common(p, ac); 3259 2730 ··· 3269 2740 * designation pointless. Cast it away when calling the operation. 3270 2741 */ 3271 2742 if (SCX_HAS_OP(sch, set_cpumask)) 3272 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, NULL, 3273 - p, (struct cpumask *)p->cpus_ptr); 2743 + SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); 3274 2744 } 3275 2745 3276 2746 static void handle_hotplug(struct rq *rq, bool online) 3277 2747 { 3278 2748 struct scx_sched *sch = scx_root; 3279 - int cpu = cpu_of(rq); 2749 + s32 cpu = cpu_of(rq); 3280 2750 3281 2751 atomic_long_inc(&scx_hotplug_seq); 3282 2752 ··· 3291 2763 scx_idle_update_selcpu_topology(&sch->ops); 3292 2764 3293 2765 if (online && SCX_HAS_OP(sch, cpu_online)) 3294 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu); 2766 + SCX_CALL_OP(sch, cpu_online, NULL, cpu); 3295 2767 else if (!online && SCX_HAS_OP(sch, cpu_offline)) 3296 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_offline, NULL, cpu); 2768 + SCX_CALL_OP(sch, cpu_offline, NULL, cpu); 3297 2769 else 3298 2770 scx_exit(sch, SCX_EXIT_UNREG_KERN, 3299 2771 SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, ··· 3321 2793 rq->scx.flags &= ~SCX_RQ_ONLINE; 3322 2794 } 3323 2795 3324 - 3325 2796 static bool check_rq_for_timeouts(struct rq *rq) 3326 2797 { 3327 2798 struct scx_sched *sch; ··· 3334 2807 goto out_unlock; 3335 2808 3336 2809 list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) { 2810 + struct scx_sched *sch = scx_task_sched(p); 3337 2811 unsigned long last_runnable = p->scx.runnable_at; 3338 2812 3339 2813 if (unlikely(time_after(jiffies, 3340 - last_runnable + READ_ONCE(scx_watchdog_timeout)))) { 2814 + last_runnable + READ_ONCE(sch->watchdog_timeout)))) { 3341 2815 u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable); 3342 2816 3343 2817 scx_exit(sch, SCX_EXIT_ERROR_STALL, 0, ··· 3355 2827 3356 2828 static void scx_watchdog_workfn(struct work_struct *work) 3357 2829 { 2830 + unsigned long intv; 3358 2831 int cpu; 3359 2832 3360 2833 WRITE_ONCE(scx_watchdog_timestamp, jiffies); ··· 3366 2837 3367 2838 cond_resched(); 3368 2839 } 3369 - queue_delayed_work(system_dfl_wq, to_delayed_work(work), 3370 - READ_ONCE(scx_watchdog_timeout) / 2); 2840 + 2841 + intv = READ_ONCE(scx_watchdog_interval); 2842 + if (intv < ULONG_MAX) 2843 + queue_delayed_work(system_dfl_wq, to_delayed_work(work), intv); 3371 2844 } 3372 2845 3373 2846 void scx_tick(struct rq *rq) 3374 2847 { 3375 - struct scx_sched *sch; 2848 + struct scx_sched *root; 3376 2849 unsigned long last_check; 3377 2850 3378 2851 if (!scx_enabled()) 3379 2852 return; 3380 2853 3381 - sch = rcu_dereference_bh(scx_root); 3382 - if (unlikely(!sch)) 2854 + root = rcu_dereference_bh(scx_root); 2855 + if (unlikely(!root)) 3383 2856 return; 3384 2857 3385 2858 last_check = READ_ONCE(scx_watchdog_timestamp); 3386 2859 if (unlikely(time_after(jiffies, 3387 - last_check + READ_ONCE(scx_watchdog_timeout)))) { 2860 + last_check + READ_ONCE(root->watchdog_timeout)))) { 3388 2861 u32 dur_ms = jiffies_to_msecs(jiffies - last_check); 3389 2862 3390 - scx_exit(sch, SCX_EXIT_ERROR_STALL, 0, 2863 + scx_exit(root, SCX_EXIT_ERROR_STALL, 0, 3391 2864 "watchdog failed to check in for %u.%03us", 3392 2865 dur_ms / 1000, dur_ms % 1000); 3393 2866 } ··· 3399 2868 3400 2869 static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) 3401 2870 { 3402 - struct scx_sched *sch = scx_root; 2871 + struct scx_sched *sch = scx_task_sched(curr); 3403 2872 3404 2873 update_curr_scx(rq); 3405 2874 ··· 3407 2876 * While disabling, always resched and refresh core-sched timestamp as 3408 2877 * we can't trust the slice management or ops.core_sched_before(). 3409 2878 */ 3410 - if (scx_rq_bypassing(rq)) { 2879 + if (scx_bypassing(sch, cpu_of(rq))) { 3411 2880 curr->scx.slice = 0; 3412 2881 touch_core_sched(rq, curr); 3413 2882 } else if (SCX_HAS_OP(sch, tick)) { 3414 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, tick, rq, curr); 2883 + SCX_CALL_OP_TASK(sch, tick, rq, curr); 3415 2884 } 3416 2885 3417 2886 if (!curr->scx.slice) ··· 3440 2909 3441 2910 #endif /* CONFIG_EXT_GROUP_SCHED */ 3442 2911 3443 - static enum scx_task_state scx_get_task_state(const struct task_struct *p) 2912 + static u32 scx_get_task_state(const struct task_struct *p) 3444 2913 { 3445 - return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT; 2914 + return p->scx.flags & SCX_TASK_STATE_MASK; 3446 2915 } 3447 2916 3448 - static void scx_set_task_state(struct task_struct *p, enum scx_task_state state) 2917 + static void scx_set_task_state(struct task_struct *p, u32 state) 3449 2918 { 3450 - enum scx_task_state prev_state = scx_get_task_state(p); 2919 + u32 prev_state = scx_get_task_state(p); 3451 2920 bool warn = false; 3452 - 3453 - BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS)); 3454 2921 3455 2922 switch (state) { 3456 2923 case SCX_TASK_NONE: ··· 3463 2934 warn = prev_state != SCX_TASK_READY; 3464 2935 break; 3465 2936 default: 3466 - warn = true; 2937 + WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]", 2938 + prev_state, state, p->comm, p->pid); 3467 2939 return; 3468 2940 } 3469 2941 3470 - WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]", 2942 + WARN_ONCE(warn, "sched_ext: Invalid task state transition 0x%x -> 0x%x for %s[%d]", 3471 2943 prev_state, state, p->comm, p->pid); 3472 2944 3473 2945 p->scx.flags &= ~SCX_TASK_STATE_MASK; 3474 - p->scx.flags |= state << SCX_TASK_STATE_SHIFT; 2946 + p->scx.flags |= state; 3475 2947 } 3476 2948 3477 - static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork) 2949 + static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork) 3478 2950 { 3479 - struct scx_sched *sch = scx_root; 3480 2951 int ret; 3481 2952 3482 2953 p->scx.disallow = false; 3483 2954 3484 2955 if (SCX_HAS_OP(sch, init_task)) { 3485 2956 struct scx_init_task_args args = { 3486 - SCX_INIT_TASK_ARGS_CGROUP(tg) 2957 + SCX_INIT_TASK_ARGS_CGROUP(task_group(p)) 3487 2958 .fork = fork, 3488 2959 }; 3489 2960 3490 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init_task, NULL, 3491 - p, &args); 2961 + ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args); 3492 2962 if (unlikely(ret)) { 3493 2963 ret = ops_sanitize_err(sch, "init_task", ret); 3494 2964 return ret; 3495 2965 } 3496 2966 } 3497 2967 3498 - scx_set_task_state(p, SCX_TASK_INIT); 3499 - 3500 2968 if (p->scx.disallow) { 3501 - if (!fork) { 2969 + if (unlikely(scx_parent(sch))) { 2970 + scx_error(sch, "non-root ops.init_task() set task->scx.disallow for %s[%d]", 2971 + p->comm, p->pid); 2972 + } else if (unlikely(fork)) { 2973 + scx_error(sch, "ops.init_task() set task->scx.disallow for %s[%d] during fork", 2974 + p->comm, p->pid); 2975 + } else { 3502 2976 struct rq *rq; 3503 2977 struct rq_flags rf; 3504 2978 ··· 3520 2988 } 3521 2989 3522 2990 task_rq_unlock(rq, p, &rf); 3523 - } else if (p->policy == SCHED_EXT) { 3524 - scx_error(sch, "ops.init_task() set task->scx.disallow for %s[%d] during fork", 3525 - p->comm, p->pid); 3526 2991 } 3527 2992 } 3528 2993 3529 - p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; 3530 2994 return 0; 3531 2995 } 3532 2996 3533 - static void scx_enable_task(struct task_struct *p) 2997 + static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork) 3534 2998 { 3535 - struct scx_sched *sch = scx_root; 2999 + int ret; 3000 + 3001 + ret = __scx_init_task(sch, p, fork); 3002 + if (!ret) { 3003 + /* 3004 + * While @p's rq is not locked. @p is not visible to the rest of 3005 + * SCX yet and it's safe to update the flags and state. 3006 + */ 3007 + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; 3008 + scx_set_task_state(p, SCX_TASK_INIT); 3009 + } 3010 + return ret; 3011 + } 3012 + 3013 + static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p) 3014 + { 3536 3015 struct rq *rq = task_rq(p); 3537 3016 u32 weight; 3538 3017 3539 3018 lockdep_assert_rq_held(rq); 3019 + 3020 + /* 3021 + * Verify the task is not in BPF scheduler's custody. If flag 3022 + * transitions are consistent, the flag should always be clear 3023 + * here. 3024 + */ 3025 + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); 3540 3026 3541 3027 /* 3542 3028 * Set the weight before calling ops.enable() so that the scheduler ··· 3568 3018 p->scx.weight = sched_weight_to_cgroup(weight); 3569 3019 3570 3020 if (SCX_HAS_OP(sch, enable)) 3571 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p); 3572 - scx_set_task_state(p, SCX_TASK_ENABLED); 3021 + SCX_CALL_OP_TASK(sch, enable, rq, p); 3573 3022 3574 3023 if (SCX_HAS_OP(sch, set_weight)) 3575 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, 3576 - p, p->scx.weight); 3024 + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); 3577 3025 } 3578 3026 3579 - static void scx_disable_task(struct task_struct *p) 3027 + static void scx_enable_task(struct scx_sched *sch, struct task_struct *p) 3580 3028 { 3581 - struct scx_sched *sch = scx_root; 3029 + __scx_enable_task(sch, p); 3030 + scx_set_task_state(p, SCX_TASK_ENABLED); 3031 + } 3032 + 3033 + static void scx_disable_task(struct scx_sched *sch, struct task_struct *p) 3034 + { 3582 3035 struct rq *rq = task_rq(p); 3583 3036 3584 3037 lockdep_assert_rq_held(rq); ··· 3590 3037 clear_direct_dispatch(p); 3591 3038 3592 3039 if (SCX_HAS_OP(sch, disable)) 3593 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); 3040 + SCX_CALL_OP_TASK(sch, disable, rq, p); 3594 3041 scx_set_task_state(p, SCX_TASK_READY); 3042 + 3043 + /* 3044 + * Verify the task is not in BPF scheduler's custody. If flag 3045 + * transitions are consistent, the flag should always be clear 3046 + * here. 3047 + */ 3048 + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); 3595 3049 } 3596 3050 3597 - static void scx_exit_task(struct task_struct *p) 3051 + static void __scx_disable_and_exit_task(struct scx_sched *sch, 3052 + struct task_struct *p) 3598 3053 { 3599 - struct scx_sched *sch = scx_root; 3600 3054 struct scx_exit_task_args args = { 3601 3055 .cancelled = false, 3602 3056 }; 3603 3057 3058 + lockdep_assert_held(&p->pi_lock); 3604 3059 lockdep_assert_rq_held(task_rq(p)); 3605 3060 3606 3061 switch (scx_get_task_state(p)) { ··· 3620 3059 case SCX_TASK_READY: 3621 3060 break; 3622 3061 case SCX_TASK_ENABLED: 3623 - scx_disable_task(p); 3062 + scx_disable_task(sch, p); 3624 3063 break; 3625 3064 default: 3626 3065 WARN_ON_ONCE(true); ··· 3628 3067 } 3629 3068 3630 3069 if (SCX_HAS_OP(sch, exit_task)) 3631 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, exit_task, task_rq(p), 3632 - p, &args); 3070 + SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); 3071 + } 3072 + 3073 + static void scx_disable_and_exit_task(struct scx_sched *sch, 3074 + struct task_struct *p) 3075 + { 3076 + __scx_disable_and_exit_task(sch, p); 3077 + 3078 + /* 3079 + * If set, @p exited between __scx_init_task() and scx_enable_task() in 3080 + * scx_sub_enable() and is initialized for both the associated sched and 3081 + * its parent. Disable and exit for the child too. 3082 + */ 3083 + if ((p->scx.flags & SCX_TASK_SUB_INIT) && 3084 + !WARN_ON_ONCE(!scx_enabling_sub_sched)) { 3085 + __scx_disable_and_exit_task(scx_enabling_sub_sched, p); 3086 + p->scx.flags &= ~SCX_TASK_SUB_INIT; 3087 + } 3088 + 3089 + scx_set_task_sched(p, NULL); 3633 3090 scx_set_task_state(p, SCX_TASK_NONE); 3634 3091 } 3635 3092 ··· 3661 3082 INIT_LIST_HEAD(&scx->runnable_node); 3662 3083 scx->runnable_at = jiffies; 3663 3084 scx->ddsp_dsq_id = SCX_DSQ_INVALID; 3664 - scx->slice = READ_ONCE(scx_slice_dfl); 3085 + scx->slice = SCX_SLICE_DFL; 3665 3086 } 3666 3087 3667 3088 void scx_pre_fork(struct task_struct *p) ··· 3675 3096 percpu_down_read(&scx_fork_rwsem); 3676 3097 } 3677 3098 3678 - int scx_fork(struct task_struct *p) 3099 + int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) 3679 3100 { 3101 + s32 ret; 3102 + 3680 3103 percpu_rwsem_assert_held(&scx_fork_rwsem); 3681 3104 3682 - if (scx_init_task_enabled) 3683 - return scx_init_task(p, task_group(p), true); 3684 - else 3685 - return 0; 3105 + if (scx_init_task_enabled) { 3106 + #ifdef CONFIG_EXT_SUB_SCHED 3107 + struct scx_sched *sch = kargs->cset->dfl_cgrp->scx_sched; 3108 + #else 3109 + struct scx_sched *sch = scx_root; 3110 + #endif 3111 + ret = scx_init_task(sch, p, true); 3112 + if (!ret) 3113 + scx_set_task_sched(p, sch); 3114 + return ret; 3115 + } 3116 + 3117 + return 0; 3686 3118 } 3687 3119 3688 3120 void scx_post_fork(struct task_struct *p) ··· 3711 3121 struct rq *rq; 3712 3122 3713 3123 rq = task_rq_lock(p, &rf); 3714 - scx_enable_task(p); 3124 + scx_enable_task(scx_task_sched(p), p); 3715 3125 task_rq_unlock(rq, p, &rf); 3716 3126 } 3717 3127 } ··· 3731 3141 3732 3142 rq = task_rq_lock(p, &rf); 3733 3143 WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY); 3734 - scx_exit_task(p); 3144 + scx_disable_and_exit_task(scx_task_sched(p), p); 3735 3145 task_rq_unlock(rq, p, &rf); 3736 3146 } 3737 3147 ··· 3782 3192 raw_spin_unlock_irqrestore(&scx_tasks_lock, flags); 3783 3193 3784 3194 /* 3785 - * @p is off scx_tasks and wholly ours. scx_enable()'s READY -> ENABLED 3786 - * transitions can't race us. Disable ops for @p. 3195 + * @p is off scx_tasks and wholly ours. scx_root_enable()'s READY -> 3196 + * ENABLED transitions can't race us. Disable ops for @p. 3787 3197 */ 3788 3198 if (scx_get_task_state(p) != SCX_TASK_NONE) { 3789 3199 struct rq_flags rf; 3790 3200 struct rq *rq; 3791 3201 3792 3202 rq = task_rq_lock(p, &rf); 3793 - scx_exit_task(p); 3203 + scx_disable_and_exit_task(scx_task_sched(p), p); 3794 3204 task_rq_unlock(rq, p, &rf); 3795 3205 } 3796 3206 } ··· 3798 3208 static void reweight_task_scx(struct rq *rq, struct task_struct *p, 3799 3209 const struct load_weight *lw) 3800 3210 { 3801 - struct scx_sched *sch = scx_root; 3211 + struct scx_sched *sch = scx_task_sched(p); 3802 3212 3803 3213 lockdep_assert_rq_held(task_rq(p)); 3804 3214 ··· 3807 3217 3808 3218 p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); 3809 3219 if (SCX_HAS_OP(sch, set_weight)) 3810 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, 3811 - p, p->scx.weight); 3220 + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); 3812 3221 } 3813 3222 3814 3223 static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio) ··· 3816 3227 3817 3228 static void switching_to_scx(struct rq *rq, struct task_struct *p) 3818 3229 { 3819 - struct scx_sched *sch = scx_root; 3230 + struct scx_sched *sch = scx_task_sched(p); 3820 3231 3821 3232 if (task_dead_and_done(p)) 3822 3233 return; 3823 3234 3824 - scx_enable_task(p); 3235 + scx_enable_task(sch, p); 3825 3236 3826 3237 /* 3827 3238 * set_cpus_allowed_scx() is not called while @p is associated with a 3828 3239 * different scheduler class. Keep the BPF scheduler up-to-date. 3829 3240 */ 3830 3241 if (SCX_HAS_OP(sch, set_cpumask)) 3831 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, rq, 3832 - p, (struct cpumask *)p->cpus_ptr); 3242 + SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr); 3833 3243 } 3834 3244 3835 3245 static void switched_from_scx(struct rq *rq, struct task_struct *p) ··· 3836 3248 if (task_dead_and_done(p)) 3837 3249 return; 3838 3250 3839 - scx_disable_task(p); 3251 + scx_disable_task(scx_task_sched(p), p); 3840 3252 } 3841 - 3842 - static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {} 3843 3253 3844 3254 static void switched_to_scx(struct rq *rq, struct task_struct *p) {} 3845 3255 ··· 3853 3267 return 0; 3854 3268 } 3855 3269 3270 + static void process_ddsp_deferred_locals(struct rq *rq) 3271 + { 3272 + struct task_struct *p; 3273 + 3274 + lockdep_assert_rq_held(rq); 3275 + 3276 + /* 3277 + * Now that @rq can be unlocked, execute the deferred enqueueing of 3278 + * tasks directly dispatched to the local DSQs of other CPUs. See 3279 + * direct_dispatch(). Keep popping from the head instead of using 3280 + * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq 3281 + * temporarily. 3282 + */ 3283 + while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, 3284 + struct task_struct, scx.dsq_list.node))) { 3285 + struct scx_sched *sch = scx_task_sched(p); 3286 + struct scx_dispatch_q *dsq; 3287 + u64 dsq_id = p->scx.ddsp_dsq_id; 3288 + u64 enq_flags = p->scx.ddsp_enq_flags; 3289 + 3290 + list_del_init(&p->scx.dsq_list.node); 3291 + clear_direct_dispatch(p); 3292 + 3293 + dsq = find_dsq_for_dispatch(sch, rq, dsq_id, task_cpu(p)); 3294 + if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 3295 + dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); 3296 + } 3297 + } 3298 + 3299 + /* 3300 + * Determine whether @p should be reenqueued from a local DSQ. 3301 + * 3302 + * @reenq_flags is mutable and accumulates state across the DSQ walk: 3303 + * 3304 + * - %SCX_REENQ_TSR_NOT_FIRST: Set after the first task is visited. "First" 3305 + * tracks position in the DSQ list, not among IMMED tasks. A non-IMMED task at 3306 + * the head consumes the first slot. 3307 + * 3308 + * - %SCX_REENQ_TSR_RQ_OPEN: Set by reenq_local() before the walk if 3309 + * rq_is_open() is true. 3310 + * 3311 + * An IMMED task is kept (returns %false) only if it's the first task in the DSQ 3312 + * AND the current task is done — i.e. it will execute immediately. All other 3313 + * IMMED tasks are reenqueued. This means if a non-IMMED task sits at the head, 3314 + * every IMMED task behind it gets reenqueued. 3315 + * 3316 + * Reenqueued tasks go through ops.enqueue() with %SCX_ENQ_REENQ | 3317 + * %SCX_TASK_REENQ_IMMED. If the BPF scheduler dispatches back to the same local 3318 + * DSQ with %SCX_ENQ_IMMED while the CPU is still unavailable, this triggers 3319 + * another reenq cycle. Repetitions are bounded by %SCX_REENQ_LOCAL_MAX_REPEAT 3320 + * in process_deferred_reenq_locals(). 3321 + */ 3322 + static bool local_task_should_reenq(struct task_struct *p, u64 *reenq_flags, u32 *reason) 3323 + { 3324 + bool first; 3325 + 3326 + first = !(*reenq_flags & SCX_REENQ_TSR_NOT_FIRST); 3327 + *reenq_flags |= SCX_REENQ_TSR_NOT_FIRST; 3328 + 3329 + *reason = SCX_TASK_REENQ_KFUNC; 3330 + 3331 + if ((p->scx.flags & SCX_TASK_IMMED) && 3332 + (!first || !(*reenq_flags & SCX_REENQ_TSR_RQ_OPEN))) { 3333 + __scx_add_event(scx_task_sched(p), SCX_EV_REENQ_IMMED, 1); 3334 + *reason = SCX_TASK_REENQ_IMMED; 3335 + return true; 3336 + } 3337 + 3338 + return *reenq_flags & SCX_REENQ_ANY; 3339 + } 3340 + 3341 + static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags) 3342 + { 3343 + LIST_HEAD(tasks); 3344 + u32 nr_enqueued = 0; 3345 + struct task_struct *p, *n; 3346 + 3347 + lockdep_assert_rq_held(rq); 3348 + 3349 + if (WARN_ON_ONCE(reenq_flags & __SCX_REENQ_TSR_MASK)) 3350 + reenq_flags &= ~__SCX_REENQ_TSR_MASK; 3351 + if (rq_is_open(rq, 0)) 3352 + reenq_flags |= SCX_REENQ_TSR_RQ_OPEN; 3353 + 3354 + /* 3355 + * The BPF scheduler may choose to dispatch tasks back to 3356 + * @rq->scx.local_dsq. Move all candidate tasks off to a private list 3357 + * first to avoid processing the same tasks repeatedly. 3358 + */ 3359 + list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, 3360 + scx.dsq_list.node) { 3361 + struct scx_sched *task_sch = scx_task_sched(p); 3362 + u32 reason; 3363 + 3364 + /* 3365 + * If @p is being migrated, @p's current CPU may not agree with 3366 + * its allowed CPUs and the migration_cpu_stop is about to 3367 + * deactivate and re-activate @p anyway. Skip re-enqueueing. 3368 + * 3369 + * While racing sched property changes may also dequeue and 3370 + * re-enqueue a migrating task while its current CPU and allowed 3371 + * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to 3372 + * the current local DSQ for running tasks and thus are not 3373 + * visible to the BPF scheduler. 3374 + */ 3375 + if (p->migration_pending) 3376 + continue; 3377 + 3378 + if (!scx_is_descendant(task_sch, sch)) 3379 + continue; 3380 + 3381 + if (!local_task_should_reenq(p, &reenq_flags, &reason)) 3382 + continue; 3383 + 3384 + dispatch_dequeue(rq, p); 3385 + 3386 + if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK)) 3387 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3388 + p->scx.flags |= reason; 3389 + 3390 + list_add_tail(&p->scx.dsq_list.node, &tasks); 3391 + } 3392 + 3393 + list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { 3394 + list_del_init(&p->scx.dsq_list.node); 3395 + 3396 + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 3397 + 3398 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3399 + nr_enqueued++; 3400 + } 3401 + 3402 + return nr_enqueued; 3403 + } 3404 + 3405 + static void process_deferred_reenq_locals(struct rq *rq) 3406 + { 3407 + u64 seq = ++rq->scx.deferred_reenq_locals_seq; 3408 + 3409 + lockdep_assert_rq_held(rq); 3410 + 3411 + while (true) { 3412 + struct scx_sched *sch; 3413 + u64 reenq_flags; 3414 + bool skip = false; 3415 + 3416 + scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) { 3417 + struct scx_deferred_reenq_local *drl = 3418 + list_first_entry_or_null(&rq->scx.deferred_reenq_locals, 3419 + struct scx_deferred_reenq_local, 3420 + node); 3421 + struct scx_sched_pcpu *sch_pcpu; 3422 + 3423 + if (!drl) 3424 + return; 3425 + 3426 + sch_pcpu = container_of(drl, struct scx_sched_pcpu, 3427 + deferred_reenq_local); 3428 + sch = sch_pcpu->sch; 3429 + 3430 + reenq_flags = drl->flags; 3431 + WRITE_ONCE(drl->flags, 0); 3432 + list_del_init(&drl->node); 3433 + 3434 + if (likely(drl->seq != seq)) { 3435 + drl->seq = seq; 3436 + drl->cnt = 0; 3437 + } else { 3438 + if (unlikely(++drl->cnt > SCX_REENQ_LOCAL_MAX_REPEAT)) { 3439 + scx_error(sch, "SCX_ENQ_REENQ on SCX_DSQ_LOCAL repeated %u times", 3440 + drl->cnt); 3441 + skip = true; 3442 + } 3443 + 3444 + __scx_add_event(sch, SCX_EV_REENQ_LOCAL_REPEAT, 1); 3445 + } 3446 + } 3447 + 3448 + if (!skip) { 3449 + /* see schedule_dsq_reenq() */ 3450 + smp_mb(); 3451 + 3452 + reenq_local(sch, rq, reenq_flags); 3453 + } 3454 + } 3455 + } 3456 + 3457 + static bool user_task_should_reenq(struct task_struct *p, u64 reenq_flags, u32 *reason) 3458 + { 3459 + *reason = SCX_TASK_REENQ_KFUNC; 3460 + return reenq_flags & SCX_REENQ_ANY; 3461 + } 3462 + 3463 + static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flags) 3464 + { 3465 + struct rq *locked_rq = rq; 3466 + struct scx_sched *sch = dsq->sched; 3467 + struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, dsq, 0); 3468 + struct task_struct *p; 3469 + s32 nr_enqueued = 0; 3470 + 3471 + lockdep_assert_rq_held(rq); 3472 + 3473 + raw_spin_lock(&dsq->lock); 3474 + 3475 + while (likely(!READ_ONCE(sch->bypass_depth))) { 3476 + struct rq *task_rq; 3477 + u32 reason; 3478 + 3479 + p = nldsq_cursor_next_task(&cursor, dsq); 3480 + if (!p) 3481 + break; 3482 + 3483 + if (!user_task_should_reenq(p, reenq_flags, &reason)) 3484 + continue; 3485 + 3486 + task_rq = task_rq(p); 3487 + 3488 + if (locked_rq != task_rq) { 3489 + if (locked_rq) 3490 + raw_spin_rq_unlock(locked_rq); 3491 + if (unlikely(!raw_spin_rq_trylock(task_rq))) { 3492 + raw_spin_unlock(&dsq->lock); 3493 + raw_spin_rq_lock(task_rq); 3494 + raw_spin_lock(&dsq->lock); 3495 + } 3496 + locked_rq = task_rq; 3497 + 3498 + /* did we lose @p while switching locks? */ 3499 + if (nldsq_cursor_lost_task(&cursor, task_rq, dsq, p)) 3500 + continue; 3501 + } 3502 + 3503 + /* @p is on @dsq, its rq and @dsq are locked */ 3504 + dispatch_dequeue_locked(p, dsq); 3505 + raw_spin_unlock(&dsq->lock); 3506 + 3507 + if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK)) 3508 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3509 + p->scx.flags |= reason; 3510 + 3511 + do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1); 3512 + 3513 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3514 + 3515 + if (!(++nr_enqueued % SCX_TASK_ITER_BATCH)) { 3516 + raw_spin_rq_unlock(locked_rq); 3517 + locked_rq = NULL; 3518 + cpu_relax(); 3519 + } 3520 + 3521 + raw_spin_lock(&dsq->lock); 3522 + } 3523 + 3524 + list_del_init(&cursor.node); 3525 + raw_spin_unlock(&dsq->lock); 3526 + 3527 + if (locked_rq != rq) { 3528 + if (locked_rq) 3529 + raw_spin_rq_unlock(locked_rq); 3530 + raw_spin_rq_lock(rq); 3531 + } 3532 + } 3533 + 3534 + static void process_deferred_reenq_users(struct rq *rq) 3535 + { 3536 + lockdep_assert_rq_held(rq); 3537 + 3538 + while (true) { 3539 + struct scx_dispatch_q *dsq; 3540 + u64 reenq_flags; 3541 + 3542 + scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) { 3543 + struct scx_deferred_reenq_user *dru = 3544 + list_first_entry_or_null(&rq->scx.deferred_reenq_users, 3545 + struct scx_deferred_reenq_user, 3546 + node); 3547 + struct scx_dsq_pcpu *dsq_pcpu; 3548 + 3549 + if (!dru) 3550 + return; 3551 + 3552 + dsq_pcpu = container_of(dru, struct scx_dsq_pcpu, 3553 + deferred_reenq_user); 3554 + dsq = dsq_pcpu->dsq; 3555 + reenq_flags = dru->flags; 3556 + WRITE_ONCE(dru->flags, 0); 3557 + list_del_init(&dru->node); 3558 + } 3559 + 3560 + /* see schedule_dsq_reenq() */ 3561 + smp_mb(); 3562 + 3563 + BUG_ON(dsq->id & SCX_DSQ_FLAG_BUILTIN); 3564 + reenq_user(rq, dsq, reenq_flags); 3565 + } 3566 + } 3567 + 3568 + static void run_deferred(struct rq *rq) 3569 + { 3570 + process_ddsp_deferred_locals(rq); 3571 + 3572 + if (!list_empty(&rq->scx.deferred_reenq_locals)) 3573 + process_deferred_reenq_locals(rq); 3574 + 3575 + if (!list_empty(&rq->scx.deferred_reenq_users)) 3576 + process_deferred_reenq_users(rq); 3577 + } 3578 + 3856 3579 #ifdef CONFIG_NO_HZ_FULL 3857 3580 bool scx_can_stop_tick(struct rq *rq) 3858 3581 { 3859 3582 struct task_struct *p = rq->curr; 3860 - 3861 - if (scx_rq_bypassing(rq)) 3862 - return false; 3583 + struct scx_sched *sch = scx_task_sched(p); 3863 3584 3864 3585 if (p->sched_class != &ext_sched_class) 3865 3586 return true; 3587 + 3588 + if (scx_bypassing(sch, cpu_of(rq))) 3589 + return false; 3866 3590 3867 3591 /* 3868 3592 * @rq can dispatch from different DSQs, so we can't tell whether it ··· 4211 3315 .bw_quota_us = tg->scx.bw_quota_us, 4212 3316 .bw_burst_us = tg->scx.bw_burst_us }; 4213 3317 4214 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, 3318 + ret = SCX_CALL_OP_RET(sch, cgroup_init, 4215 3319 NULL, tg->css.cgroup, &args); 4216 3320 if (ret) 4217 3321 ret = ops_sanitize_err(sch, "cgroup_init", ret); ··· 4233 3337 4234 3338 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_exit) && 4235 3339 (tg->scx.flags & SCX_TG_INITED)) 4236 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, 4237 - tg->css.cgroup); 3340 + SCX_CALL_OP(sch, cgroup_exit, NULL, tg->css.cgroup); 4238 3341 tg->scx.flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); 4239 3342 } 4240 3343 ··· 4262 3367 continue; 4263 3368 4264 3369 if (SCX_HAS_OP(sch, cgroup_prep_move)) { 4265 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, 4266 - cgroup_prep_move, NULL, 3370 + ret = SCX_CALL_OP_RET(sch, cgroup_prep_move, NULL, 4267 3371 p, from, css->cgroup); 4268 3372 if (ret) 4269 3373 goto err; ··· 4277 3383 cgroup_taskset_for_each(p, css, tset) { 4278 3384 if (SCX_HAS_OP(sch, cgroup_cancel_move) && 4279 3385 p->scx.cgrp_moving_from) 4280 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, 3386 + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, 4281 3387 p, p->scx.cgrp_moving_from, css->cgroup); 4282 3388 p->scx.cgrp_moving_from = NULL; 4283 3389 } ··· 4298 3404 */ 4299 3405 if (SCX_HAS_OP(sch, cgroup_move) && 4300 3406 !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) 4301 - SCX_CALL_OP_TASK(sch, SCX_KF_UNLOCKED, cgroup_move, NULL, 3407 + SCX_CALL_OP_TASK(sch, cgroup_move, task_rq(p), 4302 3408 p, p->scx.cgrp_moving_from, 4303 3409 tg_cgrp(task_group(p))); 4304 3410 p->scx.cgrp_moving_from = NULL; ··· 4316 3422 cgroup_taskset_for_each(p, css, tset) { 4317 3423 if (SCX_HAS_OP(sch, cgroup_cancel_move) && 4318 3424 p->scx.cgrp_moving_from) 4319 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, 3425 + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, 4320 3426 p, p->scx.cgrp_moving_from, css->cgroup); 4321 3427 p->scx.cgrp_moving_from = NULL; 4322 3428 } ··· 4330 3436 4331 3437 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) && 4332 3438 tg->scx.weight != weight) 4333 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_weight, NULL, 4334 - tg_cgrp(tg), weight); 3439 + SCX_CALL_OP(sch, cgroup_set_weight, NULL, tg_cgrp(tg), weight); 4335 3440 4336 3441 tg->scx.weight = weight; 4337 3442 ··· 4344 3451 percpu_down_read(&scx_cgroup_ops_rwsem); 4345 3452 4346 3453 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle)) 4347 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_idle, NULL, 4348 - tg_cgrp(tg), idle); 3454 + SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle); 4349 3455 4350 3456 /* Update the task group's idle state */ 4351 3457 tg->scx.idle = idle; ··· 4363 3471 (tg->scx.bw_period_us != period_us || 4364 3472 tg->scx.bw_quota_us != quota_us || 4365 3473 tg->scx.bw_burst_us != burst_us)) 4366 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_bandwidth, NULL, 3474 + SCX_CALL_OP(sch, cgroup_set_bandwidth, NULL, 4367 3475 tg_cgrp(tg), period_us, quota_us, burst_us); 4368 3476 4369 3477 tg->scx.bw_period_us = period_us; ··· 4372 3480 4373 3481 percpu_up_read(&scx_cgroup_ops_rwsem); 4374 3482 } 3483 + #endif /* CONFIG_EXT_GROUP_SCHED */ 3484 + 3485 + #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 3486 + static struct cgroup *root_cgroup(void) 3487 + { 3488 + return &cgrp_dfl_root.cgrp; 3489 + } 3490 + 3491 + static struct cgroup *sch_cgroup(struct scx_sched *sch) 3492 + { 3493 + return sch->cgrp; 3494 + } 3495 + 3496 + /* for each descendant of @cgrp including self, set ->scx_sched to @sch */ 3497 + static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) 3498 + { 3499 + struct cgroup *pos; 3500 + struct cgroup_subsys_state *css; 3501 + 3502 + cgroup_for_each_live_descendant_pre(pos, css, cgrp) 3503 + rcu_assign_pointer(pos->scx_sched, sch); 3504 + } 4375 3505 4376 3506 static void scx_cgroup_lock(void) 4377 3507 { 3508 + #ifdef CONFIG_EXT_GROUP_SCHED 4378 3509 percpu_down_write(&scx_cgroup_ops_rwsem); 3510 + #endif 4379 3511 cgroup_lock(); 4380 3512 } 4381 3513 4382 3514 static void scx_cgroup_unlock(void) 4383 3515 { 4384 3516 cgroup_unlock(); 3517 + #ifdef CONFIG_EXT_GROUP_SCHED 4385 3518 percpu_up_write(&scx_cgroup_ops_rwsem); 3519 + #endif 4386 3520 } 4387 - 4388 - #else /* CONFIG_EXT_GROUP_SCHED */ 4389 - 3521 + #else /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ 3522 + static struct cgroup *root_cgroup(void) { return NULL; } 3523 + static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } 3524 + static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} 4390 3525 static void scx_cgroup_lock(void) {} 4391 3526 static void scx_cgroup_unlock(void) {} 4392 - 4393 - #endif /* CONFIG_EXT_GROUP_SCHED */ 3527 + #endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ 4394 3528 4395 3529 /* 4396 3530 * Omitted operations: 4397 - * 4398 - * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task 4399 - * isn't tied to the CPU at that point. Preemption is implemented by resetting 4400 - * the victim task's slice to 0 and triggering reschedule on the target CPU. 4401 3531 * 4402 3532 * - migrate_task_rq: Unnecessary as task to cpu mapping is transient. 4403 3533 * ··· 4461 3547 #endif 4462 3548 }; 4463 3549 4464 - static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id) 3550 + static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id, 3551 + struct scx_sched *sch) 4465 3552 { 3553 + s32 cpu; 3554 + 4466 3555 memset(dsq, 0, sizeof(*dsq)); 4467 3556 4468 3557 raw_spin_lock_init(&dsq->lock); 4469 3558 INIT_LIST_HEAD(&dsq->list); 4470 3559 dsq->id = dsq_id; 3560 + dsq->sched = sch; 3561 + 3562 + dsq->pcpu = alloc_percpu(struct scx_dsq_pcpu); 3563 + if (!dsq->pcpu) 3564 + return -ENOMEM; 3565 + 3566 + for_each_possible_cpu(cpu) { 3567 + struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu); 3568 + 3569 + pcpu->dsq = dsq; 3570 + INIT_LIST_HEAD(&pcpu->deferred_reenq_user.node); 3571 + } 3572 + 3573 + return 0; 3574 + } 3575 + 3576 + static void exit_dsq(struct scx_dispatch_q *dsq) 3577 + { 3578 + s32 cpu; 3579 + 3580 + for_each_possible_cpu(cpu) { 3581 + struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu); 3582 + struct scx_deferred_reenq_user *dru = &pcpu->deferred_reenq_user; 3583 + struct rq *rq = cpu_rq(cpu); 3584 + 3585 + /* 3586 + * There must have been a RCU grace period since the last 3587 + * insertion and @dsq should be off the deferred list by now. 3588 + */ 3589 + if (WARN_ON_ONCE(!list_empty(&dru->node))) { 3590 + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); 3591 + list_del_init(&dru->node); 3592 + } 3593 + } 3594 + 3595 + free_percpu(dsq->pcpu); 3596 + } 3597 + 3598 + static void free_dsq_rcufn(struct rcu_head *rcu) 3599 + { 3600 + struct scx_dispatch_q *dsq = container_of(rcu, struct scx_dispatch_q, rcu); 3601 + 3602 + exit_dsq(dsq); 3603 + kfree(dsq); 4471 3604 } 4472 3605 4473 3606 static void free_dsq_irq_workfn(struct irq_work *irq_work) ··· 4523 3562 struct scx_dispatch_q *dsq, *tmp_dsq; 4524 3563 4525 3564 llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node) 4526 - kfree_rcu(dsq, rcu); 3565 + call_rcu(&dsq->rcu, free_dsq_rcufn); 4527 3566 } 4528 3567 4529 3568 static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn); ··· 4588 3627 if (!sch->ops.cgroup_exit) 4589 3628 continue; 4590 3629 4591 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, 4592 - css->cgroup); 3630 + SCX_CALL_OP(sch, cgroup_exit, NULL, css->cgroup); 4593 3631 } 4594 3632 } 4595 3633 ··· 4619 3659 continue; 4620 3660 } 4621 3661 4622 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, NULL, 3662 + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, 4623 3663 css->cgroup, &args); 4624 3664 if (ret) { 4625 3665 scx_error(sch, "ops.cgroup_init() failed (%d)", ret); ··· 4698 3738 .attrs = scx_global_attrs, 4699 3739 }; 4700 3740 3741 + static void free_pnode(struct scx_sched_pnode *pnode); 4701 3742 static void free_exit_info(struct scx_exit_info *ei); 4702 3743 4703 3744 static void scx_sched_free_rcu_work(struct work_struct *work) ··· 4707 3746 struct scx_sched *sch = container_of(rcu_work, struct scx_sched, rcu_work); 4708 3747 struct rhashtable_iter rht_iter; 4709 3748 struct scx_dispatch_q *dsq; 4710 - int node; 3749 + int cpu, node; 4711 3750 4712 - irq_work_sync(&sch->error_irq_work); 3751 + irq_work_sync(&sch->disable_irq_work); 4713 3752 kthread_destroy_worker(sch->helper); 3753 + timer_shutdown_sync(&sch->bypass_lb_timer); 3754 + 3755 + #ifdef CONFIG_EXT_SUB_SCHED 3756 + kfree(sch->cgrp_path); 3757 + if (sch_cgroup(sch)) 3758 + cgroup_put(sch_cgroup(sch)); 3759 + #endif /* CONFIG_EXT_SUB_SCHED */ 3760 + 3761 + for_each_possible_cpu(cpu) { 3762 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); 3763 + 3764 + /* 3765 + * $sch would have entered bypass mode before the RCU grace 3766 + * period. As that blocks new deferrals, all 3767 + * deferred_reenq_local_node's must be off-list by now. 3768 + */ 3769 + WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node)); 3770 + 3771 + exit_dsq(bypass_dsq(sch, cpu)); 3772 + } 4714 3773 4715 3774 free_percpu(sch->pcpu); 4716 3775 4717 3776 for_each_node_state(node, N_POSSIBLE) 4718 - kfree(sch->global_dsqs[node]); 4719 - kfree(sch->global_dsqs); 3777 + free_pnode(sch->pnode[node]); 3778 + kfree(sch->pnode); 4720 3779 4721 3780 rhashtable_walk_enter(&sch->dsq_hash, &rht_iter); 4722 3781 do { 4723 3782 rhashtable_walk_start(&rht_iter); 4724 3783 4725 - while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq)) 3784 + while (!IS_ERR_OR_NULL((dsq = rhashtable_walk_next(&rht_iter)))) 4726 3785 destroy_dsq(sch, dsq->id); 4727 3786 4728 3787 rhashtable_walk_stop(&rht_iter); ··· 4759 3778 struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); 4760 3779 4761 3780 INIT_RCU_WORK(&sch->rcu_work, scx_sched_free_rcu_work); 4762 - queue_rcu_work(system_unbound_wq, &sch->rcu_work); 3781 + queue_rcu_work(system_dfl_wq, &sch->rcu_work); 4763 3782 } 4764 3783 4765 3784 static ssize_t scx_attr_ops_show(struct kobject *kobj, ··· 4788 3807 at += scx_attr_event_show(buf, at, &events, SCX_EV_DISPATCH_KEEP_LAST); 4789 3808 at += scx_attr_event_show(buf, at, &events, SCX_EV_ENQ_SKIP_EXITING); 4790 3809 at += scx_attr_event_show(buf, at, &events, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); 3810 + at += scx_attr_event_show(buf, at, &events, SCX_EV_REENQ_IMMED); 3811 + at += scx_attr_event_show(buf, at, &events, SCX_EV_REENQ_LOCAL_REPEAT); 4791 3812 at += scx_attr_event_show(buf, at, &events, SCX_EV_REFILL_SLICE_DFL); 4792 3813 at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DURATION); 4793 3814 at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DISPATCH); 4794 3815 at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_ACTIVATE); 3816 + at += scx_attr_event_show(buf, at, &events, SCX_EV_INSERT_NOT_OWNED); 3817 + at += scx_attr_event_show(buf, at, &events, SCX_EV_SUB_BYPASS_DISPATCH); 4795 3818 return at; 4796 3819 } 4797 3820 SCX_ATTR(events); ··· 4815 3830 4816 3831 static int scx_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) 4817 3832 { 4818 - const struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); 3833 + const struct scx_sched *sch; 3834 + 3835 + /* 3836 + * scx_uevent() can be reached by both scx_sched kobjects (scx_ktype) 3837 + * and sub-scheduler kset kobjects (kset_ktype) through the parent 3838 + * chain walk. Filter out the latter to avoid invalid casts. 3839 + */ 3840 + if (kobj->ktype != &scx_ktype) 3841 + return 0; 3842 + 3843 + sch = container_of(kobj, struct scx_sched, kobj); 4819 3844 4820 3845 return add_uevent_var(env, "SCXOPS=%s", sch->ops.name); 4821 3846 } ··· 4854 3859 if (!scx_enabled()) 4855 3860 return true; 4856 3861 4857 - sch = rcu_dereference_sched(scx_root); 3862 + sch = scx_task_sched(p); 4858 3863 if (unlikely(!sch)) 4859 3864 return true; 4860 3865 ··· 4947 3952 * a good state before taking more drastic actions. 4948 3953 * 4949 3954 * Returns %true if sched_ext is enabled and abort was initiated, which may 4950 - * resolve the reported hardlockdup. %false if sched_ext is not enabled or 3955 + * resolve the reported hardlockup. %false if sched_ext is not enabled or 4951 3956 * someone else already initiated abort. 4952 3957 */ 4953 3958 bool scx_hardlockup(int cpu) ··· 4960 3965 return true; 4961 3966 } 4962 3967 4963 - static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, 3968 + static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor, 4964 3969 struct cpumask *donee_mask, struct cpumask *resched_mask, 4965 3970 u32 nr_donor_target, u32 nr_donee_target) 4966 3971 { 4967 - struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq; 3972 + struct rq *donor_rq = cpu_rq(donor); 3973 + struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor); 4968 3974 struct task_struct *p, *n; 4969 - struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, 0, 0); 3975 + struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0); 4970 3976 s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target; 4971 3977 u32 nr_balanced = 0, min_delta_us; 4972 3978 ··· 4981 3985 if (delta < DIV_ROUND_UP(min_delta_us, READ_ONCE(scx_slice_bypass_us))) 4982 3986 return 0; 4983 3987 4984 - raw_spin_rq_lock_irq(rq); 3988 + raw_spin_rq_lock_irq(donor_rq); 4985 3989 raw_spin_lock(&donor_dsq->lock); 4986 3990 list_add(&cursor.node, &donor_dsq->list); 4987 3991 resume: ··· 4989 3993 n = nldsq_next_task(donor_dsq, n, false); 4990 3994 4991 3995 while ((p = n)) { 4992 - struct rq *donee_rq; 4993 3996 struct scx_dispatch_q *donee_dsq; 4994 3997 int donee; 4995 3998 ··· 5004 4009 if (donee >= nr_cpu_ids) 5005 4010 continue; 5006 4011 5007 - donee_rq = cpu_rq(donee); 5008 - donee_dsq = &donee_rq->scx.bypass_dsq; 4012 + donee_dsq = bypass_dsq(sch, donee); 5009 4013 5010 4014 /* 5011 4015 * $p's rq is not locked but $p's DSQ lock protects its 5012 4016 * scheduling properties making this test safe. 5013 4017 */ 5014 - if (!task_can_run_on_remote_rq(sch, p, donee_rq, false)) 4018 + if (!task_can_run_on_remote_rq(sch, p, cpu_rq(donee), false)) 5015 4019 continue; 5016 4020 5017 4021 /* ··· 5025 4031 * between bypass DSQs. 5026 4032 */ 5027 4033 dispatch_dequeue_locked(p, donor_dsq); 5028 - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); 4034 + dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED); 5029 4035 5030 4036 /* 5031 4037 * $donee might have been idle and need to be woken up. No need ··· 5040 4046 if (!(nr_balanced % SCX_BYPASS_LB_BATCH) && n) { 5041 4047 list_move_tail(&cursor.node, &n->scx.dsq_list.node); 5042 4048 raw_spin_unlock(&donor_dsq->lock); 5043 - raw_spin_rq_unlock_irq(rq); 4049 + raw_spin_rq_unlock_irq(donor_rq); 5044 4050 cpu_relax(); 5045 - raw_spin_rq_lock_irq(rq); 4051 + raw_spin_rq_lock_irq(donor_rq); 5046 4052 raw_spin_lock(&donor_dsq->lock); 5047 4053 goto resume; 5048 4054 } ··· 5050 4056 5051 4057 list_del_init(&cursor.node); 5052 4058 raw_spin_unlock(&donor_dsq->lock); 5053 - raw_spin_rq_unlock_irq(rq); 4059 + raw_spin_rq_unlock_irq(donor_rq); 5054 4060 5055 4061 return nr_balanced; 5056 4062 } ··· 5068 4074 5069 4075 /* count the target tasks and CPUs */ 5070 4076 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5071 - u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr); 4077 + u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr); 5072 4078 5073 4079 nr_tasks += nr; 5074 4080 nr_cpus++; ··· 5090 4096 5091 4097 cpumask_clear(donee_mask); 5092 4098 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5093 - if (READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr) < nr_target) 4099 + if (READ_ONCE(bypass_dsq(sch, cpu)->nr) < nr_target) 5094 4100 cpumask_set_cpu(cpu, donee_mask); 5095 4101 } 5096 4102 5097 4103 /* iterate !donee CPUs and see if they should be offloaded */ 5098 4104 cpumask_clear(resched_mask); 5099 4105 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5100 - struct rq *rq = cpu_rq(cpu); 5101 - struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq; 5102 - 5103 4106 if (cpumask_empty(donee_mask)) 5104 4107 break; 5105 4108 if (cpumask_test_cpu(cpu, donee_mask)) 5106 4109 continue; 5107 - if (READ_ONCE(donor_dsq->nr) <= nr_donor_target) 4110 + if (READ_ONCE(bypass_dsq(sch, cpu)->nr) <= nr_donor_target) 5108 4111 continue; 5109 4112 5110 - nr_balanced += bypass_lb_cpu(sch, rq, donee_mask, resched_mask, 4113 + nr_balanced += bypass_lb_cpu(sch, cpu, donee_mask, resched_mask, 5111 4114 nr_donor_target, nr_target); 5112 4115 } 5113 4116 ··· 5112 4121 resched_cpu(cpu); 5113 4122 5114 4123 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5115 - u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr); 4124 + u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr); 5116 4125 5117 4126 after_min = min(nr, after_min); 5118 4127 after_max = max(nr, after_max); ··· 5134 4143 */ 5135 4144 static void scx_bypass_lb_timerfn(struct timer_list *timer) 5136 4145 { 5137 - struct scx_sched *sch; 4146 + struct scx_sched *sch = container_of(timer, struct scx_sched, bypass_lb_timer); 5138 4147 int node; 5139 4148 u32 intv_us; 5140 4149 5141 - sch = rcu_dereference_all(scx_root); 5142 - if (unlikely(!sch) || !READ_ONCE(scx_bypass_depth)) 4150 + if (!bypass_dsp_enabled(sch)) 5143 4151 return; 5144 4152 5145 4153 for_each_node_with_cpus(node) ··· 5149 4159 mod_timer(timer, jiffies + usecs_to_jiffies(intv_us)); 5150 4160 } 5151 4161 5152 - static DEFINE_TIMER(scx_bypass_lb_timer, scx_bypass_lb_timerfn); 4162 + static bool inc_bypass_depth(struct scx_sched *sch) 4163 + { 4164 + lockdep_assert_held(&scx_bypass_lock); 4165 + 4166 + WARN_ON_ONCE(sch->bypass_depth < 0); 4167 + WRITE_ONCE(sch->bypass_depth, sch->bypass_depth + 1); 4168 + if (sch->bypass_depth != 1) 4169 + return false; 4170 + 4171 + WRITE_ONCE(sch->slice_dfl, READ_ONCE(scx_slice_bypass_us) * NSEC_PER_USEC); 4172 + sch->bypass_timestamp = ktime_get_ns(); 4173 + scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1); 4174 + return true; 4175 + } 4176 + 4177 + static bool dec_bypass_depth(struct scx_sched *sch) 4178 + { 4179 + lockdep_assert_held(&scx_bypass_lock); 4180 + 4181 + WARN_ON_ONCE(sch->bypass_depth < 1); 4182 + WRITE_ONCE(sch->bypass_depth, sch->bypass_depth - 1); 4183 + if (sch->bypass_depth != 0) 4184 + return false; 4185 + 4186 + WRITE_ONCE(sch->slice_dfl, SCX_SLICE_DFL); 4187 + scx_add_event(sch, SCX_EV_BYPASS_DURATION, 4188 + ktime_get_ns() - sch->bypass_timestamp); 4189 + return true; 4190 + } 4191 + 4192 + static void enable_bypass_dsp(struct scx_sched *sch) 4193 + { 4194 + struct scx_sched *host = scx_parent(sch) ?: sch; 4195 + u32 intv_us = READ_ONCE(scx_bypass_lb_intv_us); 4196 + s32 ret; 4197 + 4198 + /* 4199 + * @sch->bypass_depth transitioning from 0 to 1 triggers enabling. 4200 + * Shouldn't stagger. 4201 + */ 4202 + if (WARN_ON_ONCE(test_and_set_bit(0, &sch->bypass_dsp_claim))) 4203 + return; 4204 + 4205 + /* 4206 + * When a sub-sched bypasses, its tasks are queued on the bypass DSQs of 4207 + * the nearest non-bypassing ancestor or root. As enable_bypass_dsp() is 4208 + * called iff @sch is not already bypassed due to an ancestor bypassing, 4209 + * we can assume that the parent is not bypassing and thus will be the 4210 + * host of the bypass DSQs. 4211 + * 4212 + * While the situation may change in the future, the following 4213 + * guarantees that the nearest non-bypassing ancestor or root has bypass 4214 + * dispatch enabled while a descendant is bypassing, which is all that's 4215 + * required. 4216 + * 4217 + * bypass_dsp_enabled() test is used to determine whether to enter the 4218 + * bypass dispatch handling path from both bypassing and hosting scheds. 4219 + * Bump enable depth on both @sch and bypass dispatch host. 4220 + */ 4221 + ret = atomic_inc_return(&sch->bypass_dsp_enable_depth); 4222 + WARN_ON_ONCE(ret <= 0); 4223 + 4224 + if (host != sch) { 4225 + ret = atomic_inc_return(&host->bypass_dsp_enable_depth); 4226 + WARN_ON_ONCE(ret <= 0); 4227 + } 4228 + 4229 + /* 4230 + * The LB timer will stop running if bypass dispatch is disabled. Start 4231 + * after enabling bypass dispatch. 4232 + */ 4233 + if (intv_us && !timer_pending(&host->bypass_lb_timer)) 4234 + mod_timer(&host->bypass_lb_timer, 4235 + jiffies + usecs_to_jiffies(intv_us)); 4236 + } 4237 + 4238 + /* may be called without holding scx_bypass_lock */ 4239 + static void disable_bypass_dsp(struct scx_sched *sch) 4240 + { 4241 + s32 ret; 4242 + 4243 + if (!test_and_clear_bit(0, &sch->bypass_dsp_claim)) 4244 + return; 4245 + 4246 + ret = atomic_dec_return(&sch->bypass_dsp_enable_depth); 4247 + WARN_ON_ONCE(ret < 0); 4248 + 4249 + if (scx_parent(sch)) { 4250 + ret = atomic_dec_return(&scx_parent(sch)->bypass_dsp_enable_depth); 4251 + WARN_ON_ONCE(ret < 0); 4252 + } 4253 + } 5153 4254 5154 4255 /** 5155 4256 * scx_bypass - [Un]bypass scx_ops and guarantee forward progress 4257 + * @sch: sched to bypass 5156 4258 * @bypass: true for bypass, false for unbypass 5157 4259 * 5158 4260 * Bypassing guarantees that all runnable tasks make forward progress without ··· 5274 4192 * 5275 4193 * - scx_prio_less() reverts to the default core_sched_at order. 5276 4194 */ 5277 - static void scx_bypass(bool bypass) 4195 + static void scx_bypass(struct scx_sched *sch, bool bypass) 5278 4196 { 5279 - static DEFINE_RAW_SPINLOCK(bypass_lock); 5280 - static unsigned long bypass_timestamp; 5281 - struct scx_sched *sch; 4197 + struct scx_sched *pos; 5282 4198 unsigned long flags; 5283 4199 int cpu; 5284 4200 5285 - raw_spin_lock_irqsave(&bypass_lock, flags); 5286 - sch = rcu_dereference_bh(scx_root); 4201 + raw_spin_lock_irqsave(&scx_bypass_lock, flags); 5287 4202 5288 4203 if (bypass) { 5289 - u32 intv_us; 5290 - 5291 - WRITE_ONCE(scx_bypass_depth, scx_bypass_depth + 1); 5292 - WARN_ON_ONCE(scx_bypass_depth <= 0); 5293 - if (scx_bypass_depth != 1) 4204 + if (!inc_bypass_depth(sch)) 5294 4205 goto unlock; 5295 - WRITE_ONCE(scx_slice_dfl, READ_ONCE(scx_slice_bypass_us) * NSEC_PER_USEC); 5296 - bypass_timestamp = ktime_get_ns(); 5297 - if (sch) 5298 - scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1); 5299 4206 5300 - intv_us = READ_ONCE(scx_bypass_lb_intv_us); 5301 - if (intv_us && !timer_pending(&scx_bypass_lb_timer)) { 5302 - scx_bypass_lb_timer.expires = 5303 - jiffies + usecs_to_jiffies(intv_us); 5304 - add_timer_global(&scx_bypass_lb_timer); 5305 - } 4207 + enable_bypass_dsp(sch); 5306 4208 } else { 5307 - WRITE_ONCE(scx_bypass_depth, scx_bypass_depth - 1); 5308 - WARN_ON_ONCE(scx_bypass_depth < 0); 5309 - if (scx_bypass_depth != 0) 4209 + if (!dec_bypass_depth(sch)) 5310 4210 goto unlock; 5311 - WRITE_ONCE(scx_slice_dfl, SCX_SLICE_DFL); 5312 - if (sch) 5313 - scx_add_event(sch, SCX_EV_BYPASS_DURATION, 5314 - ktime_get_ns() - bypass_timestamp); 5315 4211 } 5316 4212 5317 4213 /* 4214 + * Bypass state is propagated to all descendants - an scx_sched bypasses 4215 + * if itself or any of its ancestors are in bypass mode. 4216 + */ 4217 + raw_spin_lock(&scx_sched_lock); 4218 + scx_for_each_descendant_pre(pos, sch) { 4219 + if (pos == sch) 4220 + continue; 4221 + if (bypass) 4222 + inc_bypass_depth(pos); 4223 + else 4224 + dec_bypass_depth(pos); 4225 + } 4226 + raw_spin_unlock(&scx_sched_lock); 4227 + 4228 + /* 5318 4229 * No task property is changing. We just need to make sure all currently 5319 - * queued tasks are re-queued according to the new scx_rq_bypassing() 4230 + * queued tasks are re-queued according to the new scx_bypassing() 5320 4231 * state. As an optimization, walk each rq's runnable_list instead of 5321 4232 * the scx_tasks list. 5322 4233 * ··· 5321 4246 struct task_struct *p, *n; 5322 4247 5323 4248 raw_spin_rq_lock(rq); 4249 + raw_spin_lock(&scx_sched_lock); 5324 4250 5325 - if (bypass) { 5326 - WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING); 5327 - rq->scx.flags |= SCX_RQ_BYPASSING; 5328 - } else { 5329 - WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING)); 5330 - rq->scx.flags &= ~SCX_RQ_BYPASSING; 4251 + scx_for_each_descendant_pre(pos, sch) { 4252 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu); 4253 + 4254 + if (pos->bypass_depth) 4255 + pcpu->flags |= SCX_SCHED_PCPU_BYPASSING; 4256 + else 4257 + pcpu->flags &= ~SCX_SCHED_PCPU_BYPASSING; 5331 4258 } 4259 + 4260 + raw_spin_unlock(&scx_sched_lock); 5332 4261 5333 4262 /* 5334 4263 * We need to guarantee that no tasks are on the BPF scheduler 5335 4264 * while bypassing. Either we see enabled or the enable path 5336 - * sees scx_rq_bypassing() before moving tasks to SCX. 4265 + * sees scx_bypassing() before moving tasks to SCX. 5337 4266 */ 5338 4267 if (!scx_enabled()) { 5339 4268 raw_spin_rq_unlock(rq); ··· 5353 4274 */ 5354 4275 list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list, 5355 4276 scx.runnable_node) { 4277 + if (!scx_is_descendant(scx_task_sched(p), sch)) 4278 + continue; 4279 + 5356 4280 /* cycling deq/enq is enough, see the function comment */ 5357 4281 scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 5358 4282 /* nothing */ ; ··· 5369 4287 raw_spin_rq_unlock(rq); 5370 4288 } 5371 4289 4290 + /* disarming must come after moving all tasks out of the bypass DSQs */ 4291 + if (!bypass) 4292 + disable_bypass_dsp(sch); 5372 4293 unlock: 5373 - raw_spin_unlock_irqrestore(&bypass_lock, flags); 4294 + raw_spin_unlock_irqrestore(&scx_bypass_lock, flags); 5374 4295 } 5375 4296 5376 4297 static void free_exit_info(struct scx_exit_info *ei) ··· 5415 4330 return "unregistered from the main kernel"; 5416 4331 case SCX_EXIT_SYSRQ: 5417 4332 return "disabled by sysrq-S"; 4333 + case SCX_EXIT_PARENT: 4334 + return "parent exiting"; 5418 4335 case SCX_EXIT_ERROR: 5419 4336 return "runtime error"; 5420 4337 case SCX_EXIT_ERROR_BPF: ··· 5442 4355 } 5443 4356 } 5444 4357 5445 - static void scx_disable_workfn(struct kthread_work *work) 4358 + static void refresh_watchdog(void) 5446 4359 { 5447 - struct scx_sched *sch = container_of(work, struct scx_sched, disable_work); 4360 + struct scx_sched *sch; 4361 + unsigned long intv = ULONG_MAX; 4362 + 4363 + /* take the shortest timeout and use its half for watchdog interval */ 4364 + rcu_read_lock(); 4365 + list_for_each_entry_rcu(sch, &scx_sched_all, all) 4366 + intv = max(min(intv, sch->watchdog_timeout / 2), 1); 4367 + rcu_read_unlock(); 4368 + 4369 + WRITE_ONCE(scx_watchdog_timestamp, jiffies); 4370 + WRITE_ONCE(scx_watchdog_interval, intv); 4371 + 4372 + if (intv < ULONG_MAX) 4373 + mod_delayed_work(system_dfl_wq, &scx_watchdog_work, intv); 4374 + else 4375 + cancel_delayed_work_sync(&scx_watchdog_work); 4376 + } 4377 + 4378 + static s32 scx_link_sched(struct scx_sched *sch) 4379 + { 4380 + scoped_guard(raw_spinlock_irq, &scx_sched_lock) { 4381 + #ifdef CONFIG_EXT_SUB_SCHED 4382 + struct scx_sched *parent = scx_parent(sch); 4383 + s32 ret; 4384 + 4385 + if (parent) { 4386 + /* 4387 + * scx_claim_exit() propagates exit_kind transition to 4388 + * its sub-scheds while holding scx_sched_lock - either 4389 + * we can see the parent's non-NONE exit_kind or the 4390 + * parent can shoot us down. 4391 + */ 4392 + if (atomic_read(&parent->exit_kind) != SCX_EXIT_NONE) { 4393 + scx_error(sch, "parent disabled"); 4394 + return -ENOENT; 4395 + } 4396 + 4397 + ret = rhashtable_lookup_insert_fast(&scx_sched_hash, 4398 + &sch->hash_node, scx_sched_hash_params); 4399 + if (ret) { 4400 + scx_error(sch, "failed to insert into scx_sched_hash (%d)", ret); 4401 + return ret; 4402 + } 4403 + 4404 + list_add_tail(&sch->sibling, &parent->children); 4405 + } 4406 + #endif /* CONFIG_EXT_SUB_SCHED */ 4407 + 4408 + list_add_tail_rcu(&sch->all, &scx_sched_all); 4409 + } 4410 + 4411 + refresh_watchdog(); 4412 + return 0; 4413 + } 4414 + 4415 + static void scx_unlink_sched(struct scx_sched *sch) 4416 + { 4417 + scoped_guard(raw_spinlock_irq, &scx_sched_lock) { 4418 + #ifdef CONFIG_EXT_SUB_SCHED 4419 + if (scx_parent(sch)) { 4420 + rhashtable_remove_fast(&scx_sched_hash, &sch->hash_node, 4421 + scx_sched_hash_params); 4422 + list_del_init(&sch->sibling); 4423 + } 4424 + #endif /* CONFIG_EXT_SUB_SCHED */ 4425 + list_del_rcu(&sch->all); 4426 + } 4427 + 4428 + refresh_watchdog(); 4429 + } 4430 + 4431 + /* 4432 + * Called to disable future dumps and wait for in-progress one while disabling 4433 + * @sch. Once @sch becomes empty during disable, there's no point in dumping it. 4434 + * This prevents calling dump ops on a dead sch. 4435 + */ 4436 + static void scx_disable_dump(struct scx_sched *sch) 4437 + { 4438 + guard(raw_spinlock_irqsave)(&scx_dump_lock); 4439 + sch->dump_disabled = true; 4440 + } 4441 + 4442 + #ifdef CONFIG_EXT_SUB_SCHED 4443 + static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq); 4444 + 4445 + static void drain_descendants(struct scx_sched *sch) 4446 + { 4447 + /* 4448 + * Child scheds that finished the critical part of disabling will take 4449 + * themselves off @sch->children. Wait for it to drain. As propagation 4450 + * is recursive, empty @sch->children means that all proper descendant 4451 + * scheds reached unlinking stage. 4452 + */ 4453 + wait_event(scx_unlink_waitq, list_empty(&sch->children)); 4454 + } 4455 + 4456 + static void scx_fail_parent(struct scx_sched *sch, 4457 + struct task_struct *failed, s32 fail_code) 4458 + { 4459 + struct scx_sched *parent = scx_parent(sch); 4460 + struct scx_task_iter sti; 4461 + struct task_struct *p; 4462 + 4463 + scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler", 4464 + fail_code, failed->comm, failed->pid); 4465 + 4466 + /* 4467 + * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into 4468 + * it. This may cause downstream failures on the BPF side but $parent is 4469 + * dying anyway. 4470 + */ 4471 + scx_bypass(parent, true); 4472 + 4473 + scx_task_iter_start(&sti, sch->cgrp); 4474 + while ((p = scx_task_iter_next_locked(&sti))) { 4475 + if (scx_task_on_sched(parent, p)) 4476 + continue; 4477 + 4478 + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 4479 + scx_disable_and_exit_task(sch, p); 4480 + rcu_assign_pointer(p->scx.sched, parent); 4481 + } 4482 + } 4483 + scx_task_iter_stop(&sti); 4484 + } 4485 + 4486 + static void scx_sub_disable(struct scx_sched *sch) 4487 + { 4488 + struct scx_sched *parent = scx_parent(sch); 4489 + struct scx_task_iter sti; 4490 + struct task_struct *p; 4491 + int ret; 4492 + 4493 + /* 4494 + * Guarantee forward progress and wait for descendants to be disabled. 4495 + * To limit disruptions, $parent is not bypassed. Tasks are fully 4496 + * prepped and then inserted back into $parent. 4497 + */ 4498 + scx_bypass(sch, true); 4499 + drain_descendants(sch); 4500 + 4501 + /* 4502 + * Here, every runnable task is guaranteed to make forward progress and 4503 + * we can safely use blocking synchronization constructs. Actually 4504 + * disable ops. 4505 + */ 4506 + mutex_lock(&scx_enable_mutex); 4507 + percpu_down_write(&scx_fork_rwsem); 4508 + scx_cgroup_lock(); 4509 + 4510 + set_cgroup_sched(sch_cgroup(sch), parent); 4511 + 4512 + scx_task_iter_start(&sti, sch->cgrp); 4513 + while ((p = scx_task_iter_next_locked(&sti))) { 4514 + struct rq *rq; 4515 + struct rq_flags rf; 4516 + 4517 + /* filter out duplicate visits */ 4518 + if (scx_task_on_sched(parent, p)) 4519 + continue; 4520 + 4521 + /* 4522 + * By the time control reaches here, all descendant schedulers 4523 + * should already have been disabled. 4524 + */ 4525 + WARN_ON_ONCE(!scx_task_on_sched(sch, p)); 4526 + 4527 + /* 4528 + * If $p is about to be freed, nothing prevents $sch from 4529 + * unloading before $p reaches sched_ext_free(). Disable and 4530 + * exit $p right away. 4531 + */ 4532 + if (!tryget_task_struct(p)) { 4533 + scx_disable_and_exit_task(sch, p); 4534 + continue; 4535 + } 4536 + 4537 + scx_task_iter_unlock(&sti); 4538 + 4539 + /* 4540 + * $p is READY or ENABLED on @sch. Initialize for $parent, 4541 + * disable and exit from @sch, and then switch over to $parent. 4542 + * 4543 + * If a task fails to initialize for $parent, the only available 4544 + * action is disabling $parent too. While this allows disabling 4545 + * of a child sched to cause the parent scheduler to fail, the 4546 + * failure can only originate from ops.init_task() of the 4547 + * parent. A child can't directly affect the parent through its 4548 + * own failures. 4549 + */ 4550 + ret = __scx_init_task(parent, p, false); 4551 + if (ret) { 4552 + scx_fail_parent(sch, p, ret); 4553 + put_task_struct(p); 4554 + break; 4555 + } 4556 + 4557 + rq = task_rq_lock(p, &rf); 4558 + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 4559 + /* 4560 + * $p is initialized for $parent and still attached to 4561 + * @sch. Disable and exit for @sch, switch over to 4562 + * $parent, override the state to READY to account for 4563 + * $p having already been initialized, and then enable. 4564 + */ 4565 + scx_disable_and_exit_task(sch, p); 4566 + scx_set_task_state(p, SCX_TASK_INIT); 4567 + rcu_assign_pointer(p->scx.sched, parent); 4568 + scx_set_task_state(p, SCX_TASK_READY); 4569 + scx_enable_task(parent, p); 4570 + } 4571 + task_rq_unlock(rq, p, &rf); 4572 + 4573 + put_task_struct(p); 4574 + } 4575 + scx_task_iter_stop(&sti); 4576 + 4577 + scx_disable_dump(sch); 4578 + 4579 + scx_cgroup_unlock(); 4580 + percpu_up_write(&scx_fork_rwsem); 4581 + 4582 + /* 4583 + * All tasks are moved off of @sch but there may still be on-going 4584 + * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use 4585 + * the expedited version as ancestors may be waiting in bypass mode. 4586 + * Also, tell the parent that there is no need to keep running bypass 4587 + * DSQs for us. 4588 + */ 4589 + synchronize_rcu_expedited(); 4590 + disable_bypass_dsp(sch); 4591 + 4592 + scx_unlink_sched(sch); 4593 + 4594 + mutex_unlock(&scx_enable_mutex); 4595 + 4596 + /* 4597 + * @sch is now unlinked from the parent's children list. Notify and call 4598 + * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called 4599 + * after unlinking and releasing all locks. See scx_claim_exit(). 4600 + */ 4601 + wake_up_all(&scx_unlink_waitq); 4602 + 4603 + if (parent->ops.sub_detach && sch->sub_attached) { 4604 + struct scx_sub_detach_args sub_detach_args = { 4605 + .ops = &sch->ops, 4606 + .cgroup_path = sch->cgrp_path, 4607 + }; 4608 + SCX_CALL_OP(parent, sub_detach, NULL, 4609 + &sub_detach_args); 4610 + } 4611 + 4612 + if (sch->ops.exit) 4613 + SCX_CALL_OP(sch, exit, NULL, sch->exit_info); 4614 + kobject_del(&sch->kobj); 4615 + } 4616 + #else /* CONFIG_EXT_SUB_SCHED */ 4617 + static void drain_descendants(struct scx_sched *sch) { } 4618 + static void scx_sub_disable(struct scx_sched *sch) { } 4619 + #endif /* CONFIG_EXT_SUB_SCHED */ 4620 + 4621 + static void scx_root_disable(struct scx_sched *sch) 4622 + { 5448 4623 struct scx_exit_info *ei = sch->exit_info; 5449 4624 struct scx_task_iter sti; 5450 4625 struct task_struct *p; 5451 - int kind, cpu; 4626 + int cpu; 5452 4627 5453 - kind = atomic_read(&sch->exit_kind); 5454 - while (true) { 5455 - if (kind == SCX_EXIT_DONE) /* already disabled? */ 5456 - return; 5457 - WARN_ON_ONCE(kind == SCX_EXIT_NONE); 5458 - if (atomic_try_cmpxchg(&sch->exit_kind, &kind, SCX_EXIT_DONE)) 5459 - break; 5460 - } 5461 - ei->kind = kind; 5462 - ei->reason = scx_exit_reason(ei->kind); 5463 - 5464 - /* guarantee forward progress by bypassing scx_ops */ 5465 - scx_bypass(true); 5466 - WRITE_ONCE(scx_aborting, false); 4628 + /* guarantee forward progress and wait for descendants to be disabled */ 4629 + scx_bypass(sch, true); 4630 + drain_descendants(sch); 5467 4631 5468 4632 switch (scx_set_enable_state(SCX_DISABLING)) { 5469 4633 case SCX_DISABLING: ··· 5741 4403 5742 4404 /* 5743 4405 * Shut down cgroup support before tasks so that the cgroup attach path 5744 - * doesn't race against scx_exit_task(). 4406 + * doesn't race against scx_disable_and_exit_task(). 5745 4407 */ 5746 4408 scx_cgroup_lock(); 5747 4409 scx_cgroup_exit(sch); ··· 5755 4417 5756 4418 scx_init_task_enabled = false; 5757 4419 5758 - scx_task_iter_start(&sti); 4420 + scx_task_iter_start(&sti, NULL); 5759 4421 while ((p = scx_task_iter_next_locked(&sti))) { 5760 4422 unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; 5761 4423 const struct sched_class *old_class = p->sched_class; ··· 5770 4432 p->sched_class = new_class; 5771 4433 } 5772 4434 5773 - scx_exit_task(p); 4435 + scx_disable_and_exit_task(scx_task_sched(p), p); 5774 4436 } 5775 4437 scx_task_iter_stop(&sti); 4438 + 4439 + scx_disable_dump(sch); 4440 + 4441 + scx_cgroup_lock(); 4442 + set_cgroup_sched(sch_cgroup(sch), NULL); 4443 + scx_cgroup_unlock(); 4444 + 5776 4445 percpu_up_write(&scx_fork_rwsem); 5777 4446 5778 4447 /* ··· 5812 4467 } 5813 4468 5814 4469 if (sch->ops.exit) 5815 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, ei); 4470 + SCX_CALL_OP(sch, exit, NULL, ei); 5816 4471 5817 - cancel_delayed_work_sync(&scx_watchdog_work); 4472 + scx_unlink_sched(sch); 5818 4473 5819 4474 /* 5820 4475 * scx_root clearing must be inside cpus_read_lock(). See ··· 5831 4486 */ 5832 4487 kobject_del(&sch->kobj); 5833 4488 5834 - free_percpu(scx_dsp_ctx); 5835 - scx_dsp_ctx = NULL; 5836 - scx_dsp_max_batch = 0; 5837 4489 free_kick_syncs(); 5838 - 5839 - if (scx_bypassed_for_enable) { 5840 - scx_bypassed_for_enable = false; 5841 - scx_bypass(false); 5842 - } 5843 4490 5844 4491 mutex_unlock(&scx_enable_mutex); 5845 4492 5846 4493 WARN_ON_ONCE(scx_set_enable_state(SCX_DISABLED) != SCX_DISABLING); 5847 4494 done: 5848 - scx_bypass(false); 4495 + scx_bypass(sch, false); 5849 4496 } 5850 4497 5851 4498 /* ··· 5853 4516 5854 4517 lockdep_assert_preemption_disabled(); 5855 4518 4519 + if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) 4520 + kind = SCX_EXIT_ERROR; 4521 + 5856 4522 if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind)) 5857 4523 return false; 5858 4524 ··· 5864 4524 * flag to break potential live-lock scenarios, ensuring we can 5865 4525 * successfully reach scx_bypass(). 5866 4526 */ 5867 - WRITE_ONCE(scx_aborting, true); 4527 + WRITE_ONCE(sch->aborting, true); 4528 + 4529 + /* 4530 + * Propagate exits to descendants immediately. Each has a dedicated 4531 + * helper kthread and can run in parallel. While most of disabling is 4532 + * serialized, running them in separate threads allows parallelizing 4533 + * ops.exit(), which can take arbitrarily long prolonging bypass mode. 4534 + * 4535 + * To guarantee forward progress, this propagation must be in-line so 4536 + * that ->aborting is synchronously asserted for all sub-scheds. The 4537 + * propagation is also the interlocking point against sub-sched 4538 + * attachment. See scx_link_sched(). 4539 + * 4540 + * This doesn't cause recursions as propagation only takes place for 4541 + * non-propagation exits. 4542 + */ 4543 + if (kind != SCX_EXIT_PARENT) { 4544 + scoped_guard (raw_spinlock_irqsave, &scx_sched_lock) { 4545 + struct scx_sched *pos; 4546 + scx_for_each_descendant_pre(pos, sch) 4547 + scx_disable(pos, SCX_EXIT_PARENT); 4548 + } 4549 + } 4550 + 5868 4551 return true; 5869 4552 } 5870 4553 5871 - static void scx_disable(enum scx_exit_kind kind) 4554 + static void scx_disable_workfn(struct kthread_work *work) 5872 4555 { 5873 - struct scx_sched *sch; 4556 + struct scx_sched *sch = container_of(work, struct scx_sched, disable_work); 4557 + struct scx_exit_info *ei = sch->exit_info; 4558 + int kind; 5874 4559 5875 - if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) 5876 - kind = SCX_EXIT_ERROR; 5877 - 5878 - rcu_read_lock(); 5879 - sch = rcu_dereference(scx_root); 5880 - if (sch) { 5881 - guard(preempt)(); 5882 - scx_claim_exit(sch, kind); 5883 - kthread_queue_work(sch->helper, &sch->disable_work); 4560 + kind = atomic_read(&sch->exit_kind); 4561 + while (true) { 4562 + if (kind == SCX_EXIT_DONE) /* already disabled? */ 4563 + return; 4564 + WARN_ON_ONCE(kind == SCX_EXIT_NONE); 4565 + if (atomic_try_cmpxchg(&sch->exit_kind, &kind, SCX_EXIT_DONE)) 4566 + break; 5884 4567 } 5885 - rcu_read_unlock(); 4568 + ei->kind = kind; 4569 + ei->reason = scx_exit_reason(ei->kind); 4570 + 4571 + if (scx_parent(sch)) 4572 + scx_sub_disable(sch); 4573 + else 4574 + scx_root_disable(sch); 4575 + } 4576 + 4577 + static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind) 4578 + { 4579 + guard(preempt)(); 4580 + if (scx_claim_exit(sch, kind)) 4581 + irq_work_queue(&sch->disable_irq_work); 5886 4582 } 5887 4583 5888 4584 static void dump_newline(struct seq_buf *s) ··· 5936 4560 5937 4561 #ifdef CONFIG_TRACEPOINTS 5938 4562 if (trace_sched_ext_dump_enabled()) { 5939 - /* protected by scx_dump_state()::dump_lock */ 4563 + /* protected by scx_dump_lock */ 5940 4564 static char line_buf[SCX_EXIT_MSG_LEN]; 5941 4565 5942 4566 va_start(args, fmt); ··· 6032 4656 scx_dump_data.cpu = -1; 6033 4657 } 6034 4658 6035 - static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, 4659 + static void scx_dump_task(struct scx_sched *sch, 4660 + struct seq_buf *s, struct scx_dump_ctx *dctx, 6036 4661 struct task_struct *p, char marker) 6037 4662 { 6038 4663 static unsigned long bt[SCX_EXIT_BT_LEN]; 6039 - struct scx_sched *sch = scx_root; 4664 + struct scx_sched *task_sch = scx_task_sched(p); 4665 + const char *own_marker; 4666 + char sch_id_buf[32]; 6040 4667 char dsq_id_buf[19] = "(n/a)"; 6041 4668 unsigned long ops_state = atomic_long_read(&p->scx.ops_state); 6042 4669 unsigned int bt_len = 0; 4670 + 4671 + own_marker = task_sch == sch ? "*" : ""; 4672 + 4673 + if (task_sch->level == 0) 4674 + scnprintf(sch_id_buf, sizeof(sch_id_buf), "root"); 4675 + else 4676 + scnprintf(sch_id_buf, sizeof(sch_id_buf), "sub%d-%llu", 4677 + task_sch->level, task_sch->ops.sub_cgroup_id); 6043 4678 6044 4679 if (p->scx.dsq) 6045 4680 scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx", 6046 4681 (unsigned long long)p->scx.dsq->id); 6047 4682 6048 4683 dump_newline(s); 6049 - dump_line(s, " %c%c %s[%d] %+ldms", 4684 + dump_line(s, " %c%c %s[%d] %s%s %+ldms", 6050 4685 marker, task_state_to_char(p), p->comm, p->pid, 4686 + own_marker, sch_id_buf, 6051 4687 jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); 6052 4688 dump_line(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu", 6053 - scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, 4689 + scx_get_task_state(p) >> SCX_TASK_STATE_SHIFT, 4690 + p->scx.flags & ~SCX_TASK_STATE_MASK, 6054 4691 p->scx.dsq_flags, ops_state & SCX_OPSS_STATE_MASK, 6055 4692 ops_state >> SCX_OPSS_QSEQ_SHIFT); 6056 4693 dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s", ··· 6075 4686 6076 4687 if (SCX_HAS_OP(sch, dump_task)) { 6077 4688 ops_dump_init(s, " "); 6078 - SCX_CALL_OP(sch, SCX_KF_REST, dump_task, NULL, dctx, p); 4689 + SCX_CALL_OP(sch, dump_task, NULL, dctx, p); 6079 4690 ops_dump_exit(); 6080 4691 } 6081 4692 ··· 6088 4699 } 6089 4700 } 6090 4701 6091 - static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) 4702 + /* 4703 + * Dump scheduler state. If @dump_all_tasks is true, dump all tasks regardless 4704 + * of which scheduler they belong to. If false, only dump tasks owned by @sch. 4705 + * For SysRq-D dumps, @dump_all_tasks=false since all schedulers are dumped 4706 + * separately. For error dumps, @dump_all_tasks=true since only the failing 4707 + * scheduler is dumped. 4708 + */ 4709 + static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, 4710 + size_t dump_len, bool dump_all_tasks) 6092 4711 { 6093 - static DEFINE_SPINLOCK(dump_lock); 6094 4712 static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n"; 6095 - struct scx_sched *sch = scx_root; 6096 4713 struct scx_dump_ctx dctx = { 6097 4714 .kind = ei->kind, 6098 4715 .exit_code = ei->exit_code, ··· 6108 4713 }; 6109 4714 struct seq_buf s; 6110 4715 struct scx_event_stats events; 6111 - unsigned long flags; 6112 4716 char *buf; 6113 4717 int cpu; 6114 4718 6115 - spin_lock_irqsave(&dump_lock, flags); 4719 + guard(raw_spinlock_irqsave)(&scx_dump_lock); 4720 + 4721 + if (sch->dump_disabled) 4722 + return; 6116 4723 6117 4724 seq_buf_init(&s, ei->dump, dump_len); 6118 4725 4726 + #ifdef CONFIG_EXT_SUB_SCHED 4727 + if (sch->level == 0) 4728 + dump_line(&s, "%s: root", sch->ops.name); 4729 + else 4730 + dump_line(&s, "%s: sub%d-%llu %s", 4731 + sch->ops.name, sch->level, sch->ops.sub_cgroup_id, 4732 + sch->cgrp_path); 4733 + #endif 6119 4734 if (ei->kind == SCX_EXIT_NONE) { 6120 4735 dump_line(&s, "Debug dump triggered by %s", ei->reason); 6121 4736 } else { ··· 6139 4734 6140 4735 if (SCX_HAS_OP(sch, dump)) { 6141 4736 ops_dump_init(&s, ""); 6142 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, dump, NULL, &dctx); 4737 + SCX_CALL_OP(sch, dump, NULL, &dctx); 6143 4738 ops_dump_exit(); 6144 4739 } 6145 4740 ··· 6199 4794 used = seq_buf_used(&ns); 6200 4795 if (SCX_HAS_OP(sch, dump_cpu)) { 6201 4796 ops_dump_init(&ns, " "); 6202 - SCX_CALL_OP(sch, SCX_KF_REST, dump_cpu, NULL, 4797 + SCX_CALL_OP(sch, dump_cpu, NULL, 6203 4798 &dctx, cpu, idle); 6204 4799 ops_dump_exit(); 6205 4800 } ··· 6221 4816 seq_buf_set_overflow(&s); 6222 4817 } 6223 4818 6224 - if (rq->curr->sched_class == &ext_sched_class) 6225 - scx_dump_task(&s, &dctx, rq->curr, '*'); 4819 + if (rq->curr->sched_class == &ext_sched_class && 4820 + (dump_all_tasks || scx_task_on_sched(sch, rq->curr))) 4821 + scx_dump_task(sch, &s, &dctx, rq->curr, '*'); 6226 4822 6227 4823 list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) 6228 - scx_dump_task(&s, &dctx, p, ' '); 4824 + if (dump_all_tasks || scx_task_on_sched(sch, p)) 4825 + scx_dump_task(sch, &s, &dctx, p, ' '); 6229 4826 next: 6230 4827 rq_unlock_irqrestore(rq, &rf); 6231 4828 } ··· 6242 4835 scx_dump_event(s, &events, SCX_EV_DISPATCH_KEEP_LAST); 6243 4836 scx_dump_event(s, &events, SCX_EV_ENQ_SKIP_EXITING); 6244 4837 scx_dump_event(s, &events, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); 4838 + scx_dump_event(s, &events, SCX_EV_REENQ_IMMED); 4839 + scx_dump_event(s, &events, SCX_EV_REENQ_LOCAL_REPEAT); 6245 4840 scx_dump_event(s, &events, SCX_EV_REFILL_SLICE_DFL); 6246 4841 scx_dump_event(s, &events, SCX_EV_BYPASS_DURATION); 6247 4842 scx_dump_event(s, &events, SCX_EV_BYPASS_DISPATCH); 6248 4843 scx_dump_event(s, &events, SCX_EV_BYPASS_ACTIVATE); 4844 + scx_dump_event(s, &events, SCX_EV_INSERT_NOT_OWNED); 4845 + scx_dump_event(s, &events, SCX_EV_SUB_BYPASS_DISPATCH); 6249 4846 6250 4847 if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker)) 6251 4848 memcpy(ei->dump + dump_len - sizeof(trunc_marker), 6252 4849 trunc_marker, sizeof(trunc_marker)); 6253 - 6254 - spin_unlock_irqrestore(&dump_lock, flags); 6255 4850 } 6256 4851 6257 - static void scx_error_irq_workfn(struct irq_work *irq_work) 4852 + static void scx_disable_irq_workfn(struct irq_work *irq_work) 6258 4853 { 6259 - struct scx_sched *sch = container_of(irq_work, struct scx_sched, error_irq_work); 4854 + struct scx_sched *sch = container_of(irq_work, struct scx_sched, disable_irq_work); 6260 4855 struct scx_exit_info *ei = sch->exit_info; 6261 4856 6262 4857 if (ei->kind >= SCX_EXIT_ERROR) 6263 - scx_dump_state(ei, sch->ops.exit_dump_len); 4858 + scx_dump_state(sch, ei, sch->ops.exit_dump_len, true); 6264 4859 6265 4860 kthread_queue_work(sch->helper, &sch->disable_work); 6266 4861 } ··· 6292 4883 ei->kind = kind; 6293 4884 ei->reason = scx_exit_reason(ei->kind); 6294 4885 6295 - irq_work_queue(&sch->error_irq_work); 4886 + irq_work_queue(&sch->disable_irq_work); 6296 4887 return true; 6297 4888 } 6298 4889 ··· 6323 4914 return 0; 6324 4915 } 6325 4916 6326 - static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops) 4917 + static void free_pnode(struct scx_sched_pnode *pnode) 4918 + { 4919 + if (!pnode) 4920 + return; 4921 + exit_dsq(&pnode->global_dsq); 4922 + kfree(pnode); 4923 + } 4924 + 4925 + static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node) 4926 + { 4927 + struct scx_sched_pnode *pnode; 4928 + 4929 + pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node); 4930 + if (!pnode) 4931 + return NULL; 4932 + 4933 + if (init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch)) { 4934 + kfree(pnode); 4935 + return NULL; 4936 + } 4937 + 4938 + return pnode; 4939 + } 4940 + 4941 + /* 4942 + * Allocate and initialize a new scx_sched. @cgrp's reference is always 4943 + * consumed whether the function succeeds or fails. 4944 + */ 4945 + static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops, 4946 + struct cgroup *cgrp, 4947 + struct scx_sched *parent) 6327 4948 { 6328 4949 struct scx_sched *sch; 6329 - int node, ret; 4950 + s32 level = parent ? parent->level + 1 : 0; 4951 + s32 node, cpu, ret, bypass_fail_cpu = nr_cpu_ids; 6330 4952 6331 - sch = kzalloc_obj(*sch); 6332 - if (!sch) 6333 - return ERR_PTR(-ENOMEM); 4953 + sch = kzalloc_flex(*sch, ancestors, level + 1); 4954 + if (!sch) { 4955 + ret = -ENOMEM; 4956 + goto err_put_cgrp; 4957 + } 6334 4958 6335 4959 sch->exit_info = alloc_exit_info(ops->exit_dump_len); 6336 4960 if (!sch->exit_info) { ··· 6375 4933 if (ret < 0) 6376 4934 goto err_free_ei; 6377 4935 6378 - sch->global_dsqs = kzalloc_objs(sch->global_dsqs[0], nr_node_ids); 6379 - if (!sch->global_dsqs) { 4936 + sch->pnode = kzalloc_objs(sch->pnode[0], nr_node_ids); 4937 + if (!sch->pnode) { 6380 4938 ret = -ENOMEM; 6381 4939 goto err_free_hash; 6382 4940 } 6383 4941 6384 4942 for_each_node_state(node, N_POSSIBLE) { 6385 - struct scx_dispatch_q *dsq; 6386 - 6387 - dsq = kzalloc_node(sizeof(*dsq), GFP_KERNEL, node); 6388 - if (!dsq) { 4943 + sch->pnode[node] = alloc_pnode(sch, node); 4944 + if (!sch->pnode[node]) { 6389 4945 ret = -ENOMEM; 6390 - goto err_free_gdsqs; 4946 + goto err_free_pnode; 6391 4947 } 6392 - 6393 - init_dsq(dsq, SCX_DSQ_GLOBAL); 6394 - sch->global_dsqs[node] = dsq; 6395 4948 } 6396 4949 6397 - sch->pcpu = alloc_percpu(struct scx_sched_pcpu); 4950 + sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; 4951 + sch->pcpu = __alloc_percpu(struct_size_t(struct scx_sched_pcpu, 4952 + dsp_ctx.buf, sch->dsp_max_batch), 4953 + __alignof__(struct scx_sched_pcpu)); 6398 4954 if (!sch->pcpu) { 6399 4955 ret = -ENOMEM; 6400 - goto err_free_gdsqs; 4956 + goto err_free_pnode; 4957 + } 4958 + 4959 + for_each_possible_cpu(cpu) { 4960 + ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch); 4961 + if (ret) { 4962 + bypass_fail_cpu = cpu; 4963 + goto err_free_pcpu; 4964 + } 4965 + } 4966 + 4967 + for_each_possible_cpu(cpu) { 4968 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); 4969 + 4970 + pcpu->sch = sch; 4971 + INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node); 6401 4972 } 6402 4973 6403 4974 sch->helper = kthread_run_worker(0, "sched_ext_helper"); ··· 6421 4966 6422 4967 sched_set_fifo(sch->helper->task); 6423 4968 4969 + if (parent) 4970 + memcpy(sch->ancestors, parent->ancestors, 4971 + level * sizeof(parent->ancestors[0])); 4972 + sch->ancestors[level] = sch; 4973 + sch->level = level; 4974 + 4975 + if (ops->timeout_ms) 4976 + sch->watchdog_timeout = msecs_to_jiffies(ops->timeout_ms); 4977 + else 4978 + sch->watchdog_timeout = SCX_WATCHDOG_MAX_TIMEOUT; 4979 + 4980 + sch->slice_dfl = SCX_SLICE_DFL; 6424 4981 atomic_set(&sch->exit_kind, SCX_EXIT_NONE); 6425 - init_irq_work(&sch->error_irq_work, scx_error_irq_workfn); 4982 + init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn); 6426 4983 kthread_init_work(&sch->disable_work, scx_disable_workfn); 4984 + timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0); 6427 4985 sch->ops = *ops; 6428 - ops->priv = sch; 4986 + rcu_assign_pointer(ops->priv, sch); 6429 4987 6430 4988 sch->kobj.kset = scx_kset; 6431 - ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); 6432 - if (ret < 0) 6433 - goto err_stop_helper; 6434 4989 4990 + #ifdef CONFIG_EXT_SUB_SCHED 4991 + char *buf = kzalloc(PATH_MAX, GFP_KERNEL); 4992 + if (!buf) { 4993 + ret = -ENOMEM; 4994 + goto err_stop_helper; 4995 + } 4996 + cgroup_path(cgrp, buf, PATH_MAX); 4997 + sch->cgrp_path = kstrdup(buf, GFP_KERNEL); 4998 + kfree(buf); 4999 + if (!sch->cgrp_path) { 5000 + ret = -ENOMEM; 5001 + goto err_stop_helper; 5002 + } 5003 + 5004 + sch->cgrp = cgrp; 5005 + INIT_LIST_HEAD(&sch->children); 5006 + INIT_LIST_HEAD(&sch->sibling); 5007 + 5008 + if (parent) 5009 + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, 5010 + &parent->sub_kset->kobj, 5011 + "sub-%llu", cgroup_id(cgrp)); 5012 + else 5013 + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); 5014 + 5015 + if (ret < 0) { 5016 + kobject_put(&sch->kobj); 5017 + return ERR_PTR(ret); 5018 + } 5019 + 5020 + if (ops->sub_attach) { 5021 + sch->sub_kset = kset_create_and_add("sub", NULL, &sch->kobj); 5022 + if (!sch->sub_kset) { 5023 + kobject_put(&sch->kobj); 5024 + return ERR_PTR(-ENOMEM); 5025 + } 5026 + } 5027 + #else /* CONFIG_EXT_SUB_SCHED */ 5028 + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); 5029 + if (ret < 0) { 5030 + kobject_put(&sch->kobj); 5031 + return ERR_PTR(ret); 5032 + } 5033 + #endif /* CONFIG_EXT_SUB_SCHED */ 6435 5034 return sch; 6436 5035 5036 + #ifdef CONFIG_EXT_SUB_SCHED 6437 5037 err_stop_helper: 6438 5038 kthread_destroy_worker(sch->helper); 5039 + #endif 6439 5040 err_free_pcpu: 5041 + for_each_possible_cpu(cpu) { 5042 + if (cpu == bypass_fail_cpu) 5043 + break; 5044 + exit_dsq(bypass_dsq(sch, cpu)); 5045 + } 6440 5046 free_percpu(sch->pcpu); 6441 - err_free_gdsqs: 5047 + err_free_pnode: 6442 5048 for_each_node_state(node, N_POSSIBLE) 6443 - kfree(sch->global_dsqs[node]); 6444 - kfree(sch->global_dsqs); 5049 + free_pnode(sch->pnode[node]); 5050 + kfree(sch->pnode); 6445 5051 err_free_hash: 6446 5052 rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL); 6447 5053 err_free_ei: 6448 5054 free_exit_info(sch->exit_info); 6449 5055 err_free_sch: 6450 5056 kfree(sch); 5057 + err_put_cgrp: 5058 + #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 5059 + cgroup_put(cgrp); 5060 + #endif 6451 5061 return ERR_PTR(ret); 6452 5062 } 6453 5063 ··· 6561 5041 return -EINVAL; 6562 5042 } 6563 5043 6564 - if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT) 6565 - pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n"); 6566 - 6567 5044 if (ops->cpu_acquire || ops->cpu_release) 6568 5045 pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n"); 6569 5046 ··· 6580 5063 int ret; 6581 5064 }; 6582 5065 6583 - static void scx_enable_workfn(struct kthread_work *work) 5066 + static void scx_root_enable_workfn(struct kthread_work *work) 6584 5067 { 6585 - struct scx_enable_cmd *cmd = 6586 - container_of(work, struct scx_enable_cmd, work); 5068 + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); 6587 5069 struct sched_ext_ops *ops = cmd->ops; 5070 + struct cgroup *cgrp = root_cgroup(); 6588 5071 struct scx_sched *sch; 6589 5072 struct scx_task_iter sti; 6590 5073 struct task_struct *p; 6591 - unsigned long timeout; 6592 5074 int i, cpu, ret; 6593 5075 6594 5076 mutex_lock(&scx_enable_mutex); ··· 6601 5085 if (ret) 6602 5086 goto err_unlock; 6603 5087 6604 - sch = scx_alloc_and_add_sched(ops); 5088 + #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 5089 + cgroup_get(cgrp); 5090 + #endif 5091 + sch = scx_alloc_and_add_sched(ops, cgrp, NULL); 6605 5092 if (IS_ERR(sch)) { 6606 5093 ret = PTR_ERR(sch); 6607 5094 goto err_free_ksyncs; ··· 6616 5097 */ 6617 5098 WARN_ON_ONCE(scx_set_enable_state(SCX_ENABLING) != SCX_DISABLED); 6618 5099 WARN_ON_ONCE(scx_root); 6619 - if (WARN_ON_ONCE(READ_ONCE(scx_aborting))) 6620 - WRITE_ONCE(scx_aborting, false); 6621 5100 6622 5101 atomic_long_set(&scx_nr_rejected, 0); 6623 5102 6624 - for_each_possible_cpu(cpu) 6625 - cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; 5103 + for_each_possible_cpu(cpu) { 5104 + struct rq *rq = cpu_rq(cpu); 5105 + 5106 + rq->scx.local_dsq.sched = sch; 5107 + rq->scx.cpuperf_target = SCX_CPUPERF_ONE; 5108 + } 6626 5109 6627 5110 /* 6628 5111 * Keep CPUs stable during enable so that the BPF scheduler can track ··· 6638 5117 */ 6639 5118 rcu_assign_pointer(scx_root, sch); 6640 5119 5120 + ret = scx_link_sched(sch); 5121 + if (ret) 5122 + goto err_disable; 5123 + 6641 5124 scx_idle_enable(ops); 6642 5125 6643 5126 if (sch->ops.init) { 6644 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); 5127 + ret = SCX_CALL_OP_RET(sch, init, NULL); 6645 5128 if (ret) { 6646 5129 ret = ops_sanitize_err(sch, "init", ret); 6647 5130 cpus_read_unlock(); ··· 6672 5147 if (ret) 6673 5148 goto err_disable; 6674 5149 6675 - WARN_ON_ONCE(scx_dsp_ctx); 6676 - scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; 6677 - scx_dsp_ctx = __alloc_percpu(struct_size_t(struct scx_dsp_ctx, buf, 6678 - scx_dsp_max_batch), 6679 - __alignof__(struct scx_dsp_ctx)); 6680 - if (!scx_dsp_ctx) { 6681 - ret = -ENOMEM; 6682 - goto err_disable; 6683 - } 6684 - 6685 - if (ops->timeout_ms) 6686 - timeout = msecs_to_jiffies(ops->timeout_ms); 6687 - else 6688 - timeout = SCX_WATCHDOG_MAX_TIMEOUT; 6689 - 6690 - WRITE_ONCE(scx_watchdog_timeout, timeout); 6691 - WRITE_ONCE(scx_watchdog_timestamp, jiffies); 6692 - queue_delayed_work(system_dfl_wq, &scx_watchdog_work, 6693 - READ_ONCE(scx_watchdog_timeout) / 2); 6694 - 6695 5150 /* 6696 5151 * Once __scx_enabled is set, %current can be switched to SCX anytime. 6697 5152 * This can lead to stalls as some BPF schedulers (e.g. userspace 6698 5153 * scheduling) may not function correctly before all tasks are switched. 6699 5154 * Init in bypass mode to guarantee forward progress. 6700 5155 */ 6701 - scx_bypass(true); 6702 - scx_bypassed_for_enable = true; 5156 + scx_bypass(sch, true); 6703 5157 6704 5158 for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++) 6705 5159 if (((void (**)(void))ops)[i]) ··· 6710 5206 * never sees uninitialized tasks. 6711 5207 */ 6712 5208 scx_cgroup_lock(); 5209 + set_cgroup_sched(sch_cgroup(sch), sch); 6713 5210 ret = scx_cgroup_init(sch); 6714 5211 if (ret) 6715 5212 goto err_disable_unlock_all; 6716 5213 6717 - scx_task_iter_start(&sti); 5214 + scx_task_iter_start(&sti, NULL); 6718 5215 while ((p = scx_task_iter_next_locked(&sti))) { 6719 5216 /* 6720 5217 * @p may already be dead, have lost all its usages counts and ··· 6727 5222 6728 5223 scx_task_iter_unlock(&sti); 6729 5224 6730 - ret = scx_init_task(p, task_group(p), false); 5225 + ret = scx_init_task(sch, p, false); 6731 5226 if (ret) { 6732 5227 put_task_struct(p); 6733 5228 scx_task_iter_stop(&sti); ··· 6736 5231 goto err_disable_unlock_all; 6737 5232 } 6738 5233 5234 + scx_set_task_sched(p, sch); 6739 5235 scx_set_task_state(p, SCX_TASK_READY); 6740 5236 6741 5237 put_task_struct(p); ··· 6758 5252 * scx_tasks_lock. 6759 5253 */ 6760 5254 percpu_down_write(&scx_fork_rwsem); 6761 - scx_task_iter_start(&sti); 5255 + scx_task_iter_start(&sti, NULL); 6762 5256 while ((p = scx_task_iter_next_locked(&sti))) { 6763 5257 unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; 6764 5258 const struct sched_class *old_class = p->sched_class; ··· 6771 5265 queue_flags |= DEQUEUE_CLASS; 6772 5266 6773 5267 scoped_guard (sched_change, p, queue_flags) { 6774 - p->scx.slice = READ_ONCE(scx_slice_dfl); 5268 + p->scx.slice = READ_ONCE(sch->slice_dfl); 6775 5269 p->sched_class = new_class; 6776 5270 } 6777 5271 } 6778 5272 scx_task_iter_stop(&sti); 6779 5273 percpu_up_write(&scx_fork_rwsem); 6780 5274 6781 - scx_bypassed_for_enable = false; 6782 - scx_bypass(false); 5275 + scx_bypass(sch, false); 6783 5276 6784 5277 if (!scx_tryset_enable_state(SCX_ENABLED, SCX_ENABLING)) { 6785 5278 WARN_ON_ONCE(atomic_read(&sch->exit_kind) == SCX_EXIT_NONE); ··· 6820 5315 * Flush scx_disable_work to ensure that error is reported before init 6821 5316 * completion. sch's base reference will be put by bpf_scx_unreg(). 6822 5317 */ 6823 - scx_error(sch, "scx_enable() failed (%d)", ret); 5318 + scx_error(sch, "scx_root_enable() failed (%d)", ret); 6824 5319 kthread_flush_work(&sch->disable_work); 6825 5320 cmd->ret = 0; 6826 5321 } 6827 5322 6828 - static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) 5323 + #ifdef CONFIG_EXT_SUB_SCHED 5324 + /* verify that a scheduler can be attached to @cgrp and return the parent */ 5325 + static struct scx_sched *find_parent_sched(struct cgroup *cgrp) 5326 + { 5327 + struct scx_sched *parent = cgrp->scx_sched; 5328 + struct scx_sched *pos; 5329 + 5330 + lockdep_assert_held(&scx_sched_lock); 5331 + 5332 + /* can't attach twice to the same cgroup */ 5333 + if (parent->cgrp == cgrp) 5334 + return ERR_PTR(-EBUSY); 5335 + 5336 + /* does $parent allow sub-scheds? */ 5337 + if (!parent->ops.sub_attach) 5338 + return ERR_PTR(-EOPNOTSUPP); 5339 + 5340 + /* can't insert between $parent and its exiting children */ 5341 + list_for_each_entry(pos, &parent->children, sibling) 5342 + if (cgroup_is_descendant(pos->cgrp, cgrp)) 5343 + return ERR_PTR(-EBUSY); 5344 + 5345 + return parent; 5346 + } 5347 + 5348 + static bool assert_task_ready_or_enabled(struct task_struct *p) 5349 + { 5350 + u32 state = scx_get_task_state(p); 5351 + 5352 + switch (state) { 5353 + case SCX_TASK_READY: 5354 + case SCX_TASK_ENABLED: 5355 + return true; 5356 + default: 5357 + WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched", 5358 + state, p->comm, p->pid); 5359 + return false; 5360 + } 5361 + } 5362 + 5363 + static void scx_sub_enable_workfn(struct kthread_work *work) 5364 + { 5365 + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); 5366 + struct sched_ext_ops *ops = cmd->ops; 5367 + struct cgroup *cgrp; 5368 + struct scx_sched *parent, *sch; 5369 + struct scx_task_iter sti; 5370 + struct task_struct *p; 5371 + s32 i, ret; 5372 + 5373 + mutex_lock(&scx_enable_mutex); 5374 + 5375 + if (!scx_enabled()) { 5376 + ret = -ENODEV; 5377 + goto out_unlock; 5378 + } 5379 + 5380 + cgrp = cgroup_get_from_id(ops->sub_cgroup_id); 5381 + if (IS_ERR(cgrp)) { 5382 + ret = PTR_ERR(cgrp); 5383 + goto out_unlock; 5384 + } 5385 + 5386 + raw_spin_lock_irq(&scx_sched_lock); 5387 + parent = find_parent_sched(cgrp); 5388 + if (IS_ERR(parent)) { 5389 + raw_spin_unlock_irq(&scx_sched_lock); 5390 + ret = PTR_ERR(parent); 5391 + goto out_put_cgrp; 5392 + } 5393 + kobject_get(&parent->kobj); 5394 + raw_spin_unlock_irq(&scx_sched_lock); 5395 + 5396 + /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */ 5397 + sch = scx_alloc_and_add_sched(ops, cgrp, parent); 5398 + kobject_put(&parent->kobj); 5399 + if (IS_ERR(sch)) { 5400 + ret = PTR_ERR(sch); 5401 + goto out_unlock; 5402 + } 5403 + 5404 + ret = scx_link_sched(sch); 5405 + if (ret) 5406 + goto err_disable; 5407 + 5408 + if (sch->level >= SCX_SUB_MAX_DEPTH) { 5409 + scx_error(sch, "max nesting depth %d violated", 5410 + SCX_SUB_MAX_DEPTH); 5411 + goto err_disable; 5412 + } 5413 + 5414 + if (sch->ops.init) { 5415 + ret = SCX_CALL_OP_RET(sch, init, NULL); 5416 + if (ret) { 5417 + ret = ops_sanitize_err(sch, "init", ret); 5418 + scx_error(sch, "ops.init() failed (%d)", ret); 5419 + goto err_disable; 5420 + } 5421 + sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; 5422 + } 5423 + 5424 + if (validate_ops(sch, ops)) 5425 + goto err_disable; 5426 + 5427 + struct scx_sub_attach_args sub_attach_args = { 5428 + .ops = &sch->ops, 5429 + .cgroup_path = sch->cgrp_path, 5430 + }; 5431 + 5432 + ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, 5433 + &sub_attach_args); 5434 + if (ret) { 5435 + ret = ops_sanitize_err(sch, "sub_attach", ret); 5436 + scx_error(sch, "parent rejected (%d)", ret); 5437 + goto err_disable; 5438 + } 5439 + sch->sub_attached = true; 5440 + 5441 + scx_bypass(sch, true); 5442 + 5443 + for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) 5444 + if (((void (**)(void))ops)[i]) 5445 + set_bit(i, sch->has_op); 5446 + 5447 + percpu_down_write(&scx_fork_rwsem); 5448 + scx_cgroup_lock(); 5449 + 5450 + /* 5451 + * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see 5452 + * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down. 5453 + */ 5454 + set_cgroup_sched(sch_cgroup(sch), sch); 5455 + if (!(cgrp->self.flags & CSS_ONLINE)) { 5456 + scx_error(sch, "cgroup is not online"); 5457 + goto err_unlock_and_disable; 5458 + } 5459 + 5460 + /* 5461 + * Initialize tasks for the new child $sch without exiting them for 5462 + * $parent so that the tasks can always be reverted back to $parent 5463 + * sched on child init failure. 5464 + */ 5465 + WARN_ON_ONCE(scx_enabling_sub_sched); 5466 + scx_enabling_sub_sched = sch; 5467 + 5468 + scx_task_iter_start(&sti, sch->cgrp); 5469 + while ((p = scx_task_iter_next_locked(&sti))) { 5470 + struct rq *rq; 5471 + struct rq_flags rf; 5472 + 5473 + /* 5474 + * Task iteration may visit the same task twice when racing 5475 + * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which 5476 + * finished __scx_init_task() and skip if set. 5477 + * 5478 + * A task may exit and get freed between __scx_init_task() 5479 + * completion and scx_enable_task(). In such cases, 5480 + * scx_disable_and_exit_task() must exit the task for both the 5481 + * parent and child scheds. 5482 + */ 5483 + if (p->scx.flags & SCX_TASK_SUB_INIT) 5484 + continue; 5485 + 5486 + /* see scx_root_enable() */ 5487 + if (!tryget_task_struct(p)) 5488 + continue; 5489 + 5490 + if (!assert_task_ready_or_enabled(p)) { 5491 + ret = -EINVAL; 5492 + goto abort; 5493 + } 5494 + 5495 + scx_task_iter_unlock(&sti); 5496 + 5497 + /* 5498 + * As $p is still on $parent, it can't be transitioned to INIT. 5499 + * Let's worry about task state later. Use __scx_init_task(). 5500 + */ 5501 + ret = __scx_init_task(sch, p, false); 5502 + if (ret) 5503 + goto abort; 5504 + 5505 + rq = task_rq_lock(p, &rf); 5506 + p->scx.flags |= SCX_TASK_SUB_INIT; 5507 + task_rq_unlock(rq, p, &rf); 5508 + 5509 + put_task_struct(p); 5510 + } 5511 + scx_task_iter_stop(&sti); 5512 + 5513 + /* 5514 + * All tasks are prepped. Disable/exit tasks for $parent and enable for 5515 + * the new @sch. 5516 + */ 5517 + scx_task_iter_start(&sti, sch->cgrp); 5518 + while ((p = scx_task_iter_next_locked(&sti))) { 5519 + /* 5520 + * Use clearing of %SCX_TASK_SUB_INIT to detect and skip 5521 + * duplicate iterations. 5522 + */ 5523 + if (!(p->scx.flags & SCX_TASK_SUB_INIT)) 5524 + continue; 5525 + 5526 + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 5527 + /* 5528 + * $p must be either READY or ENABLED. If ENABLED, 5529 + * __scx_disabled_and_exit_task() first disables and 5530 + * makes it READY. However, after exiting $p, it will 5531 + * leave $p as READY. 5532 + */ 5533 + assert_task_ready_or_enabled(p); 5534 + __scx_disable_and_exit_task(parent, p); 5535 + 5536 + /* 5537 + * $p is now only initialized for @sch and READY, which 5538 + * is what we want. Assign it to @sch and enable. 5539 + */ 5540 + rcu_assign_pointer(p->scx.sched, sch); 5541 + scx_enable_task(sch, p); 5542 + 5543 + p->scx.flags &= ~SCX_TASK_SUB_INIT; 5544 + } 5545 + } 5546 + scx_task_iter_stop(&sti); 5547 + 5548 + scx_enabling_sub_sched = NULL; 5549 + 5550 + scx_cgroup_unlock(); 5551 + percpu_up_write(&scx_fork_rwsem); 5552 + 5553 + scx_bypass(sch, false); 5554 + 5555 + pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name); 5556 + kobject_uevent(&sch->kobj, KOBJ_ADD); 5557 + ret = 0; 5558 + goto out_unlock; 5559 + 5560 + out_put_cgrp: 5561 + cgroup_put(cgrp); 5562 + out_unlock: 5563 + mutex_unlock(&scx_enable_mutex); 5564 + cmd->ret = ret; 5565 + return; 5566 + 5567 + abort: 5568 + put_task_struct(p); 5569 + scx_task_iter_stop(&sti); 5570 + scx_enabling_sub_sched = NULL; 5571 + 5572 + scx_task_iter_start(&sti, sch->cgrp); 5573 + while ((p = scx_task_iter_next_locked(&sti))) { 5574 + if (p->scx.flags & SCX_TASK_SUB_INIT) { 5575 + __scx_disable_and_exit_task(sch, p); 5576 + p->scx.flags &= ~SCX_TASK_SUB_INIT; 5577 + } 5578 + } 5579 + scx_task_iter_stop(&sti); 5580 + err_unlock_and_disable: 5581 + /* we'll soon enter disable path, keep bypass on */ 5582 + scx_cgroup_unlock(); 5583 + percpu_up_write(&scx_fork_rwsem); 5584 + err_disable: 5585 + mutex_unlock(&scx_enable_mutex); 5586 + kthread_flush_work(&sch->disable_work); 5587 + cmd->ret = 0; 5588 + } 5589 + 5590 + static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb, 5591 + unsigned long action, void *data) 5592 + { 5593 + struct cgroup *cgrp = data; 5594 + struct cgroup *parent = cgroup_parent(cgrp); 5595 + 5596 + if (!cgroup_on_dfl(cgrp)) 5597 + return NOTIFY_OK; 5598 + 5599 + switch (action) { 5600 + case CGROUP_LIFETIME_ONLINE: 5601 + /* inherit ->scx_sched from $parent */ 5602 + if (parent) 5603 + rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched); 5604 + break; 5605 + case CGROUP_LIFETIME_OFFLINE: 5606 + /* if there is a sched attached, shoot it down */ 5607 + if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp) 5608 + scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN, 5609 + SCX_ECODE_RSN_CGROUP_OFFLINE, 5610 + "cgroup %llu going offline", cgroup_id(cgrp)); 5611 + break; 5612 + } 5613 + 5614 + return NOTIFY_OK; 5615 + } 5616 + 5617 + static struct notifier_block scx_cgroup_lifetime_nb = { 5618 + .notifier_call = scx_cgroup_lifetime_notify, 5619 + }; 5620 + 5621 + static s32 __init scx_cgroup_lifetime_notifier_init(void) 5622 + { 5623 + return blocking_notifier_chain_register(&cgroup_lifetime_notifier, 5624 + &scx_cgroup_lifetime_nb); 5625 + } 5626 + core_initcall(scx_cgroup_lifetime_notifier_init); 5627 + #endif /* CONFIG_EXT_SUB_SCHED */ 5628 + 5629 + static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) 6829 5630 { 6830 5631 static struct kthread_worker *helper; 6831 5632 static DEFINE_MUTEX(helper_mutex); ··· 7158 5347 mutex_unlock(&helper_mutex); 7159 5348 } 7160 5349 7161 - kthread_init_work(&cmd.work, scx_enable_workfn); 5350 + #ifdef CONFIG_EXT_SUB_SCHED 5351 + if (ops->sub_cgroup_id > 1) 5352 + kthread_init_work(&cmd.work, scx_sub_enable_workfn); 5353 + else 5354 + #endif /* CONFIG_EXT_SUB_SCHED */ 5355 + kthread_init_work(&cmd.work, scx_root_enable_workfn); 7162 5356 cmd.ops = ops; 7163 5357 7164 5358 kthread_queue_work(READ_ONCE(helper), &cmd.work); ··· 7204 5388 7205 5389 t = btf_type_by_id(reg->btf, reg->btf_id); 7206 5390 if (t == task_struct_type) { 7207 - if (off >= offsetof(struct task_struct, scx.slice) && 7208 - off + size <= offsetofend(struct task_struct, scx.slice)) 5391 + /* 5392 + * COMPAT: Will be removed in v6.23. 5393 + */ 5394 + if ((off >= offsetof(struct task_struct, scx.slice) && 5395 + off + size <= offsetofend(struct task_struct, scx.slice)) || 5396 + (off >= offsetof(struct task_struct, scx.dsq_vtime) && 5397 + off + size <= offsetofend(struct task_struct, scx.dsq_vtime))) { 5398 + pr_warn("sched_ext: Writing directly to p->scx.slice/dsq_vtime is deprecated, use scx_bpf_task_set_slice/dsq_vtime()"); 7209 5399 return SCALAR_VALUE; 7210 - if (off >= offsetof(struct task_struct, scx.dsq_vtime) && 7211 - off + size <= offsetofend(struct task_struct, scx.dsq_vtime)) 7212 - return SCALAR_VALUE; 5400 + } 5401 + 7213 5402 if (off >= offsetof(struct task_struct, scx.disallow) && 7214 5403 off + size <= offsetofend(struct task_struct, scx.disallow)) 7215 5404 return SCALAR_VALUE; ··· 7270 5449 case offsetof(struct sched_ext_ops, hotplug_seq): 7271 5450 ops->hotplug_seq = *(u64 *)(udata + moff); 7272 5451 return 1; 5452 + #ifdef CONFIG_EXT_SUB_SCHED 5453 + case offsetof(struct sched_ext_ops, sub_cgroup_id): 5454 + ops->sub_cgroup_id = *(u64 *)(udata + moff); 5455 + return 1; 5456 + #endif /* CONFIG_EXT_SUB_SCHED */ 7273 5457 } 7274 5458 7275 5459 return 0; 7276 5460 } 5461 + 5462 + #ifdef CONFIG_EXT_SUB_SCHED 5463 + static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog) 5464 + { 5465 + struct scx_sched *sch; 5466 + 5467 + guard(rcu)(); 5468 + sch = scx_prog_sched(prog->aux); 5469 + if (unlikely(!sch)) 5470 + return; 5471 + 5472 + scx_error(sch, "dispatch recursion detected"); 5473 + } 5474 + #endif /* CONFIG_EXT_SUB_SCHED */ 7277 5475 7278 5476 static int bpf_scx_check_member(const struct btf_type *t, 7279 5477 const struct btf_member *member, ··· 7311 5471 case offsetof(struct sched_ext_ops, cpu_offline): 7312 5472 case offsetof(struct sched_ext_ops, init): 7313 5473 case offsetof(struct sched_ext_ops, exit): 5474 + case offsetof(struct sched_ext_ops, sub_attach): 5475 + case offsetof(struct sched_ext_ops, sub_detach): 7314 5476 break; 7315 5477 default: 7316 5478 if (prog->sleepable) 7317 5479 return -EINVAL; 7318 5480 } 5481 + 5482 + #ifdef CONFIG_EXT_SUB_SCHED 5483 + /* 5484 + * Enable private stack for operations that can nest along the 5485 + * hierarchy. 5486 + * 5487 + * XXX - Ideally, we should only do this for scheds that allow 5488 + * sub-scheds and sub-scheds themselves but I don't know how to access 5489 + * struct_ops from here. 5490 + */ 5491 + switch (moff) { 5492 + case offsetof(struct sched_ext_ops, dispatch): 5493 + prog->aux->priv_stack_requested = true; 5494 + prog->aux->recursion_detected = scx_pstack_recursion_on_dispatch; 5495 + } 5496 + #endif /* CONFIG_EXT_SUB_SCHED */ 7319 5497 7320 5498 return 0; 7321 5499 } ··· 7346 5488 static void bpf_scx_unreg(void *kdata, struct bpf_link *link) 7347 5489 { 7348 5490 struct sched_ext_ops *ops = kdata; 7349 - struct scx_sched *sch = ops->priv; 5491 + struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); 7350 5492 7351 - scx_disable(SCX_EXIT_UNREG); 5493 + scx_disable(sch, SCX_EXIT_UNREG); 7352 5494 kthread_flush_work(&sch->disable_work); 5495 + RCU_INIT_POINTER(ops->priv, NULL); 7353 5496 kobject_put(&sch->kobj); 7354 5497 } 7355 5498 ··· 7407 5548 static void sched_ext_ops__cgroup_set_weight(struct cgroup *cgrp, u32 weight) {} 7408 5549 static void sched_ext_ops__cgroup_set_bandwidth(struct cgroup *cgrp, u64 period_us, u64 quota_us, u64 burst_us) {} 7409 5550 static void sched_ext_ops__cgroup_set_idle(struct cgroup *cgrp, bool idle) {} 7410 - #endif 5551 + #endif /* CONFIG_EXT_GROUP_SCHED */ 5552 + static s32 sched_ext_ops__sub_attach(struct scx_sub_attach_args *args) { return -EINVAL; } 5553 + static void sched_ext_ops__sub_detach(struct scx_sub_detach_args *args) {} 7411 5554 static void sched_ext_ops__cpu_online(s32 cpu) {} 7412 5555 static void sched_ext_ops__cpu_offline(s32 cpu) {} 7413 5556 static s32 sched_ext_ops__init(void) { return -EINVAL; } ··· 7449 5588 .cgroup_set_bandwidth = sched_ext_ops__cgroup_set_bandwidth, 7450 5589 .cgroup_set_idle = sched_ext_ops__cgroup_set_idle, 7451 5590 #endif 5591 + .sub_attach = sched_ext_ops__sub_attach, 5592 + .sub_detach = sched_ext_ops__sub_detach, 7452 5593 .cpu_online = sched_ext_ops__cpu_online, 7453 5594 .cpu_offline = sched_ext_ops__cpu_offline, 7454 5595 .init = sched_ext_ops__init, ··· 7481 5618 7482 5619 static void sysrq_handle_sched_ext_reset(u8 key) 7483 5620 { 7484 - scx_disable(SCX_EXIT_SYSRQ); 5621 + struct scx_sched *sch; 5622 + 5623 + rcu_read_lock(); 5624 + sch = rcu_dereference(scx_root); 5625 + if (likely(sch)) 5626 + scx_disable(sch, SCX_EXIT_SYSRQ); 5627 + else 5628 + pr_info("sched_ext: BPF schedulers not loaded\n"); 5629 + rcu_read_unlock(); 7485 5630 } 7486 5631 7487 5632 static const struct sysrq_key_op sysrq_sched_ext_reset_op = { ··· 7502 5631 static void sysrq_handle_sched_ext_dump(u8 key) 7503 5632 { 7504 5633 struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" }; 5634 + struct scx_sched *sch; 7505 5635 7506 - if (scx_enabled()) 7507 - scx_dump_state(&ei, 0); 5636 + list_for_each_entry_rcu(sch, &scx_sched_all, all) 5637 + scx_dump_state(sch, &ei, 0, false); 7508 5638 } 7509 5639 7510 5640 static const struct sysrq_key_op sysrq_sched_ext_dump_op = { ··· 7600 5728 unsigned long *ksyncs; 7601 5729 s32 cpu; 7602 5730 7603 - if (unlikely(!ksyncs_pcpu)) { 7604 - pr_warn_once("kick_cpus_irq_workfn() called with NULL scx_kick_syncs"); 5731 + /* can race with free_kick_syncs() during scheduler disable */ 5732 + if (unlikely(!ksyncs_pcpu)) 7605 5733 return; 7606 - } 7607 5734 7608 5735 ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs; 7609 5736 ··· 7643 5772 */ 7644 5773 void print_scx_info(const char *log_lvl, struct task_struct *p) 7645 5774 { 7646 - struct scx_sched *sch = scx_root; 5775 + struct scx_sched *sch; 7647 5776 enum scx_enable_state state = scx_enable_state(); 7648 5777 const char *all = READ_ONCE(scx_switching_all) ? "+all" : ""; 7649 5778 char runnable_at_buf[22] = "?"; 7650 5779 struct sched_class *class; 7651 5780 unsigned long runnable_at; 7652 5781 7653 - if (state == SCX_DISABLED) 5782 + guard(rcu)(); 5783 + 5784 + sch = scx_task_sched_rcu(p); 5785 + 5786 + if (!sch) 7654 5787 return; 7655 5788 7656 5789 /* ··· 7681 5806 7682 5807 static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void *ptr) 7683 5808 { 5809 + struct scx_sched *sch; 5810 + 5811 + guard(rcu)(); 5812 + 5813 + sch = rcu_dereference(scx_root); 5814 + if (!sch) 5815 + return NOTIFY_OK; 5816 + 7684 5817 /* 7685 5818 * SCX schedulers often have userspace components which are sometimes 7686 5819 * involved in critial scheduling paths. PM operations involve freezing ··· 7699 5816 case PM_HIBERNATION_PREPARE: 7700 5817 case PM_SUSPEND_PREPARE: 7701 5818 case PM_RESTORE_PREPARE: 7702 - scx_bypass(true); 5819 + scx_bypass(sch, true); 7703 5820 break; 7704 5821 case PM_POST_HIBERNATION: 7705 5822 case PM_POST_SUSPEND: 7706 5823 case PM_POST_RESTORE: 7707 - scx_bypass(false); 5824 + scx_bypass(sch, false); 7708 5825 break; 7709 5826 } 7710 5827 ··· 7733 5850 struct rq *rq = cpu_rq(cpu); 7734 5851 int n = cpu_to_node(cpu); 7735 5852 7736 - init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); 7737 - init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS); 5853 + /* local_dsq's sch will be set during scx_root_enable() */ 5854 + BUG_ON(init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL)); 5855 + 7738 5856 INIT_LIST_HEAD(&rq->scx.runnable_list); 7739 5857 INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals); 7740 5858 ··· 7744 5860 BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n)); 7745 5861 BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n)); 7746 5862 BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_sync, GFP_KERNEL, n)); 5863 + raw_spin_lock_init(&rq->scx.deferred_reenq_lock); 5864 + INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals); 5865 + INIT_LIST_HEAD(&rq->scx.deferred_reenq_users); 7747 5866 rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn); 7748 5867 rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn); 7749 5868 ··· 7757 5870 register_sysrq_key('S', &sysrq_sched_ext_reset_op); 7758 5871 register_sysrq_key('D', &sysrq_sched_ext_dump_op); 7759 5872 INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn); 5873 + 5874 + #ifdef CONFIG_EXT_SUB_SCHED 5875 + BUG_ON(rhashtable_init(&scx_sched_hash, &scx_sched_hash_params)); 5876 + #endif /* CONFIG_EXT_SUB_SCHED */ 7760 5877 } 7761 5878 7762 5879 7763 5880 /******************************************************************************** 7764 5881 * Helpers that can be called from the BPF scheduler. 7765 5882 */ 7766 - static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, 7767 - u64 enq_flags) 5883 + static bool scx_vet_enq_flags(struct scx_sched *sch, u64 dsq_id, u64 *enq_flags) 7768 5884 { 7769 - if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) 7770 - return false; 5885 + bool is_local = dsq_id == SCX_DSQ_LOCAL || 5886 + (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON; 7771 5887 5888 + if (*enq_flags & SCX_ENQ_IMMED) { 5889 + if (unlikely(!is_local)) { 5890 + scx_error(sch, "SCX_ENQ_IMMED on a non-local DSQ 0x%llx", dsq_id); 5891 + return false; 5892 + } 5893 + } else if ((sch->ops.flags & SCX_OPS_ALWAYS_ENQ_IMMED) && is_local) { 5894 + *enq_flags |= SCX_ENQ_IMMED; 5895 + } 5896 + 5897 + return true; 5898 + } 5899 + 5900 + static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, 5901 + u64 dsq_id, u64 *enq_flags) 5902 + { 7772 5903 lockdep_assert_irqs_disabled(); 7773 5904 7774 5905 if (unlikely(!p)) { ··· 7794 5889 return false; 7795 5890 } 7796 5891 7797 - if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) { 7798 - scx_error(sch, "invalid enq_flags 0x%llx", enq_flags); 5892 + if (unlikely(*enq_flags & __SCX_ENQ_INTERNAL_MASK)) { 5893 + scx_error(sch, "invalid enq_flags 0x%llx", *enq_flags); 7799 5894 return false; 7800 5895 } 5896 + 5897 + /* see SCX_EV_INSERT_NOT_OWNED definition */ 5898 + if (unlikely(!scx_task_on_sched(sch, p))) { 5899 + __scx_add_event(sch, SCX_EV_INSERT_NOT_OWNED, 1); 5900 + return false; 5901 + } 5902 + 5903 + if (!scx_vet_enq_flags(sch, dsq_id, enq_flags)) 5904 + return false; 7801 5905 7802 5906 return true; 7803 5907 } ··· 7814 5900 static void scx_dsq_insert_commit(struct scx_sched *sch, struct task_struct *p, 7815 5901 u64 dsq_id, u64 enq_flags) 7816 5902 { 7817 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 5903 + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 7818 5904 struct task_struct *ddsp_task; 7819 5905 7820 5906 ddsp_task = __this_cpu_read(direct_dispatch_task); ··· 7823 5909 return; 7824 5910 } 7825 5911 7826 - if (unlikely(dspc->cursor >= scx_dsp_max_batch)) { 5912 + if (unlikely(dspc->cursor >= sch->dsp_max_batch)) { 7827 5913 scx_error(sch, "dispatch buffer overflow"); 7828 5914 return; 7829 5915 } ··· 7844 5930 * @dsq_id: DSQ to insert into 7845 5931 * @slice: duration @p can run for in nsecs, 0 to keep the current value 7846 5932 * @enq_flags: SCX_ENQ_* 5933 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 7847 5934 * 7848 5935 * Insert @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe to 7849 5936 * call this function spuriously. Can be called from ops.enqueue(), ··· 7879 5964 * to check the return value. 7880 5965 */ 7881 5966 __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id, 7882 - u64 slice, u64 enq_flags) 5967 + u64 slice, u64 enq_flags, 5968 + const struct bpf_prog_aux *aux) 7883 5969 { 7884 5970 struct scx_sched *sch; 7885 5971 7886 5972 guard(rcu)(); 7887 - sch = rcu_dereference(scx_root); 5973 + sch = scx_prog_sched(aux); 7888 5974 if (unlikely(!sch)) 7889 5975 return false; 7890 5976 7891 - if (!scx_dsq_insert_preamble(sch, p, enq_flags)) 5977 + if (!scx_dsq_insert_preamble(sch, p, dsq_id, &enq_flags)) 7892 5978 return false; 7893 5979 7894 5980 if (slice) ··· 7906 5990 * COMPAT: Will be removed in v6.23 along with the ___v2 suffix. 7907 5991 */ 7908 5992 __bpf_kfunc void scx_bpf_dsq_insert(struct task_struct *p, u64 dsq_id, 7909 - u64 slice, u64 enq_flags) 5993 + u64 slice, u64 enq_flags, 5994 + const struct bpf_prog_aux *aux) 7910 5995 { 7911 - scx_bpf_dsq_insert___v2(p, dsq_id, slice, enq_flags); 5996 + scx_bpf_dsq_insert___v2(p, dsq_id, slice, enq_flags, aux); 7912 5997 } 7913 5998 7914 5999 static bool scx_dsq_insert_vtime(struct scx_sched *sch, struct task_struct *p, 7915 6000 u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) 7916 6001 { 7917 - if (!scx_dsq_insert_preamble(sch, p, enq_flags)) 6002 + if (!scx_dsq_insert_preamble(sch, p, dsq_id, &enq_flags)) 7918 6003 return false; 7919 6004 7920 6005 if (slice) ··· 7946 6029 * @args->slice: duration @p can run for in nsecs, 0 to keep the current value 7947 6030 * @args->vtime: @p's ordering inside the vtime-sorted queue of the target DSQ 7948 6031 * @args->enq_flags: SCX_ENQ_* 6032 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 7949 6033 * 7950 6034 * Wrapper kfunc that takes arguments via struct to work around BPF's 5 argument 7951 6035 * limit. BPF programs should use scx_bpf_dsq_insert_vtime() which is provided ··· 7971 6053 */ 7972 6054 __bpf_kfunc bool 7973 6055 __scx_bpf_dsq_insert_vtime(struct task_struct *p, 7974 - struct scx_bpf_dsq_insert_vtime_args *args) 6056 + struct scx_bpf_dsq_insert_vtime_args *args, 6057 + const struct bpf_prog_aux *aux) 7975 6058 { 7976 6059 struct scx_sched *sch; 7977 6060 7978 6061 guard(rcu)(); 7979 6062 7980 - sch = rcu_dereference(scx_root); 6063 + sch = scx_prog_sched(aux); 7981 6064 if (unlikely(!sch)) 7982 6065 return false; 7983 6066 ··· 8000 6081 if (unlikely(!sch)) 8001 6082 return; 8002 6083 6084 + #ifdef CONFIG_EXT_SUB_SCHED 6085 + /* 6086 + * Disallow if any sub-scheds are attached. There is no way to tell 6087 + * which scheduler called us, just error out @p's scheduler. 6088 + */ 6089 + if (unlikely(!list_empty(&sch->children))) { 6090 + scx_error(scx_task_sched(p), "__scx_bpf_dsq_insert_vtime() must be used"); 6091 + return; 6092 + } 6093 + #endif 6094 + 8003 6095 scx_dsq_insert_vtime(sch, p, dsq_id, slice, vtime, enq_flags); 8004 6096 } 8005 6097 8006 6098 __bpf_kfunc_end_defs(); 8007 6099 8008 6100 BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch) 8009 - BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_RCU) 8010 - BTF_ID_FLAGS(func, scx_bpf_dsq_insert___v2, KF_RCU) 8011 - BTF_ID_FLAGS(func, __scx_bpf_dsq_insert_vtime, KF_RCU) 6101 + BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_IMPLICIT_ARGS | KF_RCU) 6102 + BTF_ID_FLAGS(func, scx_bpf_dsq_insert___v2, KF_IMPLICIT_ARGS | KF_RCU) 6103 + BTF_ID_FLAGS(func, __scx_bpf_dsq_insert_vtime, KF_IMPLICIT_ARGS | KF_RCU) 8012 6104 BTF_ID_FLAGS(func, scx_bpf_dsq_insert_vtime, KF_RCU) 8013 6105 BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) 8014 6106 8015 6107 static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { 8016 6108 .owner = THIS_MODULE, 8017 6109 .set = &scx_kfunc_ids_enqueue_dispatch, 6110 + .filter = scx_kfunc_context_filter, 8018 6111 }; 8019 6112 8020 6113 static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, 8021 6114 struct task_struct *p, u64 dsq_id, u64 enq_flags) 8022 6115 { 8023 - struct scx_sched *sch = scx_root; 8024 6116 struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq; 6117 + struct scx_sched *sch = src_dsq->sched; 8025 6118 struct rq *this_rq, *src_rq, *locked_rq; 8026 6119 bool dispatched = false; 8027 6120 bool in_balance; 8028 6121 unsigned long flags; 8029 6122 8030 - if (!scx_kf_allowed_if_unlocked() && 8031 - !scx_kf_allowed(sch, SCX_KF_DISPATCH)) 6123 + if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags)) 8032 6124 return false; 8033 6125 8034 6126 /* 8035 6127 * If the BPF scheduler keeps calling this function repeatedly, it can 8036 6128 * cause similar live-lock conditions as consume_dispatch_q(). 8037 6129 */ 8038 - if (unlikely(READ_ONCE(scx_aborting))) 6130 + if (unlikely(READ_ONCE(sch->aborting))) 8039 6131 return false; 6132 + 6133 + if (unlikely(!scx_task_on_sched(sch, p))) { 6134 + scx_error(sch, "scx_bpf_dsq_move[_vtime]() on %s[%d] but the task belongs to a different scheduler", 6135 + p->comm, p->pid); 6136 + return false; 6137 + } 8040 6138 8041 6139 /* 8042 6140 * Can be called from either ops.dispatch() locking this_rq() or any ··· 8078 6142 locked_rq = src_rq; 8079 6143 raw_spin_lock(&src_dsq->lock); 8080 6144 8081 - /* 8082 - * Did someone else get to it? @p could have already left $src_dsq, got 8083 - * re-enqueud, or be in the process of being consumed by someone else. 8084 - */ 8085 - if (unlikely(p->scx.dsq != src_dsq || 8086 - u32_before(kit->cursor.priv, p->scx.dsq_seq) || 8087 - p->scx.holding_cpu >= 0) || 8088 - WARN_ON_ONCE(src_rq != task_rq(p))) { 6145 + /* did someone else get to it while we dropped the locks? */ 6146 + if (nldsq_cursor_lost_task(&kit->cursor, src_rq, src_dsq, p)) { 8089 6147 raw_spin_unlock(&src_dsq->lock); 8090 6148 goto out; 8091 6149 } 8092 6150 8093 6151 /* @p is still on $src_dsq and stable, determine the destination */ 8094 - dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, p); 6152 + dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, task_cpu(p)); 8095 6153 8096 6154 /* 8097 6155 * Apply vtime and slice updates before moving so that the new time is ··· 8119 6189 8120 6190 /** 8121 6191 * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots 6192 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8122 6193 * 8123 6194 * Can only be called from ops.dispatch(). 8124 6195 */ 8125 - __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void) 6196 + __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(const struct bpf_prog_aux *aux) 8126 6197 { 8127 6198 struct scx_sched *sch; 8128 6199 8129 6200 guard(rcu)(); 8130 6201 8131 - sch = rcu_dereference(scx_root); 6202 + sch = scx_prog_sched(aux); 8132 6203 if (unlikely(!sch)) 8133 6204 return 0; 8134 6205 8135 - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) 8136 - return 0; 8137 - 8138 - return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor); 6206 + return sch->dsp_max_batch - __this_cpu_read(sch->pcpu->dsp_ctx.cursor); 8139 6207 } 8140 6208 8141 6209 /** 8142 6210 * scx_bpf_dispatch_cancel - Cancel the latest dispatch 6211 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8143 6212 * 8144 6213 * Cancel the latest dispatch. Can be called multiple times to cancel further 8145 6214 * dispatches. Can only be called from ops.dispatch(). 8146 6215 */ 8147 - __bpf_kfunc void scx_bpf_dispatch_cancel(void) 6216 + __bpf_kfunc void scx_bpf_dispatch_cancel(const struct bpf_prog_aux *aux) 8148 6217 { 8149 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 8150 6218 struct scx_sched *sch; 6219 + struct scx_dsp_ctx *dspc; 8151 6220 8152 6221 guard(rcu)(); 8153 6222 8154 - sch = rcu_dereference(scx_root); 6223 + sch = scx_prog_sched(aux); 8155 6224 if (unlikely(!sch)) 8156 6225 return; 8157 6226 8158 - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) 8159 - return; 6227 + dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 8160 6228 8161 6229 if (dspc->cursor > 0) 8162 6230 dspc->cursor--; ··· 8164 6236 8165 6237 /** 8166 6238 * scx_bpf_dsq_move_to_local - move a task from a DSQ to the current CPU's local DSQ 8167 - * @dsq_id: DSQ to move task from 6239 + * @dsq_id: DSQ to move task from. Must be a user-created DSQ 6240 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6241 + * @enq_flags: %SCX_ENQ_* 8168 6242 * 8169 6243 * Move a task from the non-local DSQ identified by @dsq_id to the current CPU's 8170 - * local DSQ for execution. Can only be called from ops.dispatch(). 6244 + * local DSQ for execution with @enq_flags applied. Can only be called from 6245 + * ops.dispatch(). 6246 + * 6247 + * Built-in DSQs (%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) are not supported as 6248 + * sources. Local DSQs support reenqueueing (a task can be picked up for 6249 + * execution, dequeued for property changes, or reenqueued), but the BPF 6250 + * scheduler cannot directly iterate or move tasks from them. %SCX_DSQ_GLOBAL 6251 + * is similar but also doesn't support reenqueueing, as it maps to multiple 6252 + * per-node DSQs making the scope difficult to define; this may change in the 6253 + * future. 8171 6254 * 8172 6255 * This function flushes the in-flight dispatches from scx_bpf_dsq_insert() 8173 6256 * before trying to move from the specified DSQ. It may also grab rq locks and ··· 8187 6248 * Returns %true if a task has been moved, %false if there isn't any task to 8188 6249 * move. 8189 6250 */ 8190 - __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id) 6251 + __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags, 6252 + const struct bpf_prog_aux *aux) 8191 6253 { 8192 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 8193 6254 struct scx_dispatch_q *dsq; 8194 6255 struct scx_sched *sch; 6256 + struct scx_dsp_ctx *dspc; 8195 6257 8196 6258 guard(rcu)(); 8197 6259 8198 - sch = rcu_dereference(scx_root); 6260 + sch = scx_prog_sched(aux); 8199 6261 if (unlikely(!sch)) 8200 6262 return false; 8201 6263 8202 - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) 6264 + if (!scx_vet_enq_flags(sch, SCX_DSQ_LOCAL, &enq_flags)) 8203 6265 return false; 6266 + 6267 + dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 8204 6268 8205 6269 flush_dispatch_buf(sch, dspc->rq); 8206 6270 ··· 8213 6271 return false; 8214 6272 } 8215 6273 8216 - if (consume_dispatch_q(sch, dspc->rq, dsq)) { 6274 + if (consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) { 8217 6275 /* 8218 6276 * A successfully consumed task can be dequeued before it starts 8219 6277 * running while the CPU is trying to migrate other dispatched ··· 8225 6283 } else { 8226 6284 return false; 8227 6285 } 6286 + } 6287 + 6288 + /* 6289 + * COMPAT: ___v2 was introduced in v7.1. Remove this and ___v2 tag in the future. 6290 + */ 6291 + __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id, const struct bpf_prog_aux *aux) 6292 + { 6293 + return scx_bpf_dsq_move_to_local___v2(dsq_id, 0, aux); 8228 6294 } 8229 6295 8230 6296 /** ··· 8330 6380 p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); 8331 6381 } 8332 6382 6383 + #ifdef CONFIG_EXT_SUB_SCHED 6384 + /** 6385 + * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler 6386 + * @cgroup_id: cgroup ID of the child scheduler to dispatch 6387 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6388 + * 6389 + * Allows a parent scheduler to trigger dispatching on one of its direct 6390 + * child schedulers. The child scheduler runs its dispatch operation to 6391 + * move tasks from dispatch queues to the local runqueue. 6392 + * 6393 + * Returns: true on success, false if cgroup_id is invalid, not a direct 6394 + * child, or caller lacks dispatch permission. 6395 + */ 6396 + __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux) 6397 + { 6398 + struct rq *this_rq = this_rq(); 6399 + struct scx_sched *parent, *child; 6400 + 6401 + guard(rcu)(); 6402 + parent = scx_prog_sched(aux); 6403 + if (unlikely(!parent)) 6404 + return false; 6405 + 6406 + child = scx_find_sub_sched(cgroup_id); 6407 + 6408 + if (unlikely(!child)) 6409 + return false; 6410 + 6411 + if (unlikely(scx_parent(child) != parent)) { 6412 + scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu", 6413 + cgroup_id); 6414 + return false; 6415 + } 6416 + 6417 + return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev, 6418 + true); 6419 + } 6420 + #endif /* CONFIG_EXT_SUB_SCHED */ 6421 + 8333 6422 __bpf_kfunc_end_defs(); 8334 6423 8335 6424 BTF_KFUNCS_START(scx_kfunc_ids_dispatch) 8336 - BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots) 8337 - BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel) 8338 - BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local) 6425 + BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_IMPLICIT_ARGS) 6426 + BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS) 6427 + BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS) 6428 + BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local___v2, KF_IMPLICIT_ARGS) 6429 + /* scx_bpf_dsq_move*() also in scx_kfunc_ids_unlocked: callable from unlocked contexts */ 8339 6430 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) 8340 6431 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) 8341 6432 BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) 8342 6433 BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) 6434 + #ifdef CONFIG_EXT_SUB_SCHED 6435 + BTF_ID_FLAGS(func, scx_bpf_sub_dispatch, KF_IMPLICIT_ARGS) 6436 + #endif 8343 6437 BTF_KFUNCS_END(scx_kfunc_ids_dispatch) 8344 6438 8345 6439 static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { 8346 6440 .owner = THIS_MODULE, 8347 6441 .set = &scx_kfunc_ids_dispatch, 6442 + .filter = scx_kfunc_context_filter, 8348 6443 }; 8349 - 8350 - static u32 reenq_local(struct rq *rq) 8351 - { 8352 - LIST_HEAD(tasks); 8353 - u32 nr_enqueued = 0; 8354 - struct task_struct *p, *n; 8355 - 8356 - lockdep_assert_rq_held(rq); 8357 - 8358 - /* 8359 - * The BPF scheduler may choose to dispatch tasks back to 8360 - * @rq->scx.local_dsq. Move all candidate tasks off to a private list 8361 - * first to avoid processing the same tasks repeatedly. 8362 - */ 8363 - list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, 8364 - scx.dsq_list.node) { 8365 - /* 8366 - * If @p is being migrated, @p's current CPU may not agree with 8367 - * its allowed CPUs and the migration_cpu_stop is about to 8368 - * deactivate and re-activate @p anyway. Skip re-enqueueing. 8369 - * 8370 - * While racing sched property changes may also dequeue and 8371 - * re-enqueue a migrating task while its current CPU and allowed 8372 - * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to 8373 - * the current local DSQ for running tasks and thus are not 8374 - * visible to the BPF scheduler. 8375 - */ 8376 - if (p->migration_pending) 8377 - continue; 8378 - 8379 - dispatch_dequeue(rq, p); 8380 - list_add_tail(&p->scx.dsq_list.node, &tasks); 8381 - } 8382 - 8383 - list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { 8384 - list_del_init(&p->scx.dsq_list.node); 8385 - do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 8386 - nr_enqueued++; 8387 - } 8388 - 8389 - return nr_enqueued; 8390 - } 8391 6444 8392 6445 __bpf_kfunc_start_defs(); 8393 6446 8394 6447 /** 8395 6448 * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 6449 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8396 6450 * 8397 6451 * Iterate over all of the tasks currently enqueued on the local DSQ of the 8398 6452 * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of 8399 6453 * processed tasks. Can only be called from ops.cpu_release(). 8400 - * 8401 - * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void 8402 - * returning variant that can be called from anywhere. 8403 6454 */ 8404 - __bpf_kfunc u32 scx_bpf_reenqueue_local(void) 6455 + __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux) 8405 6456 { 8406 6457 struct scx_sched *sch; 8407 6458 struct rq *rq; 8408 6459 8409 6460 guard(rcu)(); 8410 - sch = rcu_dereference(scx_root); 6461 + sch = scx_prog_sched(aux); 8411 6462 if (unlikely(!sch)) 8412 - return 0; 8413 - 8414 - if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE)) 8415 6463 return 0; 8416 6464 8417 6465 rq = cpu_rq(smp_processor_id()); 8418 6466 lockdep_assert_rq_held(rq); 8419 6467 8420 - return reenq_local(rq); 6468 + return reenq_local(sch, rq, SCX_REENQ_ANY); 8421 6469 } 8422 6470 8423 6471 __bpf_kfunc_end_defs(); 8424 6472 8425 6473 BTF_KFUNCS_START(scx_kfunc_ids_cpu_release) 8426 - BTF_ID_FLAGS(func, scx_bpf_reenqueue_local) 6474 + BTF_ID_FLAGS(func, scx_bpf_reenqueue_local, KF_IMPLICIT_ARGS) 8427 6475 BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) 8428 6476 8429 6477 static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { 8430 6478 .owner = THIS_MODULE, 8431 6479 .set = &scx_kfunc_ids_cpu_release, 6480 + .filter = scx_kfunc_context_filter, 8432 6481 }; 8433 6482 8434 6483 __bpf_kfunc_start_defs(); ··· 8436 6487 * scx_bpf_create_dsq - Create a custom DSQ 8437 6488 * @dsq_id: DSQ to create 8438 6489 * @node: NUMA node to allocate from 6490 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8439 6491 * 8440 6492 * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable 8441 6493 * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. 8442 6494 */ 8443 - __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) 6495 + __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node, const struct bpf_prog_aux *aux) 8444 6496 { 8445 6497 struct scx_dispatch_q *dsq; 8446 6498 struct scx_sched *sch; ··· 8458 6508 if (!dsq) 8459 6509 return -ENOMEM; 8460 6510 8461 - init_dsq(dsq, dsq_id); 6511 + /* 6512 + * init_dsq() must be called in GFP_KERNEL context. Init it with NULL 6513 + * @sch and update afterwards. 6514 + */ 6515 + ret = init_dsq(dsq, dsq_id, NULL); 6516 + if (ret) { 6517 + kfree(dsq); 6518 + return ret; 6519 + } 8462 6520 8463 6521 rcu_read_lock(); 8464 6522 8465 - sch = rcu_dereference(scx_root); 8466 - if (sch) 6523 + sch = scx_prog_sched(aux); 6524 + if (sch) { 6525 + dsq->sched = sch; 8467 6526 ret = rhashtable_lookup_insert_fast(&sch->dsq_hash, &dsq->hash_node, 8468 6527 dsq_hash_params); 8469 - else 6528 + } else { 8470 6529 ret = -ENODEV; 6530 + } 8471 6531 8472 6532 rcu_read_unlock(); 8473 - if (ret) 6533 + if (ret) { 6534 + exit_dsq(dsq); 8474 6535 kfree(dsq); 6536 + } 8475 6537 return ret; 8476 6538 } 8477 6539 8478 6540 __bpf_kfunc_end_defs(); 8479 6541 8480 6542 BTF_KFUNCS_START(scx_kfunc_ids_unlocked) 8481 - BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) 6543 + BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_IMPLICIT_ARGS | KF_SLEEPABLE) 6544 + /* also in scx_kfunc_ids_dispatch: also callable from ops.dispatch() */ 8482 6545 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) 8483 6546 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) 8484 6547 BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) 8485 6548 BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) 6549 + /* also in scx_kfunc_ids_select_cpu: also callable from ops.select_cpu()/ops.enqueue() */ 6550 + BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) 6551 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) 6552 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) 8486 6553 BTF_KFUNCS_END(scx_kfunc_ids_unlocked) 8487 6554 8488 6555 static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { 8489 6556 .owner = THIS_MODULE, 8490 6557 .set = &scx_kfunc_ids_unlocked, 6558 + .filter = scx_kfunc_context_filter, 8491 6559 }; 8492 6560 8493 6561 __bpf_kfunc_start_defs(); ··· 8514 6546 * scx_bpf_task_set_slice - Set task's time slice 8515 6547 * @p: task of interest 8516 6548 * @slice: time slice to set in nsecs 6549 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8517 6550 * 8518 6551 * Set @p's time slice to @slice. Returns %true on success, %false if the 8519 6552 * calling scheduler doesn't have authority over @p. 8520 6553 */ 8521 - __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice) 6554 + __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice, 6555 + const struct bpf_prog_aux *aux) 8522 6556 { 6557 + struct scx_sched *sch; 6558 + 6559 + guard(rcu)(); 6560 + sch = scx_prog_sched(aux); 6561 + if (unlikely(!scx_task_on_sched(sch, p))) 6562 + return false; 6563 + 8523 6564 p->scx.slice = slice; 8524 6565 return true; 8525 6566 } ··· 8537 6560 * scx_bpf_task_set_dsq_vtime - Set task's virtual time for DSQ ordering 8538 6561 * @p: task of interest 8539 6562 * @vtime: virtual time to set 6563 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8540 6564 * 8541 6565 * Set @p's virtual time to @vtime. Returns %true on success, %false if the 8542 6566 * calling scheduler doesn't have authority over @p. 8543 6567 */ 8544 - __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime) 6568 + __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime, 6569 + const struct bpf_prog_aux *aux) 8545 6570 { 6571 + struct scx_sched *sch; 6572 + 6573 + guard(rcu)(); 6574 + sch = scx_prog_sched(aux); 6575 + if (unlikely(!scx_task_on_sched(sch, p))) 6576 + return false; 6577 + 8546 6578 p->scx.dsq_vtime = vtime; 8547 6579 return true; 8548 6580 } ··· 8573 6587 * lead to irq_work_queue() malfunction such as infinite busy wait for 8574 6588 * IRQ status update. Suppress kicking. 8575 6589 */ 8576 - if (scx_rq_bypassing(this_rq)) 6590 + if (scx_bypassing(sch, cpu_of(this_rq))) 8577 6591 goto out; 8578 6592 8579 6593 /* ··· 8613 6627 * scx_bpf_kick_cpu - Trigger reschedule on a CPU 8614 6628 * @cpu: cpu to kick 8615 6629 * @flags: %SCX_KICK_* flags 6630 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8616 6631 * 8617 6632 * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or 8618 6633 * trigger rescheduling on a busy CPU. This can be called from any online 8619 6634 * scx_ops operation and the actual kicking is performed asynchronously through 8620 6635 * an irq work. 8621 6636 */ 8622 - __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) 6637 + __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux *aux) 8623 6638 { 8624 6639 struct scx_sched *sch; 8625 6640 8626 6641 guard(rcu)(); 8627 - sch = rcu_dereference(scx_root); 6642 + sch = scx_prog_sched(aux); 8628 6643 if (likely(sch)) 8629 6644 scx_kick_cpu(sch, cpu, flags); 8630 6645 } ··· 8699 6712 * @it: iterator to initialize 8700 6713 * @dsq_id: DSQ to iterate 8701 6714 * @flags: %SCX_DSQ_ITER_* 6715 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8702 6716 * 8703 6717 * Initialize BPF iterator @it which can be used with bpf_for_each() to walk 8704 6718 * tasks in the DSQ specified by @dsq_id. Iteration using @it only includes 8705 6719 * tasks which are already queued when this function is invoked. 8706 6720 */ 8707 6721 __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, 8708 - u64 flags) 6722 + u64 flags, const struct bpf_prog_aux *aux) 8709 6723 { 8710 6724 struct bpf_iter_scx_dsq_kern *kit = (void *)it; 8711 6725 struct scx_sched *sch; ··· 8724 6736 */ 8725 6737 kit->dsq = NULL; 8726 6738 8727 - sch = rcu_dereference_check(scx_root, rcu_read_lock_bh_held()); 6739 + sch = scx_prog_sched(aux); 8728 6740 if (unlikely(!sch)) 8729 6741 return -ENODEV; 8730 6742 ··· 8735 6747 if (!kit->dsq) 8736 6748 return -ENOENT; 8737 6749 8738 - kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, flags, 8739 - READ_ONCE(kit->dsq->seq)); 6750 + kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, kit->dsq, flags); 8740 6751 8741 6752 return 0; 8742 6753 } ··· 8749 6762 __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) 8750 6763 { 8751 6764 struct bpf_iter_scx_dsq_kern *kit = (void *)it; 8752 - bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV; 8753 - struct task_struct *p; 8754 - unsigned long flags; 8755 6765 8756 6766 if (!kit->dsq) 8757 6767 return NULL; 8758 6768 8759 - raw_spin_lock_irqsave(&kit->dsq->lock, flags); 6769 + guard(raw_spinlock_irqsave)(&kit->dsq->lock); 8760 6770 8761 - if (list_empty(&kit->cursor.node)) 8762 - p = NULL; 8763 - else 8764 - p = container_of(&kit->cursor, struct task_struct, scx.dsq_list); 8765 - 8766 - /* 8767 - * Only tasks which were queued before the iteration started are 8768 - * visible. This bounds BPF iterations and guarantees that vtime never 8769 - * jumps in the other direction while iterating. 8770 - */ 8771 - do { 8772 - p = nldsq_next_task(kit->dsq, p, rev); 8773 - } while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq))); 8774 - 8775 - if (p) { 8776 - if (rev) 8777 - list_move_tail(&kit->cursor.node, &p->scx.dsq_list.node); 8778 - else 8779 - list_move(&kit->cursor.node, &p->scx.dsq_list.node); 8780 - } else { 8781 - list_del_init(&kit->cursor.node); 8782 - } 8783 - 8784 - raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); 8785 - 8786 - return p; 6771 + return nldsq_cursor_next_task(&kit->cursor, kit->dsq); 8787 6772 } 8788 6773 8789 6774 /** ··· 8784 6825 /** 8785 6826 * scx_bpf_dsq_peek - Lockless peek at the first element. 8786 6827 * @dsq_id: DSQ to examine. 6828 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8787 6829 * 8788 6830 * Read the first element in the DSQ. This is semantically equivalent to using 8789 6831 * the DSQ iterator, but is lockfree. Of course, like any lockless operation, ··· 8793 6833 * 8794 6834 * Returns the pointer, or NULL indicates an empty queue OR internal error. 8795 6835 */ 8796 - __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) 6836 + __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id, 6837 + const struct bpf_prog_aux *aux) 8797 6838 { 8798 6839 struct scx_sched *sch; 8799 6840 struct scx_dispatch_q *dsq; 8800 6841 8801 - sch = rcu_dereference(scx_root); 6842 + sch = scx_prog_sched(aux); 8802 6843 if (unlikely(!sch)) 8803 6844 return NULL; 8804 6845 ··· 8815 6854 } 8816 6855 8817 6856 return rcu_dereference(dsq->first_task); 6857 + } 6858 + 6859 + /** 6860 + * scx_bpf_dsq_reenq - Re-enqueue tasks on a DSQ 6861 + * @dsq_id: DSQ to re-enqueue 6862 + * @reenq_flags: %SCX_RENQ_* 6863 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6864 + * 6865 + * Iterate over all of the tasks currently enqueued on the DSQ identified by 6866 + * @dsq_id, and re-enqueue them in the BPF scheduler. The following DSQs are 6867 + * supported: 6868 + * 6869 + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON | $cpu) 6870 + * - User DSQs 6871 + * 6872 + * Re-enqueues are performed asynchronously. Can be called from anywhere. 6873 + */ 6874 + __bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags, 6875 + const struct bpf_prog_aux *aux) 6876 + { 6877 + struct scx_sched *sch; 6878 + struct scx_dispatch_q *dsq; 6879 + 6880 + guard(preempt)(); 6881 + 6882 + sch = scx_prog_sched(aux); 6883 + if (unlikely(!sch)) 6884 + return; 6885 + 6886 + if (unlikely(reenq_flags & ~__SCX_REENQ_USER_MASK)) { 6887 + scx_error(sch, "invalid SCX_REENQ flags 0x%llx", reenq_flags); 6888 + return; 6889 + } 6890 + 6891 + /* not specifying any filter bits is the same as %SCX_REENQ_ANY */ 6892 + if (!(reenq_flags & __SCX_REENQ_FILTER_MASK)) 6893 + reenq_flags |= SCX_REENQ_ANY; 6894 + 6895 + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id()); 6896 + schedule_dsq_reenq(sch, dsq, reenq_flags, scx_locked_rq()); 6897 + } 6898 + 6899 + /** 6900 + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 6901 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6902 + * 6903 + * Iterate over all of the tasks currently enqueued on the local DSQ of the 6904 + * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from 6905 + * anywhere. 6906 + * 6907 + * This is now a special case of scx_bpf_dsq_reenq() and may be removed in the 6908 + * future. 6909 + */ 6910 + __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux) 6911 + { 6912 + scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0, aux); 8818 6913 } 8819 6914 8820 6915 __bpf_kfunc_end_defs(); ··· 8927 6910 * @fmt: error message format string 8928 6911 * @data: format string parameters packaged using ___bpf_fill() macro 8929 6912 * @data__sz: @data len, must end in '__sz' for the verifier 6913 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8930 6914 * 8931 6915 * Indicate that the BPF scheduler wants to exit gracefully, and initiate ops 8932 6916 * disabling. 8933 6917 */ 8934 6918 __bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt, 8935 - unsigned long long *data, u32 data__sz) 6919 + unsigned long long *data, u32 data__sz, 6920 + const struct bpf_prog_aux *aux) 8936 6921 { 8937 6922 struct scx_sched *sch; 8938 6923 unsigned long flags; 8939 6924 8940 6925 raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); 8941 - sch = rcu_dereference_bh(scx_root); 6926 + sch = scx_prog_sched(aux); 8942 6927 if (likely(sch) && 8943 6928 bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0) 8944 6929 scx_exit(sch, SCX_EXIT_UNREG_BPF, exit_code, "%s", scx_exit_bstr_buf.line); ··· 8952 6933 * @fmt: error message format string 8953 6934 * @data: format string parameters packaged using ___bpf_fill() macro 8954 6935 * @data__sz: @data len, must end in '__sz' for the verifier 6936 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8955 6937 * 8956 6938 * Indicate that the BPF scheduler encountered a fatal error and initiate ops 8957 6939 * disabling. 8958 6940 */ 8959 6941 __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, 8960 - u32 data__sz) 6942 + u32 data__sz, const struct bpf_prog_aux *aux) 8961 6943 { 8962 6944 struct scx_sched *sch; 8963 6945 unsigned long flags; 8964 6946 8965 6947 raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); 8966 - sch = rcu_dereference_bh(scx_root); 6948 + sch = scx_prog_sched(aux); 8967 6949 if (likely(sch) && 8968 6950 bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0) 8969 6951 scx_exit(sch, SCX_EXIT_ERROR_BPF, 0, "%s", scx_exit_bstr_buf.line); ··· 8976 6956 * @fmt: format string 8977 6957 * @data: format string parameters packaged using ___bpf_fill() macro 8978 6958 * @data__sz: @data len, must end in '__sz' for the verifier 6959 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8979 6960 * 8980 6961 * To be called through scx_bpf_dump() helper from ops.dump(), dump_cpu() and 8981 6962 * dump_task() to generate extra debug dump specific to the BPF scheduler. ··· 8985 6964 * multiple calls. The last line is automatically terminated. 8986 6965 */ 8987 6966 __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, 8988 - u32 data__sz) 6967 + u32 data__sz, const struct bpf_prog_aux *aux) 8989 6968 { 8990 6969 struct scx_sched *sch; 8991 6970 struct scx_dump_data *dd = &scx_dump_data; ··· 8994 6973 8995 6974 guard(rcu)(); 8996 6975 8997 - sch = rcu_dereference(scx_root); 6976 + sch = scx_prog_sched(aux); 8998 6977 if (unlikely(!sch)) 8999 6978 return; 9000 6979 ··· 9031 7010 } 9032 7011 9033 7012 /** 9034 - * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 9035 - * 9036 - * Iterate over all of the tasks currently enqueued on the local DSQ of the 9037 - * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from 9038 - * anywhere. 9039 - */ 9040 - __bpf_kfunc void scx_bpf_reenqueue_local___v2(void) 9041 - { 9042 - struct rq *rq; 9043 - 9044 - guard(preempt)(); 9045 - 9046 - rq = this_rq(); 9047 - local_set(&rq->scx.reenq_local_deferred, 1); 9048 - schedule_deferred(rq); 9049 - } 9050 - 9051 - /** 9052 7013 * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU 9053 7014 * @cpu: CPU of interest 7015 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9054 7016 * 9055 7017 * Return the maximum relative capacity of @cpu in relation to the most 9056 7018 * performant CPU in the system. The return value is in the range [1, 9057 7019 * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). 9058 7020 */ 9059 - __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) 7021 + __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux) 9060 7022 { 9061 7023 struct scx_sched *sch; 9062 7024 9063 7025 guard(rcu)(); 9064 7026 9065 - sch = rcu_dereference(scx_root); 7027 + sch = scx_prog_sched(aux); 9066 7028 if (likely(sch) && ops_cpu_valid(sch, cpu, NULL)) 9067 7029 return arch_scale_cpu_capacity(cpu); 9068 7030 else ··· 9055 7051 /** 9056 7052 * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU 9057 7053 * @cpu: CPU of interest 7054 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9058 7055 * 9059 7056 * Return the current relative performance of @cpu in relation to its maximum. 9060 7057 * The return value is in the range [1, %SCX_CPUPERF_ONE]. ··· 9067 7062 * 9068 7063 * The result is in the range [1, %SCX_CPUPERF_ONE]. 9069 7064 */ 9070 - __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) 7065 + __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux) 9071 7066 { 9072 7067 struct scx_sched *sch; 9073 7068 9074 7069 guard(rcu)(); 9075 7070 9076 - sch = rcu_dereference(scx_root); 7071 + sch = scx_prog_sched(aux); 9077 7072 if (likely(sch) && ops_cpu_valid(sch, cpu, NULL)) 9078 7073 return arch_scale_freq_capacity(cpu); 9079 7074 else ··· 9084 7079 * scx_bpf_cpuperf_set - Set the relative performance target of a CPU 9085 7080 * @cpu: CPU of interest 9086 7081 * @perf: target performance level [0, %SCX_CPUPERF_ONE] 7082 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9087 7083 * 9088 7084 * Set the target performance level of @cpu to @perf. @perf is in linear 9089 7085 * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the ··· 9095 7089 * use. Consult hardware and cpufreq documentation for more information. The 9096 7090 * current performance level can be monitored using scx_bpf_cpuperf_cur(). 9097 7091 */ 9098 - __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) 7092 + __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_aux *aux) 9099 7093 { 9100 7094 struct scx_sched *sch; 9101 7095 9102 7096 guard(rcu)(); 9103 7097 9104 - sch = rcu_dereference(scx_root); 7098 + sch = scx_prog_sched(aux); 9105 7099 if (unlikely(!sch)) 9106 7100 return; 9107 7101 ··· 9211 7205 /** 9212 7206 * scx_bpf_cpu_rq - Fetch the rq of a CPU 9213 7207 * @cpu: CPU of the rq 7208 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9214 7209 */ 9215 - __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) 7210 + __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu, const struct bpf_prog_aux *aux) 9216 7211 { 9217 7212 struct scx_sched *sch; 9218 7213 9219 7214 guard(rcu)(); 9220 7215 9221 - sch = rcu_dereference(scx_root); 7216 + sch = scx_prog_sched(aux); 9222 7217 if (unlikely(!sch)) 9223 7218 return NULL; 9224 7219 ··· 9238 7231 9239 7232 /** 9240 7233 * scx_bpf_locked_rq - Return the rq currently locked by SCX 7234 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9241 7235 * 9242 7236 * Returns the rq if a rq lock is currently held by SCX. 9243 7237 * Otherwise emits an error and returns NULL. 9244 7238 */ 9245 - __bpf_kfunc struct rq *scx_bpf_locked_rq(void) 7239 + __bpf_kfunc struct rq *scx_bpf_locked_rq(const struct bpf_prog_aux *aux) 9246 7240 { 9247 7241 struct scx_sched *sch; 9248 7242 struct rq *rq; 9249 7243 9250 7244 guard(preempt)(); 9251 7245 9252 - sch = rcu_dereference_sched(scx_root); 7246 + sch = scx_prog_sched(aux); 9253 7247 if (unlikely(!sch)) 9254 7248 return NULL; 9255 7249 ··· 9266 7258 /** 9267 7259 * scx_bpf_cpu_curr - Return remote CPU's curr task 9268 7260 * @cpu: CPU of interest 7261 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9269 7262 * 9270 7263 * Callers must hold RCU read lock (KF_RCU). 9271 7264 */ 9272 - __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu) 7265 + __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_aux *aux) 9273 7266 { 9274 7267 struct scx_sched *sch; 9275 7268 9276 7269 guard(rcu)(); 9277 7270 9278 - sch = rcu_dereference(scx_root); 7271 + sch = scx_prog_sched(aux); 9279 7272 if (unlikely(!sch)) 9280 7273 return NULL; 9281 7274 ··· 9285 7276 9286 7277 return rcu_dereference(cpu_rq(cpu)->curr); 9287 7278 } 9288 - 9289 - /** 9290 - * scx_bpf_task_cgroup - Return the sched cgroup of a task 9291 - * @p: task of interest 9292 - * 9293 - * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with 9294 - * from the scheduler's POV. SCX operations should use this function to 9295 - * determine @p's current cgroup as, unlike following @p->cgroups, 9296 - * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all 9297 - * rq-locked operations. Can be called on the parameter tasks of rq-locked 9298 - * operations. The restriction guarantees that @p's rq is locked by the caller. 9299 - */ 9300 - #ifdef CONFIG_CGROUP_SCHED 9301 - __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) 9302 - { 9303 - struct task_group *tg = p->sched_task_group; 9304 - struct cgroup *cgrp = &cgrp_dfl_root.cgrp; 9305 - struct scx_sched *sch; 9306 - 9307 - guard(rcu)(); 9308 - 9309 - sch = rcu_dereference(scx_root); 9310 - if (unlikely(!sch)) 9311 - goto out; 9312 - 9313 - if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p)) 9314 - goto out; 9315 - 9316 - cgrp = tg_cgrp(tg); 9317 - 9318 - out: 9319 - cgroup_get(cgrp); 9320 - return cgrp; 9321 - } 9322 - #endif 9323 7279 9324 7280 /** 9325 7281 * scx_bpf_now - Returns a high-performance monotonically non-decreasing ··· 9362 7388 scx_agg_event(events, e_cpu, SCX_EV_DISPATCH_KEEP_LAST); 9363 7389 scx_agg_event(events, e_cpu, SCX_EV_ENQ_SKIP_EXITING); 9364 7390 scx_agg_event(events, e_cpu, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); 7391 + scx_agg_event(events, e_cpu, SCX_EV_REENQ_IMMED); 7392 + scx_agg_event(events, e_cpu, SCX_EV_REENQ_LOCAL_REPEAT); 9365 7393 scx_agg_event(events, e_cpu, SCX_EV_REFILL_SLICE_DFL); 9366 7394 scx_agg_event(events, e_cpu, SCX_EV_BYPASS_DURATION); 9367 7395 scx_agg_event(events, e_cpu, SCX_EV_BYPASS_DISPATCH); 9368 7396 scx_agg_event(events, e_cpu, SCX_EV_BYPASS_ACTIVATE); 7397 + scx_agg_event(events, e_cpu, SCX_EV_INSERT_NOT_OWNED); 7398 + scx_agg_event(events, e_cpu, SCX_EV_SUB_BYPASS_DISPATCH); 9369 7399 } 9370 7400 } 9371 7401 ··· 9403 7425 memcpy(events, &e_sys, events__sz); 9404 7426 } 9405 7427 7428 + #ifdef CONFIG_CGROUP_SCHED 7429 + /** 7430 + * scx_bpf_task_cgroup - Return the sched cgroup of a task 7431 + * @p: task of interest 7432 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 7433 + * 7434 + * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with 7435 + * from the scheduler's POV. SCX operations should use this function to 7436 + * determine @p's current cgroup as, unlike following @p->cgroups, 7437 + * @p->sched_task_group is stable for the duration of the SCX op. See 7438 + * SCX_CALL_OP_TASK() for details. 7439 + */ 7440 + __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, 7441 + const struct bpf_prog_aux *aux) 7442 + { 7443 + struct task_group *tg = p->sched_task_group; 7444 + struct cgroup *cgrp = &cgrp_dfl_root.cgrp; 7445 + struct scx_sched *sch; 7446 + 7447 + guard(rcu)(); 7448 + 7449 + sch = scx_prog_sched(aux); 7450 + if (unlikely(!sch)) 7451 + goto out; 7452 + 7453 + if (!scx_kf_arg_task_ok(sch, p)) 7454 + goto out; 7455 + 7456 + cgrp = tg_cgrp(tg); 7457 + 7458 + out: 7459 + cgroup_get(cgrp); 7460 + return cgrp; 7461 + } 7462 + #endif /* CONFIG_CGROUP_SCHED */ 7463 + 9406 7464 __bpf_kfunc_end_defs(); 9407 7465 9408 7466 BTF_KFUNCS_START(scx_kfunc_ids_any) 9409 - BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_RCU); 9410 - BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_RCU); 9411 - BTF_ID_FLAGS(func, scx_bpf_kick_cpu) 7467 + BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU); 7468 + BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU); 7469 + BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS) 9412 7470 BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) 9413 7471 BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) 9414 - BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL) 9415 - BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED) 7472 + BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL) 7473 + BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS) 7474 + BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS) 7475 + BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_IMPLICIT_ARGS | KF_ITER_NEW | KF_RCU_PROTECTED) 9416 7476 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL) 9417 7477 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY) 9418 - BTF_ID_FLAGS(func, scx_bpf_exit_bstr) 9419 - BTF_ID_FLAGS(func, scx_bpf_error_bstr) 9420 - BTF_ID_FLAGS(func, scx_bpf_dump_bstr) 9421 - BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2) 9422 - BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) 9423 - BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) 9424 - BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) 7478 + BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_IMPLICIT_ARGS) 7479 + BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_IMPLICIT_ARGS) 7480 + BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS) 7481 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS) 7482 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS) 7483 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS) 9425 7484 BTF_ID_FLAGS(func, scx_bpf_nr_node_ids) 9426 7485 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) 9427 7486 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) ··· 9466 7451 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE) 9467 7452 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) 9468 7453 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) 9469 - BTF_ID_FLAGS(func, scx_bpf_cpu_rq) 9470 - BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) 9471 - BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) 9472 - #ifdef CONFIG_CGROUP_SCHED 9473 - BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE) 9474 - #endif 7454 + BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS) 7455 + BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL) 7456 + BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED) 9475 7457 BTF_ID_FLAGS(func, scx_bpf_now) 9476 7458 BTF_ID_FLAGS(func, scx_bpf_events) 7459 + #ifdef CONFIG_CGROUP_SCHED 7460 + BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE) 7461 + #endif 9477 7462 BTF_KFUNCS_END(scx_kfunc_ids_any) 9478 7463 9479 7464 static const struct btf_kfunc_id_set scx_kfunc_set_any = { 9480 7465 .owner = THIS_MODULE, 9481 7466 .set = &scx_kfunc_ids_any, 9482 7467 }; 7468 + 7469 + /* 7470 + * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc 7471 + * group; an op may permit zero or more groups, with the union expressed in 7472 + * scx_kf_allow_flags[]. The verifier-time filter (scx_kfunc_context_filter()) 7473 + * consults this table to decide whether a context-sensitive kfunc is callable 7474 + * from a given SCX op. 7475 + */ 7476 + enum scx_kf_allow_flags { 7477 + SCX_KF_ALLOW_UNLOCKED = 1 << 0, 7478 + SCX_KF_ALLOW_CPU_RELEASE = 1 << 1, 7479 + SCX_KF_ALLOW_DISPATCH = 1 << 2, 7480 + SCX_KF_ALLOW_ENQUEUE = 1 << 3, 7481 + SCX_KF_ALLOW_SELECT_CPU = 1 << 4, 7482 + }; 7483 + 7484 + /* 7485 + * Map each SCX op to the union of kfunc groups it permits, indexed by 7486 + * SCX_OP_IDX(op). Ops not listed only permit kfuncs that are not 7487 + * context-sensitive. 7488 + */ 7489 + static const u32 scx_kf_allow_flags[] = { 7490 + [SCX_OP_IDX(select_cpu)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, 7491 + [SCX_OP_IDX(enqueue)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, 7492 + [SCX_OP_IDX(dispatch)] = SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH, 7493 + [SCX_OP_IDX(cpu_release)] = SCX_KF_ALLOW_CPU_RELEASE, 7494 + [SCX_OP_IDX(init_task)] = SCX_KF_ALLOW_UNLOCKED, 7495 + [SCX_OP_IDX(dump)] = SCX_KF_ALLOW_UNLOCKED, 7496 + #ifdef CONFIG_EXT_GROUP_SCHED 7497 + [SCX_OP_IDX(cgroup_init)] = SCX_KF_ALLOW_UNLOCKED, 7498 + [SCX_OP_IDX(cgroup_exit)] = SCX_KF_ALLOW_UNLOCKED, 7499 + [SCX_OP_IDX(cgroup_prep_move)] = SCX_KF_ALLOW_UNLOCKED, 7500 + [SCX_OP_IDX(cgroup_cancel_move)] = SCX_KF_ALLOW_UNLOCKED, 7501 + [SCX_OP_IDX(cgroup_set_weight)] = SCX_KF_ALLOW_UNLOCKED, 7502 + [SCX_OP_IDX(cgroup_set_bandwidth)] = SCX_KF_ALLOW_UNLOCKED, 7503 + [SCX_OP_IDX(cgroup_set_idle)] = SCX_KF_ALLOW_UNLOCKED, 7504 + #endif /* CONFIG_EXT_GROUP_SCHED */ 7505 + [SCX_OP_IDX(sub_attach)] = SCX_KF_ALLOW_UNLOCKED, 7506 + [SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED, 7507 + [SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED, 7508 + [SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED, 7509 + [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED, 7510 + [SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED, 7511 + }; 7512 + 7513 + /* 7514 + * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the 7515 + * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this 7516 + * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or 7517 + * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the 7518 + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. 7519 + * scx_kfunc_ids_any) by falling through to "allow" when none of the 7520 + * context-sensitive sets contain the kfunc. 7521 + */ 7522 + int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) 7523 + { 7524 + bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id); 7525 + bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id); 7526 + bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); 7527 + bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); 7528 + bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); 7529 + u32 moff, flags; 7530 + 7531 + /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ 7532 + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) 7533 + return 0; 7534 + 7535 + /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ 7536 + if (prog->type == BPF_PROG_TYPE_SYSCALL) 7537 + return (in_unlocked || in_select_cpu) ? 0 : -EACCES; 7538 + 7539 + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) 7540 + return -EACCES; 7541 + 7542 + /* 7543 + * add_subprog_and_kfunc() collects all kfunc calls, including dead code 7544 + * guarded by bpf_ksym_exists(), before check_attach_btf_id() sets 7545 + * prog->aux->st_ops. Allow all kfuncs when st_ops is not yet set; 7546 + * do_check_main() re-runs the filter with st_ops set and enforces the 7547 + * actual restrictions. 7548 + */ 7549 + if (!prog->aux->st_ops) 7550 + return 0; 7551 + 7552 + /* 7553 + * Non-SCX struct_ops: only unlocked kfuncs are safe. The other 7554 + * context-sensitive kfuncs assume the rq lock is held by the SCX 7555 + * dispatch path, which doesn't apply to other struct_ops users. 7556 + */ 7557 + if (prog->aux->st_ops != &bpf_sched_ext_ops) 7558 + return in_unlocked ? 0 : -EACCES; 7559 + 7560 + /* SCX struct_ops: check the per-op allow list. */ 7561 + moff = prog->aux->attach_st_ops_member_off; 7562 + flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; 7563 + 7564 + if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked) 7565 + return 0; 7566 + if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release) 7567 + return 0; 7568 + if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch) 7569 + return 0; 7570 + if ((flags & SCX_KF_ALLOW_ENQUEUE) && in_enqueue) 7571 + return 0; 7572 + if ((flags & SCX_KF_ALLOW_SELECT_CPU) && in_select_cpu) 7573 + return 0; 7574 + 7575 + return -EACCES; 7576 + } 9483 7577 9484 7578 static int __init scx_init(void) 9485 7579 { ··· 9599 7475 * register_btf_kfunc_id_set() needs most of the system to be up. 9600 7476 * 9601 7477 * Some kfuncs are context-sensitive and can only be called from 9602 - * specific SCX ops. They are grouped into BTF sets accordingly. 9603 - * Unfortunately, BPF currently doesn't have a way of enforcing such 9604 - * restrictions. Eventually, the verifier should be able to enforce 9605 - * them. For now, register them the same and make each kfunc explicitly 9606 - * check using scx_kf_allowed(). 7478 + * specific SCX ops. They are grouped into per-context BTF sets, each 7479 + * registered with scx_kfunc_context_filter as its .filter callback. The 7480 + * BPF core dedups identical filter pointers per hook 7481 + * (btf_populate_kfunc_set()), so the filter is invoked exactly once per 7482 + * kfunc lookup; it consults scx_kf_allow_flags[] to enforce per-op 7483 + * restrictions at verify time. 9607 7484 */ 9608 7485 if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 9609 7486 &scx_kfunc_set_enqueue_dispatch)) ||
+2 -2
kernel/sched/ext.h
··· 11 11 void scx_tick(struct rq *rq); 12 12 void init_scx_entity(struct sched_ext_entity *scx); 13 13 void scx_pre_fork(struct task_struct *p); 14 - int scx_fork(struct task_struct *p); 14 + int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs); 15 15 void scx_post_fork(struct task_struct *p); 16 16 void scx_cancel_fork(struct task_struct *p); 17 17 bool scx_can_stop_tick(struct rq *rq); ··· 44 44 45 45 static inline void scx_tick(struct rq *rq) {} 46 46 static inline void scx_pre_fork(struct task_struct *p) {} 47 - static inline int scx_fork(struct task_struct *p) { return 0; } 47 + static inline int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) { return 0; } 48 48 static inline void scx_post_fork(struct task_struct *p) {} 49 49 static inline void scx_cancel_fork(struct task_struct *p) {} 50 50 static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
+132 -67
kernel/sched/ext_idle.c
··· 368 368 369 369 /* 370 370 * Enable NUMA optimization only when there are multiple NUMA domains 371 - * among the online CPUs and the NUMA domains don't perfectly overlaps 371 + * among the online CPUs and the NUMA domains don't perfectly overlap 372 372 * with the LLC domains. 373 373 * 374 374 * If all CPUs belong to the same NUMA node and the same LLC domain, ··· 424 424 * - prefer the last used CPU to take advantage of cached data (L1, L2) and 425 425 * branch prediction optimizations. 426 426 * 427 - * 3. Pick a CPU within the same LLC (Last-Level Cache): 427 + * 3. Prefer @prev_cpu's SMT sibling: 428 + * - if @prev_cpu is busy and no fully idle core is available, try to 429 + * place the task on an idle SMT sibling of @prev_cpu; keeping the 430 + * task on the same core makes migration cheaper, preserves L1 cache 431 + * locality and reduces wakeup latency. 432 + * 433 + * 4. Pick a CPU within the same LLC (Last-Level Cache): 428 434 * - if the above conditions aren't met, pick a CPU that shares the same 429 435 * LLC, if the LLC domain is a subset of @cpus_allowed, to maintain 430 436 * cache locality. 431 437 * 432 - * 4. Pick a CPU within the same NUMA node, if enabled: 438 + * 5. Pick a CPU within the same NUMA node, if enabled: 433 439 * - choose a CPU from the same NUMA node, if the node cpumask is a 434 440 * subset of @cpus_allowed, to reduce memory access latency. 435 441 * 436 - * 5. Pick any idle CPU within the @cpus_allowed domain. 442 + * 6. Pick any idle CPU within the @cpus_allowed domain. 437 443 * 438 - * Step 3 and 4 are performed only if the system has, respectively, 444 + * Step 4 and 5 are performed only if the system has, respectively, 439 445 * multiple LLCs / multiple NUMA nodes (see scx_selcpu_topo_llc and 440 446 * scx_selcpu_topo_numa) and they don't contain the same subset of CPUs. 441 447 * ··· 622 616 goto out_unlock; 623 617 } 624 618 619 + #ifdef CONFIG_SCHED_SMT 620 + /* 621 + * Use @prev_cpu's sibling if it's idle. 622 + */ 623 + if (sched_smt_active()) { 624 + for_each_cpu_and(cpu, cpu_smt_mask(prev_cpu), allowed) { 625 + if (cpu == prev_cpu) 626 + continue; 627 + if (scx_idle_test_and_clear_cpu(cpu)) 628 + goto out_unlock; 629 + } 630 + } 631 + #endif 632 + 625 633 /* 626 634 * Search for any idle CPU in the same LLC domain. 627 635 */ ··· 787 767 * either enqueue() sees the idle bit or update_idle() sees the task 788 768 * that enqueue() queued. 789 769 */ 790 - if (SCX_HAS_OP(sch, update_idle) && do_notify && !scx_rq_bypassing(rq)) 791 - SCX_CALL_OP(sch, SCX_KF_REST, update_idle, rq, cpu_of(rq), idle); 770 + if (SCX_HAS_OP(sch, update_idle) && do_notify && 771 + !scx_bypassing(sch, cpu_of(rq))) 772 + SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle); 792 773 } 793 774 794 775 static void reset_idle_masks(struct sched_ext_ops *ops) ··· 913 892 s32 prev_cpu, u64 wake_flags, 914 893 const struct cpumask *allowed, u64 flags) 915 894 { 916 - struct rq *rq; 917 - struct rq_flags rf; 895 + unsigned long irq_flags; 896 + bool we_locked = false; 918 897 s32 cpu; 919 898 920 899 if (!ops_cpu_valid(sch, prev_cpu, NULL)) ··· 924 903 return -EBUSY; 925 904 926 905 /* 927 - * If called from an unlocked context, acquire the task's rq lock, 928 - * so that we can safely access p->cpus_ptr and p->nr_cpus_allowed. 906 + * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq 907 + * lock or @p's pi_lock. Three cases: 929 908 * 930 - * Otherwise, allow to use this kfunc only from ops.select_cpu() 931 - * and ops.select_enqueue(). 909 + * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. 910 + * - other rq-locked SCX op: scx_locked_rq() points at the held rq. 911 + * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): 912 + * nothing held, take pi_lock ourselves. 932 913 */ 933 - if (scx_kf_allowed_if_unlocked()) { 934 - rq = task_rq_lock(p, &rf); 935 - } else { 936 - if (!scx_kf_allowed(sch, SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE)) 937 - return -EPERM; 938 - rq = scx_locked_rq(); 939 - } 940 - 941 - /* 942 - * Validate locking correctness to access p->cpus_ptr and 943 - * p->nr_cpus_allowed: if we're holding an rq lock, we're safe; 944 - * otherwise, assert that p->pi_lock is held. 945 - */ 946 - if (!rq) 914 + if (this_rq()->scx.in_select_cpu) { 947 915 lockdep_assert_held(&p->pi_lock); 916 + } else if (!scx_locked_rq()) { 917 + raw_spin_lock_irqsave(&p->pi_lock, irq_flags); 918 + we_locked = true; 919 + } 948 920 949 921 /* 950 922 * This may also be called from ops.enqueue(), so we need to handle ··· 956 942 allowed ?: p->cpus_ptr, flags); 957 943 } 958 944 959 - if (scx_kf_allowed_if_unlocked()) 960 - task_rq_unlock(rq, p, &rf); 945 + if (we_locked) 946 + raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); 961 947 962 948 return cpu; 963 949 } ··· 966 952 * scx_bpf_cpu_node - Return the NUMA node the given @cpu belongs to, or 967 953 * trigger an error if @cpu is invalid 968 954 * @cpu: target CPU 955 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 969 956 */ 970 - __bpf_kfunc int scx_bpf_cpu_node(s32 cpu) 957 + __bpf_kfunc s32 scx_bpf_cpu_node(s32 cpu, const struct bpf_prog_aux *aux) 971 958 { 972 959 struct scx_sched *sch; 973 960 974 961 guard(rcu)(); 975 962 976 - sch = rcu_dereference(scx_root); 963 + sch = scx_prog_sched(aux); 977 964 if (unlikely(!sch) || !ops_cpu_valid(sch, cpu, NULL)) 978 965 return NUMA_NO_NODE; 979 966 return cpu_to_node(cpu); ··· 986 971 * @prev_cpu: CPU @p was on previously 987 972 * @wake_flags: %SCX_WAKE_* flags 988 973 * @is_idle: out parameter indicating whether the returned CPU is idle 974 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 989 975 * 990 976 * Can be called from ops.select_cpu(), ops.enqueue(), or from an unlocked 991 977 * context such as a BPF test_run() call, as long as built-in CPU selection ··· 997 981 * currently idle and thus a good candidate for direct dispatching. 998 982 */ 999 983 __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, 1000 - u64 wake_flags, bool *is_idle) 984 + u64 wake_flags, bool *is_idle, 985 + const struct bpf_prog_aux *aux) 1001 986 { 1002 987 struct scx_sched *sch; 1003 988 s32 cpu; 1004 989 1005 990 guard(rcu)(); 1006 991 1007 - sch = rcu_dereference(scx_root); 992 + sch = scx_prog_sched(aux); 1008 993 if (unlikely(!sch)) 1009 994 return -ENODEV; 1010 995 ··· 1033 1016 * @args->prev_cpu: CPU @p was on previously 1034 1017 * @args->wake_flags: %SCX_WAKE_* flags 1035 1018 * @args->flags: %SCX_PICK_IDLE* flags 1019 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1036 1020 * 1037 1021 * Wrapper kfunc that takes arguments via struct to work around BPF's 5 argument 1038 1022 * limit. BPF programs should use scx_bpf_select_cpu_and() which is provided ··· 1052 1034 */ 1053 1035 __bpf_kfunc s32 1054 1036 __scx_bpf_select_cpu_and(struct task_struct *p, const struct cpumask *cpus_allowed, 1055 - struct scx_bpf_select_cpu_and_args *args) 1037 + struct scx_bpf_select_cpu_and_args *args, 1038 + const struct bpf_prog_aux *aux) 1056 1039 { 1057 1040 struct scx_sched *sch; 1058 1041 1059 1042 guard(rcu)(); 1060 1043 1061 - sch = rcu_dereference(scx_root); 1044 + sch = scx_prog_sched(aux); 1062 1045 if (unlikely(!sch)) 1063 1046 return -ENODEV; 1064 1047 ··· 1081 1062 if (unlikely(!sch)) 1082 1063 return -ENODEV; 1083 1064 1065 + #ifdef CONFIG_EXT_SUB_SCHED 1066 + /* 1067 + * Disallow if any sub-scheds are attached. There is no way to tell 1068 + * which scheduler called us, just error out @p's scheduler. 1069 + */ 1070 + if (unlikely(!list_empty(&sch->children))) { 1071 + scx_error(scx_task_sched(p), "__scx_bpf_select_cpu_and() must be used"); 1072 + return -EINVAL; 1073 + } 1074 + #endif 1075 + 1084 1076 return select_cpu_from_kfunc(sch, p, prev_cpu, wake_flags, 1085 1077 cpus_allowed, flags); 1086 1078 } ··· 1100 1070 * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the 1101 1071 * idle-tracking per-CPU cpumask of a target NUMA node. 1102 1072 * @node: target NUMA node 1073 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1103 1074 * 1104 1075 * Returns an empty cpumask if idle tracking is not enabled, if @node is 1105 1076 * not valid, or running on a UP kernel. In this case the actual error will 1106 1077 * be reported to the BPF scheduler via scx_error(). 1107 1078 */ 1108 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) 1079 + __bpf_kfunc const struct cpumask * 1080 + scx_bpf_get_idle_cpumask_node(s32 node, const struct bpf_prog_aux *aux) 1109 1081 { 1110 1082 struct scx_sched *sch; 1111 1083 1112 1084 guard(rcu)(); 1113 1085 1114 - sch = rcu_dereference(scx_root); 1086 + sch = scx_prog_sched(aux); 1115 1087 if (unlikely(!sch)) 1116 1088 return cpu_none_mask; 1117 1089 ··· 1127 1095 /** 1128 1096 * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking 1129 1097 * per-CPU cpumask. 1098 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1130 1099 * 1131 1100 * Returns an empty mask if idle tracking is not enabled, or running on a 1132 1101 * UP kernel. 1133 1102 */ 1134 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) 1103 + __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(const struct bpf_prog_aux *aux) 1135 1104 { 1136 1105 struct scx_sched *sch; 1137 1106 1138 1107 guard(rcu)(); 1139 1108 1140 - sch = rcu_dereference(scx_root); 1109 + sch = scx_prog_sched(aux); 1141 1110 if (unlikely(!sch)) 1142 1111 return cpu_none_mask; 1143 1112 ··· 1158 1125 * idle-tracking, per-physical-core cpumask of a target NUMA node. Can be 1159 1126 * used to determine if an entire physical core is free. 1160 1127 * @node: target NUMA node 1128 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1161 1129 * 1162 1130 * Returns an empty cpumask if idle tracking is not enabled, if @node is 1163 1131 * not valid, or running on a UP kernel. In this case the actual error will 1164 1132 * be reported to the BPF scheduler via scx_error(). 1165 1133 */ 1166 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) 1134 + __bpf_kfunc const struct cpumask * 1135 + scx_bpf_get_idle_smtmask_node(s32 node, const struct bpf_prog_aux *aux) 1167 1136 { 1168 1137 struct scx_sched *sch; 1169 1138 1170 1139 guard(rcu)(); 1171 1140 1172 - sch = rcu_dereference(scx_root); 1141 + sch = scx_prog_sched(aux); 1173 1142 if (unlikely(!sch)) 1174 1143 return cpu_none_mask; 1175 1144 ··· 1189 1154 * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, 1190 1155 * per-physical-core cpumask. Can be used to determine if an entire physical 1191 1156 * core is free. 1157 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1192 1158 * 1193 1159 * Returns an empty mask if idle tracking is not enabled, or running on a 1194 1160 * UP kernel. 1195 1161 */ 1196 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) 1162 + __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(const struct bpf_prog_aux *aux) 1197 1163 { 1198 1164 struct scx_sched *sch; 1199 1165 1200 1166 guard(rcu)(); 1201 1167 1202 - sch = rcu_dereference(scx_root); 1168 + sch = scx_prog_sched(aux); 1203 1169 if (unlikely(!sch)) 1204 1170 return cpu_none_mask; 1205 1171 ··· 1236 1200 /** 1237 1201 * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state 1238 1202 * @cpu: cpu to test and clear idle for 1203 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1239 1204 * 1240 1205 * Returns %true if @cpu was idle and its idle state was successfully cleared. 1241 1206 * %false otherwise. ··· 1244 1207 * Unavailable if ops.update_idle() is implemented and 1245 1208 * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. 1246 1209 */ 1247 - __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) 1210 + __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu, const struct bpf_prog_aux *aux) 1248 1211 { 1249 1212 struct scx_sched *sch; 1250 1213 1251 1214 guard(rcu)(); 1252 1215 1253 - sch = rcu_dereference(scx_root); 1216 + sch = scx_prog_sched(aux); 1254 1217 if (unlikely(!sch)) 1255 1218 return false; 1256 1219 ··· 1268 1231 * @cpus_allowed: Allowed cpumask 1269 1232 * @node: target NUMA node 1270 1233 * @flags: %SCX_PICK_IDLE_* flags 1234 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1271 1235 * 1272 1236 * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node. 1273 1237 * ··· 1284 1246 * %SCX_OPS_BUILTIN_IDLE_PER_NODE is not set. 1285 1247 */ 1286 1248 __bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed, 1287 - int node, u64 flags) 1249 + s32 node, u64 flags, 1250 + const struct bpf_prog_aux *aux) 1288 1251 { 1289 1252 struct scx_sched *sch; 1290 1253 1291 1254 guard(rcu)(); 1292 1255 1293 - sch = rcu_dereference(scx_root); 1256 + sch = scx_prog_sched(aux); 1294 1257 if (unlikely(!sch)) 1295 1258 return -ENODEV; 1296 1259 ··· 1306 1267 * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu 1307 1268 * @cpus_allowed: Allowed cpumask 1308 1269 * @flags: %SCX_PICK_IDLE_CPU_* flags 1270 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1309 1271 * 1310 1272 * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu 1311 1273 * number on success. -%EBUSY if no matching cpu was found. ··· 1326 1286 * scx_bpf_pick_idle_cpu_node() instead. 1327 1287 */ 1328 1288 __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, 1329 - u64 flags) 1289 + u64 flags, const struct bpf_prog_aux *aux) 1330 1290 { 1331 1291 struct scx_sched *sch; 1332 1292 1333 1293 guard(rcu)(); 1334 1294 1335 - sch = rcu_dereference(scx_root); 1295 + sch = scx_prog_sched(aux); 1336 1296 if (unlikely(!sch)) 1337 1297 return -ENODEV; 1338 1298 ··· 1353 1313 * @cpus_allowed: Allowed cpumask 1354 1314 * @node: target NUMA node 1355 1315 * @flags: %SCX_PICK_IDLE_CPU_* flags 1316 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1356 1317 * 1357 1318 * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any 1358 1319 * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu ··· 1370 1329 * CPU. 1371 1330 */ 1372 1331 __bpf_kfunc s32 scx_bpf_pick_any_cpu_node(const struct cpumask *cpus_allowed, 1373 - int node, u64 flags) 1332 + s32 node, u64 flags, 1333 + const struct bpf_prog_aux *aux) 1374 1334 { 1375 1335 struct scx_sched *sch; 1376 1336 s32 cpu; 1377 1337 1378 1338 guard(rcu)(); 1379 1339 1380 - sch = rcu_dereference(scx_root); 1340 + sch = scx_prog_sched(aux); 1381 1341 if (unlikely(!sch)) 1382 1342 return -ENODEV; 1383 1343 ··· 1404 1362 * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU 1405 1363 * @cpus_allowed: Allowed cpumask 1406 1364 * @flags: %SCX_PICK_IDLE_CPU_* flags 1365 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1407 1366 * 1408 1367 * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any 1409 1368 * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu ··· 1419 1376 * scx_bpf_pick_any_cpu_node() instead. 1420 1377 */ 1421 1378 __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, 1422 - u64 flags) 1379 + u64 flags, const struct bpf_prog_aux *aux) 1423 1380 { 1424 1381 struct scx_sched *sch; 1425 1382 s32 cpu; 1426 1383 1427 1384 guard(rcu)(); 1428 1385 1429 - sch = rcu_dereference(scx_root); 1386 + sch = scx_prog_sched(aux); 1430 1387 if (unlikely(!sch)) 1431 1388 return -ENODEV; 1432 1389 ··· 1451 1408 __bpf_kfunc_end_defs(); 1452 1409 1453 1410 BTF_KFUNCS_START(scx_kfunc_ids_idle) 1454 - BTF_ID_FLAGS(func, scx_bpf_cpu_node) 1455 - BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE) 1456 - BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE) 1457 - BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE) 1458 - BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE) 1411 + BTF_ID_FLAGS(func, scx_bpf_cpu_node, KF_IMPLICIT_ARGS) 1412 + BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1413 + BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1414 + BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1415 + BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1459 1416 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE) 1460 - BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle) 1461 - BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU) 1462 - BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) 1463 - BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU) 1464 - BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) 1465 - BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_RCU) 1466 - BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) 1467 - BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) 1417 + BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle, KF_IMPLICIT_ARGS) 1418 + BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) 1419 + BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_IMPLICIT_ARGS | KF_RCU) 1420 + BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) 1421 + BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_IMPLICIT_ARGS | KF_RCU) 1468 1422 BTF_KFUNCS_END(scx_kfunc_ids_idle) 1469 1423 1470 1424 static const struct btf_kfunc_id_set scx_kfunc_set_idle = { 1471 1425 .owner = THIS_MODULE, 1472 1426 .set = &scx_kfunc_ids_idle, 1427 + }; 1428 + 1429 + /* 1430 + * The select_cpu kfuncs internally call task_rq_lock() when invoked from an 1431 + * rq-unlocked context, and thus cannot be safely called from arbitrary tracing 1432 + * contexts where @p's pi_lock state is unknown. Keep them out of 1433 + * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed 1434 + * only to STRUCT_OPS and SYSCALL programs. 1435 + * 1436 + * These kfuncs are also members of scx_kfunc_ids_unlocked (see ext.c) because 1437 + * they're callable from unlocked contexts in addition to ops.select_cpu() and 1438 + * ops.enqueue(). 1439 + */ 1440 + BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) 1441 + BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) 1442 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) 1443 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) 1444 + BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) 1445 + 1446 + static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { 1447 + .owner = THIS_MODULE, 1448 + .set = &scx_kfunc_ids_select_cpu, 1449 + .filter = scx_kfunc_context_filter, 1473 1450 }; 1474 1451 1475 1452 int scx_idle_init(void) ··· 1498 1435 1499 1436 ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) || 1500 1437 register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idle) || 1501 - register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle); 1438 + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle) || 1439 + register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu) || 1440 + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_select_cpu); 1502 1441 1503 1442 return ret; 1504 1443 }
+2
kernel/sched/ext_idle.h
··· 12 12 13 13 struct sched_ext_ops; 14 14 15 + extern struct btf_id_set8 scx_kfunc_ids_select_cpu; 16 + 15 17 void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); 16 18 void scx_idle_init_masks(void); 17 19
+321 -23
kernel/sched/ext_internal.h
··· 6 6 * Copyright (c) 2025 Tejun Heo <tj@kernel.org> 7 7 */ 8 8 #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) 9 + #define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) 9 10 10 11 enum scx_consts { 11 12 SCX_DSP_DFL_MAX_BATCH = 32, ··· 25 24 */ 26 25 SCX_TASK_ITER_BATCH = 32, 27 26 27 + SCX_BYPASS_HOST_NTH = 2, 28 + 28 29 SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC, 29 30 SCX_BYPASS_LB_DONOR_PCT = 125, 30 31 SCX_BYPASS_LB_MIN_DELTA_DIV = 4, 31 32 SCX_BYPASS_LB_BATCH = 256, 33 + 34 + SCX_REENQ_LOCAL_MAX_REPEAT = 256, 35 + 36 + SCX_SUB_MAX_DEPTH = 4, 32 37 }; 33 38 34 39 enum scx_exit_kind { ··· 45 38 SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */ 46 39 SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */ 47 40 SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */ 41 + SCX_EXIT_PARENT, /* parent exiting */ 48 42 49 43 SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ 50 44 SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ ··· 70 62 enum scx_exit_code { 71 63 /* Reasons */ 72 64 SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, 65 + SCX_ECODE_RSN_CGROUP_OFFLINE = 2LLU << 32, 73 66 74 67 /* Actions */ 75 68 SCX_ECODE_ACT_RESTART = 1LLU << 48, ··· 184 175 SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 6, 185 176 186 177 /* 187 - * CPU cgroup support flags 178 + * If set, %SCX_ENQ_IMMED is assumed to be set on all local DSQ 179 + * enqueues. 188 180 */ 189 - SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* DEPRECATED, will be removed on 6.18 */ 181 + SCX_OPS_ALWAYS_ENQ_IMMED = 1LLU << 7, 190 182 191 183 SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | 192 184 SCX_OPS_ENQ_LAST | ··· 196 186 SCX_OPS_ALLOW_QUEUED_WAKEUP | 197 187 SCX_OPS_SWITCH_PARTIAL | 198 188 SCX_OPS_BUILTIN_IDLE_PER_NODE | 199 - SCX_OPS_HAS_CGROUP_WEIGHT, 189 + SCX_OPS_ALWAYS_ENQ_IMMED, 200 190 201 191 /* high 8 bits are internal, don't include in SCX_OPS_ALL_FLAGS */ 202 192 __SCX_OPS_INTERNAL_MASK = 0xffLLU << 56, ··· 223 213 bool cancelled; 224 214 }; 225 215 226 - /* argument container for ops->cgroup_init() */ 216 + /* argument container for ops.cgroup_init() */ 227 217 struct scx_cgroup_init_args { 228 218 /* the weight of the cgroup [1..10000] */ 229 219 u32 weight; ··· 246 236 }; 247 237 248 238 /* 249 - * Argument container for ops->cpu_acquire(). Currently empty, but may be 239 + * Argument container for ops.cpu_acquire(). Currently empty, but may be 250 240 * expanded in the future. 251 241 */ 252 242 struct scx_cpu_acquire_args {}; 253 243 254 - /* argument container for ops->cpu_release() */ 244 + /* argument container for ops.cpu_release() */ 255 245 struct scx_cpu_release_args { 256 246 /* the reason the CPU was preempted */ 257 247 enum scx_cpu_preempt_reason reason; ··· 260 250 struct task_struct *task; 261 251 }; 262 252 263 - /* 264 - * Informational context provided to dump operations. 265 - */ 253 + /* informational context provided to dump operations */ 266 254 struct scx_dump_ctx { 267 255 enum scx_exit_kind kind; 268 256 s64 exit_code; 269 257 const char *reason; 270 258 u64 at_ns; 271 259 u64 at_jiffies; 260 + }; 261 + 262 + /* argument container for ops.sub_attach() */ 263 + struct scx_sub_attach_args { 264 + struct sched_ext_ops *ops; 265 + char *cgroup_path; 266 + }; 267 + 268 + /* argument container for ops.sub_detach() */ 269 + struct scx_sub_detach_args { 270 + struct sched_ext_ops *ops; 271 + char *cgroup_path; 272 272 }; 273 273 274 274 /** ··· 741 721 742 722 #endif /* CONFIG_EXT_GROUP_SCHED */ 743 723 724 + /** 725 + * @sub_attach: Attach a sub-scheduler 726 + * @args: argument container, see the struct definition 727 + * 728 + * Return 0 to accept the sub-scheduler. -errno to reject. 729 + */ 730 + s32 (*sub_attach)(struct scx_sub_attach_args *args); 731 + 732 + /** 733 + * @sub_detach: Detach a sub-scheduler 734 + * @args: argument container, see the struct definition 735 + */ 736 + void (*sub_detach)(struct scx_sub_detach_args *args); 737 + 744 738 /* 745 739 * All online ops must come before ops.cpu_online(). 746 740 */ ··· 796 762 */ 797 763 void (*exit)(struct scx_exit_info *info); 798 764 765 + /* 766 + * Data fields must comes after all ops fields. 767 + */ 768 + 799 769 /** 800 770 * @dispatch_max_batch: Max nr of tasks that dispatch() can dispatch 801 771 */ ··· 835 797 u64 hotplug_seq; 836 798 837 799 /** 800 + * @cgroup_id: When >1, attach the scheduler as a sub-scheduler on the 801 + * specified cgroup. 802 + */ 803 + u64 sub_cgroup_id; 804 + 805 + /** 838 806 * @name: BPF scheduler's name 839 807 * 840 808 * Must be a non-zero valid BPF object name including only isalnum(), ··· 850 806 char name[SCX_OPS_NAME_LEN]; 851 807 852 808 /* internal use only, must be NULL */ 853 - void *priv; 809 + void __rcu *priv; 854 810 }; 855 811 856 812 enum scx_opi { ··· 898 854 s64 SCX_EV_ENQ_SKIP_MIGRATION_DISABLED; 899 855 900 856 /* 857 + * The number of times a task, enqueued on a local DSQ with 858 + * SCX_ENQ_IMMED, was re-enqueued because the CPU was not available for 859 + * immediate execution. 860 + */ 861 + s64 SCX_EV_REENQ_IMMED; 862 + 863 + /* 864 + * The number of times a reenq of local DSQ caused another reenq of 865 + * local DSQ. This can happen when %SCX_ENQ_IMMED races against a higher 866 + * priority class task even if the BPF scheduler always satisfies the 867 + * prerequisites for %SCX_ENQ_IMMED at the time of enqueue. However, 868 + * that scenario is very unlikely and this count going up regularly 869 + * indicates that the BPF scheduler is handling %SCX_ENQ_REENQ 870 + * incorrectly causing recursive reenqueues. 871 + */ 872 + s64 SCX_EV_REENQ_LOCAL_REPEAT; 873 + 874 + /* 901 875 * Total number of times a task's time slice was refilled with the 902 876 * default value (SCX_SLICE_DFL). 903 877 */ ··· 935 873 * The number of times the bypassing mode has been activated. 936 874 */ 937 875 s64 SCX_EV_BYPASS_ACTIVATE; 876 + 877 + /* 878 + * The number of times the scheduler attempted to insert a task that it 879 + * doesn't own into a DSQ. Such attempts are ignored. 880 + * 881 + * As BPF schedulers are allowed to ignore dequeues, it's difficult to 882 + * tell whether such an attempt is from a scheduler malfunction or an 883 + * ignored dequeue around sub-sched enabling. If this count keeps going 884 + * up regardless of sub-sched enabling, it likely indicates a bug in the 885 + * scheduler. 886 + */ 887 + s64 SCX_EV_INSERT_NOT_OWNED; 888 + 889 + /* 890 + * The number of times tasks from bypassing descendants are scheduled 891 + * from sub_bypass_dsq's. 892 + */ 893 + s64 SCX_EV_SUB_BYPASS_DISPATCH; 894 + }; 895 + 896 + struct scx_sched; 897 + 898 + enum scx_sched_pcpu_flags { 899 + SCX_SCHED_PCPU_BYPASSING = 1LLU << 0, 900 + }; 901 + 902 + /* dispatch buf */ 903 + struct scx_dsp_buf_ent { 904 + struct task_struct *task; 905 + unsigned long qseq; 906 + u64 dsq_id; 907 + u64 enq_flags; 908 + }; 909 + 910 + struct scx_dsp_ctx { 911 + struct rq *rq; 912 + u32 cursor; 913 + u32 nr_tasks; 914 + struct scx_dsp_buf_ent buf[]; 915 + }; 916 + 917 + struct scx_deferred_reenq_local { 918 + struct list_head node; 919 + u64 flags; 920 + u64 seq; 921 + u32 cnt; 938 922 }; 939 923 940 924 struct scx_sched_pcpu { 925 + struct scx_sched *sch; 926 + u64 flags; /* protected by rq lock */ 927 + 941 928 /* 942 929 * The event counters are in a per-CPU variable to minimize the 943 930 * accounting overhead. A system-wide view on the event counter is 944 931 * constructed when requested by scx_bpf_events(). 945 932 */ 946 933 struct scx_event_stats event_stats; 934 + 935 + struct scx_deferred_reenq_local deferred_reenq_local; 936 + struct scx_dispatch_q bypass_dsq; 937 + #ifdef CONFIG_EXT_SUB_SCHED 938 + u32 bypass_host_seq; 939 + #endif 940 + 941 + /* must be the last entry - contains flex array */ 942 + struct scx_dsp_ctx dsp_ctx; 943 + }; 944 + 945 + struct scx_sched_pnode { 946 + struct scx_dispatch_q global_dsq; 947 947 }; 948 948 949 949 struct scx_sched { ··· 1021 897 * per-node split isn't sufficient, it can be further split. 1022 898 */ 1023 899 struct rhashtable dsq_hash; 1024 - struct scx_dispatch_q **global_dsqs; 900 + struct scx_sched_pnode **pnode; 1025 901 struct scx_sched_pcpu __percpu *pcpu; 902 + 903 + u64 slice_dfl; 904 + u64 bypass_timestamp; 905 + s32 bypass_depth; 906 + 907 + /* bypass dispatch path enable state, see bypass_dsp_enabled() */ 908 + unsigned long bypass_dsp_claim; 909 + atomic_t bypass_dsp_enable_depth; 910 + 911 + bool aborting; 912 + bool dump_disabled; /* protected by scx_dump_lock */ 913 + u32 dsp_max_batch; 914 + s32 level; 1026 915 1027 916 /* 1028 917 * Updates to the following warned bitfields can race causing RMW issues ··· 1043 906 */ 1044 907 bool warned_zero_slice:1; 1045 908 bool warned_deprecated_rq:1; 909 + bool warned_unassoc_progs:1; 910 + 911 + struct list_head all; 912 + 913 + #ifdef CONFIG_EXT_SUB_SCHED 914 + struct rhash_head hash_node; 915 + 916 + struct list_head children; 917 + struct list_head sibling; 918 + struct cgroup *cgrp; 919 + char *cgrp_path; 920 + struct kset *sub_kset; 921 + 922 + bool sub_attached; 923 + #endif /* CONFIG_EXT_SUB_SCHED */ 924 + 925 + /* 926 + * The maximum amount of time in jiffies that a task may be runnable 927 + * without being scheduled on a CPU. If this timeout is exceeded, it 928 + * will trigger scx_error(). 929 + */ 930 + unsigned long watchdog_timeout; 1046 931 1047 932 atomic_t exit_kind; 1048 933 struct scx_exit_info *exit_info; ··· 1072 913 struct kobject kobj; 1073 914 1074 915 struct kthread_worker *helper; 1075 - struct irq_work error_irq_work; 916 + struct irq_work disable_irq_work; 1076 917 struct kthread_work disable_work; 918 + struct timer_list bypass_lb_timer; 1077 919 struct rcu_work rcu_work; 920 + 921 + /* all ancestors including self */ 922 + struct scx_sched *ancestors[]; 1078 923 }; 1079 924 1080 925 enum scx_wake_flags { ··· 1105 942 SCX_ENQ_PREEMPT = 1LLU << 32, 1106 943 1107 944 /* 1108 - * The task being enqueued was previously enqueued on the current CPU's 1109 - * %SCX_DSQ_LOCAL, but was removed from it in a call to the 1110 - * scx_bpf_reenqueue_local() kfunc. If scx_bpf_reenqueue_local() was 1111 - * invoked in a ->cpu_release() callback, and the task is again 1112 - * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the 1113 - * task will not be scheduled on the CPU until at least the next invocation 1114 - * of the ->cpu_acquire() callback. 945 + * Only allowed on local DSQs. Guarantees that the task either gets 946 + * on the CPU immediately and stays on it, or gets reenqueued back 947 + * to the BPF scheduler. It will never linger on a local DSQ or be 948 + * silently put back after preemption. 949 + * 950 + * The protection persists until the next fresh enqueue - it 951 + * survives SAVE/RESTORE cycles, slice extensions and preemption. 952 + * If the task can't stay on the CPU for any reason, it gets 953 + * reenqueued back to the BPF scheduler. 954 + * 955 + * Exiting and migration-disabled tasks bypass ops.enqueue() and 956 + * are placed directly on a local DSQ without IMMED protection 957 + * unless %SCX_OPS_ENQ_EXITING and %SCX_OPS_ENQ_MIGRATION_DISABLED 958 + * are set respectively. 959 + */ 960 + SCX_ENQ_IMMED = 1LLU << 33, 961 + 962 + /* 963 + * The task being enqueued was previously enqueued on a DSQ, but was 964 + * removed and is being re-enqueued. See SCX_TASK_REENQ_* flags to find 965 + * out why a given task is being reenqueued. 1115 966 */ 1116 967 SCX_ENQ_REENQ = 1LLU << 40, 1117 968 ··· 1146 969 SCX_ENQ_CLEAR_OPSS = 1LLU << 56, 1147 970 SCX_ENQ_DSQ_PRIQ = 1LLU << 57, 1148 971 SCX_ENQ_NESTED = 1LLU << 58, 972 + SCX_ENQ_GDSQ_FALLBACK = 1LLU << 59, /* fell back to global DSQ */ 1149 973 }; 1150 974 1151 975 enum scx_deq_flags { ··· 1160 982 * it hasn't been dispatched yet. Dequeue from the BPF side. 1161 983 */ 1162 984 SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, 985 + 986 + /* 987 + * The task is being dequeued due to a property change (e.g., 988 + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), 989 + * etc.). 990 + */ 991 + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, 992 + }; 993 + 994 + enum scx_reenq_flags { 995 + /* low 16bits determine which tasks should be reenqueued */ 996 + SCX_REENQ_ANY = 1LLU << 0, /* all tasks */ 997 + 998 + __SCX_REENQ_FILTER_MASK = 0xffffLLU, 999 + 1000 + __SCX_REENQ_USER_MASK = SCX_REENQ_ANY, 1001 + 1002 + /* bits 32-35 used by task_should_reenq() */ 1003 + SCX_REENQ_TSR_RQ_OPEN = 1LLU << 32, 1004 + SCX_REENQ_TSR_NOT_FIRST = 1LLU << 33, 1005 + 1006 + __SCX_REENQ_TSR_MASK = 0xfLLU << 32, 1163 1007 }; 1164 1008 1165 1009 enum scx_pick_idle_cpu_flags { ··· 1361 1161 #define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1) 1362 1162 #define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK) 1363 1163 1164 + extern struct scx_sched __rcu *scx_root; 1364 1165 DECLARE_PER_CPU(struct rq *, scx_locked_rq_state); 1166 + 1167 + int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id); 1365 1168 1366 1169 /* 1367 1170 * Return the rq currently locked from an scx callback, or NULL if no rq is ··· 1375 1172 return __this_cpu_read(scx_locked_rq_state); 1376 1173 } 1377 1174 1378 - static inline bool scx_kf_allowed_if_unlocked(void) 1175 + static inline bool scx_bypassing(struct scx_sched *sch, s32 cpu) 1379 1176 { 1380 - return !current->scx.kf_mask; 1177 + return unlikely(per_cpu_ptr(sch->pcpu, cpu)->flags & 1178 + SCX_SCHED_PCPU_BYPASSING); 1381 1179 } 1382 1180 1383 - static inline bool scx_rq_bypassing(struct rq *rq) 1181 + #ifdef CONFIG_EXT_SUB_SCHED 1182 + /** 1183 + * scx_task_sched - Find scx_sched scheduling a task 1184 + * @p: task of interest 1185 + * 1186 + * Return @p's scheduler instance. Must be called with @p's pi_lock or rq lock 1187 + * held. 1188 + */ 1189 + static inline struct scx_sched *scx_task_sched(const struct task_struct *p) 1384 1190 { 1385 - return unlikely(rq->scx.flags & SCX_RQ_BYPASSING); 1191 + return rcu_dereference_protected(p->scx.sched, 1192 + lockdep_is_held(&p->pi_lock) || 1193 + lockdep_is_held(__rq_lockp(task_rq(p)))); 1386 1194 } 1195 + 1196 + /** 1197 + * scx_task_sched_rcu - Find scx_sched scheduling a task 1198 + * @p: task of interest 1199 + * 1200 + * Return @p's scheduler instance. The returned scx_sched is RCU protected. 1201 + */ 1202 + static inline struct scx_sched *scx_task_sched_rcu(const struct task_struct *p) 1203 + { 1204 + return rcu_dereference_all(p->scx.sched); 1205 + } 1206 + 1207 + /** 1208 + * scx_task_on_sched - Is a task on the specified sched? 1209 + * @sch: sched to test against 1210 + * @p: task of interest 1211 + * 1212 + * Returns %true if @p is on @sch, %false otherwise. 1213 + */ 1214 + static inline bool scx_task_on_sched(struct scx_sched *sch, 1215 + const struct task_struct *p) 1216 + { 1217 + return rcu_access_pointer(p->scx.sched) == sch; 1218 + } 1219 + 1220 + /** 1221 + * scx_prog_sched - Find scx_sched associated with a BPF prog 1222 + * @aux: aux passed in from BPF to a kfunc 1223 + * 1224 + * To be called from kfuncs. Return the scheduler instance associated with the 1225 + * BPF program given the implicit kfunc argument aux. The returned scx_sched is 1226 + * RCU protected. 1227 + */ 1228 + static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux) 1229 + { 1230 + struct sched_ext_ops *ops; 1231 + struct scx_sched *root; 1232 + 1233 + ops = bpf_prog_get_assoc_struct_ops(aux); 1234 + if (likely(ops)) 1235 + return rcu_dereference_all(ops->priv); 1236 + 1237 + root = rcu_dereference_all(scx_root); 1238 + if (root) { 1239 + /* 1240 + * COMPAT-v6.19: Schedulers built before sub-sched support was 1241 + * introduced may have unassociated non-struct_ops programs. 1242 + */ 1243 + if (!root->ops.sub_attach) 1244 + return root; 1245 + 1246 + if (!root->warned_unassoc_progs) { 1247 + printk_deferred(KERN_WARNING "sched_ext: Unassociated program %s (id %d)\n", 1248 + aux->name, aux->id); 1249 + root->warned_unassoc_progs = true; 1250 + } 1251 + } 1252 + 1253 + return NULL; 1254 + } 1255 + #else /* CONFIG_EXT_SUB_SCHED */ 1256 + static inline struct scx_sched *scx_task_sched(const struct task_struct *p) 1257 + { 1258 + return rcu_dereference_protected(scx_root, 1259 + lockdep_is_held(&p->pi_lock) || 1260 + lockdep_is_held(__rq_lockp(task_rq(p)))); 1261 + } 1262 + 1263 + static inline struct scx_sched *scx_task_sched_rcu(const struct task_struct *p) 1264 + { 1265 + return rcu_dereference_all(scx_root); 1266 + } 1267 + 1268 + static inline bool scx_task_on_sched(struct scx_sched *sch, 1269 + const struct task_struct *p) 1270 + { 1271 + return true; 1272 + } 1273 + 1274 + static struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux) 1275 + { 1276 + return rcu_dereference_all(scx_root); 1277 + } 1278 + #endif /* CONFIG_EXT_SUB_SCHED */
+9 -3
kernel/sched/sched.h
··· 783 783 SCX_RQ_ONLINE = 1 << 0, 784 784 SCX_RQ_CAN_STOP_TICK = 1 << 1, 785 785 SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */ 786 - SCX_RQ_BYPASSING = 1 << 4, 787 786 SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */ 788 787 SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */ 789 788 ··· 798 799 u64 extra_enq_flags; /* see move_task_to_local_dsq() */ 799 800 u32 nr_running; 800 801 u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ 802 + bool in_select_cpu; 801 803 bool cpu_released; 802 804 u32 flags; 805 + u32 nr_immed; /* ENQ_IMMED tasks on local_dsq */ 803 806 u64 clock; /* current per-rq clock -- see scx_bpf_now() */ 804 807 cpumask_var_t cpus_to_kick; 805 808 cpumask_var_t cpus_to_kick_if_idle; ··· 810 809 cpumask_var_t cpus_to_sync; 811 810 bool kick_sync_pending; 812 811 unsigned long kick_sync; 813 - local_t reenq_local_deferred; 812 + 813 + struct task_struct *sub_dispatch_prev; 814 + 815 + raw_spinlock_t deferred_reenq_lock; 816 + u64 deferred_reenq_locals_seq; 817 + struct list_head deferred_reenq_locals; /* scheds requesting reenq of local DSQ */ 818 + struct list_head deferred_reenq_users; /* user DSQs requesting reenq */ 814 819 struct balance_callback deferred_bal_cb; 815 820 struct balance_callback kick_sync_bal_cb; 816 821 struct irq_work deferred_irq_work; 817 822 struct irq_work kick_cpus_irq_work; 818 - struct scx_dispatch_q bypass_dsq; 819 823 }; 820 824 #endif /* CONFIG_SCHED_CLASS_EXT */ 821 825
+6 -2
tools/sched_ext/include/scx/bpf_arena_common.bpf.h
··· 15 15 #endif 16 16 17 17 #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) && !defined(BPF_ARENA_FORCE_ASM) 18 + #ifndef __arena 18 19 #define __arena __attribute__((address_space(1))) 20 + #endif 19 21 #define __arena_global __attribute__((address_space(1))) 20 22 #define cast_kern(ptr) /* nop for bpf prog. emitted by LLVM */ 21 23 #define cast_user(ptr) /* nop for bpf prog. emitted by LLVM */ ··· 83 81 void __arena* bpf_arena_alloc_pages(void *map, void __arena *addr, __u32 page_cnt, 84 82 int node_id, __u64 flags) __ksym __weak; 85 83 void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak; 84 + int bpf_arena_reserve_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak; 86 85 87 86 /* 88 87 * Note that cond_break can only be portably used in the body of a breakable 89 88 * construct, whereas can_loop can be used anywhere. 90 89 */ 91 - #ifdef TEST 90 + #ifdef SCX_BPF_UNITTEST 92 91 #define can_loop true 93 92 #define __cond_break(expr) expr 94 93 #else ··· 168 165 }) 169 166 #endif /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */ 170 167 #endif /* __BPF_FEATURE_MAY_GOTO */ 171 - #endif /* TEST */ 168 + #endif /* SCX_BPF_UNITTEST */ 172 169 173 170 #define cond_break __cond_break(break) 174 171 #define cond_break_label(label) __cond_break(goto label) ··· 176 173 177 174 void bpf_preempt_disable(void) __weak __ksym; 178 175 void bpf_preempt_enable(void) __weak __ksym; 176 + ssize_t bpf_arena_mapping_nr_pages(void *p__map) __weak __ksym;
+277
tools/sched_ext/include/scx/common.bpf.h
··· 291 291 }) 292 292 #endif /* ARRAY_ELEM_PTR */ 293 293 294 + /** 295 + * __sink - Hide @expr's value from the compiler and BPF verifier 296 + * @expr: The expression whose value should be opacified 297 + * 298 + * No-op at runtime. The empty inline assembly with a read-write constraint 299 + * ("+g") has two effects at compile/verify time: 300 + * 301 + * 1. Compiler: treats @expr as both read and written, preventing dead-code 302 + * elimination and keeping @expr (and any side effects that produced it) 303 + * alive. 304 + * 305 + * 2. BPF verifier: forgets the precise value/range of @expr ("makes it 306 + * imprecise"). The verifier normally tracks exact ranges for every register 307 + * and stack slot. While useful, precision means each distinct value creates a 308 + * separate verifier state. Inside loops this leads to state explosion - each 309 + * iteration carries different precise values so states never merge and the 310 + * verifier explores every iteration individually. 311 + * 312 + * Example - preventing loop state explosion:: 313 + * 314 + * u32 nr_intersects = 0, nr_covered = 0; 315 + * __sink(nr_intersects); 316 + * __sink(nr_covered); 317 + * bpf_for(i, 0, nr_nodes) { 318 + * if (intersects(cpumask, node_mask[i])) 319 + * nr_intersects++; 320 + * if (covers(cpumask, node_mask[i])) 321 + * nr_covered++; 322 + * } 323 + * 324 + * Without __sink(), the verifier tracks every possible (nr_intersects, 325 + * nr_covered) pair across iterations, causing "BPF program is too large". With 326 + * __sink(), the values become unknown scalars so all iterations collapse into 327 + * one reusable state. 328 + * 329 + * Example - keeping a reference alive:: 330 + * 331 + * struct task_struct *t = bpf_task_acquire(task); 332 + * __sink(t); 333 + * 334 + * Follows the convention from BPF selftests (bpf_misc.h). 335 + */ 336 + #define __sink(expr) asm volatile ("" : "+g"(expr)) 337 + 294 338 /* 295 339 * BPF declarations and helpers 296 340 */ ··· 380 336 381 337 /* cgroup */ 382 338 struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym; 339 + struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp) __ksym; 383 340 void bpf_cgroup_release(struct cgroup *cgrp) __ksym; 384 341 struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; 385 342 ··· 787 742 } 788 743 789 744 /* 745 + * ctzll -- Counts trailing zeros in an unsigned long long. If the input value 746 + * is zero, the return value is undefined. 747 + */ 748 + static inline int ctzll(u64 v) 749 + { 750 + #if (!defined(__BPF__) && defined(__SCX_TARGET_ARCH_x86)) || \ 751 + (defined(__BPF__) && defined(__clang_major__) && __clang_major__ >= 19) 752 + /* 753 + * Use the ctz builtin when: (1) building for native x86, or 754 + * (2) building for BPF with clang >= 19 (BPF backend supports 755 + * the intrinsic from clang 19 onward; earlier versions hit 756 + * "unimplemented opcode" in the backend). 757 + */ 758 + return __builtin_ctzll(v); 759 + #else 760 + /* 761 + * If neither the target architecture nor the toolchains support ctzll, 762 + * use software-based emulation. Let's use the De Bruijn sequence-based 763 + * approach to find LSB fastly. See the details of De Bruijn sequence: 764 + * 765 + * https://en.wikipedia.org/wiki/De_Bruijn_sequence 766 + * https://www.chessprogramming.org/BitScan#De_Bruijn_Multiplication 767 + */ 768 + const int lookup_table[64] = { 769 + 0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4, 770 + 62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5, 771 + 63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11, 772 + 46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6, 773 + }; 774 + const u64 DEBRUIJN_CONSTANT = 0x03f79d71b4cb0a89ULL; 775 + unsigned int index; 776 + u64 lowest_bit; 777 + const int *lt; 778 + 779 + if (v == 0) 780 + return -1; 781 + 782 + /* 783 + * Isolate the least significant bit (LSB). 784 + * For example, if v = 0b...10100, then v & -v = 0b...00100 785 + */ 786 + lowest_bit = v & -v; 787 + 788 + /* 789 + * Each isolated bit produces a unique 6-bit value, guaranteed by the 790 + * De Bruijn property. Calculate a unique index into the lookup table 791 + * using the magic constant and a right shift. 792 + * 793 + * Multiplying by the 64-bit constant "spreads out" that 1-bit into a 794 + * unique pattern in the top 6 bits. This uniqueness property is 795 + * exactly what a De Bruijn sequence guarantees: Every possible 6-bit 796 + * pattern (in top bits) occurs exactly once for each LSB position. So, 797 + * the constant 0x03f79d71b4cb0a89ULL is carefully chosen to be a 798 + * De Bruijn sequence, ensuring no collisions in the table index. 799 + */ 800 + index = (lowest_bit * DEBRUIJN_CONSTANT) >> 58; 801 + 802 + /* 803 + * Lookup in a precomputed table. No collision is guaranteed by the 804 + * De Bruijn property. 805 + */ 806 + lt = MEMBER_VPTR(lookup_table, [index]); 807 + return (lt)? *lt : -1; 808 + #endif 809 + } 810 + 811 + /* 790 812 * Return a value proportionally scaled to the task's weight. 791 813 */ 792 814 static inline u64 scale_by_task_weight(const struct task_struct *p, u64 value) ··· 869 757 return value * 100 / p->scx.weight; 870 758 } 871 759 760 + 761 + /* 762 + * Get a random u64 from the kernel's pseudo-random generator. 763 + */ 764 + static inline u64 get_prandom_u64() 765 + { 766 + return ((u64)bpf_get_prandom_u32() << 32) | bpf_get_prandom_u32(); 767 + } 768 + 769 + /* 770 + * Define the shadow structure to avoid a compilation error when 771 + * vmlinux.h does not enable necessary kernel configs. The ___local 772 + * suffix is a CO-RE convention that tells the loader to match this 773 + * against the base struct rq in the kernel. The attribute 774 + * preserve_access_index tells the compiler to generate a CO-RE 775 + * relocation for these fields. 776 + */ 777 + struct rq___local { 778 + /* 779 + * A monotonically increasing clock per CPU. It is rq->clock minus 780 + * cumulative IRQ time and hypervisor steal time. Unlike rq->clock, 781 + * it does not advance during IRQ processing or hypervisor preemption. 782 + * It does advance during idle (the idle task counts as a running task 783 + * for this purpose). 784 + */ 785 + u64 clock_task; 786 + /* 787 + * Invariant version of clock_task scaled by CPU capacity and 788 + * frequency. For example, clock_pelt advances 2x slower on a CPU 789 + * with half the capacity. 790 + * 791 + * At idle exit, rq->clock_pelt jumps forward to resync with 792 + * clock_task. The kernel's rq_clock_pelt() corrects for this jump 793 + * by subtracting lost_idle_time, yielding a clock that appears 794 + * continuous across idle transitions. scx_clock_pelt() mirrors 795 + * rq_clock_pelt() by performing the same subtraction. 796 + */ 797 + u64 clock_pelt; 798 + /* 799 + * Accumulates the magnitude of each clock_pelt jump at idle exit. 800 + * Subtracting this from clock_pelt gives rq_clock_pelt(): a 801 + * continuous, capacity-invariant clock suitable for both task 802 + * execution time stamping and cross-idle measurements. 803 + */ 804 + unsigned long lost_idle_time; 805 + /* 806 + * Shadow of paravirt_steal_clock() (the hypervisor's cumulative 807 + * stolen time counter). Stays frozen while the hypervisor preempts 808 + * the vCPU; catches up the next time update_rq_clock_task() is 809 + * called. The delta is the stolen time not yet subtracted from 810 + * clock_task. 811 + * 812 + * Unlike irqtime->total (a plain kernel-side field), the live stolen 813 + * time counter lives in hypervisor-specific shared memory and has no 814 + * kernel-side equivalent readable from BPF in a hypervisor-agnostic 815 + * way. This field is therefore the only portable BPF-accessible 816 + * approximation of cumulative steal time. 817 + * 818 + * Available only when CONFIG_PARAVIRT_TIME_ACCOUNTING is on. 819 + */ 820 + u64 prev_steal_time_rq; 821 + } __attribute__((preserve_access_index)); 822 + 823 + extern struct rq runqueues __ksym; 824 + 825 + /* 826 + * Define the shadow structure to avoid a compilation error when 827 + * vmlinux.h does not enable necessary kernel configs. 828 + */ 829 + struct irqtime___local { 830 + /* 831 + * Cumulative IRQ time counter for this CPU, in nanoseconds. Advances 832 + * immediately at the exit of every hardirq and non-ksoftirqd softirq 833 + * via irqtime_account_irq(). ksoftirqd time is counted as normal 834 + * task time and is NOT included. NMI time is also NOT included. 835 + * 836 + * The companion field irqtime->sync (struct u64_stats_sync) protects 837 + * against 64-bit tearing on 32-bit architectures. On 64-bit kernels, 838 + * u64_stats_sync is an empty struct and all seqcount operations are 839 + * no-ops, so a plain BPF_CORE_READ of this field is safe. 840 + * 841 + * Available only when CONFIG_IRQ_TIME_ACCOUNTING is on. 842 + */ 843 + u64 total; 844 + } __attribute__((preserve_access_index)); 845 + 846 + /* 847 + * cpu_irqtime is a per-CPU variable defined only when 848 + * CONFIG_IRQ_TIME_ACCOUNTING is on. Declare it as __weak so the BPF 849 + * loader sets its address to 0 (rather than failing) when the symbol 850 + * is absent from the running kernel. 851 + */ 852 + extern struct irqtime___local cpu_irqtime __ksym __weak; 853 + 854 + static inline struct rq___local *get_current_rq(u32 cpu) 855 + { 856 + /* 857 + * This is a workaround to get an rq pointer since we decided to 858 + * deprecate scx_bpf_cpu_rq(). 859 + * 860 + * WARNING: The caller must hold the rq lock for @cpu. This is 861 + * guaranteed when called from scheduling callbacks (ops.running, 862 + * ops.stopping, ops.enqueue, ops.dequeue, ops.dispatch, etc.). 863 + * There is no runtime check available in BPF for kernel spinlock 864 + * state — correctness is enforced by calling context only. 865 + */ 866 + return (void *)bpf_per_cpu_ptr(&runqueues, cpu); 867 + } 868 + 869 + static inline u64 scx_clock_task(u32 cpu) 870 + { 871 + struct rq___local *rq = get_current_rq(cpu); 872 + 873 + /* Equivalent to the kernel's rq_clock_task(). */ 874 + return rq ? rq->clock_task : 0; 875 + } 876 + 877 + static inline u64 scx_clock_pelt(u32 cpu) 878 + { 879 + struct rq___local *rq = get_current_rq(cpu); 880 + 881 + /* 882 + * Equivalent to the kernel's rq_clock_pelt(): subtracts 883 + * lost_idle_time from clock_pelt to absorb the jump that occurs 884 + * when clock_pelt resyncs with clock_task at idle exit. The result 885 + * is a continuous, capacity-invariant clock safe for both task 886 + * execution time stamping and cross-idle measurements. 887 + */ 888 + return rq ? (rq->clock_pelt - rq->lost_idle_time) : 0; 889 + } 890 + 891 + static inline u64 scx_clock_virt(u32 cpu) 892 + { 893 + struct rq___local *rq; 894 + 895 + /* 896 + * Check field existence before calling get_current_rq() so we avoid 897 + * the per_cpu lookup entirely on kernels built without 898 + * CONFIG_PARAVIRT_TIME_ACCOUNTING. 899 + */ 900 + if (!bpf_core_field_exists(((struct rq___local *)0)->prev_steal_time_rq)) 901 + return 0; 902 + 903 + /* Lagging shadow of the kernel's paravirt_steal_clock(). */ 904 + rq = get_current_rq(cpu); 905 + return rq ? BPF_CORE_READ(rq, prev_steal_time_rq) : 0; 906 + } 907 + 908 + static inline u64 scx_clock_irq(u32 cpu) 909 + { 910 + struct irqtime___local *irqt; 911 + 912 + /* 913 + * bpf_core_type_exists() resolves at load time: if struct irqtime is 914 + * absent from kernel BTF (CONFIG_IRQ_TIME_ACCOUNTING off), the loader 915 + * patches this into an unconditional return 0, making the 916 + * bpf_per_cpu_ptr() call below dead code that the verifier never sees. 917 + */ 918 + if (!bpf_core_type_exists(struct irqtime___local)) 919 + return 0; 920 + 921 + /* Equivalent to the kernel's irq_time_read(). */ 922 + irqt = bpf_per_cpu_ptr(&cpu_irqtime, cpu); 923 + return irqt ? BPF_CORE_READ(irqt, total) : 0; 924 + } 872 925 873 926 #include "compat.bpf.h" 874 927 #include "enums.bpf.h"
+1 -4
tools/sched_ext/include/scx/common.h
··· 67 67 bpf_map__set_value_size((__skel)->maps.elfsec##_##arr, \ 68 68 sizeof((__skel)->elfsec##_##arr->arr[0]) * (n)); \ 69 69 (__skel)->elfsec##_##arr = \ 70 + (typeof((__skel)->elfsec##_##arr)) \ 70 71 bpf_map__initial_value((__skel)->maps.elfsec##_##arr, &__sz); \ 71 72 } while (0) 72 73 ··· 75 74 #include "compat.h" 76 75 #include "enums.h" 77 76 78 - /* not available when building kernel tools/sched_ext */ 79 - #if __has_include(<lib/sdt_task_defs.h>) 80 77 #include "bpf_arena_common.h" 81 - #include <lib/sdt_task_defs.h> 82 - #endif 83 78 84 79 #endif /* __SCHED_EXT_COMMON_H */
+52 -5
tools/sched_ext/include/scx/compat.bpf.h
··· 28 28 * 29 29 * scx_bpf_dispatch_from_dsq() and friends were added during v6.12 by 30 30 * 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()"). 31 + * 32 + * v7.1: scx_bpf_dsq_move_to_local___v2() to add @enq_flags. 31 33 */ 32 - bool scx_bpf_dsq_move_to_local___new(u64 dsq_id) __ksym __weak; 34 + bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags) __ksym __weak; 35 + bool scx_bpf_dsq_move_to_local___v1(u64 dsq_id) __ksym __weak; 33 36 void scx_bpf_dsq_move_set_slice___new(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym __weak; 34 37 void scx_bpf_dsq_move_set_vtime___new(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak; 35 38 bool scx_bpf_dsq_move___new(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; ··· 44 41 bool scx_bpf_dispatch_from_dsq___old(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; 45 42 bool scx_bpf_dispatch_vtime_from_dsq___old(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; 46 43 47 - #define scx_bpf_dsq_move_to_local(dsq_id) \ 48 - (bpf_ksym_exists(scx_bpf_dsq_move_to_local___new) ? \ 49 - scx_bpf_dsq_move_to_local___new((dsq_id)) : \ 50 - scx_bpf_consume___old((dsq_id))) 44 + #define scx_bpf_dsq_move_to_local(dsq_id, enq_flags) \ 45 + (bpf_ksym_exists(scx_bpf_dsq_move_to_local___v2) ? \ 46 + scx_bpf_dsq_move_to_local___v2((dsq_id), (enq_flags)) : \ 47 + (bpf_ksym_exists(scx_bpf_dsq_move_to_local___v1) ? \ 48 + scx_bpf_dsq_move_to_local___v1((dsq_id)) : \ 49 + scx_bpf_consume___old((dsq_id)))) 51 50 52 51 #define scx_bpf_dsq_move_set_slice(it__iter, slice) \ 53 52 (bpf_ksym_exists(scx_bpf_dsq_move_set_slice___new) ? \ ··· 106 101 p = bpf_iter_scx_dsq_next(&it); 107 102 bpf_iter_scx_dsq_destroy(&it); 108 103 return p; 104 + } 105 + 106 + /* 107 + * v7.1: scx_bpf_sub_dispatch() for sub-sched dispatch. Preserve until 108 + * we drop the compat layer for older kernels that lack the kfunc. 109 + */ 110 + bool scx_bpf_sub_dispatch___compat(u64 cgroup_id) __ksym __weak; 111 + 112 + static inline bool scx_bpf_sub_dispatch(u64 cgroup_id) 113 + { 114 + if (bpf_ksym_exists(scx_bpf_sub_dispatch___compat)) 115 + return scx_bpf_sub_dispatch___compat(cgroup_id); 116 + return false; 109 117 } 110 118 111 119 /** ··· 284 266 } 285 267 } 286 268 269 + /* 270 + * scx_bpf_select_cpu_and() is now an inline wrapper. Use this instead of 271 + * bpf_ksym_exists(scx_bpf_select_cpu_and) to test availability. 272 + */ 273 + #define __COMPAT_HAS_scx_bpf_select_cpu_and \ 274 + (bpf_core_type_exists(struct scx_bpf_select_cpu_and_args) || \ 275 + bpf_ksym_exists(scx_bpf_select_cpu_and___compat)) 276 + 287 277 /** 288 278 * scx_bpf_dsq_insert_vtime - Insert a task into the vtime priority queue of a DSQ 289 279 * @p: task_struct to insert ··· 399 373 scx_bpf_reenqueue_local___v2___compat(); 400 374 else 401 375 scx_bpf_reenqueue_local___v1(); 376 + } 377 + 378 + /* 379 + * v6.20: New scx_bpf_dsq_reenq() that allows re-enqueues on more DSQs. This 380 + * will eventually deprecate scx_bpf_reenqueue_local(). 381 + */ 382 + void scx_bpf_dsq_reenq___compat(u64 dsq_id, u64 reenq_flags, const struct bpf_prog_aux *aux__prog) __ksym __weak; 383 + 384 + static inline bool __COMPAT_has_generic_reenq(void) 385 + { 386 + return bpf_ksym_exists(scx_bpf_dsq_reenq___compat); 387 + } 388 + 389 + static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags) 390 + { 391 + if (bpf_ksym_exists(scx_bpf_dsq_reenq___compat)) 392 + scx_bpf_dsq_reenq___compat(dsq_id, reenq_flags, NULL); 393 + else if (dsq_id == SCX_DSQ_LOCAL && reenq_flags == 0) 394 + scx_bpf_reenqueue_local(); 395 + else 396 + scx_bpf_error("kernel too old to reenqueue foreign local or user DSQs"); 402 397 } 403 398 404 399 /*
+51 -1
tools/sched_ext/include/scx/compat.h
··· 8 8 #define __SCX_COMPAT_H 9 9 10 10 #include <bpf/btf.h> 11 + #include <bpf/libbpf.h> 11 12 #include <fcntl.h> 12 13 #include <stdlib.h> 13 14 #include <unistd.h> ··· 116 115 #define SCX_OPS_ENQ_MIGRATION_DISABLED SCX_OPS_FLAG(SCX_OPS_ENQ_MIGRATION_DISABLED) 117 116 #define SCX_OPS_ALLOW_QUEUED_WAKEUP SCX_OPS_FLAG(SCX_OPS_ALLOW_QUEUED_WAKEUP) 118 117 #define SCX_OPS_BUILTIN_IDLE_PER_NODE SCX_OPS_FLAG(SCX_OPS_BUILTIN_IDLE_PER_NODE) 118 + #define SCX_OPS_ALWAYS_ENQ_IMMED SCX_OPS_FLAG(SCX_OPS_ALWAYS_ENQ_IMMED) 119 119 120 120 #define SCX_PICK_IDLE_FLAG(name) __COMPAT_ENUM_OR_ZERO("scx_pick_idle_cpu_flags", #name) 121 121 ··· 160 158 * COMPAT: 161 159 * - v6.17: ops.cgroup_set_bandwidth() 162 160 * - v6.19: ops.cgroup_set_idle() 161 + * - v7.1: ops.sub_attach(), ops.sub_detach(), ops.sub_cgroup_id 163 162 */ 164 163 #define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \ 165 164 struct __scx_name *__skel; \ ··· 182 179 fprintf(stderr, "WARNING: kernel doesn't support ops.cgroup_set_idle()\n"); \ 183 180 __skel->struct_ops.__ops_name->cgroup_set_idle = NULL; \ 184 181 } \ 182 + if (__skel->struct_ops.__ops_name->sub_attach && \ 183 + !__COMPAT_struct_has_field("sched_ext_ops", "sub_attach")) { \ 184 + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_attach()\n"); \ 185 + __skel->struct_ops.__ops_name->sub_attach = NULL; \ 186 + } \ 187 + if (__skel->struct_ops.__ops_name->sub_detach && \ 188 + !__COMPAT_struct_has_field("sched_ext_ops", "sub_detach")) { \ 189 + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_detach()\n"); \ 190 + __skel->struct_ops.__ops_name->sub_detach = NULL; \ 191 + } \ 192 + if (__skel->struct_ops.__ops_name->sub_cgroup_id > 0 && \ 193 + !__COMPAT_struct_has_field("sched_ext_ops", "sub_cgroup_id")) { \ 194 + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_cgroup_id\n"); \ 195 + __skel->struct_ops.__ops_name->sub_cgroup_id = 0; \ 196 + } \ 185 197 __skel; \ 186 198 }) 187 199 200 + /* 201 + * Associate non-struct_ops BPF programs with the scheduler's struct_ops map so 202 + * that scx_prog_sched() can determine which scheduler a BPF program belongs 203 + * to. Requires libbpf >= 1.7. 204 + */ 205 + #if LIBBPF_MAJOR_VERSION > 1 || \ 206 + (LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 7) 207 + static inline void __scx_ops_assoc_prog(struct bpf_program *prog, 208 + struct bpf_map *map, 209 + const char *ops_name) 210 + { 211 + s32 err = bpf_program__assoc_struct_ops(prog, map, NULL); 212 + if (err) 213 + fprintf(stderr, 214 + "ERROR: Failed to associate %s with %s: %d\n", 215 + bpf_program__name(prog), ops_name, err); 216 + } 217 + #else 218 + static inline void __scx_ops_assoc_prog(struct bpf_program *prog, 219 + struct bpf_map *map, 220 + const char *ops_name) 221 + { 222 + } 223 + #endif 224 + 188 225 #define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({ \ 226 + struct bpf_program *__prog; \ 189 227 UEI_SET_SIZE(__skel, __ops_name, __uei_name); \ 190 228 SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \ 229 + bpf_object__for_each_program(__prog, (__skel)->obj) { \ 230 + if (bpf_program__type(__prog) == BPF_PROG_TYPE_STRUCT_OPS) \ 231 + continue; \ 232 + __scx_ops_assoc_prog(__prog, (__skel)->maps.__ops_name, \ 233 + #__ops_name); \ 234 + } \ 191 235 }) 192 236 193 237 /* 194 238 * New versions of bpftool now emit additional link placeholders for BPF maps, 195 239 * and set up BPF skeleton in such a way that libbpf will auto-attach BPF maps 196 - * automatically, assumming libbpf is recent enough (v1.5+). Old libbpf will do 240 + * automatically, assuming libbpf is recent enough (v1.5+). Old libbpf will do 197 241 * nothing with those links and won't attempt to auto-attach maps. 198 242 * 199 243 * To maintain compatibility with older libbpf while avoiding trying to attach
+48 -13
tools/sched_ext/include/scx/enum_defs.autogen.h
··· 14 14 #define HAVE_SCX_EXIT_MSG_LEN 15 15 #define HAVE_SCX_EXIT_DUMP_DFL_LEN 16 16 #define HAVE_SCX_CPUPERF_ONE 17 - #define HAVE_SCX_OPS_TASK_ITER_BATCH 17 + #define HAVE_SCX_TASK_ITER_BATCH 18 + #define HAVE_SCX_BYPASS_HOST_NTH 19 + #define HAVE_SCX_BYPASS_LB_DFL_INTV_US 20 + #define HAVE_SCX_BYPASS_LB_DONOR_PCT 21 + #define HAVE_SCX_BYPASS_LB_MIN_DELTA_DIV 22 + #define HAVE_SCX_BYPASS_LB_BATCH 23 + #define HAVE_SCX_REENQ_LOCAL_MAX_REPEAT 24 + #define HAVE_SCX_SUB_MAX_DEPTH 18 25 #define HAVE_SCX_CPU_PREEMPT_RT 19 26 #define HAVE_SCX_CPU_PREEMPT_DL 20 27 #define HAVE_SCX_CPU_PREEMPT_STOP 21 28 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN 22 29 #define HAVE_SCX_DEQ_SLEEP 23 30 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC 31 + #define HAVE_SCX_DEQ_SCHED_CHANGE 24 32 #define HAVE_SCX_DSQ_FLAG_BUILTIN 25 33 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON 26 34 #define HAVE_SCX_DSQ_INVALID 27 35 #define HAVE_SCX_DSQ_GLOBAL 28 36 #define HAVE_SCX_DSQ_LOCAL 37 + #define HAVE_SCX_DSQ_BYPASS 29 38 #define HAVE_SCX_DSQ_LOCAL_ON 30 39 #define HAVE_SCX_DSQ_LOCAL_CPU_MASK 31 40 #define HAVE_SCX_DSQ_ITER_REV ··· 44 35 #define HAVE___SCX_DSQ_ITER_ALL_FLAGS 45 36 #define HAVE_SCX_DSQ_LNODE_ITER_CURSOR 46 37 #define HAVE___SCX_DSQ_LNODE_PRIV_SHIFT 38 + #define HAVE_SCX_ENABLING 39 + #define HAVE_SCX_ENABLED 40 + #define HAVE_SCX_DISABLING 41 + #define HAVE_SCX_DISABLED 47 42 #define HAVE_SCX_ENQ_WAKEUP 48 43 #define HAVE_SCX_ENQ_HEAD 49 44 #define HAVE_SCX_ENQ_CPU_SELECTED 50 45 #define HAVE_SCX_ENQ_PREEMPT 46 + #define HAVE_SCX_ENQ_IMMED 51 47 #define HAVE_SCX_ENQ_REENQ 52 48 #define HAVE_SCX_ENQ_LAST 53 49 #define HAVE___SCX_ENQ_INTERNAL_MASK 54 50 #define HAVE_SCX_ENQ_CLEAR_OPSS 55 51 #define HAVE_SCX_ENQ_DSQ_PRIQ 52 + #define HAVE_SCX_ENQ_NESTED 53 + #define HAVE_SCX_ENQ_GDSQ_FALLBACK 56 54 #define HAVE_SCX_TASK_DSQ_ON_PRIQ 57 55 #define HAVE_SCX_TASK_QUEUED 56 + #define HAVE_SCX_TASK_IN_CUSTODY 58 57 #define HAVE_SCX_TASK_RESET_RUNNABLE_AT 59 58 #define HAVE_SCX_TASK_DEQD_FOR_SLEEP 59 + #define HAVE_SCX_TASK_SUB_INIT 60 + #define HAVE_SCX_TASK_IMMED 60 61 #define HAVE_SCX_TASK_STATE_SHIFT 61 62 #define HAVE_SCX_TASK_STATE_BITS 62 63 #define HAVE_SCX_TASK_STATE_MASK 64 + #define HAVE_SCX_TASK_NONE 65 + #define HAVE_SCX_TASK_INIT 66 + #define HAVE_SCX_TASK_READY 67 + #define HAVE_SCX_TASK_ENABLED 68 + #define HAVE_SCX_TASK_REENQ_REASON_SHIFT 69 + #define HAVE_SCX_TASK_REENQ_REASON_BITS 70 + #define HAVE_SCX_TASK_REENQ_REASON_MASK 71 + #define HAVE_SCX_TASK_REENQ_NONE 72 + #define HAVE_SCX_TASK_REENQ_KFUNC 73 + #define HAVE_SCX_TASK_REENQ_IMMED 74 + #define HAVE_SCX_TASK_REENQ_PREEMPTED 63 75 #define HAVE_SCX_TASK_CURSOR 64 76 #define HAVE_SCX_ECODE_RSN_HOTPLUG 77 + #define HAVE_SCX_ECODE_RSN_CGROUP_OFFLINE 65 78 #define HAVE_SCX_ECODE_ACT_RESTART 79 + #define HAVE_SCX_EFLAG_INITIALIZED 66 80 #define HAVE_SCX_EXIT_NONE 67 81 #define HAVE_SCX_EXIT_DONE 68 82 #define HAVE_SCX_EXIT_UNREG 69 83 #define HAVE_SCX_EXIT_UNREG_BPF 70 84 #define HAVE_SCX_EXIT_UNREG_KERN 71 85 #define HAVE_SCX_EXIT_SYSRQ 86 + #define HAVE_SCX_EXIT_PARENT 72 87 #define HAVE_SCX_EXIT_ERROR 73 88 #define HAVE_SCX_EXIT_ERROR_BPF 74 89 #define HAVE_SCX_EXIT_ERROR_STALL ··· 113 80 #define HAVE_SCX_OPI_CPU_HOTPLUG_BEGIN 114 81 #define HAVE_SCX_OPI_CPU_HOTPLUG_END 115 82 #define HAVE_SCX_OPI_END 116 - #define HAVE_SCX_OPS_ENABLING 117 - #define HAVE_SCX_OPS_ENABLED 118 - #define HAVE_SCX_OPS_DISABLING 119 - #define HAVE_SCX_OPS_DISABLED 120 83 #define HAVE_SCX_OPS_KEEP_BUILTIN_IDLE 121 84 #define HAVE_SCX_OPS_ENQ_LAST 122 85 #define HAVE_SCX_OPS_ENQ_EXITING 123 86 #define HAVE_SCX_OPS_SWITCH_PARTIAL 124 87 #define HAVE_SCX_OPS_ENQ_MIGRATION_DISABLED 125 88 #define HAVE_SCX_OPS_ALLOW_QUEUED_WAKEUP 126 - #define HAVE_SCX_OPS_HAS_CGROUP_WEIGHT 89 + #define HAVE_SCX_OPS_BUILTIN_IDLE_PER_NODE 90 + #define HAVE_SCX_OPS_ALWAYS_ENQ_IMMED 127 91 #define HAVE_SCX_OPS_ALL_FLAGS 92 + #define HAVE___SCX_OPS_INTERNAL_MASK 93 + #define HAVE_SCX_OPS_HAS_CPU_PREEMPT 128 94 #define HAVE_SCX_OPSS_NONE 129 95 #define HAVE_SCX_OPSS_QUEUEING 130 96 #define HAVE_SCX_OPSS_QUEUED 131 97 #define HAVE_SCX_OPSS_DISPATCHING 132 98 #define HAVE_SCX_OPSS_QSEQ_SHIFT 133 99 #define HAVE_SCX_PICK_IDLE_CORE 100 + #define HAVE_SCX_PICK_IDLE_IN_NODE 134 101 #define HAVE_SCX_OPS_NAME_LEN 135 102 #define HAVE_SCX_SLICE_DFL 103 + #define HAVE_SCX_SLICE_BYPASS 136 104 #define HAVE_SCX_SLICE_INF 105 + #define HAVE_SCX_REENQ_ANY 106 + #define HAVE___SCX_REENQ_FILTER_MASK 107 + #define HAVE___SCX_REENQ_USER_MASK 108 + #define HAVE_SCX_REENQ_TSR_RQ_OPEN 109 + #define HAVE_SCX_REENQ_TSR_NOT_FIRST 110 + #define HAVE___SCX_REENQ_TSR_MASK 137 111 #define HAVE_SCX_RQ_ONLINE 138 112 #define HAVE_SCX_RQ_CAN_STOP_TICK 139 - #define HAVE_SCX_RQ_BAL_PENDING 140 113 #define HAVE_SCX_RQ_BAL_KEEP 141 - #define HAVE_SCX_RQ_BYPASSING 142 114 #define HAVE_SCX_RQ_CLK_VALID 115 + #define HAVE_SCX_RQ_BAL_CB_PENDING 143 116 #define HAVE_SCX_RQ_IN_WAKEUP 144 117 #define HAVE_SCX_RQ_IN_BALANCE 145 - #define HAVE_SCX_TASK_NONE 146 - #define HAVE_SCX_TASK_INIT 147 - #define HAVE_SCX_TASK_READY 148 - #define HAVE_SCX_TASK_ENABLED 149 - #define HAVE_SCX_TASK_NR_STATES 118 + #define HAVE_SCX_SCHED_PCPU_BYPASSING 150 119 #define HAVE_SCX_TG_ONLINE 151 120 #define HAVE_SCX_TG_INITED 152 121 #define HAVE_SCX_WAKE_FORK
+11
tools/sched_ext/include/scx/enums.autogen.bpf.h
··· 67 67 const volatile u64 __SCX_TASK_DEQD_FOR_SLEEP __weak; 68 68 #define SCX_TASK_DEQD_FOR_SLEEP __SCX_TASK_DEQD_FOR_SLEEP 69 69 70 + const volatile u64 __SCX_TASK_SUB_INIT __weak; 71 + #define SCX_TASK_SUB_INIT __SCX_TASK_SUB_INIT 72 + 73 + const volatile u64 __SCX_TASK_IMMED __weak; 74 + #define SCX_TASK_IMMED __SCX_TASK_IMMED 75 + 70 76 const volatile u64 __SCX_TASK_STATE_SHIFT __weak; 71 77 #define SCX_TASK_STATE_SHIFT __SCX_TASK_STATE_SHIFT 72 78 ··· 121 115 const volatile u64 __SCX_ENQ_PREEMPT __weak; 122 116 #define SCX_ENQ_PREEMPT __SCX_ENQ_PREEMPT 123 117 118 + const volatile u64 __SCX_ENQ_IMMED __weak; 119 + #define SCX_ENQ_IMMED __SCX_ENQ_IMMED 120 + 124 121 const volatile u64 __SCX_ENQ_REENQ __weak; 125 122 #define SCX_ENQ_REENQ __SCX_ENQ_REENQ 126 123 ··· 136 127 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; 137 128 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ 138 129 130 + const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; 131 + #define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
+4
tools/sched_ext/include/scx/enums.autogen.h
··· 26 26 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_QUEUED); \ 27 27 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_RESET_RUNNABLE_AT); \ 28 28 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_DEQD_FOR_SLEEP); \ 29 + SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_SUB_INIT); \ 30 + SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_IMMED); \ 29 31 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_SHIFT); \ 30 32 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_BITS); \ 31 33 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_MASK); \ ··· 44 42 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_WAKEUP); \ 45 43 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_HEAD); \ 46 44 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_PREEMPT); \ 45 + SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_IMMED); \ 47 46 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_REENQ); \ 48 47 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ 49 48 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ 50 49 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ 50 + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ 51 51 } while (0)
+1 -1
tools/sched_ext/include/scx/enums.h
··· 9 9 #ifndef __SCX_ENUMS_H 10 10 #define __SCX_ENUMS_H 11 11 12 - static inline void __ENUM_set(u64 *val, char *type, char *name) 12 + static inline void __ENUM_set(u64 *val, const char *type, const char *name) 13 13 { 14 14 bool res; 15 15
+44 -22
tools/sched_ext/scx_central.bpf.c
··· 60 60 const volatile u64 slice_ns; 61 61 62 62 bool timer_pinned = true; 63 + bool timer_started; 63 64 u64 nr_total, nr_locals, nr_queued, nr_lost_pids; 64 65 u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries; 65 66 u64 nr_overflows; ··· 180 179 return false; 181 180 } 182 181 182 + static void start_central_timer(void) 183 + { 184 + struct bpf_timer *timer; 185 + u32 key = 0; 186 + int ret; 187 + 188 + if (likely(timer_started)) 189 + return; 190 + 191 + timer = bpf_map_lookup_elem(&central_timer, &key); 192 + if (!timer) { 193 + scx_bpf_error("failed to lookup central timer"); 194 + return; 195 + } 196 + 197 + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); 198 + /* 199 + * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a 200 + * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. 201 + * Retry without the PIN. This would be the perfect use case for 202 + * bpf_core_enum_value_exists() but the enum type doesn't have a name 203 + * and can't be used with bpf_core_enum_value_exists(). Oh well... 204 + */ 205 + if (ret == -EINVAL) { 206 + timer_pinned = false; 207 + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); 208 + } 209 + 210 + if (ret) { 211 + scx_bpf_error("bpf_timer_start failed (%d)", ret); 212 + return; 213 + } 214 + 215 + timer_started = true; 216 + } 217 + 183 218 void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) 184 219 { 185 220 if (cpu == central_cpu) { 221 + start_central_timer(); 222 + 186 223 /* dispatch for all other CPUs first */ 187 224 __sync_fetch_and_add(&nr_dispatches, 1); 188 225 ··· 253 214 } 254 215 255 216 /* look for a task to run on the central CPU */ 256 - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID)) 217 + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID, 0)) 257 218 return; 258 219 dispatch_to_cpu(central_cpu); 259 220 } else { 260 221 bool *gimme; 261 222 262 - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID)) 223 + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID, 0)) 263 224 return; 264 225 265 226 gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); ··· 349 310 if (!timer) 350 311 return -ESRCH; 351 312 352 - if (bpf_get_smp_processor_id() != central_cpu) { 353 - scx_bpf_error("init from non-central CPU"); 354 - return -EINVAL; 355 - } 356 - 357 313 bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC); 358 314 bpf_timer_set_callback(timer, central_timerfn); 359 315 360 - ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); 361 - /* 362 - * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a 363 - * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. 364 - * Retry without the PIN. This would be the perfect use case for 365 - * bpf_core_enum_value_exists() but the enum type doesn't have a name 366 - * and can't be used with bpf_core_enum_value_exists(). Oh well... 367 - */ 368 - if (ret == -EINVAL) { 369 - timer_pinned = false; 370 - ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); 371 - } 372 - if (ret) 373 - scx_bpf_error("bpf_timer_start failed (%d)", ret); 374 - return ret; 316 + scx_bpf_kick_cpu(central_cpu, 0); 317 + 318 + return 0; 375 319 } 376 320 377 321 void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
+1 -25
tools/sched_ext/scx_central.c
··· 5 5 * Copyright (c) 2022 David Vernet <dvernet@meta.com> 6 6 */ 7 7 #define _GNU_SOURCE 8 - #include <sched.h> 9 8 #include <stdio.h> 10 9 #include <unistd.h> 11 10 #include <inttypes.h> ··· 20 21 "\n" 21 22 "See the top-level comment in .bpf.c for more details.\n" 22 23 "\n" 23 - "Usage: %s [-s SLICE_US] [-c CPU]\n" 24 + "Usage: %s [-s SLICE_US] [-c CPU] [-v]\n" 24 25 "\n" 25 26 " -s SLICE_US Override slice duration\n" 26 27 " -c CPU Override the central CPU (default: 0)\n" ··· 48 49 struct bpf_link *link; 49 50 __u64 seq = 0, ecode; 50 51 __s32 opt; 51 - cpu_set_t *cpuset; 52 - size_t cpuset_size; 53 52 54 53 libbpf_set_print(libbpf_print_fn); 55 54 signal(SIGINT, sigint_handler); ··· 92 95 RESIZE_ARRAY(skel, data, cpu_started_at, skel->rodata->nr_cpu_ids); 93 96 94 97 SCX_OPS_LOAD(skel, central_ops, scx_central, uei); 95 - 96 - /* 97 - * Affinitize the loading thread to the central CPU, as: 98 - * - That's where the BPF timer is first invoked in the BPF program. 99 - * - We probably don't want this user space component to take up a core 100 - * from a task that would benefit from avoiding preemption on one of 101 - * the tickless cores. 102 - * 103 - * Until BPF supports pinning the timer, it's not guaranteed that it 104 - * will always be invoked on the central CPU. In practice, this 105 - * suffices the majority of the time. 106 - */ 107 - cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids); 108 - SCX_BUG_ON(!cpuset, "Failed to allocate cpuset"); 109 - cpuset_size = CPU_ALLOC_SIZE(skel->rodata->nr_cpu_ids); 110 - CPU_ZERO_S(cpuset_size, cpuset); 111 - CPU_SET_S(skel->rodata->central_cpu, cpuset_size, cpuset); 112 - SCX_BUG_ON(sched_setaffinity(0, cpuset_size, cpuset), 113 - "Failed to affinitize to central CPU %d (max %d)", 114 - skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1); 115 - CPU_FREE(cpuset); 116 98 117 99 link = SCX_OPS_ATTACH(skel, central_ops, scx_central); 118 100
+1 -1
tools/sched_ext/scx_cpu0.bpf.c
··· 66 66 void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev) 67 67 { 68 68 if (cpu == 0) 69 - scx_bpf_dsq_move_to_local(DSQ_CPU0); 69 + scx_bpf_dsq_move_to_local(DSQ_CPU0, 0); 70 70 } 71 71 72 72 s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init)
+13 -11
tools/sched_ext/scx_flatcg.bpf.c
··· 18 18 * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's 19 19 * share in that competition is 100/(200+100) == 1/3. B's eventual share in the 20 20 * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's 21 - * eventual shaer is the same at 1/6. D is only competing at the top level and 21 + * eventual share is the same at 1/6. D is only competing at the top level and 22 22 * its share is 200/(100+200) == 2/3. 23 23 * 24 24 * So, instead of hierarchically scheduling level-by-level, we can consider it ··· 551 551 * too much, determine the execution time by taking explicit timestamps 552 552 * instead of depending on @p->scx.slice. 553 553 */ 554 - if (!fifo_sched) 555 - p->scx.dsq_vtime += 556 - (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 554 + if (!fifo_sched) { 555 + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); 556 + 557 + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); 558 + } 557 559 558 560 taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); 559 561 if (!taskc) { ··· 662 660 goto out_free; 663 661 } 664 662 665 - if (!scx_bpf_dsq_move_to_local(cgid)) { 663 + if (!scx_bpf_dsq_move_to_local(cgid, 0)) { 666 664 bpf_cgroup_release(cgrp); 667 665 stat_inc(FCG_STAT_PNC_EMPTY); 668 666 goto out_stash; ··· 742 740 goto pick_next_cgroup; 743 741 744 742 if (time_before(now, cpuc->cur_at + cgrp_slice_ns)) { 745 - if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid)) { 743 + if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid, 0)) { 746 744 stat_inc(FCG_STAT_CNS_KEEP); 747 745 return; 748 746 } ··· 782 780 pick_next_cgroup: 783 781 cpuc->cur_at = now; 784 782 785 - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ)) { 783 + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ, 0)) { 786 784 cpuc->cur_cgid = 0; 787 785 return; 788 786 } ··· 824 822 if (!(cgc = find_cgrp_ctx(args->cgroup))) 825 823 return -ENOENT; 826 824 827 - p->scx.dsq_vtime = cgc->tvtime_now; 825 + scx_bpf_task_set_dsq_vtime(p, cgc->tvtime_now); 828 826 829 827 return 0; 830 828 } ··· 921 919 struct fcg_cgrp_ctx *from_cgc, *to_cgc; 922 920 s64 delta; 923 921 924 - /* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */ 922 + /* find_cgrp_ctx() triggers scx_bpf_error() on lookup failures */ 925 923 if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to))) 926 924 return; 927 925 928 926 delta = time_delta(p->scx.dsq_vtime, from_cgc->tvtime_now); 929 - p->scx.dsq_vtime = to_cgc->tvtime_now + delta; 927 + scx_bpf_task_set_dsq_vtime(p, to_cgc->tvtime_now + delta); 930 928 } 931 929 932 930 s32 BPF_STRUCT_OPS_SLEEPABLE(fcg_init) ··· 962 960 .cgroup_move = (void *)fcg_cgroup_move, 963 961 .init = (void *)fcg_init, 964 962 .exit = (void *)fcg_exit, 965 - .flags = SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING, 963 + .flags = SCX_OPS_ENQ_EXITING, 966 964 .name = "flatcg");
+13 -3
tools/sched_ext/scx_pair.c
··· 21 21 "\n" 22 22 "See the top-level comment in .bpf.c for more details.\n" 23 23 "\n" 24 - "Usage: %s [-S STRIDE]\n" 24 + "Usage: %s [-S STRIDE] [-v]\n" 25 25 "\n" 26 26 " -S STRIDE Override CPU pair stride (default: nr_cpus_ids / 2)\n" 27 27 " -v Print libbpf debug messages\n" ··· 48 48 struct bpf_link *link; 49 49 __u64 seq = 0, ecode; 50 50 __s32 stride, i, opt, outer_fd; 51 + __u32 pair_id = 0; 51 52 52 53 libbpf_set_print(libbpf_print_fn); 53 54 signal(SIGINT, sigint_handler); ··· 83 82 scx_pair__destroy(skel); 84 83 return -1; 85 84 } 85 + 86 + if (skel->rodata->nr_cpu_ids & 1) { 87 + fprintf(stderr, "scx_pair requires an even CPU count, got %u\n", 88 + skel->rodata->nr_cpu_ids); 89 + scx_pair__destroy(skel); 90 + return -1; 91 + } 92 + 86 93 bpf_map__set_max_entries(skel->maps.pair_ctx, skel->rodata->nr_cpu_ids / 2); 87 94 88 95 /* Resize arrays so their element count is equal to cpu count. */ ··· 118 109 119 110 skel->rodata_pair_cpu->pair_cpu[i] = j; 120 111 skel->rodata_pair_cpu->pair_cpu[j] = i; 121 - skel->rodata_pair_id->pair_id[i] = i; 122 - skel->rodata_pair_id->pair_id[j] = i; 112 + skel->rodata_pair_id->pair_id[i] = pair_id; 113 + skel->rodata_pair_id->pair_id[j] = pair_id; 123 114 skel->rodata_in_pair_idx->in_pair_idx[i] = 0; 124 115 skel->rodata_in_pair_idx->in_pair_idx[j] = 1; 116 + pair_id++; 125 117 126 118 printf("[%d, %d] ", i, j); 127 119 }
+172 -42
tools/sched_ext/scx_qmap.bpf.c
··· 11 11 * 12 12 * - BPF-side queueing using PIDs. 13 13 * - Sleepable per-task storage allocation using ops.prep_enable(). 14 - * - Using ops.cpu_release() to handle a higher priority scheduling class taking 15 - * the CPU away. 16 14 * - Core-sched support. 17 15 * 18 16 * This scheduler is primarily for demonstration and testing of sched_ext ··· 24 26 25 27 enum consts { 26 28 ONE_SEC_IN_NS = 1000000000, 29 + ONE_MSEC_IN_NS = 1000000, 30 + LOWPRI_INTV_NS = 10 * ONE_MSEC_IN_NS, 27 31 SHARED_DSQ = 0, 28 32 HIGHPRI_DSQ = 1, 33 + LOWPRI_DSQ = 2, 29 34 HIGHPRI_WEIGHT = 8668, /* this is what -20 maps to */ 30 35 }; 31 36 ··· 42 41 const volatile bool highpri_boosting; 43 42 const volatile bool print_dsqs_and_events; 44 43 const volatile bool print_msgs; 44 + const volatile u64 sub_cgroup_id; 45 45 const volatile s32 disallow_tgid; 46 46 const volatile bool suppress_dump; 47 + const volatile bool always_enq_immed; 48 + const volatile u32 immed_stress_nth; 47 49 48 50 u64 nr_highpri_queued; 49 51 u32 test_error_cnt; 52 + 53 + #define MAX_SUB_SCHEDS 8 54 + u64 sub_sched_cgroup_ids[MAX_SUB_SCHEDS]; 50 55 51 56 UEI_DEFINE(uei); 52 57 ··· 134 127 } cpu_ctx_stor SEC(".maps"); 135 128 136 129 /* Statistics */ 137 - u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq; 130 + u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0, nr_dequeued, nr_ddsp_from_enq; 138 131 u64 nr_core_sched_execed; 139 132 u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer; 140 133 u32 cpuperf_min, cpuperf_avg, cpuperf_max; ··· 144 137 { 145 138 s32 cpu; 146 139 147 - if (p->nr_cpus_allowed == 1 || 148 - scx_bpf_test_and_clear_cpu_idle(prev_cpu)) 140 + if (!always_enq_immed && p->nr_cpus_allowed == 1) 141 + return prev_cpu; 142 + 143 + if (scx_bpf_test_and_clear_cpu_idle(prev_cpu)) 149 144 return prev_cpu; 150 145 151 146 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); ··· 176 167 177 168 if (!(tctx = lookup_task_ctx(p))) 178 169 return -ESRCH; 170 + 171 + if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD)) 172 + return prev_cpu; 179 173 180 174 cpu = pick_direct_dispatch_cpu(p, prev_cpu); 181 175 ··· 214 202 void *ring; 215 203 s32 cpu; 216 204 217 - if (enq_flags & SCX_ENQ_REENQ) 205 + if (enq_flags & SCX_ENQ_REENQ) { 218 206 __sync_fetch_and_add(&nr_reenqueued, 1); 207 + if (scx_bpf_task_cpu(p) == 0) 208 + __sync_fetch_and_add(&nr_reenqueued_cpu0, 1); 209 + } 219 210 220 211 if (p->flags & PF_KTHREAD) { 221 212 if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth)) ··· 241 226 tctx->core_sched_seq = core_sched_tail_seqs[idx]++; 242 227 243 228 /* 229 + * IMMED stress testing: Every immed_stress_nth'th enqueue, dispatch 230 + * directly to prev_cpu's local DSQ even when busy to force dsq->nr > 1 231 + * and exercise the kernel IMMED reenqueue trigger paths. 232 + */ 233 + if (immed_stress_nth && !(enq_flags & SCX_ENQ_REENQ)) { 234 + static u32 immed_stress_cnt; 235 + 236 + if (!(++immed_stress_cnt % immed_stress_nth)) { 237 + tctx->force_local = false; 238 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cpu(p), 239 + slice_ns, enq_flags); 240 + return; 241 + } 242 + } 243 + 244 + /* 244 245 * If qmap_select_cpu() is telling us to or this is the last runnable 245 246 * task on the CPU, enqueue locally. 246 247 */ 247 248 if (tctx->force_local) { 248 249 tctx->force_local = false; 249 250 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); 251 + return; 252 + } 253 + 254 + /* see lowpri_timerfn() */ 255 + if (__COMPAT_has_generic_reenq() && 256 + p->scx.weight < 2 && !(p->flags & PF_KTHREAD) && !(enq_flags & SCX_ENQ_REENQ)) { 257 + scx_bpf_dsq_insert(p, LOWPRI_DSQ, slice_ns, enq_flags); 250 258 return; 251 259 } 252 260 ··· 413 375 if (dispatch_highpri(false)) 414 376 return; 415 377 416 - if (!nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ)) 378 + if (!nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ, 0)) 417 379 return; 418 380 419 381 if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) { ··· 471 433 __sync_fetch_and_add(&nr_dispatched, 1); 472 434 473 435 scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0); 436 + 437 + /* 438 + * scx_qmap uses a global BPF queue that any CPU's 439 + * dispatch can pop from. If this CPU popped a task that 440 + * can't run here, it gets stranded on SHARED_DSQ after 441 + * consume_dispatch_q() skips it. Kick the task's home 442 + * CPU so it drains SHARED_DSQ. 443 + * 444 + * There's a race between the pop and the flush of the 445 + * buffered dsq_insert: 446 + * 447 + * CPU 0 (dispatching) CPU 1 (home, idle) 448 + * ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~ 449 + * pop from BPF queue 450 + * dsq_insert(buffered) 451 + * balance: 452 + * SHARED_DSQ empty 453 + * BPF queue empty 454 + * -> goes idle 455 + * flush -> on SHARED 456 + * kick CPU 1 457 + * wakes, drains task 458 + * 459 + * The kick prevents indefinite stalls but a per-CPU 460 + * kthread like ksoftirqd can be briefly stranded when 461 + * its home CPU enters idle with softirq pending, 462 + * triggering: 463 + * 464 + * "NOHZ tick-stop error: local softirq work is pending, handler #N!!!" 465 + * 466 + * from report_idle_softirq(). The kick lands shortly 467 + * after and the home CPU drains the task. This could be 468 + * avoided by e.g. dispatching pinned tasks to local or 469 + * global DSQs, but the current code is left as-is to 470 + * document this class of issue -- other schedulers 471 + * seeing similar warnings can use this as a reference. 472 + */ 473 + if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) 474 + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0); 475 + 474 476 bpf_task_release(p); 475 477 476 478 batch--; ··· 518 440 if (!batch || !scx_bpf_dispatch_nr_slots()) { 519 441 if (dispatch_highpri(false)) 520 442 return; 521 - scx_bpf_dsq_move_to_local(SHARED_DSQ); 443 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 522 444 return; 523 445 } 524 446 if (!cpuc->dsp_cnt) ··· 526 448 } 527 449 528 450 cpuc->dsp_cnt = 0; 451 + } 452 + 453 + for (i = 0; i < MAX_SUB_SCHEDS; i++) { 454 + if (sub_sched_cgroup_ids[i] && 455 + scx_bpf_sub_dispatch(sub_sched_cgroup_ids[i])) 456 + return; 529 457 } 530 458 531 459 /* ··· 616 532 return task_qdist(a) > task_qdist(b); 617 533 } 618 534 619 - SEC("tp_btf/sched_switch") 620 - int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev, 621 - struct task_struct *next, unsigned long prev_state) 622 - { 623 - if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere()) 624 - return 0; 625 - 626 - /* 627 - * If @cpu is taken by a higher priority scheduling class, it is no 628 - * longer available for executing sched_ext tasks. As we don't want the 629 - * tasks in @cpu's local dsq to sit there until @cpu becomes available 630 - * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ 631 - * handling in qmap_enqueue(). 632 - */ 633 - switch (next->policy) { 634 - case 1: /* SCHED_FIFO */ 635 - case 2: /* SCHED_RR */ 636 - case 6: /* SCHED_DEADLINE */ 637 - scx_bpf_reenqueue_local(); 638 - } 639 - 640 - return 0; 641 - } 642 - 643 - void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args) 644 - { 645 - /* see qmap_sched_switch() to learn how to do this on newer kernels */ 646 - if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere()) 647 - scx_bpf_reenqueue_local(); 648 - } 535 + /* 536 + * sched_switch tracepoint and cpu_release handlers are no longer needed. 537 + * With SCX_OPS_ALWAYS_ENQ_IMMED, wakeup_preempt_scx() reenqueues IMMED 538 + * tasks when a higher-priority scheduling class takes the CPU. 539 + */ 649 540 650 541 s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, 651 542 struct scx_init_task_args *args) ··· 915 856 return 0; 916 857 } 917 858 859 + struct lowpri_timer { 860 + struct bpf_timer timer; 861 + }; 862 + 863 + struct { 864 + __uint(type, BPF_MAP_TYPE_ARRAY); 865 + __uint(max_entries, 1); 866 + __type(key, u32); 867 + __type(value, struct lowpri_timer); 868 + } lowpri_timer SEC(".maps"); 869 + 870 + /* 871 + * Nice 19 tasks are put into the lowpri DSQ. Every 10ms, reenq is triggered and 872 + * the tasks are transferred to SHARED_DSQ. 873 + */ 874 + static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer) 875 + { 876 + scx_bpf_dsq_reenq(LOWPRI_DSQ, 0); 877 + bpf_timer_start(timer, LOWPRI_INTV_NS, 0); 878 + return 0; 879 + } 880 + 918 881 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) 919 882 { 920 883 u32 key = 0; 921 884 struct bpf_timer *timer; 922 885 s32 ret; 923 886 924 - if (print_msgs) 887 + if (print_msgs && !sub_cgroup_id) 925 888 print_cpus(); 926 889 927 890 ret = scx_bpf_create_dsq(SHARED_DSQ, -1); ··· 958 877 return ret; 959 878 } 960 879 880 + ret = scx_bpf_create_dsq(LOWPRI_DSQ, -1); 881 + if (ret) 882 + return ret; 883 + 961 884 timer = bpf_map_lookup_elem(&monitor_timer, &key); 962 885 if (!timer) 963 886 return -ESRCH; 964 - 965 887 bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); 966 888 bpf_timer_set_callback(timer, monitor_timerfn); 889 + ret = bpf_timer_start(timer, ONE_SEC_IN_NS, 0); 890 + if (ret) 891 + return ret; 967 892 968 - return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); 893 + if (__COMPAT_has_generic_reenq()) { 894 + /* see lowpri_timerfn() */ 895 + timer = bpf_map_lookup_elem(&lowpri_timer, &key); 896 + if (!timer) 897 + return -ESRCH; 898 + bpf_timer_init(timer, &lowpri_timer, CLOCK_MONOTONIC); 899 + bpf_timer_set_callback(timer, lowpri_timerfn); 900 + ret = bpf_timer_start(timer, LOWPRI_INTV_NS, 0); 901 + if (ret) 902 + return ret; 903 + } 904 + 905 + return 0; 969 906 } 970 907 971 908 void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) 972 909 { 973 910 UEI_RECORD(uei, ei); 911 + } 912 + 913 + s32 BPF_STRUCT_OPS(qmap_sub_attach, struct scx_sub_attach_args *args) 914 + { 915 + s32 i; 916 + 917 + for (i = 0; i < MAX_SUB_SCHEDS; i++) { 918 + if (!sub_sched_cgroup_ids[i]) { 919 + sub_sched_cgroup_ids[i] = args->ops->sub_cgroup_id; 920 + bpf_printk("attaching sub-sched[%d] on %s", 921 + i, args->cgroup_path); 922 + return 0; 923 + } 924 + } 925 + 926 + return -ENOSPC; 927 + } 928 + 929 + void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args) 930 + { 931 + s32 i; 932 + 933 + for (i = 0; i < MAX_SUB_SCHEDS; i++) { 934 + if (sub_sched_cgroup_ids[i] == args->ops->sub_cgroup_id) { 935 + sub_sched_cgroup_ids[i] = 0; 936 + bpf_printk("detaching sub-sched[%d] on %s", 937 + i, args->cgroup_path); 938 + break; 939 + } 940 + } 974 941 } 975 942 976 943 SCX_OPS_DEFINE(qmap_ops, ··· 1028 899 .dispatch = (void *)qmap_dispatch, 1029 900 .tick = (void *)qmap_tick, 1030 901 .core_sched_before = (void *)qmap_core_sched_before, 1031 - .cpu_release = (void *)qmap_cpu_release, 1032 902 .init_task = (void *)qmap_init_task, 1033 903 .dump = (void *)qmap_dump, 1034 904 .dump_cpu = (void *)qmap_dump_cpu, ··· 1035 907 .cgroup_init = (void *)qmap_cgroup_init, 1036 908 .cgroup_set_weight = (void *)qmap_cgroup_set_weight, 1037 909 .cgroup_set_bandwidth = (void *)qmap_cgroup_set_bandwidth, 910 + .sub_attach = (void *)qmap_sub_attach, 911 + .sub_detach = (void *)qmap_sub_detach, 1038 912 .cpu_online = (void *)qmap_cpu_online, 1039 913 .cpu_offline = (void *)qmap_cpu_offline, 1040 914 .init = (void *)qmap_init,
+25 -4
tools/sched_ext/scx_qmap.c
··· 10 10 #include <inttypes.h> 11 11 #include <signal.h> 12 12 #include <libgen.h> 13 + #include <sys/stat.h> 13 14 #include <bpf/bpf.h> 14 15 #include <scx/common.h> 15 16 #include "scx_qmap.bpf.skel.h" ··· 21 20 "See the top-level comment in .bpf.c for more details.\n" 22 21 "\n" 23 22 "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n" 24 - " [-P] [-M] [-d PID] [-D LEN] [-p] [-v]\n" 23 + " [-P] [-M] [-H] [-d PID] [-D LEN] [-S] [-p] [-I] [-F COUNT] [-v]\n" 25 24 "\n" 26 25 " -s SLICE_US Override slice duration\n" 27 26 " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" ··· 36 35 " -D LEN Set scx_exit_info.dump buffer length\n" 37 36 " -S Suppress qmap-specific debug dump\n" 38 37 " -p Switch only tasks on SCHED_EXT policy instead of all\n" 38 + " -I Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n" 39 + " -F COUNT IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n" 39 40 " -v Print libbpf debug messages\n" 40 41 " -h Display this help and exit\n"; 41 42 ··· 70 67 71 68 skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL"); 72 69 73 - while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHd:D:Spvh")) != -1) { 70 + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHc:d:D:SpIF:vh")) != -1) { 74 71 switch (opt) { 75 72 case 's': 76 73 skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; ··· 99 96 case 'H': 100 97 skel->rodata->highpri_boosting = true; 101 98 break; 99 + case 'c': { 100 + struct stat st; 101 + if (stat(optarg, &st) < 0) { 102 + perror("stat"); 103 + return 1; 104 + } 105 + skel->struct_ops.qmap_ops->sub_cgroup_id = st.st_ino; 106 + skel->rodata->sub_cgroup_id = st.st_ino; 107 + break; 108 + } 102 109 case 'd': 103 110 skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); 104 111 if (skel->rodata->disallow_tgid < 0) ··· 122 109 break; 123 110 case 'p': 124 111 skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; 112 + break; 113 + case 'I': 114 + skel->rodata->always_enq_immed = true; 115 + skel->struct_ops.qmap_ops->flags |= SCX_OPS_ALWAYS_ENQ_IMMED; 116 + break; 117 + case 'F': 118 + skel->rodata->immed_stress_nth = strtoul(optarg, NULL, 0); 125 119 break; 126 120 case 'v': 127 121 verbose = true; ··· 146 126 long nr_enqueued = skel->bss->nr_enqueued; 147 127 long nr_dispatched = skel->bss->nr_dispatched; 148 128 149 - printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", 129 + printf("stats : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%"PRIu64"/%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", 150 130 nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, 151 - skel->bss->nr_reenqueued, skel->bss->nr_dequeued, 131 + skel->bss->nr_reenqueued, skel->bss->nr_reenqueued_cpu0, 132 + skel->bss->nr_dequeued, 152 133 skel->bss->nr_core_sched_execed, 153 134 skel->bss->nr_ddsp_from_enq); 154 135 printf(" exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",
+3 -2
tools/sched_ext/scx_sdt.bpf.c
··· 317 317 }; 318 318 319 319 /* Zero out one word at a time. */ 320 - for (i = zero; i < alloc->pool.elem_size / 8 && can_loop; i++) { 320 + for (i = zero; i < (alloc->pool.elem_size - sizeof(struct sdt_data)) / 8 321 + && can_loop; i++) { 321 322 data->payload[i] = 0; 322 323 } 323 324 } ··· 644 643 645 644 void BPF_STRUCT_OPS(sdt_dispatch, s32 cpu, struct task_struct *prev) 646 645 { 647 - scx_bpf_dsq_move_to_local(SHARED_DSQ); 646 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 648 647 } 649 648 650 649 s32 BPF_STRUCT_OPS_SLEEPABLE(sdt_init_task, struct task_struct *p,
+1 -1
tools/sched_ext/scx_sdt.c
··· 20 20 "\n" 21 21 "Modified version of scx_simple that demonstrates arena-based data structures.\n" 22 22 "\n" 23 - "Usage: %s [-f] [-v]\n" 23 + "Usage: %s [-v]\n" 24 24 "\n" 25 25 " -v Print libbpf debug messages\n" 26 26 " -h Display this help and exit\n";
+5 -3
tools/sched_ext/scx_simple.bpf.c
··· 89 89 90 90 void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev) 91 91 { 92 - scx_bpf_dsq_move_to_local(SHARED_DSQ); 92 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 93 93 } 94 94 95 95 void BPF_STRUCT_OPS(simple_running, struct task_struct *p) ··· 121 121 * too much, determine the execution time by taking explicit timestamps 122 122 * instead of depending on @p->scx.slice. 123 123 */ 124 - p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 124 + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); 125 + 126 + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); 125 127 } 126 128 127 129 void BPF_STRUCT_OPS(simple_enable, struct task_struct *p) 128 130 { 129 - p->scx.dsq_vtime = vtime_now; 131 + scx_bpf_task_set_dsq_vtime(p, vtime_now); 130 132 } 131 133 132 134 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
+1 -1
tools/sched_ext/scx_userland.c
··· 38 38 "\n" 39 39 "Try to reduce `sysctl kernel.pid_max` if this program triggers OOMs.\n" 40 40 "\n" 41 - "Usage: %s [-b BATCH]\n" 41 + "Usage: %s [-b BATCH] [-v]\n" 42 42 "\n" 43 43 " -b BATCH The number of tasks to batch when dispatching (default: 8)\n" 44 44 " -v Print libbpf debug messages\n"
+1
tools/testing/selftests/sched_ext/Makefile
··· 163 163 164 164 auto-test-targets := \ 165 165 create_dsq \ 166 + dequeue \ 166 167 enq_last_no_enq_fails \ 167 168 ddsp_bogus_dsq_fail \ 168 169 ddsp_vtimelocal_fail \
+389
tools/testing/selftests/sched_ext/dequeue.bpf.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * A scheduler that validates ops.dequeue() is called correctly: 4 + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF 5 + * scheduler entirely: no ops.dequeue() should be called 6 + * - Tasks dispatched to user DSQs from ops.enqueue() enter BPF custody: 7 + * ops.dequeue() must be called when they leave custody 8 + * - Every ops.enqueue() dispatch to non-terminal DSQs is followed by 9 + * exactly one ops.dequeue() (validate 1:1 pairing and state machine) 10 + * 11 + * Copyright (c) 2026 NVIDIA Corporation. 12 + */ 13 + 14 + #include <scx/common.bpf.h> 15 + 16 + #define SHARED_DSQ 0 17 + 18 + /* 19 + * BPF internal queue. 20 + * 21 + * Tasks are stored here and consumed from ops.dispatch(), validating that 22 + * tasks on BPF internal structures still get ops.dequeue() when they 23 + * leave. 24 + */ 25 + struct { 26 + __uint(type, BPF_MAP_TYPE_QUEUE); 27 + __uint(max_entries, 32768); 28 + __type(value, s32); 29 + } global_queue SEC(".maps"); 30 + 31 + char _license[] SEC("license") = "GPL"; 32 + 33 + UEI_DEFINE(uei); 34 + 35 + /* 36 + * Counters to track the lifecycle of tasks: 37 + * - enqueue_cnt: Number of times ops.enqueue() was called 38 + * - dequeue_cnt: Number of times ops.dequeue() was called (any type) 39 + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag) 40 + * - change_dequeue_cnt: Number of property change dequeues 41 + * - bpf_queue_full: Number of times the BPF internal queue was full 42 + */ 43 + u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt, bpf_queue_full; 44 + 45 + /* 46 + * Test scenarios: 47 + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF 48 + * scheduler, no dequeue callbacks) 49 + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF 50 + * scheduler, no dequeue callbacks) 51 + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler, 52 + * dequeue callbacks expected) 53 + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF 54 + * scheduler, no dequeue callbacks) 55 + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF 56 + * scheduler, no dequeue callbacks) 57 + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler, 58 + * dequeue callbacks expected) 59 + * 6) BPF internal queue from ops.enqueue(): store task PIDs in ops.enqueue(), 60 + * consume in ops.dispatch() and dispatch to local DSQ (validates dequeue 61 + * for tasks stored in internal BPF data structures) 62 + */ 63 + u32 test_scenario; 64 + 65 + /* 66 + * Per-task state to track lifecycle and validate workflow semantics. 67 + * State transitions: 68 + * NONE -> ENQUEUED (on enqueue) 69 + * NONE -> DISPATCHED (on direct dispatch to terminal DSQ) 70 + * ENQUEUED -> DISPATCHED (on dispatch dequeue) 71 + * DISPATCHED -> NONE (on property change dequeue or re-enqueue) 72 + * ENQUEUED -> NONE (on property change dequeue before dispatch) 73 + */ 74 + enum task_state { 75 + TASK_NONE = 0, 76 + TASK_ENQUEUED, 77 + TASK_DISPATCHED, 78 + }; 79 + 80 + struct task_ctx { 81 + enum task_state state; /* Current state in the workflow */ 82 + u64 enqueue_seq; /* Sequence number for debugging */ 83 + }; 84 + 85 + struct { 86 + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); 87 + __uint(map_flags, BPF_F_NO_PREALLOC); 88 + __type(key, int); 89 + __type(value, struct task_ctx); 90 + } task_ctx_stor SEC(".maps"); 91 + 92 + static struct task_ctx *try_lookup_task_ctx(struct task_struct *p) 93 + { 94 + return bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 95 + } 96 + 97 + s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p, 98 + s32 prev_cpu, u64 wake_flags) 99 + { 100 + struct task_ctx *tctx; 101 + 102 + tctx = try_lookup_task_ctx(p); 103 + if (!tctx) 104 + return prev_cpu; 105 + 106 + switch (test_scenario) { 107 + case 0: 108 + /* 109 + * Direct dispatch to the local DSQ. 110 + * 111 + * Task bypasses BPF scheduler entirely: no enqueue 112 + * tracking, no ops.dequeue() callbacks. 113 + */ 114 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 115 + tctx->state = TASK_DISPATCHED; 116 + break; 117 + case 1: 118 + /* 119 + * Direct dispatch to the global DSQ. 120 + * 121 + * Task bypasses BPF scheduler entirely: no enqueue 122 + * tracking, no ops.dequeue() callbacks. 123 + */ 124 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); 125 + tctx->state = TASK_DISPATCHED; 126 + break; 127 + case 2: 128 + /* 129 + * Dispatch to a shared user DSQ. 130 + * 131 + * Task enters BPF scheduler management: track 132 + * enqueue/dequeue lifecycle and validate state 133 + * transitions. 134 + */ 135 + if (tctx->state == TASK_ENQUEUED) 136 + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", 137 + p->pid, p->comm, tctx->enqueue_seq); 138 + 139 + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0); 140 + 141 + __sync_fetch_and_add(&enqueue_cnt, 1); 142 + 143 + tctx->state = TASK_ENQUEUED; 144 + tctx->enqueue_seq++; 145 + break; 146 + } 147 + 148 + return prev_cpu; 149 + } 150 + 151 + void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags) 152 + { 153 + struct task_ctx *tctx; 154 + s32 pid = p->pid; 155 + 156 + tctx = try_lookup_task_ctx(p); 157 + if (!tctx) 158 + return; 159 + 160 + switch (test_scenario) { 161 + case 3: 162 + /* 163 + * Direct dispatch to the local DSQ. 164 + * 165 + * Task bypasses BPF scheduler entirely: no enqueue 166 + * tracking, no ops.dequeue() callbacks. 167 + */ 168 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); 169 + tctx->state = TASK_DISPATCHED; 170 + break; 171 + case 4: 172 + /* 173 + * Direct dispatch to the global DSQ. 174 + * 175 + * Task bypasses BPF scheduler entirely: no enqueue 176 + * tracking, no ops.dequeue() callbacks. 177 + */ 178 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 179 + tctx->state = TASK_DISPATCHED; 180 + break; 181 + case 5: 182 + /* 183 + * Dispatch to shared user DSQ. 184 + * 185 + * Task enters BPF scheduler management: track 186 + * enqueue/dequeue lifecycle and validate state 187 + * transitions. 188 + */ 189 + if (tctx->state == TASK_ENQUEUED) 190 + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", 191 + p->pid, p->comm, tctx->enqueue_seq); 192 + 193 + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); 194 + 195 + __sync_fetch_and_add(&enqueue_cnt, 1); 196 + 197 + tctx->state = TASK_ENQUEUED; 198 + tctx->enqueue_seq++; 199 + break; 200 + case 6: 201 + /* 202 + * Store task in BPF internal queue. 203 + * 204 + * Task enters BPF scheduler management: track 205 + * enqueue/dequeue lifecycle and validate state 206 + * transitions. 207 + */ 208 + if (tctx->state == TASK_ENQUEUED) 209 + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", 210 + p->pid, p->comm, tctx->enqueue_seq); 211 + 212 + if (bpf_map_push_elem(&global_queue, &pid, 0)) { 213 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 214 + __sync_fetch_and_add(&bpf_queue_full, 1); 215 + 216 + tctx->state = TASK_DISPATCHED; 217 + } else { 218 + __sync_fetch_and_add(&enqueue_cnt, 1); 219 + 220 + tctx->state = TASK_ENQUEUED; 221 + tctx->enqueue_seq++; 222 + } 223 + break; 224 + default: 225 + /* For all other scenarios, dispatch to the global DSQ */ 226 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 227 + tctx->state = TASK_DISPATCHED; 228 + break; 229 + } 230 + 231 + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE); 232 + } 233 + 234 + void BPF_STRUCT_OPS(dequeue_dequeue, struct task_struct *p, u64 deq_flags) 235 + { 236 + struct task_ctx *tctx; 237 + 238 + __sync_fetch_and_add(&dequeue_cnt, 1); 239 + 240 + tctx = try_lookup_task_ctx(p); 241 + if (!tctx) 242 + return; 243 + 244 + /* 245 + * For scenarios 0, 1, 3, and 4 (terminal DSQs: local and global), 246 + * ops.dequeue() should never be called because tasks bypass the 247 + * BPF scheduler entirely. If we get here, it's a kernel bug. 248 + */ 249 + if (test_scenario == 0 || test_scenario == 3) { 250 + scx_bpf_error("%d (%s): dequeue called for local DSQ scenario", 251 + p->pid, p->comm); 252 + return; 253 + } 254 + 255 + if (test_scenario == 1 || test_scenario == 4) { 256 + scx_bpf_error("%d (%s): dequeue called for global DSQ scenario", 257 + p->pid, p->comm); 258 + return; 259 + } 260 + 261 + if (deq_flags & SCX_DEQ_SCHED_CHANGE) { 262 + /* 263 + * Property change interrupting the workflow. Valid from 264 + * both ENQUEUED and DISPATCHED states. Transitions task 265 + * back to NONE state. 266 + */ 267 + __sync_fetch_and_add(&change_dequeue_cnt, 1); 268 + 269 + /* Validate state transition */ 270 + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_DISPATCHED) 271 + scx_bpf_error("%d (%s): invalid property change dequeue state=%d seq=%llu", 272 + p->pid, p->comm, tctx->state, tctx->enqueue_seq); 273 + 274 + /* 275 + * Transition back to NONE: task outside scheduler control. 276 + * 277 + * Scenario 6: dispatch() checks tctx->state after popping a 278 + * PID, if the task is in state NONE, it was dequeued by 279 + * property change and must not be dispatched (this 280 + * prevents "target CPU not allowed"). 281 + */ 282 + tctx->state = TASK_NONE; 283 + } else { 284 + /* 285 + * Regular dispatch dequeue: kernel is moving the task from 286 + * BPF custody to a terminal DSQ. Normally we come from 287 + * ENQUEUED state. We can also see TASK_NONE if the task 288 + * was dequeued by property change (SCX_DEQ_SCHED_CHANGE) 289 + * while it was already on a DSQ (dispatched but not yet 290 + * consumed); in that case we just leave state as NONE. 291 + */ 292 + __sync_fetch_and_add(&dispatch_dequeue_cnt, 1); 293 + 294 + /* 295 + * Must be ENQUEUED (normal path) or NONE (already dequeued 296 + * by property change while on a DSQ). 297 + */ 298 + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_NONE) 299 + scx_bpf_error("%d (%s): dispatch dequeue from state %d seq=%llu", 300 + p->pid, p->comm, tctx->state, tctx->enqueue_seq); 301 + 302 + if (tctx->state == TASK_ENQUEUED) 303 + tctx->state = TASK_DISPATCHED; 304 + 305 + /* NONE: leave as-is, task was already property-change dequeued */ 306 + } 307 + } 308 + 309 + void BPF_STRUCT_OPS(dequeue_dispatch, s32 cpu, struct task_struct *prev) 310 + { 311 + if (test_scenario == 6) { 312 + struct task_ctx *tctx; 313 + struct task_struct *p; 314 + s32 pid; 315 + 316 + if (bpf_map_pop_elem(&global_queue, &pid)) 317 + return; 318 + 319 + p = bpf_task_from_pid(pid); 320 + if (!p) 321 + return; 322 + 323 + /* 324 + * If the task was dequeued by property change 325 + * (ops.dequeue() set tctx->state = TASK_NONE), skip 326 + * dispatch. 327 + */ 328 + tctx = try_lookup_task_ctx(p); 329 + if (!tctx || tctx->state == TASK_NONE) { 330 + bpf_task_release(p); 331 + return; 332 + } 333 + 334 + /* 335 + * Dispatch to this CPU's local DSQ if allowed, otherwise 336 + * fallback to the global DSQ. 337 + */ 338 + if (bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) 339 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0); 340 + else 341 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); 342 + 343 + bpf_task_release(p); 344 + } else { 345 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 346 + } 347 + } 348 + 349 + s32 BPF_STRUCT_OPS(dequeue_init_task, struct task_struct *p, 350 + struct scx_init_task_args *args) 351 + { 352 + struct task_ctx *tctx; 353 + 354 + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 355 + BPF_LOCAL_STORAGE_GET_F_CREATE); 356 + if (!tctx) 357 + return -ENOMEM; 358 + 359 + return 0; 360 + } 361 + 362 + s32 BPF_STRUCT_OPS_SLEEPABLE(dequeue_init) 363 + { 364 + s32 ret; 365 + 366 + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); 367 + if (ret) 368 + return ret; 369 + 370 + return 0; 371 + } 372 + 373 + void BPF_STRUCT_OPS(dequeue_exit, struct scx_exit_info *ei) 374 + { 375 + UEI_RECORD(uei, ei); 376 + } 377 + 378 + SEC(".struct_ops.link") 379 + struct sched_ext_ops dequeue_ops = { 380 + .select_cpu = (void *)dequeue_select_cpu, 381 + .enqueue = (void *)dequeue_enqueue, 382 + .dequeue = (void *)dequeue_dequeue, 383 + .dispatch = (void *)dequeue_dispatch, 384 + .init_task = (void *)dequeue_init_task, 385 + .init = (void *)dequeue_init, 386 + .exit = (void *)dequeue_exit, 387 + .flags = SCX_OPS_ENQ_LAST, 388 + .name = "dequeue_test", 389 + };
+274
tools/testing/selftests/sched_ext/dequeue.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2025 NVIDIA Corporation. 4 + */ 5 + #define _GNU_SOURCE 6 + #include <stdio.h> 7 + #include <unistd.h> 8 + #include <signal.h> 9 + #include <time.h> 10 + #include <bpf/bpf.h> 11 + #include <scx/common.h> 12 + #include <sys/wait.h> 13 + #include <sched.h> 14 + #include <pthread.h> 15 + #include "scx_test.h" 16 + #include "dequeue.bpf.skel.h" 17 + 18 + #define NUM_WORKERS 8 19 + #define AFFINITY_HAMMER_MS 500 20 + 21 + /* 22 + * Worker function that creates enqueue/dequeue events via CPU work and 23 + * sleep. 24 + */ 25 + static void worker_fn(int id) 26 + { 27 + int i; 28 + volatile int sum = 0; 29 + 30 + for (i = 0; i < 1000; i++) { 31 + volatile int j; 32 + 33 + /* Do some work to trigger scheduling events */ 34 + for (j = 0; j < 10000; j++) 35 + sum += j; 36 + 37 + /* Sleep to trigger dequeue */ 38 + usleep(1000 + (id * 100)); 39 + } 40 + 41 + exit(0); 42 + } 43 + 44 + /* 45 + * This thread changes workers' affinity from outside so that some changes 46 + * hit tasks while they are still in the scheduler's queue and trigger 47 + * property-change dequeues. 48 + */ 49 + static void *affinity_hammer_fn(void *arg) 50 + { 51 + pid_t *pids = arg; 52 + cpu_set_t cpuset; 53 + int i = 0, n = NUM_WORKERS; 54 + struct timespec start, now; 55 + 56 + clock_gettime(CLOCK_MONOTONIC, &start); 57 + while (1) { 58 + int w = i % n; 59 + int cpu = (i / n) % 4; 60 + 61 + CPU_ZERO(&cpuset); 62 + CPU_SET(cpu, &cpuset); 63 + sched_setaffinity(pids[w], sizeof(cpuset), &cpuset); 64 + i++; 65 + 66 + /* Check elapsed time every 256 iterations to limit gettime cost */ 67 + if ((i & 255) == 0) { 68 + long long elapsed_ms; 69 + 70 + clock_gettime(CLOCK_MONOTONIC, &now); 71 + elapsed_ms = (now.tv_sec - start.tv_sec) * 1000LL + 72 + (now.tv_nsec - start.tv_nsec) / 1000000; 73 + if (elapsed_ms >= AFFINITY_HAMMER_MS) 74 + break; 75 + } 76 + } 77 + return NULL; 78 + } 79 + 80 + static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario, 81 + const char *scenario_name) 82 + { 83 + struct bpf_link *link; 84 + pid_t pids[NUM_WORKERS]; 85 + pthread_t hammer; 86 + 87 + int i, status; 88 + u64 enq_start, deq_start, 89 + dispatch_deq_start, change_deq_start, bpf_queue_full_start; 90 + u64 enq_delta, deq_delta, 91 + dispatch_deq_delta, change_deq_delta, bpf_queue_full_delta; 92 + 93 + /* Set the test scenario */ 94 + skel->bss->test_scenario = scenario; 95 + 96 + /* Record starting counts */ 97 + enq_start = skel->bss->enqueue_cnt; 98 + deq_start = skel->bss->dequeue_cnt; 99 + dispatch_deq_start = skel->bss->dispatch_dequeue_cnt; 100 + change_deq_start = skel->bss->change_dequeue_cnt; 101 + bpf_queue_full_start = skel->bss->bpf_queue_full; 102 + 103 + link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops); 104 + SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name); 105 + 106 + /* Fork worker processes to generate enqueue/dequeue events */ 107 + for (i = 0; i < NUM_WORKERS; i++) { 108 + pids[i] = fork(); 109 + SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i); 110 + 111 + if (pids[i] == 0) { 112 + worker_fn(i); 113 + /* Should not reach here */ 114 + exit(1); 115 + } 116 + } 117 + 118 + /* 119 + * Run an "affinity hammer" so that some property changes hit tasks 120 + * while they are still in BPF custody (e.g., in user DSQ or BPF 121 + * queue), triggering SCX_DEQ_SCHED_CHANGE dequeues. 122 + */ 123 + SCX_FAIL_IF(pthread_create(&hammer, NULL, affinity_hammer_fn, pids) != 0, 124 + "Failed to create affinity hammer thread"); 125 + pthread_join(hammer, NULL); 126 + 127 + /* Wait for all workers to complete */ 128 + for (i = 0; i < NUM_WORKERS; i++) { 129 + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], 130 + "Failed to wait for worker %d", i); 131 + SCX_FAIL_IF(status != 0, "Worker %d exited with status %d", i, status); 132 + } 133 + 134 + bpf_link__destroy(link); 135 + 136 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG)); 137 + 138 + /* Calculate deltas */ 139 + enq_delta = skel->bss->enqueue_cnt - enq_start; 140 + deq_delta = skel->bss->dequeue_cnt - deq_start; 141 + dispatch_deq_delta = skel->bss->dispatch_dequeue_cnt - dispatch_deq_start; 142 + change_deq_delta = skel->bss->change_dequeue_cnt - change_deq_start; 143 + bpf_queue_full_delta = skel->bss->bpf_queue_full - bpf_queue_full_start; 144 + 145 + printf("%s:\n", scenario_name); 146 + printf(" enqueues: %lu\n", (unsigned long)enq_delta); 147 + printf(" dequeues: %lu (dispatch: %lu, property_change: %lu)\n", 148 + (unsigned long)deq_delta, 149 + (unsigned long)dispatch_deq_delta, 150 + (unsigned long)change_deq_delta); 151 + printf(" BPF queue full: %lu\n", (unsigned long)bpf_queue_full_delta); 152 + 153 + /* 154 + * Validate enqueue/dequeue lifecycle tracking. 155 + * 156 + * For scenarios 0, 1, 3, 4 (local and global DSQs from 157 + * ops.select_cpu() and ops.enqueue()), both enqueues and dequeues 158 + * should be 0 because tasks bypass the BPF scheduler entirely: 159 + * tasks never enter BPF scheduler's custody. 160 + * 161 + * For scenarios 2, 5, 6 (user DSQ or BPF internal queue) we expect 162 + * both enqueues and dequeues. 163 + * 164 + * The BPF code does strict state machine validation with 165 + * scx_bpf_error() to ensure the workflow semantics are correct. 166 + * 167 + * If we reach this point without errors, the semantics are 168 + * validated correctly. 169 + */ 170 + if (scenario == 0 || scenario == 1 || 171 + scenario == 3 || scenario == 4) { 172 + /* Tasks bypass BPF scheduler completely */ 173 + SCX_EQ(enq_delta, 0); 174 + SCX_EQ(deq_delta, 0); 175 + SCX_EQ(dispatch_deq_delta, 0); 176 + SCX_EQ(change_deq_delta, 0); 177 + } else { 178 + /* 179 + * User DSQ from ops.enqueue() or ops.select_cpu(): tasks 180 + * enter BPF scheduler's custody. 181 + * 182 + * Also validate 1:1 enqueue/dequeue pairing. 183 + */ 184 + SCX_GT(enq_delta, 0); 185 + SCX_GT(deq_delta, 0); 186 + SCX_EQ(enq_delta, deq_delta); 187 + } 188 + 189 + return SCX_TEST_PASS; 190 + } 191 + 192 + static enum scx_test_status setup(void **ctx) 193 + { 194 + struct dequeue *skel; 195 + 196 + skel = dequeue__open(); 197 + SCX_FAIL_IF(!skel, "Failed to open skel"); 198 + SCX_ENUM_INIT(skel); 199 + SCX_FAIL_IF(dequeue__load(skel), "Failed to load skel"); 200 + 201 + *ctx = skel; 202 + 203 + return SCX_TEST_PASS; 204 + } 205 + 206 + static enum scx_test_status run(void *ctx) 207 + { 208 + struct dequeue *skel = ctx; 209 + enum scx_test_status status; 210 + 211 + status = run_scenario(skel, 0, "Scenario 0: Local DSQ from ops.select_cpu()"); 212 + if (status != SCX_TEST_PASS) 213 + return status; 214 + 215 + status = run_scenario(skel, 1, "Scenario 1: Global DSQ from ops.select_cpu()"); 216 + if (status != SCX_TEST_PASS) 217 + return status; 218 + 219 + status = run_scenario(skel, 2, "Scenario 2: User DSQ from ops.select_cpu()"); 220 + if (status != SCX_TEST_PASS) 221 + return status; 222 + 223 + status = run_scenario(skel, 3, "Scenario 3: Local DSQ from ops.enqueue()"); 224 + if (status != SCX_TEST_PASS) 225 + return status; 226 + 227 + status = run_scenario(skel, 4, "Scenario 4: Global DSQ from ops.enqueue()"); 228 + if (status != SCX_TEST_PASS) 229 + return status; 230 + 231 + status = run_scenario(skel, 5, "Scenario 5: User DSQ from ops.enqueue()"); 232 + if (status != SCX_TEST_PASS) 233 + return status; 234 + 235 + status = run_scenario(skel, 6, "Scenario 6: BPF queue from ops.enqueue()"); 236 + if (status != SCX_TEST_PASS) 237 + return status; 238 + 239 + printf("\n=== Summary ===\n"); 240 + printf("Total enqueues: %lu\n", (unsigned long)skel->bss->enqueue_cnt); 241 + printf("Total dequeues: %lu\n", (unsigned long)skel->bss->dequeue_cnt); 242 + printf(" Dispatch dequeues: %lu (no flag, normal workflow)\n", 243 + (unsigned long)skel->bss->dispatch_dequeue_cnt); 244 + printf(" Property change dequeues: %lu (SCX_DEQ_SCHED_CHANGE flag)\n", 245 + (unsigned long)skel->bss->change_dequeue_cnt); 246 + printf(" BPF queue full: %lu\n", 247 + (unsigned long)skel->bss->bpf_queue_full); 248 + printf("\nAll scenarios passed - no state machine violations detected\n"); 249 + printf("-> Validated: Local DSQ dispatch bypasses BPF scheduler\n"); 250 + printf("-> Validated: Global DSQ dispatch bypasses BPF scheduler\n"); 251 + printf("-> Validated: User DSQ dispatch triggers ops.dequeue() callbacks\n"); 252 + printf("-> Validated: Dispatch dequeues have no flags (normal workflow)\n"); 253 + printf("-> Validated: Property change dequeues have SCX_DEQ_SCHED_CHANGE flag\n"); 254 + printf("-> Validated: No duplicate enqueues or invalid state transitions\n"); 255 + 256 + return SCX_TEST_PASS; 257 + } 258 + 259 + static void cleanup(void *ctx) 260 + { 261 + struct dequeue *skel = ctx; 262 + 263 + dequeue__destroy(skel); 264 + } 265 + 266 + struct scx_test dequeue_test = { 267 + .name = "dequeue", 268 + .description = "Verify ops.dequeue() semantics", 269 + .setup = setup, 270 + .run = run, 271 + .cleanup = cleanup, 272 + }; 273 + 274 + REGISTER_SCX_TEST(&dequeue_test)
+1 -1
tools/testing/selftests/sched_ext/exit.bpf.c
··· 41 41 if (exit_point == EXIT_DISPATCH) 42 42 EXIT_CLEANLY(); 43 43 44 - scx_bpf_dsq_move_to_local(DSQ_ID); 44 + scx_bpf_dsq_move_to_local(DSQ_ID, 0); 45 45 } 46 46 47 47 void BPF_STRUCT_OPS(exit_enable, struct task_struct *p)
+1 -1
tools/testing/selftests/sched_ext/exit.c
··· 33 33 skel = exit__open(); 34 34 SCX_ENUM_INIT(skel); 35 35 skel->rodata->exit_point = tc; 36 - exit__load(skel); 36 + SCX_FAIL_IF(exit__load(skel), "Failed to load skel"); 37 37 link = bpf_map__attach_struct_ops(skel->maps.exit_ops); 38 38 if (!link) { 39 39 SCX_ERR("Failed to attach scheduler");
+1 -1
tools/testing/selftests/sched_ext/exit_test.h
··· 17 17 NUM_EXITS, 18 18 }; 19 19 20 - #endif // # __EXIT_TEST_H__ 20 + #endif // __EXIT_TEST_H__
+7 -10
tools/testing/selftests/sched_ext/maximal.bpf.c
··· 30 30 31 31 void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev) 32 32 { 33 - scx_bpf_dsq_move_to_local(DSQ_ID); 33 + scx_bpf_dsq_move_to_local(DSQ_ID, 0); 34 34 } 35 35 36 36 void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags) ··· 67 67 void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle) 68 68 {} 69 69 70 - void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu, 71 - struct scx_cpu_acquire_args *args) 72 - {} 73 - 74 - void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu, 75 - struct scx_cpu_release_args *args) 76 - {} 70 + SEC("tp_btf/sched_switch") 71 + int BPF_PROG(maximal_sched_switch, bool preempt, struct task_struct *prev, 72 + struct task_struct *next, unsigned int prev_state) 73 + { 74 + return 0; 75 + } 77 76 78 77 void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu) 79 78 {} ··· 149 150 .set_weight = (void *) maximal_set_weight, 150 151 .set_cpumask = (void *) maximal_set_cpumask, 151 152 .update_idle = (void *) maximal_update_idle, 152 - .cpu_acquire = (void *) maximal_cpu_acquire, 153 - .cpu_release = (void *) maximal_cpu_release, 154 153 .cpu_online = (void *) maximal_cpu_online, 155 154 .cpu_offline = (void *) maximal_cpu_offline, 156 155 .init_task = (void *) maximal_init_task,
+3
tools/testing/selftests/sched_ext/maximal.c
··· 19 19 SCX_ENUM_INIT(skel); 20 20 SCX_FAIL_IF(maximal__load(skel), "Failed to load skel"); 21 21 22 + bpf_map__set_autoattach(skel->maps.maximal_ops, false); 23 + SCX_FAIL_IF(maximal__attach(skel), "Failed to attach skel"); 24 + 22 25 *ctx = skel; 23 26 24 27 return SCX_TEST_PASS;
+1 -1
tools/testing/selftests/sched_ext/numa.bpf.c
··· 68 68 { 69 69 int node = __COMPAT_scx_bpf_cpu_node(cpu); 70 70 71 - scx_bpf_dsq_move_to_local(node); 71 + scx_bpf_dsq_move_to_local(node, 0); 72 72 } 73 73 74 74 s32 BPF_STRUCT_OPS_SLEEPABLE(numa_init)
+5 -5
tools/testing/selftests/sched_ext/peek_dsq.bpf.c
··· 95 95 record_peek_result(task->pid); 96 96 97 97 /* Try to move this task to local */ 98 - if (!moved && scx_bpf_dsq_move_to_local(dsq_id) == 0) { 98 + if (!moved && scx_bpf_dsq_move_to_local(dsq_id, 0) == 0) { 99 99 moved = 1; 100 100 break; 101 101 } ··· 156 156 dsq_peek_result2_pid = peek_result ? peek_result->pid : -1; 157 157 158 158 /* Now consume the task since we've peeked at it */ 159 - scx_bpf_dsq_move_to_local(test_dsq_id); 159 + scx_bpf_dsq_move_to_local(test_dsq_id, 0); 160 160 161 161 /* Mark phase 1 as complete */ 162 162 phase1_complete = 1; 163 163 bpf_printk("Phase 1 complete, starting phase 2 stress testing"); 164 164 } else if (!phase1_complete) { 165 165 /* Still in phase 1, use real DSQ */ 166 - scx_bpf_dsq_move_to_local(real_dsq_id); 166 + scx_bpf_dsq_move_to_local(real_dsq_id, 0); 167 167 } else { 168 168 /* Phase 2: Scan all DSQs in the pool and try to move a task */ 169 169 if (!scan_dsq_pool()) { 170 170 /* No tasks found in DSQ pool, fall back to real DSQ */ 171 - scx_bpf_dsq_move_to_local(real_dsq_id); 171 + scx_bpf_dsq_move_to_local(real_dsq_id, 0); 172 172 } 173 173 } 174 174 } ··· 197 197 } 198 198 err = scx_bpf_create_dsq(real_dsq_id, -1); 199 199 if (err) { 200 - scx_bpf_error("Failed to create DSQ %d: %d", test_dsq_id, err); 200 + scx_bpf_error("Failed to create DSQ %d: %d", real_dsq_id, err); 201 201 return err; 202 202 } 203 203
+3
tools/testing/selftests/sched_ext/reload_loop.c
··· 23 23 SCX_ENUM_INIT(skel); 24 24 SCX_FAIL_IF(maximal__load(skel), "Failed to load skel"); 25 25 26 + bpf_map__set_autoattach(skel->maps.maximal_ops, false); 27 + SCX_FAIL_IF(maximal__attach(skel), "Failed to attach skel"); 28 + 26 29 return SCX_TEST_PASS; 27 30 } 28 31
+5
tools/testing/selftests/sched_ext/rt_stall.c
··· 119 119 { 120 120 struct rt_stall *skel; 121 121 122 + if (!__COMPAT_struct_has_field("rq", "ext_server")) { 123 + fprintf(stderr, "SKIP: ext DL server not supported\n"); 124 + return SCX_TEST_SKIP; 125 + } 126 + 122 127 skel = rt_stall__open(); 123 128 SCX_FAIL_IF(!skel, "Failed to open"); 124 129 SCX_ENUM_INIT(skel);
+36 -4
tools/testing/selftests/sched_ext/runner.c
··· 18 18 "It's required for the testcases to be serial, as only a single host-wide sched_ext\n" 19 19 "scheduler may be loaded at any given time." 20 20 "\n" 21 - "Usage: %s [-t TEST] [-h]\n" 21 + "Usage: %s [-t TEST] [-s] [-l] [-q]\n" 22 22 "\n" 23 23 " -t TEST Only run tests whose name includes this string\n" 24 24 " -s Include print output for skipped tests\n" ··· 133 133 int main(int argc, char **argv) 134 134 { 135 135 const char *filter = NULL; 136 + const char *failed_tests[MAX_SCX_TESTS]; 137 + const char *skipped_tests[MAX_SCX_TESTS]; 136 138 unsigned testnum = 0, i; 137 139 unsigned passed = 0, skipped = 0, failed = 0; 138 140 int opt; ··· 161 159 default: 162 160 fprintf(stderr, help_fmt, basename(argv[0])); 163 161 return opt != 'h'; 162 + } 163 + } 164 + 165 + if (optind < argc) { 166 + fprintf(stderr, "Unexpected argument '%s'. Use -t to filter tests.\n", 167 + argv[optind]); 168 + return 1; 169 + } 170 + 171 + if (filter) { 172 + for (i = 0; i < __scx_num_tests; i++) { 173 + if (!should_skip_test(&__scx_tests[i], filter)) 174 + break; 175 + } 176 + if (i == __scx_num_tests) { 177 + fprintf(stderr, "No tests matched filter '%s'\n", filter); 178 + fprintf(stderr, "Available tests (use -l to list):\n"); 179 + for (i = 0; i < __scx_num_tests; i++) 180 + fprintf(stderr, " %s\n", __scx_tests[i].name); 181 + return 1; 164 182 } 165 183 } 166 184 ··· 220 198 passed++; 221 199 break; 222 200 case SCX_TEST_SKIP: 223 - skipped++; 201 + skipped_tests[skipped++] = test->name; 224 202 break; 225 203 case SCX_TEST_FAIL: 226 - failed++; 204 + failed_tests[failed++] = test->name; 227 205 break; 228 206 } 229 207 } ··· 232 210 printf("PASSED: %u\n", passed); 233 211 printf("SKIPPED: %u\n", skipped); 234 212 printf("FAILED: %u\n", failed); 213 + if (skipped > 0) { 214 + printf("\nSkipped tests:\n"); 215 + for (i = 0; i < skipped; i++) 216 + printf(" - %s\n", skipped_tests[i]); 217 + } 218 + if (failed > 0) { 219 + printf("\nFailed tests:\n"); 220 + for (i = 0; i < failed; i++) 221 + printf(" - %s\n", failed_tests[i]); 222 + } 235 223 236 - return 0; 224 + return failed > 0 ? 1 : 0; 237 225 } 238 226 239 227 void scx_test_register(struct scx_test *test)
+5 -3
tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c
··· 53 53 54 54 void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p) 55 55 { 56 - if (scx_bpf_dsq_move_to_local(VTIME_DSQ)) 56 + if (scx_bpf_dsq_move_to_local(VTIME_DSQ, 0)) 57 57 consumed = true; 58 58 } 59 59 ··· 66 66 void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p, 67 67 bool runnable) 68 68 { 69 - p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 69 + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); 70 + 71 + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); 70 72 } 71 73 72 74 void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p) 73 75 { 74 - p->scx.dsq_vtime = vtime_now; 76 + scx_bpf_task_set_dsq_vtime(p, vtime_now); 75 77 } 76 78 77 79 s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init)
+1 -1
tools/testing/selftests/sched_ext/util.h
··· 10 10 long file_read_long(const char *path); 11 11 int file_write_long(const char *path, long val); 12 12 13 - #endif // __SCX_TEST_H__ 13 + #endif // __SCX_TEST_UTIL_H__