Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

+188 -19

Documentation/scheduler/sched-ext.rst

··· 93 93 # cat /sys/kernel/sched_ext/enable_seq 94 94 1 95 95 96 + Each running scheduler also exposes a per-scheduler ``events`` file under 97 + ``/sys/kernel/sched_ext/<scheduler-name>/events`` that tracks diagnostic 98 + counters. Each counter occupies one ``name value`` line: 99 + 100 + .. code-block:: none 101 + 102 + # cat /sys/kernel/sched_ext/simple/events 103 + SCX_EV_SELECT_CPU_FALLBACK 0 104 + SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0 105 + SCX_EV_DISPATCH_KEEP_LAST 123 106 + SCX_EV_ENQ_SKIP_EXITING 0 107 + SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0 108 + SCX_EV_REENQ_IMMED 0 109 + SCX_EV_REENQ_LOCAL_REPEAT 0 110 + SCX_EV_REFILL_SLICE_DFL 456789 111 + SCX_EV_BYPASS_DURATION 0 112 + SCX_EV_BYPASS_DISPATCH 0 113 + SCX_EV_BYPASS_ACTIVATE 0 114 + SCX_EV_INSERT_NOT_OWNED 0 115 + SCX_EV_SUB_BYPASS_DISPATCH 0 116 + 117 + The counters are described in ``kernel/sched/ext_internal.h``; briefly: 118 + 119 + * ``SCX_EV_SELECT_CPU_FALLBACK``: ops.select_cpu() returned a CPU unusable by 120 + the task and the core scheduler silently picked a fallback CPU. 121 + * ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE``: a local-DSQ dispatch was redirected 122 + to the global DSQ because the target CPU went offline. 123 + * ``SCX_EV_DISPATCH_KEEP_LAST``: a task continued running because no other 124 + task was available (only when ``SCX_OPS_ENQ_LAST`` is not set). 125 + * ``SCX_EV_ENQ_SKIP_EXITING``: an exiting task was dispatched to the local DSQ 126 + directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING`` is not set). 127 + * ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED``: a migration-disabled task was 128 + dispatched to its local DSQ directly (only when 129 + ``SCX_OPS_ENQ_MIGRATION_DISABLED`` is not set). 130 + * ``SCX_EV_REENQ_IMMED``: a task dispatched with ``SCX_ENQ_IMMED`` was 131 + re-enqueued because the target CPU was not available for immediate execution. 132 + * ``SCX_EV_REENQ_LOCAL_REPEAT``: a reenqueue of the local DSQ triggered 133 + another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ`` 134 + handling in the BPF scheduler. 135 + * ``SCX_EV_REFILL_SLICE_DFL``: a task's time slice was refilled with the 136 + default value (``SCX_SLICE_DFL``). 137 + * ``SCX_EV_BYPASS_DURATION``: total nanoseconds spent in bypass mode. 138 + * ``SCX_EV_BYPASS_DISPATCH``: number of tasks dispatched while in bypass mode. 139 + * ``SCX_EV_BYPASS_ACTIVATE``: number of times bypass mode was activated. 140 + * ``SCX_EV_INSERT_NOT_OWNED``: attempted to insert a task not owned by this 141 + scheduler into a DSQ; such attempts are silently ignored. 142 + * ``SCX_EV_SUB_BYPASS_DISPATCH``: tasks dispatched from sub-scheduler bypass 143 + DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED``). 144 + 96 145 ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 97 146 detailed information: 98 147 ··· 277 228 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 278 229 using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 279 230 280 - A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 281 - by calling ``scx_bpf_dsq_insert()``. If the task is inserted into 282 - ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the 283 - local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 284 - Additionally, inserting directly from ``ops.select_cpu()`` will cause the 285 - ``ops.enqueue()`` callback to be skipped. 286 - 287 231 Note that the scheduler core will ignore an invalid CPU selection, for 288 232 example, if it's outside the allowed cpumask of the task. 233 + 234 + A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` 235 + by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``. 236 + 237 + If the task is inserted into ``SCX_DSQ_LOCAL`` from 238 + ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU 239 + is returned from ``ops.select_cpu()``. Additionally, inserting directly 240 + from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to 241 + be skipped. 242 + 243 + Any other attempt to store a task in BPF-internal data structures from 244 + ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being 245 + invoked. This is discouraged, as it can introduce racy behavior or 246 + inconsistent state. 289 247 290 248 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 291 249 task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` ··· 307 251 308 252 * Queue the task on the BPF side. 309 253 254 + **Task State Tracking and ops.dequeue() Semantics** 255 + 256 + A task is in the "BPF scheduler's custody" when the BPF scheduler is 257 + responsible for managing its lifecycle. A task enters custody when it is 258 + dispatched to a user DSQ or stored in the BPF scheduler's internal data 259 + structures. Custody is entered only from ``ops.enqueue()`` for those 260 + operations. The only exception is dispatching to a user DSQ from 261 + ``ops.select_cpu()``: although the task is not yet technically in BPF 262 + scheduler custody at that point, the dispatch has the same semantic 263 + effect as dispatching from ``ops.enqueue()`` for custody-related 264 + purposes. 265 + 266 + Once ``ops.enqueue()`` is called, the task may or may not enter custody 267 + depending on what the scheduler does: 268 + 269 + * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``, 270 + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler 271 + is done with the task - it either goes straight to a CPU's local run 272 + queue or to the global DSQ as a fallback. The task never enters (or 273 + exits) BPF custody, and ``ops.dequeue()`` will not be called. 274 + 275 + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the 276 + BPF scheduler's custody. When the task later leaves BPF custody 277 + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for 278 + sleep/property changes), ``ops.dequeue()`` will be called exactly 279 + once. 280 + 281 + * **Stored in BPF data structures** (e.g., internal BPF queues): the 282 + task is in BPF custody. ``ops.dequeue()`` will be called when it 283 + leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or 284 + on property change / sleep). 285 + 286 + When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked. 287 + The dequeue can happen for different reasons, distinguished by flags: 288 + 289 + 1. **Regular dispatch**: when a task in BPF custody is dispatched to a 290 + terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for 291 + execution), ``ops.dequeue()`` is triggered without any special flags. 292 + 293 + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and 294 + core scheduling picks a task for execution while it's still in BPF 295 + custody, ``ops.dequeue()`` is called with the 296 + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. 297 + 298 + 3. **Scheduling property change**: when a task property changes (via 299 + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, 300 + priority changes, CPU migrations, etc.) while the task is still in 301 + BPF custody, ``ops.dequeue()`` is called with the 302 + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. 303 + 304 + **Important**: Once a task has left BPF custody (e.g., after being 305 + dispatched to a terminal DSQ), property changes will not trigger 306 + ``ops.dequeue()``, since the task is no longer managed by the BPF 307 + scheduler. 308 + 310 309 3. When a CPU is ready to schedule, it first looks at its local DSQ. If 311 310 empty, it then looks at the global DSQ. If there still isn't a task to 312 311 run, ``ops.dispatch()`` is invoked which can use the following two ··· 375 264 rather than performing them immediately. There can be up to 376 265 ``ops.dispatch_max_batch`` pending tasks. 377 266 378 - * ``scx_bpf_move_to_local()`` moves a task from the specified non-local 267 + * ``scx_bpf_dsq_move_to_local()`` moves a task from the specified non-local 379 268 DSQ to the dispatching DSQ. This function cannot be called with any BPF 380 - locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions 269 + locks held. ``scx_bpf_dsq_move_to_local()`` flushes the pending insertions 381 270 tasks before trying to move from the specified DSQ. 382 271 383 272 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, ··· 408 297 Task Lifecycle 409 298 -------------- 410 299 411 - The following pseudo-code summarizes the entire lifecycle of a task managed 412 - by a sched_ext scheduler: 300 + The following pseudo-code presents a rough overview of the entire lifecycle 301 + of a task managed by a sched_ext scheduler: 413 302 414 303 .. code-block:: c 415 304 ··· 422 311 423 312 ops.runnable(); /* Task becomes ready to run */ 424 313 425 - while (task is runnable) { 426 - if (task is not in a DSQ && task->scx.slice == 0) { 314 + while (task_is_runnable(task)) { 315 + if (task is not in a DSQ || task->scx.slice == 0) { 427 316 ops.enqueue(); /* Task can be added to a DSQ */ 317 + 318 + /* Task property change (i.e., affinity, nice, etc.)? */ 319 + if (sched_change(task)) { 320 + ops.dequeue(); /* Exiting BPF scheduler custody */ 321 + ops.quiescent(); 322 + 323 + /* Property change callback, e.g. ops.set_weight() */ 324 + 325 + ops.runnable(); 326 + continue; 327 + } 428 328 429 329 /* Any usable CPU becomes available */ 430 330 431 - ops.dispatch(); /* Task is moved to a local DSQ */ 331 + ops.dispatch(); /* Task is moved to a local DSQ */ 332 + ops.dequeue(); /* Exiting BPF scheduler custody */ 432 333 } 334 + 433 335 ops.running(); /* Task starts running on its assigned CPU */ 434 - while (task->scx.slice > 0 && task is runnable) 336 + 337 + while (task_is_runnable(task) && task->scx.slice > 0) { 435 338 ops.tick(); /* Called every 1/HZ seconds */ 339 + 340 + if (task->scx.slice == 0) 341 + ops.dispatch(); /* task->scx.slice can be refilled */ 342 + } 343 + 436 344 ops.stopping(); /* Task stops running (time slice expires or wait) */ 437 - 438 - /* Task's CPU becomes available */ 439 - 440 - ops.dispatch(); /* task->scx.slice can be refilled */ 441 345 } 442 346 443 347 ops.quiescent(); /* Task releases its assigned CPU (wait) */ ··· 460 334 461 335 ops.disable(); /* Disable BPF scheduling for the task */ 462 336 ops.exit_task(); /* Task is destroyed */ 337 + 338 + Note that the above pseudo-code does not cover all possible state transitions 339 + and edge cases, to name a few examples: 340 + 341 + * ``ops.dispatch()`` may fail to move the task to a local DSQ due to a racing 342 + property change on that task, in which case ``ops.dispatch()`` will be 343 + retried. 344 + 345 + * The task may be direct-dispatched to a local DSQ from ``ops.enqueue()``, 346 + in which case ``ops.dispatch()`` and ``ops.dequeue()`` are skipped and we go 347 + straight to ``ops.running()``. 348 + 349 + * Property changes may occur at virtually any point during the task's lifecycle, 350 + not just when the task is queued and waiting to be dispatched. For example, 351 + changing a property of a running task will lead to the callback sequence 352 + ``ops.stopping()`` -> ``ops.quiescent()`` -> (property change callback) -> 353 + ``ops.runnable()`` -> ``ops.running()``. 354 + 355 + * A sched_ext task can be preempted by a task from a higher-priority scheduling 356 + class, in which case it will exit the tick-dispatch loop even though it is runnable 357 + and has a non-zero slice. 358 + 359 + See the "Scheduling Cycle" section for a more detailed description of how 360 + a freshly woken up task gets on a CPU. 463 361 464 362 Where to Look 465 363 ============= ··· 526 376 * ``scx_userland[.bpf].c``: A minimal scheduler demonstrating user space 527 377 scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order; 528 378 all others are scheduled in user space by a simple vruntime scheduler. 379 + 380 + Module Parameters 381 + ================= 382 + 383 + sched_ext exposes two module parameters under the ``sched_ext.`` prefix that 384 + control bypass-mode behaviour. These knobs are primarily for debugging; there 385 + is usually no reason to change them during normal operation. They can be read 386 + and written at runtime (mode 0600) via 387 + ``/sys/module/sched_ext/parameters/``. 388 + 389 + ``sched_ext.slice_bypass_us`` (default: 5000 µs) 390 + The time slice assigned to all tasks when the scheduler is in bypass mode, 391 + i.e. during BPF scheduler load, unload, and error recovery. Valid range is 392 + 100 µs to 100 ms. 393 + 394 + ``sched_ext.bypass_lb_intv_us`` (default: 500000 µs) 395 + The interval at which the bypass-mode load balancer redistributes tasks 396 + across CPUs. Set to 0 to disable load balancing during bypass mode. Valid 397 + range is 0 to 10 s. 529 398 530 399 ABI Instability 531 400 ===============

+4

include/linux/cgroup-defs.h

··· 17 17 #include <linux/refcount.h> 18 18 #include <linux/percpu-refcount.h> 19 19 #include <linux/percpu-rwsem.h> 20 + #include <linux/sched.h> 20 21 #include <linux/u64_stats_sync.h> 21 22 #include <linux/workqueue.h> 22 23 #include <linux/bpf-cgroup-defs.h> ··· 628 627 629 628 #ifdef CONFIG_BPF_SYSCALL 630 629 struct bpf_local_storage __rcu *bpf_cgrp_storage; 630 + #endif 631 + #ifdef CONFIG_EXT_SUB_SCHED 632 + struct scx_sched __rcu *scx_sched; 631 633 #endif 632 634 633 635 /* All ancestors including self */

+64 -45

include/linux/sched/ext.h

··· 62 62 SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU, 63 63 }; 64 64 65 + struct scx_deferred_reenq_user { 66 + struct list_head node; 67 + u64 flags; 68 + }; 69 + 70 + struct scx_dsq_pcpu { 71 + struct scx_dispatch_q *dsq; 72 + struct scx_deferred_reenq_user deferred_reenq_user; 73 + }; 74 + 65 75 /* 66 76 * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered 67 77 * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to ··· 88 78 u64 id; 89 79 struct rhash_head hash_node; 90 80 struct llist_node free_node; 81 + struct scx_sched *sched; 82 + struct scx_dsq_pcpu __percpu *pcpu; 91 83 struct rcu_head rcu; 92 84 }; 93 85 94 - /* scx_entity.flags */ 86 + /* sched_ext_entity.flags */ 95 87 enum scx_ent_flags { 96 88 SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ 89 + SCX_TASK_IN_CUSTODY = 1 << 1, /* in custody, needs ops.dequeue() when leaving */ 97 90 SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ 98 91 SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ 92 + SCX_TASK_SUB_INIT = 1 << 4, /* task being initialized for a sub sched */ 93 + SCX_TASK_IMMED = 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */ 99 94 100 - SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ 95 + /* 96 + * Bits 8 and 9 are used to carry task state: 97 + * 98 + * NONE ops.init_task() not called yet 99 + * INIT ops.init_task() succeeded, but task can be cancelled 100 + * READY fully initialized, but not in sched_ext 101 + * ENABLED fully initialized and in sched_ext 102 + */ 103 + SCX_TASK_STATE_SHIFT = 8, /* bits 8 and 9 are used to carry task state */ 101 104 SCX_TASK_STATE_BITS = 2, 102 105 SCX_TASK_STATE_MASK = ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT, 103 106 104 - SCX_TASK_CURSOR = 1 << 31, /* iteration cursor, not a task */ 105 - }; 107 + SCX_TASK_NONE = 0 << SCX_TASK_STATE_SHIFT, 108 + SCX_TASK_INIT = 1 << SCX_TASK_STATE_SHIFT, 109 + SCX_TASK_READY = 2 << SCX_TASK_STATE_SHIFT, 110 + SCX_TASK_ENABLED = 3 << SCX_TASK_STATE_SHIFT, 106 111 107 - /* scx_entity.flags & SCX_TASK_STATE_MASK */ 108 - enum scx_task_state { 109 - SCX_TASK_NONE, /* ops.init_task() not called yet */ 110 - SCX_TASK_INIT, /* ops.init_task() succeeded, but task can be cancelled */ 111 - SCX_TASK_READY, /* fully initialized, but not in sched_ext */ 112 - SCX_TASK_ENABLED, /* fully initialized and in sched_ext */ 112 + /* 113 + * Bits 12 and 13 are used to carry reenqueue reason. In addition to 114 + * %SCX_ENQ_REENQ flag, ops.enqueue() can also test for 115 + * %SCX_TASK_REENQ_REASON_NONE to distinguish reenqueues. 116 + * 117 + * NONE not being reenqueued 118 + * KFUNC reenqueued by scx_bpf_dsq_reenq() and friends 119 + * IMMED reenqueued due to failed ENQ_IMMED 120 + * PREEMPTED preempted while running 121 + */ 122 + SCX_TASK_REENQ_REASON_SHIFT = 12, 123 + SCX_TASK_REENQ_REASON_BITS = 2, 124 + SCX_TASK_REENQ_REASON_MASK = ((1 << SCX_TASK_REENQ_REASON_BITS) - 1) << SCX_TASK_REENQ_REASON_SHIFT, 113 125 114 - SCX_TASK_NR_STATES, 126 + SCX_TASK_REENQ_NONE = 0 << SCX_TASK_REENQ_REASON_SHIFT, 127 + SCX_TASK_REENQ_KFUNC = 1 << SCX_TASK_REENQ_REASON_SHIFT, 128 + SCX_TASK_REENQ_IMMED = 2 << SCX_TASK_REENQ_REASON_SHIFT, 129 + SCX_TASK_REENQ_PREEMPTED = 3 << SCX_TASK_REENQ_REASON_SHIFT, 130 + 131 + /* iteration cursor, not a task */ 132 + SCX_TASK_CURSOR = 1 << 31, 115 133 }; 116 134 117 135 /* scx_entity.dsq_flags */ 118 136 enum scx_ent_dsq_flags { 119 137 SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ 120 - }; 121 - 122 - /* 123 - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from 124 - * everywhere and the following bits track which kfunc sets are currently 125 - * allowed for %current. This simple per-task tracking works because SCX ops 126 - * nest in a limited way. BPF will likely implement a way to allow and disallow 127 - * kfuncs depending on the calling context which will replace this manual 128 - * mechanism. See scx_kf_allow(). 129 - */ 130 - enum scx_kf_mask { 131 - SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ 132 - /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ 133 - SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ 134 - /* 135 - * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and 136 - * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be 137 - * nested inside DISPATCH. 138 - */ 139 - SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ 140 - SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ 141 - SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ 142 - SCX_KF_REST = 1 << 4, /* other rq-locked operations */ 143 - 144 - __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | 145 - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, 146 - __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, 147 138 }; 148 139 149 140 enum scx_dsq_lnode_flags { ··· 160 149 u32 priv; /* can be used by iter cursor */ 161 150 }; 162 151 163 - #define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv) \ 152 + #define INIT_DSQ_LIST_CURSOR(__cursor, __dsq, __flags) \ 164 153 (struct scx_dsq_list_node) { \ 165 - .node = LIST_HEAD_INIT((__node).node), \ 154 + .node = LIST_HEAD_INIT((__cursor).node), \ 166 155 .flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags), \ 167 - .priv = (__priv), \ 156 + .priv = READ_ONCE((__dsq)->seq), \ 168 157 } 158 + 159 + struct scx_sched; 169 160 170 161 /* 171 162 * The following is embedded in task_struct and contains all fields necessary 172 163 * for a task to be scheduled by SCX. 173 164 */ 174 165 struct sched_ext_entity { 166 + #ifdef CONFIG_CGROUPS 167 + /* 168 + * Associated scx_sched. Updated either during fork or while holding 169 + * both p->pi_lock and rq lock. 170 + */ 171 + struct scx_sched __rcu *sched; 172 + #endif 175 173 struct scx_dispatch_q *dsq; 174 + atomic_long_t ops_state; 175 + u64 ddsp_dsq_id; 176 + u64 ddsp_enq_flags; 176 177 struct scx_dsq_list_node dsq_list; /* dispatch order */ 177 178 struct rb_node dsq_priq; /* p->scx.dsq_vtime order */ 178 179 u32 dsq_seq; ··· 194 171 s32 sticky_cpu; 195 172 s32 holding_cpu; 196 173 s32 selected_cpu; 197 - u32 kf_mask; /* see scx_kf_mask above */ 198 174 struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ 199 - atomic_long_t ops_state; 200 175 201 176 struct list_head runnable_node; /* rq->scx.runnable_list */ 202 177 unsigned long runnable_at; ··· 202 181 #ifdef CONFIG_SCHED_CORE 203 182 u64 core_sched_at; /* see scx_prio_less() */ 204 183 #endif 205 - u64 ddsp_dsq_id; 206 - u64 ddsp_enq_flags; 207 184 208 185 /* BPF scheduler modifiable fields */ 209 186

+4

init/Kconfig

··· 1190 1190 1191 1191 endif #CGROUP_SCHED 1192 1192 1193 + config EXT_SUB_SCHED 1194 + def_bool y 1195 + depends on SCHED_CLASS_EXT && CGROUPS 1196 + 1193 1197 config SCHED_MM_CID 1194 1198 def_bool y 1195 1199 depends on SMP && RSEQ

+5 -1

kernel/fork.c

··· 2514 2514 fd_install(pidfd, pidfile); 2515 2515 2516 2516 proc_fork_connector(p); 2517 - sched_post_fork(p); 2517 + /* 2518 + * sched_ext needs @p to be associated with its cgroup in its post_fork 2519 + * hook. cgroup_post_fork() should come before sched_post_fork(). 2520 + */ 2518 2521 cgroup_post_fork(p, args); 2522 + sched_post_fork(p); 2519 2523 perf_event_fork(p); 2520 2524 2521 2525 trace_task_newtask(p, clone_flags);

+1 -1

kernel/sched/core.c

··· 4776 4776 p->sched_class->task_fork(p); 4777 4777 raw_spin_unlock_irqrestore(&p->pi_lock, flags); 4778 4778 4779 - return scx_fork(p); 4779 + return scx_fork(p, kargs); 4780 4780 } 4781 4781 4782 4782 void sched_cancel_fork(struct task_struct *p)

+3062 -937

kernel/sched/ext.c

··· 9 9 #include <linux/btf_ids.h> 10 10 #include "ext_idle.h" 11 11 12 + static DEFINE_RAW_SPINLOCK(scx_sched_lock); 13 + 12 14 /* 13 15 * NOTE: sched_ext is in the process of growing multiple scheduler support and 14 16 * scx_root usage is in a transitional state. Naked dereferences are safe if the ··· 19 17 * are used as temporary markers to indicate that the dereferences need to be 20 18 * updated to point to the associated scheduler instances rather than scx_root. 21 19 */ 22 - static struct scx_sched __rcu *scx_root; 20 + struct scx_sched __rcu *scx_root; 21 + 22 + /* 23 + * All scheds, writers must hold both scx_enable_mutex and scx_sched_lock. 24 + * Readers can hold either or rcu_read_lock(). 25 + */ 26 + static LIST_HEAD(scx_sched_all); 27 + 28 + #ifdef CONFIG_EXT_SUB_SCHED 29 + static const struct rhashtable_params scx_sched_hash_params = { 30 + .key_len = sizeof_field(struct scx_sched, ops.sub_cgroup_id), 31 + .key_offset = offsetof(struct scx_sched, ops.sub_cgroup_id), 32 + .head_offset = offsetof(struct scx_sched, hash_node), 33 + }; 34 + 35 + static struct rhashtable scx_sched_hash; 36 + #endif 23 37 24 38 /* 25 39 * During exit, a task may schedule after losing its PIDs. When disabling the ··· 51 33 DEFINE_STATIC_KEY_FALSE(__scx_enabled); 52 34 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); 53 35 static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED); 54 - static int scx_bypass_depth; 36 + static DEFINE_RAW_SPINLOCK(scx_bypass_lock); 55 37 static cpumask_var_t scx_bypass_lb_donee_cpumask; 56 38 static cpumask_var_t scx_bypass_lb_resched_cpumask; 57 - static bool scx_aborting; 58 39 static bool scx_init_task_enabled; 59 40 static bool scx_switching_all; 60 41 DEFINE_STATIC_KEY_FALSE(__scx_switched_all); 61 42 62 - /* 63 - * Tracks whether scx_enable() called scx_bypass(true). Used to balance bypass 64 - * depth on enable failure. Will be removed when bypass depth is moved into the 65 - * sched instance. 66 - */ 67 - static bool scx_bypassed_for_enable; 68 - 69 43 static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); 70 44 static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0); 71 45 46 + #ifdef CONFIG_EXT_SUB_SCHED 72 47 /* 73 - * A monotically increasing sequence number that is incremented every time a 74 - * scheduler is enabled. This can be used by to check if any custom sched_ext 48 + * The sub sched being enabled. Used by scx_disable_and_exit_task() to exit 49 + * tasks for the sub-sched being enabled. Use a global variable instead of a 50 + * per-task field as all enables are serialized. 51 + */ 52 + static struct scx_sched *scx_enabling_sub_sched; 53 + #else 54 + #define scx_enabling_sub_sched (struct scx_sched *)NULL 55 + #endif /* CONFIG_EXT_SUB_SCHED */ 56 + 57 + /* 58 + * A monotonically increasing sequence number that is incremented every time a 59 + * scheduler is enabled. This can be used to check if any custom sched_ext 75 60 * scheduler has ever been used in the system. 76 61 */ 77 62 static atomic_long_t scx_enable_seq = ATOMIC_LONG_INIT(0); 78 63 79 64 /* 80 - * The maximum amount of time in jiffies that a task may be runnable without 81 - * being scheduled on a CPU. If this timeout is exceeded, it will trigger 82 - * scx_error(). 65 + * Watchdog interval. All scx_sched's share a single watchdog timer and the 66 + * interval is half of the shortest sch->watchdog_timeout. 83 67 */ 84 - static unsigned long scx_watchdog_timeout; 68 + static unsigned long scx_watchdog_interval; 85 69 86 70 /* 87 71 * The last time the delayed work was run. This delayed work relies on ··· 126 106 127 107 static LLIST_HEAD(dsqs_to_free); 128 108 129 - /* dispatch buf */ 130 - struct scx_dsp_buf_ent { 131 - struct task_struct *task; 132 - unsigned long qseq; 133 - u64 dsq_id; 134 - u64 enq_flags; 135 - }; 136 - 137 - static u32 scx_dsp_max_batch; 138 - 139 - struct scx_dsp_ctx { 140 - struct rq *rq; 141 - u32 cursor; 142 - u32 nr_tasks; 143 - struct scx_dsp_buf_ent buf[]; 144 - }; 145 - 146 - static struct scx_dsp_ctx __percpu *scx_dsp_ctx; 147 - 148 109 /* string formatting from BPF */ 149 110 struct scx_bstr_buf { 150 111 u64 data[MAX_BPRINTF_VARARGS]; ··· 136 135 static struct scx_bstr_buf scx_exit_bstr_buf; 137 136 138 137 /* ops debug dump */ 138 + static DEFINE_RAW_SPINLOCK(scx_dump_lock); 139 + 139 140 struct scx_dump_data { 140 141 s32 cpu; 141 142 bool first; ··· 159 156 * There usually is no reason to modify these as normal scheduler operation 160 157 * shouldn't be affected by them. The knobs are primarily for debugging. 161 158 */ 162 - static u64 scx_slice_dfl = SCX_SLICE_DFL; 163 159 static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC; 164 160 static unsigned int scx_bypass_lb_intv_us = SCX_BYPASS_LB_DFL_INTV_US; 165 161 ··· 195 193 #define CREATE_TRACE_POINTS 196 194 #include <trace/events/sched_ext.h> 197 195 198 - static void process_ddsp_deferred_locals(struct rq *rq); 196 + static void run_deferred(struct rq *rq); 199 197 static bool task_dead_and_done(struct task_struct *p); 200 - static u32 reenq_local(struct rq *rq); 201 198 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags); 199 + static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind); 202 200 static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind, 203 201 s64 exit_code, const char *fmt, va_list args); 204 202 ··· 229 227 return -(long)jiffies_to_msecs(now - at); 230 228 } 231 229 232 - /* if the highest set bit is N, return a mask with bits [N+1, 31] set */ 233 - static u32 higher_bits(u32 flags) 234 - { 235 - return ~((1 << fls(flags)) - 1); 236 - } 237 - 238 - /* return the mask with only the highest bit set */ 239 - static u32 highest_bit(u32 flags) 240 - { 241 - int bit = fls(flags); 242 - return ((u64)1 << bit) >> 1; 243 - } 244 - 245 230 static bool u32_before(u32 a, u32 b) 246 231 { 247 232 return (s32)(a - b) < 0; 248 233 } 249 234 250 - static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, 251 - struct task_struct *p) 235 + #ifdef CONFIG_EXT_SUB_SCHED 236 + /** 237 + * scx_parent - Find the parent sched 238 + * @sch: sched to find the parent of 239 + * 240 + * Returns the parent scheduler or %NULL if @sch is root. 241 + */ 242 + static struct scx_sched *scx_parent(struct scx_sched *sch) 252 243 { 253 - return sch->global_dsqs[cpu_to_node(task_cpu(p))]; 244 + if (sch->level) 245 + return sch->ancestors[sch->level - 1]; 246 + else 247 + return NULL; 248 + } 249 + 250 + /** 251 + * scx_next_descendant_pre - find the next descendant for pre-order walk 252 + * @pos: the current position (%NULL to initiate traversal) 253 + * @root: sched whose descendants to walk 254 + * 255 + * To be used by scx_for_each_descendant_pre(). Find the next descendant to 256 + * visit for pre-order traversal of @root's descendants. @root is included in 257 + * the iteration and the first node to be visited. 258 + */ 259 + static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, 260 + struct scx_sched *root) 261 + { 262 + struct scx_sched *next; 263 + 264 + lockdep_assert(lockdep_is_held(&scx_enable_mutex) || 265 + lockdep_is_held(&scx_sched_lock)); 266 + 267 + /* if first iteration, visit @root */ 268 + if (!pos) 269 + return root; 270 + 271 + /* visit the first child if exists */ 272 + next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling); 273 + if (next) 274 + return next; 275 + 276 + /* no child, visit my or the closest ancestor's next sibling */ 277 + while (pos != root) { 278 + if (!list_is_last(&pos->sibling, &scx_parent(pos)->children)) 279 + return list_next_entry(pos, sibling); 280 + pos = scx_parent(pos); 281 + } 282 + 283 + return NULL; 284 + } 285 + 286 + static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) 287 + { 288 + return rhashtable_lookup(&scx_sched_hash, &cgroup_id, 289 + scx_sched_hash_params); 290 + } 291 + 292 + static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) 293 + { 294 + rcu_assign_pointer(p->scx.sched, sch); 295 + } 296 + #else /* CONFIG_EXT_SUB_SCHED */ 297 + static struct scx_sched *scx_parent(struct scx_sched *sch) { return NULL; } 298 + static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; } 299 + static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) { return NULL; } 300 + static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {} 301 + #endif /* CONFIG_EXT_SUB_SCHED */ 302 + 303 + /** 304 + * scx_is_descendant - Test whether sched is a descendant 305 + * @sch: sched to test 306 + * @ancestor: ancestor sched to test against 307 + * 308 + * Test whether @sch is a descendant of @ancestor. 309 + */ 310 + static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor) 311 + { 312 + if (sch->level < ancestor->level) 313 + return false; 314 + return sch->ancestors[ancestor->level] == ancestor; 315 + } 316 + 317 + /** 318 + * scx_for_each_descendant_pre - pre-order walk of a sched's descendants 319 + * @pos: iteration cursor 320 + * @root: sched to walk the descendants of 321 + * 322 + * Walk @root's descendants. @root is included in the iteration and the first 323 + * node to be visited. Must be called with either scx_enable_mutex or 324 + * scx_sched_lock held. 325 + */ 326 + #define scx_for_each_descendant_pre(pos, root) \ 327 + for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \ 328 + (pos) = scx_next_descendant_pre((pos), (root))) 329 + 330 + static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu) 331 + { 332 + return &sch->pnode[cpu_to_node(cpu)]->global_dsq; 254 333 } 255 334 256 335 static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id) ··· 347 264 return __setscheduler_class(p->policy, p->prio); 348 265 } 349 266 350 - /* 351 - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX 352 - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate 353 - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check 354 - * whether it's running from an allowed context. 355 - * 356 - * @mask is constant, always inline to cull the mask calculations. 357 - */ 358 - static __always_inline void scx_kf_allow(u32 mask) 267 + static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu) 359 268 { 360 - /* nesting is allowed only in increasing scx_kf_mask order */ 361 - WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, 362 - "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", 363 - current->scx.kf_mask, mask); 364 - current->scx.kf_mask |= mask; 365 - barrier(); 269 + return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq; 366 270 } 367 271 368 - static void scx_kf_disallow(u32 mask) 272 + static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu) 369 273 { 370 - barrier(); 371 - current->scx.kf_mask &= ~mask; 274 + #ifdef CONFIG_EXT_SUB_SCHED 275 + /* 276 + * If @sch is a sub-sched which is bypassing, its tasks should go into 277 + * the bypass DSQs of the nearest ancestor which is not bypassing. The 278 + * not-bypassing ancestor is responsible for scheduling all tasks from 279 + * bypassing sub-trees. If all ancestors including root are bypassing, 280 + * all tasks should go to the root's bypass DSQs. 281 + * 282 + * Whenever a sched starts bypassing, all runnable tasks in its subtree 283 + * are re-enqueued after scx_bypassing() is turned on, guaranteeing that 284 + * all tasks are transferred to the right DSQs. 285 + */ 286 + while (scx_parent(sch) && scx_bypassing(sch, cpu)) 287 + sch = scx_parent(sch); 288 + #endif /* CONFIG_EXT_SUB_SCHED */ 289 + 290 + return bypass_dsq(sch, cpu); 291 + } 292 + 293 + /** 294 + * bypass_dsp_enabled - Check if bypass dispatch path is enabled 295 + * @sch: scheduler to check 296 + * 297 + * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled 298 + * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors 299 + * are bypassing. In the former case, the ancestor is not itself bypassing but 300 + * its bypass DSQs will be populated with bypassed tasks from descendants. Thus, 301 + * the ancestor's bypass dispatch path must be active even though its own 302 + * bypass_depth remains zero. 303 + * 304 + * This function checks bypass_dsp_enable_depth which is managed separately from 305 + * bypass_depth to enable this decoupling. See enable_bypass_dsp() and 306 + * disable_bypass_dsp(). 307 + */ 308 + static bool bypass_dsp_enabled(struct scx_sched *sch) 309 + { 310 + return unlikely(atomic_read(&sch->bypass_dsp_enable_depth)); 311 + } 312 + 313 + /** 314 + * rq_is_open - Is the rq available for immediate execution of an SCX task? 315 + * @rq: rq to test 316 + * @enq_flags: optional %SCX_ENQ_* of the task being enqueued 317 + * 318 + * Returns %true if @rq is currently open for executing an SCX task. After a 319 + * %false return, @rq is guaranteed to invoke SCX dispatch path at least once 320 + * before going to idle and not inserting a task into @rq's local DSQ after a 321 + * %false return doesn't cause @rq to stall. 322 + */ 323 + static bool rq_is_open(struct rq *rq, u64 enq_flags) 324 + { 325 + lockdep_assert_rq_held(rq); 326 + 327 + /* 328 + * A higher-priority class task is either running or in the process of 329 + * waking up on @rq. 330 + */ 331 + if (sched_class_above(rq->next_class, &ext_sched_class)) 332 + return false; 333 + 334 + /* 335 + * @rq is either in transition to or in idle and there is no 336 + * higher-priority class task waking up on it. 337 + */ 338 + if (sched_class_above(&ext_sched_class, rq->next_class)) 339 + return true; 340 + 341 + /* 342 + * @rq is either picking, in transition to, or running an SCX task. 343 + */ 344 + 345 + /* 346 + * If we're in the dispatch path holding rq lock, $curr may or may not 347 + * be ready depending on whether the on-going dispatch decides to extend 348 + * $curr's slice. We say yes here and resolve it at the end of dispatch. 349 + * See balance_one(). 350 + */ 351 + if (rq->scx.flags & SCX_RQ_IN_BALANCE) 352 + return true; 353 + 354 + /* 355 + * %SCX_ENQ_PREEMPT clears $curr's slice if on SCX and kicks dispatch, 356 + * so allow it to avoid spuriously triggering reenq on a combined 357 + * PREEMPT|IMMED insertion. 358 + */ 359 + if (enq_flags & SCX_ENQ_PREEMPT) 360 + return true; 361 + 362 + /* 363 + * @rq is either in transition to or running an SCX task and can't go 364 + * idle without another SCX dispatch cycle. 365 + */ 366 + return false; 372 367 } 373 368 374 369 /* ··· 469 308 __this_cpu_write(scx_locked_rq_state, rq); 470 309 } 471 310 472 - #define SCX_CALL_OP(sch, mask, op, rq, args...) \ 311 + #define SCX_CALL_OP(sch, op, rq, args...) \ 473 312 do { \ 474 313 if (rq) \ 475 314 update_locked_rq(rq); \ 476 - if (mask) { \ 477 - scx_kf_allow(mask); \ 478 - (sch)->ops.op(args); \ 479 - scx_kf_disallow(mask); \ 480 - } else { \ 481 - (sch)->ops.op(args); \ 482 - } \ 315 + (sch)->ops.op(args); \ 483 316 if (rq) \ 484 317 update_locked_rq(NULL); \ 485 318 } while (0) 486 319 487 - #define SCX_CALL_OP_RET(sch, mask, op, rq, args...) \ 320 + #define SCX_CALL_OP_RET(sch, op, rq, args...) \ 488 321 ({ \ 489 322 __typeof__((sch)->ops.op(args)) __ret; \ 490 323 \ 491 324 if (rq) \ 492 325 update_locked_rq(rq); \ 493 - if (mask) { \ 494 - scx_kf_allow(mask); \ 495 - __ret = (sch)->ops.op(args); \ 496 - scx_kf_disallow(mask); \ 497 - } else { \ 498 - __ret = (sch)->ops.op(args); \ 499 - } \ 326 + __ret = (sch)->ops.op(args); \ 500 327 if (rq) \ 501 328 update_locked_rq(NULL); \ 502 329 __ret; \ 503 330 }) 504 331 505 332 /* 506 - * Some kfuncs are allowed only on the tasks that are subjects of the 507 - * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such 508 - * restrictions, the following SCX_CALL_OP_*() variants should be used when 509 - * invoking scx_ops operations that take task arguments. These can only be used 510 - * for non-nesting operations due to the way the tasks are tracked. 333 + * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments 334 + * and records them in current->scx.kf_tasks[] for the duration of the call. A 335 + * kfunc invoked from inside such an op can then use 336 + * scx_kf_arg_task_ok() to verify that its task argument is one of 337 + * those subject tasks. 511 338 * 512 - * kfuncs which can only operate on such tasks can in turn use 513 - * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on 514 - * the specific task. 339 + * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - 340 + * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock 341 + * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if 342 + * kf_tasks[] is set, @p's scheduler-protected fields are stable. 343 + * 344 + * kf_tasks[] can not stack, so task-based SCX ops must not nest. The 345 + * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants 346 + * while a previous one is still in progress. 515 347 */ 516 - #define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ 348 + #define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ 517 349 do { \ 518 - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 350 + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 519 351 current->scx.kf_tasks[0] = task; \ 520 - SCX_CALL_OP((sch), mask, op, rq, task, ##args); \ 352 + SCX_CALL_OP((sch), op, rq, task, ##args); \ 521 353 current->scx.kf_tasks[0] = NULL; \ 522 354 } while (0) 523 355 524 - #define SCX_CALL_OP_TASK_RET(sch, mask, op, rq, task, args...) \ 356 + #define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ 525 357 ({ \ 526 358 __typeof__((sch)->ops.op(task, ##args)) __ret; \ 527 - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 359 + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 528 360 current->scx.kf_tasks[0] = task; \ 529 - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task, ##args); \ 361 + __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ 530 362 current->scx.kf_tasks[0] = NULL; \ 531 363 __ret; \ 532 364 }) 533 365 534 - #define SCX_CALL_OP_2TASKS_RET(sch, mask, op, rq, task0, task1, args...) \ 366 + #define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ 535 367 ({ \ 536 368 __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ 537 - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 369 + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ 538 370 current->scx.kf_tasks[0] = task0; \ 539 371 current->scx.kf_tasks[1] = task1; \ 540 - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task0, task1, ##args); \ 372 + __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ 541 373 current->scx.kf_tasks[0] = NULL; \ 542 374 current->scx.kf_tasks[1] = NULL; \ 543 375 __ret; \ 544 376 }) 545 377 546 - /* @mask is constant, always inline to cull unnecessary branches */ 547 - static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) 548 - { 549 - if (unlikely(!(current->scx.kf_mask & mask))) { 550 - scx_error(sch, "kfunc with mask 0x%x called from an operation only allowing 0x%x", 551 - mask, current->scx.kf_mask); 552 - return false; 553 - } 554 - 555 - /* 556 - * Enforce nesting boundaries. e.g. A kfunc which can be called from 557 - * DISPATCH must not be called if we're running DEQUEUE which is nested 558 - * inside ops.dispatch(). We don't need to check boundaries for any 559 - * blocking kfuncs as the verifier ensures they're only called from 560 - * sleepable progs. 561 - */ 562 - if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && 563 - (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { 564 - scx_error(sch, "cpu_release kfunc called from a nested operation"); 565 - return false; 566 - } 567 - 568 - if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && 569 - (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { 570 - scx_error(sch, "dispatch kfunc called from a nested operation"); 571 - return false; 572 - } 573 - 574 - return true; 575 - } 576 - 577 378 /* see SCX_CALL_OP_TASK() */ 578 - static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, 579 - u32 mask, 379 + static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch, 580 380 struct task_struct *p) 581 381 { 582 - if (!scx_kf_allowed(sch, mask)) 583 - return false; 584 - 585 382 if (unlikely((p != current->scx.kf_tasks[0] && 586 383 p != current->scx.kf_tasks[1]))) { 587 384 scx_error(sch, "called on a task not being operated on"); ··· 549 430 return true; 550 431 } 551 432 433 + enum scx_dsq_iter_flags { 434 + /* iterate in the reverse dispatch order */ 435 + SCX_DSQ_ITER_REV = 1U << 16, 436 + 437 + __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, 438 + __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, 439 + 440 + __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, 441 + __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | 442 + __SCX_DSQ_ITER_HAS_SLICE | 443 + __SCX_DSQ_ITER_HAS_VTIME, 444 + }; 445 + 552 446 /** 553 447 * nldsq_next_task - Iterate to the next task in a non-local DSQ 554 - * @dsq: user dsq being iterated 448 + * @dsq: non-local dsq being iterated 555 449 * @cur: current position, %NULL to start iteration 556 450 * @rev: walk backwards 557 451 * ··· 604 472 for ((p) = nldsq_next_task((dsq), NULL, false); (p); \ 605 473 (p) = nldsq_next_task((dsq), (p), false)) 606 474 475 + /** 476 + * nldsq_cursor_next_task - Iterate to the next task given a cursor in a non-local DSQ 477 + * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR() 478 + * @dsq: non-local dsq being iterated 479 + * 480 + * Find the next task in a cursor based iteration. The caller must have 481 + * initialized @cursor using INIT_DSQ_LIST_CURSOR() and can release the DSQ lock 482 + * between the iteration steps. 483 + * 484 + * Only tasks which were queued before @cursor was initialized are visible. This 485 + * bounds the iteration and guarantees that vtime never jumps in the other 486 + * direction while iterating. 487 + */ 488 + static struct task_struct *nldsq_cursor_next_task(struct scx_dsq_list_node *cursor, 489 + struct scx_dispatch_q *dsq) 490 + { 491 + bool rev = cursor->flags & SCX_DSQ_ITER_REV; 492 + struct task_struct *p; 493 + 494 + lockdep_assert_held(&dsq->lock); 495 + BUG_ON(!(cursor->flags & SCX_DSQ_LNODE_ITER_CURSOR)); 496 + 497 + if (list_empty(&cursor->node)) 498 + p = NULL; 499 + else 500 + p = container_of(cursor, struct task_struct, scx.dsq_list); 501 + 502 + /* skip cursors and tasks that were queued after @cursor init */ 503 + do { 504 + p = nldsq_next_task(dsq, p, rev); 505 + } while (p && unlikely(u32_before(cursor->priv, p->scx.dsq_seq))); 506 + 507 + if (p) { 508 + if (rev) 509 + list_move_tail(&cursor->node, &p->scx.dsq_list.node); 510 + else 511 + list_move(&cursor->node, &p->scx.dsq_list.node); 512 + } else { 513 + list_del_init(&cursor->node); 514 + } 515 + 516 + return p; 517 + } 518 + 519 + /** 520 + * nldsq_cursor_lost_task - Test whether someone else took the task since iteration 521 + * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR() 522 + * @rq: rq @p was on 523 + * @dsq: dsq @p was on 524 + * @p: target task 525 + * 526 + * @p is a task returned by nldsq_cursor_next_task(). The locks may have been 527 + * dropped and re-acquired inbetween. Verify that no one else took or is in the 528 + * process of taking @p from @dsq. 529 + * 530 + * On %false return, the caller can assume full ownership of @p. 531 + */ 532 + static bool nldsq_cursor_lost_task(struct scx_dsq_list_node *cursor, 533 + struct rq *rq, struct scx_dispatch_q *dsq, 534 + struct task_struct *p) 535 + { 536 + lockdep_assert_rq_held(rq); 537 + lockdep_assert_held(&dsq->lock); 538 + 539 + /* 540 + * @p could have already left $src_dsq, got re-enqueud, or be in the 541 + * process of being consumed by someone else. 542 + */ 543 + if (unlikely(p->scx.dsq != dsq || 544 + u32_before(cursor->priv, p->scx.dsq_seq) || 545 + p->scx.holding_cpu >= 0)) 546 + return true; 547 + 548 + /* if @p has stayed on @dsq, its rq couldn't have changed */ 549 + if (WARN_ON_ONCE(rq != task_rq(p))) 550 + return true; 551 + 552 + return false; 553 + } 607 554 608 555 /* 609 556 * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse] ··· 690 479 * changes without breaking backward compatibility. Can be used with 691 480 * bpf_for_each(). See bpf_iter_scx_dsq_*(). 692 481 */ 693 - enum scx_dsq_iter_flags { 694 - /* iterate in the reverse dispatch order */ 695 - SCX_DSQ_ITER_REV = 1U << 16, 696 - 697 - __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, 698 - __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, 699 - 700 - __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, 701 - __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | 702 - __SCX_DSQ_ITER_HAS_SLICE | 703 - __SCX_DSQ_ITER_HAS_VTIME, 704 - }; 705 - 706 482 struct bpf_iter_scx_dsq_kern { 707 483 struct scx_dsq_list_node cursor; 708 484 struct scx_dispatch_q *dsq; ··· 712 514 struct rq_flags rf; 713 515 u32 cnt; 714 516 bool list_locked; 517 + #ifdef CONFIG_EXT_SUB_SCHED 518 + struct cgroup *cgrp; 519 + struct cgroup_subsys_state *css_pos; 520 + struct css_task_iter css_iter; 521 + #endif 715 522 }; 716 523 717 524 /** 718 525 * scx_task_iter_start - Lock scx_tasks_lock and start a task iteration 719 526 * @iter: iterator to init 527 + * @cgrp: Optional root of cgroup subhierarchy to iterate 720 528 * 721 - * Initialize @iter and return with scx_tasks_lock held. Once initialized, @iter 722 - * must eventually be stopped with scx_task_iter_stop(). 529 + * Initialize @iter. Once initialized, @iter must eventually be stopped with 530 + * scx_task_iter_stop(). 531 + * 532 + * If @cgrp is %NULL, scx_tasks is used for iteration and this function returns 533 + * with scx_tasks_lock held and @iter->cursor inserted into scx_tasks. 534 + * 535 + * If @cgrp is not %NULL, @cgrp and its descendants' tasks are walked using 536 + * @iter->css_iter. The caller must be holding cgroup_lock() to prevent cgroup 537 + * task migrations. 538 + * 539 + * The two modes of iterations are largely independent and it's likely that 540 + * scx_tasks can be removed in favor of always using cgroup iteration if 541 + * CONFIG_SCHED_CLASS_EXT depends on CONFIG_CGROUPS. 723 542 * 724 543 * scx_tasks_lock and the rq lock may be released using scx_task_iter_unlock() 725 544 * between this and the first next() call or between any two next() calls. If ··· 747 532 * All tasks which existed when the iteration started are guaranteed to be 748 533 * visited as long as they are not dead. 749 534 */ 750 - static void scx_task_iter_start(struct scx_task_iter *iter) 535 + static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp) 751 536 { 752 537 memset(iter, 0, sizeof(*iter)); 753 538 539 + #ifdef CONFIG_EXT_SUB_SCHED 540 + if (cgrp) { 541 + lockdep_assert_held(&cgroup_mutex); 542 + iter->cgrp = cgrp; 543 + iter->css_pos = css_next_descendant_pre(NULL, &iter->cgrp->self); 544 + css_task_iter_start(iter->css_pos, 0, &iter->css_iter); 545 + return; 546 + } 547 + #endif 754 548 raw_spin_lock_irq(&scx_tasks_lock); 755 549 756 550 iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR }; ··· 812 588 */ 813 589 static void scx_task_iter_stop(struct scx_task_iter *iter) 814 590 { 591 + #ifdef CONFIG_EXT_SUB_SCHED 592 + if (iter->cgrp) { 593 + if (iter->css_pos) 594 + css_task_iter_end(&iter->css_iter); 595 + __scx_task_iter_rq_unlock(iter); 596 + return; 597 + } 598 + #endif 815 599 __scx_task_iter_maybe_relock(iter); 816 600 list_del_init(&iter->cursor.tasks_node); 817 601 scx_task_iter_unlock(iter); ··· 843 611 cond_resched(); 844 612 } 845 613 614 + #ifdef CONFIG_EXT_SUB_SCHED 615 + if (iter->cgrp) { 616 + while (iter->css_pos) { 617 + struct task_struct *p; 618 + 619 + p = css_task_iter_next(&iter->css_iter); 620 + if (p) 621 + return p; 622 + 623 + css_task_iter_end(&iter->css_iter); 624 + iter->css_pos = css_next_descendant_pre(iter->css_pos, 625 + &iter->cgrp->self); 626 + if (iter->css_pos) 627 + css_task_iter_start(iter->css_pos, 0, &iter->css_iter); 628 + } 629 + return NULL; 630 + } 631 + #endif 846 632 __scx_task_iter_maybe_relock(iter); 847 633 848 634 list_for_each_entry(pos, cursor, tasks_node) { ··· 1060 810 return -EPROTO; 1061 811 } 1062 812 1063 - static void run_deferred(struct rq *rq) 1064 - { 1065 - process_ddsp_deferred_locals(rq); 1066 - 1067 - if (local_read(&rq->scx.reenq_local_deferred)) { 1068 - local_set(&rq->scx.reenq_local_deferred, 0); 1069 - reenq_local(rq); 1070 - } 1071 - } 1072 - 1073 813 static void deferred_bal_cb_workfn(struct rq *rq) 1074 814 { 1075 815 run_deferred(rq); ··· 1085 845 static void schedule_deferred(struct rq *rq) 1086 846 { 1087 847 /* 1088 - * Queue an irq work. They are executed on IRQ re-enable which may take 1089 - * a bit longer than the scheduler hook in schedule_deferred_locked(). 848 + * This is the fallback when schedule_deferred_locked() can't use 849 + * the cheaper balance callback or wakeup hook paths (the target 850 + * CPU is not in balance or wakeup). Currently, this is primarily 851 + * hit by reenqueue operations targeting a remote CPU. 852 + * 853 + * Queue on the target CPU. The deferred work can run from any CPU 854 + * correctly - the _locked() path already processes remote rqs from 855 + * the calling CPU - but targeting the owning CPU allows IPI delivery 856 + * without waiting for the calling CPU to re-enable IRQs and is 857 + * cheaper as the reenqueue runs locally. 1090 858 */ 1091 - irq_work_queue(&rq->scx.deferred_irq_work); 859 + irq_work_queue_on(&rq->scx.deferred_irq_work, cpu_of(rq)); 1092 860 } 1093 861 1094 862 /** ··· 1144 896 * time to IRQ re-enable shouldn't be long. 1145 897 */ 1146 898 schedule_deferred(rq); 899 + } 900 + 901 + static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq, 902 + u64 reenq_flags, struct rq *locked_rq) 903 + { 904 + struct rq *rq; 905 + 906 + /* 907 + * Allowing reenqueues doesn't make sense while bypassing. This also 908 + * blocks from new reenqueues to be scheduled on dead scheds. 909 + */ 910 + if (unlikely(READ_ONCE(sch->bypass_depth))) 911 + return; 912 + 913 + if (dsq->id == SCX_DSQ_LOCAL) { 914 + rq = container_of(dsq, struct rq, scx.local_dsq); 915 + 916 + struct scx_sched_pcpu *sch_pcpu = per_cpu_ptr(sch->pcpu, cpu_of(rq)); 917 + struct scx_deferred_reenq_local *drl = &sch_pcpu->deferred_reenq_local; 918 + 919 + /* 920 + * Pairs with smp_mb() in process_deferred_reenq_locals() and 921 + * guarantees that there is a reenq_local() afterwards. 922 + */ 923 + smp_mb(); 924 + 925 + if (list_empty(&drl->node) || 926 + (READ_ONCE(drl->flags) & reenq_flags) != reenq_flags) { 927 + 928 + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); 929 + 930 + if (list_empty(&drl->node)) 931 + list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals); 932 + WRITE_ONCE(drl->flags, drl->flags | reenq_flags); 933 + } 934 + } else if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN)) { 935 + rq = this_rq(); 936 + 937 + struct scx_dsq_pcpu *dsq_pcpu = per_cpu_ptr(dsq->pcpu, cpu_of(rq)); 938 + struct scx_deferred_reenq_user *dru = &dsq_pcpu->deferred_reenq_user; 939 + 940 + /* 941 + * Pairs with smp_mb() in process_deferred_reenq_users() and 942 + * guarantees that there is a reenq_user() afterwards. 943 + */ 944 + smp_mb(); 945 + 946 + if (list_empty(&dru->node) || 947 + (READ_ONCE(dru->flags) & reenq_flags) != reenq_flags) { 948 + 949 + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); 950 + 951 + if (list_empty(&dru->node)) 952 + list_move_tail(&dru->node, &rq->scx.deferred_reenq_users); 953 + WRITE_ONCE(dru->flags, dru->flags | reenq_flags); 954 + } 955 + } else { 956 + scx_error(sch, "DSQ 0x%llx not allowed for reenq", dsq->id); 957 + return; 958 + } 959 + 960 + if (rq == locked_rq) 961 + schedule_deferred_locked(rq); 962 + else 963 + schedule_deferred(rq); 964 + } 965 + 966 + static void schedule_reenq_local(struct rq *rq, u64 reenq_flags) 967 + { 968 + struct scx_sched *root = rcu_dereference_sched(scx_root); 969 + 970 + if (WARN_ON_ONCE(!root)) 971 + return; 972 + 973 + schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq); 1147 974 } 1148 975 1149 976 /** ··· 1297 974 return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime); 1298 975 } 1299 976 1300 - static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) 977 + static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) 1301 978 { 979 + /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */ 980 + WRITE_ONCE(dsq->nr, dsq->nr + 1); 981 + 1302 982 /* 1303 - * scx_bpf_dsq_nr_queued() reads ->nr without locking. Use READ_ONCE() 1304 - * on the read side and WRITE_ONCE() on the write side to properly 1305 - * annotate the concurrent lockless access and avoid KCSAN warnings. 983 + * Once @p reaches a local DSQ, it can only leave it by being dispatched 984 + * to the CPU or dequeued. In both cases, the only way @p can go back to 985 + * the BPF sched is through enqueueing. If being inserted into a local 986 + * DSQ with IMMED, persist the state until the next enqueueing event in 987 + * do_enqueue_task() so that we can maintain IMMED protection through 988 + * e.g. SAVE/RESTORE cycles and slice extensions. 1306 989 */ 1307 - WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta); 990 + if (enq_flags & SCX_ENQ_IMMED) { 991 + if (unlikely(dsq->id != SCX_DSQ_LOCAL)) { 992 + WARN_ON_ONCE(!(enq_flags & SCX_ENQ_GDSQ_FALLBACK)); 993 + return; 994 + } 995 + p->scx.flags |= SCX_TASK_IMMED; 996 + } 997 + 998 + if (p->scx.flags & SCX_TASK_IMMED) { 999 + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1000 + 1001 + if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 1002 + return; 1003 + 1004 + rq->scx.nr_immed++; 1005 + 1006 + /* 1007 + * If @rq already had other tasks or the current task is not 1008 + * done yet, @p can't go on the CPU immediately. Re-enqueue. 1009 + */ 1010 + if (unlikely(dsq->nr > 1 || !rq_is_open(rq, enq_flags))) 1011 + schedule_reenq_local(rq, 0); 1012 + } 1013 + } 1014 + 1015 + static void dsq_dec_nr(struct scx_dispatch_q *dsq, struct task_struct *p) 1016 + { 1017 + /* see dsq_inc_nr() */ 1018 + WRITE_ONCE(dsq->nr, dsq->nr - 1); 1019 + 1020 + if (p->scx.flags & SCX_TASK_IMMED) { 1021 + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1022 + 1023 + if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL) || 1024 + WARN_ON_ONCE(rq->scx.nr_immed <= 0)) 1025 + return; 1026 + 1027 + rq->scx.nr_immed--; 1028 + } 1308 1029 } 1309 1030 1310 1031 static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p) 1311 1032 { 1312 - p->scx.slice = READ_ONCE(scx_slice_dfl); 1033 + p->scx.slice = READ_ONCE(sch->slice_dfl); 1313 1034 __scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1); 1035 + } 1036 + 1037 + /* 1038 + * Return true if @p is moving due to an internal SCX migration, false 1039 + * otherwise. 1040 + */ 1041 + static inline bool task_scx_migrating(struct task_struct *p) 1042 + { 1043 + /* 1044 + * We only need to check sticky_cpu: it is set to the destination 1045 + * CPU in move_remote_task_to_local_dsq() before deactivate_task() 1046 + * and cleared when the task is enqueued on the destination, so it 1047 + * is only non-negative during an internal SCX migration. 1048 + */ 1049 + return p->scx.sticky_cpu >= 0; 1050 + } 1051 + 1052 + /* 1053 + * Call ops.dequeue() if the task is in BPF custody and not migrating. 1054 + * Clears %SCX_TASK_IN_CUSTODY when the callback is invoked. 1055 + */ 1056 + static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, 1057 + struct task_struct *p, u64 deq_flags) 1058 + { 1059 + if (!(p->scx.flags & SCX_TASK_IN_CUSTODY) || task_scx_migrating(p)) 1060 + return; 1061 + 1062 + if (SCX_HAS_OP(sch, dequeue)) 1063 + SCX_CALL_OP_TASK(sch, dequeue, rq, p, deq_flags); 1064 + 1065 + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; 1314 1066 } 1315 1067 1316 1068 static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p, ··· 1393 995 { 1394 996 struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1395 997 bool preempt = false; 998 + 999 + call_task_dequeue(scx_root, rq, p, 0); 1396 1000 1397 1001 /* 1398 1002 * If @rq is in balance, the CPU is already vacant and looking for the ··· 1414 1014 resched_curr(rq); 1415 1015 } 1416 1016 1417 - static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, 1418 - struct task_struct *p, u64 enq_flags) 1017 + static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, 1018 + struct scx_dispatch_q *dsq, struct task_struct *p, 1019 + u64 enq_flags) 1419 1020 { 1420 1021 bool is_local = dsq->id == SCX_DSQ_LOCAL; 1421 1022 ··· 1432 1031 scx_error(sch, "attempting to dispatch to a destroyed dsq"); 1433 1032 /* fall back to the global dsq */ 1434 1033 raw_spin_unlock(&dsq->lock); 1435 - dsq = find_global_dsq(sch, p); 1034 + dsq = find_global_dsq(sch, task_cpu(p)); 1436 1035 raw_spin_lock(&dsq->lock); 1437 1036 } 1438 1037 } ··· 1507 1106 WRITE_ONCE(dsq->seq, dsq->seq + 1); 1508 1107 p->scx.dsq_seq = dsq->seq; 1509 1108 1510 - dsq_mod_nr(dsq, 1); 1109 + dsq_inc_nr(dsq, p, enq_flags); 1511 1110 p->scx.dsq = dsq; 1111 + 1112 + /* 1113 + * Update custody and call ops.dequeue() before clearing ops_state: 1114 + * once ops_state is cleared, waiters in ops_dequeue() can proceed 1115 + * and dequeue_task_scx() will RMW p->scx.flags. If we clear 1116 + * ops_state first, both sides would modify p->scx.flags 1117 + * concurrently in a non-atomic way. 1118 + */ 1119 + if (is_local) { 1120 + local_dsq_post_enq(dsq, p, enq_flags); 1121 + } else { 1122 + /* 1123 + * Task on global/bypass DSQ: leave custody, task on 1124 + * non-terminal DSQ: enter custody. 1125 + */ 1126 + if (dsq->id == SCX_DSQ_GLOBAL || dsq->id == SCX_DSQ_BYPASS) 1127 + call_task_dequeue(sch, rq, p, 0); 1128 + else 1129 + p->scx.flags |= SCX_TASK_IN_CUSTODY; 1130 + 1131 + raw_spin_unlock(&dsq->lock); 1132 + } 1512 1133 1513 1134 /* 1514 1135 * We're transitioning out of QUEUEING or DISPATCHING. store_release to ··· 1538 1115 */ 1539 1116 if (enq_flags & SCX_ENQ_CLEAR_OPSS) 1540 1117 atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); 1541 - 1542 - if (is_local) 1543 - local_dsq_post_enq(dsq, p, enq_flags); 1544 - else 1545 - raw_spin_unlock(&dsq->lock); 1546 1118 } 1547 1119 1548 1120 static void task_unlink_from_dsq(struct task_struct *p, ··· 1552 1134 } 1553 1135 1554 1136 list_del_init(&p->scx.dsq_list.node); 1555 - dsq_mod_nr(dsq, -1); 1137 + dsq_dec_nr(dsq, p); 1556 1138 1557 1139 if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) { 1558 1140 struct task_struct *first_task; ··· 1631 1213 1632 1214 static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch, 1633 1215 struct rq *rq, u64 dsq_id, 1634 - struct task_struct *p) 1216 + s32 tcpu) 1635 1217 { 1636 1218 struct scx_dispatch_q *dsq; 1637 1219 ··· 1642 1224 s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; 1643 1225 1644 1226 if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) 1645 - return find_global_dsq(sch, p); 1227 + return find_global_dsq(sch, tcpu); 1646 1228 1647 1229 return &cpu_rq(cpu)->scx.local_dsq; 1648 1230 } 1649 1231 1650 1232 if (dsq_id == SCX_DSQ_GLOBAL) 1651 - dsq = find_global_dsq(sch, p); 1233 + dsq = find_global_dsq(sch, tcpu); 1652 1234 else 1653 1235 dsq = find_user_dsq(sch, dsq_id); 1654 1236 1655 1237 if (unlikely(!dsq)) { 1656 - scx_error(sch, "non-existent DSQ 0x%llx for %s[%d]", 1657 - dsq_id, p->comm, p->pid); 1658 - return find_global_dsq(sch, p); 1238 + scx_error(sch, "non-existent DSQ 0x%llx", dsq_id); 1239 + return find_global_dsq(sch, tcpu); 1659 1240 } 1660 1241 1661 1242 return dsq; ··· 1717 1300 { 1718 1301 struct rq *rq = task_rq(p); 1719 1302 struct scx_dispatch_q *dsq = 1720 - find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p); 1303 + find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p)); 1721 1304 u64 ddsp_enq_flags; 1722 1305 1723 1306 touch_core_sched_dispatch(rq, p); ··· 1762 1345 ddsp_enq_flags = p->scx.ddsp_enq_flags; 1763 1346 clear_direct_dispatch(p); 1764 1347 1765 - dispatch_enqueue(sch, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); 1348 + dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); 1766 1349 } 1767 1350 1768 1351 static bool scx_rq_online(struct rq *rq) ··· 1780 1363 static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, 1781 1364 int sticky_cpu) 1782 1365 { 1783 - struct scx_sched *sch = scx_root; 1366 + struct scx_sched *sch = scx_task_sched(p); 1784 1367 struct task_struct **ddsp_taskp; 1785 1368 struct scx_dispatch_q *dsq; 1786 1369 unsigned long qseq; 1787 1370 1788 1371 WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); 1789 1372 1790 - /* rq migration */ 1373 + /* internal movements - rq migration / RESTORE */ 1791 1374 if (sticky_cpu == cpu_of(rq)) 1792 1375 goto local_norefill; 1376 + 1377 + /* 1378 + * Clear persistent TASK_IMMED for fresh enqueues, see dsq_inc_nr(). 1379 + * Note that exiting and migration-disabled tasks that skip 1380 + * ops.enqueue() below will lose IMMED protection unless 1381 + * %SCX_OPS_ENQ_EXITING / %SCX_OPS_ENQ_MIGRATION_DISABLED are set. 1382 + */ 1383 + p->scx.flags &= ~SCX_TASK_IMMED; 1793 1384 1794 1385 /* 1795 1386 * If !scx_rq_online(), we already told the BPF scheduler that the CPU ··· 1807 1382 if (!scx_rq_online(rq)) 1808 1383 goto local; 1809 1384 1810 - if (scx_rq_bypassing(rq)) { 1385 + if (scx_bypassing(sch, cpu_of(rq))) { 1811 1386 __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); 1812 1387 goto bypass; 1813 1388 } ··· 1842 1417 WARN_ON_ONCE(*ddsp_taskp); 1843 1418 *ddsp_taskp = p; 1844 1419 1845 - SCX_CALL_OP_TASK(sch, SCX_KF_ENQUEUE, enqueue, rq, p, enq_flags); 1420 + SCX_CALL_OP_TASK(sch, enqueue, rq, p, enq_flags); 1846 1421 1847 1422 *ddsp_taskp = NULL; 1848 1423 if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) 1849 1424 goto direct; 1425 + 1426 + /* 1427 + * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY 1428 + * so ops.dequeue() is called when it leaves custody. 1429 + */ 1430 + p->scx.flags |= SCX_TASK_IN_CUSTODY; 1850 1431 1851 1432 /* 1852 1433 * If not directly dispatched, QUEUEING isn't clear yet and dispatch or ··· 1865 1434 direct_dispatch(sch, p, enq_flags); 1866 1435 return; 1867 1436 local_norefill: 1868 - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); 1437 + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); 1869 1438 return; 1870 1439 local: 1871 1440 dsq = &rq->scx.local_dsq; 1872 1441 goto enqueue; 1873 1442 global: 1874 - dsq = find_global_dsq(sch, p); 1443 + dsq = find_global_dsq(sch, task_cpu(p)); 1875 1444 goto enqueue; 1876 1445 bypass: 1877 - dsq = &task_rq(p)->scx.bypass_dsq; 1446 + dsq = bypass_enq_target_dsq(sch, task_cpu(p)); 1878 1447 goto enqueue; 1879 1448 1880 1449 enqueue: ··· 1886 1455 touch_core_sched(rq, p); 1887 1456 refill_task_slice_dfl(sch, p); 1888 1457 clear_direct_dispatch(p); 1889 - dispatch_enqueue(sch, dsq, p, enq_flags); 1458 + dispatch_enqueue(sch, rq, dsq, p, enq_flags); 1890 1459 } 1891 1460 1892 1461 static bool task_runnable(const struct task_struct *p) ··· 1919 1488 1920 1489 static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_flags) 1921 1490 { 1922 - struct scx_sched *sch = scx_root; 1491 + struct scx_sched *sch = scx_task_sched(p); 1923 1492 int sticky_cpu = p->scx.sticky_cpu; 1924 1493 u64 enq_flags = core_enq_flags | rq->scx.extra_enq_flags; 1925 1494 1926 1495 if (enq_flags & ENQUEUE_WAKEUP) 1927 1496 rq->scx.flags |= SCX_RQ_IN_WAKEUP; 1928 - 1929 - if (sticky_cpu >= 0) 1930 - p->scx.sticky_cpu = -1; 1931 1497 1932 1498 /* 1933 1499 * Restoring a running task will be immediately followed by ··· 1946 1518 add_nr_running(rq, 1); 1947 1519 1948 1520 if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p)) 1949 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags); 1521 + SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags); 1950 1522 1951 1523 if (enq_flags & SCX_ENQ_WAKEUP) 1952 1524 touch_core_sched(rq, p); ··· 1956 1528 dl_server_start(&rq->ext_server); 1957 1529 1958 1530 do_enqueue_task(rq, p, enq_flags, sticky_cpu); 1531 + 1532 + if (sticky_cpu >= 0) 1533 + p->scx.sticky_cpu = -1; 1959 1534 out: 1960 1535 rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; 1961 1536 ··· 1969 1538 1970 1539 static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) 1971 1540 { 1972 - struct scx_sched *sch = scx_root; 1541 + struct scx_sched *sch = scx_task_sched(p); 1973 1542 unsigned long opss; 1974 1543 1975 1544 /* dequeue is always temporary, don't reset runnable_at */ ··· 1988 1557 */ 1989 1558 BUG(); 1990 1559 case SCX_OPSS_QUEUED: 1991 - if (SCX_HAS_OP(sch, dequeue)) 1992 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, 1993 - p, deq_flags); 1994 - 1560 + /* A queued task must always be in BPF scheduler's custody */ 1561 + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)); 1995 1562 if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, 1996 1563 SCX_OPSS_NONE)) 1997 1564 break; ··· 2012 1583 BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); 2013 1584 break; 2014 1585 } 1586 + 1587 + /* 1588 + * Call ops.dequeue() if the task is still in BPF custody. 1589 + * 1590 + * The code that clears ops_state to %SCX_OPSS_NONE does not always 1591 + * clear %SCX_TASK_IN_CUSTODY: in dispatch_to_local_dsq(), when 1592 + * we're moving a task that was in %SCX_OPSS_DISPATCHING to a 1593 + * remote CPU's local DSQ, we only set ops_state to %SCX_OPSS_NONE 1594 + * so that a concurrent dequeue can proceed, but we clear 1595 + * %SCX_TASK_IN_CUSTODY only when we later enqueue or move the 1596 + * task. So we can see NONE + IN_CUSTODY here and we must handle 1597 + * it. Similarly, after waiting on %SCX_OPSS_DISPATCHING we see 1598 + * NONE but the task may still have %SCX_TASK_IN_CUSTODY set until 1599 + * it is enqueued on the destination. 1600 + */ 1601 + call_task_dequeue(sch, rq, p, deq_flags); 2015 1602 } 2016 1603 2017 - static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) 1604 + static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_flags) 2018 1605 { 2019 - struct scx_sched *sch = scx_root; 1606 + struct scx_sched *sch = scx_task_sched(p); 1607 + u64 deq_flags = core_deq_flags; 1608 + 1609 + /* 1610 + * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a property 1611 + * change (not sleep or core-sched pick). 1612 + */ 1613 + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) 1614 + deq_flags |= SCX_DEQ_SCHED_CHANGE; 2020 1615 2021 1616 if (!(p->scx.flags & SCX_TASK_QUEUED)) { 2022 1617 WARN_ON_ONCE(task_runnable(p)); ··· 2063 1610 */ 2064 1611 if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) { 2065 1612 update_curr_scx(rq); 2066 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false); 1613 + SCX_CALL_OP_TASK(sch, stopping, rq, p, false); 2067 1614 } 2068 1615 2069 1616 if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p)) 2070 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags); 1617 + SCX_CALL_OP_TASK(sch, quiescent, rq, p, deq_flags); 2071 1618 2072 1619 if (deq_flags & SCX_DEQ_SLEEP) 2073 1620 p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; ··· 2085 1632 2086 1633 static void yield_task_scx(struct rq *rq) 2087 1634 { 2088 - struct scx_sched *sch = scx_root; 2089 1635 struct task_struct *p = rq->donor; 1636 + struct scx_sched *sch = scx_task_sched(p); 2090 1637 2091 1638 if (SCX_HAS_OP(sch, yield)) 2092 - SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, p, NULL); 1639 + SCX_CALL_OP_2TASKS_RET(sch, yield, rq, p, NULL); 2093 1640 else 2094 1641 p->scx.slice = 0; 2095 1642 } 2096 1643 2097 1644 static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) 2098 1645 { 2099 - struct scx_sched *sch = scx_root; 2100 1646 struct task_struct *from = rq->donor; 1647 + struct scx_sched *sch = scx_task_sched(from); 2101 1648 2102 - if (SCX_HAS_OP(sch, yield)) 2103 - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, 2104 - from, to); 1649 + if (SCX_HAS_OP(sch, yield) && sch == scx_task_sched(to)) 1650 + return SCX_CALL_OP_2TASKS_RET(sch, yield, rq, from, to); 2105 1651 else 2106 1652 return false; 1653 + } 1654 + 1655 + static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) 1656 + { 1657 + /* 1658 + * Preemption between SCX tasks is implemented by resetting the victim 1659 + * task's slice to 0 and triggering reschedule on the target CPU. 1660 + * Nothing to do. 1661 + */ 1662 + if (p->sched_class == &ext_sched_class) 1663 + return; 1664 + 1665 + /* 1666 + * Getting preempted by a higher-priority class. Reenqueue IMMED tasks. 1667 + * This captures all preemption cases including: 1668 + * 1669 + * - A SCX task is currently running. 1670 + * 1671 + * - @rq is waking from idle due to a SCX task waking to it. 1672 + * 1673 + * - A higher-priority wakes up while SCX dispatch is in progress. 1674 + */ 1675 + if (rq->scx.nr_immed) 1676 + schedule_reenq_local(rq, 0); 2107 1677 } 2108 1678 2109 1679 static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, ··· 2146 1670 else 2147 1671 list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list); 2148 1672 2149 - dsq_mod_nr(dst_dsq, 1); 1673 + dsq_inc_nr(dst_dsq, p, enq_flags); 2150 1674 p->scx.dsq = dst_dsq; 2151 1675 2152 1676 local_dsq_post_enq(dst_dsq, p, enq_flags); ··· 2166 1690 { 2167 1691 lockdep_assert_rq_held(src_rq); 2168 1692 2169 - /* the following marks @p MIGRATING which excludes dequeue */ 1693 + /* 1694 + * Set sticky_cpu before deactivate_task() to properly mark the 1695 + * beginning of an SCX-internal migration. 1696 + */ 1697 + p->scx.sticky_cpu = cpu_of(dst_rq); 2170 1698 deactivate_task(src_rq, p, 0); 2171 1699 set_task_cpu(p, cpu_of(dst_rq)); 2172 - p->scx.sticky_cpu = cpu_of(dst_rq); 2173 1700 2174 1701 raw_spin_rq_unlock(src_rq); 2175 1702 raw_spin_rq_lock(dst_rq); ··· 2212 1733 struct task_struct *p, struct rq *rq, 2213 1734 bool enforce) 2214 1735 { 2215 - int cpu = cpu_of(rq); 1736 + s32 cpu = cpu_of(rq); 2216 1737 2217 1738 WARN_ON_ONCE(task_cpu(p) == cpu); 2218 1739 ··· 2306 1827 !WARN_ON_ONCE(src_rq != task_rq(p)); 2307 1828 } 2308 1829 2309 - static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, 1830 + static bool consume_remote_task(struct rq *this_rq, 1831 + struct task_struct *p, u64 enq_flags, 2310 1832 struct scx_dispatch_q *dsq, struct rq *src_rq) 2311 1833 { 2312 1834 raw_spin_rq_unlock(this_rq); 2313 1835 2314 1836 if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { 2315 - move_remote_task_to_local_dsq(p, 0, src_rq, this_rq); 1837 + move_remote_task_to_local_dsq(p, enq_flags, src_rq, this_rq); 2316 1838 return true; 2317 1839 } else { 2318 1840 raw_spin_rq_unlock(src_rq); ··· 2353 1873 dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); 2354 1874 if (src_rq != dst_rq && 2355 1875 unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { 2356 - dst_dsq = find_global_dsq(sch, p); 1876 + dst_dsq = find_global_dsq(sch, task_cpu(p)); 2357 1877 dst_rq = src_rq; 1878 + enq_flags |= SCX_ENQ_GDSQ_FALLBACK; 2358 1879 } 2359 1880 } else { 2360 1881 /* no need to migrate if destination is a non-local DSQ */ ··· 2386 1905 dispatch_dequeue_locked(p, src_dsq); 2387 1906 raw_spin_unlock(&src_dsq->lock); 2388 1907 2389 - dispatch_enqueue(sch, dst_dsq, p, enq_flags); 1908 + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); 2390 1909 } 2391 1910 2392 1911 return dst_rq; 2393 1912 } 2394 1913 2395 1914 static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq, 2396 - struct scx_dispatch_q *dsq) 1915 + struct scx_dispatch_q *dsq, u64 enq_flags) 2397 1916 { 2398 1917 struct task_struct *p; 2399 1918 retry: ··· 2418 1937 * the system into the bypass mode. This can easily live-lock the 2419 1938 * machine. If aborting, exit from all non-bypass DSQs. 2420 1939 */ 2421 - if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS) 1940 + if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS) 2422 1941 break; 2423 1942 2424 1943 if (rq == task_rq) { 2425 1944 task_unlink_from_dsq(p, dsq); 2426 - move_local_task_to_local_dsq(p, 0, dsq, rq); 1945 + move_local_task_to_local_dsq(p, enq_flags, dsq, rq); 2427 1946 raw_spin_unlock(&dsq->lock); 2428 1947 return true; 2429 1948 } 2430 1949 2431 1950 if (task_can_run_on_remote_rq(sch, p, rq, false)) { 2432 - if (likely(consume_remote_task(rq, p, dsq, task_rq))) 1951 + if (likely(consume_remote_task(rq, p, enq_flags, dsq, task_rq))) 2433 1952 return true; 2434 1953 goto retry; 2435 1954 } ··· 2443 1962 { 2444 1963 int node = cpu_to_node(cpu_of(rq)); 2445 1964 2446 - return consume_dispatch_q(sch, rq, sch->global_dsqs[node]); 1965 + return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0); 2447 1966 } 2448 1967 2449 1968 /** ··· 2476 1995 * If dispatching to @rq that @p is already on, no lock dancing needed. 2477 1996 */ 2478 1997 if (rq == src_rq && rq == dst_rq) { 2479 - dispatch_enqueue(sch, dst_dsq, p, 1998 + dispatch_enqueue(sch, rq, dst_dsq, p, 2480 1999 enq_flags | SCX_ENQ_CLEAR_OPSS); 2481 2000 return; 2482 2001 } 2483 2002 2484 2003 if (src_rq != dst_rq && 2485 2004 unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { 2486 - dispatch_enqueue(sch, find_global_dsq(sch, p), p, 2487 - enq_flags | SCX_ENQ_CLEAR_OPSS); 2005 + dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p, 2006 + enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK); 2488 2007 return; 2489 2008 } 2490 2009 ··· 2521 2040 */ 2522 2041 if (src_rq == dst_rq) { 2523 2042 p->scx.holding_cpu = -1; 2524 - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, 2043 + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, 2525 2044 enq_flags); 2526 2045 } else { 2527 2046 move_remote_task_to_local_dsq(p, enq_flags, ··· 2591 2110 if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch) 2592 2111 return; 2593 2112 2113 + /* see SCX_EV_INSERT_NOT_OWNED definition */ 2114 + if (unlikely(!scx_task_on_sched(sch, p))) { 2115 + __scx_add_event(sch, SCX_EV_INSERT_NOT_OWNED, 1); 2116 + return; 2117 + } 2118 + 2594 2119 /* 2595 2120 * While we know @p is accessible, we don't yet have a claim on 2596 2121 * it - the BPF scheduler is allowed to dispatch tasks ··· 2621 2134 2622 2135 BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); 2623 2136 2624 - dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); 2137 + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, task_cpu(p)); 2625 2138 2626 2139 if (dsq->id == SCX_DSQ_LOCAL) 2627 2140 dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); 2628 2141 else 2629 - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2142 + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2630 2143 } 2631 2144 2632 2145 static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) 2633 2146 { 2634 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2147 + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 2635 2148 u32 u; 2636 2149 2637 2150 for (u = 0; u < dspc->cursor; u++) { ··· 2658 2171 rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING; 2659 2172 } 2660 2173 2174 + /* 2175 + * One user of this function is scx_bpf_dispatch() which can be called 2176 + * recursively as sub-sched dispatches nest. Always inline to reduce stack usage 2177 + * from the call frame. 2178 + */ 2179 + static __always_inline bool 2180 + scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, 2181 + struct task_struct *prev, bool nested) 2182 + { 2183 + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 2184 + int nr_loops = SCX_DSP_MAX_LOOPS; 2185 + s32 cpu = cpu_of(rq); 2186 + bool prev_on_sch = (prev->sched_class == &ext_sched_class) && 2187 + scx_task_on_sched(sch, prev); 2188 + 2189 + if (consume_global_dsq(sch, rq)) 2190 + return true; 2191 + 2192 + if (bypass_dsp_enabled(sch)) { 2193 + /* if @sch is bypassing, only the bypass DSQs are active */ 2194 + if (scx_bypassing(sch, cpu)) 2195 + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0); 2196 + 2197 + #ifdef CONFIG_EXT_SUB_SCHED 2198 + /* 2199 + * If @sch isn't bypassing but its children are, @sch is 2200 + * responsible for making forward progress for both its own 2201 + * tasks that aren't bypassing and the bypassing descendants' 2202 + * tasks. The following implements a simple built-in behavior - 2203 + * let each CPU try to run the bypass DSQ every Nth time. 2204 + * 2205 + * Later, if necessary, we can add an ops flag to suppress the 2206 + * auto-consumption and a kfunc to consume the bypass DSQ and, 2207 + * so that the BPF scheduler can fully control scheduling of 2208 + * bypassed tasks. 2209 + */ 2210 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); 2211 + 2212 + if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) && 2213 + consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0)) { 2214 + __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1); 2215 + return true; 2216 + } 2217 + #endif /* CONFIG_EXT_SUB_SCHED */ 2218 + } 2219 + 2220 + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) 2221 + return false; 2222 + 2223 + dspc->rq = rq; 2224 + 2225 + /* 2226 + * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, 2227 + * the local DSQ might still end up empty after a successful 2228 + * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() 2229 + * produced some tasks, retry. The BPF scheduler may depend on this 2230 + * looping behavior to simplify its implementation. 2231 + */ 2232 + do { 2233 + dspc->nr_tasks = 0; 2234 + 2235 + if (nested) { 2236 + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); 2237 + } else { 2238 + /* stash @prev so that nested invocations can access it */ 2239 + rq->scx.sub_dispatch_prev = prev; 2240 + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); 2241 + rq->scx.sub_dispatch_prev = NULL; 2242 + } 2243 + 2244 + flush_dispatch_buf(sch, rq); 2245 + 2246 + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) { 2247 + rq->scx.flags |= SCX_RQ_BAL_KEEP; 2248 + return true; 2249 + } 2250 + if (rq->scx.local_dsq.nr) 2251 + return true; 2252 + if (consume_global_dsq(sch, rq)) 2253 + return true; 2254 + 2255 + /* 2256 + * ops.dispatch() can trap us in this loop by repeatedly 2257 + * dispatching ineligible tasks. Break out once in a while to 2258 + * allow the watchdog to run. As IRQ can't be enabled in 2259 + * balance(), we want to complete this scheduling cycle and then 2260 + * start a new one. IOW, we want to call resched_curr() on the 2261 + * next, most likely idle, task, not the current one. Use 2262 + * __scx_bpf_kick_cpu() for deferred kicking. 2263 + */ 2264 + if (unlikely(!--nr_loops)) { 2265 + scx_kick_cpu(sch, cpu, 0); 2266 + break; 2267 + } 2268 + } while (dspc->nr_tasks); 2269 + 2270 + /* 2271 + * Prevent the CPU from going idle while bypassed descendants have tasks 2272 + * queued. Without this fallback, bypassed tasks could stall if the host 2273 + * scheduler's ops.dispatch() doesn't yield any tasks. 2274 + */ 2275 + if (bypass_dsp_enabled(sch)) 2276 + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0); 2277 + 2278 + return false; 2279 + } 2280 + 2661 2281 static int balance_one(struct rq *rq, struct task_struct *prev) 2662 2282 { 2663 2283 struct scx_sched *sch = scx_root; 2664 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2665 - bool prev_on_scx = prev->sched_class == &ext_sched_class; 2666 - bool prev_on_rq = prev->scx.flags & SCX_TASK_QUEUED; 2667 - int nr_loops = SCX_DSP_MAX_LOOPS; 2284 + s32 cpu = cpu_of(rq); 2668 2285 2669 2286 lockdep_assert_rq_held(rq); 2670 2287 rq->scx.flags |= SCX_RQ_IN_BALANCE; ··· 2783 2192 * emitted in switch_class(). 2784 2193 */ 2785 2194 if (SCX_HAS_OP(sch, cpu_acquire)) 2786 - SCX_CALL_OP(sch, SCX_KF_REST, cpu_acquire, rq, 2787 - cpu_of(rq), NULL); 2195 + SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL); 2788 2196 rq->scx.cpu_released = false; 2789 2197 } 2790 2198 2791 - if (prev_on_scx) { 2199 + if (prev->sched_class == &ext_sched_class) { 2792 2200 update_curr_scx(rq); 2793 2201 2794 2202 /* ··· 2800 2210 * See scx_disable_workfn() for the explanation on the bypassing 2801 2211 * test. 2802 2212 */ 2803 - if (prev_on_rq && prev->scx.slice && !scx_rq_bypassing(rq)) { 2213 + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice && 2214 + !scx_bypassing(sch, cpu)) { 2804 2215 rq->scx.flags |= SCX_RQ_BAL_KEEP; 2805 2216 goto has_tasks; 2806 2217 } ··· 2811 2220 if (rq->scx.local_dsq.nr) 2812 2221 goto has_tasks; 2813 2222 2814 - if (consume_global_dsq(sch, rq)) 2223 + if (scx_dispatch_sched(sch, rq, prev, false)) 2815 2224 goto has_tasks; 2816 2225 2817 - if (scx_rq_bypassing(rq)) { 2818 - if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq)) 2819 - goto has_tasks; 2820 - else 2821 - goto no_tasks; 2822 - } 2823 - 2824 - if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) 2825 - goto no_tasks; 2826 - 2827 - dspc->rq = rq; 2828 - 2829 - /* 2830 - * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, 2831 - * the local DSQ might still end up empty after a successful 2832 - * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() 2833 - * produced some tasks, retry. The BPF scheduler may depend on this 2834 - * looping behavior to simplify its implementation. 2835 - */ 2836 - do { 2837 - dspc->nr_tasks = 0; 2838 - 2839 - SCX_CALL_OP(sch, SCX_KF_DISPATCH, dispatch, rq, 2840 - cpu_of(rq), prev_on_scx ? prev : NULL); 2841 - 2842 - flush_dispatch_buf(sch, rq); 2843 - 2844 - if (prev_on_rq && prev->scx.slice) { 2845 - rq->scx.flags |= SCX_RQ_BAL_KEEP; 2846 - goto has_tasks; 2847 - } 2848 - if (rq->scx.local_dsq.nr) 2849 - goto has_tasks; 2850 - if (consume_global_dsq(sch, rq)) 2851 - goto has_tasks; 2852 - 2853 - /* 2854 - * ops.dispatch() can trap us in this loop by repeatedly 2855 - * dispatching ineligible tasks. Break out once in a while to 2856 - * allow the watchdog to run. As IRQ can't be enabled in 2857 - * balance(), we want to complete this scheduling cycle and then 2858 - * start a new one. IOW, we want to call resched_curr() on the 2859 - * next, most likely idle, task, not the current one. Use 2860 - * scx_kick_cpu() for deferred kicking. 2861 - */ 2862 - if (unlikely(!--nr_loops)) { 2863 - scx_kick_cpu(sch, cpu_of(rq), 0); 2864 - break; 2865 - } 2866 - } while (dspc->nr_tasks); 2867 - 2868 - no_tasks: 2869 2226 /* 2870 2227 * Didn't find another task to run. Keep running @prev unless 2871 2228 * %SCX_OPS_ENQ_LAST is in effect. 2872 2229 */ 2873 - if (prev_on_rq && 2874 - (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_rq_bypassing(rq))) { 2230 + if ((prev->scx.flags & SCX_TASK_QUEUED) && 2231 + (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_bypassing(sch, cpu))) { 2875 2232 rq->scx.flags |= SCX_RQ_BAL_KEEP; 2876 2233 __scx_add_event(sch, SCX_EV_DISPATCH_KEEP_LAST, 1); 2877 2234 goto has_tasks; ··· 2828 2289 return false; 2829 2290 2830 2291 has_tasks: 2292 + /* 2293 + * @rq may have extra IMMED tasks without reenq scheduled: 2294 + * 2295 + * - rq_is_open() can't reliably tell when and how slice is going to be 2296 + * modified for $curr and allows IMMED tasks to be queued while 2297 + * dispatch is in progress. 2298 + * 2299 + * - A non-IMMED HEAD task can get queued in front of an IMMED task 2300 + * between the IMMED queueing and the subsequent scheduling event. 2301 + */ 2302 + if (unlikely(rq->scx.local_dsq.nr > 1 && rq->scx.nr_immed)) 2303 + schedule_reenq_local(rq, 0); 2304 + 2831 2305 rq->scx.flags &= ~SCX_RQ_IN_BALANCE; 2832 2306 return true; 2833 2307 } 2834 2308 2835 - static void process_ddsp_deferred_locals(struct rq *rq) 2836 - { 2837 - struct task_struct *p; 2838 - 2839 - lockdep_assert_rq_held(rq); 2840 - 2841 - /* 2842 - * Now that @rq can be unlocked, execute the deferred enqueueing of 2843 - * tasks directly dispatched to the local DSQs of other CPUs. See 2844 - * direct_dispatch(). Keep popping from the head instead of using 2845 - * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq 2846 - * temporarily. 2847 - */ 2848 - while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, 2849 - struct task_struct, scx.dsq_list.node))) { 2850 - struct scx_sched *sch = scx_root; 2851 - struct scx_dispatch_q *dsq; 2852 - u64 dsq_id = p->scx.ddsp_dsq_id; 2853 - u64 enq_flags = p->scx.ddsp_enq_flags; 2854 - 2855 - list_del_init(&p->scx.dsq_list.node); 2856 - clear_direct_dispatch(p); 2857 - 2858 - dsq = find_dsq_for_dispatch(sch, rq, dsq_id, p); 2859 - if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 2860 - dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); 2861 - } 2862 - } 2863 - 2864 2309 static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) 2865 2310 { 2866 - struct scx_sched *sch = scx_root; 2311 + struct scx_sched *sch = scx_task_sched(p); 2867 2312 2868 2313 if (p->scx.flags & SCX_TASK_QUEUED) { 2869 2314 /* ··· 2862 2339 2863 2340 /* see dequeue_task_scx() on why we skip when !QUEUED */ 2864 2341 if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED)) 2865 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, running, rq, p); 2342 + SCX_CALL_OP_TASK(sch, running, rq, p); 2866 2343 2867 2344 clr_task_runnable(p, true); 2868 2345 ··· 2934 2411 .task = next, 2935 2412 }; 2936 2413 2937 - SCX_CALL_OP(sch, SCX_KF_CPU_RELEASE, cpu_release, rq, 2938 - cpu_of(rq), &args); 2414 + SCX_CALL_OP(sch, cpu_release, rq, cpu_of(rq), &args); 2939 2415 } 2940 2416 rq->scx.cpu_released = true; 2941 2417 } ··· 2943 2421 static void put_prev_task_scx(struct rq *rq, struct task_struct *p, 2944 2422 struct task_struct *next) 2945 2423 { 2946 - struct scx_sched *sch = scx_root; 2424 + struct scx_sched *sch = scx_task_sched(p); 2947 2425 2948 2426 /* see kick_sync_wait_bal_cb() */ 2949 2427 smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1); ··· 2952 2430 2953 2431 /* see dequeue_task_scx() on why we skip when !QUEUED */ 2954 2432 if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED)) 2955 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, true); 2433 + SCX_CALL_OP_TASK(sch, stopping, rq, p, true); 2956 2434 2957 2435 if (p->scx.flags & SCX_TASK_QUEUED) { 2958 2436 set_task_runnable(rq, p); ··· 2961 2439 * If @p has slice left and is being put, @p is getting 2962 2440 * preempted by a higher priority scheduler class or core-sched 2963 2441 * forcing a different task. Leave it at the head of the local 2964 - * DSQ. 2442 + * DSQ unless it was an IMMED task. IMMED tasks should not 2443 + * linger on a busy CPU, reenqueue them to the BPF scheduler. 2965 2444 */ 2966 - if (p->scx.slice && !scx_rq_bypassing(rq)) { 2967 - dispatch_enqueue(sch, &rq->scx.local_dsq, p, 2968 - SCX_ENQ_HEAD); 2445 + if (p->scx.slice && !scx_bypassing(sch, cpu_of(rq))) { 2446 + if (p->scx.flags & SCX_TASK_IMMED) { 2447 + p->scx.flags |= SCX_TASK_REENQ_PREEMPTED; 2448 + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 2449 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 2450 + } else { 2451 + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD); 2452 + } 2969 2453 goto switch_class; 2970 2454 } 2971 2455 ··· 3096 2568 if (keep_prev) { 3097 2569 p = prev; 3098 2570 if (!p->scx.slice) 3099 - refill_task_slice_dfl(rcu_dereference_sched(scx_root), p); 2571 + refill_task_slice_dfl(scx_task_sched(p), p); 3100 2572 } else { 3101 2573 p = first_local_task(rq); 3102 2574 if (!p) 3103 2575 return NULL; 3104 2576 3105 2577 if (unlikely(!p->scx.slice)) { 3106 - struct scx_sched *sch = rcu_dereference_sched(scx_root); 2578 + struct scx_sched *sch = scx_task_sched(p); 3107 2579 3108 - if (!scx_rq_bypassing(rq) && !sch->warned_zero_slice) { 2580 + if (!scx_bypassing(sch, cpu_of(rq)) && 2581 + !sch->warned_zero_slice) { 3109 2582 printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in %s()\n", 3110 2583 p->comm, p->pid, __func__); 3111 2584 sch->warned_zero_slice = true; ··· 3172 2643 bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, 3173 2644 bool in_fi) 3174 2645 { 3175 - struct scx_sched *sch = scx_root; 2646 + struct scx_sched *sch_a = scx_task_sched(a); 2647 + struct scx_sched *sch_b = scx_task_sched(b); 3176 2648 3177 2649 /* 3178 2650 * The const qualifiers are dropped from task_struct pointers when 3179 2651 * calling ops.core_sched_before(). Accesses are controlled by the 3180 2652 * verifier. 3181 2653 */ 3182 - if (SCX_HAS_OP(sch, core_sched_before) && 3183 - !scx_rq_bypassing(task_rq(a))) 3184 - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, core_sched_before, 2654 + if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) && 2655 + !scx_bypassing(sch_a, task_cpu(a))) 2656 + return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before, 3185 2657 NULL, 3186 2658 (struct task_struct *)a, 3187 2659 (struct task_struct *)b); ··· 3193 2663 3194 2664 static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags) 3195 2665 { 3196 - struct scx_sched *sch = scx_root; 3197 - bool rq_bypass; 2666 + struct scx_sched *sch = scx_task_sched(p); 2667 + bool bypassing; 3198 2668 3199 2669 /* 3200 2670 * sched_exec() calls with %WF_EXEC when @p is about to exec(2) as it ··· 3209 2679 if (unlikely(wake_flags & WF_EXEC)) 3210 2680 return prev_cpu; 3211 2681 3212 - rq_bypass = scx_rq_bypassing(task_rq(p)); 3213 - if (likely(SCX_HAS_OP(sch, select_cpu)) && !rq_bypass) { 2682 + bypassing = scx_bypassing(sch, task_cpu(p)); 2683 + if (likely(SCX_HAS_OP(sch, select_cpu)) && !bypassing) { 3214 2684 s32 cpu; 3215 2685 struct task_struct **ddsp_taskp; 3216 2686 ··· 3218 2688 WARN_ON_ONCE(*ddsp_taskp); 3219 2689 *ddsp_taskp = p; 3220 2690 3221 - cpu = SCX_CALL_OP_TASK_RET(sch, 3222 - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, 3223 - select_cpu, NULL, p, prev_cpu, 3224 - wake_flags); 2691 + this_rq()->scx.in_select_cpu = true; 2692 + cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags); 2693 + this_rq()->scx.in_select_cpu = false; 3225 2694 p->scx.selected_cpu = cpu; 3226 2695 *ddsp_taskp = NULL; 3227 2696 if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()")) ··· 3239 2710 } 3240 2711 p->scx.selected_cpu = cpu; 3241 2712 3242 - if (rq_bypass) 2713 + if (bypassing) 3243 2714 __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); 3244 2715 return cpu; 3245 2716 } ··· 3253 2724 static void set_cpus_allowed_scx(struct task_struct *p, 3254 2725 struct affinity_context *ac) 3255 2726 { 3256 - struct scx_sched *sch = scx_root; 2727 + struct scx_sched *sch = scx_task_sched(p); 3257 2728 3258 2729 set_cpus_allowed_common(p, ac); 3259 2730 ··· 3269 2740 * designation pointless. Cast it away when calling the operation. 3270 2741 */ 3271 2742 if (SCX_HAS_OP(sch, set_cpumask)) 3272 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, NULL, 3273 - p, (struct cpumask *)p->cpus_ptr); 2743 + SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); 3274 2744 } 3275 2745 3276 2746 static void handle_hotplug(struct rq *rq, bool online) 3277 2747 { 3278 2748 struct scx_sched *sch = scx_root; 3279 - int cpu = cpu_of(rq); 2749 + s32 cpu = cpu_of(rq); 3280 2750 3281 2751 atomic_long_inc(&scx_hotplug_seq); 3282 2752 ··· 3291 2763 scx_idle_update_selcpu_topology(&sch->ops); 3292 2764 3293 2765 if (online && SCX_HAS_OP(sch, cpu_online)) 3294 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu); 2766 + SCX_CALL_OP(sch, cpu_online, NULL, cpu); 3295 2767 else if (!online && SCX_HAS_OP(sch, cpu_offline)) 3296 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_offline, NULL, cpu); 2768 + SCX_CALL_OP(sch, cpu_offline, NULL, cpu); 3297 2769 else 3298 2770 scx_exit(sch, SCX_EXIT_UNREG_KERN, 3299 2771 SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, ··· 3321 2793 rq->scx.flags &= ~SCX_RQ_ONLINE; 3322 2794 } 3323 2795 3324 - 3325 2796 static bool check_rq_for_timeouts(struct rq *rq) 3326 2797 { 3327 2798 struct scx_sched *sch; ··· 3334 2807 goto out_unlock; 3335 2808 3336 2809 list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) { 2810 + struct scx_sched *sch = scx_task_sched(p); 3337 2811 unsigned long last_runnable = p->scx.runnable_at; 3338 2812 3339 2813 if (unlikely(time_after(jiffies, 3340 - last_runnable + READ_ONCE(scx_watchdog_timeout)))) { 2814 + last_runnable + READ_ONCE(sch->watchdog_timeout)))) { 3341 2815 u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable); 3342 2816 3343 2817 scx_exit(sch, SCX_EXIT_ERROR_STALL, 0, ··· 3355 2827 3356 2828 static void scx_watchdog_workfn(struct work_struct *work) 3357 2829 { 2830 + unsigned long intv; 3358 2831 int cpu; 3359 2832 3360 2833 WRITE_ONCE(scx_watchdog_timestamp, jiffies); ··· 3366 2837 3367 2838 cond_resched(); 3368 2839 } 3369 - queue_delayed_work(system_dfl_wq, to_delayed_work(work), 3370 - READ_ONCE(scx_watchdog_timeout) / 2); 2840 + 2841 + intv = READ_ONCE(scx_watchdog_interval); 2842 + if (intv < ULONG_MAX) 2843 + queue_delayed_work(system_dfl_wq, to_delayed_work(work), intv); 3371 2844 } 3372 2845 3373 2846 void scx_tick(struct rq *rq) 3374 2847 { 3375 - struct scx_sched *sch; 2848 + struct scx_sched *root; 3376 2849 unsigned long last_check; 3377 2850 3378 2851 if (!scx_enabled()) 3379 2852 return; 3380 2853 3381 - sch = rcu_dereference_bh(scx_root); 3382 - if (unlikely(!sch)) 2854 + root = rcu_dereference_bh(scx_root); 2855 + if (unlikely(!root)) 3383 2856 return; 3384 2857 3385 2858 last_check = READ_ONCE(scx_watchdog_timestamp); 3386 2859 if (unlikely(time_after(jiffies, 3387 - last_check + READ_ONCE(scx_watchdog_timeout)))) { 2860 + last_check + READ_ONCE(root->watchdog_timeout)))) { 3388 2861 u32 dur_ms = jiffies_to_msecs(jiffies - last_check); 3389 2862 3390 - scx_exit(sch, SCX_EXIT_ERROR_STALL, 0, 2863 + scx_exit(root, SCX_EXIT_ERROR_STALL, 0, 3391 2864 "watchdog failed to check in for %u.%03us", 3392 2865 dur_ms / 1000, dur_ms % 1000); 3393 2866 } ··· 3399 2868 3400 2869 static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) 3401 2870 { 3402 - struct scx_sched *sch = scx_root; 2871 + struct scx_sched *sch = scx_task_sched(curr); 3403 2872 3404 2873 update_curr_scx(rq); 3405 2874 ··· 3407 2876 * While disabling, always resched and refresh core-sched timestamp as 3408 2877 * we can't trust the slice management or ops.core_sched_before(). 3409 2878 */ 3410 - if (scx_rq_bypassing(rq)) { 2879 + if (scx_bypassing(sch, cpu_of(rq))) { 3411 2880 curr->scx.slice = 0; 3412 2881 touch_core_sched(rq, curr); 3413 2882 } else if (SCX_HAS_OP(sch, tick)) { 3414 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, tick, rq, curr); 2883 + SCX_CALL_OP_TASK(sch, tick, rq, curr); 3415 2884 } 3416 2885 3417 2886 if (!curr->scx.slice) ··· 3440 2909 3441 2910 #endif /* CONFIG_EXT_GROUP_SCHED */ 3442 2911 3443 - static enum scx_task_state scx_get_task_state(const struct task_struct *p) 2912 + static u32 scx_get_task_state(const struct task_struct *p) 3444 2913 { 3445 - return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT; 2914 + return p->scx.flags & SCX_TASK_STATE_MASK; 3446 2915 } 3447 2916 3448 - static void scx_set_task_state(struct task_struct *p, enum scx_task_state state) 2917 + static void scx_set_task_state(struct task_struct *p, u32 state) 3449 2918 { 3450 - enum scx_task_state prev_state = scx_get_task_state(p); 2919 + u32 prev_state = scx_get_task_state(p); 3451 2920 bool warn = false; 3452 - 3453 - BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS)); 3454 2921 3455 2922 switch (state) { 3456 2923 case SCX_TASK_NONE: ··· 3463 2934 warn = prev_state != SCX_TASK_READY; 3464 2935 break; 3465 2936 default: 3466 - warn = true; 2937 + WARN_ONCE(1, "sched_ext: Invalid task state %d -> %d for %s[%d]", 2938 + prev_state, state, p->comm, p->pid); 3467 2939 return; 3468 2940 } 3469 2941 3470 - WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]", 2942 + WARN_ONCE(warn, "sched_ext: Invalid task state transition 0x%x -> 0x%x for %s[%d]", 3471 2943 prev_state, state, p->comm, p->pid); 3472 2944 3473 2945 p->scx.flags &= ~SCX_TASK_STATE_MASK; 3474 - p->scx.flags |= state << SCX_TASK_STATE_SHIFT; 2946 + p->scx.flags |= state; 3475 2947 } 3476 2948 3477 - static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork) 2949 + static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork) 3478 2950 { 3479 - struct scx_sched *sch = scx_root; 3480 2951 int ret; 3481 2952 3482 2953 p->scx.disallow = false; 3483 2954 3484 2955 if (SCX_HAS_OP(sch, init_task)) { 3485 2956 struct scx_init_task_args args = { 3486 - SCX_INIT_TASK_ARGS_CGROUP(tg) 2957 + SCX_INIT_TASK_ARGS_CGROUP(task_group(p)) 3487 2958 .fork = fork, 3488 2959 }; 3489 2960 3490 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init_task, NULL, 3491 - p, &args); 2961 + ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args); 3492 2962 if (unlikely(ret)) { 3493 2963 ret = ops_sanitize_err(sch, "init_task", ret); 3494 2964 return ret; 3495 2965 } 3496 2966 } 3497 2967 3498 - scx_set_task_state(p, SCX_TASK_INIT); 3499 - 3500 2968 if (p->scx.disallow) { 3501 - if (!fork) { 2969 + if (unlikely(scx_parent(sch))) { 2970 + scx_error(sch, "non-root ops.init_task() set task->scx.disallow for %s[%d]", 2971 + p->comm, p->pid); 2972 + } else if (unlikely(fork)) { 2973 + scx_error(sch, "ops.init_task() set task->scx.disallow for %s[%d] during fork", 2974 + p->comm, p->pid); 2975 + } else { 3502 2976 struct rq *rq; 3503 2977 struct rq_flags rf; 3504 2978 ··· 3520 2988 } 3521 2989 3522 2990 task_rq_unlock(rq, p, &rf); 3523 - } else if (p->policy == SCHED_EXT) { 3524 - scx_error(sch, "ops.init_task() set task->scx.disallow for %s[%d] during fork", 3525 - p->comm, p->pid); 3526 2991 } 3527 2992 } 3528 2993 3529 - p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; 3530 2994 return 0; 3531 2995 } 3532 2996 3533 - static void scx_enable_task(struct task_struct *p) 2997 + static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork) 3534 2998 { 3535 - struct scx_sched *sch = scx_root; 2999 + int ret; 3000 + 3001 + ret = __scx_init_task(sch, p, fork); 3002 + if (!ret) { 3003 + /* 3004 + * While @p's rq is not locked. @p is not visible to the rest of 3005 + * SCX yet and it's safe to update the flags and state. 3006 + */ 3007 + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; 3008 + scx_set_task_state(p, SCX_TASK_INIT); 3009 + } 3010 + return ret; 3011 + } 3012 + 3013 + static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p) 3014 + { 3536 3015 struct rq *rq = task_rq(p); 3537 3016 u32 weight; 3538 3017 3539 3018 lockdep_assert_rq_held(rq); 3019 + 3020 + /* 3021 + * Verify the task is not in BPF scheduler's custody. If flag 3022 + * transitions are consistent, the flag should always be clear 3023 + * here. 3024 + */ 3025 + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); 3540 3026 3541 3027 /* 3542 3028 * Set the weight before calling ops.enable() so that the scheduler ··· 3568 3018 p->scx.weight = sched_weight_to_cgroup(weight); 3569 3019 3570 3020 if (SCX_HAS_OP(sch, enable)) 3571 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p); 3572 - scx_set_task_state(p, SCX_TASK_ENABLED); 3021 + SCX_CALL_OP_TASK(sch, enable, rq, p); 3573 3022 3574 3023 if (SCX_HAS_OP(sch, set_weight)) 3575 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, 3576 - p, p->scx.weight); 3024 + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); 3577 3025 } 3578 3026 3579 - static void scx_disable_task(struct task_struct *p) 3027 + static void scx_enable_task(struct scx_sched *sch, struct task_struct *p) 3580 3028 { 3581 - struct scx_sched *sch = scx_root; 3029 + __scx_enable_task(sch, p); 3030 + scx_set_task_state(p, SCX_TASK_ENABLED); 3031 + } 3032 + 3033 + static void scx_disable_task(struct scx_sched *sch, struct task_struct *p) 3034 + { 3582 3035 struct rq *rq = task_rq(p); 3583 3036 3584 3037 lockdep_assert_rq_held(rq); ··· 3590 3037 clear_direct_dispatch(p); 3591 3038 3592 3039 if (SCX_HAS_OP(sch, disable)) 3593 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); 3040 + SCX_CALL_OP_TASK(sch, disable, rq, p); 3594 3041 scx_set_task_state(p, SCX_TASK_READY); 3042 + 3043 + /* 3044 + * Verify the task is not in BPF scheduler's custody. If flag 3045 + * transitions are consistent, the flag should always be clear 3046 + * here. 3047 + */ 3048 + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); 3595 3049 } 3596 3050 3597 - static void scx_exit_task(struct task_struct *p) 3051 + static void __scx_disable_and_exit_task(struct scx_sched *sch, 3052 + struct task_struct *p) 3598 3053 { 3599 - struct scx_sched *sch = scx_root; 3600 3054 struct scx_exit_task_args args = { 3601 3055 .cancelled = false, 3602 3056 }; 3603 3057 3058 + lockdep_assert_held(&p->pi_lock); 3604 3059 lockdep_assert_rq_held(task_rq(p)); 3605 3060 3606 3061 switch (scx_get_task_state(p)) { ··· 3620 3059 case SCX_TASK_READY: 3621 3060 break; 3622 3061 case SCX_TASK_ENABLED: 3623 - scx_disable_task(p); 3062 + scx_disable_task(sch, p); 3624 3063 break; 3625 3064 default: 3626 3065 WARN_ON_ONCE(true); ··· 3628 3067 } 3629 3068 3630 3069 if (SCX_HAS_OP(sch, exit_task)) 3631 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, exit_task, task_rq(p), 3632 - p, &args); 3070 + SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); 3071 + } 3072 + 3073 + static void scx_disable_and_exit_task(struct scx_sched *sch, 3074 + struct task_struct *p) 3075 + { 3076 + __scx_disable_and_exit_task(sch, p); 3077 + 3078 + /* 3079 + * If set, @p exited between __scx_init_task() and scx_enable_task() in 3080 + * scx_sub_enable() and is initialized for both the associated sched and 3081 + * its parent. Disable and exit for the child too. 3082 + */ 3083 + if ((p->scx.flags & SCX_TASK_SUB_INIT) && 3084 + !WARN_ON_ONCE(!scx_enabling_sub_sched)) { 3085 + __scx_disable_and_exit_task(scx_enabling_sub_sched, p); 3086 + p->scx.flags &= ~SCX_TASK_SUB_INIT; 3087 + } 3088 + 3089 + scx_set_task_sched(p, NULL); 3633 3090 scx_set_task_state(p, SCX_TASK_NONE); 3634 3091 } 3635 3092 ··· 3661 3082 INIT_LIST_HEAD(&scx->runnable_node); 3662 3083 scx->runnable_at = jiffies; 3663 3084 scx->ddsp_dsq_id = SCX_DSQ_INVALID; 3664 - scx->slice = READ_ONCE(scx_slice_dfl); 3085 + scx->slice = SCX_SLICE_DFL; 3665 3086 } 3666 3087 3667 3088 void scx_pre_fork(struct task_struct *p) ··· 3675 3096 percpu_down_read(&scx_fork_rwsem); 3676 3097 } 3677 3098 3678 - int scx_fork(struct task_struct *p) 3099 + int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) 3679 3100 { 3101 + s32 ret; 3102 + 3680 3103 percpu_rwsem_assert_held(&scx_fork_rwsem); 3681 3104 3682 - if (scx_init_task_enabled) 3683 - return scx_init_task(p, task_group(p), true); 3684 - else 3685 - return 0; 3105 + if (scx_init_task_enabled) { 3106 + #ifdef CONFIG_EXT_SUB_SCHED 3107 + struct scx_sched *sch = kargs->cset->dfl_cgrp->scx_sched; 3108 + #else 3109 + struct scx_sched *sch = scx_root; 3110 + #endif 3111 + ret = scx_init_task(sch, p, true); 3112 + if (!ret) 3113 + scx_set_task_sched(p, sch); 3114 + return ret; 3115 + } 3116 + 3117 + return 0; 3686 3118 } 3687 3119 3688 3120 void scx_post_fork(struct task_struct *p) ··· 3711 3121 struct rq *rq; 3712 3122 3713 3123 rq = task_rq_lock(p, &rf); 3714 - scx_enable_task(p); 3124 + scx_enable_task(scx_task_sched(p), p); 3715 3125 task_rq_unlock(rq, p, &rf); 3716 3126 } 3717 3127 } ··· 3731 3141 3732 3142 rq = task_rq_lock(p, &rf); 3733 3143 WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY); 3734 - scx_exit_task(p); 3144 + scx_disable_and_exit_task(scx_task_sched(p), p); 3735 3145 task_rq_unlock(rq, p, &rf); 3736 3146 } 3737 3147 ··· 3782 3192 raw_spin_unlock_irqrestore(&scx_tasks_lock, flags); 3783 3193 3784 3194 /* 3785 - * @p is off scx_tasks and wholly ours. scx_enable()'s READY -> ENABLED 3786 - * transitions can't race us. Disable ops for @p. 3195 + * @p is off scx_tasks and wholly ours. scx_root_enable()'s READY -> 3196 + * ENABLED transitions can't race us. Disable ops for @p. 3787 3197 */ 3788 3198 if (scx_get_task_state(p) != SCX_TASK_NONE) { 3789 3199 struct rq_flags rf; 3790 3200 struct rq *rq; 3791 3201 3792 3202 rq = task_rq_lock(p, &rf); 3793 - scx_exit_task(p); 3203 + scx_disable_and_exit_task(scx_task_sched(p), p); 3794 3204 task_rq_unlock(rq, p, &rf); 3795 3205 } 3796 3206 } ··· 3798 3208 static void reweight_task_scx(struct rq *rq, struct task_struct *p, 3799 3209 const struct load_weight *lw) 3800 3210 { 3801 - struct scx_sched *sch = scx_root; 3211 + struct scx_sched *sch = scx_task_sched(p); 3802 3212 3803 3213 lockdep_assert_rq_held(task_rq(p)); 3804 3214 ··· 3807 3217 3808 3218 p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); 3809 3219 if (SCX_HAS_OP(sch, set_weight)) 3810 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, 3811 - p, p->scx.weight); 3220 + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); 3812 3221 } 3813 3222 3814 3223 static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio) ··· 3816 3227 3817 3228 static void switching_to_scx(struct rq *rq, struct task_struct *p) 3818 3229 { 3819 - struct scx_sched *sch = scx_root; 3230 + struct scx_sched *sch = scx_task_sched(p); 3820 3231 3821 3232 if (task_dead_and_done(p)) 3822 3233 return; 3823 3234 3824 - scx_enable_task(p); 3235 + scx_enable_task(sch, p); 3825 3236 3826 3237 /* 3827 3238 * set_cpus_allowed_scx() is not called while @p is associated with a 3828 3239 * different scheduler class. Keep the BPF scheduler up-to-date. 3829 3240 */ 3830 3241 if (SCX_HAS_OP(sch, set_cpumask)) 3831 - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, rq, 3832 - p, (struct cpumask *)p->cpus_ptr); 3242 + SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr); 3833 3243 } 3834 3244 3835 3245 static void switched_from_scx(struct rq *rq, struct task_struct *p) ··· 3836 3248 if (task_dead_and_done(p)) 3837 3249 return; 3838 3250 3839 - scx_disable_task(p); 3251 + scx_disable_task(scx_task_sched(p), p); 3840 3252 } 3841 - 3842 - static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {} 3843 3253 3844 3254 static void switched_to_scx(struct rq *rq, struct task_struct *p) {} 3845 3255 ··· 3853 3267 return 0; 3854 3268 } 3855 3269 3270 + static void process_ddsp_deferred_locals(struct rq *rq) 3271 + { 3272 + struct task_struct *p; 3273 + 3274 + lockdep_assert_rq_held(rq); 3275 + 3276 + /* 3277 + * Now that @rq can be unlocked, execute the deferred enqueueing of 3278 + * tasks directly dispatched to the local DSQs of other CPUs. See 3279 + * direct_dispatch(). Keep popping from the head instead of using 3280 + * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq 3281 + * temporarily. 3282 + */ 3283 + while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, 3284 + struct task_struct, scx.dsq_list.node))) { 3285 + struct scx_sched *sch = scx_task_sched(p); 3286 + struct scx_dispatch_q *dsq; 3287 + u64 dsq_id = p->scx.ddsp_dsq_id; 3288 + u64 enq_flags = p->scx.ddsp_enq_flags; 3289 + 3290 + list_del_init(&p->scx.dsq_list.node); 3291 + clear_direct_dispatch(p); 3292 + 3293 + dsq = find_dsq_for_dispatch(sch, rq, dsq_id, task_cpu(p)); 3294 + if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 3295 + dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); 3296 + } 3297 + } 3298 + 3299 + /* 3300 + * Determine whether @p should be reenqueued from a local DSQ. 3301 + * 3302 + * @reenq_flags is mutable and accumulates state across the DSQ walk: 3303 + * 3304 + * - %SCX_REENQ_TSR_NOT_FIRST: Set after the first task is visited. "First" 3305 + * tracks position in the DSQ list, not among IMMED tasks. A non-IMMED task at 3306 + * the head consumes the first slot. 3307 + * 3308 + * - %SCX_REENQ_TSR_RQ_OPEN: Set by reenq_local() before the walk if 3309 + * rq_is_open() is true. 3310 + * 3311 + * An IMMED task is kept (returns %false) only if it's the first task in the DSQ 3312 + * AND the current task is done — i.e. it will execute immediately. All other 3313 + * IMMED tasks are reenqueued. This means if a non-IMMED task sits at the head, 3314 + * every IMMED task behind it gets reenqueued. 3315 + * 3316 + * Reenqueued tasks go through ops.enqueue() with %SCX_ENQ_REENQ | 3317 + * %SCX_TASK_REENQ_IMMED. If the BPF scheduler dispatches back to the same local 3318 + * DSQ with %SCX_ENQ_IMMED while the CPU is still unavailable, this triggers 3319 + * another reenq cycle. Repetitions are bounded by %SCX_REENQ_LOCAL_MAX_REPEAT 3320 + * in process_deferred_reenq_locals(). 3321 + */ 3322 + static bool local_task_should_reenq(struct task_struct *p, u64 *reenq_flags, u32 *reason) 3323 + { 3324 + bool first; 3325 + 3326 + first = !(*reenq_flags & SCX_REENQ_TSR_NOT_FIRST); 3327 + *reenq_flags |= SCX_REENQ_TSR_NOT_FIRST; 3328 + 3329 + *reason = SCX_TASK_REENQ_KFUNC; 3330 + 3331 + if ((p->scx.flags & SCX_TASK_IMMED) && 3332 + (!first || !(*reenq_flags & SCX_REENQ_TSR_RQ_OPEN))) { 3333 + __scx_add_event(scx_task_sched(p), SCX_EV_REENQ_IMMED, 1); 3334 + *reason = SCX_TASK_REENQ_IMMED; 3335 + return true; 3336 + } 3337 + 3338 + return *reenq_flags & SCX_REENQ_ANY; 3339 + } 3340 + 3341 + static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags) 3342 + { 3343 + LIST_HEAD(tasks); 3344 + u32 nr_enqueued = 0; 3345 + struct task_struct *p, *n; 3346 + 3347 + lockdep_assert_rq_held(rq); 3348 + 3349 + if (WARN_ON_ONCE(reenq_flags & __SCX_REENQ_TSR_MASK)) 3350 + reenq_flags &= ~__SCX_REENQ_TSR_MASK; 3351 + if (rq_is_open(rq, 0)) 3352 + reenq_flags |= SCX_REENQ_TSR_RQ_OPEN; 3353 + 3354 + /* 3355 + * The BPF scheduler may choose to dispatch tasks back to 3356 + * @rq->scx.local_dsq. Move all candidate tasks off to a private list 3357 + * first to avoid processing the same tasks repeatedly. 3358 + */ 3359 + list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, 3360 + scx.dsq_list.node) { 3361 + struct scx_sched *task_sch = scx_task_sched(p); 3362 + u32 reason; 3363 + 3364 + /* 3365 + * If @p is being migrated, @p's current CPU may not agree with 3366 + * its allowed CPUs and the migration_cpu_stop is about to 3367 + * deactivate and re-activate @p anyway. Skip re-enqueueing. 3368 + * 3369 + * While racing sched property changes may also dequeue and 3370 + * re-enqueue a migrating task while its current CPU and allowed 3371 + * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to 3372 + * the current local DSQ for running tasks and thus are not 3373 + * visible to the BPF scheduler. 3374 + */ 3375 + if (p->migration_pending) 3376 + continue; 3377 + 3378 + if (!scx_is_descendant(task_sch, sch)) 3379 + continue; 3380 + 3381 + if (!local_task_should_reenq(p, &reenq_flags, &reason)) 3382 + continue; 3383 + 3384 + dispatch_dequeue(rq, p); 3385 + 3386 + if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK)) 3387 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3388 + p->scx.flags |= reason; 3389 + 3390 + list_add_tail(&p->scx.dsq_list.node, &tasks); 3391 + } 3392 + 3393 + list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { 3394 + list_del_init(&p->scx.dsq_list.node); 3395 + 3396 + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 3397 + 3398 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3399 + nr_enqueued++; 3400 + } 3401 + 3402 + return nr_enqueued; 3403 + } 3404 + 3405 + static void process_deferred_reenq_locals(struct rq *rq) 3406 + { 3407 + u64 seq = ++rq->scx.deferred_reenq_locals_seq; 3408 + 3409 + lockdep_assert_rq_held(rq); 3410 + 3411 + while (true) { 3412 + struct scx_sched *sch; 3413 + u64 reenq_flags; 3414 + bool skip = false; 3415 + 3416 + scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) { 3417 + struct scx_deferred_reenq_local *drl = 3418 + list_first_entry_or_null(&rq->scx.deferred_reenq_locals, 3419 + struct scx_deferred_reenq_local, 3420 + node); 3421 + struct scx_sched_pcpu *sch_pcpu; 3422 + 3423 + if (!drl) 3424 + return; 3425 + 3426 + sch_pcpu = container_of(drl, struct scx_sched_pcpu, 3427 + deferred_reenq_local); 3428 + sch = sch_pcpu->sch; 3429 + 3430 + reenq_flags = drl->flags; 3431 + WRITE_ONCE(drl->flags, 0); 3432 + list_del_init(&drl->node); 3433 + 3434 + if (likely(drl->seq != seq)) { 3435 + drl->seq = seq; 3436 + drl->cnt = 0; 3437 + } else { 3438 + if (unlikely(++drl->cnt > SCX_REENQ_LOCAL_MAX_REPEAT)) { 3439 + scx_error(sch, "SCX_ENQ_REENQ on SCX_DSQ_LOCAL repeated %u times", 3440 + drl->cnt); 3441 + skip = true; 3442 + } 3443 + 3444 + __scx_add_event(sch, SCX_EV_REENQ_LOCAL_REPEAT, 1); 3445 + } 3446 + } 3447 + 3448 + if (!skip) { 3449 + /* see schedule_dsq_reenq() */ 3450 + smp_mb(); 3451 + 3452 + reenq_local(sch, rq, reenq_flags); 3453 + } 3454 + } 3455 + } 3456 + 3457 + static bool user_task_should_reenq(struct task_struct *p, u64 reenq_flags, u32 *reason) 3458 + { 3459 + *reason = SCX_TASK_REENQ_KFUNC; 3460 + return reenq_flags & SCX_REENQ_ANY; 3461 + } 3462 + 3463 + static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flags) 3464 + { 3465 + struct rq *locked_rq = rq; 3466 + struct scx_sched *sch = dsq->sched; 3467 + struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, dsq, 0); 3468 + struct task_struct *p; 3469 + s32 nr_enqueued = 0; 3470 + 3471 + lockdep_assert_rq_held(rq); 3472 + 3473 + raw_spin_lock(&dsq->lock); 3474 + 3475 + while (likely(!READ_ONCE(sch->bypass_depth))) { 3476 + struct rq *task_rq; 3477 + u32 reason; 3478 + 3479 + p = nldsq_cursor_next_task(&cursor, dsq); 3480 + if (!p) 3481 + break; 3482 + 3483 + if (!user_task_should_reenq(p, reenq_flags, &reason)) 3484 + continue; 3485 + 3486 + task_rq = task_rq(p); 3487 + 3488 + if (locked_rq != task_rq) { 3489 + if (locked_rq) 3490 + raw_spin_rq_unlock(locked_rq); 3491 + if (unlikely(!raw_spin_rq_trylock(task_rq))) { 3492 + raw_spin_unlock(&dsq->lock); 3493 + raw_spin_rq_lock(task_rq); 3494 + raw_spin_lock(&dsq->lock); 3495 + } 3496 + locked_rq = task_rq; 3497 + 3498 + /* did we lose @p while switching locks? */ 3499 + if (nldsq_cursor_lost_task(&cursor, task_rq, dsq, p)) 3500 + continue; 3501 + } 3502 + 3503 + /* @p is on @dsq, its rq and @dsq are locked */ 3504 + dispatch_dequeue_locked(p, dsq); 3505 + raw_spin_unlock(&dsq->lock); 3506 + 3507 + if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK)) 3508 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3509 + p->scx.flags |= reason; 3510 + 3511 + do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1); 3512 + 3513 + p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK; 3514 + 3515 + if (!(++nr_enqueued % SCX_TASK_ITER_BATCH)) { 3516 + raw_spin_rq_unlock(locked_rq); 3517 + locked_rq = NULL; 3518 + cpu_relax(); 3519 + } 3520 + 3521 + raw_spin_lock(&dsq->lock); 3522 + } 3523 + 3524 + list_del_init(&cursor.node); 3525 + raw_spin_unlock(&dsq->lock); 3526 + 3527 + if (locked_rq != rq) { 3528 + if (locked_rq) 3529 + raw_spin_rq_unlock(locked_rq); 3530 + raw_spin_rq_lock(rq); 3531 + } 3532 + } 3533 + 3534 + static void process_deferred_reenq_users(struct rq *rq) 3535 + { 3536 + lockdep_assert_rq_held(rq); 3537 + 3538 + while (true) { 3539 + struct scx_dispatch_q *dsq; 3540 + u64 reenq_flags; 3541 + 3542 + scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) { 3543 + struct scx_deferred_reenq_user *dru = 3544 + list_first_entry_or_null(&rq->scx.deferred_reenq_users, 3545 + struct scx_deferred_reenq_user, 3546 + node); 3547 + struct scx_dsq_pcpu *dsq_pcpu; 3548 + 3549 + if (!dru) 3550 + return; 3551 + 3552 + dsq_pcpu = container_of(dru, struct scx_dsq_pcpu, 3553 + deferred_reenq_user); 3554 + dsq = dsq_pcpu->dsq; 3555 + reenq_flags = dru->flags; 3556 + WRITE_ONCE(dru->flags, 0); 3557 + list_del_init(&dru->node); 3558 + } 3559 + 3560 + /* see schedule_dsq_reenq() */ 3561 + smp_mb(); 3562 + 3563 + BUG_ON(dsq->id & SCX_DSQ_FLAG_BUILTIN); 3564 + reenq_user(rq, dsq, reenq_flags); 3565 + } 3566 + } 3567 + 3568 + static void run_deferred(struct rq *rq) 3569 + { 3570 + process_ddsp_deferred_locals(rq); 3571 + 3572 + if (!list_empty(&rq->scx.deferred_reenq_locals)) 3573 + process_deferred_reenq_locals(rq); 3574 + 3575 + if (!list_empty(&rq->scx.deferred_reenq_users)) 3576 + process_deferred_reenq_users(rq); 3577 + } 3578 + 3856 3579 #ifdef CONFIG_NO_HZ_FULL 3857 3580 bool scx_can_stop_tick(struct rq *rq) 3858 3581 { 3859 3582 struct task_struct *p = rq->curr; 3860 - 3861 - if (scx_rq_bypassing(rq)) 3862 - return false; 3583 + struct scx_sched *sch = scx_task_sched(p); 3863 3584 3864 3585 if (p->sched_class != &ext_sched_class) 3865 3586 return true; 3587 + 3588 + if (scx_bypassing(sch, cpu_of(rq))) 3589 + return false; 3866 3590 3867 3591 /* 3868 3592 * @rq can dispatch from different DSQs, so we can't tell whether it ··· 4211 3315 .bw_quota_us = tg->scx.bw_quota_us, 4212 3316 .bw_burst_us = tg->scx.bw_burst_us }; 4213 3317 4214 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, 3318 + ret = SCX_CALL_OP_RET(sch, cgroup_init, 4215 3319 NULL, tg->css.cgroup, &args); 4216 3320 if (ret) 4217 3321 ret = ops_sanitize_err(sch, "cgroup_init", ret); ··· 4233 3337 4234 3338 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_exit) && 4235 3339 (tg->scx.flags & SCX_TG_INITED)) 4236 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, 4237 - tg->css.cgroup); 3340 + SCX_CALL_OP(sch, cgroup_exit, NULL, tg->css.cgroup); 4238 3341 tg->scx.flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); 4239 3342 } 4240 3343 ··· 4262 3367 continue; 4263 3368 4264 3369 if (SCX_HAS_OP(sch, cgroup_prep_move)) { 4265 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, 4266 - cgroup_prep_move, NULL, 3370 + ret = SCX_CALL_OP_RET(sch, cgroup_prep_move, NULL, 4267 3371 p, from, css->cgroup); 4268 3372 if (ret) 4269 3373 goto err; ··· 4277 3383 cgroup_taskset_for_each(p, css, tset) { 4278 3384 if (SCX_HAS_OP(sch, cgroup_cancel_move) && 4279 3385 p->scx.cgrp_moving_from) 4280 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, 3386 + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, 4281 3387 p, p->scx.cgrp_moving_from, css->cgroup); 4282 3388 p->scx.cgrp_moving_from = NULL; 4283 3389 } ··· 4298 3404 */ 4299 3405 if (SCX_HAS_OP(sch, cgroup_move) && 4300 3406 !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) 4301 - SCX_CALL_OP_TASK(sch, SCX_KF_UNLOCKED, cgroup_move, NULL, 3407 + SCX_CALL_OP_TASK(sch, cgroup_move, task_rq(p), 4302 3408 p, p->scx.cgrp_moving_from, 4303 3409 tg_cgrp(task_group(p))); 4304 3410 p->scx.cgrp_moving_from = NULL; ··· 4316 3422 cgroup_taskset_for_each(p, css, tset) { 4317 3423 if (SCX_HAS_OP(sch, cgroup_cancel_move) && 4318 3424 p->scx.cgrp_moving_from) 4319 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, 3425 + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, 4320 3426 p, p->scx.cgrp_moving_from, css->cgroup); 4321 3427 p->scx.cgrp_moving_from = NULL; 4322 3428 } ··· 4330 3436 4331 3437 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) && 4332 3438 tg->scx.weight != weight) 4333 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_weight, NULL, 4334 - tg_cgrp(tg), weight); 3439 + SCX_CALL_OP(sch, cgroup_set_weight, NULL, tg_cgrp(tg), weight); 4335 3440 4336 3441 tg->scx.weight = weight; 4337 3442 ··· 4344 3451 percpu_down_read(&scx_cgroup_ops_rwsem); 4345 3452 4346 3453 if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle)) 4347 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_idle, NULL, 4348 - tg_cgrp(tg), idle); 3454 + SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle); 4349 3455 4350 3456 /* Update the task group's idle state */ 4351 3457 tg->scx.idle = idle; ··· 4363 3471 (tg->scx.bw_period_us != period_us || 4364 3472 tg->scx.bw_quota_us != quota_us || 4365 3473 tg->scx.bw_burst_us != burst_us)) 4366 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_bandwidth, NULL, 3474 + SCX_CALL_OP(sch, cgroup_set_bandwidth, NULL, 4367 3475 tg_cgrp(tg), period_us, quota_us, burst_us); 4368 3476 4369 3477 tg->scx.bw_period_us = period_us; ··· 4372 3480 4373 3481 percpu_up_read(&scx_cgroup_ops_rwsem); 4374 3482 } 3483 + #endif /* CONFIG_EXT_GROUP_SCHED */ 3484 + 3485 + #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 3486 + static struct cgroup *root_cgroup(void) 3487 + { 3488 + return &cgrp_dfl_root.cgrp; 3489 + } 3490 + 3491 + static struct cgroup *sch_cgroup(struct scx_sched *sch) 3492 + { 3493 + return sch->cgrp; 3494 + } 3495 + 3496 + /* for each descendant of @cgrp including self, set ->scx_sched to @sch */ 3497 + static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) 3498 + { 3499 + struct cgroup *pos; 3500 + struct cgroup_subsys_state *css; 3501 + 3502 + cgroup_for_each_live_descendant_pre(pos, css, cgrp) 3503 + rcu_assign_pointer(pos->scx_sched, sch); 3504 + } 4375 3505 4376 3506 static void scx_cgroup_lock(void) 4377 3507 { 3508 + #ifdef CONFIG_EXT_GROUP_SCHED 4378 3509 percpu_down_write(&scx_cgroup_ops_rwsem); 3510 + #endif 4379 3511 cgroup_lock(); 4380 3512 } 4381 3513 4382 3514 static void scx_cgroup_unlock(void) 4383 3515 { 4384 3516 cgroup_unlock(); 3517 + #ifdef CONFIG_EXT_GROUP_SCHED 4385 3518 percpu_up_write(&scx_cgroup_ops_rwsem); 3519 + #endif 4386 3520 } 4387 - 4388 - #else /* CONFIG_EXT_GROUP_SCHED */ 4389 - 3521 + #else /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ 3522 + static struct cgroup *root_cgroup(void) { return NULL; } 3523 + static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } 3524 + static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} 4390 3525 static void scx_cgroup_lock(void) {} 4391 3526 static void scx_cgroup_unlock(void) {} 4392 - 4393 - #endif /* CONFIG_EXT_GROUP_SCHED */ 3527 + #endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ 4394 3528 4395 3529 /* 4396 3530 * Omitted operations: 4397 - * 4398 - * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task 4399 - * isn't tied to the CPU at that point. Preemption is implemented by resetting 4400 - * the victim task's slice to 0 and triggering reschedule on the target CPU. 4401 3531 * 4402 3532 * - migrate_task_rq: Unnecessary as task to cpu mapping is transient. 4403 3533 * ··· 4461 3547 #endif 4462 3548 }; 4463 3549 4464 - static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id) 3550 + static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id, 3551 + struct scx_sched *sch) 4465 3552 { 3553 + s32 cpu; 3554 + 4466 3555 memset(dsq, 0, sizeof(*dsq)); 4467 3556 4468 3557 raw_spin_lock_init(&dsq->lock); 4469 3558 INIT_LIST_HEAD(&dsq->list); 4470 3559 dsq->id = dsq_id; 3560 + dsq->sched = sch; 3561 + 3562 + dsq->pcpu = alloc_percpu(struct scx_dsq_pcpu); 3563 + if (!dsq->pcpu) 3564 + return -ENOMEM; 3565 + 3566 + for_each_possible_cpu(cpu) { 3567 + struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu); 3568 + 3569 + pcpu->dsq = dsq; 3570 + INIT_LIST_HEAD(&pcpu->deferred_reenq_user.node); 3571 + } 3572 + 3573 + return 0; 3574 + } 3575 + 3576 + static void exit_dsq(struct scx_dispatch_q *dsq) 3577 + { 3578 + s32 cpu; 3579 + 3580 + for_each_possible_cpu(cpu) { 3581 + struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu); 3582 + struct scx_deferred_reenq_user *dru = &pcpu->deferred_reenq_user; 3583 + struct rq *rq = cpu_rq(cpu); 3584 + 3585 + /* 3586 + * There must have been a RCU grace period since the last 3587 + * insertion and @dsq should be off the deferred list by now. 3588 + */ 3589 + if (WARN_ON_ONCE(!list_empty(&dru->node))) { 3590 + guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock); 3591 + list_del_init(&dru->node); 3592 + } 3593 + } 3594 + 3595 + free_percpu(dsq->pcpu); 3596 + } 3597 + 3598 + static void free_dsq_rcufn(struct rcu_head *rcu) 3599 + { 3600 + struct scx_dispatch_q *dsq = container_of(rcu, struct scx_dispatch_q, rcu); 3601 + 3602 + exit_dsq(dsq); 3603 + kfree(dsq); 4471 3604 } 4472 3605 4473 3606 static void free_dsq_irq_workfn(struct irq_work *irq_work) ··· 4523 3562 struct scx_dispatch_q *dsq, *tmp_dsq; 4524 3563 4525 3564 llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node) 4526 - kfree_rcu(dsq, rcu); 3565 + call_rcu(&dsq->rcu, free_dsq_rcufn); 4527 3566 } 4528 3567 4529 3568 static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn); ··· 4588 3627 if (!sch->ops.cgroup_exit) 4589 3628 continue; 4590 3629 4591 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, 4592 - css->cgroup); 3630 + SCX_CALL_OP(sch, cgroup_exit, NULL, css->cgroup); 4593 3631 } 4594 3632 } 4595 3633 ··· 4619 3659 continue; 4620 3660 } 4621 3661 4622 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, NULL, 3662 + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, 4623 3663 css->cgroup, &args); 4624 3664 if (ret) { 4625 3665 scx_error(sch, "ops.cgroup_init() failed (%d)", ret); ··· 4698 3738 .attrs = scx_global_attrs, 4699 3739 }; 4700 3740 3741 + static void free_pnode(struct scx_sched_pnode *pnode); 4701 3742 static void free_exit_info(struct scx_exit_info *ei); 4702 3743 4703 3744 static void scx_sched_free_rcu_work(struct work_struct *work) ··· 4707 3746 struct scx_sched *sch = container_of(rcu_work, struct scx_sched, rcu_work); 4708 3747 struct rhashtable_iter rht_iter; 4709 3748 struct scx_dispatch_q *dsq; 4710 - int node; 3749 + int cpu, node; 4711 3750 4712 - irq_work_sync(&sch->error_irq_work); 3751 + irq_work_sync(&sch->disable_irq_work); 4713 3752 kthread_destroy_worker(sch->helper); 3753 + timer_shutdown_sync(&sch->bypass_lb_timer); 3754 + 3755 + #ifdef CONFIG_EXT_SUB_SCHED 3756 + kfree(sch->cgrp_path); 3757 + if (sch_cgroup(sch)) 3758 + cgroup_put(sch_cgroup(sch)); 3759 + #endif /* CONFIG_EXT_SUB_SCHED */ 3760 + 3761 + for_each_possible_cpu(cpu) { 3762 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); 3763 + 3764 + /* 3765 + * $sch would have entered bypass mode before the RCU grace 3766 + * period. As that blocks new deferrals, all 3767 + * deferred_reenq_local_node's must be off-list by now. 3768 + */ 3769 + WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node)); 3770 + 3771 + exit_dsq(bypass_dsq(sch, cpu)); 3772 + } 4714 3773 4715 3774 free_percpu(sch->pcpu); 4716 3775 4717 3776 for_each_node_state(node, N_POSSIBLE) 4718 - kfree(sch->global_dsqs[node]); 4719 - kfree(sch->global_dsqs); 3777 + free_pnode(sch->pnode[node]); 3778 + kfree(sch->pnode); 4720 3779 4721 3780 rhashtable_walk_enter(&sch->dsq_hash, &rht_iter); 4722 3781 do { 4723 3782 rhashtable_walk_start(&rht_iter); 4724 3783 4725 - while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq)) 3784 + while (!IS_ERR_OR_NULL((dsq = rhashtable_walk_next(&rht_iter)))) 4726 3785 destroy_dsq(sch, dsq->id); 4727 3786 4728 3787 rhashtable_walk_stop(&rht_iter); ··· 4759 3778 struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); 4760 3779 4761 3780 INIT_RCU_WORK(&sch->rcu_work, scx_sched_free_rcu_work); 4762 - queue_rcu_work(system_unbound_wq, &sch->rcu_work); 3781 + queue_rcu_work(system_dfl_wq, &sch->rcu_work); 4763 3782 } 4764 3783 4765 3784 static ssize_t scx_attr_ops_show(struct kobject *kobj, ··· 4788 3807 at += scx_attr_event_show(buf, at, &events, SCX_EV_DISPATCH_KEEP_LAST); 4789 3808 at += scx_attr_event_show(buf, at, &events, SCX_EV_ENQ_SKIP_EXITING); 4790 3809 at += scx_attr_event_show(buf, at, &events, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); 3810 + at += scx_attr_event_show(buf, at, &events, SCX_EV_REENQ_IMMED); 3811 + at += scx_attr_event_show(buf, at, &events, SCX_EV_REENQ_LOCAL_REPEAT); 4791 3812 at += scx_attr_event_show(buf, at, &events, SCX_EV_REFILL_SLICE_DFL); 4792 3813 at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DURATION); 4793 3814 at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DISPATCH); 4794 3815 at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_ACTIVATE); 3816 + at += scx_attr_event_show(buf, at, &events, SCX_EV_INSERT_NOT_OWNED); 3817 + at += scx_attr_event_show(buf, at, &events, SCX_EV_SUB_BYPASS_DISPATCH); 4795 3818 return at; 4796 3819 } 4797 3820 SCX_ATTR(events); ··· 4815 3830 4816 3831 static int scx_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) 4817 3832 { 4818 - const struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); 3833 + const struct scx_sched *sch; 3834 + 3835 + /* 3836 + * scx_uevent() can be reached by both scx_sched kobjects (scx_ktype) 3837 + * and sub-scheduler kset kobjects (kset_ktype) through the parent 3838 + * chain walk. Filter out the latter to avoid invalid casts. 3839 + */ 3840 + if (kobj->ktype != &scx_ktype) 3841 + return 0; 3842 + 3843 + sch = container_of(kobj, struct scx_sched, kobj); 4819 3844 4820 3845 return add_uevent_var(env, "SCXOPS=%s", sch->ops.name); 4821 3846 } ··· 4854 3859 if (!scx_enabled()) 4855 3860 return true; 4856 3861 4857 - sch = rcu_dereference_sched(scx_root); 3862 + sch = scx_task_sched(p); 4858 3863 if (unlikely(!sch)) 4859 3864 return true; 4860 3865 ··· 4947 3952 * a good state before taking more drastic actions. 4948 3953 * 4949 3954 * Returns %true if sched_ext is enabled and abort was initiated, which may 4950 - * resolve the reported hardlockdup. %false if sched_ext is not enabled or 3955 + * resolve the reported hardlockup. %false if sched_ext is not enabled or 4951 3956 * someone else already initiated abort. 4952 3957 */ 4953 3958 bool scx_hardlockup(int cpu) ··· 4960 3965 return true; 4961 3966 } 4962 3967 4963 - static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, 3968 + static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor, 4964 3969 struct cpumask *donee_mask, struct cpumask *resched_mask, 4965 3970 u32 nr_donor_target, u32 nr_donee_target) 4966 3971 { 4967 - struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq; 3972 + struct rq *donor_rq = cpu_rq(donor); 3973 + struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor); 4968 3974 struct task_struct *p, *n; 4969 - struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, 0, 0); 3975 + struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0); 4970 3976 s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target; 4971 3977 u32 nr_balanced = 0, min_delta_us; 4972 3978 ··· 4981 3985 if (delta < DIV_ROUND_UP(min_delta_us, READ_ONCE(scx_slice_bypass_us))) 4982 3986 return 0; 4983 3987 4984 - raw_spin_rq_lock_irq(rq); 3988 + raw_spin_rq_lock_irq(donor_rq); 4985 3989 raw_spin_lock(&donor_dsq->lock); 4986 3990 list_add(&cursor.node, &donor_dsq->list); 4987 3991 resume: ··· 4989 3993 n = nldsq_next_task(donor_dsq, n, false); 4990 3994 4991 3995 while ((p = n)) { 4992 - struct rq *donee_rq; 4993 3996 struct scx_dispatch_q *donee_dsq; 4994 3997 int donee; 4995 3998 ··· 5004 4009 if (donee >= nr_cpu_ids) 5005 4010 continue; 5006 4011 5007 - donee_rq = cpu_rq(donee); 5008 - donee_dsq = &donee_rq->scx.bypass_dsq; 4012 + donee_dsq = bypass_dsq(sch, donee); 5009 4013 5010 4014 /* 5011 4015 * $p's rq is not locked but $p's DSQ lock protects its 5012 4016 * scheduling properties making this test safe. 5013 4017 */ 5014 - if (!task_can_run_on_remote_rq(sch, p, donee_rq, false)) 4018 + if (!task_can_run_on_remote_rq(sch, p, cpu_rq(donee), false)) 5015 4019 continue; 5016 4020 5017 4021 /* ··· 5025 4031 * between bypass DSQs. 5026 4032 */ 5027 4033 dispatch_dequeue_locked(p, donor_dsq); 5028 - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); 4034 + dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED); 5029 4035 5030 4036 /* 5031 4037 * $donee might have been idle and need to be woken up. No need ··· 5040 4046 if (!(nr_balanced % SCX_BYPASS_LB_BATCH) && n) { 5041 4047 list_move_tail(&cursor.node, &n->scx.dsq_list.node); 5042 4048 raw_spin_unlock(&donor_dsq->lock); 5043 - raw_spin_rq_unlock_irq(rq); 4049 + raw_spin_rq_unlock_irq(donor_rq); 5044 4050 cpu_relax(); 5045 - raw_spin_rq_lock_irq(rq); 4051 + raw_spin_rq_lock_irq(donor_rq); 5046 4052 raw_spin_lock(&donor_dsq->lock); 5047 4053 goto resume; 5048 4054 } ··· 5050 4056 5051 4057 list_del_init(&cursor.node); 5052 4058 raw_spin_unlock(&donor_dsq->lock); 5053 - raw_spin_rq_unlock_irq(rq); 4059 + raw_spin_rq_unlock_irq(donor_rq); 5054 4060 5055 4061 return nr_balanced; 5056 4062 } ··· 5068 4074 5069 4075 /* count the target tasks and CPUs */ 5070 4076 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5071 - u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr); 4077 + u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr); 5072 4078 5073 4079 nr_tasks += nr; 5074 4080 nr_cpus++; ··· 5090 4096 5091 4097 cpumask_clear(donee_mask); 5092 4098 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5093 - if (READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr) < nr_target) 4099 + if (READ_ONCE(bypass_dsq(sch, cpu)->nr) < nr_target) 5094 4100 cpumask_set_cpu(cpu, donee_mask); 5095 4101 } 5096 4102 5097 4103 /* iterate !donee CPUs and see if they should be offloaded */ 5098 4104 cpumask_clear(resched_mask); 5099 4105 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5100 - struct rq *rq = cpu_rq(cpu); 5101 - struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq; 5102 - 5103 4106 if (cpumask_empty(donee_mask)) 5104 4107 break; 5105 4108 if (cpumask_test_cpu(cpu, donee_mask)) 5106 4109 continue; 5107 - if (READ_ONCE(donor_dsq->nr) <= nr_donor_target) 4110 + if (READ_ONCE(bypass_dsq(sch, cpu)->nr) <= nr_donor_target) 5108 4111 continue; 5109 4112 5110 - nr_balanced += bypass_lb_cpu(sch, rq, donee_mask, resched_mask, 4113 + nr_balanced += bypass_lb_cpu(sch, cpu, donee_mask, resched_mask, 5111 4114 nr_donor_target, nr_target); 5112 4115 } 5113 4116 ··· 5112 4121 resched_cpu(cpu); 5113 4122 5114 4123 for_each_cpu_and(cpu, cpu_online_mask, node_mask) { 5115 - u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr); 4124 + u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr); 5116 4125 5117 4126 after_min = min(nr, after_min); 5118 4127 after_max = max(nr, after_max); ··· 5134 4143 */ 5135 4144 static void scx_bypass_lb_timerfn(struct timer_list *timer) 5136 4145 { 5137 - struct scx_sched *sch; 4146 + struct scx_sched *sch = container_of(timer, struct scx_sched, bypass_lb_timer); 5138 4147 int node; 5139 4148 u32 intv_us; 5140 4149 5141 - sch = rcu_dereference_all(scx_root); 5142 - if (unlikely(!sch) || !READ_ONCE(scx_bypass_depth)) 4150 + if (!bypass_dsp_enabled(sch)) 5143 4151 return; 5144 4152 5145 4153 for_each_node_with_cpus(node) ··· 5149 4159 mod_timer(timer, jiffies + usecs_to_jiffies(intv_us)); 5150 4160 } 5151 4161 5152 - static DEFINE_TIMER(scx_bypass_lb_timer, scx_bypass_lb_timerfn); 4162 + static bool inc_bypass_depth(struct scx_sched *sch) 4163 + { 4164 + lockdep_assert_held(&scx_bypass_lock); 4165 + 4166 + WARN_ON_ONCE(sch->bypass_depth < 0); 4167 + WRITE_ONCE(sch->bypass_depth, sch->bypass_depth + 1); 4168 + if (sch->bypass_depth != 1) 4169 + return false; 4170 + 4171 + WRITE_ONCE(sch->slice_dfl, READ_ONCE(scx_slice_bypass_us) * NSEC_PER_USEC); 4172 + sch->bypass_timestamp = ktime_get_ns(); 4173 + scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1); 4174 + return true; 4175 + } 4176 + 4177 + static bool dec_bypass_depth(struct scx_sched *sch) 4178 + { 4179 + lockdep_assert_held(&scx_bypass_lock); 4180 + 4181 + WARN_ON_ONCE(sch->bypass_depth < 1); 4182 + WRITE_ONCE(sch->bypass_depth, sch->bypass_depth - 1); 4183 + if (sch->bypass_depth != 0) 4184 + return false; 4185 + 4186 + WRITE_ONCE(sch->slice_dfl, SCX_SLICE_DFL); 4187 + scx_add_event(sch, SCX_EV_BYPASS_DURATION, 4188 + ktime_get_ns() - sch->bypass_timestamp); 4189 + return true; 4190 + } 4191 + 4192 + static void enable_bypass_dsp(struct scx_sched *sch) 4193 + { 4194 + struct scx_sched *host = scx_parent(sch) ?: sch; 4195 + u32 intv_us = READ_ONCE(scx_bypass_lb_intv_us); 4196 + s32 ret; 4197 + 4198 + /* 4199 + * @sch->bypass_depth transitioning from 0 to 1 triggers enabling. 4200 + * Shouldn't stagger. 4201 + */ 4202 + if (WARN_ON_ONCE(test_and_set_bit(0, &sch->bypass_dsp_claim))) 4203 + return; 4204 + 4205 + /* 4206 + * When a sub-sched bypasses, its tasks are queued on the bypass DSQs of 4207 + * the nearest non-bypassing ancestor or root. As enable_bypass_dsp() is 4208 + * called iff @sch is not already bypassed due to an ancestor bypassing, 4209 + * we can assume that the parent is not bypassing and thus will be the 4210 + * host of the bypass DSQs. 4211 + * 4212 + * While the situation may change in the future, the following 4213 + * guarantees that the nearest non-bypassing ancestor or root has bypass 4214 + * dispatch enabled while a descendant is bypassing, which is all that's 4215 + * required. 4216 + * 4217 + * bypass_dsp_enabled() test is used to determine whether to enter the 4218 + * bypass dispatch handling path from both bypassing and hosting scheds. 4219 + * Bump enable depth on both @sch and bypass dispatch host. 4220 + */ 4221 + ret = atomic_inc_return(&sch->bypass_dsp_enable_depth); 4222 + WARN_ON_ONCE(ret <= 0); 4223 + 4224 + if (host != sch) { 4225 + ret = atomic_inc_return(&host->bypass_dsp_enable_depth); 4226 + WARN_ON_ONCE(ret <= 0); 4227 + } 4228 + 4229 + /* 4230 + * The LB timer will stop running if bypass dispatch is disabled. Start 4231 + * after enabling bypass dispatch. 4232 + */ 4233 + if (intv_us && !timer_pending(&host->bypass_lb_timer)) 4234 + mod_timer(&host->bypass_lb_timer, 4235 + jiffies + usecs_to_jiffies(intv_us)); 4236 + } 4237 + 4238 + /* may be called without holding scx_bypass_lock */ 4239 + static void disable_bypass_dsp(struct scx_sched *sch) 4240 + { 4241 + s32 ret; 4242 + 4243 + if (!test_and_clear_bit(0, &sch->bypass_dsp_claim)) 4244 + return; 4245 + 4246 + ret = atomic_dec_return(&sch->bypass_dsp_enable_depth); 4247 + WARN_ON_ONCE(ret < 0); 4248 + 4249 + if (scx_parent(sch)) { 4250 + ret = atomic_dec_return(&scx_parent(sch)->bypass_dsp_enable_depth); 4251 + WARN_ON_ONCE(ret < 0); 4252 + } 4253 + } 5153 4254 5154 4255 /** 5155 4256 * scx_bypass - [Un]bypass scx_ops and guarantee forward progress 4257 + * @sch: sched to bypass 5156 4258 * @bypass: true for bypass, false for unbypass 5157 4259 * 5158 4260 * Bypassing guarantees that all runnable tasks make forward progress without ··· 5274 4192 * 5275 4193 * - scx_prio_less() reverts to the default core_sched_at order. 5276 4194 */ 5277 - static void scx_bypass(bool bypass) 4195 + static void scx_bypass(struct scx_sched *sch, bool bypass) 5278 4196 { 5279 - static DEFINE_RAW_SPINLOCK(bypass_lock); 5280 - static unsigned long bypass_timestamp; 5281 - struct scx_sched *sch; 4197 + struct scx_sched *pos; 5282 4198 unsigned long flags; 5283 4199 int cpu; 5284 4200 5285 - raw_spin_lock_irqsave(&bypass_lock, flags); 5286 - sch = rcu_dereference_bh(scx_root); 4201 + raw_spin_lock_irqsave(&scx_bypass_lock, flags); 5287 4202 5288 4203 if (bypass) { 5289 - u32 intv_us; 5290 - 5291 - WRITE_ONCE(scx_bypass_depth, scx_bypass_depth + 1); 5292 - WARN_ON_ONCE(scx_bypass_depth <= 0); 5293 - if (scx_bypass_depth != 1) 4204 + if (!inc_bypass_depth(sch)) 5294 4205 goto unlock; 5295 - WRITE_ONCE(scx_slice_dfl, READ_ONCE(scx_slice_bypass_us) * NSEC_PER_USEC); 5296 - bypass_timestamp = ktime_get_ns(); 5297 - if (sch) 5298 - scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1); 5299 4206 5300 - intv_us = READ_ONCE(scx_bypass_lb_intv_us); 5301 - if (intv_us && !timer_pending(&scx_bypass_lb_timer)) { 5302 - scx_bypass_lb_timer.expires = 5303 - jiffies + usecs_to_jiffies(intv_us); 5304 - add_timer_global(&scx_bypass_lb_timer); 5305 - } 4207 + enable_bypass_dsp(sch); 5306 4208 } else { 5307 - WRITE_ONCE(scx_bypass_depth, scx_bypass_depth - 1); 5308 - WARN_ON_ONCE(scx_bypass_depth < 0); 5309 - if (scx_bypass_depth != 0) 4209 + if (!dec_bypass_depth(sch)) 5310 4210 goto unlock; 5311 - WRITE_ONCE(scx_slice_dfl, SCX_SLICE_DFL); 5312 - if (sch) 5313 - scx_add_event(sch, SCX_EV_BYPASS_DURATION, 5314 - ktime_get_ns() - bypass_timestamp); 5315 4211 } 5316 4212 5317 4213 /* 4214 + * Bypass state is propagated to all descendants - an scx_sched bypasses 4215 + * if itself or any of its ancestors are in bypass mode. 4216 + */ 4217 + raw_spin_lock(&scx_sched_lock); 4218 + scx_for_each_descendant_pre(pos, sch) { 4219 + if (pos == sch) 4220 + continue; 4221 + if (bypass) 4222 + inc_bypass_depth(pos); 4223 + else 4224 + dec_bypass_depth(pos); 4225 + } 4226 + raw_spin_unlock(&scx_sched_lock); 4227 + 4228 + /* 5318 4229 * No task property is changing. We just need to make sure all currently 5319 - * queued tasks are re-queued according to the new scx_rq_bypassing() 4230 + * queued tasks are re-queued according to the new scx_bypassing() 5320 4231 * state. As an optimization, walk each rq's runnable_list instead of 5321 4232 * the scx_tasks list. 5322 4233 * ··· 5321 4246 struct task_struct *p, *n; 5322 4247 5323 4248 raw_spin_rq_lock(rq); 4249 + raw_spin_lock(&scx_sched_lock); 5324 4250 5325 - if (bypass) { 5326 - WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING); 5327 - rq->scx.flags |= SCX_RQ_BYPASSING; 5328 - } else { 5329 - WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING)); 5330 - rq->scx.flags &= ~SCX_RQ_BYPASSING; 4251 + scx_for_each_descendant_pre(pos, sch) { 4252 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu); 4253 + 4254 + if (pos->bypass_depth) 4255 + pcpu->flags |= SCX_SCHED_PCPU_BYPASSING; 4256 + else 4257 + pcpu->flags &= ~SCX_SCHED_PCPU_BYPASSING; 5331 4258 } 4259 + 4260 + raw_spin_unlock(&scx_sched_lock); 5332 4261 5333 4262 /* 5334 4263 * We need to guarantee that no tasks are on the BPF scheduler 5335 4264 * while bypassing. Either we see enabled or the enable path 5336 - * sees scx_rq_bypassing() before moving tasks to SCX. 4265 + * sees scx_bypassing() before moving tasks to SCX. 5337 4266 */ 5338 4267 if (!scx_enabled()) { 5339 4268 raw_spin_rq_unlock(rq); ··· 5353 4274 */ 5354 4275 list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list, 5355 4276 scx.runnable_node) { 4277 + if (!scx_is_descendant(scx_task_sched(p), sch)) 4278 + continue; 4279 + 5356 4280 /* cycling deq/enq is enough, see the function comment */ 5357 4281 scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 5358 4282 /* nothing */ ; ··· 5369 4287 raw_spin_rq_unlock(rq); 5370 4288 } 5371 4289 4290 + /* disarming must come after moving all tasks out of the bypass DSQs */ 4291 + if (!bypass) 4292 + disable_bypass_dsp(sch); 5372 4293 unlock: 5373 - raw_spin_unlock_irqrestore(&bypass_lock, flags); 4294 + raw_spin_unlock_irqrestore(&scx_bypass_lock, flags); 5374 4295 } 5375 4296 5376 4297 static void free_exit_info(struct scx_exit_info *ei) ··· 5415 4330 return "unregistered from the main kernel"; 5416 4331 case SCX_EXIT_SYSRQ: 5417 4332 return "disabled by sysrq-S"; 4333 + case SCX_EXIT_PARENT: 4334 + return "parent exiting"; 5418 4335 case SCX_EXIT_ERROR: 5419 4336 return "runtime error"; 5420 4337 case SCX_EXIT_ERROR_BPF: ··· 5442 4355 } 5443 4356 } 5444 4357 5445 - static void scx_disable_workfn(struct kthread_work *work) 4358 + static void refresh_watchdog(void) 5446 4359 { 5447 - struct scx_sched *sch = container_of(work, struct scx_sched, disable_work); 4360 + struct scx_sched *sch; 4361 + unsigned long intv = ULONG_MAX; 4362 + 4363 + /* take the shortest timeout and use its half for watchdog interval */ 4364 + rcu_read_lock(); 4365 + list_for_each_entry_rcu(sch, &scx_sched_all, all) 4366 + intv = max(min(intv, sch->watchdog_timeout / 2), 1); 4367 + rcu_read_unlock(); 4368 + 4369 + WRITE_ONCE(scx_watchdog_timestamp, jiffies); 4370 + WRITE_ONCE(scx_watchdog_interval, intv); 4371 + 4372 + if (intv < ULONG_MAX) 4373 + mod_delayed_work(system_dfl_wq, &scx_watchdog_work, intv); 4374 + else 4375 + cancel_delayed_work_sync(&scx_watchdog_work); 4376 + } 4377 + 4378 + static s32 scx_link_sched(struct scx_sched *sch) 4379 + { 4380 + scoped_guard(raw_spinlock_irq, &scx_sched_lock) { 4381 + #ifdef CONFIG_EXT_SUB_SCHED 4382 + struct scx_sched *parent = scx_parent(sch); 4383 + s32 ret; 4384 + 4385 + if (parent) { 4386 + /* 4387 + * scx_claim_exit() propagates exit_kind transition to 4388 + * its sub-scheds while holding scx_sched_lock - either 4389 + * we can see the parent's non-NONE exit_kind or the 4390 + * parent can shoot us down. 4391 + */ 4392 + if (atomic_read(&parent->exit_kind) != SCX_EXIT_NONE) { 4393 + scx_error(sch, "parent disabled"); 4394 + return -ENOENT; 4395 + } 4396 + 4397 + ret = rhashtable_lookup_insert_fast(&scx_sched_hash, 4398 + &sch->hash_node, scx_sched_hash_params); 4399 + if (ret) { 4400 + scx_error(sch, "failed to insert into scx_sched_hash (%d)", ret); 4401 + return ret; 4402 + } 4403 + 4404 + list_add_tail(&sch->sibling, &parent->children); 4405 + } 4406 + #endif /* CONFIG_EXT_SUB_SCHED */ 4407 + 4408 + list_add_tail_rcu(&sch->all, &scx_sched_all); 4409 + } 4410 + 4411 + refresh_watchdog(); 4412 + return 0; 4413 + } 4414 + 4415 + static void scx_unlink_sched(struct scx_sched *sch) 4416 + { 4417 + scoped_guard(raw_spinlock_irq, &scx_sched_lock) { 4418 + #ifdef CONFIG_EXT_SUB_SCHED 4419 + if (scx_parent(sch)) { 4420 + rhashtable_remove_fast(&scx_sched_hash, &sch->hash_node, 4421 + scx_sched_hash_params); 4422 + list_del_init(&sch->sibling); 4423 + } 4424 + #endif /* CONFIG_EXT_SUB_SCHED */ 4425 + list_del_rcu(&sch->all); 4426 + } 4427 + 4428 + refresh_watchdog(); 4429 + } 4430 + 4431 + /* 4432 + * Called to disable future dumps and wait for in-progress one while disabling 4433 + * @sch. Once @sch becomes empty during disable, there's no point in dumping it. 4434 + * This prevents calling dump ops on a dead sch. 4435 + */ 4436 + static void scx_disable_dump(struct scx_sched *sch) 4437 + { 4438 + guard(raw_spinlock_irqsave)(&scx_dump_lock); 4439 + sch->dump_disabled = true; 4440 + } 4441 + 4442 + #ifdef CONFIG_EXT_SUB_SCHED 4443 + static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq); 4444 + 4445 + static void drain_descendants(struct scx_sched *sch) 4446 + { 4447 + /* 4448 + * Child scheds that finished the critical part of disabling will take 4449 + * themselves off @sch->children. Wait for it to drain. As propagation 4450 + * is recursive, empty @sch->children means that all proper descendant 4451 + * scheds reached unlinking stage. 4452 + */ 4453 + wait_event(scx_unlink_waitq, list_empty(&sch->children)); 4454 + } 4455 + 4456 + static void scx_fail_parent(struct scx_sched *sch, 4457 + struct task_struct *failed, s32 fail_code) 4458 + { 4459 + struct scx_sched *parent = scx_parent(sch); 4460 + struct scx_task_iter sti; 4461 + struct task_struct *p; 4462 + 4463 + scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler", 4464 + fail_code, failed->comm, failed->pid); 4465 + 4466 + /* 4467 + * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into 4468 + * it. This may cause downstream failures on the BPF side but $parent is 4469 + * dying anyway. 4470 + */ 4471 + scx_bypass(parent, true); 4472 + 4473 + scx_task_iter_start(&sti, sch->cgrp); 4474 + while ((p = scx_task_iter_next_locked(&sti))) { 4475 + if (scx_task_on_sched(parent, p)) 4476 + continue; 4477 + 4478 + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 4479 + scx_disable_and_exit_task(sch, p); 4480 + rcu_assign_pointer(p->scx.sched, parent); 4481 + } 4482 + } 4483 + scx_task_iter_stop(&sti); 4484 + } 4485 + 4486 + static void scx_sub_disable(struct scx_sched *sch) 4487 + { 4488 + struct scx_sched *parent = scx_parent(sch); 4489 + struct scx_task_iter sti; 4490 + struct task_struct *p; 4491 + int ret; 4492 + 4493 + /* 4494 + * Guarantee forward progress and wait for descendants to be disabled. 4495 + * To limit disruptions, $parent is not bypassed. Tasks are fully 4496 + * prepped and then inserted back into $parent. 4497 + */ 4498 + scx_bypass(sch, true); 4499 + drain_descendants(sch); 4500 + 4501 + /* 4502 + * Here, every runnable task is guaranteed to make forward progress and 4503 + * we can safely use blocking synchronization constructs. Actually 4504 + * disable ops. 4505 + */ 4506 + mutex_lock(&scx_enable_mutex); 4507 + percpu_down_write(&scx_fork_rwsem); 4508 + scx_cgroup_lock(); 4509 + 4510 + set_cgroup_sched(sch_cgroup(sch), parent); 4511 + 4512 + scx_task_iter_start(&sti, sch->cgrp); 4513 + while ((p = scx_task_iter_next_locked(&sti))) { 4514 + struct rq *rq; 4515 + struct rq_flags rf; 4516 + 4517 + /* filter out duplicate visits */ 4518 + if (scx_task_on_sched(parent, p)) 4519 + continue; 4520 + 4521 + /* 4522 + * By the time control reaches here, all descendant schedulers 4523 + * should already have been disabled. 4524 + */ 4525 + WARN_ON_ONCE(!scx_task_on_sched(sch, p)); 4526 + 4527 + /* 4528 + * If $p is about to be freed, nothing prevents $sch from 4529 + * unloading before $p reaches sched_ext_free(). Disable and 4530 + * exit $p right away. 4531 + */ 4532 + if (!tryget_task_struct(p)) { 4533 + scx_disable_and_exit_task(sch, p); 4534 + continue; 4535 + } 4536 + 4537 + scx_task_iter_unlock(&sti); 4538 + 4539 + /* 4540 + * $p is READY or ENABLED on @sch. Initialize for $parent, 4541 + * disable and exit from @sch, and then switch over to $parent. 4542 + * 4543 + * If a task fails to initialize for $parent, the only available 4544 + * action is disabling $parent too. While this allows disabling 4545 + * of a child sched to cause the parent scheduler to fail, the 4546 + * failure can only originate from ops.init_task() of the 4547 + * parent. A child can't directly affect the parent through its 4548 + * own failures. 4549 + */ 4550 + ret = __scx_init_task(parent, p, false); 4551 + if (ret) { 4552 + scx_fail_parent(sch, p, ret); 4553 + put_task_struct(p); 4554 + break; 4555 + } 4556 + 4557 + rq = task_rq_lock(p, &rf); 4558 + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 4559 + /* 4560 + * $p is initialized for $parent and still attached to 4561 + * @sch. Disable and exit for @sch, switch over to 4562 + * $parent, override the state to READY to account for 4563 + * $p having already been initialized, and then enable. 4564 + */ 4565 + scx_disable_and_exit_task(sch, p); 4566 + scx_set_task_state(p, SCX_TASK_INIT); 4567 + rcu_assign_pointer(p->scx.sched, parent); 4568 + scx_set_task_state(p, SCX_TASK_READY); 4569 + scx_enable_task(parent, p); 4570 + } 4571 + task_rq_unlock(rq, p, &rf); 4572 + 4573 + put_task_struct(p); 4574 + } 4575 + scx_task_iter_stop(&sti); 4576 + 4577 + scx_disable_dump(sch); 4578 + 4579 + scx_cgroup_unlock(); 4580 + percpu_up_write(&scx_fork_rwsem); 4581 + 4582 + /* 4583 + * All tasks are moved off of @sch but there may still be on-going 4584 + * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use 4585 + * the expedited version as ancestors may be waiting in bypass mode. 4586 + * Also, tell the parent that there is no need to keep running bypass 4587 + * DSQs for us. 4588 + */ 4589 + synchronize_rcu_expedited(); 4590 + disable_bypass_dsp(sch); 4591 + 4592 + scx_unlink_sched(sch); 4593 + 4594 + mutex_unlock(&scx_enable_mutex); 4595 + 4596 + /* 4597 + * @sch is now unlinked from the parent's children list. Notify and call 4598 + * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called 4599 + * after unlinking and releasing all locks. See scx_claim_exit(). 4600 + */ 4601 + wake_up_all(&scx_unlink_waitq); 4602 + 4603 + if (parent->ops.sub_detach && sch->sub_attached) { 4604 + struct scx_sub_detach_args sub_detach_args = { 4605 + .ops = &sch->ops, 4606 + .cgroup_path = sch->cgrp_path, 4607 + }; 4608 + SCX_CALL_OP(parent, sub_detach, NULL, 4609 + &sub_detach_args); 4610 + } 4611 + 4612 + if (sch->ops.exit) 4613 + SCX_CALL_OP(sch, exit, NULL, sch->exit_info); 4614 + kobject_del(&sch->kobj); 4615 + } 4616 + #else /* CONFIG_EXT_SUB_SCHED */ 4617 + static void drain_descendants(struct scx_sched *sch) { } 4618 + static void scx_sub_disable(struct scx_sched *sch) { } 4619 + #endif /* CONFIG_EXT_SUB_SCHED */ 4620 + 4621 + static void scx_root_disable(struct scx_sched *sch) 4622 + { 5448 4623 struct scx_exit_info *ei = sch->exit_info; 5449 4624 struct scx_task_iter sti; 5450 4625 struct task_struct *p; 5451 - int kind, cpu; 4626 + int cpu; 5452 4627 5453 - kind = atomic_read(&sch->exit_kind); 5454 - while (true) { 5455 - if (kind == SCX_EXIT_DONE) /* already disabled? */ 5456 - return; 5457 - WARN_ON_ONCE(kind == SCX_EXIT_NONE); 5458 - if (atomic_try_cmpxchg(&sch->exit_kind, &kind, SCX_EXIT_DONE)) 5459 - break; 5460 - } 5461 - ei->kind = kind; 5462 - ei->reason = scx_exit_reason(ei->kind); 5463 - 5464 - /* guarantee forward progress by bypassing scx_ops */ 5465 - scx_bypass(true); 5466 - WRITE_ONCE(scx_aborting, false); 4628 + /* guarantee forward progress and wait for descendants to be disabled */ 4629 + scx_bypass(sch, true); 4630 + drain_descendants(sch); 5467 4631 5468 4632 switch (scx_set_enable_state(SCX_DISABLING)) { 5469 4633 case SCX_DISABLING: ··· 5741 4403 5742 4404 /* 5743 4405 * Shut down cgroup support before tasks so that the cgroup attach path 5744 - * doesn't race against scx_exit_task(). 4406 + * doesn't race against scx_disable_and_exit_task(). 5745 4407 */ 5746 4408 scx_cgroup_lock(); 5747 4409 scx_cgroup_exit(sch); ··· 5755 4417 5756 4418 scx_init_task_enabled = false; 5757 4419 5758 - scx_task_iter_start(&sti); 4420 + scx_task_iter_start(&sti, NULL); 5759 4421 while ((p = scx_task_iter_next_locked(&sti))) { 5760 4422 unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; 5761 4423 const struct sched_class *old_class = p->sched_class; ··· 5770 4432 p->sched_class = new_class; 5771 4433 } 5772 4434 5773 - scx_exit_task(p); 4435 + scx_disable_and_exit_task(scx_task_sched(p), p); 5774 4436 } 5775 4437 scx_task_iter_stop(&sti); 4438 + 4439 + scx_disable_dump(sch); 4440 + 4441 + scx_cgroup_lock(); 4442 + set_cgroup_sched(sch_cgroup(sch), NULL); 4443 + scx_cgroup_unlock(); 4444 + 5776 4445 percpu_up_write(&scx_fork_rwsem); 5777 4446 5778 4447 /* ··· 5812 4467 } 5813 4468 5814 4469 if (sch->ops.exit) 5815 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, ei); 4470 + SCX_CALL_OP(sch, exit, NULL, ei); 5816 4471 5817 - cancel_delayed_work_sync(&scx_watchdog_work); 4472 + scx_unlink_sched(sch); 5818 4473 5819 4474 /* 5820 4475 * scx_root clearing must be inside cpus_read_lock(). See ··· 5831 4486 */ 5832 4487 kobject_del(&sch->kobj); 5833 4488 5834 - free_percpu(scx_dsp_ctx); 5835 - scx_dsp_ctx = NULL; 5836 - scx_dsp_max_batch = 0; 5837 4489 free_kick_syncs(); 5838 - 5839 - if (scx_bypassed_for_enable) { 5840 - scx_bypassed_for_enable = false; 5841 - scx_bypass(false); 5842 - } 5843 4490 5844 4491 mutex_unlock(&scx_enable_mutex); 5845 4492 5846 4493 WARN_ON_ONCE(scx_set_enable_state(SCX_DISABLED) != SCX_DISABLING); 5847 4494 done: 5848 - scx_bypass(false); 4495 + scx_bypass(sch, false); 5849 4496 } 5850 4497 5851 4498 /* ··· 5853 4516 5854 4517 lockdep_assert_preemption_disabled(); 5855 4518 4519 + if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) 4520 + kind = SCX_EXIT_ERROR; 4521 + 5856 4522 if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind)) 5857 4523 return false; 5858 4524 ··· 5864 4524 * flag to break potential live-lock scenarios, ensuring we can 5865 4525 * successfully reach scx_bypass(). 5866 4526 */ 5867 - WRITE_ONCE(scx_aborting, true); 4527 + WRITE_ONCE(sch->aborting, true); 4528 + 4529 + /* 4530 + * Propagate exits to descendants immediately. Each has a dedicated 4531 + * helper kthread and can run in parallel. While most of disabling is 4532 + * serialized, running them in separate threads allows parallelizing 4533 + * ops.exit(), which can take arbitrarily long prolonging bypass mode. 4534 + * 4535 + * To guarantee forward progress, this propagation must be in-line so 4536 + * that ->aborting is synchronously asserted for all sub-scheds. The 4537 + * propagation is also the interlocking point against sub-sched 4538 + * attachment. See scx_link_sched(). 4539 + * 4540 + * This doesn't cause recursions as propagation only takes place for 4541 + * non-propagation exits. 4542 + */ 4543 + if (kind != SCX_EXIT_PARENT) { 4544 + scoped_guard (raw_spinlock_irqsave, &scx_sched_lock) { 4545 + struct scx_sched *pos; 4546 + scx_for_each_descendant_pre(pos, sch) 4547 + scx_disable(pos, SCX_EXIT_PARENT); 4548 + } 4549 + } 4550 + 5868 4551 return true; 5869 4552 } 5870 4553 5871 - static void scx_disable(enum scx_exit_kind kind) 4554 + static void scx_disable_workfn(struct kthread_work *work) 5872 4555 { 5873 - struct scx_sched *sch; 4556 + struct scx_sched *sch = container_of(work, struct scx_sched, disable_work); 4557 + struct scx_exit_info *ei = sch->exit_info; 4558 + int kind; 5874 4559 5875 - if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) 5876 - kind = SCX_EXIT_ERROR; 5877 - 5878 - rcu_read_lock(); 5879 - sch = rcu_dereference(scx_root); 5880 - if (sch) { 5881 - guard(preempt)(); 5882 - scx_claim_exit(sch, kind); 5883 - kthread_queue_work(sch->helper, &sch->disable_work); 4560 + kind = atomic_read(&sch->exit_kind); 4561 + while (true) { 4562 + if (kind == SCX_EXIT_DONE) /* already disabled? */ 4563 + return; 4564 + WARN_ON_ONCE(kind == SCX_EXIT_NONE); 4565 + if (atomic_try_cmpxchg(&sch->exit_kind, &kind, SCX_EXIT_DONE)) 4566 + break; 5884 4567 } 5885 - rcu_read_unlock(); 4568 + ei->kind = kind; 4569 + ei->reason = scx_exit_reason(ei->kind); 4570 + 4571 + if (scx_parent(sch)) 4572 + scx_sub_disable(sch); 4573 + else 4574 + scx_root_disable(sch); 4575 + } 4576 + 4577 + static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind) 4578 + { 4579 + guard(preempt)(); 4580 + if (scx_claim_exit(sch, kind)) 4581 + irq_work_queue(&sch->disable_irq_work); 5886 4582 } 5887 4583 5888 4584 static void dump_newline(struct seq_buf *s) ··· 5936 4560 5937 4561 #ifdef CONFIG_TRACEPOINTS 5938 4562 if (trace_sched_ext_dump_enabled()) { 5939 - /* protected by scx_dump_state()::dump_lock */ 4563 + /* protected by scx_dump_lock */ 5940 4564 static char line_buf[SCX_EXIT_MSG_LEN]; 5941 4565 5942 4566 va_start(args, fmt); ··· 6032 4656 scx_dump_data.cpu = -1; 6033 4657 } 6034 4658 6035 - static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, 4659 + static void scx_dump_task(struct scx_sched *sch, 4660 + struct seq_buf *s, struct scx_dump_ctx *dctx, 6036 4661 struct task_struct *p, char marker) 6037 4662 { 6038 4663 static unsigned long bt[SCX_EXIT_BT_LEN]; 6039 - struct scx_sched *sch = scx_root; 4664 + struct scx_sched *task_sch = scx_task_sched(p); 4665 + const char *own_marker; 4666 + char sch_id_buf[32]; 6040 4667 char dsq_id_buf[19] = "(n/a)"; 6041 4668 unsigned long ops_state = atomic_long_read(&p->scx.ops_state); 6042 4669 unsigned int bt_len = 0; 4670 + 4671 + own_marker = task_sch == sch ? "*" : ""; 4672 + 4673 + if (task_sch->level == 0) 4674 + scnprintf(sch_id_buf, sizeof(sch_id_buf), "root"); 4675 + else 4676 + scnprintf(sch_id_buf, sizeof(sch_id_buf), "sub%d-%llu", 4677 + task_sch->level, task_sch->ops.sub_cgroup_id); 6043 4678 6044 4679 if (p->scx.dsq) 6045 4680 scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx", 6046 4681 (unsigned long long)p->scx.dsq->id); 6047 4682 6048 4683 dump_newline(s); 6049 - dump_line(s, " %c%c %s[%d] %+ldms", 4684 + dump_line(s, " %c%c %s[%d] %s%s %+ldms", 6050 4685 marker, task_state_to_char(p), p->comm, p->pid, 4686 + own_marker, sch_id_buf, 6051 4687 jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); 6052 4688 dump_line(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu", 6053 - scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, 4689 + scx_get_task_state(p) >> SCX_TASK_STATE_SHIFT, 4690 + p->scx.flags & ~SCX_TASK_STATE_MASK, 6054 4691 p->scx.dsq_flags, ops_state & SCX_OPSS_STATE_MASK, 6055 4692 ops_state >> SCX_OPSS_QSEQ_SHIFT); 6056 4693 dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s", ··· 6075 4686 6076 4687 if (SCX_HAS_OP(sch, dump_task)) { 6077 4688 ops_dump_init(s, " "); 6078 - SCX_CALL_OP(sch, SCX_KF_REST, dump_task, NULL, dctx, p); 4689 + SCX_CALL_OP(sch, dump_task, NULL, dctx, p); 6079 4690 ops_dump_exit(); 6080 4691 } 6081 4692 ··· 6088 4699 } 6089 4700 } 6090 4701 6091 - static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) 4702 + /* 4703 + * Dump scheduler state. If @dump_all_tasks is true, dump all tasks regardless 4704 + * of which scheduler they belong to. If false, only dump tasks owned by @sch. 4705 + * For SysRq-D dumps, @dump_all_tasks=false since all schedulers are dumped 4706 + * separately. For error dumps, @dump_all_tasks=true since only the failing 4707 + * scheduler is dumped. 4708 + */ 4709 + static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, 4710 + size_t dump_len, bool dump_all_tasks) 6092 4711 { 6093 - static DEFINE_SPINLOCK(dump_lock); 6094 4712 static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n"; 6095 - struct scx_sched *sch = scx_root; 6096 4713 struct scx_dump_ctx dctx = { 6097 4714 .kind = ei->kind, 6098 4715 .exit_code = ei->exit_code, ··· 6108 4713 }; 6109 4714 struct seq_buf s; 6110 4715 struct scx_event_stats events; 6111 - unsigned long flags; 6112 4716 char *buf; 6113 4717 int cpu; 6114 4718 6115 - spin_lock_irqsave(&dump_lock, flags); 4719 + guard(raw_spinlock_irqsave)(&scx_dump_lock); 4720 + 4721 + if (sch->dump_disabled) 4722 + return; 6116 4723 6117 4724 seq_buf_init(&s, ei->dump, dump_len); 6118 4725 4726 + #ifdef CONFIG_EXT_SUB_SCHED 4727 + if (sch->level == 0) 4728 + dump_line(&s, "%s: root", sch->ops.name); 4729 + else 4730 + dump_line(&s, "%s: sub%d-%llu %s", 4731 + sch->ops.name, sch->level, sch->ops.sub_cgroup_id, 4732 + sch->cgrp_path); 4733 + #endif 6119 4734 if (ei->kind == SCX_EXIT_NONE) { 6120 4735 dump_line(&s, "Debug dump triggered by %s", ei->reason); 6121 4736 } else { ··· 6139 4734 6140 4735 if (SCX_HAS_OP(sch, dump)) { 6141 4736 ops_dump_init(&s, ""); 6142 - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, dump, NULL, &dctx); 4737 + SCX_CALL_OP(sch, dump, NULL, &dctx); 6143 4738 ops_dump_exit(); 6144 4739 } 6145 4740 ··· 6199 4794 used = seq_buf_used(&ns); 6200 4795 if (SCX_HAS_OP(sch, dump_cpu)) { 6201 4796 ops_dump_init(&ns, " "); 6202 - SCX_CALL_OP(sch, SCX_KF_REST, dump_cpu, NULL, 4797 + SCX_CALL_OP(sch, dump_cpu, NULL, 6203 4798 &dctx, cpu, idle); 6204 4799 ops_dump_exit(); 6205 4800 } ··· 6221 4816 seq_buf_set_overflow(&s); 6222 4817 } 6223 4818 6224 - if (rq->curr->sched_class == &ext_sched_class) 6225 - scx_dump_task(&s, &dctx, rq->curr, '*'); 4819 + if (rq->curr->sched_class == &ext_sched_class && 4820 + (dump_all_tasks || scx_task_on_sched(sch, rq->curr))) 4821 + scx_dump_task(sch, &s, &dctx, rq->curr, '*'); 6226 4822 6227 4823 list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) 6228 - scx_dump_task(&s, &dctx, p, ' '); 4824 + if (dump_all_tasks || scx_task_on_sched(sch, p)) 4825 + scx_dump_task(sch, &s, &dctx, p, ' '); 6229 4826 next: 6230 4827 rq_unlock_irqrestore(rq, &rf); 6231 4828 } ··· 6242 4835 scx_dump_event(s, &events, SCX_EV_DISPATCH_KEEP_LAST); 6243 4836 scx_dump_event(s, &events, SCX_EV_ENQ_SKIP_EXITING); 6244 4837 scx_dump_event(s, &events, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); 4838 + scx_dump_event(s, &events, SCX_EV_REENQ_IMMED); 4839 + scx_dump_event(s, &events, SCX_EV_REENQ_LOCAL_REPEAT); 6245 4840 scx_dump_event(s, &events, SCX_EV_REFILL_SLICE_DFL); 6246 4841 scx_dump_event(s, &events, SCX_EV_BYPASS_DURATION); 6247 4842 scx_dump_event(s, &events, SCX_EV_BYPASS_DISPATCH); 6248 4843 scx_dump_event(s, &events, SCX_EV_BYPASS_ACTIVATE); 4844 + scx_dump_event(s, &events, SCX_EV_INSERT_NOT_OWNED); 4845 + scx_dump_event(s, &events, SCX_EV_SUB_BYPASS_DISPATCH); 6249 4846 6250 4847 if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker)) 6251 4848 memcpy(ei->dump + dump_len - sizeof(trunc_marker), 6252 4849 trunc_marker, sizeof(trunc_marker)); 6253 - 6254 - spin_unlock_irqrestore(&dump_lock, flags); 6255 4850 } 6256 4851 6257 - static void scx_error_irq_workfn(struct irq_work *irq_work) 4852 + static void scx_disable_irq_workfn(struct irq_work *irq_work) 6258 4853 { 6259 - struct scx_sched *sch = container_of(irq_work, struct scx_sched, error_irq_work); 4854 + struct scx_sched *sch = container_of(irq_work, struct scx_sched, disable_irq_work); 6260 4855 struct scx_exit_info *ei = sch->exit_info; 6261 4856 6262 4857 if (ei->kind >= SCX_EXIT_ERROR) 6263 - scx_dump_state(ei, sch->ops.exit_dump_len); 4858 + scx_dump_state(sch, ei, sch->ops.exit_dump_len, true); 6264 4859 6265 4860 kthread_queue_work(sch->helper, &sch->disable_work); 6266 4861 } ··· 6292 4883 ei->kind = kind; 6293 4884 ei->reason = scx_exit_reason(ei->kind); 6294 4885 6295 - irq_work_queue(&sch->error_irq_work); 4886 + irq_work_queue(&sch->disable_irq_work); 6296 4887 return true; 6297 4888 } 6298 4889 ··· 6323 4914 return 0; 6324 4915 } 6325 4916 6326 - static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops) 4917 + static void free_pnode(struct scx_sched_pnode *pnode) 4918 + { 4919 + if (!pnode) 4920 + return; 4921 + exit_dsq(&pnode->global_dsq); 4922 + kfree(pnode); 4923 + } 4924 + 4925 + static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node) 4926 + { 4927 + struct scx_sched_pnode *pnode; 4928 + 4929 + pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node); 4930 + if (!pnode) 4931 + return NULL; 4932 + 4933 + if (init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch)) { 4934 + kfree(pnode); 4935 + return NULL; 4936 + } 4937 + 4938 + return pnode; 4939 + } 4940 + 4941 + /* 4942 + * Allocate and initialize a new scx_sched. @cgrp's reference is always 4943 + * consumed whether the function succeeds or fails. 4944 + */ 4945 + static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops, 4946 + struct cgroup *cgrp, 4947 + struct scx_sched *parent) 6327 4948 { 6328 4949 struct scx_sched *sch; 6329 - int node, ret; 4950 + s32 level = parent ? parent->level + 1 : 0; 4951 + s32 node, cpu, ret, bypass_fail_cpu = nr_cpu_ids; 6330 4952 6331 - sch = kzalloc_obj(*sch); 6332 - if (!sch) 6333 - return ERR_PTR(-ENOMEM); 4953 + sch = kzalloc_flex(*sch, ancestors, level + 1); 4954 + if (!sch) { 4955 + ret = -ENOMEM; 4956 + goto err_put_cgrp; 4957 + } 6334 4958 6335 4959 sch->exit_info = alloc_exit_info(ops->exit_dump_len); 6336 4960 if (!sch->exit_info) { ··· 6375 4933 if (ret < 0) 6376 4934 goto err_free_ei; 6377 4935 6378 - sch->global_dsqs = kzalloc_objs(sch->global_dsqs[0], nr_node_ids); 6379 - if (!sch->global_dsqs) { 4936 + sch->pnode = kzalloc_objs(sch->pnode[0], nr_node_ids); 4937 + if (!sch->pnode) { 6380 4938 ret = -ENOMEM; 6381 4939 goto err_free_hash; 6382 4940 } 6383 4941 6384 4942 for_each_node_state(node, N_POSSIBLE) { 6385 - struct scx_dispatch_q *dsq; 6386 - 6387 - dsq = kzalloc_node(sizeof(*dsq), GFP_KERNEL, node); 6388 - if (!dsq) { 4943 + sch->pnode[node] = alloc_pnode(sch, node); 4944 + if (!sch->pnode[node]) { 6389 4945 ret = -ENOMEM; 6390 - goto err_free_gdsqs; 4946 + goto err_free_pnode; 6391 4947 } 6392 - 6393 - init_dsq(dsq, SCX_DSQ_GLOBAL); 6394 - sch->global_dsqs[node] = dsq; 6395 4948 } 6396 4949 6397 - sch->pcpu = alloc_percpu(struct scx_sched_pcpu); 4950 + sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; 4951 + sch->pcpu = __alloc_percpu(struct_size_t(struct scx_sched_pcpu, 4952 + dsp_ctx.buf, sch->dsp_max_batch), 4953 + __alignof__(struct scx_sched_pcpu)); 6398 4954 if (!sch->pcpu) { 6399 4955 ret = -ENOMEM; 6400 - goto err_free_gdsqs; 4956 + goto err_free_pnode; 4957 + } 4958 + 4959 + for_each_possible_cpu(cpu) { 4960 + ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch); 4961 + if (ret) { 4962 + bypass_fail_cpu = cpu; 4963 + goto err_free_pcpu; 4964 + } 4965 + } 4966 + 4967 + for_each_possible_cpu(cpu) { 4968 + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); 4969 + 4970 + pcpu->sch = sch; 4971 + INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node); 6401 4972 } 6402 4973 6403 4974 sch->helper = kthread_run_worker(0, "sched_ext_helper"); ··· 6421 4966 6422 4967 sched_set_fifo(sch->helper->task); 6423 4968 4969 + if (parent) 4970 + memcpy(sch->ancestors, parent->ancestors, 4971 + level * sizeof(parent->ancestors[0])); 4972 + sch->ancestors[level] = sch; 4973 + sch->level = level; 4974 + 4975 + if (ops->timeout_ms) 4976 + sch->watchdog_timeout = msecs_to_jiffies(ops->timeout_ms); 4977 + else 4978 + sch->watchdog_timeout = SCX_WATCHDOG_MAX_TIMEOUT; 4979 + 4980 + sch->slice_dfl = SCX_SLICE_DFL; 6424 4981 atomic_set(&sch->exit_kind, SCX_EXIT_NONE); 6425 - init_irq_work(&sch->error_irq_work, scx_error_irq_workfn); 4982 + init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn); 6426 4983 kthread_init_work(&sch->disable_work, scx_disable_workfn); 4984 + timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0); 6427 4985 sch->ops = *ops; 6428 - ops->priv = sch; 4986 + rcu_assign_pointer(ops->priv, sch); 6429 4987 6430 4988 sch->kobj.kset = scx_kset; 6431 - ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); 6432 - if (ret < 0) 6433 - goto err_stop_helper; 6434 4989 4990 + #ifdef CONFIG_EXT_SUB_SCHED 4991 + char *buf = kzalloc(PATH_MAX, GFP_KERNEL); 4992 + if (!buf) { 4993 + ret = -ENOMEM; 4994 + goto err_stop_helper; 4995 + } 4996 + cgroup_path(cgrp, buf, PATH_MAX); 4997 + sch->cgrp_path = kstrdup(buf, GFP_KERNEL); 4998 + kfree(buf); 4999 + if (!sch->cgrp_path) { 5000 + ret = -ENOMEM; 5001 + goto err_stop_helper; 5002 + } 5003 + 5004 + sch->cgrp = cgrp; 5005 + INIT_LIST_HEAD(&sch->children); 5006 + INIT_LIST_HEAD(&sch->sibling); 5007 + 5008 + if (parent) 5009 + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, 5010 + &parent->sub_kset->kobj, 5011 + "sub-%llu", cgroup_id(cgrp)); 5012 + else 5013 + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); 5014 + 5015 + if (ret < 0) { 5016 + kobject_put(&sch->kobj); 5017 + return ERR_PTR(ret); 5018 + } 5019 + 5020 + if (ops->sub_attach) { 5021 + sch->sub_kset = kset_create_and_add("sub", NULL, &sch->kobj); 5022 + if (!sch->sub_kset) { 5023 + kobject_put(&sch->kobj); 5024 + return ERR_PTR(-ENOMEM); 5025 + } 5026 + } 5027 + #else /* CONFIG_EXT_SUB_SCHED */ 5028 + ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); 5029 + if (ret < 0) { 5030 + kobject_put(&sch->kobj); 5031 + return ERR_PTR(ret); 5032 + } 5033 + #endif /* CONFIG_EXT_SUB_SCHED */ 6435 5034 return sch; 6436 5035 5036 + #ifdef CONFIG_EXT_SUB_SCHED 6437 5037 err_stop_helper: 6438 5038 kthread_destroy_worker(sch->helper); 5039 + #endif 6439 5040 err_free_pcpu: 5041 + for_each_possible_cpu(cpu) { 5042 + if (cpu == bypass_fail_cpu) 5043 + break; 5044 + exit_dsq(bypass_dsq(sch, cpu)); 5045 + } 6440 5046 free_percpu(sch->pcpu); 6441 - err_free_gdsqs: 5047 + err_free_pnode: 6442 5048 for_each_node_state(node, N_POSSIBLE) 6443 - kfree(sch->global_dsqs[node]); 6444 - kfree(sch->global_dsqs); 5049 + free_pnode(sch->pnode[node]); 5050 + kfree(sch->pnode); 6445 5051 err_free_hash: 6446 5052 rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL); 6447 5053 err_free_ei: 6448 5054 free_exit_info(sch->exit_info); 6449 5055 err_free_sch: 6450 5056 kfree(sch); 5057 + err_put_cgrp: 5058 + #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 5059 + cgroup_put(cgrp); 5060 + #endif 6451 5061 return ERR_PTR(ret); 6452 5062 } 6453 5063 ··· 6561 5041 return -EINVAL; 6562 5042 } 6563 5043 6564 - if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT) 6565 - pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n"); 6566 - 6567 5044 if (ops->cpu_acquire || ops->cpu_release) 6568 5045 pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n"); 6569 5046 ··· 6580 5063 int ret; 6581 5064 }; 6582 5065 6583 - static void scx_enable_workfn(struct kthread_work *work) 5066 + static void scx_root_enable_workfn(struct kthread_work *work) 6584 5067 { 6585 - struct scx_enable_cmd *cmd = 6586 - container_of(work, struct scx_enable_cmd, work); 5068 + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); 6587 5069 struct sched_ext_ops *ops = cmd->ops; 5070 + struct cgroup *cgrp = root_cgroup(); 6588 5071 struct scx_sched *sch; 6589 5072 struct scx_task_iter sti; 6590 5073 struct task_struct *p; 6591 - unsigned long timeout; 6592 5074 int i, cpu, ret; 6593 5075 6594 5076 mutex_lock(&scx_enable_mutex); ··· 6601 5085 if (ret) 6602 5086 goto err_unlock; 6603 5087 6604 - sch = scx_alloc_and_add_sched(ops); 5088 + #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED) 5089 + cgroup_get(cgrp); 5090 + #endif 5091 + sch = scx_alloc_and_add_sched(ops, cgrp, NULL); 6605 5092 if (IS_ERR(sch)) { 6606 5093 ret = PTR_ERR(sch); 6607 5094 goto err_free_ksyncs; ··· 6616 5097 */ 6617 5098 WARN_ON_ONCE(scx_set_enable_state(SCX_ENABLING) != SCX_DISABLED); 6618 5099 WARN_ON_ONCE(scx_root); 6619 - if (WARN_ON_ONCE(READ_ONCE(scx_aborting))) 6620 - WRITE_ONCE(scx_aborting, false); 6621 5100 6622 5101 atomic_long_set(&scx_nr_rejected, 0); 6623 5102 6624 - for_each_possible_cpu(cpu) 6625 - cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; 5103 + for_each_possible_cpu(cpu) { 5104 + struct rq *rq = cpu_rq(cpu); 5105 + 5106 + rq->scx.local_dsq.sched = sch; 5107 + rq->scx.cpuperf_target = SCX_CPUPERF_ONE; 5108 + } 6626 5109 6627 5110 /* 6628 5111 * Keep CPUs stable during enable so that the BPF scheduler can track ··· 6638 5117 */ 6639 5118 rcu_assign_pointer(scx_root, sch); 6640 5119 5120 + ret = scx_link_sched(sch); 5121 + if (ret) 5122 + goto err_disable; 5123 + 6641 5124 scx_idle_enable(ops); 6642 5125 6643 5126 if (sch->ops.init) { 6644 - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); 5127 + ret = SCX_CALL_OP_RET(sch, init, NULL); 6645 5128 if (ret) { 6646 5129 ret = ops_sanitize_err(sch, "init", ret); 6647 5130 cpus_read_unlock(); ··· 6672 5147 if (ret) 6673 5148 goto err_disable; 6674 5149 6675 - WARN_ON_ONCE(scx_dsp_ctx); 6676 - scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; 6677 - scx_dsp_ctx = __alloc_percpu(struct_size_t(struct scx_dsp_ctx, buf, 6678 - scx_dsp_max_batch), 6679 - __alignof__(struct scx_dsp_ctx)); 6680 - if (!scx_dsp_ctx) { 6681 - ret = -ENOMEM; 6682 - goto err_disable; 6683 - } 6684 - 6685 - if (ops->timeout_ms) 6686 - timeout = msecs_to_jiffies(ops->timeout_ms); 6687 - else 6688 - timeout = SCX_WATCHDOG_MAX_TIMEOUT; 6689 - 6690 - WRITE_ONCE(scx_watchdog_timeout, timeout); 6691 - WRITE_ONCE(scx_watchdog_timestamp, jiffies); 6692 - queue_delayed_work(system_dfl_wq, &scx_watchdog_work, 6693 - READ_ONCE(scx_watchdog_timeout) / 2); 6694 - 6695 5150 /* 6696 5151 * Once __scx_enabled is set, %current can be switched to SCX anytime. 6697 5152 * This can lead to stalls as some BPF schedulers (e.g. userspace 6698 5153 * scheduling) may not function correctly before all tasks are switched. 6699 5154 * Init in bypass mode to guarantee forward progress. 6700 5155 */ 6701 - scx_bypass(true); 6702 - scx_bypassed_for_enable = true; 5156 + scx_bypass(sch, true); 6703 5157 6704 5158 for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++) 6705 5159 if (((void (**)(void))ops)[i]) ··· 6710 5206 * never sees uninitialized tasks. 6711 5207 */ 6712 5208 scx_cgroup_lock(); 5209 + set_cgroup_sched(sch_cgroup(sch), sch); 6713 5210 ret = scx_cgroup_init(sch); 6714 5211 if (ret) 6715 5212 goto err_disable_unlock_all; 6716 5213 6717 - scx_task_iter_start(&sti); 5214 + scx_task_iter_start(&sti, NULL); 6718 5215 while ((p = scx_task_iter_next_locked(&sti))) { 6719 5216 /* 6720 5217 * @p may already be dead, have lost all its usages counts and ··· 6727 5222 6728 5223 scx_task_iter_unlock(&sti); 6729 5224 6730 - ret = scx_init_task(p, task_group(p), false); 5225 + ret = scx_init_task(sch, p, false); 6731 5226 if (ret) { 6732 5227 put_task_struct(p); 6733 5228 scx_task_iter_stop(&sti); ··· 6736 5231 goto err_disable_unlock_all; 6737 5232 } 6738 5233 5234 + scx_set_task_sched(p, sch); 6739 5235 scx_set_task_state(p, SCX_TASK_READY); 6740 5236 6741 5237 put_task_struct(p); ··· 6758 5252 * scx_tasks_lock. 6759 5253 */ 6760 5254 percpu_down_write(&scx_fork_rwsem); 6761 - scx_task_iter_start(&sti); 5255 + scx_task_iter_start(&sti, NULL); 6762 5256 while ((p = scx_task_iter_next_locked(&sti))) { 6763 5257 unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE; 6764 5258 const struct sched_class *old_class = p->sched_class; ··· 6771 5265 queue_flags |= DEQUEUE_CLASS; 6772 5266 6773 5267 scoped_guard (sched_change, p, queue_flags) { 6774 - p->scx.slice = READ_ONCE(scx_slice_dfl); 5268 + p->scx.slice = READ_ONCE(sch->slice_dfl); 6775 5269 p->sched_class = new_class; 6776 5270 } 6777 5271 } 6778 5272 scx_task_iter_stop(&sti); 6779 5273 percpu_up_write(&scx_fork_rwsem); 6780 5274 6781 - scx_bypassed_for_enable = false; 6782 - scx_bypass(false); 5275 + scx_bypass(sch, false); 6783 5276 6784 5277 if (!scx_tryset_enable_state(SCX_ENABLED, SCX_ENABLING)) { 6785 5278 WARN_ON_ONCE(atomic_read(&sch->exit_kind) == SCX_EXIT_NONE); ··· 6820 5315 * Flush scx_disable_work to ensure that error is reported before init 6821 5316 * completion. sch's base reference will be put by bpf_scx_unreg(). 6822 5317 */ 6823 - scx_error(sch, "scx_enable() failed (%d)", ret); 5318 + scx_error(sch, "scx_root_enable() failed (%d)", ret); 6824 5319 kthread_flush_work(&sch->disable_work); 6825 5320 cmd->ret = 0; 6826 5321 } 6827 5322 6828 - static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) 5323 + #ifdef CONFIG_EXT_SUB_SCHED 5324 + /* verify that a scheduler can be attached to @cgrp and return the parent */ 5325 + static struct scx_sched *find_parent_sched(struct cgroup *cgrp) 5326 + { 5327 + struct scx_sched *parent = cgrp->scx_sched; 5328 + struct scx_sched *pos; 5329 + 5330 + lockdep_assert_held(&scx_sched_lock); 5331 + 5332 + /* can't attach twice to the same cgroup */ 5333 + if (parent->cgrp == cgrp) 5334 + return ERR_PTR(-EBUSY); 5335 + 5336 + /* does $parent allow sub-scheds? */ 5337 + if (!parent->ops.sub_attach) 5338 + return ERR_PTR(-EOPNOTSUPP); 5339 + 5340 + /* can't insert between $parent and its exiting children */ 5341 + list_for_each_entry(pos, &parent->children, sibling) 5342 + if (cgroup_is_descendant(pos->cgrp, cgrp)) 5343 + return ERR_PTR(-EBUSY); 5344 + 5345 + return parent; 5346 + } 5347 + 5348 + static bool assert_task_ready_or_enabled(struct task_struct *p) 5349 + { 5350 + u32 state = scx_get_task_state(p); 5351 + 5352 + switch (state) { 5353 + case SCX_TASK_READY: 5354 + case SCX_TASK_ENABLED: 5355 + return true; 5356 + default: 5357 + WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched", 5358 + state, p->comm, p->pid); 5359 + return false; 5360 + } 5361 + } 5362 + 5363 + static void scx_sub_enable_workfn(struct kthread_work *work) 5364 + { 5365 + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); 5366 + struct sched_ext_ops *ops = cmd->ops; 5367 + struct cgroup *cgrp; 5368 + struct scx_sched *parent, *sch; 5369 + struct scx_task_iter sti; 5370 + struct task_struct *p; 5371 + s32 i, ret; 5372 + 5373 + mutex_lock(&scx_enable_mutex); 5374 + 5375 + if (!scx_enabled()) { 5376 + ret = -ENODEV; 5377 + goto out_unlock; 5378 + } 5379 + 5380 + cgrp = cgroup_get_from_id(ops->sub_cgroup_id); 5381 + if (IS_ERR(cgrp)) { 5382 + ret = PTR_ERR(cgrp); 5383 + goto out_unlock; 5384 + } 5385 + 5386 + raw_spin_lock_irq(&scx_sched_lock); 5387 + parent = find_parent_sched(cgrp); 5388 + if (IS_ERR(parent)) { 5389 + raw_spin_unlock_irq(&scx_sched_lock); 5390 + ret = PTR_ERR(parent); 5391 + goto out_put_cgrp; 5392 + } 5393 + kobject_get(&parent->kobj); 5394 + raw_spin_unlock_irq(&scx_sched_lock); 5395 + 5396 + /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */ 5397 + sch = scx_alloc_and_add_sched(ops, cgrp, parent); 5398 + kobject_put(&parent->kobj); 5399 + if (IS_ERR(sch)) { 5400 + ret = PTR_ERR(sch); 5401 + goto out_unlock; 5402 + } 5403 + 5404 + ret = scx_link_sched(sch); 5405 + if (ret) 5406 + goto err_disable; 5407 + 5408 + if (sch->level >= SCX_SUB_MAX_DEPTH) { 5409 + scx_error(sch, "max nesting depth %d violated", 5410 + SCX_SUB_MAX_DEPTH); 5411 + goto err_disable; 5412 + } 5413 + 5414 + if (sch->ops.init) { 5415 + ret = SCX_CALL_OP_RET(sch, init, NULL); 5416 + if (ret) { 5417 + ret = ops_sanitize_err(sch, "init", ret); 5418 + scx_error(sch, "ops.init() failed (%d)", ret); 5419 + goto err_disable; 5420 + } 5421 + sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; 5422 + } 5423 + 5424 + if (validate_ops(sch, ops)) 5425 + goto err_disable; 5426 + 5427 + struct scx_sub_attach_args sub_attach_args = { 5428 + .ops = &sch->ops, 5429 + .cgroup_path = sch->cgrp_path, 5430 + }; 5431 + 5432 + ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, 5433 + &sub_attach_args); 5434 + if (ret) { 5435 + ret = ops_sanitize_err(sch, "sub_attach", ret); 5436 + scx_error(sch, "parent rejected (%d)", ret); 5437 + goto err_disable; 5438 + } 5439 + sch->sub_attached = true; 5440 + 5441 + scx_bypass(sch, true); 5442 + 5443 + for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) 5444 + if (((void (**)(void))ops)[i]) 5445 + set_bit(i, sch->has_op); 5446 + 5447 + percpu_down_write(&scx_fork_rwsem); 5448 + scx_cgroup_lock(); 5449 + 5450 + /* 5451 + * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see 5452 + * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down. 5453 + */ 5454 + set_cgroup_sched(sch_cgroup(sch), sch); 5455 + if (!(cgrp->self.flags & CSS_ONLINE)) { 5456 + scx_error(sch, "cgroup is not online"); 5457 + goto err_unlock_and_disable; 5458 + } 5459 + 5460 + /* 5461 + * Initialize tasks for the new child $sch without exiting them for 5462 + * $parent so that the tasks can always be reverted back to $parent 5463 + * sched on child init failure. 5464 + */ 5465 + WARN_ON_ONCE(scx_enabling_sub_sched); 5466 + scx_enabling_sub_sched = sch; 5467 + 5468 + scx_task_iter_start(&sti, sch->cgrp); 5469 + while ((p = scx_task_iter_next_locked(&sti))) { 5470 + struct rq *rq; 5471 + struct rq_flags rf; 5472 + 5473 + /* 5474 + * Task iteration may visit the same task twice when racing 5475 + * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which 5476 + * finished __scx_init_task() and skip if set. 5477 + * 5478 + * A task may exit and get freed between __scx_init_task() 5479 + * completion and scx_enable_task(). In such cases, 5480 + * scx_disable_and_exit_task() must exit the task for both the 5481 + * parent and child scheds. 5482 + */ 5483 + if (p->scx.flags & SCX_TASK_SUB_INIT) 5484 + continue; 5485 + 5486 + /* see scx_root_enable() */ 5487 + if (!tryget_task_struct(p)) 5488 + continue; 5489 + 5490 + if (!assert_task_ready_or_enabled(p)) { 5491 + ret = -EINVAL; 5492 + goto abort; 5493 + } 5494 + 5495 + scx_task_iter_unlock(&sti); 5496 + 5497 + /* 5498 + * As $p is still on $parent, it can't be transitioned to INIT. 5499 + * Let's worry about task state later. Use __scx_init_task(). 5500 + */ 5501 + ret = __scx_init_task(sch, p, false); 5502 + if (ret) 5503 + goto abort; 5504 + 5505 + rq = task_rq_lock(p, &rf); 5506 + p->scx.flags |= SCX_TASK_SUB_INIT; 5507 + task_rq_unlock(rq, p, &rf); 5508 + 5509 + put_task_struct(p); 5510 + } 5511 + scx_task_iter_stop(&sti); 5512 + 5513 + /* 5514 + * All tasks are prepped. Disable/exit tasks for $parent and enable for 5515 + * the new @sch. 5516 + */ 5517 + scx_task_iter_start(&sti, sch->cgrp); 5518 + while ((p = scx_task_iter_next_locked(&sti))) { 5519 + /* 5520 + * Use clearing of %SCX_TASK_SUB_INIT to detect and skip 5521 + * duplicate iterations. 5522 + */ 5523 + if (!(p->scx.flags & SCX_TASK_SUB_INIT)) 5524 + continue; 5525 + 5526 + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { 5527 + /* 5528 + * $p must be either READY or ENABLED. If ENABLED, 5529 + * __scx_disabled_and_exit_task() first disables and 5530 + * makes it READY. However, after exiting $p, it will 5531 + * leave $p as READY. 5532 + */ 5533 + assert_task_ready_or_enabled(p); 5534 + __scx_disable_and_exit_task(parent, p); 5535 + 5536 + /* 5537 + * $p is now only initialized for @sch and READY, which 5538 + * is what we want. Assign it to @sch and enable. 5539 + */ 5540 + rcu_assign_pointer(p->scx.sched, sch); 5541 + scx_enable_task(sch, p); 5542 + 5543 + p->scx.flags &= ~SCX_TASK_SUB_INIT; 5544 + } 5545 + } 5546 + scx_task_iter_stop(&sti); 5547 + 5548 + scx_enabling_sub_sched = NULL; 5549 + 5550 + scx_cgroup_unlock(); 5551 + percpu_up_write(&scx_fork_rwsem); 5552 + 5553 + scx_bypass(sch, false); 5554 + 5555 + pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name); 5556 + kobject_uevent(&sch->kobj, KOBJ_ADD); 5557 + ret = 0; 5558 + goto out_unlock; 5559 + 5560 + out_put_cgrp: 5561 + cgroup_put(cgrp); 5562 + out_unlock: 5563 + mutex_unlock(&scx_enable_mutex); 5564 + cmd->ret = ret; 5565 + return; 5566 + 5567 + abort: 5568 + put_task_struct(p); 5569 + scx_task_iter_stop(&sti); 5570 + scx_enabling_sub_sched = NULL; 5571 + 5572 + scx_task_iter_start(&sti, sch->cgrp); 5573 + while ((p = scx_task_iter_next_locked(&sti))) { 5574 + if (p->scx.flags & SCX_TASK_SUB_INIT) { 5575 + __scx_disable_and_exit_task(sch, p); 5576 + p->scx.flags &= ~SCX_TASK_SUB_INIT; 5577 + } 5578 + } 5579 + scx_task_iter_stop(&sti); 5580 + err_unlock_and_disable: 5581 + /* we'll soon enter disable path, keep bypass on */ 5582 + scx_cgroup_unlock(); 5583 + percpu_up_write(&scx_fork_rwsem); 5584 + err_disable: 5585 + mutex_unlock(&scx_enable_mutex); 5586 + kthread_flush_work(&sch->disable_work); 5587 + cmd->ret = 0; 5588 + } 5589 + 5590 + static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb, 5591 + unsigned long action, void *data) 5592 + { 5593 + struct cgroup *cgrp = data; 5594 + struct cgroup *parent = cgroup_parent(cgrp); 5595 + 5596 + if (!cgroup_on_dfl(cgrp)) 5597 + return NOTIFY_OK; 5598 + 5599 + switch (action) { 5600 + case CGROUP_LIFETIME_ONLINE: 5601 + /* inherit ->scx_sched from $parent */ 5602 + if (parent) 5603 + rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched); 5604 + break; 5605 + case CGROUP_LIFETIME_OFFLINE: 5606 + /* if there is a sched attached, shoot it down */ 5607 + if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp) 5608 + scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN, 5609 + SCX_ECODE_RSN_CGROUP_OFFLINE, 5610 + "cgroup %llu going offline", cgroup_id(cgrp)); 5611 + break; 5612 + } 5613 + 5614 + return NOTIFY_OK; 5615 + } 5616 + 5617 + static struct notifier_block scx_cgroup_lifetime_nb = { 5618 + .notifier_call = scx_cgroup_lifetime_notify, 5619 + }; 5620 + 5621 + static s32 __init scx_cgroup_lifetime_notifier_init(void) 5622 + { 5623 + return blocking_notifier_chain_register(&cgroup_lifetime_notifier, 5624 + &scx_cgroup_lifetime_nb); 5625 + } 5626 + core_initcall(scx_cgroup_lifetime_notifier_init); 5627 + #endif /* CONFIG_EXT_SUB_SCHED */ 5628 + 5629 + static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) 6829 5630 { 6830 5631 static struct kthread_worker *helper; 6831 5632 static DEFINE_MUTEX(helper_mutex); ··· 7158 5347 mutex_unlock(&helper_mutex); 7159 5348 } 7160 5349 7161 - kthread_init_work(&cmd.work, scx_enable_workfn); 5350 + #ifdef CONFIG_EXT_SUB_SCHED 5351 + if (ops->sub_cgroup_id > 1) 5352 + kthread_init_work(&cmd.work, scx_sub_enable_workfn); 5353 + else 5354 + #endif /* CONFIG_EXT_SUB_SCHED */ 5355 + kthread_init_work(&cmd.work, scx_root_enable_workfn); 7162 5356 cmd.ops = ops; 7163 5357 7164 5358 kthread_queue_work(READ_ONCE(helper), &cmd.work); ··· 7204 5388 7205 5389 t = btf_type_by_id(reg->btf, reg->btf_id); 7206 5390 if (t == task_struct_type) { 7207 - if (off >= offsetof(struct task_struct, scx.slice) && 7208 - off + size <= offsetofend(struct task_struct, scx.slice)) 5391 + /* 5392 + * COMPAT: Will be removed in v6.23. 5393 + */ 5394 + if ((off >= offsetof(struct task_struct, scx.slice) && 5395 + off + size <= offsetofend(struct task_struct, scx.slice)) || 5396 + (off >= offsetof(struct task_struct, scx.dsq_vtime) && 5397 + off + size <= offsetofend(struct task_struct, scx.dsq_vtime))) { 5398 + pr_warn("sched_ext: Writing directly to p->scx.slice/dsq_vtime is deprecated, use scx_bpf_task_set_slice/dsq_vtime()"); 7209 5399 return SCALAR_VALUE; 7210 - if (off >= offsetof(struct task_struct, scx.dsq_vtime) && 7211 - off + size <= offsetofend(struct task_struct, scx.dsq_vtime)) 7212 - return SCALAR_VALUE; 5400 + } 5401 + 7213 5402 if (off >= offsetof(struct task_struct, scx.disallow) && 7214 5403 off + size <= offsetofend(struct task_struct, scx.disallow)) 7215 5404 return SCALAR_VALUE; ··· 7270 5449 case offsetof(struct sched_ext_ops, hotplug_seq): 7271 5450 ops->hotplug_seq = *(u64 *)(udata + moff); 7272 5451 return 1; 5452 + #ifdef CONFIG_EXT_SUB_SCHED 5453 + case offsetof(struct sched_ext_ops, sub_cgroup_id): 5454 + ops->sub_cgroup_id = *(u64 *)(udata + moff); 5455 + return 1; 5456 + #endif /* CONFIG_EXT_SUB_SCHED */ 7273 5457 } 7274 5458 7275 5459 return 0; 7276 5460 } 5461 + 5462 + #ifdef CONFIG_EXT_SUB_SCHED 5463 + static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog) 5464 + { 5465 + struct scx_sched *sch; 5466 + 5467 + guard(rcu)(); 5468 + sch = scx_prog_sched(prog->aux); 5469 + if (unlikely(!sch)) 5470 + return; 5471 + 5472 + scx_error(sch, "dispatch recursion detected"); 5473 + } 5474 + #endif /* CONFIG_EXT_SUB_SCHED */ 7277 5475 7278 5476 static int bpf_scx_check_member(const struct btf_type *t, 7279 5477 const struct btf_member *member, ··· 7311 5471 case offsetof(struct sched_ext_ops, cpu_offline): 7312 5472 case offsetof(struct sched_ext_ops, init): 7313 5473 case offsetof(struct sched_ext_ops, exit): 5474 + case offsetof(struct sched_ext_ops, sub_attach): 5475 + case offsetof(struct sched_ext_ops, sub_detach): 7314 5476 break; 7315 5477 default: 7316 5478 if (prog->sleepable) 7317 5479 return -EINVAL; 7318 5480 } 5481 + 5482 + #ifdef CONFIG_EXT_SUB_SCHED 5483 + /* 5484 + * Enable private stack for operations that can nest along the 5485 + * hierarchy. 5486 + * 5487 + * XXX - Ideally, we should only do this for scheds that allow 5488 + * sub-scheds and sub-scheds themselves but I don't know how to access 5489 + * struct_ops from here. 5490 + */ 5491 + switch (moff) { 5492 + case offsetof(struct sched_ext_ops, dispatch): 5493 + prog->aux->priv_stack_requested = true; 5494 + prog->aux->recursion_detected = scx_pstack_recursion_on_dispatch; 5495 + } 5496 + #endif /* CONFIG_EXT_SUB_SCHED */ 7319 5497 7320 5498 return 0; 7321 5499 } ··· 7346 5488 static void bpf_scx_unreg(void *kdata, struct bpf_link *link) 7347 5489 { 7348 5490 struct sched_ext_ops *ops = kdata; 7349 - struct scx_sched *sch = ops->priv; 5491 + struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); 7350 5492 7351 - scx_disable(SCX_EXIT_UNREG); 5493 + scx_disable(sch, SCX_EXIT_UNREG); 7352 5494 kthread_flush_work(&sch->disable_work); 5495 + RCU_INIT_POINTER(ops->priv, NULL); 7353 5496 kobject_put(&sch->kobj); 7354 5497 } 7355 5498 ··· 7407 5548 static void sched_ext_ops__cgroup_set_weight(struct cgroup *cgrp, u32 weight) {} 7408 5549 static void sched_ext_ops__cgroup_set_bandwidth(struct cgroup *cgrp, u64 period_us, u64 quota_us, u64 burst_us) {} 7409 5550 static void sched_ext_ops__cgroup_set_idle(struct cgroup *cgrp, bool idle) {} 7410 - #endif 5551 + #endif /* CONFIG_EXT_GROUP_SCHED */ 5552 + static s32 sched_ext_ops__sub_attach(struct scx_sub_attach_args *args) { return -EINVAL; } 5553 + static void sched_ext_ops__sub_detach(struct scx_sub_detach_args *args) {} 7411 5554 static void sched_ext_ops__cpu_online(s32 cpu) {} 7412 5555 static void sched_ext_ops__cpu_offline(s32 cpu) {} 7413 5556 static s32 sched_ext_ops__init(void) { return -EINVAL; } ··· 7449 5588 .cgroup_set_bandwidth = sched_ext_ops__cgroup_set_bandwidth, 7450 5589 .cgroup_set_idle = sched_ext_ops__cgroup_set_idle, 7451 5590 #endif 5591 + .sub_attach = sched_ext_ops__sub_attach, 5592 + .sub_detach = sched_ext_ops__sub_detach, 7452 5593 .cpu_online = sched_ext_ops__cpu_online, 7453 5594 .cpu_offline = sched_ext_ops__cpu_offline, 7454 5595 .init = sched_ext_ops__init, ··· 7481 5618 7482 5619 static void sysrq_handle_sched_ext_reset(u8 key) 7483 5620 { 7484 - scx_disable(SCX_EXIT_SYSRQ); 5621 + struct scx_sched *sch; 5622 + 5623 + rcu_read_lock(); 5624 + sch = rcu_dereference(scx_root); 5625 + if (likely(sch)) 5626 + scx_disable(sch, SCX_EXIT_SYSRQ); 5627 + else 5628 + pr_info("sched_ext: BPF schedulers not loaded\n"); 5629 + rcu_read_unlock(); 7485 5630 } 7486 5631 7487 5632 static const struct sysrq_key_op sysrq_sched_ext_reset_op = { ··· 7502 5631 static void sysrq_handle_sched_ext_dump(u8 key) 7503 5632 { 7504 5633 struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" }; 5634 + struct scx_sched *sch; 7505 5635 7506 - if (scx_enabled()) 7507 - scx_dump_state(&ei, 0); 5636 + list_for_each_entry_rcu(sch, &scx_sched_all, all) 5637 + scx_dump_state(sch, &ei, 0, false); 7508 5638 } 7509 5639 7510 5640 static const struct sysrq_key_op sysrq_sched_ext_dump_op = { ··· 7600 5728 unsigned long *ksyncs; 7601 5729 s32 cpu; 7602 5730 7603 - if (unlikely(!ksyncs_pcpu)) { 7604 - pr_warn_once("kick_cpus_irq_workfn() called with NULL scx_kick_syncs"); 5731 + /* can race with free_kick_syncs() during scheduler disable */ 5732 + if (unlikely(!ksyncs_pcpu)) 7605 5733 return; 7606 - } 7607 5734 7608 5735 ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs; 7609 5736 ··· 7643 5772 */ 7644 5773 void print_scx_info(const char *log_lvl, struct task_struct *p) 7645 5774 { 7646 - struct scx_sched *sch = scx_root; 5775 + struct scx_sched *sch; 7647 5776 enum scx_enable_state state = scx_enable_state(); 7648 5777 const char *all = READ_ONCE(scx_switching_all) ? "+all" : ""; 7649 5778 char runnable_at_buf[22] = "?"; 7650 5779 struct sched_class *class; 7651 5780 unsigned long runnable_at; 7652 5781 7653 - if (state == SCX_DISABLED) 5782 + guard(rcu)(); 5783 + 5784 + sch = scx_task_sched_rcu(p); 5785 + 5786 + if (!sch) 7654 5787 return; 7655 5788 7656 5789 /* ··· 7681 5806 7682 5807 static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void *ptr) 7683 5808 { 5809 + struct scx_sched *sch; 5810 + 5811 + guard(rcu)(); 5812 + 5813 + sch = rcu_dereference(scx_root); 5814 + if (!sch) 5815 + return NOTIFY_OK; 5816 + 7684 5817 /* 7685 5818 * SCX schedulers often have userspace components which are sometimes 7686 5819 * involved in critial scheduling paths. PM operations involve freezing ··· 7699 5816 case PM_HIBERNATION_PREPARE: 7700 5817 case PM_SUSPEND_PREPARE: 7701 5818 case PM_RESTORE_PREPARE: 7702 - scx_bypass(true); 5819 + scx_bypass(sch, true); 7703 5820 break; 7704 5821 case PM_POST_HIBERNATION: 7705 5822 case PM_POST_SUSPEND: 7706 5823 case PM_POST_RESTORE: 7707 - scx_bypass(false); 5824 + scx_bypass(sch, false); 7708 5825 break; 7709 5826 } 7710 5827 ··· 7733 5850 struct rq *rq = cpu_rq(cpu); 7734 5851 int n = cpu_to_node(cpu); 7735 5852 7736 - init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); 7737 - init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS); 5853 + /* local_dsq's sch will be set during scx_root_enable() */ 5854 + BUG_ON(init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL)); 5855 + 7738 5856 INIT_LIST_HEAD(&rq->scx.runnable_list); 7739 5857 INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals); 7740 5858 ··· 7744 5860 BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n)); 7745 5861 BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n)); 7746 5862 BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_sync, GFP_KERNEL, n)); 5863 + raw_spin_lock_init(&rq->scx.deferred_reenq_lock); 5864 + INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals); 5865 + INIT_LIST_HEAD(&rq->scx.deferred_reenq_users); 7747 5866 rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn); 7748 5867 rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn); 7749 5868 ··· 7757 5870 register_sysrq_key('S', &sysrq_sched_ext_reset_op); 7758 5871 register_sysrq_key('D', &sysrq_sched_ext_dump_op); 7759 5872 INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn); 5873 + 5874 + #ifdef CONFIG_EXT_SUB_SCHED 5875 + BUG_ON(rhashtable_init(&scx_sched_hash, &scx_sched_hash_params)); 5876 + #endif /* CONFIG_EXT_SUB_SCHED */ 7760 5877 } 7761 5878 7762 5879 7763 5880 /******************************************************************************** 7764 5881 * Helpers that can be called from the BPF scheduler. 7765 5882 */ 7766 - static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, 7767 - u64 enq_flags) 5883 + static bool scx_vet_enq_flags(struct scx_sched *sch, u64 dsq_id, u64 *enq_flags) 7768 5884 { 7769 - if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) 7770 - return false; 5885 + bool is_local = dsq_id == SCX_DSQ_LOCAL || 5886 + (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON; 7771 5887 5888 + if (*enq_flags & SCX_ENQ_IMMED) { 5889 + if (unlikely(!is_local)) { 5890 + scx_error(sch, "SCX_ENQ_IMMED on a non-local DSQ 0x%llx", dsq_id); 5891 + return false; 5892 + } 5893 + } else if ((sch->ops.flags & SCX_OPS_ALWAYS_ENQ_IMMED) && is_local) { 5894 + *enq_flags |= SCX_ENQ_IMMED; 5895 + } 5896 + 5897 + return true; 5898 + } 5899 + 5900 + static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, 5901 + u64 dsq_id, u64 *enq_flags) 5902 + { 7772 5903 lockdep_assert_irqs_disabled(); 7773 5904 7774 5905 if (unlikely(!p)) { ··· 7794 5889 return false; 7795 5890 } 7796 5891 7797 - if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) { 7798 - scx_error(sch, "invalid enq_flags 0x%llx", enq_flags); 5892 + if (unlikely(*enq_flags & __SCX_ENQ_INTERNAL_MASK)) { 5893 + scx_error(sch, "invalid enq_flags 0x%llx", *enq_flags); 7799 5894 return false; 7800 5895 } 5896 + 5897 + /* see SCX_EV_INSERT_NOT_OWNED definition */ 5898 + if (unlikely(!scx_task_on_sched(sch, p))) { 5899 + __scx_add_event(sch, SCX_EV_INSERT_NOT_OWNED, 1); 5900 + return false; 5901 + } 5902 + 5903 + if (!scx_vet_enq_flags(sch, dsq_id, enq_flags)) 5904 + return false; 7801 5905 7802 5906 return true; 7803 5907 } ··· 7814 5900 static void scx_dsq_insert_commit(struct scx_sched *sch, struct task_struct *p, 7815 5901 u64 dsq_id, u64 enq_flags) 7816 5902 { 7817 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 5903 + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 7818 5904 struct task_struct *ddsp_task; 7819 5905 7820 5906 ddsp_task = __this_cpu_read(direct_dispatch_task); ··· 7823 5909 return; 7824 5910 } 7825 5911 7826 - if (unlikely(dspc->cursor >= scx_dsp_max_batch)) { 5912 + if (unlikely(dspc->cursor >= sch->dsp_max_batch)) { 7827 5913 scx_error(sch, "dispatch buffer overflow"); 7828 5914 return; 7829 5915 } ··· 7844 5930 * @dsq_id: DSQ to insert into 7845 5931 * @slice: duration @p can run for in nsecs, 0 to keep the current value 7846 5932 * @enq_flags: SCX_ENQ_* 5933 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 7847 5934 * 7848 5935 * Insert @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe to 7849 5936 * call this function spuriously. Can be called from ops.enqueue(), ··· 7879 5964 * to check the return value. 7880 5965 */ 7881 5966 __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id, 7882 - u64 slice, u64 enq_flags) 5967 + u64 slice, u64 enq_flags, 5968 + const struct bpf_prog_aux *aux) 7883 5969 { 7884 5970 struct scx_sched *sch; 7885 5971 7886 5972 guard(rcu)(); 7887 - sch = rcu_dereference(scx_root); 5973 + sch = scx_prog_sched(aux); 7888 5974 if (unlikely(!sch)) 7889 5975 return false; 7890 5976 7891 - if (!scx_dsq_insert_preamble(sch, p, enq_flags)) 5977 + if (!scx_dsq_insert_preamble(sch, p, dsq_id, &enq_flags)) 7892 5978 return false; 7893 5979 7894 5980 if (slice) ··· 7906 5990 * COMPAT: Will be removed in v6.23 along with the ___v2 suffix. 7907 5991 */ 7908 5992 __bpf_kfunc void scx_bpf_dsq_insert(struct task_struct *p, u64 dsq_id, 7909 - u64 slice, u64 enq_flags) 5993 + u64 slice, u64 enq_flags, 5994 + const struct bpf_prog_aux *aux) 7910 5995 { 7911 - scx_bpf_dsq_insert___v2(p, dsq_id, slice, enq_flags); 5996 + scx_bpf_dsq_insert___v2(p, dsq_id, slice, enq_flags, aux); 7912 5997 } 7913 5998 7914 5999 static bool scx_dsq_insert_vtime(struct scx_sched *sch, struct task_struct *p, 7915 6000 u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) 7916 6001 { 7917 - if (!scx_dsq_insert_preamble(sch, p, enq_flags)) 6002 + if (!scx_dsq_insert_preamble(sch, p, dsq_id, &enq_flags)) 7918 6003 return false; 7919 6004 7920 6005 if (slice) ··· 7946 6029 * @args->slice: duration @p can run for in nsecs, 0 to keep the current value 7947 6030 * @args->vtime: @p's ordering inside the vtime-sorted queue of the target DSQ 7948 6031 * @args->enq_flags: SCX_ENQ_* 6032 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 7949 6033 * 7950 6034 * Wrapper kfunc that takes arguments via struct to work around BPF's 5 argument 7951 6035 * limit. BPF programs should use scx_bpf_dsq_insert_vtime() which is provided ··· 7971 6053 */ 7972 6054 __bpf_kfunc bool 7973 6055 __scx_bpf_dsq_insert_vtime(struct task_struct *p, 7974 - struct scx_bpf_dsq_insert_vtime_args *args) 6056 + struct scx_bpf_dsq_insert_vtime_args *args, 6057 + const struct bpf_prog_aux *aux) 7975 6058 { 7976 6059 struct scx_sched *sch; 7977 6060 7978 6061 guard(rcu)(); 7979 6062 7980 - sch = rcu_dereference(scx_root); 6063 + sch = scx_prog_sched(aux); 7981 6064 if (unlikely(!sch)) 7982 6065 return false; 7983 6066 ··· 8000 6081 if (unlikely(!sch)) 8001 6082 return; 8002 6083 6084 + #ifdef CONFIG_EXT_SUB_SCHED 6085 + /* 6086 + * Disallow if any sub-scheds are attached. There is no way to tell 6087 + * which scheduler called us, just error out @p's scheduler. 6088 + */ 6089 + if (unlikely(!list_empty(&sch->children))) { 6090 + scx_error(scx_task_sched(p), "__scx_bpf_dsq_insert_vtime() must be used"); 6091 + return; 6092 + } 6093 + #endif 6094 + 8003 6095 scx_dsq_insert_vtime(sch, p, dsq_id, slice, vtime, enq_flags); 8004 6096 } 8005 6097 8006 6098 __bpf_kfunc_end_defs(); 8007 6099 8008 6100 BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch) 8009 - BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_RCU) 8010 - BTF_ID_FLAGS(func, scx_bpf_dsq_insert___v2, KF_RCU) 8011 - BTF_ID_FLAGS(func, __scx_bpf_dsq_insert_vtime, KF_RCU) 6101 + BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_IMPLICIT_ARGS | KF_RCU) 6102 + BTF_ID_FLAGS(func, scx_bpf_dsq_insert___v2, KF_IMPLICIT_ARGS | KF_RCU) 6103 + BTF_ID_FLAGS(func, __scx_bpf_dsq_insert_vtime, KF_IMPLICIT_ARGS | KF_RCU) 8012 6104 BTF_ID_FLAGS(func, scx_bpf_dsq_insert_vtime, KF_RCU) 8013 6105 BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) 8014 6106 8015 6107 static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { 8016 6108 .owner = THIS_MODULE, 8017 6109 .set = &scx_kfunc_ids_enqueue_dispatch, 6110 + .filter = scx_kfunc_context_filter, 8018 6111 }; 8019 6112 8020 6113 static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, 8021 6114 struct task_struct *p, u64 dsq_id, u64 enq_flags) 8022 6115 { 8023 - struct scx_sched *sch = scx_root; 8024 6116 struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq; 6117 + struct scx_sched *sch = src_dsq->sched; 8025 6118 struct rq *this_rq, *src_rq, *locked_rq; 8026 6119 bool dispatched = false; 8027 6120 bool in_balance; 8028 6121 unsigned long flags; 8029 6122 8030 - if (!scx_kf_allowed_if_unlocked() && 8031 - !scx_kf_allowed(sch, SCX_KF_DISPATCH)) 6123 + if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags)) 8032 6124 return false; 8033 6125 8034 6126 /* 8035 6127 * If the BPF scheduler keeps calling this function repeatedly, it can 8036 6128 * cause similar live-lock conditions as consume_dispatch_q(). 8037 6129 */ 8038 - if (unlikely(READ_ONCE(scx_aborting))) 6130 + if (unlikely(READ_ONCE(sch->aborting))) 8039 6131 return false; 6132 + 6133 + if (unlikely(!scx_task_on_sched(sch, p))) { 6134 + scx_error(sch, "scx_bpf_dsq_move[_vtime]() on %s[%d] but the task belongs to a different scheduler", 6135 + p->comm, p->pid); 6136 + return false; 6137 + } 8040 6138 8041 6139 /* 8042 6140 * Can be called from either ops.dispatch() locking this_rq() or any ··· 8078 6142 locked_rq = src_rq; 8079 6143 raw_spin_lock(&src_dsq->lock); 8080 6144 8081 - /* 8082 - * Did someone else get to it? @p could have already left $src_dsq, got 8083 - * re-enqueud, or be in the process of being consumed by someone else. 8084 - */ 8085 - if (unlikely(p->scx.dsq != src_dsq || 8086 - u32_before(kit->cursor.priv, p->scx.dsq_seq) || 8087 - p->scx.holding_cpu >= 0) || 8088 - WARN_ON_ONCE(src_rq != task_rq(p))) { 6145 + /* did someone else get to it while we dropped the locks? */ 6146 + if (nldsq_cursor_lost_task(&kit->cursor, src_rq, src_dsq, p)) { 8089 6147 raw_spin_unlock(&src_dsq->lock); 8090 6148 goto out; 8091 6149 } 8092 6150 8093 6151 /* @p is still on $src_dsq and stable, determine the destination */ 8094 - dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, p); 6152 + dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, task_cpu(p)); 8095 6153 8096 6154 /* 8097 6155 * Apply vtime and slice updates before moving so that the new time is ··· 8119 6189 8120 6190 /** 8121 6191 * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots 6192 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8122 6193 * 8123 6194 * Can only be called from ops.dispatch(). 8124 6195 */ 8125 - __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void) 6196 + __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(const struct bpf_prog_aux *aux) 8126 6197 { 8127 6198 struct scx_sched *sch; 8128 6199 8129 6200 guard(rcu)(); 8130 6201 8131 - sch = rcu_dereference(scx_root); 6202 + sch = scx_prog_sched(aux); 8132 6203 if (unlikely(!sch)) 8133 6204 return 0; 8134 6205 8135 - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) 8136 - return 0; 8137 - 8138 - return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor); 6206 + return sch->dsp_max_batch - __this_cpu_read(sch->pcpu->dsp_ctx.cursor); 8139 6207 } 8140 6208 8141 6209 /** 8142 6210 * scx_bpf_dispatch_cancel - Cancel the latest dispatch 6211 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8143 6212 * 8144 6213 * Cancel the latest dispatch. Can be called multiple times to cancel further 8145 6214 * dispatches. Can only be called from ops.dispatch(). 8146 6215 */ 8147 - __bpf_kfunc void scx_bpf_dispatch_cancel(void) 6216 + __bpf_kfunc void scx_bpf_dispatch_cancel(const struct bpf_prog_aux *aux) 8148 6217 { 8149 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 8150 6218 struct scx_sched *sch; 6219 + struct scx_dsp_ctx *dspc; 8151 6220 8152 6221 guard(rcu)(); 8153 6222 8154 - sch = rcu_dereference(scx_root); 6223 + sch = scx_prog_sched(aux); 8155 6224 if (unlikely(!sch)) 8156 6225 return; 8157 6226 8158 - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) 8159 - return; 6227 + dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 8160 6228 8161 6229 if (dspc->cursor > 0) 8162 6230 dspc->cursor--; ··· 8164 6236 8165 6237 /** 8166 6238 * scx_bpf_dsq_move_to_local - move a task from a DSQ to the current CPU's local DSQ 8167 - * @dsq_id: DSQ to move task from 6239 + * @dsq_id: DSQ to move task from. Must be a user-created DSQ 6240 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6241 + * @enq_flags: %SCX_ENQ_* 8168 6242 * 8169 6243 * Move a task from the non-local DSQ identified by @dsq_id to the current CPU's 8170 - * local DSQ for execution. Can only be called from ops.dispatch(). 6244 + * local DSQ for execution with @enq_flags applied. Can only be called from 6245 + * ops.dispatch(). 6246 + * 6247 + * Built-in DSQs (%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) are not supported as 6248 + * sources. Local DSQs support reenqueueing (a task can be picked up for 6249 + * execution, dequeued for property changes, or reenqueued), but the BPF 6250 + * scheduler cannot directly iterate or move tasks from them. %SCX_DSQ_GLOBAL 6251 + * is similar but also doesn't support reenqueueing, as it maps to multiple 6252 + * per-node DSQs making the scope difficult to define; this may change in the 6253 + * future. 8171 6254 * 8172 6255 * This function flushes the in-flight dispatches from scx_bpf_dsq_insert() 8173 6256 * before trying to move from the specified DSQ. It may also grab rq locks and ··· 8187 6248 * Returns %true if a task has been moved, %false if there isn't any task to 8188 6249 * move. 8189 6250 */ 8190 - __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id) 6251 + __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags, 6252 + const struct bpf_prog_aux *aux) 8191 6253 { 8192 - struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 8193 6254 struct scx_dispatch_q *dsq; 8194 6255 struct scx_sched *sch; 6256 + struct scx_dsp_ctx *dspc; 8195 6257 8196 6258 guard(rcu)(); 8197 6259 8198 - sch = rcu_dereference(scx_root); 6260 + sch = scx_prog_sched(aux); 8199 6261 if (unlikely(!sch)) 8200 6262 return false; 8201 6263 8202 - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) 6264 + if (!scx_vet_enq_flags(sch, SCX_DSQ_LOCAL, &enq_flags)) 8203 6265 return false; 6266 + 6267 + dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; 8204 6268 8205 6269 flush_dispatch_buf(sch, dspc->rq); 8206 6270 ··· 8213 6271 return false; 8214 6272 } 8215 6273 8216 - if (consume_dispatch_q(sch, dspc->rq, dsq)) { 6274 + if (consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) { 8217 6275 /* 8218 6276 * A successfully consumed task can be dequeued before it starts 8219 6277 * running while the CPU is trying to migrate other dispatched ··· 8225 6283 } else { 8226 6284 return false; 8227 6285 } 6286 + } 6287 + 6288 + /* 6289 + * COMPAT: ___v2 was introduced in v7.1. Remove this and ___v2 tag in the future. 6290 + */ 6291 + __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id, const struct bpf_prog_aux *aux) 6292 + { 6293 + return scx_bpf_dsq_move_to_local___v2(dsq_id, 0, aux); 8228 6294 } 8229 6295 8230 6296 /** ··· 8330 6380 p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); 8331 6381 } 8332 6382 6383 + #ifdef CONFIG_EXT_SUB_SCHED 6384 + /** 6385 + * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler 6386 + * @cgroup_id: cgroup ID of the child scheduler to dispatch 6387 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6388 + * 6389 + * Allows a parent scheduler to trigger dispatching on one of its direct 6390 + * child schedulers. The child scheduler runs its dispatch operation to 6391 + * move tasks from dispatch queues to the local runqueue. 6392 + * 6393 + * Returns: true on success, false if cgroup_id is invalid, not a direct 6394 + * child, or caller lacks dispatch permission. 6395 + */ 6396 + __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux) 6397 + { 6398 + struct rq *this_rq = this_rq(); 6399 + struct scx_sched *parent, *child; 6400 + 6401 + guard(rcu)(); 6402 + parent = scx_prog_sched(aux); 6403 + if (unlikely(!parent)) 6404 + return false; 6405 + 6406 + child = scx_find_sub_sched(cgroup_id); 6407 + 6408 + if (unlikely(!child)) 6409 + return false; 6410 + 6411 + if (unlikely(scx_parent(child) != parent)) { 6412 + scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu", 6413 + cgroup_id); 6414 + return false; 6415 + } 6416 + 6417 + return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev, 6418 + true); 6419 + } 6420 + #endif /* CONFIG_EXT_SUB_SCHED */ 6421 + 8333 6422 __bpf_kfunc_end_defs(); 8334 6423 8335 6424 BTF_KFUNCS_START(scx_kfunc_ids_dispatch) 8336 - BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots) 8337 - BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel) 8338 - BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local) 6425 + BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_IMPLICIT_ARGS) 6426 + BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS) 6427 + BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS) 6428 + BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local___v2, KF_IMPLICIT_ARGS) 6429 + /* scx_bpf_dsq_move*() also in scx_kfunc_ids_unlocked: callable from unlocked contexts */ 8339 6430 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) 8340 6431 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) 8341 6432 BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) 8342 6433 BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) 6434 + #ifdef CONFIG_EXT_SUB_SCHED 6435 + BTF_ID_FLAGS(func, scx_bpf_sub_dispatch, KF_IMPLICIT_ARGS) 6436 + #endif 8343 6437 BTF_KFUNCS_END(scx_kfunc_ids_dispatch) 8344 6438 8345 6439 static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { 8346 6440 .owner = THIS_MODULE, 8347 6441 .set = &scx_kfunc_ids_dispatch, 6442 + .filter = scx_kfunc_context_filter, 8348 6443 }; 8349 - 8350 - static u32 reenq_local(struct rq *rq) 8351 - { 8352 - LIST_HEAD(tasks); 8353 - u32 nr_enqueued = 0; 8354 - struct task_struct *p, *n; 8355 - 8356 - lockdep_assert_rq_held(rq); 8357 - 8358 - /* 8359 - * The BPF scheduler may choose to dispatch tasks back to 8360 - * @rq->scx.local_dsq. Move all candidate tasks off to a private list 8361 - * first to avoid processing the same tasks repeatedly. 8362 - */ 8363 - list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, 8364 - scx.dsq_list.node) { 8365 - /* 8366 - * If @p is being migrated, @p's current CPU may not agree with 8367 - * its allowed CPUs and the migration_cpu_stop is about to 8368 - * deactivate and re-activate @p anyway. Skip re-enqueueing. 8369 - * 8370 - * While racing sched property changes may also dequeue and 8371 - * re-enqueue a migrating task while its current CPU and allowed 8372 - * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to 8373 - * the current local DSQ for running tasks and thus are not 8374 - * visible to the BPF scheduler. 8375 - */ 8376 - if (p->migration_pending) 8377 - continue; 8378 - 8379 - dispatch_dequeue(rq, p); 8380 - list_add_tail(&p->scx.dsq_list.node, &tasks); 8381 - } 8382 - 8383 - list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { 8384 - list_del_init(&p->scx.dsq_list.node); 8385 - do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 8386 - nr_enqueued++; 8387 - } 8388 - 8389 - return nr_enqueued; 8390 - } 8391 6444 8392 6445 __bpf_kfunc_start_defs(); 8393 6446 8394 6447 /** 8395 6448 * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 6449 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8396 6450 * 8397 6451 * Iterate over all of the tasks currently enqueued on the local DSQ of the 8398 6452 * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of 8399 6453 * processed tasks. Can only be called from ops.cpu_release(). 8400 - * 8401 - * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void 8402 - * returning variant that can be called from anywhere. 8403 6454 */ 8404 - __bpf_kfunc u32 scx_bpf_reenqueue_local(void) 6455 + __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux) 8405 6456 { 8406 6457 struct scx_sched *sch; 8407 6458 struct rq *rq; 8408 6459 8409 6460 guard(rcu)(); 8410 - sch = rcu_dereference(scx_root); 6461 + sch = scx_prog_sched(aux); 8411 6462 if (unlikely(!sch)) 8412 - return 0; 8413 - 8414 - if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE)) 8415 6463 return 0; 8416 6464 8417 6465 rq = cpu_rq(smp_processor_id()); 8418 6466 lockdep_assert_rq_held(rq); 8419 6467 8420 - return reenq_local(rq); 6468 + return reenq_local(sch, rq, SCX_REENQ_ANY); 8421 6469 } 8422 6470 8423 6471 __bpf_kfunc_end_defs(); 8424 6472 8425 6473 BTF_KFUNCS_START(scx_kfunc_ids_cpu_release) 8426 - BTF_ID_FLAGS(func, scx_bpf_reenqueue_local) 6474 + BTF_ID_FLAGS(func, scx_bpf_reenqueue_local, KF_IMPLICIT_ARGS) 8427 6475 BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) 8428 6476 8429 6477 static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { 8430 6478 .owner = THIS_MODULE, 8431 6479 .set = &scx_kfunc_ids_cpu_release, 6480 + .filter = scx_kfunc_context_filter, 8432 6481 }; 8433 6482 8434 6483 __bpf_kfunc_start_defs(); ··· 8436 6487 * scx_bpf_create_dsq - Create a custom DSQ 8437 6488 * @dsq_id: DSQ to create 8438 6489 * @node: NUMA node to allocate from 6490 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8439 6491 * 8440 6492 * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable 8441 6493 * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. 8442 6494 */ 8443 - __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) 6495 + __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node, const struct bpf_prog_aux *aux) 8444 6496 { 8445 6497 struct scx_dispatch_q *dsq; 8446 6498 struct scx_sched *sch; ··· 8458 6508 if (!dsq) 8459 6509 return -ENOMEM; 8460 6510 8461 - init_dsq(dsq, dsq_id); 6511 + /* 6512 + * init_dsq() must be called in GFP_KERNEL context. Init it with NULL 6513 + * @sch and update afterwards. 6514 + */ 6515 + ret = init_dsq(dsq, dsq_id, NULL); 6516 + if (ret) { 6517 + kfree(dsq); 6518 + return ret; 6519 + } 8462 6520 8463 6521 rcu_read_lock(); 8464 6522 8465 - sch = rcu_dereference(scx_root); 8466 - if (sch) 6523 + sch = scx_prog_sched(aux); 6524 + if (sch) { 6525 + dsq->sched = sch; 8467 6526 ret = rhashtable_lookup_insert_fast(&sch->dsq_hash, &dsq->hash_node, 8468 6527 dsq_hash_params); 8469 - else 6528 + } else { 8470 6529 ret = -ENODEV; 6530 + } 8471 6531 8472 6532 rcu_read_unlock(); 8473 - if (ret) 6533 + if (ret) { 6534 + exit_dsq(dsq); 8474 6535 kfree(dsq); 6536 + } 8475 6537 return ret; 8476 6538 } 8477 6539 8478 6540 __bpf_kfunc_end_defs(); 8479 6541 8480 6542 BTF_KFUNCS_START(scx_kfunc_ids_unlocked) 8481 - BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) 6543 + BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_IMPLICIT_ARGS | KF_SLEEPABLE) 6544 + /* also in scx_kfunc_ids_dispatch: also callable from ops.dispatch() */ 8482 6545 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) 8483 6546 BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) 8484 6547 BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) 8485 6548 BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) 6549 + /* also in scx_kfunc_ids_select_cpu: also callable from ops.select_cpu()/ops.enqueue() */ 6550 + BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) 6551 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) 6552 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) 8486 6553 BTF_KFUNCS_END(scx_kfunc_ids_unlocked) 8487 6554 8488 6555 static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { 8489 6556 .owner = THIS_MODULE, 8490 6557 .set = &scx_kfunc_ids_unlocked, 6558 + .filter = scx_kfunc_context_filter, 8491 6559 }; 8492 6560 8493 6561 __bpf_kfunc_start_defs(); ··· 8514 6546 * scx_bpf_task_set_slice - Set task's time slice 8515 6547 * @p: task of interest 8516 6548 * @slice: time slice to set in nsecs 6549 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8517 6550 * 8518 6551 * Set @p's time slice to @slice. Returns %true on success, %false if the 8519 6552 * calling scheduler doesn't have authority over @p. 8520 6553 */ 8521 - __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice) 6554 + __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice, 6555 + const struct bpf_prog_aux *aux) 8522 6556 { 6557 + struct scx_sched *sch; 6558 + 6559 + guard(rcu)(); 6560 + sch = scx_prog_sched(aux); 6561 + if (unlikely(!scx_task_on_sched(sch, p))) 6562 + return false; 6563 + 8523 6564 p->scx.slice = slice; 8524 6565 return true; 8525 6566 } ··· 8537 6560 * scx_bpf_task_set_dsq_vtime - Set task's virtual time for DSQ ordering 8538 6561 * @p: task of interest 8539 6562 * @vtime: virtual time to set 6563 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8540 6564 * 8541 6565 * Set @p's virtual time to @vtime. Returns %true on success, %false if the 8542 6566 * calling scheduler doesn't have authority over @p. 8543 6567 */ 8544 - __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime) 6568 + __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime, 6569 + const struct bpf_prog_aux *aux) 8545 6570 { 6571 + struct scx_sched *sch; 6572 + 6573 + guard(rcu)(); 6574 + sch = scx_prog_sched(aux); 6575 + if (unlikely(!scx_task_on_sched(sch, p))) 6576 + return false; 6577 + 8546 6578 p->scx.dsq_vtime = vtime; 8547 6579 return true; 8548 6580 } ··· 8573 6587 * lead to irq_work_queue() malfunction such as infinite busy wait for 8574 6588 * IRQ status update. Suppress kicking. 8575 6589 */ 8576 - if (scx_rq_bypassing(this_rq)) 6590 + if (scx_bypassing(sch, cpu_of(this_rq))) 8577 6591 goto out; 8578 6592 8579 6593 /* ··· 8613 6627 * scx_bpf_kick_cpu - Trigger reschedule on a CPU 8614 6628 * @cpu: cpu to kick 8615 6629 * @flags: %SCX_KICK_* flags 6630 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8616 6631 * 8617 6632 * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or 8618 6633 * trigger rescheduling on a busy CPU. This can be called from any online 8619 6634 * scx_ops operation and the actual kicking is performed asynchronously through 8620 6635 * an irq work. 8621 6636 */ 8622 - __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) 6637 + __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux *aux) 8623 6638 { 8624 6639 struct scx_sched *sch; 8625 6640 8626 6641 guard(rcu)(); 8627 - sch = rcu_dereference(scx_root); 6642 + sch = scx_prog_sched(aux); 8628 6643 if (likely(sch)) 8629 6644 scx_kick_cpu(sch, cpu, flags); 8630 6645 } ··· 8699 6712 * @it: iterator to initialize 8700 6713 * @dsq_id: DSQ to iterate 8701 6714 * @flags: %SCX_DSQ_ITER_* 6715 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8702 6716 * 8703 6717 * Initialize BPF iterator @it which can be used with bpf_for_each() to walk 8704 6718 * tasks in the DSQ specified by @dsq_id. Iteration using @it only includes 8705 6719 * tasks which are already queued when this function is invoked. 8706 6720 */ 8707 6721 __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, 8708 - u64 flags) 6722 + u64 flags, const struct bpf_prog_aux *aux) 8709 6723 { 8710 6724 struct bpf_iter_scx_dsq_kern *kit = (void *)it; 8711 6725 struct scx_sched *sch; ··· 8724 6736 */ 8725 6737 kit->dsq = NULL; 8726 6738 8727 - sch = rcu_dereference_check(scx_root, rcu_read_lock_bh_held()); 6739 + sch = scx_prog_sched(aux); 8728 6740 if (unlikely(!sch)) 8729 6741 return -ENODEV; 8730 6742 ··· 8735 6747 if (!kit->dsq) 8736 6748 return -ENOENT; 8737 6749 8738 - kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, flags, 8739 - READ_ONCE(kit->dsq->seq)); 6750 + kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, kit->dsq, flags); 8740 6751 8741 6752 return 0; 8742 6753 } ··· 8749 6762 __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) 8750 6763 { 8751 6764 struct bpf_iter_scx_dsq_kern *kit = (void *)it; 8752 - bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV; 8753 - struct task_struct *p; 8754 - unsigned long flags; 8755 6765 8756 6766 if (!kit->dsq) 8757 6767 return NULL; 8758 6768 8759 - raw_spin_lock_irqsave(&kit->dsq->lock, flags); 6769 + guard(raw_spinlock_irqsave)(&kit->dsq->lock); 8760 6770 8761 - if (list_empty(&kit->cursor.node)) 8762 - p = NULL; 8763 - else 8764 - p = container_of(&kit->cursor, struct task_struct, scx.dsq_list); 8765 - 8766 - /* 8767 - * Only tasks which were queued before the iteration started are 8768 - * visible. This bounds BPF iterations and guarantees that vtime never 8769 - * jumps in the other direction while iterating. 8770 - */ 8771 - do { 8772 - p = nldsq_next_task(kit->dsq, p, rev); 8773 - } while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq))); 8774 - 8775 - if (p) { 8776 - if (rev) 8777 - list_move_tail(&kit->cursor.node, &p->scx.dsq_list.node); 8778 - else 8779 - list_move(&kit->cursor.node, &p->scx.dsq_list.node); 8780 - } else { 8781 - list_del_init(&kit->cursor.node); 8782 - } 8783 - 8784 - raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); 8785 - 8786 - return p; 6771 + return nldsq_cursor_next_task(&kit->cursor, kit->dsq); 8787 6772 } 8788 6773 8789 6774 /** ··· 8784 6825 /** 8785 6826 * scx_bpf_dsq_peek - Lockless peek at the first element. 8786 6827 * @dsq_id: DSQ to examine. 6828 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8787 6829 * 8788 6830 * Read the first element in the DSQ. This is semantically equivalent to using 8789 6831 * the DSQ iterator, but is lockfree. Of course, like any lockless operation, ··· 8793 6833 * 8794 6834 * Returns the pointer, or NULL indicates an empty queue OR internal error. 8795 6835 */ 8796 - __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id) 6836 + __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id, 6837 + const struct bpf_prog_aux *aux) 8797 6838 { 8798 6839 struct scx_sched *sch; 8799 6840 struct scx_dispatch_q *dsq; 8800 6841 8801 - sch = rcu_dereference(scx_root); 6842 + sch = scx_prog_sched(aux); 8802 6843 if (unlikely(!sch)) 8803 6844 return NULL; 8804 6845 ··· 8815 6854 } 8816 6855 8817 6856 return rcu_dereference(dsq->first_task); 6857 + } 6858 + 6859 + /** 6860 + * scx_bpf_dsq_reenq - Re-enqueue tasks on a DSQ 6861 + * @dsq_id: DSQ to re-enqueue 6862 + * @reenq_flags: %SCX_RENQ_* 6863 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6864 + * 6865 + * Iterate over all of the tasks currently enqueued on the DSQ identified by 6866 + * @dsq_id, and re-enqueue them in the BPF scheduler. The following DSQs are 6867 + * supported: 6868 + * 6869 + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON | $cpu) 6870 + * - User DSQs 6871 + * 6872 + * Re-enqueues are performed asynchronously. Can be called from anywhere. 6873 + */ 6874 + __bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags, 6875 + const struct bpf_prog_aux *aux) 6876 + { 6877 + struct scx_sched *sch; 6878 + struct scx_dispatch_q *dsq; 6879 + 6880 + guard(preempt)(); 6881 + 6882 + sch = scx_prog_sched(aux); 6883 + if (unlikely(!sch)) 6884 + return; 6885 + 6886 + if (unlikely(reenq_flags & ~__SCX_REENQ_USER_MASK)) { 6887 + scx_error(sch, "invalid SCX_REENQ flags 0x%llx", reenq_flags); 6888 + return; 6889 + } 6890 + 6891 + /* not specifying any filter bits is the same as %SCX_REENQ_ANY */ 6892 + if (!(reenq_flags & __SCX_REENQ_FILTER_MASK)) 6893 + reenq_flags |= SCX_REENQ_ANY; 6894 + 6895 + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id()); 6896 + schedule_dsq_reenq(sch, dsq, reenq_flags, scx_locked_rq()); 6897 + } 6898 + 6899 + /** 6900 + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 6901 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 6902 + * 6903 + * Iterate over all of the tasks currently enqueued on the local DSQ of the 6904 + * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from 6905 + * anywhere. 6906 + * 6907 + * This is now a special case of scx_bpf_dsq_reenq() and may be removed in the 6908 + * future. 6909 + */ 6910 + __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux) 6911 + { 6912 + scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0, aux); 8818 6913 } 8819 6914 8820 6915 __bpf_kfunc_end_defs(); ··· 8927 6910 * @fmt: error message format string 8928 6911 * @data: format string parameters packaged using ___bpf_fill() macro 8929 6912 * @data__sz: @data len, must end in '__sz' for the verifier 6913 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8930 6914 * 8931 6915 * Indicate that the BPF scheduler wants to exit gracefully, and initiate ops 8932 6916 * disabling. 8933 6917 */ 8934 6918 __bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt, 8935 - unsigned long long *data, u32 data__sz) 6919 + unsigned long long *data, u32 data__sz, 6920 + const struct bpf_prog_aux *aux) 8936 6921 { 8937 6922 struct scx_sched *sch; 8938 6923 unsigned long flags; 8939 6924 8940 6925 raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); 8941 - sch = rcu_dereference_bh(scx_root); 6926 + sch = scx_prog_sched(aux); 8942 6927 if (likely(sch) && 8943 6928 bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0) 8944 6929 scx_exit(sch, SCX_EXIT_UNREG_BPF, exit_code, "%s", scx_exit_bstr_buf.line); ··· 8952 6933 * @fmt: error message format string 8953 6934 * @data: format string parameters packaged using ___bpf_fill() macro 8954 6935 * @data__sz: @data len, must end in '__sz' for the verifier 6936 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8955 6937 * 8956 6938 * Indicate that the BPF scheduler encountered a fatal error and initiate ops 8957 6939 * disabling. 8958 6940 */ 8959 6941 __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, 8960 - u32 data__sz) 6942 + u32 data__sz, const struct bpf_prog_aux *aux) 8961 6943 { 8962 6944 struct scx_sched *sch; 8963 6945 unsigned long flags; 8964 6946 8965 6947 raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); 8966 - sch = rcu_dereference_bh(scx_root); 6948 + sch = scx_prog_sched(aux); 8967 6949 if (likely(sch) && 8968 6950 bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0) 8969 6951 scx_exit(sch, SCX_EXIT_ERROR_BPF, 0, "%s", scx_exit_bstr_buf.line); ··· 8976 6956 * @fmt: format string 8977 6957 * @data: format string parameters packaged using ___bpf_fill() macro 8978 6958 * @data__sz: @data len, must end in '__sz' for the verifier 6959 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 8979 6960 * 8980 6961 * To be called through scx_bpf_dump() helper from ops.dump(), dump_cpu() and 8981 6962 * dump_task() to generate extra debug dump specific to the BPF scheduler. ··· 8985 6964 * multiple calls. The last line is automatically terminated. 8986 6965 */ 8987 6966 __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, 8988 - u32 data__sz) 6967 + u32 data__sz, const struct bpf_prog_aux *aux) 8989 6968 { 8990 6969 struct scx_sched *sch; 8991 6970 struct scx_dump_data *dd = &scx_dump_data; ··· 8994 6973 8995 6974 guard(rcu)(); 8996 6975 8997 - sch = rcu_dereference(scx_root); 6976 + sch = scx_prog_sched(aux); 8998 6977 if (unlikely(!sch)) 8999 6978 return; 9000 6979 ··· 9031 7010 } 9032 7011 9033 7012 /** 9034 - * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 9035 - * 9036 - * Iterate over all of the tasks currently enqueued on the local DSQ of the 9037 - * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from 9038 - * anywhere. 9039 - */ 9040 - __bpf_kfunc void scx_bpf_reenqueue_local___v2(void) 9041 - { 9042 - struct rq *rq; 9043 - 9044 - guard(preempt)(); 9045 - 9046 - rq = this_rq(); 9047 - local_set(&rq->scx.reenq_local_deferred, 1); 9048 - schedule_deferred(rq); 9049 - } 9050 - 9051 - /** 9052 7013 * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU 9053 7014 * @cpu: CPU of interest 7015 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9054 7016 * 9055 7017 * Return the maximum relative capacity of @cpu in relation to the most 9056 7018 * performant CPU in the system. The return value is in the range [1, 9057 7019 * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). 9058 7020 */ 9059 - __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) 7021 + __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux) 9060 7022 { 9061 7023 struct scx_sched *sch; 9062 7024 9063 7025 guard(rcu)(); 9064 7026 9065 - sch = rcu_dereference(scx_root); 7027 + sch = scx_prog_sched(aux); 9066 7028 if (likely(sch) && ops_cpu_valid(sch, cpu, NULL)) 9067 7029 return arch_scale_cpu_capacity(cpu); 9068 7030 else ··· 9055 7051 /** 9056 7052 * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU 9057 7053 * @cpu: CPU of interest 7054 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9058 7055 * 9059 7056 * Return the current relative performance of @cpu in relation to its maximum. 9060 7057 * The return value is in the range [1, %SCX_CPUPERF_ONE]. ··· 9067 7062 * 9068 7063 * The result is in the range [1, %SCX_CPUPERF_ONE]. 9069 7064 */ 9070 - __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) 7065 + __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux) 9071 7066 { 9072 7067 struct scx_sched *sch; 9073 7068 9074 7069 guard(rcu)(); 9075 7070 9076 - sch = rcu_dereference(scx_root); 7071 + sch = scx_prog_sched(aux); 9077 7072 if (likely(sch) && ops_cpu_valid(sch, cpu, NULL)) 9078 7073 return arch_scale_freq_capacity(cpu); 9079 7074 else ··· 9084 7079 * scx_bpf_cpuperf_set - Set the relative performance target of a CPU 9085 7080 * @cpu: CPU of interest 9086 7081 * @perf: target performance level [0, %SCX_CPUPERF_ONE] 7082 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9087 7083 * 9088 7084 * Set the target performance level of @cpu to @perf. @perf is in linear 9089 7085 * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the ··· 9095 7089 * use. Consult hardware and cpufreq documentation for more information. The 9096 7090 * current performance level can be monitored using scx_bpf_cpuperf_cur(). 9097 7091 */ 9098 - __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) 7092 + __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_aux *aux) 9099 7093 { 9100 7094 struct scx_sched *sch; 9101 7095 9102 7096 guard(rcu)(); 9103 7097 9104 - sch = rcu_dereference(scx_root); 7098 + sch = scx_prog_sched(aux); 9105 7099 if (unlikely(!sch)) 9106 7100 return; 9107 7101 ··· 9211 7205 /** 9212 7206 * scx_bpf_cpu_rq - Fetch the rq of a CPU 9213 7207 * @cpu: CPU of the rq 7208 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9214 7209 */ 9215 - __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) 7210 + __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu, const struct bpf_prog_aux *aux) 9216 7211 { 9217 7212 struct scx_sched *sch; 9218 7213 9219 7214 guard(rcu)(); 9220 7215 9221 - sch = rcu_dereference(scx_root); 7216 + sch = scx_prog_sched(aux); 9222 7217 if (unlikely(!sch)) 9223 7218 return NULL; 9224 7219 ··· 9238 7231 9239 7232 /** 9240 7233 * scx_bpf_locked_rq - Return the rq currently locked by SCX 7234 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9241 7235 * 9242 7236 * Returns the rq if a rq lock is currently held by SCX. 9243 7237 * Otherwise emits an error and returns NULL. 9244 7238 */ 9245 - __bpf_kfunc struct rq *scx_bpf_locked_rq(void) 7239 + __bpf_kfunc struct rq *scx_bpf_locked_rq(const struct bpf_prog_aux *aux) 9246 7240 { 9247 7241 struct scx_sched *sch; 9248 7242 struct rq *rq; 9249 7243 9250 7244 guard(preempt)(); 9251 7245 9252 - sch = rcu_dereference_sched(scx_root); 7246 + sch = scx_prog_sched(aux); 9253 7247 if (unlikely(!sch)) 9254 7248 return NULL; 9255 7249 ··· 9266 7258 /** 9267 7259 * scx_bpf_cpu_curr - Return remote CPU's curr task 9268 7260 * @cpu: CPU of interest 7261 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 9269 7262 * 9270 7263 * Callers must hold RCU read lock (KF_RCU). 9271 7264 */ 9272 - __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu) 7265 + __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_aux *aux) 9273 7266 { 9274 7267 struct scx_sched *sch; 9275 7268 9276 7269 guard(rcu)(); 9277 7270 9278 - sch = rcu_dereference(scx_root); 7271 + sch = scx_prog_sched(aux); 9279 7272 if (unlikely(!sch)) 9280 7273 return NULL; 9281 7274 ··· 9285 7276 9286 7277 return rcu_dereference(cpu_rq(cpu)->curr); 9287 7278 } 9288 - 9289 - /** 9290 - * scx_bpf_task_cgroup - Return the sched cgroup of a task 9291 - * @p: task of interest 9292 - * 9293 - * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with 9294 - * from the scheduler's POV. SCX operations should use this function to 9295 - * determine @p's current cgroup as, unlike following @p->cgroups, 9296 - * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all 9297 - * rq-locked operations. Can be called on the parameter tasks of rq-locked 9298 - * operations. The restriction guarantees that @p's rq is locked by the caller. 9299 - */ 9300 - #ifdef CONFIG_CGROUP_SCHED 9301 - __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) 9302 - { 9303 - struct task_group *tg = p->sched_task_group; 9304 - struct cgroup *cgrp = &cgrp_dfl_root.cgrp; 9305 - struct scx_sched *sch; 9306 - 9307 - guard(rcu)(); 9308 - 9309 - sch = rcu_dereference(scx_root); 9310 - if (unlikely(!sch)) 9311 - goto out; 9312 - 9313 - if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p)) 9314 - goto out; 9315 - 9316 - cgrp = tg_cgrp(tg); 9317 - 9318 - out: 9319 - cgroup_get(cgrp); 9320 - return cgrp; 9321 - } 9322 - #endif 9323 7279 9324 7280 /** 9325 7281 * scx_bpf_now - Returns a high-performance monotonically non-decreasing ··· 9362 7388 scx_agg_event(events, e_cpu, SCX_EV_DISPATCH_KEEP_LAST); 9363 7389 scx_agg_event(events, e_cpu, SCX_EV_ENQ_SKIP_EXITING); 9364 7390 scx_agg_event(events, e_cpu, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED); 7391 + scx_agg_event(events, e_cpu, SCX_EV_REENQ_IMMED); 7392 + scx_agg_event(events, e_cpu, SCX_EV_REENQ_LOCAL_REPEAT); 9365 7393 scx_agg_event(events, e_cpu, SCX_EV_REFILL_SLICE_DFL); 9366 7394 scx_agg_event(events, e_cpu, SCX_EV_BYPASS_DURATION); 9367 7395 scx_agg_event(events, e_cpu, SCX_EV_BYPASS_DISPATCH); 9368 7396 scx_agg_event(events, e_cpu, SCX_EV_BYPASS_ACTIVATE); 7397 + scx_agg_event(events, e_cpu, SCX_EV_INSERT_NOT_OWNED); 7398 + scx_agg_event(events, e_cpu, SCX_EV_SUB_BYPASS_DISPATCH); 9369 7399 } 9370 7400 } 9371 7401 ··· 9403 7425 memcpy(events, &e_sys, events__sz); 9404 7426 } 9405 7427 7428 + #ifdef CONFIG_CGROUP_SCHED 7429 + /** 7430 + * scx_bpf_task_cgroup - Return the sched cgroup of a task 7431 + * @p: task of interest 7432 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 7433 + * 7434 + * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with 7435 + * from the scheduler's POV. SCX operations should use this function to 7436 + * determine @p's current cgroup as, unlike following @p->cgroups, 7437 + * @p->sched_task_group is stable for the duration of the SCX op. See 7438 + * SCX_CALL_OP_TASK() for details. 7439 + */ 7440 + __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, 7441 + const struct bpf_prog_aux *aux) 7442 + { 7443 + struct task_group *tg = p->sched_task_group; 7444 + struct cgroup *cgrp = &cgrp_dfl_root.cgrp; 7445 + struct scx_sched *sch; 7446 + 7447 + guard(rcu)(); 7448 + 7449 + sch = scx_prog_sched(aux); 7450 + if (unlikely(!sch)) 7451 + goto out; 7452 + 7453 + if (!scx_kf_arg_task_ok(sch, p)) 7454 + goto out; 7455 + 7456 + cgrp = tg_cgrp(tg); 7457 + 7458 + out: 7459 + cgroup_get(cgrp); 7460 + return cgrp; 7461 + } 7462 + #endif /* CONFIG_CGROUP_SCHED */ 7463 + 9406 7464 __bpf_kfunc_end_defs(); 9407 7465 9408 7466 BTF_KFUNCS_START(scx_kfunc_ids_any) 9409 - BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_RCU); 9410 - BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_RCU); 9411 - BTF_ID_FLAGS(func, scx_bpf_kick_cpu) 7467 + BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU); 7468 + BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU); 7469 + BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS) 9412 7470 BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) 9413 7471 BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) 9414 - BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_RCU_PROTECTED | KF_RET_NULL) 9415 - BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED) 7472 + BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL) 7473 + BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS) 7474 + BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS) 7475 + BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_IMPLICIT_ARGS | KF_ITER_NEW | KF_RCU_PROTECTED) 9416 7476 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL) 9417 7477 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY) 9418 - BTF_ID_FLAGS(func, scx_bpf_exit_bstr) 9419 - BTF_ID_FLAGS(func, scx_bpf_error_bstr) 9420 - BTF_ID_FLAGS(func, scx_bpf_dump_bstr) 9421 - BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2) 9422 - BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) 9423 - BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) 9424 - BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) 7478 + BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_IMPLICIT_ARGS) 7479 + BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_IMPLICIT_ARGS) 7480 + BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS) 7481 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS) 7482 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS) 7483 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS) 9425 7484 BTF_ID_FLAGS(func, scx_bpf_nr_node_ids) 9426 7485 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) 9427 7486 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) ··· 9466 7451 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE) 9467 7452 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) 9468 7453 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) 9469 - BTF_ID_FLAGS(func, scx_bpf_cpu_rq) 9470 - BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) 9471 - BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) 9472 - #ifdef CONFIG_CGROUP_SCHED 9473 - BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE) 9474 - #endif 7454 + BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS) 7455 + BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL) 7456 + BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED) 9475 7457 BTF_ID_FLAGS(func, scx_bpf_now) 9476 7458 BTF_ID_FLAGS(func, scx_bpf_events) 7459 + #ifdef CONFIG_CGROUP_SCHED 7460 + BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE) 7461 + #endif 9477 7462 BTF_KFUNCS_END(scx_kfunc_ids_any) 9478 7463 9479 7464 static const struct btf_kfunc_id_set scx_kfunc_set_any = { 9480 7465 .owner = THIS_MODULE, 9481 7466 .set = &scx_kfunc_ids_any, 9482 7467 }; 7468 + 7469 + /* 7470 + * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc 7471 + * group; an op may permit zero or more groups, with the union expressed in 7472 + * scx_kf_allow_flags[]. The verifier-time filter (scx_kfunc_context_filter()) 7473 + * consults this table to decide whether a context-sensitive kfunc is callable 7474 + * from a given SCX op. 7475 + */ 7476 + enum scx_kf_allow_flags { 7477 + SCX_KF_ALLOW_UNLOCKED = 1 << 0, 7478 + SCX_KF_ALLOW_CPU_RELEASE = 1 << 1, 7479 + SCX_KF_ALLOW_DISPATCH = 1 << 2, 7480 + SCX_KF_ALLOW_ENQUEUE = 1 << 3, 7481 + SCX_KF_ALLOW_SELECT_CPU = 1 << 4, 7482 + }; 7483 + 7484 + /* 7485 + * Map each SCX op to the union of kfunc groups it permits, indexed by 7486 + * SCX_OP_IDX(op). Ops not listed only permit kfuncs that are not 7487 + * context-sensitive. 7488 + */ 7489 + static const u32 scx_kf_allow_flags[] = { 7490 + [SCX_OP_IDX(select_cpu)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, 7491 + [SCX_OP_IDX(enqueue)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, 7492 + [SCX_OP_IDX(dispatch)] = SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH, 7493 + [SCX_OP_IDX(cpu_release)] = SCX_KF_ALLOW_CPU_RELEASE, 7494 + [SCX_OP_IDX(init_task)] = SCX_KF_ALLOW_UNLOCKED, 7495 + [SCX_OP_IDX(dump)] = SCX_KF_ALLOW_UNLOCKED, 7496 + #ifdef CONFIG_EXT_GROUP_SCHED 7497 + [SCX_OP_IDX(cgroup_init)] = SCX_KF_ALLOW_UNLOCKED, 7498 + [SCX_OP_IDX(cgroup_exit)] = SCX_KF_ALLOW_UNLOCKED, 7499 + [SCX_OP_IDX(cgroup_prep_move)] = SCX_KF_ALLOW_UNLOCKED, 7500 + [SCX_OP_IDX(cgroup_cancel_move)] = SCX_KF_ALLOW_UNLOCKED, 7501 + [SCX_OP_IDX(cgroup_set_weight)] = SCX_KF_ALLOW_UNLOCKED, 7502 + [SCX_OP_IDX(cgroup_set_bandwidth)] = SCX_KF_ALLOW_UNLOCKED, 7503 + [SCX_OP_IDX(cgroup_set_idle)] = SCX_KF_ALLOW_UNLOCKED, 7504 + #endif /* CONFIG_EXT_GROUP_SCHED */ 7505 + [SCX_OP_IDX(sub_attach)] = SCX_KF_ALLOW_UNLOCKED, 7506 + [SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED, 7507 + [SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED, 7508 + [SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED, 7509 + [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED, 7510 + [SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED, 7511 + }; 7512 + 7513 + /* 7514 + * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the 7515 + * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this 7516 + * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or 7517 + * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the 7518 + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. 7519 + * scx_kfunc_ids_any) by falling through to "allow" when none of the 7520 + * context-sensitive sets contain the kfunc. 7521 + */ 7522 + int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) 7523 + { 7524 + bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id); 7525 + bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id); 7526 + bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); 7527 + bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); 7528 + bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); 7529 + u32 moff, flags; 7530 + 7531 + /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ 7532 + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) 7533 + return 0; 7534 + 7535 + /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ 7536 + if (prog->type == BPF_PROG_TYPE_SYSCALL) 7537 + return (in_unlocked || in_select_cpu) ? 0 : -EACCES; 7538 + 7539 + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) 7540 + return -EACCES; 7541 + 7542 + /* 7543 + * add_subprog_and_kfunc() collects all kfunc calls, including dead code 7544 + * guarded by bpf_ksym_exists(), before check_attach_btf_id() sets 7545 + * prog->aux->st_ops. Allow all kfuncs when st_ops is not yet set; 7546 + * do_check_main() re-runs the filter with st_ops set and enforces the 7547 + * actual restrictions. 7548 + */ 7549 + if (!prog->aux->st_ops) 7550 + return 0; 7551 + 7552 + /* 7553 + * Non-SCX struct_ops: only unlocked kfuncs are safe. The other 7554 + * context-sensitive kfuncs assume the rq lock is held by the SCX 7555 + * dispatch path, which doesn't apply to other struct_ops users. 7556 + */ 7557 + if (prog->aux->st_ops != &bpf_sched_ext_ops) 7558 + return in_unlocked ? 0 : -EACCES; 7559 + 7560 + /* SCX struct_ops: check the per-op allow list. */ 7561 + moff = prog->aux->attach_st_ops_member_off; 7562 + flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; 7563 + 7564 + if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked) 7565 + return 0; 7566 + if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release) 7567 + return 0; 7568 + if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch) 7569 + return 0; 7570 + if ((flags & SCX_KF_ALLOW_ENQUEUE) && in_enqueue) 7571 + return 0; 7572 + if ((flags & SCX_KF_ALLOW_SELECT_CPU) && in_select_cpu) 7573 + return 0; 7574 + 7575 + return -EACCES; 7576 + } 9483 7577 9484 7578 static int __init scx_init(void) 9485 7579 { ··· 9599 7475 * register_btf_kfunc_id_set() needs most of the system to be up. 9600 7476 * 9601 7477 * Some kfuncs are context-sensitive and can only be called from 9602 - * specific SCX ops. They are grouped into BTF sets accordingly. 9603 - * Unfortunately, BPF currently doesn't have a way of enforcing such 9604 - * restrictions. Eventually, the verifier should be able to enforce 9605 - * them. For now, register them the same and make each kfunc explicitly 9606 - * check using scx_kf_allowed(). 7478 + * specific SCX ops. They are grouped into per-context BTF sets, each 7479 + * registered with scx_kfunc_context_filter as its .filter callback. The 7480 + * BPF core dedups identical filter pointers per hook 7481 + * (btf_populate_kfunc_set()), so the filter is invoked exactly once per 7482 + * kfunc lookup; it consults scx_kf_allow_flags[] to enforce per-op 7483 + * restrictions at verify time. 9607 7484 */ 9608 7485 if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 9609 7486 &scx_kfunc_set_enqueue_dispatch)) ||

+2 -2

kernel/sched/ext.h

··· 11 11 void scx_tick(struct rq *rq); 12 12 void init_scx_entity(struct sched_ext_entity *scx); 13 13 void scx_pre_fork(struct task_struct *p); 14 - int scx_fork(struct task_struct *p); 14 + int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs); 15 15 void scx_post_fork(struct task_struct *p); 16 16 void scx_cancel_fork(struct task_struct *p); 17 17 bool scx_can_stop_tick(struct rq *rq); ··· 44 44 45 45 static inline void scx_tick(struct rq *rq) {} 46 46 static inline void scx_pre_fork(struct task_struct *p) {} 47 - static inline int scx_fork(struct task_struct *p) { return 0; } 47 + static inline int scx_fork(struct task_struct *p, struct kernel_clone_args *kargs) { return 0; } 48 48 static inline void scx_post_fork(struct task_struct *p) {} 49 49 static inline void scx_cancel_fork(struct task_struct *p) {} 50 50 static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }

+132 -67

kernel/sched/ext_idle.c

··· 368 368 369 369 /* 370 370 * Enable NUMA optimization only when there are multiple NUMA domains 371 - * among the online CPUs and the NUMA domains don't perfectly overlaps 371 + * among the online CPUs and the NUMA domains don't perfectly overlap 372 372 * with the LLC domains. 373 373 * 374 374 * If all CPUs belong to the same NUMA node and the same LLC domain, ··· 424 424 * - prefer the last used CPU to take advantage of cached data (L1, L2) and 425 425 * branch prediction optimizations. 426 426 * 427 - * 3. Pick a CPU within the same LLC (Last-Level Cache): 427 + * 3. Prefer @prev_cpu's SMT sibling: 428 + * - if @prev_cpu is busy and no fully idle core is available, try to 429 + * place the task on an idle SMT sibling of @prev_cpu; keeping the 430 + * task on the same core makes migration cheaper, preserves L1 cache 431 + * locality and reduces wakeup latency. 432 + * 433 + * 4. Pick a CPU within the same LLC (Last-Level Cache): 428 434 * - if the above conditions aren't met, pick a CPU that shares the same 429 435 * LLC, if the LLC domain is a subset of @cpus_allowed, to maintain 430 436 * cache locality. 431 437 * 432 - * 4. Pick a CPU within the same NUMA node, if enabled: 438 + * 5. Pick a CPU within the same NUMA node, if enabled: 433 439 * - choose a CPU from the same NUMA node, if the node cpumask is a 434 440 * subset of @cpus_allowed, to reduce memory access latency. 435 441 * 436 - * 5. Pick any idle CPU within the @cpus_allowed domain. 442 + * 6. Pick any idle CPU within the @cpus_allowed domain. 437 443 * 438 - * Step 3 and 4 are performed only if the system has, respectively, 444 + * Step 4 and 5 are performed only if the system has, respectively, 439 445 * multiple LLCs / multiple NUMA nodes (see scx_selcpu_topo_llc and 440 446 * scx_selcpu_topo_numa) and they don't contain the same subset of CPUs. 441 447 * ··· 622 616 goto out_unlock; 623 617 } 624 618 619 + #ifdef CONFIG_SCHED_SMT 620 + /* 621 + * Use @prev_cpu's sibling if it's idle. 622 + */ 623 + if (sched_smt_active()) { 624 + for_each_cpu_and(cpu, cpu_smt_mask(prev_cpu), allowed) { 625 + if (cpu == prev_cpu) 626 + continue; 627 + if (scx_idle_test_and_clear_cpu(cpu)) 628 + goto out_unlock; 629 + } 630 + } 631 + #endif 632 + 625 633 /* 626 634 * Search for any idle CPU in the same LLC domain. 627 635 */ ··· 787 767 * either enqueue() sees the idle bit or update_idle() sees the task 788 768 * that enqueue() queued. 789 769 */ 790 - if (SCX_HAS_OP(sch, update_idle) && do_notify && !scx_rq_bypassing(rq)) 791 - SCX_CALL_OP(sch, SCX_KF_REST, update_idle, rq, cpu_of(rq), idle); 770 + if (SCX_HAS_OP(sch, update_idle) && do_notify && 771 + !scx_bypassing(sch, cpu_of(rq))) 772 + SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle); 792 773 } 793 774 794 775 static void reset_idle_masks(struct sched_ext_ops *ops) ··· 913 892 s32 prev_cpu, u64 wake_flags, 914 893 const struct cpumask *allowed, u64 flags) 915 894 { 916 - struct rq *rq; 917 - struct rq_flags rf; 895 + unsigned long irq_flags; 896 + bool we_locked = false; 918 897 s32 cpu; 919 898 920 899 if (!ops_cpu_valid(sch, prev_cpu, NULL)) ··· 924 903 return -EBUSY; 925 904 926 905 /* 927 - * If called from an unlocked context, acquire the task's rq lock, 928 - * so that we can safely access p->cpus_ptr and p->nr_cpus_allowed. 906 + * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq 907 + * lock or @p's pi_lock. Three cases: 929 908 * 930 - * Otherwise, allow to use this kfunc only from ops.select_cpu() 931 - * and ops.select_enqueue(). 909 + * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. 910 + * - other rq-locked SCX op: scx_locked_rq() points at the held rq. 911 + * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): 912 + * nothing held, take pi_lock ourselves. 932 913 */ 933 - if (scx_kf_allowed_if_unlocked()) { 934 - rq = task_rq_lock(p, &rf); 935 - } else { 936 - if (!scx_kf_allowed(sch, SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE)) 937 - return -EPERM; 938 - rq = scx_locked_rq(); 939 - } 940 - 941 - /* 942 - * Validate locking correctness to access p->cpus_ptr and 943 - * p->nr_cpus_allowed: if we're holding an rq lock, we're safe; 944 - * otherwise, assert that p->pi_lock is held. 945 - */ 946 - if (!rq) 914 + if (this_rq()->scx.in_select_cpu) { 947 915 lockdep_assert_held(&p->pi_lock); 916 + } else if (!scx_locked_rq()) { 917 + raw_spin_lock_irqsave(&p->pi_lock, irq_flags); 918 + we_locked = true; 919 + } 948 920 949 921 /* 950 922 * This may also be called from ops.enqueue(), so we need to handle ··· 956 942 allowed ?: p->cpus_ptr, flags); 957 943 } 958 944 959 - if (scx_kf_allowed_if_unlocked()) 960 - task_rq_unlock(rq, p, &rf); 945 + if (we_locked) 946 + raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); 961 947 962 948 return cpu; 963 949 } ··· 966 952 * scx_bpf_cpu_node - Return the NUMA node the given @cpu belongs to, or 967 953 * trigger an error if @cpu is invalid 968 954 * @cpu: target CPU 955 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 969 956 */ 970 - __bpf_kfunc int scx_bpf_cpu_node(s32 cpu) 957 + __bpf_kfunc s32 scx_bpf_cpu_node(s32 cpu, const struct bpf_prog_aux *aux) 971 958 { 972 959 struct scx_sched *sch; 973 960 974 961 guard(rcu)(); 975 962 976 - sch = rcu_dereference(scx_root); 963 + sch = scx_prog_sched(aux); 977 964 if (unlikely(!sch) || !ops_cpu_valid(sch, cpu, NULL)) 978 965 return NUMA_NO_NODE; 979 966 return cpu_to_node(cpu); ··· 986 971 * @prev_cpu: CPU @p was on previously 987 972 * @wake_flags: %SCX_WAKE_* flags 988 973 * @is_idle: out parameter indicating whether the returned CPU is idle 974 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 989 975 * 990 976 * Can be called from ops.select_cpu(), ops.enqueue(), or from an unlocked 991 977 * context such as a BPF test_run() call, as long as built-in CPU selection ··· 997 981 * currently idle and thus a good candidate for direct dispatching. 998 982 */ 999 983 __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, 1000 - u64 wake_flags, bool *is_idle) 984 + u64 wake_flags, bool *is_idle, 985 + const struct bpf_prog_aux *aux) 1001 986 { 1002 987 struct scx_sched *sch; 1003 988 s32 cpu; 1004 989 1005 990 guard(rcu)(); 1006 991 1007 - sch = rcu_dereference(scx_root); 992 + sch = scx_prog_sched(aux); 1008 993 if (unlikely(!sch)) 1009 994 return -ENODEV; 1010 995 ··· 1033 1016 * @args->prev_cpu: CPU @p was on previously 1034 1017 * @args->wake_flags: %SCX_WAKE_* flags 1035 1018 * @args->flags: %SCX_PICK_IDLE* flags 1019 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1036 1020 * 1037 1021 * Wrapper kfunc that takes arguments via struct to work around BPF's 5 argument 1038 1022 * limit. BPF programs should use scx_bpf_select_cpu_and() which is provided ··· 1052 1034 */ 1053 1035 __bpf_kfunc s32 1054 1036 __scx_bpf_select_cpu_and(struct task_struct *p, const struct cpumask *cpus_allowed, 1055 - struct scx_bpf_select_cpu_and_args *args) 1037 + struct scx_bpf_select_cpu_and_args *args, 1038 + const struct bpf_prog_aux *aux) 1056 1039 { 1057 1040 struct scx_sched *sch; 1058 1041 1059 1042 guard(rcu)(); 1060 1043 1061 - sch = rcu_dereference(scx_root); 1044 + sch = scx_prog_sched(aux); 1062 1045 if (unlikely(!sch)) 1063 1046 return -ENODEV; 1064 1047 ··· 1081 1062 if (unlikely(!sch)) 1082 1063 return -ENODEV; 1083 1064 1065 + #ifdef CONFIG_EXT_SUB_SCHED 1066 + /* 1067 + * Disallow if any sub-scheds are attached. There is no way to tell 1068 + * which scheduler called us, just error out @p's scheduler. 1069 + */ 1070 + if (unlikely(!list_empty(&sch->children))) { 1071 + scx_error(scx_task_sched(p), "__scx_bpf_select_cpu_and() must be used"); 1072 + return -EINVAL; 1073 + } 1074 + #endif 1075 + 1084 1076 return select_cpu_from_kfunc(sch, p, prev_cpu, wake_flags, 1085 1077 cpus_allowed, flags); 1086 1078 } ··· 1100 1070 * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the 1101 1071 * idle-tracking per-CPU cpumask of a target NUMA node. 1102 1072 * @node: target NUMA node 1073 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1103 1074 * 1104 1075 * Returns an empty cpumask if idle tracking is not enabled, if @node is 1105 1076 * not valid, or running on a UP kernel. In this case the actual error will 1106 1077 * be reported to the BPF scheduler via scx_error(). 1107 1078 */ 1108 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask_node(int node) 1079 + __bpf_kfunc const struct cpumask * 1080 + scx_bpf_get_idle_cpumask_node(s32 node, const struct bpf_prog_aux *aux) 1109 1081 { 1110 1082 struct scx_sched *sch; 1111 1083 1112 1084 guard(rcu)(); 1113 1085 1114 - sch = rcu_dereference(scx_root); 1086 + sch = scx_prog_sched(aux); 1115 1087 if (unlikely(!sch)) 1116 1088 return cpu_none_mask; 1117 1089 ··· 1127 1095 /** 1128 1096 * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking 1129 1097 * per-CPU cpumask. 1098 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1130 1099 * 1131 1100 * Returns an empty mask if idle tracking is not enabled, or running on a 1132 1101 * UP kernel. 1133 1102 */ 1134 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) 1103 + __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(const struct bpf_prog_aux *aux) 1135 1104 { 1136 1105 struct scx_sched *sch; 1137 1106 1138 1107 guard(rcu)(); 1139 1108 1140 - sch = rcu_dereference(scx_root); 1109 + sch = scx_prog_sched(aux); 1141 1110 if (unlikely(!sch)) 1142 1111 return cpu_none_mask; 1143 1112 ··· 1158 1125 * idle-tracking, per-physical-core cpumask of a target NUMA node. Can be 1159 1126 * used to determine if an entire physical core is free. 1160 1127 * @node: target NUMA node 1128 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1161 1129 * 1162 1130 * Returns an empty cpumask if idle tracking is not enabled, if @node is 1163 1131 * not valid, or running on a UP kernel. In this case the actual error will 1164 1132 * be reported to the BPF scheduler via scx_error(). 1165 1133 */ 1166 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask_node(int node) 1134 + __bpf_kfunc const struct cpumask * 1135 + scx_bpf_get_idle_smtmask_node(s32 node, const struct bpf_prog_aux *aux) 1167 1136 { 1168 1137 struct scx_sched *sch; 1169 1138 1170 1139 guard(rcu)(); 1171 1140 1172 - sch = rcu_dereference(scx_root); 1141 + sch = scx_prog_sched(aux); 1173 1142 if (unlikely(!sch)) 1174 1143 return cpu_none_mask; 1175 1144 ··· 1189 1154 * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, 1190 1155 * per-physical-core cpumask. Can be used to determine if an entire physical 1191 1156 * core is free. 1157 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1192 1158 * 1193 1159 * Returns an empty mask if idle tracking is not enabled, or running on a 1194 1160 * UP kernel. 1195 1161 */ 1196 - __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) 1162 + __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(const struct bpf_prog_aux *aux) 1197 1163 { 1198 1164 struct scx_sched *sch; 1199 1165 1200 1166 guard(rcu)(); 1201 1167 1202 - sch = rcu_dereference(scx_root); 1168 + sch = scx_prog_sched(aux); 1203 1169 if (unlikely(!sch)) 1204 1170 return cpu_none_mask; 1205 1171 ··· 1236 1200 /** 1237 1201 * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state 1238 1202 * @cpu: cpu to test and clear idle for 1203 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1239 1204 * 1240 1205 * Returns %true if @cpu was idle and its idle state was successfully cleared. 1241 1206 * %false otherwise. ··· 1244 1207 * Unavailable if ops.update_idle() is implemented and 1245 1208 * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. 1246 1209 */ 1247 - __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) 1210 + __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu, const struct bpf_prog_aux *aux) 1248 1211 { 1249 1212 struct scx_sched *sch; 1250 1213 1251 1214 guard(rcu)(); 1252 1215 1253 - sch = rcu_dereference(scx_root); 1216 + sch = scx_prog_sched(aux); 1254 1217 if (unlikely(!sch)) 1255 1218 return false; 1256 1219 ··· 1268 1231 * @cpus_allowed: Allowed cpumask 1269 1232 * @node: target NUMA node 1270 1233 * @flags: %SCX_PICK_IDLE_* flags 1234 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1271 1235 * 1272 1236 * Pick and claim an idle cpu in @cpus_allowed from the NUMA node @node. 1273 1237 * ··· 1284 1246 * %SCX_OPS_BUILTIN_IDLE_PER_NODE is not set. 1285 1247 */ 1286 1248 __bpf_kfunc s32 scx_bpf_pick_idle_cpu_node(const struct cpumask *cpus_allowed, 1287 - int node, u64 flags) 1249 + s32 node, u64 flags, 1250 + const struct bpf_prog_aux *aux) 1288 1251 { 1289 1252 struct scx_sched *sch; 1290 1253 1291 1254 guard(rcu)(); 1292 1255 1293 - sch = rcu_dereference(scx_root); 1256 + sch = scx_prog_sched(aux); 1294 1257 if (unlikely(!sch)) 1295 1258 return -ENODEV; 1296 1259 ··· 1306 1267 * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu 1307 1268 * @cpus_allowed: Allowed cpumask 1308 1269 * @flags: %SCX_PICK_IDLE_CPU_* flags 1270 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1309 1271 * 1310 1272 * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu 1311 1273 * number on success. -%EBUSY if no matching cpu was found. ··· 1326 1286 * scx_bpf_pick_idle_cpu_node() instead. 1327 1287 */ 1328 1288 __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, 1329 - u64 flags) 1289 + u64 flags, const struct bpf_prog_aux *aux) 1330 1290 { 1331 1291 struct scx_sched *sch; 1332 1292 1333 1293 guard(rcu)(); 1334 1294 1335 - sch = rcu_dereference(scx_root); 1295 + sch = scx_prog_sched(aux); 1336 1296 if (unlikely(!sch)) 1337 1297 return -ENODEV; 1338 1298 ··· 1353 1313 * @cpus_allowed: Allowed cpumask 1354 1314 * @node: target NUMA node 1355 1315 * @flags: %SCX_PICK_IDLE_CPU_* flags 1316 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1356 1317 * 1357 1318 * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any 1358 1319 * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu ··· 1370 1329 * CPU. 1371 1330 */ 1372 1331 __bpf_kfunc s32 scx_bpf_pick_any_cpu_node(const struct cpumask *cpus_allowed, 1373 - int node, u64 flags) 1332 + s32 node, u64 flags, 1333 + const struct bpf_prog_aux *aux) 1374 1334 { 1375 1335 struct scx_sched *sch; 1376 1336 s32 cpu; 1377 1337 1378 1338 guard(rcu)(); 1379 1339 1380 - sch = rcu_dereference(scx_root); 1340 + sch = scx_prog_sched(aux); 1381 1341 if (unlikely(!sch)) 1382 1342 return -ENODEV; 1383 1343 ··· 1404 1362 * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU 1405 1363 * @cpus_allowed: Allowed cpumask 1406 1364 * @flags: %SCX_PICK_IDLE_CPU_* flags 1365 + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs 1407 1366 * 1408 1367 * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any 1409 1368 * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu ··· 1419 1376 * scx_bpf_pick_any_cpu_node() instead. 1420 1377 */ 1421 1378 __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, 1422 - u64 flags) 1379 + u64 flags, const struct bpf_prog_aux *aux) 1423 1380 { 1424 1381 struct scx_sched *sch; 1425 1382 s32 cpu; 1426 1383 1427 1384 guard(rcu)(); 1428 1385 1429 - sch = rcu_dereference(scx_root); 1386 + sch = scx_prog_sched(aux); 1430 1387 if (unlikely(!sch)) 1431 1388 return -ENODEV; 1432 1389 ··· 1451 1408 __bpf_kfunc_end_defs(); 1452 1409 1453 1410 BTF_KFUNCS_START(scx_kfunc_ids_idle) 1454 - BTF_ID_FLAGS(func, scx_bpf_cpu_node) 1455 - BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_ACQUIRE) 1456 - BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE) 1457 - BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_ACQUIRE) 1458 - BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE) 1411 + BTF_ID_FLAGS(func, scx_bpf_cpu_node, KF_IMPLICIT_ARGS) 1412 + BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1413 + BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1414 + BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1415 + BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_IMPLICIT_ARGS | KF_ACQUIRE) 1459 1416 BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE) 1460 - BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle) 1461 - BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_RCU) 1462 - BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) 1463 - BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_RCU) 1464 - BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) 1465 - BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_RCU) 1466 - BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) 1467 - BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) 1417 + BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle, KF_IMPLICIT_ARGS) 1418 + BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) 1419 + BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_IMPLICIT_ARGS | KF_RCU) 1420 + BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) 1421 + BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_IMPLICIT_ARGS | KF_RCU) 1468 1422 BTF_KFUNCS_END(scx_kfunc_ids_idle) 1469 1423 1470 1424 static const struct btf_kfunc_id_set scx_kfunc_set_idle = { 1471 1425 .owner = THIS_MODULE, 1472 1426 .set = &scx_kfunc_ids_idle, 1427 + }; 1428 + 1429 + /* 1430 + * The select_cpu kfuncs internally call task_rq_lock() when invoked from an 1431 + * rq-unlocked context, and thus cannot be safely called from arbitrary tracing 1432 + * contexts where @p's pi_lock state is unknown. Keep them out of 1433 + * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed 1434 + * only to STRUCT_OPS and SYSCALL programs. 1435 + * 1436 + * These kfuncs are also members of scx_kfunc_ids_unlocked (see ext.c) because 1437 + * they're callable from unlocked contexts in addition to ops.select_cpu() and 1438 + * ops.enqueue(). 1439 + */ 1440 + BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) 1441 + BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) 1442 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) 1443 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) 1444 + BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) 1445 + 1446 + static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { 1447 + .owner = THIS_MODULE, 1448 + .set = &scx_kfunc_ids_select_cpu, 1449 + .filter = scx_kfunc_context_filter, 1473 1450 }; 1474 1451 1475 1452 int scx_idle_init(void) ··· 1498 1435 1499 1436 ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) || 1500 1437 register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idle) || 1501 - register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle); 1438 + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle) || 1439 + register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu) || 1440 + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_select_cpu); 1502 1441 1503 1442 return ret; 1504 1443 }

+2

kernel/sched/ext_idle.h

··· 12 12 13 13 struct sched_ext_ops; 14 14 15 + extern struct btf_id_set8 scx_kfunc_ids_select_cpu; 16 + 15 17 void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); 16 18 void scx_idle_init_masks(void); 17 19

+321 -23

kernel/sched/ext_internal.h

··· 6 6 * Copyright (c) 2025 Tejun Heo <tj@kernel.org> 7 7 */ 8 8 #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) 9 + #define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) 9 10 10 11 enum scx_consts { 11 12 SCX_DSP_DFL_MAX_BATCH = 32, ··· 25 24 */ 26 25 SCX_TASK_ITER_BATCH = 32, 27 26 27 + SCX_BYPASS_HOST_NTH = 2, 28 + 28 29 SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC, 29 30 SCX_BYPASS_LB_DONOR_PCT = 125, 30 31 SCX_BYPASS_LB_MIN_DELTA_DIV = 4, 31 32 SCX_BYPASS_LB_BATCH = 256, 33 + 34 + SCX_REENQ_LOCAL_MAX_REPEAT = 256, 35 + 36 + SCX_SUB_MAX_DEPTH = 4, 32 37 }; 33 38 34 39 enum scx_exit_kind { ··· 45 38 SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */ 46 39 SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */ 47 40 SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */ 41 + SCX_EXIT_PARENT, /* parent exiting */ 48 42 49 43 SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ 50 44 SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ ··· 70 62 enum scx_exit_code { 71 63 /* Reasons */ 72 64 SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, 65 + SCX_ECODE_RSN_CGROUP_OFFLINE = 2LLU << 32, 73 66 74 67 /* Actions */ 75 68 SCX_ECODE_ACT_RESTART = 1LLU << 48, ··· 184 175 SCX_OPS_BUILTIN_IDLE_PER_NODE = 1LLU << 6, 185 176 186 177 /* 187 - * CPU cgroup support flags 178 + * If set, %SCX_ENQ_IMMED is assumed to be set on all local DSQ 179 + * enqueues. 188 180 */ 189 - SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* DEPRECATED, will be removed on 6.18 */ 181 + SCX_OPS_ALWAYS_ENQ_IMMED = 1LLU << 7, 190 182 191 183 SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | 192 184 SCX_OPS_ENQ_LAST | ··· 196 186 SCX_OPS_ALLOW_QUEUED_WAKEUP | 197 187 SCX_OPS_SWITCH_PARTIAL | 198 188 SCX_OPS_BUILTIN_IDLE_PER_NODE | 199 - SCX_OPS_HAS_CGROUP_WEIGHT, 189 + SCX_OPS_ALWAYS_ENQ_IMMED, 200 190 201 191 /* high 8 bits are internal, don't include in SCX_OPS_ALL_FLAGS */ 202 192 __SCX_OPS_INTERNAL_MASK = 0xffLLU << 56, ··· 223 213 bool cancelled; 224 214 }; 225 215 226 - /* argument container for ops->cgroup_init() */ 216 + /* argument container for ops.cgroup_init() */ 227 217 struct scx_cgroup_init_args { 228 218 /* the weight of the cgroup [1..10000] */ 229 219 u32 weight; ··· 246 236 }; 247 237 248 238 /* 249 - * Argument container for ops->cpu_acquire(). Currently empty, but may be 239 + * Argument container for ops.cpu_acquire(). Currently empty, but may be 250 240 * expanded in the future. 251 241 */ 252 242 struct scx_cpu_acquire_args {}; 253 243 254 - /* argument container for ops->cpu_release() */ 244 + /* argument container for ops.cpu_release() */ 255 245 struct scx_cpu_release_args { 256 246 /* the reason the CPU was preempted */ 257 247 enum scx_cpu_preempt_reason reason; ··· 260 250 struct task_struct *task; 261 251 }; 262 252 263 - /* 264 - * Informational context provided to dump operations. 265 - */ 253 + /* informational context provided to dump operations */ 266 254 struct scx_dump_ctx { 267 255 enum scx_exit_kind kind; 268 256 s64 exit_code; 269 257 const char *reason; 270 258 u64 at_ns; 271 259 u64 at_jiffies; 260 + }; 261 + 262 + /* argument container for ops.sub_attach() */ 263 + struct scx_sub_attach_args { 264 + struct sched_ext_ops *ops; 265 + char *cgroup_path; 266 + }; 267 + 268 + /* argument container for ops.sub_detach() */ 269 + struct scx_sub_detach_args { 270 + struct sched_ext_ops *ops; 271 + char *cgroup_path; 272 272 }; 273 273 274 274 /** ··· 741 721 742 722 #endif /* CONFIG_EXT_GROUP_SCHED */ 743 723 724 + /** 725 + * @sub_attach: Attach a sub-scheduler 726 + * @args: argument container, see the struct definition 727 + * 728 + * Return 0 to accept the sub-scheduler. -errno to reject. 729 + */ 730 + s32 (*sub_attach)(struct scx_sub_attach_args *args); 731 + 732 + /** 733 + * @sub_detach: Detach a sub-scheduler 734 + * @args: argument container, see the struct definition 735 + */ 736 + void (*sub_detach)(struct scx_sub_detach_args *args); 737 + 744 738 /* 745 739 * All online ops must come before ops.cpu_online(). 746 740 */ ··· 796 762 */ 797 763 void (*exit)(struct scx_exit_info *info); 798 764 765 + /* 766 + * Data fields must comes after all ops fields. 767 + */ 768 + 799 769 /** 800 770 * @dispatch_max_batch: Max nr of tasks that dispatch() can dispatch 801 771 */ ··· 835 797 u64 hotplug_seq; 836 798 837 799 /** 800 + * @cgroup_id: When >1, attach the scheduler as a sub-scheduler on the 801 + * specified cgroup. 802 + */ 803 + u64 sub_cgroup_id; 804 + 805 + /** 838 806 * @name: BPF scheduler's name 839 807 * 840 808 * Must be a non-zero valid BPF object name including only isalnum(), ··· 850 806 char name[SCX_OPS_NAME_LEN]; 851 807 852 808 /* internal use only, must be NULL */ 853 - void *priv; 809 + void __rcu *priv; 854 810 }; 855 811 856 812 enum scx_opi { ··· 898 854 s64 SCX_EV_ENQ_SKIP_MIGRATION_DISABLED; 899 855 900 856 /* 857 + * The number of times a task, enqueued on a local DSQ with 858 + * SCX_ENQ_IMMED, was re-enqueued because the CPU was not available for 859 + * immediate execution. 860 + */ 861 + s64 SCX_EV_REENQ_IMMED; 862 + 863 + /* 864 + * The number of times a reenq of local DSQ caused another reenq of 865 + * local DSQ. This can happen when %SCX_ENQ_IMMED races against a higher 866 + * priority class task even if the BPF scheduler always satisfies the 867 + * prerequisites for %SCX_ENQ_IMMED at the time of enqueue. However, 868 + * that scenario is very unlikely and this count going up regularly 869 + * indicates that the BPF scheduler is handling %SCX_ENQ_REENQ 870 + * incorrectly causing recursive reenqueues. 871 + */ 872 + s64 SCX_EV_REENQ_LOCAL_REPEAT; 873 + 874 + /* 901 875 * Total number of times a task's time slice was refilled with the 902 876 * default value (SCX_SLICE_DFL). 903 877 */ ··· 935 873 * The number of times the bypassing mode has been activated. 936 874 */ 937 875 s64 SCX_EV_BYPASS_ACTIVATE; 876 + 877 + /* 878 + * The number of times the scheduler attempted to insert a task that it 879 + * doesn't own into a DSQ. Such attempts are ignored. 880 + * 881 + * As BPF schedulers are allowed to ignore dequeues, it's difficult to 882 + * tell whether such an attempt is from a scheduler malfunction or an 883 + * ignored dequeue around sub-sched enabling. If this count keeps going 884 + * up regardless of sub-sched enabling, it likely indicates a bug in the 885 + * scheduler. 886 + */ 887 + s64 SCX_EV_INSERT_NOT_OWNED; 888 + 889 + /* 890 + * The number of times tasks from bypassing descendants are scheduled 891 + * from sub_bypass_dsq's. 892 + */ 893 + s64 SCX_EV_SUB_BYPASS_DISPATCH; 894 + }; 895 + 896 + struct scx_sched; 897 + 898 + enum scx_sched_pcpu_flags { 899 + SCX_SCHED_PCPU_BYPASSING = 1LLU << 0, 900 + }; 901 + 902 + /* dispatch buf */ 903 + struct scx_dsp_buf_ent { 904 + struct task_struct *task; 905 + unsigned long qseq; 906 + u64 dsq_id; 907 + u64 enq_flags; 908 + }; 909 + 910 + struct scx_dsp_ctx { 911 + struct rq *rq; 912 + u32 cursor; 913 + u32 nr_tasks; 914 + struct scx_dsp_buf_ent buf[]; 915 + }; 916 + 917 + struct scx_deferred_reenq_local { 918 + struct list_head node; 919 + u64 flags; 920 + u64 seq; 921 + u32 cnt; 938 922 }; 939 923 940 924 struct scx_sched_pcpu { 925 + struct scx_sched *sch; 926 + u64 flags; /* protected by rq lock */ 927 + 941 928 /* 942 929 * The event counters are in a per-CPU variable to minimize the 943 930 * accounting overhead. A system-wide view on the event counter is 944 931 * constructed when requested by scx_bpf_events(). 945 932 */ 946 933 struct scx_event_stats event_stats; 934 + 935 + struct scx_deferred_reenq_local deferred_reenq_local; 936 + struct scx_dispatch_q bypass_dsq; 937 + #ifdef CONFIG_EXT_SUB_SCHED 938 + u32 bypass_host_seq; 939 + #endif 940 + 941 + /* must be the last entry - contains flex array */ 942 + struct scx_dsp_ctx dsp_ctx; 943 + }; 944 + 945 + struct scx_sched_pnode { 946 + struct scx_dispatch_q global_dsq; 947 947 }; 948 948 949 949 struct scx_sched { ··· 1021 897 * per-node split isn't sufficient, it can be further split. 1022 898 */ 1023 899 struct rhashtable dsq_hash; 1024 - struct scx_dispatch_q **global_dsqs; 900 + struct scx_sched_pnode **pnode; 1025 901 struct scx_sched_pcpu __percpu *pcpu; 902 + 903 + u64 slice_dfl; 904 + u64 bypass_timestamp; 905 + s32 bypass_depth; 906 + 907 + /* bypass dispatch path enable state, see bypass_dsp_enabled() */ 908 + unsigned long bypass_dsp_claim; 909 + atomic_t bypass_dsp_enable_depth; 910 + 911 + bool aborting; 912 + bool dump_disabled; /* protected by scx_dump_lock */ 913 + u32 dsp_max_batch; 914 + s32 level; 1026 915 1027 916 /* 1028 917 * Updates to the following warned bitfields can race causing RMW issues ··· 1043 906 */ 1044 907 bool warned_zero_slice:1; 1045 908 bool warned_deprecated_rq:1; 909 + bool warned_unassoc_progs:1; 910 + 911 + struct list_head all; 912 + 913 + #ifdef CONFIG_EXT_SUB_SCHED 914 + struct rhash_head hash_node; 915 + 916 + struct list_head children; 917 + struct list_head sibling; 918 + struct cgroup *cgrp; 919 + char *cgrp_path; 920 + struct kset *sub_kset; 921 + 922 + bool sub_attached; 923 + #endif /* CONFIG_EXT_SUB_SCHED */ 924 + 925 + /* 926 + * The maximum amount of time in jiffies that a task may be runnable 927 + * without being scheduled on a CPU. If this timeout is exceeded, it 928 + * will trigger scx_error(). 929 + */ 930 + unsigned long watchdog_timeout; 1046 931 1047 932 atomic_t exit_kind; 1048 933 struct scx_exit_info *exit_info; ··· 1072 913 struct kobject kobj; 1073 914 1074 915 struct kthread_worker *helper; 1075 - struct irq_work error_irq_work; 916 + struct irq_work disable_irq_work; 1076 917 struct kthread_work disable_work; 918 + struct timer_list bypass_lb_timer; 1077 919 struct rcu_work rcu_work; 920 + 921 + /* all ancestors including self */ 922 + struct scx_sched *ancestors[]; 1078 923 }; 1079 924 1080 925 enum scx_wake_flags { ··· 1105 942 SCX_ENQ_PREEMPT = 1LLU << 32, 1106 943 1107 944 /* 1108 - * The task being enqueued was previously enqueued on the current CPU's 1109 - * %SCX_DSQ_LOCAL, but was removed from it in a call to the 1110 - * scx_bpf_reenqueue_local() kfunc. If scx_bpf_reenqueue_local() was 1111 - * invoked in a ->cpu_release() callback, and the task is again 1112 - * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the 1113 - * task will not be scheduled on the CPU until at least the next invocation 1114 - * of the ->cpu_acquire() callback. 945 + * Only allowed on local DSQs. Guarantees that the task either gets 946 + * on the CPU immediately and stays on it, or gets reenqueued back 947 + * to the BPF scheduler. It will never linger on a local DSQ or be 948 + * silently put back after preemption. 949 + * 950 + * The protection persists until the next fresh enqueue - it 951 + * survives SAVE/RESTORE cycles, slice extensions and preemption. 952 + * If the task can't stay on the CPU for any reason, it gets 953 + * reenqueued back to the BPF scheduler. 954 + * 955 + * Exiting and migration-disabled tasks bypass ops.enqueue() and 956 + * are placed directly on a local DSQ without IMMED protection 957 + * unless %SCX_OPS_ENQ_EXITING and %SCX_OPS_ENQ_MIGRATION_DISABLED 958 + * are set respectively. 959 + */ 960 + SCX_ENQ_IMMED = 1LLU << 33, 961 + 962 + /* 963 + * The task being enqueued was previously enqueued on a DSQ, but was 964 + * removed and is being re-enqueued. See SCX_TASK_REENQ_* flags to find 965 + * out why a given task is being reenqueued. 1115 966 */ 1116 967 SCX_ENQ_REENQ = 1LLU << 40, 1117 968 ··· 1146 969 SCX_ENQ_CLEAR_OPSS = 1LLU << 56, 1147 970 SCX_ENQ_DSQ_PRIQ = 1LLU << 57, 1148 971 SCX_ENQ_NESTED = 1LLU << 58, 972 + SCX_ENQ_GDSQ_FALLBACK = 1LLU << 59, /* fell back to global DSQ */ 1149 973 }; 1150 974 1151 975 enum scx_deq_flags { ··· 1160 982 * it hasn't been dispatched yet. Dequeue from the BPF side. 1161 983 */ 1162 984 SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, 985 + 986 + /* 987 + * The task is being dequeued due to a property change (e.g., 988 + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), 989 + * etc.). 990 + */ 991 + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, 992 + }; 993 + 994 + enum scx_reenq_flags { 995 + /* low 16bits determine which tasks should be reenqueued */ 996 + SCX_REENQ_ANY = 1LLU << 0, /* all tasks */ 997 + 998 + __SCX_REENQ_FILTER_MASK = 0xffffLLU, 999 + 1000 + __SCX_REENQ_USER_MASK = SCX_REENQ_ANY, 1001 + 1002 + /* bits 32-35 used by task_should_reenq() */ 1003 + SCX_REENQ_TSR_RQ_OPEN = 1LLU << 32, 1004 + SCX_REENQ_TSR_NOT_FIRST = 1LLU << 33, 1005 + 1006 + __SCX_REENQ_TSR_MASK = 0xfLLU << 32, 1163 1007 }; 1164 1008 1165 1009 enum scx_pick_idle_cpu_flags { ··· 1361 1161 #define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1) 1362 1162 #define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK) 1363 1163 1164 + extern struct scx_sched __rcu *scx_root; 1364 1165 DECLARE_PER_CPU(struct rq *, scx_locked_rq_state); 1166 + 1167 + int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id); 1365 1168 1366 1169 /* 1367 1170 * Return the rq currently locked from an scx callback, or NULL if no rq is ··· 1375 1172 return __this_cpu_read(scx_locked_rq_state); 1376 1173 } 1377 1174 1378 - static inline bool scx_kf_allowed_if_unlocked(void) 1175 + static inline bool scx_bypassing(struct scx_sched *sch, s32 cpu) 1379 1176 { 1380 - return !current->scx.kf_mask; 1177 + return unlikely(per_cpu_ptr(sch->pcpu, cpu)->flags & 1178 + SCX_SCHED_PCPU_BYPASSING); 1381 1179 } 1382 1180 1383 - static inline bool scx_rq_bypassing(struct rq *rq) 1181 + #ifdef CONFIG_EXT_SUB_SCHED 1182 + /** 1183 + * scx_task_sched - Find scx_sched scheduling a task 1184 + * @p: task of interest 1185 + * 1186 + * Return @p's scheduler instance. Must be called with @p's pi_lock or rq lock 1187 + * held. 1188 + */ 1189 + static inline struct scx_sched *scx_task_sched(const struct task_struct *p) 1384 1190 { 1385 - return unlikely(rq->scx.flags & SCX_RQ_BYPASSING); 1191 + return rcu_dereference_protected(p->scx.sched, 1192 + lockdep_is_held(&p->pi_lock) || 1193 + lockdep_is_held(__rq_lockp(task_rq(p)))); 1386 1194 } 1195 + 1196 + /** 1197 + * scx_task_sched_rcu - Find scx_sched scheduling a task 1198 + * @p: task of interest 1199 + * 1200 + * Return @p's scheduler instance. The returned scx_sched is RCU protected. 1201 + */ 1202 + static inline struct scx_sched *scx_task_sched_rcu(const struct task_struct *p) 1203 + { 1204 + return rcu_dereference_all(p->scx.sched); 1205 + } 1206 + 1207 + /** 1208 + * scx_task_on_sched - Is a task on the specified sched? 1209 + * @sch: sched to test against 1210 + * @p: task of interest 1211 + * 1212 + * Returns %true if @p is on @sch, %false otherwise. 1213 + */ 1214 + static inline bool scx_task_on_sched(struct scx_sched *sch, 1215 + const struct task_struct *p) 1216 + { 1217 + return rcu_access_pointer(p->scx.sched) == sch; 1218 + } 1219 + 1220 + /** 1221 + * scx_prog_sched - Find scx_sched associated with a BPF prog 1222 + * @aux: aux passed in from BPF to a kfunc 1223 + * 1224 + * To be called from kfuncs. Return the scheduler instance associated with the 1225 + * BPF program given the implicit kfunc argument aux. The returned scx_sched is 1226 + * RCU protected. 1227 + */ 1228 + static inline struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux) 1229 + { 1230 + struct sched_ext_ops *ops; 1231 + struct scx_sched *root; 1232 + 1233 + ops = bpf_prog_get_assoc_struct_ops(aux); 1234 + if (likely(ops)) 1235 + return rcu_dereference_all(ops->priv); 1236 + 1237 + root = rcu_dereference_all(scx_root); 1238 + if (root) { 1239 + /* 1240 + * COMPAT-v6.19: Schedulers built before sub-sched support was 1241 + * introduced may have unassociated non-struct_ops programs. 1242 + */ 1243 + if (!root->ops.sub_attach) 1244 + return root; 1245 + 1246 + if (!root->warned_unassoc_progs) { 1247 + printk_deferred(KERN_WARNING "sched_ext: Unassociated program %s (id %d)\n", 1248 + aux->name, aux->id); 1249 + root->warned_unassoc_progs = true; 1250 + } 1251 + } 1252 + 1253 + return NULL; 1254 + } 1255 + #else /* CONFIG_EXT_SUB_SCHED */ 1256 + static inline struct scx_sched *scx_task_sched(const struct task_struct *p) 1257 + { 1258 + return rcu_dereference_protected(scx_root, 1259 + lockdep_is_held(&p->pi_lock) || 1260 + lockdep_is_held(__rq_lockp(task_rq(p)))); 1261 + } 1262 + 1263 + static inline struct scx_sched *scx_task_sched_rcu(const struct task_struct *p) 1264 + { 1265 + return rcu_dereference_all(scx_root); 1266 + } 1267 + 1268 + static inline bool scx_task_on_sched(struct scx_sched *sch, 1269 + const struct task_struct *p) 1270 + { 1271 + return true; 1272 + } 1273 + 1274 + static struct scx_sched *scx_prog_sched(const struct bpf_prog_aux *aux) 1275 + { 1276 + return rcu_dereference_all(scx_root); 1277 + } 1278 + #endif /* CONFIG_EXT_SUB_SCHED */

+9 -3

kernel/sched/sched.h

··· 783 783 SCX_RQ_ONLINE = 1 << 0, 784 784 SCX_RQ_CAN_STOP_TICK = 1 << 1, 785 785 SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */ 786 - SCX_RQ_BYPASSING = 1 << 4, 787 786 SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */ 788 787 SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */ 789 788 ··· 798 799 u64 extra_enq_flags; /* see move_task_to_local_dsq() */ 799 800 u32 nr_running; 800 801 u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ 802 + bool in_select_cpu; 801 803 bool cpu_released; 802 804 u32 flags; 805 + u32 nr_immed; /* ENQ_IMMED tasks on local_dsq */ 803 806 u64 clock; /* current per-rq clock -- see scx_bpf_now() */ 804 807 cpumask_var_t cpus_to_kick; 805 808 cpumask_var_t cpus_to_kick_if_idle; ··· 810 809 cpumask_var_t cpus_to_sync; 811 810 bool kick_sync_pending; 812 811 unsigned long kick_sync; 813 - local_t reenq_local_deferred; 812 + 813 + struct task_struct *sub_dispatch_prev; 814 + 815 + raw_spinlock_t deferred_reenq_lock; 816 + u64 deferred_reenq_locals_seq; 817 + struct list_head deferred_reenq_locals; /* scheds requesting reenq of local DSQ */ 818 + struct list_head deferred_reenq_users; /* user DSQs requesting reenq */ 814 819 struct balance_callback deferred_bal_cb; 815 820 struct balance_callback kick_sync_bal_cb; 816 821 struct irq_work deferred_irq_work; 817 822 struct irq_work kick_cpus_irq_work; 818 - struct scx_dispatch_q bypass_dsq; 819 823 }; 820 824 #endif /* CONFIG_SCHED_CLASS_EXT */ 821 825

+6 -2

tools/sched_ext/include/scx/bpf_arena_common.bpf.h

··· 15 15 #endif 16 16 17 17 #if defined(__BPF_FEATURE_ADDR_SPACE_CAST) && !defined(BPF_ARENA_FORCE_ASM) 18 + #ifndef __arena 18 19 #define __arena __attribute__((address_space(1))) 20 + #endif 19 21 #define __arena_global __attribute__((address_space(1))) 20 22 #define cast_kern(ptr) /* nop for bpf prog. emitted by LLVM */ 21 23 #define cast_user(ptr) /* nop for bpf prog. emitted by LLVM */ ··· 83 81 void __arena* bpf_arena_alloc_pages(void *map, void __arena *addr, __u32 page_cnt, 84 82 int node_id, __u64 flags) __ksym __weak; 85 83 void bpf_arena_free_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak; 84 + int bpf_arena_reserve_pages(void *map, void __arena *ptr, __u32 page_cnt) __ksym __weak; 86 85 87 86 /* 88 87 * Note that cond_break can only be portably used in the body of a breakable 89 88 * construct, whereas can_loop can be used anywhere. 90 89 */ 91 - #ifdef TEST 90 + #ifdef SCX_BPF_UNITTEST 92 91 #define can_loop true 93 92 #define __cond_break(expr) expr 94 93 #else ··· 168 165 }) 169 166 #endif /* __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ */ 170 167 #endif /* __BPF_FEATURE_MAY_GOTO */ 171 - #endif /* TEST */ 168 + #endif /* SCX_BPF_UNITTEST */ 172 169 173 170 #define cond_break __cond_break(break) 174 171 #define cond_break_label(label) __cond_break(goto label) ··· 176 173 177 174 void bpf_preempt_disable(void) __weak __ksym; 178 175 void bpf_preempt_enable(void) __weak __ksym; 176 + ssize_t bpf_arena_mapping_nr_pages(void *p__map) __weak __ksym;

+277

tools/sched_ext/include/scx/common.bpf.h

··· 291 291 }) 292 292 #endif /* ARRAY_ELEM_PTR */ 293 293 294 + /** 295 + * __sink - Hide @expr's value from the compiler and BPF verifier 296 + * @expr: The expression whose value should be opacified 297 + * 298 + * No-op at runtime. The empty inline assembly with a read-write constraint 299 + * ("+g") has two effects at compile/verify time: 300 + * 301 + * 1. Compiler: treats @expr as both read and written, preventing dead-code 302 + * elimination and keeping @expr (and any side effects that produced it) 303 + * alive. 304 + * 305 + * 2. BPF verifier: forgets the precise value/range of @expr ("makes it 306 + * imprecise"). The verifier normally tracks exact ranges for every register 307 + * and stack slot. While useful, precision means each distinct value creates a 308 + * separate verifier state. Inside loops this leads to state explosion - each 309 + * iteration carries different precise values so states never merge and the 310 + * verifier explores every iteration individually. 311 + * 312 + * Example - preventing loop state explosion:: 313 + * 314 + * u32 nr_intersects = 0, nr_covered = 0; 315 + * __sink(nr_intersects); 316 + * __sink(nr_covered); 317 + * bpf_for(i, 0, nr_nodes) { 318 + * if (intersects(cpumask, node_mask[i])) 319 + * nr_intersects++; 320 + * if (covers(cpumask, node_mask[i])) 321 + * nr_covered++; 322 + * } 323 + * 324 + * Without __sink(), the verifier tracks every possible (nr_intersects, 325 + * nr_covered) pair across iterations, causing "BPF program is too large". With 326 + * __sink(), the values become unknown scalars so all iterations collapse into 327 + * one reusable state. 328 + * 329 + * Example - keeping a reference alive:: 330 + * 331 + * struct task_struct *t = bpf_task_acquire(task); 332 + * __sink(t); 333 + * 334 + * Follows the convention from BPF selftests (bpf_misc.h). 335 + */ 336 + #define __sink(expr) asm volatile ("" : "+g"(expr)) 337 + 294 338 /* 295 339 * BPF declarations and helpers 296 340 */ ··· 380 336 381 337 /* cgroup */ 382 338 struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym; 339 + struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp) __ksym; 383 340 void bpf_cgroup_release(struct cgroup *cgrp) __ksym; 384 341 struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; 385 342 ··· 787 742 } 788 743 789 744 /* 745 + * ctzll -- Counts trailing zeros in an unsigned long long. If the input value 746 + * is zero, the return value is undefined. 747 + */ 748 + static inline int ctzll(u64 v) 749 + { 750 + #if (!defined(__BPF__) && defined(__SCX_TARGET_ARCH_x86)) || \ 751 + (defined(__BPF__) && defined(__clang_major__) && __clang_major__ >= 19) 752 + /* 753 + * Use the ctz builtin when: (1) building for native x86, or 754 + * (2) building for BPF with clang >= 19 (BPF backend supports 755 + * the intrinsic from clang 19 onward; earlier versions hit 756 + * "unimplemented opcode" in the backend). 757 + */ 758 + return __builtin_ctzll(v); 759 + #else 760 + /* 761 + * If neither the target architecture nor the toolchains support ctzll, 762 + * use software-based emulation. Let's use the De Bruijn sequence-based 763 + * approach to find LSB fastly. See the details of De Bruijn sequence: 764 + * 765 + * https://en.wikipedia.org/wiki/De_Bruijn_sequence 766 + * https://www.chessprogramming.org/BitScan#De_Bruijn_Multiplication 767 + */ 768 + const int lookup_table[64] = { 769 + 0, 1, 48, 2, 57, 49, 28, 3, 61, 58, 50, 42, 38, 29, 17, 4, 770 + 62, 55, 59, 36, 53, 51, 43, 22, 45, 39, 33, 30, 24, 18, 12, 5, 771 + 63, 47, 56, 27, 60, 41, 37, 16, 54, 35, 52, 21, 44, 32, 23, 11, 772 + 46, 26, 40, 15, 34, 20, 31, 10, 25, 14, 19, 9, 13, 8, 7, 6, 773 + }; 774 + const u64 DEBRUIJN_CONSTANT = 0x03f79d71b4cb0a89ULL; 775 + unsigned int index; 776 + u64 lowest_bit; 777 + const int *lt; 778 + 779 + if (v == 0) 780 + return -1; 781 + 782 + /* 783 + * Isolate the least significant bit (LSB). 784 + * For example, if v = 0b...10100, then v & -v = 0b...00100 785 + */ 786 + lowest_bit = v & -v; 787 + 788 + /* 789 + * Each isolated bit produces a unique 6-bit value, guaranteed by the 790 + * De Bruijn property. Calculate a unique index into the lookup table 791 + * using the magic constant and a right shift. 792 + * 793 + * Multiplying by the 64-bit constant "spreads out" that 1-bit into a 794 + * unique pattern in the top 6 bits. This uniqueness property is 795 + * exactly what a De Bruijn sequence guarantees: Every possible 6-bit 796 + * pattern (in top bits) occurs exactly once for each LSB position. So, 797 + * the constant 0x03f79d71b4cb0a89ULL is carefully chosen to be a 798 + * De Bruijn sequence, ensuring no collisions in the table index. 799 + */ 800 + index = (lowest_bit * DEBRUIJN_CONSTANT) >> 58; 801 + 802 + /* 803 + * Lookup in a precomputed table. No collision is guaranteed by the 804 + * De Bruijn property. 805 + */ 806 + lt = MEMBER_VPTR(lookup_table, [index]); 807 + return (lt)? *lt : -1; 808 + #endif 809 + } 810 + 811 + /* 790 812 * Return a value proportionally scaled to the task's weight. 791 813 */ 792 814 static inline u64 scale_by_task_weight(const struct task_struct *p, u64 value) ··· 869 757 return value * 100 / p->scx.weight; 870 758 } 871 759 760 + 761 + /* 762 + * Get a random u64 from the kernel's pseudo-random generator. 763 + */ 764 + static inline u64 get_prandom_u64() 765 + { 766 + return ((u64)bpf_get_prandom_u32() << 32) | bpf_get_prandom_u32(); 767 + } 768 + 769 + /* 770 + * Define the shadow structure to avoid a compilation error when 771 + * vmlinux.h does not enable necessary kernel configs. The ___local 772 + * suffix is a CO-RE convention that tells the loader to match this 773 + * against the base struct rq in the kernel. The attribute 774 + * preserve_access_index tells the compiler to generate a CO-RE 775 + * relocation for these fields. 776 + */ 777 + struct rq___local { 778 + /* 779 + * A monotonically increasing clock per CPU. It is rq->clock minus 780 + * cumulative IRQ time and hypervisor steal time. Unlike rq->clock, 781 + * it does not advance during IRQ processing or hypervisor preemption. 782 + * It does advance during idle (the idle task counts as a running task 783 + * for this purpose). 784 + */ 785 + u64 clock_task; 786 + /* 787 + * Invariant version of clock_task scaled by CPU capacity and 788 + * frequency. For example, clock_pelt advances 2x slower on a CPU 789 + * with half the capacity. 790 + * 791 + * At idle exit, rq->clock_pelt jumps forward to resync with 792 + * clock_task. The kernel's rq_clock_pelt() corrects for this jump 793 + * by subtracting lost_idle_time, yielding a clock that appears 794 + * continuous across idle transitions. scx_clock_pelt() mirrors 795 + * rq_clock_pelt() by performing the same subtraction. 796 + */ 797 + u64 clock_pelt; 798 + /* 799 + * Accumulates the magnitude of each clock_pelt jump at idle exit. 800 + * Subtracting this from clock_pelt gives rq_clock_pelt(): a 801 + * continuous, capacity-invariant clock suitable for both task 802 + * execution time stamping and cross-idle measurements. 803 + */ 804 + unsigned long lost_idle_time; 805 + /* 806 + * Shadow of paravirt_steal_clock() (the hypervisor's cumulative 807 + * stolen time counter). Stays frozen while the hypervisor preempts 808 + * the vCPU; catches up the next time update_rq_clock_task() is 809 + * called. The delta is the stolen time not yet subtracted from 810 + * clock_task. 811 + * 812 + * Unlike irqtime->total (a plain kernel-side field), the live stolen 813 + * time counter lives in hypervisor-specific shared memory and has no 814 + * kernel-side equivalent readable from BPF in a hypervisor-agnostic 815 + * way. This field is therefore the only portable BPF-accessible 816 + * approximation of cumulative steal time. 817 + * 818 + * Available only when CONFIG_PARAVIRT_TIME_ACCOUNTING is on. 819 + */ 820 + u64 prev_steal_time_rq; 821 + } __attribute__((preserve_access_index)); 822 + 823 + extern struct rq runqueues __ksym; 824 + 825 + /* 826 + * Define the shadow structure to avoid a compilation error when 827 + * vmlinux.h does not enable necessary kernel configs. 828 + */ 829 + struct irqtime___local { 830 + /* 831 + * Cumulative IRQ time counter for this CPU, in nanoseconds. Advances 832 + * immediately at the exit of every hardirq and non-ksoftirqd softirq 833 + * via irqtime_account_irq(). ksoftirqd time is counted as normal 834 + * task time and is NOT included. NMI time is also NOT included. 835 + * 836 + * The companion field irqtime->sync (struct u64_stats_sync) protects 837 + * against 64-bit tearing on 32-bit architectures. On 64-bit kernels, 838 + * u64_stats_sync is an empty struct and all seqcount operations are 839 + * no-ops, so a plain BPF_CORE_READ of this field is safe. 840 + * 841 + * Available only when CONFIG_IRQ_TIME_ACCOUNTING is on. 842 + */ 843 + u64 total; 844 + } __attribute__((preserve_access_index)); 845 + 846 + /* 847 + * cpu_irqtime is a per-CPU variable defined only when 848 + * CONFIG_IRQ_TIME_ACCOUNTING is on. Declare it as __weak so the BPF 849 + * loader sets its address to 0 (rather than failing) when the symbol 850 + * is absent from the running kernel. 851 + */ 852 + extern struct irqtime___local cpu_irqtime __ksym __weak; 853 + 854 + static inline struct rq___local *get_current_rq(u32 cpu) 855 + { 856 + /* 857 + * This is a workaround to get an rq pointer since we decided to 858 + * deprecate scx_bpf_cpu_rq(). 859 + * 860 + * WARNING: The caller must hold the rq lock for @cpu. This is 861 + * guaranteed when called from scheduling callbacks (ops.running, 862 + * ops.stopping, ops.enqueue, ops.dequeue, ops.dispatch, etc.). 863 + * There is no runtime check available in BPF for kernel spinlock 864 + * state — correctness is enforced by calling context only. 865 + */ 866 + return (void *)bpf_per_cpu_ptr(&runqueues, cpu); 867 + } 868 + 869 + static inline u64 scx_clock_task(u32 cpu) 870 + { 871 + struct rq___local *rq = get_current_rq(cpu); 872 + 873 + /* Equivalent to the kernel's rq_clock_task(). */ 874 + return rq ? rq->clock_task : 0; 875 + } 876 + 877 + static inline u64 scx_clock_pelt(u32 cpu) 878 + { 879 + struct rq___local *rq = get_current_rq(cpu); 880 + 881 + /* 882 + * Equivalent to the kernel's rq_clock_pelt(): subtracts 883 + * lost_idle_time from clock_pelt to absorb the jump that occurs 884 + * when clock_pelt resyncs with clock_task at idle exit. The result 885 + * is a continuous, capacity-invariant clock safe for both task 886 + * execution time stamping and cross-idle measurements. 887 + */ 888 + return rq ? (rq->clock_pelt - rq->lost_idle_time) : 0; 889 + } 890 + 891 + static inline u64 scx_clock_virt(u32 cpu) 892 + { 893 + struct rq___local *rq; 894 + 895 + /* 896 + * Check field existence before calling get_current_rq() so we avoid 897 + * the per_cpu lookup entirely on kernels built without 898 + * CONFIG_PARAVIRT_TIME_ACCOUNTING. 899 + */ 900 + if (!bpf_core_field_exists(((struct rq___local *)0)->prev_steal_time_rq)) 901 + return 0; 902 + 903 + /* Lagging shadow of the kernel's paravirt_steal_clock(). */ 904 + rq = get_current_rq(cpu); 905 + return rq ? BPF_CORE_READ(rq, prev_steal_time_rq) : 0; 906 + } 907 + 908 + static inline u64 scx_clock_irq(u32 cpu) 909 + { 910 + struct irqtime___local *irqt; 911 + 912 + /* 913 + * bpf_core_type_exists() resolves at load time: if struct irqtime is 914 + * absent from kernel BTF (CONFIG_IRQ_TIME_ACCOUNTING off), the loader 915 + * patches this into an unconditional return 0, making the 916 + * bpf_per_cpu_ptr() call below dead code that the verifier never sees. 917 + */ 918 + if (!bpf_core_type_exists(struct irqtime___local)) 919 + return 0; 920 + 921 + /* Equivalent to the kernel's irq_time_read(). */ 922 + irqt = bpf_per_cpu_ptr(&cpu_irqtime, cpu); 923 + return irqt ? BPF_CORE_READ(irqt, total) : 0; 924 + } 872 925 873 926 #include "compat.bpf.h" 874 927 #include "enums.bpf.h"

+1 -4

tools/sched_ext/include/scx/common.h

··· 67 67 bpf_map__set_value_size((__skel)->maps.elfsec##_##arr, \ 68 68 sizeof((__skel)->elfsec##_##arr->arr[0]) * (n)); \ 69 69 (__skel)->elfsec##_##arr = \ 70 + (typeof((__skel)->elfsec##_##arr)) \ 70 71 bpf_map__initial_value((__skel)->maps.elfsec##_##arr, &__sz); \ 71 72 } while (0) 72 73 ··· 75 74 #include "compat.h" 76 75 #include "enums.h" 77 76 78 - /* not available when building kernel tools/sched_ext */ 79 - #if __has_include(<lib/sdt_task_defs.h>) 80 77 #include "bpf_arena_common.h" 81 - #include <lib/sdt_task_defs.h> 82 - #endif 83 78 84 79 #endif /* __SCHED_EXT_COMMON_H */

+52 -5

tools/sched_ext/include/scx/compat.bpf.h

··· 28 28 * 29 29 * scx_bpf_dispatch_from_dsq() and friends were added during v6.12 by 30 30 * 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()"). 31 + * 32 + * v7.1: scx_bpf_dsq_move_to_local___v2() to add @enq_flags. 31 33 */ 32 - bool scx_bpf_dsq_move_to_local___new(u64 dsq_id) __ksym __weak; 34 + bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags) __ksym __weak; 35 + bool scx_bpf_dsq_move_to_local___v1(u64 dsq_id) __ksym __weak; 33 36 void scx_bpf_dsq_move_set_slice___new(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym __weak; 34 37 void scx_bpf_dsq_move_set_vtime___new(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak; 35 38 bool scx_bpf_dsq_move___new(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; ··· 44 41 bool scx_bpf_dispatch_from_dsq___old(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; 45 42 bool scx_bpf_dispatch_vtime_from_dsq___old(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; 46 43 47 - #define scx_bpf_dsq_move_to_local(dsq_id) \ 48 - (bpf_ksym_exists(scx_bpf_dsq_move_to_local___new) ? \ 49 - scx_bpf_dsq_move_to_local___new((dsq_id)) : \ 50 - scx_bpf_consume___old((dsq_id))) 44 + #define scx_bpf_dsq_move_to_local(dsq_id, enq_flags) \ 45 + (bpf_ksym_exists(scx_bpf_dsq_move_to_local___v2) ? \ 46 + scx_bpf_dsq_move_to_local___v2((dsq_id), (enq_flags)) : \ 47 + (bpf_ksym_exists(scx_bpf_dsq_move_to_local___v1) ? \ 48 + scx_bpf_dsq_move_to_local___v1((dsq_id)) : \ 49 + scx_bpf_consume___old((dsq_id)))) 51 50 52 51 #define scx_bpf_dsq_move_set_slice(it__iter, slice) \ 53 52 (bpf_ksym_exists(scx_bpf_dsq_move_set_slice___new) ? \ ··· 106 101 p = bpf_iter_scx_dsq_next(&it); 107 102 bpf_iter_scx_dsq_destroy(&it); 108 103 return p; 104 + } 105 + 106 + /* 107 + * v7.1: scx_bpf_sub_dispatch() for sub-sched dispatch. Preserve until 108 + * we drop the compat layer for older kernels that lack the kfunc. 109 + */ 110 + bool scx_bpf_sub_dispatch___compat(u64 cgroup_id) __ksym __weak; 111 + 112 + static inline bool scx_bpf_sub_dispatch(u64 cgroup_id) 113 + { 114 + if (bpf_ksym_exists(scx_bpf_sub_dispatch___compat)) 115 + return scx_bpf_sub_dispatch___compat(cgroup_id); 116 + return false; 109 117 } 110 118 111 119 /** ··· 284 266 } 285 267 } 286 268 269 + /* 270 + * scx_bpf_select_cpu_and() is now an inline wrapper. Use this instead of 271 + * bpf_ksym_exists(scx_bpf_select_cpu_and) to test availability. 272 + */ 273 + #define __COMPAT_HAS_scx_bpf_select_cpu_and \ 274 + (bpf_core_type_exists(struct scx_bpf_select_cpu_and_args) || \ 275 + bpf_ksym_exists(scx_bpf_select_cpu_and___compat)) 276 + 287 277 /** 288 278 * scx_bpf_dsq_insert_vtime - Insert a task into the vtime priority queue of a DSQ 289 279 * @p: task_struct to insert ··· 399 373 scx_bpf_reenqueue_local___v2___compat(); 400 374 else 401 375 scx_bpf_reenqueue_local___v1(); 376 + } 377 + 378 + /* 379 + * v6.20: New scx_bpf_dsq_reenq() that allows re-enqueues on more DSQs. This 380 + * will eventually deprecate scx_bpf_reenqueue_local(). 381 + */ 382 + void scx_bpf_dsq_reenq___compat(u64 dsq_id, u64 reenq_flags, const struct bpf_prog_aux *aux__prog) __ksym __weak; 383 + 384 + static inline bool __COMPAT_has_generic_reenq(void) 385 + { 386 + return bpf_ksym_exists(scx_bpf_dsq_reenq___compat); 387 + } 388 + 389 + static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags) 390 + { 391 + if (bpf_ksym_exists(scx_bpf_dsq_reenq___compat)) 392 + scx_bpf_dsq_reenq___compat(dsq_id, reenq_flags, NULL); 393 + else if (dsq_id == SCX_DSQ_LOCAL && reenq_flags == 0) 394 + scx_bpf_reenqueue_local(); 395 + else 396 + scx_bpf_error("kernel too old to reenqueue foreign local or user DSQs"); 402 397 } 403 398 404 399 /*

+51 -1

tools/sched_ext/include/scx/compat.h

··· 8 8 #define __SCX_COMPAT_H 9 9 10 10 #include <bpf/btf.h> 11 + #include <bpf/libbpf.h> 11 12 #include <fcntl.h> 12 13 #include <stdlib.h> 13 14 #include <unistd.h> ··· 116 115 #define SCX_OPS_ENQ_MIGRATION_DISABLED SCX_OPS_FLAG(SCX_OPS_ENQ_MIGRATION_DISABLED) 117 116 #define SCX_OPS_ALLOW_QUEUED_WAKEUP SCX_OPS_FLAG(SCX_OPS_ALLOW_QUEUED_WAKEUP) 118 117 #define SCX_OPS_BUILTIN_IDLE_PER_NODE SCX_OPS_FLAG(SCX_OPS_BUILTIN_IDLE_PER_NODE) 118 + #define SCX_OPS_ALWAYS_ENQ_IMMED SCX_OPS_FLAG(SCX_OPS_ALWAYS_ENQ_IMMED) 119 119 120 120 #define SCX_PICK_IDLE_FLAG(name) __COMPAT_ENUM_OR_ZERO("scx_pick_idle_cpu_flags", #name) 121 121 ··· 160 158 * COMPAT: 161 159 * - v6.17: ops.cgroup_set_bandwidth() 162 160 * - v6.19: ops.cgroup_set_idle() 161 + * - v7.1: ops.sub_attach(), ops.sub_detach(), ops.sub_cgroup_id 163 162 */ 164 163 #define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \ 165 164 struct __scx_name *__skel; \ ··· 182 179 fprintf(stderr, "WARNING: kernel doesn't support ops.cgroup_set_idle()\n"); \ 183 180 __skel->struct_ops.__ops_name->cgroup_set_idle = NULL; \ 184 181 } \ 182 + if (__skel->struct_ops.__ops_name->sub_attach && \ 183 + !__COMPAT_struct_has_field("sched_ext_ops", "sub_attach")) { \ 184 + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_attach()\n"); \ 185 + __skel->struct_ops.__ops_name->sub_attach = NULL; \ 186 + } \ 187 + if (__skel->struct_ops.__ops_name->sub_detach && \ 188 + !__COMPAT_struct_has_field("sched_ext_ops", "sub_detach")) { \ 189 + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_detach()\n"); \ 190 + __skel->struct_ops.__ops_name->sub_detach = NULL; \ 191 + } \ 192 + if (__skel->struct_ops.__ops_name->sub_cgroup_id > 0 && \ 193 + !__COMPAT_struct_has_field("sched_ext_ops", "sub_cgroup_id")) { \ 194 + fprintf(stderr, "WARNING: kernel doesn't support ops.sub_cgroup_id\n"); \ 195 + __skel->struct_ops.__ops_name->sub_cgroup_id = 0; \ 196 + } \ 185 197 __skel; \ 186 198 }) 187 199 200 + /* 201 + * Associate non-struct_ops BPF programs with the scheduler's struct_ops map so 202 + * that scx_prog_sched() can determine which scheduler a BPF program belongs 203 + * to. Requires libbpf >= 1.7. 204 + */ 205 + #if LIBBPF_MAJOR_VERSION > 1 || \ 206 + (LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 7) 207 + static inline void __scx_ops_assoc_prog(struct bpf_program *prog, 208 + struct bpf_map *map, 209 + const char *ops_name) 210 + { 211 + s32 err = bpf_program__assoc_struct_ops(prog, map, NULL); 212 + if (err) 213 + fprintf(stderr, 214 + "ERROR: Failed to associate %s with %s: %d\n", 215 + bpf_program__name(prog), ops_name, err); 216 + } 217 + #else 218 + static inline void __scx_ops_assoc_prog(struct bpf_program *prog, 219 + struct bpf_map *map, 220 + const char *ops_name) 221 + { 222 + } 223 + #endif 224 + 188 225 #define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({ \ 226 + struct bpf_program *__prog; \ 189 227 UEI_SET_SIZE(__skel, __ops_name, __uei_name); \ 190 228 SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \ 229 + bpf_object__for_each_program(__prog, (__skel)->obj) { \ 230 + if (bpf_program__type(__prog) == BPF_PROG_TYPE_STRUCT_OPS) \ 231 + continue; \ 232 + __scx_ops_assoc_prog(__prog, (__skel)->maps.__ops_name, \ 233 + #__ops_name); \ 234 + } \ 191 235 }) 192 236 193 237 /* 194 238 * New versions of bpftool now emit additional link placeholders for BPF maps, 195 239 * and set up BPF skeleton in such a way that libbpf will auto-attach BPF maps 196 - * automatically, assumming libbpf is recent enough (v1.5+). Old libbpf will do 240 + * automatically, assuming libbpf is recent enough (v1.5+). Old libbpf will do 197 241 * nothing with those links and won't attempt to auto-attach maps. 198 242 * 199 243 * To maintain compatibility with older libbpf while avoiding trying to attach

+48 -13

tools/sched_ext/include/scx/enum_defs.autogen.h

··· 14 14 #define HAVE_SCX_EXIT_MSG_LEN 15 15 #define HAVE_SCX_EXIT_DUMP_DFL_LEN 16 16 #define HAVE_SCX_CPUPERF_ONE 17 - #define HAVE_SCX_OPS_TASK_ITER_BATCH 17 + #define HAVE_SCX_TASK_ITER_BATCH 18 + #define HAVE_SCX_BYPASS_HOST_NTH 19 + #define HAVE_SCX_BYPASS_LB_DFL_INTV_US 20 + #define HAVE_SCX_BYPASS_LB_DONOR_PCT 21 + #define HAVE_SCX_BYPASS_LB_MIN_DELTA_DIV 22 + #define HAVE_SCX_BYPASS_LB_BATCH 23 + #define HAVE_SCX_REENQ_LOCAL_MAX_REPEAT 24 + #define HAVE_SCX_SUB_MAX_DEPTH 18 25 #define HAVE_SCX_CPU_PREEMPT_RT 19 26 #define HAVE_SCX_CPU_PREEMPT_DL 20 27 #define HAVE_SCX_CPU_PREEMPT_STOP 21 28 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN 22 29 #define HAVE_SCX_DEQ_SLEEP 23 30 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC 31 + #define HAVE_SCX_DEQ_SCHED_CHANGE 24 32 #define HAVE_SCX_DSQ_FLAG_BUILTIN 25 33 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON 26 34 #define HAVE_SCX_DSQ_INVALID 27 35 #define HAVE_SCX_DSQ_GLOBAL 28 36 #define HAVE_SCX_DSQ_LOCAL 37 + #define HAVE_SCX_DSQ_BYPASS 29 38 #define HAVE_SCX_DSQ_LOCAL_ON 30 39 #define HAVE_SCX_DSQ_LOCAL_CPU_MASK 31 40 #define HAVE_SCX_DSQ_ITER_REV ··· 44 35 #define HAVE___SCX_DSQ_ITER_ALL_FLAGS 45 36 #define HAVE_SCX_DSQ_LNODE_ITER_CURSOR 46 37 #define HAVE___SCX_DSQ_LNODE_PRIV_SHIFT 38 + #define HAVE_SCX_ENABLING 39 + #define HAVE_SCX_ENABLED 40 + #define HAVE_SCX_DISABLING 41 + #define HAVE_SCX_DISABLED 47 42 #define HAVE_SCX_ENQ_WAKEUP 48 43 #define HAVE_SCX_ENQ_HEAD 49 44 #define HAVE_SCX_ENQ_CPU_SELECTED 50 45 #define HAVE_SCX_ENQ_PREEMPT 46 + #define HAVE_SCX_ENQ_IMMED 51 47 #define HAVE_SCX_ENQ_REENQ 52 48 #define HAVE_SCX_ENQ_LAST 53 49 #define HAVE___SCX_ENQ_INTERNAL_MASK 54 50 #define HAVE_SCX_ENQ_CLEAR_OPSS 55 51 #define HAVE_SCX_ENQ_DSQ_PRIQ 52 + #define HAVE_SCX_ENQ_NESTED 53 + #define HAVE_SCX_ENQ_GDSQ_FALLBACK 56 54 #define HAVE_SCX_TASK_DSQ_ON_PRIQ 57 55 #define HAVE_SCX_TASK_QUEUED 56 + #define HAVE_SCX_TASK_IN_CUSTODY 58 57 #define HAVE_SCX_TASK_RESET_RUNNABLE_AT 59 58 #define HAVE_SCX_TASK_DEQD_FOR_SLEEP 59 + #define HAVE_SCX_TASK_SUB_INIT 60 + #define HAVE_SCX_TASK_IMMED 60 61 #define HAVE_SCX_TASK_STATE_SHIFT 61 62 #define HAVE_SCX_TASK_STATE_BITS 62 63 #define HAVE_SCX_TASK_STATE_MASK 64 + #define HAVE_SCX_TASK_NONE 65 + #define HAVE_SCX_TASK_INIT 66 + #define HAVE_SCX_TASK_READY 67 + #define HAVE_SCX_TASK_ENABLED 68 + #define HAVE_SCX_TASK_REENQ_REASON_SHIFT 69 + #define HAVE_SCX_TASK_REENQ_REASON_BITS 70 + #define HAVE_SCX_TASK_REENQ_REASON_MASK 71 + #define HAVE_SCX_TASK_REENQ_NONE 72 + #define HAVE_SCX_TASK_REENQ_KFUNC 73 + #define HAVE_SCX_TASK_REENQ_IMMED 74 + #define HAVE_SCX_TASK_REENQ_PREEMPTED 63 75 #define HAVE_SCX_TASK_CURSOR 64 76 #define HAVE_SCX_ECODE_RSN_HOTPLUG 77 + #define HAVE_SCX_ECODE_RSN_CGROUP_OFFLINE 65 78 #define HAVE_SCX_ECODE_ACT_RESTART 79 + #define HAVE_SCX_EFLAG_INITIALIZED 66 80 #define HAVE_SCX_EXIT_NONE 67 81 #define HAVE_SCX_EXIT_DONE 68 82 #define HAVE_SCX_EXIT_UNREG 69 83 #define HAVE_SCX_EXIT_UNREG_BPF 70 84 #define HAVE_SCX_EXIT_UNREG_KERN 71 85 #define HAVE_SCX_EXIT_SYSRQ 86 + #define HAVE_SCX_EXIT_PARENT 72 87 #define HAVE_SCX_EXIT_ERROR 73 88 #define HAVE_SCX_EXIT_ERROR_BPF 74 89 #define HAVE_SCX_EXIT_ERROR_STALL ··· 113 80 #define HAVE_SCX_OPI_CPU_HOTPLUG_BEGIN 114 81 #define HAVE_SCX_OPI_CPU_HOTPLUG_END 115 82 #define HAVE_SCX_OPI_END 116 - #define HAVE_SCX_OPS_ENABLING 117 - #define HAVE_SCX_OPS_ENABLED 118 - #define HAVE_SCX_OPS_DISABLING 119 - #define HAVE_SCX_OPS_DISABLED 120 83 #define HAVE_SCX_OPS_KEEP_BUILTIN_IDLE 121 84 #define HAVE_SCX_OPS_ENQ_LAST 122 85 #define HAVE_SCX_OPS_ENQ_EXITING 123 86 #define HAVE_SCX_OPS_SWITCH_PARTIAL 124 87 #define HAVE_SCX_OPS_ENQ_MIGRATION_DISABLED 125 88 #define HAVE_SCX_OPS_ALLOW_QUEUED_WAKEUP 126 - #define HAVE_SCX_OPS_HAS_CGROUP_WEIGHT 89 + #define HAVE_SCX_OPS_BUILTIN_IDLE_PER_NODE 90 + #define HAVE_SCX_OPS_ALWAYS_ENQ_IMMED 127 91 #define HAVE_SCX_OPS_ALL_FLAGS 92 + #define HAVE___SCX_OPS_INTERNAL_MASK 93 + #define HAVE_SCX_OPS_HAS_CPU_PREEMPT 128 94 #define HAVE_SCX_OPSS_NONE 129 95 #define HAVE_SCX_OPSS_QUEUEING 130 96 #define HAVE_SCX_OPSS_QUEUED 131 97 #define HAVE_SCX_OPSS_DISPATCHING 132 98 #define HAVE_SCX_OPSS_QSEQ_SHIFT 133 99 #define HAVE_SCX_PICK_IDLE_CORE 100 + #define HAVE_SCX_PICK_IDLE_IN_NODE 134 101 #define HAVE_SCX_OPS_NAME_LEN 135 102 #define HAVE_SCX_SLICE_DFL 103 + #define HAVE_SCX_SLICE_BYPASS 136 104 #define HAVE_SCX_SLICE_INF 105 + #define HAVE_SCX_REENQ_ANY 106 + #define HAVE___SCX_REENQ_FILTER_MASK 107 + #define HAVE___SCX_REENQ_USER_MASK 108 + #define HAVE_SCX_REENQ_TSR_RQ_OPEN 109 + #define HAVE_SCX_REENQ_TSR_NOT_FIRST 110 + #define HAVE___SCX_REENQ_TSR_MASK 137 111 #define HAVE_SCX_RQ_ONLINE 138 112 #define HAVE_SCX_RQ_CAN_STOP_TICK 139 - #define HAVE_SCX_RQ_BAL_PENDING 140 113 #define HAVE_SCX_RQ_BAL_KEEP 141 - #define HAVE_SCX_RQ_BYPASSING 142 114 #define HAVE_SCX_RQ_CLK_VALID 115 + #define HAVE_SCX_RQ_BAL_CB_PENDING 143 116 #define HAVE_SCX_RQ_IN_WAKEUP 144 117 #define HAVE_SCX_RQ_IN_BALANCE 145 - #define HAVE_SCX_TASK_NONE 146 - #define HAVE_SCX_TASK_INIT 147 - #define HAVE_SCX_TASK_READY 148 - #define HAVE_SCX_TASK_ENABLED 149 - #define HAVE_SCX_TASK_NR_STATES 118 + #define HAVE_SCX_SCHED_PCPU_BYPASSING 150 119 #define HAVE_SCX_TG_ONLINE 151 120 #define HAVE_SCX_TG_INITED 152 121 #define HAVE_SCX_WAKE_FORK

+11

tools/sched_ext/include/scx/enums.autogen.bpf.h

··· 67 67 const volatile u64 __SCX_TASK_DEQD_FOR_SLEEP __weak; 68 68 #define SCX_TASK_DEQD_FOR_SLEEP __SCX_TASK_DEQD_FOR_SLEEP 69 69 70 + const volatile u64 __SCX_TASK_SUB_INIT __weak; 71 + #define SCX_TASK_SUB_INIT __SCX_TASK_SUB_INIT 72 + 73 + const volatile u64 __SCX_TASK_IMMED __weak; 74 + #define SCX_TASK_IMMED __SCX_TASK_IMMED 75 + 70 76 const volatile u64 __SCX_TASK_STATE_SHIFT __weak; 71 77 #define SCX_TASK_STATE_SHIFT __SCX_TASK_STATE_SHIFT 72 78 ··· 121 115 const volatile u64 __SCX_ENQ_PREEMPT __weak; 122 116 #define SCX_ENQ_PREEMPT __SCX_ENQ_PREEMPT 123 117 118 + const volatile u64 __SCX_ENQ_IMMED __weak; 119 + #define SCX_ENQ_IMMED __SCX_ENQ_IMMED 120 + 124 121 const volatile u64 __SCX_ENQ_REENQ __weak; 125 122 #define SCX_ENQ_REENQ __SCX_ENQ_REENQ 126 123 ··· 136 127 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; 137 128 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ 138 129 130 + const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; 131 + #define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE

+4

tools/sched_ext/include/scx/enums.autogen.h

··· 26 26 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_QUEUED); \ 27 27 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_RESET_RUNNABLE_AT); \ 28 28 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_DEQD_FOR_SLEEP); \ 29 + SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_SUB_INIT); \ 30 + SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_IMMED); \ 29 31 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_SHIFT); \ 30 32 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_BITS); \ 31 33 SCX_ENUM_SET(skel, scx_ent_flags, SCX_TASK_STATE_MASK); \ ··· 44 42 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_WAKEUP); \ 45 43 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_HEAD); \ 46 44 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_PREEMPT); \ 45 + SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_IMMED); \ 47 46 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_REENQ); \ 48 47 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ 49 48 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ 50 49 SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ 50 + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ 51 51 } while (0)

+1 -1

tools/sched_ext/include/scx/enums.h

··· 9 9 #ifndef __SCX_ENUMS_H 10 10 #define __SCX_ENUMS_H 11 11 12 - static inline void __ENUM_set(u64 *val, char *type, char *name) 12 + static inline void __ENUM_set(u64 *val, const char *type, const char *name) 13 13 { 14 14 bool res; 15 15

+44 -22

tools/sched_ext/scx_central.bpf.c

··· 60 60 const volatile u64 slice_ns; 61 61 62 62 bool timer_pinned = true; 63 + bool timer_started; 63 64 u64 nr_total, nr_locals, nr_queued, nr_lost_pids; 64 65 u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries; 65 66 u64 nr_overflows; ··· 180 179 return false; 181 180 } 182 181 182 + static void start_central_timer(void) 183 + { 184 + struct bpf_timer *timer; 185 + u32 key = 0; 186 + int ret; 187 + 188 + if (likely(timer_started)) 189 + return; 190 + 191 + timer = bpf_map_lookup_elem(&central_timer, &key); 192 + if (!timer) { 193 + scx_bpf_error("failed to lookup central timer"); 194 + return; 195 + } 196 + 197 + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); 198 + /* 199 + * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a 200 + * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. 201 + * Retry without the PIN. This would be the perfect use case for 202 + * bpf_core_enum_value_exists() but the enum type doesn't have a name 203 + * and can't be used with bpf_core_enum_value_exists(). Oh well... 204 + */ 205 + if (ret == -EINVAL) { 206 + timer_pinned = false; 207 + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); 208 + } 209 + 210 + if (ret) { 211 + scx_bpf_error("bpf_timer_start failed (%d)", ret); 212 + return; 213 + } 214 + 215 + timer_started = true; 216 + } 217 + 183 218 void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) 184 219 { 185 220 if (cpu == central_cpu) { 221 + start_central_timer(); 222 + 186 223 /* dispatch for all other CPUs first */ 187 224 __sync_fetch_and_add(&nr_dispatches, 1); 188 225 ··· 253 214 } 254 215 255 216 /* look for a task to run on the central CPU */ 256 - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID)) 217 + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID, 0)) 257 218 return; 258 219 dispatch_to_cpu(central_cpu); 259 220 } else { 260 221 bool *gimme; 261 222 262 - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID)) 223 + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ_ID, 0)) 263 224 return; 264 225 265 226 gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); ··· 349 310 if (!timer) 350 311 return -ESRCH; 351 312 352 - if (bpf_get_smp_processor_id() != central_cpu) { 353 - scx_bpf_error("init from non-central CPU"); 354 - return -EINVAL; 355 - } 356 - 357 313 bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC); 358 314 bpf_timer_set_callback(timer, central_timerfn); 359 315 360 - ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); 361 - /* 362 - * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a 363 - * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. 364 - * Retry without the PIN. This would be the perfect use case for 365 - * bpf_core_enum_value_exists() but the enum type doesn't have a name 366 - * and can't be used with bpf_core_enum_value_exists(). Oh well... 367 - */ 368 - if (ret == -EINVAL) { 369 - timer_pinned = false; 370 - ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); 371 - } 372 - if (ret) 373 - scx_bpf_error("bpf_timer_start failed (%d)", ret); 374 - return ret; 316 + scx_bpf_kick_cpu(central_cpu, 0); 317 + 318 + return 0; 375 319 } 376 320 377 321 void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)

+1 -25

tools/sched_ext/scx_central.c

··· 5 5 * Copyright (c) 2022 David Vernet <dvernet@meta.com> 6 6 */ 7 7 #define _GNU_SOURCE 8 - #include <sched.h> 9 8 #include <stdio.h> 10 9 #include <unistd.h> 11 10 #include <inttypes.h> ··· 20 21 "\n" 21 22 "See the top-level comment in .bpf.c for more details.\n" 22 23 "\n" 23 - "Usage: %s [-s SLICE_US] [-c CPU]\n" 24 + "Usage: %s [-s SLICE_US] [-c CPU] [-v]\n" 24 25 "\n" 25 26 " -s SLICE_US Override slice duration\n" 26 27 " -c CPU Override the central CPU (default: 0)\n" ··· 48 49 struct bpf_link *link; 49 50 __u64 seq = 0, ecode; 50 51 __s32 opt; 51 - cpu_set_t *cpuset; 52 - size_t cpuset_size; 53 52 54 53 libbpf_set_print(libbpf_print_fn); 55 54 signal(SIGINT, sigint_handler); ··· 92 95 RESIZE_ARRAY(skel, data, cpu_started_at, skel->rodata->nr_cpu_ids); 93 96 94 97 SCX_OPS_LOAD(skel, central_ops, scx_central, uei); 95 - 96 - /* 97 - * Affinitize the loading thread to the central CPU, as: 98 - * - That's where the BPF timer is first invoked in the BPF program. 99 - * - We probably don't want this user space component to take up a core 100 - * from a task that would benefit from avoiding preemption on one of 101 - * the tickless cores. 102 - * 103 - * Until BPF supports pinning the timer, it's not guaranteed that it 104 - * will always be invoked on the central CPU. In practice, this 105 - * suffices the majority of the time. 106 - */ 107 - cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids); 108 - SCX_BUG_ON(!cpuset, "Failed to allocate cpuset"); 109 - cpuset_size = CPU_ALLOC_SIZE(skel->rodata->nr_cpu_ids); 110 - CPU_ZERO_S(cpuset_size, cpuset); 111 - CPU_SET_S(skel->rodata->central_cpu, cpuset_size, cpuset); 112 - SCX_BUG_ON(sched_setaffinity(0, cpuset_size, cpuset), 113 - "Failed to affinitize to central CPU %d (max %d)", 114 - skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1); 115 - CPU_FREE(cpuset); 116 98 117 99 link = SCX_OPS_ATTACH(skel, central_ops, scx_central); 118 100

+1 -1

tools/sched_ext/scx_cpu0.bpf.c

··· 66 66 void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev) 67 67 { 68 68 if (cpu == 0) 69 - scx_bpf_dsq_move_to_local(DSQ_CPU0); 69 + scx_bpf_dsq_move_to_local(DSQ_CPU0, 0); 70 70 } 71 71 72 72 s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init)

+13 -11

tools/sched_ext/scx_flatcg.bpf.c

··· 18 18 * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's 19 19 * share in that competition is 100/(200+100) == 1/3. B's eventual share in the 20 20 * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's 21 - * eventual shaer is the same at 1/6. D is only competing at the top level and 21 + * eventual share is the same at 1/6. D is only competing at the top level and 22 22 * its share is 200/(100+200) == 2/3. 23 23 * 24 24 * So, instead of hierarchically scheduling level-by-level, we can consider it ··· 551 551 * too much, determine the execution time by taking explicit timestamps 552 552 * instead of depending on @p->scx.slice. 553 553 */ 554 - if (!fifo_sched) 555 - p->scx.dsq_vtime += 556 - (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 554 + if (!fifo_sched) { 555 + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); 556 + 557 + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); 558 + } 557 559 558 560 taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); 559 561 if (!taskc) { ··· 662 660 goto out_free; 663 661 } 664 662 665 - if (!scx_bpf_dsq_move_to_local(cgid)) { 663 + if (!scx_bpf_dsq_move_to_local(cgid, 0)) { 666 664 bpf_cgroup_release(cgrp); 667 665 stat_inc(FCG_STAT_PNC_EMPTY); 668 666 goto out_stash; ··· 742 740 goto pick_next_cgroup; 743 741 744 742 if (time_before(now, cpuc->cur_at + cgrp_slice_ns)) { 745 - if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid)) { 743 + if (scx_bpf_dsq_move_to_local(cpuc->cur_cgid, 0)) { 746 744 stat_inc(FCG_STAT_CNS_KEEP); 747 745 return; 748 746 } ··· 782 780 pick_next_cgroup: 783 781 cpuc->cur_at = now; 784 782 785 - if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ)) { 783 + if (scx_bpf_dsq_move_to_local(FALLBACK_DSQ, 0)) { 786 784 cpuc->cur_cgid = 0; 787 785 return; 788 786 } ··· 824 822 if (!(cgc = find_cgrp_ctx(args->cgroup))) 825 823 return -ENOENT; 826 824 827 - p->scx.dsq_vtime = cgc->tvtime_now; 825 + scx_bpf_task_set_dsq_vtime(p, cgc->tvtime_now); 828 826 829 827 return 0; 830 828 } ··· 921 919 struct fcg_cgrp_ctx *from_cgc, *to_cgc; 922 920 s64 delta; 923 921 924 - /* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */ 922 + /* find_cgrp_ctx() triggers scx_bpf_error() on lookup failures */ 925 923 if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to))) 926 924 return; 927 925 928 926 delta = time_delta(p->scx.dsq_vtime, from_cgc->tvtime_now); 929 - p->scx.dsq_vtime = to_cgc->tvtime_now + delta; 927 + scx_bpf_task_set_dsq_vtime(p, to_cgc->tvtime_now + delta); 930 928 } 931 929 932 930 s32 BPF_STRUCT_OPS_SLEEPABLE(fcg_init) ··· 962 960 .cgroup_move = (void *)fcg_cgroup_move, 963 961 .init = (void *)fcg_init, 964 962 .exit = (void *)fcg_exit, 965 - .flags = SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING, 963 + .flags = SCX_OPS_ENQ_EXITING, 966 964 .name = "flatcg");

+13 -3

tools/sched_ext/scx_pair.c

··· 21 21 "\n" 22 22 "See the top-level comment in .bpf.c for more details.\n" 23 23 "\n" 24 - "Usage: %s [-S STRIDE]\n" 24 + "Usage: %s [-S STRIDE] [-v]\n" 25 25 "\n" 26 26 " -S STRIDE Override CPU pair stride (default: nr_cpus_ids / 2)\n" 27 27 " -v Print libbpf debug messages\n" ··· 48 48 struct bpf_link *link; 49 49 __u64 seq = 0, ecode; 50 50 __s32 stride, i, opt, outer_fd; 51 + __u32 pair_id = 0; 51 52 52 53 libbpf_set_print(libbpf_print_fn); 53 54 signal(SIGINT, sigint_handler); ··· 83 82 scx_pair__destroy(skel); 84 83 return -1; 85 84 } 85 + 86 + if (skel->rodata->nr_cpu_ids & 1) { 87 + fprintf(stderr, "scx_pair requires an even CPU count, got %u\n", 88 + skel->rodata->nr_cpu_ids); 89 + scx_pair__destroy(skel); 90 + return -1; 91 + } 92 + 86 93 bpf_map__set_max_entries(skel->maps.pair_ctx, skel->rodata->nr_cpu_ids / 2); 87 94 88 95 /* Resize arrays so their element count is equal to cpu count. */ ··· 118 109 119 110 skel->rodata_pair_cpu->pair_cpu[i] = j; 120 111 skel->rodata_pair_cpu->pair_cpu[j] = i; 121 - skel->rodata_pair_id->pair_id[i] = i; 122 - skel->rodata_pair_id->pair_id[j] = i; 112 + skel->rodata_pair_id->pair_id[i] = pair_id; 113 + skel->rodata_pair_id->pair_id[j] = pair_id; 123 114 skel->rodata_in_pair_idx->in_pair_idx[i] = 0; 124 115 skel->rodata_in_pair_idx->in_pair_idx[j] = 1; 116 + pair_id++; 125 117 126 118 printf("[%d, %d] ", i, j); 127 119 }

+172 -42

tools/sched_ext/scx_qmap.bpf.c

··· 11 11 * 12 12 * - BPF-side queueing using PIDs. 13 13 * - Sleepable per-task storage allocation using ops.prep_enable(). 14 - * - Using ops.cpu_release() to handle a higher priority scheduling class taking 15 - * the CPU away. 16 14 * - Core-sched support. 17 15 * 18 16 * This scheduler is primarily for demonstration and testing of sched_ext ··· 24 26 25 27 enum consts { 26 28 ONE_SEC_IN_NS = 1000000000, 29 + ONE_MSEC_IN_NS = 1000000, 30 + LOWPRI_INTV_NS = 10 * ONE_MSEC_IN_NS, 27 31 SHARED_DSQ = 0, 28 32 HIGHPRI_DSQ = 1, 33 + LOWPRI_DSQ = 2, 29 34 HIGHPRI_WEIGHT = 8668, /* this is what -20 maps to */ 30 35 }; 31 36 ··· 42 41 const volatile bool highpri_boosting; 43 42 const volatile bool print_dsqs_and_events; 44 43 const volatile bool print_msgs; 44 + const volatile u64 sub_cgroup_id; 45 45 const volatile s32 disallow_tgid; 46 46 const volatile bool suppress_dump; 47 + const volatile bool always_enq_immed; 48 + const volatile u32 immed_stress_nth; 47 49 48 50 u64 nr_highpri_queued; 49 51 u32 test_error_cnt; 52 + 53 + #define MAX_SUB_SCHEDS 8 54 + u64 sub_sched_cgroup_ids[MAX_SUB_SCHEDS]; 50 55 51 56 UEI_DEFINE(uei); 52 57 ··· 134 127 } cpu_ctx_stor SEC(".maps"); 135 128 136 129 /* Statistics */ 137 - u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq; 130 + u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0, nr_dequeued, nr_ddsp_from_enq; 138 131 u64 nr_core_sched_execed; 139 132 u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer; 140 133 u32 cpuperf_min, cpuperf_avg, cpuperf_max; ··· 144 137 { 145 138 s32 cpu; 146 139 147 - if (p->nr_cpus_allowed == 1 || 148 - scx_bpf_test_and_clear_cpu_idle(prev_cpu)) 140 + if (!always_enq_immed && p->nr_cpus_allowed == 1) 141 + return prev_cpu; 142 + 143 + if (scx_bpf_test_and_clear_cpu_idle(prev_cpu)) 149 144 return prev_cpu; 150 145 151 146 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); ··· 176 167 177 168 if (!(tctx = lookup_task_ctx(p))) 178 169 return -ESRCH; 170 + 171 + if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD)) 172 + return prev_cpu; 179 173 180 174 cpu = pick_direct_dispatch_cpu(p, prev_cpu); 181 175 ··· 214 202 void *ring; 215 203 s32 cpu; 216 204 217 - if (enq_flags & SCX_ENQ_REENQ) 205 + if (enq_flags & SCX_ENQ_REENQ) { 218 206 __sync_fetch_and_add(&nr_reenqueued, 1); 207 + if (scx_bpf_task_cpu(p) == 0) 208 + __sync_fetch_and_add(&nr_reenqueued_cpu0, 1); 209 + } 219 210 220 211 if (p->flags & PF_KTHREAD) { 221 212 if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth)) ··· 241 226 tctx->core_sched_seq = core_sched_tail_seqs[idx]++; 242 227 243 228 /* 229 + * IMMED stress testing: Every immed_stress_nth'th enqueue, dispatch 230 + * directly to prev_cpu's local DSQ even when busy to force dsq->nr > 1 231 + * and exercise the kernel IMMED reenqueue trigger paths. 232 + */ 233 + if (immed_stress_nth && !(enq_flags & SCX_ENQ_REENQ)) { 234 + static u32 immed_stress_cnt; 235 + 236 + if (!(++immed_stress_cnt % immed_stress_nth)) { 237 + tctx->force_local = false; 238 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cpu(p), 239 + slice_ns, enq_flags); 240 + return; 241 + } 242 + } 243 + 244 + /* 244 245 * If qmap_select_cpu() is telling us to or this is the last runnable 245 246 * task on the CPU, enqueue locally. 246 247 */ 247 248 if (tctx->force_local) { 248 249 tctx->force_local = false; 249 250 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); 251 + return; 252 + } 253 + 254 + /* see lowpri_timerfn() */ 255 + if (__COMPAT_has_generic_reenq() && 256 + p->scx.weight < 2 && !(p->flags & PF_KTHREAD) && !(enq_flags & SCX_ENQ_REENQ)) { 257 + scx_bpf_dsq_insert(p, LOWPRI_DSQ, slice_ns, enq_flags); 250 258 return; 251 259 } 252 260 ··· 413 375 if (dispatch_highpri(false)) 414 376 return; 415 377 416 - if (!nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ)) 378 + if (!nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ, 0)) 417 379 return; 418 380 419 381 if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) { ··· 471 433 __sync_fetch_and_add(&nr_dispatched, 1); 472 434 473 435 scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0); 436 + 437 + /* 438 + * scx_qmap uses a global BPF queue that any CPU's 439 + * dispatch can pop from. If this CPU popped a task that 440 + * can't run here, it gets stranded on SHARED_DSQ after 441 + * consume_dispatch_q() skips it. Kick the task's home 442 + * CPU so it drains SHARED_DSQ. 443 + * 444 + * There's a race between the pop and the flush of the 445 + * buffered dsq_insert: 446 + * 447 + * CPU 0 (dispatching) CPU 1 (home, idle) 448 + * ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~ 449 + * pop from BPF queue 450 + * dsq_insert(buffered) 451 + * balance: 452 + * SHARED_DSQ empty 453 + * BPF queue empty 454 + * -> goes idle 455 + * flush -> on SHARED 456 + * kick CPU 1 457 + * wakes, drains task 458 + * 459 + * The kick prevents indefinite stalls but a per-CPU 460 + * kthread like ksoftirqd can be briefly stranded when 461 + * its home CPU enters idle with softirq pending, 462 + * triggering: 463 + * 464 + * "NOHZ tick-stop error: local softirq work is pending, handler #N!!!" 465 + * 466 + * from report_idle_softirq(). The kick lands shortly 467 + * after and the home CPU drains the task. This could be 468 + * avoided by e.g. dispatching pinned tasks to local or 469 + * global DSQs, but the current code is left as-is to 470 + * document this class of issue -- other schedulers 471 + * seeing similar warnings can use this as a reference. 472 + */ 473 + if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) 474 + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0); 475 + 474 476 bpf_task_release(p); 475 477 476 478 batch--; ··· 518 440 if (!batch || !scx_bpf_dispatch_nr_slots()) { 519 441 if (dispatch_highpri(false)) 520 442 return; 521 - scx_bpf_dsq_move_to_local(SHARED_DSQ); 443 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 522 444 return; 523 445 } 524 446 if (!cpuc->dsp_cnt) ··· 526 448 } 527 449 528 450 cpuc->dsp_cnt = 0; 451 + } 452 + 453 + for (i = 0; i < MAX_SUB_SCHEDS; i++) { 454 + if (sub_sched_cgroup_ids[i] && 455 + scx_bpf_sub_dispatch(sub_sched_cgroup_ids[i])) 456 + return; 529 457 } 530 458 531 459 /* ··· 616 532 return task_qdist(a) > task_qdist(b); 617 533 } 618 534 619 - SEC("tp_btf/sched_switch") 620 - int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev, 621 - struct task_struct *next, unsigned long prev_state) 622 - { 623 - if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere()) 624 - return 0; 625 - 626 - /* 627 - * If @cpu is taken by a higher priority scheduling class, it is no 628 - * longer available for executing sched_ext tasks. As we don't want the 629 - * tasks in @cpu's local dsq to sit there until @cpu becomes available 630 - * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ 631 - * handling in qmap_enqueue(). 632 - */ 633 - switch (next->policy) { 634 - case 1: /* SCHED_FIFO */ 635 - case 2: /* SCHED_RR */ 636 - case 6: /* SCHED_DEADLINE */ 637 - scx_bpf_reenqueue_local(); 638 - } 639 - 640 - return 0; 641 - } 642 - 643 - void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args) 644 - { 645 - /* see qmap_sched_switch() to learn how to do this on newer kernels */ 646 - if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere()) 647 - scx_bpf_reenqueue_local(); 648 - } 535 + /* 536 + * sched_switch tracepoint and cpu_release handlers are no longer needed. 537 + * With SCX_OPS_ALWAYS_ENQ_IMMED, wakeup_preempt_scx() reenqueues IMMED 538 + * tasks when a higher-priority scheduling class takes the CPU. 539 + */ 649 540 650 541 s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, 651 542 struct scx_init_task_args *args) ··· 915 856 return 0; 916 857 } 917 858 859 + struct lowpri_timer { 860 + struct bpf_timer timer; 861 + }; 862 + 863 + struct { 864 + __uint(type, BPF_MAP_TYPE_ARRAY); 865 + __uint(max_entries, 1); 866 + __type(key, u32); 867 + __type(value, struct lowpri_timer); 868 + } lowpri_timer SEC(".maps"); 869 + 870 + /* 871 + * Nice 19 tasks are put into the lowpri DSQ. Every 10ms, reenq is triggered and 872 + * the tasks are transferred to SHARED_DSQ. 873 + */ 874 + static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer) 875 + { 876 + scx_bpf_dsq_reenq(LOWPRI_DSQ, 0); 877 + bpf_timer_start(timer, LOWPRI_INTV_NS, 0); 878 + return 0; 879 + } 880 + 918 881 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) 919 882 { 920 883 u32 key = 0; 921 884 struct bpf_timer *timer; 922 885 s32 ret; 923 886 924 - if (print_msgs) 887 + if (print_msgs && !sub_cgroup_id) 925 888 print_cpus(); 926 889 927 890 ret = scx_bpf_create_dsq(SHARED_DSQ, -1); ··· 958 877 return ret; 959 878 } 960 879 880 + ret = scx_bpf_create_dsq(LOWPRI_DSQ, -1); 881 + if (ret) 882 + return ret; 883 + 961 884 timer = bpf_map_lookup_elem(&monitor_timer, &key); 962 885 if (!timer) 963 886 return -ESRCH; 964 - 965 887 bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); 966 888 bpf_timer_set_callback(timer, monitor_timerfn); 889 + ret = bpf_timer_start(timer, ONE_SEC_IN_NS, 0); 890 + if (ret) 891 + return ret; 967 892 968 - return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); 893 + if (__COMPAT_has_generic_reenq()) { 894 + /* see lowpri_timerfn() */ 895 + timer = bpf_map_lookup_elem(&lowpri_timer, &key); 896 + if (!timer) 897 + return -ESRCH; 898 + bpf_timer_init(timer, &lowpri_timer, CLOCK_MONOTONIC); 899 + bpf_timer_set_callback(timer, lowpri_timerfn); 900 + ret = bpf_timer_start(timer, LOWPRI_INTV_NS, 0); 901 + if (ret) 902 + return ret; 903 + } 904 + 905 + return 0; 969 906 } 970 907 971 908 void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) 972 909 { 973 910 UEI_RECORD(uei, ei); 911 + } 912 + 913 + s32 BPF_STRUCT_OPS(qmap_sub_attach, struct scx_sub_attach_args *args) 914 + { 915 + s32 i; 916 + 917 + for (i = 0; i < MAX_SUB_SCHEDS; i++) { 918 + if (!sub_sched_cgroup_ids[i]) { 919 + sub_sched_cgroup_ids[i] = args->ops->sub_cgroup_id; 920 + bpf_printk("attaching sub-sched[%d] on %s", 921 + i, args->cgroup_path); 922 + return 0; 923 + } 924 + } 925 + 926 + return -ENOSPC; 927 + } 928 + 929 + void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args) 930 + { 931 + s32 i; 932 + 933 + for (i = 0; i < MAX_SUB_SCHEDS; i++) { 934 + if (sub_sched_cgroup_ids[i] == args->ops->sub_cgroup_id) { 935 + sub_sched_cgroup_ids[i] = 0; 936 + bpf_printk("detaching sub-sched[%d] on %s", 937 + i, args->cgroup_path); 938 + break; 939 + } 940 + } 974 941 } 975 942 976 943 SCX_OPS_DEFINE(qmap_ops, ··· 1028 899 .dispatch = (void *)qmap_dispatch, 1029 900 .tick = (void *)qmap_tick, 1030 901 .core_sched_before = (void *)qmap_core_sched_before, 1031 - .cpu_release = (void *)qmap_cpu_release, 1032 902 .init_task = (void *)qmap_init_task, 1033 903 .dump = (void *)qmap_dump, 1034 904 .dump_cpu = (void *)qmap_dump_cpu, ··· 1035 907 .cgroup_init = (void *)qmap_cgroup_init, 1036 908 .cgroup_set_weight = (void *)qmap_cgroup_set_weight, 1037 909 .cgroup_set_bandwidth = (void *)qmap_cgroup_set_bandwidth, 910 + .sub_attach = (void *)qmap_sub_attach, 911 + .sub_detach = (void *)qmap_sub_detach, 1038 912 .cpu_online = (void *)qmap_cpu_online, 1039 913 .cpu_offline = (void *)qmap_cpu_offline, 1040 914 .init = (void *)qmap_init,

+25 -4

tools/sched_ext/scx_qmap.c

··· 10 10 #include <inttypes.h> 11 11 #include <signal.h> 12 12 #include <libgen.h> 13 + #include <sys/stat.h> 13 14 #include <bpf/bpf.h> 14 15 #include <scx/common.h> 15 16 #include "scx_qmap.bpf.skel.h" ··· 21 20 "See the top-level comment in .bpf.c for more details.\n" 22 21 "\n" 23 22 "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n" 24 - " [-P] [-M] [-d PID] [-D LEN] [-p] [-v]\n" 23 + " [-P] [-M] [-H] [-d PID] [-D LEN] [-S] [-p] [-I] [-F COUNT] [-v]\n" 25 24 "\n" 26 25 " -s SLICE_US Override slice duration\n" 27 26 " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" ··· 36 35 " -D LEN Set scx_exit_info.dump buffer length\n" 37 36 " -S Suppress qmap-specific debug dump\n" 38 37 " -p Switch only tasks on SCHED_EXT policy instead of all\n" 38 + " -I Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n" 39 + " -F COUNT IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n" 39 40 " -v Print libbpf debug messages\n" 40 41 " -h Display this help and exit\n"; 41 42 ··· 70 67 71 68 skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL"); 72 69 73 - while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHd:D:Spvh")) != -1) { 70 + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PMHc:d:D:SpIF:vh")) != -1) { 74 71 switch (opt) { 75 72 case 's': 76 73 skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; ··· 99 96 case 'H': 100 97 skel->rodata->highpri_boosting = true; 101 98 break; 99 + case 'c': { 100 + struct stat st; 101 + if (stat(optarg, &st) < 0) { 102 + perror("stat"); 103 + return 1; 104 + } 105 + skel->struct_ops.qmap_ops->sub_cgroup_id = st.st_ino; 106 + skel->rodata->sub_cgroup_id = st.st_ino; 107 + break; 108 + } 102 109 case 'd': 103 110 skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); 104 111 if (skel->rodata->disallow_tgid < 0) ··· 122 109 break; 123 110 case 'p': 124 111 skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; 112 + break; 113 + case 'I': 114 + skel->rodata->always_enq_immed = true; 115 + skel->struct_ops.qmap_ops->flags |= SCX_OPS_ALWAYS_ENQ_IMMED; 116 + break; 117 + case 'F': 118 + skel->rodata->immed_stress_nth = strtoul(optarg, NULL, 0); 125 119 break; 126 120 case 'v': 127 121 verbose = true; ··· 146 126 long nr_enqueued = skel->bss->nr_enqueued; 147 127 long nr_dispatched = skel->bss->nr_dispatched; 148 128 149 - printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", 129 + printf("stats : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%"PRIu64"/%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", 150 130 nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, 151 - skel->bss->nr_reenqueued, skel->bss->nr_dequeued, 131 + skel->bss->nr_reenqueued, skel->bss->nr_reenqueued_cpu0, 132 + skel->bss->nr_dequeued, 152 133 skel->bss->nr_core_sched_execed, 153 134 skel->bss->nr_ddsp_from_enq); 154 135 printf(" exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",

+3 -2

tools/sched_ext/scx_sdt.bpf.c

··· 317 317 }; 318 318 319 319 /* Zero out one word at a time. */ 320 - for (i = zero; i < alloc->pool.elem_size / 8 && can_loop; i++) { 320 + for (i = zero; i < (alloc->pool.elem_size - sizeof(struct sdt_data)) / 8 321 + && can_loop; i++) { 321 322 data->payload[i] = 0; 322 323 } 323 324 } ··· 644 643 645 644 void BPF_STRUCT_OPS(sdt_dispatch, s32 cpu, struct task_struct *prev) 646 645 { 647 - scx_bpf_dsq_move_to_local(SHARED_DSQ); 646 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 648 647 } 649 648 650 649 s32 BPF_STRUCT_OPS_SLEEPABLE(sdt_init_task, struct task_struct *p,

+1 -1

tools/sched_ext/scx_sdt.c

··· 20 20 "\n" 21 21 "Modified version of scx_simple that demonstrates arena-based data structures.\n" 22 22 "\n" 23 - "Usage: %s [-f] [-v]\n" 23 + "Usage: %s [-v]\n" 24 24 "\n" 25 25 " -v Print libbpf debug messages\n" 26 26 " -h Display this help and exit\n";

+5 -3

tools/sched_ext/scx_simple.bpf.c

··· 89 89 90 90 void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev) 91 91 { 92 - scx_bpf_dsq_move_to_local(SHARED_DSQ); 92 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 93 93 } 94 94 95 95 void BPF_STRUCT_OPS(simple_running, struct task_struct *p) ··· 121 121 * too much, determine the execution time by taking explicit timestamps 122 122 * instead of depending on @p->scx.slice. 123 123 */ 124 - p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 124 + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); 125 + 126 + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); 125 127 } 126 128 127 129 void BPF_STRUCT_OPS(simple_enable, struct task_struct *p) 128 130 { 129 - p->scx.dsq_vtime = vtime_now; 131 + scx_bpf_task_set_dsq_vtime(p, vtime_now); 130 132 } 131 133 132 134 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)

+1 -1

tools/sched_ext/scx_userland.c

··· 38 38 "\n" 39 39 "Try to reduce `sysctl kernel.pid_max` if this program triggers OOMs.\n" 40 40 "\n" 41 - "Usage: %s [-b BATCH]\n" 41 + "Usage: %s [-b BATCH] [-v]\n" 42 42 "\n" 43 43 " -b BATCH The number of tasks to batch when dispatching (default: 8)\n" 44 44 " -v Print libbpf debug messages\n"

+1

tools/testing/selftests/sched_ext/Makefile

··· 163 163 164 164 auto-test-targets := \ 165 165 create_dsq \ 166 + dequeue \ 166 167 enq_last_no_enq_fails \ 167 168 ddsp_bogus_dsq_fail \ 168 169 ddsp_vtimelocal_fail \

+389

tools/testing/selftests/sched_ext/dequeue.bpf.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * A scheduler that validates ops.dequeue() is called correctly: 4 + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF 5 + * scheduler entirely: no ops.dequeue() should be called 6 + * - Tasks dispatched to user DSQs from ops.enqueue() enter BPF custody: 7 + * ops.dequeue() must be called when they leave custody 8 + * - Every ops.enqueue() dispatch to non-terminal DSQs is followed by 9 + * exactly one ops.dequeue() (validate 1:1 pairing and state machine) 10 + * 11 + * Copyright (c) 2026 NVIDIA Corporation. 12 + */ 13 + 14 + #include <scx/common.bpf.h> 15 + 16 + #define SHARED_DSQ 0 17 + 18 + /* 19 + * BPF internal queue. 20 + * 21 + * Tasks are stored here and consumed from ops.dispatch(), validating that 22 + * tasks on BPF internal structures still get ops.dequeue() when they 23 + * leave. 24 + */ 25 + struct { 26 + __uint(type, BPF_MAP_TYPE_QUEUE); 27 + __uint(max_entries, 32768); 28 + __type(value, s32); 29 + } global_queue SEC(".maps"); 30 + 31 + char _license[] SEC("license") = "GPL"; 32 + 33 + UEI_DEFINE(uei); 34 + 35 + /* 36 + * Counters to track the lifecycle of tasks: 37 + * - enqueue_cnt: Number of times ops.enqueue() was called 38 + * - dequeue_cnt: Number of times ops.dequeue() was called (any type) 39 + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag) 40 + * - change_dequeue_cnt: Number of property change dequeues 41 + * - bpf_queue_full: Number of times the BPF internal queue was full 42 + */ 43 + u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt, bpf_queue_full; 44 + 45 + /* 46 + * Test scenarios: 47 + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF 48 + * scheduler, no dequeue callbacks) 49 + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF 50 + * scheduler, no dequeue callbacks) 51 + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler, 52 + * dequeue callbacks expected) 53 + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF 54 + * scheduler, no dequeue callbacks) 55 + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF 56 + * scheduler, no dequeue callbacks) 57 + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler, 58 + * dequeue callbacks expected) 59 + * 6) BPF internal queue from ops.enqueue(): store task PIDs in ops.enqueue(), 60 + * consume in ops.dispatch() and dispatch to local DSQ (validates dequeue 61 + * for tasks stored in internal BPF data structures) 62 + */ 63 + u32 test_scenario; 64 + 65 + /* 66 + * Per-task state to track lifecycle and validate workflow semantics. 67 + * State transitions: 68 + * NONE -> ENQUEUED (on enqueue) 69 + * NONE -> DISPATCHED (on direct dispatch to terminal DSQ) 70 + * ENQUEUED -> DISPATCHED (on dispatch dequeue) 71 + * DISPATCHED -> NONE (on property change dequeue or re-enqueue) 72 + * ENQUEUED -> NONE (on property change dequeue before dispatch) 73 + */ 74 + enum task_state { 75 + TASK_NONE = 0, 76 + TASK_ENQUEUED, 77 + TASK_DISPATCHED, 78 + }; 79 + 80 + struct task_ctx { 81 + enum task_state state; /* Current state in the workflow */ 82 + u64 enqueue_seq; /* Sequence number for debugging */ 83 + }; 84 + 85 + struct { 86 + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); 87 + __uint(map_flags, BPF_F_NO_PREALLOC); 88 + __type(key, int); 89 + __type(value, struct task_ctx); 90 + } task_ctx_stor SEC(".maps"); 91 + 92 + static struct task_ctx *try_lookup_task_ctx(struct task_struct *p) 93 + { 94 + return bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 95 + } 96 + 97 + s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p, 98 + s32 prev_cpu, u64 wake_flags) 99 + { 100 + struct task_ctx *tctx; 101 + 102 + tctx = try_lookup_task_ctx(p); 103 + if (!tctx) 104 + return prev_cpu; 105 + 106 + switch (test_scenario) { 107 + case 0: 108 + /* 109 + * Direct dispatch to the local DSQ. 110 + * 111 + * Task bypasses BPF scheduler entirely: no enqueue 112 + * tracking, no ops.dequeue() callbacks. 113 + */ 114 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 115 + tctx->state = TASK_DISPATCHED; 116 + break; 117 + case 1: 118 + /* 119 + * Direct dispatch to the global DSQ. 120 + * 121 + * Task bypasses BPF scheduler entirely: no enqueue 122 + * tracking, no ops.dequeue() callbacks. 123 + */ 124 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); 125 + tctx->state = TASK_DISPATCHED; 126 + break; 127 + case 2: 128 + /* 129 + * Dispatch to a shared user DSQ. 130 + * 131 + * Task enters BPF scheduler management: track 132 + * enqueue/dequeue lifecycle and validate state 133 + * transitions. 134 + */ 135 + if (tctx->state == TASK_ENQUEUED) 136 + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", 137 + p->pid, p->comm, tctx->enqueue_seq); 138 + 139 + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0); 140 + 141 + __sync_fetch_and_add(&enqueue_cnt, 1); 142 + 143 + tctx->state = TASK_ENQUEUED; 144 + tctx->enqueue_seq++; 145 + break; 146 + } 147 + 148 + return prev_cpu; 149 + } 150 + 151 + void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags) 152 + { 153 + struct task_ctx *tctx; 154 + s32 pid = p->pid; 155 + 156 + tctx = try_lookup_task_ctx(p); 157 + if (!tctx) 158 + return; 159 + 160 + switch (test_scenario) { 161 + case 3: 162 + /* 163 + * Direct dispatch to the local DSQ. 164 + * 165 + * Task bypasses BPF scheduler entirely: no enqueue 166 + * tracking, no ops.dequeue() callbacks. 167 + */ 168 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); 169 + tctx->state = TASK_DISPATCHED; 170 + break; 171 + case 4: 172 + /* 173 + * Direct dispatch to the global DSQ. 174 + * 175 + * Task bypasses BPF scheduler entirely: no enqueue 176 + * tracking, no ops.dequeue() callbacks. 177 + */ 178 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 179 + tctx->state = TASK_DISPATCHED; 180 + break; 181 + case 5: 182 + /* 183 + * Dispatch to shared user DSQ. 184 + * 185 + * Task enters BPF scheduler management: track 186 + * enqueue/dequeue lifecycle and validate state 187 + * transitions. 188 + */ 189 + if (tctx->state == TASK_ENQUEUED) 190 + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", 191 + p->pid, p->comm, tctx->enqueue_seq); 192 + 193 + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); 194 + 195 + __sync_fetch_and_add(&enqueue_cnt, 1); 196 + 197 + tctx->state = TASK_ENQUEUED; 198 + tctx->enqueue_seq++; 199 + break; 200 + case 6: 201 + /* 202 + * Store task in BPF internal queue. 203 + * 204 + * Task enters BPF scheduler management: track 205 + * enqueue/dequeue lifecycle and validate state 206 + * transitions. 207 + */ 208 + if (tctx->state == TASK_ENQUEUED) 209 + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", 210 + p->pid, p->comm, tctx->enqueue_seq); 211 + 212 + if (bpf_map_push_elem(&global_queue, &pid, 0)) { 213 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 214 + __sync_fetch_and_add(&bpf_queue_full, 1); 215 + 216 + tctx->state = TASK_DISPATCHED; 217 + } else { 218 + __sync_fetch_and_add(&enqueue_cnt, 1); 219 + 220 + tctx->state = TASK_ENQUEUED; 221 + tctx->enqueue_seq++; 222 + } 223 + break; 224 + default: 225 + /* For all other scenarios, dispatch to the global DSQ */ 226 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 227 + tctx->state = TASK_DISPATCHED; 228 + break; 229 + } 230 + 231 + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE); 232 + } 233 + 234 + void BPF_STRUCT_OPS(dequeue_dequeue, struct task_struct *p, u64 deq_flags) 235 + { 236 + struct task_ctx *tctx; 237 + 238 + __sync_fetch_and_add(&dequeue_cnt, 1); 239 + 240 + tctx = try_lookup_task_ctx(p); 241 + if (!tctx) 242 + return; 243 + 244 + /* 245 + * For scenarios 0, 1, 3, and 4 (terminal DSQs: local and global), 246 + * ops.dequeue() should never be called because tasks bypass the 247 + * BPF scheduler entirely. If we get here, it's a kernel bug. 248 + */ 249 + if (test_scenario == 0 || test_scenario == 3) { 250 + scx_bpf_error("%d (%s): dequeue called for local DSQ scenario", 251 + p->pid, p->comm); 252 + return; 253 + } 254 + 255 + if (test_scenario == 1 || test_scenario == 4) { 256 + scx_bpf_error("%d (%s): dequeue called for global DSQ scenario", 257 + p->pid, p->comm); 258 + return; 259 + } 260 + 261 + if (deq_flags & SCX_DEQ_SCHED_CHANGE) { 262 + /* 263 + * Property change interrupting the workflow. Valid from 264 + * both ENQUEUED and DISPATCHED states. Transitions task 265 + * back to NONE state. 266 + */ 267 + __sync_fetch_and_add(&change_dequeue_cnt, 1); 268 + 269 + /* Validate state transition */ 270 + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_DISPATCHED) 271 + scx_bpf_error("%d (%s): invalid property change dequeue state=%d seq=%llu", 272 + p->pid, p->comm, tctx->state, tctx->enqueue_seq); 273 + 274 + /* 275 + * Transition back to NONE: task outside scheduler control. 276 + * 277 + * Scenario 6: dispatch() checks tctx->state after popping a 278 + * PID, if the task is in state NONE, it was dequeued by 279 + * property change and must not be dispatched (this 280 + * prevents "target CPU not allowed"). 281 + */ 282 + tctx->state = TASK_NONE; 283 + } else { 284 + /* 285 + * Regular dispatch dequeue: kernel is moving the task from 286 + * BPF custody to a terminal DSQ. Normally we come from 287 + * ENQUEUED state. We can also see TASK_NONE if the task 288 + * was dequeued by property change (SCX_DEQ_SCHED_CHANGE) 289 + * while it was already on a DSQ (dispatched but not yet 290 + * consumed); in that case we just leave state as NONE. 291 + */ 292 + __sync_fetch_and_add(&dispatch_dequeue_cnt, 1); 293 + 294 + /* 295 + * Must be ENQUEUED (normal path) or NONE (already dequeued 296 + * by property change while on a DSQ). 297 + */ 298 + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_NONE) 299 + scx_bpf_error("%d (%s): dispatch dequeue from state %d seq=%llu", 300 + p->pid, p->comm, tctx->state, tctx->enqueue_seq); 301 + 302 + if (tctx->state == TASK_ENQUEUED) 303 + tctx->state = TASK_DISPATCHED; 304 + 305 + /* NONE: leave as-is, task was already property-change dequeued */ 306 + } 307 + } 308 + 309 + void BPF_STRUCT_OPS(dequeue_dispatch, s32 cpu, struct task_struct *prev) 310 + { 311 + if (test_scenario == 6) { 312 + struct task_ctx *tctx; 313 + struct task_struct *p; 314 + s32 pid; 315 + 316 + if (bpf_map_pop_elem(&global_queue, &pid)) 317 + return; 318 + 319 + p = bpf_task_from_pid(pid); 320 + if (!p) 321 + return; 322 + 323 + /* 324 + * If the task was dequeued by property change 325 + * (ops.dequeue() set tctx->state = TASK_NONE), skip 326 + * dispatch. 327 + */ 328 + tctx = try_lookup_task_ctx(p); 329 + if (!tctx || tctx->state == TASK_NONE) { 330 + bpf_task_release(p); 331 + return; 332 + } 333 + 334 + /* 335 + * Dispatch to this CPU's local DSQ if allowed, otherwise 336 + * fallback to the global DSQ. 337 + */ 338 + if (bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) 339 + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0); 340 + else 341 + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); 342 + 343 + bpf_task_release(p); 344 + } else { 345 + scx_bpf_dsq_move_to_local(SHARED_DSQ, 0); 346 + } 347 + } 348 + 349 + s32 BPF_STRUCT_OPS(dequeue_init_task, struct task_struct *p, 350 + struct scx_init_task_args *args) 351 + { 352 + struct task_ctx *tctx; 353 + 354 + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 355 + BPF_LOCAL_STORAGE_GET_F_CREATE); 356 + if (!tctx) 357 + return -ENOMEM; 358 + 359 + return 0; 360 + } 361 + 362 + s32 BPF_STRUCT_OPS_SLEEPABLE(dequeue_init) 363 + { 364 + s32 ret; 365 + 366 + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); 367 + if (ret) 368 + return ret; 369 + 370 + return 0; 371 + } 372 + 373 + void BPF_STRUCT_OPS(dequeue_exit, struct scx_exit_info *ei) 374 + { 375 + UEI_RECORD(uei, ei); 376 + } 377 + 378 + SEC(".struct_ops.link") 379 + struct sched_ext_ops dequeue_ops = { 380 + .select_cpu = (void *)dequeue_select_cpu, 381 + .enqueue = (void *)dequeue_enqueue, 382 + .dequeue = (void *)dequeue_dequeue, 383 + .dispatch = (void *)dequeue_dispatch, 384 + .init_task = (void *)dequeue_init_task, 385 + .init = (void *)dequeue_init, 386 + .exit = (void *)dequeue_exit, 387 + .flags = SCX_OPS_ENQ_LAST, 388 + .name = "dequeue_test", 389 + };

+274

tools/testing/selftests/sched_ext/dequeue.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2025 NVIDIA Corporation. 4 + */ 5 + #define _GNU_SOURCE 6 + #include <stdio.h> 7 + #include <unistd.h> 8 + #include <signal.h> 9 + #include <time.h> 10 + #include <bpf/bpf.h> 11 + #include <scx/common.h> 12 + #include <sys/wait.h> 13 + #include <sched.h> 14 + #include <pthread.h> 15 + #include "scx_test.h" 16 + #include "dequeue.bpf.skel.h" 17 + 18 + #define NUM_WORKERS 8 19 + #define AFFINITY_HAMMER_MS 500 20 + 21 + /* 22 + * Worker function that creates enqueue/dequeue events via CPU work and 23 + * sleep. 24 + */ 25 + static void worker_fn(int id) 26 + { 27 + int i; 28 + volatile int sum = 0; 29 + 30 + for (i = 0; i < 1000; i++) { 31 + volatile int j; 32 + 33 + /* Do some work to trigger scheduling events */ 34 + for (j = 0; j < 10000; j++) 35 + sum += j; 36 + 37 + /* Sleep to trigger dequeue */ 38 + usleep(1000 + (id * 100)); 39 + } 40 + 41 + exit(0); 42 + } 43 + 44 + /* 45 + * This thread changes workers' affinity from outside so that some changes 46 + * hit tasks while they are still in the scheduler's queue and trigger 47 + * property-change dequeues. 48 + */ 49 + static void *affinity_hammer_fn(void *arg) 50 + { 51 + pid_t *pids = arg; 52 + cpu_set_t cpuset; 53 + int i = 0, n = NUM_WORKERS; 54 + struct timespec start, now; 55 + 56 + clock_gettime(CLOCK_MONOTONIC, &start); 57 + while (1) { 58 + int w = i % n; 59 + int cpu = (i / n) % 4; 60 + 61 + CPU_ZERO(&cpuset); 62 + CPU_SET(cpu, &cpuset); 63 + sched_setaffinity(pids[w], sizeof(cpuset), &cpuset); 64 + i++; 65 + 66 + /* Check elapsed time every 256 iterations to limit gettime cost */ 67 + if ((i & 255) == 0) { 68 + long long elapsed_ms; 69 + 70 + clock_gettime(CLOCK_MONOTONIC, &now); 71 + elapsed_ms = (now.tv_sec - start.tv_sec) * 1000LL + 72 + (now.tv_nsec - start.tv_nsec) / 1000000; 73 + if (elapsed_ms >= AFFINITY_HAMMER_MS) 74 + break; 75 + } 76 + } 77 + return NULL; 78 + } 79 + 80 + static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario, 81 + const char *scenario_name) 82 + { 83 + struct bpf_link *link; 84 + pid_t pids[NUM_WORKERS]; 85 + pthread_t hammer; 86 + 87 + int i, status; 88 + u64 enq_start, deq_start, 89 + dispatch_deq_start, change_deq_start, bpf_queue_full_start; 90 + u64 enq_delta, deq_delta, 91 + dispatch_deq_delta, change_deq_delta, bpf_queue_full_delta; 92 + 93 + /* Set the test scenario */ 94 + skel->bss->test_scenario = scenario; 95 + 96 + /* Record starting counts */ 97 + enq_start = skel->bss->enqueue_cnt; 98 + deq_start = skel->bss->dequeue_cnt; 99 + dispatch_deq_start = skel->bss->dispatch_dequeue_cnt; 100 + change_deq_start = skel->bss->change_dequeue_cnt; 101 + bpf_queue_full_start = skel->bss->bpf_queue_full; 102 + 103 + link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops); 104 + SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name); 105 + 106 + /* Fork worker processes to generate enqueue/dequeue events */ 107 + for (i = 0; i < NUM_WORKERS; i++) { 108 + pids[i] = fork(); 109 + SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i); 110 + 111 + if (pids[i] == 0) { 112 + worker_fn(i); 113 + /* Should not reach here */ 114 + exit(1); 115 + } 116 + } 117 + 118 + /* 119 + * Run an "affinity hammer" so that some property changes hit tasks 120 + * while they are still in BPF custody (e.g., in user DSQ or BPF 121 + * queue), triggering SCX_DEQ_SCHED_CHANGE dequeues. 122 + */ 123 + SCX_FAIL_IF(pthread_create(&hammer, NULL, affinity_hammer_fn, pids) != 0, 124 + "Failed to create affinity hammer thread"); 125 + pthread_join(hammer, NULL); 126 + 127 + /* Wait for all workers to complete */ 128 + for (i = 0; i < NUM_WORKERS; i++) { 129 + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], 130 + "Failed to wait for worker %d", i); 131 + SCX_FAIL_IF(status != 0, "Worker %d exited with status %d", i, status); 132 + } 133 + 134 + bpf_link__destroy(link); 135 + 136 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG)); 137 + 138 + /* Calculate deltas */ 139 + enq_delta = skel->bss->enqueue_cnt - enq_start; 140 + deq_delta = skel->bss->dequeue_cnt - deq_start; 141 + dispatch_deq_delta = skel->bss->dispatch_dequeue_cnt - dispatch_deq_start; 142 + change_deq_delta = skel->bss->change_dequeue_cnt - change_deq_start; 143 + bpf_queue_full_delta = skel->bss->bpf_queue_full - bpf_queue_full_start; 144 + 145 + printf("%s:\n", scenario_name); 146 + printf(" enqueues: %lu\n", (unsigned long)enq_delta); 147 + printf(" dequeues: %lu (dispatch: %lu, property_change: %lu)\n", 148 + (unsigned long)deq_delta, 149 + (unsigned long)dispatch_deq_delta, 150 + (unsigned long)change_deq_delta); 151 + printf(" BPF queue full: %lu\n", (unsigned long)bpf_queue_full_delta); 152 + 153 + /* 154 + * Validate enqueue/dequeue lifecycle tracking. 155 + * 156 + * For scenarios 0, 1, 3, 4 (local and global DSQs from 157 + * ops.select_cpu() and ops.enqueue()), both enqueues and dequeues 158 + * should be 0 because tasks bypass the BPF scheduler entirely: 159 + * tasks never enter BPF scheduler's custody. 160 + * 161 + * For scenarios 2, 5, 6 (user DSQ or BPF internal queue) we expect 162 + * both enqueues and dequeues. 163 + * 164 + * The BPF code does strict state machine validation with 165 + * scx_bpf_error() to ensure the workflow semantics are correct. 166 + * 167 + * If we reach this point without errors, the semantics are 168 + * validated correctly. 169 + */ 170 + if (scenario == 0 || scenario == 1 || 171 + scenario == 3 || scenario == 4) { 172 + /* Tasks bypass BPF scheduler completely */ 173 + SCX_EQ(enq_delta, 0); 174 + SCX_EQ(deq_delta, 0); 175 + SCX_EQ(dispatch_deq_delta, 0); 176 + SCX_EQ(change_deq_delta, 0); 177 + } else { 178 + /* 179 + * User DSQ from ops.enqueue() or ops.select_cpu(): tasks 180 + * enter BPF scheduler's custody. 181 + * 182 + * Also validate 1:1 enqueue/dequeue pairing. 183 + */ 184 + SCX_GT(enq_delta, 0); 185 + SCX_GT(deq_delta, 0); 186 + SCX_EQ(enq_delta, deq_delta); 187 + } 188 + 189 + return SCX_TEST_PASS; 190 + } 191 + 192 + static enum scx_test_status setup(void **ctx) 193 + { 194 + struct dequeue *skel; 195 + 196 + skel = dequeue__open(); 197 + SCX_FAIL_IF(!skel, "Failed to open skel"); 198 + SCX_ENUM_INIT(skel); 199 + SCX_FAIL_IF(dequeue__load(skel), "Failed to load skel"); 200 + 201 + *ctx = skel; 202 + 203 + return SCX_TEST_PASS; 204 + } 205 + 206 + static enum scx_test_status run(void *ctx) 207 + { 208 + struct dequeue *skel = ctx; 209 + enum scx_test_status status; 210 + 211 + status = run_scenario(skel, 0, "Scenario 0: Local DSQ from ops.select_cpu()"); 212 + if (status != SCX_TEST_PASS) 213 + return status; 214 + 215 + status = run_scenario(skel, 1, "Scenario 1: Global DSQ from ops.select_cpu()"); 216 + if (status != SCX_TEST_PASS) 217 + return status; 218 + 219 + status = run_scenario(skel, 2, "Scenario 2: User DSQ from ops.select_cpu()"); 220 + if (status != SCX_TEST_PASS) 221 + return status; 222 + 223 + status = run_scenario(skel, 3, "Scenario 3: Local DSQ from ops.enqueue()"); 224 + if (status != SCX_TEST_PASS) 225 + return status; 226 + 227 + status = run_scenario(skel, 4, "Scenario 4: Global DSQ from ops.enqueue()"); 228 + if (status != SCX_TEST_PASS) 229 + return status; 230 + 231 + status = run_scenario(skel, 5, "Scenario 5: User DSQ from ops.enqueue()"); 232 + if (status != SCX_TEST_PASS) 233 + return status; 234 + 235 + status = run_scenario(skel, 6, "Scenario 6: BPF queue from ops.enqueue()"); 236 + if (status != SCX_TEST_PASS) 237 + return status; 238 + 239 + printf("\n=== Summary ===\n"); 240 + printf("Total enqueues: %lu\n", (unsigned long)skel->bss->enqueue_cnt); 241 + printf("Total dequeues: %lu\n", (unsigned long)skel->bss->dequeue_cnt); 242 + printf(" Dispatch dequeues: %lu (no flag, normal workflow)\n", 243 + (unsigned long)skel->bss->dispatch_dequeue_cnt); 244 + printf(" Property change dequeues: %lu (SCX_DEQ_SCHED_CHANGE flag)\n", 245 + (unsigned long)skel->bss->change_dequeue_cnt); 246 + printf(" BPF queue full: %lu\n", 247 + (unsigned long)skel->bss->bpf_queue_full); 248 + printf("\nAll scenarios passed - no state machine violations detected\n"); 249 + printf("-> Validated: Local DSQ dispatch bypasses BPF scheduler\n"); 250 + printf("-> Validated: Global DSQ dispatch bypasses BPF scheduler\n"); 251 + printf("-> Validated: User DSQ dispatch triggers ops.dequeue() callbacks\n"); 252 + printf("-> Validated: Dispatch dequeues have no flags (normal workflow)\n"); 253 + printf("-> Validated: Property change dequeues have SCX_DEQ_SCHED_CHANGE flag\n"); 254 + printf("-> Validated: No duplicate enqueues or invalid state transitions\n"); 255 + 256 + return SCX_TEST_PASS; 257 + } 258 + 259 + static void cleanup(void *ctx) 260 + { 261 + struct dequeue *skel = ctx; 262 + 263 + dequeue__destroy(skel); 264 + } 265 + 266 + struct scx_test dequeue_test = { 267 + .name = "dequeue", 268 + .description = "Verify ops.dequeue() semantics", 269 + .setup = setup, 270 + .run = run, 271 + .cleanup = cleanup, 272 + }; 273 + 274 + REGISTER_SCX_TEST(&dequeue_test)

+1 -1

tools/testing/selftests/sched_ext/exit.bpf.c

··· 41 41 if (exit_point == EXIT_DISPATCH) 42 42 EXIT_CLEANLY(); 43 43 44 - scx_bpf_dsq_move_to_local(DSQ_ID); 44 + scx_bpf_dsq_move_to_local(DSQ_ID, 0); 45 45 } 46 46 47 47 void BPF_STRUCT_OPS(exit_enable, struct task_struct *p)

+1 -1

tools/testing/selftests/sched_ext/exit.c

··· 33 33 skel = exit__open(); 34 34 SCX_ENUM_INIT(skel); 35 35 skel->rodata->exit_point = tc; 36 - exit__load(skel); 36 + SCX_FAIL_IF(exit__load(skel), "Failed to load skel"); 37 37 link = bpf_map__attach_struct_ops(skel->maps.exit_ops); 38 38 if (!link) { 39 39 SCX_ERR("Failed to attach scheduler");

+1 -1

tools/testing/selftests/sched_ext/exit_test.h

··· 17 17 NUM_EXITS, 18 18 }; 19 19 20 - #endif // # __EXIT_TEST_H__ 20 + #endif // __EXIT_TEST_H__

+7 -10

tools/testing/selftests/sched_ext/maximal.bpf.c

··· 30 30 31 31 void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev) 32 32 { 33 - scx_bpf_dsq_move_to_local(DSQ_ID); 33 + scx_bpf_dsq_move_to_local(DSQ_ID, 0); 34 34 } 35 35 36 36 void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags) ··· 67 67 void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle) 68 68 {} 69 69 70 - void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu, 71 - struct scx_cpu_acquire_args *args) 72 - {} 73 - 74 - void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu, 75 - struct scx_cpu_release_args *args) 76 - {} 70 + SEC("tp_btf/sched_switch") 71 + int BPF_PROG(maximal_sched_switch, bool preempt, struct task_struct *prev, 72 + struct task_struct *next, unsigned int prev_state) 73 + { 74 + return 0; 75 + } 77 76 78 77 void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu) 79 78 {} ··· 149 150 .set_weight = (void *) maximal_set_weight, 150 151 .set_cpumask = (void *) maximal_set_cpumask, 151 152 .update_idle = (void *) maximal_update_idle, 152 - .cpu_acquire = (void *) maximal_cpu_acquire, 153 - .cpu_release = (void *) maximal_cpu_release, 154 153 .cpu_online = (void *) maximal_cpu_online, 155 154 .cpu_offline = (void *) maximal_cpu_offline, 156 155 .init_task = (void *) maximal_init_task,

+3

tools/testing/selftests/sched_ext/maximal.c

··· 19 19 SCX_ENUM_INIT(skel); 20 20 SCX_FAIL_IF(maximal__load(skel), "Failed to load skel"); 21 21 22 + bpf_map__set_autoattach(skel->maps.maximal_ops, false); 23 + SCX_FAIL_IF(maximal__attach(skel), "Failed to attach skel"); 24 + 22 25 *ctx = skel; 23 26 24 27 return SCX_TEST_PASS;

+1 -1

tools/testing/selftests/sched_ext/numa.bpf.c

··· 68 68 { 69 69 int node = __COMPAT_scx_bpf_cpu_node(cpu); 70 70 71 - scx_bpf_dsq_move_to_local(node); 71 + scx_bpf_dsq_move_to_local(node, 0); 72 72 } 73 73 74 74 s32 BPF_STRUCT_OPS_SLEEPABLE(numa_init)

+5 -5

tools/testing/selftests/sched_ext/peek_dsq.bpf.c

··· 95 95 record_peek_result(task->pid); 96 96 97 97 /* Try to move this task to local */ 98 - if (!moved && scx_bpf_dsq_move_to_local(dsq_id) == 0) { 98 + if (!moved && scx_bpf_dsq_move_to_local(dsq_id, 0) == 0) { 99 99 moved = 1; 100 100 break; 101 101 } ··· 156 156 dsq_peek_result2_pid = peek_result ? peek_result->pid : -1; 157 157 158 158 /* Now consume the task since we've peeked at it */ 159 - scx_bpf_dsq_move_to_local(test_dsq_id); 159 + scx_bpf_dsq_move_to_local(test_dsq_id, 0); 160 160 161 161 /* Mark phase 1 as complete */ 162 162 phase1_complete = 1; 163 163 bpf_printk("Phase 1 complete, starting phase 2 stress testing"); 164 164 } else if (!phase1_complete) { 165 165 /* Still in phase 1, use real DSQ */ 166 - scx_bpf_dsq_move_to_local(real_dsq_id); 166 + scx_bpf_dsq_move_to_local(real_dsq_id, 0); 167 167 } else { 168 168 /* Phase 2: Scan all DSQs in the pool and try to move a task */ 169 169 if (!scan_dsq_pool()) { 170 170 /* No tasks found in DSQ pool, fall back to real DSQ */ 171 - scx_bpf_dsq_move_to_local(real_dsq_id); 171 + scx_bpf_dsq_move_to_local(real_dsq_id, 0); 172 172 } 173 173 } 174 174 } ··· 197 197 } 198 198 err = scx_bpf_create_dsq(real_dsq_id, -1); 199 199 if (err) { 200 - scx_bpf_error("Failed to create DSQ %d: %d", test_dsq_id, err); 200 + scx_bpf_error("Failed to create DSQ %d: %d", real_dsq_id, err); 201 201 return err; 202 202 } 203 203

+3

tools/testing/selftests/sched_ext/reload_loop.c

··· 23 23 SCX_ENUM_INIT(skel); 24 24 SCX_FAIL_IF(maximal__load(skel), "Failed to load skel"); 25 25 26 + bpf_map__set_autoattach(skel->maps.maximal_ops, false); 27 + SCX_FAIL_IF(maximal__attach(skel), "Failed to attach skel"); 28 + 26 29 return SCX_TEST_PASS; 27 30 } 28 31

+5

tools/testing/selftests/sched_ext/rt_stall.c

··· 119 119 { 120 120 struct rt_stall *skel; 121 121 122 + if (!__COMPAT_struct_has_field("rq", "ext_server")) { 123 + fprintf(stderr, "SKIP: ext DL server not supported\n"); 124 + return SCX_TEST_SKIP; 125 + } 126 + 122 127 skel = rt_stall__open(); 123 128 SCX_FAIL_IF(!skel, "Failed to open"); 124 129 SCX_ENUM_INIT(skel);

+36 -4

tools/testing/selftests/sched_ext/runner.c

··· 18 18 "It's required for the testcases to be serial, as only a single host-wide sched_ext\n" 19 19 "scheduler may be loaded at any given time." 20 20 "\n" 21 - "Usage: %s [-t TEST] [-h]\n" 21 + "Usage: %s [-t TEST] [-s] [-l] [-q]\n" 22 22 "\n" 23 23 " -t TEST Only run tests whose name includes this string\n" 24 24 " -s Include print output for skipped tests\n" ··· 133 133 int main(int argc, char **argv) 134 134 { 135 135 const char *filter = NULL; 136 + const char *failed_tests[MAX_SCX_TESTS]; 137 + const char *skipped_tests[MAX_SCX_TESTS]; 136 138 unsigned testnum = 0, i; 137 139 unsigned passed = 0, skipped = 0, failed = 0; 138 140 int opt; ··· 161 159 default: 162 160 fprintf(stderr, help_fmt, basename(argv[0])); 163 161 return opt != 'h'; 162 + } 163 + } 164 + 165 + if (optind < argc) { 166 + fprintf(stderr, "Unexpected argument '%s'. Use -t to filter tests.\n", 167 + argv[optind]); 168 + return 1; 169 + } 170 + 171 + if (filter) { 172 + for (i = 0; i < __scx_num_tests; i++) { 173 + if (!should_skip_test(&__scx_tests[i], filter)) 174 + break; 175 + } 176 + if (i == __scx_num_tests) { 177 + fprintf(stderr, "No tests matched filter '%s'\n", filter); 178 + fprintf(stderr, "Available tests (use -l to list):\n"); 179 + for (i = 0; i < __scx_num_tests; i++) 180 + fprintf(stderr, " %s\n", __scx_tests[i].name); 181 + return 1; 164 182 } 165 183 } 166 184 ··· 220 198 passed++; 221 199 break; 222 200 case SCX_TEST_SKIP: 223 - skipped++; 201 + skipped_tests[skipped++] = test->name; 224 202 break; 225 203 case SCX_TEST_FAIL: 226 - failed++; 204 + failed_tests[failed++] = test->name; 227 205 break; 228 206 } 229 207 } ··· 232 210 printf("PASSED: %u\n", passed); 233 211 printf("SKIPPED: %u\n", skipped); 234 212 printf("FAILED: %u\n", failed); 213 + if (skipped > 0) { 214 + printf("\nSkipped tests:\n"); 215 + for (i = 0; i < skipped; i++) 216 + printf(" - %s\n", skipped_tests[i]); 217 + } 218 + if (failed > 0) { 219 + printf("\nFailed tests:\n"); 220 + for (i = 0; i < failed; i++) 221 + printf(" - %s\n", failed_tests[i]); 222 + } 235 223 236 - return 0; 224 + return failed > 0 ? 1 : 0; 237 225 } 238 226 239 227 void scx_test_register(struct scx_test *test)

+5 -3

tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c

··· 53 53 54 54 void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p) 55 55 { 56 - if (scx_bpf_dsq_move_to_local(VTIME_DSQ)) 56 + if (scx_bpf_dsq_move_to_local(VTIME_DSQ, 0)) 57 57 consumed = true; 58 58 } 59 59 ··· 66 66 void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p, 67 67 bool runnable) 68 68 { 69 - p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 69 + u64 delta = scale_by_task_weight_inverse(p, SCX_SLICE_DFL - p->scx.slice); 70 + 71 + scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta); 70 72 } 71 73 72 74 void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p) 73 75 { 74 - p->scx.dsq_vtime = vtime_now; 76 + scx_bpf_task_set_dsq_vtime(p, vtime_now); 75 77 } 76 78 77 79 s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init)

+1 -1

tools/testing/selftests/sched_ext/util.h

··· 10 10 long file_read_long(const char *path); 11 11 int file_write_long(const char *path, long val); 12 12 13 - #endif // __SCX_TEST_H__ 13 + #endif // __SCX_TEST_UTIL_H__

Configure Feed

Configure Feed