Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

+1

Documentation/scheduler/index.rst

··· 21 21 sched-nice-design 22 22 sched-rt-group 23 23 sched-stats 24 + sched-ext 24 25 sched-debug 25 26 26 27 text_files

+316

Documentation/scheduler/sched-ext.rst

··· 1 + ========================== 2 + Extensible Scheduler Class 3 + ========================== 4 + 5 + sched_ext is a scheduler class whose behavior can be defined by a set of BPF 6 + programs - the BPF scheduler. 7 + 8 + * sched_ext exports a full scheduling interface so that any scheduling 9 + algorithm can be implemented on top. 10 + 11 + * The BPF scheduler can group CPUs however it sees fit and schedule them 12 + together, as tasks aren't tied to specific CPUs at the time of wakeup. 13 + 14 + * The BPF scheduler can be turned on and off dynamically anytime. 15 + 16 + * The system integrity is maintained no matter what the BPF scheduler does. 17 + The default scheduling behavior is restored anytime an error is detected, 18 + a runnable task stalls, or on invoking the SysRq key sequence 19 + :kbd:`SysRq-S`. 20 + 21 + * When the BPF scheduler triggers an error, debug information is dumped to 22 + aid debugging. The debug dump is passed to and printed out by the 23 + scheduler binary. The debug dump can also be accessed through the 24 + `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D` 25 + triggers a debug dump. This doesn't terminate the BPF scheduler and can 26 + only be read through the tracepoint. 27 + 28 + Switching to and from sched_ext 29 + =============================== 30 + 31 + ``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and 32 + ``tools/sched_ext`` contains the example schedulers. The following config 33 + options should be enabled to use sched_ext: 34 + 35 + .. code-block:: none 36 + 37 + CONFIG_BPF=y 38 + CONFIG_SCHED_CLASS_EXT=y 39 + CONFIG_BPF_SYSCALL=y 40 + CONFIG_BPF_JIT=y 41 + CONFIG_DEBUG_INFO_BTF=y 42 + CONFIG_BPF_JIT_ALWAYS_ON=y 43 + CONFIG_BPF_JIT_DEFAULT_ON=y 44 + CONFIG_PAHOLE_HAS_SPLIT_BTF=y 45 + CONFIG_PAHOLE_HAS_BTF_TAG=y 46 + 47 + sched_ext is used only when the BPF scheduler is loaded and running. 48 + 49 + If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be 50 + treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is 51 + loaded. 52 + 53 + When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set 54 + in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and 55 + ``SCHED_EXT`` tasks are scheduled by sched_ext. 56 + 57 + However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is 58 + set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled 59 + by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and 60 + ``SCHED_IDLE`` policies are scheduled by CFS. 61 + 62 + Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or 63 + detection of any internal error including stalled runnable tasks aborts the 64 + BPF scheduler and reverts all tasks back to CFS. 65 + 66 + .. code-block:: none 67 + 68 + # make -j16 -C tools/sched_ext 69 + # tools/sched_ext/scx_simple 70 + local=0 global=3 71 + local=5 global=24 72 + local=9 global=44 73 + local=13 global=56 74 + local=17 global=72 75 + ^CEXIT: BPF scheduler unregistered 76 + 77 + The current status of the BPF scheduler can be determined as follows: 78 + 79 + .. code-block:: none 80 + 81 + # cat /sys/kernel/sched_ext/state 82 + enabled 83 + # cat /sys/kernel/sched_ext/root/ops 84 + simple 85 + 86 + ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more 87 + detailed information: 88 + 89 + .. code-block:: none 90 + 91 + # tools/sched_ext/scx_show_state.py 92 + ops : simple 93 + enabled : 1 94 + switching_all : 1 95 + switched_all : 1 96 + enable_state : enabled (2) 97 + bypass_depth : 0 98 + nr_rejected : 0 99 + 100 + If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can 101 + be determined as follows: 102 + 103 + .. code-block:: none 104 + 105 + # grep ext /proc/self/sched 106 + ext.enabled : 1 107 + 108 + The Basics 109 + ========== 110 + 111 + Userspace can implement an arbitrary BPF scheduler by loading a set of BPF 112 + programs that implement ``struct sched_ext_ops``. The only mandatory field 113 + is ``ops.name`` which must be a valid BPF object name. All operations are 114 + optional. The following modified excerpt is from 115 + ``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler. 116 + 117 + .. code-block:: c 118 + 119 + /* 120 + * Decide which CPU a task should be migrated to before being 121 + * enqueued (either at wakeup, fork time, or exec time). If an 122 + * idle core is found by the default ops.select_cpu() implementation, 123 + * then dispatch the task directly to SCX_DSQ_LOCAL and skip the 124 + * ops.enqueue() callback. 125 + * 126 + * Note that this implementation has exactly the same behavior as the 127 + * default ops.select_cpu implementation. The behavior of the scheduler 128 + * would be exactly same if the implementation just didn't define the 129 + * simple_select_cpu() struct_ops prog. 130 + */ 131 + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, 132 + s32 prev_cpu, u64 wake_flags) 133 + { 134 + s32 cpu; 135 + /* Need to initialize or the BPF verifier will reject the program */ 136 + bool direct = false; 137 + 138 + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct); 139 + 140 + if (direct) 141 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 142 + 143 + return cpu; 144 + } 145 + 146 + /* 147 + * Do a direct dispatch of a task to the global DSQ. This ops.enqueue() 148 + * callback will only be invoked if we failed to find a core to dispatch 149 + * to in ops.select_cpu() above. 150 + * 151 + * Note that this implementation has exactly the same behavior as the 152 + * default ops.enqueue implementation, which just dispatches the task 153 + * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same 154 + * if the implementation just didn't define the simple_enqueue struct_ops 155 + * prog. 156 + */ 157 + void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 158 + { 159 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 160 + } 161 + 162 + s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 163 + { 164 + /* 165 + * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and 166 + * SCHED_BATCH tasks should use sched_ext. 167 + */ 168 + return 0; 169 + } 170 + 171 + void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 172 + { 173 + exit_type = ei->type; 174 + } 175 + 176 + SEC(".struct_ops") 177 + struct sched_ext_ops simple_ops = { 178 + .select_cpu = (void *)simple_select_cpu, 179 + .enqueue = (void *)simple_enqueue, 180 + .init = (void *)simple_init, 181 + .exit = (void *)simple_exit, 182 + .name = "simple", 183 + }; 184 + 185 + Dispatch Queues 186 + --------------- 187 + 188 + To match the impedance between the scheduler core and the BPF scheduler, 189 + sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a 190 + priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``), 191 + and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage 192 + an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and 193 + ``scx_bpf_destroy_dsq()``. 194 + 195 + A CPU always executes a task from its local DSQ. A task is "dispatched" to a 196 + DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's 197 + local DSQ. 198 + 199 + When a CPU is looking for the next task to run, if the local DSQ is not 200 + empty, the first task is picked. Otherwise, the CPU tries to consume the 201 + global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()`` 202 + is invoked. 203 + 204 + Scheduling Cycle 205 + ---------------- 206 + 207 + The following briefly shows how a waking task is scheduled and executed. 208 + 209 + 1. When a task is waking up, ``ops.select_cpu()`` is the first operation 210 + invoked. This serves two purposes. First, CPU selection optimization 211 + hint. Second, waking up the selected CPU if idle. 212 + 213 + The CPU selected by ``ops.select_cpu()`` is an optimization hint and not 214 + binding. The actual decision is made at the last step of scheduling. 215 + However, there is a small performance gain if the CPU 216 + ``ops.select_cpu()`` returns matches the CPU the task eventually runs on. 217 + 218 + A side-effect of selecting a CPU is waking it up from idle. While a BPF 219 + scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, 220 + using ``ops.select_cpu()`` judiciously can be simpler and more efficient. 221 + 222 + A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by 223 + calling ``scx_bpf_dispatch()``. If the task is dispatched to 224 + ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the 225 + local DSQ of whichever CPU is returned from ``ops.select_cpu()``. 226 + Additionally, dispatching directly from ``ops.select_cpu()`` will cause the 227 + ``ops.enqueue()`` callback to be skipped. 228 + 229 + Note that the scheduler core will ignore an invalid CPU selection, for 230 + example, if it's outside the allowed cpumask of the task. 231 + 232 + 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the 233 + task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()`` 234 + can make one of the following decisions: 235 + 236 + * Immediately dispatch the task to either the global or local DSQ by 237 + calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or 238 + ``SCX_DSQ_LOCAL``, respectively. 239 + 240 + * Immediately dispatch the task to a custom DSQ by calling 241 + ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63. 242 + 243 + * Queue the task on the BPF side. 244 + 245 + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If 246 + empty, it then looks at the global DSQ. If there still isn't a task to 247 + run, ``ops.dispatch()`` is invoked which can use the following two 248 + functions to populate the local DSQ. 249 + 250 + * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can 251 + be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``, 252 + ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()`` 253 + currently can't be called with BPF locks held, this is being worked on 254 + and will be supported. ``scx_bpf_dispatch()`` schedules dispatching 255 + rather than performing them immediately. There can be up to 256 + ``ops.dispatch_max_batch`` pending tasks. 257 + 258 + * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ 259 + to the dispatching DSQ. This function cannot be called with any BPF 260 + locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks 261 + before trying to consume the specified DSQ. 262 + 263 + 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ, 264 + the CPU runs the first one. If empty, the following steps are taken: 265 + 266 + * Try to consume the global DSQ. If successful, run the task. 267 + 268 + * If ``ops.dispatch()`` has dispatched any tasks, retry #3. 269 + 270 + * If the previous task is an SCX task and still runnable, keep executing 271 + it (see ``SCX_OPS_ENQ_LAST``). 272 + 273 + * Go idle. 274 + 275 + Note that the BPF scheduler can always choose to dispatch tasks immediately 276 + in ``ops.enqueue()`` as illustrated in the above simple example. If only the 277 + built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as 278 + a task is never queued on the BPF scheduler and both the local and global 279 + DSQs are consumed automatically. 280 + 281 + ``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use 282 + ``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as 283 + ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue 284 + dispatching, and must be dispatched to with ``scx_bpf_dispatch()``. See the 285 + function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for 286 + more information. 287 + 288 + Where to Look 289 + ============= 290 + 291 + * ``include/linux/sched/ext.h`` defines the core data structures, ops table 292 + and constants. 293 + 294 + * ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers. 295 + The functions prefixed with ``scx_bpf_`` can be called from the BPF 296 + scheduler. 297 + 298 + * ``tools/sched_ext/`` hosts example BPF scheduler implementations. 299 + 300 + * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a 301 + custom DSQ. 302 + 303 + * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five 304 + levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``. 305 + 306 + ABI Instability 307 + =============== 308 + 309 + The APIs provided by sched_ext to BPF schedulers programs have no stability 310 + guarantees. This includes the ops table callbacks and constants defined in 311 + ``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in 312 + ``kernel/sched/ext.c``. 313 + 314 + While we will attempt to provide a relatively stable API surface when 315 + possible, they are subject to change without warning between kernel 316 + versions.

+13

MAINTAINERS

··· 20511 20511 F: include/uapi/linux/sched.h 20512 20512 F: kernel/sched/ 20513 20513 20514 + SCHEDULER - SCHED_EXT 20515 + R: Tejun Heo <tj@kernel.org> 20516 + R: David Vernet <void@manifault.com> 20517 + L: linux-kernel@vger.kernel.org 20518 + S: Maintained 20519 + W: https://github.com/sched-ext/scx 20520 + T: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git 20521 + F: include/linux/sched/ext.h 20522 + F: kernel/sched/ext.h 20523 + F: kernel/sched/ext.c 20524 + F: tools/sched_ext/ 20525 + F: tools/testing/selftests/sched_ext 20526 + 20514 20527 SCIOSENSE ENS160 MULTI-GAS SENSOR DRIVER 20515 20528 M: Gustavo Silva <gustavograzs@gmail.com> 20516 20529 S: Maintained

+1

drivers/tty/sysrq.c

··· 531 531 NULL, /* P */ 532 532 NULL, /* Q */ 533 533 &sysrq_replay_logs_op, /* R */ 534 + /* S: May be registered by sched_ext for resetting */ 534 535 NULL, /* S */ 535 536 NULL, /* T */ 536 537 NULL, /* U */

+1

include/asm-generic/vmlinux.lds.h

··· 133 133 *(__dl_sched_class) \ 134 134 *(__rt_sched_class) \ 135 135 *(__fair_sched_class) \ 136 + *(__ext_sched_class) \ 136 137 *(__idle_sched_class) \ 137 138 __sched_class_lowest = .; 138 139

+2 -2

include/linux/cgroup.h

··· 29 29 30 30 struct kernel_clone_args; 31 31 32 - #ifdef CONFIG_CGROUPS 33 - 34 32 /* 35 33 * All weight knobs on the default hierarchy should use the following min, 36 34 * default and max values. The default value is the logarithmic center of ··· 37 39 #define CGROUP_WEIGHT_MIN 1 38 40 #define CGROUP_WEIGHT_DFL 100 39 41 #define CGROUP_WEIGHT_MAX 10000 42 + 43 + #ifdef CONFIG_CGROUPS 40 44 41 45 enum { 42 46 CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */

+5

include/linux/sched.h

··· 82 82 struct task_struct; 83 83 struct user_event_mm; 84 84 85 + #include <linux/sched/ext.h> 86 + 85 87 /* 86 88 * Task state bitmask. NOTE! These bits are also 87 89 * encoded in fs/proc/array.c: get_task_state(). ··· 832 830 struct sched_rt_entity rt; 833 831 struct sched_dl_entity dl; 834 832 struct sched_dl_entity *dl_server; 833 + #ifdef CONFIG_SCHED_CLASS_EXT 834 + struct sched_ext_entity scx; 835 + #endif 835 836 const struct sched_class *sched_class; 836 837 837 838 #ifdef CONFIG_SCHED_CORE

+215

include/linux/sched/ext.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst 4 + * 5 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 6 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 7 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 8 + */ 9 + #ifndef _LINUX_SCHED_EXT_H 10 + #define _LINUX_SCHED_EXT_H 11 + 12 + #ifdef CONFIG_SCHED_CLASS_EXT 13 + 14 + #include <linux/llist.h> 15 + #include <linux/rhashtable-types.h> 16 + 17 + enum scx_public_consts { 18 + SCX_OPS_NAME_LEN = 128, 19 + 20 + SCX_SLICE_DFL = 20 * 1000000, /* 20ms */ 21 + SCX_SLICE_INF = U64_MAX, /* infinite, implies nohz */ 22 + }; 23 + 24 + /* 25 + * DSQ (dispatch queue) IDs are 64bit of the format: 26 + * 27 + * Bits: [63] [62 .. 0] 28 + * [ B] [ ID ] 29 + * 30 + * B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs 31 + * ID: 63 bit ID 32 + * 33 + * Built-in IDs: 34 + * 35 + * Bits: [63] [62] [61..32] [31 .. 0] 36 + * [ 1] [ L] [ R ] [ V ] 37 + * 38 + * 1: 1 for built-in DSQs. 39 + * L: 1 for LOCAL_ON DSQ IDs, 0 for others 40 + * V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value. 41 + */ 42 + enum scx_dsq_id_flags { 43 + SCX_DSQ_FLAG_BUILTIN = 1LLU << 63, 44 + SCX_DSQ_FLAG_LOCAL_ON = 1LLU << 62, 45 + 46 + SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0, 47 + SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1, 48 + SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2, 49 + SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON, 50 + SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU, 51 + }; 52 + 53 + /* 54 + * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered 55 + * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to 56 + * buffer between the scheduler core and the BPF scheduler. See the 57 + * documentation for more details. 58 + */ 59 + struct scx_dispatch_q { 60 + raw_spinlock_t lock; 61 + struct list_head list; /* tasks in dispatch order */ 62 + struct rb_root priq; /* used to order by p->scx.dsq_vtime */ 63 + u32 nr; 64 + u32 seq; /* used by BPF iter */ 65 + u64 id; 66 + struct rhash_head hash_node; 67 + struct llist_node free_node; 68 + struct rcu_head rcu; 69 + }; 70 + 71 + /* scx_entity.flags */ 72 + enum scx_ent_flags { 73 + SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ 74 + SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ 75 + SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ 76 + 77 + SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ 78 + SCX_TASK_STATE_BITS = 2, 79 + SCX_TASK_STATE_MASK = ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT, 80 + 81 + SCX_TASK_CURSOR = 1 << 31, /* iteration cursor, not a task */ 82 + }; 83 + 84 + /* scx_entity.flags & SCX_TASK_STATE_MASK */ 85 + enum scx_task_state { 86 + SCX_TASK_NONE, /* ops.init_task() not called yet */ 87 + SCX_TASK_INIT, /* ops.init_task() succeeded, but task can be cancelled */ 88 + SCX_TASK_READY, /* fully initialized, but not in sched_ext */ 89 + SCX_TASK_ENABLED, /* fully initialized and in sched_ext */ 90 + 91 + SCX_TASK_NR_STATES, 92 + }; 93 + 94 + /* scx_entity.dsq_flags */ 95 + enum scx_ent_dsq_flags { 96 + SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ 97 + }; 98 + 99 + /* 100 + * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from 101 + * everywhere and the following bits track which kfunc sets are currently 102 + * allowed for %current. This simple per-task tracking works because SCX ops 103 + * nest in a limited way. BPF will likely implement a way to allow and disallow 104 + * kfuncs depending on the calling context which will replace this manual 105 + * mechanism. See scx_kf_allow(). 106 + */ 107 + enum scx_kf_mask { 108 + SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ 109 + /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ 110 + SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ 111 + /* ops.dequeue (in REST) may be nested inside DISPATCH */ 112 + SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ 113 + SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ 114 + SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ 115 + SCX_KF_REST = 1 << 4, /* other rq-locked operations */ 116 + 117 + __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | 118 + SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, 119 + __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, 120 + }; 121 + 122 + enum scx_dsq_lnode_flags { 123 + SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0, 124 + 125 + /* high 16 bits can be for iter cursor flags */ 126 + __SCX_DSQ_LNODE_PRIV_SHIFT = 16, 127 + }; 128 + 129 + struct scx_dsq_list_node { 130 + struct list_head node; 131 + u32 flags; 132 + u32 priv; /* can be used by iter cursor */ 133 + }; 134 + 135 + /* 136 + * The following is embedded in task_struct and contains all fields necessary 137 + * for a task to be scheduled by SCX. 138 + */ 139 + struct sched_ext_entity { 140 + struct scx_dispatch_q *dsq; 141 + struct scx_dsq_list_node dsq_list; /* dispatch order */ 142 + struct rb_node dsq_priq; /* p->scx.dsq_vtime order */ 143 + u32 dsq_seq; 144 + u32 dsq_flags; /* protected by DSQ lock */ 145 + u32 flags; /* protected by rq lock */ 146 + u32 weight; 147 + s32 sticky_cpu; 148 + s32 holding_cpu; 149 + u32 kf_mask; /* see scx_kf_mask above */ 150 + struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ 151 + atomic_long_t ops_state; 152 + 153 + struct list_head runnable_node; /* rq->scx.runnable_list */ 154 + unsigned long runnable_at; 155 + 156 + #ifdef CONFIG_SCHED_CORE 157 + u64 core_sched_at; /* see scx_prio_less() */ 158 + #endif 159 + u64 ddsp_dsq_id; 160 + u64 ddsp_enq_flags; 161 + 162 + /* BPF scheduler modifiable fields */ 163 + 164 + /* 165 + * Runtime budget in nsecs. This is usually set through 166 + * scx_bpf_dispatch() but can also be modified directly by the BPF 167 + * scheduler. Automatically decreased by SCX as the task executes. On 168 + * depletion, a scheduling event is triggered. 169 + * 170 + * This value is cleared to zero if the task is preempted by 171 + * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the 172 + * task ran. Use p->se.sum_exec_runtime instead. 173 + */ 174 + u64 slice; 175 + 176 + /* 177 + * Used to order tasks when dispatching to the vtime-ordered priority 178 + * queue of a dsq. This is usually set through scx_bpf_dispatch_vtime() 179 + * but can also be modified directly by the BPF scheduler. Modifying it 180 + * while a task is queued on a dsq may mangle the ordering and is not 181 + * recommended. 182 + */ 183 + u64 dsq_vtime; 184 + 185 + /* 186 + * If set, reject future sched_setscheduler(2) calls updating the policy 187 + * to %SCHED_EXT with -%EACCES. 188 + * 189 + * Can be set from ops.init_task() while the BPF scheduler is being 190 + * loaded (!scx_init_task_args->fork). If set and the task's policy is 191 + * already %SCHED_EXT, the task's policy is rejected and forcefully 192 + * reverted to %SCHED_NORMAL. The number of such events are reported 193 + * through /sys/kernel/debug/sched_ext::nr_rejected. Setting this flag 194 + * during fork is not allowed. 195 + */ 196 + bool disallow; /* reject switching into SCX */ 197 + 198 + /* cold fields */ 199 + #ifdef CONFIG_EXT_GROUP_SCHED 200 + struct cgroup *cgrp_moving_from; 201 + #endif 202 + /* must be the last field, see init_scx_entity() */ 203 + struct list_head tasks_node; 204 + }; 205 + 206 + void sched_ext_free(struct task_struct *p); 207 + void print_scx_info(const char *log_lvl, struct task_struct *p); 208 + 209 + #else /* !CONFIG_SCHED_CLASS_EXT */ 210 + 211 + static inline void sched_ext_free(struct task_struct *p) {} 212 + static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {} 213 + 214 + #endif /* CONFIG_SCHED_CLASS_EXT */ 215 + #endif /* _LINUX_SCHED_EXT_H */

+7 -1

include/linux/sched/task.h

··· 63 63 extern void init_idle(struct task_struct *idle, int cpu); 64 64 65 65 extern int sched_fork(unsigned long clone_flags, struct task_struct *p); 66 - extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs); 66 + extern int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs); 67 + extern void sched_cancel_fork(struct task_struct *p); 67 68 extern void sched_post_fork(struct task_struct *p); 68 69 extern void sched_dead(struct task_struct *p); 69 70 ··· 118 117 { 119 118 refcount_inc(&t->usage); 120 119 return t; 120 + } 121 + 122 + static inline struct task_struct *tryget_task_struct(struct task_struct *t) 123 + { 124 + return refcount_inc_not_zero(&t->usage) ? t : NULL; 121 125 } 122 126 123 127 extern void __put_task_struct(struct task_struct *t);

+32

include/trace/events/sched_ext.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #undef TRACE_SYSTEM 3 + #define TRACE_SYSTEM sched_ext 4 + 5 + #if !defined(_TRACE_SCHED_EXT_H) || defined(TRACE_HEADER_MULTI_READ) 6 + #define _TRACE_SCHED_EXT_H 7 + 8 + #include <linux/tracepoint.h> 9 + 10 + TRACE_EVENT(sched_ext_dump, 11 + 12 + TP_PROTO(const char *line), 13 + 14 + TP_ARGS(line), 15 + 16 + TP_STRUCT__entry( 17 + __string(line, line) 18 + ), 19 + 20 + TP_fast_assign( 21 + __assign_str(line); 22 + ), 23 + 24 + TP_printk("%s", 25 + __get_str(line) 26 + ) 27 + ); 28 + 29 + #endif /* _TRACE_SCHED_EXT_H */ 30 + 31 + /* This part must be outside protection */ 32 + #include <trace/define_trace.h>

+1

include/uapi/linux/sched.h

··· 118 118 /* SCHED_ISO: reserved but not implemented yet */ 119 119 #define SCHED_IDLE 5 120 120 #define SCHED_DEADLINE 6 121 + #define SCHED_EXT 7 121 122 122 123 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */ 123 124 #define SCHED_RESET_ON_FORK 0x40000000

+10

init/Kconfig

··· 1025 1025 tasks. 1026 1026 1027 1027 if CGROUP_SCHED 1028 + config GROUP_SCHED_WEIGHT 1029 + def_bool n 1030 + 1028 1031 config FAIR_GROUP_SCHED 1029 1032 bool "Group scheduling for SCHED_OTHER" 1030 1033 depends on CGROUP_SCHED 1034 + select GROUP_SCHED_WEIGHT 1031 1035 default CGROUP_SCHED 1032 1036 1033 1037 config CFS_BANDWIDTH ··· 1055 1051 schedule realtime tasks for non-root users until you allocate 1056 1052 realtime bandwidth for them. 1057 1053 See Documentation/scheduler/sched-rt-group.rst for more information. 1054 + 1055 + config EXT_GROUP_SCHED 1056 + bool 1057 + depends on SCHED_CLASS_EXT && CGROUP_SCHED 1058 + select GROUP_SCHED_WEIGHT 1059 + default y 1058 1060 1059 1061 endif #CGROUP_SCHED 1060 1062

+12

init/init_task.c

··· 6 6 #include <linux/sched/sysctl.h> 7 7 #include <linux/sched/rt.h> 8 8 #include <linux/sched/task.h> 9 + #include <linux/sched/ext.h> 9 10 #include <linux/init.h> 10 11 #include <linux/fs.h> 11 12 #include <linux/mm.h> ··· 99 98 #endif 100 99 #ifdef CONFIG_CGROUP_SCHED 101 100 .sched_task_group = &root_task_group, 101 + #endif 102 + #ifdef CONFIG_SCHED_CLASS_EXT 103 + .scx = { 104 + .dsq_list.node = LIST_HEAD_INIT(init_task.scx.dsq_list.node), 105 + .sticky_cpu = -1, 106 + .holding_cpu = -1, 107 + .runnable_node = LIST_HEAD_INIT(init_task.scx.runnable_node), 108 + .runnable_at = INITIAL_JIFFIES, 109 + .ddsp_dsq_id = SCX_DSQ_INVALID, 110 + .slice = SCX_SLICE_DFL, 111 + }, 102 112 #endif 103 113 .ptraced = LIST_HEAD_INIT(init_task.ptraced), 104 114 .ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),

+25

kernel/Kconfig.preempt

··· 133 133 which is the likely usage by Linux distributions, there should 134 134 be no measurable impact on performance. 135 135 136 + config SCHED_CLASS_EXT 137 + bool "Extensible Scheduling Class" 138 + depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF 139 + select STACKTRACE if STACKTRACE_SUPPORT 140 + help 141 + This option enables a new scheduler class sched_ext (SCX), which 142 + allows scheduling policies to be implemented as BPF programs to 143 + achieve the following: 136 144 145 + - Ease of experimentation and exploration: Enabling rapid 146 + iteration of new scheduling policies. 147 + - Customization: Building application-specific schedulers which 148 + implement policies that are not applicable to general-purpose 149 + schedulers. 150 + - Rapid scheduler deployments: Non-disruptive swap outs of 151 + scheduling policies in production environments. 152 + 153 + sched_ext leverages BPF struct_ops feature to define a structure 154 + which exports function callbacks and flags to BPF programs that 155 + wish to implement scheduling policies. The struct_ops structure 156 + exported by sched_ext is struct sched_ext_ops, and is conceptually 157 + similar to struct sched_class. 158 + 159 + For more information: 160 + Documentation/scheduler/sched-ext.rst 161 + https://github.com/sched-ext/scx

+12 -5

kernel/fork.c

··· 23 23 #include <linux/sched/task.h> 24 24 #include <linux/sched/task_stack.h> 25 25 #include <linux/sched/cputime.h> 26 + #include <linux/sched/ext.h> 26 27 #include <linux/seq_file.h> 27 28 #include <linux/rtmutex.h> 28 29 #include <linux/init.h> ··· 970 969 WARN_ON(refcount_read(&tsk->usage)); 971 970 WARN_ON(tsk == current); 972 971 972 + sched_ext_free(tsk); 973 973 io_uring_free(tsk); 974 974 cgroup_free(tsk); 975 975 task_numa_free(tsk, true); ··· 2348 2346 2349 2347 retval = perf_event_init_task(p, clone_flags); 2350 2348 if (retval) 2351 - goto bad_fork_cleanup_policy; 2349 + goto bad_fork_sched_cancel_fork; 2352 2350 retval = audit_alloc(p); 2353 2351 if (retval) 2354 2352 goto bad_fork_cleanup_perf; ··· 2481 2479 * cgroup specific, it unconditionally needs to place the task on a 2482 2480 * runqueue. 2483 2481 */ 2484 - sched_cgroup_fork(p, args); 2482 + retval = sched_cgroup_fork(p, args); 2483 + if (retval) 2484 + goto bad_fork_cancel_cgroup; 2485 2485 2486 2486 /* 2487 2487 * From this point on we must avoid any synchronous user-space ··· 2529 2525 /* Don't start children in a dying pid namespace */ 2530 2526 if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) { 2531 2527 retval = -ENOMEM; 2532 - goto bad_fork_cancel_cgroup; 2528 + goto bad_fork_core_free; 2533 2529 } 2534 2530 2535 2531 /* Let kill terminate clone/fork in the middle */ 2536 2532 if (fatal_signal_pending(current)) { 2537 2533 retval = -EINTR; 2538 - goto bad_fork_cancel_cgroup; 2534 + goto bad_fork_core_free; 2539 2535 } 2540 2536 2541 2537 /* No more failure paths after this point. */ ··· 2609 2605 2610 2606 return p; 2611 2607 2612 - bad_fork_cancel_cgroup: 2608 + bad_fork_core_free: 2613 2609 sched_core_free(p); 2614 2610 spin_unlock(&current->sighand->siglock); 2615 2611 write_unlock_irq(&tasklist_lock); 2612 + bad_fork_cancel_cgroup: 2616 2613 cgroup_cancel_fork(p, args); 2617 2614 bad_fork_put_pidfd: 2618 2615 if (clone_flags & CLONE_PIDFD) { ··· 2652 2647 audit_free(p); 2653 2648 bad_fork_cleanup_perf: 2654 2649 perf_event_free_task(p); 2650 + bad_fork_sched_cancel_fork: 2651 + sched_cancel_fork(p); 2655 2652 bad_fork_cleanup_policy: 2656 2653 lockdep_free_task(p); 2657 2654 #ifdef CONFIG_NUMA

+11

kernel/sched/build_policy.c

··· 16 16 #include <linux/sched/clock.h> 17 17 #include <linux/sched/cputime.h> 18 18 #include <linux/sched/hotplug.h> 19 + #include <linux/sched/isolation.h> 19 20 #include <linux/sched/posix-timers.h> 20 21 #include <linux/sched/rt.h> 21 22 22 23 #include <linux/cpuidle.h> 23 24 #include <linux/jiffies.h> 25 + #include <linux/kobject.h> 24 26 #include <linux/livepatch.h> 27 + #include <linux/pm.h> 25 28 #include <linux/psi.h> 29 + #include <linux/rhashtable.h> 30 + #include <linux/seq_buf.h> 26 31 #include <linux/seqlock_api.h> 27 32 #include <linux/slab.h> 28 33 #include <linux/suspend.h> 29 34 #include <linux/tsacct_kern.h> 30 35 #include <linux/vtime.h> 36 + #include <linux/sysrq.h> 37 + #include <linux/percpu-rwsem.h> 31 38 32 39 #include <uapi/linux/sched/types.h> 33 40 ··· 58 51 59 52 #include "cputime.c" 60 53 #include "deadline.c" 54 + 55 + #ifdef CONFIG_SCHED_CLASS_EXT 56 + # include "ext.c" 57 + #endif 61 58 62 59 #include "syscalls.c"

+218 -63

kernel/sched/core.c

··· 172 172 if (p->sched_class == &idle_sched_class) 173 173 return MAX_RT_PRIO + NICE_WIDTH; /* 140 */ 174 174 175 - return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */ 175 + if (task_on_scx(p)) 176 + return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */ 177 + 178 + return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */ 176 179 } 177 180 178 181 /* ··· 219 216 220 217 if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ 221 218 return cfs_prio_less(a, b, in_fi); 219 + 220 + #ifdef CONFIG_SCHED_CLASS_EXT 221 + if (pa == MAX_RT_PRIO + MAX_NICE + 1) /* ext */ 222 + return scx_prio_less(a, b, in_fi); 223 + #endif 222 224 223 225 return false; 224 226 } ··· 1288 1280 return true; 1289 1281 1290 1282 /* 1291 - * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left; 1292 - * if there's more than one we need the tick for involuntary 1293 - * preemption. 1283 + * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks 1284 + * left. For CFS, if there's more than one we need the tick for 1285 + * involuntary preemption. For SCX, ask. 1294 1286 */ 1295 - if (rq->nr_running > 1) 1287 + if (scx_enabled() && !scx_can_stop_tick(rq)) 1288 + return false; 1289 + 1290 + if (rq->cfs.nr_running > 1) 1296 1291 return false; 1297 1292 1298 1293 /* ··· 1377 1366 * SCHED_OTHER tasks have to update their load when changing their 1378 1367 * weight 1379 1368 */ 1380 - if (update_load && p->sched_class == &fair_sched_class) 1381 - reweight_task(p, &lw); 1369 + if (update_load && p->sched_class->reweight_task) 1370 + p->sched_class->reweight_task(task_rq(p), p, &lw); 1382 1371 else 1383 1372 p->se.load = lw; 1384 1373 } ··· 2098 2087 } 2099 2088 2100 2089 /* 2090 + * ->switching_to() is called with the pi_lock and rq_lock held and must not 2091 + * mess with locking. 2092 + */ 2093 + void check_class_changing(struct rq *rq, struct task_struct *p, 2094 + const struct sched_class *prev_class) 2095 + { 2096 + if (prev_class != p->sched_class && p->sched_class->switching_to) 2097 + p->sched_class->switching_to(rq, p); 2098 + } 2099 + 2100 + /* 2101 2101 * switched_from, switched_to and prio_changed must _NOT_ drop rq->lock, 2102 2102 * use the balance_callback list if you want balancing. 2103 2103 * ··· 2378 2356 static inline bool is_cpu_allowed(struct task_struct *p, int cpu) 2379 2357 { 2380 2358 /* When not in the task's cpumask, no point in looking further. */ 2381 - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) 2359 + if (!task_allowed_on_cpu(p, cpu)) 2382 2360 return false; 2383 2361 2384 2362 /* migrate_disabled() must be allowed to finish. */ ··· 2387 2365 2388 2366 /* Non kernel threads are not allowed during either online or offline. */ 2389 2367 if (!(p->flags & PF_KTHREAD)) 2390 - return cpu_active(cpu) && task_cpu_possible(cpu, p); 2368 + return cpu_active(cpu); 2391 2369 2392 2370 /* KTHREAD_IS_PER_CPU is always allowed. */ 2393 2371 if (kthread_is_per_cpu(p)) ··· 3865 3843 static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) 3866 3844 { 3867 3845 /* 3846 + * The BPF scheduler may depend on select_task_rq() being invoked during 3847 + * wakeups. In addition, @p may end up executing on a different CPU 3848 + * regardless of what happens in the wakeup path making the ttwu_queue 3849 + * optimization less meaningful. Skip if on SCX. 3850 + */ 3851 + if (task_on_scx(p)) 3852 + return false; 3853 + 3854 + /* 3868 3855 * Do not complicate things with the async wake_list while the CPU is 3869 3856 * in hotplug state. 3870 3857 */ ··· 4447 4416 p->rt.on_rq = 0; 4448 4417 p->rt.on_list = 0; 4449 4418 4419 + #ifdef CONFIG_SCHED_CLASS_EXT 4420 + init_scx_entity(&p->scx); 4421 + #endif 4422 + 4450 4423 #ifdef CONFIG_PREEMPT_NOTIFIERS 4451 4424 INIT_HLIST_HEAD(&p->preempt_notifiers); 4452 4425 #endif ··· 4693 4658 4694 4659 if (dl_prio(p->prio)) 4695 4660 return -EAGAIN; 4696 - else if (rt_prio(p->prio)) 4661 + 4662 + scx_pre_fork(p); 4663 + 4664 + if (rt_prio(p->prio)) { 4697 4665 p->sched_class = &rt_sched_class; 4698 - else 4666 + #ifdef CONFIG_SCHED_CLASS_EXT 4667 + } else if (task_should_scx(p)) { 4668 + p->sched_class = &ext_sched_class; 4669 + #endif 4670 + } else { 4699 4671 p->sched_class = &fair_sched_class; 4672 + } 4700 4673 4701 4674 init_entity_runnable_average(&p->se); 4702 4675 ··· 4724 4681 return 0; 4725 4682 } 4726 4683 4727 - void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) 4684 + int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs) 4728 4685 { 4729 4686 unsigned long flags; 4730 4687 ··· 4751 4708 if (p->sched_class->task_fork) 4752 4709 p->sched_class->task_fork(p); 4753 4710 raw_spin_unlock_irqrestore(&p->pi_lock, flags); 4711 + 4712 + return scx_fork(p); 4713 + } 4714 + 4715 + void sched_cancel_fork(struct task_struct *p) 4716 + { 4717 + scx_cancel_fork(p); 4754 4718 } 4755 4719 4756 4720 void sched_post_fork(struct task_struct *p) 4757 4721 { 4758 4722 uclamp_post_fork(p); 4723 + scx_post_fork(p); 4759 4724 } 4760 4725 4761 4726 unsigned long to_ratio(u64 period, u64 runtime) ··· 5596 5545 calc_global_load_tick(rq); 5597 5546 sched_core_tick(rq); 5598 5547 task_tick_mm_cid(rq, curr); 5548 + scx_tick(rq); 5599 5549 5600 5550 rq_unlock(rq, &rf); 5601 5551 ··· 5609 5557 wq_worker_tick(curr); 5610 5558 5611 5559 #ifdef CONFIG_SMP 5612 - rq->idle_balance = idle_cpu(cpu); 5613 - sched_balance_trigger(rq); 5560 + if (!scx_switched_all()) { 5561 + rq->idle_balance = idle_cpu(cpu); 5562 + sched_balance_trigger(rq); 5563 + } 5614 5564 #endif 5615 5565 } 5616 5566 ··· 5902 5848 static void prev_balance(struct rq *rq, struct task_struct *prev, 5903 5849 struct rq_flags *rf) 5904 5850 { 5905 - #ifdef CONFIG_SMP 5851 + const struct sched_class *start_class = prev->sched_class; 5906 5852 const struct sched_class *class; 5853 + 5854 + #ifdef CONFIG_SCHED_CLASS_EXT 5855 + /* 5856 + * SCX requires a balance() call before every pick_next_task() including 5857 + * when waking up from SCHED_IDLE. If @start_class is below SCX, start 5858 + * from SCX instead. 5859 + */ 5860 + if (scx_enabled() && sched_class_above(&ext_sched_class, start_class)) 5861 + start_class = &ext_sched_class; 5862 + #endif 5863 + 5907 5864 /* 5908 5865 * We must do the balancing pass before put_prev_task(), such 5909 5866 * that when we release the rq->lock the task is in the same ··· 5923 5858 * We can terminate the balance pass as soon as we know there is 5924 5859 * a runnable task of @class priority or higher. 5925 5860 */ 5926 - for_class_range(class, prev->sched_class, &idle_sched_class) { 5927 - if (class->balance(rq, prev, rf)) 5861 + for_active_class_range(class, start_class, &idle_sched_class) { 5862 + if (class->balance && class->balance(rq, prev, rf)) 5928 5863 break; 5929 5864 } 5930 - #endif 5931 5865 } 5932 5866 5933 5867 /* ··· 5939 5875 struct task_struct *p; 5940 5876 5941 5877 rq->dl_server = NULL; 5878 + 5879 + if (scx_enabled()) 5880 + goto restart; 5942 5881 5943 5882 /* 5944 5883 * Optimization: we know that if all tasks are in the fair class we can ··· 5968 5901 restart: 5969 5902 prev_balance(rq, prev, rf); 5970 5903 5971 - for_each_class(class) { 5904 + for_each_active_class(class) { 5972 5905 if (class->pick_next_task) { 5973 5906 p = class->pick_next_task(rq, prev); 5974 5907 if (p) ··· 6011 5944 6012 5945 rq->dl_server = NULL; 6013 5946 6014 - for_each_class(class) { 5947 + for_each_active_class(class) { 6015 5948 p = class->pick_task(rq); 6016 5949 if (p) 6017 5950 return p; ··· 7015 6948 p->sched_class = &dl_sched_class; 7016 6949 else if (rt_prio(prio)) 7017 6950 p->sched_class = &rt_sched_class; 6951 + #ifdef CONFIG_SCHED_CLASS_EXT 6952 + else if (task_should_scx(p)) 6953 + p->sched_class = &ext_sched_class; 6954 + #endif 7018 6955 else 7019 6956 p->sched_class = &fair_sched_class; 7020 6957 ··· 7164 7093 } 7165 7094 7166 7095 __setscheduler_prio(p, prio); 7096 + check_class_changing(rq, p, prev_class); 7167 7097 7168 7098 if (queued) 7169 7099 enqueue_task(rq, p, queue_flag); ··· 7577 7505 7578 7506 print_worker_info(KERN_INFO, p); 7579 7507 print_stop_info(KERN_INFO, p); 7508 + print_scx_info(KERN_INFO, p); 7580 7509 show_stack(p, NULL, KERN_INFO); 7581 7510 put_task_stack(p); 7582 7511 } ··· 8106 8033 cpuset_cpu_active(); 8107 8034 } 8108 8035 8036 + scx_rq_activate(rq); 8037 + 8109 8038 /* 8110 8039 * Put the rq online, if not already. This happens: 8111 8040 * ··· 8156 8081 synchronize_rcu(); 8157 8082 8158 8083 sched_set_rq_offline(rq, cpu); 8084 + 8085 + scx_rq_deactivate(rq); 8159 8086 8160 8087 /* 8161 8088 * When going down, decrement the number of cores with SMT present. ··· 8343 8266 int i; 8344 8267 8345 8268 /* Make sure the linker didn't screw up */ 8346 - BUG_ON(&idle_sched_class != &fair_sched_class + 1 || 8347 - &fair_sched_class != &rt_sched_class + 1 || 8348 - &rt_sched_class != &dl_sched_class + 1); 8349 8269 #ifdef CONFIG_SMP 8350 - BUG_ON(&dl_sched_class != &stop_sched_class + 1); 8270 + BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class)); 8271 + #endif 8272 + BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class)); 8273 + BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class)); 8274 + BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class)); 8275 + #ifdef CONFIG_SCHED_CLASS_EXT 8276 + BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class)); 8277 + BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class)); 8351 8278 #endif 8352 8279 8353 8280 wait_bit_init(); ··· 8375 8294 root_task_group.shares = ROOT_TASK_GROUP_LOAD; 8376 8295 init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL); 8377 8296 #endif /* CONFIG_FAIR_GROUP_SCHED */ 8297 + #ifdef CONFIG_EXT_GROUP_SCHED 8298 + root_task_group.scx_weight = CGROUP_WEIGHT_DFL; 8299 + #endif /* CONFIG_EXT_GROUP_SCHED */ 8378 8300 #ifdef CONFIG_RT_GROUP_SCHED 8379 8301 root_task_group.rt_se = (struct sched_rt_entity **)ptr; 8380 8302 ptr += nr_cpu_ids * sizeof(void **); ··· 8529 8445 balance_push_set(smp_processor_id(), false); 8530 8446 #endif 8531 8447 init_sched_fair_class(); 8448 + init_sched_ext_class(); 8532 8449 8533 8450 psi_init(); 8534 8451 ··· 8815 8730 if (!alloc_rt_sched_group(tg, parent)) 8816 8731 goto err; 8817 8732 8733 + scx_group_set_weight(tg, CGROUP_WEIGHT_DFL); 8818 8734 alloc_uclamp_sched_group(tg, parent); 8819 8735 8820 8736 return tg; ··· 8943 8857 put_prev_task(rq, tsk); 8944 8858 8945 8859 sched_change_group(tsk, group); 8860 + scx_move_task(tsk); 8946 8861 8947 8862 if (queued) 8948 8863 enqueue_task(rq, tsk, queue_flags); ··· 8956 8869 */ 8957 8870 resched_curr(rq); 8958 8871 } 8959 - } 8960 - 8961 - static inline struct task_group *css_tg(struct cgroup_subsys_state *css) 8962 - { 8963 - return css ? container_of(css, struct task_group, css) : NULL; 8964 8872 } 8965 8873 8966 8874 static struct cgroup_subsys_state * ··· 8981 8899 { 8982 8900 struct task_group *tg = css_tg(css); 8983 8901 struct task_group *parent = css_tg(css->parent); 8902 + int ret; 8903 + 8904 + ret = scx_tg_online(tg); 8905 + if (ret) 8906 + return ret; 8984 8907 8985 8908 if (parent) 8986 8909 sched_online_group(tg, parent); ··· 8998 8911 #endif 8999 8912 9000 8913 return 0; 8914 + } 8915 + 8916 + static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css) 8917 + { 8918 + struct task_group *tg = css_tg(css); 8919 + 8920 + scx_tg_offline(tg); 9001 8921 } 9002 8922 9003 8923 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css) ··· 9024 8930 sched_unregister_group(tg); 9025 8931 } 9026 8932 9027 - #ifdef CONFIG_RT_GROUP_SCHED 9028 8933 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset) 9029 8934 { 8935 + #ifdef CONFIG_RT_GROUP_SCHED 9030 8936 struct task_struct *task; 9031 8937 struct cgroup_subsys_state *css; 9032 8938 ··· 9034 8940 if (!sched_rt_can_attach(css_tg(css), task)) 9035 8941 return -EINVAL; 9036 8942 } 9037 - return 0; 9038 - } 9039 8943 #endif 8944 + return scx_cgroup_can_attach(tset); 8945 + } 9040 8946 9041 8947 static void cpu_cgroup_attach(struct cgroup_taskset *tset) 9042 8948 { ··· 9045 8951 9046 8952 cgroup_taskset_for_each(task, css, tset) 9047 8953 sched_move_task(task); 8954 + 8955 + scx_cgroup_finish_attach(); 8956 + } 8957 + 8958 + static void cpu_cgroup_cancel_attach(struct cgroup_taskset *tset) 8959 + { 8960 + scx_cgroup_cancel_attach(tset); 9048 8961 } 9049 8962 9050 8963 #ifdef CONFIG_UCLAMP_TASK_GROUP ··· 9228 9127 } 9229 9128 #endif /* CONFIG_UCLAMP_TASK_GROUP */ 9230 9129 9130 + #ifdef CONFIG_GROUP_SCHED_WEIGHT 9131 + static unsigned long tg_weight(struct task_group *tg) 9132 + { 9231 9133 #ifdef CONFIG_FAIR_GROUP_SCHED 9134 + return scale_load_down(tg->shares); 9135 + #else 9136 + return sched_weight_from_cgroup(tg->scx_weight); 9137 + #endif 9138 + } 9139 + 9232 9140 static int cpu_shares_write_u64(struct cgroup_subsys_state *css, 9233 9141 struct cftype *cftype, u64 shareval) 9234 9142 { 9143 + int ret; 9144 + 9235 9145 if (shareval > scale_load_down(ULONG_MAX)) 9236 9146 shareval = MAX_SHARES; 9237 - return sched_group_set_shares(css_tg(css), scale_load(shareval)); 9147 + ret = sched_group_set_shares(css_tg(css), scale_load(shareval)); 9148 + if (!ret) 9149 + scx_group_set_weight(css_tg(css), 9150 + sched_weight_to_cgroup(shareval)); 9151 + return ret; 9238 9152 } 9239 9153 9240 9154 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css, 9241 9155 struct cftype *cft) 9242 9156 { 9243 - struct task_group *tg = css_tg(css); 9244 - 9245 - return (u64) scale_load_down(tg->shares); 9157 + return tg_weight(css_tg(css)); 9246 9158 } 9159 + #endif /* CONFIG_GROUP_SCHED_WEIGHT */ 9247 9160 9248 9161 #ifdef CONFIG_CFS_BANDWIDTH 9249 9162 static DEFINE_MUTEX(cfs_constraints_mutex); ··· 9603 9488 return 0; 9604 9489 } 9605 9490 #endif /* CONFIG_CFS_BANDWIDTH */ 9606 - #endif /* CONFIG_FAIR_GROUP_SCHED */ 9607 9491 9608 9492 #ifdef CONFIG_RT_GROUP_SCHED 9609 9493 static int cpu_rt_runtime_write(struct cgroup_subsys_state *css, ··· 9630 9516 } 9631 9517 #endif /* CONFIG_RT_GROUP_SCHED */ 9632 9518 9633 - #ifdef CONFIG_FAIR_GROUP_SCHED 9519 + #ifdef CONFIG_GROUP_SCHED_WEIGHT 9634 9520 static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css, 9635 9521 struct cftype *cft) 9636 9522 { ··· 9640 9526 static int cpu_idle_write_s64(struct cgroup_subsys_state *css, 9641 9527 struct cftype *cft, s64 idle) 9642 9528 { 9643 - return sched_group_set_idle(css_tg(css), idle); 9529 + int ret; 9530 + 9531 + ret = sched_group_set_idle(css_tg(css), idle); 9532 + if (!ret) 9533 + scx_group_set_idle(css_tg(css), idle); 9534 + return ret; 9644 9535 } 9645 9536 #endif 9646 9537 9647 9538 static struct cftype cpu_legacy_files[] = { 9648 - #ifdef CONFIG_FAIR_GROUP_SCHED 9539 + #ifdef CONFIG_GROUP_SCHED_WEIGHT 9649 9540 { 9650 9541 .name = "shares", 9651 9542 .read_u64 = cpu_shares_read_u64, ··· 9760 9641 return 0; 9761 9642 } 9762 9643 9763 - #ifdef CONFIG_FAIR_GROUP_SCHED 9644 + #ifdef CONFIG_GROUP_SCHED_WEIGHT 9645 + 9764 9646 static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, 9765 9647 struct cftype *cft) 9766 9648 { 9767 - struct task_group *tg = css_tg(css); 9768 - u64 weight = scale_load_down(tg->shares); 9769 - 9770 - return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024); 9649 + return sched_weight_to_cgroup(tg_weight(css_tg(css))); 9771 9650 } 9772 9651 9773 9652 static int cpu_weight_write_u64(struct cgroup_subsys_state *css, 9774 - struct cftype *cft, u64 weight) 9653 + struct cftype *cft, u64 cgrp_weight) 9775 9654 { 9776 - /* 9777 - * cgroup weight knobs should use the common MIN, DFL and MAX 9778 - * values which are 1, 100 and 10000 respectively. While it loses 9779 - * a bit of range on both ends, it maps pretty well onto the shares 9780 - * value used by scheduler and the round-trip conversions preserve 9781 - * the original value over the entire range. 9782 - */ 9783 - if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX) 9655 + unsigned long weight; 9656 + int ret; 9657 + 9658 + if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX) 9784 9659 return -ERANGE; 9785 9660 9786 - weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL); 9661 + weight = sched_weight_from_cgroup(cgrp_weight); 9787 9662 9788 - return sched_group_set_shares(css_tg(css), scale_load(weight)); 9663 + ret = sched_group_set_shares(css_tg(css), scale_load(weight)); 9664 + if (!ret) 9665 + scx_group_set_weight(css_tg(css), cgrp_weight); 9666 + return ret; 9789 9667 } 9790 9668 9791 9669 static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css, 9792 9670 struct cftype *cft) 9793 9671 { 9794 - unsigned long weight = scale_load_down(css_tg(css)->shares); 9672 + unsigned long weight = tg_weight(css_tg(css)); 9795 9673 int last_delta = INT_MAX; 9796 9674 int prio, delta; 9797 9675 ··· 9807 9691 struct cftype *cft, s64 nice) 9808 9692 { 9809 9693 unsigned long weight; 9810 - int idx; 9694 + int idx, ret; 9811 9695 9812 9696 if (nice < MIN_NICE || nice > MAX_NICE) 9813 9697 return -ERANGE; ··· 9816 9700 idx = array_index_nospec(idx, 40); 9817 9701 weight = sched_prio_to_weight[idx]; 9818 9702 9819 - return sched_group_set_shares(css_tg(css), scale_load(weight)); 9703 + ret = sched_group_set_shares(css_tg(css), scale_load(weight)); 9704 + if (!ret) 9705 + scx_group_set_weight(css_tg(css), 9706 + sched_weight_to_cgroup(weight)); 9707 + return ret; 9820 9708 } 9821 - #endif 9709 + #endif /* CONFIG_GROUP_SCHED_WEIGHT */ 9822 9710 9823 9711 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf, 9824 9712 long period, long quota) ··· 9882 9762 #endif 9883 9763 9884 9764 static struct cftype cpu_files[] = { 9885 - #ifdef CONFIG_FAIR_GROUP_SCHED 9765 + #ifdef CONFIG_GROUP_SCHED_WEIGHT 9886 9766 { 9887 9767 .name = "weight", 9888 9768 .flags = CFTYPE_NOT_ON_ROOT, ··· 9936 9816 struct cgroup_subsys cpu_cgrp_subsys = { 9937 9817 .css_alloc = cpu_cgroup_css_alloc, 9938 9818 .css_online = cpu_cgroup_css_online, 9819 + .css_offline = cpu_cgroup_css_offline, 9939 9820 .css_released = cpu_cgroup_css_released, 9940 9821 .css_free = cpu_cgroup_css_free, 9941 9822 .css_extra_stat_show = cpu_extra_stat_show, 9942 9823 .css_local_stat_show = cpu_local_stat_show, 9943 - #ifdef CONFIG_RT_GROUP_SCHED 9944 9824 .can_attach = cpu_cgroup_can_attach, 9945 - #endif 9946 9825 .attach = cpu_cgroup_attach, 9826 + .cancel_attach = cpu_cgroup_cancel_attach, 9947 9827 .legacy_cftypes = cpu_legacy_files, 9948 9828 .dfl_cftypes = cpu_files, 9949 9829 .early_init = true, ··· 10533 10413 t->mm_cid_active = 1; 10534 10414 } 10535 10415 #endif 10416 + 10417 + #ifdef CONFIG_SCHED_CLASS_EXT 10418 + void sched_deq_and_put_task(struct task_struct *p, int queue_flags, 10419 + struct sched_enq_and_set_ctx *ctx) 10420 + { 10421 + struct rq *rq = task_rq(p); 10422 + 10423 + lockdep_assert_rq_held(rq); 10424 + 10425 + *ctx = (struct sched_enq_and_set_ctx){ 10426 + .p = p, 10427 + .queue_flags = queue_flags, 10428 + .queued = task_on_rq_queued(p), 10429 + .running = task_current(rq, p), 10430 + }; 10431 + 10432 + update_rq_clock(rq); 10433 + if (ctx->queued) 10434 + dequeue_task(rq, p, queue_flags | DEQUEUE_NOCLOCK); 10435 + if (ctx->running) 10436 + put_prev_task(rq, p); 10437 + } 10438 + 10439 + void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx) 10440 + { 10441 + struct rq *rq = task_rq(ctx->p); 10442 + 10443 + lockdep_assert_rq_held(rq); 10444 + 10445 + if (ctx->queued) 10446 + enqueue_task(rq, ctx->p, ctx->queue_flags | ENQUEUE_NOCLOCK); 10447 + if (ctx->running) 10448 + set_next_task(rq, ctx->p); 10449 + } 10450 + #endif /* CONFIG_SCHED_CLASS_EXT */

+29 -21

kernel/sched/cpufreq_schedutil.c

··· 197 197 198 198 static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) 199 199 { 200 - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); 200 + unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu); 201 201 202 + if (!scx_switched_all()) 203 + util += cpu_util_cfs_boost(sg_cpu->cpu); 202 204 util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); 203 205 util = max(util, boost); 204 206 sg_cpu->bw_min = min; ··· 327 325 } 328 326 329 327 #ifdef CONFIG_NO_HZ_COMMON 330 - static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) 328 + static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) 331 329 { 332 - unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); 333 - bool ret = idle_calls == sg_cpu->saved_idle_calls; 330 + unsigned long idle_calls; 331 + bool ret; 332 + 333 + /* 334 + * The heuristics in this function is for the fair class. For SCX, the 335 + * performance target comes directly from the BPF scheduler. Let's just 336 + * follow it. 337 + */ 338 + if (scx_switched_all()) 339 + return false; 340 + 341 + /* if capped by uclamp_max, always update to be in compliance */ 342 + if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) 343 + return false; 344 + 345 + /* 346 + * Maintain the frequency if the CPU has not been idle recently, as 347 + * reduction is likely to be premature. 348 + */ 349 + idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); 350 + ret = idle_calls == sg_cpu->saved_idle_calls; 334 351 335 352 sg_cpu->saved_idle_calls = idle_calls; 336 353 return ret; 337 354 } 338 355 #else 339 - static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } 356 + static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; } 340 357 #endif /* CONFIG_NO_HZ_COMMON */ 341 358 342 359 /* ··· 403 382 return; 404 383 405 384 next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap); 406 - /* 407 - * Do not reduce the frequency if the CPU has not been idle 408 - * recently, as the reduction is likely to be premature then. 409 - * 410 - * Except when the rq is capped by uclamp_max. 411 - */ 412 - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) && 413 - sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq && 385 + 386 + if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq && 414 387 !sg_policy->need_freq_update) { 415 388 next_f = sg_policy->next_freq; 416 389 ··· 451 436 if (!sugov_update_single_common(sg_cpu, time, max_cap, flags)) 452 437 return; 453 438 454 - /* 455 - * Do not reduce the target performance level if the CPU has not been 456 - * idle recently, as the reduction is likely to be premature then. 457 - * 458 - * Except when the rq is capped by uclamp_max. 459 - */ 460 - if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) && 461 - sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) 439 + if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util) 462 440 sg_cpu->util = prev_util; 463 441 464 442 cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,

+3

kernel/sched/debug.c

··· 1264 1264 P(dl.runtime); 1265 1265 P(dl.deadline); 1266 1266 } 1267 + #ifdef CONFIG_SCHED_CLASS_EXT 1268 + __PS("ext.enabled", task_on_scx(p)); 1269 + #endif 1267 1270 #undef PN_SCHEDSTAT 1268 1271 #undef P_SCHEDSTAT 1269 1272

+7173

kernel/sched/ext.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst 4 + * 5 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 6 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 7 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 8 + */ 9 + #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) 10 + 11 + enum scx_consts { 12 + SCX_DSP_DFL_MAX_BATCH = 32, 13 + SCX_DSP_MAX_LOOPS = 32, 14 + SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ, 15 + 16 + SCX_EXIT_BT_LEN = 64, 17 + SCX_EXIT_MSG_LEN = 1024, 18 + SCX_EXIT_DUMP_DFL_LEN = 32768, 19 + 20 + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, 21 + }; 22 + 23 + enum scx_exit_kind { 24 + SCX_EXIT_NONE, 25 + SCX_EXIT_DONE, 26 + 27 + SCX_EXIT_UNREG = 64, /* user-space initiated unregistration */ 28 + SCX_EXIT_UNREG_BPF, /* BPF-initiated unregistration */ 29 + SCX_EXIT_UNREG_KERN, /* kernel-initiated unregistration */ 30 + SCX_EXIT_SYSRQ, /* requested by 'S' sysrq */ 31 + 32 + SCX_EXIT_ERROR = 1024, /* runtime error, error msg contains details */ 33 + SCX_EXIT_ERROR_BPF, /* ERROR but triggered through scx_bpf_error() */ 34 + SCX_EXIT_ERROR_STALL, /* watchdog detected stalled runnable tasks */ 35 + }; 36 + 37 + /* 38 + * An exit code can be specified when exiting with scx_bpf_exit() or 39 + * scx_ops_exit(), corresponding to exit_kind UNREG_BPF and UNREG_KERN 40 + * respectively. The codes are 64bit of the format: 41 + * 42 + * Bits: [63 .. 48 47 .. 32 31 .. 0] 43 + * [ SYS ACT ] [ SYS RSN ] [ USR ] 44 + * 45 + * SYS ACT: System-defined exit actions 46 + * SYS RSN: System-defined exit reasons 47 + * USR : User-defined exit codes and reasons 48 + * 49 + * Using the above, users may communicate intention and context by ORing system 50 + * actions and/or system reasons with a user-defined exit code. 51 + */ 52 + enum scx_exit_code { 53 + /* Reasons */ 54 + SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, 55 + 56 + /* Actions */ 57 + SCX_ECODE_ACT_RESTART = 1LLU << 48, 58 + }; 59 + 60 + /* 61 + * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is 62 + * being disabled. 63 + */ 64 + struct scx_exit_info { 65 + /* %SCX_EXIT_* - broad category of the exit reason */ 66 + enum scx_exit_kind kind; 67 + 68 + /* exit code if gracefully exiting */ 69 + s64 exit_code; 70 + 71 + /* textual representation of the above */ 72 + const char *reason; 73 + 74 + /* backtrace if exiting due to an error */ 75 + unsigned long *bt; 76 + u32 bt_len; 77 + 78 + /* informational message */ 79 + char *msg; 80 + 81 + /* debug dump */ 82 + char *dump; 83 + }; 84 + 85 + /* sched_ext_ops.flags */ 86 + enum scx_ops_flags { 87 + /* 88 + * Keep built-in idle tracking even if ops.update_idle() is implemented. 89 + */ 90 + SCX_OPS_KEEP_BUILTIN_IDLE = 1LLU << 0, 91 + 92 + /* 93 + * By default, if there are no other task to run on the CPU, ext core 94 + * keeps running the current task even after its slice expires. If this 95 + * flag is specified, such tasks are passed to ops.enqueue() with 96 + * %SCX_ENQ_LAST. See the comment above %SCX_ENQ_LAST for more info. 97 + */ 98 + SCX_OPS_ENQ_LAST = 1LLU << 1, 99 + 100 + /* 101 + * An exiting task may schedule after PF_EXITING is set. In such cases, 102 + * bpf_task_from_pid() may not be able to find the task and if the BPF 103 + * scheduler depends on pid lookup for dispatching, the task will be 104 + * lost leading to various issues including RCU grace period stalls. 105 + * 106 + * To mask this problem, by default, unhashed tasks are automatically 107 + * dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't 108 + * depend on pid lookups and wants to handle these tasks directly, the 109 + * following flag can be used. 110 + */ 111 + SCX_OPS_ENQ_EXITING = 1LLU << 2, 112 + 113 + /* 114 + * If set, only tasks with policy set to SCHED_EXT are attached to 115 + * sched_ext. If clear, SCHED_NORMAL tasks are also included. 116 + */ 117 + SCX_OPS_SWITCH_PARTIAL = 1LLU << 3, 118 + 119 + /* 120 + * CPU cgroup support flags 121 + */ 122 + SCX_OPS_HAS_CGROUP_WEIGHT = 1LLU << 16, /* cpu.weight */ 123 + 124 + SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | 125 + SCX_OPS_ENQ_LAST | 126 + SCX_OPS_ENQ_EXITING | 127 + SCX_OPS_SWITCH_PARTIAL | 128 + SCX_OPS_HAS_CGROUP_WEIGHT, 129 + }; 130 + 131 + /* argument container for ops.init_task() */ 132 + struct scx_init_task_args { 133 + /* 134 + * Set if ops.init_task() is being invoked on the fork path, as opposed 135 + * to the scheduler transition path. 136 + */ 137 + bool fork; 138 + #ifdef CONFIG_EXT_GROUP_SCHED 139 + /* the cgroup the task is joining */ 140 + struct cgroup *cgroup; 141 + #endif 142 + }; 143 + 144 + /* argument container for ops.exit_task() */ 145 + struct scx_exit_task_args { 146 + /* Whether the task exited before running on sched_ext. */ 147 + bool cancelled; 148 + }; 149 + 150 + /* argument container for ops->cgroup_init() */ 151 + struct scx_cgroup_init_args { 152 + /* the weight of the cgroup [1..10000] */ 153 + u32 weight; 154 + }; 155 + 156 + enum scx_cpu_preempt_reason { 157 + /* next task is being scheduled by &sched_class_rt */ 158 + SCX_CPU_PREEMPT_RT, 159 + /* next task is being scheduled by &sched_class_dl */ 160 + SCX_CPU_PREEMPT_DL, 161 + /* next task is being scheduled by &sched_class_stop */ 162 + SCX_CPU_PREEMPT_STOP, 163 + /* unknown reason for SCX being preempted */ 164 + SCX_CPU_PREEMPT_UNKNOWN, 165 + }; 166 + 167 + /* 168 + * Argument container for ops->cpu_acquire(). Currently empty, but may be 169 + * expanded in the future. 170 + */ 171 + struct scx_cpu_acquire_args {}; 172 + 173 + /* argument container for ops->cpu_release() */ 174 + struct scx_cpu_release_args { 175 + /* the reason the CPU was preempted */ 176 + enum scx_cpu_preempt_reason reason; 177 + 178 + /* the task that's going to be scheduled on the CPU */ 179 + struct task_struct *task; 180 + }; 181 + 182 + /* 183 + * Informational context provided to dump operations. 184 + */ 185 + struct scx_dump_ctx { 186 + enum scx_exit_kind kind; 187 + s64 exit_code; 188 + const char *reason; 189 + u64 at_ns; 190 + u64 at_jiffies; 191 + }; 192 + 193 + /** 194 + * struct sched_ext_ops - Operation table for BPF scheduler implementation 195 + * 196 + * Userland can implement an arbitrary scheduling policy by implementing and 197 + * loading operations in this table. 198 + */ 199 + struct sched_ext_ops { 200 + /** 201 + * select_cpu - Pick the target CPU for a task which is being woken up 202 + * @p: task being woken up 203 + * @prev_cpu: the cpu @p was on before sleeping 204 + * @wake_flags: SCX_WAKE_* 205 + * 206 + * Decision made here isn't final. @p may be moved to any CPU while it 207 + * is getting dispatched for execution later. However, as @p is not on 208 + * the rq at this point, getting the eventual execution CPU right here 209 + * saves a small bit of overhead down the line. 210 + * 211 + * If an idle CPU is returned, the CPU is kicked and will try to 212 + * dispatch. While an explicit custom mechanism can be added, 213 + * select_cpu() serves as the default way to wake up idle CPUs. 214 + * 215 + * @p may be dispatched directly by calling scx_bpf_dispatch(). If @p 216 + * is dispatched, the ops.enqueue() callback will be skipped. Finally, 217 + * if @p is dispatched to SCX_DSQ_LOCAL, it will be dispatched to the 218 + * local DSQ of whatever CPU is returned by this callback. 219 + */ 220 + s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags); 221 + 222 + /** 223 + * enqueue - Enqueue a task on the BPF scheduler 224 + * @p: task being enqueued 225 + * @enq_flags: %SCX_ENQ_* 226 + * 227 + * @p is ready to run. Dispatch directly by calling scx_bpf_dispatch() 228 + * or enqueue on the BPF scheduler. If not directly dispatched, the bpf 229 + * scheduler owns @p and if it fails to dispatch @p, the task will 230 + * stall. 231 + * 232 + * If @p was dispatched from ops.select_cpu(), this callback is 233 + * skipped. 234 + */ 235 + void (*enqueue)(struct task_struct *p, u64 enq_flags); 236 + 237 + /** 238 + * dequeue - Remove a task from the BPF scheduler 239 + * @p: task being dequeued 240 + * @deq_flags: %SCX_DEQ_* 241 + * 242 + * Remove @p from the BPF scheduler. This is usually called to isolate 243 + * the task while updating its scheduling properties (e.g. priority). 244 + * 245 + * The ext core keeps track of whether the BPF side owns a given task or 246 + * not and can gracefully ignore spurious dispatches from BPF side, 247 + * which makes it safe to not implement this method. However, depending 248 + * on the scheduling logic, this can lead to confusing behaviors - e.g. 249 + * scheduling position not being updated across a priority change. 250 + */ 251 + void (*dequeue)(struct task_struct *p, u64 deq_flags); 252 + 253 + /** 254 + * dispatch - Dispatch tasks from the BPF scheduler and/or consume DSQs 255 + * @cpu: CPU to dispatch tasks for 256 + * @prev: previous task being switched out 257 + * 258 + * Called when a CPU's local dsq is empty. The operation should dispatch 259 + * one or more tasks from the BPF scheduler into the DSQs using 260 + * scx_bpf_dispatch() and/or consume user DSQs into the local DSQ using 261 + * scx_bpf_consume(). 262 + * 263 + * The maximum number of times scx_bpf_dispatch() can be called without 264 + * an intervening scx_bpf_consume() is specified by 265 + * ops.dispatch_max_batch. See the comments on top of the two functions 266 + * for more details. 267 + * 268 + * When not %NULL, @prev is an SCX task with its slice depleted. If 269 + * @prev is still runnable as indicated by set %SCX_TASK_QUEUED in 270 + * @prev->scx.flags, it is not enqueued yet and will be enqueued after 271 + * ops.dispatch() returns. To keep executing @prev, return without 272 + * dispatching or consuming any tasks. Also see %SCX_OPS_ENQ_LAST. 273 + */ 274 + void (*dispatch)(s32 cpu, struct task_struct *prev); 275 + 276 + /** 277 + * tick - Periodic tick 278 + * @p: task running currently 279 + * 280 + * This operation is called every 1/HZ seconds on CPUs which are 281 + * executing an SCX task. Setting @p->scx.slice to 0 will trigger an 282 + * immediate dispatch cycle on the CPU. 283 + */ 284 + void (*tick)(struct task_struct *p); 285 + 286 + /** 287 + * runnable - A task is becoming runnable on its associated CPU 288 + * @p: task becoming runnable 289 + * @enq_flags: %SCX_ENQ_* 290 + * 291 + * This and the following three functions can be used to track a task's 292 + * execution state transitions. A task becomes ->runnable() on a CPU, 293 + * and then goes through one or more ->running() and ->stopping() pairs 294 + * as it runs on the CPU, and eventually becomes ->quiescent() when it's 295 + * done running on the CPU. 296 + * 297 + * @p is becoming runnable on the CPU because it's 298 + * 299 + * - waking up (%SCX_ENQ_WAKEUP) 300 + * - being moved from another CPU 301 + * - being restored after temporarily taken off the queue for an 302 + * attribute change. 303 + * 304 + * This and ->enqueue() are related but not coupled. This operation 305 + * notifies @p's state transition and may not be followed by ->enqueue() 306 + * e.g. when @p is being dispatched to a remote CPU, or when @p is 307 + * being enqueued on a CPU experiencing a hotplug event. Likewise, a 308 + * task may be ->enqueue()'d without being preceded by this operation 309 + * e.g. after exhausting its slice. 310 + */ 311 + void (*runnable)(struct task_struct *p, u64 enq_flags); 312 + 313 + /** 314 + * running - A task is starting to run on its associated CPU 315 + * @p: task starting to run 316 + * 317 + * See ->runnable() for explanation on the task state notifiers. 318 + */ 319 + void (*running)(struct task_struct *p); 320 + 321 + /** 322 + * stopping - A task is stopping execution 323 + * @p: task stopping to run 324 + * @runnable: is task @p still runnable? 325 + * 326 + * See ->runnable() for explanation on the task state notifiers. If 327 + * !@runnable, ->quiescent() will be invoked after this operation 328 + * returns. 329 + */ 330 + void (*stopping)(struct task_struct *p, bool runnable); 331 + 332 + /** 333 + * quiescent - A task is becoming not runnable on its associated CPU 334 + * @p: task becoming not runnable 335 + * @deq_flags: %SCX_DEQ_* 336 + * 337 + * See ->runnable() for explanation on the task state notifiers. 338 + * 339 + * @p is becoming quiescent on the CPU because it's 340 + * 341 + * - sleeping (%SCX_DEQ_SLEEP) 342 + * - being moved to another CPU 343 + * - being temporarily taken off the queue for an attribute change 344 + * (%SCX_DEQ_SAVE) 345 + * 346 + * This and ->dequeue() are related but not coupled. This operation 347 + * notifies @p's state transition and may not be preceded by ->dequeue() 348 + * e.g. when @p is being dispatched to a remote CPU. 349 + */ 350 + void (*quiescent)(struct task_struct *p, u64 deq_flags); 351 + 352 + /** 353 + * yield - Yield CPU 354 + * @from: yielding task 355 + * @to: optional yield target task 356 + * 357 + * If @to is NULL, @from is yielding the CPU to other runnable tasks. 358 + * The BPF scheduler should ensure that other available tasks are 359 + * dispatched before the yielding task. Return value is ignored in this 360 + * case. 361 + * 362 + * If @to is not-NULL, @from wants to yield the CPU to @to. If the bpf 363 + * scheduler can implement the request, return %true; otherwise, %false. 364 + */ 365 + bool (*yield)(struct task_struct *from, struct task_struct *to); 366 + 367 + /** 368 + * core_sched_before - Task ordering for core-sched 369 + * @a: task A 370 + * @b: task B 371 + * 372 + * Used by core-sched to determine the ordering between two tasks. See 373 + * Documentation/admin-guide/hw-vuln/core-scheduling.rst for details on 374 + * core-sched. 375 + * 376 + * Both @a and @b are runnable and may or may not currently be queued on 377 + * the BPF scheduler. Should return %true if @a should run before @b. 378 + * %false if there's no required ordering or @b should run before @a. 379 + * 380 + * If not specified, the default is ordering them according to when they 381 + * became runnable. 382 + */ 383 + bool (*core_sched_before)(struct task_struct *a, struct task_struct *b); 384 + 385 + /** 386 + * set_weight - Set task weight 387 + * @p: task to set weight for 388 + * @weight: new weight [1..10000] 389 + * 390 + * Update @p's weight to @weight. 391 + */ 392 + void (*set_weight)(struct task_struct *p, u32 weight); 393 + 394 + /** 395 + * set_cpumask - Set CPU affinity 396 + * @p: task to set CPU affinity for 397 + * @cpumask: cpumask of cpus that @p can run on 398 + * 399 + * Update @p's CPU affinity to @cpumask. 400 + */ 401 + void (*set_cpumask)(struct task_struct *p, 402 + const struct cpumask *cpumask); 403 + 404 + /** 405 + * update_idle - Update the idle state of a CPU 406 + * @cpu: CPU to udpate the idle state for 407 + * @idle: whether entering or exiting the idle state 408 + * 409 + * This operation is called when @rq's CPU goes or leaves the idle 410 + * state. By default, implementing this operation disables the built-in 411 + * idle CPU tracking and the following helpers become unavailable: 412 + * 413 + * - scx_bpf_select_cpu_dfl() 414 + * - scx_bpf_test_and_clear_cpu_idle() 415 + * - scx_bpf_pick_idle_cpu() 416 + * 417 + * The user also must implement ops.select_cpu() as the default 418 + * implementation relies on scx_bpf_select_cpu_dfl(). 419 + * 420 + * Specify the %SCX_OPS_KEEP_BUILTIN_IDLE flag to keep the built-in idle 421 + * tracking. 422 + */ 423 + void (*update_idle)(s32 cpu, bool idle); 424 + 425 + /** 426 + * cpu_acquire - A CPU is becoming available to the BPF scheduler 427 + * @cpu: The CPU being acquired by the BPF scheduler. 428 + * @args: Acquire arguments, see the struct definition. 429 + * 430 + * A CPU that was previously released from the BPF scheduler is now once 431 + * again under its control. 432 + */ 433 + void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args); 434 + 435 + /** 436 + * cpu_release - A CPU is taken away from the BPF scheduler 437 + * @cpu: The CPU being released by the BPF scheduler. 438 + * @args: Release arguments, see the struct definition. 439 + * 440 + * The specified CPU is no longer under the control of the BPF 441 + * scheduler. This could be because it was preempted by a higher 442 + * priority sched_class, though there may be other reasons as well. The 443 + * caller should consult @args->reason to determine the cause. 444 + */ 445 + void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args); 446 + 447 + /** 448 + * init_task - Initialize a task to run in a BPF scheduler 449 + * @p: task to initialize for BPF scheduling 450 + * @args: init arguments, see the struct definition 451 + * 452 + * Either we're loading a BPF scheduler or a new task is being forked. 453 + * Initialize @p for BPF scheduling. This operation may block and can 454 + * be used for allocations, and is called exactly once for a task. 455 + * 456 + * Return 0 for success, -errno for failure. An error return while 457 + * loading will abort loading of the BPF scheduler. During a fork, it 458 + * will abort that specific fork. 459 + */ 460 + s32 (*init_task)(struct task_struct *p, struct scx_init_task_args *args); 461 + 462 + /** 463 + * exit_task - Exit a previously-running task from the system 464 + * @p: task to exit 465 + * 466 + * @p is exiting or the BPF scheduler is being unloaded. Perform any 467 + * necessary cleanup for @p. 468 + */ 469 + void (*exit_task)(struct task_struct *p, struct scx_exit_task_args *args); 470 + 471 + /** 472 + * enable - Enable BPF scheduling for a task 473 + * @p: task to enable BPF scheduling for 474 + * 475 + * Enable @p for BPF scheduling. enable() is called on @p any time it 476 + * enters SCX, and is always paired with a matching disable(). 477 + */ 478 + void (*enable)(struct task_struct *p); 479 + 480 + /** 481 + * disable - Disable BPF scheduling for a task 482 + * @p: task to disable BPF scheduling for 483 + * 484 + * @p is exiting, leaving SCX or the BPF scheduler is being unloaded. 485 + * Disable BPF scheduling for @p. A disable() call is always matched 486 + * with a prior enable() call. 487 + */ 488 + void (*disable)(struct task_struct *p); 489 + 490 + /** 491 + * dump - Dump BPF scheduler state on error 492 + * @ctx: debug dump context 493 + * 494 + * Use scx_bpf_dump() to generate BPF scheduler specific debug dump. 495 + */ 496 + void (*dump)(struct scx_dump_ctx *ctx); 497 + 498 + /** 499 + * dump_cpu - Dump BPF scheduler state for a CPU on error 500 + * @ctx: debug dump context 501 + * @cpu: CPU to generate debug dump for 502 + * @idle: @cpu is currently idle without any runnable tasks 503 + * 504 + * Use scx_bpf_dump() to generate BPF scheduler specific debug dump for 505 + * @cpu. If @idle is %true and this operation doesn't produce any 506 + * output, @cpu is skipped for dump. 507 + */ 508 + void (*dump_cpu)(struct scx_dump_ctx *ctx, s32 cpu, bool idle); 509 + 510 + /** 511 + * dump_task - Dump BPF scheduler state for a runnable task on error 512 + * @ctx: debug dump context 513 + * @p: runnable task to generate debug dump for 514 + * 515 + * Use scx_bpf_dump() to generate BPF scheduler specific debug dump for 516 + * @p. 517 + */ 518 + void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p); 519 + 520 + #ifdef CONFIG_EXT_GROUP_SCHED 521 + /** 522 + * cgroup_init - Initialize a cgroup 523 + * @cgrp: cgroup being initialized 524 + * @args: init arguments, see the struct definition 525 + * 526 + * Either the BPF scheduler is being loaded or @cgrp created, initialize 527 + * @cgrp for sched_ext. This operation may block. 528 + * 529 + * Return 0 for success, -errno for failure. An error return while 530 + * loading will abort loading of the BPF scheduler. During cgroup 531 + * creation, it will abort the specific cgroup creation. 532 + */ 533 + s32 (*cgroup_init)(struct cgroup *cgrp, 534 + struct scx_cgroup_init_args *args); 535 + 536 + /** 537 + * cgroup_exit - Exit a cgroup 538 + * @cgrp: cgroup being exited 539 + * 540 + * Either the BPF scheduler is being unloaded or @cgrp destroyed, exit 541 + * @cgrp for sched_ext. This operation my block. 542 + */ 543 + void (*cgroup_exit)(struct cgroup *cgrp); 544 + 545 + /** 546 + * cgroup_prep_move - Prepare a task to be moved to a different cgroup 547 + * @p: task being moved 548 + * @from: cgroup @p is being moved from 549 + * @to: cgroup @p is being moved to 550 + * 551 + * Prepare @p for move from cgroup @from to @to. This operation may 552 + * block and can be used for allocations. 553 + * 554 + * Return 0 for success, -errno for failure. An error return aborts the 555 + * migration. 556 + */ 557 + s32 (*cgroup_prep_move)(struct task_struct *p, 558 + struct cgroup *from, struct cgroup *to); 559 + 560 + /** 561 + * cgroup_move - Commit cgroup move 562 + * @p: task being moved 563 + * @from: cgroup @p is being moved from 564 + * @to: cgroup @p is being moved to 565 + * 566 + * Commit the move. @p is dequeued during this operation. 567 + */ 568 + void (*cgroup_move)(struct task_struct *p, 569 + struct cgroup *from, struct cgroup *to); 570 + 571 + /** 572 + * cgroup_cancel_move - Cancel cgroup move 573 + * @p: task whose cgroup move is being canceled 574 + * @from: cgroup @p was being moved from 575 + * @to: cgroup @p was being moved to 576 + * 577 + * @p was cgroup_prep_move()'d but failed before reaching cgroup_move(). 578 + * Undo the preparation. 579 + */ 580 + void (*cgroup_cancel_move)(struct task_struct *p, 581 + struct cgroup *from, struct cgroup *to); 582 + 583 + /** 584 + * cgroup_set_weight - A cgroup's weight is being changed 585 + * @cgrp: cgroup whose weight is being updated 586 + * @weight: new weight [1..10000] 587 + * 588 + * Update @tg's weight to @weight. 589 + */ 590 + void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight); 591 + #endif /* CONFIG_CGROUPS */ 592 + 593 + /* 594 + * All online ops must come before ops.cpu_online(). 595 + */ 596 + 597 + /** 598 + * cpu_online - A CPU became online 599 + * @cpu: CPU which just came up 600 + * 601 + * @cpu just came online. @cpu will not call ops.enqueue() or 602 + * ops.dispatch(), nor run tasks associated with other CPUs beforehand. 603 + */ 604 + void (*cpu_online)(s32 cpu); 605 + 606 + /** 607 + * cpu_offline - A CPU is going offline 608 + * @cpu: CPU which is going offline 609 + * 610 + * @cpu is going offline. @cpu will not call ops.enqueue() or 611 + * ops.dispatch(), nor run tasks associated with other CPUs afterwards. 612 + */ 613 + void (*cpu_offline)(s32 cpu); 614 + 615 + /* 616 + * All CPU hotplug ops must come before ops.init(). 617 + */ 618 + 619 + /** 620 + * init - Initialize the BPF scheduler 621 + */ 622 + s32 (*init)(void); 623 + 624 + /** 625 + * exit - Clean up after the BPF scheduler 626 + * @info: Exit info 627 + */ 628 + void (*exit)(struct scx_exit_info *info); 629 + 630 + /** 631 + * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch 632 + */ 633 + u32 dispatch_max_batch; 634 + 635 + /** 636 + * flags - %SCX_OPS_* flags 637 + */ 638 + u64 flags; 639 + 640 + /** 641 + * timeout_ms - The maximum amount of time, in milliseconds, that a 642 + * runnable task should be able to wait before being scheduled. The 643 + * maximum timeout may not exceed the default timeout of 30 seconds. 644 + * 645 + * Defaults to the maximum allowed timeout value of 30 seconds. 646 + */ 647 + u32 timeout_ms; 648 + 649 + /** 650 + * exit_dump_len - scx_exit_info.dump buffer length. If 0, the default 651 + * value of 32768 is used. 652 + */ 653 + u32 exit_dump_len; 654 + 655 + /** 656 + * hotplug_seq - A sequence number that may be set by the scheduler to 657 + * detect when a hotplug event has occurred during the loading process. 658 + * If 0, no detection occurs. Otherwise, the scheduler will fail to 659 + * load if the sequence number does not match @scx_hotplug_seq on the 660 + * enable path. 661 + */ 662 + u64 hotplug_seq; 663 + 664 + /** 665 + * name - BPF scheduler's name 666 + * 667 + * Must be a non-zero valid BPF object name including only isalnum(), 668 + * '_' and '.' chars. Shows up in kernel.sched_ext_ops sysctl while the 669 + * BPF scheduler is enabled. 670 + */ 671 + char name[SCX_OPS_NAME_LEN]; 672 + }; 673 + 674 + enum scx_opi { 675 + SCX_OPI_BEGIN = 0, 676 + SCX_OPI_NORMAL_BEGIN = 0, 677 + SCX_OPI_NORMAL_END = SCX_OP_IDX(cpu_online), 678 + SCX_OPI_CPU_HOTPLUG_BEGIN = SCX_OP_IDX(cpu_online), 679 + SCX_OPI_CPU_HOTPLUG_END = SCX_OP_IDX(init), 680 + SCX_OPI_END = SCX_OP_IDX(init), 681 + }; 682 + 683 + enum scx_wake_flags { 684 + /* expose select WF_* flags as enums */ 685 + SCX_WAKE_FORK = WF_FORK, 686 + SCX_WAKE_TTWU = WF_TTWU, 687 + SCX_WAKE_SYNC = WF_SYNC, 688 + }; 689 + 690 + enum scx_enq_flags { 691 + /* expose select ENQUEUE_* flags as enums */ 692 + SCX_ENQ_WAKEUP = ENQUEUE_WAKEUP, 693 + SCX_ENQ_HEAD = ENQUEUE_HEAD, 694 + 695 + /* high 32bits are SCX specific */ 696 + 697 + /* 698 + * Set the following to trigger preemption when calling 699 + * scx_bpf_dispatch() with a local dsq as the target. The slice of the 700 + * current task is cleared to zero and the CPU is kicked into the 701 + * scheduling path. Implies %SCX_ENQ_HEAD. 702 + */ 703 + SCX_ENQ_PREEMPT = 1LLU << 32, 704 + 705 + /* 706 + * The task being enqueued was previously enqueued on the current CPU's 707 + * %SCX_DSQ_LOCAL, but was removed from it in a call to the 708 + * bpf_scx_reenqueue_local() kfunc. If bpf_scx_reenqueue_local() was 709 + * invoked in a ->cpu_release() callback, and the task is again 710 + * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the 711 + * task will not be scheduled on the CPU until at least the next invocation 712 + * of the ->cpu_acquire() callback. 713 + */ 714 + SCX_ENQ_REENQ = 1LLU << 40, 715 + 716 + /* 717 + * The task being enqueued is the only task available for the cpu. By 718 + * default, ext core keeps executing such tasks but when 719 + * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with the 720 + * %SCX_ENQ_LAST flag set. 721 + * 722 + * The BPF scheduler is responsible for triggering a follow-up 723 + * scheduling event. Otherwise, Execution may stall. 724 + */ 725 + SCX_ENQ_LAST = 1LLU << 41, 726 + 727 + /* high 8 bits are internal */ 728 + __SCX_ENQ_INTERNAL_MASK = 0xffLLU << 56, 729 + 730 + SCX_ENQ_CLEAR_OPSS = 1LLU << 56, 731 + SCX_ENQ_DSQ_PRIQ = 1LLU << 57, 732 + }; 733 + 734 + enum scx_deq_flags { 735 + /* expose select DEQUEUE_* flags as enums */ 736 + SCX_DEQ_SLEEP = DEQUEUE_SLEEP, 737 + 738 + /* high 32bits are SCX specific */ 739 + 740 + /* 741 + * The generic core-sched layer decided to execute the task even though 742 + * it hasn't been dispatched yet. Dequeue from the BPF side. 743 + */ 744 + SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, 745 + }; 746 + 747 + enum scx_pick_idle_cpu_flags { 748 + SCX_PICK_IDLE_CORE = 1LLU << 0, /* pick a CPU whose SMT siblings are also idle */ 749 + }; 750 + 751 + enum scx_kick_flags { 752 + /* 753 + * Kick the target CPU if idle. Guarantees that the target CPU goes 754 + * through at least one full scheduling cycle before going idle. If the 755 + * target CPU can be determined to be currently not idle and going to go 756 + * through a scheduling cycle before going idle, noop. 757 + */ 758 + SCX_KICK_IDLE = 1LLU << 0, 759 + 760 + /* 761 + * Preempt the current task and execute the dispatch path. If the 762 + * current task of the target CPU is an SCX task, its ->scx.slice is 763 + * cleared to zero before the scheduling path is invoked so that the 764 + * task expires and the dispatch path is invoked. 765 + */ 766 + SCX_KICK_PREEMPT = 1LLU << 1, 767 + 768 + /* 769 + * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will 770 + * return after the target CPU finishes picking the next task. 771 + */ 772 + SCX_KICK_WAIT = 1LLU << 2, 773 + }; 774 + 775 + enum scx_tg_flags { 776 + SCX_TG_ONLINE = 1U << 0, 777 + SCX_TG_INITED = 1U << 1, 778 + }; 779 + 780 + enum scx_ops_enable_state { 781 + SCX_OPS_PREPPING, 782 + SCX_OPS_ENABLING, 783 + SCX_OPS_ENABLED, 784 + SCX_OPS_DISABLING, 785 + SCX_OPS_DISABLED, 786 + }; 787 + 788 + static const char *scx_ops_enable_state_str[] = { 789 + [SCX_OPS_PREPPING] = "prepping", 790 + [SCX_OPS_ENABLING] = "enabling", 791 + [SCX_OPS_ENABLED] = "enabled", 792 + [SCX_OPS_DISABLING] = "disabling", 793 + [SCX_OPS_DISABLED] = "disabled", 794 + }; 795 + 796 + /* 797 + * sched_ext_entity->ops_state 798 + * 799 + * Used to track the task ownership between the SCX core and the BPF scheduler. 800 + * State transitions look as follows: 801 + * 802 + * NONE -> QUEUEING -> QUEUED -> DISPATCHING 803 + * ^ | | 804 + * | v v 805 + * \-------------------------------/ 806 + * 807 + * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call 808 + * sites for explanations on the conditions being waited upon and why they are 809 + * safe. Transitions out of them into NONE or QUEUED must store_release and the 810 + * waiters should load_acquire. 811 + * 812 + * Tracking scx_ops_state enables sched_ext core to reliably determine whether 813 + * any given task can be dispatched by the BPF scheduler at all times and thus 814 + * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler 815 + * to try to dispatch any task anytime regardless of its state as the SCX core 816 + * can safely reject invalid dispatches. 817 + */ 818 + enum scx_ops_state { 819 + SCX_OPSS_NONE, /* owned by the SCX core */ 820 + SCX_OPSS_QUEUEING, /* in transit to the BPF scheduler */ 821 + SCX_OPSS_QUEUED, /* owned by the BPF scheduler */ 822 + SCX_OPSS_DISPATCHING, /* in transit back to the SCX core */ 823 + 824 + /* 825 + * QSEQ brands each QUEUED instance so that, when dispatch races 826 + * dequeue/requeue, the dispatcher can tell whether it still has a claim 827 + * on the task being dispatched. 828 + * 829 + * As some 32bit archs can't do 64bit store_release/load_acquire, 830 + * p->scx.ops_state is atomic_long_t which leaves 30 bits for QSEQ on 831 + * 32bit machines. The dispatch race window QSEQ protects is very narrow 832 + * and runs with IRQ disabled. 30 bits should be sufficient. 833 + */ 834 + SCX_OPSS_QSEQ_SHIFT = 2, 835 + }; 836 + 837 + /* Use macros to ensure that the type is unsigned long for the masks */ 838 + #define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1) 839 + #define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK) 840 + 841 + /* 842 + * During exit, a task may schedule after losing its PIDs. When disabling the 843 + * BPF scheduler, we need to be able to iterate tasks in every state to 844 + * guarantee system safety. Maintain a dedicated task list which contains every 845 + * task between its fork and eventual free. 846 + */ 847 + static DEFINE_SPINLOCK(scx_tasks_lock); 848 + static LIST_HEAD(scx_tasks); 849 + 850 + /* ops enable/disable */ 851 + static struct kthread_worker *scx_ops_helper; 852 + static DEFINE_MUTEX(scx_ops_enable_mutex); 853 + DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled); 854 + DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem); 855 + static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED); 856 + static atomic_t scx_ops_bypass_depth = ATOMIC_INIT(0); 857 + static bool scx_switching_all; 858 + DEFINE_STATIC_KEY_FALSE(__scx_switched_all); 859 + 860 + static struct sched_ext_ops scx_ops; 861 + static bool scx_warned_zero_slice; 862 + 863 + static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last); 864 + static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting); 865 + static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); 866 + static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); 867 + 868 + static struct static_key_false scx_has_op[SCX_OPI_END] = 869 + { [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT }; 870 + 871 + static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE); 872 + static struct scx_exit_info *scx_exit_info; 873 + 874 + static atomic_long_t scx_nr_rejected = ATOMIC_LONG_INIT(0); 875 + static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0); 876 + 877 + /* 878 + * The maximum amount of time in jiffies that a task may be runnable without 879 + * being scheduled on a CPU. If this timeout is exceeded, it will trigger 880 + * scx_ops_error(). 881 + */ 882 + static unsigned long scx_watchdog_timeout; 883 + 884 + /* 885 + * The last time the delayed work was run. This delayed work relies on 886 + * ksoftirqd being able to run to service timer interrupts, so it's possible 887 + * that this work itself could get wedged. To account for this, we check that 888 + * it's not stalled in the timer tick, and trigger an error if it is. 889 + */ 890 + static unsigned long scx_watchdog_timestamp = INITIAL_JIFFIES; 891 + 892 + static struct delayed_work scx_watchdog_work; 893 + 894 + /* idle tracking */ 895 + #ifdef CONFIG_SMP 896 + #ifdef CONFIG_CPUMASK_OFFSTACK 897 + #define CL_ALIGNED_IF_ONSTACK 898 + #else 899 + #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp 900 + #endif 901 + 902 + static struct { 903 + cpumask_var_t cpu; 904 + cpumask_var_t smt; 905 + } idle_masks CL_ALIGNED_IF_ONSTACK; 906 + 907 + #endif /* CONFIG_SMP */ 908 + 909 + /* for %SCX_KICK_WAIT */ 910 + static unsigned long __percpu *scx_kick_cpus_pnt_seqs; 911 + 912 + /* 913 + * Direct dispatch marker. 914 + * 915 + * Non-NULL values are used for direct dispatch from enqueue path. A valid 916 + * pointer points to the task currently being enqueued. An ERR_PTR value is used 917 + * to indicate that direct dispatch has already happened. 918 + */ 919 + static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task); 920 + 921 + /* dispatch queues */ 922 + static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global; 923 + 924 + static const struct rhashtable_params dsq_hash_params = { 925 + .key_len = 8, 926 + .key_offset = offsetof(struct scx_dispatch_q, id), 927 + .head_offset = offsetof(struct scx_dispatch_q, hash_node), 928 + }; 929 + 930 + static struct rhashtable dsq_hash; 931 + static LLIST_HEAD(dsqs_to_free); 932 + 933 + /* dispatch buf */ 934 + struct scx_dsp_buf_ent { 935 + struct task_struct *task; 936 + unsigned long qseq; 937 + u64 dsq_id; 938 + u64 enq_flags; 939 + }; 940 + 941 + static u32 scx_dsp_max_batch; 942 + 943 + struct scx_dsp_ctx { 944 + struct rq *rq; 945 + u32 cursor; 946 + u32 nr_tasks; 947 + struct scx_dsp_buf_ent buf[]; 948 + }; 949 + 950 + static struct scx_dsp_ctx __percpu *scx_dsp_ctx; 951 + 952 + /* string formatting from BPF */ 953 + struct scx_bstr_buf { 954 + u64 data[MAX_BPRINTF_VARARGS]; 955 + char line[SCX_EXIT_MSG_LEN]; 956 + }; 957 + 958 + static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock); 959 + static struct scx_bstr_buf scx_exit_bstr_buf; 960 + 961 + /* ops debug dump */ 962 + struct scx_dump_data { 963 + s32 cpu; 964 + bool first; 965 + s32 cursor; 966 + struct seq_buf *s; 967 + const char *prefix; 968 + struct scx_bstr_buf buf; 969 + }; 970 + 971 + static struct scx_dump_data scx_dump_data = { 972 + .cpu = -1, 973 + }; 974 + 975 + /* /sys/kernel/sched_ext interface */ 976 + static struct kset *scx_kset; 977 + static struct kobject *scx_root_kobj; 978 + 979 + #define CREATE_TRACE_POINTS 980 + #include <trace/events/sched_ext.h> 981 + 982 + static void process_ddsp_deferred_locals(struct rq *rq); 983 + static void scx_bpf_kick_cpu(s32 cpu, u64 flags); 984 + static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, 985 + s64 exit_code, 986 + const char *fmt, ...); 987 + 988 + #define scx_ops_error_kind(err, fmt, args...) \ 989 + scx_ops_exit_kind((err), 0, fmt, ##args) 990 + 991 + #define scx_ops_exit(code, fmt, args...) \ 992 + scx_ops_exit_kind(SCX_EXIT_UNREG_KERN, (code), fmt, ##args) 993 + 994 + #define scx_ops_error(fmt, args...) \ 995 + scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args) 996 + 997 + #define SCX_HAS_OP(op) static_branch_likely(&scx_has_op[SCX_OP_IDX(op)]) 998 + 999 + static long jiffies_delta_msecs(unsigned long at, unsigned long now) 1000 + { 1001 + if (time_after(at, now)) 1002 + return jiffies_to_msecs(at - now); 1003 + else 1004 + return -(long)jiffies_to_msecs(now - at); 1005 + } 1006 + 1007 + /* if the highest set bit is N, return a mask with bits [N+1, 31] set */ 1008 + static u32 higher_bits(u32 flags) 1009 + { 1010 + return ~((1 << fls(flags)) - 1); 1011 + } 1012 + 1013 + /* return the mask with only the highest bit set */ 1014 + static u32 highest_bit(u32 flags) 1015 + { 1016 + int bit = fls(flags); 1017 + return ((u64)1 << bit) >> 1; 1018 + } 1019 + 1020 + static bool u32_before(u32 a, u32 b) 1021 + { 1022 + return (s32)(a - b) < 0; 1023 + } 1024 + 1025 + /* 1026 + * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX 1027 + * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate 1028 + * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check 1029 + * whether it's running from an allowed context. 1030 + * 1031 + * @mask is constant, always inline to cull the mask calculations. 1032 + */ 1033 + static __always_inline void scx_kf_allow(u32 mask) 1034 + { 1035 + /* nesting is allowed only in increasing scx_kf_mask order */ 1036 + WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, 1037 + "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", 1038 + current->scx.kf_mask, mask); 1039 + current->scx.kf_mask |= mask; 1040 + barrier(); 1041 + } 1042 + 1043 + static void scx_kf_disallow(u32 mask) 1044 + { 1045 + barrier(); 1046 + current->scx.kf_mask &= ~mask; 1047 + } 1048 + 1049 + #define SCX_CALL_OP(mask, op, args...) \ 1050 + do { \ 1051 + if (mask) { \ 1052 + scx_kf_allow(mask); \ 1053 + scx_ops.op(args); \ 1054 + scx_kf_disallow(mask); \ 1055 + } else { \ 1056 + scx_ops.op(args); \ 1057 + } \ 1058 + } while (0) 1059 + 1060 + #define SCX_CALL_OP_RET(mask, op, args...) \ 1061 + ({ \ 1062 + __typeof__(scx_ops.op(args)) __ret; \ 1063 + if (mask) { \ 1064 + scx_kf_allow(mask); \ 1065 + __ret = scx_ops.op(args); \ 1066 + scx_kf_disallow(mask); \ 1067 + } else { \ 1068 + __ret = scx_ops.op(args); \ 1069 + } \ 1070 + __ret; \ 1071 + }) 1072 + 1073 + /* 1074 + * Some kfuncs are allowed only on the tasks that are subjects of the 1075 + * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such 1076 + * restrictions, the following SCX_CALL_OP_*() variants should be used when 1077 + * invoking scx_ops operations that take task arguments. These can only be used 1078 + * for non-nesting operations due to the way the tasks are tracked. 1079 + * 1080 + * kfuncs which can only operate on such tasks can in turn use 1081 + * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on 1082 + * the specific task. 1083 + */ 1084 + #define SCX_CALL_OP_TASK(mask, op, task, args...) \ 1085 + do { \ 1086 + BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 1087 + current->scx.kf_tasks[0] = task; \ 1088 + SCX_CALL_OP(mask, op, task, ##args); \ 1089 + current->scx.kf_tasks[0] = NULL; \ 1090 + } while (0) 1091 + 1092 + #define SCX_CALL_OP_TASK_RET(mask, op, task, args...) \ 1093 + ({ \ 1094 + __typeof__(scx_ops.op(task, ##args)) __ret; \ 1095 + BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 1096 + current->scx.kf_tasks[0] = task; \ 1097 + __ret = SCX_CALL_OP_RET(mask, op, task, ##args); \ 1098 + current->scx.kf_tasks[0] = NULL; \ 1099 + __ret; \ 1100 + }) 1101 + 1102 + #define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...) \ 1103 + ({ \ 1104 + __typeof__(scx_ops.op(task0, task1, ##args)) __ret; \ 1105 + BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ 1106 + current->scx.kf_tasks[0] = task0; \ 1107 + current->scx.kf_tasks[1] = task1; \ 1108 + __ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args); \ 1109 + current->scx.kf_tasks[0] = NULL; \ 1110 + current->scx.kf_tasks[1] = NULL; \ 1111 + __ret; \ 1112 + }) 1113 + 1114 + /* @mask is constant, always inline to cull unnecessary branches */ 1115 + static __always_inline bool scx_kf_allowed(u32 mask) 1116 + { 1117 + if (unlikely(!(current->scx.kf_mask & mask))) { 1118 + scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x", 1119 + mask, current->scx.kf_mask); 1120 + return false; 1121 + } 1122 + 1123 + /* 1124 + * Enforce nesting boundaries. e.g. A kfunc which can be called from 1125 + * DISPATCH must not be called if we're running DEQUEUE which is nested 1126 + * inside ops.dispatch(). We don't need to check boundaries for any 1127 + * blocking kfuncs as the verifier ensures they're only called from 1128 + * sleepable progs. 1129 + */ 1130 + if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && 1131 + (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { 1132 + scx_ops_error("cpu_release kfunc called from a nested operation"); 1133 + return false; 1134 + } 1135 + 1136 + if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && 1137 + (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { 1138 + scx_ops_error("dispatch kfunc called from a nested operation"); 1139 + return false; 1140 + } 1141 + 1142 + return true; 1143 + } 1144 + 1145 + /* see SCX_CALL_OP_TASK() */ 1146 + static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask, 1147 + struct task_struct *p) 1148 + { 1149 + if (!scx_kf_allowed(mask)) 1150 + return false; 1151 + 1152 + if (unlikely((p != current->scx.kf_tasks[0] && 1153 + p != current->scx.kf_tasks[1]))) { 1154 + scx_ops_error("called on a task not being operated on"); 1155 + return false; 1156 + } 1157 + 1158 + return true; 1159 + } 1160 + 1161 + static bool scx_kf_allowed_if_unlocked(void) 1162 + { 1163 + return !current->scx.kf_mask; 1164 + } 1165 + 1166 + /** 1167 + * nldsq_next_task - Iterate to the next task in a non-local DSQ 1168 + * @dsq: user dsq being interated 1169 + * @cur: current position, %NULL to start iteration 1170 + * @rev: walk backwards 1171 + * 1172 + * Returns %NULL when iteration is finished. 1173 + */ 1174 + static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq, 1175 + struct task_struct *cur, bool rev) 1176 + { 1177 + struct list_head *list_node; 1178 + struct scx_dsq_list_node *dsq_lnode; 1179 + 1180 + lockdep_assert_held(&dsq->lock); 1181 + 1182 + if (cur) 1183 + list_node = &cur->scx.dsq_list.node; 1184 + else 1185 + list_node = &dsq->list; 1186 + 1187 + /* find the next task, need to skip BPF iteration cursors */ 1188 + do { 1189 + if (rev) 1190 + list_node = list_node->prev; 1191 + else 1192 + list_node = list_node->next; 1193 + 1194 + if (list_node == &dsq->list) 1195 + return NULL; 1196 + 1197 + dsq_lnode = container_of(list_node, struct scx_dsq_list_node, 1198 + node); 1199 + } while (dsq_lnode->flags & SCX_DSQ_LNODE_ITER_CURSOR); 1200 + 1201 + return container_of(dsq_lnode, struct task_struct, scx.dsq_list); 1202 + } 1203 + 1204 + #define nldsq_for_each_task(p, dsq) \ 1205 + for ((p) = nldsq_next_task((dsq), NULL, false); (p); \ 1206 + (p) = nldsq_next_task((dsq), (p), false)) 1207 + 1208 + 1209 + /* 1210 + * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse] 1211 + * dispatch order. BPF-visible iterator is opaque and larger to allow future 1212 + * changes without breaking backward compatibility. Can be used with 1213 + * bpf_for_each(). See bpf_iter_scx_dsq_*(). 1214 + */ 1215 + enum scx_dsq_iter_flags { 1216 + /* iterate in the reverse dispatch order */ 1217 + SCX_DSQ_ITER_REV = 1U << 16, 1218 + 1219 + __SCX_DSQ_ITER_HAS_SLICE = 1U << 30, 1220 + __SCX_DSQ_ITER_HAS_VTIME = 1U << 31, 1221 + 1222 + __SCX_DSQ_ITER_USER_FLAGS = SCX_DSQ_ITER_REV, 1223 + __SCX_DSQ_ITER_ALL_FLAGS = __SCX_DSQ_ITER_USER_FLAGS | 1224 + __SCX_DSQ_ITER_HAS_SLICE | 1225 + __SCX_DSQ_ITER_HAS_VTIME, 1226 + }; 1227 + 1228 + struct bpf_iter_scx_dsq_kern { 1229 + struct scx_dsq_list_node cursor; 1230 + struct scx_dispatch_q *dsq; 1231 + u64 slice; 1232 + u64 vtime; 1233 + } __attribute__((aligned(8))); 1234 + 1235 + struct bpf_iter_scx_dsq { 1236 + u64 __opaque[6]; 1237 + } __attribute__((aligned(8))); 1238 + 1239 + 1240 + /* 1241 + * SCX task iterator. 1242 + */ 1243 + struct scx_task_iter { 1244 + struct sched_ext_entity cursor; 1245 + struct task_struct *locked; 1246 + struct rq *rq; 1247 + struct rq_flags rf; 1248 + }; 1249 + 1250 + /** 1251 + * scx_task_iter_init - Initialize a task iterator 1252 + * @iter: iterator to init 1253 + * 1254 + * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized, 1255 + * @iter must eventually be exited with scx_task_iter_exit(). 1256 + * 1257 + * scx_tasks_lock may be released between this and the first next() call or 1258 + * between any two next() calls. If scx_tasks_lock is released between two 1259 + * next() calls, the caller is responsible for ensuring that the task being 1260 + * iterated remains accessible either through RCU read lock or obtaining a 1261 + * reference count. 1262 + * 1263 + * All tasks which existed when the iteration started are guaranteed to be 1264 + * visited as long as they still exist. 1265 + */ 1266 + static void scx_task_iter_init(struct scx_task_iter *iter) 1267 + { 1268 + lockdep_assert_held(&scx_tasks_lock); 1269 + 1270 + BUILD_BUG_ON(__SCX_DSQ_ITER_ALL_FLAGS & 1271 + ((1U << __SCX_DSQ_LNODE_PRIV_SHIFT) - 1)); 1272 + 1273 + iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR }; 1274 + list_add(&iter->cursor.tasks_node, &scx_tasks); 1275 + iter->locked = NULL; 1276 + } 1277 + 1278 + /** 1279 + * scx_task_iter_rq_unlock - Unlock rq locked by a task iterator 1280 + * @iter: iterator to unlock rq for 1281 + * 1282 + * If @iter is in the middle of a locked iteration, it may be locking the rq of 1283 + * the task currently being visited. Unlock the rq if so. This function can be 1284 + * safely called anytime during an iteration. 1285 + * 1286 + * Returns %true if the rq @iter was locking is unlocked. %false if @iter was 1287 + * not locking an rq. 1288 + */ 1289 + static bool scx_task_iter_rq_unlock(struct scx_task_iter *iter) 1290 + { 1291 + if (iter->locked) { 1292 + task_rq_unlock(iter->rq, iter->locked, &iter->rf); 1293 + iter->locked = NULL; 1294 + return true; 1295 + } else { 1296 + return false; 1297 + } 1298 + } 1299 + 1300 + /** 1301 + * scx_task_iter_exit - Exit a task iterator 1302 + * @iter: iterator to exit 1303 + * 1304 + * Exit a previously initialized @iter. Must be called with scx_tasks_lock held. 1305 + * If the iterator holds a task's rq lock, that rq lock is released. See 1306 + * scx_task_iter_init() for details. 1307 + */ 1308 + static void scx_task_iter_exit(struct scx_task_iter *iter) 1309 + { 1310 + lockdep_assert_held(&scx_tasks_lock); 1311 + 1312 + scx_task_iter_rq_unlock(iter); 1313 + list_del_init(&iter->cursor.tasks_node); 1314 + } 1315 + 1316 + /** 1317 + * scx_task_iter_next - Next task 1318 + * @iter: iterator to walk 1319 + * 1320 + * Visit the next task. See scx_task_iter_init() for details. 1321 + */ 1322 + static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter) 1323 + { 1324 + struct list_head *cursor = &iter->cursor.tasks_node; 1325 + struct sched_ext_entity *pos; 1326 + 1327 + lockdep_assert_held(&scx_tasks_lock); 1328 + 1329 + list_for_each_entry(pos, cursor, tasks_node) { 1330 + if (&pos->tasks_node == &scx_tasks) 1331 + return NULL; 1332 + if (!(pos->flags & SCX_TASK_CURSOR)) { 1333 + list_move(cursor, &pos->tasks_node); 1334 + return container_of(pos, struct task_struct, scx); 1335 + } 1336 + } 1337 + 1338 + /* can't happen, should always terminate at scx_tasks above */ 1339 + BUG(); 1340 + } 1341 + 1342 + /** 1343 + * scx_task_iter_next_locked - Next non-idle task with its rq locked 1344 + * @iter: iterator to walk 1345 + * @include_dead: Whether we should include dead tasks in the iteration 1346 + * 1347 + * Visit the non-idle task with its rq lock held. Allows callers to specify 1348 + * whether they would like to filter out dead tasks. See scx_task_iter_init() 1349 + * for details. 1350 + */ 1351 + static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter) 1352 + { 1353 + struct task_struct *p; 1354 + 1355 + scx_task_iter_rq_unlock(iter); 1356 + 1357 + while ((p = scx_task_iter_next(iter))) { 1358 + /* 1359 + * scx_task_iter is used to prepare and move tasks into SCX 1360 + * while loading the BPF scheduler and vice-versa while 1361 + * unloading. The init_tasks ("swappers") should be excluded 1362 + * from the iteration because: 1363 + * 1364 + * - It's unsafe to use __setschduler_prio() on an init_task to 1365 + * determine the sched_class to use as it won't preserve its 1366 + * idle_sched_class. 1367 + * 1368 + * - ops.init/exit_task() can easily be confused if called with 1369 + * init_tasks as they, e.g., share PID 0. 1370 + * 1371 + * As init_tasks are never scheduled through SCX, they can be 1372 + * skipped safely. Note that is_idle_task() which tests %PF_IDLE 1373 + * doesn't work here: 1374 + * 1375 + * - %PF_IDLE may not be set for an init_task whose CPU hasn't 1376 + * yet been onlined. 1377 + * 1378 + * - %PF_IDLE can be set on tasks that are not init_tasks. See 1379 + * play_idle_precise() used by CONFIG_IDLE_INJECT. 1380 + * 1381 + * Test for idle_sched_class as only init_tasks are on it. 1382 + */ 1383 + if (p->sched_class != &idle_sched_class) 1384 + break; 1385 + } 1386 + if (!p) 1387 + return NULL; 1388 + 1389 + iter->rq = task_rq_lock(p, &iter->rf); 1390 + iter->locked = p; 1391 + 1392 + return p; 1393 + } 1394 + 1395 + static enum scx_ops_enable_state scx_ops_enable_state(void) 1396 + { 1397 + return atomic_read(&scx_ops_enable_state_var); 1398 + } 1399 + 1400 + static enum scx_ops_enable_state 1401 + scx_ops_set_enable_state(enum scx_ops_enable_state to) 1402 + { 1403 + return atomic_xchg(&scx_ops_enable_state_var, to); 1404 + } 1405 + 1406 + static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to, 1407 + enum scx_ops_enable_state from) 1408 + { 1409 + int from_v = from; 1410 + 1411 + return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to); 1412 + } 1413 + 1414 + static bool scx_rq_bypassing(struct rq *rq) 1415 + { 1416 + return unlikely(rq->scx.flags & SCX_RQ_BYPASSING); 1417 + } 1418 + 1419 + /** 1420 + * wait_ops_state - Busy-wait the specified ops state to end 1421 + * @p: target task 1422 + * @opss: state to wait the end of 1423 + * 1424 + * Busy-wait for @p to transition out of @opss. This can only be used when the 1425 + * state part of @opss is %SCX_QUEUEING or %SCX_DISPATCHING. This function also 1426 + * has load_acquire semantics to ensure that the caller can see the updates made 1427 + * in the enqueueing and dispatching paths. 1428 + */ 1429 + static void wait_ops_state(struct task_struct *p, unsigned long opss) 1430 + { 1431 + do { 1432 + cpu_relax(); 1433 + } while (atomic_long_read_acquire(&p->scx.ops_state) == opss); 1434 + } 1435 + 1436 + /** 1437 + * ops_cpu_valid - Verify a cpu number 1438 + * @cpu: cpu number which came from a BPF ops 1439 + * @where: extra information reported on error 1440 + * 1441 + * @cpu is a cpu number which came from the BPF scheduler and can be any value. 1442 + * Verify that it is in range and one of the possible cpus. If invalid, trigger 1443 + * an ops error. 1444 + */ 1445 + static bool ops_cpu_valid(s32 cpu, const char *where) 1446 + { 1447 + if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) { 1448 + return true; 1449 + } else { 1450 + scx_ops_error("invalid CPU %d%s%s", cpu, 1451 + where ? " " : "", where ?: ""); 1452 + return false; 1453 + } 1454 + } 1455 + 1456 + /** 1457 + * ops_sanitize_err - Sanitize a -errno value 1458 + * @ops_name: operation to blame on failure 1459 + * @err: -errno value to sanitize 1460 + * 1461 + * Verify @err is a valid -errno. If not, trigger scx_ops_error() and return 1462 + * -%EPROTO. This is necessary because returning a rogue -errno up the chain can 1463 + * cause misbehaviors. For an example, a large negative return from 1464 + * ops.init_task() triggers an oops when passed up the call chain because the 1465 + * value fails IS_ERR() test after being encoded with ERR_PTR() and then is 1466 + * handled as a pointer. 1467 + */ 1468 + static int ops_sanitize_err(const char *ops_name, s32 err) 1469 + { 1470 + if (err < 0 && err >= -MAX_ERRNO) 1471 + return err; 1472 + 1473 + scx_ops_error("ops.%s() returned an invalid errno %d", ops_name, err); 1474 + return -EPROTO; 1475 + } 1476 + 1477 + static void run_deferred(struct rq *rq) 1478 + { 1479 + process_ddsp_deferred_locals(rq); 1480 + } 1481 + 1482 + #ifdef CONFIG_SMP 1483 + static void deferred_bal_cb_workfn(struct rq *rq) 1484 + { 1485 + run_deferred(rq); 1486 + } 1487 + #endif 1488 + 1489 + static void deferred_irq_workfn(struct irq_work *irq_work) 1490 + { 1491 + struct rq *rq = container_of(irq_work, struct rq, scx.deferred_irq_work); 1492 + 1493 + raw_spin_rq_lock(rq); 1494 + run_deferred(rq); 1495 + raw_spin_rq_unlock(rq); 1496 + } 1497 + 1498 + /** 1499 + * schedule_deferred - Schedule execution of deferred actions on an rq 1500 + * @rq: target rq 1501 + * 1502 + * Schedule execution of deferred actions on @rq. Must be called with @rq 1503 + * locked. Deferred actions are executed with @rq locked but unpinned, and thus 1504 + * can unlock @rq to e.g. migrate tasks to other rqs. 1505 + */ 1506 + static void schedule_deferred(struct rq *rq) 1507 + { 1508 + lockdep_assert_rq_held(rq); 1509 + 1510 + #ifdef CONFIG_SMP 1511 + /* 1512 + * If in the middle of waking up a task, task_woken_scx() will be called 1513 + * afterwards which will then run the deferred actions, no need to 1514 + * schedule anything. 1515 + */ 1516 + if (rq->scx.flags & SCX_RQ_IN_WAKEUP) 1517 + return; 1518 + 1519 + /* 1520 + * If in balance, the balance callbacks will be called before rq lock is 1521 + * released. Schedule one. 1522 + */ 1523 + if (rq->scx.flags & SCX_RQ_IN_BALANCE) { 1524 + queue_balance_callback(rq, &rq->scx.deferred_bal_cb, 1525 + deferred_bal_cb_workfn); 1526 + return; 1527 + } 1528 + #endif 1529 + /* 1530 + * No scheduler hooks available. Queue an irq work. They are executed on 1531 + * IRQ re-enable which may take a bit longer than the scheduler hooks. 1532 + * The above WAKEUP and BALANCE paths should cover most of the cases and 1533 + * the time to IRQ re-enable shouldn't be long. 1534 + */ 1535 + irq_work_queue(&rq->scx.deferred_irq_work); 1536 + } 1537 + 1538 + /** 1539 + * touch_core_sched - Update timestamp used for core-sched task ordering 1540 + * @rq: rq to read clock from, must be locked 1541 + * @p: task to update the timestamp for 1542 + * 1543 + * Update @p->scx.core_sched_at timestamp. This is used by scx_prio_less() to 1544 + * implement global or local-DSQ FIFO ordering for core-sched. Should be called 1545 + * when a task becomes runnable and its turn on the CPU ends (e.g. slice 1546 + * exhaustion). 1547 + */ 1548 + static void touch_core_sched(struct rq *rq, struct task_struct *p) 1549 + { 1550 + lockdep_assert_rq_held(rq); 1551 + 1552 + #ifdef CONFIG_SCHED_CORE 1553 + /* 1554 + * It's okay to update the timestamp spuriously. Use 1555 + * sched_core_disabled() which is cheaper than enabled(). 1556 + * 1557 + * As this is used to determine ordering between tasks of sibling CPUs, 1558 + * it may be better to use per-core dispatch sequence instead. 1559 + */ 1560 + if (!sched_core_disabled()) 1561 + p->scx.core_sched_at = sched_clock_cpu(cpu_of(rq)); 1562 + #endif 1563 + } 1564 + 1565 + /** 1566 + * touch_core_sched_dispatch - Update core-sched timestamp on dispatch 1567 + * @rq: rq to read clock from, must be locked 1568 + * @p: task being dispatched 1569 + * 1570 + * If the BPF scheduler implements custom core-sched ordering via 1571 + * ops.core_sched_before(), @p->scx.core_sched_at is used to implement FIFO 1572 + * ordering within each local DSQ. This function is called from dispatch paths 1573 + * and updates @p->scx.core_sched_at if custom core-sched ordering is in effect. 1574 + */ 1575 + static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p) 1576 + { 1577 + lockdep_assert_rq_held(rq); 1578 + 1579 + #ifdef CONFIG_SCHED_CORE 1580 + if (SCX_HAS_OP(core_sched_before)) 1581 + touch_core_sched(rq, p); 1582 + #endif 1583 + } 1584 + 1585 + static void update_curr_scx(struct rq *rq) 1586 + { 1587 + struct task_struct *curr = rq->curr; 1588 + s64 delta_exec; 1589 + 1590 + delta_exec = update_curr_common(rq); 1591 + if (unlikely(delta_exec <= 0)) 1592 + return; 1593 + 1594 + if (curr->scx.slice != SCX_SLICE_INF) { 1595 + curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec); 1596 + if (!curr->scx.slice) 1597 + touch_core_sched(rq, curr); 1598 + } 1599 + } 1600 + 1601 + static bool scx_dsq_priq_less(struct rb_node *node_a, 1602 + const struct rb_node *node_b) 1603 + { 1604 + const struct task_struct *a = 1605 + container_of(node_a, struct task_struct, scx.dsq_priq); 1606 + const struct task_struct *b = 1607 + container_of(node_b, struct task_struct, scx.dsq_priq); 1608 + 1609 + return time_before64(a->scx.dsq_vtime, b->scx.dsq_vtime); 1610 + } 1611 + 1612 + static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta) 1613 + { 1614 + /* scx_bpf_dsq_nr_queued() reads ->nr without locking, use WRITE_ONCE() */ 1615 + WRITE_ONCE(dsq->nr, dsq->nr + delta); 1616 + } 1617 + 1618 + static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p, 1619 + u64 enq_flags) 1620 + { 1621 + bool is_local = dsq->id == SCX_DSQ_LOCAL; 1622 + 1623 + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node)); 1624 + WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) || 1625 + !RB_EMPTY_NODE(&p->scx.dsq_priq)); 1626 + 1627 + if (!is_local) { 1628 + raw_spin_lock(&dsq->lock); 1629 + if (unlikely(dsq->id == SCX_DSQ_INVALID)) { 1630 + scx_ops_error("attempting to dispatch to a destroyed dsq"); 1631 + /* fall back to the global dsq */ 1632 + raw_spin_unlock(&dsq->lock); 1633 + dsq = &scx_dsq_global; 1634 + raw_spin_lock(&dsq->lock); 1635 + } 1636 + } 1637 + 1638 + if (unlikely((dsq->id & SCX_DSQ_FLAG_BUILTIN) && 1639 + (enq_flags & SCX_ENQ_DSQ_PRIQ))) { 1640 + /* 1641 + * SCX_DSQ_LOCAL and SCX_DSQ_GLOBAL DSQs always consume from 1642 + * their FIFO queues. To avoid confusion and accidentally 1643 + * starving vtime-dispatched tasks by FIFO-dispatched tasks, we 1644 + * disallow any internal DSQ from doing vtime ordering of 1645 + * tasks. 1646 + */ 1647 + scx_ops_error("cannot use vtime ordering for built-in DSQs"); 1648 + enq_flags &= ~SCX_ENQ_DSQ_PRIQ; 1649 + } 1650 + 1651 + if (enq_flags & SCX_ENQ_DSQ_PRIQ) { 1652 + struct rb_node *rbp; 1653 + 1654 + /* 1655 + * A PRIQ DSQ shouldn't be using FIFO enqueueing. As tasks are 1656 + * linked to both the rbtree and list on PRIQs, this can only be 1657 + * tested easily when adding the first task. 1658 + */ 1659 + if (unlikely(RB_EMPTY_ROOT(&dsq->priq) && 1660 + nldsq_next_task(dsq, NULL, false))) 1661 + scx_ops_error("DSQ ID 0x%016llx already had FIFO-enqueued tasks", 1662 + dsq->id); 1663 + 1664 + p->scx.dsq_flags |= SCX_TASK_DSQ_ON_PRIQ; 1665 + rb_add(&p->scx.dsq_priq, &dsq->priq, scx_dsq_priq_less); 1666 + 1667 + /* 1668 + * Find the previous task and insert after it on the list so 1669 + * that @dsq->list is vtime ordered. 1670 + */ 1671 + rbp = rb_prev(&p->scx.dsq_priq); 1672 + if (rbp) { 1673 + struct task_struct *prev = 1674 + container_of(rbp, struct task_struct, 1675 + scx.dsq_priq); 1676 + list_add(&p->scx.dsq_list.node, &prev->scx.dsq_list.node); 1677 + } else { 1678 + list_add(&p->scx.dsq_list.node, &dsq->list); 1679 + } 1680 + } else { 1681 + /* a FIFO DSQ shouldn't be using PRIQ enqueuing */ 1682 + if (unlikely(!RB_EMPTY_ROOT(&dsq->priq))) 1683 + scx_ops_error("DSQ ID 0x%016llx already had PRIQ-enqueued tasks", 1684 + dsq->id); 1685 + 1686 + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) 1687 + list_add(&p->scx.dsq_list.node, &dsq->list); 1688 + else 1689 + list_add_tail(&p->scx.dsq_list.node, &dsq->list); 1690 + } 1691 + 1692 + /* seq records the order tasks are queued, used by BPF DSQ iterator */ 1693 + dsq->seq++; 1694 + p->scx.dsq_seq = dsq->seq; 1695 + 1696 + dsq_mod_nr(dsq, 1); 1697 + p->scx.dsq = dsq; 1698 + 1699 + /* 1700 + * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the 1701 + * direct dispatch path, but we clear them here because the direct 1702 + * dispatch verdict may be overridden on the enqueue path during e.g. 1703 + * bypass. 1704 + */ 1705 + p->scx.ddsp_dsq_id = SCX_DSQ_INVALID; 1706 + p->scx.ddsp_enq_flags = 0; 1707 + 1708 + /* 1709 + * We're transitioning out of QUEUEING or DISPATCHING. store_release to 1710 + * match waiters' load_acquire. 1711 + */ 1712 + if (enq_flags & SCX_ENQ_CLEAR_OPSS) 1713 + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); 1714 + 1715 + if (is_local) { 1716 + struct rq *rq = container_of(dsq, struct rq, scx.local_dsq); 1717 + bool preempt = false; 1718 + 1719 + if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr && 1720 + rq->curr->sched_class == &ext_sched_class) { 1721 + rq->curr->scx.slice = 0; 1722 + preempt = true; 1723 + } 1724 + 1725 + if (preempt || sched_class_above(&ext_sched_class, 1726 + rq->curr->sched_class)) 1727 + resched_curr(rq); 1728 + } else { 1729 + raw_spin_unlock(&dsq->lock); 1730 + } 1731 + } 1732 + 1733 + static void task_unlink_from_dsq(struct task_struct *p, 1734 + struct scx_dispatch_q *dsq) 1735 + { 1736 + WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node)); 1737 + 1738 + if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) { 1739 + rb_erase(&p->scx.dsq_priq, &dsq->priq); 1740 + RB_CLEAR_NODE(&p->scx.dsq_priq); 1741 + p->scx.dsq_flags &= ~SCX_TASK_DSQ_ON_PRIQ; 1742 + } 1743 + 1744 + list_del_init(&p->scx.dsq_list.node); 1745 + dsq_mod_nr(dsq, -1); 1746 + } 1747 + 1748 + static void dispatch_dequeue(struct rq *rq, struct task_struct *p) 1749 + { 1750 + struct scx_dispatch_q *dsq = p->scx.dsq; 1751 + bool is_local = dsq == &rq->scx.local_dsq; 1752 + 1753 + if (!dsq) { 1754 + /* 1755 + * If !dsq && on-list, @p is on @rq's ddsp_deferred_locals. 1756 + * Unlinking is all that's needed to cancel. 1757 + */ 1758 + if (unlikely(!list_empty(&p->scx.dsq_list.node))) 1759 + list_del_init(&p->scx.dsq_list.node); 1760 + 1761 + /* 1762 + * When dispatching directly from the BPF scheduler to a local 1763 + * DSQ, the task isn't associated with any DSQ but 1764 + * @p->scx.holding_cpu may be set under the protection of 1765 + * %SCX_OPSS_DISPATCHING. 1766 + */ 1767 + if (p->scx.holding_cpu >= 0) 1768 + p->scx.holding_cpu = -1; 1769 + 1770 + return; 1771 + } 1772 + 1773 + if (!is_local) 1774 + raw_spin_lock(&dsq->lock); 1775 + 1776 + /* 1777 + * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_* can't 1778 + * change underneath us. 1779 + */ 1780 + if (p->scx.holding_cpu < 0) { 1781 + /* @p must still be on @dsq, dequeue */ 1782 + task_unlink_from_dsq(p, dsq); 1783 + } else { 1784 + /* 1785 + * We're racing against dispatch_to_local_dsq() which already 1786 + * removed @p from @dsq and set @p->scx.holding_cpu. Clear the 1787 + * holding_cpu which tells dispatch_to_local_dsq() that it lost 1788 + * the race. 1789 + */ 1790 + WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node)); 1791 + p->scx.holding_cpu = -1; 1792 + } 1793 + p->scx.dsq = NULL; 1794 + 1795 + if (!is_local) 1796 + raw_spin_unlock(&dsq->lock); 1797 + } 1798 + 1799 + static struct scx_dispatch_q *find_user_dsq(u64 dsq_id) 1800 + { 1801 + return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params); 1802 + } 1803 + 1804 + static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id) 1805 + { 1806 + lockdep_assert(rcu_read_lock_any_held()); 1807 + 1808 + if (dsq_id == SCX_DSQ_GLOBAL) 1809 + return &scx_dsq_global; 1810 + else 1811 + return find_user_dsq(dsq_id); 1812 + } 1813 + 1814 + static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id, 1815 + struct task_struct *p) 1816 + { 1817 + struct scx_dispatch_q *dsq; 1818 + 1819 + if (dsq_id == SCX_DSQ_LOCAL) 1820 + return &rq->scx.local_dsq; 1821 + 1822 + if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { 1823 + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; 1824 + 1825 + if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict")) 1826 + return &scx_dsq_global; 1827 + 1828 + return &cpu_rq(cpu)->scx.local_dsq; 1829 + } 1830 + 1831 + dsq = find_non_local_dsq(dsq_id); 1832 + if (unlikely(!dsq)) { 1833 + scx_ops_error("non-existent DSQ 0x%llx for %s[%d]", 1834 + dsq_id, p->comm, p->pid); 1835 + return &scx_dsq_global; 1836 + } 1837 + 1838 + return dsq; 1839 + } 1840 + 1841 + static void mark_direct_dispatch(struct task_struct *ddsp_task, 1842 + struct task_struct *p, u64 dsq_id, 1843 + u64 enq_flags) 1844 + { 1845 + /* 1846 + * Mark that dispatch already happened from ops.select_cpu() or 1847 + * ops.enqueue() by spoiling direct_dispatch_task with a non-NULL value 1848 + * which can never match a valid task pointer. 1849 + */ 1850 + __this_cpu_write(direct_dispatch_task, ERR_PTR(-ESRCH)); 1851 + 1852 + /* @p must match the task on the enqueue path */ 1853 + if (unlikely(p != ddsp_task)) { 1854 + if (IS_ERR(ddsp_task)) 1855 + scx_ops_error("%s[%d] already direct-dispatched", 1856 + p->comm, p->pid); 1857 + else 1858 + scx_ops_error("scheduling for %s[%d] but trying to direct-dispatch %s[%d]", 1859 + ddsp_task->comm, ddsp_task->pid, 1860 + p->comm, p->pid); 1861 + return; 1862 + } 1863 + 1864 + WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID); 1865 + WARN_ON_ONCE(p->scx.ddsp_enq_flags); 1866 + 1867 + p->scx.ddsp_dsq_id = dsq_id; 1868 + p->scx.ddsp_enq_flags = enq_flags; 1869 + } 1870 + 1871 + static void direct_dispatch(struct task_struct *p, u64 enq_flags) 1872 + { 1873 + struct rq *rq = task_rq(p); 1874 + struct scx_dispatch_q *dsq = 1875 + find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p); 1876 + 1877 + touch_core_sched_dispatch(rq, p); 1878 + 1879 + p->scx.ddsp_enq_flags |= enq_flags; 1880 + 1881 + /* 1882 + * We are in the enqueue path with @rq locked and pinned, and thus can't 1883 + * double lock a remote rq and enqueue to its local DSQ. For 1884 + * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer 1885 + * the enqueue so that it's executed when @rq can be unlocked. 1886 + */ 1887 + if (dsq->id == SCX_DSQ_LOCAL && dsq != &rq->scx.local_dsq) { 1888 + unsigned long opss; 1889 + 1890 + opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK; 1891 + 1892 + switch (opss & SCX_OPSS_STATE_MASK) { 1893 + case SCX_OPSS_NONE: 1894 + break; 1895 + case SCX_OPSS_QUEUEING: 1896 + /* 1897 + * As @p was never passed to the BPF side, _release is 1898 + * not strictly necessary. Still do it for consistency. 1899 + */ 1900 + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); 1901 + break; 1902 + default: 1903 + WARN_ONCE(true, "sched_ext: %s[%d] has invalid ops state 0x%lx in direct_dispatch()", 1904 + p->comm, p->pid, opss); 1905 + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); 1906 + break; 1907 + } 1908 + 1909 + WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node)); 1910 + list_add_tail(&p->scx.dsq_list.node, 1911 + &rq->scx.ddsp_deferred_locals); 1912 + schedule_deferred(rq); 1913 + return; 1914 + } 1915 + 1916 + dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); 1917 + } 1918 + 1919 + static bool scx_rq_online(struct rq *rq) 1920 + { 1921 + /* 1922 + * Test both cpu_active() and %SCX_RQ_ONLINE. %SCX_RQ_ONLINE indicates 1923 + * the online state as seen from the BPF scheduler. cpu_active() test 1924 + * guarantees that, if this function returns %true, %SCX_RQ_ONLINE will 1925 + * stay set until the current scheduling operation is complete even if 1926 + * we aren't locking @rq. 1927 + */ 1928 + return likely((rq->scx.flags & SCX_RQ_ONLINE) && cpu_active(cpu_of(rq))); 1929 + } 1930 + 1931 + static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, 1932 + int sticky_cpu) 1933 + { 1934 + struct task_struct **ddsp_taskp; 1935 + unsigned long qseq; 1936 + 1937 + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED)); 1938 + 1939 + /* rq migration */ 1940 + if (sticky_cpu == cpu_of(rq)) 1941 + goto local_norefill; 1942 + 1943 + /* 1944 + * If !scx_rq_online(), we already told the BPF scheduler that the CPU 1945 + * is offline and are just running the hotplug path. Don't bother the 1946 + * BPF scheduler. 1947 + */ 1948 + if (!scx_rq_online(rq)) 1949 + goto local; 1950 + 1951 + if (scx_rq_bypassing(rq)) 1952 + goto global; 1953 + 1954 + if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) 1955 + goto direct; 1956 + 1957 + /* see %SCX_OPS_ENQ_EXITING */ 1958 + if (!static_branch_unlikely(&scx_ops_enq_exiting) && 1959 + unlikely(p->flags & PF_EXITING)) 1960 + goto local; 1961 + 1962 + if (!SCX_HAS_OP(enqueue)) 1963 + goto global; 1964 + 1965 + /* DSQ bypass didn't trigger, enqueue on the BPF scheduler */ 1966 + qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT; 1967 + 1968 + WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); 1969 + atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); 1970 + 1971 + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); 1972 + WARN_ON_ONCE(*ddsp_taskp); 1973 + *ddsp_taskp = p; 1974 + 1975 + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); 1976 + 1977 + *ddsp_taskp = NULL; 1978 + if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) 1979 + goto direct; 1980 + 1981 + /* 1982 + * If not directly dispatched, QUEUEING isn't clear yet and dispatch or 1983 + * dequeue may be waiting. The store_release matches their load_acquire. 1984 + */ 1985 + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); 1986 + return; 1987 + 1988 + direct: 1989 + direct_dispatch(p, enq_flags); 1990 + return; 1991 + 1992 + local: 1993 + /* 1994 + * For task-ordering, slice refill must be treated as implying the end 1995 + * of the current slice. Otherwise, the longer @p stays on the CPU, the 1996 + * higher priority it becomes from scx_prio_less()'s POV. 1997 + */ 1998 + touch_core_sched(rq, p); 1999 + p->scx.slice = SCX_SLICE_DFL; 2000 + local_norefill: 2001 + dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags); 2002 + return; 2003 + 2004 + global: 2005 + touch_core_sched(rq, p); /* see the comment in local: */ 2006 + p->scx.slice = SCX_SLICE_DFL; 2007 + dispatch_enqueue(&scx_dsq_global, p, enq_flags); 2008 + } 2009 + 2010 + static bool task_runnable(const struct task_struct *p) 2011 + { 2012 + return !list_empty(&p->scx.runnable_node); 2013 + } 2014 + 2015 + static void set_task_runnable(struct rq *rq, struct task_struct *p) 2016 + { 2017 + lockdep_assert_rq_held(rq); 2018 + 2019 + if (p->scx.flags & SCX_TASK_RESET_RUNNABLE_AT) { 2020 + p->scx.runnable_at = jiffies; 2021 + p->scx.flags &= ~SCX_TASK_RESET_RUNNABLE_AT; 2022 + } 2023 + 2024 + /* 2025 + * list_add_tail() must be used. scx_ops_bypass() depends on tasks being 2026 + * appened to the runnable_list. 2027 + */ 2028 + list_add_tail(&p->scx.runnable_node, &rq->scx.runnable_list); 2029 + } 2030 + 2031 + static void clr_task_runnable(struct task_struct *p, bool reset_runnable_at) 2032 + { 2033 + list_del_init(&p->scx.runnable_node); 2034 + if (reset_runnable_at) 2035 + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; 2036 + } 2037 + 2038 + static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags) 2039 + { 2040 + int sticky_cpu = p->scx.sticky_cpu; 2041 + 2042 + if (enq_flags & ENQUEUE_WAKEUP) 2043 + rq->scx.flags |= SCX_RQ_IN_WAKEUP; 2044 + 2045 + enq_flags |= rq->scx.extra_enq_flags; 2046 + 2047 + if (sticky_cpu >= 0) 2048 + p->scx.sticky_cpu = -1; 2049 + 2050 + /* 2051 + * Restoring a running task will be immediately followed by 2052 + * set_next_task_scx() which expects the task to not be on the BPF 2053 + * scheduler as tasks can only start running through local DSQs. Force 2054 + * direct-dispatch into the local DSQ by setting the sticky_cpu. 2055 + */ 2056 + if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p)) 2057 + sticky_cpu = cpu_of(rq); 2058 + 2059 + if (p->scx.flags & SCX_TASK_QUEUED) { 2060 + WARN_ON_ONCE(!task_runnable(p)); 2061 + goto out; 2062 + } 2063 + 2064 + set_task_runnable(rq, p); 2065 + p->scx.flags |= SCX_TASK_QUEUED; 2066 + rq->scx.nr_running++; 2067 + add_nr_running(rq, 1); 2068 + 2069 + if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p)) 2070 + SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); 2071 + 2072 + if (enq_flags & SCX_ENQ_WAKEUP) 2073 + touch_core_sched(rq, p); 2074 + 2075 + do_enqueue_task(rq, p, enq_flags, sticky_cpu); 2076 + out: 2077 + rq->scx.flags &= ~SCX_RQ_IN_WAKEUP; 2078 + } 2079 + 2080 + static void ops_dequeue(struct task_struct *p, u64 deq_flags) 2081 + { 2082 + unsigned long opss; 2083 + 2084 + /* dequeue is always temporary, don't reset runnable_at */ 2085 + clr_task_runnable(p, false); 2086 + 2087 + /* acquire ensures that we see the preceding updates on QUEUED */ 2088 + opss = atomic_long_read_acquire(&p->scx.ops_state); 2089 + 2090 + switch (opss & SCX_OPSS_STATE_MASK) { 2091 + case SCX_OPSS_NONE: 2092 + break; 2093 + case SCX_OPSS_QUEUEING: 2094 + /* 2095 + * QUEUEING is started and finished while holding @p's rq lock. 2096 + * As we're holding the rq lock now, we shouldn't see QUEUEING. 2097 + */ 2098 + BUG(); 2099 + case SCX_OPSS_QUEUED: 2100 + if (SCX_HAS_OP(dequeue)) 2101 + SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags); 2102 + 2103 + if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, 2104 + SCX_OPSS_NONE)) 2105 + break; 2106 + fallthrough; 2107 + case SCX_OPSS_DISPATCHING: 2108 + /* 2109 + * If @p is being dispatched from the BPF scheduler to a DSQ, 2110 + * wait for the transfer to complete so that @p doesn't get 2111 + * added to its DSQ after dequeueing is complete. 2112 + * 2113 + * As we're waiting on DISPATCHING with the rq locked, the 2114 + * dispatching side shouldn't try to lock the rq while 2115 + * DISPATCHING is set. See dispatch_to_local_dsq(). 2116 + * 2117 + * DISPATCHING shouldn't have qseq set and control can reach 2118 + * here with NONE @opss from the above QUEUED case block. 2119 + * Explicitly wait on %SCX_OPSS_DISPATCHING instead of @opss. 2120 + */ 2121 + wait_ops_state(p, SCX_OPSS_DISPATCHING); 2122 + BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); 2123 + break; 2124 + } 2125 + } 2126 + 2127 + static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags) 2128 + { 2129 + if (!(p->scx.flags & SCX_TASK_QUEUED)) { 2130 + WARN_ON_ONCE(task_runnable(p)); 2131 + return true; 2132 + } 2133 + 2134 + ops_dequeue(p, deq_flags); 2135 + 2136 + /* 2137 + * A currently running task which is going off @rq first gets dequeued 2138 + * and then stops running. As we want running <-> stopping transitions 2139 + * to be contained within runnable <-> quiescent transitions, trigger 2140 + * ->stopping() early here instead of in put_prev_task_scx(). 2141 + * 2142 + * @p may go through multiple stopping <-> running transitions between 2143 + * here and put_prev_task_scx() if task attribute changes occur while 2144 + * balance_scx() leaves @rq unlocked. However, they don't contain any 2145 + * information meaningful to the BPF scheduler and can be suppressed by 2146 + * skipping the callbacks if the task is !QUEUED. 2147 + */ 2148 + if (SCX_HAS_OP(stopping) && task_current(rq, p)) { 2149 + update_curr_scx(rq); 2150 + SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false); 2151 + } 2152 + 2153 + if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p)) 2154 + SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags); 2155 + 2156 + if (deq_flags & SCX_DEQ_SLEEP) 2157 + p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; 2158 + else 2159 + p->scx.flags &= ~SCX_TASK_DEQD_FOR_SLEEP; 2160 + 2161 + p->scx.flags &= ~SCX_TASK_QUEUED; 2162 + rq->scx.nr_running--; 2163 + sub_nr_running(rq, 1); 2164 + 2165 + dispatch_dequeue(rq, p); 2166 + return true; 2167 + } 2168 + 2169 + static void yield_task_scx(struct rq *rq) 2170 + { 2171 + struct task_struct *p = rq->curr; 2172 + 2173 + if (SCX_HAS_OP(yield)) 2174 + SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL); 2175 + else 2176 + p->scx.slice = 0; 2177 + } 2178 + 2179 + static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) 2180 + { 2181 + struct task_struct *from = rq->curr; 2182 + 2183 + if (SCX_HAS_OP(yield)) 2184 + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to); 2185 + else 2186 + return false; 2187 + } 2188 + 2189 + static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, 2190 + struct scx_dispatch_q *src_dsq, 2191 + struct rq *dst_rq) 2192 + { 2193 + struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; 2194 + 2195 + /* @dsq is locked and @p is on @dst_rq */ 2196 + lockdep_assert_held(&src_dsq->lock); 2197 + lockdep_assert_rq_held(dst_rq); 2198 + 2199 + WARN_ON_ONCE(p->scx.holding_cpu >= 0); 2200 + 2201 + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) 2202 + list_add(&p->scx.dsq_list.node, &dst_dsq->list); 2203 + else 2204 + list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list); 2205 + 2206 + dsq_mod_nr(dst_dsq, 1); 2207 + p->scx.dsq = dst_dsq; 2208 + } 2209 + 2210 + #ifdef CONFIG_SMP 2211 + /** 2212 + * move_remote_task_to_local_dsq - Move a task from a foreign rq to a local DSQ 2213 + * @p: task to move 2214 + * @enq_flags: %SCX_ENQ_* 2215 + * @src_rq: rq to move the task from, locked on entry, released on return 2216 + * @dst_rq: rq to move the task into, locked on return 2217 + * 2218 + * Move @p which is currently on @src_rq to @dst_rq's local DSQ. 2219 + */ 2220 + static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags, 2221 + struct rq *src_rq, struct rq *dst_rq) 2222 + { 2223 + lockdep_assert_rq_held(src_rq); 2224 + 2225 + /* the following marks @p MIGRATING which excludes dequeue */ 2226 + deactivate_task(src_rq, p, 0); 2227 + set_task_cpu(p, cpu_of(dst_rq)); 2228 + p->scx.sticky_cpu = cpu_of(dst_rq); 2229 + 2230 + raw_spin_rq_unlock(src_rq); 2231 + raw_spin_rq_lock(dst_rq); 2232 + 2233 + /* 2234 + * We want to pass scx-specific enq_flags but activate_task() will 2235 + * truncate the upper 32 bit. As we own @rq, we can pass them through 2236 + * @rq->scx.extra_enq_flags instead. 2237 + */ 2238 + WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)); 2239 + WARN_ON_ONCE(dst_rq->scx.extra_enq_flags); 2240 + dst_rq->scx.extra_enq_flags = enq_flags; 2241 + activate_task(dst_rq, p, 0); 2242 + dst_rq->scx.extra_enq_flags = 0; 2243 + } 2244 + 2245 + /* 2246 + * Similar to kernel/sched/core.c::is_cpu_allowed(). However, there are two 2247 + * differences: 2248 + * 2249 + * - is_cpu_allowed() asks "Can this task run on this CPU?" while 2250 + * task_can_run_on_remote_rq() asks "Can the BPF scheduler migrate the task to 2251 + * this CPU?". 2252 + * 2253 + * While migration is disabled, is_cpu_allowed() has to say "yes" as the task 2254 + * must be allowed to finish on the CPU that it's currently on regardless of 2255 + * the CPU state. However, task_can_run_on_remote_rq() must say "no" as the 2256 + * BPF scheduler shouldn't attempt to migrate a task which has migration 2257 + * disabled. 2258 + * 2259 + * - The BPF scheduler is bypassed while the rq is offline and we can always say 2260 + * no to the BPF scheduler initiated migrations while offline. 2261 + */ 2262 + static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, 2263 + bool trigger_error) 2264 + { 2265 + int cpu = cpu_of(rq); 2266 + 2267 + /* 2268 + * We don't require the BPF scheduler to avoid dispatching to offline 2269 + * CPUs mostly for convenience but also because CPUs can go offline 2270 + * between scx_bpf_dispatch() calls and here. Trigger error iff the 2271 + * picked CPU is outside the allowed mask. 2272 + */ 2273 + if (!task_allowed_on_cpu(p, cpu)) { 2274 + if (trigger_error) 2275 + scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]", 2276 + cpu_of(rq), p->comm, p->pid); 2277 + return false; 2278 + } 2279 + 2280 + if (unlikely(is_migration_disabled(p))) 2281 + return false; 2282 + 2283 + if (!scx_rq_online(rq)) 2284 + return false; 2285 + 2286 + return true; 2287 + } 2288 + 2289 + /** 2290 + * unlink_dsq_and_lock_src_rq() - Unlink task from its DSQ and lock its task_rq 2291 + * @p: target task 2292 + * @dsq: locked DSQ @p is currently on 2293 + * @src_rq: rq @p is currently on, stable with @dsq locked 2294 + * 2295 + * Called with @dsq locked but no rq's locked. We want to move @p to a different 2296 + * DSQ, including any local DSQ, but are not locking @src_rq. Locking @src_rq is 2297 + * required when transferring into a local DSQ. Even when transferring into a 2298 + * non-local DSQ, it's better to use the same mechanism to protect against 2299 + * dequeues and maintain the invariant that @p->scx.dsq can only change while 2300 + * @src_rq is locked, which e.g. scx_dump_task() depends on. 2301 + * 2302 + * We want to grab @src_rq but that can deadlock if we try while locking @dsq, 2303 + * so we want to unlink @p from @dsq, drop its lock and then lock @src_rq. As 2304 + * this may race with dequeue, which can't drop the rq lock or fail, do a little 2305 + * dancing from our side. 2306 + * 2307 + * @p->scx.holding_cpu is set to this CPU before @dsq is unlocked. If @p gets 2308 + * dequeued after we unlock @dsq but before locking @src_rq, the holding_cpu 2309 + * would be cleared to -1. While other cpus may have updated it to different 2310 + * values afterwards, as this operation can't be preempted or recurse, the 2311 + * holding_cpu can never become this CPU again before we're done. Thus, we can 2312 + * tell whether we lost to dequeue by testing whether the holding_cpu still 2313 + * points to this CPU. See dispatch_dequeue() for the counterpart. 2314 + * 2315 + * On return, @dsq is unlocked and @src_rq is locked. Returns %true if @p is 2316 + * still valid. %false if lost to dequeue. 2317 + */ 2318 + static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, 2319 + struct scx_dispatch_q *dsq, 2320 + struct rq *src_rq) 2321 + { 2322 + s32 cpu = raw_smp_processor_id(); 2323 + 2324 + lockdep_assert_held(&dsq->lock); 2325 + 2326 + WARN_ON_ONCE(p->scx.holding_cpu >= 0); 2327 + task_unlink_from_dsq(p, dsq); 2328 + p->scx.holding_cpu = cpu; 2329 + 2330 + raw_spin_unlock(&dsq->lock); 2331 + raw_spin_rq_lock(src_rq); 2332 + 2333 + /* task_rq couldn't have changed if we're still the holding cpu */ 2334 + return likely(p->scx.holding_cpu == cpu) && 2335 + !WARN_ON_ONCE(src_rq != task_rq(p)); 2336 + } 2337 + 2338 + static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, 2339 + struct scx_dispatch_q *dsq, struct rq *src_rq) 2340 + { 2341 + raw_spin_rq_unlock(this_rq); 2342 + 2343 + if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { 2344 + move_remote_task_to_local_dsq(p, 0, src_rq, this_rq); 2345 + return true; 2346 + } else { 2347 + raw_spin_rq_unlock(src_rq); 2348 + raw_spin_rq_lock(this_rq); 2349 + return false; 2350 + } 2351 + } 2352 + #else /* CONFIG_SMP */ 2353 + static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; } 2354 + static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; } 2355 + #endif /* CONFIG_SMP */ 2356 + 2357 + static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq) 2358 + { 2359 + struct task_struct *p; 2360 + retry: 2361 + /* 2362 + * The caller can't expect to successfully consume a task if the task's 2363 + * addition to @dsq isn't guaranteed to be visible somehow. Test 2364 + * @dsq->list without locking and skip if it seems empty. 2365 + */ 2366 + if (list_empty(&dsq->list)) 2367 + return false; 2368 + 2369 + raw_spin_lock(&dsq->lock); 2370 + 2371 + nldsq_for_each_task(p, dsq) { 2372 + struct rq *task_rq = task_rq(p); 2373 + 2374 + if (rq == task_rq) { 2375 + task_unlink_from_dsq(p, dsq); 2376 + move_local_task_to_local_dsq(p, 0, dsq, rq); 2377 + raw_spin_unlock(&dsq->lock); 2378 + return true; 2379 + } 2380 + 2381 + if (task_can_run_on_remote_rq(p, rq, false)) { 2382 + if (likely(consume_remote_task(rq, p, dsq, task_rq))) 2383 + return true; 2384 + goto retry; 2385 + } 2386 + } 2387 + 2388 + raw_spin_unlock(&dsq->lock); 2389 + return false; 2390 + } 2391 + 2392 + /** 2393 + * dispatch_to_local_dsq - Dispatch a task to a local dsq 2394 + * @rq: current rq which is locked 2395 + * @dst_dsq: destination DSQ 2396 + * @p: task to dispatch 2397 + * @enq_flags: %SCX_ENQ_* 2398 + * 2399 + * We're holding @rq lock and want to dispatch @p to @dst_dsq which is a local 2400 + * DSQ. This function performs all the synchronization dancing needed because 2401 + * local DSQs are protected with rq locks. 2402 + * 2403 + * The caller must have exclusive ownership of @p (e.g. through 2404 + * %SCX_OPSS_DISPATCHING). 2405 + */ 2406 + static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq, 2407 + struct task_struct *p, u64 enq_flags) 2408 + { 2409 + struct rq *src_rq = task_rq(p); 2410 + struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); 2411 + 2412 + /* 2413 + * We're synchronized against dequeue through DISPATCHING. As @p can't 2414 + * be dequeued, its task_rq and cpus_allowed are stable too. 2415 + * 2416 + * If dispatching to @rq that @p is already on, no lock dancing needed. 2417 + */ 2418 + if (rq == src_rq && rq == dst_rq) { 2419 + dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2420 + return; 2421 + } 2422 + 2423 + #ifdef CONFIG_SMP 2424 + if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) { 2425 + dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2426 + return; 2427 + } 2428 + 2429 + /* 2430 + * @p is on a possibly remote @src_rq which we need to lock to move the 2431 + * task. If dequeue is in progress, it'd be locking @src_rq and waiting 2432 + * on DISPATCHING, so we can't grab @src_rq lock while holding 2433 + * DISPATCHING. 2434 + * 2435 + * As DISPATCHING guarantees that @p is wholly ours, we can pretend that 2436 + * we're moving from a DSQ and use the same mechanism - mark the task 2437 + * under transfer with holding_cpu, release DISPATCHING and then follow 2438 + * the same protocol. See unlink_dsq_and_lock_src_rq(). 2439 + */ 2440 + p->scx.holding_cpu = raw_smp_processor_id(); 2441 + 2442 + /* store_release ensures that dequeue sees the above */ 2443 + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE); 2444 + 2445 + /* switch to @src_rq lock */ 2446 + if (rq != src_rq) { 2447 + raw_spin_rq_unlock(rq); 2448 + raw_spin_rq_lock(src_rq); 2449 + } 2450 + 2451 + /* task_rq couldn't have changed if we're still the holding cpu */ 2452 + if (likely(p->scx.holding_cpu == raw_smp_processor_id()) && 2453 + !WARN_ON_ONCE(src_rq != task_rq(p))) { 2454 + /* 2455 + * If @p is staying on the same rq, there's no need to go 2456 + * through the full deactivate/activate cycle. Optimize by 2457 + * abbreviating move_remote_task_to_local_dsq(). 2458 + */ 2459 + if (src_rq == dst_rq) { 2460 + p->scx.holding_cpu = -1; 2461 + dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags); 2462 + } else { 2463 + move_remote_task_to_local_dsq(p, enq_flags, 2464 + src_rq, dst_rq); 2465 + } 2466 + 2467 + /* if the destination CPU is idle, wake it up */ 2468 + if (sched_class_above(p->sched_class, dst_rq->curr->sched_class)) 2469 + resched_curr(dst_rq); 2470 + } 2471 + 2472 + /* switch back to @rq lock */ 2473 + if (rq != dst_rq) { 2474 + raw_spin_rq_unlock(dst_rq); 2475 + raw_spin_rq_lock(rq); 2476 + } 2477 + #else /* CONFIG_SMP */ 2478 + BUG(); /* control can not reach here on UP */ 2479 + #endif /* CONFIG_SMP */ 2480 + } 2481 + 2482 + /** 2483 + * finish_dispatch - Asynchronously finish dispatching a task 2484 + * @rq: current rq which is locked 2485 + * @p: task to finish dispatching 2486 + * @qseq_at_dispatch: qseq when @p started getting dispatched 2487 + * @dsq_id: destination DSQ ID 2488 + * @enq_flags: %SCX_ENQ_* 2489 + * 2490 + * Dispatching to local DSQs may need to wait for queueing to complete or 2491 + * require rq lock dancing. As we don't wanna do either while inside 2492 + * ops.dispatch() to avoid locking order inversion, we split dispatching into 2493 + * two parts. scx_bpf_dispatch() which is called by ops.dispatch() records the 2494 + * task and its qseq. Once ops.dispatch() returns, this function is called to 2495 + * finish up. 2496 + * 2497 + * There is no guarantee that @p is still valid for dispatching or even that it 2498 + * was valid in the first place. Make sure that the task is still owned by the 2499 + * BPF scheduler and claim the ownership before dispatching. 2500 + */ 2501 + static void finish_dispatch(struct rq *rq, struct task_struct *p, 2502 + unsigned long qseq_at_dispatch, 2503 + u64 dsq_id, u64 enq_flags) 2504 + { 2505 + struct scx_dispatch_q *dsq; 2506 + unsigned long opss; 2507 + 2508 + touch_core_sched_dispatch(rq, p); 2509 + retry: 2510 + /* 2511 + * No need for _acquire here. @p is accessed only after a successful 2512 + * try_cmpxchg to DISPATCHING. 2513 + */ 2514 + opss = atomic_long_read(&p->scx.ops_state); 2515 + 2516 + switch (opss & SCX_OPSS_STATE_MASK) { 2517 + case SCX_OPSS_DISPATCHING: 2518 + case SCX_OPSS_NONE: 2519 + /* someone else already got to it */ 2520 + return; 2521 + case SCX_OPSS_QUEUED: 2522 + /* 2523 + * If qseq doesn't match, @p has gone through at least one 2524 + * dispatch/dequeue and re-enqueue cycle between 2525 + * scx_bpf_dispatch() and here and we have no claim on it. 2526 + */ 2527 + if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch) 2528 + return; 2529 + 2530 + /* 2531 + * While we know @p is accessible, we don't yet have a claim on 2532 + * it - the BPF scheduler is allowed to dispatch tasks 2533 + * spuriously and there can be a racing dequeue attempt. Let's 2534 + * claim @p by atomically transitioning it from QUEUED to 2535 + * DISPATCHING. 2536 + */ 2537 + if (likely(atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, 2538 + SCX_OPSS_DISPATCHING))) 2539 + break; 2540 + goto retry; 2541 + case SCX_OPSS_QUEUEING: 2542 + /* 2543 + * do_enqueue_task() is in the process of transferring the task 2544 + * to the BPF scheduler while holding @p's rq lock. As we aren't 2545 + * holding any kernel or BPF resource that the enqueue path may 2546 + * depend upon, it's safe to wait. 2547 + */ 2548 + wait_ops_state(p, opss); 2549 + goto retry; 2550 + } 2551 + 2552 + BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); 2553 + 2554 + dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p); 2555 + 2556 + if (dsq->id == SCX_DSQ_LOCAL) 2557 + dispatch_to_local_dsq(rq, dsq, p, enq_flags); 2558 + else 2559 + dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); 2560 + } 2561 + 2562 + static void flush_dispatch_buf(struct rq *rq) 2563 + { 2564 + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2565 + u32 u; 2566 + 2567 + for (u = 0; u < dspc->cursor; u++) { 2568 + struct scx_dsp_buf_ent *ent = &dspc->buf[u]; 2569 + 2570 + finish_dispatch(rq, ent->task, ent->qseq, ent->dsq_id, 2571 + ent->enq_flags); 2572 + } 2573 + 2574 + dspc->nr_tasks += dspc->cursor; 2575 + dspc->cursor = 0; 2576 + } 2577 + 2578 + static int balance_one(struct rq *rq, struct task_struct *prev) 2579 + { 2580 + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2581 + bool prev_on_scx = prev->sched_class == &ext_sched_class; 2582 + int nr_loops = SCX_DSP_MAX_LOOPS; 2583 + 2584 + lockdep_assert_rq_held(rq); 2585 + rq->scx.flags |= SCX_RQ_IN_BALANCE; 2586 + rq->scx.flags &= ~SCX_RQ_BAL_KEEP; 2587 + 2588 + if (static_branch_unlikely(&scx_ops_cpu_preempt) && 2589 + unlikely(rq->scx.cpu_released)) { 2590 + /* 2591 + * If the previous sched_class for the current CPU was not SCX, 2592 + * notify the BPF scheduler that it again has control of the 2593 + * core. This callback complements ->cpu_release(), which is 2594 + * emitted in scx_next_task_picked(). 2595 + */ 2596 + if (SCX_HAS_OP(cpu_acquire)) 2597 + SCX_CALL_OP(0, cpu_acquire, cpu_of(rq), NULL); 2598 + rq->scx.cpu_released = false; 2599 + } 2600 + 2601 + if (prev_on_scx) { 2602 + update_curr_scx(rq); 2603 + 2604 + /* 2605 + * If @prev is runnable & has slice left, it has priority and 2606 + * fetching more just increases latency for the fetched tasks. 2607 + * Tell pick_task_scx() to keep running @prev. If the BPF 2608 + * scheduler wants to handle this explicitly, it should 2609 + * implement ->cpu_release(). 2610 + * 2611 + * See scx_ops_disable_workfn() for the explanation on the 2612 + * bypassing test. 2613 + */ 2614 + if ((prev->scx.flags & SCX_TASK_QUEUED) && 2615 + prev->scx.slice && !scx_rq_bypassing(rq)) { 2616 + rq->scx.flags |= SCX_RQ_BAL_KEEP; 2617 + goto has_tasks; 2618 + } 2619 + } 2620 + 2621 + /* if there already are tasks to run, nothing to do */ 2622 + if (rq->scx.local_dsq.nr) 2623 + goto has_tasks; 2624 + 2625 + if (consume_dispatch_q(rq, &scx_dsq_global)) 2626 + goto has_tasks; 2627 + 2628 + if (!SCX_HAS_OP(dispatch) || scx_rq_bypassing(rq) || !scx_rq_online(rq)) 2629 + goto no_tasks; 2630 + 2631 + dspc->rq = rq; 2632 + 2633 + /* 2634 + * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock, 2635 + * the local DSQ might still end up empty after a successful 2636 + * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() 2637 + * produced some tasks, retry. The BPF scheduler may depend on this 2638 + * looping behavior to simplify its implementation. 2639 + */ 2640 + do { 2641 + dspc->nr_tasks = 0; 2642 + 2643 + SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq), 2644 + prev_on_scx ? prev : NULL); 2645 + 2646 + flush_dispatch_buf(rq); 2647 + 2648 + if (rq->scx.local_dsq.nr) 2649 + goto has_tasks; 2650 + if (consume_dispatch_q(rq, &scx_dsq_global)) 2651 + goto has_tasks; 2652 + 2653 + /* 2654 + * ops.dispatch() can trap us in this loop by repeatedly 2655 + * dispatching ineligible tasks. Break out once in a while to 2656 + * allow the watchdog to run. As IRQ can't be enabled in 2657 + * balance(), we want to complete this scheduling cycle and then 2658 + * start a new one. IOW, we want to call resched_curr() on the 2659 + * next, most likely idle, task, not the current one. Use 2660 + * scx_bpf_kick_cpu() for deferred kicking. 2661 + */ 2662 + if (unlikely(!--nr_loops)) { 2663 + scx_bpf_kick_cpu(cpu_of(rq), 0); 2664 + break; 2665 + } 2666 + } while (dspc->nr_tasks); 2667 + 2668 + no_tasks: 2669 + /* 2670 + * Didn't find another task to run. Keep running @prev unless 2671 + * %SCX_OPS_ENQ_LAST is in effect. 2672 + */ 2673 + if ((prev->scx.flags & SCX_TASK_QUEUED) && 2674 + (!static_branch_unlikely(&scx_ops_enq_last) || 2675 + scx_rq_bypassing(rq))) { 2676 + rq->scx.flags |= SCX_RQ_BAL_KEEP; 2677 + goto has_tasks; 2678 + } 2679 + rq->scx.flags &= ~SCX_RQ_IN_BALANCE; 2680 + return false; 2681 + 2682 + has_tasks: 2683 + rq->scx.flags &= ~SCX_RQ_IN_BALANCE; 2684 + return true; 2685 + } 2686 + 2687 + static int balance_scx(struct rq *rq, struct task_struct *prev, 2688 + struct rq_flags *rf) 2689 + { 2690 + int ret; 2691 + 2692 + rq_unpin_lock(rq, rf); 2693 + 2694 + ret = balance_one(rq, prev); 2695 + 2696 + #ifdef CONFIG_SCHED_SMT 2697 + /* 2698 + * When core-sched is enabled, this ops.balance() call will be followed 2699 + * by pick_task_scx() on this CPU and the SMT siblings. Balance the 2700 + * siblings too. 2701 + */ 2702 + if (sched_core_enabled(rq)) { 2703 + const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq)); 2704 + int scpu; 2705 + 2706 + for_each_cpu_andnot(scpu, smt_mask, cpumask_of(cpu_of(rq))) { 2707 + struct rq *srq = cpu_rq(scpu); 2708 + struct task_struct *sprev = srq->curr; 2709 + 2710 + WARN_ON_ONCE(__rq_lockp(rq) != __rq_lockp(srq)); 2711 + update_rq_clock(srq); 2712 + balance_one(srq, sprev); 2713 + } 2714 + } 2715 + #endif 2716 + rq_repin_lock(rq, rf); 2717 + 2718 + return ret; 2719 + } 2720 + 2721 + static void process_ddsp_deferred_locals(struct rq *rq) 2722 + { 2723 + struct task_struct *p; 2724 + 2725 + lockdep_assert_rq_held(rq); 2726 + 2727 + /* 2728 + * Now that @rq can be unlocked, execute the deferred enqueueing of 2729 + * tasks directly dispatched to the local DSQs of other CPUs. See 2730 + * direct_dispatch(). Keep popping from the head instead of using 2731 + * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq 2732 + * temporarily. 2733 + */ 2734 + while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals, 2735 + struct task_struct, scx.dsq_list.node))) { 2736 + struct scx_dispatch_q *dsq; 2737 + 2738 + list_del_init(&p->scx.dsq_list.node); 2739 + 2740 + dsq = find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p); 2741 + if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL)) 2742 + dispatch_to_local_dsq(rq, dsq, p, p->scx.ddsp_enq_flags); 2743 + } 2744 + } 2745 + 2746 + static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) 2747 + { 2748 + if (p->scx.flags & SCX_TASK_QUEUED) { 2749 + /* 2750 + * Core-sched might decide to execute @p before it is 2751 + * dispatched. Call ops_dequeue() to notify the BPF scheduler. 2752 + */ 2753 + ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC); 2754 + dispatch_dequeue(rq, p); 2755 + } 2756 + 2757 + p->se.exec_start = rq_clock_task(rq); 2758 + 2759 + /* see dequeue_task_scx() on why we skip when !QUEUED */ 2760 + if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED)) 2761 + SCX_CALL_OP_TASK(SCX_KF_REST, running, p); 2762 + 2763 + clr_task_runnable(p, true); 2764 + 2765 + /* 2766 + * @p is getting newly scheduled or got kicked after someone updated its 2767 + * slice. Refresh whether tick can be stopped. See scx_can_stop_tick(). 2768 + */ 2769 + if ((p->scx.slice == SCX_SLICE_INF) != 2770 + (bool)(rq->scx.flags & SCX_RQ_CAN_STOP_TICK)) { 2771 + if (p->scx.slice == SCX_SLICE_INF) 2772 + rq->scx.flags |= SCX_RQ_CAN_STOP_TICK; 2773 + else 2774 + rq->scx.flags &= ~SCX_RQ_CAN_STOP_TICK; 2775 + 2776 + sched_update_tick_dependency(rq); 2777 + 2778 + /* 2779 + * For now, let's refresh the load_avgs just when transitioning 2780 + * in and out of nohz. In the future, we might want to add a 2781 + * mechanism which calls the following periodically on 2782 + * tick-stopped CPUs. 2783 + */ 2784 + update_other_load_avgs(rq); 2785 + } 2786 + } 2787 + 2788 + static enum scx_cpu_preempt_reason 2789 + preempt_reason_from_class(const struct sched_class *class) 2790 + { 2791 + #ifdef CONFIG_SMP 2792 + if (class == &stop_sched_class) 2793 + return SCX_CPU_PREEMPT_STOP; 2794 + #endif 2795 + if (class == &dl_sched_class) 2796 + return SCX_CPU_PREEMPT_DL; 2797 + if (class == &rt_sched_class) 2798 + return SCX_CPU_PREEMPT_RT; 2799 + return SCX_CPU_PREEMPT_UNKNOWN; 2800 + } 2801 + 2802 + static void switch_class(struct rq *rq, struct task_struct *next) 2803 + { 2804 + const struct sched_class *next_class = next->sched_class; 2805 + 2806 + #ifdef CONFIG_SMP 2807 + /* 2808 + * Pairs with the smp_load_acquire() issued by a CPU in 2809 + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a 2810 + * resched. 2811 + */ 2812 + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); 2813 + #endif 2814 + if (!static_branch_unlikely(&scx_ops_cpu_preempt)) 2815 + return; 2816 + 2817 + /* 2818 + * The callback is conceptually meant to convey that the CPU is no 2819 + * longer under the control of SCX. Therefore, don't invoke the callback 2820 + * if the next class is below SCX (in which case the BPF scheduler has 2821 + * actively decided not to schedule any tasks on the CPU). 2822 + */ 2823 + if (sched_class_above(&ext_sched_class, next_class)) 2824 + return; 2825 + 2826 + /* 2827 + * At this point we know that SCX was preempted by a higher priority 2828 + * sched_class, so invoke the ->cpu_release() callback if we have not 2829 + * done so already. We only send the callback once between SCX being 2830 + * preempted, and it regaining control of the CPU. 2831 + * 2832 + * ->cpu_release() complements ->cpu_acquire(), which is emitted the 2833 + * next time that balance_scx() is invoked. 2834 + */ 2835 + if (!rq->scx.cpu_released) { 2836 + if (SCX_HAS_OP(cpu_release)) { 2837 + struct scx_cpu_release_args args = { 2838 + .reason = preempt_reason_from_class(next_class), 2839 + .task = next, 2840 + }; 2841 + 2842 + SCX_CALL_OP(SCX_KF_CPU_RELEASE, 2843 + cpu_release, cpu_of(rq), &args); 2844 + } 2845 + rq->scx.cpu_released = true; 2846 + } 2847 + } 2848 + 2849 + static void put_prev_task_scx(struct rq *rq, struct task_struct *p, 2850 + struct task_struct *next) 2851 + { 2852 + update_curr_scx(rq); 2853 + 2854 + /* see dequeue_task_scx() on why we skip when !QUEUED */ 2855 + if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) 2856 + SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); 2857 + 2858 + if (p->scx.flags & SCX_TASK_QUEUED) { 2859 + set_task_runnable(rq, p); 2860 + 2861 + /* 2862 + * If @p has slice left and is being put, @p is getting 2863 + * preempted by a higher priority scheduler class or core-sched 2864 + * forcing a different task. Leave it at the head of the local 2865 + * DSQ. 2866 + */ 2867 + if (p->scx.slice && !scx_rq_bypassing(rq)) { 2868 + dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); 2869 + return; 2870 + } 2871 + 2872 + /* 2873 + * If @p is runnable but we're about to enter a lower 2874 + * sched_class, %SCX_OPS_ENQ_LAST must be set. Tell 2875 + * ops.enqueue() that @p is the only one available for this cpu, 2876 + * which should trigger an explicit follow-up scheduling event. 2877 + */ 2878 + if (sched_class_above(&ext_sched_class, next->sched_class)) { 2879 + WARN_ON_ONCE(!static_branch_unlikely(&scx_ops_enq_last)); 2880 + do_enqueue_task(rq, p, SCX_ENQ_LAST, -1); 2881 + } else { 2882 + do_enqueue_task(rq, p, 0, -1); 2883 + } 2884 + } 2885 + 2886 + if (next && next->sched_class != &ext_sched_class) 2887 + switch_class(rq, next); 2888 + } 2889 + 2890 + static struct task_struct *first_local_task(struct rq *rq) 2891 + { 2892 + return list_first_entry_or_null(&rq->scx.local_dsq.list, 2893 + struct task_struct, scx.dsq_list.node); 2894 + } 2895 + 2896 + static struct task_struct *pick_task_scx(struct rq *rq) 2897 + { 2898 + struct task_struct *prev = rq->curr; 2899 + struct task_struct *p; 2900 + 2901 + /* 2902 + * If balance_scx() is telling us to keep running @prev, replenish slice 2903 + * if necessary and keep running @prev. Otherwise, pop the first one 2904 + * from the local DSQ. 2905 + * 2906 + * WORKAROUND: 2907 + * 2908 + * %SCX_RQ_BAL_KEEP should be set iff $prev is on SCX as it must just 2909 + * have gone through balance_scx(). Unfortunately, there currently is a 2910 + * bug where fair could say yes on balance() but no on pick_task(), 2911 + * which then ends up calling pick_task_scx() without preceding 2912 + * balance_scx(). 2913 + * 2914 + * For now, ignore cases where $prev is not on SCX. This isn't great and 2915 + * can theoretically lead to stalls. However, for switch_all cases, this 2916 + * happens only while a BPF scheduler is being loaded or unloaded, and, 2917 + * for partial cases, fair will likely keep triggering this CPU. 2918 + * 2919 + * Once fair is fixed, restore WARN_ON_ONCE(). 2920 + */ 2921 + if ((rq->scx.flags & SCX_RQ_BAL_KEEP) && 2922 + prev->sched_class == &ext_sched_class) { 2923 + p = prev; 2924 + if (!p->scx.slice) 2925 + p->scx.slice = SCX_SLICE_DFL; 2926 + } else { 2927 + p = first_local_task(rq); 2928 + if (!p) 2929 + return NULL; 2930 + 2931 + if (unlikely(!p->scx.slice)) { 2932 + if (!scx_rq_bypassing(rq) && !scx_warned_zero_slice) { 2933 + printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", 2934 + p->comm, p->pid); 2935 + scx_warned_zero_slice = true; 2936 + } 2937 + p->scx.slice = SCX_SLICE_DFL; 2938 + } 2939 + } 2940 + 2941 + return p; 2942 + } 2943 + 2944 + #ifdef CONFIG_SCHED_CORE 2945 + /** 2946 + * scx_prio_less - Task ordering for core-sched 2947 + * @a: task A 2948 + * @b: task B 2949 + * 2950 + * Core-sched is implemented as an additional scheduling layer on top of the 2951 + * usual sched_class'es and needs to find out the expected task ordering. For 2952 + * SCX, core-sched calls this function to interrogate the task ordering. 2953 + * 2954 + * Unless overridden by ops.core_sched_before(), @p->scx.core_sched_at is used 2955 + * to implement the default task ordering. The older the timestamp, the higher 2956 + * prority the task - the global FIFO ordering matching the default scheduling 2957 + * behavior. 2958 + * 2959 + * When ops.core_sched_before() is enabled, @p->scx.core_sched_at is used to 2960 + * implement FIFO ordering within each local DSQ. See pick_task_scx(). 2961 + */ 2962 + bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, 2963 + bool in_fi) 2964 + { 2965 + /* 2966 + * The const qualifiers are dropped from task_struct pointers when 2967 + * calling ops.core_sched_before(). Accesses are controlled by the 2968 + * verifier. 2969 + */ 2970 + if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a))) 2971 + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, 2972 + (struct task_struct *)a, 2973 + (struct task_struct *)b); 2974 + else 2975 + return time_after64(a->scx.core_sched_at, b->scx.core_sched_at); 2976 + } 2977 + #endif /* CONFIG_SCHED_CORE */ 2978 + 2979 + #ifdef CONFIG_SMP 2980 + 2981 + static bool test_and_clear_cpu_idle(int cpu) 2982 + { 2983 + #ifdef CONFIG_SCHED_SMT 2984 + /* 2985 + * SMT mask should be cleared whether we can claim @cpu or not. The SMT 2986 + * cluster is not wholly idle either way. This also prevents 2987 + * scx_pick_idle_cpu() from getting caught in an infinite loop. 2988 + */ 2989 + if (sched_smt_active()) { 2990 + const struct cpumask *smt = cpu_smt_mask(cpu); 2991 + 2992 + /* 2993 + * If offline, @cpu is not its own sibling and 2994 + * scx_pick_idle_cpu() can get caught in an infinite loop as 2995 + * @cpu is never cleared from idle_masks.smt. Ensure that @cpu 2996 + * is eventually cleared. 2997 + */ 2998 + if (cpumask_intersects(smt, idle_masks.smt)) 2999 + cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); 3000 + else if (cpumask_test_cpu(cpu, idle_masks.smt)) 3001 + __cpumask_clear_cpu(cpu, idle_masks.smt); 3002 + } 3003 + #endif 3004 + return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); 3005 + } 3006 + 3007 + static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) 3008 + { 3009 + int cpu; 3010 + 3011 + retry: 3012 + if (sched_smt_active()) { 3013 + cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); 3014 + if (cpu < nr_cpu_ids) 3015 + goto found; 3016 + 3017 + if (flags & SCX_PICK_IDLE_CORE) 3018 + return -EBUSY; 3019 + } 3020 + 3021 + cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); 3022 + if (cpu >= nr_cpu_ids) 3023 + return -EBUSY; 3024 + 3025 + found: 3026 + if (test_and_clear_cpu_idle(cpu)) 3027 + return cpu; 3028 + else 3029 + goto retry; 3030 + } 3031 + 3032 + static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, 3033 + u64 wake_flags, bool *found) 3034 + { 3035 + s32 cpu; 3036 + 3037 + *found = false; 3038 + 3039 + if (!static_branch_likely(&scx_builtin_idle_enabled)) { 3040 + scx_ops_error("built-in idle tracking is disabled"); 3041 + return prev_cpu; 3042 + } 3043 + 3044 + /* 3045 + * If WAKE_SYNC, the waker's local DSQ is empty, and the system is 3046 + * under utilized, wake up @p to the local DSQ of the waker. Checking 3047 + * only for an empty local DSQ is insufficient as it could give the 3048 + * wakee an unfair advantage when the system is oversaturated. 3049 + * Checking only for the presence of idle CPUs is also insufficient as 3050 + * the local DSQ of the waker could have tasks piled up on it even if 3051 + * there is an idle core elsewhere on the system. 3052 + */ 3053 + cpu = smp_processor_id(); 3054 + if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 && 3055 + !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING) && 3056 + cpu_rq(cpu)->scx.local_dsq.nr == 0) { 3057 + if (cpumask_test_cpu(cpu, p->cpus_ptr)) 3058 + goto cpu_found; 3059 + } 3060 + 3061 + if (p->nr_cpus_allowed == 1) { 3062 + if (test_and_clear_cpu_idle(prev_cpu)) { 3063 + cpu = prev_cpu; 3064 + goto cpu_found; 3065 + } else { 3066 + return prev_cpu; 3067 + } 3068 + } 3069 + 3070 + /* 3071 + * If CPU has SMT, any wholly idle CPU is likely a better pick than 3072 + * partially idle @prev_cpu. 3073 + */ 3074 + if (sched_smt_active()) { 3075 + if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && 3076 + test_and_clear_cpu_idle(prev_cpu)) { 3077 + cpu = prev_cpu; 3078 + goto cpu_found; 3079 + } 3080 + 3081 + cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); 3082 + if (cpu >= 0) 3083 + goto cpu_found; 3084 + } 3085 + 3086 + if (test_and_clear_cpu_idle(prev_cpu)) { 3087 + cpu = prev_cpu; 3088 + goto cpu_found; 3089 + } 3090 + 3091 + cpu = scx_pick_idle_cpu(p->cpus_ptr, 0); 3092 + if (cpu >= 0) 3093 + goto cpu_found; 3094 + 3095 + return prev_cpu; 3096 + 3097 + cpu_found: 3098 + *found = true; 3099 + return cpu; 3100 + } 3101 + 3102 + static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags) 3103 + { 3104 + /* 3105 + * sched_exec() calls with %WF_EXEC when @p is about to exec(2) as it 3106 + * can be a good migration opportunity with low cache and memory 3107 + * footprint. Returning a CPU different than @prev_cpu triggers 3108 + * immediate rq migration. However, for SCX, as the current rq 3109 + * association doesn't dictate where the task is going to run, this 3110 + * doesn't fit well. If necessary, we can later add a dedicated method 3111 + * which can decide to preempt self to force it through the regular 3112 + * scheduling path. 3113 + */ 3114 + if (unlikely(wake_flags & WF_EXEC)) 3115 + return prev_cpu; 3116 + 3117 + if (SCX_HAS_OP(select_cpu)) { 3118 + s32 cpu; 3119 + struct task_struct **ddsp_taskp; 3120 + 3121 + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); 3122 + WARN_ON_ONCE(*ddsp_taskp); 3123 + *ddsp_taskp = p; 3124 + 3125 + cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, 3126 + select_cpu, p, prev_cpu, wake_flags); 3127 + *ddsp_taskp = NULL; 3128 + if (ops_cpu_valid(cpu, "from ops.select_cpu()")) 3129 + return cpu; 3130 + else 3131 + return prev_cpu; 3132 + } else { 3133 + bool found; 3134 + s32 cpu; 3135 + 3136 + cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found); 3137 + if (found) { 3138 + p->scx.slice = SCX_SLICE_DFL; 3139 + p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL; 3140 + } 3141 + return cpu; 3142 + } 3143 + } 3144 + 3145 + static void task_woken_scx(struct rq *rq, struct task_struct *p) 3146 + { 3147 + run_deferred(rq); 3148 + } 3149 + 3150 + static void set_cpus_allowed_scx(struct task_struct *p, 3151 + struct affinity_context *ac) 3152 + { 3153 + set_cpus_allowed_common(p, ac); 3154 + 3155 + /* 3156 + * The effective cpumask is stored in @p->cpus_ptr which may temporarily 3157 + * differ from the configured one in @p->cpus_mask. Always tell the bpf 3158 + * scheduler the effective one. 3159 + * 3160 + * Fine-grained memory write control is enforced by BPF making the const 3161 + * designation pointless. Cast it away when calling the operation. 3162 + */ 3163 + if (SCX_HAS_OP(set_cpumask)) 3164 + SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, 3165 + (struct cpumask *)p->cpus_ptr); 3166 + } 3167 + 3168 + static void reset_idle_masks(void) 3169 + { 3170 + /* 3171 + * Consider all online cpus idle. Should converge to the actual state 3172 + * quickly. 3173 + */ 3174 + cpumask_copy(idle_masks.cpu, cpu_online_mask); 3175 + cpumask_copy(idle_masks.smt, cpu_online_mask); 3176 + } 3177 + 3178 + void __scx_update_idle(struct rq *rq, bool idle) 3179 + { 3180 + int cpu = cpu_of(rq); 3181 + 3182 + if (SCX_HAS_OP(update_idle)) { 3183 + SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); 3184 + if (!static_branch_unlikely(&scx_builtin_idle_enabled)) 3185 + return; 3186 + } 3187 + 3188 + if (idle) 3189 + cpumask_set_cpu(cpu, idle_masks.cpu); 3190 + else 3191 + cpumask_clear_cpu(cpu, idle_masks.cpu); 3192 + 3193 + #ifdef CONFIG_SCHED_SMT 3194 + if (sched_smt_active()) { 3195 + const struct cpumask *smt = cpu_smt_mask(cpu); 3196 + 3197 + if (idle) { 3198 + /* 3199 + * idle_masks.smt handling is racy but that's fine as 3200 + * it's only for optimization and self-correcting. 3201 + */ 3202 + for_each_cpu(cpu, smt) { 3203 + if (!cpumask_test_cpu(cpu, idle_masks.cpu)) 3204 + return; 3205 + } 3206 + cpumask_or(idle_masks.smt, idle_masks.smt, smt); 3207 + } else { 3208 + cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); 3209 + } 3210 + } 3211 + #endif 3212 + } 3213 + 3214 + static void handle_hotplug(struct rq *rq, bool online) 3215 + { 3216 + int cpu = cpu_of(rq); 3217 + 3218 + atomic_long_inc(&scx_hotplug_seq); 3219 + 3220 + if (online && SCX_HAS_OP(cpu_online)) 3221 + SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu); 3222 + else if (!online && SCX_HAS_OP(cpu_offline)) 3223 + SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu); 3224 + else 3225 + scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, 3226 + "cpu %d going %s, exiting scheduler", cpu, 3227 + online ? "online" : "offline"); 3228 + } 3229 + 3230 + void scx_rq_activate(struct rq *rq) 3231 + { 3232 + handle_hotplug(rq, true); 3233 + } 3234 + 3235 + void scx_rq_deactivate(struct rq *rq) 3236 + { 3237 + handle_hotplug(rq, false); 3238 + } 3239 + 3240 + static void rq_online_scx(struct rq *rq) 3241 + { 3242 + rq->scx.flags |= SCX_RQ_ONLINE; 3243 + } 3244 + 3245 + static void rq_offline_scx(struct rq *rq) 3246 + { 3247 + rq->scx.flags &= ~SCX_RQ_ONLINE; 3248 + } 3249 + 3250 + #else /* CONFIG_SMP */ 3251 + 3252 + static bool test_and_clear_cpu_idle(int cpu) { return false; } 3253 + static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; } 3254 + static void reset_idle_masks(void) {} 3255 + 3256 + #endif /* CONFIG_SMP */ 3257 + 3258 + static bool check_rq_for_timeouts(struct rq *rq) 3259 + { 3260 + struct task_struct *p; 3261 + struct rq_flags rf; 3262 + bool timed_out = false; 3263 + 3264 + rq_lock_irqsave(rq, &rf); 3265 + list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) { 3266 + unsigned long last_runnable = p->scx.runnable_at; 3267 + 3268 + if (unlikely(time_after(jiffies, 3269 + last_runnable + scx_watchdog_timeout))) { 3270 + u32 dur_ms = jiffies_to_msecs(jiffies - last_runnable); 3271 + 3272 + scx_ops_error_kind(SCX_EXIT_ERROR_STALL, 3273 + "%s[%d] failed to run for %u.%03us", 3274 + p->comm, p->pid, 3275 + dur_ms / 1000, dur_ms % 1000); 3276 + timed_out = true; 3277 + break; 3278 + } 3279 + } 3280 + rq_unlock_irqrestore(rq, &rf); 3281 + 3282 + return timed_out; 3283 + } 3284 + 3285 + static void scx_watchdog_workfn(struct work_struct *work) 3286 + { 3287 + int cpu; 3288 + 3289 + WRITE_ONCE(scx_watchdog_timestamp, jiffies); 3290 + 3291 + for_each_online_cpu(cpu) { 3292 + if (unlikely(check_rq_for_timeouts(cpu_rq(cpu)))) 3293 + break; 3294 + 3295 + cond_resched(); 3296 + } 3297 + queue_delayed_work(system_unbound_wq, to_delayed_work(work), 3298 + scx_watchdog_timeout / 2); 3299 + } 3300 + 3301 + void scx_tick(struct rq *rq) 3302 + { 3303 + unsigned long last_check; 3304 + 3305 + if (!scx_enabled()) 3306 + return; 3307 + 3308 + last_check = READ_ONCE(scx_watchdog_timestamp); 3309 + if (unlikely(time_after(jiffies, 3310 + last_check + READ_ONCE(scx_watchdog_timeout)))) { 3311 + u32 dur_ms = jiffies_to_msecs(jiffies - last_check); 3312 + 3313 + scx_ops_error_kind(SCX_EXIT_ERROR_STALL, 3314 + "watchdog failed to check in for %u.%03us", 3315 + dur_ms / 1000, dur_ms % 1000); 3316 + } 3317 + 3318 + update_other_load_avgs(rq); 3319 + } 3320 + 3321 + static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) 3322 + { 3323 + update_curr_scx(rq); 3324 + 3325 + /* 3326 + * While disabling, always resched and refresh core-sched timestamp as 3327 + * we can't trust the slice management or ops.core_sched_before(). 3328 + */ 3329 + if (scx_rq_bypassing(rq)) { 3330 + curr->scx.slice = 0; 3331 + touch_core_sched(rq, curr); 3332 + } else if (SCX_HAS_OP(tick)) { 3333 + SCX_CALL_OP(SCX_KF_REST, tick, curr); 3334 + } 3335 + 3336 + if (!curr->scx.slice) 3337 + resched_curr(rq); 3338 + } 3339 + 3340 + #ifdef CONFIG_EXT_GROUP_SCHED 3341 + static struct cgroup *tg_cgrp(struct task_group *tg) 3342 + { 3343 + /* 3344 + * If CGROUP_SCHED is disabled, @tg is NULL. If @tg is an autogroup, 3345 + * @tg->css.cgroup is NULL. In both cases, @tg can be treated as the 3346 + * root cgroup. 3347 + */ 3348 + if (tg && tg->css.cgroup) 3349 + return tg->css.cgroup; 3350 + else 3351 + return &cgrp_dfl_root.cgrp; 3352 + } 3353 + 3354 + #define SCX_INIT_TASK_ARGS_CGROUP(tg) .cgroup = tg_cgrp(tg), 3355 + 3356 + #else /* CONFIG_EXT_GROUP_SCHED */ 3357 + 3358 + #define SCX_INIT_TASK_ARGS_CGROUP(tg) 3359 + 3360 + #endif /* CONFIG_EXT_GROUP_SCHED */ 3361 + 3362 + static enum scx_task_state scx_get_task_state(const struct task_struct *p) 3363 + { 3364 + return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT; 3365 + } 3366 + 3367 + static void scx_set_task_state(struct task_struct *p, enum scx_task_state state) 3368 + { 3369 + enum scx_task_state prev_state = scx_get_task_state(p); 3370 + bool warn = false; 3371 + 3372 + BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS)); 3373 + 3374 + switch (state) { 3375 + case SCX_TASK_NONE: 3376 + break; 3377 + case SCX_TASK_INIT: 3378 + warn = prev_state != SCX_TASK_NONE; 3379 + break; 3380 + case SCX_TASK_READY: 3381 + warn = prev_state == SCX_TASK_NONE; 3382 + break; 3383 + case SCX_TASK_ENABLED: 3384 + warn = prev_state != SCX_TASK_READY; 3385 + break; 3386 + default: 3387 + warn = true; 3388 + return; 3389 + } 3390 + 3391 + WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]", 3392 + prev_state, state, p->comm, p->pid); 3393 + 3394 + p->scx.flags &= ~SCX_TASK_STATE_MASK; 3395 + p->scx.flags |= state << SCX_TASK_STATE_SHIFT; 3396 + } 3397 + 3398 + static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool fork) 3399 + { 3400 + int ret; 3401 + 3402 + p->scx.disallow = false; 3403 + 3404 + if (SCX_HAS_OP(init_task)) { 3405 + struct scx_init_task_args args = { 3406 + SCX_INIT_TASK_ARGS_CGROUP(tg) 3407 + .fork = fork, 3408 + }; 3409 + 3410 + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args); 3411 + if (unlikely(ret)) { 3412 + ret = ops_sanitize_err("init_task", ret); 3413 + return ret; 3414 + } 3415 + } 3416 + 3417 + scx_set_task_state(p, SCX_TASK_INIT); 3418 + 3419 + if (p->scx.disallow) { 3420 + if (!fork) { 3421 + struct rq *rq; 3422 + struct rq_flags rf; 3423 + 3424 + rq = task_rq_lock(p, &rf); 3425 + 3426 + /* 3427 + * We're in the load path and @p->policy will be applied 3428 + * right after. Reverting @p->policy here and rejecting 3429 + * %SCHED_EXT transitions from scx_check_setscheduler() 3430 + * guarantees that if ops.init_task() sets @p->disallow, 3431 + * @p can never be in SCX. 3432 + */ 3433 + if (p->policy == SCHED_EXT) { 3434 + p->policy = SCHED_NORMAL; 3435 + atomic_long_inc(&scx_nr_rejected); 3436 + } 3437 + 3438 + task_rq_unlock(rq, p, &rf); 3439 + } else if (p->policy == SCHED_EXT) { 3440 + scx_ops_error("ops.init_task() set task->scx.disallow for %s[%d] during fork", 3441 + p->comm, p->pid); 3442 + } 3443 + } 3444 + 3445 + p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT; 3446 + return 0; 3447 + } 3448 + 3449 + static void scx_ops_enable_task(struct task_struct *p) 3450 + { 3451 + u32 weight; 3452 + 3453 + lockdep_assert_rq_held(task_rq(p)); 3454 + 3455 + /* 3456 + * Set the weight before calling ops.enable() so that the scheduler 3457 + * doesn't see a stale value if they inspect the task struct. 3458 + */ 3459 + if (task_has_idle_policy(p)) 3460 + weight = WEIGHT_IDLEPRIO; 3461 + else 3462 + weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO]; 3463 + 3464 + p->scx.weight = sched_weight_to_cgroup(weight); 3465 + 3466 + if (SCX_HAS_OP(enable)) 3467 + SCX_CALL_OP_TASK(SCX_KF_REST, enable, p); 3468 + scx_set_task_state(p, SCX_TASK_ENABLED); 3469 + 3470 + if (SCX_HAS_OP(set_weight)) 3471 + SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); 3472 + } 3473 + 3474 + static void scx_ops_disable_task(struct task_struct *p) 3475 + { 3476 + lockdep_assert_rq_held(task_rq(p)); 3477 + WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED); 3478 + 3479 + if (SCX_HAS_OP(disable)) 3480 + SCX_CALL_OP(SCX_KF_REST, disable, p); 3481 + scx_set_task_state(p, SCX_TASK_READY); 3482 + } 3483 + 3484 + static void scx_ops_exit_task(struct task_struct *p) 3485 + { 3486 + struct scx_exit_task_args args = { 3487 + .cancelled = false, 3488 + }; 3489 + 3490 + lockdep_assert_rq_held(task_rq(p)); 3491 + 3492 + switch (scx_get_task_state(p)) { 3493 + case SCX_TASK_NONE: 3494 + return; 3495 + case SCX_TASK_INIT: 3496 + args.cancelled = true; 3497 + break; 3498 + case SCX_TASK_READY: 3499 + break; 3500 + case SCX_TASK_ENABLED: 3501 + scx_ops_disable_task(p); 3502 + break; 3503 + default: 3504 + WARN_ON_ONCE(true); 3505 + return; 3506 + } 3507 + 3508 + if (SCX_HAS_OP(exit_task)) 3509 + SCX_CALL_OP(SCX_KF_REST, exit_task, p, &args); 3510 + scx_set_task_state(p, SCX_TASK_NONE); 3511 + } 3512 + 3513 + void init_scx_entity(struct sched_ext_entity *scx) 3514 + { 3515 + /* 3516 + * init_idle() calls this function again after fork sequence is 3517 + * complete. Don't touch ->tasks_node as it's already linked. 3518 + */ 3519 + memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node)); 3520 + 3521 + INIT_LIST_HEAD(&scx->dsq_list.node); 3522 + RB_CLEAR_NODE(&scx->dsq_priq); 3523 + scx->sticky_cpu = -1; 3524 + scx->holding_cpu = -1; 3525 + INIT_LIST_HEAD(&scx->runnable_node); 3526 + scx->runnable_at = jiffies; 3527 + scx->ddsp_dsq_id = SCX_DSQ_INVALID; 3528 + scx->slice = SCX_SLICE_DFL; 3529 + } 3530 + 3531 + void scx_pre_fork(struct task_struct *p) 3532 + { 3533 + /* 3534 + * BPF scheduler enable/disable paths want to be able to iterate and 3535 + * update all tasks which can become complex when racing forks. As 3536 + * enable/disable are very cold paths, let's use a percpu_rwsem to 3537 + * exclude forks. 3538 + */ 3539 + percpu_down_read(&scx_fork_rwsem); 3540 + } 3541 + 3542 + int scx_fork(struct task_struct *p) 3543 + { 3544 + percpu_rwsem_assert_held(&scx_fork_rwsem); 3545 + 3546 + if (scx_enabled()) 3547 + return scx_ops_init_task(p, task_group(p), true); 3548 + else 3549 + return 0; 3550 + } 3551 + 3552 + void scx_post_fork(struct task_struct *p) 3553 + { 3554 + if (scx_enabled()) { 3555 + scx_set_task_state(p, SCX_TASK_READY); 3556 + 3557 + /* 3558 + * Enable the task immediately if it's running on sched_ext. 3559 + * Otherwise, it'll be enabled in switching_to_scx() if and 3560 + * when it's ever configured to run with a SCHED_EXT policy. 3561 + */ 3562 + if (p->sched_class == &ext_sched_class) { 3563 + struct rq_flags rf; 3564 + struct rq *rq; 3565 + 3566 + rq = task_rq_lock(p, &rf); 3567 + scx_ops_enable_task(p); 3568 + task_rq_unlock(rq, p, &rf); 3569 + } 3570 + } 3571 + 3572 + spin_lock_irq(&scx_tasks_lock); 3573 + list_add_tail(&p->scx.tasks_node, &scx_tasks); 3574 + spin_unlock_irq(&scx_tasks_lock); 3575 + 3576 + percpu_up_read(&scx_fork_rwsem); 3577 + } 3578 + 3579 + void scx_cancel_fork(struct task_struct *p) 3580 + { 3581 + if (scx_enabled()) { 3582 + struct rq *rq; 3583 + struct rq_flags rf; 3584 + 3585 + rq = task_rq_lock(p, &rf); 3586 + WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY); 3587 + scx_ops_exit_task(p); 3588 + task_rq_unlock(rq, p, &rf); 3589 + } 3590 + 3591 + percpu_up_read(&scx_fork_rwsem); 3592 + } 3593 + 3594 + void sched_ext_free(struct task_struct *p) 3595 + { 3596 + unsigned long flags; 3597 + 3598 + spin_lock_irqsave(&scx_tasks_lock, flags); 3599 + list_del_init(&p->scx.tasks_node); 3600 + spin_unlock_irqrestore(&scx_tasks_lock, flags); 3601 + 3602 + /* 3603 + * @p is off scx_tasks and wholly ours. scx_ops_enable()'s READY -> 3604 + * ENABLED transitions can't race us. Disable ops for @p. 3605 + */ 3606 + if (scx_get_task_state(p) != SCX_TASK_NONE) { 3607 + struct rq_flags rf; 3608 + struct rq *rq; 3609 + 3610 + rq = task_rq_lock(p, &rf); 3611 + scx_ops_exit_task(p); 3612 + task_rq_unlock(rq, p, &rf); 3613 + } 3614 + } 3615 + 3616 + static void reweight_task_scx(struct rq *rq, struct task_struct *p, 3617 + const struct load_weight *lw) 3618 + { 3619 + lockdep_assert_rq_held(task_rq(p)); 3620 + 3621 + p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); 3622 + if (SCX_HAS_OP(set_weight)) 3623 + SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); 3624 + } 3625 + 3626 + static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio) 3627 + { 3628 + } 3629 + 3630 + static void switching_to_scx(struct rq *rq, struct task_struct *p) 3631 + { 3632 + scx_ops_enable_task(p); 3633 + 3634 + /* 3635 + * set_cpus_allowed_scx() is not called while @p is associated with a 3636 + * different scheduler class. Keep the BPF scheduler up-to-date. 3637 + */ 3638 + if (SCX_HAS_OP(set_cpumask)) 3639 + SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, 3640 + (struct cpumask *)p->cpus_ptr); 3641 + } 3642 + 3643 + static void switched_from_scx(struct rq *rq, struct task_struct *p) 3644 + { 3645 + scx_ops_disable_task(p); 3646 + } 3647 + 3648 + static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} 3649 + static void switched_to_scx(struct rq *rq, struct task_struct *p) {} 3650 + 3651 + int scx_check_setscheduler(struct task_struct *p, int policy) 3652 + { 3653 + lockdep_assert_rq_held(task_rq(p)); 3654 + 3655 + /* if disallow, reject transitioning into SCX */ 3656 + if (scx_enabled() && READ_ONCE(p->scx.disallow) && 3657 + p->policy != policy && policy == SCHED_EXT) 3658 + return -EACCES; 3659 + 3660 + return 0; 3661 + } 3662 + 3663 + #ifdef CONFIG_NO_HZ_FULL 3664 + bool scx_can_stop_tick(struct rq *rq) 3665 + { 3666 + struct task_struct *p = rq->curr; 3667 + 3668 + if (scx_rq_bypassing(rq)) 3669 + return false; 3670 + 3671 + if (p->sched_class != &ext_sched_class) 3672 + return true; 3673 + 3674 + /* 3675 + * @rq can dispatch from different DSQs, so we can't tell whether it 3676 + * needs the tick or not by looking at nr_running. Allow stopping ticks 3677 + * iff the BPF scheduler indicated so. See set_next_task_scx(). 3678 + */ 3679 + return rq->scx.flags & SCX_RQ_CAN_STOP_TICK; 3680 + } 3681 + #endif 3682 + 3683 + #ifdef CONFIG_EXT_GROUP_SCHED 3684 + 3685 + DEFINE_STATIC_PERCPU_RWSEM(scx_cgroup_rwsem); 3686 + static bool cgroup_warned_missing_weight; 3687 + static bool cgroup_warned_missing_idle; 3688 + 3689 + static void scx_cgroup_warn_missing_weight(struct task_group *tg) 3690 + { 3691 + if (scx_ops_enable_state() == SCX_OPS_DISABLED || 3692 + cgroup_warned_missing_weight) 3693 + return; 3694 + 3695 + if ((scx_ops.flags & SCX_OPS_HAS_CGROUP_WEIGHT) || !tg->css.parent) 3696 + return; 3697 + 3698 + pr_warn("sched_ext: \"%s\" does not implement cgroup cpu.weight\n", 3699 + scx_ops.name); 3700 + cgroup_warned_missing_weight = true; 3701 + } 3702 + 3703 + static void scx_cgroup_warn_missing_idle(struct task_group *tg) 3704 + { 3705 + if (scx_ops_enable_state() == SCX_OPS_DISABLED || 3706 + cgroup_warned_missing_idle) 3707 + return; 3708 + 3709 + if (!tg->idle) 3710 + return; 3711 + 3712 + pr_warn("sched_ext: \"%s\" does not implement cgroup cpu.idle\n", 3713 + scx_ops.name); 3714 + cgroup_warned_missing_idle = true; 3715 + } 3716 + 3717 + int scx_tg_online(struct task_group *tg) 3718 + { 3719 + int ret = 0; 3720 + 3721 + WARN_ON_ONCE(tg->scx_flags & (SCX_TG_ONLINE | SCX_TG_INITED)); 3722 + 3723 + percpu_down_read(&scx_cgroup_rwsem); 3724 + 3725 + scx_cgroup_warn_missing_weight(tg); 3726 + 3727 + if (SCX_HAS_OP(cgroup_init)) { 3728 + struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; 3729 + 3730 + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, 3731 + tg->css.cgroup, &args); 3732 + if (!ret) 3733 + tg->scx_flags |= SCX_TG_ONLINE | SCX_TG_INITED; 3734 + else 3735 + ret = ops_sanitize_err("cgroup_init", ret); 3736 + } else { 3737 + tg->scx_flags |= SCX_TG_ONLINE; 3738 + } 3739 + 3740 + percpu_up_read(&scx_cgroup_rwsem); 3741 + return ret; 3742 + } 3743 + 3744 + void scx_tg_offline(struct task_group *tg) 3745 + { 3746 + WARN_ON_ONCE(!(tg->scx_flags & SCX_TG_ONLINE)); 3747 + 3748 + percpu_down_read(&scx_cgroup_rwsem); 3749 + 3750 + if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED)) 3751 + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup); 3752 + tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); 3753 + 3754 + percpu_up_read(&scx_cgroup_rwsem); 3755 + } 3756 + 3757 + int scx_cgroup_can_attach(struct cgroup_taskset *tset) 3758 + { 3759 + struct cgroup_subsys_state *css; 3760 + struct task_struct *p; 3761 + int ret; 3762 + 3763 + /* released in scx_finish/cancel_attach() */ 3764 + percpu_down_read(&scx_cgroup_rwsem); 3765 + 3766 + if (!scx_enabled()) 3767 + return 0; 3768 + 3769 + cgroup_taskset_for_each(p, css, tset) { 3770 + struct cgroup *from = tg_cgrp(task_group(p)); 3771 + struct cgroup *to = tg_cgrp(css_tg(css)); 3772 + 3773 + WARN_ON_ONCE(p->scx.cgrp_moving_from); 3774 + 3775 + /* 3776 + * sched_move_task() omits identity migrations. Let's match the 3777 + * behavior so that ops.cgroup_prep_move() and ops.cgroup_move() 3778 + * always match one-to-one. 3779 + */ 3780 + if (from == to) 3781 + continue; 3782 + 3783 + if (SCX_HAS_OP(cgroup_prep_move)) { 3784 + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move, 3785 + p, from, css->cgroup); 3786 + if (ret) 3787 + goto err; 3788 + } 3789 + 3790 + p->scx.cgrp_moving_from = from; 3791 + } 3792 + 3793 + return 0; 3794 + 3795 + err: 3796 + cgroup_taskset_for_each(p, css, tset) { 3797 + if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) 3798 + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, 3799 + p->scx.cgrp_moving_from, css->cgroup); 3800 + p->scx.cgrp_moving_from = NULL; 3801 + } 3802 + 3803 + percpu_up_read(&scx_cgroup_rwsem); 3804 + return ops_sanitize_err("cgroup_prep_move", ret); 3805 + } 3806 + 3807 + void scx_move_task(struct task_struct *p) 3808 + { 3809 + if (!scx_enabled()) 3810 + return; 3811 + 3812 + /* 3813 + * We're called from sched_move_task() which handles both cgroup and 3814 + * autogroup moves. Ignore the latter. 3815 + * 3816 + * Also ignore exiting tasks, because in the exit path tasks transition 3817 + * from the autogroup to the root group, so task_group_is_autogroup() 3818 + * alone isn't able to catch exiting autogroup tasks. This is safe for 3819 + * cgroup_move(), because cgroup migrations never happen for PF_EXITING 3820 + * tasks. 3821 + */ 3822 + if (task_group_is_autogroup(task_group(p)) || (p->flags & PF_EXITING)) 3823 + return; 3824 + 3825 + /* 3826 + * @p must have ops.cgroup_prep_move() called on it and thus 3827 + * cgrp_moving_from set. 3828 + */ 3829 + if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) 3830 + SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p, 3831 + p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); 3832 + p->scx.cgrp_moving_from = NULL; 3833 + } 3834 + 3835 + void scx_cgroup_finish_attach(void) 3836 + { 3837 + percpu_up_read(&scx_cgroup_rwsem); 3838 + } 3839 + 3840 + void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) 3841 + { 3842 + struct cgroup_subsys_state *css; 3843 + struct task_struct *p; 3844 + 3845 + if (!scx_enabled()) 3846 + goto out_unlock; 3847 + 3848 + cgroup_taskset_for_each(p, css, tset) { 3849 + if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) 3850 + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, 3851 + p->scx.cgrp_moving_from, css->cgroup); 3852 + p->scx.cgrp_moving_from = NULL; 3853 + } 3854 + out_unlock: 3855 + percpu_up_read(&scx_cgroup_rwsem); 3856 + } 3857 + 3858 + void scx_group_set_weight(struct task_group *tg, unsigned long weight) 3859 + { 3860 + percpu_down_read(&scx_cgroup_rwsem); 3861 + 3862 + if (tg->scx_weight != weight) { 3863 + if (SCX_HAS_OP(cgroup_set_weight)) 3864 + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight, 3865 + tg_cgrp(tg), weight); 3866 + tg->scx_weight = weight; 3867 + } 3868 + 3869 + percpu_up_read(&scx_cgroup_rwsem); 3870 + } 3871 + 3872 + void scx_group_set_idle(struct task_group *tg, bool idle) 3873 + { 3874 + percpu_down_read(&scx_cgroup_rwsem); 3875 + scx_cgroup_warn_missing_idle(tg); 3876 + percpu_up_read(&scx_cgroup_rwsem); 3877 + } 3878 + 3879 + static void scx_cgroup_lock(void) 3880 + { 3881 + percpu_down_write(&scx_cgroup_rwsem); 3882 + } 3883 + 3884 + static void scx_cgroup_unlock(void) 3885 + { 3886 + percpu_up_write(&scx_cgroup_rwsem); 3887 + } 3888 + 3889 + #else /* CONFIG_EXT_GROUP_SCHED */ 3890 + 3891 + static inline void scx_cgroup_lock(void) {} 3892 + static inline void scx_cgroup_unlock(void) {} 3893 + 3894 + #endif /* CONFIG_EXT_GROUP_SCHED */ 3895 + 3896 + /* 3897 + * Omitted operations: 3898 + * 3899 + * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task 3900 + * isn't tied to the CPU at that point. Preemption is implemented by resetting 3901 + * the victim task's slice to 0 and triggering reschedule on the target CPU. 3902 + * 3903 + * - migrate_task_rq: Unnecessary as task to cpu mapping is transient. 3904 + * 3905 + * - task_fork/dead: We need fork/dead notifications for all tasks regardless of 3906 + * their current sched_class. Call them directly from sched core instead. 3907 + */ 3908 + DEFINE_SCHED_CLASS(ext) = { 3909 + .enqueue_task = enqueue_task_scx, 3910 + .dequeue_task = dequeue_task_scx, 3911 + .yield_task = yield_task_scx, 3912 + .yield_to_task = yield_to_task_scx, 3913 + 3914 + .wakeup_preempt = wakeup_preempt_scx, 3915 + 3916 + .balance = balance_scx, 3917 + .pick_task = pick_task_scx, 3918 + 3919 + .put_prev_task = put_prev_task_scx, 3920 + .set_next_task = set_next_task_scx, 3921 + 3922 + #ifdef CONFIG_SMP 3923 + .select_task_rq = select_task_rq_scx, 3924 + .task_woken = task_woken_scx, 3925 + .set_cpus_allowed = set_cpus_allowed_scx, 3926 + 3927 + .rq_online = rq_online_scx, 3928 + .rq_offline = rq_offline_scx, 3929 + #endif 3930 + 3931 + .task_tick = task_tick_scx, 3932 + 3933 + .switching_to = switching_to_scx, 3934 + .switched_from = switched_from_scx, 3935 + .switched_to = switched_to_scx, 3936 + .reweight_task = reweight_task_scx, 3937 + .prio_changed = prio_changed_scx, 3938 + 3939 + .update_curr = update_curr_scx, 3940 + 3941 + #ifdef CONFIG_UCLAMP_TASK 3942 + .uclamp_enabled = 1, 3943 + #endif 3944 + }; 3945 + 3946 + static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id) 3947 + { 3948 + memset(dsq, 0, sizeof(*dsq)); 3949 + 3950 + raw_spin_lock_init(&dsq->lock); 3951 + INIT_LIST_HEAD(&dsq->list); 3952 + dsq->id = dsq_id; 3953 + } 3954 + 3955 + static struct scx_dispatch_q *create_dsq(u64 dsq_id, int node) 3956 + { 3957 + struct scx_dispatch_q *dsq; 3958 + int ret; 3959 + 3960 + if (dsq_id & SCX_DSQ_FLAG_BUILTIN) 3961 + return ERR_PTR(-EINVAL); 3962 + 3963 + dsq = kmalloc_node(sizeof(*dsq), GFP_KERNEL, node); 3964 + if (!dsq) 3965 + return ERR_PTR(-ENOMEM); 3966 + 3967 + init_dsq(dsq, dsq_id); 3968 + 3969 + ret = rhashtable_insert_fast(&dsq_hash, &dsq->hash_node, 3970 + dsq_hash_params); 3971 + if (ret) { 3972 + kfree(dsq); 3973 + return ERR_PTR(ret); 3974 + } 3975 + return dsq; 3976 + } 3977 + 3978 + static void free_dsq_irq_workfn(struct irq_work *irq_work) 3979 + { 3980 + struct llist_node *to_free = llist_del_all(&dsqs_to_free); 3981 + struct scx_dispatch_q *dsq, *tmp_dsq; 3982 + 3983 + llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node) 3984 + kfree_rcu(dsq, rcu); 3985 + } 3986 + 3987 + static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn); 3988 + 3989 + static void destroy_dsq(u64 dsq_id) 3990 + { 3991 + struct scx_dispatch_q *dsq; 3992 + unsigned long flags; 3993 + 3994 + rcu_read_lock(); 3995 + 3996 + dsq = find_user_dsq(dsq_id); 3997 + if (!dsq) 3998 + goto out_unlock_rcu; 3999 + 4000 + raw_spin_lock_irqsave(&dsq->lock, flags); 4001 + 4002 + if (dsq->nr) { 4003 + scx_ops_error("attempting to destroy in-use dsq 0x%016llx (nr=%u)", 4004 + dsq->id, dsq->nr); 4005 + goto out_unlock_dsq; 4006 + } 4007 + 4008 + if (rhashtable_remove_fast(&dsq_hash, &dsq->hash_node, dsq_hash_params)) 4009 + goto out_unlock_dsq; 4010 + 4011 + /* 4012 + * Mark dead by invalidating ->id to prevent dispatch_enqueue() from 4013 + * queueing more tasks. As this function can be called from anywhere, 4014 + * freeing is bounced through an irq work to avoid nesting RCU 4015 + * operations inside scheduler locks. 4016 + */ 4017 + dsq->id = SCX_DSQ_INVALID; 4018 + llist_add(&dsq->free_node, &dsqs_to_free); 4019 + irq_work_queue(&free_dsq_irq_work); 4020 + 4021 + out_unlock_dsq: 4022 + raw_spin_unlock_irqrestore(&dsq->lock, flags); 4023 + out_unlock_rcu: 4024 + rcu_read_unlock(); 4025 + } 4026 + 4027 + #ifdef CONFIG_EXT_GROUP_SCHED 4028 + static void scx_cgroup_exit(void) 4029 + { 4030 + struct cgroup_subsys_state *css; 4031 + 4032 + percpu_rwsem_assert_held(&scx_cgroup_rwsem); 4033 + 4034 + /* 4035 + * scx_tg_on/offline() are excluded through scx_cgroup_rwsem. If we walk 4036 + * cgroups and exit all the inited ones, all online cgroups are exited. 4037 + */ 4038 + rcu_read_lock(); 4039 + css_for_each_descendant_post(css, &root_task_group.css) { 4040 + struct task_group *tg = css_tg(css); 4041 + 4042 + if (!(tg->scx_flags & SCX_TG_INITED)) 4043 + continue; 4044 + tg->scx_flags &= ~SCX_TG_INITED; 4045 + 4046 + if (!scx_ops.cgroup_exit) 4047 + continue; 4048 + 4049 + if (WARN_ON_ONCE(!css_tryget(css))) 4050 + continue; 4051 + rcu_read_unlock(); 4052 + 4053 + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup); 4054 + 4055 + rcu_read_lock(); 4056 + css_put(css); 4057 + } 4058 + rcu_read_unlock(); 4059 + } 4060 + 4061 + static int scx_cgroup_init(void) 4062 + { 4063 + struct cgroup_subsys_state *css; 4064 + int ret; 4065 + 4066 + percpu_rwsem_assert_held(&scx_cgroup_rwsem); 4067 + 4068 + cgroup_warned_missing_weight = false; 4069 + cgroup_warned_missing_idle = false; 4070 + 4071 + /* 4072 + * scx_tg_on/offline() are excluded thorugh scx_cgroup_rwsem. If we walk 4073 + * cgroups and init, all online cgroups are initialized. 4074 + */ 4075 + rcu_read_lock(); 4076 + css_for_each_descendant_pre(css, &root_task_group.css) { 4077 + struct task_group *tg = css_tg(css); 4078 + struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; 4079 + 4080 + scx_cgroup_warn_missing_weight(tg); 4081 + scx_cgroup_warn_missing_idle(tg); 4082 + 4083 + if ((tg->scx_flags & 4084 + (SCX_TG_ONLINE | SCX_TG_INITED)) != SCX_TG_ONLINE) 4085 + continue; 4086 + 4087 + if (!scx_ops.cgroup_init) { 4088 + tg->scx_flags |= SCX_TG_INITED; 4089 + continue; 4090 + } 4091 + 4092 + if (WARN_ON_ONCE(!css_tryget(css))) 4093 + continue; 4094 + rcu_read_unlock(); 4095 + 4096 + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, 4097 + css->cgroup, &args); 4098 + if (ret) { 4099 + css_put(css); 4100 + return ret; 4101 + } 4102 + tg->scx_flags |= SCX_TG_INITED; 4103 + 4104 + rcu_read_lock(); 4105 + css_put(css); 4106 + } 4107 + rcu_read_unlock(); 4108 + 4109 + return 0; 4110 + } 4111 + 4112 + #else 4113 + static void scx_cgroup_exit(void) {} 4114 + static int scx_cgroup_init(void) { return 0; } 4115 + #endif 4116 + 4117 + 4118 + /******************************************************************************** 4119 + * Sysfs interface and ops enable/disable. 4120 + */ 4121 + 4122 + #define SCX_ATTR(_name) \ 4123 + static struct kobj_attribute scx_attr_##_name = { \ 4124 + .attr = { .name = __stringify(_name), .mode = 0444 }, \ 4125 + .show = scx_attr_##_name##_show, \ 4126 + } 4127 + 4128 + static ssize_t scx_attr_state_show(struct kobject *kobj, 4129 + struct kobj_attribute *ka, char *buf) 4130 + { 4131 + return sysfs_emit(buf, "%s\n", 4132 + scx_ops_enable_state_str[scx_ops_enable_state()]); 4133 + } 4134 + SCX_ATTR(state); 4135 + 4136 + static ssize_t scx_attr_switch_all_show(struct kobject *kobj, 4137 + struct kobj_attribute *ka, char *buf) 4138 + { 4139 + return sysfs_emit(buf, "%d\n", READ_ONCE(scx_switching_all)); 4140 + } 4141 + SCX_ATTR(switch_all); 4142 + 4143 + static ssize_t scx_attr_nr_rejected_show(struct kobject *kobj, 4144 + struct kobj_attribute *ka, char *buf) 4145 + { 4146 + return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_nr_rejected)); 4147 + } 4148 + SCX_ATTR(nr_rejected); 4149 + 4150 + static ssize_t scx_attr_hotplug_seq_show(struct kobject *kobj, 4151 + struct kobj_attribute *ka, char *buf) 4152 + { 4153 + return sysfs_emit(buf, "%ld\n", atomic_long_read(&scx_hotplug_seq)); 4154 + } 4155 + SCX_ATTR(hotplug_seq); 4156 + 4157 + static struct attribute *scx_global_attrs[] = { 4158 + &scx_attr_state.attr, 4159 + &scx_attr_switch_all.attr, 4160 + &scx_attr_nr_rejected.attr, 4161 + &scx_attr_hotplug_seq.attr, 4162 + NULL, 4163 + }; 4164 + 4165 + static const struct attribute_group scx_global_attr_group = { 4166 + .attrs = scx_global_attrs, 4167 + }; 4168 + 4169 + static void scx_kobj_release(struct kobject *kobj) 4170 + { 4171 + kfree(kobj); 4172 + } 4173 + 4174 + static ssize_t scx_attr_ops_show(struct kobject *kobj, 4175 + struct kobj_attribute *ka, char *buf) 4176 + { 4177 + return sysfs_emit(buf, "%s\n", scx_ops.name); 4178 + } 4179 + SCX_ATTR(ops); 4180 + 4181 + static struct attribute *scx_sched_attrs[] = { 4182 + &scx_attr_ops.attr, 4183 + NULL, 4184 + }; 4185 + ATTRIBUTE_GROUPS(scx_sched); 4186 + 4187 + static const struct kobj_type scx_ktype = { 4188 + .release = scx_kobj_release, 4189 + .sysfs_ops = &kobj_sysfs_ops, 4190 + .default_groups = scx_sched_groups, 4191 + }; 4192 + 4193 + static int scx_uevent(const struct kobject *kobj, struct kobj_uevent_env *env) 4194 + { 4195 + return add_uevent_var(env, "SCXOPS=%s", scx_ops.name); 4196 + } 4197 + 4198 + static const struct kset_uevent_ops scx_uevent_ops = { 4199 + .uevent = scx_uevent, 4200 + }; 4201 + 4202 + /* 4203 + * Used by sched_fork() and __setscheduler_prio() to pick the matching 4204 + * sched_class. dl/rt are already handled. 4205 + */ 4206 + bool task_should_scx(struct task_struct *p) 4207 + { 4208 + if (!scx_enabled() || 4209 + unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING)) 4210 + return false; 4211 + if (READ_ONCE(scx_switching_all)) 4212 + return true; 4213 + return p->policy == SCHED_EXT; 4214 + } 4215 + 4216 + /** 4217 + * scx_ops_bypass - [Un]bypass scx_ops and guarantee forward progress 4218 + * 4219 + * Bypassing guarantees that all runnable tasks make forward progress without 4220 + * trusting the BPF scheduler. We can't grab any mutexes or rwsems as they might 4221 + * be held by tasks that the BPF scheduler is forgetting to run, which 4222 + * unfortunately also excludes toggling the static branches. 4223 + * 4224 + * Let's work around by overriding a couple ops and modifying behaviors based on 4225 + * the DISABLING state and then cycling the queued tasks through dequeue/enqueue 4226 + * to force global FIFO scheduling. 4227 + * 4228 + * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order. 4229 + * %SCX_OPS_ENQ_LAST is also ignored. 4230 + * 4231 + * b. ops.dispatch() is ignored. 4232 + * 4233 + * c. balance_scx() does not set %SCX_RQ_BAL_KEEP on non-zero slice as slice 4234 + * can't be trusted. Whenever a tick triggers, the running task is rotated to 4235 + * the tail of the queue with core_sched_at touched. 4236 + * 4237 + * d. pick_next_task() suppresses zero slice warning. 4238 + * 4239 + * e. scx_bpf_kick_cpu() is disabled to avoid irq_work malfunction during PM 4240 + * operations. 4241 + * 4242 + * f. scx_prio_less() reverts to the default core_sched_at order. 4243 + */ 4244 + static void scx_ops_bypass(bool bypass) 4245 + { 4246 + int depth, cpu; 4247 + 4248 + if (bypass) { 4249 + depth = atomic_inc_return(&scx_ops_bypass_depth); 4250 + WARN_ON_ONCE(depth <= 0); 4251 + if (depth != 1) 4252 + return; 4253 + } else { 4254 + depth = atomic_dec_return(&scx_ops_bypass_depth); 4255 + WARN_ON_ONCE(depth < 0); 4256 + if (depth != 0) 4257 + return; 4258 + } 4259 + 4260 + /* 4261 + * No task property is changing. We just need to make sure all currently 4262 + * queued tasks are re-queued according to the new scx_rq_bypassing() 4263 + * state. As an optimization, walk each rq's runnable_list instead of 4264 + * the scx_tasks list. 4265 + * 4266 + * This function can't trust the scheduler and thus can't use 4267 + * cpus_read_lock(). Walk all possible CPUs instead of online. 4268 + */ 4269 + for_each_possible_cpu(cpu) { 4270 + struct rq *rq = cpu_rq(cpu); 4271 + struct rq_flags rf; 4272 + struct task_struct *p, *n; 4273 + 4274 + rq_lock_irqsave(rq, &rf); 4275 + 4276 + if (bypass) { 4277 + WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING); 4278 + rq->scx.flags |= SCX_RQ_BYPASSING; 4279 + } else { 4280 + WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING)); 4281 + rq->scx.flags &= ~SCX_RQ_BYPASSING; 4282 + } 4283 + 4284 + /* 4285 + * We need to guarantee that no tasks are on the BPF scheduler 4286 + * while bypassing. Either we see enabled or the enable path 4287 + * sees scx_rq_bypassing() before moving tasks to SCX. 4288 + */ 4289 + if (!scx_enabled()) { 4290 + rq_unlock_irqrestore(rq, &rf); 4291 + continue; 4292 + } 4293 + 4294 + /* 4295 + * The use of list_for_each_entry_safe_reverse() is required 4296 + * because each task is going to be removed from and added back 4297 + * to the runnable_list during iteration. Because they're added 4298 + * to the tail of the list, safe reverse iteration can still 4299 + * visit all nodes. 4300 + */ 4301 + list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list, 4302 + scx.runnable_node) { 4303 + struct sched_enq_and_set_ctx ctx; 4304 + 4305 + /* cycling deq/enq is enough, see the function comment */ 4306 + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); 4307 + sched_enq_and_set_task(&ctx); 4308 + } 4309 + 4310 + rq_unlock_irqrestore(rq, &rf); 4311 + 4312 + /* kick to restore ticks */ 4313 + resched_cpu(cpu); 4314 + } 4315 + } 4316 + 4317 + static void free_exit_info(struct scx_exit_info *ei) 4318 + { 4319 + kfree(ei->dump); 4320 + kfree(ei->msg); 4321 + kfree(ei->bt); 4322 + kfree(ei); 4323 + } 4324 + 4325 + static struct scx_exit_info *alloc_exit_info(size_t exit_dump_len) 4326 + { 4327 + struct scx_exit_info *ei; 4328 + 4329 + ei = kzalloc(sizeof(*ei), GFP_KERNEL); 4330 + if (!ei) 4331 + return NULL; 4332 + 4333 + ei->bt = kcalloc(SCX_EXIT_BT_LEN, sizeof(ei->bt[0]), GFP_KERNEL); 4334 + ei->msg = kzalloc(SCX_EXIT_MSG_LEN, GFP_KERNEL); 4335 + ei->dump = kzalloc(exit_dump_len, GFP_KERNEL); 4336 + 4337 + if (!ei->bt || !ei->msg || !ei->dump) { 4338 + free_exit_info(ei); 4339 + return NULL; 4340 + } 4341 + 4342 + return ei; 4343 + } 4344 + 4345 + static const char *scx_exit_reason(enum scx_exit_kind kind) 4346 + { 4347 + switch (kind) { 4348 + case SCX_EXIT_UNREG: 4349 + return "unregistered from user space"; 4350 + case SCX_EXIT_UNREG_BPF: 4351 + return "unregistered from BPF"; 4352 + case SCX_EXIT_UNREG_KERN: 4353 + return "unregistered from the main kernel"; 4354 + case SCX_EXIT_SYSRQ: 4355 + return "disabled by sysrq-S"; 4356 + case SCX_EXIT_ERROR: 4357 + return "runtime error"; 4358 + case SCX_EXIT_ERROR_BPF: 4359 + return "scx_bpf_error"; 4360 + case SCX_EXIT_ERROR_STALL: 4361 + return "runnable task stall"; 4362 + default: 4363 + return "<UNKNOWN>"; 4364 + } 4365 + } 4366 + 4367 + static void scx_ops_disable_workfn(struct kthread_work *work) 4368 + { 4369 + struct scx_exit_info *ei = scx_exit_info; 4370 + struct scx_task_iter sti; 4371 + struct task_struct *p; 4372 + struct rhashtable_iter rht_iter; 4373 + struct scx_dispatch_q *dsq; 4374 + int i, kind; 4375 + 4376 + kind = atomic_read(&scx_exit_kind); 4377 + while (true) { 4378 + /* 4379 + * NONE indicates that a new scx_ops has been registered since 4380 + * disable was scheduled - don't kill the new ops. DONE 4381 + * indicates that the ops has already been disabled. 4382 + */ 4383 + if (kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE) 4384 + return; 4385 + if (atomic_try_cmpxchg(&scx_exit_kind, &kind, SCX_EXIT_DONE)) 4386 + break; 4387 + } 4388 + ei->kind = kind; 4389 + ei->reason = scx_exit_reason(ei->kind); 4390 + 4391 + /* guarantee forward progress by bypassing scx_ops */ 4392 + scx_ops_bypass(true); 4393 + 4394 + switch (scx_ops_set_enable_state(SCX_OPS_DISABLING)) { 4395 + case SCX_OPS_DISABLING: 4396 + WARN_ONCE(true, "sched_ext: duplicate disabling instance?"); 4397 + break; 4398 + case SCX_OPS_DISABLED: 4399 + pr_warn("sched_ext: ops error detected without ops (%s)\n", 4400 + scx_exit_info->msg); 4401 + WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) != 4402 + SCX_OPS_DISABLING); 4403 + goto done; 4404 + default: 4405 + break; 4406 + } 4407 + 4408 + /* 4409 + * Here, every runnable task is guaranteed to make forward progress and 4410 + * we can safely use blocking synchronization constructs. Actually 4411 + * disable ops. 4412 + */ 4413 + mutex_lock(&scx_ops_enable_mutex); 4414 + 4415 + static_branch_disable(&__scx_switched_all); 4416 + WRITE_ONCE(scx_switching_all, false); 4417 + 4418 + /* 4419 + * Avoid racing against fork and cgroup changes. See scx_ops_enable() 4420 + * for explanation on the locking order. 4421 + */ 4422 + percpu_down_write(&scx_fork_rwsem); 4423 + cpus_read_lock(); 4424 + scx_cgroup_lock(); 4425 + 4426 + spin_lock_irq(&scx_tasks_lock); 4427 + scx_task_iter_init(&sti); 4428 + /* 4429 + * The BPF scheduler is going away. All tasks including %TASK_DEAD ones 4430 + * must be switched out and exited synchronously. 4431 + */ 4432 + while ((p = scx_task_iter_next_locked(&sti))) { 4433 + const struct sched_class *old_class = p->sched_class; 4434 + struct sched_enq_and_set_ctx ctx; 4435 + 4436 + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); 4437 + 4438 + p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL); 4439 + __setscheduler_prio(p, p->prio); 4440 + check_class_changing(task_rq(p), p, old_class); 4441 + 4442 + sched_enq_and_set_task(&ctx); 4443 + 4444 + check_class_changed(task_rq(p), p, old_class, p->prio); 4445 + scx_ops_exit_task(p); 4446 + } 4447 + scx_task_iter_exit(&sti); 4448 + spin_unlock_irq(&scx_tasks_lock); 4449 + 4450 + /* no task is on scx, turn off all the switches and flush in-progress calls */ 4451 + static_branch_disable_cpuslocked(&__scx_ops_enabled); 4452 + for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) 4453 + static_branch_disable_cpuslocked(&scx_has_op[i]); 4454 + static_branch_disable_cpuslocked(&scx_ops_enq_last); 4455 + static_branch_disable_cpuslocked(&scx_ops_enq_exiting); 4456 + static_branch_disable_cpuslocked(&scx_ops_cpu_preempt); 4457 + static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); 4458 + synchronize_rcu(); 4459 + 4460 + scx_cgroup_exit(); 4461 + 4462 + scx_cgroup_unlock(); 4463 + cpus_read_unlock(); 4464 + percpu_up_write(&scx_fork_rwsem); 4465 + 4466 + if (ei->kind >= SCX_EXIT_ERROR) { 4467 + pr_err("sched_ext: BPF scheduler \"%s\" disabled (%s)\n", 4468 + scx_ops.name, ei->reason); 4469 + 4470 + if (ei->msg[0] != '\0') 4471 + pr_err("sched_ext: %s: %s\n", scx_ops.name, ei->msg); 4472 + 4473 + stack_trace_print(ei->bt, ei->bt_len, 2); 4474 + } else { 4475 + pr_info("sched_ext: BPF scheduler \"%s\" disabled (%s)\n", 4476 + scx_ops.name, ei->reason); 4477 + } 4478 + 4479 + if (scx_ops.exit) 4480 + SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei); 4481 + 4482 + cancel_delayed_work_sync(&scx_watchdog_work); 4483 + 4484 + /* 4485 + * Delete the kobject from the hierarchy eagerly in addition to just 4486 + * dropping a reference. Otherwise, if the object is deleted 4487 + * asynchronously, sysfs could observe an object of the same name still 4488 + * in the hierarchy when another scheduler is loaded. 4489 + */ 4490 + kobject_del(scx_root_kobj); 4491 + kobject_put(scx_root_kobj); 4492 + scx_root_kobj = NULL; 4493 + 4494 + memset(&scx_ops, 0, sizeof(scx_ops)); 4495 + 4496 + rhashtable_walk_enter(&dsq_hash, &rht_iter); 4497 + do { 4498 + rhashtable_walk_start(&rht_iter); 4499 + 4500 + while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq)) 4501 + destroy_dsq(dsq->id); 4502 + 4503 + rhashtable_walk_stop(&rht_iter); 4504 + } while (dsq == ERR_PTR(-EAGAIN)); 4505 + rhashtable_walk_exit(&rht_iter); 4506 + 4507 + free_percpu(scx_dsp_ctx); 4508 + scx_dsp_ctx = NULL; 4509 + scx_dsp_max_batch = 0; 4510 + 4511 + free_exit_info(scx_exit_info); 4512 + scx_exit_info = NULL; 4513 + 4514 + mutex_unlock(&scx_ops_enable_mutex); 4515 + 4516 + WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) != 4517 + SCX_OPS_DISABLING); 4518 + done: 4519 + scx_ops_bypass(false); 4520 + } 4521 + 4522 + static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn); 4523 + 4524 + static void schedule_scx_ops_disable_work(void) 4525 + { 4526 + struct kthread_worker *helper = READ_ONCE(scx_ops_helper); 4527 + 4528 + /* 4529 + * We may be called spuriously before the first bpf_sched_ext_reg(). If 4530 + * scx_ops_helper isn't set up yet, there's nothing to do. 4531 + */ 4532 + if (helper) 4533 + kthread_queue_work(helper, &scx_ops_disable_work); 4534 + } 4535 + 4536 + static void scx_ops_disable(enum scx_exit_kind kind) 4537 + { 4538 + int none = SCX_EXIT_NONE; 4539 + 4540 + if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)) 4541 + kind = SCX_EXIT_ERROR; 4542 + 4543 + atomic_try_cmpxchg(&scx_exit_kind, &none, kind); 4544 + 4545 + schedule_scx_ops_disable_work(); 4546 + } 4547 + 4548 + static void dump_newline(struct seq_buf *s) 4549 + { 4550 + trace_sched_ext_dump(""); 4551 + 4552 + /* @s may be zero sized and seq_buf triggers WARN if so */ 4553 + if (s->size) 4554 + seq_buf_putc(s, '\n'); 4555 + } 4556 + 4557 + static __printf(2, 3) void dump_line(struct seq_buf *s, const char *fmt, ...) 4558 + { 4559 + va_list args; 4560 + 4561 + #ifdef CONFIG_TRACEPOINTS 4562 + if (trace_sched_ext_dump_enabled()) { 4563 + /* protected by scx_dump_state()::dump_lock */ 4564 + static char line_buf[SCX_EXIT_MSG_LEN]; 4565 + 4566 + va_start(args, fmt); 4567 + vscnprintf(line_buf, sizeof(line_buf), fmt, args); 4568 + va_end(args); 4569 + 4570 + trace_sched_ext_dump(line_buf); 4571 + } 4572 + #endif 4573 + /* @s may be zero sized and seq_buf triggers WARN if so */ 4574 + if (s->size) { 4575 + va_start(args, fmt); 4576 + seq_buf_vprintf(s, fmt, args); 4577 + va_end(args); 4578 + 4579 + seq_buf_putc(s, '\n'); 4580 + } 4581 + } 4582 + 4583 + static void dump_stack_trace(struct seq_buf *s, const char *prefix, 4584 + const unsigned long *bt, unsigned int len) 4585 + { 4586 + unsigned int i; 4587 + 4588 + for (i = 0; i < len; i++) 4589 + dump_line(s, "%s%pS", prefix, (void *)bt[i]); 4590 + } 4591 + 4592 + static void ops_dump_init(struct seq_buf *s, const char *prefix) 4593 + { 4594 + struct scx_dump_data *dd = &scx_dump_data; 4595 + 4596 + lockdep_assert_irqs_disabled(); 4597 + 4598 + dd->cpu = smp_processor_id(); /* allow scx_bpf_dump() */ 4599 + dd->first = true; 4600 + dd->cursor = 0; 4601 + dd->s = s; 4602 + dd->prefix = prefix; 4603 + } 4604 + 4605 + static void ops_dump_flush(void) 4606 + { 4607 + struct scx_dump_data *dd = &scx_dump_data; 4608 + char *line = dd->buf.line; 4609 + 4610 + if (!dd->cursor) 4611 + return; 4612 + 4613 + /* 4614 + * There's something to flush and this is the first line. Insert a blank 4615 + * line to distinguish ops dump. 4616 + */ 4617 + if (dd->first) { 4618 + dump_newline(dd->s); 4619 + dd->first = false; 4620 + } 4621 + 4622 + /* 4623 + * There may be multiple lines in $line. Scan and emit each line 4624 + * separately. 4625 + */ 4626 + while (true) { 4627 + char *end = line; 4628 + char c; 4629 + 4630 + while (*end != '\n' && *end != '\0') 4631 + end++; 4632 + 4633 + /* 4634 + * If $line overflowed, it may not have newline at the end. 4635 + * Always emit with a newline. 4636 + */ 4637 + c = *end; 4638 + *end = '\0'; 4639 + dump_line(dd->s, "%s%s", dd->prefix, line); 4640 + if (c == '\0') 4641 + break; 4642 + 4643 + /* move to the next line */ 4644 + end++; 4645 + if (*end == '\0') 4646 + break; 4647 + line = end; 4648 + } 4649 + 4650 + dd->cursor = 0; 4651 + } 4652 + 4653 + static void ops_dump_exit(void) 4654 + { 4655 + ops_dump_flush(); 4656 + scx_dump_data.cpu = -1; 4657 + } 4658 + 4659 + static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, 4660 + struct task_struct *p, char marker) 4661 + { 4662 + static unsigned long bt[SCX_EXIT_BT_LEN]; 4663 + char dsq_id_buf[19] = "(n/a)"; 4664 + unsigned long ops_state = atomic_long_read(&p->scx.ops_state); 4665 + unsigned int bt_len = 0; 4666 + 4667 + if (p->scx.dsq) 4668 + scnprintf(dsq_id_buf, sizeof(dsq_id_buf), "0x%llx", 4669 + (unsigned long long)p->scx.dsq->id); 4670 + 4671 + dump_newline(s); 4672 + dump_line(s, " %c%c %s[%d] %+ldms", 4673 + marker, task_state_to_char(p), p->comm, p->pid, 4674 + jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies)); 4675 + dump_line(s, " scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu", 4676 + scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK, 4677 + p->scx.dsq_flags, ops_state & SCX_OPSS_STATE_MASK, 4678 + ops_state >> SCX_OPSS_QSEQ_SHIFT); 4679 + dump_line(s, " sticky/holding_cpu=%d/%d dsq_id=%s dsq_vtime=%llu", 4680 + p->scx.sticky_cpu, p->scx.holding_cpu, dsq_id_buf, 4681 + p->scx.dsq_vtime); 4682 + dump_line(s, " cpus=%*pb", cpumask_pr_args(p->cpus_ptr)); 4683 + 4684 + if (SCX_HAS_OP(dump_task)) { 4685 + ops_dump_init(s, " "); 4686 + SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p); 4687 + ops_dump_exit(); 4688 + } 4689 + 4690 + #ifdef CONFIG_STACKTRACE 4691 + bt_len = stack_trace_save_tsk(p, bt, SCX_EXIT_BT_LEN, 1); 4692 + #endif 4693 + if (bt_len) { 4694 + dump_newline(s); 4695 + dump_stack_trace(s, " ", bt, bt_len); 4696 + } 4697 + } 4698 + 4699 + static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) 4700 + { 4701 + static DEFINE_SPINLOCK(dump_lock); 4702 + static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n"; 4703 + struct scx_dump_ctx dctx = { 4704 + .kind = ei->kind, 4705 + .exit_code = ei->exit_code, 4706 + .reason = ei->reason, 4707 + .at_ns = ktime_get_ns(), 4708 + .at_jiffies = jiffies, 4709 + }; 4710 + struct seq_buf s; 4711 + unsigned long flags; 4712 + char *buf; 4713 + int cpu; 4714 + 4715 + spin_lock_irqsave(&dump_lock, flags); 4716 + 4717 + seq_buf_init(&s, ei->dump, dump_len); 4718 + 4719 + if (ei->kind == SCX_EXIT_NONE) { 4720 + dump_line(&s, "Debug dump triggered by %s", ei->reason); 4721 + } else { 4722 + dump_line(&s, "%s[%d] triggered exit kind %d:", 4723 + current->comm, current->pid, ei->kind); 4724 + dump_line(&s, " %s (%s)", ei->reason, ei->msg); 4725 + dump_newline(&s); 4726 + dump_line(&s, "Backtrace:"); 4727 + dump_stack_trace(&s, " ", ei->bt, ei->bt_len); 4728 + } 4729 + 4730 + if (SCX_HAS_OP(dump)) { 4731 + ops_dump_init(&s, ""); 4732 + SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx); 4733 + ops_dump_exit(); 4734 + } 4735 + 4736 + dump_newline(&s); 4737 + dump_line(&s, "CPU states"); 4738 + dump_line(&s, "----------"); 4739 + 4740 + for_each_possible_cpu(cpu) { 4741 + struct rq *rq = cpu_rq(cpu); 4742 + struct rq_flags rf; 4743 + struct task_struct *p; 4744 + struct seq_buf ns; 4745 + size_t avail, used; 4746 + bool idle; 4747 + 4748 + rq_lock(rq, &rf); 4749 + 4750 + idle = list_empty(&rq->scx.runnable_list) && 4751 + rq->curr->sched_class == &idle_sched_class; 4752 + 4753 + if (idle && !SCX_HAS_OP(dump_cpu)) 4754 + goto next; 4755 + 4756 + /* 4757 + * We don't yet know whether ops.dump_cpu() will produce output 4758 + * and we may want to skip the default CPU dump if it doesn't. 4759 + * Use a nested seq_buf to generate the standard dump so that we 4760 + * can decide whether to commit later. 4761 + */ 4762 + avail = seq_buf_get_buf(&s, &buf); 4763 + seq_buf_init(&ns, buf, avail); 4764 + 4765 + dump_newline(&ns); 4766 + dump_line(&ns, "CPU %-4d: nr_run=%u flags=0x%x cpu_rel=%d ops_qseq=%lu pnt_seq=%lu", 4767 + cpu, rq->scx.nr_running, rq->scx.flags, 4768 + rq->scx.cpu_released, rq->scx.ops_qseq, 4769 + rq->scx.pnt_seq); 4770 + dump_line(&ns, " curr=%s[%d] class=%ps", 4771 + rq->curr->comm, rq->curr->pid, 4772 + rq->curr->sched_class); 4773 + if (!cpumask_empty(rq->scx.cpus_to_kick)) 4774 + dump_line(&ns, " cpus_to_kick : %*pb", 4775 + cpumask_pr_args(rq->scx.cpus_to_kick)); 4776 + if (!cpumask_empty(rq->scx.cpus_to_kick_if_idle)) 4777 + dump_line(&ns, " idle_to_kick : %*pb", 4778 + cpumask_pr_args(rq->scx.cpus_to_kick_if_idle)); 4779 + if (!cpumask_empty(rq->scx.cpus_to_preempt)) 4780 + dump_line(&ns, " cpus_to_preempt: %*pb", 4781 + cpumask_pr_args(rq->scx.cpus_to_preempt)); 4782 + if (!cpumask_empty(rq->scx.cpus_to_wait)) 4783 + dump_line(&ns, " cpus_to_wait : %*pb", 4784 + cpumask_pr_args(rq->scx.cpus_to_wait)); 4785 + 4786 + used = seq_buf_used(&ns); 4787 + if (SCX_HAS_OP(dump_cpu)) { 4788 + ops_dump_init(&ns, " "); 4789 + SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle); 4790 + ops_dump_exit(); 4791 + } 4792 + 4793 + /* 4794 + * If idle && nothing generated by ops.dump_cpu(), there's 4795 + * nothing interesting. Skip. 4796 + */ 4797 + if (idle && used == seq_buf_used(&ns)) 4798 + goto next; 4799 + 4800 + /* 4801 + * $s may already have overflowed when $ns was created. If so, 4802 + * calling commit on it will trigger BUG. 4803 + */ 4804 + if (avail) { 4805 + seq_buf_commit(&s, seq_buf_used(&ns)); 4806 + if (seq_buf_has_overflowed(&ns)) 4807 + seq_buf_set_overflow(&s); 4808 + } 4809 + 4810 + if (rq->curr->sched_class == &ext_sched_class) 4811 + scx_dump_task(&s, &dctx, rq->curr, '*'); 4812 + 4813 + list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node) 4814 + scx_dump_task(&s, &dctx, p, ' '); 4815 + next: 4816 + rq_unlock(rq, &rf); 4817 + } 4818 + 4819 + if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker)) 4820 + memcpy(ei->dump + dump_len - sizeof(trunc_marker), 4821 + trunc_marker, sizeof(trunc_marker)); 4822 + 4823 + spin_unlock_irqrestore(&dump_lock, flags); 4824 + } 4825 + 4826 + static void scx_ops_error_irq_workfn(struct irq_work *irq_work) 4827 + { 4828 + struct scx_exit_info *ei = scx_exit_info; 4829 + 4830 + if (ei->kind >= SCX_EXIT_ERROR) 4831 + scx_dump_state(ei, scx_ops.exit_dump_len); 4832 + 4833 + schedule_scx_ops_disable_work(); 4834 + } 4835 + 4836 + static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn); 4837 + 4838 + static __printf(3, 4) void scx_ops_exit_kind(enum scx_exit_kind kind, 4839 + s64 exit_code, 4840 + const char *fmt, ...) 4841 + { 4842 + struct scx_exit_info *ei = scx_exit_info; 4843 + int none = SCX_EXIT_NONE; 4844 + va_list args; 4845 + 4846 + if (!atomic_try_cmpxchg(&scx_exit_kind, &none, kind)) 4847 + return; 4848 + 4849 + ei->exit_code = exit_code; 4850 + 4851 + if (kind >= SCX_EXIT_ERROR) 4852 + ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1); 4853 + 4854 + va_start(args, fmt); 4855 + vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args); 4856 + va_end(args); 4857 + 4858 + /* 4859 + * Set ei->kind and ->reason for scx_dump_state(). They'll be set again 4860 + * in scx_ops_disable_workfn(). 4861 + */ 4862 + ei->kind = kind; 4863 + ei->reason = scx_exit_reason(ei->kind); 4864 + 4865 + irq_work_queue(&scx_ops_error_irq_work); 4866 + } 4867 + 4868 + static struct kthread_worker *scx_create_rt_helper(const char *name) 4869 + { 4870 + struct kthread_worker *helper; 4871 + 4872 + helper = kthread_create_worker(0, name); 4873 + if (helper) 4874 + sched_set_fifo(helper->task); 4875 + return helper; 4876 + } 4877 + 4878 + static void check_hotplug_seq(const struct sched_ext_ops *ops) 4879 + { 4880 + unsigned long long global_hotplug_seq; 4881 + 4882 + /* 4883 + * If a hotplug event has occurred between when a scheduler was 4884 + * initialized, and when we were able to attach, exit and notify user 4885 + * space about it. 4886 + */ 4887 + if (ops->hotplug_seq) { 4888 + global_hotplug_seq = atomic_long_read(&scx_hotplug_seq); 4889 + if (ops->hotplug_seq != global_hotplug_seq) { 4890 + scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, 4891 + "expected hotplug seq %llu did not match actual %llu", 4892 + ops->hotplug_seq, global_hotplug_seq); 4893 + } 4894 + } 4895 + } 4896 + 4897 + static int validate_ops(const struct sched_ext_ops *ops) 4898 + { 4899 + /* 4900 + * It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the 4901 + * ops.enqueue() callback isn't implemented. 4902 + */ 4903 + if ((ops->flags & SCX_OPS_ENQ_LAST) && !ops->enqueue) { 4904 + scx_ops_error("SCX_OPS_ENQ_LAST requires ops.enqueue() to be implemented"); 4905 + return -EINVAL; 4906 + } 4907 + 4908 + return 0; 4909 + } 4910 + 4911 + static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) 4912 + { 4913 + struct scx_task_iter sti; 4914 + struct task_struct *p; 4915 + unsigned long timeout; 4916 + int i, cpu, ret; 4917 + 4918 + if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN), 4919 + cpu_possible_mask)) { 4920 + pr_err("sched_ext: Not compatible with \"isolcpus=\" domain isolation"); 4921 + return -EINVAL; 4922 + } 4923 + 4924 + mutex_lock(&scx_ops_enable_mutex); 4925 + 4926 + if (!scx_ops_helper) { 4927 + WRITE_ONCE(scx_ops_helper, 4928 + scx_create_rt_helper("sched_ext_ops_helper")); 4929 + if (!scx_ops_helper) { 4930 + ret = -ENOMEM; 4931 + goto err_unlock; 4932 + } 4933 + } 4934 + 4935 + if (scx_ops_enable_state() != SCX_OPS_DISABLED) { 4936 + ret = -EBUSY; 4937 + goto err_unlock; 4938 + } 4939 + 4940 + scx_root_kobj = kzalloc(sizeof(*scx_root_kobj), GFP_KERNEL); 4941 + if (!scx_root_kobj) { 4942 + ret = -ENOMEM; 4943 + goto err_unlock; 4944 + } 4945 + 4946 + scx_root_kobj->kset = scx_kset; 4947 + ret = kobject_init_and_add(scx_root_kobj, &scx_ktype, NULL, "root"); 4948 + if (ret < 0) 4949 + goto err; 4950 + 4951 + scx_exit_info = alloc_exit_info(ops->exit_dump_len); 4952 + if (!scx_exit_info) { 4953 + ret = -ENOMEM; 4954 + goto err_del; 4955 + } 4956 + 4957 + /* 4958 + * Set scx_ops, transition to PREPPING and clear exit info to arm the 4959 + * disable path. Failure triggers full disabling from here on. 4960 + */ 4961 + scx_ops = *ops; 4962 + 4963 + WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) != 4964 + SCX_OPS_DISABLED); 4965 + 4966 + atomic_set(&scx_exit_kind, SCX_EXIT_NONE); 4967 + scx_warned_zero_slice = false; 4968 + 4969 + atomic_long_set(&scx_nr_rejected, 0); 4970 + 4971 + for_each_possible_cpu(cpu) 4972 + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; 4973 + 4974 + /* 4975 + * Keep CPUs stable during enable so that the BPF scheduler can track 4976 + * online CPUs by watching ->on/offline_cpu() after ->init(). 4977 + */ 4978 + cpus_read_lock(); 4979 + 4980 + if (scx_ops.init) { 4981 + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init); 4982 + if (ret) { 4983 + ret = ops_sanitize_err("init", ret); 4984 + goto err_disable_unlock_cpus; 4985 + } 4986 + } 4987 + 4988 + for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++) 4989 + if (((void (**)(void))ops)[i]) 4990 + static_branch_enable_cpuslocked(&scx_has_op[i]); 4991 + 4992 + cpus_read_unlock(); 4993 + 4994 + ret = validate_ops(ops); 4995 + if (ret) 4996 + goto err_disable; 4997 + 4998 + WARN_ON_ONCE(scx_dsp_ctx); 4999 + scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH; 5000 + scx_dsp_ctx = __alloc_percpu(struct_size_t(struct scx_dsp_ctx, buf, 5001 + scx_dsp_max_batch), 5002 + __alignof__(struct scx_dsp_ctx)); 5003 + if (!scx_dsp_ctx) { 5004 + ret = -ENOMEM; 5005 + goto err_disable; 5006 + } 5007 + 5008 + if (ops->timeout_ms) 5009 + timeout = msecs_to_jiffies(ops->timeout_ms); 5010 + else 5011 + timeout = SCX_WATCHDOG_MAX_TIMEOUT; 5012 + 5013 + WRITE_ONCE(scx_watchdog_timeout, timeout); 5014 + WRITE_ONCE(scx_watchdog_timestamp, jiffies); 5015 + queue_delayed_work(system_unbound_wq, &scx_watchdog_work, 5016 + scx_watchdog_timeout / 2); 5017 + 5018 + /* 5019 + * Lock out forks, cgroup on/offlining and moves before opening the 5020 + * floodgate so that they don't wander into the operations prematurely. 5021 + * 5022 + * We don't need to keep the CPUs stable but static_branch_*() requires 5023 + * cpus_read_lock() and scx_cgroup_rwsem must nest inside 5024 + * cpu_hotplug_lock because of the following dependency chain: 5025 + * 5026 + * cpu_hotplug_lock --> cgroup_threadgroup_rwsem --> scx_cgroup_rwsem 5027 + * 5028 + * So, we need to do cpus_read_lock() before scx_cgroup_lock() and use 5029 + * static_branch_*_cpuslocked(). 5030 + * 5031 + * Note that cpu_hotplug_lock must nest inside scx_fork_rwsem due to the 5032 + * following dependency chain: 5033 + * 5034 + * scx_fork_rwsem --> pernet_ops_rwsem --> cpu_hotplug_lock 5035 + */ 5036 + percpu_down_write(&scx_fork_rwsem); 5037 + cpus_read_lock(); 5038 + scx_cgroup_lock(); 5039 + 5040 + check_hotplug_seq(ops); 5041 + 5042 + for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++) 5043 + if (((void (**)(void))ops)[i]) 5044 + static_branch_enable_cpuslocked(&scx_has_op[i]); 5045 + 5046 + if (ops->flags & SCX_OPS_ENQ_LAST) 5047 + static_branch_enable_cpuslocked(&scx_ops_enq_last); 5048 + 5049 + if (ops->flags & SCX_OPS_ENQ_EXITING) 5050 + static_branch_enable_cpuslocked(&scx_ops_enq_exiting); 5051 + if (scx_ops.cpu_acquire || scx_ops.cpu_release) 5052 + static_branch_enable_cpuslocked(&scx_ops_cpu_preempt); 5053 + 5054 + if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) { 5055 + reset_idle_masks(); 5056 + static_branch_enable_cpuslocked(&scx_builtin_idle_enabled); 5057 + } else { 5058 + static_branch_disable_cpuslocked(&scx_builtin_idle_enabled); 5059 + } 5060 + 5061 + /* 5062 + * All cgroups should be initialized before letting in tasks. cgroup 5063 + * on/offlining and task migrations are already locked out. 5064 + */ 5065 + ret = scx_cgroup_init(); 5066 + if (ret) 5067 + goto err_disable_unlock_all; 5068 + 5069 + static_branch_enable_cpuslocked(&__scx_ops_enabled); 5070 + 5071 + /* 5072 + * Enable ops for every task. Fork is excluded by scx_fork_rwsem 5073 + * preventing new tasks from being added. No need to exclude tasks 5074 + * leaving as sched_ext_free() can handle both prepped and enabled 5075 + * tasks. Prep all tasks first and then enable them with preemption 5076 + * disabled. 5077 + */ 5078 + spin_lock_irq(&scx_tasks_lock); 5079 + 5080 + scx_task_iter_init(&sti); 5081 + while ((p = scx_task_iter_next_locked(&sti))) { 5082 + /* 5083 + * @p may already be dead, have lost all its usages counts and 5084 + * be waiting for RCU grace period before being freed. @p can't 5085 + * be initialized for SCX in such cases and should be ignored. 5086 + */ 5087 + if (!tryget_task_struct(p)) 5088 + continue; 5089 + 5090 + scx_task_iter_rq_unlock(&sti); 5091 + spin_unlock_irq(&scx_tasks_lock); 5092 + 5093 + ret = scx_ops_init_task(p, task_group(p), false); 5094 + if (ret) { 5095 + put_task_struct(p); 5096 + spin_lock_irq(&scx_tasks_lock); 5097 + scx_task_iter_exit(&sti); 5098 + spin_unlock_irq(&scx_tasks_lock); 5099 + pr_err("sched_ext: ops.init_task() failed (%d) for %s[%d] while loading\n", 5100 + ret, p->comm, p->pid); 5101 + goto err_disable_unlock_all; 5102 + } 5103 + 5104 + put_task_struct(p); 5105 + spin_lock_irq(&scx_tasks_lock); 5106 + } 5107 + scx_task_iter_exit(&sti); 5108 + 5109 + /* 5110 + * All tasks are prepped but are still ops-disabled. Ensure that 5111 + * %current can't be scheduled out and switch everyone. 5112 + * preempt_disable() is necessary because we can't guarantee that 5113 + * %current won't be starved if scheduled out while switching. 5114 + */ 5115 + preempt_disable(); 5116 + 5117 + /* 5118 + * From here on, the disable path must assume that tasks have ops 5119 + * enabled and need to be recovered. 5120 + * 5121 + * Transition to ENABLING fails iff the BPF scheduler has already 5122 + * triggered scx_bpf_error(). Returning an error code here would lose 5123 + * the recorded error information. Exit indicating success so that the 5124 + * error is notified through ops.exit() with all the details. 5125 + */ 5126 + if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) { 5127 + preempt_enable(); 5128 + spin_unlock_irq(&scx_tasks_lock); 5129 + WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE); 5130 + ret = 0; 5131 + goto err_disable_unlock_all; 5132 + } 5133 + 5134 + /* 5135 + * We're fully committed and can't fail. The PREPPED -> ENABLED 5136 + * transitions here are synchronized against sched_ext_free() through 5137 + * scx_tasks_lock. 5138 + */ 5139 + WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL)); 5140 + 5141 + scx_task_iter_init(&sti); 5142 + while ((p = scx_task_iter_next_locked(&sti))) { 5143 + const struct sched_class *old_class = p->sched_class; 5144 + struct sched_enq_and_set_ctx ctx; 5145 + 5146 + sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx); 5147 + 5148 + scx_set_task_state(p, SCX_TASK_READY); 5149 + __setscheduler_prio(p, p->prio); 5150 + check_class_changing(task_rq(p), p, old_class); 5151 + 5152 + sched_enq_and_set_task(&ctx); 5153 + 5154 + check_class_changed(task_rq(p), p, old_class, p->prio); 5155 + } 5156 + scx_task_iter_exit(&sti); 5157 + 5158 + spin_unlock_irq(&scx_tasks_lock); 5159 + preempt_enable(); 5160 + scx_cgroup_unlock(); 5161 + cpus_read_unlock(); 5162 + percpu_up_write(&scx_fork_rwsem); 5163 + 5164 + /* see above ENABLING transition for the explanation on exiting with 0 */ 5165 + if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) { 5166 + WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE); 5167 + ret = 0; 5168 + goto err_disable; 5169 + } 5170 + 5171 + if (!(ops->flags & SCX_OPS_SWITCH_PARTIAL)) 5172 + static_branch_enable(&__scx_switched_all); 5173 + 5174 + pr_info("sched_ext: BPF scheduler \"%s\" enabled%s\n", 5175 + scx_ops.name, scx_switched_all() ? "" : " (partial)"); 5176 + kobject_uevent(scx_root_kobj, KOBJ_ADD); 5177 + mutex_unlock(&scx_ops_enable_mutex); 5178 + 5179 + return 0; 5180 + 5181 + err_del: 5182 + kobject_del(scx_root_kobj); 5183 + err: 5184 + kobject_put(scx_root_kobj); 5185 + scx_root_kobj = NULL; 5186 + if (scx_exit_info) { 5187 + free_exit_info(scx_exit_info); 5188 + scx_exit_info = NULL; 5189 + } 5190 + err_unlock: 5191 + mutex_unlock(&scx_ops_enable_mutex); 5192 + return ret; 5193 + 5194 + err_disable_unlock_all: 5195 + scx_cgroup_unlock(); 5196 + percpu_up_write(&scx_fork_rwsem); 5197 + err_disable_unlock_cpus: 5198 + cpus_read_unlock(); 5199 + err_disable: 5200 + mutex_unlock(&scx_ops_enable_mutex); 5201 + /* must be fully disabled before returning */ 5202 + scx_ops_disable(SCX_EXIT_ERROR); 5203 + kthread_flush_work(&scx_ops_disable_work); 5204 + return ret; 5205 + } 5206 + 5207 + 5208 + /******************************************************************************** 5209 + * bpf_struct_ops plumbing. 5210 + */ 5211 + #include <linux/bpf_verifier.h> 5212 + #include <linux/bpf.h> 5213 + #include <linux/btf.h> 5214 + 5215 + extern struct btf *btf_vmlinux; 5216 + static const struct btf_type *task_struct_type; 5217 + static u32 task_struct_type_id; 5218 + 5219 + static bool set_arg_maybe_null(const char *op, int arg_n, int off, int size, 5220 + enum bpf_access_type type, 5221 + const struct bpf_prog *prog, 5222 + struct bpf_insn_access_aux *info) 5223 + { 5224 + struct btf *btf = bpf_get_btf_vmlinux(); 5225 + const struct bpf_struct_ops_desc *st_ops_desc; 5226 + const struct btf_member *member; 5227 + const struct btf_type *t; 5228 + u32 btf_id, member_idx; 5229 + const char *mname; 5230 + 5231 + /* struct_ops op args are all sequential, 64-bit numbers */ 5232 + if (off != arg_n * sizeof(__u64)) 5233 + return false; 5234 + 5235 + /* btf_id should be the type id of struct sched_ext_ops */ 5236 + btf_id = prog->aux->attach_btf_id; 5237 + st_ops_desc = bpf_struct_ops_find(btf, btf_id); 5238 + if (!st_ops_desc) 5239 + return false; 5240 + 5241 + /* BTF type of struct sched_ext_ops */ 5242 + t = st_ops_desc->type; 5243 + 5244 + member_idx = prog->expected_attach_type; 5245 + if (member_idx >= btf_type_vlen(t)) 5246 + return false; 5247 + 5248 + /* 5249 + * Get the member name of this struct_ops program, which corresponds to 5250 + * a field in struct sched_ext_ops. For example, the member name of the 5251 + * dispatch struct_ops program (callback) is "dispatch". 5252 + */ 5253 + member = &btf_type_member(t)[member_idx]; 5254 + mname = btf_name_by_offset(btf_vmlinux, member->name_off); 5255 + 5256 + if (!strcmp(mname, op)) { 5257 + /* 5258 + * The value is a pointer to a type (struct task_struct) given 5259 + * by a BTF ID (PTR_TO_BTF_ID). It is trusted (PTR_TRUSTED), 5260 + * however, can be a NULL (PTR_MAYBE_NULL). The BPF program 5261 + * should check the pointer to make sure it is not NULL before 5262 + * using it, or the verifier will reject the program. 5263 + * 5264 + * Longer term, this is something that should be addressed by 5265 + * BTF, and be fully contained within the verifier. 5266 + */ 5267 + info->reg_type = PTR_MAYBE_NULL | PTR_TO_BTF_ID | PTR_TRUSTED; 5268 + info->btf = btf_vmlinux; 5269 + info->btf_id = task_struct_type_id; 5270 + 5271 + return true; 5272 + } 5273 + 5274 + return false; 5275 + } 5276 + 5277 + static bool bpf_scx_is_valid_access(int off, int size, 5278 + enum bpf_access_type type, 5279 + const struct bpf_prog *prog, 5280 + struct bpf_insn_access_aux *info) 5281 + { 5282 + if (type != BPF_READ) 5283 + return false; 5284 + if (set_arg_maybe_null("dispatch", 1, off, size, type, prog, info) || 5285 + set_arg_maybe_null("yield", 1, off, size, type, prog, info)) 5286 + return true; 5287 + if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS) 5288 + return false; 5289 + if (off % size != 0) 5290 + return false; 5291 + 5292 + return btf_ctx_access(off, size, type, prog, info); 5293 + } 5294 + 5295 + static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log, 5296 + const struct bpf_reg_state *reg, int off, 5297 + int size) 5298 + { 5299 + const struct btf_type *t; 5300 + 5301 + t = btf_type_by_id(reg->btf, reg->btf_id); 5302 + if (t == task_struct_type) { 5303 + if (off >= offsetof(struct task_struct, scx.slice) && 5304 + off + size <= offsetofend(struct task_struct, scx.slice)) 5305 + return SCALAR_VALUE; 5306 + if (off >= offsetof(struct task_struct, scx.dsq_vtime) && 5307 + off + size <= offsetofend(struct task_struct, scx.dsq_vtime)) 5308 + return SCALAR_VALUE; 5309 + if (off >= offsetof(struct task_struct, scx.disallow) && 5310 + off + size <= offsetofend(struct task_struct, scx.disallow)) 5311 + return SCALAR_VALUE; 5312 + } 5313 + 5314 + return -EACCES; 5315 + } 5316 + 5317 + static const struct bpf_func_proto * 5318 + bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) 5319 + { 5320 + switch (func_id) { 5321 + case BPF_FUNC_task_storage_get: 5322 + return &bpf_task_storage_get_proto; 5323 + case BPF_FUNC_task_storage_delete: 5324 + return &bpf_task_storage_delete_proto; 5325 + default: 5326 + return bpf_base_func_proto(func_id, prog); 5327 + } 5328 + } 5329 + 5330 + static const struct bpf_verifier_ops bpf_scx_verifier_ops = { 5331 + .get_func_proto = bpf_scx_get_func_proto, 5332 + .is_valid_access = bpf_scx_is_valid_access, 5333 + .btf_struct_access = bpf_scx_btf_struct_access, 5334 + }; 5335 + 5336 + static int bpf_scx_init_member(const struct btf_type *t, 5337 + const struct btf_member *member, 5338 + void *kdata, const void *udata) 5339 + { 5340 + const struct sched_ext_ops *uops = udata; 5341 + struct sched_ext_ops *ops = kdata; 5342 + u32 moff = __btf_member_bit_offset(t, member) / 8; 5343 + int ret; 5344 + 5345 + switch (moff) { 5346 + case offsetof(struct sched_ext_ops, dispatch_max_batch): 5347 + if (*(u32 *)(udata + moff) > INT_MAX) 5348 + return -E2BIG; 5349 + ops->dispatch_max_batch = *(u32 *)(udata + moff); 5350 + return 1; 5351 + case offsetof(struct sched_ext_ops, flags): 5352 + if (*(u64 *)(udata + moff) & ~SCX_OPS_ALL_FLAGS) 5353 + return -EINVAL; 5354 + ops->flags = *(u64 *)(udata + moff); 5355 + return 1; 5356 + case offsetof(struct sched_ext_ops, name): 5357 + ret = bpf_obj_name_cpy(ops->name, uops->name, 5358 + sizeof(ops->name)); 5359 + if (ret < 0) 5360 + return ret; 5361 + if (ret == 0) 5362 + return -EINVAL; 5363 + return 1; 5364 + case offsetof(struct sched_ext_ops, timeout_ms): 5365 + if (msecs_to_jiffies(*(u32 *)(udata + moff)) > 5366 + SCX_WATCHDOG_MAX_TIMEOUT) 5367 + return -E2BIG; 5368 + ops->timeout_ms = *(u32 *)(udata + moff); 5369 + return 1; 5370 + case offsetof(struct sched_ext_ops, exit_dump_len): 5371 + ops->exit_dump_len = 5372 + *(u32 *)(udata + moff) ?: SCX_EXIT_DUMP_DFL_LEN; 5373 + return 1; 5374 + case offsetof(struct sched_ext_ops, hotplug_seq): 5375 + ops->hotplug_seq = *(u64 *)(udata + moff); 5376 + return 1; 5377 + } 5378 + 5379 + return 0; 5380 + } 5381 + 5382 + static int bpf_scx_check_member(const struct btf_type *t, 5383 + const struct btf_member *member, 5384 + const struct bpf_prog *prog) 5385 + { 5386 + u32 moff = __btf_member_bit_offset(t, member) / 8; 5387 + 5388 + switch (moff) { 5389 + case offsetof(struct sched_ext_ops, init_task): 5390 + #ifdef CONFIG_EXT_GROUP_SCHED 5391 + case offsetof(struct sched_ext_ops, cgroup_init): 5392 + case offsetof(struct sched_ext_ops, cgroup_exit): 5393 + case offsetof(struct sched_ext_ops, cgroup_prep_move): 5394 + #endif 5395 + case offsetof(struct sched_ext_ops, cpu_online): 5396 + case offsetof(struct sched_ext_ops, cpu_offline): 5397 + case offsetof(struct sched_ext_ops, init): 5398 + case offsetof(struct sched_ext_ops, exit): 5399 + break; 5400 + default: 5401 + if (prog->sleepable) 5402 + return -EINVAL; 5403 + } 5404 + 5405 + return 0; 5406 + } 5407 + 5408 + static int bpf_scx_reg(void *kdata, struct bpf_link *link) 5409 + { 5410 + return scx_ops_enable(kdata, link); 5411 + } 5412 + 5413 + static void bpf_scx_unreg(void *kdata, struct bpf_link *link) 5414 + { 5415 + scx_ops_disable(SCX_EXIT_UNREG); 5416 + kthread_flush_work(&scx_ops_disable_work); 5417 + } 5418 + 5419 + static int bpf_scx_init(struct btf *btf) 5420 + { 5421 + s32 type_id; 5422 + 5423 + type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT); 5424 + if (type_id < 0) 5425 + return -EINVAL; 5426 + task_struct_type = btf_type_by_id(btf, type_id); 5427 + task_struct_type_id = type_id; 5428 + 5429 + return 0; 5430 + } 5431 + 5432 + static int bpf_scx_update(void *kdata, void *old_kdata, struct bpf_link *link) 5433 + { 5434 + /* 5435 + * sched_ext does not support updating the actively-loaded BPF 5436 + * scheduler, as registering a BPF scheduler can always fail if the 5437 + * scheduler returns an error code for e.g. ops.init(), ops.init_task(), 5438 + * etc. Similarly, we can always race with unregistration happening 5439 + * elsewhere, such as with sysrq. 5440 + */ 5441 + return -EOPNOTSUPP; 5442 + } 5443 + 5444 + static int bpf_scx_validate(void *kdata) 5445 + { 5446 + return 0; 5447 + } 5448 + 5449 + static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64 wake_flags) { return -EINVAL; } 5450 + static void enqueue_stub(struct task_struct *p, u64 enq_flags) {} 5451 + static void dequeue_stub(struct task_struct *p, u64 enq_flags) {} 5452 + static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {} 5453 + static void tick_stub(struct task_struct *p) {} 5454 + static void runnable_stub(struct task_struct *p, u64 enq_flags) {} 5455 + static void running_stub(struct task_struct *p) {} 5456 + static void stopping_stub(struct task_struct *p, bool runnable) {} 5457 + static void quiescent_stub(struct task_struct *p, u64 deq_flags) {} 5458 + static bool yield_stub(struct task_struct *from, struct task_struct *to) { return false; } 5459 + static bool core_sched_before_stub(struct task_struct *a, struct task_struct *b) { return false; } 5460 + static void set_weight_stub(struct task_struct *p, u32 weight) {} 5461 + static void set_cpumask_stub(struct task_struct *p, const struct cpumask *mask) {} 5462 + static void update_idle_stub(s32 cpu, bool idle) {} 5463 + static void cpu_acquire_stub(s32 cpu, struct scx_cpu_acquire_args *args) {} 5464 + static void cpu_release_stub(s32 cpu, struct scx_cpu_release_args *args) {} 5465 + static s32 init_task_stub(struct task_struct *p, struct scx_init_task_args *args) { return -EINVAL; } 5466 + static void exit_task_stub(struct task_struct *p, struct scx_exit_task_args *args) {} 5467 + static void enable_stub(struct task_struct *p) {} 5468 + static void disable_stub(struct task_struct *p) {} 5469 + #ifdef CONFIG_EXT_GROUP_SCHED 5470 + static s32 cgroup_init_stub(struct cgroup *cgrp, struct scx_cgroup_init_args *args) { return -EINVAL; } 5471 + static void cgroup_exit_stub(struct cgroup *cgrp) {} 5472 + static s32 cgroup_prep_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) { return -EINVAL; } 5473 + static void cgroup_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {} 5474 + static void cgroup_cancel_move_stub(struct task_struct *p, struct cgroup *from, struct cgroup *to) {} 5475 + static void cgroup_set_weight_stub(struct cgroup *cgrp, u32 weight) {} 5476 + #endif 5477 + static void cpu_online_stub(s32 cpu) {} 5478 + static void cpu_offline_stub(s32 cpu) {} 5479 + static s32 init_stub(void) { return -EINVAL; } 5480 + static void exit_stub(struct scx_exit_info *info) {} 5481 + static void dump_stub(struct scx_dump_ctx *ctx) {} 5482 + static void dump_cpu_stub(struct scx_dump_ctx *ctx, s32 cpu, bool idle) {} 5483 + static void dump_task_stub(struct scx_dump_ctx *ctx, struct task_struct *p) {} 5484 + 5485 + static struct sched_ext_ops __bpf_ops_sched_ext_ops = { 5486 + .select_cpu = select_cpu_stub, 5487 + .enqueue = enqueue_stub, 5488 + .dequeue = dequeue_stub, 5489 + .dispatch = dispatch_stub, 5490 + .tick = tick_stub, 5491 + .runnable = runnable_stub, 5492 + .running = running_stub, 5493 + .stopping = stopping_stub, 5494 + .quiescent = quiescent_stub, 5495 + .yield = yield_stub, 5496 + .core_sched_before = core_sched_before_stub, 5497 + .set_weight = set_weight_stub, 5498 + .set_cpumask = set_cpumask_stub, 5499 + .update_idle = update_idle_stub, 5500 + .cpu_acquire = cpu_acquire_stub, 5501 + .cpu_release = cpu_release_stub, 5502 + .init_task = init_task_stub, 5503 + .exit_task = exit_task_stub, 5504 + .enable = enable_stub, 5505 + .disable = disable_stub, 5506 + #ifdef CONFIG_EXT_GROUP_SCHED 5507 + .cgroup_init = cgroup_init_stub, 5508 + .cgroup_exit = cgroup_exit_stub, 5509 + .cgroup_prep_move = cgroup_prep_move_stub, 5510 + .cgroup_move = cgroup_move_stub, 5511 + .cgroup_cancel_move = cgroup_cancel_move_stub, 5512 + .cgroup_set_weight = cgroup_set_weight_stub, 5513 + #endif 5514 + .cpu_online = cpu_online_stub, 5515 + .cpu_offline = cpu_offline_stub, 5516 + .init = init_stub, 5517 + .exit = exit_stub, 5518 + .dump = dump_stub, 5519 + .dump_cpu = dump_cpu_stub, 5520 + .dump_task = dump_task_stub, 5521 + }; 5522 + 5523 + static struct bpf_struct_ops bpf_sched_ext_ops = { 5524 + .verifier_ops = &bpf_scx_verifier_ops, 5525 + .reg = bpf_scx_reg, 5526 + .unreg = bpf_scx_unreg, 5527 + .check_member = bpf_scx_check_member, 5528 + .init_member = bpf_scx_init_member, 5529 + .init = bpf_scx_init, 5530 + .update = bpf_scx_update, 5531 + .validate = bpf_scx_validate, 5532 + .name = "sched_ext_ops", 5533 + .owner = THIS_MODULE, 5534 + .cfi_stubs = &__bpf_ops_sched_ext_ops 5535 + }; 5536 + 5537 + 5538 + /******************************************************************************** 5539 + * System integration and init. 5540 + */ 5541 + 5542 + static void sysrq_handle_sched_ext_reset(u8 key) 5543 + { 5544 + if (scx_ops_helper) 5545 + scx_ops_disable(SCX_EXIT_SYSRQ); 5546 + else 5547 + pr_info("sched_ext: BPF scheduler not yet used\n"); 5548 + } 5549 + 5550 + static const struct sysrq_key_op sysrq_sched_ext_reset_op = { 5551 + .handler = sysrq_handle_sched_ext_reset, 5552 + .help_msg = "reset-sched-ext(S)", 5553 + .action_msg = "Disable sched_ext and revert all tasks to CFS", 5554 + .enable_mask = SYSRQ_ENABLE_RTNICE, 5555 + }; 5556 + 5557 + static void sysrq_handle_sched_ext_dump(u8 key) 5558 + { 5559 + struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" }; 5560 + 5561 + if (scx_enabled()) 5562 + scx_dump_state(&ei, 0); 5563 + } 5564 + 5565 + static const struct sysrq_key_op sysrq_sched_ext_dump_op = { 5566 + .handler = sysrq_handle_sched_ext_dump, 5567 + .help_msg = "dump-sched-ext(D)", 5568 + .action_msg = "Trigger sched_ext debug dump", 5569 + .enable_mask = SYSRQ_ENABLE_RTNICE, 5570 + }; 5571 + 5572 + static bool can_skip_idle_kick(struct rq *rq) 5573 + { 5574 + lockdep_assert_rq_held(rq); 5575 + 5576 + /* 5577 + * We can skip idle kicking if @rq is going to go through at least one 5578 + * full SCX scheduling cycle before going idle. Just checking whether 5579 + * curr is not idle is insufficient because we could be racing 5580 + * balance_one() trying to pull the next task from a remote rq, which 5581 + * may fail, and @rq may become idle afterwards. 5582 + * 5583 + * The race window is small and we don't and can't guarantee that @rq is 5584 + * only kicked while idle anyway. Skip only when sure. 5585 + */ 5586 + return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE); 5587 + } 5588 + 5589 + static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs) 5590 + { 5591 + struct rq *rq = cpu_rq(cpu); 5592 + struct scx_rq *this_scx = &this_rq->scx; 5593 + bool should_wait = false; 5594 + unsigned long flags; 5595 + 5596 + raw_spin_rq_lock_irqsave(rq, flags); 5597 + 5598 + /* 5599 + * During CPU hotplug, a CPU may depend on kicking itself to make 5600 + * forward progress. Allow kicking self regardless of online state. 5601 + */ 5602 + if (cpu_online(cpu) || cpu == cpu_of(this_rq)) { 5603 + if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) { 5604 + if (rq->curr->sched_class == &ext_sched_class) 5605 + rq->curr->scx.slice = 0; 5606 + cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); 5607 + } 5608 + 5609 + if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) { 5610 + pseqs[cpu] = rq->scx.pnt_seq; 5611 + should_wait = true; 5612 + } 5613 + 5614 + resched_curr(rq); 5615 + } else { 5616 + cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); 5617 + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); 5618 + } 5619 + 5620 + raw_spin_rq_unlock_irqrestore(rq, flags); 5621 + 5622 + return should_wait; 5623 + } 5624 + 5625 + static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq) 5626 + { 5627 + struct rq *rq = cpu_rq(cpu); 5628 + unsigned long flags; 5629 + 5630 + raw_spin_rq_lock_irqsave(rq, flags); 5631 + 5632 + if (!can_skip_idle_kick(rq) && 5633 + (cpu_online(cpu) || cpu == cpu_of(this_rq))) 5634 + resched_curr(rq); 5635 + 5636 + raw_spin_rq_unlock_irqrestore(rq, flags); 5637 + } 5638 + 5639 + static void kick_cpus_irq_workfn(struct irq_work *irq_work) 5640 + { 5641 + struct rq *this_rq = this_rq(); 5642 + struct scx_rq *this_scx = &this_rq->scx; 5643 + unsigned long *pseqs = this_cpu_ptr(scx_kick_cpus_pnt_seqs); 5644 + bool should_wait = false; 5645 + s32 cpu; 5646 + 5647 + for_each_cpu(cpu, this_scx->cpus_to_kick) { 5648 + should_wait |= kick_one_cpu(cpu, this_rq, pseqs); 5649 + cpumask_clear_cpu(cpu, this_scx->cpus_to_kick); 5650 + cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); 5651 + } 5652 + 5653 + for_each_cpu(cpu, this_scx->cpus_to_kick_if_idle) { 5654 + kick_one_cpu_if_idle(cpu, this_rq); 5655 + cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); 5656 + } 5657 + 5658 + if (!should_wait) 5659 + return; 5660 + 5661 + for_each_cpu(cpu, this_scx->cpus_to_wait) { 5662 + unsigned long *wait_pnt_seq = &cpu_rq(cpu)->scx.pnt_seq; 5663 + 5664 + if (cpu != cpu_of(this_rq)) { 5665 + /* 5666 + * Pairs with smp_store_release() issued by this CPU in 5667 + * scx_next_task_picked() on the resched path. 5668 + * 5669 + * We busy-wait here to guarantee that no other task can 5670 + * be scheduled on our core before the target CPU has 5671 + * entered the resched path. 5672 + */ 5673 + while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) 5674 + cpu_relax(); 5675 + } 5676 + 5677 + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); 5678 + } 5679 + } 5680 + 5681 + /** 5682 + * print_scx_info - print out sched_ext scheduler state 5683 + * @log_lvl: the log level to use when printing 5684 + * @p: target task 5685 + * 5686 + * If a sched_ext scheduler is enabled, print the name and state of the 5687 + * scheduler. If @p is on sched_ext, print further information about the task. 5688 + * 5689 + * This function can be safely called on any task as long as the task_struct 5690 + * itself is accessible. While safe, this function isn't synchronized and may 5691 + * print out mixups or garbages of limited length. 5692 + */ 5693 + void print_scx_info(const char *log_lvl, struct task_struct *p) 5694 + { 5695 + enum scx_ops_enable_state state = scx_ops_enable_state(); 5696 + const char *all = READ_ONCE(scx_switching_all) ? "+all" : ""; 5697 + char runnable_at_buf[22] = "?"; 5698 + struct sched_class *class; 5699 + unsigned long runnable_at; 5700 + 5701 + if (state == SCX_OPS_DISABLED) 5702 + return; 5703 + 5704 + /* 5705 + * Carefully check if the task was running on sched_ext, and then 5706 + * carefully copy the time it's been runnable, and its state. 5707 + */ 5708 + if (copy_from_kernel_nofault(&class, &p->sched_class, sizeof(class)) || 5709 + class != &ext_sched_class) { 5710 + printk("%sSched_ext: %s (%s%s)", log_lvl, scx_ops.name, 5711 + scx_ops_enable_state_str[state], all); 5712 + return; 5713 + } 5714 + 5715 + if (!copy_from_kernel_nofault(&runnable_at, &p->scx.runnable_at, 5716 + sizeof(runnable_at))) 5717 + scnprintf(runnable_at_buf, sizeof(runnable_at_buf), "%+ldms", 5718 + jiffies_delta_msecs(runnable_at, jiffies)); 5719 + 5720 + /* print everything onto one line to conserve console space */ 5721 + printk("%sSched_ext: %s (%s%s), task: runnable_at=%s", 5722 + log_lvl, scx_ops.name, scx_ops_enable_state_str[state], all, 5723 + runnable_at_buf); 5724 + } 5725 + 5726 + static int scx_pm_handler(struct notifier_block *nb, unsigned long event, void *ptr) 5727 + { 5728 + /* 5729 + * SCX schedulers often have userspace components which are sometimes 5730 + * involved in critial scheduling paths. PM operations involve freezing 5731 + * userspace which can lead to scheduling misbehaviors including stalls. 5732 + * Let's bypass while PM operations are in progress. 5733 + */ 5734 + switch (event) { 5735 + case PM_HIBERNATION_PREPARE: 5736 + case PM_SUSPEND_PREPARE: 5737 + case PM_RESTORE_PREPARE: 5738 + scx_ops_bypass(true); 5739 + break; 5740 + case PM_POST_HIBERNATION: 5741 + case PM_POST_SUSPEND: 5742 + case PM_POST_RESTORE: 5743 + scx_ops_bypass(false); 5744 + break; 5745 + } 5746 + 5747 + return NOTIFY_OK; 5748 + } 5749 + 5750 + static struct notifier_block scx_pm_notifier = { 5751 + .notifier_call = scx_pm_handler, 5752 + }; 5753 + 5754 + void __init init_sched_ext_class(void) 5755 + { 5756 + s32 cpu, v; 5757 + 5758 + /* 5759 + * The following is to prevent the compiler from optimizing out the enum 5760 + * definitions so that BPF scheduler implementations can use them 5761 + * through the generated vmlinux.h. 5762 + */ 5763 + WRITE_ONCE(v, SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP | SCX_KICK_PREEMPT | 5764 + SCX_TG_ONLINE); 5765 + 5766 + BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); 5767 + init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL); 5768 + #ifdef CONFIG_SMP 5769 + BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); 5770 + BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); 5771 + #endif 5772 + scx_kick_cpus_pnt_seqs = 5773 + __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, 5774 + __alignof__(scx_kick_cpus_pnt_seqs[0])); 5775 + BUG_ON(!scx_kick_cpus_pnt_seqs); 5776 + 5777 + for_each_possible_cpu(cpu) { 5778 + struct rq *rq = cpu_rq(cpu); 5779 + 5780 + init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); 5781 + INIT_LIST_HEAD(&rq->scx.runnable_list); 5782 + INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals); 5783 + 5784 + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick, GFP_KERNEL)); 5785 + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL)); 5786 + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_preempt, GFP_KERNEL)); 5787 + BUG_ON(!zalloc_cpumask_var(&rq->scx.cpus_to_wait, GFP_KERNEL)); 5788 + init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn); 5789 + init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn); 5790 + 5791 + if (cpu_online(cpu)) 5792 + cpu_rq(cpu)->scx.flags |= SCX_RQ_ONLINE; 5793 + } 5794 + 5795 + register_sysrq_key('S', &sysrq_sched_ext_reset_op); 5796 + register_sysrq_key('D', &sysrq_sched_ext_dump_op); 5797 + INIT_DELAYED_WORK(&scx_watchdog_work, scx_watchdog_workfn); 5798 + } 5799 + 5800 + 5801 + /******************************************************************************** 5802 + * Helpers that can be called from the BPF scheduler. 5803 + */ 5804 + #include <linux/btf_ids.h> 5805 + 5806 + __bpf_kfunc_start_defs(); 5807 + 5808 + /** 5809 + * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu() 5810 + * @p: task_struct to select a CPU for 5811 + * @prev_cpu: CPU @p was on previously 5812 + * @wake_flags: %SCX_WAKE_* flags 5813 + * @is_idle: out parameter indicating whether the returned CPU is idle 5814 + * 5815 + * Can only be called from ops.select_cpu() if the built-in CPU selection is 5816 + * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set. 5817 + * @p, @prev_cpu and @wake_flags match ops.select_cpu(). 5818 + * 5819 + * Returns the picked CPU with *@is_idle indicating whether the picked CPU is 5820 + * currently idle and thus a good candidate for direct dispatching. 5821 + */ 5822 + __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, 5823 + u64 wake_flags, bool *is_idle) 5824 + { 5825 + if (!scx_kf_allowed(SCX_KF_SELECT_CPU)) { 5826 + *is_idle = false; 5827 + return prev_cpu; 5828 + } 5829 + #ifdef CONFIG_SMP 5830 + return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle); 5831 + #else 5832 + *is_idle = false; 5833 + return prev_cpu; 5834 + #endif 5835 + } 5836 + 5837 + __bpf_kfunc_end_defs(); 5838 + 5839 + BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) 5840 + BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) 5841 + BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) 5842 + 5843 + static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { 5844 + .owner = THIS_MODULE, 5845 + .set = &scx_kfunc_ids_select_cpu, 5846 + }; 5847 + 5848 + static bool scx_dispatch_preamble(struct task_struct *p, u64 enq_flags) 5849 + { 5850 + if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) 5851 + return false; 5852 + 5853 + lockdep_assert_irqs_disabled(); 5854 + 5855 + if (unlikely(!p)) { 5856 + scx_ops_error("called with NULL task"); 5857 + return false; 5858 + } 5859 + 5860 + if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) { 5861 + scx_ops_error("invalid enq_flags 0x%llx", enq_flags); 5862 + return false; 5863 + } 5864 + 5865 + return true; 5866 + } 5867 + 5868 + static void scx_dispatch_commit(struct task_struct *p, u64 dsq_id, u64 enq_flags) 5869 + { 5870 + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 5871 + struct task_struct *ddsp_task; 5872 + 5873 + ddsp_task = __this_cpu_read(direct_dispatch_task); 5874 + if (ddsp_task) { 5875 + mark_direct_dispatch(ddsp_task, p, dsq_id, enq_flags); 5876 + return; 5877 + } 5878 + 5879 + if (unlikely(dspc->cursor >= scx_dsp_max_batch)) { 5880 + scx_ops_error("dispatch buffer overflow"); 5881 + return; 5882 + } 5883 + 5884 + dspc->buf[dspc->cursor++] = (struct scx_dsp_buf_ent){ 5885 + .task = p, 5886 + .qseq = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK, 5887 + .dsq_id = dsq_id, 5888 + .enq_flags = enq_flags, 5889 + }; 5890 + } 5891 + 5892 + __bpf_kfunc_start_defs(); 5893 + 5894 + /** 5895 + * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ 5896 + * @p: task_struct to dispatch 5897 + * @dsq_id: DSQ to dispatch to 5898 + * @slice: duration @p can run for in nsecs, 0 to keep the current value 5899 + * @enq_flags: SCX_ENQ_* 5900 + * 5901 + * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe 5902 + * to call this function spuriously. Can be called from ops.enqueue(), 5903 + * ops.select_cpu(), and ops.dispatch(). 5904 + * 5905 + * When called from ops.select_cpu() or ops.enqueue(), it's for direct dispatch 5906 + * and @p must match the task being enqueued. Also, %SCX_DSQ_LOCAL_ON can't be 5907 + * used to target the local DSQ of a CPU other than the enqueueing one. Use 5908 + * ops.select_cpu() to be on the target CPU in the first place. 5909 + * 5910 + * When called from ops.select_cpu(), @enq_flags and @dsp_id are stored, and @p 5911 + * will be directly dispatched to the corresponding dispatch queue after 5912 + * ops.select_cpu() returns. If @p is dispatched to SCX_DSQ_LOCAL, it will be 5913 + * dispatched to the local DSQ of the CPU returned by ops.select_cpu(). 5914 + * @enq_flags are OR'd with the enqueue flags on the enqueue path before the 5915 + * task is dispatched. 5916 + * 5917 + * When called from ops.dispatch(), there are no restrictions on @p or @dsq_id 5918 + * and this function can be called upto ops.dispatch_max_batch times to dispatch 5919 + * multiple tasks. scx_bpf_dispatch_nr_slots() returns the number of the 5920 + * remaining slots. scx_bpf_consume() flushes the batch and resets the counter. 5921 + * 5922 + * This function doesn't have any locking restrictions and may be called under 5923 + * BPF locks (in the future when BPF introduces more flexible locking). 5924 + * 5925 + * @p is allowed to run for @slice. The scheduling path is triggered on slice 5926 + * exhaustion. If zero, the current residual slice is maintained. If 5927 + * %SCX_SLICE_INF, @p never expires and the BPF scheduler must kick the CPU with 5928 + * scx_bpf_kick_cpu() to trigger scheduling. 5929 + */ 5930 + __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, 5931 + u64 enq_flags) 5932 + { 5933 + if (!scx_dispatch_preamble(p, enq_flags)) 5934 + return; 5935 + 5936 + if (slice) 5937 + p->scx.slice = slice; 5938 + else 5939 + p->scx.slice = p->scx.slice ?: 1; 5940 + 5941 + scx_dispatch_commit(p, dsq_id, enq_flags); 5942 + } 5943 + 5944 + /** 5945 + * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ 5946 + * @p: task_struct to dispatch 5947 + * @dsq_id: DSQ to dispatch to 5948 + * @slice: duration @p can run for in nsecs, 0 to keep the current value 5949 + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ 5950 + * @enq_flags: SCX_ENQ_* 5951 + * 5952 + * Dispatch @p into the vtime priority queue of the DSQ identified by @dsq_id. 5953 + * Tasks queued into the priority queue are ordered by @vtime and always 5954 + * consumed after the tasks in the FIFO queue. All other aspects are identical 5955 + * to scx_bpf_dispatch(). 5956 + * 5957 + * @vtime ordering is according to time_before64() which considers wrapping. A 5958 + * numerically larger vtime may indicate an earlier position in the ordering and 5959 + * vice-versa. 5960 + */ 5961 + __bpf_kfunc void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, 5962 + u64 slice, u64 vtime, u64 enq_flags) 5963 + { 5964 + if (!scx_dispatch_preamble(p, enq_flags)) 5965 + return; 5966 + 5967 + if (slice) 5968 + p->scx.slice = slice; 5969 + else 5970 + p->scx.slice = p->scx.slice ?: 1; 5971 + 5972 + p->scx.dsq_vtime = vtime; 5973 + 5974 + scx_dispatch_commit(p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); 5975 + } 5976 + 5977 + __bpf_kfunc_end_defs(); 5978 + 5979 + BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch) 5980 + BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU) 5981 + BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU) 5982 + BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) 5983 + 5984 + static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { 5985 + .owner = THIS_MODULE, 5986 + .set = &scx_kfunc_ids_enqueue_dispatch, 5987 + }; 5988 + 5989 + static bool scx_dispatch_from_dsq(struct bpf_iter_scx_dsq_kern *kit, 5990 + struct task_struct *p, u64 dsq_id, 5991 + u64 enq_flags) 5992 + { 5993 + struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq; 5994 + struct rq *this_rq, *src_rq, *dst_rq, *locked_rq; 5995 + bool dispatched = false; 5996 + bool in_balance; 5997 + unsigned long flags; 5998 + 5999 + if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH)) 6000 + return false; 6001 + 6002 + /* 6003 + * Can be called from either ops.dispatch() locking this_rq() or any 6004 + * context where no rq lock is held. If latter, lock @p's task_rq which 6005 + * we'll likely need anyway. 6006 + */ 6007 + src_rq = task_rq(p); 6008 + 6009 + local_irq_save(flags); 6010 + this_rq = this_rq(); 6011 + in_balance = this_rq->scx.flags & SCX_RQ_IN_BALANCE; 6012 + 6013 + if (in_balance) { 6014 + if (this_rq != src_rq) { 6015 + raw_spin_rq_unlock(this_rq); 6016 + raw_spin_rq_lock(src_rq); 6017 + } 6018 + } else { 6019 + raw_spin_rq_lock(src_rq); 6020 + } 6021 + 6022 + locked_rq = src_rq; 6023 + raw_spin_lock(&src_dsq->lock); 6024 + 6025 + /* 6026 + * Did someone else get to it? @p could have already left $src_dsq, got 6027 + * re-enqueud, or be in the process of being consumed by someone else. 6028 + */ 6029 + if (unlikely(p->scx.dsq != src_dsq || 6030 + u32_before(kit->cursor.priv, p->scx.dsq_seq) || 6031 + p->scx.holding_cpu >= 0) || 6032 + WARN_ON_ONCE(src_rq != task_rq(p))) { 6033 + raw_spin_unlock(&src_dsq->lock); 6034 + goto out; 6035 + } 6036 + 6037 + /* @p is still on $src_dsq and stable, determine the destination */ 6038 + dst_dsq = find_dsq_for_dispatch(this_rq, dsq_id, p); 6039 + 6040 + if (dst_dsq->id == SCX_DSQ_LOCAL) { 6041 + dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); 6042 + if (!task_can_run_on_remote_rq(p, dst_rq, true)) { 6043 + dst_dsq = &scx_dsq_global; 6044 + dst_rq = src_rq; 6045 + } 6046 + } else { 6047 + /* no need to migrate if destination is a non-local DSQ */ 6048 + dst_rq = src_rq; 6049 + } 6050 + 6051 + /* 6052 + * Move @p into $dst_dsq. If $dst_dsq is the local DSQ of a different 6053 + * CPU, @p will be migrated. 6054 + */ 6055 + if (dst_dsq->id == SCX_DSQ_LOCAL) { 6056 + /* @p is going from a non-local DSQ to a local DSQ */ 6057 + if (src_rq == dst_rq) { 6058 + task_unlink_from_dsq(p, src_dsq); 6059 + move_local_task_to_local_dsq(p, enq_flags, 6060 + src_dsq, dst_rq); 6061 + raw_spin_unlock(&src_dsq->lock); 6062 + } else { 6063 + raw_spin_unlock(&src_dsq->lock); 6064 + move_remote_task_to_local_dsq(p, enq_flags, 6065 + src_rq, dst_rq); 6066 + locked_rq = dst_rq; 6067 + } 6068 + } else { 6069 + /* 6070 + * @p is going from a non-local DSQ to a non-local DSQ. As 6071 + * $src_dsq is already locked, do an abbreviated dequeue. 6072 + */ 6073 + task_unlink_from_dsq(p, src_dsq); 6074 + p->scx.dsq = NULL; 6075 + raw_spin_unlock(&src_dsq->lock); 6076 + 6077 + if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_VTIME) 6078 + p->scx.dsq_vtime = kit->vtime; 6079 + dispatch_enqueue(dst_dsq, p, enq_flags); 6080 + } 6081 + 6082 + if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_SLICE) 6083 + p->scx.slice = kit->slice; 6084 + 6085 + dispatched = true; 6086 + out: 6087 + if (in_balance) { 6088 + if (this_rq != locked_rq) { 6089 + raw_spin_rq_unlock(locked_rq); 6090 + raw_spin_rq_lock(this_rq); 6091 + } 6092 + } else { 6093 + raw_spin_rq_unlock_irqrestore(locked_rq, flags); 6094 + } 6095 + 6096 + kit->cursor.flags &= ~(__SCX_DSQ_ITER_HAS_SLICE | 6097 + __SCX_DSQ_ITER_HAS_VTIME); 6098 + return dispatched; 6099 + } 6100 + 6101 + __bpf_kfunc_start_defs(); 6102 + 6103 + /** 6104 + * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots 6105 + * 6106 + * Can only be called from ops.dispatch(). 6107 + */ 6108 + __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void) 6109 + { 6110 + if (!scx_kf_allowed(SCX_KF_DISPATCH)) 6111 + return 0; 6112 + 6113 + return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor); 6114 + } 6115 + 6116 + /** 6117 + * scx_bpf_dispatch_cancel - Cancel the latest dispatch 6118 + * 6119 + * Cancel the latest dispatch. Can be called multiple times to cancel further 6120 + * dispatches. Can only be called from ops.dispatch(). 6121 + */ 6122 + __bpf_kfunc void scx_bpf_dispatch_cancel(void) 6123 + { 6124 + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 6125 + 6126 + if (!scx_kf_allowed(SCX_KF_DISPATCH)) 6127 + return; 6128 + 6129 + if (dspc->cursor > 0) 6130 + dspc->cursor--; 6131 + else 6132 + scx_ops_error("dispatch buffer underflow"); 6133 + } 6134 + 6135 + /** 6136 + * scx_bpf_consume - Transfer a task from a DSQ to the current CPU's local DSQ 6137 + * @dsq_id: DSQ to consume 6138 + * 6139 + * Consume a task from the non-local DSQ identified by @dsq_id and transfer it 6140 + * to the current CPU's local DSQ for execution. Can only be called from 6141 + * ops.dispatch(). 6142 + * 6143 + * This function flushes the in-flight dispatches from scx_bpf_dispatch() before 6144 + * trying to consume the specified DSQ. It may also grab rq locks and thus can't 6145 + * be called under any BPF locks. 6146 + * 6147 + * Returns %true if a task has been consumed, %false if there isn't any task to 6148 + * consume. 6149 + */ 6150 + __bpf_kfunc bool scx_bpf_consume(u64 dsq_id) 6151 + { 6152 + struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 6153 + struct scx_dispatch_q *dsq; 6154 + 6155 + if (!scx_kf_allowed(SCX_KF_DISPATCH)) 6156 + return false; 6157 + 6158 + flush_dispatch_buf(dspc->rq); 6159 + 6160 + dsq = find_non_local_dsq(dsq_id); 6161 + if (unlikely(!dsq)) { 6162 + scx_ops_error("invalid DSQ ID 0x%016llx", dsq_id); 6163 + return false; 6164 + } 6165 + 6166 + if (consume_dispatch_q(dspc->rq, dsq)) { 6167 + /* 6168 + * A successfully consumed task can be dequeued before it starts 6169 + * running while the CPU is trying to migrate other dispatched 6170 + * tasks. Bump nr_tasks to tell balance_scx() to retry on empty 6171 + * local DSQ. 6172 + */ 6173 + dspc->nr_tasks++; 6174 + return true; 6175 + } else { 6176 + return false; 6177 + } 6178 + } 6179 + 6180 + /** 6181 + * scx_bpf_dispatch_from_dsq_set_slice - Override slice when dispatching from DSQ 6182 + * @it__iter: DSQ iterator in progress 6183 + * @slice: duration the dispatched task can run for in nsecs 6184 + * 6185 + * Override the slice of the next task that will be dispatched from @it__iter 6186 + * using scx_bpf_dispatch_from_dsq[_vtime](). If this function is not called, 6187 + * the previous slice duration is kept. 6188 + */ 6189 + __bpf_kfunc void scx_bpf_dispatch_from_dsq_set_slice( 6190 + struct bpf_iter_scx_dsq *it__iter, u64 slice) 6191 + { 6192 + struct bpf_iter_scx_dsq_kern *kit = (void *)it__iter; 6193 + 6194 + kit->slice = slice; 6195 + kit->cursor.flags |= __SCX_DSQ_ITER_HAS_SLICE; 6196 + } 6197 + 6198 + /** 6199 + * scx_bpf_dispatch_from_dsq_set_vtime - Override vtime when dispatching from DSQ 6200 + * @it__iter: DSQ iterator in progress 6201 + * @vtime: task's ordering inside the vtime-sorted queue of the target DSQ 6202 + * 6203 + * Override the vtime of the next task that will be dispatched from @it__iter 6204 + * using scx_bpf_dispatch_from_dsq_vtime(). If this function is not called, the 6205 + * previous slice vtime is kept. If scx_bpf_dispatch_from_dsq() is used to 6206 + * dispatch the next task, the override is ignored and cleared. 6207 + */ 6208 + __bpf_kfunc void scx_bpf_dispatch_from_dsq_set_vtime( 6209 + struct bpf_iter_scx_dsq *it__iter, u64 vtime) 6210 + { 6211 + struct bpf_iter_scx_dsq_kern *kit = (void *)it__iter; 6212 + 6213 + kit->vtime = vtime; 6214 + kit->cursor.flags |= __SCX_DSQ_ITER_HAS_VTIME; 6215 + } 6216 + 6217 + /** 6218 + * scx_bpf_dispatch_from_dsq - Move a task from DSQ iteration to a DSQ 6219 + * @it__iter: DSQ iterator in progress 6220 + * @p: task to transfer 6221 + * @dsq_id: DSQ to move @p to 6222 + * @enq_flags: SCX_ENQ_* 6223 + * 6224 + * Transfer @p which is on the DSQ currently iterated by @it__iter to the DSQ 6225 + * specified by @dsq_id. All DSQs - local DSQs, global DSQ and user DSQs - can 6226 + * be the destination. 6227 + * 6228 + * For the transfer to be successful, @p must still be on the DSQ and have been 6229 + * queued before the DSQ iteration started. This function doesn't care whether 6230 + * @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have 6231 + * been queued before the iteration started. 6232 + * 6233 + * @p's slice is kept by default. Use scx_bpf_dispatch_from_dsq_set_slice() to 6234 + * update. 6235 + * 6236 + * Can be called from ops.dispatch() or any BPF context which doesn't hold a rq 6237 + * lock (e.g. BPF timers or SYSCALL programs). 6238 + * 6239 + * Returns %true if @p has been consumed, %false if @p had already been consumed 6240 + * or dequeued. 6241 + */ 6242 + __bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, 6243 + struct task_struct *p, u64 dsq_id, 6244 + u64 enq_flags) 6245 + { 6246 + return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter, 6247 + p, dsq_id, enq_flags); 6248 + } 6249 + 6250 + /** 6251 + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ 6252 + * @it__iter: DSQ iterator in progress 6253 + * @p: task to transfer 6254 + * @dsq_id: DSQ to move @p to 6255 + * @enq_flags: SCX_ENQ_* 6256 + * 6257 + * Transfer @p which is on the DSQ currently iterated by @it__iter to the 6258 + * priority queue of the DSQ specified by @dsq_id. The destination must be a 6259 + * user DSQ as only user DSQs support priority queue. 6260 + * 6261 + * @p's slice and vtime are kept by default. Use 6262 + * scx_bpf_dispatch_from_dsq_set_slice() and 6263 + * scx_bpf_dispatch_from_dsq_set_vtime() to update. 6264 + * 6265 + * All other aspects are identical to scx_bpf_dispatch_from_dsq(). See 6266 + * scx_bpf_dispatch_vtime() for more information on @vtime. 6267 + */ 6268 + __bpf_kfunc bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, 6269 + struct task_struct *p, u64 dsq_id, 6270 + u64 enq_flags) 6271 + { 6272 + return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter, 6273 + p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); 6274 + } 6275 + 6276 + __bpf_kfunc_end_defs(); 6277 + 6278 + BTF_KFUNCS_START(scx_kfunc_ids_dispatch) 6279 + BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots) 6280 + BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel) 6281 + BTF_ID_FLAGS(func, scx_bpf_consume) 6282 + BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_slice) 6283 + BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_vtime) 6284 + BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU) 6285 + BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU) 6286 + BTF_KFUNCS_END(scx_kfunc_ids_dispatch) 6287 + 6288 + static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { 6289 + .owner = THIS_MODULE, 6290 + .set = &scx_kfunc_ids_dispatch, 6291 + }; 6292 + 6293 + __bpf_kfunc_start_defs(); 6294 + 6295 + /** 6296 + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ 6297 + * 6298 + * Iterate over all of the tasks currently enqueued on the local DSQ of the 6299 + * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of 6300 + * processed tasks. Can only be called from ops.cpu_release(). 6301 + */ 6302 + __bpf_kfunc u32 scx_bpf_reenqueue_local(void) 6303 + { 6304 + LIST_HEAD(tasks); 6305 + u32 nr_enqueued = 0; 6306 + struct rq *rq; 6307 + struct task_struct *p, *n; 6308 + 6309 + if (!scx_kf_allowed(SCX_KF_CPU_RELEASE)) 6310 + return 0; 6311 + 6312 + rq = cpu_rq(smp_processor_id()); 6313 + lockdep_assert_rq_held(rq); 6314 + 6315 + /* 6316 + * The BPF scheduler may choose to dispatch tasks back to 6317 + * @rq->scx.local_dsq. Move all candidate tasks off to a private list 6318 + * first to avoid processing the same tasks repeatedly. 6319 + */ 6320 + list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list, 6321 + scx.dsq_list.node) { 6322 + /* 6323 + * If @p is being migrated, @p's current CPU may not agree with 6324 + * its allowed CPUs and the migration_cpu_stop is about to 6325 + * deactivate and re-activate @p anyway. Skip re-enqueueing. 6326 + * 6327 + * While racing sched property changes may also dequeue and 6328 + * re-enqueue a migrating task while its current CPU and allowed 6329 + * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to 6330 + * the current local DSQ for running tasks and thus are not 6331 + * visible to the BPF scheduler. 6332 + */ 6333 + if (p->migration_pending) 6334 + continue; 6335 + 6336 + dispatch_dequeue(rq, p); 6337 + list_add_tail(&p->scx.dsq_list.node, &tasks); 6338 + } 6339 + 6340 + list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) { 6341 + list_del_init(&p->scx.dsq_list.node); 6342 + do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1); 6343 + nr_enqueued++; 6344 + } 6345 + 6346 + return nr_enqueued; 6347 + } 6348 + 6349 + __bpf_kfunc_end_defs(); 6350 + 6351 + BTF_KFUNCS_START(scx_kfunc_ids_cpu_release) 6352 + BTF_ID_FLAGS(func, scx_bpf_reenqueue_local) 6353 + BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) 6354 + 6355 + static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { 6356 + .owner = THIS_MODULE, 6357 + .set = &scx_kfunc_ids_cpu_release, 6358 + }; 6359 + 6360 + __bpf_kfunc_start_defs(); 6361 + 6362 + /** 6363 + * scx_bpf_create_dsq - Create a custom DSQ 6364 + * @dsq_id: DSQ to create 6365 + * @node: NUMA node to allocate from 6366 + * 6367 + * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable 6368 + * scx callback, and any BPF_PROG_TYPE_SYSCALL prog. 6369 + */ 6370 + __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) 6371 + { 6372 + if (unlikely(node >= (int)nr_node_ids || 6373 + (node < 0 && node != NUMA_NO_NODE))) 6374 + return -EINVAL; 6375 + return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node)); 6376 + } 6377 + 6378 + __bpf_kfunc_end_defs(); 6379 + 6380 + BTF_KFUNCS_START(scx_kfunc_ids_unlocked) 6381 + BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE) 6382 + BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU) 6383 + BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU) 6384 + BTF_KFUNCS_END(scx_kfunc_ids_unlocked) 6385 + 6386 + static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { 6387 + .owner = THIS_MODULE, 6388 + .set = &scx_kfunc_ids_unlocked, 6389 + }; 6390 + 6391 + __bpf_kfunc_start_defs(); 6392 + 6393 + /** 6394 + * scx_bpf_kick_cpu - Trigger reschedule on a CPU 6395 + * @cpu: cpu to kick 6396 + * @flags: %SCX_KICK_* flags 6397 + * 6398 + * Kick @cpu into rescheduling. This can be used to wake up an idle CPU or 6399 + * trigger rescheduling on a busy CPU. This can be called from any online 6400 + * scx_ops operation and the actual kicking is performed asynchronously through 6401 + * an irq work. 6402 + */ 6403 + __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags) 6404 + { 6405 + struct rq *this_rq; 6406 + unsigned long irq_flags; 6407 + 6408 + if (!ops_cpu_valid(cpu, NULL)) 6409 + return; 6410 + 6411 + local_irq_save(irq_flags); 6412 + 6413 + this_rq = this_rq(); 6414 + 6415 + /* 6416 + * While bypassing for PM ops, IRQ handling may not be online which can 6417 + * lead to irq_work_queue() malfunction such as infinite busy wait for 6418 + * IRQ status update. Suppress kicking. 6419 + */ 6420 + if (scx_rq_bypassing(this_rq)) 6421 + goto out; 6422 + 6423 + /* 6424 + * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting 6425 + * rq locks. We can probably be smarter and avoid bouncing if called 6426 + * from ops which don't hold a rq lock. 6427 + */ 6428 + if (flags & SCX_KICK_IDLE) { 6429 + struct rq *target_rq = cpu_rq(cpu); 6430 + 6431 + if (unlikely(flags & (SCX_KICK_PREEMPT | SCX_KICK_WAIT))) 6432 + scx_ops_error("PREEMPT/WAIT cannot be used with SCX_KICK_IDLE"); 6433 + 6434 + if (raw_spin_rq_trylock(target_rq)) { 6435 + if (can_skip_idle_kick(target_rq)) { 6436 + raw_spin_rq_unlock(target_rq); 6437 + goto out; 6438 + } 6439 + raw_spin_rq_unlock(target_rq); 6440 + } 6441 + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle); 6442 + } else { 6443 + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick); 6444 + 6445 + if (flags & SCX_KICK_PREEMPT) 6446 + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt); 6447 + if (flags & SCX_KICK_WAIT) 6448 + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait); 6449 + } 6450 + 6451 + irq_work_queue(&this_rq->scx.kick_cpus_irq_work); 6452 + out: 6453 + local_irq_restore(irq_flags); 6454 + } 6455 + 6456 + /** 6457 + * scx_bpf_dsq_nr_queued - Return the number of queued tasks 6458 + * @dsq_id: id of the DSQ 6459 + * 6460 + * Return the number of tasks in the DSQ matching @dsq_id. If not found, 6461 + * -%ENOENT is returned. 6462 + */ 6463 + __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id) 6464 + { 6465 + struct scx_dispatch_q *dsq; 6466 + s32 ret; 6467 + 6468 + preempt_disable(); 6469 + 6470 + if (dsq_id == SCX_DSQ_LOCAL) { 6471 + ret = READ_ONCE(this_rq()->scx.local_dsq.nr); 6472 + goto out; 6473 + } else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) { 6474 + s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK; 6475 + 6476 + if (ops_cpu_valid(cpu, NULL)) { 6477 + ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr); 6478 + goto out; 6479 + } 6480 + } else { 6481 + dsq = find_non_local_dsq(dsq_id); 6482 + if (dsq) { 6483 + ret = READ_ONCE(dsq->nr); 6484 + goto out; 6485 + } 6486 + } 6487 + ret = -ENOENT; 6488 + out: 6489 + preempt_enable(); 6490 + return ret; 6491 + } 6492 + 6493 + /** 6494 + * scx_bpf_destroy_dsq - Destroy a custom DSQ 6495 + * @dsq_id: DSQ to destroy 6496 + * 6497 + * Destroy the custom DSQ identified by @dsq_id. Only DSQs created with 6498 + * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is 6499 + * empty and no further tasks are dispatched to it. Ignored if called on a DSQ 6500 + * which doesn't exist. Can be called from any online scx_ops operations. 6501 + */ 6502 + __bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id) 6503 + { 6504 + destroy_dsq(dsq_id); 6505 + } 6506 + 6507 + /** 6508 + * bpf_iter_scx_dsq_new - Create a DSQ iterator 6509 + * @it: iterator to initialize 6510 + * @dsq_id: DSQ to iterate 6511 + * @flags: %SCX_DSQ_ITER_* 6512 + * 6513 + * Initialize BPF iterator @it which can be used with bpf_for_each() to walk 6514 + * tasks in the DSQ specified by @dsq_id. Iteration using @it only includes 6515 + * tasks which are already queued when this function is invoked. 6516 + */ 6517 + __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, 6518 + u64 flags) 6519 + { 6520 + struct bpf_iter_scx_dsq_kern *kit = (void *)it; 6521 + 6522 + BUILD_BUG_ON(sizeof(struct bpf_iter_scx_dsq_kern) > 6523 + sizeof(struct bpf_iter_scx_dsq)); 6524 + BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) != 6525 + __alignof__(struct bpf_iter_scx_dsq)); 6526 + 6527 + if (flags & ~__SCX_DSQ_ITER_USER_FLAGS) 6528 + return -EINVAL; 6529 + 6530 + kit->dsq = find_non_local_dsq(dsq_id); 6531 + if (!kit->dsq) 6532 + return -ENOENT; 6533 + 6534 + INIT_LIST_HEAD(&kit->cursor.node); 6535 + kit->cursor.flags |= SCX_DSQ_LNODE_ITER_CURSOR | flags; 6536 + kit->cursor.priv = READ_ONCE(kit->dsq->seq); 6537 + 6538 + return 0; 6539 + } 6540 + 6541 + /** 6542 + * bpf_iter_scx_dsq_next - Progress a DSQ iterator 6543 + * @it: iterator to progress 6544 + * 6545 + * Return the next task. See bpf_iter_scx_dsq_new(). 6546 + */ 6547 + __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) 6548 + { 6549 + struct bpf_iter_scx_dsq_kern *kit = (void *)it; 6550 + bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV; 6551 + struct task_struct *p; 6552 + unsigned long flags; 6553 + 6554 + if (!kit->dsq) 6555 + return NULL; 6556 + 6557 + raw_spin_lock_irqsave(&kit->dsq->lock, flags); 6558 + 6559 + if (list_empty(&kit->cursor.node)) 6560 + p = NULL; 6561 + else 6562 + p = container_of(&kit->cursor, struct task_struct, scx.dsq_list); 6563 + 6564 + /* 6565 + * Only tasks which were queued before the iteration started are 6566 + * visible. This bounds BPF iterations and guarantees that vtime never 6567 + * jumps in the other direction while iterating. 6568 + */ 6569 + do { 6570 + p = nldsq_next_task(kit->dsq, p, rev); 6571 + } while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq))); 6572 + 6573 + if (p) { 6574 + if (rev) 6575 + list_move_tail(&kit->cursor.node, &p->scx.dsq_list.node); 6576 + else 6577 + list_move(&kit->cursor.node, &p->scx.dsq_list.node); 6578 + } else { 6579 + list_del_init(&kit->cursor.node); 6580 + } 6581 + 6582 + raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); 6583 + 6584 + return p; 6585 + } 6586 + 6587 + /** 6588 + * bpf_iter_scx_dsq_destroy - Destroy a DSQ iterator 6589 + * @it: iterator to destroy 6590 + * 6591 + * Undo scx_iter_scx_dsq_new(). 6592 + */ 6593 + __bpf_kfunc void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) 6594 + { 6595 + struct bpf_iter_scx_dsq_kern *kit = (void *)it; 6596 + 6597 + if (!kit->dsq) 6598 + return; 6599 + 6600 + if (!list_empty(&kit->cursor.node)) { 6601 + unsigned long flags; 6602 + 6603 + raw_spin_lock_irqsave(&kit->dsq->lock, flags); 6604 + list_del_init(&kit->cursor.node); 6605 + raw_spin_unlock_irqrestore(&kit->dsq->lock, flags); 6606 + } 6607 + kit->dsq = NULL; 6608 + } 6609 + 6610 + __bpf_kfunc_end_defs(); 6611 + 6612 + static s32 __bstr_format(u64 *data_buf, char *line_buf, size_t line_size, 6613 + char *fmt, unsigned long long *data, u32 data__sz) 6614 + { 6615 + struct bpf_bprintf_data bprintf_data = { .get_bin_args = true }; 6616 + s32 ret; 6617 + 6618 + if (data__sz % 8 || data__sz > MAX_BPRINTF_VARARGS * 8 || 6619 + (data__sz && !data)) { 6620 + scx_ops_error("invalid data=%p and data__sz=%u", 6621 + (void *)data, data__sz); 6622 + return -EINVAL; 6623 + } 6624 + 6625 + ret = copy_from_kernel_nofault(data_buf, data, data__sz); 6626 + if (ret < 0) { 6627 + scx_ops_error("failed to read data fields (%d)", ret); 6628 + return ret; 6629 + } 6630 + 6631 + ret = bpf_bprintf_prepare(fmt, UINT_MAX, data_buf, data__sz / 8, 6632 + &bprintf_data); 6633 + if (ret < 0) { 6634 + scx_ops_error("format preparation failed (%d)", ret); 6635 + return ret; 6636 + } 6637 + 6638 + ret = bstr_printf(line_buf, line_size, fmt, 6639 + bprintf_data.bin_args); 6640 + bpf_bprintf_cleanup(&bprintf_data); 6641 + if (ret < 0) { 6642 + scx_ops_error("(\"%s\", %p, %u) failed to format", 6643 + fmt, data, data__sz); 6644 + return ret; 6645 + } 6646 + 6647 + return ret; 6648 + } 6649 + 6650 + static s32 bstr_format(struct scx_bstr_buf *buf, 6651 + char *fmt, unsigned long long *data, u32 data__sz) 6652 + { 6653 + return __bstr_format(buf->data, buf->line, sizeof(buf->line), 6654 + fmt, data, data__sz); 6655 + } 6656 + 6657 + __bpf_kfunc_start_defs(); 6658 + 6659 + /** 6660 + * scx_bpf_exit_bstr - Gracefully exit the BPF scheduler. 6661 + * @exit_code: Exit value to pass to user space via struct scx_exit_info. 6662 + * @fmt: error message format string 6663 + * @data: format string parameters packaged using ___bpf_fill() macro 6664 + * @data__sz: @data len, must end in '__sz' for the verifier 6665 + * 6666 + * Indicate that the BPF scheduler wants to exit gracefully, and initiate ops 6667 + * disabling. 6668 + */ 6669 + __bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt, 6670 + unsigned long long *data, u32 data__sz) 6671 + { 6672 + unsigned long flags; 6673 + 6674 + raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); 6675 + if (bstr_format(&scx_exit_bstr_buf, fmt, data, data__sz) >= 0) 6676 + scx_ops_exit_kind(SCX_EXIT_UNREG_BPF, exit_code, "%s", 6677 + scx_exit_bstr_buf.line); 6678 + raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags); 6679 + } 6680 + 6681 + /** 6682 + * scx_bpf_error_bstr - Indicate fatal error 6683 + * @fmt: error message format string 6684 + * @data: format string parameters packaged using ___bpf_fill() macro 6685 + * @data__sz: @data len, must end in '__sz' for the verifier 6686 + * 6687 + * Indicate that the BPF scheduler encountered a fatal error and initiate ops 6688 + * disabling. 6689 + */ 6690 + __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data, 6691 + u32 data__sz) 6692 + { 6693 + unsigned long flags; 6694 + 6695 + raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags); 6696 + if (bstr_format(&scx_exit_bstr_buf, fmt, data, data__sz) >= 0) 6697 + scx_ops_exit_kind(SCX_EXIT_ERROR_BPF, 0, "%s", 6698 + scx_exit_bstr_buf.line); 6699 + raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags); 6700 + } 6701 + 6702 + /** 6703 + * scx_bpf_dump - Generate extra debug dump specific to the BPF scheduler 6704 + * @fmt: format string 6705 + * @data: format string parameters packaged using ___bpf_fill() macro 6706 + * @data__sz: @data len, must end in '__sz' for the verifier 6707 + * 6708 + * To be called through scx_bpf_dump() helper from ops.dump(), dump_cpu() and 6709 + * dump_task() to generate extra debug dump specific to the BPF scheduler. 6710 + * 6711 + * The extra dump may be multiple lines. A single line may be split over 6712 + * multiple calls. The last line is automatically terminated. 6713 + */ 6714 + __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, 6715 + u32 data__sz) 6716 + { 6717 + struct scx_dump_data *dd = &scx_dump_data; 6718 + struct scx_bstr_buf *buf = &dd->buf; 6719 + s32 ret; 6720 + 6721 + if (raw_smp_processor_id() != dd->cpu) { 6722 + scx_ops_error("scx_bpf_dump() must only be called from ops.dump() and friends"); 6723 + return; 6724 + } 6725 + 6726 + /* append the formatted string to the line buf */ 6727 + ret = __bstr_format(buf->data, buf->line + dd->cursor, 6728 + sizeof(buf->line) - dd->cursor, fmt, data, data__sz); 6729 + if (ret < 0) { 6730 + dump_line(dd->s, "%s[!] (\"%s\", %p, %u) failed to format (%d)", 6731 + dd->prefix, fmt, data, data__sz, ret); 6732 + return; 6733 + } 6734 + 6735 + dd->cursor += ret; 6736 + dd->cursor = min_t(s32, dd->cursor, sizeof(buf->line)); 6737 + 6738 + if (!dd->cursor) 6739 + return; 6740 + 6741 + /* 6742 + * If the line buf overflowed or ends in a newline, flush it into the 6743 + * dump. This is to allow the caller to generate a single line over 6744 + * multiple calls. As ops_dump_flush() can also handle multiple lines in 6745 + * the line buf, the only case which can lead to an unexpected 6746 + * truncation is when the caller keeps generating newlines in the middle 6747 + * instead of the end consecutively. Don't do that. 6748 + */ 6749 + if (dd->cursor >= sizeof(buf->line) || buf->line[dd->cursor - 1] == '\n') 6750 + ops_dump_flush(); 6751 + } 6752 + 6753 + /** 6754 + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU 6755 + * @cpu: CPU of interest 6756 + * 6757 + * Return the maximum relative capacity of @cpu in relation to the most 6758 + * performant CPU in the system. The return value is in the range [1, 6759 + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). 6760 + */ 6761 + __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) 6762 + { 6763 + if (ops_cpu_valid(cpu, NULL)) 6764 + return arch_scale_cpu_capacity(cpu); 6765 + else 6766 + return SCX_CPUPERF_ONE; 6767 + } 6768 + 6769 + /** 6770 + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU 6771 + * @cpu: CPU of interest 6772 + * 6773 + * Return the current relative performance of @cpu in relation to its maximum. 6774 + * The return value is in the range [1, %SCX_CPUPERF_ONE]. 6775 + * 6776 + * The current performance level of a CPU in relation to the maximum performance 6777 + * available in the system can be calculated as follows: 6778 + * 6779 + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE 6780 + * 6781 + * The result is in the range [1, %SCX_CPUPERF_ONE]. 6782 + */ 6783 + __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) 6784 + { 6785 + if (ops_cpu_valid(cpu, NULL)) 6786 + return arch_scale_freq_capacity(cpu); 6787 + else 6788 + return SCX_CPUPERF_ONE; 6789 + } 6790 + 6791 + /** 6792 + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU 6793 + * @cpu: CPU of interest 6794 + * @perf: target performance level [0, %SCX_CPUPERF_ONE] 6795 + * @flags: %SCX_CPUPERF_* flags 6796 + * 6797 + * Set the target performance level of @cpu to @perf. @perf is in linear 6798 + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the 6799 + * schedutil cpufreq governor chooses the target frequency. 6800 + * 6801 + * The actual performance level chosen, CPU grouping, and the overhead and 6802 + * latency of the operations are dependent on the hardware and cpufreq driver in 6803 + * use. Consult hardware and cpufreq documentation for more information. The 6804 + * current performance level can be monitored using scx_bpf_cpuperf_cur(). 6805 + */ 6806 + __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) 6807 + { 6808 + if (unlikely(perf > SCX_CPUPERF_ONE)) { 6809 + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); 6810 + return; 6811 + } 6812 + 6813 + if (ops_cpu_valid(cpu, NULL)) { 6814 + struct rq *rq = cpu_rq(cpu); 6815 + 6816 + rq->scx.cpuperf_target = perf; 6817 + 6818 + rcu_read_lock_sched_notrace(); 6819 + cpufreq_update_util(cpu_rq(cpu), 0); 6820 + rcu_read_unlock_sched_notrace(); 6821 + } 6822 + } 6823 + 6824 + /** 6825 + * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs 6826 + * 6827 + * All valid CPU IDs in the system are smaller than the returned value. 6828 + */ 6829 + __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void) 6830 + { 6831 + return nr_cpu_ids; 6832 + } 6833 + 6834 + /** 6835 + * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask 6836 + */ 6837 + __bpf_kfunc const struct cpumask *scx_bpf_get_possible_cpumask(void) 6838 + { 6839 + return cpu_possible_mask; 6840 + } 6841 + 6842 + /** 6843 + * scx_bpf_get_online_cpumask - Get a referenced kptr to cpu_online_mask 6844 + */ 6845 + __bpf_kfunc const struct cpumask *scx_bpf_get_online_cpumask(void) 6846 + { 6847 + return cpu_online_mask; 6848 + } 6849 + 6850 + /** 6851 + * scx_bpf_put_cpumask - Release a possible/online cpumask 6852 + * @cpumask: cpumask to release 6853 + */ 6854 + __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask) 6855 + { 6856 + /* 6857 + * Empty function body because we aren't actually acquiring or releasing 6858 + * a reference to a global cpumask, which is read-only in the caller and 6859 + * is never released. The acquire / release semantics here are just used 6860 + * to make the cpumask is a trusted pointer in the caller. 6861 + */ 6862 + } 6863 + 6864 + /** 6865 + * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking 6866 + * per-CPU cpumask. 6867 + * 6868 + * Returns NULL if idle tracking is not enabled, or running on a UP kernel. 6869 + */ 6870 + __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) 6871 + { 6872 + if (!static_branch_likely(&scx_builtin_idle_enabled)) { 6873 + scx_ops_error("built-in idle tracking is disabled"); 6874 + return cpu_none_mask; 6875 + } 6876 + 6877 + #ifdef CONFIG_SMP 6878 + return idle_masks.cpu; 6879 + #else 6880 + return cpu_none_mask; 6881 + #endif 6882 + } 6883 + 6884 + /** 6885 + * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, 6886 + * per-physical-core cpumask. Can be used to determine if an entire physical 6887 + * core is free. 6888 + * 6889 + * Returns NULL if idle tracking is not enabled, or running on a UP kernel. 6890 + */ 6891 + __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) 6892 + { 6893 + if (!static_branch_likely(&scx_builtin_idle_enabled)) { 6894 + scx_ops_error("built-in idle tracking is disabled"); 6895 + return cpu_none_mask; 6896 + } 6897 + 6898 + #ifdef CONFIG_SMP 6899 + if (sched_smt_active()) 6900 + return idle_masks.smt; 6901 + else 6902 + return idle_masks.cpu; 6903 + #else 6904 + return cpu_none_mask; 6905 + #endif 6906 + } 6907 + 6908 + /** 6909 + * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to 6910 + * either the percpu, or SMT idle-tracking cpumask. 6911 + */ 6912 + __bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask) 6913 + { 6914 + /* 6915 + * Empty function body because we aren't actually acquiring or releasing 6916 + * a reference to a global idle cpumask, which is read-only in the 6917 + * caller and is never released. The acquire / release semantics here 6918 + * are just used to make the cpumask a trusted pointer in the caller. 6919 + */ 6920 + } 6921 + 6922 + /** 6923 + * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state 6924 + * @cpu: cpu to test and clear idle for 6925 + * 6926 + * Returns %true if @cpu was idle and its idle state was successfully cleared. 6927 + * %false otherwise. 6928 + * 6929 + * Unavailable if ops.update_idle() is implemented and 6930 + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. 6931 + */ 6932 + __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) 6933 + { 6934 + if (!static_branch_likely(&scx_builtin_idle_enabled)) { 6935 + scx_ops_error("built-in idle tracking is disabled"); 6936 + return false; 6937 + } 6938 + 6939 + if (ops_cpu_valid(cpu, NULL)) 6940 + return test_and_clear_cpu_idle(cpu); 6941 + else 6942 + return false; 6943 + } 6944 + 6945 + /** 6946 + * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu 6947 + * @cpus_allowed: Allowed cpumask 6948 + * @flags: %SCX_PICK_IDLE_CPU_* flags 6949 + * 6950 + * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu 6951 + * number on success. -%EBUSY if no matching cpu was found. 6952 + * 6953 + * Idle CPU tracking may race against CPU scheduling state transitions. For 6954 + * example, this function may return -%EBUSY as CPUs are transitioning into the 6955 + * idle state. If the caller then assumes that there will be dispatch events on 6956 + * the CPUs as they were all busy, the scheduler may end up stalling with CPUs 6957 + * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and 6958 + * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch 6959 + * event in the near future. 6960 + * 6961 + * Unavailable if ops.update_idle() is implemented and 6962 + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. 6963 + */ 6964 + __bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, 6965 + u64 flags) 6966 + { 6967 + if (!static_branch_likely(&scx_builtin_idle_enabled)) { 6968 + scx_ops_error("built-in idle tracking is disabled"); 6969 + return -EBUSY; 6970 + } 6971 + 6972 + return scx_pick_idle_cpu(cpus_allowed, flags); 6973 + } 6974 + 6975 + /** 6976 + * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU 6977 + * @cpus_allowed: Allowed cpumask 6978 + * @flags: %SCX_PICK_IDLE_CPU_* flags 6979 + * 6980 + * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any 6981 + * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu 6982 + * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is 6983 + * empty. 6984 + * 6985 + * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not 6986 + * set, this function can't tell which CPUs are idle and will always pick any 6987 + * CPU. 6988 + */ 6989 + __bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, 6990 + u64 flags) 6991 + { 6992 + s32 cpu; 6993 + 6994 + if (static_branch_likely(&scx_builtin_idle_enabled)) { 6995 + cpu = scx_pick_idle_cpu(cpus_allowed, flags); 6996 + if (cpu >= 0) 6997 + return cpu; 6998 + } 6999 + 7000 + cpu = cpumask_any_distribute(cpus_allowed); 7001 + if (cpu < nr_cpu_ids) 7002 + return cpu; 7003 + else 7004 + return -EBUSY; 7005 + } 7006 + 7007 + /** 7008 + * scx_bpf_task_running - Is task currently running? 7009 + * @p: task of interest 7010 + */ 7011 + __bpf_kfunc bool scx_bpf_task_running(const struct task_struct *p) 7012 + { 7013 + return task_rq(p)->curr == p; 7014 + } 7015 + 7016 + /** 7017 + * scx_bpf_task_cpu - CPU a task is currently associated with 7018 + * @p: task of interest 7019 + */ 7020 + __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p) 7021 + { 7022 + return task_cpu(p); 7023 + } 7024 + 7025 + /** 7026 + * scx_bpf_cpu_rq - Fetch the rq of a CPU 7027 + * @cpu: CPU of the rq 7028 + */ 7029 + __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu) 7030 + { 7031 + if (!ops_cpu_valid(cpu, NULL)) 7032 + return NULL; 7033 + 7034 + return cpu_rq(cpu); 7035 + } 7036 + 7037 + /** 7038 + * scx_bpf_task_cgroup - Return the sched cgroup of a task 7039 + * @p: task of interest 7040 + * 7041 + * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with 7042 + * from the scheduler's POV. SCX operations should use this function to 7043 + * determine @p's current cgroup as, unlike following @p->cgroups, 7044 + * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all 7045 + * rq-locked operations. Can be called on the parameter tasks of rq-locked 7046 + * operations. The restriction guarantees that @p's rq is locked by the caller. 7047 + */ 7048 + #ifdef CONFIG_CGROUP_SCHED 7049 + __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) 7050 + { 7051 + struct task_group *tg = p->sched_task_group; 7052 + struct cgroup *cgrp = &cgrp_dfl_root.cgrp; 7053 + 7054 + if (!scx_kf_allowed_on_arg_tasks(__SCX_KF_RQ_LOCKED, p)) 7055 + goto out; 7056 + 7057 + /* 7058 + * A task_group may either be a cgroup or an autogroup. In the latter 7059 + * case, @tg->css.cgroup is %NULL. A task_group can't become the other 7060 + * kind once created. 7061 + */ 7062 + if (tg && tg->css.cgroup) 7063 + cgrp = tg->css.cgroup; 7064 + else 7065 + cgrp = &cgrp_dfl_root.cgrp; 7066 + out: 7067 + cgroup_get(cgrp); 7068 + return cgrp; 7069 + } 7070 + #endif 7071 + 7072 + __bpf_kfunc_end_defs(); 7073 + 7074 + BTF_KFUNCS_START(scx_kfunc_ids_any) 7075 + BTF_ID_FLAGS(func, scx_bpf_kick_cpu) 7076 + BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued) 7077 + BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) 7078 + BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_ITER_NEW | KF_RCU_PROTECTED) 7079 + BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL) 7080 + BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY) 7081 + BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) 7082 + BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) 7083 + BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS) 7084 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) 7085 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) 7086 + BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) 7087 + BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) 7088 + BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) 7089 + BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) 7090 + BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE) 7091 + BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE) 7092 + BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE) 7093 + BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE) 7094 + BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle) 7095 + BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) 7096 + BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) 7097 + BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU) 7098 + BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU) 7099 + BTF_ID_FLAGS(func, scx_bpf_cpu_rq) 7100 + #ifdef CONFIG_CGROUP_SCHED 7101 + BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_RCU | KF_ACQUIRE) 7102 + #endif 7103 + BTF_KFUNCS_END(scx_kfunc_ids_any) 7104 + 7105 + static const struct btf_kfunc_id_set scx_kfunc_set_any = { 7106 + .owner = THIS_MODULE, 7107 + .set = &scx_kfunc_ids_any, 7108 + }; 7109 + 7110 + static int __init scx_init(void) 7111 + { 7112 + int ret; 7113 + 7114 + /* 7115 + * kfunc registration can't be done from init_sched_ext_class() as 7116 + * register_btf_kfunc_id_set() needs most of the system to be up. 7117 + * 7118 + * Some kfuncs are context-sensitive and can only be called from 7119 + * specific SCX ops. They are grouped into BTF sets accordingly. 7120 + * Unfortunately, BPF currently doesn't have a way of enforcing such 7121 + * restrictions. Eventually, the verifier should be able to enforce 7122 + * them. For now, register them the same and make each kfunc explicitly 7123 + * check using scx_kf_allowed(). 7124 + */ 7125 + if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 7126 + &scx_kfunc_set_select_cpu)) || 7127 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 7128 + &scx_kfunc_set_enqueue_dispatch)) || 7129 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 7130 + &scx_kfunc_set_dispatch)) || 7131 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 7132 + &scx_kfunc_set_cpu_release)) || 7133 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 7134 + &scx_kfunc_set_unlocked)) || 7135 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, 7136 + &scx_kfunc_set_unlocked)) || 7137 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, 7138 + &scx_kfunc_set_any)) || 7139 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, 7140 + &scx_kfunc_set_any)) || 7141 + (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, 7142 + &scx_kfunc_set_any))) { 7143 + pr_err("sched_ext: Failed to register kfunc sets (%d)\n", ret); 7144 + return ret; 7145 + } 7146 + 7147 + ret = register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops); 7148 + if (ret) { 7149 + pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret); 7150 + return ret; 7151 + } 7152 + 7153 + ret = register_pm_notifier(&scx_pm_notifier); 7154 + if (ret) { 7155 + pr_err("sched_ext: Failed to register PM notifier (%d)\n", ret); 7156 + return ret; 7157 + } 7158 + 7159 + scx_kset = kset_create_and_add("sched_ext", &scx_uevent_ops, kernel_kobj); 7160 + if (!scx_kset) { 7161 + pr_err("sched_ext: Failed to create /sys/kernel/sched_ext\n"); 7162 + return -ENOMEM; 7163 + } 7164 + 7165 + ret = sysfs_create_group(&scx_kset->kobj, &scx_global_attr_group); 7166 + if (ret < 0) { 7167 + pr_err("sched_ext: Failed to add global attributes\n"); 7168 + return ret; 7169 + } 7170 + 7171 + return 0; 7172 + } 7173 + __initcall(scx_init);

+91

kernel/sched/ext.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst 4 + * 5 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 6 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 7 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 8 + */ 9 + #ifdef CONFIG_SCHED_CLASS_EXT 10 + 11 + void scx_tick(struct rq *rq); 12 + void init_scx_entity(struct sched_ext_entity *scx); 13 + void scx_pre_fork(struct task_struct *p); 14 + int scx_fork(struct task_struct *p); 15 + void scx_post_fork(struct task_struct *p); 16 + void scx_cancel_fork(struct task_struct *p); 17 + bool scx_can_stop_tick(struct rq *rq); 18 + void scx_rq_activate(struct rq *rq); 19 + void scx_rq_deactivate(struct rq *rq); 20 + int scx_check_setscheduler(struct task_struct *p, int policy); 21 + bool task_should_scx(struct task_struct *p); 22 + void init_sched_ext_class(void); 23 + 24 + static inline u32 scx_cpuperf_target(s32 cpu) 25 + { 26 + if (scx_enabled()) 27 + return cpu_rq(cpu)->scx.cpuperf_target; 28 + else 29 + return 0; 30 + } 31 + 32 + static inline bool task_on_scx(const struct task_struct *p) 33 + { 34 + return scx_enabled() && p->sched_class == &ext_sched_class; 35 + } 36 + 37 + #ifdef CONFIG_SCHED_CORE 38 + bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, 39 + bool in_fi); 40 + #endif 41 + 42 + #else /* CONFIG_SCHED_CLASS_EXT */ 43 + 44 + static inline void scx_tick(struct rq *rq) {} 45 + static inline void scx_pre_fork(struct task_struct *p) {} 46 + static inline int scx_fork(struct task_struct *p) { return 0; } 47 + static inline void scx_post_fork(struct task_struct *p) {} 48 + static inline void scx_cancel_fork(struct task_struct *p) {} 49 + static inline u32 scx_cpuperf_target(s32 cpu) { return 0; } 50 + static inline bool scx_can_stop_tick(struct rq *rq) { return true; } 51 + static inline void scx_rq_activate(struct rq *rq) {} 52 + static inline void scx_rq_deactivate(struct rq *rq) {} 53 + static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; } 54 + static inline bool task_on_scx(const struct task_struct *p) { return false; } 55 + static inline void init_sched_ext_class(void) {} 56 + 57 + #endif /* CONFIG_SCHED_CLASS_EXT */ 58 + 59 + #if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP) 60 + void __scx_update_idle(struct rq *rq, bool idle); 61 + 62 + static inline void scx_update_idle(struct rq *rq, bool idle) 63 + { 64 + if (scx_enabled()) 65 + __scx_update_idle(rq, idle); 66 + } 67 + #else 68 + static inline void scx_update_idle(struct rq *rq, bool idle) {} 69 + #endif 70 + 71 + #ifdef CONFIG_CGROUP_SCHED 72 + #ifdef CONFIG_EXT_GROUP_SCHED 73 + int scx_tg_online(struct task_group *tg); 74 + void scx_tg_offline(struct task_group *tg); 75 + int scx_cgroup_can_attach(struct cgroup_taskset *tset); 76 + void scx_move_task(struct task_struct *p); 77 + void scx_cgroup_finish_attach(void); 78 + void scx_cgroup_cancel_attach(struct cgroup_taskset *tset); 79 + void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight); 80 + void scx_group_set_idle(struct task_group *tg, bool idle); 81 + #else /* CONFIG_EXT_GROUP_SCHED */ 82 + static inline int scx_tg_online(struct task_group *tg) { return 0; } 83 + static inline void scx_tg_offline(struct task_group *tg) {} 84 + static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; } 85 + static inline void scx_move_task(struct task_struct *p) {} 86 + static inline void scx_cgroup_finish_attach(void) {} 87 + static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {} 88 + static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {} 89 + static inline void scx_group_set_idle(struct task_group *tg, bool idle) {} 90 + #endif /* CONFIG_EXT_GROUP_SCHED */ 91 + #endif /* CONFIG_CGROUP_SCHED */

+7 -16

kernel/sched/fair.c

··· 3924 3924 } 3925 3925 } 3926 3926 3927 - void reweight_task(struct task_struct *p, const struct load_weight *lw) 3927 + static void reweight_task_fair(struct rq *rq, struct task_struct *p, 3928 + const struct load_weight *lw) 3928 3929 { 3929 3930 struct sched_entity *se = &p->se; 3930 3931 struct cfs_rq *cfs_rq = cfs_rq_of(se); ··· 8808 8807 /* 8809 8808 * BATCH and IDLE tasks do not preempt others. 8810 8809 */ 8811 - if (unlikely(p->policy != SCHED_NORMAL)) 8810 + if (unlikely(!normal_policy(p->policy))) 8812 8811 return; 8813 8812 8814 8813 cfs_rq = cfs_rq_of(se); ··· 9717 9716 9718 9717 static bool __update_blocked_others(struct rq *rq, bool *done) 9719 9718 { 9720 - const struct sched_class *curr_class; 9721 - u64 now = rq_clock_pelt(rq); 9722 - unsigned long hw_pressure; 9723 - bool decayed; 9719 + bool updated; 9724 9720 9725 9721 /* 9726 9722 * update_load_avg() can call cpufreq_update_util(). Make sure that RT, 9727 9723 * DL and IRQ signals have been updated before updating CFS. 9728 9724 */ 9729 - curr_class = rq->curr->sched_class; 9730 - 9731 - hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); 9732 - 9733 - /* hw_pressure doesn't care about invariance */ 9734 - decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | 9735 - update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | 9736 - update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) | 9737 - update_irq_load_avg(rq, 0); 9725 + updated = update_other_load_avgs(rq); 9738 9726 9739 9727 if (others_have_blocked(rq)) 9740 9728 *done = false; 9741 9729 9742 - return decayed; 9730 + return updated; 9743 9731 } 9744 9732 9745 9733 #ifdef CONFIG_FAIR_GROUP_SCHED ··· 13602 13612 .task_tick = task_tick_fair, 13603 13613 .task_fork = task_fork_fair, 13604 13614 13615 + .reweight_task = reweight_task_fair, 13605 13616 .prio_changed = prio_changed_fair, 13606 13617 .switched_from = switched_from_fair, 13607 13618 .switched_to = switched_to_fair,

+2

kernel/sched/idle.c

··· 453 453 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next) 454 454 { 455 455 dl_server_update_idle_time(rq, prev); 456 + scx_update_idle(rq, false); 456 457 } 457 458 458 459 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first) 459 460 { 460 461 update_idle_core(rq); 462 + scx_update_idle(rq, true); 461 463 schedstat_inc(rq->sched_goidle); 462 464 next->se.exec_start = rq_clock_task(rq); 463 465 }

+20

kernel/sched/pelt.c

··· 467 467 return ret; 468 468 } 469 469 #endif 470 + 471 + /* 472 + * Load avg and utiliztion metrics need to be updated periodically and before 473 + * consumption. This function updates the metrics for all subsystems except for 474 + * the fair class. @rq must be locked and have its clock updated. 475 + */ 476 + bool update_other_load_avgs(struct rq *rq) 477 + { 478 + u64 now = rq_clock_pelt(rq); 479 + const struct sched_class *curr_class = rq->curr->sched_class; 480 + unsigned long hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); 481 + 482 + lockdep_assert_rq_held(rq); 483 + 484 + /* hw_pressure doesn't care about invariance */ 485 + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | 486 + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | 487 + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) | 488 + update_irq_load_avg(rq, 0); 489 + }

+1

kernel/sched/pelt.h

··· 6 6 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq); 7 7 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running); 8 8 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running); 9 + bool update_other_load_avgs(struct rq *rq); 9 10 10 11 #ifdef CONFIG_SCHED_HW_PRESSURE 11 12 int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity);

+174 -16

kernel/sched/sched.h

··· 193 193 return policy == SCHED_IDLE; 194 194 } 195 195 196 + static inline int normal_policy(int policy) 197 + { 198 + #ifdef CONFIG_SCHED_CLASS_EXT 199 + if (policy == SCHED_EXT) 200 + return true; 201 + #endif 202 + return policy == SCHED_NORMAL; 203 + } 204 + 196 205 static inline int fair_policy(int policy) 197 206 { 198 - return policy == SCHED_NORMAL || policy == SCHED_BATCH; 207 + return normal_policy(policy) || policy == SCHED_BATCH; 199 208 } 200 209 201 210 static inline int rt_policy(int policy) ··· 253 244 */ 254 245 #define shr_bound(val, shift) \ 255 246 (val >> min_t(typeof(shift), shift, BITS_PER_TYPE(typeof(val)) - 1)) 247 + 248 + /* 249 + * cgroup weight knobs should use the common MIN, DFL and MAX values which are 250 + * 1, 100 and 10000 respectively. While it loses a bit of range on both ends, it 251 + * maps pretty well onto the shares value used by scheduler and the round-trip 252 + * conversions preserve the original value over the entire range. 253 + */ 254 + static inline unsigned long sched_weight_from_cgroup(unsigned long cgrp_weight) 255 + { 256 + return DIV_ROUND_CLOSEST_ULL(cgrp_weight * 1024, CGROUP_WEIGHT_DFL); 257 + } 258 + 259 + static inline unsigned long sched_weight_to_cgroup(unsigned long weight) 260 + { 261 + return clamp_t(unsigned long, 262 + DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024), 263 + CGROUP_WEIGHT_MIN, CGROUP_WEIGHT_MAX); 264 + } 256 265 257 266 /* 258 267 * !! For sched_setattr_nocheck() (kernel) only !! ··· 459 432 struct rt_bandwidth rt_bandwidth; 460 433 #endif 461 434 435 + #ifdef CONFIG_EXT_GROUP_SCHED 436 + u32 scx_flags; /* SCX_TG_* */ 437 + u32 scx_weight; 438 + #endif 439 + 462 440 struct rcu_head rcu; 463 441 struct list_head list; 464 442 ··· 488 456 489 457 }; 490 458 491 - #ifdef CONFIG_FAIR_GROUP_SCHED 459 + #ifdef CONFIG_GROUP_SCHED_WEIGHT 492 460 #define ROOT_TASK_GROUP_LOAD NICE_0_LOAD 493 461 494 462 /* ··· 517 485 static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data) 518 486 { 519 487 return walk_tg_tree_from(&root_task_group, down, up, data); 488 + } 489 + 490 + static inline struct task_group *css_tg(struct cgroup_subsys_state *css) 491 + { 492 + return css ? container_of(css, struct task_group, css) : NULL; 520 493 } 521 494 522 495 extern int tg_nop(struct task_group *tg, void *data); ··· 580 543 static inline void set_task_rq_fair(struct sched_entity *se, 581 544 struct cfs_rq *prev, struct cfs_rq *next) { } 582 545 #endif /* CONFIG_SMP */ 546 + #else /* !CONFIG_FAIR_GROUP_SCHED */ 547 + static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares) { return 0; } 583 548 #endif /* CONFIG_FAIR_GROUP_SCHED */ 584 549 585 550 #else /* CONFIG_CGROUP_SCHED */ ··· 634 595 #endif 635 596 # define u64_u32_load(var) u64_u32_load_copy(var, var##_copy) 636 597 # define u64_u32_store(var, val) u64_u32_store_copy(var, var##_copy, val) 598 + 599 + struct balance_callback { 600 + struct balance_callback *next; 601 + void (*func)(struct rq *rq); 602 + }; 637 603 638 604 /* CFS-related fields in a runqueue */ 639 605 struct cfs_rq { ··· 738 694 #endif /* CONFIG_CFS_BANDWIDTH */ 739 695 #endif /* CONFIG_FAIR_GROUP_SCHED */ 740 696 }; 697 + 698 + #ifdef CONFIG_SCHED_CLASS_EXT 699 + /* scx_rq->flags, protected by the rq lock */ 700 + enum scx_rq_flags { 701 + /* 702 + * A hotplugged CPU starts scheduling before rq_online_scx(). Track 703 + * ops.cpu_on/offline() state so that ops.enqueue/dispatch() are called 704 + * only while the BPF scheduler considers the CPU to be online. 705 + */ 706 + SCX_RQ_ONLINE = 1 << 0, 707 + SCX_RQ_CAN_STOP_TICK = 1 << 1, 708 + SCX_RQ_BAL_KEEP = 1 << 2, /* balance decided to keep current */ 709 + SCX_RQ_BYPASSING = 1 << 3, 710 + 711 + SCX_RQ_IN_WAKEUP = 1 << 16, 712 + SCX_RQ_IN_BALANCE = 1 << 17, 713 + }; 714 + 715 + struct scx_rq { 716 + struct scx_dispatch_q local_dsq; 717 + struct list_head runnable_list; /* runnable tasks on this rq */ 718 + struct list_head ddsp_deferred_locals; /* deferred ddsps from enq */ 719 + unsigned long ops_qseq; 720 + u64 extra_enq_flags; /* see move_task_to_local_dsq() */ 721 + u32 nr_running; 722 + u32 flags; 723 + u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ 724 + bool cpu_released; 725 + cpumask_var_t cpus_to_kick; 726 + cpumask_var_t cpus_to_kick_if_idle; 727 + cpumask_var_t cpus_to_preempt; 728 + cpumask_var_t cpus_to_wait; 729 + unsigned long pnt_seq; 730 + struct balance_callback deferred_bal_cb; 731 + struct irq_work deferred_irq_work; 732 + struct irq_work kick_cpus_irq_work; 733 + }; 734 + #endif /* CONFIG_SCHED_CLASS_EXT */ 741 735 742 736 static inline int rt_bandwidth_enabled(void) 743 737 { ··· 1083 1001 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used); 1084 1002 #endif /* CONFIG_UCLAMP_TASK */ 1085 1003 1086 - struct balance_callback { 1087 - struct balance_callback *next; 1088 - void (*func)(struct rq *rq); 1089 - }; 1090 - 1091 1004 /* 1092 1005 * This is the main, per-CPU runqueue data structure. 1093 1006 * ··· 1125 1048 struct cfs_rq cfs; 1126 1049 struct rt_rq rt; 1127 1050 struct dl_rq dl; 1051 + #ifdef CONFIG_SCHED_CLASS_EXT 1052 + struct scx_rq scx; 1053 + #endif 1128 1054 1129 1055 struct sched_dl_entity fair_server; 1130 1056 ··· 2382 2302 2383 2303 void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags); 2384 2304 2305 + int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); 2385 2306 struct task_struct *(*pick_task)(struct rq *rq); 2386 2307 /* 2387 2308 * Optional! When implemented pick_next_task() should be equivalent to: ··· 2399 2318 void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); 2400 2319 2401 2320 #ifdef CONFIG_SMP 2402 - int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); 2403 2321 int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags); 2404 2322 2405 2323 void (*migrate_task_rq)(struct task_struct *p, int new_cpu); ··· 2422 2342 * cannot assume the switched_from/switched_to pair is serialized by 2423 2343 * rq->lock. They are however serialized by p->pi_lock. 2424 2344 */ 2345 + void (*switching_to) (struct rq *this_rq, struct task_struct *task); 2425 2346 void (*switched_from)(struct rq *this_rq, struct task_struct *task); 2426 2347 void (*switched_to) (struct rq *this_rq, struct task_struct *task); 2348 + void (*reweight_task)(struct rq *this_rq, struct task_struct *task, 2349 + const struct load_weight *lw); 2427 2350 void (*prio_changed) (struct rq *this_rq, struct task_struct *task, 2428 2351 int oldprio); 2429 2352 ··· 2499 2416 extern struct sched_class __sched_class_highest[]; 2500 2417 extern struct sched_class __sched_class_lowest[]; 2501 2418 2419 + extern const struct sched_class stop_sched_class; 2420 + extern const struct sched_class dl_sched_class; 2421 + extern const struct sched_class rt_sched_class; 2422 + extern const struct sched_class fair_sched_class; 2423 + extern const struct sched_class idle_sched_class; 2424 + 2425 + #ifdef CONFIG_SCHED_CLASS_EXT 2426 + extern const struct sched_class ext_sched_class; 2427 + 2428 + DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled); /* SCX BPF scheduler loaded */ 2429 + DECLARE_STATIC_KEY_FALSE(__scx_switched_all); /* all fair class tasks on SCX */ 2430 + 2431 + #define scx_enabled() static_branch_unlikely(&__scx_ops_enabled) 2432 + #define scx_switched_all() static_branch_unlikely(&__scx_switched_all) 2433 + #else /* !CONFIG_SCHED_CLASS_EXT */ 2434 + #define scx_enabled() false 2435 + #define scx_switched_all() false 2436 + #endif /* !CONFIG_SCHED_CLASS_EXT */ 2437 + 2438 + /* 2439 + * Iterate only active classes. SCX can take over all fair tasks or be 2440 + * completely disabled. If the former, skip fair. If the latter, skip SCX. 2441 + */ 2442 + static inline const struct sched_class *next_active_class(const struct sched_class *class) 2443 + { 2444 + class++; 2445 + #ifdef CONFIG_SCHED_CLASS_EXT 2446 + if (scx_switched_all() && class == &fair_sched_class) 2447 + class++; 2448 + if (!scx_enabled() && class == &ext_sched_class) 2449 + class++; 2450 + #endif 2451 + return class; 2452 + } 2453 + 2502 2454 #define for_class_range(class, _from, _to) \ 2503 2455 for (class = (_from); class < (_to); class++) 2504 2456 2505 2457 #define for_each_class(class) \ 2506 2458 for_class_range(class, __sched_class_highest, __sched_class_lowest) 2507 2459 2508 - #define sched_class_above(_a, _b) ((_a) < (_b)) 2460 + #define for_active_class_range(class, _from, _to) \ 2461 + for (class = (_from); class != (_to); class = next_active_class(class)) 2509 2462 2510 - extern const struct sched_class stop_sched_class; 2511 - extern const struct sched_class dl_sched_class; 2512 - extern const struct sched_class rt_sched_class; 2513 - extern const struct sched_class fair_sched_class; 2514 - extern const struct sched_class idle_sched_class; 2463 + #define for_each_active_class(class) \ 2464 + for_active_class_range(class, __sched_class_highest, __sched_class_lowest) 2465 + 2466 + #define sched_class_above(_a, _b) ((_a) < (_b)) 2515 2467 2516 2468 static inline bool sched_stop_runnable(struct rq *rq) 2517 2469 { ··· 2585 2467 extern int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx); 2586 2468 extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx); 2587 2469 2470 + static inline bool task_allowed_on_cpu(struct task_struct *p, int cpu) 2471 + { 2472 + /* When not in the task's cpumask, no point in looking further. */ 2473 + if (!cpumask_test_cpu(cpu, p->cpus_ptr)) 2474 + return false; 2475 + 2476 + /* Can @cpu run a user thread? */ 2477 + if (!(p->flags & PF_KTHREAD) && !task_cpu_possible(cpu, p)) 2478 + return false; 2479 + 2480 + return true; 2481 + } 2482 + 2588 2483 static inline cpumask_t *alloc_user_cpus_ptr(int node) 2589 2484 { 2590 2485 /* ··· 2630 2499 extern int push_cpu_stop(void *arg); 2631 2500 2632 2501 #else /* !CONFIG_SMP: */ 2502 + 2503 + static inline bool task_allowed_on_cpu(struct task_struct *p, int cpu) 2504 + { 2505 + return true; 2506 + } 2633 2507 2634 2508 static inline int __set_cpus_allowed_ptr(struct task_struct *p, 2635 2509 struct affinity_context *ctx) ··· 2688 2552 extern void init_sched_dl_class(void); 2689 2553 extern void init_sched_rt_class(void); 2690 2554 extern void init_sched_fair_class(void); 2691 - 2692 - extern void reweight_task(struct task_struct *p, const struct load_weight *lw); 2693 2555 2694 2556 extern void resched_curr(struct rq *rq); 2695 2557 extern void resched_cpu(int cpu); ··· 3288 3154 return READ_ONCE(rq->avg_rt.util_avg); 3289 3155 } 3290 3156 3157 + #else /* !CONFIG_SMP */ 3158 + static inline bool update_other_load_avgs(struct rq *rq) { return false; } 3291 3159 #endif /* CONFIG_SMP */ 3292 3160 3293 3161 #ifdef CONFIG_UCLAMP_TASK ··· 3800 3664 extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); 3801 3665 extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags); 3802 3666 3667 + extern void check_class_changing(struct rq *rq, struct task_struct *p, 3668 + const struct sched_class *prev_class); 3803 3669 extern void check_class_changed(struct rq *rq, struct task_struct *p, 3804 3670 const struct sched_class *prev_class, 3805 3671 int oldprio); ··· 3821 3683 } 3822 3684 3823 3685 #endif 3686 + 3687 + #ifdef CONFIG_SCHED_CLASS_EXT 3688 + /* 3689 + * Used by SCX in the enable/disable paths to move tasks between sched_classes 3690 + * and establish invariants. 3691 + */ 3692 + struct sched_enq_and_set_ctx { 3693 + struct task_struct *p; 3694 + int queue_flags; 3695 + bool queued; 3696 + bool running; 3697 + }; 3698 + 3699 + void sched_deq_and_put_task(struct task_struct *p, int queue_flags, 3700 + struct sched_enq_and_set_ctx *ctx); 3701 + void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx); 3702 + 3703 + #endif /* CONFIG_SCHED_CLASS_EXT */ 3704 + 3705 + #include "ext.h" 3824 3706 3825 3707 #endif /* _KERNEL_SCHED_SCHED_H */

+7

kernel/sched/syscalls.c

··· 612 612 goto unlock; 613 613 } 614 614 615 + retval = scx_check_setscheduler(p, policy); 616 + if (retval) 617 + goto unlock; 618 + 615 619 /* 616 620 * If not changing anything there's no need to proceed further, 617 621 * but store a possible modification of reset_on_fork. ··· 720 716 __setscheduler_prio(p, newprio); 721 717 } 722 718 __setscheduler_uclamp(p, attr); 719 + check_class_changing(rq, p, prev_class); 723 720 724 721 if (queued) { 725 722 /* ··· 1531 1526 case SCHED_NORMAL: 1532 1527 case SCHED_BATCH: 1533 1528 case SCHED_IDLE: 1529 + case SCHED_EXT: 1534 1530 ret = 0; 1535 1531 break; 1536 1532 } ··· 1559 1553 case SCHED_NORMAL: 1560 1554 case SCHED_BATCH: 1561 1555 case SCHED_IDLE: 1556 + case SCHED_EXT: 1562 1557 ret = 0; 1563 1558 } 1564 1559 return ret;

+1

lib/dump_stack.c

··· 73 73 74 74 print_worker_info(log_lvl, current); 75 75 print_stop_info(log_lvl, current); 76 + print_scx_info(log_lvl, current); 76 77 } 77 78 78 79 /**

+9 -1

tools/Makefile

··· 28 28 @echo ' pci - PCI tools' 29 29 @echo ' perf - Linux performance measurement and analysis tool' 30 30 @echo ' selftests - various kernel selftests' 31 + @echo ' sched_ext - sched_ext example schedulers' 31 32 @echo ' bootconfig - boot config tool' 32 33 @echo ' spi - spi tools' 33 34 @echo ' tmon - thermal monitoring and tuning tool' ··· 91 90 perf: FORCE 92 91 $(Q)mkdir -p $(PERF_O) . 93 92 $(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir= 93 + 94 + sched_ext: FORCE 95 + $(call descend,sched_ext) 94 96 95 97 selftests: FORCE 96 98 $(call descend,testing/$@) ··· 188 184 $(Q)mkdir -p $(PERF_O) . 189 185 $(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir= clean 190 186 187 + sched_ext_clean: 188 + $(call descend,sched_ext,clean) 189 + 191 190 selftests_clean: 192 191 $(call descend,testing/$(@:_clean=),clean) 193 192 ··· 220 213 mm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \ 221 214 freefall_clean build_clean libbpf_clean libsubcmd_clean \ 222 215 gpio_clean objtool_clean leds_clean wmi_clean pci_clean firmware_clean debugging_clean \ 223 - intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean 216 + intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean \ 217 + sched_ext_clean 224 218 225 219 .PHONY: FORCE

+2

tools/sched_ext/.gitignore

··· 1 + tools/ 2 + build/

+246

tools/sched_ext/Makefile

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + # Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 3 + include ../build/Build.include 4 + include ../scripts/Makefile.arch 5 + include ../scripts/Makefile.include 6 + 7 + all: all_targets 8 + 9 + ifneq ($(LLVM),) 10 + ifneq ($(filter %/,$(LLVM)),) 11 + LLVM_PREFIX := $(LLVM) 12 + else ifneq ($(filter -%,$(LLVM)),) 13 + LLVM_SUFFIX := $(LLVM) 14 + endif 15 + 16 + CLANG_TARGET_FLAGS_arm := arm-linux-gnueabi 17 + CLANG_TARGET_FLAGS_arm64 := aarch64-linux-gnu 18 + CLANG_TARGET_FLAGS_hexagon := hexagon-linux-musl 19 + CLANG_TARGET_FLAGS_m68k := m68k-linux-gnu 20 + CLANG_TARGET_FLAGS_mips := mipsel-linux-gnu 21 + CLANG_TARGET_FLAGS_powerpc := powerpc64le-linux-gnu 22 + CLANG_TARGET_FLAGS_riscv := riscv64-linux-gnu 23 + CLANG_TARGET_FLAGS_s390 := s390x-linux-gnu 24 + CLANG_TARGET_FLAGS_x86 := x86_64-linux-gnu 25 + CLANG_TARGET_FLAGS := $(CLANG_TARGET_FLAGS_$(ARCH)) 26 + 27 + ifeq ($(CROSS_COMPILE),) 28 + ifeq ($(CLANG_TARGET_FLAGS),) 29 + $(error Specify CROSS_COMPILE or add '--target=' option to lib.mk) 30 + else 31 + CLANG_FLAGS += --target=$(CLANG_TARGET_FLAGS) 32 + endif # CLANG_TARGET_FLAGS 33 + else 34 + CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%)) 35 + endif # CROSS_COMPILE 36 + 37 + CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as 38 + else 39 + CC := $(CROSS_COMPILE)gcc 40 + endif # LLVM 41 + 42 + CURDIR := $(abspath .) 43 + TOOLSDIR := $(abspath ..) 44 + LIBDIR := $(TOOLSDIR)/lib 45 + BPFDIR := $(LIBDIR)/bpf 46 + TOOLSINCDIR := $(TOOLSDIR)/include 47 + BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool 48 + APIDIR := $(TOOLSINCDIR)/uapi 49 + GENDIR := $(abspath ../../include/generated) 50 + GENHDR := $(GENDIR)/autoconf.h 51 + 52 + ifeq ($(O),) 53 + OUTPUT_DIR := $(CURDIR)/build 54 + else 55 + OUTPUT_DIR := $(O)/build 56 + endif # O 57 + OBJ_DIR := $(OUTPUT_DIR)/obj 58 + INCLUDE_DIR := $(OUTPUT_DIR)/include 59 + BPFOBJ_DIR := $(OBJ_DIR)/libbpf 60 + SCXOBJ_DIR := $(OBJ_DIR)/sched_ext 61 + BINDIR := $(OUTPUT_DIR)/bin 62 + BPFOBJ := $(BPFOBJ_DIR)/libbpf.a 63 + ifneq ($(CROSS_COMPILE),) 64 + HOST_BUILD_DIR := $(OBJ_DIR)/host 65 + HOST_OUTPUT_DIR := host-tools 66 + HOST_INCLUDE_DIR := $(HOST_OUTPUT_DIR)/include 67 + else 68 + HOST_BUILD_DIR := $(OBJ_DIR) 69 + HOST_OUTPUT_DIR := $(OUTPUT_DIR) 70 + HOST_INCLUDE_DIR := $(INCLUDE_DIR) 71 + endif 72 + HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a 73 + RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids 74 + DEFAULT_BPFTOOL := $(HOST_OUTPUT_DIR)/sbin/bpftool 75 + 76 + VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \ 77 + $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \ 78 + ../../vmlinux \ 79 + /sys/kernel/btf/vmlinux \ 80 + /boot/vmlinux-$(shell uname -r) 81 + VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS)))) 82 + ifeq ($(VMLINUX_BTF),) 83 + $(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)") 84 + endif 85 + 86 + BPFTOOL ?= $(DEFAULT_BPFTOOL) 87 + 88 + ifneq ($(wildcard $(GENHDR)),) 89 + GENFLAGS := -DHAVE_GENHDR 90 + endif 91 + 92 + CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \ 93 + -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \ 94 + -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include 95 + 96 + # Silence some warnings when compiled with clang 97 + ifneq ($(LLVM),) 98 + CFLAGS += -Wno-unused-command-line-argument 99 + endif 100 + 101 + LDFLAGS = -lelf -lz -lpthread 102 + 103 + IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \ 104 + grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__') 105 + 106 + # Get Clang's default includes on this system, as opposed to those seen by 107 + # '-target bpf'. This fixes "missing" files on some architectures/distros, 108 + # such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc. 109 + # 110 + # Use '-idirafter': Don't interfere with include mechanics except where the 111 + # build would have failed anyways. 112 + define get_sys_includes 113 + $(shell $(1) -v -E - </dev/null 2>&1 \ 114 + | sed -n '/<...> search starts here:/,/End of search list./{ s| $/.*$|-idirafter \1|p }') \ 115 + $(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}') 116 + endef 117 + 118 + BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \ 119 + $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \ 120 + -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat \ 121 + -I$(INCLUDE_DIR) -I$(APIDIR) \ 122 + -I../../include \ 123 + $(call get_sys_includes,$(CLANG)) \ 124 + -Wall -Wno-compare-distinct-pointer-types \ 125 + -O2 -mcpu=v3 126 + 127 + # sort removes libbpf duplicates when not cross-building 128 + MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf \ 129 + $(HOST_BUILD_DIR)/bpftool $(HOST_BUILD_DIR)/resolve_btfids \ 130 + $(INCLUDE_DIR) $(SCXOBJ_DIR) $(BINDIR)) 131 + 132 + $(MAKE_DIRS): 133 + $(call msg,MKDIR,,$@) 134 + $(Q)mkdir -p $@ 135 + 136 + $(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \ 137 + $(APIDIR)/linux/bpf.h \ 138 + | $(OBJ_DIR)/libbpf 139 + $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/ \ 140 + EXTRA_CFLAGS='-g -O0 -fPIC' \ 141 + DESTDIR=$(OUTPUT_DIR) prefix= all install_headers 142 + 143 + $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \ 144 + $(HOST_BPFOBJ) | $(HOST_BUILD_DIR)/bpftool 145 + $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \ 146 + ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \ 147 + EXTRA_CFLAGS='-g -O0' \ 148 + OUTPUT=$(HOST_BUILD_DIR)/bpftool/ \ 149 + LIBBPF_OUTPUT=$(HOST_BUILD_DIR)/libbpf/ \ 150 + LIBBPF_DESTDIR=$(HOST_OUTPUT_DIR)/ \ 151 + prefix= DESTDIR=$(HOST_OUTPUT_DIR)/ install-bin 152 + 153 + $(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR) 154 + ifeq ($(VMLINUX_H),) 155 + $(call msg,GEN,,$@) 156 + $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@ 157 + else 158 + $(call msg,CP,,$@) 159 + $(Q)cp "$(VMLINUX_H)" $@ 160 + endif 161 + 162 + $(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h include/scx/*.h \ 163 + | $(BPFOBJ) $(SCXOBJ_DIR) 164 + $(call msg,CLNG-BPF,,$(notdir $@)) 165 + $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@ 166 + 167 + $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL) 168 + $(eval sched=$(notdir $@)) 169 + $(call msg,GEN-SKEL,,$(sched)) 170 + $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $< 171 + $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o) 172 + $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o) 173 + $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o) 174 + $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@ 175 + $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h) 176 + 177 + SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR) 178 + 179 + c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg 180 + 181 + $(addprefix $(BINDIR)/,$(c-sched-targets)): \ 182 + $(BINDIR)/%: \ 183 + $(filter-out %.bpf.c,%.c) \ 184 + $(INCLUDE_DIR)/%.bpf.skel.h \ 185 + $(SCX_COMMON_DEPS) 186 + $(eval sched=$(notdir $@)) 187 + $(CC) $(CFLAGS) -c $(sched).c -o $(SCXOBJ_DIR)/$(sched).o 188 + $(CC) -o $@ $(SCXOBJ_DIR)/$(sched).o $(HOST_BPFOBJ) $(LDFLAGS) 189 + 190 + $(c-sched-targets): %: $(BINDIR)/% 191 + 192 + install: all 193 + $(Q)mkdir -p $(DESTDIR)/usr/local/bin/ 194 + $(Q)cp $(BINDIR)/* $(DESTDIR)/usr/local/bin/ 195 + 196 + clean: 197 + rm -rf $(OUTPUT_DIR) $(HOST_OUTPUT_DIR) 198 + rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h 199 + rm -f $(c-sched-targets) 200 + 201 + help: 202 + @echo 'Building targets' 203 + @echo '================' 204 + @echo '' 205 + @echo ' all - Compile all schedulers' 206 + @echo '' 207 + @echo 'Alternatively, you may compile individual schedulers:' 208 + @echo '' 209 + @printf ' %s\n' $(c-sched-targets) 210 + @echo '' 211 + @echo 'For any scheduler build target, you may specify an alternative' 212 + @echo 'build output path with the O= environment variable. For example:' 213 + @echo '' 214 + @echo ' O=/tmp/sched_ext make all' 215 + @echo '' 216 + @echo 'will compile all schedulers, and emit the build artifacts to' 217 + @echo '/tmp/sched_ext/build.' 218 + @echo '' 219 + @echo '' 220 + @echo 'Installing targets' 221 + @echo '==================' 222 + @echo '' 223 + @echo ' install - Compile and install all schedulers to /usr/bin.' 224 + @echo ' You may specify the DESTDIR= environment variable' 225 + @echo ' to indicate a prefix for /usr/bin. For example:' 226 + @echo '' 227 + @echo ' DESTDIR=/tmp/sched_ext make install' 228 + @echo '' 229 + @echo ' will build the schedulers in CWD/build, and' 230 + @echo ' install the schedulers to /tmp/sched_ext/usr/bin.' 231 + @echo '' 232 + @echo '' 233 + @echo 'Cleaning targets' 234 + @echo '================' 235 + @echo '' 236 + @echo ' clean - Remove all generated files' 237 + 238 + all_targets: $(c-sched-targets) 239 + 240 + .PHONY: all all_targets $(c-sched-targets) clean help 241 + 242 + # delete failed targets 243 + .DELETE_ON_ERROR: 244 + 245 + # keep intermediate (.bpf.skel.h, .bpf.o, etc) targets 246 + .SECONDARY:

+270

tools/sched_ext/README.md

··· 1 + SCHED_EXT EXAMPLE SCHEDULERS 2 + ============================ 3 + 4 + # Introduction 5 + 6 + This directory contains a number of example sched_ext schedulers. These 7 + schedulers are meant to provide examples of different types of schedulers 8 + that can be built using sched_ext, and illustrate how various features of 9 + sched_ext can be used. 10 + 11 + Some of the examples are performant, production-ready schedulers. That is, for 12 + the correct workload and with the correct tuning, they may be deployed in a 13 + production environment with acceptable or possibly even improved performance. 14 + Others are just examples that in practice, would not provide acceptable 15 + performance (though they could be improved to get there). 16 + 17 + This README will describe these example schedulers, including describing the 18 + types of workloads or scenarios they're designed to accommodate, and whether or 19 + not they're production ready. For more details on any of these schedulers, 20 + please see the header comment in their .bpf.c file. 21 + 22 + 23 + # Compiling the examples 24 + 25 + There are a few toolchain dependencies for compiling the example schedulers. 26 + 27 + ## Toolchain dependencies 28 + 29 + 1. clang >= 16.0.0 30 + 31 + The schedulers are BPF programs, and therefore must be compiled with clang. gcc 32 + is actively working on adding a BPF backend compiler as well, but are still 33 + missing some features such as BTF type tags which are necessary for using 34 + kptrs. 35 + 36 + 2. pahole >= 1.25 37 + 38 + You may need pahole in order to generate BTF from DWARF. 39 + 40 + 3. rust >= 1.70.0 41 + 42 + Rust schedulers uses features present in the rust toolchain >= 1.70.0. You 43 + should be able to use the stable build from rustup, but if that doesn't 44 + work, try using the rustup nightly build. 45 + 46 + There are other requirements as well, such as make, but these are the main / 47 + non-trivial ones. 48 + 49 + ## Compiling the kernel 50 + 51 + In order to run a sched_ext scheduler, you'll have to run a kernel compiled 52 + with the patches in this repository, and with a minimum set of necessary 53 + Kconfig options: 54 + 55 + ``` 56 + CONFIG_BPF=y 57 + CONFIG_SCHED_CLASS_EXT=y 58 + CONFIG_BPF_SYSCALL=y 59 + CONFIG_BPF_JIT=y 60 + CONFIG_DEBUG_INFO_BTF=y 61 + ``` 62 + 63 + It's also recommended that you also include the following Kconfig options: 64 + 65 + ``` 66 + CONFIG_BPF_JIT_ALWAYS_ON=y 67 + CONFIG_BPF_JIT_DEFAULT_ON=y 68 + CONFIG_PAHOLE_HAS_SPLIT_BTF=y 69 + CONFIG_PAHOLE_HAS_BTF_TAG=y 70 + ``` 71 + 72 + There is a `Kconfig` file in this directory whose contents you can append to 73 + your local `.config` file, as long as there are no conflicts with any existing 74 + options in the file. 75 + 76 + ## Getting a vmlinux.h file 77 + 78 + You may notice that most of the example schedulers include a "vmlinux.h" file. 79 + This is a large, auto-generated header file that contains all of the types 80 + defined in some vmlinux binary that was compiled with 81 + [BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig 82 + options specified above). 83 + 84 + The header file is created using `bpftool`, by passing it a vmlinux binary 85 + compiled with BTF as follows: 86 + 87 + ```bash 88 + $ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h 89 + ``` 90 + 91 + `bpftool` analyzes all of the BTF encodings in the binary, and produces a 92 + header file that can be included by BPF programs to access those types. For 93 + example, using vmlinux.h allows a scheduler to access fields defined directly 94 + in vmlinux as follows: 95 + 96 + ```c 97 + #include "vmlinux.h" 98 + // vmlinux.h is also implicitly included by scx_common.bpf.h. 99 + #include "scx_common.bpf.h" 100 + 101 + /* 102 + * vmlinux.h provides definitions for struct task_struct and 103 + * struct scx_enable_args. 104 + */ 105 + void BPF_STRUCT_OPS(example_enable, struct task_struct *p, 106 + struct scx_enable_args *args) 107 + { 108 + bpf_printk("Task %s enabled in example scheduler", p->comm); 109 + } 110 + 111 + // vmlinux.h provides the definition for struct sched_ext_ops. 112 + SEC(".struct_ops.link") 113 + struct sched_ext_ops example_ops { 114 + .enable = (void *)example_enable, 115 + .name = "example", 116 + } 117 + ``` 118 + 119 + The scheduler build system will generate this vmlinux.h file as part of the 120 + scheduler build pipeline. It looks for a vmlinux file in the following 121 + dependency order: 122 + 123 + 1. If the O= environment variable is defined, at `$O/vmlinux` 124 + 2. If the KBUILD_OUTPUT= environment variable is defined, at 125 + `$KBUILD_OUTPUT/vmlinux` 126 + 3. At `../../vmlinux` (i.e. at the root of the kernel tree where you're 127 + compiling the schedulers) 128 + 3. `/sys/kernel/btf/vmlinux` 129 + 4. `/boot/vmlinux-$(uname -r)` 130 + 131 + In other words, if you have compiled a kernel in your local repo, its vmlinux 132 + file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of 133 + the kernel you're currently running on. This means that if you're running on a 134 + kernel with sched_ext support, you may not need to compile a local kernel at 135 + all. 136 + 137 + ### Aside on CO-RE 138 + 139 + One of the cooler features of BPF is that it supports 140 + [CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run 141 + Everywhere). This feature allows you to reference fields inside of structs with 142 + types defined internal to the kernel, and not have to recompile if you load the 143 + BPF program on a different kernel with the field at a different offset. In our 144 + example above, we print out a task name with `p->comm`. CO-RE would perform 145 + relocations for that access when the program is loaded to ensure that it's 146 + referencing the correct offset for the currently running kernel. 147 + 148 + ## Compiling the schedulers 149 + 150 + Once you have your toolchain setup, and a vmlinux that can be used to generate 151 + a full vmlinux.h file, you can compile the schedulers using `make`: 152 + 153 + ```bash 154 + $ make -j($nproc) 155 + ``` 156 + 157 + # Example schedulers 158 + 159 + This directory contains the following example schedulers. These schedulers are 160 + for testing and demonstrating different aspects of sched_ext. While some may be 161 + useful in limited scenarios, they are not intended to be practical. 162 + 163 + For more scheduler implementations, tools and documentation, visit 164 + https://github.com/sched-ext/scx. 165 + 166 + ## scx_simple 167 + 168 + A simple scheduler that provides an example of a minimal sched_ext scheduler. 169 + scx_simple can be run in either global weighted vtime mode, or FIFO mode. 170 + 171 + Though very simple, in limited scenarios, this scheduler can perform reasonably 172 + well on single-socket systems with a unified L3 cache. 173 + 174 + ## scx_qmap 175 + 176 + Another simple, yet slightly more complex scheduler that provides an example of 177 + a basic weighted FIFO queuing policy. It also provides examples of some common 178 + useful BPF features, such as sleepable per-task storage allocation in the 179 + `ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to 180 + enqueue tasks. It also illustrates how core-sched support could be implemented. 181 + 182 + ## scx_central 183 + 184 + A "central" scheduler where scheduling decisions are made from a single CPU. 185 + This scheduler illustrates how scheduling decisions can be dispatched from a 186 + single CPU, allowing other cores to run with infinite slices, without timer 187 + ticks, and without having to incur the overhead of making scheduling decisions. 188 + 189 + The approach demonstrated by this scheduler may be useful for any workload that 190 + benefits from minimizing scheduling overhead and timer ticks. An example of 191 + where this could be particularly useful is running VMs, where running with 192 + infinite slices and no timer ticks allows the VM to avoid unnecessary expensive 193 + vmexits. 194 + 195 + ## scx_flatcg 196 + 197 + A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical 198 + weight-based cgroup CPU control by flattening the cgroup hierarchy into a single 199 + layer, by compounding the active weight share at each level. The effect of this 200 + is a much more performant CPU controller, which does not need to descend down 201 + cgroup trees in order to properly compute a cgroup's share. 202 + 203 + Similar to scx_simple, in limited scenarios, this scheduler can perform 204 + reasonably well on single socket-socket systems with a unified L3 cache and show 205 + significantly lowered hierarchical scheduling overhead. 206 + 207 + 208 + # Troubleshooting 209 + 210 + There are a number of common issues that you may run into when building the 211 + schedulers. We'll go over some of the common ones here. 212 + 213 + ## Build Failures 214 + 215 + ### Old version of clang 216 + 217 + ``` 218 + error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole 219 + _Static_assert(SCX_DSQ_FLAG_BUILTIN, 220 + ^~~~~~~~~~~~~~~~~~~~ 221 + 1 error generated. 222 + ``` 223 + 224 + This means you built the kernel or the schedulers with an older version of 225 + clang than what's supported (i.e. older than 16.0.0). To remediate this: 226 + 227 + 1. `which clang` to make sure you're using a sufficiently new version of clang. 228 + 229 + 2. `make fullclean` in the root path of the repository, and rebuild the kernel 230 + and schedulers. 231 + 232 + 3. Rebuild the kernel, and then your example schedulers. 233 + 234 + The schedulers are also cleaned if you invoke `make mrproper` in the root 235 + directory of the tree. 236 + 237 + ### Stale kernel build / incomplete vmlinux.h file 238 + 239 + As described above, you'll need a `vmlinux.h` file that was generated from a 240 + vmlinux built with BTF, and with sched_ext support enabled. If you don't, 241 + you'll see errors such as the following which indicate that a type being 242 + referenced in a scheduler is unknown: 243 + 244 + ``` 245 + /path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info' 246 + 247 + const struct scx_exit_info *ei) 248 + 249 + ^ 250 + ``` 251 + 252 + In order to resolve this, please follow the steps above in 253 + [Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your 254 + schedulers are using a vmlinux.h file that includes the requisite types. 255 + 256 + ## Misc 257 + 258 + ### llvm: [OFF] 259 + 260 + You may see the following output when building the schedulers: 261 + 262 + ``` 263 + Auto-detecting system features: 264 + ... clang-bpf-co-re: [ on ] 265 + ... llvm: [ OFF ] 266 + ... libcap: [ on ] 267 + ... libbfd: [ on ] 268 + ``` 269 + 270 + Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.

+11

tools/sched_ext/include/bpf-compat/gnu/stubs.h

··· 1 + /* 2 + * Dummy gnu/stubs.h. clang can end up including /usr/include/gnu/stubs.h when 3 + * compiling BPF files although its content doesn't play any role. The file in 4 + * turn includes stubs-64.h or stubs-32.h depending on whether __x86_64__ is 5 + * defined. When compiling a BPF source, __x86_64__ isn't set and thus 6 + * stubs-32.h is selected. However, the file is not there if the system doesn't 7 + * have 32bit glibc devel package installed leading to a build failure. 8 + * 9 + * The problem is worked around by making this file available in the include 10 + * search paths before the system one when building BPF. 11 + */

+412

tools/sched_ext/include/scx/common.bpf.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 6 + */ 7 + #ifndef __SCX_COMMON_BPF_H 8 + #define __SCX_COMMON_BPF_H 9 + 10 + #include "vmlinux.h" 11 + #include <bpf/bpf_helpers.h> 12 + #include <bpf/bpf_tracing.h> 13 + #include <asm-generic/errno.h> 14 + #include "user_exit_info.h" 15 + 16 + #define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */ 17 + #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ 18 + #define PF_EXITING 0x00000004 19 + #define CLOCK_MONOTONIC 1 20 + 21 + /* 22 + * Earlier versions of clang/pahole lost upper 32bits in 64bit enums which can 23 + * lead to really confusing misbehaviors. Let's trigger a build failure. 24 + */ 25 + static inline void ___vmlinux_h_sanity_check___(void) 26 + { 27 + _Static_assert(SCX_DSQ_FLAG_BUILTIN, 28 + "bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole"); 29 + } 30 + 31 + s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym; 32 + s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym; 33 + void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym; 34 + void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym; 35 + u32 scx_bpf_dispatch_nr_slots(void) __ksym; 36 + void scx_bpf_dispatch_cancel(void) __ksym; 37 + bool scx_bpf_consume(u64 dsq_id) __ksym; 38 + void scx_bpf_dispatch_from_dsq_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym; 39 + void scx_bpf_dispatch_from_dsq_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym; 40 + bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; 41 + bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak; 42 + u32 scx_bpf_reenqueue_local(void) __ksym; 43 + void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; 44 + s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym; 45 + void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; 46 + int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flags) __ksym __weak; 47 + struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak; 48 + void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak; 49 + void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; 50 + void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; 51 + void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak; 52 + u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak; 53 + u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak; 54 + void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak; 55 + u32 scx_bpf_nr_cpu_ids(void) __ksym __weak; 56 + const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak; 57 + const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak; 58 + void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak; 59 + const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym; 60 + const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym; 61 + void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym; 62 + bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym; 63 + s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; 64 + s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym; 65 + bool scx_bpf_task_running(const struct task_struct *p) __ksym; 66 + s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym; 67 + struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym; 68 + struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym; 69 + 70 + /* 71 + * Use the following as @it__iter when calling 72 + * scx_bpf_dispatch[_vtime]_from_dsq() from within bpf_for_each() loops. 73 + */ 74 + #define BPF_FOR_EACH_ITER (&___it) 75 + 76 + static inline __attribute__((format(printf, 1, 2))) 77 + void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {} 78 + 79 + /* 80 + * Helper macro for initializing the fmt and variadic argument inputs to both 81 + * bstr exit kfuncs. Callers to this function should use ___fmt and ___param to 82 + * refer to the initialized list of inputs to the bstr kfunc. 83 + */ 84 + #define scx_bpf_bstr_preamble(fmt, args...) \ 85 + static char ___fmt[] = fmt; \ 86 + /* \ 87 + * Note that __param[] must have at least one \ 88 + * element to keep the verifier happy. \ 89 + */ \ 90 + unsigned long long ___param[___bpf_narg(args) ?: 1] = {}; \ 91 + \ 92 + _Pragma("GCC diagnostic push") \ 93 + _Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \ 94 + ___bpf_fill(___param, args); \ 95 + _Pragma("GCC diagnostic pop") \ 96 + 97 + /* 98 + * scx_bpf_exit() wraps the scx_bpf_exit_bstr() kfunc with variadic arguments 99 + * instead of an array of u64. Using this macro will cause the scheduler to 100 + * exit cleanly with the specified exit code being passed to user space. 101 + */ 102 + #define scx_bpf_exit(code, fmt, args...) \ 103 + ({ \ 104 + scx_bpf_bstr_preamble(fmt, args) \ 105 + scx_bpf_exit_bstr(code, ___fmt, ___param, sizeof(___param)); \ 106 + ___scx_bpf_bstr_format_checker(fmt, ##args); \ 107 + }) 108 + 109 + /* 110 + * scx_bpf_error() wraps the scx_bpf_error_bstr() kfunc with variadic arguments 111 + * instead of an array of u64. Invoking this macro will cause the scheduler to 112 + * exit in an erroneous state, with diagnostic information being passed to the 113 + * user. 114 + */ 115 + #define scx_bpf_error(fmt, args...) \ 116 + ({ \ 117 + scx_bpf_bstr_preamble(fmt, args) \ 118 + scx_bpf_error_bstr(___fmt, ___param, sizeof(___param)); \ 119 + ___scx_bpf_bstr_format_checker(fmt, ##args); \ 120 + }) 121 + 122 + /* 123 + * scx_bpf_dump() wraps the scx_bpf_dump_bstr() kfunc with variadic arguments 124 + * instead of an array of u64. To be used from ops.dump() and friends. 125 + */ 126 + #define scx_bpf_dump(fmt, args...) \ 127 + ({ \ 128 + scx_bpf_bstr_preamble(fmt, args) \ 129 + scx_bpf_dump_bstr(___fmt, ___param, sizeof(___param)); \ 130 + ___scx_bpf_bstr_format_checker(fmt, ##args); \ 131 + }) 132 + 133 + #define BPF_STRUCT_OPS(name, args...) \ 134 + SEC("struct_ops/"#name) \ 135 + BPF_PROG(name, ##args) 136 + 137 + #define BPF_STRUCT_OPS_SLEEPABLE(name, args...) \ 138 + SEC("struct_ops.s/"#name) \ 139 + BPF_PROG(name, ##args) 140 + 141 + /** 142 + * RESIZABLE_ARRAY - Generates annotations for an array that may be resized 143 + * @elfsec: the data section of the BPF program in which to place the array 144 + * @arr: the name of the array 145 + * 146 + * libbpf has an API for setting map value sizes. Since data sections (i.e. 147 + * bss, data, rodata) themselves are maps, a data section can be resized. If 148 + * a data section has an array as its last element, the BTF info for that 149 + * array will be adjusted so that length of the array is extended to meet the 150 + * new length of the data section. This macro annotates an array to have an 151 + * element count of one with the assumption that this array can be resized 152 + * within the userspace program. It also annotates the section specifier so 153 + * this array exists in a custom sub data section which can be resized 154 + * independently. 155 + * 156 + * See RESIZE_ARRAY() for the userspace convenience macro for resizing an 157 + * array declared with RESIZABLE_ARRAY(). 158 + */ 159 + #define RESIZABLE_ARRAY(elfsec, arr) arr[1] SEC("."#elfsec"."#arr) 160 + 161 + /** 162 + * MEMBER_VPTR - Obtain the verified pointer to a struct or array member 163 + * @base: struct or array to index 164 + * @member: dereferenced member (e.g. .field, [idx0][idx1], .field[idx0] ...) 165 + * 166 + * The verifier often gets confused by the instruction sequence the compiler 167 + * generates for indexing struct fields or arrays. This macro forces the 168 + * compiler to generate a code sequence which first calculates the byte offset, 169 + * checks it against the struct or array size and add that byte offset to 170 + * generate the pointer to the member to help the verifier. 171 + * 172 + * Ideally, we want to abort if the calculated offset is out-of-bounds. However, 173 + * BPF currently doesn't support abort, so evaluate to %NULL instead. The caller 174 + * must check for %NULL and take appropriate action to appease the verifier. To 175 + * avoid confusing the verifier, it's best to check for %NULL and dereference 176 + * immediately. 177 + * 178 + * vptr = MEMBER_VPTR(my_array, [i][j]); 179 + * if (!vptr) 180 + * return error; 181 + * *vptr = new_value; 182 + * 183 + * sizeof(@base) should encompass the memory area to be accessed and thus can't 184 + * be a pointer to the area. Use `MEMBER_VPTR(*ptr, .member)` instead of 185 + * `MEMBER_VPTR(ptr, ->member)`. 186 + */ 187 + #define MEMBER_VPTR(base, member) (typeof((base) member) *) \ 188 + ({ \ 189 + u64 __base = (u64)&(base); \ 190 + u64 __addr = (u64)&((base) member) - __base; \ 191 + _Static_assert(sizeof(base) >= sizeof((base) member), \ 192 + "@base is smaller than @member, is @base a pointer?"); \ 193 + asm volatile ( \ 194 + "if %0 <= %[max] goto +2\n" \ 195 + "%0 = 0\n" \ 196 + "goto +1\n" \ 197 + "%0 += %1\n" \ 198 + : "+r"(__addr) \ 199 + : "r"(__base), \ 200 + [max]"i"(sizeof(base) - sizeof((base) member))); \ 201 + __addr; \ 202 + }) 203 + 204 + /** 205 + * ARRAY_ELEM_PTR - Obtain the verified pointer to an array element 206 + * @arr: array to index into 207 + * @i: array index 208 + * @n: number of elements in array 209 + * 210 + * Similar to MEMBER_VPTR() but is intended for use with arrays where the 211 + * element count needs to be explicit. 212 + * It can be used in cases where a global array is defined with an initial 213 + * size but is intended to be be resized before loading the BPF program. 214 + * Without this version of the macro, MEMBER_VPTR() will use the compile time 215 + * size of the array to compute the max, which will result in rejection by 216 + * the verifier. 217 + */ 218 + #define ARRAY_ELEM_PTR(arr, i, n) (typeof(arr[i]) *) \ 219 + ({ \ 220 + u64 __base = (u64)arr; \ 221 + u64 __addr = (u64)&(arr[i]) - __base; \ 222 + asm volatile ( \ 223 + "if %0 <= %[max] goto +2\n" \ 224 + "%0 = 0\n" \ 225 + "goto +1\n" \ 226 + "%0 += %1\n" \ 227 + : "+r"(__addr) \ 228 + : "r"(__base), \ 229 + [max]"r"(sizeof(arr[0]) * ((n) - 1))); \ 230 + __addr; \ 231 + }) 232 + 233 + 234 + /* 235 + * BPF declarations and helpers 236 + */ 237 + 238 + /* list and rbtree */ 239 + #define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node))) 240 + #define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8))) 241 + 242 + void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym; 243 + void bpf_obj_drop_impl(void *kptr, void *meta) __ksym; 244 + 245 + #define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL)) 246 + #define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL) 247 + 248 + void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node) __ksym; 249 + void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node) __ksym; 250 + struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym; 251 + struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym; 252 + struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root, 253 + struct bpf_rb_node *node) __ksym; 254 + int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node, 255 + bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b), 256 + void *meta, __u64 off) __ksym; 257 + #define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0) 258 + 259 + struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym; 260 + 261 + void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym; 262 + #define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL) 263 + 264 + /* task */ 265 + struct task_struct *bpf_task_from_pid(s32 pid) __ksym; 266 + struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym; 267 + void bpf_task_release(struct task_struct *p) __ksym; 268 + 269 + /* cgroup */ 270 + struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym; 271 + void bpf_cgroup_release(struct cgroup *cgrp) __ksym; 272 + struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym; 273 + 274 + /* css iteration */ 275 + struct bpf_iter_css; 276 + struct cgroup_subsys_state; 277 + extern int bpf_iter_css_new(struct bpf_iter_css *it, 278 + struct cgroup_subsys_state *start, 279 + unsigned int flags) __weak __ksym; 280 + extern struct cgroup_subsys_state * 281 + bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym; 282 + extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym; 283 + 284 + /* cpumask */ 285 + struct bpf_cpumask *bpf_cpumask_create(void) __ksym; 286 + struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask) __ksym; 287 + void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym; 288 + u32 bpf_cpumask_first(const struct cpumask *cpumask) __ksym; 289 + u32 bpf_cpumask_first_zero(const struct cpumask *cpumask) __ksym; 290 + void bpf_cpumask_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; 291 + void bpf_cpumask_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; 292 + bool bpf_cpumask_test_cpu(u32 cpu, const struct cpumask *cpumask) __ksym; 293 + bool bpf_cpumask_test_and_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; 294 + bool bpf_cpumask_test_and_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym; 295 + void bpf_cpumask_setall(struct bpf_cpumask *cpumask) __ksym; 296 + void bpf_cpumask_clear(struct bpf_cpumask *cpumask) __ksym; 297 + bool bpf_cpumask_and(struct bpf_cpumask *dst, const struct cpumask *src1, 298 + const struct cpumask *src2) __ksym; 299 + void bpf_cpumask_or(struct bpf_cpumask *dst, const struct cpumask *src1, 300 + const struct cpumask *src2) __ksym; 301 + void bpf_cpumask_xor(struct bpf_cpumask *dst, const struct cpumask *src1, 302 + const struct cpumask *src2) __ksym; 303 + bool bpf_cpumask_equal(const struct cpumask *src1, const struct cpumask *src2) __ksym; 304 + bool bpf_cpumask_intersects(const struct cpumask *src1, const struct cpumask *src2) __ksym; 305 + bool bpf_cpumask_subset(const struct cpumask *src1, const struct cpumask *src2) __ksym; 306 + bool bpf_cpumask_empty(const struct cpumask *cpumask) __ksym; 307 + bool bpf_cpumask_full(const struct cpumask *cpumask) __ksym; 308 + void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym; 309 + u32 bpf_cpumask_any_distribute(const struct cpumask *cpumask) __ksym; 310 + u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1, 311 + const struct cpumask *src2) __ksym; 312 + 313 + /* rcu */ 314 + void bpf_rcu_read_lock(void) __ksym; 315 + void bpf_rcu_read_unlock(void) __ksym; 316 + 317 + 318 + /* 319 + * Other helpers 320 + */ 321 + 322 + /* useful compiler attributes */ 323 + #define likely(x) __builtin_expect(!!(x), 1) 324 + #define unlikely(x) __builtin_expect(!!(x), 0) 325 + #define __maybe_unused __attribute__((__unused__)) 326 + 327 + /* 328 + * READ/WRITE_ONCE() are from kernel (include/asm-generic/rwonce.h). They 329 + * prevent compiler from caching, redoing or reordering reads or writes. 330 + */ 331 + typedef __u8 __attribute__((__may_alias__)) __u8_alias_t; 332 + typedef __u16 __attribute__((__may_alias__)) __u16_alias_t; 333 + typedef __u32 __attribute__((__may_alias__)) __u32_alias_t; 334 + typedef __u64 __attribute__((__may_alias__)) __u64_alias_t; 335 + 336 + static __always_inline void __read_once_size(const volatile void *p, void *res, int size) 337 + { 338 + switch (size) { 339 + case 1: *(__u8_alias_t *) res = *(volatile __u8_alias_t *) p; break; 340 + case 2: *(__u16_alias_t *) res = *(volatile __u16_alias_t *) p; break; 341 + case 4: *(__u32_alias_t *) res = *(volatile __u32_alias_t *) p; break; 342 + case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break; 343 + default: 344 + barrier(); 345 + __builtin_memcpy((void *)res, (const void *)p, size); 346 + barrier(); 347 + } 348 + } 349 + 350 + static __always_inline void __write_once_size(volatile void *p, void *res, int size) 351 + { 352 + switch (size) { 353 + case 1: *(volatile __u8_alias_t *) p = *(__u8_alias_t *) res; break; 354 + case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break; 355 + case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break; 356 + case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break; 357 + default: 358 + barrier(); 359 + __builtin_memcpy((void *)p, (const void *)res, size); 360 + barrier(); 361 + } 362 + } 363 + 364 + #define READ_ONCE(x) \ 365 + ({ \ 366 + union { typeof(x) __val; char __c[1]; } __u = \ 367 + { .__c = { 0 } }; \ 368 + __read_once_size(&(x), __u.__c, sizeof(x)); \ 369 + __u.__val; \ 370 + }) 371 + 372 + #define WRITE_ONCE(x, val) \ 373 + ({ \ 374 + union { typeof(x) __val; char __c[1]; } __u = \ 375 + { .__val = (val) }; \ 376 + __write_once_size(&(x), __u.__c, sizeof(x)); \ 377 + __u.__val; \ 378 + }) 379 + 380 + /* 381 + * log2_u32 - Compute the base 2 logarithm of a 32-bit exponential value. 382 + * @v: The value for which we're computing the base 2 logarithm. 383 + */ 384 + static inline u32 log2_u32(u32 v) 385 + { 386 + u32 r; 387 + u32 shift; 388 + 389 + r = (v > 0xFFFF) << 4; v >>= r; 390 + shift = (v > 0xFF) << 3; v >>= shift; r |= shift; 391 + shift = (v > 0xF) << 2; v >>= shift; r |= shift; 392 + shift = (v > 0x3) << 1; v >>= shift; r |= shift; 393 + r |= (v >> 1); 394 + return r; 395 + } 396 + 397 + /* 398 + * log2_u64 - Compute the base 2 logarithm of a 64-bit exponential value. 399 + * @v: The value for which we're computing the base 2 logarithm. 400 + */ 401 + static inline u32 log2_u64(u64 v) 402 + { 403 + u32 hi = v >> 32; 404 + if (hi) 405 + return log2_u32(hi) + 32 + 1; 406 + else 407 + return log2_u32(v) + 1; 408 + } 409 + 410 + #include "compat.bpf.h" 411 + 412 + #endif /* __SCX_COMMON_BPF_H */

+75

tools/sched_ext/include/scx/common.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 6 + */ 7 + #ifndef __SCHED_EXT_COMMON_H 8 + #define __SCHED_EXT_COMMON_H 9 + 10 + #ifdef __KERNEL__ 11 + #error "Should not be included by BPF programs" 12 + #endif 13 + 14 + #include <stdarg.h> 15 + #include <stdio.h> 16 + #include <stdlib.h> 17 + #include <stdint.h> 18 + #include <errno.h> 19 + 20 + typedef uint8_t u8; 21 + typedef uint16_t u16; 22 + typedef uint32_t u32; 23 + typedef uint64_t u64; 24 + typedef int8_t s8; 25 + typedef int16_t s16; 26 + typedef int32_t s32; 27 + typedef int64_t s64; 28 + 29 + #define SCX_BUG(__fmt, ...) \ 30 + do { \ 31 + fprintf(stderr, "[SCX_BUG] %s:%d", __FILE__, __LINE__); \ 32 + if (errno) \ 33 + fprintf(stderr, " (%s)\n", strerror(errno)); \ 34 + else \ 35 + fprintf(stderr, "\n"); \ 36 + fprintf(stderr, __fmt __VA_OPT__(,) __VA_ARGS__); \ 37 + fprintf(stderr, "\n"); \ 38 + \ 39 + exit(EXIT_FAILURE); \ 40 + } while (0) 41 + 42 + #define SCX_BUG_ON(__cond, __fmt, ...) \ 43 + do { \ 44 + if (__cond) \ 45 + SCX_BUG((__fmt) __VA_OPT__(,) __VA_ARGS__); \ 46 + } while (0) 47 + 48 + /** 49 + * RESIZE_ARRAY - Convenience macro for resizing a BPF array 50 + * @__skel: the skeleton containing the array 51 + * @elfsec: the data section of the BPF program in which the array exists 52 + * @arr: the name of the array 53 + * @n: the desired array element count 54 + * 55 + * For BPF arrays declared with RESIZABLE_ARRAY(), this macro performs two 56 + * operations. It resizes the map which corresponds to the custom data 57 + * section that contains the target array. As a side effect, the BTF info for 58 + * the array is adjusted so that the array length is sized to cover the new 59 + * data section size. The second operation is reassigning the skeleton pointer 60 + * for that custom data section so that it points to the newly memory mapped 61 + * region. 62 + */ 63 + #define RESIZE_ARRAY(__skel, elfsec, arr, n) \ 64 + do { \ 65 + size_t __sz; \ 66 + bpf_map__set_value_size((__skel)->maps.elfsec##_##arr, \ 67 + sizeof((__skel)->elfsec##_##arr->arr[0]) * (n)); \ 68 + (__skel)->elfsec##_##arr = \ 69 + bpf_map__initial_value((__skel)->maps.elfsec##_##arr, &__sz); \ 70 + } while (0) 71 + 72 + #include "user_exit_info.h" 73 + #include "compat.h" 74 + 75 + #endif /* __SCHED_EXT_COMMON_H */

+28

tools/sched_ext/include/scx/compat.bpf.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 6 + */ 7 + #ifndef __SCX_COMPAT_BPF_H 8 + #define __SCX_COMPAT_BPF_H 9 + 10 + #define __COMPAT_ENUM_OR_ZERO(__type, __ent) \ 11 + ({ \ 12 + __type __ret = 0; \ 13 + if (bpf_core_enum_value_exists(__type, __ent)) \ 14 + __ret = __ent; \ 15 + __ret; \ 16 + }) 17 + 18 + /* 19 + * Define sched_ext_ops. This may be expanded to define multiple variants for 20 + * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH(). 21 + */ 22 + #define SCX_OPS_DEFINE(__name, ...) \ 23 + SEC(".struct_ops.link") \ 24 + struct sched_ext_ops __name = { \ 25 + __VA_ARGS__, \ 26 + }; 27 + 28 + #endif /* __SCX_COMPAT_BPF_H */

+186

tools/sched_ext/include/scx/compat.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 6 + */ 7 + #ifndef __SCX_COMPAT_H 8 + #define __SCX_COMPAT_H 9 + 10 + #include <bpf/btf.h> 11 + #include <fcntl.h> 12 + #include <stdlib.h> 13 + #include <unistd.h> 14 + 15 + struct btf *__COMPAT_vmlinux_btf __attribute__((weak)); 16 + 17 + static inline void __COMPAT_load_vmlinux_btf(void) 18 + { 19 + if (!__COMPAT_vmlinux_btf) { 20 + __COMPAT_vmlinux_btf = btf__load_vmlinux_btf(); 21 + SCX_BUG_ON(!__COMPAT_vmlinux_btf, "btf__load_vmlinux_btf()"); 22 + } 23 + } 24 + 25 + static inline bool __COMPAT_read_enum(const char *type, const char *name, u64 *v) 26 + { 27 + const struct btf_type *t; 28 + const char *n; 29 + s32 tid; 30 + int i; 31 + 32 + __COMPAT_load_vmlinux_btf(); 33 + 34 + tid = btf__find_by_name(__COMPAT_vmlinux_btf, type); 35 + if (tid < 0) 36 + return false; 37 + 38 + t = btf__type_by_id(__COMPAT_vmlinux_btf, tid); 39 + SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid); 40 + 41 + if (btf_is_enum(t)) { 42 + struct btf_enum *e = btf_enum(t); 43 + 44 + for (i = 0; i < BTF_INFO_VLEN(t->info); i++) { 45 + n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off); 46 + SCX_BUG_ON(!n, "btf__name_by_offset()"); 47 + if (!strcmp(n, name)) { 48 + *v = e[i].val; 49 + return true; 50 + } 51 + } 52 + } else if (btf_is_enum64(t)) { 53 + struct btf_enum64 *e = btf_enum64(t); 54 + 55 + for (i = 0; i < BTF_INFO_VLEN(t->info); i++) { 56 + n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off); 57 + SCX_BUG_ON(!n, "btf__name_by_offset()"); 58 + if (!strcmp(n, name)) { 59 + *v = btf_enum64_value(&e[i]); 60 + return true; 61 + } 62 + } 63 + } 64 + 65 + return false; 66 + } 67 + 68 + #define __COMPAT_ENUM_OR_ZERO(__type, __ent) \ 69 + ({ \ 70 + u64 __val = 0; \ 71 + __COMPAT_read_enum(__type, __ent, &__val); \ 72 + __val; \ 73 + }) 74 + 75 + static inline bool __COMPAT_has_ksym(const char *ksym) 76 + { 77 + __COMPAT_load_vmlinux_btf(); 78 + return btf__find_by_name(__COMPAT_vmlinux_btf, ksym) >= 0; 79 + } 80 + 81 + static inline bool __COMPAT_struct_has_field(const char *type, const char *field) 82 + { 83 + const struct btf_type *t; 84 + const struct btf_member *m; 85 + const char *n; 86 + s32 tid; 87 + int i; 88 + 89 + __COMPAT_load_vmlinux_btf(); 90 + tid = btf__find_by_name_kind(__COMPAT_vmlinux_btf, type, BTF_KIND_STRUCT); 91 + if (tid < 0) 92 + return false; 93 + 94 + t = btf__type_by_id(__COMPAT_vmlinux_btf, tid); 95 + SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid); 96 + 97 + m = btf_members(t); 98 + 99 + for (i = 0; i < BTF_INFO_VLEN(t->info); i++) { 100 + n = btf__name_by_offset(__COMPAT_vmlinux_btf, m[i].name_off); 101 + SCX_BUG_ON(!n, "btf__name_by_offset()"); 102 + if (!strcmp(n, field)) 103 + return true; 104 + } 105 + 106 + return false; 107 + } 108 + 109 + #define SCX_OPS_SWITCH_PARTIAL \ 110 + __COMPAT_ENUM_OR_ZERO("scx_ops_flags", "SCX_OPS_SWITCH_PARTIAL") 111 + 112 + static inline long scx_hotplug_seq(void) 113 + { 114 + int fd; 115 + char buf[32]; 116 + ssize_t len; 117 + long val; 118 + 119 + fd = open("/sys/kernel/sched_ext/hotplug_seq", O_RDONLY); 120 + if (fd < 0) 121 + return -ENOENT; 122 + 123 + len = read(fd, buf, sizeof(buf) - 1); 124 + SCX_BUG_ON(len <= 0, "read failed (%ld)", len); 125 + buf[len] = 0; 126 + close(fd); 127 + 128 + val = strtoul(buf, NULL, 10); 129 + SCX_BUG_ON(val < 0, "invalid num hotplug events: %lu", val); 130 + 131 + return val; 132 + } 133 + 134 + /* 135 + * struct sched_ext_ops can change over time. If compat.bpf.h::SCX_OPS_DEFINE() 136 + * is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load 137 + * and attach it, backward compatibility is automatically maintained where 138 + * reasonable. 139 + * 140 + * ec7e3b0463e1 ("implement-ops") in https://github.com/sched-ext/sched_ext is 141 + * the current minimum required kernel version. 142 + */ 143 + #define SCX_OPS_OPEN(__ops_name, __scx_name) ({ \ 144 + struct __scx_name *__skel; \ 145 + \ 146 + SCX_BUG_ON(!__COMPAT_struct_has_field("sched_ext_ops", "dump"), \ 147 + "sched_ext_ops.dump() missing, kernel too old?"); \ 148 + \ 149 + __skel = __scx_name##__open(); \ 150 + SCX_BUG_ON(!__skel, "Could not open " #__scx_name); \ 151 + __skel->struct_ops.__ops_name->hotplug_seq = scx_hotplug_seq(); \ 152 + __skel; \ 153 + }) 154 + 155 + #define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({ \ 156 + UEI_SET_SIZE(__skel, __ops_name, __uei_name); \ 157 + SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel"); \ 158 + }) 159 + 160 + /* 161 + * New versions of bpftool now emit additional link placeholders for BPF maps, 162 + * and set up BPF skeleton in such a way that libbpf will auto-attach BPF maps 163 + * automatically, assumming libbpf is recent enough (v1.5+). Old libbpf will do 164 + * nothing with those links and won't attempt to auto-attach maps. 165 + * 166 + * To maintain compatibility with older libbpf while avoiding trying to attach 167 + * twice, disable the autoattach feature on newer libbpf. 168 + */ 169 + #if LIBBPF_MAJOR_VERSION > 1 || \ 170 + (LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 5) 171 + #define __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name) \ 172 + bpf_map__set_autoattach((__skel)->maps.__ops_name, false) 173 + #else 174 + #define __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name) do {} while (0) 175 + #endif 176 + 177 + #define SCX_OPS_ATTACH(__skel, __ops_name, __scx_name) ({ \ 178 + struct bpf_link *__link; \ 179 + __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name); \ 180 + SCX_BUG_ON(__scx_name##__attach((__skel)), "Failed to attach skel"); \ 181 + __link = bpf_map__attach_struct_ops((__skel)->maps.__ops_name); \ 182 + SCX_BUG_ON(!__link, "Failed to attach struct_ops"); \ 183 + __link; \ 184 + }) 185 + 186 + #endif /* __SCX_COMPAT_H */

+111

tools/sched_ext/include/scx/user_exit_info.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Define struct user_exit_info which is shared between BPF and userspace parts 4 + * to communicate exit status and other information. 5 + * 6 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 8 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 9 + */ 10 + #ifndef __USER_EXIT_INFO_H 11 + #define __USER_EXIT_INFO_H 12 + 13 + enum uei_sizes { 14 + UEI_REASON_LEN = 128, 15 + UEI_MSG_LEN = 1024, 16 + UEI_DUMP_DFL_LEN = 32768, 17 + }; 18 + 19 + struct user_exit_info { 20 + int kind; 21 + s64 exit_code; 22 + char reason[UEI_REASON_LEN]; 23 + char msg[UEI_MSG_LEN]; 24 + }; 25 + 26 + #ifdef __bpf__ 27 + 28 + #include "vmlinux.h" 29 + #include <bpf/bpf_core_read.h> 30 + 31 + #define UEI_DEFINE(__name) \ 32 + char RESIZABLE_ARRAY(data, __name##_dump); \ 33 + const volatile u32 __name##_dump_len; \ 34 + struct user_exit_info __name SEC(".data") 35 + 36 + #define UEI_RECORD(__uei_name, __ei) ({ \ 37 + bpf_probe_read_kernel_str(__uei_name.reason, \ 38 + sizeof(__uei_name.reason), (__ei)->reason); \ 39 + bpf_probe_read_kernel_str(__uei_name.msg, \ 40 + sizeof(__uei_name.msg), (__ei)->msg); \ 41 + bpf_probe_read_kernel_str(__uei_name##_dump, \ 42 + __uei_name##_dump_len, (__ei)->dump); \ 43 + if (bpf_core_field_exists((__ei)->exit_code)) \ 44 + __uei_name.exit_code = (__ei)->exit_code; \ 45 + /* use __sync to force memory barrier */ \ 46 + __sync_val_compare_and_swap(&__uei_name.kind, __uei_name.kind, \ 47 + (__ei)->kind); \ 48 + }) 49 + 50 + #else /* !__bpf__ */ 51 + 52 + #include <stdio.h> 53 + #include <stdbool.h> 54 + 55 + /* no need to call the following explicitly if SCX_OPS_LOAD() is used */ 56 + #define UEI_SET_SIZE(__skel, __ops_name, __uei_name) ({ \ 57 + u32 __len = (__skel)->struct_ops.__ops_name->exit_dump_len ?: UEI_DUMP_DFL_LEN; \ 58 + (__skel)->rodata->__uei_name##_dump_len = __len; \ 59 + RESIZE_ARRAY((__skel), data, __uei_name##_dump, __len); \ 60 + }) 61 + 62 + #define UEI_EXITED(__skel, __uei_name) ({ \ 63 + /* use __sync to force memory barrier */ \ 64 + __sync_val_compare_and_swap(&(__skel)->data->__uei_name.kind, -1, -1); \ 65 + }) 66 + 67 + #define UEI_REPORT(__skel, __uei_name) ({ \ 68 + struct user_exit_info *__uei = &(__skel)->data->__uei_name; \ 69 + char *__uei_dump = (__skel)->data_##__uei_name##_dump->__uei_name##_dump; \ 70 + if (__uei_dump[0] != '\0') { \ 71 + fputs("\nDEBUG DUMP\n", stderr); \ 72 + fputs("================================================================================\n\n", stderr); \ 73 + fputs(__uei_dump, stderr); \ 74 + fputs("\n================================================================================\n\n", stderr); \ 75 + } \ 76 + fprintf(stderr, "EXIT: %s", __uei->reason); \ 77 + if (__uei->msg[0] != '\0') \ 78 + fprintf(stderr, " (%s)", __uei->msg); \ 79 + fputs("\n", stderr); \ 80 + __uei->exit_code; \ 81 + }) 82 + 83 + /* 84 + * We can't import vmlinux.h while compiling user C code. Let's duplicate 85 + * scx_exit_code definition. 86 + */ 87 + enum scx_exit_code { 88 + /* Reasons */ 89 + SCX_ECODE_RSN_HOTPLUG = 1LLU << 32, 90 + 91 + /* Actions */ 92 + SCX_ECODE_ACT_RESTART = 1LLU << 48, 93 + }; 94 + 95 + enum uei_ecode_mask { 96 + UEI_ECODE_USER_MASK = ((1LLU << 32) - 1), 97 + UEI_ECODE_SYS_RSN_MASK = ((1LLU << 16) - 1) << 32, 98 + UEI_ECODE_SYS_ACT_MASK = ((1LLU << 16) - 1) << 48, 99 + }; 100 + 101 + /* 102 + * These macro interpret the ecode returned from UEI_REPORT(). 103 + */ 104 + #define UEI_ECODE_USER(__ecode) ((__ecode) & UEI_ECODE_USER_MASK) 105 + #define UEI_ECODE_SYS_RSN(__ecode) ((__ecode) & UEI_ECODE_SYS_RSN_MASK) 106 + #define UEI_ECODE_SYS_ACT(__ecode) ((__ecode) & UEI_ECODE_SYS_ACT_MASK) 107 + 108 + #define UEI_ECODE_RESTART(__ecode) (UEI_ECODE_SYS_ACT((__ecode)) == SCX_ECODE_ACT_RESTART) 109 + 110 + #endif /* __bpf__ */ 111 + #endif /* __USER_EXIT_INFO_H */

+361

tools/sched_ext/scx_central.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A central FIFO sched_ext scheduler which demonstrates the followings: 4 + * 5 + * a. Making all scheduling decisions from one CPU: 6 + * 7 + * The central CPU is the only one making scheduling decisions. All other 8 + * CPUs kick the central CPU when they run out of tasks to run. 9 + * 10 + * There is one global BPF queue and the central CPU schedules all CPUs by 11 + * dispatching from the global queue to each CPU's local dsq from dispatch(). 12 + * This isn't the most straightforward. e.g. It'd be easier to bounce 13 + * through per-CPU BPF queues. The current design is chosen to maximally 14 + * utilize and verify various SCX mechanisms such as LOCAL_ON dispatching. 15 + * 16 + * b. Tickless operation 17 + * 18 + * All tasks are dispatched with the infinite slice which allows stopping the 19 + * ticks on CONFIG_NO_HZ_FULL kernels running with the proper nohz_full 20 + * parameter. The tickless operation can be observed through 21 + * /proc/interrupts. 22 + * 23 + * Periodic switching is enforced by a periodic timer checking all CPUs and 24 + * preempting them as necessary. Unfortunately, BPF timer currently doesn't 25 + * have a way to pin to a specific CPU, so the periodic timer isn't pinned to 26 + * the central CPU. 27 + * 28 + * c. Preemption 29 + * 30 + * Kthreads are unconditionally queued to the head of a matching local dsq 31 + * and dispatched with SCX_DSQ_PREEMPT. This ensures that a kthread is always 32 + * prioritized over user threads, which is required for ensuring forward 33 + * progress as e.g. the periodic timer may run on a ksoftirqd and if the 34 + * ksoftirqd gets starved by a user thread, there may not be anything else to 35 + * vacate that user thread. 36 + * 37 + * SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the 38 + * next tasks. 39 + * 40 + * This scheduler is designed to maximize usage of various SCX mechanisms. A 41 + * more practical implementation would likely put the scheduling loop outside 42 + * the central CPU's dispatch() path and add some form of priority mechanism. 43 + * 44 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 45 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 46 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 47 + */ 48 + #include <scx/common.bpf.h> 49 + 50 + char _license[] SEC("license") = "GPL"; 51 + 52 + enum { 53 + FALLBACK_DSQ_ID = 0, 54 + MS_TO_NS = 1000LLU * 1000, 55 + TIMER_INTERVAL_NS = 1 * MS_TO_NS, 56 + }; 57 + 58 + const volatile s32 central_cpu; 59 + const volatile u32 nr_cpu_ids = 1; /* !0 for veristat, set during init */ 60 + const volatile u64 slice_ns = SCX_SLICE_DFL; 61 + 62 + bool timer_pinned = true; 63 + u64 nr_total, nr_locals, nr_queued, nr_lost_pids; 64 + u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries; 65 + u64 nr_overflows; 66 + 67 + UEI_DEFINE(uei); 68 + 69 + struct { 70 + __uint(type, BPF_MAP_TYPE_QUEUE); 71 + __uint(max_entries, 4096); 72 + __type(value, s32); 73 + } central_q SEC(".maps"); 74 + 75 + /* can't use percpu map due to bad lookups */ 76 + bool RESIZABLE_ARRAY(data, cpu_gimme_task); 77 + u64 RESIZABLE_ARRAY(data, cpu_started_at); 78 + 79 + struct central_timer { 80 + struct bpf_timer timer; 81 + }; 82 + 83 + struct { 84 + __uint(type, BPF_MAP_TYPE_ARRAY); 85 + __uint(max_entries, 1); 86 + __type(key, u32); 87 + __type(value, struct central_timer); 88 + } central_timer SEC(".maps"); 89 + 90 + static bool vtime_before(u64 a, u64 b) 91 + { 92 + return (s64)(a - b) < 0; 93 + } 94 + 95 + s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p, 96 + s32 prev_cpu, u64 wake_flags) 97 + { 98 + /* 99 + * Steer wakeups to the central CPU as much as possible to avoid 100 + * disturbing other CPUs. It's safe to blindly return the central cpu as 101 + * select_cpu() is a hint and if @p can't be on it, the kernel will 102 + * automatically pick a fallback CPU. 103 + */ 104 + return central_cpu; 105 + } 106 + 107 + void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags) 108 + { 109 + s32 pid = p->pid; 110 + 111 + __sync_fetch_and_add(&nr_total, 1); 112 + 113 + /* 114 + * Push per-cpu kthreads at the head of local dsq's and preempt the 115 + * corresponding CPU. This ensures that e.g. ksoftirqd isn't blocked 116 + * behind other threads which is necessary for forward progress 117 + * guarantee as we depend on the BPF timer which may run from ksoftirqd. 118 + */ 119 + if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) { 120 + __sync_fetch_and_add(&nr_locals, 1); 121 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF, 122 + enq_flags | SCX_ENQ_PREEMPT); 123 + return; 124 + } 125 + 126 + if (bpf_map_push_elem(&central_q, &pid, 0)) { 127 + __sync_fetch_and_add(&nr_overflows, 1); 128 + scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, enq_flags); 129 + return; 130 + } 131 + 132 + __sync_fetch_and_add(&nr_queued, 1); 133 + 134 + if (!scx_bpf_task_running(p)) 135 + scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT); 136 + } 137 + 138 + static bool dispatch_to_cpu(s32 cpu) 139 + { 140 + struct task_struct *p; 141 + s32 pid; 142 + 143 + bpf_repeat(BPF_MAX_LOOPS) { 144 + if (bpf_map_pop_elem(&central_q, &pid)) 145 + break; 146 + 147 + __sync_fetch_and_sub(&nr_queued, 1); 148 + 149 + p = bpf_task_from_pid(pid); 150 + if (!p) { 151 + __sync_fetch_and_add(&nr_lost_pids, 1); 152 + continue; 153 + } 154 + 155 + /* 156 + * If we can't run the task at the top, do the dumb thing and 157 + * bounce it to the fallback dsq. 158 + */ 159 + if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) { 160 + __sync_fetch_and_add(&nr_mismatches, 1); 161 + scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, 0); 162 + bpf_task_release(p); 163 + /* 164 + * We might run out of dispatch buffer slots if we continue dispatching 165 + * to the fallback DSQ, without dispatching to the local DSQ of the 166 + * target CPU. In such a case, break the loop now as will fail the 167 + * next dispatch operation. 168 + */ 169 + if (!scx_bpf_dispatch_nr_slots()) 170 + break; 171 + continue; 172 + } 173 + 174 + /* dispatch to local and mark that @cpu doesn't need more */ 175 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_INF, 0); 176 + 177 + if (cpu != central_cpu) 178 + scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE); 179 + 180 + bpf_task_release(p); 181 + return true; 182 + } 183 + 184 + return false; 185 + } 186 + 187 + void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev) 188 + { 189 + if (cpu == central_cpu) { 190 + /* dispatch for all other CPUs first */ 191 + __sync_fetch_and_add(&nr_dispatches, 1); 192 + 193 + bpf_for(cpu, 0, nr_cpu_ids) { 194 + bool *gimme; 195 + 196 + if (!scx_bpf_dispatch_nr_slots()) 197 + break; 198 + 199 + /* central's gimme is never set */ 200 + gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); 201 + if (!gimme || !*gimme) 202 + continue; 203 + 204 + if (dispatch_to_cpu(cpu)) 205 + *gimme = false; 206 + } 207 + 208 + /* 209 + * Retry if we ran out of dispatch buffer slots as we might have 210 + * skipped some CPUs and also need to dispatch for self. The ext 211 + * core automatically retries if the local dsq is empty but we 212 + * can't rely on that as we're dispatching for other CPUs too. 213 + * Kick self explicitly to retry. 214 + */ 215 + if (!scx_bpf_dispatch_nr_slots()) { 216 + __sync_fetch_and_add(&nr_retries, 1); 217 + scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT); 218 + return; 219 + } 220 + 221 + /* look for a task to run on the central CPU */ 222 + if (scx_bpf_consume(FALLBACK_DSQ_ID)) 223 + return; 224 + dispatch_to_cpu(central_cpu); 225 + } else { 226 + bool *gimme; 227 + 228 + if (scx_bpf_consume(FALLBACK_DSQ_ID)) 229 + return; 230 + 231 + gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids); 232 + if (gimme) 233 + *gimme = true; 234 + 235 + /* 236 + * Force dispatch on the scheduling CPU so that it finds a task 237 + * to run for us. 238 + */ 239 + scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT); 240 + } 241 + } 242 + 243 + void BPF_STRUCT_OPS(central_running, struct task_struct *p) 244 + { 245 + s32 cpu = scx_bpf_task_cpu(p); 246 + u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids); 247 + if (started_at) 248 + *started_at = bpf_ktime_get_ns() ?: 1; /* 0 indicates idle */ 249 + } 250 + 251 + void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable) 252 + { 253 + s32 cpu = scx_bpf_task_cpu(p); 254 + u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids); 255 + if (started_at) 256 + *started_at = 0; 257 + } 258 + 259 + static int central_timerfn(void *map, int *key, struct bpf_timer *timer) 260 + { 261 + u64 now = bpf_ktime_get_ns(); 262 + u64 nr_to_kick = nr_queued; 263 + s32 i, curr_cpu; 264 + 265 + curr_cpu = bpf_get_smp_processor_id(); 266 + if (timer_pinned && (curr_cpu != central_cpu)) { 267 + scx_bpf_error("Central timer ran on CPU %d, not central CPU %d", 268 + curr_cpu, central_cpu); 269 + return 0; 270 + } 271 + 272 + bpf_for(i, 0, nr_cpu_ids) { 273 + s32 cpu = (nr_timers + i) % nr_cpu_ids; 274 + u64 *started_at; 275 + 276 + if (cpu == central_cpu) 277 + continue; 278 + 279 + /* kick iff the current one exhausted its slice */ 280 + started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids); 281 + if (started_at && *started_at && 282 + vtime_before(now, *started_at + slice_ns)) 283 + continue; 284 + 285 + /* and there's something pending */ 286 + if (scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID) || 287 + scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpu)) 288 + ; 289 + else if (nr_to_kick) 290 + nr_to_kick--; 291 + else 292 + continue; 293 + 294 + scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT); 295 + } 296 + 297 + bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); 298 + __sync_fetch_and_add(&nr_timers, 1); 299 + return 0; 300 + } 301 + 302 + int BPF_STRUCT_OPS_SLEEPABLE(central_init) 303 + { 304 + u32 key = 0; 305 + struct bpf_timer *timer; 306 + int ret; 307 + 308 + ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1); 309 + if (ret) 310 + return ret; 311 + 312 + timer = bpf_map_lookup_elem(&central_timer, &key); 313 + if (!timer) 314 + return -ESRCH; 315 + 316 + if (bpf_get_smp_processor_id() != central_cpu) { 317 + scx_bpf_error("init from non-central CPU"); 318 + return -EINVAL; 319 + } 320 + 321 + bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC); 322 + bpf_timer_set_callback(timer, central_timerfn); 323 + 324 + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN); 325 + /* 326 + * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a 327 + * kernel which doesn't have it, bpf_timer_start() will return -EINVAL. 328 + * Retry without the PIN. This would be the perfect use case for 329 + * bpf_core_enum_value_exists() but the enum type doesn't have a name 330 + * and can't be used with bpf_core_enum_value_exists(). Oh well... 331 + */ 332 + if (ret == -EINVAL) { 333 + timer_pinned = false; 334 + ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0); 335 + } 336 + if (ret) 337 + scx_bpf_error("bpf_timer_start failed (%d)", ret); 338 + return ret; 339 + } 340 + 341 + void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei) 342 + { 343 + UEI_RECORD(uei, ei); 344 + } 345 + 346 + SCX_OPS_DEFINE(central_ops, 347 + /* 348 + * We are offloading all scheduling decisions to the central CPU 349 + * and thus being the last task on a given CPU doesn't mean 350 + * anything special. Enqueue the last tasks like any other tasks. 351 + */ 352 + .flags = SCX_OPS_ENQ_LAST, 353 + 354 + .select_cpu = (void *)central_select_cpu, 355 + .enqueue = (void *)central_enqueue, 356 + .dispatch = (void *)central_dispatch, 357 + .running = (void *)central_running, 358 + .stopping = (void *)central_stopping, 359 + .init = (void *)central_init, 360 + .exit = (void *)central_exit, 361 + .name = "central");

+135

tools/sched_ext/scx_central.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 6 + */ 7 + #define _GNU_SOURCE 8 + #include <sched.h> 9 + #include <stdio.h> 10 + #include <unistd.h> 11 + #include <inttypes.h> 12 + #include <signal.h> 13 + #include <libgen.h> 14 + #include <bpf/bpf.h> 15 + #include <scx/common.h> 16 + #include "scx_central.bpf.skel.h" 17 + 18 + const char help_fmt[] = 19 + "A central FIFO sched_ext scheduler.\n" 20 + "\n" 21 + "See the top-level comment in .bpf.c for more details.\n" 22 + "\n" 23 + "Usage: %s [-s SLICE_US] [-c CPU]\n" 24 + "\n" 25 + " -s SLICE_US Override slice duration\n" 26 + " -c CPU Override the central CPU (default: 0)\n" 27 + " -v Print libbpf debug messages\n" 28 + " -h Display this help and exit\n"; 29 + 30 + static bool verbose; 31 + static volatile int exit_req; 32 + 33 + static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) 34 + { 35 + if (level == LIBBPF_DEBUG && !verbose) 36 + return 0; 37 + return vfprintf(stderr, format, args); 38 + } 39 + 40 + static void sigint_handler(int dummy) 41 + { 42 + exit_req = 1; 43 + } 44 + 45 + int main(int argc, char **argv) 46 + { 47 + struct scx_central *skel; 48 + struct bpf_link *link; 49 + __u64 seq = 0, ecode; 50 + __s32 opt; 51 + cpu_set_t *cpuset; 52 + 53 + libbpf_set_print(libbpf_print_fn); 54 + signal(SIGINT, sigint_handler); 55 + signal(SIGTERM, sigint_handler); 56 + restart: 57 + skel = SCX_OPS_OPEN(central_ops, scx_central); 58 + 59 + skel->rodata->central_cpu = 0; 60 + skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus(); 61 + 62 + while ((opt = getopt(argc, argv, "s:c:pvh")) != -1) { 63 + switch (opt) { 64 + case 's': 65 + skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; 66 + break; 67 + case 'c': 68 + skel->rodata->central_cpu = strtoul(optarg, NULL, 0); 69 + break; 70 + case 'v': 71 + verbose = true; 72 + break; 73 + default: 74 + fprintf(stderr, help_fmt, basename(argv[0])); 75 + return opt != 'h'; 76 + } 77 + } 78 + 79 + /* Resize arrays so their element count is equal to cpu count. */ 80 + RESIZE_ARRAY(skel, data, cpu_gimme_task, skel->rodata->nr_cpu_ids); 81 + RESIZE_ARRAY(skel, data, cpu_started_at, skel->rodata->nr_cpu_ids); 82 + 83 + SCX_OPS_LOAD(skel, central_ops, scx_central, uei); 84 + 85 + /* 86 + * Affinitize the loading thread to the central CPU, as: 87 + * - That's where the BPF timer is first invoked in the BPF program. 88 + * - We probably don't want this user space component to take up a core 89 + * from a task that would benefit from avoiding preemption on one of 90 + * the tickless cores. 91 + * 92 + * Until BPF supports pinning the timer, it's not guaranteed that it 93 + * will always be invoked on the central CPU. In practice, this 94 + * suffices the majority of the time. 95 + */ 96 + cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids); 97 + SCX_BUG_ON(!cpuset, "Failed to allocate cpuset"); 98 + CPU_ZERO(cpuset); 99 + CPU_SET(skel->rodata->central_cpu, cpuset); 100 + SCX_BUG_ON(sched_setaffinity(0, sizeof(cpuset), cpuset), 101 + "Failed to affinitize to central CPU %d (max %d)", 102 + skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1); 103 + CPU_FREE(cpuset); 104 + 105 + link = SCX_OPS_ATTACH(skel, central_ops, scx_central); 106 + 107 + if (!skel->data->timer_pinned) 108 + printf("WARNING : BPF_F_TIMER_CPU_PIN not available, timer not pinned to central\n"); 109 + 110 + while (!exit_req && !UEI_EXITED(skel, uei)) { 111 + printf("[SEQ %llu]\n", seq++); 112 + printf("total :%10" PRIu64 " local:%10" PRIu64 " queued:%10" PRIu64 " lost:%10" PRIu64 "\n", 113 + skel->bss->nr_total, 114 + skel->bss->nr_locals, 115 + skel->bss->nr_queued, 116 + skel->bss->nr_lost_pids); 117 + printf("timer :%10" PRIu64 " dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n", 118 + skel->bss->nr_timers, 119 + skel->bss->nr_dispatches, 120 + skel->bss->nr_mismatches, 121 + skel->bss->nr_retries); 122 + printf("overflow:%10" PRIu64 "\n", 123 + skel->bss->nr_overflows); 124 + fflush(stdout); 125 + sleep(1); 126 + } 127 + 128 + bpf_link__destroy(link); 129 + ecode = UEI_REPORT(skel, uei); 130 + scx_central__destroy(skel); 131 + 132 + if (UEI_ECODE_RESTART(ecode)) 133 + goto restart; 134 + return 0; 135 + }

+949

tools/sched_ext/scx_flatcg.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A demo sched_ext flattened cgroup hierarchy scheduler. It implements 4 + * hierarchical weight-based cgroup CPU control by flattening the cgroup 5 + * hierarchy into a single layer by compounding the active weight share at each 6 + * level. Consider the following hierarchy with weights in parentheses: 7 + * 8 + * R + A (100) + B (100) 9 + * | \ C (100) 10 + * \ D (200) 11 + * 12 + * Ignoring the root and threaded cgroups, only B, C and D can contain tasks. 13 + * Let's say all three have runnable tasks. The total share that each of these 14 + * three cgroups is entitled to can be calculated by compounding its share at 15 + * each level. 16 + * 17 + * For example, B is competing against C and in that competition its share is 18 + * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's 19 + * share in that competition is 100/(200+100) == 1/3. B's eventual share in the 20 + * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's 21 + * eventual shaer is the same at 1/6. D is only competing at the top level and 22 + * its share is 200/(100+200) == 2/3. 23 + * 24 + * So, instead of hierarchically scheduling level-by-level, we can consider it 25 + * as B, C and D competing each other with respective share of 1/6, 1/6 and 2/3 26 + * and keep updating the eventual shares as the cgroups' runnable states change. 27 + * 28 + * This flattening of hierarchy can bring a substantial performance gain when 29 + * the cgroup hierarchy is nested multiple levels. in a simple benchmark using 30 + * wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it 31 + * outperforms CFS by ~3% with CPU controller disabled and by ~10% with two 32 + * apache instances competing with 2:1 weight ratio nested four level deep. 33 + * 34 + * However, the gain comes at the cost of not being able to properly handle 35 + * thundering herd of cgroups. For example, if many cgroups which are nested 36 + * behind a low priority parent cgroup wake up around the same time, they may be 37 + * able to consume more CPU cycles than they are entitled to. In many use cases, 38 + * this isn't a real concern especially given the performance gain. Also, there 39 + * are ways to mitigate the problem further by e.g. introducing an extra 40 + * scheduling layer on cgroup delegation boundaries. 41 + * 42 + * The scheduler first picks the cgroup to run and then schedule the tasks 43 + * within by using nested weighted vtime scheduling by default. The 44 + * cgroup-internal scheduling can be switched to FIFO with the -f option. 45 + */ 46 + #include <scx/common.bpf.h> 47 + #include "scx_flatcg.h" 48 + 49 + /* 50 + * Maximum amount of retries to find a valid cgroup. 51 + */ 52 + #define CGROUP_MAX_RETRIES 1024 53 + 54 + char _license[] SEC("license") = "GPL"; 55 + 56 + const volatile u32 nr_cpus = 32; /* !0 for veristat, set during init */ 57 + const volatile u64 cgrp_slice_ns = SCX_SLICE_DFL; 58 + const volatile bool fifo_sched; 59 + 60 + u64 cvtime_now; 61 + UEI_DEFINE(uei); 62 + 63 + struct { 64 + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); 65 + __type(key, u32); 66 + __type(value, u64); 67 + __uint(max_entries, FCG_NR_STATS); 68 + } stats SEC(".maps"); 69 + 70 + static void stat_inc(enum fcg_stat_idx idx) 71 + { 72 + u32 idx_v = idx; 73 + 74 + u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx_v); 75 + if (cnt_p) 76 + (*cnt_p)++; 77 + } 78 + 79 + struct fcg_cpu_ctx { 80 + u64 cur_cgid; 81 + u64 cur_at; 82 + }; 83 + 84 + struct { 85 + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); 86 + __type(key, u32); 87 + __type(value, struct fcg_cpu_ctx); 88 + __uint(max_entries, 1); 89 + } cpu_ctx SEC(".maps"); 90 + 91 + struct { 92 + __uint(type, BPF_MAP_TYPE_CGRP_STORAGE); 93 + __uint(map_flags, BPF_F_NO_PREALLOC); 94 + __type(key, int); 95 + __type(value, struct fcg_cgrp_ctx); 96 + } cgrp_ctx SEC(".maps"); 97 + 98 + struct cgv_node { 99 + struct bpf_rb_node rb_node; 100 + __u64 cvtime; 101 + __u64 cgid; 102 + }; 103 + 104 + private(CGV_TREE) struct bpf_spin_lock cgv_tree_lock; 105 + private(CGV_TREE) struct bpf_rb_root cgv_tree __contains(cgv_node, rb_node); 106 + 107 + struct cgv_node_stash { 108 + struct cgv_node __kptr *node; 109 + }; 110 + 111 + struct { 112 + __uint(type, BPF_MAP_TYPE_HASH); 113 + __uint(max_entries, 16384); 114 + __type(key, __u64); 115 + __type(value, struct cgv_node_stash); 116 + } cgv_node_stash SEC(".maps"); 117 + 118 + struct fcg_task_ctx { 119 + u64 bypassed_at; 120 + }; 121 + 122 + struct { 123 + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); 124 + __uint(map_flags, BPF_F_NO_PREALLOC); 125 + __type(key, int); 126 + __type(value, struct fcg_task_ctx); 127 + } task_ctx SEC(".maps"); 128 + 129 + /* gets inc'd on weight tree changes to expire the cached hweights */ 130 + u64 hweight_gen = 1; 131 + 132 + static u64 div_round_up(u64 dividend, u64 divisor) 133 + { 134 + return (dividend + divisor - 1) / divisor; 135 + } 136 + 137 + static bool vtime_before(u64 a, u64 b) 138 + { 139 + return (s64)(a - b) < 0; 140 + } 141 + 142 + static bool cgv_node_less(struct bpf_rb_node *a, const struct bpf_rb_node *b) 143 + { 144 + struct cgv_node *cgc_a, *cgc_b; 145 + 146 + cgc_a = container_of(a, struct cgv_node, rb_node); 147 + cgc_b = container_of(b, struct cgv_node, rb_node); 148 + 149 + return cgc_a->cvtime < cgc_b->cvtime; 150 + } 151 + 152 + static struct fcg_cpu_ctx *find_cpu_ctx(void) 153 + { 154 + struct fcg_cpu_ctx *cpuc; 155 + u32 idx = 0; 156 + 157 + cpuc = bpf_map_lookup_elem(&cpu_ctx, &idx); 158 + if (!cpuc) { 159 + scx_bpf_error("cpu_ctx lookup failed"); 160 + return NULL; 161 + } 162 + return cpuc; 163 + } 164 + 165 + static struct fcg_cgrp_ctx *find_cgrp_ctx(struct cgroup *cgrp) 166 + { 167 + struct fcg_cgrp_ctx *cgc; 168 + 169 + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0); 170 + if (!cgc) { 171 + scx_bpf_error("cgrp_ctx lookup failed for cgid %llu", cgrp->kn->id); 172 + return NULL; 173 + } 174 + return cgc; 175 + } 176 + 177 + static struct fcg_cgrp_ctx *find_ancestor_cgrp_ctx(struct cgroup *cgrp, int level) 178 + { 179 + struct fcg_cgrp_ctx *cgc; 180 + 181 + cgrp = bpf_cgroup_ancestor(cgrp, level); 182 + if (!cgrp) { 183 + scx_bpf_error("ancestor cgroup lookup failed"); 184 + return NULL; 185 + } 186 + 187 + cgc = find_cgrp_ctx(cgrp); 188 + if (!cgc) 189 + scx_bpf_error("ancestor cgrp_ctx lookup failed"); 190 + bpf_cgroup_release(cgrp); 191 + return cgc; 192 + } 193 + 194 + static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc) 195 + { 196 + int level; 197 + 198 + if (!cgc->nr_active) { 199 + stat_inc(FCG_STAT_HWT_SKIP); 200 + return; 201 + } 202 + 203 + if (cgc->hweight_gen == hweight_gen) { 204 + stat_inc(FCG_STAT_HWT_CACHE); 205 + return; 206 + } 207 + 208 + stat_inc(FCG_STAT_HWT_UPDATES); 209 + bpf_for(level, 0, cgrp->level + 1) { 210 + struct fcg_cgrp_ctx *cgc; 211 + bool is_active; 212 + 213 + cgc = find_ancestor_cgrp_ctx(cgrp, level); 214 + if (!cgc) 215 + break; 216 + 217 + if (!level) { 218 + cgc->hweight = FCG_HWEIGHT_ONE; 219 + cgc->hweight_gen = hweight_gen; 220 + } else { 221 + struct fcg_cgrp_ctx *pcgc; 222 + 223 + pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1); 224 + if (!pcgc) 225 + break; 226 + 227 + /* 228 + * We can be oppotunistic here and not grab the 229 + * cgv_tree_lock and deal with the occasional races. 230 + * However, hweight updates are already cached and 231 + * relatively low-frequency. Let's just do the 232 + * straightforward thing. 233 + */ 234 + bpf_spin_lock(&cgv_tree_lock); 235 + is_active = cgc->nr_active; 236 + if (is_active) { 237 + cgc->hweight_gen = pcgc->hweight_gen; 238 + cgc->hweight = 239 + div_round_up(pcgc->hweight * cgc->weight, 240 + pcgc->child_weight_sum); 241 + } 242 + bpf_spin_unlock(&cgv_tree_lock); 243 + 244 + if (!is_active) { 245 + stat_inc(FCG_STAT_HWT_RACE); 246 + break; 247 + } 248 + } 249 + } 250 + } 251 + 252 + static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc) 253 + { 254 + u64 delta, cvtime, max_budget; 255 + 256 + /* 257 + * A node which is on the rbtree can't be pointed to from elsewhere yet 258 + * and thus can't be updated and repositioned. Instead, we collect the 259 + * vtime deltas separately and apply it asynchronously here. 260 + */ 261 + delta = cgc->cvtime_delta; 262 + __sync_fetch_and_sub(&cgc->cvtime_delta, delta); 263 + cvtime = cgv_node->cvtime + delta; 264 + 265 + /* 266 + * Allow a cgroup to carry the maximum budget proportional to its 267 + * hweight such that a full-hweight cgroup can immediately take up half 268 + * of the CPUs at the most while staying at the front of the rbtree. 269 + */ 270 + max_budget = (cgrp_slice_ns * nr_cpus * cgc->hweight) / 271 + (2 * FCG_HWEIGHT_ONE); 272 + if (vtime_before(cvtime, cvtime_now - max_budget)) 273 + cvtime = cvtime_now - max_budget; 274 + 275 + cgv_node->cvtime = cvtime; 276 + } 277 + 278 + static void cgrp_enqueued(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc) 279 + { 280 + struct cgv_node_stash *stash; 281 + struct cgv_node *cgv_node; 282 + u64 cgid = cgrp->kn->id; 283 + 284 + /* paired with cmpxchg in try_pick_next_cgroup() */ 285 + if (__sync_val_compare_and_swap(&cgc->queued, 0, 1)) { 286 + stat_inc(FCG_STAT_ENQ_SKIP); 287 + return; 288 + } 289 + 290 + stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid); 291 + if (!stash) { 292 + scx_bpf_error("cgv_node lookup failed for cgid %llu", cgid); 293 + return; 294 + } 295 + 296 + /* NULL if the node is already on the rbtree */ 297 + cgv_node = bpf_kptr_xchg(&stash->node, NULL); 298 + if (!cgv_node) { 299 + stat_inc(FCG_STAT_ENQ_RACE); 300 + return; 301 + } 302 + 303 + bpf_spin_lock(&cgv_tree_lock); 304 + cgrp_cap_budget(cgv_node, cgc); 305 + bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less); 306 + bpf_spin_unlock(&cgv_tree_lock); 307 + } 308 + 309 + static void set_bypassed_at(struct task_struct *p, struct fcg_task_ctx *taskc) 310 + { 311 + /* 312 + * Tell fcg_stopping() that this bypassed the regular scheduling path 313 + * and should be force charged to the cgroup. 0 is used to indicate that 314 + * the task isn't bypassing, so if the current runtime is 0, go back by 315 + * one nanosecond. 316 + */ 317 + taskc->bypassed_at = p->se.sum_exec_runtime ?: (u64)-1; 318 + } 319 + 320 + s32 BPF_STRUCT_OPS(fcg_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) 321 + { 322 + struct fcg_task_ctx *taskc; 323 + bool is_idle = false; 324 + s32 cpu; 325 + 326 + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle); 327 + 328 + taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); 329 + if (!taskc) { 330 + scx_bpf_error("task_ctx lookup failed"); 331 + return cpu; 332 + } 333 + 334 + /* 335 + * If select_cpu_dfl() is recommending local enqueue, the target CPU is 336 + * idle. Follow it and charge the cgroup later in fcg_stopping() after 337 + * the fact. 338 + */ 339 + if (is_idle) { 340 + set_bypassed_at(p, taskc); 341 + stat_inc(FCG_STAT_LOCAL); 342 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 343 + } 344 + 345 + return cpu; 346 + } 347 + 348 + void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags) 349 + { 350 + struct fcg_task_ctx *taskc; 351 + struct cgroup *cgrp; 352 + struct fcg_cgrp_ctx *cgc; 353 + 354 + taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); 355 + if (!taskc) { 356 + scx_bpf_error("task_ctx lookup failed"); 357 + return; 358 + } 359 + 360 + /* 361 + * Use the direct dispatching and force charging to deal with tasks with 362 + * custom affinities so that we don't have to worry about per-cgroup 363 + * dq's containing tasks that can't be executed from some CPUs. 364 + */ 365 + if (p->nr_cpus_allowed != nr_cpus) { 366 + set_bypassed_at(p, taskc); 367 + 368 + /* 369 + * The global dq is deprioritized as we don't want to let tasks 370 + * to boost themselves by constraining its cpumask. The 371 + * deprioritization is rather severe, so let's not apply that to 372 + * per-cpu kernel threads. This is ham-fisted. We probably wanna 373 + * implement per-cgroup fallback dq's instead so that we have 374 + * more control over when tasks with custom cpumask get issued. 375 + */ 376 + if (p->nr_cpus_allowed == 1 && (p->flags & PF_KTHREAD)) { 377 + stat_inc(FCG_STAT_LOCAL); 378 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); 379 + } else { 380 + stat_inc(FCG_STAT_GLOBAL); 381 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 382 + } 383 + return; 384 + } 385 + 386 + cgrp = scx_bpf_task_cgroup(p); 387 + cgc = find_cgrp_ctx(cgrp); 388 + if (!cgc) 389 + goto out_release; 390 + 391 + if (fifo_sched) { 392 + scx_bpf_dispatch(p, cgrp->kn->id, SCX_SLICE_DFL, enq_flags); 393 + } else { 394 + u64 tvtime = p->scx.dsq_vtime; 395 + 396 + /* 397 + * Limit the amount of budget that an idling task can accumulate 398 + * to one slice. 399 + */ 400 + if (vtime_before(tvtime, cgc->tvtime_now - SCX_SLICE_DFL)) 401 + tvtime = cgc->tvtime_now - SCX_SLICE_DFL; 402 + 403 + scx_bpf_dispatch_vtime(p, cgrp->kn->id, SCX_SLICE_DFL, 404 + tvtime, enq_flags); 405 + } 406 + 407 + cgrp_enqueued(cgrp, cgc); 408 + out_release: 409 + bpf_cgroup_release(cgrp); 410 + } 411 + 412 + /* 413 + * Walk the cgroup tree to update the active weight sums as tasks wake up and 414 + * sleep. The weight sums are used as the base when calculating the proportion a 415 + * given cgroup or task is entitled to at each level. 416 + */ 417 + static void update_active_weight_sums(struct cgroup *cgrp, bool runnable) 418 + { 419 + struct fcg_cgrp_ctx *cgc; 420 + bool updated = false; 421 + int idx; 422 + 423 + cgc = find_cgrp_ctx(cgrp); 424 + if (!cgc) 425 + return; 426 + 427 + /* 428 + * In most cases, a hot cgroup would have multiple threads going to 429 + * sleep and waking up while the whole cgroup stays active. In leaf 430 + * cgroups, ->nr_runnable which is updated with __sync operations gates 431 + * ->nr_active updates, so that we don't have to grab the cgv_tree_lock 432 + * repeatedly for a busy cgroup which is staying active. 433 + */ 434 + if (runnable) { 435 + if (__sync_fetch_and_add(&cgc->nr_runnable, 1)) 436 + return; 437 + stat_inc(FCG_STAT_ACT); 438 + } else { 439 + if (__sync_sub_and_fetch(&cgc->nr_runnable, 1)) 440 + return; 441 + stat_inc(FCG_STAT_DEACT); 442 + } 443 + 444 + /* 445 + * If @cgrp is becoming runnable, its hweight should be refreshed after 446 + * it's added to the weight tree so that enqueue has the up-to-date 447 + * value. If @cgrp is becoming quiescent, the hweight should be 448 + * refreshed before it's removed from the weight tree so that the usage 449 + * charging which happens afterwards has access to the latest value. 450 + */ 451 + if (!runnable) 452 + cgrp_refresh_hweight(cgrp, cgc); 453 + 454 + /* propagate upwards */ 455 + bpf_for(idx, 0, cgrp->level) { 456 + int level = cgrp->level - idx; 457 + struct fcg_cgrp_ctx *cgc, *pcgc = NULL; 458 + bool propagate = false; 459 + 460 + cgc = find_ancestor_cgrp_ctx(cgrp, level); 461 + if (!cgc) 462 + break; 463 + if (level) { 464 + pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1); 465 + if (!pcgc) 466 + break; 467 + } 468 + 469 + /* 470 + * We need the propagation protected by a lock to synchronize 471 + * against weight changes. There's no reason to drop the lock at 472 + * each level but bpf_spin_lock() doesn't want any function 473 + * calls while locked. 474 + */ 475 + bpf_spin_lock(&cgv_tree_lock); 476 + 477 + if (runnable) { 478 + if (!cgc->nr_active++) { 479 + updated = true; 480 + if (pcgc) { 481 + propagate = true; 482 + pcgc->child_weight_sum += cgc->weight; 483 + } 484 + } 485 + } else { 486 + if (!--cgc->nr_active) { 487 + updated = true; 488 + if (pcgc) { 489 + propagate = true; 490 + pcgc->child_weight_sum -= cgc->weight; 491 + } 492 + } 493 + } 494 + 495 + bpf_spin_unlock(&cgv_tree_lock); 496 + 497 + if (!propagate) 498 + break; 499 + } 500 + 501 + if (updated) 502 + __sync_fetch_and_add(&hweight_gen, 1); 503 + 504 + if (runnable) 505 + cgrp_refresh_hweight(cgrp, cgc); 506 + } 507 + 508 + void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags) 509 + { 510 + struct cgroup *cgrp; 511 + 512 + cgrp = scx_bpf_task_cgroup(p); 513 + update_active_weight_sums(cgrp, true); 514 + bpf_cgroup_release(cgrp); 515 + } 516 + 517 + void BPF_STRUCT_OPS(fcg_running, struct task_struct *p) 518 + { 519 + struct cgroup *cgrp; 520 + struct fcg_cgrp_ctx *cgc; 521 + 522 + if (fifo_sched) 523 + return; 524 + 525 + cgrp = scx_bpf_task_cgroup(p); 526 + cgc = find_cgrp_ctx(cgrp); 527 + if (cgc) { 528 + /* 529 + * @cgc->tvtime_now always progresses forward as tasks start 530 + * executing. The test and update can be performed concurrently 531 + * from multiple CPUs and thus racy. Any error should be 532 + * contained and temporary. Let's just live with it. 533 + */ 534 + if (vtime_before(cgc->tvtime_now, p->scx.dsq_vtime)) 535 + cgc->tvtime_now = p->scx.dsq_vtime; 536 + } 537 + bpf_cgroup_release(cgrp); 538 + } 539 + 540 + void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable) 541 + { 542 + struct fcg_task_ctx *taskc; 543 + struct cgroup *cgrp; 544 + struct fcg_cgrp_ctx *cgc; 545 + 546 + /* 547 + * Scale the execution time by the inverse of the weight and charge. 548 + * 549 + * Note that the default yield implementation yields by setting 550 + * @p->scx.slice to zero and the following would treat the yielding task 551 + * as if it has consumed all its slice. If this penalizes yielding tasks 552 + * too much, determine the execution time by taking explicit timestamps 553 + * instead of depending on @p->scx.slice. 554 + */ 555 + if (!fifo_sched) 556 + p->scx.dsq_vtime += 557 + (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 558 + 559 + taskc = bpf_task_storage_get(&task_ctx, p, 0, 0); 560 + if (!taskc) { 561 + scx_bpf_error("task_ctx lookup failed"); 562 + return; 563 + } 564 + 565 + if (!taskc->bypassed_at) 566 + return; 567 + 568 + cgrp = scx_bpf_task_cgroup(p); 569 + cgc = find_cgrp_ctx(cgrp); 570 + if (cgc) { 571 + __sync_fetch_and_add(&cgc->cvtime_delta, 572 + p->se.sum_exec_runtime - taskc->bypassed_at); 573 + taskc->bypassed_at = 0; 574 + } 575 + bpf_cgroup_release(cgrp); 576 + } 577 + 578 + void BPF_STRUCT_OPS(fcg_quiescent, struct task_struct *p, u64 deq_flags) 579 + { 580 + struct cgroup *cgrp; 581 + 582 + cgrp = scx_bpf_task_cgroup(p); 583 + update_active_weight_sums(cgrp, false); 584 + bpf_cgroup_release(cgrp); 585 + } 586 + 587 + void BPF_STRUCT_OPS(fcg_cgroup_set_weight, struct cgroup *cgrp, u32 weight) 588 + { 589 + struct fcg_cgrp_ctx *cgc, *pcgc = NULL; 590 + 591 + cgc = find_cgrp_ctx(cgrp); 592 + if (!cgc) 593 + return; 594 + 595 + if (cgrp->level) { 596 + pcgc = find_ancestor_cgrp_ctx(cgrp, cgrp->level - 1); 597 + if (!pcgc) 598 + return; 599 + } 600 + 601 + bpf_spin_lock(&cgv_tree_lock); 602 + if (pcgc && cgc->nr_active) 603 + pcgc->child_weight_sum += (s64)weight - cgc->weight; 604 + cgc->weight = weight; 605 + bpf_spin_unlock(&cgv_tree_lock); 606 + } 607 + 608 + static bool try_pick_next_cgroup(u64 *cgidp) 609 + { 610 + struct bpf_rb_node *rb_node; 611 + struct cgv_node_stash *stash; 612 + struct cgv_node *cgv_node; 613 + struct fcg_cgrp_ctx *cgc; 614 + struct cgroup *cgrp; 615 + u64 cgid; 616 + 617 + /* pop the front cgroup and wind cvtime_now accordingly */ 618 + bpf_spin_lock(&cgv_tree_lock); 619 + 620 + rb_node = bpf_rbtree_first(&cgv_tree); 621 + if (!rb_node) { 622 + bpf_spin_unlock(&cgv_tree_lock); 623 + stat_inc(FCG_STAT_PNC_NO_CGRP); 624 + *cgidp = 0; 625 + return true; 626 + } 627 + 628 + rb_node = bpf_rbtree_remove(&cgv_tree, rb_node); 629 + bpf_spin_unlock(&cgv_tree_lock); 630 + 631 + if (!rb_node) { 632 + /* 633 + * This should never happen. bpf_rbtree_first() was called 634 + * above while the tree lock was held, so the node should 635 + * always be present. 636 + */ 637 + scx_bpf_error("node could not be removed"); 638 + return true; 639 + } 640 + 641 + cgv_node = container_of(rb_node, struct cgv_node, rb_node); 642 + cgid = cgv_node->cgid; 643 + 644 + if (vtime_before(cvtime_now, cgv_node->cvtime)) 645 + cvtime_now = cgv_node->cvtime; 646 + 647 + /* 648 + * If lookup fails, the cgroup's gone. Free and move on. See 649 + * fcg_cgroup_exit(). 650 + */ 651 + cgrp = bpf_cgroup_from_id(cgid); 652 + if (!cgrp) { 653 + stat_inc(FCG_STAT_PNC_GONE); 654 + goto out_free; 655 + } 656 + 657 + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0); 658 + if (!cgc) { 659 + bpf_cgroup_release(cgrp); 660 + stat_inc(FCG_STAT_PNC_GONE); 661 + goto out_free; 662 + } 663 + 664 + if (!scx_bpf_consume(cgid)) { 665 + bpf_cgroup_release(cgrp); 666 + stat_inc(FCG_STAT_PNC_EMPTY); 667 + goto out_stash; 668 + } 669 + 670 + /* 671 + * Successfully consumed from the cgroup. This will be our current 672 + * cgroup for the new slice. Refresh its hweight. 673 + */ 674 + cgrp_refresh_hweight(cgrp, cgc); 675 + 676 + bpf_cgroup_release(cgrp); 677 + 678 + /* 679 + * As the cgroup may have more tasks, add it back to the rbtree. Note 680 + * that here we charge the full slice upfront and then exact later 681 + * according to the actual consumption. This prevents lowpri thundering 682 + * herd from saturating the machine. 683 + */ 684 + bpf_spin_lock(&cgv_tree_lock); 685 + cgv_node->cvtime += cgrp_slice_ns * FCG_HWEIGHT_ONE / (cgc->hweight ?: 1); 686 + cgrp_cap_budget(cgv_node, cgc); 687 + bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less); 688 + bpf_spin_unlock(&cgv_tree_lock); 689 + 690 + *cgidp = cgid; 691 + stat_inc(FCG_STAT_PNC_NEXT); 692 + return true; 693 + 694 + out_stash: 695 + stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid); 696 + if (!stash) { 697 + stat_inc(FCG_STAT_PNC_GONE); 698 + goto out_free; 699 + } 700 + 701 + /* 702 + * Paired with cmpxchg in cgrp_enqueued(). If they see the following 703 + * transition, they'll enqueue the cgroup. If they are earlier, we'll 704 + * see their task in the dq below and requeue the cgroup. 705 + */ 706 + __sync_val_compare_and_swap(&cgc->queued, 1, 0); 707 + 708 + if (scx_bpf_dsq_nr_queued(cgid)) { 709 + bpf_spin_lock(&cgv_tree_lock); 710 + bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less); 711 + bpf_spin_unlock(&cgv_tree_lock); 712 + stat_inc(FCG_STAT_PNC_RACE); 713 + } else { 714 + cgv_node = bpf_kptr_xchg(&stash->node, cgv_node); 715 + if (cgv_node) { 716 + scx_bpf_error("unexpected !NULL cgv_node stash"); 717 + goto out_free; 718 + } 719 + } 720 + 721 + return false; 722 + 723 + out_free: 724 + bpf_obj_drop(cgv_node); 725 + return false; 726 + } 727 + 728 + void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev) 729 + { 730 + struct fcg_cpu_ctx *cpuc; 731 + struct fcg_cgrp_ctx *cgc; 732 + struct cgroup *cgrp; 733 + u64 now = bpf_ktime_get_ns(); 734 + bool picked_next = false; 735 + 736 + cpuc = find_cpu_ctx(); 737 + if (!cpuc) 738 + return; 739 + 740 + if (!cpuc->cur_cgid) 741 + goto pick_next_cgroup; 742 + 743 + if (vtime_before(now, cpuc->cur_at + cgrp_slice_ns)) { 744 + if (scx_bpf_consume(cpuc->cur_cgid)) { 745 + stat_inc(FCG_STAT_CNS_KEEP); 746 + return; 747 + } 748 + stat_inc(FCG_STAT_CNS_EMPTY); 749 + } else { 750 + stat_inc(FCG_STAT_CNS_EXPIRE); 751 + } 752 + 753 + /* 754 + * The current cgroup is expiring. It was already charged a full slice. 755 + * Calculate the actual usage and accumulate the delta. 756 + */ 757 + cgrp = bpf_cgroup_from_id(cpuc->cur_cgid); 758 + if (!cgrp) { 759 + stat_inc(FCG_STAT_CNS_GONE); 760 + goto pick_next_cgroup; 761 + } 762 + 763 + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0); 764 + if (cgc) { 765 + /* 766 + * We want to update the vtime delta and then look for the next 767 + * cgroup to execute but the latter needs to be done in a loop 768 + * and we can't keep the lock held. Oh well... 769 + */ 770 + bpf_spin_lock(&cgv_tree_lock); 771 + __sync_fetch_and_add(&cgc->cvtime_delta, 772 + (cpuc->cur_at + cgrp_slice_ns - now) * 773 + FCG_HWEIGHT_ONE / (cgc->hweight ?: 1)); 774 + bpf_spin_unlock(&cgv_tree_lock); 775 + } else { 776 + stat_inc(FCG_STAT_CNS_GONE); 777 + } 778 + 779 + bpf_cgroup_release(cgrp); 780 + 781 + pick_next_cgroup: 782 + cpuc->cur_at = now; 783 + 784 + if (scx_bpf_consume(SCX_DSQ_GLOBAL)) { 785 + cpuc->cur_cgid = 0; 786 + return; 787 + } 788 + 789 + bpf_repeat(CGROUP_MAX_RETRIES) { 790 + if (try_pick_next_cgroup(&cpuc->cur_cgid)) { 791 + picked_next = true; 792 + break; 793 + } 794 + } 795 + 796 + /* 797 + * This only happens if try_pick_next_cgroup() races against enqueue 798 + * path for more than CGROUP_MAX_RETRIES times, which is extremely 799 + * unlikely and likely indicates an underlying bug. There shouldn't be 800 + * any stall risk as the race is against enqueue. 801 + */ 802 + if (!picked_next) 803 + stat_inc(FCG_STAT_PNC_FAIL); 804 + } 805 + 806 + s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p, 807 + struct scx_init_task_args *args) 808 + { 809 + struct fcg_task_ctx *taskc; 810 + struct fcg_cgrp_ctx *cgc; 811 + 812 + /* 813 + * @p is new. Let's ensure that its task_ctx is available. We can sleep 814 + * in this function and the following will automatically use GFP_KERNEL. 815 + */ 816 + taskc = bpf_task_storage_get(&task_ctx, p, 0, 817 + BPF_LOCAL_STORAGE_GET_F_CREATE); 818 + if (!taskc) 819 + return -ENOMEM; 820 + 821 + taskc->bypassed_at = 0; 822 + 823 + if (!(cgc = find_cgrp_ctx(args->cgroup))) 824 + return -ENOENT; 825 + 826 + p->scx.dsq_vtime = cgc->tvtime_now; 827 + 828 + return 0; 829 + } 830 + 831 + int BPF_STRUCT_OPS_SLEEPABLE(fcg_cgroup_init, struct cgroup *cgrp, 832 + struct scx_cgroup_init_args *args) 833 + { 834 + struct fcg_cgrp_ctx *cgc; 835 + struct cgv_node *cgv_node; 836 + struct cgv_node_stash empty_stash = {}, *stash; 837 + u64 cgid = cgrp->kn->id; 838 + int ret; 839 + 840 + /* 841 + * Technically incorrect as cgroup ID is full 64bit while dq ID is 842 + * 63bit. Should not be a problem in practice and easy to spot in the 843 + * unlikely case that it breaks. 844 + */ 845 + ret = scx_bpf_create_dsq(cgid, -1); 846 + if (ret) 847 + return ret; 848 + 849 + cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 850 + BPF_LOCAL_STORAGE_GET_F_CREATE); 851 + if (!cgc) { 852 + ret = -ENOMEM; 853 + goto err_destroy_dsq; 854 + } 855 + 856 + cgc->weight = args->weight; 857 + cgc->hweight = FCG_HWEIGHT_ONE; 858 + 859 + ret = bpf_map_update_elem(&cgv_node_stash, &cgid, &empty_stash, 860 + BPF_NOEXIST); 861 + if (ret) { 862 + if (ret != -ENOMEM) 863 + scx_bpf_error("unexpected stash creation error (%d)", 864 + ret); 865 + goto err_destroy_dsq; 866 + } 867 + 868 + stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid); 869 + if (!stash) { 870 + scx_bpf_error("unexpected cgv_node stash lookup failure"); 871 + ret = -ENOENT; 872 + goto err_destroy_dsq; 873 + } 874 + 875 + cgv_node = bpf_obj_new(struct cgv_node); 876 + if (!cgv_node) { 877 + ret = -ENOMEM; 878 + goto err_del_cgv_node; 879 + } 880 + 881 + cgv_node->cgid = cgid; 882 + cgv_node->cvtime = cvtime_now; 883 + 884 + cgv_node = bpf_kptr_xchg(&stash->node, cgv_node); 885 + if (cgv_node) { 886 + scx_bpf_error("unexpected !NULL cgv_node stash"); 887 + ret = -EBUSY; 888 + goto err_drop; 889 + } 890 + 891 + return 0; 892 + 893 + err_drop: 894 + bpf_obj_drop(cgv_node); 895 + err_del_cgv_node: 896 + bpf_map_delete_elem(&cgv_node_stash, &cgid); 897 + err_destroy_dsq: 898 + scx_bpf_destroy_dsq(cgid); 899 + return ret; 900 + } 901 + 902 + void BPF_STRUCT_OPS(fcg_cgroup_exit, struct cgroup *cgrp) 903 + { 904 + u64 cgid = cgrp->kn->id; 905 + 906 + /* 907 + * For now, there's no way find and remove the cgv_node if it's on the 908 + * cgv_tree. Let's drain them in the dispatch path as they get popped 909 + * off the front of the tree. 910 + */ 911 + bpf_map_delete_elem(&cgv_node_stash, &cgid); 912 + scx_bpf_destroy_dsq(cgid); 913 + } 914 + 915 + void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p, 916 + struct cgroup *from, struct cgroup *to) 917 + { 918 + struct fcg_cgrp_ctx *from_cgc, *to_cgc; 919 + s64 vtime_delta; 920 + 921 + /* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */ 922 + if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to))) 923 + return; 924 + 925 + vtime_delta = p->scx.dsq_vtime - from_cgc->tvtime_now; 926 + p->scx.dsq_vtime = to_cgc->tvtime_now + vtime_delta; 927 + } 928 + 929 + void BPF_STRUCT_OPS(fcg_exit, struct scx_exit_info *ei) 930 + { 931 + UEI_RECORD(uei, ei); 932 + } 933 + 934 + SCX_OPS_DEFINE(flatcg_ops, 935 + .select_cpu = (void *)fcg_select_cpu, 936 + .enqueue = (void *)fcg_enqueue, 937 + .dispatch = (void *)fcg_dispatch, 938 + .runnable = (void *)fcg_runnable, 939 + .running = (void *)fcg_running, 940 + .stopping = (void *)fcg_stopping, 941 + .quiescent = (void *)fcg_quiescent, 942 + .init_task = (void *)fcg_init_task, 943 + .cgroup_set_weight = (void *)fcg_cgroup_set_weight, 944 + .cgroup_init = (void *)fcg_cgroup_init, 945 + .cgroup_exit = (void *)fcg_cgroup_exit, 946 + .cgroup_move = (void *)fcg_cgroup_move, 947 + .exit = (void *)fcg_exit, 948 + .flags = SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING, 949 + .name = "flatcg");

+233

tools/sched_ext/scx_flatcg.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 6 + */ 7 + #include <stdio.h> 8 + #include <signal.h> 9 + #include <unistd.h> 10 + #include <libgen.h> 11 + #include <limits.h> 12 + #include <inttypes.h> 13 + #include <fcntl.h> 14 + #include <time.h> 15 + #include <bpf/bpf.h> 16 + #include <scx/common.h> 17 + #include "scx_flatcg.h" 18 + #include "scx_flatcg.bpf.skel.h" 19 + 20 + #ifndef FILEID_KERNFS 21 + #define FILEID_KERNFS 0xfe 22 + #endif 23 + 24 + const char help_fmt[] = 25 + "A flattened cgroup hierarchy sched_ext scheduler.\n" 26 + "\n" 27 + "See the top-level comment in .bpf.c for more details.\n" 28 + "\n" 29 + "Usage: %s [-s SLICE_US] [-i INTERVAL] [-f] [-v]\n" 30 + "\n" 31 + " -s SLICE_US Override slice duration\n" 32 + " -i INTERVAL Report interval\n" 33 + " -f Use FIFO scheduling instead of weighted vtime scheduling\n" 34 + " -v Print libbpf debug messages\n" 35 + " -h Display this help and exit\n"; 36 + 37 + static bool verbose; 38 + static volatile int exit_req; 39 + 40 + static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) 41 + { 42 + if (level == LIBBPF_DEBUG && !verbose) 43 + return 0; 44 + return vfprintf(stderr, format, args); 45 + } 46 + 47 + static void sigint_handler(int dummy) 48 + { 49 + exit_req = 1; 50 + } 51 + 52 + static float read_cpu_util(__u64 *last_sum, __u64 *last_idle) 53 + { 54 + FILE *fp; 55 + char buf[4096]; 56 + char *line, *cur = NULL, *tok; 57 + __u64 sum = 0, idle = 0; 58 + __u64 delta_sum, delta_idle; 59 + int idx; 60 + 61 + fp = fopen("/proc/stat", "r"); 62 + if (!fp) { 63 + perror("fopen(\"/proc/stat\")"); 64 + return 0.0; 65 + } 66 + 67 + if (!fgets(buf, sizeof(buf), fp)) { 68 + perror("fgets(\"/proc/stat\")"); 69 + fclose(fp); 70 + return 0.0; 71 + } 72 + fclose(fp); 73 + 74 + line = buf; 75 + for (idx = 0; (tok = strtok_r(line, " \n", &cur)); idx++) { 76 + char *endp = NULL; 77 + __u64 v; 78 + 79 + if (idx == 0) { 80 + line = NULL; 81 + continue; 82 + } 83 + v = strtoull(tok, &endp, 0); 84 + if (!endp || *endp != '\0') { 85 + fprintf(stderr, "failed to parse %dth field of /proc/stat (\"%s\")\n", 86 + idx, tok); 87 + continue; 88 + } 89 + sum += v; 90 + if (idx == 4) 91 + idle = v; 92 + } 93 + 94 + delta_sum = sum - *last_sum; 95 + delta_idle = idle - *last_idle; 96 + *last_sum = sum; 97 + *last_idle = idle; 98 + 99 + return delta_sum ? (float)(delta_sum - delta_idle) / delta_sum : 0.0; 100 + } 101 + 102 + static void fcg_read_stats(struct scx_flatcg *skel, __u64 *stats) 103 + { 104 + __u64 cnts[FCG_NR_STATS][skel->rodata->nr_cpus]; 105 + __u32 idx; 106 + 107 + memset(stats, 0, sizeof(stats[0]) * FCG_NR_STATS); 108 + 109 + for (idx = 0; idx < FCG_NR_STATS; idx++) { 110 + int ret, cpu; 111 + 112 + ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats), 113 + &idx, cnts[idx]); 114 + if (ret < 0) 115 + continue; 116 + for (cpu = 0; cpu < skel->rodata->nr_cpus; cpu++) 117 + stats[idx] += cnts[idx][cpu]; 118 + } 119 + } 120 + 121 + int main(int argc, char **argv) 122 + { 123 + struct scx_flatcg *skel; 124 + struct bpf_link *link; 125 + struct timespec intv_ts = { .tv_sec = 2, .tv_nsec = 0 }; 126 + bool dump_cgrps = false; 127 + __u64 last_cpu_sum = 0, last_cpu_idle = 0; 128 + __u64 last_stats[FCG_NR_STATS] = {}; 129 + unsigned long seq = 0; 130 + __s32 opt; 131 + __u64 ecode; 132 + 133 + libbpf_set_print(libbpf_print_fn); 134 + signal(SIGINT, sigint_handler); 135 + signal(SIGTERM, sigint_handler); 136 + restart: 137 + skel = SCX_OPS_OPEN(flatcg_ops, scx_flatcg); 138 + 139 + skel->rodata->nr_cpus = libbpf_num_possible_cpus(); 140 + 141 + while ((opt = getopt(argc, argv, "s:i:dfvh")) != -1) { 142 + double v; 143 + 144 + switch (opt) { 145 + case 's': 146 + v = strtod(optarg, NULL); 147 + skel->rodata->cgrp_slice_ns = v * 1000; 148 + break; 149 + case 'i': 150 + v = strtod(optarg, NULL); 151 + intv_ts.tv_sec = v; 152 + intv_ts.tv_nsec = (v - (float)intv_ts.tv_sec) * 1000000000; 153 + break; 154 + case 'd': 155 + dump_cgrps = true; 156 + break; 157 + case 'f': 158 + skel->rodata->fifo_sched = true; 159 + break; 160 + case 'v': 161 + verbose = true; 162 + break; 163 + case 'h': 164 + default: 165 + fprintf(stderr, help_fmt, basename(argv[0])); 166 + return opt != 'h'; 167 + } 168 + } 169 + 170 + printf("slice=%.1lfms intv=%.1lfs dump_cgrps=%d", 171 + (double)skel->rodata->cgrp_slice_ns / 1000000.0, 172 + (double)intv_ts.tv_sec + (double)intv_ts.tv_nsec / 1000000000.0, 173 + dump_cgrps); 174 + 175 + SCX_OPS_LOAD(skel, flatcg_ops, scx_flatcg, uei); 176 + link = SCX_OPS_ATTACH(skel, flatcg_ops, scx_flatcg); 177 + 178 + while (!exit_req && !UEI_EXITED(skel, uei)) { 179 + __u64 acc_stats[FCG_NR_STATS]; 180 + __u64 stats[FCG_NR_STATS]; 181 + float cpu_util; 182 + int i; 183 + 184 + cpu_util = read_cpu_util(&last_cpu_sum, &last_cpu_idle); 185 + 186 + fcg_read_stats(skel, acc_stats); 187 + for (i = 0; i < FCG_NR_STATS; i++) 188 + stats[i] = acc_stats[i] - last_stats[i]; 189 + 190 + memcpy(last_stats, acc_stats, sizeof(acc_stats)); 191 + 192 + printf("\n[SEQ %6lu cpu=%5.1lf hweight_gen=%" PRIu64 "]\n", 193 + seq++, cpu_util * 100.0, skel->data->hweight_gen); 194 + printf(" act:%6llu deact:%6llu global:%6llu local:%6llu\n", 195 + stats[FCG_STAT_ACT], 196 + stats[FCG_STAT_DEACT], 197 + stats[FCG_STAT_GLOBAL], 198 + stats[FCG_STAT_LOCAL]); 199 + printf("HWT cache:%6llu update:%6llu skip:%6llu race:%6llu\n", 200 + stats[FCG_STAT_HWT_CACHE], 201 + stats[FCG_STAT_HWT_UPDATES], 202 + stats[FCG_STAT_HWT_SKIP], 203 + stats[FCG_STAT_HWT_RACE]); 204 + printf("ENQ skip:%6llu race:%6llu\n", 205 + stats[FCG_STAT_ENQ_SKIP], 206 + stats[FCG_STAT_ENQ_RACE]); 207 + printf("CNS keep:%6llu expire:%6llu empty:%6llu gone:%6llu\n", 208 + stats[FCG_STAT_CNS_KEEP], 209 + stats[FCG_STAT_CNS_EXPIRE], 210 + stats[FCG_STAT_CNS_EMPTY], 211 + stats[FCG_STAT_CNS_GONE]); 212 + printf("PNC next:%6llu empty:%6llu nocgrp:%6llu gone:%6llu race:%6llu fail:%6llu\n", 213 + stats[FCG_STAT_PNC_NEXT], 214 + stats[FCG_STAT_PNC_EMPTY], 215 + stats[FCG_STAT_PNC_NO_CGRP], 216 + stats[FCG_STAT_PNC_GONE], 217 + stats[FCG_STAT_PNC_RACE], 218 + stats[FCG_STAT_PNC_FAIL]); 219 + printf("BAD remove:%6llu\n", 220 + acc_stats[FCG_STAT_BAD_REMOVAL]); 221 + fflush(stdout); 222 + 223 + nanosleep(&intv_ts, NULL); 224 + } 225 + 226 + bpf_link__destroy(link); 227 + ecode = UEI_REPORT(skel, uei); 228 + scx_flatcg__destroy(skel); 229 + 230 + if (UEI_ECODE_RESTART(ecode)) 231 + goto restart; 232 + return 0; 233 + }

+51

tools/sched_ext/scx_flatcg.h

··· 1 + #ifndef __SCX_EXAMPLE_FLATCG_H 2 + #define __SCX_EXAMPLE_FLATCG_H 3 + 4 + enum { 5 + FCG_HWEIGHT_ONE = 1LLU << 16, 6 + }; 7 + 8 + enum fcg_stat_idx { 9 + FCG_STAT_ACT, 10 + FCG_STAT_DEACT, 11 + FCG_STAT_LOCAL, 12 + FCG_STAT_GLOBAL, 13 + 14 + FCG_STAT_HWT_UPDATES, 15 + FCG_STAT_HWT_CACHE, 16 + FCG_STAT_HWT_SKIP, 17 + FCG_STAT_HWT_RACE, 18 + 19 + FCG_STAT_ENQ_SKIP, 20 + FCG_STAT_ENQ_RACE, 21 + 22 + FCG_STAT_CNS_KEEP, 23 + FCG_STAT_CNS_EXPIRE, 24 + FCG_STAT_CNS_EMPTY, 25 + FCG_STAT_CNS_GONE, 26 + 27 + FCG_STAT_PNC_NO_CGRP, 28 + FCG_STAT_PNC_NEXT, 29 + FCG_STAT_PNC_EMPTY, 30 + FCG_STAT_PNC_GONE, 31 + FCG_STAT_PNC_RACE, 32 + FCG_STAT_PNC_FAIL, 33 + 34 + FCG_STAT_BAD_REMOVAL, 35 + 36 + FCG_NR_STATS, 37 + }; 38 + 39 + struct fcg_cgrp_ctx { 40 + u32 nr_active; 41 + u32 nr_runnable; 42 + u32 queued; 43 + u32 weight; 44 + u32 hweight; 45 + u64 child_weight_sum; 46 + u64 hweight_gen; 47 + s64 cvtime_delta; 48 + u64 tvtime_now; 49 + }; 50 + 51 + #endif /* __SCX_EXAMPLE_FLATCG_H */

+827

tools/sched_ext/scx_qmap.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A simple five-level FIFO queue scheduler. 4 + * 5 + * There are five FIFOs implemented using BPF_MAP_TYPE_QUEUE. A task gets 6 + * assigned to one depending on its compound weight. Each CPU round robins 7 + * through the FIFOs and dispatches more from FIFOs with higher indices - 1 from 8 + * queue0, 2 from queue1, 4 from queue2 and so on. 9 + * 10 + * This scheduler demonstrates: 11 + * 12 + * - BPF-side queueing using PIDs. 13 + * - Sleepable per-task storage allocation using ops.prep_enable(). 14 + * - Using ops.cpu_release() to handle a higher priority scheduling class taking 15 + * the CPU away. 16 + * - Core-sched support. 17 + * 18 + * This scheduler is primarily for demonstration and testing of sched_ext 19 + * features and unlikely to be useful for actual workloads. 20 + * 21 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 22 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 23 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 24 + */ 25 + #include <scx/common.bpf.h> 26 + 27 + enum consts { 28 + ONE_SEC_IN_NS = 1000000000, 29 + SHARED_DSQ = 0, 30 + HIGHPRI_DSQ = 1, 31 + HIGHPRI_WEIGHT = 8668, /* this is what -20 maps to */ 32 + }; 33 + 34 + char _license[] SEC("license") = "GPL"; 35 + 36 + const volatile u64 slice_ns = SCX_SLICE_DFL; 37 + const volatile u32 stall_user_nth; 38 + const volatile u32 stall_kernel_nth; 39 + const volatile u32 dsp_inf_loop_after; 40 + const volatile u32 dsp_batch; 41 + const volatile bool highpri_boosting; 42 + const volatile bool print_shared_dsq; 43 + const volatile s32 disallow_tgid; 44 + const volatile bool suppress_dump; 45 + 46 + u64 nr_highpri_queued; 47 + u32 test_error_cnt; 48 + 49 + UEI_DEFINE(uei); 50 + 51 + struct qmap { 52 + __uint(type, BPF_MAP_TYPE_QUEUE); 53 + __uint(max_entries, 4096); 54 + __type(value, u32); 55 + } queue0 SEC(".maps"), 56 + queue1 SEC(".maps"), 57 + queue2 SEC(".maps"), 58 + queue3 SEC(".maps"), 59 + queue4 SEC(".maps"); 60 + 61 + struct { 62 + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); 63 + __uint(max_entries, 5); 64 + __type(key, int); 65 + __array(values, struct qmap); 66 + } queue_arr SEC(".maps") = { 67 + .values = { 68 + [0] = &queue0, 69 + [1] = &queue1, 70 + [2] = &queue2, 71 + [3] = &queue3, 72 + [4] = &queue4, 73 + }, 74 + }; 75 + 76 + /* 77 + * If enabled, CPU performance target is set according to the queue index 78 + * according to the following table. 79 + */ 80 + static const u32 qidx_to_cpuperf_target[] = { 81 + [0] = SCX_CPUPERF_ONE * 0 / 4, 82 + [1] = SCX_CPUPERF_ONE * 1 / 4, 83 + [2] = SCX_CPUPERF_ONE * 2 / 4, 84 + [3] = SCX_CPUPERF_ONE * 3 / 4, 85 + [4] = SCX_CPUPERF_ONE * 4 / 4, 86 + }; 87 + 88 + /* 89 + * Per-queue sequence numbers to implement core-sched ordering. 90 + * 91 + * Tail seq is assigned to each queued task and incremented. Head seq tracks the 92 + * sequence number of the latest dispatched task. The distance between the a 93 + * task's seq and the associated queue's head seq is called the queue distance 94 + * and used when comparing two tasks for ordering. See qmap_core_sched_before(). 95 + */ 96 + static u64 core_sched_head_seqs[5]; 97 + static u64 core_sched_tail_seqs[5]; 98 + 99 + /* Per-task scheduling context */ 100 + struct task_ctx { 101 + bool force_local; /* Dispatch directly to local_dsq */ 102 + bool highpri; 103 + u64 core_sched_seq; 104 + }; 105 + 106 + struct { 107 + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); 108 + __uint(map_flags, BPF_F_NO_PREALLOC); 109 + __type(key, int); 110 + __type(value, struct task_ctx); 111 + } task_ctx_stor SEC(".maps"); 112 + 113 + struct cpu_ctx { 114 + u64 dsp_idx; /* dispatch index */ 115 + u64 dsp_cnt; /* remaining count */ 116 + u32 avg_weight; 117 + u32 cpuperf_target; 118 + }; 119 + 120 + struct { 121 + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); 122 + __uint(max_entries, 1); 123 + __type(key, u32); 124 + __type(value, struct cpu_ctx); 125 + } cpu_ctx_stor SEC(".maps"); 126 + 127 + /* Statistics */ 128 + u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq; 129 + u64 nr_core_sched_execed; 130 + u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer; 131 + u32 cpuperf_min, cpuperf_avg, cpuperf_max; 132 + u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max; 133 + 134 + static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu) 135 + { 136 + s32 cpu; 137 + 138 + if (p->nr_cpus_allowed == 1 || 139 + scx_bpf_test_and_clear_cpu_idle(prev_cpu)) 140 + return prev_cpu; 141 + 142 + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); 143 + if (cpu >= 0) 144 + return cpu; 145 + 146 + return -1; 147 + } 148 + 149 + static struct task_ctx *lookup_task_ctx(struct task_struct *p) 150 + { 151 + struct task_ctx *tctx; 152 + 153 + if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) { 154 + scx_bpf_error("task_ctx lookup failed"); 155 + return NULL; 156 + } 157 + return tctx; 158 + } 159 + 160 + s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, 161 + s32 prev_cpu, u64 wake_flags) 162 + { 163 + struct task_ctx *tctx; 164 + s32 cpu; 165 + 166 + if (!(tctx = lookup_task_ctx(p))) 167 + return -ESRCH; 168 + 169 + cpu = pick_direct_dispatch_cpu(p, prev_cpu); 170 + 171 + if (cpu >= 0) { 172 + tctx->force_local = true; 173 + return cpu; 174 + } else { 175 + return prev_cpu; 176 + } 177 + } 178 + 179 + static int weight_to_idx(u32 weight) 180 + { 181 + /* Coarsely map the compound weight to a FIFO. */ 182 + if (weight <= 25) 183 + return 0; 184 + else if (weight <= 50) 185 + return 1; 186 + else if (weight < 200) 187 + return 2; 188 + else if (weight < 400) 189 + return 3; 190 + else 191 + return 4; 192 + } 193 + 194 + void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags) 195 + { 196 + static u32 user_cnt, kernel_cnt; 197 + struct task_ctx *tctx; 198 + u32 pid = p->pid; 199 + int idx = weight_to_idx(p->scx.weight); 200 + void *ring; 201 + s32 cpu; 202 + 203 + if (p->flags & PF_KTHREAD) { 204 + if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth)) 205 + return; 206 + } else { 207 + if (stall_user_nth && !(++user_cnt % stall_user_nth)) 208 + return; 209 + } 210 + 211 + if (test_error_cnt && !--test_error_cnt) 212 + scx_bpf_error("test triggering error"); 213 + 214 + if (!(tctx = lookup_task_ctx(p))) 215 + return; 216 + 217 + /* 218 + * All enqueued tasks must have their core_sched_seq updated for correct 219 + * core-sched ordering. Also, take a look at the end of qmap_dispatch(). 220 + */ 221 + tctx->core_sched_seq = core_sched_tail_seqs[idx]++; 222 + 223 + /* 224 + * If qmap_select_cpu() is telling us to or this is the last runnable 225 + * task on the CPU, enqueue locally. 226 + */ 227 + if (tctx->force_local) { 228 + tctx->force_local = false; 229 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); 230 + return; 231 + } 232 + 233 + /* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */ 234 + if (!(enq_flags & SCX_ENQ_WAKEUP) && 235 + (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) { 236 + __sync_fetch_and_add(&nr_ddsp_from_enq, 1); 237 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags); 238 + return; 239 + } 240 + 241 + /* 242 + * If the task was re-enqueued due to the CPU being preempted by a 243 + * higher priority scheduling class, just re-enqueue the task directly 244 + * on the global DSQ. As we want another CPU to pick it up, find and 245 + * kick an idle CPU. 246 + */ 247 + if (enq_flags & SCX_ENQ_REENQ) { 248 + s32 cpu; 249 + 250 + scx_bpf_dispatch(p, SHARED_DSQ, 0, enq_flags); 251 + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); 252 + if (cpu >= 0) 253 + scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE); 254 + return; 255 + } 256 + 257 + ring = bpf_map_lookup_elem(&queue_arr, &idx); 258 + if (!ring) { 259 + scx_bpf_error("failed to find ring %d", idx); 260 + return; 261 + } 262 + 263 + /* Queue on the selected FIFO. If the FIFO overflows, punt to global. */ 264 + if (bpf_map_push_elem(ring, &pid, 0)) { 265 + scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, enq_flags); 266 + return; 267 + } 268 + 269 + if (highpri_boosting && p->scx.weight >= HIGHPRI_WEIGHT) { 270 + tctx->highpri = true; 271 + __sync_fetch_and_add(&nr_highpri_queued, 1); 272 + } 273 + __sync_fetch_and_add(&nr_enqueued, 1); 274 + } 275 + 276 + /* 277 + * The BPF queue map doesn't support removal and sched_ext can handle spurious 278 + * dispatches. qmap_dequeue() is only used to collect statistics. 279 + */ 280 + void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags) 281 + { 282 + __sync_fetch_and_add(&nr_dequeued, 1); 283 + if (deq_flags & SCX_DEQ_CORE_SCHED_EXEC) 284 + __sync_fetch_and_add(&nr_core_sched_execed, 1); 285 + } 286 + 287 + static void update_core_sched_head_seq(struct task_struct *p) 288 + { 289 + int idx = weight_to_idx(p->scx.weight); 290 + struct task_ctx *tctx; 291 + 292 + if ((tctx = lookup_task_ctx(p))) 293 + core_sched_head_seqs[idx] = tctx->core_sched_seq; 294 + } 295 + 296 + /* 297 + * To demonstrate the use of scx_bpf_dispatch_from_dsq(), implement silly 298 + * selective priority boosting mechanism by scanning SHARED_DSQ looking for 299 + * highpri tasks, moving them to HIGHPRI_DSQ and then consuming them first. This 300 + * makes minor difference only when dsp_batch is larger than 1. 301 + * 302 + * scx_bpf_dispatch[_vtime]_from_dsq() are allowed both from ops.dispatch() and 303 + * non-rq-lock holding BPF programs. As demonstration, this function is called 304 + * from qmap_dispatch() and monitor_timerfn(). 305 + */ 306 + static bool dispatch_highpri(bool from_timer) 307 + { 308 + struct task_struct *p; 309 + s32 this_cpu = bpf_get_smp_processor_id(); 310 + 311 + /* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */ 312 + bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) { 313 + static u64 highpri_seq; 314 + struct task_ctx *tctx; 315 + 316 + if (!(tctx = lookup_task_ctx(p))) 317 + return false; 318 + 319 + if (tctx->highpri) { 320 + /* exercise the set_*() and vtime interface too */ 321 + scx_bpf_dispatch_from_dsq_set_slice( 322 + BPF_FOR_EACH_ITER, slice_ns * 2); 323 + scx_bpf_dispatch_from_dsq_set_vtime( 324 + BPF_FOR_EACH_ITER, highpri_seq++); 325 + scx_bpf_dispatch_vtime_from_dsq( 326 + BPF_FOR_EACH_ITER, p, HIGHPRI_DSQ, 0); 327 + } 328 + } 329 + 330 + /* 331 + * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU 332 + * is found. 333 + */ 334 + bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) { 335 + bool dispatched = false; 336 + s32 cpu; 337 + 338 + if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr)) 339 + cpu = this_cpu; 340 + else 341 + cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0); 342 + 343 + if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p, 344 + SCX_DSQ_LOCAL_ON | cpu, 345 + SCX_ENQ_PREEMPT)) { 346 + if (cpu == this_cpu) { 347 + dispatched = true; 348 + __sync_fetch_and_add(&nr_expedited_local, 1); 349 + } else { 350 + __sync_fetch_and_add(&nr_expedited_remote, 1); 351 + } 352 + if (from_timer) 353 + __sync_fetch_and_add(&nr_expedited_from_timer, 1); 354 + } else { 355 + __sync_fetch_and_add(&nr_expedited_lost, 1); 356 + } 357 + 358 + if (dispatched) 359 + return true; 360 + } 361 + 362 + return false; 363 + } 364 + 365 + void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) 366 + { 367 + struct task_struct *p; 368 + struct cpu_ctx *cpuc; 369 + struct task_ctx *tctx; 370 + u32 zero = 0, batch = dsp_batch ?: 1; 371 + void *fifo; 372 + s32 i, pid; 373 + 374 + if (dispatch_highpri(false)) 375 + return; 376 + 377 + if (!nr_highpri_queued && scx_bpf_consume(SHARED_DSQ)) 378 + return; 379 + 380 + if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) { 381 + /* 382 + * PID 2 should be kthreadd which should mostly be idle and off 383 + * the scheduler. Let's keep dispatching it to force the kernel 384 + * to call this function over and over again. 385 + */ 386 + p = bpf_task_from_pid(2); 387 + if (p) { 388 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, 0); 389 + bpf_task_release(p); 390 + return; 391 + } 392 + } 393 + 394 + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { 395 + scx_bpf_error("failed to look up cpu_ctx"); 396 + return; 397 + } 398 + 399 + for (i = 0; i < 5; i++) { 400 + /* Advance the dispatch cursor and pick the fifo. */ 401 + if (!cpuc->dsp_cnt) { 402 + cpuc->dsp_idx = (cpuc->dsp_idx + 1) % 5; 403 + cpuc->dsp_cnt = 1 << cpuc->dsp_idx; 404 + } 405 + 406 + fifo = bpf_map_lookup_elem(&queue_arr, &cpuc->dsp_idx); 407 + if (!fifo) { 408 + scx_bpf_error("failed to find ring %llu", cpuc->dsp_idx); 409 + return; 410 + } 411 + 412 + /* Dispatch or advance. */ 413 + bpf_repeat(BPF_MAX_LOOPS) { 414 + struct task_ctx *tctx; 415 + 416 + if (bpf_map_pop_elem(fifo, &pid)) 417 + break; 418 + 419 + p = bpf_task_from_pid(pid); 420 + if (!p) 421 + continue; 422 + 423 + if (!(tctx = lookup_task_ctx(p))) { 424 + bpf_task_release(p); 425 + return; 426 + } 427 + 428 + if (tctx->highpri) 429 + __sync_fetch_and_sub(&nr_highpri_queued, 1); 430 + 431 + update_core_sched_head_seq(p); 432 + __sync_fetch_and_add(&nr_dispatched, 1); 433 + 434 + scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0); 435 + bpf_task_release(p); 436 + 437 + batch--; 438 + cpuc->dsp_cnt--; 439 + if (!batch || !scx_bpf_dispatch_nr_slots()) { 440 + if (dispatch_highpri(false)) 441 + return; 442 + scx_bpf_consume(SHARED_DSQ); 443 + return; 444 + } 445 + if (!cpuc->dsp_cnt) 446 + break; 447 + } 448 + 449 + cpuc->dsp_cnt = 0; 450 + } 451 + 452 + /* 453 + * No other tasks. @prev will keep running. Update its core_sched_seq as 454 + * if the task were enqueued and dispatched immediately. 455 + */ 456 + if (prev) { 457 + tctx = bpf_task_storage_get(&task_ctx_stor, prev, 0, 0); 458 + if (!tctx) { 459 + scx_bpf_error("task_ctx lookup failed"); 460 + return; 461 + } 462 + 463 + tctx->core_sched_seq = 464 + core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++; 465 + } 466 + } 467 + 468 + void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p) 469 + { 470 + struct cpu_ctx *cpuc; 471 + u32 zero = 0; 472 + int idx; 473 + 474 + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { 475 + scx_bpf_error("failed to look up cpu_ctx"); 476 + return; 477 + } 478 + 479 + /* 480 + * Use the running avg of weights to select the target cpuperf level. 481 + * This is a demonstration of the cpuperf feature rather than a 482 + * practical strategy to regulate CPU frequency. 483 + */ 484 + cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4; 485 + idx = weight_to_idx(cpuc->avg_weight); 486 + cpuc->cpuperf_target = qidx_to_cpuperf_target[idx]; 487 + 488 + scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target); 489 + } 490 + 491 + /* 492 + * The distance from the head of the queue scaled by the weight of the queue. 493 + * The lower the number, the older the task and the higher the priority. 494 + */ 495 + static s64 task_qdist(struct task_struct *p) 496 + { 497 + int idx = weight_to_idx(p->scx.weight); 498 + struct task_ctx *tctx; 499 + s64 qdist; 500 + 501 + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 502 + if (!tctx) { 503 + scx_bpf_error("task_ctx lookup failed"); 504 + return 0; 505 + } 506 + 507 + qdist = tctx->core_sched_seq - core_sched_head_seqs[idx]; 508 + 509 + /* 510 + * As queue index increments, the priority doubles. The queue w/ index 3 511 + * is dispatched twice more frequently than 2. Reflect the difference by 512 + * scaling qdists accordingly. Note that the shift amount needs to be 513 + * flipped depending on the sign to avoid flipping priority direction. 514 + */ 515 + if (qdist >= 0) 516 + return qdist << (4 - idx); 517 + else 518 + return qdist << idx; 519 + } 520 + 521 + /* 522 + * This is called to determine the task ordering when core-sched is picking 523 + * tasks to execute on SMT siblings and should encode about the same ordering as 524 + * the regular scheduling path. Use the priority-scaled distances from the head 525 + * of the queues to compare the two tasks which should be consistent with the 526 + * dispatch path behavior. 527 + */ 528 + bool BPF_STRUCT_OPS(qmap_core_sched_before, 529 + struct task_struct *a, struct task_struct *b) 530 + { 531 + return task_qdist(a) > task_qdist(b); 532 + } 533 + 534 + void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args) 535 + { 536 + u32 cnt; 537 + 538 + /* 539 + * Called when @cpu is taken by a higher priority scheduling class. This 540 + * makes @cpu no longer available for executing sched_ext tasks. As we 541 + * don't want the tasks in @cpu's local dsq to sit there until @cpu 542 + * becomes available again, re-enqueue them into the global dsq. See 543 + * %SCX_ENQ_REENQ handling in qmap_enqueue(). 544 + */ 545 + cnt = scx_bpf_reenqueue_local(); 546 + if (cnt) 547 + __sync_fetch_and_add(&nr_reenqueued, cnt); 548 + } 549 + 550 + s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p, 551 + struct scx_init_task_args *args) 552 + { 553 + if (p->tgid == disallow_tgid) 554 + p->scx.disallow = true; 555 + 556 + /* 557 + * @p is new. Let's ensure that its task_ctx is available. We can sleep 558 + * in this function and the following will automatically use GFP_KERNEL. 559 + */ 560 + if (bpf_task_storage_get(&task_ctx_stor, p, 0, 561 + BPF_LOCAL_STORAGE_GET_F_CREATE)) 562 + return 0; 563 + else 564 + return -ENOMEM; 565 + } 566 + 567 + void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx) 568 + { 569 + s32 i, pid; 570 + 571 + if (suppress_dump) 572 + return; 573 + 574 + bpf_for(i, 0, 5) { 575 + void *fifo; 576 + 577 + if (!(fifo = bpf_map_lookup_elem(&queue_arr, &i))) 578 + return; 579 + 580 + scx_bpf_dump("QMAP FIFO[%d]:", i); 581 + bpf_repeat(4096) { 582 + if (bpf_map_pop_elem(fifo, &pid)) 583 + break; 584 + scx_bpf_dump(" %d", pid); 585 + } 586 + scx_bpf_dump("\n"); 587 + } 588 + } 589 + 590 + void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle) 591 + { 592 + u32 zero = 0; 593 + struct cpu_ctx *cpuc; 594 + 595 + if (suppress_dump || idle) 596 + return; 597 + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu))) 598 + return; 599 + 600 + scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u", 601 + cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight, 602 + cpuc->cpuperf_target); 603 + } 604 + 605 + void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p) 606 + { 607 + struct task_ctx *taskc; 608 + 609 + if (suppress_dump) 610 + return; 611 + if (!(taskc = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) 612 + return; 613 + 614 + scx_bpf_dump("QMAP: force_local=%d core_sched_seq=%llu", 615 + taskc->force_local, taskc->core_sched_seq); 616 + } 617 + 618 + /* 619 + * Print out the online and possible CPU map using bpf_printk() as a 620 + * demonstration of using the cpumask kfuncs and ops.cpu_on/offline(). 621 + */ 622 + static void print_cpus(void) 623 + { 624 + const struct cpumask *possible, *online; 625 + s32 cpu; 626 + char buf[128] = "", *p; 627 + int idx; 628 + 629 + possible = scx_bpf_get_possible_cpumask(); 630 + online = scx_bpf_get_online_cpumask(); 631 + 632 + idx = 0; 633 + bpf_for(cpu, 0, scx_bpf_nr_cpu_ids()) { 634 + if (!(p = MEMBER_VPTR(buf, [idx++]))) 635 + break; 636 + if (bpf_cpumask_test_cpu(cpu, online)) 637 + *p++ = 'O'; 638 + else if (bpf_cpumask_test_cpu(cpu, possible)) 639 + *p++ = 'X'; 640 + else 641 + *p++ = ' '; 642 + 643 + if ((cpu & 7) == 7) { 644 + if (!(p = MEMBER_VPTR(buf, [idx++]))) 645 + break; 646 + *p++ = '|'; 647 + } 648 + } 649 + buf[sizeof(buf) - 1] = '\0'; 650 + 651 + scx_bpf_put_cpumask(online); 652 + scx_bpf_put_cpumask(possible); 653 + 654 + bpf_printk("CPUS: |%s", buf); 655 + } 656 + 657 + void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu) 658 + { 659 + bpf_printk("CPU %d coming online", cpu); 660 + /* @cpu is already online at this point */ 661 + print_cpus(); 662 + } 663 + 664 + void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu) 665 + { 666 + bpf_printk("CPU %d going offline", cpu); 667 + /* @cpu is still online at this point */ 668 + print_cpus(); 669 + } 670 + 671 + struct monitor_timer { 672 + struct bpf_timer timer; 673 + }; 674 + 675 + struct { 676 + __uint(type, BPF_MAP_TYPE_ARRAY); 677 + __uint(max_entries, 1); 678 + __type(key, u32); 679 + __type(value, struct monitor_timer); 680 + } monitor_timer SEC(".maps"); 681 + 682 + /* 683 + * Print out the min, avg and max performance levels of CPUs every second to 684 + * demonstrate the cpuperf interface. 685 + */ 686 + static void monitor_cpuperf(void) 687 + { 688 + u32 zero = 0, nr_cpu_ids; 689 + u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0; 690 + u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0; 691 + const struct cpumask *online; 692 + int i, nr_online_cpus = 0; 693 + 694 + nr_cpu_ids = scx_bpf_nr_cpu_ids(); 695 + online = scx_bpf_get_online_cpumask(); 696 + 697 + bpf_for(i, 0, nr_cpu_ids) { 698 + struct cpu_ctx *cpuc; 699 + u32 cap, cur; 700 + 701 + if (!bpf_cpumask_test_cpu(i, online)) 702 + continue; 703 + nr_online_cpus++; 704 + 705 + /* collect the capacity and current cpuperf */ 706 + cap = scx_bpf_cpuperf_cap(i); 707 + cur = scx_bpf_cpuperf_cur(i); 708 + 709 + cur_min = cur < cur_min ? cur : cur_min; 710 + cur_max = cur > cur_max ? cur : cur_max; 711 + 712 + /* 713 + * $cur is relative to $cap. Scale it down accordingly so that 714 + * it's in the same scale as other CPUs and $cur_sum/$cap_sum 715 + * makes sense. 716 + */ 717 + cur_sum += cur * cap / SCX_CPUPERF_ONE; 718 + cap_sum += cap; 719 + 720 + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) { 721 + scx_bpf_error("failed to look up cpu_ctx"); 722 + goto out; 723 + } 724 + 725 + /* collect target */ 726 + cur = cpuc->cpuperf_target; 727 + target_sum += cur; 728 + target_min = cur < target_min ? cur : target_min; 729 + target_max = cur > target_max ? cur : target_max; 730 + } 731 + 732 + cpuperf_min = cur_min; 733 + cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum; 734 + cpuperf_max = cur_max; 735 + 736 + cpuperf_target_min = target_min; 737 + cpuperf_target_avg = target_sum / nr_online_cpus; 738 + cpuperf_target_max = target_max; 739 + out: 740 + scx_bpf_put_cpumask(online); 741 + } 742 + 743 + /* 744 + * Dump the currently queued tasks in the shared DSQ to demonstrate the usage of 745 + * scx_bpf_dsq_nr_queued() and DSQ iterator. Raise the dispatch batch count to 746 + * see meaningful dumps in the trace pipe. 747 + */ 748 + static void dump_shared_dsq(void) 749 + { 750 + struct task_struct *p; 751 + s32 nr; 752 + 753 + if (!(nr = scx_bpf_dsq_nr_queued(SHARED_DSQ))) 754 + return; 755 + 756 + bpf_printk("Dumping %d tasks in SHARED_DSQ in reverse order", nr); 757 + 758 + bpf_rcu_read_lock(); 759 + bpf_for_each(scx_dsq, p, SHARED_DSQ, SCX_DSQ_ITER_REV) 760 + bpf_printk("%s[%d]", p->comm, p->pid); 761 + bpf_rcu_read_unlock(); 762 + } 763 + 764 + static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) 765 + { 766 + bpf_rcu_read_lock(); 767 + dispatch_highpri(true); 768 + bpf_rcu_read_unlock(); 769 + 770 + monitor_cpuperf(); 771 + 772 + if (print_shared_dsq) 773 + dump_shared_dsq(); 774 + 775 + bpf_timer_start(timer, ONE_SEC_IN_NS, 0); 776 + return 0; 777 + } 778 + 779 + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) 780 + { 781 + u32 key = 0; 782 + struct bpf_timer *timer; 783 + s32 ret; 784 + 785 + print_cpus(); 786 + 787 + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); 788 + if (ret) 789 + return ret; 790 + 791 + ret = scx_bpf_create_dsq(HIGHPRI_DSQ, -1); 792 + if (ret) 793 + return ret; 794 + 795 + timer = bpf_map_lookup_elem(&monitor_timer, &key); 796 + if (!timer) 797 + return -ESRCH; 798 + 799 + bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); 800 + bpf_timer_set_callback(timer, monitor_timerfn); 801 + 802 + return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); 803 + } 804 + 805 + void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) 806 + { 807 + UEI_RECORD(uei, ei); 808 + } 809 + 810 + SCX_OPS_DEFINE(qmap_ops, 811 + .select_cpu = (void *)qmap_select_cpu, 812 + .enqueue = (void *)qmap_enqueue, 813 + .dequeue = (void *)qmap_dequeue, 814 + .dispatch = (void *)qmap_dispatch, 815 + .tick = (void *)qmap_tick, 816 + .core_sched_before = (void *)qmap_core_sched_before, 817 + .cpu_release = (void *)qmap_cpu_release, 818 + .init_task = (void *)qmap_init_task, 819 + .dump = (void *)qmap_dump, 820 + .dump_cpu = (void *)qmap_dump_cpu, 821 + .dump_task = (void *)qmap_dump_task, 822 + .cpu_online = (void *)qmap_cpu_online, 823 + .cpu_offline = (void *)qmap_cpu_offline, 824 + .init = (void *)qmap_init, 825 + .exit = (void *)qmap_exit, 826 + .timeout_ms = 5000U, 827 + .name = "qmap");

+153

tools/sched_ext/scx_qmap.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 6 + */ 7 + #include <stdio.h> 8 + #include <stdlib.h> 9 + #include <unistd.h> 10 + #include <inttypes.h> 11 + #include <signal.h> 12 + #include <libgen.h> 13 + #include <bpf/bpf.h> 14 + #include <scx/common.h> 15 + #include "scx_qmap.bpf.skel.h" 16 + 17 + const char help_fmt[] = 18 + "A simple five-level FIFO queue sched_ext scheduler.\n" 19 + "\n" 20 + "See the top-level comment in .bpf.c for more details.\n" 21 + "\n" 22 + "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n" 23 + " [-P] [-d PID] [-D LEN] [-p] [-v]\n" 24 + "\n" 25 + " -s SLICE_US Override slice duration\n" 26 + " -e COUNT Trigger scx_bpf_error() after COUNT enqueues\n" 27 + " -t COUNT Stall every COUNT'th user thread\n" 28 + " -T COUNT Stall every COUNT'th kernel thread\n" 29 + " -l COUNT Trigger dispatch infinite looping after COUNT dispatches\n" 30 + " -b COUNT Dispatch upto COUNT tasks together\n" 31 + " -P Print out DSQ content to trace_pipe every second, use with -b\n" 32 + " -H Boost nice -20 tasks in SHARED_DSQ, use with -b\n" 33 + " -d PID Disallow a process from switching into SCHED_EXT (-1 for self)\n" 34 + " -D LEN Set scx_exit_info.dump buffer length\n" 35 + " -S Suppress qmap-specific debug dump\n" 36 + " -p Switch only tasks on SCHED_EXT policy instead of all\n" 37 + " -v Print libbpf debug messages\n" 38 + " -h Display this help and exit\n"; 39 + 40 + static bool verbose; 41 + static volatile int exit_req; 42 + 43 + static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) 44 + { 45 + if (level == LIBBPF_DEBUG && !verbose) 46 + return 0; 47 + return vfprintf(stderr, format, args); 48 + } 49 + 50 + static void sigint_handler(int dummy) 51 + { 52 + exit_req = 1; 53 + } 54 + 55 + int main(int argc, char **argv) 56 + { 57 + struct scx_qmap *skel; 58 + struct bpf_link *link; 59 + int opt; 60 + 61 + libbpf_set_print(libbpf_print_fn); 62 + signal(SIGINT, sigint_handler); 63 + signal(SIGTERM, sigint_handler); 64 + 65 + skel = SCX_OPS_OPEN(qmap_ops, scx_qmap); 66 + 67 + while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) { 68 + switch (opt) { 69 + case 's': 70 + skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000; 71 + break; 72 + case 'e': 73 + skel->bss->test_error_cnt = strtoul(optarg, NULL, 0); 74 + break; 75 + case 't': 76 + skel->rodata->stall_user_nth = strtoul(optarg, NULL, 0); 77 + break; 78 + case 'T': 79 + skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0); 80 + break; 81 + case 'l': 82 + skel->rodata->dsp_inf_loop_after = strtoul(optarg, NULL, 0); 83 + break; 84 + case 'b': 85 + skel->rodata->dsp_batch = strtoul(optarg, NULL, 0); 86 + break; 87 + case 'P': 88 + skel->rodata->print_shared_dsq = true; 89 + break; 90 + case 'H': 91 + skel->rodata->highpri_boosting = true; 92 + break; 93 + case 'd': 94 + skel->rodata->disallow_tgid = strtol(optarg, NULL, 0); 95 + if (skel->rodata->disallow_tgid < 0) 96 + skel->rodata->disallow_tgid = getpid(); 97 + break; 98 + case 'D': 99 + skel->struct_ops.qmap_ops->exit_dump_len = strtoul(optarg, NULL, 0); 100 + break; 101 + case 'S': 102 + skel->rodata->suppress_dump = true; 103 + break; 104 + case 'p': 105 + skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL; 106 + break; 107 + case 'v': 108 + verbose = true; 109 + break; 110 + default: 111 + fprintf(stderr, help_fmt, basename(argv[0])); 112 + return opt != 'h'; 113 + } 114 + } 115 + 116 + SCX_OPS_LOAD(skel, qmap_ops, scx_qmap, uei); 117 + link = SCX_OPS_ATTACH(skel, qmap_ops, scx_qmap); 118 + 119 + while (!exit_req && !UEI_EXITED(skel, uei)) { 120 + long nr_enqueued = skel->bss->nr_enqueued; 121 + long nr_dispatched = skel->bss->nr_dispatched; 122 + 123 + printf("stats : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n", 124 + nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, 125 + skel->bss->nr_reenqueued, skel->bss->nr_dequeued, 126 + skel->bss->nr_core_sched_execed, 127 + skel->bss->nr_ddsp_from_enq); 128 + printf(" exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n", 129 + skel->bss->nr_expedited_local, 130 + skel->bss->nr_expedited_remote, 131 + skel->bss->nr_expedited_from_timer, 132 + skel->bss->nr_expedited_lost); 133 + if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur")) 134 + printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n", 135 + skel->bss->cpuperf_min, 136 + skel->bss->cpuperf_avg, 137 + skel->bss->cpuperf_max, 138 + skel->bss->cpuperf_target_min, 139 + skel->bss->cpuperf_target_avg, 140 + skel->bss->cpuperf_target_max); 141 + fflush(stdout); 142 + sleep(1); 143 + } 144 + 145 + bpf_link__destroy(link); 146 + UEI_REPORT(skel, uei); 147 + scx_qmap__destroy(skel); 148 + /* 149 + * scx_qmap implements ops.cpu_on/offline() and doesn't need to restart 150 + * on CPU hotplug events. 151 + */ 152 + return 0; 153 + }

+39

tools/sched_ext/scx_show_state.py

··· 1 + #!/usr/bin/env drgn 2 + # 3 + # Copyright (C) 2024 Tejun Heo <tj@kernel.org> 4 + # Copyright (C) 2024 Meta Platforms, Inc. and affiliates. 5 + 6 + desc = """ 7 + This is a drgn script to show the current sched_ext state. 8 + For more info on drgn, visit https://github.com/osandov/drgn. 9 + """ 10 + 11 + import drgn 12 + import sys 13 + 14 + def err(s): 15 + print(s, file=sys.stderr, flush=True) 16 + sys.exit(1) 17 + 18 + def read_int(name): 19 + return int(prog[name].value_()) 20 + 21 + def read_atomic(name): 22 + return prog[name].counter.value_() 23 + 24 + def read_static_key(name): 25 + return prog[name].key.enabled.counter.value_() 26 + 27 + def ops_state_str(state): 28 + return prog['scx_ops_enable_state_str'][state].string_().decode() 29 + 30 + ops = prog['scx_ops'] 31 + enable_state = read_atomic("scx_ops_enable_state_var") 32 + 33 + print(f'ops : {ops.name.string_().decode()}') 34 + print(f'enabled : {read_static_key("__scx_ops_enabled")}') 35 + print(f'switching_all : {read_int("scx_switching_all")}') 36 + print(f'switched_all : {read_static_key("__scx_switched_all")}') 37 + print(f'enable_state : {ops_state_str(enable_state)} ({enable_state})') 38 + print(f'bypass_depth : {read_atomic("scx_ops_bypass_depth")}') 39 + print(f'nr_rejected : {read_atomic("scx_nr_rejected")}')

+156

tools/sched_ext/scx_simple.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A simple scheduler. 4 + * 5 + * By default, it operates as a simple global weighted vtime scheduler and can 6 + * be switched to FIFO scheduling. It also demonstrates the following niceties. 7 + * 8 + * - Statistics tracking how many tasks are queued to local and global dsq's. 9 + * - Termination notification for userspace. 10 + * 11 + * While very simple, this scheduler should work reasonably well on CPUs with a 12 + * uniform L3 cache topology. While preemption is not implemented, the fact that 13 + * the scheduling queue is shared across all CPUs means that whatever is at the 14 + * front of the queue is likely to be executed fairly quickly given enough 15 + * number of CPUs. The FIFO scheduling mode may be beneficial to some workloads 16 + * but comes with the usual problems with FIFO scheduling where saturating 17 + * threads can easily drown out interactive ones. 18 + * 19 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 20 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 21 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 22 + */ 23 + #include <scx/common.bpf.h> 24 + 25 + char _license[] SEC("license") = "GPL"; 26 + 27 + const volatile bool fifo_sched; 28 + 29 + static u64 vtime_now; 30 + UEI_DEFINE(uei); 31 + 32 + /* 33 + * Built-in DSQs such as SCX_DSQ_GLOBAL cannot be used as priority queues 34 + * (meaning, cannot be dispatched to with scx_bpf_dispatch_vtime()). We 35 + * therefore create a separate DSQ with ID 0 that we dispatch to and consume 36 + * from. If scx_simple only supported global FIFO scheduling, then we could 37 + * just use SCX_DSQ_GLOBAL. 38 + */ 39 + #define SHARED_DSQ 0 40 + 41 + struct { 42 + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); 43 + __uint(key_size, sizeof(u32)); 44 + __uint(value_size, sizeof(u64)); 45 + __uint(max_entries, 2); /* [local, global] */ 46 + } stats SEC(".maps"); 47 + 48 + static void stat_inc(u32 idx) 49 + { 50 + u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx); 51 + if (cnt_p) 52 + (*cnt_p)++; 53 + } 54 + 55 + static inline bool vtime_before(u64 a, u64 b) 56 + { 57 + return (s64)(a - b) < 0; 58 + } 59 + 60 + s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) 61 + { 62 + bool is_idle = false; 63 + s32 cpu; 64 + 65 + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle); 66 + if (is_idle) { 67 + stat_inc(0); /* count local queueing */ 68 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); 69 + } 70 + 71 + return cpu; 72 + } 73 + 74 + void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags) 75 + { 76 + stat_inc(1); /* count global queueing */ 77 + 78 + if (fifo_sched) { 79 + scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); 80 + } else { 81 + u64 vtime = p->scx.dsq_vtime; 82 + 83 + /* 84 + * Limit the amount of budget that an idling task can accumulate 85 + * to one slice. 86 + */ 87 + if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL)) 88 + vtime = vtime_now - SCX_SLICE_DFL; 89 + 90 + scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime, 91 + enq_flags); 92 + } 93 + } 94 + 95 + void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev) 96 + { 97 + scx_bpf_consume(SHARED_DSQ); 98 + } 99 + 100 + void BPF_STRUCT_OPS(simple_running, struct task_struct *p) 101 + { 102 + if (fifo_sched) 103 + return; 104 + 105 + /* 106 + * Global vtime always progresses forward as tasks start executing. The 107 + * test and update can be performed concurrently from multiple CPUs and 108 + * thus racy. Any error should be contained and temporary. Let's just 109 + * live with it. 110 + */ 111 + if (vtime_before(vtime_now, p->scx.dsq_vtime)) 112 + vtime_now = p->scx.dsq_vtime; 113 + } 114 + 115 + void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable) 116 + { 117 + if (fifo_sched) 118 + return; 119 + 120 + /* 121 + * Scale the execution time by the inverse of the weight and charge. 122 + * 123 + * Note that the default yield implementation yields by setting 124 + * @p->scx.slice to zero and the following would treat the yielding task 125 + * as if it has consumed all its slice. If this penalizes yielding tasks 126 + * too much, determine the execution time by taking explicit timestamps 127 + * instead of depending on @p->scx.slice. 128 + */ 129 + p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 130 + } 131 + 132 + void BPF_STRUCT_OPS(simple_enable, struct task_struct *p) 133 + { 134 + p->scx.dsq_vtime = vtime_now; 135 + } 136 + 137 + s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) 138 + { 139 + return scx_bpf_create_dsq(SHARED_DSQ, -1); 140 + } 141 + 142 + void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei) 143 + { 144 + UEI_RECORD(uei, ei); 145 + } 146 + 147 + SCX_OPS_DEFINE(simple_ops, 148 + .select_cpu = (void *)simple_select_cpu, 149 + .enqueue = (void *)simple_enqueue, 150 + .dispatch = (void *)simple_dispatch, 151 + .running = (void *)simple_running, 152 + .stopping = (void *)simple_stopping, 153 + .enable = (void *)simple_enable, 154 + .init = (void *)simple_init, 155 + .exit = (void *)simple_exit, 156 + .name = "simple");

+107

tools/sched_ext/scx_simple.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2022 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2022 David Vernet <dvernet@meta.com> 6 + */ 7 + #include <stdio.h> 8 + #include <unistd.h> 9 + #include <signal.h> 10 + #include <libgen.h> 11 + #include <bpf/bpf.h> 12 + #include <scx/common.h> 13 + #include "scx_simple.bpf.skel.h" 14 + 15 + const char help_fmt[] = 16 + "A simple sched_ext scheduler.\n" 17 + "\n" 18 + "See the top-level comment in .bpf.c for more details.\n" 19 + "\n" 20 + "Usage: %s [-f] [-v]\n" 21 + "\n" 22 + " -f Use FIFO scheduling instead of weighted vtime scheduling\n" 23 + " -v Print libbpf debug messages\n" 24 + " -h Display this help and exit\n"; 25 + 26 + static bool verbose; 27 + static volatile int exit_req; 28 + 29 + static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args) 30 + { 31 + if (level == LIBBPF_DEBUG && !verbose) 32 + return 0; 33 + return vfprintf(stderr, format, args); 34 + } 35 + 36 + static void sigint_handler(int simple) 37 + { 38 + exit_req = 1; 39 + } 40 + 41 + static void read_stats(struct scx_simple *skel, __u64 *stats) 42 + { 43 + int nr_cpus = libbpf_num_possible_cpus(); 44 + __u64 cnts[2][nr_cpus]; 45 + __u32 idx; 46 + 47 + memset(stats, 0, sizeof(stats[0]) * 2); 48 + 49 + for (idx = 0; idx < 2; idx++) { 50 + int ret, cpu; 51 + 52 + ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats), 53 + &idx, cnts[idx]); 54 + if (ret < 0) 55 + continue; 56 + for (cpu = 0; cpu < nr_cpus; cpu++) 57 + stats[idx] += cnts[idx][cpu]; 58 + } 59 + } 60 + 61 + int main(int argc, char **argv) 62 + { 63 + struct scx_simple *skel; 64 + struct bpf_link *link; 65 + __u32 opt; 66 + __u64 ecode; 67 + 68 + libbpf_set_print(libbpf_print_fn); 69 + signal(SIGINT, sigint_handler); 70 + signal(SIGTERM, sigint_handler); 71 + restart: 72 + skel = SCX_OPS_OPEN(simple_ops, scx_simple); 73 + 74 + while ((opt = getopt(argc, argv, "fvh")) != -1) { 75 + switch (opt) { 76 + case 'f': 77 + skel->rodata->fifo_sched = true; 78 + break; 79 + case 'v': 80 + verbose = true; 81 + break; 82 + default: 83 + fprintf(stderr, help_fmt, basename(argv[0])); 84 + return opt != 'h'; 85 + } 86 + } 87 + 88 + SCX_OPS_LOAD(skel, simple_ops, scx_simple, uei); 89 + link = SCX_OPS_ATTACH(skel, simple_ops, scx_simple); 90 + 91 + while (!exit_req && !UEI_EXITED(skel, uei)) { 92 + __u64 stats[2]; 93 + 94 + read_stats(skel, stats); 95 + printf("local=%llu global=%llu\n", stats[0], stats[1]); 96 + fflush(stdout); 97 + sleep(1); 98 + } 99 + 100 + bpf_link__destroy(link); 101 + ecode = UEI_REPORT(skel, uei); 102 + scx_simple__destroy(skel); 103 + 104 + if (UEI_ECODE_RESTART(ecode)) 105 + goto restart; 106 + return 0; 107 + }

+6

tools/testing/selftests/sched_ext/.gitignore

··· 1 + * 2 + !*.c 3 + !*.h 4 + !Makefile 5 + !.gitignore 6 + !config

+218

tools/testing/selftests/sched_ext/Makefile

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + # Copyright (c) 2022 Meta Platforms, Inc. and affiliates. 3 + include ../../../build/Build.include 4 + include ../../../scripts/Makefile.arch 5 + include ../../../scripts/Makefile.include 6 + include ../lib.mk 7 + 8 + ifneq ($(LLVM),) 9 + ifneq ($(filter %/,$(LLVM)),) 10 + LLVM_PREFIX := $(LLVM) 11 + else ifneq ($(filter -%,$(LLVM)),) 12 + LLVM_SUFFIX := $(LLVM) 13 + endif 14 + 15 + CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as 16 + else 17 + CC := gcc 18 + endif # LLVM 19 + 20 + ifneq ($(CROSS_COMPILE),) 21 + $(error CROSS_COMPILE not supported for scx selftests) 22 + endif # CROSS_COMPILE 23 + 24 + CURDIR := $(abspath .) 25 + REPOROOT := $(abspath ../../../..) 26 + TOOLSDIR := $(REPOROOT)/tools 27 + LIBDIR := $(TOOLSDIR)/lib 28 + BPFDIR := $(LIBDIR)/bpf 29 + TOOLSINCDIR := $(TOOLSDIR)/include 30 + BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool 31 + APIDIR := $(TOOLSINCDIR)/uapi 32 + GENDIR := $(REPOROOT)/include/generated 33 + GENHDR := $(GENDIR)/autoconf.h 34 + SCXTOOLSDIR := $(TOOLSDIR)/sched_ext 35 + SCXTOOLSINCDIR := $(TOOLSDIR)/sched_ext/include 36 + 37 + OUTPUT_DIR := $(CURDIR)/build 38 + OBJ_DIR := $(OUTPUT_DIR)/obj 39 + INCLUDE_DIR := $(OUTPUT_DIR)/include 40 + BPFOBJ_DIR := $(OBJ_DIR)/libbpf 41 + SCXOBJ_DIR := $(OBJ_DIR)/sched_ext 42 + BPFOBJ := $(BPFOBJ_DIR)/libbpf.a 43 + LIBBPF_OUTPUT := $(OBJ_DIR)/libbpf/libbpf.a 44 + DEFAULT_BPFTOOL := $(OUTPUT_DIR)/sbin/bpftool 45 + HOST_BUILD_DIR := $(OBJ_DIR) 46 + HOST_OUTPUT_DIR := $(OUTPUT_DIR) 47 + 48 + VMLINUX_BTF_PATHS ?= ../../../../vmlinux \ 49 + /sys/kernel/btf/vmlinux \ 50 + /boot/vmlinux-$(shell uname -r) 51 + VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS)))) 52 + ifeq ($(VMLINUX_BTF),) 53 + $(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)") 54 + endif 55 + 56 + BPFTOOL ?= $(DEFAULT_BPFTOOL) 57 + 58 + ifneq ($(wildcard $(GENHDR)),) 59 + GENFLAGS := -DHAVE_GENHDR 60 + endif 61 + 62 + CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS) \ 63 + -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR) \ 64 + -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include -I$(SCXTOOLSINCDIR) 65 + 66 + # Silence some warnings when compiled with clang 67 + ifneq ($(LLVM),) 68 + CFLAGS += -Wno-unused-command-line-argument 69 + endif 70 + 71 + LDFLAGS = -lelf -lz -lpthread -lzstd 72 + 73 + IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null | \ 74 + grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__') 75 + 76 + # Get Clang's default includes on this system, as opposed to those seen by 77 + # '-target bpf'. This fixes "missing" files on some architectures/distros, 78 + # such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc. 79 + # 80 + # Use '-idirafter': Don't interfere with include mechanics except where the 81 + # build would have failed anyways. 82 + define get_sys_includes 83 + $(shell $(1) -v -E - </dev/null 2>&1 \ 84 + | sed -n '/<...> search starts here:/,/End of search list./{ s| $/.*$|-idirafter \1|p }') \ 85 + $(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}') 86 + endef 87 + 88 + BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH) \ 89 + $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian) \ 90 + -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat \ 91 + -I$(INCLUDE_DIR) -I$(APIDIR) -I$(SCXTOOLSINCDIR) \ 92 + -I$(REPOROOT)/include \ 93 + $(call get_sys_includes,$(CLANG)) \ 94 + -Wall -Wno-compare-distinct-pointer-types \ 95 + -Wno-incompatible-function-pointer-types \ 96 + -O2 -mcpu=v3 97 + 98 + # sort removes libbpf duplicates when not cross-building 99 + MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(OBJ_DIR)/libbpf \ 100 + $(OBJ_DIR)/bpftool $(OBJ_DIR)/resolve_btfids \ 101 + $(INCLUDE_DIR) $(SCXOBJ_DIR)) 102 + 103 + $(MAKE_DIRS): 104 + $(call msg,MKDIR,,$@) 105 + $(Q)mkdir -p $@ 106 + 107 + $(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile) \ 108 + $(APIDIR)/linux/bpf.h \ 109 + | $(OBJ_DIR)/libbpf 110 + $(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/ \ 111 + EXTRA_CFLAGS='-g -O0 -fPIC' \ 112 + DESTDIR=$(OUTPUT_DIR) prefix= all install_headers 113 + 114 + $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile) \ 115 + $(LIBBPF_OUTPUT) | $(OBJ_DIR)/bpftool 116 + $(Q)$(MAKE) $(submake_extras) -C $(BPFTOOLDIR) \ 117 + ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD) \ 118 + EXTRA_CFLAGS='-g -O0' \ 119 + OUTPUT=$(OBJ_DIR)/bpftool/ \ 120 + LIBBPF_OUTPUT=$(OBJ_DIR)/libbpf/ \ 121 + LIBBPF_DESTDIR=$(OUTPUT_DIR)/ \ 122 + prefix= DESTDIR=$(OUTPUT_DIR)/ install-bin 123 + 124 + $(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR) 125 + ifeq ($(VMLINUX_H),) 126 + $(call msg,GEN,,$@) 127 + $(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@ 128 + else 129 + $(call msg,CP,,$@) 130 + $(Q)cp "$(VMLINUX_H)" $@ 131 + endif 132 + 133 + $(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h | $(BPFOBJ) $(SCXOBJ_DIR) 134 + $(call msg,CLNG-BPF,,$(notdir $@)) 135 + $(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@ 136 + 137 + $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL) | $(INCLUDE_DIR) 138 + $(eval sched=$(notdir $@)) 139 + $(call msg,GEN-SKEL,,$(sched)) 140 + $(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $< 141 + $(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o) 142 + $(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o) 143 + $(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o) 144 + $(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@ 145 + $(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h) 146 + 147 + ################ 148 + # C schedulers # 149 + ################ 150 + 151 + override define CLEAN 152 + rm -rf $(OUTPUT_DIR) 153 + rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h 154 + rm -f $(TEST_GEN_PROGS) 155 + rm -f runner 156 + endef 157 + 158 + # Every testcase takes all of the BPF progs are dependencies by default. This 159 + # allows testcases to load any BPF scheduler, which is useful for testcases 160 + # that don't need their own prog to run their test. 161 + all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubst %.c,%.skel.h,$(prog))) 162 + 163 + auto-test-targets := \ 164 + create_dsq \ 165 + enq_last_no_enq_fails \ 166 + enq_select_cpu_fails \ 167 + ddsp_bogus_dsq_fail \ 168 + ddsp_vtimelocal_fail \ 169 + dsp_local_on \ 170 + exit \ 171 + hotplug \ 172 + init_enable_count \ 173 + maximal \ 174 + maybe_null \ 175 + minimal \ 176 + prog_run \ 177 + reload_loop \ 178 + select_cpu_dfl \ 179 + select_cpu_dfl_nodispatch \ 180 + select_cpu_dispatch \ 181 + select_cpu_dispatch_bad_dsq \ 182 + select_cpu_dispatch_dbl_dsp \ 183 + select_cpu_vtime \ 184 + test_example \ 185 + 186 + testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) 187 + 188 + $(SCXOBJ_DIR)/runner.o: runner.c | $(SCXOBJ_DIR) 189 + $(CC) $(CFLAGS) -c $< -o $@ 190 + 191 + # Create all of the test targets object files, whose testcase objects will be 192 + # registered into the runner in ELF constructors. 193 + # 194 + # Note that we must do double expansion here in order to support conditionally 195 + # compiling BPF object files only if one is present, as the wildcard Make 196 + # function doesn't support using implicit rules otherwise. 197 + $(testcase-targets): $(SCXOBJ_DIR)/%.o: %.c $(SCXOBJ_DIR)/runner.o $(all_test_bpfprogs) | $(SCXOBJ_DIR) 198 + $(eval test=$(patsubst %.o,%.c,$(notdir $@))) 199 + $(CC) $(CFLAGS) -c $< -o $@ $(SCXOBJ_DIR)/runner.o 200 + 201 + $(SCXOBJ_DIR)/util.o: util.c | $(SCXOBJ_DIR) 202 + $(CC) $(CFLAGS) -c $< -o $@ 203 + 204 + runner: $(SCXOBJ_DIR)/runner.o $(SCXOBJ_DIR)/util.o $(BPFOBJ) $(testcase-targets) 205 + @echo "$(testcase-targets)" 206 + $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS) 207 + 208 + TEST_GEN_PROGS := runner 209 + 210 + all: runner 211 + 212 + .PHONY: all clean help 213 + 214 + .DEFAULT_GOAL := all 215 + 216 + .DELETE_ON_ERROR: 217 + 218 + .SECONDARY:

+9

tools/testing/selftests/sched_ext/config

··· 1 + CONFIG_SCHED_DEBUG=y 2 + CONFIG_SCHED_CLASS_EXT=y 3 + CONFIG_CGROUPS=y 4 + CONFIG_CGROUP_SCHED=y 5 + CONFIG_EXT_GROUP_SCHED=y 6 + CONFIG_BPF=y 7 + CONFIG_BPF_SYSCALL=y 8 + CONFIG_DEBUG_INFO=y 9 + CONFIG_DEBUG_INFO_BTF=y

+58

tools/testing/selftests/sched_ext/create_dsq.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Create and destroy DSQs in a loop. 4 + * 5 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 6 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 7 + */ 8 + 9 + #include <scx/common.bpf.h> 10 + 11 + char _license[] SEC("license") = "GPL"; 12 + 13 + void BPF_STRUCT_OPS(create_dsq_exit_task, struct task_struct *p, 14 + struct scx_exit_task_args *args) 15 + { 16 + scx_bpf_destroy_dsq(p->pid); 17 + } 18 + 19 + s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init_task, struct task_struct *p, 20 + struct scx_init_task_args *args) 21 + { 22 + s32 err; 23 + 24 + err = scx_bpf_create_dsq(p->pid, -1); 25 + if (err) 26 + scx_bpf_error("Failed to create DSQ for %s[%d]", 27 + p->comm, p->pid); 28 + 29 + return err; 30 + } 31 + 32 + s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init) 33 + { 34 + u32 i; 35 + s32 err; 36 + 37 + bpf_for(i, 0, 1024) { 38 + err = scx_bpf_create_dsq(i, -1); 39 + if (err) { 40 + scx_bpf_error("Failed to create DSQ %d", i); 41 + return 0; 42 + } 43 + } 44 + 45 + bpf_for(i, 0, 1024) { 46 + scx_bpf_destroy_dsq(i); 47 + } 48 + 49 + return 0; 50 + } 51 + 52 + SEC(".struct_ops.link") 53 + struct sched_ext_ops create_dsq_ops = { 54 + .init_task = create_dsq_init_task, 55 + .exit_task = create_dsq_exit_task, 56 + .init = create_dsq_init, 57 + .name = "create_dsq", 58 + };

+57

tools/testing/selftests/sched_ext/create_dsq.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <scx/common.h> 8 + #include <sys/wait.h> 9 + #include <unistd.h> 10 + #include "create_dsq.bpf.skel.h" 11 + #include "scx_test.h" 12 + 13 + static enum scx_test_status setup(void **ctx) 14 + { 15 + struct create_dsq *skel; 16 + 17 + skel = create_dsq__open_and_load(); 18 + if (!skel) { 19 + SCX_ERR("Failed to open and load skel"); 20 + return SCX_TEST_FAIL; 21 + } 22 + *ctx = skel; 23 + 24 + return SCX_TEST_PASS; 25 + } 26 + 27 + static enum scx_test_status run(void *ctx) 28 + { 29 + struct create_dsq *skel = ctx; 30 + struct bpf_link *link; 31 + 32 + link = bpf_map__attach_struct_ops(skel->maps.create_dsq_ops); 33 + if (!link) { 34 + SCX_ERR("Failed to attach scheduler"); 35 + return SCX_TEST_FAIL; 36 + } 37 + 38 + bpf_link__destroy(link); 39 + 40 + return SCX_TEST_PASS; 41 + } 42 + 43 + static void cleanup(void *ctx) 44 + { 45 + struct create_dsq *skel = ctx; 46 + 47 + create_dsq__destroy(skel); 48 + } 49 + 50 + struct scx_test create_dsq = { 51 + .name = "create_dsq", 52 + .description = "Create and destroy a dsq in a loop", 53 + .setup = setup, 54 + .run = run, 55 + .cleanup = cleanup, 56 + }; 57 + REGISTER_SCX_TEST(&create_dsq)

+42

tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <scx/common.bpf.h> 8 + 9 + char _license[] SEC("license") = "GPL"; 10 + 11 + UEI_DEFINE(uei); 12 + 13 + s32 BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_select_cpu, struct task_struct *p, 14 + s32 prev_cpu, u64 wake_flags) 15 + { 16 + s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); 17 + 18 + if (cpu >= 0) { 19 + /* 20 + * If we dispatch to a bogus DSQ that will fall back to the 21 + * builtin global DSQ, we fail gracefully. 22 + */ 23 + scx_bpf_dispatch_vtime(p, 0xcafef00d, SCX_SLICE_DFL, 24 + p->scx.dsq_vtime, 0); 25 + return cpu; 26 + } 27 + 28 + return prev_cpu; 29 + } 30 + 31 + void BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_exit, struct scx_exit_info *ei) 32 + { 33 + UEI_RECORD(uei, ei); 34 + } 35 + 36 + SEC(".struct_ops.link") 37 + struct sched_ext_ops ddsp_bogus_dsq_fail_ops = { 38 + .select_cpu = ddsp_bogus_dsq_fail_select_cpu, 39 + .exit = ddsp_bogus_dsq_fail_exit, 40 + .name = "ddsp_bogus_dsq_fail", 41 + .timeout_ms = 1000U, 42 + };

+57

tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "ddsp_bogus_dsq_fail.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct ddsp_bogus_dsq_fail *skel; 17 + 18 + skel = ddsp_bogus_dsq_fail__open_and_load(); 19 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 20 + *ctx = skel; 21 + 22 + return SCX_TEST_PASS; 23 + } 24 + 25 + static enum scx_test_status run(void *ctx) 26 + { 27 + struct ddsp_bogus_dsq_fail *skel = ctx; 28 + struct bpf_link *link; 29 + 30 + link = bpf_map__attach_struct_ops(skel->maps.ddsp_bogus_dsq_fail_ops); 31 + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); 32 + 33 + sleep(1); 34 + 35 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); 36 + bpf_link__destroy(link); 37 + 38 + return SCX_TEST_PASS; 39 + } 40 + 41 + static void cleanup(void *ctx) 42 + { 43 + struct ddsp_bogus_dsq_fail *skel = ctx; 44 + 45 + ddsp_bogus_dsq_fail__destroy(skel); 46 + } 47 + 48 + struct scx_test ddsp_bogus_dsq_fail = { 49 + .name = "ddsp_bogus_dsq_fail", 50 + .description = "Verify we gracefully fail, and fall back to using a " 51 + "built-in DSQ, if we do a direct dispatch to an invalid" 52 + " DSQ in ops.select_cpu()", 53 + .setup = setup, 54 + .run = run, 55 + .cleanup = cleanup, 56 + }; 57 + REGISTER_SCX_TEST(&ddsp_bogus_dsq_fail)

+39

tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <scx/common.bpf.h> 8 + 9 + char _license[] SEC("license") = "GPL"; 10 + 11 + UEI_DEFINE(uei); 12 + 13 + s32 BPF_STRUCT_OPS(ddsp_vtimelocal_fail_select_cpu, struct task_struct *p, 14 + s32 prev_cpu, u64 wake_flags) 15 + { 16 + s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); 17 + 18 + if (cpu >= 0) { 19 + /* Shouldn't be allowed to vtime dispatch to a builtin DSQ. */ 20 + scx_bpf_dispatch_vtime(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 21 + p->scx.dsq_vtime, 0); 22 + return cpu; 23 + } 24 + 25 + return prev_cpu; 26 + } 27 + 28 + void BPF_STRUCT_OPS(ddsp_vtimelocal_fail_exit, struct scx_exit_info *ei) 29 + { 30 + UEI_RECORD(uei, ei); 31 + } 32 + 33 + SEC(".struct_ops.link") 34 + struct sched_ext_ops ddsp_vtimelocal_fail_ops = { 35 + .select_cpu = ddsp_vtimelocal_fail_select_cpu, 36 + .exit = ddsp_vtimelocal_fail_exit, 37 + .name = "ddsp_vtimelocal_fail", 38 + .timeout_ms = 1000U, 39 + };

+56

tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <unistd.h> 10 + #include "ddsp_vtimelocal_fail.bpf.skel.h" 11 + #include "scx_test.h" 12 + 13 + static enum scx_test_status setup(void **ctx) 14 + { 15 + struct ddsp_vtimelocal_fail *skel; 16 + 17 + skel = ddsp_vtimelocal_fail__open_and_load(); 18 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 19 + *ctx = skel; 20 + 21 + return SCX_TEST_PASS; 22 + } 23 + 24 + static enum scx_test_status run(void *ctx) 25 + { 26 + struct ddsp_vtimelocal_fail *skel = ctx; 27 + struct bpf_link *link; 28 + 29 + link = bpf_map__attach_struct_ops(skel->maps.ddsp_vtimelocal_fail_ops); 30 + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); 31 + 32 + sleep(1); 33 + 34 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); 35 + bpf_link__destroy(link); 36 + 37 + return SCX_TEST_PASS; 38 + } 39 + 40 + static void cleanup(void *ctx) 41 + { 42 + struct ddsp_vtimelocal_fail *skel = ctx; 43 + 44 + ddsp_vtimelocal_fail__destroy(skel); 45 + } 46 + 47 + struct scx_test ddsp_vtimelocal_fail = { 48 + .name = "ddsp_vtimelocal_fail", 49 + .description = "Verify we gracefully fail, and fall back to using a " 50 + "built-in DSQ, if we do a direct vtime dispatch to a " 51 + "built-in DSQ from DSQ in ops.select_cpu()", 52 + .setup = setup, 53 + .run = run, 54 + .cleanup = cleanup, 55 + }; 56 + REGISTER_SCX_TEST(&ddsp_vtimelocal_fail)

+65

tools/testing/selftests/sched_ext/dsp_local_on.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <scx/common.bpf.h> 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + const volatile s32 nr_cpus; 10 + 11 + UEI_DEFINE(uei); 12 + 13 + struct { 14 + __uint(type, BPF_MAP_TYPE_QUEUE); 15 + __uint(max_entries, 8192); 16 + __type(value, s32); 17 + } queue SEC(".maps"); 18 + 19 + s32 BPF_STRUCT_OPS(dsp_local_on_select_cpu, struct task_struct *p, 20 + s32 prev_cpu, u64 wake_flags) 21 + { 22 + return prev_cpu; 23 + } 24 + 25 + void BPF_STRUCT_OPS(dsp_local_on_enqueue, struct task_struct *p, 26 + u64 enq_flags) 27 + { 28 + s32 pid = p->pid; 29 + 30 + if (bpf_map_push_elem(&queue, &pid, 0)) 31 + scx_bpf_error("Failed to enqueue %s[%d]", p->comm, p->pid); 32 + } 33 + 34 + void BPF_STRUCT_OPS(dsp_local_on_dispatch, s32 cpu, struct task_struct *prev) 35 + { 36 + s32 pid, target; 37 + struct task_struct *p; 38 + 39 + if (bpf_map_pop_elem(&queue, &pid)) 40 + return; 41 + 42 + p = bpf_task_from_pid(pid); 43 + if (!p) 44 + return; 45 + 46 + target = bpf_get_prandom_u32() % nr_cpus; 47 + 48 + scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | target, SCX_SLICE_DFL, 0); 49 + bpf_task_release(p); 50 + } 51 + 52 + void BPF_STRUCT_OPS(dsp_local_on_exit, struct scx_exit_info *ei) 53 + { 54 + UEI_RECORD(uei, ei); 55 + } 56 + 57 + SEC(".struct_ops.link") 58 + struct sched_ext_ops dsp_local_on_ops = { 59 + .select_cpu = dsp_local_on_select_cpu, 60 + .enqueue = dsp_local_on_enqueue, 61 + .dispatch = dsp_local_on_dispatch, 62 + .exit = dsp_local_on_exit, 63 + .name = "dsp_local_on", 64 + .timeout_ms = 1000U, 65 + };

+58

tools/testing/selftests/sched_ext/dsp_local_on.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <scx/common.h> 8 + #include <unistd.h> 9 + #include "dsp_local_on.bpf.skel.h" 10 + #include "scx_test.h" 11 + 12 + static enum scx_test_status setup(void **ctx) 13 + { 14 + struct dsp_local_on *skel; 15 + 16 + skel = dsp_local_on__open(); 17 + SCX_FAIL_IF(!skel, "Failed to open"); 18 + 19 + skel->rodata->nr_cpus = libbpf_num_possible_cpus(); 20 + SCX_FAIL_IF(dsp_local_on__load(skel), "Failed to load skel"); 21 + *ctx = skel; 22 + 23 + return SCX_TEST_PASS; 24 + } 25 + 26 + static enum scx_test_status run(void *ctx) 27 + { 28 + struct dsp_local_on *skel = ctx; 29 + struct bpf_link *link; 30 + 31 + link = bpf_map__attach_struct_ops(skel->maps.dsp_local_on_ops); 32 + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); 33 + 34 + /* Just sleeping is fine, plenty of scheduling events happening */ 35 + sleep(1); 36 + 37 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); 38 + bpf_link__destroy(link); 39 + 40 + return SCX_TEST_PASS; 41 + } 42 + 43 + static void cleanup(void *ctx) 44 + { 45 + struct dsp_local_on *skel = ctx; 46 + 47 + dsp_local_on__destroy(skel); 48 + } 49 + 50 + struct scx_test dsp_local_on = { 51 + .name = "dsp_local_on", 52 + .description = "Verify we can directly dispatch tasks to a local DSQs " 53 + "from osp.dispatch()", 54 + .setup = setup, 55 + .run = run, 56 + .cleanup = cleanup, 57 + }; 58 + REGISTER_SCX_TEST(&dsp_local_on)

+21

tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates the behavior of direct dispatching with a default 4 + * select_cpu implementation. 5 + * 6 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 8 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + SEC(".struct_ops.link") 16 + struct sched_ext_ops enq_last_no_enq_fails_ops = { 17 + .name = "enq_last_no_enq_fails", 18 + /* Need to define ops.enqueue() with SCX_OPS_ENQ_LAST */ 19 + .flags = SCX_OPS_ENQ_LAST, 20 + .timeout_ms = 1000U, 21 + };

+60

tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "enq_last_no_enq_fails.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct enq_last_no_enq_fails *skel; 17 + 18 + skel = enq_last_no_enq_fails__open_and_load(); 19 + if (!skel) { 20 + SCX_ERR("Failed to open and load skel"); 21 + return SCX_TEST_FAIL; 22 + } 23 + *ctx = skel; 24 + 25 + return SCX_TEST_PASS; 26 + } 27 + 28 + static enum scx_test_status run(void *ctx) 29 + { 30 + struct enq_last_no_enq_fails *skel = ctx; 31 + struct bpf_link *link; 32 + 33 + link = bpf_map__attach_struct_ops(skel->maps.enq_last_no_enq_fails_ops); 34 + if (link) { 35 + SCX_ERR("Incorrectly succeeded in to attaching scheduler"); 36 + return SCX_TEST_FAIL; 37 + } 38 + 39 + bpf_link__destroy(link); 40 + 41 + return SCX_TEST_PASS; 42 + } 43 + 44 + static void cleanup(void *ctx) 45 + { 46 + struct enq_last_no_enq_fails *skel = ctx; 47 + 48 + enq_last_no_enq_fails__destroy(skel); 49 + } 50 + 51 + struct scx_test enq_last_no_enq_fails = { 52 + .name = "enq_last_no_enq_fails", 53 + .description = "Verify we fail to load a scheduler if we specify " 54 + "the SCX_OPS_ENQ_LAST flag without defining " 55 + "ops.enqueue()", 56 + .setup = setup, 57 + .run = run, 58 + .cleanup = cleanup, 59 + }; 60 + REGISTER_SCX_TEST(&enq_last_no_enq_fails)

+43

tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + 8 + #include <scx/common.bpf.h> 9 + 10 + char _license[] SEC("license") = "GPL"; 11 + 12 + /* Manually specify the signature until the kfunc is added to the scx repo. */ 13 + s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, 14 + bool *found) __ksym; 15 + 16 + s32 BPF_STRUCT_OPS(enq_select_cpu_fails_select_cpu, struct task_struct *p, 17 + s32 prev_cpu, u64 wake_flags) 18 + { 19 + return prev_cpu; 20 + } 21 + 22 + void BPF_STRUCT_OPS(enq_select_cpu_fails_enqueue, struct task_struct *p, 23 + u64 enq_flags) 24 + { 25 + /* 26 + * Need to initialize the variable or the verifier will fail to load. 27 + * Improving these semantics is actively being worked on. 28 + */ 29 + bool found = false; 30 + 31 + /* Can only call from ops.select_cpu() */ 32 + scx_bpf_select_cpu_dfl(p, 0, 0, &found); 33 + 34 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 35 + } 36 + 37 + SEC(".struct_ops.link") 38 + struct sched_ext_ops enq_select_cpu_fails_ops = { 39 + .select_cpu = enq_select_cpu_fails_select_cpu, 40 + .enqueue = enq_select_cpu_fails_enqueue, 41 + .name = "enq_select_cpu_fails", 42 + .timeout_ms = 1000U, 43 + };

+61

tools/testing/selftests/sched_ext/enq_select_cpu_fails.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "enq_select_cpu_fails.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct enq_select_cpu_fails *skel; 17 + 18 + skel = enq_select_cpu_fails__open_and_load(); 19 + if (!skel) { 20 + SCX_ERR("Failed to open and load skel"); 21 + return SCX_TEST_FAIL; 22 + } 23 + *ctx = skel; 24 + 25 + return SCX_TEST_PASS; 26 + } 27 + 28 + static enum scx_test_status run(void *ctx) 29 + { 30 + struct enq_select_cpu_fails *skel = ctx; 31 + struct bpf_link *link; 32 + 33 + link = bpf_map__attach_struct_ops(skel->maps.enq_select_cpu_fails_ops); 34 + if (!link) { 35 + SCX_ERR("Failed to attach scheduler"); 36 + return SCX_TEST_FAIL; 37 + } 38 + 39 + sleep(1); 40 + 41 + bpf_link__destroy(link); 42 + 43 + return SCX_TEST_PASS; 44 + } 45 + 46 + static void cleanup(void *ctx) 47 + { 48 + struct enq_select_cpu_fails *skel = ctx; 49 + 50 + enq_select_cpu_fails__destroy(skel); 51 + } 52 + 53 + struct scx_test enq_select_cpu_fails = { 54 + .name = "enq_select_cpu_fails", 55 + .description = "Verify we fail to call scx_bpf_select_cpu_dfl() " 56 + "from ops.enqueue()", 57 + .setup = setup, 58 + .run = run, 59 + .cleanup = cleanup, 60 + }; 61 + REGISTER_SCX_TEST(&enq_select_cpu_fails)

+84

tools/testing/selftests/sched_ext/exit.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + 7 + #include <scx/common.bpf.h> 8 + 9 + char _license[] SEC("license") = "GPL"; 10 + 11 + #include "exit_test.h" 12 + 13 + const volatile int exit_point; 14 + UEI_DEFINE(uei); 15 + 16 + #define EXIT_CLEANLY() scx_bpf_exit(exit_point, "%d", exit_point) 17 + 18 + s32 BPF_STRUCT_OPS(exit_select_cpu, struct task_struct *p, 19 + s32 prev_cpu, u64 wake_flags) 20 + { 21 + bool found; 22 + 23 + if (exit_point == EXIT_SELECT_CPU) 24 + EXIT_CLEANLY(); 25 + 26 + return scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &found); 27 + } 28 + 29 + void BPF_STRUCT_OPS(exit_enqueue, struct task_struct *p, u64 enq_flags) 30 + { 31 + if (exit_point == EXIT_ENQUEUE) 32 + EXIT_CLEANLY(); 33 + 34 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 35 + } 36 + 37 + void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p) 38 + { 39 + if (exit_point == EXIT_DISPATCH) 40 + EXIT_CLEANLY(); 41 + 42 + scx_bpf_consume(SCX_DSQ_GLOBAL); 43 + } 44 + 45 + void BPF_STRUCT_OPS(exit_enable, struct task_struct *p) 46 + { 47 + if (exit_point == EXIT_ENABLE) 48 + EXIT_CLEANLY(); 49 + } 50 + 51 + s32 BPF_STRUCT_OPS(exit_init_task, struct task_struct *p, 52 + struct scx_init_task_args *args) 53 + { 54 + if (exit_point == EXIT_INIT_TASK) 55 + EXIT_CLEANLY(); 56 + 57 + return 0; 58 + } 59 + 60 + void BPF_STRUCT_OPS(exit_exit, struct scx_exit_info *ei) 61 + { 62 + UEI_RECORD(uei, ei); 63 + } 64 + 65 + s32 BPF_STRUCT_OPS_SLEEPABLE(exit_init) 66 + { 67 + if (exit_point == EXIT_INIT) 68 + EXIT_CLEANLY(); 69 + 70 + return 0; 71 + } 72 + 73 + SEC(".struct_ops.link") 74 + struct sched_ext_ops exit_ops = { 75 + .select_cpu = exit_select_cpu, 76 + .enqueue = exit_enqueue, 77 + .dispatch = exit_dispatch, 78 + .init_task = exit_init_task, 79 + .enable = exit_enable, 80 + .exit = exit_exit, 81 + .init = exit_init, 82 + .name = "exit", 83 + .timeout_ms = 1000U, 84 + };

+55

tools/testing/selftests/sched_ext/exit.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <sched.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "exit.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + #include "exit_test.h" 15 + 16 + static enum scx_test_status run(void *ctx) 17 + { 18 + enum exit_test_case tc; 19 + 20 + for (tc = 0; tc < NUM_EXITS; tc++) { 21 + struct exit *skel; 22 + struct bpf_link *link; 23 + char buf[16]; 24 + 25 + skel = exit__open(); 26 + skel->rodata->exit_point = tc; 27 + exit__load(skel); 28 + link = bpf_map__attach_struct_ops(skel->maps.exit_ops); 29 + if (!link) { 30 + SCX_ERR("Failed to attach scheduler"); 31 + exit__destroy(skel); 32 + return SCX_TEST_FAIL; 33 + } 34 + 35 + /* Assumes uei.kind is written last */ 36 + while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE)) 37 + sched_yield(); 38 + 39 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF)); 40 + SCX_EQ(skel->data->uei.exit_code, tc); 41 + sprintf(buf, "%d", tc); 42 + SCX_ASSERT(!strcmp(skel->data->uei.msg, buf)); 43 + bpf_link__destroy(link); 44 + exit__destroy(skel); 45 + } 46 + 47 + return SCX_TEST_PASS; 48 + } 49 + 50 + struct scx_test exit_test = { 51 + .name = "exit", 52 + .description = "Verify we can cleanly exit a scheduler in multiple places", 53 + .run = run, 54 + }; 55 + REGISTER_SCX_TEST(&exit_test)

+20

tools/testing/selftests/sched_ext/exit_test.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + 7 + #ifndef __EXIT_TEST_H__ 8 + #define __EXIT_TEST_H__ 9 + 10 + enum exit_test_case { 11 + EXIT_SELECT_CPU, 12 + EXIT_ENQUEUE, 13 + EXIT_DISPATCH, 14 + EXIT_ENABLE, 15 + EXIT_INIT_TASK, 16 + EXIT_INIT, 17 + NUM_EXITS, 18 + }; 19 + 20 + #endif // # __EXIT_TEST_H__

+61

tools/testing/selftests/sched_ext/hotplug.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + 7 + #include <scx/common.bpf.h> 8 + 9 + char _license[] SEC("license") = "GPL"; 10 + 11 + #include "hotplug_test.h" 12 + 13 + UEI_DEFINE(uei); 14 + 15 + void BPF_STRUCT_OPS(hotplug_exit, struct scx_exit_info *ei) 16 + { 17 + UEI_RECORD(uei, ei); 18 + } 19 + 20 + static void exit_from_hotplug(s32 cpu, bool onlining) 21 + { 22 + /* 23 + * Ignored, just used to verify that we can invoke blocking kfuncs 24 + * from the hotplug path. 25 + */ 26 + scx_bpf_create_dsq(0, -1); 27 + 28 + s64 code = SCX_ECODE_ACT_RESTART | HOTPLUG_EXIT_RSN; 29 + 30 + if (onlining) 31 + code |= HOTPLUG_ONLINING; 32 + 33 + scx_bpf_exit(code, "hotplug event detected (%d going %s)", cpu, 34 + onlining ? "online" : "offline"); 35 + } 36 + 37 + void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_online, s32 cpu) 38 + { 39 + exit_from_hotplug(cpu, true); 40 + } 41 + 42 + void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_offline, s32 cpu) 43 + { 44 + exit_from_hotplug(cpu, false); 45 + } 46 + 47 + SEC(".struct_ops.link") 48 + struct sched_ext_ops hotplug_cb_ops = { 49 + .cpu_online = hotplug_cpu_online, 50 + .cpu_offline = hotplug_cpu_offline, 51 + .exit = hotplug_exit, 52 + .name = "hotplug_cbs", 53 + .timeout_ms = 1000U, 54 + }; 55 + 56 + SEC(".struct_ops.link") 57 + struct sched_ext_ops hotplug_nocb_ops = { 58 + .exit = hotplug_exit, 59 + .name = "hotplug_nocbs", 60 + .timeout_ms = 1000U, 61 + };

+168

tools/testing/selftests/sched_ext/hotplug.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <sched.h> 8 + #include <scx/common.h> 9 + #include <sched.h> 10 + #include <sys/wait.h> 11 + #include <unistd.h> 12 + 13 + #include "hotplug_test.h" 14 + #include "hotplug.bpf.skel.h" 15 + #include "scx_test.h" 16 + #include "util.h" 17 + 18 + const char *online_path = "/sys/devices/system/cpu/cpu1/online"; 19 + 20 + static bool is_cpu_online(void) 21 + { 22 + return file_read_long(online_path) > 0; 23 + } 24 + 25 + static void toggle_online_status(bool online) 26 + { 27 + long val = online ? 1 : 0; 28 + int ret; 29 + 30 + ret = file_write_long(online_path, val); 31 + if (ret != 0) 32 + fprintf(stderr, "Failed to bring CPU %s (%s)", 33 + online ? "online" : "offline", strerror(errno)); 34 + } 35 + 36 + static enum scx_test_status setup(void **ctx) 37 + { 38 + if (!is_cpu_online()) 39 + return SCX_TEST_SKIP; 40 + 41 + return SCX_TEST_PASS; 42 + } 43 + 44 + static enum scx_test_status test_hotplug(bool onlining, bool cbs_defined) 45 + { 46 + struct hotplug *skel; 47 + struct bpf_link *link; 48 + long kind, code; 49 + 50 + SCX_ASSERT(is_cpu_online()); 51 + 52 + skel = hotplug__open_and_load(); 53 + SCX_ASSERT(skel); 54 + 55 + /* Testing the offline -> online path, so go offline before starting */ 56 + if (onlining) 57 + toggle_online_status(0); 58 + 59 + if (cbs_defined) { 60 + kind = SCX_KIND_VAL(SCX_EXIT_UNREG_BPF); 61 + code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | HOTPLUG_EXIT_RSN; 62 + if (onlining) 63 + code |= HOTPLUG_ONLINING; 64 + } else { 65 + kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN); 66 + code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | 67 + SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG); 68 + } 69 + 70 + if (cbs_defined) 71 + link = bpf_map__attach_struct_ops(skel->maps.hotplug_cb_ops); 72 + else 73 + link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops); 74 + 75 + if (!link) { 76 + SCX_ERR("Failed to attach scheduler"); 77 + hotplug__destroy(skel); 78 + return SCX_TEST_FAIL; 79 + } 80 + 81 + toggle_online_status(onlining ? 1 : 0); 82 + 83 + while (!UEI_EXITED(skel, uei)) 84 + sched_yield(); 85 + 86 + SCX_EQ(skel->data->uei.kind, kind); 87 + SCX_EQ(UEI_REPORT(skel, uei), code); 88 + 89 + if (!onlining) 90 + toggle_online_status(1); 91 + 92 + bpf_link__destroy(link); 93 + hotplug__destroy(skel); 94 + 95 + return SCX_TEST_PASS; 96 + } 97 + 98 + static enum scx_test_status test_hotplug_attach(void) 99 + { 100 + struct hotplug *skel; 101 + struct bpf_link *link; 102 + enum scx_test_status status = SCX_TEST_PASS; 103 + long kind, code; 104 + 105 + SCX_ASSERT(is_cpu_online()); 106 + SCX_ASSERT(scx_hotplug_seq() > 0); 107 + 108 + skel = SCX_OPS_OPEN(hotplug_nocb_ops, hotplug); 109 + SCX_ASSERT(skel); 110 + 111 + SCX_OPS_LOAD(skel, hotplug_nocb_ops, hotplug, uei); 112 + 113 + /* 114 + * Take the CPU offline to increment the global hotplug seq, which 115 + * should cause attach to fail due to us setting the hotplug seq above 116 + */ 117 + toggle_online_status(0); 118 + link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops); 119 + 120 + toggle_online_status(1); 121 + 122 + SCX_ASSERT(link); 123 + while (!UEI_EXITED(skel, uei)) 124 + sched_yield(); 125 + 126 + kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN); 127 + code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | 128 + SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG); 129 + SCX_EQ(skel->data->uei.kind, kind); 130 + SCX_EQ(UEI_REPORT(skel, uei), code); 131 + 132 + bpf_link__destroy(link); 133 + hotplug__destroy(skel); 134 + 135 + return status; 136 + } 137 + 138 + static enum scx_test_status run(void *ctx) 139 + { 140 + 141 + #define HP_TEST(__onlining, __cbs_defined) ({ \ 142 + if (test_hotplug(__onlining, __cbs_defined) != SCX_TEST_PASS) \ 143 + return SCX_TEST_FAIL; \ 144 + }) 145 + 146 + HP_TEST(true, true); 147 + HP_TEST(false, true); 148 + HP_TEST(true, false); 149 + HP_TEST(false, false); 150 + 151 + #undef HP_TEST 152 + 153 + return test_hotplug_attach(); 154 + } 155 + 156 + static void cleanup(void *ctx) 157 + { 158 + toggle_online_status(1); 159 + } 160 + 161 + struct scx_test hotplug_test = { 162 + .name = "hotplug", 163 + .description = "Verify hotplug behavior", 164 + .setup = setup, 165 + .run = run, 166 + .cleanup = cleanup, 167 + }; 168 + REGISTER_SCX_TEST(&hotplug_test)

+15

tools/testing/selftests/sched_ext/hotplug_test.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + 7 + #ifndef __HOTPLUG_TEST_H__ 8 + #define __HOTPLUG_TEST_H__ 9 + 10 + enum hotplug_test_flags { 11 + HOTPLUG_EXIT_RSN = 1LLU << 0, 12 + HOTPLUG_ONLINING = 1LLU << 1, 13 + }; 14 + 15 + #endif // # __HOTPLUG_TEST_H__

+53

tools/testing/selftests/sched_ext/init_enable_count.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that verifies that we do proper counting of init, enable, etc 4 + * callbacks. 5 + * 6 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 8 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + u64 init_task_cnt, exit_task_cnt, enable_cnt, disable_cnt; 16 + u64 init_fork_cnt, init_transition_cnt; 17 + 18 + s32 BPF_STRUCT_OPS_SLEEPABLE(cnt_init_task, struct task_struct *p, 19 + struct scx_init_task_args *args) 20 + { 21 + __sync_fetch_and_add(&init_task_cnt, 1); 22 + 23 + if (args->fork) 24 + __sync_fetch_and_add(&init_fork_cnt, 1); 25 + else 26 + __sync_fetch_and_add(&init_transition_cnt, 1); 27 + 28 + return 0; 29 + } 30 + 31 + void BPF_STRUCT_OPS(cnt_exit_task, struct task_struct *p) 32 + { 33 + __sync_fetch_and_add(&exit_task_cnt, 1); 34 + } 35 + 36 + void BPF_STRUCT_OPS(cnt_enable, struct task_struct *p) 37 + { 38 + __sync_fetch_and_add(&enable_cnt, 1); 39 + } 40 + 41 + void BPF_STRUCT_OPS(cnt_disable, struct task_struct *p) 42 + { 43 + __sync_fetch_and_add(&disable_cnt, 1); 44 + } 45 + 46 + SEC(".struct_ops.link") 47 + struct sched_ext_ops init_enable_count_ops = { 48 + .init_task = cnt_init_task, 49 + .exit_task = cnt_exit_task, 50 + .enable = cnt_enable, 51 + .disable = cnt_disable, 52 + .name = "init_enable_count", 53 + };

+166

tools/testing/selftests/sched_ext/init_enable_count.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <stdio.h> 8 + #include <unistd.h> 9 + #include <sched.h> 10 + #include <bpf/bpf.h> 11 + #include <scx/common.h> 12 + #include <sys/wait.h> 13 + #include "scx_test.h" 14 + #include "init_enable_count.bpf.skel.h" 15 + 16 + #define SCHED_EXT 7 17 + 18 + static struct init_enable_count * 19 + open_load_prog(bool global) 20 + { 21 + struct init_enable_count *skel; 22 + 23 + skel = init_enable_count__open(); 24 + SCX_BUG_ON(!skel, "Failed to open skel"); 25 + 26 + if (!global) 27 + skel->struct_ops.init_enable_count_ops->flags |= SCX_OPS_SWITCH_PARTIAL; 28 + 29 + SCX_BUG_ON(init_enable_count__load(skel), "Failed to load skel"); 30 + 31 + return skel; 32 + } 33 + 34 + static enum scx_test_status run_test(bool global) 35 + { 36 + struct init_enable_count *skel; 37 + struct bpf_link *link; 38 + const u32 num_children = 5, num_pre_forks = 1024; 39 + int ret, i, status; 40 + struct sched_param param = {}; 41 + pid_t pids[num_pre_forks]; 42 + 43 + skel = open_load_prog(global); 44 + 45 + /* 46 + * Fork a bunch of children before we attach the scheduler so that we 47 + * ensure (at least in practical terms) that there are more tasks that 48 + * transition from SCHED_OTHER -> SCHED_EXT than there are tasks that 49 + * take the fork() path either below or in other processes. 50 + */ 51 + for (i = 0; i < num_pre_forks; i++) { 52 + pids[i] = fork(); 53 + SCX_FAIL_IF(pids[i] < 0, "Failed to fork child"); 54 + if (pids[i] == 0) { 55 + sleep(1); 56 + exit(0); 57 + } 58 + } 59 + 60 + link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops); 61 + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); 62 + 63 + for (i = 0; i < num_pre_forks; i++) { 64 + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], 65 + "Failed to wait for pre-forked child\n"); 66 + 67 + SCX_FAIL_IF(status != 0, "Pre-forked child %d exited with status %d\n", i, 68 + status); 69 + } 70 + 71 + bpf_link__destroy(link); 72 + SCX_GE(skel->bss->init_task_cnt, num_pre_forks); 73 + SCX_GE(skel->bss->exit_task_cnt, num_pre_forks); 74 + 75 + link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops); 76 + SCX_FAIL_IF(!link, "Failed to attach struct_ops"); 77 + 78 + /* SCHED_EXT children */ 79 + for (i = 0; i < num_children; i++) { 80 + pids[i] = fork(); 81 + SCX_FAIL_IF(pids[i] < 0, "Failed to fork child"); 82 + 83 + if (pids[i] == 0) { 84 + ret = sched_setscheduler(0, SCHED_EXT, &param); 85 + SCX_BUG_ON(ret, "Failed to set sched to sched_ext"); 86 + 87 + /* 88 + * Reset to SCHED_OTHER for half of them. Counts for 89 + * everything should still be the same regardless, as 90 + * ops.disable() is invoked even if a task is still on 91 + * SCHED_EXT before it exits. 92 + */ 93 + if (i % 2 == 0) { 94 + ret = sched_setscheduler(0, SCHED_OTHER, &param); 95 + SCX_BUG_ON(ret, "Failed to reset sched to normal"); 96 + } 97 + exit(0); 98 + } 99 + } 100 + for (i = 0; i < num_children; i++) { 101 + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], 102 + "Failed to wait for SCX child\n"); 103 + 104 + SCX_FAIL_IF(status != 0, "SCX child %d exited with status %d\n", i, 105 + status); 106 + } 107 + 108 + /* SCHED_OTHER children */ 109 + for (i = 0; i < num_children; i++) { 110 + pids[i] = fork(); 111 + if (pids[i] == 0) 112 + exit(0); 113 + } 114 + 115 + for (i = 0; i < num_children; i++) { 116 + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], 117 + "Failed to wait for normal child\n"); 118 + 119 + SCX_FAIL_IF(status != 0, "Normal child %d exited with status %d\n", i, 120 + status); 121 + } 122 + 123 + bpf_link__destroy(link); 124 + 125 + SCX_GE(skel->bss->init_task_cnt, 2 * num_children); 126 + SCX_GE(skel->bss->exit_task_cnt, 2 * num_children); 127 + 128 + if (global) { 129 + SCX_GE(skel->bss->enable_cnt, 2 * num_children); 130 + SCX_GE(skel->bss->disable_cnt, 2 * num_children); 131 + } else { 132 + SCX_EQ(skel->bss->enable_cnt, num_children); 133 + SCX_EQ(skel->bss->disable_cnt, num_children); 134 + } 135 + /* 136 + * We forked a ton of tasks before we attached the scheduler above, so 137 + * this should be fine. Technically it could be flaky if a ton of forks 138 + * are happening at the same time in other processes, but that should 139 + * be exceedingly unlikely. 140 + */ 141 + SCX_GT(skel->bss->init_transition_cnt, skel->bss->init_fork_cnt); 142 + SCX_GE(skel->bss->init_fork_cnt, 2 * num_children); 143 + 144 + init_enable_count__destroy(skel); 145 + 146 + return SCX_TEST_PASS; 147 + } 148 + 149 + static enum scx_test_status run(void *ctx) 150 + { 151 + enum scx_test_status status; 152 + 153 + status = run_test(true); 154 + if (status != SCX_TEST_PASS) 155 + return status; 156 + 157 + return run_test(false); 158 + } 159 + 160 + struct scx_test init_enable_count = { 161 + .name = "init_enable_count", 162 + .description = "Verify we do the correct amount of counting of init, " 163 + "enable, etc callbacks.", 164 + .run = run, 165 + }; 166 + REGISTER_SCX_TEST(&init_enable_count)

+164

tools/testing/selftests/sched_ext/maximal.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler with every callback defined. 4 + * 5 + * This scheduler defines every callback. 6 + * 7 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 8 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + s32 BPF_STRUCT_OPS(maximal_select_cpu, struct task_struct *p, s32 prev_cpu, 16 + u64 wake_flags) 17 + { 18 + return prev_cpu; 19 + } 20 + 21 + void BPF_STRUCT_OPS(maximal_enqueue, struct task_struct *p, u64 enq_flags) 22 + { 23 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 24 + } 25 + 26 + void BPF_STRUCT_OPS(maximal_dequeue, struct task_struct *p, u64 deq_flags) 27 + {} 28 + 29 + void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev) 30 + { 31 + scx_bpf_consume(SCX_DSQ_GLOBAL); 32 + } 33 + 34 + void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags) 35 + {} 36 + 37 + void BPF_STRUCT_OPS(maximal_running, struct task_struct *p) 38 + {} 39 + 40 + void BPF_STRUCT_OPS(maximal_stopping, struct task_struct *p, bool runnable) 41 + {} 42 + 43 + void BPF_STRUCT_OPS(maximal_quiescent, struct task_struct *p, u64 deq_flags) 44 + {} 45 + 46 + bool BPF_STRUCT_OPS(maximal_yield, struct task_struct *from, 47 + struct task_struct *to) 48 + { 49 + return false; 50 + } 51 + 52 + bool BPF_STRUCT_OPS(maximal_core_sched_before, struct task_struct *a, 53 + struct task_struct *b) 54 + { 55 + return false; 56 + } 57 + 58 + void BPF_STRUCT_OPS(maximal_set_weight, struct task_struct *p, u32 weight) 59 + {} 60 + 61 + void BPF_STRUCT_OPS(maximal_set_cpumask, struct task_struct *p, 62 + const struct cpumask *cpumask) 63 + {} 64 + 65 + void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle) 66 + {} 67 + 68 + void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu, 69 + struct scx_cpu_acquire_args *args) 70 + {} 71 + 72 + void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu, 73 + struct scx_cpu_release_args *args) 74 + {} 75 + 76 + void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu) 77 + {} 78 + 79 + void BPF_STRUCT_OPS(maximal_cpu_offline, s32 cpu) 80 + {} 81 + 82 + s32 BPF_STRUCT_OPS(maximal_init_task, struct task_struct *p, 83 + struct scx_init_task_args *args) 84 + { 85 + return 0; 86 + } 87 + 88 + void BPF_STRUCT_OPS(maximal_enable, struct task_struct *p) 89 + {} 90 + 91 + void BPF_STRUCT_OPS(maximal_exit_task, struct task_struct *p, 92 + struct scx_exit_task_args *args) 93 + {} 94 + 95 + void BPF_STRUCT_OPS(maximal_disable, struct task_struct *p) 96 + {} 97 + 98 + s32 BPF_STRUCT_OPS(maximal_cgroup_init, struct cgroup *cgrp, 99 + struct scx_cgroup_init_args *args) 100 + { 101 + return 0; 102 + } 103 + 104 + void BPF_STRUCT_OPS(maximal_cgroup_exit, struct cgroup *cgrp) 105 + {} 106 + 107 + s32 BPF_STRUCT_OPS(maximal_cgroup_prep_move, struct task_struct *p, 108 + struct cgroup *from, struct cgroup *to) 109 + { 110 + return 0; 111 + } 112 + 113 + void BPF_STRUCT_OPS(maximal_cgroup_move, struct task_struct *p, 114 + struct cgroup *from, struct cgroup *to) 115 + {} 116 + 117 + void BPF_STRUCT_OPS(maximal_cgroup_cancel_move, struct task_struct *p, 118 + struct cgroup *from, struct cgroup *to) 119 + {} 120 + 121 + void BPF_STRUCT_OPS(maximal_cgroup_set_weight, struct cgroup *cgrp, u32 weight) 122 + {} 123 + 124 + s32 BPF_STRUCT_OPS_SLEEPABLE(maximal_init) 125 + { 126 + return 0; 127 + } 128 + 129 + void BPF_STRUCT_OPS(maximal_exit, struct scx_exit_info *info) 130 + {} 131 + 132 + SEC(".struct_ops.link") 133 + struct sched_ext_ops maximal_ops = { 134 + .select_cpu = maximal_select_cpu, 135 + .enqueue = maximal_enqueue, 136 + .dequeue = maximal_dequeue, 137 + .dispatch = maximal_dispatch, 138 + .runnable = maximal_runnable, 139 + .running = maximal_running, 140 + .stopping = maximal_stopping, 141 + .quiescent = maximal_quiescent, 142 + .yield = maximal_yield, 143 + .core_sched_before = maximal_core_sched_before, 144 + .set_weight = maximal_set_weight, 145 + .set_cpumask = maximal_set_cpumask, 146 + .update_idle = maximal_update_idle, 147 + .cpu_acquire = maximal_cpu_acquire, 148 + .cpu_release = maximal_cpu_release, 149 + .cpu_online = maximal_cpu_online, 150 + .cpu_offline = maximal_cpu_offline, 151 + .init_task = maximal_init_task, 152 + .enable = maximal_enable, 153 + .exit_task = maximal_exit_task, 154 + .disable = maximal_disable, 155 + .cgroup_init = maximal_cgroup_init, 156 + .cgroup_exit = maximal_cgroup_exit, 157 + .cgroup_prep_move = maximal_cgroup_prep_move, 158 + .cgroup_move = maximal_cgroup_move, 159 + .cgroup_cancel_move = maximal_cgroup_cancel_move, 160 + .cgroup_set_weight = maximal_cgroup_set_weight, 161 + .init = maximal_init, 162 + .exit = maximal_exit, 163 + .name = "maximal", 164 + };

+51

tools/testing/selftests/sched_ext/maximal.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <scx/common.h> 8 + #include <sys/wait.h> 9 + #include <unistd.h> 10 + #include "maximal.bpf.skel.h" 11 + #include "scx_test.h" 12 + 13 + static enum scx_test_status setup(void **ctx) 14 + { 15 + struct maximal *skel; 16 + 17 + skel = maximal__open_and_load(); 18 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 19 + *ctx = skel; 20 + 21 + return SCX_TEST_PASS; 22 + } 23 + 24 + static enum scx_test_status run(void *ctx) 25 + { 26 + struct maximal *skel = ctx; 27 + struct bpf_link *link; 28 + 29 + link = bpf_map__attach_struct_ops(skel->maps.maximal_ops); 30 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 31 + 32 + bpf_link__destroy(link); 33 + 34 + return SCX_TEST_PASS; 35 + } 36 + 37 + static void cleanup(void *ctx) 38 + { 39 + struct maximal *skel = ctx; 40 + 41 + maximal__destroy(skel); 42 + } 43 + 44 + struct scx_test maximal = { 45 + .name = "maximal", 46 + .description = "Verify we can load a scheduler with every callback defined", 47 + .setup = setup, 48 + .run = run, 49 + .cleanup = cleanup, 50 + }; 51 + REGISTER_SCX_TEST(&maximal)

+36

tools/testing/selftests/sched_ext/maybe_null.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + */ 5 + 6 + #include <scx/common.bpf.h> 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + 10 + u64 vtime_test; 11 + 12 + void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p) 13 + {} 14 + 15 + void BPF_STRUCT_OPS(maybe_null_success_dispatch, s32 cpu, struct task_struct *p) 16 + { 17 + if (p != NULL) 18 + vtime_test = p->scx.dsq_vtime; 19 + } 20 + 21 + bool BPF_STRUCT_OPS(maybe_null_success_yield, struct task_struct *from, 22 + struct task_struct *to) 23 + { 24 + if (to) 25 + bpf_printk("Yielding to %s[%d]", to->comm, to->pid); 26 + 27 + return false; 28 + } 29 + 30 + SEC(".struct_ops.link") 31 + struct sched_ext_ops maybe_null_success = { 32 + .dispatch = maybe_null_success_dispatch, 33 + .yield = maybe_null_success_yield, 34 + .enable = maybe_null_running, 35 + .name = "minimal", 36 + };

+49

tools/testing/selftests/sched_ext/maybe_null.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + */ 5 + #include <bpf/bpf.h> 6 + #include <scx/common.h> 7 + #include <sys/wait.h> 8 + #include <unistd.h> 9 + #include "maybe_null.bpf.skel.h" 10 + #include "maybe_null_fail_dsp.bpf.skel.h" 11 + #include "maybe_null_fail_yld.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status run(void *ctx) 15 + { 16 + struct maybe_null *skel; 17 + struct maybe_null_fail_dsp *fail_dsp; 18 + struct maybe_null_fail_yld *fail_yld; 19 + 20 + skel = maybe_null__open_and_load(); 21 + if (!skel) { 22 + SCX_ERR("Failed to open and load maybe_null skel"); 23 + return SCX_TEST_FAIL; 24 + } 25 + maybe_null__destroy(skel); 26 + 27 + fail_dsp = maybe_null_fail_dsp__open_and_load(); 28 + if (fail_dsp) { 29 + maybe_null_fail_dsp__destroy(fail_dsp); 30 + SCX_ERR("Should failed to open and load maybe_null_fail_dsp skel"); 31 + return SCX_TEST_FAIL; 32 + } 33 + 34 + fail_yld = maybe_null_fail_yld__open_and_load(); 35 + if (fail_yld) { 36 + maybe_null_fail_yld__destroy(fail_yld); 37 + SCX_ERR("Should failed to open and load maybe_null_fail_yld skel"); 38 + return SCX_TEST_FAIL; 39 + } 40 + 41 + return SCX_TEST_PASS; 42 + } 43 + 44 + struct scx_test maybe_null = { 45 + .name = "maybe_null", 46 + .description = "Verify if PTR_MAYBE_NULL work for .dispatch", 47 + .run = run, 48 + }; 49 + REGISTER_SCX_TEST(&maybe_null)

+25

tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + */ 5 + 6 + #include <scx/common.bpf.h> 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + 10 + u64 vtime_test; 11 + 12 + void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p) 13 + {} 14 + 15 + void BPF_STRUCT_OPS(maybe_null_fail_dispatch, s32 cpu, struct task_struct *p) 16 + { 17 + vtime_test = p->scx.dsq_vtime; 18 + } 19 + 20 + SEC(".struct_ops.link") 21 + struct sched_ext_ops maybe_null_fail = { 22 + .dispatch = maybe_null_fail_dispatch, 23 + .enable = maybe_null_running, 24 + .name = "maybe_null_fail_dispatch", 25 + };

+28

tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + */ 5 + 6 + #include <scx/common.bpf.h> 7 + 8 + char _license[] SEC("license") = "GPL"; 9 + 10 + u64 vtime_test; 11 + 12 + void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p) 13 + {} 14 + 15 + bool BPF_STRUCT_OPS(maybe_null_fail_yield, struct task_struct *from, 16 + struct task_struct *to) 17 + { 18 + bpf_printk("Yielding to %s[%d]", to->comm, to->pid); 19 + 20 + return false; 21 + } 22 + 23 + SEC(".struct_ops.link") 24 + struct sched_ext_ops maybe_null_fail = { 25 + .yield = maybe_null_fail_yield, 26 + .enable = maybe_null_running, 27 + .name = "maybe_null_fail_yield", 28 + };

+21

tools/testing/selftests/sched_ext/minimal.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A completely minimal scheduler. 4 + * 5 + * This scheduler defines the absolute minimal set of struct sched_ext_ops 6 + * fields: its name. It should _not_ fail to be loaded, and can be used to 7 + * exercise the default scheduling paths in ext.c. 8 + * 9 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 10 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 11 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 12 + */ 13 + 14 + #include <scx/common.bpf.h> 15 + 16 + char _license[] SEC("license") = "GPL"; 17 + 18 + SEC(".struct_ops.link") 19 + struct sched_ext_ops minimal_ops = { 20 + .name = "minimal", 21 + };

+58

tools/testing/selftests/sched_ext/minimal.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "minimal.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct minimal *skel; 17 + 18 + skel = minimal__open_and_load(); 19 + if (!skel) { 20 + SCX_ERR("Failed to open and load skel"); 21 + return SCX_TEST_FAIL; 22 + } 23 + *ctx = skel; 24 + 25 + return SCX_TEST_PASS; 26 + } 27 + 28 + static enum scx_test_status run(void *ctx) 29 + { 30 + struct minimal *skel = ctx; 31 + struct bpf_link *link; 32 + 33 + link = bpf_map__attach_struct_ops(skel->maps.minimal_ops); 34 + if (!link) { 35 + SCX_ERR("Failed to attach scheduler"); 36 + return SCX_TEST_FAIL; 37 + } 38 + 39 + bpf_link__destroy(link); 40 + 41 + return SCX_TEST_PASS; 42 + } 43 + 44 + static void cleanup(void *ctx) 45 + { 46 + struct minimal *skel = ctx; 47 + 48 + minimal__destroy(skel); 49 + } 50 + 51 + struct scx_test minimal = { 52 + .name = "minimal", 53 + .description = "Verify we can load a fully minimal scheduler", 54 + .setup = setup, 55 + .run = run, 56 + .cleanup = cleanup, 57 + }; 58 + REGISTER_SCX_TEST(&minimal)

+33

tools/testing/selftests/sched_ext/prog_run.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates that we can invoke sched_ext kfuncs in 4 + * BPF_PROG_TYPE_SYSCALL programs. 5 + * 6 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 8 + */ 9 + 10 + #include <scx/common.bpf.h> 11 + 12 + UEI_DEFINE(uei); 13 + 14 + char _license[] SEC("license") = "GPL"; 15 + 16 + SEC("syscall") 17 + int BPF_PROG(prog_run_syscall) 18 + { 19 + scx_bpf_create_dsq(0, -1); 20 + scx_bpf_exit(0xdeadbeef, "Exited from PROG_RUN"); 21 + return 0; 22 + } 23 + 24 + void BPF_STRUCT_OPS(prog_run_exit, struct scx_exit_info *ei) 25 + { 26 + UEI_RECORD(uei, ei); 27 + } 28 + 29 + SEC(".struct_ops.link") 30 + struct sched_ext_ops prog_run_ops = { 31 + .exit = prog_run_exit, 32 + .name = "prog_run", 33 + };

+78

tools/testing/selftests/sched_ext/prog_run.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <sched.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "prog_run.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct prog_run *skel; 17 + 18 + skel = prog_run__open_and_load(); 19 + if (!skel) { 20 + SCX_ERR("Failed to open and load skel"); 21 + return SCX_TEST_FAIL; 22 + } 23 + *ctx = skel; 24 + 25 + return SCX_TEST_PASS; 26 + } 27 + 28 + static enum scx_test_status run(void *ctx) 29 + { 30 + struct prog_run *skel = ctx; 31 + struct bpf_link *link; 32 + int prog_fd, err = 0; 33 + 34 + prog_fd = bpf_program__fd(skel->progs.prog_run_syscall); 35 + if (prog_fd < 0) { 36 + SCX_ERR("Failed to get BPF_PROG_RUN prog"); 37 + return SCX_TEST_FAIL; 38 + } 39 + 40 + LIBBPF_OPTS(bpf_test_run_opts, topts); 41 + 42 + link = bpf_map__attach_struct_ops(skel->maps.prog_run_ops); 43 + if (!link) { 44 + SCX_ERR("Failed to attach scheduler"); 45 + close(prog_fd); 46 + return SCX_TEST_FAIL; 47 + } 48 + 49 + err = bpf_prog_test_run_opts(prog_fd, &topts); 50 + SCX_EQ(err, 0); 51 + 52 + /* Assumes uei.kind is written last */ 53 + while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE)) 54 + sched_yield(); 55 + 56 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF)); 57 + SCX_EQ(skel->data->uei.exit_code, 0xdeadbeef); 58 + close(prog_fd); 59 + bpf_link__destroy(link); 60 + 61 + return SCX_TEST_PASS; 62 + } 63 + 64 + static void cleanup(void *ctx) 65 + { 66 + struct prog_run *skel = ctx; 67 + 68 + prog_run__destroy(skel); 69 + } 70 + 71 + struct scx_test prog_run = { 72 + .name = "prog_run", 73 + .description = "Verify we can call into a scheduler with BPF_PROG_RUN, and invoke kfuncs", 74 + .setup = setup, 75 + .run = run, 76 + .cleanup = cleanup, 77 + }; 78 + REGISTER_SCX_TEST(&prog_run)

+75

tools/testing/selftests/sched_ext/reload_loop.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <bpf/bpf.h> 7 + #include <pthread.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "maximal.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static struct maximal *skel; 15 + static pthread_t threads[2]; 16 + 17 + bool force_exit = false; 18 + 19 + static enum scx_test_status setup(void **ctx) 20 + { 21 + skel = maximal__open_and_load(); 22 + if (!skel) { 23 + SCX_ERR("Failed to open and load skel"); 24 + return SCX_TEST_FAIL; 25 + } 26 + 27 + return SCX_TEST_PASS; 28 + } 29 + 30 + static void *do_reload_loop(void *arg) 31 + { 32 + u32 i; 33 + 34 + for (i = 0; i < 1024 && !force_exit; i++) { 35 + struct bpf_link *link; 36 + 37 + link = bpf_map__attach_struct_ops(skel->maps.maximal_ops); 38 + if (link) 39 + bpf_link__destroy(link); 40 + } 41 + 42 + return NULL; 43 + } 44 + 45 + static enum scx_test_status run(void *ctx) 46 + { 47 + int err; 48 + void *ret; 49 + 50 + err = pthread_create(&threads[0], NULL, do_reload_loop, NULL); 51 + SCX_FAIL_IF(err, "Failed to create thread 0"); 52 + 53 + err = pthread_create(&threads[1], NULL, do_reload_loop, NULL); 54 + SCX_FAIL_IF(err, "Failed to create thread 1"); 55 + 56 + SCX_FAIL_IF(pthread_join(threads[0], &ret), "thread 0 failed"); 57 + SCX_FAIL_IF(pthread_join(threads[1], &ret), "thread 1 failed"); 58 + 59 + return SCX_TEST_PASS; 60 + } 61 + 62 + static void cleanup(void *ctx) 63 + { 64 + force_exit = true; 65 + maximal__destroy(skel); 66 + } 67 + 68 + struct scx_test reload_loop = { 69 + .name = "reload_loop", 70 + .description = "Stress test loading and unloading schedulers repeatedly in a tight loop", 71 + .setup = setup, 72 + .run = run, 73 + .cleanup = cleanup, 74 + }; 75 + REGISTER_SCX_TEST(&reload_loop)

+201

tools/testing/selftests/sched_ext/runner.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <stdio.h> 8 + #include <unistd.h> 9 + #include <signal.h> 10 + #include <libgen.h> 11 + #include <bpf/bpf.h> 12 + #include "scx_test.h" 13 + 14 + const char help_fmt[] = 15 + "The runner for sched_ext tests.\n" 16 + "\n" 17 + "The runner is statically linked against all testcases, and runs them all serially.\n" 18 + "It's required for the testcases to be serial, as only a single host-wide sched_ext\n" 19 + "scheduler may be loaded at any given time." 20 + "\n" 21 + "Usage: %s [-t TEST] [-h]\n" 22 + "\n" 23 + " -t TEST Only run tests whose name includes this string\n" 24 + " -s Include print output for skipped tests\n" 25 + " -q Don't print the test descriptions during run\n" 26 + " -h Display this help and exit\n"; 27 + 28 + static volatile int exit_req; 29 + static bool quiet, print_skipped; 30 + 31 + #define MAX_SCX_TESTS 2048 32 + 33 + static struct scx_test __scx_tests[MAX_SCX_TESTS]; 34 + static unsigned __scx_num_tests = 0; 35 + 36 + static void sigint_handler(int simple) 37 + { 38 + exit_req = 1; 39 + } 40 + 41 + static void print_test_preamble(const struct scx_test *test, bool quiet) 42 + { 43 + printf("===== START =====\n"); 44 + printf("TEST: %s\n", test->name); 45 + if (!quiet) 46 + printf("DESCRIPTION: %s\n", test->description); 47 + printf("OUTPUT:\n"); 48 + } 49 + 50 + static const char *status_to_result(enum scx_test_status status) 51 + { 52 + switch (status) { 53 + case SCX_TEST_PASS: 54 + case SCX_TEST_SKIP: 55 + return "ok"; 56 + case SCX_TEST_FAIL: 57 + return "not ok"; 58 + default: 59 + return "<UNKNOWN>"; 60 + } 61 + } 62 + 63 + static void print_test_result(const struct scx_test *test, 64 + enum scx_test_status status, 65 + unsigned int testnum) 66 + { 67 + const char *result = status_to_result(status); 68 + const char *directive = status == SCX_TEST_SKIP ? "SKIP " : ""; 69 + 70 + printf("%s %u %s # %s\n", result, testnum, test->name, directive); 71 + printf("===== END =====\n"); 72 + } 73 + 74 + static bool should_skip_test(const struct scx_test *test, const char * filter) 75 + { 76 + return !strstr(test->name, filter); 77 + } 78 + 79 + static enum scx_test_status run_test(const struct scx_test *test) 80 + { 81 + enum scx_test_status status; 82 + void *context = NULL; 83 + 84 + if (test->setup) { 85 + status = test->setup(&context); 86 + if (status != SCX_TEST_PASS) 87 + return status; 88 + } 89 + 90 + status = test->run(context); 91 + 92 + if (test->cleanup) 93 + test->cleanup(context); 94 + 95 + return status; 96 + } 97 + 98 + static bool test_valid(const struct scx_test *test) 99 + { 100 + if (!test) { 101 + fprintf(stderr, "NULL test detected\n"); 102 + return false; 103 + } 104 + 105 + if (!test->name) { 106 + fprintf(stderr, 107 + "Test with no name found. Must specify test name.\n"); 108 + return false; 109 + } 110 + 111 + if (!test->description) { 112 + fprintf(stderr, "Test %s requires description.\n", test->name); 113 + return false; 114 + } 115 + 116 + if (!test->run) { 117 + fprintf(stderr, "Test %s has no run() callback\n", test->name); 118 + return false; 119 + } 120 + 121 + return true; 122 + } 123 + 124 + int main(int argc, char **argv) 125 + { 126 + const char *filter = NULL; 127 + unsigned testnum = 0, i; 128 + unsigned passed = 0, skipped = 0, failed = 0; 129 + int opt; 130 + 131 + signal(SIGINT, sigint_handler); 132 + signal(SIGTERM, sigint_handler); 133 + 134 + libbpf_set_strict_mode(LIBBPF_STRICT_ALL); 135 + 136 + while ((opt = getopt(argc, argv, "qst:h")) != -1) { 137 + switch (opt) { 138 + case 'q': 139 + quiet = true; 140 + break; 141 + case 's': 142 + print_skipped = true; 143 + break; 144 + case 't': 145 + filter = optarg; 146 + break; 147 + default: 148 + fprintf(stderr, help_fmt, basename(argv[0])); 149 + return opt != 'h'; 150 + } 151 + } 152 + 153 + for (i = 0; i < __scx_num_tests; i++) { 154 + enum scx_test_status status; 155 + struct scx_test *test = &__scx_tests[i]; 156 + 157 + if (filter && should_skip_test(test, filter)) { 158 + /* 159 + * Printing the skipped tests and their preambles can 160 + * add a lot of noise to the runner output. Printing 161 + * this is only really useful for CI, so let's skip it 162 + * by default. 163 + */ 164 + if (print_skipped) { 165 + print_test_preamble(test, quiet); 166 + print_test_result(test, SCX_TEST_SKIP, ++testnum); 167 + } 168 + continue; 169 + } 170 + 171 + print_test_preamble(test, quiet); 172 + status = run_test(test); 173 + print_test_result(test, status, ++testnum); 174 + switch (status) { 175 + case SCX_TEST_PASS: 176 + passed++; 177 + break; 178 + case SCX_TEST_SKIP: 179 + skipped++; 180 + break; 181 + case SCX_TEST_FAIL: 182 + failed++; 183 + break; 184 + } 185 + } 186 + printf("\n\n=============================\n\n"); 187 + printf("RESULTS:\n\n"); 188 + printf("PASSED: %u\n", passed); 189 + printf("SKIPPED: %u\n", skipped); 190 + printf("FAILED: %u\n", failed); 191 + 192 + return 0; 193 + } 194 + 195 + void scx_test_register(struct scx_test *test) 196 + { 197 + SCX_BUG_ON(!test_valid(test), "Invalid test found"); 198 + SCX_BUG_ON(__scx_num_tests >= MAX_SCX_TESTS, "Maximum tests exceeded"); 199 + 200 + __scx_tests[__scx_num_tests++] = *test; 201 + }

+131

tools/testing/selftests/sched_ext/scx_test.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 6 + */ 7 + 8 + #ifndef __SCX_TEST_H__ 9 + #define __SCX_TEST_H__ 10 + 11 + #include <errno.h> 12 + #include <scx/common.h> 13 + #include <scx/compat.h> 14 + 15 + enum scx_test_status { 16 + SCX_TEST_PASS = 0, 17 + SCX_TEST_SKIP, 18 + SCX_TEST_FAIL, 19 + }; 20 + 21 + #define EXIT_KIND(__ent) __COMPAT_ENUM_OR_ZERO("scx_exit_kind", #__ent) 22 + 23 + struct scx_test { 24 + /** 25 + * name - The name of the testcase. 26 + */ 27 + const char *name; 28 + 29 + /** 30 + * description - A description of your testcase: what it tests and is 31 + * meant to validate. 32 + */ 33 + const char *description; 34 + 35 + /* 36 + * setup - Setup the test. 37 + * @ctx: A pointer to a context object that will be passed to run and 38 + * cleanup. 39 + * 40 + * An optional callback that allows a testcase to perform setup for its 41 + * run. A test may return SCX_TEST_SKIP to skip the run. 42 + */ 43 + enum scx_test_status (*setup)(void **ctx); 44 + 45 + /* 46 + * run - Run the test. 47 + * @ctx: Context set in the setup() callback. If @ctx was not set in 48 + * setup(), it is NULL. 49 + * 50 + * The main test. Callers should return one of: 51 + * 52 + * - SCX_TEST_PASS: Test passed 53 + * - SCX_TEST_SKIP: Test should be skipped 54 + * - SCX_TEST_FAIL: Test failed 55 + * 56 + * This callback must be defined. 57 + */ 58 + enum scx_test_status (*run)(void *ctx); 59 + 60 + /* 61 + * cleanup - Perform cleanup following the test 62 + * @ctx: Context set in the setup() callback. If @ctx was not set in 63 + * setup(), it is NULL. 64 + * 65 + * An optional callback that allows a test to perform cleanup after 66 + * being run. This callback is run even if the run() callback returns 67 + * SCX_TEST_SKIP or SCX_TEST_FAIL. It is not run if setup() returns 68 + * SCX_TEST_SKIP or SCX_TEST_FAIL. 69 + */ 70 + void (*cleanup)(void *ctx); 71 + }; 72 + 73 + void scx_test_register(struct scx_test *test); 74 + 75 + #define REGISTER_SCX_TEST(__test) \ 76 + __attribute__((constructor)) \ 77 + static void ___scxregister##__LINE__(void) \ 78 + { \ 79 + scx_test_register(__test); \ 80 + } 81 + 82 + #define SCX_ERR(__fmt, ...) \ 83 + do { \ 84 + fprintf(stderr, "ERR: %s:%d\n", __FILE__, __LINE__); \ 85 + fprintf(stderr, __fmt"\n", ##__VA_ARGS__); \ 86 + } while (0) 87 + 88 + #define SCX_FAIL(__fmt, ...) \ 89 + do { \ 90 + SCX_ERR(__fmt, ##__VA_ARGS__); \ 91 + return SCX_TEST_FAIL; \ 92 + } while (0) 93 + 94 + #define SCX_FAIL_IF(__cond, __fmt, ...) \ 95 + do { \ 96 + if (__cond) \ 97 + SCX_FAIL(__fmt, ##__VA_ARGS__); \ 98 + } while (0) 99 + 100 + #define SCX_GT(_x, _y) SCX_FAIL_IF((_x) <= (_y), "Expected %s > %s (%lu > %lu)", \ 101 + #_x, #_y, (u64)(_x), (u64)(_y)) 102 + #define SCX_GE(_x, _y) SCX_FAIL_IF((_x) < (_y), "Expected %s >= %s (%lu >= %lu)", \ 103 + #_x, #_y, (u64)(_x), (u64)(_y)) 104 + #define SCX_LT(_x, _y) SCX_FAIL_IF((_x) >= (_y), "Expected %s < %s (%lu < %lu)", \ 105 + #_x, #_y, (u64)(_x), (u64)(_y)) 106 + #define SCX_LE(_x, _y) SCX_FAIL_IF((_x) > (_y), "Expected %s <= %s (%lu <= %lu)", \ 107 + #_x, #_y, (u64)(_x), (u64)(_y)) 108 + #define SCX_EQ(_x, _y) SCX_FAIL_IF((_x) != (_y), "Expected %s == %s (%lu == %lu)", \ 109 + #_x, #_y, (u64)(_x), (u64)(_y)) 110 + #define SCX_ASSERT(_x) SCX_FAIL_IF(!(_x), "Expected %s to be true (%lu)", \ 111 + #_x, (u64)(_x)) 112 + 113 + #define SCX_ECODE_VAL(__ecode) ({ \ 114 + u64 __val = 0; \ 115 + bool __found = false; \ 116 + \ 117 + __found = __COMPAT_read_enum("scx_exit_code", #__ecode, &__val); \ 118 + SCX_ASSERT(__found); \ 119 + (s64)__val; \ 120 + }) 121 + 122 + #define SCX_KIND_VAL(__kind) ({ \ 123 + u64 __val = 0; \ 124 + bool __found = false; \ 125 + \ 126 + __found = __COMPAT_read_enum("scx_exit_kind", #__kind, &__val); \ 127 + SCX_ASSERT(__found); \ 128 + __val; \ 129 + }) 130 + 131 + #endif // # __SCX_TEST_H__

+40

tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates the behavior of direct dispatching with a default 4 + * select_cpu implementation. 5 + * 6 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 8 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + bool saw_local = false; 16 + 17 + static bool task_is_test(const struct task_struct *p) 18 + { 19 + return !bpf_strncmp(p->comm, 9, "select_cpu"); 20 + } 21 + 22 + void BPF_STRUCT_OPS(select_cpu_dfl_enqueue, struct task_struct *p, 23 + u64 enq_flags) 24 + { 25 + const struct cpumask *idle_mask = scx_bpf_get_idle_cpumask(); 26 + 27 + if (task_is_test(p) && 28 + bpf_cpumask_test_cpu(scx_bpf_task_cpu(p), idle_mask)) { 29 + saw_local = true; 30 + } 31 + scx_bpf_put_idle_cpumask(idle_mask); 32 + 33 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); 34 + } 35 + 36 + SEC(".struct_ops.link") 37 + struct sched_ext_ops select_cpu_dfl_ops = { 38 + .enqueue = select_cpu_dfl_enqueue, 39 + .name = "select_cpu_dfl", 40 + };

+72

tools/testing/selftests/sched_ext/select_cpu_dfl.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "select_cpu_dfl.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + #define NUM_CHILDREN 1028 15 + 16 + static enum scx_test_status setup(void **ctx) 17 + { 18 + struct select_cpu_dfl *skel; 19 + 20 + skel = select_cpu_dfl__open_and_load(); 21 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 22 + *ctx = skel; 23 + 24 + return SCX_TEST_PASS; 25 + } 26 + 27 + static enum scx_test_status run(void *ctx) 28 + { 29 + struct select_cpu_dfl *skel = ctx; 30 + struct bpf_link *link; 31 + pid_t pids[NUM_CHILDREN]; 32 + int i, status; 33 + 34 + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_ops); 35 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 36 + 37 + for (i = 0; i < NUM_CHILDREN; i++) { 38 + pids[i] = fork(); 39 + if (pids[i] == 0) { 40 + sleep(1); 41 + exit(0); 42 + } 43 + } 44 + 45 + for (i = 0; i < NUM_CHILDREN; i++) { 46 + SCX_EQ(waitpid(pids[i], &status, 0), pids[i]); 47 + SCX_EQ(status, 0); 48 + } 49 + 50 + SCX_ASSERT(!skel->bss->saw_local); 51 + 52 + bpf_link__destroy(link); 53 + 54 + return SCX_TEST_PASS; 55 + } 56 + 57 + static void cleanup(void *ctx) 58 + { 59 + struct select_cpu_dfl *skel = ctx; 60 + 61 + select_cpu_dfl__destroy(skel); 62 + } 63 + 64 + struct scx_test select_cpu_dfl = { 65 + .name = "select_cpu_dfl", 66 + .description = "Verify the default ops.select_cpu() dispatches tasks " 67 + "when idles cores are found, and skips ops.enqueue()", 68 + .setup = setup, 69 + .run = run, 70 + .cleanup = cleanup, 71 + }; 72 + REGISTER_SCX_TEST(&select_cpu_dfl)

+89

tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates the behavior of direct dispatching with a default 4 + * select_cpu implementation, and with the SCX_OPS_ENQ_DFL_NO_DISPATCH ops flag 5 + * specified. 6 + * 7 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 8 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 9 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 10 + */ 11 + 12 + #include <scx/common.bpf.h> 13 + 14 + char _license[] SEC("license") = "GPL"; 15 + 16 + bool saw_local = false; 17 + 18 + /* Per-task scheduling context */ 19 + struct task_ctx { 20 + bool force_local; /* CPU changed by ops.select_cpu() */ 21 + }; 22 + 23 + struct { 24 + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); 25 + __uint(map_flags, BPF_F_NO_PREALLOC); 26 + __type(key, int); 27 + __type(value, struct task_ctx); 28 + } task_ctx_stor SEC(".maps"); 29 + 30 + /* Manually specify the signature until the kfunc is added to the scx repo. */ 31 + s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, 32 + bool *found) __ksym; 33 + 34 + s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_select_cpu, struct task_struct *p, 35 + s32 prev_cpu, u64 wake_flags) 36 + { 37 + struct task_ctx *tctx; 38 + s32 cpu; 39 + 40 + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 41 + if (!tctx) { 42 + scx_bpf_error("task_ctx lookup failed"); 43 + return -ESRCH; 44 + } 45 + 46 + cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, 47 + &tctx->force_local); 48 + 49 + return cpu; 50 + } 51 + 52 + void BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_enqueue, struct task_struct *p, 53 + u64 enq_flags) 54 + { 55 + u64 dsq_id = SCX_DSQ_GLOBAL; 56 + struct task_ctx *tctx; 57 + 58 + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0); 59 + if (!tctx) { 60 + scx_bpf_error("task_ctx lookup failed"); 61 + return; 62 + } 63 + 64 + if (tctx->force_local) { 65 + dsq_id = SCX_DSQ_LOCAL; 66 + tctx->force_local = false; 67 + saw_local = true; 68 + } 69 + 70 + scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, enq_flags); 71 + } 72 + 73 + s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_init_task, 74 + struct task_struct *p, struct scx_init_task_args *args) 75 + { 76 + if (bpf_task_storage_get(&task_ctx_stor, p, 0, 77 + BPF_LOCAL_STORAGE_GET_F_CREATE)) 78 + return 0; 79 + else 80 + return -ENOMEM; 81 + } 82 + 83 + SEC(".struct_ops.link") 84 + struct sched_ext_ops select_cpu_dfl_nodispatch_ops = { 85 + .select_cpu = select_cpu_dfl_nodispatch_select_cpu, 86 + .enqueue = select_cpu_dfl_nodispatch_enqueue, 87 + .init_task = select_cpu_dfl_nodispatch_init_task, 88 + .name = "select_cpu_dfl_nodispatch", 89 + };

+72

tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "select_cpu_dfl_nodispatch.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + #define NUM_CHILDREN 1028 15 + 16 + static enum scx_test_status setup(void **ctx) 17 + { 18 + struct select_cpu_dfl_nodispatch *skel; 19 + 20 + skel = select_cpu_dfl_nodispatch__open_and_load(); 21 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 22 + *ctx = skel; 23 + 24 + return SCX_TEST_PASS; 25 + } 26 + 27 + static enum scx_test_status run(void *ctx) 28 + { 29 + struct select_cpu_dfl_nodispatch *skel = ctx; 30 + struct bpf_link *link; 31 + pid_t pids[NUM_CHILDREN]; 32 + int i, status; 33 + 34 + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_nodispatch_ops); 35 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 36 + 37 + for (i = 0; i < NUM_CHILDREN; i++) { 38 + pids[i] = fork(); 39 + if (pids[i] == 0) { 40 + sleep(1); 41 + exit(0); 42 + } 43 + } 44 + 45 + for (i = 0; i < NUM_CHILDREN; i++) { 46 + SCX_EQ(waitpid(pids[i], &status, 0), pids[i]); 47 + SCX_EQ(status, 0); 48 + } 49 + 50 + SCX_ASSERT(skel->bss->saw_local); 51 + 52 + bpf_link__destroy(link); 53 + 54 + return SCX_TEST_PASS; 55 + } 56 + 57 + static void cleanup(void *ctx) 58 + { 59 + struct select_cpu_dfl_nodispatch *skel = ctx; 60 + 61 + select_cpu_dfl_nodispatch__destroy(skel); 62 + } 63 + 64 + struct scx_test select_cpu_dfl_nodispatch = { 65 + .name = "select_cpu_dfl_nodispatch", 66 + .description = "Verify behavior of scx_bpf_select_cpu_dfl() in " 67 + "ops.select_cpu()", 68 + .setup = setup, 69 + .run = run, 70 + .cleanup = cleanup, 71 + }; 72 + REGISTER_SCX_TEST(&select_cpu_dfl_nodispatch)

+41

tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates the behavior of direct dispatching with a default 4 + * select_cpu implementation. 5 + * 6 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 8 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + s32 BPF_STRUCT_OPS(select_cpu_dispatch_select_cpu, struct task_struct *p, 16 + s32 prev_cpu, u64 wake_flags) 17 + { 18 + u64 dsq_id = SCX_DSQ_LOCAL; 19 + s32 cpu = prev_cpu; 20 + 21 + if (scx_bpf_test_and_clear_cpu_idle(cpu)) 22 + goto dispatch; 23 + 24 + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); 25 + if (cpu >= 0) 26 + goto dispatch; 27 + 28 + dsq_id = SCX_DSQ_GLOBAL; 29 + cpu = prev_cpu; 30 + 31 + dispatch: 32 + scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, 0); 33 + return cpu; 34 + } 35 + 36 + SEC(".struct_ops.link") 37 + struct sched_ext_ops select_cpu_dispatch_ops = { 38 + .select_cpu = select_cpu_dispatch_select_cpu, 39 + .name = "select_cpu_dispatch", 40 + .timeout_ms = 1000U, 41 + };

+70

tools/testing/selftests/sched_ext/select_cpu_dispatch.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "select_cpu_dispatch.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + #define NUM_CHILDREN 1028 15 + 16 + static enum scx_test_status setup(void **ctx) 17 + { 18 + struct select_cpu_dispatch *skel; 19 + 20 + skel = select_cpu_dispatch__open_and_load(); 21 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 22 + *ctx = skel; 23 + 24 + return SCX_TEST_PASS; 25 + } 26 + 27 + static enum scx_test_status run(void *ctx) 28 + { 29 + struct select_cpu_dispatch *skel = ctx; 30 + struct bpf_link *link; 31 + pid_t pids[NUM_CHILDREN]; 32 + int i, status; 33 + 34 + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_ops); 35 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 36 + 37 + for (i = 0; i < NUM_CHILDREN; i++) { 38 + pids[i] = fork(); 39 + if (pids[i] == 0) { 40 + sleep(1); 41 + exit(0); 42 + } 43 + } 44 + 45 + for (i = 0; i < NUM_CHILDREN; i++) { 46 + SCX_EQ(waitpid(pids[i], &status, 0), pids[i]); 47 + SCX_EQ(status, 0); 48 + } 49 + 50 + bpf_link__destroy(link); 51 + 52 + return SCX_TEST_PASS; 53 + } 54 + 55 + static void cleanup(void *ctx) 56 + { 57 + struct select_cpu_dispatch *skel = ctx; 58 + 59 + select_cpu_dispatch__destroy(skel); 60 + } 61 + 62 + struct scx_test select_cpu_dispatch = { 63 + .name = "select_cpu_dispatch", 64 + .description = "Test direct dispatching to built-in DSQs from " 65 + "ops.select_cpu()", 66 + .setup = setup, 67 + .run = run, 68 + .cleanup = cleanup, 69 + }; 70 + REGISTER_SCX_TEST(&select_cpu_dispatch)

+37

tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates the behavior of direct dispatching with a default 4 + * select_cpu implementation. 5 + * 6 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 8 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + UEI_DEFINE(uei); 16 + 17 + s32 BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_select_cpu, struct task_struct *p, 18 + s32 prev_cpu, u64 wake_flags) 19 + { 20 + /* Dispatching to a random DSQ should fail. */ 21 + scx_bpf_dispatch(p, 0xcafef00d, SCX_SLICE_DFL, 0); 22 + 23 + return prev_cpu; 24 + } 25 + 26 + void BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_exit, struct scx_exit_info *ei) 27 + { 28 + UEI_RECORD(uei, ei); 29 + } 30 + 31 + SEC(".struct_ops.link") 32 + struct sched_ext_ops select_cpu_dispatch_bad_dsq_ops = { 33 + .select_cpu = select_cpu_dispatch_bad_dsq_select_cpu, 34 + .exit = select_cpu_dispatch_bad_dsq_exit, 35 + .name = "select_cpu_dispatch_bad_dsq", 36 + .timeout_ms = 1000U, 37 + };

+56

tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "select_cpu_dispatch_bad_dsq.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct select_cpu_dispatch_bad_dsq *skel; 17 + 18 + skel = select_cpu_dispatch_bad_dsq__open_and_load(); 19 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 20 + *ctx = skel; 21 + 22 + return SCX_TEST_PASS; 23 + } 24 + 25 + static enum scx_test_status run(void *ctx) 26 + { 27 + struct select_cpu_dispatch_bad_dsq *skel = ctx; 28 + struct bpf_link *link; 29 + 30 + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_bad_dsq_ops); 31 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 32 + 33 + sleep(1); 34 + 35 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); 36 + bpf_link__destroy(link); 37 + 38 + return SCX_TEST_PASS; 39 + } 40 + 41 + static void cleanup(void *ctx) 42 + { 43 + struct select_cpu_dispatch_bad_dsq *skel = ctx; 44 + 45 + select_cpu_dispatch_bad_dsq__destroy(skel); 46 + } 47 + 48 + struct scx_test select_cpu_dispatch_bad_dsq = { 49 + .name = "select_cpu_dispatch_bad_dsq", 50 + .description = "Verify graceful failure if we direct-dispatch to a " 51 + "bogus DSQ in ops.select_cpu()", 52 + .setup = setup, 53 + .run = run, 54 + .cleanup = cleanup, 55 + }; 56 + REGISTER_SCX_TEST(&select_cpu_dispatch_bad_dsq)

+38

tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates the behavior of direct dispatching with a default 4 + * select_cpu implementation. 5 + * 6 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 7 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 8 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 9 + */ 10 + 11 + #include <scx/common.bpf.h> 12 + 13 + char _license[] SEC("license") = "GPL"; 14 + 15 + UEI_DEFINE(uei); 16 + 17 + s32 BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_select_cpu, struct task_struct *p, 18 + s32 prev_cpu, u64 wake_flags) 19 + { 20 + /* Dispatching twice in a row is disallowed. */ 21 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); 22 + scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); 23 + 24 + return prev_cpu; 25 + } 26 + 27 + void BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_exit, struct scx_exit_info *ei) 28 + { 29 + UEI_RECORD(uei, ei); 30 + } 31 + 32 + SEC(".struct_ops.link") 33 + struct sched_ext_ops select_cpu_dispatch_dbl_dsp_ops = { 34 + .select_cpu = select_cpu_dispatch_dbl_dsp_select_cpu, 35 + .exit = select_cpu_dispatch_dbl_dsp_exit, 36 + .name = "select_cpu_dispatch_dbl_dsp", 37 + .timeout_ms = 1000U, 38 + };

+56

tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2023 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2023 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2023 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "select_cpu_dispatch_dbl_dsp.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct select_cpu_dispatch_dbl_dsp *skel; 17 + 18 + skel = select_cpu_dispatch_dbl_dsp__open_and_load(); 19 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 20 + *ctx = skel; 21 + 22 + return SCX_TEST_PASS; 23 + } 24 + 25 + static enum scx_test_status run(void *ctx) 26 + { 27 + struct select_cpu_dispatch_dbl_dsp *skel = ctx; 28 + struct bpf_link *link; 29 + 30 + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_dbl_dsp_ops); 31 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 32 + 33 + sleep(1); 34 + 35 + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR)); 36 + bpf_link__destroy(link); 37 + 38 + return SCX_TEST_PASS; 39 + } 40 + 41 + static void cleanup(void *ctx) 42 + { 43 + struct select_cpu_dispatch_dbl_dsp *skel = ctx; 44 + 45 + select_cpu_dispatch_dbl_dsp__destroy(skel); 46 + } 47 + 48 + struct scx_test select_cpu_dispatch_dbl_dsp = { 49 + .name = "select_cpu_dispatch_dbl_dsp", 50 + .description = "Verify graceful failure if we dispatch twice to a " 51 + "DSQ in ops.select_cpu()", 52 + .setup = setup, 53 + .run = run, 54 + .cleanup = cleanup, 55 + }; 56 + REGISTER_SCX_TEST(&select_cpu_dispatch_dbl_dsp)

+92

tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * A scheduler that validates that enqueue flags are properly stored and 4 + * applied at dispatch time when a task is directly dispatched from 5 + * ops.select_cpu(). We validate this by using scx_bpf_dispatch_vtime(), and 6 + * making the test a very basic vtime scheduler. 7 + * 8 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 9 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 10 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 11 + */ 12 + 13 + #include <scx/common.bpf.h> 14 + 15 + char _license[] SEC("license") = "GPL"; 16 + 17 + volatile bool consumed; 18 + 19 + static u64 vtime_now; 20 + 21 + #define VTIME_DSQ 0 22 + 23 + static inline bool vtime_before(u64 a, u64 b) 24 + { 25 + return (s64)(a - b) < 0; 26 + } 27 + 28 + static inline u64 task_vtime(const struct task_struct *p) 29 + { 30 + u64 vtime = p->scx.dsq_vtime; 31 + 32 + if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL)) 33 + return vtime_now - SCX_SLICE_DFL; 34 + else 35 + return vtime; 36 + } 37 + 38 + s32 BPF_STRUCT_OPS(select_cpu_vtime_select_cpu, struct task_struct *p, 39 + s32 prev_cpu, u64 wake_flags) 40 + { 41 + s32 cpu; 42 + 43 + cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0); 44 + if (cpu >= 0) 45 + goto ddsp; 46 + 47 + cpu = prev_cpu; 48 + scx_bpf_test_and_clear_cpu_idle(cpu); 49 + ddsp: 50 + scx_bpf_dispatch_vtime(p, VTIME_DSQ, SCX_SLICE_DFL, task_vtime(p), 0); 51 + return cpu; 52 + } 53 + 54 + void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p) 55 + { 56 + if (scx_bpf_consume(VTIME_DSQ)) 57 + consumed = true; 58 + } 59 + 60 + void BPF_STRUCT_OPS(select_cpu_vtime_running, struct task_struct *p) 61 + { 62 + if (vtime_before(vtime_now, p->scx.dsq_vtime)) 63 + vtime_now = p->scx.dsq_vtime; 64 + } 65 + 66 + void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p, 67 + bool runnable) 68 + { 69 + p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight; 70 + } 71 + 72 + void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p) 73 + { 74 + p->scx.dsq_vtime = vtime_now; 75 + } 76 + 77 + s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init) 78 + { 79 + return scx_bpf_create_dsq(VTIME_DSQ, -1); 80 + } 81 + 82 + SEC(".struct_ops.link") 83 + struct sched_ext_ops select_cpu_vtime_ops = { 84 + .select_cpu = select_cpu_vtime_select_cpu, 85 + .dispatch = select_cpu_vtime_dispatch, 86 + .running = select_cpu_vtime_running, 87 + .stopping = select_cpu_vtime_stopping, 88 + .enable = select_cpu_vtime_enable, 89 + .init = select_cpu_vtime_init, 90 + .name = "select_cpu_vtime", 91 + .timeout_ms = 1000U, 92 + };

+59

tools/testing/selftests/sched_ext/select_cpu_vtime.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include <sys/wait.h> 10 + #include <unistd.h> 11 + #include "select_cpu_vtime.bpf.skel.h" 12 + #include "scx_test.h" 13 + 14 + static enum scx_test_status setup(void **ctx) 15 + { 16 + struct select_cpu_vtime *skel; 17 + 18 + skel = select_cpu_vtime__open_and_load(); 19 + SCX_FAIL_IF(!skel, "Failed to open and load skel"); 20 + *ctx = skel; 21 + 22 + return SCX_TEST_PASS; 23 + } 24 + 25 + static enum scx_test_status run(void *ctx) 26 + { 27 + struct select_cpu_vtime *skel = ctx; 28 + struct bpf_link *link; 29 + 30 + SCX_ASSERT(!skel->bss->consumed); 31 + 32 + link = bpf_map__attach_struct_ops(skel->maps.select_cpu_vtime_ops); 33 + SCX_FAIL_IF(!link, "Failed to attach scheduler"); 34 + 35 + sleep(1); 36 + 37 + SCX_ASSERT(skel->bss->consumed); 38 + 39 + bpf_link__destroy(link); 40 + 41 + return SCX_TEST_PASS; 42 + } 43 + 44 + static void cleanup(void *ctx) 45 + { 46 + struct select_cpu_vtime *skel = ctx; 47 + 48 + select_cpu_vtime__destroy(skel); 49 + } 50 + 51 + struct scx_test select_cpu_vtime = { 52 + .name = "select_cpu_vtime", 53 + .description = "Test doing direct vtime-dispatching from " 54 + "ops.select_cpu(), to a non-built-in DSQ", 55 + .setup = setup, 56 + .run = run, 57 + .cleanup = cleanup, 58 + }; 59 + REGISTER_SCX_TEST(&select_cpu_vtime)

+49

tools/testing/selftests/sched_ext/test_example.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 Tejun Heo <tj@kernel.org> 5 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 6 + */ 7 + #include <bpf/bpf.h> 8 + #include <scx/common.h> 9 + #include "scx_test.h" 10 + 11 + static bool setup_called = false; 12 + static bool run_called = false; 13 + static bool cleanup_called = false; 14 + 15 + static int context = 10; 16 + 17 + static enum scx_test_status setup(void **ctx) 18 + { 19 + setup_called = true; 20 + *ctx = &context; 21 + 22 + return SCX_TEST_PASS; 23 + } 24 + 25 + static enum scx_test_status run(void *ctx) 26 + { 27 + int *arg = ctx; 28 + 29 + SCX_ASSERT(setup_called); 30 + SCX_ASSERT(!run_called && !cleanup_called); 31 + SCX_EQ(*arg, context); 32 + 33 + run_called = true; 34 + return SCX_TEST_PASS; 35 + } 36 + 37 + static void cleanup (void *ctx) 38 + { 39 + SCX_BUG_ON(!run_called || cleanup_called, "Wrong callbacks invoked"); 40 + } 41 + 42 + struct scx_test example = { 43 + .name = "example", 44 + .description = "Validate the basic function of the test suite itself", 45 + .setup = setup, 46 + .run = run, 47 + .cleanup = cleanup, 48 + }; 49 + REGISTER_SCX_TEST(&example)

+71

tools/testing/selftests/sched_ext/util.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <dvernet@meta.com> 5 + */ 6 + #include <errno.h> 7 + #include <fcntl.h> 8 + #include <stdio.h> 9 + #include <stdlib.h> 10 + #include <string.h> 11 + #include <unistd.h> 12 + 13 + /* Returns read len on success, or -errno on failure. */ 14 + static ssize_t read_text(const char *path, char *buf, size_t max_len) 15 + { 16 + ssize_t len; 17 + int fd; 18 + 19 + fd = open(path, O_RDONLY); 20 + if (fd < 0) 21 + return -errno; 22 + 23 + len = read(fd, buf, max_len - 1); 24 + 25 + if (len >= 0) 26 + buf[len] = 0; 27 + 28 + close(fd); 29 + return len < 0 ? -errno : len; 30 + } 31 + 32 + /* Returns written len on success, or -errno on failure. */ 33 + static ssize_t write_text(const char *path, char *buf, ssize_t len) 34 + { 35 + int fd; 36 + ssize_t written; 37 + 38 + fd = open(path, O_WRONLY | O_APPEND); 39 + if (fd < 0) 40 + return -errno; 41 + 42 + written = write(fd, buf, len); 43 + close(fd); 44 + return written < 0 ? -errno : written; 45 + } 46 + 47 + long file_read_long(const char *path) 48 + { 49 + char buf[128]; 50 + 51 + 52 + if (read_text(path, buf, sizeof(buf)) <= 0) 53 + return -1; 54 + 55 + return atol(buf); 56 + } 57 + 58 + int file_write_long(const char *path, long val) 59 + { 60 + char buf[64]; 61 + int ret; 62 + 63 + ret = sprintf(buf, "%lu", val); 64 + if (ret < 0) 65 + return ret; 66 + 67 + if (write_text(path, buf, sizeof(buf)) <= 0) 68 + return -1; 69 + 70 + return 0; 71 + }

+13

tools/testing/selftests/sched_ext/util.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright (c) 2024 Meta Platforms, Inc. and affiliates. 4 + * Copyright (c) 2024 David Vernet <void@manifault.com> 5 + */ 6 + 7 + #ifndef __SCX_TEST_UTIL_H__ 8 + #define __SCX_TEST_UTIL_H__ 9 + 10 + long file_read_long(const char *path); 11 + int file_write_long(const char *path, long val); 12 + 13 + #endif // __SCX_TEST_H__

Configure Feed

Configure Feed