Documentation/scheduler/sched-ext.rst at master

tjh.dev / kernel
fork
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork
kernel / Documentation / scheduler / sched-ext.rst
at master 559 lines 23 kB view raw
wrap content
  1.. _sched-ext:
  2
  3==========================
  4Extensible Scheduler Class
  5==========================
  6
  7sched_ext is a scheduler class whose behavior can be defined by a set of BPF
  8programs - the BPF scheduler.
  9
 10* sched_ext exports a full scheduling interface so that any scheduling
 11  algorithm can be implemented on top.
 12
 13* The BPF scheduler can group CPUs however it sees fit and schedule them
 14  together, as tasks aren't tied to specific CPUs at the time of wakeup.
 15
 16* The BPF scheduler can be turned on and off dynamically anytime.
 17
 18* The system integrity is maintained no matter what the BPF scheduler does.
 19  The default scheduling behavior is restored anytime an error is detected,
 20  a runnable task stalls, or on invoking the SysRq key sequence
 21  `SysRq-S`.
 22
 23* When the BPF scheduler triggers an error, debug information is dumped to
 24  aid debugging. The debug dump is passed to and printed out by the
 25  scheduler binary. The debug dump can also be accessed through the
 26  `sched_ext_dump` tracepoint. The SysRq key sequence `SysRq-D`
 27  triggers a debug dump. This doesn't terminate the BPF scheduler and can
 28  only be read through the tracepoint.
 29
 30Switching to and from sched_ext
 31===============================
 32
 33``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
 34``tools/sched_ext`` contains the example schedulers. The following config
 35options should be enabled to use sched_ext:
 36
 37.. code-block:: none
 38
 39    CONFIG_BPF=y
 40    CONFIG_SCHED_CLASS_EXT=y
 41    CONFIG_BPF_SYSCALL=y
 42    CONFIG_BPF_JIT=y
 43    CONFIG_DEBUG_INFO_BTF=y
 44    CONFIG_BPF_JIT_ALWAYS_ON=y
 45    CONFIG_BPF_JIT_DEFAULT_ON=y
 46
 47sched_ext is used only when the BPF scheduler is loaded and running.
 48
 49If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
 50treated as ``SCHED_NORMAL`` and scheduled by the fair-class scheduler until the
 51BPF scheduler is loaded.
 52
 53When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
 54in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
 55``SCHED_EXT`` tasks are scheduled by sched_ext.
 56
 57However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
 58set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
 59by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
 60``SCHED_IDLE`` policies are scheduled by the fair-class scheduler which has
 61higher sched_class precedence than ``SCHED_EXT``.
 62
 63Terminating the sched_ext scheduler program, triggering `SysRq-S`, or
 64detection of any internal error including stalled runnable tasks aborts the
 65BPF scheduler and reverts all tasks back to the fair-class scheduler.
 66
 67.. code-block:: none
 68
 69    # make -j16 -C tools/sched_ext
 70    # tools/sched_ext/build/bin/scx_simple
 71    local=0 global=3
 72    local=5 global=24
 73    local=9 global=44
 74    local=13 global=56
 75    local=17 global=72
 76    ^CEXIT: BPF scheduler unregistered
 77
 78The current status of the BPF scheduler can be determined as follows:
 79
 80.. code-block:: none
 81
 82    # cat /sys/kernel/sched_ext/state
 83    enabled
 84    # cat /sys/kernel/sched_ext/root/ops
 85    simple
 86
 87You can check if any BPF scheduler has ever been loaded since boot by examining
 88this monotonically incrementing counter (a value of zero indicates that no BPF
 89scheduler has been loaded):
 90
 91.. code-block:: none
 92
 93    # cat /sys/kernel/sched_ext/enable_seq
 94    1
 95
 96Each running scheduler also exposes a per-scheduler ``events`` file under
 97``/sys/kernel/sched_ext/<scheduler-name>/events`` that tracks diagnostic
 98counters. Each counter occupies one ``name value`` line:
 99
100.. code-block:: none
101
102    # cat /sys/kernel/sched_ext/simple/events
103    SCX_EV_SELECT_CPU_FALLBACK 0
104    SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0
105    SCX_EV_DISPATCH_KEEP_LAST 123
106    SCX_EV_ENQ_SKIP_EXITING 0
107    SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0
108    SCX_EV_REENQ_IMMED 0
109    SCX_EV_REENQ_LOCAL_REPEAT 0
110    SCX_EV_REFILL_SLICE_DFL 456789
111    SCX_EV_BYPASS_DURATION 0
112    SCX_EV_BYPASS_DISPATCH 0
113    SCX_EV_BYPASS_ACTIVATE 0
114    SCX_EV_INSERT_NOT_OWNED 0
115    SCX_EV_SUB_BYPASS_DISPATCH 0
116
117The counters are described in ``kernel/sched/ext_internal.h``; briefly:
118
119* ``SCX_EV_SELECT_CPU_FALLBACK``: ops.select_cpu() returned a CPU unusable by
120  the task and the core scheduler silently picked a fallback CPU.
121* ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE``: a local-DSQ dispatch was redirected
122  to the global DSQ because the target CPU went offline.
123* ``SCX_EV_DISPATCH_KEEP_LAST``: a task continued running because no other
124  task was available (only when ``SCX_OPS_ENQ_LAST`` is not set).
125* ``SCX_EV_ENQ_SKIP_EXITING``: an exiting task was dispatched to the local DSQ
126  directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING`` is not set).
127* ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED``: a migration-disabled task was
128  dispatched to its local DSQ directly (only when
129  ``SCX_OPS_ENQ_MIGRATION_DISABLED`` is not set).
130* ``SCX_EV_REENQ_IMMED``: a task dispatched with ``SCX_ENQ_IMMED`` was
131  re-enqueued because the target CPU was not available for immediate execution.
132* ``SCX_EV_REENQ_LOCAL_REPEAT``: a reenqueue of the local DSQ triggered
133  another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ``
134  handling in the BPF scheduler.
135* ``SCX_EV_REFILL_SLICE_DFL``: a task's time slice was refilled with the
136  default value (``SCX_SLICE_DFL``).
137* ``SCX_EV_BYPASS_DURATION``: total nanoseconds spent in bypass mode.
138* ``SCX_EV_BYPASS_DISPATCH``: number of tasks dispatched while in bypass mode.
139* ``SCX_EV_BYPASS_ACTIVATE``: number of times bypass mode was activated.
140* ``SCX_EV_INSERT_NOT_OWNED``: attempted to insert a task not owned by this
141  scheduler into a DSQ; such attempts are silently ignored.
142* ``SCX_EV_SUB_BYPASS_DISPATCH``: tasks dispatched from sub-scheduler bypass
143  DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED``).
144
145``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
146detailed information:
147
148.. code-block:: none
149
150    # tools/sched_ext/scx_show_state.py
151    ops           : simple
152    enabled       : 1
153    switching_all : 1
154    switched_all  : 1
155    enable_state  : enabled (2)
156    bypass_depth  : 0
157    nr_rejected   : 0
158    enable_seq    : 1
159
160Whether a given task is on sched_ext can be determined as follows:
161
162.. code-block:: none
163
164    # grep ext /proc/self/sched
165    ext.enabled                                  :                    1
166
167The Basics
168==========
169
170Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
171programs that implement ``struct sched_ext_ops``. The only mandatory field
172is ``ops.name`` which must be a valid BPF object name. All operations are
173optional. The following modified excerpt is from
174``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
175
176.. code-block:: c
177
178    /*
179     * Decide which CPU a task should be migrated to before being
180     * enqueued (either at wakeup, fork time, or exec time). If an
181     * idle core is found by the default ops.select_cpu() implementation,
182     * then insert the task directly into SCX_DSQ_LOCAL and skip the
183     * ops.enqueue() callback.
184     *
185     * Note that this implementation has exactly the same behavior as the
186     * default ops.select_cpu implementation. The behavior of the scheduler
187     * would be exactly same if the implementation just didn't define the
188     * simple_select_cpu() struct_ops prog.
189     */
190    s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
191                       s32 prev_cpu, u64 wake_flags)
192    {
193            s32 cpu;
194            /* Need to initialize or the BPF verifier will reject the program */
195            bool direct = false;
196
197            cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
198
199            if (direct)
200                    scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
201
202            return cpu;
203    }
204
205    /*
206     * Do a direct insertion of a task to the global DSQ. This ops.enqueue()
207     * callback will only be invoked if we failed to find a core to insert
208     * into in ops.select_cpu() above.
209     *
210     * Note that this implementation has exactly the same behavior as the
211     * default ops.enqueue implementation, which just dispatches the task
212     * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
213     * if the implementation just didn't define the simple_enqueue struct_ops
214     * prog.
215     */
216    void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
217    {
218            scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
219    }
220
221    s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
222    {
223            /*
224             * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
225             * SCHED_BATCH tasks should use sched_ext.
226             */
227            return 0;
228    }
229
230    void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
231    {
232            exit_type = ei->type;
233    }
234
235    SEC(".struct_ops")
236    struct sched_ext_ops simple_ops = {
237            .select_cpu             = (void *)simple_select_cpu,
238            .enqueue                = (void *)simple_enqueue,
239            .init                   = (void *)simple_init,
240            .exit                   = (void *)simple_exit,
241            .name                   = "simple",
242    };
243
244Dispatch Queues
245---------------
246
247To match the impedance between the scheduler core and the BPF scheduler,
248sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
249priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
250and one local DSQ per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
251an arbitrary number of DSQs using ``scx_bpf_create_dsq()`` and
252``scx_bpf_destroy_dsq()``.
253
254A CPU always executes a task from its local DSQ. A task is "inserted" into a
255DSQ. A task in a non-local DSQ is "move"d into the target CPU's local DSQ.
256
257When a CPU is looking for the next task to run, if the local DSQ is not
258empty, the first task is picked. Otherwise, the CPU tries to move a task
259from the global DSQ. If that doesn't yield a runnable task either,
260``ops.dispatch()`` is invoked.
261
262Scheduling Cycle
263----------------
264
265The following briefly shows how a waking task is scheduled and executed.
266
2671. When a task is waking up, ``ops.select_cpu()`` is the first operation
268   invoked. This serves two purposes. First, CPU selection optimization
269   hint. Second, waking up the selected CPU if idle.
270
271   The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
272   binding. The actual decision is made at the last step of scheduling.
273   However, there is a small performance gain if the CPU
274   ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
275
276   A side-effect of selecting a CPU is waking it up from idle. While a BPF
277   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
278   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
279
280   Note that the scheduler core will ignore an invalid CPU selection, for
281   example, if it's outside the allowed cpumask of the task.
282
283   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
284   by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``.
285
286   If the task is inserted into ``SCX_DSQ_LOCAL`` from
287   ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU
288   is returned from ``ops.select_cpu()``. Additionally, inserting directly
289   from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to
290   be skipped.
291
292   Any other attempt to store a task in BPF-internal data structures from
293   ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being
294   invoked. This is discouraged, as it can introduce racy behavior or
295   inconsistent state.
296
2972. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
298   task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
299   can make one of the following decisions:
300
301   * Immediately insert the task into either the global or a local DSQ by
302     calling ``scx_bpf_dsq_insert()`` with one of the following options:
303     ``SCX_DSQ_GLOBAL``, ``SCX_DSQ_LOCAL``, or ``SCX_DSQ_LOCAL_ON | cpu``.
304
305   * Immediately insert the task into a custom DSQ by calling
306     ``scx_bpf_dsq_insert()`` with a DSQ ID which is smaller than 2^63.
307
308   * Queue the task on the BPF side.
309
310   **Task State Tracking and ops.dequeue() Semantics**
311
312   A task is in the "BPF scheduler's custody" when the BPF scheduler is
313   responsible for managing its lifecycle. A task enters custody when it is
314   dispatched to a user DSQ or stored in the BPF scheduler's internal data
315   structures. Custody is entered only from ``ops.enqueue()`` for those
316   operations. The only exception is dispatching to a user DSQ from
317   ``ops.select_cpu()``: although the task is not yet technically in BPF
318   scheduler custody at that point, the dispatch has the same semantic
319   effect as dispatching from ``ops.enqueue()`` for custody-related
320   purposes.
321
322   Once ``ops.enqueue()`` is called, the task may or may not enter custody
323   depending on what the scheduler does:
324
325   * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
326     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler
327     is done with the task - it either goes straight to a CPU's local run
328     queue or to the global DSQ as a fallback. The task never enters (or
329     exits) BPF custody, and ``ops.dequeue()`` will not be called.
330
331   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
332     BPF scheduler's custody. When the task later leaves BPF custody
333     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
334     sleep/property changes), ``ops.dequeue()`` will be called exactly
335     once.
336
337   * **Stored in BPF data structures** (e.g., internal BPF queues): the
338     task is in BPF custody. ``ops.dequeue()`` will be called when it
339     leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or
340     on property change / sleep).
341
342   When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked.
343   The dequeue can happen for different reasons, distinguished by flags:
344
345   1. **Regular dispatch**: when a task in BPF custody is dispatched to a
346      terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
347      execution), ``ops.dequeue()`` is triggered without any special flags.
348
349   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
350      core scheduling picks a task for execution while it's still in BPF
351      custody, ``ops.dequeue()`` is called with the
352      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
353
354   3. **Scheduling property change**: when a task property changes (via
355      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
356      priority changes, CPU migrations, etc.) while the task is still in
357      BPF custody, ``ops.dequeue()`` is called with the
358      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
359
360   **Important**: Once a task has left BPF custody (e.g., after being
361   dispatched to a terminal DSQ), property changes will not trigger
362   ``ops.dequeue()``, since the task is no longer managed by the BPF
363   scheduler.
364
3653. When a CPU is ready to schedule, it first looks at its local DSQ. If
366   empty, it then looks at the global DSQ. If there still isn't a task to
367   run, ``ops.dispatch()`` is invoked which can use the following two
368   functions to populate the local DSQ.
369
370   * ``scx_bpf_dsq_insert()`` inserts a task to a DSQ. Any target DSQ can be
371     used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
372     ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dsq_insert()``
373     currently can't be called with BPF locks held, this is being worked on
374     and will be supported. ``scx_bpf_dsq_insert()`` schedules insertion
375     rather than performing them immediately. There can be up to
376     ``ops.dispatch_max_batch`` pending tasks.
377
378   * ``scx_bpf_dsq_move_to_local()`` moves a task from the specified non-local
379     DSQ to the dispatching DSQ. This function cannot be called with any BPF
380     locks held. ``scx_bpf_dsq_move_to_local()`` flushes the pending insertions
381     tasks before trying to move from the specified DSQ.
382
3834. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
384   the CPU runs the first one. If empty, the following steps are taken:
385
386   * Try to move from the global DSQ. If successful, run the task.
387
388   * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
389
390   * If the previous task is an SCX task and still runnable, keep executing
391     it (see ``SCX_OPS_ENQ_LAST``).
392
393   * Go idle.
394
395Note that the BPF scheduler can always choose to dispatch tasks immediately
396in ``ops.enqueue()`` as illustrated in the above simple example. If only the
397built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
398a task is never queued on the BPF scheduler and both the local and global
399DSQs are executed automatically.
400
401``scx_bpf_dsq_insert()`` inserts the task on the FIFO of the target DSQ. Use
402``scx_bpf_dsq_insert_vtime()`` for the priority queue. Internal DSQs such as
403``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
404dispatching, and must be dispatched to with ``scx_bpf_dsq_insert()``. See
405the function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c``
406for more information.
407
408Task Lifecycle
409--------------
410
411The following pseudo-code presents a rough overview of the entire lifecycle
412of a task managed by a sched_ext scheduler:
413
414.. code-block:: c
415
416    ops.init_task();            /* A new task is created */
417    ops.enable();               /* Enable BPF scheduling for the task */
418
419    while (task in SCHED_EXT) {
420        if (task can migrate)
421            ops.select_cpu();   /* Called on wakeup (optimization) */
422
423        ops.runnable();         /* Task becomes ready to run */
424
425        while (task_is_runnable(task)) {
426            if (task is not in a DSQ || task->scx.slice == 0) {
427                ops.enqueue();  /* Task can be added to a DSQ */
428
429                /* Task property change (i.e., affinity, nice, etc.)? */
430                if (sched_change(task)) {
431                    ops.dequeue(); /* Exiting BPF scheduler custody */
432                    ops.quiescent();
433
434                    /* Property change callback, e.g. ops.set_weight() */
435
436                    ops.runnable();
437                    continue;
438                }
439
440                /* Any usable CPU becomes available */
441
442                ops.dispatch();     /* Task is moved to a local DSQ */
443                ops.dequeue();      /* Exiting BPF scheduler custody */
444            }
445
446            ops.running();      /* Task starts running on its assigned CPU */
447
448            while (task_is_runnable(task) && task->scx.slice > 0) {
449                ops.tick();     /* Called every 1/HZ seconds */
450
451                if (task->scx.slice == 0)
452                    ops.dispatch(); /* task->scx.slice can be refilled */
453            }
454
455            ops.stopping();     /* Task stops running (time slice expires or wait) */
456        }
457
458        ops.quiescent();        /* Task releases its assigned CPU (wait) */
459    }
460
461    ops.disable();              /* Disable BPF scheduling for the task */
462    ops.exit_task();            /* Task is destroyed */
463
464Note that the above pseudo-code does not cover all possible state transitions
465and edge cases, to name a few examples:
466
467* ``ops.dispatch()`` may fail to move the task to a local DSQ due to a racing
468  property change on that task, in which case ``ops.dispatch()`` will be
469  retried.
470
471* The task may be direct-dispatched to a local DSQ from ``ops.enqueue()``,
472  in which case ``ops.dispatch()`` and ``ops.dequeue()`` are skipped and we go
473  straight to ``ops.running()``.
474
475* Property changes may occur at virtually any point during the task's lifecycle,
476  not just when the task is queued and waiting to be dispatched. For example,
477  changing a property of a running task will lead to the callback sequence
478  ``ops.stopping()`` -> ``ops.quiescent()`` -> (property change callback) ->
479  ``ops.runnable()`` -> ``ops.running()``.
480
481* A sched_ext task can be preempted by a task from a higher-priority scheduling
482  class, in which case it will exit the tick-dispatch loop even though it is runnable
483  and has a non-zero slice.
484
485See the "Scheduling Cycle" section for a more detailed description of how
486a freshly woken up task gets on a CPU.
487
488Where to Look
489=============
490
491* ``include/linux/sched/ext.h`` defines the core data structures, ops table
492  and constants.
493
494* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
495  The functions prefixed with ``scx_bpf_`` can be called from the BPF
496  scheduler.
497
498* ``kernel/sched/ext_idle.c`` contains the built-in idle CPU selection policy.
499
500* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
501
502  * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
503    custom DSQ.
504
505  * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
506    levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
507
508  * ``scx_central[.bpf].c``: A central FIFO scheduler where all scheduling
509    decisions are made on one CPU, demonstrating ``LOCAL_ON`` dispatching,
510    tickless operation, and kthread preemption.
511
512  * ``scx_cpu0[.bpf].c``: A scheduler that queues all tasks to a shared DSQ
513    and only dispatches them on CPU0 in FIFO order. Useful for testing bypass
514    behavior.
515
516  * ``scx_flatcg[.bpf].c``: A flattened cgroup hierarchy scheduler
517    implementing hierarchical weight-based cgroup CPU control by compounding
518    each cgroup's share at every level into a single flat scheduling layer.
519
520  * ``scx_pair[.bpf].c``: A core-scheduling example that always makes
521    sibling CPU pairs execute tasks from the same CPU cgroup.
522
523  * ``scx_sdt[.bpf].c``: A variation of ``scx_simple`` demonstrating BPF
524    arena memory management for per-task data.
525
526  * ``scx_userland[.bpf].c``: A minimal scheduler demonstrating user space
527    scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order;
528    all others are scheduled in user space by a simple vruntime scheduler.
529
530Module Parameters
531=================
532
533sched_ext exposes two module parameters under the ``sched_ext.`` prefix that
534control bypass-mode behaviour. These knobs are primarily for debugging; there
535is usually no reason to change them during normal operation. They can be read
536and written at runtime (mode 0600) via
537``/sys/module/sched_ext/parameters/``.
538
539``sched_ext.slice_bypass_us`` (default: 5000 µs)
540    The time slice assigned to all tasks when the scheduler is in bypass mode,
541    i.e. during BPF scheduler load, unload, and error recovery. Valid range is
542    100 µs to 100 ms.
543
544``sched_ext.bypass_lb_intv_us`` (default: 500000 µs)
545    The interval at which the bypass-mode load balancer redistributes tasks
546    across CPUs. Set to 0 to disable load balancing during bypass mode. Valid
547    range is 0 to 10 s.
548
549ABI Instability
550===============
551
552The APIs provided by sched_ext to BPF schedulers programs have no stability
553guarantees. This includes the ops table callbacks and constants defined in
554``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
555``kernel/sched/ext.c`` and ``kernel/sched/ext_idle.c``.
556
557While we will attempt to provide a relatively stable API surface when
558possible, they are subject to change without warning between kernel
559versions.
Configure Feed

Configure Feed