sched_ext: Don't call put_prev_task_scx() before picking the next task

fd03c5b85855 ("sched: Rework pick_next_task()") changed the definition of
pick_next_task() from:

pick_next_task() := pick_task() + set_next_task(.first = true)

to:

pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

making invoking put_prev_task() pick_next_task()'s responsibility. This
reordering allows pick_task() to be shared between regular and core-sched
paths and put_prev_task() to know the next task.

sched_ext depended on put_prev_task_scx() enqueueing the current task before
pick_next_task_scx() is called. While pulling sched/core changes,
70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an
explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx()
before picking the first task as a workaround.

Clean it up and adopt the conventions that other sched classes are
following.

The operation of keeping running the current task was spread and required
the task to be put on the local DSQ before picking:

- balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still
runnable, hasn't exhausted its slice, and thus should keep running.

- put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP
is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the
only runnable task. do_enqueue_task() in turn decided whether to use the
local DSQ depending on SCX_OPS_ENQ_LAST.

Consolidate the logic in balance_one() as it always knows whether it is
going to keep the current task. balance_one() now considers all conditions
where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell
pick_next_task_scx() to keep the current task instead of picking one from
the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from
put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is
updated to pick the current task if SCX_TASK_BAL_KEEP is set.

The workaround put_prev_task[_scx]() calls are replaced with
put_prev_set_next_task().

This causes two behavior changes observable from the BPF scheduler:

- When a task keep running, it no longer goes through enqueue/dequeue cycle
and thus ops.stopping/running() transitions. The new behavior is better
and all the existing schedulers should be able to handle the new behavior.

- The BPF scheduler cannot keep executing the current task by enqueueing
SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
BPF scheduler is responsible for resuming execution after each
SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling
decisions are not made on the local CPU - e.g. central or userspace-driven
schedulin - and the new behavior is more logical and shouldn't pose any
problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it
doesn't fit that well anymore and the last task handling is moved to the
end of qmap_dispatch().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>

Tejun Heo 2 years ago 7c65ae81 d7b01aef

+81 -73

2 changed files

expand all

kernel

sched

ext.c

tools

sched_ext

scx_qmap.bpf.c

+63 -69

kernel/sched/ext.c

··· 630 630 * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with the 631 631 * %SCX_ENQ_LAST flag set. 632 632 * 633 - * If the BPF scheduler wants to continue executing the task, 634 - * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately. 635 - * If the task gets queued on a different dsq or the BPF side, the BPF 636 - * scheduler is responsible for triggering a follow-up scheduling event. 637 - * Otherwise, Execution may stall. 633 + * The BPF scheduler is responsible for triggering a follow-up 634 + * scheduling event. Otherwise, Execution may stall. 638 635 */ 639 636 SCX_ENQ_LAST = 1LLU << 41, 640 637 ··· 1849 1852 if (!scx_rq_online(rq)) 1850 1853 goto local; 1851 1854 1852 - if (scx_ops_bypassing()) { 1853 - if (enq_flags & SCX_ENQ_LAST) 1854 - goto local; 1855 - else 1856 - goto global; 1857 - } 1855 + if (scx_ops_bypassing()) 1856 + goto global; 1858 1857 1859 1858 if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) 1860 1859 goto direct; ··· 1858 1865 /* see %SCX_OPS_ENQ_EXITING */ 1859 1866 if (!static_branch_unlikely(&scx_ops_enq_exiting) && 1860 1867 unlikely(p->flags & PF_EXITING)) 1861 - goto local; 1862 - 1863 - /* see %SCX_OPS_ENQ_LAST */ 1864 - if (!static_branch_unlikely(&scx_ops_enq_last) && 1865 - (enq_flags & SCX_ENQ_LAST)) 1866 1868 goto local; 1867 1869 1868 1870 if (!SCX_HAS_OP(enqueue)) ··· 2505 2517 struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx); 2506 2518 bool prev_on_scx = prev->sched_class == &ext_sched_class; 2507 2519 int nr_loops = SCX_DSP_MAX_LOOPS; 2508 - bool has_tasks = false; 2509 2520 2510 2521 lockdep_assert_rq_held(rq); 2511 2522 rq->scx.flags |= SCX_RQ_IN_BALANCE; ··· 2529 2542 /* 2530 2543 * If @prev is runnable & has slice left, it has priority and 2531 2544 * fetching more just increases latency for the fetched tasks. 2532 - * Tell put_prev_task_scx() to put @prev on local_dsq. If the 2533 - * BPF scheduler wants to handle this explicitly, it should 2534 - * implement ->cpu_released(). 2545 + * Tell pick_next_task_scx() to keep running @prev. If the BPF 2546 + * scheduler wants to handle this explicitly, it should 2547 + * implement ->cpu_release(). 2535 2548 * 2536 2549 * See scx_ops_disable_workfn() for the explanation on the 2537 2550 * bypassing test. ··· 2557 2570 goto has_tasks; 2558 2571 2559 2572 if (!SCX_HAS_OP(dispatch) || scx_ops_bypassing() || !scx_rq_online(rq)) 2560 - goto out; 2573 + goto no_tasks; 2561 2574 2562 2575 dspc->rq = rq; 2563 2576 ··· 2596 2609 } 2597 2610 } while (dspc->nr_tasks); 2598 2611 2599 - goto out; 2612 + no_tasks: 2613 + /* 2614 + * Didn't find another task to run. Keep running @prev unless 2615 + * %SCX_OPS_ENQ_LAST is in effect. 2616 + */ 2617 + if ((prev->scx.flags & SCX_TASK_QUEUED) && 2618 + (!static_branch_unlikely(&scx_ops_enq_last) || scx_ops_bypassing())) { 2619 + if (local) 2620 + prev->scx.flags |= SCX_TASK_BAL_KEEP; 2621 + goto has_tasks; 2622 + } 2623 + rq->scx.flags &= ~SCX_RQ_IN_BALANCE; 2624 + return false; 2600 2625 2601 2626 has_tasks: 2602 - has_tasks = true; 2603 - out: 2604 2627 rq->scx.flags &= ~SCX_RQ_IN_BALANCE; 2605 - return has_tasks; 2628 + return true; 2606 2629 } 2607 2630 2608 2631 static int balance_scx(struct rq *rq, struct task_struct *prev, ··· 2725 2728 if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) 2726 2729 SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); 2727 2730 2728 - /* 2729 - * If we're being called from put_prev_task_balance(), balance_scx() may 2730 - * have decided that @p should keep running. 2731 - */ 2732 - if (p->scx.flags & SCX_TASK_BAL_KEEP) { 2733 - p->scx.flags &= ~SCX_TASK_BAL_KEEP; 2734 - set_task_runnable(rq, p); 2735 - dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); 2736 - return; 2737 - } 2738 - 2739 2731 if (p->scx.flags & SCX_TASK_QUEUED) { 2732 + p->scx.flags &= ~SCX_TASK_BAL_KEEP; 2733 + 2740 2734 set_task_runnable(rq, p); 2741 2735 2742 2736 /* 2743 - * If @p has slice left and balance_scx() didn't tag it for 2744 - * keeping, @p is getting preempted by a higher priority 2745 - * scheduler class or core-sched forcing a different task. Leave 2746 - * it at the head of the local DSQ. 2737 + * If @p has slice left and is being put, @p is getting 2738 + * preempted by a higher priority scheduler class or core-sched 2739 + * forcing a different task. Leave it at the head of the local 2740 + * DSQ. 2747 2741 */ 2748 2742 if (p->scx.slice && !scx_ops_bypassing()) { 2749 2743 dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD); ··· 2742 2754 } 2743 2755 2744 2756 /* 2745 - * If we're in the pick_next_task path, balance_scx() should 2746 - * have already populated the local DSQ if there are any other 2747 - * available tasks. If empty, tell ops.enqueue() that @p is the 2748 - * only one available for this cpu. ops.enqueue() should put it 2749 - * on the local DSQ so that the subsequent pick_next_task_scx() 2750 - * can find the task unless it wants to trigger a separate 2751 - * follow-up scheduling event. 2757 + * If @p is runnable but we're about to enter a lower 2758 + * sched_class, %SCX_OPS_ENQ_LAST must be set. Tell 2759 + * ops.enqueue() that @p is the only one available for this cpu, 2760 + * which should trigger an explicit follow-up scheduling event. 2752 2761 */ 2753 - if (list_empty(&rq->scx.local_dsq.list)) 2762 + if (sched_class_above(&ext_sched_class, next->sched_class)) { 2763 + WARN_ON_ONCE(!static_branch_unlikely(&scx_ops_enq_last)); 2754 2764 do_enqueue_task(rq, p, SCX_ENQ_LAST, -1); 2755 - else 2765 + } else { 2756 2766 do_enqueue_task(rq, p, 0, -1); 2767 + } 2757 2768 } 2758 2769 } 2759 2770 ··· 2767 2780 { 2768 2781 struct task_struct *p; 2769 2782 2770 - if (prev->sched_class == &ext_sched_class) 2771 - put_prev_task_scx(rq, prev, NULL); 2783 + /* 2784 + * If balance_scx() is telling us to keep running @prev, replenish slice 2785 + * if necessary and keep running @prev. Otherwise, pop the first one 2786 + * from the local DSQ. 2787 + */ 2788 + if (prev->scx.flags & SCX_TASK_BAL_KEEP) { 2789 + prev->scx.flags &= ~SCX_TASK_BAL_KEEP; 2790 + p = prev; 2791 + if (!p->scx.slice) 2792 + p->scx.slice = SCX_SLICE_DFL; 2793 + } else { 2794 + p = first_local_task(rq); 2795 + if (!p) 2796 + return NULL; 2772 2797 2773 - p = first_local_task(rq); 2774 - if (!p) 2775 - return NULL; 2776 - 2777 - if (prev->sched_class != &ext_sched_class) 2778 - prev->sched_class->put_prev_task(rq, prev, p); 2779 - 2780 - set_next_task_scx(rq, p, true); 2781 - 2782 - if (unlikely(!p->scx.slice)) { 2783 - if (!scx_ops_bypassing() && !scx_warned_zero_slice) { 2784 - printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", 2785 - p->comm, p->pid); 2786 - scx_warned_zero_slice = true; 2798 + if (unlikely(!p->scx.slice)) { 2799 + if (!scx_ops_bypassing() && !scx_warned_zero_slice) { 2800 + printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n", 2801 + p->comm, p->pid); 2802 + scx_warned_zero_slice = true; 2803 + } 2804 + p->scx.slice = SCX_SLICE_DFL; 2787 2805 } 2788 - p->scx.slice = SCX_SLICE_DFL; 2789 2806 } 2807 + 2808 + put_prev_set_next_task(rq, prev, p); 2790 2809 2791 2810 return p; 2792 2811 } ··· 3868 3875 * to force global FIFO scheduling. 3869 3876 * 3870 3877 * a. ops.enqueue() is ignored and tasks are queued in simple global FIFO order. 3878 + * %SCX_OPS_ENQ_LAST is also ignored. 3871 3879 * 3872 3880 * b. ops.dispatch() is ignored. 3873 3881 * 3874 - * c. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value can't be 3875 - * trusted. Whenever a tick triggers, the running task is rotated to the tail 3876 - * of the queue with core_sched_at touched. 3882 + * c. balance_scx() does not set %SCX_TASK_BAL_KEEP on non-zero slice as slice 3883 + * can't be trusted. Whenever a tick triggers, the running task is rotated to 3884 + * the tail of the queue with core_sched_at touched. 3877 3885 * 3878 3886 * d. pick_next_task() suppresses zero slice warning. 3879 3887 *

+18 -4

tools/sched_ext/scx_qmap.bpf.c

··· 205 205 206 206 /* 207 207 * All enqueued tasks must have their core_sched_seq updated for correct 208 - * core-sched ordering, which is why %SCX_OPS_ENQ_LAST is specified in 209 - * qmap_ops.flags. 208 + * core-sched ordering. Also, take a look at the end of qmap_dispatch(). 210 209 */ 211 210 tctx->core_sched_seq = core_sched_tail_seqs[idx]++; 212 211 ··· 213 214 * If qmap_select_cpu() is telling us to or this is the last runnable 214 215 * task on the CPU, enqueue locally. 215 216 */ 216 - if (tctx->force_local || (enq_flags & SCX_ENQ_LAST)) { 217 + if (tctx->force_local) { 217 218 tctx->force_local = false; 218 219 scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags); 219 220 return; ··· 284 285 { 285 286 struct task_struct *p; 286 287 struct cpu_ctx *cpuc; 288 + struct task_ctx *tctx; 287 289 u32 zero = 0, batch = dsp_batch ?: 1; 288 290 void *fifo; 289 291 s32 i, pid; ··· 348 348 } 349 349 350 350 cpuc->dsp_cnt = 0; 351 + } 352 + 353 + /* 354 + * No other tasks. @prev will keep running. Update its core_sched_seq as 355 + * if the task were enqueued and dispatched immediately. 356 + */ 357 + if (prev) { 358 + tctx = bpf_task_storage_get(&task_ctx_stor, prev, 0, 0); 359 + if (!tctx) { 360 + scx_bpf_error("task_ctx lookup failed"); 361 + return; 362 + } 363 + 364 + tctx->core_sched_seq = 365 + core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++; 351 366 } 352 367 } 353 368 ··· 716 701 .cpu_offline = (void *)qmap_cpu_offline, 717 702 .init = (void *)qmap_init, 718 703 .exit = (void *)qmap_exit, 719 - .flags = SCX_OPS_ENQ_LAST, 720 704 .timeout_ms = 5000U, 721 705 .name = "qmap");

Configure Feed

Configure Feed