sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.

Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.

Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.

This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.

v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).

Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Tejun Heo 7 months ago 61debc25 3546119f

+15 -3

3 changed files

expand all

include

linux

sched

ext.h

kernel

sched

ext.c

sched.h

include/linux/sched/ext.h

··· 57 57 SCX_DSQ_INVALID = SCX_DSQ_FLAG_BUILTIN | 0, 58 58 SCX_DSQ_GLOBAL = SCX_DSQ_FLAG_BUILTIN | 1, 59 59 SCX_DSQ_LOCAL = SCX_DSQ_FLAG_BUILTIN | 2, 60 + SCX_DSQ_BYPASS = SCX_DSQ_FLAG_BUILTIN | 3, 60 61 SCX_DSQ_LOCAL_ON = SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON, 61 62 SCX_DSQ_LOCAL_CPU_MASK = 0xffffffffLLU, 62 63 };

+13 -3

kernel/sched/ext.c

··· 1298 1298 1299 1299 if (scx_rq_bypassing(rq)) { 1300 1300 __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); 1301 - goto global; 1301 + goto bypass; 1302 1302 } 1303 1303 1304 1304 if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) ··· 1355 1355 goto enqueue; 1356 1356 global: 1357 1357 dsq = find_global_dsq(sch, p); 1358 + goto enqueue; 1359 + bypass: 1360 + dsq = &task_rq(p)->scx.bypass_dsq; 1358 1361 goto enqueue; 1359 1362 1360 1363 enqueue: ··· 2157 2154 if (consume_global_dsq(sch, rq)) 2158 2155 goto has_tasks; 2159 2156 2160 - if (unlikely(!SCX_HAS_OP(sch, dispatch)) || 2161 - scx_rq_bypassing(rq) || !scx_rq_online(rq)) 2157 + if (scx_rq_bypassing(rq)) { 2158 + if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq)) 2159 + goto has_tasks; 2160 + else 2161 + goto no_tasks; 2162 + } 2163 + 2164 + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) 2162 2165 goto no_tasks; 2163 2166 2164 2167 dspc->rq = rq; ··· 5380 5371 int n = cpu_to_node(cpu); 5381 5372 5382 5373 init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL); 5374 + init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS); 5383 5375 INIT_LIST_HEAD(&rq->scx.runnable_list); 5384 5376 INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals); 5385 5377

kernel/sched/sched.h

··· 808 808 struct balance_callback deferred_bal_cb; 809 809 struct irq_work deferred_irq_work; 810 810 struct irq_work kick_cpus_irq_work; 811 + struct scx_dispatch_q bypass_dsq; 811 812 }; 812 813 #endif /* CONFIG_SCHED_CLASS_EXT */ 813 814

Configure Feed

Configure Feed