Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

sched_ext: Fix SCX_KICK_WAIT to work reliably

SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.

This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.

However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.

This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and also increments it in
pick_task_scx() to handle cases where the same task is re-selected, whether
by BPF scheduler decision or slice refill. The semantics: If the current
task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the
scheduling path. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.

v2: - Also increment pnt_seq in pick_task_scx() to handle same-task
re-selection (Andrea Righi).
- Use smp_cond_load_acquire() for the busy-wait loop for better
architecture optimization (Peter Zijlstra).

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Tejun Heo a379fa1e a9c1fbbd

+30 -22
+26 -20
kernel/sched/ext.c
··· 2260 2260 struct scx_sched *sch = scx_root; 2261 2261 const struct sched_class *next_class = next->sched_class; 2262 2262 2263 - /* 2264 - * Pairs with the smp_load_acquire() issued by a CPU in 2265 - * kick_cpus_irq_workfn() who is waiting for this CPU to perform a 2266 - * resched. 2267 - */ 2268 - smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); 2269 2263 if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT)) 2270 2264 return; 2271 2265 ··· 2299 2305 struct task_struct *next) 2300 2306 { 2301 2307 struct scx_sched *sch = scx_root; 2308 + 2309 + /* see kick_cpus_irq_workfn() */ 2310 + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); 2311 + 2302 2312 update_curr_scx(rq); 2303 2313 2304 2314 /* see dequeue_task_scx() on why we skip when !QUEUED */ ··· 2355 2357 struct task_struct *prev = rq->curr; 2356 2358 bool keep_prev, kick_idle = false; 2357 2359 struct task_struct *p; 2360 + 2361 + /* see kick_cpus_irq_workfn() */ 2362 + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); 2358 2363 2359 2364 rq_modified_clear(rq); 2360 2365 ··· 5145 5144 } 5146 5145 5147 5146 if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) { 5148 - pseqs[cpu] = rq->scx.pnt_seq; 5149 - should_wait = true; 5147 + if (cur_class == &ext_sched_class) { 5148 + pseqs[cpu] = rq->scx.pnt_seq; 5149 + should_wait = true; 5150 + } else { 5151 + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); 5152 + } 5150 5153 } 5151 5154 5152 5155 resched_curr(rq); ··· 5211 5206 for_each_cpu(cpu, this_scx->cpus_to_wait) { 5212 5207 unsigned long *wait_pnt_seq = &cpu_rq(cpu)->scx.pnt_seq; 5213 5208 5214 - if (cpu != cpu_of(this_rq)) { 5215 - /* 5216 - * Pairs with smp_store_release() issued by this CPU in 5217 - * switch_class() on the resched path. 5218 - * 5219 - * We busy-wait here to guarantee that no other task can 5220 - * be scheduled on our core before the target CPU has 5221 - * entered the resched path. 5222 - */ 5223 - while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) 5224 - cpu_relax(); 5225 - } 5209 + /* 5210 + * Busy-wait until the task running at the time of kicking is no 5211 + * longer running. This can be used to implement e.g. core 5212 + * scheduling. 5213 + * 5214 + * smp_cond_load_acquire() pairs with store_releases in 5215 + * pick_task_scx() and put_prev_task_scx(). The former breaks 5216 + * the wait if SCX's scheduling path is entered even if the same 5217 + * task is picked subsequently. The latter is necessary to break 5218 + * the wait when $cpu is taken by a higher sched class. 5219 + */ 5220 + if (cpu != cpu_of(this_rq)) 5221 + smp_cond_load_acquire(wait_pnt_seq, VAL != pseqs[cpu]); 5226 5222 5227 5223 cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); 5228 5224 }
+4 -2
kernel/sched/ext_internal.h
··· 997 997 SCX_KICK_PREEMPT = 1LLU << 1, 998 998 999 999 /* 1000 - * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will 1001 - * return after the target CPU finishes picking the next task. 1000 + * The scx_bpf_kick_cpu() call will return after the current SCX task of 1001 + * the target CPU switches out. This can be used to implement e.g. core 1002 + * scheduling. This has no effect if the current task on the target CPU 1003 + * is not on SCX. 1002 1004 */ 1003 1005 SCX_KICK_WAIT = 1LLU << 2, 1004 1006 };