sched/fair: Fix zero_vruntime tracking fix

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

John reported that stress-ng-yield could make his machine unhappy and
managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix
zero_vruntime tracking").

The combination of yield and that commit was specific enough to
hypothesize the following scenario:

Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.

Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and current is no
longer. So we schedule.

Since we are runnable, there is no {de,en}queue. All we have is the
__{en,de}queue_entity() from {put_prev,set_next}_task(). But per the
fingered commit, those two no longer move zero_vruntime.

All that moves zero_vruntime are tick and full {de,en}queue.

This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.

Additionally, when multiple cgroups are involved, there is no guarantee
the tick will in fact hit every cgroup in a timely manner. Statistically
speaking it will, but that same statistics does not rule out the
possibility of one cgroup not getting a tick for a significant amount of
time -- however unlikely.

Therefore, just like with the yield() case, force an update at the end
of every slice. This ensures the update is never more than a single
slice behind and the whole thing is within 2 lag bounds as per the
comment on entity_key().

Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260401132355.081530332@infradead.org

Peter Zijlstra 1 month ago 1319ea57 7aaa8047

+3 -7

1 changed file

expand all

kernel

sched

fair.c

+3 -7

kernel/sched/fair.c

··· 707 707 * Called in: 708 708 * - place_entity() -- before enqueue 709 709 * - update_entity_lag() -- before dequeue 710 - * - entity_tick() 710 + * - update_deadline() -- slice expiration 711 711 * 712 712 * This means it is one entry 'behind' but that puts it close enough to where 713 713 * the bound on entity_key() is at most two lag bounds. ··· 1131 1131 * EEVDF: vd_i = ve_i + r_i / w_i 1132 1132 */ 1133 1133 se->deadline = se->vruntime + calc_delta_fair(se->slice, se); 1134 + avg_vruntime(cfs_rq); 1134 1135 1135 1136 /* 1136 1137 * The task has consumed its request, reschedule. ··· 5594 5593 update_load_avg(cfs_rq, curr, UPDATE_TG); 5595 5594 update_cfs_group(curr); 5596 5595 5597 - /* 5598 - * Pulls along cfs_rq::zero_vruntime. 5599 - */ 5600 - avg_vruntime(cfs_rq); 5601 - 5602 5596 #ifdef CONFIG_SCHED_HRTICK 5603 5597 /* 5604 5598 * queued ticks are scheduled to match the slice, so don't bother ··· 9124 9128 */ 9125 9129 if (entity_eligible(cfs_rq, se)) { 9126 9130 se->vruntime = se->deadline; 9127 - se->deadline += calc_delta_fair(se->slice, se); 9131 + update_deadline(cfs_rq, se); 9128 9132 } 9129 9133 } 9130 9134

Configure Feed

Configure Feed