sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")

Zicheng Qu reported that, because avg_vruntime() always includes
cfs_rq->curr, when ->on_rq, place_entity() doesn't work right.

Specifically, the lag scaling in place_entity() relies on
avg_vruntime() being the state *before* placement of the new entity.
However in this case avg_vruntime() will actually already include the
entity, which breaks things.

Also, Zicheng Qu argues that avg_vruntime should be invariant under
reweight. IOW commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") was wrong!

The issue reported in 6d71a9c61604 could possibly be explained by
rounding artifacts -- notably the extreme weight '2' is outside of the
range of avg_vruntime/sum_w_vruntime, since that uses
scale_load_down(). By scaling vruntime by the real weight, but
accounting it in vruntime with a factor 1024 more, the average moves
significantly. However, that is now cured.

Tested by reverting 66951e4860d3 ("sched/fair: Fix update_cfs_group()
vs DELAY_DEQUEUE") and tracing vruntime and vlag figures again.

Reported-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260219080625.066102672%40infradead.org

Peter Zijlstra 4 months ago 101f3498 4823725d

+124 -24

1 changed file

expand all

kernel

sched

fair.c

+124 -24

kernel/sched/fair.c

··· 822 822 * 823 823 * -r_max < lag < max(r_max, q) 824 824 */ 825 - static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) 825 + static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avruntime) 826 826 { 827 827 u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC; 828 828 s64 vlag, limit; 829 829 830 - WARN_ON_ONCE(!se->on_rq); 831 - 832 - vlag = avg_vruntime(cfs_rq) - se->vruntime; 830 + vlag = avruntime - se->vruntime; 833 831 limit = calc_delta_fair(max_slice, se); 834 832 835 - se->vlag = clamp(vlag, -limit, limit); 833 + return clamp(vlag, -limit, limit); 834 + } 835 + 836 + static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) 837 + { 838 + WARN_ON_ONCE(!se->on_rq); 839 + 840 + se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq)); 836 841 } 837 842 838 843 /* ··· 3903 3898 se_weight(se) * -se->avg.load_sum); 3904 3899 } 3905 3900 3906 - static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); 3901 + static void 3902 + rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot) 3903 + { 3904 + unsigned long old_weight = se->load.weight; 3905 + 3906 + /* 3907 + * VRUNTIME 3908 + * -------- 3909 + * 3910 + * COROLLARY #1: The virtual runtime of the entity needs to be 3911 + * adjusted if re-weight at !0-lag point. 3912 + * 3913 + * Proof: For contradiction assume this is not true, so we can 3914 + * re-weight without changing vruntime at !0-lag point. 3915 + * 3916 + * Weight VRuntime Avg-VRuntime 3917 + * before w v V 3918 + * after w' v' V' 3919 + * 3920 + * Since lag needs to be preserved through re-weight: 3921 + * 3922 + * lag = (V - v)*w = (V'- v')*w', where v = v' 3923 + * ==> V' = (V - v)*w/w' + v (1) 3924 + * 3925 + * Let W be the total weight of the entities before reweight, 3926 + * since V' is the new weighted average of entities: 3927 + * 3928 + * V' = (WV + w'v - wv) / (W + w' - w) (2) 3929 + * 3930 + * by using (1) & (2) we obtain: 3931 + * 3932 + * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v 3933 + * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v 3934 + * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v 3935 + * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3) 3936 + * 3937 + * Since we are doing at !0-lag point which means V != v, we 3938 + * can simplify (3): 3939 + * 3940 + * ==> W / (W + w' - w) = w / w' 3941 + * ==> Ww' = Ww + ww' - ww 3942 + * ==> W * (w' - w) = w * (w' - w) 3943 + * ==> W = w (re-weight indicates w' != w) 3944 + * 3945 + * So the cfs_rq contains only one entity, hence vruntime of 3946 + * the entity @v should always equal to the cfs_rq's weighted 3947 + * average vruntime @V, which means we will always re-weight 3948 + * at 0-lag point, thus breach assumption. Proof completed. 3949 + * 3950 + * 3951 + * COROLLARY #2: Re-weight does NOT affect weighted average 3952 + * vruntime of all the entities. 3953 + * 3954 + * Proof: According to corollary #1, Eq. (1) should be: 3955 + * 3956 + * (V - v)*w = (V' - v')*w' 3957 + * ==> v' = V' - (V - v)*w/w' (4) 3958 + * 3959 + * According to the weighted average formula, we have: 3960 + * 3961 + * V' = (WV - wv + w'v') / (W - w + w') 3962 + * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w') 3963 + * = (WV - wv + w'V' - Vw + wv) / (W - w + w') 3964 + * = (WV + w'V' - Vw) / (W - w + w') 3965 + * 3966 + * ==> V'*(W - w + w') = WV + w'V' - Vw 3967 + * ==> V' * (W - w) = (W - w) * V (5) 3968 + * 3969 + * If the entity is the only one in the cfs_rq, then reweight 3970 + * always occurs at 0-lag point, so V won't change. Or else 3971 + * there are other entities, hence W != w, then Eq. (5) turns 3972 + * into V' = V. So V won't change in either case, proof done. 3973 + * 3974 + * 3975 + * So according to corollary #1 & #2, the effect of re-weight 3976 + * on vruntime should be: 3977 + * 3978 + * v' = V' - (V - v) * w / w' (4) 3979 + * = V - (V - v) * w / w' 3980 + * = V - vl * w / w' 3981 + * = V - vl' 3982 + */ 3983 + se->vlag = div64_long(se->vlag * old_weight, weight); 3984 + 3985 + /* 3986 + * DEADLINE 3987 + * -------- 3988 + * 3989 + * When the weight changes, the virtual time slope changes and 3990 + * we should adjust the relative virtual deadline accordingly. 3991 + * 3992 + * d' = v' + (d - v)*w/w' 3993 + * = V' - (V - v)*w/w' + (d - v)*w/w' 3994 + * = V - (V - v)*w/w' + (d - v)*w/w' 3995 + * = V + (d - V)*w/w' 3996 + */ 3997 + if (se->rel_deadline) 3998 + se->deadline = div64_long(se->deadline * old_weight, weight); 3999 + 4000 + if (rel_vprot) 4001 + se->vprot = div64_long(se->vprot * old_weight, weight); 4002 + } 3907 4003 3908 4004 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, 3909 4005 unsigned long weight) 3910 4006 { 3911 4007 bool curr = cfs_rq->curr == se; 3912 4008 bool rel_vprot = false; 3913 - u64 vprot; 4009 + u64 avruntime = 0; 3914 4010 3915 4011 if (se->on_rq) { 3916 4012 /* commit outstanding execution time */ 3917 4013 update_curr(cfs_rq); 3918 - update_entity_lag(cfs_rq, se); 3919 - se->deadline -= se->vruntime; 4014 + avruntime = avg_vruntime(cfs_rq); 4015 + se->vlag = entity_lag(cfs_rq, se, avruntime); 4016 + se->deadline -= avruntime; 3920 4017 se->rel_deadline = 1; 3921 4018 if (curr && protect_slice(se)) { 3922 - vprot = se->vprot - se->vruntime; 4019 + se->vprot -= avruntime; 3923 4020 rel_vprot = true; 3924 4021 } 3925 4022 ··· 4032 3925 } 4033 3926 dequeue_load_avg(cfs_rq, se); 4034 3927 4035 - /* 4036 - * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), 4037 - * we need to scale se->vlag when w_i changes. 4038 - */ 4039 - se->vlag = div64_long(se->vlag * se->load.weight, weight); 4040 - if (se->rel_deadline) 4041 - se->deadline = div64_long(se->deadline * se->load.weight, weight); 4042 - 4043 - if (rel_vprot) 4044 - vprot = div64_long(vprot * se->load.weight, weight); 3928 + rescale_entity(se, weight, rel_vprot); 4045 3929 4046 3930 update_load_set(&se->load, weight); 4047 3931 4048 3932 do { 4049 3933 u32 divider = get_pelt_divider(&se->avg); 4050 - 4051 3934 se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider); 4052 3935 } while (0); 4053 3936 4054 3937 enqueue_load_avg(cfs_rq, se); 4055 3938 if (se->on_rq) { 4056 - place_entity(cfs_rq, se, 0); 4057 3939 if (rel_vprot) 4058 - se->vprot = se->vruntime + vprot; 3940 + se->vprot += avruntime; 3941 + se->deadline += avruntime; 3942 + se->rel_deadline = 0; 3943 + se->vruntime = avruntime - se->vlag; 3944 + 4059 3945 update_load_add(&cfs_rq->load, se->load.weight); 4060 3946 if (!curr) 4061 3947 __enqueue_entity(cfs_rq, se); ··· 5406 5306 5407 5307 se->vruntime = vruntime - lag; 5408 5308 5409 - if (se->rel_deadline) { 5309 + if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { 5410 5310 se->deadline += se->vruntime; 5411 5311 se->rel_deadline = 0; 5412 5312 return;

Configure Feed

Configure Feed