sched/fair: Fix zero_vruntime tracking

It turns out that zero_vruntime tracking is broken when there is but a single
task running. Current update paths are through __{en,de}queue_entity(), and
when there is but a single task, pick_next_task() will always return that one
task, and put_prev_set_next_task() will end up in neither function.

This can cause entity_key() to grow indefinitely large and cause overflows,
leading to much pain and suffering.

Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
are called from {set_next,put_prev}_entity() has problems because:

- set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
This means the avg_vruntime() will see the removal but not current, missing
the entity for accounting.

- put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
NULL. This means the avg_vruntime() will see the addition *and* current,
leading to double accounting.

Both cases are incorrect/inconsistent.

Noting that avg_vruntime is already called on each {en,de}queue, remove the
explicit avg_vruntime() calls (which removes an extra 64bit division for each
{en,de}queue) and have avg_vruntime() update zero_vruntime itself.

Additionally, have the tick call avg_vruntime() -- discarding the result, but
for the side-effect of updating zero_vruntime.

While there, optimize avg_vruntime() by noting that the average of one value is
rather trivial to compute.

Test case:
# taskset -c -p 1 $$
# taskset -c 2 bash -c 'while :; do :; done&'
# cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"

PRE:
.zero_vruntime : 31316.407903
>R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 /
.zero_vruntime : 382548.253179
>R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 /

POST:
.zero_vruntime : 17259.709467
>R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 /
.zero_vruntime : 18702.723356
>R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 /

Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260219080624.438854780%40infradead.org

Peter Zijlstra 2 months ago b3d99f43 6de23f81

+57 -27

1 changed file

expand all

kernel

sched

fair.c

+57 -27

kernel/sched/fair.c

··· 589 589 return vruntime_cmp(a->deadline, "<", b->deadline); 590 590 } 591 591 592 + /* 593 + * Per avg_vruntime() below, cfs_rq::zero_vruntime is only slightly stale 594 + * and this value should be no more than two lag bounds. Which puts it in the 595 + * general order of: 596 + * 597 + * (slice + TICK_NSEC) << NICE_0_LOAD_SHIFT 598 + * 599 + * which is around 44 bits in size (on 64bit); that is 20 for 600 + * NICE_0_LOAD_SHIFT, another 20 for NSEC_PER_MSEC and then a handful for 601 + * however many msec the actual slice+tick ends up begin. 602 + * 603 + * (disregarding the actual divide-by-weight part makes for the worst case 604 + * weight of 2, which nicely cancels vs the fuzz in zero_vruntime not actually 605 + * being the zero-lag point). 606 + */ 592 607 static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) 593 608 { 594 609 return vruntime_op(se->vruntime, "-", cfs_rq->zero_vruntime); ··· 691 676 } 692 677 693 678 static inline 694 - void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) 679 + void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta) 695 680 { 696 681 /* 697 - * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight 682 + * v' = v + d ==> sum_w_vruntime' = sum_w_vruntime - d*sum_weight 698 683 */ 699 684 cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta; 685 + cfs_rq->zero_vruntime += delta; 700 686 } 701 687 702 688 /* 703 - * Specifically: avg_runtime() + 0 must result in entity_eligible() := true 689 + * Specifically: avg_vruntime() + 0 must result in entity_eligible() := true 704 690 * For this to be so, the result of this function must have a left bias. 691 + * 692 + * Called in: 693 + * - place_entity() -- before enqueue 694 + * - update_entity_lag() -- before dequeue 695 + * - entity_tick() 696 + * 697 + * This means it is one entry 'behind' but that puts it close enough to where 698 + * the bound on entity_key() is at most two lag bounds. 705 699 */ 706 700 u64 avg_vruntime(struct cfs_rq *cfs_rq) 707 701 { 708 702 struct sched_entity *curr = cfs_rq->curr; 709 - s64 avg = cfs_rq->sum_w_vruntime; 710 - long load = cfs_rq->sum_weight; 703 + long weight = cfs_rq->sum_weight; 704 + s64 delta = 0; 711 705 712 - if (curr && curr->on_rq) { 713 - unsigned long weight = scale_load_down(curr->load.weight); 706 + if (curr && !curr->on_rq) 707 + curr = NULL; 714 708 715 - avg += entity_key(cfs_rq, curr) * weight; 716 - load += weight; 717 - } 709 + if (weight) { 710 + s64 runtime = cfs_rq->sum_w_vruntime; 718 711 719 - if (load) { 712 + if (curr) { 713 + unsigned long w = scale_load_down(curr->load.weight); 714 + 715 + runtime += entity_key(cfs_rq, curr) * w; 716 + weight += w; 717 + } 718 + 720 719 /* sign flips effective floor / ceiling */ 721 - if (avg < 0) 722 - avg -= (load - 1); 723 - avg = div_s64(avg, load); 720 + if (runtime < 0) 721 + runtime -= (weight - 1); 722 + 723 + delta = div_s64(runtime, weight); 724 + } else if (curr) { 725 + /* 726 + * When there is but one element, it is the average. 727 + */ 728 + delta = curr->vruntime - cfs_rq->zero_vruntime; 724 729 } 725 730 726 - return cfs_rq->zero_vruntime + avg; 731 + update_zero_vruntime(cfs_rq, delta); 732 + 733 + return cfs_rq->zero_vruntime; 727 734 } 728 735 729 736 /* ··· 812 775 int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se) 813 776 { 814 777 return vruntime_eligible(cfs_rq, se->vruntime); 815 - } 816 - 817 - static void update_zero_vruntime(struct cfs_rq *cfs_rq) 818 - { 819 - u64 vruntime = avg_vruntime(cfs_rq); 820 - s64 delta = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime); 821 - 822 - sum_w_vruntime_update(cfs_rq, delta); 823 - 824 - cfs_rq->zero_vruntime = vruntime; 825 778 } 826 779 827 780 static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq) ··· 883 856 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 884 857 { 885 858 sum_w_vruntime_add(cfs_rq, se); 886 - update_zero_vruntime(cfs_rq); 887 859 se->min_vruntime = se->vruntime; 888 860 se->min_slice = se->slice; 889 861 rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, ··· 894 868 rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, 895 869 &min_vruntime_cb); 896 870 sum_w_vruntime_sub(cfs_rq, se); 897 - update_zero_vruntime(cfs_rq); 898 871 } 899 872 900 873 struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq) ··· 5548 5523 */ 5549 5524 update_load_avg(cfs_rq, curr, UPDATE_TG); 5550 5525 update_cfs_group(curr); 5526 + 5527 + /* 5528 + * Pulls along cfs_rq::zero_vruntime. 5529 + */ 5530 + avg_vruntime(cfs_rq); 5551 5531 5552 5532 #ifdef CONFIG_SCHED_HRTICK 5553 5533 /*

Configure Feed

Configure Feed