sched: psi: fix bogus pressure spikes from aggregation race

Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Johannes Weiner and committed by

Linus Torvalds 2 years ago 3840cbe2 8c245fe7

+12 -14

1 changed file

expand all

kernel

sched

psi.c

+12 -14

kernel/sched/psi.c

··· 769 769 } 770 770 771 771 static void psi_group_change(struct psi_group *group, int cpu, 772 - unsigned int clear, unsigned int set, u64 now, 772 + unsigned int clear, unsigned int set, 773 773 bool wake_clock) 774 774 { 775 775 struct psi_group_cpu *groupc; 776 776 unsigned int t, m; 777 777 u32 state_mask; 778 + u64 now; 778 779 779 780 lockdep_assert_rq_held(cpu_rq(cpu)); 780 781 groupc = per_cpu_ptr(group->pcpu, cpu); ··· 790 789 * SOME and FULL time these may have resulted in. 791 790 */ 792 791 write_seqcount_begin(&groupc->seq); 792 + now = cpu_clock(cpu); 793 793 794 794 /* 795 795 * Start with TSK_ONCPU, which doesn't have a corresponding ··· 901 899 { 902 900 int cpu = task_cpu(task); 903 901 struct psi_group *group; 904 - u64 now; 905 902 906 903 if (!task->pid) 907 904 return; 908 905 909 906 psi_flags_change(task, clear, set); 910 907 911 - now = cpu_clock(cpu); 912 - 913 908 group = task_psi_group(task); 914 909 do { 915 - psi_group_change(group, cpu, clear, set, now, true); 910 + psi_group_change(group, cpu, clear, set, true); 916 911 } while ((group = group->parent)); 917 912 } 918 913 ··· 918 919 { 919 920 struct psi_group *group, *common = NULL; 920 921 int cpu = task_cpu(prev); 921 - u64 now = cpu_clock(cpu); 922 922 923 923 if (next->pid) { 924 924 psi_flags_change(next, 0, TSK_ONCPU); ··· 934 936 break; 935 937 } 936 938 937 - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); 939 + psi_group_change(group, cpu, 0, TSK_ONCPU, true); 938 940 } while ((group = group->parent)); 939 941 } 940 942 ··· 972 974 do { 973 975 if (group == common) 974 976 break; 975 - psi_group_change(group, cpu, clear, set, now, wake_clock); 977 + psi_group_change(group, cpu, clear, set, wake_clock); 976 978 } while ((group = group->parent)); 977 979 978 980 /* ··· 984 986 if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { 985 987 clear &= ~TSK_ONCPU; 986 988 for (; group; group = group->parent) 987 - psi_group_change(group, cpu, clear, set, now, wake_clock); 989 + psi_group_change(group, cpu, clear, set, wake_clock); 988 990 } 989 991 } 990 992 } ··· 995 997 int cpu = task_cpu(curr); 996 998 struct psi_group *group; 997 999 struct psi_group_cpu *groupc; 998 - u64 now, irq; 999 1000 s64 delta; 1001 + u64 irq; 1000 1002 1001 1003 if (static_branch_likely(&psi_disabled)) 1002 1004 return; ··· 1009 1011 if (prev && task_psi_group(prev) == group) 1010 1012 return; 1011 1013 1012 - now = cpu_clock(cpu); 1013 1014 irq = irq_time_read(cpu); 1014 1015 delta = (s64)(irq - rq->psi_irq_time); 1015 1016 if (delta < 0) ··· 1016 1019 rq->psi_irq_time = irq; 1017 1020 1018 1021 do { 1022 + u64 now; 1023 + 1019 1024 if (!group->enabled) 1020 1025 continue; 1021 1026 1022 1027 groupc = per_cpu_ptr(group->pcpu, cpu); 1023 1028 1024 1029 write_seqcount_begin(&groupc->seq); 1030 + now = cpu_clock(cpu); 1025 1031 1026 1032 record_times(groupc, now); 1027 1033 groupc->times[PSI_IRQ_FULL] += delta; ··· 1223 1223 for_each_possible_cpu(cpu) { 1224 1224 struct rq *rq = cpu_rq(cpu); 1225 1225 struct rq_flags rf; 1226 - u64 now; 1227 1226 1228 1227 rq_lock_irq(rq, &rf); 1229 - now = cpu_clock(cpu); 1230 - psi_group_change(group, cpu, 0, 0, now, true); 1228 + psi_group_change(group, cpu, 0, 0, true); 1231 1229 rq_unlock_irq(rq, &rf); 1232 1230 } 1233 1231 }

Configure Feed

Configure Feed