Merge tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
"A set of scheduler updates:

- Prevent PSI state corruption when schedule() races with cgroup
move.

A recent commit combined two PSI callbacks to reduce the number of
cgroup tree updates, but missed that schedule() can drop rq::lock
for load balancing, which opens the race window for
cgroup_move_task() which then observes half updated state.

The fix is to solely use task::ps_flags instead of looking at the
potentially mismatching scheduler state

- Prevent an out-of-bounds access in uclamp caused bu a rounding
division which can lead to an off-by-one error exceeding the
buckets array size.

- Prevent unfairness caused by missing load decay when a task is
attached to a cfs runqueue.

The old load of the task was attached to the runqueue and never
removed. Fix it by enforcing the load update through the hierarchy
for unthrottled run queue instances.

- A documentation fix fot the 'sched_verbose' command line option"

* tag 'sched-urgent-2021-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix unfairness caused by missing load decay
sched: Fix out-of-bound access in uclamp
psi: Fix psi state corruption when schedule() races with cgroup move
sched,doc: sched_debug_verbose cmdline should be sched_verbose

Linus Torvalds 5 years ago 9819f682 732a27a0

+37 -15

4 changed files

expand all

Documentation

scheduler

sched-domains.rst

kernel

sched

core.c

fair.c

psi.c

+1 -1

Documentation/scheduler/sched-domains.rst

··· 74 74 calling set_sched_topology() with this array as the parameter. 75 75 76 76 The sched-domains debugging infrastructure can be enabled by enabling 77 - CONFIG_SCHED_DEBUG and adding 'sched_debug_verbose' to your cmdline. If you 77 + CONFIG_SCHED_DEBUG and adding 'sched_verbose' to your cmdline. If you 78 78 forgot to tweak your cmdline, you can also flip the 79 79 /sys/kernel/debug/sched/verbose knob. This enables an error checking parse of 80 80 the sched domains which should catch most possible errors (described above). It

+1 -1

kernel/sched/core.c

··· 938 938 939 939 static inline unsigned int uclamp_bucket_id(unsigned int clamp_value) 940 940 { 941 - return clamp_value / UCLAMP_BUCKET_DELTA; 941 + return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1); 942 942 } 943 943 944 944 static inline unsigned int uclamp_none(enum uclamp_id clamp_id)

+9 -3

kernel/sched/fair.c

··· 10878 10878 { 10879 10879 struct cfs_rq *cfs_rq; 10880 10880 10881 + list_add_leaf_cfs_rq(cfs_rq_of(se)); 10882 + 10881 10883 /* Start to propagate at parent */ 10882 10884 se = se->parent; 10883 10885 10884 10886 for_each_sched_entity(se) { 10885 10887 cfs_rq = cfs_rq_of(se); 10886 10888 10887 - if (cfs_rq_throttled(cfs_rq)) 10888 - break; 10889 + if (!cfs_rq_throttled(cfs_rq)){ 10890 + update_load_avg(cfs_rq, se, UPDATE_TG); 10891 + list_add_leaf_cfs_rq(cfs_rq); 10892 + continue; 10893 + } 10889 10894 10890 - update_load_avg(cfs_rq, se, UPDATE_TG); 10895 + if (list_add_leaf_cfs_rq(cfs_rq)) 10896 + break; 10891 10897 } 10892 10898 } 10893 10899 #else

+26 -10

kernel/sched/psi.c

··· 972 972 */ 973 973 void cgroup_move_task(struct task_struct *task, struct css_set *to) 974 974 { 975 - unsigned int task_flags = 0; 975 + unsigned int task_flags; 976 976 struct rq_flags rf; 977 977 struct rq *rq; 978 978 ··· 987 987 988 988 rq = task_rq_lock(task, &rf); 989 989 990 - if (task_on_rq_queued(task)) { 991 - task_flags = TSK_RUNNING; 992 - if (task_current(rq, task)) 993 - task_flags |= TSK_ONCPU; 994 - } else if (task->in_iowait) 995 - task_flags = TSK_IOWAIT; 996 - 997 - if (task->in_memstall) 998 - task_flags |= TSK_MEMSTALL; 990 + /* 991 + * We may race with schedule() dropping the rq lock between 992 + * deactivating prev and switching to next. Because the psi 993 + * updates from the deactivation are deferred to the switch 994 + * callback to save cgroup tree updates, the task's scheduling 995 + * state here is not coherent with its psi state: 996 + * 997 + * schedule() cgroup_move_task() 998 + * rq_lock() 999 + * deactivate_task() 1000 + * p->on_rq = 0 1001 + * psi_dequeue() // defers TSK_RUNNING & TSK_IOWAIT updates 1002 + * pick_next_task() 1003 + * rq_unlock() 1004 + * rq_lock() 1005 + * psi_task_change() // old cgroup 1006 + * task->cgroups = to 1007 + * psi_task_change() // new cgroup 1008 + * rq_unlock() 1009 + * rq_lock() 1010 + * psi_sched_switch() // does deferred updates in new cgroup 1011 + * 1012 + * Don't rely on the scheduling state. Use psi_flags instead. 1013 + */ 1014 + task_flags = task->psi_flags; 999 1015 1000 1016 if (task_flags) 1001 1017 psi_task_change(task, task_flags, 0);

Configure Feed

Configure Feed