sched: Reorder some fields in struct rq

This colocates some hot fields in "struct rq" to be on the same cache line
as others that are often accessed at the same time or in similar ways.

Using data from a Google-internal fleet-scale profiler, I found three
distinct groups of hot fields in struct rq:

- (1) The runqueue lock: __lock.

- (2) Those accessed from hot code in pick_next_task_fair():
nr_running, nr_numa_running, nr_preferred_running,
ttwu_pending, cpu_capacity, curr, idle.

- (3) Those accessed from some other hot codepaths, e.g.
update_curr(), update_rq_clock(), and scheduler_tick():
clock_task, clock_pelt, clock, lost_idle_time,
clock_update_flags, clock_pelt_idle, clock_idle.

The cycles spent on accessing these different groups of fields broke down
roughly as follows:

- 50% on group (1) (the runqueue lock, always read-write)
- 39% on group (2) (load:store ratio around 38:1)
- 8% on group (3) (load:store ratio around 5:1)
- 3% on all the other fields

Most of the fields in group (3) are already in a cache line grouping; this
patch just adds "clock" and "clock_update_flags" to that group. The fields
in group (2) are scattered across several cache lines; the main effect of
this patch is to group them together, on a single line at the beginning of
the structure. A few other less performance-critical fields (nr_switches,
numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
were also reordered to reduce holes in the data structure.

Since the runqueue lock is acquired from so many different contexts, and is
basically always accessed using an atomic operation, putting it on either
of the cache lines for groups (2) or (3) would slow down accesses to those
fields dramatically, since those groups are read-mostly accesses.

To test this, I wrote a focused load test that would put load on the
pick_next_task_fair() path. A parent process would fork many child
processes, and each child would nanosleep() for 1 msec many times in a
loop. The load test was monitored with "perf", and I looked at the amount
of cycles that were spent with sched_balance_rq() on the stack. The test
was reliably spending ~5% of all of its cycles there. I ran it 100 times
on a pair of 2-socket Intel Haswell machines (72 vCPUs per machine) - one
running the tip of sched/core, the other running this change - using 360
child processes and 8192 1-msec sleeps per child. The mean cycle count
dropped from 5.14B to 4.91B, or a *4.6% decrease* in relevant scheduler
cycles.

Given that this change reduces cache misses in a very hot kernel codepath,
there's likely to be additional application performance improvement due to
reduced cache conflicts from kernel data structures.

On a Power11 system with 128-byte cache lines, my test showed a ~5%
decrease in relevant scheduler cycles, along with a slight increase in user
time - both positive indicators. This data comes from
https://lore.kernel.org/lkml/affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com/
This is the case even though the additional "____cacheline_aligned" that
puts the runqueue lock on the next cache line adds an additional 64 bytes
of padding on those machines. This patch does not change the size of
"struct rq" on machines with 64-byte cache lines.

I also ran "hackbench" to try to test this change, but it didn't show
conclusive results. Looking at a CPU cycle profile of the hackbench run,
it was spending 95% of its cycles inside __alloc_skb(), __kfree_skb(), or
kmem_cache_free() - almost all of which was spent updating memcg counters
or contending on the list_lock in kmem_cache_node. And it spent less than
0.5% of its cycles inside either schedule() or try_to_wake_up(). So it's
not surprising that it didn't show useful results here.

The "__no_randomize_layout" was added to reflect the fact that performance
of code that references this data structure is unusually sensitive to
placement of its members.

Signed-off-by: Blake Jones <blakejones@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reviewed-by: Josh Don <joshdon@google.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://patch.msgid.link/20251202023743.1524247-1-blakejones@google.com

authored by

Blake Jones and committed by

Peter Zijlstra 5 months ago 89951fc1 55b39b0c

+47 -28

1 changed file

expand all

kernel

sched

sched.h

+47 -28

kernel/sched/sched.h

··· 1115 1115 * acquire operations must be ordered by ascending &runqueue. 1116 1116 */ 1117 1117 struct rq { 1118 - /* runqueue lock: */ 1119 - raw_spinlock_t __lock; 1120 - 1118 + /* 1119 + * The following members are loaded together, without holding the 1120 + * rq->lock, in an extremely hot loop in update_sg_lb_stats() 1121 + * (called from pick_next_task()). To reduce cache pollution from 1122 + * this operation, they are placed together on this dedicated cache 1123 + * line. Even though some of them are frequently modified, they are 1124 + * loaded much more frequently than they are stored. 1125 + */ 1121 1126 unsigned int nr_running; 1122 1127 #ifdef CONFIG_NUMA_BALANCING 1123 1128 unsigned int nr_numa_running; 1124 1129 unsigned int nr_preferred_running; 1125 - unsigned int numa_migrate_on; 1126 1130 #endif 1131 + unsigned int ttwu_pending; 1132 + unsigned long cpu_capacity; 1133 + #ifdef CONFIG_SCHED_PROXY_EXEC 1134 + struct task_struct __rcu *donor; /* Scheduling context */ 1135 + struct task_struct __rcu *curr; /* Execution context */ 1136 + #else 1137 + union { 1138 + struct task_struct __rcu *donor; /* Scheduler context */ 1139 + struct task_struct __rcu *curr; /* Execution context */ 1140 + }; 1141 + #endif 1142 + struct task_struct *idle; 1143 + /* padding left here deliberately */ 1144 + 1145 + /* 1146 + * The next cacheline holds the (hot) runqueue lock, as well as 1147 + * some other less performance-critical fields. 1148 + */ 1149 + u64 nr_switches ____cacheline_aligned; 1150 + 1151 + /* runqueue lock: */ 1152 + raw_spinlock_t __lock; 1153 + 1127 1154 #ifdef CONFIG_NO_HZ_COMMON 1128 - unsigned long last_blocked_load_update_tick; 1129 - unsigned int has_blocked_load; 1130 - call_single_data_t nohz_csd; 1131 1155 unsigned int nohz_tick_stopped; 1132 1156 atomic_t nohz_flags; 1157 + unsigned int has_blocked_load; 1158 + unsigned long last_blocked_load_update_tick; 1159 + call_single_data_t nohz_csd; 1133 1160 #endif /* CONFIG_NO_HZ_COMMON */ 1134 - 1135 - unsigned int ttwu_pending; 1136 - u64 nr_switches; 1137 1161 1138 1162 #ifdef CONFIG_UCLAMP_TASK 1139 1163 /* Utilization clamp values based on CPU's RUNNABLE tasks */ ··· 1181 1157 struct list_head *tmp_alone_branch; 1182 1158 #endif /* CONFIG_FAIR_GROUP_SCHED */ 1183 1159 1160 + #ifdef CONFIG_NUMA_BALANCING 1161 + unsigned int numa_migrate_on; 1162 + #endif 1184 1163 /* 1185 1164 * This is part of a global counter where only the total sum 1186 1165 * over all CPUs matters. A task can increase this counter on ··· 1192 1165 */ 1193 1166 unsigned long nr_uninterruptible; 1194 1167 1195 - #ifdef CONFIG_SCHED_PROXY_EXEC 1196 - struct task_struct __rcu *donor; /* Scheduling context */ 1197 - struct task_struct __rcu *curr; /* Execution context */ 1198 - #else 1199 - union { 1200 - struct task_struct __rcu *donor; /* Scheduler context */ 1201 - struct task_struct __rcu *curr; /* Execution context */ 1202 - }; 1203 - #endif 1204 1168 struct sched_dl_entity *dl_server; 1205 - struct task_struct *idle; 1206 1169 struct task_struct *stop; 1207 1170 const struct sched_class *next_class; 1208 1171 unsigned long next_balance; 1209 1172 struct mm_struct *prev_mm; 1210 1173 1211 - unsigned int clock_update_flags; 1212 - u64 clock; 1213 - /* Ensure that all clocks are in the same cache line */ 1174 + /* 1175 + * The following fields of clock data are frequently referenced 1176 + * and updated together, and should go on their own cache line. 1177 + */ 1214 1178 u64 clock_task ____cacheline_aligned; 1215 1179 u64 clock_pelt; 1180 + u64 clock; 1216 1181 unsigned long lost_idle_time; 1182 + unsigned int clock_update_flags; 1217 1183 u64 clock_pelt_idle; 1218 1184 u64 clock_idle; 1185 + 1219 1186 #ifndef CONFIG_64BIT 1220 1187 u64 clock_pelt_idle_copy; 1221 1188 u64 clock_idle_copy; 1222 1189 #endif 1223 - 1224 - atomic_t nr_iowait; 1225 1190 1226 1191 u64 last_seen_need_resched_ns; 1227 1192 int ticks_without_resched; ··· 1224 1205 1225 1206 struct root_domain *rd; 1226 1207 struct sched_domain __rcu *sd; 1227 - 1228 - unsigned long cpu_capacity; 1229 1208 1230 1209 struct balance_callback *balance_callback; 1231 1210 ··· 1334 1317 call_single_data_t cfsb_csd; 1335 1318 struct list_head cfsb_csd_list; 1336 1319 #endif 1337 - }; 1320 + 1321 + atomic_t nr_iowait; 1322 + } __no_randomize_layout; 1338 1323 1339 1324 #ifdef CONFIG_FAIR_GROUP_SCHED 1340 1325

Configure Feed

Configure Feed