Documentation/admin-guide/cpu-isolation.rst at master

tjh.dev / kernel
fork
Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
fork
kernel / Documentation / admin-guide / cpu-isolation.rst
at master 357 lines 11 kB view raw
wrap content
  1.. SPDX-License-Identifier: GPL-2.0
  2
  3=============
  4CPU Isolation
  5=============
  6
  7Introduction
  8============
  9
 10"CPU Isolation" means leaving a CPU exclusive to a given workload
 11without any undesired code interference from the kernel.
 12
 13Those interferences, commonly pointed out as "noise", can be triggered
 14by asynchronous events (interrupts, timers, scheduler preemption by
 15workqueues and kthreads, ...) or synchronous events (syscalls and page
 16faults).
 17
 18Such noise usually goes unnoticed. After all, synchronous events are a
 19component of the requested kernel service. And asynchronous events are
 20either sufficiently well-distributed by the scheduler when executed
 21as tasks or reasonably fast when executed as interrupt. The timer
 22interrupt can even execute 1024 times per seconds without a significant
 23and measurable impact most of the time.
 24
 25However some rare and extreme workloads can be quite sensitive to
 26those kinds of noise. This is the case, for example, with high
 27bandwidth network processing that can't afford losing a single packet
 28or very low latency network processing. Typically those use cases
 29involve DPDK, bypassing the kernel networking stack and performing
 30direct access to the networking device from userspace.
 31
 32In order to run a CPU without or with limited kernel noise, the
 33related housekeeping work needs to be either shut down, migrated or
 34offloaded.
 35
 36Housekeeping
 37============
 38
 39In the CPU isolation terminology, housekeeping is the work, often
 40asynchronous, that the kernel needs to process in order to maintain
 41all its services. It matches the noises and disturbances enumerated
 42above except when at least one CPU is isolated. Then housekeeping may
 43make use of further coping mechanisms if CPU-tied work must be
 44offloaded.
 45
 46Housekeeping CPUs are the non-isolated CPUs where the kernel noise
 47is moved away from isolated CPUs.
 48
 49The isolation can be implemented in several ways depending on the
 50nature of the noise:
 51
 52- Unbound work, where "unbound" means not tied to any CPU, can be
 53  simply migrated away from isolated CPUs to housekeeping CPUs.
 54  This is the case of unbound workqueues, kthreads and timers.
 55
 56- Bound work, where "bound" means tied to a specific CPU, usually
 57  can't be moved away as-is by nature. Either:
 58
 59	- The work must switch to a locked implementation. E.g.:
 60	  This is the case of RCU with CONFIG_RCU_NOCB_CPU.
 61
 62	- The related feature must be shut down and considered
 63	  incompatible with isolated CPUs. E.g.: Lockup watchdog,
 64	  unreliable clocksources, etc...
 65
 66	- An elaborate and heavyweight coping mechanism stands as a
 67	  replacement. E.g.: the timer tick is shut down on nohz_full
 68	  CPUs but with the constraint of running a single task on
 69	  them. A significant cost penalty is added on kernel entry/exit
 70	  and a residual 1Hz scheduler tick is offloaded to housekeeping
 71	  CPUs.
 72
 73In any case, housekeeping work has to be handled, which is why there
 74must be at least one housekeeping CPU in the system, preferably more
 75if the machine runs a lot of CPUs. For example one per node on NUMA
 76systems.
 77
 78Also CPU isolation often means a tradeoff between noise-free isolated
 79CPUs and added overhead on housekeeping CPUs, sometimes even on
 80isolated CPUs entering the kernel.
 81
 82Isolation features
 83==================
 84
 85Different levels of isolation can be configured in the kernel, each of
 86which has its own drawbacks and tradeoffs.
 87
 88Scheduler domain isolation
 89--------------------------
 90
 91This feature isolates a CPU from the scheduler topology. As a result,
 92the target isn't part of the load balancing. Tasks won't migrate
 93either from or to it unless affined explicitly.
 94
 95As a side effect the CPU is also isolated from unbound workqueues and
 96unbound kthreads.
 97
 98Requirements
 99~~~~~~~~~~~~
100
101- CONFIG_CPUSETS=y for the cpusets-based interface
102
103Tradeoffs
104~~~~~~~~~
105
106By nature, the system load is overall less distributed since some CPUs
107are extracted from the global load balancing.
108
109Interfaces
110~~~~~~~~~~
111
112- Documentation/admin-guide/cgroup-v2.rst cpuset isolated partitions are recommended
113  because they are tunable at runtime.
114
115- The 'isolcpus=' kernel boot parameter with the 'domain' flag is a
116  less flexible alternative that doesn't allow for runtime
117  reconfiguration.
118
119IRQs isolation
120--------------
121
122Isolate the IRQs whenever possible, so that they don't fire on the
123target CPUs.
124
125Interfaces
126~~~~~~~~~~
127
128- The file /proc/irq/\*/smp_affinity as explained in detail in
129  Documentation/core-api/irq/irq-affinity.rst page.
130
131- The "irqaffinity=" kernel boot parameter for a default setting.
132
133- The "managed_irq" flag in the "isolcpus=" kernel boot parameter
134  tries a best effort affinity override for managed IRQs.
135
136Full Dynticks (aka nohz_full)
137-----------------------------
138
139Full dynticks extends the dynticks idle mode, which stops the tick when
140the CPU is idle, to CPUs running a single task in userspace. That is,
141the timer tick is stopped if the environment allows it.
142
143Global timer callbacks are also isolated from the nohz_full CPUs.
144
145Requirements
146~~~~~~~~~~~~
147
148- CONFIG_NO_HZ_FULL=y
149
150Constraints
151~~~~~~~~~~~
152
153- The isolated CPUs must run a single task only. Multitask requires
154  the tick to maintain preemption. This is usually fine since the
155  workload usually can't stand the latency of random context switches.
156
157- No call to the kernel from isolated CPUs, at the risk of triggering
158  random noise.
159
160- No use of POSIX CPU timers on isolated CPUs.
161
162- Architecture must have a stable and reliable clocksource (no
163  unreliable TSC that requires the watchdog).
164
165
166Tradeoffs
167~~~~~~~~~
168
169In terms of cost, this is the most invasive isolation feature. It is
170assumed to be used when the workload spends most of its time in
171userspace and doesn't rely on the kernel except for preparatory
172work because:
173
174- RCU adds more overhead due to the locked, offloaded and threaded
175  callbacks processing (the same that would be obtained with "rcu_nocbs"
176  boot parameter).
177
178- Kernel entry/exit through syscalls, exceptions and IRQs are more
179  costly due to fully ordered RmW operations that maintain userspace
180  as RCU extended quiescent state. Also the CPU time is accounted on
181  kernel boundaries instead of periodically from the tick.
182
183- Housekeeping CPUs must run a 1Hz residual remote scheduler tick
184  on behalf of the isolated CPUs.
185
186Checklist
187=========
188
189You have set up each of the above isolation features but you still
190observe jitters that trash your workload? Make sure to check a few
191elements before proceeding.
192
193Some of these checklist items are similar to those of real-time
194workloads:
195
196- Use mlock() to prevent your pages from being swapped away. Page
197  faults are usually not compatible with jitter sensitive workloads.
198
199- Avoid SMT to prevent your hardware thread from being "preempted"
200  by another one.
201
202- CPU frequency changes may induce subtle sorts of jitter in a
203  workload. Cpufreq should be used and tuned with caution.
204
205- Deep C-states may result in latency issues upon wake-up. If this
206  happens to be a problem, C-states can be limited via kernel boot
207  parameters such as processor.max_cstate or intel_idle.max_cstate.
208  More finegrained tunings are described in
209  Documentation/admin-guide/pm/cpuidle.rst page
210
211- Your system may be subject to firmware-originating interrupts - x86 has
212  System Management Interrupts (SMIs) for example. Check your system BIOS
213  to disable such interference, and with some luck your vendor will have
214  a BIOS tuning guidance for low-latency operations.
215
216
217Full isolation example
218======================
219
220In this example, the system has 8 CPUs and the 8th is to be fully
221isolated. Since CPUs start from 0, the 8th CPU is CPU 7.
222
223Kernel parameters
224-----------------
225
226Set the following kernel boot parameters to disable SMT and setup tick
227and IRQ isolation:
228
229- Full dynticks: nohz_full=7
230
231- IRQs isolation: irqaffinity=0-6
232
233- Managed IRQs isolation: isolcpus=managed_irq,7
234
235- Prevent SMT: nosmt
236
237The full command line is then:
238
239  nohz_full=7 irqaffinity=0-6 isolcpus=managed_irq,7 nosmt
240
241CPUSET configuration (cgroup v2)
242--------------------------------
243
244Assuming cgroup v2 is mounted to /sys/fs/cgroup, the following script
245isolates CPU 7 from scheduler domains.
246
247::
248
249  cd /sys/fs/cgroup
250  # Activate the cpuset subsystem
251  echo +cpuset > cgroup.subtree_control
252  # Create partition to be isolated
253  mkdir test
254  cd test
255  echo +cpuset > cgroup.subtree_control
256  # Isolate CPU 7
257  echo 7 > cpuset.cpus
258  echo "isolated" > cpuset.cpus.partition
259
260The userspace workload
261----------------------
262
263Fake a pure userspace workload, the program below runs a dummy
264userspace loop on the isolated CPU 7.
265
266::
267
268  #include <stdio.h>
269  #include <fcntl.h>
270  #include <unistd.h>
271  #include <errno.h>
272  int main(void)
273  {
274      // Move the current task to the isolated cpuset (bind to CPU 7)
275      int fd = open("/sys/fs/cgroup/test/cgroup.procs", O_WRONLY);
276      if (fd < 0) {
277          perror("Can't open cpuset file...\n");
278          return 0;
279      }
280
281      write(fd, "0\n", 2);
282      close(fd);
283
284      // Run an endless dummy loop until the launcher kills us
285      while (1)
286      ;
287
288      return 0;
289  }
290
291Build it and save for later step:
292
293::
294
295  # gcc user_loop.c -o user_loop
296
297The launcher
298------------
299
300The below launcher runs the above program for 10 seconds and traces
301the noise resulting from preempting tasks and IRQs.
302
303::
304
305  TRACING=/sys/kernel/tracing/
306  # Make sure tracing is off for now
307  echo 0 > $TRACING/tracing_on
308  # Flush previous traces
309  echo > $TRACING/trace
310  # Record disturbance from other tasks
311  echo 1 > $TRACING/events/sched/sched_switch/enable
312  # Record disturbance from interrupts
313  echo 1 > $TRACING/events/irq_vectors/enable
314  # Now we can start tracing
315  echo 1 > $TRACING/tracing_on
316  # Run the dummy user_loop for 10 seconds on CPU 7
317  ./user_loop &
318  USER_LOOP_PID=$!
319  sleep 10
320  kill $USER_LOOP_PID
321  # Disable tracing and save traces from CPU 7 in a file
322  echo 0 > $TRACING/tracing_on
323  cat $TRACING/per_cpu/cpu7/trace > trace.7
324
325If no specific problem arose, the output of trace.7 should look like
326the following:
327
328::
329
330  <idle>-0 [007] d..2. 1980.976624: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=user_loop next_pid=1553 next_prio=120
331  user_loop-1553 [007] d.h.. 1990.946593: reschedule_entry: vector=253
332  user_loop-1553 [007] d.h.. 1990.946593: reschedule_exit: vector=253
333
334That is, no specific noise triggered between the first trace and the
335second during 10 seconds when user_loop was running.
336
337Debugging
338=========
339
340Of course things are never so easy, especially on this matter.
341Chances are that actual noise will be observed in the aforementioned
342trace.7 file.
343
344The best way to investigate further is to enable finer grained
345tracepoints such as those of subsystems producing asynchronous
346events: workqueue, timer, irq_vector, etc... It also can be
347interesting to enable the tick_stop event to diagnose why the tick is
348retained when that happens.
349
350Some tools may also be useful for higher level analysis:
351
352- Documentation/tools/rtla/rtla.rst provides a suite of tools to analyze
353  latency and noise in the system. For example Documentation/tools/rtla/rtla-osnoise.rst
354  runs a kernel tracer that analyzes and output a summary of the noises.
355
356- dynticks-testing does something similar to rtla-osnoise but in userspace. It is available
357  at git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git
Configure Feed

Configure Feed