Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.

One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.

Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.

The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.

As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"

* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...

Linus Torvalds 4 months ago d16738a4 0506158a

+554 -222

28 changed files

expand all

Documentation

arch

arm64

asymmetric-32bit.rst

core-api

housekeeping.rst

index.rst

arch

arm64

kernel

cpufeature.c

block

blk-mq.c

drivers

base

cpu.c

pci

pci-driver.c

include

linux

cpuhplock.h

cpuset.h

kthread.h

memcontrol.h

mmu_context.h

pci.h

percpu-rwsem.h

sched

isolation.h

vmstat.h

workqueue.h

init

Kconfig

kernel

cgroup

cpuset.c

cpu.c

kthread.c

sched

isolation.c

sched.h

time

timer_migration.c

workqueue.c

memcontrol.c

vmstat.c

net

core

net-sysfs.c

+8 -4

Documentation/arch/arm64/asymmetric-32bit.rst

··· 154 154 ``KVM_EXIT_FAIL_ENTRY`` and will remain non-runnable until successfully 155 155 re-initialised by a subsequent ``KVM_ARM_VCPU_INIT`` operation. 156 156 157 - NOHZ FULL 158 - --------- 157 + SCHEDULER DOMAIN ISOLATION 158 + -------------------------- 159 159 160 - To avoid perturbing an adaptive-ticks CPU (specified using 161 - ``nohz_full=``) when a 32-bit task is forcefully migrated, these CPUs 160 + To avoid perturbing a boot-defined domain isolated CPU (specified using 161 + ``isolcpus=[domain]``) when a 32-bit task is forcefully migrated, these CPUs 162 162 are treated as 64-bit-only when support for asymmetric 32-bit systems 163 163 is enabled. 164 + 165 + However as opposed to boot-defined domain isolation, runtime-defined domain 166 + isolation using cpuset isolated partition is not advised on asymmetric 167 + 32-bit systems and will result in undefined behaviour.

+111

Documentation/core-api/housekeeping.rst

··· 1 + ====================================== 2 + Housekeeping 3 + ====================================== 4 + 5 + 6 + CPU Isolation moves away kernel work that may otherwise run on any CPU. 7 + The purpose of its related features is to reduce the OS jitter that some 8 + extreme workloads can't stand, such as in some DPDK usecases. 9 + 10 + The kernel work moved away by CPU isolation is commonly described as 11 + "housekeeping" because it includes ground work that performs cleanups, 12 + statistics maintainance and actions relying on them, memory release, 13 + various deferrals etc... 14 + 15 + Sometimes housekeeping is just some unbound work (unbound workqueues, 16 + unbound timers, ...) that gets easily assigned to non-isolated CPUs. 17 + But sometimes housekeeping is tied to a specific CPU and requires 18 + elaborated tricks to be offloaded to non-isolated CPUs (RCU_NOCB, remote 19 + scheduler tick, etc...). 20 + 21 + Thus, a housekeeping CPU can be considered as the reverse of an isolated 22 + CPU. It is simply a CPU that can execute housekeeping work. There must 23 + always be at least one online housekeeping CPU at any time. The CPUs that 24 + are not isolated are automatically assigned as housekeeping. 25 + 26 + Housekeeping is currently divided in four features described 27 + by the ``enum hk_type type``: 28 + 29 + 1. HK_TYPE_DOMAIN matches the work moved away by scheduler domain 30 + isolation performed through ``isolcpus=domain`` boot parameter or 31 + isolated cpuset partitions in cgroup v2. This includes scheduler 32 + load balancing, unbound workqueues and timers. 33 + 34 + 2. HK_TYPE_KERNEL_NOISE matches the work moved away by tick isolation 35 + performed through ``nohz_full=`` or ``isolcpus=nohz`` boot 36 + parameters. This includes remote scheduler tick, vmstat and lockup 37 + watchdog. 38 + 39 + 3. HK_TYPE_MANAGED_IRQ matches the IRQ handlers moved away by managed 40 + IRQ isolation performed through ``isolcpus=managed_irq``. 41 + 42 + 4. HK_TYPE_DOMAIN_BOOT matches the work moved away by scheduler domain 43 + isolation performed through ``isolcpus=domain`` only. It is similar 44 + to HK_TYPE_DOMAIN except it ignores the isolation performed by 45 + cpusets. 46 + 47 + 48 + Housekeeping cpumasks 49 + ================================= 50 + 51 + Housekeeping cpumasks include the CPUs that can execute the work moved 52 + away by the matching isolation feature. These cpumasks are returned by 53 + the following function:: 54 + 55 + const struct cpumask *housekeeping_cpumask(enum hk_type type) 56 + 57 + By default, if neither ``nohz_full=``, nor ``isolcpus``, nor cpuset's 58 + isolated partitions are used, which covers most usecases, this function 59 + returns the cpu_possible_mask. 60 + 61 + Otherwise the function returns the cpumask complement of the isolation 62 + feature. For example: 63 + 64 + With isolcpus=domain,7 the following will return a mask with all possible 65 + CPUs except 7:: 66 + 67 + housekeeping_cpumask(HK_TYPE_DOMAIN) 68 + 69 + Similarly with nohz_full=5,6 the following will return a mask with all 70 + possible CPUs except 5,6:: 71 + 72 + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE) 73 + 74 + 75 + Synchronization against cpusets 76 + ================================= 77 + 78 + Cpuset can modify the HK_TYPE_DOMAIN housekeeping cpumask while creating, 79 + modifying or deleting an isolated partition. 80 + 81 + The users of HK_TYPE_DOMAIN cpumask must then make sure to synchronize 82 + properly against cpuset in order to make sure that: 83 + 84 + 1. The cpumask snapshot stays coherent. 85 + 86 + 2. No housekeeping work is queued on a newly made isolated CPU. 87 + 88 + 3. Pending housekeeping work that was queued to a non isolated 89 + CPU which just turned isolated through cpuset must be flushed 90 + before the related created/modified isolated partition is made 91 + available to userspace. 92 + 93 + This synchronization is maintained by an RCU based scheme. The cpuset update 94 + side waits for an RCU grace period after updating the HK_TYPE_DOMAIN 95 + cpumask and before flushing pending works. On the read side, care must be 96 + taken to gather the housekeeping target election and the work enqueue within 97 + the same RCU read side critical section. 98 + 99 + A typical layout example would look like this on the update side 100 + (``housekeeping_update()``):: 101 + 102 + rcu_assign_pointer(housekeeping_cpumasks[type], trial); 103 + synchronize_rcu(); 104 + flush_workqueue(example_workqueue); 105 + 106 + And then on the read side:: 107 + 108 + rcu_read_lock(); 109 + cpu = housekeeping_any_cpu(HK_TYPE_DOMAIN); 110 + queue_work_on(cpu, example_workqueue, work); 111 + rcu_read_unlock();

Documentation/core-api/index.rst

··· 25 25 symbol-namespaces 26 26 asm-annotations 27 27 real-time/index 28 + housekeeping.rst 28 29 29 30 Data structures and low-level utilities 30 31 =======================================

+3 -3

arch/arm64/kernel/cpufeature.c

··· 1669 1669 1670 1670 const struct cpumask *task_cpu_fallback_mask(struct task_struct *p) 1671 1671 { 1672 - return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_TICK)); 1672 + return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_DOMAIN)); 1673 1673 } 1674 1674 1675 1675 static int __init parse_32bit_el0_param(char *str) ··· 3987 3987 bool cpu_32bit = false; 3988 3988 3989 3989 if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) { 3990 - if (!housekeeping_cpu(cpu, HK_TYPE_TICK)) 3991 - pr_info("Treating adaptive-ticks CPU %u as 64-bit only\n", cpu); 3990 + if (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN)) 3991 + pr_info("Treating domain isolated CPU %u as 64-bit only\n", cpu); 3992 3992 else 3993 3993 cpu_32bit = true; 3994 3994 }

+5 -1

block/blk-mq.c

··· 4270 4270 4271 4271 /* 4272 4272 * Rule out isolated CPUs from hctx->cpumask to avoid 4273 - * running block kworker on isolated CPUs 4273 + * running block kworker on isolated CPUs. 4274 + * FIXME: cpuset should propagate further changes to isolated CPUs 4275 + * here. 4274 4276 */ 4277 + rcu_read_lock(); 4275 4278 for_each_cpu(cpu, hctx->cpumask) { 4276 4279 if (cpu_is_isolated(cpu)) 4277 4280 cpumask_clear_cpu(cpu, hctx->cpumask); 4278 4281 } 4282 + rcu_read_unlock(); 4279 4283 4280 4284 /* 4281 4285 * Initialize batch roundrobin counts

+1 -1

drivers/base/cpu.c

··· 291 291 return -ENOMEM; 292 292 293 293 cpumask_andnot(isolated, cpu_possible_mask, 294 - housekeeping_cpumask(HK_TYPE_DOMAIN)); 294 + housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)); 295 295 len = sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(isolated)); 296 296 297 297 free_cpumask_var(isolated);

+53 -20

drivers/pci/pci-driver.c

··· 302 302 const struct pci_device_id *id; 303 303 }; 304 304 305 - static long local_pci_probe(void *_ddi) 305 + static int local_pci_probe(struct drv_dev_and_id *ddi) 306 306 { 307 - struct drv_dev_and_id *ddi = _ddi; 308 307 struct pci_dev *pci_dev = ddi->dev; 309 308 struct pci_driver *pci_drv = ddi->drv; 310 309 struct device *dev = &pci_dev->dev; ··· 337 338 return 0; 338 339 } 339 340 341 + static struct workqueue_struct *pci_probe_wq; 342 + 343 + struct pci_probe_arg { 344 + struct drv_dev_and_id *ddi; 345 + struct work_struct work; 346 + int ret; 347 + }; 348 + 349 + static void local_pci_probe_callback(struct work_struct *work) 350 + { 351 + struct pci_probe_arg *arg = container_of(work, struct pci_probe_arg, work); 352 + 353 + arg->ret = local_pci_probe(arg->ddi); 354 + } 355 + 340 356 static bool pci_physfn_is_probed(struct pci_dev *dev) 341 357 { 342 358 #ifdef CONFIG_PCI_IOV ··· 376 362 dev->is_probed = 1; 377 363 378 364 cpu_hotplug_disable(); 379 - 380 365 /* 381 366 * Prevent nesting work_on_cpu() for the case where a Virtual Function 382 367 * device is probed from work_on_cpu() of the Physical device. 383 368 */ 384 369 if (node < 0 || node >= MAX_NUMNODES || !node_online(node) || 385 370 pci_physfn_is_probed(dev)) { 386 - cpu = nr_cpu_ids; 371 + error = local_pci_probe(&ddi); 387 372 } else { 388 - cpumask_var_t wq_domain_mask; 373 + struct pci_probe_arg arg = { .ddi = &ddi }; 389 374 390 - if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) { 391 - error = -ENOMEM; 392 - goto out; 393 - } 394 - cpumask_and(wq_domain_mask, 395 - housekeeping_cpumask(HK_TYPE_WQ), 396 - housekeeping_cpumask(HK_TYPE_DOMAIN)); 397 - 375 + INIT_WORK_ONSTACK(&arg.work, local_pci_probe_callback); 376 + /* 377 + * The target election and the enqueue of the work must be within 378 + * the same RCU read side section so that when the workqueue pool 379 + * is flushed after a housekeeping cpumask update, further readers 380 + * are guaranteed to queue the probing work to the appropriate 381 + * targets. 382 + */ 383 + rcu_read_lock(); 398 384 cpu = cpumask_any_and(cpumask_of_node(node), 399 - wq_domain_mask); 400 - free_cpumask_var(wq_domain_mask); 385 + housekeeping_cpumask(HK_TYPE_DOMAIN)); 386 + 387 + if (cpu < nr_cpu_ids) { 388 + struct workqueue_struct *wq = pci_probe_wq; 389 + 390 + if (WARN_ON_ONCE(!wq)) 391 + wq = system_percpu_wq; 392 + queue_work_on(cpu, wq, &arg.work); 393 + rcu_read_unlock(); 394 + flush_work(&arg.work); 395 + error = arg.ret; 396 + } else { 397 + rcu_read_unlock(); 398 + error = local_pci_probe(&ddi); 399 + } 400 + 401 + destroy_work_on_stack(&arg.work); 401 402 } 402 403 403 - if (cpu < nr_cpu_ids) 404 - error = work_on_cpu(cpu, local_pci_probe, &ddi); 405 - else 406 - error = local_pci_probe(&ddi); 407 - out: 408 404 dev->is_probed = 0; 409 405 cpu_hotplug_enable(); 410 406 return error; 407 + } 408 + 409 + void pci_probe_flush_workqueue(void) 410 + { 411 + flush_workqueue(pci_probe_wq); 411 412 } 412 413 413 414 /** ··· 1761 1732 static int __init pci_driver_init(void) 1762 1733 { 1763 1734 int ret; 1735 + 1736 + pci_probe_wq = alloc_workqueue("sync_wq", WQ_PERCPU, 0); 1737 + if (!pci_probe_wq) 1738 + return -ENOMEM; 1764 1739 1765 1740 ret = bus_register(&pci_bus_type); 1766 1741 if (ret)

include/linux/cpuhplock.h

··· 13 13 struct device; 14 14 15 15 extern int lockdep_is_cpus_held(void); 16 + extern int lockdep_is_cpus_write_held(void); 16 17 17 18 #ifdef CONFIG_HOTPLUG_CPU 18 19 void cpus_write_lock(void);

+2 -6

include/linux/cpuset.h

··· 18 18 #include <linux/mmu_context.h> 19 19 #include <linux/jump_label.h> 20 20 21 + extern bool lockdep_is_cpuset_held(void); 22 + 21 23 #ifdef CONFIG_CPUSETS 22 24 23 25 /* ··· 79 77 extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask); 80 78 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); 81 79 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); 82 - extern bool cpuset_cpu_is_isolated(int cpu); 83 80 extern nodemask_t cpuset_mems_allowed(struct task_struct *p); 84 81 #define cpuset_current_mems_allowed (current->mems_allowed) 85 82 void cpuset_init_current_mems_allowed(void); ··· 210 209 } 211 210 212 211 static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p) 213 - { 214 - return false; 215 - } 216 - 217 - static inline bool cpuset_cpu_is_isolated(int cpu) 218 212 { 219 213 return false; 220 214 }

include/linux/kthread.h

··· 100 100 void kthread_parkme(void); 101 101 void kthread_exit(long result) __noreturn; 102 102 void kthread_complete_and_exit(struct completion *, long) __noreturn; 103 + int kthreads_update_housekeeping(void); 103 104 104 105 int kthreadd(void *unused); 105 106 extern struct task_struct *kthreadd_task;

include/linux/memcontrol.h

··· 1037 1037 return id; 1038 1038 } 1039 1039 1040 + void mem_cgroup_flush_workqueue(void); 1041 + 1040 1042 extern int mem_cgroup_init(void); 1041 1043 #else /* CONFIG_MEMCG */ 1042 1044 ··· 1437 1435 { 1438 1436 return 0; 1439 1437 } 1438 + 1439 + static inline void mem_cgroup_flush_workqueue(void) { } 1440 1440 1441 1441 static inline int mem_cgroup_init(void) { return 0; } 1442 1442 #endif /* CONFIG_MEMCG */

+1 -1

include/linux/mmu_context.h

··· 24 24 #ifndef task_cpu_possible_mask 25 25 # define task_cpu_possible_mask(p) cpu_possible_mask 26 26 # define task_cpu_possible(cpu, p) true 27 - # define task_cpu_fallback_mask(p) housekeeping_cpumask(HK_TYPE_TICK) 27 + # define task_cpu_fallback_mask(p) housekeeping_cpumask(HK_TYPE_DOMAIN) 28 28 #else 29 29 # define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p)) 30 30 #endif

include/linux/pci.h

··· 1206 1206 struct pci_ops *ops, void *sysdata, 1207 1207 struct list_head *resources); 1208 1208 int pci_host_probe(struct pci_host_bridge *bridge); 1209 + void pci_probe_flush_workqueue(void); 1209 1210 int pci_bus_insert_busn_res(struct pci_bus *b, int bus, int busmax); 1210 1211 int pci_bus_update_busn_res_end(struct pci_bus *b, int busmax); 1211 1212 void pci_bus_release_busn_res(struct pci_bus *b); ··· 2079 2078 _PCI_NOP(o, dword, u32 x) 2080 2079 _PCI_NOP_ALL(read, *) 2081 2080 _PCI_NOP_ALL(write,) 2081 + 2082 + static inline void pci_probe_flush_workqueue(void) { } 2082 2083 2083 2084 static inline struct pci_dev *pci_get_device(unsigned int vendor, 2084 2085 unsigned int device,

include/linux/percpu-rwsem.h

··· 161 161 __percpu_init_rwsem(sem, #sem, &rwsem_key); \ 162 162 }) 163 163 164 + #define percpu_rwsem_is_write_held(sem) lockdep_is_held_type(sem, 0) 164 165 #define percpu_rwsem_is_held(sem) lockdep_is_held(sem) 165 166 #define percpu_rwsem_assert_held(sem) lockdep_assert_held(sem) 166 167

+12 -4

include/linux/sched/isolation.h

··· 2 2 #define _LINUX_SCHED_ISOLATION_H 3 3 4 4 #include <linux/cpumask.h> 5 - #include <linux/cpuset.h> 6 5 #include <linux/init.h> 7 6 #include <linux/tick.h> 8 7 9 8 enum hk_type { 9 + /* Inverse of boot-time isolcpus= argument */ 10 + HK_TYPE_DOMAIN_BOOT, 11 + /* 12 + * Same as HK_TYPE_DOMAIN_BOOT but also includes the 13 + * inverse of cpuset isolated partitions. As such it 14 + * is always a subset of HK_TYPE_DOMAIN_BOOT. 15 + */ 10 16 HK_TYPE_DOMAIN, 17 + /* Inverse of boot-time isolcpus=managed_irq argument */ 11 18 HK_TYPE_MANAGED_IRQ, 19 + /* Inverse of boot-time nohz_full= or isolcpus=nohz arguments */ 12 20 HK_TYPE_KERNEL_NOISE, 13 21 HK_TYPE_MAX, 14 22 ··· 39 31 extern bool housekeeping_enabled(enum hk_type type); 40 32 extern void housekeeping_affine(struct task_struct *t, enum hk_type type); 41 33 extern bool housekeeping_test_cpu(int cpu, enum hk_type type); 34 + extern int housekeeping_update(struct cpumask *isol_mask); 42 35 extern void __init housekeeping_init(void); 43 36 44 37 #else ··· 67 58 return true; 68 59 } 69 60 61 + static inline int housekeeping_update(struct cpumask *isol_mask) { return 0; } 70 62 static inline void housekeeping_init(void) { } 71 63 #endif /* CONFIG_CPU_ISOLATION */ 72 64 ··· 82 72 83 73 static inline bool cpu_is_isolated(int cpu) 84 74 { 85 - return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) || 86 - !housekeeping_test_cpu(cpu, HK_TYPE_TICK) || 87 - cpuset_cpu_is_isolated(cpu); 75 + return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN); 88 76 } 89 77 90 78 #endif /* _LINUX_SCHED_ISOLATION_H */

include/linux/vmstat.h

··· 303 303 int calculate_normal_threshold(struct zone *zone); 304 304 void set_pgdat_percpu_threshold(pg_data_t *pgdat, 305 305 int (*calculate_pressure)(struct zone *)); 306 + void vmstat_flush_workqueue(void); 306 307 #else /* CONFIG_SMP */ 307 308 308 309 /* ··· 404 403 static inline void refresh_zone_stat_thresholds(void) { } 405 404 static inline void cpu_vm_stats_fold(int cpu) { } 406 405 static inline void quiet_vmstat(void) { } 406 + static inline void vmstat_flush_workqueue(void) { } 407 407 408 408 static inline void drain_zonestat(struct zone *zone, 409 409 struct per_cpu_zonestat *pzstats) { }

+1 -1

include/linux/workqueue.h

··· 588 588 void free_workqueue_attrs(struct workqueue_attrs *attrs); 589 589 int apply_workqueue_attrs(struct workqueue_struct *wq, 590 590 const struct workqueue_attrs *attrs); 591 - extern int workqueue_unbound_exclude_cpumask(cpumask_var_t cpumask); 591 + extern int workqueue_unbound_housekeeping_update(const struct cpumask *hk); 592 592 593 593 extern bool queue_work_on(int cpu, struct workqueue_struct *wq, 594 594 struct work_struct *work);

init/Kconfig

··· 1257 1257 bool "Cpuset controller" 1258 1258 depends on SMP 1259 1259 select UNION_FIND 1260 + select CPU_ISOLATION 1260 1261 help 1261 1262 This option will let you create and manage CPUSETs which 1262 1263 allow dynamically partitioning a system into sets of CPUs and

+17 -36

kernel/cgroup/cpuset.c

··· 26 26 #include <linux/mempolicy.h> 27 27 #include <linux/mm.h> 28 28 #include <linux/memory.h> 29 - #include <linux/export.h> 30 29 #include <linux/rcupdate.h> 31 30 #include <linux/sched.h> 32 31 #include <linux/sched/deadline.h> ··· 83 84 * cpuset_mutex crtical section. 84 85 */ 85 86 static bool isolated_cpus_updating; 86 - 87 - /* 88 - * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot 89 - */ 90 - static cpumask_var_t boot_hk_cpus; 91 - static bool have_boot_isolcpus; 92 87 93 88 /* 94 89 * A flag to force sched domain rebuild at the end of an operation. ··· 278 285 mutex_unlock(&cpuset_mutex); 279 286 cpus_read_unlock(); 280 287 } 288 + 289 + #ifdef CONFIG_LOCKDEP 290 + bool lockdep_is_cpuset_held(void) 291 + { 292 + return lockdep_is_held(&cpuset_mutex); 293 + } 294 + #endif 281 295 282 296 static DEFINE_SPINLOCK(callback_lock); 283 297 ··· 1205 1205 1206 1206 if (top_cs) { 1207 1207 /* 1208 + * PF_KTHREAD tasks are handled by housekeeping. 1208 1209 * PF_NO_SETAFFINITY tasks are ignored. 1209 - * All per cpu kthreads should have PF_NO_SETAFFINITY 1210 - * flag set, see kthread_set_per_cpu(). 1211 1210 */ 1212 - if (task->flags & PF_NO_SETAFFINITY) 1211 + if (task->flags & (PF_KTHREAD | PF_NO_SETAFFINITY)) 1213 1212 continue; 1214 1213 cpumask_andnot(new_cpus, possible_mask, subpartitions_cpus); 1215 1214 } else { ··· 1449 1450 * @new_cpus: cpu mask 1450 1451 * Return: true if there is conflict, false otherwise 1451 1452 * 1452 - * CPUs outside of boot_hk_cpus, if defined, can only be used in an 1453 + * CPUs outside of HK_TYPE_DOMAIN_BOOT, if defined, can only be used in an 1453 1454 * isolated partition. 1454 1455 */ 1455 1456 static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) 1456 1457 { 1457 - if (!have_boot_isolcpus) 1458 + if (!housekeeping_enabled(HK_TYPE_DOMAIN_BOOT)) 1458 1459 return false; 1459 1460 1460 - if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus)) 1461 + if ((prstate != PRS_ISOLATED) && 1462 + !cpumask_subset(new_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT))) 1461 1463 return true; 1462 1464 1463 1465 return false; ··· 1477 1477 if (!isolated_cpus_updating) 1478 1478 return; 1479 1479 1480 - lockdep_assert_cpus_held(); 1481 - 1482 - ret = workqueue_unbound_exclude_cpumask(isolated_cpus); 1483 - WARN_ON_ONCE(ret < 0); 1484 - 1485 - ret = tmigr_isolated_exclude_cpumask(isolated_cpus); 1480 + ret = housekeeping_update(isolated_cpus); 1486 1481 WARN_ON_ONCE(ret < 0); 1487 1482 1488 1483 isolated_cpus_updating = false; 1489 1484 } 1490 - 1491 - /** 1492 - * cpuset_cpu_is_isolated - Check if the given CPU is isolated 1493 - * @cpu: the CPU number to be checked 1494 - * Return: true if CPU is used in an isolated partition, false otherwise 1495 - */ 1496 - bool cpuset_cpu_is_isolated(int cpu) 1497 - { 1498 - return cpumask_test_cpu(cpu, isolated_cpus); 1499 - } 1500 - EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated); 1501 1485 1502 1486 /** 1503 1487 * rm_siblings_excl_cpus - Remove exclusive CPUs that are used by sibling cpusets ··· 3880 3896 3881 3897 BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); 3882 3898 3883 - have_boot_isolcpus = housekeeping_enabled(HK_TYPE_DOMAIN); 3884 - if (have_boot_isolcpus) { 3885 - BUG_ON(!alloc_cpumask_var(&boot_hk_cpus, GFP_KERNEL)); 3886 - cpumask_copy(boot_hk_cpus, housekeeping_cpumask(HK_TYPE_DOMAIN)); 3887 - cpumask_andnot(isolated_cpus, cpu_possible_mask, boot_hk_cpus); 3888 - } 3899 + if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT)) 3900 + cpumask_andnot(isolated_cpus, cpu_possible_mask, 3901 + housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)); 3889 3902 3890 3903 return 0; 3891 3904 }

+16 -26

kernel/cpu.c

··· 534 534 { 535 535 return percpu_rwsem_is_held(&cpu_hotplug_lock); 536 536 } 537 + 538 + int lockdep_is_cpus_write_held(void) 539 + { 540 + return percpu_rwsem_is_write_held(&cpu_hotplug_lock); 541 + } 537 542 #endif 538 543 539 544 static void lockdep_acquire_cpus_lock(void) ··· 1415 1410 1416 1411 cpus_write_lock(); 1417 1412 1413 + /* 1414 + * Keep at least one housekeeping cpu onlined to avoid generating 1415 + * an empty sched_domain span. 1416 + */ 1417 + if (cpumask_any_and(cpu_online_mask, 1418 + housekeeping_cpumask(HK_TYPE_DOMAIN)) >= nr_cpu_ids) { 1419 + ret = -EBUSY; 1420 + goto out; 1421 + } 1422 + 1418 1423 cpuhp_tasks_frozen = tasks_frozen; 1419 1424 1420 1425 prev_state = cpuhp_set_state(cpu, st, target); ··· 1471 1456 return ret; 1472 1457 } 1473 1458 1474 - struct cpu_down_work { 1475 - unsigned int cpu; 1476 - enum cpuhp_state target; 1477 - }; 1478 - 1479 - static long __cpu_down_maps_locked(void *arg) 1480 - { 1481 - struct cpu_down_work *work = arg; 1482 - 1483 - return _cpu_down(work->cpu, 0, work->target); 1484 - } 1485 - 1486 1459 static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target) 1487 1460 { 1488 - struct cpu_down_work work = { .cpu = cpu, .target = target, }; 1489 - 1490 1461 /* 1491 1462 * If the platform does not support hotplug, report it explicitly to 1492 1463 * differentiate it from a transient offlining failure. ··· 1481 1480 return -EOPNOTSUPP; 1482 1481 if (cpu_hotplug_disabled) 1483 1482 return -EBUSY; 1484 - 1485 - /* 1486 - * Ensure that the control task does not run on the to be offlined 1487 - * CPU to prevent a deadlock against cfs_b->period_timer. 1488 - * Also keep at least one housekeeping cpu onlined to avoid generating 1489 - * an empty sched_domain span. 1490 - */ 1491 - for_each_cpu_and(cpu, cpu_online_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) { 1492 - if (cpu != work.cpu) 1493 - return work_on_cpu(cpu, __cpu_down_maps_locked, &work); 1494 - } 1495 - return -EBUSY; 1483 + return _cpu_down(cpu, 0, target); 1496 1484 } 1497 1485 1498 1486 static int cpu_down(unsigned int cpu, enum cpuhp_state target)

+121 -69

kernel/kthread.c

··· 35 35 static LIST_HEAD(kthread_create_list); 36 36 struct task_struct *kthreadd_task; 37 37 38 - static LIST_HEAD(kthreads_hotplug); 39 - static DEFINE_MUTEX(kthreads_hotplug_lock); 38 + static LIST_HEAD(kthread_affinity_list); 39 + static DEFINE_MUTEX(kthread_affinity_lock); 40 40 41 41 struct kthread_create_info 42 42 { ··· 69 69 /* To store the full name if task comm is truncated. */ 70 70 char *full_name; 71 71 struct task_struct *task; 72 - struct list_head hotplug_node; 72 + struct list_head affinity_node; 73 73 struct cpumask *preferred_affinity; 74 74 }; 75 75 ··· 128 128 129 129 init_completion(&kthread->exited); 130 130 init_completion(&kthread->parked); 131 - INIT_LIST_HEAD(&kthread->hotplug_node); 131 + INIT_LIST_HEAD(&kthread->affinity_node); 132 132 p->vfork_done = &kthread->exited; 133 133 134 134 kthread->task = p; ··· 323 323 { 324 324 struct kthread *kthread = to_kthread(current); 325 325 kthread->result = result; 326 - if (!list_empty(&kthread->hotplug_node)) { 327 - mutex_lock(&kthreads_hotplug_lock); 328 - list_del(&kthread->hotplug_node); 329 - mutex_unlock(&kthreads_hotplug_lock); 326 + if (!list_empty(&kthread->affinity_node)) { 327 + mutex_lock(&kthread_affinity_lock); 328 + list_del(&kthread->affinity_node); 329 + mutex_unlock(&kthread_affinity_lock); 330 330 331 331 if (kthread->preferred_affinity) { 332 332 kfree(kthread->preferred_affinity); ··· 362 362 { 363 363 const struct cpumask *pref; 364 364 365 + guard(rcu)(); 366 + 365 367 if (kthread->preferred_affinity) { 366 368 pref = kthread->preferred_affinity; 367 369 } else { 368 - if (WARN_ON_ONCE(kthread->node == NUMA_NO_NODE)) 369 - return; 370 - pref = cpumask_of_node(kthread->node); 370 + if (kthread->node == NUMA_NO_NODE) 371 + pref = housekeeping_cpumask(HK_TYPE_DOMAIN); 372 + else 373 + pref = cpumask_of_node(kthread->node); 371 374 } 372 375 373 - cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_KTHREAD)); 376 + cpumask_and(cpumask, pref, housekeeping_cpumask(HK_TYPE_DOMAIN)); 374 377 if (cpumask_empty(cpumask)) 375 - cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_KTHREAD)); 378 + cpumask_copy(cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN)); 376 379 } 377 380 378 381 static void kthread_affine_node(void) ··· 383 380 struct kthread *kthread = to_kthread(current); 384 381 cpumask_var_t affinity; 385 382 386 - WARN_ON_ONCE(kthread_is_per_cpu(current)); 383 + if (WARN_ON_ONCE(kthread_is_per_cpu(current))) 384 + return; 387 385 388 - if (kthread->node == NUMA_NO_NODE) { 389 - housekeeping_affine(current, HK_TYPE_KTHREAD); 390 - } else { 391 - if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) { 392 - WARN_ON_ONCE(1); 393 - return; 394 - } 395 - 396 - mutex_lock(&kthreads_hotplug_lock); 397 - WARN_ON_ONCE(!list_empty(&kthread->hotplug_node)); 398 - list_add_tail(&kthread->hotplug_node, &kthreads_hotplug); 399 - /* 400 - * The node cpumask is racy when read from kthread() but: 401 - * - a racing CPU going down will either fail on the subsequent 402 - * call to set_cpus_allowed_ptr() or be migrated to housekeepers 403 - * afterwards by the scheduler. 404 - * - a racing CPU going up will be handled by kthreads_online_cpu() 405 - */ 406 - kthread_fetch_affinity(kthread, affinity); 407 - set_cpus_allowed_ptr(current, affinity); 408 - mutex_unlock(&kthreads_hotplug_lock); 409 - 410 - free_cpumask_var(affinity); 386 + if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) { 387 + WARN_ON_ONCE(1); 388 + return; 411 389 } 390 + 391 + mutex_lock(&kthread_affinity_lock); 392 + WARN_ON_ONCE(!list_empty(&kthread->affinity_node)); 393 + list_add_tail(&kthread->affinity_node, &kthread_affinity_list); 394 + /* 395 + * The node cpumask is racy when read from kthread() but: 396 + * - a racing CPU going down will either fail on the subsequent 397 + * call to set_cpus_allowed_ptr() or be migrated to housekeepers 398 + * afterwards by the scheduler. 399 + * - a racing CPU going up will be handled by kthreads_online_cpu() 400 + */ 401 + kthread_fetch_affinity(kthread, affinity); 402 + set_cpus_allowed_ptr(current, affinity); 403 + mutex_unlock(&kthread_affinity_lock); 404 + 405 + free_cpumask_var(affinity); 412 406 } 413 407 414 408 static int kthread(void *_create) ··· 453 453 454 454 self->started = 1; 455 455 456 + /* 457 + * Apply default node affinity if no call to kthread_bind[_mask]() nor 458 + * kthread_affine_preferred() was issued before the first wake-up. 459 + */ 456 460 if (!(current->flags & PF_NO_SETAFFINITY) && !self->preferred_affinity) 457 461 kthread_affine_node(); 458 462 ··· 824 820 /* Setup a clean context for our children to inherit. */ 825 821 set_task_comm(tsk, comm); 826 822 ignore_signals(tsk); 827 - set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_KTHREAD)); 828 823 set_mems_allowed(node_states[N_MEMORY]); 829 824 830 825 current->flags |= PF_NOFREEZE; 831 826 cgroup_init_kthreadd(); 827 + 828 + kthread_affine_node(); 832 829 833 830 for (;;) { 834 831 set_current_state(TASK_INTERRUPTIBLE); ··· 856 851 return 0; 857 852 } 858 853 854 + /** 855 + * kthread_affine_preferred - Define a kthread's preferred affinity 856 + * @p: thread created by kthread_create(). 857 + * @mask: preferred mask of CPUs (might not be online, must be possible) for @p 858 + * to run on. 859 + * 860 + * Similar to kthread_bind_mask() except that the affinity is not a requirement 861 + * but rather a preference that can be constrained by CPU isolation or CPU hotplug. 862 + * Must be called before the first wakeup of the kthread. 863 + * 864 + * Returns 0 if the affinity has been applied. 865 + */ 859 866 int kthread_affine_preferred(struct task_struct *p, const struct cpumask *mask) 860 867 { 861 868 struct kthread *kthread = to_kthread(p); ··· 890 873 goto out; 891 874 } 892 875 893 - mutex_lock(&kthreads_hotplug_lock); 876 + mutex_lock(&kthread_affinity_lock); 894 877 cpumask_copy(kthread->preferred_affinity, mask); 895 - WARN_ON_ONCE(!list_empty(&kthread->hotplug_node)); 896 - list_add_tail(&kthread->hotplug_node, &kthreads_hotplug); 878 + WARN_ON_ONCE(!list_empty(&kthread->affinity_node)); 879 + list_add_tail(&kthread->affinity_node, &kthread_affinity_list); 897 880 kthread_fetch_affinity(kthread, affinity); 898 881 899 882 scoped_guard (raw_spinlock_irqsave, &p->pi_lock) 900 883 set_cpus_allowed_force(p, affinity); 901 884 902 - mutex_unlock(&kthreads_hotplug_lock); 885 + mutex_unlock(&kthread_affinity_lock); 903 886 out: 904 887 free_cpumask_var(affinity); 905 888 906 889 return ret; 907 890 } 908 891 EXPORT_SYMBOL_GPL(kthread_affine_preferred); 892 + 893 + static int kthreads_update_affinity(bool force) 894 + { 895 + cpumask_var_t affinity; 896 + struct kthread *k; 897 + int ret; 898 + 899 + guard(mutex)(&kthread_affinity_lock); 900 + 901 + if (list_empty(&kthread_affinity_list)) 902 + return 0; 903 + 904 + if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) 905 + return -ENOMEM; 906 + 907 + ret = 0; 908 + 909 + list_for_each_entry(k, &kthread_affinity_list, affinity_node) { 910 + if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) || 911 + kthread_is_per_cpu(k->task))) { 912 + ret = -EINVAL; 913 + continue; 914 + } 915 + 916 + /* 917 + * Unbound kthreads without preferred affinity are already affine 918 + * to housekeeping, whether those CPUs are online or not. So no need 919 + * to handle newly online CPUs for them. However housekeeping changes 920 + * have to be applied. 921 + * 922 + * But kthreads with a preferred affinity or node are different: 923 + * if none of their preferred CPUs are online and part of 924 + * housekeeping at the same time, they must be affine to housekeeping. 925 + * But as soon as one of their preferred CPU becomes online, they must 926 + * be affine to them. 927 + */ 928 + if (force || k->preferred_affinity || k->node != NUMA_NO_NODE) { 929 + kthread_fetch_affinity(k, affinity); 930 + set_cpus_allowed_ptr(k->task, affinity); 931 + } 932 + } 933 + 934 + free_cpumask_var(affinity); 935 + 936 + return ret; 937 + } 938 + 939 + /** 940 + * kthreads_update_housekeeping - Update kthreads affinity on cpuset change 941 + * 942 + * When cpuset changes a partition type to/from "isolated" or updates related 943 + * cpumasks, propagate the housekeeping cpumask change to preferred kthreads 944 + * affinity. 945 + * 946 + * Returns 0 if successful, -ENOMEM if temporary mask couldn't 947 + * be allocated or -EINVAL in case of internal error. 948 + */ 949 + int kthreads_update_housekeeping(void) 950 + { 951 + return kthreads_update_affinity(true); 952 + } 909 953 910 954 /* 911 955 * Re-affine kthreads according to their preferences ··· 977 899 */ 978 900 static int kthreads_online_cpu(unsigned int cpu) 979 901 { 980 - cpumask_var_t affinity; 981 - struct kthread *k; 982 - int ret; 983 - 984 - guard(mutex)(&kthreads_hotplug_lock); 985 - 986 - if (list_empty(&kthreads_hotplug)) 987 - return 0; 988 - 989 - if (!zalloc_cpumask_var(&affinity, GFP_KERNEL)) 990 - return -ENOMEM; 991 - 992 - ret = 0; 993 - 994 - list_for_each_entry(k, &kthreads_hotplug, hotplug_node) { 995 - if (WARN_ON_ONCE((k->task->flags & PF_NO_SETAFFINITY) || 996 - kthread_is_per_cpu(k->task))) { 997 - ret = -EINVAL; 998 - continue; 999 - } 1000 - kthread_fetch_affinity(k, affinity); 1001 - set_cpus_allowed_ptr(k->task, affinity); 1002 - } 1003 - 1004 - free_cpumask_var(affinity); 1005 - 1006 - return ret; 902 + return kthreads_update_affinity(false); 1007 903 } 1008 904 1009 905 static int kthreads_init(void)

+119 -26

kernel/sched/isolation.c

··· 8 8 * 9 9 */ 10 10 #include <linux/sched/isolation.h> 11 + #include <linux/pci.h> 11 12 #include "sched.h" 12 13 13 14 enum hk_flags { 15 + HK_FLAG_DOMAIN_BOOT = BIT(HK_TYPE_DOMAIN_BOOT), 14 16 HK_FLAG_DOMAIN = BIT(HK_TYPE_DOMAIN), 15 17 HK_FLAG_MANAGED_IRQ = BIT(HK_TYPE_MANAGED_IRQ), 16 18 HK_FLAG_KERNEL_NOISE = BIT(HK_TYPE_KERNEL_NOISE), ··· 22 20 EXPORT_SYMBOL_GPL(housekeeping_overridden); 23 21 24 22 struct housekeeping { 25 - cpumask_var_t cpumasks[HK_TYPE_MAX]; 23 + struct cpumask __rcu *cpumasks[HK_TYPE_MAX]; 26 24 unsigned long flags; 27 25 }; 28 26 ··· 30 28 31 29 bool housekeeping_enabled(enum hk_type type) 32 30 { 33 - return !!(housekeeping.flags & BIT(type)); 31 + return !!(READ_ONCE(housekeeping.flags) & BIT(type)); 34 32 } 35 33 EXPORT_SYMBOL_GPL(housekeeping_enabled); 34 + 35 + static bool housekeeping_dereference_check(enum hk_type type) 36 + { 37 + if (IS_ENABLED(CONFIG_LOCKDEP) && type == HK_TYPE_DOMAIN) { 38 + /* Cpuset isn't even writable yet? */ 39 + if (system_state <= SYSTEM_SCHEDULING) 40 + return true; 41 + 42 + /* CPU hotplug write locked, so cpuset partition can't be overwritten */ 43 + if (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_write_held()) 44 + return true; 45 + 46 + /* Cpuset lock held, partitions not writable */ 47 + if (IS_ENABLED(CONFIG_CPUSETS) && lockdep_is_cpuset_held()) 48 + return true; 49 + 50 + return false; 51 + } 52 + 53 + return true; 54 + } 55 + 56 + static inline struct cpumask *housekeeping_cpumask_dereference(enum hk_type type) 57 + { 58 + return rcu_dereference_all_check(housekeeping.cpumasks[type], 59 + housekeeping_dereference_check(type)); 60 + } 61 + 62 + const struct cpumask *housekeeping_cpumask(enum hk_type type) 63 + { 64 + const struct cpumask *mask = NULL; 65 + 66 + if (static_branch_unlikely(&housekeeping_overridden)) { 67 + if (READ_ONCE(housekeeping.flags) & BIT(type)) 68 + mask = housekeeping_cpumask_dereference(type); 69 + } 70 + if (!mask) 71 + mask = cpu_possible_mask; 72 + return mask; 73 + } 74 + EXPORT_SYMBOL_GPL(housekeeping_cpumask); 36 75 37 76 int housekeeping_any_cpu(enum hk_type type) 38 77 { ··· 81 38 82 39 if (static_branch_unlikely(&housekeeping_overridden)) { 83 40 if (housekeeping.flags & BIT(type)) { 84 - cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id()); 41 + cpu = sched_numa_find_closest(housekeeping_cpumask(type), smp_processor_id()); 85 42 if (cpu < nr_cpu_ids) 86 43 return cpu; 87 44 88 - cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask); 45 + cpu = cpumask_any_and_distribute(housekeeping_cpumask(type), cpu_online_mask); 89 46 if (likely(cpu < nr_cpu_ids)) 90 47 return cpu; 91 48 /* ··· 101 58 } 102 59 EXPORT_SYMBOL_GPL(housekeeping_any_cpu); 103 60 104 - const struct cpumask *housekeeping_cpumask(enum hk_type type) 105 - { 106 - if (static_branch_unlikely(&housekeeping_overridden)) 107 - if (housekeeping.flags & BIT(type)) 108 - return housekeeping.cpumasks[type]; 109 - return cpu_possible_mask; 110 - } 111 - EXPORT_SYMBOL_GPL(housekeeping_cpumask); 112 - 113 61 void housekeeping_affine(struct task_struct *t, enum hk_type type) 114 62 { 115 63 if (static_branch_unlikely(&housekeeping_overridden)) 116 64 if (housekeeping.flags & BIT(type)) 117 - set_cpus_allowed_ptr(t, housekeeping.cpumasks[type]); 65 + set_cpus_allowed_ptr(t, housekeeping_cpumask(type)); 118 66 } 119 67 EXPORT_SYMBOL_GPL(housekeeping_affine); 120 68 121 69 bool housekeeping_test_cpu(int cpu, enum hk_type type) 122 70 { 123 - if (static_branch_unlikely(&housekeeping_overridden)) 124 - if (housekeeping.flags & BIT(type)) 125 - return cpumask_test_cpu(cpu, housekeeping.cpumasks[type]); 71 + if (static_branch_unlikely(&housekeeping_overridden) && 72 + READ_ONCE(housekeeping.flags) & BIT(type)) 73 + return cpumask_test_cpu(cpu, housekeeping_cpumask(type)); 126 74 return true; 127 75 } 128 76 EXPORT_SYMBOL_GPL(housekeeping_test_cpu); 77 + 78 + int housekeeping_update(struct cpumask *isol_mask) 79 + { 80 + struct cpumask *trial, *old = NULL; 81 + int err; 82 + 83 + lockdep_assert_cpus_held(); 84 + 85 + trial = kmalloc(cpumask_size(), GFP_KERNEL); 86 + if (!trial) 87 + return -ENOMEM; 88 + 89 + cpumask_andnot(trial, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT), isol_mask); 90 + if (!cpumask_intersects(trial, cpu_online_mask)) { 91 + kfree(trial); 92 + return -EINVAL; 93 + } 94 + 95 + if (!housekeeping.flags) 96 + static_branch_enable_cpuslocked(&housekeeping_overridden); 97 + 98 + if (housekeeping.flags & HK_FLAG_DOMAIN) 99 + old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN); 100 + else 101 + WRITE_ONCE(housekeeping.flags, housekeeping.flags | HK_FLAG_DOMAIN); 102 + rcu_assign_pointer(housekeeping.cpumasks[HK_TYPE_DOMAIN], trial); 103 + 104 + synchronize_rcu(); 105 + 106 + pci_probe_flush_workqueue(); 107 + mem_cgroup_flush_workqueue(); 108 + vmstat_flush_workqueue(); 109 + 110 + err = workqueue_unbound_housekeeping_update(housekeeping_cpumask(HK_TYPE_DOMAIN)); 111 + WARN_ON_ONCE(err < 0); 112 + 113 + err = tmigr_isolated_exclude_cpumask(isol_mask); 114 + WARN_ON_ONCE(err < 0); 115 + 116 + err = kthreads_update_housekeeping(); 117 + WARN_ON_ONCE(err < 0); 118 + 119 + kfree(old); 120 + 121 + return 0; 122 + } 129 123 130 124 void __init housekeeping_init(void) 131 125 { ··· 175 95 176 96 if (housekeeping.flags & HK_FLAG_KERNEL_NOISE) 177 97 sched_tick_offload_init(); 178 - 98 + /* 99 + * Realloc with a proper allocator so that any cpumask update 100 + * can indifferently free the old version with kfree(). 101 + */ 179 102 for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) { 103 + struct cpumask *omask, *nmask = kmalloc(cpumask_size(), GFP_KERNEL); 104 + 105 + if (WARN_ON_ONCE(!nmask)) 106 + return; 107 + 108 + omask = rcu_dereference(housekeeping.cpumasks[type]); 109 + 180 110 /* We need at least one CPU to handle housekeeping work */ 181 - WARN_ON_ONCE(cpumask_empty(housekeeping.cpumasks[type])); 111 + WARN_ON_ONCE(cpumask_empty(omask)); 112 + cpumask_copy(nmask, omask); 113 + RCU_INIT_POINTER(housekeeping.cpumasks[type], nmask); 114 + memblock_free(omask, cpumask_size()); 182 115 } 183 116 } 184 117 185 118 static void __init housekeeping_setup_type(enum hk_type type, 186 119 cpumask_var_t housekeeping_staging) 187 120 { 121 + struct cpumask *mask = memblock_alloc_or_panic(cpumask_size(), SMP_CACHE_BYTES); 188 122 189 - alloc_bootmem_cpumask_var(&housekeeping.cpumasks[type]); 190 - cpumask_copy(housekeeping.cpumasks[type], 191 - housekeeping_staging); 123 + cpumask_copy(mask, housekeeping_staging); 124 + RCU_INIT_POINTER(housekeeping.cpumasks[type], mask); 192 125 } 193 126 194 127 static int __init housekeeping_setup(char *str, unsigned long flags) ··· 254 161 255 162 for_each_set_bit(type, &iter_flags, HK_TYPE_MAX) { 256 163 if (!cpumask_equal(housekeeping_staging, 257 - housekeeping.cpumasks[type])) { 164 + housekeeping_cpumask(type))) { 258 165 pr_warn("Housekeeping: nohz_full= must match isolcpus=\n"); 259 166 goto free_housekeeping_staging; 260 167 } ··· 275 182 iter_flags = flags & (HK_FLAG_KERNEL_NOISE | HK_FLAG_DOMAIN); 276 183 first_cpu = (type == HK_TYPE_MAX || !iter_flags) ? 0 : 277 184 cpumask_first_and_and(cpu_present_mask, 278 - housekeeping_staging, housekeeping.cpumasks[type]); 185 + housekeeping_staging, housekeeping_cpumask(type)); 279 186 if (first_cpu >= min(nr_cpu_ids, setup_max_cpus)) { 280 187 pr_warn("Housekeeping: must include one present CPU " 281 188 "neither in nohz_full= nor in isolcpus=domain, " ··· 332 239 333 240 if (!strncmp(str, "domain,", 7)) { 334 241 str += 7; 335 - flags |= HK_FLAG_DOMAIN; 242 + flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT; 336 243 continue; 337 244 } 338 245 ··· 362 269 363 270 /* Default behaviour for isolcpus without flags */ 364 271 if (!flags) 365 - flags |= HK_FLAG_DOMAIN; 272 + flags |= HK_FLAG_DOMAIN | HK_FLAG_DOMAIN_BOOT; 366 273 367 274 return housekeeping_setup(str, flags); 368 275 }

kernel/sched/sched.h

··· 30 30 #include <linux/context_tracking.h> 31 31 #include <linux/cpufreq.h> 32 32 #include <linux/cpumask_api.h> 33 + #include <linux/cpuset.h> 33 34 #include <linux/ctype.h> 34 35 #include <linux/file.h> 35 36 #include <linux/fs_api.h> ··· 43 42 #include <linux/ktime_api.h> 44 43 #include <linux/lockdep_api.h> 45 44 #include <linux/lockdep.h> 45 + #include <linux/memblock.h> 46 + #include <linux/memcontrol.h> 46 47 #include <linux/minmax.h> 47 48 #include <linux/mm.h> 48 49 #include <linux/module.h> ··· 68 65 #include <linux/types.h> 69 66 #include <linux/u64_stats_sync_api.h> 70 67 #include <linux/uaccess.h> 68 + #include <linux/vmstat.h> 71 69 #include <linux/wait_api.h> 72 70 #include <linux/wait_bit.h> 73 71 #include <linux/workqueue_api.h>

+17 -8

kernel/time/timer_migration.c

··· 466 466 { 467 467 if (!static_branch_unlikely(&tmigr_exclude_isolated)) 468 468 return false; 469 - return (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || 470 - cpuset_cpu_is_isolated(cpu)) && 471 - housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE); 469 + return (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) && 470 + housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE)); 472 471 } 473 472 474 473 /* ··· 1496 1497 return 0; 1497 1498 } 1498 1499 1499 - static int tmigr_set_cpu_available(unsigned int cpu) 1500 + static int __tmigr_set_cpu_available(unsigned int cpu) 1500 1501 { 1501 1502 struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1502 1503 1503 1504 /* Check whether CPU data was successfully initialized */ 1504 1505 if (WARN_ON_ONCE(!tmc->tmgroup)) 1505 1506 return -EINVAL; 1506 - 1507 - if (tmigr_is_isolated(cpu)) 1508 - return 0; 1509 1507 1510 1508 guard(mutex)(&tmigr_available_mutex); 1511 1509 ··· 1519 1523 return 0; 1520 1524 } 1521 1525 1526 + static int tmigr_set_cpu_available(unsigned int cpu) 1527 + { 1528 + if (tmigr_is_isolated(cpu)) 1529 + return 0; 1530 + 1531 + return __tmigr_set_cpu_available(cpu); 1532 + } 1533 + 1522 1534 static void tmigr_cpu_isolate(struct work_struct *ignored) 1523 1535 { 1524 1536 tmigr_clear_cpu_available(smp_processor_id()); ··· 1534 1530 1535 1531 static void tmigr_cpu_unisolate(struct work_struct *ignored) 1536 1532 { 1537 - tmigr_set_cpu_available(smp_processor_id()); 1533 + /* 1534 + * Don't call tmigr_is_isolated() ->housekeeping_cpu() directly because 1535 + * the cpuset mutex is correctly held by the workqueue caller but lockdep 1536 + * doesn't know that. 1537 + */ 1538 + __tmigr_set_cpu_available(smp_processor_id()); 1538 1539 } 1539 1540 1540 1541 /**

+10 -7

kernel/workqueue.c

··· 6959 6959 } 6960 6960 6961 6961 /** 6962 - * workqueue_unbound_exclude_cpumask - Exclude given CPUs from unbound cpumask 6963 - * @exclude_cpumask: the cpumask to be excluded from wq_unbound_cpumask 6962 + * workqueue_unbound_housekeeping_update - Propagate housekeeping cpumask update 6963 + * @hk: the new housekeeping cpumask 6964 6964 * 6965 - * This function can be called from cpuset code to provide a set of isolated 6966 - * CPUs that should be excluded from wq_unbound_cpumask. 6965 + * Update the unbound workqueue cpumask on top of the new housekeeping cpumask such 6966 + * that the effective unbound affinity is the intersection of the new housekeeping 6967 + * with the requested affinity set via nohz_full=/isolcpus= or sysfs. 6968 + * 6969 + * Return: 0 on success and -errno on failure. 6967 6970 */ 6968 - int workqueue_unbound_exclude_cpumask(cpumask_var_t exclude_cpumask) 6971 + int workqueue_unbound_housekeeping_update(const struct cpumask *hk) 6969 6972 { 6970 6973 cpumask_var_t cpumask; 6971 6974 int ret = 0; ··· 6984 6981 * (HK_TYPE_WQ ∩ HK_TYPE_DOMAIN) house keeping mask and rewritten 6985 6982 * by any subsequent write to workqueue/cpumask sysfs file. 6986 6983 */ 6987 - if (!cpumask_andnot(cpumask, wq_requested_unbound_cpumask, exclude_cpumask)) 6984 + if (!cpumask_and(cpumask, wq_requested_unbound_cpumask, hk)) 6988 6985 cpumask_copy(cpumask, wq_requested_unbound_cpumask); 6989 6986 if (!cpumask_equal(cpumask, wq_unbound_cpumask)) 6990 6987 ret = workqueue_apply_unbound_cpumask(cpumask); 6991 6988 6992 6989 /* Save the current isolated cpumask & export it via sysfs */ 6993 6990 if (!ret) 6994 - cpumask_copy(wq_isolated_cpumask, exclude_cpumask); 6991 + cpumask_andnot(wq_isolated_cpumask, cpu_possible_mask, hk); 6995 6992 6996 6993 mutex_unlock(&wq_pool_mutex); 6997 6994 free_cpumask_var(cpumask);

+27 -4

mm/memcontrol.c

··· 96 96 /* BPF memory accounting disabled? */ 97 97 static bool cgroup_memory_nobpf __ro_after_init; 98 98 99 + static struct workqueue_struct *memcg_wq __ro_after_init; 100 + 99 101 static struct kmem_cache *memcg_cachep; 100 102 static struct kmem_cache *memcg_pn_cachep; 101 103 ··· 2005 2003 return flush; 2006 2004 } 2007 2005 2006 + static void schedule_drain_work(int cpu, struct work_struct *work) 2007 + { 2008 + /* 2009 + * Protect housekeeping cpumask read and work enqueue together 2010 + * in the same RCU critical section so that later cpuset isolated 2011 + * partition update only need to wait for an RCU GP and flush the 2012 + * pending work on newly isolated CPUs. 2013 + */ 2014 + guard(rcu)(); 2015 + if (!cpu_is_isolated(cpu)) 2016 + queue_work_on(cpu, memcg_wq, work); 2017 + } 2018 + 2008 2019 /* 2009 2020 * Drains all per-CPU charge caches for given root_memcg resp. subtree 2010 2021 * of the hierarchy under it. ··· 2047 2032 &memcg_st->flags)) { 2048 2033 if (cpu == curcpu) 2049 2034 drain_local_memcg_stock(&memcg_st->work); 2050 - else if (!cpu_is_isolated(cpu)) 2051 - schedule_work_on(cpu, &memcg_st->work); 2035 + else 2036 + schedule_drain_work(cpu, &memcg_st->work); 2052 2037 } 2053 2038 2054 2039 if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) && ··· 2057 2042 &obj_st->flags)) { 2058 2043 if (cpu == curcpu) 2059 2044 drain_local_obj_stock(&obj_st->work); 2060 - else if (!cpu_is_isolated(cpu)) 2061 - schedule_work_on(cpu, &obj_st->work); 2045 + else 2046 + schedule_drain_work(cpu, &obj_st->work); 2062 2047 } 2063 2048 } 2064 2049 migrate_enable(); ··· 5127 5112 refill_stock(memcg, nr_pages); 5128 5113 } 5129 5114 5115 + void mem_cgroup_flush_workqueue(void) 5116 + { 5117 + flush_workqueue(memcg_wq); 5118 + } 5119 + 5130 5120 static int __init cgroup_memory(char *s) 5131 5121 { 5132 5122 char *token; ··· 5173 5153 5174 5154 cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL, 5175 5155 memcg_hotplug_cpu_dead); 5156 + 5157 + memcg_wq = alloc_workqueue("memcg", WQ_PERCPU, 0); 5158 + WARN_ON(!memcg_wq); 5176 5159 5177 5160 for_each_possible_cpu(cpu) { 5178 5161 INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,

+11 -4

mm/vmstat.c

··· 2124 2124 2125 2125 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd); 2126 2126 2127 + void vmstat_flush_workqueue(void) 2128 + { 2129 + flush_workqueue(mm_percpu_wq); 2130 + } 2131 + 2127 2132 static void vmstat_shepherd(struct work_struct *w) 2128 2133 { 2129 2134 int cpu; ··· 2149 2144 * infrastructure ever noticing. Skip regular flushing from vmstat_shepherd 2150 2145 * for all isolated CPUs to avoid interference with the isolated workload. 2151 2146 */ 2152 - if (cpu_is_isolated(cpu)) 2153 - continue; 2147 + scoped_guard(rcu) { 2148 + if (cpu_is_isolated(cpu)) 2149 + continue; 2154 2150 2155 - if (!delayed_work_pending(dw) && need_update(cpu)) 2156 - queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); 2151 + if (!delayed_work_pending(dw) && need_update(cpu)) 2152 + queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); 2153 + } 2157 2154 2158 2155 cond_resched(); 2159 2156 }

+1 -1

net/core/net-sysfs.c

··· 1022 1022 int rps_cpumask_housekeeping(struct cpumask *mask) 1023 1023 { 1024 1024 if (!cpumask_empty(mask)) { 1025 - cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN)); 1025 + cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT)); 1026 1026 cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ)); 1027 1027 if (cpumask_empty(mask)) 1028 1028 return -EINVAL;

Configure Feed

Configure Feed