Merge tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Pull cgroup updates from Tejun Heo:

- Defer task cgroup unlink until after the dying task's final context
switch so that controllers see the cgroup properly populated until
the task is truly gone

- cpuset cleanups and simplifications.

Enforce that domain isolated CPUs stay in root or isolated partitions
and fail if isolated+nohz_full would leave no housekeeping CPU. Fix
sched/deadline root domain handling during CPU hot-unplug and race
for tasks in attaching cpusets

- Misc fixes including memory reclaim protection documentation and
selftest KTAP conformance

* tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: Treat cpusets in attaching as populated
sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
cgroup/cpuset: Introduce cpuset_cpus_allowed_locked()
docs: cgroup: No special handling of unpopulated memcgs
docs: cgroup: Note about sibling relative reclaim protection
docs: cgroup: Explain reclaim protection target
selftests/cgroup: conform test to KTAP format output
cpuset: remove need_rebuild_sched_domains
cpuset: remove global remote_children list
cpuset: simplify node setting on error
cgroup: include missing header for struct irq_work
cgroup: Fix sleeping from invalid context warning on PREEMPT_RT
cgroup/cpuset: Globally track isolated_cpus update
cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition
cgroup/cpuset: Move up prstate_housekeeping_conflict() helper
cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
cgroup: Defer task cgroup unlink until after the task is done switching out
cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
...

Linus Torvalds 6 months ago 8449d325 2b601457

+437 -207

20 changed files

expand all

Documentation

admin-guide

cgroup-v2.rst

include

linux

cgroup.h

cpuset.h

sched.h

kernel

cgroup

cgroup.c

cpuset-internal.h

cpuset.c

exit.c

fork.c

sched

autogroup.c

core.c

deadline.c

tools

testing

selftests

cgroup

test_core.c

test_cpu.c

test_cpuset.c

test_freezer.c

test_kill.c

test_kmem.c

test_memcontrol.c

test_zswap.c

+25 -6

Documentation/admin-guide/cgroup-v2.rst

··· 53 53 5-2. Memory 54 54 5-2-1. Memory Interface Files 55 55 5-2-2. Usage Guidelines 56 - 5-2-3. Memory Ownership 56 + 5-2-3. Reclaim Protection 57 + 5-2-4. Memory Ownership 57 58 5-3. IO 58 59 5-3-1. IO Interface Files 59 60 5-3-2. Writeback ··· 1318 1317 smaller overages. 1319 1318 1320 1319 Effective min boundary is limited by memory.min values of 1321 - all ancestor cgroups. If there is memory.min overcommitment 1320 + ancestor cgroups. If there is memory.min overcommitment 1322 1321 (child cgroup or cgroups are requiring more protected memory 1323 1322 than parent will allow), then each child cgroup will get 1324 1323 the part of parent's protection proportional to its ··· 1326 1325 1327 1326 Putting more memory than generally available under this 1328 1327 protection is discouraged and may lead to constant OOMs. 1329 - 1330 - If a memory cgroup is not populated with processes, 1331 - its memory.min is ignored. 1332 1328 1333 1329 memory.low 1334 1330 A read-write single value file which exists on non-root ··· 1341 1343 smaller overages. 1342 1344 1343 1345 Effective low boundary is limited by memory.low values of 1344 - all ancestor cgroups. If there is memory.low overcommitment 1346 + ancestor cgroups. If there is memory.low overcommitment 1345 1347 (child cgroup or cgroups are requiring more protected memory 1346 1348 than parent will allow), then each child cgroup will get 1347 1349 the part of parent's protection proportional to its ··· 1932 1934 memory; unfortunately, memory pressure monitoring mechanism isn't 1933 1935 implemented yet. 1934 1936 1937 + Reclaim Protection 1938 + ~~~~~~~~~~~~~~~~~~ 1939 + 1940 + The protection configured with "memory.low" or "memory.min" applies relatively 1941 + to the target of the reclaim (i.e. any of memory cgroup limits, proactive 1942 + memory.reclaim or global reclaim apparently located in the root cgroup). 1943 + The protection value configured for B applies unchanged to the reclaim 1944 + targeting A (i.e. caused by competition with the sibling E):: 1945 + 1946 + root - ... - A - B - C 1947 + \ ` D 1948 + ` E 1949 + 1950 + When the reclaim targets ancestors of A, the effective protection of B is 1951 + capped by the protection value configured for A (and any other intermediate 1952 + ancestors between A and the target). 1953 + 1954 + To express indifference about relative sibling protection, it is suggested to 1955 + use memory_recursiveprot. Configuring all descendants of a parent with finite 1956 + protection to "max" works but it may unnecessarily skew memory.events:low 1957 + field. 1935 1958 1936 1959 Memory Ownership 1937 1960 ~~~~~~~~~~~~~~~~

+8 -6

include/linux/cgroup.h

··· 137 137 struct kernel_clone_args *kargs); 138 138 extern void cgroup_post_fork(struct task_struct *p, 139 139 struct kernel_clone_args *kargs); 140 - void cgroup_exit(struct task_struct *p); 141 - void cgroup_release(struct task_struct *p); 142 - void cgroup_free(struct task_struct *p); 140 + void cgroup_task_exit(struct task_struct *p); 141 + void cgroup_task_dead(struct task_struct *p); 142 + void cgroup_task_release(struct task_struct *p); 143 + void cgroup_task_free(struct task_struct *p); 143 144 144 145 int cgroup_init_early(void); 145 146 int cgroup_init(void); ··· 681 680 struct kernel_clone_args *kargs) {} 682 681 static inline void cgroup_post_fork(struct task_struct *p, 683 682 struct kernel_clone_args *kargs) {} 684 - static inline void cgroup_exit(struct task_struct *p) {} 685 - static inline void cgroup_release(struct task_struct *p) {} 686 - static inline void cgroup_free(struct task_struct *p) {} 683 + static inline void cgroup_task_exit(struct task_struct *p) {} 684 + static inline void cgroup_task_dead(struct task_struct *p) {} 685 + static inline void cgroup_task_release(struct task_struct *p) {} 686 + static inline void cgroup_task_free(struct task_struct *p) {} 687 687 688 688 static inline int cgroup_init_early(void) { return 0; } 689 689 static inline int cgroup_init(void) { return 0; }

+8 -1

include/linux/cpuset.h

··· 74 74 extern void dec_dl_tasks_cs(struct task_struct *task); 75 75 extern void cpuset_lock(void); 76 76 extern void cpuset_unlock(void); 77 + extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask); 77 78 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); 78 79 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); 79 80 extern bool cpuset_cpu_is_isolated(int cpu); ··· 196 195 static inline void cpuset_lock(void) { } 197 196 static inline void cpuset_unlock(void) { } 198 197 198 + static inline void cpuset_cpus_allowed_locked(struct task_struct *p, 199 + struct cpumask *mask) 200 + { 201 + cpumask_copy(mask, task_cpu_possible_mask(p)); 202 + } 203 + 199 204 static inline void cpuset_cpus_allowed(struct task_struct *p, 200 205 struct cpumask *mask) 201 206 { 202 - cpumask_copy(mask, task_cpu_possible_mask(p)); 207 + cpuset_cpus_allowed_locked(p, mask); 203 208 } 204 209 205 210 static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)

+4 -1

include/linux/sched.h

··· 1324 1324 struct css_set __rcu *cgroups; 1325 1325 /* cg_list protected by css_set_lock and tsk->alloc_lock: */ 1326 1326 struct list_head cg_list; 1327 - #endif 1327 + #ifdef CONFIG_PREEMPT_RT 1328 + struct llist_node cg_dead_lnode; 1329 + #endif /* CONFIG_PREEMPT_RT */ 1330 + #endif /* CONFIG_CGROUPS */ 1328 1331 #ifdef CONFIG_X86_CPU_RESCTRL 1329 1332 u32 closid; 1330 1333 u32 rmid;

+77 -16

kernel/cgroup/cgroup.c

··· 60 60 #include <linux/sched/deadline.h> 61 61 #include <linux/psi.h> 62 62 #include <linux/nstree.h> 63 + #include <linux/irq_work.h> 63 64 #include <net/sock.h> 64 65 65 66 #define CREATE_TRACE_POINTS ··· 288 287 static int cgroup_addrm_files(struct cgroup_subsys_state *css, 289 288 struct cgroup *cgrp, struct cftype cfts[], 290 289 bool is_add); 290 + static void cgroup_rt_init(void); 291 291 292 292 #ifdef CONFIG_DEBUG_CGROUP_REF 293 293 #define CGROUP_REF_FN_ATTRS noinline ··· 943 941 /* 944 942 * We are synchronized through cgroup_threadgroup_rwsem 945 943 * against PF_EXITING setting such that we can't race 946 - * against cgroup_exit()/cgroup_free() dropping the css_set. 944 + * against cgroup_task_dead()/cgroup_task_free() dropping 945 + * the css_set. 947 946 */ 948 947 WARN_ON_ONCE(task->flags & PF_EXITING); 949 948 ··· 6357 6354 BUG_ON(ss_rstat_init(NULL)); 6358 6355 6359 6356 get_user_ns(init_cgroup_ns.user_ns); 6357 + cgroup_rt_init(); 6360 6358 6361 6359 cgroup_lock(); 6362 6360 ··· 6971 6967 } 6972 6968 6973 6969 /** 6974 - * cgroup_exit - detach cgroup from exiting task 6970 + * cgroup_task_exit - detach cgroup from exiting task 6975 6971 * @tsk: pointer to task_struct of exiting process 6976 6972 * 6977 6973 * Description: Detach cgroup from @tsk. 6978 6974 * 6979 6975 */ 6980 - void cgroup_exit(struct task_struct *tsk) 6976 + void cgroup_task_exit(struct task_struct *tsk) 6981 6977 { 6982 6978 struct cgroup_subsys *ss; 6983 - struct css_set *cset; 6984 6979 int i; 6985 6980 6986 - spin_lock_irq(&css_set_lock); 6981 + /* see cgroup_post_fork() for details */ 6982 + do_each_subsys_mask(ss, i, have_exit_callback) { 6983 + ss->exit(tsk); 6984 + } while_each_subsys_mask(); 6985 + } 6986 + 6987 + static void do_cgroup_task_dead(struct task_struct *tsk) 6988 + { 6989 + struct css_set *cset; 6990 + unsigned long flags; 6991 + 6992 + spin_lock_irqsave(&css_set_lock, flags); 6987 6993 6988 6994 WARN_ON_ONCE(list_empty(&tsk->cg_list)); 6989 6995 cset = task_css_set(tsk); ··· 7011 6997 test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) 7012 6998 cgroup_update_frozen(task_dfl_cgroup(tsk)); 7013 6999 7014 - spin_unlock_irq(&css_set_lock); 7015 - 7016 - /* see cgroup_post_fork() for details */ 7017 - do_each_subsys_mask(ss, i, have_exit_callback) { 7018 - ss->exit(tsk); 7019 - } while_each_subsys_mask(); 7000 + spin_unlock_irqrestore(&css_set_lock, flags); 7020 7001 } 7021 7002 7022 - void cgroup_release(struct task_struct *task) 7003 + #ifdef CONFIG_PREEMPT_RT 7004 + /* 7005 + * cgroup_task_dead() is called from finish_task_switch() which doesn't allow 7006 + * scheduling even in RT. As the task_dead path requires grabbing css_set_lock, 7007 + * this lead to sleeping in the invalid context warning bug. css_set_lock is too 7008 + * big to become a raw_spinlock. The task_dead path doesn't need to run 7009 + * synchronously but can't be delayed indefinitely either as the dead task pins 7010 + * the cgroup and task_struct can be pinned indefinitely. Bounce through lazy 7011 + * irq_work to allow batching while ensuring timely completion. 7012 + */ 7013 + static DEFINE_PER_CPU(struct llist_head, cgrp_dead_tasks); 7014 + static DEFINE_PER_CPU(struct irq_work, cgrp_dead_tasks_iwork); 7015 + 7016 + static void cgrp_dead_tasks_iwork_fn(struct irq_work *iwork) 7017 + { 7018 + struct llist_node *lnode; 7019 + struct task_struct *task, *next; 7020 + 7021 + lnode = llist_del_all(this_cpu_ptr(&cgrp_dead_tasks)); 7022 + llist_for_each_entry_safe(task, next, lnode, cg_dead_lnode) { 7023 + do_cgroup_task_dead(task); 7024 + put_task_struct(task); 7025 + } 7026 + } 7027 + 7028 + static void __init cgroup_rt_init(void) 7029 + { 7030 + int cpu; 7031 + 7032 + for_each_possible_cpu(cpu) { 7033 + init_llist_head(per_cpu_ptr(&cgrp_dead_tasks, cpu)); 7034 + per_cpu(cgrp_dead_tasks_iwork, cpu) = 7035 + IRQ_WORK_INIT_LAZY(cgrp_dead_tasks_iwork_fn); 7036 + } 7037 + } 7038 + 7039 + void cgroup_task_dead(struct task_struct *task) 7040 + { 7041 + get_task_struct(task); 7042 + llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks)); 7043 + irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork)); 7044 + } 7045 + #else /* CONFIG_PREEMPT_RT */ 7046 + static void __init cgroup_rt_init(void) {} 7047 + 7048 + void cgroup_task_dead(struct task_struct *task) 7049 + { 7050 + do_cgroup_task_dead(task); 7051 + } 7052 + #endif /* CONFIG_PREEMPT_RT */ 7053 + 7054 + void cgroup_task_release(struct task_struct *task) 7023 7055 { 7024 7056 struct cgroup_subsys *ss; 7025 7057 int ssid; ··· 7073 7013 do_each_subsys_mask(ss, ssid, have_release_callback) { 7074 7014 ss->release(task); 7075 7015 } while_each_subsys_mask(); 7016 + } 7017 + 7018 + void cgroup_task_free(struct task_struct *task) 7019 + { 7020 + struct css_set *cset = task_css_set(task); 7076 7021 7077 7022 if (!list_empty(&task->cg_list)) { 7078 7023 spin_lock_irq(&css_set_lock); ··· 7085 7020 list_del_init(&task->cg_list); 7086 7021 spin_unlock_irq(&css_set_lock); 7087 7022 } 7088 - } 7089 7023 7090 - void cgroup_free(struct task_struct *task) 7091 - { 7092 - struct css_set *cset = task_css_set(task); 7093 7024 put_css_set(cset); 7094 7025 } 7095 7026

+7 -6

kernel/cgroup/cpuset-internal.h

··· 155 155 /* for custom sched domain */ 156 156 int relax_domain_level; 157 157 158 - /* number of valid local child partitions */ 159 - int nr_subparts; 160 - 161 158 /* partition root state */ 162 159 int partition_root_state; 160 + 161 + /* 162 + * Whether cpuset is a remote partition. 163 + * It used to be a list anchoring all remote partitions — we can switch back 164 + * to a list if we need to iterate over the remote partitions. 165 + */ 166 + bool remote_partition; 163 167 164 168 /* 165 169 * number of SCHED_DEADLINE tasks attached to this cpuset, so that we ··· 178 174 179 175 /* Handle for cpuset.cpus.partition */ 180 176 struct cgroup_file partition_file; 181 - 182 - /* Remote partition silbling list anchored at remote_children */ 183 - struct list_head remote_sibling; 184 177 185 178 /* Used to merge intersecting subsets for generate_sched_domains */ 186 179 struct uf_node node;

+221 -136

kernel/cgroup/cpuset.c

··· 82 82 static cpumask_var_t isolated_cpus; 83 83 84 84 /* 85 + * isolated_cpus updating flag (protected by cpuset_mutex) 86 + * Set if isolated_cpus is going to be updated in the current 87 + * cpuset_mutex crtical section. 88 + */ 89 + static bool isolated_cpus_updating; 90 + 91 + /* 85 92 * Housekeeping (HK_TYPE_DOMAIN) CPUs at boot 86 93 */ 87 94 static cpumask_var_t boot_hk_cpus; 88 95 static bool have_boot_isolcpus; 89 - 90 - /* List of remote partition root children */ 91 - static struct list_head remote_children; 92 96 93 97 /* 94 98 * A flag to force sched domain rebuild at the end of an operation. ··· 216 212 BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE), 217 213 .partition_root_state = PRS_ROOT, 218 214 .relax_domain_level = -1, 219 - .remote_sibling = LIST_HEAD_INIT(top_cpuset.remote_sibling), 215 + .remote_partition = false, 220 216 }; 221 217 222 218 /* ··· 356 352 (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); 357 353 } 358 354 355 + static inline bool cpuset_is_populated(struct cpuset *cs) 356 + { 357 + lockdep_assert_held(&cpuset_mutex); 358 + 359 + /* Cpusets in the process of attaching should be considered as populated */ 360 + return cgroup_is_populated(cs->css.cgroup) || 361 + cs->attach_in_progress; 362 + } 363 + 359 364 /** 360 365 * partition_is_populated - check if partition has tasks 361 366 * @cs: partition root to be checked 362 367 * @excluded_child: a child cpuset to be excluded in task checking 363 368 * Return: true if there are tasks, false otherwise 364 369 * 365 - * It is assumed that @cs is a valid partition root. @excluded_child should 366 - * be non-NULL when this cpuset is going to become a partition itself. 370 + * @cs should be a valid partition root or going to become a partition root. 371 + * @excluded_child should be non-NULL when this cpuset is going to become a 372 + * partition itself. 373 + * 374 + * Note that a remote partition is not allowed underneath a valid local 375 + * or remote partition. So if a non-partition root child is populated, 376 + * the whole partition is considered populated. 367 377 */ 368 378 static inline bool partition_is_populated(struct cpuset *cs, 369 379 struct cpuset *excluded_child) 370 380 { 371 - struct cgroup_subsys_state *css; 372 - struct cpuset *child; 381 + struct cpuset *cp; 382 + struct cgroup_subsys_state *pos_css; 373 383 374 - if (cs->css.cgroup->nr_populated_csets) 384 + /* 385 + * We cannot call cs_is_populated(cs) directly, as 386 + * nr_populated_domain_children may include populated 387 + * csets from descendants that are partitions. 388 + */ 389 + if (cs->css.cgroup->nr_populated_csets || 390 + cs->attach_in_progress) 375 391 return true; 376 - if (!excluded_child && !cs->nr_subparts) 377 - return cgroup_is_populated(cs->css.cgroup); 378 392 379 393 rcu_read_lock(); 380 - cpuset_for_each_child(child, css, cs) { 381 - if (child == excluded_child) 394 + cpuset_for_each_descendant_pre(cp, pos_css, cs) { 395 + if (cp == cs || cp == excluded_child) 382 396 continue; 383 - if (is_partition_valid(child)) 397 + 398 + if (is_partition_valid(cp)) { 399 + pos_css = css_rightmost_descendant(pos_css); 384 400 continue; 385 - if (cgroup_is_populated(child->css.cgroup)) { 401 + } 402 + 403 + if (cpuset_is_populated(cp)) { 386 404 rcu_read_unlock(); 387 405 return true; 388 406 } ··· 689 663 * be changed to have empty cpus_allowed or mems_allowed. 690 664 */ 691 665 ret = -ENOSPC; 692 - if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) { 666 + if (cpuset_is_populated(cur)) { 693 667 if (!cpumask_empty(cur->cpus_allowed) && 694 668 cpumask_empty(trial->cpus_allowed)) 695 669 goto out; ··· 1328 1302 1329 1303 lockdep_assert_held(&callback_lock); 1330 1304 1331 - cs->nr_subparts = 0; 1332 1305 if (cpumask_empty(cs->exclusive_cpus)) { 1333 1306 cpumask_clear(cs->effective_xcpus); 1334 1307 if (is_cpu_exclusive(cs)) ··· 1350 1325 cpumask_or(isolated_cpus, isolated_cpus, xcpus); 1351 1326 else 1352 1327 cpumask_andnot(isolated_cpus, isolated_cpus, xcpus); 1328 + 1329 + isolated_cpus_updating = true; 1353 1330 } 1354 1331 1355 1332 /* ··· 1359 1332 * @new_prs: new partition_root_state 1360 1333 * @parent: parent cpuset 1361 1334 * @xcpus: exclusive CPUs to be added 1362 - * Return: true if isolated_cpus modified, false otherwise 1363 1335 * 1364 1336 * Remote partition if parent == NULL 1365 1337 */ 1366 - static bool partition_xcpus_add(int new_prs, struct cpuset *parent, 1338 + static void partition_xcpus_add(int new_prs, struct cpuset *parent, 1367 1339 struct cpumask *xcpus) 1368 1340 { 1369 - bool isolcpus_updated; 1370 - 1371 1341 WARN_ON_ONCE(new_prs < 0); 1372 1342 lockdep_assert_held(&callback_lock); 1373 1343 if (!parent) ··· 1374 1350 if (parent == &top_cpuset) 1375 1351 cpumask_or(subpartitions_cpus, subpartitions_cpus, xcpus); 1376 1352 1377 - isolcpus_updated = (new_prs != parent->partition_root_state); 1378 - if (isolcpus_updated) 1353 + if (new_prs != parent->partition_root_state) 1379 1354 isolated_cpus_update(parent->partition_root_state, new_prs, 1380 1355 xcpus); 1381 1356 1382 1357 cpumask_andnot(parent->effective_cpus, parent->effective_cpus, xcpus); 1383 - return isolcpus_updated; 1384 1358 } 1385 1359 1386 1360 /* ··· 1386 1364 * @old_prs: old partition_root_state 1387 1365 * @parent: parent cpuset 1388 1366 * @xcpus: exclusive CPUs to be removed 1389 - * Return: true if isolated_cpus modified, false otherwise 1390 1367 * 1391 1368 * Remote partition if parent == NULL 1392 1369 */ 1393 - static bool partition_xcpus_del(int old_prs, struct cpuset *parent, 1370 + static void partition_xcpus_del(int old_prs, struct cpuset *parent, 1394 1371 struct cpumask *xcpus) 1395 1372 { 1396 - bool isolcpus_updated; 1397 - 1398 1373 WARN_ON_ONCE(old_prs < 0); 1399 1374 lockdep_assert_held(&callback_lock); 1400 1375 if (!parent) ··· 1400 1381 if (parent == &top_cpuset) 1401 1382 cpumask_andnot(subpartitions_cpus, subpartitions_cpus, xcpus); 1402 1383 1403 - isolcpus_updated = (old_prs != parent->partition_root_state); 1404 - if (isolcpus_updated) 1384 + if (old_prs != parent->partition_root_state) 1405 1385 isolated_cpus_update(old_prs, parent->partition_root_state, 1406 1386 xcpus); 1407 1387 1408 1388 cpumask_and(xcpus, xcpus, cpu_active_mask); 1409 1389 cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus); 1410 - return isolcpus_updated; 1411 1390 } 1412 1391 1413 - static void update_isolation_cpumasks(bool isolcpus_updated) 1392 + /* 1393 + * isolated_cpus_can_update - check for isolated & nohz_full conflicts 1394 + * @add_cpus: cpu mask for cpus that are going to be isolated 1395 + * @del_cpus: cpu mask for cpus that are no longer isolated, can be NULL 1396 + * Return: false if there is conflict, true otherwise 1397 + * 1398 + * If nohz_full is enabled and we have isolated CPUs, their combination must 1399 + * still leave housekeeping CPUs. 1400 + * 1401 + * TBD: Should consider merging this function into 1402 + * prstate_housekeeping_conflict(). 1403 + */ 1404 + static bool isolated_cpus_can_update(struct cpumask *add_cpus, 1405 + struct cpumask *del_cpus) 1406 + { 1407 + cpumask_var_t full_hk_cpus; 1408 + int res = true; 1409 + 1410 + if (!housekeeping_enabled(HK_TYPE_KERNEL_NOISE)) 1411 + return true; 1412 + 1413 + if (del_cpus && cpumask_weight_and(del_cpus, 1414 + housekeeping_cpumask(HK_TYPE_KERNEL_NOISE))) 1415 + return true; 1416 + 1417 + if (!alloc_cpumask_var(&full_hk_cpus, GFP_KERNEL)) 1418 + return false; 1419 + 1420 + cpumask_and(full_hk_cpus, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE), 1421 + housekeeping_cpumask(HK_TYPE_DOMAIN)); 1422 + cpumask_andnot(full_hk_cpus, full_hk_cpus, isolated_cpus); 1423 + cpumask_and(full_hk_cpus, full_hk_cpus, cpu_active_mask); 1424 + if (!cpumask_weight_andnot(full_hk_cpus, add_cpus)) 1425 + res = false; 1426 + 1427 + free_cpumask_var(full_hk_cpus); 1428 + return res; 1429 + } 1430 + 1431 + /* 1432 + * prstate_housekeeping_conflict - check for partition & housekeeping conflicts 1433 + * @prstate: partition root state to be checked 1434 + * @new_cpus: cpu mask 1435 + * Return: true if there is conflict, false otherwise 1436 + * 1437 + * CPUs outside of boot_hk_cpus, if defined, can only be used in an 1438 + * isolated partition. 1439 + */ 1440 + static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) 1441 + { 1442 + if (!have_boot_isolcpus) 1443 + return false; 1444 + 1445 + if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus)) 1446 + return true; 1447 + 1448 + return false; 1449 + } 1450 + 1451 + /* 1452 + * update_isolation_cpumasks - Update external isolation related CPU masks 1453 + * 1454 + * The following external CPU masks will be updated if necessary: 1455 + * - workqueue unbound cpumask 1456 + */ 1457 + static void update_isolation_cpumasks(void) 1414 1458 { 1415 1459 int ret; 1416 1460 1417 - lockdep_assert_cpus_held(); 1418 - 1419 - if (!isolcpus_updated) 1461 + if (!isolated_cpus_updating) 1420 1462 return; 1463 + 1464 + lockdep_assert_cpus_held(); 1421 1465 1422 1466 ret = workqueue_unbound_exclude_cpumask(isolated_cpus); 1423 1467 WARN_ON_ONCE(ret < 0); 1424 1468 1425 1469 ret = tmigr_isolated_exclude_cpumask(isolated_cpus); 1426 1470 WARN_ON_ONCE(ret < 0); 1471 + 1472 + isolated_cpus_updating = false; 1427 1473 } 1428 1474 1429 1475 /** ··· 1592 1508 1593 1509 static inline bool is_remote_partition(struct cpuset *cs) 1594 1510 { 1595 - return !list_empty(&cs->remote_sibling); 1511 + return cs->remote_partition; 1596 1512 } 1597 1513 1598 1514 static inline bool is_local_partition(struct cpuset *cs) ··· 1613 1529 static int remote_partition_enable(struct cpuset *cs, int new_prs, 1614 1530 struct tmpmasks *tmp) 1615 1531 { 1616 - bool isolcpus_updated; 1617 - 1618 1532 /* 1619 1533 * The user must have sysadmin privilege. 1620 1534 */ ··· 1634 1552 if (!cpumask_intersects(tmp->new_cpus, cpu_active_mask) || 1635 1553 cpumask_subset(top_cpuset.effective_cpus, tmp->new_cpus)) 1636 1554 return PERR_INVCPUS; 1555 + if (((new_prs == PRS_ISOLATED) && 1556 + !isolated_cpus_can_update(tmp->new_cpus, NULL)) || 1557 + prstate_housekeeping_conflict(new_prs, tmp->new_cpus)) 1558 + return PERR_HKEEPING; 1637 1559 1638 1560 spin_lock_irq(&callback_lock); 1639 - isolcpus_updated = partition_xcpus_add(new_prs, NULL, tmp->new_cpus); 1640 - list_add(&cs->remote_sibling, &remote_children); 1561 + partition_xcpus_add(new_prs, NULL, tmp->new_cpus); 1562 + cs->remote_partition = true; 1641 1563 cpumask_copy(cs->effective_xcpus, tmp->new_cpus); 1642 1564 spin_unlock_irq(&callback_lock); 1643 - update_isolation_cpumasks(isolcpus_updated); 1565 + update_isolation_cpumasks(); 1644 1566 cpuset_force_rebuild(); 1645 1567 cs->prs_err = 0; 1646 1568 ··· 1667 1581 */ 1668 1582 static void remote_partition_disable(struct cpuset *cs, struct tmpmasks *tmp) 1669 1583 { 1670 - bool isolcpus_updated; 1671 - 1672 1584 WARN_ON_ONCE(!is_remote_partition(cs)); 1673 1585 WARN_ON_ONCE(!cpumask_subset(cs->effective_xcpus, subpartitions_cpus)); 1674 1586 1675 1587 spin_lock_irq(&callback_lock); 1676 - list_del_init(&cs->remote_sibling); 1677 - isolcpus_updated = partition_xcpus_del(cs->partition_root_state, 1678 - NULL, cs->effective_xcpus); 1588 + cs->remote_partition = false; 1589 + partition_xcpus_del(cs->partition_root_state, NULL, cs->effective_xcpus); 1679 1590 if (cs->prs_err) 1680 1591 cs->partition_root_state = -cs->partition_root_state; 1681 1592 else ··· 1682 1599 compute_excpus(cs, cs->effective_xcpus); 1683 1600 reset_partition_data(cs); 1684 1601 spin_unlock_irq(&callback_lock); 1685 - update_isolation_cpumasks(isolcpus_updated); 1602 + update_isolation_cpumasks(); 1686 1603 cpuset_force_rebuild(); 1687 1604 1688 1605 /* ··· 1707 1624 { 1708 1625 bool adding, deleting; 1709 1626 int prs = cs->partition_root_state; 1710 - int isolcpus_updated = 0; 1711 1627 1712 1628 if (WARN_ON_ONCE(!is_remote_partition(cs))) 1713 1629 return; ··· 1733 1651 else if (cpumask_intersects(tmp->addmask, subpartitions_cpus) || 1734 1652 cpumask_subset(top_cpuset.effective_cpus, tmp->addmask)) 1735 1653 cs->prs_err = PERR_NOCPUS; 1654 + else if ((prs == PRS_ISOLATED) && 1655 + !isolated_cpus_can_update(tmp->addmask, tmp->delmask)) 1656 + cs->prs_err = PERR_HKEEPING; 1736 1657 if (cs->prs_err) 1737 1658 goto invalidate; 1738 1659 } 1739 1660 1740 1661 spin_lock_irq(&callback_lock); 1741 1662 if (adding) 1742 - isolcpus_updated += partition_xcpus_add(prs, NULL, tmp->addmask); 1663 + partition_xcpus_add(prs, NULL, tmp->addmask); 1743 1664 if (deleting) 1744 - isolcpus_updated += partition_xcpus_del(prs, NULL, tmp->delmask); 1665 + partition_xcpus_del(prs, NULL, tmp->delmask); 1745 1666 /* 1746 1667 * Need to update effective_xcpus and exclusive_cpus now as 1747 1668 * update_sibling_cpumasks() below may iterate back to the same cs. ··· 1753 1668 if (xcpus) 1754 1669 cpumask_copy(cs->exclusive_cpus, xcpus); 1755 1670 spin_unlock_irq(&callback_lock); 1756 - update_isolation_cpumasks(isolcpus_updated); 1671 + update_isolation_cpumasks(); 1757 1672 if (adding || deleting) 1758 1673 cpuset_force_rebuild(); 1759 1674 ··· 1766 1681 1767 1682 invalidate: 1768 1683 remote_partition_disable(cs, tmp); 1769 - } 1770 - 1771 - /* 1772 - * prstate_housekeeping_conflict - check for partition & housekeeping conflicts 1773 - * @prstate: partition root state to be checked 1774 - * @new_cpus: cpu mask 1775 - * Return: true if there is conflict, false otherwise 1776 - * 1777 - * CPUs outside of boot_hk_cpus, if defined, can only be used in an 1778 - * isolated partition. 1779 - */ 1780 - static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus) 1781 - { 1782 - if (!have_boot_isolcpus) 1783 - return false; 1784 - 1785 - if ((prstate != PRS_ISOLATED) && !cpumask_subset(new_cpus, boot_hk_cpus)) 1786 - return true; 1787 - 1788 - return false; 1789 1684 } 1790 1685 1791 1686 /** ··· 1814 1749 int deleting; /* Deleting cpus from parent's effective_cpus */ 1815 1750 int old_prs, new_prs; 1816 1751 int part_error = PERR_NONE; /* Partition error? */ 1817 - int subparts_delta = 0; 1818 - int isolcpus_updated = 0; 1819 1752 struct cpumask *xcpus = user_xcpus(cs); 1753 + int parent_prs = parent->partition_root_state; 1820 1754 bool nocpu; 1821 1755 1822 1756 lockdep_assert_held(&cpuset_mutex); ··· 1838 1774 if (is_partition_valid(parent)) 1839 1775 adding = cpumask_and(tmp->addmask, 1840 1776 xcpus, parent->effective_xcpus); 1841 - if (old_prs > 0) { 1777 + if (old_prs > 0) 1842 1778 new_prs = -old_prs; 1843 - subparts_delta--; 1844 - } 1779 + 1845 1780 goto write_error; 1846 1781 } 1847 1782 ··· 1879 1816 if (prstate_housekeeping_conflict(new_prs, xcpus)) 1880 1817 return PERR_HKEEPING; 1881 1818 1819 + if ((new_prs == PRS_ISOLATED) && (new_prs != parent_prs) && 1820 + !isolated_cpus_can_update(xcpus, NULL)) 1821 + return PERR_HKEEPING; 1822 + 1882 1823 if (tasks_nocpu_error(parent, cs, xcpus)) 1883 1824 return PERR_NOCPUS; 1884 1825 ··· 1899 1832 WARN_ON_ONCE(!cpumask_subset(tmp->new_cpus, parent->effective_cpus)); 1900 1833 1901 1834 deleting = true; 1902 - subparts_delta++; 1903 1835 } else if (cmd == partcmd_disable) { 1904 1836 /* 1905 1837 * May need to add cpus back to parent's effective_cpus ··· 1909 1843 if (is_partition_valid(cs)) { 1910 1844 cpumask_copy(tmp->addmask, cs->effective_xcpus); 1911 1845 adding = true; 1912 - subparts_delta--; 1913 1846 } 1914 1847 new_prs = PRS_MEMBER; 1915 1848 } else if (newmask) { ··· 1936 1871 * 1937 1872 * For invalid partition: 1938 1873 * delmask = newmask & parent->effective_xcpus 1874 + * The partition may become valid soon. 1939 1875 */ 1940 1876 if (is_partition_invalid(cs)) { 1941 1877 adding = false; ··· 1951 1885 deleting = cpumask_and(tmp->delmask, tmp->delmask, 1952 1886 parent->effective_xcpus); 1953 1887 } 1888 + 1889 + /* 1890 + * TBD: Invalidate a currently valid child root partition may 1891 + * still break isolated_cpus_can_update() rule if parent is an 1892 + * isolated partition. 1893 + */ 1894 + if (is_partition_valid(cs) && (old_prs != parent_prs)) { 1895 + if ((parent_prs == PRS_ROOT) && 1896 + /* Adding to parent means removing isolated CPUs */ 1897 + !isolated_cpus_can_update(tmp->delmask, tmp->addmask)) 1898 + part_error = PERR_HKEEPING; 1899 + if ((parent_prs == PRS_ISOLATED) && 1900 + /* Adding to parent means adding isolated CPUs */ 1901 + !isolated_cpus_can_update(tmp->addmask, tmp->delmask)) 1902 + part_error = PERR_HKEEPING; 1903 + } 1904 + 1954 1905 /* 1955 1906 * The new CPUs to be removed from parent's effective CPUs 1956 1907 * must be present. ··· 2049 1966 switch (cs->partition_root_state) { 2050 1967 case PRS_ROOT: 2051 1968 case PRS_ISOLATED: 2052 - if (part_error) { 1969 + if (part_error) 2053 1970 new_prs = -old_prs; 2054 - subparts_delta--; 2055 - } 2056 1971 break; 2057 1972 case PRS_INVALID_ROOT: 2058 1973 case PRS_INVALID_ISOLATED: 2059 - if (!part_error) { 1974 + if (!part_error) 2060 1975 new_prs = -old_prs; 2061 - subparts_delta++; 2062 - } 2063 1976 break; 2064 1977 } 2065 1978 } ··· 2084 2005 * newly deleted ones will be added back to effective_cpus. 2085 2006 */ 2086 2007 spin_lock_irq(&callback_lock); 2087 - if (old_prs != new_prs) { 2008 + if (old_prs != new_prs) 2088 2009 cs->partition_root_state = new_prs; 2089 - if (new_prs <= 0) 2090 - cs->nr_subparts = 0; 2091 - } 2010 + 2092 2011 /* 2093 2012 * Adding to parent's effective_cpus means deletion CPUs from cs 2094 2013 * and vice versa. 2095 2014 */ 2096 2015 if (adding) 2097 - isolcpus_updated += partition_xcpus_del(old_prs, parent, 2098 - tmp->addmask); 2016 + partition_xcpus_del(old_prs, parent, tmp->addmask); 2099 2017 if (deleting) 2100 - isolcpus_updated += partition_xcpus_add(new_prs, parent, 2101 - tmp->delmask); 2018 + partition_xcpus_add(new_prs, parent, tmp->delmask); 2102 2019 2103 - if (is_partition_valid(parent)) { 2104 - parent->nr_subparts += subparts_delta; 2105 - WARN_ON_ONCE(parent->nr_subparts < 0); 2106 - } 2107 2020 spin_unlock_irq(&callback_lock); 2108 - update_isolation_cpumasks(isolcpus_updated); 2021 + update_isolation_cpumasks(); 2109 2022 2110 2023 if ((old_prs != new_prs) && (cmd == partcmd_update)) 2111 2024 update_partition_exclusive_flag(cs, new_prs); ··· 2179 2108 */ 2180 2109 spin_lock_irq(&callback_lock); 2181 2110 make_partition_invalid(child); 2182 - cs->nr_subparts--; 2183 - child->nr_subparts = 0; 2184 2111 spin_unlock_irq(&callback_lock); 2185 2112 notify_partition_change(child, old_prs); 2186 2113 continue; ··· 2207 2138 { 2208 2139 struct cpuset *cp; 2209 2140 struct cgroup_subsys_state *pos_css; 2210 - bool need_rebuild_sched_domains = false; 2211 2141 int old_prs, new_prs; 2212 2142 2213 2143 rcu_read_lock(); ··· 2370 2302 if (!cpumask_empty(cp->cpus_allowed) && 2371 2303 is_sched_load_balance(cp) && 2372 2304 (!cpuset_v2() || is_partition_valid(cp))) 2373 - need_rebuild_sched_domains = true; 2305 + cpuset_force_rebuild(); 2374 2306 2375 2307 rcu_read_lock(); 2376 2308 css_put(&cp->css); 2377 2309 } 2378 2310 rcu_read_unlock(); 2379 - 2380 - if (need_rebuild_sched_domains) 2381 - cpuset_force_rebuild(); 2382 2311 } 2383 2312 2384 2313 /** ··· 2913 2848 */ 2914 2849 retval = nodelist_parse(buf, trialcs->mems_allowed); 2915 2850 if (retval < 0) 2916 - goto done; 2851 + return retval; 2917 2852 2918 2853 if (!nodes_subset(trialcs->mems_allowed, 2919 - top_cpuset.mems_allowed)) { 2920 - retval = -EINVAL; 2921 - goto done; 2922 - } 2854 + top_cpuset.mems_allowed)) 2855 + return -EINVAL; 2923 2856 2924 - if (nodes_equal(cs->mems_allowed, trialcs->mems_allowed)) { 2925 - retval = 0; /* Too easy - nothing to do */ 2926 - goto done; 2927 - } 2857 + /* No change? nothing to do */ 2858 + if (nodes_equal(cs->mems_allowed, trialcs->mems_allowed)) 2859 + return 0; 2860 + 2928 2861 retval = validate_change(cs, trialcs); 2929 2862 if (retval < 0) 2930 - goto done; 2863 + return retval; 2931 2864 2932 2865 check_insane_mems_config(&trialcs->mems_allowed); 2933 2866 ··· 2935 2872 2936 2873 /* use trialcs->mems_allowed as a temp variable */ 2937 2874 update_nodemasks_hier(cs, &trialcs->mems_allowed); 2938 - done: 2939 - return retval; 2875 + return 0; 2940 2876 } 2941 2877 2942 2878 bool current_cpuset_is_being_rebound(void) ··· 3073 3011 * A change in load balance state only, no change in cpumasks. 3074 3012 * Need to update isolated_cpus. 3075 3013 */ 3076 - isolcpus_updated = true; 3014 + if (((new_prs == PRS_ISOLATED) && 3015 + !isolated_cpus_can_update(cs->effective_xcpus, NULL)) || 3016 + prstate_housekeeping_conflict(new_prs, cs->effective_xcpus)) 3017 + err = PERR_HKEEPING; 3018 + else 3019 + isolcpus_updated = true; 3077 3020 } else { 3078 3021 /* 3079 3022 * Switching back to member is always allowed even if it ··· 3113 3046 else if (isolcpus_updated) 3114 3047 isolated_cpus_update(old_prs, new_prs, cs->effective_xcpus); 3115 3048 spin_unlock_irq(&callback_lock); 3116 - update_isolation_cpumasks(isolcpus_updated); 3049 + update_isolation_cpumasks(); 3117 3050 3118 3051 /* Force update if switching back to member & update effective_xcpus */ 3119 3052 update_cpumasks_hier(cs, &tmpmask, !new_prs); ··· 3619 3552 __set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); 3620 3553 fmeter_init(&cs->fmeter); 3621 3554 cs->relax_domain_level = -1; 3622 - INIT_LIST_HEAD(&cs->remote_sibling); 3623 3555 3624 3556 /* Set CS_MEMORY_MIGRATE for default hierarchy */ 3625 3557 if (cpuset_v2()) ··· 3889 3823 nodes_setall(top_cpuset.effective_mems); 3890 3824 3891 3825 fmeter_init(&top_cpuset.fmeter); 3892 - INIT_LIST_HEAD(&remote_children); 3893 3826 3894 3827 BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); 3895 3828 ··· 4089 4024 */ 4090 4025 if (!cpumask_empty(subpartitions_cpus)) { 4091 4026 if (cpumask_subset(&new_cpus, subpartitions_cpus)) { 4092 - top_cpuset.nr_subparts = 0; 4093 4027 cpumask_clear(subpartitions_cpus); 4094 4028 } else { 4095 4029 cpumask_andnot(&new_cpus, &new_cpus, ··· 4183 4119 BUG_ON(!cpuset_migrate_mm_wq); 4184 4120 } 4185 4121 4186 - /** 4187 - * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset. 4188 - * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. 4189 - * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. 4190 - * 4191 - * Description: Returns the cpumask_var_t cpus_allowed of the cpuset 4192 - * attached to the specified @tsk. Guaranteed to return some non-empty 4193 - * subset of cpu_active_mask, even if this means going outside the 4194 - * tasks cpuset, except when the task is in the top cpuset. 4195 - **/ 4196 - 4197 - void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) 4122 + /* 4123 + * Return cpus_allowed mask from a task's cpuset. 4124 + */ 4125 + static void __cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask) 4198 4126 { 4199 - unsigned long flags; 4200 4127 struct cpuset *cs; 4201 - 4202 - spin_lock_irqsave(&callback_lock, flags); 4203 4128 4204 4129 cs = task_cs(tsk); 4205 4130 if (cs != &top_cpuset) ··· 4209 4156 if (!cpumask_intersects(pmask, cpu_active_mask)) 4210 4157 cpumask_copy(pmask, possible_mask); 4211 4158 } 4159 + } 4212 4160 4161 + /** 4162 + * cpuset_cpus_allowed_locked - return cpus_allowed mask from a task's cpuset. 4163 + * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. 4164 + * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. 4165 + * 4166 + * Similir to cpuset_cpus_allowed() except that the caller must have acquired 4167 + * cpuset_mutex. 4168 + */ 4169 + void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask) 4170 + { 4171 + lockdep_assert_held(&cpuset_mutex); 4172 + __cpuset_cpus_allowed_locked(tsk, pmask); 4173 + } 4174 + 4175 + /** 4176 + * cpuset_cpus_allowed - return cpus_allowed mask from a task's cpuset. 4177 + * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed. 4178 + * @pmask: pointer to struct cpumask variable to receive cpus_allowed set. 4179 + * 4180 + * Description: Returns the cpumask_var_t cpus_allowed of the cpuset 4181 + * attached to the specified @tsk. Guaranteed to return some non-empty 4182 + * subset of cpu_active_mask, even if this means going outside the 4183 + * tasks cpuset, except when the task is in the top cpuset. 4184 + **/ 4185 + 4186 + void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) 4187 + { 4188 + unsigned long flags; 4189 + 4190 + spin_lock_irqsave(&callback_lock, flags); 4191 + __cpuset_cpus_allowed_locked(tsk, pmask); 4213 4192 spin_unlock_irqrestore(&callback_lock, flags); 4214 4193 } 4215 4194

+2 -2

kernel/exit.c

··· 257 257 rcu_read_unlock(); 258 258 259 259 pidfs_exit(p); 260 - cgroup_release(p); 260 + cgroup_task_release(p); 261 261 262 262 /* Retrieve @thread_pid before __unhash_process() may set it to NULL. */ 263 263 thread_pid = task_pid(p); ··· 974 974 exit_thread(tsk); 975 975 976 976 sched_autogroup_exit_task(tsk); 977 - cgroup_exit(tsk); 977 + cgroup_task_exit(tsk); 978 978 979 979 /* 980 980 * FIXME: do that only when needed, using sched_exit tracepoint

+1 -1

kernel/fork.c

··· 738 738 unwind_task_free(tsk); 739 739 sched_ext_free(tsk); 740 740 io_uring_free(tsk); 741 - cgroup_free(tsk); 741 + cgroup_task_free(tsk); 742 742 task_numa_free(tsk, true); 743 743 security_task_free(tsk); 744 744 exit_creds(tsk);

+2 -2

kernel/sched/autogroup.c

··· 178 178 * this process can already run with task_group() == prev->tg or we can 179 179 * race with cgroup code which can read autogroup = prev under rq->lock. 180 180 * In the latter case for_each_thread() can not miss a migrating thread, 181 - * cpu_cgroup_attach() must not be possible after cgroup_exit() and it 182 - * can't be removed from thread list, we hold ->siglock. 181 + * cpu_cgroup_attach() must not be possible after cgroup_task_exit() 182 + * and it can't be removed from thread list, we hold ->siglock. 183 183 * 184 184 * If an exiting thread was already removed from thread list we rely on 185 185 * sched_autogroup_exit_task().

kernel/sched/core.c

··· 5143 5143 if (prev->sched_class->task_dead) 5144 5144 prev->sched_class->task_dead(prev); 5145 5145 5146 + cgroup_task_dead(prev); 5147 + 5146 5148 /* Task is done with its stack. */ 5147 5149 put_task_stack(prev); 5148 5150

+48 -6

kernel/sched/deadline.c

··· 2675 2675 return NULL; 2676 2676 } 2677 2677 2678 + /* Access rule: must be called on local CPU with preemption disabled */ 2678 2679 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl); 2679 2680 2680 2681 static int find_later_rq(struct task_struct *task) ··· 3118 3117 GFP_KERNEL, cpu_to_node(i)); 3119 3118 } 3120 3119 3120 + /* 3121 + * This function always returns a non-empty bitmap in @cpus. This is because 3122 + * if a root domain has reserved bandwidth for DL tasks, the DL bandwidth 3123 + * check will prevent CPU hotplug from deactivating all CPUs in that domain. 3124 + */ 3125 + static void dl_get_task_effective_cpus(struct task_struct *p, struct cpumask *cpus) 3126 + { 3127 + const struct cpumask *hk_msk; 3128 + 3129 + hk_msk = housekeeping_cpumask(HK_TYPE_DOMAIN); 3130 + if (housekeeping_enabled(HK_TYPE_DOMAIN)) { 3131 + if (!cpumask_intersects(p->cpus_ptr, hk_msk)) { 3132 + /* 3133 + * CPUs isolated by isolcpu="domain" always belong to 3134 + * def_root_domain. 3135 + */ 3136 + cpumask_andnot(cpus, cpu_active_mask, hk_msk); 3137 + return; 3138 + } 3139 + } 3140 + 3141 + /* 3142 + * If a root domain holds a DL task, it must have active CPUs. So 3143 + * active CPUs can always be found by walking up the task's cpuset 3144 + * hierarchy up to the partition root. 3145 + */ 3146 + cpuset_cpus_allowed_locked(p, cpus); 3147 + } 3148 + 3149 + /* The caller should hold cpuset_mutex */ 3121 3150 void dl_add_task_root_domain(struct task_struct *p) 3122 3151 { 3123 3152 struct rq_flags rf; 3124 3153 struct rq *rq; 3125 3154 struct dl_bw *dl_b; 3155 + unsigned int cpu; 3156 + struct cpumask *msk = this_cpu_cpumask_var_ptr(local_cpu_mask_dl); 3126 3157 3127 3158 raw_spin_lock_irqsave(&p->pi_lock, rf.flags); 3128 3159 if (!dl_task(p) || dl_entity_is_special(&p->dl)) { ··· 3162 3129 return; 3163 3130 } 3164 3131 3165 - rq = __task_rq_lock(p, &rf); 3166 - 3132 + /* 3133 + * Get an active rq, whose rq->rd traces the correct root 3134 + * domain. 3135 + * Ideally this would be under cpuset reader lock until rq->rd is 3136 + * fetched. However, sleepable locks cannot nest inside pi_lock, so we 3137 + * rely on the caller of dl_add_task_root_domain() holds 'cpuset_mutex' 3138 + * to guarantee the CPU stays in the cpuset. 3139 + */ 3140 + dl_get_task_effective_cpus(p, msk); 3141 + cpu = cpumask_first_and(cpu_active_mask, msk); 3142 + BUG_ON(cpu >= nr_cpu_ids); 3143 + rq = cpu_rq(cpu); 3167 3144 dl_b = &rq->rd->dl_bw; 3145 + /* End of fetching rd */ 3146 + 3168 3147 raw_spin_lock(&dl_b->lock); 3169 - 3170 3148 __dl_add(dl_b, p->dl.dl_bw, cpumask_weight(rq->rd->span)); 3171 - 3172 3149 raw_spin_unlock(&dl_b->lock); 3173 - 3174 - task_rq_unlock(rq, p, &rf); 3150 + raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags); 3175 3151 } 3176 3152 3177 3153 void dl_clear_root_domain(struct root_domain *rd)

+4 -3

tools/testing/selftests/cgroup/test_core.c

··· 923 923 int main(int argc, char *argv[]) 924 924 { 925 925 char root[PATH_MAX]; 926 - int i, ret = EXIT_SUCCESS; 926 + int i; 927 927 928 + ksft_print_header(); 929 + ksft_set_plan(ARRAY_SIZE(tests)); 928 930 if (cg_find_unified_root(root, sizeof(root), &nsdelegate)) { 929 931 if (setup_named_v1_root(root, sizeof(root), CG_NAMED_NAME)) 930 932 ksft_exit_skip("cgroup v2 isn't mounted and could not setup named v1 hierarchy\n"); ··· 948 946 ksft_test_result_skip("%s\n", tests[i].name); 949 947 break; 950 948 default: 951 - ret = EXIT_FAILURE; 952 949 ksft_test_result_fail("%s\n", tests[i].name); 953 950 break; 954 951 } 955 952 } 956 953 957 954 cleanup_named_v1_root(root); 958 - return ret; 955 + ksft_finished(); 959 956 }

+4 -3

tools/testing/selftests/cgroup/test_cpu.c

··· 796 796 int main(int argc, char *argv[]) 797 797 { 798 798 char root[PATH_MAX]; 799 - int i, ret = EXIT_SUCCESS; 799 + int i; 800 800 801 + ksft_print_header(); 802 + ksft_set_plan(ARRAY_SIZE(tests)); 801 803 if (cg_find_unified_root(root, sizeof(root), NULL)) 802 804 ksft_exit_skip("cgroup v2 isn't mounted\n"); 803 805 ··· 816 814 ksft_test_result_skip("%s\n", tests[i].name); 817 815 break; 818 816 default: 819 - ret = EXIT_FAILURE; 820 817 ksft_test_result_fail("%s\n", tests[i].name); 821 818 break; 822 819 } 823 820 } 824 821 825 - return ret; 822 + ksft_finished(); 826 823 }

+4 -3

tools/testing/selftests/cgroup/test_cpuset.c

··· 247 247 int main(int argc, char *argv[]) 248 248 { 249 249 char root[PATH_MAX]; 250 - int i, ret = EXIT_SUCCESS; 250 + int i; 251 251 252 + ksft_print_header(); 253 + ksft_set_plan(ARRAY_SIZE(tests)); 252 254 if (cg_find_unified_root(root, sizeof(root), NULL)) 253 255 ksft_exit_skip("cgroup v2 isn't mounted\n"); 254 256 ··· 267 265 ksft_test_result_skip("%s\n", tests[i].name); 268 266 break; 269 267 default: 270 - ret = EXIT_FAILURE; 271 268 ksft_test_result_fail("%s\n", tests[i].name); 272 269 break; 273 270 } 274 271 } 275 272 276 - return ret; 273 + ksft_finished(); 277 274 }

+4 -3

tools/testing/selftests/cgroup/test_freezer.c

··· 1488 1488 int main(int argc, char *argv[]) 1489 1489 { 1490 1490 char root[PATH_MAX]; 1491 - int i, ret = EXIT_SUCCESS; 1491 + int i; 1492 1492 1493 + ksft_print_header(); 1494 + ksft_set_plan(ARRAY_SIZE(tests)); 1493 1495 if (cg_find_unified_root(root, sizeof(root), NULL)) 1494 1496 ksft_exit_skip("cgroup v2 isn't mounted\n"); 1495 1497 for (i = 0; i < ARRAY_SIZE(tests); i++) { ··· 1503 1501 ksft_test_result_skip("%s\n", tests[i].name); 1504 1502 break; 1505 1503 default: 1506 - ret = EXIT_FAILURE; 1507 1504 ksft_test_result_fail("%s\n", tests[i].name); 1508 1505 break; 1509 1506 } 1510 1507 } 1511 1508 1512 - return ret; 1509 + ksft_finished(); 1513 1510 }

+4 -3

tools/testing/selftests/cgroup/test_kill.c

··· 274 274 int main(int argc, char *argv[]) 275 275 { 276 276 char root[PATH_MAX]; 277 - int i, ret = EXIT_SUCCESS; 277 + int i; 278 278 279 + ksft_print_header(); 280 + ksft_set_plan(ARRAY_SIZE(tests)); 279 281 if (cg_find_unified_root(root, sizeof(root), NULL)) 280 282 ksft_exit_skip("cgroup v2 isn't mounted\n"); 281 283 for (i = 0; i < ARRAY_SIZE(tests); i++) { ··· 289 287 ksft_test_result_skip("%s\n", tests[i].name); 290 288 break; 291 289 default: 292 - ret = EXIT_FAILURE; 293 290 ksft_test_result_fail("%s\n", tests[i].name); 294 291 break; 295 292 } 296 293 } 297 294 298 - return ret; 295 + ksft_finished(); 299 296 }

+4 -3

tools/testing/selftests/cgroup/test_kmem.c

··· 421 421 int main(int argc, char **argv) 422 422 { 423 423 char root[PATH_MAX]; 424 - int i, ret = EXIT_SUCCESS; 424 + int i; 425 425 426 + ksft_print_header(); 427 + ksft_set_plan(ARRAY_SIZE(tests)); 426 428 if (cg_find_unified_root(root, sizeof(root), NULL)) 427 429 ksft_exit_skip("cgroup v2 isn't mounted\n"); 428 430 ··· 448 446 ksft_test_result_skip("%s\n", tests[i].name); 449 447 break; 450 448 default: 451 - ret = EXIT_FAILURE; 452 449 ksft_test_result_fail("%s\n", tests[i].name); 453 450 break; 454 451 } 455 452 } 456 453 457 - return ret; 454 + ksft_finished(); 458 455 }

+4 -3

tools/testing/selftests/cgroup/test_memcontrol.c

··· 1650 1650 int main(int argc, char **argv) 1651 1651 { 1652 1652 char root[PATH_MAX]; 1653 - int i, proc_status, ret = EXIT_SUCCESS; 1653 + int i, proc_status; 1654 1654 1655 + ksft_print_header(); 1656 + ksft_set_plan(ARRAY_SIZE(tests)); 1655 1657 if (cg_find_unified_root(root, sizeof(root), NULL)) 1656 1658 ksft_exit_skip("cgroup v2 isn't mounted\n"); 1657 1659 ··· 1687 1685 ksft_test_result_skip("%s\n", tests[i].name); 1688 1686 break; 1689 1687 default: 1690 - ret = EXIT_FAILURE; 1691 1688 ksft_test_result_fail("%s\n", tests[i].name); 1692 1689 break; 1693 1690 } 1694 1691 } 1695 1692 1696 - return ret; 1693 + ksft_finished(); 1697 1694 }

+4 -3

tools/testing/selftests/cgroup/test_zswap.c

··· 597 597 int main(int argc, char **argv) 598 598 { 599 599 char root[PATH_MAX]; 600 - int i, ret = EXIT_SUCCESS; 600 + int i; 601 601 602 + ksft_print_header(); 603 + ksft_set_plan(ARRAY_SIZE(tests)); 602 604 if (cg_find_unified_root(root, sizeof(root), NULL)) 603 605 ksft_exit_skip("cgroup v2 isn't mounted\n"); 604 606 ··· 627 625 ksft_test_result_skip("%s\n", tests[i].name); 628 626 break; 629 627 default: 630 - ret = EXIT_FAILURE; 631 628 ksft_test_result_fail("%s\n", tests[i].name); 632 629 break; 633 630 } 634 631 } 635 632 636 - return ret; 633 + ksft_finished(); 637 634 }

Configure Feed

Configure Feed