Merge tag 'cgroup-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

+27 -17

Documentation/admin-guide/cgroup-v2.rst

··· 737 737 resource is mandatory for execution of processes, process migrations 738 738 may be rejected. 739 739 740 - "cpu.rt.max" hard-allocates realtime slices and is an example of this 741 - type. 742 - 743 740 744 741 Interface Files 745 742 =============== ··· 2558 2561 Users can manually set it to a value that is different from 2559 2562 "cpuset.cpus". One constraint in setting it is that the list of 2560 2563 CPUs must be exclusive with respect to "cpuset.cpus.exclusive" 2561 - of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup 2562 - isn't set, its "cpuset.cpus" value, if set, cannot be a subset 2563 - of it to leave at least one CPU available when the exclusive 2564 - CPUs are taken away. 2564 + and "cpuset.cpus.exclusive.effective" of its siblings. Another 2565 + constraint is that it cannot be a superset of "cpuset.cpus" 2566 + of its sibling in order to leave at least one CPU available to 2567 + that sibling when the exclusive CPUs are taken away. 2565 2568 2566 2569 For a parent cgroup, any one of its exclusive CPUs can only 2567 2570 be distributed to at most one of its child cgroups. Having an ··· 2581 2584 of this file will always be a subset of its parent's 2582 2585 "cpuset.cpus.exclusive.effective" if its parent is not the root 2583 2586 cgroup. It will also be a subset of "cpuset.cpus.exclusive" 2584 - if it is set. If "cpuset.cpus.exclusive" is not set, it is 2585 - treated to have an implicit value of "cpuset.cpus" in the 2586 - formation of local partition. 2587 + if it is set. This file should only be non-empty if either 2588 + "cpuset.cpus.exclusive" is set or when the current cpuset is 2589 + a valid partition root. 2587 2590 2588 2591 cpuset.cpus.isolated 2589 2592 A read-only and root cgroup only multiple values file. ··· 2615 2618 There are two types of partitions - local and remote. A local 2616 2619 partition is one whose parent cgroup is also a valid partition 2617 2620 root. A remote partition is one whose parent cgroup is not a 2618 - valid partition root itself. Writing to "cpuset.cpus.exclusive" 2619 - is optional for the creation of a local partition as its 2620 - "cpuset.cpus.exclusive" file will assume an implicit value that 2621 - is the same as "cpuset.cpus" if it is not set. Writing the 2622 - proper "cpuset.cpus.exclusive" values down the cgroup hierarchy 2623 - before the target partition root is mandatory for the creation 2624 - of a remote partition. 2621 + valid partition root itself. 2622 + 2623 + Writing to "cpuset.cpus.exclusive" is optional for the creation 2624 + of a local partition as its "cpuset.cpus.exclusive" file will 2625 + assume an implicit value that is the same as "cpuset.cpus" if it 2626 + is not set. Writing the proper "cpuset.cpus.exclusive" values 2627 + down the cgroup hierarchy before the target partition root is 2628 + mandatory for the creation of a remote partition. 2629 + 2630 + Not all the CPUs requested in "cpuset.cpus.exclusive" can be 2631 + used to form a new partition. Only those that were present 2632 + in its parent's "cpuset.cpus.exclusive.effective" control 2633 + file can be used. For partitions created without setting 2634 + "cpuset.cpus.exclusive", exclusive CPUs specified in sibling's 2635 + "cpuset.cpus.exclusive" or "cpuset.cpus.exclusive.effective" 2636 + also cannot be used. 2625 2637 2626 2638 Currently, a remote partition cannot be created under a local 2627 2639 partition. All the ancestors of a remote partition root except ··· 2638 2632 2639 2633 The root cgroup is always a partition root and its state cannot 2640 2634 be changed. All other non-root cgroups start out as "member". 2635 + Even though the "cpuset.cpus.exclusive*" and "cpuset.cpus" 2636 + control files are not present in the root cgroup, they are 2637 + implicitly the same as the "/sys/devices/system/cpu/possible" 2638 + sysfs file. 2641 2639 2642 2640 When set to "root", the current cgroup is the root of a new 2643 2641 partition or scheduling domain. The set of exclusive CPUs is

+1 -1

fs/fs-writeback.c

··· 981 981 982 982 css = mem_cgroup_css_from_folio(folio); 983 983 /* dead cgroups shouldn't contribute to inode ownership arbitration */ 984 - if (!(css->flags & CSS_ONLINE)) 984 + if (!css_is_online(css)) 985 985 return; 986 986 987 987 id = css->id;

+4 -4

include/linux/cgroup-defs.h

··· 535 535 * one which may have more subsystems enabled. Controller knobs 536 536 * are made available iff it's enabled in ->subtree_control. 537 537 */ 538 - u16 subtree_control; 539 - u16 subtree_ss_mask; 540 - u16 old_subtree_control; 541 - u16 old_subtree_ss_mask; 538 + u32 subtree_control; 539 + u32 subtree_ss_mask; 540 + u32 old_subtree_control; 541 + u32 old_subtree_ss_mask; 542 542 543 543 /* Private pointers for each registered subsystem */ 544 544 struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];

+2

include/linux/cpuset.h

··· 76 76 extern void dec_dl_tasks_cs(struct task_struct *task); 77 77 extern void cpuset_lock(void); 78 78 extern void cpuset_unlock(void); 79 + extern void lockdep_assert_cpuset_lock_held(void); 79 80 extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask); 80 81 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); 81 82 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); ··· 197 196 static inline void dec_dl_tasks_cs(struct task_struct *task) { } 198 197 static inline void cpuset_lock(void) { } 199 198 static inline void cpuset_unlock(void) { } 199 + static inline void lockdep_assert_cpuset_lock_held(void) { } 200 200 201 201 static inline void cpuset_cpus_allowed_locked(struct task_struct *p, 202 202 struct cpumask *mask)

+1 -1

include/linux/memcontrol.h

··· 893 893 { 894 894 if (mem_cgroup_disabled()) 895 895 return true; 896 - return !!(memcg->css.flags & CSS_ONLINE); 896 + return css_is_online(&memcg->css); 897 897 } 898 898 899 899 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,

+1 -1

include/trace/events/cgroup.h

··· 16 16 17 17 TP_STRUCT__entry( 18 18 __field( int, root ) 19 - __field( u16, ss_mask ) 19 + __field( u32, ss_mask ) 20 20 __string( name, root->name ) 21 21 ), 22 22

+4 -4

kernel/cgroup/cgroup-internal.h

··· 52 52 bool cpuset_clone_children; 53 53 bool none; /* User explicitly requested empty subsystem */ 54 54 bool all_ss; /* Seen 'all' option */ 55 - u16 subsys_mask; /* Selected subsystems */ 55 + u32 subsys_mask; /* Selected subsystems */ 56 56 char *name; /* Hierarchy name */ 57 57 char *release_agent; /* Path for release notifications */ 58 58 }; ··· 146 146 struct cgroup_taskset tset; 147 147 148 148 /* subsystems affected by migration */ 149 - u16 ss_mask; 149 + u32 ss_mask; 150 150 }; 151 151 152 152 #define CGROUP_TASKSET_INIT(tset) \ ··· 235 235 void cgroup_favor_dynmods(struct cgroup_root *root, bool favor); 236 236 void cgroup_free_root(struct cgroup_root *root); 237 237 void init_cgroup_root(struct cgroup_fs_context *ctx); 238 - int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask); 239 - int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask); 238 + int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask); 239 + int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask); 240 240 int cgroup_do_get_tree(struct fs_context *fc); 241 241 242 242 int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);

+6 -6

kernel/cgroup/cgroup-v1.c

··· 28 28 #define CGROUP_PIDLIST_DESTROY_DELAY HZ 29 29 30 30 /* Controllers blocked by the commandline in v1 */ 31 - static u16 cgroup_no_v1_mask; 31 + static u32 cgroup_no_v1_mask; 32 32 33 33 /* disable named v1 mounts */ 34 34 static bool cgroup_no_v1_named; ··· 1037 1037 static int check_cgroupfs_options(struct fs_context *fc) 1038 1038 { 1039 1039 struct cgroup_fs_context *ctx = cgroup_fc2context(fc); 1040 - u16 mask = U16_MAX; 1041 - u16 enabled = 0; 1040 + u32 mask = U32_MAX; 1041 + u32 enabled = 0; 1042 1042 struct cgroup_subsys *ss; 1043 1043 int i; 1044 1044 1045 1045 #ifdef CONFIG_CPUSETS 1046 - mask = ~((u16)1 << cpuset_cgrp_id); 1046 + mask = ~((u32)1 << cpuset_cgrp_id); 1047 1047 #endif 1048 1048 for_each_subsys(ss, i) 1049 1049 if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i) && ··· 1095 1095 struct kernfs_root *kf_root = kernfs_root_from_sb(fc->root->d_sb); 1096 1096 struct cgroup_root *root = cgroup_root_from_kf(kf_root); 1097 1097 int ret = 0; 1098 - u16 added_mask, removed_mask; 1098 + u32 added_mask, removed_mask; 1099 1099 1100 1100 cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); 1101 1101 ··· 1343 1343 continue; 1344 1344 1345 1345 if (!strcmp(token, "all")) { 1346 - cgroup_no_v1_mask = U16_MAX; 1346 + cgroup_no_v1_mask = U32_MAX; 1347 1347 continue; 1348 1348 } 1349 1349

+25 -25

kernel/cgroup/cgroup.c

··· 203 203 bool cgrp_dfl_visible; 204 204 205 205 /* some controllers are not supported in the default hierarchy */ 206 - static u16 cgrp_dfl_inhibit_ss_mask; 206 + static u32 cgrp_dfl_inhibit_ss_mask; 207 207 208 208 /* some controllers are implicitly enabled on the default hierarchy */ 209 - static u16 cgrp_dfl_implicit_ss_mask; 209 + static u32 cgrp_dfl_implicit_ss_mask; 210 210 211 211 /* some controllers can be threaded on the default hierarchy */ 212 - static u16 cgrp_dfl_threaded_ss_mask; 212 + static u32 cgrp_dfl_threaded_ss_mask; 213 213 214 214 /* The list of hierarchy roots */ 215 215 LIST_HEAD(cgroup_roots); ··· 231 231 * These bitmasks identify subsystems with specific features to avoid 232 232 * having to do iterative checks repeatedly. 233 233 */ 234 - static u16 have_fork_callback __read_mostly; 235 - static u16 have_exit_callback __read_mostly; 236 - static u16 have_release_callback __read_mostly; 237 - static u16 have_canfork_callback __read_mostly; 234 + static u32 have_fork_callback __read_mostly; 235 + static u32 have_exit_callback __read_mostly; 236 + static u32 have_release_callback __read_mostly; 237 + static u32 have_canfork_callback __read_mostly; 238 238 239 239 static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); 240 240 ··· 472 472 } 473 473 474 474 /* subsystems visibly enabled on a cgroup */ 475 - static u16 cgroup_control(struct cgroup *cgrp) 475 + static u32 cgroup_control(struct cgroup *cgrp) 476 476 { 477 477 struct cgroup *parent = cgroup_parent(cgrp); 478 - u16 root_ss_mask = cgrp->root->subsys_mask; 478 + u32 root_ss_mask = cgrp->root->subsys_mask; 479 479 480 480 if (parent) { 481 - u16 ss_mask = parent->subtree_control; 481 + u32 ss_mask = parent->subtree_control; 482 482 483 483 /* threaded cgroups can only have threaded controllers */ 484 484 if (cgroup_is_threaded(cgrp)) ··· 493 493 } 494 494 495 495 /* subsystems enabled on a cgroup */ 496 - static u16 cgroup_ss_mask(struct cgroup *cgrp) 496 + static u32 cgroup_ss_mask(struct cgroup *cgrp) 497 497 { 498 498 struct cgroup *parent = cgroup_parent(cgrp); 499 499 500 500 if (parent) { 501 - u16 ss_mask = parent->subtree_ss_mask; 501 + u32 ss_mask = parent->subtree_ss_mask; 502 502 503 503 /* threaded cgroups can only have threaded controllers */ 504 504 if (cgroup_is_threaded(cgrp)) ··· 1633 1633 * This function calculates which subsystems need to be enabled if 1634 1634 * @subtree_control is to be applied while restricted to @this_ss_mask. 1635 1635 */ 1636 - static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask) 1636 + static u32 cgroup_calc_subtree_ss_mask(u32 subtree_control, u32 this_ss_mask) 1637 1637 { 1638 - u16 cur_ss_mask = subtree_control; 1638 + u32 cur_ss_mask = subtree_control; 1639 1639 struct cgroup_subsys *ss; 1640 1640 int ssid; 1641 1641 ··· 1644 1644 cur_ss_mask |= cgrp_dfl_implicit_ss_mask; 1645 1645 1646 1646 while (true) { 1647 - u16 new_ss_mask = cur_ss_mask; 1647 + u32 new_ss_mask = cur_ss_mask; 1648 1648 1649 1649 do_each_subsys_mask(ss, ssid, cur_ss_mask) { 1650 1650 new_ss_mask |= ss->depends_on; ··· 1848 1848 return ret; 1849 1849 } 1850 1850 1851 - int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask) 1851 + int rebind_subsystems(struct cgroup_root *dst_root, u32 ss_mask) 1852 1852 { 1853 1853 struct cgroup *dcgrp = &dst_root->cgrp; 1854 1854 struct cgroup_subsys *ss; 1855 1855 int ssid, ret; 1856 - u16 dfl_disable_ss_mask = 0; 1856 + u32 dfl_disable_ss_mask = 0; 1857 1857 1858 1858 lockdep_assert_held(&cgroup_mutex); 1859 1859 ··· 2149 2149 set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags); 2150 2150 } 2151 2151 2152 - int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) 2152 + int cgroup_setup_root(struct cgroup_root *root, u32 ss_mask) 2153 2153 { 2154 2154 LIST_HEAD(tmp_links); 2155 2155 struct cgroup *root_cgrp = &root->cgrp; ··· 3131 3131 put_task_struct(task); 3132 3132 } 3133 3133 3134 - static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask) 3134 + static void cgroup_print_ss_mask(struct seq_file *seq, u32 ss_mask) 3135 3135 { 3136 3136 struct cgroup_subsys *ss; 3137 3137 bool printed = false; ··· 3496 3496 cgroup_apply_control_disable(cgrp); 3497 3497 } 3498 3498 3499 - static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable) 3499 + static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u32 enable) 3500 3500 { 3501 - u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask; 3501 + u32 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask; 3502 3502 3503 3503 /* if nothing is getting enabled, nothing to worry about */ 3504 3504 if (!enable) ··· 3541 3541 char *buf, size_t nbytes, 3542 3542 loff_t off) 3543 3543 { 3544 - u16 enable = 0, disable = 0; 3544 + u32 enable = 0, disable = 0; 3545 3545 struct cgroup *cgrp, *child; 3546 3546 struct cgroup_subsys *ss; 3547 3547 char *tok; ··· 4945 4945 4946 4946 rcu_read_lock(); 4947 4947 css_for_each_child(child, css) { 4948 - if (child->flags & CSS_ONLINE) { 4948 + if (css_is_online(child)) { 4949 4949 ret = true; 4950 4950 break; 4951 4951 } ··· 5750 5750 5751 5751 lockdep_assert_held(&cgroup_mutex); 5752 5752 5753 - if (!(css->flags & CSS_ONLINE)) 5753 + if (!css_is_online(css)) 5754 5754 return; 5755 5755 5756 5756 if (ss->css_offline) ··· 6347 6347 struct cgroup_subsys *ss; 6348 6348 int ssid; 6349 6349 6350 - BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16); 6350 + BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 32); 6351 6351 BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files)); 6352 6352 BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files)); 6353 6353 BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));

+47 -7

kernel/cgroup/cpuset-internal.h

··· 9 9 #include <linux/cpuset.h> 10 10 #include <linux/spinlock.h> 11 11 #include <linux/union_find.h> 12 + #include <linux/sched/isolation.h> 12 13 13 14 /* See "Frequency meter" comments, below. */ 14 15 ··· 145 144 */ 146 145 nodemask_t old_mems_allowed; 147 146 148 - struct fmeter fmeter; /* memory_pressure filter */ 149 - 150 147 /* 151 148 * Tasks are being attached to this cpuset. Used to prevent 152 149 * zeroing cpus/mems_allowed between ->can_attach() and ->attach(). 153 150 */ 154 151 int attach_in_progress; 155 - 156 - /* for custom sched domain */ 157 - int relax_domain_level; 158 152 159 153 /* partition root state */ 160 154 int partition_root_state; ··· 175 179 /* Handle for cpuset.cpus.partition */ 176 180 struct cgroup_file partition_file; 177 181 182 + #ifdef CONFIG_CPUSETS_V1 183 + struct fmeter fmeter; /* memory_pressure filter */ 184 + 185 + /* for custom sched domain */ 186 + int relax_domain_level; 187 + 178 188 /* Used to merge intersecting subsets for generate_sched_domains */ 179 189 struct uf_node node; 190 + #endif 180 191 }; 192 + 193 + extern struct cpuset top_cpuset; 181 194 182 195 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css) 183 196 { ··· 245 240 return test_bit(CS_SPREAD_SLAB, &cs->flags); 246 241 } 247 242 243 + /* 244 + * Helper routine for generate_sched_domains(). 245 + * Do cpusets a, b have overlapping effective cpus_allowed masks? 246 + */ 247 + static inline int cpusets_overlap(struct cpuset *a, struct cpuset *b) 248 + { 249 + return cpumask_intersects(a->effective_cpus, b->effective_cpus); 250 + } 251 + 252 + static inline int nr_cpusets(void) 253 + { 254 + /* jump label reference count + the top-level cpuset */ 255 + return static_key_count(&cpusets_enabled_key.key) + 1; 256 + } 257 + 258 + static inline bool cpuset_is_populated(struct cpuset *cs) 259 + { 260 + lockdep_assert_cpuset_lock_held(); 261 + 262 + /* Cpusets in the process of attaching should be considered as populated */ 263 + return cgroup_is_populated(cs->css.cgroup) || 264 + cs->attach_in_progress; 265 + } 266 + 248 267 /** 249 268 * cpuset_for_each_child - traverse online children of a cpuset 250 269 * @child_cs: loop cursor pointing to the current child ··· 314 285 */ 315 286 #ifdef CONFIG_CPUSETS_V1 316 287 extern struct cftype cpuset1_files[]; 317 - void fmeter_init(struct fmeter *fmp); 318 288 void cpuset1_update_task_spread_flags(struct cpuset *cs, 319 289 struct task_struct *tsk); 320 290 void cpuset1_update_tasks_flags(struct cpuset *cs); ··· 321 293 struct cpumask *new_cpus, nodemask_t *new_mems, 322 294 bool cpus_updated, bool mems_updated); 323 295 int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial); 296 + bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2); 297 + void cpuset1_init(struct cpuset *cs); 298 + void cpuset1_online_css(struct cgroup_subsys_state *css); 299 + int cpuset1_generate_sched_domains(cpumask_var_t **domains, 300 + struct sched_domain_attr **attributes); 301 + 324 302 #else 325 - static inline void fmeter_init(struct fmeter *fmp) {} 326 303 static inline void cpuset1_update_task_spread_flags(struct cpuset *cs, 327 304 struct task_struct *tsk) {} 328 305 static inline void cpuset1_update_tasks_flags(struct cpuset *cs) {} ··· 336 303 bool cpus_updated, bool mems_updated) {} 337 304 static inline int cpuset1_validate_change(struct cpuset *cur, 338 305 struct cpuset *trial) { return 0; } 306 + static inline bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, 307 + struct cpuset *cs2) { return false; } 308 + static inline void cpuset1_init(struct cpuset *cs) {} 309 + static inline void cpuset1_online_css(struct cgroup_subsys_state *css) {} 310 + static inline int cpuset1_generate_sched_domains(cpumask_var_t **domains, 311 + struct sched_domain_attr **attributes) { return 0; }; 312 + 339 313 #endif /* CONFIG_CPUSETS_V1 */ 340 314 341 315 #endif /* __CPUSET_INTERNAL_H */

+270 -1

kernel/cgroup/cpuset-v1.c

··· 62 62 #define FM_SCALE 1000 /* faux fixed point scale */ 63 63 64 64 /* Initialize a frequency meter */ 65 - void fmeter_init(struct fmeter *fmp) 65 + static void fmeter_init(struct fmeter *fmp) 66 66 { 67 67 fmp->cnt = 0; 68 68 fmp->val = 0; ··· 368 368 if (par && !is_cpuset_subset(trial, par)) 369 369 goto out; 370 370 371 + /* 372 + * Cpusets with tasks - existing or newly being attached - can't 373 + * be changed to have empty cpus_allowed or mems_allowed. 374 + */ 375 + ret = -ENOSPC; 376 + if (cpuset_is_populated(cur)) { 377 + if (!cpumask_empty(cur->cpus_allowed) && 378 + cpumask_empty(trial->cpus_allowed)) 379 + goto out; 380 + if (!nodes_empty(cur->mems_allowed) && 381 + nodes_empty(trial->mems_allowed)) 382 + goto out; 383 + } 384 + 371 385 ret = 0; 372 386 out: 373 387 return ret; 388 + } 389 + 390 + /* 391 + * cpuset1_cpus_excl_conflict() - Check if two cpusets have exclusive CPU conflicts 392 + * to legacy (v1) 393 + * @cs1: first cpuset to check 394 + * @cs2: second cpuset to check 395 + * 396 + * Returns: true if CPU exclusivity conflict exists, false otherwise 397 + * 398 + * If either cpuset is CPU exclusive, their allowed CPUs cannot intersect. 399 + */ 400 + bool cpuset1_cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2) 401 + { 402 + if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2)) 403 + return cpumask_intersects(cs1->cpus_allowed, 404 + cs2->cpus_allowed); 405 + 406 + return false; 374 407 } 375 408 376 409 #ifdef CONFIG_PROC_PID_CPUSET ··· 530 497 out_unlock: 531 498 cpuset_full_unlock(); 532 499 return retval; 500 + } 501 + 502 + void cpuset1_init(struct cpuset *cs) 503 + { 504 + fmeter_init(&cs->fmeter); 505 + cs->relax_domain_level = -1; 506 + } 507 + 508 + void cpuset1_online_css(struct cgroup_subsys_state *css) 509 + { 510 + struct cpuset *tmp_cs; 511 + struct cgroup_subsys_state *pos_css; 512 + struct cpuset *cs = css_cs(css); 513 + struct cpuset *parent = parent_cs(cs); 514 + 515 + lockdep_assert_cpus_held(); 516 + lockdep_assert_cpuset_lock_held(); 517 + 518 + if (is_spread_page(parent)) 519 + set_bit(CS_SPREAD_PAGE, &cs->flags); 520 + if (is_spread_slab(parent)) 521 + set_bit(CS_SPREAD_SLAB, &cs->flags); 522 + 523 + if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags)) 524 + return; 525 + 526 + /* 527 + * Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is 528 + * set. This flag handling is implemented in cgroup core for 529 + * historical reasons - the flag may be specified during mount. 530 + * 531 + * Currently, if any sibling cpusets have exclusive cpus or mem, we 532 + * refuse to clone the configuration - thereby refusing the task to 533 + * be entered, and as a result refusing the sys_unshare() or 534 + * clone() which initiated it. If this becomes a problem for some 535 + * users who wish to allow that scenario, then this could be 536 + * changed to grant parent->cpus_allowed-sibling_cpus_exclusive 537 + * (and likewise for mems) to the new cgroup. 538 + */ 539 + rcu_read_lock(); 540 + cpuset_for_each_child(tmp_cs, pos_css, parent) { 541 + if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) { 542 + rcu_read_unlock(); 543 + return; 544 + } 545 + } 546 + rcu_read_unlock(); 547 + 548 + cpuset_callback_lock_irq(); 549 + cs->mems_allowed = parent->mems_allowed; 550 + cs->effective_mems = parent->mems_allowed; 551 + cpumask_copy(cs->cpus_allowed, parent->cpus_allowed); 552 + cpumask_copy(cs->effective_cpus, parent->cpus_allowed); 553 + cpuset_callback_unlock_irq(); 554 + } 555 + 556 + static void 557 + update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c) 558 + { 559 + if (dattr->relax_domain_level < c->relax_domain_level) 560 + dattr->relax_domain_level = c->relax_domain_level; 561 + } 562 + 563 + static void update_domain_attr_tree(struct sched_domain_attr *dattr, 564 + struct cpuset *root_cs) 565 + { 566 + struct cpuset *cp; 567 + struct cgroup_subsys_state *pos_css; 568 + 569 + rcu_read_lock(); 570 + cpuset_for_each_descendant_pre(cp, pos_css, root_cs) { 571 + /* skip the whole subtree if @cp doesn't have any CPU */ 572 + if (cpumask_empty(cp->cpus_allowed)) { 573 + pos_css = css_rightmost_descendant(pos_css); 574 + continue; 575 + } 576 + 577 + if (is_sched_load_balance(cp)) 578 + update_domain_attr(dattr, cp); 579 + } 580 + rcu_read_unlock(); 581 + } 582 + 583 + /* 584 + * cpuset1_generate_sched_domains() 585 + * 586 + * Finding the best partition (set of domains): 587 + * The double nested loops below over i, j scan over the load 588 + * balanced cpusets (using the array of cpuset pointers in csa[]) 589 + * looking for pairs of cpusets that have overlapping cpus_allowed 590 + * and merging them using a union-find algorithm. 591 + * 592 + * The union of the cpus_allowed masks from the set of all cpusets 593 + * having the same root then form the one element of the partition 594 + * (one sched domain) to be passed to partition_sched_domains(). 595 + */ 596 + int cpuset1_generate_sched_domains(cpumask_var_t **domains, 597 + struct sched_domain_attr **attributes) 598 + { 599 + struct cpuset *cp; /* top-down scan of cpusets */ 600 + struct cpuset **csa; /* array of all cpuset ptrs */ 601 + int csn; /* how many cpuset ptrs in csa so far */ 602 + int i, j; /* indices for partition finding loops */ 603 + cpumask_var_t *doms; /* resulting partition; i.e. sched domains */ 604 + struct sched_domain_attr *dattr; /* attributes for custom domains */ 605 + int ndoms = 0; /* number of sched domains in result */ 606 + int nslot; /* next empty doms[] struct cpumask slot */ 607 + struct cgroup_subsys_state *pos_css; 608 + int nslot_update; 609 + 610 + lockdep_assert_cpuset_lock_held(); 611 + 612 + doms = NULL; 613 + dattr = NULL; 614 + csa = NULL; 615 + 616 + /* Special case for the 99% of systems with one, full, sched domain */ 617 + if (is_sched_load_balance(&top_cpuset)) { 618 + ndoms = 1; 619 + doms = alloc_sched_domains(ndoms); 620 + if (!doms) 621 + goto done; 622 + 623 + dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL); 624 + if (dattr) { 625 + *dattr = SD_ATTR_INIT; 626 + update_domain_attr_tree(dattr, &top_cpuset); 627 + } 628 + cpumask_and(doms[0], top_cpuset.effective_cpus, 629 + housekeeping_cpumask(HK_TYPE_DOMAIN)); 630 + 631 + goto done; 632 + } 633 + 634 + csa = kmalloc_array(nr_cpusets(), sizeof(cp), GFP_KERNEL); 635 + if (!csa) 636 + goto done; 637 + csn = 0; 638 + 639 + rcu_read_lock(); 640 + cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) { 641 + if (cp == &top_cpuset) 642 + continue; 643 + 644 + /* 645 + * Continue traversing beyond @cp iff @cp has some CPUs and 646 + * isn't load balancing. The former is obvious. The 647 + * latter: All child cpusets contain a subset of the 648 + * parent's cpus, so just skip them, and then we call 649 + * update_domain_attr_tree() to calc relax_domain_level of 650 + * the corresponding sched domain. 651 + */ 652 + if (!cpumask_empty(cp->cpus_allowed) && 653 + !(is_sched_load_balance(cp) && 654 + cpumask_intersects(cp->cpus_allowed, 655 + housekeeping_cpumask(HK_TYPE_DOMAIN)))) 656 + continue; 657 + 658 + if (is_sched_load_balance(cp) && 659 + !cpumask_empty(cp->effective_cpus)) 660 + csa[csn++] = cp; 661 + 662 + /* skip @cp's subtree */ 663 + pos_css = css_rightmost_descendant(pos_css); 664 + continue; 665 + } 666 + rcu_read_unlock(); 667 + 668 + for (i = 0; i < csn; i++) 669 + uf_node_init(&csa[i]->node); 670 + 671 + /* Merge overlapping cpusets */ 672 + for (i = 0; i < csn; i++) { 673 + for (j = i + 1; j < csn; j++) { 674 + if (cpusets_overlap(csa[i], csa[j])) 675 + uf_union(&csa[i]->node, &csa[j]->node); 676 + } 677 + } 678 + 679 + /* Count the total number of domains */ 680 + for (i = 0; i < csn; i++) { 681 + if (uf_find(&csa[i]->node) == &csa[i]->node) 682 + ndoms++; 683 + } 684 + 685 + /* 686 + * Now we know how many domains to create. 687 + * Convert <csn, csa> to <ndoms, doms> and populate cpu masks. 688 + */ 689 + doms = alloc_sched_domains(ndoms); 690 + if (!doms) 691 + goto done; 692 + 693 + /* 694 + * The rest of the code, including the scheduler, can deal with 695 + * dattr==NULL case. No need to abort if alloc fails. 696 + */ 697 + dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr), 698 + GFP_KERNEL); 699 + 700 + for (nslot = 0, i = 0; i < csn; i++) { 701 + nslot_update = 0; 702 + for (j = i; j < csn; j++) { 703 + if (uf_find(&csa[j]->node) == &csa[i]->node) { 704 + struct cpumask *dp = doms[nslot]; 705 + 706 + if (i == j) { 707 + nslot_update = 1; 708 + cpumask_clear(dp); 709 + if (dattr) 710 + *(dattr + nslot) = SD_ATTR_INIT; 711 + } 712 + cpumask_or(dp, dp, csa[j]->effective_cpus); 713 + cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN)); 714 + if (dattr) 715 + update_domain_attr_tree(dattr + nslot, csa[j]); 716 + } 717 + } 718 + if (nslot_update) 719 + nslot++; 720 + } 721 + BUG_ON(nslot != ndoms); 722 + 723 + done: 724 + kfree(csa); 725 + 726 + /* 727 + * Fallback to the default domain if kmalloc() failed. 728 + * See comments in partition_sched_domains(). 729 + */ 730 + if (doms == NULL) 731 + ndoms = 1; 732 + 733 + *domains = doms; 734 + *attributes = dattr; 735 + return ndoms; 533 736 } 534 737 535 738 /*

+122 -381

kernel/cgroup/cpuset.c

··· 119 119 * For simplicity, a local partition can be created under a local or remote 120 120 * partition but a remote partition cannot have any partition root in its 121 121 * ancestor chain except the cgroup root. 122 + * 123 + * A valid partition can be formed by setting exclusive_cpus or cpus_allowed 124 + * if exclusive_cpus is not set. In the case of partition with empty 125 + * exclusive_cpus, all the conflicting exclusive CPUs specified in the 126 + * following cpumasks of sibling cpusets will be removed from its 127 + * cpus_allowed in determining its effective_xcpus. 128 + * - effective_xcpus 129 + * - exclusive_cpus 130 + * 131 + * The "cpuset.cpus.exclusive" control file should be used for setting up 132 + * partition if the users want to get as many CPUs as possible. 122 133 */ 123 134 #define PRS_MEMBER 0 124 135 #define PRS_ROOT 1 ··· 212 201 * If cpu_online_mask is used while a hotunplug operation is happening in 213 202 * parallel, we may leave an offline CPU in cpu_allowed or some other masks. 214 203 */ 215 - static struct cpuset top_cpuset = { 204 + struct cpuset top_cpuset = { 216 205 .flags = BIT(CS_CPU_EXCLUSIVE) | 217 206 BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE), 218 207 .partition_root_state = PRS_ROOT, 219 - .relax_domain_level = -1, 220 - .remote_partition = false, 221 208 }; 222 209 223 210 /* ··· 268 259 void cpuset_unlock(void) 269 260 { 270 261 mutex_unlock(&cpuset_mutex); 262 + } 263 + 264 + void lockdep_assert_cpuset_lock_held(void) 265 + { 266 + lockdep_assert_held(&cpuset_mutex); 271 267 } 272 268 273 269 /** ··· 333 319 */ 334 320 static inline void dec_attach_in_progress_locked(struct cpuset *cs) 335 321 { 336 - lockdep_assert_held(&cpuset_mutex); 322 + lockdep_assert_cpuset_lock_held(); 337 323 338 324 cs->attach_in_progress--; 339 325 if (!cs->attach_in_progress) ··· 365 351 { 366 352 return cpuset_v2() || 367 353 (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); 368 - } 369 - 370 - static inline bool cpuset_is_populated(struct cpuset *cs) 371 - { 372 - lockdep_assert_held(&cpuset_mutex); 373 - 374 - /* Cpusets in the process of attaching should be considered as populated */ 375 - return cgroup_is_populated(cs->css.cgroup) || 376 - cs->attach_in_progress; 377 354 } 378 355 379 356 /** ··· 608 603 609 604 /** 610 605 * cpus_excl_conflict - Check if two cpusets have exclusive CPU conflicts 611 - * @cs1: first cpuset to check 612 - * @cs2: second cpuset to check 606 + * @trial: the trial cpuset to be checked 607 + * @sibling: a sibling cpuset to be checked against 608 + * @xcpus_changed: set if exclusive_cpus has been set 613 609 * 614 610 * Returns: true if CPU exclusivity conflict exists, false otherwise 615 611 * 616 612 * Conflict detection rules: 617 - * 1. If either cpuset is CPU exclusive, they must be mutually exclusive 618 - * 2. exclusive_cpus masks cannot intersect between cpusets 619 - * 3. The allowed CPUs of one cpuset cannot be a subset of another's exclusive CPUs 613 + * o cgroup v1 614 + * See cpuset1_cpus_excl_conflict() 615 + * o cgroup v2 616 + * - The exclusive_cpus values cannot overlap. 617 + * - New exclusive_cpus cannot be a superset of a sibling's cpus_allowed. 620 618 */ 621 - static inline bool cpus_excl_conflict(struct cpuset *cs1, struct cpuset *cs2) 619 + static inline bool cpus_excl_conflict(struct cpuset *trial, struct cpuset *sibling, 620 + bool xcpus_changed) 622 621 { 623 - /* If either cpuset is exclusive, check if they are mutually exclusive */ 624 - if (is_cpu_exclusive(cs1) || is_cpu_exclusive(cs2)) 625 - return !cpusets_are_exclusive(cs1, cs2); 622 + if (!cpuset_v2()) 623 + return cpuset1_cpus_excl_conflict(trial, sibling); 624 + 625 + /* The cpus_allowed of a sibling cpuset cannot be a subset of the new exclusive_cpus */ 626 + if (xcpus_changed && !cpumask_empty(sibling->cpus_allowed) && 627 + cpumask_subset(sibling->cpus_allowed, trial->exclusive_cpus)) 628 + return true; 626 629 627 630 /* Exclusive_cpus cannot intersect */ 628 - if (cpumask_intersects(cs1->exclusive_cpus, cs2->exclusive_cpus)) 629 - return true; 630 - 631 - /* The cpus_allowed of one cpuset cannot be a subset of another cpuset's exclusive_cpus */ 632 - if (!cpumask_empty(cs1->cpus_allowed) && 633 - cpumask_subset(cs1->cpus_allowed, cs2->exclusive_cpus)) 634 - return true; 635 - 636 - if (!cpumask_empty(cs2->cpus_allowed) && 637 - cpumask_subset(cs2->cpus_allowed, cs1->exclusive_cpus)) 638 - return true; 639 - 640 - return false; 631 + return cpumask_intersects(trial->exclusive_cpus, sibling->exclusive_cpus); 641 632 } 642 633 643 634 static inline bool mems_excl_conflict(struct cpuset *cs1, struct cpuset *cs2) ··· 667 666 { 668 667 struct cgroup_subsys_state *css; 669 668 struct cpuset *c, *par; 669 + bool xcpus_changed; 670 670 int ret = 0; 671 671 672 672 rcu_read_lock(); ··· 682 680 goto out; 683 681 684 682 par = parent_cs(cur); 685 - 686 - /* 687 - * Cpusets with tasks - existing or newly being attached - can't 688 - * be changed to have empty cpus_allowed or mems_allowed. 689 - */ 690 - ret = -ENOSPC; 691 - if (cpuset_is_populated(cur)) { 692 - if (!cpumask_empty(cur->cpus_allowed) && 693 - cpumask_empty(trial->cpus_allowed)) 694 - goto out; 695 - if (!nodes_empty(cur->mems_allowed) && 696 - nodes_empty(trial->mems_allowed)) 697 - goto out; 698 - } 699 683 700 684 /* 701 685 * We can't shrink if we won't have enough room for SCHED_DEADLINE ··· 710 722 * overlap. exclusive_cpus cannot overlap with each other if set. 711 723 */ 712 724 ret = -EINVAL; 725 + xcpus_changed = !cpumask_equal(cur->exclusive_cpus, trial->exclusive_cpus); 713 726 cpuset_for_each_child(c, css, par) { 714 727 if (c == cur) 715 728 continue; 716 - if (cpus_excl_conflict(trial, c)) 729 + if (cpus_excl_conflict(trial, c, xcpus_changed)) 717 730 goto out; 718 731 if (mems_excl_conflict(trial, c)) 719 732 goto out; ··· 727 738 } 728 739 729 740 #ifdef CONFIG_SMP 730 - /* 731 - * Helper routine for generate_sched_domains(). 732 - * Do cpusets a, b have overlapping effective cpus_allowed masks? 733 - */ 734 - static int cpusets_overlap(struct cpuset *a, struct cpuset *b) 735 - { 736 - return cpumask_intersects(a->effective_cpus, b->effective_cpus); 737 - } 738 - 739 - static void 740 - update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c) 741 - { 742 - if (dattr->relax_domain_level < c->relax_domain_level) 743 - dattr->relax_domain_level = c->relax_domain_level; 744 - return; 745 - } 746 - 747 - static void update_domain_attr_tree(struct sched_domain_attr *dattr, 748 - struct cpuset *root_cs) 749 - { 750 - struct cpuset *cp; 751 - struct cgroup_subsys_state *pos_css; 752 - 753 - rcu_read_lock(); 754 - cpuset_for_each_descendant_pre(cp, pos_css, root_cs) { 755 - /* skip the whole subtree if @cp doesn't have any CPU */ 756 - if (cpumask_empty(cp->cpus_allowed)) { 757 - pos_css = css_rightmost_descendant(pos_css); 758 - continue; 759 - } 760 - 761 - if (is_sched_load_balance(cp)) 762 - update_domain_attr(dattr, cp); 763 - } 764 - rcu_read_unlock(); 765 - } 766 - 767 - /* Must be called with cpuset_mutex held. */ 768 - static inline int nr_cpusets(void) 769 - { 770 - /* jump label reference count + the top-level cpuset */ 771 - return static_key_count(&cpusets_enabled_key.key) + 1; 772 - } 773 741 774 742 /* 775 743 * generate_sched_domains() ··· 766 820 * convenient format, that can be easily compared to the prior 767 821 * value to determine what partition elements (sched domains) 768 822 * were changed (added or removed.) 769 - * 770 - * Finding the best partition (set of domains): 771 - * The double nested loops below over i, j scan over the load 772 - * balanced cpusets (using the array of cpuset pointers in csa[]) 773 - * looking for pairs of cpusets that have overlapping cpus_allowed 774 - * and merging them using a union-find algorithm. 775 - * 776 - * The union of the cpus_allowed masks from the set of all cpusets 777 - * having the same root then form the one element of the partition 778 - * (one sched domain) to be passed to partition_sched_domains(). 779 - * 780 823 */ 781 824 static int generate_sched_domains(cpumask_var_t **domains, 782 825 struct sched_domain_attr **attributes) 783 826 { 784 827 struct cpuset *cp; /* top-down scan of cpusets */ 785 828 struct cpuset **csa; /* array of all cpuset ptrs */ 786 - int csn; /* how many cpuset ptrs in csa so far */ 787 829 int i, j; /* indices for partition finding loops */ 788 830 cpumask_var_t *doms; /* resulting partition; i.e. sched domains */ 789 831 struct sched_domain_attr *dattr; /* attributes for custom domains */ 790 832 int ndoms = 0; /* number of sched domains in result */ 791 - int nslot; /* next empty doms[] struct cpumask slot */ 792 833 struct cgroup_subsys_state *pos_css; 793 - bool root_load_balance = is_sched_load_balance(&top_cpuset); 794 - bool cgrpv2 = cpuset_v2(); 795 - int nslot_update; 834 + 835 + if (!cpuset_v2()) 836 + return cpuset1_generate_sched_domains(domains, attributes); 796 837 797 838 doms = NULL; 798 839 dattr = NULL; 799 840 csa = NULL; 800 841 801 842 /* Special case for the 99% of systems with one, full, sched domain */ 802 - if (root_load_balance && cpumask_empty(subpartitions_cpus)) { 803 - single_root_domain: 843 + if (cpumask_empty(subpartitions_cpus)) { 804 844 ndoms = 1; 805 - doms = alloc_sched_domains(ndoms); 806 - if (!doms) 807 - goto done; 808 - 809 - dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL); 810 - if (dattr) { 811 - *dattr = SD_ATTR_INIT; 812 - update_domain_attr_tree(dattr, &top_cpuset); 813 - } 814 - cpumask_and(doms[0], top_cpuset.effective_cpus, 815 - housekeeping_cpumask(HK_TYPE_DOMAIN)); 816 - 817 - goto done; 845 + /* !csa will be checked and can be correctly handled */ 846 + goto generate_doms; 818 847 } 819 848 820 849 csa = kmalloc_array(nr_cpusets(), sizeof(cp), GFP_KERNEL); 821 850 if (!csa) 822 851 goto done; 823 - csn = 0; 824 852 853 + /* Find how many partitions and cache them to csa[] */ 825 854 rcu_read_lock(); 826 - if (root_load_balance) 827 - csa[csn++] = &top_cpuset; 828 855 cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) { 829 - if (cp == &top_cpuset) 830 - continue; 831 - 832 - if (cgrpv2) 833 - goto v2; 834 - 835 - /* 836 - * v1: 837 - * Continue traversing beyond @cp iff @cp has some CPUs and 838 - * isn't load balancing. The former is obvious. The 839 - * latter: All child cpusets contain a subset of the 840 - * parent's cpus, so just skip them, and then we call 841 - * update_domain_attr_tree() to calc relax_domain_level of 842 - * the corresponding sched domain. 843 - */ 844 - if (!cpumask_empty(cp->cpus_allowed) && 845 - !(is_sched_load_balance(cp) && 846 - cpumask_intersects(cp->cpus_allowed, 847 - housekeeping_cpumask(HK_TYPE_DOMAIN)))) 848 - continue; 849 - 850 - if (is_sched_load_balance(cp) && 851 - !cpumask_empty(cp->effective_cpus)) 852 - csa[csn++] = cp; 853 - 854 - /* skip @cp's subtree */ 855 - pos_css = css_rightmost_descendant(pos_css); 856 - continue; 857 - 858 - v2: 859 856 /* 860 857 * Only valid partition roots that are not isolated and with 861 - * non-empty effective_cpus will be saved into csn[]. 858 + * non-empty effective_cpus will be saved into csa[]. 862 859 */ 863 860 if ((cp->partition_root_state == PRS_ROOT) && 864 861 !cpumask_empty(cp->effective_cpus)) 865 - csa[csn++] = cp; 862 + csa[ndoms++] = cp; 866 863 867 864 /* 868 865 * Skip @cp's subtree if not a partition root and has no ··· 816 927 } 817 928 rcu_read_unlock(); 818 929 819 - /* 820 - * If there are only isolated partitions underneath the cgroup root, 821 - * we can optimize out unneeded sched domains scanning. 822 - */ 823 - if (root_load_balance && (csn == 1)) 824 - goto single_root_domain; 825 - 826 - for (i = 0; i < csn; i++) 827 - uf_node_init(&csa[i]->node); 828 - 829 - /* Merge overlapping cpusets */ 830 - for (i = 0; i < csn; i++) { 831 - for (j = i + 1; j < csn; j++) { 832 - if (cpusets_overlap(csa[i], csa[j])) { 930 + for (i = 0; i < ndoms; i++) { 931 + for (j = i + 1; j < ndoms; j++) { 932 + if (cpusets_overlap(csa[i], csa[j])) 833 933 /* 834 934 * Cgroup v2 shouldn't pass down overlapping 835 935 * partition root cpusets. 836 936 */ 837 - WARN_ON_ONCE(cgrpv2); 838 - uf_union(&csa[i]->node, &csa[j]->node); 839 - } 937 + WARN_ON_ONCE(1); 840 938 } 841 939 } 842 940 843 - /* Count the total number of domains */ 844 - for (i = 0; i < csn; i++) { 845 - if (uf_find(&csa[i]->node) == &csa[i]->node) 846 - ndoms++; 847 - } 848 - 849 - /* 850 - * Now we know how many domains to create. 851 - * Convert <csn, csa> to <ndoms, doms> and populate cpu masks. 852 - */ 941 + generate_doms: 853 942 doms = alloc_sched_domains(ndoms); 854 943 if (!doms) 855 944 goto done; ··· 844 977 * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a 845 978 * subset of HK_TYPE_DOMAIN housekeeping CPUs. 846 979 */ 847 - if (cgrpv2) { 848 - for (i = 0; i < ndoms; i++) { 849 - /* 850 - * The top cpuset may contain some boot time isolated 851 - * CPUs that need to be excluded from the sched domain. 852 - */ 853 - if (csa[i] == &top_cpuset) 854 - cpumask_and(doms[i], csa[i]->effective_cpus, 855 - housekeeping_cpumask(HK_TYPE_DOMAIN)); 856 - else 857 - cpumask_copy(doms[i], csa[i]->effective_cpus); 858 - if (dattr) 859 - dattr[i] = SD_ATTR_INIT; 860 - } 861 - goto done; 980 + for (i = 0; i < ndoms; i++) { 981 + /* 982 + * The top cpuset may contain some boot time isolated 983 + * CPUs that need to be excluded from the sched domain. 984 + */ 985 + if (!csa || csa[i] == &top_cpuset) 986 + cpumask_and(doms[i], top_cpuset.effective_cpus, 987 + housekeeping_cpumask(HK_TYPE_DOMAIN)); 988 + else 989 + cpumask_copy(doms[i], csa[i]->effective_cpus); 990 + if (dattr) 991 + dattr[i] = SD_ATTR_INIT; 862 992 } 863 - 864 - for (nslot = 0, i = 0; i < csn; i++) { 865 - nslot_update = 0; 866 - for (j = i; j < csn; j++) { 867 - if (uf_find(&csa[j]->node) == &csa[i]->node) { 868 - struct cpumask *dp = doms[nslot]; 869 - 870 - if (i == j) { 871 - nslot_update = 1; 872 - cpumask_clear(dp); 873 - if (dattr) 874 - *(dattr + nslot) = SD_ATTR_INIT; 875 - } 876 - cpumask_or(dp, dp, csa[j]->effective_cpus); 877 - cpumask_and(dp, dp, housekeeping_cpumask(HK_TYPE_DOMAIN)); 878 - if (dattr) 879 - update_domain_attr_tree(dattr + nslot, csa[j]); 880 - } 881 - } 882 - if (nslot_update) 883 - nslot++; 884 - } 885 - BUG_ON(nslot != ndoms); 886 993 887 994 done: 888 995 kfree(csa); ··· 896 1055 int cpu; 897 1056 u64 cookie = ++dl_cookie; 898 1057 899 - lockdep_assert_held(&cpuset_mutex); 1058 + lockdep_assert_cpuset_lock_held(); 900 1059 lockdep_assert_cpus_held(); 901 1060 lockdep_assert_held(&sched_domains_mutex); 902 1061 ··· 941 1100 */ 942 1101 void rebuild_sched_domains_locked(void) 943 1102 { 944 - struct cgroup_subsys_state *pos_css; 945 1103 struct sched_domain_attr *attr; 946 1104 cpumask_var_t *doms; 947 - struct cpuset *cs; 948 1105 int ndoms; 1106 + int i; 949 1107 950 1108 lockdep_assert_cpus_held(); 951 - lockdep_assert_held(&cpuset_mutex); 1109 + lockdep_assert_cpuset_lock_held(); 952 1110 force_sd_rebuild = false; 953 - 954 - /* 955 - * If we have raced with CPU hotplug, return early to avoid 956 - * passing doms with offlined cpu to partition_sched_domains(). 957 - * Anyways, cpuset_handle_hotplug() will rebuild sched domains. 958 - * 959 - * With no CPUs in any subpartitions, top_cpuset's effective CPUs 960 - * should be the same as the active CPUs, so checking only top_cpuset 961 - * is enough to detect racing CPU offlines. 962 - */ 963 - if (cpumask_empty(subpartitions_cpus) && 964 - !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask)) 965 - return; 966 - 967 - /* 968 - * With subpartition CPUs, however, the effective CPUs of a partition 969 - * root should be only a subset of the active CPUs. Since a CPU in any 970 - * partition root could be offlined, all must be checked. 971 - */ 972 - if (!cpumask_empty(subpartitions_cpus)) { 973 - rcu_read_lock(); 974 - cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { 975 - if (!is_partition_valid(cs)) { 976 - pos_css = css_rightmost_descendant(pos_css); 977 - continue; 978 - } 979 - if (!cpumask_subset(cs->effective_cpus, 980 - cpu_active_mask)) { 981 - rcu_read_unlock(); 982 - return; 983 - } 984 - } 985 - rcu_read_unlock(); 986 - } 987 1111 988 1112 /* Generate domain masks and attrs */ 989 1113 ndoms = generate_sched_domains(&doms, &attr); 1114 + 1115 + /* 1116 + * cpuset_hotplug_workfn is invoked synchronously now, thus this 1117 + * function should not race with CPU hotplug. And the effective CPUs 1118 + * must not include any offline CPUs. Passing an offline CPU in the 1119 + * doms to partition_sched_domains() will trigger a kernel panic. 1120 + * 1121 + * We perform a final check here: if the doms contains any 1122 + * offline CPUs, a warning is emitted and we return directly to 1123 + * prevent the panic. 1124 + */ 1125 + for (i = 0; i < ndoms; ++i) { 1126 + if (WARN_ON_ONCE(!cpumask_subset(doms[i], cpu_active_mask))) 1127 + return; 1128 + } 990 1129 991 1130 /* Have scheduler rebuild the domains */ 992 1131 partition_sched_domains(ndoms, doms, attr); ··· 1322 1501 int retval = 0; 1323 1502 1324 1503 if (cpumask_empty(excpus)) 1325 - return retval; 1504 + return 0; 1326 1505 1327 1506 /* 1328 - * Exclude exclusive CPUs from siblings 1507 + * Remove exclusive CPUs from siblings 1329 1508 */ 1330 1509 rcu_read_lock(); 1331 1510 cpuset_for_each_child(sibling, css, parent) { 1511 + struct cpumask *sibling_xcpus; 1512 + 1332 1513 if (sibling == cs) 1333 1514 continue; 1334 1515 1335 - if (cpumask_intersects(excpus, sibling->exclusive_cpus)) { 1336 - cpumask_andnot(excpus, excpus, sibling->exclusive_cpus); 1337 - retval++; 1338 - continue; 1339 - } 1340 - if (cpumask_intersects(excpus, sibling->effective_xcpus)) { 1341 - cpumask_andnot(excpus, excpus, sibling->effective_xcpus); 1516 + /* 1517 + * If exclusive_cpus is defined, effective_xcpus will always 1518 + * be a subset. Otherwise, effective_xcpus will only be set 1519 + * in a valid partition root. 1520 + */ 1521 + sibling_xcpus = cpumask_empty(sibling->exclusive_cpus) 1522 + ? sibling->effective_xcpus 1523 + : sibling->exclusive_cpus; 1524 + 1525 + if (cpumask_intersects(excpus, sibling_xcpus)) { 1526 + cpumask_andnot(excpus, excpus, sibling_xcpus); 1342 1527 retval++; 1343 1528 } 1344 1529 } ··· 1633 1806 int parent_prs = parent->partition_root_state; 1634 1807 bool nocpu; 1635 1808 1636 - lockdep_assert_held(&cpuset_mutex); 1809 + lockdep_assert_cpuset_lock_held(); 1637 1810 WARN_ON_ONCE(is_remote_partition(cs)); /* For local partition only */ 1638 1811 1639 1812 /* ··· 2142 2315 spin_lock_irq(&callback_lock); 2143 2316 cpumask_copy(cp->effective_cpus, tmp->new_cpus); 2144 2317 cp->partition_root_state = new_prs; 2145 - if (!cpumask_empty(cp->exclusive_cpus) && (cp != cs)) 2146 - compute_excpus(cp, cp->effective_xcpus); 2147 - 2148 2318 /* 2149 - * Make sure effective_xcpus is properly set for a valid 2150 - * partition root. 2319 + * Need to compute effective_xcpus if either exclusive_cpus 2320 + * is non-empty or it is a valid partition root. 2151 2321 */ 2152 - if ((new_prs > 0) && cpumask_empty(cp->exclusive_cpus)) 2153 - cpumask_and(cp->effective_xcpus, 2154 - cp->cpus_allowed, parent->effective_xcpus); 2155 - else if (new_prs < 0) 2322 + if ((new_prs > 0) || !cpumask_empty(cp->exclusive_cpus)) 2323 + compute_excpus(cp, cp->effective_xcpus); 2324 + if (new_prs <= 0) 2156 2325 reset_partition_data(cp); 2157 2326 spin_unlock_irq(&callback_lock); 2158 2327 ··· 2201 2378 struct cpuset *sibling; 2202 2379 struct cgroup_subsys_state *pos_css; 2203 2380 2204 - lockdep_assert_held(&cpuset_mutex); 2381 + lockdep_assert_cpuset_lock_held(); 2205 2382 2206 2383 /* 2207 2384 * Check all its siblings and call update_cpumasks_hier() ··· 2210 2387 * It is possible a change in parent's effective_cpus 2211 2388 * due to a change in a child partition's effective_xcpus will impact 2212 2389 * its siblings even if they do not inherit parent's effective_cpus 2213 - * directly. 2390 + * directly. It should not impact valid partition. 2214 2391 * 2215 2392 * The update_cpumasks_hier() function may sleep. So we have to 2216 2393 * release the RCU read lock before calling it. 2217 2394 */ 2218 2395 rcu_read_lock(); 2219 2396 cpuset_for_each_child(sibling, pos_css, parent) { 2220 - if (sibling == cs) 2397 + if (sibling == cs || is_partition_valid(sibling)) 2221 2398 continue; 2222 - if (!is_partition_valid(sibling)) { 2223 - compute_effective_cpumask(tmp->new_cpus, sibling, 2224 - parent); 2225 - if (cpumask_equal(tmp->new_cpus, sibling->effective_cpus)) 2226 - continue; 2227 - } else if (is_remote_partition(sibling)) { 2228 - /* 2229 - * Change in a sibling cpuset won't affect a remote 2230 - * partition root. 2231 - */ 2399 + 2400 + compute_effective_cpumask(tmp->new_cpus, sibling, 2401 + parent); 2402 + if (cpumask_equal(tmp->new_cpus, sibling->effective_cpus)) 2232 2403 continue; 2233 - } 2234 2404 2235 2405 if (!css_tryget_online(&sibling->css)) 2236 2406 continue; ··· 2277 2461 return PERR_NOCPUS; 2278 2462 2279 2463 return PERR_NONE; 2280 - } 2281 - 2282 - static int cpus_allowed_validate_change(struct cpuset *cs, struct cpuset *trialcs, 2283 - struct tmpmasks *tmp) 2284 - { 2285 - int retval; 2286 - struct cpuset *parent = parent_cs(cs); 2287 - 2288 - retval = validate_change(cs, trialcs); 2289 - 2290 - if ((retval == -EINVAL) && cpuset_v2()) { 2291 - struct cgroup_subsys_state *css; 2292 - struct cpuset *cp; 2293 - 2294 - /* 2295 - * The -EINVAL error code indicates that partition sibling 2296 - * CPU exclusivity rule has been violated. We still allow 2297 - * the cpumask change to proceed while invalidating the 2298 - * partition. However, any conflicting sibling partitions 2299 - * have to be marked as invalid too. 2300 - */ 2301 - trialcs->prs_err = PERR_NOTEXCL; 2302 - rcu_read_lock(); 2303 - cpuset_for_each_child(cp, css, parent) { 2304 - struct cpumask *xcpus = user_xcpus(trialcs); 2305 - 2306 - if (is_partition_valid(cp) && 2307 - cpumask_intersects(xcpus, cp->effective_xcpus)) { 2308 - rcu_read_unlock(); 2309 - update_parent_effective_cpumask(cp, partcmd_invalidate, NULL, tmp); 2310 - rcu_read_lock(); 2311 - } 2312 - } 2313 - rcu_read_unlock(); 2314 - retval = 0; 2315 - } 2316 - return retval; 2317 2464 } 2318 2465 2319 2466 /** ··· 2338 2559 if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed)) 2339 2560 return 0; 2340 2561 2341 - if (alloc_tmpmasks(&tmp)) 2342 - return -ENOMEM; 2343 - 2344 2562 compute_trialcs_excpus(trialcs, cs); 2345 2563 trialcs->prs_err = PERR_NONE; 2346 2564 2347 - retval = cpus_allowed_validate_change(cs, trialcs, &tmp); 2565 + retval = validate_change(cs, trialcs); 2348 2566 if (retval < 0) 2349 - goto out_free; 2567 + return retval; 2568 + 2569 + if (alloc_tmpmasks(&tmp)) 2570 + return -ENOMEM; 2350 2571 2351 2572 /* 2352 2573 * Check all the descendants in update_cpumasks_hier() if ··· 2369 2590 /* Update CS_SCHED_LOAD_BALANCE and/or sched_domains, if necessary */ 2370 2591 if (cs->partition_root_state) 2371 2592 update_partition_sd_lb(cs, old_prs); 2372 - out_free: 2593 + 2373 2594 free_tmpmasks(&tmp); 2374 2595 return retval; 2375 2596 } ··· 3028 3249 3029 3250 static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task) 3030 3251 { 3031 - lockdep_assert_held(&cpuset_mutex); 3252 + lockdep_assert_cpuset_lock_held(); 3032 3253 3033 3254 if (cs != &top_cpuset) 3034 3255 guarantee_active_cpus(task, cpus_attach); ··· 3384 3605 return ERR_PTR(-ENOMEM); 3385 3606 3386 3607 __set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags); 3387 - fmeter_init(&cs->fmeter); 3388 - cs->relax_domain_level = -1; 3608 + cpuset1_init(cs); 3389 3609 3390 3610 /* Set CS_MEMORY_MIGRATE for default hierarchy */ 3391 3611 if (cpuset_v2()) ··· 3397 3619 { 3398 3620 struct cpuset *cs = css_cs(css); 3399 3621 struct cpuset *parent = parent_cs(cs); 3400 - struct cpuset *tmp_cs; 3401 - struct cgroup_subsys_state *pos_css; 3402 3622 3403 3623 if (!parent) 3404 3624 return 0; 3405 3625 3406 3626 cpuset_full_lock(); 3407 - if (is_spread_page(parent)) 3408 - set_bit(CS_SPREAD_PAGE, &cs->flags); 3409 - if (is_spread_slab(parent)) 3410 - set_bit(CS_SPREAD_SLAB, &cs->flags); 3411 3627 /* 3412 3628 * For v2, clear CS_SCHED_LOAD_BALANCE if parent is isolated 3413 3629 */ ··· 3416 3644 cs->effective_mems = parent->effective_mems; 3417 3645 } 3418 3646 spin_unlock_irq(&callback_lock); 3647 + cpuset1_online_css(css); 3419 3648 3420 - if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags)) 3421 - goto out_unlock; 3422 - 3423 - /* 3424 - * Clone @parent's configuration if CGRP_CPUSET_CLONE_CHILDREN is 3425 - * set. This flag handling is implemented in cgroup core for 3426 - * historical reasons - the flag may be specified during mount. 3427 - * 3428 - * Currently, if any sibling cpusets have exclusive cpus or mem, we 3429 - * refuse to clone the configuration - thereby refusing the task to 3430 - * be entered, and as a result refusing the sys_unshare() or 3431 - * clone() which initiated it. If this becomes a problem for some 3432 - * users who wish to allow that scenario, then this could be 3433 - * changed to grant parent->cpus_allowed-sibling_cpus_exclusive 3434 - * (and likewise for mems) to the new cgroup. 3435 - */ 3436 - rcu_read_lock(); 3437 - cpuset_for_each_child(tmp_cs, pos_css, parent) { 3438 - if (is_mem_exclusive(tmp_cs) || is_cpu_exclusive(tmp_cs)) { 3439 - rcu_read_unlock(); 3440 - goto out_unlock; 3441 - } 3442 - } 3443 - rcu_read_unlock(); 3444 - 3445 - spin_lock_irq(&callback_lock); 3446 - cs->mems_allowed = parent->mems_allowed; 3447 - cs->effective_mems = parent->mems_allowed; 3448 - cpumask_copy(cs->cpus_allowed, parent->cpus_allowed); 3449 - cpumask_copy(cs->effective_cpus, parent->cpus_allowed); 3450 - spin_unlock_irq(&callback_lock); 3451 - out_unlock: 3452 3649 cpuset_full_unlock(); 3453 3650 return 0; 3454 3651 } ··· 3617 3876 cpumask_setall(top_cpuset.exclusive_cpus); 3618 3877 nodes_setall(top_cpuset.effective_mems); 3619 3878 3620 - fmeter_init(&top_cpuset.fmeter); 3879 + cpuset1_init(&top_cpuset); 3621 3880 3622 3881 BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); 3623 3882 ··· 3951 4210 */ 3952 4211 void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask) 3953 4212 { 3954 - lockdep_assert_held(&cpuset_mutex); 4213 + lockdep_assert_cpuset_lock_held(); 3955 4214 __cpuset_cpus_allowed_locked(tsk, pmask); 3956 4215 } 3957 4216

+1 -1

kernel/cgroup/debug.c

··· 230 230 } 231 231 232 232 static void cgroup_masks_read_one(struct seq_file *seq, const char *name, 233 - u16 mask) 233 + u32 mask) 234 234 { 235 235 struct cgroup_subsys *ss; 236 236 int ssid;

+1 -1

mm/memcontrol.c

··· 283 283 /* page_folio() is racy here, but the entire function is racy anyway */ 284 284 memcg = folio_memcg_check(page_folio(page)); 285 285 286 - while (memcg && !(memcg->css.flags & CSS_ONLINE)) 286 + while (memcg && !css_is_online(&memcg->css)) 287 287 memcg = parent_mem_cgroup(memcg); 288 288 if (memcg) 289 289 ino = cgroup_ino(memcg->css.cgroup);

+1 -1

mm/page_owner.c

··· 530 530 if (!memcg) 531 531 goto out_unlock; 532 532 533 - online = (memcg->css.flags & CSS_ONLINE); 533 + online = css_is_online(&memcg->css); 534 534 cgroup_name(memcg->css.cgroup, name, sizeof(name)); 535 535 ret += scnprintf(kbuf + ret, count - ret, 536 536 "Charged %sto %smemcg %s\n",

+21

tools/testing/selftests/cgroup/lib/cgroup_util.c

··· 168 168 return atol(ptr + strlen(key)); 169 169 } 170 170 171 + long cg_read_key_long_poll(const char *cgroup, const char *control, 172 + const char *key, long expected, int retries, 173 + useconds_t wait_interval_us) 174 + { 175 + long val = -1; 176 + int i; 177 + 178 + for (i = 0; i < retries; i++) { 179 + val = cg_read_key_long(cgroup, control, key); 180 + if (val < 0) 181 + return val; 182 + 183 + if (val == expected) 184 + break; 185 + 186 + usleep(wait_interval_us); 187 + } 188 + 189 + return val; 190 + } 191 + 171 192 long cg_read_lc(const char *cgroup, const char *control) 172 193 { 173 194 char buf[PAGE_SIZE];

+5

tools/testing/selftests/cgroup/lib/include/cgroup_util.h

··· 17 17 #define CG_NAMED_NAME "selftest" 18 18 #define CG_PATH_FORMAT (!cg_test_v1_named ? "0::%s" : (":name=" CG_NAMED_NAME ":%s")) 19 19 20 + #define DEFAULT_WAIT_INTERVAL_US (100 * 1000) /* 100 ms */ 21 + 20 22 /* 21 23 * Checks if two given values differ by less than err% of their sum. 22 24 */ ··· 66 64 extern long cg_read_long(const char *cgroup, const char *control); 67 65 extern long cg_read_long_fd(int fd); 68 66 long cg_read_key_long(const char *cgroup, const char *control, const char *key); 67 + long cg_read_key_long_poll(const char *cgroup, const char *control, 68 + const char *key, long expected, int retries, 69 + useconds_t wait_interval_us); 69 70 extern long cg_read_lc(const char *cgroup, const char *control); 70 71 extern int cg_write(const char *cgroup, const char *control, char *buf); 71 72 extern int cg_open(const char *cgroup, const char *control, int flags);

+22 -7

tools/testing/selftests/cgroup/test_cpuset_prs.sh

··· 269 269 " C0-3:S+ C1-3:S+ C2-3 . X2-3 X3:P2 . . 0 A1:0-2|A2:3|A3:3 A1:P0|A2:P2 3" 270 270 " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2 . 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3" 271 271 " C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:C3 . 0 A1:0-1|A2:1|A3:2-3 A1:P0|A3:P2 2-3" 272 - " C0-3:S+ C1-3:S+ C2-3 C2-3 . . . P2 0 A1:0-3|A2:1-3|A3:2-3|B1:2-3 A1:P0|A3:P0|B1:P-2" 272 + " C0-3:S+ C1-3:S+ C2-3 C2-3 . . . P2 0 A1:0-1|A2:1|A3:1|B1:2-3 A1:P0|A3:P0|B1:P2" 273 273 " C0-3:S+ C1-3:S+ C2-3 C4-5 . . . P2 0 B1:4-5 B1:P2 4-5" 274 274 " C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2 P2 0 A3:2-3|B1:4 A3:P2|B1:P2 2-4" 275 275 " C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2:C1-3 P2 0 A3:2-3|B1:4 A3:P2|B1:P2 2-4" ··· 318 318 # Invalid to valid local partition direct transition tests 319 319 " C1-3:S+:P2 X4:P2 . . . . . . 0 A1:1-3|XA1:1-3|A2:1-3:XA2: A1:P2|A2:P-2 1-3" 320 320 " C1-3:S+:P2 X4:P2 . . . X3:P2 . . 0 A1:1-2|XA1:1-3|A2:3:XA2:3 A1:P2|A2:P2 1-3" 321 - " C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4|B1:4-6 A1:P-2|B1:P0" 321 + " C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4|B1:5-6 A1:P2|B1:P0" 322 322 " C0-3:P2 . . C4-6 C0-4:C0-3 . . . 0 A1:0-3|B1:4-6 A1:P2|B1:P0 0-3" 323 323 324 324 # Local partition invalidation tests ··· 388 388 " C0-1:S+ C1 . C2-3 . P2 . . 0 A1:0-1|A2:1 A1:P0|A2:P-2" 389 389 " C0-1:S+ C1:P2 . C2-3 P1 . . . 0 A1:0|A2:1 A1:P1|A2:P2 0-1|1" 390 390 391 - # A non-exclusive cpuset.cpus change will invalidate partition and its siblings 392 - " C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2|B1:2-3 A1:P-1|B1:P0" 393 - " C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2|B1:2-3 A1:P-1|B1:P-1" 394 - " C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2|B1:2-3 A1:P0|B1:P-1" 391 + # A non-exclusive cpuset.cpus change will not invalidate its siblings partition. 392 + " C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2|B1:3 A1:P1|B1:P0" 393 + " C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-1|XA1:0-1|B1:2-3 A1:P1|B1:P1" 394 + " C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-1|B1:2-3 A1:P0|B1:P1" 395 395 396 396 # cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it 397 397 " C0-3 . . C4-5 X5 . . . 0 A1:0-3|B1:4-5" ··· 417 417 " CX1-4:S+ CX2-4:P2 . C5-6 . . . P1:C3-6 0 A1:1|A2:2-4|B1:5-6 \ 418 418 A1:P0|A2:P2:B1:P-1 2-4" 419 419 420 + # When multiple partitions with conflicting cpuset.cpus are created, the 421 + # latter created ones will only get what are left of the available exclusive 422 + # CPUs. 423 + " C1-3:P1 . . . . . . C3-5:P1 0 A1:1-3|B1:4-5:XB1:4-5 A1:P1|B1:P1" 424 + 425 + # cpuset.cpus can be set to a subset of sibling's cpuset.cpus.exclusive 426 + " C1-3:X1-3 . . C4-5 . . . C1-2 0 A1:1-3|B1:1-2" 427 + 428 + # cpuset.cpus can become empty with task in it as it inherits parent's effective CPUs 429 + " C1-3:S+ C2 . . . T:C . . 0 A1:1-3|A2:1-3" 430 + 420 431 # old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS 421 432 # ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ -------- 422 433 # Failure cases: ··· 438 427 # Changes to cpuset.cpus.exclusive that violate exclusivity rule is rejected 439 428 " C0-3 . . C4-5 X0-3 . . X3-5 1 A1:0-3|B1:4-5" 440 429 441 - # cpuset.cpus cannot be a subset of sibling cpuset.cpus.exclusive 430 + # cpuset.cpus.exclusive cannot be set to a superset of sibling's cpuset.cpus 442 431 " C0-3 . . C4-5 X3-5 . . . 1 A1:0-3|B1:4-5" 443 432 ) 444 433 ··· 488 477 . . X1-2:P2 X4-5:P1 . X1-7:P2 p1:3|c11:1-2|c12:4:c22:5-6 \ 489 478 p1:P0|p2:P1|c11:P2|c12:P1|c22:P2 \ 490 479 1-2,4-6|1-2,5-6" 480 + # c12 whose cpuset.cpus CPUs are all granted to c11 will become invalid partition 481 + " C1-5:P1:S+ . C1-4:P1 C2-3 . . \ 482 + . . . P1 . . p1:5|c11:1-4|c12:5 \ 483 + p1:P1|c11:P1|c12:P-1" 491 484 ) 492 485 493 486 #

+15 -18

tools/testing/selftests/cgroup/test_kmem.c

··· 26 26 */ 27 27 #define MAX_VMSTAT_ERROR (4096 * 64 * get_nprocs()) 28 28 29 + #define KMEM_DEAD_WAIT_RETRIES 80 29 30 30 31 static int alloc_dcache(const char *cgroup, void *arg) 31 32 { ··· 307 306 { 308 307 int ret = KSFT_FAIL; 309 308 char *parent; 310 - long dead; 311 - int i; 312 - int max_time = 20; 309 + long dead = -1; 313 310 314 311 parent = cg_name(root, "kmem_dead_cgroups_test"); 315 312 if (!parent) ··· 322 323 if (cg_run_in_subcgroups(parent, alloc_dcache, (void *)100, 30)) 323 324 goto cleanup; 324 325 325 - for (i = 0; i < max_time; i++) { 326 - dead = cg_read_key_long(parent, "cgroup.stat", 327 - "nr_dying_descendants "); 328 - if (dead == 0) { 329 - ret = KSFT_PASS; 330 - break; 331 - } 332 - /* 333 - * Reclaiming cgroups might take some time, 334 - * let's wait a bit and repeat. 335 - */ 336 - sleep(1); 337 - if (i > 5) 338 - printf("Waiting time longer than 5s; wait: %ds (dead: %ld)\n", i, dead); 339 - } 326 + /* 327 + * Allow up to ~8s for reclaim of dying descendants to complete. 328 + * This is a generous upper bound derived from stress testing, not 329 + * from a specific kernel constant, and can be adjusted if reclaim 330 + * behavior changes in the future. 331 + */ 332 + dead = cg_read_key_long_poll(parent, "cgroup.stat", 333 + "nr_dying_descendants ", 0, KMEM_DEAD_WAIT_RETRIES, 334 + DEFAULT_WAIT_INTERVAL_US); 335 + if (dead) 336 + goto cleanup; 337 + 338 + ret = KSFT_PASS; 340 339 341 340 cleanup: 342 341 cg_destroy(parent);

+19 -1

tools/testing/selftests/cgroup/test_memcontrol.c

··· 21 21 #include "kselftest.h" 22 22 #include "cgroup_util.h" 23 23 24 + #define MEMCG_SOCKSTAT_WAIT_RETRIES 30 25 + 24 26 static bool has_localevents; 25 27 static bool has_recursiveprot; 26 28 ··· 1386 1384 int bind_retries = 5, ret = KSFT_FAIL, pid, err; 1387 1385 unsigned short port; 1388 1386 char *memcg; 1387 + long sock_post = -1; 1389 1388 1390 1389 memcg = cg_name(root, "memcg_test"); 1391 1390 if (!memcg) ··· 1435 1432 if (cg_read_long(memcg, "memory.current") < 0) 1436 1433 goto cleanup; 1437 1434 1438 - if (cg_read_key_long(memcg, "memory.stat", "sock ")) 1435 + /* 1436 + * memory.stat is updated asynchronously via the memcg rstat 1437 + * flushing worker, which runs periodically (every 2 seconds, 1438 + * see FLUSH_TIME). On a busy system, the "sock " counter may 1439 + * stay non-zero for a short period of time after the TCP 1440 + * connection is closed and all socket memory has been 1441 + * uncharged. 1442 + * 1443 + * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some 1444 + * scheduling slack) and require that the "sock " counter 1445 + * eventually drops to zero. 1446 + */ 1447 + sock_post = cg_read_key_long_poll(memcg, "memory.stat", "sock ", 0, 1448 + MEMCG_SOCKSTAT_WAIT_RETRIES, 1449 + DEFAULT_WAIT_INTERVAL_US); 1450 + if (sock_post) 1439 1451 goto cleanup; 1440 1452 1441 1453 ret = KSFT_PASS;

Configure Feed

Configure Feed