Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

- cpuset now support isolated cpus.partition type, which will enable
dynamic CPU isolation

- pids.peak added to remember the max number of pids used

- holes in cgroup namespace plugged

- internal cleanups

* tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits)
cgroup: use strscpy() is more robust and safer
iocost_monitor: reorder BlkgIterator
cgroup: simplify code in cgroup_apply_control
cgroup: Make cgroup_get_from_id() prettier
cgroup/cpuset: remove unreachable code
cgroup: Remove CFTYPE_PRESSURE
cgroup: Improve cftype add/rm error handling
kselftest/cgroup: Add cpuset v2 partition root state test
cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
cgroup/cpuset: Relocate a code block in validate_change()
cgroup/cpuset: Show invalid partition reason string
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Relax constraints to partition & cpus changes
cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
cgroup/cpuset: Miscellaneous cleanups & add helper functions
cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
cgroup: add pids.peak interface for pids controller
cgroup: Remove data-race around cgrp_dfl_visible
cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG
...

+1531 -431
+76 -58
Documentation/admin-guide/cgroup-v2.rst
··· 2190 2190 2191 2191 It accepts only the following input values when written to. 2192 2192 2193 - ======== ================================ 2194 - "root" a partition root 2195 - "member" a non-root member of a partition 2196 - ======== ================================ 2193 + ========== ===================================== 2194 + "member" Non-root member of a partition 2195 + "root" Partition root 2196 + "isolated" Partition root without load balancing 2197 + ========== ===================================== 2197 2198 2198 - When set to be a partition root, the current cgroup is the 2199 - root of a new partition or scheduling domain that comprises 2200 - itself and all its descendants except those that are separate 2201 - partition roots themselves and their descendants. The root 2202 - cgroup is always a partition root. 2199 + The root cgroup is always a partition root and its state 2200 + cannot be changed. All other non-root cgroups start out as 2201 + "member". 2203 2202 2204 - There are constraints on where a partition root can be set. 2205 - It can only be set in a cgroup if all the following conditions 2206 - are true. 2203 + When set to "root", the current cgroup is the root of a new 2204 + partition or scheduling domain that comprises itself and all 2205 + its descendants except those that are separate partition roots 2206 + themselves and their descendants. 2207 2207 2208 - 1) The "cpuset.cpus" is not empty and the list of CPUs are 2209 - exclusive, i.e. they are not shared by any of its siblings. 2210 - 2) The parent cgroup is a partition root. 2211 - 3) The "cpuset.cpus" is also a proper subset of the parent's 2212 - "cpuset.cpus.effective". 2213 - 4) There is no child cgroups with cpuset enabled. This is for 2214 - eliminating corner cases that have to be handled if such a 2215 - condition is allowed. 2208 + When set to "isolated", the CPUs in that partition root will 2209 + be in an isolated state without any load balancing from the 2210 + scheduler. Tasks placed in such a partition with multiple 2211 + CPUs should be carefully distributed and bound to each of the 2212 + individual CPUs for optimal performance. 2216 2213 2217 - Setting it to partition root will take the CPUs away from the 2218 - effective CPUs of the parent cgroup. Once it is set, this 2219 - file cannot be reverted back to "member" if there are any child 2220 - cgroups with cpuset enabled. 2214 + The value shown in "cpuset.cpus.effective" of a partition root 2215 + is the CPUs that the partition root can dedicate to a potential 2216 + new child partition root. The new child subtracts available 2217 + CPUs from its parent "cpuset.cpus.effective". 2221 2218 2222 - A parent partition cannot distribute all its CPUs to its 2223 - child partitions. There must be at least one cpu left in the 2224 - parent partition. 2219 + A partition root ("root" or "isolated") can be in one of the 2220 + two possible states - valid or invalid. An invalid partition 2221 + root is in a degraded state where some state information may 2222 + be retained, but behaves more like a "member". 2225 2223 2226 - Once becoming a partition root, changes to "cpuset.cpus" is 2227 - generally allowed as long as the first condition above is true, 2228 - the change will not take away all the CPUs from the parent 2229 - partition and the new "cpuset.cpus" value is a superset of its 2230 - children's "cpuset.cpus" values. 2224 + All possible state transitions among "member", "root" and 2225 + "isolated" are allowed. 2231 2226 2232 - Sometimes, external factors like changes to ancestors' 2233 - "cpuset.cpus" or cpu hotplug can cause the state of the partition 2234 - root to change. On read, the "cpuset.sched.partition" file 2235 - can show the following values. 2227 + On read, the "cpuset.cpus.partition" file can show the following 2228 + values. 2236 2229 2237 - ============== ============================== 2238 - "member" Non-root member of a partition 2239 - "root" Partition root 2240 - "root invalid" Invalid partition root 2241 - ============== ============================== 2230 + ============================= ===================================== 2231 + "member" Non-root member of a partition 2232 + "root" Partition root 2233 + "isolated" Partition root without load balancing 2234 + "root invalid (<reason>)" Invalid partition root 2235 + "isolated invalid (<reason>)" Invalid isolated partition root 2236 + ============================= ===================================== 2242 2237 2243 - It is a partition root if the first 2 partition root conditions 2244 - above are true and at least one CPU from "cpuset.cpus" is 2245 - granted by the parent cgroup. 2238 + In the case of an invalid partition root, a descriptive string on 2239 + why the partition is invalid is included within parentheses. 2246 2240 2247 - A partition root can become invalid if none of CPUs requested 2248 - in "cpuset.cpus" can be granted by the parent cgroup or the 2249 - parent cgroup is no longer a partition root itself. In this 2250 - case, it is not a real partition even though the restriction 2251 - of the first partition root condition above will still apply. 2252 - The cpu affinity of all the tasks in the cgroup will then be 2253 - associated with CPUs in the nearest ancestor partition. 2241 + For a partition root to become valid, the following conditions 2242 + must be met. 2254 2243 2255 - An invalid partition root can be transitioned back to a 2256 - real partition root if at least one of the requested CPUs 2257 - can now be granted by its parent. In this case, the cpu 2258 - affinity of all the tasks in the formerly invalid partition 2259 - will be associated to the CPUs of the newly formed partition. 2260 - Changing the partition state of an invalid partition root to 2261 - "member" is always allowed even if child cpusets are present. 2244 + 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they 2245 + are not shared by any of its siblings (exclusivity rule). 2246 + 2) The parent cgroup is a valid partition root. 2247 + 3) The "cpuset.cpus" is not empty and must contain at least 2248 + one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. 2249 + 4) The "cpuset.cpus.effective" cannot be empty unless there is 2250 + no task associated with this partition. 2251 + 2252 + External events like hotplug or changes to "cpuset.cpus" can 2253 + cause a valid partition root to become invalid and vice versa. 2254 + Note that a task cannot be moved to a cgroup with empty 2255 + "cpuset.cpus.effective". 2256 + 2257 + For a valid partition root with the sibling cpu exclusivity 2258 + rule enabled, changes made to "cpuset.cpus" that violate the 2259 + exclusivity rule will invalidate the partition as well as its 2260 + sibiling partitions with conflicting cpuset.cpus values. So 2261 + care must be taking in changing "cpuset.cpus". 2262 + 2263 + A valid non-root parent partition may distribute out all its CPUs 2264 + to its child partitions when there is no task associated with it. 2265 + 2266 + Care must be taken to change a valid partition root to 2267 + "member" as all its child partitions, if present, will become 2268 + invalid causing disruption to tasks running in those child 2269 + partitions. These inactivated partitions could be recovered if 2270 + their parent is switched back to a partition root with a proper 2271 + set of "cpuset.cpus". 2272 + 2273 + Poll and inotify events are triggered whenever the state of 2274 + "cpuset.cpus.partition" changes. That includes changes caused 2275 + by write to "cpuset.cpus.partition", cpu hotplug or other 2276 + changes that modify the validity status of the partition. 2277 + This will allow user space agents to monitor unexpected changes 2278 + to "cpuset.cpus.partition" without the need to do continuous 2279 + polling. 2262 2280 2263 2281 2264 2282 Device controller
+2 -2
block/blk-cgroup-fc-appid.c
··· 19 19 return -EINVAL; 20 20 21 21 cgrp = cgroup_get_from_id(cgrp_id); 22 - if (!cgrp) 23 - return -ENOENT; 22 + if (IS_ERR(cgrp)) 23 + return PTR_ERR(cgrp); 24 24 css = cgroup_get_e_css(cgrp, &io_cgrp_subsys); 25 25 if (!css) { 26 26 ret = -ENOENT;
+11 -7
include/linux/cgroup-defs.h
··· 126 126 CFTYPE_NO_PREFIX = (1 << 3), /* (DON'T USE FOR NEW FILES) no subsys prefix */ 127 127 CFTYPE_WORLD_WRITABLE = (1 << 4), /* (DON'T USE FOR NEW FILES) S_IWUGO */ 128 128 CFTYPE_DEBUG = (1 << 5), /* create when cgroup_debug */ 129 - CFTYPE_PRESSURE = (1 << 6), /* only if pressure feature is enabled */ 130 129 131 130 /* internal flags, do not use outside cgroup core proper */ 132 131 __CFTYPE_ONLY_ON_DFL = (1 << 16), /* only on default hierarchy */ 133 132 __CFTYPE_NOT_ON_DFL = (1 << 17), /* not on default hierarchy */ 133 + __CFTYPE_ADDED = (1 << 18), 134 134 }; 135 135 136 136 /* ··· 384 384 /* 385 385 * The depth this cgroup is at. The root is at depth zero and each 386 386 * step down the hierarchy increments the level. This along with 387 - * ancestor_ids[] can determine whether a given cgroup is a 387 + * ancestors[] can determine whether a given cgroup is a 388 388 * descendant of another without traversing the hierarchy. 389 389 */ 390 390 int level; ··· 504 504 /* Used to store internal freezer state */ 505 505 struct cgroup_freezer_state freezer; 506 506 507 - /* ids of the ancestors at each level including self */ 508 - u64 ancestor_ids[]; 507 + /* All ancestors including self */ 508 + struct cgroup *ancestors[]; 509 509 }; 510 510 511 511 /* ··· 522 522 /* Unique id for this hierarchy. */ 523 523 int hierarchy_id; 524 524 525 - /* The root cgroup. Root is destroyed on its release. */ 525 + /* 526 + * The root cgroup. The containing cgroup_root will be destroyed on its 527 + * release. cgrp->ancestors[0] will be used overflowing into the 528 + * following field. cgrp_ancestor_storage must immediately follow. 529 + */ 526 530 struct cgroup cgrp; 527 531 528 - /* for cgrp->ancestor_ids[0] */ 529 - u64 cgrp_ancestor_id_storage; 532 + /* must follow cgrp for cgrp->ancestors[0], see above */ 533 + struct cgroup *cgrp_ancestor_storage; 530 534 531 535 /* Number of cgroups in the hierarchy, used only for /proc/cgroups */ 532 536 atomic_t nr_cgrps;
+3 -10
include/linux/cgroup.h
··· 575 575 { 576 576 if (cgrp->root != ancestor->root || cgrp->level < ancestor->level) 577 577 return false; 578 - return cgrp->ancestor_ids[ancestor->level] == cgroup_id(ancestor); 578 + return cgrp->ancestors[ancestor->level] == ancestor; 579 579 } 580 580 581 581 /** ··· 592 592 static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp, 593 593 int ancestor_level) 594 594 { 595 - if (cgrp->level < ancestor_level) 595 + if (ancestor_level < 0 || ancestor_level > cgrp->level) 596 596 return NULL; 597 - while (cgrp && cgrp->level > ancestor_level) 598 - cgrp = cgroup_parent(cgrp); 599 - return cgrp; 597 + return cgrp->ancestors[ancestor_level]; 600 598 } 601 599 602 600 /** ··· 746 748 747 749 static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) 748 750 {} 749 - 750 - static inline struct cgroup *cgroup_get_from_id(u64 id) 751 - { 752 - return NULL; 753 - } 754 751 #endif /* !CONFIG_CGROUPS */ 755 752 756 753 #ifdef CONFIG_CGROUPS
+2
kernel/cgroup/cgroup-internal.h
··· 250 250 251 251 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, 252 252 bool threadgroup); 253 + void cgroup_attach_lock(bool lock_threadgroup); 254 + void cgroup_attach_unlock(bool lock_threadgroup); 253 255 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, 254 256 bool *locked) 255 257 __acquires(&cgroup_threadgroup_rwsem);
+2 -4
kernel/cgroup/cgroup-v1.c
··· 59 59 int retval = 0; 60 60 61 61 mutex_lock(&cgroup_mutex); 62 - cpus_read_lock(); 63 - percpu_down_write(&cgroup_threadgroup_rwsem); 62 + cgroup_attach_lock(true); 64 63 for_each_root(root) { 65 64 struct cgroup *from_cgrp; 66 65 ··· 71 72 if (retval) 72 73 break; 73 74 } 74 - percpu_up_write(&cgroup_threadgroup_rwsem); 75 - cpus_read_unlock(); 75 + cgroup_attach_unlock(true); 76 76 mutex_unlock(&cgroup_mutex); 77 77 78 78 return retval;
+89 -54
kernel/cgroup/cgroup.c
··· 217 217 218 218 static struct file_system_type cgroup2_fs_type; 219 219 static struct cftype cgroup_base_files[]; 220 + static struct cftype cgroup_psi_files[]; 220 221 221 222 /* cgroup optional features */ 222 223 enum cgroup_opt_features { ··· 1690 1689 css->flags &= ~CSS_VISIBLE; 1691 1690 1692 1691 if (!css->ss) { 1693 - if (cgroup_on_dfl(cgrp)) 1694 - cfts = cgroup_base_files; 1695 - else 1696 - cfts = cgroup1_base_files; 1697 - 1698 - cgroup_addrm_files(css, cgrp, cfts, false); 1692 + if (cgroup_on_dfl(cgrp)) { 1693 + cgroup_addrm_files(css, cgrp, 1694 + cgroup_base_files, false); 1695 + if (cgroup_psi_enabled()) 1696 + cgroup_addrm_files(css, cgrp, 1697 + cgroup_psi_files, false); 1698 + } else { 1699 + cgroup_addrm_files(css, cgrp, 1700 + cgroup1_base_files, false); 1701 + } 1699 1702 } else { 1700 1703 list_for_each_entry(cfts, &css->ss->cfts, node) 1701 1704 cgroup_addrm_files(css, cgrp, cfts, false); ··· 1722 1717 return 0; 1723 1718 1724 1719 if (!css->ss) { 1725 - if (cgroup_on_dfl(cgrp)) 1726 - cfts = cgroup_base_files; 1727 - else 1728 - cfts = cgroup1_base_files; 1720 + if (cgroup_on_dfl(cgrp)) { 1721 + ret = cgroup_addrm_files(&cgrp->self, cgrp, 1722 + cgroup_base_files, true); 1723 + if (ret < 0) 1724 + return ret; 1729 1725 1730 - ret = cgroup_addrm_files(&cgrp->self, cgrp, cfts, true); 1731 - if (ret < 0) 1732 - return ret; 1726 + if (cgroup_psi_enabled()) { 1727 + ret = cgroup_addrm_files(&cgrp->self, cgrp, 1728 + cgroup_psi_files, true); 1729 + if (ret < 0) 1730 + return ret; 1731 + } 1732 + } else { 1733 + cgroup_addrm_files(css, cgrp, 1734 + cgroup1_base_files, true); 1735 + } 1733 1736 } else { 1734 1737 list_for_each_entry(cfts, &css->ss->cfts, node) { 1735 1738 ret = cgroup_addrm_files(css, cgrp, cfts, true); ··· 2063 2050 } 2064 2051 root_cgrp->kn = kernfs_root_to_node(root->kf_root); 2065 2052 WARN_ON_ONCE(cgroup_ino(root_cgrp) != 1); 2066 - root_cgrp->ancestor_ids[0] = cgroup_id(root_cgrp); 2053 + root_cgrp->ancestors[0] = root_cgrp; 2067 2054 2068 2055 ret = css_populate_dir(&root_cgrp->self); 2069 2056 if (ret) ··· 2186 2173 struct cgroup_fs_context *ctx = cgroup_fc2context(fc); 2187 2174 int ret; 2188 2175 2189 - cgrp_dfl_visible = true; 2176 + WRITE_ONCE(cgrp_dfl_visible, true); 2190 2177 cgroup_get_live(&cgrp_dfl_root.cgrp); 2191 2178 ctx->root = &cgrp_dfl_root; 2192 2179 ··· 2374 2361 ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns); 2375 2362 } else { 2376 2363 /* if no hierarchy exists, everyone is in "/" */ 2377 - ret = strlcpy(buf, "/", buflen); 2364 + ret = strscpy(buf, "/", buflen); 2378 2365 } 2379 2366 2380 2367 spin_unlock_irq(&css_set_lock); ··· 2406 2393 * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that 2407 2394 * CPU hotplug is disabled on entry. 2408 2395 */ 2409 - static void cgroup_attach_lock(bool lock_threadgroup) 2396 + void cgroup_attach_lock(bool lock_threadgroup) 2410 2397 { 2411 2398 cpus_read_lock(); 2412 2399 if (lock_threadgroup) ··· 2417 2404 * cgroup_attach_unlock - Undo cgroup_attach_lock() 2418 2405 * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem 2419 2406 */ 2420 - static void cgroup_attach_unlock(bool lock_threadgroup) 2407 + void cgroup_attach_unlock(bool lock_threadgroup) 2421 2408 { 2422 2409 if (lock_threadgroup) 2423 2410 percpu_up_write(&cgroup_threadgroup_rwsem); ··· 3305 3292 * making the following cgroup_update_dfl_csses() properly update 3306 3293 * css associations of all tasks in the subtree. 3307 3294 */ 3308 - ret = cgroup_update_dfl_csses(cgrp); 3309 - if (ret) 3310 - return ret; 3311 - 3312 - return 0; 3295 + return cgroup_update_dfl_csses(cgrp); 3313 3296 } 3314 3297 3315 3298 /** ··· 4141 4132 restart: 4142 4133 for (cft = cfts; cft != cft_end && cft->name[0] != '\0'; cft++) { 4143 4134 /* does cft->flags tell us to skip this file on @cgrp? */ 4144 - if ((cft->flags & CFTYPE_PRESSURE) && !cgroup_psi_enabled()) 4145 - continue; 4146 4135 if ((cft->flags & __CFTYPE_ONLY_ON_DFL) && !cgroup_on_dfl(cgrp)) 4147 4136 continue; 4148 4137 if ((cft->flags & __CFTYPE_NOT_ON_DFL) && cgroup_on_dfl(cgrp)) ··· 4205 4198 cft->ss = NULL; 4206 4199 4207 4200 /* revert flags set by cgroup core while adding @cfts */ 4208 - cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL); 4201 + cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL | 4202 + __CFTYPE_ADDED); 4209 4203 } 4210 4204 } 4211 4205 4212 4206 static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) 4213 4207 { 4214 4208 struct cftype *cft; 4209 + int ret = 0; 4215 4210 4216 4211 for (cft = cfts; cft->name[0] != '\0'; cft++) { 4217 4212 struct kernfs_ops *kf_ops; 4218 4213 4219 4214 WARN_ON(cft->ss || cft->kf_ops); 4220 4215 4221 - if ((cft->flags & CFTYPE_PRESSURE) && !cgroup_psi_enabled()) 4222 - continue; 4216 + if (cft->flags & __CFTYPE_ADDED) { 4217 + ret = -EBUSY; 4218 + break; 4219 + } 4223 4220 4224 4221 if (cft->seq_start) 4225 4222 kf_ops = &cgroup_kf_ops; ··· 4237 4226 if (cft->max_write_len && cft->max_write_len != PAGE_SIZE) { 4238 4227 kf_ops = kmemdup(kf_ops, sizeof(*kf_ops), GFP_KERNEL); 4239 4228 if (!kf_ops) { 4240 - cgroup_exit_cftypes(cfts); 4241 - return -ENOMEM; 4229 + ret = -ENOMEM; 4230 + break; 4242 4231 } 4243 4232 kf_ops->atomic_write_len = cft->max_write_len; 4244 4233 } 4245 4234 4246 4235 cft->kf_ops = kf_ops; 4247 4236 cft->ss = ss; 4237 + cft->flags |= __CFTYPE_ADDED; 4248 4238 } 4249 4239 4250 - return 0; 4240 + if (ret) 4241 + cgroup_exit_cftypes(cfts); 4242 + return ret; 4251 4243 } 4252 4244 4253 4245 static int cgroup_rm_cftypes_locked(struct cftype *cfts) 4254 4246 { 4255 4247 lockdep_assert_held(&cgroup_mutex); 4256 - 4257 - if (!cfts || !cfts[0].ss) 4258 - return -ENOENT; 4259 4248 4260 4249 list_del(&cfts->node); 4261 4250 cgroup_apply_cftypes(cfts, false); ··· 4277 4266 int cgroup_rm_cftypes(struct cftype *cfts) 4278 4267 { 4279 4268 int ret; 4269 + 4270 + if (!cfts || cfts[0].name[0] == '\0') 4271 + return 0; 4272 + 4273 + if (!(cfts[0].flags & __CFTYPE_ADDED)) 4274 + return -ENOENT; 4280 4275 4281 4276 mutex_lock(&cgroup_mutex); 4282 4277 ret = cgroup_rm_cftypes_locked(cfts); ··· 5168 5151 .name = "cpu.stat", 5169 5152 .seq_show = cpu_stat_show, 5170 5153 }, 5154 + { } /* terminate */ 5155 + }; 5156 + 5157 + static struct cftype cgroup_psi_files[] = { 5171 5158 #ifdef CONFIG_PSI 5172 5159 { 5173 5160 .name = "io.pressure", 5174 - .flags = CFTYPE_PRESSURE, 5175 5161 .seq_show = cgroup_io_pressure_show, 5176 5162 .write = cgroup_io_pressure_write, 5177 5163 .poll = cgroup_pressure_poll, ··· 5182 5162 }, 5183 5163 { 5184 5164 .name = "memory.pressure", 5185 - .flags = CFTYPE_PRESSURE, 5186 5165 .seq_show = cgroup_memory_pressure_show, 5187 5166 .write = cgroup_memory_pressure_write, 5188 5167 .poll = cgroup_pressure_poll, ··· 5189 5170 }, 5190 5171 { 5191 5172 .name = "cpu.pressure", 5192 - .flags = CFTYPE_PRESSURE, 5193 5173 .seq_show = cgroup_cpu_pressure_show, 5194 5174 .write = cgroup_cpu_pressure_write, 5195 5175 .poll = cgroup_pressure_poll, ··· 5470 5452 int ret; 5471 5453 5472 5454 /* allocate the cgroup and its ID, 0 is reserved for the root */ 5473 - cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), 5474 - GFP_KERNEL); 5455 + cgrp = kzalloc(struct_size(cgrp, ancestors, (level + 1)), GFP_KERNEL); 5475 5456 if (!cgrp) 5476 5457 return ERR_PTR(-ENOMEM); 5477 5458 ··· 5522 5505 5523 5506 spin_lock_irq(&css_set_lock); 5524 5507 for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) { 5525 - cgrp->ancestor_ids[tcgrp->level] = cgroup_id(tcgrp); 5508 + cgrp->ancestors[tcgrp->level] = tcgrp; 5526 5509 5527 5510 if (tcgrp != cgrp) { 5528 5511 tcgrp->nr_descendants++; ··· 5955 5938 5956 5939 BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16); 5957 5940 BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files)); 5941 + BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files)); 5958 5942 BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files)); 5959 5943 5960 5944 cgroup_rstat_boot(); ··· 6076 6058 /* 6077 6059 * cgroup_get_from_id : get the cgroup associated with cgroup id 6078 6060 * @id: cgroup id 6079 - * On success return the cgrp, on failure return NULL 6061 + * On success return the cgrp or ERR_PTR on failure 6062 + * Only cgroups within current task's cgroup NS are valid. 6080 6063 */ 6081 6064 struct cgroup *cgroup_get_from_id(u64 id) 6082 6065 { 6083 6066 struct kernfs_node *kn; 6084 - struct cgroup *cgrp = NULL; 6067 + struct cgroup *cgrp, *root_cgrp; 6085 6068 6086 6069 kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id); 6087 6070 if (!kn) 6088 - goto out; 6071 + return ERR_PTR(-ENOENT); 6089 6072 6090 - if (kernfs_type(kn) != KERNFS_DIR) 6091 - goto put; 6073 + if (kernfs_type(kn) != KERNFS_DIR) { 6074 + kernfs_put(kn); 6075 + return ERR_PTR(-ENOENT); 6076 + } 6092 6077 6093 6078 rcu_read_lock(); 6094 6079 ··· 6100 6079 cgrp = NULL; 6101 6080 6102 6081 rcu_read_unlock(); 6103 - put: 6104 6082 kernfs_put(kn); 6105 - out: 6083 + 6084 + if (!cgrp) 6085 + return ERR_PTR(-ENOENT); 6086 + 6087 + spin_lock_irq(&css_set_lock); 6088 + root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root); 6089 + spin_unlock_irq(&css_set_lock); 6090 + if (!cgroup_is_descendant(cgrp, root_cgrp)) { 6091 + cgroup_put(cgrp); 6092 + return ERR_PTR(-ENOENT); 6093 + } 6094 + 6106 6095 return cgrp; 6107 6096 } 6108 6097 EXPORT_SYMBOL_GPL(cgroup_get_from_id); ··· 6142 6111 struct cgroup *cgrp; 6143 6112 int ssid, count = 0; 6144 6113 6145 - if (root == &cgrp_dfl_root && !cgrp_dfl_visible) 6114 + if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible)) 6146 6115 continue; 6147 6116 6148 6117 seq_printf(m, "%d:", root->hierarchy_id); ··· 6684 6653 { 6685 6654 struct kernfs_node *kn; 6686 6655 struct cgroup *cgrp = ERR_PTR(-ENOENT); 6656 + struct cgroup *root_cgrp; 6687 6657 6688 - kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path); 6658 + spin_lock_irq(&css_set_lock); 6659 + root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root); 6660 + kn = kernfs_walk_and_get(root_cgrp->kn, path); 6661 + spin_unlock_irq(&css_set_lock); 6689 6662 if (!kn) 6690 6663 goto out; 6691 6664 ··· 6847 6812 if (!(cft->flags & CFTYPE_NS_DELEGATABLE)) 6848 6813 continue; 6849 6814 6850 - if ((cft->flags & CFTYPE_PRESSURE) && !cgroup_psi_enabled()) 6851 - continue; 6852 - 6853 6815 if (prefix) 6854 6816 ret += snprintf(buf + ret, size - ret, "%s.", prefix); 6855 6817 ··· 6866 6834 int ssid; 6867 6835 ssize_t ret = 0; 6868 6836 6869 - ret = show_delegatable_files(cgroup_base_files, buf, PAGE_SIZE - ret, 6870 - NULL); 6837 + ret = show_delegatable_files(cgroup_base_files, buf + ret, 6838 + PAGE_SIZE - ret, NULL); 6839 + if (cgroup_psi_enabled()) 6840 + ret += show_delegatable_files(cgroup_psi_files, buf + ret, 6841 + PAGE_SIZE - ret, NULL); 6871 6842 6872 6843 for_each_subsys(ss, ssid) 6873 6844 ret += show_delegatable_files(ss->dfl_cftypes, buf + ret,
+533 -280
kernel/cgroup/cpuset.c
··· 33 33 #include <linux/interrupt.h> 34 34 #include <linux/kernel.h> 35 35 #include <linux/kmod.h> 36 + #include <linux/kthread.h> 36 37 #include <linux/list.h> 37 38 #include <linux/mempolicy.h> 38 39 #include <linux/mm.h> ··· 84 83 int val; /* most recent output value */ 85 84 time64_t time; /* clock (secs) when val computed */ 86 85 spinlock_t lock; /* guards read or write of above */ 86 + }; 87 + 88 + /* 89 + * Invalid partition error code 90 + */ 91 + enum prs_errcode { 92 + PERR_NONE = 0, 93 + PERR_INVCPUS, 94 + PERR_INVPARENT, 95 + PERR_NOTPART, 96 + PERR_NOTEXCL, 97 + PERR_NOCPUS, 98 + PERR_HOTPLUG, 99 + PERR_CPUSEMPTY, 100 + }; 101 + 102 + static const char * const perr_strings[] = { 103 + [PERR_INVCPUS] = "Invalid cpu list in cpuset.cpus", 104 + [PERR_INVPARENT] = "Parent is an invalid partition root", 105 + [PERR_NOTPART] = "Parent is not a partition root", 106 + [PERR_NOTEXCL] = "Cpu list in cpuset.cpus not exclusive", 107 + [PERR_NOCPUS] = "Parent unable to distribute cpu downstream", 108 + [PERR_HOTPLUG] = "No cpu available due to hotplug", 109 + [PERR_CPUSEMPTY] = "cpuset.cpus is empty", 87 110 }; 88 111 89 112 struct cpuset { ··· 193 168 int use_parent_ecpus; 194 169 int child_ecpus_count; 195 170 171 + /* Invalid partition error code, not lock protected */ 172 + enum prs_errcode prs_err; 173 + 196 174 /* Handle for cpuset.cpus.partition */ 197 175 struct cgroup_file partition_file; 198 176 }; ··· 203 175 /* 204 176 * Partition root states: 205 177 * 206 - * 0 - not a partition root 207 - * 178 + * 0 - member (not a partition root) 208 179 * 1 - partition root 209 - * 180 + * 2 - partition root without load balancing (isolated) 210 181 * -1 - invalid partition root 211 - * None of the cpus in cpus_allowed can be put into the parent's 212 - * subparts_cpus. In this case, the cpuset is not a real partition 213 - * root anymore. However, the CPU_EXCLUSIVE bit will still be set 214 - * and the cpuset can be restored back to a partition root if the 215 - * parent cpuset can give more CPUs back to this child cpuset. 182 + * -2 - invalid isolated partition root 216 183 */ 217 - #define PRS_DISABLED 0 218 - #define PRS_ENABLED 1 219 - #define PRS_ERROR -1 184 + #define PRS_MEMBER 0 185 + #define PRS_ROOT 1 186 + #define PRS_ISOLATED 2 187 + #define PRS_INVALID_ROOT -1 188 + #define PRS_INVALID_ISOLATED -2 189 + 190 + static inline bool is_prs_invalid(int prs_state) 191 + { 192 + return prs_state < 0; 193 + } 220 194 221 195 /* 222 196 * Temporary cpumasks for working with partitions that are passed among ··· 298 268 return test_bit(CS_SPREAD_SLAB, &cs->flags); 299 269 } 300 270 301 - static inline int is_partition_root(const struct cpuset *cs) 271 + static inline int is_partition_valid(const struct cpuset *cs) 302 272 { 303 273 return cs->partition_root_state > 0; 274 + } 275 + 276 + static inline int is_partition_invalid(const struct cpuset *cs) 277 + { 278 + return cs->partition_root_state < 0; 279 + } 280 + 281 + /* 282 + * Callers should hold callback_lock to modify partition_root_state. 283 + */ 284 + static inline void make_partition_invalid(struct cpuset *cs) 285 + { 286 + if (is_partition_valid(cs)) 287 + cs->partition_root_state = -cs->partition_root_state; 304 288 } 305 289 306 290 /* 307 291 * Send notification event of whenever partition_root_state changes. 308 292 */ 309 - static inline void notify_partition_change(struct cpuset *cs, 310 - int old_prs, int new_prs) 293 + static inline void notify_partition_change(struct cpuset *cs, int old_prs) 311 294 { 312 - if (old_prs != new_prs) 313 - cgroup_file_notify(&cs->partition_file); 295 + if (old_prs == cs->partition_root_state) 296 + return; 297 + cgroup_file_notify(&cs->partition_file); 298 + 299 + /* Reset prs_err if not invalid */ 300 + if (is_partition_valid(cs)) 301 + WRITE_ONCE(cs->prs_err, PERR_NONE); 314 302 } 315 303 316 304 static struct cpuset top_cpuset = { 317 305 .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | 318 306 (1 << CS_MEM_EXCLUSIVE)), 319 - .partition_root_state = PRS_ENABLED, 307 + .partition_root_state = PRS_ROOT, 320 308 }; 321 309 322 310 /** ··· 450 402 { 451 403 return cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || 452 404 (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); 405 + } 406 + 407 + /** 408 + * partition_is_populated - check if partition has tasks 409 + * @cs: partition root to be checked 410 + * @excluded_child: a child cpuset to be excluded in task checking 411 + * Return: true if there are tasks, false otherwise 412 + * 413 + * It is assumed that @cs is a valid partition root. @excluded_child should 414 + * be non-NULL when this cpuset is going to become a partition itself. 415 + */ 416 + static inline bool partition_is_populated(struct cpuset *cs, 417 + struct cpuset *excluded_child) 418 + { 419 + struct cgroup_subsys_state *css; 420 + struct cpuset *child; 421 + 422 + if (cs->css.cgroup->nr_populated_csets) 423 + return true; 424 + if (!excluded_child && !cs->nr_subparts_cpus) 425 + return cgroup_is_populated(cs->css.cgroup); 426 + 427 + rcu_read_lock(); 428 + cpuset_for_each_child(child, css, cs) { 429 + if (child == excluded_child) 430 + continue; 431 + if (is_partition_valid(child)) 432 + continue; 433 + if (cgroup_is_populated(child->css.cgroup)) { 434 + rcu_read_unlock(); 435 + return true; 436 + } 437 + } 438 + rcu_read_unlock(); 439 + return false; 453 440 } 454 441 455 442 /* ··· 742 659 par = parent_cs(cur); 743 660 744 661 /* 745 - * If either I or some sibling (!= me) is exclusive, we can't 746 - * overlap 747 - */ 748 - ret = -EINVAL; 749 - cpuset_for_each_child(c, css, par) { 750 - if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) && 751 - c != cur && 752 - cpumask_intersects(trial->cpus_allowed, c->cpus_allowed)) 753 - goto out; 754 - if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) && 755 - c != cur && 756 - nodes_intersects(trial->mems_allowed, c->mems_allowed)) 757 - goto out; 758 - } 759 - 760 - /* 761 662 * Cpusets with tasks - existing or newly being attached - can't 762 663 * be changed to have empty cpus_allowed or mems_allowed. 763 664 */ ··· 764 697 !cpuset_cpumask_can_shrink(cur->cpus_allowed, 765 698 trial->cpus_allowed)) 766 699 goto out; 700 + 701 + /* 702 + * If either I or some sibling (!= me) is exclusive, we can't 703 + * overlap 704 + */ 705 + ret = -EINVAL; 706 + cpuset_for_each_child(c, css, par) { 707 + if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) && 708 + c != cur && 709 + cpumask_intersects(trial->cpus_allowed, c->cpus_allowed)) 710 + goto out; 711 + if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) && 712 + c != cur && 713 + nodes_intersects(trial->mems_allowed, c->mems_allowed)) 714 + goto out; 715 + } 767 716 768 717 ret = 0; 769 718 out: ··· 958 875 csa[csn++] = cp; 959 876 960 877 /* skip @cp's subtree if not a partition root */ 961 - if (!is_partition_root(cp)) 878 + if (!is_partition_valid(cp)) 962 879 pos_css = css_rightmost_descendant(pos_css); 963 880 } 964 881 rcu_read_unlock(); ··· 1164 1081 if (top_cpuset.nr_subparts_cpus) { 1165 1082 rcu_read_lock(); 1166 1083 cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { 1167 - if (!is_partition_root(cs)) { 1084 + if (!is_partition_valid(cs)) { 1168 1085 pos_css = css_rightmost_descendant(pos_css); 1169 1086 continue; 1170 1087 } ··· 1210 1127 { 1211 1128 struct css_task_iter it; 1212 1129 struct task_struct *task; 1130 + bool top_cs = cs == &top_cpuset; 1213 1131 1214 1132 css_task_iter_start(&cs->css, 0, &it); 1215 - while ((task = css_task_iter_next(&it))) 1133 + while ((task = css_task_iter_next(&it))) { 1134 + /* 1135 + * Percpu kthreads in top_cpuset are ignored 1136 + */ 1137 + if (top_cs && (task->flags & PF_KTHREAD) && 1138 + kthread_is_per_cpu(task)) 1139 + continue; 1216 1140 set_cpus_allowed_ptr(task, cs->effective_cpus); 1141 + } 1217 1142 css_task_iter_end(&it); 1218 1143 } 1219 1144 ··· 1256 1165 partcmd_enable, /* Enable partition root */ 1257 1166 partcmd_disable, /* Disable partition root */ 1258 1167 partcmd_update, /* Update parent's subparts_cpus */ 1168 + partcmd_invalidate, /* Make partition invalid */ 1259 1169 }; 1260 1170 1171 + static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, 1172 + int turning_on); 1261 1173 /** 1262 1174 * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset 1263 1175 * @cpuset: The cpuset that requests change in partition root state 1264 1176 * @cmd: Partition root state change command 1265 1177 * @newmask: Optional new cpumask for partcmd_update 1266 1178 * @tmp: Temporary addmask and delmask 1267 - * Return: 0, 1 or an error code 1179 + * Return: 0 or a partition root state error code 1268 1180 * 1269 1181 * For partcmd_enable, the cpuset is being transformed from a non-partition 1270 1182 * root to a partition root. The cpus_allowed mask of the given cpuset will ··· 1278 1184 * For partcmd_disable, the cpuset is being transformed from a partition 1279 1185 * root back to a non-partition root. Any CPUs in cpus_allowed that are in 1280 1186 * parent's subparts_cpus will be taken away from that cpumask and put back 1281 - * into parent's effective_cpus. 0 should always be returned. 1187 + * into parent's effective_cpus. 0 will always be returned. 1282 1188 * 1283 - * For partcmd_update, if the optional newmask is specified, the cpu 1284 - * list is to be changed from cpus_allowed to newmask. Otherwise, 1285 - * cpus_allowed is assumed to remain the same. The cpuset should either 1286 - * be a partition root or an invalid partition root. The partition root 1287 - * state may change if newmask is NULL and none of the requested CPUs can 1288 - * be granted by the parent. The function will return 1 if changes to 1289 - * parent's subparts_cpus and effective_cpus happen or 0 otherwise. 1290 - * Error code should only be returned when newmask is non-NULL. 1189 + * For partcmd_update, if the optional newmask is specified, the cpu list is 1190 + * to be changed from cpus_allowed to newmask. Otherwise, cpus_allowed is 1191 + * assumed to remain the same. The cpuset should either be a valid or invalid 1192 + * partition root. The partition root state may change from valid to invalid 1193 + * or vice versa. An error code will only be returned if transitioning from 1194 + * invalid to valid violates the exclusivity rule. 1195 + * 1196 + * For partcmd_invalidate, the current partition will be made invalid. 1291 1197 * 1292 1198 * The partcmd_enable and partcmd_disable commands are used by 1293 - * update_prstate(). The partcmd_update command is used by 1294 - * update_cpumasks_hier() with newmask NULL and update_cpumask() with 1295 - * newmask set. 1199 + * update_prstate(). An error code may be returned and the caller will check 1200 + * for error. 1296 1201 * 1297 - * The checking is more strict when enabling partition root than the 1298 - * other two commands. 1299 - * 1300 - * Because of the implicit cpu exclusive nature of a partition root, 1301 - * cpumask changes that violates the cpu exclusivity rule will not be 1302 - * permitted when checked by validate_change(). 1202 + * The partcmd_update command is used by update_cpumasks_hier() with newmask 1203 + * NULL and update_cpumask() with newmask set. The partcmd_invalidate is used 1204 + * by update_cpumask() with NULL newmask. In both cases, the callers won't 1205 + * check for error and so partition_root_state and prs_error will be updated 1206 + * directly. 1303 1207 */ 1304 - static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, 1208 + static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, 1305 1209 struct cpumask *newmask, 1306 1210 struct tmpmasks *tmp) 1307 1211 { 1308 - struct cpuset *parent = parent_cs(cpuset); 1212 + struct cpuset *parent = parent_cs(cs); 1309 1213 int adding; /* Moving cpus from effective_cpus to subparts_cpus */ 1310 1214 int deleting; /* Moving cpus from subparts_cpus to effective_cpus */ 1311 1215 int old_prs, new_prs; 1312 - bool part_error = false; /* Partition error? */ 1216 + int part_error = PERR_NONE; /* Partition error? */ 1313 1217 1314 1218 percpu_rwsem_assert_held(&cpuset_rwsem); 1315 1219 ··· 1316 1224 * The new cpumask, if present, or the current cpus_allowed must 1317 1225 * not be empty. 1318 1226 */ 1319 - if (!is_partition_root(parent) || 1320 - (newmask && cpumask_empty(newmask)) || 1321 - (!newmask && cpumask_empty(cpuset->cpus_allowed))) 1322 - return -EINVAL; 1227 + if (!is_partition_valid(parent)) { 1228 + return is_partition_invalid(parent) 1229 + ? PERR_INVPARENT : PERR_NOTPART; 1230 + } 1231 + if ((newmask && cpumask_empty(newmask)) || 1232 + (!newmask && cpumask_empty(cs->cpus_allowed))) 1233 + return PERR_CPUSEMPTY; 1323 1234 1324 1235 /* 1325 - * Enabling/disabling partition root is not allowed if there are 1326 - * online children. 1327 - */ 1328 - if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css)) 1329 - return -EBUSY; 1330 - 1331 - /* 1332 - * Enabling partition root is not allowed if not all the CPUs 1333 - * can be granted from parent's effective_cpus or at least one 1334 - * CPU will be left after that. 1335 - */ 1336 - if ((cmd == partcmd_enable) && 1337 - (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) || 1338 - cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))) 1339 - return -EINVAL; 1340 - 1341 - /* 1342 - * A cpumask update cannot make parent's effective_cpus become empty. 1236 + * new_prs will only be changed for the partcmd_update and 1237 + * partcmd_invalidate commands. 1343 1238 */ 1344 1239 adding = deleting = false; 1345 - old_prs = new_prs = cpuset->partition_root_state; 1240 + old_prs = new_prs = cs->partition_root_state; 1346 1241 if (cmd == partcmd_enable) { 1347 - cpumask_copy(tmp->addmask, cpuset->cpus_allowed); 1242 + /* 1243 + * Enabling partition root is not allowed if cpus_allowed 1244 + * doesn't overlap parent's cpus_allowed. 1245 + */ 1246 + if (!cpumask_intersects(cs->cpus_allowed, parent->cpus_allowed)) 1247 + return PERR_INVCPUS; 1248 + 1249 + /* 1250 + * A parent can be left with no CPU as long as there is no 1251 + * task directly associated with the parent partition. 1252 + */ 1253 + if (!cpumask_intersects(cs->cpus_allowed, parent->effective_cpus) && 1254 + partition_is_populated(parent, cs)) 1255 + return PERR_NOCPUS; 1256 + 1257 + cpumask_copy(tmp->addmask, cs->cpus_allowed); 1348 1258 adding = true; 1349 1259 } else if (cmd == partcmd_disable) { 1350 - deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, 1260 + /* 1261 + * Need to remove cpus from parent's subparts_cpus for valid 1262 + * partition root. 1263 + */ 1264 + deleting = !is_prs_invalid(old_prs) && 1265 + cpumask_and(tmp->delmask, cs->cpus_allowed, 1351 1266 parent->subparts_cpus); 1267 + } else if (cmd == partcmd_invalidate) { 1268 + if (is_prs_invalid(old_prs)) 1269 + return 0; 1270 + 1271 + /* 1272 + * Make the current partition invalid. It is assumed that 1273 + * invalidation is caused by violating cpu exclusivity rule. 1274 + */ 1275 + deleting = cpumask_and(tmp->delmask, cs->cpus_allowed, 1276 + parent->subparts_cpus); 1277 + if (old_prs > 0) { 1278 + new_prs = -old_prs; 1279 + part_error = PERR_NOTEXCL; 1280 + } 1352 1281 } else if (newmask) { 1353 1282 /* 1354 1283 * partcmd_update with newmask: 1355 1284 * 1285 + * Compute add/delete mask to/from subparts_cpus 1286 + * 1356 1287 * delmask = cpus_allowed & ~newmask & parent->subparts_cpus 1357 - * addmask = newmask & parent->effective_cpus 1288 + * addmask = newmask & parent->cpus_allowed 1358 1289 * & ~parent->subparts_cpus 1359 1290 */ 1360 - cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask); 1291 + cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask); 1361 1292 deleting = cpumask_and(tmp->delmask, tmp->delmask, 1362 1293 parent->subparts_cpus); 1363 1294 1364 - cpumask_and(tmp->addmask, newmask, parent->effective_cpus); 1295 + cpumask_and(tmp->addmask, newmask, parent->cpus_allowed); 1365 1296 adding = cpumask_andnot(tmp->addmask, tmp->addmask, 1366 1297 parent->subparts_cpus); 1367 1298 /* 1368 - * Return error if the new effective_cpus could become empty. 1299 + * Make partition invalid if parent's effective_cpus could 1300 + * become empty and there are tasks in the parent. 1369 1301 */ 1370 1302 if (adding && 1371 - cpumask_equal(parent->effective_cpus, tmp->addmask)) { 1372 - if (!deleting) 1373 - return -EINVAL; 1374 - /* 1375 - * As some of the CPUs in subparts_cpus might have 1376 - * been offlined, we need to compute the real delmask 1377 - * to confirm that. 1378 - */ 1379 - if (!cpumask_and(tmp->addmask, tmp->delmask, 1380 - cpu_active_mask)) 1381 - return -EINVAL; 1382 - cpumask_copy(tmp->addmask, parent->effective_cpus); 1303 + cpumask_subset(parent->effective_cpus, tmp->addmask) && 1304 + !cpumask_intersects(tmp->delmask, cpu_active_mask) && 1305 + partition_is_populated(parent, cs)) { 1306 + part_error = PERR_NOCPUS; 1307 + adding = false; 1308 + deleting = cpumask_and(tmp->delmask, cs->cpus_allowed, 1309 + parent->subparts_cpus); 1383 1310 } 1384 1311 } else { 1385 1312 /* 1386 1313 * partcmd_update w/o newmask: 1387 1314 * 1388 - * addmask = cpus_allowed & parent->effective_cpus 1315 + * delmask = cpus_allowed & parent->subparts_cpus 1316 + * addmask = cpus_allowed & parent->cpus_allowed 1317 + * & ~parent->subparts_cpus 1389 1318 * 1390 - * Note that parent's subparts_cpus may have been 1391 - * pre-shrunk in case there is a change in the cpu list. 1392 - * So no deletion is needed. 1319 + * This gets invoked either due to a hotplug event or from 1320 + * update_cpumasks_hier(). This can cause the state of a 1321 + * partition root to transition from valid to invalid or vice 1322 + * versa. So we still need to compute the addmask and delmask. 1323 + 1324 + * A partition error happens when: 1325 + * 1) Cpuset is valid partition, but parent does not distribute 1326 + * out any CPUs. 1327 + * 2) Parent has tasks and all its effective CPUs will have 1328 + * to be distributed out. 1393 1329 */ 1394 - adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, 1395 - parent->effective_cpus); 1396 - part_error = cpumask_equal(tmp->addmask, 1397 - parent->effective_cpus); 1330 + cpumask_and(tmp->addmask, cs->cpus_allowed, 1331 + parent->cpus_allowed); 1332 + adding = cpumask_andnot(tmp->addmask, tmp->addmask, 1333 + parent->subparts_cpus); 1334 + 1335 + if ((is_partition_valid(cs) && !parent->nr_subparts_cpus) || 1336 + (adding && 1337 + cpumask_subset(parent->effective_cpus, tmp->addmask) && 1338 + partition_is_populated(parent, cs))) { 1339 + part_error = PERR_NOCPUS; 1340 + adding = false; 1341 + } 1342 + 1343 + if (part_error && is_partition_valid(cs) && 1344 + parent->nr_subparts_cpus) 1345 + deleting = cpumask_and(tmp->delmask, cs->cpus_allowed, 1346 + parent->subparts_cpus); 1398 1347 } 1348 + if (part_error) 1349 + WRITE_ONCE(cs->prs_err, part_error); 1399 1350 1400 1351 if (cmd == partcmd_update) { 1401 - int prev_prs = cpuset->partition_root_state; 1402 - 1403 1352 /* 1404 - * Check for possible transition between PRS_ENABLED 1405 - * and PRS_ERROR. 1353 + * Check for possible transition between valid and invalid 1354 + * partition root. 1406 1355 */ 1407 - switch (cpuset->partition_root_state) { 1408 - case PRS_ENABLED: 1356 + switch (cs->partition_root_state) { 1357 + case PRS_ROOT: 1358 + case PRS_ISOLATED: 1409 1359 if (part_error) 1410 - new_prs = PRS_ERROR; 1360 + new_prs = -old_prs; 1411 1361 break; 1412 - case PRS_ERROR: 1362 + case PRS_INVALID_ROOT: 1363 + case PRS_INVALID_ISOLATED: 1413 1364 if (!part_error) 1414 - new_prs = PRS_ENABLED; 1365 + new_prs = -old_prs; 1415 1366 break; 1416 1367 } 1417 - /* 1418 - * Set part_error if previously in invalid state. 1419 - */ 1420 - part_error = (prev_prs == PRS_ERROR); 1421 - } 1422 - 1423 - if (!part_error && (new_prs == PRS_ERROR)) 1424 - return 0; /* Nothing need to be done */ 1425 - 1426 - if (new_prs == PRS_ERROR) { 1427 - /* 1428 - * Remove all its cpus from parent's subparts_cpus. 1429 - */ 1430 - adding = false; 1431 - deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, 1432 - parent->subparts_cpus); 1433 1368 } 1434 1369 1435 1370 if (!adding && !deleting && (new_prs == old_prs)) 1436 1371 return 0; 1372 + 1373 + /* 1374 + * Transitioning between invalid to valid or vice versa may require 1375 + * changing CS_CPU_EXCLUSIVE and CS_SCHED_LOAD_BALANCE. 1376 + */ 1377 + if (old_prs != new_prs) { 1378 + if (is_prs_invalid(old_prs) && !is_cpu_exclusive(cs) && 1379 + (update_flag(CS_CPU_EXCLUSIVE, cs, 1) < 0)) 1380 + return PERR_NOTEXCL; 1381 + if (is_prs_invalid(new_prs) && is_cpu_exclusive(cs)) 1382 + update_flag(CS_CPU_EXCLUSIVE, cs, 0); 1383 + } 1437 1384 1438 1385 /* 1439 1386 * Change the parent's subparts_cpus. ··· 1500 1369 parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus); 1501 1370 1502 1371 if (old_prs != new_prs) 1503 - cpuset->partition_root_state = new_prs; 1372 + cs->partition_root_state = new_prs; 1504 1373 1505 1374 spin_unlock_irq(&callback_lock); 1506 - notify_partition_change(cpuset, old_prs, new_prs); 1507 1375 1508 - return cmd == partcmd_update; 1376 + if (adding || deleting) 1377 + update_tasks_cpumask(parent); 1378 + 1379 + /* 1380 + * Set or clear CS_SCHED_LOAD_BALANCE when partcmd_update, if necessary. 1381 + * rebuild_sched_domains_locked() may be called. 1382 + */ 1383 + if (old_prs != new_prs) { 1384 + if (old_prs == PRS_ISOLATED) 1385 + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); 1386 + else if (new_prs == PRS_ISOLATED) 1387 + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); 1388 + } 1389 + notify_partition_change(cs, old_prs); 1390 + return 0; 1509 1391 } 1510 1392 1511 1393 /* 1512 1394 * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree 1513 1395 * @cs: the cpuset to consider 1514 1396 * @tmp: temp variables for calculating effective_cpus & partition setup 1397 + * @force: don't skip any descendant cpusets if set 1515 1398 * 1516 1399 * When configured cpumask is changed, the effective cpumasks of this cpuset 1517 1400 * and all its descendants need to be updated. ··· 1534 1389 * 1535 1390 * Called with cpuset_rwsem held 1536 1391 */ 1537 - static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) 1392 + static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, 1393 + bool force) 1538 1394 { 1539 1395 struct cpuset *cp; 1540 1396 struct cgroup_subsys_state *pos_css; ··· 1545 1399 rcu_read_lock(); 1546 1400 cpuset_for_each_descendant_pre(cp, pos_css, cs) { 1547 1401 struct cpuset *parent = parent_cs(cp); 1402 + bool update_parent = false; 1548 1403 1549 1404 compute_effective_cpumask(tmp->new_cpus, cp, parent); 1550 1405 1551 1406 /* 1552 1407 * If it becomes empty, inherit the effective mask of the 1553 - * parent, which is guaranteed to have some CPUs. 1408 + * parent, which is guaranteed to have some CPUs unless 1409 + * it is a partition root that has explicitly distributed 1410 + * out all its CPUs. 1554 1411 */ 1555 1412 if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) { 1413 + if (is_partition_valid(cp) && 1414 + cpumask_equal(cp->cpus_allowed, cp->subparts_cpus)) 1415 + goto update_parent_subparts; 1416 + 1556 1417 cpumask_copy(tmp->new_cpus, parent->effective_cpus); 1557 1418 if (!cp->use_parent_ecpus) { 1558 1419 cp->use_parent_ecpus = true; ··· 1573 1420 1574 1421 /* 1575 1422 * Skip the whole subtree if the cpumask remains the same 1576 - * and has no partition root state. 1423 + * and has no partition root state and force flag not set. 1577 1424 */ 1578 - if (!cp->partition_root_state && 1425 + if (!cp->partition_root_state && !force && 1579 1426 cpumask_equal(tmp->new_cpus, cp->effective_cpus)) { 1580 1427 pos_css = css_rightmost_descendant(pos_css); 1581 1428 continue; 1582 1429 } 1583 1430 1431 + update_parent_subparts: 1584 1432 /* 1585 1433 * update_parent_subparts_cpumask() should have been called 1586 1434 * for cs already in update_cpumask(). We should also call ··· 1591 1437 old_prs = new_prs = cp->partition_root_state; 1592 1438 if ((cp != cs) && old_prs) { 1593 1439 switch (parent->partition_root_state) { 1594 - case PRS_DISABLED: 1595 - /* 1596 - * If parent is not a partition root or an 1597 - * invalid partition root, clear its state 1598 - * and its CS_CPU_EXCLUSIVE flag. 1599 - */ 1600 - WARN_ON_ONCE(cp->partition_root_state 1601 - != PRS_ERROR); 1602 - new_prs = PRS_DISABLED; 1603 - 1604 - /* 1605 - * clear_bit() is an atomic operation and 1606 - * readers aren't interested in the state 1607 - * of CS_CPU_EXCLUSIVE anyway. So we can 1608 - * just update the flag without holding 1609 - * the callback_lock. 1610 - */ 1611 - clear_bit(CS_CPU_EXCLUSIVE, &cp->flags); 1440 + case PRS_ROOT: 1441 + case PRS_ISOLATED: 1442 + update_parent = true; 1612 1443 break; 1613 1444 1614 - case PRS_ENABLED: 1615 - if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) 1616 - update_tasks_cpumask(parent); 1617 - break; 1618 - 1619 - case PRS_ERROR: 1445 + default: 1620 1446 /* 1621 - * When parent is invalid, it has to be too. 1447 + * When parent is not a partition root or is 1448 + * invalid, child partition roots become 1449 + * invalid too. 1622 1450 */ 1623 - new_prs = PRS_ERROR; 1451 + if (is_partition_valid(cp)) 1452 + new_prs = -cp->partition_root_state; 1453 + WRITE_ONCE(cp->prs_err, 1454 + is_partition_invalid(parent) 1455 + ? PERR_INVPARENT : PERR_NOTPART); 1624 1456 break; 1625 1457 } 1626 1458 } ··· 1615 1475 continue; 1616 1476 rcu_read_unlock(); 1617 1477 1478 + if (update_parent) { 1479 + update_parent_subparts_cpumask(cp, partcmd_update, NULL, 1480 + tmp); 1481 + /* 1482 + * The cpuset partition_root_state may become 1483 + * invalid. Capture it. 1484 + */ 1485 + new_prs = cp->partition_root_state; 1486 + } 1487 + 1618 1488 spin_lock_irq(&callback_lock); 1619 1489 1620 - cpumask_copy(cp->effective_cpus, tmp->new_cpus); 1621 - if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { 1490 + if (cp->nr_subparts_cpus && !is_partition_valid(cp)) { 1491 + /* 1492 + * Put all active subparts_cpus back to effective_cpus. 1493 + */ 1494 + cpumask_or(tmp->new_cpus, tmp->new_cpus, 1495 + cp->subparts_cpus); 1496 + cpumask_and(tmp->new_cpus, tmp->new_cpus, 1497 + cpu_active_mask); 1622 1498 cp->nr_subparts_cpus = 0; 1623 1499 cpumask_clear(cp->subparts_cpus); 1624 - } else if (cp->nr_subparts_cpus) { 1500 + } 1501 + 1502 + cpumask_copy(cp->effective_cpus, tmp->new_cpus); 1503 + if (cp->nr_subparts_cpus) { 1625 1504 /* 1626 1505 * Make sure that effective_cpus & subparts_cpus 1627 1506 * are mutually exclusive. 1628 - * 1629 - * In the unlikely event that effective_cpus 1630 - * becomes empty. we clear cp->nr_subparts_cpus and 1631 - * let its child partition roots to compete for 1632 - * CPUs again. 1633 1507 */ 1634 1508 cpumask_andnot(cp->effective_cpus, cp->effective_cpus, 1635 1509 cp->subparts_cpus); 1636 - if (cpumask_empty(cp->effective_cpus)) { 1637 - cpumask_copy(cp->effective_cpus, tmp->new_cpus); 1638 - cpumask_clear(cp->subparts_cpus); 1639 - cp->nr_subparts_cpus = 0; 1640 - } else if (!cpumask_subset(cp->subparts_cpus, 1641 - tmp->new_cpus)) { 1642 - cpumask_andnot(cp->subparts_cpus, 1643 - cp->subparts_cpus, tmp->new_cpus); 1644 - cp->nr_subparts_cpus 1645 - = cpumask_weight(cp->subparts_cpus); 1646 - } 1647 1510 } 1648 1511 1649 - if (new_prs != old_prs) 1650 - cp->partition_root_state = new_prs; 1651 - 1512 + cp->partition_root_state = new_prs; 1652 1513 spin_unlock_irq(&callback_lock); 1653 - notify_partition_change(cp, old_prs, new_prs); 1514 + 1515 + notify_partition_change(cp, old_prs); 1654 1516 1655 1517 WARN_ON(!is_in_v2_mode() && 1656 1518 !cpumask_equal(cp->cpus_allowed, cp->effective_cpus)); ··· 1668 1526 if (!cpumask_empty(cp->cpus_allowed) && 1669 1527 is_sched_load_balance(cp) && 1670 1528 (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || 1671 - is_partition_root(cp))) 1529 + is_partition_valid(cp))) 1672 1530 need_rebuild_sched_domains = true; 1673 1531 1674 1532 rcu_read_lock(); ··· 1712 1570 continue; 1713 1571 1714 1572 rcu_read_unlock(); 1715 - update_cpumasks_hier(sibling, tmp); 1573 + update_cpumasks_hier(sibling, tmp, false); 1716 1574 rcu_read_lock(); 1717 1575 css_put(&sibling->css); 1718 1576 } ··· 1730 1588 { 1731 1589 int retval; 1732 1590 struct tmpmasks tmp; 1591 + bool invalidate = false; 1733 1592 1734 1593 /* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */ 1735 1594 if (cs == &top_cpuset) ··· 1758 1615 if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed)) 1759 1616 return 0; 1760 1617 1761 - retval = validate_change(cs, trialcs); 1762 - if (retval < 0) 1763 - return retval; 1764 - 1765 1618 #ifdef CONFIG_CPUMASK_OFFSTACK 1766 1619 /* 1767 1620 * Use the cpumasks in trialcs for tmpmasks when they are pointers ··· 1768 1629 tmp.new_cpus = trialcs->cpus_allowed; 1769 1630 #endif 1770 1631 1632 + retval = validate_change(cs, trialcs); 1633 + 1634 + if ((retval == -EINVAL) && cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) { 1635 + struct cpuset *cp, *parent; 1636 + struct cgroup_subsys_state *css; 1637 + 1638 + /* 1639 + * The -EINVAL error code indicates that partition sibling 1640 + * CPU exclusivity rule has been violated. We still allow 1641 + * the cpumask change to proceed while invalidating the 1642 + * partition. However, any conflicting sibling partitions 1643 + * have to be marked as invalid too. 1644 + */ 1645 + invalidate = true; 1646 + rcu_read_lock(); 1647 + parent = parent_cs(cs); 1648 + cpuset_for_each_child(cp, css, parent) 1649 + if (is_partition_valid(cp) && 1650 + cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) { 1651 + rcu_read_unlock(); 1652 + update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp); 1653 + rcu_read_lock(); 1654 + } 1655 + rcu_read_unlock(); 1656 + retval = 0; 1657 + } 1658 + if (retval < 0) 1659 + return retval; 1660 + 1771 1661 if (cs->partition_root_state) { 1772 - /* Cpumask of a partition root cannot be empty */ 1773 - if (cpumask_empty(trialcs->cpus_allowed)) 1774 - return -EINVAL; 1775 - if (update_parent_subparts_cpumask(cs, partcmd_update, 1776 - trialcs->cpus_allowed, &tmp) < 0) 1777 - return -EINVAL; 1662 + if (invalidate) 1663 + update_parent_subparts_cpumask(cs, partcmd_invalidate, 1664 + NULL, &tmp); 1665 + else 1666 + update_parent_subparts_cpumask(cs, partcmd_update, 1667 + trialcs->cpus_allowed, &tmp); 1778 1668 } 1779 1669 1670 + compute_effective_cpumask(trialcs->effective_cpus, trialcs, 1671 + parent_cs(cs)); 1780 1672 spin_lock_irq(&callback_lock); 1781 1673 cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed); 1782 1674 1783 1675 /* 1784 - * Make sure that subparts_cpus is a subset of cpus_allowed. 1676 + * Make sure that subparts_cpus, if not empty, is a subset of 1677 + * cpus_allowed. Clear subparts_cpus if partition not valid or 1678 + * empty effective cpus with tasks. 1785 1679 */ 1786 1680 if (cs->nr_subparts_cpus) { 1787 - cpumask_and(cs->subparts_cpus, cs->subparts_cpus, cs->cpus_allowed); 1788 - cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); 1681 + if (!is_partition_valid(cs) || 1682 + (cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus) && 1683 + partition_is_populated(cs, NULL))) { 1684 + cs->nr_subparts_cpus = 0; 1685 + cpumask_clear(cs->subparts_cpus); 1686 + } else { 1687 + cpumask_and(cs->subparts_cpus, cs->subparts_cpus, 1688 + cs->cpus_allowed); 1689 + cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); 1690 + } 1789 1691 } 1790 1692 spin_unlock_irq(&callback_lock); 1791 1693 1792 - update_cpumasks_hier(cs, &tmp); 1694 + /* effective_cpus will be updated here */ 1695 + update_cpumasks_hier(cs, &tmp, false); 1793 1696 1794 1697 if (cs->partition_root_state) { 1795 1698 struct cpuset *parent = parent_cs(cs); ··· 2207 2026 return err; 2208 2027 } 2209 2028 2210 - /* 2029 + /** 2211 2030 * update_prstate - update partition_root_state 2212 - * cs: the cpuset to update 2213 - * new_prs: new partition root state 2031 + * @cs: the cpuset to update 2032 + * @new_prs: new partition root state 2033 + * Return: 0 if successful, != 0 if error 2214 2034 * 2215 2035 * Call with cpuset_rwsem held. 2216 2036 */ 2217 2037 static int update_prstate(struct cpuset *cs, int new_prs) 2218 2038 { 2219 - int err, old_prs = cs->partition_root_state; 2039 + int err = PERR_NONE, old_prs = cs->partition_root_state; 2040 + bool sched_domain_rebuilt = false; 2220 2041 struct cpuset *parent = parent_cs(cs); 2221 2042 struct tmpmasks tmpmask; 2222 2043 ··· 2226 2043 return 0; 2227 2044 2228 2045 /* 2229 - * Cannot force a partial or invalid partition root to a full 2230 - * partition root. 2046 + * For a previously invalid partition root, leave it at being 2047 + * invalid if new_prs is not "member". 2231 2048 */ 2232 - if (new_prs && (old_prs == PRS_ERROR)) 2233 - return -EINVAL; 2049 + if (new_prs && is_prs_invalid(old_prs)) { 2050 + cs->partition_root_state = -new_prs; 2051 + return 0; 2052 + } 2234 2053 2235 2054 if (alloc_cpumasks(NULL, &tmpmask)) 2236 2055 return -ENOMEM; 2237 2056 2238 - err = -EINVAL; 2239 2057 if (!old_prs) { 2240 2058 /* 2241 2059 * Turning on partition root requires setting the 2242 2060 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed 2243 - * cannot be NULL. 2061 + * cannot be empty. 2244 2062 */ 2245 - if (cpumask_empty(cs->cpus_allowed)) 2063 + if (cpumask_empty(cs->cpus_allowed)) { 2064 + err = PERR_CPUSEMPTY; 2246 2065 goto out; 2066 + } 2247 2067 2248 2068 err = update_flag(CS_CPU_EXCLUSIVE, cs, 1); 2249 - if (err) 2069 + if (err) { 2070 + err = PERR_NOTEXCL; 2250 2071 goto out; 2072 + } 2251 2073 2252 2074 err = update_parent_subparts_cpumask(cs, partcmd_enable, 2253 2075 NULL, &tmpmask); ··· 2260 2072 update_flag(CS_CPU_EXCLUSIVE, cs, 0); 2261 2073 goto out; 2262 2074 } 2075 + 2076 + if (new_prs == PRS_ISOLATED) { 2077 + /* 2078 + * Disable the load balance flag should not return an 2079 + * error unless the system is running out of memory. 2080 + */ 2081 + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); 2082 + sched_domain_rebuilt = true; 2083 + } 2084 + } else if (old_prs && new_prs) { 2085 + /* 2086 + * A change in load balance state only, no change in cpumasks. 2087 + */ 2088 + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); 2089 + sched_domain_rebuilt = true; 2090 + goto out; /* Sched domain is rebuilt in update_flag() */ 2263 2091 } else { 2264 2092 /* 2265 - * Turning off partition root will clear the 2266 - * CS_CPU_EXCLUSIVE bit. 2093 + * Switching back to member is always allowed even if it 2094 + * disables child partitions. 2267 2095 */ 2268 - if (old_prs == PRS_ERROR) { 2269 - update_flag(CS_CPU_EXCLUSIVE, cs, 0); 2270 - err = 0; 2271 - goto out; 2272 - } 2096 + update_parent_subparts_cpumask(cs, partcmd_disable, NULL, 2097 + &tmpmask); 2273 2098 2274 - err = update_parent_subparts_cpumask(cs, partcmd_disable, 2275 - NULL, &tmpmask); 2276 - if (err) 2277 - goto out; 2099 + /* 2100 + * If there are child partitions, they will all become invalid. 2101 + */ 2102 + if (unlikely(cs->nr_subparts_cpus)) { 2103 + spin_lock_irq(&callback_lock); 2104 + cs->nr_subparts_cpus = 0; 2105 + cpumask_clear(cs->subparts_cpus); 2106 + compute_effective_cpumask(cs->effective_cpus, cs, parent); 2107 + spin_unlock_irq(&callback_lock); 2108 + } 2278 2109 2279 2110 /* Turning off CS_CPU_EXCLUSIVE will not return error */ 2280 2111 update_flag(CS_CPU_EXCLUSIVE, cs, 0); 2112 + 2113 + if (!is_sched_load_balance(cs)) { 2114 + /* Make sure load balance is on */ 2115 + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); 2116 + sched_domain_rebuilt = true; 2117 + } 2281 2118 } 2282 2119 2283 - /* 2284 - * Update cpumask of parent's tasks except when it is the top 2285 - * cpuset as some system daemons cannot be mapped to other CPUs. 2286 - */ 2287 - if (parent != &top_cpuset) 2288 - update_tasks_cpumask(parent); 2120 + update_tasks_cpumask(parent); 2289 2121 2290 2122 if (parent->child_ecpus_count) 2291 2123 update_sibling_cpumasks(parent, cs, &tmpmask); 2292 2124 2293 - rebuild_sched_domains_locked(); 2125 + if (!sched_domain_rebuilt) 2126 + rebuild_sched_domains_locked(); 2294 2127 out: 2295 - if (!err) { 2296 - spin_lock_irq(&callback_lock); 2297 - cs->partition_root_state = new_prs; 2298 - spin_unlock_irq(&callback_lock); 2299 - notify_partition_change(cs, old_prs, new_prs); 2300 - } 2128 + /* 2129 + * Make partition invalid if an error happen 2130 + */ 2131 + if (err) 2132 + new_prs = -new_prs; 2133 + spin_lock_irq(&callback_lock); 2134 + cs->partition_root_state = new_prs; 2135 + spin_unlock_irq(&callback_lock); 2136 + /* 2137 + * Update child cpusets, if present. 2138 + * Force update if switching back to member. 2139 + */ 2140 + if (!list_empty(&cs->css.children)) 2141 + update_cpumasks_hier(cs, &tmpmask, !new_prs); 2301 2142 2143 + notify_partition_change(cs, old_prs); 2302 2144 free_cpumasks(NULL, &tmpmask); 2303 - return err; 2145 + return 0; 2304 2146 } 2305 2147 2306 2148 /* ··· 2454 2236 ret = -ENOSPC; 2455 2237 if (!is_in_v2_mode() && 2456 2238 (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) 2239 + goto out_unlock; 2240 + 2241 + /* 2242 + * Task cannot be moved to a cpuset with empty effective cpus. 2243 + */ 2244 + if (cpumask_empty(cs->effective_cpus)) 2457 2245 goto out_unlock; 2458 2246 2459 2247 cgroup_taskset_for_each(task, css, tset) { ··· 2822 2598 static int sched_partition_show(struct seq_file *seq, void *v) 2823 2599 { 2824 2600 struct cpuset *cs = css_cs(seq_css(seq)); 2601 + const char *err, *type = NULL; 2825 2602 2826 2603 switch (cs->partition_root_state) { 2827 - case PRS_ENABLED: 2604 + case PRS_ROOT: 2828 2605 seq_puts(seq, "root\n"); 2829 2606 break; 2830 - case PRS_DISABLED: 2607 + case PRS_ISOLATED: 2608 + seq_puts(seq, "isolated\n"); 2609 + break; 2610 + case PRS_MEMBER: 2831 2611 seq_puts(seq, "member\n"); 2832 2612 break; 2833 - case PRS_ERROR: 2834 - seq_puts(seq, "root invalid\n"); 2613 + case PRS_INVALID_ROOT: 2614 + type = "root"; 2615 + fallthrough; 2616 + case PRS_INVALID_ISOLATED: 2617 + if (!type) 2618 + type = "isolated"; 2619 + err = perr_strings[READ_ONCE(cs->prs_err)]; 2620 + if (err) 2621 + seq_printf(seq, "%s invalid (%s)\n", type, err); 2622 + else 2623 + seq_printf(seq, "%s invalid\n", type); 2835 2624 break; 2836 2625 } 2837 2626 return 0; ··· 2863 2626 * Convert "root" to ENABLED, and convert "member" to DISABLED. 2864 2627 */ 2865 2628 if (!strcmp(buf, "root")) 2866 - val = PRS_ENABLED; 2629 + val = PRS_ROOT; 2867 2630 else if (!strcmp(buf, "member")) 2868 - val = PRS_DISABLED; 2631 + val = PRS_MEMBER; 2632 + else if (!strcmp(buf, "isolated")) 2633 + val = PRS_ISOLATED; 2869 2634 else 2870 2635 return -EINVAL; 2871 2636 ··· 3166 2927 cpus_read_lock(); 3167 2928 percpu_down_write(&cpuset_rwsem); 3168 2929 3169 - if (is_partition_root(cs)) 2930 + if (is_partition_valid(cs)) 3170 2931 update_prstate(cs, 0); 3171 2932 3172 2933 if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && ··· 3342 3103 struct cpumask *new_cpus, nodemask_t *new_mems, 3343 3104 bool cpus_updated, bool mems_updated) 3344 3105 { 3345 - if (cpumask_empty(new_cpus)) 3106 + /* A partition root is allowed to have empty effective cpus */ 3107 + if (cpumask_empty(new_cpus) && !is_partition_valid(cs)) 3346 3108 cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus); 3347 3109 if (nodes_empty(*new_mems)) 3348 3110 *new_mems = parent_cs(cs)->effective_mems; ··· 3412 3172 3413 3173 /* 3414 3174 * In the unlikely event that a partition root has empty 3415 - * effective_cpus or its parent becomes erroneous, we have to 3416 - * transition it to the erroneous state. 3175 + * effective_cpus with tasks, we will have to invalidate child 3176 + * partitions, if present, by setting nr_subparts_cpus to 0 to 3177 + * reclaim their cpus. 3417 3178 */ 3418 - if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || 3419 - (parent->partition_root_state == PRS_ERROR))) { 3179 + if (cs->nr_subparts_cpus && is_partition_valid(cs) && 3180 + cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) { 3181 + spin_lock_irq(&callback_lock); 3182 + cs->nr_subparts_cpus = 0; 3183 + cpumask_clear(cs->subparts_cpus); 3184 + spin_unlock_irq(&callback_lock); 3185 + compute_effective_cpumask(&new_cpus, cs, parent); 3186 + } 3187 + 3188 + /* 3189 + * Force the partition to become invalid if either one of 3190 + * the following conditions hold: 3191 + * 1) empty effective cpus but not valid empty partition. 3192 + * 2) parent is invalid or doesn't grant any cpus to child 3193 + * partitions. 3194 + */ 3195 + if (is_partition_valid(cs) && (!parent->nr_subparts_cpus || 3196 + (cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)))) { 3197 + int old_prs, parent_prs; 3198 + 3199 + update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp); 3420 3200 if (cs->nr_subparts_cpus) { 3421 3201 spin_lock_irq(&callback_lock); 3422 3202 cs->nr_subparts_cpus = 0; ··· 3445 3185 compute_effective_cpumask(&new_cpus, cs, parent); 3446 3186 } 3447 3187 3448 - /* 3449 - * If the effective_cpus is empty because the child 3450 - * partitions take away all the CPUs, we can keep 3451 - * the current partition and let the child partitions 3452 - * fight for available CPUs. 3453 - */ 3454 - if ((parent->partition_root_state == PRS_ERROR) || 3455 - cpumask_empty(&new_cpus)) { 3456 - int old_prs; 3457 - 3458 - update_parent_subparts_cpumask(cs, partcmd_disable, 3459 - NULL, tmp); 3460 - old_prs = cs->partition_root_state; 3461 - if (old_prs != PRS_ERROR) { 3462 - spin_lock_irq(&callback_lock); 3463 - cs->partition_root_state = PRS_ERROR; 3464 - spin_unlock_irq(&callback_lock); 3465 - notify_partition_change(cs, old_prs, PRS_ERROR); 3466 - } 3188 + old_prs = cs->partition_root_state; 3189 + parent_prs = parent->partition_root_state; 3190 + if (is_partition_valid(cs)) { 3191 + spin_lock_irq(&callback_lock); 3192 + make_partition_invalid(cs); 3193 + spin_unlock_irq(&callback_lock); 3194 + if (is_prs_invalid(parent_prs)) 3195 + WRITE_ONCE(cs->prs_err, PERR_INVPARENT); 3196 + else if (!parent_prs) 3197 + WRITE_ONCE(cs->prs_err, PERR_NOTPART); 3198 + else 3199 + WRITE_ONCE(cs->prs_err, PERR_HOTPLUG); 3200 + notify_partition_change(cs, old_prs); 3467 3201 } 3468 3202 cpuset_force_rebuild(); 3469 3203 } 3470 3204 3471 3205 /* 3472 - * On the other hand, an erroneous partition root may be transitioned 3473 - * back to a regular one or a partition root with no CPU allocated 3474 - * from the parent may change to erroneous. 3206 + * On the other hand, an invalid partition root may be transitioned 3207 + * back to a regular one. 3475 3208 */ 3476 - if (is_partition_root(parent) && 3477 - ((cs->partition_root_state == PRS_ERROR) || 3478 - !cpumask_intersects(&new_cpus, parent->subparts_cpus)) && 3479 - update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) 3480 - cpuset_force_rebuild(); 3209 + else if (is_partition_valid(parent) && is_partition_invalid(cs)) { 3210 + update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp); 3211 + if (is_partition_valid(cs)) 3212 + cpuset_force_rebuild(); 3213 + } 3481 3214 3482 3215 update_tasks: 3483 3216 cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
+35 -2
kernel/cgroup/pids.c
··· 47 47 */ 48 48 atomic64_t counter; 49 49 atomic64_t limit; 50 + int64_t watermark; 50 51 51 52 /* Handle for "pids.events" */ 52 53 struct cgroup_file events_file; ··· 84 83 static void pids_css_free(struct cgroup_subsys_state *css) 85 84 { 86 85 kfree(css_pids(css)); 86 + } 87 + 88 + static void pids_update_watermark(struct pids_cgroup *p, int64_t nr_pids) 89 + { 90 + /* 91 + * This is racy, but we don't need perfectly accurate tallying of 92 + * the watermark, and this lets us avoid extra atomic overhead. 93 + */ 94 + if (nr_pids > READ_ONCE(p->watermark)) 95 + WRITE_ONCE(p->watermark, nr_pids); 87 96 } 88 97 89 98 /** ··· 139 128 { 140 129 struct pids_cgroup *p; 141 130 142 - for (p = pids; parent_pids(p); p = parent_pids(p)) 143 - atomic64_add(num, &p->counter); 131 + for (p = pids; parent_pids(p); p = parent_pids(p)) { 132 + int64_t new = atomic64_add_return(num, &p->counter); 133 + 134 + pids_update_watermark(p, new); 135 + } 144 136 } 145 137 146 138 /** ··· 170 156 */ 171 157 if (new > limit) 172 158 goto revert; 159 + 160 + /* 161 + * Not technically accurate if we go over limit somewhere up 162 + * the hierarchy, but that's tolerable for the watermark. 163 + */ 164 + pids_update_watermark(p, new); 173 165 } 174 166 175 167 return 0; ··· 331 311 return atomic64_read(&pids->counter); 332 312 } 333 313 314 + static s64 pids_peak_read(struct cgroup_subsys_state *css, 315 + struct cftype *cft) 316 + { 317 + struct pids_cgroup *pids = css_pids(css); 318 + 319 + return READ_ONCE(pids->watermark); 320 + } 321 + 334 322 static int pids_events_show(struct seq_file *sf, void *v) 335 323 { 336 324 struct pids_cgroup *pids = css_pids(seq_css(sf)); ··· 358 330 .name = "current", 359 331 .read_s64 = pids_current_read, 360 332 .flags = CFTYPE_NOT_ON_ROOT, 333 + }, 334 + { 335 + .name = "peak", 336 + .flags = CFTYPE_NOT_ON_ROOT, 337 + .read_s64 = pids_peak_read, 361 338 }, 362 339 { 363 340 .name = "events",
+2 -2
mm/memcontrol.c
··· 5104 5104 struct mem_cgroup *memcg; 5105 5105 5106 5106 cgrp = cgroup_get_from_id(ino); 5107 - if (!cgrp) 5108 - return ERR_PTR(-ENOENT); 5107 + if (IS_ERR(cgrp)) 5108 + return ERR_CAST(cgrp); 5109 5109 5110 5110 css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys); 5111 5111 if (css)
+5 -4
net/netfilter/nft_socket.c
··· 40 40 nft_sock_get_eval_cgroupv2(u32 *dest, struct sock *sk, const struct nft_pktinfo *pkt, u32 level) 41 41 { 42 42 struct cgroup *cgrp; 43 + u64 cgid; 43 44 44 45 if (!sk_fullsock(sk)) 45 46 return false; 46 47 47 - cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); 48 - if (level > cgrp->level) 48 + cgrp = cgroup_ancestor(sock_cgroup_ptr(&sk->sk_cgrp_data), level); 49 + if (!cgrp) 49 50 return false; 50 51 51 - memcpy(dest, &cgrp->ancestor_ids[level], sizeof(u64)); 52 - 52 + cgid = cgroup_id(cgrp); 53 + memcpy(dest, &cgid, sizeof(u64)); 53 54 return true; 54 55 } 55 56 #endif
+5 -5
tools/cgroup/iocost_monitor.py
··· 61 61 } 62 62 63 63 class BlkgIterator: 64 + def __init__(self, root_blkcg, q_id, include_dying=False): 65 + self.include_dying = include_dying 66 + self.blkgs = [] 67 + self.walk(root_blkcg, q_id, '') 68 + 64 69 def blkcg_name(blkcg): 65 70 return blkcg.css.cgroup.kn.name.string_().decode('utf-8') 66 71 ··· 86 81 for c in list_for_each_entry('struct blkcg', 87 82 blkcg.css.children.address_of_(), 'css.sibling'): 88 83 self.walk(c, q_id, path) 89 - 90 - def __init__(self, root_blkcg, q_id, include_dying=False): 91 - self.include_dying = include_dying 92 - self.blkgs = [] 93 - self.walk(root_blkcg, q_id, '') 94 84 95 85 def __iter__(self): 96 86 return iter(self.blkgs)
+1 -1
tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
··· 77 77 break; 78 78 79 79 // convert cgroup-id to a map index 80 - cgrp_id = BPF_CORE_READ(cgrp, ancestor_ids[i]); 80 + cgrp_id = BPF_CORE_READ(cgrp, ancestors[i], kn, id); 81 81 elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id); 82 82 if (!elem) 83 83 continue;
+1
tools/testing/selftests/cgroup/.gitignore
··· 5 5 test_kmem 6 6 test_kill 7 7 test_cpu 8 + wait_inotify
+3 -2
tools/testing/selftests/cgroup/Makefile
··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 CFLAGS += -Wall -pthread 3 3 4 - all: 4 + all: ${HELPER_PROGS} 5 5 6 6 TEST_FILES := with_stress.sh 7 - TEST_PROGS := test_stress.sh 7 + TEST_PROGS := test_stress.sh test_cpuset_prs.sh 8 + TEST_GEN_FILES := wait_inotify 8 9 TEST_GEN_PROGS = test_memcontrol 9 10 TEST_GEN_PROGS += test_kmem 10 11 TEST_GEN_PROGS += test_core
+674
tools/testing/selftests/cgroup/test_cpuset_prs.sh
··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # Test for cpuset v2 partition root state (PRS) 5 + # 6 + # The sched verbose flag is set, if available, so that the console log 7 + # can be examined for the correct setting of scheduling domain. 8 + # 9 + 10 + skip_test() { 11 + echo "$1" 12 + echo "Test SKIPPED" 13 + exit 0 14 + } 15 + 16 + [[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!" 17 + 18 + # Set sched verbose flag, if available 19 + [[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose 20 + 21 + # Get wait_inotify location 22 + WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify 23 + 24 + # Find cgroup v2 mount point 25 + CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}') 26 + [[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!" 27 + 28 + CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//") 29 + [[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!" 30 + 31 + # Set verbose flag and delay factor 32 + PROG=$1 33 + VERBOSE= 34 + DELAY_FACTOR=1 35 + while [[ "$1" = -* ]] 36 + do 37 + case "$1" in 38 + -v) VERBOSE=1 39 + break 40 + ;; 41 + -d) DELAY_FACTOR=$2 42 + shift 43 + break 44 + ;; 45 + *) echo "Usage: $PROG [-v] [-d <delay-factor>" 46 + exit 47 + ;; 48 + esac 49 + shift 50 + done 51 + 52 + cd $CGROUP2 53 + echo +cpuset > cgroup.subtree_control 54 + [[ -d test ]] || mkdir test 55 + cd test 56 + 57 + # Pause in ms 58 + pause() 59 + { 60 + DELAY=$1 61 + LOOP=0 62 + while [[ $LOOP -lt $DELAY_FACTOR ]] 63 + do 64 + sleep $DELAY 65 + ((LOOP++)) 66 + done 67 + return 0 68 + } 69 + 70 + console_msg() 71 + { 72 + MSG=$1 73 + echo "$MSG" 74 + echo "" > /dev/console 75 + echo "$MSG" > /dev/console 76 + pause 0.01 77 + } 78 + 79 + test_partition() 80 + { 81 + EXPECTED_VAL=$1 82 + echo $EXPECTED_VAL > cpuset.cpus.partition 83 + [[ $? -eq 0 ]] || exit 1 84 + ACTUAL_VAL=$(cat cpuset.cpus.partition) 85 + [[ $ACTUAL_VAL != $EXPECTED_VAL ]] && { 86 + echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL" 87 + echo "Test FAILED" 88 + exit 1 89 + } 90 + } 91 + 92 + test_effective_cpus() 93 + { 94 + EXPECTED_VAL=$1 95 + ACTUAL_VAL=$(cat cpuset.cpus.effective) 96 + [[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && { 97 + echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'" 98 + echo "Test FAILED" 99 + exit 1 100 + } 101 + } 102 + 103 + # Adding current process to cgroup.procs as a test 104 + test_add_proc() 105 + { 106 + OUTSTR="$1" 107 + ERRMSG=$((echo $$ > cgroup.procs) |& cat) 108 + echo $ERRMSG | grep -q "$OUTSTR" 109 + [[ $? -ne 0 ]] && { 110 + echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'" 111 + echo "Test FAILED" 112 + exit 1 113 + } 114 + echo $$ > $CGROUP2/cgroup.procs # Move out the task 115 + } 116 + 117 + # 118 + # Testing the new "isolated" partition root type 119 + # 120 + test_isolated() 121 + { 122 + echo 2-3 > cpuset.cpus 123 + TYPE=$(cat cpuset.cpus.partition) 124 + [[ $TYPE = member ]] || echo member > cpuset.cpus.partition 125 + 126 + console_msg "Change from member to root" 127 + test_partition root 128 + 129 + console_msg "Change from root to isolated" 130 + test_partition isolated 131 + 132 + console_msg "Change from isolated to member" 133 + test_partition member 134 + 135 + console_msg "Change from member to isolated" 136 + test_partition isolated 137 + 138 + console_msg "Change from isolated to root" 139 + test_partition root 140 + 141 + console_msg "Change from root to member" 142 + test_partition member 143 + 144 + # 145 + # Testing partition root with no cpu 146 + # 147 + console_msg "Distribute all cpus to child partition" 148 + echo +cpuset > cgroup.subtree_control 149 + test_partition root 150 + 151 + mkdir A1 152 + cd A1 153 + echo 2-3 > cpuset.cpus 154 + test_partition root 155 + test_effective_cpus 2-3 156 + cd .. 157 + test_effective_cpus "" 158 + 159 + console_msg "Moving task to partition test" 160 + test_add_proc "No space left" 161 + cd A1 162 + test_add_proc "" 163 + cd .. 164 + 165 + console_msg "Shrink and expand child partition" 166 + cd A1 167 + echo 2 > cpuset.cpus 168 + cd .. 169 + test_effective_cpus 3 170 + cd A1 171 + echo 2-3 > cpuset.cpus 172 + cd .. 173 + test_effective_cpus "" 174 + 175 + # Cleaning up 176 + console_msg "Cleaning up" 177 + echo $$ > $CGROUP2/cgroup.procs 178 + [[ -d A1 ]] && rmdir A1 179 + } 180 + 181 + # 182 + # Cpuset controller state transition test matrix. 183 + # 184 + # Cgroup test hierarchy 185 + # 186 + # test -- A1 -- A2 -- A3 187 + # \- B1 188 + # 189 + # P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid) 190 + # C<l> = add cpu-list 191 + # S<p> = use prefix in subtree_control 192 + # T = put a task into cgroup 193 + # O<c>-<v> = Write <v> to CPU online file of <c> 194 + # 195 + SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1" 196 + TEST_MATRIX=( 197 + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate 198 + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ 199 + " S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1" 200 + " S+ C0-1 . . C2-3 P1 . . . 0 " 201 + " S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 " 202 + " S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 " 203 + " S+ C0-1:S+ . . C2-3 . . . P1 0 " 204 + " S+ C0-1:P1 . . C2-3 S+ C1 . . 0 " 205 + " S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 " 206 + " S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 " 207 + " S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5" 208 + " S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5" 209 + " S+ C0-1 . . C2-3:P1 . . . C2 0 " 210 + " S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5" 211 + " S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3" 212 + " S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3" 213 + " S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1" 214 + " S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0" 215 + " S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2" 216 + " S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1" 217 + " S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1" 218 + 219 + # CPU offlining cases: 220 + " S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3" 221 + " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3" 222 + " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3" 223 + " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3" 224 + " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3" 225 + " S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1" 226 + " S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2" 227 + " S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1" 228 + " S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2" 229 + " S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1" 230 + " S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2: A1:P1,A2:P1" 231 + " S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1" 232 + " S+ C2-3:P1:S+ C3:P1 . . . T:O3-0 . . 0 A1:2,A2:2 A1:P1,A2:P-1" 233 + " S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1" 234 + " S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1" 235 + " S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1" 236 + " S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" 237 + " S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1" 238 + " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1" 239 + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" 240 + " S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" 241 + " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" 242 + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1" 243 + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" 244 + 245 + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate 246 + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ 247 + # 248 + # Incorrect change to cpuset.cpus invalidates partition root 249 + # 250 + # Adding CPUs to partition root that are not in parent's 251 + # cpuset.cpus is allowed, but those extra CPUs are ignored. 252 + " S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 0 A1:,A2:2-3 A1:P1,A2:P1" 253 + 254 + # Taking away all CPUs from parent or itself if there are tasks 255 + # will make the partition invalid. 256 + " S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1" 257 + " S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" 258 + " S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" 259 + 260 + # Changing a partition root to member makes child partitions invalid 261 + " S+ C2-3:P1:S+ C3:P1 . . P0 . . . 0 A1:2-3,A2:3 A1:P0,A2:P-1" 262 + " S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1" 263 + 264 + # cpuset.cpus can contains cpus not in parent's cpuset.cpus as long 265 + # as they overlap. 266 + " S+ C2-3:P1:S+ . . . . C3-4:P1 . . 0 A1:2,A2:3 A1:P1,A2:P1" 267 + 268 + # Deletion of CPUs distributed to child cgroup is allowed. 269 + " S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 0 A1:4-5,A2:4-5" 270 + 271 + # To become a valid partition root, cpuset.cpus must overlap parent's 272 + # cpuset.cpus. 273 + " S+ C0-1:P1 . . C2-3 S+ C4-5:P1 . . 0 A1:0-1,A2:0-1 A1:P1,A2:P-1" 274 + 275 + # Enabling partition with child cpusets is allowed 276 + " S+ C0-1:S+ C1 . C2-3 P1 . . . 0 A1:0-1,A2:1 A1:P1" 277 + 278 + # A partition root with non-partition root parent is invalid, but it 279 + # can be made valid if its parent becomes a partition root too. 280 + " S+ C0-1:S+ C1 . C2-3 . P2 . . 0 A1:0-1,A2:1 A1:P0,A2:P-2" 281 + " S+ C0-1:S+ C1:P2 . C2-3 P1 . . . 0 A1:0,A2:1 A1:P1,A2:P2" 282 + 283 + # A non-exclusive cpuset.cpus change will invalidate partition and its siblings 284 + " S+ C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P0" 285 + " S+ C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P-1" 286 + " S+ C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P0,B1:P-1" 287 + 288 + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate 289 + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ 290 + # Failure cases: 291 + 292 + # A task cannot be added to a partition with no cpu 293 + " S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1" 294 + ) 295 + 296 + # 297 + # Write to the cpu online file 298 + # $1 - <c>-<v> where <c> = cpu number, <v> value to be written 299 + # 300 + write_cpu_online() 301 + { 302 + CPU=${1%-*} 303 + VAL=${1#*-} 304 + CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online 305 + if [[ $VAL -eq 0 ]] 306 + then 307 + OFFLINE_CPUS="$OFFLINE_CPUS $CPU" 308 + else 309 + [[ -n "$OFFLINE_CPUS" ]] && { 310 + OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\ 311 + sort | uniq -u) 312 + } 313 + fi 314 + echo $VAL > $CPUFILE 315 + pause 0.01 316 + } 317 + 318 + # 319 + # Set controller state 320 + # $1 - cgroup directory 321 + # $2 - state 322 + # $3 - showerr 323 + # 324 + # The presence of ":" in state means transition from one to the next. 325 + # 326 + set_ctrl_state() 327 + { 328 + TMPMSG=/tmp/.msg_$$ 329 + CGRP=$1 330 + STATE=$2 331 + SHOWERR=${3}${VERBOSE} 332 + CTRL=${CTRL:=$CONTROLLER} 333 + HASERR=0 334 + REDIRECT="2> $TMPMSG" 335 + [[ -z "$STATE" || "$STATE" = '.' ]] && return 0 336 + 337 + rm -f $TMPMSG 338 + for CMD in $(echo $STATE | sed -e "s/:/ /g") 339 + do 340 + TFILE=$CGRP/cgroup.procs 341 + SFILE=$CGRP/cgroup.subtree_control 342 + PFILE=$CGRP/cpuset.cpus.partition 343 + CFILE=$CGRP/cpuset.cpus 344 + S=$(expr substr $CMD 1 1) 345 + if [[ $S = S ]] 346 + then 347 + PREFIX=${CMD#?} 348 + COMM="echo ${PREFIX}${CTRL} > $SFILE" 349 + eval $COMM $REDIRECT 350 + elif [[ $S = C ]] 351 + then 352 + CPUS=${CMD#?} 353 + COMM="echo $CPUS > $CFILE" 354 + eval $COMM $REDIRECT 355 + elif [[ $S = P ]] 356 + then 357 + VAL=${CMD#?} 358 + case $VAL in 359 + 0) VAL=member 360 + ;; 361 + 1) VAL=root 362 + ;; 363 + 2) VAL=isolated 364 + ;; 365 + *) 366 + echo "Invalid partition state - $VAL" 367 + exit 1 368 + ;; 369 + esac 370 + COMM="echo $VAL > $PFILE" 371 + eval $COMM $REDIRECT 372 + elif [[ $S = O ]] 373 + then 374 + VAL=${CMD#?} 375 + write_cpu_online $VAL 376 + elif [[ $S = T ]] 377 + then 378 + COMM="echo 0 > $TFILE" 379 + eval $COMM $REDIRECT 380 + fi 381 + RET=$? 382 + [[ $RET -ne 0 ]] && { 383 + [[ -n "$SHOWERR" ]] && { 384 + echo "$COMM" 385 + cat $TMPMSG 386 + } 387 + HASERR=1 388 + } 389 + pause 0.01 390 + rm -f $TMPMSG 391 + done 392 + return $HASERR 393 + } 394 + 395 + set_ctrl_state_noerr() 396 + { 397 + CGRP=$1 398 + STATE=$2 399 + [[ -d $CGRP ]] || mkdir $CGRP 400 + set_ctrl_state $CGRP $STATE 1 401 + [[ $? -ne 0 ]] && { 402 + echo "ERROR: Failed to set $2 to cgroup $1!" 403 + exit 1 404 + } 405 + } 406 + 407 + online_cpus() 408 + { 409 + [[ -n "OFFLINE_CPUS" ]] && { 410 + for C in $OFFLINE_CPUS 411 + do 412 + write_cpu_online ${C}-1 413 + done 414 + } 415 + } 416 + 417 + # 418 + # Return 1 if the list of effective cpus isn't the same as the initial list. 419 + # 420 + reset_cgroup_states() 421 + { 422 + echo 0 > $CGROUP2/cgroup.procs 423 + online_cpus 424 + rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1 425 + set_ctrl_state . S- 426 + pause 0.01 427 + } 428 + 429 + dump_states() 430 + { 431 + for DIR in A1 A1/A2 A1/A2/A3 B1 432 + do 433 + ECPUS=$DIR/cpuset.cpus.effective 434 + PRS=$DIR/cpuset.cpus.partition 435 + [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)" 436 + [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)" 437 + done 438 + } 439 + 440 + # 441 + # Check effective cpus 442 + # $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]* 443 + # 444 + check_effective_cpus() 445 + { 446 + CHK_STR=$1 447 + for CHK in $(echo $CHK_STR | sed -e "s/,/ /g") 448 + do 449 + set -- $(echo $CHK | sed -e "s/:/ /g") 450 + CGRP=$1 451 + CPUS=$2 452 + [[ $CGRP = A2 ]] && CGRP=A1/A2 453 + [[ $CGRP = A3 ]] && CGRP=A1/A2/A3 454 + FILE=$CGRP/cpuset.cpus.effective 455 + [[ -e $FILE ]] || return 1 456 + [[ $CPUS = $(cat $FILE) ]] || return 1 457 + done 458 + } 459 + 460 + # 461 + # Check cgroup states 462 + # $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]* 463 + # 464 + check_cgroup_states() 465 + { 466 + CHK_STR=$1 467 + for CHK in $(echo $CHK_STR | sed -e "s/,/ /g") 468 + do 469 + set -- $(echo $CHK | sed -e "s/:/ /g") 470 + CGRP=$1 471 + STATE=$2 472 + FILE= 473 + EVAL=$(expr substr $STATE 2 2) 474 + [[ $CGRP = A2 ]] && CGRP=A1/A2 475 + [[ $CGRP = A3 ]] && CGRP=A1/A2/A3 476 + 477 + case $STATE in 478 + P*) FILE=$CGRP/cpuset.cpus.partition 479 + ;; 480 + *) echo "Unknown state: $STATE!" 481 + exit 1 482 + ;; 483 + esac 484 + VAL=$(cat $FILE) 485 + 486 + case "$VAL" in 487 + member) VAL=0 488 + ;; 489 + root) VAL=1 490 + ;; 491 + isolated) 492 + VAL=2 493 + ;; 494 + "root invalid"*) 495 + VAL=-1 496 + ;; 497 + "isolated invalid"*) 498 + VAL=-2 499 + ;; 500 + esac 501 + [[ $EVAL != $VAL ]] && return 1 502 + done 503 + return 0 504 + } 505 + 506 + # 507 + # Run cpuset state transition test 508 + # $1 - test matrix name 509 + # 510 + # This test is somewhat fragile as delays (sleep x) are added in various 511 + # places to make sure state changes are fully propagated before the next 512 + # action. These delays may need to be adjusted if running in a slower machine. 513 + # 514 + run_state_test() 515 + { 516 + TEST=$1 517 + CONTROLLER=cpuset 518 + CPULIST=0-6 519 + I=0 520 + eval CNT="\${#$TEST[@]}" 521 + 522 + reset_cgroup_states 523 + echo $CPULIST > cpuset.cpus 524 + echo root > cpuset.cpus.partition 525 + console_msg "Running state transition test ..." 526 + 527 + while [[ $I -lt $CNT ]] 528 + do 529 + echo "Running test $I ..." > /dev/console 530 + eval set -- "\${$TEST[$I]}" 531 + ROOT=$1 532 + OLD_A1=$2 533 + OLD_A2=$3 534 + OLD_A3=$4 535 + OLD_B1=$5 536 + NEW_A1=$6 537 + NEW_A2=$7 538 + NEW_A3=$8 539 + NEW_B1=$9 540 + RESULT=${10} 541 + ECPUS=${11} 542 + STATES=${12} 543 + 544 + set_ctrl_state_noerr . $ROOT 545 + set_ctrl_state_noerr A1 $OLD_A1 546 + set_ctrl_state_noerr A1/A2 $OLD_A2 547 + set_ctrl_state_noerr A1/A2/A3 $OLD_A3 548 + set_ctrl_state_noerr B1 $OLD_B1 549 + RETVAL=0 550 + set_ctrl_state A1 $NEW_A1; ((RETVAL += $?)) 551 + set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?)) 552 + set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?)) 553 + set_ctrl_state B1 $NEW_B1; ((RETVAL += $?)) 554 + 555 + [[ $RETVAL -ne $RESULT ]] && { 556 + echo "Test $TEST[$I] failed result check!" 557 + eval echo \"\${$TEST[$I]}\" 558 + dump_states 559 + online_cpus 560 + exit 1 561 + } 562 + 563 + [[ -n "$ECPUS" && "$ECPUS" != . ]] && { 564 + check_effective_cpus $ECPUS 565 + [[ $? -ne 0 ]] && { 566 + echo "Test $TEST[$I] failed effective CPU check!" 567 + eval echo \"\${$TEST[$I]}\" 568 + echo 569 + dump_states 570 + online_cpus 571 + exit 1 572 + } 573 + } 574 + 575 + [[ -n "$STATES" ]] && { 576 + check_cgroup_states $STATES 577 + [[ $? -ne 0 ]] && { 578 + echo "FAILED: Test $TEST[$I] failed states check!" 579 + eval echo \"\${$TEST[$I]}\" 580 + echo 581 + dump_states 582 + online_cpus 583 + exit 1 584 + } 585 + } 586 + 587 + reset_cgroup_states 588 + # 589 + # Check to see if effective cpu list changes 590 + # 591 + pause 0.05 592 + NEWLIST=$(cat cpuset.cpus.effective) 593 + [[ $NEWLIST != $CPULIST ]] && { 594 + echo "Effective cpus changed to $NEWLIST after test $I!" 595 + exit 1 596 + } 597 + [[ -n "$VERBOSE" ]] && echo "Test $I done." 598 + ((I++)) 599 + done 600 + echo "All $I tests of $TEST PASSED." 601 + 602 + echo member > cpuset.cpus.partition 603 + } 604 + 605 + # 606 + # Wait for inotify event for the given file and read it 607 + # $1: cgroup file to wait for 608 + # $2: file to store the read result 609 + # 610 + wait_inotify() 611 + { 612 + CGROUP_FILE=$1 613 + OUTPUT_FILE=$2 614 + 615 + $WAIT_INOTIFY $CGROUP_FILE 616 + cat $CGROUP_FILE > $OUTPUT_FILE 617 + } 618 + 619 + # 620 + # Test if inotify events are properly generated when going into and out of 621 + # invalid partition state. 622 + # 623 + test_inotify() 624 + { 625 + ERR=0 626 + PRS=/tmp/.prs_$$ 627 + [[ -f $WAIT_INOTIFY ]] || { 628 + echo "wait_inotify not found, inotify test SKIPPED." 629 + return 630 + } 631 + 632 + pause 0.01 633 + echo 1 > cpuset.cpus 634 + echo 0 > cgroup.procs 635 + echo root > cpuset.cpus.partition 636 + pause 0.01 637 + rm -f $PRS 638 + wait_inotify $PWD/cpuset.cpus.partition $PRS & 639 + pause 0.01 640 + set_ctrl_state . "O1-0" 641 + pause 0.01 642 + check_cgroup_states ".:P-1" 643 + if [[ $? -ne 0 ]] 644 + then 645 + echo "FAILED: Inotify test - partition not invalid" 646 + ERR=1 647 + elif [[ ! -f $PRS ]] 648 + then 649 + echo "FAILED: Inotify test - event not generated" 650 + ERR=1 651 + kill %1 652 + elif [[ $(cat $PRS) != "root invalid"* ]] 653 + then 654 + echo "FAILED: Inotify test - incorrect state" 655 + cat $PRS 656 + ERR=1 657 + fi 658 + online_cpus 659 + echo member > cpuset.cpus.partition 660 + echo 0 > ../cgroup.procs 661 + if [[ $ERR -ne 0 ]] 662 + then 663 + exit 1 664 + else 665 + echo "Inotify test PASSED" 666 + fi 667 + } 668 + 669 + run_state_test TEST_MATRIX 670 + test_isolated 671 + test_inotify 672 + echo "All tests PASSED." 673 + cd .. 674 + rmdir test
+87
tools/testing/selftests/cgroup/wait_inotify.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Wait until an inotify event on the given cgroup file. 4 + */ 5 + #include <linux/limits.h> 6 + #include <sys/inotify.h> 7 + #include <sys/mman.h> 8 + #include <sys/ptrace.h> 9 + #include <sys/stat.h> 10 + #include <sys/types.h> 11 + #include <errno.h> 12 + #include <fcntl.h> 13 + #include <poll.h> 14 + #include <stdio.h> 15 + #include <stdlib.h> 16 + #include <string.h> 17 + #include <unistd.h> 18 + 19 + static const char usage[] = "Usage: %s [-v] <cgroup_file>\n"; 20 + static char *file; 21 + static int verbose; 22 + 23 + static inline void fail_message(char *msg) 24 + { 25 + fprintf(stderr, msg, file); 26 + exit(1); 27 + } 28 + 29 + int main(int argc, char *argv[]) 30 + { 31 + char *cmd = argv[0]; 32 + int c, fd; 33 + struct pollfd fds = { .events = POLLIN, }; 34 + 35 + while ((c = getopt(argc, argv, "v")) != -1) { 36 + switch (c) { 37 + case 'v': 38 + verbose++; 39 + break; 40 + } 41 + argv++, argc--; 42 + } 43 + 44 + if (argc != 2) { 45 + fprintf(stderr, usage, cmd); 46 + return -1; 47 + } 48 + file = argv[1]; 49 + fd = open(file, O_RDONLY); 50 + if (fd < 0) 51 + fail_message("Cgroup file %s not found!\n"); 52 + close(fd); 53 + 54 + fd = inotify_init(); 55 + if (fd < 0) 56 + fail_message("inotify_init() fails on %s!\n"); 57 + if (inotify_add_watch(fd, file, IN_MODIFY) < 0) 58 + fail_message("inotify_add_watch() fails on %s!\n"); 59 + fds.fd = fd; 60 + 61 + /* 62 + * poll waiting loop 63 + */ 64 + for (;;) { 65 + int ret = poll(&fds, 1, 10000); 66 + 67 + if (ret < 0) { 68 + if (errno == EINTR) 69 + continue; 70 + perror("poll"); 71 + exit(1); 72 + } 73 + if ((ret > 0) && (fds.revents & POLLIN)) 74 + break; 75 + } 76 + if (verbose) { 77 + struct inotify_event events[10]; 78 + long len; 79 + 80 + usleep(1000); 81 + len = read(fd, events, sizeof(events)); 82 + printf("Number of events read = %ld\n", 83 + len/sizeof(struct inotify_event)); 84 + } 85 + close(fd); 86 + return 0; 87 + }